Filter definition and types
A filter is a set of parameters, using regular expressions, that defines what the crawler will extract from a website and what will be discarded.
There are three different kinds of filters:
- Title: to extract the headlines.
- Content: to extract the body content.
- Gallery: to extract the images of a slideshow or gallery content.
Filters must be manually configured for each section of every site in our database to ensure comprehensive monitoring. The Online DataOps team is responsible for setting up these filters using regular expressions, ensuring our crawlers effectively track relevant media content.
Below is an example of how a title and content filter appears in a specific section:
We are currently transitioning our monitoring system to a new crawler that will work by configuring CSS selectors instead of regex. This will allow us to optimize data extraction results and streamline maintenance processes.
2. What Are Gallery Filters?
2.1 Definition & Use Cases
Gallery filters work differently from content filters: they identify gallery-style articles and extract images from slideshows or other gallery formats. These filters allow us to track:
- Sliders: Galleries where images are revealed through scrolling or interaction.
- Vertical Galleries: Cases where the content filter cannot adequately track captions or images.
When Is a Gallery Filter Required?
- Sliders:
- Galleries with a slide format requiring user interaction to reveal images.
- Since content filters cannot track these images, a gallery filter is the only way to extract them.
- The main article and slides are collected as separate documents (as you can see here)
- When a Vertical Gallery Cannot Be Tracked with a Content Filter:
- The content filter usually captures the body content of an online article but sometimes it can also track image captions as part of the article —> In this case we are not creating a separate doc for the gallery as you can see in this concrete document
- However, if the content filter fails to extract captions correctly, a gallery filter must be used instead —> In this case the gallery is tracked as a separate document as you can see in this concrete document
3. Gallery Filters: Rules & Limitations
3.1 Key Rules
-
Gallery filters are used only when necessary, specifically for:
- Tracking slideshows.
- Extracting image captions when the content filter fails.
- They are NOT used to track only images because they will not match any interesting keywords for clients or any definition set up in any clients feed
- Every time a gallery filter is applied, a separate document is created from the main article.
3.2 Limitations
Despite their utility, gallery filters have some key technical limitations:
- Multiple Gallery Formats on the Same Page
Some media sites use different gallery structures within a single article (e.g., sliders and vertical galleries).
A separate gallery filter is needed for each format.
The crawler applies only the first matching filter, ignoring others. This means that some content may be skipped.
Ordering filters from most specific to least specific can help but does not guarantee complete monitoring. For example, if a media outlet has both a slider and a vertical gallery, and the slider filter is applied first, the vertical gallery may be ignored—even if more suitable
- JavaScript-Based Galleries Are Not Yet Supported
The current gallery filter does not work with JavaScript-based galleries.
Since JavaScript content requires the New Crawler, and the gallery filter is not yet integrated, these formats cannot be monitored at present.
A solution is in development to address this limitation.
Due to gallery filter limitations and considering that our main goal is to avoid omissions, we prioritize the use of content filters whenever possible, resorting to gallery filters only when no other options are available to track gallery captions. This is why some galleries appear as separate documents, while others are embedded within the main article, potentially leading to inconsistencies in the number of documents received.