Filter out relevant pieces from the parent pattern
A Scrubyt extractor is almost like a waterfall: water is pouring from the
top until it reaches the bottom. The biggest difference is that instead of
water, a HTML document travels through the space.
Of course Scrubyt would not make much sense if the same document would
arrive at the bottom that was poured in at the top - since in this case we
might use an indentity transformation (i.e. do nothing with the input) as
well.
This is where filters came in: as they name says, they filter the stuff
that is pouring from above, to leave the interesting parts and discard the
rest. The working of a filter will be explained most easily by the help of
an example. Let’s consider that we would like to extract information
from a webshop; Concretely we are interested in the name of the items and
the URL pointing to the image of the item.
To accomplish this, first we select the items with the pattern item (a
pattern is a logical grouping of fillters; see Pattern documentation) Then
our new context is the result extracted by the ‘item’ pattern;
For every ‘item’ pattern, further extract the name and the
image of the item; and finally, extract the href attribute of the image.
Let’s see an illustration:
root --> This pattern is called a 'root pattern', It is invisible to you
| and basically it represents the document; it has no filters
+-- item --> Filter what's coming from above (the whole document) to get
| relevant pieces of data (in this case webshop items)
+-- name --> Again, filter what's coming from above (a webshop item) and
| leave only item names after this operation
+-- image --> This time filter the image of the item
|
+-- href --> And finally, from the image elements, get the attribute 'href'
<code/>and<pre/>for code samples.