Hi Markus,

Thanks for the quick response! Please let me know at any point if I
should just read some part of the code. But I'm guessing from the stored
data in HBase (with Nutch 2.x), that "parse" changed (in my case,
cleaned up the html tags in "content") the "Document".

Do you mean that parse only adds meta-data somewhere waiting for
indexing filters to index it into HBase? Maybe I'm not understanding
"indexing" correctly.

I'm trying to use the new jsoup-extractor to parse (and index) certain
fields with CSS selectors. I also want to keep the indexing by
index-basic and index-anchor, and preferably the raw html/data as well.
Am I on the right track?

Thank you!

Michael


On 08/02/2017 12:06 PM, Markus Jelsma wrote:
Hi,

ParseFilter can add metadata to parsed records. IndexingFilter can access that 
data and do something with it prior to indexing the metadata fields added 
earlier by the ParseFilter.

If you just want to index the values added by the ParseFilter, you can just use 
index-metadata to index it directly. Only use an IndexingFilter if you need 
additional logic.

Regards,
Markus

-----Original message-----
From:Michael Chen <[email protected]>
Sent: Wednesday 2nd August 2017 20:58
To: [email protected]
Subject: ParseFilter and IndexingFilter

Hi,

Does anyone know how multiple ParseFilters and IndexingFilters work
together, e.g. does the first parse affect the second, does the one
index operation affect the next? Given that the factories generate
multiple in the first place... I couldn't find a definitive answer in
the docs and it would be great if someone can help answer this question.
Thanks in advance.

Best regards,

Michael




Reply via email to