Hi, Yes, you should write a plugin that has a parse filter and indexing filter. To ease maintenance you would want to have a file per host/domain containing XPath expressions, far easier that switch statements that need to be recompiled. The indexing filter would then index the field values extracted by your parse filter.
Cheers, Markus -----Original message----- > From:Tony Mullins <[email protected]> > Sent: Tue 11-Jun-2013 16:07 > To: [email protected] > Subject: Data Extraction from 100+ different sites... > > Hi, > > I have 100+ different sites ( and may be more will be added in near > future), I have to crawl them and extract my required information from each > site. So each site would have its own extraction rule ( XPaths). > > So far I have seen there is no built-in mechanism in Nutch to fulfill my > requirement and I may have to write custom HTMLParserFilter extension and > IndexFilter plugin. > > And I may have to write 100+ switch cases in my plugin to handle the > extraction rules of each site.... > > Is this the best way to handle my requirement or there is any better way to > handle it ? > > Thanks for your support & help. > > Tony. >

