RE: Data Extraction from 100+ different sites...

Markus Jelsma Tue, 11 Jun 2013 07:37:09 -0700

Hi,

Yes, you should write a plugin that has a parse filter and indexing filter. To 
ease maintenance you would want to have a file per host/domain containing XPath 
expressions, far easier that switch statements that need to be recompiled. The 
indexing filter would then index the field values extracted by your parse 
filter.


Cheers,
Markus 
 
-----Original message-----
> From:Tony Mullins <[email protected]>
> Sent: Tue 11-Jun-2013 16:07
> To: [email protected]
> Subject: Data Extraction from 100+ different sites...
> 
> Hi,
> 
> I have 100+ different sites ( and may be more will be added in near
> future), I have to crawl them and extract my required information from each
> site. So each site would have its own extraction rule ( XPaths).
> 
> So far I have seen there is no built-in mechanism in Nutch to fulfill my
> requirement and I may  have to write custom HTMLParserFilter extension and
> IndexFilter plugin.
> 
> And I may have to write 100+ switch cases in my plugin to handle the
> extraction rules of each site....
> 
> Is this the best way to handle my requirement or there is any better way to
> handle it ?
> 
> Thanks for your support & help.
> 
> Tony.
>

RE: Data Extraction from 100+ different sites...

Reply via email to