Hi Alex,

-----Original message-----
> From:Alex McLintock <[email protected]>
> Sent: Thursday 14th November 2013 14:34
> To: [email protected]
> Subject: Performing Web Scraping within the content of fetched html pages
> 
> Hi Folks,
> 
> I'm reasonably familiar with older versions of Nutch - but have been out of
> the loop for a bit. I've done some googling, and reading docs, and have not
> really understood everything yet.
> 
> Would someone please summarise the state of play if I want to do web
> scraping with Nutch - eg to extract text that is delimited with a specific
> CSS tag, or is found within a particular XPath?
> 
> Now in the past this was totally impossible because if you wanted to write
> a plugin then Nutch had already thrown away anything like html and just
> left the "plain text" content.

Well, Nutch passes a DocumentFragment to the parse filter plugins, that 
contains the normalized document when using the Tika parser. You can do XPath 
and other operations there including checking style attributes.

> So if I wanted to take that html and push it on to some other task -
> whether Hadoop based or elsewhere, what would I need to learn about? Is
> this still plugin based? or do I just need to learn how to write my own
> Hadoop jobs which read the nutch database?

Using plugins is the easiest option but you can still write a custom MR job and 
read the Content directory of a segment, that contains the raw unparsed data. 
You'd still need a suitable parser so a plugin would be better.
> 
> Presumably people do do this, right? There are many other web scraping
> systems out there, but I'd like to stick with Nutch if possible.

We use Nutch too with the Tika parser. We delegated most extraction tools away 
from Nutch and use ContentHandlers in Tika instead. It is very easy to create 
new ContentHandlers and process SAX events there and report back to Nutch. The 
good part is that you don't need Nutch and can use it anywhere. It is also 
easier to write very specific unit tests instead of going through Nutch when 
using plugins.

> 
> Alex
> 

Reply via email to