Hi,

This is always an interesting problem. You can either buy or build your own 
extraction software or be satisfied by what Boilerpipe has to offer. Tika has 
support for Boilerpipe and NUTCH-961 has a patch for 2.x as well enabling 
Boilerpipe.

https://issues.apache.org/jira/browse/NUTCH-961

Be careful, although Boilerpipe does a good job in general, it is not an all 
purpose library and sometimes does a bad job. If your pages are semi-well 
structured it will usually be good enough.

Cheers
 
-----Original message-----
> From:A Laxmi <[email protected]>
> Sent: Friday 12th July 2013 17:15
> To: [email protected]
> Subject: Nutch(2.2.1) How to extract a proper snippet text from a crawled 
> site to display under search result?
> 
> I could crawl a bunch of urls using Nutch 2.2.1 with data stored in MySQL
> and I could index it using Solr. Now, when I want to display the search
> results on the front-end(using 'ajax-solr'), I am not sure how to display a
> snippet below the title just like the way google does.
> 
> Nutch crawler when it crawls a site, it grabs all the data on a site
> including the text in a banner, navigation, etc into a field called
> 'text'(earlier it used to be 'content'). If I want to use that 'text'
> column to serve as a snippet on the search results page, it looks odd as
> the snipped looks something like this -
> 
> *Publications [Jump to the main content of this page]  Home Publications
> Home Author's Corner All Publications Advanced Search Site Map   Search
> Online Publications     Ordering printed copies. Electronic Mailing List :
> Keep informed about our new publications. Technical Help : Problems or
> questions with our site? *
> 
> As you see above sample snippet - it shows the text included in banner of a
> site along with navigation '[Jump to the main content of this page] ' and
> lot of unncessary information rather than the description of a site as a
> snippet.
> 
> I have to crawl sites with a unknown/poor structure on which I have no
> control. How to achieve displaying a proper snippet and less of garbage on
> a search result snippet (something similar to snippet on google search
> result )?
> 

Reply via email to