I could crawl a bunch of urls using Nutch 2.2.1 with data stored in MySQL
and I could index it using Solr. Now, when I want to display the search
results on the front-end(using 'ajax-solr'), I am not sure how to display a
snippet below the title just like the way google does.

Nutch crawler when it crawls a site, it grabs all the data on a site
including the text in a banner, navigation, etc into a field called
'text'(earlier it used to be 'content'). If I want to use that 'text'
column to serve as a snippet on the search results page, it looks odd as
the snipped looks something like this -

*Publications [Jump to the main content of this page]  Home Publications
Home Author's Corner All Publications Advanced Search Site Map   Search
Online Publications     Ordering printed copies. Electronic Mailing List :
Keep informed about our new publications. Technical Help : Problems or
questions with our site? *

As you see above sample snippet - it shows the text included in banner of a
site along with navigation '[Jump to the main content of this page] ' and
lot of unncessary information rather than the description of a site as a
snippet.

I have to crawl sites with a unknown/poor structure on which I have no
control. How to achieve displaying a proper snippet and less of garbage on
a search result snippet (something similar to snippet on google search
result )?

Reply via email to