Hi Moumita,
Once, I used https://code.google.com/p/boilerpipe/ to remove common
header/footers etc.
Ahmet
On Tuesday, November 11, 2014 10:41 AM, Moumita Dhar01
moumita_dha...@infosys.com wrote:
Hi,
I am using Nutch 1.9 and Solr 4.6 to index a web application with approximately
100 distinct URL and contents.
Nutch is used to fetch the urls, links and the crawl the entire web application
to extract all the content for all pages, and send the content to Solr.
The problem that I have now is that the first 1000 or so characters and the
last 400 or so characters of the pages which are common header and footer are
showing up in the search results.
Is there a way to ignore the links or keep only the static text in the content?
Any useful pointers would be highly appreciated.
Regards,
Moumita Dhar
CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient, please
notify the sender by e-mail and delete the original message. Further, you are
not
to copy, disclose, or distribute this e-mail or its contents to any other
person and
any such actions are unlawful. This e-mail may contain viruses. Infosys has
taken
every reasonable precaution to minimize this risk, but is not liable for any
damage
you may sustain as a result of any virus in this e-mail. You should carry out
your
own virus checks before opening the e-mail or attachment. Infosys reserves the
right to monitor and review the content of all messages sent to or from this
e-mail
address. Messages sent to or from this e-mail address may be stored on the
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***