Hi Moumita, Once, I used https://code.google.com/p/boilerpipe/ to remove common header/footers etc.
Ahmet On Tuesday, November 11, 2014 10:41 AM, Moumita Dhar01 <moumita_dha...@infosys.com> wrote: Hi, I am using Nutch 1.9 and Solr 4.6 to index a web application with approximately 100 distinct URL and contents. Nutch is used to fetch the urls, links and the crawl the entire web application to extract all the content for all pages, and send the content to Solr. The problem that I have now is that the first 1000 or so characters and the last 400 or so characters of the pages which are common header and footer are showing up in the search results. Is there a way to ignore the links or keep only the static text in the content? Any useful pointers would be highly appreciated. Regards, Moumita Dhar **************** CAUTION - Disclaimer ***************** This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS******** End of Disclaimer ********INFOSYS***