Removing Common Web Page Header and Footer from content

2014-11-11 Thread Moumita Dhar01
Hi,

I am using Nutch 1.9 and Solr 4.6 to index a web application with approximately 
100 distinct  URL and contents.

Nutch is used to fetch the urls, links and the crawl the entire web application 
to extract all the content for  all pages, and send the content to  Solr.

The problem that I have now is that the first 1000 or so characters and the 
last 400 or so characters of the pages which are common header and footer are 
showing up in the  search results.

Is there a way  to ignore the links or keep only the static text in the content?

Any  useful pointers would be highly appreciated.


Regards,
Moumita Dhar


 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient, please
notify the sender by e-mail and delete the original message. Further, you are 
not
to copy, disclose, or distribute this e-mail or its contents to any other 
person and
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken
every reasonable precaution to minimize this risk, but is not liable for any 
damage
you may sustain as a result of any virus in this e-mail. You should carry out 
your
own virus checks before opening the e-mail or attachment. Infosys reserves the
right to monitor and review the content of all messages sent to or from this 
e-mail
address. Messages sent to or from this e-mail address may be stored on the
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***


Re: Removing Common Web Page Header and Footer from content

2014-11-11 Thread Ahmet Arslan
Hi Moumita,

Once, I used https://code.google.com/p/boilerpipe/ to remove common 
header/footers etc.

Ahmet



On Tuesday, November 11, 2014 10:41 AM, Moumita Dhar01 
moumita_dha...@infosys.com wrote:
Hi,

I am using Nutch 1.9 and Solr 4.6 to index a web application with approximately 
100 distinct  URL and contents.

Nutch is used to fetch the urls, links and the crawl the entire web application 
to extract all the content for  all pages, and send the content to  Solr.

The problem that I have now is that the first 1000 or so characters and the 
last 400 or so characters of the pages which are common header and footer are 
showing up in the  search results.

Is there a way  to ignore the links or keep only the static text in the content?

Any  useful pointers would be highly appreciated.


Regards,
Moumita Dhar


 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient, please
notify the sender by e-mail and delete the original message. Further, you are 
not
to copy, disclose, or distribute this e-mail or its contents to any other 
person and
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken
every reasonable precaution to minimize this risk, but is not liable for any 
damage
you may sustain as a result of any virus in this e-mail. You should carry out 
your
own virus checks before opening the e-mail or attachment. Infosys reserves the
right to monitor and review the content of all messages sent to or from this 
e-mail
address. Messages sent to or from this e-mail address may be stored on the
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***