Issue when fetching with multiple threads

Alex Wang Thu, 03 Sep 2015 08:15:03 -0700

Hi,

We are using Nutch 1.9 to crawl an internal website, and index the content
to Solr 3.5. What we found is that the page title indexed for certain html
pages are wrong. For example the "Contact us" page has "Login" as page
title in the Solr index. This only happens when we use multiple threads to
fetch (fetcher.threads.per.queue=5), while a single thread fetching seems
to be ok.


Can someone please point me to the right direction as to how to debug this
problem in Nutch? I would like to find out at what stage did the title gets
messed up, during fetching, parsing or indexing, but not sure where to
start. How can I examine the result of each step for a particular html
page?

Any suggestions are really appreciated!


Alex

-- 
 <http://crossview.com/>
www.CrossView.com <http://www.crossview.com/> | Follow us:  
<https://twitter.com/CrossView_Inc>  
<https://www.youtube.com/user/CrossViewInc1>  
<https://www.linkedin.com/company/crossview-inc->  
<https://plus.google.com/+Crossview>  <http://www.crossview.com/blog/>

This message may contain confidential and/or privileged information. If you 
are not the addressee or authorized to receive this for the addressee, you 
must not use, copy, disclose, or take any action based on this message or 
any information herein. If you have received this message in error, please 
advise the sender immediately by reply e-mail and delete this message. 
Thank you for your cooperation.

Issue when fetching with multiple threads

Reply via email to