Hi Alex You can use the segment reader to check the binary content and data extracted from the parse (`./nutch readseg ...`). This should at least give you some insights into where things might have gone wrong.
HTH Julien On 3 September 2015 at 16:13, Alex Wang <[email protected]> wrote: > Hi, > > We are using Nutch 1.9 to crawl an internal website, and index the content > to Solr 3.5. What we found is that the page title indexed for certain html > pages are wrong. For example the "Contact us" page has "Login" as page > title in the Solr index. This only happens when we use multiple threads to > fetch (fetcher.threads.per.queue=5), while a single thread fetching seems > to be ok. > > Can someone please point me to the right direction as to how to debug this > problem in Nutch? I would like to find out at what stage did the title gets > messed up, during fetching, parsing or indexing, but not sure where to > start. How can I examine the result of each step for a particular html > page? > > Any suggestions are really appreciated! > > > Alex > > -- > <http://crossview.com/> > www.CrossView.com <http://www.crossview.com/> | Follow us: > <https://twitter.com/CrossView_Inc> > <https://www.youtube.com/user/CrossViewInc1> > <https://www.linkedin.com/company/crossview-inc-> > <https://plus.google.com/+Crossview> <http://www.crossview.com/blog/> > > This message may contain confidential and/or privileged information. If you > are not the addressee or authorized to receive this for the addressee, you > must not use, copy, disclose, or take any action based on this message or > any information herein. If you have received this message in error, please > advise the sender immediately by reply e-mail and delete this message. > Thank you for your cooperation. > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

