Hi Alex

You can use the segment reader to check the binary content and data
extracted from the parse (`./nutch readseg ...`). This should at least give
you some insights into where things might have gone wrong.

HTH

Julien

On 3 September 2015 at 16:13, Alex Wang <[email protected]> wrote:

> Hi,
>
> We are using Nutch 1.9 to crawl an internal website, and index the content
> to Solr 3.5. What we found is that the page title indexed for certain html
> pages are wrong. For example the "Contact us" page has "Login" as page
> title in the Solr index. This only happens when we use multiple threads to
> fetch (fetcher.threads.per.queue=5), while a single thread fetching seems
> to be ok.
>
> Can someone please point me to the right direction as to how to debug this
> problem in Nutch? I would like to find out at what stage did the title gets
> messed up, during fetching, parsing or indexing, but not sure where to
> start. How can I examine the result of each step for a particular html
> page?
>
> Any suggestions are really appreciated!
>
>
> Alex
>
> --
>  <http://crossview.com/>
> www.CrossView.com <http://www.crossview.com/> | Follow us:
> <https://twitter.com/CrossView_Inc>
> <https://www.youtube.com/user/CrossViewInc1>
> <https://www.linkedin.com/company/crossview-inc->
> <https://plus.google.com/+Crossview>  <http://www.crossview.com/blog/>
>
> This message may contain confidential and/or privileged information. If you
> are not the addressee or authorized to receive this for the addressee, you
> must not use, copy, disclose, or take any action based on this message or
> any information herein. If you have received this message in error, please
> advise the sender immediately by reply e-mail and delete this message.
> Thank you for your cooperation.
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to