Hi Sebastian,

Thanks for shedding some light on this issue. At least I now know where to
focus in order to fix it.

In the segment, the login form is contained with different URLs. Sometimes,
even pages that do not need authentication can have the login form indexed
in Solr instead of the actual page content. Each run of the crawler can
produce different results: one run page A gets login form indexed, next run
page A might be ok, but page B get login form. So the problem is not
consistent and does seem to be a concurrency issue.

You mentioned I should make the methods thread-safe. I will make all
methods "synchronized" except for setConf() in Http.java. Do I need to
thread-safe any other classes?

By multi-thread fetcher, I meant fetcher.threads.per.queue > 1. (In my
case, I set it to 5). I left fether.parse to the default value (false).
Parsing is done as a separate step after fetching.

Thanks again for your time. Any further guidance would be greatly
appreciated!

Alex



On Tue, Sep 8, 2015 at 9:53 AM, Sebastian Nagel <[email protected]>
wrote:

> Hi Alex,
>
> > Some of the pages on the site requires login. I have enabled
> > HttpFormAuthentication in the protocal-httpclient plugin. However, looks
> > like the login page title gets indexed into Solr instead of the actual
> > page's title.
>
> Does this mean that one segment contains multiple records under the same
> URL, one for the login form and one for the actual page?
> Or is the login form contained with a different URL?
>
> Concurrency issues with plugins and  come up from time to time.
> For every plugin there is only one single instance shared among
> threads. In case of a multi-threaded fetcher this instance must
> be thread-safe, more precisely: the methods defined in the interface
> must be, except for setConf().
>
> The HttpFormAuthentication is a new feature, also not able to work
> with multiple login forms from various sites within a single crawl,
> see NUTCH-1943. It's worth to review it with focus on concurrency.
> But in general, the reason could be also somewhere else.
>
> What does "multi-threaded fetcher" mean?
>  fetcher.threads.per.queue > 1 ?
> What's the setting for fetcher.parse ?
>
> Thanks,
> Sebastian
>
> On 09/03/2015 11:24 PM, Alex Wang wrote:
> > I might have identified the issue, but have no idea how to solve it.
> >
> > Some of the pages on the site requires login. I have enabled
> > HttpFormAuthentication in the protocal-httpclient plugin. However, looks
> > like the login page title gets indexed into Solr instead of the actual
> > page's title.
> >
> > Anybody has some insight as to how to fix this?
> >
> > Thanks.
> >
> > Alex
> >
> > Alex Wang
> > Technical Architect
> > Crossview
> > C: (647) 409-3066
> > [email protected]
> >
> > On Thu, Sep 3, 2015 at 3:45 PM, Alex Wang <[email protected]> wrote:
> >
> >> Thanks Julien for your suggestion!  I ran the readseg command and
> examined
> >> the dump. The title for the particular html page was indeed fetched and
> >> parsed correctly even in multithread fetching mode. So it looks like the
> >> problem occurred somewhere after the parsing and/ or during indexing. Do
> >> you have any pointers in terms of how to further isolate the problem? I
> ran
> >> the solrindex command with no luck. It finished with no errors, but no
> >> documents were indexed into Solr either.
> >>
> >> Thanks,
> >>
> >> Alex
> >>
> >> Alex Wang
> >> Technical Architect
> >> Crossview
> >> C: (647) 409-3066
> >> [email protected]
> >>
> >> On Thu, Sep 3, 2015 at 11:52 AM, Julien Nioche <
> >> [email protected]> wrote:
> >>
> >>> Hi Alex
> >>>
> >>> You can use the segment reader to check the binary content and data
> >>> extracted from the parse (`./nutch readseg ...`). This should at least
> >>> give
> >>> you some insights into where things might have gone wrong.
> >>>
> >>> HTH
> >>>
> >>> Julien
> >>>
> >>> On 3 September 2015 at 16:13, Alex Wang <[email protected]> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> We are using Nutch 1.9 to crawl an internal website, and index the
> >>> content
> >>>> to Solr 3.5. What we found is that the page title indexed for certain
> >>> html
> >>>> pages are wrong. For example the "Contact us" page has "Login" as page
> >>>> title in the Solr index. This only happens when we use multiple
> threads
> >>> to
> >>>> fetch (fetcher.threads.per.queue=5), while a single thread fetching
> >>> seems
> >>>> to be ok.
> >>>>
> >>>> Can someone please point me to the right direction as to how to debug
> >>> this
> >>>> problem in Nutch? I would like to find out at what stage did the title
> >>> gets
> >>>> messed up, during fetching, parsing or indexing, but not sure where to
> >>>> start. How can I examine the result of each step for a particular html
> >>>> page?
> >>>>
> >>>> Any suggestions are really appreciated!
> >>>>
> >>>>
> >>>> Alex
> >>>>
> >>>> --
> >>>>  <http://crossview.com/>
> >>>> www.CrossView.com <http://www.crossview.com/> | Follow us:
> >>>> <https://twitter.com/CrossView_Inc>
> >>>> <https://www.youtube.com/user/CrossViewInc1>
> >>>> <https://www.linkedin.com/company/crossview-inc->
> >>>> <https://plus.google.com/+Crossview>  <http://www.crossview.com/blog/
> >
> >>>>
> >>>> This message may contain confidential and/or privileged information.
> If
> >>> you
> >>>> are not the addressee or authorized to receive this for the addressee,
> >>> you
> >>>> must not use, copy, disclose, or take any action based on this message
> >>> or
> >>>> any information herein. If you have received this message in error,
> >>> please
> >>>> advise the sender immediately by reply e-mail and delete this message.
> >>>> Thank you for your cooperation.
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>> Open Source Solutions for Text Engineering
> >>>
> >>> http://digitalpebble.blogspot.com/
> >>> http://www.digitalpebble.com
> >>> http://twitter.com/digitalpebble
> >>>
> >>
> >>
> >
>
>

-- 
 <http://crossview.com/>
www.CrossView.com <http://www.crossview.com/> | Follow us:  
<https://twitter.com/CrossView_Inc>  
<https://www.youtube.com/user/CrossViewInc1>  
<https://www.linkedin.com/company/crossview-inc->  
<https://plus.google.com/+Crossview>  <http://www.crossview.com/blog/>

This message may contain confidential and/or privileged information. If you 
are not the addressee or authorized to receive this for the addressee, you 
must not use, copy, disclose, or take any action based on this message or 
any information herein. If you have received this message in error, please 
advise the sender immediately by reply e-mail and delete this message. 
Thank you for your cooperation.

Reply via email to