Hi Alex,

> I will make all methods "synchronized" except for setConf() in Http.java.
This may help but it will effectively disable any parallelism in the
fetcher.

After a quick look at the form authentication of protocol-httpclient:
looks like the login is done every connection / every time getResponse()
is called. Ideally, it should be done once at the beginning when the
first page is fetched and then after a certain time when the cookie may
have timed out. Of course, the login itself should be synchronized.
The later fetching may be in parallel.

Sorry, but it looks like the form authentication needs some improvements.
Same is true for the protocol-httpclient in general, which still depends
on an old httpclient library version.

If possible, please open a Jira issue. If you find the time to investigate
the problem, that would be great.

For now I would recommend to set
  fetcher.threads.per.queue = 1

Thanks,
Sebastian

2015-09-08 16:57 GMT+02:00 Alex Wang <[email protected]>:

> Hi Sebastian,
>
> Thanks for shedding some light on this issue. At least I now know where to
> focus in order to fix it.
>
> In the segment, the login form is contained with different URLs. Sometimes,
> even pages that do not need authentication can have the login form indexed
> in Solr instead of the actual page content. Each run of the crawler can
> produce different results: one run page A gets login form indexed, next run
> page A might be ok, but page B get login form. So the problem is not
> consistent and does seem to be a concurrency issue.
>
> You mentioned I should make the methods thread-safe. I will make all
> methods "synchronized" except for setConf() in Http.java. Do I need to
> thread-safe any other classes?
>
> By multi-thread fetcher, I meant fetcher.threads.per.queue > 1. (In my
> case, I set it to 5). I left fether.parse to the default value (false).
> Parsing is done as a separate step after fetching.
>
> Thanks again for your time. Any further guidance would be greatly
> appreciated!
>
> Alex
>
>
>
> On Tue, Sep 8, 2015 at 9:53 AM, Sebastian Nagel <
> [email protected]>
> wrote:
>
> > Hi Alex,
> >
> > > Some of the pages on the site requires login. I have enabled
> > > HttpFormAuthentication in the protocal-httpclient plugin. However,
> looks
> > > like the login page title gets indexed into Solr instead of the actual
> > > page's title.
> >
> > Does this mean that one segment contains multiple records under the same
> > URL, one for the login form and one for the actual page?
> > Or is the login form contained with a different URL?
> >
> > Concurrency issues with plugins and  come up from time to time.
> > For every plugin there is only one single instance shared among
> > threads. In case of a multi-threaded fetcher this instance must
> > be thread-safe, more precisely: the methods defined in the interface
> > must be, except for setConf().
> >
> > The HttpFormAuthentication is a new feature, also not able to work
> > with multiple login forms from various sites within a single crawl,
> > see NUTCH-1943. It's worth to review it with focus on concurrency.
> > But in general, the reason could be also somewhere else.
> >
> > What does "multi-threaded fetcher" mean?
> >  fetcher.threads.per.queue > 1 ?
> > What's the setting for fetcher.parse ?
> >
> > Thanks,
> > Sebastian
> >
> > On 09/03/2015 11:24 PM, Alex Wang wrote:
> > > I might have identified the issue, but have no idea how to solve it.
> > >
> > > Some of the pages on the site requires login. I have enabled
> > > HttpFormAuthentication in the protocal-httpclient plugin. However,
> looks
> > > like the login page title gets indexed into Solr instead of the actual
> > > page's title.
> > >
> > > Anybody has some insight as to how to fix this?
> > >
> > > Thanks.
> > >
> > > Alex
> > >
> > > Alex Wang
> > > Technical Architect
> > > Crossview
> > > C: (647) 409-3066
> > > [email protected]
> > >
> > > On Thu, Sep 3, 2015 at 3:45 PM, Alex Wang <[email protected]> wrote:
> > >
> > >> Thanks Julien for your suggestion!  I ran the readseg command and
> > examined
> > >> the dump. The title for the particular html page was indeed fetched
> and
> > >> parsed correctly even in multithread fetching mode. So it looks like
> the
> > >> problem occurred somewhere after the parsing and/ or during indexing.
> Do
> > >> you have any pointers in terms of how to further isolate the problem?
> I
> > ran
> > >> the solrindex command with no luck. It finished with no errors, but no
> > >> documents were indexed into Solr either.
> > >>
> > >> Thanks,
> > >>
> > >> Alex
> > >>
> > >> Alex Wang
> > >> Technical Architect
> > >> Crossview
> > >> C: (647) 409-3066
> > >> [email protected]
> > >>
> > >> On Thu, Sep 3, 2015 at 11:52 AM, Julien Nioche <
> > >> [email protected]> wrote:
> > >>
> > >>> Hi Alex
> > >>>
> > >>> You can use the segment reader to check the binary content and data
> > >>> extracted from the parse (`./nutch readseg ...`). This should at
> least
> > >>> give
> > >>> you some insights into where things might have gone wrong.
> > >>>
> > >>> HTH
> > >>>
> > >>> Julien
> > >>>
> > >>> On 3 September 2015 at 16:13, Alex Wang <[email protected]> wrote:
> > >>>
> > >>>> Hi,
> > >>>>
> > >>>> We are using Nutch 1.9 to crawl an internal website, and index the
> > >>> content
> > >>>> to Solr 3.5. What we found is that the page title indexed for
> certain
> > >>> html
> > >>>> pages are wrong. For example the "Contact us" page has "Login" as
> page
> > >>>> title in the Solr index. This only happens when we use multiple
> > threads
> > >>> to
> > >>>> fetch (fetcher.threads.per.queue=5), while a single thread fetching
> > >>> seems
> > >>>> to be ok.
> > >>>>
> > >>>> Can someone please point me to the right direction as to how to
> debug
> > >>> this
> > >>>> problem in Nutch? I would like to find out at what stage did the
> title
> > >>> gets
> > >>>> messed up, during fetching, parsing or indexing, but not sure where
> to
> > >>>> start. How can I examine the result of each step for a particular
> html
> > >>>> page?
> > >>>>
> > >>>> Any suggestions are really appreciated!
> > >>>>
> > >>>>
> > >>>> Alex
> > >>>>
> > >>>> --
> > >>>>  <http://crossview.com/>
> > >>>> www.CrossView.com <http://www.crossview.com/> | Follow us:
> > >>>> <https://twitter.com/CrossView_Inc>
> > >>>> <https://www.youtube.com/user/CrossViewInc1>
> > >>>> <https://www.linkedin.com/company/crossview-inc->
> > >>>> <https://plus.google.com/+Crossview>  <
> http://www.crossview.com/blog/
> > >
> > >>>>
> > >>>> This message may contain confidential and/or privileged information.
> > If
> > >>> you
> > >>>> are not the addressee or authorized to receive this for the
> addressee,
> > >>> you
> > >>>> must not use, copy, disclose, or take any action based on this
> message
> > >>> or
> > >>>> any information herein. If you have received this message in error,
> > >>> please
> > >>>> advise the sender immediately by reply e-mail and delete this
> message.
> > >>>> Thank you for your cooperation.
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>>
> > >>> Open Source Solutions for Text Engineering
> > >>>
> > >>> http://digitalpebble.blogspot.com/
> > >>> http://www.digitalpebble.com
> > >>> http://twitter.com/digitalpebble
> > >>>
> > >>
> > >>
> > >
> >
> >
>
> --
>  <http://crossview.com/>
> www.CrossView.com <http://www.crossview.com/> | Follow us:
> <https://twitter.com/CrossView_Inc>
> <https://www.youtube.com/user/CrossViewInc1>
> <https://www.linkedin.com/company/crossview-inc->
> <https://plus.google.com/+Crossview>  <http://www.crossview.com/blog/>
>
> This message may contain confidential and/or privileged information. If you
> are not the addressee or authorized to receive this for the addressee, you
> must not use, copy, disclose, or take any action based on this message or
> any information herein. If you have received this message in error, please
> advise the sender immediately by reply e-mail and delete this message.
> Thank you for your cooperation.
>

Reply via email to