Hi Sebastian,

Sorry for the delay. Unfortunately, we can't afford to set
fetcher.threads.per.queue
= 1, since it is taking many hours to crawl a site with about 1000 pages,
even if I set fetcher.server.delay = 0. I have to somehow make the
multi-threaded fetching work.

I made Http.getResponse(..) synchronized, which seems to help. And yes, the
form auth is done currently once for every single url being fetched.
Unfortunately, the total time needed to do one crawl / fetch cycle is
longer than the cookie time out. So doing auth once at the beginning is not
enough either. We will need to figure out a better strategy.

I will see if I have the privilege to open ticket in Nutch Jira system and
will further investigate for sure.

Thanks again for your help!



Alex Wang
Technical Architect
Crossview
C: (647) 409-3066
[email protected]

On Thu, Sep 10, 2015 at 12:12 PM, Sebastian Nagel <
[email protected]> wrote:

> Hi Alex,
>
> > I will make all methods "synchronized" except for setConf() in Http.java.
> This may help but it will effectively disable any parallelism in the
> fetcher.
>
> After a quick look at the form authentication of protocol-httpclient:
> looks like the login is done every connection / every time getResponse()
> is called. Ideally, it should be done once at the beginning when the
> first page is fetched and then after a certain time when the cookie may
> have timed out. Of course, the login itself should be synchronized.
> The later fetching may be in parallel.
>
> Sorry, but it looks like the form authentication needs some improvements.
> Same is true for the protocol-httpclient in general, which still depends
> on an old httpclient library version.
>
> If possible, please open a Jira issue. If you find the time to investigate
> the problem, that would be great.
>
> For now I would recommend to set
>   fetcher.threads.per.queue = 1
>
> Thanks,
> Sebastian
>
> 2015-09-08 16:57 GMT+02:00 Alex Wang <[email protected]>:
>
> > Hi Sebastian,
> >
> > Thanks for shedding some light on this issue. At least I now know where
> to
> > focus in order to fix it.
> >
> > In the segment, the login form is contained with different URLs.
> Sometimes,
> > even pages that do not need authentication can have the login form
> indexed
> > in Solr instead of the actual page content. Each run of the crawler can
> > produce different results: one run page A gets login form indexed, next
> run
> > page A might be ok, but page B get login form. So the problem is not
> > consistent and does seem to be a concurrency issue.
> >
> > You mentioned I should make the methods thread-safe. I will make all
> > methods "synchronized" except for setConf() in Http.java. Do I need to
> > thread-safe any other classes?
> >
> > By multi-thread fetcher, I meant fetcher.threads.per.queue > 1. (In my
> > case, I set it to 5). I left fether.parse to the default value (false).
> > Parsing is done as a separate step after fetching.
> >
> > Thanks again for your time. Any further guidance would be greatly
> > appreciated!
> >
> > Alex
> >
> >
> >
> > On Tue, Sep 8, 2015 at 9:53 AM, Sebastian Nagel <
> > [email protected]>
> > wrote:
> >
> > > Hi Alex,
> > >
> > > > Some of the pages on the site requires login. I have enabled
> > > > HttpFormAuthentication in the protocal-httpclient plugin. However,
> > looks
> > > > like the login page title gets indexed into Solr instead of the
> actual
> > > > page's title.
> > >
> > > Does this mean that one segment contains multiple records under the
> same
> > > URL, one for the login form and one for the actual page?
> > > Or is the login form contained with a different URL?
> > >
> > > Concurrency issues with plugins and  come up from time to time.
> > > For every plugin there is only one single instance shared among
> > > threads. In case of a multi-threaded fetcher this instance must
> > > be thread-safe, more precisely: the methods defined in the interface
> > > must be, except for setConf().
> > >
> > > The HttpFormAuthentication is a new feature, also not able to work
> > > with multiple login forms from various sites within a single crawl,
> > > see NUTCH-1943. It's worth to review it with focus on concurrency.
> > > But in general, the reason could be also somewhere else.
> > >
> > > What does "multi-threaded fetcher" mean?
> > >  fetcher.threads.per.queue > 1 ?
> > > What's the setting for fetcher.parse ?
> > >
> > > Thanks,
> > > Sebastian
> > >
> > > On 09/03/2015 11:24 PM, Alex Wang wrote:
> > > > I might have identified the issue, but have no idea how to solve it.
> > > >
> > > > Some of the pages on the site requires login. I have enabled
> > > > HttpFormAuthentication in the protocal-httpclient plugin. However,
> > looks
> > > > like the login page title gets indexed into Solr instead of the
> actual
> > > > page's title.
> > > >
> > > > Anybody has some insight as to how to fix this?
> > > >
> > > > Thanks.
> > > >
> > > > Alex
> > > >
> > > > Alex Wang
> > > > Technical Architect
> > > > Crossview
> > > > C: (647) 409-3066
> > > > [email protected]
> > > >
> > > > On Thu, Sep 3, 2015 at 3:45 PM, Alex Wang <[email protected]>
> wrote:
> > > >
> > > >> Thanks Julien for your suggestion!  I ran the readseg command and
> > > examined
> > > >> the dump. The title for the particular html page was indeed fetched
> > and
> > > >> parsed correctly even in multithread fetching mode. So it looks like
> > the
> > > >> problem occurred somewhere after the parsing and/ or during
> indexing.
> > Do
> > > >> you have any pointers in terms of how to further isolate the
> problem?
> > I
> > > ran
> > > >> the solrindex command with no luck. It finished with no errors, but
> no
> > > >> documents were indexed into Solr either.
> > > >>
> > > >> Thanks,
> > > >>
> > > >> Alex
> > > >>
> > > >> Alex Wang
> > > >> Technical Architect
> > > >> Crossview
> > > >> C: (647) 409-3066
> > > >> [email protected]
> > > >>
> > > >> On Thu, Sep 3, 2015 at 11:52 AM, Julien Nioche <
> > > >> [email protected]> wrote:
> > > >>
> > > >>> Hi Alex
> > > >>>
> > > >>> You can use the segment reader to check the binary content and data
> > > >>> extracted from the parse (`./nutch readseg ...`). This should at
> > least
> > > >>> give
> > > >>> you some insights into where things might have gone wrong.
> > > >>>
> > > >>> HTH
> > > >>>
> > > >>> Julien
> > > >>>
> > > >>> On 3 September 2015 at 16:13, Alex Wang <[email protected]>
> wrote:
> > > >>>
> > > >>>> Hi,
> > > >>>>
> > > >>>> We are using Nutch 1.9 to crawl an internal website, and index the
> > > >>> content
> > > >>>> to Solr 3.5. What we found is that the page title indexed for
> > certain
> > > >>> html
> > > >>>> pages are wrong. For example the "Contact us" page has "Login" as
> > page
> > > >>>> title in the Solr index. This only happens when we use multiple
> > > threads
> > > >>> to
> > > >>>> fetch (fetcher.threads.per.queue=5), while a single thread
> fetching
> > > >>> seems
> > > >>>> to be ok.
> > > >>>>
> > > >>>> Can someone please point me to the right direction as to how to
> > debug
> > > >>> this
> > > >>>> problem in Nutch? I would like to find out at what stage did the
> > title
> > > >>> gets
> > > >>>> messed up, during fetching, parsing or indexing, but not sure
> where
> > to
> > > >>>> start. How can I examine the result of each step for a particular
> > html
> > > >>>> page?
> > > >>>>
> > > >>>> Any suggestions are really appreciated!
> > > >>>>
> > > >>>>
> > > >>>> Alex
> > > >>>>
> > > >>>> --
> > > >>>>  <http://crossview.com/>
> > > >>>> www.CrossView.com <http://www.crossview.com/> | Follow us:
> > > >>>> <https://twitter.com/CrossView_Inc>
> > > >>>> <https://www.youtube.com/user/CrossViewInc1>
> > > >>>> <https://www.linkedin.com/company/crossview-inc->
> > > >>>> <https://plus.google.com/+Crossview>  <
> > http://www.crossview.com/blog/
> > > >
> > > >>>>
> > > >>>> This message may contain confidential and/or privileged
> information.
> > > If
> > > >>> you
> > > >>>> are not the addressee or authorized to receive this for the
> > addressee,
> > > >>> you
> > > >>>> must not use, copy, disclose, or take any action based on this
> > message
> > > >>> or
> > > >>>> any information herein. If you have received this message in
> error,
> > > >>> please
> > > >>>> advise the sender immediately by reply e-mail and delete this
> > message.
> > > >>>> Thank you for your cooperation.
> > > >>>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>>
> > > >>> Open Source Solutions for Text Engineering
> > > >>>
> > > >>> http://digitalpebble.blogspot.com/
> > > >>> http://www.digitalpebble.com
> > > >>> http://twitter.com/digitalpebble
> > > >>>
> > > >>
> > > >>
> > > >
> > >
> > >
> >
> > --
> >  <http://crossview.com/>
> > www.CrossView.com <http://www.crossview.com/> | Follow us:
> > <https://twitter.com/CrossView_Inc>
> > <https://www.youtube.com/user/CrossViewInc1>
> > <https://www.linkedin.com/company/crossview-inc->
> > <https://plus.google.com/+Crossview>  <http://www.crossview.com/blog/>
> >
> > This message may contain confidential and/or privileged information. If
> you
> > are not the addressee or authorized to receive this for the addressee,
> you
> > must not use, copy, disclose, or take any action based on this message or
> > any information herein. If you have received this message in error,
> please
> > advise the sender immediately by reply e-mail and delete this message.
> > Thank you for your cooperation.
> >
>

-- 
 <http://crossview.com/>
www.CrossView.com <http://www.crossview.com/> | Follow us:  
<https://twitter.com/CrossView_Inc>  
<https://www.youtube.com/user/CrossViewInc1>  
<https://www.linkedin.com/company/crossview-inc->  
<https://plus.google.com/+Crossview>  <http://www.crossview.com/blog/>

This message may contain confidential and/or privileged information. If you 
are not the addressee or authorized to receive this for the addressee, you 
must not use, copy, disclose, or take any action based on this message or 
any information herein. If you have received this message in error, please 
advise the sender immediately by reply e-mail and delete this message. 
Thank you for your cooperation.

Reply via email to