I might have identified the issue, but have no idea how to solve it.

Some of the pages on the site requires login. I have enabled
HttpFormAuthentication in the protocal-httpclient plugin. However, looks
like the login page title gets indexed into Solr instead of the actual
page's title.

Anybody has some insight as to how to fix this?

Thanks.

Alex

Alex Wang
Technical Architect
Crossview
C: (647) 409-3066
[email protected]

On Thu, Sep 3, 2015 at 3:45 PM, Alex Wang <[email protected]> wrote:

> Thanks Julien for your suggestion!  I ran the readseg command and examined
> the dump. The title for the particular html page was indeed fetched and
> parsed correctly even in multithread fetching mode. So it looks like the
> problem occurred somewhere after the parsing and/ or during indexing. Do
> you have any pointers in terms of how to further isolate the problem? I ran
> the solrindex command with no luck. It finished with no errors, but no
> documents were indexed into Solr either.
>
> Thanks,
>
> Alex
>
> Alex Wang
> Technical Architect
> Crossview
> C: (647) 409-3066
> [email protected]
>
> On Thu, Sep 3, 2015 at 11:52 AM, Julien Nioche <
> [email protected]> wrote:
>
>> Hi Alex
>>
>> You can use the segment reader to check the binary content and data
>> extracted from the parse (`./nutch readseg ...`). This should at least
>> give
>> you some insights into where things might have gone wrong.
>>
>> HTH
>>
>> Julien
>>
>> On 3 September 2015 at 16:13, Alex Wang <[email protected]> wrote:
>>
>> > Hi,
>> >
>> > We are using Nutch 1.9 to crawl an internal website, and index the
>> content
>> > to Solr 3.5. What we found is that the page title indexed for certain
>> html
>> > pages are wrong. For example the "Contact us" page has "Login" as page
>> > title in the Solr index. This only happens when we use multiple threads
>> to
>> > fetch (fetcher.threads.per.queue=5), while a single thread fetching
>> seems
>> > to be ok.
>> >
>> > Can someone please point me to the right direction as to how to debug
>> this
>> > problem in Nutch? I would like to find out at what stage did the title
>> gets
>> > messed up, during fetching, parsing or indexing, but not sure where to
>> > start. How can I examine the result of each step for a particular html
>> > page?
>> >
>> > Any suggestions are really appreciated!
>> >
>> >
>> > Alex
>> >
>> > --
>> >  <http://crossview.com/>
>> > www.CrossView.com <http://www.crossview.com/> | Follow us:
>> > <https://twitter.com/CrossView_Inc>
>> > <https://www.youtube.com/user/CrossViewInc1>
>> > <https://www.linkedin.com/company/crossview-inc->
>> > <https://plus.google.com/+Crossview>  <http://www.crossview.com/blog/>
>> >
>> > This message may contain confidential and/or privileged information. If
>> you
>> > are not the addressee or authorized to receive this for the addressee,
>> you
>> > must not use, copy, disclose, or take any action based on this message
>> or
>> > any information herein. If you have received this message in error,
>> please
>> > advise the sender immediately by reply e-mail and delete this message.
>> > Thank you for your cooperation.
>> >
>>
>>
>>
>> --
>>
>> Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>>
>
>

-- 
 <http://crossview.com/>
www.CrossView.com <http://www.crossview.com/> | Follow us:  
<https://twitter.com/CrossView_Inc>  
<https://www.youtube.com/user/CrossViewInc1>  
<https://www.linkedin.com/company/crossview-inc->  
<https://plus.google.com/+Crossview>  <http://www.crossview.com/blog/>

This message may contain confidential and/or privileged information. If you 
are not the addressee or authorized to receive this for the addressee, you 
must not use, copy, disclose, or take any action based on this message or 
any information herein. If you have received this message in error, please 
advise the sender immediately by reply e-mail and delete this message. 
Thank you for your cooperation.

Reply via email to