Re: Issue when fetching with multiple threads

2015-09-03 Thread Julien Nioche
Hi Alex You can use the segment reader to check the binary content and data extracted from the parse (`./nutch readseg ...`). This should at least give you some insights into where things might have gone wrong. HTH Julien On 3 September 2015 at 16:13, Alex Wang wrote: >

Issue when fetching with multiple threads

2015-09-03 Thread Alex Wang
Hi, We are using Nutch 1.9 to crawl an internal website, and index the content to Solr 3.5. What we found is that the page title indexed for certain html pages are wrong. For example the "Contact us" page has "Login" as page title in the Solr index. This only happens when we use multiple threads

Only consider content and outlinks from certain html tag

2015-09-03 Thread Camilo Tejeiro
Hi, So I have a page in wikipedia (e.g. https://en.wikipedia.org/wiki/List_of_free_and_open-source_software_packages) which I am crawling. Now one problem I have is that I would like to keep Nutch from storing content and outlinks from tags that don't include relevant content (e.g. in my example

Re: Issue when fetching with multiple threads

2015-09-03 Thread Alex Wang
Thanks Julien for your suggestion! I ran the readseg command and examined the dump. The title for the particular html page was indeed fetched and parsed correctly even in multithread fetching mode. So it looks like the problem occurred somewhere after the parsing and/ or during indexing. Do you

Re: Issue when fetching with multiple threads

2015-09-03 Thread Alex Wang
I might have identified the issue, but have no idea how to solve it. Some of the pages on the site requires login. I have enabled HttpFormAuthentication in the protocal-httpclient plugin. However, looks like the login page title gets indexed into Solr instead of the actual page's title. Anybody

Re: Problems indexing to solr 3.5 from nutch 1.8

2015-09-03 Thread Lewis John Mcgibbney
Hi Paddy, Some comments in addition to my response. You should try upgrading to Nutch 1.10 when we release very shortly. There has been so much work done since 1.8 that you can benefit from. Keep your ears peeled here for a release candidate and then eventual release. Please see response below.

Re: Problems indexing to solr 3.5 from nutch 1.8

2015-09-03 Thread Guy McD
Having a similar problem in getting Nutch and Solr integrated. Newest version of both. Downloaded and installed a few days ago. Following the tut tells me to copy over the schema.xml, but it doesn't appear to be in the directory that the tut says. Or anywhere for that matter. This is probably a