Hi Alex
You can use the segment reader to check the binary content and data
extracted from the parse (`./nutch readseg ...`). This should at least give
you some insights into where things might have gone wrong.
HTH
Julien
On 3 September 2015 at 16:13, Alex Wang wrote:
>
Hi,
We are using Nutch 1.9 to crawl an internal website, and index the content
to Solr 3.5. What we found is that the page title indexed for certain html
pages are wrong. For example the "Contact us" page has "Login" as page
title in the Solr index. This only happens when we use multiple threads
Hi,
So I have a page in wikipedia (e.g.
https://en.wikipedia.org/wiki/List_of_free_and_open-source_software_packages)
which I am crawling. Now one problem I have is that I would like to keep
Nutch from storing content and outlinks from tags that don't include
relevant content (e.g. in my example
Thanks Julien for your suggestion! I ran the readseg command and examined
the dump. The title for the particular html page was indeed fetched and
parsed correctly even in multithread fetching mode. So it looks like the
problem occurred somewhere after the parsing and/ or during indexing. Do
you
I might have identified the issue, but have no idea how to solve it.
Some of the pages on the site requires login. I have enabled
HttpFormAuthentication in the protocal-httpclient plugin. However, looks
like the login page title gets indexed into Solr instead of the actual
page's title.
Anybody
Hi Paddy,
Some comments in addition to my response. You should try upgrading to Nutch
1.10 when we release very shortly. There has been so much work done since
1.8 that you can benefit from. Keep your ears peeled here for a release
candidate and then eventual release.
Please see response below.
Having a similar problem in getting Nutch and Solr integrated. Newest
version of both. Downloaded and installed a few days ago.
Following the tut tells me to copy over the schema.xml, but it doesn't
appear to be in the directory that the tut says. Or anywhere for that
matter.
This is probably a
7 matches
Mail list logo