Re: GENERAL PROBLEMS LEARNING TO USE NUTCH

Lewis John Mcgibbney Wed, 24 Apr 2013 12:54:02 -0700

Hi,

CC: [email protected]

Questions like this should really go to the user@ list, you have a must
better change of being helped there are there are many many eyes.

On Wed, Apr 24, 2013 at 8:57 AM, <[email protected]> wrote:

>
> I would be really gratefull if you could provide some links on the
> following topics.
> 1. breaking down the nutch commands into steps (at the mo i just use one
> line)
>

http://wiki.apache.org/nutch/CommandLineOptions
http://wiki.apache.org/nutch/NutchTutorial

> 2. settings such as normalization or full site indexing *.com/ or *.html
>

Please see
http://nutch.apache.org/apidocs-1.6/index.html?org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.html
http://nutch.apache.org/apidocs-1.6/index.html?org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.html
http://nutch.apache.org/apidocs-1.6/index.html?org/apache/nutch/net/urlnormalizer/regex/RegexURLNormalizer.html
There is also optional normalization and filtering permitted on many
individual tasks included within the links in 1 above.

> 3. resetting the crawldb (i use both mysql and solr on different machines)
> - changing the crawl output directory seems to work without sql
>

Can you be more verbose here please I really don't understand? Do you mean
resetting the fetch time for a particular URLs? I you mean completely
resetting the crawldb then you would be as well dumping the entire URL list
then injecting them in to a fresh crawldb.

> 4.
> http://lucene.472066.n3.nabble.com/parse-data-directory-not-found-after-merge-td3635615.htmlI
>  have this problem after starting a full crawl without topN settings and
> then starting a solr crawl to the same crawldb dir, this has started the
> nutch bot to search to *.html level without normalisation however it does
> not complete to solr where I would like to extract text data from the html
> pages
>
> It seems that this was never ever addressed and the thread dies out!!!
Can anyone please comment on whether they are loosing the parse_data
directory when merging segments? If this can be reproduced then we need to
address it. I don't have time to address it today even on a test crawl,
sorry.

Lewis

Re: GENERAL PROBLEMS LEARNING TO USE NUTCH

Reply via email to