Julien,

> the fetch step is likely to take most of the time and the time it takes it
> mostly a matter of the distribution of hosts/IP/domains in your fetchlist.
> Search the WIKI for details on performance tips

Thanks. Most of the urls that I'm fetching are each on their own
IP/hosts and unique servers.

>
>
>> * Should I be moving away from 'bin/nutch crawl -depth 8 -topN 10000'
>> and calling each step individually? Why?
>>
>
> This has been discussed several times on the mailing list : you get more
> control with a script + all in one crawl command can have issues with
> runaway parsing threads, etc...

Understood.

>
>
>> * Are there more recent/improved versions of
>> http://wiki.apache.org/nutch/Crawl scripts that are written for Nutch
>> 2.x?
>>
>
> yes, see patch in https://issues.apache.org/jira/browse/NUTCH-1087

Thanks. I'll review that.

>
>
>> * Why would Hbase show 64,000 documents but ElasticSearch only 50,000?
>>
>
> redirections? sounds quite a lot though

Thoughts for how I would identify which are redirects?

>
> HTH
>
> J
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble

Reply via email to