Re: Fetching of URLs from seed list ends up with only a small portion of them indexed by Solr

Amit Sela Sat, 02 Mar 2013 11:32:57 -0800

I tried setting http.redirect.max=30 (since I saw there is a bug preventing
from setting -1 as all) but still not much difference, it did help a little
bit since now I get ~28K but still it's less then half...


On Sat, Mar 2, 2013 at 9:00 AM, Stefan Scheffler <
[email protected]> wrote:

> Hi Amit.
> As i answered you before. There is a config paramter to activate the
> crawling of redirections  (db_redir_temp 4,770, db_redir_perm 56,810). you
> have to activate this in the nutch-site.xml.
> Please have a look at the nutch-default.xml to find out which one it is...
> Only the pages with db_fetched will be indexed.
>
> Regards
> Stefan
>
> Am 02.03.2013 01:01, schrieb Amit Sela:
>
>  I am using the crawl script that executes Solr indexing with:
>>    $bin/nutch solrindex $SOLRURL $CRAWL_PATH/crawldb -linkdb
>> $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT
>> and then executes Solr dedup:
>>    $bin/nutch solrdedup $SOLRURL
>>
>> I think it has something to do with the CrawlDB job. The job counters
>> show:
>> db_redir_temp 4,770
>> db_redir_perm 56,810
>> db_notmodified 5,343
>> db_unfetched 27,385
>> db_gone  3,741
>> db_fetched 22,065
>>
>>
>> On Thu, Feb 28, 2013 at 10:02 PM, kiran chitturi
>> <[email protected]>**wrote:
>>
>>  This looks odd. From what i know, the successfully parsed documents are
>>> sent to Solr. Did you check the logs for any exceptions ?
>>>
>>> What command are you using to index ?
>>>
>>>
>>> On Thu, Feb 28, 2013 at 1:51 PM, Amit Sela <[email protected]> wrote:
>>>
>>>  Hi everyone,
>>>>
>>>> I'm running with nutch 1.6 and Solr 3.6.2.
>>>> I'm trying to crawl only the seed list (depth 1) and it seems that the
>>>> process ends with only ~255 of the URLs indexed in Solr.
>>>>
>>>> Seed list is about 120K.
>>>> Fetcher map input is 117K where success is 62K and temp_moved 45K.
>>>> Parse shows success of 62K.
>>>> CrawlDB after the fetch shows db_redir_perm=56K, db_unfetched=27K
>>>> and db_fetched=22K.
>>>>
>>>> And finally IndexerStatus shows 20K documents added.
>>>> What am I missing ?
>>>>
>>>> Thanks!
>>>>
>>>> my nutch-site.xml includes:
>>>> ------------------------------**-----------
>>>> <name>plugin.includes</name>
>>>>
>>>>
>>>>  <value>protocol-httpclient|**urlfilter-regex|parse-(text|**
>>> html|tika|metatags|js)|index-(**basic|anchor|metadata)|query-(**
>>> basic|site|url)|response-(**json|xml)|summary-basic|**
>>> scoring-opic|urlnormalizer-(**pass|regex|basic)i</value>
>>>
>>>> <name>metatags.names</name>
>>>> <value>keywords;Keywords;**description;Description</**value>
>>>> <name>index.parse.md</name>
>>>>
>>>>
>>>>  <value>metatag.keywords,**metatag.Keywords,metatag.**
>>> description,metatag.**Description</value>
>>>
>>>> <name>db.update.additions.**allowed</name>
>>>> <value>false</value>
>>>> <name>generate.count.mode</**name>
>>>> <value>domain</value>
>>>> <name>partition.url.mode</**name>
>>>> <value>byDomain</value>
>>>> <name>file.content.limit</**name>
>>>> <value>262144</value>
>>>> <name>http.content.limit</**name>
>>>> <value>262144</value>
>>>> <name>parse.filter.urls</name>
>>>> <value>true</value>
>>>> <name>parse.normalize.urls</**name>
>>>> <value>true</value>
>>>>
>>>>
>>>
>>> --
>>> Kiran Chitturi
>>>
>>>
>

Re: Fetching of URLs from seed list ends up with only a small portion of them indexed by Solr

Reply via email to