Re: Nutch readdb shows much more fetched urls than parsed

Markus Jelsma Thu, 15 Dec 2011 07:41:25 -0800

you likely have a lot of fetched items that cannot be parsed. Check your url 
filters and parse plugins.


On Thursday 15 December 2011 11:39:21 mikaza wrote:
> I have about 2K links in urls file, and I just need to load them into
> solr/lucene index (on a local machine).
> 
> I ran inject/generate/fetch/parse/ cycle, and after that in "bin/nutch
> readseg -list" I got these stats:
> 
> NAME 20111214182250
> 
> GENERATED 1851
> 
> FETCHER START 2011-12-14T18:24:08
> 
> FETCHER END 2011-12-14T19:52:25
> 
> FETCHED 3363
> 
> PARSED 275
> 
> So it parsed only 275 out of 3363. Is it normal for nutch and how should I
> parse unparsed data?
> 
> (subsequent "bin/nutch parse" exec on the segment leads to "Segment already
> parsed" error)
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-readdb-shows-much-more-fetched-ur
> ls-than-parsed-tp3588205p3588205.html Sent from the Nutch - User mailing
> list archive at Nabble.com.

-- 
Markus Jelsma - CTO - Openindex

Re: Nutch readdb shows much more fetched urls than parsed

Reply via email to