Hi, 

After my first urls injection (2000 urls) i've generated a first segment
with topN 10000 and no depth option (is it default 5 like for crawl command?
I didn't see it in the doc)

then a first fetch/parse/update  pass

the end of parsing took a very very long time (see below)
2013-01-31 03:26:26,648 INFO  parse.ParseSegment - ParseSegment: finished at
2013-01-31 03:26:26, elapsed: 29:09:35

dump domainstats tells me i have 56393 Fetched urls and 517856 not fetched
ones

then i've tried to fetch a second segment with only topN 1000, but fetch was
stucked at this line:
2013-01-31 09:23:23,969 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'partition', using default
for more than 4 hours before i cancel by error.

Why those steps are taking so much time?

I'm using boilerpipe for parsing and set some meta data in my seed urls, but
it's the only "exotic" things i think i have in my configuration





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Very-long-time-just-before-fetching-and-just-after-parsing-tp4037673.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to