Nutch

Parin Jogani Sat, 06 Apr 2013 10:07:57 -0700

Hi,
Is there any way to perform a urlfilter from level 1-5 and a different one
from 5 onwards. I need to extract pdf files which will be only after a
given level (just to experiment).
After that I believe the pdf files will be stored in a compressed binary
format in the crawl\segment folder. I would like to extract these pdf files
and store all in 1 folder. (I guess since Nutch uses MapReduce by segments
the data, I will need to use the hadoop api present by default in the lib
folder. I can not find more tutorials on the same except
allenday<http://www-scf.usc.edu/~csci572/2013Spring/homework/nutch/allenday20080829.html>
).

PJ

Nutch

Reply via email to