Hi, Is there any way to perform a urlfilter from level 1-5 and a different one from 5 onwards. I need to extract pdf files which will be only after a given level (just to experiment). After that I believe the pdf files will be stored in a compressed binary format in the crawl\segment folder. I would like to extract these pdf files and store all in 1 folder. (I guess since Nutch uses MapReduce by segments the data, I will need to use the hadoop api present by default in the lib folder. I can not find more tutorials on the same except allenday<http://www-scf.usc.edu/~csci572/2013Spring/homework/nutch/allenday20080829.html> ).
PJ