yahoo.fr>, ---------- Message transféré ---------- De : "Tejas Patil" <[email protected]>
> On Sat, Apr 6, 2013 at 9:58 AM, Parin Jogani <[email protected]> wrote: > > > Hi, > > Is there any way to perform a urlfilter from level 1-5 and a different > one > > from 5 onwards. I need to extract pdf files which will be only after a > > given level (just to experiment). > > > You can run 2 crawls over the same crawldb using different urlfilter files. > First one would be rejecting pdf files and executed till a depth just > before you discover pdf files. For later crawl, modify the regex rule to > accept pdf files. > > > > After that I believe the pdf files will be stored in a compressed binary > > format in the crawl\segment folder. I would like to extract these pdf > files > > and store all in 1 folder. (I guess since Nutch uses MapReduce by > segments > > the data, I will need to use the hadoop api present by default in the lib > > folder. I can not find more tutorials on the same except > > allenday< > > > http://www-scf.usc.edu/~csci572/2013Spring/homework/nutch/allenday20080829.html > > > > > ). > > > I had a peek at the link that you gave and seems like that code snippet > should work. Its an old article (from 2010) so it might happen that some > classes are replaced with new ones. If you face any issues, please feel > free to shoot an email to us !!! > > > > > PJ > > >

