Re: Nutch

Amine BENHAMZA Sun, 07 Apr 2013 02:42:01 -0700

yahoo.fr>,
---------- Message transféré ----------
De : "Tejas Patil" <[email protected]>


> On Sat, Apr 6, 2013 at 9:58 AM, Parin Jogani <[email protected]> wrote:
>
> > Hi,
> > Is there any way to perform a urlfilter from level 1-5 and a different
> one
> > from 5 onwards. I need to extract pdf files which will be only after a
> > given level (just to experiment).
> >
> You can run 2 crawls over the same crawldb using different urlfilter files.
> First one would be rejecting pdf files and executed till a depth just
> before you discover pdf files. For later crawl, modify the regex rule to
> accept pdf files.
>
>
> > After that I believe the pdf files will be stored in a compressed binary
> > format in the crawl\segment folder. I would like to extract these pdf
> files
> > and store all in 1 folder. (I guess since Nutch uses MapReduce by
> segments
> > the data, I will need to use the hadoop api present by default in the lib
> > folder. I can not find more tutorials on the same except
> > allenday<
> >
> http://www-scf.usc.edu/~csci572/2013Spring/homework/nutch/allenday20080829.html
> > >
> > ).
> >
> I had a peek at the link that you gave and seems like that code snippet
> should work. Its an old article (from 2010) so it might happen that some
> classes are replaced with new ones. If you face any issues, please feel
> free to shoot an email to us !!!
>
> >
> > PJ
> >
>

Re: Nutch

Reply via email to