Re: Dump all urls from merged index

MilleBii Tue, 07 Jun 2011 23:36:10 -0700

I don't know what you call large, but its around 21GB currently.

By the way thx for automaton filter, worked great and works much faster now.
Actually I gained a x4 in the generate phase instead of loosing time by
adding regexes.


2011/6/8 Julien Nioche <[email protected]>

> or you can modify the code from the crawldb reader and get it to dump only
> the keys. If your crawldb is large, regex will take forever
>
> On 7 June 2011 22:31, Markus Jelsma <[email protected]> wrote:
>
> > Well, you can dump the crawldb using the bin/nutch readdb command. You'd
> > still
> > need to parse the output youself to get a decent list of URL's.
> >
> > > Hi guys,
> > >
> > > I was wondering if there is a quick method to dump all urls of a merged
> > > index (ie a production index).
> > > I want to use them them  for a 'fresh' seeding of a new crawldb
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>



-- 
-MilleBii-

Re: Dump all urls from merged index

Reply via email to