I don't know what you call large, but its around 21GB currently. By the way thx for automaton filter, worked great and works much faster now. Actually I gained a x4 in the generate phase instead of loosing time by adding regexes.
2011/6/8 Julien Nioche <[email protected]> > or you can modify the code from the crawldb reader and get it to dump only > the keys. If your crawldb is large, regex will take forever > > On 7 June 2011 22:31, Markus Jelsma <[email protected]> wrote: > > > Well, you can dump the crawldb using the bin/nutch readdb command. You'd > > still > > need to parse the output youself to get a decent list of URL's. > > > > > Hi guys, > > > > > > I was wondering if there is a quick method to dump all urls of a merged > > > index (ie a production index). > > > I want to use them them for a 'fresh' seeding of a new crawldb > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > -- -MilleBii-

