<< 1. Is it possible to dump from multiple segments ? >> yes, you can add -dir <segments> options. you run the bin/nutch readseg to check the help information.
<< 2. Is it possible to choose dump format (like with readdb) ? >> not support , can only dump as a text file. maybe you can write a program to convert text file into another format. On Tue, Sep 17, 2013 at 11:03 PM, Amit Sela <[email protected]> wrote: > I played with readseg a little and it could fit my use case. I have 2 > questions about it: > > 1. Is it possible to dump from multiple segments ? > 2. Is it possible to choose dump format (like with readdb) ? > > Thanks. > > > On Tue, Sep 17, 2013 at 5:26 PM, feng lu <[email protected]> wrote: > > > you can use bin/nutch readseg to dump recent crawled data. > > > > and crawldb/current directory is the database of urls only store the > Crawl > > datum of earch url. it's format is MapFileOuputFormat, you can use this > > method to load CrawlDatum from it. > > > > String url = "www.example.com"; > > FileSystem fs = FileSystem.get(config); > > MapFile.Reader[] readers = MapFileOutputFormat.getReaders(fs, new > > Path(crawlDb,CrawlDb.CURRENT_NAME), config); > > Text key = new Text(url); > > CrawlDatum val = new CrawlDatum(); > > CrawlDatum res = (CrawlDatum)MapFileOutputFormat.getEntry(readers, new > > HashPartitioner<Text, CrawlDatum>(), key, val); > > > > > > On Tue, Sep 17, 2013 at 4:45 PM, Amit Sela <[email protected]> wrote: > > > > > Hi all, > > > > > > I'd like to MapReduce over (latest) cralwed data. > > > > > > Should input path be crawldb/current/ ? > > > InputFromatClass = SequenceFileInputFormat.class ? > > > KV pair = <Text, CrawlDatum> ? where Text represents the URL ? > > > > > > Thanks. > > > > > > > > > > > -- > > Don't Grow Old, Grow Up... :-) > > > -- Don't Grow Old, Grow Up... :-)

