I played with readseg a little and it could fit my use case. I have 2 questions about it:
1. Is it possible to dump from multiple segments ? 2. Is it possible to choose dump format (like with readdb) ? Thanks. On Tue, Sep 17, 2013 at 5:26 PM, feng lu <[email protected]> wrote: > you can use bin/nutch readseg to dump recent crawled data. > > and crawldb/current directory is the database of urls only store the Crawl > datum of earch url. it's format is MapFileOuputFormat, you can use this > method to load CrawlDatum from it. > > String url = "www.example.com"; > FileSystem fs = FileSystem.get(config); > MapFile.Reader[] readers = MapFileOutputFormat.getReaders(fs, new > Path(crawlDb,CrawlDb.CURRENT_NAME), config); > Text key = new Text(url); > CrawlDatum val = new CrawlDatum(); > CrawlDatum res = (CrawlDatum)MapFileOutputFormat.getEntry(readers, new > HashPartitioner<Text, CrawlDatum>(), key, val); > > > On Tue, Sep 17, 2013 at 4:45 PM, Amit Sela <[email protected]> wrote: > > > Hi all, > > > > I'd like to MapReduce over (latest) cralwed data. > > > > Should input path be crawldb/current/ ? > > InputFromatClass = SequenceFileInputFormat.class ? > > KV pair = <Text, CrawlDatum> ? where Text represents the URL ? > > > > Thanks. > > > > > > -- > Don't Grow Old, Grow Up... :-) >

