you can use bin/nutch readseg to dump recent crawled data. and crawldb/current directory is the database of urls only store the Crawl datum of earch url. it's format is MapFileOuputFormat, you can use this method to load CrawlDatum from it.
String url = "www.example.com"; FileSystem fs = FileSystem.get(config); MapFile.Reader[] readers = MapFileOutputFormat.getReaders(fs, new Path(crawlDb,CrawlDb.CURRENT_NAME), config); Text key = new Text(url); CrawlDatum val = new CrawlDatum(); CrawlDatum res = (CrawlDatum)MapFileOutputFormat.getEntry(readers, new HashPartitioner<Text, CrawlDatum>(), key, val); On Tue, Sep 17, 2013 at 4:45 PM, Amit Sela <[email protected]> wrote: > Hi all, > > I'd like to MapReduce over (latest) cralwed data. > > Should input path be crawldb/current/ ? > InputFromatClass = SequenceFileInputFormat.class ? > KV pair = <Text, CrawlDatum> ? where Text represents the URL ? > > Thanks. > -- Don't Grow Old, Grow Up... :-)

