you can use bin/nutch readseg to dump recent crawled data.

and crawldb/current directory is the database of urls only store the Crawl
datum of earch url. it's format is MapFileOuputFormat, you can use this
method to load CrawlDatum from it.

String url = "www.example.com";
FileSystem fs = FileSystem.get(config);
MapFile.Reader[] readers = MapFileOutputFormat.getReaders(fs, new
Path(crawlDb,CrawlDb.CURRENT_NAME), config);
Text key = new Text(url);
CrawlDatum val = new CrawlDatum();
CrawlDatum res = (CrawlDatum)MapFileOutputFormat.getEntry(readers, new
HashPartitioner<Text, CrawlDatum>(), key, val);


On Tue, Sep 17, 2013 at 4:45 PM, Amit Sela <[email protected]> wrote:

> Hi all,
>
> I'd like to MapReduce over (latest) cralwed data.
>
> Should input path be crawldb/current/ ?
> InputFromatClass = SequenceFileInputFormat.class ?
> KV pair = <Text, CrawlDatum> ? where Text represents the URL ?
>
> Thanks.
>



-- 
Don't Grow Old, Grow Up... :-)

Reply via email to