I played with readseg a little and it could fit my use case. I have 2
questions about it:

1. Is it possible to dump from multiple segments ?
2. Is it possible to choose dump format (like with readdb) ?

Thanks.


On Tue, Sep 17, 2013 at 5:26 PM, feng lu <[email protected]> wrote:

> you can use bin/nutch readseg to dump recent crawled data.
>
> and crawldb/current directory is the database of urls only store the Crawl
> datum of earch url. it's format is MapFileOuputFormat, you can use this
> method to load CrawlDatum from it.
>
> String url = "www.example.com";
> FileSystem fs = FileSystem.get(config);
> MapFile.Reader[] readers = MapFileOutputFormat.getReaders(fs, new
> Path(crawlDb,CrawlDb.CURRENT_NAME), config);
> Text key = new Text(url);
> CrawlDatum val = new CrawlDatum();
> CrawlDatum res = (CrawlDatum)MapFileOutputFormat.getEntry(readers, new
> HashPartitioner<Text, CrawlDatum>(), key, val);
>
>
> On Tue, Sep 17, 2013 at 4:45 PM, Amit Sela <[email protected]> wrote:
>
> > Hi all,
> >
> > I'd like to MapReduce over (latest) cralwed data.
> >
> > Should input path be crawldb/current/ ?
> > InputFromatClass = SequenceFileInputFormat.class ?
> > KV pair = <Text, CrawlDatum> ? where Text represents the URL ?
> >
> > Thanks.
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>

Reply via email to