Re: Mapreduce over craled data

feng lu Tue, 17 Sep 2013 09:17:49 -0700

<<
1. Is it possible to dump from multiple segments ?
>>
yes, you can add -dir <segments> options. you run the bin/nutch readseg to
check the help information.


<<
2. Is it possible to choose dump format (like with readdb) ?
>>
not support , can only dump as a text file. maybe you can write a program
to convert text file into another format.


On Tue, Sep 17, 2013 at 11:03 PM, Amit Sela <[email protected]> wrote:

> I played with readseg a little and it could fit my use case. I have 2
> questions about it:
>
> 1. Is it possible to dump from multiple segments ?
> 2. Is it possible to choose dump format (like with readdb) ?
>
> Thanks.
>
>
> On Tue, Sep 17, 2013 at 5:26 PM, feng lu <[email protected]> wrote:
>
> > you can use bin/nutch readseg to dump recent crawled data.
> >
> > and crawldb/current directory is the database of urls only store the
> Crawl
> > datum of earch url. it's format is MapFileOuputFormat, you can use this
> > method to load CrawlDatum from it.
> >
> > String url = "www.example.com";
> > FileSystem fs = FileSystem.get(config);
> > MapFile.Reader[] readers = MapFileOutputFormat.getReaders(fs, new
> > Path(crawlDb,CrawlDb.CURRENT_NAME), config);
> > Text key = new Text(url);
> > CrawlDatum val = new CrawlDatum();
> > CrawlDatum res = (CrawlDatum)MapFileOutputFormat.getEntry(readers, new
> > HashPartitioner<Text, CrawlDatum>(), key, val);
> >
> >
> > On Tue, Sep 17, 2013 at 4:45 PM, Amit Sela <[email protected]> wrote:
> >
> > > Hi all,
> > >
> > > I'd like to MapReduce over (latest) cralwed data.
> > >
> > > Should input path be crawldb/current/ ?
> > > InputFromatClass = SequenceFileInputFormat.class ?
> > > KV pair = <Text, CrawlDatum> ? where Text represents the URL ?
> > >
> > > Thanks.
> > >
> >
> >
> >
> > --
> > Don't Grow Old, Grow Up... :-)
> >
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Mapreduce over craled data

Reply via email to