RE: mapping files created by: nutch dump to the URL from which each file has been dumped.

Markus Jelsma Wed, 27 Jul 2016 03:51:50 -0700

Yes, just take a look at the Solr, Elastic and Dummy indexing plugins, 
especially Dummy because it is by far the simplest implementation.
Markus


 
 
-----Original message-----
> From:shakiba davari <[email protected]>
> Sent: Friday 22nd July 2016 22:21
> To: [email protected]
> Subject: Re: mapping files created by: nutch dump to the URL from which each 
> file has been dumped.
> 
> Thanks so much for your response Markus.
> I had no Idea about this concept. I will search for it more, and try to do
> it. If you know of any pages or tutorials for that, I would appreciate it
> if you could please send me some links or...
> 
> Bests
> Shakiba Davari
> 
> 
> On Thu, Jul 21, 2016 at 7:10 PM, Markus Jelsma <[email protected]>
> wrote:
> 
> > Hello shakiba - the best solution, and solving a lot of additional
> > problems, is to make an indexing backend plugin, specifically for your
> > indexing service. The coding involved is quite straightforward, except any
> > nuances your indexing backend might have.
> >
> > Dumping files and reprocessing them for indexing purposes is not a good
> > idea because you loose the information such as CrawlDB status including new
> > 404's.
> >
> > M.
> >
> >
> >
> > -----Original message-----
> > > From:shakiba davari <[email protected]>
> > > Sent: Friday 22nd July 2016 0:57
> > > To: [email protected]
> > > Subject: mapping files created by: nutch dump to the URL from which each
> > file has been dumped.
> > >
> > > Hi, I am trying to crawl some URLs in Apache Nutch, and then index them
> > > with Bluemix Retrieve And Rank service. To do so I crawl my data by Nutch
> > > and dump the crawled data as files (mostly html files) in a directory:
> > >
> > > bin/nutch dump -flatdir -outputDir dumpData/ -segment crawl/segments/
> > >
> > > Then I send these files to the Bluemix Document Converter Service to
> > create
> > > a json file "compatible with Bluemix R&R Service", and post this Json
> > file
> > > to my R&R Service. My question is: How can I find a way to map each one
> > of
> > > these files to their related URL?
> > >
> > > th commoncrawldump command creates json files that have URL field in
> > them,
> > > but i couldn't use these files for my purpose for various reasons, first
> > > being that they contain some binary signs.
> > >
> > > Bests
> > > Shakiba Davari
> > >
> >
>

RE: mapping files created by: nutch dump to the URL from which each file has been dumped.

Reply via email to