Yes, just take a look at the Solr, Elastic and Dummy indexing plugins, especially Dummy because it is by far the simplest implementation. Markus
-----Original message----- > From:shakiba davari <[email protected]> > Sent: Friday 22nd July 2016 22:21 > To: [email protected] > Subject: Re: mapping files created by: nutch dump to the URL from which each > file has been dumped. > > Thanks so much for your response Markus. > I had no Idea about this concept. I will search for it more, and try to do > it. If you know of any pages or tutorials for that, I would appreciate it > if you could please send me some links or... > > Bests > Shakiba Davari > > > On Thu, Jul 21, 2016 at 7:10 PM, Markus Jelsma <[email protected]> > wrote: > > > Hello shakiba - the best solution, and solving a lot of additional > > problems, is to make an indexing backend plugin, specifically for your > > indexing service. The coding involved is quite straightforward, except any > > nuances your indexing backend might have. > > > > Dumping files and reprocessing them for indexing purposes is not a good > > idea because you loose the information such as CrawlDB status including new > > 404's. > > > > M. > > > > > > > > -----Original message----- > > > From:shakiba davari <[email protected]> > > > Sent: Friday 22nd July 2016 0:57 > > > To: [email protected] > > > Subject: mapping files created by: nutch dump to the URL from which each > > file has been dumped. > > > > > > Hi, I am trying to crawl some URLs in Apache Nutch, and then index them > > > with Bluemix Retrieve And Rank service. To do so I crawl my data by Nutch > > > and dump the crawled data as files (mostly html files) in a directory: > > > > > > bin/nutch dump -flatdir -outputDir dumpData/ -segment crawl/segments/ > > > > > > Then I send these files to the Bluemix Document Converter Service to > > create > > > a json file "compatible with Bluemix R&R Service", and post this Json > > file > > > to my R&R Service. My question is: How can I find a way to map each one > > of > > > these files to their related URL? > > > > > > th commoncrawldump command creates json files that have URL field in > > them, > > > but i couldn't use these files for my purpose for various reasons, first > > > being that they contain some binary signs. > > > > > > Bests > > > Shakiba Davari > > > > > >

