RE: mapping files created by: nutch dump to the URL from which each file has been dumped.

Markus Jelsma Thu, 21 Jul 2016 16:11:45 -0700

Hello shakiba - the best solution, and solving a lot of additional problems, is 
to make an indexing backend plugin, specifically for your indexing service. The 
coding involved is quite straightforward, except any nuances your indexing 
backend might have.


Dumping files and reprocessing them for indexing purposes is not a good idea 
because you loose the information such as CrawlDB status including new 404's.

M.

 
 
-----Original message-----
> From:shakiba davari <[email protected]>
> Sent: Friday 22nd July 2016 0:57
> To: [email protected]
> Subject: mapping files created by: nutch dump to the URL from which each file 
> has been dumped.
> 
> Hi, I am trying to crawl some URLs in Apache Nutch, and then index them
> with Bluemix Retrieve And Rank service. To do so I crawl my data by Nutch
> and dump the crawled data as files (mostly html files) in a directory:
> 
> bin/nutch dump -flatdir -outputDir dumpData/ -segment crawl/segments/
> 
> Then I send these files to the Bluemix Document Converter Service to create
> a json file "compatible with Bluemix R&R Service", and post this Json file
> to my R&R Service. My question is: How can I find a way to map each one of
> these files to their related URL?
> 
> th commoncrawldump command creates json files that have URL field in them,
> but i couldn't use these files for my purpose for various reasons, first
> being that they contain some binary signs.
> 
> Bests
> Shakiba Davari
>

RE: mapping files created by: nutch dump to the URL from which each file has been dumped.

Reply via email to