Hi,

did you try Heritrix?

The documents are stored as html inside an warc file which can be postprocessed easily.


Cheers,
Markus


On 11.02.2013 12:16, SivaKarthik wrote:
Dear Erick,
    Thanks for ur relpy..
    ya..nutch can meet my requirement...
   but the problem is, i want to store the crawled document in html or xml
format instead of mapreduce format..
   not sure nutch plugins available to convert into xml files.
   please share me if you any idea .

ThankYou




--
View this message in context: 
http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607831p4039619.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to