Hi, did you try Heritrix?
The documents are stored as html inside an warc file which can be postprocessed easily.
Cheers, Markus On 11.02.2013 12:16, SivaKarthik wrote:
Dear Erick, Thanks for ur relpy.. ya..nutch can meet my requirement... but the problem is, i want to store the crawled document in html or xml format instead of mapreduce format.. not sure nutch plugins available to convert into xml files. please share me if you any idea . ThankYou -- View this message in context: http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607831p4039619.html Sent from the Solr - User mailing list archive at Nabble.com.