Hi, I was looking at nutch as a crawler for indexing into Indri.  In Indri's
docs, it lists "warc" as a corpus class option described as "WARC (Web
ARChive) format, such as is output by the Nutch webcrawler" -- c.f.
http://lemur.sourceforge.net/indri/IndriIndexer.html

After finishing a short crawl using nutch (v1.2), I found no way to produce
WARC output -- neither the native data store nor any of the export/dump
options appear to be WARC.  I've inquired on Indri/Lemur forums about this,
but I thought I'd check here also if anyone knows what the docs might be
referring to...  or how else I might proceed.

Thanks!
-Michael

Reply via email to