In theory, if I find and get rid of the "bad" excel files,  which I may have
done, to overcome this hurdle, is it even possible to crawl a large file
system (approx. 350,000 files) with such a small box?  

Here is my config: 

Windows XP SP3
Core 2 Duo CPU at @ 3.00GHz
1.95 GB of Ram 

If it is possible, what should the settings be for the mapred options to
make this all work? Where should they be set? I have seen various posts
stating different locations (mapred-site.xml, hadoop-site.xml,
nutch-site.xml).

I am still getting OutOfMemoryError during the mapreduce phase of the crawl. 

ANY help would be greatly appreciated! 

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Tika-Excel-parsing-causing-out-of-memory-tp1188201p1204035.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to