In theory, if I find and get rid of the "bad" excel files, which I may have done, to overcome this hurdle, is it even possible to crawl a large file system (approx. 350,000 files) with such a small box?
Here is my config: Windows XP SP3 Core 2 Duo CPU at @ 3.00GHz 1.95 GB of Ram If it is possible, what should the settings be for the mapred options to make this all work? Where should they be set? I have seen various posts stating different locations (mapred-site.xml, hadoop-site.xml, nutch-site.xml). I am still getting OutOfMemoryError during the mapreduce phase of the crawl. ANY help would be greatly appreciated! -- View this message in context: http://lucene.472066.n3.nabble.com/Tika-Excel-parsing-causing-out-of-memory-tp1188201p1204035.html Sent from the Nutch - User mailing list archive at Nabble.com.