Hello List! I am trying to find a combination of the best settings for topN and depth for running the crawl script on a very large internal filesystem.
I have tried setting the depth to a very high number (1000), but I fail to complete the crawl. The main reason for this is the number of "bad" powerpoint and pdf files that we have. Some of the pdf files are causing the script to hang and consume all the memory on the machine. Once I resolve where those bad files are (wish the script would exit from that condition cleaner though) and remove them, then I am wondering what those parameters really should be? I also plan to add other sources to my crawl once I have completed my filesystem crawling. Most of them consist of oracle databases, websites, etc, that I need to crawl. But, I can't until I get this large shared filesystem completely crawled.. -- View this message in context: http://lucene.472066.n3.nabble.com/File-System-Crawling-tp963557p963557.html Sent from the Nutch - User mailing list archive at Nabble.com.