Hello List! 

I am trying to find a combination of the best settings for topN and depth
for running the crawl script on a very large internal filesystem.

I have tried setting the depth to a very high number (1000), but I fail to
complete the crawl.  The main reason for this is the number of "bad"
powerpoint and pdf files that we have. Some of the pdf files are causing the
script to hang and consume all the memory on the machine.  

Once I resolve where those bad files are (wish the script would exit from
that condition cleaner though) and remove them, then I am wondering what
those parameters really should be?  

I also plan to add other sources to my crawl once I have completed my
filesystem crawling.  Most of them consist of oracle databases, websites,
etc, that I need to crawl.  But, I can't until I get this large shared
filesystem completely crawled..
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/File-System-Crawling-tp963557p963557.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to