Hi, we currently encounter a little problem with the segment folders created during crawling.
Our situation is like follows: We try to set up a Nutch crawler who is crawling / recrwaling on a regular basis with a fixed depth. How to establish this is already clear for us and working as intended. (http://lucene.472066.n3.nabble.com/Absolute-depth-for-recrawling-td4008320.html) Our general solution looks (from the process point of view) like this: 1. Inject Loop Recrawl { Loop (depth) { 2. Generate 3. Fetch 4. Parse 5. UpdateDB } 6. InvertLinks 7. SOLRIndex 8. SOLRDeup } The problem we now got, is that there is a new segment (folder) created for each crawl / recrawl and each depth loop (which is in fact nothing else then a normal crawl). Our main question now is, 1) when can we delete / eventually merge these segment folders and 2) what are they used for in the future. For now we automatically delete all segement folders after each complete crawl (after each step 8.SOLRDeup) and it seems to work fine for us. Does this even make sense? I think we have to admit that we are not entirely aware of what kind of information is contained within the crawl DB and the segment folder. Thanks a lot for your help in advance and kind regards, Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Recrawling-and-segment-cleanup-tp4008865.html Sent from the Nutch - User mailing list archive at Nabble.com.

