Hi,

we currently encounter a little problem with the segment folders created
during crawling.

Our situation is like follows:
We try to set up a Nutch crawler who is crawling / recrwaling on a regular
basis with a fixed depth. How to establish this is already clear for us and
working as intended.
(http://lucene.472066.n3.nabble.com/Absolute-depth-for-recrawling-td4008320.html)

Our general solution looks (from the process point of view) like this:

  1. Inject
  Loop Recrawl {
      Loop (depth) {
        2. Generate
        3. Fetch
        4. Parse
        5. UpdateDB
      }
    6. InvertLinks
    7. SOLRIndex
    8. SOLRDeup
  }

The problem we now got, is that there is a new segment (folder) created for
each crawl / recrawl and each depth loop (which is in fact nothing else then
a normal crawl).

Our main question now is, 
   1) when can we delete / eventually merge these segment folders and
   2) what are they used for in the future.

For now we automatically delete all segement folders after each complete
crawl (after each step 8.SOLRDeup) and it seems to work fine for us. Does
this even make sense?

I think we have to admit that we are not entirely aware of what kind of
information is contained within the crawl DB and the segment folder.

Thanks a lot for your help in advance and kind regards,
Alex





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Recrawling-and-segment-cleanup-tp4008865.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to