-----Original message-----
> From:Alexandre <[email protected]>
> Sent: Wed 19-Sep-2012 13:18
> To: [email protected]
> Subject: Recrawling and segment cleanup
> 
> Hi,
> 
> we currently encounter a little problem with the segment folders created
> during crawling.
> 
> Our situation is like follows:
> We try to set up a Nutch crawler who is crawling / recrwaling on a regular
> basis with a fixed depth. How to establish this is already clear for us and
> working as intended.
> (http://lucene.472066.n3.nabble.com/Absolute-depth-for-recrawling-td4008320.html)
> 
> Our general solution looks (from the process point of view) like this:
> 
>   1. Inject
>   Loop Recrawl {
>       Loop (depth) {
>         2. Generate
>         3. Fetch
>         4. Parse
>         5. UpdateDB
>       }
>     6. InvertLinks
>     7. SOLRIndex
>     8. SOLRDeup
>   }
> 
> The problem we now got, is that there is a new segment (folder) created for
> each crawl / recrawl and each depth loop (which is in fact nothing else then
> a normal crawl).
> 
> Our main question now is, 
>    1) when can we delete / eventually merge these segment folders and

Wou can merge them whenever you want. We merge all segments daily and monthly 
because we may have to reindex occasionally.

>    2) what are they used for in the future.

They are only used for reindexing or rebuilding data structures such as the 
crawldb, webgraph of linkdb.

> 
> For now we automatically delete all segement folders after each complete
> crawl (after each step 8.SOLRDeup) and it seems to work fine for us. Does
> this even make sense?

Sure. If you don't need them.

> 
> I think we have to admit that we are not entirely aware of what kind of
> information is contained within the crawl DB and the segment folder.

The all databases contain a <url, object>  key/value pair. The CrawlDB contains 
the state of every URL and the segments contain structures such as the 
generated fetch list, info on the fetched records, parse data (outlinks and 
such) and parsed text. All this information is key/value based.

> 
> Thanks a lot for your help in advance and kind regards,
> Alex
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Recrawling-and-segment-cleanup-tp4008865.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 

Reply via email to