Alexander,

I can't afford to execute segment merger after every crawl, fetch, index
circle. Because I am using single Linux box with 1GiB RAM and 32 GiB HDD.

I set java heap size to 768 MiB to run Nutch. Still when I reach 3 GiB of
segments, temp files used by hadoop for segment merging just impossible (for
my resources). It took up all HDD space.

That is the reason why I am afraid of segment merger.

I still have the need to return summaries. Thus I would need the segments
that at still in index.

I have not tried solr indexer, I might explore that options.

Thanks for your pointers.

Y.T. Thet

On Sat, Nov 13, 2010 at 3:33 AM, Alexander Aristov <
[email protected]> wrote:

> why are you so afraid of segment merger? It appears to be the only
> "official" way to get rid of excessive folders. of course it's
> time/resource
> consuming but is your system so high loaded?
>
> Also I might be wrong but if you are not planning to return summaries and
> content from nutch when you can remove folders by rm.
>
> And you can completely get rid of segments by using the solr indexer. After
> that you perform indexing you can delete fetched segments. I presume this
> is
> what you saw in other threads.
>
> Best Regards
> Alexander Aristov
>
>
> On 12 November 2010 21:27, ytthet <[email protected]> wrote:
>
> >
> > Hi All,
> >
> > I like to know when and how to delete segments (directories) in Nutch
> 1.0.
> >
> > I searched through mailing list archive, but I can't find the answers.
> >
> > Following is my background information.
> >
> > My crawl-fetch-index process is executed once a day by scheduled job. My
> > "db.fetch.interval.max" is 1, so I am expecting urls to be fetched and
> > indexed everyday. I am not merging segments in my crawl-fetch-index
> process
> > because I can't afford Storage Space and RAM. (Merging segment is one of
> > the
> > popular discussion in this thread I guess).
> >
> > On First day, I have 6 folders in /segments/ (because i crawled 6 depth).
> > Total of 1 GiB. Second day I have another 6 more folders worth of 1 GiB++
> > Now I have total of 2 GiB. Third day, 1 GiB++ and now I have around
> 3GIB++.
> >
> > My question is when can I remove those old folder from /segments/? And
> how
> > do I remove it?
> >
> > I tried deleting previous segment (e.g from first day) by linux "rm"
> > command
> > and they are gone. But searcher no longer works.
> >
> > I saw suggestion on one entry "segments are no longer being referenced by
> > indexes which are
> > using in searches, simply delete the segments/xxxxxxxxxx directory. " Is
> > that correct?
> >
> > If so how exactly?
> >
> > Thanks for your time,
> >
> > YT Thet
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/When-and-how-properly-to-delete-segments-directory-Nutch-1-0-tp1890600p1890600.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
>

Reply via email to