Alexander, I can't afford to execute segment merger after every crawl, fetch, index circle. Because I am using single Linux box with 1GiB RAM and 32 GiB HDD.
I set java heap size to 768 MiB to run Nutch. Still when I reach 3 GiB of segments, temp files used by hadoop for segment merging just impossible (for my resources). It took up all HDD space. That is the reason why I am afraid of segment merger. I still have the need to return summaries. Thus I would need the segments that at still in index. I have not tried solr indexer, I might explore that options. Thanks for your pointers. Y.T. Thet On Sat, Nov 13, 2010 at 3:33 AM, Alexander Aristov < [email protected]> wrote: > why are you so afraid of segment merger? It appears to be the only > "official" way to get rid of excessive folders. of course it's > time/resource > consuming but is your system so high loaded? > > Also I might be wrong but if you are not planning to return summaries and > content from nutch when you can remove folders by rm. > > And you can completely get rid of segments by using the solr indexer. After > that you perform indexing you can delete fetched segments. I presume this > is > what you saw in other threads. > > Best Regards > Alexander Aristov > > > On 12 November 2010 21:27, ytthet <[email protected]> wrote: > > > > > Hi All, > > > > I like to know when and how to delete segments (directories) in Nutch > 1.0. > > > > I searched through mailing list archive, but I can't find the answers. > > > > Following is my background information. > > > > My crawl-fetch-index process is executed once a day by scheduled job. My > > "db.fetch.interval.max" is 1, so I am expecting urls to be fetched and > > indexed everyday. I am not merging segments in my crawl-fetch-index > process > > because I can't afford Storage Space and RAM. (Merging segment is one of > > the > > popular discussion in this thread I guess). > > > > On First day, I have 6 folders in /segments/ (because i crawled 6 depth). > > Total of 1 GiB. Second day I have another 6 more folders worth of 1 GiB++ > > Now I have total of 2 GiB. Third day, 1 GiB++ and now I have around > 3GIB++. > > > > My question is when can I remove those old folder from /segments/? And > how > > do I remove it? > > > > I tried deleting previous segment (e.g from first day) by linux "rm" > > command > > and they are gone. But searcher no longer works. > > > > I saw suggestion on one entry "segments are no longer being referenced by > > indexes which are > > using in searches, simply delete the segments/xxxxxxxxxx directory. " Is > > that correct? > > > > If so how exactly? > > > > Thanks for your time, > > > > YT Thet > > > > -- > > View this message in context: > > > http://lucene.472066.n3.nabble.com/When-and-how-properly-to-delete-segments-directory-Nutch-1-0-tp1890600p1890600.html > > Sent from the Nutch - User mailing list archive at Nabble.com. > > >

