If you are able to determine what is done with the parsed data, then you could delete the segment as soon as that job is completed.
As I mentioned earlier, if the data is to be pushed to Solr (e.g. with "bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawldb $SEGMENT"), then after indexing is done you can get rid of the segment On Mon, Nov 3, 2014 at 12:16 PM, Meraj A. Khan <[email protected]> wrote: > Thanks . > > How do I definitively determine , if a segment has been completely > parsed , if I were to set up a hourly crontab to delete the segments > from HDFS? I have seen that the presence of the crawl_parse directory > in the segments directory at least indicates that the parsing has > started , but I think the directory would be created as soon as the > parsing begins. > > So as to not delete the segments prematurely , while it is still being > fetched , what should I be looking for in my script ? > > On Sun, Nov 2, 2014 at 7:58 PM, remi tassing <[email protected]> > wrote: > > The next fetching time is computed after "updatedb" is isssued with that > > segment > > > > So as long as you don't need the parsed data anymore then you can delete > > the segment (e.g. after indexing through Solr...). > > > > > > > > On Mon, Nov 3, 2014 at 8:41 AM, Meraj A. Khan <[email protected]> wrote: > > > >> Hi All, > >> > >> I am deleting the segments as soon as they are fetched and parsed , I > >> have read in previous posts that it is safe to delete the segments > >> only if it is older than the db.default.fetch.interval , my > >> understanding is that one does have to wait for the segment to be > >> older than db.default.fetch.interval, but can delete it as soon as the > >> segment is parsed. > >> > >> Is my understanding correct ? I want to delete the segment as soon as > >> possible so as to save as much disk space as possible. > >> > >> Thanks. > >> >

