Thanks . How do I definitively determine , if a segment has been completely parsed , if I were to set up a hourly crontab to delete the segments from HDFS? I have seen that the presence of the crawl_parse directory in the segments directory at least indicates that the parsing has started , but I think the directory would be created as soon as the parsing begins.
So as to not delete the segments prematurely , while it is still being fetched , what should I be looking for in my script ? On Sun, Nov 2, 2014 at 7:58 PM, remi tassing <[email protected]> wrote: > The next fetching time is computed after "updatedb" is isssued with that > segment > > So as long as you don't need the parsed data anymore then you can delete > the segment (e.g. after indexing through Solr...). > > > > On Mon, Nov 3, 2014 at 8:41 AM, Meraj A. Khan <[email protected]> wrote: > >> Hi All, >> >> I am deleting the segments as soon as they are fetched and parsed , I >> have read in previous posts that it is safe to delete the segments >> only if it is older than the db.default.fetch.interval , my >> understanding is that one does have to wait for the segment to be >> older than db.default.fetch.interval, but can delete it as soon as the >> segment is parsed. >> >> Is my understanding correct ? I want to delete the segment as soon as >> possible so as to save as much disk space as possible. >> >> Thanks. >>

