I am only indexing the parsed data in Solr , so there is no way for me to know when to delete a segment in an automated fashion by considering the parsed data alone, however I just relaized that there is a _SUCCESS file being created with in the segment once it is fetched. I will use that as an indicator to automate the deletion of the segment folders.
On Mon, Nov 3, 2014 at 12:56 AM, remi tassing <[email protected]> wrote: > If you are able to determine what is done with the parsed data, then you > could delete the segment as soon as that job is completed. > > As I mentioned earlier, if the data is to be pushed to Solr (e.g. with > "bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawldb $SEGMENT"), > then after indexing is done you can get rid of the segment > > On Mon, Nov 3, 2014 at 12:16 PM, Meraj A. Khan <[email protected]> wrote: > >> Thanks . >> >> How do I definitively determine , if a segment has been completely >> parsed , if I were to set up a hourly crontab to delete the segments >> from HDFS? I have seen that the presence of the crawl_parse directory >> in the segments directory at least indicates that the parsing has >> started , but I think the directory would be created as soon as the >> parsing begins. >> >> So as to not delete the segments prematurely , while it is still being >> fetched , what should I be looking for in my script ? >> >> On Sun, Nov 2, 2014 at 7:58 PM, remi tassing <[email protected]> >> wrote: >> > The next fetching time is computed after "updatedb" is isssued with that >> > segment >> > >> > So as long as you don't need the parsed data anymore then you can delete >> > the segment (e.g. after indexing through Solr...). >> > >> > >> > >> > On Mon, Nov 3, 2014 at 8:41 AM, Meraj A. Khan <[email protected]> wrote: >> > >> >> Hi All, >> >> >> >> I am deleting the segments as soon as they are fetched and parsed , I >> >> have read in previous posts that it is safe to delete the segments >> >> only if it is older than the db.default.fetch.interval , my >> >> understanding is that one does have to wait for the segment to be >> >> older than db.default.fetch.interval, but can delete it as soon as the >> >> segment is parsed. >> >> >> >> Is my understanding correct ? I want to delete the segment as soon as >> >> possible so as to save as much disk space as possible. >> >> >> >> Thanks. >> >> >>

