If you are able to determine what is done with the parsed data, then you
could delete the segment as soon as that job is completed.

As I mentioned earlier, if the data is to be pushed to Solr (e.g. with
"bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawldb $SEGMENT"),
then after indexing is done you can get rid of the segment

On Mon, Nov 3, 2014 at 12:16 PM, Meraj A. Khan <[email protected]> wrote:

> Thanks .
>
> How do I definitively determine , if a segment has been completely
> parsed , if I were to set up a hourly crontab to delete the segments
> from HDFS? I have seen that the presence of the crawl_parse directory
> in the segments directory at least indicates that the parsing has
> started , but I think the directory would be created as  soon as the
> parsing begins.
>
> So as to not delete the segments prematurely , while it is still being
> fetched , what should I be looking for in my script ?
>
> On Sun, Nov 2, 2014 at 7:58 PM, remi tassing <[email protected]>
> wrote:
> > The next fetching time is computed after "updatedb" is isssued with that
> > segment
> >
> > So as long as you don't need the parsed data anymore then you can delete
> > the segment (e.g. after indexing through Solr...).
> >
> >
> >
> > On Mon, Nov 3, 2014 at 8:41 AM, Meraj A. Khan <[email protected]> wrote:
> >
> >> Hi All,
> >>
> >> I am deleting the segments as soon as they are fetched and parsed , I
> >> have read in previous posts that it is safe to delete the segments
> >> only if it is older than the db.default.fetch.interval , my
> >> understanding is that one does have to wait for the segment to be
> >> older than db.default.fetch.interval, but can delete it as soon as the
> >> segment is parsed.
> >>
> >> Is my understanding correct ? I want to delete the segment as soon as
> >> possible so as to save as much disk space as possible.
> >>
> >> Thanks.
> >>
>

Reply via email to