I am only indexing the parsed data in Solr , so there is no way for me
to know when to delete a segment in an automated fashion by
considering the parsed data alone, however I just relaized that there
is a _SUCCESS file being created with in the segment once it is
fetched. I will use that as an indicator to automate the deletion of
the segment folders.



On Mon, Nov 3, 2014 at 12:56 AM, remi tassing <[email protected]> wrote:
> If you are able to determine what is done with the parsed data, then you
> could delete the segment as soon as that job is completed.
>
> As I mentioned earlier, if the data is to be pushed to Solr (e.g. with
> "bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawldb $SEGMENT"),
> then after indexing is done you can get rid of the segment
>
> On Mon, Nov 3, 2014 at 12:16 PM, Meraj A. Khan <[email protected]> wrote:
>
>> Thanks .
>>
>> How do I definitively determine , if a segment has been completely
>> parsed , if I were to set up a hourly crontab to delete the segments
>> from HDFS? I have seen that the presence of the crawl_parse directory
>> in the segments directory at least indicates that the parsing has
>> started , but I think the directory would be created as  soon as the
>> parsing begins.
>>
>> So as to not delete the segments prematurely , while it is still being
>> fetched , what should I be looking for in my script ?
>>
>> On Sun, Nov 2, 2014 at 7:58 PM, remi tassing <[email protected]>
>> wrote:
>> > The next fetching time is computed after "updatedb" is isssued with that
>> > segment
>> >
>> > So as long as you don't need the parsed data anymore then you can delete
>> > the segment (e.g. after indexing through Solr...).
>> >
>> >
>> >
>> > On Mon, Nov 3, 2014 at 8:41 AM, Meraj A. Khan <[email protected]> wrote:
>> >
>> >> Hi All,
>> >>
>> >> I am deleting the segments as soon as they are fetched and parsed , I
>> >> have read in previous posts that it is safe to delete the segments
>> >> only if it is older than the db.default.fetch.interval , my
>> >> understanding is that one does have to wait for the segment to be
>> >> older than db.default.fetch.interval, but can delete it as soon as the
>> >> segment is parsed.
>> >>
>> >> Is my understanding correct ? I want to delete the segment as soon as
>> >> possible so as to save as much disk space as possible.
>> >>
>> >> Thanks.
>> >>
>>

Reply via email to