Re: When to delete the segments?

2014-11-03 Thread Meraj A. Khan
I am only indexing the parsed data in Solr , so there is no way for me to know when to delete a segment in an automated fashion by considering the parsed data alone, however I just relaized that there is a _SUCCESS file being created with in the segment once it is fetched. I will use that as an

When to delete the segments?

2014-11-02 Thread Meraj A. Khan
Hi All, I am deleting the segments as soon as they are fetched and parsed , I have read in previous posts that it is safe to delete the segments only if it is older than the db.default.fetch.interval , my understanding is that one does have to wait for the segment to be older than

Re: When to delete the segments?

2014-11-02 Thread remi tassing
The next fetching time is computed after updatedb is isssued with that segment So as long as you don't need the parsed data anymore then you can delete the segment (e.g. after indexing through Solr...). On Mon, Nov 3, 2014 at 8:41 AM, Meraj A. Khan mera...@gmail.com wrote: Hi All, I am

Re: When to delete the segments?

2014-11-02 Thread Meraj A. Khan
Thanks . How do I definitively determine , if a segment has been completely parsed , if I were to set up a hourly crontab to delete the segments from HDFS? I have seen that the presence of the crawl_parse directory in the segments directory at least indicates that the parsing has started , but I

Re: When to delete the segments?

2014-11-02 Thread remi tassing
If you are able to determine what is done with the parsed data, then you could delete the segment as soon as that job is completed. As I mentioned earlier, if the data is to be pushed to Solr (e.g. with bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawldb $SEGMENT), then after indexing is