Re: Reduce phase in Fetcher taking excessive time to finish.

2014-11-02 Thread Meraj A. Khan
Julien, Do we need to consider any data loss(URLs) in this scenario ? no, why? Thank you for confirming. J. On Thu, Oct 30, 2014 at 6:22 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Meraj You can control the # of URLs per segment with property

When to delete the segments?

2014-11-02 Thread Meraj A. Khan
Hi All, I am deleting the segments as soon as they are fetched and parsed , I have read in previous posts that it is safe to delete the segments only if it is older than the db.default.fetch.interval , my understanding is that one does have to wait for the segment to be older than

Re: When to delete the segments?

2014-11-02 Thread remi tassing
The next fetching time is computed after updatedb is isssued with that segment So as long as you don't need the parsed data anymore then you can delete the segment (e.g. after indexing through Solr...). On Mon, Nov 3, 2014 at 8:41 AM, Meraj A. Khan mera...@gmail.com wrote: Hi All, I am

Re: When to delete the segments?

2014-11-02 Thread Meraj A. Khan
Thanks . How do I definitively determine , if a segment has been completely parsed , if I were to set up a hourly crontab to delete the segments from HDFS? I have seen that the presence of the crawl_parse directory in the segments directory at least indicates that the parsing has started , but I

Re: When to delete the segments?

2014-11-02 Thread remi tassing
If you are able to determine what is done with the parsed data, then you could delete the segment as soon as that job is completed. As I mentioned earlier, if the data is to be pushed to Solr (e.g. with bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawldb $SEGMENT), then after indexing is