Julien,
Do we need to
consider any data loss(URLs) in this scenario ?
no, why?
Thank you for confirming.
J.
On Thu, Oct 30, 2014 at 6:22 AM, Julien Nioche
lists.digitalpeb...@gmail.com wrote:
Hi Meraj
You can control the # of URLs per segment with
property
Hi All,
I am deleting the segments as soon as they are fetched and parsed , I
have read in previous posts that it is safe to delete the segments
only if it is older than the db.default.fetch.interval , my
understanding is that one does have to wait for the segment to be
older than
The next fetching time is computed after updatedb is isssued with that
segment
So as long as you don't need the parsed data anymore then you can delete
the segment (e.g. after indexing through Solr...).
On Mon, Nov 3, 2014 at 8:41 AM, Meraj A. Khan mera...@gmail.com wrote:
Hi All,
I am
Thanks .
How do I definitively determine , if a segment has been completely
parsed , if I were to set up a hourly crontab to delete the segments
from HDFS? I have seen that the presence of the crawl_parse directory
in the segments directory at least indicates that the parsing has
started , but I
If you are able to determine what is done with the parsed data, then you
could delete the segment as soon as that job is completed.
As I mentioned earlier, if the data is to be pushed to Solr (e.g. with
bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawldb $SEGMENT),
then after indexing is
5 matches
Mail list logo