I am only indexing the parsed data in Solr , so there is no way for me
to know when to delete a segment in an automated fashion by
considering the parsed data alone, however I just relaized that there
is a _SUCCESS file being created with in the segment once it is
fetched. I will use that as an
Hi All,
I am deleting the segments as soon as they are fetched and parsed , I
have read in previous posts that it is safe to delete the segments
only if it is older than the db.default.fetch.interval , my
understanding is that one does have to wait for the segment to be
older than
The next fetching time is computed after updatedb is isssued with that
segment
So as long as you don't need the parsed data anymore then you can delete
the segment (e.g. after indexing through Solr...).
On Mon, Nov 3, 2014 at 8:41 AM, Meraj A. Khan mera...@gmail.com wrote:
Hi All,
I am
Thanks .
How do I definitively determine , if a segment has been completely
parsed , if I were to set up a hourly crontab to delete the segments
from HDFS? I have seen that the presence of the crawl_parse directory
in the segments directory at least indicates that the parsing has
started , but I
If you are able to determine what is done with the parsed data, then you
could delete the segment as soon as that job is completed.
As I mentioned earlier, if the data is to be pushed to Solr (e.g. with
bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawldb $SEGMENT),
then after indexing is
5 matches
Mail list logo