Hello,

I have some log files coming in, and they are named like my.log.1,
my.log.2, etc.

When I run the pig script, I store the results like:

STORE HITS INTO '/var/results/site.hits' USING PigStorage();
STORE UNQVISITS INTO '/var/results/site.visits' USING PigStorage();

which in turn makes a directory named site.hits and site.visits, with
a file in them named part-r-00000.

when i run my script the second time, (with different data loaded,
like my.log.2) pig will give me an error saying directory site.visits,
and site.hits already exists.

What I need is a cumulative count of hits and unique visitors per
item. so if the second file has hit to an item that has been
previously counted in part-r-00000, it would require to reprocess the
first log file.

How can I do this counting business incrementally?

Best Regards,
C.B.

Reply via email to