Re: incremental file processing

Laukik Chitnis Mon, 17 Jan 2011 11:36:49 -0800

Hi Cam,

Depending on the statistics that you need to maintain about the data in your 
log files, incrementally processing the files can be easy or a bit more 
involved.


For example, if you need the cumulative count of hits (or other distributive or 
algebraic aggregates like sum, avg, stddev, min, max), you can join the 
previous output file with stats from your latest incoming log file as follows:

PREV_HITS = LOAD '/var/results/site.hits.$PREV' AS (site, hits);
A = JOIN HITS BY site, PREV_HITS BY site;

Now you can just add up the hits per site, and store it in the new output file:
STORE HITS INTO '/var/results/site.hits.$CURR' USING PigStorage();

($PREV and $CURR are counters you can pass via parameters such that the curr 
output also becomes an input for your next round of processing)

Calculating unique visitors incrementally is not as easy since "distinct count" 
is a holistic aggregate (You need to maintain the user ids to remove dup 
visits). You can look at certain approximate ways of doing this (for example, 
look at Flajolet-Martin sketches and other similar papers on hash-based 
algortihms)

Cheers,
Laukik



On 1/17/11 5:26 AM, "Cam Bazz" <[email protected]> wrote:

Hello,

I have some log files coming in, and they are named like my.log.1,
my.log.2, etc.

When I run the pig script, I store the results like:

STORE HITS INTO '/var/results/site.hits' USING PigStorage();
STORE UNQVISITS INTO '/var/results/site.visits' USING PigStorage();

which in turn makes a directory named site.hits and site.visits, with
a file in them named part-r-00000.

when i run my script the second time, (with different data loaded,
like my.log.2) pig will give me an error saying directory site.visits,
and site.hits already exists.

What I need is a cumulative count of hits and unique visitors per
item. so if the second file has hit to an item that has been
previously counted in part-r-00000, it would require to reprocess the
first log file.

How can I do this counting business incrementally?

Best Regards,
C.B.

Re: incremental file processing

Reply via email to