Hi Cam, Depending on the statistics that you need to maintain about the data in your log files, incrementally processing the files can be easy or a bit more involved.
For example, if you need the cumulative count of hits (or other distributive or algebraic aggregates like sum, avg, stddev, min, max), you can join the previous output file with stats from your latest incoming log file as follows: PREV_HITS = LOAD '/var/results/site.hits.$PREV' AS (site, hits); A = JOIN HITS BY site, PREV_HITS BY site; Now you can just add up the hits per site, and store it in the new output file: STORE HITS INTO '/var/results/site.hits.$CURR' USING PigStorage(); ($PREV and $CURR are counters you can pass via parameters such that the curr output also becomes an input for your next round of processing) Calculating unique visitors incrementally is not as easy since "distinct count" is a holistic aggregate (You need to maintain the user ids to remove dup visits). You can look at certain approximate ways of doing this (for example, look at Flajolet-Martin sketches and other similar papers on hash-based algortihms) Cheers, Laukik On 1/17/11 5:26 AM, "Cam Bazz" <[email protected]> wrote: Hello, I have some log files coming in, and they are named like my.log.1, my.log.2, etc. When I run the pig script, I store the results like: STORE HITS INTO '/var/results/site.hits' USING PigStorage(); STORE UNQVISITS INTO '/var/results/site.visits' USING PigStorage(); which in turn makes a directory named site.hits and site.visits, with a file in them named part-r-00000. when i run my script the second time, (with different data loaded, like my.log.2) pig will give me an error saying directory site.visits, and site.hits already exists. What I need is a cumulative count of hits and unique visitors per item. so if the second file has hit to an item that has been previously counted in part-r-00000, it would require to reprocess the first log file. How can I do this counting business incrementally? Best Regards, C.B.
