Re: Iterating over data set

Amit Tue, 30 Jul 2013 06:19:47 -0700

Hello Xuri,
I second what Jacob says. I believe writing your own loadfunc makes sense..just 
put whatever code you have in Shell script into the getNext of LoadFunc and you 
should be good.




 
Regards,
Amit



________________________________
 From: Jacob Perkins <[email protected]>
To: [email protected] 
Sent: Tuesday, July 30, 2013 8:34 AM
Subject: Re: Iterating over data set
 

Xuri,

I don't think you can use functions in the load statement like that. To do 
something like that you'd need to write your own LoadFunc. As far as I can tell 
at a glance, and I haven't used Pig 0.11 much, the new DateTime functions are 
eval functions. That means they only operate on tuples during execution 
(map-reduce or whatever emulates map-reduce in local mode) and _after_ the 
input location has been resolved. 

--jacob
@thedatachef 


On Jul 30, 2013, at 1:30 AM, Xuri Nagarin wrote:

> Thanks Jacob.
> 
> I threw in a little bash shell hack to make it simpler. Before I run the
> pig script, I run a bash script that stores timestamps in a day for every
> 20 minutes:
> 
> *Shell*
> 
> dt1=`date -u +%Y-%m-%dT00:00:00.000Z -d "10 days ago"` #10 days ago because
> I get data 10 days late :-)
> 
> 
> for ((i=0;i<1430; i=i+10)) ; do date -u "+%Y-%m-%dT%H:%M:%S.000Z" -d "$dt1
> +$i mins"; done
> 
> Gives me:
> .
> .
> 2013-07-20T22:00:00.000Z
> 2013-07-20T22:10:00.000Z
> 2013-07-20T22:20:00.000Z
> 2013-07-20T22:30:00.000Z
> .
> .
> 
> If the file above is generated by using the date as filename then I call it
> in my pig script as:
> 
> %declare filepath `date -u +%Y-%m-%d -d "10 days ago"`;
> A1 = LOAD '$filepath.ts' USING PigStorage() AS (dt:datetime);
> 
> Now, I can iterate over it:
> 
> B = FOREACH A1 {
> 
>    C = FILTER A BY timestamp > 'dt' AND timestamp <
> 'AddDuration(ToDate(dt),PT20M)' ;
>    .
> do something()
> }
> 
> What I want to do is not use the bash command and instead use Pig's
> datetime functions. Unfortunately, I am stuck in syntactical hell.
> 
> A = LOAD
> '/path/to/logs/ToDate(SubtractDuration(CurrentTime(),'P3D'),'yyyy-MM-dd')'
> USING PigStorage();
> 
> yields:
> "2013-07-29 23:28:05,565 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1200: <line 1, column 66> mismatched input 'P3D' expecting SEMI_COLON"
> 
> I have tried various combinations of enclosing the date calculation
> functions in single quotes, brackets etc but can't seem to get anything to
> work :(
> 
> 
> 
> 
> On Mon, Jul 29, 2013 at 6:05 AM, Jacob Perkins 
> <[email protected]>wrote:
> 
>> Hi Xuri,
>> 
>> This illustrates the use case for a UDF I've had to implement in one form
>> or another called 'FilterBag'. It's essentially just Pig's builtin "FILTER"
>> but would work like so (using your pseudocode):
>> 
>> 
>> A = load 'input' as (timestamp, worker, output);
>> 
>> --
>> -- Assuming you want to restrict each calculation to a day. ToDay is most
>> likely going to be Piggybank's
>> -- ISOToDay truncation udf
>> --
>> with_day = foreach A generate timestamp, ToDay(timestamp) as day, worker,
>> output;
>> 
>> --
>> -- First you'll have to get all output for a given worker on a given day
>> into single bag
>> --
>> worker_output = foreach (group with_day by (worker, day)) {
>>                          -- this relation (worker_output) will have one
>> tuple per unique worker, day, and timestamp
>>                          timestamps = distinct with_day.timestamp;
>>                          generate
>>                            flatten(group)                        as
>> (worker, day),
>>                            flatten(timestamps)               as t1,
>>                            with_day.(timestamp, output) as outputs; -- A
>> bag that contains all of this workers output and their timestamps for this
>> day
>>                         };
>> 
>> --
>> -- Next, filter each "outputs" bag to contain only outputs that occurred
>> within a 10 minute (or whatever time unit of interest) window from the
>> -- timestamp, looking forward (whether you look forward, back, or both is
>> up to you)
>> --
>> windowed = foreach worker_output {
>>                    -- FilterBag(bag, field_num, comparison_string,
>> to_compare)
>>                    -- bag: bag to filter
>>                    -- field_num: 0 indexed field num of the tuples in the
>> bag to use for comparison to "to_compare"
>>                    -- comparison_string: one of 'lt', 'lte', 'e', 'gte',
>> 'gt' corresponding to less than, less than or equal to and so on
>>                    -- to_compare: the object to compare to
>> 
>>                    outputs_after         = FilterBag(outputs, 0, 'gte',
>> t1);
>>                    outputs_windowed = FilterBag(outputs_after, 0, 'lt',
>> t1+$TIME_UNIT);
>> 
>>                    -- what we WANT to do is this:
>>                    --
>>                    -- outputs_windowed = filter outputs by timestamp >=
>> t1 and timestamp < t1+$TIME_UNIT;
>>                    --
>>                    -- but, I have never been able to make pig happy with
>> this, thus FilterBag.
>> 
>>                    generate
>>                     worker, day, t1, SUM(outputs_windowed.output) as
>> summed_output, COUNT(outputs_windowed) as count;
>>                  };
>> 
>> dump windowed;
>> 
>> 
>> 
>> Notice that you'll have one record for each worker and timestamp that was
>> actually measured. You'll have to do something more fancy if you want
>> smoothing (eg. a record for timestamps where no data was recorded).
>> 
>> Importantly, it would be fantastic to be able to do this without a udf and
>> just using Pig's filter command as shown in the comments above. However,
>> I've tried this in several different ways and never gotten Pig to be happy
>> with it. Instead, I've written a udf called "FilterBag" to accomplish this.
>> Perhaps another Pig user can illuminate the situation better?
>> 
>> I'll see about publishing a simple version of FilterBag if it seems the
>> pig community would use it.
>> 
>> --jacob
>> @thedatachef
>> 
>> On Jul 28, 2013, at 8:34 PM, Xuri Nagarin wrote:
>> 
>>> Hi,
>>> 
>>> Lets say I have a data set of units of output per worker per second
>> that's
>>> in chronological order for a whole day
>>> 
>>> Example:
>>> 2013-07-26T14:00:00, Joe, 50
>>> 2013-07-26T14:10:00, Jane,60
>>> 2013-07-26T14:15:00, Joe, 55
>>> 2013-07-26T14:20:00, Jane,60
>>> 
>>> I create the data set above by loading a larger data set and getting
>> these
>>> three attributes in a relation.
>>> 
>>> Now, I want to count output per user per unit of time period, say every
>> ten
>>> minutes but as a rolling count with a window that moves by the minute.
>> The
>>> pseudo-code would be something along the lines of:
>>> 
>>> -----------xxxxxxxxxxxxxxxxx-------------------
>>> A = LOAD 'input' AS (timestamp, worker, output);
>>> 
>>> ts1=0
>>> ts2=1440 (24 hours x 60 mins/hr)
>>> 
>>> for (i=ts1, i<=(ts2-10), i++)
>>>  {
>>>    R1 = FILTER A BY timestamp > $i AND timestamp < ($i + 10);
>>>    GRP = R1 BY (worker, output);
>>>    CNT = FOREACH GRP GENERATE group, COUNT(GRP);
>>>    DUMP CNT;
>>>   }
>>> -----------xxxxxxxxxxxxxxxxx-------------------
>>> 
>>> But I can't figure out how to do this simple iteration in pig using
>>> FOREACH. I think the answer is create a relation that has a data set that
>>> has all the minutes in a day {0.....1440} and then iterate over it?
>>> 
>>> Sorry if my Pig terminology isn't correct. I have been using it only for
>> a
>>> day now.
>>> 
>>> Any pointers will be highly appreciated.
>>> 
>>> TIA.
>> 
>>

Re: Iterating over data set

Reply via email to