Hi,
Re:
"Also, I know that I can join the two files on *Channel_Number *and thenfilter 
for records whose *date_time *is between *start_date*_time and *end_date_time*. 
However, this process takes a long time as my actual datacontains ~10 million 
rows in file1 with additional fields and file2contains ~ 50,000 records (using 
mapside join)."
I believe you could make the join run faster by first grouping the rows in each 
relation by channel number, then joining the two relations on channel number.
  Rodrick
> Date: Thu, 27 Jun 2013 17:53:16 +0530
> Subject: Using symlink in Pig to perform a join with the help of a python UDF
> From: [email protected]
> To: [email protected]
> 
> Hi,
> 
> I have two files whose content are as follows:
> 
> *File 1:* Field Names: eventid,clientid,channel_number,date_time
> 
> 114,00001,5003,4/15/2013 11:10
> 
> 114,00001,5003,4/15/2013 1:22
> 
> 100,00001,5003,4/15/2013 23:08
> 
> 114,00002,5002,4/16/2013 8:55
> 100,00002,5002,4/16/2013 8:15
> 
> *File 2:* Field Names: ChannelNumber,ProgramID,Start_Date_Time,End_Date_Time
> 
> *5002,112311,4/16/2013 8:00,4/16/2013 8:30*
> 
> *5002,124313,4/16/2013 8:30,4/16/2013 9:00*
> 
> *5003,113214,4/15/2013 23:00,4/15/2013 23:30*
> 
> *5003,123213,4/15/2013 1:00,4/15/2013 2:00*
> 
> *5003,123343,4/15/2013 10:30,4/15/2013 11:30*
> 
> 
> I want to check if the *channel_number *from *File 1 *matches with
> *ChannelNumber
> *from *File 2* and if the *Date_time* from *File 1* is between *
> Start_Date_Time* and *End_Date_Time* from *File2*.
> 
> ***Required Output:***
> 
> *114,00001,5003,4/15/2013 11:10,5003,123343,4/15/2013 10:30,4/15/2013 11:30
> 114,00001,5003,4/15/2013 1:22,5003,123213,4/15/2013 1:00,4/15/2013 2:00
> 100,00001,5003,4/15/2013 23:08,5003,113214,4/15/2013 23:00,4/15/2013 23:30
> 114,00002,5002,4/16/2013 8:55,5002,124313,4/16/2013 8:30,4/16/2013 9:00
> 100,00002,5002,4/16/2013 8:15,5002,112311,4/16/2013 8:00,4/16/2013 8:30*
> 
> 
> I tried to write a UDF to perform this action and it is as follows:
> *Python UDF:*
> 
> *def myudf(lst):
>  output = [];
>  f = open("epg");
>  for item in lst:
>   for line in f:
>    tup = eval(line);
>    if item[2]== tup[0] and item[3] >= tup[2] and item[3] <= tup[3]:
>     newtup = tuple(lst + tup);
>     output.append(newtup);
>  return output;
>  f.close();*
> 
> I have created a Symbolic Link in Pig which links the second file. The
> procedure that I've used to create a symlink is as follows:
> 
> *    pig -Dmapred.cache.files=hdfs://path/to/file/filename#**epg**
> -Dmapred.create.symlink=yes script.pig*
> 
> The **script.pig** file contains the Pig script which is executed upon
> running the above code.
> The Pig script is as follows:
> 
> *file1 = load '/user/swaroop.sharma/sampledata_for_udf_test/activitydata/'
> using PigStorage(',') as(eventid: chararray,clientid:
> chararray,channel_number: chararray,date_time: chararray);
> 
>     register udf1.py using jython as test;
> 
>     grp_file1 = group file1 by clientid;*
> 
> *    finalfile = foreach grp_file1 generate group, test.myudf(file1) as
> file1: {(eventid: chararray,clientid: chararray,channelnumber:
> chararray,chararray,id: chararray,date_time: chararray,channelnumber:
> chararray,programid: chararray,start_date_time: chararray,end_date_time:
> chararray)};*
> 
>     *store **finalfile** into '/path/where/file/will/be/stored/' using
> PigStorage(',');*
> 
> My Python UDF gets registered successfully. I have also tested the UDF by
> running it on python with the same data by creating two lists.
> I am also able to load the file successfully.
> 
> However, the *finalfile* is not being created. An error occurs while
> running the same.
> 
> I have been trying to debug this from past two days with no success. I am
> new to Pig and currently, I am in a deadlock.
> 
> Also, I know that I can join the two files on *Channel_Number *and then
> filter for records whose *date_time *is between *start_date*_time and *
> end_date_time*. However, this process takes a long time as my actual data
> contains ~10 million rows in file1 with additional fields and file2
> contains ~ 50,000 records (using mapside join).
> 
> Therefore, I'm planning to use the *symlink* feature in Pig to reduce the
> process time.
> Any help regarding this matter will be appreciated and I would be extremely
> grateful!
> 
> Thanks,
> Swaroop Sharma
                                          

Reply via email to