If you can describe the layout of your input files more thoroughly, it would help.
On Monday, October 6, 2014, Pradeep Gollakota <[email protected]> wrote: > It looks like the best option at this point is to write a custom UDF that > takes loads a set of regular expressions from file and runs the data > through all of them. > > On Mon, Oct 6, 2014 at 1:44 PM, Ankur Kasliwal < > [email protected] <javascript:;>> > wrote: > > > Thanks for replying everyone. Few comments to everyone's suggestion. > > > > 1> I am processing sequence file which consist of many CSV files. I need > > to extract only few among all CSV'S. So that is the reason I am doing > 'SelectFieldByValue' > > which is file name in my case not by field directly. > > > > 2> All selected files ( different RegEx ) are stored in HDFS separately. > > So one STORE statement for each extracted file in a bag. > > > > 3> Cannot do cross join as all files input will get combined, do not > > want to do that. > > > > 4> Cannot do AND/OR operator as i need different bags for each selected > > file ( RegEx). > > > > > > > > Let me know if any one has any other suggestions. > > Sorry for not being clear with specification at first place. > > > > Thanks. > > > > On Mon, Oct 6, 2014 at 4:12 PM, Pradeep Gollakota <[email protected] > <javascript:;>> > > wrote: > > > >> In case you haven't seen this already, take a look at > >> http://pig.apache.org/docs/r0.13.0/perf.html for some basic strategies > on > >> optimizing your pig scripts. > >> > >> On Mon, Oct 6, 2014 at 1:08 PM, Russell Jurney < > [email protected] <javascript:;>> > >> wrote: > >> > >> > Actually, I don't think you need SelectFieldByValue. Just use the name > >> of > >> > the field directly. > >> > > >> > On Monday, October 6, 2014, Prashant Kommireddi <[email protected] > <javascript:;>> > >> > wrote: > >> > > >> > > Are these regex static? If yes, this is easily achieved with > embedding > >> > your > >> > > script in Java or any other language that Pig supports > >> > > http://pig.apache.org/docs/r0.13.0/cont.html > >> > > > >> > > You could also possibly write a UDF that loops through all the regex > >> and > >> > > returns result. > >> > > > >> > > > >> > > > >> > > On Mon, Oct 6, 2014 at 12:44 PM, Ankur Kasliwal < > >> > > [email protected] <javascript:;> <javascript:;> > >> > > > wrote: > >> > > > >> > > > Hi, > >> > > > > >> > > > > >> > > > > >> > > > I have written a ‘Pig Script’ which is processing Sequence files > >> given > >> > as > >> > > > input. > >> > > > > >> > > > It is working fine but there is one problem mentioned below. > >> > > > > >> > > > > >> > > > > >> > > > I have repetitive statements in my pig script, as shown below: > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > - Filtered_Data _1= FILTER BagName BY ($0 matches 'RegEx-1'); > >> > > > - Filtered_Data_2 = FILTER BagName BY ($0 matches 'RegEx-2'); > >> > > > - Filtered_Data_3 = FILTER BagName BY ($0 matches 'RegEx-3'); > >> > > > - So on… > >> > > > > >> > > > > >> > > > > >> > > > Question : > >> > > > > >> > > > So is there any way by which I can have above statement written > once > >> > and > >> > > > > >> > > > then loop through all possible “RegEx” and substitute in Pig > script. > >> > > > > >> > > > > >> > > > > >> > > > For Example: > >> > > > > >> > > > > >> > > > Filtered_Data _X = FILTER BagName BY ($0 matches 'RegEx'); ( > >> have > >> > > this > >> > > > statement once ) > >> > > > > >> > > > ( loop through all possible RegEx and substitute value in the > >> > statement ) > >> > > > > >> > > > > >> > > > > >> > > > Right now I am calling Pig script from a shell script, so any way > >> from > >> > > > shell script will be also be welcome.. > >> > > > > >> > > > > >> > > > > >> > > > Thanks in advance. > >> > > > > >> > > > Happy Pigging!!!! > >> > > > > >> > > > >> > > >> > > >> > -- > >> > Russell Jurney twitter.com/rjurney [email protected] > <javascript:;> > >> > datasyndrome.com > >> > > >> > > > > > -- Russell Jurney twitter.com/rjurney [email protected] datasyndrome.com
