I have bunch of CSV files which are combined into a 'Single Sequence' file
by some application along with 'Meta Data' associated with that sequence
file.

So in short Sequence file consist of :  ( Bunch of CSV's plus their
Metadata ).

My pig script is extracting only few CSV out of all the CSV's present in
the sequence file. Hope it helps.

Let me know if needed,  i can send out the pig script code as well.

Thanks.

On Mon, Oct 6, 2014 at 7:05 PM, Russell Jurney <russell.jur...@gmail.com>
wrote:

> If you can describe the layout of your input files more thoroughly, it
> would help.
>
> On Monday, October 6, 2014, Pradeep Gollakota <pradeep...@gmail.com>
> wrote:
>
> > It looks like the best option at this point is to write a custom UDF that
> > takes loads a set of regular expressions from file and runs the data
> > through all of them.
> >
> > On Mon, Oct 6, 2014 at 1:44 PM, Ankur Kasliwal <
> > ankur.kasliwal...@gmail.com <javascript:;>>
> > wrote:
> >
> > > Thanks for replying everyone. Few comments to everyone's suggestion.
> > >
> > > 1>  I am processing sequence file which consist of many CSV files. I
> need
> > > to extract only few among all CSV'S. So that is the reason I am doing
> > 'SelectFieldByValue'
> > > which is file name in my case not by field directly.
> > >
> > > 2>  All selected files ( different RegEx ) are stored in HDFS
> separately.
> > > So one STORE statement for each extracted file in a bag.
> > >
> > > 3>  Cannot  do cross join as all files input will get combined, do not
> > > want to do that.
> > >
> > > 4>  Cannot do AND/OR operator as i need different bags for each
> selected
> > > file ( RegEx).
> > >
> > >
> > >
> > > Let me know if any one has any other suggestions.
> > > Sorry for not being clear with specification at first place.
> > >
> > > Thanks.
> > >
> > > On Mon, Oct 6, 2014 at 4:12 PM, Pradeep Gollakota <
> pradeep...@gmail.com
> > <javascript:;>>
> > > wrote:
> > >
> > >> In case you haven't seen this already, take a look at
> > >> http://pig.apache.org/docs/r0.13.0/perf.html for some basic
> strategies
> > on
> > >> optimizing your pig scripts.
> > >>
> > >> On Mon, Oct 6, 2014 at 1:08 PM, Russell Jurney <
> > russell.jur...@gmail.com <javascript:;>>
> > >> wrote:
> > >>
> > >> > Actually, I don't think you need SelectFieldByValue. Just use the
> name
> > >> of
> > >> > the field directly.
> > >> >
> > >> > On Monday, October 6, 2014, Prashant Kommireddi <
> prash1...@gmail.com
> > <javascript:;>>
> > >> > wrote:
> > >> >
> > >> > > Are these regex static? If yes, this is easily achieved with
> > embedding
> > >> > your
> > >> > > script in Java or any other language that Pig supports
> > >> > > http://pig.apache.org/docs/r0.13.0/cont.html
> > >> > >
> > >> > > You could also possibly write a UDF that loops through all the
> regex
> > >> and
> > >> > > returns result.
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Mon, Oct 6, 2014 at 12:44 PM, Ankur Kasliwal <
> > >> > > ankur.kasliwal...@gmail.com <javascript:;> <javascript:;>
> > >> > > > wrote:
> > >> > >
> > >> > > > Hi,
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > I have written a ‘Pig Script’ which is processing Sequence files
> > >> given
> > >> > as
> > >> > > > input.
> > >> > > >
> > >> > > > It is working fine but there is one problem mentioned below.
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > I have repetitive statements in my pig script,  as shown below:
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > >    -  Filtered_Data _1= FILTER BagName BY ($0 matches
> 'RegEx-1');
> > >> > > >    -  Filtered_Data_2 = FILTER BagName BY ($0 matches
> 'RegEx-2');
> > >> > > >    -  Filtered_Data_3 = FILTER BagName BY ($0 matches
> 'RegEx-3');
> > >> > > >    - So on…
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > Question :
> > >> > > >
> > >> > > > So is there any way by which I can have above statement written
> > once
> > >> > and
> > >> > > >
> > >> > > > then loop through all possible “RegEx” and substitute in Pig
> > script.
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > For Example:
> > >> > > >
> > >> > > >
> > >> > > > Filtered_Data _X  =   FILTER BagName BY ($0 matches 'RegEx');  (
> > >> have
> > >> > > this
> > >> > > > statement once )
> > >> > > >
> > >> > > > ( loop through all possible RegEx and substitute value in the
> > >> > statement )
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > Right now I am calling Pig script from a shell script, so any
> way
> > >> from
> > >> > > > shell script will be also be welcome..
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > Thanks in advance.
> > >> > > >
> > >> > > > Happy Pigging!!!!
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >> > --
> > >> > Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
> > <javascript:;>
> > >> > datasyndrome.com
> > >> >
> > >>
> > >
> > >
> >
>
>
> --
> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
> datasyndrome.com
>

Reply via email to