Try this:

table = LOAD stuff AS (n1:chararray, n2:chararray, other irrelevant stuff);
pared = foreach table generate n1, n2;
grouped = group pared by n1;
counted = foreach grouped generate group, (double)COUNT(pared.n2)/COUNT_STAR(pared.n2) as ratio;
ordered = order counted by ratio desc;
limited = limit ordered 200;
dump limited;

Daniel

Yves Roy wrote:
Thanks Dmitriy.

Regarding your suggestion to use PigStorage('|') and STRSPLIT:

a) yes, PigStorage('|') does work fine (I started from there), but how to
have it work with the AS clause, which contains 5 fields (A,B,C,D,E) ans not
only 3 corresponding to the split using the delimiter '|'.

b) As for the STRSPLIT, when and where should it be used, in order to match
with the AS clause (the A, B, C, D, E) so that I can, later, i.e. after the
LOADing of the data :

(will this work with 5 fields?)
data = LOAD 'mydata.log' USING PigStorage('|') AS (A, B, C, D, E);

(then, where/how goes the STRSPLIT usage ?)

or should I start with only 4 fields:

data = LOAD 'mydata.log' USING PigStorage('|') AS (A, B, CD, E);

and then use STRSPLIT (how?), again, in order to having the following
commands to work as expected:

data_cfoo = FILTER data BY C == 'foo';
data_cfoo_ddoe = FILTER data_cfoo BY D='doe';

Thanks
Yves

YVES
DE FJORD

   YVES ROY DÉVELOPPEUR LOGICIEL DE FJORD
2100, RUE DRUMMOND, MONTRÉAL, QUÉBEC H3G 1X1 CANADA
T 514 270 8782 #4572 / F 514 270 4162 / cossette.com



On Tue, Nov 30, 2010 at 12:58 PM, Dmitriy Ryaboy <[email protected]> wrote:

An easier approach would be to just use PigStorage('|') to get the
pipe-delimited fields, and use STRSPLIT to break up the third column into
multiple columns.

-D

On Tue, Nov 30, 2010 at 9:26 AM, John Hui <[email protected]> wrote:

You can try using  a customer storage parser.

You can see a bunch of examples here..



pig-0.7.0/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage
I wrote one for JSON.

On Tue, Nov 30, 2010 at 12:16 PM, Yves Roy <[email protected]>
wrote:
Hello:

I hope this is not double posting.

I want to do something simple:

I have a data file, mydata.log,  formatted like this:

a1 | b1 | c=foo&d=bar | e1
a2 | b2 | c=john&d=doe | e2
a3 | b3 | c=foo&d=doe | e3
...

and I want to LOAD the data USING <something> in order to get the AS to
be
(A,B,C,D, E) i.e. extract 2 fields from the third one.

For example :

data = LOAD 'mydata.log' USING <something> AS (A, B, C, D, E);

i.e. I want the third field (i.e. the one formatted as
'cx=foox&dx=barx')
to
be parsed to yield the C and D in my AS list of fields
so that later on I can do things like:

data_cfoo = FILTER data BY c == 'foo';
data_cfoo_ddoe = FILTER data_cfoo BY d='doe';


There has to have a simple way way to do that ?
Passing a regex, a ruby script or what else as a parameter to
PigStorage,
or
using something else than PigStorage?

Many thanks

Yves

YVES
DE FJORD

  YVES ROY DÉVELOPPEUR LOGICIEL DE FJORD
2100, RUE DRUMMOND, MONTRÉAL, QUÉBEC H3G 1X1 CANADA
T 514 270 8782 #4572 / F 514 270 4162 / cossette.com


Reply via email to