BinStorage is more efficient and doesn't have the trouble with nested data
representations you encountered in PigStorage. The downside is only that
it's not human-readable, and that it might change between versions of Pig
(though so far we have resisted the urge, iirc)

D

On Tue, Dec 28, 2010 at 3:24 PM, Jonathan Coveney <[email protected]>wrote:

> Thanks. Is there any particular downside to this if you get to the millions
> and hundreds of millions of rows, or is it just the lack of simple use with
> nonpig systems?
>
> Sent via BlackBerry
>
> -----Original Message-----
> From: Dmitriy Ryaboy <[email protected]>
> Date: Tue, 28 Dec 2010 15:08:15
> To: <[email protected]>
> Reply-To: [email protected]
> Subject: Re: Possible deficiency in describe?
>
> Try using BinStorage instead of the text-based PigStorage
>
> D
>
> On Tue, Dec 28, 2010 at 2:08 PM, Jonathan Coveney <[email protected]
> >wrote:
>
> > So, I made a dumb little python script that parses a pig script, see's
> what
> > stores there are, and then uses pig's describe function to get the schema
> > of
> > the object being stored and then uses that info to make a new file that
> has
> > the proper loader/schema. I felt this was useful because I found myself
> > making intermediate stores, and then it being pretty difficult to make
> the
> > proper loader if there were a lot of columns (especially remembering the
> > type).
> >
> > However, it seems that the result from DESCRIBE is not adequate to do a
> > load. For example, I have test.txt which is literally just random pairs
> of
> > numbers
> >
> > ie
> >
> > 1 2
> > 1 3
> > 1 4
> > 2 5
> > 2 6
> > 3 7
> > 3 8
> > 4 9
> > 5 10
> > 6 11
> > 7 12
> > 8 13
> > 8 14
> > 8 15
> >
> > and so on.
> >
> > I do this:
> >
> > t1 = LOAD 'test.txt' AS (n1:int, n2:int);
> > t2 = GROUP t1 BY n1;
> > t3 = GROUP t2 BY group;
> >
> > DESCRIBE t3;
> > STORE t3 INTO 'output.txt';
> >
> > The query runs without a hitch, however, there is an issue
> >
> > This is what describe gives:
> >
> > t3: {group: int,t2: {group: int,t1: {n1: int,n2: int}}}
> >
> > However, this won't let you load the file...
> >
> > the output has form
> > x{(y,{(a,b)}
> >
> > And I'm not really sure how to go about even creating a loader that would
> > properly load it. Suffice it to say, it seems pretty complicated to store
> > and then load anything that isn't a flat file...is this by design? Is
> there
> > an easier way to go from the schema, as per describe, to the schema you'd
> > use to load it?
> >
> > I'm curious what people do in practice. I could probably extend the
> script
> > I
> > made to go from describe schema -> loading schema (if the pig loader can
> > load things that have brackets and all that?), but I want to know what
> the
> > limitations are.
> >
> > As always, I apologize if there is an easy answer to this. Thanks.
> >
>
>

Reply via email to