BinStorage is more efficient and doesn't have the trouble with nested data representations you encountered in PigStorage. The downside is only that it's not human-readable, and that it might change between versions of Pig (though so far we have resisted the urge, iirc)
D On Tue, Dec 28, 2010 at 3:24 PM, Jonathan Coveney <[email protected]>wrote: > Thanks. Is there any particular downside to this if you get to the millions > and hundreds of millions of rows, or is it just the lack of simple use with > nonpig systems? > > Sent via BlackBerry > > -----Original Message----- > From: Dmitriy Ryaboy <[email protected]> > Date: Tue, 28 Dec 2010 15:08:15 > To: <[email protected]> > Reply-To: [email protected] > Subject: Re: Possible deficiency in describe? > > Try using BinStorage instead of the text-based PigStorage > > D > > On Tue, Dec 28, 2010 at 2:08 PM, Jonathan Coveney <[email protected] > >wrote: > > > So, I made a dumb little python script that parses a pig script, see's > what > > stores there are, and then uses pig's describe function to get the schema > > of > > the object being stored and then uses that info to make a new file that > has > > the proper loader/schema. I felt this was useful because I found myself > > making intermediate stores, and then it being pretty difficult to make > the > > proper loader if there were a lot of columns (especially remembering the > > type). > > > > However, it seems that the result from DESCRIBE is not adequate to do a > > load. For example, I have test.txt which is literally just random pairs > of > > numbers > > > > ie > > > > 1 2 > > 1 3 > > 1 4 > > 2 5 > > 2 6 > > 3 7 > > 3 8 > > 4 9 > > 5 10 > > 6 11 > > 7 12 > > 8 13 > > 8 14 > > 8 15 > > > > and so on. > > > > I do this: > > > > t1 = LOAD 'test.txt' AS (n1:int, n2:int); > > t2 = GROUP t1 BY n1; > > t3 = GROUP t2 BY group; > > > > DESCRIBE t3; > > STORE t3 INTO 'output.txt'; > > > > The query runs without a hitch, however, there is an issue > > > > This is what describe gives: > > > > t3: {group: int,t2: {group: int,t1: {n1: int,n2: int}}} > > > > However, this won't let you load the file... > > > > the output has form > > x{(y,{(a,b)} > > > > And I'm not really sure how to go about even creating a loader that would > > properly load it. Suffice it to say, it seems pretty complicated to store > > and then load anything that isn't a flat file...is this by design? Is > there > > an easier way to go from the schema, as per describe, to the schema you'd > > use to load it? > > > > I'm curious what people do in practice. I could probably extend the > script > > I > > made to go from describe schema -> loading schema (if the pig loader can > > load things that have brackets and all that?), but I want to know what > the > > limitations are. > > > > As always, I apologize if there is an easy answer to this. Thanks. > > > >
