I assume by scala you mean scalding?
If so, yeah, scalding should be much easier for working with custom data
types.

Pig doesn't handle generic "objects" well. You have to write converters to
and from, like the ones we created in ElephantBird for Protocol Buffers and
Thrift (and a bunch of writables, as Pradeep pointed out).

D


On Tue, Sep 17, 2013 at 9:20 AM, Yang <[email protected]> wrote:

> Thanks Pradeep.
>
> it seems in this case just using scala/cascalog is easier for my purposes.
> I tried out scala yesterday, works fine for me in local mode
>
>
> On Mon, Sep 16, 2013 at 7:47 PM, Pradeep Gollakota <[email protected]
> >wrote:
>
> > It doesn't look like the SequenceFileLoader from the piggybank has much
> > support. The elephant bird version looks like it does what you need it to
> > do.
> >
> >
> https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/load/SequenceFileLoader.java
> >
> > You'll have to write the converters from your types to Pig data types and
> > pass it into the constructor of the SequenceFileLoader.
> >
> > Hope this helps!
> >
> >
> > On Mon, Sep 16, 2013 at 6:56 PM, Pradeep Gollakota <[email protected]
> > >wrote:
> >
> > > Thats correct...
> > >
> > > The "load ... AS (k:chararray, v:charrary);" doesn't actually do what
> you
> > > think it does. The AS statement tell Pig what the schema types are, so
> it
> > > will call the appropriate LoadCaster method to get it into the right
> > type.
> > > A LoadCaster object defines how to map byte[] into appropriate Pig
> > > datatypes. If the LoadFunc is not schema aware and you don't have the
> > > schema defined when you load, everything will be loaded as a bytearray.
> > >
> > > The problem you have is that the custom writable isn't a Pig datatype.
> I
> > > don't think you'll be able to do this without writing some custom code.
> > > I'll take a look at the source code for the SequenceFileLoader and see
> if
> > > there's a way to specify your own LoadCaster. If there is, then you'll
> > just
> > > have to write a custom LoadCaster and specify it in the configuration.
> If
> > > not, you'll have to extend and roll out your own SequenceFileLoader.
> > >
> > >
> > > On Mon, Sep 16, 2013 at 6:43 PM, Yang <[email protected]> wrote:
> > >
> > >> I think my custom type has toString(), well at least writable() says
> > it's
> > >> writable to bytes, so supposedly if I force it to bytes or string, pig
> > >> should be able to cast
> > >> like
> > >>
> > >> load ... AS ( k:chararray, v:chararray);
> > >>
> > >> but this actually fails
> > >>
> > >>
> > >> On Mon, Sep 16, 2013 at 6:22 PM, Pradeep Gollakota <
> > [email protected]
> > >> >wrote:
> > >>
> > >> > The problem is that pig only speaks its data types. So you need to
> > tell
> > >> it
> > >> > how to translate from your custom writable to a pig datatype.
> > >> >
> > >> > Apparently elephant-bird has some support for doing this type of
> > >> thing...
> > >> > take a look at this SO post
> > >> >
> > >> >
> > >>
> >
> http://stackoverflow.com/questions/16540651/apache-pig-can-we-convert-a-custom-writable-object-to-pig-format
> > >> >
> > >> >
> > >> > On Mon, Sep 16, 2013 at 5:37 PM, Yang <[email protected]>
> wrote:
> > >> >
> > >> > > I tried to do a quick and dirty inspection of some of our data
> > feeds,
> > >> > which
> > >> > > are encoded in gzipped SequenceFile.
> > >> > >
> > >> > > basically I did
> > >> > >
> > >> > > a = load 'myfile' using ......SequenceFileLoader() AS ( mykey,
> > >> myvalue);
> > >> > >
> > >> > > but it gave me some error:
> > >> > > 2013-09-16 17:34:28,915 [Thread-5] INFO
> > >> > >  org.apache.hadoop.io.compress.CodecPool - Got brand-new
> > decompressor
> > >> > > 2013-09-16 17:34:28,915 [Thread-5] INFO
> > >> > >  org.apache.hadoop.io.compress.CodecPool - Got brand-new
> > decompressor
> > >> > > 2013-09-16 17:34:28,915 [Thread-5] INFO
> > >> > >  org.apache.hadoop.io.compress.CodecPool - Got brand-new
> > decompressor
> > >> > > 2013-09-16 17:34:28,961 [Thread-5] WARN
> > >> > >  org.apache.pig.piggybank.storage.SequenceFileLoader - Unable to
> > >> > translate
> > >> > > key class com.mycompany.model.VisitKey to a Pig datatype
> > >> > > 2013-09-16 17:34:28,962 [Thread-5] WARN
> > >> > >  org.apache.hadoop.mapred.FileOutputCommitter - Output path is
> null
> > in
> > >> > > cleanup
> > >> > > 2013-09-16 17:34:28,963 [Thread-5] WARN
> > >> > >  org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
> > >> > > org.apache.pig.backend.BackendException: ERROR 0: Unable to
> > translate
> > >> > class
> > >> > > com.mycompany.model.VisitKey to a Pig datatype
> > >> > > at
> > >> > >
> > >> > >
> > >> >
> > >>
> >
> org.apache.pig.piggybank.storage.SequenceFileLoader.setKeyType(SequenceFileLoader.java:78)
> > >> > >  at
> > >> > >
> > >> > >
> > >> >
> > >>
> >
> org.apache.pig.piggybank.storage.SequenceFileLoader.getNext(SequenceFileLoader.java:133)
> > >> > >
> > >> > >
> > >> > > in the pig file, I have already REGISTERED the jar that contains
> the
> > >> > class
> > >> > >  com.mycompany.model.VisitKey
> > >> > >
> > >> > >
> > >> > > if PIG doesn't work, the only other approach is probably to use
> some
> > >> of
> > >> > the
> > >> > > newer "pseudo-scripting " languages like cascalog or scala
> > >> > > thanks
> > >> > > Yang
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Reply via email to