We managed to piece this together.  It's not fully generic (we assume a
single field).  But, it gets the job done for unit testing.
--------------
package com.civicscience.util;

import org.apache.pig.ResourceSchema;
import org.apache.pig.builtin.Utf8StorageConverter;
import org.apache.pig.impl.util.CastUtils;
import org.apache.pig.impl.util.Utils;
import org.apache.pig.newplan.logical.relational.LogicalSchema;

import java.io.IOException;

public class CSPigUtils {
    public static Object getPigRepresentation(String schema, String data)
throws IOException {
        Utf8StorageConverter caster = new Utf8StorageConverter();
        LogicalSchema ls = Utils.parseSchema(schema);
        ResourceSchema rs = new ResourceSchema(ls);
        ResourceSchema.ResourceFieldSchema[] fields = rs.getFields();
        return CastUtils.convertToType(caster, data.getBytes(), fields[0],
fields[0].getType());
    }
}
---------------


On Tue, Mar 19, 2013 at 1:20 PM, Dan DeCapria, CivicScience <
[email protected]> wrote:

> I'll give it an honest try, and any additional from the community is
> greatly appreciated!
>
> I've been on this idea for a few days now.  I even implemented my own UDF
> parser by converting the input to a char[] array and a push/popping on a
> Stack of Node Objects to generate the nested inner complex DataTypes as a
> Node tree. This worked well from a Node-linking standpoint, with a DFS
> traversal on the Node tree to rebuild the DataBag Object. But it has
> its caveats, as I have to create a UDF to generate the input for another
> input, and it assumes the fields are type safe from elements "{(})#," which
> isn't the case (ie, a serialized json chararray for a field). So I was
> hoping for a more OTS solution using existing classes and methods given the
> String and it's Schema a priori.
>
> Thank you for your help, and I'll keep this post updated on my progress
> towards a solution.
>
> -Dan
>
> On Tue, Mar 19, 2013 at 12:54 PM, Jonathan Coveney <[email protected]
> >wrote:
>
> > Ack, hit enter. I'd look at the LoadFunc interface, the PigSTorage class,
> > and if you can't make it work without playing a little, let me know.
> >
> >
> > 2013/3/19 Jonathan Coveney <[email protected]>
> >
> > > doing "new PigStorage()" is possible, but tricky. Maybe some of the
> other
> > > contributors have an easier way of doing this, but in the short term,
> I'd
> > > work on getting that to work. It's mainly just making sure you
> initialize
> > > it properly.
> > >
> > >
> > > 2013/3/19 Dan DeCapria, CivicScience <[email protected]>
> > >
> > >> This would work, but the goal would be to *not* invoke local
> interactive
> > >> pig to execute a LOAD USING PigStorage() and pass the data into the
> UDF.
> > >>  I
> > >> was hoping to keep this completely in the Java and JUnit testing
> > universe.
> > >>
> > >> Looking over the PigStorage()
> > >> doc<
> > >>
> >
> https://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html
> > >> >,
> > >> would you know how to construct this process from a baseline
> PigStorage
> > >> Object, such as:
> > >>
> > >> PigStorage pigstorage = new PigStorage();
> > >>
> > >> Any ideas?
> > >>
> > >> -Dan
> > >>
> > >> On Tue, Mar 19, 2013 at 12:08 PM, Jonathan Coveney <
> [email protected]
> > >> >wrote:
> > >>
> > >> > I definitely understand the benefits, I just wanted to understand
> your
> > >> > workflow so could weigh in with what I would do.
> > >> >
> > >> > In your case, if you're going to be making these by hand, then I
> would
> > >> > mimic what PigStorage outputs, and then just load it in using
> > >> PigStorage.
> > >> >
> > >> >
> > >> > 2013/3/19 Dan DeCapria, CivicScience <[email protected]
> >
> > >> >
> > >> > > By hand; creating a new JUnit method to test a specific use case
> > >> against
> > >> > a
> > >> > > functional requirement in the UDF.
> > >> > >
> > >> > > The UDFs I am testing are part of a larger ETL testing initiative
> I
> > >> have
> > >> > > been undertaking.  To ensure that the various states of legacy
> data
> > >> are
> > >> > > correctly extracted and transformed into a Pig context, I am
> > creating
> > >> > > specific JUnit tests per each UDF containing specific use cases as
> > >> > testing
> > >> > > methods.
> > >> > >
> > >> > > Motivation to use String inputs for the Data Objects and Schema
> > >> Objects
> > >> > is
> > >> > > the improvement on the conventional approach - creating Java
> Objects
> > >> and
> > >> > > adding and appending nested Objects to create the desired complex
> > type
> > >> > > DataBag Object to pass to the UDF as use case input. This simpler
> > >> process
> > >> > > I'm looking for should improve scale-ability and rapid-prototyping
> > >> within
> > >> > > the testing scripts.  It will also make the process more
> > approachable
> > >> for
> > >> > > another programmer to write additional unit tests.
> > >> > >
> > >> > > -Dan
> > >> > >
> > >> > > On Tue, Mar 19, 2013 at 11:43 AM, Jonathan Coveney <
> > >> [email protected]
> > >> > > >wrote:
> > >> > >
> > >> > > > How are you planning on generating these cases? By hand? Or
> > >> automated?
> > >> > > >
> > >> > > >
> > >> > > > 2013/3/19 Dan DeCapria, CivicScience <
> > [email protected]
> > >> >
> > >> > > >
> > >> > > > > String string_databag in this example was typed out by me, as
> > the
> > >> > input
> > >> > > > > String for a JUnit test method. I am considering generating
> many
> > >> of
> > >> > > these
> > >> > > > > for case specific unit testing of my UDFs.
> > >> > > > >
> > >> > > > > -Dan
> > >> > > > >
> > >> > > > > On Tue, Mar 19, 2013 at 11:27 AM, Jonathan Coveney <
> > >> > [email protected]
> > >> > > > > >wrote:
> > >> > > > >
> > >> > > > > > how was string_databag generated?
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > 2013/3/19 Dan DeCapria, CivicScience <
> > >> > [email protected]>
> > >> > > > > >
> > >> > > > > > > Expanding upon this, the following use case's Schema
> Object
> > >> can
> > >> > be
> > >> > > > > > resolved
> > >> > > > > > > from inputs:
> > >> > > > > > >
> > >> > > > > > >         String string_databag = "{(a,(b,d),f)}";
> > >> > > > > > >         String string_schema =
> > >> > > > > > >
> > >> > >
> "b1:bag{t1:tuple(a:chararray,t2:tuple(b:chararray,d:long),f:long)}";
> > >> > > > > > >         Schema schema =
> > >> Utils.getSchemaFromString(string_schema);
> > >> > > > > > >
> > >> > > > > > > Next step is to resolve a DataBag Object from String
> > >> > string_databag
> > >> > > > and
> > >> > > > > > the
> > >> > > > > > > Schema Object.
> > >> > > > > > >
> > >> > > > > > > -Dan
> > >> > > > > > >
> > >> > > > > > > On Tue, Mar 19, 2013 at 9:37 AM, Dan DeCapria,
> CivicScience
> > <
> > >> > > > > > > [email protected]> wrote:
> > >> > > > > > >
> > >> > > > > > > > Thank you for your reply.
> > >> > > > > > > >
> > >> > > > > > > > The problem is I cannot find a methodology to go from a
> > >> String
> > >> > > > > > > > representation of a complex data type to a nested Object
> > of
> > >> pig
> > >> > > > > > > DataTypes.
> > >> > > > > > > > I looked over the pig 0.10.1 docs, but cannot find a way
> > to
> > >> go
> > >> > > from
> > >> > > > > > > String
> > >> > > > > > > > and Schema to pig DataType Object.
> > >> > > > > > > >
> > >> > > > > > > > For context, I am generating these Strings for my own
> > JUnit
> > >> > > testing
> > >> > > > > of
> > >> > > > > > > > other UDFs.  Currently, for complex types, I have to
> > >> generate
> > >> > > each
> > >> > > > > > > nesting
> > >> > > > > > > > from Tuple and DataBag factories, append data, and next
> > them
> > >> > > > > manually.
> > >> > > > > > >  For
> > >> > > > > > > > larger unit tests, this process becomes unwieldy
> (hundreds
> > >> of
> > >> > > lines
> > >> > > > > per
> > >> > > > > > > > method, non-dynamic), and it would be much simpler to go
> > >> > directly
> > >> > > > > from
> > >> > > > > > a
> > >> > > > > > > > String and a Schema to a DataBag Object for UDF testing
> > (few
> > >> > > lines
> > >> > > > of
> > >> > > > > > > code,
> > >> > > > > > > > easily modifiable).
> > >> > > > > > > >
> > >> > > > > > > > -Dan
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > On Mon, Mar 18, 2013 at 6:31 PM, Jonathan Coveney <
> > >> > > > > [email protected]
> > >> > > > > > > >wrote:
> > >> > > > > > > >
> > >> > > > > > > >> Why not just use PigStorage? This is essentially what
> it
> > >> does.
> > >> > > It
> > >> > > > > > saves
> > >> > > > > > > a
> > >> > > > > > > >> bag as text, and then loads it again.
> > >> > > > > > > >>
> > >> > > > > > > >> I suppose the question becomes: why do you need to do
> > this?
> > >> > > > > > > >>
> > >> > > > > > > >>
> > >> > > > > > > >> 2013/3/18 Dan DeCapria, CivicScience <
> > >> > > > [email protected]
> > >> > > > > >
> > >> > > > > > > >>
> > >> > > > > > > >> > In Java, I am trying to convert a DataBag from it's
> > >> String
> > >> > > > > > > >> representation
> > >> > > > > > > >> > with its schema String to a valid DataBag Object:
> > >> > > > > > > >> >
> > >> > > > > > > >> > String databag_string = "{(apples,1024)}";
> > >> > > > > > > >> > String schema_string =
> > >> > "b1:bag{t1:tuple(a:chararray,b:long)}";
> > >> > > > > > > >> >
> > >> > > > > > > >> > I've tried implementing something along the lines of
> > >> this,
> > >> > > but I
> > >> > > > > > > believe
> > >> > > > > > > >> > it's in the wrong direction, and then I get stuck:
> > >> > > > > > > >> >
> > >> > > > > > > >> >         String[] aliases = {"b1", "t1", "a", "b"};
> > >> > > > > > > >> >         byte[] types = {DataType.BAG, DataType.TUPLE,
> > >> > > > > > > >> DataType.CHARARRAY,
> > >> > > > > > > >> > DataType.LONG};
> > >> > > > > > > >> >         List<Schema.FieldSchema> fsList = new
> > >> > > > > > > >> > ArrayList<Schema.FieldSchema>();
> > >> > > > > > > >> >         for (int i = 0; i < aliases.length; i++) {
> > >> > > > > > > >> >             fsList.add(new
> > Schema.FieldSchema(aliases[i],
> > >> > > > > > types[i])) ;
> > >> > > > > > > >> >         }
> > >> > > > > > > >> >         Schema origSchema = new Schema(fsList);
> > >> > > > > > > >> >         ResourceSchema rsSchema = new
> > >> > > > ResourceSchema(origSchema);
> > >> > > > > > > >> >         Schema genSchema =
> > Schema.getPigSchema(rsSchema);
> > >> > > > > > > >> >         ResourceSchema.ResourceFieldSchema[]
> rfschema =
> > >> > > > > > > >> > rsSchema.getFields();
> > >> > > > > > > >> >         ... lost here, maybe Utf8StorageConverter c =
> > new
> > >> > > > > > > >> > Utf8StorageConverter(); ???
> > >> > > > > > > >> >
> > >> > > > > > > >> >
> > >> > > > > > > >> > An ideal process would be along the lines of:
> > >> > > > > > > >> >
> > >> > > > > > > >> > DataBag d = BagFactory.getInstance().newDefaultBag();
> > >> > > > > > > >> > d.something(databag_string, schema_string);    // ???
> > no
> > >> > idea
> > >> > > > what
> > >> > > > > > > this
> > >> > > > > > > >> > process could be
> > >> > > > > > > >> > d.toString().equals(databag_string) == true.
> > >> > > > > > > >> >
> > >> > > > > > > >> > Thanks, -Dan
> > >> > > > > > > >> >
> > >> > > > > > > >>
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > --
> > >> > > > > > > > Dan DeCapria
> > >> > > > > > > > CivicScience, Inc.
> > >> > > > > > > > Senior Informatics / DM / ML / BI Specialist
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > --
> > >> > > > > > > Dan DeCapria
> > >> > > > > > > CivicScience, Inc.
> > >> > > > > > > Senior Informatics / DM / ML / BI Specialist
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > --
> > >> > > > > Dan DeCapria
> > >> > > > > CivicScience, Inc.
> > >> > > > > Senior Informatics / DM / ML / BI Specialist
> > >> > > > >
> > >> > > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > Dan DeCapria
> > >> > > CivicScience, Inc.
> > >> > > Senior Informatics / DM / ML / BI Specialist
> > >> > >
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Dan DeCapria
> > >> CivicScience, Inc.
> > >> Senior Informatics / DM / ML / BI Specialist
> > >>
> > >
> > >
> >
>
>
>
> --
> Dan DeCapria
> CivicScience, Inc.
> Senior Informatics / DM / ML / BI Specialist
>

Reply via email to