I'll give it an honest try, and any additional from the community is
greatly appreciated!
I've been on this idea for a few days now. I even implemented my own UDF
parser by converting the input to a char[] array and a push/popping on a
Stack of Node Objects to generate the nested inner complex DataTypes as a
Node tree. This worked well from a Node-linking standpoint, with a DFS
traversal on the Node tree to rebuild the DataBag Object. But it has
its caveats, as I have to create a UDF to generate the input for another
input, and it assumes the fields are type safe from elements "{(})#," which
isn't the case (ie, a serialized json chararray for a field). So I was
hoping for a more OTS solution using existing classes and methods given the
String and it's Schema a priori.
Thank you for your help, and I'll keep this post updated on my progress
towards a solution.
-Dan
On Tue, Mar 19, 2013 at 12:54 PM, Jonathan Coveney <[email protected]>wrote:
> Ack, hit enter. I'd look at the LoadFunc interface, the PigSTorage class,
> and if you can't make it work without playing a little, let me know.
>
>
> 2013/3/19 Jonathan Coveney <[email protected]>
>
> > doing "new PigStorage()" is possible, but tricky. Maybe some of the other
> > contributors have an easier way of doing this, but in the short term, I'd
> > work on getting that to work. It's mainly just making sure you initialize
> > it properly.
> >
> >
> > 2013/3/19 Dan DeCapria, CivicScience <[email protected]>
> >
> >> This would work, but the goal would be to *not* invoke local interactive
> >> pig to execute a LOAD USING PigStorage() and pass the data into the UDF.
> >> I
> >> was hoping to keep this completely in the Java and JUnit testing
> universe.
> >>
> >> Looking over the PigStorage()
> >> doc<
> >>
> https://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html
> >> >,
> >> would you know how to construct this process from a baseline PigStorage
> >> Object, such as:
> >>
> >> PigStorage pigstorage = new PigStorage();
> >>
> >> Any ideas?
> >>
> >> -Dan
> >>
> >> On Tue, Mar 19, 2013 at 12:08 PM, Jonathan Coveney <[email protected]
> >> >wrote:
> >>
> >> > I definitely understand the benefits, I just wanted to understand your
> >> > workflow so could weigh in with what I would do.
> >> >
> >> > In your case, if you're going to be making these by hand, then I would
> >> > mimic what PigStorage outputs, and then just load it in using
> >> PigStorage.
> >> >
> >> >
> >> > 2013/3/19 Dan DeCapria, CivicScience <[email protected]>
> >> >
> >> > > By hand; creating a new JUnit method to test a specific use case
> >> against
> >> > a
> >> > > functional requirement in the UDF.
> >> > >
> >> > > The UDFs I am testing are part of a larger ETL testing initiative I
> >> have
> >> > > been undertaking. To ensure that the various states of legacy data
> >> are
> >> > > correctly extracted and transformed into a Pig context, I am
> creating
> >> > > specific JUnit tests per each UDF containing specific use cases as
> >> > testing
> >> > > methods.
> >> > >
> >> > > Motivation to use String inputs for the Data Objects and Schema
> >> Objects
> >> > is
> >> > > the improvement on the conventional approach - creating Java Objects
> >> and
> >> > > adding and appending nested Objects to create the desired complex
> type
> >> > > DataBag Object to pass to the UDF as use case input. This simpler
> >> process
> >> > > I'm looking for should improve scale-ability and rapid-prototyping
> >> within
> >> > > the testing scripts. It will also make the process more
> approachable
> >> for
> >> > > another programmer to write additional unit tests.
> >> > >
> >> > > -Dan
> >> > >
> >> > > On Tue, Mar 19, 2013 at 11:43 AM, Jonathan Coveney <
> >> [email protected]
> >> > > >wrote:
> >> > >
> >> > > > How are you planning on generating these cases? By hand? Or
> >> automated?
> >> > > >
> >> > > >
> >> > > > 2013/3/19 Dan DeCapria, CivicScience <
> [email protected]
> >> >
> >> > > >
> >> > > > > String string_databag in this example was typed out by me, as
> the
> >> > input
> >> > > > > String for a JUnit test method. I am considering generating many
> >> of
> >> > > these
> >> > > > > for case specific unit testing of my UDFs.
> >> > > > >
> >> > > > > -Dan
> >> > > > >
> >> > > > > On Tue, Mar 19, 2013 at 11:27 AM, Jonathan Coveney <
> >> > [email protected]
> >> > > > > >wrote:
> >> > > > >
> >> > > > > > how was string_databag generated?
> >> > > > > >
> >> > > > > >
> >> > > > > > 2013/3/19 Dan DeCapria, CivicScience <
> >> > [email protected]>
> >> > > > > >
> >> > > > > > > Expanding upon this, the following use case's Schema Object
> >> can
> >> > be
> >> > > > > > resolved
> >> > > > > > > from inputs:
> >> > > > > > >
> >> > > > > > > String string_databag = "{(a,(b,d),f)}";
> >> > > > > > > String string_schema =
> >> > > > > > >
> >> > > "b1:bag{t1:tuple(a:chararray,t2:tuple(b:chararray,d:long),f:long)}";
> >> > > > > > > Schema schema =
> >> Utils.getSchemaFromString(string_schema);
> >> > > > > > >
> >> > > > > > > Next step is to resolve a DataBag Object from String
> >> > string_databag
> >> > > > and
> >> > > > > > the
> >> > > > > > > Schema Object.
> >> > > > > > >
> >> > > > > > > -Dan
> >> > > > > > >
> >> > > > > > > On Tue, Mar 19, 2013 at 9:37 AM, Dan DeCapria, CivicScience
> <
> >> > > > > > > [email protected]> wrote:
> >> > > > > > >
> >> > > > > > > > Thank you for your reply.
> >> > > > > > > >
> >> > > > > > > > The problem is I cannot find a methodology to go from a
> >> String
> >> > > > > > > > representation of a complex data type to a nested Object
> of
> >> pig
> >> > > > > > > DataTypes.
> >> > > > > > > > I looked over the pig 0.10.1 docs, but cannot find a way
> to
> >> go
> >> > > from
> >> > > > > > > String
> >> > > > > > > > and Schema to pig DataType Object.
> >> > > > > > > >
> >> > > > > > > > For context, I am generating these Strings for my own
> JUnit
> >> > > testing
> >> > > > > of
> >> > > > > > > > other UDFs. Currently, for complex types, I have to
> >> generate
> >> > > each
> >> > > > > > > nesting
> >> > > > > > > > from Tuple and DataBag factories, append data, and next
> them
> >> > > > > manually.
> >> > > > > > > For
> >> > > > > > > > larger unit tests, this process becomes unwieldy (hundreds
> >> of
> >> > > lines
> >> > > > > per
> >> > > > > > > > method, non-dynamic), and it would be much simpler to go
> >> > directly
> >> > > > > from
> >> > > > > > a
> >> > > > > > > > String and a Schema to a DataBag Object for UDF testing
> (few
> >> > > lines
> >> > > > of
> >> > > > > > > code,
> >> > > > > > > > easily modifiable).
> >> > > > > > > >
> >> > > > > > > > -Dan
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > On Mon, Mar 18, 2013 at 6:31 PM, Jonathan Coveney <
> >> > > > > [email protected]
> >> > > > > > > >wrote:
> >> > > > > > > >
> >> > > > > > > >> Why not just use PigStorage? This is essentially what it
> >> does.
> >> > > It
> >> > > > > > saves
> >> > > > > > > a
> >> > > > > > > >> bag as text, and then loads it again.
> >> > > > > > > >>
> >> > > > > > > >> I suppose the question becomes: why do you need to do
> this?
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >> 2013/3/18 Dan DeCapria, CivicScience <
> >> > > > [email protected]
> >> > > > > >
> >> > > > > > > >>
> >> > > > > > > >> > In Java, I am trying to convert a DataBag from it's
> >> String
> >> > > > > > > >> representation
> >> > > > > > > >> > with its schema String to a valid DataBag Object:
> >> > > > > > > >> >
> >> > > > > > > >> > String databag_string = "{(apples,1024)}";
> >> > > > > > > >> > String schema_string =
> >> > "b1:bag{t1:tuple(a:chararray,b:long)}";
> >> > > > > > > >> >
> >> > > > > > > >> > I've tried implementing something along the lines of
> >> this,
> >> > > but I
> >> > > > > > > believe
> >> > > > > > > >> > it's in the wrong direction, and then I get stuck:
> >> > > > > > > >> >
> >> > > > > > > >> > String[] aliases = {"b1", "t1", "a", "b"};
> >> > > > > > > >> > byte[] types = {DataType.BAG, DataType.TUPLE,
> >> > > > > > > >> DataType.CHARARRAY,
> >> > > > > > > >> > DataType.LONG};
> >> > > > > > > >> > List<Schema.FieldSchema> fsList = new
> >> > > > > > > >> > ArrayList<Schema.FieldSchema>();
> >> > > > > > > >> > for (int i = 0; i < aliases.length; i++) {
> >> > > > > > > >> > fsList.add(new
> Schema.FieldSchema(aliases[i],
> >> > > > > > types[i])) ;
> >> > > > > > > >> > }
> >> > > > > > > >> > Schema origSchema = new Schema(fsList);
> >> > > > > > > >> > ResourceSchema rsSchema = new
> >> > > > ResourceSchema(origSchema);
> >> > > > > > > >> > Schema genSchema =
> Schema.getPigSchema(rsSchema);
> >> > > > > > > >> > ResourceSchema.ResourceFieldSchema[] rfschema =
> >> > > > > > > >> > rsSchema.getFields();
> >> > > > > > > >> > ... lost here, maybe Utf8StorageConverter c =
> new
> >> > > > > > > >> > Utf8StorageConverter(); ???
> >> > > > > > > >> >
> >> > > > > > > >> >
> >> > > > > > > >> > An ideal process would be along the lines of:
> >> > > > > > > >> >
> >> > > > > > > >> > DataBag d = BagFactory.getInstance().newDefaultBag();
> >> > > > > > > >> > d.something(databag_string, schema_string); // ???
> no
> >> > idea
> >> > > > what
> >> > > > > > > this
> >> > > > > > > >> > process could be
> >> > > > > > > >> > d.toString().equals(databag_string) == true.
> >> > > > > > > >> >
> >> > > > > > > >> > Thanks, -Dan
> >> > > > > > > >> >
> >> > > > > > > >>
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > --
> >> > > > > > > > Dan DeCapria
> >> > > > > > > > CivicScience, Inc.
> >> > > > > > > > Senior Informatics / DM / ML / BI Specialist
> >> > > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > --
> >> > > > > > > Dan DeCapria
> >> > > > > > > CivicScience, Inc.
> >> > > > > > > Senior Informatics / DM / ML / BI Specialist
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > --
> >> > > > > Dan DeCapria
> >> > > > > CivicScience, Inc.
> >> > > > > Senior Informatics / DM / ML / BI Specialist
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Dan DeCapria
> >> > > CivicScience, Inc.
> >> > > Senior Informatics / DM / ML / BI Specialist
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Dan DeCapria
> >> CivicScience, Inc.
> >> Senior Informatics / DM / ML / BI Specialist
> >>
> >
> >
>
--
Dan DeCapria
CivicScience, Inc.
Senior Informatics / DM / ML / BI Specialist