This would work, but the goal would be to *not* invoke local interactive pig to execute a LOAD USING PigStorage() and pass the data into the UDF. I was hoping to keep this completely in the Java and JUnit testing universe.
Looking over the PigStorage() doc<https://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html>, would you know how to construct this process from a baseline PigStorage Object, such as: PigStorage pigstorage = new PigStorage(); Any ideas? -Dan On Tue, Mar 19, 2013 at 12:08 PM, Jonathan Coveney <[email protected]>wrote: > I definitely understand the benefits, I just wanted to understand your > workflow so could weigh in with what I would do. > > In your case, if you're going to be making these by hand, then I would > mimic what PigStorage outputs, and then just load it in using PigStorage. > > > 2013/3/19 Dan DeCapria, CivicScience <[email protected]> > > > By hand; creating a new JUnit method to test a specific use case against > a > > functional requirement in the UDF. > > > > The UDFs I am testing are part of a larger ETL testing initiative I have > > been undertaking. To ensure that the various states of legacy data are > > correctly extracted and transformed into a Pig context, I am creating > > specific JUnit tests per each UDF containing specific use cases as > testing > > methods. > > > > Motivation to use String inputs for the Data Objects and Schema Objects > is > > the improvement on the conventional approach - creating Java Objects and > > adding and appending nested Objects to create the desired complex type > > DataBag Object to pass to the UDF as use case input. This simpler process > > I'm looking for should improve scale-ability and rapid-prototyping within > > the testing scripts. It will also make the process more approachable for > > another programmer to write additional unit tests. > > > > -Dan > > > > On Tue, Mar 19, 2013 at 11:43 AM, Jonathan Coveney <[email protected] > > >wrote: > > > > > How are you planning on generating these cases? By hand? Or automated? > > > > > > > > > 2013/3/19 Dan DeCapria, CivicScience <[email protected]> > > > > > > > String string_databag in this example was typed out by me, as the > input > > > > String for a JUnit test method. I am considering generating many of > > these > > > > for case specific unit testing of my UDFs. > > > > > > > > -Dan > > > > > > > > On Tue, Mar 19, 2013 at 11:27 AM, Jonathan Coveney < > [email protected] > > > > >wrote: > > > > > > > > > how was string_databag generated? > > > > > > > > > > > > > > > 2013/3/19 Dan DeCapria, CivicScience < > [email protected]> > > > > > > > > > > > Expanding upon this, the following use case's Schema Object can > be > > > > > resolved > > > > > > from inputs: > > > > > > > > > > > > String string_databag = "{(a,(b,d),f)}"; > > > > > > String string_schema = > > > > > > > > "b1:bag{t1:tuple(a:chararray,t2:tuple(b:chararray,d:long),f:long)}"; > > > > > > Schema schema = Utils.getSchemaFromString(string_schema); > > > > > > > > > > > > Next step is to resolve a DataBag Object from String > string_databag > > > and > > > > > the > > > > > > Schema Object. > > > > > > > > > > > > -Dan > > > > > > > > > > > > On Tue, Mar 19, 2013 at 9:37 AM, Dan DeCapria, CivicScience < > > > > > > [email protected]> wrote: > > > > > > > > > > > > > Thank you for your reply. > > > > > > > > > > > > > > The problem is I cannot find a methodology to go from a String > > > > > > > representation of a complex data type to a nested Object of pig > > > > > > DataTypes. > > > > > > > I looked over the pig 0.10.1 docs, but cannot find a way to go > > from > > > > > > String > > > > > > > and Schema to pig DataType Object. > > > > > > > > > > > > > > For context, I am generating these Strings for my own JUnit > > testing > > > > of > > > > > > > other UDFs. Currently, for complex types, I have to generate > > each > > > > > > nesting > > > > > > > from Tuple and DataBag factories, append data, and next them > > > > manually. > > > > > > For > > > > > > > larger unit tests, this process becomes unwieldy (hundreds of > > lines > > > > per > > > > > > > method, non-dynamic), and it would be much simpler to go > directly > > > > from > > > > > a > > > > > > > String and a Schema to a DataBag Object for UDF testing (few > > lines > > > of > > > > > > code, > > > > > > > easily modifiable). > > > > > > > > > > > > > > -Dan > > > > > > > > > > > > > > > > > > > > > On Mon, Mar 18, 2013 at 6:31 PM, Jonathan Coveney < > > > > [email protected] > > > > > > >wrote: > > > > > > > > > > > > > >> Why not just use PigStorage? This is essentially what it does. > > It > > > > > saves > > > > > > a > > > > > > >> bag as text, and then loads it again. > > > > > > >> > > > > > > >> I suppose the question becomes: why do you need to do this? > > > > > > >> > > > > > > >> > > > > > > >> 2013/3/18 Dan DeCapria, CivicScience < > > > [email protected] > > > > > > > > > > > >> > > > > > > >> > In Java, I am trying to convert a DataBag from it's String > > > > > > >> representation > > > > > > >> > with its schema String to a valid DataBag Object: > > > > > > >> > > > > > > > >> > String databag_string = "{(apples,1024)}"; > > > > > > >> > String schema_string = > "b1:bag{t1:tuple(a:chararray,b:long)}"; > > > > > > >> > > > > > > > >> > I've tried implementing something along the lines of this, > > but I > > > > > > believe > > > > > > >> > it's in the wrong direction, and then I get stuck: > > > > > > >> > > > > > > > >> > String[] aliases = {"b1", "t1", "a", "b"}; > > > > > > >> > byte[] types = {DataType.BAG, DataType.TUPLE, > > > > > > >> DataType.CHARARRAY, > > > > > > >> > DataType.LONG}; > > > > > > >> > List<Schema.FieldSchema> fsList = new > > > > > > >> > ArrayList<Schema.FieldSchema>(); > > > > > > >> > for (int i = 0; i < aliases.length; i++) { > > > > > > >> > fsList.add(new Schema.FieldSchema(aliases[i], > > > > > types[i])) ; > > > > > > >> > } > > > > > > >> > Schema origSchema = new Schema(fsList); > > > > > > >> > ResourceSchema rsSchema = new > > > ResourceSchema(origSchema); > > > > > > >> > Schema genSchema = Schema.getPigSchema(rsSchema); > > > > > > >> > ResourceSchema.ResourceFieldSchema[] rfschema = > > > > > > >> > rsSchema.getFields(); > > > > > > >> > ... lost here, maybe Utf8StorageConverter c = new > > > > > > >> > Utf8StorageConverter(); ??? > > > > > > >> > > > > > > > >> > > > > > > > >> > An ideal process would be along the lines of: > > > > > > >> > > > > > > > >> > DataBag d = BagFactory.getInstance().newDefaultBag(); > > > > > > >> > d.something(databag_string, schema_string); // ??? no > idea > > > what > > > > > > this > > > > > > >> > process could be > > > > > > >> > d.toString().equals(databag_string) == true. > > > > > > >> > > > > > > > >> > Thanks, -Dan > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Dan DeCapria > > > > > > > CivicScience, Inc. > > > > > > > Senior Informatics / DM / ML / BI Specialist > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Dan DeCapria > > > > > > CivicScience, Inc. > > > > > > Senior Informatics / DM / ML / BI Specialist > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Dan DeCapria > > > > CivicScience, Inc. > > > > Senior Informatics / DM / ML / BI Specialist > > > > > > > > > > > > > > > -- > > Dan DeCapria > > CivicScience, Inc. > > Senior Informatics / DM / ML / BI Specialist > > > -- Dan DeCapria CivicScience, Inc. Senior Informatics / DM / ML / BI Specialist
