First poster here! Really excited to get some feedback and contribute to
Pig!
I am attempting to simplify the UDF input process in the context of scaling
JUnit testing. Previously, to create a valid Pig input for my UDFs for
JUnit testing, I have had to make each layer/nesting of the Pig input from
org.apache.pig.data.* constructs, per each use case to unit test. I am
looking for a quick methodology to simplify this process and to scale for
addition unit testing. A use case is defined below:
Assume the input schema is defined a priori. Assume also that the
outputSchema is properly defined in the UDF to be unit tested. Illustrating
the InputSchema from the prior pig process, I have the InputData in the
form of InputSchema, per my testing UDF. Conceptually, the unit testing
approach is as follows:
InputSchema
bag_a:bag{tuple_b:tuple(tuple_c1:tuple(tuple_d1:tuple(field_a:chararray,field_b:chararray)),field_e:chararray)}
OutputSchema
bag_a:bag{tuple_b:tuple(tuple_c1:tuple(tuple_d1:tuple(field_a:chararray,field_b:chararray),tuple_d2:tuple(field_c:chararray,field_d:chararray)),field_e:chararray)}
Prior (non-scalable) methodology:
Create bag_a DataBag.
Create tuple_b Tuple.
Create tuple_c1 Tuple.
Create tuple_d1 Tuple.
append data field_a to tuple_d1. append data field_b to tuple_d1.
append tuple_c1 to tuple_b. append data field_e to tuple_b.
append tuple_b to bag_a.
unit test UDF(bag_a). //
Is there a way to 'pigify' the InputSchema data String, as it appears from
illustrate of the prior pig process, to be fed into the UDF(InputData),
such that I do not have to perform the Prior methodology explicitly? A
solution would be ideal of the form:
Awesome methodology:
String_of_data_in_inputFormat:
bag_a:bag{tuple_b:tuple(tuple_c1:tuple(tuple_d1:tuple(field_a:chararray,field_b:chararray)),field_b)}
DataBag bag_a = pigify(String_of_data_in_inputFormat);
unit test UDF(bag_a). //
Thanks in advance,
-Dan DeCapria