Two questions: should I file a bug? and is this workaround performant?

Daniel Eklund Thu, 09 Jun 2011 04:54:20 -0700

Recently I uncovered a nasty situation in my data that caused an
IndexOutOfBoundsException.
I am including a sample pig script and data (at the bottom) that illuminate
the concern.


Succinctly:  records JOINed from one relation to another would throw an
IndexOutOfBoundsException if 1) the columns were derived from a PigStorage()
load of one large data:chararray followed by a STRSPLIT() of that data into
the proper amount of columns, an 2) there were bad records of insufficient
length (by STRSPLIT delimiter).

Why this is interesting is that if I were to use the PisgStorage() with the
delimiter directly, then the bad records would be silently dropped and the
JOIN would proceed WITHOUT throwing an exception (which is always good).

Once I discovered that the semantically equal (IMHO) notions of loading a
line as one big chararray and STRSPLITTING() on the delimiter is slightly
different from loading using the PigStorage() with the delimiter directly, I
realized I had to use a workaround as such:

     GOOD_RECORDS = FILTER RELATION_FROM_STRSPLIT by SIZE(*) == <my expected
column count>;

This was a silver lining of sorts as now I could something like

SPLIT RELATION_FROM_STRSPLIT into
     GOOD_RECORDS if SIZE(*) == <my expected column count>,
     BAD_RECORDS  if SIZE(*) != <my expected column count>;

and store the bad records for later analysis and remediation.

So, my questions:  Firstly, I feel I should file a bug for the exception
(they just never are a good thing to see).  Secondly, I am thinking of
applying the "load first, STRSPLIT second" pattern consistenly whenever I
load my data, as it allows me the ability to report out on bad data.

What does everyone feel about the performance of such a pattern?  I would
think that the difference  should be negligible.

thanks for any insight,
daniel


pig script
-----------

my_data = LOAD 'test.txt' using PigStorage(',')
      as       (age        :int,
             eye_color    :chararray,
             height          :int,
             name           :chararray);


my_data_raw =  LOAD 'test.txt' as (data:chararray);
my_data_from_split = FOREACH my_data_raw generate
           FLATTEN(STRSPLIT(data,','))
            as    (age        :int,
             eye_color    :chararray,
             height        :int,
             name        :chararray);


my_names = LOAD 'name.txt' using PigStorage(',')
      as         (name_key    :chararray,
             first        :chararray,
             last        :chararray);

-- this one has no exception
joined = JOIN my_data by name,
                 my_names by name_key;

-- this one throws an exception
bad_joined = JOIN my_data_from_split by name,
                 my_names by name_key;


-------- Sameple test.txt ----
24,brown,56,daniel
24,blue,57,janice
34,blue,23,arthi
43,blue,53,john
33,brown,23,apu
33,brown,64,ponce
34,green,23,jeaninine
25,brown,23,rachael
35,brown,43,Wolde
32,brown,33,gregory
35,brown,53,vlad
23,brown,64,emilda
33,blue,43,ravi
33,green,53,brendan
15,blue,43,ravichandra
15,brown,46,leonor
18,blue,23,caeser
23,JCVD             <-- here is the bad data
33,blue,46,anthony
23,blue,13,xavier
18,blue,33,patrick
33,brown,44,sang
18,brown,45,ari
24,green,46,vance
33,brown,23,qi
29,green,24,eloise
33,blue ,29,elaine



--- Exception thrown ---
java.lang.IndexOutOfBoundsException: Index: 14, Size: 14
    at java.util.ArrayList.RangeCheck(ArrayList.java:547)
    at java.util.ArrayList.get(ArrayList.java:322)
    at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:158)
    at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.getValueTuple(POFRJoin.java:403)
    at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.getNext(POFRJoin.java:261)
    at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276)
    at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.getNext(POFRJoin.java:241)
    at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276)
    at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.getNext(POFRJoin.java:241)
    at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276)
    at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLimit.getNext(POLimit.java:85)
    at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276)
    at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
    at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
    at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
    at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:646)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)

Two questions: should I file a bug? and is this workaround performant?

Reply via email to