Hello,

New to pig/hadoop. Using Cloudera CDH5.

I have some weird behavior that I am trying to figure out. I am trying to JOIN 
two relations and I get the error below stating DataByteArray cannot be case to 
String.

I load the first relation with a schema where $0 is nppes:char array and have 
no problem, I load the second relation the same way ($0 is npi:chararray), no 
problem. I can FILTER and STORE the relations without error, but when I try to 
JOIN I get the error. If I change both schemas to :bytearray I have no problem. 
 

So, I thought it was an error in my files. I ran another pig script loading 
each and doing a regex FILTER to validate the strings in $0 (10 digits). I 
found no non-valid fields in either file.

Any thoughts on why the weird behavior?

Thanks,
Steven


SCRIPT (WORKS but if I change $0 to chararry in both files I get the cast error)
=============================================================
n0 = LOAD '/user/nppes'
   USING org.apache.pig.piggybank.storage.CSVExcelStorage(',');

n = FOREACH n0 GENERATE $0 AS nppes_id:bytearray, $47 AS specialty:chararray;


p0 = LOAD '/user/payment'
  AS (npi:bytearray, nppes_provider_last_org_name:chararray, 
  nppes_provider_first_name:chararray, nppes_provider_mi:chararray, 
  nppes_credentials:chararray, nppes_provider_gender:chararray, 
nppes_entity_code:chararray, 
  nppes_provider_street1:chararray, nppes_provider_street2:chararray, 
  nppes_provider_city:chararray, nppes_provider_zip:chararray, 
  nppes_provider_state:chararray, nppes_provider_country:chararray, 
  provider_type:chararray, medicare_participation_indicator:chararray, 
place_of_service:chararray, 
  hcpcs_code:chararray, hcpcs_description:chararray, line_srvc_cnt:int, 
  bene_unique_cnt:int, bene_day_srvc_cnt:int, 
  average_Medicare_allowed_amt:double, stdev_Medicare_allowed_amt:double, 
  average_submitted_chrg_amt:double, stdev_submitted_chrg_amt:double, 
  average_Medicare_payment_amt:double, stdev_Medicare_payment_amt:double);
 
p = FILTER p0 BY ($0 != 'npi');


np = JOIN n by nppes_id, p by npi; — THE LINE CAUSING THE ERROR

o = FOREACH np GENERATE npi, specialty, nppes_provider_city, 
nppes_provider_state, 
  nppes_provider_zip, hcpcs_code, hcpcs_description, 
  line_srvc_cnt, bene_unique_cnt, bene_day_srvc_cnt, 
  average_Medicare_allowed_amt, stdev_Medicare_allowed_amt, 
  average_submitted_chrg_amt, stdev_submitted_chrg_amt, 
  average_Medicare_payment_amt, stdev_Medicare_payment_amt;

  
STORE o INTO '/user/out' 
   USING PigStorage('\t', '-schema');


ERROR:
==================================================================
Error: java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot 
be cast to java.lang.String at 
org.apache.pig.backend.hadoop.HDataType.getWritableComparableTypes(HDataType.java:106)
 at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Map.collect(PigGenericMapReduce.java:111)
 at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:284)
 at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:277)
 at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at 
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at 
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:415) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) Container 
killed by the ApplicationMaster. Container killed on request. Exit code is 143 
Container exited with a non-zero exit code 143

Reply via email to