Hello,
New to pig/hadoop. Using Cloudera CDH5.
I have some weird behavior that I am trying to figure out. I am trying to JOIN
two relations and I get the error below stating DataByteArray cannot be case to
String.
I load the first relation with a schema where $0 is nppes:char array and have
no problem, I load the second relation the same way ($0 is npi:chararray), no
problem. I can FILTER and STORE the relations without error, but when I try to
JOIN I get the error. If I change both schemas to :bytearray I have no problem.
So, I thought it was an error in my files. I ran another pig script loading
each and doing a regex FILTER to validate the strings in $0 (10 digits). I
found no non-valid fields in either file.
Any thoughts on why the weird behavior?
Thanks,
Steven
SCRIPT (WORKS but if I change $0 to chararry in both files I get the cast error)
=============================================================
n0 = LOAD '/user/nppes'
USING org.apache.pig.piggybank.storage.CSVExcelStorage(',');
n = FOREACH n0 GENERATE $0 AS nppes_id:bytearray, $47 AS specialty:chararray;
p0 = LOAD '/user/payment'
AS (npi:bytearray, nppes_provider_last_org_name:chararray,
nppes_provider_first_name:chararray, nppes_provider_mi:chararray,
nppes_credentials:chararray, nppes_provider_gender:chararray,
nppes_entity_code:chararray,
nppes_provider_street1:chararray, nppes_provider_street2:chararray,
nppes_provider_city:chararray, nppes_provider_zip:chararray,
nppes_provider_state:chararray, nppes_provider_country:chararray,
provider_type:chararray, medicare_participation_indicator:chararray,
place_of_service:chararray,
hcpcs_code:chararray, hcpcs_description:chararray, line_srvc_cnt:int,
bene_unique_cnt:int, bene_day_srvc_cnt:int,
average_Medicare_allowed_amt:double, stdev_Medicare_allowed_amt:double,
average_submitted_chrg_amt:double, stdev_submitted_chrg_amt:double,
average_Medicare_payment_amt:double, stdev_Medicare_payment_amt:double);
p = FILTER p0 BY ($0 != 'npi');
np = JOIN n by nppes_id, p by npi; — THE LINE CAUSING THE ERROR
o = FOREACH np GENERATE npi, specialty, nppes_provider_city,
nppes_provider_state,
nppes_provider_zip, hcpcs_code, hcpcs_description,
line_srvc_cnt, bene_unique_cnt, bene_day_srvc_cnt,
average_Medicare_allowed_amt, stdev_Medicare_allowed_amt,
average_submitted_chrg_amt, stdev_submitted_chrg_amt,
average_Medicare_payment_amt, stdev_Medicare_payment_amt;
STORE o INTO '/user/out'
USING PigStorage('\t', '-schema');
ERROR:
==================================================================
Error: java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot
be cast to java.lang.String at
org.apache.pig.backend.hadoop.HDataType.getWritableComparableTypes(HDataType.java:106)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Map.collect(PigGenericMapReduce.java:111)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:284)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:277)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:415) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) Container
killed by the ApplicationMaster. Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143