What are you using to load the data? It sounds like your loader is reporting a desired schema, but does not actually convert the data into the schema. So it tells pig to expect ints, but gives it byte arrays.
On Feb 6, 2012, at 10:48 AM, praveenesh kumar <[email protected]> wrote: > Hi everyone, > > I have a question on behavior of how Group By and Join works in Pig : > > Suppose I have Two data files: > > 1. cust_info > > 2. premium_data > > > cust_info: > > ID name region > > 2321 Austin Pondicherry > > 2375 Martin California > > 4286 Lisa Chennai > > > > premium_data: > > ID premium start_year end_year > > 2321 345 2009 2010 > > 2375 845 2009 2011 > > 4286 286 2010 2012 > > 2321 213 2001 2004 > > 3041 452 2010 2013 > > 3041 423 2006 2009 > > ================================ > > Load the premium_data, group by ID and sum their total premium > > > > grunt> premium_data = load 'premium_data'; > > grunt> illustrate premium_data; > > > > ------------------------------------------------------------------------------------------ > > | premium_data | ID:int | premium:float | start_year:int | > end_year:int | > > ------------------------------------------------------------------------------------------ > > | | 4286 | 286 | 2010 | > 2012 | > > ------------------------------------------------------------------------------------------ > > > > grunt> cust_info = load 'cust_info'; > > grunt> illustrate cust_info; > > ------------------------------------------------------------------------ > > | cust_info | ID:int | name:chararray | region:chararray | > > ------------------------------------------------------------------------ > > | | 2375 | Martin | California | > > ------------------------------------------------------------------------ > > > > grunt> grouped_ID = group premium_data by ID; > > When I am giving schema inside my Load statement, I am facing errors on > using group By and Joins. > > But if I don't give schema, my fields are treated as ByteArrays and working > fine. > > I don't think its a usual behavior. Am I doing something wrong the way I > should use Join and GroupBy ? > > > grunt> illustrate grouped_ID; -throws errors > > > > > > grunt> illustrate grouped_ID; > > 2012-02-06 22:47:31,452 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting > to hadoop file system at: file:/// > > 2012-02-06 22:47:31,651 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - > File concatenation threshold: 100 optimistic? false > > 2012-02-06 22:47:31,680 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size before optimization: 1 > > 2012-02-06 22:47:31,680 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size after optimization: 1 > > 2012-02-06 22:47:31,698 [main] INFO > org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added > to the job > > 2012-02-06 22:47:31,719 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 > > 2012-02-06 22:47:31,850 [main] INFO > org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths > to process : 1 > > 2012-02-06 22:47:31,851 [main] INFO > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input > paths to process : 1 > > 2012-02-06 22:47:31,867 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - > File concatenation threshold: 100 optimistic? false > > 2012-02-06 22:47:31,869 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size before optimization: 1 > > 2012-02-06 22:47:31,869 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size after optimization: 1 > > 2012-02-06 22:47:31,870 [main] INFO > org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added > to the job > > 2012-02-06 22:47:31,870 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 > > 2012-02-06 22:47:31,884 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=292 > > 2012-02-06 22:47:31,885 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > - Neither PARALLEL nor default parallelism is set for this job. Setting > number of reducers to 1 > > java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be > cast to java.lang.Integer > > at > org.apache.pig.backend.hadoop.HDataType.getWritableComparableTypes(HDataType.java:81) > > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Map.collect(PigGenericMapReduce.java:117) > > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:273) > > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:266) > > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) > > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > > at > org.apache.pig.pen.LocalMapReduceSimulator.launchPig(LocalMapReduceSimulator.java:205) > > at > org.apache.pig.pen.ExampleGenerator.getData(ExampleGenerator.java:257) > > at > org.apache.pig.pen.ExampleGenerator.getData(ExampleGenerator.java:238) > > at > org.apache.pig.pen.LineageTrimmingVisitor.init(LineageTrimmingVisitor.java:103) > > at > org.apache.pig.pen.LineageTrimmingVisitor.<init>(LineageTrimmingVisitor.java:98) > > at > org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:166) > > at org.apache.pig.PigServer.getExamples(PigServer.java:1202) > > at > org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:700) > > at > org.apache.pig.tools.pigscript.parser.PigScriptParser.Illustrate(PigScriptParser.java:597) > > at > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:308) > > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190) > > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) > > at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69) > > at org.apache.pig.Main.run(Main.java:523) > > at org.apache.pig.Main.main(Main.java:148) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:601) > > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > > 2012-02-06 22:47:31,936 [main] ERROR org.apache.pig.tools.grunt.Grunt - > ERROR 2997: Encountered IOException. Exception : > org.apache.pig.data.DataByteArray cannot be cast to java.lang.Integer > > Details at logfile: > /usr/local/hadoop/pig/trunk/learning/insurance/pig_1328548575504.log > > > > > Thanks, > Praveenesh
