Hello,
  I have a dataset with more than 180 columns to which I want to join (based on 
two columns) to another. 

  I would like not to have to enumerate all the 180 column names in a schema.  
What other options do I have?

  Here is my script:

-- This has 180 columns which I do not want to explicitly declare
wide_data = LOAD '/wide/' USING PigStorage('\t');
DESCRIBE wide_data ;

narrow_data =
        LOAD'/narrow/'
        USING PigStorage('\t')
        AS (
                a   : chararray,
                b   : chararray,
                c   : long,
                d   : double
        );

narrow_data = FOREACH narrow_data GENERATE a, c, d ;
DESCRIBE narrow_data;

-- join based on two columns
j = JOIN wide_data BY ((chararray)$20, (long)$172), narrow_data BY (a, c) 
PARALLEL 1800 ;
DESCRIBE j;

STORE j into '/output/';
====

When I execute pig -x, it complains because it does not know the schema:

Schema for wide_data unknown.
narrow_data: {a: chararray,c: long,d: double}
Schema for j unknown.
2010-11-18 15:58:03,589 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2189: Expect schema
Details at logfile: /hadoop/home/amallya/JARs/pig_1290121082836.log


The log file says:

more pig_1290121082836.log
Pig Stack Trace
---------------
ERROR 2189: Expect schema

org.apache.pig.impl.plan.optimizer.OptimizerException: ERROR 2185: Unable to 
prune columns when processing node (Name: ForEach 1-72 Operator Key: 1-7
2)
        at 
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns.processNode(PruneColumns.java:515)
        at 
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns.transform(PruneColumns.java:150)
        at 
org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:232)
        at org.apache.pig.PigServer.compileLp(PigServer.java:857)
        at org.apache.pig.PigServer.compileLp(PigServer.java:793)
        at org.apache.pig.PigServer.execute(PigServer.java:762)
        at org.apache.pig.PigServer.access$100(PigServer.java:90)
        at org.apache.pig.PigServer$Graph.execute(PigServer.java:952)
        at org.apache.pig.PigServer.executeBatch(PigServer.java:249)
        at 
org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:115)
        at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:172)
        at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
        at org.apache.pig.Main.main(Main.java:386)
Caused by: org.apache.pig.impl.plan.optimizer.OptimizerException: ERROR 2185: 
Unable to prune columns when processing node (Name: Load 1-47 Operator
Key: 1-47)
        at 
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns.processNode(PruneColumns.java:515)
        at 
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns.processNode(PruneColumns.java:510)
        ... 13 more
Caused by: org.apache.pig.impl.plan.VisitorException: ERROR 2188: Cannot prune 
columns for (Name: LOJoin 1-62 Operator Key: 1-62)
        at 
org.apache.pig.impl.logicalLayer.ColumnPruner.prune(ColumnPruner.java:226)
        at 
org.apache.pig.impl.logicalLayer.ColumnPruner.visit(ColumnPruner.java:251)
        at org.apache.pig.impl.logicalLayer.LOJoin.visit(LOJoin.java:206)
        at org.apache.pig.impl.logicalLayer.LOJoin.visit(LOJoin.java:45)
        at 
org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
        at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
        at 
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns.pruneLoader(PruneColumns.java:762)
        at 
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns.processNode(PruneColumns.java:198)
        ... 14 more
Caused by: org.apache.pig.impl.plan.VisitorException: ERROR 2189: Expect schema
        at 
org.apache.pig.impl.logicalLayer.ColumnPruner.prune(ColumnPruner.java:78)
        ... 21 more

Thanks
Ashok.

Reply via email to