Hi Marcos, just a quick question, have you check whether or not your data has all the fields in all the rows? Maybe you are dealing with sparse data, but due to the amount of data you are not noticing it. First, what does your data look like? My choice would be to first try with a subset of the whole data, and then write my own UDF to parse, and retrieve just the values I want.
Renato M. 2010/10/20 Marcos Medrado Rubinelli <[email protected]> > Hi everybody, > > I'm trying to use vanilla Pig 0.7.0 to generate monthly consolidations of > log files with relatively long lines: 95 fields and growing, of which I'll > be using just 7. Just so I didn't have to declare all the fields in the LOAD > command, I tried to define the schema in my first FOREACH...GENERATE, so the > first lines of my script look like this: > > input = LOAD '/tmp/test.log'; > A = FILTER input BY SIZE(*) >= 95; > B = FOREACH A GENERATE (long)$94, (chararray)$93, (long)$16, (long)$27, > (long)$23, (int)$2, (int)$3 > AS publisher, associate, site, category, > story, hits, comments; > > As you can guess by now, Pig complains while still parsing: > > ERROR 1000: Error during parsing. Invalid alias: category in null > > org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error > during parsing. Invalid alias: associate in null > at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1170) > at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114) > at org.apache.pig.PigServer.registerQuery(PigServer.java:425) > at > org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:73) > > Am I overlooking anything? Should I give up and declare a 95-field schema? > Write a LOAD UDF? Or is there a simpler way to do what I want? > > Thank you! > Marcos Rubinelli >
