Further examination shows that the problematic emails I am encoding are formatted in ISO-8859-1, not UTF-8. That is why I am getting character problems. Looks like it is not an Avro problem after all. Thanks! :)
On Thu, Feb 2, 2012 at 2:49 PM, Russell Jurney <[email protected]>wrote: > A little bit more searching shows this: > > > http://www.harshj.com/2010/04/25/writing-and-reading-avro-data-files-using-python/ > > > On Thu, Feb 2, 2012 at 2:48 PM, Russell Jurney > <[email protected]>wrote: > >> The jars being used are: >> >> REGISTER /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar >> REGISTER /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar >> REGISTER /me/pig/contrib/piggybank/java/piggybank.jar >> REGISTER /me/pig/build/ivy/lib/Pig/jackson-core-asl-1.7.3.jar >> REGISTER /me/pig/build/ivy/lib/Pig/jackson-mapper-asl-1.7.3.jar >> >> On Thu, Feb 2, 2012 at 2:41 PM, James Baldassari >> <[email protected]>wrote: >> >>> HI Russell, >>> >>> I'm not sure about the Python error, but the Java error looks like a >>> classpath problem, not a schema parsing issue. The NoSuchMethodError in >>> the stack trace indicates that Avro was trying to invoke a method in the >>> Jackson library that wasn't present at run-time. My guess is that your >>> program (or Pig?) either has two incompatible versions of the Jackson >>> library on its classpath or maybe Avro's Jackson dependency has been >>> excluded and a version that is incompatible with Avro is on the classpath. >>> >>> Which version of Avro is being used? Running 'mvn dependency:tree' in >>> Avro trunk I see that it's depending on Jackson 1.8.6. Can you verify that >>> only one version of Jackson is on the classpath and that it's the version >>> that is required by whatever version of Avro is on the classpath? >>> >>> -James >>> >>> >>> >>> On Thu, Feb 2, 2012 at 5:21 PM, Russell Jurney <[email protected] >>> > wrote: >>> >>>> Correction: when I read the file in Python, I get the error below. It >>>> looks like a unicode problem? Can one tell Avro how to handle this? >>>> >>>> Traceback (most recent call last): >>>> File "./cat_avro", line 21, in <module> >>>> for record in df_reader: >>>> File >>>> "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/datafile.py", >>>> line 354, in next >>>> datum = self.datum_reader.read(self.datum_decoder) >>>> File >>>> "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py", >>>> line 445, in read >>>> return self.read_data(self.writers_schema, self.readers_schema, >>>> decoder) >>>> File >>>> "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py", >>>> line 490, in read_data >>>> return self.read_record(writers_schema, readers_schema, decoder) >>>> File >>>> "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py", >>>> line 690, in read_record >>>> field_val = self.read_data(field.type, readers_field.type, decoder) >>>> File >>>> "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py", >>>> line 488, in read_data >>>> return self.read_union(writers_schema, readers_schema, decoder) >>>> File >>>> "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py", >>>> line 654, in read_union >>>> return self.read_data(selected_writers_schema, readers_schema, >>>> decoder) >>>> File >>>> "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py", >>>> line 458, in read_data >>>> return self.read_data(writers_schema, s, decoder) >>>> File >>>> "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py", >>>> line 468, in read_data >>>> return decoder.read_utf8() >>>> File >>>> "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py", >>>> line 233, in read_utf8 >>>> return unicode(self.read_bytes(), "utf-8") >>>> UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position >>>> 543: invalid start byte >>>> >>>> >>>> On Thu, Feb 2, 2012 at 2:06 PM, Russell Jurney < >>>> [email protected]> wrote: >>>> >>>>> I am writing Avro records in Ruby using the avro ruby gem in 1.8.7. I >>>>> have problems with loading these files sometimes. As a result, I am >>>>> unable >>>>> to write large files that are readable. >>>>> >>>>> The exception I get is below. Anyone have an idea what this means? >>>>> It looks like Avro is having trouble parsing the schema. The avro files >>>>> parse in Ruby and Python, just not Pig. Are there more rigorous checks in >>>>> Java? >>>>> >>>>> Pig Stack Trace >>>>> --------------- >>>>> ERROR 2998: Unhandled internal error. >>>>> org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg/codehaus/jackson/JsonFactory; >>>>> >>>>> java.lang.NoSuchMethodError: >>>>> org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg/codehaus/jackson/JsonFactory; >>>>> at org.apache.avro.Schema.<clinit>(Schema.java:82) >>>>> at >>>>> org.apache.pig.piggybank.storage.avro.AvroStorageUtils.<clinit>(AvroStorageUtils.java:49) >>>>> at >>>>> org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:163) >>>>> at >>>>> org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:144) >>>>> at >>>>> org.apache.pig.piggybank.storage.avro.AvroStorage.getSchema(AvroStorage.java:269) >>>>> at >>>>> org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:150) >>>>> at >>>>> org.apache.pig.newplan.logical.relational.LOLoad.getSchema(LOLoad.java:109) >>>>> at >>>>> org.apache.pig.newplan.logical.visitor.LineageFindRelVisitor.visit(LineageFindRelVisitor.java:100) >>>>> at >>>>> org.apache.pig.newplan.logical.relational.LOLoad.accept(LOLoad.java:218) >>>>> at >>>>> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) >>>>> at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50) >>>>> at >>>>> org.apache.pig.newplan.logical.visitor.CastLineageSetter.<init>(CastLineageSetter.java:57) >>>>> at org.apache.pig.PigServer$Graph.compile(PigServer.java:1679) >>>>> at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1610) >>>>> at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1582) >>>>> at org.apache.pig.PigServer.registerQuery(PigServer.java:584) >>>>> at >>>>> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:942) >>>>> at >>>>> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386) >>>>> at >>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188) >>>>> at >>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164) >>>>> at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69) >>>>> at org.apache.pig.Main.run(Main.java:495) >>>>> at org.apache.pig.Main.main(Main.java:111) >>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>> at >>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>>>> at >>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) >>>>> >>>>> ================================================================================ >>>>> >>>>> -- >>>>> Russell Jurney >>>>> twitter.com/rjurney >>>>> [email protected] >>>>> datasyndrome.com >>>>> >>>> >>>> >>>> >>>> -- >>>> Russell Jurney >>>> twitter.com/rjurney >>>> [email protected] >>>> datasyndrome.com >>>> >>> >>> >> >> >> -- >> Russell Jurney >> twitter.com/rjurney >> [email protected] >> datasyndrome.com >> > > > > -- > Russell Jurney > twitter.com/rjurney > [email protected] > datasyndrome.com > -- Russell Jurney twitter.com/rjurney [email protected] datasyndrome.com
