[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907280#action_12907280 ] Doug Cutting commented on PIG-794: -- Jeff, please instead use current trunk or the 1.4.0 build that I expect to be released tomorrow (http://people.apache.org/~cutting/avro-1.4.0-rc4/). There was a bug that caused a similar failure in the snapshot you're using, but that should only happen in multi-threaded applications, which I doubt yours is, but it's better to either test against trunk or a release so we don't chase ghosts. Further, while debugging a DatumWriter and DatumReader, you might use a ValidatingEncoder and ValidatingDecoder to ensure that what you write and read conforms to your schema. You might also test by reading and printing your data with GenericDatumReader to see that you've written what you meant to write. If you've written data that does not conform to your declared schema then it cannot be read correctly. If this is the case, we should attempt to improve the error message here. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, AvroStorage_2.patch, AvroStorage_3.patch, AvroStorage_4.patch, AvroTest.java, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906871#action_12906871 ] Doug Cutting commented on PIG-794: -- Jeff, what version of Avro are you using? Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, AvroStorage_2.patch, AvroStorage_3.patch, AvroStorage_4.patch, AvroTest.java, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907049#action_12907049 ] Jeff Zhang commented on PIG-794: Doug, I am using avro trunk revision 988779 Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, AvroStorage_2.patch, AvroStorage_3.patch, AvroStorage_4.patch, AvroTest.java, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906671#action_12906671 ] Jeff Zhang commented on PIG-794: Dmitriy, In my patch I turn InternalMap as an avro array whose element is a record having two datums(one is key and the other is value). But it occurred weird exception , not know what's wrong with my code {code} Exception in thread main java.lang.NullPointerException at org.apache.avro.io.parsing.Parser.advance(Parser.java:86) at org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:121) at org.apache.pig.impl.io.avro.PigDataRecordReader.readRecord(PigDataRecordReader.java:77) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:106) at org.apache.pig.impl.io.avro.PigDataRecordReader.readRecord(PigDataRecordReader.java:66) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:106) at org.apache.avro.generic.GenericDatumReader.readArray(GenericDatumReader.java:184) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:108) at org.apache.pig.impl.io.avro.PigDataRecordReader.readRecord(PigDataRecordReader.java:81) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:106) at org.apache.avro.generic.GenericDatumReader.readArray(GenericDatumReader.java:184) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:108) at org.apache.pig.impl.io.avro.PigDataRecordReader.readRecord(PigDataRecordReader.java:83) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:106) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:97) at org.apache.avro.file.DataFileStream.next(DataFileStream.java:198) at org.apache.avro.file.DataFileStream.next(DataFileStream.java:185) at org.apache.pig.impl.io.avro.PigData.main(PigData.java:224) {code} Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, AvroStorage_2.patch, AvroStorage_3.patch, AvroStorage_4.patch, AvroTest.java, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905612#action_12905612 ] Dmitriy V. Ryaboy commented on PIG-794: --- Doug and Scott will know better of course, but afaik, Avro doesn't support Object keys. You can cheat and turn Object keys into strings by Base64-encoding their serialized representations.. you'd have to know to reverse the process when deserializing, though. Or we can try to get rid of InternalMap. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, AvroStorage_2.patch, AvroStorage_3.patch, AvroStorage_4.patch, AvroTest.java, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905663#action_12905663 ] Doug Cutting commented on PIG-794: -- Some quick comments on the new patch: - you might define a java enum type for the union elements, using Enum#ordinal() for the union indexes - instead of name.equals(union), s.getType()==Type.UNION would be faster, but better yet would be to simply call read() recursively, since it already handles unions, no? - peekArray() can simply return null, and that might be faster Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, AvroStorage_2.patch, AvroStorage_3.patch, AvroStorage_4.patch, AvroTest.java, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905775#action_12905775 ] Dmitriy V. Ryaboy commented on PIG-794: --- Jeff, that's what I am saying -- since they are writables, we can turn them into strings and not need InternalMap at all. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, AvroStorage_2.patch, AvroStorage_3.patch, AvroStorage_4.patch, AvroTest.java, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904551#action_12904551 ] Jeff Zhang commented on PIG-794: I did some experiment on Avro, Avro_Storage_2.patch is the detail implementation. Here I use avro as the data storage between map reduce jobs to replace InterStorage which has been optimized compared to BinStorage. I use a simple pig script which will been translate into 2 mapred jobs {code} a = load '/a.txt'; b = load '/b.txt'; c = join a by $0, b by $0; d = group c by $0; dump d; {code} The following table shows my experiment result (1 master + 3 slaves) || Storage || Time spent on job_1 || Output size of job_1 || Mapper task number of job_2 || Time spent on job_2 || Total spent time on pig script | AvroStorage | 5min 57 sec | 7.97G | 120 | 16min 50 sec | 22min 47 sec| | InterStorage | 4min 33 sec | 9.55G | 143 | 17min 17 sec | 21min 50 sec| The experiment shows that AvroStorage has more compact format than InterStorage ( according the output size of job_1), but has more overhead on serialization ( according the time spent on job_1). I think the time spent on job_2 using AvroStorage is less than that using InterStorage is because the input size of job_2 (the output of job_1) which using AvroStorage is much less than that using InterStorage, so it need less mapper task. Overall, AvroStorage is not so good as expected. One reason is maybe I do not use Avro's API correctly (hope avro guys can review my code), another reason is maybe avro's serialization performance is not so good. BTW, I use avro trunk. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, AvroStorage_2.patch, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904615#action_12904615 ] Dmitriy V. Ryaboy commented on PIG-794: --- Jeff, have you checkoed out Scott Carey's work here: https://issues.apache.org/jira/browse/AVRO-592 ? Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904674#action_12904674 ] Scott Carey commented on PIG-794: - AVRO-592 creates an AvroStorage class for writing and reading M/R inputs and outputs but does not deal with intermediate M/R output. I have some updates to that in progress that simplify it more. Some aspects may be re-usable for this too. One thing to note is that Avro cannot be completely optimal for intermediate M/R output because the Hadoop API for this has a performance flaw that prevents efficient use of buffers and input/output streams there. This would affect InterStorage as well though. I'll take a look at the patch here and see if I can see any performance optimizations. Note, that there are still several performance optimizations left to do in Avro itself. For example, the BinaryDecoder has been optimized, but not the Encoder yet. Also, I'm somewhat blocked with AVRO-592 due to lack of Pig 0.7 maven availability. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904680#action_12904680 ] Scott Carey commented on PIG-794: - So a summary of the differences I can see quickly are: h5. Schema usage: This creates a 'generic' Avro schema that can be used for any pig data. Each field in a Tuple is a Union of all possible pig types, and each Tuple is a list of fields. It does not preserve the field names or types -- these are not important for intermediate data anyway. AVRO-592 translates the Pig schema into a specific Avro schema that persists the field names and types, so that: STORE foo INTO 'file' USING AvroStorage(); Will create a file that foo2 = LOAD 'file' USING AvroStorage(); will be able to re-create the exact schema for use in a script. h5. Serialization and Deserialization: This uses the same style as Avro's GenericRecord, which traverses the schema on the fly and writes fields for each record. AVRO-592 constructs a state machine for each specific schema to optimally traverse a Tuple to serialize a record or create a Tuple when deserializing. This should be faster but the code is definitely harder to read (but easy to unit test -- AVRO-592 has 98% unit test code coverage on that portion). Integrating these should not be too hard. I'll try and put my latest version of AVRO-592 up there late today or tomorrow. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904687#action_12904687 ] Doug Cutting commented on PIG-794: -- A few comments about the attached code: - is there a reason you don't subclass GenericDatumReader and GenericDatumWriter, overriding readRecord() and writeRecord()? That would simplify things and better guarantee that you're conforming to a schema. Currently, e.g., your writeMap() doesn't appear to write a valid Avro map, writeArray() doesn't write a valid Avro array, etc., so the data written is not interoperable,. - my guess is that a lot of time is spent in findSchemaIndex(). if that's right, you might improve this in various ways, e.g.: -- sort this by the most common types. the order in Pig's DataType.java is probably a good one. -- try using a static MapClass,Integer cache of indexes - have you run this under a profiler? I don't see where this specifies an Avro schema for Pig data. It's possible to construct a generic schema for all Pig data. In this, a Bag should be record with a single field, an array of Tuples. A Tuple should be a record with a single field, an array of a union of all types. Given such a schema, one could then write a DatumReader/Writer using the control logic of Pig's DataReaderWriter (i.e., a switch based on the value of DataType.findType(), but, instead of calling DataInput/Output methods, use Encoder/Decoder methods with a ValidatingEncoder (at least while debugging) to ensure you conform to that schema. Alternately, in Avro 1.4 (snapshot in Maven now, release this week, hopefully) Avro arrays can be arbitrary Collection implementations. Bag already implements all of the required Collection methods -- clear(), add(), size(), iterator(), so there's no reason I can see for Bag not to implement CollectionTuple. So then one could subclass GenericData, GenericDatumReader Writer, overriding: {code} protected boolean isRecord(Object datum) { return datum instanceof Tuple || datum instanceof Bag; } protected void writeRecord(Schema schema, Object datum, Encoder out) throws IOException { if (TUPLE_NAME.equals(schema.getFullName())) datum = ((Tuple)datum.getAll(); writeArray(schema.getFields().get(0).getType(), datum, out); } protected Object readRecord(Object old, Schema expected, ResolvingDecoder in) throws IOException { Object result; if (TUPLE_NAME.equals(schema.getFullName())) { old = new ArrayList(); result = new Tuple(old); } else { old = result = new Bag(); } readArray(old, expected.getFields().get(0).getType(), in); return result; } {code} Finally, if you knew the schema for the dataset being processed, rather than using a fully-general Pig schema, then you could translate that schema to an Avro schema. This schema in most cases would not likely have a huge, compute-intensive-to-write union in it . Or you might use something like what Scott's proposed in AVRO-592. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884577#action_12884577 ] Jeff Zhang commented on PIG-794: We can leverage AvroInputFormat and AvroOutputFormat in Avro trunk, (see AVRO-493) Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847594#action_12847594 ] Allen Wittenauer commented on PIG-794: -- What is the latest on getting Avro support in pig? Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847606#action_12847606 ] Alan Gates commented on PIG-794: It depends on what you mean by support. As far as Pig using Avro for serialization between Map and Reduce and MR jobs, we haven't done anything on that front lately. Last time we tested the performance was comparable to our own BinStorage so we weren't motivated to move yet. Now that Avro has matured a bit maybe we should test again. As far as using Avro to store user data, with Pig 0.7 it should become quite easy to write Avro load and store functions. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847607#action_12847607 ] Jeff Hammerbacher commented on PIG-794: --- bq. Last time we tested the performance was comparable to our own BinStorage so we weren't motivated to move yet. Hey Alan, There should be benefits to using Avro besides just performance. Either way, looking forward to seeing you on the Avro lists when you decide to test again! Regards, Jeff Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847613#action_12847613 ] Alan Gates commented on PIG-794: Jeff, Beyond performance what do you see as the big wins of using Avro? I'm just thinking here of moving data between MR jobs in a Pig script and between Map and Reduce phases. I see lots of advantages to users using Avro to store their data. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847614#action_12847614 ] Dmitriy V. Ryaboy commented on PIG-794: --- I'll take a crack at it. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729700#action_12729700 ] Alan Gates commented on PIG-794: I agree with Doug's comments that it's better to use an API to build the schema that will give us compile time checking. I think it will also (hopefully) be easier to figure out the schema when reading the code, as it will avoid the need to read JSON directly. I have a general question on the approach. This is a direct port of Pig's BinStorage to use Avro, including the writing of indicator bytes for types. I do not have a deep knowledge of Avro. But I had assumed that since it was a de/serialization framework with types, part of what it would provide was type recognition. That is, can't this code rely on Avro to set the type for it? Do we need to be writing those indicator bytes ourselves? Perhaps this is the same comment that Doug is making about using GenericDatumReader and addField. In response to Hong's comment, the sync marks are vulnerable as you point out. But the loader needs some way to find a proper starting place when it's handed any block but the initial block of a file. I wonder if we could create a new sync type. It would always consist of a 100 byte marker (say the first 25 prime numbers, or the first 25 digits of pi or something). We could then write a tuple with that sync type every 1000 records in the data. Loaders that don't start at position 0 could then seek to the first sync type it found before it began reading. All loaders would read past the end of their position until they saw a sync type. As for this being compatible with with non-pig apps, that isn't the purpose of this AvroStorage function. This is for pig to pass data between MR jobs for itself. Having a tool independent storage format is a bigger project, as it requires agreeing on things like sync marks, how to represent different Avro objects, etc. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Fix For: 0.2.0 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723812#action_12723812 ] Alan Gates commented on PIG-794: PIG-734 has been committed. This will allow this patch to simplify its handling of maps to match avro maps, since Pig maps now only allow strings as keys. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Fix For: 0.2.0 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712397#action_12712397 ] Hong Tang commented on PIG-794: --- - It appears that the code added a three-byte sync-mark \1\2\3 before every tuple. - There is no escaping of sync-mark collisions in user code. - The introduction of the sync mark also defeats the purpose of using Avro in the first place (sharing a common serialization format). Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Fix For: 0.2.0 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12709608#action_12709608 ] Doug Cutting commented on PIG-794: -- Looking at the patch, I have a few questions and remarks: - Why not name the records Tuple and Bag instead of T and B? The names are not written in the data, so there's little advantage to shorter names. - Why not, instead of parsing the schema from Json, construct the schema using the Java Schema API? Then you would not need to walk the schema afterwards to find union indexes, and you'd get compile-time API checking rather than potential load-time JSON parse errors. - Why not extend GenericDatumReader and override newRecord() to create either a Bag or a Tuple, then override addField() to add values to either a bag or tuple? This would make the patch much smaller, and potentially permit you to eventually take advantage of GenericDatumReader features like projection and object reuse. - Finally, since you're using a pre-release version of Avro, you should probably name the jar with the subversion revision number. Also note that, since Avro is not yet stable, it should not be yet used for persistent data in production systems. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Fix For: 0.2.0 Attachments: avro-0.1-dev-java.jar, AvroStorage.patch, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12709078#action_12709078 ] Rakesh Setty commented on PIG-794: -- Olga, The reproduced the issue. It is because the schema parsing is failing on the Avro side. I checked that the Avro codebase has changed which is causing this issue. I will work with the Avro team to understand the changes that I need to make for AvroStorage. Thanks, Rakesh Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Fix For: 0.2.0 Attachments: avro-0.1-dev-java.jar, AvroStorage.patch, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12709081#action_12709081 ] Rakesh Setty commented on PIG-794: -- I also noticed that the Avro codebase supports only Strings for maps. As mentioned earlier I have modified the AvroStorage to have key as any object. Does Avro need to have only strings as map keys? Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Fix For: 0.2.0 Attachments: avro-0.1-dev-java.jar, AvroStorage.patch, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12708701#action_12708701 ] Olga Natkovich commented on PIG-794: I integrated the latest patch and run unit tests. All the AVRO unit tests failed with the following stack trace: Could not initialize class org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.AvroTupleSchema java.lang.NoClassDefFoundError: Could not initialize class org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.AvroTupleSchema at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.TupleAvroWriter.writeDatum(AvroStorage.java:359) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.TupleAvroWriter.writeTuple(AvroStorage.java:408) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.TupleAvroWriter.write(AvroStorage.java:353) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.AvroStorage.putNext(AvroStorage.java:571) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:121) at org.apache.pig.backend.local.executionengine.LocalPigLauncher.runPipeline(LocalPigLauncher.java:129) at org.apache.pig.backend.local.executionengine.LocalPigLauncher.launchPig(LocalPigLauncher.java:102) at org.apache.pig.test.TestAvroStorage.store(TestAvroStorage.java:117) at org.apache.pig.test.TestAvroStorage.testLoadStoreComplexDataWithNull(TestAvroStorage.java:178) ~ Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Fix For: 0.2.0 Attachments: avro-0.1-dev-java.jar, AvroStorage.patch, jackson-asl-0.9.4.jar We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707570#action_12707570 ] Rakesh Setty commented on PIG-794: -- The new patch has unit tests. The comments are already in javadoc format. Please let me know if I have missed somewhere. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Fix For: 0.2.0 Attachments: avro-0.1-dev-java.jar, AvroStorage.patch, jackson-asl-0.9.4.jar We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707576#action_12707576 ] Rakesh Setty commented on PIG-794: -- There was one important change I had to do in AvroStorage to the Avro format to get it working. The map keys were stored as String objects. I had to change it so that both key and value can be Object instances. Please let me know if this is an issue. Thanks, Rakesh Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Fix For: 0.2.0 Attachments: avro-0.1-dev-java.jar, AvroStorage.patch, jackson-asl-0.9.4.jar We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707063#action_12707063 ] Olga Natkovich commented on PIG-794: Rakesh, Could you please convert the comments to javadoc and at unit tests before we commit the code. Thanks Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: avro-0.1-dev-java.jar, AvroStorage.patch, jackson-asl-0.9.4.jar We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706473#action_12706473 ] Rakesh Setty commented on PIG-794: -- The attached jar files should go to the lib directory. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: avro-0.1-dev-java.jar, AvroStorage.patch, jackson-asl-0.9.4.jar We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706144#action_12706144 ] Rakesh Setty commented on PIG-794: -- While trying to address the comment about eliminating the AvroValueReader, I noticed that the way pos (current position in the stream) is being handled is wrong. The position in the stream can only be handled by the ValueReader (Avro codebase) due to the non-standard (not making use of DataOutput's methods to store data) way of storing data by Avro. For example, an integer can be stored in anywhere between 1 - 5 bytes while a long can be stored in anywhere between 1 - 10 bytes. I think we have to ask the Avro team to support this (current position in the stream) for us to proceed with this. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: AvroBinStorage.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706197#action_12706197 ] Doug Cutting commented on PIG-794: -- I think we have to ask the Avro team to support this (current position in the stream) for us to proceed with this. ValueReader performs no buffering, so its position is always the same as the InputStream that it wraps. See DataFileReader#SeekableBufferedInput for an example of an input stream that tracks its position. Note that AVRO-25 proposes to add buffering to ValueWriter, so that the position of the underlying stream might be different than that of the ValueWriter, but I do not forsee a need to add this to ValueReader, the concern here. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: AvroBinStorage.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706220#action_12706220 ] Rakesh Setty commented on PIG-794: -- This works. Will update the patch. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: AvroBinStorage.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706226#action_12706226 ] Hadoop QA commented on PIG-794: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12407285/AvroStorage.patch against trunk revision 771844. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/30/console This message is automatically generated. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: AvroBinStorage.patch, AvroStorage.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706244#action_12706244 ] Olga Natkovich commented on PIG-794: Doug, if there is no buffering then the position in the inout stream can be used for now. However, if you are planning to do buffering in the future, it might be good to have an API that just gives the position so that later we don't need to change the code. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: AvroBinStorage.patch, AvroStorage.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706278#action_12706278 ] Olga Natkovich commented on PIG-794: Hi Rakesh, Thanks for the update. A few comments: (1) Thanks for adding comments. They need to be of javadoc style so that we get free documentation from it. You can see examples in other files (2) Looks like there is at least one System.println statement that got in I assume by mistake. (3) Looks like you have some traces as log.error instead of log.debug (4) You need to attach AVRO library separately. Patches don't work well with binary data Also I am curious if removing wrapper class made a performance difference? Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: AvroBinStorage.patch, AvroStorage.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706284#action_12706284 ] Olga Natkovich commented on PIG-794: One more thing: since we are adding avro library, lets add some unit tests as well. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: AvroBinStorage.patch, AvroStorage.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705351#action_12705351 ] Olga Natkovich commented on PIG-794: Hi Rakesh, Thanks for the patch. A few comments below. First, a few general comments: (1) AVROBinStorage should not be in builtins. We don't want to expose to the end user because in the past we had issues with backward compatibility (with BinStorage) when the same function was used both internally ane externally. (2) Every new file needs to have an apache license header. You can get one from a file in SVN. (3) I would just call the class AVROStorage (4) Once we are fully integrated with AVRO, we should at unit tests but for now this is fine (5) It would be nice to have javadoc comments in the data. At a minimum a header for each class on what it does and each public method. Also, it would be good to document any non-obvious code. Now, code related comments: what is the reason for having AVROValueReader. It seems to be a streight wrapper around ValueReader + position which we can keep track separately. I am concerned with the performance overhad that happens on each call. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: AvroBinStorage.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.