Re: Review Request 12480: HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/12480/#review26066 --- Ship it! Ship It! - Jakob Homan On Aug. 30, 2013, 11:49 a.m., Mohammad Islam wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/12480/ --- (Updated Aug. 30, 2013, 11:49 a.m.) Review request for hive, Ashutosh Chauhan and Jakob Homan. Bugs: HIVE-4732 https://issues.apache.org/jira/browse/HIVE-4732 Repository: hive-git Description --- From our performance analysis, we found AvroSerde's schema.equals() call consumed a substantial amount ( nearly 40%) of time. This patch intends to minimize the number schema.equals() calls by pushing the check as late/fewer as possible. At first, we added a unique id for each record reader which is then included in every AvroGenericRecordWritable. Then, we introduce two new data structures (one hashset and one hashmap) to store intermediate data to avoid duplicates checkings. Hashset contains all the record readers' IDs that don't need any re-encoding. On the other hand, HashMap contains the already used re-encoders. It works as cache and allows re-encoders reuse. With this change, our test shows nearly 40% reduction in Avro record reading time. Diffs - ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordReader.java ed2a9af serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java e994411 serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java 66f0348 serde/src/test/org/apache/hadoop/hive/serde2/avro/TestAvroDeserializer.java 3828940 serde/src/test/org/apache/hadoop/hive/serde2/avro/TestSchemaReEncoder.java 9af751b serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java 2b948eb Diff: https://reviews.apache.org/r/12480/diff/ Testing --- Thanks, Mohammad Islam
Re: Review Request 12480: HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/12480/ --- (Updated Aug. 30, 2013, 6:49 p.m.) Review request for hive, Ashutosh Chauhan and Jakob Homan. Changes --- Updated with Jakob's comments Bugs: HIVE-4732 https://issues.apache.org/jira/browse/HIVE-4732 Repository: hive-git Description --- From our performance analysis, we found AvroSerde's schema.equals() call consumed a substantial amount ( nearly 40%) of time. This patch intends to minimize the number schema.equals() calls by pushing the check as late/fewer as possible. At first, we added a unique id for each record reader which is then included in every AvroGenericRecordWritable. Then, we introduce two new data structures (one hashset and one hashmap) to store intermediate data to avoid duplicates checkings. Hashset contains all the record readers' IDs that don't need any re-encoding. On the other hand, HashMap contains the already used re-encoders. It works as cache and allows re-encoders reuse. With this change, our test shows nearly 40% reduction in Avro record reading time. Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordReader.java ed2a9af serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java e994411 serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java 66f0348 serde/src/test/org/apache/hadoop/hive/serde2/avro/TestAvroDeserializer.java 3828940 serde/src/test/org/apache/hadoop/hive/serde2/avro/TestSchemaReEncoder.java 9af751b serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java 2b948eb Diff: https://reviews.apache.org/r/12480/diff/ Testing --- Thanks, Mohammad Islam
Re: Review Request 12480: HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde
On Aug. 26, 2013, 5:35 a.m., Jakob Homan wrote: serde/src/test/org/apache/hadoop/hive/serde2/avro/TestAvroDeserializer.java, line 529 https://reviews.apache.org/r/12480/diff/3/?file=338097#file338097line529 Weird spacing... 2x below as well. Done On Aug. 26, 2013, 5:35 a.m., Jakob Homan wrote: serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java, line 49 https://reviews.apache.org/r/12480/diff/3/?file=338099#file338099line49 And this would indicate a bug. Done. On Aug. 26, 2013, 5:35 a.m., Jakob Homan wrote: serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java, line 38 https://reviews.apache.org/r/12480/diff/3/?file=338099#file338099line38 These should never be null, not even in testing. It's better to change the tests to correctly populate the data structure. AvroGenericRecordWriteable needs the record reader ID. Since we are instantiating one here. We will need to set it w/o any checking. Remove unnecessary null checkings. - Mohammad --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/12480/#review25537 --- On Aug. 30, 2013, 6:49 p.m., Mohammad Islam wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/12480/ --- (Updated Aug. 30, 2013, 6:49 p.m.) Review request for hive, Ashutosh Chauhan and Jakob Homan. Bugs: HIVE-4732 https://issues.apache.org/jira/browse/HIVE-4732 Repository: hive-git Description --- From our performance analysis, we found AvroSerde's schema.equals() call consumed a substantial amount ( nearly 40%) of time. This patch intends to minimize the number schema.equals() calls by pushing the check as late/fewer as possible. At first, we added a unique id for each record reader which is then included in every AvroGenericRecordWritable. Then, we introduce two new data structures (one hashset and one hashmap) to store intermediate data to avoid duplicates checkings. Hashset contains all the record readers' IDs that don't need any re-encoding. On the other hand, HashMap contains the already used re-encoders. It works as cache and allows re-encoders reuse. With this change, our test shows nearly 40% reduction in Avro record reading time. Diffs - ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordReader.java ed2a9af serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java e994411 serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java 66f0348 serde/src/test/org/apache/hadoop/hive/serde2/avro/TestAvroDeserializer.java 3828940 serde/src/test/org/apache/hadoop/hive/serde2/avro/TestSchemaReEncoder.java 9af751b serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java 2b948eb Diff: https://reviews.apache.org/r/12480/diff/ Testing --- Thanks, Mohammad Islam
Re: Review Request 12480: HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/12480/#review25537 --- One issue in the testing and a few formatting issues. Otherwise looks good. serde/src/test/org/apache/hadoop/hive/serde2/avro/TestAvroDeserializer.java https://reviews.apache.org/r/12480/#comment49986 Weird spacing... 2x below as well. serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java https://reviews.apache.org/r/12480/#comment49984 These should never be null, not even in testing. It's better to change the tests to correctly populate the data structure. serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java https://reviews.apache.org/r/12480/#comment49985 And this would indicate a bug. - Jakob Homan On Aug. 6, 2013, 7:13 p.m., Mohammad Islam wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/12480/ --- (Updated Aug. 6, 2013, 7:13 p.m.) Review request for hive, Ashutosh Chauhan and Jakob Homan. Bugs: HIVE-4732 https://issues.apache.org/jira/browse/HIVE-4732 Repository: hive-git Description --- From our performance analysis, we found AvroSerde's schema.equals() call consumed a substantial amount ( nearly 40%) of time. This patch intends to minimize the number schema.equals() calls by pushing the check as late/fewer as possible. At first, we added a unique id for each record reader which is then included in every AvroGenericRecordWritable. Then, we introduce two new data structures (one hashset and one hashmap) to store intermediate data to avoid duplicates checkings. Hashset contains all the record readers' IDs that don't need any re-encoding. On the other hand, HashMap contains the already used re-encoders. It works as cache and allows re-encoders reuse. With this change, our test shows nearly 40% reduction in Avro record reading time. Diffs - ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordReader.java ed2a9af serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java e994411 serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java 66f0348 serde/src/test/org/apache/hadoop/hive/serde2/avro/TestAvroDeserializer.java 3828940 serde/src/test/org/apache/hadoop/hive/serde2/avro/TestSchemaReEncoder.java 9af751b serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java 2b948eb Diff: https://reviews.apache.org/r/12480/diff/ Testing --- Thanks, Mohammad Islam
Re: Review Request 12480: HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/12480/ --- (Updated Aug. 7, 2013, 2:13 a.m.) Review request for hive, Ashutosh Chauhan and Jakob Homan. Changes --- Add logic to avoid excessive logging for each record. Bugs: HIVE-4732 https://issues.apache.org/jira/browse/HIVE-4732 Repository: hive-git Description --- From our performance analysis, we found AvroSerde's schema.equals() call consumed a substantial amount ( nearly 40%) of time. This patch intends to minimize the number schema.equals() calls by pushing the check as late/fewer as possible. At first, we added a unique id for each record reader which is then included in every AvroGenericRecordWritable. Then, we introduce two new data structures (one hashset and one hashmap) to store intermediate data to avoid duplicates checkings. Hashset contains all the record readers' IDs that don't need any re-encoding. On the other hand, HashMap contains the already used re-encoders. It works as cache and allows re-encoders reuse. With this change, our test shows nearly 40% reduction in Avro record reading time. Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordReader.java ed2a9af serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java e994411 serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java 66f0348 serde/src/test/org/apache/hadoop/hive/serde2/avro/TestAvroDeserializer.java 3828940 serde/src/test/org/apache/hadoop/hive/serde2/avro/TestSchemaReEncoder.java 9af751b serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java 2b948eb Diff: https://reviews.apache.org/r/12480/diff/ Testing --- Thanks, Mohammad Islam
Re: Review Request 12480: HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde
On July 12, 2013, 10:44 p.m., Jakob Homan wrote: Do you have after-optimization performance numbers? Can you add a test to verify that the reencoder cache is working correctly? Feed in a record with one uuid, then another with a different and verify that the cache has two elements. Adding a third record with the original UUID shouldn't increase the size of the cache. Also, that adding n records all with the same schema creates only one reencoder... Yes we have the number after optimization. For example, each record used to take nearly 50 micro-second. After this patch, it becomes nearly 31 micro-seconds. Added the test case as proposed. On July 12, 2013, 10:44 p.m., Jakob Homan wrote: serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java, line 66 https://reviews.apache.org/r/12480/diff/1/?file=320688#file320688line66 verifiedRecordReaders - noReencodingNeeded ? Done On July 12, 2013, 10:44 p.m., Jakob Homan wrote: serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java, line 155 https://reviews.apache.org/r/12480/diff/1/?file=320688#file320688line155 readability: pull out getRecordReaderID into its own var Done On July 12, 2013, 10:44 p.m., Jakob Homan wrote: serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java, line 78 https://reviews.apache.org/r/12480/diff/1/?file=320689#file320689line78 Need to write out the uuid too Done On July 12, 2013, 10:44 p.m., Jakob Homan wrote: serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java, line 92 https://reviews.apache.org/r/12480/diff/1/?file=320689#file320689line92 Need to read in the uuid too Done - Mohammad --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/12480/#review23113 --- On July 11, 2013, 10:31 p.m., Mohammad Islam wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/12480/ --- (Updated July 11, 2013, 10:31 p.m.) Review request for hive, Ashutosh Chauhan and Jakob Homan. Bugs: HIVE-4732 https://issues.apache.org/jira/browse/HIVE-4732 Repository: hive-git Description --- From our performance analysis, we found AvroSerde's schema.equals() call consumed a substantial amount ( nearly 40%) of time. This patch intends to minimize the number schema.equals() calls by pushing the check as late/fewer as possible. At first, we added a unique id for each record reader which is then included in every AvroGenericRecordWritable. Then, we introduce two new data structures (one hashset and one hashmap) to store intermediate data to avoid duplicates checkings. Hashset contains all the record readers' IDs that don't need any re-encoding. On the other hand, HashMap contains the already used re-encoders. It works as cache and allows re-encoders reuse. With this change, our test shows nearly 40% reduction in Avro record reading time. Diffs - ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordReader.java dbc999f serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java c85ef15 serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java 66f0348 serde/src/test/org/apache/hadoop/hive/serde2/avro/TestSchemaReEncoder.java 9af751b serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java 2b948eb Diff: https://reviews.apache.org/r/12480/diff/ Testing --- Thanks, Mohammad Islam
Re: Review Request 12480: HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/12480/ --- (Updated July 15, 2013, 11:48 p.m.) Review request for hive, Ashutosh Chauhan and Jakob Homan. Changes --- Incorporated Jakob's comment. Bugs: HIVE-4732 https://issues.apache.org/jira/browse/HIVE-4732 Repository: hive-git Description --- From our performance analysis, we found AvroSerde's schema.equals() call consumed a substantial amount ( nearly 40%) of time. This patch intends to minimize the number schema.equals() calls by pushing the check as late/fewer as possible. At first, we added a unique id for each record reader which is then included in every AvroGenericRecordWritable. Then, we introduce two new data structures (one hashset and one hashmap) to store intermediate data to avoid duplicates checkings. Hashset contains all the record readers' IDs that don't need any re-encoding. On the other hand, HashMap contains the already used re-encoders. It works as cache and allows re-encoders reuse. With this change, our test shows nearly 40% reduction in Avro record reading time. Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordReader.java dbc999f serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java c85ef15 serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java 66f0348 serde/src/test/org/apache/hadoop/hive/serde2/avro/TestAvroDeserializer.java 79c9646 serde/src/test/org/apache/hadoop/hive/serde2/avro/TestSchemaReEncoder.java 9af751b serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java 2b948eb Diff: https://reviews.apache.org/r/12480/diff/ Testing --- Thanks, Mohammad Islam
Re: Review Request 12480: HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/12480/#review23113 --- Do you have after-optimization performance numbers? Can you add a test to verify that the reencoder cache is working correctly? Feed in a record with one uuid, then another with a different and verify that the cache has two elements. Adding a third record with the original UUID shouldn't increase the size of the cache. Also, that adding n records all with the same schema creates only one reencoder... serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java https://reviews.apache.org/r/12480/#comment46953 verifiedRecordReaders - noReencodingNeeded ? serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java https://reviews.apache.org/r/12480/#comment46956 readability: pull out getRecordReaderID into its own var serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java https://reviews.apache.org/r/12480/#comment46958 Need to write out the uuid too serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java https://reviews.apache.org/r/12480/#comment46959 Need to read in the uuid too - Jakob Homan On July 11, 2013, 3:31 p.m., Mohammad Islam wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/12480/ --- (Updated July 11, 2013, 3:31 p.m.) Review request for hive, Ashutosh Chauhan and Jakob Homan. Bugs: HIVE-4732 https://issues.apache.org/jira/browse/HIVE-4732 Repository: hive-git Description --- From our performance analysis, we found AvroSerde's schema.equals() call consumed a substantial amount ( nearly 40%) of time. This patch intends to minimize the number schema.equals() calls by pushing the check as late/fewer as possible. At first, we added a unique id for each record reader which is then included in every AvroGenericRecordWritable. Then, we introduce two new data structures (one hashset and one hashmap) to store intermediate data to avoid duplicates checkings. Hashset contains all the record readers' IDs that don't need any re-encoding. On the other hand, HashMap contains the already used re-encoders. It works as cache and allows re-encoders reuse. With this change, our test shows nearly 40% reduction in Avro record reading time. Diffs - ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordReader.java dbc999f serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java c85ef15 serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java 66f0348 serde/src/test/org/apache/hadoop/hive/serde2/avro/TestSchemaReEncoder.java 9af751b serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java 2b948eb Diff: https://reviews.apache.org/r/12480/diff/ Testing --- Thanks, Mohammad Islam
Review Request 12480: HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/12480/ --- Review request for hive, Ashutosh Chauhan and Jakob Homan. Bugs: HIVE-4732 https://issues.apache.org/jira/browse/HIVE-4732 Repository: hive-git Description --- From our performance analysis, we found AvroSerde's schema.equals() call consumed a substantial amount ( nearly 40%) of time. This patch intends to minimize the number schema.equals() calls by pushing the check as late/fewer as possible. At first, we added a unique id for each record reader which is then included in every AvroGenericRecordWritable. Then, we introduce two new data structures (one hashset and one hashmap) to store intermediate data to avoid duplicates checkings. Hashset contains all the record readers' IDs that don't need any re-encoding. On the other hand, HashMap contains the already used re-encoders. It works as cache and allows re-encoders reuse. With this change, our test shows nearly 40% reduction in Avro record reading time. Diffs - ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordReader.java dbc999f serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java c85ef15 serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java 66f0348 serde/src/test/org/apache/hadoop/hive/serde2/avro/TestSchemaReEncoder.java 9af751b serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java 2b948eb Diff: https://reviews.apache.org/r/12480/diff/ Testing --- Thanks, Mohammad Islam