Re: Review Request 12480: HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde

2013-09-12 Thread Jakob Homan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/12480/#review26066
---

Ship it!


Ship It!

- Jakob Homan


On Aug. 30, 2013, 11:49 a.m., Mohammad Islam wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/12480/
 ---
 
 (Updated Aug. 30, 2013, 11:49 a.m.)
 
 
 Review request for hive, Ashutosh Chauhan and Jakob Homan.
 
 
 Bugs: HIVE-4732
 https://issues.apache.org/jira/browse/HIVE-4732
 
 
 Repository: hive-git
 
 
 Description
 ---
 
 From our performance analysis, we found AvroSerde's schema.equals() call 
 consumed a substantial amount ( nearly 40%) of time. This patch intends to 
 minimize the number schema.equals() calls by pushing the check as late/fewer 
 as possible.
 
 At first, we added a unique id for each record reader which is then included 
 in every AvroGenericRecordWritable. Then, we introduce two new data 
 structures (one hashset and one hashmap) to store intermediate data to avoid 
 duplicates checkings. Hashset contains all the record readers' IDs that don't 
 need any re-encoding. On the other hand, HashMap contains the already used 
 re-encoders. It works as cache and allows re-encoders reuse. With this 
 change, our test shows nearly 40% reduction in Avro record reading time.
  

 
 
 Diffs
 -
 
   ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordReader.java 
 ed2a9af 
   serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java 
 e994411 
   
 serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java
  66f0348 
   serde/src/test/org/apache/hadoop/hive/serde2/avro/TestAvroDeserializer.java 
 3828940 
   serde/src/test/org/apache/hadoop/hive/serde2/avro/TestSchemaReEncoder.java 
 9af751b 
   serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java 2b948eb 
 
 Diff: https://reviews.apache.org/r/12480/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Mohammad Islam
 




Re: Review Request 12480: HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde

2013-08-30 Thread Mohammad Islam

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/12480/
---

(Updated Aug. 30, 2013, 6:49 p.m.)


Review request for hive, Ashutosh Chauhan and Jakob Homan.


Changes
---

Updated with Jakob's comments


Bugs: HIVE-4732
https://issues.apache.org/jira/browse/HIVE-4732


Repository: hive-git


Description
---

From our performance analysis, we found AvroSerde's schema.equals() call 
consumed a substantial amount ( nearly 40%) of time. This patch intends to 
minimize the number schema.equals() calls by pushing the check as late/fewer 
as possible.

At first, we added a unique id for each record reader which is then included in 
every AvroGenericRecordWritable. Then, we introduce two new data structures 
(one hashset and one hashmap) to store intermediate data to avoid duplicates 
checkings. Hashset contains all the record readers' IDs that don't need any 
re-encoding. On the other hand, HashMap contains the already used re-encoders. 
It works as cache and allows re-encoders reuse. With this change, our test 
shows nearly 40% reduction in Avro record reading time.
 
   


Diffs (updated)
-

  ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordReader.java 
ed2a9af 
  serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java 
e994411 
  
serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java
 66f0348 
  serde/src/test/org/apache/hadoop/hive/serde2/avro/TestAvroDeserializer.java 
3828940 
  serde/src/test/org/apache/hadoop/hive/serde2/avro/TestSchemaReEncoder.java 
9af751b 
  serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java 2b948eb 

Diff: https://reviews.apache.org/r/12480/diff/


Testing
---


Thanks,

Mohammad Islam



Re: Review Request 12480: HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde

2013-08-30 Thread Mohammad Islam


 On Aug. 26, 2013, 5:35 a.m., Jakob Homan wrote:
  serde/src/test/org/apache/hadoop/hive/serde2/avro/TestAvroDeserializer.java,
   line 529
  https://reviews.apache.org/r/12480/diff/3/?file=338097#file338097line529
 
  Weird spacing... 2x below as well.

Done


 On Aug. 26, 2013, 5:35 a.m., Jakob Homan wrote:
  serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java, line 49
  https://reviews.apache.org/r/12480/diff/3/?file=338099#file338099line49
 
  And this would indicate a bug.

Done.


 On Aug. 26, 2013, 5:35 a.m., Jakob Homan wrote:
  serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java, line 38
  https://reviews.apache.org/r/12480/diff/3/?file=338099#file338099line38
 
  These should never be null, not even in testing.  It's better to change 
  the tests to correctly populate the data structure.

AvroGenericRecordWriteable needs the record reader ID. Since we are 
instantiating one here. We will need to set it w/o any checking.
Remove unnecessary null checkings.


- Mohammad


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/12480/#review25537
---


On Aug. 30, 2013, 6:49 p.m., Mohammad Islam wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/12480/
 ---
 
 (Updated Aug. 30, 2013, 6:49 p.m.)
 
 
 Review request for hive, Ashutosh Chauhan and Jakob Homan.
 
 
 Bugs: HIVE-4732
 https://issues.apache.org/jira/browse/HIVE-4732
 
 
 Repository: hive-git
 
 
 Description
 ---
 
 From our performance analysis, we found AvroSerde's schema.equals() call 
 consumed a substantial amount ( nearly 40%) of time. This patch intends to 
 minimize the number schema.equals() calls by pushing the check as late/fewer 
 as possible.
 
 At first, we added a unique id for each record reader which is then included 
 in every AvroGenericRecordWritable. Then, we introduce two new data 
 structures (one hashset and one hashmap) to store intermediate data to avoid 
 duplicates checkings. Hashset contains all the record readers' IDs that don't 
 need any re-encoding. On the other hand, HashMap contains the already used 
 re-encoders. It works as cache and allows re-encoders reuse. With this 
 change, our test shows nearly 40% reduction in Avro record reading time.
  

 
 
 Diffs
 -
 
   ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordReader.java 
 ed2a9af 
   serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java 
 e994411 
   
 serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java
  66f0348 
   serde/src/test/org/apache/hadoop/hive/serde2/avro/TestAvroDeserializer.java 
 3828940 
   serde/src/test/org/apache/hadoop/hive/serde2/avro/TestSchemaReEncoder.java 
 9af751b 
   serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java 2b948eb 
 
 Diff: https://reviews.apache.org/r/12480/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Mohammad Islam
 




Re: Review Request 12480: HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde

2013-08-25 Thread Jakob Homan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/12480/#review25537
---


One issue in the testing and a few formatting issues.  Otherwise looks good.


serde/src/test/org/apache/hadoop/hive/serde2/avro/TestAvroDeserializer.java
https://reviews.apache.org/r/12480/#comment49986

Weird spacing... 2x below as well.



serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java
https://reviews.apache.org/r/12480/#comment49984

These should never be null, not even in testing.  It's better to change the 
tests to correctly populate the data structure.



serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java
https://reviews.apache.org/r/12480/#comment49985

And this would indicate a bug.


- Jakob Homan


On Aug. 6, 2013, 7:13 p.m., Mohammad Islam wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/12480/
 ---
 
 (Updated Aug. 6, 2013, 7:13 p.m.)
 
 
 Review request for hive, Ashutosh Chauhan and Jakob Homan.
 
 
 Bugs: HIVE-4732
 https://issues.apache.org/jira/browse/HIVE-4732
 
 
 Repository: hive-git
 
 
 Description
 ---
 
 From our performance analysis, we found AvroSerde's schema.equals() call 
 consumed a substantial amount ( nearly 40%) of time. This patch intends to 
 minimize the number schema.equals() calls by pushing the check as late/fewer 
 as possible.
 
 At first, we added a unique id for each record reader which is then included 
 in every AvroGenericRecordWritable. Then, we introduce two new data 
 structures (one hashset and one hashmap) to store intermediate data to avoid 
 duplicates checkings. Hashset contains all the record readers' IDs that don't 
 need any re-encoding. On the other hand, HashMap contains the already used 
 re-encoders. It works as cache and allows re-encoders reuse. With this 
 change, our test shows nearly 40% reduction in Avro record reading time.
  

 
 
 Diffs
 -
 
   ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordReader.java 
 ed2a9af 
   serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java 
 e994411 
   
 serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java
  66f0348 
   serde/src/test/org/apache/hadoop/hive/serde2/avro/TestAvroDeserializer.java 
 3828940 
   serde/src/test/org/apache/hadoop/hive/serde2/avro/TestSchemaReEncoder.java 
 9af751b 
   serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java 2b948eb 
 
 Diff: https://reviews.apache.org/r/12480/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Mohammad Islam
 




Re: Review Request 12480: HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde

2013-08-06 Thread Mohammad Islam

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/12480/
---

(Updated Aug. 7, 2013, 2:13 a.m.)


Review request for hive, Ashutosh Chauhan and Jakob Homan.


Changes
---

Add logic to avoid excessive logging for each record.


Bugs: HIVE-4732
https://issues.apache.org/jira/browse/HIVE-4732


Repository: hive-git


Description
---

From our performance analysis, we found AvroSerde's schema.equals() call 
consumed a substantial amount ( nearly 40%) of time. This patch intends to 
minimize the number schema.equals() calls by pushing the check as late/fewer 
as possible.

At first, we added a unique id for each record reader which is then included in 
every AvroGenericRecordWritable. Then, we introduce two new data structures 
(one hashset and one hashmap) to store intermediate data to avoid duplicates 
checkings. Hashset contains all the record readers' IDs that don't need any 
re-encoding. On the other hand, HashMap contains the already used re-encoders. 
It works as cache and allows re-encoders reuse. With this change, our test 
shows nearly 40% reduction in Avro record reading time.
 
   


Diffs (updated)
-

  ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordReader.java 
ed2a9af 
  serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java 
e994411 
  
serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java
 66f0348 
  serde/src/test/org/apache/hadoop/hive/serde2/avro/TestAvroDeserializer.java 
3828940 
  serde/src/test/org/apache/hadoop/hive/serde2/avro/TestSchemaReEncoder.java 
9af751b 
  serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java 2b948eb 

Diff: https://reviews.apache.org/r/12480/diff/


Testing
---


Thanks,

Mohammad Islam



Re: Review Request 12480: HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde

2013-07-15 Thread Mohammad Islam


 On July 12, 2013, 10:44 p.m., Jakob Homan wrote:
  Do you have after-optimization performance numbers?  Can you add a test to 
  verify that the reencoder cache is working correctly?  Feed in a record 
  with one uuid, then another with a different and verify that the cache has 
  two elements.  Adding a third record with the original UUID shouldn't 
  increase the size of the cache.  Also, that adding n records all with the 
  same schema creates only one reencoder...

Yes we have the number after optimization. For example, each record used to 
take nearly 50 micro-second. After this patch, it becomes nearly 31 
micro-seconds.
Added the test case as proposed. 


 On July 12, 2013, 10:44 p.m., Jakob Homan wrote:
  serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java, 
  line 66
  https://reviews.apache.org/r/12480/diff/1/?file=320688#file320688line66
 
  verifiedRecordReaders - noReencodingNeeded ?

Done


 On July 12, 2013, 10:44 p.m., Jakob Homan wrote:
  serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java, 
  line 155
  https://reviews.apache.org/r/12480/diff/1/?file=320688#file320688line155
 
  readability: pull out getRecordReaderID into its own var

Done


 On July 12, 2013, 10:44 p.m., Jakob Homan wrote:
  serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java,
   line 78
  https://reviews.apache.org/r/12480/diff/1/?file=320689#file320689line78
 
  Need to write out the uuid too

Done


 On July 12, 2013, 10:44 p.m., Jakob Homan wrote:
  serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java,
   line 92
  https://reviews.apache.org/r/12480/diff/1/?file=320689#file320689line92
 
  Need to read in the uuid too

Done


- Mohammad


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/12480/#review23113
---


On July 11, 2013, 10:31 p.m., Mohammad Islam wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/12480/
 ---
 
 (Updated July 11, 2013, 10:31 p.m.)
 
 
 Review request for hive, Ashutosh Chauhan and Jakob Homan.
 
 
 Bugs: HIVE-4732
 https://issues.apache.org/jira/browse/HIVE-4732
 
 
 Repository: hive-git
 
 
 Description
 ---
 
 From our performance analysis, we found AvroSerde's schema.equals() call 
 consumed a substantial amount ( nearly 40%) of time. This patch intends to 
 minimize the number schema.equals() calls by pushing the check as late/fewer 
 as possible.
 
 At first, we added a unique id for each record reader which is then included 
 in every AvroGenericRecordWritable. Then, we introduce two new data 
 structures (one hashset and one hashmap) to store intermediate data to avoid 
 duplicates checkings. Hashset contains all the record readers' IDs that don't 
 need any re-encoding. On the other hand, HashMap contains the already used 
 re-encoders. It works as cache and allows re-encoders reuse. With this 
 change, our test shows nearly 40% reduction in Avro record reading time.
  

 
 
 Diffs
 -
 
   ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordReader.java 
 dbc999f 
   serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java 
 c85ef15 
   
 serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java
  66f0348 
   serde/src/test/org/apache/hadoop/hive/serde2/avro/TestSchemaReEncoder.java 
 9af751b 
   serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java 2b948eb 
 
 Diff: https://reviews.apache.org/r/12480/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Mohammad Islam
 




Re: Review Request 12480: HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde

2013-07-15 Thread Mohammad Islam

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/12480/
---

(Updated July 15, 2013, 11:48 p.m.)


Review request for hive, Ashutosh Chauhan and Jakob Homan.


Changes
---

Incorporated Jakob's comment.


Bugs: HIVE-4732
https://issues.apache.org/jira/browse/HIVE-4732


Repository: hive-git


Description
---

From our performance analysis, we found AvroSerde's schema.equals() call 
consumed a substantial amount ( nearly 40%) of time. This patch intends to 
minimize the number schema.equals() calls by pushing the check as late/fewer 
as possible.

At first, we added a unique id for each record reader which is then included in 
every AvroGenericRecordWritable. Then, we introduce two new data structures 
(one hashset and one hashmap) to store intermediate data to avoid duplicates 
checkings. Hashset contains all the record readers' IDs that don't need any 
re-encoding. On the other hand, HashMap contains the already used re-encoders. 
It works as cache and allows re-encoders reuse. With this change, our test 
shows nearly 40% reduction in Avro record reading time.
 
   


Diffs (updated)
-

  ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordReader.java 
dbc999f 
  serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java 
c85ef15 
  
serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java
 66f0348 
  serde/src/test/org/apache/hadoop/hive/serde2/avro/TestAvroDeserializer.java 
79c9646 
  serde/src/test/org/apache/hadoop/hive/serde2/avro/TestSchemaReEncoder.java 
9af751b 
  serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java 2b948eb 

Diff: https://reviews.apache.org/r/12480/diff/


Testing
---


Thanks,

Mohammad Islam



Re: Review Request 12480: HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde

2013-07-12 Thread Jakob Homan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/12480/#review23113
---


Do you have after-optimization performance numbers?  Can you add a test to 
verify that the reencoder cache is working correctly?  Feed in a record with 
one uuid, then another with a different and verify that the cache has two 
elements.  Adding a third record with the original UUID shouldn't increase the 
size of the cache.  Also, that adding n records all with the same schema 
creates only one reencoder...


serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java
https://reviews.apache.org/r/12480/#comment46953

verifiedRecordReaders - noReencodingNeeded ?



serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java
https://reviews.apache.org/r/12480/#comment46956

readability: pull out getRecordReaderID into its own var



serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java
https://reviews.apache.org/r/12480/#comment46958

Need to write out the uuid too



serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java
https://reviews.apache.org/r/12480/#comment46959

Need to read in the uuid too


- Jakob Homan


On July 11, 2013, 3:31 p.m., Mohammad Islam wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/12480/
 ---
 
 (Updated July 11, 2013, 3:31 p.m.)
 
 
 Review request for hive, Ashutosh Chauhan and Jakob Homan.
 
 
 Bugs: HIVE-4732
 https://issues.apache.org/jira/browse/HIVE-4732
 
 
 Repository: hive-git
 
 
 Description
 ---
 
 From our performance analysis, we found AvroSerde's schema.equals() call 
 consumed a substantial amount ( nearly 40%) of time. This patch intends to 
 minimize the number schema.equals() calls by pushing the check as late/fewer 
 as possible.
 
 At first, we added a unique id for each record reader which is then included 
 in every AvroGenericRecordWritable. Then, we introduce two new data 
 structures (one hashset and one hashmap) to store intermediate data to avoid 
 duplicates checkings. Hashset contains all the record readers' IDs that don't 
 need any re-encoding. On the other hand, HashMap contains the already used 
 re-encoders. It works as cache and allows re-encoders reuse. With this 
 change, our test shows nearly 40% reduction in Avro record reading time.
  

 
 
 Diffs
 -
 
   ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordReader.java 
 dbc999f 
   serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java 
 c85ef15 
   
 serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java
  66f0348 
   serde/src/test/org/apache/hadoop/hive/serde2/avro/TestSchemaReEncoder.java 
 9af751b 
   serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java 2b948eb 
 
 Diff: https://reviews.apache.org/r/12480/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Mohammad Islam
 




Review Request 12480: HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde

2013-07-11 Thread Mohammad Islam

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/12480/
---

Review request for hive, Ashutosh Chauhan and Jakob Homan.


Bugs: HIVE-4732
https://issues.apache.org/jira/browse/HIVE-4732


Repository: hive-git


Description
---

From our performance analysis, we found AvroSerde's schema.equals() call 
consumed a substantial amount ( nearly 40%) of time. This patch intends to 
minimize the number schema.equals() calls by pushing the check as late/fewer 
as possible.

At first, we added a unique id for each record reader which is then included in 
every AvroGenericRecordWritable. Then, we introduce two new data structures 
(one hashset and one hashmap) to store intermediate data to avoid duplicates 
checkings. Hashset contains all the record readers' IDs that don't need any 
re-encoding. On the other hand, HashMap contains the already used re-encoders. 
It works as cache and allows re-encoders reuse. With this change, our test 
shows nearly 40% reduction in Avro record reading time.
 
   


Diffs
-

  ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordReader.java 
dbc999f 
  serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java 
c85ef15 
  
serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java
 66f0348 
  serde/src/test/org/apache/hadoop/hive/serde2/avro/TestSchemaReEncoder.java 
9af751b 
  serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java 2b948eb 

Diff: https://reviews.apache.org/r/12480/diff/


Testing
---


Thanks,

Mohammad Islam