[jira] [Updated] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers

2015-05-10 Thread Selina Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Selina Zhang updated HIVE-10036:

Attachment: HIVE-10036.9.patch

Fixed the unit tests. 

 Writing ORC format big table causes OOM - too many fixed sized stream buffers
 -

 Key: HIVE-10036
 URL: https://issues.apache.org/jira/browse/HIVE-10036
 Project: Hive
  Issue Type: Improvement
Reporter: Selina Zhang
Assignee: Selina Zhang
  Labels: orcfile
 Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, 
 HIVE-10036.3.patch, HIVE-10036.5.patch, HIVE-10036.6.patch, 
 HIVE-10036.7.patch, HIVE-10036.8.patch, HIVE-10036.9.patch


 ORC writer keeps multiple out steams for each column. Each output stream is 
 allocated fixed size ByteBuffer (configurable, default to 256K). For a big 
 table, the memory cost is unbearable. Specially when HCatalog dynamic 
 partition involves, several hundreds files may be open and writing at the 
 same time (same problems for FileSinkOperator). 
 Global ORC memory manager controls the buffer size, but it only got kicked in 
 at 5000 rows interval. An enhancement could be done here, but the problem is 
 reducing the buffer size introduces worse compression and more IOs in read 
 path. Sacrificing the read performance is always not a good choice. 
 I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound 
 to the existing configurable buffer size. Most of the streams does not need 
 large buffer so the performance got improved significantly. Comparing to 
 Facebook's hive-dwrf, I monitored 2x performance gain with this fix. 
 Solving OOM for ORC completely maybe needs lots of effort , but this is 
 definitely a low hanging fruit. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers

2015-05-07 Thread Selina Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Selina Zhang updated HIVE-10036:

Attachment: HIVE-10036.8.patch

 Writing ORC format big table causes OOM - too many fixed sized stream buffers
 -

 Key: HIVE-10036
 URL: https://issues.apache.org/jira/browse/HIVE-10036
 Project: Hive
  Issue Type: Improvement
Reporter: Selina Zhang
Assignee: Selina Zhang
  Labels: orcfile
 Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, 
 HIVE-10036.3.patch, HIVE-10036.5.patch, HIVE-10036.6.patch, 
 HIVE-10036.7.patch, HIVE-10036.8.patch


 ORC writer keeps multiple out steams for each column. Each output stream is 
 allocated fixed size ByteBuffer (configurable, default to 256K). For a big 
 table, the memory cost is unbearable. Specially when HCatalog dynamic 
 partition involves, several hundreds files may be open and writing at the 
 same time (same problems for FileSinkOperator). 
 Global ORC memory manager controls the buffer size, but it only got kicked in 
 at 5000 rows interval. An enhancement could be done here, but the problem is 
 reducing the buffer size introduces worse compression and more IOs in read 
 path. Sacrificing the read performance is always not a good choice. 
 I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound 
 to the existing configurable buffer size. Most of the streams does not need 
 large buffer so the performance got improved significantly. Comparing to 
 Facebook's hive-dwrf, I monitored 2x performance gain with this fix. 
 Solving OOM for ORC completely maybe needs lots of effort , but this is 
 definitely a low hanging fruit. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers

2015-05-05 Thread Selina Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Selina Zhang updated HIVE-10036:

Attachment: HIVE-10036.8.patch

 Writing ORC format big table causes OOM - too many fixed sized stream buffers
 -

 Key: HIVE-10036
 URL: https://issues.apache.org/jira/browse/HIVE-10036
 Project: Hive
  Issue Type: Improvement
Reporter: Selina Zhang
Assignee: Selina Zhang
  Labels: orcfile
 Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, 
 HIVE-10036.3.patch, HIVE-10036.5.patch, HIVE-10036.6.patch, 
 HIVE-10036.7.patch, HIVE-10036.8.patch


 ORC writer keeps multiple out steams for each column. Each output stream is 
 allocated fixed size ByteBuffer (configurable, default to 256K). For a big 
 table, the memory cost is unbearable. Specially when HCatalog dynamic 
 partition involves, several hundreds files may be open and writing at the 
 same time (same problems for FileSinkOperator). 
 Global ORC memory manager controls the buffer size, but it only got kicked in 
 at 5000 rows interval. An enhancement could be done here, but the problem is 
 reducing the buffer size introduces worse compression and more IOs in read 
 path. Sacrificing the read performance is always not a good choice. 
 I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound 
 to the existing configurable buffer size. Most of the streams does not need 
 large buffer so the performance got improved significantly. Comparing to 
 Facebook's hive-dwrf, I monitored 2x performance gain with this fix. 
 Solving OOM for ORC completely maybe needs lots of effort , but this is 
 definitely a low hanging fruit. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers

2015-05-03 Thread Selina Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Selina Zhang updated HIVE-10036:

Attachment: HIVE-10036.7.patch

Fixed ql/pom.xml

 Writing ORC format big table causes OOM - too many fixed sized stream buffers
 -

 Key: HIVE-10036
 URL: https://issues.apache.org/jira/browse/HIVE-10036
 Project: Hive
  Issue Type: Improvement
Reporter: Selina Zhang
Assignee: Selina Zhang
  Labels: orcfile
 Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, 
 HIVE-10036.3.patch, HIVE-10036.5.patch, HIVE-10036.6.patch, HIVE-10036.7.patch


 ORC writer keeps multiple out steams for each column. Each output stream is 
 allocated fixed size ByteBuffer (configurable, default to 256K). For a big 
 table, the memory cost is unbearable. Specially when HCatalog dynamic 
 partition involves, several hundreds files may be open and writing at the 
 same time (same problems for FileSinkOperator). 
 Global ORC memory manager controls the buffer size, but it only got kicked in 
 at 5000 rows interval. An enhancement could be done here, but the problem is 
 reducing the buffer size introduces worse compression and more IOs in read 
 path. Sacrificing the read performance is always not a good choice. 
 I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound 
 to the existing configurable buffer size. Most of the streams does not need 
 large buffer so the performance got improved significantly. Comparing to 
 Facebook's hive-dwrf, I monitored 2x performance gain with this fix. 
 Solving OOM for ORC completely maybe needs lots of effort , but this is 
 definitely a low hanging fruit. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers

2015-04-15 Thread Damien Carol (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Damien Carol updated HIVE-10036:

Labels: orcfile  (was: )

 Writing ORC format big table causes OOM - too many fixed sized stream buffers
 -

 Key: HIVE-10036
 URL: https://issues.apache.org/jira/browse/HIVE-10036
 Project: Hive
  Issue Type: Improvement
Reporter: Selina Zhang
Assignee: Selina Zhang
  Labels: orcfile
 Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, 
 HIVE-10036.3.patch, HIVE-10036.5.patch, HIVE-10036.6.patch


 ORC writer keeps multiple out steams for each column. Each output stream is 
 allocated fixed size ByteBuffer (configurable, default to 256K). For a big 
 table, the memory cost is unbearable. Specially when HCatalog dynamic 
 partition involves, several hundreds files may be open and writing at the 
 same time (same problems for FileSinkOperator). 
 Global ORC memory manager controls the buffer size, but it only got kicked in 
 at 5000 rows interval. An enhancement could be done here, but the problem is 
 reducing the buffer size introduces worse compression and more IOs in read 
 path. Sacrificing the read performance is always not a good choice. 
 I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound 
 to the existing configurable buffer size. Most of the streams does not need 
 large buffer so the performance got improved significantly. Comparing to 
 Facebook's hive-dwrf, I monitored 2x performance gain with this fix. 
 Solving OOM for ORC completely maybe needs lots of effort , but this is 
 definitely a low hanging fruit. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers

2015-04-10 Thread Selina Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Selina Zhang updated HIVE-10036:

Attachment: HIVE-10036.5.patch

Thanks Mithun and Prasanth! Uploaded modified patch. 

 Writing ORC format big table causes OOM - too many fixed sized stream buffers
 -

 Key: HIVE-10036
 URL: https://issues.apache.org/jira/browse/HIVE-10036
 Project: Hive
  Issue Type: Improvement
Reporter: Selina Zhang
Assignee: Selina Zhang
 Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, 
 HIVE-10036.3.patch, HIVE-10036.5.patch


 ORC writer keeps multiple out steams for each column. Each output stream is 
 allocated fixed size ByteBuffer (configurable, default to 256K). For a big 
 table, the memory cost is unbearable. Specially when HCatalog dynamic 
 partition involves, several hundreds files may be open and writing at the 
 same time (same problems for FileSinkOperator). 
 Global ORC memory manager controls the buffer size, but it only got kicked in 
 at 5000 rows interval. An enhancement could be done here, but the problem is 
 reducing the buffer size introduces worse compression and more IOs in read 
 path. Sacrificing the read performance is always not a good choice. 
 I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound 
 to the existing configurable buffer size. Most of the streams does not need 
 large buffer so the performance got improved significantly. Comparing to 
 Facebook's hive-dwrf, I monitored 2x performance gain with this fix. 
 Solving OOM for ORC completely maybe needs lots of effort , but this is 
 definitely a low hanging fruit. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers

2015-03-23 Thread Selina Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Selina Zhang updated HIVE-10036:

Attachment: HIVE-10036.2.patch

 Writing ORC format big table causes OOM - too many fixed sized stream buffers
 -

 Key: HIVE-10036
 URL: https://issues.apache.org/jira/browse/HIVE-10036
 Project: Hive
  Issue Type: Improvement
Reporter: Selina Zhang
Assignee: Selina Zhang
 Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch


 ORC writer keeps multiple out steams for each column. Each output stream is 
 allocated fixed size ByteBuffer (configurable, default to 256K). For a big 
 table, the memory cost is unbearable. Specially when HCatalog dynamic 
 partition involves, several hundreds files may be open and writing at the 
 same time (same problems for FileSinkOperator). 
 Global ORC memory manager controls the buffer size, but it only got kicked in 
 at 5000 rows interval. An enhancement could be done here, but the problem is 
 reducing the buffer size introduces worse compression and more IOs in read 
 path. Sacrificing the read performance is always not a good choice. 
 I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound 
 to the existing configurable buffer size. Most of the streams does not need 
 large buffer so the performance got improved significantly. Comparing to 
 Facebook's hive-dwrf, I monitored 2x performance gain with this fix. 
 Solving OOM for ORC completely maybe needs lots of effort , but this is 
 definitely a low hanging fruit. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10036) Writing ORC format big table causes OOM - too many fixed sized stream buffers

2015-03-20 Thread Selina Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Selina Zhang updated HIVE-10036:

Attachment: HIVE-10036.1.patch

 Writing ORC format big table causes OOM - too many fixed sized stream buffers
 -

 Key: HIVE-10036
 URL: https://issues.apache.org/jira/browse/HIVE-10036
 Project: Hive
  Issue Type: Improvement
Reporter: Selina Zhang
Assignee: Selina Zhang
 Attachments: HIVE-10036.1.patch


 ORC writer keeps multiple out steams for each column. Each output stream is 
 allocated fixed size ByteBuffer (configurable, default to 256K). For a big 
 table, the memory cost is unbearable. Specially when HCatalog dynamic 
 partition involves, several hundreds files may be open and writing at the 
 same time (same problems for FileSinkOperator). 
 Global ORC memory manager controls the buffer size, but it only got kicked in 
 at 5000 rows interval. An enhancement could be done here, but the problem is 
 reducing the buffer size introduces worse compression and more IOs in read 
 path. Sacrificing the read performance is always not a good choice. 
 I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound 
 to the existing configurable buffer size. Most of the streams does not need 
 large buffer so the performance got improved significantly. Comparing to 
 Facebook's hive-dwrf, I monitored 2x performance gain with this fix. 
 Solving OOM for ORC completely maybe needs lots of effort , but this is 
 definitely a low hanging fruit. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)