[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2019-01-04 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Attachment: (was: HIVE-20760.13.patch)

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760-1.patch, HIVE-20760-2.patch, 
> HIVE-20760-3.patch, HIVE-20760.10.patch, HIVE-20760.11.patch, 
> HIVE-20760.12.patch, HIVE-20760.13.patch, HIVE-20760.4.patch, 
> HIVE-20760.5.patch, HIVE-20760.6.patch, HIVE-20760.7.patch, 
> HIVE-20760.8.patch, HIVE-20760.9.patch, HIVE-20760.patch, 
> hiveconf_interned.html, hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2019-01-04 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Attachment: HIVE-20760.13.patch
Status: Patch Available  (was: Open)

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760-1.patch, HIVE-20760-2.patch, 
> HIVE-20760-3.patch, HIVE-20760.10.patch, HIVE-20760.11.patch, 
> HIVE-20760.12.patch, HIVE-20760.13.patch, HIVE-20760.4.patch, 
> HIVE-20760.5.patch, HIVE-20760.6.patch, HIVE-20760.7.patch, 
> HIVE-20760.8.patch, HIVE-20760.9.patch, HIVE-20760.patch, 
> hiveconf_interned.html, hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2019-01-04 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Attachment: HIVE-20760.13.patch

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760-1.patch, HIVE-20760-2.patch, 
> HIVE-20760-3.patch, HIVE-20760.10.patch, HIVE-20760.11.patch, 
> HIVE-20760.12.patch, HIVE-20760.13.patch, HIVE-20760.4.patch, 
> HIVE-20760.5.patch, HIVE-20760.6.patch, HIVE-20760.7.patch, 
> HIVE-20760.8.patch, HIVE-20760.9.patch, HIVE-20760.patch, 
> hiveconf_interned.html, hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2019-01-04 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Status: Open  (was: Patch Available)

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760-1.patch, HIVE-20760-2.patch, 
> HIVE-20760-3.patch, HIVE-20760.10.patch, HIVE-20760.11.patch, 
> HIVE-20760.12.patch, HIVE-20760.13.patch, HIVE-20760.4.patch, 
> HIVE-20760.5.patch, HIVE-20760.6.patch, HIVE-20760.7.patch, 
> HIVE-20760.8.patch, HIVE-20760.9.patch, HIVE-20760.patch, 
> hiveconf_interned.html, hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2018-12-18 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Attachment: HIVE-20760.12.patch
Status: Patch Available  (was: Open)

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760-1.patch, HIVE-20760-2.patch, 
> HIVE-20760-3.patch, HIVE-20760.10.patch, HIVE-20760.11.patch, 
> HIVE-20760.12.patch, HIVE-20760.4.patch, HIVE-20760.5.patch, 
> HIVE-20760.6.patch, HIVE-20760.7.patch, HIVE-20760.8.patch, 
> HIVE-20760.9.patch, HIVE-20760.patch, hiveconf_interned.html, 
> hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2018-12-18 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Status: Open  (was: Patch Available)

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760-1.patch, HIVE-20760-2.patch, 
> HIVE-20760-3.patch, HIVE-20760.10.patch, HIVE-20760.11.patch, 
> HIVE-20760.4.patch, HIVE-20760.5.patch, HIVE-20760.6.patch, 
> HIVE-20760.7.patch, HIVE-20760.8.patch, HIVE-20760.9.patch, HIVE-20760.patch, 
> hiveconf_interned.html, hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2018-12-14 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Attachment: HIVE-20760.11.patch
Status: Patch Available  (was: Open)

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760-1.patch, HIVE-20760-2.patch, 
> HIVE-20760-3.patch, HIVE-20760.10.patch, HIVE-20760.11.patch, 
> HIVE-20760.4.patch, HIVE-20760.5.patch, HIVE-20760.6.patch, 
> HIVE-20760.7.patch, HIVE-20760.8.patch, HIVE-20760.9.patch, HIVE-20760.patch, 
> hiveconf_interned.html, hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2018-12-14 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Status: Open  (was: Patch Available)

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760-1.patch, HIVE-20760-2.patch, 
> HIVE-20760-3.patch, HIVE-20760.10.patch, HIVE-20760.11.patch, 
> HIVE-20760.4.patch, HIVE-20760.5.patch, HIVE-20760.6.patch, 
> HIVE-20760.7.patch, HIVE-20760.8.patch, HIVE-20760.9.patch, HIVE-20760.patch, 
> hiveconf_interned.html, hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2018-12-11 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Attachment: HIVE-20760.10.patch
Status: Patch Available  (was: Open)

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760-1.patch, HIVE-20760-2.patch, 
> HIVE-20760-3.patch, HIVE-20760.10.patch, HIVE-20760.4.patch, 
> HIVE-20760.5.patch, HIVE-20760.6.patch, HIVE-20760.7.patch, 
> HIVE-20760.8.patch, HIVE-20760.9.patch, HIVE-20760.patch, 
> hiveconf_interned.html, hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2018-12-11 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Status: Open  (was: Patch Available)

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760-1.patch, HIVE-20760-2.patch, 
> HIVE-20760-3.patch, HIVE-20760.4.patch, HIVE-20760.5.patch, 
> HIVE-20760.6.patch, HIVE-20760.7.patch, HIVE-20760.8.patch, 
> HIVE-20760.9.patch, HIVE-20760.patch, hiveconf_interned.html, 
> hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2018-12-10 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Attachment: HIVE-20760.9.patch
Status: Patch Available  (was: Open)

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760-1.patch, HIVE-20760-2.patch, 
> HIVE-20760-3.patch, HIVE-20760.4.patch, HIVE-20760.5.patch, 
> HIVE-20760.6.patch, HIVE-20760.7.patch, HIVE-20760.8.patch, 
> HIVE-20760.9.patch, HIVE-20760.patch, hiveconf_interned.html, 
> hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2018-12-10 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Status: Open  (was: Patch Available)

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760-1.patch, HIVE-20760-2.patch, 
> HIVE-20760-3.patch, HIVE-20760.4.patch, HIVE-20760.5.patch, 
> HIVE-20760.6.patch, HIVE-20760.7.patch, HIVE-20760.8.patch, 
> HIVE-20760.9.patch, HIVE-20760.patch, hiveconf_interned.html, 
> hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2018-11-26 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Status: Open  (was: Patch Available)

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760-1.patch, HIVE-20760-2.patch, 
> HIVE-20760-3.patch, HIVE-20760.4.patch, HIVE-20760.5.patch, 
> HIVE-20760.6.patch, HIVE-20760.7.patch, HIVE-20760.8.patch, HIVE-20760.patch, 
> hiveconf_interned.html, hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2018-11-26 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Attachment: HIVE-20760.8.patch
Status: Patch Available  (was: Open)

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760-1.patch, HIVE-20760-2.patch, 
> HIVE-20760-3.patch, HIVE-20760.4.patch, HIVE-20760.5.patch, 
> HIVE-20760.6.patch, HIVE-20760.7.patch, HIVE-20760.8.patch, HIVE-20760.patch, 
> hiveconf_interned.html, hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2018-11-19 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Attachment: HIVE-20760.7.patch
Status: Patch Available  (was: Open)

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760-1.patch, HIVE-20760-2.patch, 
> HIVE-20760-3.patch, HIVE-20760.4.patch, HIVE-20760.5.patch, 
> HIVE-20760.6.patch, HIVE-20760.7.patch, HIVE-20760.patch, 
> hiveconf_interned.html, hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2018-11-19 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Status: Open  (was: Patch Available)

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760-1.patch, HIVE-20760-2.patch, 
> HIVE-20760-3.patch, HIVE-20760.4.patch, HIVE-20760.5.patch, 
> HIVE-20760.6.patch, HIVE-20760.7.patch, HIVE-20760.patch, 
> hiveconf_interned.html, hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2018-11-19 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Attachment: HIVE-20760.6.patch
Status: Patch Available  (was: Open)

Fixed clone problem in HiveConfProperties caused by cloning already removed 
Properties.

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760-1.patch, HIVE-20760-2.patch, 
> HIVE-20760-3.patch, HIVE-20760.4.patch, HIVE-20760.5.patch, 
> HIVE-20760.6.patch, HIVE-20760.patch, hiveconf_interned.html, 
> hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2018-11-19 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Status: Open  (was: Patch Available)

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760-1.patch, HIVE-20760-2.patch, 
> HIVE-20760-3.patch, HIVE-20760.4.patch, HIVE-20760.5.patch, HIVE-20760.patch, 
> hiveconf_interned.html, hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2018-11-13 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Attachment: HIVE-20760.5.patch
Status: Patch Available  (was: Open)

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760-1.patch, HIVE-20760-2.patch, 
> HIVE-20760-3.patch, HIVE-20760.4.patch, HIVE-20760.5.patch, HIVE-20760.patch, 
> hiveconf_interned.html, hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2018-11-13 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Status: Open  (was: Patch Available)

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760-1.patch, HIVE-20760-2.patch, 
> HIVE-20760-3.patch, HIVE-20760.4.patch, HIVE-20760.patch, 
> hiveconf_interned.html, hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2018-11-06 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Attachment: HIVE-20760.4.patch
Status: Patch Available  (was: Open)

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760-1.patch, HIVE-20760-2.patch, 
> HIVE-20760-3.patch, HIVE-20760.4.patch, HIVE-20760.patch, 
> hiveconf_interned.html, hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2018-11-06 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Status: Open  (was: Patch Available)

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760-1.patch, HIVE-20760-2.patch, 
> HIVE-20760-3.patch, HIVE-20760.patch, hiveconf_interned.html, 
> hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2018-10-30 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Attachment: HIVE-20760-3.patch
Status: Patch Available  (was: Open)

HIVE-20760-3.patch: Fixing HiveConfProperties.size() and preventing 
HiveConfProperties chain happening when creating HiveConf from a conf which 
base is already interned.

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760-1.patch, HIVE-20760-2.patch, 
> HIVE-20760-3.patch, HIVE-20760.patch, hiveconf_interned.html, 
> hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2018-10-30 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Status: Open  (was: Patch Available)

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760-1.patch, HIVE-20760-2.patch, 
> HIVE-20760.patch, hiveconf_interned.html, hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2018-10-25 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Attachment: HIVE-20760-2.patch
Status: Patch Available  (was: Open)

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760-1.patch, HIVE-20760-2.patch, 
> HIVE-20760.patch, hiveconf_interned.html, hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2018-10-25 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Status: Open  (was: Patch Available)

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760-1.patch, HIVE-20760.patch, 
> hiveconf_interned.html, hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2018-10-19 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Attachment: HIVE-20760-1.patch
Status: Patch Available  (was: Open)

Fixing checkstyle issues.

Note: findbugs report 2 problems: 
 # "getProperty is unsynchronized, setProperty is synchronized": Here I 
followed the Properties interface, and getProperty is not synchronized there 
either. 
 # "clone() does not call super.clone()": We don't need to call super.clone(), 
since we need to merge super and interned and then clone that object.

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760-1.patch, HIVE-20760.patch, 
> hiveconf_interned.html, hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2018-10-19 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Status: Open  (was: Patch Available)

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760.patch, hiveconf_interned.html, 
> hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20760) Reducing memory overhead due to multiple HiveConfs

2018-10-17 Thread Barnabas Maidics (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barnabas Maidics updated HIVE-20760:

Attachment: HIVE-20760.patch
Status: Patch Available  (was: Open)

Attached the patch.

Notes:
 * I only implemented the functions we really use inside the Configuration 
class and I don't see any use case when we would use all of the Properties 
functions. So I think we can just throw a NotImplementedException().
 * The equals method looks scary but It's way faster than merging the two parts 
together and then compare them. I followed the logic of the equals method in 
HashTable, but changed to fit for storing in divided HashTable.

> Reducing memory overhead due to multiple HiveConfs
> --
>
> Key: HIVE-20760
> URL: https://issues.apache.org/jira/browse/HIVE-20760
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Barnabas Maidics
>Assignee: Barnabas Maidics
>Priority: Major
> Attachments: HIVE-20760.patch, hiveconf_interned.html, 
> hiveconf_original.html
>
>
> The issue is that every Hive task has to load its own version of 
> {{HiveConf}}. When running with a large number of cores per executor (HoS), 
> there is a significant (~10%) amount of memory wasted due to this 
> duplication. 
> I looked into the problem and found a way to reduce the overhead caused by 
> the multiple HiveConf objects.
> I've created an implementation of Properties, somewhat similar to 
> CopyOnFirstWriteProperties. CopyOnFirstWriteProperties can't be used to solve 
> this problem, because it drops the interned Properties right after we add a 
> new property.
> So my implementation looks like this:
>  * When we create a new HiveConf from an existing one (copy constructor), we 
> change the properties object stored by HiveConf to the new Properties 
> implementation (HiveConfProperties). We have 2 possible way to do this. 
> Either we change the visibility of the properties field in the ancestor class 
> (Configuration which comes from hadoop) to protected, or a simpler way is to 
> just change the type using reflection.
>  * HiveConfProperties instantly intern the given properties. After this, 
> every time we add a new property to HiveConf, we add it to an additional 
> Properties object. This way if we create multiple HiveConf with the same base 
> properties, they will use the same Properties object but each session/task 
> can add its own unique properties.
>  * Getting a property from HiveConfProperties would look like this: (I stored 
> the non-interned properties in super class)
>                 String property=super.getProperty(key);
>                 if (property == null) property= interned.getProperty(key);
>                 return property;
> Running some tests showed that the interning works (with 50 connections to 
> HiveServer2, heapdumps created after sessions are created for queries): 
> Overall memory:
>          original: 34,599K              interned: 20,582K
> Retained memory of HiveConfs:
>         original: 16,366K               interned: 10,804K
> I attach the JXray reports about the heapdumps.
> What are your thoughts about this solution? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)