[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-11-28 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
   Resolution: Fixed
Fix Version/s: 4.0.0
   Status: Resolved  (was: Patch Available)

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-20330.0.patch, HIVE-20330.1.patch, 
> HIVE-20330.2.patch, HIVE-20330.3.patch, HIVE-20330.4.patch, 
> HIVE-20330.5.patch, HIVE-20330.6.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-11-27 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
Attachment: HIVE-20330.6.patch

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
> Attachments: HIVE-20330.0.patch, HIVE-20330.1.patch, 
> HIVE-20330.2.patch, HIVE-20330.3.patch, HIVE-20330.4.patch, 
> HIVE-20330.5.patch, HIVE-20330.6.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-11-27 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
Status: Patch Available  (was: In Progress)

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
> Attachments: HIVE-20330.0.patch, HIVE-20330.1.patch, 
> HIVE-20330.2.patch, HIVE-20330.3.patch, HIVE-20330.4.patch, 
> HIVE-20330.5.patch, HIVE-20330.6.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-11-27 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
Status: In Progress  (was: Patch Available)

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
> Attachments: HIVE-20330.0.patch, HIVE-20330.1.patch, 
> HIVE-20330.2.patch, HIVE-20330.3.patch, HIVE-20330.4.patch, HIVE-20330.5.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-11-27 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
Status: In Progress  (was: Patch Available)

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
> Attachments: HIVE-20330.0.patch, HIVE-20330.1.patch, 
> HIVE-20330.2.patch, HIVE-20330.3.patch, HIVE-20330.4.patch, HIVE-20330.5.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-11-27 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
Status: Patch Available  (was: In Progress)

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
> Attachments: HIVE-20330.0.patch, HIVE-20330.1.patch, 
> HIVE-20330.2.patch, HIVE-20330.3.patch, HIVE-20330.4.patch, HIVE-20330.5.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-11-27 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
Attachment: HIVE-20330.5.patch

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
> Attachments: HIVE-20330.0.patch, HIVE-20330.1.patch, 
> HIVE-20330.2.patch, HIVE-20330.3.patch, HIVE-20330.4.patch, HIVE-20330.5.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-11-27 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
Status: Patch Available  (was: In Progress)

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
> Attachments: HIVE-20330.0.patch, HIVE-20330.1.patch, 
> HIVE-20330.2.patch, HIVE-20330.3.patch, HIVE-20330.4.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-11-27 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
Attachment: HIVE-20330.4.patch

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
> Attachments: HIVE-20330.0.patch, HIVE-20330.1.patch, 
> HIVE-20330.2.patch, HIVE-20330.3.patch, HIVE-20330.4.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-11-27 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
Status: In Progress  (was: Patch Available)

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
> Attachments: HIVE-20330.0.patch, HIVE-20330.1.patch, 
> HIVE-20330.2.patch, HIVE-20330.3.patch, HIVE-20330.4.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-11-27 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
Status: Patch Available  (was: In Progress)

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
> Attachments: HIVE-20330.0.patch, HIVE-20330.1.patch, 
> HIVE-20330.2.patch, HIVE-20330.3.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-11-27 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
Attachment: HIVE-20330.3.patch

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
> Attachments: HIVE-20330.0.patch, HIVE-20330.1.patch, 
> HIVE-20330.2.patch, HIVE-20330.3.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-11-27 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
Attachment: (was: HIVE-20330.4.patch)

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
> Attachments: HIVE-20330.0.patch, HIVE-20330.1.patch, 
> HIVE-20330.2.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-11-27 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
Status: In Progress  (was: Patch Available)

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
> Attachments: HIVE-20330.0.patch, HIVE-20330.1.patch, 
> HIVE-20330.2.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-11-27 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
Attachment: HIVE-20330.4.patch

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
> Attachments: HIVE-20330.0.patch, HIVE-20330.1.patch, 
> HIVE-20330.2.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-11-27 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
Status: Patch Available  (was: In Progress)

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
> Attachments: HIVE-20330.0.patch, HIVE-20330.1.patch, 
> HIVE-20330.2.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-11-27 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
Attachment: HIVE-20330.2.patch

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
> Attachments: HIVE-20330.0.patch, HIVE-20330.1.patch, 
> HIVE-20330.2.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-11-26 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
Status: In Progress  (was: Patch Available)

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
> Attachments: HIVE-20330.0.patch, HIVE-20330.1.patch, 
> HIVE-20330.2.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-11-26 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
Status: Patch Available  (was: In Progress)

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
> Attachments: HIVE-20330.0.patch, HIVE-20330.1.patch, 
> HIVE-20330.2.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-11-26 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
Status: In Progress  (was: Patch Available)

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
> Attachments: HIVE-20330.0.patch, HIVE-20330.1.patch, 
> HIVE-20330.2.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-11-26 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
Attachment: (was: HIVE-20330.2.patch)

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
> Attachments: HIVE-20330.0.patch, HIVE-20330.1.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-11-26 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
Attachment: HIVE-20330.2.patch

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
> Attachments: HIVE-20330.0.patch, HIVE-20330.1.patch, 
> HIVE-20330.2.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-11-23 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
Status: In Progress  (was: Patch Available)

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
> Attachments: HIVE-20330.0.patch, HIVE-20330.1.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-11-23 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
Status: Patch Available  (was: In Progress)

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
> Attachments: HIVE-20330.0.patch, HIVE-20330.1.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-11-23 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
Attachment: HIVE-20330.1.patch

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
> Attachments: HIVE-20330.0.patch, HIVE-20330.1.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-11-20 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
Attachment: HIVE-20330.0.patch

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
> Attachments: HIVE-20330.0.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-11-20 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
Status: Patch Available  (was: In Progress)

> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
> Attachments: HIVE-20330.0.patch
>
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-08-07 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated HIVE-20330:
--
Description: 
While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
performance drop in a workload that has multiple inputs from HCatLoader.

The reason is that for a particular MR job with multiple Hive tables as input, 
Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance but only 
one table's information (InputJobInfo instance) gets tracked in the JobConf. 
(This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).

Any such call overwrites preexisting values, and thus only the last table's 
information will be considered when Pig calls {{getStatistics}} to calculate 
and estimate required reducer count.

In cases when there are 2 input tables, 256GB and 1MB in size respectively, Pig 
will query the size information from HCat for both of them, but it will either 
see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the execution 
plan's DAG.
It should of course see 256.00097GB in total and use 257 reducers by default 
accordingly.

In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
with the actual 256.00097GB...

  was:
While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
performance drop in a workload that has multiple inputs from HCatLoader.

The reason is that for a particular MR job with multiple Hive tables as input, 
Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance but only 
one table's information (InputJobInfo instance) gets tracked in the JobConf. 
(This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).

Any such call overwrites preexisting values, and thus only the last table's 
information will be considered when Pig calls {{getStatistics}} to calculate 
and estimate required reducer count.

In cases when there are 2 input tables, 256GB and 1MB in size respectively, Pig 
will query the size information from HCat for both of them, but it will either 
see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the execution 
plan's DAG.
It should of course see 256.00097GB in total and use 257 reducers by default 
accordingly.

In unlucky cases this will be 2MB and 1 reducer will have to struggle with 
256GB...


> HCatLoader cannot handle multiple InputJobInfo objects for a job with 
> multiple inputs
> -
>
> Key: HIVE-20330
> URL: https://issues.apache.org/jira/browse/HIVE-20330
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
>Priority: Major
>
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
> performance drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as 
> input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance 
> but only one table's information (InputJobInfo instance) gets tracked in the 
> JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's 
> information will be considered when Pig calls {{getStatistics}} to calculate 
> and estimate required reducer count.
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, 
> Pig will query the size information from HCat for both of them, but it will 
> either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the 
> execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default 
> accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle 
> with the actual 256.00097GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)