[jira] [Work logged] (HIVE-24471) Add support for combiner in hash mode group aggregation

2021-11-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24471?focusedWorklogId=687072=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-687072
 ]

ASF GitHub Bot logged work on HIVE-24471:
-

Author: ASF GitHub Bot
Created on: 28/Nov/21 00:13
Start Date: 28/Nov/21 00:13
Worklog Time Spent: 10m 
  Work Description: github-actions[bot] closed pull request #2611:
URL: https://github.com/apache/hive/pull/2611


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 687072)
Time Spent: 3h  (was: 2h 50m)

> Add support for combiner in hash mode group aggregation 
> 
>
> Key: HIVE-24471
> URL: https://issues.apache.org/jira/browse/HIVE-24471
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> In map side group aggregation, partial grouped aggregation is calculated to 
> reduce the data written to disk by map task. In case of hash aggregation, 
> where the input data is not sorted, hash table is used (with sorting also 
> being performed before flushing). If the hash table size increases beyond 
> configurable limit, data is flushed to disk and new hash table is generated. 
> If the reduction by hash table is less than min hash aggregation reduction 
> calculated during compile time, the map side aggregation is converted to 
> streaming mode. So if the first few batch of records does not result into 
> significant reduction, then the mode is switched to streaming mode. This may 
> have impact on performance, if the subsequent batch of records have less 
> number of distinct values. 
> To improve performance both in Hash and Streaming mode, a combiner can be 
> added to the map task after the keys are sorted. This will make sure that the 
> aggregation is done if possible and reduce the data written to disk.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Work logged] (HIVE-24471) Add support for combiner in hash mode group aggregation

2021-11-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24471?focusedWorklogId=684414=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-684414
 ]

ASF GitHub Bot logged work on HIVE-24471:
-

Author: ASF GitHub Bot
Created on: 21/Nov/21 19:37
Start Date: 21/Nov/21 19:37
Worklog Time Spent: 10m 
  Work Description: github-actions[bot] commented on pull request #2611:
URL: https://github.com/apache/hive/pull/2611#issuecomment-974730493


   This pull request has been automatically marked as stale because it has not 
had recent activity. It will be closed if no further activity occurs.
   Feel free to reach out on the d...@hive.apache.org list if the patch is in 
need of reviews.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 684414)
Time Spent: 2h 50m  (was: 2h 40m)

> Add support for combiner in hash mode group aggregation 
> 
>
> Key: HIVE-24471
> URL: https://issues.apache.org/jira/browse/HIVE-24471
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> In map side group aggregation, partial grouped aggregation is calculated to 
> reduce the data written to disk by map task. In case of hash aggregation, 
> where the input data is not sorted, hash table is used (with sorting also 
> being performed before flushing). If the hash table size increases beyond 
> configurable limit, data is flushed to disk and new hash table is generated. 
> If the reduction by hash table is less than min hash aggregation reduction 
> calculated during compile time, the map side aggregation is converted to 
> streaming mode. So if the first few batch of records does not result into 
> significant reduction, then the mode is switched to streaming mode. This may 
> have impact on performance, if the subsequent batch of records have less 
> number of distinct values. 
> To improve performance both in Hash and Streaming mode, a combiner can be 
> added to the map task after the keys are sorted. This will make sure that the 
> aggregation is done if possible and reduce the data written to disk.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Work logged] (HIVE-24471) Add support for combiner in hash mode group aggregation

2021-11-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24471?focusedWorklogId=684267=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-684267
 ]

ASF GitHub Bot logged work on HIVE-24471:
-

Author: ASF GitHub Bot
Created on: 21/Nov/21 00:11
Start Date: 21/Nov/21 00:11
Worklog Time Spent: 10m 
  Work Description: github-actions[bot] commented on pull request #2611:
URL: https://github.com/apache/hive/pull/2611#issuecomment-974730493


   This pull request has been automatically marked as stale because it has not 
had recent activity. It will be closed if no further activity occurs.
   Feel free to reach out on the d...@hive.apache.org list if the patch is in 
need of reviews.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 684267)
Time Spent: 2h 40m  (was: 2.5h)

> Add support for combiner in hash mode group aggregation 
> 
>
> Key: HIVE-24471
> URL: https://issues.apache.org/jira/browse/HIVE-24471
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> In map side group aggregation, partial grouped aggregation is calculated to 
> reduce the data written to disk by map task. In case of hash aggregation, 
> where the input data is not sorted, hash table is used (with sorting also 
> being performed before flushing). If the hash table size increases beyond 
> configurable limit, data is flushed to disk and new hash table is generated. 
> If the reduction by hash table is less than min hash aggregation reduction 
> calculated during compile time, the map side aggregation is converted to 
> streaming mode. So if the first few batch of records does not result into 
> significant reduction, then the mode is switched to streaming mode. This may 
> have impact on performance, if the subsequent batch of records have less 
> number of distinct values. 
> To improve performance both in Hash and Streaming mode, a combiner can be 
> added to the map task after the keys are sorted. This will make sure that the 
> aggregation is done if possible and reduce the data written to disk.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Work logged] (HIVE-24471) Add support for combiner in hash mode group aggregation

2021-08-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24471?focusedWorklogId=644280=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-644280
 ]

ASF GitHub Bot logged work on HIVE-24471:
-

Author: ASF GitHub Bot
Created on: 31/Aug/21 15:06
Start Date: 31/Aug/21 15:06
Worklog Time Spent: 10m 
  Work Description: maheshk114 opened a new pull request #2611:
URL: https://github.com/apache/hive/pull/2611


   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 644280)
Time Spent: 2.5h  (was: 2h 20m)

> Add support for combiner in hash mode group aggregation 
> 
>
> Key: HIVE-24471
> URL: https://issues.apache.org/jira/browse/HIVE-24471
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> In map side group aggregation, partial grouped aggregation is calculated to 
> reduce the data written to disk by map task. In case of hash aggregation, 
> where the input data is not sorted, hash table is used (with sorting also 
> being performed before flushing). If the hash table size increases beyond 
> configurable limit, data is flushed to disk and new hash table is generated. 
> If the reduction by hash table is less than min hash aggregation reduction 
> calculated during compile time, the map side aggregation is converted to 
> streaming mode. So if the first few batch of records does not result into 
> significant reduction, then the mode is switched to streaming mode. This may 
> have impact on performance, if the subsequent batch of records have less 
> number of distinct values. 
> To improve performance both in Hash and Streaming mode, a combiner can be 
> added to the map task after the keys are sorted. This will make sure that the 
> aggregation is done if possible and reduce the data written to disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24471) Add support for combiner in hash mode group aggregation

2021-08-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24471?focusedWorklogId=643854=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-643854
 ]

ASF GitHub Bot logged work on HIVE-24471:
-

Author: ASF GitHub Bot
Created on: 31/Aug/21 04:39
Start Date: 31/Aug/21 04:39
Worklog Time Spent: 10m 
  Work Description: maheshk114 opened a new pull request #2611:
URL: https://github.com/apache/hive/pull/2611


   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 643854)
Time Spent: 2h 20m  (was: 2h 10m)

> Add support for combiner in hash mode group aggregation 
> 
>
> Key: HIVE-24471
> URL: https://issues.apache.org/jira/browse/HIVE-24471
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> In map side group aggregation, partial grouped aggregation is calculated to 
> reduce the data written to disk by map task. In case of hash aggregation, 
> where the input data is not sorted, hash table is used (with sorting also 
> being performed before flushing). If the hash table size increases beyond 
> configurable limit, data is flushed to disk and new hash table is generated. 
> If the reduction by hash table is less than min hash aggregation reduction 
> calculated during compile time, the map side aggregation is converted to 
> streaming mode. So if the first few batch of records does not result into 
> significant reduction, then the mode is switched to streaming mode. This may 
> have impact on performance, if the subsequent batch of records have less 
> number of distinct values. 
> To improve performance both in Hash and Streaming mode, a combiner can be 
> added to the map task after the keys are sorted. This will make sure that the 
> aggregation is done if possible and reduce the data written to disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24471) Add support for combiner in hash mode group aggregation

2021-03-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24471?focusedWorklogId=565827=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-565827
 ]

ASF GitHub Bot logged work on HIVE-24471:
-

Author: ASF GitHub Bot
Created on: 14/Mar/21 00:53
Start Date: 14/Mar/21 00:53
Worklog Time Spent: 10m 
  Work Description: github-actions[bot] closed pull request #1736:
URL: https://github.com/apache/hive/pull/1736


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 565827)
Time Spent: 2h 10m  (was: 2h)

> Add support for combiner in hash mode group aggregation 
> 
>
> Key: HIVE-24471
> URL: https://issues.apache.org/jira/browse/HIVE-24471
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> In map side group aggregation, partial grouped aggregation is calculated to 
> reduce the data written to disk by map task. In case of hash aggregation, 
> where the input data is not sorted, hash table is used (with sorting also 
> being performed before flushing). If the hash table size increases beyond 
> configurable limit, data is flushed to disk and new hash table is generated. 
> If the reduction by hash table is less than min hash aggregation reduction 
> calculated during compile time, the map side aggregation is converted to 
> streaming mode. So if the first few batch of records does not result into 
> significant reduction, then the mode is switched to streaming mode. This may 
> have impact on performance, if the subsequent batch of records have less 
> number of distinct values. 
> To improve performance both in Hash and Streaming mode, a combiner can be 
> added to the map task after the keys are sorted. This will make sure that the 
> aggregation is done if possible and reduce the data written to disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24471) Add support for combiner in hash mode group aggregation

2021-03-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24471?focusedWorklogId=561887=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-561887
 ]

ASF GitHub Bot logged work on HIVE-24471:
-

Author: ASF GitHub Bot
Created on: 07/Mar/21 00:52
Start Date: 07/Mar/21 00:52
Worklog Time Spent: 10m 
  Work Description: github-actions[bot] commented on pull request #1736:
URL: https://github.com/apache/hive/pull/1736#issuecomment-792135255


   This pull request has been automatically marked as stale because it has not 
had recent activity. It will be closed if no further activity occurs.
   Feel free to reach out on the d...@hive.apache.org list if the patch is in 
need of reviews.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 561887)
Time Spent: 2h  (was: 1h 50m)

> Add support for combiner in hash mode group aggregation 
> 
>
> Key: HIVE-24471
> URL: https://issues.apache.org/jira/browse/HIVE-24471
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> In map side group aggregation, partial grouped aggregation is calculated to 
> reduce the data written to disk by map task. In case of hash aggregation, 
> where the input data is not sorted, hash table is used (with sorting also 
> being performed before flushing). If the hash table size increases beyond 
> configurable limit, data is flushed to disk and new hash table is generated. 
> If the reduction by hash table is less than min hash aggregation reduction 
> calculated during compile time, the map side aggregation is converted to 
> streaming mode. So if the first few batch of records does not result into 
> significant reduction, then the mode is switched to streaming mode. This may 
> have impact on performance, if the subsequent batch of records have less 
> number of distinct values. 
> To improve performance both in Hash and Streaming mode, a combiner can be 
> added to the map task after the keys are sorted. This will make sure that the 
> aggregation is done if possible and reduce the data written to disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24471) Add support for combiner in hash mode group aggregation

2021-01-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24471?focusedWorklogId=530507=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-530507
 ]

ASF GitHub Bot logged work on HIVE-24471:
-

Author: ASF GitHub Bot
Created on: 04/Jan/21 04:07
Start Date: 04/Jan/21 04:07
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1736:
URL: https://github.com/apache/hive/pull/1736#discussion_r551109482



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByCombiner.java
##
@@ -0,0 +1,282 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec;
+
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.ql.exec.vector.VectorGroupByCombiner;
+import org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.BaseWork;
+import org.apache.hadoop.hive.ql.plan.GroupByDesc;
+import org.apache.hadoop.hive.ql.plan.ReduceWork;
+import org.apache.hadoop.hive.ql.plan.TableDesc;
+import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator;
+import org.apache.hadoop.hive.serde2.AbstractSerDe;
+import org.apache.hadoop.hive.serde2.Deserializer;
+import org.apache.hadoop.hive.serde2.SerDeException;
+import org.apache.hadoop.hive.serde2.SerDeUtils;
+import org.apache.hadoop.hive.serde2.Serializer;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
+import org.apache.hadoop.io.BytesWritable;
+import org.apache.hadoop.io.DataInputBuffer;
+import org.apache.hadoop.util.ReflectionUtils;
+import org.apache.tez.runtime.api.TaskContext;
+import org.apache.tez.runtime.library.common.sort.impl.IFile;
+import org.apache.tez.runtime.library.common.sort.impl.TezRawKeyValueIterator;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.apache.hadoop.fs.Path;
+
+import java.io.IOException;
+import java.util.ArrayList;
+
+import static org.apache.hadoop.hive.ql.exec.Utilities.HAS_REDUCE_WORK;
+import static org.apache.hadoop.hive.ql.exec.Utilities.REDUCE_PLAN_NAME;
+
+// Combiner for normal group by operator. In case of map side aggregate, the 
partially
+// aggregated records are sorted based on group by key. If because of some 
reasons, like hash
+// table memory exceeded the limit or the first few batches of records have 
less ndvs, the
+// aggregation is not done, then here the aggregation can be done cheaply as 
the records
+// are sorted based on group by key.
+public class GroupByCombiner extends VectorGroupByCombiner {
+
+  private static final Logger LOG = LoggerFactory.getLogger(
+  org.apache.hadoop.hive.ql.exec.GroupByCombiner.class.getName());
+
+  private transient GenericUDAFEvaluator[] aggregationEvaluators;
+  Deserializer valueDeserializer;
+  GenericUDAFEvaluator.AggregationBuffer[] aggregationBuffers;
+  GroupByOperator groupByOperator;
+  Serializer valueSerializer;
+  ObjectInspector aggrObjectInspector;
+  DataInputBuffer valueBuffer;
+  Object[] cachedValues;
+
+  public GroupByCombiner(TaskContext taskContext) throws HiveException, 
IOException {
+super(taskContext);
+if (rw != null) {
+  try {
+groupByOperator = (GroupByOperator) rw.getReducer();
+
+ArrayList ois = new ArrayList();
+ois.add(keyObjectInspector);
+ois.add(valueObjectInspector);
+ObjectInspector[] rowObjectInspector = new ObjectInspector[1];
+rowObjectInspector[0] =
+
ObjectInspectorFactory.getStandardStructObjectInspector(Utilities.reduceFieldNameList,
+ois);
+groupByOperator.setInputObjInspectors(rowObjectInspector);
+groupByOperator.initializeOp(conf);
+aggregationBuffers = groupByOperator.getAggregationBuffers();
+aggregationEvaluators = groupByOperator.getAggregationEvaluator();
+
+TableDesc valueTableDesc = rw.getTagToValueDesc().get(0);
+if 

[jira] [Work logged] (HIVE-24471) Add support for combiner in hash mode group aggregation

2021-01-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24471?focusedWorklogId=530508=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-530508
 ]

ASF GitHub Bot logged work on HIVE-24471:
-

Author: ASF GitHub Bot
Created on: 04/Jan/21 04:07
Start Date: 04/Jan/21 04:07
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1736:
URL: https://github.com/apache/hive/pull/1736#discussion_r551109525



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByCombiner.java
##
@@ -0,0 +1,282 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec;
+
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.ql.exec.vector.VectorGroupByCombiner;
+import org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.BaseWork;
+import org.apache.hadoop.hive.ql.plan.GroupByDesc;
+import org.apache.hadoop.hive.ql.plan.ReduceWork;
+import org.apache.hadoop.hive.ql.plan.TableDesc;
+import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator;
+import org.apache.hadoop.hive.serde2.AbstractSerDe;
+import org.apache.hadoop.hive.serde2.Deserializer;
+import org.apache.hadoop.hive.serde2.SerDeException;
+import org.apache.hadoop.hive.serde2.SerDeUtils;
+import org.apache.hadoop.hive.serde2.Serializer;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
+import org.apache.hadoop.io.BytesWritable;
+import org.apache.hadoop.io.DataInputBuffer;
+import org.apache.hadoop.util.ReflectionUtils;
+import org.apache.tez.runtime.api.TaskContext;
+import org.apache.tez.runtime.library.common.sort.impl.IFile;
+import org.apache.tez.runtime.library.common.sort.impl.TezRawKeyValueIterator;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.apache.hadoop.fs.Path;
+
+import java.io.IOException;
+import java.util.ArrayList;
+
+import static org.apache.hadoop.hive.ql.exec.Utilities.HAS_REDUCE_WORK;
+import static org.apache.hadoop.hive.ql.exec.Utilities.REDUCE_PLAN_NAME;
+
+// Combiner for normal group by operator. In case of map side aggregate, the 
partially
+// aggregated records are sorted based on group by key. If because of some 
reasons, like hash
+// table memory exceeded the limit or the first few batches of records have 
less ndvs, the
+// aggregation is not done, then here the aggregation can be done cheaply as 
the records
+// are sorted based on group by key.
+public class GroupByCombiner extends VectorGroupByCombiner {
+
+  private static final Logger LOG = LoggerFactory.getLogger(
+  org.apache.hadoop.hive.ql.exec.GroupByCombiner.class.getName());
+
+  private transient GenericUDAFEvaluator[] aggregationEvaluators;
+  Deserializer valueDeserializer;
+  GenericUDAFEvaluator.AggregationBuffer[] aggregationBuffers;
+  GroupByOperator groupByOperator;
+  Serializer valueSerializer;
+  ObjectInspector aggrObjectInspector;
+  DataInputBuffer valueBuffer;
+  Object[] cachedValues;
+
+  public GroupByCombiner(TaskContext taskContext) throws HiveException, 
IOException {
+super(taskContext);
+if (rw != null) {
+  try {
+groupByOperator = (GroupByOperator) rw.getReducer();
+
+ArrayList ois = new ArrayList();
+ois.add(keyObjectInspector);
+ois.add(valueObjectInspector);
+ObjectInspector[] rowObjectInspector = new ObjectInspector[1];
+rowObjectInspector[0] =
+
ObjectInspectorFactory.getStandardStructObjectInspector(Utilities.reduceFieldNameList,
+ois);
+groupByOperator.setInputObjInspectors(rowObjectInspector);
+groupByOperator.initializeOp(conf);
+aggregationBuffers = groupByOperator.getAggregationBuffers();
+aggregationEvaluators = groupByOperator.getAggregationEvaluator();
+
+TableDesc valueTableDesc = rw.getTagToValueDesc().get(0);
+if 

[jira] [Work logged] (HIVE-24471) Add support for combiner in hash mode group aggregation

2020-12-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24471?focusedWorklogId=525007=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-525007
 ]

ASF GitHub Bot logged work on HIVE-24471:
-

Author: ASF GitHub Bot
Created on: 16/Dec/20 12:52
Start Date: 16/Dec/20 12:52
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1736:
URL: https://github.com/apache/hive/pull/1736#discussion_r544256032



##
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##
@@ -1790,6 +1790,10 @@ private static void 
populateLlapDaemonVarsSet(Set llapDaemonVarsSetLocal
 HIVEALIAS("hive.alias", "", ""),
 HIVEMAPSIDEAGGREGATE("hive.map.aggr", true, "Whether to use map-side 
aggregation in Hive Group By queries"),
 HIVEGROUPBYSKEW("hive.groupby.skewindata", false, "Whether there is skew 
in data to optimize group by queries"),
+
+HIVE_ENABLE_COMBINER_FOR_GROUP_BY("hive.enable.combiner.for.groupby", true,
+"Whether to enable tez combiner to aggregate the records after sorting 
is done"),

Review comment:
   Maybe clarify it is only used for map side aggregation? Any case this 
would not be beneficial?

##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByCombiner.java
##
@@ -0,0 +1,282 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec;
+
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.ql.exec.vector.VectorGroupByCombiner;
+import org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.BaseWork;
+import org.apache.hadoop.hive.ql.plan.GroupByDesc;
+import org.apache.hadoop.hive.ql.plan.ReduceWork;
+import org.apache.hadoop.hive.ql.plan.TableDesc;
+import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator;
+import org.apache.hadoop.hive.serde2.AbstractSerDe;
+import org.apache.hadoop.hive.serde2.Deserializer;
+import org.apache.hadoop.hive.serde2.SerDeException;
+import org.apache.hadoop.hive.serde2.SerDeUtils;
+import org.apache.hadoop.hive.serde2.Serializer;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
+import org.apache.hadoop.io.BytesWritable;
+import org.apache.hadoop.io.DataInputBuffer;
+import org.apache.hadoop.util.ReflectionUtils;
+import org.apache.tez.runtime.api.TaskContext;
+import org.apache.tez.runtime.library.common.sort.impl.IFile;
+import org.apache.tez.runtime.library.common.sort.impl.TezRawKeyValueIterator;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.apache.hadoop.fs.Path;
+
+import java.io.IOException;
+import java.util.ArrayList;
+
+import static org.apache.hadoop.hive.ql.exec.Utilities.HAS_REDUCE_WORK;
+import static org.apache.hadoop.hive.ql.exec.Utilities.REDUCE_PLAN_NAME;
+
+// Combiner for normal group by operator. In case of map side aggregate, the 
partially
+// aggregated records are sorted based on group by key. If because of some 
reasons, like hash
+// table memory exceeded the limit or the first few batches of records have 
less ndvs, the
+// aggregation is not done, then here the aggregation can be done cheaply as 
the records
+// are sorted based on group by key.
+public class GroupByCombiner extends VectorGroupByCombiner {
+
+  private static final Logger LOG = LoggerFactory.getLogger(
+  org.apache.hadoop.hive.ql.exec.GroupByCombiner.class.getName());
+
+  private transient GenericUDAFEvaluator[] aggregationEvaluators;
+  Deserializer valueDeserializer;
+  GenericUDAFEvaluator.AggregationBuffer[] aggregationBuffers;
+  GroupByOperator groupByOperator;
+  Serializer valueSerializer;
+  ObjectInspector aggrObjectInspector;
+  DataInputBuffer valueBuffer;
+  Object[] cachedValues;
+
+  public GroupByCombiner(TaskContext taskContext) throws HiveException, 
IOException {
+super(taskContext);
+if (rw != null) {
+  try {
+

[jira] [Work logged] (HIVE-24471) Add support for combiner in hash mode group aggregation

2020-12-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24471?focusedWorklogId=522547=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-522547
 ]

ASF GitHub Bot logged work on HIVE-24471:
-

Author: ASF GitHub Bot
Created on: 10/Dec/20 04:51
Start Date: 10/Dec/20 04:51
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1736:
URL: https://github.com/apache/hive/pull/1736#discussion_r539843531



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
##
@@ -712,6 +751,12 @@ private void processKey(Object row,
 
   @Override
   public void process(Object row, int tag) throws HiveException {
+if (hashAggr) {
+  if (getConfiguration().get("forced.streaming.mode", 
"false").equals("true")) {

Review comment:
   i have removed it in the next commit ..had added for test only.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 522547)
Time Spent: 1h 20m  (was: 1h 10m)

> Add support for combiner in hash mode group aggregation 
> 
>
> Key: HIVE-24471
> URL: https://issues.apache.org/jira/browse/HIVE-24471
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> In map side group aggregation, partial grouped aggregation is calculated to 
> reduce the data written to disk by map task. In case of hash aggregation, 
> where the input data is not sorted, hash table is used. If the hash table 
> size increases beyond configurable limit, data is flushed to disk and new 
> hash table is generated. If the reduction by hash table is less than min hash 
> aggregation reduction calculated during compile time, the map side 
> aggregation is converted to streaming mode. So if the first few batch of 
> records does not result into significant reduction, then the mode is switched 
> to streaming mode. This may have impact on performance, if the subsequent 
> batch of records have less number of distinct values. To mitigate this 
> situation, a combiner can be added to the map task after the keys are sorted. 
> This will make sure that the aggregation is done if possible and reduce the 
> data written to disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24471) Add support for combiner in hash mode group aggregation

2020-12-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24471?focusedWorklogId=522546=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-522546
 ]

ASF GitHub Bot logged work on HIVE-24471:
-

Author: ASF GitHub Bot
Created on: 10/Dec/20 04:50
Start Date: 10/Dec/20 04:50
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1736:
URL: https://github.com/apache/hive/pull/1736#discussion_r539843194



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByCombiner.java
##
@@ -0,0 +1,246 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec;
+
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.ql.exec.vector.VectorGroupByCombiner;
+import org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.BaseWork;
+import org.apache.hadoop.hive.ql.plan.GroupByDesc;
+import org.apache.hadoop.hive.ql.plan.ReduceWork;
+import org.apache.hadoop.hive.ql.plan.TableDesc;
+import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator;
+import org.apache.hadoop.hive.serde2.AbstractSerDe;
+import org.apache.hadoop.hive.serde2.Deserializer;
+import org.apache.hadoop.hive.serde2.SerDeException;
+import org.apache.hadoop.hive.serde2.SerDeUtils;
+import org.apache.hadoop.hive.serde2.Serializer;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
+import org.apache.hadoop.io.BytesWritable;
+import org.apache.hadoop.io.DataInputBuffer;
+import org.apache.hadoop.util.ReflectionUtils;
+import org.apache.tez.runtime.api.TaskContext;
+import org.apache.tez.runtime.library.common.sort.impl.IFile;
+import org.apache.tez.runtime.library.common.sort.impl.TezRawKeyValueIterator;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.apache.hadoop.fs.Path;
+
+import java.io.IOException;
+import java.util.ArrayList;
+
+import static org.apache.hadoop.hive.ql.exec.Utilities.HAS_REDUCE_WORK;
+import static org.apache.hadoop.hive.ql.exec.Utilities.REDUCE_PLAN_NAME;
+
+// Combiner for normal group by operator. In case of map side aggregate, the 
partially
+// aggregated records are sorted based on group by key. If because of some 
reasons, like hash
+// table memory exceeded the limit or the first few batches of records have 
less ndvs, the
+// aggregation is not done, then here the aggregation can be done cheaply as 
the records
+// are sorted based on group by key.
+public class GroupByCombiner extends VectorGroupByCombiner {
+
+  private static final Logger LOG = LoggerFactory.getLogger(
+  org.apache.hadoop.hive.ql.exec.GroupByCombiner.class.getName());
+
+  private transient GenericUDAFEvaluator[] aggregationEvaluators;
+  Deserializer valueDeserializer;
+  GenericUDAFEvaluator.AggregationBuffer[] aggregationBuffers;
+  GroupByOperator groupByOperator;
+  Serializer valueSerializer;
+  ObjectInspector aggrObjectInspector;
+  DataInputBuffer valueBuffer;
+  Object[] cachedValues;
+
+  public GroupByCombiner(TaskContext taskContext) throws HiveException, 
IOException {
+super(taskContext);
+if (rw != null) {
+  try {
+groupByOperator = (GroupByOperator) rw.getReducer();
+
+ArrayList ois = new ArrayList();
+ois.add(keyObjectInspector);
+ois.add(valueObjectInspector);
+ObjectInspector[] rowObjectInspector = new ObjectInspector[1];
+rowObjectInspector[0] =
+
ObjectInspectorFactory.getStandardStructObjectInspector(Utilities.reduceFieldNameList,
+ois);
+groupByOperator.setInputObjInspectors(rowObjectInspector);
+groupByOperator.initializeOp(conf);
+aggregationBuffers = groupByOperator.getAggregationBuffers();
+aggregationEvaluators = groupByOperator.getAggregationEvaluator();
+
+TableDesc valueTableDesc = rw.getTagToValueDesc().get(0);
+valueSerializer 

[jira] [Work logged] (HIVE-24471) Add support for combiner in hash mode group aggregation

2020-12-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24471?focusedWorklogId=522544=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-522544
 ]

ASF GitHub Bot logged work on HIVE-24471:
-

Author: ASF GitHub Bot
Created on: 10/Dec/20 04:47
Start Date: 10/Dec/20 04:47
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1736:
URL: https://github.com/apache/hive/pull/1736#discussion_r539842254



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorGroupByCombiner.java
##
@@ -0,0 +1,377 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.ql.exec.Utilities;
+import org.apache.hadoop.hive.ql.exec.mr.ExecReducer;
+import 
org.apache.hadoop.hive.ql.exec.vector.expressions.aggregates.VectorAggregateExpression;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.ReduceWork;
+import org.apache.hadoop.hive.ql.plan.TableDesc;
+import org.apache.hadoop.hive.serde2.AbstractSerDe;
+import org.apache.hadoop.hive.serde2.ByteStream;
+import org.apache.hadoop.hive.serde2.Deserializer;
+import org.apache.hadoop.hive.serde2.SerDeException;
+import org.apache.hadoop.hive.serde2.SerDeUtils;
+import org.apache.hadoop.hive.serde2.lazybinary.fast.LazyBinaryDeserializeRead;
+import org.apache.hadoop.hive.serde2.lazybinary.fast.LazyBinarySerializeWrite;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
+import org.apache.hadoop.io.DataInputBuffer;
+import org.apache.hadoop.mapreduce.TaskCounter;
+import org.apache.hadoop.util.ReflectionUtils;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.tez.common.TezUtils;
+import org.apache.tez.common.counters.TezCounter;
+import org.apache.tez.mapreduce.combine.MRCombiner;
+import org.apache.tez.runtime.api.TaskContext;
+import org.apache.tez.runtime.library.common.sort.impl.IFile;
+import org.apache.tez.runtime.library.common.sort.impl.TezRawKeyValueIterator;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import java.io.IOException;
+
+import static org.apache.hadoop.hive.ql.exec.Utilities.HAS_REDUCE_WORK;
+import static org.apache.hadoop.hive.ql.exec.Utilities.MAPRED_REDUCER_CLASS;
+import static org.apache.hadoop.hive.ql.exec.Utilities.REDUCE_PLAN_NAME;
+import static 
org.apache.hadoop.hive.serde2.lazy.fast.LazySimpleDeserializeRead.byteArrayCompareRanges;
+
+// Combiner for vectorized group by operator. In case of map side aggregate, 
the partially
+// aggregated records are sorted based on group by key. If because of some 
reasons, like hash
+// table memory exceeded the limit or the first few batches of records have 
less ndvs, the
+// aggregation is not done, then here the aggregation can be done cheaply as 
the records
+// are sorted based on group by key.
+public class VectorGroupByCombiner extends MRCombiner {
+  private static final Logger LOG = LoggerFactory.getLogger(
+  VectorGroupByCombiner.class.getName());
+  protected final Configuration conf;
+  protected final TezCounter combineInputRecordsCounter;
+  protected final TezCounter combineOutputRecordsCounter;
+  VectorAggregateExpression[] aggregators;
+  VectorAggregationBufferRow aggregationBufferRow;
+  protected transient LazyBinarySerializeWrite valueLazyBinarySerializeWrite;
+
+  // This helper object serializes LazyBinary format reducer values from 
columns of a row
+  // in a vectorized row batch.
+  protected transient VectorSerializeRow 
valueVectorSerializeRow;
+
+  // The output buffer used to serialize a value into.
+  protected transient ByteStream.Output valueOutput;
+  DataInputBuffer valueBytesWritable;
+
+  // Only required minimal configs are copied to the worker nodes. This hack 
(file.) is
+  // done to include these configs to be copied to the worker node.
+  protected static String 

[jira] [Work logged] (HIVE-24471) Add support for combiner in hash mode group aggregation

2020-12-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24471?focusedWorklogId=522545=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-522545
 ]

ASF GitHub Bot logged work on HIVE-24471:
-

Author: ASF GitHub Bot
Created on: 10/Dec/20 04:47
Start Date: 10/Dec/20 04:47
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1736:
URL: https://github.com/apache/hive/pull/1736#discussion_r539842254



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorGroupByCombiner.java
##
@@ -0,0 +1,377 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.ql.exec.Utilities;
+import org.apache.hadoop.hive.ql.exec.mr.ExecReducer;
+import 
org.apache.hadoop.hive.ql.exec.vector.expressions.aggregates.VectorAggregateExpression;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.ReduceWork;
+import org.apache.hadoop.hive.ql.plan.TableDesc;
+import org.apache.hadoop.hive.serde2.AbstractSerDe;
+import org.apache.hadoop.hive.serde2.ByteStream;
+import org.apache.hadoop.hive.serde2.Deserializer;
+import org.apache.hadoop.hive.serde2.SerDeException;
+import org.apache.hadoop.hive.serde2.SerDeUtils;
+import org.apache.hadoop.hive.serde2.lazybinary.fast.LazyBinaryDeserializeRead;
+import org.apache.hadoop.hive.serde2.lazybinary.fast.LazyBinarySerializeWrite;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
+import org.apache.hadoop.io.DataInputBuffer;
+import org.apache.hadoop.mapreduce.TaskCounter;
+import org.apache.hadoop.util.ReflectionUtils;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.tez.common.TezUtils;
+import org.apache.tez.common.counters.TezCounter;
+import org.apache.tez.mapreduce.combine.MRCombiner;
+import org.apache.tez.runtime.api.TaskContext;
+import org.apache.tez.runtime.library.common.sort.impl.IFile;
+import org.apache.tez.runtime.library.common.sort.impl.TezRawKeyValueIterator;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import java.io.IOException;
+
+import static org.apache.hadoop.hive.ql.exec.Utilities.HAS_REDUCE_WORK;
+import static org.apache.hadoop.hive.ql.exec.Utilities.MAPRED_REDUCER_CLASS;
+import static org.apache.hadoop.hive.ql.exec.Utilities.REDUCE_PLAN_NAME;
+import static 
org.apache.hadoop.hive.serde2.lazy.fast.LazySimpleDeserializeRead.byteArrayCompareRanges;
+
+// Combiner for vectorized group by operator. In case of map side aggregate, 
the partially
+// aggregated records are sorted based on group by key. If because of some 
reasons, like hash
+// table memory exceeded the limit or the first few batches of records have 
less ndvs, the
+// aggregation is not done, then here the aggregation can be done cheaply as 
the records
+// are sorted based on group by key.
+public class VectorGroupByCombiner extends MRCombiner {
+  private static final Logger LOG = LoggerFactory.getLogger(
+  VectorGroupByCombiner.class.getName());
+  protected final Configuration conf;
+  protected final TezCounter combineInputRecordsCounter;
+  protected final TezCounter combineOutputRecordsCounter;
+  VectorAggregateExpression[] aggregators;
+  VectorAggregationBufferRow aggregationBufferRow;
+  protected transient LazyBinarySerializeWrite valueLazyBinarySerializeWrite;
+
+  // This helper object serializes LazyBinary format reducer values from 
columns of a row
+  // in a vectorized row batch.
+  protected transient VectorSerializeRow 
valueVectorSerializeRow;
+
+  // The output buffer used to serialize a value into.
+  protected transient ByteStream.Output valueOutput;
+  DataInputBuffer valueBytesWritable;
+
+  // Only required minimal configs are copied to the worker nodes. This hack 
(file.) is
+  // done to include these configs to be copied to the worker node.
+  protected static String 

[jira] [Work logged] (HIVE-24471) Add support for combiner in hash mode group aggregation

2020-12-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24471?focusedWorklogId=522542=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-522542
 ]

ASF GitHub Bot logged work on HIVE-24471:
-

Author: ASF GitHub Bot
Created on: 10/Dec/20 04:46
Start Date: 10/Dec/20 04:46
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1736:
URL: https://github.com/apache/hive/pull/1736#discussion_r539842047



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorGroupByCombiner.java
##
@@ -0,0 +1,377 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.ql.exec.Utilities;
+import org.apache.hadoop.hive.ql.exec.mr.ExecReducer;
+import 
org.apache.hadoop.hive.ql.exec.vector.expressions.aggregates.VectorAggregateExpression;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.ReduceWork;
+import org.apache.hadoop.hive.ql.plan.TableDesc;
+import org.apache.hadoop.hive.serde2.AbstractSerDe;
+import org.apache.hadoop.hive.serde2.ByteStream;
+import org.apache.hadoop.hive.serde2.Deserializer;
+import org.apache.hadoop.hive.serde2.SerDeException;
+import org.apache.hadoop.hive.serde2.SerDeUtils;
+import org.apache.hadoop.hive.serde2.lazybinary.fast.LazyBinaryDeserializeRead;
+import org.apache.hadoop.hive.serde2.lazybinary.fast.LazyBinarySerializeWrite;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
+import org.apache.hadoop.io.DataInputBuffer;
+import org.apache.hadoop.mapreduce.TaskCounter;
+import org.apache.hadoop.util.ReflectionUtils;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.tez.common.TezUtils;
+import org.apache.tez.common.counters.TezCounter;
+import org.apache.tez.mapreduce.combine.MRCombiner;
+import org.apache.tez.runtime.api.TaskContext;
+import org.apache.tez.runtime.library.common.sort.impl.IFile;
+import org.apache.tez.runtime.library.common.sort.impl.TezRawKeyValueIterator;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import java.io.IOException;
+
+import static org.apache.hadoop.hive.ql.exec.Utilities.HAS_REDUCE_WORK;
+import static org.apache.hadoop.hive.ql.exec.Utilities.MAPRED_REDUCER_CLASS;
+import static org.apache.hadoop.hive.ql.exec.Utilities.REDUCE_PLAN_NAME;
+import static 
org.apache.hadoop.hive.serde2.lazy.fast.LazySimpleDeserializeRead.byteArrayCompareRanges;
+
+// Combiner for vectorized group by operator. In case of map side aggregate, 
the partially
+// aggregated records are sorted based on group by key. If because of some 
reasons, like hash
+// table memory exceeded the limit or the first few batches of records have 
less ndvs, the
+// aggregation is not done, then here the aggregation can be done cheaply as 
the records
+// are sorted based on group by key.
+public class VectorGroupByCombiner extends MRCombiner {
+  private static final Logger LOG = LoggerFactory.getLogger(
+  VectorGroupByCombiner.class.getName());
+  protected final Configuration conf;
+  protected final TezCounter combineInputRecordsCounter;
+  protected final TezCounter combineOutputRecordsCounter;
+  VectorAggregateExpression[] aggregators;
+  VectorAggregationBufferRow aggregationBufferRow;
+  protected transient LazyBinarySerializeWrite valueLazyBinarySerializeWrite;
+
+  // This helper object serializes LazyBinary format reducer values from 
columns of a row
+  // in a vectorized row batch.
+  protected transient VectorSerializeRow 
valueVectorSerializeRow;
+
+  // The output buffer used to serialize a value into.
+  protected transient ByteStream.Output valueOutput;
+  DataInputBuffer valueBytesWritable;
+
+  // Only required minimal configs are copied to the worker nodes. This hack 
(file.) is
+  // done to include these configs to be copied to the worker node.
+  protected static String 

[jira] [Work logged] (HIVE-24471) Add support for combiner in hash mode group aggregation

2020-12-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24471?focusedWorklogId=522540=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-522540
 ]

ASF GitHub Bot logged work on HIVE-24471:
-

Author: ASF GitHub Bot
Created on: 10/Dec/20 04:45
Start Date: 10/Dec/20 04:45
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1736:
URL: https://github.com/apache/hive/pull/1736#discussion_r539841639



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByCombiner.java
##
@@ -0,0 +1,246 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec;
+
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.ql.exec.vector.VectorGroupByCombiner;
+import org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.BaseWork;
+import org.apache.hadoop.hive.ql.plan.GroupByDesc;
+import org.apache.hadoop.hive.ql.plan.ReduceWork;
+import org.apache.hadoop.hive.ql.plan.TableDesc;
+import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator;
+import org.apache.hadoop.hive.serde2.AbstractSerDe;
+import org.apache.hadoop.hive.serde2.Deserializer;
+import org.apache.hadoop.hive.serde2.SerDeException;
+import org.apache.hadoop.hive.serde2.SerDeUtils;
+import org.apache.hadoop.hive.serde2.Serializer;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
+import org.apache.hadoop.io.BytesWritable;
+import org.apache.hadoop.io.DataInputBuffer;
+import org.apache.hadoop.util.ReflectionUtils;
+import org.apache.tez.runtime.api.TaskContext;
+import org.apache.tez.runtime.library.common.sort.impl.IFile;
+import org.apache.tez.runtime.library.common.sort.impl.TezRawKeyValueIterator;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.apache.hadoop.fs.Path;
+
+import java.io.IOException;
+import java.util.ArrayList;
+
+import static org.apache.hadoop.hive.ql.exec.Utilities.HAS_REDUCE_WORK;
+import static org.apache.hadoop.hive.ql.exec.Utilities.REDUCE_PLAN_NAME;
+
+// Combiner for normal group by operator. In case of map side aggregate, the 
partially
+// aggregated records are sorted based on group by key. If because of some 
reasons, like hash
+// table memory exceeded the limit or the first few batches of records have 
less ndvs, the
+// aggregation is not done, then here the aggregation can be done cheaply as 
the records
+// are sorted based on group by key.
+public class GroupByCombiner extends VectorGroupByCombiner {
+
+  private static final Logger LOG = LoggerFactory.getLogger(
+  org.apache.hadoop.hive.ql.exec.GroupByCombiner.class.getName());
+
+  private transient GenericUDAFEvaluator[] aggregationEvaluators;
+  Deserializer valueDeserializer;
+  GenericUDAFEvaluator.AggregationBuffer[] aggregationBuffers;
+  GroupByOperator groupByOperator;
+  Serializer valueSerializer;
+  ObjectInspector aggrObjectInspector;
+  DataInputBuffer valueBuffer;
+  Object[] cachedValues;
+
+  public GroupByCombiner(TaskContext taskContext) throws HiveException, 
IOException {
+super(taskContext);
+if (rw != null) {
+  try {
+groupByOperator = (GroupByOperator) rw.getReducer();
+
+ArrayList ois = new ArrayList();
+ois.add(keyObjectInspector);
+ois.add(valueObjectInspector);
+ObjectInspector[] rowObjectInspector = new ObjectInspector[1];
+rowObjectInspector[0] =
+
ObjectInspectorFactory.getStandardStructObjectInspector(Utilities.reduceFieldNameList,
+ois);
+groupByOperator.setInputObjInspectors(rowObjectInspector);
+groupByOperator.initializeOp(conf);
+aggregationBuffers = groupByOperator.getAggregationBuffers();
+aggregationEvaluators = groupByOperator.getAggregationEvaluator();
+
+TableDesc valueTableDesc = rw.getTagToValueDesc().get(0);
+valueSerializer 

[jira] [Work logged] (HIVE-24471) Add support for combiner in hash mode group aggregation

2020-12-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24471?focusedWorklogId=522535=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-522535
 ]

ASF GitHub Bot logged work on HIVE-24471:
-

Author: ASF GitHub Bot
Created on: 10/Dec/20 04:39
Start Date: 10/Dec/20 04:39
Worklog Time Spent: 10m 
  Work Description: t3rmin4t0r commented on a change in pull request #1736:
URL: https://github.com/apache/hive/pull/1736#discussion_r539838422



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByCombiner.java
##
@@ -0,0 +1,246 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec;
+
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.ql.exec.vector.VectorGroupByCombiner;
+import org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.BaseWork;
+import org.apache.hadoop.hive.ql.plan.GroupByDesc;
+import org.apache.hadoop.hive.ql.plan.ReduceWork;
+import org.apache.hadoop.hive.ql.plan.TableDesc;
+import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator;
+import org.apache.hadoop.hive.serde2.AbstractSerDe;
+import org.apache.hadoop.hive.serde2.Deserializer;
+import org.apache.hadoop.hive.serde2.SerDeException;
+import org.apache.hadoop.hive.serde2.SerDeUtils;
+import org.apache.hadoop.hive.serde2.Serializer;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
+import org.apache.hadoop.io.BytesWritable;
+import org.apache.hadoop.io.DataInputBuffer;
+import org.apache.hadoop.util.ReflectionUtils;
+import org.apache.tez.runtime.api.TaskContext;
+import org.apache.tez.runtime.library.common.sort.impl.IFile;
+import org.apache.tez.runtime.library.common.sort.impl.TezRawKeyValueIterator;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.apache.hadoop.fs.Path;
+
+import java.io.IOException;
+import java.util.ArrayList;
+
+import static org.apache.hadoop.hive.ql.exec.Utilities.HAS_REDUCE_WORK;
+import static org.apache.hadoop.hive.ql.exec.Utilities.REDUCE_PLAN_NAME;
+
+// Combiner for normal group by operator. In case of map side aggregate, the 
partially
+// aggregated records are sorted based on group by key. If because of some 
reasons, like hash
+// table memory exceeded the limit or the first few batches of records have 
less ndvs, the
+// aggregation is not done, then here the aggregation can be done cheaply as 
the records
+// are sorted based on group by key.
+public class GroupByCombiner extends VectorGroupByCombiner {
+
+  private static final Logger LOG = LoggerFactory.getLogger(
+  org.apache.hadoop.hive.ql.exec.GroupByCombiner.class.getName());
+
+  private transient GenericUDAFEvaluator[] aggregationEvaluators;
+  Deserializer valueDeserializer;
+  GenericUDAFEvaluator.AggregationBuffer[] aggregationBuffers;
+  GroupByOperator groupByOperator;
+  Serializer valueSerializer;
+  ObjectInspector aggrObjectInspector;
+  DataInputBuffer valueBuffer;
+  Object[] cachedValues;
+
+  public GroupByCombiner(TaskContext taskContext) throws HiveException, 
IOException {
+super(taskContext);
+if (rw != null) {
+  try {
+groupByOperator = (GroupByOperator) rw.getReducer();
+
+ArrayList ois = new ArrayList();
+ois.add(keyObjectInspector);
+ois.add(valueObjectInspector);
+ObjectInspector[] rowObjectInspector = new ObjectInspector[1];
+rowObjectInspector[0] =
+
ObjectInspectorFactory.getStandardStructObjectInspector(Utilities.reduceFieldNameList,
+ois);
+groupByOperator.setInputObjInspectors(rowObjectInspector);
+groupByOperator.initializeOp(conf);
+aggregationBuffers = groupByOperator.getAggregationBuffers();
+aggregationEvaluators = groupByOperator.getAggregationEvaluator();
+
+TableDesc valueTableDesc = rw.getTagToValueDesc().get(0);
+valueSerializer 

[jira] [Work logged] (HIVE-24471) Add support for combiner in hash mode group aggregation

2020-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24471?focusedWorklogId=519750=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519750
 ]

ASF GitHub Bot logged work on HIVE-24471:
-

Author: ASF GitHub Bot
Created on: 03/Dec/20 15:57
Start Date: 03/Dec/20 15:57
Worklog Time Spent: 10m 
  Work Description: maheshk114 opened a new pull request #1736:
URL: https://github.com/apache/hive/pull/1736


   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 519750)
Remaining Estimate: 0h
Time Spent: 10m

> Add support for combiner in hash mode group aggregation 
> 
>
> Key: HIVE-24471
> URL: https://issues.apache.org/jira/browse/HIVE-24471
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In map side group aggregation, partial grouped aggregation is calculated to 
> reduce the data written to disk by map task. In case of hash aggregation, 
> where the input data is not sorted, hash table is used. If the hash table 
> size increases beyond configurable limit, data is flushed to disk and new 
> hash table is generated. If the reduction by hash table is less than min hash 
> aggregation reduction calculated during compile time, the map side 
> aggregation is converted to streaming mode. So if the first few batch of 
> records does not result into significant reduction, then the mode is switched 
> to streaming mode. This may have impact on performance, if the subsequent 
> batch of records have less number of distinct values. To mitigate this 
> situation, a combiner can be added to the map task after the keys are sorted. 
> This will make sure that the aggregation is done if possible and reduce the 
> data written to disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)