[jira] [Commented] (TEZ-4577) SortSpan could be created real small, resulting in eventual job failure

2024-08-22 Thread Chenyu Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875758#comment-17875758
 ] 

Chenyu Zheng commented on TEZ-4577:
---

[~yigress] 

I got it! If the reaming buffer is too small after end span, the current span 
maybe very small. When we create new span, we may construct the new span 
according to the small span.

I have submit [[https://github.com/apache/tez/pull/367],] can you please review 
this? 

> SortSpan could be created real small, resulting in eventual job failure
> ---
>
> Key: TEZ-4577
> URL: https://issues.apache.org/jira/browse/TEZ-4577
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.10.4
>Reporter: Yi Zhang
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> we run into a issue with overflow as in TEZ-4542, with TEZ-4542 applied, it 
> then run into an issue of real small sortspan (per record in this case), 
> eventually the job failed due to timeout
> from sample logs it looks like 
>  
> SortSpan(ByteBuffer source, int maxItems, int perItem, RawComparator 
> comparator)
>  
> once it get into a situation of maxItems=1, then it persists with maxItems=1
>  
> (also a side issue, the logging in this situation becomes huge)
>  
> sample logs:
> 2024-08-19 19:01:37,704 [INFO] 
> [TezTaskEventRouter\{attempt_1724090939581_0001_1_00_97_2}] 
> |input.MRInput|: scope-20 -> scope-302 initialized RecordReader from event
> 2024-08-19 19:01:37,709 [INFO] [TezChild] |runtime.PigProcessor|: Starting 
> output 
> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput@e3a121c to 
> vertex scope-308
> 2024-08-19 19:01:37,742 [INFO] [TezChild] |impl.ExternalSorter|: scope-302 -> 
> scope-308 using: memoryMb=256, keySerializerClass=class 
> org.apache.pig.impl.io.NullableTuple, 
> valueSerializerClass=org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer@1fe3d5ed,
>  
> comparator=org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTupleSortComparator@3d696d6,
>  partitioner=org.apache.tez.mapreduce.partition.MRPartitioner, 
> serialization=org.apache.hadoop.io.serializer.WritableSerialization, 
> org.apache.hadoop.io.serializer.avro.AvroSpecificSerialization, 
> org.apache.hadoop.io.serializer.avro.AvroReflectSerialization, 
> reportPartitionStats=MEMORY_OPTIMIZED
> 2024-08-19 19:01:37,758 [INFO] [TezChild] |partition.MRPartitioner|: Using 
> newApi, 
> MRpartitionerClass=org.apache.hadoop.mapreduce.lib.partition.HashPartitioner
> 2024-08-19 19:01:37,758 [INFO] [TezChild] |impl.PipelinedSorter|: Setting up 
> PipelinedSorter for scope-302 -> scope-308: , UsingHashComparator=false
> 2024-08-19 19:01:37,800 [INFO] [TezChild] |impl.PipelinedSorter|: Newly 
> allocated block size=268435456, index=0, Number of buffers=1, 
> currentAllocatableMemory=0, currentBufferSize=268435456, total=268435456
> 2024-08-19 19:01:37,800 [INFO] [TezChild] |impl.PipelinedSorter|: Pre 
> allocating rest of memory buffers upfront
> 2024-08-19 19:01:37,800 [INFO] [TezChild] |impl.PipelinedSorter|: Setting up 
> PipelinedSorter for scope-302 -> scope-308: , 
> UsingHashComparator=false#blocks=1, maxMemUsage=268435456, 
> lazyAllocateMem=false, minBlockSize=2097152000, initial BLOCK_SIZE=268435456, 
> finalMergeEnabled=true, pipelinedShuffle=false, sendEmptyPartitions=true, 
> tez.runtime.io.sort.mb=256
> 2024-08-19 19:01:37,802 [INFO] [TezChild] |impl.PipelinedSorter|: scope-302 
> -> scope-308: reserved.remaining()=268435456, reserved.metasize=16777216
> 2024-08-19 19:01:37,827 [INFO] [TezChild] |operator.POLocalRearrangeTez|: 
> Attached output to vertex scope-308 : 
> output=org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput@e3a121c,
>  
> writer=org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput$1@148db3c7
> 2024-08-19 19:01:37,827 [INFO] [TezChild] |runtime.PigProcessor|: Aliases 
> being processed per job phase (AliasName[line,offset]): 
> prev_partition_data[13,28],prev_partition_req_fields[24,28],prev_curr_grp[95,16]
> 2024-08-19 19:01:45,632 [INFO] [TezChild] |impl.PipelinedSorter|: scope-302 
> -> scope-308: Span0.length = 1048573, perItem = 138
> 2024-08-19 19:01:45,633 [INFO] [TezChild] |impl.PipelinedSorter|: scope-302 
> -> scope-308: reserved.remaining()=106530636, reserved.metasize=11068112
> 2024-08-19 19:01:45,633 [INFO] [TezChild] |impl.PipelinedSorter|: scope-302 
> -> scope-308: New Span1.length = 691757, perItem = 138, counter:1048573
> 2024-08-19 19:01:49,491 [INFO] [Sorter \{scope_302 -> scope_308} #0] 
> |impl.PipelinedSorter|: scope-302 -> scope-308: done sorting span=0, 
> length=1048573, time=3857
> 2024-08-19 19:01:51,495 [INFO] [TezChild] |impl.PipelinedSorter|: scope-302 
> -> scope-308: Span1.length = 689460, perItem = 138
> 

[jira] [Comment Edited] (TEZ-4577) SortSpan could be created real small, resulting in eventual job failure

2024-08-22 Thread Chenyu Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875758#comment-17875758
 ] 

Chenyu Zheng edited comment on TEZ-4577 at 8/22/24 7:24 AM:


[~yigress] 

I got it! If the reaming buffer is too small after end span, the current span 
maybe very small. When we create new span, we may construct the new span 
according to the small span.

I have submit [https://github.com/apache/tez/pull/367] can you please review 
this? 


was (Author: zhengchenyu):
[~yigress] 

I got it! If the reaming buffer is too small after end span, the current span 
maybe very small. When we create new span, we may construct the new span 
according to the small span.

I have submit [[https://github.com/apache/tez/pull/367],] can you please review 
this? 

> SortSpan could be created real small, resulting in eventual job failure
> ---
>
> Key: TEZ-4577
> URL: https://issues.apache.org/jira/browse/TEZ-4577
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.10.4
>Reporter: Yi Zhang
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> we run into a issue with overflow as in TEZ-4542, with TEZ-4542 applied, it 
> then run into an issue of real small sortspan (per record in this case), 
> eventually the job failed due to timeout
> from sample logs it looks like 
>  
> SortSpan(ByteBuffer source, int maxItems, int perItem, RawComparator 
> comparator)
>  
> once it get into a situation of maxItems=1, then it persists with maxItems=1
>  
> (also a side issue, the logging in this situation becomes huge)
>  
> sample logs:
> 2024-08-19 19:01:37,704 [INFO] 
> [TezTaskEventRouter\{attempt_1724090939581_0001_1_00_97_2}] 
> |input.MRInput|: scope-20 -> scope-302 initialized RecordReader from event
> 2024-08-19 19:01:37,709 [INFO] [TezChild] |runtime.PigProcessor|: Starting 
> output 
> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput@e3a121c to 
> vertex scope-308
> 2024-08-19 19:01:37,742 [INFO] [TezChild] |impl.ExternalSorter|: scope-302 -> 
> scope-308 using: memoryMb=256, keySerializerClass=class 
> org.apache.pig.impl.io.NullableTuple, 
> valueSerializerClass=org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer@1fe3d5ed,
>  
> comparator=org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTupleSortComparator@3d696d6,
>  partitioner=org.apache.tez.mapreduce.partition.MRPartitioner, 
> serialization=org.apache.hadoop.io.serializer.WritableSerialization, 
> org.apache.hadoop.io.serializer.avro.AvroSpecificSerialization, 
> org.apache.hadoop.io.serializer.avro.AvroReflectSerialization, 
> reportPartitionStats=MEMORY_OPTIMIZED
> 2024-08-19 19:01:37,758 [INFO] [TezChild] |partition.MRPartitioner|: Using 
> newApi, 
> MRpartitionerClass=org.apache.hadoop.mapreduce.lib.partition.HashPartitioner
> 2024-08-19 19:01:37,758 [INFO] [TezChild] |impl.PipelinedSorter|: Setting up 
> PipelinedSorter for scope-302 -> scope-308: , UsingHashComparator=false
> 2024-08-19 19:01:37,800 [INFO] [TezChild] |impl.PipelinedSorter|: Newly 
> allocated block size=268435456, index=0, Number of buffers=1, 
> currentAllocatableMemory=0, currentBufferSize=268435456, total=268435456
> 2024-08-19 19:01:37,800 [INFO] [TezChild] |impl.PipelinedSorter|: Pre 
> allocating rest of memory buffers upfront
> 2024-08-19 19:01:37,800 [INFO] [TezChild] |impl.PipelinedSorter|: Setting up 
> PipelinedSorter for scope-302 -> scope-308: , 
> UsingHashComparator=false#blocks=1, maxMemUsage=268435456, 
> lazyAllocateMem=false, minBlockSize=2097152000, initial BLOCK_SIZE=268435456, 
> finalMergeEnabled=true, pipelinedShuffle=false, sendEmptyPartitions=true, 
> tez.runtime.io.sort.mb=256
> 2024-08-19 19:01:37,802 [INFO] [TezChild] |impl.PipelinedSorter|: scope-302 
> -> scope-308: reserved.remaining()=268435456, reserved.metasize=16777216
> 2024-08-19 19:01:37,827 [INFO] [TezChild] |operator.POLocalRearrangeTez|: 
> Attached output to vertex scope-308 : 
> output=org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput@e3a121c,
>  
> writer=org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput$1@148db3c7
> 2024-08-19 19:01:37,827 [INFO] [TezChild] |runtime.PigProcessor|: Aliases 
> being processed per job phase (AliasName[line,offset]): 
> prev_partition_data[13,28],prev_partition_req_fields[24,28],prev_curr_grp[95,16]
> 2024-08-19 19:01:45,632 [INFO] [TezChild] |impl.PipelinedSorter|: scope-302 
> -> scope-308: Span0.length = 1048573, perItem = 138
> 2024-08-19 19:01:45,633 [INFO] [TezChild] |impl.PipelinedSorter|: scope-302 
> -> scope-308: reserved.remaining()=106530636, reserved.metasize=11068112
> 2024-08-19 19:01:45,633 [INFO] [TezChild] |impl.PipelinedSorter|: scope-302 
> -> scope-308: New S

[jira] [Comment Edited] (TEZ-4542) Tez application may fail due to int overflow when record size is large and sort memory is low.

2024-08-21 Thread Chenyu Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875389#comment-17875389
 ] 

Chenyu Zheng edited comment on TEZ-4542 at 8/21/24 7:41 AM:


[~abstractdog] [~glapark] [~yigress] 

I submit [https://github.com/apache/tez/pull/367] to try to fix this problem in 
another way, will solve the problem described in TEZ-4577.
As for the previous discussion of a record particularly big problem, we will 
discuss again, first fix the problem of TEZ-4577. What about you?


was (Author: zhengchenyu):
[~abstractdog] [~glapark] [~yigress] 

I submit 
[https://github.com/apache/tez/pull/367|https://github.com/apache/tez/pull/367.]
 to try to fix this problem in another way, will solve the problem described in 
TEZ-4577.
As for the previous discussion of a record particularly big problem, we will 
discuss again, first fix the problem of TEZ-4577. What about you?

> Tez application may fail due to int overflow when record size is large and 
> sort memory is low.
> --
>
> Key: TEZ-4542
> URL: https://issues.apache.org/jira/browse/TEZ-4542
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.2
>Reporter: Chenyu Zheng
>Assignee: Chenyu Zheng
>Priority: Major
> Fix For: 0.10.4
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Tez application application fail, then found this error stack:
> {code:java}
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:292)
>   ... 18 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.IllegalArgumentException
>   at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:402)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:907)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:643)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:675)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:753)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinObject(CommonMergeJoinOperator.java:314)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:277)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:270)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:256)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:361)
>   ... 19 more
> Caused by: java.lang.IllegalArgumentException
>   at java.nio.Buffer.position(Buffer.java:244)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter$SortSpan.(PipelinedSorter.java:936)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.sort(PipelinedSorter.java:350)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.collect(PipelinedSorter.java:406)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.write(PipelinedSorter.java:379)
>   at 
> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput$1.write(OrderedPartitionedKVOutput.java:167)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor$TezKVOutputCollector.collect(TezProcessor.java:204)
>   at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.collect(ReduceSinkOperator.java:541)
>   at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:385)
>   ... 28 more {code}
> After adding the debug log, it is easy to find this problem. The variable 
> `dataSize` in {{{}PipelinedSorter::{}}}SortSpan is overflow. 
> This problem will be triggered if the following two conditions are met at the 
> same time:
>  * Too many IO for vertex, causing the memory allocated to each I/O for 
> sorting to be too small.
>  * When average record size is larger than 2K, `dataSize`  in 
> {{{}PipelinedSorter::{}}}SortSpan is overflow will be overflow, will not 
> try to allocate less meta space. Then raise exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TEZ-4542) Tez application may fail due to int overflow when record size is large and sort memory is low.

2024-08-21 Thread Chenyu Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875389#comment-17875389
 ] 

Chenyu Zheng edited comment on TEZ-4542 at 8/21/24 7:39 AM:


[~abstractdog] [~glapark] [~yigress] 

I submit 
[https://github.com/apache/tez/pull/367|https://github.com/apache/tez/pull/367.]
 to try to fix this problem in another way, will solve the problem described in 
TEZ-4577.
As for the previous discussion of a record particularly big problem, we will 
discuss again, first fix the problem of TEZ-4577. What about you?


was (Author: zhengchenyu):
[~abstractdog] [~glapark] [~yigress] 

I submit [https://github.com/apache/tez/pull/367.] to try to fix this problem 
in another way, will solve the problem described in TEZ-4577.
As for the previous discussion of a record particularly big problem, we will 
discuss again, first fix the problem of TEZ-4577. What about you?

> Tez application may fail due to int overflow when record size is large and 
> sort memory is low.
> --
>
> Key: TEZ-4542
> URL: https://issues.apache.org/jira/browse/TEZ-4542
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.2
>Reporter: Chenyu Zheng
>Assignee: Chenyu Zheng
>Priority: Major
> Fix For: 0.10.4
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Tez application application fail, then found this error stack:
> {code:java}
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:292)
>   ... 18 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.IllegalArgumentException
>   at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:402)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:907)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:643)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:675)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:753)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinObject(CommonMergeJoinOperator.java:314)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:277)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:270)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:256)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:361)
>   ... 19 more
> Caused by: java.lang.IllegalArgumentException
>   at java.nio.Buffer.position(Buffer.java:244)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter$SortSpan.(PipelinedSorter.java:936)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.sort(PipelinedSorter.java:350)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.collect(PipelinedSorter.java:406)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.write(PipelinedSorter.java:379)
>   at 
> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput$1.write(OrderedPartitionedKVOutput.java:167)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor$TezKVOutputCollector.collect(TezProcessor.java:204)
>   at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.collect(ReduceSinkOperator.java:541)
>   at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:385)
>   ... 28 more {code}
> After adding the debug log, it is easy to find this problem. The variable 
> `dataSize` in {{{}PipelinedSorter::{}}}SortSpan is overflow. 
> This problem will be triggered if the following two conditions are met at the 
> same time:
>  * Too many IO for vertex, causing the memory allocated to each I/O for 
> sorting to be too small.
>  * When average record size is larger than 2K, `dataSize`  in 
> {{{}PipelinedSorter::{}}}SortSpan is overflow will be overflow, will not 
> try to allocate less meta space. Then raise exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TEZ-4542) Tez application may fail due to int overflow when record size is large and sort memory is low.

2024-08-21 Thread Chenyu Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875389#comment-17875389
 ] 

Chenyu Zheng commented on TEZ-4542:
---

[~abstractdog] [~glapark] [~yigress] 

I submit [https://github.com/apache/tez/pull/367.] to try to fix this problem 
in another way, will solve the problem described in TEZ-4577.
As for the previous discussion of a record particularly big problem, we will 
discuss again, first fix the problem of TEZ-4577. What about you?

> Tez application may fail due to int overflow when record size is large and 
> sort memory is low.
> --
>
> Key: TEZ-4542
> URL: https://issues.apache.org/jira/browse/TEZ-4542
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.2
>Reporter: Chenyu Zheng
>Assignee: Chenyu Zheng
>Priority: Major
> Fix For: 0.10.4
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Tez application application fail, then found this error stack:
> {code:java}
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:292)
>   ... 18 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.IllegalArgumentException
>   at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:402)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:907)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:643)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:675)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:753)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinObject(CommonMergeJoinOperator.java:314)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:277)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:270)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:256)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:361)
>   ... 19 more
> Caused by: java.lang.IllegalArgumentException
>   at java.nio.Buffer.position(Buffer.java:244)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter$SortSpan.(PipelinedSorter.java:936)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.sort(PipelinedSorter.java:350)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.collect(PipelinedSorter.java:406)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.write(PipelinedSorter.java:379)
>   at 
> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput$1.write(OrderedPartitionedKVOutput.java:167)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor$TezKVOutputCollector.collect(TezProcessor.java:204)
>   at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.collect(ReduceSinkOperator.java:541)
>   at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:385)
>   ... 28 more {code}
> After adding the debug log, it is easy to find this problem. The variable 
> `dataSize` in {{{}PipelinedSorter::{}}}SortSpan is overflow. 
> This problem will be triggered if the following two conditions are met at the 
> same time:
>  * Too many IO for vertex, causing the memory allocated to each I/O for 
> sorting to be too small.
>  * When average record size is larger than 2K, `dataSize`  in 
> {{{}PipelinedSorter::{}}}SortSpan is overflow will be overflow, will not 
> try to allocate less meta space. Then raise exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TEZ-4542) Tez application may fail due to int overflow when record size is large and sort memory is low.

2024-08-20 Thread Chenyu Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875368#comment-17875368
 ] 

Chenyu Zheng commented on TEZ-4542:
---

[~glapark] [~abstractdog] 

If revert this patch, we may still have this problem. Consider an extreme case 
where the size of one particular record is particularly large, and the other 
records are normal. If we use below code, metasize will still be small. I think 
maybe we need to delete the optimization code about metasize size.
{code:java}
if(capacity < (metasize+dataSize)) {
  // try to allocate less meta space, because we have sample data
  metasize = METASIZE*(capacity/(perItem+METASIZE));
} {code}
 We can delete these code, even though may wast more memory. Or we can set a 
minimum value for metasize.
 
[~rbalamohan] Can you give us some advice?

> Tez application may fail due to int overflow when record size is large and 
> sort memory is low.
> --
>
> Key: TEZ-4542
> URL: https://issues.apache.org/jira/browse/TEZ-4542
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.2
>Reporter: Chenyu Zheng
>Assignee: Chenyu Zheng
>Priority: Major
> Fix For: 0.10.4
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Tez application application fail, then found this error stack:
> {code:java}
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:292)
>   ... 18 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.IllegalArgumentException
>   at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:402)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:907)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:643)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:675)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:753)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinObject(CommonMergeJoinOperator.java:314)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:277)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:270)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:256)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:361)
>   ... 19 more
> Caused by: java.lang.IllegalArgumentException
>   at java.nio.Buffer.position(Buffer.java:244)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter$SortSpan.(PipelinedSorter.java:936)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.sort(PipelinedSorter.java:350)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.collect(PipelinedSorter.java:406)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.write(PipelinedSorter.java:379)
>   at 
> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput$1.write(OrderedPartitionedKVOutput.java:167)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor$TezKVOutputCollector.collect(TezProcessor.java:204)
>   at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.collect(ReduceSinkOperator.java:541)
>   at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:385)
>   ... 28 more {code}
> After adding the debug log, it is easy to find this problem. The variable 
> `dataSize` in {{{}PipelinedSorter::{}}}SortSpan is overflow. 
> This problem will be triggered if the following two conditions are met at the 
> same time:
>  * Too many IO for vertex, causing the memory allocated to each I/O for 
> sorting to be too small.
>  * When average record size is larger than 2K, `dataSize`  in 
> {{{}PipelinedSorter::{}}}SortSpan is overflow will be overflow, will not 
> try to allocate less meta space. Then raise exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] (TEZ-4542) Tez application may fail due to int overflow when record size is large and sort memory is low.

2024-08-20 Thread Chenyu Zheng (Jira)


[ https://issues.apache.org/jira/browse/TEZ-4542 ]


Chenyu Zheng deleted comment on TEZ-4542:
---

was (Author: zhengchenyu):
[~glapark] Do you have any performance test result after revert this patch?

> Tez application may fail due to int overflow when record size is large and 
> sort memory is low.
> --
>
> Key: TEZ-4542
> URL: https://issues.apache.org/jira/browse/TEZ-4542
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.2
>Reporter: Chenyu Zheng
>Assignee: Chenyu Zheng
>Priority: Major
> Fix For: 0.10.4
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Tez application application fail, then found this error stack:
> {code:java}
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:292)
>   ... 18 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.IllegalArgumentException
>   at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:402)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:907)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:643)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:675)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:753)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinObject(CommonMergeJoinOperator.java:314)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:277)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:270)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:256)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:361)
>   ... 19 more
> Caused by: java.lang.IllegalArgumentException
>   at java.nio.Buffer.position(Buffer.java:244)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter$SortSpan.(PipelinedSorter.java:936)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.sort(PipelinedSorter.java:350)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.collect(PipelinedSorter.java:406)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.write(PipelinedSorter.java:379)
>   at 
> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput$1.write(OrderedPartitionedKVOutput.java:167)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor$TezKVOutputCollector.collect(TezProcessor.java:204)
>   at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.collect(ReduceSinkOperator.java:541)
>   at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:385)
>   ... 28 more {code}
> After adding the debug log, it is easy to find this problem. The variable 
> `dataSize` in {{{}PipelinedSorter::{}}}SortSpan is overflow. 
> This problem will be triggered if the following two conditions are met at the 
> same time:
>  * Too many IO for vertex, causing the memory allocated to each I/O for 
> sorting to be too small.
>  * When average record size is larger than 2K, `dataSize`  in 
> {{{}PipelinedSorter::{}}}SortSpan is overflow will be overflow, will not 
> try to allocate less meta space. Then raise exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TEZ-4542) Tez application may fail due to int overflow when record size is large and sort memory is low.

2024-08-20 Thread Chenyu Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875356#comment-17875356
 ] 

Chenyu Zheng commented on TEZ-4542:
---

[~glapark] Do you have any performance test result after revert this patch?

> Tez application may fail due to int overflow when record size is large and 
> sort memory is low.
> --
>
> Key: TEZ-4542
> URL: https://issues.apache.org/jira/browse/TEZ-4542
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.2
>Reporter: Chenyu Zheng
>Assignee: Chenyu Zheng
>Priority: Major
> Fix For: 0.10.4
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Tez application application fail, then found this error stack:
> {code:java}
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:292)
>   ... 18 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.IllegalArgumentException
>   at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:402)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:907)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:643)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:675)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:753)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinObject(CommonMergeJoinOperator.java:314)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:277)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:270)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:256)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:361)
>   ... 19 more
> Caused by: java.lang.IllegalArgumentException
>   at java.nio.Buffer.position(Buffer.java:244)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter$SortSpan.(PipelinedSorter.java:936)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.sort(PipelinedSorter.java:350)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.collect(PipelinedSorter.java:406)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.write(PipelinedSorter.java:379)
>   at 
> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput$1.write(OrderedPartitionedKVOutput.java:167)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor$TezKVOutputCollector.collect(TezProcessor.java:204)
>   at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.collect(ReduceSinkOperator.java:541)
>   at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:385)
>   ... 28 more {code}
> After adding the debug log, it is easy to find this problem. The variable 
> `dataSize` in {{{}PipelinedSorter::{}}}SortSpan is overflow. 
> This problem will be triggered if the following two conditions are met at the 
> same time:
>  * Too many IO for vertex, causing the memory allocated to each I/O for 
> sorting to be too small.
>  * When average record size is larger than 2K, `dataSize`  in 
> {{{}PipelinedSorter::{}}}SortSpan is overflow will be overflow, will not 
> try to allocate less meta space. Then raise exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TEZ-4542) Tez application may fail due to int overflow when record size is large and sort memory is low.

2024-08-20 Thread Chenyu Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875355#comment-17875355
 ] 

Chenyu Zheng commented on TEZ-4542:
---

[~glapark] OK, Let's revert this patch first, and then solve this problem in 
other ways.

cc [~abstractdog] 

> Tez application may fail due to int overflow when record size is large and 
> sort memory is low.
> --
>
> Key: TEZ-4542
> URL: https://issues.apache.org/jira/browse/TEZ-4542
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.2
>Reporter: Chenyu Zheng
>Assignee: Chenyu Zheng
>Priority: Major
> Fix For: 0.10.4
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Tez application application fail, then found this error stack:
> {code:java}
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:292)
>   ... 18 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.IllegalArgumentException
>   at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:402)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:907)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:643)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:675)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:753)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinObject(CommonMergeJoinOperator.java:314)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:277)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:270)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:256)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:361)
>   ... 19 more
> Caused by: java.lang.IllegalArgumentException
>   at java.nio.Buffer.position(Buffer.java:244)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter$SortSpan.(PipelinedSorter.java:936)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.sort(PipelinedSorter.java:350)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.collect(PipelinedSorter.java:406)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.write(PipelinedSorter.java:379)
>   at 
> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput$1.write(OrderedPartitionedKVOutput.java:167)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor$TezKVOutputCollector.collect(TezProcessor.java:204)
>   at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.collect(ReduceSinkOperator.java:541)
>   at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:385)
>   ... 28 more {code}
> After adding the debug log, it is easy to find this problem. The variable 
> `dataSize` in {{{}PipelinedSorter::{}}}SortSpan is overflow. 
> This problem will be triggered if the following two conditions are met at the 
> same time:
>  * Too many IO for vertex, causing the memory allocated to each I/O for 
> sorting to be too small.
>  * When average record size is larger than 2K, `dataSize`  in 
> {{{}PipelinedSorter::{}}}SortSpan is overflow will be overflow, will not 
> try to allocate less meta space. Then raise exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TEZ-4577) SortSpan could be created real small, resulting in eventual job failure

2024-08-20 Thread Chenyu Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875335#comment-17875335
 ] 

Chenyu Zheng commented on TEZ-4577:
---

[~yigress]

If the maxItems of a new span is 1, the kvmeta of the new span will be very 
small. Then PipelinedSorter::sort will be triggered frequently, result to be 
slow. Am I right? If so, I think it needs to be fix it. Do you have any plans 
to fix it?

In addition, I am curious, since the first span size is 16*1024*1024, why does 
maxItems become 1? Can you add some logs to your problem application to print 
the appropriate call to PipelinedSorter::sort?

> SortSpan could be created real small, resulting in eventual job failure
> ---
>
> Key: TEZ-4577
> URL: https://issues.apache.org/jira/browse/TEZ-4577
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.10.4
>Reporter: Yi Zhang
>Priority: Major
>
> we run into a issue with overflow as in TEZ-4542, with TEZ-4542 applied, it 
> then run into an issue of real small sortspan (per record in this case), 
> eventually the job failed due to timeout
> from sample logs it looks like 
>  
> SortSpan(ByteBuffer source, int maxItems, int perItem, RawComparator 
> comparator)
>  
> once it get into a situation of maxItems=1, then it persists with maxItems=1
>  
> (also a side issue, the logging in this situation becomes huge)
>  
> sample logs:
> 2024-08-19 19:02:28,157 [INFO] [TezChild] |impl.PipelinedSorter|: scope-302 
> -> scope-308: Span260.length = 1, perItem = 139
> 2024-08-19 19:02:28,157 [INFO] [TezChild] |impl.PipelinedSorter|: scope-302 
> -> scope-308: reserved.remaining()=268396925, reserved.metasize=16
> 2024-08-19 19:02:28,157 [INFO] [TezChild] |impl.PipelinedSorter|: scope-302 
> -> scope-308: New Span261.length = 1, perItem = 139, counter:5307003
> 2024-08-19 19:02:28,157 [INFO] [Sorter \{scope_302 -> scope_308} #1|#1] 
> |impl.PipelinedSorter|: scope-302 -> scope-308: done sorting span=260, 
> length=1, time=0
> 2024-08-19 19:02:28,157 [INFO] [TezChild] |impl.PipelinedSorter|: scope-302 
> -> scope-308: Span261.length = 1, perItem = 128
> 2024-08-19 19:02:28,157 [INFO] [TezChild] |impl.PipelinedSorter|: scope-302 
> -> scope-308: reserved.remaining()=268396781, reserved.metasize=16
> 2024-08-19 19:02:28,157 [INFO] [TezChild] |impl.PipelinedSorter|: scope-302 
> -> scope-308: New Span262.length = 1, perItem = 128, counter:5307004
> 2024-08-19 19:02:28,158 [INFO] [Sorter \{scope_302 -> scope_308} #0|#0] 
> |impl.PipelinedSorter|: scope-302 -> scope-308: done sorting span=261, 
> length=1, time=0
> 2024-08-19 19:02:28,158 [INFO] [TezChild] |impl.PipelinedSorter|: scope-302 
> -> scope-308: Span262.length = 1, perItem = 145
> 2024-08-19 19:02:28,158 [INFO] [TezChild] |impl.PipelinedSorter|: scope-302 
> -> scope-308: reserved.remaining()=268396620, reserved.metasize=16
> 2024-08-19 19:02:28,158 [INFO] [TezChild] |impl.PipelinedSorter|: scope-302 
> -> scope-308: New Span263.length = 1, perItem = 145, counter:5307005
> 2024-08-19 19:02:28,158 [INFO] [Sorter \{scope_302 -> scope_308} #1|#1] 
> |impl.PipelinedSorter|: scope-302 -> scope-308: done sorting span=262, 
> length=1, time=0
> 2024-08-19 19:02:28,158 [INFO] [TezChild] |impl.PipelinedSorter|: scope-302 
> -> scope-308: Span263.length = 1, perItem = 139
> 2024-08-19 19:02:28,158 [INFO] [TezChild] |impl.PipelinedSorter|: scope-302 
> -> scope-308: reserved.remaining()=268396465, reserved.metasize=16
> 2024-08-19 19:02:28,158 [INFO] [TezChild] |impl.PipelinedSorter|: scope-302 
> -> scope-308: New Span264.length = 1, perItem = 139, counter:5307006
> 2024-08-19 19:02:28,158 [INFO] [Sorter \{scope_302 -> scope_308} #0|#0] 
> |impl.PipelinedSorter|: scope-302 -> scope-308: done sorting span=263, 
> length=1, time=0
> 2024-08-19 19:02:28,158 [INFO] [TezChild] |impl.PipelinedSorter|: scope-302 
> -> scope-308: Span264.length = 1, perItem = 129
> 2024-08-19 19:02:28,158 [INFO] [TezChild] |impl.PipelinedSorter|: scope-302 
> -> scope-308: reserved.remaining()=268396320, reserved.metasize=16
> 2024-08-19 19:02:28,158 [INFO] [TezChild] |impl.PipelinedSorter|: scope-302 
> -> scope-308: New Span265.length = 1, perItem = 129, counter:5307007
> 2024-08-19 19:02:28,158 [INFO] [Sorter \{scope_302 -> scope_308} #1|#1] 
> |impl.PipelinedSorter|: scope-302 -> scope-308: done sorting span=264, 
> length=1, time=0
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TEZ-4542) Tez application may fail due to int overflow when record size is large and sort memory is low.

2024-05-14 Thread Chenyu Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17846208#comment-17846208
 ] 

Chenyu Zheng commented on TEZ-4542:
---

Thanks [~abstractdog] and [~rbalamohan] for the review!

[~abstractdog]  BTW, do you mind taking a look at HIVE-27985 ? 

> Tez application may fail due to int overflow when record size is large and 
> sort memory is low.
> --
>
> Key: TEZ-4542
> URL: https://issues.apache.org/jira/browse/TEZ-4542
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.2
>Reporter: Chenyu Zheng
>Assignee: Chenyu Zheng
>Priority: Major
> Fix For: 0.10.4
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Tez application application fail, then found this error stack:
> {code:java}
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:292)
>   ... 18 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.IllegalArgumentException
>   at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:402)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:907)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:643)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:675)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:753)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinObject(CommonMergeJoinOperator.java:314)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:277)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:270)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:256)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:361)
>   ... 19 more
> Caused by: java.lang.IllegalArgumentException
>   at java.nio.Buffer.position(Buffer.java:244)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter$SortSpan.(PipelinedSorter.java:936)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.sort(PipelinedSorter.java:350)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.collect(PipelinedSorter.java:406)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.write(PipelinedSorter.java:379)
>   at 
> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput$1.write(OrderedPartitionedKVOutput.java:167)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor$TezKVOutputCollector.collect(TezProcessor.java:204)
>   at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.collect(ReduceSinkOperator.java:541)
>   at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:385)
>   ... 28 more {code}
> After adding the debug log, it is easy to find this problem. The variable 
> `dataSize` in {{{}PipelinedSorter::{}}}SortSpan is overflow. 
> This problem will be triggered if the following two conditions are met at the 
> same time:
>  * Too many IO for vertex, causing the memory allocated to each I/O for 
> sorting to be too small.
>  * When average record size is larger than 2K, `dataSize`  in 
> {{{}PipelinedSorter::{}}}SortSpan is overflow will be overflow, will not 
> try to allocate less meta space. Then raise exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TEZ-4542) Tez application may fail due to int overflow when record size is large and sort memory is low.

2024-02-22 Thread Chenyu Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenyu Zheng updated TEZ-4542:
--
Description: 
Tez application application fail, then found this error stack:
{code:java}
  at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
  at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:292)
  ... 18 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.IllegalArgumentException
  at 
org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:402)
  at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:907)
  at 
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:643)
  at 
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:675)
  at 
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:753)
  at 
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinObject(CommonMergeJoinOperator.java:314)
  at 
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:277)
  at 
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:270)
  at 
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:256)
  at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:361)
  ... 19 more
Caused by: java.lang.IllegalArgumentException
  at java.nio.Buffer.position(Buffer.java:244)
  at 
org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter$SortSpan.(PipelinedSorter.java:936)
  at 
org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.sort(PipelinedSorter.java:350)
  at 
org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.collect(PipelinedSorter.java:406)
  at 
org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.write(PipelinedSorter.java:379)
  at 
org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput$1.write(OrderedPartitionedKVOutput.java:167)
  at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor$TezKVOutputCollector.collect(TezProcessor.java:204)
  at 
org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.collect(ReduceSinkOperator.java:541)
  at 
org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:385)
  ... 28 more {code}
After adding the debug log, it is easy to find this problem. The variable 
`dataSize` in {{{}PipelinedSorter::{}}}SortSpan is overflow. 

This problem will be triggered if the following two conditions are met at the 
same time:
 * Too many IO for vertex, causing the memory allocated to each I/O for sorting 
to be too small.
 * When average record size is larger than 2K, `dataSize`  in 
{{{}PipelinedSorter::{}}}SortSpan is overflow will be overflow, will not 
try to allocate less meta space. Then raise exception.

  was:
Tez application application fail, then found this error stack:
{code:java}
  at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
  at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:292)
  ... 18 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.IllegalArgumentException
  at 
org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:402)
  at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:907)
  at 
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:643)
  at 
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:675)
  at 
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:753)
  at 
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinObject(CommonMergeJoinOperator.java:314)
  at 
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:277)
  at 
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:270)
  at 
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:256)
  at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:361)
  ... 19 more
Caused by: java.lang.IllegalArgumentException
  at java.nio.Buffer.position(Buffer.java:244)
  at 
org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter$SortSpan.(PipelinedSorter.java:936)
  at 
org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.sort(PipelinedSorter.java:350)
  at 
org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.collect(PipelinedSorter.java:406)
  at 
org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.write(Pipelined

[jira] [Updated] (TEZ-4542) Tez application may fail due to int overflow when record size is large and sort memory is low.

2024-02-22 Thread Chenyu Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenyu Zheng updated TEZ-4542:
--
Description: 
Tez application application fail, then found this error stack:
{code:java}
  at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
  at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:292)
  ... 18 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.IllegalArgumentException
  at 
org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:402)
  at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:907)
  at 
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:643)
  at 
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:675)
  at 
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:753)
  at 
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinObject(CommonMergeJoinOperator.java:314)
  at 
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:277)
  at 
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:270)
  at 
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:256)
  at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:361)
  ... 19 more
Caused by: java.lang.IllegalArgumentException
  at java.nio.Buffer.position(Buffer.java:244)
  at 
org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter$SortSpan.(PipelinedSorter.java:936)
  at 
org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.sort(PipelinedSorter.java:350)
  at 
org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.collect(PipelinedSorter.java:406)
  at 
org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.write(PipelinedSorter.java:379)
  at 
org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput$1.write(OrderedPartitionedKVOutput.java:167)
  at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor$TezKVOutputCollector.collect(TezProcessor.java:204)
  at 
org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.collect(ReduceSinkOperator.java:541)
  at 
org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:385)
  ... 28 more {code}
After adding the debug log, it is easy to find this problem. The variable 
`dataSize` in {{{}PipelinedSorter::{}}}SortSpan is overflow. 

This problem will be triggered if the following two conditions are met at the 
same time:
 * Too many IO for vertex, causing the memory allocated to each I/O for sorting 
to be too small.
 * When average record size is larger than 2K, `dataSize`  in 
{{{}PipelinedSorter::{}}}SortSpan is overflow will be overflow, will not 
try to allocate less meta space. Then raise exception.

Solution: change dataSize to long

  was:
Tez application application fail, then found this error stack:
{code:java}
  at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
  at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:292)
  ... 18 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.IllegalArgumentException
  at 
org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:402)
  at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:907)
  at 
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:643)
  at 
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:675)
  at 
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:753)
  at 
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinObject(CommonMergeJoinOperator.java:314)
  at 
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:277)
  at 
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:270)
  at 
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:256)
  at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:361)
  ... 19 more
Caused by: java.lang.IllegalArgumentException
  at java.nio.Buffer.position(Buffer.java:244)
  at 
org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter$SortSpan.(PipelinedSorter.java:936)
  at 
org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.sort(PipelinedSorter.java:350)
  at 
org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.collect(PipelinedSorter.java:406)
  at 
org.apache.tez.runtime.library.common.sort.i

[jira] [Created] (TEZ-4542) Tez application may fail due to int overflow when record size is large and sort memory is low.

2024-02-22 Thread Chenyu Zheng (Jira)
Chenyu Zheng created TEZ-4542:
-

 Summary: Tez application may fail due to int overflow when record 
size is large and sort memory is low.
 Key: TEZ-4542
 URL: https://issues.apache.org/jira/browse/TEZ-4542
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.9.2
Reporter: Chenyu Zheng
Assignee: Chenyu Zheng


Tez application application fail, then found this error stack:
{code:java}
  at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
  at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:292)
  ... 18 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.IllegalArgumentException
  at 
org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:402)
  at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:907)
  at 
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:643)
  at 
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:675)
  at 
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:753)
  at 
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinObject(CommonMergeJoinOperator.java:314)
  at 
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:277)
  at 
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:270)
  at 
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:256)
  at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:361)
  ... 19 more
Caused by: java.lang.IllegalArgumentException
  at java.nio.Buffer.position(Buffer.java:244)
  at 
org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter$SortSpan.(PipelinedSorter.java:936)
  at 
org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.sort(PipelinedSorter.java:350)
  at 
org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.collect(PipelinedSorter.java:406)
  at 
org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.write(PipelinedSorter.java:379)
  at 
org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput$1.write(OrderedPartitionedKVOutput.java:167)
  at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor$TezKVOutputCollector.collect(TezProcessor.java:204)
  at 
org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.collect(ReduceSinkOperator.java:541)
  at 
org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:385)
  ... 28 more {code}
After adding the debug log, it is easy to find this problem. The variable 
`dataSize` in {{{}PipelinedSorter::{}}}SortSpan is overflow. 

This problem will be triggered if the following two conditions are met at the 
same time:
 * Too many IO for vertex, causing the memory allocated to each I/O for sorting 
to be too small.
 * When average record size is larger than 2K, `dataSize`  in 
{{{}PipelinedSorter::{}}}SortSpan is overflow will be overflow, will not 
try to allocate less meta space. then raise exception.

{{}}

Solution: change dataSize to long



--
This message was sent by Atlassian Jira
(v8.20.10#820010)