[jira] [Updated] (TEZ-4442) tez unable to control the memory size when UDF occupies 100MB memory

2022-08-26 Thread Authur Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Authur Wang updated TEZ-4442:
-
Attachment: java heap2.png
java heap1.png

> tez unable to control the memory size when UDF occupies 100MB memory 
> -
>
> Key: TEZ-4442
> URL: https://issues.apache.org/jira/browse/TEZ-4442
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1
> Environment: CDP7.1.7SP1
> tez 0.9.1
> hive 3.1.3
>  
>Reporter: Authur Wang
>Priority: Critical
> Attachments: app.log, application_1659706606596_0047.log.gz, 
> hiveserver2.out, java heap1.png, java heap2.png, 
> spark-udf-0.0.1-SNAPSHOT.jar, spark-udf-src.zip
>
>
>           We have a UDF which loads about 5 million records into memory, and 
> matchs the data in the memory according to the user's input, and finally 
> return the output. Each input record of the UDF will lead to one output.
>           Based on heapdump analysis, this  udf occupies about 100MB of 
> memory. The UDF runs stably in hive on MR, hive on spark and native spark, 
> and only needs about 4GB of memory for that situation. However, if we use tez 
> engine,  we adjust the memory from 4G to 8g, the task will fail. Even if we 
> adjust the memory to 12g, the task will fail with a high probability. Why 
> does tez engine need so much memory compared to Mr and spark? Is there a good 
> tuning method to control the amount of memory ?
>  
>  
> command is as follows:
> beeline -u 
> 'jdbc:hive2://bg21146.hadoop.com:1/default;principal=hive/[bg21146.hadoop@bg.com|mailto:bg21146.hadoop@bg.com]'
>  --hiveconf tez.queue.name=root.000kjb.bdhmgmas_bas -e "
>  
> create temporary function get_card_rank as 
> 'com.unionpay.spark.udf.GenericUDFCupsCardMediaProc' using jar 
> 'hdfs:///user/lib/spark-udf-0.0.1-SNAPSHOT.jar';
>  
> set tez.am.log.level=debug;
> set tez.am.resource.memory.mb=8192;
> set hive.tez.container.size=8192;
> set tez.task.resource.memory.mb=2048;
> set tez.runtime.io.sort.mb=1200;
> set hive.auto.convert.join.noconditionaltask.size=5;
> set tez.runtime.unordered.output.buffer.size-mb=800;
> set tez.grouping.min-size=33554432;
> set tez.grouping.max-size=536870912;
> set hive.tez.auto.reducer.parallelism=true;
> set hive.tez.min.partition.factor=0.25;
> set hive.tez.max.partition.factor=2.0;
> set hive.exec.reducers.bytes.per.reducer=268435456;
> set mapreduce.map.memory.mb=4096;
> set ipc.maximum.response.length=153600;
>  
>  
> select
>  get_card_rank(ext_pri_acct_no) as ext_card_media_proc_md,
>  count(\*)
> from bs_comdb.tmp_bscom_glhis_ct_settle_dtl_bas_swt a
> where a.hp_settle_dt = '20200910'
> group by get_card_rank(ext_pri_acct_no)
> ;
> "
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TEZ-4442) tez unable to control the memory size when UDF occupies 100MB memory

2022-08-07 Thread Authur Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Authur Wang updated TEZ-4442:
-
Environment: 
CDP7.1.7SP1

tez 0.9.1

hive 3.1.3

 

  was:we use CDP7.1.7SP1 with the 0.91 tez version


> tez unable to control the memory size when UDF occupies 100MB memory 
> -
>
> Key: TEZ-4442
> URL: https://issues.apache.org/jira/browse/TEZ-4442
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1
> Environment: CDP7.1.7SP1
> tez 0.9.1
> hive 3.1.3
>  
>Reporter: Authur Wang
>Priority: Critical
> Attachments: app.log, application_1659706606596_0047.log.gz, 
> hiveserver2.out, spark-udf-0.0.1-SNAPSHOT.jar, spark-udf-src.zip
>
>
>           We have a UDF which loads about 5 million records into memory, and 
> matchs the data in the memory according to the user's input, and finally 
> return the output. Each input record of the UDF will lead to one output.
>           Based on heapdump analysis, this  udf occupies about 100MB of 
> memory. The UDF runs stably in hive on MR, hive on spark and native spark, 
> and only needs about 4GB of memory for that situation. However, if we use tez 
> engine,  we adjust the memory from 4G to 8g, the task will fail. Even if we 
> adjust the memory to 12g, the task will fail with a high probability. Why 
> does tez engine need so much memory compared to Mr and spark? Is there a good 
> tuning method to control the amount of memory ?
>  
>  
> command is as follows:
> beeline -u 
> 'jdbc:hive2://bg21146.hadoop.com:1/default;principal=hive/[bg21146.hadoop@bg.com|mailto:bg21146.hadoop@bg.com]'
>  --hiveconf tez.queue.name=root.000kjb.bdhmgmas_bas -e "
>  
> create temporary function get_card_rank as 
> 'com.unionpay.spark.udf.GenericUDFCupsCardMediaProc' using jar 
> 'hdfs:///user/lib/spark-udf-0.0.1-SNAPSHOT.jar';
>  
> set tez.am.log.level=debug;
> set tez.am.resource.memory.mb=8192;
> set hive.tez.container.size=8192;
> set tez.task.resource.memory.mb=2048;
> set tez.runtime.io.sort.mb=1200;
> set hive.auto.convert.join.noconditionaltask.size=5;
> set tez.runtime.unordered.output.buffer.size-mb=800;
> set tez.grouping.min-size=33554432;
> set tez.grouping.max-size=536870912;
> set hive.tez.auto.reducer.parallelism=true;
> set hive.tez.min.partition.factor=0.25;
> set hive.tez.max.partition.factor=2.0;
> set hive.exec.reducers.bytes.per.reducer=268435456;
> set mapreduce.map.memory.mb=4096;
> set ipc.maximum.response.length=153600;
>  
>  
> select
>  get_card_rank(ext_pri_acct_no) as ext_card_media_proc_md,
>  count(\*)
> from bs_comdb.tmp_bscom_glhis_ct_settle_dtl_bas_swt a
> where a.hp_settle_dt = '20200910'
> group by get_card_rank(ext_pri_acct_no)
> ;
> "
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TEZ-4442) tez unable to control the memory size when UDF occupies 100MB memory

2022-08-07 Thread Authur Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Authur Wang updated TEZ-4442:
-
Attachment: hiveserver2.out

> tez unable to control the memory size when UDF occupies 100MB memory 
> -
>
> Key: TEZ-4442
> URL: https://issues.apache.org/jira/browse/TEZ-4442
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1
> Environment: we use CDP7.1.7SP1 with the 0.91 tez version
>Reporter: Authur Wang
>Priority: Critical
> Attachments: app.log, application_1659706606596_0047.log.gz, 
> hiveserver2.out, spark-udf-0.0.1-SNAPSHOT.jar, spark-udf-src.zip
>
>
>           We have a UDF which loads about 5 million records into memory, and 
> matchs the data in the memory according to the user's input, and finally 
> return the output. Each input record of the UDF will lead to one output.
>           Based on heapdump analysis, this  udf occupies about 100MB of 
> memory. The UDF runs stably in hive on MR, hive on spark and native spark, 
> and only needs about 4GB of memory for that situation. However, if we use tez 
> engine,  we adjust the memory from 4G to 8g, the task will fail. Even if we 
> adjust the memory to 12g, the task will fail with a high probability. Why 
> does tez engine need so much memory compared to Mr and spark? Is there a good 
> tuning method to control the amount of memory ?
>  
>  
> command is as follows:
> beeline -u 
> 'jdbc:hive2://bg21146.hadoop.com:1/default;principal=hive/[bg21146.hadoop@bg.com|mailto:bg21146.hadoop@bg.com]'
>  --hiveconf tez.queue.name=root.000kjb.bdhmgmas_bas -e "
>  
> create temporary function get_card_rank as 
> 'com.unionpay.spark.udf.GenericUDFCupsCardMediaProc' using jar 
> 'hdfs:///user/lib/spark-udf-0.0.1-SNAPSHOT.jar';
>  
> set tez.am.log.level=debug;
> set tez.am.resource.memory.mb=8192;
> set hive.tez.container.size=8192;
> set tez.task.resource.memory.mb=2048;
> set tez.runtime.io.sort.mb=1200;
> set hive.auto.convert.join.noconditionaltask.size=5;
> set tez.runtime.unordered.output.buffer.size-mb=800;
> set tez.grouping.min-size=33554432;
> set tez.grouping.max-size=536870912;
> set hive.tez.auto.reducer.parallelism=true;
> set hive.tez.min.partition.factor=0.25;
> set hive.tez.max.partition.factor=2.0;
> set hive.exec.reducers.bytes.per.reducer=268435456;
> set mapreduce.map.memory.mb=4096;
> set ipc.maximum.response.length=153600;
>  
>  
> select
>  get_card_rank(ext_pri_acct_no) as ext_card_media_proc_md,
>  count(\*)
> from bs_comdb.tmp_bscom_glhis_ct_settle_dtl_bas_swt a
> where a.hp_settle_dt = '20200910'
> group by get_card_rank(ext_pri_acct_no)
> ;
> "
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TEZ-4442) tez unable to control the memory size when UDF occupies 100MB memory

2022-08-07 Thread Authur Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Authur Wang updated TEZ-4442:
-
Description: 
          We have a UDF which loads about 5 million records into memory, and 
matchs the data in the memory according to the user's input, and finally return 
the output. Each input record of the UDF will lead to one output.

          Based on heapdump analysis, this  udf occupies about 100MB of memory. 
The UDF runs stably in hive on MR, hive on spark and native spark, and only 
needs about 4GB of memory for that situation. However, if we use tez engine,  
we adjust the memory from 4G to 8g, the task will fail. Even if we adjust the 
memory to 12g, the task will fail with a high probability. Why does tez engine 
need so much memory compared to Mr and spark? Is there a good tuning method to 
control the amount of memory ?

 

 

command is as follows:
beeline -u 
'jdbc:hive2://bg21146.hadoop.com:1/default;principal=hive/[bg21146.hadoop@bg.com|mailto:bg21146.hadoop@bg.com]'
 --hiveconf tez.queue.name=root.000kjb.bdhmgmas_bas -e "
 
create temporary function get_card_rank as 
'com.unionpay.spark.udf.GenericUDFCupsCardMediaProc' using jar 
'hdfs:///user/lib/spark-udf-0.0.1-SNAPSHOT.jar';
 
set tez.am.log.level=debug;
set tez.am.resource.memory.mb=8192;
set hive.tez.container.size=8192;
set tez.task.resource.memory.mb=2048;
set tez.runtime.io.sort.mb=1200;
set hive.auto.convert.join.noconditionaltask.size=5;
set tez.runtime.unordered.output.buffer.size-mb=800;
set tez.grouping.min-size=33554432;
set tez.grouping.max-size=536870912;
set hive.tez.auto.reducer.parallelism=true;
set hive.tez.min.partition.factor=0.25;
set hive.tez.max.partition.factor=2.0;
set hive.exec.reducers.bytes.per.reducer=268435456;
set mapreduce.map.memory.mb=4096;
set ipc.maximum.response.length=153600;
 
 
select
 get_card_rank(ext_pri_acct_no) as ext_card_media_proc_md,
 count(\*)
from bs_comdb.tmp_bscom_glhis_ct_settle_dtl_bas_swt a
where a.hp_settle_dt = '20200910'
group by get_card_rank(ext_pri_acct_no)
;
"

 

  was:
          We have a UDF which loads about 5 million records into memory, and 
matchs the data in the memory according to the user's input, and finally return 
the output. Each input record of the UDF will lead to one output.

          Based on heapdump analysis, this  udf occupies about 100MB of memory. 
The UDF runs stably in hive on MR, hive on spark and native spark, and only 
needs about 4GB of memory for that situation. However, if we use tez engine,  
we adjust the memory from 4G to 8g, the task will fail. Even if we adjust the 
memory to 12g, the task will fail with a high probability. Why does tez engine 
need so much memory compared to Mr and spark? Is there a good tuning method to 
control the amount of memory ?

 

 

command is as follows:
beeline -u 
'jdbc:hive2://bg21146.hadoop.com:1/default;principal=hive/[bg21146.hadoop@bg.com|mailto:bg21146.hadoop@bg.com]'
 --hiveconf tez.queue.name=root.000kjb.bdhmgmas_bas -e "
 
create temporary function get_card_rank as 
'com.unionpay.spark.udf.GenericUDFCupsCardMediaProc' using jar 
'hdfs:///user/lib/spark-udf-0.0.1-SNAPSHOT.jar';
 
set tez.am.log.level=debug;
set tez.am.resource.memory.mb=8192;
set hive.tez.container.size=8192;
set tez.task.resource.memory.mb=2048;
set tez.runtime.io.sort.mb=1200;
set hive.auto.convert.join.noconditionaltask.size=5;
set tez.runtime.unordered.output.buffer.size-mb=800;
set tez.grouping.min-size=33554432;
set tez.grouping.max-size=536870912;
set hive.tez.auto.reducer.parallelism=true;
set hive.tez.min.partition.factor=0.25;
set hive.tez.max.partition.factor=2.0;
set hive.exec.reducers.bytes.per.reducer=268435456;
set mapreduce.map.memory.mb=4096;
set ipc.maximum.response.length=153600;
 
 
select
 get_card_rank(ext_pri_acct_no) as ext_card_media_proc_md,
 count(*)
from bs_comdb.tmp_bscom_glhis_ct_settle_dtl_bas_swt a
where a.hp_settle_dt = '20200910'
group by get_card_rank(ext_pri_acct_no)
;
"

 


> tez unable to control the memory size when UDF occupies 100MB memory 
> -
>
> Key: TEZ-4442
> URL: https://issues.apache.org/jira/browse/TEZ-4442
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1
> Environment: we use CDP7.1.7SP1 with the 0.91 tez version
>Reporter: Authur Wang
>Priority: Critical
> Attachments: app.log, application_1659706606596_0047.log.gz, 
> spark-udf-0.0.1-SNAPSHOT.jar, spark-udf-src.zip
>
>
>           We have a UDF which loads about 5 million records into memory, and 
> matchs the data in the memory according to the user's input, and finally 
> return the output. Each input record of the UDF will lead to one output.
>           Based on heapdump analysis, this  udf occupies about 100MB 

[jira] [Updated] (TEZ-4442) tez unable to control the memory size when UDF occupies 100MB memory

2022-08-07 Thread Authur Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Authur Wang updated TEZ-4442:
-
Description: 
          We have a UDF which loads about 5 million records into memory, and 
matchs the data in the memory according to the user's input, and finally return 
the output. Each input record of the UDF will lead to one output.

          Based on heapdump analysis, this  udf occupies about 100MB of memory. 
The UDF runs stably in hive on MR, hive on spark and native spark, and only 
needs about 4GB of memory for that situation. However, if we use tez engine,  
we adjust the memory from 4G to 8g, the task will fail. Even if we adjust the 
memory to 12g, the task will fail with a high probability. Why does tez engine 
need so much memory compared to Mr and spark? Is there a good tuning method to 
control the amount of memory ?

 

 

command is as follows:
beeline -u 
'jdbc:hive2://bg21146.hadoop.com:1/default;principal=hive/[bg21146.hadoop@bg.com|mailto:bg21146.hadoop@bg.com]'
 --hiveconf tez.queue.name=root.000kjb.bdhmgmas_bas -e "
 
create temporary function get_card_rank as 
'com.unionpay.spark.udf.GenericUDFCupsCardMediaProc' using jar 
'hdfs:///user/lib/spark-udf-0.0.1-SNAPSHOT.jar';
 
set tez.am.log.level=debug;
set tez.am.resource.memory.mb=8192;
set hive.tez.container.size=8192;
set tez.task.resource.memory.mb=2048;
set tez.runtime.io.sort.mb=1200;
set hive.auto.convert.join.noconditionaltask.size=5;
set tez.runtime.unordered.output.buffer.size-mb=800;
set tez.grouping.min-size=33554432;
set tez.grouping.max-size=536870912;
set hive.tez.auto.reducer.parallelism=true;
set hive.tez.min.partition.factor=0.25;
set hive.tez.max.partition.factor=2.0;
set hive.exec.reducers.bytes.per.reducer=268435456;
set mapreduce.map.memory.mb=4096;
set ipc.maximum.response.length=153600;
 
 
select
 get_card_rank(ext_pri_acct_no) as ext_card_media_proc_md,
 count(*)
from bs_comdb.tmp_bscom_glhis_ct_settle_dtl_bas_swt a
where a.hp_settle_dt = '20200910'
group by get_card_rank(ext_pri_acct_no)
;
"

 

  was:
          We have a UDF which loads about 5 million records into memory, and 
matchs the data in the memory according to the user's input, and finally return 
the output. Each input record of the UDF will lead to one output.

          Based on heapdump analysis, this  udf occupies about 100MB of memory. 
The UDF runs stably in hive on MR, hive on spark and native spark, and only 
needs about 4GB of memory for that situation. However, if we use tez engine,  
we adjust the memory from 4G to 8g, the task will fail. Even if we adjust the 
memory to 12g, the task will fail with a high probability. Why does tez engine 
need so much memory compared to Mr and spark? Is there a good tuning method to 
control the amount of memory ?

 

 

, and parameters are as follows:
beeline -u 
'jdbc:hive2://bg21146.hadoop.com:1/default;principal=hive/[bg21146.hadoop@bg.com|mailto:bg21146.hadoop@bg.com]'
 --hiveconf tez.queue.name=root.000kjb.bdhmgmas_bas -e "
 
create temporary function get_card_rank as 
'com.unionpay.spark.udf.GenericUDFCupsCardMediaProc' using jar 
'hdfs:///user/lib/spark-udf-0.0.1-SNAPSHOT.jar';
 
set tez.am.log.level=debug;
set tez.am.resource.memory.mb=8192;
set hive.tez.container.size=8192;
set tez.task.resource.memory.mb=2048;
set tez.runtime.io.sort.mb=1200;
set hive.auto.convert.join.noconditionaltask.size=5;
set tez.runtime.unordered.output.buffer.size-mb=800;
set tez.grouping.min-size=33554432;
set tez.grouping.max-size=536870912;
set hive.tez.auto.reducer.parallelism=true;
set hive.tez.min.partition.factor=0.25;
set hive.tez.max.partition.factor=2.0;
set hive.exec.reducers.bytes.per.reducer=268435456;
set mapreduce.map.memory.mb=4096;
set ipc.maximum.response.length=153600;
 
 
select
 get_card_rank(ext_pri_acct_no) as ext_card_media_proc_md,
 count(\*)
from bs_comdb.tmp_bscom_glhis_ct_settle_dtl_bas_swt a
where a.hp_settle_dt = '20200910'
group by get_card_rank(ext_pri_acct_no)
;
"

 


> tez unable to control the memory size when UDF occupies 100MB memory 
> -
>
> Key: TEZ-4442
> URL: https://issues.apache.org/jira/browse/TEZ-4442
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1
> Environment: we use CDP7.1.7SP1 with the 0.91 tez version
>Reporter: Authur Wang
>Priority: Critical
> Attachments: app.log, application_1659706606596_0047.log.gz, 
> spark-udf-0.0.1-SNAPSHOT.jar, spark-udf-src.zip
>
>
>           We have a UDF which loads about 5 million records into memory, and 
> matchs the data in the memory according to the user's input, and finally 
> return the output. Each input record of the UDF will lead to one output.
>           Based on heapdump analysis, this  udf occupies 

[jira] [Updated] (TEZ-4442) tez unable to control the memory size when UDF occupies 100MB memory

2022-08-07 Thread Authur Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Authur Wang updated TEZ-4442:
-
Attachment: app.log
application_1659706606596_0047.log.gz

> tez unable to control the memory size when UDF occupies 100MB memory 
> -
>
> Key: TEZ-4442
> URL: https://issues.apache.org/jira/browse/TEZ-4442
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1
> Environment: we use CDP7.1.7SP1 with the 0.91 tez version
>Reporter: Authur Wang
>Priority: Critical
> Attachments: app.log, application_1659706606596_0047.log.gz, 
> spark-udf-0.0.1-SNAPSHOT.jar, spark-udf-src.zip
>
>
>           We have a UDF which loads about 5 million records into memory, and 
> matchs the data in the memory according to the user's input, and finally 
> return the output. Each input record of the UDF will lead to one output.
>           Based on heapdump analysis, this  udf occupies about 100MB of 
> memory. The UDF runs stably in hive on MR, hive on spark and native spark, 
> and only needs about 4GB of memory for that situation. However, if we use tez 
> engine,  we adjust the memory from 4G to 8g, the task will fail. Even if we 
> adjust the memory to 12g, the task will fail with a high probability. Why 
> does tez engine need so much memory compared to Mr and spark? Is there a good 
> tuning method to control the amount of memory ?
>  
>  
> , and parameters are as follows:
> beeline -u 
> 'jdbc:hive2://bg21146.hadoop.com:1/default;principal=hive/[bg21146.hadoop@bg.com|mailto:bg21146.hadoop@bg.com]'
>  --hiveconf tez.queue.name=root.000kjb.bdhmgmas_bas -e "
>  
> create temporary function get_card_rank as 
> 'com.unionpay.spark.udf.GenericUDFCupsCardMediaProc' using jar 
> 'hdfs:///user/lib/spark-udf-0.0.1-SNAPSHOT.jar';
>  
> set tez.am.log.level=debug;
> set tez.am.resource.memory.mb=8192;
> set hive.tez.container.size=8192;
> set tez.task.resource.memory.mb=2048;
> set tez.runtime.io.sort.mb=1200;
> set hive.auto.convert.join.noconditionaltask.size=5;
> set tez.runtime.unordered.output.buffer.size-mb=800;
> set tez.grouping.min-size=33554432;
> set tez.grouping.max-size=536870912;
> set hive.tez.auto.reducer.parallelism=true;
> set hive.tez.min.partition.factor=0.25;
> set hive.tez.max.partition.factor=2.0;
> set hive.exec.reducers.bytes.per.reducer=268435456;
> set mapreduce.map.memory.mb=4096;
> set ipc.maximum.response.length=153600;
>  
>  
> select
>  get_card_rank(ext_pri_acct_no) as ext_card_media_proc_md,
>  count(\*)
> from bs_comdb.tmp_bscom_glhis_ct_settle_dtl_bas_swt a
> where a.hp_settle_dt = '20200910'
> group by get_card_rank(ext_pri_acct_no)
> ;
> "
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TEZ-4442) tez unable to control the memory size when UDF occupies 100MB memory

2022-08-07 Thread Authur Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Authur Wang updated TEZ-4442:
-
Attachment: spark-udf-src.zip
spark-udf-0.0.1-SNAPSHOT.jar

> tez unable to control the memory size when UDF occupies 100MB memory 
> -
>
> Key: TEZ-4442
> URL: https://issues.apache.org/jira/browse/TEZ-4442
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1
> Environment: we use CDP7.1.7SP1 with the 0.91 tez version
>Reporter: Authur Wang
>Priority: Critical
> Attachments: spark-udf-0.0.1-SNAPSHOT.jar, spark-udf-src.zip
>
>
>           We have a UDF which loads about 5 million records into memory, and 
> matchs the data in the memory according to the user's input, and finally 
> return the output. Each input record of the UDF will lead to one output.
>           Based on heapdump analysis, this  udf occupies about 100MB of 
> memory. The UDF runs stably in hive on MR, hive on spark and native spark, 
> and only needs about 4GB of memory for that situation. However, if we use tez 
> engine,  we adjust the memory from 4G to 8g, the task will fail. Even if we 
> adjust the memory to 12g, the task will fail with a high probability. Why 
> does tez engine need so much memory compared to Mr and spark? Is there a good 
> tuning method to control the amount of memory ?
>  
>  
> , and parameters are as follows:
> beeline -u 
> 'jdbc:hive2://bg21146.hadoop.com:1/default;principal=hive/[bg21146.hadoop@bg.com|mailto:bg21146.hadoop@bg.com]'
>  --hiveconf tez.queue.name=root.000kjb.bdhmgmas_bas -e "
>  
> create temporary function get_card_rank as 
> 'com.unionpay.spark.udf.GenericUDFCupsCardMediaProc' using jar 
> 'hdfs:///user/lib/spark-udf-0.0.1-SNAPSHOT.jar';
>  
> set tez.am.log.level=debug;
> set tez.am.resource.memory.mb=8192;
> set hive.tez.container.size=8192;
> set tez.task.resource.memory.mb=2048;
> set tez.runtime.io.sort.mb=1200;
> set hive.auto.convert.join.noconditionaltask.size=5;
> set tez.runtime.unordered.output.buffer.size-mb=800;
> set tez.grouping.min-size=33554432;
> set tez.grouping.max-size=536870912;
> set hive.tez.auto.reducer.parallelism=true;
> set hive.tez.min.partition.factor=0.25;
> set hive.tez.max.partition.factor=2.0;
> set hive.exec.reducers.bytes.per.reducer=268435456;
> set mapreduce.map.memory.mb=4096;
> set ipc.maximum.response.length=153600;
>  
>  
> select
>  get_card_rank(ext_pri_acct_no) as ext_card_media_proc_md,
>  count(\*)
> from bs_comdb.tmp_bscom_glhis_ct_settle_dtl_bas_swt a
> where a.hp_settle_dt = '20200910'
> group by get_card_rank(ext_pri_acct_no)
> ;
> "
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TEZ-4442) tez unable to control the memory size when UDF occupies 100MB memory

2022-08-07 Thread Authur Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Authur Wang updated TEZ-4442:
-
Description: 
          We have a UDF which loads about 5 million records into memory, and 
matchs the data in the memory according to the user's input, and finally return 
the output. Each input record of the UDF will lead to one output.

          Based on heapdump analysis, this  udf occupies about 100MB of memory. 
The UDF runs stably in hive on MR, hive on spark and native spark, and only 
needs about 4GB of memory for that situation. However, if we use tez engine,  
we adjust the memory from 4G to 8g, the task will fail. Even if we adjust the 
memory to 12g, the task will fail with a high probability. Why does tez engine 
need so much memory compared to Mr and spark? Is there a good tuning method to 
control the amount of memory ?

 

 

, and parameters are as follows:
beeline -u 
'jdbc:hive2://bg21146.hadoop.com:1/default;principal=hive/[bg21146.hadoop@bg.com|mailto:bg21146.hadoop@bg.com]'
 --hiveconf tez.queue.name=root.000kjb.bdhmgmas_bas -e "
 
create temporary function get_card_rank as 
'com.unionpay.spark.udf.GenericUDFCupsCardMediaProc' using jar 
'hdfs:///user/lib/spark-udf-0.0.1-SNAPSHOT.jar';
 
set tez.am.log.level=debug;
set tez.am.resource.memory.mb=8192;
set hive.tez.container.size=8192;
set tez.task.resource.memory.mb=2048;
set tez.runtime.io.sort.mb=1200;
set hive.auto.convert.join.noconditionaltask.size=5;
set tez.runtime.unordered.output.buffer.size-mb=800;
set tez.grouping.min-size=33554432;
set tez.grouping.max-size=536870912;
set hive.tez.auto.reducer.parallelism=true;
set hive.tez.min.partition.factor=0.25;
set hive.tez.max.partition.factor=2.0;
set hive.exec.reducers.bytes.per.reducer=268435456;
set mapreduce.map.memory.mb=4096;
set ipc.maximum.response.length=153600;
 
 
select
 get_card_rank(ext_pri_acct_no) as ext_card_media_proc_md,
 count(\*)
from bs_comdb.tmp_bscom_glhis_ct_settle_dtl_bas_swt a
where a.hp_settle_dt = '20200910'
group by get_card_rank(ext_pri_acct_no)
;
"

 

  was:
          We have a UDF which loads about 5 million records into memory, and 
matchs the data in the memory according to the user's input, and finally return 
the output. Each input record of the UDF will lead to one output.

          Based on heapdump analysis, this  udf occupies about 100MB of memory. 
The UDF runs stably in hive on MR, hive on spark and native spark, and only 
needs about 4GB of memory for that situation. However, if we use tez engine,  
we adjust the memory from 4G to 8g, the task will fail. Even if we adjust the 
memory to 12g, the task will fail with a high probability. Why does tez engine 
need so much memory compared to Mr and spark? Is there a good tuning method to 
control the amount of memory ?

 

 

, and parameters are as follows:
beeline -u 
'jdbc:hive2://bg21146.hadoop.com:1/default;principal=hive/[bg21146.hadoop@bg.com|mailto:bg21146.hadoop@bg.com]'
 --hiveconf tez.queue.name=root.000kjb.bdhmgmas_bas -e "
 
create temporary function get_card_rank as 
'com.unionpay.spark.udf.GenericUDFCupsCardMediaProc' using jar 
'hdfs:///user/lib/spark-udf-0.0.1-SNAPSHOT.jar';
 
set tez.am.log.level=debug;
set tez.am.resource.memory.mb=8192;
set hive.tez.container.size=8192;
set tez.task.resource.memory.mb=2048;
set tez.runtime.io.sort.mb=1200;
set hive.auto.convert.join.noconditionaltask.size=5;
set tez.runtime.unordered.output.buffer.size-mb=800;
set tez.grouping.min-size=33554432;
set tez.grouping.max-size=536870912;
set hive.tez.auto.reducer.parallelism=true;
set hive.tez.min.partition.factor=0.25;
set hive.tez.max.partition.factor=2.0;
set hive.exec.reducers.bytes.per.reducer=268435456;
set mapreduce.map.memory.mb=4096;
set ipc.maximum.response.length=153600;
 
 
select
 get_card_rank(ext_pri_acct_no) as ext_card_media_proc_md,
 count(*)
from bs_comdb.tmp_bscom_glhis_ct_settle_dtl_bas_swt a
where a.hp_settle_dt = '20200910'
group by get_card_rank(ext_pri_acct_no)
;
"

 


> tez unable to control the memory size when UDF occupies 100MB memory 
> -
>
> Key: TEZ-4442
> URL: https://issues.apache.org/jira/browse/TEZ-4442
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1
> Environment: we use CDP7.1.7SP1 with the 0.91 tez version
>Reporter: Authur Wang
>Priority: Critical
>
>           We have a UDF which loads about 5 million records into memory, and 
> matchs the data in the memory according to the user's input, and finally 
> return the output. Each input record of the UDF will lead to one output.
>           Based on heapdump analysis, this  udf occupies about 100MB of 
> memory. The UDF runs stably in hive on MR, hive on spark and native spark, 
> and only needs 

[jira] [Updated] (TEZ-4442) tez unable to control the memory size when UDF occupies 100MB memory

2022-08-07 Thread Authur Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Authur Wang updated TEZ-4442:
-
Description: 
          We have a UDF which loads about 5 million records into memory, and 
matchs the data in the memory according to the user's input, and finally return 
the output. Each input record of the UDF will lead to one output.

          Based on heapdump analysis, this  udf occupies about 100MB of memory. 
The UDF runs stably in hive on MR, hive on spark and native spark, and only 
needs about 4GB of memory for that situation. However, if we use tez engine,  
we adjust the memory from 4G to 8g, the task will fail. Even if we adjust the 
memory to 12g, the task will fail with a high probability. Why does tez engine 
need so much memory compared to Mr and spark? Is there a good tuning method to 
control the amount of memory ?

 

 

, and parameters are as follows:
beeline -u 
'jdbc:hive2://bg21146.hadoop.com:1/default;principal=hive/[bg21146.hadoop@bg.com|mailto:bg21146.hadoop@bg.com]'
 --hiveconf tez.queue.name=root.000kjb.bdhmgmas_bas -e "
 
create temporary function get_card_rank as 
'com.unionpay.spark.udf.GenericUDFCupsCardMediaProc' using jar 
'hdfs:///user/lib/spark-udf-0.0.1-SNAPSHOT.jar';
 
set tez.am.log.level=debug;
set tez.am.resource.memory.mb=8192;
set hive.tez.container.size=8192;
set tez.task.resource.memory.mb=2048;
set tez.runtime.io.sort.mb=1200;
set hive.auto.convert.join.noconditionaltask.size=5;
set tez.runtime.unordered.output.buffer.size-mb=800;
set tez.grouping.min-size=33554432;
set tez.grouping.max-size=536870912;
set hive.tez.auto.reducer.parallelism=true;
set hive.tez.min.partition.factor=0.25;
set hive.tez.max.partition.factor=2.0;
set hive.exec.reducers.bytes.per.reducer=268435456;
set mapreduce.map.memory.mb=4096;
set ipc.maximum.response.length=153600;
 
 
select
 get_card_rank(ext_pri_acct_no) as ext_card_media_proc_md,
 count(*)
from bs_comdb.tmp_bscom_glhis_ct_settle_dtl_bas_swt a
where a.hp_settle_dt = '20200910'
group by get_card_rank(ext_pri_acct_no)
;
"

 

  was:
          We have a UDF which loads about 5 million records into memory, and 
matchs the data in the memory according to the user's input, and finally return 
the output. Each input record of the UDF will lead to one output.

          Based on heapdump analysis, this  udf occupies about 100MB of memory. 
The UDF runs stably in hive on MR, hive on spark and native spark, and only 
needs about 4GB of memory for that situation. However, if we use tez engine,  
we adjust the memory from 4G to 8g, the task will fail. Even if we adjust the 
memory to 12g, the task will fail with a high probability. Why does tez engine 
need so much memory compared to Mr and spark? Is there a good tuning method to 
control the amount of memory ?

 

 

, and parameters are as follows:
beeline -u 
'jdbc:hive2://bg21146.hadoop.com:1/default;principal=hive/[bg21146.hadoop@bg.com|mailto:bg21146.hadoop@bg.com]'
 --hiveconf tez.queue.name=root.000kjb.bdhmgmas_bas -e "
 
create temporary function get_card_rank as 
'com.unionpay.spark.udf.GenericUDFCupsCardMediaProc' using jar 
'hdfs:///user/lib/spark-udf-0.0.1-SNAPSHOT.jar';
 
set tez.am.log.level=debug;
set tez.am.resource.memory.mb=8192;
set hive.tez.container.size=8192;
set tez.task.resource.memory.mb=2048;
set tez.runtime.io.sort.mb=1200;
set hive.auto.convert.join.noconditionaltask.size=5;
set tez.runtime.unordered.output.buffer.size-mb=800;
set tez.grouping.min-size=33554432;
set tez.grouping.max-size=536870912;
set hive.tez.auto.reducer.parallelism=true;
set hive.tez.min.partition.factor=0.25;
set hive.tez.max.partition.factor=2.0;
set hive.exec.reducers.bytes.per.reducer=268435456;
set mapreduce.map.memory.mb=4096;
set ipc.maximum.response.length=153600;
 
 
select
 get_card_rank(ext_pri_acct_no) as ext_card_media_proc_md,
 coun(\*)
from bs_comdb.tmp_bscom_glhis_ct_settle_dtl_bas_swt a
where a.hp_settle_dt = '20200910'
group by get_card_rank(ext_pri_acct_no)
;
"

 


> tez unable to control the memory size when UDF occupies 100MB memory 
> -
>
> Key: TEZ-4442
> URL: https://issues.apache.org/jira/browse/TEZ-4442
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1
> Environment: we use CDP7.1.7SP1 with the 0.91 tez version
>Reporter: Authur Wang
>Priority: Critical
>
>           We have a UDF which loads about 5 million records into memory, and 
> matchs the data in the memory according to the user's input, and finally 
> return the output. Each input record of the UDF will lead to one output.
>           Based on heapdump analysis, this  udf occupies about 100MB of 
> memory. The UDF runs stably in hive on MR, hive on spark and native spark, 
> and only needs about 

[jira] [Updated] (TEZ-4442) tez unable to control the memory size when UDF occupies 100MB memory

2022-08-07 Thread Authur Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Authur Wang updated TEZ-4442:
-
Description: 
          We have a UDF which loads about 5 million records into memory, and 
matchs the data in the memory according to the user's input, and finally return 
the output. Each input record of the UDF will lead to one output.

          Based on heapdump analysis, this  udf occupies about 100MB of memory. 
The UDF runs stably in hive on MR, hive on spark and native spark, and only 
needs about 4GB of memory for that situation. However, if we use tez engine,  
we adjust the memory from 4G to 8g, the task will fail. Even if we adjust the 
memory to 12g, the task will fail with a high probability. Why does tez engine 
need so much memory compared to Mr and spark? Is there a good tuning method to 
control the amount of memory ?

 

 

, and parameters are as follows:
beeline -u 
'jdbc:hive2://bg21146.hadoop.com:1/default;principal=hive/[bg21146.hadoop@bg.com|mailto:bg21146.hadoop@bg.com]'
 --hiveconf tez.queue.name=root.000kjb.bdhmgmas_bas -e "
 
create temporary function get_card_rank as 
'com.unionpay.spark.udf.GenericUDFCupsCardMediaProc' using jar 
'hdfs:///user/lib/spark-udf-0.0.1-SNAPSHOT.jar';
 
set tez.am.log.level=debug;
set tez.am.resource.memory.mb=8192;
set hive.tez.container.size=8192;
set tez.task.resource.memory.mb=2048;
set tez.runtime.io.sort.mb=1200;
set hive.auto.convert.join.noconditionaltask.size=5;
set tez.runtime.unordered.output.buffer.size-mb=800;
set tez.grouping.min-size=33554432;
set tez.grouping.max-size=536870912;
set hive.tez.auto.reducer.parallelism=true;
set hive.tez.min.partition.factor=0.25;
set hive.tez.max.partition.factor=2.0;
set hive.exec.reducers.bytes.per.reducer=268435456;
set mapreduce.map.memory.mb=4096;
set ipc.maximum.response.length=153600;
 
 
select
 get_card_rank(ext_pri_acct_no) as ext_card_media_proc_md,
 coun(\*)
from bs_comdb.tmp_bscom_glhis_ct_settle_dtl_bas_swt a
where a.hp_settle_dt = '20200910'
group by get_card_rank(ext_pri_acct_no)
;
"

 

  was:
          We have a UDF which loads about 5 million records into memory, and 
matchs the data in the memory according to the user's input, and finally return 
the output. Each input record of the UDF will lead to one output.

          Based on heapdump analysis, this  udf occupies about 100MB of memory. 
The UDF runs stably in hive on MR, hive on spark and native spark, and only 
needs about 4GB of memory for that situation. However, if we use tez engine,  
we adjust the memory from 4G to 8g, the task will fail. Even if we adjust the 
memory to 12g, the task will fail with a high probability. Why does tez engine 
need so much memory compared to Mr and spark? Is there a good tuning method to 
control the amount of memory ?

 

 

, and parameters are as follows:
beeline -u 
'jdbc:hive2://bg21146.hadoop.com:1/default;principal=hive/[bg21146.hadoop@bg.com|mailto:bg21146.hadoop@bg.com]'
 --hiveconf tez.queue.name=root.000kjb.bdhmgmas_bas -e "
 
create temporary function get_card_rank as 
'com.unionpay.spark.udf.GenericUDFCupsCardMediaProc' using jar 
'hdfs:///user/lib/spark-udf-0.0.1-SNAPSHOT.jar';
 
set tez.am.log.level=debug;
set tez.am.resource.memory.mb=8192;
set hive.tez.container.size=8192;
set tez.task.resource.memory.mb=2048;
set tez.runtime.io.sort.mb=1200;
set hive.auto.convert.join.noconditionaltask.size=5;
set tez.runtime.unordered.output.buffer.size-mb=800;
set tez.grouping.min-size=33554432;
set tez.grouping.max-size=536870912;
set hive.tez.auto.reducer.parallelism=true;
set hive.tez.min.partition.factor=0.25;
set hive.tez.max.partition.factor=2.0;
set hive.exec.reducers.bytes.per.reducer=268435456;
set mapreduce.map.memory.mb=4096;
set ipc.maximum.response.length=153600;
 
 
select
 get_card_rank(ext_pri_acct_no) as ext_card_media_proc_md,
 coun(*)
from bs_comdb.tmp_bscom_glhis_ct_settle_dtl_bas_swt a
where a.hp_settle_dt = '20200910'
group by get_card_rank(ext_pri_acct_no)
;
"

 


> tez unable to control the memory size when UDF occupies 100MB memory 
> -
>
> Key: TEZ-4442
> URL: https://issues.apache.org/jira/browse/TEZ-4442
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1
> Environment: we use CDP7.1.7SP1 with the 0.91 tez version
>Reporter: Authur Wang
>Priority: Critical
>
>           We have a UDF which loads about 5 million records into memory, and 
> matchs the data in the memory according to the user's input, and finally 
> return the output. Each input record of the UDF will lead to one output.
>           Based on heapdump analysis, this  udf occupies about 100MB of 
> memory. The UDF runs stably in hive on MR, hive on spark and native spark, 
> and only needs about 

[jira] [Updated] (TEZ-4442) tez unable to control the memory size when UDF occupies 100MB memory

2022-08-07 Thread Authur Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Authur Wang updated TEZ-4442:
-
Description: 
          We have a UDF which loads about 5 million records into memory, and 
matchs the data in the memory according to the user's input, and finally return 
the output. Each input record of the UDF will lead to one output.

          Based on heapdump analysis, this  udf occupies about 100MB of memory. 
The UDF runs stably in hive on MR, hive on spark and native spark, and only 
needs about 4GB of memory for that situation. However, if we use tez engine,  
we adjust the memory from 4G to 8g, the task will fail. Even if we adjust the 
memory to 12g, the task will fail with a high probability. Why does tez engine 
need so much memory compared to Mr and spark? Is there a good tuning method to 
control the amount of memory ?

 

 

, and parameters are as follows:
beeline -u 
'jdbc:hive2://bg21146.hadoop.com:1/default;principal=hive/[bg21146.hadoop@bg.com|mailto:bg21146.hadoop@bg.com]'
 --hiveconf tez.queue.name=root.000kjb.bdhmgmas_bas -e "
 
create temporary function get_card_rank as 
'com.unionpay.spark.udf.GenericUDFCupsCardMediaProc' using jar 
'hdfs:///user/lib/spark-udf-0.0.1-SNAPSHOT.jar';
 
set tez.am.log.level=debug;
set tez.am.resource.memory.mb=8192;
set hive.tez.container.size=8192;
set tez.task.resource.memory.mb=2048;
set tez.runtime.io.sort.mb=1200;
set hive.auto.convert.join.noconditionaltask.size=5;
set tez.runtime.unordered.output.buffer.size-mb=800;
set tez.grouping.min-size=33554432;
set tez.grouping.max-size=536870912;
set hive.tez.auto.reducer.parallelism=true;
set hive.tez.min.partition.factor=0.25;
set hive.tez.max.partition.factor=2.0;
set hive.exec.reducers.bytes.per.reducer=268435456;
set mapreduce.map.memory.mb=4096;
set ipc.maximum.response.length=153600;
 
 
select
 get_card_rank(ext_pri_acct_no) as ext_card_media_proc_md,
 coun(*)
from bs_comdb.tmp_bscom_glhis_ct_settle_dtl_bas_swt a
where a.hp_settle_dt = '20200910'
group by get_card_rank(ext_pri_acct_no)
;
"

 

  was:
          We have a UDF which loads about 5 million records into memory, and 
matchs the data in the memory according to the user's input, and finally return 
the output. Each input record of the UDF will lead to one output.

          Based on heapdump analysis, this  udf occupies about 100MB of memory. 
The UDF runs stably in hive on MR, hive on spark and native spark, and only 
needs about 4GB of memory for that situation. However, if we use tez engine,  
we adjust the memory from 4G to 8g, the task will fail. Even if we adjust the 
memory to 12g, the task will fail with a high probability. Why does tez engine 
need so much memory compared to Mr and spark? Is there a good tuning method to 
control the amount of memory ?

 


> tez unable to control the memory size when UDF occupies 100MB memory 
> -
>
> Key: TEZ-4442
> URL: https://issues.apache.org/jira/browse/TEZ-4442
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1
> Environment: we use CDP7.1.7SP1 with the 0.91 tez version
>Reporter: Authur Wang
>Priority: Critical
>
>           We have a UDF which loads about 5 million records into memory, and 
> matchs the data in the memory according to the user's input, and finally 
> return the output. Each input record of the UDF will lead to one output.
>           Based on heapdump analysis, this  udf occupies about 100MB of 
> memory. The UDF runs stably in hive on MR, hive on spark and native spark, 
> and only needs about 4GB of memory for that situation. However, if we use tez 
> engine,  we adjust the memory from 4G to 8g, the task will fail. Even if we 
> adjust the memory to 12g, the task will fail with a high probability. Why 
> does tez engine need so much memory compared to Mr and spark? Is there a good 
> tuning method to control the amount of memory ?
>  
>  
> , and parameters are as follows:
> beeline -u 
> 'jdbc:hive2://bg21146.hadoop.com:1/default;principal=hive/[bg21146.hadoop@bg.com|mailto:bg21146.hadoop@bg.com]'
>  --hiveconf tez.queue.name=root.000kjb.bdhmgmas_bas -e "
>  
> create temporary function get_card_rank as 
> 'com.unionpay.spark.udf.GenericUDFCupsCardMediaProc' using jar 
> 'hdfs:///user/lib/spark-udf-0.0.1-SNAPSHOT.jar';
>  
> set tez.am.log.level=debug;
> set tez.am.resource.memory.mb=8192;
> set hive.tez.container.size=8192;
> set tez.task.resource.memory.mb=2048;
> set tez.runtime.io.sort.mb=1200;
> set hive.auto.convert.join.noconditionaltask.size=5;
> set tez.runtime.unordered.output.buffer.size-mb=800;
> set tez.grouping.min-size=33554432;
> set tez.grouping.max-size=536870912;
> set hive.tez.auto.reducer.parallelism=true;
> set 

[jira] [Updated] (TEZ-4442) tez unable to control the memory size when UDF occupies 100MB memory

2022-08-07 Thread Authur Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Authur Wang updated TEZ-4442:
-
Environment: we use CDP7.1.7SP1 with the 0.91 tez version  (was: we use 
CDP7.1.7SP1 with the 0.91 tez version, and parameters are as follows:
beeline -u 
'jdbc:hive2://bg21146.hadoop.com:1/default;principal=hive/[bg21146.hadoop@bg.com|mailto:bg21146.hadoop@bg.com]'
 --hiveconf tez.queue.name=root.000kjb.bdhmgmas_bas -e "
 
create temporary function get_card_rank as 
'com.unionpay.spark.udf.GenericUDFCupsCardMediaProc' using jar 
'hdfs:///user/lib/spark-udf-0.0.1-SNAPSHOT.jar';
 
set tez.am.log.level=debug;
set tez.am.resource.memory.mb=8192;
set hive.tez.container.size=8192;
set tez.task.resource.memory.mb=2048;
set tez.runtime.io.sort.mb=1200;
set hive.auto.convert.join.noconditionaltask.size=5;
set tez.runtime.unordered.output.buffer.size-mb=800;
set tez.grouping.min-size=33554432;
set tez.grouping.max-size=536870912;
set hive.tez.auto.reducer.parallelism=true;
set hive.tez.min.partition.factor=0.25;
set hive.tez.max.partition.factor=2.0;
set hive.exec.reducers.bytes.per.reducer=268435456;
set mapreduce.map.memory.mb=4096;
set ipc.maximum.response.length=153600;
 
 
select
 get_card_rank(ext_pri_acct_no) as ext_card_media_proc_md,
 coun(*)
from bs_comdb.tmp_bscom_glhis_ct_settle_dtl_bas_swt a
where a.hp_settle_dt = '20200910'
group by get_card_rank(ext_pri_acct_no)
;
")

> tez unable to control the memory size when UDF occupies 100MB memory 
> -
>
> Key: TEZ-4442
> URL: https://issues.apache.org/jira/browse/TEZ-4442
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1
> Environment: we use CDP7.1.7SP1 with the 0.91 tez version
>Reporter: Authur Wang
>Priority: Critical
>
>           We have a UDF which loads about 5 million records into memory, and 
> matchs the data in the memory according to the user's input, and finally 
> return the output. Each input record of the UDF will lead to one output.
>           Based on heapdump analysis, this  udf occupies about 100MB of 
> memory. The UDF runs stably in hive on MR, hive on spark and native spark, 
> and only needs about 4GB of memory for that situation. However, if we use tez 
> engine,  we adjust the memory from 4G to 8g, the task will fail. Even if we 
> adjust the memory to 12g, the task will fail with a high probability. Why 
> does tez engine need so much memory compared to Mr and spark? Is there a good 
> tuning method to control the amount of memory ?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)