Hello Clinton:

We have actually encountered the same issue where, in many cases, querying 
Iceberg does not meet expected efficiency, falling short of regular ORC/Parquet 
tables in speed. Since the current HiveIcebergInputSplit does not support 
splits based on file size, reading can be slow when individual data files in 
Iceberg are excessively large. This issue necessitates improvements from 
community developers in future iterations. Additionally, if Iceberg tables 
employ zstd compression, the current handling via Hive's aircompress library, a 
Java library, is notably less efficient than JNI implementations. This might 
only improve after a reconstruction leveraging JDK-SIMD. Furthermore, we have 
analyzed execution latency using flame graphs and discovered potential issues 
with the implementation of VectorGroupByOperator$ProcessingModeHashAggregate, 
which exhibits exceedingly poor performance. Consequently, as of now, with 
Iceberg tables, we can temporarily address the issue by increasing the number 
of map-tasks and reducing the size of individual data files in the Iceberg 
table. We hope these issues can be resolved in subsequent iterative 
developments.












在 2024-08-28 14:41:03,"clinton chikwata" <clintonfun...@gmail.com> 写道:

Thanks  Okumin.

I am new to Hive and Tez  and I have struggled to deploy a high-performance 
Dockerized Hive setup. I followed the documentation for setting up a remote 
Metastore. I have a single node with 32 GB of RAM and 8 cores, but I have a 
dataset of about 2 GB (Iceberg table partitioned on one column). However, when 
I run select queries, the performance has not been as fast as expected. Could 
someone share some insights, especially regarding hive-site.xml and Tez custom 
configuration?

Any help would be appreciated.


On Sun, Aug 4, 2024 at 4:46 PM Okumin <m...@okumin.com> wrote:

Hi Clinton,

I tested MERGE INTO with minimal reproduction. I saw the same error.

```
CREATE TABLE src (col1 INT, col2 INT);
CREATE TABLE dst (id BIGINT DEFAULT SURROGATE_KEY(), col1 INT, col2
INT, PRIMARY KEY (id) DISABLE NOVALIDATE) STORED BY ICEBERG;

MERGE INTO dst d USING src s ON s.col1 = d.col1
WHEN MATCHED THEN UPDATE SET col2 = s.col2
WHEN NOT MATCHED THEN INSERT (col1, col2) VALUES (s.col1, s.col2);
```

The following query, which explicitly inserts `id`, succeeded on my
machine. The default keyword is unlikely to work on INSERT on MERGE
INTO. I've yet to investigate whether ANSI allows us to omit it.

```
MERGE INTO dst d USING src s ON s.col1 = d.col1
WHEN MATCHED THEN UPDATE SET col2 = s.col2
WHEN NOT MATCHED THEN INSERT (id, col1, col2) VALUES (SURROGATE_KEY(),
s.col1, s.col2);
```

As another point, the SURROGATE_KEY might not work as you expected. It
doesn't generate globally unique ids on my attempts.

Regards,
Okumin

On Wed, Jul 31, 2024 at 4:54 PM clinton chikwata
<clintonfun...@gmail.com> wrote:
>
> Dear Team,
>
> Any help will be much appreciated.
>
> Error SQL Error [40000] [42000]: Error while compiling statement: FAILED: 
> SemanticException Schema of both sides of union should match.
>
> I have an ETL workload that stores data into temp_table with the schema as 
> shown below.
>
> CREATE EXTERNAL TABLE IF NOT EXISTS temp_table (
>     VC_ITEM_CODE STRING,
>     VC_SUB_GROUP STRING,
>     VC_PRODUCT_NAME STRING,
>     VC_PRODUCT_UNIT STRING,
>     VC_GROUP_CODE STRING,
>     DT_VAT_START TIMESTAMP,
>     VC_BAND_CODE STRING,
>     VC_SEMI_BAND_CODE STRING,
>     VC_DIVISIONS STRING,
>     NU_UNIT_FACTOR DECIMAL(30, 0),
>     VC_DIVISION_SEG_CODE STRING,
>     VC_COLOR_COMB STRING,
>     DT_MOD_DATE TIMESTAMP,
>     VC_INACTIVE_PRODUCT STRING,
>     RN DECIMAL(10, 0),
>     country STRING
>     )
> STORED AS PARQUET
> LOCATION 'S{path}'
>
> Then i want to load it to the final table
>
> CREATE TABLE product_dimension (
>    `ID` BIGINT DEFAULT SURROGATE_KEY(),
>     VC_ITEM_CODE STRING,
>     VC_SUB_GROUP STRING,
>     VC_PRODUCT_NAME STRING,
>     VC_PRODUCT_UNIT STRING,
>     VC_GROUP_CODE STRING,
>     DT_VAT_START TIMESTAMP,
>     VC_BAND_CODE STRING,
>     VC_SEMI_BAND_CODE STRING,
>     VC_DIVISIONS STRING,
>     NU_UNIT_FACTOR DECIMAL(30, 0),
>     VC_DIVISION_SEG_CODE STRING,
>     VC_COLOR_COMB STRING,
>     DT_MOD_DATE TIMESTAMP,
>     VC_INACTIVE_PRODUCT STRING,
>     RN DECIMAL(10, 0),
>     country STRING,
>     PRIMARY KEY (ID) DISABLE NOVALIDATE)
> STORED BY ICEBERG;
>
> When I attempt to perform a merge operation on column  vc_item_code i get the 
> error as shown above :
>
> MERGE
> INTO
> product_dimension AS c
> USING (
> SELECT
> *
> FROM
> temp_table) AS s ON  s.vc_item_code = c.vc_item_code
> AND s.country = c.country
> WHEN MATCHED THEN
> UPDATE
> SET
> vc_item_code = s.vc_item_code,
> vc_sub_group = s.vc_sub_group,
> vc_product_name = s.vc_product_name,
> vc_product_unit = s.vc_product_unit,
> vc_group_code = s.vc_group_code,
> dt_vat_start = s.dt_vat_start,
> vc_band_code = s.vc_band_code,
> vc_semi_band_code = s.vc_semi_band_code,
> vc_divisions = s.vc_divisions,
> nu_unit_factor = s.nu_unit_factor,
> vc_division_seg_code = s.vc_division_seg_code,
> vc_color_comb = s.vc_color_comb,
> dt_mod_date = s.dt_mod_date,
> vc_inactive_product = s.vc_inactive_product,
> rn = s.rn,
> country = s.country
> WHEN NOT MATCHED THEN
> INSERT
> (
>     vc_item_code,
> vc_sub_group,
> vc_product_name,
> vc_product_unit,
> vc_group_code,
> dt_vat_start,
> vc_band_code,
> vc_semi_band_code,
> vc_divisions,
> nu_unit_factor,
> vc_division_seg_code,
> vc_color_comb,
> dt_mod_date,
> vc_inactive_product,
> rn,
> country
>     )
> VALUES (
> s.vc_item_code,
> s.vc_sub_group,
> s.vc_product_name,
> s.vc_product_unit,
> s.vc_group_code,
> s.dt_vat_start,
> s.vc_band_code,
> s.vc_semi_band_code,
> s.vc_divisions,
> s.nu_unit_factor,
> s.vc_division_seg_code,
> s.vc_color_comb,
> s.dt_mod_date,
> s.vc_inactive_product,
> s.rn,
> s.country
> );
>
> Warm Regards

Reply via email to