Hello Clinton: We have actually encountered the same issue where, in many cases, querying Iceberg does not meet expected efficiency, falling short of regular ORC/Parquet tables in speed. Since the current HiveIcebergInputSplit does not support splits based on file size, reading can be slow when individual data files in Iceberg are excessively large. This issue necessitates improvements from community developers in future iterations. Additionally, if Iceberg tables employ zstd compression, the current handling via Hive's aircompress library, a Java library, is notably less efficient than JNI implementations. This might only improve after a reconstruction leveraging JDK-SIMD. Furthermore, we have analyzed execution latency using flame graphs and discovered potential issues with the implementation of VectorGroupByOperator$ProcessingModeHashAggregate, which exhibits exceedingly poor performance. Consequently, as of now, with Iceberg tables, we can temporarily address the issue by increasing the number of map-tasks and reducing the size of individual data files in the Iceberg table. We hope these issues can be resolved in subsequent iterative developments.
在 2024-08-28 14:41:03,"clinton chikwata" <clintonfun...@gmail.com> 写道: Thanks Okumin. I am new to Hive and Tez and I have struggled to deploy a high-performance Dockerized Hive setup. I followed the documentation for setting up a remote Metastore. I have a single node with 32 GB of RAM and 8 cores, but I have a dataset of about 2 GB (Iceberg table partitioned on one column). However, when I run select queries, the performance has not been as fast as expected. Could someone share some insights, especially regarding hive-site.xml and Tez custom configuration? Any help would be appreciated. On Sun, Aug 4, 2024 at 4:46 PM Okumin <m...@okumin.com> wrote: Hi Clinton, I tested MERGE INTO with minimal reproduction. I saw the same error. ``` CREATE TABLE src (col1 INT, col2 INT); CREATE TABLE dst (id BIGINT DEFAULT SURROGATE_KEY(), col1 INT, col2 INT, PRIMARY KEY (id) DISABLE NOVALIDATE) STORED BY ICEBERG; MERGE INTO dst d USING src s ON s.col1 = d.col1 WHEN MATCHED THEN UPDATE SET col2 = s.col2 WHEN NOT MATCHED THEN INSERT (col1, col2) VALUES (s.col1, s.col2); ``` The following query, which explicitly inserts `id`, succeeded on my machine. The default keyword is unlikely to work on INSERT on MERGE INTO. I've yet to investigate whether ANSI allows us to omit it. ``` MERGE INTO dst d USING src s ON s.col1 = d.col1 WHEN MATCHED THEN UPDATE SET col2 = s.col2 WHEN NOT MATCHED THEN INSERT (id, col1, col2) VALUES (SURROGATE_KEY(), s.col1, s.col2); ``` As another point, the SURROGATE_KEY might not work as you expected. It doesn't generate globally unique ids on my attempts. Regards, Okumin On Wed, Jul 31, 2024 at 4:54 PM clinton chikwata <clintonfun...@gmail.com> wrote: > > Dear Team, > > Any help will be much appreciated. > > Error SQL Error [40000] [42000]: Error while compiling statement: FAILED: > SemanticException Schema of both sides of union should match. > > I have an ETL workload that stores data into temp_table with the schema as > shown below. > > CREATE EXTERNAL TABLE IF NOT EXISTS temp_table ( > VC_ITEM_CODE STRING, > VC_SUB_GROUP STRING, > VC_PRODUCT_NAME STRING, > VC_PRODUCT_UNIT STRING, > VC_GROUP_CODE STRING, > DT_VAT_START TIMESTAMP, > VC_BAND_CODE STRING, > VC_SEMI_BAND_CODE STRING, > VC_DIVISIONS STRING, > NU_UNIT_FACTOR DECIMAL(30, 0), > VC_DIVISION_SEG_CODE STRING, > VC_COLOR_COMB STRING, > DT_MOD_DATE TIMESTAMP, > VC_INACTIVE_PRODUCT STRING, > RN DECIMAL(10, 0), > country STRING > ) > STORED AS PARQUET > LOCATION 'S{path}' > > Then i want to load it to the final table > > CREATE TABLE product_dimension ( > `ID` BIGINT DEFAULT SURROGATE_KEY(), > VC_ITEM_CODE STRING, > VC_SUB_GROUP STRING, > VC_PRODUCT_NAME STRING, > VC_PRODUCT_UNIT STRING, > VC_GROUP_CODE STRING, > DT_VAT_START TIMESTAMP, > VC_BAND_CODE STRING, > VC_SEMI_BAND_CODE STRING, > VC_DIVISIONS STRING, > NU_UNIT_FACTOR DECIMAL(30, 0), > VC_DIVISION_SEG_CODE STRING, > VC_COLOR_COMB STRING, > DT_MOD_DATE TIMESTAMP, > VC_INACTIVE_PRODUCT STRING, > RN DECIMAL(10, 0), > country STRING, > PRIMARY KEY (ID) DISABLE NOVALIDATE) > STORED BY ICEBERG; > > When I attempt to perform a merge operation on column vc_item_code i get the > error as shown above : > > MERGE > INTO > product_dimension AS c > USING ( > SELECT > * > FROM > temp_table) AS s ON s.vc_item_code = c.vc_item_code > AND s.country = c.country > WHEN MATCHED THEN > UPDATE > SET > vc_item_code = s.vc_item_code, > vc_sub_group = s.vc_sub_group, > vc_product_name = s.vc_product_name, > vc_product_unit = s.vc_product_unit, > vc_group_code = s.vc_group_code, > dt_vat_start = s.dt_vat_start, > vc_band_code = s.vc_band_code, > vc_semi_band_code = s.vc_semi_band_code, > vc_divisions = s.vc_divisions, > nu_unit_factor = s.nu_unit_factor, > vc_division_seg_code = s.vc_division_seg_code, > vc_color_comb = s.vc_color_comb, > dt_mod_date = s.dt_mod_date, > vc_inactive_product = s.vc_inactive_product, > rn = s.rn, > country = s.country > WHEN NOT MATCHED THEN > INSERT > ( > vc_item_code, > vc_sub_group, > vc_product_name, > vc_product_unit, > vc_group_code, > dt_vat_start, > vc_band_code, > vc_semi_band_code, > vc_divisions, > nu_unit_factor, > vc_division_seg_code, > vc_color_comb, > dt_mod_date, > vc_inactive_product, > rn, > country > ) > VALUES ( > s.vc_item_code, > s.vc_sub_group, > s.vc_product_name, > s.vc_product_unit, > s.vc_group_code, > s.dt_vat_start, > s.vc_band_code, > s.vc_semi_band_code, > s.vc_divisions, > s.nu_unit_factor, > s.vc_division_seg_code, > s.vc_color_comb, > s.dt_mod_date, > s.vc_inactive_product, > s.rn, > s.country > ); > > Warm Regards