Hi Clinton, Thanks for sharing your problem. If you provide more information, such as a dataset or queries, we can reproduce it and file the problem.
Hi Lisoda, Thanks for giving us real examples. Interesting. Can I understand the first problem that happens when there is a big data file in an Iceberg table? I could not reproduce it[1], and I am curious about the detailed conditions. As for the map-side aggregation, we also found and resolved a similar problem[2]. It could be good to check. The other issues are also interesting. I'd file tickets if I had evidence. - [1] https://gist.github.com/okumin/4fccec45109fc9927a22f40c166fe7f9 - [2] https://issues.apache.org/jira/browse/HIVE-28428 Regards, Okumin On Mon, Sep 2, 2024 at 2:59 PM clinton chikwata <clintonfun...@gmail.com> wrote: > > Hello Lisoda, > > Thanks for this information. > > > On Sun, Sep 1, 2024 at 4:04 PM lisoda <lis...@yeah.net> wrote: >> >> Hello Clinton: >> >> We have actually encountered the same issue where, in many cases, querying >> Iceberg does not meet expected efficiency, falling short of regular >> ORC/Parquet tables in speed. Since the current HiveIcebergInputSplit does >> not support splits based on file size, reading can be slow when individual >> data files in Iceberg are excessively large. This issue necessitates >> improvements from community developers in future iterations. Additionally, >> if Iceberg tables employ zstd compression, the current handling via Hive's >> aircompress library, a Java library, is notably less efficient than JNI >> implementations. This might only improve after a reconstruction leveraging >> JDK-SIMD. Furthermore, we have analyzed execution latency using flame graphs >> and discovered potential issues with the implementation of >> VectorGroupByOperator$ProcessingModeHashAggregate, which exhibits >> exceedingly poor performance. Consequently, as of now, with Iceberg tables, >> we can temporarily address the issue by increasing the number of map-tasks >> and reducing the size of individual data files in the Iceberg table. We hope >> these issues can be resolved in subsequent iterative developments. >> >> >> >> >> >> >> 在 2024-08-28 14:41:03,"clinton chikwata" <clintonfun...@gmail.com> 写道: >> >> Thanks Okumin. >> >> I am new to Hive and Tez and I have struggled to deploy a high-performance >> Dockerized Hive setup. I followed the documentation for setting up a remote >> Metastore. I have a single node with 32 GB of RAM and 8 cores, but I have a >> dataset of about 2 GB (Iceberg table partitioned on one column). However, >> when I run select queries, the performance has not been as fast as expected. >> Could someone share some insights, especially regarding hive-site.xml and >> Tez custom configuration? >> >> Any help would be appreciated. >> >> On Sun, Aug 4, 2024 at 4:46 PM Okumin <m...@okumin.com> wrote: >>> >>> Hi Clinton, >>> >>> I tested MERGE INTO with minimal reproduction. I saw the same error. >>> >>> ``` >>> CREATE TABLE src (col1 INT, col2 INT); >>> CREATE TABLE dst (id BIGINT DEFAULT SURROGATE_KEY(), col1 INT, col2 >>> INT, PRIMARY KEY (id) DISABLE NOVALIDATE) STORED BY ICEBERG; >>> >>> MERGE INTO dst d USING src s ON s.col1 = d.col1 >>> WHEN MATCHED THEN UPDATE SET col2 = s.col2 >>> WHEN NOT MATCHED THEN INSERT (col1, col2) VALUES (s.col1, s.col2); >>> ``` >>> >>> The following query, which explicitly inserts `id`, succeeded on my >>> machine. The default keyword is unlikely to work on INSERT on MERGE >>> INTO. I've yet to investigate whether ANSI allows us to omit it. >>> >>> ``` >>> MERGE INTO dst d USING src s ON s.col1 = d.col1 >>> WHEN MATCHED THEN UPDATE SET col2 = s.col2 >>> WHEN NOT MATCHED THEN INSERT (id, col1, col2) VALUES (SURROGATE_KEY(), >>> s.col1, s.col2); >>> ``` >>> >>> As another point, the SURROGATE_KEY might not work as you expected. It >>> doesn't generate globally unique ids on my attempts. >>> >>> Regards, >>> Okumin >>> >>> On Wed, Jul 31, 2024 at 4:54 PM clinton chikwata >>> <clintonfun...@gmail.com> wrote: >>> > >>> > Dear Team, >>> > >>> > Any help will be much appreciated. >>> > >>> > Error SQL Error [40000] [42000]: Error while compiling statement: FAILED: >>> > SemanticException Schema of both sides of union should match. >>> > >>> > I have an ETL workload that stores data into temp_table with the schema >>> > as shown below. >>> > >>> > CREATE EXTERNAL TABLE IF NOT EXISTS temp_table ( >>> > VC_ITEM_CODE STRING, >>> > VC_SUB_GROUP STRING, >>> > VC_PRODUCT_NAME STRING, >>> > VC_PRODUCT_UNIT STRING, >>> > VC_GROUP_CODE STRING, >>> > DT_VAT_START TIMESTAMP, >>> > VC_BAND_CODE STRING, >>> > VC_SEMI_BAND_CODE STRING, >>> > VC_DIVISIONS STRING, >>> > NU_UNIT_FACTOR DECIMAL(30, 0), >>> > VC_DIVISION_SEG_CODE STRING, >>> > VC_COLOR_COMB STRING, >>> > DT_MOD_DATE TIMESTAMP, >>> > VC_INACTIVE_PRODUCT STRING, >>> > RN DECIMAL(10, 0), >>> > country STRING >>> > ) >>> > STORED AS PARQUET >>> > LOCATION 'S{path}' >>> > >>> > Then i want to load it to the final table >>> > >>> > CREATE TABLE product_dimension ( >>> > `ID` BIGINT DEFAULT SURROGATE_KEY(), >>> > VC_ITEM_CODE STRING, >>> > VC_SUB_GROUP STRING, >>> > VC_PRODUCT_NAME STRING, >>> > VC_PRODUCT_UNIT STRING, >>> > VC_GROUP_CODE STRING, >>> > DT_VAT_START TIMESTAMP, >>> > VC_BAND_CODE STRING, >>> > VC_SEMI_BAND_CODE STRING, >>> > VC_DIVISIONS STRING, >>> > NU_UNIT_FACTOR DECIMAL(30, 0), >>> > VC_DIVISION_SEG_CODE STRING, >>> > VC_COLOR_COMB STRING, >>> > DT_MOD_DATE TIMESTAMP, >>> > VC_INACTIVE_PRODUCT STRING, >>> > RN DECIMAL(10, 0), >>> > country STRING, >>> > PRIMARY KEY (ID) DISABLE NOVALIDATE) >>> > STORED BY ICEBERG; >>> > >>> > When I attempt to perform a merge operation on column vc_item_code i get >>> > the error as shown above : >>> > >>> > MERGE >>> > INTO >>> > product_dimension AS c >>> > USING ( >>> > SELECT >>> > * >>> > FROM >>> > temp_table) AS s ON s.vc_item_code = c.vc_item_code >>> > AND s.country = c.country >>> > WHEN MATCHED THEN >>> > UPDATE >>> > SET >>> > vc_item_code = s.vc_item_code, >>> > vc_sub_group = s.vc_sub_group, >>> > vc_product_name = s.vc_product_name, >>> > vc_product_unit = s.vc_product_unit, >>> > vc_group_code = s.vc_group_code, >>> > dt_vat_start = s.dt_vat_start, >>> > vc_band_code = s.vc_band_code, >>> > vc_semi_band_code = s.vc_semi_band_code, >>> > vc_divisions = s.vc_divisions, >>> > nu_unit_factor = s.nu_unit_factor, >>> > vc_division_seg_code = s.vc_division_seg_code, >>> > vc_color_comb = s.vc_color_comb, >>> > dt_mod_date = s.dt_mod_date, >>> > vc_inactive_product = s.vc_inactive_product, >>> > rn = s.rn, >>> > country = s.country >>> > WHEN NOT MATCHED THEN >>> > INSERT >>> > ( >>> > vc_item_code, >>> > vc_sub_group, >>> > vc_product_name, >>> > vc_product_unit, >>> > vc_group_code, >>> > dt_vat_start, >>> > vc_band_code, >>> > vc_semi_band_code, >>> > vc_divisions, >>> > nu_unit_factor, >>> > vc_division_seg_code, >>> > vc_color_comb, >>> > dt_mod_date, >>> > vc_inactive_product, >>> > rn, >>> > country >>> > ) >>> > VALUES ( >>> > s.vc_item_code, >>> > s.vc_sub_group, >>> > s.vc_product_name, >>> > s.vc_product_unit, >>> > s.vc_group_code, >>> > s.dt_vat_start, >>> > s.vc_band_code, >>> > s.vc_semi_band_code, >>> > s.vc_divisions, >>> > s.nu_unit_factor, >>> > s.vc_division_seg_code, >>> > s.vc_color_comb, >>> > s.dt_mod_date, >>> > s.vc_inactive_product, >>> > s.rn, >>> > s.country >>> > ); >>> > >>> > Warm Regards