Hi Clinton,

Thanks for sharing your problem. If you provide more information, such
as a dataset or queries, we can reproduce it and file the problem.

Hi Lisoda,

Thanks for giving us real examples. Interesting. Can I understand the
first problem that happens when there is a big data file in an Iceberg
table? I could not reproduce it[1], and I am curious about the
detailed conditions. As for the map-side aggregation, we also found
and resolved a similar problem[2]. It could be good to check. The
other issues are also interesting. I'd file tickets if I had evidence.

- [1] https://gist.github.com/okumin/4fccec45109fc9927a22f40c166fe7f9
- [2] https://issues.apache.org/jira/browse/HIVE-28428

Regards,
Okumin

On Mon, Sep 2, 2024 at 2:59 PM clinton chikwata <clintonfun...@gmail.com> wrote:
>
> Hello Lisoda,
>
> Thanks for this information.
>
>
> On Sun, Sep 1, 2024 at 4:04 PM lisoda <lis...@yeah.net> wrote:
>>
>> Hello Clinton:
>>
>> We have actually encountered the same issue where, in many cases, querying 
>> Iceberg does not meet expected efficiency, falling short of regular 
>> ORC/Parquet tables in speed. Since the current HiveIcebergInputSplit does 
>> not support splits based on file size, reading can be slow when individual 
>> data files in Iceberg are excessively large. This issue necessitates 
>> improvements from community developers in future iterations. Additionally, 
>> if Iceberg tables employ zstd compression, the current handling via Hive's 
>> aircompress library, a Java library, is notably less efficient than JNI 
>> implementations. This might only improve after a reconstruction leveraging 
>> JDK-SIMD. Furthermore, we have analyzed execution latency using flame graphs 
>> and discovered potential issues with the implementation of 
>> VectorGroupByOperator$ProcessingModeHashAggregate, which exhibits 
>> exceedingly poor performance. Consequently, as of now, with Iceberg tables, 
>> we can temporarily address the issue by increasing the number of map-tasks 
>> and reducing the size of individual data files in the Iceberg table. We hope 
>> these issues can be resolved in subsequent iterative developments.
>>
>>
>>
>>
>>
>>
>> 在 2024-08-28 14:41:03,"clinton chikwata" <clintonfun...@gmail.com> 写道:
>>
>> Thanks  Okumin.
>>
>> I am new to Hive and Tez  and I have struggled to deploy a high-performance 
>> Dockerized Hive setup. I followed the documentation for setting up a remote 
>> Metastore. I have a single node with 32 GB of RAM and 8 cores, but I have a 
>> dataset of about 2 GB (Iceberg table partitioned on one column). However, 
>> when I run select queries, the performance has not been as fast as expected. 
>> Could someone share some insights, especially regarding hive-site.xml and 
>> Tez custom configuration?
>>
>> Any help would be appreciated.
>>
>> On Sun, Aug 4, 2024 at 4:46 PM Okumin <m...@okumin.com> wrote:
>>>
>>> Hi Clinton,
>>>
>>> I tested MERGE INTO with minimal reproduction. I saw the same error.
>>>
>>> ```
>>> CREATE TABLE src (col1 INT, col2 INT);
>>> CREATE TABLE dst (id BIGINT DEFAULT SURROGATE_KEY(), col1 INT, col2
>>> INT, PRIMARY KEY (id) DISABLE NOVALIDATE) STORED BY ICEBERG;
>>>
>>> MERGE INTO dst d USING src s ON s.col1 = d.col1
>>> WHEN MATCHED THEN UPDATE SET col2 = s.col2
>>> WHEN NOT MATCHED THEN INSERT (col1, col2) VALUES (s.col1, s.col2);
>>> ```
>>>
>>> The following query, which explicitly inserts `id`, succeeded on my
>>> machine. The default keyword is unlikely to work on INSERT on MERGE
>>> INTO. I've yet to investigate whether ANSI allows us to omit it.
>>>
>>> ```
>>> MERGE INTO dst d USING src s ON s.col1 = d.col1
>>> WHEN MATCHED THEN UPDATE SET col2 = s.col2
>>> WHEN NOT MATCHED THEN INSERT (id, col1, col2) VALUES (SURROGATE_KEY(),
>>> s.col1, s.col2);
>>> ```
>>>
>>> As another point, the SURROGATE_KEY might not work as you expected. It
>>> doesn't generate globally unique ids on my attempts.
>>>
>>> Regards,
>>> Okumin
>>>
>>> On Wed, Jul 31, 2024 at 4:54 PM clinton chikwata
>>> <clintonfun...@gmail.com> wrote:
>>> >
>>> > Dear Team,
>>> >
>>> > Any help will be much appreciated.
>>> >
>>> > Error SQL Error [40000] [42000]: Error while compiling statement: FAILED: 
>>> > SemanticException Schema of both sides of union should match.
>>> >
>>> > I have an ETL workload that stores data into temp_table with the schema 
>>> > as shown below.
>>> >
>>> > CREATE EXTERNAL TABLE IF NOT EXISTS temp_table (
>>> >     VC_ITEM_CODE STRING,
>>> >     VC_SUB_GROUP STRING,
>>> >     VC_PRODUCT_NAME STRING,
>>> >     VC_PRODUCT_UNIT STRING,
>>> >     VC_GROUP_CODE STRING,
>>> >     DT_VAT_START TIMESTAMP,
>>> >     VC_BAND_CODE STRING,
>>> >     VC_SEMI_BAND_CODE STRING,
>>> >     VC_DIVISIONS STRING,
>>> >     NU_UNIT_FACTOR DECIMAL(30, 0),
>>> >     VC_DIVISION_SEG_CODE STRING,
>>> >     VC_COLOR_COMB STRING,
>>> >     DT_MOD_DATE TIMESTAMP,
>>> >     VC_INACTIVE_PRODUCT STRING,
>>> >     RN DECIMAL(10, 0),
>>> >     country STRING
>>> >     )
>>> > STORED AS PARQUET
>>> > LOCATION 'S{path}'
>>> >
>>> > Then i want to load it to the final table
>>> >
>>> > CREATE TABLE product_dimension (
>>> >    `ID` BIGINT DEFAULT SURROGATE_KEY(),
>>> >     VC_ITEM_CODE STRING,
>>> >     VC_SUB_GROUP STRING,
>>> >     VC_PRODUCT_NAME STRING,
>>> >     VC_PRODUCT_UNIT STRING,
>>> >     VC_GROUP_CODE STRING,
>>> >     DT_VAT_START TIMESTAMP,
>>> >     VC_BAND_CODE STRING,
>>> >     VC_SEMI_BAND_CODE STRING,
>>> >     VC_DIVISIONS STRING,
>>> >     NU_UNIT_FACTOR DECIMAL(30, 0),
>>> >     VC_DIVISION_SEG_CODE STRING,
>>> >     VC_COLOR_COMB STRING,
>>> >     DT_MOD_DATE TIMESTAMP,
>>> >     VC_INACTIVE_PRODUCT STRING,
>>> >     RN DECIMAL(10, 0),
>>> >     country STRING,
>>> >     PRIMARY KEY (ID) DISABLE NOVALIDATE)
>>> > STORED BY ICEBERG;
>>> >
>>> > When I attempt to perform a merge operation on column  vc_item_code i get 
>>> > the error as shown above :
>>> >
>>> > MERGE
>>> > INTO
>>> > product_dimension AS c
>>> > USING (
>>> > SELECT
>>> > *
>>> > FROM
>>> > temp_table) AS s ON  s.vc_item_code = c.vc_item_code
>>> > AND s.country = c.country
>>> > WHEN MATCHED THEN
>>> > UPDATE
>>> > SET
>>> > vc_item_code = s.vc_item_code,
>>> > vc_sub_group = s.vc_sub_group,
>>> > vc_product_name = s.vc_product_name,
>>> > vc_product_unit = s.vc_product_unit,
>>> > vc_group_code = s.vc_group_code,
>>> > dt_vat_start = s.dt_vat_start,
>>> > vc_band_code = s.vc_band_code,
>>> > vc_semi_band_code = s.vc_semi_band_code,
>>> > vc_divisions = s.vc_divisions,
>>> > nu_unit_factor = s.nu_unit_factor,
>>> > vc_division_seg_code = s.vc_division_seg_code,
>>> > vc_color_comb = s.vc_color_comb,
>>> > dt_mod_date = s.dt_mod_date,
>>> > vc_inactive_product = s.vc_inactive_product,
>>> > rn = s.rn,
>>> > country = s.country
>>> > WHEN NOT MATCHED THEN
>>> > INSERT
>>> > (
>>> >     vc_item_code,
>>> > vc_sub_group,
>>> > vc_product_name,
>>> > vc_product_unit,
>>> > vc_group_code,
>>> > dt_vat_start,
>>> > vc_band_code,
>>> > vc_semi_band_code,
>>> > vc_divisions,
>>> > nu_unit_factor,
>>> > vc_division_seg_code,
>>> > vc_color_comb,
>>> > dt_mod_date,
>>> > vc_inactive_product,
>>> > rn,
>>> > country
>>> >     )
>>> > VALUES (
>>> > s.vc_item_code,
>>> > s.vc_sub_group,
>>> > s.vc_product_name,
>>> > s.vc_product_unit,
>>> > s.vc_group_code,
>>> > s.dt_vat_start,
>>> > s.vc_band_code,
>>> > s.vc_semi_band_code,
>>> > s.vc_divisions,
>>> > s.nu_unit_factor,
>>> > s.vc_division_seg_code,
>>> > s.vc_color_comb,
>>> > s.dt_mod_date,
>>> > s.vc_inactive_product,
>>> > s.rn,
>>> > s.country
>>> > );
>>> >
>>> > Warm Regards

Reply via email to