Re: performance issue on big table join

Hongxu Ma Thu, 02 Nov 2017 00:22:33 -0700

Thanks LL. Your query options look good.

As Xu Cheng mentioned, I also noticed that Impala do hash join slowly in some 
big data situations.
Very curious to the root cause.


在 02/11/2017 10:00, 俊杰陈 写道:

+user list

2017-11-02 9:57 GMT+08:00 俊杰陈 <[email protected]><mailto:[email protected]>:



Hi Mostafa

Cheng already put the profile in thread.

Here is another profile for impala release version. you can also see the
attachment.


2017-11-02 9:30 GMT+08:00 Mostafa Mokhtar 
<[email protected]><mailto:[email protected]>:



Attaching the query profile will be most helpful to investigate this
issue.

If you can capture the profile from the WebUI on the coordinator node it
would be great.

On Wed, Nov 1, 2017 at 6:22 PM, 俊杰陈 
<[email protected]><mailto:[email protected]> wrote:



Thanks Hongxu,

Here are configurations on my cluster,  most of them are default values.
Which item do you think it may impact?

        ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0]
        ABORT_ON_ERROR: [0]
        ALLOW_UNSUPPORTED_FORMATS: [0]
        APPX_COUNT_DISTINCT: [0]
        BATCH_SIZE: [0]
        COMPRESSION_CODEC: [NONE]
        DEBUG_ACTION: []
        DEFAULT_ORDER_BY_LIMIT: [-1]
        DISABLE_CACHED_READS: [0]
        DISABLE_CODEGEN: [0]
        DISABLE_OUTERMOST_TOPN: [0]
        DISABLE_ROW_RUNTIME_FILTERING: [0]
        DISABLE_STREAMING_PREAGGREGATIONS: [0]
        DISABLE_UNSAFE_SPILLS: [0]
        ENABLE_EXPR_REWRITES: [1]
        EXEC_SINGLE_NODE_ROWS_THRESHOLD: [100]
        EXPLAIN_LEVEL: [1]
        HBASE_CACHE_BLOCKS: [0]
        HBASE_CACHING: [0]
        MAX_BLOCK_MGR_MEMORY: [0]
        MAX_ERRORS: [100]
        MAX_IO_BUFFERS: [0]
        MAX_NUM_RUNTIME_FILTERS: [10]
        MAX_SCAN_RANGE_LENGTH: [0]
        MEM_LIMIT: [0]
        MT_DOP: [0]
        NUM_NODES: [0]
        NUM_SCANNER_THREADS: [0]
        OPTIMIZE_PARTITION_KEY_SCANS: [0]
        PARQUET_ANNOTATE_STRINGS_UTF8: [0]
        PARQUET_FALLBACK_SCHEMA_RESOLUTION: [0]
        PARQUET_FILE_SIZE: [0]
        PREFETCH_MODE: [1]
        QUERY_TIMEOUT_S: [0]
        REPLICA_PREFERENCE: [0]
        REQUEST_POOL: []
        RESERVATION_REQUEST_TIMEOUT: [0]
        RM_INITIAL_MEM: [0]
        RUNTIME_BLOOM_FILTER_SIZE: [1048576]
        RUNTIME_FILTER_MAX_SIZE: [16777216]
        RUNTIME_FILTER_MIN_SIZE: [1048576]
        RUNTIME_FILTER_MODE: [2]
        RUNTIME_FILTER_WAIT_TIME_MS: [0]
        S3_SKIP_INSERT_STAGING: [1]
        SCAN_NODE_CODEGEN_THRESHOLD: [1800000]
        SCHEDULE_RANDOM_REPLICA: [0]
        SCRATCH_LIMIT: [-1]
        SEQ_COMPRESSION_MODE: [0]
        STRICT_MODE: [0]
        SUPPORT_START_OVER: [false]
        SYNC_DDL: [0]
        V_CPU_CORES: [0]

2017-10-31 15:30 GMT+08:00 Hongxu Ma 
<[email protected]><mailto:[email protected]>:



Hi JJ
Consider it only takes 3mins on SparkSQL, maybe there are some


mistakes


in


query options.
Try run "set;" in impala-shell and check all query options, e.g:
    BATCH_SIZE: [0]
    DISABLE_CODEGEN: [0]
    RUNTIME_FILTER_MODE: GLOBAL

Just a guess, thanks.

在 27/10/2017 10:25, 俊杰陈 写道:
The profile file is damaged. Here is a screenshot for exec summary
[cid:ii_j999ymep1_15f5ba563aeabb91]


2017-10-27 10:04 GMT+08:00 俊杰陈 
<[email protected]<mailto:[email protected]><mailto:cjj
[email protected]><mailto:[email protected]>>:
Hi Devs

I met a performance issue on big table join. The query takes more


than 3


hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes
cluster. when running query,  the left scanner and exchange node are


very


slow.  Did I miss some key arguments?

you can see profile file in attachment.

[cid:ii_j9998pph2_15f5b92f2cf47020]

--
Thanks & Best Regards



--
Thanks & Best Regards


--
Regards,
Hongxu.






--
Thanks & Best Regards









--
Thanks & Best Regards









--
Regards,
Hongxu.

Re: performance issue on big table join

Reply via email to