[jira] [Created] (IMPALA-6422) Compute stats tablesample spends a lot of time in powf()
Alexander Behm created IMPALA-6422: -- Summary: Compute stats tablesample spends a lot of time in powf() Key: IMPALA-6422 URL: https://issues.apache.org/jira/browse/IMPALA-6422 Project: IMPALA Issue Type: Improvement Components: Backend Affects Versions: Impala 2.11.0 Reporter: Alexander Behm Assignee: Alexander Behm [~mmokhtar] did perf profiling for COMPUTE STATS TABLESAMPLE and discovered that a lot of time is spend on finalizing HLL intermediates. Most time is spent in powf(). Relevant snippet from AggregateFunctions::HllFinalEstimate() in aggregate-functions-ir.cc: {code} for (int i = 0; i < num_buckets; ++i) { harmonic_mean += ldexp(1.0f, -buckets[i]); if (buckets[i] == 0) ++num_zero_registers; } {code} Since we're doing a power of 2 using ldexp() should be much more efficient. I did a microbenchmark and found that ldexp() is >10x faster than powf() for this scenario. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IMPALA-6421) Improve the way the pytest --update_results flag works
Taras Bobrovytsky created IMPALA-6421: - Summary: Improve the way the pytest --update_results flag works Key: IMPALA-6421 URL: https://issues.apache.org/jira/browse/IMPALA-6421 Project: IMPALA Issue Type: Bug Reporter: Taras Bobrovytsky Currently there are several problems with running py.test with the --update_results flag. It does not always output the test sections in a consistent order. For example sometimes TYPES is before RESULTS, and sometimes vice versa. We should come up a with a canonical order of the sections and update all the .test files so that this order is followed everywhere. Another problem is that sometimes random spaces or newlines are inserted into random places in the new file with the updated results. In general, we should make generating the expected results much more robust. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (IMPALA-6419) hdfs-parquet-scanner.cc:624] Check failed: 0 == context_->NumStreams() (0 vs. 11)
[ https://issues.apache.org/jira/browse/IMPALA-6419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Armstrong resolved IMPALA-6419. --- Resolution: Fixed Fix Version/s: Impala 2.12.0 IMPALA-6419: Revert "IMPALA-6383: free memory after skipping parquet row groups" This reverts commit 10fb24afb966c567adcf632a314f6af1826f19fc. Change-Id: I4dd62380d02b61ca46f856b4eb40670b71e28140 Reviewed-on: http://gerrit.cloudera.org:8080/9054 Reviewed-by: Alex Behm <alex.b...@cloudera.com> Tested-by: Impala Public Jenkins > hdfs-parquet-scanner.cc:624] Check failed: 0 == context_->NumStreams() (0 vs. > 11) > - > > Key: IMPALA-6419 > URL: https://issues.apache.org/jira/browse/IMPALA-6419 > Project: IMPALA > Issue Type: Bug > Components: Backend >Affects Versions: Impala 2.12.0 >Reporter: Tim Armstrong >Assignee: Tim Armstrong >Priority: Blocker > Labels: broken-build > Fix For: Impala 2.12.0 > > > Hit this during GVO > https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/975/artifact/Impala/logs_static/logs/ee_tests/impalad.ip-172-31-46-11.ubuntu.log.ERROR.20180118-025218.43759 > {noformat} > F0118 03:59:51.644770 21603 hdfs-parquet-scanner.cc:624] Check failed: 0 == > context_->NumStreams() (0 vs. 11) > *** Check failure stack trace: *** > @ 0x3c0611d google::LogMessage::Fail() > @ 0x3c079c2 google::LogMessage::SendToLog() > @ 0x3c05af7 google::LogMessage::Flush() > @ 0x3c090be google::LogMessageFatal::~LogMessageFatal() > @ 0x1bf5f26 impala::HdfsParquetScanner::NextRowGroup() > @ 0x1bf4c52 impala::HdfsParquetScanner::GetNextInternal() > @ 0x1bf3264 impala::HdfsParquetScanner::ProcessSplit() > @ 0x1b7cf42 impala::HdfsScanNode::ProcessSplit() > @ 0x1b7c36d impala::HdfsScanNode::ScannerThread() > @ 0x1b7b7f0 > _ZZN6impala12HdfsScanNode22ThreadTokenAvailableCbEPNS_17ThreadResourceMgr12ResourcePoolEENKUlvE_clEv > @ 0x1b7d793 > _ZN5boost6detail8function26void_function_obj_invoker0IZN6impala12HdfsScanNode22ThreadTokenAvailableCbEPNS3_17ThreadResourceMgr12ResourcePoolEEUlvE_vE6invokeERNS1_15function_bufferE > @ 0x17d669a boost::function0<>::operator()() > @ 0x1ad2993 impala::Thread::SuperviseThread() > @ 0x1adb6a8 boost::_bi::list4<>::operator()<>() > @ 0x1adb5eb boost::_bi::bind_t<>::operator()() > @ 0x1adb5ae boost::detail::thread_data<>::run() > @ 0x2d8faea thread_proxy > @ 0x7f64f6b1f6ba start_thread > @ 0x7f64f685541d clone > {noformat} > Repro: > {noformat} > $ SCANNER_FUZZ_SEED=1516247966 impala-py.test > tests/query_test/test_scanners_fuzz.py -n 4 --verbose -k parquet > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (IMPALA-6083) Clarify scope of STRAIGHT_JOIN query hint in Impala docs
[ https://issues.apache.org/jira/browse/IMPALA-6083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Russell resolved IMPALA-6083. -- Resolution: Fixed Fix Version/s: Impala 2.10.0 Yes, SHA = 3b5a36337f190481cef7ebef398324805d7e4485 I forgot that 'github bot' doesn't apply, so mentioning the JIRA number in the commit message doesn't create an obvious linkage. > Clarify scope of STRAIGHT_JOIN query hint in Impala docs > > > Key: IMPALA-6083 > URL: https://issues.apache.org/jira/browse/IMPALA-6083 > Project: IMPALA > Issue Type: Improvement > Components: Docs >Affects Versions: Impala 2.10.0 >Reporter: Alexander Behm >Assignee: John Russell >Priority: Major > Fix For: Impala 2.10.0 > > > The documentation for STRAIGHT_JOIN is missing one very important detail: > The scope of the hint is limited to the FROM clause of the query block where > the hint appears. In particular, the hint does not apply recursively to all > subqueries. > Existing docs are here: > https://impala.incubator.apache.org/docs/build/html/topics/impala_perf_joins.html > *Examples* > 1. The hint here will prevent reordering of "v1" and "v2", but the joins > between "t1" and "t2" and "t3" and "t4" can still be reordered. > {code} > select /* +straight_join */ count(*) from > (select t1.id as id from t1 join t2 on t1.id = t2.id) v1 > join > (select t3.id as id from t3 join t4 on t3.id = t4.id) v2 > on (v1.id = v2.id) > {code} > 2. Fully hinted query. No join ordering is applied whatsoever. > {code} > select /* +straight_join */ count(*) from > (select /* +straight_join */ t1.id as id from t1 join t2 on t1.id = t2.id) v1 > join > (select /* +straight_join */ t3.id as id from t3 join t4 on t3.id = t4.id) v2 > on (v1.id = v2.id) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (IMPALA-6386) Dataload can fail due to "invalidate metadata" concurrent with DDLs
[ https://issues.apache.org/jira/browse/IMPALA-6386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-6386. --- Resolution: Fixed Fix Version/s: Impala 2.12.0 commit d9b6fd073055b436c7404d49454dc215b2c7a369 Author: Joe McDonnellDate: Thu Jan 11 15:09:52 2018 -0800 IMPALA-6386: Invalidate metadata at table level for dataload Dataload currently executes bin/load-data.py for TPC-H, TPC-DS, and functional-query concurrently. One of the final steps for bin/load-data.py is to run a global "invalidate metadata". Global "invalidate metadata" commands are known to cause problem on concurrent systems. See IMPALA-5087. For dataload, if TPC-H executes "invalidate metadata" while TPC-DS is still creating tables and adding partitions, the TPC-DS executor might erroneously believe that a table does not exist. This changes dataload to invalidate metadata at an individual table level rather than globally. This prevents the concurrency issue. This also changes the names of some of the intermediate SQL files generated by generate-schema-statements.py and consumed by load-data.py to make them less confusing. Change-Id: Ibc3a6d8a674a0bf6b02069bfe8a5e12034335b1f Reviewed-on: http://gerrit.cloudera.org:8080/9009 Reviewed-by: Joe McDonnell Tested-by: Impala Public Jenkins > Dataload can fail due to "invalidate metadata" concurrent with DDLs > --- > > Key: IMPALA-6386 > URL: https://issues.apache.org/jira/browse/IMPALA-6386 > Project: IMPALA > Issue Type: Bug > Components: Infrastructure >Affects Versions: Impala 2.11.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Critical > Fix For: Impala 2.12.0 > > > testdata/bin/create-load-data.sh runs bin/load-data.py on TPC-H, TPC-DS, and > functional-query in parallel. One of the final steps of bin/load-data.py is > to run a universal "invalidate metadata". However, universal "invalidate > metadata" is an error-prone operation in a concurrent system. When > "invalidate metadata" happens during the DDL statements for another dataset > (i.e. TPC-H finishes and runs "invalidate metadata" while TPC-DS is still > creating tables and adding partitions), it can lead to errors. > Thread 1: create external table foo ... ; > Thread 2: invalidate metadata; > Thread 1: alter table foo add partition bar; <-- Hits error because it can't > find foo > This is a known issue: IMPALA-5087. This has been seen in my development > environment and one automated build, but it is relatively rare. > Dataload needs to switch to using "invalidate metadata {table_name}" to avoid > this issue. This is also a good time to consider using "refresh {table_name}". -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IMPALA-6419) hdfs-parquet-scanner.cc:624] Check failed: 0 == context_->NumStreams() (0 vs. 11)
Tim Armstrong created IMPALA-6419: - Summary: hdfs-parquet-scanner.cc:624] Check failed: 0 == context_->NumStreams() (0 vs. 11) Key: IMPALA-6419 URL: https://issues.apache.org/jira/browse/IMPALA-6419 Project: IMPALA Issue Type: Bug Components: Backend Affects Versions: Impala 2.12.0 Reporter: Tim Armstrong Assignee: Tim Armstrong {noformat} F0118 03:59:51.644770 21603 hdfs-parquet-scanner.cc:624] Check failed: 0 == context_->NumStreams() (0 vs. 11) *** Check failure stack trace: *** @ 0x3c0611d google::LogMessage::Fail() @ 0x3c079c2 google::LogMessage::SendToLog() @ 0x3c05af7 google::LogMessage::Flush() @ 0x3c090be google::LogMessageFatal::~LogMessageFatal() @ 0x1bf5f26 impala::HdfsParquetScanner::NextRowGroup() @ 0x1bf4c52 impala::HdfsParquetScanner::GetNextInternal() @ 0x1bf3264 impala::HdfsParquetScanner::ProcessSplit() @ 0x1b7cf42 impala::HdfsScanNode::ProcessSplit() @ 0x1b7c36d impala::HdfsScanNode::ScannerThread() @ 0x1b7b7f0 _ZZN6impala12HdfsScanNode22ThreadTokenAvailableCbEPNS_17ThreadResourceMgr12ResourcePoolEENKUlvE_clEv @ 0x1b7d793 _ZN5boost6detail8function26void_function_obj_invoker0IZN6impala12HdfsScanNode22ThreadTokenAvailableCbEPNS3_17ThreadResourceMgr12ResourcePoolEEUlvE_vE6invokeERNS1_15function_bufferE @ 0x17d669a boost::function0<>::operator()() @ 0x1ad2993 impala::Thread::SuperviseThread() @ 0x1adb6a8 boost::_bi::list4<>::operator()<>() @ 0x1adb5eb boost::_bi::bind_t<>::operator()() @ 0x1adb5ae boost::detail::thread_data<>::run() @ 0x2d8faea thread_proxy @ 0x7f64f6b1f6ba start_thread @ 0x7f64f685541d clone {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)