[jira] [Created] (IMPALA-6422) Compute stats tablesample spends a lot of time in powf()

2018-01-18 Thread Alexander Behm (JIRA)
Alexander Behm created IMPALA-6422:
--

 Summary: Compute stats tablesample spends a lot of time in powf()
 Key: IMPALA-6422
 URL: https://issues.apache.org/jira/browse/IMPALA-6422
 Project: IMPALA
  Issue Type: Improvement
  Components: Backend
Affects Versions: Impala 2.11.0
Reporter: Alexander Behm
Assignee: Alexander Behm


[~mmokhtar] did perf profiling for COMPUTE STATS TABLESAMPLE and discovered 
that a lot of time is spend on finalizing HLL intermediates. Most time is spent 
in powf().

Relevant snippet from AggregateFunctions::HllFinalEstimate() in 
aggregate-functions-ir.cc:
{code}
  for (int i = 0; i < num_buckets; ++i) {
harmonic_mean += ldexp(1.0f, -buckets[i]);
if (buckets[i] == 0) ++num_zero_registers;
  }
{code}

Since we're doing a power of 2 using ldexp() should be much more efficient.

I did a microbenchmark and found that ldexp() is >10x faster than powf() for 
this scenario.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IMPALA-6421) Improve the way the pytest --update_results flag works

2018-01-18 Thread Taras Bobrovytsky (JIRA)
Taras Bobrovytsky created IMPALA-6421:
-

 Summary: Improve the way the pytest --update_results flag works
 Key: IMPALA-6421
 URL: https://issues.apache.org/jira/browse/IMPALA-6421
 Project: IMPALA
  Issue Type: Bug
Reporter: Taras Bobrovytsky


Currently there are several problems with running py.test with the 
--update_results flag. It does not always output the test sections in a 
consistent order. For example sometimes TYPES is before RESULTS, and sometimes 
vice versa. We should come up a with a canonical order of the sections and 
update all the .test files so that this order is followed everywhere.

Another problem is that sometimes random spaces or newlines are inserted into 
random places in the new file with the updated results.

In general, we should make generating the expected results much more robust.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (IMPALA-6419) hdfs-parquet-scanner.cc:624] Check failed: 0 == context_->NumStreams() (0 vs. 11)

2018-01-18 Thread Tim Armstrong (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-6419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-6419.
---
   Resolution: Fixed
Fix Version/s: Impala 2.12.0



IMPALA-6419: Revert "IMPALA-6383: free memory after skipping parquet row groups"

This reverts commit 10fb24afb966c567adcf632a314f6af1826f19fc.

Change-Id: I4dd62380d02b61ca46f856b4eb40670b71e28140
Reviewed-on: http://gerrit.cloudera.org:8080/9054
Reviewed-by: Alex Behm <alex.b...@cloudera.com>
Tested-by: Impala Public Jenkins


> hdfs-parquet-scanner.cc:624] Check failed: 0 == context_->NumStreams() (0 vs. 
> 11)
> -
>
> Key: IMPALA-6419
> URL: https://issues.apache.org/jira/browse/IMPALA-6419
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 2.12.0
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Blocker
>  Labels: broken-build
> Fix For: Impala 2.12.0
>
>
> Hit this during GVO
> https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/975/artifact/Impala/logs_static/logs/ee_tests/impalad.ip-172-31-46-11.ubuntu.log.ERROR.20180118-025218.43759
> {noformat}
> F0118 03:59:51.644770 21603 hdfs-parquet-scanner.cc:624] Check failed: 0 == 
> context_->NumStreams() (0 vs. 11) 
> *** Check failure stack trace: ***
> @  0x3c0611d  google::LogMessage::Fail()
> @  0x3c079c2  google::LogMessage::SendToLog()
> @  0x3c05af7  google::LogMessage::Flush()
> @  0x3c090be  google::LogMessageFatal::~LogMessageFatal()
> @  0x1bf5f26  impala::HdfsParquetScanner::NextRowGroup()
> @  0x1bf4c52  impala::HdfsParquetScanner::GetNextInternal()
> @  0x1bf3264  impala::HdfsParquetScanner::ProcessSplit()
> @  0x1b7cf42  impala::HdfsScanNode::ProcessSplit()
> @  0x1b7c36d  impala::HdfsScanNode::ScannerThread()
> @  0x1b7b7f0  
> _ZZN6impala12HdfsScanNode22ThreadTokenAvailableCbEPNS_17ThreadResourceMgr12ResourcePoolEENKUlvE_clEv
> @  0x1b7d793  
> _ZN5boost6detail8function26void_function_obj_invoker0IZN6impala12HdfsScanNode22ThreadTokenAvailableCbEPNS3_17ThreadResourceMgr12ResourcePoolEEUlvE_vE6invokeERNS1_15function_bufferE
> @  0x17d669a  boost::function0<>::operator()()
> @  0x1ad2993  impala::Thread::SuperviseThread()
> @  0x1adb6a8  boost::_bi::list4<>::operator()<>()
> @  0x1adb5eb  boost::_bi::bind_t<>::operator()()
> @  0x1adb5ae  boost::detail::thread_data<>::run()
> @  0x2d8faea  thread_proxy
> @ 0x7f64f6b1f6ba  start_thread
> @ 0x7f64f685541d  clone
> {noformat}
> Repro:
> {noformat}
> $ SCANNER_FUZZ_SEED=1516247966 impala-py.test 
> tests/query_test/test_scanners_fuzz.py -n 4 --verbose -k parquet
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (IMPALA-6083) Clarify scope of STRAIGHT_JOIN query hint in Impala docs

2018-01-18 Thread John Russell (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-6083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Russell resolved IMPALA-6083.
--
   Resolution: Fixed
Fix Version/s: Impala 2.10.0

Yes, SHA = 3b5a36337f190481cef7ebef398324805d7e4485

I forgot that 'github bot' doesn't apply, so mentioning the JIRA number in the 
commit message doesn't create an obvious linkage.

> Clarify scope of STRAIGHT_JOIN query hint in Impala docs
> 
>
> Key: IMPALA-6083
> URL: https://issues.apache.org/jira/browse/IMPALA-6083
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Docs
>Affects Versions: Impala 2.10.0
>Reporter: Alexander Behm
>Assignee: John Russell
>Priority: Major
> Fix For: Impala 2.10.0
>
>
> The documentation for STRAIGHT_JOIN is missing one very important detail:
> The scope of the hint is limited to the FROM clause of the query block where 
> the hint appears. In particular, the hint does not apply recursively to all 
> subqueries.
> Existing docs are here:
> https://impala.incubator.apache.org/docs/build/html/topics/impala_perf_joins.html
> *Examples*
> 1. The hint here will prevent reordering of "v1" and "v2", but the joins 
> between "t1" and "t2" and "t3" and "t4" can still be reordered.
> {code}
> select /* +straight_join */ count(*) from
> (select t1.id as id from t1 join t2 on t1.id = t2.id) v1
> join
> (select t3.id as id from t3 join t4 on t3.id = t4.id) v2
> on (v1.id = v2.id)
> {code}
> 2. Fully hinted query. No join ordering is applied whatsoever.
> {code}
> select /* +straight_join */ count(*) from
> (select /* +straight_join */ t1.id as id from t1 join t2 on t1.id = t2.id) v1
> join
> (select /* +straight_join */ t3.id as id from t3 join t4 on t3.id = t4.id) v2
> on (v1.id = v2.id)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (IMPALA-6386) Dataload can fail due to "invalidate metadata" concurrent with DDLs

2018-01-18 Thread Joe McDonnell (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-6386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe McDonnell resolved IMPALA-6386.
---
   Resolution: Fixed
Fix Version/s: Impala 2.12.0

commit d9b6fd073055b436c7404d49454dc215b2c7a369
Author: Joe McDonnell 
Date: Thu Jan 11 15:09:52 2018 -0800

IMPALA-6386: Invalidate metadata at table level for dataload
 
 Dataload currently executes bin/load-data.py for TPC-H,
 TPC-DS, and functional-query concurrently. One of the final
 steps for bin/load-data.py is to run a global "invalidate
 metadata". Global "invalidate metadata" commands are known
 to cause problem on concurrent systems. See IMPALA-5087.
 For dataload, if TPC-H executes "invalidate metadata" while
 TPC-DS is still creating tables and adding partitions,
 the TPC-DS executor might erroneously believe that a table
 does not exist.
 
 This changes dataload to invalidate metadata at an
 individual table level rather than globally. This
 prevents the concurrency issue.
 
 This also changes the names of some of the intermediate
 SQL files generated by generate-schema-statements.py
 and consumed by load-data.py to make them less confusing.
 
 Change-Id: Ibc3a6d8a674a0bf6b02069bfe8a5e12034335b1f
 Reviewed-on: http://gerrit.cloudera.org:8080/9009
 Reviewed-by: Joe McDonnell 
 Tested-by: Impala Public Jenkins

> Dataload can fail due to "invalidate metadata" concurrent with DDLs
> ---
>
> Key: IMPALA-6386
> URL: https://issues.apache.org/jira/browse/IMPALA-6386
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Affects Versions: Impala 2.11.0
>Reporter: Joe McDonnell
>Assignee: Joe McDonnell
>Priority: Critical
> Fix For: Impala 2.12.0
>
>
> testdata/bin/create-load-data.sh runs bin/load-data.py on TPC-H, TPC-DS, and 
> functional-query in parallel. One of the final steps of bin/load-data.py is 
> to run a universal "invalidate metadata". However, universal "invalidate 
> metadata" is an error-prone operation in a concurrent system. When 
> "invalidate metadata" happens during the DDL statements for another dataset 
> (i.e. TPC-H finishes and runs "invalidate metadata" while TPC-DS is still 
> creating tables and adding partitions), it can lead to errors.
> Thread 1: create external table foo ... ;
> Thread 2: invalidate metadata;
> Thread 1: alter table foo add partition bar; <-- Hits error because it can't 
> find foo
> This is a known issue: IMPALA-5087. This has been seen in my development 
> environment and one automated build, but it is relatively rare.
> Dataload needs to switch to using "invalidate metadata {table_name}" to avoid 
> this issue. This is also a good time to consider using "refresh {table_name}".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IMPALA-6419) hdfs-parquet-scanner.cc:624] Check failed: 0 == context_->NumStreams() (0 vs. 11)

2018-01-18 Thread Tim Armstrong (JIRA)
Tim Armstrong created IMPALA-6419:
-

 Summary: hdfs-parquet-scanner.cc:624] Check failed: 0 == 
context_->NumStreams() (0 vs. 11)
 Key: IMPALA-6419
 URL: https://issues.apache.org/jira/browse/IMPALA-6419
 Project: IMPALA
  Issue Type: Bug
  Components: Backend
Affects Versions: Impala 2.12.0
Reporter: Tim Armstrong
Assignee: Tim Armstrong


{noformat}
F0118 03:59:51.644770 21603 hdfs-parquet-scanner.cc:624] Check failed: 0 == 
context_->NumStreams() (0 vs. 11) 
*** Check failure stack trace: ***
@  0x3c0611d  google::LogMessage::Fail()
@  0x3c079c2  google::LogMessage::SendToLog()
@  0x3c05af7  google::LogMessage::Flush()
@  0x3c090be  google::LogMessageFatal::~LogMessageFatal()
@  0x1bf5f26  impala::HdfsParquetScanner::NextRowGroup()
@  0x1bf4c52  impala::HdfsParquetScanner::GetNextInternal()
@  0x1bf3264  impala::HdfsParquetScanner::ProcessSplit()
@  0x1b7cf42  impala::HdfsScanNode::ProcessSplit()
@  0x1b7c36d  impala::HdfsScanNode::ScannerThread()
@  0x1b7b7f0  
_ZZN6impala12HdfsScanNode22ThreadTokenAvailableCbEPNS_17ThreadResourceMgr12ResourcePoolEENKUlvE_clEv
@  0x1b7d793  
_ZN5boost6detail8function26void_function_obj_invoker0IZN6impala12HdfsScanNode22ThreadTokenAvailableCbEPNS3_17ThreadResourceMgr12ResourcePoolEEUlvE_vE6invokeERNS1_15function_bufferE
@  0x17d669a  boost::function0<>::operator()()
@  0x1ad2993  impala::Thread::SuperviseThread()
@  0x1adb6a8  boost::_bi::list4<>::operator()<>()
@  0x1adb5eb  boost::_bi::bind_t<>::operator()()
@  0x1adb5ae  boost::detail::thread_data<>::run()
@  0x2d8faea  thread_proxy
@ 0x7f64f6b1f6ba  start_thread
@ 0x7f64f685541d  clone
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)