[jira] [Created] (IMPALA-6233) Document the column definitions list in CREATE VIEW

2017-11-21 Thread Alexander Behm (JIRA)
Alexander Behm created IMPALA-6233:
--

 Summary: Document the column definitions list in CREATE VIEW
 Key: IMPALA-6233
 URL: https://issues.apache.org/jira/browse/IMPALA-6233
 Project: IMPALA
  Issue Type: Improvement
  Components: Docs
Affects Versions: Impala 2.10.0
Reporter: Alexander Behm
Assignee: John Russell


Looking at this page:
https://www.cloudera.com/documentation/enterprise/latest/topics/impala_create_view.html#create_view

It appears we do not have an example for the "columns_list" that shows adding a 
comment to a column. We should add that.

Example:
{code}
create table t1 (c1 int, c2 int);
create view v (x comment 'hello world', y) as select * from t1;
describe v;
+--+--+-+
| name | type | comment |
+--+--+-+
| x| int  | hello world |
| y| int  | |
+--+--+-+
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (IMPALA-6232) Short circuit reads disabled when using Impala HDFS file handle cache

2017-11-21 Thread Joe McDonnell (JIRA)
Joe McDonnell created IMPALA-6232:
-

 Summary: Short circuit reads disabled when using Impala HDFS file 
handle cache
 Key: IMPALA-6232
 URL: https://issues.apache.org/jira/browse/IMPALA-6232
 Project: IMPALA
  Issue Type: Bug
  Components: Backend
Affects Versions: Impala 2.10.0
Reporter: Joe McDonnell
Assignee: Joe McDonnell
Priority: Blocker


In Impala 2.10, the HDFS file handle cache was enabled by default. However, 
testing has revealed that in cases where files are overwritten or appended, the 
file handle can encounter an error that causes HDFS to disable short circuit 
reads for 10 minutes. See 
[HDFS-12528|https://issues.apache.org/jira/browse/HDFS-12528].

Due to this performance impact and the associated unpredictability, Impala 
should disable the file handle cache by default until this issue is resolved.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (IMPALA-6172) KRPC w/ TLS doesn't work on remote clusters after rebase

2017-11-21 Thread Sailesh Mukil (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-6172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sailesh Mukil resolved IMPALA-6172.
---
   Resolution: Fixed
Fix Version/s: Impala 2.11.0

Commit in:
https://github.com/apache/incubator-impala/commit/32baa695f499a936b72c5a51ae3649c408aa5a85

> KRPC w/ TLS doesn't work on remote clusters after rebase
> 
>
> Key: IMPALA-6172
> URL: https://issues.apache.org/jira/browse/IMPALA-6172
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Security
>Reporter: Sailesh Mukil
>Assignee: Sailesh Mukil
>Priority: Blocker
>  Labels: broken-build, security
> Fix For: Impala 2.11.0
>
>
> It looks like depending on who initializes OpenSSL (KRPC or us), the behavior 
> changes. After some cherry-picks, we're unable to run Impala on remote 
> clusters with TLS with certain certificate types.
> We get the following when we use intermediate CAs:
> {code:java}
> "F1108 10:47:36.532202 93303 impalad-main.cc:79] Could not build messenger: 
> Runtime error: certificate does not match private key: error:0B080074:x509 
> certificate routines:X509_check_private_key:key values 
> mismatch:x509_cmp.c:331"
> {code}
> And we get the following when we use self-signed certificates:
> "self signed certificate in certificate chain"



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (IMPALA-5019) DECIMAL V2 add/sub result type

2017-11-21 Thread Taras Bobrovytsky (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Taras Bobrovytsky resolved IMPALA-5019.
---
   Resolution: Fixed
Fix Version/s: Impala 2.11.0

{code}
commit bc12a9eb35ff60d7a7e0f6732e9ab6a1d4538f2a
Author: Taras Bobrovytsky 
Date:   Tue Sep 19 16:23:24 2017 -0700

IMPALA-5019: Decimal V2 addition

In this patch, we implement the new decimal return type rules for
addition expressions. These rules become active when the query option
DECIMAL_V2 is enabled. The algorithm for determining the type of the
result is described in the JIRA.

DECIMAL V1:
++
| typeof(cast(1 as decimal(38,0)) + cast(0.1 as decimal(38,38))) |
++
| DECIMAL(38,38) |
++

DECIMAL V2:
++
| typeof(cast(1 as decimal(38,0)) + cast(0.1 as decimal(38,38))) |
++
| DECIMAL(38,6)  |
++

This patch required backend changes. We implement an algorithm where
we handle the whole and fractional parts separately, and then combine
them to get the final result. This is more complex and slower. We try
to avoid this by first checking if the result would fit into int128.

Testing:
- Added expr tests.
- Tested locally on my machine with a script that generates random
  decimal numbers and checks that Impala adds them correctly.

Performance:

For the common case, performance remains the same.
  select cast(2.2 as decimal(18, 1) + cast(2.2 as decimal(18, 1)

  BEFORE: 4.74s
  AFTER:  4.73s

In this case, we check if it is necessary to do the complex addition,
and it turns out to be not necessary. We see a slowdown because the
result needs to be scaled down by dividing.
  select cast(2.2 as decimal(38, 19) + cast(2.2 as decimal(38, 19)

  BEFORE: 1.63s
  AFTER:  13.57s

In following case, we take the most complex path and see the most
signification perfmance hit.
  select cast(7.5 as decimal(38,37)) + cast(2.2 as decimal(38,37))

  BEFORE: 1.63s
  AFTER: 20.57
{code}

> DECIMAL V2 add/sub result type
> --
>
> Key: IMPALA-5019
> URL: https://issues.apache.org/jira/browse/IMPALA-5019
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.0
>Reporter: Dan Hecht
>Assignee: Taras Bobrovytsky
> Fix For: Impala 2.11.0
>
>
> For decimal_v2=true, we should revisit the add/sub result type. Currently, we 
> set result scale to max(S1, S2) (potentially losing precision).  Other 
> systems (e.g. SQL Server) seem to choose either S1 or S2 depending on whether 
> digits to the left of the decimal point would be lost.  This would require 
> changes to the backend implementation of add/sub, however.
> Currently we compute rP and rS as follows:
> {code}
> rS = max(s1, s2)
> rP = max(s1, s2) + max(p1 - s1, p2 - s2) + 1
> {code}
> We currently handle the case where rP > 38 as follows:
> {code}
> if (rP > 38):
>   rP = 38
>   rS = min(38, rS)
> {code}
> This basically truncates the digits to the left of the decimal point.
> The proposed result under V2 is:
> {code}
> if (rP > 38):
>   minS = min(rS, 6)
>   rS = rS - (rP - 38)
>   rS = max(minS, rS)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (IMPALA-6200) Flakiness in Planner Tests

2017-11-21 Thread Zach Amsden (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-6200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zach Amsden closed IMPALA-6200.
---
Resolution: Fixed

I'm gonna go ahead and close this as a dup.  We can continue investigation in 
IMPALA-3887

> Flakiness in Planner Tests
> --
>
> Key: IMPALA-6200
> URL: https://issues.apache.org/jira/browse/IMPALA-6200
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.11.0
>Reporter: Taras Bobrovytsky
>Assignee: Zach Amsden
>Priority: Blocker
>  Labels: broken-build, flaky
>
> Sometimes we are seeing random small variations in Planner tests, which 
> causes builds to be flaky.
> Actual:
> {code}
> F00:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
> |  Per-Host Resources: mem-estimate=48.00MB mem-reservation=0B
> ^^
> WRITE TO HDFS [functional.alltypes, OVERWRITE=false, PARTITION-KEYS=(CAST(3 + 
> year AS INT),CAST(month - -1 AS INT))]
> |  partitions=4
> |  mem-estimate=1.56KB mem-reservation=0B
> {code}
> Expected:
> {code}
> F00:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
> |  Per-Host Resources: mem-estimate=32.00MB mem-reservation=0B
> WRITE TO HDFS [functional.alltypes, OVERWRITE=false, PARTITION-KEYS=(CAST(3 + 
> year AS INT),CAST(month - -1 AS INT))]
> |  partitions=4
> |  mem-estimate=1.56KB mem-reservation=0B
> {code}
> Actual:
> {code}
> |  F01:PLAN FRAGMENT [RANDOM] hosts=2 instances=2
> ^
> |  Per-Host Resources: mem-estimate=16.00MB mem-reservation=0B
> |  01:SCAN HDFS [functional_parquet.alltypestiny, RANDOM]
> | partitions=4/4 files=4 size=9.75KB
> | stats-rows=unavailable extrapolated-rows=disabled
> | table stats: rows=unavailable size=unavailable
> | column stats: unavailable
> | mem-estimate=16.00MB mem-reservation=0B
> | tuple-ids=1 row-size=88B cardinality=unavailable
> {code}
> Expected:
> {code}
> |  F01:PLAN FRAGMENT [RANDOM] hosts=3 instances=3
> |  Per-Host Resources: mem-estimate=16.00MB mem-reservation=0B
> |  01:SCAN HDFS [functional_parquet.alltypestiny, RANDOM]
> | partitions=4/4 files=4 size=10.48KB
> | stats-rows=unavailable extrapolated-rows=disabled
> | table stats: rows=unavailable size=unavailable
> | column stats: unavailable
> | mem-estimate=16.00MB mem-reservation=0B
> | tuple-ids=1 row-size=88B cardinality=unavailable
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (IMPALA-3436) Round(double, int) should return decimal

2017-11-21 Thread Taras Bobrovytsky (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Taras Bobrovytsky closed IMPALA-3436.
-
Resolution: Won't Fix

[~dhecht], I agree that doubles in general have problems (such as there being 
different ways to represent the same double). If customers want to avoid these 
problems, they should avoid using the double type. Also, as Tim mentioned, they 
can still get correct rounding behavior by casting to decimal first.

I created IMPALA-6230, to keep track of the required round() changes.

> Round(double, int) should return decimal
> 
>
> Key: IMPALA-3436
> URL: https://issues.apache.org/jira/browse/IMPALA-3436
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 2.5.0
>Reporter: Tim Armstrong
>Assignee: Taras Bobrovytsky
>Priority: Minor
>  Labels: compatibility, usability
>
> Impala has several versions of round: round(double a), round(double a, int 
> d), round(decimal a, int_type d)
> round(double a) returns a BIGINT, which makes sense because it rounds to the 
> nearest int.
> round(decimal a, int_type d) returns a DECIMAL, which makes sense because it 
> rounds to a decimal digit.
> round(double a, int d) predates DECIMAL support, so it returns a DOUBLE. It 
> is specified to return the nearest double value. 
> E.g. round(cast(1 as DOUBLE) / 10, 1) returns the binary floating point value 
> closest to 0.1. This number has no exact decimal representation. Both 
> 0.100 and 0.1555 are valid decimal 
> representations of this floating point number.  I.e. if you convert them back 
> to float, you will get the same number.
> This is correct according to floating point conversion rules and the Impala 
> documentation, but it is confusing for two reasons:
> * round() returning a double is a little surprising, because it can't 
> precisely represent the result
> * Impala clients can display the floating-point result in multiple *valid* 
> ways. Different clients have different algorithms for converting 
> floating-point to decimal for display, so even if Impala returns the same 
> result it may appear as 0.1 in one client and 0.1555. We 
> don't specify that clients have to use a particular algorithm, so it's valid 
> as long as it converts back to the same float as part of a round-trip.
> We should consider changing the spec of round() in Impala to always return a 
> decimal to avoid this confusion.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (IMPALA-5624) ProcessStateInfo::ReadProcFileDescriptorInfo() should not fork a process

2017-11-21 Thread Csaba Ringhofer (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-5624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer resolved IMPALA-5624.
-
   Resolution: Done
Fix Version/s: Impala 2.11.0

IMPALA-5624: Replace "ls -l" with opendir() in ProcessStateInfo

Running shell commands from impalad can be problematic, because using popen() 
leads
to forking which causes a spike in virtual memory. To avoid this, "ls" is 
replaced
with POSIX API calls.

FileDescriptorMap fd_desc_ was only used to get the number of file descriptors, 
so
it was unneccesery work to initialize it. It is removed, and only the number of 
file
descriptors is computed.

The automatic test for this function is only a sanity check,  because there is 
no
way to know the "expected value" in advance, and the number of file desciptors 
can
change anytime.

Change-Id: Ibffae8069a62e100abbfa7d558b49040b095ddc0
Reviewed-on: http://gerrit.cloudera.org:8080/8546
Reviewed-by: Lars Volker 
Tested-by: Impala Public Jenkins

> ProcessStateInfo::ReadProcFileDescriptorInfo() should not fork a process
> 
>
> Key: IMPALA-5624
> URL: https://issues.apache.org/jira/browse/IMPALA-5624
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 2.10.0
>Reporter: Tim Armstrong
>Assignee: Csaba Ringhofer
> Fix For: Impala 2.11.0
>
>
> Forking processes from the Impala daemon after startup is problematic because 
> of the spike in virtual memory it causes (see IMPALA-2294). We should avoid 
> doing this in ProcessStateInfo::ReadProcFileDescriptorInfo(), which is 
> invoked from the web server debug pages.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (IMPALA-6231) Do some fuzz testing of decimal v2 operations

2017-11-21 Thread Taras Bobrovytsky (JIRA)
Taras Bobrovytsky created IMPALA-6231:
-

 Summary: Do some fuzz testing of decimal v2 operations
 Key: IMPALA-6231
 URL: https://issues.apache.org/jira/browse/IMPALA-6231
 Project: IMPALA
  Issue Type: Bug
  Components: Backend
Affects Versions: Impala 2.11.0
Reporter: Taras Bobrovytsky
Assignee: Taras Bobrovytsky


After all decimal v2 patches go in, we need to develop and run a fuzz tester 
and checks the correctness of decimal v2 operations. For example, the fuzzer 
could generate two random decimals, add them together and verify that the 
result is correct.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (IMPALA-6230) The output type of a round() function should match the input type

2017-11-21 Thread Taras Bobrovytsky (JIRA)
Taras Bobrovytsky created IMPALA-6230:
-

 Summary: The output type of a round() function should match the 
input type
 Key: IMPALA-6230
 URL: https://issues.apache.org/jira/browse/IMPALA-6230
 Project: IMPALA
  Issue Type: Bug
  Components: Backend, Frontend
Affects Versions: Impala 2.10.0
Reporter: Taras Bobrovytsky
Assignee: Taras Bobrovytsky


At the next compatibility breaking version we should revisit the output types 
of round() functions. In order to match the behavior of most of other database 
systems, the output type of the round() functions should be the same as the 
input type.

For example, today, round(double) returns a bigint. We should return a double 
instead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (IMPALA-6054) Parquet dictionary pages should be freed on dictionary construction

2017-11-21 Thread Csaba Ringhofer (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-6054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer resolved IMPALA-6054.
-
   Resolution: Done
Fix Version/s: Impala 2.11.0

IMPALA-6054: Parquet dictionary pages should be freed on dictionary construction

During dictionary constructon, most types are copied from the parquet
dictionary page, but StringValues keep pointers to it. In this case,
the dictionary page must be kept and attached to the last row batch
that references it. In case of other types, it is safe do delete
the dictionary page after the construction of the dictionary.

This patch contains two optimizations:
- dictionary pages are deleted as soon as possible for non string types
- in non-compressed and non-string case, an unnecessary copy is avoided

Change-Id: I4d9d5f4da1028d961155dafdac0028a1c3641004
Reviewed-on: http://gerrit.cloudera.org:8080/8436
Reviewed-by: Tim Armstrong 
Tested-by: Impala Public Jenkins

> Parquet dictionary pages should be freed on dictionary construction
> ---
>
> Key: IMPALA-6054
> URL: https://issues.apache.org/jira/browse/IMPALA-6054
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 2.10.0
>Reporter: Joe McDonnell
>Assignee: Csaba Ringhofer
>Priority: Minor
>  Labels: resource-management
> Fix For: Impala 2.11.0
>
>
> The Parquet scanner uses the dictionary_pool_ to allocate memory for the 
> dictionary page (see BaseScalarColumnReader::InitDictionary()). This 
> dictionary page is used to initialize the dictionary in 
> CreateDictionaryDecoder(). The resulting dictionary is a vector of values. 
> For some datatypes, such as strings, the resulting dictionary has an array of 
> StringValue's that contain pointers into the dictionary page (see the 
> StringValue specialization in ParquetPlainEncoder::Decode()). In this case, 
> the dictionary page must be kept and attached to the last row batch that 
> references it. However, for other datatypes, the values are copied into the 
> dictionary and the dictionary page is no longer needed after the dictionary 
> is constructed.
> Currently, these dictionary pages remain in the dictionary_pool_ and are 
> attached to the last row batch to be passed to other ExecNodes (see 
> FlushRowGroupResources()). This should only pass StringValue dictionary pages 
> (or other types that point to data in the page) on the row batch. The other 
> types should be freed immediately once the dictionary has been constructed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (IMPALA-6229) A different test results between test/run-tests.py and impala-py.test

2017-11-21 Thread Jinchul Kim (JIRA)
Jinchul Kim created IMPALA-6229:
---

 Summary: A different test results between test/run-tests.py and 
impala-py.test
 Key: IMPALA-6229
 URL: https://issues.apache.org/jira/browse/IMPALA-6229
 Project: IMPALA
  Issue Type: Bug
  Components: Infrastructure
Reporter: Jinchul Kim
Priority: Minor


I copied from dev mailing list because this might be a infra issue.

I am trying to look into the build error:
https://jenkins.impala.io/job/gerrit-verify-dryrun/1472/
(The relevant code change: https://gerrit.cloudera.org/#/c/8355/13). You can 
reproduce the issue using the patch set.

There was a test failure at "TestAllocFail.test_alloc_fail_init". I ran the
following command but it always passed on my change: ./tests/run-tests.py
tests/custom_cluster/test_alloc_fail.py

The command below with "run-tests.py"
looks fine to me because it shows tests are finished successfully. I guess
it shows false positive error. If "tests/custom_cluster/test_alloc_fail.py"
cannot run with "run-tests.py", test should be finished with an error due
to incompatibility case. Would you please check two things?
1. The false positive error
2. Does "tests/custom_cluster/test_alloc_fail.py" run with "run-tests.py"?

$ ./tests/run-tests.py tests/custom_cluster/test_alloc_fail.py
...
=
test session starts
==
platform linux2 -- Python 2.7.12, pytest-2.9.2, py-1.4.32, pluggy-0.3.1 --
/home/jinchulkim/Impala/bin/../infra/python/env/bin/python
cachedir: .cache
rootdir: /home/jinchulkim/Impala/tests, inifile: pytest.ini
plugins: xdist-1.15.0, random-0.2
collected 2 items

verifiers/test_verify_metrics.py::TestValidateMetrics::test_metrics_are_zero
PASSED
verifiers/test_verify_metrics.py::TestValidateMetrics::test_num_unused_buffers
PASSED
===
2 passed in 0.11 seconds
===




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)