[jira] [Commented] (IMPALA-10073) Create shaded dependency for S3A and aws-java-sdk-bundle

2020-08-28 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-10073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186878#comment-17186878
 ] 

ASF subversion and git services commented on IMPALA-10073:
--

Commit 5daff3472440dc6174f0f31a28bbdafee4f68716 in impala's branch 
refs/heads/master from Sahil Takiar
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=5daff34 ]

IMPALA-10073: Create shaded dependency for S3A and aws-java-sdk-bundle

The aws-java-sdk-bundle is one of the largest dependencies in the Impala
Docker images and continues to grow. The jar includes SDKs for
every single AWS service.

This patch removes most of the unnecessary SDKs from the
aws-java-sdk-bundle, thus drastically decreasing the size of the
dependency. The Maven shade plugin is used to do this, and the
implementation is similar to what is currently done for the hive-exec
jar.

This patch takes a conservative approach to removing packages from the
aws-java-sdk-bundle jar, and I ensured no direct dependencies of the S3
SDK were removed. The idea is to only remove dependencies that S3A would
never conceivably need. Given the huge number of AWS services, I only
focused on removing the largest SDKs (the size of each SDK is estimated
by the number of classes in the SDK).

This decreases the size of the Docker images by about 100 MB.

Testing:
* Ran core tests against S3

Change-Id: I0939f73be986f83cc1fd07921563b4d9201780f2
Reviewed-on: http://gerrit.cloudera.org:8080/16342
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> Create shaded dependency for S3A and aws-java-sdk-bundle
> 
>
> Key: IMPALA-10073
> URL: https://issues.apache.org/jira/browse/IMPALA-10073
> Project: IMPALA
>  Issue Type: Sub-task
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
>
> One of the largest dependencies in Impala Docker containers is the 
> aws-java-sdk-bundle jar. One way to decrease the size of this dependency is 
> to apply a similar technique used for the hive-exec shaded jar: 
> [https://github.com/apache/impala/blob/master/shaded-deps/pom.xml]
> The aws-java-sdk-bundle contains SDKs for all AWS services, even though 
> Impala-S3A only requires a few of the more basic SDKs.
> IMPALA-10028 and HADOOP-17197 both discuss this a bit as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-10112) Consider skipping FpRateTooHigh() check for bloom filters

2020-08-28 Thread Tim Armstrong (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-10112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186811#comment-17186811
 ] 

Tim Armstrong commented on IMPALA-10112:


cc [~drorke]

> Consider skipping FpRateTooHigh() check for bloom filters
> -
>
> Key: IMPALA-10112
> URL: https://issues.apache.org/jira/browse/IMPALA-10112
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Major
>  Labels: performance
>
> This check disables bloom filters on the sender side.
> It is inaccurate in cases where there are duplicate values of the filter key 
> on the build side. E.g. many-to-many join or a join with multiple keys. This 
> could be fixed with some effort, but is probably not worth it, because:
> * Partition filters are probably still worth evaluating even if there are 
> false positives, because it's cheap and eliminating a partition is still 
> beneficial.
> * Runtime filters are dynamically disabled on the scan side if they are 
> ineffective. I think we still also "evaluate" the always true filter, which 
> is cheaper than doing the hashing and bloom evaluation, but still not 
> entirely free.
> * The disabling is fairly unlikely to kick in for partitioned joins because 
> it's only applied to a small subset of the filter, before the Or() operation.
> So it's potentially harmful and only likely beneficial for broadcast join 
> filters, in which case it saves a small amount of scan CPU and, for global 
> filters, coordinator RPCs and broadcasting. It's unclear that the complexity 
> is worth it for this relatively small and uncertain benefit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9989) Improve admission control pool stats logging

2020-08-28 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186809#comment-17186809
 ] 

ASF subversion and git services commented on IMPALA-9989:
-

Commit 2ef6184ee1010d29fdaa5cd4ba5f0c95ef9abc0d in impala's branch 
refs/heads/master from Qifan Chen
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=2ef6184 ]

IMPALA-9989 Improve admission control pool stats logging

This work addresses the current limitation in admission controller by
appending the last known memory consumption statistics about the set of
queries running or waiting on a host or in a pool to the existing memory
exhaustion message. The statistics is logged in impalad.INFO when a
query is queued or queued and then timed out due to memory pressure in
the pool or on the host. The statistics can also be part of the query
profile.

The new memory consumption statistics can be either stats on host or
aggregated pool stats. The stats on host describes memory consumption
for every pool on a host. The aggregated pool stats describes the
aggregated memory consumption on all hosts for a pool. For each stats
type, information such as query Ids and memory consumption of up to top
5 queries is provided, in addition to the min, the max, the average and
the total memory consumption for the query set.

When a query request is queued due to memory exhaustion, the above
new consumption statistics is logged when the BE logging level is set
at 2.

When a query request is timed out due to memory exhaustion, the above
new consumption statistics is logged when the BE logging level is set
at 1.

Testing:
1. Added a new test TopNQueryCheck in admission-controller-test.cc to
   verify that the topN query memory consumption details are reported
   correctly.
2. Add two new tests in test_admission_controller.py to simulate
   queries being queued and then timed out due to pool or host memory
   pressure.
3. Added a new test TopN in mem-tracker-test.cc to
   verify that the topN query memory consumption details are computed
   correctly from a mem tracker hierarchy.
4. Ran Core tests successfully.

Change-Id: Id995a9d044082c3b8f044e1ec25bb4c64347f781
Reviewed-on: http://gerrit.cloudera.org:8080/16220
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> Improve admission control pool stats logging
> 
>
> Key: IMPALA-9989
> URL: https://issues.apache.org/jira/browse/IMPALA-9989
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Vincent Tran
>Assignee: Qifan Chen
>Priority: Major
>
> Information that should be explicit to log consumers:
> 1) Global pool stats at the time of admission. The stats from 
> 'admission-controller.cc:515' only aggregate from queries admitted by this 
> host.
> 2) Local host's memory - since it is also a factor in the admission decision.
> 3) Any other info that would factor into the admission decision.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-10050) DCHECK was hit possibly while executing TestFailpoints::test_failpoints

2020-08-28 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-10050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186680#comment-17186680
 ] 

ASF subversion and git services commented on IMPALA-10050:
--

Commit 3733c4cc2cfb78d7f13463fb1ee9e1c4560d4a3d in impala's branch 
refs/heads/master from wzhou-code
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=3733c4c ]

IMPALA-10050: Fixed DCHECK error for backend in terminal state.

Recent patch for IMPALA-6788 makes coordinator to cancel inflight
query fragment instances when it receives failure report from one
backend. It's possible the BackendState::Cancel() is called for
one fragment instance before the first execution status report
from its backend is received and processed by the coordinator.
Since the status of BackendState is set as Cancelled after Cancel()
is called, the execution of the fragment instance is treated as
Done in such case so that the status report will NOT be processed.
Hence the backend receives response OK from coordinator even it
sent a report with execution error. This make backend hit DCHECK
error if backend in the terminal state with error.
This patch fixs the issue by making coordinator send CANCELLED
status in the response of status report if the backend status is not
ok and the execution status report is not applied.

Testing:
 - The issue could be reproduced by running test_failpoints for about
   20 iterations. Verified the fixing by running test_failpoints over
   200 iterations without DCHECK failure.
 - Passed TestProcessFailures::test_kill_coordinator.
 - Psssed TestRPCException::test_state_report_error.
 - Passed exhaustive tests.

Change-Id: Iba6a72f98c0f9299c22c58830ec5a643335b966a
Reviewed-on: http://gerrit.cloudera.org:8080/16303
Reviewed-by: Thomas Tauber-Marshall 
Tested-by: Impala Public Jenkins 


> DCHECK was hit possibly while executing TestFailpoints::test_failpoints
> ---
>
> Key: IMPALA-10050
> URL: https://issues.apache.org/jira/browse/IMPALA-10050
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 4.0
>Reporter: Attila Jeges
>Assignee: Wenzhe Zhou
>Priority: Blocker
>  Labels: broken-build, crash, flaky
> Fix For: Impala 4.0
>
>
> A DCHECK was hit during  ASAN core e2e tests. Time-frame suggests that it 
> happened while executing TestFailpoints::test_failpoints e2e test.
> {code}
> 10:56:38  TestFailpoints.test_failpoints[protocol: beeswax | table_format: 
> avro/snap/block | exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0} | mt_dop: 4 | 
> location: PREPARE | action: MEM_LIMIT_EXCEEDED | query: select 1 from 
> alltypessmall a join alltypessmall b on a.id = b.id] 
> 10:56:38 failure/test_failpoints.py:128: in test_failpoints
> 10:56:38 self.execute_query(query, vector.get_value('exec_option'))
> 10:56:38 
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/common/impala_test_suite.py:811:
>  in wrapper
> 10:56:38 return function(*args, **kwargs)
> 10:56:38 
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/common/impala_test_suite.py:843:
>  in execute_query
> 10:56:38 return self.__execute_query(self.client, query, query_options)
> 10:56:38 
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/common/impala_test_suite.py:909:
>  in __execute_query
> 10:56:38 return impalad_client.execute(query, user=user)
> 10:56:38 
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/common/impala_connection.py:205:
>  in execute
> 10:56:38 return self.__beeswax_client.execute(sql_stmt, user=user)
> 10:56:38 
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/beeswax/impala_beeswax.py:187:
>  in execute
> 10:56:38 handle = self.__execute_query(query_string.strip(), user=user)
> 10:56:38 
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/beeswax/impala_beeswax.py:365:
>  in __execute_query
> 10:56:38 self.wait_for_finished(handle)
> 10:56:38 
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/beeswax/impala_beeswax.py:386:
>  in wait_for_finished
> 10:56:38 raise ImpalaBeeswaxException("Query aborted:" + error_log, None)
> 10:56:38 E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> 10:56:38 EQuery aborted:RPC from 127.0.0.1:27000 to 127.0.0.1:27002 failed
> 10:56:38 E   TransmitData() to 127.0.0.1:27002 failed: Network error: Client 
> connection negotiation failed: client connection to 127.0.0.1:27002: connect: 
> Connection refused (error 111)
> {code}
> Impalad log:
> {code}
> Log file created at: 

[jira] [Commented] (IMPALA-6788) Abort ExecFInstance() RPC loop early after query failure

2020-08-28 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-6788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186681#comment-17186681
 ] 

ASF subversion and git services commented on IMPALA-6788:
-

Commit 3733c4cc2cfb78d7f13463fb1ee9e1c4560d4a3d in impala's branch 
refs/heads/master from wzhou-code
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=3733c4c ]

IMPALA-10050: Fixed DCHECK error for backend in terminal state.

Recent patch for IMPALA-6788 makes coordinator to cancel inflight
query fragment instances when it receives failure report from one
backend. It's possible the BackendState::Cancel() is called for
one fragment instance before the first execution status report
from its backend is received and processed by the coordinator.
Since the status of BackendState is set as Cancelled after Cancel()
is called, the execution of the fragment instance is treated as
Done in such case so that the status report will NOT be processed.
Hence the backend receives response OK from coordinator even it
sent a report with execution error. This make backend hit DCHECK
error if backend in the terminal state with error.
This patch fixs the issue by making coordinator send CANCELLED
status in the response of status report if the backend status is not
ok and the execution status report is not applied.

Testing:
 - The issue could be reproduced by running test_failpoints for about
   20 iterations. Verified the fixing by running test_failpoints over
   200 iterations without DCHECK failure.
 - Passed TestProcessFailures::test_kill_coordinator.
 - Psssed TestRPCException::test_state_report_error.
 - Passed exhaustive tests.

Change-Id: Iba6a72f98c0f9299c22c58830ec5a643335b966a
Reviewed-on: http://gerrit.cloudera.org:8080/16303
Reviewed-by: Thomas Tauber-Marshall 
Tested-by: Impala Public Jenkins 


> Abort ExecFInstance() RPC loop early after query failure
> 
>
> Key: IMPALA-6788
> URL: https://issues.apache.org/jira/browse/IMPALA-6788
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Distributed Exec
>Affects Versions: Impala 2.12.0
>Reporter: Mostafa Mokhtar
>Assignee: Wenzhe Zhou
>Priority: Major
>  Labels: krpc, rpc
> Fix For: Impala 4.0
>
> Attachments: connect_thread_busy_queries_failing.txt, 
> impalad.va1007.foo.com.impala.log.INFO.20180401-200453.1800807.zip
>
>
> Logs from a large cluster show that query startup can take a long time, then 
> once the startup completes the query is cancelled, this is because one of the 
> intermediate rpcs failed. 
> Not clear what the right answer is as fragments are started asynchronously, 
> possibly a timeout?
> {code}
> I0401 21:25:30.776803 1830900 coordinator.cc:99] Exec() 
> query_id=334cc7dd9758c36c:ec38aeb4 stmt=with customer_total_return as
> I0401 21:25:30.813993 1830900 coordinator.cc:357] starting execution on 644 
> backends for query_id=334cc7dd9758c36c:ec38aeb4
> I0401 21:29:58.406466 1830900 coordinator.cc:370] started execution on 644 
> backends for query_id=334cc7dd9758c36c:ec38aeb4
> I0401 21:29:58.412132 1830900 coordinator.cc:896] Cancel() 
> query_id=334cc7dd9758c36c:ec38aeb4
> I0401 21:29:59.188817 1830900 coordinator.cc:906] CancelBackends() 
> query_id=334cc7dd9758c36c:ec38aeb4, tried to cancel 643 backends
> I0401 21:29:59.189177 1830900 coordinator.cc:1092] Release admission control 
> resources for query_id=334cc7dd9758c36c:ec38aeb4
> {code}
> {code}
> I0401 21:23:48.218379 1830386 coordinator.cc:99] Exec() 
> query_id=e44d553b04d47cfb:28f06bb8 stmt=with customer_total_return as
> I0401 21:23:48.270226 1830386 coordinator.cc:357] starting execution on 640 
> backends for query_id=e44d553b04d47cfb:28f06bb8
> I0401 21:29:58.402195 1830386 coordinator.cc:370] started execution on 640 
> backends for query_id=e44d553b04d47cfb:28f06bb8
> I0401 21:29:58.403818 1830386 coordinator.cc:896] Cancel() 
> query_id=e44d553b04d47cfb:28f06bb8
> I0401 21:29:59.255903 1830386 coordinator.cc:906] CancelBackends() 
> query_id=e44d553b04d47cfb:28f06bb8, tried to cancel 639 backends
> I0401 21:29:59.256251 1830386 coordinator.cc:1092] Release admission control 
> resources for query_id=e44d553b04d47cfb:28f06bb8
> {code}
> Checked the coordinator and threads appear to be spending lots of time 
> waiting on exec_complete_barrier_
> {code}
> #0  0x7fd928c816d5 in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x01222944 in impala::Promise::Get() ()
> #2  0x01220d7b in impala::Coordinator::StartBackendExec() ()
> #3  0x01221c87 in impala::Coordinator::Exec() ()
> #4  0x00c3a925 in 
> 

[jira] [Commented] (IMPALA-10092) Some tests in custom_cluster/test_kudu.py do not run even they are not explicitly disabled.

2020-08-28 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-10092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186679#comment-17186679
 ] 

ASF subversion and git services commented on IMPALA-10092:
--

Commit 34668fab878c224632710670618b3f0176cbc78d in impala's branch 
refs/heads/master from Fang-Yu Rao
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=34668fa ]

IMPALA-10092: Do not skip test vectors of Kudu tests in a custom cluster

We found that the following 4 tests do not run even we remove all the
decorators like "@SkipIfKudu.no_hybrid_clock" or
"@SkipIfHive3.kudu_hms_notifications_not_supported" to skip the tests.
This is due to the fact that those 3 classes inherit the class of
CustomClusterTestSuite, which adds a constraint that only allows test
vectors with 'file_format' and 'compression_codec' being "text" and
"none", respectively, to be run.

1. TestKuduOperations::test_local_tz_conversion_ops
2. TestKuduClientTimeout::test_impalad_timeout
3. TestKuduHMSIntegration::test_create_managed_kudu_tables
4. TestKuduHMSIntegration::test_kudu_alter_table

To address this issue, in this patch we create a parent class for those
3 classes above and override the method of
add_custom_cluster_constraints() for this newly created parent class so
that we do not skip test vectors with 'file_format' and
'compression_codec' being "kudu" and "none", respectively.

On the other hand, this patch also removes a redundant method call to
super(CustomClusterTestSuite, cls).add_test_dimensions() in
CustomClusterTestSuite.add_custom_cluster_constraints() since
super(CustomClusterTestSuite, cls).add_test_dimensions() had
already been called immediately before the call to
add_custom_cluster_constraints() in
CustomClusterTestSuite.add_test_dimensions().

Testing:
 - Manually verified that after removing the decorators to skip those
   tests, those tests could be run.

Change-Id: I60a4bd4ac5a9026629fb840ab9cc7b5f9948290c
Reviewed-on: http://gerrit.cloudera.org:8080/16348
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> Some tests in custom_cluster/test_kudu.py do not run even they are not 
> explicitly disabled.
> ---
>
> Key: IMPALA-10092
> URL: https://issues.apache.org/jira/browse/IMPALA-10092
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Reporter: Fang-Yu Rao
>Assignee: Fang-Yu Rao
>Priority: Minor
>  Labels: kudu, test
>
> We found that the following tests in 
> https://github.com/apache/impala/blob/master/tests/custom_cluster/test_kudu.py
>  do not run even we remove all the decorators like 
> "{{@SkipIfKudu.no_hybrid_clock}}" or 
> "{{@SkipIfHive3.kudu_hms_notifications_not_supported}}" to skip the tests.
> # {{TestKuduOperations::test_local_tz_conversion_ops}}
> # {{TestKuduClientTimeout::test_impalad_timeout}}
> # {{TestKuduHMSIntegration::test_create_managed_kudu_tables}}
> # {{TestKuduHMSIntegration::test_kudu_alter_table}}
> This may be due to the fact that we add at 
> https://github.com/apache/impala/blob/master/tests/common/custom_cluster_test_suite.py#L78-L80
>  a constraint that only allows '{{table_format}}' to be "{{text/none}}", 
> i.e., '{{file_format}}' has to be "{{text}}" and '{{compression_codec}}' has 
> to be "{{none}}". It can be verified that by removing this constraint, the 
> tests above could be run after removing all the decorators with 
> '{{table_format}}' being "{{kudu/none}}", which is added according to 
> https://github.com/apache/impala/blob/master/tests/common/test_dimensions.py#L116-L119.
>  
> In this regard, we may need to override the class method of 
> add_test_dimensions() of the classes of {{TestKuduOperations}}, 
> {{TestKuduClientTimeout}}, and {{TestKuduHMSIntegration}} so that once those 
> currently disabled Kudu tests are re-enabled, those tests involving the test 
> vector could be run.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-10115) Impala should check file schema as well to check full ACIDv2 files

2020-08-28 Thread Jira
Zoltán Borók-Nagy created IMPALA-10115:
--

 Summary: Impala should check file schema as well to check full 
ACIDv2 files
 Key: IMPALA-10115
 URL: https://issues.apache.org/jira/browse/IMPALA-10115
 Project: IMPALA
  Issue Type: Bug
Reporter: Zoltán Borók-Nagy


Currently Impala checks file metadata 'hive.acid.version' to decide the full 
ACID schema.

There are cases when Hive forgets to set this value for full ACID files, e.g. 
major query-based compactions.

So if 'hive.acid.version' is not present, Impala should still look at the 
schema elements to be sure about the file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Work started] (IMPALA-10107) Implement HLL functions to have full compatibility with Hive

2020-08-28 Thread Adam Tamas (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on IMPALA-10107 started by Adam Tamas.
---
> Implement HLL functions to have full compatibility with Hive
> 
>
> Key: IMPALA-10107
> URL: https://issues.apache.org/jira/browse/IMPALA-10107
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Backend
>Reporter: Gabor Kaszab
>Assignee: Adam Tamas
>Priority: Minor
>
> ds_hll_estimate_bounds
> ds_hll_stringify
> ds_hll_union_f
> For parameters and expected behaviour check Hive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9967) Scan orc failed when table contains timestamp column

2020-08-28 Thread Jira


[ 
https://issues.apache.org/jira/browse/IMPALA-9967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186464#comment-17186464
 ] 

Zoltán Borók-Nagy commented on IMPALA-9967:
---

So the problem is that the writer writes TIMESTAMP_INSTANT which is "timestamp 
with local time zone".

The C++ ORC library doesn't support this type yet, only TIMESTAMP.

TIMESTAMP_INSTANT is a relatively new addition to ORC, it is not even mentioned 
in the spec currently: [https://orc.apache.org/specification/ORCv2/]

It is added by ORC-189, and part of the 1.6 release. This means Hive also can't 
read such files since it is currently using ORC 1.5.10:
{noformat}
0: jdbc:hive2://localhost:11050/default> select * from orc_test;
Error: java.io.IOException: java.lang.RuntimeException: ORC split generation 
failed with exception: java.io.IOException: Type 4 has an unknown kind. 
(state=,code=0){noformat}

> Scan orc failed when table contains timestamp column
> 
>
> Key: IMPALA-9967
> URL: https://issues.apache.org/jira/browse/IMPALA-9967
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 4.0
>Reporter: WangSheng
>Priority: Minor
> Attachments: 00031-31-26ff2064-c8f2-467f-ab7e-1949cb30d151-0.orc, 
> 00031-31-334beaba-ef4b-4d13-b338-e715cdf0ef85-0.orc
>
>
> Recently, when I test impala query orc table, I found that scanning failed 
> when table contains timestamp column, here is there exception: 
> {code:java}
> I0717 08:31:47.179124 78759 status.cc:129] 68436a6e0883be84:53877f720002] 
> Encountered parse error in tail of ORC file 
> hdfs://localhost:20500/test-warehouse/orc_scanner_test/00031-31-ac3cccf1-3ce7-40c6-933c-4fbd7bd57550-0.orc:
>  Unknown type kind
> @  0x1c9f753  impala::Status::Status()
> @  0x27aa049  impala::HdfsOrcScanner::ProcessFileTail()
> @  0x27a7fb3  impala::HdfsOrcScanner::Open()
> @  0x27365fe  
> impala::HdfsScanNodeBase::CreateAndOpenScannerHelper()
> @  0x28cb379  impala::HdfsScanNode::ProcessSplit()
> @  0x28caa7d  impala::HdfsScanNode::ScannerThread()
> @  0x28c9de5  
> _ZZN6impala12HdfsScanNode22ThreadTokenAvailableCbEPNS_18ThreadResourcePoolEENKUlvE_clEv
> @  0x28cc19e  
> _ZN5boost6detail8function26void_function_obj_invoker0IZN6impala12HdfsScanNode22ThreadTokenAvailableCbEPNS3_18ThreadResourcePoolEEUlvE_vE6invokeERNS1_15function_bufferE
> @  0x205  boost::function0<>::operator()()
> @  0x2675d93  impala::Thread::SuperviseThread()
> @  0x267dd30  boost::_bi::list5<>::operator()<>()
> @  0x267dc54  boost::_bi::bind_t<>::operator()()
> @  0x267dc15  boost::detail::thread_data<>::run()
> @  0x3e3c3c1  thread_proxy
> @ 0x7f32360336b9  start_thread
> @ 0x7f3232bfe41c  clone
> I0717 08:31:47.325670 78759 hdfs-scan-node.cc:490] 
> 68436a6e0883be84:53877f720002] Error preparing scanner for scan range 
> hdfs://localhost:20500/test-warehouse/orc_scanner_test/00031-31-ac3cccf1-3ce7-40c6-933c-4fbd7bd57550-0.orc(0:582).
>  Encountered parse error in tail of ORC file 
> hdfs://localhost:20500/test-warehouse/orc_scanner_test/00031-31-ac3cccf1-3ce7-40c6-933c-4fbd7bd57550-0.orc:
>  Unknown type kind
> {code}
> When I remove timestamp colum from table, and generate test data, query 
> success. By the way, my test data is generated by spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-10107) Implement HLL functions to have full compatibility with Hive

2020-08-28 Thread Adam Tamas (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Tamas reassigned IMPALA-10107:
---

Assignee: Adam Tamas

> Implement HLL functions to have full compatibility with Hive
> 
>
> Key: IMPALA-10107
> URL: https://issues.apache.org/jira/browse/IMPALA-10107
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Backend
>Reporter: Gabor Kaszab
>Assignee: Adam Tamas
>Priority: Minor
>
> ds_hll_estimate_bounds
> ds_hll_stringify
> ds_hll_union_f
> For parameters and expected behaviour check Hive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Comment Edited] (IMPALA-9967) Scan orc failed when table contains timestamp column

2020-08-28 Thread WangSheng (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186208#comment-17186208
 ] 

WangSheng edited comment on IMPALA-9967 at 8/28/20, 7:17 AM:
-

Hi [~boroknagyz], here is the data file:
{code:java}
create external table orc_test(
id int, user string, action string, event_time timestamp) 
stored as orc 
location 'hdfs://localhost:20500/orc_table_test';
{code}
This file contains timestamp column, create external table by this file, select 
will throw exception.
 [^00031-31-26ff2064-c8f2-467f-ab7e-1949cb30d151-0.orc] 


{code:java}
create external table orc_test2(
id int, user string, action string) 
stored as orc 
location 'hdfs://localhost:20500/orc_table_test2';
{code}
This file does not contains timestamp column, and create external table by this 
file, select returns success.
 [^00031-31-334beaba-ef4b-4d13-b338-e715cdf0ef85-0.orc] 




was (Author: skyyws):
{code:java}
create external table orc_test(
id int, user string, action string, event_time timestamp) 
stored as orc 
location 'hdfs://localhost:20500/orc_table_test';
{code}
This file contains timestamp column, create external table by this file, select 
will throw exception.
 [^00031-31-26ff2064-c8f2-467f-ab7e-1949cb30d151-0.orc] 


{code:java}
create external table orc_test2(
id int, user string, action string) 
stored as orc 
location 'hdfs://localhost:20500/orc_table_test2';
{code}
This file does not contains timestamp column, and create external table by this 
file, select returns success.
 [^00031-31-334beaba-ef4b-4d13-b338-e715cdf0ef85-0.orc] 



> Scan orc failed when table contains timestamp column
> 
>
> Key: IMPALA-9967
> URL: https://issues.apache.org/jira/browse/IMPALA-9967
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 4.0
>Reporter: WangSheng
>Priority: Minor
> Attachments: 00031-31-26ff2064-c8f2-467f-ab7e-1949cb30d151-0.orc, 
> 00031-31-334beaba-ef4b-4d13-b338-e715cdf0ef85-0.orc
>
>
> Recently, when I test impala query orc table, I found that scanning failed 
> when table contains timestamp column, here is there exception: 
> {code:java}
> I0717 08:31:47.179124 78759 status.cc:129] 68436a6e0883be84:53877f720002] 
> Encountered parse error in tail of ORC file 
> hdfs://localhost:20500/test-warehouse/orc_scanner_test/00031-31-ac3cccf1-3ce7-40c6-933c-4fbd7bd57550-0.orc:
>  Unknown type kind
> @  0x1c9f753  impala::Status::Status()
> @  0x27aa049  impala::HdfsOrcScanner::ProcessFileTail()
> @  0x27a7fb3  impala::HdfsOrcScanner::Open()
> @  0x27365fe  
> impala::HdfsScanNodeBase::CreateAndOpenScannerHelper()
> @  0x28cb379  impala::HdfsScanNode::ProcessSplit()
> @  0x28caa7d  impala::HdfsScanNode::ScannerThread()
> @  0x28c9de5  
> _ZZN6impala12HdfsScanNode22ThreadTokenAvailableCbEPNS_18ThreadResourcePoolEENKUlvE_clEv
> @  0x28cc19e  
> _ZN5boost6detail8function26void_function_obj_invoker0IZN6impala12HdfsScanNode22ThreadTokenAvailableCbEPNS3_18ThreadResourcePoolEEUlvE_vE6invokeERNS1_15function_bufferE
> @  0x205  boost::function0<>::operator()()
> @  0x2675d93  impala::Thread::SuperviseThread()
> @  0x267dd30  boost::_bi::list5<>::operator()<>()
> @  0x267dc54  boost::_bi::bind_t<>::operator()()
> @  0x267dc15  boost::detail::thread_data<>::run()
> @  0x3e3c3c1  thread_proxy
> @ 0x7f32360336b9  start_thread
> @ 0x7f3232bfe41c  clone
> I0717 08:31:47.325670 78759 hdfs-scan-node.cc:490] 
> 68436a6e0883be84:53877f720002] Error preparing scanner for scan range 
> hdfs://localhost:20500/test-warehouse/orc_scanner_test/00031-31-ac3cccf1-3ce7-40c6-933c-4fbd7bd57550-0.orc(0:582).
>  Encountered parse error in tail of ORC file 
> hdfs://localhost:20500/test-warehouse/orc_scanner_test/00031-31-ac3cccf1-3ce7-40c6-933c-4fbd7bd57550-0.orc:
>  Unknown type kind
> {code}
> When I remove timestamp colum from table, and generate test data, query 
> success. By the way, my test data is generated by spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org