invalidate metadata behaviour

2017-11-28 Thread Antoni Ivanov
Hi,

I am wondering if I run INVALIDATE METADATA for the whole database on node1
Then I ran a query on node2 – would the query on node2 used the cached metadata 
for the tables or it would know it’s invalidated?

And second how safe it is to run it for a database with many (say 30) tables 
over 10,000 partitions and 2000 more under 5000 partitions (most of the under 
100)
And each Impala Deamon node has a little (below Cloudera recommended) memory 
(32G)

Thanks,
Antoni


Difference between LOAD DATA and refresh

2018-01-08 Thread Antoni Ivanov
Hi,

We are wondering if we can reduce the impact of 
https://issues.apache.org/jira/browse/IMPALA-5058
Now we use "insert statements using spark" and then we use refresh partition x
Now we are thinking of using directly  LOAD DATA statement.

I imagine LOAD DATA doesn't require to communicate with hive metastore db (only 
update hdfs block location).

?
Thanks,
Antoni


Does Impala supports or plan to support Late Materialization

2018-03-20 Thread Antoni Ivanov
I don't mean partition pruning but as described in
https://aws.amazon.com/about-aws/whats-new/2017/12/amazon-redshift-introduces-late-materialization-for-faster-query-processing/

It basically pre-fetches first the filter columns and then after applying the 
filter it fetches only the data from the rest of columns only if filter applies.

Thanks


Query status "Session Closed"

2019-08-05 Thread Antoni Ivanov
Hi,

I am investigating the most common errors we see in our Impala Cluster.
The most common is with query status = 'Session Closed'

I can see from the code 
(https://github.com/apache/impala/blob/72c9370856d7436885adbee3e8da7e7d9336df15/be/src/service/impala-server.cc#L1435)
that it is set when Session is closed and this happens when connection is 
closed 
(ConnectionEnd)
and this is called when Thrift transport is 
closed
 (and query has not completed or failed in some way it would be marked as 
Session Closed

Does this mean that the remote end has simply dropped the connection ?
E.g there has been network interruption or someone killed (SIGKILL) the remote 
process ?
We have (TCP) load balancer (HaProxy) and I am wondering if for example Load 
Balancer tcp timeout can cause such error. Or can client socket timeout cause 
it?

I'd be grateful for any insides into the semantics of when "Session Closed" is 
set.



Thanks,
Antoni


Parsing the Query execution plan (profile and summary)

2019-08-06 Thread Antoni Ivanov
Hi,

We'd like to parse the query execution plan after queries has completed for 
telemetry purposes. We'd like to have better visibility into how queries behave.
For example You can see per-node utilization in the query profile .

E.g


RE: How to parse a query plan /summary/profile

2019-08-08 Thread Antoni Ivanov
Hi,

We did some research on the topic, the answer we've come so far is

Impala has two sets of information tracked on the coordinator node for each 
query: a summary and a profile.
The profile is currently accessible as a string, which is unwieldy for parsing. 
A thrift format is theoretically available, but there is a bug: 
https://issues.apache.org/jira/browse/IMPALA-8252 , which is resolved in 
v3.2.0. So you need to have version >=3.2


After that Thrift Encoding form Twitter commons may be used -
https://github.com/twitter/commons/blob/06905dc0f1a26440a79ff1164831c85ce2d1bdf0/src/python/twitter/thrift/text/thrift_json_encoder.py


The thrift can be downloaded from Coordinator node e.g 
http://coord-node:25000/query_profile_encoded?query_id=442c057197d9c0d:81810ccd
 ( 442c057197d9c0d:81810ccd is the Query ID)
The thrift can be downloaded from Cloudera REST API (if using Cloudera)
Or if using impyla<https://github.com/cloudera/impyla> Python library you can 
get the profile after execution
cur.execute(sql)
return cur.get_profile(profile_format=TRuntimeProfileFormat.THRIFT)


Just posting here in  case it's helpful to anyone following the user group.

-Antoni

From: Antoni Ivanov
Sent: Wednesday, August 7, 2019 10:13 AM
To: user@impala.apache.org
Cc: dev@impala ; Jenny Kwan (c) 
Subject: How to parse a query plan /summary/profile

Hi,

We'd like to get better visibility into way our Impala Cluster is used.
For example there's per node utilization - e.g sometimes fragments on a given 
node are slower, and this is visible in profile . Or there are some statistics 
available only in profile (like Runtime filters used or parquet file pruning 
stats)

I think you can download it as a Thrift ? But is it easily de-serializable (we 
need to have the Thrift Schema at least I think)
Thanks,
Antoni



RE: Generating a fixed size parquet file when doing Insert select *

2020-03-25 Thread Antoni Ivanov
Hi,

Impala team can correct me but

Even if you specify PARQUET_FILE_SIZE to 256MB Impala may and likely will 
create smaller files (e.g 128MB or even smaller).
As far as I could understand, that’s because when Impala is writing the parquet 
file, it’s making a guess about the potential file size after written, and it 
cannot correctly account for compression and encoding effectiveness. So for 
example, for tables with many columns, we can see files that are as low as 
32MBs (with parquet file size set to 256) even.

If you want to have strict control over the exact file size , I do not think 
it’s possible with Impala currently? I may be wrong though ?

The /*+shuffle*/ Is ineed more scalable as it enables to both have all nodes 
read the data instead of one node, and more than 1 node to write the data if 
you update more than one partition. At least that’s my understanding.


From: Tim Armstrong 
Sent: Wednesday, March 18, 2020 8:35 AM
To: user@impala.apache.org
Subject: Re: Generating a fixed size parquet file when doing Insert select *

Hi Ravi,
  There's a few details that could help understand the problem better. Is the 
destination table partitioned? How many partitions does it have? If you could 
share some query profiles or at least the explain plans from the different 
queries you're running that would be helpful too.

This is a guess, but the /*+shuffle*/ hint for the insert might solve your 
problem - it forces redistribution of the data based on the partition key, so 
all of the data for each partition will land on the same node. Impala's planner 
tries to be intelligent about whether to redistribute data when inserting into 
partitioned tables, but sometimes the decision won't be right for your needs. 
Here are example plans of a query that skips the shuffle without the hint.

[localhost:21000] functional> explain insert into table alltypesinsert
partition (year, month)
select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
float_col, double_col, date_string_col, string_col, timestamp_col, year, month
from alltypessmall;
Query: explain insert into table alltypesinsert
partition (year, month)
select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
float_col, double_col, date_string_col, string_col, timestamp_col, year, month
from alltypessmall
+-+
| Explain String
  |
+-+
| Max Per-Host Resource Reservation: Memory=12.01MB Threads=2   
  |
| Per-Host Resource Estimates: Memory=92MB  
  |
| Codegen disabled by planner   
  |
|   
  |
| WRITE TO HDFS [functional.alltypesinsert, OVERWRITE=false, 
PARTITION-KEYS=(year,month)] |
| |  partitions=4   
  |
| | 
  |
| 01:SORT   
  |
| |  order by: year ASC NULLS LAST, month ASC NULLS LAST
  |
| |  row-size=89B cardinality=100   
  |
| | 
  |
| 00:SCAN HDFS [functional.alltypessmall]   
  |
|HDFS partitions=4/4 files=4 size=6.32KB
  |
|row-size=89B cardinality=100   
  |
+-+

[localhost:21000] functional> explain insert into table alltypesinsert
partition (year, month) /*+shuffle*/
select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
float_col, double_col, date_string_col, string_col, timestamp_col, year, month
from alltypessmall;
Query: explain insert into table alltypesinsert
partition (year, month) /*+shuffle*/
select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
float_col, double_col, date_string_col, string_col, timestamp_col, year, month
from alltypessmall
+-+
| Explain String
  |
+-+
| Max Per-Host Resource Reservation: Memory=12.01MB Threads=3   
  |
| Per-Host Resource Estimates: Memory=92MB  
  |
| Codegen 

Re: Data is being inserted even though an INSERT INTO query fails

2021-11-16 Thread Antoni Ivanov
Hi,

Are insert queries supposed to be atomic ?

Thanks,
Antoni

From: Antoni Ivanov 
Reply to: "user@impala.apache.org" 
Date: Friday, 12 November 2021, 12:52
To: "user@impala.apache.org" 
Subject: Data is being inserted even though an INSERT INTO query fails

Hi,

A colleague of mine opened 
https://issues.apache.org/jira/browse/IMPALA-11014<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FIMPALA-11014=04%7C01%7Caivanov%40vmware.com%7Cee60611ca05442a1addc08d9a5ca8698%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637723111527650942%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=hXA1Z9z9MUQG4wd9zsmUJ80YnDbcABSRMoR7n8usUqY%3D=0>

It seems there a bug in Impala which can cause insert query to populate data 
even if it fails. That seems pretty serious since it violates atomicity of 
single query operation.
Are you aware of this (we tried to find passed jira issue on similar problem 
but could not) ?  Why could it happen?

Thanks,
Antoni


Data is being inserted even though an INSERT INTO query fails

2021-11-12 Thread Antoni Ivanov
Hi,

A colleague of mine opened https://issues.apache.org/jira/browse/IMPALA-11014

It seems there a bug in Impala which can cause insert query to populate data 
even if it fails. That seems pretty serious since it violates atomicity of 
single query operation.
Are you aware of this (we tried to find passed jira issue on similar problem 
but could not) ?  Why could it happen?

Thanks,
Antoni


Re: Data is being inserted even though an INSERT INTO query fails

2021-12-09 Thread Antoni Ivanov
Thanks for the answers so far.

We have asked a few follow up questions in 
https://issues.apache.org/jira/browse/IMPALA-11014
If someone can spare the time to look at them I’d be pretty grateful.

Regards,
Antoni

From: Wenzhe Zhou 
Reply to: "user@impala.apache.org" 
Date: Friday, 19 November 2021, 18:41
To: "user@impala.apache.org" 
Subject: Re: Data is being inserted even though an INSERT INTO query fails

Kudu supports transactions for "insert" and "CTAS" now. But transactions for 
"UPDATE/UPSERT/DELETE" are not done yet.

Wenzhe Zhou
wz...@cloudera.com<mailto:wz...@cloudera.com>
408-568-0101


On Fri, Nov 19, 2021 at 6:57 AM Csaba Ringhofer 
mailto:csringho...@cloudera.com>> wrote:
Hi!

i have added a comment to the Jira.
Generally INSERTs in Impala are not really atomic, unless we use Hive ACID 
tables or Iceberg (I am not sure about Kudu).

- Csaba

On Tue, Nov 16, 2021 at 12:36 PM Antoni Ivanov 
mailto:aiva...@vmware.com>> wrote:
Hi,

Are insert queries supposed to be atomic ?

Thanks,
Antoni

From: Antoni Ivanov mailto:aiva...@vmware.com>>
Reply to: "user@impala.apache.org<mailto:user@impala.apache.org>" 
mailto:user@impala.apache.org>>
Date: Friday, 12 November 2021, 12:52
To: "user@impala.apache.org<mailto:user@impala.apache.org>" 
mailto:user@impala.apache.org>>
Subject: Data is being inserted even though an INSERT INTO query fails

Hi,

A colleague of mine opened 
https://issues.apache.org/jira/browse/IMPALA-11014<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FIMPALA-11014=04%7C01%7Caivanov%40vmware.com%7C0901d1d7c9554f021e5e08d9ab7b8086%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637729369192967048%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000=HecmwNewOQwHRxLM4Yb8%2FtV%2Fa62x%2FRzvcgo7ATHWnLc%3D=0>

It seems there a bug in Impala which can cause insert query to populate data 
even if it fails. That seems pretty serious since it violates atomicity of 
single query operation.
Are you aware of this (we tried to find passed jira issue on similar problem 
but could not) ?  Why could it happen?

Thanks,
Antoni


Docker container image of Impala

2022-04-16 Thread Antoni Ivanov
Hi,

We are using actively Impala and we have lots of tests running against it. We’d 
like to be able to run those tests gainst a docker container – this way they 
can be easily started locally and run at any environment and are better 
isolated and reproducible.

Are there docker images offered for concrete veresions of Impala that we can 
use ?

Thanks,
Antoni