Re: impala 3.3 and hive metastore

2020-01-21 Thread Jeszy
Hey Cliff,

Thanks.
In general:
- you should have a hive-site.xml pointing to the metastore URIs on
the classpath for all deamons (I'm not sure if it's strictly required
for impalad - maybe others can clear that up -, but assuming yes for
now).
- impalad should be configured to have a catalog and statestore
service through its startup flags
- after this catalogd will manage HMS interactions and propagate
metadata to the impalad. Local catalog alters the workings of the
impalad-internal catalog cache and the interaction between catalogd,
statestored, and impalad, but catalogd remains the source of truth.

The logs tell that your impalad doesn't find a hive-site.xml, and the
weird errors could be the result of misconfiguration regarding catalog
and statestore.

HTH!

On Tue, Jan 14, 2020 at 5:41 PM Cliff Resnick  wrote:
>
> The failed attempt to initialize the metastore was near the very start of the 
> impalad log
> I0114 16:31:24.238524 11902 HiveConf.java:188] Found configuration file null
> I0114 16:31:24.498625 11902 QueryEventHookManager.java:92] QueryEventHook 
> config:
> I0114 16:31:24.498726 11902 QueryEventHookManager.java:93] - 
> query_event_hook_nthreads=1
> I0114 16:31:24.498816 11902 QueryEventHookManager.java:94] - 
> query_event_hook_classes=
> I0114 16:31:24.553463 11902 HiveMetaStoreClient.java:308] HMS client 
> filtering is enabled.
> I0114 16:31:24.618201 11902 HiveMetaStore.java:674] 0: Opening raw store with 
> implementation class:org.apache.hadoop.hive.metastore.ObjectStore
>
> After this was the usual metastore tables not found errors that you get if 
> you have an empty DB.
>
> I tried the flag --use_local_catalog=false, even though it's the default.
>
> Setting up a metastore with a proper hive-site.xml will get impalad running 
> but weird errors ensue on create table, as you would expect. I'm doing a 
> straightforward build and deploy, never seen anything like this before 3.3.
>
>
>
> On Tue, Jan 14, 2020, 5:12 AM Jeszy  wrote:
>>
>> 'Unlike previous versions, in 3.3 impalad would not start unless
>> installed a local hive metastore db on the impalad instance, not just
>> the catalogd instance.' - this sounds weird and is unexpected, and
>> multiple metastores can easily lead to issues further down the line.
>> What was the end of the log when the impalad failed to start?
>>
>> On Tue, Jan 14, 2020 at 1:24 AM Cliff Resnick  wrote:
>> >
>> > I just built Impala 3.3 from source with Kudu 1.11. Unlike previous 
>> > versions, in 3.3 impalad would not start unless installed a local hive 
>> > metastore db on the impalad instance, not just the catalogd instance. I'm 
>> > now getting a strange "IllegalArgumentException:null" error when creating 
>> > kudu tables from  impala, where  the kudu table is created but impala does 
>> > not create metadata. Logs show nothing but I'm guessing this has something 
>> > to do with metastore integration. Can I somehow disable metastore 
>> > integration from impalad?
>> >
>> > x-posting with kudu in case there might be any insight from that side.
>> >
>> > Thanks,
>> > Cliff


Re: impala 3.3 and hive metastore

2020-01-14 Thread Jeszy
'Unlike previous versions, in 3.3 impalad would not start unless
installed a local hive metastore db on the impalad instance, not just
the catalogd instance.' - this sounds weird and is unexpected, and
multiple metastores can easily lead to issues further down the line.
What was the end of the log when the impalad failed to start?

On Tue, Jan 14, 2020 at 1:24 AM Cliff Resnick  wrote:
>
> I just built Impala 3.3 from source with Kudu 1.11. Unlike previous versions, 
> in 3.3 impalad would not start unless installed a local hive metastore db on 
> the impalad instance, not just the catalogd instance. I'm now getting a 
> strange "IllegalArgumentException:null" error when creating kudu tables from  
> impala, where  the kudu table is created but impala does not create metadata. 
> Logs show nothing but I'm guessing this has something to do with metastore 
> integration. Can I somehow disable metastore integration from impalad?
>
> x-posting with kudu in case there might be any insight from that side.
>
> Thanks,
> Cliff


Re: Impala and Tableau Connectivity error

2020-01-08 Thread Jeszy
Looks like you're using Cloudera's ODBC driver, which is not part of
Impala's codebase. Cloudera could provide assistance.
(also, the logs attached aren't helpful - the driver logs would be
more meaningful in this instance)

HTH

On Wed, Jan 8, 2020 at 12:43 PM Mallika Singhi
 wrote:
>
> Hello Team,
>
> Have implemented Impala connectivity with Tableau and for some reason some 
> dashboards are failing with error as :
> com.tableausoftware.nativeapi.exceptions.DataSourceException: 
> [Cloudera][ImpalaODBC] (120) Error while retrieving data from in Impala: 
> [08S01] : ImpalaThriftAPICallFailed
>
> Dashboard is getting refreshed on Tableau Desktop but giving above error on 
> Tableau Server.
>
> Impala version -ClouderaImpalaODBC 2.6.2.1002-1.x86_64
> Tableau Server - 2018.3 and Centos OS
>
> Impala Log and Tableau Log file as attached. Please let me know how can this 
> be resolved and fixed.
>
> We can discuss over call as well if needed any inputs from my end
>
>
> Thanks,
> Mallika Singhi
> 9920028229
> BI Team


Re: Impala and kudu without HDFS

2019-09-09 Thread Jeszy
Hey,

Impala right now needs HDFS out of the box. It would probably be a lot of
work to remove that dependency.
Kudu doesn't have an HDFS dependency, you could use Spark standalone or
Kudu's API to query it.

HTH

On 2019. Sep 5., Thu at 6:31, Dinushka  wrote:

> Hi..
> I'm trying to only using Impala and Kudu without HDFS. But i get an error
> saying "Currently configured default filesystem: ProxyLocalFileSystem.
> fs.defaultFS (file:///) is not supported" only goes away when i install and
> start HDFS. can Impala and Kudu work without HDFS?


Re: Impala Error: IllegalStateException: null

2019-08-12 Thread Jeszy
Unfortunately there are many ways to trigger this error depending on
version, etc - notably, the error reported in the coordinator log may
not include 'IllegalStateException' at all. You can follow along the
thread number in the log until the query fails, maybe there's a
different (more helpful) stack trace there.

On Mon, Aug 12, 2019 at 8:26 PM Tim Armstrong  wrote:
>
> It you have a query profile, that's usually helpful to understand what 
> happened. Also the Impala version is helpful to diagnose.
>
> It's likely a bug in the frontend or in the metadata layer but hard to know 
> what it could be without more info.
>
> On Sun, Aug 11, 2019 at 9:01 PM Hendry Suwanda  
> wrote:
>>
>> Hi All,
>>
>> We got below intermittent error when running the query:
>>
>>> IllegalStateException: null
>>
>>
>> the error not show in the log file, we got the error on the client side 
>> (impala-shell or impyla)
>>
>> after trying to re-run the query several times, the error is gone.
>>
>> when i check on the log, actually the query has been success to analyze
>>
>>> I0808 12:36:11.082648 26229 Frontend.java:1245] 
>>> 6d4a7a8d9810b706:84c74718] Analyzing query:
>>>  query statement 
>>> I0808 12:36:11.089841 26229 Frontend.java:1285] 
>>> 6d4a7a8d9810b706:84c74718] Analysis finished.
>>> W0808 12:36:11.098770 26229 PlanNode.java:656] 
>>> 6d4a7a8d9810b706:84c74718] overflow when adding longs: 
>>> 6769770590837407744, 6769770590837407744
>>> W0808 12:36:11.101025 26229 PlanNode.java:656] 
>>> 6d4a7a8d9810b706:84c74718] overflow when adding longs: 
>>> 6769770590837407744, 6769770590837407744
>>
>>
>> i have check on all log (executor, coordinator & catalog) and not found the 
>> error related to that. On that day, i just found many log like below
>>
>>> Cancelling fragment instances as directed by the coordinator. Returned 
>>> status: Cancelled
>>
>>
>> do you have suggestion, what else need to check?
>>
>> --
>> Regards,
>>
>>
>> Hendry Suwanda
>>
>> Github: https://github.com/hendrysuwanda
>> Blog: http://hendrysuwanda.github.io/


Re: can't submit query after enable admission control

2019-05-13 Thread Jeszy
Hey,

The most likely causes are:
- your pool was misconfigured and has 0 mb RAM available
- other running queries use up all the resources allocated to the pool
- hitting IMPALA-8469, if you have a coordinator-only impalad on a
recent release. You can work around that one by marking the impalad as
both executor and coordinator.

FWIW, it's usually more scalable to set the daemon's mem_limit as a
specific value instead of a percentage. Also, admission control
configuration is a bit more nuanced than the startup flags you posted
- see https://impala.apache.org/docs/build3x/html/topics/impala_admission.html
for examples.

HTH

On Tue, May 14, 2019 at 6:20 AM Hendry Suwanda  wrote:
>
> Hi All,
> I got below error after enable Impala admission control
>
>> ERROR: Rejected query from pool default-pool: request memory needed 20.02 MB 
>> per node is greater than memory available for admission 0 of 
>> centos-kudu-impala:22000.
>>
>> Use the MEM_LIMIT query option to indicate how much memory is required per 
>> node.
>
>
> below is my config:
> IMPALA_KUDU_MASTERS=
> IMPALA_CATALOG_SERVICE_HOST=
> IMPALA_STATE_STORE_HOST=
> IMPALA_STATE_STORE_PORT=24000
> IMPALA_BACKEND_PORT=22000
> IMPALA_LOG_DIR=/var/log/impala
>
> IMPALA_CATALOG_ARGS=" -log_dir=${IMPALA_LOG_DIR} "
> IMPALA_STATE_STORE_ARGS=" -log_dir=${IMPALA_LOG_DIR} 
> -state_store_port=${IMPALA_STATE_STORE_PORT}"
> IMPALA_SERVER_ARGS=" \
> -is_coordinator=false \
> -is_executor=true \
> -scratch_dirs=/tmp/impala/scratch \
> -mem_limit=70% \
> --queue_wait_timeout_ms=18 \
> --default_pool_max_requests=2 \
> --default_pool_max_queued=2 \
> --default_pool_mem_limit=3g \
> -log_dir=${IMPALA_LOG_DIR} \
> -catalog_service_host=${IMPALA_CATALOG_SERVICE_HOST} \
> -state_store_port=${IMPALA_STATE_STORE_PORT} \
> -state_store_host=${IMPALA_STATE_STORE_HOST} \
> -be_port=${IMPALA_BACKEND_PORT} \
> -kudu_master_hosts=${IMPALA_KUDU_MASTERS}"
>
> could you suggest me, what i miss?
>
> --
> Regards,
>
>
> Hendry Suwanda
>
> Github: https://github.com/hendrysuwanda
> Blog: http://hendrysuwanda.github.io/


Re: Impala groups from LDAP

2019-03-03 Thread Jeszy
Yeah that helps. Impala relies on Hadoop's UserGroupInformation class
for user:group mappings. You can configure Impala the same way you
would configure any other HDFS client (see
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/GroupsMapping.html),
through core-site.xml. The daemons' webUI's /hadoop_varz page lists
the currently picked up values.
Most setups I've seen use a service such as SSSD or Centrify instead
of hitting the LDAP server directly. That way the default user:group
mapping (ShellBasedUnixGroupsMapping) works well.

HTH

On Fri, 1 Mar 2019 at 16:42, Grzegorz Solecki  wrote:
>
> Jeszy,
>
> I appreciate your response and thank you for sharing the link.
> Unfortunately, the link you sent tells about how to configure the 
> transformation of the username before Impala sends it to the LDAP server for 
> authentication against a user list e.g. DN 'ou=People,dc=localnet,dc=com'
> But my question goes a step further, i.e. if the user has already been 
> authenticated, the Impala needs to know groups the user is a member of.
> That means Impala needs to know the location of a group list  e.g. DN 
> 'ou=Group,dc=localnet,dc=com' or 'ou=UserGroups,dc=localnet,dc=com' or 
> something like that.
> So where is the place to configure groups DN in Impala.
> I hope makes my question clear.
>
>
> On Thu, Feb 28, 2019, 18:18 Jeszy  wrote:
>>
>> Hello,
>>
>> Does this link help?
>> https://www.cloudera.com/documentation/enterprise/latest/topics/impala_ldap.html#ldap_bind_strings
>> Specifically, ldap_bind_pattern.
>>
>> Jeszy
>>
>> On Thu, 28 Feb 2019 at 22:57, Grzegorz Solecki  wrote:
>> >
>> > I am setting up Impala with LDAP but I do not see ability to configure 
>> > Groups CN.
>> > Could you please let me know how impala with LDAP know groups for a 
>> > particular user?
>> > Thanks in advance.


Re: Impala groups from LDAP

2019-02-28 Thread Jeszy
Hello,

Does this link help?
https://www.cloudera.com/documentation/enterprise/latest/topics/impala_ldap.html#ldap_bind_strings
Specifically, ldap_bind_pattern.

Jeszy

On Thu, 28 Feb 2019 at 22:57, Grzegorz Solecki  wrote:
>
> I am setting up Impala with LDAP but I do not see ability to configure Groups 
> CN.
> Could you please let me know how impala with LDAP know groups for a 
> particular user?
> Thanks in advance.


Re: recover partitions vs refresh

2019-02-06 Thread Jeszy
Worth a try, although you'd definitely gain the most by compacting
these files before doing a refresh.

On Wed, 6 Feb 2019 at 09:56, Fawze Abujaber  wrote:
>
> In my case the partition has alot of small files and each 2 hours there is 
> more than 20,000 files added, today running recover partition on each table 
> exceeding 3 minutes, for that i'm trying to think if using refresh with 
> partitions can reduce this running time
>
> On Wed, Feb 6, 2019 at 10:51 AM Jeszy  wrote:
>>
>> Recover will recognize the newly added files in a newly added
>> partition. It doesn't touch already existing partitions.
>> How much you gain by using recover depends on the amount of files and
>> partitions, in the vast majority of cases I've seen it's not worth the
>> added complexity of having to use two commands instead of one.
>> Per-partition refresh is usually good enough.
>>
>> On Wed, 6 Feb 2019 at 09:39, Fawze Abujaber  wrote:
>> >
>> > Thanks Jezy for your quick response,
>> >
>> > Is it means the best that i need to run alter recover partitions once a 
>> > day and all the others in the same day to run refresh?
>> >
>> > Does both provide the same result? according to the documntation the 
>> > recover will recognize the newly added files under the partition.
>> >
>> >
>> > On Wed, Feb 6, 2019 at 10:35 AM Jeszy  wrote:
>> >>
>> >> Hey Fawze,
>> >>
>> >> RECOVER PARTITIONS is cheaper to execute, but it works only once for
>> >> each new partition. If you keep adding files to existing partitions,
>> >> per-partition REFRESH is the best bet.
>> >>
>> >> HTH
>> >>
>> >> On Wed, 6 Feb 2019 at 09:27, Fawze Abujaber  wrote:
>> >> >
>> >> > Hi Community,
>> >> >
>> >> > I'm all the time working to enhance our impala usage and resource 
>> >> > consumption, and here i would like to think which to use between alter 
>> >> > table recover partitions and refresh statement, in terms of running 
>> >> > time and resources, specially that refresh can be run on specific 
>> >> > partitions, i have spark job that adding files at the HDFS partitioned 
>> >> > by year,month and day.
>> >> >
>> >> > To automatically detect new partition directories added through Hive or 
>> >> > HDFS operations:
>> >> >
>> >> > In CDH 5.5 / Impala 2.3 and higher, the RECOVER PARTITIONS clause scans 
>> >> > a partitioned table to detect if any new partition directories were 
>> >> > added outside of Impala, such as by Hive ALTER TABLE statements or by 
>> >> > hdfs dfs or hadoop fs commands. The RECOVER PARTITIONS clause 
>> >> > automatically recognizes any data files present in these new 
>> >> > directories, the same as the REFRESH statement does.
>> >> >
>> >> >
>> >> > --
>> >> > Take Care
>> >> > Fawze Abujaber
>> >
>> >
>> >
>> > --
>> > Take Care
>> > Fawze Abujaber
>
>
>
> --
> Take Care
> Fawze Abujaber


Re: recover partitions vs refresh

2019-02-06 Thread Jeszy
Recover will recognize the newly added files in a newly added
partition. It doesn't touch already existing partitions.
How much you gain by using recover depends on the amount of files and
partitions, in the vast majority of cases I've seen it's not worth the
added complexity of having to use two commands instead of one.
Per-partition refresh is usually good enough.

On Wed, 6 Feb 2019 at 09:39, Fawze Abujaber  wrote:
>
> Thanks Jezy for your quick response,
>
> Is it means the best that i need to run alter recover partitions once a day 
> and all the others in the same day to run refresh?
>
> Does both provide the same result? according to the documntation the recover 
> will recognize the newly added files under the partition.
>
>
> On Wed, Feb 6, 2019 at 10:35 AM Jeszy  wrote:
>>
>> Hey Fawze,
>>
>> RECOVER PARTITIONS is cheaper to execute, but it works only once for
>> each new partition. If you keep adding files to existing partitions,
>> per-partition REFRESH is the best bet.
>>
>> HTH
>>
>> On Wed, 6 Feb 2019 at 09:27, Fawze Abujaber  wrote:
>> >
>> > Hi Community,
>> >
>> > I'm all the time working to enhance our impala usage and resource 
>> > consumption, and here i would like to think which to use between alter 
>> > table recover partitions and refresh statement, in terms of running time 
>> > and resources, specially that refresh can be run on specific partitions, i 
>> > have spark job that adding files at the HDFS partitioned by year,month and 
>> > day.
>> >
>> > To automatically detect new partition directories added through Hive or 
>> > HDFS operations:
>> >
>> > In CDH 5.5 / Impala 2.3 and higher, the RECOVER PARTITIONS clause scans a 
>> > partitioned table to detect if any new partition directories were added 
>> > outside of Impala, such as by Hive ALTER TABLE statements or by hdfs dfs 
>> > or hadoop fs commands. The RECOVER PARTITIONS clause automatically 
>> > recognizes any data files present in these new directories, the same as 
>> > the REFRESH statement does.
>> >
>> >
>> > --
>> > Take Care
>> > Fawze Abujaber
>
>
>
> --
> Take Care
> Fawze Abujaber


Re: recover partitions vs refresh

2019-02-06 Thread Jeszy
Hey Fawze,

RECOVER PARTITIONS is cheaper to execute, but it works only once for
each new partition. If you keep adding files to existing partitions,
per-partition REFRESH is the best bet.

HTH

On Wed, 6 Feb 2019 at 09:27, Fawze Abujaber  wrote:
>
> Hi Community,
>
> I'm all the time working to enhance our impala usage and resource 
> consumption, and here i would like to think which to use between alter table 
> recover partitions and refresh statement, in terms of running time and 
> resources, specially that refresh can be run on specific partitions, i have 
> spark job that adding files at the HDFS partitioned by year,month and day.
>
> To automatically detect new partition directories added through Hive or HDFS 
> operations:
>
> In CDH 5.5 / Impala 2.3 and higher, the RECOVER PARTITIONS clause scans a 
> partitioned table to detect if any new partition directories were added 
> outside of Impala, such as by Hive ALTER TABLE statements or by hdfs dfs or 
> hadoop fs commands. The RECOVER PARTITIONS clause automatically recognizes 
> any data files present in these new directories, the same as the REFRESH 
> statement does.
>
>
> --
> Take Care
> Fawze Abujaber


Re: Killing long running queries

2019-01-24 Thread Jeszy
Hey Quanlong,

Have you tried setting it in the pool-level default query options?
I expect that to work seamlessly.

On Thu, 24 Jan 2019 at 08:43, Quanlong Huang  wrote:
>
> Yes, we have the same pain point too :)
>
> On Thu, Jan 24, 2019 at 10:32 PM Boris Tyukin  wrote:
>>
>> I see, I was hoping admission control would be more robust as we are looking 
>> for the same control as you are. We have users, who are  not very 
>> technical and have a talent to bring systems down :)
>>
>> On Thu, Jan 24, 2019 at 9:11 AM Quanlong Huang  
>> wrote:
>>>
>>> Not YARN. I mean admission control actually. Resource pool is a term of 
>>> admission control.
>>>
>>> On Thu, Jan 24, 2019 at 9:56 PM Boris Tyukin  wrote:

 good timing guys as we are looking for a good solution as well. Quanlong, 
 when you say resource pool, do you mean YARN? HAve you looked into Impala 
 admission control feature instead?

 On Thu, Jan 24, 2019 at 7:18 AM Quanlong Huang  
 wrote:
>
> It's quite a useful option!
>
> Looks like it cannot be set in per resource pool level. We have the use 
> case that queries from different resource pools need different timeout 
> limits. For example, some systems leverage Impala to build pre-aggregate 
> tables or other light-weight ETL jobs. EXEC_TIME_LIMIT_S of this pool may 
> be set as 30 minutes, while EXEC_TIME_LIMIT_S of adhoc query pool may be 
> set as 1 minutes.
> Hopes this can be supported too. I just created IMPALA-8107 for this.
>
> Thanks,
> Quanlong
>
> On Thu, Jan 24, 2019 at 2:59 PM Tim Armstrong  
> wrote:
>>
>> Yeah we got a lot of feedback asking for a solution to this :)
>>
>> On Wed, Jan 23, 2019 at 12:54 PM Fawze Abujaber  
>> wrote:
>>>
>>> Amazing,
>>>
>>> Just on time because i'm planning to upgrade our clusters next week to 
>>> CDH 5.16.1 which includes impala 2.12
>>>
>>> On Wed, Jan 23, 2019 at 7:39 PM Tim Armstrong  
>>> wrote:

 EXEC_TIME_LIMIT_S is what you want: 
 https://impala.apache.org/docs/build/html/topics/impala_exec_time_limit_s.html

 Those other configurations are based on idle time, but 
 EXEC_TIME_LIMIT_S is based on the time spent executing.

 On Tue, Jan 22, 2019 at 6:36 PM Fawze Abujaber  
 wrote:
>
> Hi all,
>
> How i can kill a query that running beyond specific time even if it 
> really doing calculations?
>
> I'm aware of the 3 timeout configuration that can be used by like 
> idle sessio timeout,idle,query timeout and query timeout S.
>
> My goal to kill anything that running/idle beyond 20 minutes because 
> for sure there is something to enhance in the query, i'm using the 3 
> parameters and i see queries running for few hours and i want some 
> config that can kill such queries.
>
> --
> Take Care
> Fawze Abujaber
>>>
>>>
>>>
>>> --
>>> Take Care
>>> Fawze Abujaber


Re: catalogd and UserGroupInformation.getCurrentUser();

2019-01-02 Thread Jeszy
Hey,

IIUC your question correctly, this is a limitation. IMPALA-2177 looks
to be the appropriate jira.
Most users use Impala together with Sentry, where the recommended
approach is to disable impersonation (even in services that allow it,
like Hive).

HTH

On Wed, 2 Jan 2019 at 05:55, Bharath Vissapragada  wrote:
>
> Hi,
>
> Can you add the stack trace here if possible? It is not super clear where 
> exactly the problem is.
>
> Thanks,
> Bharath
>
> On Tue, Jan 1, 2019 at 6:34 PM mhd wrk  wrote:
>>
>> we have our own implementation of Hadoop FileSystem which relies on current 
>> user in a kerberosied environment to locate user specific files in HDFS.  
>> This custom file system works fine inside hive to create external tables and 
>> query them. However trying to access the same tables via Impala (jdbc 
>> driver) fails. Watching the log messages seems that when impalad sends 
>> requests to catalogd to get meta data of a given table the current user 
>> returned by  UserGroupInformation is the service account running the server 
>> (impala/hostn...@example.com) instead of the currently connected user.
>>
>> Is this a known issue or limitation of Impala?


Re: Hanging impala query

2018-11-26 Thread Jeszy
Hello Fawze,

Impala considers a query 'FINISHED' when it is ready to return rows. The
query only closes when the client closes it, or if it times out, the
profile shows the query is still running (no end time, as you mentioned).
In your version, the query will hold on to resources as long as the query
is open (even if all rows have been fetched). This is tracked as
IMPALA-1575, and gets much better in Impala 2.11.

HTH

On Fri, 23 Nov 2018 at 23:29, Fawze Abujaber  wrote:

> Hi Community,
>
> I'm looking on issue with CM and Impala where i see in impala UI the query
> still executing and in the query state it shown as FINISHED
>
> Looking at the query profile ( attached here), i see the query has no
> finished time and from the other side i see the first rows fecthed and all
> rows available.
>
> What i'm intersting in here to know if such query still hanging some
> impala resources or not, and why my configuration didn't help here and the
> query wasn't killed.
>
> BTW:
>
> I have the configuration of:
>
> Impala Command Line Argument Advanced Configuration Snippet (Safety Valve)
> Impala (Service-Wide)
>
> idle_session_timeout=600
> idle_query_timeout=600
>
>
>
> [image: image.png]
>
> Part of the query profile:
>
> Query Timeline
>   Query submitted: 119.16us (119160)
>   Planning finished: 482ms (482689616)
>   Ready to start on 7 backends: 561ms (561747089)
>   All 7 execution backends (8 fragment instances) started: 581ms
> (581185065)
> *  Rows available: 34.73s (34732649614)*
> *  First row fetched: 36.81s (36806528086)*
> --
> Take Care
> Fawze Abujaber
>


Re: Long running impala query

2018-10-01 Thread Jeszy
Yes, First row fetched was the key, along with 'ClientFetchWaitTimer'.
Unfortunately it's a bit more complicated than that. Impala won't release
the resources after first row is fetched since it doesn't buffer results
internally - meaning if you tear down the query, you won't be able to serve
future client fetches. The parameter 'idle_query_timeout' is used to
control how long should Impala wait for the client to fetch before moving
the query to the 'cancelled' state.
This gets a bit more shaky in your version. The 'cancelled' state is
different from the 'closed' state, and in 2.10 Impala will hold on to some
of the backend resources even when cancelled, only releasing everything
when the query is actually 'closed'. If you want to force query closure,
and not just cancellation, the setting is idle_session_timeout - closing
the session will close all associated queries. This option works well if
the users submit a single query at a time from a single session (which is
not always the case) - otherwise, even though query A has been idle for a
long time, actions on query B in the same session can prevent query A from
closing (though it will be cancelled, if idle_query_timeout is set).

The good news is that all of this is much simpler in versions having
IMPALA-1575, which is 2.11+. In those versions, cancelling the query will
release all backend resources and just keep some state on the coordinator
so the client has a chance to know what's going on. This way the
'idle_query_timeout' option works like a charm. If you're having trouble
with this, I'd strongly recommend upgrading to 2.11+.

Cheers

On 1 October 2018 at 09:10, Fawze Abujaber  wrote:

> Hi Jezy,
>
> S appreciating your response, i leaned something new :)
>
> Quick questions? does this mean that impala realised the resources after 2
> seconds or it will onhold the resources for the 18 minutes, indeed yes we
> are running this query through MSTR, i will look for a solution to this.
>
> And i understand that you looked at First row fetched: 1.79s (1793452923)
> which telling us it finished in less than 2 seconds, right?
>
>
>
> On Mon, Oct 1, 2018 at 10:01 AM Jeszy  wrote:
>
>> Hey Fawze,
>>
>> Your profile has:
>>
>> Query Timeline
>>   Query submitted: 115.23us (115234)
>>   Planning finished: 11ms (11653012)
>>   Ready to start on 51 backends: 22ms (22019435)
>>   All 51 execution backends (163 fragment instances) started: 62ms 
>> (62790505)
>>   First dynamic filter received: 428ms (428128089)
>>   Rows available: 1.79s (1793346081)
>>   First row fetched: 1.79s (1793452923)
>>   Unregister query: 18.7m (1124881163639)
>>   ImpalaServer
>> - ClientFetchWaitTimer: 18.5m (1108994435686)
>> - InactiveTotalTime: 0ns (0)
>> - RowMaterializationTimer: 14.03s (14027078174)
>>
>> This means that Impala was ready to return rows less than 2 seconds after
>> the query was submitted, but whoever submitted the query didn't want to
>> look at the results for some time. This often happens with tools like Hue,
>> which page through the results on demand versus fetching them all at once.
>>
>> HTH
>>
>>
>> On 29 September 2018 at 14:04, Fawze Abujaber  wrote:
>>
>>> Hi Community.
>>>
>>> I'm investigating one of my impala queries that running for 18 minutes
>>> at specific time and when i run it as ad hoc at different times it only
>>> runs for few seconds.
>>>
>>>  Looked at the cluser metrics,the cluster resources and impala metrics
>>> and don't see anything our of the reqular load.
>>>
>>>  In the query profile i see that one of the exchange steps of the data
>>> between the nodes took most of the query time,
>>>
>>>  When i run the query as ad hoc now the query is taking few seconds to
>>> finish, so i'm intersting to understand what can cause such issue
>>> specifically on the cluster as general that may cause this.
>>>
>>>  I'm thinking to change the query schedule time but i prefer to
>>> understand the root cause so i can avoid this in the next times.
>>>
>>>  Below is the query metrics and attached the query profile.
>>>
>>>  Thanks in advance.
>>>
>>> Duration: 18.7m
>>> Rows Produced: 2822780
>>> Aggregate Peak Memory Usage: 683.4 MiB
>>> Bytes Streamed: 1 GiB
>>> HDFS Average Scan Range: 752 KiB
>>> HDFS Bytes Read: 1 GiB
>>> HDFS Bytes Read From Cache: 0 B
>>> Threads: CPU Time: 89.44s
>>> Threads: Network Receive Wait Time: 11h
>>> Threads: Network Receive Wait Time Percentage: 39
>>> Threads: Network Send Wait Time:
>>> The sum of the time spent waiting to send data over the network by all
>>> threads of the query. Called 'thread_network_send_wait_time' in
>>> searches.
>>>  16.9h
>>>  Threads: Network Send Wait Time Percentage: 60
>>>  Threads: Storage Wait Time: 4.3m
>>>  Threads: Storage Wait Time Percentage: 0
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Take Care
>>> Fawze Abujaber
>>>
>>
>>
>
> --
> Take Care
> Fawze Abujaber
>


Re: Long running impala query

2018-10-01 Thread Jeszy
Hey Fawze,

Your profile has:

Query Timeline
  Query submitted: 115.23us (115234)
  Planning finished: 11ms (11653012)
  Ready to start on 51 backends: 22ms (22019435)
  All 51 execution backends (163 fragment instances) started: 62ms
(62790505)
  First dynamic filter received: 428ms (428128089)
  Rows available: 1.79s (1793346081)
  First row fetched: 1.79s (1793452923)
  Unregister query: 18.7m (1124881163639)
  ImpalaServer
- ClientFetchWaitTimer: 18.5m (1108994435686)
- InactiveTotalTime: 0ns (0)
- RowMaterializationTimer: 14.03s (14027078174)

This means that Impala was ready to return rows less than 2 seconds after
the query was submitted, but whoever submitted the query didn't want to
look at the results for some time. This often happens with tools like Hue,
which page through the results on demand versus fetching them all at once.

HTH


On 29 September 2018 at 14:04, Fawze Abujaber  wrote:

> Hi Community.
>
> I'm investigating one of my impala queries that running for 18 minutes at
> specific time and when i run it as ad hoc at different times it only runs
> for few seconds.
>
>  Looked at the cluser metrics,the cluster resources and impala metrics and
> don't see anything our of the reqular load.
>
>  In the query profile i see that one of the exchange steps of the data
> between the nodes took most of the query time,
>
>  When i run the query as ad hoc now the query is taking few seconds to
> finish, so i'm intersting to understand what can cause such issue
> specifically on the cluster as general that may cause this.
>
>  I'm thinking to change the query schedule time but i prefer to understand
> the root cause so i can avoid this in the next times.
>
>  Below is the query metrics and attached the query profile.
>
>  Thanks in advance.
>
> Duration: 18.7m
> Rows Produced: 2822780
> Aggregate Peak Memory Usage: 683.4 MiB
> Bytes Streamed: 1 GiB
> HDFS Average Scan Range: 752 KiB
> HDFS Bytes Read: 1 GiB
> HDFS Bytes Read From Cache: 0 B
> Threads: CPU Time: 89.44s
> Threads: Network Receive Wait Time: 11h
> Threads: Network Receive Wait Time Percentage: 39
> Threads: Network Send Wait Time:
> The sum of the time spent waiting to send data over the network by all
> threads of the query. Called 'thread_network_send_wait_time' in searches.
>  16.9h
>  Threads: Network Send Wait Time Percentage: 60
>  Threads: Storage Wait Time: 4.3m
>  Threads: Storage Wait Time Percentage: 0
>
>
>
>
>
>
> --
> Take Care
> Fawze Abujaber
>


Re: Avoiding running Recover partitions very frequent

2018-08-13 Thread Jeszy
I'd try to trace the update through catalog and statestore. SYNC_DDL=1
can be a problem especially if there's a slow impalad or a lot of
catalog updates concurrently (lot of data to stream from statestore
node). Namenode can also become a bottleneck. Catalog logs will help
point these out.

On 13 August 2018 at 14:30, Fawze Abujaber  wrote:
> Thanks Jezy for your quick response, We are far away from moving to Kudu.
>
> Trying to figure out what can cause Recover partition to run for a long time
> on some of the events.
>
> ===
> Query (id=9a4ed4eabe44c9e5:3f0cde63)
>   Summary
> Session ID: 4c40102c98913f44:780678e7979e929b
> Session Type: BEESWAX
> Start Time: 2018-08-13 08:00:02.757409000
> End Time: 2018-08-13 08:09:15.258627000
> Query Type: DDL
> Query State: FINISHED
> Query Status: OK
> Impala Version: impalad version 2.10.0-cdh5.13.0 RELEASE (build
> 2511805f1eaa991df1460276c7e9f19d819cd4e4)
> User: 
> Connected User: 
> Delegated User:
> Network Address: :::172.16.136.1:48037
> Default Db: default
> Sql Statement: alter table  recover partitions
> Coordinator: :22000
> Query Options (set by configuration): SYNC_DDL=1
> Query Options (set by configuration and planner): SYNC_DDL=1,MT_DOP=0
> DDL Type: ALTER_TABLE
>
> Query Timeline
>   Query submitted: 316.03us (316031)
>   Planning finished: 5.65s (5649375629)
>   Request finished: 9.2m (552495528168)
>   Unregister query: 9.2m (552500950559)
>   ImpalaServer
> - CatalogOpExecTimer: 10.73s (10730796494)
> - ClientFetchWaitTimer: 5ms (5411560)
> - InactiveTotalTime: 0ns (0)
> - RowMaterializationTimer: 0ns (0)
> - TotalTime: 0ns (0)
>
> ==
>
> Query (id=ae4266aad3cea1ed:754c9c34)
>   Summary
> Session ID: 24401399943bebf8:96c267d02619e7ac
> Session Type: BEESWAX
> Start Time: 2018-08-13 08:00:10.625885000
> End Time: 2018-08-13 08:09:15.194417000
> Query Type: DDL
> Query State: FINISHED
> Query Status: OK
> Impala Version: impalad version 2.10.0-cdh5.13.0 RELEASE (build
> 2511805f1eaa991df1460276c7e9f19d819cd4e4)
> User: 
> Connected User: 
> Delegated User:
> Network Address: :::172.16.136.1:48044
> Default Db: default
> Sql Statement: alter table  recover partitions
> Coordinator: :22000
> Query Options (set by configuration): SYNC_DDL=1
> Query Options (set by configuration and planner): SYNC_DDL=1,MT_DOP=0
> DDL Type: ALTER_TABLE
>
> Query Timeline
>   Query submitted: 502.36us (502357)
>   Planning finished: 1ms (1077718)
>   Request finished: 9.1m (544563396235)
>   Unregister query: 9.1m (544568289284)
>   ImpalaServer
> - CatalogOpExecTimer: 8.5m (511375736191)
> - ClientFetchWaitTimer: 4ms (4882019)
> - InactiveTotalTime: 0ns (0)
> - RowMaterializationTimer: 0ns (0)
> - TotalTime: 0ns (0)
>
>
>
>
> On Mon, Aug 13, 2018 at 12:55 PM Jeszy  wrote:
>>
>> Hey Fawze,
>>
>> Hm.
>> Just to make sure I got this right: you have 100 tables, each
>> partitioned by y/m/d, and you're updating a single partition of all
>> 100 tables every 20 minutes via a Spark job. Is that correct? I can't
>> think of a way to optimize your current setup for statement count
>> specifically (no way to refresh 100 tables in less than 100
>> statements).
>> However, it sounds like you would benefit from using Kudu in this
>> case. With Kudu, you don't need to REFRESH / RECOVER to pick up new
>> data, it becomes available immediately after ingestion. You could
>> create a landing table in Kudu, then migrate data to HDFS daily (or
>> so), and query a view UNIONing these two tables. With the daily
>> Kudu->HDFS move, you also remove the need for compaction on the HDFS
>> side.
>>
>> HTH
>> Jeszy
>>
>> On 13 August 2018 at 11:08, Fawze Abujaber  wrote:
>> > Hi Community,
>> >
>> > I have a Spark Job that producing parquet files at the HDFS with
>> > partitions
>> > Year, Month  and Day.
>> > The HDFS structure has 100 folders ( 1 event per folder, and these
>> > events
>> > partitioned by Year, month and day).
>> > The job is running each 20 minutes and writes files in the 100 events
>> > folders ( adding one file under the relevant partition for each event).
>> > In top of each event i have an external impala table that i defined
>> > using
>> > impala with pa

Re: Avoiding running Recover partitions very frequent

2018-08-13 Thread Jeszy
Hey Fawze,

Hm.
Just to make sure I got this right: you have 100 tables, each
partitioned by y/m/d, and you're updating a single partition of all
100 tables every 20 minutes via a Spark job. Is that correct? I can't
think of a way to optimize your current setup for statement count
specifically (no way to refresh 100 tables in less than 100
statements).
However, it sounds like you would benefit from using Kudu in this
case. With Kudu, you don't need to REFRESH / RECOVER to pick up new
data, it becomes available immediately after ingestion. You could
create a landing table in Kudu, then migrate data to HDFS daily (or
so), and query a view UNIONing these two tables. With the daily
Kudu->HDFS move, you also remove the need for compaction on the HDFS
side.

HTH
Jeszy

On 13 August 2018 at 11:08, Fawze Abujaber  wrote:
> Hi Community,
>
> I have a Spark Job that producing parquet files at the HDFS with partitions
> Year, Month  and Day.
> The HDFS structure has 100 folders ( 1 event per folder, and these events
> partitioned by Year, month and day).
> The job is running each 20 minutes and writes files in the 100 events
> folders ( adding one file under the relevant partition for each event).
> In top of each event i have an external impala table that i defined using
> impala with partitions year, month and day.
>
> Is there away to avoid running ALTER TABLE  Recover partitions on the
> 100 tables in each 20 minutes? ( The Recover statement running using
> external cron that the main folder and run recover partitions on all the
> events under the folder)
>
> I know that RECOVER PARTITIONS clause scans a partitioned table to detect if
> any new partition directories were added outside of Impala, but wondering if
> there is any other ways to avoid running 4800 statements per a day while
> keeping the refreshment rate is high.
>
>
>
> --
> Take Care
> Fawze Abujaber


Re: What means if the sql statements append "limit 0" automatically?

2018-08-01 Thread Jeszy
Impala shouldn't append limit 0 automatically. It's probably done by
the tool you use to submit the queries. Do you observe the same issue
with impala-shell?

Thanks!

On 1 August 2018 at 09:46, meij...@yandex.com  wrote:
> Hi, Every One
>
> Impala version 2.11
>
> During press test, I  found many join sql statement like blew append "limit
> 0" automatically. is there any  resource limit reached?
>
> execute sql statement:
> select count(1) cnt from kudu_session_7 s left join kudu_event_detail_7 e on
> e.uuid = s.uuid and e.st = s.st group by s.browser order by cnt desc limit
> 1000;
>
> the related sql  statement at impala log:
> select count(1) cnt from kudu_session_7 s left join kudu_event_detail_7 e on
> e.uuid = s.uuid and e.st = s.st group by s.browser order by cnt desc limit
> 1000;
>
> 
> meij...@yandex.com


Re: Impala query cann’t submitted if the estimated memory beyonds the configured memory

2018-06-19 Thread Jeszy
That sounds weird (maybe a reporting issue?), please send a profile
that shows the mem_limit set and the error message referring to the
estimated memory instead - thanks!

On 19 June 2018 at 04:59, Fawze Abujaber  wrote:
> Hi Jezy,
>
> Thanks for your quick response.
>
> I’m using impala 2.10, I think since I’m using max memory memory per poo
> with the default memory limit, and here the estimate have to take place in
> order to estimate the max concurrent queries, and if the estimate beyond the
> max memory then the query will not submitted.
>
> In my case I set both values and getting errors of high memory needed for
> the query, it’s not occurring very often.
>
> On Mon, 18 Jun 2018 at 16:41 Jeszy  wrote:
>>
>> Hey Fawze,
>>
>> Default Query Memory Limit applies here, yes.
>> If you submit a query to a pool with that setting, you should see
>> something like this in the profile:
>> Query Options (set by configuration): MEM_LIMIT=X
>>
>> (YMMV based on version - what version are you running on?)
>> If MEM_LIMIT is present in that line, Impala will (should) disregard
>> estimates.
>>
>> Thanks!
>>
>> On 17 June 2018 at 21:04, Fawze Abujaber  wrote:
>> > Hi Jeszy,
>> >
>> > Thanks for your response, Indeed this is what i was thinking about but,
>> > I
>> > have  Default Query Memory Limit and Max memory set per pool which i
>> > think
>> > should be enough to cover this, shouldnot it?  or i should pass the
>> > mem_limit in the default query options?
>> >
>> > On Sun, Jun 17, 2018 at 8:36 PM, Jeszy  wrote:
>> >>
>> >> Hello Fawze,
>> >>
>> >> Disabling this, per se, is not an option, but an equally simple
>> >> workaround
>> >> is using MEM_LIMIT.
>> >>
>> >> The estimated stats are often very far from actual memory usage and
>> >> shouldn't be relied on - a best practice is to set MEM_LIMIT as a query
>> >> option (preferably have a default value set for each pool). Having that
>> >> set
>> >> will cause Impala to ignore the estimates and rely on this limit for
>> >> admission control purposes. This works decently for well-understood
>> >> workloads (ie. where the memory consumption is known to fit within
>> >> certain
>> >> limits). For ad-hoc workloads, if the query can't be executed within
>> >> the
>> >> default limit of the pool, you can override the limit on a per-query
>> >> basis
>> >> (just issue 'set MEM_LIMIT=...' before running the query).
>> >>
>> >> HTH
>> >>
>> >> On 16 June 2018 at 13:31, Fawze Abujaber  wrote:
>> >>>
>> >>> Hi Community ,
>> >>>
>> >>> In the last impala versions, impala is estimating the memory required
>> >>> for
>> >>> the query, and in case the estimated memory required beyonds the
>> >>> configured
>> >>> memory or  the configured memory per pool, impala is not submitting
>> >>> this
>> >>> query, taking the fact that many times and specially running query on
>> >>> tables
>> >>> without stats there is a huge difference between the estimated and the
>> >>> actual memory used, the estimated can be 31 GB per node and the actual
>> >>> use
>> >>> is 1 or 2 GB, that’s mean to submit a query I need at least 1.5 T
>> >>> memory
>> >>> configured which I see it too much.
>> >>>
>> >>> I’m curios to know if there is an option making this to configuration
>> >>> (
>> >>> Un submitting query if we he estimated memory required beyond the
>> >>> configured
>> >>> memory) as an optional choice.
>> >>>
>> >>> Such issue can block using impala dynamic resource pools.
>> >>> --
>> >>> Take Care
>> >>> Fawze Abujaber
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > Take Care
>> > Fawze Abujaber
>
> --
> Take Care
> Fawze Abujaber


Re: Impala query cann’t submitted if the estimated memory beyonds the configured memory

2018-06-18 Thread Jeszy
Hey Fawze,

Default Query Memory Limit applies here, yes.
If you submit a query to a pool with that setting, you should see
something like this in the profile:
Query Options (set by configuration): MEM_LIMIT=X

(YMMV based on version - what version are you running on?)
If MEM_LIMIT is present in that line, Impala will (should) disregard estimates.

Thanks!

On 17 June 2018 at 21:04, Fawze Abujaber  wrote:
> Hi Jeszy,
>
> Thanks for your response, Indeed this is what i was thinking about but, I
> have  Default Query Memory Limit and Max memory set per pool which i think
> should be enough to cover this, shouldnot it?  or i should pass the
> mem_limit in the default query options?
>
> On Sun, Jun 17, 2018 at 8:36 PM, Jeszy  wrote:
>>
>> Hello Fawze,
>>
>> Disabling this, per se, is not an option, but an equally simple workaround
>> is using MEM_LIMIT.
>>
>> The estimated stats are often very far from actual memory usage and
>> shouldn't be relied on - a best practice is to set MEM_LIMIT as a query
>> option (preferably have a default value set for each pool). Having that set
>> will cause Impala to ignore the estimates and rely on this limit for
>> admission control purposes. This works decently for well-understood
>> workloads (ie. where the memory consumption is known to fit within certain
>> limits). For ad-hoc workloads, if the query can't be executed within the
>> default limit of the pool, you can override the limit on a per-query basis
>> (just issue 'set MEM_LIMIT=...' before running the query).
>>
>> HTH
>>
>> On 16 June 2018 at 13:31, Fawze Abujaber  wrote:
>>>
>>> Hi Community ,
>>>
>>> In the last impala versions, impala is estimating the memory required for
>>> the query, and in case the estimated memory required beyonds the configured
>>> memory or  the configured memory per pool, impala is not submitting this
>>> query, taking the fact that many times and specially running query on tables
>>> without stats there is a huge difference between the estimated and the
>>> actual memory used, the estimated can be 31 GB per node and the actual use
>>> is 1 or 2 GB, that’s mean to submit a query I need at least 1.5 T memory
>>> configured which I see it too much.
>>>
>>> I’m curios to know if there is an option making this to configuration (
>>> Un submitting query if we he estimated memory required beyond the configured
>>> memory) as an optional choice.
>>>
>>> Such issue can block using impala dynamic resource pools.
>>> --
>>> Take Care
>>> Fawze Abujaber
>>
>>
>
>
>
> --
> Take Care
> Fawze Abujaber


Re: Impala query cann’t submitted if the estimated memory beyonds the configured memory

2018-06-17 Thread Jeszy
Hello Fawze,

Disabling this, per se, is not an option, but an equally simple workaround
is using MEM_LIMIT.

The estimated stats are often very far from actual memory usage and
shouldn't be relied on - a best practice is to set MEM_LIMIT as a query
option (preferably have a default value set for each pool). Having that set
will cause Impala to ignore the estimates and rely on this limit for
admission control purposes. This works decently for well-understood
workloads (ie. where the memory consumption is known to fit within certain
limits). For ad-hoc workloads, if the query can't be executed within the
default limit of the pool, you can override the limit on a per-query basis
(just issue 'set MEM_LIMIT=...' before running the query).

HTH

On 16 June 2018 at 13:31, Fawze Abujaber  wrote:

> Hi Community ,
>
> In the last impala versions, impala is estimating the memory required for
> the query, and in case the estimated memory required beyonds the configured
> memory or  the configured memory per pool, impala is not submitting this
> query, taking the fact that many times and specially running query on
> tables without stats there is a huge difference between the estimated and
> the actual memory used, the estimated can be 31 GB per node and the actual
> use is 1 or 2 GB, that’s mean to submit a query I need at least 1.5 T
> memory configured which I see it too much.
>
> I’m curios to know if there is an option making this to configuration ( Un
> submitting query if we he estimated memory required beyond the configured
> memory) as an optional choice.
>
> Such issue can block using impala dynamic resource pools.
> --
> Take Care
> Fawze Abujaber
>


Re: Schema read issue

2018-05-16 Thread Jeszy
Hey Fawze,

This is expected to work seamlessly (although that's a pretty big
upgrade). Do the logs report any issues? Does comparing the schemas of
the before/after upgrade-written files using 'parquet-tools schema
' show any differences?

Thanks!

On 16 May 2018 at 12:47, Fawze Abujaber  wrote:
> Hello Community,
>
> We are running in schema read issue after upgrading Impala from 2.3 to 2.10
> using Cloudera CDH 5.13.0
>
>
> Our tables have alot of struct and arrays and after the upgrade all the
> historical data at arrays and structs shown as NULL while the new written
> data are shown correctly.
>
> Is it a known issue? or there was any action that i should care about before
> i upgrade?
>
>
>
> --
> Take Care
> Fawze Abujaber


Re: Issue in data loading in Impala + Kudu

2018-05-10 Thread Jeszy
As suggested over on the Kudu list, this is likely due to key
duplication (which is fine on HDFS, but won't work for Kudu). The
profile has the following error that confirms this:
Errors: Key already present in Kudu table
'impala::kudu_impala_500.LINEITEM'. (1 of -1831809966 similar)

Raised IMPALA-7007 to address the overflow(?) of the counter.

On 10 May 2018 at 08:00, Mostafa Mokhtar  wrote:
> Can you try rerunning the query again against the Kudu database instead of
> default?
>
> select count(*) from kudu_impala_500.LINEITEM;
>
>
> On Wed, May 9, 2018 at 10:13 PM, Geetika Gupta 
> wrote:
>>
>> Please find below the query profile :
>>
>> Query (id=9649f7ab3bcc5fb8:f4d6a607):
>>   Summary:
>> Session ID: b04a9080d1e1724d:41f2b0d261e8f280
>> Session Type: HIVESERVER2
>> HiveServer2 Protocol Version: V6
>> Start Time: 2018-05-08 17:55:23.181264000
>> End Time: 2018-05-10 00:34:17.784273000
>> Query Type: DML
>> Query State: FINISHED
>> Query Status: OK
>> Impala Version: impalad version 3.0.0-SNAPSHOT RELEASE (build
>> b68e06997c1f49f6b723d78e217efddec4f56f3a)
>> User: root
>> Connected User: root
>> Delegated User:
>> Network Address: :::46.4.88.233:59862
>> Default Db: kudu_impala_500
>> Sql Statement: insert into LINEITEM select * from
>> PARQUETIMPALA500.LINEITEM
>> Coordinator: slave2:22000
>> Query Options (set by configuration):
>> Query Options (set by configuration and planner): MT_DOP=0
>> Plan:
>> 
>> Max Per-Host Resource Reservation: Memory=0B
>> Per-Host Resource Estimates: Memory=704.00MB
>> WARNING: The following tables are missing relevant table and/or column
>> statistics.
>> parquetimpala500.lineitem
>>
>> F00:PLAN FRAGMENT [RANDOM] hosts=7 instances=7
>> |  Per-Host Resources: mem-estimate=704.00MB mem-reservation=0B
>> INSERT INTO KUDU [kudu_impala_500.lineitem]
>> |  mem-estimate=0B mem-reservation=0B
>> |
>> 00:SCAN HDFS [parquetimpala500.lineitem, RANDOM]
>>partitions=1/1 files=396 size=97.29GB
>>stored statistics:
>>  table: rows=unavailable size=unavailable
>>  columns: unavailable
>>extrapolated-rows=disabled
>>mem-estimate=704.00MB mem-reservation=0B
>>tuple-ids=0 row-size=171B cardinality=unavailable
>> 
>> Estimated Per-Host Mem: 738197504
>> Tables Missing Stats: parquetimpala500.lineitem
>> Per Host Min Reservation: slave1:22000(0) slave2:22000(0)
>> slave3:22000(0) slave4:22000(0) slave5:22000(0) slave6:22000(0)
>> slave7:22000(0)
>> Request Pool: default-pool
>> Admission result: Admitted immediately
>> ExecSummary:
>> Operator   #Hosts  Avg Time  Max Time  #Rows  Est. #Rows  Peak Mem
>> Est. Peak Mem  Detail
>>
>> -
>> 00:SCAN HDFS7   4s417ms   6s154ms  3.00B  -1   1.39 GB
>> 704.00 MB  parquetimpala500.lineitem
>> Errors: Key already present in Kudu table
>> 'impala::kudu_impala_500.LINEITEM'. (1 of -1831809966 similar)
>>
>> Query Compilation: 6s413ms
>>- Metadata load started: 14.443ms (14.443ms)
>>- Metadata load finished. loaded-tables=2/2 load-requests=1
>> catalog-updates=7: 6s298ms (6s283ms)
>>- Analysis finished: 6s301ms (3.659ms)
>>- Value transfer graph computed: 6s302ms (282.554us)
>>- Single node plan created: 6s363ms (61.624ms)
>>- Runtime filters computed: 6s363ms (98.878us)
>>- Distributed plan created: 6s366ms (2.832ms)
>>- Planning finished: 6s413ms (46.751ms)
>> Query Timeline: 30h38m
>>- Query submitted: 56.829us (56.829us)
>>- Planning finished: 6s431ms (6s431ms)
>>- Submit for admission: 6s432ms (821.867us)
>>- Completed admission: 6s432ms (14.519us)
>>- Ready to start on 7 backends: 6s432ms (99.515us)
>>- All 7 execution backends (7 fragment instances) started: 6s535ms
>> (103.320ms)
>>- Released admission control resources: 30h38m (30h38m)
>>- DML data written: 30h38m (934.739us)
>>- DML Metastore update finished: 30h38m (157.938us)
>>- Request finished: 30h38m (41.379us)
>>- First row fetched: 30h38m (173.124us)
>>- First row fetched: 30h38m (1.750ms)
>>- First row fetched: 30h38m (1.361ms)
>>- Unregister query: 30h38m (1.456ms)
>>  - ComputeScanRangeAssignmentTimer: 558.440us
>>   ImpalaServer:
>>  - ClientFetchWaitTimer: 4.721ms
>>  - MetastoreUpdateTimer: 191.424us
>>  - RowMaterializationTimer: 0.000ns
>>   Execution Profile 9649f7ab3bcc5fb8:f4d6a607:(Total: 30h38m,
>> non-child: 0.000ns, % non-child: 0.00%)
>> Number of filters: 0
>> Filter routing table:
>>  ID  Src. Node  Tgt. Node(s)  Target type  Partition filter  Pending
>> (Expected)  First arrived  Completed   Enabled
>>
>> ---

Re: Issue in data loading in Impala + Kudu

2018-05-06 Thread Jeszy
Impala doesn't store the data itself, so you can switch versions
without rewriting data. But you don't have to do that, you would just
have to build impala using the -release flag (of buildall.sh) and run
it using the release binaries (versus the debug ones). If you would be
looking at performance, using the release version is highly
recommended anyway.

On 7 May 2018 at 08:30, Geetika Gupta  wrote:
> Hi Jeszy,
>
> Currently, we are using the apache impala's Github master branch code. We
> tried using the released version but we encountered some errors related to
> downloading of dependencies and could not complete the installation.
>
> The current version of impala we are using: 2.12
>
> We can't try with the new release as we have already loaded 500GB of TPCH
> data on our cluster.
>
> On Mon, May 7, 2018 at 11:43 AM, Jeszy  wrote:
>>
>> What version of Impala are you using?
>> DCHECKs won't be triggered if you run a release build. Looking at the
>> code, it should work with bad values if not for the DCHECK. Can you
>> try using a release build?
>>
>> On 7 May 2018 at 08:04, Geetika Gupta  wrote:
>> > Hi community,
>> >
>> > I was trying to load 500GB of TPCH data into kudu table using the
>> > following
>> > query:
>> >
>> > insert into lineitem select * from PARQUETIMPALA500.LINEITEM
>> >
>> > While executing the query for around 17 hrs it got cancelled as the
>> > impalad
>> > process of that machine got aborted. Here are the logs of the impalad
>> > process.
>> >
>> > impalad.ERROR
>> >
>> > Log file created at: 2018/05/06 13:40:34
>> > Running on machine: slave2
>> > Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>> > E0506 13:40:34.097759 28730 logging.cc:121] stderr will be logged to
>> > this
>> > file.
>> > SLF4J: Class path contains multiple SLF4J bindings.
>> > SLF4J: Found binding in
>> >
>> > [jar:file:/root/softwares/impala/fe/target/dependency/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> > SLF4J: Found binding in
>> >
>> > [jar:file:/root/softwares/impala/testdata/target/dependency/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
>> > explanation.
>> > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>> > 18/05/06 13:40:34 WARN util.NativeCodeLoader: Unable to load
>> > native-hadoop
>> > library for your platform... using builtin-java classes where applicable
>> > 18/05/06 13:40:36 WARN shortcircuit.DomainSocketFactory: The
>> > short-circuit
>> > local reads feature cannot be used because libhadoop cannot be loaded.
>> > tcmalloc: large alloc 1073741824 bytes == 0x484434000 @  0x4135176
>> > 0x7fd9e9fc3929
>> > tcmalloc: large alloc 2147483648 bytes == 0x7fd540f18000 @  0x4135176
>> > 0x7fd9e9fc3929
>> > F0507 09:46:12.673912 29258 error-util.cc:148] Check failed:
>> > log_entry.count
>> >> 0 (-1831809966 vs. 0)
>> > *** Check failure stack trace: ***
>> > @  0x3fc0c0d  google::LogMessage::Fail()
>> > @  0x3fc24b2  google::LogMessage::SendToLog()
>> > @  0x3fc05e7  google::LogMessage::Flush()
>> > @  0x3fc3bae  google::LogMessageFatal::~LogMessageFatal()
>> > @  0x1bbcb31  impala::PrintErrorMap()
>> > @  0x1bbcd07  impala::PrintErrorMapToString()
>> > @  0x2decbd7  impala::Coordinator::GetErrorLog()
>> > @  0x1a8d634  impala::ImpalaServer::UnregisterQuery()
>> > @  0x1b29264  impala::ImpalaServer::CloseOperation()
>> > @  0x2c5ce86
>> >
>> > apache::hive::service::cli::thrift::TCLIServiceProcessor::process_CloseOperation()
>> > @  0x2c56b8c
>> > apache::hive::service::cli::thrift::TCLIServiceProcessor::dispatchCall()
>> > @  0x2c2fcb1
>> > impala::ImpalaHiveServer2ServiceProcessor::dispatchCall()
>> > @  0x16fdb20  apache::thrift::TDispatchProcessor::process()
>> > @  0x18ea6b3
>> > apache::thrift::server::TAcceptQueueServer::Task::run()
>> > @  0x18e2181  impala::ThriftThread::RunRunnable()
>> > @  0x18e3885  boost::_mfi::mf2<>::operator()()
>> > @  0x18e371b  boost::_bi::list3<>::operator()<>()
>> > @  

Re: Issue in data loading in Impala + Kudu

2018-05-06 Thread Jeszy
What version of Impala are you using?
DCHECKs won't be triggered if you run a release build. Looking at the
code, it should work with bad values if not for the DCHECK. Can you
try using a release build?

On 7 May 2018 at 08:04, Geetika Gupta  wrote:
> Hi community,
>
> I was trying to load 500GB of TPCH data into kudu table using the following
> query:
>
> insert into lineitem select * from PARQUETIMPALA500.LINEITEM
>
> While executing the query for around 17 hrs it got cancelled as the impalad
> process of that machine got aborted. Here are the logs of the impalad
> process.
>
> impalad.ERROR
>
> Log file created at: 2018/05/06 13:40:34
> Running on machine: slave2
> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
> E0506 13:40:34.097759 28730 logging.cc:121] stderr will be logged to this
> file.
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in
> [jar:file:/root/softwares/impala/fe/target/dependency/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/root/softwares/impala/testdata/target/dependency/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 18/05/06 13:40:34 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 18/05/06 13:40:36 WARN shortcircuit.DomainSocketFactory: The short-circuit
> local reads feature cannot be used because libhadoop cannot be loaded.
> tcmalloc: large alloc 1073741824 bytes == 0x484434000 @  0x4135176
> 0x7fd9e9fc3929
> tcmalloc: large alloc 2147483648 bytes == 0x7fd540f18000 @  0x4135176
> 0x7fd9e9fc3929
> F0507 09:46:12.673912 29258 error-util.cc:148] Check failed: log_entry.count
>> 0 (-1831809966 vs. 0)
> *** Check failure stack trace: ***
> @  0x3fc0c0d  google::LogMessage::Fail()
> @  0x3fc24b2  google::LogMessage::SendToLog()
> @  0x3fc05e7  google::LogMessage::Flush()
> @  0x3fc3bae  google::LogMessageFatal::~LogMessageFatal()
> @  0x1bbcb31  impala::PrintErrorMap()
> @  0x1bbcd07  impala::PrintErrorMapToString()
> @  0x2decbd7  impala::Coordinator::GetErrorLog()
> @  0x1a8d634  impala::ImpalaServer::UnregisterQuery()
> @  0x1b29264  impala::ImpalaServer::CloseOperation()
> @  0x2c5ce86
> apache::hive::service::cli::thrift::TCLIServiceProcessor::process_CloseOperation()
> @  0x2c56b8c
> apache::hive::service::cli::thrift::TCLIServiceProcessor::dispatchCall()
> @  0x2c2fcb1
> impala::ImpalaHiveServer2ServiceProcessor::dispatchCall()
> @  0x16fdb20  apache::thrift::TDispatchProcessor::process()
> @  0x18ea6b3
> apache::thrift::server::TAcceptQueueServer::Task::run()
> @  0x18e2181  impala::ThriftThread::RunRunnable()
> @  0x18e3885  boost::_mfi::mf2<>::operator()()
> @  0x18e371b  boost::_bi::list3<>::operator()<>()
> @  0x18e3467  boost::_bi::bind_t<>::operator()()
> @  0x18e337a
> boost::detail::function::void_function_obj_invoker0<>::invoke()
> @  0x192761c  boost::function0<>::operator()()
> @  0x1c3ebf7  impala::Thread::SuperviseThread()
> @  0x1c470cd  boost::_bi::list5<>::operator()<>()
> @  0x1c46ff1  boost::_bi::bind_t<>::operator()()
> @  0x1c46fb4  boost::detail::thread_data<>::run()
> @  0x2eedb4a  thread_proxy
> @ 0x7fda1dbb16ba  start_thread
> @ 0x7fda1d8e741d  clone
> Wrote minidump to
> /tmp/minidumps/impalad/a9113d9b-bc3d-488a-1feebf9b-47b42022.dmp
>
> impalad.FATAL
>
> Log file created at: 2018/05/07 09:46:12
> Running on machine: slave2
> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
> F0507 09:46:12.673912 29258 error-util.cc:148] Check failed: log_entry.count
>> 0 (-1831809966 vs. 0)
>
> Impalad.INFO
> edentials={real_user=root}} blocked reactor thread for 34288.6us
> I0507 09:38:14.943245 29882 outbound_call.cc:288] RPC callback for RPC call
> kudu.tserver.TabletServerService.Write -> {remote=136.243.74.42:7050
> (slave5), user_credentials={real_user=root}} blocked reactor thread for
> 35859.8us
> I0507 09:38:15.942150 29882 outbound_call.cc:288] RPC callback for RPC call
> kudu.tserver.TabletServerService.Write -> {remote=136.243.74.42:7050
> (slave5), user_credentials={real_user=root}} blocked reactor thread for
> 40664.9us
> I0507 09:38:17.495046 29882 outbound_call.cc:288] RPC callback for RPC call
> kudu.tserver.TabletServerService.Write -> {remote=136.243.74.42:7050
> (slave5), user_credentials={real_user=root}} blocked reactor thread for
> 49514.6us
> I0507 09:46:12.664149  4507 coordinator.cc:783] Release admission control
> resources for query_id=3e4a4c64

Re: How To Install Apche Impala on Ubuntu

2018-03-29 Thread Jeszy
Hello Anubhav,

Please include the steps that you've tried on your own to resolve this
problem, the blockers you've faced, and in general provide as much
information as possible about the problem at hand to encourage
community members to spend time of their own working with you.
Answering the questions you were asked earlier is the least, but I'd
strongly encourage you to try to make sense of the error messages and
attempt to take the issue as far as you can by reading Impala's wiki
pages and searching online.

HTH



On 29 March 2018 at 12:36, Anubhav Tarar  wrote:
> i resolved all your points but when i hit he development script this is the
> error when i hit bootstrap
> _development.sh
>
> [INFO] BUILD SUCCESS
> [ 86%] Built target fe
> Makefile:94: recipe for target 'all' failed
> make: *** [all] Error 2
> Error in /home/anubhav/Impala/bin/make_impala.sh at line 178: ${MAKE_CMD}
> ${MAKE_ARGS}
>
> now when i tried to run my impalad this is the error
>
> anubhav@anubhav-Vostro-3559:~/Impala/bin$ ./start-impalad.sh
> ./start-impalad.sh: line 89: /admin: No such file or directory
> /home/anubhav/Impala/be/build/latest/service/impalad: error while loading
> shared libraries: libjvm.so: cannot open shared object file: No such file or
> directory
>
>
> someone please help
>
> On Wed, Mar 28, 2018 at 8:21 PM, Gabor Kaszab 
> wrote:
>>
>> I few questions:
>> - Is /home/anubhav/Impala the directory you expect your Impala sources.
>> (It can happen that bootstrap_development.sh check-out Impala to a different
>> directory you'd expect)
>> - Can you find that kudu/client/client.h file somewhere in you toolchain
>> directory? (To rule out that the script searches it on a wrong path)
>> - Do you have a toolchain directory at all?
>> - Have you executed those commands I sent you to set env variables?
>>
>> Gabor
>>
>> On Wed, Mar 28, 2018 at 8:23 AM, Anubhav Tarar 
>> wrote:
>>>
>>> hi i cloned the apache impala then hit command
>>> ./bin/bootstrap_development.sh
>>>
>>> still getting the same error
>>>
>>>   File "/home/anubhav/Impala/infra/python/bootstrap_virtualenv.py", line
>>> 379, in 
>>> kudu_client_dir = find_kudu_client_install_dir()
>>>   File "/home/anubhav/Impala/infra/python/bootstrap_virtualenv.py", line
>>> 315, in find_kudu_client_install_dir
>>> error_if_kudu_client_not_found(install_dir)
>>>   File "/home/anubhav/Impala/infra/python/bootstrap_virtualenv.py", line
>>> 322, in error_if_kudu_client_not_found
>>> raise Exception("Kudu client header not found at %s" % header_path)
>>> Exception: Kudu client header not found at
>>> /home/anubhav/Impala/toolchain/kudu-0eef8e0/debug/include/kudu/client/client.h
>>> Error in /home/anubhav/Impala/bin/impala-python at line 25:
>>>
>>>
>>> On Tue, Mar 27, 2018 at 1:48 PM, Jeszy  wrote:
>>>>
>>>> Hello Anubhav,
>>>>
>>>> Impala doesn't ship it's releases in binary format, so you will have
>>>> to build it locally after downloading the source. For instruction on
>>>> how to do that, see Gabor's comments.
>>>>
>>>> If you don't have an existing Hadoop cluster on which you want to run
>>>> Impala, you can just use one of the distributor's packaged (binary)
>>>> versions. Cloudera's CDH is also open source.
>>>>
>>>> HTH
>>>>
>>>> On 27 March 2018 at 10:14, Gabor Kaszab 
>>>> wrote:
>>>> > I have already provided you the sourcing steps in my first mail:
>>>> > "Also, did you source the Impala configs? These are the ones I usually
>>>> > use:
>>>> >
>>>> > . bin/impala-config.sh;
>>>> >
>>>> > . bin/set-pythonpath.sh;
>>>> >
>>>> > . bin/set-classpath.sh
>>>> >
>>>> > "
>>>> >
>>>> >
>>>> > Have you tried following the Impala build page?
>>>> >
>>>> > https://cwiki.apache.org/confluence/display/IMPALA/Building+Impala
>>>> >
>>>> >
>>>> > Gabor
>>>> >
>>>> >
>>>> >
>>>> > On Tue, Mar 27, 2018 at 10:08 AM, Anubhav Tarar
>>>> > 
>>>> > wrote:
>>>> >>
>>>> >> i didn't clone the open source i downloaded apache impala 2.11
>>>> >> release
>>>> >> from download page of official documentation
>>>> >> https://impala.apache.org/downloads.html and what do you mean by
>>>> >> sourcing of
>>>> >> configs what are the steps? can you please provide me complete steps
>>>> >> i am
>>>> >> struck from 2 days
>>>> >
>>>> >
>>>
>>>
>>>
>>>
>>> --
>>> Thanks and Regards
>>> Anubhav Tarar
>>>
>>>  Software Consultant
>>>   Knoldus Software LLP
>>>LinkedIn Twitterfb
>>>   mob : 8588915184
>>
>>
>
>
>
> --
> Thanks and Regards
> Anubhav Tarar
>
>  Software Consultant
>   Knoldus Software LLP
>LinkedIn Twitterfb
>   mob : 8588915184


Re: How To Install Apche Impala on Ubuntu

2018-03-27 Thread Jeszy
Hello Anubhav,

Impala doesn't ship it's releases in binary format, so you will have
to build it locally after downloading the source. For instruction on
how to do that, see Gabor's comments.

If you don't have an existing Hadoop cluster on which you want to run
Impala, you can just use one of the distributor's packaged (binary)
versions. Cloudera's CDH is also open source.

HTH

On 27 March 2018 at 10:14, Gabor Kaszab  wrote:
> I have already provided you the sourcing steps in my first mail:
> "Also, did you source the Impala configs? These are the ones I usually use:
>
> . bin/impala-config.sh;
>
> . bin/set-pythonpath.sh;
>
> . bin/set-classpath.sh
>
> "
>
>
> Have you tried following the Impala build page?
>
> https://cwiki.apache.org/confluence/display/IMPALA/Building+Impala
>
>
> Gabor
>
>
>
> On Tue, Mar 27, 2018 at 10:08 AM, Anubhav Tarar 
> wrote:
>>
>> i didn't clone the open source i downloaded apache impala 2.11 release
>> from download page of official documentation
>> https://impala.apache.org/downloads.html and what do you mean by sourcing of
>> configs what are the steps? can you please provide me complete steps i am
>> struck from 2 days
>
>


Re: Estimate peak memory VS used peak memory

2018-03-04 Thread Jeszy
>> admitted.
>>>>>
>>>>> That support is going to enable future enhancements to memory-based
>>>>> admission control to make it easier for cluster admins like yourself to
>>>>> configure admission control. It is definitely tricky to pick a good value
>>>>> for mem_limit when pools can contain a mix of queries and I think Impala 
>>>>> can
>>>>> do better at making these decisions automatically.
>>>>>
>>>>> - Tim
>>>>>
>>>>> On Fri, Feb 23, 2018 at 9:05 AM, Alexander Behm
>>>>>  wrote:
>>>>>>
>>>>>> For a given query the logic for determining the memory that will be
>>>>>> required from admission is:
>>>>>> - if the query has mem_limit use that
>>>>>> - otherwise, use memory estimates from the planner
>>>>>>
>>>>>> A query may be assigned a mem_limit by:
>>>>>> - taking the default mem_limit from the pool it was submitted to (this
>>>>>> is the recommended practice)
>>>>>> - manually setting one for the query (in case you want to override the
>>>>>> pool default for a single query)
>>>>>>
>>>>>> In that setup, the memory estimates from the planner are irrelevant
>>>>>> for admission decisions and only serve for informational purposes.
>>>>>> Please do not read too much into the memory estimates from the
>>>>>> planner. They can be totally wrong (like your 8TB example).
>>>>>>
>>>>>>
>>>>>> On Fri, Feb 23, 2018 at 3:47 AM, Jeszy  wrote:
>>>>>>>
>>>>>>> Again, the 8TB estimate would not be relevant if the query had a
>>>>>>> mem_limit set.
>>>>>>> I think all that we discussed is covered in the docs, but if you feel
>>>>>>> like specific parts need clarification, please file a jira.
>>>>>>>
>>>>>>> On 23 February 2018 at 11:51, Fawze Abujaber 
>>>>>>> wrote:
>>>>>>> > Sorry for  asking many questions, but i see your answers are
>>>>>>> > closing the
>>>>>>> > gaps that i cannot find in the documentation.
>>>>>>> >
>>>>>>> > So how we can explain that there was an estimate for 8T per node
>>>>>>> > and impala
>>>>>>> > decided to submit this query?
>>>>>>> >
>>>>>>> > My goal that each query running beyond the actual limit per node to
>>>>>>> > fail (
>>>>>>> > and this is what i setup in the default memory per node per pool)
>>>>>>> > an want
>>>>>>> > all other queries to be queue and not killed, so what i understand
>>>>>>> > that i
>>>>>>> > need to setup the max queue query to unlimited and the queue
>>>>>>> > timeout to
>>>>>>> > hours.
>>>>>>> >
>>>>>>> > And in order to reach that i need to setup the default memory per
>>>>>>> > node for
>>>>>>> > each pool and setting either max concurrency or the max memory per
>>>>>>> > pool that
>>>>>>> > will help to measure the max concurrent queries that can run in
>>>>>>> > specific
>>>>>>> > pool.
>>>>>>> >
>>>>>>> > I think reaching this goal will close all my gaps.
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > On Fri, Feb 23, 2018 at 11:49 AM, Jeszy  wrote:
>>>>>>> >>
>>>>>>> >> > Do queuing query or not is based on the prediction which based
>>>>>>> >> > on the
>>>>>>> >> > estimate and of course the concurrency that can run in a pool.
>>>>>>> >>
>>>>>>> >> Yes, it is.
>>>>>>> >>
>>>>>>> >> > If I have memory limit per pool and memory limit per node for a
>>>>>>> >> > pool, so
>>>>>>> >> > it
>>>>>>> >> > can be used to estimate number of queries that can run
>>>>>>> >> > concurrently, is
>>>>>>> >> > this
>>>>>>> >> > also based on the prediction and not the actual use.
>>>>>>> >>
>>>>>>> >> Also on prediction.
>>>>>>> >
>>>>>>> >
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Estimate peak memory VS used peak memory

2018-02-23 Thread Jeszy
Again, the 8TB estimate would not be relevant if the query had a mem_limit set.
I think all that we discussed is covered in the docs, but if you feel
like specific parts need clarification, please file a jira.

On 23 February 2018 at 11:51, Fawze Abujaber  wrote:
> Sorry for  asking many questions, but i see your answers are closing the
> gaps that i cannot find in the documentation.
>
> So how we can explain that there was an estimate for 8T per node and impala
> decided to submit this query?
>
> My goal that each query running beyond the actual limit per node to fail (
> and this is what i setup in the default memory per node per pool) an want
> all other queries to be queue and not killed, so what i understand that i
> need to setup the max queue query to unlimited and the queue timeout to
> hours.
>
> And in order to reach that i need to setup the default memory per node for
> each pool and setting either max concurrency or the max memory per pool that
> will help to measure the max concurrent queries that can run in specific
> pool.
>
> I think reaching this goal will close all my gaps.
>
>
>
> On Fri, Feb 23, 2018 at 11:49 AM, Jeszy  wrote:
>>
>> > Do queuing query or not is based on the prediction which based on the
>> > estimate and of course the concurrency that can run in a pool.
>>
>> Yes, it is.
>>
>> > If I have memory limit per pool and memory limit per node for a pool, so
>> > it
>> > can be used to estimate number of queries that can run concurrently, is
>> > this
>> > also based on the prediction and not the actual use.
>>
>> Also on prediction.
>
>


Re: Estimate peak memory VS used peak memory

2018-02-23 Thread Jeszy
> Do queuing query or not is based on the prediction which based on the
> estimate and of course the concurrency that can run in a pool.

Yes, it is.

> If I have memory limit per pool and memory limit per node for a pool, so it
> can be used to estimate number of queries that can run concurrently, is this
> also based on the prediction and not the actual use.

Also on prediction.


Re: Estimate peak memory VS used peak memory

2018-02-23 Thread Jeszy
Queries will be killed based on actual usage (peak memory usage across
hosts), so the 200mb is the interesting value in your example.

Compare the pool's available memory to the query's mem requirement
(based on estimate or mem_limit, as discussed) to predict admission.

On 23 February 2018 at 10:06, Fawze Abujaber  wrote:
> Thanks jezy for your detailed response.
>
> Yes I read the documentation.
>
> Let simplify my question:
>
> I have pools set up with memory limit per node and concurrency.
>
> If I’m looking on the historical impala queries that I have and the metrics
> I have per query, on which metrics I can understand that impala will kill
> the query, for example if I have a query with estimate of 2GB and the used
> per node is 200mb, what is the default memory values that i need to setup so
> the query will not fail.
>
> The second one is the distribution between pools, if one query is running
> which metrics o have to look into to know if I submit a query it fail or
> not.
>
> On Fri, 23 Feb 2018 at 10:48 Jeszy  wrote:
>>
>> Hey Fawze,
>>
>> Answers inline.
>>
>> On 23 February 2018 at 01:23, Fawze Abujaber  wrote:
>> > There is no option in the admission control to setup memory limit per
>> > query,
>> > the memory limit is per pool and there is a default memory per node for
>> > query.
>>
>> per node for query memory limit multiplied by number of nodes gives
>> you a per query memory limit. I agree its confusing that the
>> configurations mix and match between per-node and aggregated values.
>> In this case there's a good reason though, as a single node running
>> out of memory will lead to query failure, meaning that in addition to
>> total memory used, distribution of memory usage between hosts also
>> matters.
>>
>> > I have hundreds of impala queries and more add hoc queries, making a
>> > pool
>> > for each query is not a visible solution.
>> >
>> > still waiting to understand how the estimate per node related to the
>> > default
>> > memory per node I set up per pool, is it used in the decision of queuing
>> > and
>> > killing the query? and if this is true how it was not kill a query that
>> > was
>> > estimated it needs 8.2TB memory per node.
>> >
>> > Understanding on which parameters impala decides to kill a query can
>> > help
>> > understand to define and divide the memory between the pools.
>>
>> If you set mem_limit at any level (service level, pool level, or query
>> level), it will be used for admission control purposes instead of
>> estimates. So a 8.2TB estimate would not be a problem, if impala can
>> reserve mem_limit amount on each host, it will start running the
>> query.
>>
>> > Passing memory limit per query manually is also not visible and such
>> > settings not needs admission control.
>> >
>> > I have support pool that runs ad hoc query and I can not ask them to use
>> > memory limit per query, and I have analytics pool which is fully
>> > business
>> > and I can rely on admission control if it extremely in accurate.
>>
>> It's a bit tricky to use memory-based admission control with
>> non-trivial ad hoc queries. For simple ad-hoc queries, you can try to
>> come up with a 'good enough' mem_limit, or omit mem_limit and trust
>> impala's estimations. You can check the estimated vs. actual values
>> for a representative set of ad hoc queries to see what would work in
>> your case. I've found that people tend to go with a large enough
>> mem_limit for the ad hoc pool.
>>
>> > Can someone explain me exactly which recommended setting to use per pool
>> > and
>> > which of them rely on impala memory estimates?
>>
>> The documentation of admission control
>> (https://impala.apache.org/docs/build/html/topics/impala_admission.html)
>> gives you a good view on how stuff works, but you will have to figure
>> out how to use these features for your specific use case. That said,
>> when using memory based admission control, it is best practice to
>> always use a mem_limit due to potential inaccuracy of estimates as
>> well as potential variance of estimates between Impala releases. Keep
>> in mind that you can opt to set a default mem_limit for one pool and
>> leave it unset for another.
>>
>> > So my conclusion right now to avoid using any settings rely on the
>> > estimates
>> > and to ignore the estimates when I want to evaluate query.
>>
&

Re: Estimate peak memory VS used peak memory

2018-02-23 Thread Jeszy
Hey Fawze,

Answers inline.

On 23 February 2018 at 01:23, Fawze Abujaber  wrote:
> There is no option in the admission control to setup memory limit per query,
> the memory limit is per pool and there is a default memory per node for
> query.

per node for query memory limit multiplied by number of nodes gives
you a per query memory limit. I agree its confusing that the
configurations mix and match between per-node and aggregated values.
In this case there's a good reason though, as a single node running
out of memory will lead to query failure, meaning that in addition to
total memory used, distribution of memory usage between hosts also
matters.

> I have hundreds of impala queries and more add hoc queries, making a pool
> for each query is not a visible solution.
>
> still waiting to understand how the estimate per node related to the default
> memory per node I set up per pool, is it used in the decision of queuing and
> killing the query? and if this is true how it was not kill a query that was
> estimated it needs 8.2TB memory per node.
>
> Understanding on which parameters impala decides to kill a query can help
> understand to define and divide the memory between the pools.

If you set mem_limit at any level (service level, pool level, or query
level), it will be used for admission control purposes instead of
estimates. So a 8.2TB estimate would not be a problem, if impala can
reserve mem_limit amount on each host, it will start running the
query.

> Passing memory limit per query manually is also not visible and such
> settings not needs admission control.
>
> I have support pool that runs ad hoc query and I can not ask them to use
> memory limit per query, and I have analytics pool which is fully business
> and I can rely on admission control if it extremely in accurate.

It's a bit tricky to use memory-based admission control with
non-trivial ad hoc queries. For simple ad-hoc queries, you can try to
come up with a 'good enough' mem_limit, or omit mem_limit and trust
impala's estimations. You can check the estimated vs. actual values
for a representative set of ad hoc queries to see what would work in
your case. I've found that people tend to go with a large enough
mem_limit for the ad hoc pool.

> Can someone explain me exactly which recommended setting to use per pool and
> which of them rely on impala memory estimates?

The documentation of admission control
(https://impala.apache.org/docs/build/html/topics/impala_admission.html)
gives you a good view on how stuff works, but you will have to figure
out how to use these features for your specific use case. That said,
when using memory based admission control, it is best practice to
always use a mem_limit due to potential inaccuracy of estimates as
well as potential variance of estimates between Impala releases. Keep
in mind that you can opt to set a default mem_limit for one pool and
leave it unset for another.

> So my conclusion right now to avoid using any settings rely on the estimates
> and to ignore the estimates when I want to evaluate query.

Sounds good.

> @mostafa, since my issue with all the query, I think the profile will not
> help me to solve such huge issue.
>
> I’m planning to move a way from Vertica and rely on impala as a sql engine
> and now fully confused how I can do this if I can’t use the admission
> control.
>
> Last think, is it recommend to use the impala admission control?

Yes. Admission control can take a while to understand, but if done
right, it works.

HTH

> On Fri, 23 Feb 2018 at 1:56 Alexander Behm  wrote:
>>
>> The planner memory estimates are conservative and sometimes extremely
>> inaccurate. In their current form, they are rarely appropriate for admission
>> decisions.
>>
>> The recommended practice for memory-based admission control it to set a
>> mem_limit for every query. You can make this easier by setting up different
>> pools with different mem_limits, e.g. a small/medium/big queries pool or
>> similar.
>>
>> On Thu, Feb 22, 2018 at 3:00 PM, Mostafa Mokhtar 
>> wrote:
>>>
>>> It is recommended to set a per query memory limit as part of admission
>>> and not rely on estimates as they are sometimes inaccurate.
>>> Can you please include the full query profile?
>>>
>>>
>>> On Thu, Feb 22, 2018 at 12:13 PM, Fawze Abujaber 
>>> wrote:

 Hi Mostafa,

 It's not a specific query, almost all the query has such differene
 between the 2 values.

 I can see even queries showing the estimate per node is 8.2 Tib

 User: psanalytics

 Database: default

 Query Type: QUERY
 Coordinator: slpr-dhc014.lpdomain.com

 Duration: 6.48s

 Rows Produced: 708
 Estimated per Node Peak Memory: 8.2 TiB

 Per Node Peak Memory Usage: 1.1 GiB

 Pool: root.impanalytics
 Threads: CPU Time: 20.1m



 How you can explain this behavior, and for sure i don't have 8.2 Tib
 memory per node to give neither you.

 Can

Re: access control port 25000

2018-02-20 Thread Jeszy
Hey Sunil,

No, there is no way to do this currently.
The cancel part is tracked as:
https://issues.apache.org/jira/browse/IMPALA-1762. I don't think we
have jiras for the rest, though I agree they would be nice additions.

HTH

On 20 February 2018 at 07:27, Sunil Parmar  wrote:
> Impalad's port 25000 allows easy way to monitor queries and other debug
> information about the node. Most things that can be done are read only
> except few...
>
> - Cancel the running query.
> - Close Hive connections
> - Set log level
>
> We're trying to expose this page to dev for easy debug of an issue but it
> exposes these capabilities which are not ideal for production setup. Is
> there a way to set access control on what get exposed on port 25000 ?
>
> Thanks,
> Sunil Parmar


Re: Debugging Impala query that consistently hangs

2018-02-08 Thread Jeszy
Not sure that's what you're referring to, but scan progress isn't
necessarily indicative of overall query progress. Can you attach the text
profile of the cancelled query?
If you cannot upload attachments, the Summary section is the best starting
point, so please include that.

On 8 February 2018 at 20:53, Piyush Narang  wrote:

> Hi folks,
>
>
>
> I have a query that I’m running on Impala that seems to consistently stop
> making progress after reaching 45-50%. It stays at that split number for a
> couple of hours (before I cancel it).  I don’t see any progress on the
> summary page either. I’m running 2.11.0-cdh5.14.0 RELEASE (build
> d68206561bce6b26762d62c01a78e6cd27aa7690). It seems to not be making
> progress from an exchange hash step.
>
> Has anyone run into this in the past? Any suggestions on what’s the best
> way to debug this? (I could take stack dumps on the coordinator / workers,
> but not sure if there’s any other way).
>
>
>
> Thanks,
>
>
>
> -- Piyush
>
>
>


Re: Question about using LDAP

2018-02-02 Thread Jeszy
Is the difference in ending (dc=ldapserver,dc=*com* versus dc=ldapserver,dc=
*local*) intentional?

On 2 February 2018 at 20:48, Jason Mcswain  wrote:

> Sunil,
> Just in case you meant "ldap_tls", that property is disabled.
>
> -Jason-
>
> On Fri, Feb 2, 2018 at 1:43 PM, Jason Mcswain 
> wrote:
>
>> Hello Sunil,
>>
>> Thank you for the quick response.  Yes, this deployment is not secure,
>> i'm just trying to get the env working, and then later i will consider
>> using TLS.  The property you mentioned "ldap_ls",  is that an ldap property
>> or an impala property?  Do you have an example of how i might disable this?
>>
>> Thank you,
>> -Jason McSwain-
>>
>> -- Forwarded message --
>> From: Sunil Parmar 
>> To: user@impala.apache.org
>> Cc:
>> Bcc:
>> Date: Fri, 2 Feb 2018 10:57:23 -0800
>> Subject: Re: Question about using LDAP
>> I'm assuming you're not using tls because you're sending password in
>> clear text. Can you try disabling the property ldap_ls , unless you already
>> did?
>>
>> Sunil Parmar
>>
>> On Fri, Feb 2, 2018 at 11:55 AM, Jason Mcswain 
>> wrote:
>>
>>> Hello Impala User Group,
>>>
>>> I am trying to configure Impala to use existing LDAP service, but i'm
>>> running into some kind of error.  I am able to do an ldapsearch from the
>>> same node that is running impalad, but when i run impala-shell i get an
>>> erorr that looks like auth failed.
>>>
>>> ---
>>> impala-shell query request - failed with related impalad.INFO log file.
>>> ---
>>>
>>> [root@mycdhcluster-2 ~]# impala-shell -i 127.0.0.1:21000
>>> --auth_creds_ok_in_clear -u bob -l -q "select * from testdb.accounts"
>>> Starting Impala Shell using LDAP-based authentication
>>> LDAP password for bob:
>>> Error connecting: TTransportException, TSocket read 0 bytes
>>> Not connected to Impala, could not execute queries.
>>> [root@mycdhcluster-2 ~]#
>>> [root@mycdhcluster-2 ~]# tail /var/log/impalad/impalad.INFO
>>> I0202 09:39:49.781989 17168 authentication.cc:249] Trying simple LDAP
>>> bind for: uid=bob,ou=users,dc=ldapserver,dc=com
>>> W0202 09:39:49.834450 17168 authentication.cc:256] LDAP authentication
>>> failure for uid=bob,ou=users,dc=ldapserver,dc=com : Invalid credentials
>>> E0202 09:39:49.835139 17168 authentication.cc:159] SASL message (LDAP):
>>> Password verification failed
>>> I0202 09:39:49.835741 17168 thrift-util.cc:123] TThreadPoolServer:
>>> Caught TException: SASL(-13): user not found: Password verification failed
>>> [root@mycdhcluster-2 ~]#
>>> [root@mycdhcluster-2 ~]#
>>>
>>> ---
>>> ldap search on impala cluster node. - Success.
>>> ---
>>> [root@mycdhcluster-2 ~]# ldapsearch -W -h ldapserver.gce.cloudera.com
>>> -D "uid=bob,ou=users,dc=ldapserver,dc=local" -b
>>> "dc=ldapserver,dc=local" "uid=bob"
>>> Enter LDAP Password:
>>> # extended LDIF
>>> #
>>> # LDAPv3
>>> # base  with scope subtree
>>> # filter: uid=bob
>>> # requesting: ALL
>>> #
>>>
>>> # bob, users, ldapserver.local
>>> dn: uid=bob,ou=users,dc=ldapserver,dc=local
>>> uid: bob
>>> cn: bob
>>> objectClass: account
>>> objectClass: posixAccount
>>> objectClass: top
>>> uidNumber: 504
>>> gidNumber: 502
>>> loginShell: /bin/bash
>>> homeDirectory: /home/bob
>>> userPassword:: Ymx1ZXRhbG9u
>>>
>>> # search result
>>> search: 2
>>> result: 0 Success
>>>
>>> # numResponses: 2
>>> # numEntries: 1
>>> [root@mycdhcluster-2 ~]# echo $?
>>> 0
>>>
>>> -
>>> Here is the configuration that i have done via CDH:
>>> -
>>>
>>> [image: Inline image 4]
>>> [image: Inline image 1]
>>> [image: Inline image 5]
>>> [image: Inline image 6]
>>>
>>> Based on this configuration and the output, does anyone know what i'm
>>> doing wrong here?  I feel like i'm really close to getting impala working
>>> with ldap, but i'm missing something.
>>>
>>> BTW my environment:
>>>
>>>- i'm on CDH5.12.1
>>>- statestored version 2.9.0-cdh5.12.1 RELEASE (build
>>>5131a031f4aa38c1e50c430373c55ca53e0517b9)
>>>- (Impala Shell v2.9.0-cdh5.12.1 (5131a03) built on Thu Aug 24
>>>09:27:32 PDT 2017)
>>>
>>> Any assistance you can provide will be greatly appreciated,
>>>
>>> Warm Regards,
>>> -Jason McSwain-
>>>
>>
>>
>


Re: 答复: Can not start impala Catalog and Daemons due to JVM FATAL

2018-01-30 Thread Jeszy
This is most likely caused by a RedHat bug - see this article:
https://access.redhat.com/solutions/3091371

On 30 January 2018 at 12:11, Dong Bo 董博  wrote:

> The strange thing is I can start statestore on the same server .
>
>
>
> *发件人:* Dong Bo 董博
> *发送时间:* 2018年1月30日 19:07
> *收件人:* user@impala.apache.org
> *主题:* 答复: Can not start impala Catalog and Daemons due to JVM FATAL
>
>
>
> Not found these 2 options in /etc/default/impala nor /etc/impala/conf ,
> anywhere should I check ?
>
> Btw , I am not using Cloudera Manager.  Impala is installed with yum from
> cloudera repo.
>
>
>
> *发件人:* Fawze Abujaber [mailto:fawz...@gmail.com ]
> *发送时间:* 2018年1月30日 17:31
> *收件人:* user@impala.apache.org
> *主题:* Re: Can not start impala Catalog and Daemons due to JVM FATAL
>
>
>
> *Go to impala configuration and make the following parameters empty.*
>
>
>
> *Catalog Server Core Dump Directory*
>
>
>
> *Catalog Server Breakpad Dump Directory*
>
>
>
>
>
> On Tue, Jan 30, 2018 at 10:59 AM, Dong Bo 董博  wrote:
>
> I am trying to merge Impala with a kerberized Hadoop platform , and failed
> to start catalogd and daemons.
>
>
>
> Error message is not specific , all related files attached .
>
>
>
> Impala Version : 2.11.0+cdh5.14.0+0-1.cdh5.14.0.p0.50.el7
>
> JDK : jdk1.8.0_91
>
>
>
> Error Content :
>
>
>
> #
>
> # A fatal error has been detected by the Java Runtime Environment:
>
> #
>
> #  SIGBUS (0x7) at pc=0x7f23019d1c1f, pid=10116, tid=139788732049792
>
> #
>
> # JRE version:  (8.0_91-b14) (build )
>
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.91-b14 mixed mode
> linux-amd64 compressed oops)
>
> # Problematic frame:
>
> # j  java.lang.Object.()V+0
>
> #
>
> # Failed to write core dump. Core dumps have been disabled. To enable core
> dumping, try "ulimit -c unlimited" before starting Java again
>
> #
>
> # An error report file with more information is saved as:
>
> # /home/impala/hs_err_pid10116.log
>
> #
>
> # If you would like to submit a bug report, please visit:
>
> #   http://bugreport.java.com/bugreport/crash.jsp
>
> #
>
>
>
>
>
> Thanks in Advance
>
>
>
> Carl Dong
>
>
>


Re: Re: Configuration for Admission Control

2018-01-24 Thread Jeszy
Hey Quanlong,

1. Impala estimates the memory usage at planning time, and runtime
statistics for a specific run aren't reused on subsequent runs, so the
estimate changes only when the plan changes, or when statistics
change. Also, estimates are often wrong (usually overestimating). The
'mem_limit' query option will override estimates, it's a good practice
to apply it at the pool level, so you can get deterministic
concurrency. This can be difficult though, as it requires you to
assign queries to pools based on memory usage / allowance.

2. No, this isn't possible currently.

HTH!

On 25 January 2018 at 07:43, Quanlong Huang  wrote:
> Thanks, Tim!
>
> 1. The soft limit is exactly what we want. I have another question that how
> does Impala estimate the memory usage of a query? It seems that it won't
> change the estimation even after the query run again and the actual usage is
> much smaller than the estimate.
>
> 2. I think we need a template for configuration. Something like Presto
> provided: https://prestodb.io/docs/current/admin/queue.html. Every new user
> corresponds to a new pool. The admin doesn't need to create a pool manually
> for him/her. For detailed limit types, I think more are welcome. Currently,
> we need the two assumptions:
>   a. Each user can run no more than 5 queries in parallel.
>   b. The total amount of queries running in parallel of the whole system
> should no more than 20.
> Does it seem that I can't config this right now?
>
> Thanks,
> Quanlong
>
>
> At 2018-01-25 02:49:33,"Tim Armstrong"  wrote:
>
> Hi Quanlong,
>
>  1. Admission control memory limits for pools actually behave as soft limits
> already - admission control won't kill queries if the pool's limits are
> exceeded. It is limited to admitting/queueing/rejecting a query.
>
> The hard limits are the query and process memory limits. If an individual's
> query mem_limit is exceeded, it will be killed, or if the Impala daemon
> process's total memory limit is killed, queries will be killed until it gets
> under the limit.
>
> In the short-to-medium term we're working to avoid that kind of
> out-of-memory as much as possible. If a query can't run with a given
> mem_limit, it shouldn't be admitted. If it is admitted, it should regulate
> its own memory consumption by spilling to disk, etc, to stay under the
> mem_limit. We have a lot of pieces for that already (e.g. the big revamp of
> spill-to-disk in IMPALA-3200 and the HDFS scanner patches I have out for
> review right now).
>
> 2. We don't support this right now, but that is a very good idea. I'm not
> sure what exactly the right policy is. Maybe limiting each user to a fixed
> number of queries is reasonable, or maybe there should also be some kind of
> fairness (e.g. a user can't consume more than x% of the remaining resources
> in the pool). Would be interested in your thoughts.
>
> - Tim
>
> On Wed, Jan 24, 2018 at 5:53 AM, Quanlong Huang 
> wrote:
>>
>> Hi all,
>>
>> We're going to use Admission Control to support multi-tenancy. I have
>> several questions about the configuration:
>>
>> 1. Is there a config about the soft memory limit of a queue? i.e. when
>> queries in a pool totally consumed much amount of memory than the soft
>> limit, they won't fail directly but the queries submitted later for this
>> pool will be queued.
>>
>> 2. Can we config that the max concurrent running queries for each user
>> should no more than a limit (e.g. 10)? Currently, I have to create a pool
>> for a user to do this. This is not scalable if we have tens of users. And we
>> have to add a new pool for each new user.
>>
>> Thanks,
>> Quanlong
>>
>>
>>
>
>
>
>
>


Re: Impala queries running with empty user

2018-01-11 Thread Jeszy
Hey Fawze,

You should consider Impala's authentication options, as described in:
http://impala.apache.org/docs/build/html/topics/impala_authentication.html

The admission control based approach will be breakable since users
(including the anonymus ones you want to exclude) can just specify another
user's pool via command line (SET REQUEST_POOL - the first option in your
placement rules).

HTH

On 11 January 2018 at 15:47, Fawze Abujaber  wrote:

> Hello Guys,
>
>
>
> I have some impala queries that running with no user, and i want to block
> such queries.
>
>
>
> Is there any configuration that i can configure to block this?
>
>
>
>
>
> Here is my placement rules:
>
>
>
>
> 1 Use the pool Specified at run time, only if the pool exists.
> 2 Use the pool root.[username], only if the pool exists.
> 3 Use the pool root.default.
> This rule is always satisfied. Subsequent rules are not used.
>


Impala on Kudu vs. Impala on HDFS

2018-01-08 Thread Jeszy
Hey,

Boris Tyukin recently shared

their experience on exploring what Impala on Kudu is capable of, compared
to Impala on HDFS:
http://boristyukin.com/benchmarking-apache-kudu-vs-apache-impala/

It would be interesting to hear more user stories like that - let us know
if you have one!

Thanks!

Balazs


Re: Impala, Kudu, and timestamps (and UNIXTIME_MICROS...)

2017-12-18 Thread Jeszy
Hello Franco,

Thanks for your feedback! I agree there are pain points with using
timestamps, especially together with other systems.
Is there any particular approach or solution you propose that would
work well for you? Have you found any jiras on issues.apache.org that
describe what you're asking for? Commenting on a jira will help the
team track your input better.

Regards,
Balazs

On 17 December 2017 at 00:38, Franco Venturi  wrote:
> Please note that the discussion below refers to the following versions:
>   - Impala: v2.10.0-cdh5.13.0
>   - Kudu: 1.5.0-cdh5.13.0
>   - Everything runs on a standard Cloudera 5.13 installation
>
>
> A few days ago I was writing some Java code to migrate several tables
> directly from Oracle to Kudu (to be queried later on by our developers and
> BI tools using Impala). Most of these tables have columns that are of type
> "timestamp" (to be exact, they come in as instances of class
> oracle.sql.TIMESTAMP and I cast them to java.sql.Timestamp; for the rest of
> this discussion I'll assume we only deal with objects of java.sql.Timestamp,
> to make things simple).
> As you probably know, Kudu, starting I think with version 1.3.1, has a type
> called 'UNIXTIME_MICROS') and that type gets mapped by Impala as "Impala
> TIMESTAMP" data type
> (https://www.cloudera.com/documentation/enterprise/latest/topics/impala_timestamp.html).
>
>
> A good description of the meaning of 'UNIXTIME_MICROS' in Kudu is in the
> 'Apache Kudu Schema Design' document
> (https://kudu.apache.org/docs/schema_design.html), which says:
>
>
>   unixtime_micros (64-bit microseconds since the Unix epoch)
>
>
> where the 'Unix epoch' is defined as 1/1/1970 00:00:00 GMT.
>
>
> With this understanding I went ahead and wrote my Java code; when I ran the
> first few tests, I noticed that the timestamp values returned by Impala (I
> created in Impala an 'external' table 'stored as kudu') were off by several
> hours compared to the values returned by the original table in Oracle (our
> servers, both the Oracle ones and the Impala/Kudu ones, are all configured
> in the 'America/New_York' timezone).
>
>
> To investigate this difference, I created a simple table in Kudu with just
> two columns, an INT64 as the primary key and a UNIXTIME_MICROS as a
> timestamp. I ran a few inserts and selects over this table in Impala and
> figured out that Impala stores a value that is more or less defined as
> follow:
>
>
>   number of microseconds since the Unix epoch (i.e. what I was expecting
> originally)
>   + offset of the timestamp I inserted with respect to GMT (in my case
> this offset is the offset for EST or EDT depending if that timestamp was
> during EST (winter) or EDT (summer))
>
>
> This is how Impala achieves what is described as:
>
>
>   Impala does not store timestamps using the local timezone, to avoid
> undesired results from unexpected time zone issues
>
>
> That same page has caveats like the following, that sent a shiver down my
> spine:
>
>
>   If that value was written to a data file, and shipped off to a distant
> server to be analyzed alongside other data from far-flung locations, the
> dates and times would not match up precisely because of time zone
> differences
>
>
> This means that if anyone is using (or even thinking about using) "Impala
> timestamps" to say store financial or health services (or security) events,
> they'll find some nasty "surprises" (even if they don't plan to ever move
> their servers and only do business in one timezone).
>
>
> Consider for instance the case of anything that occurred between 1am and 2am
> EDT on 11/5/2017 (i.e. in the hour before we moved our clocks back from EDT
> to EST) - there's no way to store the timestamps for these events in Kudu
> via Impala.
>
> To prove this I wrote this simple piece of Java code (which uses Java 8 and
> all well documented and non-deprecated classes and methods) to do just an
> insert and a select via Impala JDBC of a timestamp row in the simple table
> that I mentioned above (primary key + timestamp column):
>
>
>
>   // run insert
>   long primaryKey = 1L;
>   PreparedStatement insert = connection.prepareStatement("insert into "
> + table + " values (?, ?)");
>   insert.setLong(1, primaryKey);
>   Timestamp timestampIn = new Timestamp(150985980L);
>   System.out.println("TimestampIn: " + timestampIn + " - getTime(): " +
> timestampIn.getTime());
>   insert.setTimestamp(2, timestampIn);
>   insert.executeUpdate();
>   insert.close();
>
>
>   // run select
>   PreparedStatement select = connection.prepareStatement("select " +
> timestampColumn + " from " + table + " where " + primaryKeyColumn + "=?");
>   select.setLong(1, primaryKey);
>   ResultSet resultSet = select.executeQuery();
>   while (resultSet.next()) {
>   Timestamp timestampOut = resultSet.getTimestamp(1);
>   System.out.println("TimestampOut: " + timestampOut + "

Re: English Mixed Chinese Substring

2017-12-18 Thread Jeszy
Hello Carl,

There's no UTF-8 support in Impala yet, but you could write your own
UDF to handle it (or contribute a patch).

Regards

On 18 December 2017 at 07:38, Dong Bo 董博  wrote:
> Hi Forks,
>
>
>
> Impala treats Chinese word as 3 letters , for English word 1 letter, Is
> there any settings to make it easy to substring a mixed String just like
> what hive does?
>
>
>
> Eg :  select substr('test测试test', 1 ,5 ) ,
>
> returns :  test测  ,  NOT : test�
>
>
>
>
>
> Thanks
>
> Carl


Re: Questions about Statestore and Catalogservice

2017-12-11 Thread Jeszy
Thanks for pointing out the docs issue! I opened IMPALA-6303 to track it.

On 10 December 2017 at 15:47, Lars Francke  wrote:
> Thank you Bharath & Dimitris!
>
> That answers all the questions I have right now, thank you so much for
> taking the time to write it up.
>
> Regarding the docs:
> 
>
>> The Impala component known as the catalog service relays the metadata
>> changes from Impala SQL statements to all the DataNodes in a cluster. It is
>> physically represented by a daemon process named catalogd; you only need
>> such a process on one host in the cluster. Because the requests are passed
>> through the statestore daemon, it makes sense to run the statestored and
>> catalogd services on the same host.
>
> Reading it again now it also says "DataNodes" which is not correct.
>
> Cheers,
> Lars
>
>
> On Fri, Dec 8, 2017 at 7:00 PM, Bharath Vissapragada 
> wrote:
>>
>> Looks like a topic for dev@.
>>
>> On Fri, Dec 8, 2017 at 2:48 AM, Lars Francke 
>> wrote:
>>>
>>> Hi,
>>>
>>> I'm trying to understand how the communication between the components
>>> works.
>>>
>>> I understand that an impala daemon subscribes to the statestore. The
>>> statestore seems to have the concept of heartbeats and topics. But I'm not
>>> sure what topics are all about.
>>
>>
>> Statestore follows the standard pub-sub pattern where a publisher
>> publishes messages and subscribers subscribe to the messages/categories they
>> are interested in.  Like you mentioned, statestore is like a mediator
>> between the publishers and the subscribers.
>>
>> "Topic" is an abstraction that makes the content of these messages opaque
>> to the statestore. The publishers (like Catalog server for example)
>> serialize the messages (metadata for example) into a "Topic" to ship them to
>> the statestore which then broadcasts that to the interested subscribers
>> (coordinators). The coordinators then unpack/deserialize the topic into the
>> corresponding object classes (like Tables/Functions etc.) and apply those
>> updates locally.
>>
>> In Impala, currently we have the following topics:
>>
>> catalog-update - For Catalog metadata
>> impala-membership - For tracking liveness of the coordinators/executors
>> impala-request-queue -  For admission control
>>
>> You can see these in the statestore web UI (/topics page)
>>
>>>
>>>
>>> The docs also say that only the statestore communicates with the catalog
>>> service. How does that happen?
>>
>>
>> Can you point us to which doc you are referring to here?
>>
>> Techincally speaking, the coordinators also connect to the Catalog service
>> for executing DDLs, but I'm assuming you are speaking here in terms of the
>> broadcast of the table updates, in which case Catalog sends those tables to
>> the statestore (as a part of catalog-update topic) and those are broadcast
>> by the statestore to all the coordinators. (described above)
>>
>> How is a INVALIDATE/REFRESH statement routed from a daemon to the catalog
>> service and back?
>>
>> I'll take the example of REFRESH here.  The metadata flow looks something
>> like this
>>
>> - coordinator 'coo' gets 'refresh foo'
>> - 'coo' makes an RPC to the catalog server 'cat' for executing 'refresh'
>> - 'cat' refreshes the table 'foo', which changes the version of 'foo' from
>> v1 to v2 (Internally Catalog versions all the objects to track which objects
>> changed over time)
>> - 'cat' returns 'foo' (v2) directly to the coordinator 'coo'  (as the
>> result of RPC) which then applies the update locally.
>> - Additionally 'cat' also has a thread running in the background that
>> figures out that the 'foo' has changed (v1 -> v2), which then repacks 'foo'
>> into a "Topic" update and sends it to the statestore.
>> - Statestore then broadcasts the new updates to all the coordinators.
>>
>> INVALIDATE is slightly different in the sense that the coordinator doesn't
>> get foo(v2) back as the result of the rpc, instead it gets an
>> "IncompleteTable" (Impala terminology) which means that the table is either
>> missing the catalog metadata/it has been invalidated.
>>
>> There are many minor details on how the entire system works but "most"
>> Catalog updates work as above (with some exceptions).
>>
>>>
>>> I'm sure I'll have follow-up questions but this would already be very
>>> helpful. Thank you!
>>
>>
>> Sure, feel free to ask the list. Here are some code pointers incase you
>> are interested.
>>
>>
>> https://github.com/apache/impala/blob/master/be/src/statestore/statestore.h
>> (Topic/TopicEntry and other SS  abstractions)
>>
>>
>> https://github.com/apache/impala/blob/master/common/thrift/CatalogService.thrift#L45
>> (thrift definitions for most Catalog operations)
>>
>>
>> https://github.com/apache/impala/blob/master/be/src/service/impala-server.h#L324
>> (how coordinators apply the Catalog updates)
>>
>>
>> https://github.com/apache/impala/blob/master/be/src/catalog/catalog-server.cc#L187
>> (An

Re: invalidate metadata behaviour

2017-11-29 Thread Jeszy
Hey,

On 29 November 2017 at 11:12, Antoni Ivanov  wrote:
> Thanks,
> I hope you don't mind a few more questions:
>
>> Node2 would also eventually consider these invalidated
> - How exactly does it work. E.g When I issue invalidate metadata. Does it say 
> to the catalogd to invalidate metadata or is this information broadcasted 
> through the statestored?

That's right, it's invalidated on the catalogd first, then propagated
to the daemons through statestore.

>> stored in the catalog daemon centrally
> - Oh so metadata is stored in the catalogD . I thought it was stored only in 
> the statestore (and cached in each ImpalaD) and catalog facilitate fetching 
> metadata from Hive Metastore and Block information from HDFS Namenode.
> What was I wrong ?

Statestore is only responsible for pushing the catalog changes to the
impala daemons, catalogd is the central store. I'm not 100% sure about
how the catalog deltas are stored in the statestore so I'll let others
comment.

> - Does INVALIDATE METADATA have any impact on the Hive Metastore . I don't 
> believe so, right? E.g instead of running invalidate metadata (say after HDFS 
> rebalance) I can restart Impala to clear caches (including statestore catalog 
> topic) so that new data is loaded lazily again.

A global 'invalidate metadata' (i.e. not a table-specific one) will
have the same impact on the HMS as a catalog / service restart would.
Impala will have to fetch a list of tables in both cases, so there is
some work on the HMS side, but it's miniscule. It matters when HMS is
unreachable, for example.

One addition, I just noticed that earlier you said that most of the
2000 tables have much less partitions than 5000, so YMMV. Partition
and file count will directly impact metadata size, the less you have,
the better (metadata-wise, at least).

Regards

> -Antoni
>
> -Original Message-
> From: Jeszy [mailto:jes...@gmail.com]
> Sent: Wednesday, November 29, 2017 9:56 AM
> To: user@impala.apache.org
> Cc: u...@impala.incubator.apache.org
> Subject: Re: invalidate metadata behaviour
>
> Hey Antoni,
>
> On 29 November 2017 at 07:42, Antoni Ivanov  wrote:
>> Hi,
>>
>>
>>
>> I am wondering if I run INVALIDATE METADATA for the whole database on
>> node1
>>
>> Then I ran a query on node2 – would the query on node2 used the cached
>> metadata for the tables or it would know it’s invalidated?
>
> Node2 would also eventually consider these invalidated.
>
>> And second how safe it is to run it for a database with many (say 30)
>> tables over 10,000 partitions and 2000 more under 5000 partitions
>> (most of the under 100)
>>
>> And each Impala Deamon node has a little (below Cloudera recommended)
>> memory
>> (32G)
>
> These numbers influence the size of the catalog cache, which is stored in the 
> catalog daemon centrally, and then replicated on each impalad, or on each 
> coordinator in more recent versions. The metadata you mention (2000 tables * 
> 5000 partitions each, plus the big tables) is in the 10 million partitions 
> range. Each of those will have at least one file with 3 blocks, probably 
> more, so all this adds up to a sizeable metadata. The cached version will 
> require a large amount of memory (on the catalog as well as the 
> daemons/coordinators), which could easily lead to even small queries running 
> out of memory with only 32gb.
>
>> Thanks,
>>
>> Antoni
>
> HTH


Re: invalidate metadata behaviour

2017-11-28 Thread Jeszy
Hey Antoni,

On 29 November 2017 at 07:42, Antoni Ivanov  wrote:
> Hi,
>
>
>
> I am wondering if I run INVALIDATE METADATA for the whole database on node1
>
> Then I ran a query on node2 – would the query on node2 used the cached
> metadata for the tables or it would know it’s invalidated?

Node2 would also eventually consider these invalidated.

> And second how safe it is to run it for a database with many (say 30) tables
> over 10,000 partitions and 2000 more under 5000 partitions (most of the
> under 100)
>
> And each Impala Deamon node has a little (below Cloudera recommended) memory
> (32G)

These numbers influence the size of the catalog cache, which is stored
in the catalog daemon centrally, and then replicated on each impalad,
or on each coordinator in more recent versions. The metadata you
mention (2000 tables * 5000 partitions each, plus the big tables) is
in the 10 million partitions range. Each of those will have at least
one file with 3 blocks, probably more, so all this adds up to a
sizeable metadata. The cached version will require a large amount of
memory (on the catalog as well as the daemons/coordinators), which
could easily lead to even small queries running out of memory with
only 32gb.

> Thanks,
>
> Antoni

HTH


Re: Any plans for approximate topN query?

2017-11-28 Thread Jeszy
Hello Jason,

IMPALA-5300 (https://issues.apache.org/jira/browse/IMPALA-5300) is in
the works, and I think it fits your use case. Can you take a look?

Thanks!

On 28 November 2017 at 15:11, Jason Heo  wrote:
> Hi,
>
> I'm wondering impala team has any plans for approximate topN for single
> dimension.
>
> My Web analytic system mostly serves top n urls. Such a "GROUP BY url ORDER
> BY pageview LIMIT n" is slow especially for high-cardinality field.
> Approximate topN can be used instead of GroupBy for single dimension with
> extremely lower latency.
>
> Elastisearch, Druid, and Clickhouse already provide this feature.
>
> It would be great if I can use it on Druid.
>
> Thanks.
>
> Regards,
>
> Jason