Re: How to tune Livy for fast queries

2018-08-02 Thread Harsch, Tim
I've looked a little deeper and see now my error, those parameters are for 
python and java clients (clearly).  I forgot there was clients in the code 
base.   Just wishful thinking on my part I guess...


In any case, I'm still hoping to understand where Livy overhead on queries is 
coming from.



From: Harsch, Tim 
Sent: Thursday, August 2, 2018 8:28:58 AM
To: user@livy.incubator.apache.org
Subject: Re: How to tune Livy for fast queries


Thank you Saisai for your response.


I did have a chance to investigate further and I should give a little 
background on why I feel network cost is not the issue:
I added to our application Kylo (http://kylo.io) as an optional spark 
server that is used as a replacement for our existing spark server.  I noticed 
the performance issues when I use Livy instead of our pre-existing server.  
Kylo's spark-shell would consistently execute queries quickly (e.g. <100ms) and 
the same would take longer (>1500ms) with a 500ms polling (0ms initial query) 
interval.  This led me to write code that would query Livy quickly in Python 
(50ms) and wrap the scala code execute in Livy with some timer method that logs 
to Livy logs the time taken.   I would notice that my faster queries are 
executing in Livy in <50ms, yet Livy does return the results for at least 350ms 
(7 queries for results made, 6 returned to client as pending).  I feel fairly 
confident that Livy has some overhead other than network.


   I've since discovered these settings in livy-client.conf.template

# Initial interval before polling for Job results
# livy.client.http.job.initial-poll-interval = 100ms
# Maximum interval between successive polls
# livy.client.http.job.max-poll-interval = 5s

and I looked at Livy source and noticed it seems it has a geomertic interval 
for polling
https://github.com/cloudera/livy/blob/5de6cf21c61db4093646a23c65c37c8b52202dc8/client-http/src/main/java/com/cloudera/livy/client/http/JobHandleImpl.java#L266

I'm thinking that could be the source of my issue but I need a chance to dive 
deeper.  Do you think tuning those parameters could improve the situation?


Thanks,

Tim




From: Saisai Shao 
Sent: Wednesday, August 1, 2018 7:23:55 PM
To: user@livy.incubator.apache.org
Subject: Re: How to tune Livy for fast queries

[External Email]

Probably some network cost should also be counted in. There's no such 
configuration for tuning. If you find some performance issue, you can create a 
JIRA or even a patch to fix Livy.

Harsch, Tim mailto:tim.har...@teradata.com>> 
于2018年8月1日周三 上午8:04写道:

I have a Livy application that I'm trying to tune as I'm seeing some 
performance issue when the queries are fast queries.  I've wrapped my queries 
with a timer that logs the time taken.  The spark code executed typically takes 
50ms to 150ms.  I'm querying Livy every 500ms looking for my response, and 
generally it doesn't succeed until the third check.   It seems Livy itself is 
spending up to an extra 1000ms.  Where is Livy spending this time?  Are there 
any tuning parameters I can adjust?


Also, I am having difficulty changing any of the settings in livy-client.conf.  
I placed the file in /etc/hadoop/conf and livy/conf folder but my settings seem 
to get ignored.


Thanks

Tim


Re: How to tune Livy for fast queries

2018-08-02 Thread Harsch, Tim
Thank you Saisai for your response.


I did have a chance to investigate further and I should give a little 
background on why I feel network cost is not the issue:
I added to our application Kylo (http://kylo.io) as an optional spark 
server that is used as a replacement for our existing spark server.  I noticed 
the performance issues when I use Livy instead of our pre-existing server.  
Kylo's spark-shell would consistently execute queries quickly (e.g. <100ms) and 
the same would take longer (>1500ms) with a 500ms polling (0ms initial query) 
interval.  This led me to write code that would query Livy quickly in Python 
(50ms) and wrap the scala code execute in Livy with some timer method that logs 
to Livy logs the time taken.   I would notice that my faster queries are 
executing in Livy in <50ms, yet Livy does return the results for at least 350ms 
(7 queries for results made, 6 returned to client as pending).  I feel fairly 
confident that Livy has some overhead other than network.


   I've since discovered these settings in livy-client.conf.template

# Initial interval before polling for Job results
# livy.client.http.job.initial-poll-interval = 100ms
# Maximum interval between successive polls
# livy.client.http.job.max-poll-interval = 5s

and I looked at Livy source and noticed it seems it has a geomertic interval 
for polling
https://github.com/cloudera/livy/blob/5de6cf21c61db4093646a23c65c37c8b52202dc8/client-http/src/main/java/com/cloudera/livy/client/http/JobHandleImpl.java#L266

I'm thinking that could be the source of my issue but I need a chance to dive 
deeper.  Do you think tuning those parameters could improve the situation?


Thanks,

Tim




From: Saisai Shao 
Sent: Wednesday, August 1, 2018 7:23:55 PM
To: user@livy.incubator.apache.org
Subject: Re: How to tune Livy for fast queries

[External Email]

Probably some network cost should also be counted in. There's no such 
configuration for tuning. If you find some performance issue, you can create a 
JIRA or even a patch to fix Livy.

Harsch, Tim mailto:tim.har...@teradata.com>> 
于2018年8月1日周三 上午8:04写道:

I have a Livy application that I'm trying to tune as I'm seeing some 
performance issue when the queries are fast queries.  I've wrapped my queries 
with a timer that logs the time taken.  The spark code executed typically takes 
50ms to 150ms.  I'm querying Livy every 500ms looking for my response, and 
generally it doesn't succeed until the third check.   It seems Livy itself is 
spending up to an extra 1000ms.  Where is Livy spending this time?  Are there 
any tuning parameters I can adjust?


Also, I am having difficulty changing any of the settings in livy-client.conf.  
I placed the file in /etc/hadoop/conf and livy/conf folder but my settings seem 
to get ignored.


Thanks

Tim


Security Questions

2018-08-02 Thread Harun Zengin
Hi,
we are trying to build a setup where we have a server that submits jobs
of different users to the Livy server via the REST API. We established a
kerberos server to authenticate against livy, with one superuser that
makes the requests in behalf of the users. But we want to prohibit
the users to access a different users' data, the filesystem, and the
network.

My question would then be, how secure is livy? Users can inject custom
code to run on livy, but this gives them the ability to access the
filesystem on the host the livy server resides in. Even if we run livy
with a different unix user, that has very little permissions on the
filesystem, that could be potentially dangerous from my point of view,
they could potentially access the keytab on the livy server also. And
they could also potentially inject malware and run it.

I know that the session created creates also a JVM, so one session lives
in a JVM, and it is impossible to see another session without having the
kerberos ticket, but could I change the security settings of that JVM to
only access specific paths and specific IP addresses only? Would that
mean for me to change the source code of livy?

And in the case of using HDFS with active directory to secure the
datasystem, so that users need to specify a kerberos key to access their
files, how could I manage multiple principals in one server, to get this
working?

Any help to any of the questions is very much appriciated,

Thanks in forehand,

Harun