How to display column names in spark-sql output

2015-12-11 Thread Ashwin Shankar
Hi,
When we run spark-sql, is there a way to get column names/headers with the
result?

-- 
Thanks,
Ashwin


Problem with pyspark on Docker talking to YARN cluster

2015-06-10 Thread Ashwin Shankar
All,
I was wondering if any of you have solved this problem :

I have pyspark(ipython mode) running on docker talking to
a yarn cluster(AM/executors are NOT running on docker).

When I start pyspark in the docker container, it binds to port *49460.*

Once the app is submitted to YARN, the app(AM) on the cluster side fails
with the following error message :
*ERROR yarn.ApplicationMaster: Failed to connect to driver at :49460*

This makes sense because AM is trying to talk to container directly and
it cannot, it should be talking to the docker host instead.

*Question* :
How do we make Spark AM talk to host1:port1 of the docker host(not the
container), which would then
route it to container which is running pyspark on host2:port2 ?

One solution I could think of is : after starting the driver(say on
hostA:portA), and before submitting the app to yarn, we could
reset driver's host/port to hostmachine's ip/port. So the AM can then talk
hostmachine's ip/port, which would be mapped
to the container.

Thoughts ?
-- 
Thanks,
Ashwin


How to pass system properties in spark ?

2015-06-03 Thread Ashwin Shankar
Hi,
I'm trying to use property substitution in my log4j.properties, so that
I can choose where to write spark logs at runtime.
The problem is that, system property passed to spark shell
doesn't seem to getting propagated to log4j.

*Here is log4j.properites(partial) with a parameter 'spark.log.path' :*
log4j.appender.logFile=org.apache.log4j.FileAppender
log4j.appender.logFile.File=*${spark.log.path}*
log4j.appender.logFile.layout=org.apache.log4j.PatternLayout
log4j.appender.logFile.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p
%c{1}: %m%n

*Here is how I pass the 'spark.log.path' variable on command line :*
$spark-shell --conf
spark.driver.extraJavaOptions=-Dspark.log.path=/tmp/spark.log

I also tried :
$spark-shell -Dspark.log.path=/tmp/spark.log

*Result : */tmp*/*spark.log not getting created when I run spark.

Any ideas why this is happening ?

*When I enable log4j debug I see that following :*
log4j: Setting property [file] to [].
log4j: setFile called: , true
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException:  (No such file or directory)
at java.io.FileOutputStream.open(Native Method)

-- 
Thanks,
Ashwin


Spark on Yarn : Map outputs lifetime ?

2015-05-12 Thread Ashwin Shankar
Hi,
In spark on yarn and when running spark_shuffle as auxiliary service on
node manager, does map spills of a stage gets cleaned up once the next
stage completes OR
is it preserved till the app completes(ie waits for all the stages to
complete) ?

-- 
Thanks,
Ashwin


Building spark targz

2014-11-12 Thread Ashwin Shankar
Hi,
I just cloned spark from the github and I'm trying to build to generate a
tar ball.
I'm doing : mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive
-DskipTests clean package

Although the build is successful, I don't see the targz generated.

Am I running the wrong command ?

-- 
Thanks,
Ashwin


Re: Building spark targz

2014-11-12 Thread Ashwin Shankar
Yes, I'm looking at assembly/target. I don't see the tar ball.
I only see scala-2.10/spark-assembly-1.2.0-SNAPSHOT-hadoop2.4.0.jar
,classes,test-classes,
maven-shared-archive-resources,spark-test-classpath.txt.

On Wed, Nov 12, 2014 at 12:16 PM, Sadhan Sood sadhan.s...@gmail.com wrote:

 Just making sure but are you looking for the tar in assembly/target dir ?

 On Wed, Nov 12, 2014 at 3:14 PM, Ashwin Shankar ashwinshanka...@gmail.com
  wrote:

 Hi,
 I just cloned spark from the github and I'm trying to build to generate a
 tar ball.
 I'm doing : mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive
 -DskipTests clean package

 Although the build is successful, I don't see the targz generated.

 Am I running the wrong command ?

 --
 Thanks,
 Ashwin






-- 
Thanks,
Ashwin


Multitenancy in Spark - within/across spark context

2014-10-22 Thread Ashwin Shankar
Hi Spark devs/users,
One of the things we are investigating here at Netflix is if Spark would
suit us for our ETL needs, and one of requirements is multi tenancy.
I did read the official doc
http://spark.apache.org/docs/latest/job-scheduling.html and the book, but
I'm still not clear on certain things.

Here are my questions :
1. *Sharing spark context* : How exactly multiple users can share the
cluster using same spark
context ? UserA wants to run AppA, UserB wants to run AppB. How do they
talk to same
context ? How exactly are each of their jobs scheduled and run in same
context?
Is preemption supported in this scenario ? How are user names passed on
to the spark context ?

2. *Different spark context in YARN*: assuming I have a YARN cluster with
queues and preemption
configured. Are there problems if executors/containers of a spark app
are preempted to allow a
high priority spark app to execute ? Would the preempted app get stuck
or would it continue to
make progress? How are user names passed on from spark to yarn(say I'm
using nested user
queues feature in fair scheduler) ?

3. Sharing RDDs in 1 and 2 above ?

4. Anything else about user/job isolation ?

I know I'm asking a lot of questions. Thanks in advance :) !

-- 
Thanks,
Ashwin
Netflix


Re: Multitenancy in Spark - within/across spark context

2014-10-22 Thread Ashwin Shankar
Thanks Marcelo, that was helpful ! I had some follow up questions :

That's not something you might want to do usually. In general, a
 SparkContext maps to a user application

My question was basically this. In this
http://spark.apache.org/docs/latest/job-scheduling.html page in the
official doc, under  Scheduling within an application section, it talks
about multiuser and fair sharing within an app. How does multiuser within
an application work(how users connect to an app,run their stuff) ? When
would I want to use this ?

As far as I understand, this will cause executors to be killed, which
 means that Spark will start retrying tasks to rebuild the data that
 was held by those executors when needed.

I basically wanted to find out if there were any gotchas related to
preemption on Spark. Things like say half of an application's executors got
preempted say while doing reduceByKey, will the application progress with
the remaining resources/fair share ?

I'm new to spark, sry if I'm asking something very obvious :).

Thanks,
Ashwin

On Wed, Oct 22, 2014 at 12:07 PM, Marcelo Vanzin van...@cloudera.com
wrote:

 Hi Ashwin,

 Let me try to answer to the best of my knowledge.

 On Wed, Oct 22, 2014 at 11:47 AM, Ashwin Shankar
 ashwinshanka...@gmail.com wrote:
  Here are my questions :
  1. Sharing spark context : How exactly multiple users can share the
 cluster
  using same spark
  context ?

 That's not something you might want to do usually. In general, a
 SparkContext maps to a user application, so each user would submit
 their own job which would create its own SparkContext.

 If you want to go outside of Spark, there are project which allow you
 to manage SparkContext instances outside of applications and
 potentially share them, such as
 https://github.com/spark-jobserver/spark-jobserver. But be sure you
 actually need it - since you haven't really explained the use case,
 it's not very clear.

  2. Different spark context in YARN: assuming I have a YARN cluster with
  queues and preemption
  configured. Are there problems if executors/containers of a spark app
  are preempted to allow a
  high priority spark app to execute ?

 As far as I understand, this will cause executors to be killed, which
 means that Spark will start retrying tasks to rebuild the data that
 was held by those executors when needed. Yarn mode does have a
 configurable upper limit on the number of executor failures, so if
 your jobs keeps getting preempted it will eventually fail (unless you
 tweak the settings).

 I don't recall whether Yarn has an API to cleanly allow clients to
 stop executors when preempted, but even if it does, I don't think
 that's supported in Spark at the moment.

  How are user names passed on from spark to yarn(say I'm
  using nested user queues feature in fair scheduler) ?

 Spark will try to run the job as the requesting user; if you're not
 using Kerberos, that means the process themselves will be run as
 whatever user runs the Yarn daemons, but the Spark app will be run
 inside a UserGroupInformation.doAs() call as the requesting user. So
 technically nested queues should work as expected.

  3. Sharing RDDs in 1 and 2 above ?

 I'll assume you don't mean actually sharing RDDs in the same context,
 but between different SparkContext instances. You might (big might
 here) be able to checkpoint an RDD from one context and load it from
 another context; that's actually like some HA-like features for Spark
 drivers are being addressed.

 The job server I mentioned before, which allows different apps to
 share the same Spark context, has a feature to share RDDs by name,
 also, without having to resort to checkpointing.

 Hope this helps!

 --
 Marcelo




-- 
Thanks,
Ashwin