How to display column names in spark-sql output
Hi, When we run spark-sql, is there a way to get column names/headers with the result? -- Thanks, Ashwin
Problem with pyspark on Docker talking to YARN cluster
All, I was wondering if any of you have solved this problem : I have pyspark(ipython mode) running on docker talking to a yarn cluster(AM/executors are NOT running on docker). When I start pyspark in the docker container, it binds to port *49460.* Once the app is submitted to YARN, the app(AM) on the cluster side fails with the following error message : *ERROR yarn.ApplicationMaster: Failed to connect to driver at :49460* This makes sense because AM is trying to talk to container directly and it cannot, it should be talking to the docker host instead. *Question* : How do we make Spark AM talk to host1:port1 of the docker host(not the container), which would then route it to container which is running pyspark on host2:port2 ? One solution I could think of is : after starting the driver(say on hostA:portA), and before submitting the app to yarn, we could reset driver's host/port to hostmachine's ip/port. So the AM can then talk hostmachine's ip/port, which would be mapped to the container. Thoughts ? -- Thanks, Ashwin
How to pass system properties in spark ?
Hi, I'm trying to use property substitution in my log4j.properties, so that I can choose where to write spark logs at runtime. The problem is that, system property passed to spark shell doesn't seem to getting propagated to log4j. *Here is log4j.properites(partial) with a parameter 'spark.log.path' :* log4j.appender.logFile=org.apache.log4j.FileAppender log4j.appender.logFile.File=*${spark.log.path}* log4j.appender.logFile.layout=org.apache.log4j.PatternLayout log4j.appender.logFile.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n *Here is how I pass the 'spark.log.path' variable on command line :* $spark-shell --conf spark.driver.extraJavaOptions=-Dspark.log.path=/tmp/spark.log I also tried : $spark-shell -Dspark.log.path=/tmp/spark.log *Result : */tmp*/*spark.log not getting created when I run spark. Any ideas why this is happening ? *When I enable log4j debug I see that following :* log4j: Setting property [file] to []. log4j: setFile called: , true log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: (No such file or directory) at java.io.FileOutputStream.open(Native Method) -- Thanks, Ashwin
Spark on Yarn : Map outputs lifetime ?
Hi, In spark on yarn and when running spark_shuffle as auxiliary service on node manager, does map spills of a stage gets cleaned up once the next stage completes OR is it preserved till the app completes(ie waits for all the stages to complete) ? -- Thanks, Ashwin
Building spark targz
Hi, I just cloned spark from the github and I'm trying to build to generate a tar ball. I'm doing : mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -DskipTests clean package Although the build is successful, I don't see the targz generated. Am I running the wrong command ? -- Thanks, Ashwin
Re: Building spark targz
Yes, I'm looking at assembly/target. I don't see the tar ball. I only see scala-2.10/spark-assembly-1.2.0-SNAPSHOT-hadoop2.4.0.jar ,classes,test-classes, maven-shared-archive-resources,spark-test-classpath.txt. On Wed, Nov 12, 2014 at 12:16 PM, Sadhan Sood sadhan.s...@gmail.com wrote: Just making sure but are you looking for the tar in assembly/target dir ? On Wed, Nov 12, 2014 at 3:14 PM, Ashwin Shankar ashwinshanka...@gmail.com wrote: Hi, I just cloned spark from the github and I'm trying to build to generate a tar ball. I'm doing : mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -DskipTests clean package Although the build is successful, I don't see the targz generated. Am I running the wrong command ? -- Thanks, Ashwin -- Thanks, Ashwin
Multitenancy in Spark - within/across spark context
Hi Spark devs/users, One of the things we are investigating here at Netflix is if Spark would suit us for our ETL needs, and one of requirements is multi tenancy. I did read the official doc http://spark.apache.org/docs/latest/job-scheduling.html and the book, but I'm still not clear on certain things. Here are my questions : 1. *Sharing spark context* : How exactly multiple users can share the cluster using same spark context ? UserA wants to run AppA, UserB wants to run AppB. How do they talk to same context ? How exactly are each of their jobs scheduled and run in same context? Is preemption supported in this scenario ? How are user names passed on to the spark context ? 2. *Different spark context in YARN*: assuming I have a YARN cluster with queues and preemption configured. Are there problems if executors/containers of a spark app are preempted to allow a high priority spark app to execute ? Would the preempted app get stuck or would it continue to make progress? How are user names passed on from spark to yarn(say I'm using nested user queues feature in fair scheduler) ? 3. Sharing RDDs in 1 and 2 above ? 4. Anything else about user/job isolation ? I know I'm asking a lot of questions. Thanks in advance :) ! -- Thanks, Ashwin Netflix
Re: Multitenancy in Spark - within/across spark context
Thanks Marcelo, that was helpful ! I had some follow up questions : That's not something you might want to do usually. In general, a SparkContext maps to a user application My question was basically this. In this http://spark.apache.org/docs/latest/job-scheduling.html page in the official doc, under Scheduling within an application section, it talks about multiuser and fair sharing within an app. How does multiuser within an application work(how users connect to an app,run their stuff) ? When would I want to use this ? As far as I understand, this will cause executors to be killed, which means that Spark will start retrying tasks to rebuild the data that was held by those executors when needed. I basically wanted to find out if there were any gotchas related to preemption on Spark. Things like say half of an application's executors got preempted say while doing reduceByKey, will the application progress with the remaining resources/fair share ? I'm new to spark, sry if I'm asking something very obvious :). Thanks, Ashwin On Wed, Oct 22, 2014 at 12:07 PM, Marcelo Vanzin van...@cloudera.com wrote: Hi Ashwin, Let me try to answer to the best of my knowledge. On Wed, Oct 22, 2014 at 11:47 AM, Ashwin Shankar ashwinshanka...@gmail.com wrote: Here are my questions : 1. Sharing spark context : How exactly multiple users can share the cluster using same spark context ? That's not something you might want to do usually. In general, a SparkContext maps to a user application, so each user would submit their own job which would create its own SparkContext. If you want to go outside of Spark, there are project which allow you to manage SparkContext instances outside of applications and potentially share them, such as https://github.com/spark-jobserver/spark-jobserver. But be sure you actually need it - since you haven't really explained the use case, it's not very clear. 2. Different spark context in YARN: assuming I have a YARN cluster with queues and preemption configured. Are there problems if executors/containers of a spark app are preempted to allow a high priority spark app to execute ? As far as I understand, this will cause executors to be killed, which means that Spark will start retrying tasks to rebuild the data that was held by those executors when needed. Yarn mode does have a configurable upper limit on the number of executor failures, so if your jobs keeps getting preempted it will eventually fail (unless you tweak the settings). I don't recall whether Yarn has an API to cleanly allow clients to stop executors when preempted, but even if it does, I don't think that's supported in Spark at the moment. How are user names passed on from spark to yarn(say I'm using nested user queues feature in fair scheduler) ? Spark will try to run the job as the requesting user; if you're not using Kerberos, that means the process themselves will be run as whatever user runs the Yarn daemons, but the Spark app will be run inside a UserGroupInformation.doAs() call as the requesting user. So technically nested queues should work as expected. 3. Sharing RDDs in 1 and 2 above ? I'll assume you don't mean actually sharing RDDs in the same context, but between different SparkContext instances. You might (big might here) be able to checkpoint an RDD from one context and load it from another context; that's actually like some HA-like features for Spark drivers are being addressed. The job server I mentioned before, which allows different apps to share the same Spark context, has a feature to share RDDs by name, also, without having to resort to checkpointing. Hope this helps! -- Marcelo -- Thanks, Ashwin