Shuffle Files

2014-03-03 Thread Usman Ghani
Where on the filesystem does spark write the shuffle files?

Re: o.a.s.u.Vector instances for equality

2014-03-03 Thread Shixiong Zhu
Vector is an enhanced Array[Double]. You can compare it like Array[Double]. E.g., scala> val v1 = Vector(1.0, 2.0) v1: org.apache.spark.util.Vector = (1.0, 2.0) scala> val v2 = Vector(1.0, 2.0) v2: org.apache.spark.util.Vector = (1.0, 2.0) scala> val exactResult = v1.elements.sameElements(v2.ele

Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Koert Kuipers
to be more precise, the difference depends on de-serialization overhead from kryo for your data structures. On Mon, Mar 3, 2014 at 8:21 PM, Koert Kuipers wrote: > yes, tachyon is in memory serialized, which is not as fast as cached in > memory in spark (not serialized). the difference really de

Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Koert Kuipers
yes, tachyon is in memory serialized, which is not as fast as cached in memory in spark (not serialized). the difference really depends on your job type. On Mon, Mar 3, 2014 at 7:10 PM, polkosity wrote: > Thats exciting! Will be looking into that, thanks Andrew. > > Related topic, has anyone

Re: pyspark crash on mesos

2014-03-03 Thread Josh Rosen
Brad and I looked into this error and I have a few hunches about what might be happening. We didn't observe any failed tasks in the logs. For some reason, the Python driver is failing to acknowledge an accumulator update from a successfully-completed task. Our program doesn't explicitly use accu

Re: Missing Spark URL after staring the master

2014-03-03 Thread Bin Wang
Hi Ognen/Mayur, Thanks for the reply and it is good to know how easy it is to setup Spark on AWS cluster. My situation is a bit different from yours, our company already have a cluster and it really doesn't make that much sense not to use them. That is why I have been "going through" this. I real

Re: filter operation in pyspark

2014-03-03 Thread Mayur Rustagi
Could be a number of issues.. maybe your csv is not allowing map tasks to be broken, of the file is not process-node local.. how many tasks are you seeing in spark web ui for map & store data. are all the nodes being used when you look at task level .. is the time taken by each task roughly equal o

Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Mayur Rustagi
+1 Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Mon, Mar 3, 2014 at 4:10 PM, polkosity wrote: > Thats exciting! Will be looking into that, thanks Andrew. > > Related topic, has anyone had any experience running Spa

filter operation in pyspark

2014-03-03 Thread Mohit Singh
Hi, I have a csv file... (say "n" columns ) I am trying to do a filter operation like: query = rdd.filter(lambda x:x[1] == "1234") query.take(20) Basically this would return me rows with that specific value? This manipulation is taking quite some time to execute.. (if i can compare.. maybe slo

Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread polkosity
Thats exciting! Will be looking into that, thanks Andrew. Related topic, has anyone had any experience running Spark on Tachyon in-memory filesystem, and could offer their views on using it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-initializati

Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Andrew Ash
polkosity, have you seen the job server that Ooyala open sourced? I think it's very similar to what you're proposing with a REST API and re-using a SparkContext. https://github.com/apache/incubator-spark/pull/222 http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server On Mon, Mar

Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread polkosity
We're thinking of creating a Spark job server with a REST API, which would enable us (as well as managing jobs) to re-use the spark context as you suggest. Thanks Koert! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark

Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Sandy Ryza
Are you running in yarn-standalone mode or yarn-client mode? Also, what YARN scheduler and what NodeManager heartbeat? On Sun, Mar 2, 2014 at 9:41 PM, polkosity wrote: > Thanks for the advice Mayur. > > I thought I'd report back on the performance difference... Spark > standalone > mode has e

Re: Missing Spark URL after staring the master

2014-03-03 Thread Ognen Duzlevski
I should add that in this setup you really do not need to look for the printout of the master node's IP - you set it yourself a priori. If anyone is interested, let me know, I can write it all up so that people can follow some set of instructions. Who knows, maybe I can come up with a set of sc

Re: Missing Spark URL after staring the master

2014-03-03 Thread Ognen Duzlevski
I have a Standalone spark cluster running in an Amazon VPC that I set up by hand. All I did was provision the machines from a common AMI image (my underlying distribution is Ubuntu), I created a "sparkuser" on each machine and I have a /home/sparkuser/spark folder where I downladed spark. I did

Re: Missing Spark URL after staring the master

2014-03-03 Thread Mayur Rustagi
I think you have been through enough :). Basically you have to download spark-ec2 scripts & run them. It'll just need your amazon secret key & access key, start your cluster, install everything, create security groups & give you the url, just login & go ahead... Mayur Rustagi Ph: +1 (760) 203 3257

o.a.s.u.Vector instances for equality

2014-03-03 Thread Oleksandr Olgashko
Hello. How should i better check two Vector's for equality? val a = new Vector(Array(1)) val b = new Vector(Array(1)) println(a == b) // false

Missing Spark URL after staring the master

2014-03-03 Thread Bin Wang
Hi there, I have a CDH cluster set up, and I tried using the Spark parcel come with Cloudera Manager, but it turned out they even don't have the run-example shell command in the bin folder. Then I removed it from the cluster and cloned the incubator-spark into the name node of my cluster, and buil

pyspark crash on mesos

2014-03-03 Thread bmiller1
Hi All, After switching from standalone Spark to Mesos I'm experiencing some instability. I'm running pyspark interactively through iPython notebook, and get this crash non-deterministically (although pretty reliably in the first 2000 tasks, often much sooner). Exception in thread "DAGScheduler"

Blog : Why Apache Spark is a Crossover Hit for Data Scientists

2014-03-03 Thread Sean Owen
I put together a little opinion piece on why Spark is cool for data science. There is, I think, a nice example of using ALS with Stack Overflow tags in here too. Hope Spark folks out there might enjoy... http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/

Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Koert Kuipers
If you need quick response re-use your spark context between queries and cache rdds in memory On Mar 3, 2014 12:42 AM, "polkosity" wrote: > Thanks for the advice Mayur. > > I thought I'd report back on the performance difference... Spark > standalone > mode has executors processing at capacity i

Re: Beginners Hadoop question

2014-03-03 Thread goi cto
Thanks. I will try it! On Mon, Mar 3, 2014 at 1:19 PM, Alonso Isidoro Roman wrote: > Hi, i am a beginner too, but as i have learned, hadoop works better with > big files, at least with 64MB, 128MB or even more. I think you need to > aggregate all the files into a new big one. Then you must copy

Re: Beginners Hadoop question

2014-03-03 Thread Mohit Singh
Not sure whether I understand your question correctly or not? If you are trying to use hadoop ( as in map reduce programming model), then basically you would have to use hadoop api's to solve your program. But if you have data stored in hdfs, and you want to use spark to process that data, then jus

Re: Beginners Hadoop question

2014-03-03 Thread Alonso Isidoro Roman
Hi, i am a beginner too, but as i have learned, hadoop works better with big files, at least with 64MB, 128MB or even more. I think you need to aggregate all the files into a new big one. Then you must copy to HDFS using this command: hadoop fs -put MYFILE /YOUR_ROUTE_ON_HDFS/MYFILE hadoop just c

Beginners Hadoop question

2014-03-03 Thread goi cto
Hi, I am sorry for the beginners question but... I have a spark java code which reads a file (c:\my-input.csv) process it and writes an output file (my-output.csv) Now I want to run it on Hadoop in a distributed environment 1) My inlut file should be one big file or separate smaller files? 2) if w

Problem with "delete spark temp dir" on spark 0.8.1

2014-03-03 Thread goi cto
Hi, I am running a spark java program on a local machine. when I try to write the output to a file (RDD.SaveAsTextFile) I am getting this exception: Exception in thread "Delete Spark temp dir ..." This is running on my local window machine. Any ideas? -- Eran | CTO

Error: Could not find or load main class org.apache.spark.repl.Main on GitBash

2014-03-03 Thread goi cto
Hi, I am trying to run Spark-shell on GitBash on windows with Spark 0.9 I am getting "*Error: Could not find or load main class org.apache.spark.repl.Main*" I tried running sbt/sbt clean assembly which completed successfully but the problem still exist. Any other ideas? Which path variables shou

Re: OutOfMemoryError when loading input file

2014-03-03 Thread Yonathan Perez
Thanks for your answer yxzhao, but setting SPARK_MEM doesn't solve the problem. I also understand that setting SPARK_MEM is the same as calling SparkConf.set("spark.executor.memory",..) which I do. Any additional advice would be highly appreciated. -- View this message in context: http://apac