[ANNOUNCE] Welcoming two new Spark committers: Tom Graves and Prashant Sharma

2013-11-13 Thread Matei Zaharia
Hi folks, The Apache Spark PPMC is happy to welcome two new PPMC members and committers: Tom Graves and Prashant Sharma. Tom has been maintaining and expanding the YARN support in Spark over the past few months, including adding big features such as support for YARN security, and recently cont

Re: executor failures w/ scala 2.10

2013-11-13 Thread Matei Zaharia
meout if it associates again we >> keep moving else we shutdown the executor. This timeout can ofcourse be >> configurable. >> >> Thoughts ? >> >> >> On Sat, Nov 2, 2013 at 3:29 AM, Matei Zaharia >> wrote: >> Hey Imran, >> >> Good

Re: problems with sbt

2013-11-12 Thread Matei Zaharia
It’s hard to tell, but maybe you’ve run out of space in your working directory? The assembly command will try to write stuff in assembly/target. Matei On Nov 11, 2013, at 2:54 PM, Umar Javed wrote: > I keep getting these io.Exception Permission denied errors when building with > sbt assembly:

Re: shark-shell not launching in cluster

2013-11-12 Thread Matei Zaharia
It might mean one of your JARs is corrupted. Try doing sbt clean and then sbt assembly again. Matei On Nov 12, 2013, at 10:48 AM, Josh Rosen wrote: > I've seen this "error: error while loading , error in opening zip file" > before, but I'm not exactly sure what causes it. Here's a JIRA discu

Re: spark.akka.threads recommendations?

2013-11-11 Thread Matei Zaharia
Actually it doesn’t matter a lot from what I’ve seen. Only do it if you see a lot of communication going to the master (these threads do the serialization of tasks). I’ve never put more than 8 or so. Matei On Nov 11, 2013, at 12:13 PM, Walrus theCat wrote: > Hi, > > The docs say that we shou

Re: Anyway to monitor the shuffling size?

2013-11-11 Thread Matei Zaharia
Yes, just look at the application UI on http://:4040 Matei On Nov 11, 2013, at 12:26 AM, Wenlei Xie wrote: > Hi, > > I have some shuffling task which is supposed to have may repeated values, > thus I assume the shuffling compress would help the performance . > > However I get very similar ru

Re: Spark Summit agenda posted

2013-11-08 Thread Matei Zaharia
> Atte. > Rafael R. > > > > 2013/11/7 Matei Zaharia > Hi everyone, > > We're glad to announce the agenda of the Spark Summit, which will happen on > December 2nd and 3rd in San Francisco. We have 5 keynotes and 24 talks lined > up, from 18 diffe

Re: Not caching rdds, spark.storage.memoryFraction setting

2013-11-08 Thread Matei Zaharia
Hi Grega, This memory is not taken away from the application in any way, so the setting doesn’t matter if you don’t use caching. You don’t need to configure it in any special way. Matei On Nov 8, 2013, at 8:01 AM, Grega Kešpret wrote: > Hi, > > The docs say: Fraction of Java heap to use for

Re: Where is reduceByKey?

2013-11-07 Thread Matei Zaharia
ode > verbatim that doesn't have the necessary import statements > > > On 11/7/2013 4:05 PM, Matei Zaharia wrote: >> Yeah, this is confusing and unfortunately as far as I know it’s API >> specific. Maybe we should add this to the documentation page for RDD. >&g

Re: Where is reduceByKey?

2013-11-07 Thread Matei Zaharia
Yeah, this is confusing and unfortunately as far as I know it’s API specific. Maybe we should add this to the documentation page for RDD. The reason for these conversions is to only allow some operations based on the underlying data type of the collection. For example, Scala collections support

Spark Summit agenda posted

2013-11-07 Thread Matei Zaharia
Hi everyone, We're glad to announce the agenda of the Spark Summit, which will happen on December 2nd and 3rd in San Francisco. We have 5 keynotes and 24 talks lined up, from 18 different companies. Check out the agenda here: http://spark-summit.org/agenda/. This will be the biggest Spark even

Re: PMML support in spark

2013-11-07 Thread Matei Zaharia
Hi Pranay, I don’t think anyone’s working on this right now, but contributions would be welcome if this is a thing we could plug into MLlib. Matei On Nov 6, 2013, at 8:44 PM, Pranay Tonpay wrote: > Hi, > Wanted to know if PMML support in Spark is there in the roadmap for Spark… > PMML has b

Re: rdd.foreach doesn't act as expected

2013-11-06 Thread Matei Zaharia
In general, you shouldn’t be mutating data in RDDs. That will make it impossible to recover from faults. In this particular case, you got 1 and 2 because the RDD isn’t cached. You just get the same list you called parallelize() with each time you iterate through it. But caching it and modifying

Re: executor failures w/ scala 2.10

2013-11-01 Thread Matei Zaharia
t; Thoughts ? > > > On Sat, Nov 2, 2013 at 3:29 AM, Matei Zaharia wrote: > Hey Imran, > > Good to know that Akka 2.1 handles this — that at least will give us a start. > > In the old code, executors certainly did get flagged as “down” occasionally, > but that

Re: executor failures w/ scala 2.10

2013-11-01 Thread Matei Zaharia
> the disassociation events; what do we do to fix it? How can we diagnose the > problem, and figure out which of the configuration variables to tune? > clearly, there *will be* long gc pauses, and the networking layer needs to be > able to deal with them. > > still I under

Re: executor failures w/ scala 2.10

2013-10-31 Thread Matei Zaharia
rrect me if I am > wrong. > > > > On Fri, Nov 1, 2013 at 10:08 AM, Matei Zaharia > wrote: > It’s true that Akka’s delivery guarantees are in general at-most-once, but if > you look at the text there it says that they differ by transport. In the > previous ve

Re: executor failures w/ scala 2.10

2013-10-31 Thread Matei Zaharia
t; just had more robust defaults or something, but I bet it could still have the > same problems. Even before, I have seen the driver thinking there were > running tasks, but nothing happening on any executor -- it was just rare > enough (and hard to reproduce) that I never bothered lookin

Re: How to exclude a library from "sbt assembly"

2013-10-30 Thread Matei Zaharia
Looking at https://github.com/sbt/sbt-assembly, it seems you can add the following into extraAssemblySettings: assemblyOption in assembly ~= { _.copy(includeScala = false) } Matei On Oct 30, 2013, at 9:58 AM, Mingyu Kim wrote: > Hi, > > In order to work around the library dependency problem,

Re: spark-0.8.0 and hadoop-2.1.0-beta

2013-10-29 Thread Matei Zaharia
ken, cmAddress) to > ConverterUtils.convertFromYarn(containerToken, cmAddress). > > Not 100% sure that my changes are correct. > > Hope that helps, > Viren > > > On Sun, Sep 29, 2013 at 8:59 AM, Matei Zaharia > wrote: > Hi Terence, > > YARN's API changed in an incompati

Re: Questions about the files that Spark will produce during its running

2013-10-28 Thread Matei Zaharia
The error is from a worker node -- did you check that /data2 is set up properly on the worker nodes too? In general that should be the only directory used. Matei On Oct 28, 2013, at 6:52 PM, Shangyu Luo wrote: > Hello, > I have some questions about the files that Spark will create and use duri

Re: Task output before a shuffle

2013-10-28 Thread Matei Zaharia
Hi Ufuk, Yes, we still write out data after these tasks in Spark 0.8, and it needs to be written out before any stage that reads it can start. The main reason is simplicity when there are faults, as well as more flexible scheduling (you don't have to decide where each reduce task is in advance,

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Matei Zaharia
age it. So of course we develop features and optimizations as we see demand for them, but if there's a lot of demand for this, we can do it. Matei On Oct 28, 2013, at 5:51 PM, Matei Zaharia wrote: > FWIW, the only thing that Spark expects to fit in memory if you use DISK_ONLY > ca

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Matei Zaharia
FWIW, the only thing that Spark expects to fit in memory if you use DISK_ONLY caching is the input to each reduce task. Those currently don't spill to disk. The solution if datasets are large is to add more reduce tasks, whereas Hadoop would run along with a small number of tasks that do lots of

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Matei Zaharia
Hi Philip, Indeed, Spark's API allows direct creation of complex workflows the same way Cascading would. Cascading built that functionality on top of MapReduce (translating user operations down to a series of MapReduce jobs), but Spark's engine supports complex workflows from the start and the

Re: Compare with Storm

2013-10-27 Thread Matei Zaharia
Hey Howard, Great to hear that you're looking at Spark Streaming! > We have some in house real time streaming jobs written for Storm and want to > see the possibility to migrate to Spark Streaming in the future as our team > all think Spark is a very promising technologies (one platform to exec

Re: accessing spark ui over an ssh tunnel

2013-10-25 Thread Matei Zaharia
Hey Stephen, SSH actually supports creating a HTTP proxy through the -D flag. Take a look at the -D option on our spark-ec2 script for example, which just exposes the -D option of ssh. With this feature you can do stuff like ssh -D 8088 and then configure localhost:8088 as a proxy in your web

Re: understanding spark internals

2013-10-25 Thread Matei Zaharia
Hi Umar, The Spark wiki at https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage has a few pages on Spark internals (specifically the Python and Java APIs) and on how to build and contribute to Spark (https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark). Hopefull

Re: Take last k elements from RDD?

2013-10-24 Thread Matei Zaharia
Are you doing this because it's sorted somehow, or you have a file where you want the last K? For that you could probably use the lower-level API of SparkContext.runJob() to run a job on just the last partition and then return the last elements from there. I'm just curious how general this need

Re: Failed to build Spark with YARN 2.2.0

2013-10-24 Thread Matei Zaharia
Yup, unfortunately YARN changed its API upon releasing 2.2, which puts us in an awkward position because all the major current users are on the old YARN API (from 0.23.x and 2.0.x) but new users will try this one. We'll probably change the default version in Spark 0.8.1 or 0.8.2. If you look on

Re: solution to write data to S3?

2013-10-23 Thread Matei Zaharia
at 18:28, Ayush Mishra wrote: > >> You can check >> http://blog.knoldus.com/2013/09/09/running-standalone-scala-job-on-amazon-ec2-spark-cluster/. >> >> >> On Thu, Oct 24, 2013 at 6:54 AM, Nan Zhu wrote: >> Great!!! >> >> >> On Wed, Oc

Re: solution to write data to S3?

2013-10-23 Thread Matei Zaharia
Yes, take a look at http://spark.incubator.apache.org/docs/latest/ec2-scripts.html#accessing-data-in-s3 Matei On Oct 23, 2013, at 6:17 PM, Nan Zhu wrote: > Hi, all > > Is there any solution running Spark with Amazon S3? > > Best, > > Nan

Re: Help with Initial Cluster Configuration / Tuning

2013-10-22 Thread Matei Zaharia
em to run fine until you try something with 500GB of data > etc. > > I was wondering if you could write up a little white paper or some guide > lines on how to set memory values, and what to look at when something goes > wrong? Eg. I would never gave guessed that countByValue happe

Re: Spark unit test question

2013-10-21 Thread Matei Zaharia
Yup, local mode also catches serialization errors. The issue with local variables in the function happens only if they're not Serializable, and even then, Spark's closure cleaner tries to eliminate references to them in some cases. But for example here's one thing that wouldn't work: class C {

Re: Help with Initial Cluster Configuration / Tuning

2013-10-21 Thread Matei Zaharia
Hi there, The problem is that countByValue happens in only a single reduce task -- this is probably something we should fix but it's basically not designed for lots of values. Instead, do the count in parallel as follows: val counts = mapped.map(str => (str, 1)).reduceByKey((a, b) => a + b) If

Re: Kryo Serializer

2013-10-19 Thread Matei Zaharia
This line here is the problem: >System.setProperty("spark.serializer", > "org.apache.spark.serializer.KryoRegistrator") It should say org.apache.spark.serializer.KryoSerializer, not Registrator. Matei >System.setProperty("spark.kryo.registrator", > classOf[EdgeWithIDRegistrator].getNam

Re: help on SparkContext.sequenceFile()

2013-10-18 Thread Matei Zaharia
.hadoop.io.Text] > [ERROR] Error occurred in an application involving default arguments. > [INFO] val rdd = sc.sequenceFile[org.apache.hadoop.io.Text, > org.apache.hadoop.io.BytesWritable](uri) > > > > On Fri, Oct 18, 2013 at 9:37 AM, Matei Zaharia > wrote: &

Re: help on SparkContext.sequenceFile()

2013-10-18 Thread Matei Zaharia
Don't worry about the implicit params, those are filled in by the compiler. All you need to do is provide a key and value type, and a path. Look at how sequenceFile gets used in this test: https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=blob;f=core/src/test/scala/spark/FileSuite.

Re: spark 0.8

2013-10-18 Thread Matei Zaharia
for hadoop 1.0.4, but the actual installed > version of spark is build against cdh4.3.0-mr1. this also used to work, and i > prefer to do this so i compile against a generic spark build. could this be > the issue? > > > On Thu, Oct 17, 2013 at 8:06 PM, Koert Kuipers wrot

Re: spark 0.8

2013-10-17 Thread Matei Zaharia
Koert, did you link your Spark job to the right version of HDFS as well? In Spark 0.8, you have to add a Maven dependency on "hadoop-client" for your version of Hadoop. See http://spark.incubator.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala for example. Matei On Oct 17, 2

Re: pyspark memory usage

2013-10-17 Thread Matei Zaharia
Hi there, I'm not sure I understand your problem -- is it that Spark used *less* memory than the 2 GB? That out of memory message seems to be from your operating system, so maybe there were other things using RAM on that machine, or maybe Linux is configured to kill tasks quickly when the memor

Re: Kafka dependency issues

2013-10-17 Thread Matei Zaharia
rk Streaming dependency if the goal is to > keep size down and you don't want to confuse new adopters who aren't using > Kafka as part of their tech stack. > > -Ryan > > > On Sat, Oct 12, 2013 at 10:52 AM, Matei Zaharia > wrote: > Hi Ryan, > > Spark St

Re: Announcing the first Spark Summit, Mon Dec 2, 2013

2013-10-16 Thread Matei Zaharia
Hey folks, FYI, the talk submission deadline for this is October 25th. We've gotten a lot of great submissions already. If you'd like to submit one, go to http://www.spark-summit.org/submit/. It can be about anything -- projects you're doing with Spark, open source development within the project

Re: Suggested Filesystem Layout for Spark Cluster Node

2013-10-15 Thread Matei Zaharia
tings correctly in a Spark-on-Mesos > environment. Can you describe the differences for Mesos? > > Thanks again, > Craig > > > On Mon, Oct 14, 2013 at 6:15 PM, Matei Zaharia > wrote: > Hi Craig, > > The best configuration is to have multiple disks configured

Re: Suggested Filesystem Layout for Spark Cluster Node

2013-10-14 Thread Matei Zaharia
Hi Craig, The best configuration is to have multiple disks configured as separate filesystems (so no RAID), and set the spark.local.dir property, which configures Spark's scratch space directories, to be a comma-separated list of directories, one per disk. In 0.8 we've written a bit on how to c

Re: Spark & Spark Streaming, how to get started for local development?

2013-10-13 Thread Matei Zaharia
Hi Ryan, If you're only going to run in local mode, there's no need to package the app with sbt and pass a JAR. You can just run it straight out of your IDE. Matei On Oct 13, 2013, at 9:17 PM, Ryan Chan wrote: > Hi, > > Are there any guide on teaching how to get started for local rapid > de

Re: Spark REPL produces error on a piece of scala code that works in pure Scala REPL

2013-10-12 Thread Matei Zaharia
We're still not using macros in the 2.10 branch, so this issue will still happen there. We may do macros later but it's a fair bit of work so I wouldn't guarantee that it happens in our first 2.10 release. Matei On Oct 12, 2013, at 2:33 PM, Mark Hamstra wrote: > That's a TODO that is either n

Re: Output configuration

2013-10-12 Thread Matei Zaharia
Hi Alex, Unfortunately there seems to be something wrong with how the generics on that method get seen by Java. You can work around it by calling this with: plans.saveAsHadoopFiles("hdfs://localhost:8020/user/hue/output/completed", "csv", String.class, String.class, (Class) TextOutputFormat.cla

Re: Kafka dependency issues

2013-10-12 Thread Matei Zaharia
Hi Ryan, Spark Streaming ships with a special version of the Kafka 0.7.2 client that we ported to Scala 2.9, and you need to add that as a JAR explicitly in your project. The JAR is in streaming/lib/org/apache/kafka/kafka/0.7.2-spark/kafka-0.7.2-spark.jar under Spark. The streaming/lib directo

Re: Write to HBase from spark job

2013-10-12 Thread Matei Zaharia
Hi Eugen, You should use saveAsHadoopDataset, to which you pass a JobConf object that you've configured with TableOutputFormat the same way you would for a MapReduce job. The saveAsHadoopFile methods are specifically for output formats that go to a filesystem (e.g. HDFS), but HBase isn't a file

Re: Spark 0.8.0 on Mesos 0.13.0 (clustered) : NoClassDefFoundError

2013-10-12 Thread Matei Zaharia
Hey, this seems to be a problem in the docs about how to set the executor URI. It looks like the SPARK_EXECUTOR_URI variable is not actually used. Instead, set the spark.executor.uri Java system property using System.setProperty("spark.executor.uri", "") before you create a SparkContext. Matei

Re: Execution time of spark job

2013-10-10 Thread Matei Zaharia
Take a look at the org.apache.spark.scheduler.SparkListener class. You can register your own SparkListener with SparkContext that listens for job-start and job-end events. Matei On Oct 10, 2013, at 9:04 PM, prabeesh k wrote: > Is there any way to get execution time in the program? > Actually

Re: Output to a single directory with multiple files rather multiple directories ?

2013-10-10 Thread Matei Zaharia
Yeah, Christopher answered this before I could, but you can list the directory in the driver nodes, find out all the filenames, and then use SparkContext.parallelize() on an array of filenames to split the set of filenames among tasks. After that, run a foreach() on the parallelized RDD and hav

Re: Output to a single directory with multiple files rather multiple directories ?

2013-10-10 Thread Matei Zaharia
Hey, sorry, for this question, there's a similar answer to the previous one. You'll have to move the files from the output directories into a common directory by hand, possibly renaming them. The Hadoop InputFormat and OutputFormat APIs that we use are just designed to work at the level of dire

Re: Controlling the name of the output file

2013-10-10 Thread Matei Zaharia
Hi Ramkumar, I don't think there's a good way to give them different names other than opening and writing the files yourself. You could do that with a foreach(). For example, suppose you created and RDD of records (say (key, listOfValues)) and you wanted to save each one to a different file bas

Re: 0.8 in Maven Repo?

2013-10-09 Thread Matei Zaharia
Yes, the organization name just changed because we moved to Apache. Here's the right Maven info: http://spark.incubator.apache.org/downloads.html. Matei On Oct 9, 2013, at 5:25 PM, Erik Freed wrote: > Did the 0.8 release get into a maven repo? Did this change for apache status? > thanks! > Eri

Re: GPU-awareness

2013-10-09 Thread Matei Zaharia
Hi Patrick, This is indeed pretty application specific. While you could modify Spark to list GPUs and assign tasks to them, I think a simpler solution would be to manage use of GPUs at the application level. Create a static object GPUManager that lists the GPUs on each machine (somehow) and rec

Re: spark_ec2 script in 0.8.0 and mesos

2013-10-08 Thread Matei Zaharia
Hi Shay, We actually don't support Mesos in the EC2 scripts anymore -- sorry about that. If you want to deploy Mesos on EC2, I'd recommend looking at Mesos's own EC2 scripts. Then it's fairly easy to launch Spark on there. If you want to deploy Mesos locally you can go through the Spark docs fo

Re: Spark dependency library causing problems with conflicting versions at import

2013-10-07 Thread Matei Zaharia
se (we prefer to stick to official releases), > and > It's 33 commits behind master. > Are there plans to actively maintain this branch and eventually release it > officially? > > -Matt Cheah > > From: Matei Zaharia > Date: Monday, October 7, 2013 7:49 PM > To: &quo

Re: Spark dependency library causing problems with conflicting versions at import

2013-10-07 Thread Matei Zaharia
Hi Mingyu, The latest version of Spark works with Scala 2.9.3, which is the latest Scala-2.9 version. There's also a branch called branch-2.10 on GitHub that uses 2.10.3. What specific libraries are you having trouble with? > I see other open source projects private-namespacing the dependencies

Re: Roadblock with Spark 0.8.0 ActorStream

2013-10-04 Thread Matei Zaharia
the remote node from it. Hopefully one of these works. Anyway, thanks for bringing up this issue -- it's a confusing one and we should have a recommended solution for it. Matei On Oct 4, 2013, at 1:13 PM, Paul Snively wrote: > Hi Matei! > > On Oct 4, 2013, at 12:03 PM, Matei Z

Re: Roadblock with Spark 0.8.0 ActorStream

2013-10-04 Thread Matei Zaharia
Hi Paul, Just FYI, I'm not sure Akka was designed to pass ActorSystems across closures the way you're doing. Also, there's a bit of a misunderstanding about closures on RDDs. Consider this change you made to ActorWordCount: lines.flatMap(_.split("\\s+")).map(x => (x, 1)).reduceByKey(_ + _).for

Re: Sort order of RDD rows

2013-10-03 Thread Matei Zaharia
Yes, it is for these map-like operations. The only time when it isn't is when you change the RDD's partitioner, e.g. by doing sortByKey or groupByKey. It would definitely be good to document this more formally. Matei On Oct 3, 2013, at 3:33 PM, Mingyu Kim wrote: > Hi all, > > Is the sort ord

Re: Troubleshooting and how to interpret the logs

2013-10-03 Thread Matei Zaharia
Hi Ashish, Those "removing" messages mean that the node in question didn't communicate with your application for 45 seconds. Most likely the executor process on the node died, though there's also a chance that it was doing a super-long garbage collection or that there was a network problem. Loo

Re: Some questions about task distribution and execution in Spark

2013-10-02 Thread Matei Zaharia
Hi Shangyu, > (1) When we read in a local file by SparkContext.textFile and do some > map/reduce job on it, how will spark decide to send data to which worker > node? Will the data be divided/partitioned equally according to the number of > worker node and each worker node get one piece of data

Re: Building Spark 0.8

2013-10-02 Thread Matei Zaharia
Nope, I don't think it matters there. Matei On Oct 2, 2013, at 5:18 AM, Stuart Layton wrote: > Should shark 0.8 be built with sbt/sbt assembly as well? > > On Oct 2, 2013 1:32 AM, "Matei Zaharia" wrote: > Assembly packages all into one big JAR, which does a bett

Re: Building Spark 0.8

2013-10-01 Thread Matei Zaharia
Assembly packages all into one big JAR, which does a better job of capturing only the needed dependencies and simplifies deployment. Package won't work anymore because all the scripts expect this JAR. Matei On Oct 1, 2013, at 8:34 PM, Stuart Layton wrote: > I noticed that the build instructio

Re: spark-0.8.0 and hadoop-2.1.0-beta

2013-09-29 Thread Matei Zaharia
Hi Terence, YARN's API changed in an incompatible way in Hadoop 2.1.0, so I'd suggest sticking with 2.0.x for now. We may create a different branch for this version. Unfortunately due to the API change it may not be possible to support this version while also supporting other widely-used versio

Re: Wrong result with mapPartitions example

2013-09-28 Thread Matei Zaharia
This was actually a bug in the parallelize() version for Python that should be fixed in Spark 0.8. It may also be fixed in 0.7.3. Matei On Sep 27, 2013, at 8:59 PM, Reynold Xin wrote: > It worked for me: > > a=[] > for i in range(0,1): >a.append(i) > > def f(iterator): yield sum(1 fo

Re: Spark does not build with latest version of Hadoop

2013-09-27 Thread Matei Zaharia
Hi Sergey, Because this was a breaking API change on YARN's part, I'd recommend just sticking with 2.0.x for now if possible. Otherwise, we'll likely add support for this, and remove support for older versions of YARN, in the next major version of Spark. Before that, it's possible that we can m

Re: take sample with replacement

2013-09-27 Thread Matei Zaharia
Hi Sebastian, I believe the reasoning was as follows. The actual number of times we expect an element to occur in sampling with replacement is given by the binomial distribution (http://en.wikipedia.org/wiki/Binomial_distribution), but for rare events this can be approximated with a Poisson dis

Re: Scala 2.10?

2013-09-19 Thread Matei Zaharia
Hey Paul, 2.10 is definitely on our roadmap, and you can actually find a scala-2.10 branch in the repo that has a bunch of the changes done. However, as Mark said, it won't be in 0.8 mostly because we've had a lot of other changes in that release. One challenge for us is that we also make some

Re: job failing in standalone mode

2013-09-13 Thread Matei Zaharia
Have you looked at the stdout and stderr files created for the job on the worker nodes? By default they're in the "work" directory under SPARK_HOME. In my experience this either means no write permissions to the filesystem, or no Java found. Matei On Sep 12, 2013, at 10:59 PM, Vipul Pandey wr

Re: Deploy spark to ec2 eu-west-1

2013-09-12 Thread Matei Zaharia
Hy Han, The AMI in the master branch works with the version of the EC2 script there. Matei On Sep 12, 2013, at 11:21 AM, Han JU wrote: > Hi all, > > I'd like to deploy a spark 0.7.3 cluster on ec2 eu-west-1. > The ec2 script bundled in 0.7.3 can not find the AMI . I tried to point to > the A

Re: sbt assembly question

2013-09-11 Thread Matei Zaharia
t sbt compile. Matei On Sep 11, 2013, at 7:21 PM, "Shao, Saisai" wrote: > Hi Matei, > > Thanks a lot. My colleague meets the same problem, so I’m just wondering > whether this command is so slow. I will try it on SSD or in-memory FS. > > Thanks > Jerry > &

Re: RDD.SaveAsObject will always use Java Serialization?

2013-09-11 Thread Matei Zaharia
Hi Wenlei, This was actually semi-intentional because we wanted a forward-compatible format across Spark versions. I'm not sure whether that was a good idea (and we didn't promise it will be compatible), so later we can change it. But for now, if you'd like to use Kryo, I recommend implementing

Re: sbt assembly question

2013-09-11 Thread Matei Zaharia
That's weird, it takes 30-60 seconds for me. If you can put this on an SSD or in-memory filesystem in any way that would help a lot. I have an SSD on my laptop. Matei On Sep 11, 2013, at 6:40 PM, "Shao, Saisai" wrote: > Hi all, > > Now Spark changes sbt package to sbt assembly, and class pa

Re: Checking which RDDs still might be cached?

2013-09-11 Thread Matei Zaharia
You can actually do SparkContext.getExecutorStorageStatus to get a list of stored blocks. These have a special name when they belong to an RDD, using that RDD's id field. But unfortunately there's no way to get this info from the RDD itself. Matei On Sep 11, 2013, at 4:52 PM, Dmitriy Lyubimov

Re: Split RDD and save as separate files

2013-09-10 Thread Matei Zaharia
Hi Nicholas, Right now the best way to do this is probably to run foreach() on each value and then use the Hadoop FileSystem API directly to write a file. It has a pretty simple API based on OutputStreams: http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/fs/FileSystem.html. You just

Re: When using spark shell, classpath on workers does not seem to see all of my custom classes

2013-09-09 Thread Matei Zaharia
9.3-0.1-SNAPSHOT-assembly.jar > with timestamp 1378683857701 > > I can also confirm that the 'verrazano' jar (my custom one) is in a mesos > slave temp directory on all of the slave nodes. > > > > > On Sun, Sep 8, 2013 at 7:01 PM, Matei Zaharia wrote: > Whi

Re: When using spark shell, classpath on workers does not seem to see all of my custom classes

2013-09-08 Thread Matei Zaharia
Which version of Spark is this with? Did the logs print something about sending the JAR you added with ADD_JARS to the cluster? Matei On Sep 8, 2013, at 8:56 AM, Gary Malouf wrote: > I built a custom jar with among other things, nscalatime and joda time packed > inside of it. Using the ADD_J

[INPUT WANTED!] Spark user survey and Powered By page

2013-09-05 Thread Matei Zaharia
Hi folks, As we continue developing Spark, we would love to get feedback from users and hear what you'd like us to work on next. We've decided that a good way to do that is a survey -- we hope to run this at regular intervals. If you have a few minutes to participate, do you mind filling it in

Re: how to push a jar file to workers

2013-09-04 Thread Matei Zaharia
Hi Daniel, Either add this to the "jars" parameter of SparkContext (see http://spark.incubator.apache.org/docs/latest/quick-start.html), or use SparkContext.addJar. Those methods are preferable to SPARK_CLASSPATH. Sorry for the somewhat poor docs on this -- we added these methods later so some

Re: Timezone Conversion Utilities

2013-09-04 Thread Matei Zaharia
t 10:21 AM, Gary Malouf wrote: > That's how I do it now, list is getting lengthy but we are automating the > retrieving of the jars and list build up in ansible. > > > On Wed, Sep 4, 2013 at 12:55 PM, Matei Zaharia > wrote: > Hi Gary, > > Just to be clear, i

Re: Timezone Conversion Utilities

2013-09-04 Thread Matei Zaharia
Hi Gary, Just to be clear, if you want to use third-party libraries in Spark (or even your own code), you *don't* need to modify SparkBuild.scala. Just pass a list of JARs containing your dependencies when you create your SparkContext. See http://spark.incubator.apache.org/docs/latest/quick-sta

Re: AmpCamp 3 blog posts

2013-09-03 Thread Matei Zaharia
Cool, thanks for this really detailed writeup! It's great that you're also covering how to set this up on your own. Regarding YouTube videos -- the group that recorded it is working on those, but I don't actually know the ETA yet. I'll let you know if I find out. Matei On Sep 3, 2013, at 11:47

Re:

2013-09-02 Thread Matei Zaharia
So I think the problem might be that BytesWritable.getBytes() can return an array bigger than the actual bytes used (see http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/BytesWritable.html#getBytes() ). It just returns a backing array that can be reused across records. Try using c

Re:

2013-09-01 Thread Matei Zaharia
What's your code for loading the SequenceFile? You may also want to check that you're using the right version of protobuf in Spark. Matei On Sep 1, 2013, at 10:52 AM, Gary Malouf wrote: > We are using Spark 0.7.3 compiled (and running) against Hadood > 2.0.0-mr1-cdh4.2.1. > > When I read a

AMP Camp 3 video stream registration

2013-08-28 Thread Matei Zaharia
Hi everyone, As we've mentioned before, we're holding a 2-day training camp on Spark and related projects at Berkeley tomorrow and Friday: http://ampcamp.berkeley.edu/3/. A video stream will be available *for free* to anyone who wants to watch. If you'd like to watch it, please register before

Re: AMP Camp 3 cluster setup

2013-08-28 Thread Matei Zaharia
By the way, an important note: Make sure you *shut down* your cluster after using it. Otherwise, Amazon will keep charging you money for it! I've seen some people get caught by that in the past. For others following this list, it's probably fine to start the cluster tomorrow morning (Pacific ti

Re: Rolling window cache

2013-08-27 Thread Matei Zaharia
ry, and you won't get OutOfMemoryErrors. But we are the ones controlling when we unreference them, and the GC just picks up from there when it decides to clean stuff up. Matei > > Thanks, > Grega > > > On Wed, Aug 14, 2013 at 12:40 AM, Matei Zaharia > wrote: >

Re: sample data & code for performance tests

2013-08-24 Thread Matei Zaharia
Hi Mike, This project contains some small synthetic benchmarks: https://github.com/amplab/spark-perf. Otherwise, for ML algorithms, look in mllib -- it comes with driver programs for K-means, logistic regression, matrix factorization, etc, as well as data generators for them. Matei On Aug 23,

Re: jmx visualvm profiling

2013-08-22 Thread Matei Zaharia
What are the failures? Matei On Aug 22, 2013, at 2:57 PM, Aaron Babcock wrote: > Hi, > > Does anyone have any experience using jmx and visualvm instead of yourkit to > remotely profile spark workers. > > I tried the following in spark-env.sh but I get all kinds of failures when > workers sp

Re: Streaming JSON From S3?

2013-08-21 Thread Matei Zaharia
Hi Paul, On Aug 21, 2013, at 6:11 PM, Paul Snively wrote: >> Just to understand, are you trying to do a real-time application (which is >> what the streaming in Spark Streaming is for), or just to read an input file >> into a batch job? > > Well, it's an interesting case. I'm trying to take a

Re: Streaming JSON From S3?

2013-08-20 Thread Matei Zaharia
Hi Paul, Just to understand, are you trying to do a real-time application (which is what the streaming in Spark Streaming is for), or just to read an input file into a batch job? For the latter, you can pass an s3n:// URL to any of Spark's file input methods (e.g. SparkContext.textFile). The e

Re: ML Algos

2013-08-16 Thread Matei Zaharia
On Aug 15, 2013, at 7:13 PM, Lijie Xu wrote: > 3) MLBase may require Spark to provide some new features for implementing > some specific algorithms. Is there any? Or you have added some new > fundamental features which are not supported in Spark-0.7? On this particular aspect, we actually have

Re: New release of Spark and Shark on Amazon EMR

2013-08-16 Thread Matei Zaharia
Cool, thanks for doing this! Matei On Aug 16, 2013, at 11:27 AM, Parviz deyhim wrote: > Amazon EMR now has the latest version of Spark 0.7.3 and Shark 0.7 > > Let me know if you have any questions. > > Thanks, > Parviz

Re: Bagel?

2013-08-14 Thread Matei Zaharia
Hmm, it's weird that it built two. It should just be spark-0.7.3/bagel/target. Matei On Aug 14, 2013, at 2:29 PM, Ryan Compton wrote: > spark-0.7.3/bagel/target or spark-0.7.3/bagel/bagel/target ? > > On Wed, Aug 7, 2013 at 9:17 PM, Matei Zaharia wrote: >> Hi Ryan, >

Re: Rolling window cache

2013-08-13 Thread Matei Zaharia
Hi Grega, You'll need to create a new cached RDD for each batch, and then create the union of those on every window. So for example, if you have rdd0, rdd1, and rdd2, you might first take the union of 0 and 1, then of 1 and 2. This will let you use just the subset of RDDs you care about instead

Re: Standalone deploy mode not working

2013-08-12 Thread Matei Zaharia
Yes, you have a hostname (stepreach-lm) that doesn't seem to resolve to any IP address. You can fix it by adding export SPARK_LOCAL_IP=. Note that this will have to be set to the right IP on each machine. Matei On Aug 12, 2013, at 2:22 PM, Gowtham N wrote: > Hi, > > I downloaded spark and it

Re: when should I copy object coming out of RDD

2013-08-10 Thread Matei Zaharia
D the only one that has this > optimization of reusing Writable objects? > > Ameet > > On Sat, Aug 10, 2013 at 12:07 AM, Matei Zaharia > wrote: > What happens is that as we iterate through the SequenceFile, we reuse the > same IntegerWritable (or other Writable)

<    1   2   3   >