Hi folks,
The Apache Spark PPMC is happy to welcome two new PPMC members and committers:
Tom Graves and Prashant Sharma.
Tom has been maintaining and expanding the YARN support in Spark over the past
few months, including adding big features such as support for YARN security,
and recently cont
meout if it associates again we
>> keep moving else we shutdown the executor. This timeout can ofcourse be
>> configurable.
>>
>> Thoughts ?
>>
>>
>> On Sat, Nov 2, 2013 at 3:29 AM, Matei Zaharia
>> wrote:
>> Hey Imran,
>>
>> Good
It’s hard to tell, but maybe you’ve run out of space in your working directory?
The assembly command will try to write stuff in assembly/target.
Matei
On Nov 11, 2013, at 2:54 PM, Umar Javed wrote:
> I keep getting these io.Exception Permission denied errors when building with
> sbt assembly:
It might mean one of your JARs is corrupted. Try doing sbt clean and then sbt
assembly again.
Matei
On Nov 12, 2013, at 10:48 AM, Josh Rosen wrote:
> I've seen this "error: error while loading , error in opening zip file"
> before, but I'm not exactly sure what causes it. Here's a JIRA discu
Actually it doesn’t matter a lot from what I’ve seen. Only do it if you see a
lot of communication going to the master (these threads do the serialization of
tasks). I’ve never put more than 8 or so.
Matei
On Nov 11, 2013, at 12:13 PM, Walrus theCat wrote:
> Hi,
>
> The docs say that we shou
Yes, just look at the application UI on http://:4040
Matei
On Nov 11, 2013, at 12:26 AM, Wenlei Xie wrote:
> Hi,
>
> I have some shuffling task which is supposed to have may repeated values,
> thus I assume the shuffling compress would help the performance .
>
> However I get very similar ru
> Atte.
> Rafael R.
>
>
>
> 2013/11/7 Matei Zaharia
> Hi everyone,
>
> We're glad to announce the agenda of the Spark Summit, which will happen on
> December 2nd and 3rd in San Francisco. We have 5 keynotes and 24 talks lined
> up, from 18 diffe
Hi Grega,
This memory is not taken away from the application in any way, so the setting
doesn’t matter if you don’t use caching. You don’t need to configure it in any
special way.
Matei
On Nov 8, 2013, at 8:01 AM, Grega Kešpret wrote:
> Hi,
>
> The docs say: Fraction of Java heap to use for
ode
> verbatim that doesn't have the necessary import statements
>
>
> On 11/7/2013 4:05 PM, Matei Zaharia wrote:
>> Yeah, this is confusing and unfortunately as far as I know it’s API
>> specific. Maybe we should add this to the documentation page for RDD.
>&g
Yeah, this is confusing and unfortunately as far as I know it’s API specific.
Maybe we should add this to the documentation page for RDD.
The reason for these conversions is to only allow some operations based on the
underlying data type of the collection. For example, Scala collections support
Hi everyone,
We're glad to announce the agenda of the Spark Summit, which will happen on
December 2nd and 3rd in San Francisco. We have 5 keynotes and 24 talks lined
up, from 18 different companies. Check out the agenda here:
http://spark-summit.org/agenda/.
This will be the biggest Spark even
Hi Pranay,
I don’t think anyone’s working on this right now, but contributions would be
welcome if this is a thing we could plug into MLlib.
Matei
On Nov 6, 2013, at 8:44 PM, Pranay Tonpay wrote:
> Hi,
> Wanted to know if PMML support in Spark is there in the roadmap for Spark…
> PMML has b
In general, you shouldn’t be mutating data in RDDs. That will make it
impossible to recover from faults.
In this particular case, you got 1 and 2 because the RDD isn’t cached. You just
get the same list you called parallelize() with each time you iterate through
it. But caching it and modifying
t; Thoughts ?
>
>
> On Sat, Nov 2, 2013 at 3:29 AM, Matei Zaharia wrote:
> Hey Imran,
>
> Good to know that Akka 2.1 handles this — that at least will give us a start.
>
> In the old code, executors certainly did get flagged as “down” occasionally,
> but that
> the disassociation events; what do we do to fix it? How can we diagnose the
> problem, and figure out which of the configuration variables to tune?
> clearly, there *will be* long gc pauses, and the networking layer needs to be
> able to deal with them.
>
> still I under
rrect me if I am
> wrong.
>
>
>
> On Fri, Nov 1, 2013 at 10:08 AM, Matei Zaharia
> wrote:
> It’s true that Akka’s delivery guarantees are in general at-most-once, but if
> you look at the text there it says that they differ by transport. In the
> previous ve
t; just had more robust defaults or something, but I bet it could still have the
> same problems. Even before, I have seen the driver thinking there were
> running tasks, but nothing happening on any executor -- it was just rare
> enough (and hard to reproduce) that I never bothered lookin
Looking at https://github.com/sbt/sbt-assembly, it seems you can add the
following into extraAssemblySettings:
assemblyOption in assembly ~= { _.copy(includeScala = false) }
Matei
On Oct 30, 2013, at 9:58 AM, Mingyu Kim wrote:
> Hi,
>
> In order to work around the library dependency problem,
ken, cmAddress) to
> ConverterUtils.convertFromYarn(containerToken, cmAddress).
>
> Not 100% sure that my changes are correct.
>
> Hope that helps,
> Viren
>
>
> On Sun, Sep 29, 2013 at 8:59 AM, Matei Zaharia
> wrote:
> Hi Terence,
>
> YARN's API changed in an incompati
The error is from a worker node -- did you check that /data2 is set up properly
on the worker nodes too? In general that should be the only directory used.
Matei
On Oct 28, 2013, at 6:52 PM, Shangyu Luo wrote:
> Hello,
> I have some questions about the files that Spark will create and use duri
Hi Ufuk,
Yes, we still write out data after these tasks in Spark 0.8, and it needs to be
written out before any stage that reads it can start. The main reason is
simplicity when there are faults, as well as more flexible scheduling (you
don't have to decide where each reduce task is in advance,
age it. So of course we
develop features and optimizations as we see demand for them, but if there's a
lot of demand for this, we can do it.
Matei
On Oct 28, 2013, at 5:51 PM, Matei Zaharia wrote:
> FWIW, the only thing that Spark expects to fit in memory if you use DISK_ONLY
> ca
FWIW, the only thing that Spark expects to fit in memory if you use DISK_ONLY
caching is the input to each reduce task. Those currently don't spill to disk.
The solution if datasets are large is to add more reduce tasks, whereas Hadoop
would run along with a small number of tasks that do lots of
Hi Philip,
Indeed, Spark's API allows direct creation of complex workflows the same way
Cascading would. Cascading built that functionality on top of MapReduce
(translating user operations down to a series of MapReduce jobs), but Spark's
engine supports complex workflows from the start and the
Hey Howard,
Great to hear that you're looking at Spark Streaming!
> We have some in house real time streaming jobs written for Storm and want to
> see the possibility to migrate to Spark Streaming in the future as our team
> all think Spark is a very promising technologies (one platform to exec
Hey Stephen,
SSH actually supports creating a HTTP proxy through the -D flag. Take a look at
the -D option on our spark-ec2 script for example, which just exposes the -D
option of ssh. With this feature you can do stuff like ssh -D 8088 and
then configure localhost:8088 as a proxy in your web
Hi Umar,
The Spark wiki at
https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage has a few pages
on Spark internals (specifically the Python and Java APIs) and on how to build
and contribute to Spark
(https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark).
Hopefull
Are you doing this because it's sorted somehow, or you have a file where you
want the last K? For that you could probably use the lower-level API of
SparkContext.runJob() to run a job on just the last partition and then return
the last elements from there. I'm just curious how general this need
Yup, unfortunately YARN changed its API upon releasing 2.2, which puts us in an
awkward position because all the major current users are on the old YARN API
(from 0.23.x and 2.0.x) but new users will try this one. We'll probably change
the default version in Spark 0.8.1 or 0.8.2. If you look on
at 18:28, Ayush Mishra wrote:
>
>> You can check
>> http://blog.knoldus.com/2013/09/09/running-standalone-scala-job-on-amazon-ec2-spark-cluster/.
>>
>>
>> On Thu, Oct 24, 2013 at 6:54 AM, Nan Zhu wrote:
>> Great!!!
>>
>>
>> On Wed, Oc
Yes, take a look at
http://spark.incubator.apache.org/docs/latest/ec2-scripts.html#accessing-data-in-s3
Matei
On Oct 23, 2013, at 6:17 PM, Nan Zhu wrote:
> Hi, all
>
> Is there any solution running Spark with Amazon S3?
>
> Best,
>
> Nan
em to run fine until you try something with 500GB of data
> etc.
>
> I was wondering if you could write up a little white paper or some guide
> lines on how to set memory values, and what to look at when something goes
> wrong? Eg. I would never gave guessed that countByValue happe
Yup, local mode also catches serialization errors. The issue with local
variables in the function happens only if they're not Serializable, and even
then, Spark's closure cleaner tries to eliminate references to them in some
cases. But for example here's one thing that wouldn't work:
class C {
Hi there,
The problem is that countByValue happens in only a single reduce task -- this
is probably something we should fix but it's basically not designed for lots of
values. Instead, do the count in parallel as follows:
val counts = mapped.map(str => (str, 1)).reduceByKey((a, b) => a + b)
If
This line here is the problem:
>System.setProperty("spark.serializer",
> "org.apache.spark.serializer.KryoRegistrator")
It should say org.apache.spark.serializer.KryoSerializer, not Registrator.
Matei
>System.setProperty("spark.kryo.registrator",
> classOf[EdgeWithIDRegistrator].getNam
.hadoop.io.Text]
> [ERROR] Error occurred in an application involving default arguments.
> [INFO] val rdd = sc.sequenceFile[org.apache.hadoop.io.Text,
> org.apache.hadoop.io.BytesWritable](uri)
>
>
>
> On Fri, Oct 18, 2013 at 9:37 AM, Matei Zaharia
> wrote:
&
Don't worry about the implicit params, those are filled in by the compiler. All
you need to do is provide a key and value type, and a path. Look at how
sequenceFile gets used in this test:
https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=blob;f=core/src/test/scala/spark/FileSuite.
for hadoop 1.0.4, but the actual installed
> version of spark is build against cdh4.3.0-mr1. this also used to work, and i
> prefer to do this so i compile against a generic spark build. could this be
> the issue?
>
>
> On Thu, Oct 17, 2013 at 8:06 PM, Koert Kuipers wrot
Koert, did you link your Spark job to the right version of HDFS as well? In
Spark 0.8, you have to add a Maven dependency on "hadoop-client" for your
version of Hadoop. See
http://spark.incubator.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala
for example.
Matei
On Oct 17, 2
Hi there,
I'm not sure I understand your problem -- is it that Spark used *less* memory
than the 2 GB? That out of memory message seems to be from your operating
system, so maybe there were other things using RAM on that machine, or maybe
Linux is configured to kill tasks quickly when the memor
rk Streaming dependency if the goal is to
> keep size down and you don't want to confuse new adopters who aren't using
> Kafka as part of their tech stack.
>
> -Ryan
>
>
> On Sat, Oct 12, 2013 at 10:52 AM, Matei Zaharia
> wrote:
> Hi Ryan,
>
> Spark St
Hey folks, FYI, the talk submission deadline for this is October 25th. We've
gotten a lot of great submissions already. If you'd like to submit one, go to
http://www.spark-summit.org/submit/. It can be about anything -- projects
you're doing with Spark, open source development within the project
tings correctly in a Spark-on-Mesos
> environment. Can you describe the differences for Mesos?
>
> Thanks again,
> Craig
>
>
> On Mon, Oct 14, 2013 at 6:15 PM, Matei Zaharia
> wrote:
> Hi Craig,
>
> The best configuration is to have multiple disks configured
Hi Craig,
The best configuration is to have multiple disks configured as separate
filesystems (so no RAID), and set the spark.local.dir property, which
configures Spark's scratch space directories, to be a comma-separated list of
directories, one per disk. In 0.8 we've written a bit on how to c
Hi Ryan,
If you're only going to run in local mode, there's no need to package the app
with sbt and pass a JAR. You can just run it straight out of your IDE.
Matei
On Oct 13, 2013, at 9:17 PM, Ryan Chan wrote:
> Hi,
>
> Are there any guide on teaching how to get started for local rapid
> de
We're still not using macros in the 2.10 branch, so this issue will still
happen there. We may do macros later but it's a fair bit of work so I wouldn't
guarantee that it happens in our first 2.10 release.
Matei
On Oct 12, 2013, at 2:33 PM, Mark Hamstra wrote:
> That's a TODO that is either n
Hi Alex,
Unfortunately there seems to be something wrong with how the generics on that
method get seen by Java. You can work around it by calling this with:
plans.saveAsHadoopFiles("hdfs://localhost:8020/user/hue/output/completed",
"csv", String.class, String.class, (Class) TextOutputFormat.cla
Hi Ryan,
Spark Streaming ships with a special version of the Kafka 0.7.2 client that we
ported to Scala 2.9, and you need to add that as a JAR explicitly in your
project. The JAR is in
streaming/lib/org/apache/kafka/kafka/0.7.2-spark/kafka-0.7.2-spark.jar under
Spark. The streaming/lib directo
Hi Eugen,
You should use saveAsHadoopDataset, to which you pass a JobConf object that
you've configured with TableOutputFormat the same way you would for a MapReduce
job. The saveAsHadoopFile methods are specifically for output formats that go
to a filesystem (e.g. HDFS), but HBase isn't a file
Hey, this seems to be a problem in the docs about how to set the executor URI.
It looks like the SPARK_EXECUTOR_URI variable is not actually used. Instead,
set the spark.executor.uri Java system property using
System.setProperty("spark.executor.uri", "") before you create a
SparkContext.
Matei
Take a look at the org.apache.spark.scheduler.SparkListener class. You can
register your own SparkListener with SparkContext that listens for job-start
and job-end events.
Matei
On Oct 10, 2013, at 9:04 PM, prabeesh k wrote:
> Is there any way to get execution time in the program?
> Actually
Yeah, Christopher answered this before I could, but you can list the directory
in the driver nodes, find out all the filenames, and then use
SparkContext.parallelize() on an array of filenames to split the set of
filenames among tasks. After that, run a foreach() on the parallelized RDD and
hav
Hey, sorry, for this question, there's a similar answer to the previous one.
You'll have to move the files from the output directories into a common
directory by hand, possibly renaming them. The Hadoop InputFormat and
OutputFormat APIs that we use are just designed to work at the level of
dire
Hi Ramkumar,
I don't think there's a good way to give them different names other than
opening and writing the files yourself. You could do that with a foreach(). For
example, suppose you created and RDD of records (say (key, listOfValues)) and
you wanted to save each one to a different file bas
Yes, the organization name just changed because we moved to Apache. Here's the
right Maven info: http://spark.incubator.apache.org/downloads.html.
Matei
On Oct 9, 2013, at 5:25 PM, Erik Freed wrote:
> Did the 0.8 release get into a maven repo? Did this change for apache status?
> thanks!
> Eri
Hi Patrick,
This is indeed pretty application specific. While you could modify Spark to
list GPUs and assign tasks to them, I think a simpler solution would be to
manage use of GPUs at the application level. Create a static object GPUManager
that lists the GPUs on each machine (somehow) and rec
Hi Shay,
We actually don't support Mesos in the EC2 scripts anymore -- sorry about that.
If you want to deploy Mesos on EC2, I'd recommend looking at Mesos's own EC2
scripts. Then it's fairly easy to launch Spark on there. If you want to deploy
Mesos locally you can go through the Spark docs fo
se (we prefer to stick to official releases),
> and
> It's 33 commits behind master.
> Are there plans to actively maintain this branch and eventually release it
> officially?
>
> -Matt Cheah
>
> From: Matei Zaharia
> Date: Monday, October 7, 2013 7:49 PM
> To: &quo
Hi Mingyu,
The latest version of Spark works with Scala 2.9.3, which is the latest
Scala-2.9 version. There's also a branch called branch-2.10 on GitHub that uses
2.10.3. What specific libraries are you having trouble with?
> I see other open source projects private-namespacing the dependencies
the remote node from it. Hopefully one of these
works.
Anyway, thanks for bringing up this issue -- it's a confusing one and we should
have a recommended solution for it.
Matei
On Oct 4, 2013, at 1:13 PM, Paul Snively wrote:
> Hi Matei!
>
> On Oct 4, 2013, at 12:03 PM, Matei Z
Hi Paul,
Just FYI, I'm not sure Akka was designed to pass ActorSystems across closures
the way you're doing. Also, there's a bit of a misunderstanding about closures
on RDDs. Consider this change you made to ActorWordCount:
lines.flatMap(_.split("\\s+")).map(x => (x, 1)).reduceByKey(_ + _).for
Yes, it is for these map-like operations. The only time when it isn't is when
you change the RDD's partitioner, e.g. by doing sortByKey or groupByKey. It
would definitely be good to document this more formally.
Matei
On Oct 3, 2013, at 3:33 PM, Mingyu Kim wrote:
> Hi all,
>
> Is the sort ord
Hi Ashish,
Those "removing" messages mean that the node in question didn't communicate
with your application for 45 seconds. Most likely the executor process on the
node died, though there's also a chance that it was doing a super-long garbage
collection or that there was a network problem. Loo
Hi Shangyu,
> (1) When we read in a local file by SparkContext.textFile and do some
> map/reduce job on it, how will spark decide to send data to which worker
> node? Will the data be divided/partitioned equally according to the number of
> worker node and each worker node get one piece of data
Nope, I don't think it matters there.
Matei
On Oct 2, 2013, at 5:18 AM, Stuart Layton wrote:
> Should shark 0.8 be built with sbt/sbt assembly as well?
>
> On Oct 2, 2013 1:32 AM, "Matei Zaharia" wrote:
> Assembly packages all into one big JAR, which does a bett
Assembly packages all into one big JAR, which does a better job of capturing
only the needed dependencies and simplifies deployment. Package won't work
anymore because all the scripts expect this JAR.
Matei
On Oct 1, 2013, at 8:34 PM, Stuart Layton wrote:
> I noticed that the build instructio
Hi Terence,
YARN's API changed in an incompatible way in Hadoop 2.1.0, so I'd suggest
sticking with 2.0.x for now. We may create a different branch for this version.
Unfortunately due to the API change it may not be possible to support this
version while also supporting other widely-used versio
This was actually a bug in the parallelize() version for Python that should be
fixed in Spark 0.8. It may also be fixed in 0.7.3.
Matei
On Sep 27, 2013, at 8:59 PM, Reynold Xin wrote:
> It worked for me:
>
> a=[]
> for i in range(0,1):
>a.append(i)
>
> def f(iterator): yield sum(1 fo
Hi Sergey,
Because this was a breaking API change on YARN's part, I'd recommend just
sticking with 2.0.x for now if possible. Otherwise, we'll likely add support
for this, and remove support for older versions of YARN, in the next major
version of Spark. Before that, it's possible that we can m
Hi Sebastian,
I believe the reasoning was as follows. The actual number of times we expect an
element to occur in sampling with replacement is given by the binomial
distribution (http://en.wikipedia.org/wiki/Binomial_distribution), but for rare
events this can be approximated with a Poisson dis
Hey Paul,
2.10 is definitely on our roadmap, and you can actually find a scala-2.10
branch in the repo that has a bunch of the changes done. However, as Mark said,
it won't be in 0.8 mostly because we've had a lot of other changes in that
release. One challenge for us is that we also make some
Have you looked at the stdout and stderr files created for the job on the
worker nodes? By default they're in the "work" directory under SPARK_HOME.
In my experience this either means no write permissions to the filesystem, or
no Java found.
Matei
On Sep 12, 2013, at 10:59 PM, Vipul Pandey wr
Hy Han,
The AMI in the master branch works with the version of the EC2 script there.
Matei
On Sep 12, 2013, at 11:21 AM, Han JU wrote:
> Hi all,
>
> I'd like to deploy a spark 0.7.3 cluster on ec2 eu-west-1.
> The ec2 script bundled in 0.7.3 can not find the AMI . I tried to point to
> the A
t sbt compile.
Matei
On Sep 11, 2013, at 7:21 PM, "Shao, Saisai" wrote:
> Hi Matei,
>
> Thanks a lot. My colleague meets the same problem, so I’m just wondering
> whether this command is so slow. I will try it on SSD or in-memory FS.
>
> Thanks
> Jerry
>
&
Hi Wenlei,
This was actually semi-intentional because we wanted a forward-compatible
format across Spark versions. I'm not sure whether that was a good idea (and we
didn't promise it will be compatible), so later we can change it. But for now,
if you'd like to use Kryo, I recommend implementing
That's weird, it takes 30-60 seconds for me. If you can put this on an SSD or
in-memory filesystem in any way that would help a lot. I have an SSD on my
laptop.
Matei
On Sep 11, 2013, at 6:40 PM, "Shao, Saisai" wrote:
> Hi all,
>
> Now Spark changes sbt package to sbt assembly, and class pa
You can actually do SparkContext.getExecutorStorageStatus to get a list of
stored blocks. These have a special name when they belong to an RDD, using that
RDD's id field. But unfortunately there's no way to get this info from the RDD
itself.
Matei
On Sep 11, 2013, at 4:52 PM, Dmitriy Lyubimov
Hi Nicholas,
Right now the best way to do this is probably to run foreach() on each value
and then use the Hadoop FileSystem API directly to write a file. It has a
pretty simple API based on OutputStreams:
http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/fs/FileSystem.html.
You just
9.3-0.1-SNAPSHOT-assembly.jar
> with timestamp 1378683857701
>
> I can also confirm that the 'verrazano' jar (my custom one) is in a mesos
> slave temp directory on all of the slave nodes.
>
>
>
>
> On Sun, Sep 8, 2013 at 7:01 PM, Matei Zaharia wrote:
> Whi
Which version of Spark is this with? Did the logs print something about sending
the JAR you added with ADD_JARS to the cluster?
Matei
On Sep 8, 2013, at 8:56 AM, Gary Malouf wrote:
> I built a custom jar with among other things, nscalatime and joda time packed
> inside of it. Using the ADD_J
Hi folks,
As we continue developing Spark, we would love to get feedback from users and
hear what you'd like us to work on next. We've decided that a good way to do
that is a survey -- we hope to run this at regular intervals. If you have a few
minutes to participate, do you mind filling it in
Hi Daniel,
Either add this to the "jars" parameter of SparkContext (see
http://spark.incubator.apache.org/docs/latest/quick-start.html), or use
SparkContext.addJar. Those methods are preferable to SPARK_CLASSPATH. Sorry for
the somewhat poor docs on this -- we added these methods later so some
t 10:21 AM, Gary Malouf wrote:
> That's how I do it now, list is getting lengthy but we are automating the
> retrieving of the jars and list build up in ansible.
>
>
> On Wed, Sep 4, 2013 at 12:55 PM, Matei Zaharia
> wrote:
> Hi Gary,
>
> Just to be clear, i
Hi Gary,
Just to be clear, if you want to use third-party libraries in Spark (or even
your own code), you *don't* need to modify SparkBuild.scala. Just pass a list
of JARs containing your dependencies when you create your SparkContext. See
http://spark.incubator.apache.org/docs/latest/quick-sta
Cool, thanks for this really detailed writeup! It's great that you're also
covering how to set this up on your own.
Regarding YouTube videos -- the group that recorded it is working on those, but
I don't actually know the ETA yet. I'll let you know if I find out.
Matei
On Sep 3, 2013, at 11:47
So I think the problem might be that BytesWritable.getBytes() can return an
array bigger than the actual bytes used (see
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/BytesWritable.html#getBytes()
). It just returns a backing array that can be reused across records. Try
using c
What's your code for loading the SequenceFile?
You may also want to check that you're using the right version of protobuf in
Spark.
Matei
On Sep 1, 2013, at 10:52 AM, Gary Malouf wrote:
> We are using Spark 0.7.3 compiled (and running) against Hadood
> 2.0.0-mr1-cdh4.2.1.
>
> When I read a
Hi everyone,
As we've mentioned before, we're holding a 2-day training camp on Spark and
related projects at Berkeley tomorrow and Friday:
http://ampcamp.berkeley.edu/3/. A video stream will be available *for free* to
anyone who wants to watch. If you'd like to watch it, please register
before
By the way, an important note: Make sure you *shut down* your cluster after
using it. Otherwise, Amazon will keep charging you money for it! I've seen some
people get caught by that in the past.
For others following this list, it's probably fine to start the cluster
tomorrow morning (Pacific ti
ry, and you won't get
OutOfMemoryErrors. But we are the ones controlling when we unreference them,
and the GC just picks up from there when it decides to clean stuff up.
Matei
>
> Thanks,
> Grega
>
>
> On Wed, Aug 14, 2013 at 12:40 AM, Matei Zaharia
> wrote:
>
Hi Mike,
This project contains some small synthetic benchmarks:
https://github.com/amplab/spark-perf. Otherwise, for ML algorithms, look in
mllib -- it comes with driver programs for K-means, logistic regression, matrix
factorization, etc, as well as data generators for them.
Matei
On Aug 23,
What are the failures?
Matei
On Aug 22, 2013, at 2:57 PM, Aaron Babcock wrote:
> Hi,
>
> Does anyone have any experience using jmx and visualvm instead of yourkit to
> remotely profile spark workers.
>
> I tried the following in spark-env.sh but I get all kinds of failures when
> workers sp
Hi Paul,
On Aug 21, 2013, at 6:11 PM, Paul Snively wrote:
>> Just to understand, are you trying to do a real-time application (which is
>> what the streaming in Spark Streaming is for), or just to read an input file
>> into a batch job?
>
> Well, it's an interesting case. I'm trying to take a
Hi Paul,
Just to understand, are you trying to do a real-time application (which is what
the streaming in Spark Streaming is for), or just to read an input file into a
batch job?
For the latter, you can pass an s3n:// URL to any of Spark's file input methods
(e.g. SparkContext.textFile). The e
On Aug 15, 2013, at 7:13 PM, Lijie Xu wrote:
> 3) MLBase may require Spark to provide some new features for implementing
> some specific algorithms. Is there any? Or you have added some new
> fundamental features which are not supported in Spark-0.7?
On this particular aspect, we actually have
Cool, thanks for doing this!
Matei
On Aug 16, 2013, at 11:27 AM, Parviz deyhim wrote:
> Amazon EMR now has the latest version of Spark 0.7.3 and Shark 0.7
>
> Let me know if you have any questions.
>
> Thanks,
> Parviz
Hmm, it's weird that it built two. It should just be spark-0.7.3/bagel/target.
Matei
On Aug 14, 2013, at 2:29 PM, Ryan Compton wrote:
> spark-0.7.3/bagel/target or spark-0.7.3/bagel/bagel/target ?
>
> On Wed, Aug 7, 2013 at 9:17 PM, Matei Zaharia wrote:
>> Hi Ryan,
>
Hi Grega,
You'll need to create a new cached RDD for each batch, and then create the
union of those on every window. So for example, if you have rdd0, rdd1, and
rdd2, you might first take the union of 0 and 1, then of 1 and 2. This will let
you use just the subset of RDDs you care about instead
Yes, you have a hostname (stepreach-lm) that doesn't seem to resolve to any IP
address. You can fix it by adding export SPARK_LOCAL_IP=. Note
that this will have to be set to the right IP on each machine.
Matei
On Aug 12, 2013, at 2:22 PM, Gowtham N wrote:
> Hi,
>
> I downloaded spark and it
D the only one that has this
> optimization of reusing Writable objects?
>
> Ameet
>
> On Sat, Aug 10, 2013 at 12:07 AM, Matei Zaharia
> wrote:
> What happens is that as we iterate through the SequenceFile, we reuse the
> same IntegerWritable (or other Writable)
101 - 200 of 208 matches
Mail list logo