Well, it says that the jar was successfully added but can't reference
classes from it. Does this have anything to do with this bug?
http://stackoverflow.com/questions/22457645/when-to-use-spark-classpath-or-sparkcontext-addjar
On Thu, Mar 27, 2014 at 2:57 PM, Sandy Ryza sandy.r...@cloudera.com
On 28 Mar 2014, at 00:34, Scott Clasen scott.cla...@gmail.com wrote:
Actually looking closer it is stranger than I thought,
in the spark UI, one executor has executed 4 tasks, and one has executed
1928
Can anyone explain the workings of a KafkaInputStream wrt kafka partitions
and mapping to
Evgeniy Shishkin wrote
So, at the bottom — kafka input stream just does not work.
That was the conclusion I was coming to as well. Are there open tickets
around fixing this up?
--
View this message in context:
That bug only appears to apply to spark-shell.
Do things work in yarn-client mode or on a standalone cluster? Are you
passing a path with parent directories to addJar?
On Thu, Mar 27, 2014 at 3:01 PM, Sung Hwan Chung
coded...@cs.stanford.eduwrote:
Well, it says that the jar was successfully
Seems like the configuration of the Spark worker is not right. Either the
worker has not been given enough memory or the allocation of the memory to
the RDD storage needs to be fixed. If configured correctly, the Spark
workers should not get OOMs.
On Thu, Mar 27, 2014 at 2:52 PM, Evgeny
On 28 Mar 2014, at 01:44, Tathagata Das tathagata.das1...@gmail.com wrote:
The more I think about it the problem is not about /tmp, its more about the
workers not having enough memory. Blocks of received data could be falling
out of memory before it is getting processed.
BTW, what is the
Thanks everyone for the discussion.
Just to note, I restarted the job yet again, and this time there are indeed
tasks being executed by both worker nodes. So the behavior does seem
inconsistent/broken atm.
Then I added a third node to the cluster, and a third executor came up, and
everything
Christopher
Sorry I might be missing the obvious, but how do i get my function called on
all Executors used by the app? I dont want to use RDDs unless necessary.
once I start my shell or app, how do I get
TaskNonce.getSingleton().doThisOnce() executed on each executor?
@dmpour
On 28 Mar 2014, at 02:10, Scott Clasen scott.cla...@gmail.com wrote:
Thanks everyone for the discussion.
Just to note, I restarted the job yet again, and this time there are indeed
tasks being executed by both worker nodes. So the behavior does seem
inconsistent/broken atm.
Then I added
I dint mention anything, so by default it should be MEMORY_AND_DISK right?
My doubt was, between two different experiments, are the RDDs cached in
memory need to be unpersisted???
Or it doesnt matter ?
On Fri, Mar 28, 2014 at 1:43 AM, Syed A. Hashmi shas...@cloudera.comwrote:
Which storage
I see, did this also fail with previous versions of Spark (0.9 or 0.8)? We’ll
try to look into these, seems like a serious error.
Matei
On Mar 27, 2014, at 7:27 PM, Jim Blomo jim.bl...@gmail.com wrote:
Thanks, Matei. I am running Spark 1.0.0-SNAPSHOT built for Hadoop
1.0.4 from GitHub on
Anyone can help?
How can I configure a different spark.local.dir for each executor?
On 23 Mar, 2014, at 12:11 am, Tsai Li Ming mailingl...@ltsai.com wrote:
Hi,
Each of my worker node has its own unique spark.local.dir.
However, when I run spark-shell, the shuffle writes are always
Hi,
My worker nodes have more memory than the host that I’m submitting my driver
program, but it seems that SPARK_MEM is also setting the Xmx of the spark shell?
$ SPARK_MEM=100g MASTER=spark://XXX:7077 bin/spark-shell
Java HotSpot(TM) 64-Bit Server VM warning: INFO:
Assuming you're using a new enough version of Spark, you should use
spark.executor.memory to set the memory for your executors, without
changing the driver memory. See the docs for your version of Spark.
On Thu, Mar 27, 2014 at 10:48 PM, Tsai Li Ming mailingl...@ltsai.comwrote:
Hi,
My worker
Hi,
Thanks! I found out that I wasn’t setting the SPARK_JAVA_OPTS correctly..
I took a look at the process table and saw that the
“org.apache.spark.executor.CoarseGrainedExecutorBackend” didn’t have the
-Dspark.local.dir set.
On 28 Mar, 2014, at 1:05 pm, Matei Zaharia
Hi David,
I am sorry but your question is not clear to me. Are you talking about
taking some value and sharing it across your cluster so that it is present
on all the nodes? You can look at Spark's broadcasting in that case. On the
other hand, if you want to take one item and create an RDD of 100
Hi,
I am a newbie with Spark.
I tried installing 2 virtual machines, one as a client and one as standalone
mode worker+master.
Everything seems to run and connect fine, but when I try to run a simple
script, I get weird errors.
Here is the traceback, notice my program is just a one-liner:
Have you tried setting the partitioning ?
Best Regards,
Sonal
Nube Technologies http://www.nubetech.co
http://in.linkedin.com/in/sonalgoyal
On Thu, Mar 27, 2014 at 10:04 AM, lannyripple lanny.rip...@gmail.comwrote:
Hi all,
I've got something which I think should be straightforward but
Hi,
I just run a simple example to generate some data for the ALS
algorithm. my spark version is 0.9, and in local mode, the memory of my
node is 108G
but when I set conf.set(spark.akka.frameSize, 4096), it
then occurred the following problem, and when I do not set this, it runs
well .
I forgot to mention that I don't really use all of my data. Instead I use a
sample extracted with randomSample.
On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa jaon...@gmail.comwrote:
Hi all,
I notice that RDD.cartesian has a strange behavior with cached and
uncached data. More
I sorted it out.
Turns out that if the client uses Python 2.7 and the server is Python 2.6,
you get some weird errors, like this and others.
So you would probably want not to do that...
--
View this message in context:
If you are learning about Spark Streaming, as I am, you've probably use
netcat nc as mentioned in the spark streaming programming guide. I
wanted something a little more useful, so I modified the
ClickStreamGenerator code to make a very simple script that simply reads a
file off disk and passes
I've played around with it. The CSV file looks like it gives 130
partitions. I'm assuming that's the standard 64MB split size for HDFS
files. I have increased number of partitions and number of tasks for
things like groupByKey and such. Usually I start blowing up on GC
Overlimit or sometimes
Classes are serialized and sent to all the workers as akka msgs
singletons and case classes I am not sure if they are javaserialized or
kryoserialized by default
But definitely your own classes if serialized by kryo will be much
efficient.there is an comparison that Matei did for all
As long as the amount of state being passed is relatively small, it's
probably easiest to send it back to the driver and to introduce it into RDD
transformations as the zero value of a fold.
On Fri, Mar 28, 2014 at 7:12 AM, Adrian Mocanu amoc...@verticalscope.comwrote:
I'd like to resurrect
Ok. Based on Sonal's message I dived more into memory and partitioning and
got it to work.
For the CSV file I used 1024 partitions [textFile(path, 1024)] which cut
the partition size down to 8MB (based on standard HDFS 64MB splits). For
the key file I also adjusted partitions to use about 8MB.
I'd like to resurrect this thread since I don't have an answer yet.
From: Adrian Mocanu [mailto:amoc...@verticalscope.com]
Sent: March-27-14 10:04 AM
To: u...@spark.incubator.apache.org
Subject: function state lost when next RDD is processed
Is there a way to pass a custom function to spark to
Hi Aureliano,
I followed this thread to create a custom saveAsObjectFile.
The following is the code.
/new org.apache.spark.rdd.SequenceFileRDDFunctions[NullWritable,
BytesWritable](saveRDD.mapPartitions(iter =
iter.grouped(10).map(_.toArray)).map(x = (NullWritable.get(), new
There is also this quote from the Tuning guide
(http://spark.incubator.apache.org/docs/latest/tuning.html):
Finally, if you don't register your classes, Kryo will still work, but
it will have to store the full class name with each object, which is
wasteful.
It implies that you don't really
Thanks!
Ya that's what I'm doing so far, but I wanted to see if it's possible to keep
the tuples inside Spark for fault tolerance purposes.
-A
From: Mark Hamstra [mailto:m...@clearstorydata.com]
Sent: March-28-14 10:45 AM
To: user@spark.apache.org
Subject: Re: function state lost when next RDD
Thanks a lot Ognen!
It's not a fancy class that I wrote, and now I realized I neither extends
Serializable or register with Kyro and that's why it is not working.
--
View this message in context:
Hi,
Thanks Nanzhu.I tried to implement your suggestion on following scenario.I
have RDD of say 24 elements.In that when i partioned into two groups of 12
elements each.Their is loss of order of elements in partition.Elemest are
partitioned randomly.I need to preserve the order such that the first
The cleaner ttl was introduced as a brute force method to clean all old
data and metadata in the system, so that the system can run 24/7. The
cleaner ttl should be set to a large value, so that RDDs older than that
are not used. Though there are some cases where you may want to use an RDD
again
I think you should sort each RDD
-Original Message-
From: yh18190 [mailto:yh18...@gmail.com]
Sent: March-28-14 4:44 PM
To: u...@spark.incubator.apache.org
Subject: Re: Splitting RDD and Grouping together to perform computation
Hi,
Thanks Nanzhu.I tried to implement your suggestion on
I say you need to remap so you have a key for each tuple that you can sort on.
Then call rdd.sortByKey(true) like this mystream.transform(rdd =
rdd.sortByKey(true))
For this fn to be available you need to import
org.apache.spark.rdd.OrderedRDDFunctions
-Original Message-
From: yh18190
From the jist of it, it seems like you need to override the default
partitioner to control how your data is distributed among partitions. Take
a look at different Partitioners available (Default, Range, Hash) if none
of these get you desired result, you might want to provide your own.
On Fri,
Hi Andriana,
Thanks for suggestion.Could you please modify my code part where I need to
do so..I apologise for inconvinience ,becoz i am new to spark I coudnt apply
appropriately..i would be thankful to you.
--
View this message in context:
Not sure how to change your code because you'd need to generate the keys where
you get the data. Sorry about that.
I can tell you where to put the code to remap and sort though.
import org.apache.spark.rdd.OrderedRDDFunctions
val res2=reduced_hccg.map(_._2)
.map( x= (newkey,x)).sortByKey(true)
Hey guys,
I need to tag individual RDD lines with some values. This tag value would
change at every iteration. Is this possible with RDD (I suppose this is
sort of like mutable RDD, but it's more) ?
If not, what would be the best way to do something like this? Basically, we
need to keep mutable
Sung Hwan, strictly speaking, RDDs are immutable, so the canonical way to
get what you want is to transform to another RDD. But you might look at
MutablePair (
https://github.com/apache/spark/blob/60abc252545ec7a5d59957a32e764cd18f6c16b4/core/src/main/scala/org/apache/spark/util/MutablePair.scala)
Weird, how exactly are you pulling out the sample? Do you have a small program
that reproduces this?
Matei
On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
I forgot to mention that I don't really use all of my data. Instead I use a
sample extracted with randomSample.
What does your saveRDD contain? If you are using custom objects, they
should be serializable.
Best Regards,
Sonal
Nube Technologies http://www.nubetech.co
http://in.linkedin.com/in/sonalgoyal
On Sat, Mar 29, 2014 at 12:02 AM, pradeeps8 srinivasa.prad...@gmail.comwrote:
Hi Aureliano,
I
Are you referring to Spark Streaming?
Can you save the sum as a RDD keep joining the two rdd together?
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Fri, Mar 28, 2014 at 10:47 AM, Adrian Mocanu
Sung Hwan, yes, I'm saying exactly what you interpreted, including that if
you tried it, it would (mostly) work, and my uncertainty with respect to
guarantees on the semantics. Definitely there would be no fault tolerance
if the mutations depend on state that is not captured in the RDD lineage.
Thanks Patrick,
I was thinking about that... Upon analysis I realized (on date) it would be
something similar to the way Hive Context using CustomCatalog stuff.
I will review it again, on the lines of implementing SchemaRDD with
Cassandra. Thanks for the pointer.
Upon discussion with couple of
That helps! Thank you.
On Fri, Mar 28, 2014 at 12:36 AM, Sonal Goyal sonalgoy...@gmail.com wrote:
Hi David,
I am sorry but your question is not clear to me. Are you talking about
taking some value and sharing it across your cluster so that it is present
on all the nodes? You can look at
Got it.
Thanks for your help!!
Chieh-Yen
On Tue, Mar 25, 2014 at 6:51 PM, hequn cheng chenghe...@gmail.com wrote:
Hi~I wrote a program to test.The non-idempotent compute function in
foreach does change the value of RDD. It may looks a little crazy to do so
since modify the RDD will make it
I'm trying to create an RDD from multiple scans.
I tried to set the configuration this way:
Configuration config = HBaseConfiguration.create();
config.setStrings(MultiTableInputFormat.SCANS,scanStrings);
And creating each scan string in the array scanStrings this way:
Scan scan = new Scan();
Hi,
I have an RDD of elements and want to create a new RDD by Zipping other RDD
in order.
result[RDD] with sequence of 10,20,30,40,50 ...elements.
I am facing problems as index is not an RDD...its gives an error...Could
anyone help me how we can zip it or map it inorder to obtain following
From my limited knowledge, all classes involved with the RDD operations
should be extending Serializable if you want Java serialization(default).
However, if you want Kryo serialization, you can
use conf.set(spark.serializer,org.apache.spark.serializer.KryoSerializer);
If you also want to perform
zipWithIndex works on the git clone, not sure if its part of a released
version.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala
Best Regards,
Sonal
Nube Technologies http://www.nubetech.co
http://in.linkedin.com/in/sonalgoyal
On Sat, Mar 29,
Thanks sonal.Is der anyother way like to map values with Increasing
indexes...so that i can map(t=(i,t)) where value if 'i' increases after
each map operation on element...
Please help me ..in this aspect
--
View this message in context:
Hi,
I want to perform map operation on an RDD of elements such that resulting
RDD is a key value pair(counter,value)
For example var k:RDD[Int]=10,20,30,40,40,60...
k.map(t=(i,t)) where 'i' value should be like a counter whose value
increments after each mapoperation...
Pleas help me..
I tried
Thanks so much Sonal! I am much clearer now.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Do-all-classes-involving-RDD-operation-need-to-be-registered-tp3439p3472.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Is this spark 0.9.0? Try setting spark.shuffle.spill=false There was a hash
collision bug that's fixed in 0.9.1 that might cause you to have too few
results in that join.
Sent from my mobile phone
On Mar 28, 2014 8:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Weird, how exactly are you
On Fri, Mar 28, 2014 at 9:53 PM, Rohit Rai ro...@tuplejump.com wrote:
Upon discussion with couple of our clients, it seems the reason they would
prefer using hive is that they have already invested a lot in it. Mostly in
UDFs and HiveQL.
1. Are there any plans to develop the SQL Parser to
Hi
Is there any workaround to this problem?
I'm trying to implement a KafkaReceiver using the SimpleConsumer API [1] of
Kafka and handle the partition assignment manually. The easiest setup in
this case would be to bind the number of parallel jobs to the number of
partitions in Kafka. This is
I've only tried 0.9, in which I ran into the `stdin writer to Python
finished early` so frequently I wasn't able to load even a 1GB file.
Let me know if I can provide any other info!
On Thu, Mar 27, 2014 at 8:48 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
I see, did this also fail with
Hi everyone,
I'm using Spark on machines where I can't change the maximum number of open
files. As a result, I'm limiting the number of reducers to 500. I'm also
only using a single machine that has 32 cores and emulating a cluster by
running 4 worker daemons with 8 cores (maximum) each.
What
Hi,
I notices spark machine learning examples use training data to validate
regression models, For instance, in linear
regressionhttp://spark.apache.org/docs/0.9.0/mllib-guide.htmlexample:
// Evaluate model on training examples and compute training errorval
valuesAndPreds = parsedData.map {
I think the problem I ran into in 0.9 is covered in
https://issues.apache.org/jira/browse/SPARK-1323
When I kill the python process, the stacktrace I gets indicates that
this happens at initialization. It looks like the initial write to
the Python process does not go through, and then the
This is a great question. We are in the same position, having not invested
in Hive yet and looking at various options for SQL-on-Hadoop.
On Sat, Mar 29, 2014 at 9:48 PM, Manoj Samel manojsamelt...@gmail.comwrote:
Hi,
In context of the recent Spark SQL announcement (
The GraphX team has been using Wikipedia dumps from
http://dumps.wikimedia.org/enwiki/. Unfortunately, these are in a less
convenient format than the Freebase dumps. In particular, an article may
span multiple lines, so more involved input parsing is required.
Dan Crankshaw (cc'd) wrote a driver
In particular, we are using this dataset:
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
Ankur http://www.ankurdave.com/
On Sun, Mar 30, 2014 at 12:45 AM, Ankur Dave ankurd...@gmail.com wrote:
The GraphX team has been using Wikipedia dumps from
Aureliano, you're correct that this is not validation error, which is
computed as the residuals on out-of-training-sample data, and helps
minimize overfit variance.
However, in this example, the errors are correctly referred to as training
error, which is what you might compute on a per-iteration
Hi,
Can we convert directly scala collection to spark RDD data type without
using parellize method?
Is their any way to create custom converted RDD datatype from scala type
using some typecast like that?
Please suggest me
--
View this message in context:
Hi,
On
http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html,
I am trying to run code on Writing Language-Integrated Relational Queries
( I have 1.0.0 Snapshot ).
I am running into error on
val people: RDD[Person] // An RDD of case class objects, from the first
example.
Hi,
I am trying SparkSQL based on the example on doc ...
val people =
sc.textFile(/data/spark/examples/src/main/resources/people.txt).map(_.split(,)).map(p
= Person(p(0), p(1).trim.toInt))
val olderThanTeans = people.where('age 19)
val youngerThanTeans = people.where('age 13)
val
Hi,
If I do a where on BigDecimal, I get a stack trace. Changing BigDecimal to
Double works ...
scala case class JournalLine(account: String, credit: BigDecimal, debit:
BigDecimal, date: String, company: String, currency: String, costcenter:
String, region: String)
defined class JournalLine
Is there a way to see 'Application Detail UI' page (at master:4040) for
completed applications? Currently, I can see that page only for running
applications, I would like to see various numbers for the application after
it has completed.
can I get the whole operation? then i can try to locate the error
smallmonkey...@hotmail.com
From: Manoj Samel
Date: 2014-03-31 01:16
To: user
Subject: SparkSQL where with BigDecimal type gives stacktrace
Hi,
If I do a where on BigDecimal, I get a stack trace. Changing BigDecimal to
This will be a feature in Spark 1.0 but is not yet released. In 1.0 Spark
applications can persist their state so that the UI can be reloaded after
they have completed.
- Patrick
On Sun, Mar 30, 2014 at 10:30 AM, David Thomas dt5434...@gmail.com wrote:
Is there a way to see 'Application
Hi,
Spark-ec2 uses rsync to deploy many applications. It seem over time more
and more applications have been added to the script, which has
significantly slowed down the setup time.
Perhaps the script could be restructured this this way: Instead of rsyncing
N times per application, we could have
That is a good idea, though I am not sure how much it will help as time to
rsync is also dependent just on data size being copied. The other problem
is that sometime we have dependencies across packages, so the first needs
to be running before the second can start etc.
However I agree that it
The scala object needs to be sent to workers to be used as a RDD,
parallalize is a way to do that. What are you looking to do?
You can serialize the scala object to hdfs/disk load it from thr
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi
+1 Have done a few installations of Shark with customers using Hive, they
love it. Would be good to maintain compatibility with Metastore QL till
we have substantial reason to break off (like BlinkDB).
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi
Hi,
Would the same issue be present for other Java type like Date ?
Converting the person/teenager example on Patricks page reproduces the
problem ...
Thanks,
scala import scala.math
import scala.math
scala case class Person(name: String, age: BigDecimal)
defined class Person
scala val
Hi Manoj,
At the current time, for drop-in replacement of Hive, it will be best to stick
with Shark. Over time, Shark will use the Spark SQL backend, but should remain
deployable the way it is today (including launching the SharkServer, using the
Hive CLI, etc). Spark SQL is better for
Hi,
If I create a groupBy('a)(Sum('b) as 'foo, Sum('c) as 'bar), then the
resulting RDD should have 'a, 'foo and 'bar.
The result RDD just shows 'foo and 'bar and is missing 'a
Thoughts?
Thanks,
Manoj
I'm using ScalaBuff (which depends on protobuf2.5) and facing the same issue.
any word on this one?
On Mar 27, 2014, at 6:41 PM, Kanwaldeep kanwal...@gmail.com wrote:
We are using Protocol Buffer 2.5 to send messages to Spark Streaming 0.9 with
Kafka stream setup. I have protocol Buffer 2.5
Hi,
I need to batch the values in my final RDD before writing out to hdfs. The idea
is to batch multiple rows in a protobuf and write those batches out - mostly
to save some space as a lot of metadata is the same.
e.g. 1,2,3,4,5,6 just batch them (1,2), (3,4),(5,6) and save three records
Hi
I am new to Spark and I encountered this error when I try to map RDD[A] =
RDD[Array[Double]] then collect the results.
A is a custom class extends Serializable. (Actually it's just a wrapper
class which wraps a few variables that are all serializable).
I also tried KryoSerializer according
Hi Sonal,
There are no custom objects in saveRDD, it is of type RDD[(String, String)].
Thanks,
Pradeep
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SequenceFileRDDFunctions-cannot-be-used-output-of-spark-package-tp250p3508.html
Sent from the Apache
I am facing different kinds of java.lang.ClassNotFoundException when trying to
run spark on mesos. One error has to do with
org.apache.spark.executor.MesosExecutorBackend. Another has to do with
org.apache.spark.serializer.JavaSerializer. I see other people complaining
about similar issues.
I
What versions are you running?
There is a known protobuf 2.5 mismatch, depending on your versions.
Cheers,
Tim
- Original Message -
From: Bharath Bhushan manku.ti...@outlook.com
To: user@spark.apache.org
Sent: Monday, March 31, 2014 8:16:19 AM
Subject:
Hi,
I've just tested spark in yarn mode, but something made me confused.
When I *delete* the yarn.application.classpath configuration in
yarn-site.xml, the following command works well.
*bin/spark-class org.apache.spark.deploy.yarn.Client --jar
I tried 0.9.0 and the latest git tree of spark. For mesos, I tried 0.17.0 and
the latest git tree.
Thanks
On 31-Mar-2014, at 7:24 pm, Tim St Clair tstcl...@redhat.com wrote:
What versions are you running?
There is a known protobuf 2.5 mismatch, depending on your versions.
Cheers,
Howdy-doody,
I have a single, very large file sitting in S3 that I want to read in with
sc.textFile(). What are the best practices for reading in this file as
quickly as possible? How do I parallelize the read as much as possible?
Similarly, say I have a single, very large RDD sitting in memory
* unionAll preserve duplicate v/s union that does not
This is true, if you want to eliminate duplicate items you should follow
the union with a distinct()
* SQL union and unionAll result in same output format i.e. another SQL v/s
different RDD types here.
* Understand the existing union
This is similar to how SQL works, items in the GROUP BY clause are not
included in the output by default. You will need to include 'a in the
second parameter list (which is similar to the SELECT clause) as well if
you want it included in the output.
On Sun, Mar 30, 2014 at 9:52 PM, Manoj Samel
val people: RDD[Person] // An RDD of case class objects, from the first
example. is just a placeholder to avoid cluttering up each example with
the same code for creating an RDD. The : RDD[People] is just there to
let you know the expected type of the variable 'people'. Perhaps there is
a
Hi Michael,
Thanks for the clarification. My question is about the error above error:
class $iwC needs to be abstract and what does the RDD brings, since I can
do the DSL without the people: people: org.apache.spark.rdd.RDD[Person]
Thanks,
On Mon, Mar 31, 2014 at 9:13 AM, Michael Armbrust
Note that you may have minSplits set to more than the number of cores in
the cluster, and Spark will just run as many as possible at a time. This is
better if certain nodes may be slow, for instance.
In general, it is not necessarily the case that doubling the number of
cores doing IO will double
OK sweet. Thanks for walking me through that.
I wish this were StackOverflow so I could bestow some nice rep on all you
helpful people.
On Mon, Mar 31, 2014 at 1:06 PM, Aaron Davidson ilike...@gmail.com wrote:
Note that you may have minSplits set to more than the number of cores in
the
How about London?
--
Martin Goodson | VP Data Science
(0)20 3397 1240
[image: Inline image 1]
On Mon, Mar 31, 2014 at 6:28 PM, Andy Konwinski andykonwin...@gmail.comwrote:
Hi folks,
We have seen a lot of community growth outside of the Bay Area and we are
looking to help spur even
Not sure what data you are sending in. You could try calling
lines.print() instead which should just output everything that comes in
on the stream. Just to test that your socket is receiving what you think
you are sending.
On Mon, Mar 31, 2014 at 12:18 PM, eric perler
Responses about London, Montreal/Toronto, DC, Chicago. Great coverage so
far, and keep 'em coming! (still looking for an NYC connection)
I'll reply to each of you off-list to coordinate next-steps for setting up
a Spark meetup in your home area.
Thanks again, this is super exciting.
Andy
On
We'd love to see a Spark user group in Los Angeles and connect with others
working with it here.
Ping me if you're in the LA area and use Spark at your company (
ch...@retentionscience.com ).
Chris
Retention Science
call: 734.272.3099
visit: Site | like: Facebook | follow: Twitter
On Mar
It sounds like the protobuf issue.
So FWIW, You might want to try updating the 0.9.0 w/pom mods for mesos
protobuf.
mesos 0.17.0 protobuf 2.5
Cheers,
Tim
- Original Message -
From: Bharath Bhushan manku.ti...@outlook.com
To: user@spark.apache.org
Sent: Monday, March 31, 2014
Dear list,
I was wondering how Spark handles congestion when the upstream is
generating dstreams faster than downstream workers can handle?
Thanks
-Mo
601 - 700 of 75449 matches
Mail list logo