The code is
here:https://github.com/Earthson/sparklda/blob/master/src/main/scala/net/earthson/nlp/lda/lda.scala
I've change it to from Broadcast to Serializable. Now it works:) But There
are too many rdd cache, It is the problem?
--
View this message in context:
Hi, all
I installed spark-0.9.1 and zeromq 4.0.1 , and then run below example:
./bin/run-example org.apache.spark.streaming.examples.SimpleZeroMQPublisher
tcp://127.0.1.1:1234 foo.bar`
./bin/run-example org.apache.spark.streaming.examples.ZeroMQWordCount
local[2] tcp://127.0.1.1:1234 foo`
Unfortunately zeromq 4.0.1 is not supported.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/ZeroMQWordCount.scala#L63Says
about the version. You will need that version of zeromq to see it
work. Basically I have seen it working nicely with
Well that is not going to be easy, simply because we depend on akka-zeromq
for zeromq support. And since akka does not support the latest zeromq
library yet, I doubt if there is something simple that can be done to
support it.
Prashant Sharma
On Tue, Apr 29, 2014 at 2:44 PM, Francis.Hu
Very interesting.
One of spark's attractive features is being able to do stuff interactively
via spark-shell. Is something like that still available via Ooyala's job
server?
Or do you use the spark-shell independently of that? If the latter then how
do you manage custom jars for spark-shell? Our
Hi,
By default a fraction of the executor memory (60%) is reserved for RDD
caching, so if there's no explicit caching in the code (eg. rdd.cache()
etc.), or if we persist RDD with StorageLevel.DISK_ONLY, is this part of
memory wated? Does Spark allocates the RDD cache memory dynamically? Or
does
Create a key and join on that.
val callPricesByHour = callPrices.map(p = ((p.year, p.month, p.day,
p.hour), p))
val callsByHour = calls.map(c = ((c.year, c.month, c.day, c.hour), c))
val bills = callPricesByHour.join(callsByHour).mapValues({ case (p, c) =
BillRow(c.customer, c.hour, c.minutes *
There's no easy way to d this currently. The pieces are there from the PySpark
code for regression which should be adaptable.
But you'd have to roll your own solution.
This is something I also want so I intend to put together a pull request for
this soon
—
Sent from Mailbox
On Tue, Apr
SparkContext.getRDDStorageInfo
On Tue, Apr 29, 2014 at 12:34 PM, Andras Nemeth
andras.nem...@lynxanalytics.com wrote:
Hi,
Is it possible to know from code about an RDD if it is cached, and more
precisely, how many of its partitions are cached in memory and how many are
cached on disk? I
The return type should be RDD[(Int, Int, Int)] because sc.textFile()
returns an RDD. Try adding an import for the RDD type to get rid of the
compile error.
import org.apache.spark.rdd.RDD
On Mon, Apr 28, 2014 at 6:22 PM, SK skrishna...@gmail.com wrote:
Hi,
I am a new user of Spark. I have
Hi all:
Is it possible to develop Spark programs in Python and run them on YARN?
From the Python SparkContext class, it doesn't seem to have such an option.
Thank you,
- Guanhua
===
Guanhua Yan, Ph.D.
Information Sciences Group (CCS-3)
Los Alamos National Laboratory
Hi,
I am a new user of Spark. I have a class that defines a function as follows.
It returns a tuple : (Int, Int, Int).
class Sim extends VectorSim {
override def input(master:String): (Int,Int,Int) = {
sc = new SparkContext(master, Test)
val ratings =
What is Seq[V] in updateStateByKey?
Does this store the collected tuples of the RDD in a collection?
Method signature:
def updateStateByKey[S: ClassTag] ( updateFunc: (Seq[V], Option[S]) =
Option[S] ): DStream[(K, S)]
In my case I used Seq[Double] assuming a sequence of Doubles in the RDD; the
Each time I run sbt/sbt assembly to compile my program, the packaging time
takes about 370 sec (about 6 min). How can I reduce this time?
thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/packaging-time-tp5048.html
Sent from the Apache Spark User
Hi All,
I have replication factor 3 in my HDFS.
With 3 datanodes, i ran my experiments. Now i just added another node to it
with no data in it.
When i ran, SPARK launches non-local tasks in it and the time taken is more
than what it took for 3 node cluster.
Here delayed scheduling fails i think
you need to merge reference.conf files and its no longer an issue.
see the Build for for spark itself:
case reference.conf = MergeStrategy.concat
On Tue, Apr 29, 2014 at 3:32 PM, Shivani Rao raoshiv...@gmail.com wrote:
Hello folks,
I was going to post this question to spark user group as
Tip: read the wiki --
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools
On Tue, Apr 29, 2014 at 12:48 PM, Daniel Darabos
daniel.dara...@lynxanalytics.com wrote:
Tips from my experience. Disable scaladoc:
sources in doc in Compile := List()
Do not package the source:
The original DStream is of (K,V). This function creates a DStream of
(K,S). Each time slice brings one or more new V for each K. The old
state S (can be different from V!) for each K -- possibly non-existent
-- is updated in some way by a bunch of new V, to produce a new state
S -- which also
You may have already seen it, but I will mention it anyways. This example
may help.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/StatefulNetworkWordCount.scala
Here the state is essentially a running count of the words seen. So the
value
This will be possible in 1.0 after this pull request:
https://github.com/apache/spark/pull/30
Matei
On Apr 29, 2014, at 9:51 AM, Guanhua Yan gh...@lanl.gov wrote:
Hi all:
Is it possible to develop Spark programs in Python and run them on YARN? From
the Python SparkContext class, it
Hi TD,
In my tests with spark streaming, I'm using JavaNetworkWordCount(modified) code
and a program that I wrote that sends words to the Spark worker, I use TCP as
transport. I verified that after starting Spark, it connects to my source which
actually starts sending, but the first word count
Hi,
I started with a text file(CSV) of sorted data (by first column), parsed it
into Scala objects using map operation in Scala. Then I used more maps to
add some extra info to the data and saved it as text file.
The final text file is not sorted. What do I need to do to keep the order
from the
Hi TD,
We are not using stream context with master local, we have 1 Master and 8
Workers and 1 word source. The command line that we are using is:
bin/run-example org.apache.spark.streaming.examples.JavaNetworkWordCount
spark://192.168.0.13:7077
On Apr 30, 2014, at 0:09, Tathagata Das
Thanks, Matei. Will take a look at it.
Best regards,
Guanhua
From: Matei Zaharia matei.zaha...@gmail.com
Reply-To: user@spark.apache.org
Date: Tue, 29 Apr 2014 14:19:30 -0700
To: user@spark.apache.org
Subject: Re: Python Spark on YARN
This will be possible in 1.0 after this pull request:
Strange! Can you just do lines.print() to print the raw data instead of
doing word count. Beyond that we can do two things.
1. Can see the Spark stage UI to see whether there are stages running
during the 30 second period you referred to?
2. If you upgrade to using Spark master branch (or Spark
Hi Daniel
Thanks for your reply, While I think for reduceByKey, it will also do
map side combine, thus extra the result is the same, say, for each partition,
one entry per distinct word. In my case with javaserializer, 240MB dataset
yield to around 70MB shuffle data. Only that shuffle
i met with the same question when update to spark 0.9.1
(svn checkout https://github.com/apache/spark/)
Exception in thread main java.lang.NoSuchMethodError:
org.apache.spark.SparkContext$.jarOfClass(Ljava/lang/Class;)Lscala/collection/Seq;
at
Hi
I am running a WordCount program which count words from HDFS, and I
noticed that the serializer part of code takes a lot of CPU time. On a
16core/32thread node, the total throughput is around 50MB/s by JavaSerializer,
and if I switching to KryoSerializer, it doubles to around
Is this the serialization throughput per task or the serialization
throughput for all the tasks?
On Tue, Apr 29, 2014 at 9:34 PM, Liu, Raymond raymond@intel.com wrote:
Hi
I am running a WordCount program which count words from HDFS, and I
noticed that the serializer part of code
For all the tasks, say 32 task on total
Best Regards,
Raymond Liu
-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com]
Is this the serialization throughput per task or the serialization throughput
for all the tasks?
On Tue, Apr 29, 2014 at 9:34 PM, Liu, Raymond
Hi Patrick,
I¹m a little confused about your comment that RDDs are not ordered. As far
as I know, RDDs keep list of partitions that are ordered and this is why I
can call RDD.take() and get the same first k rows every time I call it and
RDD.take() returns the same entries as RDD.map().take()
By the way, to be clear, I run repartition firstly to make all data go through
shuffle instead of run ReduceByKey etc directly ( which reduce the data need to
be shuffle and serialized), thus say all 50MB/s data from HDFS will go to
serializer. ( in fact, I also tried generate data in memory
This class was made to be java friendly so that we wouldn't have to
use two versions. The class itself is simple. But I agree adding java
setters would be nice.
On Tue, Apr 29, 2014 at 8:32 PM, Soren Macbeth so...@yieldbot.com wrote:
There is a JavaSparkContext, but no JavaSparkConf object. I
Later case, total throughput aggregated from all cores.
Best Regards,
Raymond Liu
-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com]
Sent: Wednesday, April 30, 2014 1:22 PM
To: user@spark.apache.org
Subject: Re: How fast would you expect shuffle serialize to be?
Hm -
You are right, once you sort() the RDD, then yes it has a well defined ordering.
But that ordering is lost as soon as you transform the RDD, including
if you union it with another RDD.
On Tue, Apr 29, 2014 at 10:22 PM, Mingyu Kim m...@palantir.com wrote:
Hi Patrick,
I¹m a little confused
35 matches
Mail list logo