date:20140429

Re: Why Spark require this object to be serializerable?

2014-04-29 Thread Earthson

The code is here:https://github.com/Earthson/sparklda/blob/master/src/main/scala/net/earthson/nlp/lda/lda.scala I've change it to from Broadcast to Serializable. Now it works:) But There are too many rdd cache, It is the problem? -- View this message in context:

Issue during Spark streaming with ZeroMQ source

2014-04-29 Thread Francis . Hu

Hi, all I installed spark-0.9.1 and zeromq 4.0.1 , and then run below example: ./bin/run-example org.apache.spark.streaming.examples.SimpleZeroMQPublisher tcp://127.0.1.1:1234 foo.bar` ./bin/run-example org.apache.spark.streaming.examples.ZeroMQWordCount local[2] tcp://127.0.1.1:1234 foo`

Re: Issue during Spark streaming with ZeroMQ source

2014-04-29 Thread Prashant Sharma

Unfortunately zeromq 4.0.1 is not supported. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/ZeroMQWordCount.scala#L63Says about the version. You will need that version of zeromq to see it work. Basically I have seen it working nicely with

Re: 答复: Issue during Spark streaming with ZeroMQ source

2014-04-29 Thread Prashant Sharma

Well that is not going to be easy, simply because we depend on akka-zeromq for zeromq support. And since akka does not support the latest zeromq library yet, I doubt if there is something simple that can be done to support it. Prashant Sharma On Tue, Apr 29, 2014 at 2:44 PM, Francis.Hu

Re: launching concurrent jobs programmatically

2014-04-29 Thread ishaaq

Very interesting. One of spark's attractive features is being able to do stuff interactively via spark-shell. Is something like that still available via Ooyala's job server? Or do you use the spark-shell independently of that? If the latter then how do you manage custom jars for spark-shell? Our

Fwd: Spark RDD cache memory usage

2014-04-29 Thread Han JU

Hi, By default a fraction of the executor memory (60%) is reserved for RDD caching, so if there's no explicit caching in the code (eg. rdd.cache() etc.), or if we persist RDD with StorageLevel.DISK_ONLY, is this part of memory wated? Does Spark allocates the RDD cache memory dynamically? Or does

Re: Joining not-pair RDDs in Spark

2014-04-29 Thread Daniel Darabos

Create a key and join on that. val callPricesByHour = callPrices.map(p = ((p.year, p.month, p.day, p.hour), p)) val callsByHour = calls.map(c = ((c.year, c.month, c.day, c.hour), c)) val bills = callPricesByHour.join(callsByHour).mapValues({ case (p, c) = BillRow(c.customer, c.hour, c.minutes *

Re: User/Product Clustering with pySpark ALS

2014-04-29 Thread Nick Pentreath

There's no easy way to d this currently. The pieces are there from the PySpark code for regression which should be adaptable. But you'd have to roll your own solution. This is something I also want so I intend to put together a pull request for this soon — Sent from Mailbox On Tue, Apr

Re: Storage information about an RDD from the API

2014-04-29 Thread Koert Kuipers

SparkContext.getRDDStorageInfo On Tue, Apr 29, 2014 at 12:34 PM, Andras Nemeth andras.nem...@lynxanalytics.com wrote: Hi, Is it possible to know from code about an RDD if it is cached, and more precisely, how many of its partitions are cached in memory and how many are cached on disk? I

Re: How to declare Tuple return type for a function

2014-04-29 Thread Roger Hoover

The return type should be RDD[(Int, Int, Int)] because sc.textFile() returns an RDD. Try adding an import for the RDD type to get rid of the compile error. import org.apache.spark.rdd.RDD On Mon, Apr 28, 2014 at 6:22 PM, SK skrishna...@gmail.com wrote: Hi, I am a new user of Spark. I have

Python Spark on YARN

2014-04-29 Thread Guanhua Yan

Hi all: Is it possible to develop Spark programs in Python and run them on YARN? From the Python SparkContext class, it doesn't seem to have such an option. Thank you, - Guanhua === Guanhua Yan, Ph.D. Information Sciences Group (CCS-3) Los Alamos National Laboratory

How to declare Tuple return type for a function

2014-04-29 Thread SK

Hi, I am a new user of Spark. I have a class that defines a function as follows. It returns a tuple : (Int, Int, Int). class Sim extends VectorSim { override def input(master:String): (Int,Int,Int) = { sc = new SparkContext(master, Test) val ratings =

What is Seq[V] in updateStateByKey?

2014-04-29 Thread Adrian Mocanu

What is Seq[V] in updateStateByKey? Does this store the collected tuples of the RDD in a collection? Method signature: def updateStateByKey[S: ClassTag] ( updateFunc: (Seq[V], Option[S]) = Option[S] ): DStream[(K, S)] In my case I used Seq[Double] assuming a sequence of Doubles in the RDD; the

packaging time

2014-04-29 Thread SK

Each time I run sbt/sbt assembly to compile my program, the packaging time takes about 370 sec (about 6 min). How can I reduce this time? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/packaging-time-tp5048.html Sent from the Apache Spark User

Delayed Scheduling - Setting spark.locality.wait.node parameter in interactive shell

2014-04-29 Thread Sai Prasanna

Hi All, I have replication factor 3 in my HDFS. With 3 datanodes, i ran my experiments. Now i just added another node to it with no data in it. When i ran, SPARK launches non-local tasks in it and the time taken is more than what it took for 3 node cluster. Here delayed scheduling fails i think

Re: Spark: issues with running a sbt fat jar due to akka dependencies

2014-04-29 Thread Koert Kuipers

you need to merge reference.conf files and its no longer an issue. see the Build for for spark itself: case reference.conf = MergeStrategy.concat On Tue, Apr 29, 2014 at 3:32 PM, Shivani Rao raoshiv...@gmail.com wrote: Hello folks, I was going to post this question to spark user group as

Re: packaging time

2014-04-29 Thread Mark Hamstra

Tip: read the wiki -- https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools On Tue, Apr 29, 2014 at 12:48 PM, Daniel Darabos daniel.dara...@lynxanalytics.com wrote: Tips from my experience. Disable scaladoc: sources in doc in Compile := List() Do not package the source:

Re: What is Seq[V] in updateStateByKey?

2014-04-29 Thread Sean Owen

The original DStream is of (K,V). This function creates a DStream of (K,S). Each time slice brings one or more new V for each K. The old state S (can be different from V!) for each K -- possibly non-existent -- is updated in some way by a bunch of new V, to produce a new state S -- which also

Re: What is Seq[V] in updateStateByKey?

2014-04-29 Thread Tathagata Das

You may have already seen it, but I will mention it anyways. This example may help. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/StatefulNetworkWordCount.scala Here the state is essentially a running count of the words seen. So the value

Re: Python Spark on YARN

2014-04-29 Thread Matei Zaharia

This will be possible in 1.0 after this pull request: https://github.com/apache/spark/pull/30 Matei On Apr 29, 2014, at 9:51 AM, Guanhua Yan gh...@lanl.gov wrote: Hi all: Is it possible to develop Spark programs in Python and run them on YARN? From the Python SparkContext class, it

Spark's behavior

2014-04-29 Thread Eduardo Costa Alfaia

Hi TD, In my tests with spark streaming, I'm using JavaNetworkWordCount(modified) code and a program that I wrote that sends words to the Spark worker, I use TCP as transport. I verified that after starting Spark, it connects to my source which actually starts sending, but the first word count

rdd ordering gets scrambled

2014-04-29 Thread Mohit Jaggi

Hi, I started with a text file(CSV) of sorted data (by first column), parsed it into Scala objects using map operation in Scala. Then I used more maps to add some extra info to the data and saved it as text file. The final text file is not sorted. What do I need to do to keep the order from the

Re: Spark's behavior

2014-04-29 Thread Eduardo Costa Alfaia

Hi TD, We are not using stream context with master local, we have 1 Master and 8 Workers and 1 word source. The command line that we are using is: bin/run-example org.apache.spark.streaming.examples.JavaNetworkWordCount spark://192.168.0.13:7077 On Apr 30, 2014, at 0:09, Tathagata Das

Re: Python Spark on YARN

2014-04-29 Thread Guanhua Yan

Thanks, Matei. Will take a look at it. Best regards, Guanhua From: Matei Zaharia matei.zaha...@gmail.com Reply-To: user@spark.apache.org Date: Tue, 29 Apr 2014 14:19:30 -0700 To: user@spark.apache.org Subject: Re: Python Spark on YARN This will be possible in 1.0 after this pull request:

Re: Spark's behavior

2014-04-29 Thread Tathagata Das

Strange! Can you just do lines.print() to print the raw data instead of doing word count. Beyond that we can do two things. 1. Can see the Spark stage UI to see whether there are stages running during the 30 second period you referred to? 2. If you upgrade to using Spark master branch (or Spark

RE: Shuffle Spill Issue

2014-04-29 Thread Liu, Raymond

Hi Daniel Thanks for your reply, While I think for reduceByKey, it will also do map side combine, thus extra the result is the same, say, for each partition, one entry per distinct word. In my case with javaserializer, 240MB dataset yield to around 70MB shuffle data. Only that shuffle

Re: NoSuchMethodError from Spark Java

2014-04-29 Thread wxhsdp

i met with the same question when update to spark 0.9.1 (svn checkout https://github.com/apache/spark/) Exception in thread main java.lang.NoSuchMethodError: org.apache.spark.SparkContext$.jarOfClass(Ljava/lang/Class;)Lscala/collection/Seq; at

How fast would you expect shuffle serialize to be?

2014-04-29 Thread Liu, Raymond

Hi I am running a WordCount program which count words from HDFS, and I noticed that the serializer part of code takes a lot of CPU time. On a 16core/32thread node, the total throughput is around 50MB/s by JavaSerializer, and if I switching to KryoSerializer, it doubles to around

Re: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Patrick Wendell

Is this the serialization throughput per task or the serialization throughput for all the tasks? On Tue, Apr 29, 2014 at 9:34 PM, Liu, Raymond raymond@intel.com wrote: Hi I am running a WordCount program which count words from HDFS, and I noticed that the serializer part of code

RE: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Liu, Raymond

For all the tasks, say 32 task on total Best Regards, Raymond Liu -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Is this the serialization throughput per task or the serialization throughput for all the tasks? On Tue, Apr 29, 2014 at 9:34 PM, Liu, Raymond

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Mingyu Kim

Hi Patrick, I¹m a little confused about your comment that RDDs are not ordered. As far as I know, RDDs keep list of partitions that are ordered and this is why I can call RDD.take() and get the same first k rows every time I call it and RDD.take() returns the same entries as RDD.map().take()

RE: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Liu, Raymond

By the way, to be clear, I run repartition firstly to make all data go through shuffle instead of run ReduceByKey etc directly ( which reduce the data need to be shuffle and serialized), thus say all 50MB/s data from HDFS will go to serializer. ( in fact, I also tried generate data in memory

Re: JavaSparkConf

2014-04-29 Thread Patrick Wendell

This class was made to be java friendly so that we wouldn't have to use two versions. The class itself is simple. But I agree adding java setters would be nice. On Tue, Apr 29, 2014 at 8:32 PM, Soren Macbeth so...@yieldbot.com wrote: There is a JavaSparkContext, but no JavaSparkConf object. I

RE: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Liu, Raymond

Later case, total throughput aggregated from all cores. Best Regards, Raymond Liu -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Wednesday, April 30, 2014 1:22 PM To: user@spark.apache.org Subject: Re: How fast would you expect shuffle serialize to be? Hm -

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Patrick Wendell

You are right, once you sort() the RDD, then yes it has a well defined ordering. But that ordering is lost as soon as you transform the RDD, including if you union it with another RDD. On Tue, Apr 29, 2014 at 10:22 PM, Mingyu Kim m...@palantir.com wrote: Hi Patrick, I¹m a little confused

Re: Why Spark require this object to be serializerable?

Issue during Spark streaming with ZeroMQ source

Re: Issue during Spark streaming with ZeroMQ source

Re: 答复: Issue during Spark streaming with ZeroMQ source

Re: launching concurrent jobs programmatically

Fwd: Spark RDD cache memory usage

Re: Joining not-pair RDDs in Spark

Re: User/Product Clustering with pySpark ALS

Re: Storage information about an RDD from the API

Re: How to declare Tuple return type for a function

Python Spark on YARN

How to declare Tuple return type for a function

What is Seq[V] in updateStateByKey?

packaging time

Delayed Scheduling - Setting spark.locality.wait.node parameter in interactive shell

Re: Spark: issues with running a sbt fat jar due to akka dependencies

Re: packaging time

Re: What is Seq[V] in updateStateByKey?

Re: What is Seq[V] in updateStateByKey?

Re: Python Spark on YARN

Spark's behavior

rdd ordering gets scrambled

Re: Spark's behavior

Re: Python Spark on YARN

Re: Spark's behavior

RE: Shuffle Spill Issue

Re: NoSuchMethodError from Spark Java

How fast would you expect shuffle serialize to be?

Re: How fast would you expect shuffle serialize to be?

RE: How fast would you expect shuffle serialize to be?

Re: Union of 2 RDD's only returns the first one

RE: How fast would you expect shuffle serialize to be?

Re: JavaSparkConf

RE: How fast would you expect shuffle serialize to be?

Re: Union of 2 RDD's only returns the first one

35 matches

Site Navigation

Mail list logo

Footer information