Re: YARN problem using an external jar in worker nodes Inbox x

2014-03-27 Thread Sung Hwan Chung
Well, it says that the jar was successfully added but can't reference classes from it. Does this have anything to do with this bug? http://stackoverflow.com/questions/22457645/when-to-use-spark-classpath-or-sparkcontext-addjar On Thu, Mar 27, 2014 at 2:57 PM, Sandy Ryza sandy.r...@cloudera.com

Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Evgeny Shishkin
On 28 Mar 2014, at 00:34, Scott Clasen scott.cla...@gmail.com wrote: Actually looking closer it is stranger than I thought, in the spark UI, one executor has executed 4 tasks, and one has executed 1928 Can anyone explain the workings of a KafkaInputStream wrt kafka partitions and mapping to

Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Scott Clasen
Evgeniy Shishkin wrote So, at the bottom — kafka input stream just does not work. That was the conclusion I was coming to as well. Are there open tickets around fixing this up? -- View this message in context:

Re: YARN problem using an external jar in worker nodes Inbox x

2014-03-27 Thread Sandy Ryza
That bug only appears to apply to spark-shell. Do things work in yarn-client mode or on a standalone cluster? Are you passing a path with parent directories to addJar? On Thu, Mar 27, 2014 at 3:01 PM, Sung Hwan Chung coded...@cs.stanford.eduwrote: Well, it says that the jar was successfully

Re: spark streaming and the spark shell

2014-03-27 Thread Tathagata Das
Seems like the configuration of the Spark worker is not right. Either the worker has not been given enough memory or the allocation of the memory to the RDD storage needs to be fixed. If configured correctly, the Spark workers should not get OOMs. On Thu, Mar 27, 2014 at 2:52 PM, Evgeny

Re: Spark Streaming + Kafka + Mesos/Marathon strangeness

2014-03-27 Thread Evgeny Shishkin
On 28 Mar 2014, at 01:44, Tathagata Das tathagata.das1...@gmail.com wrote: The more I think about it the problem is not about /tmp, its more about the workers not having enough memory. Blocks of received data could be falling out of memory before it is getting processed. BTW, what is the

Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Scott Clasen
Thanks everyone for the discussion. Just to note, I restarted the job yet again, and this time there are indeed tasks being executed by both worker nodes. So the behavior does seem inconsistent/broken atm. Then I added a third node to the cluster, and a third executor came up, and everything

Re: Running a task once on each executor

2014-03-27 Thread deenar.toraskar
Christopher Sorry I might be missing the obvious, but how do i get my function called on all Executors used by the app? I dont want to use RDDs unless necessary. once I start my shell or app, how do I get TaskNonce.getSingleton().doThisOnce() executed on each executor? @dmpour

Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Evgeny Shishkin
On 28 Mar 2014, at 02:10, Scott Clasen scott.cla...@gmail.com wrote: Thanks everyone for the discussion. Just to note, I restarted the job yet again, and this time there are indeed tasks being executed by both worker nodes. So the behavior does seem inconsistent/broken atm. Then I added

Re: GC overhead limit exceeded

2014-03-27 Thread Sai Prasanna
I dint mention anything, so by default it should be MEMORY_AND_DISK right? My doubt was, between two different experiments, are the RDDs cached in memory need to be unpersisted??? Or it doesnt matter ? On Fri, Mar 28, 2014 at 1:43 AM, Syed A. Hashmi shas...@cloudera.comwrote: Which storage

Re: pySpark memory usage

2014-03-27 Thread Matei Zaharia
I see, did this also fail with previous versions of Spark (0.9 or 0.8)? We’ll try to look into these, seems like a serious error. Matei On Mar 27, 2014, at 7:27 PM, Jim Blomo jim.bl...@gmail.com wrote: Thanks, Matei. I am running Spark 1.0.0-SNAPSHOT built for Hadoop 1.0.4 from GitHub on

Re: Configuring shuffle write directory

2014-03-27 Thread Tsai Li Ming
Anyone can help? How can I configure a different spark.local.dir for each executor? On 23 Mar, 2014, at 12:11 am, Tsai Li Ming mailingl...@ltsai.com wrote: Hi, Each of my worker node has its own unique spark.local.dir. However, when I run spark-shell, the shuffle writes are always

Setting SPARK_MEM higher than available memory in driver

2014-03-27 Thread Tsai Li Ming
Hi, My worker nodes have more memory than the host that I’m submitting my driver program, but it seems that SPARK_MEM is also setting the Xmx of the spark shell? $ SPARK_MEM=100g MASTER=spark://XXX:7077 bin/spark-shell Java HotSpot(TM) 64-Bit Server VM warning: INFO:

Re: Setting SPARK_MEM higher than available memory in driver

2014-03-28 Thread Aaron Davidson
Assuming you're using a new enough version of Spark, you should use spark.executor.memory to set the memory for your executors, without changing the driver memory. See the docs for your version of Spark. On Thu, Mar 27, 2014 at 10:48 PM, Tsai Li Ming mailingl...@ltsai.comwrote: Hi, My worker

Re: Configuring shuffle write directory

2014-03-28 Thread Tsai Li Ming
Hi, Thanks! I found out that I wasn’t setting the SPARK_JAVA_OPTS correctly.. I took a look at the process table and saw that the “org.apache.spark.executor.CoarseGrainedExecutorBackend” didn’t have the -Dspark.local.dir set. On 28 Mar, 2014, at 1:05 pm, Matei Zaharia

Re: Replicating RDD elements

2014-03-28 Thread Sonal Goyal
Hi David, I am sorry but your question is not clear to me. Are you talking about taking some value and sharing it across your cluster so that it is present on all the nodes? You can look at Spark's broadcasting in that case. On the other hand, if you want to take one item and create an RDD of 100

Exception on simple pyspark script

2014-03-28 Thread idanzalz
Hi, I am a newbie with Spark. I tried installing 2 virtual machines, one as a client and one as standalone mode worker+master. Everything seems to run and connect fine, but when I try to run a simple script, I get weird errors. Here is the traceback, notice my program is just a one-liner:

Re: Not getting it

2014-03-28 Thread Sonal Goyal
Have you tried setting the partitioning ? Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Thu, Mar 27, 2014 at 10:04 AM, lannyripple lanny.rip...@gmail.comwrote: Hi all, I've got something which I think should be straightforward but

spark.akka.frameSize setting problem

2014-03-28 Thread lihu
Hi, I just run a simple example to generate some data for the ALS algorithm. my spark version is 0.9, and in local mode, the memory of my node is 108G but when I set conf.set(spark.akka.frameSize, 4096), it then occurred the following problem, and when I do not set this, it runs well .

Re: Strange behavior of RDD.cartesian

2014-03-28 Thread Jaonary Rabarisoa
I forgot to mention that I don't really use all of my data. Instead I use a sample extracted with randomSample. On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa jaon...@gmail.comwrote: Hi all, I notice that RDD.cartesian has a strange behavior with cached and uncached data. More

Re: Exception on simple pyspark script

2014-03-28 Thread idanzalz
I sorted it out. Turns out that if the client uses Python 2.7 and the server is Python 2.6, you get some weird errors, like this and others. So you would probably want not to do that... -- View this message in context:

streaming: code to simulate a network socket data source

2014-03-28 Thread Diana Carroll
If you are learning about Spark Streaming, as I am, you've probably use netcat nc as mentioned in the spark streaming programming guide. I wanted something a little more useful, so I modified the ClickStreamGenerator code to make a very simple script that simply reads a file off disk and passes

Re: Not getting it

2014-03-28 Thread lannyripple
I've played around with it. The CSV file looks like it gives 130 partitions. I'm assuming that's the standard 64MB split size for HDFS files. I have increased number of partitions and number of tasks for things like groupByKey and such. Usually I start blowing up on GC Overlimit or sometimes

Re: Do all classes involving RDD operation need to be registered?

2014-03-28 Thread Debasish Das
Classes are serialized and sent to all the workers as akka msgs singletons and case classes I am not sure if they are javaserialized or kryoserialized by default But definitely your own classes if serialized by kryo will be much efficient.there is an comparison that Matei did for all

Re: function state lost when next RDD is processed

2014-03-28 Thread Mark Hamstra
As long as the amount of state being passed is relatively small, it's probably easiest to send it back to the driver and to introduce it into RDD transformations as the zero value of a fold. On Fri, Mar 28, 2014 at 7:12 AM, Adrian Mocanu amoc...@verticalscope.comwrote: I'd like to resurrect

Re: Not getting it

2014-03-28 Thread lannyripple
Ok. Based on Sonal's message I dived more into memory and partitioning and got it to work. For the CSV file I used 1024 partitions [textFile(path, 1024)] which cut the partition size down to 8MB (based on standard HDFS 64MB splits). For the key file I also adjusted partitions to use about 8MB.

RE: function state lost when next RDD is processed

2014-03-28 Thread Adrian Mocanu
I'd like to resurrect this thread since I don't have an answer yet. From: Adrian Mocanu [mailto:amoc...@verticalscope.com] Sent: March-27-14 10:04 AM To: u...@spark.incubator.apache.org Subject: function state lost when next RDD is processed Is there a way to pass a custom function to spark to

Re: SequenceFileRDDFunctions cannot be used output of spark package

2014-03-28 Thread pradeeps8
Hi Aureliano, I followed this thread to create a custom saveAsObjectFile. The following is the code. /new org.apache.spark.rdd.SequenceFileRDDFunctions[NullWritable, BytesWritable](saveRDD.mapPartitions(iter = iter.grouped(10).map(_.toArray)).map(x = (NullWritable.get(), new

Re: Do all classes involving RDD operation need to be registered?

2014-03-28 Thread Ognen Duzlevski
There is also this quote from the Tuning guide (http://spark.incubator.apache.org/docs/latest/tuning.html): Finally, if you don't register your classes, Kryo will still work, but it will have to store the full class name with each object, which is wasteful. It implies that you don't really

RE: function state lost when next RDD is processed

2014-03-28 Thread Adrian Mocanu
Thanks! Ya that's what I'm doing so far, but I wanted to see if it's possible to keep the tuples inside Spark for fault tolerance purposes. -A From: Mark Hamstra [mailto:m...@clearstorydata.com] Sent: March-28-14 10:45 AM To: user@spark.apache.org Subject: Re: function state lost when next RDD

Re: Do all classes involving RDD operation need to be registered?

2014-03-28 Thread anny9699
Thanks a lot Ognen! It's not a fancy class that I wrote, and now I realized I neither extends Serializable or register with Kyro and that's why it is not working. -- View this message in context:

Re: Splitting RDD and Grouping together to perform computation

2014-03-28 Thread yh18190
Hi, Thanks Nanzhu.I tried to implement your suggestion on following scenario.I have RDD of say 24 elements.In that when i partioned into two groups of 12 elements each.Their is loss of order of elements in partition.Elemest are partitioned randomly.I need to preserve the order such that the first

Re: Spark Streaming + Kafka + Mesos/Marathon strangeness

2014-03-28 Thread Tathagata Das
The cleaner ttl was introduced as a brute force method to clean all old data and metadata in the system, so that the system can run 24/7. The cleaner ttl should be set to a large value, so that RDDs older than that are not used. Though there are some cases where you may want to use an RDD again

RE: Splitting RDD and Grouping together to perform computation

2014-03-28 Thread Adrian Mocanu
I think you should sort each RDD -Original Message- From: yh18190 [mailto:yh18...@gmail.com] Sent: March-28-14 4:44 PM To: u...@spark.incubator.apache.org Subject: Re: Splitting RDD and Grouping together to perform computation Hi, Thanks Nanzhu.I tried to implement your suggestion on

RE: Splitting RDD and Grouping together to perform computation

2014-03-28 Thread Adrian Mocanu
I say you need to remap so you have a key for each tuple that you can sort on. Then call rdd.sortByKey(true) like this mystream.transform(rdd = rdd.sortByKey(true)) For this fn to be available you need to import org.apache.spark.rdd.OrderedRDDFunctions -Original Message- From: yh18190

Re: Splitting RDD and Grouping together to perform computation

2014-03-28 Thread Syed A. Hashmi
From the jist of it, it seems like you need to override the default partitioner to control how your data is distributed among partitions. Take a look at different Partitioners available (Default, Range, Hash) if none of these get you desired result, you might want to provide your own. On Fri,

RE: Splitting RDD and Grouping together to perform computation

2014-03-28 Thread yh18190
Hi Andriana, Thanks for suggestion.Could you please modify my code part where I need to do so..I apologise for inconvinience ,becoz i am new to spark I coudnt apply appropriately..i would be thankful to you. -- View this message in context:

RE: Splitting RDD and Grouping together to perform computation

2014-03-28 Thread Adrian Mocanu
Not sure how to change your code because you'd need to generate the keys where you get the data. Sorry about that. I can tell you where to put the code to remap and sort though. import org.apache.spark.rdd.OrderedRDDFunctions val res2=reduced_hccg.map(_._2) .map( x= (newkey,x)).sortByKey(true)

Mutable tagging RDD rows ?

2014-03-28 Thread Sung Hwan Chung
Hey guys, I need to tag individual RDD lines with some values. This tag value would change at every iteration. Is this possible with RDD (I suppose this is sort of like mutable RDD, but it's more) ? If not, what would be the best way to do something like this? Basically, we need to keep mutable

Re: Mutable tagging RDD rows ?

2014-03-28 Thread Christopher Nguyen
Sung Hwan, strictly speaking, RDDs are immutable, so the canonical way to get what you want is to transform to another RDD. But you might look at MutablePair ( https://github.com/apache/spark/blob/60abc252545ec7a5d59957a32e764cd18f6c16b4/core/src/main/scala/org/apache/spark/util/MutablePair.scala)

Re: Strange behavior of RDD.cartesian

2014-03-28 Thread Matei Zaharia
Weird, how exactly are you pulling out the sample? Do you have a small program that reproduces this? Matei On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: I forgot to mention that I don't really use all of my data. Instead I use a sample extracted with randomSample.

Re: SequenceFileRDDFunctions cannot be used output of spark package

2014-03-28 Thread Sonal Goyal
What does your saveRDD contain? If you are using custom objects, they should be serializable. Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Sat, Mar 29, 2014 at 12:02 AM, pradeeps8 srinivasa.prad...@gmail.comwrote: Hi Aureliano, I

Re: function state lost when next RDD is processed

2014-03-28 Thread Mayur Rustagi
Are you referring to Spark Streaming? Can you save the sum as a RDD keep joining the two rdd together? Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Mar 28, 2014 at 10:47 AM, Adrian Mocanu

Re: Mutable tagging RDD rows ?

2014-03-28 Thread Christopher Nguyen
Sung Hwan, yes, I'm saying exactly what you interpreted, including that if you tried it, it would (mostly) work, and my uncertainty with respect to guarantees on the semantics. Definitely there would be no fault tolerance if the mutations depend on state that is not captured in the RDD lineage.

Re: Announcing Spark SQL

2014-03-28 Thread Rohit Rai
Thanks Patrick, I was thinking about that... Upon analysis I realized (on date) it would be something similar to the way Hive Context using CustomCatalog stuff. I will review it again, on the lines of implementing SchemaRDD with Cassandra. Thanks for the pointer. Upon discussion with couple of

Re: Replicating RDD elements

2014-03-28 Thread David Thomas
That helps! Thank you. On Fri, Mar 28, 2014 at 12:36 AM, Sonal Goyal sonalgoy...@gmail.com wrote: Hi David, I am sorry but your question is not clear to me. Are you talking about taking some value and sharing it across your cluster so that it is present on all the nodes? You can look at

Re: 答复: 答复: RDD usage

2014-03-29 Thread Chieh-Yen
Got it. Thanks for your help!! Chieh-Yen On Tue, Mar 25, 2014 at 6:51 PM, hequn cheng chenghe...@gmail.com wrote: Hi~I wrote a program to test.The non-idempotent compute function in foreach does change the value of RDD. It may looks a little crazy to do so since modify the RDD will make it

working with MultiTableInputFormat

2014-03-29 Thread Livni, Dana
I'm trying to create an RDD from multiple scans. I tried to set the configuration this way: Configuration config = HBaseConfiguration.create(); config.setStrings(MultiTableInputFormat.SCANS,scanStrings); And creating each scan string in the array scanStrings this way: Scan scan = new Scan();

Zip or map elements to create new RDD

2014-03-29 Thread yh18190
Hi, I have an RDD of elements and want to create a new RDD by Zipping other RDD in order. result[RDD] with sequence of 10,20,30,40,50 ...elements. I am facing problems as index is not an RDD...its gives an error...Could anyone help me how we can zip it or map it inorder to obtain following

Re: Do all classes involving RDD operation need to be registered?

2014-03-29 Thread Sonal Goyal
From my limited knowledge, all classes involved with the RDD operations should be extending Serializable if you want Java serialization(default). However, if you want Kryo serialization, you can use conf.set(spark.serializer,org.apache.spark.serializer.KryoSerializer); If you also want to perform

Re: Zip or map elements to create new RDD

2014-03-29 Thread Sonal Goyal
zipWithIndex works on the git clone, not sure if its part of a released version. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Sat, Mar 29,

Re: Zip or map elements to create new RDD

2014-03-29 Thread yh18190
Thanks sonal.Is der anyother way like to map values with Increasing indexes...so that i can map(t=(i,t)) where value if 'i' increases after each map operation on element... Please help me ..in this aspect -- View this message in context:

How to index each map operation????

2014-03-29 Thread yh18190
Hi, I want to perform map operation on an RDD of elements such that resulting RDD is a key value pair(counter,value) For example var k:RDD[Int]=10,20,30,40,40,60... k.map(t=(i,t)) where 'i' value should be like a counter whose value increments after each mapoperation... Pleas help me.. I tried

Re: Do all classes involving RDD operation need to be registered?

2014-03-29 Thread anny9699
Thanks so much Sonal! I am much clearer now. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Do-all-classes-involving-RDD-operation-need-to-be-registered-tp3439p3472.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Strange behavior of RDD.cartesian

2014-03-29 Thread Andrew Ash
Is this spark 0.9.0? Try setting spark.shuffle.spill=false There was a hash collision bug that's fixed in 0.9.1 that might cause you to have too few results in that join. Sent from my mobile phone On Mar 28, 2014 8:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Weird, how exactly are you

Re: Announcing Spark SQL

2014-03-29 Thread Michael Armbrust
On Fri, Mar 28, 2014 at 9:53 PM, Rohit Rai ro...@tuplejump.com wrote: Upon discussion with couple of our clients, it seems the reason they would prefer using hive is that they have already invested a lot in it. Mostly in UDFs and HiveQL. 1. Are there any plans to develop the SQL Parser to

Re: KafkaInputDStream mapping of partitions to tasks

2014-03-29 Thread Nicolas Bär
Hi Is there any workaround to this problem? I'm trying to implement a KafkaReceiver using the SimpleConsumer API [1] of Kafka and handle the partition assignment manually. The easiest setup in this case would be to bind the number of parallel jobs to the number of partitions in Kafka. This is

Re: pySpark memory usage

2014-03-29 Thread Jim Blomo
I've only tried 0.9, in which I ran into the `stdin writer to Python finished early` so frequently I wasn't able to load even a 1GB file. Let me know if I can provide any other info! On Thu, Mar 27, 2014 at 8:48 PM, Matei Zaharia matei.zaha...@gmail.com wrote: I see, did this also fail with

Limiting number of reducers performance implications

2014-03-29 Thread Matthew Cheah
Hi everyone, I'm using Spark on machines where I can't change the maximum number of open files. As a result, I'm limiting the number of reducers to 500. I'm also only using a single machine that has 32 cores and emulating a cluster by running 4 worker daemons with 8 cores (maximum) each. What

Cross validation is missing in machine learning examples

2014-03-29 Thread Aureliano Buendia
Hi, I notices spark machine learning examples use training data to validate regression models, For instance, in linear regressionhttp://spark.apache.org/docs/0.9.0/mllib-guide.htmlexample: // Evaluate model on training examples and compute training errorval valuesAndPreds = parsedData.map {

Re: pySpark memory usage

2014-03-29 Thread Jim Blomo
I think the problem I ran into in 0.9 is covered in https://issues.apache.org/jira/browse/SPARK-1323 When I kill the python process, the stacktrace I gets indicates that this happens at initialization. It looks like the initial write to the Python process does not go through, and then the

Re: SQL on Spark - Shark or SparkSQL

2014-03-30 Thread Nicholas Chammas
This is a great question. We are in the same position, having not invested in Hive yet and looking at various options for SQL-on-Hadoop. On Sat, Mar 29, 2014 at 9:48 PM, Manoj Samel manojsamelt...@gmail.comwrote: Hi, In context of the recent Spark SQL announcement (

Re: WikipediaPageRank Data Set

2014-03-30 Thread Ankur Dave
The GraphX team has been using Wikipedia dumps from http://dumps.wikimedia.org/enwiki/. Unfortunately, these are in a less convenient format than the Freebase dumps. In particular, an article may span multiple lines, so more involved input parsing is required. Dan Crankshaw (cc'd) wrote a driver

Re: WikipediaPageRank Data Set

2014-03-30 Thread Ankur Dave
In particular, we are using this dataset: http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 Ankur http://www.ankurdave.com/ On Sun, Mar 30, 2014 at 12:45 AM, Ankur Dave ankurd...@gmail.com wrote: The GraphX team has been using Wikipedia dumps from

Re: Cross validation is missing in machine learning examples

2014-03-30 Thread Christopher Nguyen
Aureliano, you're correct that this is not validation error, which is computed as the residuals on out-of-training-sample data, and helps minimize overfit variance. However, in this example, the errors are correctly referred to as training error, which is what you might compute on a per-iteration

Can we convert scala.collection.ArrayBuffer[(Int,Double)] to org.spark.RDD[(Int,Double])

2014-03-30 Thread yh18190
Hi, Can we convert directly scala collection to spark RDD data type without using parellize method? Is their any way to create custom converted RDD datatype from scala type using some typecast like that? Please suggest me -- View this message in context:

Error in SparkSQL Example

2014-03-30 Thread Manoj Samel
Hi, On http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html, I am trying to run code on Writing Language-Integrated Relational Queries ( I have 1.0.0 Snapshot ). I am running into error on val people: RDD[Person] // An RDD of case class objects, from the first example.

Shouldn't the UNION of SchemaRDDs produce SchemaRDD ?

2014-03-30 Thread Manoj Samel
Hi, I am trying SparkSQL based on the example on doc ... val people = sc.textFile(/data/spark/examples/src/main/resources/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)) val olderThanTeans = people.where('age 19) val youngerThanTeans = people.where('age 13) val

SparkSQL where with BigDecimal type gives stacktrace

2014-03-30 Thread Manoj Samel
Hi, If I do a where on BigDecimal, I get a stack trace. Changing BigDecimal to Double works ... scala case class JournalLine(account: String, credit: BigDecimal, debit: BigDecimal, date: String, company: String, currency: String, costcenter: String, region: String) defined class JournalLine

Spark webUI - application details page

2014-03-30 Thread David Thomas
Is there a way to see 'Application Detail UI' page (at master:4040) for completed applications? Currently, I can see that page only for running applications, I would like to see various numbers for the application after it has completed.

Re: SparkSQL where with BigDecimal type gives stacktrace

2014-03-30 Thread smallmonkey...@hotmail.com
can I get the whole operation? then i can try to locate the error smallmonkey...@hotmail.com From: Manoj Samel Date: 2014-03-31 01:16 To: user Subject: SparkSQL where with BigDecimal type gives stacktrace Hi, If I do a where on BigDecimal, I get a stack trace. Changing BigDecimal to

Re: Spark webUI - application details page

2014-03-30 Thread Patrick Wendell
This will be a feature in Spark 1.0 but is not yet released. In 1.0 Spark applications can persist their state so that the UI can be reloaded after they have completed. - Patrick On Sun, Mar 30, 2014 at 10:30 AM, David Thomas dt5434...@gmail.com wrote: Is there a way to see 'Application

Spark-ec2 setup is getting slower and slower

2014-03-30 Thread Aureliano Buendia
Hi, Spark-ec2 uses rsync to deploy many applications. It seem over time more and more applications have been added to the script, which has significantly slowed down the setup time. Perhaps the script could be restructured this this way: Instead of rsyncing N times per application, we could have

Re: Spark-ec2 setup is getting slower and slower

2014-03-30 Thread Shivaram Venkataraman
That is a good idea, though I am not sure how much it will help as time to rsync is also dependent just on data size being copied. The other problem is that sometime we have dependencies across packages, so the first needs to be running before the second can start etc. However I agree that it

Re: Can we convert scala.collection.ArrayBuffer[(Int,Double)] to org.spark.RDD[(Int,Double])

2014-03-30 Thread Mayur Rustagi
The scala object needs to be sent to workers to be used as a RDD, parallalize is a way to do that. What are you looking to do? You can serialize the scala object to hdfs/disk load it from thr Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: SQL on Spark - Shark or SparkSQL

2014-03-30 Thread Mayur Rustagi
+1 Have done a few installations of Shark with customers using Hive, they love it. Would be good to maintain compatibility with Metastore QL till we have substantial reason to break off (like BlinkDB). Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: SparkSQL where with BigDecimal type gives stacktrace

2014-03-30 Thread Manoj Samel
Hi, Would the same issue be present for other Java type like Date ? Converting the person/teenager example on Patricks page reproduces the problem ... Thanks, scala import scala.math import scala.math scala case class Person(name: String, age: BigDecimal) defined class Person scala val

Re: [shark-users] SQL on Spark - Shark or SparkSQL

2014-03-30 Thread Matei Zaharia
Hi Manoj, At the current time, for drop-in replacement of Hive, it will be best to stick with Shark. Over time, Shark will use the Spark SQL backend, but should remain deployable the way it is today (including launching the SharkServer, using the Hive CLI, etc). Spark SQL is better for

groupBy RDD does not have grouping column ?

2014-03-30 Thread Manoj Samel
Hi, If I create a groupBy('a)(Sum('b) as 'foo, Sum('c) as 'bar), then the resulting RDD should have 'a, 'foo and 'bar. The result RDD just shows 'foo and 'bar and is missing 'a Thoughts? Thanks, Manoj

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-03-30 Thread Vipul Pandey
I'm using ScalaBuff (which depends on protobuf2.5) and facing the same issue. any word on this one? On Mar 27, 2014, at 6:41 PM, Kanwaldeep kanwal...@gmail.com wrote: We are using Protocol Buffer 2.5 to send messages to Spark Streaming 0.9 with Kafka stream setup. I have protocol Buffer 2.5

batching the output

2014-03-30 Thread Vipul Pandey
Hi, I need to batch the values in my final RDD before writing out to hdfs. The idea is to batch multiple rows in a protobuf and write those batches out - mostly to save some space as a lot of metadata is the same. e.g. 1,2,3,4,5,6 just batch them (1,2), (3,4),(5,6) and save three records

Re: Task not serializable?

2014-03-31 Thread Daniel Liu
Hi I am new to Spark and I encountered this error when I try to map RDD[A] = RDD[Array[Double]] then collect the results. A is a custom class extends Serializable. (Actually it's just a wrapper class which wraps a few variables that are all serializable). I also tried KryoSerializer according

Re: SequenceFileRDDFunctions cannot be used output of spark package

2014-03-31 Thread pradeeps8
Hi Sonal, There are no custom objects in saveRDD, it is of type RDD[(String, String)]. Thanks, Pradeep -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SequenceFileRDDFunctions-cannot-be-used-output-of-spark-package-tp250p3508.html Sent from the Apache

java.lang.ClassNotFoundException - spark on mesos

2014-03-31 Thread Bharath Bhushan
I am facing different kinds of java.lang.ClassNotFoundException when trying to run spark on mesos. One error has to do with org.apache.spark.executor.MesosExecutorBackend. Another has to do with org.apache.spark.serializer.JavaSerializer. I see other people complaining about similar issues. I

Re: java.lang.ClassNotFoundException - spark on mesos

2014-03-31 Thread Tim St Clair
What versions are you running? There is a known protobuf 2.5 mismatch, depending on your versions. Cheers, Tim - Original Message - From: Bharath Bhushan manku.ti...@outlook.com To: user@spark.apache.org Sent: Monday, March 31, 2014 8:16:19 AM Subject:

yarn.application.classpath in yarn-site.xml

2014-03-31 Thread Dan
Hi, I've just tested spark in yarn mode, but something made me confused. When I *delete* the yarn.application.classpath configuration in yarn-site.xml, the following command works well. *bin/spark-class org.apache.spark.deploy.yarn.Client --jar

Re: java.lang.ClassNotFoundException - spark on mesos

2014-03-31 Thread Bharath Bhushan
I tried 0.9.0 and the latest git tree of spark. For mesos, I tried 0.17.0 and the latest git tree. Thanks On 31-Mar-2014, at 7:24 pm, Tim St Clair tstcl...@redhat.com wrote: What versions are you running? There is a known protobuf 2.5 mismatch, depending on your versions. Cheers,

Best practices: Parallelized write to / read from S3

2014-03-31 Thread Nicholas Chammas
Howdy-doody, I have a single, very large file sitting in S3 that I want to read in with sc.textFile(). What are the best practices for reading in this file as quickly as possible? How do I parallelize the read as much as possible? Similarly, say I have a single, very large RDD sitting in memory

Re: Shouldn't the UNION of SchemaRDDs produce SchemaRDD ?

2014-03-31 Thread Michael Armbrust
* unionAll preserve duplicate v/s union that does not This is true, if you want to eliminate duplicate items you should follow the union with a distinct() * SQL union and unionAll result in same output format i.e. another SQL v/s different RDD types here. * Understand the existing union

Re: groupBy RDD does not have grouping column ?

2014-03-31 Thread Michael Armbrust
This is similar to how SQL works, items in the GROUP BY clause are not included in the output by default. You will need to include 'a in the second parameter list (which is similar to the SELECT clause) as well if you want it included in the output. On Sun, Mar 30, 2014 at 9:52 PM, Manoj Samel

Re: Error in SparkSQL Example

2014-03-31 Thread Michael Armbrust
val people: RDD[Person] // An RDD of case class objects, from the first example. is just a placeholder to avoid cluttering up each example with the same code for creating an RDD. The : RDD[People] is just there to let you know the expected type of the variable 'people'. Perhaps there is a

Re: Error in SparkSQL Example

2014-03-31 Thread Manoj Samel
Hi Michael, Thanks for the clarification. My question is about the error above error: class $iwC needs to be abstract and what does the RDD brings, since I can do the DSL without the people: people: org.apache.spark.rdd.RDD[Person] Thanks, On Mon, Mar 31, 2014 at 9:13 AM, Michael Armbrust

Re: Best practices: Parallelized write to / read from S3

2014-03-31 Thread Aaron Davidson
Note that you may have minSplits set to more than the number of cores in the cluster, and Spark will just run as many as possible at a time. This is better if certain nodes may be slow, for instance. In general, it is not necessarily the case that doubling the number of cores doing IO will double

Re: Best practices: Parallelized write to / read from S3

2014-03-31 Thread Nicholas Chammas
OK sweet. Thanks for walking me through that. I wish this were StackOverflow so I could bestow some nice rep on all you helpful people. On Mon, Mar 31, 2014 at 1:06 PM, Aaron Davidson ilike...@gmail.com wrote: Note that you may have minSplits set to more than the number of cores in the

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Martin Goodson
How about London? -- Martin Goodson | VP Data Science (0)20 3397 1240 [image: Inline image 1] On Mon, Mar 31, 2014 at 6:28 PM, Andy Konwinski andykonwin...@gmail.comwrote: Hi folks, We have seen a lot of community growth outside of the Bay Area and we are looking to help spur even

Re: network wordcount example

2014-03-31 Thread Diana Carroll
Not sure what data you are sending in. You could try calling lines.print() instead which should just output everything that comes in on the stream. Just to test that your socket is receiving what you think you are sending. On Mon, Mar 31, 2014 at 12:18 PM, eric perler

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Andy Konwinski
Responses about London, Montreal/Toronto, DC, Chicago. Great coverage so far, and keep 'em coming! (still looking for an NYC connection) I'll reply to each of you off-list to coordinate next-steps for setting up a Spark meetup in your home area. Thanks again, this is super exciting. Andy On

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Chris Gore
We'd love to see a Spark user group in Los Angeles and connect with others working with it here. Ping me if you're in the LA area and use Spark at your company ( ch...@retentionscience.com ). Chris Retention Science call: 734.272.3099 visit: Site | like: Facebook | follow: Twitter On Mar

Re: java.lang.ClassNotFoundException - spark on mesos

2014-03-31 Thread Tim St Clair
It sounds like the protobuf issue. So FWIW, You might want to try updating the 0.9.0 w/pom mods for mesos protobuf. mesos 0.17.0 protobuf 2.5 Cheers, Tim - Original Message - From: Bharath Bhushan manku.ti...@outlook.com To: user@spark.apache.org Sent: Monday, March 31, 2014

how spark dstream handles congestion?

2014-03-31 Thread Dong Mo
Dear list, I was wondering how Spark handles congestion when the upstream is generating dstreams faster than downstream workers can handle? Thanks -Mo

<    2   3   4   5   6   7   8   9   10   11   >