Re: how spark dstream handles congestion?

2014-03-31 Thread Dong Mo
Thanks -Mo 2014-03-31 13:16 GMT-05:00 Evgeny Shishkin itparan...@gmail.com: On 31 Mar 2014, at 21:05, Dong Mo monted...@gmail.com wrote: Dear list, I was wondering how Spark handles congestion when the upstream is generating dstreams faster than downstream workers can handle? It

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Yana Kadiyska
Nicholas, I'm in Boston and would be interested in a Spark group. Not sure if you know this -- there was a meetup that never got off the ground. Anyway, I'd be +1 for attending. Not sure what is involved in organizing. Seems a shame that a city like Boston doesn't have one. On Mon, Mar 31, 2014

Calling Spahk enthusiasts in Boston

2014-03-31 Thread Nicholas Chammas
My fellow Bostonians and New Englanders, We cannot allow New York to beat us to having a banging Spark meetup. Respond to me (and I guess also Andy?) if you are interested. Yana, I'm not sure either what is involved in organizing, but we can figure it out. I didn't know about the meetup that

Re: Calling Spahk enthusiasts in Boston

2014-03-31 Thread Nick Pentreath
I would offer to host one in Cape Town but we're almost certainly the only Spark users in the country apart from perhaps one in Johanmesburg :)— Sent from Mailbox for iPhone On Mon, Mar 31, 2014 at 8:53 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: My fellow Bostonians and New

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Jeremy Freeman
Happy to help with an NYC meet up (just emailed Andy). I recently moved to VA, but am back in NYC quite often, and have been turning several computational people at Columbia / NYU / Simons Foundation onto Spark; there'd definitely be interest in those communities. -- Jeremy

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Patrick Grinaway
Also in NYC, definitely interested in a spark meetup! Sent from my iPhone On Mar 31, 2014, at 3:07 PM, Jeremy Freeman freeman.jer...@gmail.com wrote: Happy to help with an NYC meet up (just emailed Andy). I recently moved to VA, but am back in NYC quite often, and have been turning several

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Denny Lee
If you have any questions on helping to get a Spark Meetup off the ground, please do not hesitate to ping me (denny.g@gmail.com).  I helped jump start the one here in Seattle (and tangentially have been helping the Vancouver and Denver ones as well).  HTH! On March 31, 2014 at 12:35:38

Re: java.lang.ClassNotFoundException - spark on mesos

2014-03-31 Thread Bharath Bhushan
Your suggestion took me past the ClassNotFoundException. I then hit akka.actor.ActorNotFound exception. I patched in PR 568 into my 0.9.0 spark codebase and everything worked. So thanks a lot, Tim. Is there a JIRA/PR for the protobuf issue? Why is it not fixed in the latest git tree? Thanks.

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-03-31 Thread Patrick Wendell
Spark now shades its own protobuf dependency so protobuf 2.4.1 should't be getting pulled in unless you are directly using akka yourself. Are you? Does your project have other dependencies that might be indirectly pulling in protobuf 2.4.1? It would be helpful if you could list all of your

Calling Spark enthusiasts in Austin, TX

2014-03-31 Thread Ognen Duzlevski
In the spirit of everything being bigger and better in TX ;) = if anyone is in Austin and interested in meeting up over Spark - contact me! There seems to be a Spark meetup group in Austin that has never met and my initial email to organize the first gathering was never acknowledged. Ognen On

Re: network wordcount example

2014-03-31 Thread Chris Fregly
@eric- i saw this exact issue recently while working on the KinesisWordCount. are you passing local[2] to your example as the MASTER arg versus just local or local[1]? you need at least 2. it's documented as n1 in the scala source docs - which is easy to mistake for n=1. i just ran the

Re: java.lang.ClassNotFoundException - spark on mesos

2014-03-31 Thread Bharath Bhushan
I was talking about the protobuf version issue as not fixed. I could not find any reference to the problem or the fix. Reg. SPARK-1052, I could pull in the fix into my 0.9.0 tree (from the tar ball on the website) and I see the fix in the latest git. Thanks On 01-Apr-2014, at 3:28 am, deric

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Sonal Goyal
Hi Andy, I would be interested in setting up a meetup in Delhi/NCR, India. Can you please let me know how to go about organizing it? Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Tue, Apr 1, 2014 at 10:04 AM, giive chen

Re: java.lang.ClassNotFoundException - spark on mesos

2014-03-31 Thread Bharath Bhushan
Another problem I noticed is that the current 1.0.0 git tree still gives me the ClassNotFoundException. I see that the SPARK-1052 is already fixed there. I then modified the pom.xml for mesos and protobuf and that still gave the ClassNotFoundException. I also tried modifying pom.xml only for

Re: Hadoop LR comparison

2014-04-01 Thread DB Tsai
Hi Li-Ming, This binary logistic regression using SGD is in https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala We're working on multinomial logistic regression using Newton and L-BFGS optimizer now. Will be released

Re: Hadoop LR comparison

2014-04-01 Thread Tsai Li Ming
Thanks. What will be equivalent code in Hadoop where Spark published the 110s/0.9s comparison? On 1 Apr, 2014, at 2:44 pm, DB Tsai dbt...@alpinenow.com wrote: Hi Li-Ming, This binary logistic regression using SGD is in

Re: Configuring distributed caching with Spark and YARN

2014-04-01 Thread santhoma
I think with addJar() there is no 'caching', in the sense files will be copied everytime per job. Whereas in hadoop distributed cache, files will be copied only once, and a symlink will be created to the cache file for subsequent runs:

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Vipul Pandey
Spark now shades its own protobuf dependency so protobuf 2.4.1 should't be getting pulled in unless you are directly using akka yourself. Are you? No i'm not. Although I see that protobuf libraries are directly pulled into the 0.9.0 assembly jar - I do see the shaded version as well. e.g.

SSH problem

2014-04-01 Thread Sai Prasanna
Hi All, I have a five node spark cluster, Master, s1,s2,s3,s4. I have passwordless ssh to all slaves from master and vice-versa. But only one machine, s2, what happens is after 2-3 minutes of my connection from master to slave, the write-pipe is broken. So if try to connect again from master i

Sliding Subwindows

2014-04-01 Thread aecc
Hello, I would like to have a kind of sub windows. The idea is to have 3 windows in the following way: future - -- past w1 w2 w3 So I can do some processing with the

foreach not working

2014-04-01 Thread eric perler
hello.. i am on my second day with spark.. and im having trouble getting the foreach function to work with the network wordcount example.. i can see the the flatMap and map methods are being invoked.. but i dont seem to be getting into the foreach method... not sure if what i am doing even

Re: Unable to submit an application to standalone cluster which on hdfs.

2014-04-01 Thread haikal.pribadi
How do you remove the validation blocker from the compilation? Thank you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-submit-an-application-to-standalone-cluster-which-on-hdfs-tp1730p3574.html Sent from the Apache Spark User List mailing list

custom receiver in java

2014-04-01 Thread eric perler
i would like to write a custom receiver to receive data from a Tibco RV subject i found this scala example.. http://spark.incubator.apache.org/docs/0.8.0/streaming-custom-receivers.html but i cant seem to find a java example does anybody know of a good java example for creating a custom receiver

Use combineByKey and StatCount

2014-04-01 Thread Jaonary Rabarisoa
Hi all; Can someone give me some tips to compute mean of RDD by key , maybe with combineByKey and StatCount. Cheers, Jaonary

Re: Is there a way to get the current progress of the job?

2014-04-01 Thread Mark Hamstra
Some related discussion: https://github.com/apache/spark/pull/246 On Tue, Apr 1, 2014 at 8:43 AM, Philip Ogren philip.og...@oracle.comwrote: Hi DB, Just wondering if you ever got an answer to your question about monitoring progress - either offline or through your own investigation. Any

Re: Mllib in pyspark for 0.8.1

2014-04-01 Thread Matei Zaharia
You could probably port it back, but it required some changes on the Java side as well (a new PythonMLUtils class). It might be easier to fix the Mesos issues with 0.9. Matei On Apr 1, 2014, at 8:53 AM, Ian Ferreira ianferre...@hotmail.com wrote: Hi there, For some reason the

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Kanwaldeep
Yes I'm using akka as well. But if that is the problem then I should have been facing this issue in my local setup as well. I'm only running into this error on using the spark standalone cluster. But will try out your suggestion and let you know. Thanks Kanwal -- View this message in context:

Re: Is there a way to get the current progress of the job?

2014-04-01 Thread Kevin Markey
The discussion there hits on the distinction of jobs and stages. When looking at one application, there are hundreds of stages, sometimes thousands. Depends on the data and the task. And the UI seems to track stages. And one could independently track them for such a job.

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Kanwaldeep
I've removed the dependency on akka in a separate project but still running into the same error. In the POM Dependency Hierarchy I do see 2.4.1 - shaded and 2.5.0 being included. If there is a conflict with project dependency I would think I should be getting the same error in my local setup as

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Vipul Pandey
SPARK_HADOOP_VERSION=2.0.0-cdh4.2.1 sbt/sbt assembly That's all I do. On Apr 1, 2014, at 11:41 AM, Patrick Wendell pwend...@gmail.com wrote: Vidal - could you show exactly what flags/commands you are using when you build spark to produce this assembly? On Tue, Apr 1, 2014 at 12:53 AM,

Re: Best practices: Parallelized write to / read from S3

2014-04-01 Thread Nicholas Chammas
Alright, so I've upped the minSplits parameter on my call to textFile, but the resulting RDD still has only 1 partition, which I assume means it was read in on a single process. I am checking the number of partitions in pyspark by using the rdd._jrdd.splits().size() trick I picked up on this list.

Generic types and pair RDDs

2014-04-01 Thread Daniel Siegmann
When my tuple type includes a generic type parameter, the pair RDD functions aren't available. Take for example the following (a join on two RDDs, taking the sum of the values): def joinTest(rddA: RDD[(String, Int)], rddB: RDD[(String, Int)]) : RDD[(String, Int)] = { rddA.join(rddB).map {

Re: Best practices: Parallelized write to / read from S3

2014-04-01 Thread Aaron Davidson
Looks like you're right that gzip files are not easily splittable [1], and also about everything else you said. [1] http://mail-archives.apache.org/mod_mbox/spark-user/201310.mbox/%3CCANDWdjY2hN-=jXTSNZ8JHZ=G-S+ZKLNze=rgkjacjaw3tto...@mail.gmail.com%3E On Tue, Apr 1, 2014 at 1:51 PM, Nicholas

Re: Generic types and pair RDDs

2014-04-01 Thread Koert Kuipers
import org.apache.spark.SparkContext._ import org.apache.spark.rdd.RDD import scala.reflect.ClassTag def joinTest[K: ClassTag](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) : RDD[(K, Int)] = { rddA.join(rddB).map { case (k, (a, b)) = (k, a+b) } } On Tue, Apr 1, 2014 at 4:55 PM, Daniel

Re: Generic types and pair RDDs

2014-04-01 Thread Aaron Davidson
Koert's answer is very likely correct. This implicit definition which converts an RDD[(K, V)] to provide PairRDDFunctions requires a ClassTag is available for K: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1124 To fully understand what's

PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-01 Thread Nicholas Chammas
Just an FYI, it's not obvious from the docshttp://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#partitionBythat the following code should fail: a = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2) a._jrdd.splits().size() a.count() b = a.partitionBy(5)

Cannot Access Web UI

2014-04-01 Thread yxzhao
http://spark.incubator.apache.org/docs/latest/spark-standalone.html#monitoring-and-logging As the above shows: Monitoring and Logging Spark’s standalone mode offers a web-based user interface to monitor the cluster. The master and each worker has its own web UI that shows cluster and job

Re: Cannot Access Web UI

2014-04-01 Thread Nicholas Chammas
Are you trying to access the UI from another machine? If so, first confirm that you don't have a network issue by opening the UI from the master node itself. For example: yum -y install lynx lynx ip_address:8080 If this succeeds, then you likely have something blocking you from accessing the

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Patrick Wendell
Do you get the same problem if you build with maven? On Tue, Apr 1, 2014 at 12:23 PM, Vipul Pandey vipan...@gmail.com wrote: SPARK_HADOOP_VERSION=2.0.0-cdh4.2.1 sbt/sbt assembly That's all I do. On Apr 1, 2014, at 11:41 AM, Patrick Wendell pwend...@gmail.com wrote: Vidal - could you show

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-01 Thread Aaron Davidson
Hm, yeah, the docs are not clear on this one. The function you're looking for to change the number of partitions on any ol' RDD is repartition(), which is available in master but for some reason doesn't seem to show up in the latest docs. Sorry about that, I also didn't realize partitionBy() had

Re: Best practices: Parallelized write to / read from S3

2014-04-01 Thread Nicholas Chammas
Alright! Thanks for that link. I did little research based on it and it looks like Snappy or LZO + some container would be better alternatives to gzip. I confirmed that gzip was cramping my style by trying sc.textFile() on an uncompressed version of the text file. With the uncompressed file,

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-01 Thread Nicholas Chammas
Hmm, doing help(rdd) in PySpark doesn't show a method called repartition(). Trying rdd.repartition() or rdd.repartition(10) also fail. I'm on 0.9.0. The approach I'm going with to partition my MappedRDD is to key it by a random int, and then partition it. So something like: rdd =

Issue with zip and partitions

2014-04-01 Thread Patrick_Nicolas
Dell - Internal Use - Confidential I got an exception can't zip RDDs with unusual numbers of Partitions when I apply any action (reduce, collect) of dataset created by zipping two dataset of 10 million entries each. The problem occurs independently of the number of partitions or when I let

Re: Is there a way to get the current progress of the job?

2014-04-01 Thread Mayur Rustagi
You can get detailed information through Spark listener interface regarding each stage. Multiple jobs may be compressed into a single stage so jobwise information would be same as Spark. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: Status of MLI?

2014-04-01 Thread Nan Zhu
mllib has been part of Spark distribution (under mllib directory), also check http://spark.apache.org/docs/latest/mllib-guide.html and for JIRA, because of the recent migration to apache JIRA, I think all mllib-related issues should be under the Spark umbrella,

Re: Status of MLI?

2014-04-01 Thread Krakna H
Hi Nan, I was actually referring to MLI/MLBase (http://www.mlbase.org); is this being actively developed? I'm familiar with mllib and have been looking at its documentation. Thanks! On Tue, Apr 1, 2014 at 10:44 PM, Nan Zhu [via Apache Spark User List] ml-node+s1001560n3611...@n3.nabble.com

Re: Status of MLI?

2014-04-01 Thread Nan Zhu
Ah, I see, I’m sorry, I didn’t read your email carefully then I have no idea about the progress on MLBase Best, -- Nan Zhu On Tuesday, April 1, 2014 at 11:05 PM, Krakna H wrote: Hi Nan, I was actually referring to MLI/MLBase (http://www.mlbase.org); is this being actively

Re: Status of MLI?

2014-04-01 Thread Evan R. Sparks
Hi there, MLlib is the first component of MLbase - MLI and the higher levels of the stack are still being developed. Look for updates in terms of our progress on the hyperparameter tuning/model selection problem in the next month or so! - Evan On Tue, Apr 1, 2014 at 8:05 PM, Krakna H

Re: Issue with zip and partitions

2014-04-02 Thread Xiangrui Meng
From API docs: Zips this RDD with another one, returning key-value pairs with the first element in each RDD, second element in each RDD, etc. Assumes that the two RDDs have the *same number of partitions* and the *same number of elements in each partition* (e.g. one was made through a map on the

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-02 Thread Patrick Wendell
It's this: mvn -Dhadoop.version=2.0.0-cdh4.2.1 -DskipTests clean package On Tue, Apr 1, 2014 at 5:15 PM, Vipul Pandey vipan...@gmail.com wrote: how do you recommend building that - it says ERROR] Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:2.2-beta-5:assembly

Re: How to index each map operation????

2014-04-02 Thread Shixiong Zhu
Hi Thierry, Your code does not work if @yh18190 wants a global counter. A RDD may have more than one partition. For each partition, cnt will be reset to -1. You can try the following code: scala val rdd = sc.parallelize( (1, 'a') :: (2, 'b') :: (3, 'c') :: (4, 'd') :: Nil) rdd:

Re: possible bug in Spark's ALS implementation...

2014-04-02 Thread Debasish Das
I think multiply by ratings is a heuristic that worked on rating related problems like netflix dataset or any other ratings datasets but the scope of NMF is much more broad than that @Sean please correct me in case you don't agree... Definitely it's good to add all the rating dataset related

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-02 Thread Vipul Pandey
I downloaded 0.9.0 fresh and ran the mvn command - the assembly jar thus generated also has both shaded and real version of protobuf classes Vipuls-MacBook-Pro-3:spark-0.9.0-incubating vipul$ jar -ftv ./assembly/target/scala-2.10/spark-assembly_2.10-0.9.0-incubating-hadoop2.0.0-cdh4.2.1.jar |

Re: How to index each map operation????

2014-04-02 Thread yh18190
Hi Therry, Thanks for the above responses..I implemented using RangePartitioner..we need to use any of the custom partitioners in orderto perform this task..Normally u cant maintain a counter becoz count operations should beperformed on each partitioned block ofdata... -- View this message in

CDH5 Spark on EC2

2014-04-02 Thread Denny Lee
I’ve been able to get CDH5 up and running on EC2 and according to Cloudera Manager, Spark is running healthy. But when I try to run spark-shell, I eventually get the error: 14/04/02 07:18:18 INFO client.AppClient$ClientActor: Connecting to master  spark://ip-172-xxx-xxx-xxx:7077... 14/04/02

Re: Status of MLI?

2014-04-02 Thread Krakna H
Thanks for the update Evan! In terms of using MLI, I see that the Github code is linked to Spark 0.8; will it not work with 0.9 (which is what I have set up) or higher versions? On Wed, Apr 2, 2014 at 1:44 AM, Evan R. Sparks [via Apache Spark User List] ml-node+s1001560n3615...@n3.nabble.com

Re: possible bug in Spark's ALS implementation...

2014-04-02 Thread Sean Owen
It should be kept in mind that different implementations are rarely strictly better, and that what works well in one type of data might not in another. It also bears keeping in mind that several of these differences just amount to different amounts of regularization, which need not be a

ActorNotFound problem for mesos driver

2014-04-02 Thread Leon Zhang
Hi, Spark Devs: I encounter a problem which shows error message akka.actor.ActorNotFound on our mesos mini-cluster. mesos : 0.17.0 spark : spark-0.9.0-incubating spark-env.sh: #!/usr/bin/env bash export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so export SPARK_EXECUTOR_URI=hdfs://

Re: ActorNotFound problem for mesos driver

2014-04-02 Thread andy petrella
Heya, Yep this is a problem in the Mesos scheduler implementation that has been fixed after 0.9.0 (https://spark-project.atlassian.net/browse/SPARK-1052 = MesosSchedulerBackend) So several options, like applying the patch, upgrading to 0.9.1 :-/ Cheers, Andy On Wed, Apr 2, 2014 at 5:30 PM,

Re: ActorNotFound problem for mesos driver

2014-04-02 Thread Leon Zhang
Aha, thank you for your kind reply. Upgrading to 0.9.1 is a good choice. :) On Wed, Apr 2, 2014 at 11:35 PM, andy petrella andy.petre...@gmail.comwrote: Heya, Yep this is a problem in the Mesos scheduler implementation that has been fixed after 0.9.0

Re: ActorNotFound problem for mesos driver

2014-04-02 Thread andy petrella
np ;-) On Wed, Apr 2, 2014 at 5:50 PM, Leon Zhang leonca...@gmail.com wrote: Aha, thank you for your kind reply. Upgrading to 0.9.1 is a good choice. :) On Wed, Apr 2, 2014 at 11:35 PM, andy petrella andy.petre...@gmail.comwrote: Heya, Yep this is a problem in the Mesos scheduler

Resilient nature of RDD

2014-04-02 Thread David Thomas
Can someone explain how RDD is resilient? If one of the partition is lost, who is responsible to recreate that partition - is it the driver program?

Print line in JavaNetworkWordCount

2014-04-02 Thread Eduardo Costa Alfaia
Hi Guys I would like printing the content inside of line in : JavaDStreamString lines = ssc.socketTextStream(args[1], Integer.parseInt(args[2])); JavaDStreamString words = lines.flatMap(new FlatMapFunctionString, String() { @Override public IterableString call(String x) {

Re: Need suggestions

2014-04-02 Thread andy petrella
TL;DR Your classes are missing on the workers, pass the jar containing the class main.scala.Utils to the SparkContext Longer: I miss some information, like how the SparkContext is configured but my best guess is that you didn't provided the jars (addJars on SparkConf or use the SC's constructor

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-02 Thread Nicholas Chammas
Update: I'm now using this ghetto function to partition the RDD I get back when I call textFile() on a gzipped file: # Python 2.6 def partitionRDD(rdd, numPartitions): counter = {'a': 0} def count_up(x): counter['a'] += 1 return counter['a'] return (rdd.keyBy(count_up)

Re: Need suggestions

2014-04-02 Thread andy petrella
Sorry I was not clear perhaps, anyway, could you try with the path in the *List* to be the absolute one; e.g. List(/home/yh/src/pj/spark-stuffs/target/scala-2.10/simple-project_2.10-1.0.jar) In order to provide a relative path, you need first to figure out your CWD, so you can do (to be really

Re: Spark output compression on HDFS

2014-04-02 Thread Patrick Wendell
For textFile I believe we overload it and let you set a codec directly: https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FileSuite.scala#L59 For saveAsSequenceFile yep, I think Mark is right, you need an option. On Wed, Apr 2, 2014 at 12:36 PM, Mark Hamstra

Optimal Server Design for Spark

2014-04-02 Thread Stephen Watt
Hi Folks I'm looking to buy some gear to run Spark. I'm quite well versed in Hadoop Server design but there does not seem to be much Spark related collateral around infrastructure guidelines (or at least I haven't been able to find them). My current thinking for server design is something

Re: Spark output compression on HDFS

2014-04-02 Thread Nicholas Chammas
Is this a Scala-onlyhttp://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#saveAsTextFilefeature? On Wed, Apr 2, 2014 at 5:55 PM, Patrick Wendell pwend...@gmail.com wrote: For textFile I believe we overload it and let you set a codec directly:

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-02 Thread Mark Hamstra
There is a repartition method in pyspark master: https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L1128 On Wed, Apr 2, 2014 at 2:44 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Update: I'm now using this ghetto function to partition the RDD I get back when I call

Re: Spark output compression on HDFS

2014-04-02 Thread Nicholas Chammas
Thanks for pointing that out. On Wed, Apr 2, 2014 at 6:11 PM, Mark Hamstra m...@clearstorydata.comwrote: First, you shouldn't be using spark.incubator.apache.org anymore, just spark.apache.org. Second, saveAsSequenceFile doesn't appear to exist in the Python API at this point. On Wed,

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-02 Thread Nicholas Chammas
Ah, now I see what Aaron was referring to. So I'm guessing we will get this in the next release or two. Thank you. On Wed, Apr 2, 2014 at 6:09 PM, Mark Hamstra m...@clearstorydata.comwrote: There is a repartition method in pyspark master:

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-02 Thread Mark Hamstra
Will be in 1.0.0 On Wed, Apr 2, 2014 at 3:22 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Ah, now I see what Aaron was referring to. So I'm guessing we will get this in the next release or two. Thank you. On Wed, Apr 2, 2014 at 6:09 PM, Mark Hamstra

Re: Is there a way to get the current progress of the job?

2014-04-02 Thread Philip Ogren
What I'd like is a way to capture the information provided on the stages page (i.e. cluster:4040/stages via IndexPage). Looking through the Spark code, it doesn't seem like it is possible to directly query for specific facts such as how many tasks have succeeded or how many total tasks there

Measure the Total Network I/O, Cpu and Memory Consumed by Spark Job

2014-04-02 Thread yxzhao
Hi All, I am intrested in measure the total network I/O, cpu and memory consumed by Spark job. I tried to find the related information in logs and Web UI. But there seems no sufficient information. Could anyone give me any suggestion? Thanks very much in advance. -- View this

Efficient way to aggregate event data at daily/weekly/monthly level

2014-04-02 Thread K Koh
Hi, I want to aggregate (time-stamped) event data at daily, weekly and monthly level stored in a directory in data//mm/dd/dat.gz format. For example: Each dat.gz file contains tuples in (datetime, id, value) format. I can perform aggregation as follows: but this code doesn't seem to be

Re: Resilient nature of RDD

2014-04-02 Thread Patrick Wendell
The driver stores the meta-data associated with the partition, but the re-computation will occur on an executor. So if several partitions are lost, e.g. due to a few machines failing, the re-computation can be striped across the cluster making it fast. On Wed, Apr 2, 2014 at 11:27 AM, David

Re: Is there a way to get the current progress of the job?

2014-04-02 Thread Patrick Wendell
Hey Phillip, Right now there is no mechanism for this. You have to go in through the low level listener interface. We could consider exposing the JobProgressListener directly - I think it's been factored nicely so it's fairly decoupled from the UI. The concern is this is a semi-internal piece of

Re: Is there a way to get the current progress of the job?

2014-04-02 Thread Andrew Or
Hi Philip, In the upcoming release of Spark 1.0 there will be a feature that provides for exactly what you describe: capturing the information displayed on the UI in JSON. More details will be provided in the documentation, but for now, anything before 0.9.1 can only go through JobLogger.scala,

Re: Efficient way to aggregate event data at daily/weekly/monthly level

2014-04-02 Thread Nicholas Chammas
Watch out with loading data from gzipped files. Spark cannot parallelize the load of gzipped files, and if you do not explicitly repartition your RDD created from such a file, everything you do on that RDD will run on a single core. On Wed, Apr 2, 2014 at 8:22 PM, K Koh den...@gmail.com wrote:

Re: Status of MLI?

2014-04-02 Thread Evan R. Sparks
Targeting 0.9.0 should work out of the box (just a change to the build.sbt) - I'll push some changes I've been sitting on to the public repo in the next couple of days. On Wed, Apr 2, 2014 at 4:05 AM, Krakna H shankark+...@gmail.com wrote: Thanks for the update Evan! In terms of using MLI, I

Re: Optimal Server Design for Spark

2014-04-02 Thread Mayur Rustagi
I would suggest to start with cloud hosting if you can, depending on your usecase, memory requirement may vary a lot . Regards Mayur On Apr 2, 2014 3:59 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey Steve, This configuration sounds pretty good. The one thing I would consider is having

Re: Optimal Server Design for Spark

2014-04-02 Thread Debasish Das
Hi Matei, How can I run multiple Spark workers per node ? I am running 8 core 10 node cluster but I do have 8 more cores on each nodeSo having 2 workers per node will definitely help my usecase. Thanks. Deb On Wed, Apr 2, 2014 at 3:58 PM, Matei Zaharia matei.zaha...@gmail.comwrote: Hey

How to ask questions on Spark usage?

2014-04-02 Thread weida xu
Hi, Shall I send my questions to this Email address? Sorry for bothering, and thanks a lot!

Re: How to ask questions on Spark usage?

2014-04-02 Thread Andrew Or
Yes, please do. :) On Wed, Apr 2, 2014 at 7:36 PM, weida xu xwd0...@gmail.com wrote: Hi, Shall I send my questions to this Email address? Sorry for bothering, and thanks a lot!

Spark RDD to Shark table IN MEMORY conversion

2014-04-02 Thread abhietc31
Hi, We are placing business logic in incoming data stream using Spark streaming. Here I want to point Shark table to use data coming from Spark Streaming. Instead of storing Spark streaming to HDFS or other area, is there a way I can directly point Shark in-memory table to take data from Spark

Shark Direct insert into table value (?)

2014-04-02 Thread abhietc31
Hi, I'm trying to run script in SHARK(0.81) insert into emp (id,name) values (212,Abhi) but it doesn't work. I urgently need direct insert as it is show stopper. I know that we can do insert into emp select * from xyz. Here requirement is direct insert. Does any one tried it ? Or is there

Submitting to yarn cluster

2014-04-02 Thread Ron Gonzalez
Hi,   I have a small program but I cannot seem to make it connect to the right properties of the cluster.   I have the SPARK_YARN_APP_JAR, SPARK_JAR and SPARK_HOME set properly.   If I run this scala file, I am seeing that this is never using the yarn.resourcemanager.address property that I set

Error when run Spark on mesos

2014-04-02 Thread felix
I deployed mesos and test it using the exmaple/test-framework script, mesos seems OK.but when runing spark on the mesos cluster, the mesos slave nodes report the following exception, any one can help me to fix this ? thanks in advance:14/04/03 11:24:39 INFO Slf4jLogger: Slf4jLogger started14/04/03

Re: Error when run Spark on mesos

2014-04-02 Thread panfei
any advice ? 2014-04-03 11:35 GMT+08:00 felix cnwe...@gmail.com: I deployed mesos and test it using the exmaple/test-framework script, mesos seems OK. but when runing spark on the mesos cluster, the mesos slave nodes report the following exception, any one can help me to fix this ? thanks

Re: Error when run Spark on mesos

2014-04-02 Thread Ian Ferreira
I think this is related to a known issue (regression) in 0.9.0. Try using explicit IP other than loop back. Sent from a mobile device On Apr 2, 2014, at 8:53 PM, panfei cnwe...@gmail.com wrote: any advice ? 2014-04-03 11:35 GMT+08:00 felix cnwe...@gmail.com: I deployed mesos and test

Example of creating expressions for SchemaRDD methods

2014-04-02 Thread All In A Days Work
For various schemaRDD functions like select, where, orderby, groupby etc. I would like to create expression objects and pass these to the methods for execution. Can someone show some examples of how to create expressions for case class and execute ? E.g., how to create expressions for select,

Re: Error when run Spark on mesos

2014-04-03 Thread panfei
after upgrading to 0.9.1 , everything goes well now. thanks for the reply. 2014-04-03 13:47 GMT+08:00 andy petrella andy.petre...@gmail.com: Hello, It's indeed due to a known bug, but using another IP for the driver won't be enough (other problems will pop up). A easy solution would be to

How to stop system info output in spark shell

2014-04-03 Thread weida xu
Hi, alll When I start spark in the shell. It automatically output some system info every minute, see below. Can I stop or block the output of these info? I tried the :silent comnond, but the automatical output remains. 14/04/03 19:34:30 INFO MetadataCleaner: Ran metadata cleaner for

Spark Disk Usage

2014-04-03 Thread Surendranauth Hiraman
Hi, I know if we call persist with the right options, we can have Spark persist an RDD's data on disk. I am wondering what happens in intermediate operations that could conceivably create large collections/Sequences, like GroupBy and shuffling. Basically, one part of the question is when is

Re: Strange behavior of RDD.cartesian

2014-04-03 Thread Jaonary Rabarisoa
You can find here a gist that illustrates this issue https://gist.github.com/jrabary/9953562 I got this with spark from master branch. On Sat, Mar 29, 2014 at 7:12 PM, Andrew Ash and...@andrewash.com wrote: Is this spark 0.9.0? Try setting spark.shuffle.spill=false There was a hash collision

Re: Avro serialization

2014-04-03 Thread FRANK AUSTIN NOTHAFT
We use avro objects in our project, and have a Kryo serializer for generic Avro SpecificRecords. Take a look at: https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/edu/berkeley/cs/amplab/adam/serialization/ADAMKryoRegistrator.scala Also, Matt Massie has a good blog post

Re: Is there a way to get the current progress of the job?

2014-04-03 Thread Philip Ogren
This is great news thanks for the update! I will either wait for the 1.0 release or go and test it ahead of time from git rather than trying to pull it out of JobLogger or creating my own SparkListener. On 04/02/2014 06:48 PM, Andrew Or wrote: Hi Philip, In the upcoming release of Spark

Re: Is there a way to get the current progress of the job?

2014-04-03 Thread Philip Ogren
I can appreciate the reluctance to expose something like the JobProgressListener as a public interface. It's exactly the sort of thing that you want to deprecate as soon as something better comes along and can be a real pain when trying to maintain the level of backwards compatibility that

Re: what does SPARK_EXECUTOR_URI in spark-env.sh do ?

2014-04-03 Thread andy petrella
Indeed, it's how mesos works actually. So the tarball just has to be somewhere accessible by the mesos slaves. That's why it is often put in hdfs. Le 3 avr. 2014 18:46, felix cnwe...@gmail.com a écrit : So, if I set this parameter, there is no need to copy the spark tarball to every mesos

<    3   4   5   6   7   8   9   10   11   12   >