How to get well-distribute partition

2014-02-24 Thread zhaoxw12
I use spark-0.8.0. This is my code in python. list = [(i, i*i) for i in xrange(0, 16)]*10 rdd = sc.parallelize(list, 80) temp = rdd.collect() temp2 = rdd.partitionBy(16, lambda x: x ) count = 0 for i in temp2.glom().collect(): print count, "**", i count += 1 This will get result below: 0 *

Re: apparently non-critical errors running spark-ec2 launch

2014-02-24 Thread Shivaram Venkataraman
Replies inline On Mon, Feb 24, 2014 at 5:26 PM, nicholas.chammas wrote: > I'm seeing a bunch of (apparently) non-critical errors when launching new > clusters with spark-ec2 0.9.0. > > Here are some of them (emphasis added; names redacted): > > Generating cluster's SSH key on master... > > ssh: c

apparently non-critical errors running spark-ec2 launch

2014-02-24 Thread nicholas.chammas
I'm seeing a bunch of (apparently) non-critical errors when launching new clusters with spark-ec2 0.9.0. Here are some of them (emphasis added; names redacted): Generating cluster's SSH key on master... ssh: connect to host ec2-.compute-1.amazonaws.com port 22: Connection refused *Error executi

Re: ETL on pyspark

2014-02-24 Thread Matei Zaharia
Yeah, so the problem is that countByValue returns *all* values and their counts to your machine. If you just want the top 10, try this: # do a distributed count using reduceByKey counts = data.map(lambda x: (x, 1)).reduceByKey(lambda a, b: a + b) # reverse the (key, count) pairs into (count, key

Re: ETL on pyspark

2014-02-24 Thread Chengi Liu
Hi, Using pyspark for the first time on realistic dataset ( few hundred GB's) but have been seeing a lot of errors on pyspark shell? This might be because maybe I am not using pyspark correctly? But here is what I was trying: extract_subs.take(2) //returns [u'867430', u'867429'] extract_subs_co

Re: GraphX with UUID vertex IDs instead of Long

2014-02-24 Thread Deepak Nulu
Thanks Christopher. I too am not comfortable with halving the random UUIDs, and thanks to your response, I don't need to do the math :-). The StackOverflow link you suggested had a different set of ideas that I am bit more comfortable with, but the one I am still hoping for is the use of UUIDs as V

Re: ETL on pyspark

2014-02-24 Thread Chengi Liu
Its around 10 GB big? All I want is to do a frequency count? And then get top 10 entries based on count? How do i do this (again on pyspark( Thanks On Mon, Feb 24, 2014 at 1:19 PM, Matei Zaharia wrote: > collect() means to bring all the data back to the master node, and there > might just be to

Re: ETL on pyspark

2014-02-24 Thread Matei Zaharia
collect() means to bring all the data back to the master node, and there might just be too much of it for that. How big is your file? If you can’t bring it back to the master node try saveAsTextFile to write it out to a filesystem (in parallel). Matei On Feb 24, 2014, at 1:08 PM, Chengi Liu w

cached rdd in memory eviction

2014-02-24 Thread Koert Kuipers
i was under the impression that running jobs could not evict cached rdds from memory as long as they are below spark.storage.memoryFraction. however what i observe seems to indicate the opposite. did anything change? thanks! koert

ETL on pyspark

2014-02-24 Thread Chengi Liu
Hi, A newbie here. I am trying to do etl on spark. Few questions. I have csv file with header. 1) How do I parse this file (as it has a header..) 2) I was trying to follow the tutorial here: http://ampcamp.berkeley.edu/3/exercises/data-exploration-using-spark.html 3) I am trying to do a frequen

Re: GraphX with UUID vertex IDs instead of Long

2014-02-24 Thread Christopher Nguyen
Deepak, to be sure, I was referring to sequential guarantees with the longs. I would suggest being careful with taking half the UUID as the probability of collision can be unexpectedly high. Many bits of the UUID is typically time-based so collision among those bits is virtually guaranteed with pr

Re: GraphX with UUID vertex IDs instead of Long

2014-02-24 Thread Deepak Nulu
A lot of great suggestions here that I am going to investigate. In parallel, I would like to explore the possibility of having GraphX be parameterized on the VertexId type. Is that a question for the developer mailing list? Thanks. -deepak -- View this message in context: http://apache-spark

Re: GraphX with UUID vertex IDs instead of Long

2014-02-24 Thread Deepak Nulu
Thanks Christopher, I will look into the StackOverflow suggestion of generating 64-bit UUIDs in the same fashion as 128-bit UUIDs. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-with-UUID-vertex-IDs-instead-of-Long-tp1953p1990.html Sent from the Apa

Re: GraphX with UUID vertex IDs instead of Long

2014-02-24 Thread Deepak Nulu
Thanks Ewen, I will look into using half the UUID (we are indeed using random (version 4) UUIDs). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-with-UUID-vertex-IDs-instead-of-Long-tp1953p1989.html Sent from the Apache Spark User List mailing list

Re: standalone spark app build.sbt compilation error

2014-02-24 Thread Spark Storm
Thanks TD - that indeed worked ! On Sat, Feb 22, 2014 at 6:28 PM, Tathagata Das wrote: > This is what a minimalistic sbt with Spark Streaming would look like > https://github.com/amplab/training/blob/ampcamp4/streaming/java/build.sbt > So "0.9.0" should be "0.9.0-incubating" > > TD > > > On Sat,

metrics.MetricsSystem: Sink class org.apache.spark.metrics.sink.MetricsServlet cannot be instantialized

2014-02-24 Thread Grega Kešpret
Hi, I'm seeing below output in the logs when I start a job on driver (with master = local) when I package driver sources + pre-built spark spark-core-assembly-v0.8.1-incubating.jar in a far jar with sbt/sbt assembly. However, when I start with sbt/sbt run, it works fine. I've tried rm -rf target

Re: GraphX with UUID vertex IDs instead of Long

2014-02-24 Thread Christopher Nguyen
Deepak, depending on your use case, you might find it appropriate and certainly easy create a lightweight sequence number service that serves requests from parallel clients. http://stackoverflow.com/questions/2671858/distributed-sequence-number-generation/5685869 There's no shame in using non-Spa

Re: GraphX with UUID vertex IDs instead of Long

2014-02-24 Thread Ewen Cheslack-Postava
In addition, you can easily verify there are no collisions with Spark before running anything through GraphX -- create the mapping and then groupByKey to find any keys with multiple mappings. Ewen Cheslack-Postava February 24, 2014 at 10:58 AM You can almost certainly

Re: GraphX with UUID vertex IDs instead of Long

2014-02-24 Thread Ewen Cheslack-Postava
You can almost certainly take half of the UUID safely, assuming you're using random UUIDs. You could work out the math if you're really concerned, but the probability of a collision in 64 bits is probably pretty low even with a very large data set. If your UUIDs aren't version 4, you probably j

Re: GraphX with UUID vertex IDs instead of Long

2014-02-24 Thread Deepak Nulu
Hi Josh, Thanks for your quick response. Yes, it is a practical option, but my concern is the need to centralize this mapping. Please see my response to Evan's response. Thanks. -deepak -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-with-UUID-ver

Re: GraphX with UUID vertex IDs instead of Long

2014-02-24 Thread Deepak Nulu
Hi Evan, Thanks for the quick response. The only mapping between UUIDs and Longs that I can think of is one where I sequentially assign Longs as I load the UUIDs from the DB. But this results in having to centralize this mapping. I am guessing that centralizing this is not a good idea for a distri

Re: Creating a Spark context from a Scalatra servlet

2014-02-24 Thread Ognen Duzlevski
Figured it out. I did sbt/sbt assembly on the same jobserver branch and am running that as a standalone spark cluster. I am then running a separate jobserver from the same branch - it all works now. Ognen On 2/24/14, 9:02 AM, Ognen Duzlevski wrote: In any case, I am running the same version

Re: Creating a Spark context from a Scalatra servlet

2014-02-24 Thread Ognen Duzlevski
In any case, I am running the same version of spark standalone on the cluster as the jobserver (I compiled the master branch as opposed to the jobserver branch, not sure if this matters). I then proceeded to change the application.conf file to reflect the spark://master_ip:7077 as the master.

Nothing happens when executing on cluster

2014-02-24 Thread Anders Bennehag
Hello there, I'm having some trouble with my spark-cluster consisting of master.censored.dev and spark-worker-0 Reading from the output of pyspark, master, and worker-node it seems like the cluster is formed correctly and pyspark connects to it. But for some reason, nothing happens after "TaskSc

Re: Shark server crashes-[Thrift Error]: java.net.SocketException: Socket closed

2014-02-24 Thread Mayur Rustagi
Can you give a bigger stack trace? Mayur Rustagi Ph: +919632149971 h ttp://www.sigmoidanalytics.com https://twitter.com/mayur_rustagi On Mon, Feb 24, 2014 at 4:08 AM, Arpit Tak wrote: > > Shark-server crashes after some-times when running big queries ... > An

java.lang.ClassNotFoundException

2014-02-24 Thread Terance Dias
Hi, I'm trying the spark on yarn example at https://spark.incubator.apache.org/docs/latest/running-on-yarn.html When I try to run the SparkPi example using the spark-class command, the job fails and In the stderr file of the job logs, I see the following error. java.lang.ClassNotFoundException: o

Shark server crashes-[Thrift Error]: java.net.SocketException: Socket closed

2014-02-24 Thread Arpit Tak
Shark-server crashes after some-times when running big queries ... Any suggestion how to get rid of it.. show tables exception: [Thrift Error]: java.net.SocketException: Socket closed [Thrift Error]: Hive server is not cleaned due to thrift exception: java.net.SocketException: Socket closed

Re: java.io.NotSerializableException

2014-02-24 Thread yaoxin
In the end is my exception stack. It is a company internal class that Spark complains. org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: com.mycompany.util.xxx at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$D

Re: java.io.NotSerializableException

2014-02-24 Thread leosand...@gmail.com
Which class is not Serializable? I run shark0.9 has a similarity exception: java.io.NotSerializableException (java.io.NotSerializableException: shark.execution.ReduceKeyReduceSide) java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) java.io.ObjectOutputStream.defaultWriteField

java.io.NotSerializableException

2014-02-24 Thread yaoxin
I got a error org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: But the class it complains is a java lib class that I dependents on, that I can't change it to Serializable. Is there any method to work this around? I am using Spark 0.9, spa

Re: How to use FlumeInputDStream in spark cluster?

2014-02-24 Thread anoldbrain
Tracked more codes and here's my findings. 1. TaskLocation(Some("node-005"), None) is not incorrect. 2. The problem is caused by a weird 'malfunction' of .contains call by all the HashMaps for storing tasks in TaskSetManager.scala