graph reduceByKey

2014-08-05 Thread Omer Holzinger
get something like: v2 --> v4 v2 --> v6 v4 --> v2 v4 --> v6 v6 --> v2 v6 --> v4 Does anyone have advice on what will be the best way to do that over a graph instance? I attempted to do it using mapReduceTriplets but I need the reduce function to work like reduceByKey, which I wasn't able to do. Thank you. -- Omer

Re: Spark ReduceByKey - Working in Java

2014-08-02 Thread Sean Owen
m happens. > > I am trying to understand the working of the reducebykey in Spark using java > as the programming language. > > Say if I have a sentence "I am who I am" I break the sentence into words and > store as list [I, am, who, I, am] > > now this function assi

Spark ReduceByKey - Working in Java

2014-08-02 Thread Anil Karamchandani
Hi, I am a complete newbie to spark and map reduce frameworks and have a basic question on the reduce function. I was working on the word count example and was stuck at the reduce stage where the sum happens. I am trying to understand the working of the reducebykey in Spark using java as the

Re: reduceByKey Not Being Called by Spark Streaming

2014-07-03 Thread Dan H.
Hi All, I was able to resolve this matter with a simple fix. It seems that in order to process a reduceByKey and the flat map operations at the same time, the only way to resolve was to increase the number of threads to > 1. Since I'm developing on my personal machine for speed,

reduceByKey Not Being Called by Spark Streaming

2014-07-02 Thread Dan H.
Hi all, I recently just picked up Spark and am trying to work through a coding issue that involves the reduceByKey method. After various debugging efforts, it seems that the reducyByKey method never gets called. Here's my workflow, which is followed by my code and results: My parsed

Cannot print a derived DStream after reduceByKey

2014-06-18 Thread haopu
ut reduce function. I will appreciate any help, thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-print-a-derived-DStream-after-reduceByKey-tp7837.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Cannot print a derived DStream after reduceByKey

2014-06-18 Thread haopu
I guess this is a basic question about the usage of reduce. Please shed some lights, thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-print-a-derived-DStream-after-reduceByKey-tp7834p7836.html Sent from the Apache Spark User List mailing

Re: Using custom class as a key for groupByKey() or reduceByKey()

2014-06-15 Thread Andrew Ash
>> I have a simple Java class as follows, that I want to use as a key while >> applying groupByKey or reduceByKey functions: >> >> private static class FlowId { >> public String dcxId; >> public String trxId; >>

Re: Using custom class as a key for groupByKey() or reduceByKey()

2014-06-15 Thread Sean Owen
ey while > applying groupByKey or reduceByKey functions: > > private static class FlowId { > public String dcxId; > public String trxId; > public String msgType; > > public FlowId(String dcx

Using custom class as a key for groupByKey() or reduceByKey()

2014-06-15 Thread Gaurav Jain
I have a simple Java class as follows, that I want to use as a key while applying groupByKey or reduceByKey functions: private static class FlowId { public String dcxId; public String trxId; public String msgType

Re: When to use CombineByKey vs reduceByKey?

2014-06-12 Thread Diana Hu
iners, mergeValue, etc) and return that instead of > allocating a new object. So it should work with mutable objects — please > post what problems you had with that. reduceByKey actually also allows this > if your types are the same. > > Matei > > > On Jun 11, 2014, at 3:21 PM,

Re: HELP!? Re: Streaming trouble (reduceByKey causes all printing to stop)

2014-06-12 Thread Michael Campbell
> (10.20.11.3,Set(10.10.61.95)) > ... > > > > What I want is a SET of (sourceIP -> Set(all the destination Ips)) Using > a set because as you can see above, the same source may have the same > destination multiple times and I want to eliminate dupes on the destination

HELP!? Re: Streaming trouble (reduceByKey causes all printing to stop)

2014-06-12 Thread Michael Campbell
sourceIP -> Set(all the destination Ips)) Using a set because as you can see above, the same source may have the same destination multiple times and I want to eliminate dupes on the destination side. When I call the reduceByKey() method, I never get any output. When I do a "srcDestinat

Re: When to use CombineByKey vs reduceByKey?

2014-06-11 Thread Matei Zaharia
allocating a new object. So it should work with mutable objects — please post what problems you had with that. reduceByKey actually also allows this if your types are the same. Matei On Jun 11, 2014, at 3:21 PM, Diana Hu wrote: > Hello all, > > I've seen some performance impr

When to use CombineByKey vs reduceByKey?

2014-06-11 Thread Diana Hu
Hello all, I've seen some performance improvements using combineByKey as opposed to reduceByKey or a groupByKey+map function. I have a couple questions. it'd be great if any one can provide some light into this. 1) When should I use combineByKey vs reduceByKey? 2) Do the containers

Re: Driver OOM while using reduceByKey

2014-05-29 Thread haitao .yao
Thanks. it worked. 2014-05-30 1:53 GMT+08:00 Matei Zaharia : > That hash map is just a list of where each task ran, it’s not the actual > data. How many map and reduce tasks do you have? Maybe you need to give the > driver a bit more memory, or use fewer tasks (e.g. do reduceByKey(_ +

Re: Driver OOM while using reduceByKey

2014-05-29 Thread Matei Zaharia
That hash map is just a list of where each task ran, it’s not the actual data. How many map and reduce tasks do you have? Maybe you need to give the driver a bit more memory, or use fewer tasks (e.g. do reduceByKey(_ + _, 100) to use only 100 tasks). Matei On May 29, 2014, at 2:03 AM, haitao

Driver OOM while using reduceByKey

2014-05-29 Thread haitao .yao
Hi, I used 1g memory for the driver java process and got OOM error on driver side before reduceByKey. After analyzed the heap dump, the biggest object is org.apache.spark.MapStatus, which occupied over 900MB memory. Here's my question: 1. Is there any optimization switches that I can

Re: Configuring Spark for reduceByKey on on massive data sets

2014-05-18 Thread Daniel Mahler
; roughly 30 nodes. > > I was planning to test it with Spark. > > I'm very interested in your findings. > > > > > > > > - > > Madhu > > https://www.linkedin.com/in/msiddalingaiah > > -- > > View this message in context: > http:/

Re: Configuring Spark for reduceByKey on on massive data sets

2014-05-18 Thread lukas nalezenec
> > Madhu > > https://www.linkedin.com/in/msiddalingaiah > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > >

Re: Configuring Spark for reduceByKey on on massive data sets

2014-05-17 Thread Matei Zaharia
x27;m very interested in your findings. > > > > - > Madhu > https://www.linkedin.com/in/msiddalingaiah > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.

Re: Configuring Spark for reduceByKey on on massive data sets

2014-05-17 Thread Madhu
https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Configuring Spark for reduceByKey on on massive data sets

2014-05-17 Thread Daniel Mahler
I have had a lot of success with Spark on large datasets, both in terms of performance and flexibility. However I hit a wall with reduceByKey when the RDD contains billions of items. I am reducing with simple functions like addition for building histograms, so the reduction process should be

Re: java.net.SocketException on reduceByKey() in pyspark

2014-04-22 Thread benlaird
sparkcontext object solved the problem. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-net-SocketException-on-reduceByKey-in-pyspark-tp2184p4612.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: reduceByKey issue in example wordcount (scala)

2014-04-18 Thread Ian Bonnycastle
fference at all. > /usr/local/pkg/spark is also on all the nodes... it has to be in order to > get all the nodes up and running in the cluster. Also, I'm confused by what > you mean with "That's most probably the class that implements the closure > you're passi

Re: confused by reduceByKey usage

2014-04-17 Thread 诺铁
t; using println to debug is great for me to explore spark. >>>> thank you very much for your kindly help. >>>> >>>> >>>> >>>> On Fri, Apr 18, 2014 at 12:54 AM, Daniel Darabos < >>>> daniel.dara...@lynxanalytics.com> wrote:

Re: confused by reduceByKey usage

2014-04-17 Thread Cheng Lian
Apr 18, 2014 at 12:54 AM, Daniel Darabos < >>> daniel.dara...@lynxanalytics.com> wrote: >>> >>>> Here's a way to debug something like this: >>>> >>>> scala> d5.keyBy(_.split(" ")(0)).reduceByKey((v1,v2) => { >>&g

Re: confused by reduceByKey usage

2014-04-17 Thread 诺铁
Fri, Apr 18, 2014 at 12:54 AM, Daniel Darabos < >> daniel.dara...@lynxanalytics.com> wrote: >> >>> Here's a way to debug something like this: >>> >>> scala> d5.keyBy(_.split(" ")(0)).reduceByKey((v1,v2) => { >>>

Re: confused by reduceByKey usage

2014-04-17 Thread Cheng Lian
me to explore spark. > thank you very much for your kindly help. > > > > On Fri, Apr 18, 2014 at 12:54 AM, Daniel Darabos < > daniel.dara...@lynxanalytics.com> wrote: > >> Here's a way to debug something like this: >> >> scala> d5.keyBy(_.split(

Re: confused by reduceByKey usage

2014-04-17 Thread 诺铁
d5.keyBy(_.split(" ")(0)).reduceByKey((v1,v2) => { >println("v1: " + v1) >println("v2: " + v2) >(v1.split(" ")(1).toInt + v2.split(" ")(1).toInt).toString >}).collect > > You get: > v1

Re: confused by reduceByKey usage

2014-04-17 Thread Daniel Darabos
Here's a way to debug something like this: scala> d5.keyBy(_.split(" ")(0)).reduceByKey((v1,v2) => { println("v1: " + v1) println("v2: " + v2) (v1.split(" ")(1).toInt + v2.split(" ")(1).toInt).toString

confused by reduceByKey usage

2014-04-17 Thread 诺铁
dd.RDD[String] = MappedRDD[91] at textFile at :12 scala> d5.keyBy(_.split(" ")(0)).reduceByKey((v1,v2) => (v1.split(" ")(1).toInt + v2.split(" ")(1).toInt).toString).first then error occurs: 14/04/18 00:20:11 ERROR Executor: Exception in task ID 36 java.lang.ArrayI

Re: reduceByKey issue in example wordcount (scala)

2014-04-14 Thread Ian Bonnycastle
s the closure you're passing as an argument to the reduceByKey() method". The only thing I'm passing to it is "_ + _".. and as you mentioned, its pretty much the same as the map() method. If I run the following code, it runs 100% properly on the cluster: val nu

Re: reduceByKey issue in example wordcount (scala)

2014-04-14 Thread Marcelo Vanzin
hat's most probably the class that implements the closure you're passing as an argument to the reduceByKey() method. Although I can't really explain why the same isn't happening for the closure you're passing to map()... Sorry I can't be more helpful. > I still get the err

Re: reduceByKey issue in example wordcount (scala)

2014-04-14 Thread Ian Bonnycastle
s trouble finding the code that is itself. And why only with the reduceByKey function is it occuring? I have no problems with any other code running except for that. (BTW, I don't use in my code above... I just removed it for security purposes.) Thanks, Ian On Mon, Apr 14, 2014 at 12:45 PM

Re: reduceByKey issue in example wordcount (scala)

2014-04-14 Thread Marcelo Vanzin
pr 14, 2014 at 9:17 AM, Ian Bonnycastle wrote: > Good afternoon, > > I'm attempting to get the wordcount example working, and I keep getting an > error in the "reduceByKey(_ + _)" call. I've scoured the mailing lists, and > haven't been able to find a sure fire

reduceByKey issue in example wordcount (scala)

2014-04-14 Thread Ian Bonnycastle
Good afternoon, I'm attempting to get the wordcount example working, and I keep getting an error in the "reduceByKey(_ + _)" call. I've scoured the mailing lists, and haven't been able to find a sure fire solution, unless I'm missing something big. I did find somethi

Too many tasks in reduceByKey() when do PageRank iteration

2014-04-11 Thread 张志齐
iteration is the same. However, I found that during the first iteration the reduceByKey() (line 162) has 6 tasks and during the second iteration it has 18 tasks and third iteration 54 tasks, fourth iteration 162 tasks.. during the sixth iteration it has 1458 tasks which almost costs more

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Xiangrui Meng
Sorry, I meant the master branch of https://github.com/apache/spark. -Xiangrui On Mon, Mar 24, 2014 at 6:27 PM, Tsai Li Ming wrote: > Thanks again. > >> If you use the KMeans implementation from MLlib, the >> initialization stage is done on master, > > The "master" here is the app/driver/spark-sh

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Tsai Li Ming
Thanks again. > If you use the KMeans implementation from MLlib, the > initialization stage is done on master, The “master” here is the app/driver/spark-shell? Thanks! On 25 Mar, 2014, at 1:03 am, Xiangrui Meng wrote: > Number of rows doesn't matter much as long as you have enough workers >

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Xiangrui Meng
Number of rows doesn't matter much as long as you have enough workers to distribute the work. K-means has complexity O(n * d * k), where n is number of points, d is the dimension, and k is the number of clusters. If you use the KMeans implementation from MLlib, the initialization stage is done on m

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Tsai Li Ming
Thanks, Let me try with a smaller K. Does the size of the input data matters for the example? Currently I have 50M rows. What is a reasonable size to demonstrate the capability of Spark? On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng wrote: > K = 50 is certainly a large number for k-means.

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Xiangrui Meng
K = 50 is certainly a large number for k-means. If there is no particular reason to have 50 clusters, could you try to reduce it to, e.g, 100 or 1000? Also, the example code is not for large-scale problems. You should use the KMeans algorithm in mllib clustering for your problem. -Xiangrui

Re: Kmeans example reduceByKey slow

2014-03-23 Thread Tsai Li Ming
Hi, This is on a 4 nodes cluster each with 32 cores/256GB Ram. (0.9.0) is deployed in a stand alone mode. Each worker is configured with 192GB. Spark executor memory is also 192GB. This is on the first iteration. K=50. Here’s the code I use: http://pastebin.com/2yXL3y8i , which is a copy-

Re: Kmeans example reduceByKey slow

2014-03-23 Thread Xiangrui Meng
Hi Tsai, Could you share more information about the machine you used and the training parameters (runs, k, and iterations)? It can help solve your issues. Thanks! Best, Xiangrui On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming wrote: > Hi, > > At the reduceBuyKey stage, it takes a few minutes befo

Kmeans example reduceByKey slow

2014-03-23 Thread Tsai Li Ming
Hi, At the reduceBuyKey stage, it takes a few minutes before the tasks start working. I have -Dspark.default.parallelism=127 cores (n-1). CPU/Network/IO is idling across all nodes when this is happening. And there is nothing particular on the master log file. From the spark-shell: 14/03/23 1

Re: sort order after reduceByKey / groupByKey

2014-03-20 Thread Ameet Kini
www.sigmoidanalytics.com > @mayur_rustagi <https://twitter.com/mayur_rustagi> > > > > On Thu, Mar 20, 2014 at 3:20 PM, Ameet Kini wrote: > >> >> val rdd2 = rdd.partitionBy(my partitioner).reduceByKey(some function) >> >> I see that rdd2's partitions are not

Re: sort order after reduceByKey / groupByKey

2014-03-20 Thread Mayur Rustagi
y(my partitioner).reduceByKey(some function) > > I see that rdd2's partitions are not internally sorted. Can someone > confirm that this is expected behavior? And if so, the only way to get > partitions internally sorted is to follow it with something like this > > val rdd2 = rdd.

sort order after reduceByKey / groupByKey

2014-03-20 Thread Ameet Kini
val rdd2 = rdd.partitionBy(my partitioner).reduceByKey(some function) I see that rdd2's partitions are not internally sorted. Can someone confirm that this is expected behavior? And if so, the only way to get partitions internally sorted is to follow it with something like this val

Re: java.net.SocketException on reduceByKey() in pyspark

2014-03-19 Thread Uri Laserson
I have the exact same error running on a bare metal cluster with CentOS6 and Python 2.6.6. Any other thoughts on the problem here? I only get the error on operations that require communication, like reduceByKey or groupBy. On Sun, Mar 2, 2014 at 1:29 PM, Nicholas Chammas wrote: > Alright,

Re: How to work with ReduceByKey?

2014-03-13 Thread Shixiong Zhu
l = new ArrayList>(); for (Tuple2> value : values) { count += value._1; l.add(value._2); } return new Tuple2>>( count, l); } }); Or "combineByKey" which often has better performance. Best Regards, Shixiong Zhu 2014-03-14 0:56 GMT+08:00 goi cto : > Hi, > > I have

How to work with ReduceByKey?

2014-03-13 Thread goi cto
Hi, I have an RDD with > which I want to reduceByKey and get I+I and List of List (add the integers and build a list of the lists. BUT reduce by key requires that the return value is of the same type of the input so I can combine the lists. JavaPairRDD>>> callCount = byCaller.*reduc

Re: "Too many open files" exception on reduceByKey

2014-03-11 Thread Matthew Cheah
>> a job with X reducers then Spark will open C*X files in parallel and >> start writing. Shuffle consolidation will help decrease the total >> number of files created but the number of file handles open at any >> time doesn't change so it won't help the ulimit pro

Re: "Too many open files" exception on reduceByKey

2014-03-11 Thread Matthew Cheah
X reducers then Spark will open C*X files in parallel and > start writing. Shuffle consolidation will help decrease the total > number of files created but the number of file handles open at any > time doesn't change so it won't help the ulimit problem. > > This mea

Re: "Too many open files" exception on reduceByKey

2014-03-10 Thread Patrick Wendell
x27;t change so it won't help the ulimit problem. This means you'll have to use fewer reducers (e.g. pass reduceByKey a number of reducers) or use fewer cores on each machine. - Patrick On Mon, Mar 10, 2014 at 10:41 AM, Matthew Cheah wrote: > Hi everyone, > > My team (cc'

"Too many open files" exception on reduceByKey

2014-03-10 Thread Matthew Cheah
Hi everyone, My team (cc'ed in this e-mail) and I are running a Spark reduceByKey operation on a cluster of 10 slaves where I don't have the privileges to set "ulimit -n" to a higher number. I'm running on a cluster where "ulimit -n" returns 1024 on each machi

Re: java.net.SocketException on reduceByKey() in pyspark

2014-03-02 Thread Nicholas Chammas
h.Mailbox.processMailbox(Mailbox.scala:237) at >> akka.dispatch.Mailbox.run(Mailbox.scala:219) at >> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) >> at >> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java

Re: java.net.SocketException on reduceByKey() in pyspark

2014-02-28 Thread Nicholas Chammas
t.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > >>> > > > The lambda passed to flatMap() returns a list of tuples; take() works fine > just on the flatMap(). > > Where would I start to troubleshoot this error? > > The error output includes m

java.net.SocketException on reduceByKey() in pyspark

2014-02-28 Thread nicholas.chammas
, I upgraded the cluster to Python 2.7 using the instructions here <https://spark-project.atlassian.net/browse/SPARK-922>. Also, I am running Spark 0.9.0, though I notice that in the error output is mention of 0.8.1 files. Nick -- View this message in context: http://apache-spark-user-l

Re: ReduceByKey or groupByKey to Count?

2014-02-27 Thread Mayur Rustagi
rect approach, sortByKey > maybe what ia m looking for any insight would be helpful... there are not > many example out there for newbies such as myself > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-or-groupByKey-to-Count-tp1765p2110.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >

Re: ReduceByKey or groupByKey to Count?

2014-02-27 Thread dmpour23
in context: http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-or-groupByKey-to-Count-tp1765p2110.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: ReduceByKey or groupByKey to Count?

2014-02-26 Thread Mayur Rustagi
context: > http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-or-groupByKey-to-Count-tp1765p2078.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >

Re: ReduceByKey or groupByKey to Count?

2014-02-26 Thread dmpour23
ser-list.1001560.n3.nabble.com/ReduceByKey-or-groupByKey-to-Count-tp1765p2078.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

<    1   2   3