Re: orc read issue n spark

2015-11-18 Thread Reynold Xin
What do you mean by starts delay scheduling? Are you saying it is no longer doing local reads? If that's the case you can increase the spark.locality.read timeout. On Wednesday, November 18, 2015, Renu Yadav wrote: > Hi , > I am using spark 1.4.1 and saving orc file using >

Re: ISDATE Function

2015-11-18 Thread Ruslan Dautkhanov
You could write your own UDF isdate(). -- Ruslan Dautkhanov On Tue, Nov 17, 2015 at 11:25 PM, Ravisankar Mani wrote: > Hi Ted Yu, > > Thanks for your response. Is any other way to achieve in Spark Query? > > > Regards, > Ravi > > On Tue, Nov 17, 2015 at 10:26 AM, Ted Yu

Re: how can evenly distribute my records in all partition

2015-11-18 Thread prateek arora
Hi Thanks for the help. In my Case ... I want to perform operation on 30 record per second using spark streaming. and difference between key of records is around 33-34 ms and my RDD that have 30 records already have 4 partition. and right now my algo take around 400 ms to perform operation on 1

Re: zeppelin (or spark-shell) with HBase fails on executor level

2015-11-18 Thread Ted Yu
Interesting. I will watching your PR. On Wed, Nov 18, 2015 at 7:51 AM, 임정택 wrote: > Ted, > > I suspect I hit the issue > https://issues.apache.org/jira/browse/SPARK-11818 > Could you refer the issue and verify that it makes sense? > > Thanks, > Jungtaek Lim (HeartSaVioR) > >

Re: Calculating Timeseries Aggregation

2015-11-18 Thread Sandip Mehta
TD thank you for your reply. I agree on data store requirement. I am using HBase as an underlying store. So for every batch interval of say 10 seconds - Calculate the time dimension ( minutes, hours, day, week, month and quarter ) along with other dimensions and metrics - Update relevant base

Re: zeppelin (or spark-shell) with HBase fails on executor level

2015-11-18 Thread 임정택
Ted, I suspect I hit the issue https://issues.apache.org/jira/browse/SPARK-11818 Could you refer the issue and verify that it makes sense? Thanks, Jungtaek Lim (HeartSaVioR) 2015-11-18 20:32 GMT+09:00 Ted Yu : > Here is related code: > > private static void

Unable to load native-hadoop library for your platform - already loaded in another classloader

2015-11-18 Thread Deenar Toraskar
Hi I want to make sure we use short-circuit local reads for performance. I have set the LD_LIBRARY_PATH correctly, confirmed that the native libraries match our platform (i.e. are 64 bit and are loaded successfully). When I start spark, i get the following message after increasing the logging

Unable to import SharedSparkContext

2015-11-18 Thread njoshi
Hi, Doesn't *SharedSparkContext* come with spark-core? Do I need to include any special package in the library dependancies for using SharedSparkContext? I am trying to write a testSuite similar to the *LogisticRegressionSuite* test in the Spak-ml. Unfortunately, I am unable to import any of

newbie simple app, small data set: Py4JJavaError java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-11-18 Thread Andy Davidson
Hi I am working on a spark POC. I created a ec2 cluster on AWS using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 Bellow is a simple python program. I am running using IPython notebook. The notebook server is running on my spark master. If I run my program more than 1 once using my large data set, I

Re: Unable to import SharedSparkContext

2015-11-18 Thread Sourigna Phetsarath
Nikhil, Please take a look at: https://github.com/holdenk/spark-testing-base On Wed, Nov 18, 2015 at 2:12 PM, Marcelo Vanzin wrote: > On Wed, Nov 18, 2015 at 11:08 AM, njoshi wrote: > > Doesn't *SharedSparkContext* come with spark-core? Do I need

DataFrames initial jdbc loading - will it be utilizing a filter predicate?

2015-11-18 Thread Eran Medan
I understand that the following are equivalent df.filter('account === "acct1") sql("select * from tempTableName where account = 'acct1'") But is Spark SQL "smart" to also push filter predicates down for the initial load? e.g. sqlContext.read.jdbc(…).filter('account=== "acct1")

Re: Unable to import SharedSparkContext

2015-11-18 Thread Sourigna Phetsarath
Plus this article: http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/ On Wed, Nov 18, 2015 at 2:25 PM, Sourigna Phetsarath < gna.phetsar...@teamaol.com> wrote: > Nikhil, > > Please take a look at: https://github.com/holdenk/spark-testing-base > > On

GraphX stopped without finishing and with no ERRORs !

2015-11-18 Thread Khaled Ammar
Hi all, I have a problem running some algorithms on GraphX. Occasionally, it stopped running without any errors. The task state is FINISHED, but the executers state is KILLED. However, I can see that one job is not finished yet. It took too much time (minutes) while every job/iteration were

Apache Groovy and Spark

2015-11-18 Thread tog
Hi I start playing with both Apache projects and quickly got that exception. Anyone being able to give some hint on the problem so that I can dig further. It seems to be a problem for Spark to load some of the groovy classes ... Any idea? Thanks Guillaume tog GroovySpark $

Re: Unable to import SharedSparkContext

2015-11-18 Thread Nikhil Joshi
Thanks Marcelo and Sourigna. I see the spark-testing-base being part of Spark, but has been included under test package of Spark-core. That caused the trouble :(. On Wed, Nov 18, 2015 at 11:26 AM, Sourigna Phetsarath < gna.phetsar...@teamaol.com> wrote: > Plus this article: >

Re: How to disable SparkUI programmatically?

2015-11-18 Thread Ted Yu
Refer to core/src/test/scala/org/apache/spark/ui/UISuite.scala , around line 41: val conf = new SparkConf() .setMaster("local") .setAppName("test") .set("spark.ui.enabled", "true") Cheers On Wed, Nov 18, 2015 at 3:05 AM, Ted Yu wrote: > You can set

Re: Additional Master daemon classpath

2015-11-18 Thread Michal Klos
Hi, Thanks for the suggestion -- but those classpaths config options only affect the driver and executor processes -- not the standalone mode daemons (master and slave). Incidentally we have the extra jars we need set there. I went through the docs but couldn't find a place to set extra

Re: How to disable SparkUI programmatically?

2015-11-18 Thread Ted Yu
You can set spark.ui.enabled config parameter to false. Cheers > On Nov 18, 2015, at 1:29 AM, Alex Luya wrote: > > I noticed that blow bug has been fixed: > > https://issues.apache.org/jira/browse/SPARK-2100 > but how to do it(I mean disabling SparkUI)

Shuffle FileNotFound Exception

2015-11-18 Thread Tom Arnfeld
Hey, I’m wondering if anyone has run into issues with Spark 1.5 and a FileNotFound exception with shuffle.index files? It’s been cropping up with very large joins and aggregations, and causing all of our jobs to fail towards the end. The memory limit for the executors (we’re running on mesos)

Re: Shuffle FileNotFound Exception

2015-11-18 Thread Romi Kuntsman
take executor memory times spark.shuffle.memoryFraction and divide the data so that each partition is less than the above *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Wed, Nov 18, 2015 at 2:09 PM, Tom Arnfeld wrote: > Hi Romi, > > Thanks! Could you give me an

Re: zeppelin (or spark-shell) with HBase fails on executor level

2015-11-18 Thread Ted Yu
Here is related code: private static void checkDefaultsVersion(Configuration conf) { if (conf.getBoolean("hbase.defaults.for.version.skip", Boolean.FALSE)) return; String defaultsVersion = conf.get("hbase.defaults.for.version"); String thisVersion = VersionInfo.getVersion();

Re: thought experiment: use spark ML to real time prediction

2015-11-18 Thread Nick Pentreath
One such "lightweight PMML in JSON" is here - https://github.com/bigmlcom/json-pml. At least for the schema definitions. But nothing available in terms of evaluation/scoring. Perhaps this is something that can form a basis for such a new undertaking. I agree that distributed models are only

Re: Shuffle FileNotFound Exception

2015-11-18 Thread Tom Arnfeld
Hi Romi, Thanks! Could you give me an indication of how much increase the partitions by? We’ll take a stab in the dark, the input data is around 5M records (though each record is fairly small). We’ve had trouble both with DataFrames and RDDs. Tom. > On 18 Nov 2015, at 12:04, Romi Kuntsman

How to disable SparkUI programmatically?

2015-11-18 Thread Alex Luya
I noticed that blow bug has been fixed: https://issues.apache.org/jira/browse/SPARK-2100 but how to do it(I mean disabling SparkUI) programmatically? is it by sparkContext.setLocalProperty(?,?)? and I checked blow link,can't figured out which property to set

(send this email to subscribe)

2015-11-18 Thread Alex Luya

Re: (send this email to subscribe)

2015-11-18 Thread Nick Pentreath
To subscribe to the list, you need to send a mail to user-subscr...@spark.apache.org (see http://spark.apache.org/community.html for details and a subscribe link). On Wed, Nov 18, 2015 at 11:23 AM, Alex Luya wrote: > >

subscribe

2015-11-18 Thread Alex Luya
subscribe

Re: spark with breeze error of NoClassDefFoundError

2015-11-18 Thread Fengdong Yu
The simplest way is remove all “provided” in your pom. then ‘sbt assembly” to build your final package. then get rid of ‘—jars’ because assembly already includes all dependencies. > On Nov 18, 2015, at 2:15 PM, Jack Yang wrote: > > So weird. Is there anything wrong with

Re: spark with breeze error of NoClassDefFoundError

2015-11-18 Thread 金国栋
Have you tried to change to scope to `compile` ? 2015-11-18 14:15 GMT+08:00 Jack Yang : > So weird. Is there anything wrong with the way I made the pom file (I > labelled them as *provided*)? > > > > Is there missing jar I forget to add in “--jar”? > > > > See the trace below: >

Re: How to return a pair RDD from an RDD that has foreachPartition applied?

2015-11-18 Thread swetha kasireddy
Looks like I can use mapPartitions but can it be done using forEachPartition? On Tue, Nov 17, 2015 at 11:51 PM, swetha wrote: > Hi, > > How to return an RDD of key/value pairs from an RDD that has > foreachPartition applied. I have my code something like the

Re: Streaming Job gives error after changing to version 1.5.2

2015-11-18 Thread swetha kasireddy
It works fine after some changes. -Thanks, Swetha On Tue, Nov 17, 2015 at 10:22 PM, Tathagata Das wrote: > Can you verify that the cluster is running the correct version of Spark. > 1.5.2. > > On Tue, Nov 17, 2015 at 7:23 PM, swetha kasireddy < > swethakasire...@gmail.com>

orc read issue n spark

2015-11-18 Thread Renu Yadav
Hi , I am using spark 1.4.1 and saving orc file using df.write.format("orc").save("outputlocation") outputloation size 440GB and while reading df.read.format("orc").load("outputlocation").count it has 2618 partitions . the count operation runs fine uptil 2500 but starts delay scheduling after

Re: DataFrames initial jdbc loading - will it be utilizing a filter predicate?

2015-11-18 Thread Zhan Zhang
When you have following query, 'account=== “acct1” will be pushdown to generate new query with “where account = acct1” Thanks. Zhan Zhang On Nov 18, 2015, at 11:36 AM, Eran Medan > wrote: I understand that the following are equivalent

getting different results from same line of code repeated

2015-11-18 Thread Walrus theCat
Hi, I'm launching a Spark cluster with the spark-ec2 script and playing around in spark-shell. I'm running the same line of code over and over again, and getting different results, and sometimes exceptions. Towards the end, after I cache the first RDD, it gives me the correct result multiple

Re: Spark job workflow engine recommendations

2015-11-18 Thread Vikram Kone
Hi Feng, Does airflow allow remote submissions of spark jobs via spark-submit? On Wed, Nov 18, 2015 at 6:01 PM, Fengdong Yu wrote: > Hi, > > we use ‘Airflow' as our job workflow scheduler. > > > > > On Nov 19, 2015, at 9:47 AM, Vikram Kone

Re: Calculating Timeseries Aggregation

2015-11-18 Thread Sandip Mehta
Thank you TD for your time and help. SM > On 19-Nov-2015, at 6:58 AM, Tathagata Das wrote: > > There are different ways to do the rollups. Either update rollups from the > streaming application, or you can generate roll ups in a later process - say > periodic Spark job

Re: Spark job workflow engine recommendations

2015-11-18 Thread Fengdong Yu
Hi, we use ‘Airflow' as our job workflow scheduler. > On Nov 19, 2015, at 9:47 AM, Vikram Kone wrote: > > Hi Nick, > Quick question about spark-submit command executed from azkaban with command > job type. > I see that when I press kill in azkaban portal on a

Do windowing functions require hive support?

2015-11-18 Thread Stephen Boesch
The following works against a hive table from spark sql hc.sql("select id,r from (select id, name, rank() over (order by name) as r from tt2) v where v.r >= 1 and v.r <= 12") But when using a standard sql context against a temporary table the following occurs: Exception in thread "main"

RE: how to group timestamp data and filter on it

2015-11-18 Thread Tim Barthram
Hi LCassa, Try: Map to pair, then reduce by key. The spark documentation is a pretty good reference for this & there are plenty of word count examples on the internet. Warm regards, TimB From: Cassa L [mailto:lcas...@gmail.com] Sent: Thursday, 19 November 2015 11:27 AM To: user Subject: how

Re: Spark LogisticRegression returns scaled coefficients

2015-11-18 Thread robert_dodier
njoshi wrote > I am testing the LogisticRegression performance on a synthetically > generated data. Hmm, seems like a good idea. Can you give the code for generating the training data? best, Robert Dodier -- View this message in context:

Re: Streaming Job gives error after changing to version 1.5.2

2015-11-18 Thread Tathagata Das
If possible, could you give us the root cause and solution for future readers of this thread. On Wed, Nov 18, 2015 at 6:37 AM, swetha kasireddy wrote: > It works fine after some changes. > > -Thanks, > Swetha > > On Tue, Nov 17, 2015 at 10:22 PM, Tathagata Das

Re: Spark job workflow engine recommendations

2015-11-18 Thread Fengdong Yu
Yes, you can submit job remotely. > On Nov 19, 2015, at 10:10 AM, Vikram Kone wrote: > > Hi Feng, > Does airflow allow remote submissions of spark jobs via spark-submit? > > On Wed, Nov 18, 2015 at 6:01 PM, Fengdong Yu

DataFrame.insertIntoJDBC throws AnalysisException -- cannot save

2015-11-18 Thread jonpowell
For this simple example, we are importing 4 lines of 3 columns of a CSV file: Administrator,FiveHundredAddresses1,92121 Ann,FiveHundredAddresses2,92109 Bobby,FiveHundredAddresses3,92101 Charles,FiveHundredAddresses4,92111 We are running spark-1.5.1-bin-hadoop2.6 with master and one slave, and

Re: Calculating Timeseries Aggregation

2015-11-18 Thread Tathagata Das
There are different ways to do the rollups. Either update rollups from the streaming application, or you can generate roll ups in a later process - say periodic Spark job every hour. Or you could just generate rollups on demand, when it is queried. The whole thing depends on your downstream

Re: Spark job workflow engine recommendations

2015-11-18 Thread Vikram Kone
Hi Nick, Quick question about spark-submit command executed from azkaban with command job type. I see that when I press kill in azkaban portal on a spark-submit job, it doesn't actually kill the application on spark master and it continues to run even though azkaban thinks that it's killed. How do

Re: How to return a pair RDD from an RDD that has foreachPartition applied?

2015-11-18 Thread Sathish Kumaran Vairavelu
I think you can use mapPartitions that returns PairRDDs followed by forEachPartition for saving it On Wed, Nov 18, 2015 at 9:31 AM swetha kasireddy wrote: > Looks like I can use mapPartitions but can it be done using > forEachPartition? > > On Tue, Nov 17, 2015 at

unsubscribe

2015-11-18 Thread VJ Anand
-- *VJ Anand* *Founder * *Sankia* vjan...@sankia.com 925-640-1340 www.sankia.com *Confidentiality Notice*: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use,

Re: Apache Groovy and Spark

2015-11-18 Thread Steve Loughran
Looks like groovy scripts dont' serialize over the wire properly. Back in 2011 I hooked up groovy to mapreduce, so that you could do mappers and reducers there; "grumpy" https://github.com/steveloughran/grumpy slides: http://www.slideshare.net/steve_l/hadoop-gets-groovy What I ended up doing

Re: Additional Master daemon classpath

2015-11-18 Thread Michal Klos
We solved this by adding to spark-class script. At the bottom before the exec statement we intercepted the command that was constructed and injected our additional class path : for ((i=0; i<${#CMD[@]}; i++)); do if [[ ${CMD[$i]} == *"$SPARK_ASSEMBLY_JAR"* ]] then

RE: spark with breeze error of NoClassDefFoundError

2015-11-18 Thread Jack Yang
If I tried to change “provided” to “compile”.. then the error changed to : Exception in thread "main" java.lang.IncompatibleClassChangeError: Implementing class at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:800)

how to group timestamp data and filter on it

2015-11-18 Thread Cassa L
Hi, I have a data stream (JavaDStream) in following format- timestamp=second1, map(key1=value1, key2=value2) timestamp=second2,map(key1=value3, key2=value4) timestamp=second2, map(key1=value1, key2=value5) I want to group data by 'timestamp' first and then filter each RDD for Key1=value1 or

How to clear the temp files that gets created by shuffle in Spark Streaming

2015-11-18 Thread swetha
Hi, We have a lot of temp files that gets created due to shuffles caused by group by. How to clear the files that gets created due to intermediate operations in group by? Thanks, Swetha -- View this message in context:

Re: Do windowing functions require hive support?

2015-11-18 Thread Stephen Boesch
Checked out 1.6.0-SNAPSHOT 60 minutes ago 2015-11-18 19:19 GMT-08:00 Jack Yang : > Which version of spark are you using? > > > > *From:* Stephen Boesch [mailto:java...@gmail.com] > *Sent:* Thursday, 19 November 2015 2:12 PM > *To:* user > *Subject:* Do windowing functions

Re: SequenceFile and object reuse

2015-11-18 Thread Ryan Williams
Hey Jeff, in addition to what Sandy said, there are two more reasons that this might not be as bad as it seems; I may be incorrect in my understanding though. First, the "additional step" you're referring to is not likely to be adding any overhead; the "extra map" is really just materializing the

Re: spark with breeze error of NoClassDefFoundError

2015-11-18 Thread Zhiliang Zhu
Dear Ted,I just looked at the link you provided, it is great! For my understanding, I could also directly use other Breeze part (except spark mllib package linalg ) in spark (scala or java ) program after importing Breeze package,it is right? Thanks a lot in advance again!Zhiliang  On

Re: Do windowing functions require hive support?

2015-11-18 Thread Stephen Boesch
But to focus the attention properly: I had already tried out 1.5.2. 2015-11-18 19:46 GMT-08:00 Stephen Boesch : > Checked out 1.6.0-SNAPSHOT 60 minutes ago > > 2015-11-18 19:19 GMT-08:00 Jack Yang : > >> Which version of spark are you using? >> >> >> >>

Re: Do windowing functions require hive support?

2015-11-18 Thread Michael Armbrust
Yes they do. On Wed, Nov 18, 2015 at 7:49 PM, Stephen Boesch wrote: > But to focus the attention properly: I had already tried out 1.5.2. > > 2015-11-18 19:46 GMT-08:00 Stephen Boesch : > >> Checked out 1.6.0-SNAPSHOT 60 minutes ago >> >> 2015-11-18 19:19

RE: Do windowing functions require hive support?

2015-11-18 Thread Jack Yang
SQLContext only implements a subset of the SQL function, not included the window function. In HiveContext it is fine though. From: Stephen Boesch [mailto:java...@gmail.com] Sent: Thursday, 19 November 2015 3:01 PM To: Michael Armbrust Cc: Jack Yang; user Subject: Re: Do windowing functions

Re: Apache Groovy and Spark

2015-11-18 Thread Nick Pentreath
Given there is no existing Groovy integration out there, I'd tend to agree to use Scala if possible - the basics of functional-style Groovy is fairly similar to Scala. — Sent from Mailbox On Wed, Nov 18, 2015 at 11:52 PM, Steve Loughran wrote: > Looks like groovy

Re: spark with breeze error of NoClassDefFoundError

2015-11-18 Thread Zhiliang Zhu
Dear Jack, As is known, Breeze is numerical calculation package wrote by scala , spark mllib also use it as underlying package for algebra usage.Here I am also preparing to use Breeze for nonlinear equation optimization, however, it seemed that I could not find the exact doc or API for Breeze

Re: SequenceFile and object reuse

2015-11-18 Thread Sandy Ryza
Hi Jeff, Many access patterns simply take the result of hadoopFile and use it to create some other object, and thus have no need for each input record to refer to a different object. In those cases, the current API is more performant than an alternative that would create an object for each

Re: spark with breeze error of NoClassDefFoundError

2015-11-18 Thread Ted Yu
Have you looked at https://github.com/scalanlp/breeze/wiki Cheers > On Nov 18, 2015, at 9:34 PM, Zhiliang Zhu wrote: > > Dear Jack, > > As is known, Breeze is numerical calculation package wrote by scala , spark > mllib also use it as underlying package for algebra

RE: Do windowing functions require hive support?

2015-11-18 Thread Jack Yang
Which version of spark are you using? From: Stephen Boesch [mailto:java...@gmail.com] Sent: Thursday, 19 November 2015 2:12 PM To: user Subject: Do windowing functions require hive support? The following works against a hive table from spark sql hc.sql("select id,r from (select id, name,

Re: Do windowing functions require hive support?

2015-11-18 Thread Stephen Boesch
Why is the same query (and actually i tried several variations) working against a hivecontext and not against the sql context? 2015-11-18 19:57 GMT-08:00 Michael Armbrust : > Yes they do. > > On Wed, Nov 18, 2015 at 7:49 PM, Stephen Boesch wrote: > >>

RE: spark with breeze error of NoClassDefFoundError

2015-11-18 Thread Jack Yang
Back to my question. If I use “provided”, the jar file will expect some libraries are provided by the system. However, the “ compiled ” is the default setting, which means the third-party library will be included inside jar file after compiling. So when I use “provided”, the error is they

Spark Monitoring to get Spark GCs and records processed

2015-11-18 Thread rakesh rakshit
Hi all, I want to monitor Spark to get the following: 1. All the GC stats for Spark JVMs 2. Records successfully processed in a batch 3. Records failed in a batch 4. Getting historical data for batches,jobs,stages,tasks,etc, Please let me know how can I get these information in Spark. Regards,

Re: Apache Groovy and Spark

2015-11-18 Thread tog
Hi Steve Since you are familiar with groovy it will go a bit deeper in details. My (simple) groovy scripts are working fine with Apache Spark - a closure (when dehydrated) will nicely serialize. My issue comes when I want to use GroovyShell to run my scripts (my ultimate goal is to integrate with

Spark twitter streaming in Java

2015-11-18 Thread Soni spark
Dear Friends, I am struggling with spark twitter streaming. I am not getting any data. Please correct below code if you found any mistakes. import org.apache.spark.*; import org.apache.spark.api.java. function.*; import org.apache.spark.streaming.*; import org.apache.spark.streaming.api.java.*;

Spark JDBCRDD query

2015-11-18 Thread sunil m
Hello Spark experts! I am new to Spark and i have the following query... What I am trying to do: Run a spark 1.5.1 job local[*] on a 4 core CPU. This will ping oracle data base and fetch 5000 records each in jdbcRDD, I increase the number of partitions by 1 for every 5000 records i fetch. I

Re: How to clear the temp files that gets created by shuffle in Spark Streaming

2015-11-18 Thread Ted Yu
Have you seen SPARK-5836 ? Note TD's comment at the end. Cheers On Wed, Nov 18, 2015 at 7:28 PM, swetha wrote: > Hi, > > We have a lot of temp files that gets created due to shuffles caused by > group by. How to clear the files that gets created due to intermediate >

Re: getting different results from same line of code repeated

2015-11-18 Thread Dean Wampler
Methods like first() and take(n) can't guarantee to return the same result in a distributed context, because Spark uses an algorithm to grab data from one or more partitions that involves running a distributed job over the cluster, with tasks on the nodes where the chosen partitions are located.

Re: Incorrect results with reduceByKey

2015-11-18 Thread tovbinm
Deep copying the data solved the issue: data.map(r => {val t = SpecificData.get().deepCopy(r.getSchema, r); (t.id, List(t)) }).reduceByKey(_ ++ _) (noted here: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1003) Thanks Igor Berman, for