Re: Running in cluster mode causes native library linking to fail

2015-10-13 Thread Deenar Toraskar
Hi Bernardo Is the native library installed on all machines of your cluster and are you setting both the spark.driver.extraLibraryPath and spark.executor.extraLibraryPath ? Deenar On 14 October 2015 at 05:44, Bernardo Vecchia Stein < bernardovst...@gmail.com> wrote: > Hello, > > I am trying

Re: When does python program started in pyspark

2015-10-13 Thread canan chen
I think PythonRunner is launched when executing python script. PythonGatewayServer is entry point for python spark shell if (args.isPython && deployMode == CLIENT) { if (args.primaryResource == PYSPARK_SHELL) { args.mainClass = "org.apache.spark.api.python.PythonGatewayServer" } else {

Re: Spark DataFrame GroupBy into List

2015-10-13 Thread SLiZn Liu
Hi Michael, Can you be more specific on `collect_set`? Is it a built-in function or, if it is an UDF, how it is defined? BR, Todd Leo On Wed, Oct 14, 2015 at 2:12 AM Michael Armbrust wrote: > import org.apache.spark.sql.functions._ > > df.groupBy("category") >

Re: Building with SBT and Scala 2.11

2015-10-13 Thread Adrian Tanase
Do you mean hadoop-2.4 or 2.6? not sure if this is the issue but I'm also compiling the 1.5.1 version with scala 2.11 and hadoop 2.6 and it works. -adrian Sent from my iPhone On 14 Oct 2015, at 03:53, Jakob Odersky > wrote: I'm having trouble

When does python program started in pyspark

2015-10-13 Thread canan chen
I look at the source code of spark, but didn't find where python program is started in python. It seems spark-submit will call PythonGatewayServer, but where is python program started ? Thanks

Running in cluster mode causes native library linking to fail

2015-10-13 Thread Bernardo Vecchia Stein
Hello, I am trying to run some scala code in cluster mode using spark-submit. This code uses addLibrary to link with a .so that exists in the machine, and this library has a function to be called natively (there's a native definition as needed in the code). The problem I'm facing is: whenever I

Re: an problem about zippartition

2015-10-13 Thread Saisai Shao
maybe you could try "localCheckpoint" insteadly. 2015年10月14日星期三,张仪yf1 写道: > Thank you for your reply. It helped a lot. But when the data became > bigger, the action cost more, is there any optimizer > > > > *发件人:* Saisai Shao [mailto:sai.sai.s...@gmail.com >

OutOfMemoryError When Reading Many json Files

2015-10-13 Thread SLiZn Liu
Hey Spark Users, I kept getting java.lang.OutOfMemoryError: Java heap space as I read a massive amount of json files, iteratively via read.json(). Even the result RDD is rather small, I still get the OOM Error. The brief structure of my program reads as following, in psuedo-code:

Re: compatibility issue with Jersey2

2015-10-13 Thread Mingyu Kim
Hi all, I filed https://issues.apache.org/jira/browse/SPARK-11081. Since Jersey’s surface area is relatively small and seems to be only used for Spark UI and json API, shading the dependency might make sense similar to what’s done for Jerry dependencies at

Re: Install via directions in "Learning Spark". Exception when running bin/pyspark

2015-10-13 Thread Robineast
What you have done should work. A couple of things to try: 1) you should have a lib directory in your Spark deployment, it should have a jar file called lib/spark-assembly-1.5.1-hadoop2.6.0.jar. Is it there? 2) Have you set the JAVA_HOME variable to point to your java8 deployment? If not try

Re: HiveThriftServer not registering with Zookeeper

2015-10-13 Thread Xiaoyu Wang
I have the same issue. I think spark thrift server is not suport HA with zookeeper now. 在 2015年09月01日 18:10, sreeramvenkat 写道: Hi, I am trying to setup dynamic service discovery for HiveThriftServer in a two node cluster. In the thrift server logs, I am not seeing itself registering with

Re: HiveThriftServer not registering with Zookeeper

2015-10-13 Thread Xiaoyu Wang
I have the same issue. I think spark thrift server is not suport HA with zookeeper now. 在 2015年09月01日 18:10, sreeramvenkat 写道: Hi, I am trying to setup dynamic service discovery for HiveThriftServer in a two node cluster. In the thrift server logs, I am not seeing itself registering with

Re: unresolved dependency: org.apache.spark#spark-streaming_2.10;1.5.0: not found

2015-10-13 Thread Akhil Das
You need to add "org.apache.spark" % "spark-streaming_2.10" % "1.5.0" to the dependencies list. Thanks Best Regards On Tue, Oct 6, 2015 at 3:20 PM, shahab wrote: > Hi, > > I am trying to use Spark 1.5, Mlib, but I keep getting > "sbt.ResolveException: unresolved

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Umesh Kacha
Hi Ted if fix went after 1.5.1 release then how come it's working with 1.5.1 binary in spark-shell. On Oct 13, 2015 1:32 PM, "Ted Yu" wrote: > Looks like the fix went in after 1.5.1 was released. > > You may verify using master branch build. > > Cheers > > On Oct 13, 2015,

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Umesh Kacha
Hi Ted, thanks much I tried using percentile_approx in Spark-shell like you mentioned it works using 1.5.1 but it doesn't compile in Java using 1.5.1 maven libraries it still complains same that callUdf can have string and column types only. Please guide. On Oct 13, 2015 12:34 AM, "Ted Yu"

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Ted Yu
Pardon me. I didn't read your previous response clearly. I will try to reproduce the compilation error on master branch. Right now, I have some other high priority task on hand. BTW I was looking at SPARK-10671 FYI On Tue, Oct 13, 2015 at 1:42 AM, Umesh Kacha wrote: >

writing to hive

2015-10-13 Thread Hafiz Mujadid
hi! I am following this tutorial to read and write from hive. But i am facing following exception when i run the code. 15/10/12 14:57:36 INFO storage.BlockManagerMaster: Registered BlockManager 15/10/12 14:57:38

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Ted Yu
Looks like the fix went in after 1.5.1 was released. You may verify using master branch build. Cheers > On Oct 13, 2015, at 12:21 AM, Umesh Kacha wrote: > > Hi Ted, thanks much I tried using percentile_approx in Spark-shell like you > mentioned it works using 1.5.1

Re: sql query orc slow

2015-10-13 Thread Patcharee Thongtra
Hi Zhan Zhang, Is my problem (which is ORC predicate is not generated from WHERE clause even though spark.sql.orc.filterPushdown=true) can be related to some factors below ? - orc file version (File Version: 0.12 with HIVE_8732) - hive version (using Hive 1.2.1.2.3.0.0-2557) - orc table is

Re: Spark DataFrame GroupBy into List

2015-10-13 Thread Rishitesh Mishra
Hi Liu, I could not see any operator on DataFrame which will give the desired result . DataFrame APIs as expected works on Row format and a fixed set of operators on them. However you can achive the desired result by accessing the internal RDD as below.. val s = Seq(Test("A",1),

an problem about zippartition

2015-10-13 Thread 张仪yf1
Hi,there I problem an issue when using the zippartition, first I created a rdd from a seq,then created another one,and zippartitioned them with rdd3, then cached the rdd3,then created a new rdd ,and zippartitioned it with rdd3.I repeat this operation many times, and I found that,

How can I use dynamic resource allocation option in spark-jobserver?

2015-10-13 Thread JUNG YOUSUN
Hi all, I have some questions about spark -jobserver. I deployed a spark-jobserver in yarn-client mode using docker. I’d like to use dynamic resource allocation option for yarn in spark-jobserver. How can I add this option? And when will it be support 1.5.x version ?

Re: Spark DataFrame GroupBy into List

2015-10-13 Thread SLiZn Liu
Hi Rishitesh, I did it by CombineByKey, but your solution is more clear and readable, at least doesn't require 3 lambda functions to get confused with. Will definitely try it out tomorrow, thanks.  Plus, OutOfMemoryError keeps bothering me as I read a massive amount of json files, whereas the

Why is the Spark Web GUI failing with JavaScript "Uncaught SyntaxError"?

2015-10-13 Thread Joshua Fox
I am accessing the Spark Jobs Web GUI, running on AWS EMR. I can access this webapp (port 4040 as per default), but it only half-renders, producing "Uncaught SyntaxError: Unexpected token <" Here is a screenshot including Chrome Developer Console. [image:

Re: localhost webui port

2015-10-13 Thread Saisai Shao
By configuring "spark.ui.port" to the port you could bind. On Tue, Oct 13, 2015 at 8:47 PM, Langston, Jim wrote: > Hi all, > > Is there anyway to change the default port 4040 for the localhost webUI, > unfortunately, that port is blocked and I have no control of

Re: an problem about zippartition

2015-10-13 Thread Saisai Shao
You have to call the checkpoint regularly on rdd0 to cut the dependency chain, otherwise you will meet such problem as you mentioned, even stack overflow finally. This is a classic problem for high iterative job, you could google it for the fix solution. On Tue, Oct 13, 2015 at 7:09 PM, 张仪yf1

localhost webui port

2015-10-13 Thread Langston, Jim
Hi all, Is there anyway to change the default port 4040 for the localhost webUI, unfortunately, that port is blocked and I have no control of that. I have not found any configuration parameter that would enable me to change it. Thanks, Jim

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Ted Yu
Can you pastebin your Java code and the command you used to compile ? Thanks > On Oct 13, 2015, at 1:42 AM, Umesh Kacha wrote: > > Hi Ted if fix went after 1.5.1 release then how come it's working with 1.5.1 > binary in spark-shell. > >> On Oct 13, 2015 1:32 PM, "Ted

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Umesh Kacha
OK thanks much Ted looks like some issue while using maven dependencies in Java code for 1.5.1. I am still not able to understand if spark 1.5.1 binary in spark-shell can recognize callUdf then why not callUdf not getting compiled while using maven build. On Oct 13, 2015 2:20 PM, "Ted Yu"

Re: Why is the Spark Web GUI failing with JavaScript "Uncaught SyntaxError"?

2015-10-13 Thread Jean-Baptiste Onofré
Hi Joshua, What's the Spark version and what's your browser ? I just tried on Spark 1.6-SNAPSHOT with firefox and it works fine. Thanks Regards JB On 10/13/2015 02:17 PM, Joshua Fox wrote: I am accessing the Spark Jobs Web GUI, running on AWS EMR. I can access this webapp (port 4040 as per

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Umesh Kacha
Hi Ted I am using the following line of code I can't paste entire code sorry but the following only line doesn't compile in my spark job sourceframe.select(callUDF("percentile_approx",col("mycol"), lit(0.25))) I am using Intellij editor java and maven dependencies of spark core spark sql spark

Spark shuffle service does not work in stand alone

2015-10-13 Thread Saif.A.Ellafi
Has anyone tried shuffle service in Stand Alone cluster mode? I want to enable it for d.a. but my jobs never start when I submit them. This happens with all my jobs. 15/10/13 08:29:45 INFO DAGScheduler: Job 0 failed: json at DataLoader.scala:86, took 16.318615 s Exception in thread "main"

Re: Why is the Spark Web GUI failing with JavaScript "Uncaught SyntaxError"?

2015-10-13 Thread Jonathan Kelly
Joshua, Since Spark is configured to run on YARN in EMR, instead of viewing the Spark application UI at port 4040, you should instead start from the YARN ResourceManager (on port 8088), then click on the ApplicationMaster link for the Spark application you are interested in. This will take you to

Re: Why is the Spark Web GUI failing with JavaScript "Uncaught SyntaxError"?

2015-10-13 Thread Jean-Baptiste Onofré
Thanks for the update Joshua. Let me try with Spark 1.4.1. I keep you posted. Regards JB On 10/13/2015 04:17 PM, Joshua Fox wrote: * Spark 1.4.1, part of EMR emr-4.0.0 * Chrome Version 41.0.2272.118 (64-bit) on Ubuntu On Tue, Oct 13, 2015 at 3:27 PM, Jean-Baptiste Onofré

Re: Why is the Spark Web GUI failing with JavaScript "Uncaught SyntaxError"?

2015-10-13 Thread Joshua Fox
- Spark 1.4.1, part of EMR emr-4.0.0 - Chrome Version 41.0.2272.118 (64-bit) on Ubuntu On Tue, Oct 13, 2015 at 3:27 PM, Jean-Baptiste Onofré wrote: > Hi Joshua, > > What's the Spark version and what's your browser ? > > I just tried on Spark 1.6-SNAPSHOT with firefox

Re: Spark shuffle service does not work in stand alone

2015-10-13 Thread Jean-Baptiste Onofré
Hi, AFAIK, the shuffle service makes sense only to delegate the shuffle to mapreduce (as mapreduce shuffle is most of the time faster than the spark shuffle). As you run in standalone mode, shuffle service will use the spark shuffle. Not 100% thought. Regards JB On 10/13/2015 04:23 PM,

Why is my spark executor is terminated?

2015-10-13 Thread Wang, Ningjun (LNG-NPV)
We use spark on windows 2008 R2 servers. We use one spark context which create one spark executor. We run spark master, slave, driver, executor on one single machine. >From time to time, we found that the executor JAVA process was terminated. I >cannot fig out why it was terminated. Can

Re: Why is my spark executor is terminated?

2015-10-13 Thread Jean-Baptiste Onofré
Hi Ningjun, Nothing special in the master log ? Regards JB On 10/13/2015 04:34 PM, Wang, Ningjun (LNG-NPV) wrote: We use spark on windows 2008 R2 servers. We use one spark context which create one spark executor. We run spark master, slave, driver, executor on one single machine. From time

Re: Install via directions in "Learning Spark". Exception when running bin/pyspark

2015-10-13 Thread David Bess
Got it working! Thank you for confirming my suspicion that this issue was related to Java. When I dug deeper I found multiple versions and some other issues. I worked on it a while before deciding it would be easier to just uninstall all Java and reinstall clean JDK, and now it works perfectly.

Re: Problem installing Sparck on Windows 8

2015-10-13 Thread Steve Loughran
On 12 Oct 2015, at 23:11, Marco Mistroni > wrote: HI all i have downloaded spark-1.5.1-bin-hadoop.2.4 i have extracted it on my machine, but when i go to the \bin directory and invoke spark-shell i get the following exception Could anyone

Conf setting for Java Spark

2015-10-13 Thread Ramkumar V
Hi, I'm using java over spark for processing 30 GB of data every hour. I'm doing spark-submit in cluster mode. I have a cluster of 11 machines (9 - 64 GB memory and 2 - 32 GB memory ) but it takes 30 mins to process 30 GB of data every hour. How can i optimize this ? How to compute the driver and

Re: Spark shuffle service does not work in stand alone

2015-10-13 Thread Marcelo Vanzin
It would probably be more helpful if you looked for the executor error and posted it. The screenshot you posted is the driver exception caused by the task failure, which is not terribly useful. On Tue, Oct 13, 2015 at 7:23 AM, wrote: > Has anyone tried shuffle

Re: sql query orc slow

2015-10-13 Thread Zhan Zhang
Hi Patcharee, I am not sure which side is wrong, driver or executor. If it is executor side, the reason you mentioned may be possible. But if the driver side didn’t set the predicate at all, then somewhere else is broken. Can you please file a JIRA with a simple reproduce step, and let me know

RE: Spark shuffle service does not work in stand alone

2015-10-13 Thread Saif.A.Ellafi
Hi, thanks Executors are simply failing to connect to a shuffle server: 15/10/13 08:29:34 INFO BlockManagerMaster: Registered BlockManager 15/10/13 08:29:34 INFO BlockManager: Registering executor with local external shuffle service. 15/10/13 08:29:34 ERROR BlockManager: Failed to connect to

RE: Spark shuffle service does not work in stand alone

2015-10-13 Thread Saif.A.Ellafi
I believe the confusion here is self-answered. The thing is that in the documentation, the spark shuffle service runs only under YARN, while here we are speaking about a stand alone cluster. The proper question is, how to launch a shuffle service for stand alone? Saif From:

Re: sql query orc slow

2015-10-13 Thread Patcharee Thongtra
Hi Zhan Zhang, Here is the issue https://issues.apache.org/jira/browse/SPARK-11087 BR, Patcharee On 10/13/2015 06:47 PM, Zhan Zhang wrote: Hi Patcharee, I am not sure which side is wrong, driver or executor. If it is executor side, the reason you mentioned may be possible. But if the

Re: Spark shuffle service does not work in stand alone

2015-10-13 Thread Marcelo Vanzin
You have to manually start the shuffle service if you're not running YARN. See the "sbin/start-shuffle-service.sh" script. On Tue, Oct 13, 2015 at 10:29 AM, wrote: > I believe the confusion here is self-answered. > > The thing is that in the documentation, the

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Ted Yu
I am currently dealing with a high priority bug in another project. Hope to get back to this soon. On Tue, Oct 13, 2015 at 11:56 AM, Umesh Kacha wrote: > Hi Ted sorry for asking again. Did you get chance to look at compilation > issue? Thanks much. > > Regards. > On Oct

TTL for saveAsObjectFile()

2015-10-13 Thread antoniosi
Hi, I am using RDD.saveAsObjectFile() to save the RDD dataset to Tachyon. In version 0.8, Tachyon will support for TTL for saved file. Is that supported from Spark as well? Is there a way I could specify an TTL for a saved object file? Thanks. Antonio. -- View this message in context:

Re: Spark DataFrame GroupBy into List

2015-10-13 Thread Michael Armbrust
import org.apache.spark.sql.functions._ df.groupBy("category") .agg(callUDF("collect_set", df("id")).as("id_list")) On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu wrote: > Hey Spark users, > > I'm trying to group by a dataframe, by appending occurrences into a list >

RE: Spark shuffle service does not work in stand alone

2015-10-13 Thread Saif.A.Ellafi
Thanks, I missed that one. From: Marcelo Vanzin [mailto:van...@cloudera.com] Sent: Tuesday, October 13, 2015 2:36 PM To: Ellafi, Saif A. Cc: user@spark.apache.org Subject: Re: Spark shuffle service does not work in stand alone You have to manually start the shuffle service if you're not running

Generated ORC files cause NPE in Hive

2015-10-13 Thread Daniel Haviv
Hi, We are inserting streaming data into a hive orc table via a simple insert statement passed to HiveContext. When trying to read the files generated using Hive 1.2.1 we are getting NPE: at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:91) at

Re: Generated ORC files cause NPE in Hive

2015-10-13 Thread Alexander Pivovarov
Daniel, Looks like we already have Jira for that error https://issues.apache.org/jira/browse/HIVE-11431 Could you put details on how to reproduce the issue to the ticket? Thank you Alex On Tue, Oct 13, 2015 at 11:14 AM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > We are

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Umesh Kacha
Hi Ted sorry for asking again. Did you get chance to look at compilation issue? Thanks much. Regards. On Oct 13, 2015 18:39, "Umesh Kacha" wrote: > Hi Ted I am using the following line of code I can't paste entire code > sorry but the following only line doesn't compile

Changing application log level in standalone cluster

2015-10-13 Thread Tom Graves
I would like to change the logging level for my application running on a standalone Spark cluster.  Is there an easy way to do that  without changing the log4j.properties on each individual node? Thanks,Tom

Machine learning with spark (book code example error)

2015-10-13 Thread Zsombor Egyed
Hi! I was reading the ML with spark book, and I was very interested about the 9. chapter (text mining), so I tried code examples. Everything was fine, but in this line: val testLabels = testRDD.map { case (file, text) => val topic = file.split("/").takeRight(2).head newsgroupsMap(topic) } I

Fwd: Problem about cannot open shared object file

2015-10-13 Thread 赵夏
Hello Everyone: I am new to Spark. Now I meet a problem which I cannot solve by using Google. I have run the Pi example on Hadoop 2.6 successfully. After exploiting the spark platform and try to run Pi using spark "./bin/spark-submit --class org.apache.spark.examples.SparkPi --master

Spark 1.5 java.net.ConnectException: Connection refused

2015-10-13 Thread Spark Newbie
Hi Spark users, I'm seeing the below exception in my spark streaming application. It happens in the first stage where the kinesis receivers receive records and perform a flatMap operation on the unioned Dstream. A coalesce step also happens as a part of that stage for optimizing the performance.

Any plans to support Spark Streaming within an interactive shell?

2015-10-13 Thread YaoPau
I'm seeing products that allow you to interact with a stream in realtime (write code, and see the streaming output automatically change), which I think makes it easier to test streaming code, although running it on batch then turning streaming on certainly is a good way as well. I played around

Re: SPARK SQL Error

2015-10-13 Thread pnpritchard
Your app jar should be at the end of the command, without the --jars prefix. That option is only necessary if you have more than one jar to put on the classpath (i.e. dependency jars that aren't packaged inside your app jar). spark-submit --master yarn --class org.spark.apache.CsvDataSource

Announcement: Hackathon at Netherlands Cancer Institute next week

2015-10-13 Thread Kees van Bochove
Dear all, I'd like to point out that 19-21 October we have a hackathon at the Netherlands Cancer Institute in Amsterdam, which is about connecting a major open source bioinformatics / medical informatics research datawarehouse (called tranSMART) to SparkR, by implementing the RDD interface via

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Ted Yu
I modified DataFrameSuite, in master branch, to call percentile_approx instead of simpleUDF : - deprecated callUdf in SQLContext - callUDF in SQLContext *** FAILED *** org.apache.spark.sql.AnalysisException: undefined function percentile_approx; at

Re: Spark 1.5 java.net.ConnectException: Connection refused

2015-10-13 Thread Tathagata Das
Is this happening too often? Is it slowing things down or blocking progress. Failures once in a while is part of the norm, and the system should take care of itself. On Tue, Oct 13, 2015 at 2:47 PM, Spark Newbie wrote: > Hi Spark users, > > I'm seeing the below

Re: updateStateByKey and stack overflow

2015-10-13 Thread Tian Zhang
It turns out that our hdfs checkpoint failed, but spark streaming is running and building up a long lineage ... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/updateStateByKey-and-stack-overflow-tp25015p25054.html Sent from the Apache Spark User List

Re: Any plans to support Spark Streaming within an interactive shell?

2015-10-13 Thread Tathagata Das
StreamingContext is not designed to be reused. Its not in the immediate roadmap. You have to create a new StreamingContext. Also you should stop the streamingContext without stopping the SparkContext of the shell. On Tue, Oct 13, 2015 at 1:47 PM, YaoPau wrote: > I'm seeing

Building with SBT and Scala 2.11

2015-10-13 Thread Jakob Odersky
I'm having trouble compiling Spark with SBT for Scala 2.11. The command I use is: dev/change-version-to-2.11.sh build/sbt -Pyarn -Phadoop-2.11 -Dscala-2.11 followed by compile in the sbt shell. The error I get specifically is:

Re: Building with SBT and Scala 2.11

2015-10-13 Thread Ted Yu
See this thread: http://search-hadoop.com/m/q3RTtY7aX22B44dB On Tue, Oct 13, 2015 at 5:53 PM, Jakob Odersky wrote: > I'm having trouble compiling Spark with SBT for Scala 2.11. The command I > use is: > > dev/change-version-to-2.11.sh > build/sbt -Pyarn -Phadoop-2.11

Fwd: [Streaming] join events in last 10 minutes

2015-10-13 Thread Daniel Li
re-post to the right group. -- Forwarded message -- From: Daniel Li Date: Tue, Oct 13, 2015 at 5:14 PM Subject: [Streaming] join events in last 10 minutes To: d...@spark.apache.org We have a scenario that events from three kafka topics sharing the same

Re: Cannot get spark-streaming_2.10-1.5.0.pom from the maven repository

2015-10-13 Thread Ted Yu
Still 404 as of a moment ago. On Mon, Oct 12, 2015 at 9:04 PM, Ted Yu wrote: > I checked commit history of streaming/pom.xml > > There should be no difference between 1.5.0 and 1.5.1 > > You can download 1.5.1's pom.xml and rename it so that you get unblocked. > > On Mon,

Spark DataFrame GroupBy into List

2015-10-13 Thread SLiZn Liu
Hey Spark users, I'm trying to group by a dataframe, by appending occurrences into a list instead of count. Let's say we have a dataframe as shown below: | category | id | | |:--:| | A| 1 | | A| 2 | | B| 3 | | B| 4 | | C| 5 | ideally, after