Re: groupByKey does not work?

2016-01-05 Thread Sean Owen
I suspect this is another instance of case classes not working as expected between the driver and executor when used with spark-shell. Search JIRA for some back story. On Tue, Jan 5, 2016 at 12:42 AM, Arun Luthra wrote: > Spark 1.5.0 > > data: > >

Is there a way to use parallelize function in sparkR spark version (1.6.0)

2016-01-05 Thread Chandan Verma
=== DISCLAIMER: The information contained in this message (including any attachments) is confidential and

finding distinct count using dataframe

2016-01-05 Thread Arunkumar Pillai
Hi Is there any functions to find distinct count of all the variables in dataframe. val sc = new SparkContext(conf) // spark context val options = Map("header" -> "true", "delimiter" -> delimiter, "inferSchema" -> "true") val sqlContext = new org.apache.spark.sql.SQLContext(sc) // sql context

Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Jeff Zhang
It seems currently spark.scheduler.pool must be set as localProperties (associate with thread). Any reason why spark.scheduler.pool can not be used globally. My scenario is that I want my thriftserver started with fair scheduler as the default pool without using set command to set the pool. Is

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Sean Owen
+juliet for an additional opinion, but FWIW I think it's safe to say that future CDH will have a more consistent Python story and that story will support 2.7 rather than 2.6. On Tue, Jan 5, 2016 at 7:17 AM, Reynold Xin wrote: > Does anybody here care about us dropping

Re: finding distinct count using dataframe

2016-01-05 Thread Yanbo Liang
Hi Arunkumar, You can use datasetDF.select(countDistinct(col1, col2, col3, ...)) or approxCountDistinct for a approximate result. 2016-01-05 17:11 GMT+08:00 Arunkumar Pillai : > Hi > > Is there any functions to find distinct count of all the variables in > dataframe.

Networking problems in Spark 1.6.0

2016-01-05 Thread Yiannis Gkoufas
Hi there, I have been using Spark 1.5.2 on my cluster without a problem and wanted to try Spark 1.6.0. I have the exact same configuration on both clusters. I am able to start the Standalone Cluster but I fail to submit a job getting errors like the following: 16/01/05 14:24:14 INFO

RE: Double Counting When Using Accumulators with Spark Streaming

2016-01-05 Thread Rachana Srivastava
Thanks a lot for your prompt response. I am pushing one message. HashMap kafkaParams = new HashMap(); kafkaParams.put("metadata.broker.list","localhost:9092"); kafkaParams.put("zookeeper.connect", "localhost:2181"); JavaPairInputDStream

Re: Networking problems in Spark 1.6.0

2016-01-05 Thread Dean Wampler
​Still, it would be good to know what happened exactly. Why did the netty dependency expect Java 8? Did you build your app on a machine with Java 8 and deploy on a Java 7 machine?​ Anyway, I played with the 1.6.0 spark-shell using Java 7 and it worked fine. I also looked at the distribution's

Re: Negative Number of Active Tasks in Spark UI

2016-01-05 Thread Prasad Ravilla
I am using Spark 1.5.2. I am not using Dynamic allocation. Thanks, Prasad. On 1/5/16, 3:24 AM, "Ted Yu" wrote: >Which version of Spark do you use ? > >This might be related:

Re: Networking problems in Spark 1.6.0

2016-01-05 Thread Yiannis Gkoufas
Yes, that was the case, the app was built with java 8. But that was the case with Spark 1.5.2 as well and it didn't complain. On 5 January 2016 at 16:40, Dean Wampler wrote: > ​Still, it would be good to know what happened exactly. Why did the netty > dependency expect

Re: sparkR ORC support.

2016-01-05 Thread Prem Sure
Yes Sandeep, also copy hive-site.xml too to spark conf directory. On Tue, Jan 5, 2016 at 10:07 AM, Sandeep Khurana wrote: > Also, do I need to setup hive in spark as per the link > http://stackoverflow.com/questions/26360725/accesing-hive-tables-in-spark > ? > > We might

Re: Double Counting When Using Accumulators with Spark Streaming

2016-01-05 Thread Jean-Baptiste Onofré
Hi Rachana, don't you have two messages on the kafka broker ? Regards JB On 01/05/2016 05:14 PM, Rachana Srivastava wrote: I have a very simple two lines program. I am getting input from Kafka and save the input in a file and counting the input received. My code looks like this, when I run

Spark 1.6 - Datasets and Avro Encoders

2016-01-05 Thread Olivier Girardot
Hi everyone, considering the new Datasets API, will there be Encoders defined for reading and writing Avro files ? Will it be possible to use already generated Avro classes ? Regards, -- *Olivier Girardot*

Double Counting When Using Accumulators with Spark Streaming

2016-01-05 Thread Rachana Srivastava
I have a very simple two lines program. I am getting input from Kafka and save the input in a file and counting the input received. My code looks like this, when I run this code I am getting two accumulator count for each input. HashMap kafkaParams = new HashMap

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2016-01-05 Thread Annabel Melongo
Vijay, Are you closing the fileinputstream at the end of each loop ( in.close())? My guess is those streams aren't close and thus the "too many open files" exception. On Tuesday, January 5, 2016 8:03 AM, Priya Ch wrote: Can some one throw light on this ?

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
+1 Red Hat supports Python 2.6 on REHL 5 until 2020 , but otherwise yes, Python 2.6 is ancient history and the core Python developers stopped supporting it in 2013. REHL 5 is not a good enough reason to continue support for Python

Re: finding distinct count using dataframe

2016-01-05 Thread Kristina Rogale Plazonic
I think it's an expression, rather than a function you'd find in the API (as a function you could do df.select(col).distinct.count) This will give you the number of distinct rows in both columns: scala> df.select(countDistinct("name", "age")) res397: org.apache.spark.sql.DataFrame =

Re: sparkR ORC support.

2016-01-05 Thread Sandeep Khurana
Also, do I need to setup hive in spark as per the link http://stackoverflow.com/questions/26360725/accesing-hive-tables-in-spark ? We might need to copy hdfs-site.xml file to spark conf directory ? On Tue, Jan 5, 2016 at 8:28 PM, Sandeep Khurana wrote: > Deepak > > Tried

Re: Is there a way to use parallelize function in sparkR spark version (1.6.0)

2016-01-05 Thread Ted Yu
Please take a look at the following for examples: R/pkg/R/RDD.R R/pkg/R/pairRDD.R Cheers On Tue, Jan 5, 2016 at 2:36 AM, Chandan Verma wrote: > >

Re: Networking problems in Spark 1.6.0

2016-01-05 Thread Yiannis Gkoufas
Hi Dean, thanks so much for the response! It works without a problem now! On 5 January 2016 at 14:33, Dean Wampler wrote: > ConcurrentHashMap.keySet() returning a KeySetView is a Java 8 method. The > Java 7 method returns a Set. Are you running Java 7? What happens if

Handling futures from foreachPartitionAsync in Spark Streaming

2016-01-05 Thread Trevor
Hi everyone, I'm working on a spark streaming program where I need to asynchronously apply a complex function across the partitions of an RDD. I'm currently using foreachPartitionAsync to achieve this. What is the idiomatic way of handling the FutureAction that returns from the

Re: Networking problems in Spark 1.6.0

2016-01-05 Thread Dean Wampler
ConcurrentHashMap.keySet() returning a KeySetView is a Java 8 method. The Java 7 method returns a Set. Are you running Java 7? What happens if you run Java 8? Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe

Re: Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Mark Hamstra
I don't understand. If you're using fair scheduling and don't set a pool, the default pool will be used. On Tue, Jan 5, 2016 at 1:57 AM, Jeff Zhang wrote: > > It seems currently spark.scheduler.pool must be set as localProperties > (associate with thread). Any reason why

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
rhel/centos 6 ships with python 2.6, doesnt it? if so, i still know plenty of large companies where python 2.6 is the only option. asking them for python 2.7 is not going to work so i think its a bad idea On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland wrote: > I

Re: Negative Number of Active Tasks in Spark UI

2016-01-05 Thread Shixiong(Ryan) Zhu
Did you enable "spark.speculation"? On Tue, Jan 5, 2016 at 9:14 AM, Prasad Ravilla wrote: > I am using Spark 1.5.2. > > I am not using Dynamic allocation. > > Thanks, > Prasad. > > > > > On 1/5/16, 3:24 AM, "Ted Yu" wrote: > > >Which version of Spark do

Re: Double Counting When Using Accumulators with Spark Streaming

2016-01-05 Thread Shixiong(Ryan) Zhu
Hey Rachana, There are two jobs in your codes actually: `rdd.isEmpty` and `rdd.saveAsTextFile`. Since you don't cache or checkpoint this rdd, it will execute your map function twice for each record. You can move "accum.add(1)" to "rdd.saveAsTextFile" like this: JavaDStream lines =

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Ted Yu
+1 > On Jan 5, 2016, at 10:49 AM, Davies Liu wrote: > > +1 > > On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas > wrote: >> +1 >> >> Red Hat supports Python 2.6 on REHL 5 until 2020, but otherwise yes, Python >> 2.6 is ancient history and

Re: Spark 1.6 - Datasets and Avro Encoders

2016-01-05 Thread Michael Armbrust
You could try with the `Encoders.bean` method. It detects classes that have getters and setters. Please report back! On Tue, Jan 5, 2016 at 9:45 AM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > Hi everyone, > considering the new Datasets API, will there be Encoders defined for

Spark on Apache Ingnite?

2016-01-05 Thread unk1102
Hi has anybody tried and had success with Spark on Apache Ignite seems promising? https://ignite.apache.org/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Apache-Ingnite-tp25884.html Sent from the Apache Spark User List mailing list archive at

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Davies Liu
+1 On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas wrote: > +1 > > Red Hat supports Python 2.6 on REHL 5 until 2020, but otherwise yes, Python > 2.6 is ancient history and the core Python developers stopped supporting it > in 2013. REHL 5 is not a good enough reason

Re: problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-05 Thread Andy Davidson
Hi Michael I am not sure you under stand my code correct. I am trying to implement org.apache.spark.ml.Transformer interface in Java 8. My understanding is the sudo code for transformers is something like @Override public DataFrame transform(DataFrame df) { 1. Select the input column

spark-itemsimilarity No FileSystem for scheme error

2016-01-05 Thread roy
Hi we are using CDH 5.4.0 with Spark 1.5.2 (doesn't come with CDH 5.4.0) I am following this link https://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html to trying to test/create new algorithm with mahout item-similarity. I am running following command ./bin/mahout

Re: sparkR ORC support.

2016-01-05 Thread Sandeep Khurana
Deepak Tried this. Getting this error now rror in sql(hivecontext, "FROM CATEGORIES SELECT category_id", "") : unused argument ("") On Tue, Jan 5, 2016 at 6:48 PM, Deepak Sharma wrote: > Hi Sandeep > can you try this ? > > results <- sql(hivecontext, "FROM test

Re: coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread Igor Berman
another option will be to try rdd.toLocalIterator() not sure if it will help though I had same problem and ended up to move all parts to local disk(with Hadoop FileSystem api) and then processing them locally On 5 January 2016 at 22:08, Alexander Pivovarov wrote: > try

coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread unk1102
hi I am trying to save many partitions of Dataframe into one CSV file and it take forever for large data sets of around 5-6 GB. sourceFrame.coalesce(1).write().format("com.databricks.spark.csv").option("gzip").save("/path/hadoop") For small data above code works well but for large data it hangs

Re: coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread Alexander Pivovarov
try coalesce(1, true). On Tue, Jan 5, 2016 at 11:58 AM, unk1102 wrote: > hi I am trying to save many partitions of Dataframe into one CSV file and > it > take forever for large data sets of around 5-6 GB. > > >

Out of memory issue

2016-01-05 Thread babloo80
Hello there, I have a spark job reads 7 parquet files (8 GB, 3 x 16 GB, 3 x 14 GB) in different stages of execution and creates a result parquet of 9 GB (about 27 million rows containing 165 columns. some columns are map based containing utmost 200 value histograms). The stages involve, Step 1:

RE: Spark on Apache Ingnite?

2016-01-05 Thread Umesh Kacha
Hi Nate thanks much. I have exact same use cases mentioned by you. My spark job does heavy writing involving group by and huge data shuffling. Can you please provide any pointer how can I run my existing spark job which is running on yarn to make it run on ignite? Please guide. Thanks again. On

Re: How to concat few rows into a new column in dataframe

2016-01-05 Thread Gavin Yue
I tried the Ted's solution and it works. But I keep hitting the JVM out of memory problem. And grouping the key causes a lot of data shuffling. So I am trying to order the data based on ID first and save as Parquet. Is there way to make sure that the data is partitioned that each ID's data is

Re: How to accelerate reading json file?

2016-01-05 Thread VISHNU SUBRAMANIAN
HI , You can try this sqlContext.read.format("json").option("samplingRatio","0.1").load("path") If it still takes time , feel free to experiment with the samplingRatio. Thanks, Vishnu On Wed, Jan 6, 2016 at 12:43 PM, Gavin Yue wrote: > I am trying to read json files

Re: How to concat few rows into a new column in dataframe

2016-01-05 Thread Gavin Yue
I found that in 1.6 dataframe could do repartition. Should I still need to do orderby first or I just have to repartition? On Tue, Jan 5, 2016 at 9:25 PM, Gavin Yue wrote: > I tried the Ted's solution and it works. But I keep hitting the JVM out > of memory

Re: 101 question on external metastore

2016-01-05 Thread Yana Kadiyska
Deenar, I have not resolved this issue. Why do you think it's from different versions of Derby? I was playing with this as a fun experiment and my setup was on a clean machine -- no other versions of hive/hadoop/etc... On Sun, Dec 20, 2015 at 12:17 AM, Deenar Toraskar

Re: pyspark Dataframe and histogram through ggplot (python)

2016-01-05 Thread Felix Cheung
Hi, select() returns a new Spark DataFrame; I would imagine ggplot would not work with it. Could you try df.select("something").toPandas()? _ From: Snehotosh Banerjee Sent: Tuesday, January 5, 2016 4:32 AM Subject: pyspark Dataframe

RE: aggregateByKey vs combineByKey

2016-01-05 Thread LINChen
Hi Marco,In your case, since you don't need to perform an aggregation (such as a sum or average) over each key, using groupByKey may perform better. groupByKey inherently utilizes compactBuffer which is much more efficient than ArrayBuffer. Thanks.LIN Chen Date: Tue, 5 Jan 2016 21:13:40 +

How to accelerate reading json file?

2016-01-05 Thread Gavin Yue
I am trying to read json files following the example: val path = "examples/src/main/resources/jsonfile"val people = sqlContext.read.json(path) I have 1 Tb size files in the path. It took 1.2 hours to finish the reading to infer the schema. But I already know the schema. Could I make this

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Jim Lohse
Hey Python 2.6 don't let the door hit you on the way out! haha Drop It No Problem On 01/05/2016 12:17 AM, Reynold Xin wrote: Does anybody here care about us dropping support for Python 2.6 in Spark 2.0? Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json parsing) when

Re: pyspark dataframe: row with a minimum value of a column for each group

2016-01-05 Thread ayan guha
Yes there is. It is called window function over partitions. Equivalent SQL would be: select * from (select a,b,c, rank() over (partition by a order by b) r from df) x where r = 1 You can register your DF as a temp table and use the sql form. Or, (>Spark 1.4) you can use window methods

Re: sparkR ORC support.

2016-01-05 Thread Felix Cheung
Firstly I don't have ORC data to verify but this should work: df <- loadDF(sqlContext, "data/path", "orc") Secondly, could you check if sparkR.stop() was called? sparkRHive.init() should be called after sparkR.init() - please check if there is any error message there.

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2016-01-05 Thread Priya Ch
Yes, the fileinputstream is closed. May be i didn't show in the screen shot . As spark implements, sort-based shuffle, there is a parameter called maximum merge factor which decides the number of files that can be merged at once and this avoids too many open files. I am suspecting that it is

Re: Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Mark Hamstra
The other way to do it is to build a custom version of Spark where you have changed the value of DEFAULT_SCHEDULING_MODE -- and if you were paying close attention, I accidentally let it slip that that is what I've done. I previously wrote "schedulingMode = DEFAULT_SCHEDULING_MODE -- i.e.

Re: How to use Java8

2016-01-05 Thread Andy Davidson
Hi Sea From: Sea <261810...@qq.com> Date: Tuesday, January 5, 2016 at 6:16 PM To: "user @spark" Subject: How to use Java8 > Hi, all > I want to support java8, I use JDK1.8.0_65 in production environment, but > it doesn't work. Should I build spark using jdk1.8,

?????? How to use Java8

2016-01-05 Thread Sea
thanks -- -- ??: "Andy Davidson";; : 2016??1??6??(??) 11:04 ??: "Sea"<261810...@qq.com>; "user"; : Re: How to use Java8 Hi Sea From: Sea <261810...@qq.com> Date:

Re: UpdateStateByKey : Partitioning and Shuffle

2016-01-05 Thread Tathagata Das
Both mapWithState and updateStateByKey by default uses the HashPartitioner, and hashes the key in the key-value DStream on which the state operation is applied. The new data and state is partition in the exact same partitioner, so that same keys from the new data (from the input DStream) get

Re: Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Jeff Zhang
Thanks Mark, custom configuration file would be better for me. Changing code will make it affect all the applications, this is too risky for me. On Wed, Jan 6, 2016 at 10:50 AM, Mark Hamstra wrote: > The other way to do it is to build a custom version of Spark where

[Spark-SQL] Custom aggregate function for GrouppedData

2016-01-05 Thread Abhishek Gayakwad
Hello Hivemind, Referring to this thread - https://forums.databricks.com/questions/956/how-do-i-group-my-dataset-by-a-key-or-combination.html. I have learnt that we can not do much with groupped data apart from using existing aggregate functions. This blog post was written in may 2015, I don't

Re: aggregateByKey vs combineByKey

2016-01-05 Thread Ted Yu
Looking at PairRDDFunctions.scala : def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U, combOp: (U, U) => U): RDD[(K, U)] = self.withScope { ... combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v), cleanedSeqOp, combOp,

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
As I pointed out in my earlier email, RHEL will support Python 2.6 until 2020. So I'm assuming these large companies will have the option of riding out Python 2.6 until then. Are we seriously saying that Spark should likewise support Python 2.6 for the next several years? Even though the core

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
yeah, the practical concern is that we have no control over java or python version on large company clusters. our current reality for the vast majority of them is java 7 and python 2.6, no matter how outdated that is. i dont like it either, but i cannot change it. we currently don't use pyspark

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
If users are able to install Spark 2.0 on their RHEL clusters, then I imagine that they're also capable of installing a standalone Python alongside that Spark version (without changing Python systemwide). For instance, Anaconda/Miniconda make it really easy to install Python 2.7.x/3.x without

How to concat few rows into a new column in dataframe

2016-01-05 Thread Gavin Yue
Hey, For example, a table df with two columns id name 1 abc 1 bdf 2 ab 2 cd I want to group by the id and concat the string into array of string. like this id 1 [abc,bdf] 2 [ab, cd] How could I achieve this in dataframe? I stuck on df.groupBy("id"). ??? Thanks

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
i do not think so. does the python 2.7 need to be installed on all slaves? if so, we do not have direct access to those. also, spark is easy for us to ship with our software since its apache 2 licensed, and it only needs to be present on the machine that launches the app (thanks to yarn). even

Re: Spark SQL dataframes explode /lateral view help

2016-01-05 Thread Deenar Toraskar
val sparkConf = new SparkConf() .setMaster("local[*]") .setAppName("Dataframe Test") val sc = new SparkContext(sparkConf) val sql = new SQLContext(sc) val dataframe = sql.read.json("orders.json") val expanded = dataframe .explode[::[Long], Long]("items", "item1")(row => row)

Re: coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread Umesh Kacha
Hi dataframe has not boolean option for coalesce it is only for RDD I believe sourceFrame.coalesce(1,true) //gives compilation error On Wed, Jan 6, 2016 at 1:38 AM, Alexander Pivovarov wrote: > try coalesce(1, true). > > On Tue, Jan 5, 2016 at 11:58 AM, unk1102

RE: Spark on Apache Ingnite?

2016-01-05 Thread nate
We started playing with Ignite back Hadoop, hive and spark services, and looking to move to it as our default for deployment going forward, still early but so far its been pretty nice and excited for the flexibility it will provide for our particular use cases. Would say in general its worth

Re: problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-05 Thread Michael Armbrust
> > I am trying to implement org.apache.spark.ml.Transformer interface in > Java 8. > My understanding is the sudo code for transformers is something like > > @Override > > public DataFrame transform(DataFrame df) { > > 1. Select the input column > > 2. Create a new column > > 3. Append the

Re: Negative Number of Active Tasks in Spark UI

2016-01-05 Thread Ted Yu
Which version of Spark do you use ? This might be related: https://issues.apache.org/jira/browse/SPARK-8560 Do you use dynamic allocation ? Cheers > On Jan 4, 2016, at 10:05 PM, Prasad Ravilla wrote: > > I am seeing negative active tasks in the Spark UI. > > Is anyone

sparkR ORC support.

2016-01-05 Thread Sandeep Khurana
Hello I need to read an ORC files in hdfs in R using spark. I am not able to find a package to do that. Can anyone help with documentation or example for this purpose? -- Architect Infoworks.io http://Infoworks.io

Re: finding distinct count using dataframe

2016-01-05 Thread Arunkumar Pillai
Thanks Yanbo, Thanks for the help. But I'm not able to find countDistinct ot approxCountDistinct. function. These functions are within dataframe or any other package On Tue, Jan 5, 2016 at 3:24 PM, Yanbo Liang wrote: > Hi Arunkumar, > > You can use

RE: Spark Streaming + Kafka + scala job message read issue

2016-01-05 Thread vivek.meghanathan
Hello All, After investigating further using a test program, we were able to read the kafka input messages using spark streaming. Once we add a particular line which performs map and reduce - and groupByKey (all written in single line), we are not seeing the input message details in the logs.

Re: coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread Andy Davidson
Hi Unk1102 I also had trouble when I used coalesce(). Reparation() worked much better. Keep in mind if you have a large number of portions you are probably going have high communication costs. Also my code works a lot better on 1.6.0. DataFrame memory was not be spilled in 1.5.2. In 1.6.0

aggregateByKey vs combineByKey

2016-01-05 Thread Marco Mistroni
Hi all i have the following dataSet kv = [(2,Hi), (1,i), (2,am), (1,a), (4,test), (6,s tring)] It's a simple list of tuples containing (word_length, word) What i wanted to do was to group the result by key in order to have a result in the form [(word_length_1, [word1, word2, word3],

Re: Spark 1.6 - Datasets and Avro Encoders

2016-01-05 Thread Olivier Girardot
I'll do, but if you want my two cents, creating a dedicated "optimised" encoder for Avro would be great (especially if it's possible to do better than plain AvroKeyValueOutputFormat with saveAsNewAPIHadoopFile :) ) Thanks for your time Michael, and happy new year :-) Regards, Olivier.

sortBy transformation shows as a job

2016-01-05 Thread Soumitra Kumar
Fellows, I have a simple code. sc.parallelize (Array (1, 4, 3, 2), 2).sortBy (i=>i).foreach (println) This results in 2 jobs (sortBy, foreach) in Spark's application master ui. I thought there is one to one relationship between RDD action and job. Here, only action is foreach, so should be only

Re: Spark 1.6 - Datasets and Avro Encoders

2016-01-05 Thread Michael Armbrust
On Tue, Jan 5, 2016 at 1:31 PM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > I'll do, but if you want my two cents, creating a dedicated "optimised" > encoder for Avro would be great (especially if it's possible to do better > than plain AvroKeyValueOutputFormat with

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2016-01-05 Thread Priya Ch
Can some one throw light on this ? Regards, Padma Ch On Mon, Dec 28, 2015 at 3:59 PM, Priya Ch wrote: > Chris, we are using spark 1.3.0 version. we have not set > spark.streaming.concurrentJobs > this parameter. It takes the default value. > > Vijay, > > From

Re: sparkR ORC support.

2016-01-05 Thread Deepak Sharma
Hi Sandeep can you try this ? results <- sql(hivecontext, "FROM test SELECT id","") Thanks Deepak On Tue, Jan 5, 2016 at 5:49 PM, Sandeep Khurana wrote: > Thanks Deepak. > > I tried this as well. I created a hivecontext with "hivecontext <<- > sparkRHive.init(sc) "

problem building spark on centos

2016-01-05 Thread Jade Liu
Hi, All: I'm trying to build spark 1.5.2 from source using maven with the following command: ./make-distribution.sh --tgz -Phadoop-2.6 -Pyarn -Dhadoop.version=2.6.0 -Dscala-2.11 -Phive -Phive-thriftserver -DskipTests I got the following error: + VERSION='[ERROR] [Help 2]

pyspark dataframe: row with a minimum value of a column for each group

2016-01-05 Thread Wei Chen
Hi, I am trying to retrieve the rows with a minimum value of a column for each group. For example: the following dataframe: a | b | c -- 1 | 1 | 1 1 | 2 | 2 1 | 3 | 3 2 | 1 | 4 2 | 2 | 5 2 | 3 | 6 3 | 1 | 7 3 | 2 | 8 3 | 3 | 9 -- I group by 'a', and want the rows with the

Re: problem building spark on centos

2016-01-05 Thread Ted Yu
Which version of maven are you using ? It should be 3.3.3+ On Tue, Jan 5, 2016 at 4:54 PM, Jade Liu wrote: > Hi, All: > > I’m trying to build spark 1.5.2 from source using maven with the following > command: > > ./make-distribution.sh --tgz -Phadoop-2.6 -Pyarn

DataFrame withColumnRenamed throwing NullPointerException

2016-01-05 Thread Prasad Ravilla
I am joining two data frames as shown in the code below. This is throwing NullPointerException. I have a number of different join throughout the program and the SparkContext throws this NullPointerException on a randomly on one of the joins. The two data frames are very large data frames (

UpdateStateByKey : Partitioning and Shuffle

2016-01-05 Thread Soumitra Johri
Hi, I am relatively new to Spark and am using updateStateByKey() operation to maintain state in my Spark Streaming application. The input data is coming through a Kafka topic. 1. I want to understand how are DStreams partitioned? 2. How does the partitioning work with mapWithState() or

Re: Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Jeff Zhang
Right, I can override the root pool in configuration file, Thanks Mark. On Wed, Jan 6, 2016 at 8:45 AM, Mark Hamstra wrote: > Just configure with > FAIR in fairscheduler.xml (or > in spark.scheduler.allocation.file if you have over-riden the default name > for the

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Jeff Zhang
+1 On Wed, Jan 6, 2016 at 9:18 AM, Juliet Hougland wrote: > Most admins I talk to about python and spark are already actively (or on > their way to) managing their cluster python installations. Even if people > begin using the system python with pyspark, there is

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
hey evil admin:) i think the bit about java was from me? if so, i meant to indicate that the reality for us is java is 1.7 on most (all?) clusters. i do not believe spark prefers java 1.8. my point was that even although java 1.7 is getting old as well it would be a major issue for me if spark

Re: Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Jeff Zhang
Sorry, I don't make it clearly. What I want is the default pool is fair scheduling. But seems if I want to use fair scheduling now, I have to set spark.scheduler.pool explicitly. On Wed, Jan 6, 2016 at 2:03 AM, Mark Hamstra wrote: > I don't understand. If you're using

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
I don't think that we're planning to drop Java 7 support for Spark 2.0. Personally, I would recommend using Java 8 if you're running Spark 1.5.0+ and are using SQL/DataFrames so that you can benefit from improvements to code cache flushing in the Java 8 JVMs. Spark SQL's generated classes can

Re: Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Mark Hamstra
Just configure with FAIR in fairscheduler.xml (or in spark.scheduler.allocation.file if you have over-riden the default name for the config file.) `buildDefaultPool()` will only build the pool named "default" with the default properties (such as schedulingMode = DEFAULT_SCHEDULING_MODE -- i.e.

Spark SQL dataframes explode /lateral view help

2016-01-05 Thread Deenar Toraskar
Hi All I have the following spark sql query and would like to use convert this to use the dataframes api (spark 1.6). The eee, eep and pfep are all maps of (int -> float) select e.counterparty, epe, mpfe, eepe, noOfMonthseep, teee as effectiveExpectedExposure, teep as expectedExposure , tpfep

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Davies Liu
Created JIRA: https://issues.apache.org/jira/browse/SPARK-12661 On Tue, Jan 5, 2016 at 2:49 PM, Koert Kuipers wrote: > i do not think so. > > does the python 2.7 need to be installed on all slaves? if so, we do not > have direct access to those. > > also, spark is easy for us

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
interesting i didnt know that! On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas wrote: > even if python 2.7 was needed only on this one machine that launches the > app we can not ship it with our software because its gpl licensed > > Not to nitpick, but maybe this

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
I think all the slaves need the same (or a compatible) version of Python installed since they run Python code in PySpark jobs natively. On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers wrote: > interesting i didnt know that! > > On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas <

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
> > Note that you _can_ use a Python 2.7 `ipython` executable on the driver > while continuing to use a vanilla `python` executable on the executors Whoops, just to be clear, this should actually read "while continuing to use a vanilla `python` 2.7 executable". On Tue, Jan 5, 2016 at 3:07 PM,

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
even if python 2.7 was needed only on this one machine that launches the app we can not ship it with our software because its gpl licensed Not to nitpick, but maybe this is important. The Python license is GPL-compatible but not GPL : Note GPL-compatible

Re: How to concat few rows into a new column in dataframe

2016-01-05 Thread Ted Yu
Something like the following: val zeroValue = collection.mutable.Set[String]() val aggredated = data.aggregateByKey (zeroValue)((set, v) => set += v, (setOne, setTwo) => setOne ++= setTwo) On Tue, Jan 5, 2016 at 2:46 PM, Gavin Yue wrote: > Hey, > > For example, a table

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
if python 2.7 only has to be present on the node that launches the app (does it?) than that could be important indeed. On Tue, Jan 5, 2016 at 6:02 PM, Koert Kuipers wrote: > interesting i didnt know that! > > On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas < >

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
Yep, the driver and executors need to have compatible Python versions. I think that there are some bytecode-level incompatibilities between 2.6 and 2.7 which would impact the deserialization of Python closures, so I think you need to be running the same 2.x version for all communicating Spark

Re: How to concat few rows into a new column in dataframe

2016-01-05 Thread Michael Armbrust
This would also be possible with an Aggregator in Spark 1.6: https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html On Tue, Jan 5, 2016 at 2:59 PM, Ted Yu wrote: > Something like the following: > > val zeroValue =

Re: sparkR ORC support.

2016-01-05 Thread Deepak Sharma
Hi Sandeep I am not sure if ORC can be read directly in R. But there can be a workaround .First create hive table on top of ORC files and then access hive table in R. Thanks Deepak On Tue, Jan 5, 2016 at 4:57 PM, Sandeep Khurana wrote: > Hello > > I need to read an ORC

Re: sparkR ORC support.

2016-01-05 Thread Sandeep Khurana
Thanks Deepak. I tried this as well. I created a hivecontext with "hivecontext <<- sparkRHive.init(sc) " . When I tried to read hive table from this , results <- sql(hivecontext, "FROM test SELECT id") I get below error, Error in callJMethod(sqlContext, "sql", sqlQuery) : Invalid jobj

  1   2   >