Re: Groupby in fast in Impala than spark sql - any suggestions

2017-03-28 Thread Ryan
and could you paste the stage and task information from SparkUI On Wed, Mar 29, 2017 at 11:30 AM, Ryan wrote: > how long does it take if you remove the repartition and just collect the > result? I don't think repartition is needed here. There's already a shuffle > for

Re: Groupby in fast in Impala than spark sql - any suggestions

2017-03-28 Thread Ryan
how long does it take if you remove the repartition and just collect the result? I don't think repartition is needed here. There's already a shuffle for group by On Tue, Mar 28, 2017 at 10:35 PM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > Hi, > > I am working on requirement where i

Re: apache-spark: Converting List of Rows into Dataset Java

2017-03-28 Thread Richard Xin
Maybe you could try something like that:        SparkSession sparkSession = SparkSession     .builder()     .appName("Rows2DataSet")     .master("local")     .getOrCreate();         List results = new LinkedList();         JavaRDD jsonRDD =          

Re: Upgrade the scala code using the most updated Spark version

2017-03-28 Thread Shixiong(Ryan) Zhu
mapPartitionsWithSplit was removed in Spark 2.0.0. You can use mapPartitionsWithIndex instead. On Tue, Mar 28, 2017 at 3:52 PM, Anahita Talebi wrote: > Thanks. > I tried this one, as well. Unfortunately I still get the same error. > > > On Wednesday, March 29, 2017,

GraphFrames 0.4.0 release, with Apache Spark 2.1 support

2017-03-28 Thread Joseph Bradley
Hi Spark dev & users, For those who use GraphFrames (DataFrame-based graphs), we have published a new release 0.4.0. It adds support for Apache Spark 2.1, with versions published for Spark 2.1 and 2.0 and for Scala 2.10 and 2.11. *Docs*:

Re: Upgrade the scala code using the most updated Spark version

2017-03-28 Thread Anahita Talebi
Thanks. I tried this one, as well. Unfortunately I still get the same error. On Wednesday, March 29, 2017, Marco Mistroni wrote: > 1.7.5 > > On 28 Mar 2017 10:10 pm, "Anahita Talebi" >

Re: Multiple cores/executors in Pyspark standalone mode

2017-03-28 Thread Gourav Sengupta
hi, any particular reason why you would not use spark server and then create your own spark session? Regards, Gourav On Fri, Mar 24, 2017 at 7:43 PM, Li Jin wrote: > Hi, > > I am wondering does pyspark standalone (local) mode support multi > cores/executors? > >

Re: Upgrade the scala code using the most updated Spark version

2017-03-28 Thread Marco Mistroni
1.7.5 On 28 Mar 2017 10:10 pm, "Anahita Talebi" wrote: > Hi, > > Thanks for your answer. > What is the version of "org.slf4j" % "slf4j-api" in your sbt file? > I think the problem might come from this part. > > On Tue, Mar 28, 2017 at 11:02 PM, Marco Mistroni

dataframe join questions?

2017-03-28 Thread shyla deshpande
Following are my questions. Thank you. 1. When joining dataframes is it a good idea to repartition on the key column that is used in the join or the optimizer is too smart so forget it. 2. In RDD join, wherever possible we do reduceByKey before the join to avoid a big shuffle of data. Do we need

Re: Upgrade the scala code using the most updated Spark version

2017-03-28 Thread Anahita Talebi
Hello again, I just tried to change the version to 3.0.0 and remove the libraries breeze, netlib and scoopt but I still get the same error. On Tue, Mar 28, 2017 at 11:02 PM, Marco Mistroni wrote: > Hello > uhm ihave a project whose build,sbt is closest to yours, where i am

Re: Upgrade the scala code using the most updated Spark version

2017-03-28 Thread Anahita Talebi
Hi, Thanks for your answer. What is the version of "org.slf4j" % "slf4j-api" in your sbt file? I think the problem might come from this part. On Tue, Mar 28, 2017 at 11:02 PM, Marco Mistroni wrote: > Hello > uhm ihave a project whose build,sbt is closest to yours, where i

Re: Upgrade the scala code using the most updated Spark version

2017-03-28 Thread Marco Mistroni
Hello uhm ihave a project whose build,sbt is closest to yours, where i am using spark 2.1, scala 2.11 and scalatest (i upgraded to 3.0.0) and it works fine in my projects though i don thave any of the following libraries that you mention - breeze - netlib,all - scoopt hth On Tue, Mar 28, 2017

Re: Upgrade the scala code using the most updated Spark version

2017-03-28 Thread Anahita Talebi
Hi, Thanks for your answer. I first changed the scala version to 2.11.8 and kept the spark version 1.5.2 (old version). Then I changed the scalatest version into "3.0.1". With this configuration, I could run the code and compile it and generate the .jar file. When I changed the spark version

Re: Upgrade the scala code using the most updated Spark version

2017-03-28 Thread Jörn Franke
I personally never add the _scala version to the dependency but always crosscompile. This seems to be cleanest. Additionally Spark dependencies and hadoop dependencies should be provided not compile. Scalatest seems to be outdated. I would also not use a local repo, but either an artefact

Re: Upgrade the scala code using the most updated Spark version

2017-03-28 Thread Anahita Talebi
Hi, Thanks for your answer. I just changes the sbt file and set the scala version to 2.10.4 But I still get the same error [info] Compiling 4 Scala sources to /Users/atalebi/Desktop/new_version_proxcocoa-master/target/scala-2.10/classes... [error]

Re: Upgrade the scala code using the most updated Spark version

2017-03-28 Thread Marco Mistroni
Hello that looks to me like there's something dodgy withyour Scala installation Though Spark 2.0 is built on Scala 2.11, it still support 2.10... i suggest you change one thing at a time in your sbt First Spark version. run it and see if it works Then amend the scala version hth marco On Tue,

Re: Upgrade the scala code using the most updated Spark version

2017-03-28 Thread Anahita Talebi
Hello, Thanks you all for your informative answers. I actually changed the scala version to the 2.11.8 and spark version into 2.1.0 in the build.sbt Except for these two guys (scala and spark version), I kept the same values for the rest in the build.sbt file.

apache-spark: Converting List of Rows into Dataset Java

2017-03-28 Thread Karin Valisova
Hello! I am running Spark on Java and bumped into a problem I can't solve or find anything helpful among answered questions, so I would really appreciate your help. I am running some calculations, creating rows for each result: List results = new LinkedList(); for(something){

Groupby in fast in Impala than spark sql - any suggestions

2017-03-28 Thread KhajaAsmath Mohammed
Hi, I am working on requirement where i need to join two tables and do group by to get max value on some fileds. Table1: 10 GB of data Table2: 96 GB of data Same query in Impala is taking around 20 miniutes and it took almost 3 hours to run in spark sql. I have added repartition to dataframe,

question on DStreams

2017-03-28 Thread kant kodali
Hi All, I have the following question. Imagine there is a DStream of JSON strings coming in and I apply few different filters in parallel on the same DStream (so these filters are not applied one after the other). For Example here is the Pseudo code if that helps dstream.filter(x -> { check for

problem reading binary source with apache streaming when using JavaStreaminContext.binaryRecordsStream()

2017-03-28 Thread Hamza HACHANI
Hi all, I'm having a binary file composed of messages which length is 57 bytes. The binary file contains exactly 10 message and its size is about 44 MB ( I ' ve already verified that ). What I do simply is reading the file via JavaStreaminContext.binaryRecordsStream("folder",57) so I

Writing dataframe to a final path using another temporary path

2017-03-28 Thread yohann jardin
Hello, I’m using spark 2.1. Once a job completes, I want to write a Parquet file to, let’s say, the folder /user/my_user/final_path/ However, I have other jobs reading files in that specific folder, so I need those files to be completely written when there are in that folder. So while the

Re: Upgrade the scala code using the most updated Spark version

2017-03-28 Thread Dinko Srkoč
Adding to advices given by others ... Spark 2.1.0 works with Scala 2.11, so set: scalaVersion := "2.11.8" When you see something like: "org.apache.spark" % "spark-core_2.10" % "1.5.2" that means that library `spark-core` is compiled against Scala 2.10, so you would have to change that to

Utilities for Twitter Analysis?

2017-03-28 Thread Gaurav1809
Hello all, I want to know the utilities (and complete pipeline) that I can use for twitter Analysis in Spark? Also I want to know if Kafka is needed Or Spark streaming will be able to do work? Thanks and regards, Gaurav Pandya -- View this message in context: