Re: use case reading files split per id

2016-11-14 Thread ayan guha
How about following approach - - get the list of ID - get one rdd each for them using wholetextfile - map and flatmap to generate pair rdd with ID as key and list as value - union all the RDD s together - group by key On 15 Nov 2016 16:43, "Mo Tao" wrote: > Hi ruben, > > You may

RE: How to read a Multi Line json object via Spark

2016-11-14 Thread Kappaganthu, Sivaram (ES)
Hello, Please find attached the old mail on this subject Thanks, Sivaram From: Sree Eedupuganti [mailto:s...@inndata.in] Sent: Tuesday, November 15, 2016 12:51 PM To: user Subject: How to read a Multi Line json object via Spark I tried from Spark-Shell and i am getting the following error:

How to read a Multi Line json object via Spark

2016-11-14 Thread Sree Eedupuganti
I tried from Spark-Shell and i am getting the following error: Here is the test.json file: { "colorsArray": [{ "red": "#f00", "green": "#0f0", "blue": "#00f", "cyan": "#0ff", "magenta": "#f0f", "yellow": "#ff0", "black": "#000" }]}

HiveContext.getOrCreate not accessible

2016-11-14 Thread Praseetha
Hi All, I have a streaming app and when i try invoking the HiveContext.getOrCreate, it errors out with the following stmt. 'object HiveContext in package hive cannot be accessed in package org.apache.spark.sql.hive' I would require HiveContext instead of SQLContext for my application and

mapWithState job slows down & exceeds yarn's memory limits

2016-11-14 Thread Daniel Haviv
Hi, I have a fairly simple stateful streaming job that suffers from high GC and it's executors are killed as they are exceeding the size of the requested container. My current executor-memory is 10G, spark overhead is 2G and it's running with one core. At first the job begins running at a rate

Re: use case reading files split per id

2016-11-14 Thread Mo Tao
Hi ruben, You may try sc.binaryFiles which is designed for lots of small files and it can map paths into inputstreams. Each inputstream will keep only the path and some configuration, so it would be cheap to shuffle them. However, I'm not sure whether spark take the data locality into account

Re: Cannot find Native Library in "cluster" deploy-mode

2016-11-14 Thread Mo Tao
Hi jtgenesis, UnsatisfiedLinkError could be caused by the missing library that your .so files require, so you may have a look at the exception message. You can also try setExecutorEnv("LD_LIBRARY_PATH", ".:" + sys.env("LD_LIBRARY_PATH")) and then submit your job with your .so files using the

[ANNOUNCE] Apache Spark 2.0.2

2016-11-14 Thread Reynold Xin
We are happy to announce the availability of Spark 2.0.2! Apache Spark 2.0.2 is a maintenance release containing 90 bug fixes along with Kafka 0.10 support and runtime metrics for Structured Streaming. This release is based on the branch-2.0 maintenance branch of Spark. We strongly recommend all

Re: AVRO File size when caching in-memory

2016-11-14 Thread Prithish
I am using 2.0.1 and databricks avro library 3.0.1. I am running this on the latest AWS EMR release. On Mon, Nov 14, 2016 at 3:06 PM, Jörn Franke wrote: > spark version? Are you using tungsten? > > > On 14 Nov 2016, at 10:05, Prithish wrote: > > > >

Re: spark streaming with kinesis

2016-11-14 Thread Takeshi Yamamuro
Seems it it not a good design to frequently restart workers in a minute because their initialization and shutdown take much time as you said (e.g., interconnection overheads with dynamodb and graceful shutdown). Anyway, since this is a kind of questions about the aws kinesis library, so you'd

Spark SQL UDF - passing map as a UDF parameter

2016-11-14 Thread Nirav Patel
I am trying to use following API from Functions to convert a map into column so I can pass it to UDF. map(cols: Column *): Column

Re: sbt shenanigans for a Spark-based project

2016-11-14 Thread Don Drake
I would remove your entire local Maven repo (~/.m2/repo in linux) and try again. I'm able to compile sample code with your build.sbt and sbt v.0.13.12. -Don On Mon, Nov 14, 2016 at 3:11 PM, Marco Mistroni wrote: > uhm.sorry.. still same issues. this is hte new version

Cannot find Native Library in "cluster" deploy-mode

2016-11-14 Thread jtgenesis
hey guys, I'm hoping someone could provide some assistance. I'm having issues (UnsatisfiedLinkError) calling some native libraries when submitting the application in "cluster" mode. When running the application in local mode, the application runs fine. Here's what my setup looks like. The .so

Re: sbt shenanigans for a Spark-based project

2016-11-14 Thread Marco Mistroni
uhm.sorry.. still same issues. this is hte new version name := "SparkExamples" version := "1.0" scalaVersion := "2.11.8" val sparkVersion = "2.0.1" // Add a single dependency libraryDependencies += "junit" % "junit" % "4.8" % "test" libraryDependencies ++= Seq("org.slf4j" % "slf4j-api" %

Re: scala.MatchError while doing BinaryClassificationMetrics

2016-11-14 Thread Nick Pentreath
Typically you pass in the result of a model transform to the evaluator. So: val model = estimator.fit(data) val auc = evaluator.evaluate(model.transform(testData) Check Scala API docs for some details:

Re: Grouping Set

2016-11-14 Thread ayan guha
And, run the same SQL in hive and post any difference. On 15 Nov 2016 07:48, "ayan guha" wrote: > It should be A,yes. Can you please reproduce this with small data and > exact SQL? > On 15 Nov 2016 02:21, "Andrés Ivaldi" wrote: > >> Hello, I'm tryin to

Re: Grouping Set

2016-11-14 Thread ayan guha
It should be A,yes. Can you please reproduce this with small data and exact SQL? On 15 Nov 2016 02:21, "Andrés Ivaldi" wrote: > Hello, I'm tryin to use Grouping Set, but I dont know if it is a bug or > the correct behavior. > > Givven the above example > Select a,b,sum(c)

Re: Newbie question - Best way to bootstrap with Spark

2016-11-14 Thread Jon Gregg
Piggybacking off this - how are you guys teaching DataFrames and Datasets to new users? I haven't taken the edx courses but I don't see Spark SQL covered heavily in the syllabus. I've dug through the Databricks documentation but it's a lot of information for a new user I think - hoping there is

Re: Spark Streaming: question on sticky session across batches ?

2016-11-14 Thread Manish Malhotra
sending again. any help is appreciated ! thanks in advance. On Thu, Nov 10, 2016 at 8:42 AM, Manish Malhotra < manish.malhotra.w...@gmail.com> wrote: > Hello Spark Devs/Users, > > Im trying to solve the use case with Spark Streaming 1.6.2 where for every > batch ( say 2 mins) data needs to go

Pasting oddity with Spark 2.0 (scala)

2016-11-14 Thread jggg777
This one has stumped the group here, hoping to get some insight into why this error is happening. I'm going through the Databricks DataFrames scala docs

Re: scala.MatchError while doing BinaryClassificationMetrics

2016-11-14 Thread Bhaarat Sharma
Can you please suggest how I can use BinaryClassificationEvaluator? I tried: scala> import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator scala> val evaluator = new BinaryClassificationEvaluator() evaluator:

Re: scala.MatchError while doing BinaryClassificationMetrics

2016-11-14 Thread Nick Pentreath
DataFrame.rdd returns an RDD[Row]. You'll need to use map to extract the doubles from the test score and label DF. But you may prefer to just use spark.ml evaluators, which work with DataFrames. Try BinaryClassificationEvaluator. On Mon, 14 Nov 2016 at 19:30, Bhaarat Sharma

Re: Two questions about running spark on mesos

2016-11-14 Thread Michael Gummelt
1. I had never even heard of conf/slaves until this email, and I only see it referenced in the docs next to Spark Standalone, so I doubt that works. 2. Yes. See the --kill option in spark-submit. Also, we're considering dropping the Spark dispatcher in DC/OS in favor of Metronome, which will be

scala.MatchError while doing BinaryClassificationMetrics

2016-11-14 Thread Bhaarat Sharma
I am getting scala.MatchError in the code below. I'm not able to see why this would be happening. I am using Spark 2.0.1 scala> testResults.columns res538: Array[String] = Array(TopicVector, subject_id, hadm_id, isElective, isNewborn, isUrgent, isEmergency, isMale, isFemale, oasis_score,

Spark streaming data loss due to timeout in writing BlockAdditionEvent to WAL by the driver

2016-11-14 Thread Arijit
Hi, We are seeing another case of data loss/drop when the following exception happens. This particular Exception treated as WARN resulted in dropping 2095 events from processing. 16/10/26 19:24:08 WARN ReceivedBlockTracker: Exception thrown while writing record:

Re: Convert SparseVector column to Densevector column

2016-11-14 Thread janardhan shetty
This worked thanks Maropu. On Sun, Nov 13, 2016 at 9:34 PM, Takeshi Yamamuro wrote: > Hi, > > How about this? > > import org.apache.spark.ml.linalg._ > val toSV = udf((v: Vector) => v.toDense) > val df = Seq((0.1, Vectors.sparse(16, Array(0, 3), Array(0.1, 0.3))), > (0.2,

Re: took more time to get data from spark dataset to driver program

2016-11-14 Thread Rishikesh Teke
Again if you run spark cluster in standalone mode with optimum number of executors with balanced cores and memory configuration, it will run faster as more parallel operations took place. -- View this message in context:

Re: Newbie question - Best way to bootstrap with Spark

2016-11-14 Thread Rishikesh Teke
Integrate spark with apache zeppelin https://zeppelin.apache.org/ its again a very handy way to bootstrap with spark. -- View this message in context:

Re: Nearest neighbour search

2016-11-14 Thread Nick Pentreath
LSH-based NN search and similarity join should be out in Spark 2.1 - there's a little work being done still to clear up the APIs and some functionality. Check out https://issues.apache.org/jira/browse/SPARK-5992 On Mon, 14 Nov 2016 at 16:12, Kevin Mellott wrote: >

Grouping Set

2016-11-14 Thread Andrés Ivaldi
Hello, I'm tryin to use Grouping Set, but I dont know if it is a bug or the correct behavior. Givven the above example Select a,b,sum(c) from table group by a,b grouping set ( (a), (a,b) ) What shound be the expected result A: A | B| sum(c) xx | null | xx | yy | xx | zz |

Re: Nearest neighbour search

2016-11-14 Thread Kevin Mellott
You may be able to benefit from Soundcloud's open source implementation, either as a solution or as a reference implementation. https://github.com/soundcloud/cosine-lsh-join-spark Thanks, Kevin On Sun, Nov 13, 2016 at 2:07 PM, Meeraj Kunnumpurath < mee...@servicesymphony.com> wrote: > That was

Re: spark streaming with kinesis

2016-11-14 Thread Shushant Arora
1.No, I want to implement low level consumer on kinesis stream. so need to stop the worker once it read the latest sequence number sent by driver. 2.What is the cost of frequent register and deregister of worker node. Is that when worker's shutdown is called it will terminate run method but

Re: spark streaming with kinesis

2016-11-14 Thread Takeshi Yamamuro
Is "aws kinesis get-shard-iterator --shard-iterator-type LATEST" not enough for your usecase? On Mon, Nov 14, 2016 at 10:23 PM, Shushant Arora wrote: > Thanks! > Is there a way to get the latest sequence number of all shards of a > kinesis stream? > > > > On Mon, Nov

Re: spark streaming with kinesis

2016-11-14 Thread Shushant Arora
Thanks! Is there a way to get the latest sequence number of all shards of a kinesis stream? On Mon, Nov 14, 2016 at 5:43 PM, Takeshi Yamamuro wrote: > Hi, > > The time interval can be controlled by `IdleTimeBetweenReadsInMillis` in > KinesisClientLibConfiguration >

Re: spark streaming with kinesis

2016-11-14 Thread Takeshi Yamamuro
Hi, The time interval can be controlled by `IdleTimeBetweenReadsInMillis` in KinesisClientLibConfiguration though, it is not configurable in the current implementation. The detail can be found in;

Spark hash function

2016-11-14 Thread Rohit Verma
Hi All, One of the miscellaneous functions in spark sql is hash expression[Murmur3Hash]("hash"), I was wondering whether its which variant of murmurhas3 murmurhash3_x64_128 murmurhash3_x86_32 ( this is also part of spark unsafe package). Also what is seed for the hash function. I am

Two questions about running spark on mesos

2016-11-14 Thread Yu Wei
Hi Guys, Two questions about running spark on mesos. 1, Does spark configuration of conf/slaves still work when running spark on mesos? According to my observations, it seemed that conf/slaves still took effect when running spark-shell. However, it doesn't take effect when deploying

SparkSQL: intra-SparkSQL-application table registration

2016-11-14 Thread Mohamed Nadjib Mami
Hello, I've asked the following question [1] on Stackoverflow but didn't get an answer, yet. I use now this channel to give it more visibility, and hopefully find someone who can help. "*Context.* I have tens of SQL queries stored in separate files. For benchmarking purposes, I created an

Re: AVRO File size when caching in-memory

2016-11-14 Thread Jörn Franke
spark version? Are you using tungsten? > On 14 Nov 2016, at 10:05, Prithish wrote: > > Can someone please explain why this happens? > > When I read a 600kb AVRO file and cache this in memory (using cacheTable), it > shows up as 11mb (storage tab in Spark UI). I have tried

AVRO File size when caching in-memory

2016-11-14 Thread Prithish
Can someone please explain why this happens? When I read a 600kb AVRO file and cache this in memory (using cacheTable), it shows up as 11mb (storage tab in Spark UI). I have tried this with different file sizes, and the size in-memory is always proportionate. I thought Spark compresses when using