Re: Streaming 2.1.0 - window vs. batch duration

2017-03-17 Thread Cody Koeninger
Probably easier if you show some more code, but if you just call dstream.window(Seconds(60)) you didn't specify a slide duration, so it's going to default to your batch duration of 1 second. So yeah, if you're just using e.g. foreachRDD to output every message in the window, every second it's

Getting 2.0.2 for the link http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz

2017-03-17 Thread George Obama
Hello, I download spark-2.1.0-bin-hadoop2.7.tgz from http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz and get Spark 2.0.2: verified for the Scala, Python and R. The link is from the download page

HyperLogLogMonoid for unique visitor count in Spark Streaming

2017-03-17 Thread SRK
Hi, We have a requirement to calculate unique visitors in Spark Streaming. Can HyperLogLogMonoid be applied to a sliding window in Spark Streaming to calculate unique visitors? Any example on how to do that would be of great help. Thanks, Swetha -- View this message in context:

How to redistribute dataset without full shuffle

2017-03-17 Thread Artur R
Hi! I use Spark heavily for various workloads and always fall in the situation when there is some skewed dataset (without any partitioner assigned) and I just want to "redistribute" its data more evenly. For example, say there is RDD of X partitions with Y rows on each except one large partition

Re: [Spark Streaming+Kafka][How-to]

2017-03-17 Thread Michael Armbrust
Another option that would avoid a shuffle would be to use assign and coalesce, running two separate streams. spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "...") .option("assign", """{t0: {"0": }, t1:{"0": x}}""") .load() .coalesce(1) .writeStream

Spark master IP on Kubernetes

2017-03-17 Thread ffarozan
We are deploying Spark on k8s cluster. We are facing one issue with respect to Spark master IP from a worker perspective. The Spark master is exposed as a service @ 10.3.0.175:7077. Spark worker registers with the master, but saves the pod IP, instead of the service IP. Following are related

Spark Streaming from Kafka, deal with initial heavy load.

2017-03-17 Thread sagarcasual .
Hi, we have spark 1.6.1 streaming from Kafka (0.10.1) topic using direct approach. The streaming part works fine but when we initially start the job, we have to deal with really huge Kafka message backlog, millions of messages, and that first batch runs for over 40 hours, and after 12 hours or so

Re: Spark 2.0.2 - hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count()

2017-03-17 Thread Ravindra
Thanks a lot young for explanation. But its sounds like an API behaviour change. For now I do the counts != o on both dataframes before these operations. Not good from performance point of view hence have created a JIRA (SPARK-20008) to track it. Thanks, Ravindra. On Fri, Mar 17, 2017 at 8:51 PM

Re: spark streaming exectors memory increasing and executor killed by yarn

2017-03-17 Thread darin
I add this code in foreachRDD block . ``` rdd.persist(StorageLevel.MEMORY_AND_DISK) ``` This exception no occur agein.But many executor dead showing in spark streaming UI . ``` User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 21 in stage 1194.0

Re: [Spark Streaming+Kafka][How-to]

2017-03-17 Thread Michael Armbrust
Sorry, typo. Should be a repartition not a groupBy. > spark.readStream > .format("kafka") > .option("kafka.bootstrap.servers", "...") > .option("subscribe", "t0,t1") > .load() > .repartition($"partition") > .writeStream > .foreach(... code to write to cassandra ...) >

Spark 2.0.2 - hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count()

2017-03-17 Thread Ravindra
Can someone please explain why println ( " Empty count " + hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() *prints* - Empty count 1 This was not the case in Spark 1.5.2... I am upgrading to spark 2.0.2 and found this. This causes my tests to fail. Is there another way to

org.apache.spark.ui.jobs.UIData$TaskMetricsUIData

2017-03-17 Thread Jiří Syrový
Hi, is there a good way how to get rid of UIData completely? I have switched off UI, decreased retainedXXX to minimum, but still there seems to be a lot of instances of this class ($SUBJ) held in memory. Any ideas? Thanks, J. S. spark { master = "local[2]" master = ${?SPARK_MASTER} info =

Re: Dataset : Issue with Save

2017-03-17 Thread Yong Zhang
Looks like the current fix is reducing accumulator data being sent to driver, but there are still lots of more statistics data being sent to the driver. It is arguable that how much data is reasonable for 3.7k tasks. You can attach your heap dump file in that JIRA and follow it. Yong

Re: RDD can not convert to df, thanks

2017-03-17 Thread Yong Zhang
You also need the import the sqlContext implicits import sqlContext.implicits._ Yong From: 萝卜丝炒饭 <1427357...@qq.com> Sent: Friday, March 17, 2017 1:52 AM To: user-return-68576-1427357147=qq.com; user Subject: Re: RDD can not convert to df, thanks More info,I

org.apache.spark.ui.jobs.UIData$TaskMetricsUIData

2017-03-17 Thread xjrk
Hi, is there a good way how to get rid of UIData completely? I have switched off UI, decreased retainedXXX to minimum, but still there seems to be a lot of instances of this class ($SUBJ) held in memory. Any ideas? Thanks, J. S. spark { master = "local[2]" master = ${?SPARK_MASTER} info =

UI Metrics data memory consumption

2017-03-17 Thread xjrk
Hi, is there a good way how to get rid of UIData completely? I have switched off UI, decreased retainedXXX to minimum, but still there seems to be a lot of instances of this class (org.apache.spark.ui.jobs.UIData$TaskMetricsUIData) held in memory. Any ideas? Thanks, J. S. spark { master =

Re: [Spark Streaming+Kafka][How-to]

2017-03-17 Thread OUASSAIDI, Sami
@Cody : Duly noted. @Michael Ambrust : A repartition is out of the question for our project as it would be a fairly expensive operation. We tried looking into targeting a specific executor so as to avoid this extra cost and directly have well partitioned data after consuming the kafka topics. Also

Re: How does preprocessing fit into Spark MLlib pipeline

2017-03-17 Thread Yanbo Liang
Hi Adrian, Did you try SQLTransformer? Your preprocessing steps are SQL operations and can be handled by SQLTransformer in MLlib pipeline scope. Thanks Yanbo On Thu, Mar 9, 2017 at 11:02 AM, aATv wrote: > I want to start using PySpark Mllib pipelines, but I don't understand

Re: Spark 2.0.2 - hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count()

2017-03-17 Thread Yong Zhang
Starting from Spark 2, these kind of operation are implemented in left anti join, instead of using RDD operation directly. Same issue also on sqlContext. scala> spark.version res25: String = 2.0.2 spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true) ==