Re: How to read from multiple kafka topics using structured streaming (spark 2.2.0)?

2017-09-19 Thread Jacek Laskowski
Hi, Use subscribepattern You haven't googled well enough --> https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-KafkaSource.html :) Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Spark Structured Streaming (Apache Spark 2.2+) https://bit.

Uses of avg hash probe metric in HashAggregateExec?

2017-09-19 Thread Jacek Laskowski
itbook or any other place you point to :) Thanks! [1] https://github.com/apache/spark/commit/18066f2e61f430b691ed8a777c9b4e5786bf9dbc Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Spark Structured Streaming (Apache Spark 2.2+) https://bit.ly/spark-structured-streaming Maste

Re: Is watermark always set using processing time or event time or both?

2017-09-04 Thread Jacek Laskowski
Hi, https://stackoverflow.com/q/46032001/1305344 :) Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Spark Structured Streaming (Apache Spark 2.2+) https://bit.ly/spark-structured-streaming Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark Follow me at https

Re: Is watermark always set using processing time or event time or both?

2017-09-04 Thread Jacek Laskowski
ps://youtu.be/JAb4FIheP28 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Spark Structured Streaming (Apache Spark 2.2+) https://bit.ly/spark-structured-streaming Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Fri, Sep

[SS] How to know what events were late in a streaming batch?

2017-09-03 Thread Jacek Laskowski
offer! [1] https://stackoverflow.com/q/46022876/1305344 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Spark Structured Streaming (Apache Spark 2.2+) https://bit.ly/spark-structured-streaming Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark Follow me at https://t

[SS] StateStoreSaveExec in Complete output mode and metrics in stateOperators

2017-08-30 Thread Jacek Laskowski
d appreciate any help. Thanks! [1] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala#L249 [2] https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-StateStoreSaveExec.html [3] https

Re: [SS] Why is a streaming aggregation required for complete output mode?

2017-08-22 Thread Jacek Laskowski
and would appreciate some more help. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Sat, Aug 19, 2017 at 12:10 AM, Burak Yavuz wrote: > Hi Jacek, > >

Re: [SS] Why is a streaming aggregation required for complete output mode?

2017-08-18 Thread Jacek Laskowski
sink accepting the flag as enabled which would make memory sink the only one left with the flag enabled for Complete output. And I thought I've been close to understand Structured Streaming :) Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2 http

Re: [SS] Why is a streaming aggregation required for complete output mode?

2017-08-18 Thread Jacek Laskowski
-memory-hungry memory sink require yet another thing to get the query working. On to exploring the bits... Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Fri, Aug 18

[SS] Why is a streaming aggregation required for complete output mode?

2017-08-18 Thread Jacek Laskowski
rk.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278) at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:249) ... 57 elided Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2 https://bit.ly/master

Re: Structured Streaming + Kafka Integration unable to read new messages after sometimes

2017-08-11 Thread Jacek Laskowski
Hi, Any logs you could share? Anything about the query itself? Watermarked? Aggregation? How long does it work fine? Is this somehow stable in its instability? What version of Spark and Kafka? Pozdrawiam, Jacek Laskowski http://blog.japila.pl On 11 Aug 2017 11:29, "NikhilP" wrote: &

Re: [SS] Console sink not supporting recovering from checkpoint location? Why?

2017-08-08 Thread Jacek Laskowski
Hi Michael, That reflects my sentiments so well. Thanks for having confirmed my thoughts! https://issues.apache.org/jira/browse/SPARK-21667 Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark Follow me at https

[SS] Console sink not supporting recovering from checkpoint location? Why?

2017-08-07 Thread Jacek Laskowski
81de598ed657de7R277. Why is this needed? I can't think of a use case where console sink could not recover from checkpoint location (since all the information is available). I'm lost on it and would appreciate some help (to recover :)) Pozdrawiam, Jacek Laskowski https://medium.com/

Re: Solutions.Hamburg conference

2017-07-18 Thread Jacek Laskowski
Hi Myrle, You're welcome. Pleasure's all mine. Could you please change Spark Streaming (technically a dead end) with the modern Structured Streaming. That's what I'd be shooting at. Thanks. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering

Re: Solutions.Hamburg conference

2017-07-18 Thread Jacek Laskowski
g publicly to invite others to have their chance. I could co-present if that's your first talk. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Tue, Jul 18, 2

Re: Is GraphX really deprecated?

2017-05-13 Thread Jacek Laskowski
Hi, I'd like to hear the official statement too. My take on GraphX and Spark Streaming is that they are long dead projects with GraphFrames and Structured Streaming taking their place, respectively. Jacek On 13 May 2017 3:00 p.m., "Sergey Zhemzhitsky" wrote: > Hello Spark users, > > I just wo

Re: Spark books

2017-05-05 Thread Jacek Laskowski
Thanks Stephen! I appreciate it very much. And yeah...Stephen is right on this. Go and read the notes and let me know where you're missing things :-) p.s. Holden has just announced that her book is complete and think Matei is also quite far with his writing. Jacek On 4 May 2017 2:52 a.m., "Step

Re: Spark-SQL Query Optimization: overlapping ranges

2017-04-27 Thread Jacek Laskowski
sion point). Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Thu, Apr 27, 2017 at 3:22 PM, Lavelle, Shawn wrote: > Hi Jacek, > > > > I know that

Re: weird error message

2017-04-26 Thread Jacek Laskowski
Hi, Good progress! Can you remove metastore_db directory and start ./bin/pyspark over? I don't think starting from ~ is necessary. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark Follow me at https://twitte

Re: Spark-SQL Query Optimization: overlapping ranges

2017-04-26 Thread Jacek Laskowski
explain it and you'll know what happens under the covers. i.e. Use explain on the Dataset. Jacek On 25 Apr 2017 12:46 a.m., "Lavelle, Shawn" wrote: > Hello Spark Users! > >Does the Spark Optimization engine reduce overlapping column ranges? > If so, should it push this down to a Data Sourc

Re: weird error message

2017-04-26 Thread Jacek Laskowski
Hi, You've got two spark sessions up and running (and given Spark SQL uses Derby-managed Hive MetaStock hence the issue) Please don't start spark-submit from inside bin. Rather bin/spark-submit... Jacek On 26 Apr 2017 1:57 a.m., "Afshin, Bardia" wrote: I’m having issues when I fire up pyspar

Re: Returning DataFrame for text file

2017-04-07 Thread Jacek Laskowski
Hi, What's the alternative? Dataset? You've got textFile then. It's an older API from the ages when Dataset was merely experimental. Jacek On 29 Mar 2017 8:58 p.m., "George Obama" wrote: > Hi, > > I saw that the API, either R or Scala, we are returning DataFrame for > sparkSession.read.text()

Re: reading snappy eventlog files from hdfs using spark

2017-04-07 Thread Jacek Laskowski
Hi, If your Spark app uses snappy in the code, define an appropriate library dependency to have it on classpath. Don't rely on transitive dependencies. Jacek On 7 Apr 2017 8:34 a.m., "satishl" wrote: Hi, I am planning to process spark app eventlogs with another spark app. These event logs are

Re: how do i force unit test to do whole stage codegen

2017-04-05 Thread Jacek Laskowski
Thanks Koert for the kind words. That part however is easy to fix and was surprised to have seen the old style referenced (!) Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com

Re: how do i force unit test to do whole stage codegen

2017-04-05 Thread Jacek Laskowski
Hi, I'm very sorry for not being up to date with the current style (and "promoting" the old style) and am going to review that part soon. I'm very close to touch it again since I'm with Optimizer these days. Jacek On 5 Apr 2017 6:08 a.m., "Kazuaki Ishizaki" wrote: > Hi, > The page in the URL e

Re: Does Apache Spark use any Dependency Injection framework?

2017-04-03 Thread Jacek Laskowski
Hi, Answering your question from the title (that seems different from what's in the email) and leaving the other part of how to do it using a DI framework to others. Spark does not use any DI framework internally and wires components itself. Jacek On 2 Apr 2017 3:29 p.m., "kant kodali" wrote:

Re: Why selectExpr changes schema (to include id column)?

2017-03-27 Thread Jacek Laskowski
Hi Hyukjin, It was a false alarm as I had a local change to `def schema` in `Dataset` that caused the issue. I apologize for the noise. Sorry and thanks a lot for the prompt response. I appreciate. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2

Why selectExpr changes schema (to include id column)?

2017-03-27 Thread Jacek Laskowski
ng (nullable = false) p.s. http://stackoverflow.com/q/43041975/1305344 Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski --

Re: Is there any limit on number of tasks per stage attempt?

2017-02-24 Thread Jacek Laskowski
Hi, Think it's the size of the type to count the partitions which I think is Int. I don't think there's another reason. Jacek On 23 Feb 2017 5:01 a.m., "Parag Chaudhari" wrote: > Hi, > > Is there any limit on number of tasks per stage attempt? > > > *Thanks,* > > *​Parag​* >

Re: Is there a list of missing optimizations for typed functions?

2017-02-24 Thread Jacek Laskowski
Hi Justin, I have never seen such a list. I think the area is in heavy development esp. optimizations for typed operations. There's a JIRA to somehow find out more on the behavior of Scala code (non-Column-based one from your list) but I've seen no activity in this area. That's why for now Column

Re: RDD blocks on Spark Driver

2017-02-24 Thread Jacek Laskowski
Hi, Guess you're use local mode which has only one executor called driver. Is my guessing correct? Jacek On 23 Feb 2017 2:03 a.m., wrote: > Hello, > > Had a question. When I look at the executors tab in Spark UI, I notice > that some RDD blocks are assigned to the driver as well. Can someone p

Re: [SparkSQL] pre-check syntex before running spark job?

2017-02-21 Thread Jacek Laskowski
Hi, Never heard about such a tool before. You could use Antlr to parse SQLs (just as Spark SQL does while parsing queries). I think it's a one-hour project. Jacek On 21 Feb 2017 4:44 a.m., "Linyuxin" wrote: Hi All, Is there any tool/api to check the sql syntax without running spark job actuall

Re: Executor tab values in Spark Application UI

2017-02-18 Thread Jacek Laskowski
Hi, Yes, it's the "sum of values for all tasks" (it's based on TaskMetrics which are accumulators behind the scenes). Why "it appears that value isnt much of help while debugging?" ? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering A

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-10 Thread Jacek Laskowski
"Something like that" I've never tried it out myself so I'm only guessing having a brief look at the API. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jac

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-09 Thread Jacek Laskowski
Hi, Yes, that's ForeachWriter. Yes, it works with element by element. You're looking for mapPartition and ForeachWriter has partitionId that you could use to implement a similar thing. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http

Re: Un-exploding / denormalizing Spark SQL help

2017-02-07 Thread Jacek Laskowski
better than window (there were more exchanges in play for windows I reckon). Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Tue, Feb 7, 2017 at 10:54 PM, Everett A

Re: Un-exploding / denormalizing Spark SQL help

2017-02-07 Thread Jacek Laskowski
Hi, Could groupBy and withColumn or UDAF work perhaps? I think window could help here too. Jacek On 7 Feb 2017 8:02 p.m., "Everett Anderson" wrote: > Hi, > > I'm trying to un-explode or denormalize a table like > > +---++-+--++ > |id |name|extra|data |priority| > +---+

Re: submit a spark code on google cloud

2017-02-07 Thread Jacek Laskowski
Hi, I know nothing about Spark in GCP so answering this for a pure Spark. Can you use web UI and Executors tab or a SparkListener? Jacek On 7 Feb 2017 5:33 p.m., "Anahita Talebi" wrote: Hello Friends, I am trying to run a spark code on multiple machines. To this aim, I submit a spark code on

Re: How to get a spark sql statement implement duration ?

2017-02-07 Thread Jacek Laskowski
On 7 Feb 2017 4:17 a.m., "Mars Xu" wrote: Hello All, Some spark sqls will produce one or more jobs, I have 2 questions, 1, How the cc.sql(“sql statement”) divided into one or more jobs ? It's an implementation detail. You can have zero or more jobs for a single structured quer

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-07 Thread Jacek Laskowski
Hi, Have you considered foreach sink? Jacek On 6 Feb 2017 8:39 p.m., "Egor Pahomov" wrote: > Hi, I'm thinking of using Structured Streaming instead of old streaming, > but I need to be able to save results to Hive table. Documentation for file > sink says(http://spark.apache.org/docs/latest/st

Re: NoNodeAvailableException (None of the configured nodes are available) error when trying to push data to Elastic from a Spark job

2017-02-07 Thread Jacek Laskowski
Hi, I may have seen this issue already... What's the cluster manager? How do you spark-submit? Jacek On 7 Feb 2017 7:44 p.m., "dgoldenberg" wrote: Hi, Any reason why we might be getting this error? The code seems to work fine in the non-distributed mode but the same code when run from a Spar

Re: using an alternative slf4j implementation

2017-02-06 Thread Jacek Laskowski
to logback eventually). Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Mon, Feb 6, 2017 at 9:06 AM, Mendelson, Assaf wrote: > Shading doesn’t help (we a

Re: using an alternative slf4j implementation

2017-02-05 Thread Jacek Laskowski
Hi, Shading conflicting dependencies? Jacek On 5 Feb 2017 3:56 p.m., "Mendelson, Assaf" wrote: > Hi, > > Spark seems to explicitly use log4j. > > This means that if I use an alternative backend for my application (e.g. > ch.qos.logback) I have a conflict. > > Sure I can exclude logback but tha

Re: High Availability/DR options for Spark applications

2017-02-05 Thread Jacek Laskowski
o resurrect it a few times. The other "components", i.e. map shuffle stages, partitions/tasks, are handled by Spark itself. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at https://twitter.co

Re: NoNodeAvailableException (None of the configured nodes are available) error when trying to push data to Elastic from a Spark job

2017-02-04 Thread Jacek Laskowski
Hi, I'd say the error says it all : Caused by: NoNodeAvailableException[None of the configured nodes are available: [{#transport#-1}{XX.XXX.XXX.XX}{XX.XXX.XXX.XX:9300}]] Jacek On 3 Feb 2017 7:58 p.m., "Anastasios Zouzias" wrote: Hi there, Are you sure that the cluster nodes where the executo

Re: Spark submit on yarn does not return with exit code 1 on exception

2017-02-03 Thread Jacek Laskowski
Hi, ➜ spark git:(master) ✗ ./bin/spark-submit whatever || echo $? Error: Cannot load main class from JAR file:/Users/jacek/dev/oss/spark/whatever Run with --help for usage help or --verbose for debug output 1 I see 1 and there are other cases for 1 too. Pozdrawiam, Jacek Laskowski https

Re: Spark submit on yarn does not return with exit code 1 on exception

2017-02-03 Thread Jacek Laskowski
u see "There is an exception in the script exiting with status 1" printed out to stdout? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Fri, Feb 3,

Re: sqlContext vs spark.

2017-02-03 Thread Jacek Laskowski
Hi, Yes. Forget about SQLContext. It's been merged into SparkSession as of Spark 2.0 (same about HiveContext). Long live SparkSession! :-) Jacek On 3 Feb 2017 7:48 p.m., "☼ R Nair (रविशंकर नायर)" < ravishankar.n...@gmail.com> wrote: All, In Spark 1.6.0, we used val jdbcDF = sqlContext.read.

Re: Error Saving Dataframe to Hive with Spark 2.0.0

2017-01-29 Thread Jacek Laskowski
Hi, I think you have to upgrade to 2.1.0. There were few changes wrt the ERROR since. Jacek On 29 Jan 2017 9:24 a.m., "Chetan Khatri" wrote: Hello Spark Users, I am getting error while saving Spark Dataframe to Hive Table: Hive 1.2.1 Spark 2.0.0 Local environment. Note: Job is getting execut

Re: DAG Visualization option is missing on Spark Web UI

2017-01-28 Thread Jacek Laskowski
Hi, Wonder if you have any adblocker enabled in your browser? Is this the only version giving you this behavior? All Spark jobs have no visualization? Jacek On 28 Jan 2017 7:03 p.m., "Md. Rezaul Karim" < rezaul.ka...@insight-centre.org> wrote: Hi All, I am running a Spark job on my local machi

Re: issue with running Spark streaming with spark-shell

2017-01-28 Thread Jacek Laskowski
Hi, How did you start spark-shell? Jacek On 28 Jan 2017 11:20 a.m., "Mich Talebzadeh" wrote: > > Hi, > > My spark-streaming application works fine when compiled with Maven with > uber jar file. > > With spark-shell this program throws an error as follows: > > scala> val dstream = KafkaUtils.cr

Re: How to reduce number of tasks and partitions in Spark job?

2017-01-26 Thread Jacek Laskowski
Repartition Jacek On 26 Jan 2017 6:13 p.m., "Md. Rezaul Karim" < rezaul.ka...@insight-centre.org> wrote: > Hi All, > > When I run a Spark job on my local machine (having 8 cores and 16GB of > RAM) on an input data of 6.5GB, it creates 193 parallel tasks and put > the output into 193 partitions.

Re: Cached table details

2017-01-26 Thread Jacek Laskowski
Hi, I think that the only way to get the information about a cached RDD is to use SparkListener and intercept respective events about cached blocks on BlockManagers. Jacek On 25 Jan 2017 5:54 a.m., "kumar r" wrote: Hi, I have cached some table in Spark Thrift Server. I want to get all cached

Re: spark intermediate data fills up the disk

2017-01-26 Thread Jacek Laskowski
Hi, The files are for shuffle blocks. Where did you find the docs about them? Jacek On 25 Jan 2017 8:41 p.m., "kant kodali" wrote: oh sorry its actually in the documentation. I should just set spark.worker.cleanup.enabled = true On Wed, Jan 25, 2017 at 11:30 AM, kant kodali wrote: > I have

Re: can we plz open up encoder on dataset

2017-01-26 Thread Jacek Laskowski
Hi Koert, map will take the value that has an implicit Encoder to any value that may or may not have an encoder in scope. That's why I'm asking about the map function to see what it does. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark

Re: can we plz open up encoder on dataset

2017-01-26 Thread Jacek Laskowski
Hi, Can you show the code from map to reproduce the issue? You can create encoders using Encoders object (I'm using it all over the place for schema generation). Jacek On 25 Jan 2017 10:19 p.m., "Koert Kuipers" wrote: > i often run into problems like this: > > i need to write a Dataset[T] => D

Re: where is mapWithState executed?

2017-01-26 Thread Jacek Laskowski
Hi, Shooting in the dark...it's executed on executors (it's old tech RDD-based so not many extra optimizations like in Spark SQL now). Can you show the code as I'm scared to hear that you're trying to broadcast inside a transformation which I'd believe is impossible. Jacek On 26 Jan 2017 12:18

Re: Spark Streaming proactive monitoring

2017-01-24 Thread Jacek Laskowski
Hi, My impression is to use StreamingListener to track metrics and react appropriately. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0

Re: best practice for paralleling model training

2017-01-24 Thread Jacek Laskowski
s/api/java/util/concurrent/ExecutorService.html Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Tue, Jan 24, 2017 at 10:48 PM, Shiyuan wrote: > Hi spark us

Re:

2017-01-21 Thread Jacek Laskowski
Executors are "dumb", i.e. they execute TaskRunners for tasks and...that's it. Your logic should be on the driver that can intercept events and...trigger cleanup. I don't think there's another way to do it. Pozdrawiam, Jacek Laskowski https://medium.com/@jacekla

Re: New runtime exception after switch to Spark 2.1.0

2017-01-20 Thread Jacek Laskowski
Thanks for sharing! A very interesting reading indeed. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Fri, Jan 20, 2017 at 10:17 PM, Morten Hornbech wrote

Re: New runtime exception after switch to Spark 2.1.0

2017-01-20 Thread Jacek Laskowski
Hi, I'd be very interested in how you figured it out. Mind sharing? Jacek On 18 Jan 2017 9:51 p.m., "mhornbech" wrote: > For anyone revisiting this at a later point, the issue was that Spark 2.1.0 > upgrades netty to version 4.0.42 which is not binary compatible with > version > 4.0.37 used by

Re:

2017-01-20 Thread Jacek Laskowski
Hi, (redirecting to users as it has nothing to do with Spark project development) Monitor jobs and stages using SparkListener and submit cleanup jobs where a condition holds. Jacek On 20 Jan 2017 3:57 a.m., "Keith Chapman" wrote: > Hi , > > Is it possible for an executor (or slave) to know wh

Re: Old version of Spark [v1.2.0]

2017-01-16 Thread Jacek Laskowski
? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Sun, Jan 15, 2017 at 11:48 PM, ayan guha wrote: > archive.apache.org will always have all the

Re: Spark and Kafka integration

2017-01-12 Thread Jacek Laskowski
Hi Phadnis, I found this in http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html: > This version of the integration is marked as experimental, so the API is > potentially subject to change. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mas

Re: Storage history in web UI

2017-01-08 Thread Jacek Laskowski
Hi, A possible workaround...Use SparkListener and save the results to a custom sink. After all web UI is a mere bag of SparkListeners + excellent visualizations. Jacek On 3 Jan 2017 4:14 p.m., "Joseph Naegele" wrote: Hi all, Is there any way to observe Storage history in Spark, i.e. which RD

Re: What's the best practice to load data from RDMS to Spark

2017-01-02 Thread Jacek Laskowski
FYI option works with boolean literals directly. Jacek On 30 Dec 2016 9:32 p.m., "Palash Gupta" wrote: > Hi, > > If you want to load from csv, you can use below procedure. Of course you > need to define spark context first. (Given example to load all csv under a > folder, you can use specific n

Re: [ANNOUNCE] Announcing Apache Spark 2.1.0

2016-12-29 Thread Jacek Laskowski
Hi Yan, I've been surprised the first time when I noticed rxin stepped back and a new release manager stepped in. Congrats on your first ANNOUNCE! I can only expect even more great stuff coming in to Spark from the dev team after Reynold spared some time 😉 Can't wait to read the changes... Jace

Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-26 Thread Jacek Laskowski
Thanks a LOT, Michael! Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Mon, Dec 26, 2016 at 10:04 PM, Michael Gummelt wrote: > In fine-grained mode (which

Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-26 Thread Jacek Laskowski
Hi Michael, That caught my attention... Could you please elaborate on "elastically grow and shrink CPU usage" and how it really works under the covers? It seems that CPU usage is just a "label" for an executor on Mesos. Where's this in the code? Pozdrawiam, J

Re: Spark Storage Tab is empty

2016-12-26 Thread Jacek Laskowski
Hi David, Can you use persist instead? Perhaps with some other StorageLevel? It worked with Spark 2.2.0-SNAPSHOT I use and don't remember how it worked back then in 1.6.2. You could also check the Executors tab and see how many blocks you have in their BlockManagers. Pozdrawiam, Jacek Lask

Re: Kafka 0.10 & Spark Streaming 2.0.2

2016-12-02 Thread Jacek Laskowski
Hi, What's the entire spark-submit + Spark properties you're using? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Fri, Dec 2, 2016 at 6:28 P

Re: Kafka 0.10 & Spark Streaming 2.0.2

2016-12-02 Thread Jacek Laskowski
ad pulling and in the master > spark UI I see the executor thread id is showing as 0 and that’s it. > > > > Thanks, > > Gabe > > > > > > *From: *Jacek Laskowski > *Date: *Friday, December 2, 2016 at 11:47 AM > *To: *Gabriel Perez > *Cc: *user >

Re: Kafka 0.10 & Spark Streaming 2.0.2

2016-12-02 Thread Jacek Laskowski
Hi, How many partitions does the topic have? How do you check how many executors read from the topic? Jacek On 2 Dec 2016 2:44 p.m., "gabrielperez2484" wrote: Hello, I am trying to perform a POC between Kafka 0.10 and Spark 2.0.2. Currently I am running into an issue, where only one executor

Re: Any equivalent method lateral and explore

2016-11-25 Thread Jacek Laskowski
Hi, Interesting, but I personally would opt for withColumn since it'd be less to type (and also be consistent with ticks (')) as follows: df.withColumn(explode('myArray) as 'arrayItem) (Spark SQL made my SQL developer's life so easy these days :)) Pozdrawiam, J

Re: Akka Stream as the source for Spark Streaming. Please advice...

2016-11-12 Thread Jacek Laskowski
Hi Luciano, Mind sharing why to have a structured streaming source/sink for Akka if Kafka's available and Akka Streams has a Kafka module? #curious Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow

Re: Akka Stream as the source for Spark Streaming. Please advice...

2016-11-12 Thread Jacek Laskowski
//github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at https://twitter

Re: How to impersonate a user from a Spark program

2016-11-09 Thread Jacek Laskowski
ache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L163-L164 Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Wed, Nov 9, 2016

Physical plan for windows and joins - how to know which is faster?

2016-11-09 Thread Jacek Laskowski
tial_sum(cast(id#15 as bigint))]) +- *Project [_1#12 AS id#15, (_1#12 % 3) AS ID % 3#681] +- *Filter isnotnull((_1#12 % 3)) +- LocalTableScan [_1#12, _2#13] Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spar

Re: Running jobs against remote cluster from scala eclipse ide

2016-09-26 Thread Jacek Laskowski
Hi, Remove .setMaster("spark://spark-437-1-5963003:7077"). set("spark.driver.host","11.104.29.106") and start over. Can you also run the following command to check out Spark Standalone: run-example --master spark://spark-437-1-5963003:7077 SparkPi Pozdrawi

Re: spark-submit failing but job running from scala ide

2016-09-25 Thread Jacek Laskowski
Hi, How did you install Spark 1.6? It's usually as simple as rm -rf $SPARK_1.6_HOME, but it really depends on how you installed it in the first place. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow

Re: spark-submit failing but job running from scala ide

2016-09-25 Thread Jacek Laskowski
You've got two Spark runtimes up that may or may not contribute to the issue. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Sun, Sep 25, 2016 at 8:36 AM, v

Re: Equivalent to --files for driver?

2016-09-22 Thread Jacek Laskowski
Hi Everett, I'd bet on --driver-class-path (but didn't check that out myself). Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Wed, Sep 21, 2016 a

Re: Lemmatization using StanfordNLP in ML 2.0

2016-09-19 Thread Jacek Laskowski
Hi Janardhan, What's the command to build the project (sbt package or sbt assembly)? What's the command you execute to run the application? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow m

Re: Total Shuffle Read and Write Size of Spark workload

2016-09-19 Thread Jacek Laskowski
On Mon, Sep 19, 2016 at 11:36 AM, Mich Talebzadeh wrote: > Spark UI on port 4040 by default That's exactly *a* SparkListener + web UI :) Jacek - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Total Shuffle Read and Write Size of Spark workload

2016-09-19 Thread Jacek Laskowski
Hi Cristina, http://blog.jaceklaskowski.pl/spark-workshop/slides/08_Monitoring_using_SparkListeners.html http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.scheduler.SparkListener Let me know if you've got more questions. Pozdrawiam, Jacek Laskowski

Re: Total Shuffle Read and Write Size of Spark workload

2016-09-18 Thread Jacek Laskowski
SparkListener perhaps? Jacek On 15 Sep 2016 1:41 p.m., "Cristina Rozee" wrote: > Hello, > > I am running a spark application and I would like to know the total amount > of shuffle data (read + write ) so could anyone let me know how to get this > information? > > Thank you > Cristina. >

Re: Lemmatization using StanfordNLP in ML 2.0

2016-09-18 Thread Jacek Laskowski
Hi Jonardhan, Can you share the code that you execute? What's the command? Mind sharing the complete project on github? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitte

Re: Apache Spark 2.0.0 on Microsoft Windows Create Dataframe

2016-09-16 Thread Jacek Laskowski
Hi Advait, It's due to https://issues.apache.org/jira/browse/SPARK-15565. See http://stackoverflow.com/a/38945867/1305344 for a solution (that's spark.sql.warehouse.dir away). Upvote if it works for you. Thanks! Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/

Re: Spark Interview questions

2016-09-14 Thread Jacek Laskowski
s/TODOs in my Spark notes... Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Wed, Sep 14, 2016 at 4:09 PM, Mich Talebzadeh wrote: > Hi Ashok, > > I am

Re: Spark metrics when running with YARN?

2016-09-11 Thread Jacek Laskowski
d to handle the single Spark application. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Sun, Sep 11, 2016 at 11:18 AM, Vladimir Tretyakov wrote: > Hello Ja

Re: Reading a TSV file

2016-09-10 Thread Jacek Laskowski
Hi Muhammad, sep or delimiter should both work fine. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Sat, Sep 10, 2016 at 10:42 AM, Muhammad Asif Abbasi

Re: Reading a TSV file

2016-09-10 Thread Jacek Laskowski
eve got its own file format and support @ spark-packages. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Sat, Sep 10, 2016 at 8:00 AM, Mich Talebzadeh wrote:

Re: Reading a TSV file

2016-09-10 Thread Jacek Laskowski
ps://issues.apache.org/jira/browse/SPARK. Have you run into any issues with CSV and Java? Share the code. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Sat

Re: Spark metrics when running with YARN?

2016-09-09 Thread Jacek Laskowski
Hi, That's correct. One app one web UI. Open 4041 and you'll see the other app. Jacek On 9 Sep 2016 11:53 a.m., "Vladimir Tretyakov" < vladimir.tretya...@sematext.com> wrote: > Hello again. > > I am trying to play with Spark version "2.11-2.0.0". > > Problem that REST API and UI shows me differ

Re: Spark ML 2.1.0 new features

2016-09-06 Thread Jacek Laskowski
Hi, https://issues.apache.org/jira/browse/SPARK-17363?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.0%20AND%20component%20%3D%20MLlib Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https

Re: reuse the Spark SQL internal metrics

2016-08-30 Thread Jacek Laskowski
es in onExecutorMetricsUpdate. [1] http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.scheduler.SparkListener Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklask

Re: Spark 2.0 - Join statement compile error

2016-08-30 Thread Jacek Laskowski
. scala> s"I'm using $spark in ${spark.version}" res0: String = I'm using org.apache.spark.sql.SparkSession@1fc1c7e in 2.1.0-SNAPSHOT Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me

Re: Spark 2.0 - Join statement compile error

2016-08-28 Thread Jacek Laskowski
Hi Mich, This is Scala's string interpolation which allow for replacing $-prefixed expressions with their values. It's what cool kids use in Scala to do templating and concatenation 😁 Jacek On 23 Aug 2016 9:21 a.m., "Mich Talebzadeh" wrote: > What is --> s below before the text of sql? > >

Re: Is there anyway Spark UI is set to poll and refreshes itself

2016-08-27 Thread Jacek Laskowski
may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 26 August 2016 at 23:21, Jacek Laskowski wrote: > > Hi Mich, &

<    1   2   3   4   5   >