[SQL] Why does a small two-source JDBC query take ~150-200ms with all optimizations (AQE, CBO, pushdown, Kryo, unsafe) enabled? (v3.4.0-SNAPSHOT)

2022-05-18 Thread Gavin Ray
I did some basic testing of multi-source queries with the most recent Spark: https://github.com/GavinRay97/spark-playground/blob/44a756acaee676a9b0c128466e4ab231a7df8d46/src/main/scala/Application.scala#L46-L115 The output of "spark.time()" surprised me: SELECT p.id, p.name, t.id, t.title FROM db

Spark sql join optimizations

2019-02-26 Thread Akhilanand
Hello, I recently noticed that spark doesn't optimize the joins when we are limiting it. Say when we have payment.join(customer,Seq("customerId"), "left").limit(1).explain(true) Spark doesn't optimize it. > == Physical Plan == > CollectLimit 1 > +- *(5) Project [customerId#29, paymentId#28,

Re: If I pass raw SQL string to dataframe do I still get the Spark SQL optimizations?

2017-07-06 Thread ayan guha
5:28 PM, kant kodali wrote: > >> HI All, >> >> I am wondering If I pass a raw SQL string to dataframe do I still get the >> Spark SQL optimizations? why or why not? >> >> Thanks! >> > > -- Best Regards, Ayan Guha

Re: If I pass raw SQL string to dataframe do I still get the Spark SQL optimizations?

2017-07-06 Thread Michael Armbrust
It goes through the same optimization pipeline. More in this video <https://youtu.be/1a4pgYzeFwE?t=608>. On Thu, Jul 6, 2017 at 5:28 PM, kant kodali wrote: > HI All, > > I am wondering If I pass a raw SQL string to dataframe do I still get the > Spark SQL optimizatio

If I pass raw SQL string to dataframe do I still get the Spark SQL optimizations?

2017-07-06 Thread kant kodali
HI All, I am wondering If I pass a raw SQL string to dataframe do I still get the Spark SQL optimizations? why or why not? Thanks!

Re: Is there a list of missing optimizations for typed functions?

2017-02-27 Thread lihu
rea is in heavy development > esp. optimizations for typed operations. > > There's a JIRA to somehow find out more on the behavior of Scala code > (non-Column-based one from your list) but I've seen no activity in this > area. That's why for now Column-based untype

Re: Disable Spark SQL Optimizations for unit tests

2017-02-26 Thread Stefan Ackermann
if (castToInts.contains(c)) { dfIn(c).cast(IntegerType) } else { dfIn(c) } } dfIn.select(columns: _*) } As I consequently applied this to other similar functions the unit tests went down from 60 to 18 minutes. Another way to break SQL optimizations was to just s

Re: Is there a list of missing optimizations for typed functions?

2017-02-24 Thread Jacek Laskowski
Hi Justin, I have never seen such a list. I think the area is in heavy development esp. optimizations for typed operations. There's a JIRA to somehow find out more on the behavior of Scala code (non-Column-based one from your list) but I've seen no activity in this area. That

Is there a list of missing optimizations for typed functions?

2017-02-22 Thread Justin Pihony
ction, so I have two questions really: 1.) Is there a list of the methods that lose some of the optimizations that you get from non-functional methods? Is it any method that accepts a generic function? 2.) Is there any work to attempt reflection and gain some of these optimizations back? I couldn

Disable Spark SQL Optimizations for unit tests

2017-02-11 Thread Stefan Ackermann
Hi, Can the Spark SQL Optimizations be disabled somehow? In our project we started 4 weeks ago to write scala / spark / dataframe code. We currently have only around 10% of the planned project scope, and we are already waiting 10 (Spark 2.1.0, everything cached) to 30 (Spark 1.6, nothing cached

Are ser/de optimizations relevant with Dataset API and Encoders ?

2016-06-18 Thread Amit Sela
With RDD API, you could optimize shuffling data by making sure that bytes are shuffled instead of objects and using the appropriate ser/de mechanism before and after the shuffle, for example: Before parallelize, transform to bytes using a dedicated serializer, parallelize, and immediately after de

Spark 1.5.2 - are new Project Tungsten optimizations available on RDD as well?

2016-02-02 Thread Nirav Patel
Hi, I read about release notes and few slideshares on latest optimizations done on spark 1.4 and 1.5 releases. Part of which are optimizations from project Tungsten. Docs says it usues sun.misc.unsafe to convert physical rdd structure into byte array before shuffle for optimized GC and memory. My

SparkSQL optimizations and Spark streaming

2015-11-19 Thread Sela, Amit
There is a lot of work done on SparkSQL and DataFrames which optimizes the execution, with some of it working on the data source – I.e., optimizing read from Parquet. I was wondering if using SparkSQL with streaming (in transform/foreachRDD) could benefit in optimization ? Although (currently)

Re: Optimizations

2015-07-03 Thread Marius Danciu
Friday, July 3, 2015 at 3:13 AM > To: user > Subject: Optimizations > > Hi all, > > If I have something like: > > rdd.join(...).mapPartitionToPair(...) > > It looks like mapPartitionToPair runs in a different stage then join. Is > there a way to piggyback this

Re: Optimizations

2015-07-03 Thread Silvio Fiorito
on the size of your data if it’s worth it or not. From: Marius Danciu Date: Friday, July 3, 2015 at 3:13 AM To: user Subject: Optimizations Hi all, If I have something like: rdd.join(...).mapPartitionToPair(...) It looks like mapPartitionToPair runs in a different stage then join. Is there a

Re: Optimizations

2015-07-03 Thread Raghavendra Pandey
This is the basic design of spark that it runs all actions in different stages... Not sure you can achieve what you r looking for. On Jul 3, 2015 12:43 PM, "Marius Danciu" wrote: > Hi all, > > If I have something like: > > rdd.join(...).mapPartitionToPair(...) > > It looks like mapPartitionToPair

Optimizations

2015-07-03 Thread Marius Danciu
Hi all, If I have something like: rdd.join(...).mapPartitionToPair(...) It looks like mapPartitionToPair runs in a different stage then join. Is there a way to piggyback this computation inside the join stage ? ... such that each result partition after join is passed to the mapPartitionToPair fu

Re: Delayed hotspot optimizations in Spark

2014-10-10 Thread Guillaume Pitel
hotspot optimizations that used due first reading. Do you have any idea how to confirm/solve this performance problem? Thanks for advice! p.s. I have billion hotspot optimization showed with -XX:+PrintCompilation but can not figure out what are important and what are not

Re: Delayed hotspot optimizations in Spark

2014-10-10 Thread Alexey Romanchuk
t; > As you can see second reading is 10x times faster then first. Most of the > > query time spent to work with parquet file. > > > > This problem is really annoying, because most of my spark task contains > just > > 1 sql query and data processing and to speedup my j

Re: Delayed hotspot optimizations in Spark

2014-10-10 Thread Sean Owen
ing is 10x times faster then first. Most of the > query time spent to work with parquet file. > > This problem is really annoying, because most of my spark task contains just > 1 sql query and data processing and to speedup my jobs I put special warmup > query in from of any job. &g

Delayed hotspot optimizations in Spark

2014-10-10 Thread Alexey Romanchuk
query in from of any job. My assumption is that it is hotspot optimizations that used due first reading. Do you have any idea how to confirm/solve this performance problem? Thanks for advice! p.s. I have billion hotspot optimization showed with -XX:+PrintCompilation but can not figure out what