Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-10 Thread Hemant Bhanawat
+1 (non-binding)

I have found the suggestion from Andrew Ash and James about plan push down
quite interesting. However, I am not clear about the join push-down support
at the data source level. Shouldn't it be the responsibility of the join
node to carry out a data source specific join? I mean join node and the
data source scan of the two sides can be coalesced into a single node
(theoretically). This can be done by providing a Strategy that replaces the
join node with a data source specific join node. We are doing it that way
for our data sources. I find this more intuitive.

BTW, aggregate push-down support is desirable and should be considered as
an enhancement going forward.

Hemant Bhanawat 
www.snappydata.io

On Sun, Sep 10, 2017 at 8:45 PM, vaquar khan  wrote:

> +1
>
> Regards,
> Vaquar khan
>
> On Sep 10, 2017 5:18 AM, "Noman Khan"  wrote:
>
>> +1
>> --
>> *From:* wangzhenhua (G) 
>> *Sent:* Friday, September 8, 2017 2:20:07 AM
>> *To:* Dongjoon Hyun; 蒋星博
>> *Cc:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot
>> Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
>> *Subject:* 答复: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path
>>
>>
>> +1 (non-binding)  Great to see data source API is going to be improved!
>>
>>
>>
>> best regards,
>>
>> -Zhenhua(Xander)
>>
>>
>>
>> *发件人:* Dongjoon Hyun [mailto:dongjoon.h...@gmail.com]
>> *发送时间:* 2017年9月8日 4:07
>> *收件人:* 蒋星博
>> *抄送:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot
>> Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
>> *主题:* Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path
>>
>>
>>
>> +1 (non-binding).
>>
>>
>>
>> On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博  wrote:
>>
>> +1
>>
>>
>>
>>
>>
>> Reynold Xin 于2017年9月7日 周四下午12:04写道:
>>
>> +1 as well
>>
>>
>>
>> On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust 
>> wrote:
>>
>> +1
>>
>>
>>
>> On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue 
>> wrote:
>>
>> +1 (non-binding)
>>
>> Thanks for making the updates reflected in the current PR. It would be
>> great to see the doc updated before it is finally published though.
>>
>> Right now it feels like this SPIP is focused more on getting the basics
>> right for what many datasources are already doing in API V1 combined with
>> other private APIs, vs pushing forward state of the art for performance.
>>
>> I think that’s the right approach for this SPIP. We can add the support
>> you’re talking about later with a more specific plan that doesn’t block
>> fixing the problems that this addresses.
>>
>> ​
>>
>>
>>
>> On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <
>> hvanhov...@databricks.com> wrote:
>>
>> +1 (binding)
>>
>>
>>
>> I personally believe that there is quite a big difference between having
>> a generic data source interface with a low surface area and pushing down a
>> significant part of query processing into a datasource. The later has much
>> wider wider surface area and will require us to stabilize most of the
>> internal catalyst API's which will be a significant burden on the community
>> to maintain and has the potential to slow development velocity
>> significantly. If you want to write such integrations then you should be
>> prepared to work with catalyst internals and own up to the fact that things
>> might change across minor versions (and in some cases even maintenance
>> releases). If you are willing to go down that road, then your best bet is
>> to use the already existing spark session extensions which will allow you
>> to write such integrations and can be used as an `escape hatch`.
>>
>>
>>
>>
>>
>> On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash  wrote:
>>
>> +0 (non-binding)
>>
>>
>>
>> I think there are benefits to unifying all the Spark-internal datasources
>> into a common public API for sure.  It will serve as a forcing function to
>> ensure that those internal datasources aren't advantaged vs datasources
>> developed externally as plugins to Spark, and that all Spark features are
>> available to all datasources.
>>
>>
>>
>> But I also think this read-path proposal avoids the more difficult
>> questions around how to continue pushing datasource performance forwards.
>> James Baker (my colleague) had a number of questions about advanced
>> pushdowns (combined sorting and filtering), and Reynold also noted that
>> pushdown of aggregates and joins are desirable on longer timeframes as
>> well.  The Spark community saw similar requests, for aggregate pushdown in
>> SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown
>> in SPARK-12449.  Clearly a number of people are interested in this kind of
>> performance work for datasources.
>>
>>
>>
>> To leave enough 

Supporting Apache Aurora as a cluster manager

2017-09-10 Thread karthik padmanabhan
Hi Spark Devs,

We are using Aurora (http://aurora.apache.org/) as our mesos framework for
running stateless services. We would like to use Aurora to deploy big data
and batch workloads as well. And for this we have forked Spark and
implement the ExternalClusterManager trait.

The reason for doing this and not running Spark on Mesos is to leverage the
existing roles and quotas provided by Aurora for admission control and also
leverage Aurora features such as priority and preemption. Additionally we
would like Aurora to be only deploy/orchestration system that our users
should interact with.

We have a working POC where Spark is launching jobs through as the
ClusterManager. Is this something that can be merged upstream ? If so I can
create a design document and create an associated jira ticket.

Thanks
Karthik


Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-10 Thread vaquar khan
+1

Regards,
Vaquar khan

On Sep 10, 2017 5:18 AM, "Noman Khan"  wrote:

> +1
> --
> *From:* wangzhenhua (G) 
> *Sent:* Friday, September 8, 2017 2:20:07 AM
> *To:* Dongjoon Hyun; 蒋星博
> *Cc:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot
> Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
> *Subject:* 答复: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path
>
>
> +1 (non-binding)  Great to see data source API is going to be improved!
>
>
>
> best regards,
>
> -Zhenhua(Xander)
>
>
>
> *发件人:* Dongjoon Hyun [mailto:dongjoon.h...@gmail.com]
> *发送时间:* 2017年9月8日 4:07
> *收件人:* 蒋星博
> *抄送:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot
> Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
> *主题:* Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path
>
>
>
> +1 (non-binding).
>
>
>
> On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博  wrote:
>
> +1
>
>
>
>
>
> Reynold Xin 于2017年9月7日 周四下午12:04写道:
>
> +1 as well
>
>
>
> On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust 
> wrote:
>
> +1
>
>
>
> On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue 
> wrote:
>
> +1 (non-binding)
>
> Thanks for making the updates reflected in the current PR. It would be
> great to see the doc updated before it is finally published though.
>
> Right now it feels like this SPIP is focused more on getting the basics
> right for what many datasources are already doing in API V1 combined with
> other private APIs, vs pushing forward state of the art for performance.
>
> I think that’s the right approach for this SPIP. We can add the support
> you’re talking about later with a more specific plan that doesn’t block
> fixing the problems that this addresses.
>
> ​
>
>
>
> On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <
> hvanhov...@databricks.com> wrote:
>
> +1 (binding)
>
>
>
> I personally believe that there is quite a big difference between having a
> generic data source interface with a low surface area and pushing down a
> significant part of query processing into a datasource. The later has much
> wider wider surface area and will require us to stabilize most of the
> internal catalyst API's which will be a significant burden on the community
> to maintain and has the potential to slow development velocity
> significantly. If you want to write such integrations then you should be
> prepared to work with catalyst internals and own up to the fact that things
> might change across minor versions (and in some cases even maintenance
> releases). If you are willing to go down that road, then your best bet is
> to use the already existing spark session extensions which will allow you
> to write such integrations and can be used as an `escape hatch`.
>
>
>
>
>
> On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash  wrote:
>
> +0 (non-binding)
>
>
>
> I think there are benefits to unifying all the Spark-internal datasources
> into a common public API for sure.  It will serve as a forcing function to
> ensure that those internal datasources aren't advantaged vs datasources
> developed externally as plugins to Spark, and that all Spark features are
> available to all datasources.
>
>
>
> But I also think this read-path proposal avoids the more difficult
> questions around how to continue pushing datasource performance forwards.
> James Baker (my colleague) had a number of questions about advanced
> pushdowns (combined sorting and filtering), and Reynold also noted that
> pushdown of aggregates and joins are desirable on longer timeframes as
> well.  The Spark community saw similar requests, for aggregate pushdown in
> SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown
> in SPARK-12449.  Clearly a number of people are interested in this kind of
> performance work for datasources.
>
>
>
> To leave enough space for datasource developers to continue experimenting
> with advanced interactions between Spark and their datasources, I'd propose
> we leave some sort of escape valve that enables these datasources to keep
> pushing the boundaries without forking Spark.  Possibly that looks like an
> additional unsupported/unstable interface that pushes down an entire
> (unstable API) logical plan, which is expected to break API on every
> release.   (Spark attempts this full-plan pushdown, and if that fails Spark
> ignores it and continues on with the rest of the V2 API for
> compatibility).  Or maybe it looks like something else that we don't know
> of yet.  Possibly this falls outside of the desired goals for the V2 API
> and instead should be a separate SPIP.
>
>
>
> If we had a plan for this kind of escape valve for advanced datasource
> developers I'd be an unequivocal +1.  Right now it feels like this SPIP is
> focused more on getting the basics right for what many datasources are

Re: [SS] Bug in StreamExecution? currentBatchId and getBatchDescriptionString for web UI

2017-09-10 Thread Jacek Laskowski
Hi,

Please disregard my finding. It does not seem a bug, but just a small
"dead code" as "init" will never be displayed in web UI = the minimum
batch id can ever be 0 and so getBatchDescriptionString could be a
little "improved".

Sorry for the noise.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Spark Structured Streaming (Apache Spark 2.2+)
https://bit.ly/spark-structured-streaming
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Sat, Sep 9, 2017 at 9:21 PM, Jacek Laskowski  wrote:
> Hi,
>
> While reviewing StreamExecution and how batches are displayed in web
> UI, I've noticed that currentBatchId is -1 when StreamExecution is
> created [1] and becomes 0 when no offsets are available [2].
>
> That leads to my question about setting the job description for a
> query using getBatchDescriptionString [3]. It branches per
> currentBatchId and when it's -1 gives "init" [4] which never happens
> as showed above.
>
> That leads to the PR for SPARK-20464 "Add a job group and description
> for streaming queries and fix cancellation of running jobs using the
> job group" that sets the job description after populateStartOffsets
> [5].
>
> Shouldn't it be before populateStartOffsets so
> getBatchDescriptionString has a chance of giving "init" and we see no
> two 0s?
>
> Help appreciated.
>
> [1] 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L116
> [2] 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala?utf8=%E2%9C%93#L516
> [3] 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala?utf8=%E2%9C%93#L878-L883
> [4] 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala?utf8=%E2%9C%93#L879
> [5] 
> https://github.com/apache/spark/commit/6fc6cf88d871f5b05b0ad1a504e0d6213cf9d331#diff-6532dd3b63bdab0364fbcf2303e290e4R294
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> Spark Structured Streaming (Apache Spark 2.2+)
> https://bit.ly/spark-structured-streaming
> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-10 Thread Noman Khan
+1

From: wangzhenhua (G) 
Sent: Friday, September 8, 2017 2:20:07 AM
To: Dongjoon Hyun; 蒋星博
Cc: Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot 
Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
Subject: 答复: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

+1 (non-binding)  Great to see data source API is going to be improved!

best regards,
-Zhenhua(Xander)

发件人: Dongjoon Hyun [mailto:dongjoon.h...@gmail.com]
发送时间: 2017年9月8日 4:07
收件人: 蒋星博
抄送: Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot 
Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
主题: Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

+1 (non-binding).

On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博 
> wrote:
+1


Reynold Xin >于2017年9月7日 
周四下午12:04写道:
+1 as well

On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust 
> wrote:
+1

On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue 
> wrote:

+1 (non-binding)

Thanks for making the updates reflected in the current PR. It would be great to 
see the doc updated before it is finally published though.

Right now it feels like this SPIP is focused more on getting the basics right 
for what many datasources are already doing in API V1 combined with other 
private APIs, vs pushing forward state of the art for performance.

I think that’s the right approach for this SPIP. We can add the support you’re 
talking about later with a more specific plan that doesn’t block fixing the 
problems that this addresses.
​

On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier 
> wrote:
+1 (binding)

I personally believe that there is quite a big difference between having a 
generic data source interface with a low surface area and pushing down a 
significant part of query processing into a datasource. The later has much 
wider wider surface area and will require us to stabilize most of the internal 
catalyst API's which will be a significant burden on the community to maintain 
and has the potential to slow development velocity significantly. If you want 
to write such integrations then you should be prepared to work with catalyst 
internals and own up to the fact that things might change across minor versions 
(and in some cases even maintenance releases). If you are willing to go down 
that road, then your best bet is to use the already existing spark session 
extensions which will allow you to write such integrations and can be used as 
an `escape hatch`.


On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash 
> wrote:
+0 (non-binding)

I think there are benefits to unifying all the Spark-internal datasources into 
a common public API for sure.  It will serve as a forcing function to ensure 
that those internal datasources aren't advantaged vs datasources developed 
externally as plugins to Spark, and that all Spark features are available to 
all datasources.

But I also think this read-path proposal avoids the more difficult questions 
around how to continue pushing datasource performance forwards.  James Baker 
(my colleague) had a number of questions about advanced pushdowns (combined 
sorting and filtering), and Reynold also noted that pushdown of aggregates and 
joins are desirable on longer timeframes as well.  The Spark community saw 
similar requests, for aggregate pushdown in SPARK-12686, join pushdown in 
SPARK-20259, and arbitrary plan pushdown in SPARK-12449.  Clearly a number of 
people are interested in this kind of performance work for datasources.

To leave enough space for datasource developers to continue experimenting with 
advanced interactions between Spark and their datasources, I'd propose we leave 
some sort of escape valve that enables these datasources to keep pushing the 
boundaries without forking Spark.  Possibly that looks like an additional 
unsupported/unstable interface that pushes down an entire (unstable API) 
logical plan, which is expected to break API on every release.   (Spark 
attempts this full-plan pushdown, and if that fails Spark ignores it and 
continues on with the rest of the V2 API for compatibility).  Or maybe it looks 
like something else that we don't know of yet.  Possibly this falls outside of 
the desired goals for the V2 API and instead should be a separate SPIP.

If we had a plan for this kind of escape valve for advanced datasource 
developers I'd be an unequivocal +1.  Right now it feels like this SPIP is 
focused more on getting the basics right for what many datasources are already 
doing in API V1 combined with other private APIs, vs pushing forward state of 
the art for performance.

Andrew

On Wed, Sep 6, 2017