Re: [DISCUSS] SPIP: Python Data Source API

2023-06-25 Thread Reynold Xin
Personally I'd love this, but I agree with some of the earlier comments that 
this should not be Python specific (meaning I should be able to implement a 
data source in Python and then make it usable across all languages Spark 
supports). I think we should find a way to make this reusable beyond Python 
(especially for SQL).

Python is the most popular programming language by a large margin, in general 
and among Spark users. Many of the organizations that use Spark often don't 
even have a single person that knows Scala. What if they want to implement a 
custom data source to fetch some data? Today we'd have to tell them to learn 
Scala/Java and the fairly complex data source API (v1 or v2).

Maciej - I understand your concern about endpoint throttling etc. And it goes 
much more beyond querying REST endpoints. I personally had that concern too 
when we were adding the JDBC data source (what if somebody launches a 512 node 
Spark cluster to query my single node MySQL cluster?!). But the built-in JDBC 
data source is one of the most popular data sources (I just looked up its usage 
on Databricks and it's by far the #1 data source outside of files, used by > 
1 organizations everyday).

On Sun, Jun 25, 2023 at 1:38 AM, Maciej < mszymkiew...@gmail.com > wrote:

> 
> 
> 
> Thanks for your feedback Martin.
> 
> However, if the primary intended purpose of this API is to provide an
> interface for endpoint querying, then I find this proposal even less
> convincing.
> 
> 
> 
> Neither the Spark execution model nor the data source API (full or
> restricted as proposed here) are a good fit for handling problems arising
> from massive endpoint requests, including, but not limited to, handling
> quotas and rate limiting.
> 
> 
> 
> Consistency and streamlined development are, of course, valuable.
> Nonetheless, they are not sufficient, especially if they cannot deliver
> the expected user experience in terms of reliability and execution cost.
> 
> 
> 
> 
> 
> 
> Best regards,
> Maciej Szymkiewicz
> 
> Web: https:/ / zero323. net ( https://zero323.net )
> PGP: A30CEF0C31A501EC
> On 6/24/23 23:42, Martin Grund wrote:
> 
> 
>> Hey,
>> 
>> 
>> I would like to express my strong support for Python Data Sources even
>> though they might not be immediately as powerful as Scala-based data
>> sources. One element that is easily lost in this discussion is how much
>> faster the iteration speed is with Python compared to Scala. Due to the
>> dynamic nature of Python, you can design and build a data source while
>> running in a notebook and continuously change the code until it works as
>> you want. This behavior is unparalleled!
>> 
>> 
>> There exists a litany of Python libraries connecting to all kinds of
>> different endpoints that could provide data that is usable with Spark. I
>> personally can imagine implementing a data source on top of the AWS SDK to
>> extract EC2 instance information. Now I don't have to switch tools and can
>> keep my pipeline consistent.
>> 
>> 
>> Let's say you want to query an API in parallel from Spark using Python, today
>> 's way would be to create a Python RDD and then implement the planning and
>> execution process manually. Finally calling `toDF` in the end. While the
>> actual code of the DS and the RDD-based implementation would be very
>> similar, the abstraction that is provided by the DS is much more powerful
>> and future-proof. Performing dynamic partition elimination, and filter
>> push-down can all be implemented at a later point in time.
>> 
>> 
>> Comparing a DS to using batch calling from a UDF is not great because, the
>> execution pattern would be very brittle. Imagine something like
>> `spark.range(10).withColumn("data",
>> fetch_api).explode(col("data")).collect()`. Here you're encoding
>> partitioning logic and data transformation in simple ways, but you can't
>> reason about the structural integrity of the query and tiny changes in the
>> UDF interface might already cause a lot of downstream issues.
>> 
>> 
>> 
>> 
>> Martin
>> 
>> 
>> 
>> On Sat, Jun 24 , 2023 at 1:44 AM Maciej < mszymkiewicz@ gmail. com (
>> mszymkiew...@gmail.com ) > wrote:
>> 
>> 
>>> 
>>> 
>>> With such limited scope (both language availability and features) do we
>>> have any representative examples of sources that could significantly
>>> benefit from providing this API,  compared other available options, such
>>> as batch imports, direct queries from vectorized  UDFs or even interfacing
>>> sources through 3rd party FDWs?
>>> 
>>> 
>>> Best regards,
>>> Maciej Szymkiewicz
>>> 
>>> Web: https:/ / zero323. net ( https://zero323.net )
>>> PGP: A30CEF0C31A501EC
>>> On 6/20/23 16:23, Wenchen Fan wrote:
>>> 
>>> 
 In an ideal world, every data source you want to connect to already has a
 Spark data source implementation (either v1 or v2), then this Python API
 is useless. But I feel it's common that people want to do quick data
 exploration, and the target data system is not popular 

Re: [DISCUSS] SPIP: Python Data Source API

2023-06-25 Thread Maciej

Thanks for your feedback Martin.

However, if the primary intended purpose of this API is to provide an 
interface for endpoint querying, then I find this proposal even less 
convincing.


Neither the Spark execution model nor the data source API (full or 
restricted as proposed here) are a good fit for handling problems 
arising from massive endpoint requests, including, but not limited to, 
handling quotas and rate limiting.


Consistency and streamlined development are, of course, valuable. 
Nonetheless, they are not sufficient, especially if they cannot deliver 
the expected user experience in terms of reliability and execution cost.


Best regards,
Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC

On 6/24/23 23:42, Martin Grund wrote:

Hey,

I would like to express my strong support for Python Data Sources even 
though they might not be immediately as powerful as Scala-based data 
sources. One element that is easily lost in this discussion is how 
much faster the iteration speed is with Python compared to Scala. Due 
to the dynamic nature of Python, you can design and build a data 
source while running in a notebook and continuously change the code 
until it works as you want. This behavior is unparalleled!


There exists a litany of Python libraries connecting to all kinds of 
different endpoints that could provide data that is usable with Spark. 
I personally can imagine implementing a data source on top of the AWS 
SDK to extract EC2 instance information. Now I don't have to switch 
tools and can keep my pipeline consistent.


Let's say you want to query an API in parallel from Spark using 
Python, today's way would be to create a Python RDD and then implement 
the planning and execution process manually. Finally calling `toDF` in 
the end. While the actual code of the DS and the RDD-based 
implementation would be very similar, the abstraction that is provided 
by the DS is much more powerful and future-proof. Performing dynamic 
partition elimination, and filter push-down can all be implemented at 
a later point in time.


Comparing a DS to using batch calling from a UDF is not great because, 
the execution pattern would be very brittle. Imagine something like 
`spark.range(10).withColumn("data", 
fetch_api).explode(col("data")).collect()`. Here you're encoding 
partitioning logic and data transformation in simple ways, but you 
can't reason about the structural integrity of the query and tiny 
changes in the UDF interface might already cause a lot of downstream 
issues.



Martin


On Sat, Jun 24, 2023 at 1:44 AM Maciej  wrote:

With such limited scope (both language availability and features)
do we have any representative examples of sources that could
significantly benefit from providing this API,  compared other
available options, such as batch imports, direct queries from
vectorized  UDFs or even interfacing sources through 3rd party FDWs?

Best regards,
Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC

On 6/20/23 16:23, Wenchen Fan wrote:

In an ideal world, every data source you want to connect to
already has a Spark data source implementation (either v1 or v2),
then this Python API is useless. But I feel it's common that
people want to do quick data exploration, and the target data
system is not popular enough to have an existing Spark data
source implementation. It will be useful if people can quickly
implement a Spark data source using their favorite Python language.

I'm +1 to this proposal, assuming that we will keep it simple and
won't copy all the complicated features we built in DS v2 to this
new Python API.

On Tue, Jun 20, 2023 at 2:11 PM Maciej 
wrote:

Similarly to Jacek, I feel it fails to document an actual
community need for such a feature.

Currently, any data source implementation has the potential
to benefit Spark users across all supported and third-party
clients.  For generally available sources, this is
advantageous for the whole Spark community and avoids
creating 1st and 2nd-tier citizens. This is even more
important with new officially supported languages being added
through connect.

Instead, we might rather document in detail the process of
implementing a new source using current APIs and work towards
easily extensible or customizable sources, in case there is
such a need.

-- 
Best regards,

Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC


On 6/20/23 05:19, Hyukjin Kwon wrote:

Actually I support this idea in a way that Python developers
don't have to learn Scala to write their own source (and
separate packaging).
This is more crucial especially when you want to write a
simple data source that interacts with the Python ecosystem.

 

Re: [DISCUSS] SPIP: Python Data Source API

2023-06-24 Thread Martin Grund
Hey,

I would like to express my strong support for Python Data Sources even
though they might not be immediately as powerful as Scala-based data
sources. One element that is easily lost in this discussion is how much
faster the iteration speed is with Python compared to Scala. Due to the
dynamic nature of Python, you can design and build a data source while
running in a notebook and continuously change the code until it works as
you want. This behavior is unparalleled!

There exists a litany of Python libraries connecting to all kinds of
different endpoints that could provide data that is usable with Spark. I
personally can imagine implementing a data source on top of the AWS SDK to
extract EC2 instance information. Now I don't have to switch tools and can
keep my pipeline consistent.

Let's say you want to query an API in parallel from Spark using Python,
today's way would be to create a Python RDD and then implement the planning
and execution process manually. Finally calling `toDF` in the end. While
the actual code of the DS and the RDD-based implementation would be very
similar, the abstraction that is provided by the DS is much more powerful
and future-proof. Performing dynamic partition elimination, and filter
push-down can all be implemented at a later point in time.

Comparing a DS to using batch calling from a UDF is not great because, the
execution pattern would be very brittle. Imagine something like
`spark.range(10).withColumn("data",
fetch_api).explode(col("data")).collect()`. Here you're encoding
partitioning logic and data transformation in simple ways, but you can't
reason about the structural integrity of the query and tiny changes in the
UDF interface might already cause a lot of downstream issues.


Martin


On Sat, Jun 24, 2023 at 1:44 AM Maciej  wrote:

> With such limited scope (both language availability and features) do we
> have any representative examples of sources that could significantly
> benefit from providing this API,  compared other available options, such as
> batch imports, direct queries from vectorized  UDFs or even interfacing
> sources through 3rd party FDWs?
>
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> PGP: A30CEF0C31A501EC
>
> On 6/20/23 16:23, Wenchen Fan wrote:
>
> In an ideal world, every data source you want to connect to already has a
> Spark data source implementation (either v1 or v2), then this Python API is
> useless. But I feel it's common that people want to do quick data
> exploration, and the target data system is not popular enough to have an
> existing Spark data source implementation. It will be useful if people can
> quickly implement a Spark data source using their favorite Python language.
>
> I'm +1 to this proposal, assuming that we will keep it simple and won't
> copy all the complicated features we built in DS v2 to this new Python API.
>
> On Tue, Jun 20, 2023 at 2:11 PM Maciej  wrote:
>
>> Similarly to Jacek, I feel it fails to document an actual community need
>> for such a feature.
>>
>> Currently, any data source implementation has the potential to benefit
>> Spark users across all supported and third-party clients.  For generally
>> available sources, this is advantageous for the whole Spark community and
>> avoids creating 1st and 2nd-tier citizens. This is even more important with
>> new officially supported languages being added through connect.
>>
>> Instead, we might rather document in detail the process of implementing a
>> new source using current APIs and work towards easily extensible or
>> customizable sources, in case there is such a need.
>>
>> --
>> Best regards,
>> Maciej Szymkiewicz
>>
>> Web: https://zero323.net
>> PGP: A30CEF0C31A501EC
>>
>>
>> On 6/20/23 05:19, Hyukjin Kwon wrote:
>>
>> Actually I support this idea in a way that Python developers don't have
>> to learn Scala to write their own source (and separate packaging).
>> This is more crucial especially when you want to write a simple data
>> source that interacts with the Python ecosystem.
>>
>> On Tue, 20 Jun 2023 at 03:08, Denny Lee  wrote:
>>
>>> Slightly biased, but per my conversations - this would be awesome to
>>> have!
>>>
>>> On Mon, Jun 19, 2023 at 09:43 Abdeali Kothari 
>>> wrote:
>>>
 I would definitely use it - is it's available :)

 On Mon, 19 Jun 2023, 21:56 Jacek Laskowski,  wrote:

> Hi Allison and devs,
>
> Although I was against this idea at first sight (probably because I'm
> a Scala dev), I think it could work as long as there are people who'd be
> interested in such an API. Were there any? I'm just curious. I've seen no
> emails requesting it.
>
> I also doubt that Python devs would like to work on new data sources
> but support their wishes wholeheartedly :)
>
> Pozdrawiam,
> Jacek Laskowski
> 
> "The Internals Of" Online Books 
> Follow me on https://twitter.com/jaceklaskowski
>
> 

Re: [DISCUSS] SPIP: Python Data Source API

2023-06-24 Thread Maciej
With such limited scope (both language availability and features) do we 
have any representative examples of sources that could significantly 
benefit from providing this API,  compared other available options, such 
as batch imports, direct queries from vectorized  UDFs or even 
interfacing sources through 3rd party FDWs?


Best regards,
Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC

On 6/20/23 16:23, Wenchen Fan wrote:
In an ideal world, every data source you want to connect to already 
has a Spark data source implementation (either v1 or v2), then this 
Python API is useless. But I feel it's common that people want to do 
quick data exploration, and the target data system is not popular 
enough to have an existing Spark data source implementation. It will 
be useful if people can quickly implement a Spark data source using 
their favorite Python language.


I'm +1 to this proposal, assuming that we will keep it simple and 
won't copy all the complicated features we built in DS v2 to this new 
Python API.


On Tue, Jun 20, 2023 at 2:11 PM Maciej  wrote:

Similarly to Jacek, I feel it fails to document an actual
community need for such a feature.

Currently, any data source implementation has the potential to
benefit Spark users across all supported and third-party clients. 
For generally available sources, this is advantageous for the
whole Spark community and avoids creating 1st and 2nd-tier
citizens. This is even more important with new officially
supported languages being added through connect.

Instead, we might rather document in detail the process of
implementing a new source using current APIs and work towards
easily extensible or customizable sources, in case there is such a
need.

-- 
Best regards,

Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC


On 6/20/23 05:19, Hyukjin Kwon wrote:

Actually I support this idea in a way that Python developers
don't have to learn Scala to write their own source (and separate
packaging).
This is more crucial especially when you want to write a simple
data source that interacts with the Python ecosystem.

On Tue, 20 Jun 2023 at 03:08, Denny Lee 
wrote:

Slightly biased, but per my conversations - this would be
awesome to have!

On Mon, Jun 19, 2023 at 09:43 Abdeali Kothari
 wrote:

I would definitely use it - is it's available :)

On Mon, 19 Jun 2023, 21:56 Jacek Laskowski,
 wrote:

Hi Allison and devs,

Although I was against this idea at first sight
(probably because I'm a Scala dev), I think it
could work as long as there are people who'd be
interested in such an API. Were there any? I'm just
curious. I've seen no emails requesting it.

I also doubt that Python devs would like to work on
new data sources but support their wishes
wholeheartedly :)

Pozdrawiam,
Jacek Laskowski

"The Internals Of" Online Books

Follow me on https://twitter.com/jaceklaskowski




On Fri, Jun 16, 2023 at 6:14 AM Allison Wang

 wrote:

Hi everyone,

I would like to start a discussion on “Python
Data Source API”.

This proposal aims to introduce a simple API in
Python for Data Sources. The idea is to enable
Python developers to create data sources without
having to learn Scala or deal with the
complexities of the current data source APIs. The
goal is to make a Python-based API that is simple
and easy to use, thus making Spark more
accessible to the wider Python developer
community. This proposed approach is based on the
recently introduced Python user-defined table
functions with extensions to support data sources.

*SPIP Doc*:

https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing


*SPIP JIRA*:
https://issues.apache.org/jira/browse/SPARK-44076

Looking forward to your feedback.

Thanks,
Allison






OpenPGP_signature
Description: OpenPGP digital signature


Re: [DISCUSS] SPIP: Python Data Source API

2023-06-20 Thread Wenchen Fan
In an ideal world, every data source you want to connect to already has a
Spark data source implementation (either v1 or v2), then this Python API is
useless. But I feel it's common that people want to do quick data
exploration, and the target data system is not popular enough to have an
existing Spark data source implementation. It will be useful if people can
quickly implement a Spark data source using their favorite Python language.

I'm +1 to this proposal, assuming that we will keep it simple and won't
copy all the complicated features we built in DS v2 to this new Python API.

On Tue, Jun 20, 2023 at 2:11 PM Maciej  wrote:

> Similarly to Jacek, I feel it fails to document an actual community need
> for such a feature.
>
> Currently, any data source implementation has the potential to benefit
> Spark users across all supported and third-party clients.  For generally
> available sources, this is advantageous for the whole Spark community and
> avoids creating 1st and 2nd-tier citizens. This is even more important with
> new officially supported languages being added through connect.
>
> Instead, we might rather document in detail the process of implementing a
> new source using current APIs and work towards easily extensible or
> customizable sources, in case there is such a need.
>
> --
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> PGP: A30CEF0C31A501EC
>
>
> On 6/20/23 05:19, Hyukjin Kwon wrote:
>
> Actually I support this idea in a way that Python developers don't have to
> learn Scala to write their own source (and separate packaging).
> This is more crucial especially when you want to write a simple data
> source that interacts with the Python ecosystem.
>
> On Tue, 20 Jun 2023 at 03:08, Denny Lee  wrote:
>
>> Slightly biased, but per my conversations - this would be awesome to
>> have!
>>
>> On Mon, Jun 19, 2023 at 09:43 Abdeali Kothari 
>> wrote:
>>
>>> I would definitely use it - is it's available :)
>>>
>>> On Mon, 19 Jun 2023, 21:56 Jacek Laskowski,  wrote:
>>>
 Hi Allison and devs,

 Although I was against this idea at first sight (probably because I'm a
 Scala dev), I think it could work as long as there are people who'd be
 interested in such an API. Were there any? I'm just curious. I've seen no
 emails requesting it.

 I also doubt that Python devs would like to work on new data sources
 but support their wishes wholeheartedly :)

 Pozdrawiam,
 Jacek Laskowski
 
 "The Internals Of" Online Books 
 Follow me on https://twitter.com/jaceklaskowski

 


 On Fri, Jun 16, 2023 at 6:14 AM Allison Wang
 
  wrote:

> Hi everyone,
>
> I would like to start a discussion on “Python Data Source API”.
>
> This proposal aims to introduce a simple API in Python for Data
> Sources. The idea is to enable Python developers to create data sources
> without having to learn Scala or deal with the complexities of the current
> data source APIs. The goal is to make a Python-based API that is simple 
> and
> easy to use, thus making Spark more accessible to the wider Python
> developer community. This proposed approach is based on the recently
> introduced Python user-defined table functions with extensions to support
> data sources.
>
> *SPIP Doc*:
> https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing
>
> *SPIP JIRA*: https://issues.apache.org/jira/browse/SPARK-44076
>
> Looking forward to your feedback.
>
> Thanks,
> Allison
>

>
>


Re: [DISCUSS] SPIP: Python Data Source API

2023-06-20 Thread Maciej
Similarly to Jacek, I feel it fails to document an actual community need 
for such a feature.


Currently, any data source implementation has the potential to benefit 
Spark users across all supported and third-party clients. For generally 
available sources, this is advantageous for the whole Spark community 
and avoids creating 1st and 2nd-tier citizens. This is even more 
important with new officially supported languages being added through 
connect.


Instead, we might rather document in detail the process of implementing 
a new source using current APIs and work towards easily extensible or 
customizable sources, in case there is such a need.


--
Best regards,
Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC


On 6/20/23 05:19, Hyukjin Kwon wrote:
Actually I support this idea in a way that Python developers don't 
have to learn Scala to write their own source (and separate packaging).
This is more crucial especially when you want to write a simple data 
source that interacts with the Python ecosystem.


On Tue, 20 Jun 2023 at 03:08, Denny Lee  wrote:

Slightly biased, but per my conversations - this would be awesome
to have!

On Mon, Jun 19, 2023 at 09:43 Abdeali Kothari
 wrote:

I would definitely use it - is it's available :)

On Mon, 19 Jun 2023, 21:56 Jacek Laskowski, 
wrote:

Hi Allison and devs,

Although I was against this idea at first sight (probably
because I'm a Scala dev), I think it could work as long as
there are people who'd be interested in such an API. Were
there any? I'm just curious. I've seen no emails
requesting it.

I also doubt that Python devs would like to work on new
data sources but support their wishes wholeheartedly :)

Pozdrawiam,
Jacek Laskowski

"The Internals Of" Online Books 
Follow me on https://twitter.com/jaceklaskowski




On Fri, Jun 16, 2023 at 6:14 AM Allison Wang
 wrote:

Hi everyone,

I would like to start a discussion on “Python Data
Source API”.

This proposal aims to introduce a simple API in Python
for Data Sources. The idea is to enable Python
developers to create data sources without having to
learn Scala or deal with the complexities of the
current data source APIs. The goal is to make a
Python-based API that is simple and easy to use, thus
making Spark more accessible to the wider Python
developer community. This proposed approach is based
on the recently introduced Python user-defined table
functions with extensions to support data sources.

*SPIP Doc*:

https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing


*SPIP JIRA*:
https://issues.apache.org/jira/browse/SPARK-44076

Looking forward to your feedback.

Thanks,
Allison






OpenPGP_signature
Description: OpenPGP digital signature


Re: [DISCUSS] SPIP: Python Data Source API

2023-06-19 Thread Cheng Pan
This API looks starting from scratch and has no relationship with the existing 
Java/Scala DataSourceV2 API. Particularly, how can they support SQL?

We have been back and forth on the DataSource V2 design since 2.3, I believe 
there are some things to learn when introducing the Python DataSource API.

Thanks,
Cheng Pan




> On Jun 16, 2023, at 12:14, Allison Wang  
> wrote:
> 
> Hi everyone,
> 
> I would like to start a discussion on “Python Data Source API”.
> 
> This proposal aims to introduce a simple API in Python for Data Sources. The 
> idea is to enable Python developers to create data sources without having to 
> learn Scala or deal with the complexities of the current data source APIs. 
> The goal is to make a Python-based API that is simple and easy to use, thus 
> making Spark more accessible to the wider Python developer community. This 
> proposed approach is based on the recently introduced Python user-defined 
> table functions with extensions to support data sources.
> 
> SPIP Doc:  
> https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing
> 
> SPIP JIRA: https://issues.apache.org/jira/browse/SPARK-44076
> 
> Looking forward to your feedback.
> 
> Thanks,
> Allison


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] SPIP: Python Data Source API

2023-06-19 Thread Hyukjin Kwon
Actually I support this idea in a way that Python developers don't have to
learn Scala to write their own source (and separate packaging).
This is more crucial especially when you want to write a simple data source
that interacts with the Python ecosystem.

On Tue, 20 Jun 2023 at 03:08, Denny Lee  wrote:

> Slightly biased, but per my conversations - this would be awesome to have!
>
> On Mon, Jun 19, 2023 at 09:43 Abdeali Kothari 
> wrote:
>
>> I would definitely use it - is it's available :)
>>
>> On Mon, 19 Jun 2023, 21:56 Jacek Laskowski,  wrote:
>>
>>> Hi Allison and devs,
>>>
>>> Although I was against this idea at first sight (probably because I'm a
>>> Scala dev), I think it could work as long as there are people who'd be
>>> interested in such an API. Were there any? I'm just curious. I've seen no
>>> emails requesting it.
>>>
>>> I also doubt that Python devs would like to work on new data sources but
>>> support their wishes wholeheartedly :)
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> "The Internals Of" Online Books 
>>> Follow me on https://twitter.com/jaceklaskowski
>>>
>>> 
>>>
>>>
>>> On Fri, Jun 16, 2023 at 6:14 AM Allison Wang
>>>  wrote:
>>>
 Hi everyone,

 I would like to start a discussion on “Python Data Source API”.

 This proposal aims to introduce a simple API in Python for Data
 Sources. The idea is to enable Python developers to create data sources
 without having to learn Scala or deal with the complexities of the current
 data source APIs. The goal is to make a Python-based API that is simple and
 easy to use, thus making Spark more accessible to the wider Python
 developer community. This proposed approach is based on the recently
 introduced Python user-defined table functions with extensions to support
 data sources.

 *SPIP Doc*:
 https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing

 *SPIP JIRA*: https://issues.apache.org/jira/browse/SPARK-44076

 Looking forward to your feedback.

 Thanks,
 Allison

>>>


Re: [DISCUSS] SPIP: Python Data Source API

2023-06-19 Thread Denny Lee
Slightly biased, but per my conversations - this would be awesome to have!

On Mon, Jun 19, 2023 at 09:43 Abdeali Kothari 
wrote:

> I would definitely use it - is it's available :)
>
> On Mon, 19 Jun 2023, 21:56 Jacek Laskowski,  wrote:
>
>> Hi Allison and devs,
>>
>> Although I was against this idea at first sight (probably because I'm a
>> Scala dev), I think it could work as long as there are people who'd be
>> interested in such an API. Were there any? I'm just curious. I've seen no
>> emails requesting it.
>>
>> I also doubt that Python devs would like to work on new data sources but
>> support their wishes wholeheartedly :)
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> "The Internals Of" Online Books 
>> Follow me on https://twitter.com/jaceklaskowski
>>
>> 
>>
>>
>> On Fri, Jun 16, 2023 at 6:14 AM Allison Wang
>>  wrote:
>>
>>> Hi everyone,
>>>
>>> I would like to start a discussion on “Python Data Source API”.
>>>
>>> This proposal aims to introduce a simple API in Python for Data Sources.
>>> The idea is to enable Python developers to create data sources without
>>> having to learn Scala or deal with the complexities of the current data
>>> source APIs. The goal is to make a Python-based API that is simple and easy
>>> to use, thus making Spark more accessible to the wider Python developer
>>> community. This proposed approach is based on the recently introduced
>>> Python user-defined table functions with extensions to support data sources.
>>>
>>> *SPIP Doc*:
>>> https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing
>>>
>>> *SPIP JIRA*: https://issues.apache.org/jira/browse/SPARK-44076
>>>
>>> Looking forward to your feedback.
>>>
>>> Thanks,
>>> Allison
>>>
>>


Re: [DISCUSS] SPIP: Python Data Source API

2023-06-19 Thread Abdeali Kothari
I would definitely use it - is it's available :)

On Mon, 19 Jun 2023, 21:56 Jacek Laskowski,  wrote:

> Hi Allison and devs,
>
> Although I was against this idea at first sight (probably because I'm a
> Scala dev), I think it could work as long as there are people who'd be
> interested in such an API. Were there any? I'm just curious. I've seen no
> emails requesting it.
>
> I also doubt that Python devs would like to work on new data sources but
> support their wishes wholeheartedly :)
>
> Pozdrawiam,
> Jacek Laskowski
> 
> "The Internals Of" Online Books 
> Follow me on https://twitter.com/jaceklaskowski
>
> 
>
>
> On Fri, Jun 16, 2023 at 6:14 AM Allison Wang
>  wrote:
>
>> Hi everyone,
>>
>> I would like to start a discussion on “Python Data Source API”.
>>
>> This proposal aims to introduce a simple API in Python for Data Sources.
>> The idea is to enable Python developers to create data sources without
>> having to learn Scala or deal with the complexities of the current data
>> source APIs. The goal is to make a Python-based API that is simple and easy
>> to use, thus making Spark more accessible to the wider Python developer
>> community. This proposed approach is based on the recently introduced
>> Python user-defined table functions with extensions to support data sources.
>>
>> *SPIP Doc*:
>> https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing
>>
>> *SPIP JIRA*: https://issues.apache.org/jira/browse/SPARK-44076
>>
>> Looking forward to your feedback.
>>
>> Thanks,
>> Allison
>>
>


Re: [DISCUSS] SPIP: Python Data Source API

2023-06-19 Thread Jacek Laskowski
Hi Allison and devs,

Although I was against this idea at first sight (probably because I'm a
Scala dev), I think it could work as long as there are people who'd be
interested in such an API. Were there any? I'm just curious. I've seen no
emails requesting it.

I also doubt that Python devs would like to work on new data sources but
support their wishes wholeheartedly :)

Pozdrawiam,
Jacek Laskowski

"The Internals Of" Online Books 
Follow me on https://twitter.com/jaceklaskowski




On Fri, Jun 16, 2023 at 6:14 AM Allison Wang
 wrote:

> Hi everyone,
>
> I would like to start a discussion on “Python Data Source API”.
>
> This proposal aims to introduce a simple API in Python for Data Sources.
> The idea is to enable Python developers to create data sources without
> having to learn Scala or deal with the complexities of the current data
> source APIs. The goal is to make a Python-based API that is simple and easy
> to use, thus making Spark more accessible to the wider Python developer
> community. This proposed approach is based on the recently introduced
> Python user-defined table functions with extensions to support data sources.
>
> *SPIP Doc*:
> https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing
>
> *SPIP JIRA*: https://issues.apache.org/jira/browse/SPARK-44076
>
> Looking forward to your feedback.
>
> Thanks,
> Allison
>


[DISCUSS] SPIP: Python Data Source API

2023-06-15 Thread Allison Wang
Hi everyone,

I would like to start a discussion on “Python Data Source API”.

This proposal aims to introduce a simple API in Python for Data Sources.
The idea is to enable Python developers to create data sources without
having to learn Scala or deal with the complexities of the current data
source APIs. The goal is to make a Python-based API that is simple and easy
to use, thus making Spark more accessible to the wider Python developer
community. This proposed approach is based on the recently introduced
Python user-defined table functions with extensions to support data sources.

*SPIP Doc*:
https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing

*SPIP JIRA*: https://issues.apache.org/jira/browse/SPARK-44076

Looking forward to your feedback.

Thanks,
Allison