Re: Support User Defined Types in pandas_udf for Spark's own Python API

Hyukjin Kwon Tue, 06 Apr 2021 21:17:25 -0700

Yeah, we still should improve PySpark APIs together. I am currently stuck
at some work and porting Koalas at this moment so couldn't have a chance to
take a very close look (but drop some comments and skim).


2021년 4월 6일 (화) 오후 5:31, Darcy Shen <sad...@zoho.com.cn>님이 작성:

> was: [DISCUSS] Support pandas API layer on PySpark
>
>
> I'm working on [SPARK-34600] Support user defined types in Pandas UDF -
> ASF JIRA (apache.org) <https://issues.apache.org/jira/browse/SPARK-34600>.
>
> I'm wondering if we are still working on improving Spark's own Python API.
>
> SPARK-34600 is relatively a big feature for PySpark. I splited it into
> several small tickets and submitted the first small PR:
>
> [SPARK-34771] Support UDT for Pandas/Spark conversion with Arrow support
> Enabled by sadhen · Pull Request #32026 · apache/spark (github.com)
> <https://github.com/apache/spark/pull/32026>
>
> I'm afraid that the Spark community are busy working on pandas API layer
> on PySpark and the improvements for Spark's own Python API will be
> postponed and postponed.
>
> As gongjonn.hyun said:
> > BTW, what is the future plan for the existing APIs?
>
> If we are keeping these existing APIs, will we add new features for
> Spark's own Python API?
>
> Or will we fix bugs for Spark's own Python API?
>
> Specifically, will we add support for User Defined Types in pandas_udf for
> Spark's own Python API?
>
>
> ---- On Mon, 2021-03-15 14:12:28 *Reynold Xin <r...@databricks.com
> <r...@databricks.com>>* wrote ----
>
> I don't think we should deprecate existing APIs.
>
> Spark's own Python API is relatively stable and not difficult to support.
> It has a pretty large number of users and existing code. Also pretty easy
> to learn by data engineers.
>
> pandas API is a great for data science, but isn't that great for some
> other tasks. It's super wide. Great for data scientists that have learned
> it, or great for copy paste from Stackoverflow.
>
>
>
>
>
> On Sun, Mar 14, 2021 at 11:08 PM, Dongjoon Hyun <dongjoon.h...@gmail.com>
> wrote:
>
> Thank you for the proposal. It looks like a good addition.
> BTW, what is the future plan for the existing APIs?
> Are we going to deprecate it eventually in favor of Koalas (because we
> don't remove the existing APIs in general)?
>
> > Fourthly, PySpark is still not Pythonic enough. For example, I hear
> complaints such as "why does
> > PySpark follow pascalCase?" or "PySpark APIs are difficult to learn",
> and APIs are very difficult to change
> > in Spark (as I emphasized above).
>
>
> On Sun, Mar 14, 2021 at 4:03 AM Hyukjin Kwon <gurwls...@gmail.com> wrote:
>
> Firstly my biggest reason is that I would like to promote this more as a
> built-in support because it is simply
> important to have it with the impact on the large user group, and the
> needs are increasing
> as the charts indicate. I usually think that features or add-ons stay as
> third parties when it’s rather for a
> smaller set of users, it addresses a corner case of needs, etc. I think
> this is similar to the datasources
> we have added. Spark ported CSV and Avro because more and more people use
> it, and it became important
> to have it as a built-in support.
>
> Secondly, Koalas needs more help from Spark, PySpark, Python and pandas
> experts from the
> bigger community. Koalas’ team isn’t experts in all the areas, and there
> are many missing corner
> cases to fix, Some require deep expertise from specific areas.
>
> One example is the type hints. Koalas uses type hints for schema inference.
> Due to the lack of Python’s type hinting way, Koalas added its own
> (hacky) way
> <https://koalas.readthedocs.io/en/latest/user_guide/typehints.html#type-hints-in-koalas>
> .
> Fortunately the way Koalas implemented is now partially proposed into
> Python officially (PEP 646).
> But Koalas could have been better with interacting with the Python
> community more and actively
> joining in the design issues together to lead the best output that
> benefits both and more projects.
>
> Thirdly, I would like to contribute to the growth of PySpark. The growth
> of the Koalas is very fast given the
> internal and external stats. The number of users has jumped up twice
> almost every 4 ~ 6 months.
> I think Koalas will be a good momentum to keep Spark up.
> Fourthly, PySpark is still not Pythonic enough. For example, I hear
> complaints such as "why does
> PySpark follow pascalCase?" or "PySpark APIs are difficult to learn", and
> APIs are very difficult to change
> in Spark (as I emphasized above). This set of Koalas APIs will be able to
> address these concerns
> in PySpark.
>
> Lastly, I really think PySpark needs its native plotting features. As I
> emphasized before with
> elaboration, I do think this is an important feature missing in PySpark
> that users need.
> I do think Koalas completes what PySpark is currently missing.
>
>
>
> 2021년 3월 14일 (일) 오후 7:12, Sean Owen <sro...@gmail.com>님이 작성:
>
> I like koalas a lot. Playing devil's advocate, why not just let it
> continue to live as an add on? Usually the argument is it'll be maintained
> better in Spark but it's well maintained. It adds some overhead to
> maintaining Spark conversely. On the upside it makes it a little more
> discoverable. Are there more 'synergies'?
>
> On Sat, Mar 13, 2021, 7:57 PM Hyukjin Kwon <gurwls...@gmail.com> wrote:
>
> Hi all,
>
> I would like to start the discussion on supporting pandas API layer on
> Spark.
>
>
>
> If we have a general consensus on having it in PySpark, I will initiate
> and drive an SPIP with a detailed explanation about the implementation’s
> overview and structure.
>
> I would appreciate it if I can know whether you guys support this or not
> before starting the SPIP.
> What do you want to propose?
>
> I have been working on the Koalas <https://github.com/databricks/koalas>
> project that is essentially: pandas API support on Spark, and I would like
> to propose embracing Koalas in PySpark.
>
>
>
> More specifically, I am thinking about adding a separate package, to
> PySpark, for pandas APIs on PySpark Therefore it wouldn’t break anything in
> the existing codes. The overview would look as below:
>
>
> pyspark_dataframe.[... PySpark APIs ...]
> pandas_dataframe.[... pandas APIs (local) ...]
>
> *# The package names will change in the final proposal and during review. *
> *koalas_dataframe *=* koalas.from_pandas**(*pyspark_dataframe*)*
> *koalas_dataframe  *=* koalas.from_spark**(*pandas_dataframe*)*
> *koalas_dataframe.[... pandas APIs on Spark ...]*
>
> pyspark_dataframe = *koalas_dataframe.to_spark()*
> pandas_dataframe = *koalas_dataframe.to_pandas()*
>
>
>
> Koalas provides a pandas API layer on PySpark. It supports almost the same
> API usages. Users can leverage their existing Spark cluster to scale their
> pandas workloads. It works interchangeably with PySpark by allowing both
> pandas and PySpark APIs to users.
>
> The project has grown separately more than two years, and this has been
> successfully going. With version 1.7.0 Koalas has greatly improved maturity
> and stability. Its usability has been proven with numerous users’ adoptions
> and by reaching more than 75% API coverage in pandas’ Index, Series and
> DataFrame.
>
>
> I strongly think this is the direction we should go for Apache Spark, and
> it is a win-win strategy for the growth of both Apache Spark and pandas.
> Please see the reasons below.
> Why do we need it?
>
>    -
>
>    Python has grown dramatically in the last few years and became one of
>    the most popular languages, see also StackOverFlow trend
>    <https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr>
>    for Python, Java, R and Scala languages.
>    -
>
>    pandas became almost the standard library of data science. Please also
>    see the StackOverFlow trend
>    <https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr>
>    for pandas, Apache Spark and PySpark.
>    -
>
>    PySpark is not Pythonic enough. At least I myself hear a lot of
>    complaints. That initiated Project Zen
>    <https://issues.apache.org/jira/browse/SPARK-32082>, and we have
>    greatly improved PySpark usability and made it more Pythonic.
>
> Nevertheless, data scientists tend to prefer pandas libraries according to
> the trends but APIs are hard to change in PySpark. We should redesign all
> APIs and improve them from scratch, which is very difficult.
>
> One straightforward and fast approach is to benchmark a successful case,
> and pandas does not support distributed execution. Once PySpark supports
> pandas-like APIs, it can be a good option for pandas users to scale their
> workloads easily. I do believe this is a win-win strategy for the growth of
> both pandas and PySpark.
>
> In fact, there are already similar tries such as Dask <https://dask.org/>
> and Modin <https://modin.readthedocs.io/en/latest/> (other than Koalas
> <https://github.com/databricks/koalas>). They are all growing fast and
> successfully, and I find that people compare it to PySpark from time to
> time, for example, see Beyond Pandas: Spark, Dask, Vaex and other big
> data technologies battling head to head
> <https://towardsdatascience.com/beyond-pandas-spark-dask-vaex-and-other-big-data-technologies-battling-head-to-head-a453a1f8cc13>
> .
>
>
>
>    -
>
>    There are many important features missing that are very common in data
>    science. One of the most important features is plotting and drawing a
>    chart. Almost every data scientist plots and draws a chart to understand
>    their data quickly and visually in their daily work but this is missing in
>    PySpark. Please see one example in pandas:
>
>
>
>
>
> I do recommend taking a quick look for blog posts and talks made for
> pandas on Spark:
> https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html.
> They explain why we need this far more better.
>
>
>
>
>

Re: Support User Defined Types in pandas_udf for Spark's own Python API

Reply via email to