Yeah, we still should improve PySpark APIs together. I am currently stuck at some work and porting Koalas at this moment so couldn't have a chance to take a very close look (but drop some comments and skim).
2021년 4월 6일 (화) 오후 5:31, Darcy Shen <sad...@zoho.com.cn>님이 작성: > was: [DISCUSS] Support pandas API layer on PySpark > > > I'm working on [SPARK-34600] Support user defined types in Pandas UDF - > ASF JIRA (apache.org) <https://issues.apache.org/jira/browse/SPARK-34600>. > > I'm wondering if we are still working on improving Spark's own Python API. > > SPARK-34600 is relatively a big feature for PySpark. I splited it into > several small tickets and submitted the first small PR: > > [SPARK-34771] Support UDT for Pandas/Spark conversion with Arrow support > Enabled by sadhen · Pull Request #32026 · apache/spark (github.com) > <https://github.com/apache/spark/pull/32026> > > I'm afraid that the Spark community are busy working on pandas API layer > on PySpark and the improvements for Spark's own Python API will be > postponed and postponed. > > As gongjonn.hyun said: > > BTW, what is the future plan for the existing APIs? > > If we are keeping these existing APIs, will we add new features for > Spark's own Python API? > > Or will we fix bugs for Spark's own Python API? > > Specifically, will we add support for User Defined Types in pandas_udf for > Spark's own Python API? > > > ---- On Mon, 2021-03-15 14:12:28 *Reynold Xin <r...@databricks.com > <r...@databricks.com>>* wrote ---- > > I don't think we should deprecate existing APIs. > > Spark's own Python API is relatively stable and not difficult to support. > It has a pretty large number of users and existing code. Also pretty easy > to learn by data engineers. > > pandas API is a great for data science, but isn't that great for some > other tasks. It's super wide. Great for data scientists that have learned > it, or great for copy paste from Stackoverflow. > > > > > > On Sun, Mar 14, 2021 at 11:08 PM, Dongjoon Hyun <dongjoon.h...@gmail.com> > wrote: > > Thank you for the proposal. It looks like a good addition. > BTW, what is the future plan for the existing APIs? > Are we going to deprecate it eventually in favor of Koalas (because we > don't remove the existing APIs in general)? > > > Fourthly, PySpark is still not Pythonic enough. For example, I hear > complaints such as "why does > > PySpark follow pascalCase?" or "PySpark APIs are difficult to learn", > and APIs are very difficult to change > > in Spark (as I emphasized above). > > > On Sun, Mar 14, 2021 at 4:03 AM Hyukjin Kwon <gurwls...@gmail.com> wrote: > > Firstly my biggest reason is that I would like to promote this more as a > built-in support because it is simply > important to have it with the impact on the large user group, and the > needs are increasing > as the charts indicate. I usually think that features or add-ons stay as > third parties when it’s rather for a > smaller set of users, it addresses a corner case of needs, etc. I think > this is similar to the datasources > we have added. Spark ported CSV and Avro because more and more people use > it, and it became important > to have it as a built-in support. > > Secondly, Koalas needs more help from Spark, PySpark, Python and pandas > experts from the > bigger community. Koalas’ team isn’t experts in all the areas, and there > are many missing corner > cases to fix, Some require deep expertise from specific areas. > > One example is the type hints. Koalas uses type hints for schema inference. > Due to the lack of Python’s type hinting way, Koalas added its own > (hacky) way > <https://koalas.readthedocs.io/en/latest/user_guide/typehints.html#type-hints-in-koalas> > . > Fortunately the way Koalas implemented is now partially proposed into > Python officially (PEP 646). > But Koalas could have been better with interacting with the Python > community more and actively > joining in the design issues together to lead the best output that > benefits both and more projects. > > Thirdly, I would like to contribute to the growth of PySpark. The growth > of the Koalas is very fast given the > internal and external stats. The number of users has jumped up twice > almost every 4 ~ 6 months. > I think Koalas will be a good momentum to keep Spark up. > Fourthly, PySpark is still not Pythonic enough. For example, I hear > complaints such as "why does > PySpark follow pascalCase?" or "PySpark APIs are difficult to learn", and > APIs are very difficult to change > in Spark (as I emphasized above). This set of Koalas APIs will be able to > address these concerns > in PySpark. > > Lastly, I really think PySpark needs its native plotting features. As I > emphasized before with > elaboration, I do think this is an important feature missing in PySpark > that users need. > I do think Koalas completes what PySpark is currently missing. > > > > 2021년 3월 14일 (일) 오후 7:12, Sean Owen <sro...@gmail.com>님이 작성: > > I like koalas a lot. Playing devil's advocate, why not just let it > continue to live as an add on? Usually the argument is it'll be maintained > better in Spark but it's well maintained. It adds some overhead to > maintaining Spark conversely. On the upside it makes it a little more > discoverable. Are there more 'synergies'? > > On Sat, Mar 13, 2021, 7:57 PM Hyukjin Kwon <gurwls...@gmail.com> wrote: > > Hi all, > > I would like to start the discussion on supporting pandas API layer on > Spark. > > > > If we have a general consensus on having it in PySpark, I will initiate > and drive an SPIP with a detailed explanation about the implementation’s > overview and structure. > > I would appreciate it if I can know whether you guys support this or not > before starting the SPIP. > What do you want to propose? > > I have been working on the Koalas <https://github.com/databricks/koalas> > project that is essentially: pandas API support on Spark, and I would like > to propose embracing Koalas in PySpark. > > > > More specifically, I am thinking about adding a separate package, to > PySpark, for pandas APIs on PySpark Therefore it wouldn’t break anything in > the existing codes. The overview would look as below: > > > pyspark_dataframe.[... PySpark APIs ...] > pandas_dataframe.[... pandas APIs (local) ...] > > *# The package names will change in the final proposal and during review. * > *koalas_dataframe *=* koalas.from_pandas**(*pyspark_dataframe*)* > *koalas_dataframe *=* koalas.from_spark**(*pandas_dataframe*)* > *koalas_dataframe.[... pandas APIs on Spark ...]* > > pyspark_dataframe = *koalas_dataframe.to_spark()* > pandas_dataframe = *koalas_dataframe.to_pandas()* > > > > Koalas provides a pandas API layer on PySpark. It supports almost the same > API usages. Users can leverage their existing Spark cluster to scale their > pandas workloads. It works interchangeably with PySpark by allowing both > pandas and PySpark APIs to users. > > The project has grown separately more than two years, and this has been > successfully going. With version 1.7.0 Koalas has greatly improved maturity > and stability. Its usability has been proven with numerous users’ adoptions > and by reaching more than 75% API coverage in pandas’ Index, Series and > DataFrame. > > > I strongly think this is the direction we should go for Apache Spark, and > it is a win-win strategy for the growth of both Apache Spark and pandas. > Please see the reasons below. > Why do we need it? > > - > > Python has grown dramatically in the last few years and became one of > the most popular languages, see also StackOverFlow trend > <https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr> > for Python, Java, R and Scala languages. > - > > pandas became almost the standard library of data science. Please also > see the StackOverFlow trend > <https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr> > for pandas, Apache Spark and PySpark. > - > > PySpark is not Pythonic enough. At least I myself hear a lot of > complaints. That initiated Project Zen > <https://issues.apache.org/jira/browse/SPARK-32082>, and we have > greatly improved PySpark usability and made it more Pythonic. > > Nevertheless, data scientists tend to prefer pandas libraries according to > the trends but APIs are hard to change in PySpark. We should redesign all > APIs and improve them from scratch, which is very difficult. > > One straightforward and fast approach is to benchmark a successful case, > and pandas does not support distributed execution. Once PySpark supports > pandas-like APIs, it can be a good option for pandas users to scale their > workloads easily. I do believe this is a win-win strategy for the growth of > both pandas and PySpark. > > In fact, there are already similar tries such as Dask <https://dask.org/> > and Modin <https://modin.readthedocs.io/en/latest/> (other than Koalas > <https://github.com/databricks/koalas>). They are all growing fast and > successfully, and I find that people compare it to PySpark from time to > time, for example, see Beyond Pandas: Spark, Dask, Vaex and other big > data technologies battling head to head > <https://towardsdatascience.com/beyond-pandas-spark-dask-vaex-and-other-big-data-technologies-battling-head-to-head-a453a1f8cc13> > . > > > > - > > There are many important features missing that are very common in data > science. One of the most important features is plotting and drawing a > chart. Almost every data scientist plots and draws a chart to understand > their data quickly and visually in their daily work but this is missing in > PySpark. Please see one example in pandas: > > > > > > I do recommend taking a quick look for blog posts and talks made for > pandas on Spark: > https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html. > They explain why we need this far more better. > > > > >