Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-22 Thread Hyukjin Kwon
> 1. Does this suggestion imply Python API implementation will be the new
blocker in the future in terms of feature parity among languages? Until
now, Python API feature parity was one of the audit items because it's not
enforced. In other words, Scala and Java have been the full feature because
they are the underlying main developer languages while Python/R/SQL
environments were the nice-to-have.

I think it wouldn't be treated as a blocker .. but I do believe we have
added all new features into the Python side for the last couple of
releases. So, I wouldn't worry about this at this moment - we have been
doing fine in terms of feature parity.

> 2. Does this suggestion assume that the Python environment is easier for
users than Scala/Java always? Given that we support Python 3.8 to 3.11, the
support matrix for Python library dependency is a problem for the Apache
Spark community to solve in order to claim that. As we say at SPARK-41454,
Python language also introduces breaking changes to us historically and we
have many `Pinned` python libraries issues.

Yes. In fact, regardless of this change, I do believe we should test more
versions, etc. At least scheduled jobs like we're doing JDK and Scala
versions.


FWIW, my take about this change is: people use Python and PySpark more
(according to the chart and stats provided) so let's put those examples
first :-).


On Thu, 23 Feb 2023 at 10:27, Dongjoon Hyun  wrote:

> I have two questions to clarify the scope and boundaries.
>
> 1. Does this suggestion imply Python API implementation will be the new
> blocker in the future in terms of feature parity among languages? Until
> now, Python API feature parity was one of the audit items because it's not
> enforced. In other words, Scala and Java have been the full feature because
> they are the underlying main developer languages while Python/R/SQL
> environments were the nice-to-have.
>
> 2. Does this suggestion assume that the Python environment is easier for
> users than Scala/Java always? Given that we support Python 3.8 to 3.11, the
> support matrix for Python library dependency is a problem for the Apache
> Spark community to solve in order to claim that. As we say at SPARK-41454,
> Python language also introduces breaking changes to us historically and we
> have many `Pinned` python libraries issues.
>
> Changing documentation is easy, but I hope we can give clear
> communication and direction in this effort because this is one of the most
> user-facing changes.
>
> Dongjoon.
>
> On Wed, Feb 22, 2023 at 5:26 PM 416161...@qq.com 
> wrote:
>
>> +1 LGTM
>>
>> --
>> Ruifeng Zheng
>> ruife...@foxmail.com
>>
>> 
>>
>>
>>
>> -- Original --
>> *From:* "Xinrong Meng" ;
>> *Date:* Thu, Feb 23, 2023 09:17 AM
>> *To:* "Allan Folting";
>> *Cc:* "dev";
>> *Subject:* Re: [DISCUSS] Show Python code examples first in Spark
>> documentation
>>
>> +1 Good idea!
>>
>> On Thu, Feb 23, 2023 at 7:41 AM Jack Goodson 
>> wrote:
>>
>>> Good idea, at the company I work at we discussed using Scala as our
>>> primary language because technically it is slightly stronger than python
>>> but ultimately chose python in the end as it’s easier for other devs to be
>>> on boarded to our platform and future hiring for the team etc would be
>>> easier
>>>
>>> On Thu, 23 Feb 2023 at 12:20 PM, Hyukjin Kwon 
>>> wrote:
>>>
 +1 I like this idea too.

 On Thu, Feb 23, 2023 at 6:00 AM Allan Folting 
 wrote:

> Hi all,
>
> I would like to propose that we show Python code examples first in the
> Spark documentation where we have multiple programming language examples.
> An example is on the Quick Start page:
> https://spark.apache.org/docs/latest/quick-start.html
>
> I propose this change because Python has become more popular than the
> other languages supported in Apache Spark. There are a lot more users of
> Spark in Python than Scala today and Python attracts a broader set of new
> users.
> For Python usage data, see https://www.tiobe.com/tiobe-index/ and
> https://insights.stackoverflow.com/trends?tags=r%2Cscala%2Cpython%2Cjava
> .
>
> Also, this change aligns with Python already being the first tab on
> our home page:
> https://spark.apache.org/
>
> Anyone who wants to use another language can still just click on the
> other tabs.
>
> I created a draft PR for the Spark SQL, DataFrames and Datasets Guide
> page as a first step:
> https://github.com/apache/spark/pull/40087
>
>
> I would appreciate it if you could share your thoughts on this
> proposal.
>
>
> Thanks a lot,
> Allan Folting
>



Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-22 Thread Dongjoon Hyun
I have two questions to clarify the scope and boundaries.

1. Does this suggestion imply Python API implementation will be the new
blocker in the future in terms of feature parity among languages? Until
now, Python API feature parity was one of the audit items because it's not
enforced. In other words, Scala and Java have been the full feature because
they are the underlying main developer languages while Python/R/SQL
environments were the nice-to-have.

2. Does this suggestion assume that the Python environment is easier for
users than Scala/Java always? Given that we support Python 3.8 to 3.11, the
support matrix for Python library dependency is a problem for the Apache
Spark community to solve in order to claim that. As we say at SPARK-41454,
Python language also introduces breaking changes to us historically and we
have many `Pinned` python libraries issues.

Changing documentation is easy, but I hope we can give clear
communication and direction in this effort because this is one of the most
user-facing changes.

Dongjoon.

On Wed, Feb 22, 2023 at 5:26 PM 416161...@qq.com 
wrote:

> +1 LGTM
>
> --
> Ruifeng Zheng
> ruife...@foxmail.com
>
> 
>
>
>
> -- Original --
> *From:* "Xinrong Meng" ;
> *Date:* Thu, Feb 23, 2023 09:17 AM
> *To:* "Allan Folting";
> *Cc:* "dev";
> *Subject:* Re: [DISCUSS] Show Python code examples first in Spark
> documentation
>
> +1 Good idea!
>
> On Thu, Feb 23, 2023 at 7:41 AM Jack Goodson 
> wrote:
>
>> Good idea, at the company I work at we discussed using Scala as our
>> primary language because technically it is slightly stronger than python
>> but ultimately chose python in the end as it’s easier for other devs to be
>> on boarded to our platform and future hiring for the team etc would be
>> easier
>>
>> On Thu, 23 Feb 2023 at 12:20 PM, Hyukjin Kwon 
>> wrote:
>>
>>> +1 I like this idea too.
>>>
>>> On Thu, Feb 23, 2023 at 6:00 AM Allan Folting 
>>> wrote:
>>>
 Hi all,

 I would like to propose that we show Python code examples first in the
 Spark documentation where we have multiple programming language examples.
 An example is on the Quick Start page:
 https://spark.apache.org/docs/latest/quick-start.html

 I propose this change because Python has become more popular than the
 other languages supported in Apache Spark. There are a lot more users of
 Spark in Python than Scala today and Python attracts a broader set of new
 users.
 For Python usage data, see https://www.tiobe.com/tiobe-index/ and
 https://insights.stackoverflow.com/trends?tags=r%2Cscala%2Cpython%2Cjava
 .

 Also, this change aligns with Python already being the first tab on our
 home page:
 https://spark.apache.org/

 Anyone who wants to use another language can still just click on the
 other tabs.

 I created a draft PR for the Spark SQL, DataFrames and Datasets Guide
 page as a first step:
 https://github.com/apache/spark/pull/40087


 I would appreciate it if you could share your thoughts on this proposal.


 Thanks a lot,
 Allan Folting

>>>


Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-22 Thread 416161...@qq.com
+1 LGTM




RuifengZheng
ruife...@foxmail.com








--Original--
From:   
 "Xinrong Meng" 
   
https://spark.apache.org/docs/latest/quick-start.html


I propose this change because Python has become more popular than the other 
languagessupported in Apache Spark. There are a lot more users of Spark 
in Python than Scala today and Python attracts a broader set of new users.
For Python usage data, 
seehttps://www.tiobe.com/tiobe-index/andhttps://insights.stackoverflow.com/trends?tags=r%2Cscala%2Cpython%2Cjava.


Also, this change aligns with Python already being the first tab on our home 
page:
https://spark.apache.org/



Anyone who wants to use another language can still just click on the other tabs.


I created a draft PR for the Spark SQL, DataFrames and Datasets Guide page as a 
first step:

https://github.com/apache/spark/pull/40087




I would appreciate it if you could share your thoughts on this proposal.




Thanks a lot,
Allan Folting

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-22 Thread Xinrong Meng
+1 Good idea!

On Thu, Feb 23, 2023 at 7:41 AM Jack Goodson  wrote:

> Good idea, at the company I work at we discussed using Scala as our
> primary language because technically it is slightly stronger than python
> but ultimately chose python in the end as it’s easier for other devs to be
> on boarded to our platform and future hiring for the team etc would be
> easier
>
> On Thu, 23 Feb 2023 at 12:20 PM, Hyukjin Kwon  wrote:
>
>> +1 I like this idea too.
>>
>> On Thu, Feb 23, 2023 at 6:00 AM Allan Folting 
>> wrote:
>>
>>> Hi all,
>>>
>>> I would like to propose that we show Python code examples first in the
>>> Spark documentation where we have multiple programming language examples.
>>> An example is on the Quick Start page:
>>> https://spark.apache.org/docs/latest/quick-start.html
>>>
>>> I propose this change because Python has become more popular than the
>>> other languages supported in Apache Spark. There are a lot more users of
>>> Spark in Python than Scala today and Python attracts a broader set of new
>>> users.
>>> For Python usage data, see https://www.tiobe.com/tiobe-index/ and
>>> https://insights.stackoverflow.com/trends?tags=r%2Cscala%2Cpython%2Cjava
>>> .
>>>
>>> Also, this change aligns with Python already being the first tab on our
>>> home page:
>>> https://spark.apache.org/
>>>
>>> Anyone who wants to use another language can still just click on the
>>> other tabs.
>>>
>>> I created a draft PR for the Spark SQL, DataFrames and Datasets Guide
>>> page as a first step:
>>> https://github.com/apache/spark/pull/40087
>>>
>>>
>>> I would appreciate it if you could share your thoughts on this proposal.
>>>
>>>
>>> Thanks a lot,
>>> Allan Folting
>>>
>>


Unsubscribe

2023-02-22 Thread Tang Jinxin
Unsubscribe


Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-22 Thread Jack Goodson
Good idea, at the company I work at we discussed using Scala as our primary
language because technically it is slightly stronger than python but
ultimately chose python in the end as it’s easier for other devs to be on
boarded to our platform and future hiring for the team etc would be easier

On Thu, 23 Feb 2023 at 12:20 PM, Hyukjin Kwon  wrote:

> +1 I like this idea too.
>
> On Thu, Feb 23, 2023 at 6:00 AM Allan Folting 
> wrote:
>
>> Hi all,
>>
>> I would like to propose that we show Python code examples first in the
>> Spark documentation where we have multiple programming language examples.
>> An example is on the Quick Start page:
>> https://spark.apache.org/docs/latest/quick-start.html
>>
>> I propose this change because Python has become more popular than the
>> other languages supported in Apache Spark. There are a lot more users of
>> Spark in Python than Scala today and Python attracts a broader set of new
>> users.
>> For Python usage data, see https://www.tiobe.com/tiobe-index/ and
>> https://insights.stackoverflow.com/trends?tags=r%2Cscala%2Cpython%2Cjava.
>>
>> Also, this change aligns with Python already being the first tab on our
>> home page:
>> https://spark.apache.org/
>>
>> Anyone who wants to use another language can still just click on the
>> other tabs.
>>
>> I created a draft PR for the Spark SQL, DataFrames and Datasets Guide
>> page as a first step:
>> https://github.com/apache/spark/pull/40087
>>
>>
>> I would appreciate it if you could share your thoughts on this proposal.
>>
>>
>> Thanks a lot,
>> Allan Folting
>>
>


Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-22 Thread Hyukjin Kwon
+1 I like this idea too.

On Thu, Feb 23, 2023 at 6:00 AM Allan Folting  wrote:

> Hi all,
>
> I would like to propose that we show Python code examples first in the
> Spark documentation where we have multiple programming language examples.
> An example is on the Quick Start page:
> https://spark.apache.org/docs/latest/quick-start.html
>
> I propose this change because Python has become more popular than the
> other languages supported in Apache Spark. There are a lot more users of
> Spark in Python than Scala today and Python attracts a broader set of new
> users.
> For Python usage data, see https://www.tiobe.com/tiobe-index/ and
> https://insights.stackoverflow.com/trends?tags=r%2Cscala%2Cpython%2Cjava.
>
> Also, this change aligns with Python already being the first tab on our
> home page:
> https://spark.apache.org/
>
> Anyone who wants to use another language can still just click on the other
> tabs.
>
> I created a draft PR for the Spark SQL, DataFrames and Datasets Guide page
> as a first step:
> https://github.com/apache/spark/pull/40087
>
>
> I would appreciate it if you could share your thoughts on this proposal.
>
>
> Thanks a lot,
> Allan Folting
>


Re: [VOTE] Release Apache Spark 3.4.0 (RC1)

2023-02-22 Thread Tom Graves
 It looks like there are still blockers open, we need to make sure they are 
addressed before doing a release:
https://issues.apache.org/jira/browse/SPARK-41793
https://issues.apache.org/jira/browse/SPARK-42444

TomOn Tuesday, February 21, 2023 at 10:35:45 PM CST, Xinrong Meng 
 wrote:  
 
 Please vote on releasing the following candidate as Apache Spark version 3.4.0.

The vote is open until 11:59pm Pacific time February 27th and passes if a 
majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.4.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v3.4.0-rc1 (commit 
e2484f626bb338274665a49078b528365ea18c3b):
https://github.com/apache/spark/tree/v3.4.0-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1435

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc1-docs/

The list of bug fixes going into 3.4.0 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12351465

This release is using the release script of the tag v3.4.0-rc1.


FAQ

=
How can I help test this release?
=
If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.4.0?
===
The current list of open tickets targeted at 3.4.0 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 3.4.0

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==
In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.

Thanks,
Xinrong Meng
  

Re: [VOTE] Release Apache Spark 3.4.0 (RC1)

2023-02-22 Thread Jonathan Kelly
Thanks! I was wondering about that ClientE2ETestSuite failure today, so I'm
glad to know that it's also being experienced by others.

On a similar note, I am experiencing the following error when running the
Python tests with Python 3.7:

+ ./python/run-tests --python-executables=python3
Running PySpark tests. Output is in
/home/ec2-user/spark/python/unit-tests.log
Will test against the following Python executables: ['python3']
Will test the following Python modules: ['pyspark-connect', 'pyspark-core',
'pyspark-errors', 'pyspark-ml', 'pyspark-mllib', 'pyspark-pandas',
'pyspark-pandas-slow', 'pyspark-resource', 'pyspark-sql',
'pyspark-streaming']
python3 python_implementation is CPython
python3 version is: Python 3.7.16
Starting test(python3): pyspark.ml.tests.test_feature (temp output:
/home/ec2-user/spark/python/target/8ca9ab1a-05cc-4845-bf89-30d9001510bc/python3__pyspark.ml.tests.test_feature__kg6sseie.log)
Starting test(python3): pyspark.ml.tests.test_base (temp output:
/home/ec2-user/spark/python/target/f2264f3b-6b26-4e61-9452-8d6ddd7eb002/python3__pyspark.ml.tests.test_base__0902zf9_.log)
Starting test(python3): pyspark.ml.tests.test_algorithms (temp output:
/home/ec2-user/spark/python/target/d1dc4e07-e58c-4c03-abe5-09d8fab22e6a/python3__pyspark.ml.tests.test_algorithms__lh3wb2u8.log)
Starting test(python3): pyspark.ml.tests.test_evaluation (temp output:
/home/ec2-user/spark/python/target/3f42dc79-c945-4cf2-a1eb-83e72b40a9ee/python3__pyspark.ml.tests.test_evaluation__89idc7fa.log)
Finished test(python3): pyspark.ml.tests.test_base (16s)
Starting test(python3): pyspark.ml.tests.test_functions (temp output:
/home/ec2-user/spark/python/target/5a3b90f0-216b-4edd-9d15-6619d3e03300/python3__pyspark.ml.tests.test_functions__g5u1290s.log)
Traceback (most recent call last):
  File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
  File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
  File "/home/ec2-user/spark/python/pyspark/ml/tests/test_functions.py",
line 21, in 
from pyspark.ml.functions import predict_batch_udf
  File "/home/ec2-user/spark/python/pyspark/ml/functions.py", line 38, in

from typing import Any, Callable, Iterator, List, Mapping, Protocol,
TYPE_CHECKING, Tuple, Union
ImportError: cannot import name 'Protocol' from 'typing'
(/usr/lib64/python3.7/typing.py)
Had test failures in pyspark.ml.tests.test_functions with python3; see logs.

I know we should move on to a newer version of Python, but isn't Python 3.7
still officially supported?

Thank you,
Jonathan Kelly

On Wed, Feb 22, 2023 at 1:47 PM Herman van Hovell
 wrote:

> Hi All,
>
> Thanks for testing the 3.4.0 RC! I apologize for the maven testing
> failures for the Spark Connect Scala Client. We will try to get those
> sorted as soon as possible.
>
> This is an artifact of having multiple build systems, and only running CI
> for one (SBT). That, however, is a debate for another day :)...
>
> Cheers,
> Herman
>
> On Wed, Feb 22, 2023 at 5:32 PM Bjørn Jørgensen 
> wrote:
>
>> ./build/mvn clean package
>>
>> I'm using ubuntu rolling, python 3.11 openjdk 17
>>
>> CompatibilitySuite:
>> - compatibility MiMa tests *** FAILED ***
>>   java.lang.AssertionError: assertion failed: Failed to find the jar
>> inside folder: /home/bjorn/spark-3.4.0/connector/connect/client/jvm/target
>>   at scala.Predef$.assert(Predef.scala:223)
>>   at
>> org.apache.spark.sql.connect.client.util.IntegrationTestUtils$.findJar(IntegrationTestUtils.scala:67)
>>   at
>> org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar$lzycompute(CompatibilitySuite.scala:57)
>>   at
>> org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar(CompatibilitySuite.scala:53)
>>   at
>> org.apache.spark.sql.connect.client.CompatibilitySuite.$anonfun$new$1(CompatibilitySuite.scala:69)
>>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>>   ...
>> - compatibility API tests: Dataset *** FAILED ***
>>   java.lang.AssertionError: assertion failed: Failed to find the jar
>> inside folder: /home/bjorn/spark-3.4.0/connector/connect/client/jvm/target
>>   at scala.Predef$.assert(Predef.scala:223)
>>   at
>> org.apache.spark.sql.connect.client.util.IntegrationTestUtils$.findJar(IntegrationTestUtils.scala:67)
>>   at
>> org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar$lzycompute(CompatibilitySuite.scala:57)
>>   at
>> org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar(CompatibilitySuite.scala:53)
>>   at
>> org.apache.spark.sql.connect.client.CompatibilitySuite.$anonfun$new$7(CompatibilitySuite.scala:110)
>>   at
>> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>>   at 

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-22 Thread Mich Talebzadeh
I have seen many data engineering teams start out with Scala because
technically it is the best choice for many given reasons and basically it
is what Spark is. I also concur that Python is more popular than Scala
because of the advent of data science. A majority of use cases we see these
days are data science or related use cases where people mostly do Python.
Most Cloud Data warehouses offer embedded modeling tools that rely
extensively on Python packages. So, if you need those two worlds to
share code e and even handover code, you do not want the ideological battle
of Scala vs Python. Often we chose python for the sake of everybody
speaking the same language.


With regard to Spark docs showing Python code in the first tab etc, in my
opinion it is a moot point. The laws of diminishing return may indicate
that the time spent on changing the order may not worth it


HTH


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 22 Feb 2023 at 21:00, Allan Folting  wrote:

> Hi all,
>
> I would like to propose that we show Python code examples first in the
> Spark documentation where we have multiple programming language examples.
> An example is on the Quick Start page:
> https://spark.apache.org/docs/latest/quick-start.html
>
> I propose this change because Python has become more popular than the
> other languages supported in Apache Spark. There are a lot more users of
> Spark in Python than Scala today and Python attracts a broader set of new
> users.
> For Python usage data, see https://www.tiobe.com/tiobe-index/ and
> https://insights.stackoverflow.com/trends?tags=r%2Cscala%2Cpython%2Cjava.
>
> Also, this change aligns with Python already being the first tab on our
> home page:
> https://spark.apache.org/
>
> Anyone who wants to use another language can still just click on the other
> tabs.
>
> I created a draft PR for the Spark SQL, DataFrames and Datasets Guide page
> as a first step:
> https://github.com/apache/spark/pull/40087
>
>
> I would appreciate it if you could share your thoughts on this proposal.
>
>
> Thanks a lot,
> Allan Folting
>


Re: [VOTE] Release Apache Spark 3.4.0 (RC1)

2023-02-22 Thread Herman van Hovell
Hi All,

Thanks for testing the 3.4.0 RC! I apologize for the maven testing failures
for the Spark Connect Scala Client. We will try to get those sorted as soon
as possible.

This is an artifact of having multiple build systems, and only running CI
for one (SBT). That, however, is a debate for another day :)...

Cheers,
Herman

On Wed, Feb 22, 2023 at 5:32 PM Bjørn Jørgensen 
wrote:

> ./build/mvn clean package
>
> I'm using ubuntu rolling, python 3.11 openjdk 17
>
> CompatibilitySuite:
> - compatibility MiMa tests *** FAILED ***
>   java.lang.AssertionError: assertion failed: Failed to find the jar
> inside folder: /home/bjorn/spark-3.4.0/connector/connect/client/jvm/target
>   at scala.Predef$.assert(Predef.scala:223)
>   at
> org.apache.spark.sql.connect.client.util.IntegrationTestUtils$.findJar(IntegrationTestUtils.scala:67)
>   at
> org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar$lzycompute(CompatibilitySuite.scala:57)
>   at
> org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar(CompatibilitySuite.scala:53)
>   at
> org.apache.spark.sql.connect.client.CompatibilitySuite.$anonfun$new$1(CompatibilitySuite.scala:69)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   ...
> - compatibility API tests: Dataset *** FAILED ***
>   java.lang.AssertionError: assertion failed: Failed to find the jar
> inside folder: /home/bjorn/spark-3.4.0/connector/connect/client/jvm/target
>   at scala.Predef$.assert(Predef.scala:223)
>   at
> org.apache.spark.sql.connect.client.util.IntegrationTestUtils$.findJar(IntegrationTestUtils.scala:67)
>   at
> org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar$lzycompute(CompatibilitySuite.scala:57)
>   at
> org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar(CompatibilitySuite.scala:53)
>   at
> org.apache.spark.sql.connect.client.CompatibilitySuite.$anonfun$new$7(CompatibilitySuite.scala:110)
>   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   ...
> SparkConnectClientSuite:
> - Placeholder test: Create SparkConnectClient
> - Test connection
> - Test connection string
> - Check URI: sc://host, isCorrect: true
> - Check URI: sc://localhost/, isCorrect: true
> - Check URI: sc://localhost:1234/, isCorrect: true
> - Check URI: sc://localhost/;, isCorrect: true
> - Check URI: sc://host:123, isCorrect: true
> - Check URI: sc://host:123/;user_id=a94, isCorrect: true
> - Check URI: scc://host:12, isCorrect: false
> - Check URI: http://host, isCorrect: false
> - Check URI: sc:/host:1234/path, isCorrect: false
> - Check URI: sc://host/path, isCorrect: false
> - Check URI: sc://host/;parm1;param2, isCorrect: false
> - Check URI: sc://host:123;user_id=a94, isCorrect: false
> - Check URI: sc:///user_id=123, isCorrect: false
> - Check URI: sc://host:-4, isCorrect: false
> - Check URI: sc://:123/, isCorrect: false
> - Non user-id parameters throw unsupported errors
> DatasetSuite:
> - limit
> - select
> - filter
> - write
> UserDefinedFunctionSuite:
> - udf and encoder serialization
> Run completed in 21 seconds, 944 milliseconds.
> Total number of tests run: 389
> Suites: completed 10, aborted 0
> Tests: succeeded 386, failed 3, canceled 0, ignored 0, pending 0
> *** 3 TESTS FAILED ***
> [INFO]
> 
> [INFO] Reactor Summary for Spark Project Parent POM 3.4.0:
> [INFO]
> [INFO] Spark Project Parent POM ... SUCCESS [
> 47.096 s]
> [INFO] Spark Project Tags . SUCCESS [
> 14.759 s]
> [INFO] Spark Project Sketch ... SUCCESS [
> 21.628 s]
> [INFO] Spark Project Local DB . SUCCESS [
> 20.311 s]
> [INFO] Spark Project Networking ... SUCCESS [01:07
> min]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
> 15.921 s]
> [INFO] Spark Project Unsafe ... SUCCESS [
> 16.020 s]
> [INFO] Spark Project Launcher . SUCCESS [
> 10.873 s]
> [INFO] Spark Project Core . SUCCESS [37:10
> min]
> [INFO] Spark Project ML Local Library . SUCCESS [
> 40.841 s]
> [INFO] Spark Project GraphX ... SUCCESS [02:39
> min]
> [INFO] Spark Project Streaming  SUCCESS [05:53
> min]
> [INFO] Spark Project Catalyst . SUCCESS [11:22
> min]
> 

Re: [VOTE] Release Apache Spark 3.4.0 (RC1)

2023-02-22 Thread Bjørn Jørgensen
./build/mvn clean package

I'm using ubuntu rolling, python 3.11 openjdk 17

CompatibilitySuite:
- compatibility MiMa tests *** FAILED ***
  java.lang.AssertionError: assertion failed: Failed to find the jar inside
folder: /home/bjorn/spark-3.4.0/connector/connect/client/jvm/target
  at scala.Predef$.assert(Predef.scala:223)
  at
org.apache.spark.sql.connect.client.util.IntegrationTestUtils$.findJar(IntegrationTestUtils.scala:67)
  at
org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar$lzycompute(CompatibilitySuite.scala:57)
  at
org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar(CompatibilitySuite.scala:53)
  at
org.apache.spark.sql.connect.client.CompatibilitySuite.$anonfun$new$1(CompatibilitySuite.scala:69)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  ...
- compatibility API tests: Dataset *** FAILED ***
  java.lang.AssertionError: assertion failed: Failed to find the jar inside
folder: /home/bjorn/spark-3.4.0/connector/connect/client/jvm/target
  at scala.Predef$.assert(Predef.scala:223)
  at
org.apache.spark.sql.connect.client.util.IntegrationTestUtils$.findJar(IntegrationTestUtils.scala:67)
  at
org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar$lzycompute(CompatibilitySuite.scala:57)
  at
org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar(CompatibilitySuite.scala:53)
  at
org.apache.spark.sql.connect.client.CompatibilitySuite.$anonfun$new$7(CompatibilitySuite.scala:110)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  ...
SparkConnectClientSuite:
- Placeholder test: Create SparkConnectClient
- Test connection
- Test connection string
- Check URI: sc://host, isCorrect: true
- Check URI: sc://localhost/, isCorrect: true
- Check URI: sc://localhost:1234/, isCorrect: true
- Check URI: sc://localhost/;, isCorrect: true
- Check URI: sc://host:123, isCorrect: true
- Check URI: sc://host:123/;user_id=a94, isCorrect: true
- Check URI: scc://host:12, isCorrect: false
- Check URI: http://host, isCorrect: false
- Check URI: sc:/host:1234/path, isCorrect: false
- Check URI: sc://host/path, isCorrect: false
- Check URI: sc://host/;parm1;param2, isCorrect: false
- Check URI: sc://host:123;user_id=a94, isCorrect: false
- Check URI: sc:///user_id=123, isCorrect: false
- Check URI: sc://host:-4, isCorrect: false
- Check URI: sc://:123/, isCorrect: false
- Non user-id parameters throw unsupported errors
DatasetSuite:
- limit
- select
- filter
- write
UserDefinedFunctionSuite:
- udf and encoder serialization
Run completed in 21 seconds, 944 milliseconds.
Total number of tests run: 389
Suites: completed 10, aborted 0
Tests: succeeded 386, failed 3, canceled 0, ignored 0, pending 0
*** 3 TESTS FAILED ***
[INFO]

[INFO] Reactor Summary for Spark Project Parent POM 3.4.0:
[INFO]
[INFO] Spark Project Parent POM ... SUCCESS [
47.096 s]
[INFO] Spark Project Tags . SUCCESS [
14.759 s]
[INFO] Spark Project Sketch ... SUCCESS [
21.628 s]
[INFO] Spark Project Local DB . SUCCESS [
20.311 s]
[INFO] Spark Project Networking ... SUCCESS [01:07
min]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [
15.921 s]
[INFO] Spark Project Unsafe ... SUCCESS [
16.020 s]
[INFO] Spark Project Launcher . SUCCESS [
10.873 s]
[INFO] Spark Project Core . SUCCESS [37:10
min]
[INFO] Spark Project ML Local Library . SUCCESS [
40.841 s]
[INFO] Spark Project GraphX ... SUCCESS [02:39
min]
[INFO] Spark Project Streaming  SUCCESS [05:53
min]
[INFO] Spark Project Catalyst . SUCCESS [11:22
min]
[INFO] Spark Project SQL .. SUCCESS [
 02:27 h]
[INFO] Spark Project ML Library ... SUCCESS [22:45
min]
[INFO] Spark Project Tools  SUCCESS [
 7.263 s]
[INFO] Spark Project Hive . SUCCESS [
 01:21 h]
[INFO] Spark Project REPL . SUCCESS [02:07
min]
[INFO] Spark Project Assembly . SUCCESS [
11.704 s]
[INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [
26.748 s]
[INFO] Spark Integration for Kafka 

Re: Spark Union performance issue

2023-02-22 Thread Prem Sahoo
Please see inline comments
So you union two tables, union the result with another one, and finally
with a last one?
  first Union 2 tables  = Result1
2nd Union of another 2 tables = Result2

3rd Result1UnionResult2 = finalResult

How many columns do all these tables have?

each is having around 700 columns

Are you sure creating the plan depends on the number of rows?
as the explain plan keeps on increasing along with metadata ...

On Wed, Feb 22, 2023 at 3:23 PM Enrico Minack 
wrote:

> So you union two tables, union the result with another one, and finally
> with a last one?
>
> How many columns do all these tables have?
>
> Are you sure creating the plan depends on the number of rows?
>
> Enrico
>
>
> Am 22.02.23 um 19:08 schrieb Prem Sahoo:
>
> here is the information missed
> 1. Spark 3.2.0
> 2. it is scala based
> 3. size of tables will be ~60G
> 4. explain plan for catalysts shows lots of time is being spent in
> creating the plan
> 5. number of union table is 2 , and another 2 then finally 2
>
> slowness is providing resylut as the data size & column size increases .
>
> On Wed, Feb 22, 2023 at 11:07 AM Enrico Minack 
> wrote:
>
>> Plus number of unioned tables would be helpful, as well as which
>> downstream operations are performed on the unioned tables.
>>
>> And what "performance issues" do you exactly measure?
>>
>> Enrico
>>
>>
>>
>> Am 22.02.23 um 16:50 schrieb Mich Talebzadeh:
>>
>> Hi,
>>
>> Few details will help
>>
>>1. Spark version
>>2. Spark SQL, Scala or PySpark
>>3. size of tables in join.
>>4. What does explain() or the joining operation show?
>>
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 22 Feb 2023 at 15:42, Prem Sahoo  wrote:
>>
>>> Hello Team,
>>> We are observing Spark Union performance issues when unioning big tables
>>> with lots of rows. Do we have any option apart from the Union ?
>>>
>>
>>
>


Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-22 Thread Sean Owen
FWIW I agree with this.

On Wed, Feb 22, 2023 at 2:59 PM Allan Folting  wrote:

> Hi all,
>
> I would like to propose that we show Python code examples first in the
> Spark documentation where we have multiple programming language examples.
> An example is on the Quick Start page:
> https://spark.apache.org/docs/latest/quick-start.html
>
> I propose this change because Python has become more popular than the
> other languages supported in Apache Spark. There are a lot more users of
> Spark in Python than Scala today and Python attracts a broader set of new
> users.
> For Python usage data, see https://www.tiobe.com/tiobe-index/ and
> https://insights.stackoverflow.com/trends?tags=r%2Cscala%2Cpython%2Cjava.
>
> Also, this change aligns with Python already being the first tab on our
> home page:
> https://spark.apache.org/
>
> Anyone who wants to use another language can still just click on the other
> tabs.
>
> I created a draft PR for the Spark SQL, DataFrames and Datasets Guide page
> as a first step:
> https://github.com/apache/spark/pull/40087
>
>
> I would appreciate it if you could share your thoughts on this proposal.
>
>
> Thanks a lot,
> Allan Folting
>


Re: Spark Union performance issue

2023-02-22 Thread Zhiyuan Lin
Hi Spark devs,

I'm experiencing a Union performance degradation as well. Since this email
thread is very related, posting it here to see if anyone has any insights.

*Background*:
After upgrading a Spark job from Spark 2.4 to Spark 3.1 without any code
change, we saw *big performance degradation* (5-8 times slower). After
enabling DEBUG log, we found the Spark 3.1 job takes significantly longer
in fetching file metadata from S3. E.g., Spark 2.4 takes 5 minutes fetching
all metadata, and Spark 3.1 needs 40 minutes.
*Findings*:
After closely monitoring *thread dumps* for both Spark versions, we found
the reason for the slowness is that Spark 2.4 has *a pool of 8 threads*
(spawn from the driver main thread) to do the metadata fetch, versus Spark
3 only has the dag-scheduler-event-loop thread to do the work.
More details: Spark 2.4 is calling 2.4/core/rdd/UnionRDD.scala#L76

to
get partition metadata, which triggers the thread pool here
,
versus Spark 3.1 is calling 3.1/core/rdd/UnionRDD.scala#L94
,
and the thread pool is not triggered.

Any insights for why this happens or how to fix the performance issue is
welcome. In the meantime, I will do more investigation to see if I can root
fix this issue.

Thanks,
Zoe

On Wed, Feb 22, 2023 at 12:24 PM Enrico Minack 
wrote:

> So you union two tables, union the result with another one, and finally
> with a last one?
>
> How many columns do all these tables have?
>
> Are you sure creating the plan depends on the number of rows?
>
> Enrico
>
>
> Am 22.02.23 um 19:08 schrieb Prem Sahoo:
>
> here is the information missed
> 1. Spark 3.2.0
> 2. it is scala based
> 3. size of tables will be ~60G
> 4. explain plan for catalysts shows lots of time is being spent in
> creating the plan
> 5. number of union table is 2 , and another 2 then finally 2
>
> slowness is providing resylut as the data size & column size increases .
>
> On Wed, Feb 22, 2023 at 11:07 AM Enrico Minack 
> wrote:
>
>> Plus number of unioned tables would be helpful, as well as which
>> downstream operations are performed on the unioned tables.
>>
>> And what "performance issues" do you exactly measure?
>>
>> Enrico
>>
>>
>>
>> Am 22.02.23 um 16:50 schrieb Mich Talebzadeh:
>>
>> Hi,
>>
>> Few details will help
>>
>>1. Spark version
>>2. Spark SQL, Scala or PySpark
>>3. size of tables in join.
>>4. What does explain() or the joining operation show?
>>
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 22 Feb 2023 at 15:42, Prem Sahoo  wrote:
>>
>>> Hello Team,
>>> We are observing Spark Union performance issues when unioning big tables
>>> with lots of rows. Do we have any option apart from the Union ?
>>>
>>
>>
>


[DISCUSS] Show Python code examples first in Spark documentation

2023-02-22 Thread Allan Folting
Hi all,

I would like to propose that we show Python code examples first in the
Spark documentation where we have multiple programming language examples.
An example is on the Quick Start page:
https://spark.apache.org/docs/latest/quick-start.html

I propose this change because Python has become more popular than the other
languages supported in Apache Spark. There are a lot more users of Spark in
Python than Scala today and Python attracts a broader set of new users.
For Python usage data, see https://www.tiobe.com/tiobe-index/ and
https://insights.stackoverflow.com/trends?tags=r%2Cscala%2Cpython%2Cjava.

Also, this change aligns with Python already being the first tab on our
home page:
https://spark.apache.org/

Anyone who wants to use another language can still just click on the other
tabs.

I created a draft PR for the Spark SQL, DataFrames and Datasets Guide page
as a first step:
https://github.com/apache/spark/pull/40087


I would appreciate it if you could share your thoughts on this proposal.


Thanks a lot,
Allan Folting


Re: [VOTE] Release Apache Spark 3.4.0 (RC1)

2023-02-22 Thread Mridul Muralidharan
Signatures, digests, etc check out fine - thanks for updating them !
Checked out tag and build/tested with -Phive -Pyarn -Pmesos -Pkubernetes


The test ClientE2ETestSuite.simple udf failed [1] in "Connect Client "
module ... yet to test "Spark Protobuf" module due to the failure.


Regards,
Mridul

[1]

- simple udf *** FAILED ***

  io.grpc.StatusRuntimeException: INTERNAL:
org.apache.spark.sql.ClientE2ETestSuite

  at io.grpc.Status.asRuntimeException(Status.java:535)

  at
io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660)

  at org.apache.spark.sql.connect.client.SparkResult.org
$apache$spark$sql$connect$client$SparkResult$$processResponses(SparkResult.scala:50)

  at
org.apache.spark.sql.connect.client.SparkResult.length(SparkResult.scala:95)

  at
org.apache.spark.sql.connect.client.SparkResult.toArray(SparkResult.scala:112)

  at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2037)

  at org.apache.spark.sql.Dataset.withResult(Dataset.scala:2267)

  at org.apache.spark.sql.Dataset.collect(Dataset.scala:2036)

  at
org.apache.spark.sql.ClientE2ETestSuite.$anonfun$new$5(ClientE2ETestSuite.scala:65)

  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)

  ...





On Wed, Feb 22, 2023 at 2:07 AM Mridul Muralidharan 
wrote:

>
> Thanks Xinrong !
> The signature verifications are fine now ... will continue with testing
> the release.
>
>
> Regards,
> Mridul
>
>
> On Wed, Feb 22, 2023 at 1:27 AM Xinrong Meng 
> wrote:
>
>> Hi Mridul,
>>
>> Would you please try that again? It should work now.
>>
>> On Wed, Feb 22, 2023 at 2:04 PM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> Hi Xinrong,
>>>
>>>   Was it signed with the same key as present in KEYS [1] ?
>>> I am seeing errors with gpg when validating. For example:
>>>
>>>
>>> $ gpg --verify pyspark-3.4.0.tar.gz.asc
>>>
>>> gpg: assuming signed data in 'pyspark-3.4.0.tar.gz'
>>>
>>> gpg: Signature made Tue 21 Feb 2023 05:56:05 AM CST
>>>
>>> gpg:using RSA key
>>> CC68B3D16FE33A766705160BA7E57908C7A4E1B1
>>>
>>> gpg:issuer "xinr...@apache.org"
>>>
>>> gpg: Can't check signature: No public key
>>>
>>>
>>>
>>> Regards,
>>> Mridul
>>>
>>> [1] https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>>
>>> On Tue, Feb 21, 2023 at 10:36 PM Xinrong Meng 
>>> wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
 version 3.4.0.

 The vote is open until 11:59pm Pacific time *February 27th* and passes
 if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

 [ ] +1 Release this package as Apache Spark 3.4.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/

 The tag to be voted on is *v3.4.0-rc1* (commit
 e2484f626bb338274665a49078b528365ea18c3b):
 https://github.com/apache/spark/tree/v3.4.0-rc1

 The release files, including signatures, digests, etc. can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc1-bin/

 Signatures used for Spark RCs can be found in this file:
 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1435

 The documentation corresponding to this release can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc1-docs/

 The list of bug fixes going into 3.4.0 can be found at the following
 URL:
 https://issues.apache.org/jira/projects/SPARK/versions/12351465

 This release is using the release script of the tag v3.4.0-rc1.


 FAQ

 =
 How can I help test this release?
 =
 If you are a Spark user, you can help us test this release by taking
 an existing Spark workload and running on this release candidate, then
 reporting any regressions.

 If you're working in PySpark you can set up a virtual env and install
 the current RC and see if anything important breaks, in the Java/Scala
 you can add the staging repository to your projects resolvers and test
 with the RC (make sure to clean up the artifact cache before/after so
 you don't end up building with a out of date RC going forward).

 ===
 What should happen to JIRA tickets still targeting 3.4.0?
 ===
 The current list of open tickets targeted at 3.4.0 can be found at:
 https://issues.apache.org/jira/projects/SPARK and search for "Target
 Version/s" = 3.4.0

 Committers should look at those and triage. Extremely important bug
 fixes, documentation, and API tweaks that impact compatibility should
 be worked on immediately. Everything else please retarget to an
 appropriate release.


Re: Spark Union performance issue

2023-02-22 Thread Enrico Minack
So you union two tables, union the result with another one, and finally 
with a last one?


How many columns do all these tables have?

Are you sure creating the plan depends on the number of rows?

Enrico


Am 22.02.23 um 19:08 schrieb Prem Sahoo:

here is the information missed
1. Spark 3.2.0
2. it is scala based
3. size of tables will be ~60G
4. explain plan for catalysts shows lots of time is being spent in 
creating the plan

5. number of union table is 2 , and another 2 then finally 2

slowness is providing resylut as the data size & column size increases .

On Wed, Feb 22, 2023 at 11:07 AM Enrico Minack 
 wrote:


Plus number of unioned tables would be helpful, as well as which
downstream operations are performed on the unioned tables.

And what "performance issues" do you exactly measure?

Enrico



Am 22.02.23 um 16:50 schrieb Mich Talebzadeh:

Hi,

Few details will help

 1. Spark version
 2. Spark SQL, Scala or PySpark
 3. size of tables in join.
 4. What does explain() or the joining operation show?


HTH


**view my Linkedin profile



https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk.Any and all responsibility
for any loss, damage or destruction of data or any other property
which may arise from relying on this email's technical content is
explicitly disclaimed. The author will in no case be liable for
any monetary damages arising from such loss, damage or destruction.



On Wed, 22 Feb 2023 at 15:42, Prem Sahoo 
wrote:

Hello Team,
We are observing Spark Union performance issues when unioning
big tables with lots of rows. Do we have any option apart
from the Union ?





Re: Spark Union performance issue

2023-02-22 Thread Prem Sahoo
here is the information missed
1. Spark 3.2.0
2. it is scala based
3. size of tables will be ~60G
4. explain plan for catalysts shows lots of time is being spent in creating
the plan
5. number of union table is 2 , and another 2 then finally 2

slowness is providing resylut as the data size & column size increases .

On Wed, Feb 22, 2023 at 11:07 AM Enrico Minack 
wrote:

> Plus number of unioned tables would be helpful, as well as which
> downstream operations are performed on the unioned tables.
>
> And what "performance issues" do you exactly measure?
>
> Enrico
>
>
>
> Am 22.02.23 um 16:50 schrieb Mich Talebzadeh:
>
> Hi,
>
> Few details will help
>
>1. Spark version
>2. Spark SQL, Scala or PySpark
>3. size of tables in join.
>4. What does explain() or the joining operation show?
>
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 22 Feb 2023 at 15:42, Prem Sahoo  wrote:
>
>> Hello Team,
>> We are observing Spark Union performance issues when unioning big tables
>> with lots of rows. Do we have any option apart from the Union ?
>>
>
>


Re: Spark Union performance issue

2023-02-22 Thread Enrico Minack
Plus number of unioned tables would be helpful, as well as which 
downstream operations are performed on the unioned tables.


And what "performance issues" do you exactly measure?

Enrico



Am 22.02.23 um 16:50 schrieb Mich Talebzadeh:

Hi,

Few details will help

 1. Spark version
 2. Spark SQL, Scala or PySpark
 3. size of tables in join.
 4. What does explain() or the joining operation show?


HTH


**view my Linkedin profile 




https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk.Any and all responsibility for 
any loss, damage or destruction of data or any other property which 
may arise from relying on this email's technical content is explicitly 
disclaimed. The author will in no case be liable for any monetary 
damages arising from such loss, damage or destruction.




On Wed, 22 Feb 2023 at 15:42, Prem Sahoo  wrote:

Hello Team,
We are observing Spark Union performance issues when unioning big
tables with lots of rows. Do we have any option apart from the Union ?



Re: Spark Union performance issue

2023-02-22 Thread Mich Talebzadeh
Hi,

Few details will help

   1. Spark version
   2. Spark SQL, Scala or PySpark
   3. size of tables in join.
   4. What does explain() or the joining operation show?


HTH


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 22 Feb 2023 at 15:42, Prem Sahoo  wrote:

> Hello Team,
> We are observing Spark Union performance issues when unioning big tables
> with lots of rows. Do we have any option apart from the Union ?
>


Spark Union performance issue

2023-02-22 Thread Prem Sahoo
Hello Team,
We are observing Spark Union performance issues when unioning big tables
with lots of rows. Do we have any option apart from the Union ?


Re: Pandas UDF cogroup.applyInPandas with multiple dataframes

2023-02-22 Thread Santosh Pingale
I have opened two PRs:
One that tries to maintain backwards compatibility: 
https://github.com/apache/spark/pull/39902 

One that breaks the API to make it cleaner: 
https://github.com/apache/spark/pull/40122 


Note this API has been marked experimental so imagining breaking changes is a 
possibility at the moment, whether we do it or not in practice is something we 
need to decide.

> On 7 Feb 2023, at 22:52, Li Jin  wrote:
> 
> I am not a Spark committer and haven't been working on Spark for a while. 
> However, I was heavily involved in the original cogroup work and we are using 
> cogroup functionality pretty heavily and I want to give my two cents here.
> 
> I think this is a nice improvement and I hope someone from the PySpark side 
> can take a look at this.
> 
> On Mon, Feb 6, 2023 at 5:29 AM Santosh Pingale 
>  wrote:
> Created  a PR: https://github.com/apache/spark/pull/39902 
> 
> 
> 
>> On 24 Jan 2023, at 15:04, Santosh Pingale > > wrote:
>> 
>> Hey all
>> 
>> I have an interesting problem in hand. We have cases where we want to pass 
>> multiple(20 to 30) data frames to cogroup.applyInPandas function.
>> 
>> RDD currently supports cogroup with upto 4 dataframes (ZippedPartitionsRDD4) 
>>  where as cogroup with pandas can handle only 2 dataframes (with 
>> ZippedPartitionsRDD2). In our use case, we do not have much control over how 
>> many data frames we may need in the cogroup.applyInPandas function.
>> 
>> To achieve this, we can:
>> (a) Implement ZippedPartitionsRDD5, 
>> ZippedPartitionsRDD..ZippedPartitionsRDD30..ZippedPartitionsRDD50 with 
>> respective iterators, serializers and so on. This ensures we keep type 
>> safety intact but a lot more boilerplate code has to be written to achieve 
>> this.
>> (b) Do not use cogroup.applyInPandas, rather use RDD.keyBy.cogroup and then 
>> getItem in a nested fashion. Then convert data to pandas df in the python 
>> function. This looks like a good workaround but mistakes are very easy to 
>> happen. We also don't look at typesafety here from user's point of view.
>> (c) Implement ZippedPartitionsRDDN and NaryLike with childrenNodes type set 
>> to Seq[T] which allows for arbitrary number of children to be set. Here we 
>> have very little boilerplate but we sacrifice type safety.
>> (d) ... some new suggestions... ?
>> 
>> I have done preliminary work on option (c). It works like a charm but before 
>> I proceed, is my concern about sacrificed type safety overblown, and do we 
>> have an approach (d)?
>> (a) is something that is too much of an investment for it to be useful. (b) 
>> is okay enough workaround, but it is not very efficient.
>> 
> 



signature.asc
Description: Message signed with OpenPGP


Re: [VOTE] Release Apache Spark 3.4.0 (RC1)

2023-02-22 Thread Mridul Muralidharan
Thanks Xinrong !
The signature verifications are fine now ... will continue with testing the
release.


Regards,
Mridul


On Wed, Feb 22, 2023 at 1:27 AM Xinrong Meng 
wrote:

> Hi Mridul,
>
> Would you please try that again? It should work now.
>
> On Wed, Feb 22, 2023 at 2:04 PM Mridul Muralidharan 
> wrote:
>
>>
>> Hi Xinrong,
>>
>>   Was it signed with the same key as present in KEYS [1] ?
>> I am seeing errors with gpg when validating. For example:
>>
>>
>> $ gpg --verify pyspark-3.4.0.tar.gz.asc
>>
>> gpg: assuming signed data in 'pyspark-3.4.0.tar.gz'
>>
>> gpg: Signature made Tue 21 Feb 2023 05:56:05 AM CST
>>
>> gpg:using RSA key
>> CC68B3D16FE33A766705160BA7E57908C7A4E1B1
>>
>> gpg:issuer "xinr...@apache.org"
>>
>> gpg: Can't check signature: No public key
>>
>>
>>
>> Regards,
>> Mridul
>>
>> [1] https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>>
>> On Tue, Feb 21, 2023 at 10:36 PM Xinrong Meng 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 3.4.0.
>>>
>>> The vote is open until 11:59pm Pacific time *February 27th* and passes
>>> if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.4.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is *v3.4.0-rc1* (commit
>>> e2484f626bb338274665a49078b528365ea18c3b):
>>> https://github.com/apache/spark/tree/v3.4.0-rc1
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc1-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1435
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc1-docs/
>>>
>>> The list of bug fixes going into 3.4.0 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12351465
>>>
>>> This release is using the release script of the tag v3.4.0-rc1.
>>>
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 3.4.0?
>>> ===
>>> The current list of open tickets targeted at 3.4.0 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.4.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>> Thanks,
>>> Xinrong Meng
>>>
>>