[jira] [Resolved] (SPARK-38073) NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon > 3.7
[ https://issues.apache.org/jira/browse/SPARK-38073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38073. --- Fix Version/s: 3.3.0 3.2.2 Resolution: Fixed Issue resolved by pull request 35396 [https://github.com/apache/spark/pull/35396] > NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon > > 3.7 > --- > > Key: SPARK-38073 > URL: https://issues.apache.org/jira/browse/SPARK-38073 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Shell >Affects Versions: 3.2.0, 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.3.0, 3.2.2 > > > When {{PYSPARK_DRIVER_PYTHON=$(which ipython) bin/pyspark}} is executed with > Python >= 3.8, function registered wiht atexit seems to be executed in > different scope than in Python 3.7. > It result in {{NameError: name 'sc' is not defined}} on exit: > {code:python} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 3.3.0-SNAPSHOT > /_/ > Using Python version 3.8.12 (default, Oct 12 2021 21:57:06) > Spark context Web UI available at http://192.168.0.198:4040 > Spark context available as 'sc' (master = local[*], app id = > local-1643555855409). > SparkSession available as 'spark'. > In [1]: > > > Do you really want to exit ([y]/n)? y > Error in atexit._run_exitfuncs: > Traceback (most recent call last): > File "/path/to/spark/python/pyspark/shell.py", line 49, in > atexit.register(lambda: sc.stop()) > NameError: name 'sc' is not defined > {code} > This could be easily fixed by capturing `sc` instance > {code:none} > diff --git a/python/pyspark/shell.py b/python/pyspark/shell.py > index f0c487877a..4164e3ab0c 100644 > --- a/python/pyspark/shell.py > +++ b/python/pyspark/shell.py > @@ -46,7 +46,7 @@ except Exception: > > sc = spark.sparkContext > sql = spark.sql > -atexit.register(lambda: sc.stop()) > +atexit.register((lambda sc: lambda: sc.stop())(sc)) > > # for compatibility > sqlContext = spark._wrapped > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36665) Add more Not operator optimizations
[ https://issues.apache.org/jira/browse/SPARK-36665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487415#comment-17487415 ] Apache Spark commented on SPARK-36665: -- User 'kazuyukitanimura' has created a pull request for this issue: https://github.com/apache/spark/pull/35400 > Add more Not operator optimizations > --- > > Key: SPARK-36665 > URL: https://issues.apache.org/jira/browse/SPARK-36665 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kazuyuki Tanimura >Assignee: Kazuyuki Tanimura >Priority: Major > Fix For: 3.3.0 > > Attachments: Pasted Graphic 3.png > > > {{BooleanSimplification should be able to do more simplifications for Not > operators applying following rules}} > # {{Not(null) == null}} > ## {{e.g. IsNull(Not(...)) can be IsNull(...)}} > # {{(Not(a) = b) == (a = Not(b))}} > ## {{e.g. Not(...) = true can be (...) = false}} > # {{(a != b) == (a = Not(b))}} > ## {{e.g. (...) != true can be (...) = false}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36665) Add more Not operator optimizations
[ https://issues.apache.org/jira/browse/SPARK-36665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487413#comment-17487413 ] Apache Spark commented on SPARK-36665: -- User 'kazuyukitanimura' has created a pull request for this issue: https://github.com/apache/spark/pull/35400 > Add more Not operator optimizations > --- > > Key: SPARK-36665 > URL: https://issues.apache.org/jira/browse/SPARK-36665 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kazuyuki Tanimura >Assignee: Kazuyuki Tanimura >Priority: Major > Fix For: 3.3.0 > > Attachments: Pasted Graphic 3.png > > > {{BooleanSimplification should be able to do more simplifications for Not > operators applying following rules}} > # {{Not(null) == null}} > ## {{e.g. IsNull(Not(...)) can be IsNull(...)}} > # {{(Not(a) = b) == (a = Not(b))}} > ## {{e.g. Not(...) = true can be (...) = false}} > # {{(a != b) == (a = Not(b))}} > ## {{e.g. (...) != true can be (...) = false}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38073) NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon > 3.7
[ https://issues.apache.org/jira/browse/SPARK-38073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-38073: - Assignee: Maciej Szymkiewicz > NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon > > 3.7 > --- > > Key: SPARK-38073 > URL: https://issues.apache.org/jira/browse/SPARK-38073 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Shell >Affects Versions: 3.2.0, 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > > When {{PYSPARK_DRIVER_PYTHON=$(which ipython) bin/pyspark}} is executed with > Python >= 3.8, function registered wiht atexit seems to be executed in > different scope than in Python 3.7. > It result in {{NameError: name 'sc' is not defined}} on exit: > {code:python} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 3.3.0-SNAPSHOT > /_/ > Using Python version 3.8.12 (default, Oct 12 2021 21:57:06) > Spark context Web UI available at http://192.168.0.198:4040 > Spark context available as 'sc' (master = local[*], app id = > local-1643555855409). > SparkSession available as 'spark'. > In [1]: > > > Do you really want to exit ([y]/n)? y > Error in atexit._run_exitfuncs: > Traceback (most recent call last): > File "/path/to/spark/python/pyspark/shell.py", line 49, in > atexit.register(lambda: sc.stop()) > NameError: name 'sc' is not defined > {code} > This could be easily fixed by capturing `sc` instance > {code:none} > diff --git a/python/pyspark/shell.py b/python/pyspark/shell.py > index f0c487877a..4164e3ab0c 100644 > --- a/python/pyspark/shell.py > +++ b/python/pyspark/shell.py > @@ -46,7 +46,7 @@ except Exception: > > sc = spark.sparkContext > sql = spark.sql > -atexit.register(lambda: sc.stop()) > +atexit.register((lambda sc: lambda: sc.stop())(sc)) > > # for compatibility > sqlContext = spark._wrapped > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38073) NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon > 3.7
[ https://issues.apache.org/jira/browse/SPARK-38073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38073: -- Issue Type: Bug (was: Improvement) > NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon > > 3.7 > --- > > Key: SPARK-38073 > URL: https://issues.apache.org/jira/browse/SPARK-38073 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Shell >Affects Versions: 3.2.0, 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > When {{PYSPARK_DRIVER_PYTHON=$(which ipython) bin/pyspark}} is executed with > Python >= 3.8, function registered wiht atexit seems to be executed in > different scope than in Python 3.7. > It result in {{NameError: name 'sc' is not defined}} on exit: > {code:python} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 3.3.0-SNAPSHOT > /_/ > Using Python version 3.8.12 (default, Oct 12 2021 21:57:06) > Spark context Web UI available at http://192.168.0.198:4040 > Spark context available as 'sc' (master = local[*], app id = > local-1643555855409). > SparkSession available as 'spark'. > In [1]: > > > Do you really want to exit ([y]/n)? y > Error in atexit._run_exitfuncs: > Traceback (most recent call last): > File "/path/to/spark/python/pyspark/shell.py", line 49, in > atexit.register(lambda: sc.stop()) > NameError: name 'sc' is not defined > {code} > This could be easily fixed by capturing `sc` instance > {code:none} > diff --git a/python/pyspark/shell.py b/python/pyspark/shell.py > index f0c487877a..4164e3ab0c 100644 > --- a/python/pyspark/shell.py > +++ b/python/pyspark/shell.py > @@ -46,7 +46,7 @@ except Exception: > > sc = spark.sparkContext > sql = spark.sql > -atexit.register(lambda: sc.stop()) > +atexit.register((lambda sc: lambda: sc.stop())(sc)) > > # for compatibility > sqlContext = spark._wrapped > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38082) Update minimum numpy version to 1.15
[ https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38082: -- Summary: Update minimum numpy version to 1.15 (was: Update minimum numpy version) > Update minimum numpy version to 1.15 > > > Key: SPARK-38082 > URL: https://issues.apache.org/jira/browse/SPARK-38082 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.3.0 > > > Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}. > However, 1.7 has been released almost 9 years ago and since then some methods > that we use have been deprecated in favor of new additions and anew API > ({{numpy.typing}}, that is of some interest to us, has been added. > We should update minimum version requirement to one of the following > - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to > replace deprecated {{tostring}} calls with {{tobytes}}. > - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our > minimum supported pandas version. > - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing. > The last one might be somewhat controversial, but 1.15 shouldn't require much > discussion. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38082) Update minimum numpy version
[ https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-38082: - Assignee: Maciej Szymkiewicz > Update minimum numpy version > > > Key: SPARK-38082 > URL: https://issues.apache.org/jira/browse/SPARK-38082 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > > Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}. > However, 1.7 has been released almost 9 years ago and since then some methods > that we use have been deprecated in favor of new additions and anew API > ({{numpy.typing}}, that is of some interest to us, has been added. > We should update minimum version requirement to one of the following > - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to > replace deprecated {{tostring}} calls with {{tobytes}}. > - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our > minimum supported pandas version. > - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing. > The last one might be somewhat controversial, but 1.15 shouldn't require much > discussion. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38082) Update minimum numpy version
[ https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38082. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35398 [https://github.com/apache/spark/pull/35398] > Update minimum numpy version > > > Key: SPARK-38082 > URL: https://issues.apache.org/jira/browse/SPARK-38082 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.3.0 > > > Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}. > However, 1.7 has been released almost 9 years ago and since then some methods > that we use have been deprecated in favor of new additions and anew API > ({{numpy.typing}}, that is of some interest to us, has been added. > We should update minimum version requirement to one of the following > - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to > replace deprecated {{tostring}} calls with {{tobytes}}. > - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our > minimum supported pandas version. > - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing. > The last one might be somewhat controversial, but 1.15 shouldn't require much > discussion. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36837) Upgrade Kafka to 3.1.0
[ https://issues.apache.org/jira/browse/SPARK-36837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-36837. --- Fix Version/s: 3.3.0 Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/34089 > Upgrade Kafka to 3.1.0 > -- > > Key: SPARK-36837 > URL: https://issues.apache.org/jira/browse/SPARK-36837 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > > Kafka 3.1.0 has the official Java 17 support. We had better align with it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37408) Inline type hints for python/pyspark/ml/image.py
[ https://issues.apache.org/jira/browse/SPARK-37408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz reassigned SPARK-37408: -- Assignee: Maciej Szymkiewicz (was: Apache Spark) > Inline type hints for python/pyspark/ml/image.py > > > Key: SPARK-37408 > URL: https://issues.apache.org/jira/browse/SPARK-37408 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.3.0 > > > Inline type hints from python/pyspark/ml/image.pyi to > python/pyspark/ml/image.py. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487376#comment-17487376 ] Sean R. Owen commented on SPARK-6305: - Spark doesn't use JDBCAppender, and doesn't use Chainsaw, so I don't believe either of those apply. > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.3.0 >Reporter: Tal Sliwowicz >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.3.0 > > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37416) Inline type hints for python/pyspark/ml/wrapper.py
[ https://issues.apache.org/jira/browse/SPARK-37416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487359#comment-17487359 ] Apache Spark commented on SPARK-37416: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/35399 > Inline type hints for python/pyspark/ml/wrapper.py > -- > > Key: SPARK-37416 > URL: https://issues.apache.org/jira/browse/SPARK-37416 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Inline type hints from python/pyspark/ml/wrapper.pyi to > python/pyspark/ml/wrapper.py. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37416) Inline type hints for python/pyspark/ml/wrapper.py
[ https://issues.apache.org/jira/browse/SPARK-37416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487358#comment-17487358 ] Apache Spark commented on SPARK-37416: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/35399 > Inline type hints for python/pyspark/ml/wrapper.py > -- > > Key: SPARK-37416 > URL: https://issues.apache.org/jira/browse/SPARK-37416 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Inline type hints from python/pyspark/ml/wrapper.pyi to > python/pyspark/ml/wrapper.py. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37416) Inline type hints for python/pyspark/ml/wrapper.py
[ https://issues.apache.org/jira/browse/SPARK-37416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37416: Assignee: (was: Apache Spark) > Inline type hints for python/pyspark/ml/wrapper.py > -- > > Key: SPARK-37416 > URL: https://issues.apache.org/jira/browse/SPARK-37416 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Inline type hints from python/pyspark/ml/wrapper.pyi to > python/pyspark/ml/wrapper.py. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37416) Inline type hints for python/pyspark/ml/wrapper.py
[ https://issues.apache.org/jira/browse/SPARK-37416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37416: Assignee: Apache Spark > Inline type hints for python/pyspark/ml/wrapper.py > -- > > Key: SPARK-37416 > URL: https://issues.apache.org/jira/browse/SPARK-37416 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark >Priority: Major > > Inline type hints from python/pyspark/ml/wrapper.pyi to > python/pyspark/ml/wrapper.py. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36986) Improving external schema management flexibility
[ https://issues.apache.org/jira/browse/SPARK-36986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rodrigo Boavida updated SPARK-36986: Docs Text: Schema management improvements 1 - Retrieving a field name and type from a schema based on its index was: Schema management improvements 1 - Retrieving a field name and type from a schema based on its index 2 - Allowing external dataSet schemas to be provided as well their external generated rows. > Improving external schema management flexibility > > > Key: SPARK-36986 > URL: https://issues.apache.org/jira/browse/SPARK-36986 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Rodrigo Boavida >Priority: Major > > Our spark usage, requires us to build an external schema and pass it on while > creating a DataSet. > While working through this, I found a couple of optimizations would improve > greatly Spark's flexibility to handle external schema management. > Scope: ability to retrieve a field's name and schema in one single call, > requesting to return a tupple by index. > Means extending the StructType class to support an additional method > This is what the function would look like: > /** > * Returns the index and field structure by name. > * If it doesn't find it, returns None. > * Avoids two client calls/loops to obtain consolidated field info. > * > */ > def getIndexAndFieldByName(name: String): Option[(Int, StructField)] = \{ > val field = nameToField.get(name) if(field.isDefined) \{ > Some((fieldIndex(name), field.get)) } > else > { None } > } > This is particularly useful from an efficiency perspective, when we're > parsing a Json structure and we want to check for every field what is the > name and field type already defined in the schema > I will create a corresponding branch for PR review, assuming that there are > no concerns with the above proposal. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36986) Improving external schema management flexibility
[ https://issues.apache.org/jira/browse/SPARK-36986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rodrigo Boavida updated SPARK-36986: Description: Our spark usage, requires us to build an external schema and pass it on while creating a DataSet. While working through this, I found a couple of optimizations would improve greatly Spark's flexibility to handle external schema management. Scope: ability to retrieve a field's name and schema in one single call, requesting to return a tupple by index. Means extending the StructType class to support an additional method This is what the function would look like: /** * Returns the index and field structure by name. * If it doesn't find it, returns None. * Avoids two client calls/loops to obtain consolidated field info. * */ def getIndexAndFieldByName(name: String): Option[(Int, StructField)] = \{ val field = nameToField.get(name) if(field.isDefined) \{ Some((fieldIndex(name), field.get)) } else { None } } This is particularly useful from an efficiency perspective, when we're parsing a Json structure and we want to check for every field what is the name and field type already defined in the schema I will create a corresponding branch for PR review, assuming that there are no concerns with the above proposal. was: Our spark usage, requires us to build an external schema and pass it on while creating a DataSet. While working through this, I found a couple of optimizations would improve greatly Spark's flexibility to handle external schema management. 1 - ability to retrieve a field's name and schema in one single call, requesting to return a tupple by index. Means extending the StructType class to support an additional method This is what the function would look like: /** * Returns the index and field structure by name. * If it doesn't find it, returns None. * Avoids two client calls/loops to obtain consolidated field info. * */ def getIndexAndFieldByName(name: String): Option[(Int, StructField)] = { val field = nameToField.get(name) if(field.isDefined) \{ Some((fieldIndex(name), field.get)) } else { None } } This is particularly useful from an efficiency perspective, when we're parsing a Json structure and we want to check for every field what is the name and field type already defined in the schema 2 - Allowing for a dataset to be created from a schema, and passing the corresponding internal rows which the internal types map with the schema already defined externally. This allows to create Spark fields based on any data structure, without depending on Spark's internal conversions (in particular for Json parsing), and improves performance by skipping the CatalystConverts job of converting native Java types into Spark types. This is what the function would look like: /** * Creates a [[Dataset]] from an RDD of spark.sql.catalyst.InternalRow. This method allows * the caller to create externally the InternalRow set, as we as define the schema externally. * * @since 3.3.0 */ def createDataset(data: RDD[InternalRow], schema: StructType): DataFrame = \{ val attributes = schema.toAttributes val plan = LogicalRDD(attributes, data)(self) val qe = sessionState.executePlan(plan) qe.assertAnalyzed() new Dataset[Row](this, plan, RowEncoder(schema)) } This is similar to this function: def createDataFrame(rows: java.util.List[Row], schema: StructType): DataFrame But doesn't depend on Spark internally creating the RDD based by inferring for example from a Json structure. Which is not useful if we're managing the schema externally. Also skips the Catalyst conversions and corresponding object overhead, making the internal rows generation much more efficient, by being done explicitly from the caller. I will create a corresponding branch for PR review, assuming that there are no concerns with the above proposals. > Improving external schema management flexibility > > > Key: SPARK-36986 > URL: https://issues.apache.org/jira/browse/SPARK-36986 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Rodrigo Boavida >Priority: Major > > Our spark usage, requires us to build an external schema and pass it on while > creating a DataSet. > While working through this, I found a couple of optimizations would improve > greatly Spark's flexibility to handle external schema management. > Scope: ability to retrieve a field's name and schema in one single call, > requesting to return a tupple by index. > Means extending the StructType class to support an additional method > This is what the function would look like: > /** > * Returns the index and field structure by name. > * If it doesn't find it, returns None. >
[jira] [Updated] (SPARK-36986) Improving external schema management flexibility
[ https://issues.apache.org/jira/browse/SPARK-36986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rodrigo Boavida updated SPARK-36986: Priority: Major (was: Minor) > Improving external schema management flexibility > > > Key: SPARK-36986 > URL: https://issues.apache.org/jira/browse/SPARK-36986 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Rodrigo Boavida >Priority: Major > > Our spark usage, requires us to build an external schema and pass it on while > creating a DataSet. > While working through this, I found a couple of optimizations would improve > greatly Spark's flexibility to handle external schema management. > 1 - ability to retrieve a field's name and schema in one single call, > requesting to return a tupple by index. > Means extending the StructType class to support an additional method > This is what the function would look like: > /** > * Returns the index and field structure by name. > * If it doesn't find it, returns None. > * Avoids two client calls/loops to obtain consolidated field info. > * > */ > def getIndexAndFieldByName(name: String): Option[(Int, StructField)] = { > val field = nameToField.get(name) > if(field.isDefined) \{ Some((fieldIndex(name), field.get)) } > else > { None } > } > This is particularly useful from an efficiency perspective, when we're > parsing a Json structure and we want to check for every field what is the > name and field type already defined in the schema > > 2 - Allowing for a dataset to be created from a schema, and passing the > corresponding internal rows which the internal types map with the schema > already defined externally. This allows to create Spark fields based on any > data structure, without depending on Spark's internal conversions (in > particular for Json parsing), and improves performance by skipping the > CatalystConverts job of converting native Java types into Spark types. > This is what the function would look like: > > /** > * Creates a [[Dataset]] from an RDD of spark.sql.catalyst.InternalRow. This > method allows > * the caller to create externally the InternalRow set, as we as define the > schema externally. > * > * @since 3.3.0 > */ > def createDataset(data: RDD[InternalRow], schema: StructType): DataFrame = > \{ val attributes = schema.toAttributes val plan = LogicalRDD(attributes, > data)(self) val qe = sessionState.executePlan(plan) qe.assertAnalyzed() > new Dataset[Row](this, plan, RowEncoder(schema)) } > > This is similar to this function: > def createDataFrame(rows: java.util.List[Row], schema: StructType): DataFrame > But doesn't depend on Spark internally creating the RDD based by inferring > for example from a Json structure. Which is not useful if we're managing the > schema externally. > Also skips the Catalyst conversions and corresponding object overhead, making > the internal rows generation much more efficient, by being done explicitly > from the caller. > > I will create a corresponding branch for PR review, assuming that there are > no concerns with the above proposals. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38115) No spark conf to control the path of _temporary when writing to target filesystem
[ https://issues.apache.org/jira/browse/SPARK-38115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kk updated SPARK-38115: --- Description: No default spark conf or param to control the '_temporary' path when writing to filesystem. (was: There is default spark conf or param to control the '_temporary' path when writing to filesystem.) > No spark conf to control the path of _temporary when writing to target > filesystem > - > > Key: SPARK-38115 > URL: https://issues.apache.org/jira/browse/SPARK-38115 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, Spark Shell, Spark Submit >Affects Versions: 2.4.8, 3.2.1 >Reporter: kk >Priority: Major > Labels: spark, spark-conf, spark-sql, spark-submit > > No default spark conf or param to control the '_temporary' path when writing > to filesystem. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38115) No spark conf to control the path of _temporary when writing to target filesystem
Karthik created SPARK-38115: --- Summary: No spark conf to control the path of _temporary when writing to target filesystem Key: SPARK-38115 URL: https://issues.apache.org/jira/browse/SPARK-38115 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core, Spark Shell, Spark Submit Affects Versions: 3.2.1, 2.4.8 Reporter: Karthik There is default spark conf or param to control the '_temporary' path when writing to filesystem. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38114) Spark build fails in Windows
[ https://issues.apache.org/jira/browse/SPARK-38114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SOUVIK PAUL updated SPARK-38114: Description: java.lang.NoSuchMethodError: org.fusesource.jansi.AnsiConsole.wrapOutputStream(Ljava/io/OutputStream;)Ljava/io/OutputStream; jline.AnsiWindowsTerminal.detectAnsiSupport(AnsiWindowsTerminal.java:57) jline.AnsiWindowsTerminal.(AnsiWindowsTerminal.java:27) A similar issue is being faced by the quarkus project with latest Maven. [https://github.com/quarkusio/quarkus/issues/19491] Upgrading the scala-maven-plugin seems to resolve the issue but this ticket can be a blocker https://issues.apache.org/jira/browse/SPARK-36547 was: java.lang.NoSuchMethodError: org.fusesource.jansi.AnsiConsole.wrapOutputStream(Ljava/io/OutputStream;)Ljava/io/OutputStream; jline.AnsiWindowsTerminal.detectAnsiSupport(AnsiWindowsTerminal.java:57) jline.AnsiWindowsTerminal.(AnsiWindowsTerminal.java:27) A similar issue is being faced by the quarkus project with latest Maven. [https://github.com/quarkusio/quarkus/issues/19491] Upgrading the scala-maven-plugin seems to resolve the issue https://issues.apache.org/jira/browse/SPARK-36547 > Spark build fails in Windows > > > Key: SPARK-38114 > URL: https://issues.apache.org/jira/browse/SPARK-38114 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.3 >Reporter: SOUVIK PAUL >Priority: Major > > java.lang.NoSuchMethodError: > org.fusesource.jansi.AnsiConsole.wrapOutputStream(Ljava/io/OutputStream;)Ljava/io/OutputStream; > jline.AnsiWindowsTerminal.detectAnsiSupport(AnsiWindowsTerminal.java:57) > jline.AnsiWindowsTerminal.(AnsiWindowsTerminal.java:27) > > A similar issue is being faced by the quarkus project with latest Maven. > [https://github.com/quarkusio/quarkus/issues/19491] > > Upgrading the scala-maven-plugin seems to resolve the issue but this ticket > can be a blocker > https://issues.apache.org/jira/browse/SPARK-36547 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38114) Spark build fails in Windows
[ https://issues.apache.org/jira/browse/SPARK-38114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SOUVIK PAUL updated SPARK-38114: Description: java.lang.NoSuchMethodError: org.fusesource.jansi.AnsiConsole.wrapOutputStream(Ljava/io/OutputStream;)Ljava/io/OutputStream; jline.AnsiWindowsTerminal.detectAnsiSupport(AnsiWindowsTerminal.java:57) jline.AnsiWindowsTerminal.(AnsiWindowsTerminal.java:27) A similar issue is being faced by the quarkus project with latest Maven. [https://github.com/quarkusio/quarkus/issues/19491] Upgrading the scala-maven-plugin seems to resolve the issue https://issues.apache.org/jira/browse/SPARK-36547 was: java.lang.NoSuchMethodError: org.fusesource.jansi.AnsiConsole.wrapOutputStream(Ljava/io/OutputStream;)Ljava/io/OutputStream; jline.AnsiWindowsTerminal.detectAnsiSupport(AnsiWindowsTerminal.java:57) jline.AnsiWindowsTerminal.(AnsiWindowsTerminal.java:27) A similar issue is being faced by the quarkus project with latest Maven. https://github.com/quarkusio/quarkus/issues/19491 > Spark build fails in Windows > > > Key: SPARK-38114 > URL: https://issues.apache.org/jira/browse/SPARK-38114 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.3 >Reporter: SOUVIK PAUL >Priority: Major > > java.lang.NoSuchMethodError: > org.fusesource.jansi.AnsiConsole.wrapOutputStream(Ljava/io/OutputStream;)Ljava/io/OutputStream; > jline.AnsiWindowsTerminal.detectAnsiSupport(AnsiWindowsTerminal.java:57) > jline.AnsiWindowsTerminal.(AnsiWindowsTerminal.java:27) > > A similar issue is being faced by the quarkus project with latest Maven. > [https://github.com/quarkusio/quarkus/issues/19491] > > Upgrading the scala-maven-plugin seems to resolve the issue > https://issues.apache.org/jira/browse/SPARK-36547 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37814) Migrating from log4j 1 to log4j 2
[ https://issues.apache.org/jira/browse/SPARK-37814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487287#comment-17487287 ] John Crowe commented on SPARK-37814: Did you use some sort of Title in your message so that they know you're also a dev and have customers of our own? Regards; John Crowe > Migrating from log4j 1 to log4j 2 > - > > Key: SPARK-37814 > URL: https://issues.apache.org/jira/browse/SPARK-37814 > Project: Spark > Issue Type: Umbrella > Components: Build >Affects Versions: 3.3.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Labels: releasenotes > Fix For: 3.3.0 > > > This is umbrella ticket for all tasks related to migrating to log4j2. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487286#comment-17487286 ] James Inlow commented on SPARK-6305: As we wait for spark to be released with log4j v2, how can we know about if Spark is effected by any other more recent CVEs impacting log4j 1.x? Specifically: * [https://nvd.nist.gov/vuln/detail/CVE-2022-23307] * [https://nvd.nist.gov/vuln/detail/CVE-2022-23305] Not sure the correct platform to ask these questions? > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.3.0 >Reporter: Tal Sliwowicz >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.3.0 > > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37814) Migrating from log4j 1 to log4j 2
[ https://issues.apache.org/jira/browse/SPARK-37814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487285#comment-17487285 ] Stephen L. De Rudder commented on SPARK-37814: -- With log4j 1.x line having several CVEs reported against it too; please consider doing one (or both) of the following: * Consider porting this to the 3.2 line and releasing a Spark 3.2.2 to address the log4j CVEs sooner * Consider expediting the 3.3.0 release to address the log4j CVEs Log4j 1.x CVEs info: [logging-log4j1/README.md at main · apache/logging-log4j1 · GitHub|https://github.com/apache/logging-log4j1/blob/main/README.md#unfixed-vulnerabilities] > Migrating from log4j 1 to log4j 2 > - > > Key: SPARK-37814 > URL: https://issues.apache.org/jira/browse/SPARK-37814 > Project: Spark > Issue Type: Umbrella > Components: Build >Affects Versions: 3.3.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Labels: releasenotes > Fix For: 3.3.0 > > > This is umbrella ticket for all tasks related to migrating to log4j2. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37630) Security issue from Log4j 1.X exploit
[ https://issues.apache.org/jira/browse/SPARK-37630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487279#comment-17487279 ] James Inlow commented on SPARK-37630: - [~pj.fanning] Thanks, I have seen that Spark has switched to log4jv2, but since the release won't be for a few months, I am looking for how to identify if the current release of Spark is OK as new CVE's are released relating to log4j v1 until that time. > Security issue from Log4j 1.X exploit > - > > Key: SPARK-37630 > URL: https://issues.apache.org/jira/browse/SPARK-37630 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.8, 3.2.0 >Reporter: Ismail H >Priority: Major > Labels: security > > log4j is being used in version [1.2.17|#L122]] > > This version has been deprecated and since [then have a known issue that > hasn't been adressed in 1.X > versions|https://www.cvedetails.com/cve/CVE-2019-17571/]. > > *Solution:* > * Upgrade log4j to version 2.15.0 which correct all known issues. [Last > known issues |https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44228] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38098) Add support for ArrayType of nested StructType to arrow-based conversion
[ https://issues.apache.org/jira/browse/SPARK-38098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Canali updated SPARK-38098: Summary: Add support for ArrayType of nested StructType to arrow-based conversion (was: Support Array of Struct for Pandas UDFs) > Add support for ArrayType of nested StructType to arrow-based conversion > > > Key: SPARK-38098 > URL: https://issues.apache.org/jira/browse/SPARK-38098 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Luca Canali >Priority: Minor > > This is to allow Pandas UDFs (and mapInArrow UDFs) to operate on columns of > type Array of Struct via arrow serialization. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38098) Add support for ArrayType of nested StructType to arrow-based conversion
[ https://issues.apache.org/jira/browse/SPARK-38098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Canali updated SPARK-38098: Description: This proposes to add support for ArrayType of nested StructType to arrow-based conversion. This allows Pandas UDFs, mapInArrow UDFs, and toPandas to operate on columns of type Array of Struct, via arrow serialization. was:This is to allow Pandas UDFs (and mapInArrow UDFs) to operate on columns of type Array of Struct via arrow serialization. > Add support for ArrayType of nested StructType to arrow-based conversion > > > Key: SPARK-38098 > URL: https://issues.apache.org/jira/browse/SPARK-38098 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Luca Canali >Priority: Minor > > This proposes to add support for ArrayType of nested StructType to > arrow-based conversion. > This allows Pandas UDFs, mapInArrow UDFs, and toPandas to operate on columns > of type Array of Struct, via arrow serialization. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37630) Security issue from Log4j 1.X exploit
[ https://issues.apache.org/jira/browse/SPARK-37630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487275#comment-17487275 ] PJ Fanning commented on SPARK-37630: [~jinlow] there is little point commenting on this closed issue - please look at https://issues.apache.org/jira/browse/SPARK-6305 - this issue is marked as a duplicate of that and progress has been made on the switch to log4jv2 > Security issue from Log4j 1.X exploit > - > > Key: SPARK-37630 > URL: https://issues.apache.org/jira/browse/SPARK-37630 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.8, 3.2.0 >Reporter: Ismail H >Priority: Major > Labels: security > > log4j is being used in version [1.2.17|#L122]] > > This version has been deprecated and since [then have a known issue that > hasn't been adressed in 1.X > versions|https://www.cvedetails.com/cve/CVE-2019-17571/]. > > *Solution:* > * Upgrade log4j to version 2.15.0 which correct all known issues. [Last > known issues |https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44228] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37630) Security issue from Log4j 1.X exploit
[ https://issues.apache.org/jira/browse/SPARK-37630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487271#comment-17487271 ] James Inlow commented on SPARK-37630: - How can we know about if Spark is impacted by any other more recent CVEs impacting log4j 1.x? Specifically: * [https://nvd.nist.gov/vuln/detail/CVE-2022-23307] * [https://nvd.nist.gov/vuln/detail/CVE-2022-23305] Not sure the correct platform to ask these questions? > Security issue from Log4j 1.X exploit > - > > Key: SPARK-37630 > URL: https://issues.apache.org/jira/browse/SPARK-37630 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.8, 3.2.0 >Reporter: Ismail H >Priority: Major > Labels: security > > log4j is being used in version [1.2.17|#L122]] > > This version has been deprecated and since [then have a known issue that > hasn't been adressed in 1.X > versions|https://www.cvedetails.com/cve/CVE-2019-17571/]. > > *Solution:* > * Upgrade log4j to version 2.15.0 which correct all known issues. [Last > known issues |https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44228] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38114) Spark build fails in Windows
SOUVIK PAUL created SPARK-38114: --- Summary: Spark build fails in Windows Key: SPARK-38114 URL: https://issues.apache.org/jira/browse/SPARK-38114 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.3 Reporter: SOUVIK PAUL java.lang.NoSuchMethodError: org.fusesource.jansi.AnsiConsole.wrapOutputStream(Ljava/io/OutputStream;)Ljava/io/OutputStream; jline.AnsiWindowsTerminal.detectAnsiSupport(AnsiWindowsTerminal.java:57) jline.AnsiWindowsTerminal.(AnsiWindowsTerminal.java:27) A similar issue is being faced by the quarkus project with latest Maven. https://github.com/quarkusio/quarkus/issues/19491 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38101) MetadataFetchFailedException due to decommission block migrations
[ https://issues.apache.org/jira/browse/SPARK-38101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487258#comment-17487258 ] L. C. Hsieh commented on SPARK-38101: - Thanks for reporting this issue, [~eejbyfeldt]. > MetadataFetchFailedException due to decommission block migrations > - > > Key: SPARK-38101 > URL: https://issues.apache.org/jira/browse/SPARK-38101 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2, 3.1.3, 3.2.1, 3.3.0, 3.2.2 >Reporter: Emil Ejbyfeldt >Priority: Major > > As noted in SPARK-34939 there is race when using broadcast for map output > status. Explanation from SPARK-34939 > > After map statuses are broadcasted and the executors obtain serialized > > broadcasted map statuses. If any fetch failure happens after, Spark > > scheduler invalidates cached map statuses and destroy broadcasted value of > > the map statuses. Then any executor trying to deserialize serialized > > broadcasted map statuses and access broadcasted value, IOException will be > > thrown. Currently we don't catch it in MapOutputTrackerWorker and above > > exception will fail the application. > But if running with `spark.decommission.enabled=true` and > `spark.storage.decommission.shuffleBlocks.enabled=true` there is another way > to hit this race, when a node is decommissioning and the shuffle blocks are > migrated. After a block has been migrated an update will be sent to the > driver for each block and the map output caches will be invalidated. > Here are a driver when we hit the race condition running with spark 3.2.0: > {code:java} > 2022-01-28 03:20:12,409 INFO memory.MemoryStore: Block broadcast_27 stored as > values in memory (estimated size 5.5 MiB, free 11.0 GiB) > 2022-01-28 03:20:12,410 INFO spark.ShuffleStatus: Updating map output for > 192108 to BlockManagerId(760, ip-10-231-63-204.ec2.internal, 34707, None) > 2022-01-28 03:20:12,410 INFO spark.ShuffleStatus: Updating map output for > 179529 to BlockManagerId(743, ip-10-231-34-160.ec2.internal, 44225, None) > 2022-01-28 03:20:12,414 INFO spark.ShuffleStatus: Updating map output for > 187194 to BlockManagerId(761, ip-10-231-43-219.ec2.internal, 39943, None) > 2022-01-28 03:20:12,415 INFO spark.ShuffleStatus: Updating map output for > 190303 to BlockManagerId(270, ip-10-231-33-206.ec2.internal, 38965, None) > 2022-01-28 03:20:12,416 INFO spark.ShuffleStatus: Updating map output for > 192220 to BlockManagerId(270, ip-10-231-33-206.ec2.internal, 38965, None) > 2022-01-28 03:20:12,416 INFO spark.ShuffleStatus: Updating map output for > 182306 to BlockManagerId(688, ip-10-231-43-41.ec2.internal, 35967, None) > 2022-01-28 03:20:12,417 INFO spark.ShuffleStatus: Updating map output for > 190387 to BlockManagerId(772, ip-10-231-55-173.ec2.internal, 35523, None) > 2022-01-28 03:20:12,417 INFO memory.MemoryStore: Block broadcast_27_piece0 > stored as bytes in memory (estimated size 4.0 MiB, free 10.9 GiB) > 2022-01-28 03:20:12,417 INFO storage.BlockManagerInfo: Added > broadcast_27_piece0 in memory on ip-10-231-63-1.ec2.internal:34761 (size: 4.0 > MiB, free: 11.0 GiB) > 2022-01-28 03:20:12,418 INFO memory.MemoryStore: Block broadcast_27_piece1 > stored as bytes in memory (estimated size 1520.4 KiB, free 10.9 GiB) > 2022-01-28 03:20:12,418 INFO storage.BlockManagerInfo: Added > broadcast_27_piece1 in memory on ip-10-231-63-1.ec2.internal:34761 (size: > 1520.4 KiB, free: 11.0 GiB) > 2022-01-28 03:20:12,418 INFO spark.MapOutputTracker: Broadcast outputstatuses > size = 416, actual size = 5747443 > 2022-01-28 03:20:12,419 INFO spark.ShuffleStatus: Updating map output for > 153389 to BlockManagerId(154, ip-10-231-42-104.ec2.internal, 44717, None) > 2022-01-28 03:20:12,419 INFO broadcast.TorrentBroadcast: Destroying > Broadcast(27) (from updateMapOutput at BlockManagerMasterEndpoint.scala:594) > 2022-01-28 03:20:12,427 INFO storage.BlockManagerInfo: Added rdd_65_20310 on > disk on ip-10-231-32-25.ec2.internal:40657 (size: 77.6 MiB) > 2022-01-28 03:20:12,427 INFO storage.BlockManagerInfo: Removed > broadcast_27_piece0 on ip-10-231-63-1.ec2.internal:34761 in memory (size: 4.0 > MiB, free: 11.0 GiB) > {code} > While the Broadcast is being constructed we have updates coming in and the > broadcast is destroyed almost immediately. On this particular job we ended up > hitting the race condition a lot of times and it caused ~18 task failures and > stage retries within 20 seconds causing us to hit our stage retry limit and > the job to fail. > As far I understand this was the expected behavior for handling this case > after SPARK-34939. But it seems like when combined with decommissioning > hitting the race is a bit too common. > We have observed this behavior running 3.
[jira] [Updated] (SPARK-38101) MetadataFetchFailedException due to decommission block migrations
[ https://issues.apache.org/jira/browse/SPARK-38101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38101: -- Affects Version/s: 3.3.0 (was: 3.3) > MetadataFetchFailedException due to decommission block migrations > - > > Key: SPARK-38101 > URL: https://issues.apache.org/jira/browse/SPARK-38101 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2, 3.1.3, 3.2.1, 3.3.0, 3.2.2 >Reporter: Emil Ejbyfeldt >Priority: Major > > As noted in SPARK-34939 there is race when using broadcast for map output > status. Explanation from SPARK-34939 > > After map statuses are broadcasted and the executors obtain serialized > > broadcasted map statuses. If any fetch failure happens after, Spark > > scheduler invalidates cached map statuses and destroy broadcasted value of > > the map statuses. Then any executor trying to deserialize serialized > > broadcasted map statuses and access broadcasted value, IOException will be > > thrown. Currently we don't catch it in MapOutputTrackerWorker and above > > exception will fail the application. > But if running with `spark.decommission.enabled=true` and > `spark.storage.decommission.shuffleBlocks.enabled=true` there is another way > to hit this race, when a node is decommissioning and the shuffle blocks are > migrated. After a block has been migrated an update will be sent to the > driver for each block and the map output caches will be invalidated. > Here are a driver when we hit the race condition running with spark 3.2.0: > {code:java} > 2022-01-28 03:20:12,409 INFO memory.MemoryStore: Block broadcast_27 stored as > values in memory (estimated size 5.5 MiB, free 11.0 GiB) > 2022-01-28 03:20:12,410 INFO spark.ShuffleStatus: Updating map output for > 192108 to BlockManagerId(760, ip-10-231-63-204.ec2.internal, 34707, None) > 2022-01-28 03:20:12,410 INFO spark.ShuffleStatus: Updating map output for > 179529 to BlockManagerId(743, ip-10-231-34-160.ec2.internal, 44225, None) > 2022-01-28 03:20:12,414 INFO spark.ShuffleStatus: Updating map output for > 187194 to BlockManagerId(761, ip-10-231-43-219.ec2.internal, 39943, None) > 2022-01-28 03:20:12,415 INFO spark.ShuffleStatus: Updating map output for > 190303 to BlockManagerId(270, ip-10-231-33-206.ec2.internal, 38965, None) > 2022-01-28 03:20:12,416 INFO spark.ShuffleStatus: Updating map output for > 192220 to BlockManagerId(270, ip-10-231-33-206.ec2.internal, 38965, None) > 2022-01-28 03:20:12,416 INFO spark.ShuffleStatus: Updating map output for > 182306 to BlockManagerId(688, ip-10-231-43-41.ec2.internal, 35967, None) > 2022-01-28 03:20:12,417 INFO spark.ShuffleStatus: Updating map output for > 190387 to BlockManagerId(772, ip-10-231-55-173.ec2.internal, 35523, None) > 2022-01-28 03:20:12,417 INFO memory.MemoryStore: Block broadcast_27_piece0 > stored as bytes in memory (estimated size 4.0 MiB, free 10.9 GiB) > 2022-01-28 03:20:12,417 INFO storage.BlockManagerInfo: Added > broadcast_27_piece0 in memory on ip-10-231-63-1.ec2.internal:34761 (size: 4.0 > MiB, free: 11.0 GiB) > 2022-01-28 03:20:12,418 INFO memory.MemoryStore: Block broadcast_27_piece1 > stored as bytes in memory (estimated size 1520.4 KiB, free 10.9 GiB) > 2022-01-28 03:20:12,418 INFO storage.BlockManagerInfo: Added > broadcast_27_piece1 in memory on ip-10-231-63-1.ec2.internal:34761 (size: > 1520.4 KiB, free: 11.0 GiB) > 2022-01-28 03:20:12,418 INFO spark.MapOutputTracker: Broadcast outputstatuses > size = 416, actual size = 5747443 > 2022-01-28 03:20:12,419 INFO spark.ShuffleStatus: Updating map output for > 153389 to BlockManagerId(154, ip-10-231-42-104.ec2.internal, 44717, None) > 2022-01-28 03:20:12,419 INFO broadcast.TorrentBroadcast: Destroying > Broadcast(27) (from updateMapOutput at BlockManagerMasterEndpoint.scala:594) > 2022-01-28 03:20:12,427 INFO storage.BlockManagerInfo: Added rdd_65_20310 on > disk on ip-10-231-32-25.ec2.internal:40657 (size: 77.6 MiB) > 2022-01-28 03:20:12,427 INFO storage.BlockManagerInfo: Removed > broadcast_27_piece0 on ip-10-231-63-1.ec2.internal:34761 in memory (size: 4.0 > MiB, free: 11.0 GiB) > {code} > While the Broadcast is being constructed we have updates coming in and the > broadcast is destroyed almost immediately. On this particular job we ended up > hitting the race condition a lot of times and it caused ~18 task failures and > stage retries within 20 seconds causing us to hit our stage retry limit and > the job to fail. > As far I understand this was the expected behavior for handling this case > after SPARK-34939. But it seems like when combined with decommissioning > hitting the race is a bit too common. > We have observed this behavior running 3.2.0 and 3.2.1, but I think other > c
[jira] [Commented] (SPARK-38101) MetadataFetchFailedException due to decommission block migrations
[ https://issues.apache.org/jira/browse/SPARK-38101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487246#comment-17487246 ] Dongjoon Hyun commented on SPARK-38101: --- Thank you for filing a JIRA, [~eejbyfeldt]. > MetadataFetchFailedException due to decommission block migrations > - > > Key: SPARK-38101 > URL: https://issues.apache.org/jira/browse/SPARK-38101 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2, 3.1.3, 3.2.1, 3.2.2, 3.3 >Reporter: Emil Ejbyfeldt >Priority: Major > > As noted in SPARK-34939 there is race when using broadcast for map output > status. Explanation from SPARK-34939 > > After map statuses are broadcasted and the executors obtain serialized > > broadcasted map statuses. If any fetch failure happens after, Spark > > scheduler invalidates cached map statuses and destroy broadcasted value of > > the map statuses. Then any executor trying to deserialize serialized > > broadcasted map statuses and access broadcasted value, IOException will be > > thrown. Currently we don't catch it in MapOutputTrackerWorker and above > > exception will fail the application. > But if running with `spark.decommission.enabled=true` and > `spark.storage.decommission.shuffleBlocks.enabled=true` there is another way > to hit this race, when a node is decommissioning and the shuffle blocks are > migrated. After a block has been migrated an update will be sent to the > driver for each block and the map output caches will be invalidated. > Here are a driver when we hit the race condition running with spark 3.2.0: > {code:java} > 2022-01-28 03:20:12,409 INFO memory.MemoryStore: Block broadcast_27 stored as > values in memory (estimated size 5.5 MiB, free 11.0 GiB) > 2022-01-28 03:20:12,410 INFO spark.ShuffleStatus: Updating map output for > 192108 to BlockManagerId(760, ip-10-231-63-204.ec2.internal, 34707, None) > 2022-01-28 03:20:12,410 INFO spark.ShuffleStatus: Updating map output for > 179529 to BlockManagerId(743, ip-10-231-34-160.ec2.internal, 44225, None) > 2022-01-28 03:20:12,414 INFO spark.ShuffleStatus: Updating map output for > 187194 to BlockManagerId(761, ip-10-231-43-219.ec2.internal, 39943, None) > 2022-01-28 03:20:12,415 INFO spark.ShuffleStatus: Updating map output for > 190303 to BlockManagerId(270, ip-10-231-33-206.ec2.internal, 38965, None) > 2022-01-28 03:20:12,416 INFO spark.ShuffleStatus: Updating map output for > 192220 to BlockManagerId(270, ip-10-231-33-206.ec2.internal, 38965, None) > 2022-01-28 03:20:12,416 INFO spark.ShuffleStatus: Updating map output for > 182306 to BlockManagerId(688, ip-10-231-43-41.ec2.internal, 35967, None) > 2022-01-28 03:20:12,417 INFO spark.ShuffleStatus: Updating map output for > 190387 to BlockManagerId(772, ip-10-231-55-173.ec2.internal, 35523, None) > 2022-01-28 03:20:12,417 INFO memory.MemoryStore: Block broadcast_27_piece0 > stored as bytes in memory (estimated size 4.0 MiB, free 10.9 GiB) > 2022-01-28 03:20:12,417 INFO storage.BlockManagerInfo: Added > broadcast_27_piece0 in memory on ip-10-231-63-1.ec2.internal:34761 (size: 4.0 > MiB, free: 11.0 GiB) > 2022-01-28 03:20:12,418 INFO memory.MemoryStore: Block broadcast_27_piece1 > stored as bytes in memory (estimated size 1520.4 KiB, free 10.9 GiB) > 2022-01-28 03:20:12,418 INFO storage.BlockManagerInfo: Added > broadcast_27_piece1 in memory on ip-10-231-63-1.ec2.internal:34761 (size: > 1520.4 KiB, free: 11.0 GiB) > 2022-01-28 03:20:12,418 INFO spark.MapOutputTracker: Broadcast outputstatuses > size = 416, actual size = 5747443 > 2022-01-28 03:20:12,419 INFO spark.ShuffleStatus: Updating map output for > 153389 to BlockManagerId(154, ip-10-231-42-104.ec2.internal, 44717, None) > 2022-01-28 03:20:12,419 INFO broadcast.TorrentBroadcast: Destroying > Broadcast(27) (from updateMapOutput at BlockManagerMasterEndpoint.scala:594) > 2022-01-28 03:20:12,427 INFO storage.BlockManagerInfo: Added rdd_65_20310 on > disk on ip-10-231-32-25.ec2.internal:40657 (size: 77.6 MiB) > 2022-01-28 03:20:12,427 INFO storage.BlockManagerInfo: Removed > broadcast_27_piece0 on ip-10-231-63-1.ec2.internal:34761 in memory (size: 4.0 > MiB, free: 11.0 GiB) > {code} > While the Broadcast is being constructed we have updates coming in and the > broadcast is destroyed almost immediately. On this particular job we ended up > hitting the race condition a lot of times and it caused ~18 task failures and > stage retries within 20 seconds causing us to hit our stage retry limit and > the job to fail. > As far I understand this was the expected behavior for handling this case > after SPARK-34939. But it seems like when combined with decommissioning > hitting the race is a bit too common. > We have observed this behavior running 3.2.0
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487244#comment-17487244 ] Dongjoon Hyun commented on SPARK-6305: -- This is a part of Apache Spark 3.3 and we will start to vote the release candidate on April. Please see the community release plan. - https://spark.apache.org/versioning-policy.html > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.3.0 >Reporter: Tal Sliwowicz >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.3.0 > > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36665) Add more Not operator optimizations
[ https://issues.apache.org/jira/browse/SPARK-36665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487235#comment-17487235 ] Anton Okolnychyi commented on SPARK-36665: -- Thanks for the prompt reply, [~kazuyukitanimura]! > Add more Not operator optimizations > --- > > Key: SPARK-36665 > URL: https://issues.apache.org/jira/browse/SPARK-36665 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kazuyuki Tanimura >Assignee: Kazuyuki Tanimura >Priority: Major > Fix For: 3.3.0 > > Attachments: Pasted Graphic 3.png > > > {{BooleanSimplification should be able to do more simplifications for Not > operators applying following rules}} > # {{Not(null) == null}} > ## {{e.g. IsNull(Not(...)) can be IsNull(...)}} > # {{(Not(a) = b) == (a = Not(b))}} > ## {{e.g. Not(...) = true can be (...) = false}} > # {{(a != b) == (a = Not(b))}} > ## {{e.g. (...) != true can be (...) = false}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38113) Use error classes in the execution errors of pivoting
Max Gekk created SPARK-38113: Summary: Use error classes in the execution errors of pivoting Key: SPARK-38113 URL: https://issues.apache.org/jira/browse/SPARK-38113 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Max Gekk Migrate the following errors in QueryExecutionErrors: * repeatedPivotsUnsupportedError * pivotNotAfterGroupByUnsupportedError onto use error classes. Throw an implementation of SparkThrowable. Also write a test per every error in QueryExecutionErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38112) Use error classes in the execution errors of date/timestamp handling
Max Gekk created SPARK-38112: Summary: Use error classes in the execution errors of date/timestamp handling Key: SPARK-38112 URL: https://issues.apache.org/jira/browse/SPARK-38112 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Max Gekk Migrate the following errors in QueryExecutionErrors: * sparkUpgradeInReadingDatesError * sparkUpgradeInWritingDatesError * timeZoneIdNotSpecifiedForTimestampTypeError * cannotConvertOrcTimestampToTimestampNTZError onto use error classes. Throw an implementation of SparkThrowable. Also write a test per every error in QueryExecutionErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36665) Add more Not operator optimizations
[ https://issues.apache.org/jira/browse/SPARK-36665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487222#comment-17487222 ] Kazuyuki Tanimura commented on SPARK-36665: --- Understood, thank you [~aokolnychyi] I am preparing a fix. I am sorry for the inconvenience. > Add more Not operator optimizations > --- > > Key: SPARK-36665 > URL: https://issues.apache.org/jira/browse/SPARK-36665 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kazuyuki Tanimura >Assignee: Kazuyuki Tanimura >Priority: Major > Fix For: 3.3.0 > > Attachments: Pasted Graphic 3.png > > > {{BooleanSimplification should be able to do more simplifications for Not > operators applying following rules}} > # {{Not(null) == null}} > ## {{e.g. IsNull(Not(...)) can be IsNull(...)}} > # {{(Not(a) = b) == (a = Not(b))}} > ## {{e.g. Not(...) = true can be (...) = false}} > # {{(a != b) == (a = Not(b))}} > ## {{e.g. (...) != true can be (...) = false}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-36665) Add more Not operator optimizations
[ https://issues.apache.org/jira/browse/SPARK-36665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487200#comment-17487200 ] Anton Okolnychyi edited comment on SPARK-36665 at 2/4/22, 5:47 PM: --- [~kazuyukitanimura] {{RewritePredicateSubquery}} still rewrites the predicate subquery but it is IN subquery instead of NOT IN. In SQL, NOT IN subqueries have to be treated in a special way. As a result, we are getting wrong query results right now. was (Author: aokolnychyi): [~kazuyukitanimura] {{RewritePredicateSubquery}} still rewrites the predicate subquery but it is IN subquery instead of NOT IN. In SQL, NOT IN subquery have to be treated in a special way. As a result, we are getting wrong query results right now. > Add more Not operator optimizations > --- > > Key: SPARK-36665 > URL: https://issues.apache.org/jira/browse/SPARK-36665 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kazuyuki Tanimura >Assignee: Kazuyuki Tanimura >Priority: Major > Fix For: 3.3.0 > > Attachments: Pasted Graphic 3.png > > > {{BooleanSimplification should be able to do more simplifications for Not > operators applying following rules}} > # {{Not(null) == null}} > ## {{e.g. IsNull(Not(...)) can be IsNull(...)}} > # {{(Not(a) = b) == (a = Not(b))}} > ## {{e.g. Not(...) = true can be (...) = false}} > # {{(a != b) == (a = Not(b))}} > ## {{e.g. (...) != true can be (...) = false}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36665) Add more Not operator optimizations
[ https://issues.apache.org/jira/browse/SPARK-36665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487200#comment-17487200 ] Anton Okolnychyi commented on SPARK-36665: -- [~kazuyukitanimura] {{RewritePredicateSubquery}} still rewrites the predicate subquery but it is IN subquery instead of NOT IN. In SQL, NOT IN subquery have to be treated in a special way. As a result, we are getting wrong query results right now. > Add more Not operator optimizations > --- > > Key: SPARK-36665 > URL: https://issues.apache.org/jira/browse/SPARK-36665 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kazuyuki Tanimura >Assignee: Kazuyuki Tanimura >Priority: Major > Fix For: 3.3.0 > > Attachments: Pasted Graphic 3.png > > > {{BooleanSimplification should be able to do more simplifications for Not > operators applying following rules}} > # {{Not(null) == null}} > ## {{e.g. IsNull(Not(...)) can be IsNull(...)}} > # {{(Not(a) = b) == (a = Not(b))}} > ## {{e.g. Not(...) = true can be (...) = false}} > # {{(a != b) == (a = Not(b))}} > ## {{e.g. (...) != true can be (...) = false}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38111) Retrieve a Spark dataframe as Arrow batches
[ https://issues.apache.org/jira/browse/SPARK-38111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabien updated SPARK-38111: --- Labels: arrow (was: ) > Retrieve a Spark dataframe as Arrow batches > --- > > Key: SPARK-38111 > URL: https://issues.apache.org/jira/browse/SPARK-38111 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 3.2.0 > Environment: Java 11 > Spark 3 >Reporter: Fabien >Priority: Minor > Labels: arrow > > Using the Java API, is there a way to efficiently retrieve a dataframe as > Arrow batches ? > I have a pretty large dataset on my cluster so I cannot collect it using > [collectAsList|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#collectAsList--] > which download every thing at once and saturate my JVM memory > Seeing that Arrow is becoming a standard to transfer large datasets and that > Spark uses a lot Arrow, is there a way to transfer my Spark dataframe with > Arrow batches ? > This would be ideal to process the data batch per batch and avoid saturating > the memory. > > I am looking for an API like this (in Java) > > {code:java} > var stream = dataframe.collectAsArrowStream() > while (stream.hasNextBatch()) { > var batch = stream.getNextBatch() > // do some stuff with the arrow batch > } > {code} > It would be even better if I can split the dataframe into several streams so > I can download and process it in parallel -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38111) Retrieve a Spark dataframe as Arrow batches
[ https://issues.apache.org/jira/browse/SPARK-38111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabien updated SPARK-38111: --- Description: Using the Java API, is there a way to efficiently retrieve a dataframe as Arrow batches ? I have a pretty large dataset on my cluster so I cannot collect it using [collectAsList|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#collectAsList--] which download every thing at once and saturate my JVM memory Seeing that Arrow is becoming a standard to transfer large datasets and that Spark uses a lot Arrow, is there a way to transfer my Spark dataframe with Arrow batches ? This would be ideal to process the data batch per batch and avoid saturating the memory. I am looking for an API like this (in Java) {code:java} var stream = dataframe.collectAsArrowStream() while (stream.hasNextBatch()) { var batch = stream.getNextBatch() // do some stuff with the arrow batch } {code} It would be even better if I can split the dataframe into several streams so I can download and process it in parallel was: Using the Java API, is there a way to efficiently retrieve a dataframe as Arrow batches ? I have a pretty large dataset on my cluster so I cannot collect it using [collectAsList|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#collectAsList--] which download every thing at once and saturate the my JVM memory Seeing that Arrow is becoming a standard to transfer large datasets and that Spark uses a lot Arrow, is there a way to transfer my Spark dataframe with Arrow batches ? This would be ideal to process the data batch per batch and avoid saturating the memory. I am looking for an API like this (in Java) {code:java} var stream = dataframe.collectAsArrowStream() while (stream.hasNextBatch()) { var batch = stream.getNextBatch() // do some stuff with the arrow batch } {code} It would be even better if I can split the dataframe into several streams so I can download and process it in parallel > Retrieve a Spark dataframe as Arrow batches > --- > > Key: SPARK-38111 > URL: https://issues.apache.org/jira/browse/SPARK-38111 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 3.2.0 > Environment: Java 11 > Spark 3 >Reporter: Fabien >Priority: Minor > > Using the Java API, is there a way to efficiently retrieve a dataframe as > Arrow batches ? > I have a pretty large dataset on my cluster so I cannot collect it using > [collectAsList|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#collectAsList--] > which download every thing at once and saturate my JVM memory > Seeing that Arrow is becoming a standard to transfer large datasets and that > Spark uses a lot Arrow, is there a way to transfer my Spark dataframe with > Arrow batches ? > This would be ideal to process the data batch per batch and avoid saturating > the memory. > > I am looking for an API like this (in Java) > > {code:java} > var stream = dataframe.collectAsArrowStream() > while (stream.hasNextBatch()) { > var batch = stream.getNextBatch() > // do some stuff with the arrow batch > } > {code} > It would be even better if I can split the dataframe into several streams so > I can download and process it in parallel -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38111) Retrieve a Spark dataframe as Arrow batches
Fabien created SPARK-38111: -- Summary: Retrieve a Spark dataframe as Arrow batches Key: SPARK-38111 URL: https://issues.apache.org/jira/browse/SPARK-38111 Project: Spark Issue Type: Question Components: Java API Affects Versions: 3.2.0 Environment: Java 11 Spark 3 Reporter: Fabien Using the Java API, is there a way to efficiently retrieve a dataframe as Arrow batches ? I have a pretty large dataset on my cluster so I cannot collect it using [collectAsList|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#collectAsList--] which download every thing at once and saturate the my JVM memory Seeing that Arrow is becoming a standard to transfer large datasets and that Spark uses a lot Arrow, is there a way to transfer my Spark dataframe with Arrow batches ? This would be ideal to process the data batch per batch and avoid saturating the memory. I am looking for an API like this (in Java) {code:java} var stream = dataframe.collectAsArrowStream() while (stream.hasNextBatch()) { var batch = stream.getNextBatch() // do some stuff with the arrow batch } {code} It would be even better if I can split the dataframe into several streams so I can download and process it in parallel -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38082) Update minimum numpy version
[ https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38082: Assignee: (was: Apache Spark) > Update minimum numpy version > > > Key: SPARK-38082 > URL: https://issues.apache.org/jira/browse/SPARK-38082 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}. > However, 1.7 has been released almost 9 years ago and since then some methods > that we use have been deprecated in favor of new additions and anew API > ({{numpy.typing}}, that is of some interest to us, has been added. > We should update minimum version requirement to one of the following > - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to > replace deprecated {{tostring}} calls with {{tobytes}}. > - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our > minimum supported pandas version. > - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing. > The last one might be somewhat controversial, but 1.15 shouldn't require much > discussion. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38082) Update minimum numpy version
[ https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38082: Assignee: Apache Spark > Update minimum numpy version > > > Key: SPARK-38082 > URL: https://issues.apache.org/jira/browse/SPARK-38082 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark >Priority: Major > > Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}. > However, 1.7 has been released almost 9 years ago and since then some methods > that we use have been deprecated in favor of new additions and anew API > ({{numpy.typing}}, that is of some interest to us, has been added. > We should update minimum version requirement to one of the following > - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to > replace deprecated {{tostring}} calls with {{tobytes}}. > - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our > minimum supported pandas version. > - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing. > The last one might be somewhat controversial, but 1.15 shouldn't require much > discussion. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38082) Update minimum numpy version
[ https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487180#comment-17487180 ] Apache Spark commented on SPARK-38082: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/35398 > Update minimum numpy version > > > Key: SPARK-38082 > URL: https://issues.apache.org/jira/browse/SPARK-38082 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}. > However, 1.7 has been released almost 9 years ago and since then some methods > that we use have been deprecated in favor of new additions and anew API > ({{numpy.typing}}, that is of some interest to us, has been added. > We should update minimum version requirement to one of the following > - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to > replace deprecated {{tostring}} calls with {{tobytes}}. > - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our > minimum supported pandas version. > - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing. > The last one might be somewhat controversial, but 1.15 shouldn't require much > discussion. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38110) Use error classes in the compilation errors of windows
[ https://issues.apache.org/jira/browse/SPARK-38110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-38110: - Description: Migrate the following errors in QueryCompilationErrors: * windowSpecificationNotDefinedError * windowAggregateFunctionWithFilterNotSupportedError * windowFunctionInsideAggregateFunctionNotAllowedError * expressionWithoutWindowExpressionError * expressionWithMultiWindowExpressionsError * windowFunctionNotAllowedError * cannotSpecifyWindowFrameError * windowFrameNotMatchRequiredFrameError * windowFunctionWithWindowFrameNotOrderedError * multiTimeWindowExpressionsNotSupportedError * sessionWindowGapDurationDataTypeError * invalidLiteralForWindowDurationError * emptyWindowExpressionError * foundDifferentWindowFunctionTypeError onto use error classes. Throw an implementation of SparkThrowable. Also write a test per every error in QueryCompilationErrorsSuite. *Feel free to split this to sub-tasks.* was: Migrate the following errors in QueryCompilationErrors: * windowSpecificationNotDefinedError * windowAggregateFunctionWithFilterNotSupportedError * windowFunctionInsideAggregateFunctionNotAllowedError * expressionWithoutWindowExpressionError * expressionWithMultiWindowExpressionsError * windowFunctionNotAllowedError * cannotSpecifyWindowFrameError * windowFrameNotMatchRequiredFrameError * windowFunctionWithWindowFrameNotOrderedError * multiTimeWindowExpressionsNotSupportedError * sessionWindowGapDurationDataTypeError * invalidLiteralForWindowDurationError * emptyWindowExpressionError * foundDifferentWindowFunctionTypeError onto use error classes. Throw an implementation of SparkThrowable. Also write a test per every error in QueryCompilationErrorsSuite. > Use error classes in the compilation errors of windows > -- > > Key: SPARK-38110 > URL: https://issues.apache.org/jira/browse/SPARK-38110 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryCompilationErrors: > * windowSpecificationNotDefinedError > * windowAggregateFunctionWithFilterNotSupportedError > * windowFunctionInsideAggregateFunctionNotAllowedError > * expressionWithoutWindowExpressionError > * expressionWithMultiWindowExpressionsError > * windowFunctionNotAllowedError > * cannotSpecifyWindowFrameError > * windowFrameNotMatchRequiredFrameError > * windowFunctionWithWindowFrameNotOrderedError > * multiTimeWindowExpressionsNotSupportedError > * sessionWindowGapDurationDataTypeError > * invalidLiteralForWindowDurationError > * emptyWindowExpressionError > * foundDifferentWindowFunctionTypeError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryCompilationErrorsSuite. > *Feel free to split this to sub-tasks.* -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38110) Use error classes in the compilation errors of windows
Max Gekk created SPARK-38110: Summary: Use error classes in the compilation errors of windows Key: SPARK-38110 URL: https://issues.apache.org/jira/browse/SPARK-38110 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Max Gekk Migrate the following errors in QueryCompilationErrors: * windowSpecificationNotDefinedError * windowAggregateFunctionWithFilterNotSupportedError * windowFunctionInsideAggregateFunctionNotAllowedError * expressionWithoutWindowExpressionError * expressionWithMultiWindowExpressionsError * windowFunctionNotAllowedError * cannotSpecifyWindowFrameError * windowFrameNotMatchRequiredFrameError * windowFunctionWithWindowFrameNotOrderedError * multiTimeWindowExpressionsNotSupportedError * sessionWindowGapDurationDataTypeError * invalidLiteralForWindowDurationError * emptyWindowExpressionError * foundDifferentWindowFunctionTypeError onto use error classes. Throw an implementation of SparkThrowable. Also write a test per every error in QueryCompilationErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38109) pyspark DataFrame.replace() is sensitive to column name case in pyspark 3.2 but not in 3.1
[ https://issues.apache.org/jira/browse/SPARK-38109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ss updated SPARK-38109: --- Description: The `subset` argument for `DataFrame.replace()` accepts one or more column names. In pyspark 3.2 the case of the column names must match the column names in the schema exactly or the replacements will not take place. In earlier versions (3.1.2 was tested) the argument is case insensitive. Minimal example: {{replace_dict = \{'wrong': 'right'}}} {{df = spark.createDataFrame(}} {{ [['wrong', 'wrong']], }} {{ schema=['case_matched', 'case_unmatched']}} {{)}} {{df2 = df.replace(replace_dict, subset=['case_matched', 'Case_Unmatched'])}} In pyspark 3.2 (tested 3.2.0, 3.2.1 via pip on windows and 3.2.0 on Databricks) the result is: |case_matched|case_unmatched| |right|wrong| While in pyspark 3.1 (tested 3.1.2 via pip on windows and 3.1.2 on Databricks) the result is: |case_matched|case_unmatched| |right|right| I believe the expected behaviour is that shown in pyspark 3.1 as in all other situations column names are accepted in a case insensitive manner. was: The `subset` argument for `DataFrame.replace()` accepts one or more column names. In pyspark 3.2 the case of the column names must match the column names in the schema exactly or the replacements will not take place. In earlier versions (3.1.2 was tested) the argument is case insensitive. Minimal example: {{ replace_dict = {'wrong': 'right'} df = spark.createDataFrame( [['wrong', 'wrong']], schema=['case_matched', 'case_unmatched'] ) df2 = df.replace(replace_dict, subset=['case_matched', 'Case_Unmatched']) }} In pyspark 3.2 (tested 3.2.0, 3.2.1 via pip on windows and 3.2.0 on Databricks) the result is: |case_matched|case_unmatched| |right|wrong| While in pyspark 3.1 (tested 3.1.2 via pip on windows and 3.1.2 on Databricks) the result is: |case_matched|case_unmatched| |right|right| I believe the expected behaviour is that shown in pyspark 3.1 as in all other situations column names are accepted in a case insensitive manner. > pyspark DataFrame.replace() is sensitive to column name case in pyspark 3.2 > but not in 3.1 > -- > > Key: SPARK-38109 > URL: https://issues.apache.org/jira/browse/SPARK-38109 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0, 3.2.1 >Reporter: ss >Priority: Minor > > The `subset` argument for `DataFrame.replace()` accepts one or more column > names. In pyspark 3.2 the case of the column names must match the column > names in the schema exactly or the replacements will not take place. In > earlier versions (3.1.2 was tested) the argument is case insensitive. > Minimal example: > {{replace_dict = \{'wrong': 'right'}}} > {{df = spark.createDataFrame(}} > {{ [['wrong', 'wrong']], }} > {{ schema=['case_matched', 'case_unmatched']}} > {{)}} > {{df2 = df.replace(replace_dict, subset=['case_matched', 'Case_Unmatched'])}} > > In pyspark 3.2 (tested 3.2.0, 3.2.1 via pip on windows and 3.2.0 on > Databricks) the result is: > |case_matched|case_unmatched| > |right|wrong| > While in pyspark 3.1 (tested 3.1.2 via pip on windows and 3.1.2 on > Databricks) the result is: > |case_matched|case_unmatched| > |right|right| > I believe the expected behaviour is that shown in pyspark 3.1 as in all other > situations column names are accepted in a case insensitive manner. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38109) pyspark DataFrame.replace() is sensitive to column name case in pyspark 3.2 but not in 3.1
[ https://issues.apache.org/jira/browse/SPARK-38109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ss updated SPARK-38109: --- Description: The `subset` argument for `DataFrame.replace()` accepts one or more column names. In pyspark 3.2 the case of the column names must match the column names in the schema exactly or the replacements will not take place. In earlier versions (3.1.2 was tested) the argument is case insensitive. Minimal example: {{ replace_dict = {'wrong': 'right'} df = spark.createDataFrame( [['wrong', 'wrong']], schema=['case_matched', 'case_unmatched'] ) df2 = df.replace(replace_dict, subset=['case_matched', 'Case_Unmatched']) }} In pyspark 3.2 (tested 3.2.0, 3.2.1 via pip on windows and 3.2.0 on Databricks) the result is: |case_matched|case_unmatched| |right|wrong| While in pyspark 3.1 (tested 3.1.2 via pip on windows and 3.1.2 on Databricks) the result is: |case_matched|case_unmatched| |right|right| I believe the expected behaviour is that shown in pyspark 3.1 as in all other situations column names are accepted in a case insensitive manner. was: The `subset` argument for `DataFrame.replace()` accepts one or more column names. In pyspark 3.2 the case of the column names must match the column names in the schema exactly or the replacements will not take place. In earlier versions (3.1.2 was tested) the argument is case insensitive. Minimal example: ```python replace_dict = {'wrong': 'right'} df = spark.createDataFrame( [['wrong', 'wrong']], schema=['case_matched', 'case_unmatched'] ) df2 = df.replace(replace_dict, subset=['case_matched', 'Case_Unmatched']) ``` In pyspark 3.2 (tested 3.2.0, 3.2.1 via pip on windows and 3.2.0 on Databricks) the result is: |case_matched|case_unmatched| |right|wrong| While in pyspark 3.1 (tested 3.1.2 via pip on windows and 3.1.2 on Databricks) the result is: |case_matched|case_unmatched| |right|right| I believe the expected behaviour is that shown in pyspark 3.1 as in all other situations column names are accepted in a case insensitive manner. > pyspark DataFrame.replace() is sensitive to column name case in pyspark 3.2 > but not in 3.1 > -- > > Key: SPARK-38109 > URL: https://issues.apache.org/jira/browse/SPARK-38109 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0, 3.2.1 >Reporter: ss >Priority: Minor > > The `subset` argument for `DataFrame.replace()` accepts one or more column > names. In pyspark 3.2 the case of the column names must match the column > names in the schema exactly or the replacements will not take place. In > earlier versions (3.1.2 was tested) the argument is case insensitive. > Minimal example: > {{ > replace_dict = {'wrong': 'right'} > df = spark.createDataFrame( > [['wrong', 'wrong']], > schema=['case_matched', 'case_unmatched'] > ) > df2 = df.replace(replace_dict, subset=['case_matched', 'Case_Unmatched']) > }} > In pyspark 3.2 (tested 3.2.0, 3.2.1 via pip on windows and 3.2.0 on > Databricks) the result is: > |case_matched|case_unmatched| > |right|wrong| > While in pyspark 3.1 (tested 3.1.2 via pip on windows and 3.1.2 on > Databricks) the result is: > |case_matched|case_unmatched| > |right|right| > I believe the expected behaviour is that shown in pyspark 3.1 as in all other > situations column names are accepted in a case insensitive manner. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38109) pyspark DataFrame.replace() is sensitive to column name case in pyspark 3.2 but not in 3.1
ss created SPARK-38109: -- Summary: pyspark DataFrame.replace() is sensitive to column name case in pyspark 3.2 but not in 3.1 Key: SPARK-38109 URL: https://issues.apache.org/jira/browse/SPARK-38109 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.2.1, 3.2.0 Reporter: ss The `subset` argument for `DataFrame.replace()` accepts one or more column names. In pyspark 3.2 the case of the column names must match the column names in the schema exactly or the replacements will not take place. In earlier versions (3.1.2 was tested) the argument is case insensitive. Minimal example: ```python replace_dict = {'wrong': 'right'} df = spark.createDataFrame( [['wrong', 'wrong']], schema=['case_matched', 'case_unmatched'] ) df2 = df.replace(replace_dict, subset=['case_matched', 'Case_Unmatched']) ``` In pyspark 3.2 (tested 3.2.0, 3.2.1 via pip on windows and 3.2.0 on Databricks) the result is: |case_matched|case_unmatched| |-|-| |right|wrong| While in pyspark 3.1 (tested 3.1.2 via pip on windows and 3.1.2 on Databricks) the result is: |case_matched|case_unmatched| |-|-| |right|right| I believe the expected behaviour is that shown in pyspark 3.1 as in all other situations column names are accepted in a case insensitive manner. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38109) pyspark DataFrame.replace() is sensitive to column name case in pyspark 3.2 but not in 3.1
[ https://issues.apache.org/jira/browse/SPARK-38109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ss updated SPARK-38109: --- Description: The `subset` argument for `DataFrame.replace()` accepts one or more column names. In pyspark 3.2 the case of the column names must match the column names in the schema exactly or the replacements will not take place. In earlier versions (3.1.2 was tested) the argument is case insensitive. Minimal example: ```python replace_dict = {'wrong': 'right'} df = spark.createDataFrame( [['wrong', 'wrong']], schema=['case_matched', 'case_unmatched'] ) df2 = df.replace(replace_dict, subset=['case_matched', 'Case_Unmatched']) ``` In pyspark 3.2 (tested 3.2.0, 3.2.1 via pip on windows and 3.2.0 on Databricks) the result is: |case_matched|case_unmatched| |right|wrong| While in pyspark 3.1 (tested 3.1.2 via pip on windows and 3.1.2 on Databricks) the result is: |case_matched|case_unmatched| |right|right| I believe the expected behaviour is that shown in pyspark 3.1 as in all other situations column names are accepted in a case insensitive manner. was: The `subset` argument for `DataFrame.replace()` accepts one or more column names. In pyspark 3.2 the case of the column names must match the column names in the schema exactly or the replacements will not take place. In earlier versions (3.1.2 was tested) the argument is case insensitive. Minimal example: ```python replace_dict = {'wrong': 'right'} df = spark.createDataFrame( [['wrong', 'wrong']], schema=['case_matched', 'case_unmatched'] ) df2 = df.replace(replace_dict, subset=['case_matched', 'Case_Unmatched']) ``` In pyspark 3.2 (tested 3.2.0, 3.2.1 via pip on windows and 3.2.0 on Databricks) the result is: |case_matched|case_unmatched| |-|-| |right|wrong| While in pyspark 3.1 (tested 3.1.2 via pip on windows and 3.1.2 on Databricks) the result is: |case_matched|case_unmatched| |-|-| |right|right| I believe the expected behaviour is that shown in pyspark 3.1 as in all other situations column names are accepted in a case insensitive manner. > pyspark DataFrame.replace() is sensitive to column name case in pyspark 3.2 > but not in 3.1 > -- > > Key: SPARK-38109 > URL: https://issues.apache.org/jira/browse/SPARK-38109 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0, 3.2.1 >Reporter: ss >Priority: Minor > > The `subset` argument for `DataFrame.replace()` accepts one or more column > names. In pyspark 3.2 the case of the column names must match the column > names in the schema exactly or the replacements will not take place. In > earlier versions (3.1.2 was tested) the argument is case insensitive. > Minimal example: > ```python > replace_dict = {'wrong': 'right'} > df = spark.createDataFrame( > [['wrong', 'wrong']], > schema=['case_matched', 'case_unmatched'] > ) > df2 = df.replace(replace_dict, subset=['case_matched', 'Case_Unmatched']) > ``` > In pyspark 3.2 (tested 3.2.0, 3.2.1 via pip on windows and 3.2.0 on > Databricks) the result is: > |case_matched|case_unmatched| > |right|wrong| > While in pyspark 3.1 (tested 3.1.2 via pip on windows and 3.1.2 on > Databricks) the result is: > |case_matched|case_unmatched| > |right|right| > I believe the expected behaviour is that shown in pyspark 3.1 as in all other > situations column names are accepted in a case insensitive manner. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38107) Use error classes in the compilation errors of python/pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-38107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487170#comment-17487170 ] Max Gekk commented on SPARK-38107: -- [~hyukjin.kwon] Do you know someone who could be interested in implementing this? > Use error classes in the compilation errors of python/pandas UDFs > - > > Key: SPARK-38107 > URL: https://issues.apache.org/jira/browse/SPARK-38107 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryCompilationErrors: > * pandasUDFAggregateNotSupportedInPivotError > * groupAggPandasUDFUnsupportedByStreamingAggError > * cannotUseMixtureOfAggFunctionAndGroupAggPandasUDFError > * usePythonUDFInJoinConditionUnsupportedError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryCompilationErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38107) Use error classes in the compilation errors of python/pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-38107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-38107: - Summary: Use error classes in the compilation errors of python/pandas UDFs (was: Use error classes in the compilation errors of pandas UDFs) > Use error classes in the compilation errors of python/pandas UDFs > - > > Key: SPARK-38107 > URL: https://issues.apache.org/jira/browse/SPARK-38107 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryCompilationErrors: > * pandasUDFAggregateNotSupportedInPivotError > * groupAggPandasUDFUnsupportedByStreamingAggError > * cannotUseMixtureOfAggFunctionAndGroupAggPandasUDFError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryCompilationErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38107) Use error classes in the compilation errors of python/pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-38107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-38107: - Description: Migrate the following errors in QueryCompilationErrors: * pandasUDFAggregateNotSupportedInPivotError * groupAggPandasUDFUnsupportedByStreamingAggError * cannotUseMixtureOfAggFunctionAndGroupAggPandasUDFError * usePythonUDFInJoinConditionUnsupportedError onto use error classes. Throw an implementation of SparkThrowable. Also write a test per every error in QueryCompilationErrorsSuite. was: Migrate the following errors in QueryCompilationErrors: * pandasUDFAggregateNotSupportedInPivotError * groupAggPandasUDFUnsupportedByStreamingAggError * cannotUseMixtureOfAggFunctionAndGroupAggPandasUDFError onto use error classes. Throw an implementation of SparkThrowable. Also write a test per every error in QueryCompilationErrorsSuite. > Use error classes in the compilation errors of python/pandas UDFs > - > > Key: SPARK-38107 > URL: https://issues.apache.org/jira/browse/SPARK-38107 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryCompilationErrors: > * pandasUDFAggregateNotSupportedInPivotError > * groupAggPandasUDFUnsupportedByStreamingAggError > * cannotUseMixtureOfAggFunctionAndGroupAggPandasUDFError > * usePythonUDFInJoinConditionUnsupportedError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryCompilationErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38108) Use error classes in the compilation errors of UDF/UDAF
Max Gekk created SPARK-38108: Summary: Use error classes in the compilation errors of UDF/UDAF Key: SPARK-38108 URL: https://issues.apache.org/jira/browse/SPARK-38108 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Max Gekk Migrate the following errors in QueryCompilationErrors: * noHandlerForUDAFError * unexpectedEvalTypesForUDFsError * usingUntypedScalaUDFError * udfClassDoesNotImplementAnyUDFInterfaceError * udfClassNotAllowedToImplementMultiUDFInterfacesError * udfClassWithTooManyTypeArgumentsError onto use error classes. Throw an implementation of SparkThrowable. Also write a test per every error in QueryCompilationErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38107) Use error classes in the compilation errors of pandas UDFs
Max Gekk created SPARK-38107: Summary: Use error classes in the compilation errors of pandas UDFs Key: SPARK-38107 URL: https://issues.apache.org/jira/browse/SPARK-38107 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Max Gekk Migrate the following errors in QueryCompilationErrors: * pandasUDFAggregateNotSupportedInPivotError * groupAggPandasUDFUnsupportedByStreamingAggError * cannotUseMixtureOfAggFunctionAndGroupAggPandasUDFError onto use error classes. Throw an implementation of SparkThrowable. Also write a test per every error in QueryCompilationErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38102) Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset
[ https://issues.apache.org/jira/browse/SPARK-38102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487154#comment-17487154 ] Apache Spark commented on SPARK-38102: -- User 'ocworld' has created a pull request for this issue: https://github.com/apache/spark/pull/35397 > Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset > -- > > Key: SPARK-38102 > URL: https://issues.apache.org/jira/browse/SPARK-38102 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Keunhyun Oh >Priority: Major > > There is no way to apply spark-hadoop-cloud's commitProtocolClass when using > saveAsNewAPIHadoopDataset that are to avoid object storage's problem. > [https://spark.apache.org/docs/latest/cloud-integration.html] > > It is needed to support custom commitProtocolClass class when using > saveAsNewAPIHadoopDataset by an option. For example, > {code:java} > spark.hadoop.mapreduce.sources.commitProtocolClass > org.apache.spark.internal.io.cloud.PathOutputCommitProtocol{code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38102) Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset
[ https://issues.apache.org/jira/browse/SPARK-38102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487153#comment-17487153 ] Apache Spark commented on SPARK-38102: -- User 'ocworld' has created a pull request for this issue: https://github.com/apache/spark/pull/35397 > Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset > -- > > Key: SPARK-38102 > URL: https://issues.apache.org/jira/browse/SPARK-38102 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Keunhyun Oh >Priority: Major > > There is no way to apply spark-hadoop-cloud's commitProtocolClass when using > saveAsNewAPIHadoopDataset that are to avoid object storage's problem. > [https://spark.apache.org/docs/latest/cloud-integration.html] > > It is needed to support custom commitProtocolClass class when using > saveAsNewAPIHadoopDataset by an option. For example, > {code:java} > spark.hadoop.mapreduce.sources.commitProtocolClass > org.apache.spark.internal.io.cloud.PathOutputCommitProtocol{code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38102) Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset
[ https://issues.apache.org/jira/browse/SPARK-38102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38102: Assignee: (was: Apache Spark) > Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset > -- > > Key: SPARK-38102 > URL: https://issues.apache.org/jira/browse/SPARK-38102 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Keunhyun Oh >Priority: Major > > There is no way to apply spark-hadoop-cloud's commitProtocolClass when using > saveAsNewAPIHadoopDataset that are to avoid object storage's problem. > [https://spark.apache.org/docs/latest/cloud-integration.html] > > It is needed to support custom commitProtocolClass class when using > saveAsNewAPIHadoopDataset by an option. For example, > {code:java} > spark.hadoop.mapreduce.sources.commitProtocolClass > org.apache.spark.internal.io.cloud.PathOutputCommitProtocol{code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38102) Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset
[ https://issues.apache.org/jira/browse/SPARK-38102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38102: Assignee: Apache Spark > Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset > -- > > Key: SPARK-38102 > URL: https://issues.apache.org/jira/browse/SPARK-38102 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Keunhyun Oh >Assignee: Apache Spark >Priority: Major > > There is no way to apply spark-hadoop-cloud's commitProtocolClass when using > saveAsNewAPIHadoopDataset that are to avoid object storage's problem. > [https://spark.apache.org/docs/latest/cloud-integration.html] > > It is needed to support custom commitProtocolClass class when using > saveAsNewAPIHadoopDataset by an option. For example, > {code:java} > spark.hadoop.mapreduce.sources.commitProtocolClass > org.apache.spark.internal.io.cloud.PathOutputCommitProtocol{code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38102) Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset
[ https://issues.apache.org/jira/browse/SPARK-38102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Keunhyun Oh updated SPARK-38102: Description: There is no way to apply spark-hadoop-cloud's commitProtocolClass when using saveAsNewAPIHadoopDataset that are to avoid object storage's problem. [https://spark.apache.org/docs/latest/cloud-integration.html] It is needed to support custom commitProtocolClass class when using saveAsNewAPIHadoopDataset by an option. For example, {code:java} spark.hadoop.mapreduce.sources.commitProtocolClass org.apache.spark.internal.io.cloud.PathOutputCommitProtocol{code} was: It is needed to support custom commitProtocolClass class when using saveAsNewAPIHadoopDataset. It's because of no way to apply spark-hadoop-cloud's commitProtocolClass when using saveAsNewAPIHadoopDataset that are to avoid object storage's problem. > Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset > -- > > Key: SPARK-38102 > URL: https://issues.apache.org/jira/browse/SPARK-38102 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Keunhyun Oh >Priority: Major > > There is no way to apply spark-hadoop-cloud's commitProtocolClass when using > saveAsNewAPIHadoopDataset that are to avoid object storage's problem. > [https://spark.apache.org/docs/latest/cloud-integration.html] > > It is needed to support custom commitProtocolClass class when using > saveAsNewAPIHadoopDataset by an option. For example, > {code:java} > spark.hadoop.mapreduce.sources.commitProtocolClass > org.apache.spark.internal.io.cloud.PathOutputCommitProtocol{code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38106) Use error classes in the parsing errors of functions
Max Gekk created SPARK-38106: Summary: Use error classes in the parsing errors of functions Key: SPARK-38106 URL: https://issues.apache.org/jira/browse/SPARK-38106 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Max Gekk Migrate the following errors in QueryParsingErrors: * functionNameUnsupportedError * showFunctionsUnsupportedError * showFunctionsInvalidPatternError * createFuncWithBothIfNotExistsAndReplaceError * defineTempFuncWithIfNotExistsError * unsupportedFunctionNameError * specifyingDBInCreateTempFuncError * invalidNameForDropTempFunc onto use error classes. Throw an implementation of SparkThrowable. Also write a test per every error in QueryParsingErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38105) Use error classes in the parsing errors of joins
Max Gekk created SPARK-38105: Summary: Use error classes in the parsing errors of joins Key: SPARK-38105 URL: https://issues.apache.org/jira/browse/SPARK-38105 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Max Gekk Migrate the following errors in QueryParsingErrors: * joinCriteriaUnimplementedError * naturalCrossJoinUnsupportedError onto use error classes. Throw an implementation of SparkThrowable. Also write a test per every error in QueryParsingErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38104) Use error classes in the parsing errors of windows
Max Gekk created SPARK-38104: Summary: Use error classes in the parsing errors of windows Key: SPARK-38104 URL: https://issues.apache.org/jira/browse/SPARK-38104 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Max Gekk Migrate the following errors in QueryParsingErrors: * repetitiveWindowDefinitionError * invalidWindowReferenceError * cannotResolveWindowReferenceError onto use error classes. Throw an implementation of SparkThrowable. Also write a test per every error in QueryParsingErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38103) Use error classes in the parsing errors of transform
Max Gekk created SPARK-38103: Summary: Use error classes in the parsing errors of transform Key: SPARK-38103 URL: https://issues.apache.org/jira/browse/SPARK-38103 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Max Gekk Migrate the following errors in QueryParsingErrors: * transformNotSupportQuantifierError * transformWithSerdeUnsupportedError * tooManyArgumentsForTransformError * notEnoughArgumentsForTransformError * invalidTransformArgumentError onto use error classes. Throw an implementation of SparkThrowable. Also write a test per every error in QueryParsingErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487134#comment-17487134 ] Rob D commented on SPARK-6305: -- Thank you for this update. Has this change been included in an official release yet? > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.3.0 >Reporter: Tal Sliwowicz >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.3.0 > > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38102) Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset
[ https://issues.apache.org/jira/browse/SPARK-38102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Keunhyun Oh updated SPARK-38102: Summary: Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset (was: Supporting custom commitProtocolClass and committer class when using saveAsNewAPIHadoopDataset) > Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset > -- > > Key: SPARK-38102 > URL: https://issues.apache.org/jira/browse/SPARK-38102 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Keunhyun Oh >Priority: Major > > It is needed to support custom commitProtocolClass and committer class when > using saveAsNewAPIHadoopDataset. > It's because of no way to apply spark-hadoop-cloud's classes > commitProtocolClass and committer when using saveAsNewAPIHadoopDataset that > are to avoid object storage's problem. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38102) Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset
[ https://issues.apache.org/jira/browse/SPARK-38102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Keunhyun Oh updated SPARK-38102: Description: It is needed to support custom commitProtocolClass class when using saveAsNewAPIHadoopDataset. It's because of no way to apply spark-hadoop-cloud's commitProtocolClass when using saveAsNewAPIHadoopDataset that are to avoid object storage's problem. was: It is needed to support custom commitProtocolClass and committer class when using saveAsNewAPIHadoopDataset. It's because of no way to apply spark-hadoop-cloud's classes commitProtocolClass and committer when using saveAsNewAPIHadoopDataset that are to avoid object storage's problem. > Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset > -- > > Key: SPARK-38102 > URL: https://issues.apache.org/jira/browse/SPARK-38102 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Keunhyun Oh >Priority: Major > > It is needed to support custom commitProtocolClass class when using > saveAsNewAPIHadoopDataset. > It's because of no way to apply spark-hadoop-cloud's commitProtocolClass when > using saveAsNewAPIHadoopDataset that are to avoid object storage's problem. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38102) Supporting custom commitProtocolClass and committer class when using saveAsNewAPIHadoopDataset
Keunhyun Oh created SPARK-38102: --- Summary: Supporting custom commitProtocolClass and committer class when using saveAsNewAPIHadoopDataset Key: SPARK-38102 URL: https://issues.apache.org/jira/browse/SPARK-38102 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.2.0 Reporter: Keunhyun Oh It is needed to support custom commitProtocolClass and committer class when using saveAsNewAPIHadoopDataset. It's because of no way to apply spark-hadoop-cloud's classes commitProtocolClass and committer when using saveAsNewAPIHadoopDataset that are to avoid object storage's problem. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38073) NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon > 3.7
[ https://issues.apache.org/jira/browse/SPARK-38073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487055#comment-17487055 ] Apache Spark commented on SPARK-38073: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/35396 > NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon > > 3.7 > --- > > Key: SPARK-38073 > URL: https://issues.apache.org/jira/browse/SPARK-38073 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Shell >Affects Versions: 3.2.0, 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > When {{PYSPARK_DRIVER_PYTHON=$(which ipython) bin/pyspark}} is executed with > Python >= 3.8, function registered wiht atexit seems to be executed in > different scope than in Python 3.7. > It result in {{NameError: name 'sc' is not defined}} on exit: > {code:python} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 3.3.0-SNAPSHOT > /_/ > Using Python version 3.8.12 (default, Oct 12 2021 21:57:06) > Spark context Web UI available at http://192.168.0.198:4040 > Spark context available as 'sc' (master = local[*], app id = > local-1643555855409). > SparkSession available as 'spark'. > In [1]: > > > Do you really want to exit ([y]/n)? y > Error in atexit._run_exitfuncs: > Traceback (most recent call last): > File "/path/to/spark/python/pyspark/shell.py", line 49, in > atexit.register(lambda: sc.stop()) > NameError: name 'sc' is not defined > {code} > This could be easily fixed by capturing `sc` instance > {code:none} > diff --git a/python/pyspark/shell.py b/python/pyspark/shell.py > index f0c487877a..4164e3ab0c 100644 > --- a/python/pyspark/shell.py > +++ b/python/pyspark/shell.py > @@ -46,7 +46,7 @@ except Exception: > > sc = spark.sparkContext > sql = spark.sql > -atexit.register(lambda: sc.stop()) > +atexit.register((lambda sc: lambda: sc.stop())(sc)) > > # for compatibility > sqlContext = spark._wrapped > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38073) NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon > 3.7
[ https://issues.apache.org/jira/browse/SPARK-38073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38073: Assignee: (was: Apache Spark) > NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon > > 3.7 > --- > > Key: SPARK-38073 > URL: https://issues.apache.org/jira/browse/SPARK-38073 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Shell >Affects Versions: 3.2.0, 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > When {{PYSPARK_DRIVER_PYTHON=$(which ipython) bin/pyspark}} is executed with > Python >= 3.8, function registered wiht atexit seems to be executed in > different scope than in Python 3.7. > It result in {{NameError: name 'sc' is not defined}} on exit: > {code:python} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 3.3.0-SNAPSHOT > /_/ > Using Python version 3.8.12 (default, Oct 12 2021 21:57:06) > Spark context Web UI available at http://192.168.0.198:4040 > Spark context available as 'sc' (master = local[*], app id = > local-1643555855409). > SparkSession available as 'spark'. > In [1]: > > > Do you really want to exit ([y]/n)? y > Error in atexit._run_exitfuncs: > Traceback (most recent call last): > File "/path/to/spark/python/pyspark/shell.py", line 49, in > atexit.register(lambda: sc.stop()) > NameError: name 'sc' is not defined > {code} > This could be easily fixed by capturing `sc` instance > {code:none} > diff --git a/python/pyspark/shell.py b/python/pyspark/shell.py > index f0c487877a..4164e3ab0c 100644 > --- a/python/pyspark/shell.py > +++ b/python/pyspark/shell.py > @@ -46,7 +46,7 @@ except Exception: > > sc = spark.sparkContext > sql = spark.sql > -atexit.register(lambda: sc.stop()) > +atexit.register((lambda sc: lambda: sc.stop())(sc)) > > # for compatibility > sqlContext = spark._wrapped > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38073) NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon > 3.7
[ https://issues.apache.org/jira/browse/SPARK-38073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38073: Assignee: Apache Spark > NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon > > 3.7 > --- > > Key: SPARK-38073 > URL: https://issues.apache.org/jira/browse/SPARK-38073 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Shell >Affects Versions: 3.2.0, 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark >Priority: Major > > When {{PYSPARK_DRIVER_PYTHON=$(which ipython) bin/pyspark}} is executed with > Python >= 3.8, function registered wiht atexit seems to be executed in > different scope than in Python 3.7. > It result in {{NameError: name 'sc' is not defined}} on exit: > {code:python} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 3.3.0-SNAPSHOT > /_/ > Using Python version 3.8.12 (default, Oct 12 2021 21:57:06) > Spark context Web UI available at http://192.168.0.198:4040 > Spark context available as 'sc' (master = local[*], app id = > local-1643555855409). > SparkSession available as 'spark'. > In [1]: > > > Do you really want to exit ([y]/n)? y > Error in atexit._run_exitfuncs: > Traceback (most recent call last): > File "/path/to/spark/python/pyspark/shell.py", line 49, in > atexit.register(lambda: sc.stop()) > NameError: name 'sc' is not defined > {code} > This could be easily fixed by capturing `sc` instance > {code:none} > diff --git a/python/pyspark/shell.py b/python/pyspark/shell.py > index f0c487877a..4164e3ab0c 100644 > --- a/python/pyspark/shell.py > +++ b/python/pyspark/shell.py > @@ -46,7 +46,7 @@ except Exception: > > sc = spark.sparkContext > sql = spark.sql > -atexit.register(lambda: sc.stop()) > +atexit.register((lambda sc: lambda: sc.stop())(sc)) > > # for compatibility > sqlContext = spark._wrapped > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org