[jira] [Resolved] (SPARK-29258) parity between ml.evaluator and mllib.metrics
[ https://issues.apache.org/jira/browse/SPARK-29258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-29258. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25940 [https://github.com/apache/spark/pull/25940] > parity between ml.evaluator and mllib.metrics > - > > Key: SPARK-29258 > URL: https://issues.apache.org/jira/browse/SPARK-29258 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > Fix For: 3.0.0 > > > 1, expose {{BinaryClassificationMetrics.numBins}} in > {{BinaryClassificationEvaluator}} > 2, expose {{RegressionMetrics.throughOrigin}} in {{RegressionEvaluator}} > 3, add metric {{explainedVariance}} in {{RegressionEvaluator}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29269) Pyspark ALSModel support getters/setters
[ https://issues.apache.org/jira/browse/SPARK-29269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939109#comment-16939109 ] Huaxin Gao commented on SPARK-29269: Yes. I am happy to work on this. Thanks! [~podongfeng] > Pyspark ALSModel support getters/setters > > > Key: SPARK-29269 > URL: https://issues.apache.org/jira/browse/SPARK-29269 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > ping [~huaxingao] , would you like to work on this? This is similar to your > previous works. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27715) SQL query details in UI dose not show in correct format.
[ https://issues.apache.org/jira/browse/SPARK-27715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-27715. -- Fix Version/s: 3.0.0 Assignee: Genmao Yu Resolution: Fixed Resolved by https://github.com/apache/spark/pull/24609 > SQL query details in UI dose not show in correct format. > > > Key: SPARK-27715 > URL: https://issues.apache.org/jira/browse/SPARK-27715 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 3.0.0 >Reporter: Genmao Yu >Assignee: Genmao Yu >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29175) Make maven central repository in IsolatedClientLoader configurable
[ https://issues.apache.org/jira/browse/SPARK-29175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29175. --- Fix Version/s: 3.0.0 Assignee: Yuanjian Li Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/25849 > Make maven central repository in IsolatedClientLoader configurable > -- > > Key: SPARK-29175 > URL: https://issues.apache.org/jira/browse/SPARK-29175 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Fix For: 3.0.0 > > > We need to connect a central repository in IsolatedClientLoader for > downloading Hive jars. Here we added a new config > `spark.sql.additionalRemoteRepositories`, a comma-delimited string config of > the optional additional remote maven mirror repositories, it can be used as > the additional remote repositories for the default maven central repo. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29245) CCE during creating HiveMetaStoreClient
[ https://issues.apache.org/jira/browse/SPARK-29245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939099#comment-16939099 ] Hyukjin Kwon commented on SPARK-29245: -- BTW, can we confirm that this is the last one we need for JDK 11 from Hive 2.3.x? We cannot keep asking to making minor release to Hive dev for every bug fix ... It might better wait more when it's close to Spark 3 release to see if there are more bugs found, and ask it in a batch. cc [~alangates] > CCE during creating HiveMetaStoreClient > > > Key: SPARK-29245 > URL: https://issues.apache.org/jira/browse/SPARK-29245 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 > Environment: CDH 6.3 >Reporter: Dongjoon Hyun >Priority: Blocker > > From `master` branch build, when I try to connect to an external HMS, I hit > the following. > {code} > 19/09/25 10:58:46 ERROR hive.log: Got exception: java.lang.ClassCastException > class [Ljava.lang.Object; cannot be cast to class [Ljava.net.URI; > ([Ljava.lang.Object; and [Ljava.net.URI; are in module java.base of loader > 'bootstrap') > java.lang.ClassCastException: class [Ljava.lang.Object; cannot be cast to > class [Ljava.net.URI; ([Ljava.lang.Object; and [Ljava.net.URI; are in module > java.base of loader 'bootstrap') > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:200) > at > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:70) > {code} > With HIVE-21508, I can get the following. > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT > /_/ > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.4) > Type in expressions to have them evaluated. > Type :help for more information. > scala> sql("show databases").show > ++ > |databaseName| > ++ > | . | > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26431) Update availableSlots by availableCpus for barrier taskset
[ https://issues.apache.org/jira/browse/SPARK-26431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26431: -- Fix Version/s: (was: 2.4.0) > Update availableSlots by availableCpus for barrier taskset > -- > > Key: SPARK-26431 > URL: https://issues.apache.org/jira/browse/SPARK-26431 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: wuyi >Priority: Major > > availableCpus decrease as tasks allocated, so, we should update > availableSlots by availableCpus for barrier taskset to avoid unnecessary > resourceOffer process. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29246) Remove unnecessary imports in `core` module
[ https://issues.apache.org/jira/browse/SPARK-29246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29246. --- Fix Version/s: 3.0.0 Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/25927 > Remove unnecessary imports in `core` module > --- > > Key: SPARK-29246 > URL: https://issues.apache.org/jira/browse/SPARK-29246 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jiaqi Li >Assignee: Jiaqi Li >Priority: Trivial > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29269) Pyspark ALSModel support getters/setters
zhengruifeng created SPARK-29269: Summary: Pyspark ALSModel support getters/setters Key: SPARK-29269 URL: https://issues.apache.org/jira/browse/SPARK-29269 Project: Spark Issue Type: Sub-task Components: ML, PySpark Affects Versions: 3.0.0 Reporter: zhengruifeng ping [~huaxingao] , would you like to work on this? This is similar to your previous works. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29142) Pyspark clustering models support column setters/getters/predict
[ https://issues.apache.org/jira/browse/SPARK-29142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-29142. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25859 [https://github.com/apache/spark/pull/25859] > Pyspark clustering models support column setters/getters/predict > > > Key: SPARK-29142 > URL: https://issues.apache.org/jira/browse/SPARK-29142 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > Fix For: 3.0.0 > > > Unlike the reg/clf models, clustering models do not have some common class, > so we need to add them one by one. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29212) Add common classes without using JVM backend
[ https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939082#comment-16939082 ] zhengruifeng commented on SPARK-29212: -- [~zero323] I had not notice the base hierarchy without JVM-backend in SPARK-28985, and thanks you for pointing out it. I guess we reach some consensus on: 1, add base classes without JVM-backend, and make JVM-classes extends them; (may limited to classes modified in SPARK-28985 at first) 2, rename private classname following PEP-8. > Add common classes without using JVM backend > > > Key: SPARK-29212 > URL: https://issues.apache.org/jira/browse/SPARK-29212 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > > copyed from [https://github.com/apache/spark/pull/25776.] > > Maciej's *Concern*: > *Use cases for public ML type hierarchy* > * Add Python-only Transformer implementations: > ** I am Python user and want to implement pure Python ML classifier without > providing JVM backend. > ** I want this classifier to be meaningfully positioned in the existing type > hierarchy. > ** However I have access only to high level classes ({{Estimator}}, > {{Model}}, {{MLReader}} / {{MLReadable}}). > * Run time parameter validation for both user defined (see above) and > existing class hierarchy, > ** I am a library developer who provides functions that are meaningful only > for specific categories of {{Estimators}} - here classifiers. > ** I want to validate that user passed argument is indeed a classifier: > *** For built-in objects using "private" type hierarchy is not really > satisfying (actually, what is the rationale behind making it "private"? If > the goal is Scala API parity, and Scala counterparts are public, shouldn't > these be too?). > ** For user defined objects I can: > *** Use duck typing (on {{setRawPredictionCol}} for classifier, on > {{numClasses}} for classification model) but it hardly satisfying. > *** Provide parallel non-abstract type hierarchy ({{Classifier}} or > {{PythonClassifier}} and so on) and require users to implement such > interfaces. That however would require separate logic for checking for > built-in and and user-provided classes. > *** Provide parallel abstract type hierarchy, register all existing built-in > classes and require users to do the same. > Clearly these are not satisfying solutions as they require either defensive > programming or reinventing the same functionality for different 3rd party > APIs. > * Static type checking > ** I am either end user or library developer and want to use PEP-484 > annotations to indicate components that require classifier or classification > model. > ** Currently I can provide only imprecise annotations, [such > as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241] > def setClassifier(self, value: Estimator[M]) -> OneVsRest: ... > or try to narrow things down using structural subtyping: > class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, > value: str) -> Classifier: ... class Classifier(Protocol, Model): def > setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> > int: ... > > Maciej's *Proposal*: > {code:java} > Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e. > class ClassifierParams: ... > class Predictor(Estimator,PredictorParams): > def setLabelCol(self, value): ... > def setFeaturesCol(self, value): ... > def setPredictionCol(self, value): ... > class Classifier(Predictor, ClassifierParams): > def setRawPredictionCol(self, value): ... > class PredictionModel(Model,PredictorParams): > def setFeaturesCol(self, value): ... > def setPredictionCol(self, value): ... > def numFeatures(self): ... > def predict(self, value): ... > and JVM interop should extend from this hierarchy, i.e. > class JavaPredictionModel(PredictionModel): ... > In other words it should be consistent with existing approach, where we have > ABCs reflecting Scala API (Transformer, Estimator, Model) and so on, and > Java* variants are their subclasses. > {code} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29257) All Task attempts scheduled to the same executor inevitably access the same bad disk
[ https://issues.apache.org/jira/browse/SPARK-29257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939075#comment-16939075 ] Kent Yao commented on SPARK-29257: -- this issue should be not a problem anymore after this fix - [https://github.com/apache/spark/pull/25620] . Each index and data file will have a unique name with task attempt id. > All Task attempts scheduled to the same executor inevitably access the same > bad disk > > > Key: SPARK-29257 > URL: https://issues.apache.org/jira/browse/SPARK-29257 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.4, 2.4.4 >Reporter: Kent Yao >Priority: Major > Attachments: image-2019-09-26-16-44-48-554.png > > > We have an HDFS/YARN cluster with about 2k~3k nodes, each node has 8T * 12 > local disks for storage and shuffle. Sometimes, one or more disks get into > bad status during computations. Sometimes it does cause job level failure, > sometimes does. > The following picture shows one failure job caused by 4 task attempts were > all delivered to the same node and failed with almost the same exception for > writing the index temporary file to the same bad disk. > > This is caused by two reasons: > # As we can see in the figure the data and the node have the best data > locality for reading from HDFS. As the default spark.locality.wait(3s) taking > effect, there is a high probability that those attempts will be scheduled to > this node. > # The index file or data file name for a particular shuffle map task is > fixed. It is formed by the shuffle id, the map id and the noop reduce id > which is always 0. The root local dir is picked by the fixed file name's > non-negative hash code % the disk number. Thus, this value is also fixed. > Even when we have 12 disks in total and only one of them is broken, if the > broken one is once picked, all the following attempts of this task will > inevitably pick the broken one. > > > !image-2019-09-26-16-44-48-554.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29268) Failed to start spark-sql when using Derby metastore and isolatedLoader is enabled
[ https://issues.apache.org/jira/browse/SPARK-29268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939073#comment-16939073 ] Sandeep Katta commented on SPARK-29268: --- I will work on this > Failed to start spark-sql when using Derby metastore and isolatedLoader is > enabled > -- > > Key: SPARK-29268 > URL: https://issues.apache.org/jira/browse/SPARK-29268 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.4, 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Failed to start spark-sql when using Derby metastore and isolatedLoader is > enabled({{spark.sql.hive.metastore.jars != builtin}}). > How to reproduce: > {code:sh} > bin/spark-sql --conf spark.sql.hive.metastore.version=2.1 --conf > spark.sql.hive.metastore.jars=maven > {code} > Logs: > {noformat} > ... > Caused by: java.sql.SQLException: Failed to start database 'metastore_db' > with class loader > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@15a591d9, see > the next exception for details. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown > Source) > ... 108 more > Caused by: java.sql.SQLException: Another instance of Derby may have already > booted the database /root/spark-2.3.4-bin-hadoop2.7/metastore_db. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown > Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown > Source) > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown > Source) > ... 105 more > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > database /root/spark-2.3.4-bin-hadoop2.7/metastore_db. > ... > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29268) Failed to start spark-sql when using Derby metastore and isolatedLoader is enabled
[ https://issues.apache.org/jira/browse/SPARK-29268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-29268: Description: Failed to start spark-sql when using Derby metastore and isolatedLoader is enabled({{spark.sql.hive.metastore.jars != builtin}}). How to reproduce: {code:sh} bin/spark-sql --conf spark.sql.hive.metastore.version=2.1 --conf spark.sql.hive.metastore.jars=maven {code} Logs: {noformat} ... Caused by: java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@15a591d9, see the next exception for details. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source) ... 108 more Caused by: java.sql.SQLException: Another instance of Derby may have already booted the database /root/spark-2.3.4-bin-hadoop2.7/metastore_db. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source) ... 105 more Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /root/spark-2.3.4-bin-hadoop2.7/metastore_db. ... {noformat} was: Failed to start spark-sql when using Derby metastore and isolatedLoader is enabled({{spark.sql.hive.metastore.jars != builtin}}). How to reproduce: {noformat} bin/spark-sql --conf spark.sql.hive.metastore.version=2.1 --conf spark.sql.hive.metastore.jars=maven {noformat} Logs: {noformat} ... Caused by: java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@15a591d9, see the next exception for details. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source) ... 108 more Caused by: java.sql.SQLException: Another instance of Derby may have already booted the database /root/spark-2.3.4-bin-hadoop2.7/metastore_db. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source) ... 105 more Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /root/spark-2.3.4-bin-hadoop2.7/metastore_db. ... {noformat} > Failed to start spark-sql when using Derby metastore and isolatedLoader is > enabled > -- > > Key: SPARK-29268 > URL: https://issues.apache.org/jira/browse/SPARK-29268 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.4, 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Failed to start spark-sql when using Derby metastore and isolatedLoader is > enabled({{spark.sql.hive.metastore.jars != builtin}}). > How to reproduce: > {code:sh} > bin/spark-sql --conf spark.sql.hive.metastore.version=2.1 --conf > spark.sql.hive.metastore.jars=maven > {code} > Logs: > {noformat} > ... > Caused by: java.sql.SQLException: Failed to start database 'metastore_db' > with class loader > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@15a591d9, see > the next exception for details. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown > Source) > ... 108 more > Caused by: java.sql.SQLException: Another instance of Derby may have already > booted the database /root/spark-2.3.4-bin-hadoop2.7/metastore_db. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown > Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown > Source) > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown > Source) > ... 105 more > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > database /root/spark-2.3.4-bin-hadoop2.7/metastore_db. > ... > {noformat} -- This message was
[jira] [Updated] (SPARK-29268) Failed to start spark-sql when using Derby metastore and isolatedLoader is enabled
[ https://issues.apache.org/jira/browse/SPARK-29268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-29268: Description: Failed to start spark-sql when using Derby metastore and isolatedLoader is enabled({{spark.sql.hive.metastore.jars != builtin}}). How to reproduce: {noformat} bin/spark-sql --conf spark.sql.hive.metastore.version=2.1 --conf spark.sql.hive.metastore.jars=maven {noformat} Logs: {noformat} ... Caused by: java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@15a591d9, see the next exception for details. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source) ... 108 more Caused by: java.sql.SQLException: Another instance of Derby may have already booted the database /root/spark-2.3.4-bin-hadoop2.7/metastore_db. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source) ... 105 more Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /root/spark-2.3.4-bin-hadoop2.7/metastore_db. ... {noformat} was: How to reproduce: {noformat} bin/spark-sql --conf spark.sql.hive.metastore.version=2.1 --conf spark.sql.hive.metastore.jars=maven {noformat} Logs: {noformat} ... Caused by: java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@15a591d9, see the next exception for details. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source) ... 108 more Caused by: java.sql.SQLException: Another instance of Derby may have already booted the database /root/spark-2.3.4-bin-hadoop2.7/metastore_db. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source) ... 105 more Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /root/spark-2.3.4-bin-hadoop2.7/metastore_db. ... {noformat} > Failed to start spark-sql when using Derby metastore and isolatedLoader is > enabled > -- > > Key: SPARK-29268 > URL: https://issues.apache.org/jira/browse/SPARK-29268 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.4, 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Failed to start spark-sql when using Derby metastore and isolatedLoader is > enabled({{spark.sql.hive.metastore.jars != builtin}}). > How to reproduce: > {noformat} > bin/spark-sql --conf spark.sql.hive.metastore.version=2.1 --conf > spark.sql.hive.metastore.jars=maven > {noformat} > Logs: > {noformat} > ... > Caused by: java.sql.SQLException: Failed to start database 'metastore_db' > with class loader > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@15a591d9, see > the next exception for details. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown > Source) > ... 108 more > Caused by: java.sql.SQLException: Another instance of Derby may have already > booted the database /root/spark-2.3.4-bin-hadoop2.7/metastore_db. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown > Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown > Source) > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown > Source) > ... 105 more > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > database /root/spark-2.3.4-bin-hadoop2.7/metastore_db. > ... > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To
[jira] [Created] (SPARK-29268) Failed to start spark-sql when using Derby metastore and isolatedLoader is enabled
Yuming Wang created SPARK-29268: --- Summary: Failed to start spark-sql when using Derby metastore and isolatedLoader is enabled Key: SPARK-29268 URL: https://issues.apache.org/jira/browse/SPARK-29268 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.4, 2.3.4, 3.0.0 Reporter: Yuming Wang How to reproduce: {noformat} bin/spark-sql --conf spark.sql.hive.metastore.version=2.1 --conf spark.sql.hive.metastore.jars=maven {noformat} Logs: {noformat} ... Caused by: java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@15a591d9, see the next exception for details. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source) ... 108 more Caused by: java.sql.SQLException: Another instance of Derby may have already booted the database /root/spark-2.3.4-bin-hadoop2.7/metastore_db. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source) ... 105 more Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /root/spark-2.3.4-bin-hadoop2.7/metastore_db. ... {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29254) Failed to include jars passed in through --jars when isolatedLoader is enabled
[ https://issues.apache.org/jira/browse/SPARK-29254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939068#comment-16939068 ] angerszhu commented on SPARK-29254: --- [~cltlfcjin] Not same reason. > Failed to include jars passed in through --jars when isolatedLoader is enabled > -- > > Key: SPARK-29254 > URL: https://issues.apache.org/jira/browse/SPARK-29254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Failed to include jars passed in through --jars when {{isolatedLoader}} is > enabled({{spark.sql.hive.metastore.jars != builtin}}). How to reproduce: > {code:scala} > test("SPARK-29254: include jars passed in through --jars when > isolatedLoader is enabled") { > val unusedJar = TestUtils.createJarWithClasses(Seq.empty) > val jar1 = TestUtils.createJarWithClasses(Seq("SparkSubmitClassA")) > val jar2 = TestUtils.createJarWithClasses(Seq("SparkSubmitClassB")) > val jar3 = HiveTestJars.getHiveContribJar.getCanonicalPath > val jar4 = HiveTestJars.getHiveHcatalogCoreJar.getCanonicalPath > val jarsString = Seq(jar1, jar2, jar3, jar4).map(j => > j.toString).mkString(",") > val args = Seq( > "--class", SparkSubmitClassLoaderTest.getClass.getName.stripSuffix("$"), > "--name", "SparkSubmitClassLoaderTest", > "--master", "local-cluster[2,1,1024]", > "--conf", "spark.ui.enabled=false", > "--conf", "spark.master.rest.enabled=false", > "--conf", "spark.sql.hive.metastore.version=3.1.2", > "--conf", "spark.sql.hive.metastore.jars=maven", > "--driver-java-options", "-Dderby.system.durability=test", > "--jars", jarsString, > unusedJar.toString, "SparkSubmitClassA", "SparkSubmitClassB") > runSparkSubmit(args) > } > {code} > Logs: > {noformat} > 2019-09-25 22:11:42.854 - stderr> 19/09/25 22:11:42 ERROR log: error in > initSerDe: java.lang.ClassNotFoundException Class > org.apache.hive.hcatalog.data.JsonSerDe not found > 2019-09-25 22:11:42.854 - stderr> java.lang.ClassNotFoundException: Class > org.apache.hive.hcatalog.data.JsonSerDe not found > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.metastore.HiveMetaStoreUtils.getDeserializer(HiveMetaStoreUtils.java:84) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.metastore.HiveMetaStoreUtils.getDeserializer(HiveMetaStoreUtils.java:77) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:289) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:271) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getColsInternal(Table.java:663) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:646) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:898) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:937) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:539) > 2019-09-25 22:11:42.854 - stderr> at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:311) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:245) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:244) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:294) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:537) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:284) > 2019-09-25 22:11:42.854 - stderr> at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:242) > 2019-09-25 22:11:42.854 - stderr> at >
[jira] [Commented] (SPARK-29262) DataFrameWriter insertIntoPartition function
[ https://issues.apache.org/jira/browse/SPARK-29262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939067#comment-16939067 ] Wenchen Fan commented on SPARK-29262: - There is a `DataFrameWriterV2`, we can consider adding this API there. > DataFrameWriter insertIntoPartition function > > > Key: SPARK-29262 > URL: https://issues.apache.org/jira/browse/SPARK-29262 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > > Do we have plan to support insertIntoPartition function for dataFrameWriter? > [~cloud_fan] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29267) rdd.countApprox should stop when 'timeout'
Kangtian created SPARK-29267: Summary: rdd.countApprox should stop when 'timeout' Key: SPARK-29267 URL: https://issues.apache.org/jira/browse/SPARK-29267 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.4 Reporter: Kangtian {{The way to Approximate counting: org.apache.spark.rdd.RDD#countApprox}} +countApprox(timeout: Long, confidence: Double = 0.95)+ But: when timeout comes, the job will continue run until really finish. We Want: *When timeout comes, the job will finish{color:#FF} immediately{color}*, without FinalValue -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29266) Optimize Dataset.isEmpty for base relations / unfiltered datasets
[ https://issues.apache.org/jira/browse/SPARK-29266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-29266: --- Description: SPARK-23627 added a {{Dataset.isEmpty}} method. This is currently implemented as {code:java} def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan => plan.executeCollect().head.getLong(0) == 0 } {code} which has a global limit of 1 embedded in the middle of the query plan. As a result, this will end up computing *all* partitions of the Dataset but each task can stop early once it's computed a single record (due to how global limits are planned / executed in this query). We could instead implement this as {{ds.limit(1).collect().isEmpty}} but that will go through the "CollectLimit" execution strategy which first computes 1 partition, then 2, then 4, and so on. That will be faster in some cases but slower in others: if the dataset is indeed empty then that method will be slower than one which checks all partitions in parallel, but if it's non-empty (and most tasks' output is non-empty) then it can be much faster. There's not an obviously-best implementation here. However, I think there's high value (and low downside) to optimizing for the special case where the Dataset is an unfiltered, untransformed input dataset (e.g. the output of {{spark.read.parquet}}): I found a production job which calls {{isEmpty}} on the output of {{spark.read.parquet()}} and the {{isEmpty}} call took several minutes to complete because it needed to launch hundreds of thousands of tasks to compute a single record of each partition (this is an enormous dataset, hence the long runtime for this). I could instruct the job author use a different, more efficient method of checking for non-emptiness, but this feels like the type of optimization that Spark should handle itself (reducing the burden placed on users to understand internal details of Spark's execution model). Maybe we can special-case {{isEmpty}} for the case where plan consists of only a file source scan (or a file source scan followed by a projection, but without any filters, etc.). In those cases, we can use either the {{.limit(1).take()}} implementation (under assumption that we don't have a ton of empty input files) or something fancier (metadata-only query, looking at Parquet footers, delegating to some datasource API, etc). was: SPARK-23627 added a {{Dataset.isEmpty}} method. This is currently implemented as {code:java} def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan => plan.executeCollect().head.getLong(0) == 0 } {code} which has a global limit of 1 embedded in the middle of the query plan. As a result, this will end up computing *all* partitions of the Dataset but each task can stop early once it's computed a single record. We could instead implement this as {{ds.limit(1).collect().isEmpty}} but that will go through the "CollectLimit" execution strategy which first computes 1 partition, then 2, then 4, and so on. That will be faster in some cases but slower in others: if the dataset is indeed empty then that method will be slower than one which checks all partitions in parallel, but if it's non-empty (and most tasks' output is non-empty) then it can be much faster. There's not an obviously-best implementation here. However, I think there's high value (and low downside) to optimizing for the special case where the Dataset is an unfiltered, untransformed input dataset (e.g. the output of {{spark.read.parquet}}): I found a production job which calls {{isEmpty}} on the output of {{spark.read.parquet()}} and the {{isEmpty}} call took several minutes to complete because it needed to launch hundreds of thousands of tasks to compute a single record of each partition (this is an enormous dataset, hence the long runtime for this). I could instruct the job author use a different, more efficient method of checking for non-emptiness, but this feels like the type of optimization that Spark should handle itself (reducing the burden placed on users to understand internal details of Spark's execution model). Maybe we can special-case {{IsEmpty}} for the case where plan consists of only a file source scan (or a file source scan followed by a projection, but without any filters, etc.). In those cases, we can use either the {{.limit(1).take()}} implementation (under assumption that we don't have a ton of empty input files) or something fancier (metadata-only query, looking at Parquet footers, delegating to some datasource API, etc). > Optimize Dataset.isEmpty for base relations / unfiltered datasets > - > > Key: SPARK-29266 > URL: https://issues.apache.org/jira/browse/SPARK-29266 > Project: Spark >
[jira] [Updated] (SPARK-29246) Remove unnecessary imports in `core` module
[ https://issues.apache.org/jira/browse/SPARK-29246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29246: -- Summary: Remove unnecessary imports in `core` module (was: Remove unnecessary imports in CoarseGrainedExecutorBackendSuite) > Remove unnecessary imports in `core` module > --- > > Key: SPARK-29246 > URL: https://issues.apache.org/jira/browse/SPARK-29246 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Tests >Affects Versions: 3.0.0 >Reporter: Jiaqi Li >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29265) Window orderBy causing full-DF orderBy
[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938989#comment-16938989 ] Florentino Sainz edited comment on SPARK-29265 at 9/26/19 10:54 PM: Sorry, ima test again... I'm seeing something that doesn't match my guessings in my test case. I thought the cause was because it triggered a global sort, but plan is similar different to the one generated when ordering by another column, someone can guess why a similar Window+orderBy (ordering by one of the partitionBy columns) tanked my performance and only used one executor? (not a problem with data having only one value, that key had multiple values evenly distributed). I'm going to keep investigating to see if I can reproduce this in a dummy test, but removing the orderBy in a similar case just fixed performance problem in production :$ (it was causing 50+min single-tasks) was (Author: fsainz): Sorry, ima test again... I'm seeing something that doesn't match my guessings in my test case. I thought the cause was because it triggered a global sort, but plan is similar different to the one generated when ordering by another column, someone knows why a similar Window+orderBy (ordering by one of the partitionBy columns) tanked my performance and only used one executor? (not a problem with data having only one value, that key had multiple values evenly distributed). I'm going to keep investigating to see if I can reproduce this in a dummy test, but removing the orderBy in a similar case just fixed performance problem in production :$ (it was causing 50+min single-tasks) > Window orderBy causing full-DF orderBy > --- > > Key: SPARK-29265 > URL: https://issues.apache.org/jira/browse/SPARK-29265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: Any >Reporter: Florentino Sainz >Priority: Minor > > Hi, > > I had this problem in "real" environments and also made a self-contained test > ( [^Test.scala] attached). > Having this Window definition: > {code:scala} > val myWindow = Window.partitionBy($"word").orderBy("word") > val filt2 = filtrador.withColumn("avg_Time", > collect_list($"number").over(myWindow)) > {code} > > As a user, I would expect either: > 1- Error/warning (because trying to sort on one of the columns of the window > partitionBy) > 2- A mostly-useless operation which just orders the rows (which have the same > value on word) inside each Window but doesn't affect performance too much. > > Currently what I see: > *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is > performing a global orderBy on the whole DataFrame. Similar to > dataframe.orderBy("word").* > *In my real environment, my program just didn't finish in time/crashed thus > causing my program to be very slow or crash (because as it's a global > orderBy, it will just go to one executor).* > > In the test I can see how all elements of my DF are in a single partition > (side-effect of the global orderBy) > > Full Code showing the error (see how the mapPartitions shows 99 rows in one > partition) attached in Test.scala, sbt project (src and build.sbt) attached > too in TestSpark.zip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29246) Remove unnecessary imports in `core` module
[ https://issues.apache.org/jira/browse/SPARK-29246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29246: - Assignee: Jiaqi Li > Remove unnecessary imports in `core` module > --- > > Key: SPARK-29246 > URL: https://issues.apache.org/jira/browse/SPARK-29246 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jiaqi Li >Assignee: Jiaqi Li >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29246) Remove unnecessary imports in `core` module
[ https://issues.apache.org/jira/browse/SPARK-29246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29246: -- Component/s: (was: Tests) > Remove unnecessary imports in `core` module > --- > > Key: SPARK-29246 > URL: https://issues.apache.org/jira/browse/SPARK-29246 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jiaqi Li >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29259) Filesystem.exists is called even when not necessary for append save mode
[ https://issues.apache.org/jira/browse/SPARK-29259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29259: - Assignee: Rahij Ramsharan > Filesystem.exists is called even when not necessary for append save mode > > > Key: SPARK-29259 > URL: https://issues.apache.org/jira/browse/SPARK-29259 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Rahij Ramsharan >Assignee: Rahij Ramsharan >Priority: Minor > Fix For: 3.0.0 > > > When saving a dataframe into Hadoop > ([https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L93]), > spark first checks if the file exists before inspecting the SaveMode to > determine if it should actually insert data. However, the pathExists variable > is actually not used in the case of SaveMode.Append. In some file systems, > the exists call can be expensive and hence this PR makes that call only when > necessary. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29259) Filesystem.exists is called even when not necessary for append save mode
[ https://issues.apache.org/jira/browse/SPARK-29259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29259. --- Fix Version/s: 3.0.0 Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/25928 > Filesystem.exists is called even when not necessary for append save mode > > > Key: SPARK-29259 > URL: https://issues.apache.org/jira/browse/SPARK-29259 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Rahij Ramsharan >Priority: Minor > Fix For: 3.0.0 > > > When saving a dataframe into Hadoop > ([https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L93]), > spark first checks if the file exists before inspecting the SaveMode to > determine if it should actually insert data. However, the pathExists variable > is actually not used in the case of SaveMode.Append. In some file systems, > the exists call can be expensive and hence this PR makes that call only when > necessary. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29259) Filesystem.exists is called even when not necessary for append save mode
[ https://issues.apache.org/jira/browse/SPARK-29259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29259: -- Affects Version/s: (was: 2.4.4) 3.0.0 > Filesystem.exists is called even when not necessary for append save mode > > > Key: SPARK-29259 > URL: https://issues.apache.org/jira/browse/SPARK-29259 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Rahij Ramsharan >Priority: Minor > Fix For: 3.0.0 > > > When saving a dataframe into Hadoop > ([https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L93]), > spark first checks if the file exists before inspecting the SaveMode to > determine if it should actually insert data. However, the pathExists variable > is actually not used in the case of SaveMode.Append. In some file systems, > the exists call can be expensive and hence this PR makes that call only when > necessary. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29266) Optimize Dataset.isEmpty for base relations / unfiltered datasets
[ https://issues.apache.org/jira/browse/SPARK-29266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-29266: --- Description: SPARK-23627 added a {{Dataset.isEmpty}} method. This is currently implemented as {code:java} def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan => plan.executeCollect().head.getLong(0) == 0 } {code} which has a global limit of 1 embedded in the middle of the query plan. As a result, this will end up computing *all* partitions of the Dataset but each task can stop early once it's computed a single record. We could instead implement this as {{ds.limit(1).collect().isEmpty}} but that will go through the "CollectLimit" execution strategy which first computes 1 partition, then 2, then 4, and so on. That will be faster in some cases but slower in others: if the dataset is indeed empty then that method will be slower than one which checks all partitions in parallel, but if it's non-empty (and most tasks' output is non-empty) then it can be much faster. There's not an obviously-best implementation here. However, I think there's high value (and low downside) to optimizing for the special case where the Dataset is an unfiltered, untransformed input dataset (e.g. the output of {{spark.read.parquet}}): I found a production job which calls {{isEmpty}} on the output of {{spark.read.parquet()}} and the {{isEmpty}} call took several minutes to complete because it needed to launch hundreds of thousands of tasks to compute a single record of each partition (this is an enormous dataset, hence the long runtime for this). I could instruct the job author use a different, more efficient method of checking for non-emptiness, but this feels like the type of optimization that Spark should handle itself (reducing the burden placed on users to understand internal details of Spark's execution model). Maybe we can special-case {{IsEmpty}} for the case where plan consists of only a file source scan (or a file source scan followed by a projection, but without any filters, etc.). In those cases, we can use either the {{.limit(1).take()}} implementation (under assumption that we don't have a ton of empty input files) or something fancier (metadata-only query, looking at Parquet footers, delegating to some datasource API, etc). was: SPARK-23627 added a {{Dataset.isEmpty}} method. This is currently implemented as {code:java} def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan => plan.executeCollect().head.getLong(0) == 0 } {code} which has a global limit of 1 embedded in the middle of the query plan. As a result, this will end up computing *all* partitions of the Dataset but each task can stop early once it's computed a single record. We could instead implement this as {{ds.limit(1).collect().isEmpty}} but that will go through the "CollectLimit" execution strategy which first computes 1 partition, then 2, then 4, and so on. That will be faster in some cases but slower in others: if the dataset is indeed empty then that method will be slower than one which checks all partitions in parallel, but if it's non-empty (and most tasks' output is non-empty) then it can be much faster. There's not an obviously-best implementation here. However, I think there's high value (and low downside) to optimizing for the special case where the Dataset is an unfiltered, untransformed input dataset (e.g. the output of {{spark.read.parquet}}): I found a production job which calls {{isEmpty}} on the output of {{spark.read.parquet()}} and the {{isEmpty}} call took several minutes to complete because it needed to launch hundreds of thousands of tasks to compute a single record of each partition (this is an enormous dataset, hence the long runtime for this). I could instruct the job author use a different, more efficient method of checking for non-emptiness, but this feels like the type of optimization that Spark should handle itself. Maybe we can special-case {{IsEmpty}} for the case where plan consists of only a file source scan (or a file source scan followed by a projection, but without any filters, etc.). In those cases, we can use either the {{.limit(1).take()}} implementation (under assumption that we don't have a ton of empty input files) or something fancier (metadata-only query, looking at Parquet footers, delegating to some datasource API, etc). > Optimize Dataset.isEmpty for base relations / unfiltered datasets > - > > Key: SPARK-29266 > URL: https://issues.apache.org/jira/browse/SPARK-29266 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Priority: Minor > > SPARK-23627
[jira] [Updated] (SPARK-29266) Optimize Dataset.isEmpty for base relations / unfiltered datasets
[ https://issues.apache.org/jira/browse/SPARK-29266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-29266: --- Description: SPARK-23627 added a {{Dataset.isEmpty}} method. This is currently implemented as {code:java} def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan => plan.executeCollect().head.getLong(0) == 0 } {code} which has a global limit of 1 embedded in the middle of the query plan. As a result, this will end up computing *all* partitions of the Dataset but each task can stop early once it's computed a single record. We could instead implement this as {{ds.limit(1).collect().isEmpty}} but that will go through the "CollectLimit" execution strategy which first computes 1 partition, then 2, then 4, and so on. That will be faster in some cases but slower in others: if the dataset is indeed empty then that method will be slower than one which checks all partitions in parallel, but if it's non-empty (and most tasks' output is non-empty) then it can be much faster. There's not an obviously-best implementation here. However, I think there's high value (and low downside) to optimizing for the special case where the Dataset is an unfiltered, untransformed input dataset (e.g. the output of {{spark.read.parquet}}): I found a production job which calls {{isEmpty}} on the output of {{spark.read.parquet()}} and the {{isEmpty}} call took several minutes to complete because it needed to launch hundreds of thousands of tasks to compute a single record of each partition (this is an enormous dataset). I could instruct the job author use a different, more efficient method of checking for non-emptiness, but this feels like the type of optimization that Spark should handle itself. Maybe we can special-case {{IsEmpty}} for the case where plan consists of only a file source scan (or a file source scan followed by a projection, but without any filters, etc.). In those cases, we can use either the {{.limit(1).take()}} implementation (under assumption that we don't have a ton of empty input files) or something fancier (metadata-only query, looking at Parquet footers, delegating to some datasource API, etc). was: SPARK-23627 added a {{Dataset.isEmpty}} method. This is currently implemented as {code:java} def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan => plan.executeCollect().head.getLong(0) == 0 } {code} which has a global limit of 1 embedded in the middle of the query plan. As a result, this will end up computing *all* partitions of the Dataset but each task can stop early once it's computed a single record. We could instead implement this as {{ds.limit(1).collect().isEmpty}} but that will go through the "CollectLimit" execution strategy which first computes 1 partition, then 2, then 4, and so on. That will be faster in some cases but slower in others: if the dataset is indeed empty then that method will be slower than one which checks all partitions in parallel, but if it's non-empty (and most tasks' output is non-empty) then it can be much faster. There's not an obviously-best implementation here. However, I think there's high value (and low downside) to optimizing for the special case where the Dataset is an unfiltered, untransformed input dataset (e.g. the output of {{spark.read.parquet}}): I found a production job which calls {{isEmpty}} on the output of {{spark.read.parquet()}} and the {{isEmpty}} call took several minutes to complete because it needed to launch hundreds of thousands of tasks to compute a single record of each partition. I could instruct the job author use a different, more efficient method of checking for non-emptiness, but this feels like the type of optimization that Spark should handle itself. Maybe we can special-case {{IsEmpty}} for the case where plan consists of only a file source scan (or a file source scan followed by a projection, but without any filters, etc.). In those cases, we can use either the {{.limit(1).take()}} implementation (under assumption that we don't have a ton of empty input files) or something fancier (metadata-only query, looking at Parquet footers, delegating to some datasource API, etc). > Optimize Dataset.isEmpty for base relations / unfiltered datasets > - > > Key: SPARK-29266 > URL: https://issues.apache.org/jira/browse/SPARK-29266 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Priority: Minor > > SPARK-23627 added a {{Dataset.isEmpty}} method. This is currently implemented > as > {code:java} > def isEmpty: Boolean = withAction("isEmpty", > limit(1).groupBy().count().queryExecution) { plan => >
[jira] [Updated] (SPARK-29266) Optimize Dataset.isEmpty for base relations / unfiltered datasets
[ https://issues.apache.org/jira/browse/SPARK-29266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-29266: --- Description: SPARK-23627 added a {{Dataset.isEmpty}} method. This is currently implemented as {code:java} def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan => plan.executeCollect().head.getLong(0) == 0 } {code} which has a global limit of 1 embedded in the middle of the query plan. As a result, this will end up computing *all* partitions of the Dataset but each task can stop early once it's computed a single record. We could instead implement this as {{ds.limit(1).collect().isEmpty}} but that will go through the "CollectLimit" execution strategy which first computes 1 partition, then 2, then 4, and so on. That will be faster in some cases but slower in others: if the dataset is indeed empty then that method will be slower than one which checks all partitions in parallel, but if it's non-empty (and most tasks' output is non-empty) then it can be much faster. There's not an obviously-best implementation here. However, I think there's high value (and low downside) to optimizing for the special case where the Dataset is an unfiltered, untransformed input dataset (e.g. the output of {{spark.read.parquet}}): I found a production job which calls {{isEmpty}} on the output of {{spark.read.parquet()}} and the {{isEmpty}} call took several minutes to complete because it needed to launch hundreds of thousands of tasks to compute a single record of each partition (this is an enormous dataset, hence the long runtime for this). I could instruct the job author use a different, more efficient method of checking for non-emptiness, but this feels like the type of optimization that Spark should handle itself. Maybe we can special-case {{IsEmpty}} for the case where plan consists of only a file source scan (or a file source scan followed by a projection, but without any filters, etc.). In those cases, we can use either the {{.limit(1).take()}} implementation (under assumption that we don't have a ton of empty input files) or something fancier (metadata-only query, looking at Parquet footers, delegating to some datasource API, etc). was: SPARK-23627 added a {{Dataset.isEmpty}} method. This is currently implemented as {code:java} def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan => plan.executeCollect().head.getLong(0) == 0 } {code} which has a global limit of 1 embedded in the middle of the query plan. As a result, this will end up computing *all* partitions of the Dataset but each task can stop early once it's computed a single record. We could instead implement this as {{ds.limit(1).collect().isEmpty}} but that will go through the "CollectLimit" execution strategy which first computes 1 partition, then 2, then 4, and so on. That will be faster in some cases but slower in others: if the dataset is indeed empty then that method will be slower than one which checks all partitions in parallel, but if it's non-empty (and most tasks' output is non-empty) then it can be much faster. There's not an obviously-best implementation here. However, I think there's high value (and low downside) to optimizing for the special case where the Dataset is an unfiltered, untransformed input dataset (e.g. the output of {{spark.read.parquet}}): I found a production job which calls {{isEmpty}} on the output of {{spark.read.parquet()}} and the {{isEmpty}} call took several minutes to complete because it needed to launch hundreds of thousands of tasks to compute a single record of each partition (this is an enormous dataset). I could instruct the job author use a different, more efficient method of checking for non-emptiness, but this feels like the type of optimization that Spark should handle itself. Maybe we can special-case {{IsEmpty}} for the case where plan consists of only a file source scan (or a file source scan followed by a projection, but without any filters, etc.). In those cases, we can use either the {{.limit(1).take()}} implementation (under assumption that we don't have a ton of empty input files) or something fancier (metadata-only query, looking at Parquet footers, delegating to some datasource API, etc). > Optimize Dataset.isEmpty for base relations / unfiltered datasets > - > > Key: SPARK-29266 > URL: https://issues.apache.org/jira/browse/SPARK-29266 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Priority: Minor > > SPARK-23627 added a {{Dataset.isEmpty}} method. This is currently implemented > as > {code:java} > def isEmpty: Boolean = withAction("isEmpty", >
[jira] [Created] (SPARK-29266) Optimize Dataset.isEmpty for base relations / unfiltered datasets
Josh Rosen created SPARK-29266: -- Summary: Optimize Dataset.isEmpty for base relations / unfiltered datasets Key: SPARK-29266 URL: https://issues.apache.org/jira/browse/SPARK-29266 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Josh Rosen SPARK-23627 added a {{Dataset.isEmpty}} method. This is currently implemented as {code:java} def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan => plan.executeCollect().head.getLong(0) == 0 } {code} which has a global limit of 1 embedded in the middle of the query plan. As a result, this will end up computing *all* partitions of the Dataset but each task can stop early once it's computed a single record. We could instead implement this as {{ds.limit(1).collect().isEmpty}} but that will go through the "CollectLimit" execution strategy which first computes 1 partition, then 2, then 4, and so on. That will be faster in some cases but slower in others: if the dataset is indeed empty then that method will be slower than one which checks all partitions in parallel, but if it's non-empty (and most tasks' output is non-empty) then it can be much faster. There's not an obviously-best implementation here. However, I think there's high value (and low downside) to optimizing for the special case where the Dataset is an unfiltered, untransformed input dataset (e.g. the output of {{spark.read.parquet}}): I found a production job which calls {{isEmpty}} on the output of {{spark.read.parquet()}} and the {{isEmpty}} call took several minutes to complete because it needed to launch hundreds of thousands of tasks to compute a single record of each partition. I could instruct the job author use a different, more efficient method of checking for non-emptiness, but this feels like the type of optimization that Spark should handle itself. Maybe we can special-case {{IsEmpty}} for the case where plan consists of only a file source scan (or a file source scan followed by a projection, but without any filters, etc.). In those cases, we can use either the {{.limit(1).take()}} implementation (under assumption that we don't have a ton of empty input files) or something fancier (metadata-only query, looking at Parquet footers, delegating to some datasource API, etc). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy
[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florentino Sainz updated SPARK-29265: - Attachment: (was: Test.scala) > Window orderBy causing full-DF orderBy > --- > > Key: SPARK-29265 > URL: https://issues.apache.org/jira/browse/SPARK-29265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: Any >Reporter: Florentino Sainz >Priority: Minor > > Hi, > > I had this problem in "real" environments and also made a self-contained test > ( [^Test.scala] attached). > Having this Window definition: > {code:scala} > val myWindow = Window.partitionBy($"word").orderBy("word") > val filt2 = filtrador.withColumn("avg_Time", > collect_list($"number").over(myWindow)) > {code} > > As a user, I would expect either: > 1- Error/warning (because trying to sort on one of the columns of the window > partitionBy) > 2- A mostly-useless operation which just orders the rows (which have the same > value on word) inside each Window but doesn't affect performance too much. > > Currently what I see: > *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is > performing a global orderBy on the whole DataFrame. Similar to > dataframe.orderBy("word").* > *In my real environment, my program just didn't finish in time/crashed thus > causing my program to be very slow or crash (because as it's a global > orderBy, it will just go to one executor).* > > In the test I can see how all elements of my DF are in a single partition > (side-effect of the global orderBy) > > Full Code showing the error (see how the mapPartitions shows 99 rows in one > partition) attached in Test.scala, sbt project (src and build.sbt) attached > too in TestSpark.zip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29202) --driver-java-options are not passed to driver process in yarn client mode
[ https://issues.apache.org/jira/browse/SPARK-29202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29202. --- Fix Version/s: 3.0.0 Assignee: Sandeep Katta Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/25889 > --driver-java-options are not passed to driver process in yarn client mode > -- > > Key: SPARK-29202 > URL: https://issues.apache.org/jira/browse/SPARK-29202 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Assignee: Sandeep Katta >Priority: Major > Fix For: 3.0.0 > > > Run the below command > ./bin/spark-sql --master yarn > --driver-java-options="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=" > > In Spark 2.3.3 > /opt/softwares/Java/jdk1.8.0_211/bin/java -cp > /opt/BigdataTools/spark-2.3.3-bin-hadoop2.7/conf/:/opt/BigdataTools/spark-2.3.3-bin-hadoop2.7/jars/*:/opt/BigdataTools/hadoop-3.2.0/etc/hadoop/ > -Xmx1g -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address= > org.apache.spark.deploy.SparkSubmit --master yarn --conf > spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address= > --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver > spark-internal > > In Spark 3.0 > /opt/softwares/Java/jdk1.8.0_211/bin/java -cp > /opt/apache/git/sparkSourceCode/spark/conf/:/opt/apache/git/sparkSourceCode/spark/assembly/target/scala-2.12/jars/*:/opt/BigdataTools/hadoop-3.2.0/etc/hadoop/ > org.apache.spark.deploy.SparkSubmit --master yarn --conf > spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5556 > --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver > spark-internal > We can see that java options are not passed to driver process in spark3 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy
[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florentino Sainz updated SPARK-29265: - Attachment: (was: TestSpark.zip) > Window orderBy causing full-DF orderBy > --- > > Key: SPARK-29265 > URL: https://issues.apache.org/jira/browse/SPARK-29265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: Any >Reporter: Florentino Sainz >Priority: Minor > > Hi, > > I had this problem in "real" environments and also made a self-contained test > ( [^Test.scala] attached). > Having this Window definition: > {code:scala} > val myWindow = Window.partitionBy($"word").orderBy("word") > val filt2 = filtrador.withColumn("avg_Time", > collect_list($"number").over(myWindow)) > {code} > > As a user, I would expect either: > 1- Error/warning (because trying to sort on one of the columns of the window > partitionBy) > 2- A mostly-useless operation which just orders the rows (which have the same > value on word) inside each Window but doesn't affect performance too much. > > Currently what I see: > *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is > performing a global orderBy on the whole DataFrame. Similar to > dataframe.orderBy("word").* > *In my real environment, my program just didn't finish in time/crashed thus > causing my program to be very slow or crash (because as it's a global > orderBy, it will just go to one executor).* > > In the test I can see how all elements of my DF are in a single partition > (side-effect of the global orderBy) > > Full Code showing the error (see how the mapPartitions shows 99 rows in one > partition) attached in Test.scala, sbt project (src and build.sbt) attached > too in TestSpark.zip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy
[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florentino Sainz updated SPARK-29265: - Affects Version/s: (was: 2.4.4) > Window orderBy causing full-DF orderBy > --- > > Key: SPARK-29265 > URL: https://issues.apache.org/jira/browse/SPARK-29265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: Any >Reporter: Florentino Sainz >Priority: Minor > Attachments: Test.scala, TestSpark.zip > > > Hi, > > I had this problem in "real" environments and also made a self-contained test > ( [^Test.scala] attached). > Having this Window definition: > {code:scala} > val myWindow = Window.partitionBy($"word").orderBy("word") > val filt2 = filtrador.withColumn("avg_Time", > collect_list($"number").over(myWindow)) > {code} > > As a user, I would expect either: > 1- Error/warning (because trying to sort on one of the columns of the window > partitionBy) > 2- A mostly-useless operation which just orders the rows (which have the same > value on word) inside each Window but doesn't affect performance too much. > > Currently what I see: > *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is > performing a global orderBy on the whole DataFrame. Similar to > dataframe.orderBy("word").* > *In my real environment, my program just didn't finish in time/crashed thus > causing my program to be very slow or crash (because as it's a global > orderBy, it will just go to one executor).* > > In the test I can see how all elements of my DF are in a single partition > (side-effect of the global orderBy) > > Full Code showing the error (see how the mapPartitions shows 99 rows in one > partition) attached in Test.scala, sbt project (src and build.sbt) attached > too in TestSpark.zip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy
[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florentino Sainz updated SPARK-29265: - Affects Version/s: 2.4.4 > Window orderBy causing full-DF orderBy > --- > > Key: SPARK-29265 > URL: https://issues.apache.org/jira/browse/SPARK-29265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.4 > Environment: Any >Reporter: Florentino Sainz >Priority: Minor > Attachments: Test.scala, TestSpark.zip > > > Hi, > > I had this problem in "real" environments and also made a self-contained test > ( [^Test.scala] attached). > Having this Window definition: > {code:scala} > val myWindow = Window.partitionBy($"word").orderBy("word") > val filt2 = filtrador.withColumn("avg_Time", > collect_list($"number").over(myWindow)) > {code} > > As a user, I would expect either: > 1- Error/warning (because trying to sort on one of the columns of the window > partitionBy) > 2- A mostly-useless operation which just orders the rows (which have the same > value on word) inside each Window but doesn't affect performance too much. > > Currently what I see: > *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is > performing a global orderBy on the whole DataFrame. Similar to > dataframe.orderBy("word").* > *In my real environment, my program just didn't finish in time/crashed thus > causing my program to be very slow or crash (because as it's a global > orderBy, it will just go to one executor).* > > In the test I can see how all elements of my DF are in a single partition > (side-effect of the global orderBy) > > Full Code showing the error (see how the mapPartitions shows 99 rows in one > partition) attached in Test.scala, sbt project (src and build.sbt) attached > too in TestSpark.zip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy
[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florentino Sainz updated SPARK-29265: - Description: Hi, I had this problem in "real" environments and also made a self-contained test ( [^Test.scala] attached). Having this Window definition: {code:scala} val myWindow = Window.partitionBy($"word").orderBy("word") val filt2 = filtrador.withColumn("avg_Time", collect_list($"number").over(myWindow)) {code} As a user, I would expect either: 1- Error/warning (because trying to sort on one of the columns of the window partitionBy) 2- A mostly-useless operation which just orders the rows (which have the same value on word) inside each Window but doesn't affect performance too much. Currently what I see: *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is performing a global orderBy on the whole DataFrame. Similar to dataframe.orderBy("word").* *In my real environment, my program just didn't finish in time/crashed thus causing my program to be very slow or crash (because as it's a global orderBy, it will just go to one executor).* In the test I can see how all elements of my DF are in a single partition (side-effect of the global orderBy) Full Code showing the error (see how the mapPartitions shows 99 rows in one partition) attached in Test.scala, sbt project (src and build.sbt) attached too in TestSpark.zip. was: Hi, I had this problem in "real" environments and also made a self-contained test ( [^Test.scala] attached). Having this Window definition: {code:scala} val myWindow = Window.partitionBy($"word").orderBy("word") val filt2 = filtrador.withColumn("avg_Time", avg($"number").over(myWindow)){code} As a user, I would expect either: 1- Error/warning (because trying to sort on one of the columns of the window partitionBy) 2- A mostly-useless operation which just orders the rows (which have the same value on word) inside each Window but doesn't affect performance too much. Currently what I see: *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is performing a global orderBy on the whole DataFrame. Similar to dataframe.orderBy("word").* *In my real environment, my program just didn't finish in time/crashed thus causing my program to be very slow or crash (because as it's a global orderBy, it will just go to one executor).* In the test I can see how all elements of my DF are in a single partition (side-effect of the global orderBy) Full Code showing the error (see how the mapPartitions shows 99 rows in one partition) attached in Test.scala, sbt project (src and build.sbt) attached too in TestSpark.zip. > Window orderBy causing full-DF orderBy > --- > > Key: SPARK-29265 > URL: https://issues.apache.org/jira/browse/SPARK-29265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: Any >Reporter: Florentino Sainz >Priority: Minor > Attachments: Test.scala, TestSpark.zip > > > Hi, > > I had this problem in "real" environments and also made a self-contained test > ( [^Test.scala] attached). > Having this Window definition: > {code:scala} > val myWindow = Window.partitionBy($"word").orderBy("word") > val filt2 = filtrador.withColumn("avg_Time", > collect_list($"number").over(myWindow)) > {code} > > As a user, I would expect either: > 1- Error/warning (because trying to sort on one of the columns of the window > partitionBy) > 2- A mostly-useless operation which just orders the rows (which have the same > value on word) inside each Window but doesn't affect performance too much. > > Currently what I see: > *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is > performing a global orderBy on the whole DataFrame. Similar to > dataframe.orderBy("word").* > *In my real environment, my program just didn't finish in time/crashed thus > causing my program to be very slow or crash (because as it's a global > orderBy, it will just go to one executor).* > > In the test I can see how all elements of my DF are in a single partition > (side-effect of the global orderBy) > > Full Code showing the error (see how the mapPartitions shows 99 rows in one > partition) attached in Test.scala, sbt project (src and build.sbt) attached > too in TestSpark.zip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29265) Window orderBy causing full-DF orderBy
[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938989#comment-16938989 ] Florentino Sainz edited comment on SPARK-29265 at 9/26/19 10:18 PM: Sorry, ima test again... I'm seeing something that doesn't match my guessings in my test case. I thought the cause was because it triggered a global sort, but plan is similar different to the one generated when ordering by another column, someone knows why a similar Window+orderBy (ordering by one of the partitionBy columns) tanked my performance and only used one executor? (not a problem with data having only one value, that key had multiple values evenly distributed). I'm going to keep investigating to see if I can reproduce this in a dummy test, but removing the orderBy in a similar case just fixed performance problem in production :$ was (Author: fsainz): Sorry, ima test again... I'm seeing something that doesn't match my guessings in my test case. I thought the cause was because it triggered a global sort, but plan is similar different to the one generated when ordering by another column, someone knows why a similar Window+orderBy (ordering by one of the partitionBy columns) tanked my performance and only used one executor? (not a problem with data having only one value, that key had multiple values evenly distributed). I'm going to keep investigating, but removing the orderBy in a similar case just fixed performance problem :$ > Window orderBy causing full-DF orderBy > --- > > Key: SPARK-29265 > URL: https://issues.apache.org/jira/browse/SPARK-29265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: Any >Reporter: Florentino Sainz >Priority: Minor > Attachments: Test.scala, TestSpark.zip > > > Hi, > > I had this problem in "real" environments and also made a self-contained test > ( [^Test.scala] attached). > Having this Window definition: > {code:scala} > val myWindow = Window.partitionBy($"word").orderBy("word") > val filt2 = filtrador.withColumn("avg_Time", > avg($"number").over(myWindow)){code} > > As a user, I would expect either: > 1- Error/warning (because trying to sort on one of the columns of the window > partitionBy) > 2- A mostly-useless operation which just orders the rows (which have the same > value on word) inside each Window but doesn't affect performance too much. > > Currently what I see: > *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is > performing a global orderBy on the whole DataFrame. Similar to > dataframe.orderBy("word").* > *In my real environment, my program just didn't finish in time/crashed thus > causing my program to be very slow or crash (because as it's a global > orderBy, it will just go to one executor).* > > In the test I can see how all elements of my DF are in a single partition > (side-effect of the global orderBy) > > Full Code showing the error (see how the mapPartitions shows 99 rows in one > partition) attached in Test.scala, sbt project (src and build.sbt) attached > too in TestSpark.zip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29265) Window orderBy causing full-DF orderBy
[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938989#comment-16938989 ] Florentino Sainz edited comment on SPARK-29265 at 9/26/19 10:18 PM: Sorry, ima test again... I'm seeing something that doesn't match my guessings in my test case. I thought the cause was because it triggered a global sort, but plan is similar different to the one generated when ordering by another column, someone knows why a similar Window+orderBy (ordering by one of the partitionBy columns) tanked my performance and only used one executor? (not a problem with data having only one value, that key had multiple values evenly distributed). I'm going to keep investigating to see if I can reproduce this in a dummy test, but removing the orderBy in a similar case just fixed performance problem in production :$ (it was causing 50+min single-tasks) was (Author: fsainz): Sorry, ima test again... I'm seeing something that doesn't match my guessings in my test case. I thought the cause was because it triggered a global sort, but plan is similar different to the one generated when ordering by another column, someone knows why a similar Window+orderBy (ordering by one of the partitionBy columns) tanked my performance and only used one executor? (not a problem with data having only one value, that key had multiple values evenly distributed). I'm going to keep investigating to see if I can reproduce this in a dummy test, but removing the orderBy in a similar case just fixed performance problem in production :$ > Window orderBy causing full-DF orderBy > --- > > Key: SPARK-29265 > URL: https://issues.apache.org/jira/browse/SPARK-29265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: Any >Reporter: Florentino Sainz >Priority: Minor > Attachments: Test.scala, TestSpark.zip > > > Hi, > > I had this problem in "real" environments and also made a self-contained test > ( [^Test.scala] attached). > Having this Window definition: > {code:scala} > val myWindow = Window.partitionBy($"word").orderBy("word") > val filt2 = filtrador.withColumn("avg_Time", > avg($"number").over(myWindow)){code} > > As a user, I would expect either: > 1- Error/warning (because trying to sort on one of the columns of the window > partitionBy) > 2- A mostly-useless operation which just orders the rows (which have the same > value on word) inside each Window but doesn't affect performance too much. > > Currently what I see: > *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is > performing a global orderBy on the whole DataFrame. Similar to > dataframe.orderBy("word").* > *In my real environment, my program just didn't finish in time/crashed thus > causing my program to be very slow or crash (because as it's a global > orderBy, it will just go to one executor).* > > In the test I can see how all elements of my DF are in a single partition > (side-effect of the global orderBy) > > Full Code showing the error (see how the mapPartitions shows 99 rows in one > partition) attached in Test.scala, sbt project (src and build.sbt) attached > too in TestSpark.zip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29265) Window orderBy causing full-DF orderBy
[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938989#comment-16938989 ] Florentino Sainz edited comment on SPARK-29265 at 9/26/19 10:07 PM: Sorry, ima test again... I'm seeing something that doesn't match my guessings in my test case. I thought the cause was because it triggered a global sort, but plan is similar different to the one generated when ordering by another column, someone knows why a similar Window+orderBy (ordering by one of the partitionBy columns) tanked my performance and only used one executor? (not a problem with data having only one value, that key had multiple values evenly distributed). I'm going to keep investigating, but removing the orderBy in a similar case just fixed performance problem :$ was (Author: fsainz): Sorry, ima test again... I'm seeing something that doesn't match my guessings in my test case. I thought the cause was because it triggered a global sort, but plan is similar different to the one generated when ordering by another column, someone knows why a similar Window+orderBy (ordering by one of the partitionBy columns) tanked my performance and only used one executor? (not a problem with data having only one value, that key had multiple values evenly distributed). I'm going to keep investigating, but removing the orderBy in a similar case just fixed it :$ > Window orderBy causing full-DF orderBy > --- > > Key: SPARK-29265 > URL: https://issues.apache.org/jira/browse/SPARK-29265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: Any >Reporter: Florentino Sainz >Priority: Minor > Attachments: Test.scala, TestSpark.zip > > > Hi, > > I had this problem in "real" environments and also made a self-contained test > ( [^Test.scala] attached). > Having this Window definition: > {code:scala} > val myWindow = Window.partitionBy($"word").orderBy("word") > val filt2 = filtrador.withColumn("avg_Time", > avg($"number").over(myWindow)){code} > > As a user, I would expect either: > 1- Error/warning (because trying to sort on one of the columns of the window > partitionBy) > 2- A mostly-useless operation which just orders the rows (which have the same > value on word) inside each Window but doesn't affect performance too much. > > Currently what I see: > *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is > performing a global orderBy on the whole DataFrame. Similar to > dataframe.orderBy("word").* > *In my real environment, my program just didn't finish in time/crashed thus > causing my program to be very slow or crash (because as it's a global > orderBy, it will just go to one executor).* > > In the test I can see how all elements of my DF are in a single partition > (side-effect of the global orderBy) > > Full Code showing the error (see how the mapPartitions shows 99 rows in one > partition) attached in Test.scala, sbt project (src and build.sbt) attached > too in TestSpark.zip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29265) Window orderBy causing full-DF orderBy
[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938989#comment-16938989 ] Florentino Sainz edited comment on SPARK-29265 at 9/26/19 10:07 PM: Sorry, ima test again... I'm seeing something that doesn't match my guessings in my test case. I thought the cause was because it triggered a global sort, but plan is similar different to the one generated when ordering by another column, someone knows why a similar Window+orderBy (ordering by one of the partitionBy columns) tanked my performance and only used one executor? (not a problem with data having only one value, that key had multiple values evenly distributed). I'm going to keep investigating, but removing the orderBy in a similar case just fixed it :$ was (Author: fsainz): Sorry, ima test again... I'm seeing something that doesn't match my guessings in my test case. I thought the cause was because it triggered a global sort, but plan is similar different to the one generated when ordering by another column, someone knows why a similar Window+orderBy (ordering by one of the partitionBy columns) tanked my performance and only used one executor? (not a problem with data having only one value, that key had multiple values evenly distributed) > Window orderBy causing full-DF orderBy > --- > > Key: SPARK-29265 > URL: https://issues.apache.org/jira/browse/SPARK-29265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: Any >Reporter: Florentino Sainz >Priority: Minor > Attachments: Test.scala, TestSpark.zip > > > Hi, > > I had this problem in "real" environments and also made a self-contained test > ( [^Test.scala] attached). > Having this Window definition: > {code:scala} > val myWindow = Window.partitionBy($"word").orderBy("word") > val filt2 = filtrador.withColumn("avg_Time", > avg($"number").over(myWindow)){code} > > As a user, I would expect either: > 1- Error/warning (because trying to sort on one of the columns of the window > partitionBy) > 2- A mostly-useless operation which just orders the rows (which have the same > value on word) inside each Window but doesn't affect performance too much. > > Currently what I see: > *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is > performing a global orderBy on the whole DataFrame. Similar to > dataframe.orderBy("word").* > *In my real environment, my program just didn't finish in time/crashed thus > causing my program to be very slow or crash (because as it's a global > orderBy, it will just go to one executor).* > > In the test I can see how all elements of my DF are in a single partition > (side-effect of the global orderBy) > > Full Code showing the error (see how the mapPartitions shows 99 rows in one > partition) attached in Test.scala, sbt project (src and build.sbt) attached > too in TestSpark.zip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29265) Window orderBy causing full-DF orderBy
[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938989#comment-16938989 ] Florentino Sainz edited comment on SPARK-29265 at 9/26/19 10:03 PM: Sorry, ima test again... I'm seeing something that doesn't match my guessings in my test case. I thought the cause was because it triggered a global sort, but plan is similar different to the one generated when ordering by another column, someone knows why a similar Window+orderBy (ordering by one of the partitionBy columns) tanked my performance and only used one executor? (not a problem with data having only one value, that key had multiple values evenly distributed) was (Author: fsainz): Sorry, ima test again... I'm seeing something that doesn't match my guessings in my test case. I thought the cause was because it triggered a global sort, but plan is similar different to the one generated when ordering by another column, someone knows why a similar Window+orderBy (ordering by one of the partitionBy columns) tanked my performance and only used one executor? > Window orderBy causing full-DF orderBy > --- > > Key: SPARK-29265 > URL: https://issues.apache.org/jira/browse/SPARK-29265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: Any >Reporter: Florentino Sainz >Priority: Minor > Attachments: Test.scala, TestSpark.zip > > > Hi, > > I had this problem in "real" environments and also made a self-contained test > ( [^Test.scala] attached). > Having this Window definition: > {code:scala} > val myWindow = Window.partitionBy($"word").orderBy("word") > val filt2 = filtrador.withColumn("avg_Time", > avg($"number").over(myWindow)){code} > > As a user, I would expect either: > 1- Error/warning (because trying to sort on one of the columns of the window > partitionBy) > 2- A mostly-useless operation which just orders the rows (which have the same > value on word) inside each Window but doesn't affect performance too much. > > Currently what I see: > *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is > performing a global orderBy on the whole DataFrame. Similar to > dataframe.orderBy("word").* > *In my real environment, my program just didn't finish in time/crashed thus > causing my program to be very slow or crash (because as it's a global > orderBy, it will just go to one executor).* > > In the test I can see how all elements of my DF are in a single partition > (side-effect of the global orderBy) > > Full Code showing the error (see how the mapPartitions shows 99 rows in one > partition) attached in Test.scala, sbt project (src and build.sbt) attached > too in TestSpark.zip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29265) Window orderBy causing full-DF orderBy
[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938989#comment-16938989 ] Florentino Sainz edited comment on SPARK-29265 at 9/26/19 10:01 PM: Sorry, ima test again... I'm seeing something that doesn't match my guessings in my test case. I thought the cause was because it triggered a global sort, but plan is similar different to the one generated when ordering by another column, someone knows why a similar Window+orderBy (ordering by one of the partitionBy columns) tanked my performance and only used one executor? was (Author: fsainz): Sorry, ima test again... I'm seeing something that doesn't match in my test case. I thought the cause was because it triggered a global sort, but plan is similar different to the one generated when ordering by another column, someone knows why a similar Window+orderBy (ordering by one of the partitionBy columns) tanked my performance and only used one executor? > Window orderBy causing full-DF orderBy > --- > > Key: SPARK-29265 > URL: https://issues.apache.org/jira/browse/SPARK-29265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: Any >Reporter: Florentino Sainz >Priority: Minor > Attachments: Test.scala, TestSpark.zip > > > Hi, > > I had this problem in "real" environments and also made a self-contained test > ( [^Test.scala] attached). > Having this Window definition: > {code:scala} > val myWindow = Window.partitionBy($"word").orderBy("word") > val filt2 = filtrador.withColumn("avg_Time", > avg($"number").over(myWindow)){code} > > As a user, I would expect either: > 1- Error/warning (because trying to sort on one of the columns of the window > partitionBy) > 2- A mostly-useless operation which just orders the rows (which have the same > value on word) inside each Window but doesn't affect performance too much. > > Currently what I see: > *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is > performing a global orderBy on the whole DataFrame. Similar to > dataframe.orderBy("word").* > *In my real environment, my program just didn't finish in time/crashed thus > causing my program to be very slow or crash (because as it's a global > orderBy, it will just go to one executor).* > > In the test I can see how all elements of my DF are in a single partition > (side-effect of the global orderBy) > > Full Code showing the error (see how the mapPartitions shows 99 rows in one > partition) attached in Test.scala, sbt project (src and build.sbt) attached > too in TestSpark.zip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy
[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florentino Sainz updated SPARK-29265: - Affects Version/s: (was: 2.4.4) (was: 2.4.3) (was: 2.3.0) 2.4.0 > Window orderBy causing full-DF orderBy > --- > > Key: SPARK-29265 > URL: https://issues.apache.org/jira/browse/SPARK-29265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: Any >Reporter: Florentino Sainz >Priority: Minor > Attachments: Test.scala, TestSpark.zip > > > Hi, > > I had this problem in "real" environments and also made a self-contained test > ( [^Test.scala] attached). > Having this Window definition: > {code:scala} > val myWindow = Window.partitionBy($"word").orderBy("word") > val filt2 = filtrador.withColumn("avg_Time", > avg($"number").over(myWindow)){code} > > As a user, I would expect either: > 1- Error/warning (because trying to sort on one of the columns of the window > partitionBy) > 2- A mostly-useless operation which just orders the rows (which have the same > value on word) inside each Window but doesn't affect performance too much. > > Currently what I see: > *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is > performing a global orderBy on the whole DataFrame. Similar to > dataframe.orderBy("word").* > *In my real environment, my program just didn't finish in time/crashed thus > causing my program to be very slow or crash (because as it's a global > orderBy, it will just go to one executor).* > > In the test I can see how all elements of my DF are in a single partition > (side-effect of the global orderBy) > > Full Code showing the error (see how the mapPartitions shows 99 rows in one > partition) attached in Test.scala, sbt project (src and build.sbt) attached > too in TestSpark.zip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29242) Check results of expression examples automatically
[ https://issues.apache.org/jira/browse/SPARK-29242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29242: -- Issue Type: Improvement (was: Test) > Check results of expression examples automatically > -- > > Key: SPARK-29242 > URL: https://issues.apache.org/jira/browse/SPARK-29242 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > Expression examples demonstrate how to use associated functions, and show > expected results. Need to write a test which executes the examples and > compare actual and expected results. For example: > https://github.com/apache/spark/blob/051e691029c456fc2db5f229485d3fb8f5d0e84c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2038-L2043 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29242) Check results of expression examples automatically
[ https://issues.apache.org/jira/browse/SPARK-29242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29242: -- Component/s: Tests > Check results of expression examples automatically > -- > > Key: SPARK-29242 > URL: https://issues.apache.org/jira/browse/SPARK-29242 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > Expression examples demonstrate how to use associated functions, and show > expected results. Need to write a test which executes the examples and > compare actual and expected results. For example: > https://github.com/apache/spark/blob/051e691029c456fc2db5f229485d3fb8f5d0e84c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2038-L2043 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29265) Window orderBy causing full-DF orderBy
[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938989#comment-16938989 ] Florentino Sainz edited comment on SPARK-29265 at 9/26/19 9:54 PM: --- Sorry, ima test again... I'm seeing something that doesn't match in my test case. I thought the cause was because it triggered a global sort, but plan is similar different to the one generated when ordering by another column, someone knows why a similar Window+orderBy tanked my performance and only used one executor? was (Author: fsainz): Sorry, ima test again... I'm seeing something that doesn't match in my test case. > Window orderBy causing full-DF orderBy > --- > > Key: SPARK-29265 > URL: https://issues.apache.org/jira/browse/SPARK-29265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.4.3, 2.4.4 > Environment: Any >Reporter: Florentino Sainz >Priority: Minor > Attachments: Test.scala, TestSpark.zip > > > Hi, > > I had this problem in "real" environments and also made a self-contained test > ( [^Test.scala] attached). > Having this Window definition: > {code:scala} > val myWindow = Window.partitionBy($"word").orderBy("word") > val filt2 = filtrador.withColumn("avg_Time", > avg($"number").over(myWindow)){code} > > As a user, I would expect either: > 1- Error/warning (because trying to sort on one of the columns of the window > partitionBy) > 2- A mostly-useless operation which just orders the rows (which have the same > value on word) inside each Window but doesn't affect performance too much. > > Currently what I see: > *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is > performing a global orderBy on the whole DataFrame. Similar to > dataframe.orderBy("word").* > *In my real environment, my program just didn't finish in time/crashed thus > causing my program to be very slow or crash (because as it's a global > orderBy, it will just go to one executor).* > > In the test I can see how all elements of my DF are in a single partition > (side-effect of the global orderBy) > > Full Code showing the error (see how the mapPartitions shows 99 rows in one > partition) attached in Test.scala, sbt project (src and build.sbt) attached > too in TestSpark.zip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29265) Window orderBy causing full-DF orderBy
[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938989#comment-16938989 ] Florentino Sainz edited comment on SPARK-29265 at 9/26/19 9:54 PM: --- Sorry, ima test again... I'm seeing something that doesn't match in my test case. I thought the cause was because it triggered a global sort, but plan is similar different to the one generated when ordering by another column, someone knows why a similar Window+orderBy (ordering by one of the partitionBy columns) tanked my performance and only used one executor? was (Author: fsainz): Sorry, ima test again... I'm seeing something that doesn't match in my test case. I thought the cause was because it triggered a global sort, but plan is similar different to the one generated when ordering by another column, someone knows why a similar Window+orderBy tanked my performance and only used one executor? > Window orderBy causing full-DF orderBy > --- > > Key: SPARK-29265 > URL: https://issues.apache.org/jira/browse/SPARK-29265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.4.3, 2.4.4 > Environment: Any >Reporter: Florentino Sainz >Priority: Minor > Attachments: Test.scala, TestSpark.zip > > > Hi, > > I had this problem in "real" environments and also made a self-contained test > ( [^Test.scala] attached). > Having this Window definition: > {code:scala} > val myWindow = Window.partitionBy($"word").orderBy("word") > val filt2 = filtrador.withColumn("avg_Time", > avg($"number").over(myWindow)){code} > > As a user, I would expect either: > 1- Error/warning (because trying to sort on one of the columns of the window > partitionBy) > 2- A mostly-useless operation which just orders the rows (which have the same > value on word) inside each Window but doesn't affect performance too much. > > Currently what I see: > *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is > performing a global orderBy on the whole DataFrame. Similar to > dataframe.orderBy("word").* > *In my real environment, my program just didn't finish in time/crashed thus > causing my program to be very slow or crash (because as it's a global > orderBy, it will just go to one executor).* > > In the test I can see how all elements of my DF are in a single partition > (side-effect of the global orderBy) > > Full Code showing the error (see how the mapPartitions shows 99 rows in one > partition) attached in Test.scala, sbt project (src and build.sbt) attached > too in TestSpark.zip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy
[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florentino Sainz updated SPARK-29265: - Description: Hi, I had this problem in "real" environments and also made a self-contained test ( [^Test.scala] attached). Having this Window definition: {code:scala} val myWindow = Window.partitionBy($"word").orderBy("word") val filt2 = filtrador.withColumn("avg_Time", avg($"number").over(myWindow)){code} As a user, I would expect either: 1- Error/warning (because trying to sort on one of the columns of the window partitionBy) 2- A mostly-useless operation which just orders the rows (which have the same value on word) inside each Window but doesn't affect performance too much. Currently what I see: *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is performing a global orderBy on the whole DataFrame. Similar to dataframe.orderBy("word").* *In my real environment, my program just didn't finish in time/crashed thus causing my program to be very slow or crash (because as it's a global orderBy, it will just go to one executor).* In the test I can see how all elements of my DF are in a single partition (side-effect of the global orderBy) Full Code showing the error (see how the mapPartitions shows 99 rows in one partition) attached in Test.scala, sbt project (src and build.sbt) attached too in TestSpark.zip. was: Hi, I had this problem in "real" environments and also made a self-contained test ( [^Test.scala] attached). Having this Window definition: {code:scala} val myWindow = Window.partitionBy($"word").orderBy("word") val filt2 = filtrador.withColumn("avg_Time", avg($"number").over(myWindow)){code} As a user, I would expect either: 1- Error/warning (because trying to sort on one of the columns of the window partitionBy) 2- A mostly-useless operation which just orders the rows inside each Window but doesn't affect performance too much. Currently what I see: *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is performing a global orderBy on the whole DataFrame. Similar to dataframe.orderBy("word").* *In my real environment, my program just didn't finish in time/crashed thus causing my program to be very slow or crash (because as it's a global orderBy, it will just go to one executor).* In the test I can see how all elements of my DF are in a single partition (side-effect of the global orderBy) Full Code showing the error (see how the mapPartitions shows 99 rows in one partition) attached in Test.scala, sbt project (src and build.sbt) attached too in TestSpark.zip. > Window orderBy causing full-DF orderBy > --- > > Key: SPARK-29265 > URL: https://issues.apache.org/jira/browse/SPARK-29265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.4.3, 2.4.4 > Environment: Any >Reporter: Florentino Sainz >Priority: Minor > Attachments: Test.scala, TestSpark.zip > > > Hi, > > I had this problem in "real" environments and also made a self-contained test > ( [^Test.scala] attached). > Having this Window definition: > {code:scala} > val myWindow = Window.partitionBy($"word").orderBy("word") > val filt2 = filtrador.withColumn("avg_Time", > avg($"number").over(myWindow)){code} > > As a user, I would expect either: > 1- Error/warning (because trying to sort on one of the columns of the window > partitionBy) > 2- A mostly-useless operation which just orders the rows (which have the same > value on word) inside each Window but doesn't affect performance too much. > > Currently what I see: > *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is > performing a global orderBy on the whole DataFrame. Similar to > dataframe.orderBy("word").* > *In my real environment, my program just didn't finish in time/crashed thus > causing my program to be very slow or crash (because as it's a global > orderBy, it will just go to one executor).* > > In the test I can see how all elements of my DF are in a single partition > (side-effect of the global orderBy) > > Full Code showing the error (see how the mapPartitions shows 99 rows in one > partition) attached in Test.scala, sbt project (src and build.sbt) attached > too in TestSpark.zip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29265) Window orderBy causing full-DF orderBy
[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938989#comment-16938989 ] Florentino Sainz commented on SPARK-29265: -- Sorry, ima test again... I'm seeing something that doesn't match in my test case. > Window orderBy causing full-DF orderBy > --- > > Key: SPARK-29265 > URL: https://issues.apache.org/jira/browse/SPARK-29265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.4.3, 2.4.4 > Environment: Any >Reporter: Florentino Sainz >Priority: Minor > Attachments: Test.scala, TestSpark.zip > > > Hi, > > I had this problem in "real" environments and also made a self-contained test > ( [^Test.scala] attached). > Having this Window definition: > {code:scala} > val myWindow = Window.partitionBy($"word").orderBy("word") > val filt2 = filtrador.withColumn("avg_Time", > avg($"number").over(myWindow)){code} > > As a user, I would expect either: > 1- Error/warning (because trying to sort on one of the columns of the window > partitionBy) > 2- A mostly-useless operation which just orders the rows inside each Window > but doesn't affect performance too much. > > Currently what I see: > *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is > performing a global orderBy on the whole DataFrame. Similar to > dataframe.orderBy("word").* > *In my real environment, my program just didn't finish in time/crashed thus > causing my program to be very slow or crash (because as it's a global > orderBy, it will just go to one executor).* > > In the test I can see how all elements of my DF are in a single partition > (side-effect of the global orderBy) > > Full Code showing the error (see how the mapPartitions shows 99 rows in one > partition) attached in Test.scala, sbt project (src and build.sbt) attached > too in TestSpark.zip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29265) Window orderBy causing full-DF orderBy
[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florentino Sainz resolved SPARK-29265. -- Resolution: Not A Bug > Window orderBy causing full-DF orderBy > --- > > Key: SPARK-29265 > URL: https://issues.apache.org/jira/browse/SPARK-29265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.4.3, 2.4.4 > Environment: Any >Reporter: Florentino Sainz >Priority: Minor > Attachments: Test.scala, TestSpark.zip > > > Hi, > > I had this problem in "real" environments and also made a self-contained test > ( [^Test.scala] attached). > Having this Window definition: > {code:scala} > val myWindow = Window.partitionBy($"word").orderBy("word") > val filt2 = filtrador.withColumn("avg_Time", > avg($"number").over(myWindow)){code} > > As a user, I would expect either: > 1- Error/warning (because trying to sort on one of the columns of the window > partitionBy) > 2- A mostly-useless operation which just orders the rows inside each Window > but doesn't affect performance too much. > > Currently what I see: > *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is > performing a global orderBy on the whole DataFrame. Similar to > dataframe.orderBy("word").* > *In my real environment, my program just didn't finish in time/crashed thus > causing my program to be very slow or crash (because as it's a global > orderBy, it will just go to one executor).* > > In the test I can see how all elements of my DF are in a single partition > (side-effect of the global orderBy) > > Full Code showing the error (see how the mapPartitions shows 99 rows in one > partition) attached in Test.scala, sbt project (src and build.sbt) attached > too in TestSpark.zip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy
[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florentino Sainz updated SPARK-29265: - Description: Hi, I had this problem in "real" environments and also made a self-contained test ( [^Test.scala] attached). Having this Window definition: {code:scala} val myWindow = Window.partitionBy($"word").orderBy("word") val filt2 = filtrador.withColumn("avg_Time", avg($"number").over(myWindow)){code} As a user, I would expect either: 1- Error/warning (because trying to sort on one of the columns of the window partitionBy) 2- A mostly-useless operation which just orders the rows inside each Window but doesn't affect performance too much. Currently what I see: *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is performing a global orderBy on the whole DataFrame. Similar to dataframe.orderBy("word").* *In my real environment, my program just didn't finish in time/crashed thus causing my program to be very slow or crash (because as it's a global orderBy, it will just go to one executor).* In the test I can see how all elements of my DF are in a single partition (side-effect of the global orderBy) Full Code showing the error (see how the mapPartitions shows 99 rows in one partition) attached in Test.scala, sbt project (src and build.sbt) attached too in TestSpark.zip. was: Hi, I had this problem in "real" environments and also made a self-contained test ( [^Test.scala] attached). Having this Window definition: {code:scala} val myWindow = Window.partitionBy($"word").orderBy("word") val filt2 = filtrador.withColumn("avg_Time", avg($"number").over(myWindow)){code} As a user, I would expect either: 1- Error/warning (because trying to sort on one of the columns of the window partitionBy) 2- A mostly-useless operation which just orders the rows inside each Window but doesn't affect performance too much. Currently what I see: *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is performing a global orderBy of the whole DataFrame. Similar to dataframe.orderBy("word").* *In my real environment, my program just didn't finish in time/crashed thus causing my program to be very slow or crash (because as it's a global orderBy, it will just go to one executor).* In the test I can see how all elements of my DF are in a single partition (side-effect of the global orderBy) Full Code showing the error (see how the mapPartitions shows 99 rows in one partition) attached in Test.scala, sbt project (src and build.sbt) attached too in TestSpark.zip. > Window orderBy causing full-DF orderBy > --- > > Key: SPARK-29265 > URL: https://issues.apache.org/jira/browse/SPARK-29265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.4.3, 2.4.4 > Environment: Any >Reporter: Florentino Sainz >Priority: Minor > Attachments: Test.scala, TestSpark.zip > > > Hi, > > I had this problem in "real" environments and also made a self-contained test > ( [^Test.scala] attached). > Having this Window definition: > {code:scala} > val myWindow = Window.partitionBy($"word").orderBy("word") > val filt2 = filtrador.withColumn("avg_Time", > avg($"number").over(myWindow)){code} > > As a user, I would expect either: > 1- Error/warning (because trying to sort on one of the columns of the window > partitionBy) > 2- A mostly-useless operation which just orders the rows inside each Window > but doesn't affect performance too much. > > Currently what I see: > *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is > performing a global orderBy on the whole DataFrame. Similar to > dataframe.orderBy("word").* > *In my real environment, my program just didn't finish in time/crashed thus > causing my program to be very slow or crash (because as it's a global > orderBy, it will just go to one executor).* > > In the test I can see how all elements of my DF are in a single partition > (side-effect of the global orderBy) > > Full Code showing the error (see how the mapPartitions shows 99 rows in one > partition) attached in Test.scala, sbt project (src and build.sbt) attached > too in TestSpark.zip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy
[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florentino Sainz updated SPARK-29265: - Description: Hi, I had this problem in "real" environments and also made a self-contained test ( [^Test.scala] attached). Having this Window definition: {code:scala} val myWindow = Window.partitionBy($"word").orderBy("word") val filt2 = filtrador.withColumn("avg_Time", avg($"number").over(myWindow)){code} As a user, I would expect either: 1- Error/warning (because trying to sort on one of the columns of the window partitionBy) 2- A mostly-useless operation which just orders the rows inside each Window but doesn't affect performance too much. Currently what I see: *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is performing a global orderBy of the whole DataFrame. Similar to dataframe.orderBy("word").* *In my real environment, my program just didn't finish in time/crashed thus causing my program to be very slow or crash (because as it's a global orderBy, it will just go to one executor).* In the test I can see how all elements of my DF are in a single partition (side-effect of the global orderBy) Full Code showing the error (see how the mapPartitions shows 99 rows in one partition) attached in Test.scala, sbt project (src and build.sbt) attached too in TestSpark.zip. was: Hi, I had this problem in "real" environments and also made a self-contained test ( [^Test.scala] attached). Having this Window definition: {code:scala} val myWindow = Window.partitionBy($"word").orderBy("word") val filt2 = filtrador.withColumn("avg_Time", avg($"number").over(myWindow)){code} As a user, I would expect either: 1- Error/warning (because trying to sort on one of the columns of the window partitionBy) 2- A mostly-useless operation which just orders the rows inside each Window but doesn't affect performance too much. Currently what I see: *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is performing a global orderBy of the whole DataFrame. Similar to dataframe.orderBy("word").* *In my real environment, my program just didn't finish in time/crashed thus causing my program to be very slow or crash (because as it's a global orderBy, it will just go to one executor).* In the test I can see how all elements of my DF are in a single partition (side-effect of the global orderBy) Full Code showing the error (see how the mapPartitions shows 99 rows in one partition) attached in Test.scala, sbt project (src and build.sbt) attached too. > Window orderBy causing full-DF orderBy > --- > > Key: SPARK-29265 > URL: https://issues.apache.org/jira/browse/SPARK-29265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.4.3, 2.4.4 > Environment: Any >Reporter: Florentino Sainz >Priority: Minor > Attachments: Test.scala, TestSpark.zip > > > Hi, > > I had this problem in "real" environments and also made a self-contained test > ( [^Test.scala] attached). > Having this Window definition: > {code:scala} > val myWindow = Window.partitionBy($"word").orderBy("word") > val filt2 = filtrador.withColumn("avg_Time", > avg($"number").over(myWindow)){code} > > As a user, I would expect either: > 1- Error/warning (because trying to sort on one of the columns of the window > partitionBy) > 2- A mostly-useless operation which just orders the rows inside each Window > but doesn't affect performance too much. > > Currently what I see: > *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is > performing a global orderBy of the whole DataFrame. Similar to > dataframe.orderBy("word").* > *In my real environment, my program just didn't finish in time/crashed thus > causing my program to be very slow or crash (because as it's a global > orderBy, it will just go to one executor).* > > In the test I can see how all elements of my DF are in a single partition > (side-effect of the global orderBy) > > Full Code showing the error (see how the mapPartitions shows 99 rows in one > partition) attached in Test.scala, sbt project (src and build.sbt) attached > too in TestSpark.zip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy
[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florentino Sainz updated SPARK-29265: - Description: Hi, I had this problem in "real" environments and also made a self-contained test ( [^Test.scala] attached). Having this Window definition: {code:scala} val myWindow = Window.partitionBy($"word").orderBy("word") val filt2 = filtrador.withColumn("avg_Time", avg($"number").over(myWindow)){code} As a user, I would expect either: 1- Error/warning (because trying to sort on one of the columns of the window partitionBy) 2- A mostly-useless operation which just orders the rows inside each Window but doesn't affect performance too much. Currently what I see: *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is performing a global orderBy of the whole DataFrame. Similar to dataframe.orderBy("word").* *In my real environment, my program just didn't finish in time/crashed thus causing my program to be very slow or crash (because as it's a global orderBy, it will just go to one executor).* In the test I can see how all elements of my DF are in a single partition (side-effect of the global orderBy) Full Code showing the error (see how the mapPartitions shows 99 rows in one partition) attached in Test.scala, sbt project (src and build.sbt) attached too. was: Hi, I had this problem in "real" environments and also made a self-contained test ( [^Test.scala] attached). Having this Window definition: {code:scala} val myWindow = Window.partitionBy($"word").orderBy("word") val filt2 = filtrador.withColumn("avg_Time", avg($"number").over(myWindow)){code} As a user, I would expect either: 1- Error/warning (because trying to sort on one of the columns of the window partitionBy) 2- A mostly-useless operation which just orders the rows inside each Window but doesn't affect performance too much. Currently what I see: *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is performing a global orderBy of the whole DataFrame. Similar to dataframe.orderBy("word").* *In my real environment, my program just didn't finish in time/crashed thus causing my program to be very slow or crash (because as it's a global orderBy, it will just go to one executor).* In the test I can see how all elements of my DF are in a single partition (side-effect of the global orderBy) Full Code showing the error (see how the mapPartitions shows 99 rows in one partition) attached in Test.scala > Window orderBy causing full-DF orderBy > --- > > Key: SPARK-29265 > URL: https://issues.apache.org/jira/browse/SPARK-29265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.4.3, 2.4.4 > Environment: Any >Reporter: Florentino Sainz >Priority: Minor > Attachments: Test.scala, TestSpark.zip > > > Hi, > > I had this problem in "real" environments and also made a self-contained test > ( [^Test.scala] attached). > Having this Window definition: > {code:scala} > val myWindow = Window.partitionBy($"word").orderBy("word") > val filt2 = filtrador.withColumn("avg_Time", > avg($"number").over(myWindow)){code} > > As a user, I would expect either: > 1- Error/warning (because trying to sort on one of the columns of the window > partitionBy) > 2- A mostly-useless operation which just orders the rows inside each Window > but doesn't affect performance too much. > > Currently what I see: > *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is > performing a global orderBy of the whole DataFrame. Similar to > dataframe.orderBy("word").* > *In my real environment, my program just didn't finish in time/crashed thus > causing my program to be very slow or crash (because as it's a global > orderBy, it will just go to one executor).* > > In the test I can see how all elements of my DF are in a single partition > (side-effect of the global orderBy) > > Full Code showing the error (see how the mapPartitions shows 99 rows in one > partition) attached in Test.scala, sbt project (src and build.sbt) attached > too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy
[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florentino Sainz updated SPARK-29265: - Attachment: TestSpark.zip > Window orderBy causing full-DF orderBy > --- > > Key: SPARK-29265 > URL: https://issues.apache.org/jira/browse/SPARK-29265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.4.3, 2.4.4 > Environment: Any >Reporter: Florentino Sainz >Priority: Minor > Attachments: Test.scala, TestSpark.zip > > > Hi, > > I had this problem in "real" environments and also made a self-contained test > ( [^Test.scala] attached). > Having this Window definition: > {code:scala} > val myWindow = Window.partitionBy($"word").orderBy("word") > val filt2 = filtrador.withColumn("avg_Time", > avg($"number").over(myWindow)){code} > > As a user, I would expect either: > 1- Error/warning (because trying to sort on one of the columns of the window > partitionBy) > 2- A mostly-useless operation which just orders the rows inside each Window > but doesn't affect performance too much. > > Currently what I see: > *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is > performing a global orderBy of the whole DataFrame. Similar to > dataframe.orderBy("word").* > *In my real environment, my program just didn't finish in time/crashed thus > causing my program to be very slow or crash (because as it's a global > orderBy, it will just go to one executor).* > > In the test I can see how all elements of my DF are in a single partition > (side-effect of the global orderBy) > > Full Code showing the error (see how the mapPartitions shows 99 rows in one > partition) attached in Test.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy
[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florentino Sainz updated SPARK-29265: - Description: Hi, I had this problem in "real" environments and also made a self-contained test ( [^Test.scala] attached). Having this Window definition: {code:scala} val myWindow = Window.partitionBy($"word").orderBy("word") val filt2 = filtrador.withColumn("avg_Time", avg($"number").over(myWindow)){code} As a user, I would expect either: 1- Error/warning (because trying to sort on one of the columns of the window partitionBy) 2- A mostly-useless operation which just orders the rows inside each Window but doesn't affect performance too much. Currently what I see: *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is performing a global orderBy of the whole DataFrame. Similar to dataframe.orderBy("word").* *In my real environment, my program just didn't finish in time/crashed thus causing my program to be very slow or crash (because as it's a global orderBy, it will just go to one executor).* In the test I can see how all elements of my DF are in a single partition (side-effect of the global orderBy) Full Code showing the error (see how the mapPartitions shows 99 rows in one partition) attached in Test.scala was: Hi, I had this problem in "real" environments and also made a self-contained test (attached). Having this Window definition: {code:scala} val myWindow = Window.partitionBy($"word").orderBy("word") val filt2 = filtrador.withColumn("avg_Time", avg($"number").over(myWindow)){code} As a user, I would expect either: 1- Error/warning (because trying to sort on one of the columns of the window partitionBy) 2- A mostly-useless operation which just orders the rows inside each Window but doesn't affect performance too much. Currently what I see: *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is performing a global orderBy of the whole DataFrame. Similar to dataframe.orderBy("word").* *In my real environment, my program just didn't finish in time/crashed thus causing my program to be very slow or crash (because as it's a global orderBy, it will just go to one executor).* In the test I can see how all elements of my DF are in a single partition (side-effect of the global orderBy) Full Code showing the error (see how the mapPartitions shows 99 rows in one partition) <-- You can pase it in Intellij/Any other and it should work: {code:scala} import java.io.ByteArrayOutputStream import java.net.URL import java.nio.charset.Charset import org.apache.commons.io.IOUtils import org.apache.spark.sql.SparkSession import org.apache.spark.sql._ import org.apache.spark.sql.catalyst.encoders.RowEncoder import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType} import scala.collection.mutable object Test { case class Bank(age:Integer, job:String, marital : String, education : String, balance : Integer) def main(args: Array[String]): Unit = { val spark = SparkSession.builder .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .config("spark.sql.autoBroadcastJoinThreshold", -1) .master("local[4]") .appName("Word Count") .getOrCreate() import org.apache.spark.sql.functions._ import spark.implicits._ val sc = spark.sparkContext val expectedSchema = List( StructField("number", IntegerType, false), StructField("word", StringType, false), StructField("dummyColumn", StringType, false) ) val expectedData = Seq( Row(8, "bat", "test"), Row(64, "mouse", "test"), Row(-27, "horse", "test") ) val filtrador = spark.createDataFrame( spark.sparkContext.parallelize(expectedData), StructType(expectedSchema) ).withColumn("dummy", explode(array((1 until 100).map(lit): _*))) //spark.createDataFrame(bank,Bank.getClass).createOrReplaceTempView("bank") //spark.createDataFrame(bank,Bank.getClass).registerTempTable("bankDos") //spark.createDataFrame(bank,Bank.getClass).registerTempTable("bankTres") //val filtrador2=filtrador.crossJoin(filtrador) //filtrador2.cache() //filtrador2.union(filtrador2).count val myWindow = Window.partitionBy($"word").orderBy("word") val filt2 = filtrador.withColumn("avg_Time", avg($"number").over(myWindow)) filt2.show filt2.rdd.mapPartitions(iter => Iterator(iter.size), true).collect().foreach(println) } } {code} > Window orderBy causing full-DF orderBy > --- > > Key: SPARK-29265 > URL: https://issues.apache.org/jira/browse/SPARK-29265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0,
[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy
[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florentino Sainz updated SPARK-29265: - Attachment: Test.scala > Window orderBy causing full-DF orderBy > --- > > Key: SPARK-29265 > URL: https://issues.apache.org/jira/browse/SPARK-29265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.4.3, 2.4.4 > Environment: Any >Reporter: Florentino Sainz >Priority: Minor > Attachments: Test.scala > > > Hi, > > I had this problem in "real" environments and also made a self-contained test > (attached). > Having this Window definition: > {code:scala} > val myWindow = Window.partitionBy($"word").orderBy("word") > val filt2 = filtrador.withColumn("avg_Time", > avg($"number").over(myWindow)){code} > > As a user, I would expect either: > 1- Error/warning (because trying to sort on one of the columns of the window > partitionBy) > 2- A mostly-useless operation which just orders the rows inside each Window > but doesn't affect performance too much. > > Currently what I see: > *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is > performing a global orderBy of the whole DataFrame. Similar to > dataframe.orderBy("word").* > *In my real environment, my program just didn't finish in time/crashed thus > causing my program to be very slow or crash (because as it's a global > orderBy, it will just go to one executor).* > > In the test I can see how all elements of my DF are in a single partition > (side-effect of the global orderBy) > > Full Code showing the error (see how the mapPartitions shows 99 rows in one > partition) <-- You can pase it in Intellij/Any other and it should work: > > {code:scala} > import java.io.ByteArrayOutputStream > import java.net.URL > import java.nio.charset.Charset > import org.apache.commons.io.IOUtils > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql._ > import org.apache.spark.sql.catalyst.encoders.RowEncoder > import org.apache.spark.sql.expressions.Window > import org.apache.spark.sql.types.{IntegerType, StringType, StructField, > StructType} > import scala.collection.mutable > object Test { > case class Bank(age:Integer, job:String, marital : String, education : > String, balance : Integer) > def main(args: Array[String]): Unit = { > val spark = SparkSession.builder > .config("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") > .config("spark.sql.autoBroadcastJoinThreshold", -1) > .master("local[4]") > .appName("Word Count") > .getOrCreate() > import org.apache.spark.sql.functions._ > import spark.implicits._ > val sc = spark.sparkContext > val expectedSchema = List( > StructField("number", IntegerType, false), > StructField("word", StringType, false), > StructField("dummyColumn", StringType, false) > ) > val expectedData = Seq( > Row(8, "bat", "test"), > Row(64, "mouse", "test"), > Row(-27, "horse", "test") > ) > val filtrador = spark.createDataFrame( > spark.sparkContext.parallelize(expectedData), > StructType(expectedSchema) > ).withColumn("dummy", explode(array((1 until 100).map(lit): _*))) > > //spark.createDataFrame(bank,Bank.getClass).createOrReplaceTempView("bank") > //spark.createDataFrame(bank,Bank.getClass).registerTempTable("bankDos") > //spark.createDataFrame(bank,Bank.getClass).registerTempTable("bankTres") > //val filtrador2=filtrador.crossJoin(filtrador) > //filtrador2.cache() > //filtrador2.union(filtrador2).count > val myWindow = Window.partitionBy($"word").orderBy("word") > val filt2 = filtrador.withColumn("avg_Time", > avg($"number").over(myWindow)) > filt2.show > filt2.rdd.mapPartitions(iter => Iterator(iter.size), > true).collect().foreach(println) > } > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29265) Window orderBy causing full-DF orderBy
Florentino Sainz created SPARK-29265: Summary: Window orderBy causing full-DF orderBy Key: SPARK-29265 URL: https://issues.apache.org/jira/browse/SPARK-29265 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.4, 2.4.3, 2.3.0 Environment: Any Reporter: Florentino Sainz Hi, I had this problem in "real" environments and also made a self-contained test (attached). Having this Window definition: {code:java} val myWindow = Window.partitionBy($"word").orderBy("word") val filt2 = filtrador.withColumn("avg_Time", avg($"number").over(myWindow)){code} As a user, I would expect either: 1- Error/warning (because trying to sort on one of the columns of the window partitionBy) 2- A mostly-useless operation which just orders the rows inside each Window but doesn't affect performance too much. Currently what I see: *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is performing a global orderBy of the whole DataFrame. Similar to dataframe.orderBy("word").* *In my real environment, my program just didn't finish in time/crashed thus causing my program to be very slow or crash (because as it's a global orderBy, it will just go to one executor).* In the test I can see how all elements of my DF are in a single partition (side-effect of the global orderBy) Full Code showing the error (see how the mapPartitions shows 99 rows in one partition) <-- You can pase it in Intellij/Any other and it should work: {quote} import java.io.ByteArrayOutputStream import java.net.URL import java.nio.charset.Charset import org.apache.commons.io.IOUtils import org.apache.spark.sql.SparkSession import org.apache.spark.sql._ import org.apache.spark.sql.catalyst.encoders.RowEncoder import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.types.\{IntegerType, StringType, StructField, StructType} import scala.collection.mutable object Test { case class Bank(age:Integer, job:String, marital : String, education : String, balance : Integer) def main(args: Array[String]): Unit = { val spark = SparkSession.builder .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .config("spark.sql.autoBroadcastJoinThreshold", -1) .master("local[4]") .appName("Word Count") .getOrCreate() import org.apache.spark.sql.functions._ import spark.implicits._ val sc = spark.sparkContext val expectedSchema = List( StructField("number", IntegerType, false), StructField("word", StringType, false), StructField("dummyColumn", StringType, false) ) val expectedData = Seq( Row(8, "bat", "test"), Row(64, "mouse", "test"), Row(-27, "horse", "test") ) val filtrador = spark.createDataFrame( spark.sparkContext.parallelize(expectedData), StructType(expectedSchema) ).withColumn("dummy", explode(array((1 until 100).map(lit): _*))) val myWindow = Window.partitionBy($"word").orderBy("word") val filt2 = filtrador.withColumn("avg_Time", avg($"number").over(myWindow)) filt2.show filt2.rdd.mapPartitions(iter => Iterator(iter.size), true).collect().foreach(println) } } {quote} {code:java} {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy
[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florentino Sainz updated SPARK-29265: - Description: Hi, I had this problem in "real" environments and also made a self-contained test (attached). Having this Window definition: {code:scala} val myWindow = Window.partitionBy($"word").orderBy("word") val filt2 = filtrador.withColumn("avg_Time", avg($"number").over(myWindow)){code} As a user, I would expect either: 1- Error/warning (because trying to sort on one of the columns of the window partitionBy) 2- A mostly-useless operation which just orders the rows inside each Window but doesn't affect performance too much. Currently what I see: *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is performing a global orderBy of the whole DataFrame. Similar to dataframe.orderBy("word").* *In my real environment, my program just didn't finish in time/crashed thus causing my program to be very slow or crash (because as it's a global orderBy, it will just go to one executor).* In the test I can see how all elements of my DF are in a single partition (side-effect of the global orderBy) Full Code showing the error (see how the mapPartitions shows 99 rows in one partition) <-- You can pase it in Intellij/Any other and it should work: {code:scala} import java.io.ByteArrayOutputStream import java.net.URL import java.nio.charset.Charset import org.apache.commons.io.IOUtils import org.apache.spark.sql.SparkSession import org.apache.spark.sql._ import org.apache.spark.sql.catalyst.encoders.RowEncoder import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType} import scala.collection.mutable object Test { case class Bank(age:Integer, job:String, marital : String, education : String, balance : Integer) def main(args: Array[String]): Unit = { val spark = SparkSession.builder .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .config("spark.sql.autoBroadcastJoinThreshold", -1) .master("local[4]") .appName("Word Count") .getOrCreate() import org.apache.spark.sql.functions._ import spark.implicits._ val sc = spark.sparkContext val expectedSchema = List( StructField("number", IntegerType, false), StructField("word", StringType, false), StructField("dummyColumn", StringType, false) ) val expectedData = Seq( Row(8, "bat", "test"), Row(64, "mouse", "test"), Row(-27, "horse", "test") ) val filtrador = spark.createDataFrame( spark.sparkContext.parallelize(expectedData), StructType(expectedSchema) ).withColumn("dummy", explode(array((1 until 100).map(lit): _*))) //spark.createDataFrame(bank,Bank.getClass).createOrReplaceTempView("bank") //spark.createDataFrame(bank,Bank.getClass).registerTempTable("bankDos") //spark.createDataFrame(bank,Bank.getClass).registerTempTable("bankTres") //val filtrador2=filtrador.crossJoin(filtrador) //filtrador2.cache() //filtrador2.union(filtrador2).count val myWindow = Window.partitionBy($"word").orderBy("word") val filt2 = filtrador.withColumn("avg_Time", avg($"number").over(myWindow)) filt2.show filt2.rdd.mapPartitions(iter => Iterator(iter.size), true).collect().foreach(println) } } {code} was: Hi, I had this problem in "real" environments and also made a self-contained test (attached). Having this Window definition: {code:java} val myWindow = Window.partitionBy($"word").orderBy("word") val filt2 = filtrador.withColumn("avg_Time", avg($"number").over(myWindow)){code} As a user, I would expect either: 1- Error/warning (because trying to sort on one of the columns of the window partitionBy) 2- A mostly-useless operation which just orders the rows inside each Window but doesn't affect performance too much. Currently what I see: *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is performing a global orderBy of the whole DataFrame. Similar to dataframe.orderBy("word").* *In my real environment, my program just didn't finish in time/crashed thus causing my program to be very slow or crash (because as it's a global orderBy, it will just go to one executor).* In the test I can see how all elements of my DF are in a single partition (side-effect of the global orderBy) Full Code showing the error (see how the mapPartitions shows 99 rows in one partition) <-- You can pase it in Intellij/Any other and it should work: {code: scala} import java.io.ByteArrayOutputStream import java.net.URL import java.nio.charset.Charset import org.apache.commons.io.IOUtils import org.apache.spark.sql.SparkSession import org.apache.spark.sql._ import org.apache.spark.sql.catalyst.encoders.RowEncoder import
[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy
[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florentino Sainz updated SPARK-29265: - Description: Hi, I had this problem in "real" environments and also made a self-contained test (attached). Having this Window definition: {code:java} val myWindow = Window.partitionBy($"word").orderBy("word") val filt2 = filtrador.withColumn("avg_Time", avg($"number").over(myWindow)){code} As a user, I would expect either: 1- Error/warning (because trying to sort on one of the columns of the window partitionBy) 2- A mostly-useless operation which just orders the rows inside each Window but doesn't affect performance too much. Currently what I see: *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is performing a global orderBy of the whole DataFrame. Similar to dataframe.orderBy("word").* *In my real environment, my program just didn't finish in time/crashed thus causing my program to be very slow or crash (because as it's a global orderBy, it will just go to one executor).* In the test I can see how all elements of my DF are in a single partition (side-effect of the global orderBy) Full Code showing the error (see how the mapPartitions shows 99 rows in one partition) <-- You can pase it in Intellij/Any other and it should work: {code: scala} import java.io.ByteArrayOutputStream import java.net.URL import java.nio.charset.Charset import org.apache.commons.io.IOUtils import org.apache.spark.sql.SparkSession import org.apache.spark.sql._ import org.apache.spark.sql.catalyst.encoders.RowEncoder import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType} import scala.collection.mutable object Test { case class Bank(age:Integer, job:String, marital : String, education : String, balance : Integer) def main(args: Array[String]): Unit = { val spark = SparkSession.builder .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .config("spark.sql.autoBroadcastJoinThreshold", -1) .master("local[4]") .appName("Word Count") .getOrCreate() import org.apache.spark.sql.functions._ import spark.implicits._ val sc = spark.sparkContext val expectedSchema = List( StructField("number", IntegerType, false), StructField("word", StringType, false), StructField("dummyColumn", StringType, false) ) val expectedData = Seq( Row(8, "bat", "test"), Row(64, "mouse", "test"), Row(-27, "horse", "test") ) val filtrador = spark.createDataFrame( spark.sparkContext.parallelize(expectedData), StructType(expectedSchema) ).withColumn("dummy", explode(array((1 until 100).map(lit): _*))) //spark.createDataFrame(bank,Bank.getClass).createOrReplaceTempView("bank") //spark.createDataFrame(bank,Bank.getClass).registerTempTable("bankDos") //spark.createDataFrame(bank,Bank.getClass).registerTempTable("bankTres") //val filtrador2=filtrador.crossJoin(filtrador) //filtrador2.cache() //filtrador2.union(filtrador2).count val myWindow = Window.partitionBy($"word").orderBy("word") val filt2 = filtrador.withColumn("avg_Time", avg($"number").over(myWindow)) filt2.show filt2.rdd.mapPartitions(iter => Iterator(iter.size), true).collect().foreach(println) } } {code} was: Hi, I had this problem in "real" environments and also made a self-contained test (attached). Having this Window definition: {code:java} val myWindow = Window.partitionBy($"word").orderBy("word") val filt2 = filtrador.withColumn("avg_Time", avg($"number").over(myWindow)){code} As a user, I would expect either: 1- Error/warning (because trying to sort on one of the columns of the window partitionBy) 2- A mostly-useless operation which just orders the rows inside each Window but doesn't affect performance too much. Currently what I see: *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is performing a global orderBy of the whole DataFrame. Similar to dataframe.orderBy("word").* *In my real environment, my program just didn't finish in time/crashed thus causing my program to be very slow or crash (because as it's a global orderBy, it will just go to one executor).* In the test I can see how all elements of my DF are in a single partition (side-effect of the global orderBy) Full Code showing the error (see how the mapPartitions shows 99 rows in one partition) <-- You can pase it in Intellij/Any other and it should work: {quote} import java.io.ByteArrayOutputStream import java.net.URL import java.nio.charset.Charset import org.apache.commons.io.IOUtils import org.apache.spark.sql.SparkSession import org.apache.spark.sql._ import org.apache.spark.sql.catalyst.encoders.RowEncoder import
[jira] [Resolved] (SPARK-29264) Incorrect example for the RLike expression
[ https://issues.apache.org/jira/browse/SPARK-29264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk resolved SPARK-29264. Resolution: Won't Fix > Incorrect example for the RLike expression > -- > > Key: SPARK-29264 > URL: https://issues.apache.org/jira/browse/SPARK-29264 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > > The example for RLIKE is incorrect: > https://github.com/apache/spark/blob/a428f406693f1c372dc0e378f6b413eca9e367ac/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L174 > {code} > spark-sql> SET spark.sql.parser.escapedStringLiterals=true; > spark.sql.parser.escapedStringLiteralstrue > spark-sql> SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\Users.*'; > 19/09/26 23:33:13 ERROR SparkSQLDriver: Failed in [SELECT > '%SystemDrive%\Users\John' rlike '%SystemDrive%\Users.*'] > java.util.regex.PatternSyntaxException: Illegal/unsupported escape sequence > near index 14 > %SystemDrive%\Users.* > ^ > at java.util.regex.Pattern.error(Pattern.java:1957) > at java.util.regex.Pattern.escape(Pattern.java:2473) > at java.util.regex.Pattern.atom(Pattern.java:2200) > at java.util.regex.Pattern.sequence(Pattern.java:2132) > at java.util.regex.Pattern.expr(Pattern.java:1998) > at java.util.regex.Pattern.compile(Pattern.java:1698) > at java.util.regex.Pattern.(Pattern.java:1351) > at java.util.regex.Pattern.compile(Pattern.java:1028) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29264) Incorrect example for the RLike expression
[ https://issues.apache.org/jira/browse/SPARK-29264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938952#comment-16938952 ] Maxim Gekk commented on SPARK-29264: Remove Rlike from the ignore set after fix: [https://github.com/apache/spark/pull/25942/files#diff-5a2e7f03d14856c8769fd3ddea8742bdR171] > Incorrect example for the RLike expression > -- > > Key: SPARK-29264 > URL: https://issues.apache.org/jira/browse/SPARK-29264 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > > The example for RLIKE is incorrect: > https://github.com/apache/spark/blob/a428f406693f1c372dc0e378f6b413eca9e367ac/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L174 > {code} > spark-sql> SET spark.sql.parser.escapedStringLiterals=true; > spark.sql.parser.escapedStringLiteralstrue > spark-sql> SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\Users.*'; > 19/09/26 23:33:13 ERROR SparkSQLDriver: Failed in [SELECT > '%SystemDrive%\Users\John' rlike '%SystemDrive%\Users.*'] > java.util.regex.PatternSyntaxException: Illegal/unsupported escape sequence > near index 14 > %SystemDrive%\Users.* > ^ > at java.util.regex.Pattern.error(Pattern.java:1957) > at java.util.regex.Pattern.escape(Pattern.java:2473) > at java.util.regex.Pattern.atom(Pattern.java:2200) > at java.util.regex.Pattern.sequence(Pattern.java:2132) > at java.util.regex.Pattern.expr(Pattern.java:1998) > at java.util.regex.Pattern.compile(Pattern.java:1698) > at java.util.regex.Pattern.(Pattern.java:1351) > at java.util.regex.Pattern.compile(Pattern.java:1028) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29264) Incorrect example for the RLike expression
[ https://issues.apache.org/jira/browse/SPARK-29264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-29264: --- Summary: Incorrect example for the RLike expression (was: Fix examples for the RLike expression) > Incorrect example for the RLike expression > -- > > Key: SPARK-29264 > URL: https://issues.apache.org/jira/browse/SPARK-29264 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > > The example for RLIKE is incorrect: > https://github.com/apache/spark/blob/a428f406693f1c372dc0e378f6b413eca9e367ac/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L174 > {code} > spark-sql> SET spark.sql.parser.escapedStringLiterals=true; > spark.sql.parser.escapedStringLiteralstrue > spark-sql> SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\Users.*'; > 19/09/26 23:33:13 ERROR SparkSQLDriver: Failed in [SELECT > '%SystemDrive%\Users\John' rlike '%SystemDrive%\Users.*'] > java.util.regex.PatternSyntaxException: Illegal/unsupported escape sequence > near index 14 > %SystemDrive%\Users.* > ^ > at java.util.regex.Pattern.error(Pattern.java:1957) > at java.util.regex.Pattern.escape(Pattern.java:2473) > at java.util.regex.Pattern.atom(Pattern.java:2200) > at java.util.regex.Pattern.sequence(Pattern.java:2132) > at java.util.regex.Pattern.expr(Pattern.java:1998) > at java.util.regex.Pattern.compile(Pattern.java:1698) > at java.util.regex.Pattern.(Pattern.java:1351) > at java.util.regex.Pattern.compile(Pattern.java:1028) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29264) Fix examples for the RLike expression
Maxim Gekk created SPARK-29264: -- Summary: Fix examples for the RLike expression Key: SPARK-29264 URL: https://issues.apache.org/jira/browse/SPARK-29264 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk The example for RLIKE is incorrect: https://github.com/apache/spark/blob/a428f406693f1c372dc0e378f6b413eca9e367ac/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L174 {code} spark-sql> SET spark.sql.parser.escapedStringLiterals=true; spark.sql.parser.escapedStringLiterals true spark-sql> SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\Users.*'; 19/09/26 23:33:13 ERROR SparkSQLDriver: Failed in [SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\Users.*'] java.util.regex.PatternSyntaxException: Illegal/unsupported escape sequence near index 14 %SystemDrive%\Users.* ^ at java.util.regex.Pattern.error(Pattern.java:1957) at java.util.regex.Pattern.escape(Pattern.java:2473) at java.util.regex.Pattern.atom(Pattern.java:2200) at java.util.regex.Pattern.sequence(Pattern.java:2132) at java.util.regex.Pattern.expr(Pattern.java:1998) at java.util.regex.Pattern.compile(Pattern.java:1698) at java.util.regex.Pattern.(Pattern.java:1351) at java.util.regex.Pattern.compile(Pattern.java:1028) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28917) Jobs can hang because of race of RDD.dependencies
[ https://issues.apache.org/jira/browse/SPARK-28917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-28917: - Description: {{RDD.dependencies}} stores the precomputed cache value, but it is not thread-safe. This can lead to a race where the value gets overwritten, but the DAGScheduler gets stuck in an inconsistent state. In particular, this can happen when there is a race between the DAGScheduler event loop, and another thread (eg. a user thread, if there is multi-threaded job submission). First, a job is submitted by the user, which then computes the result Stage and its parents: https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L983 Which eventually makes a call to {{rdd.dependencies}}: https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L519 At the same time, the user could also touch {{rdd.dependencies}} in another thread, which could overwrite the stored value because of the race. Then the DAGScheduler checks the dependencies *again* later on in the job submission, via {{getMissingParentStages}} https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1025 Because it will find new dependencies, it will create entirely different stages. Now the job has some orphaned stages which will never run. The symptoms of this are seeing disjoint sets of stages in the "Parents of final stage" and the "Missing parents" messages on job submission. (*EDIT*: Seeing repeated msgs "Registering RDD X" actually is just fine, it is not a symptom of a problem at all. It just means the RDD is the *input* to multiple shuffles.) {noformat} [INFO] 2019-08-15 23:22:31,570 org.apache.spark.SparkContext logInfo - Starting job: count at XXX.scala:462 ... [INFO] 2019-08-15 23:22:31,573 org.apache.spark.scheduler.DAGScheduler logInfo - Registering RDD 14 (repartition at XXX.scala:421) ... ... [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler logInfo - Got job 1 (count at XXX.scala:462) with 40 output partitions [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler logInfo - Final stage: ResultStage 5 (count at XXX.scala:462) [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler logInfo - Parents of final stage: List(ShuffleMapStage 4) [INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler logInfo - Registering RDD 14 (repartition at XXX.scala:421) [INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler logInfo - Missing parents: List(ShuffleMapStage 6) {noformat} Note that there is a similar issue w/ {{rdd.partitions}}. I don't see a way it could mess up the scheduler (seems its only used for {{rdd.partitions.length}}). There is also an issue that {{rdd.storageLevel}} is read and cached in the scheduler, but it could be modified simultaneously by the user in another thread. Similarly, I can't see a way it could effect the scheduler. *WORKAROUND*: (a) call {{rdd.dependencies}} while you know that RDD is only getting touched by one thread (eg. in the thread that created it, or before you submit multiple jobs touching that RDD from other threads). Then that value will get cached. (b) don't submit jobs from multiple threads. was: {{RDD.dependencies}} stores the precomputed cache value, but it is not thread-safe. This can lead to a race where the value gets overwritten, but the DAGScheduler gets stuck in an inconsistent state. In particular, this can happen when there is a race between the DAGScheduler event loop, and another thread (eg. a user thread, if there is multi-threaded job submission). First, a job is submitted by the user, which then computes the result Stage and its parents: https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L983 Which eventually makes a call to {{rdd.dependencies}}: https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L519 At the same time, the user could also touch {{rdd.dependencies}} in another thread, which could overwrite the stored value because of the race. Then the DAGScheduler checks the dependencies *again* later on in the job submission, via {{getMissingParentStages}} https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1025 Because it will find new dependencies, it will create entirely different stages. Now the job has some orphaned stages which will never run. The symptoms of this are seeing disjoint sets of
[jira] [Created] (SPARK-29263) availableSlots in scheduler can change before being checked by barrier taskset
Juliusz Sompolski created SPARK-29263: - Summary: availableSlots in scheduler can change before being checked by barrier taskset Key: SPARK-29263 URL: https://issues.apache.org/jira/browse/SPARK-29263 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 3.0.0 Reporter: Juliusz Sompolski availableSlots are computed before the loop in resourceOffer, but they change in every iteration -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29245) CCE during creating HiveMetaStoreClient
[ https://issues.apache.org/jira/browse/SPARK-29245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938857#comment-16938857 ] Dongjoon Hyun commented on SPARK-29245: --- In addition to [~yumwang]'s configuration, in CDH 6.3, HMS is also running JDK11. > CCE during creating HiveMetaStoreClient > > > Key: SPARK-29245 > URL: https://issues.apache.org/jira/browse/SPARK-29245 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 > Environment: CDH 6.3 >Reporter: Dongjoon Hyun >Priority: Blocker > > From `master` branch build, when I try to connect to an external HMS, I hit > the following. > {code} > 19/09/25 10:58:46 ERROR hive.log: Got exception: java.lang.ClassCastException > class [Ljava.lang.Object; cannot be cast to class [Ljava.net.URI; > ([Ljava.lang.Object; and [Ljava.net.URI; are in module java.base of loader > 'bootstrap') > java.lang.ClassCastException: class [Ljava.lang.Object; cannot be cast to > class [Ljava.net.URI; ([Ljava.lang.Object; and [Ljava.net.URI; are in module > java.base of loader 'bootstrap') > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:200) > at > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:70) > {code} > With HIVE-21508, I can get the following. > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT > /_/ > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.4) > Type in expressions to have them evaluated. > Type :help for more information. > scala> sql("show databases").show > ++ > |databaseName| > ++ > | . | > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25475) Refactor all benchmark to save the result as a separate file
[ https://issues.apache.org/jira/browse/SPARK-25475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25475: -- Target Version/s: 3.0.0 > Refactor all benchmark to save the result as a separate file > > > Key: SPARK-25475 > URL: https://issues.apache.org/jira/browse/SPARK-25475 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > This is an umbrella issue to refactor all benchmarks to use a common style > using main-method (instead of `test` method) and saving the result as a > separate file (instead of embedding as comments). This is not only for > consistency, but also for making the benchmark-automation easy. SPARK-25339 > is finished as a reference model. > *Completed* > - FilterPushdownBenchmark.scala (SPARK-25339) > *Candidates* > - AggregateBenchmark.scala > - AvroWriteBenchmark.scala (SPARK-24777) > - ColumnarBatchBenchmark.scala > - CompressionSchemeBenchmark.scala > - DataSourceReadBenchmark.scala > - DataSourceWriteBenchmark.scala (SPARK-24777) > - DatasetBenchmark.scala > - ExternalAppendOnlyUnsafeRowArrayBenchmark.scala > - HashBenchmark.scala > - HashByteArrayBenchmark.scala > - JoinBenchmark.scala > - KryoBenchmark.scala > - MiscBenchmark.scala > - ObjectHashAggregateExecBenchmark.scala > - OrcReadBenchmark.scala > - PrimitiveArrayBenchmark.scala > - SortBenchmark.scala > - SynthBenchmark.scala > - TPCDSQueryBenchmark.scala > - UDTSerializationBenchmark.scala > - UnsafeArrayDataBenchmark.scala > - UnsafeProjectionBenchmark.scala > - WideSchemaBenchmark.scala > Candidates will be reviewed and converted as a subtask of this JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29242) Check results of expression examples automatically
[ https://issues.apache.org/jira/browse/SPARK-29242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29242: - Assignee: Maxim Gekk > Check results of expression examples automatically > -- > > Key: SPARK-29242 > URL: https://issues.apache.org/jira/browse/SPARK-29242 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > Expression examples demonstrate how to use associated functions, and show > expected results. Need to write a test which executes the examples and > compare actual and expected results. For example: > https://github.com/apache/spark/blob/051e691029c456fc2db5f229485d3fb8f5d0e84c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2038-L2043 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29248) Pass in number of partitions to BuildWriter
[ https://issues.apache.org/jira/browse/SPARK-29248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938743#comment-16938743 ] Ximo Guanter commented on SPARK-29248: -- [~kabhwan], I have opened [https://github.com/apache/spark/pull/25945] with a proposal on how this could work. Hopefully the PR can help frame the technical discussion better, since I'm not sure I'm explaining myself fully. > Pass in number of partitions to BuildWriter > --- > > Key: SPARK-29248 > URL: https://issues.apache.org/jira/browse/SPARK-29248 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ximo Guanter >Priority: Major > > When implementing a ScanBuilder, we require the implementor to provide the > schema of the data and the number of partitions. > However, when someone is implementing WriteBuilder we only pass them the > schema, but not the number of partitions. This is an asymetrical developer > experience. Passing in the number of partitions on the WriteBuilder would > enable data sources to provision their write targets before starting to > write. For example, it could be used to provision a Kafka topic with a > specific number of partitions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29254) Failed to include jars passed in through --jars when isolatedLoader is enabled
[ https://issues.apache.org/jira/browse/SPARK-29254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938725#comment-16938725 ] Lantao Jin commented on SPARK-29254: Does it duplicate to https://issues.apache.org/jira/browse/SPARK-29022 ? > Failed to include jars passed in through --jars when isolatedLoader is enabled > -- > > Key: SPARK-29254 > URL: https://issues.apache.org/jira/browse/SPARK-29254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Failed to include jars passed in through --jars when {{isolatedLoader}} is > enabled({{spark.sql.hive.metastore.jars != builtin}}). How to reproduce: > {code:scala} > test("SPARK-29254: include jars passed in through --jars when > isolatedLoader is enabled") { > val unusedJar = TestUtils.createJarWithClasses(Seq.empty) > val jar1 = TestUtils.createJarWithClasses(Seq("SparkSubmitClassA")) > val jar2 = TestUtils.createJarWithClasses(Seq("SparkSubmitClassB")) > val jar3 = HiveTestJars.getHiveContribJar.getCanonicalPath > val jar4 = HiveTestJars.getHiveHcatalogCoreJar.getCanonicalPath > val jarsString = Seq(jar1, jar2, jar3, jar4).map(j => > j.toString).mkString(",") > val args = Seq( > "--class", SparkSubmitClassLoaderTest.getClass.getName.stripSuffix("$"), > "--name", "SparkSubmitClassLoaderTest", > "--master", "local-cluster[2,1,1024]", > "--conf", "spark.ui.enabled=false", > "--conf", "spark.master.rest.enabled=false", > "--conf", "spark.sql.hive.metastore.version=3.1.2", > "--conf", "spark.sql.hive.metastore.jars=maven", > "--driver-java-options", "-Dderby.system.durability=test", > "--jars", jarsString, > unusedJar.toString, "SparkSubmitClassA", "SparkSubmitClassB") > runSparkSubmit(args) > } > {code} > Logs: > {noformat} > 2019-09-25 22:11:42.854 - stderr> 19/09/25 22:11:42 ERROR log: error in > initSerDe: java.lang.ClassNotFoundException Class > org.apache.hive.hcatalog.data.JsonSerDe not found > 2019-09-25 22:11:42.854 - stderr> java.lang.ClassNotFoundException: Class > org.apache.hive.hcatalog.data.JsonSerDe not found > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.metastore.HiveMetaStoreUtils.getDeserializer(HiveMetaStoreUtils.java:84) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.metastore.HiveMetaStoreUtils.getDeserializer(HiveMetaStoreUtils.java:77) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:289) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:271) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getColsInternal(Table.java:663) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:646) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:898) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:937) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:539) > 2019-09-25 22:11:42.854 - stderr> at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:311) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:245) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:244) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:294) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:537) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:284) > 2019-09-25 22:11:42.854 - stderr> at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:242) >
[jira] [Created] (SPARK-29262) DataFrameWriter insertIntoPartition function
feiwang created SPARK-29262: --- Summary: DataFrameWriter insertIntoPartition function Key: SPARK-29262 URL: https://issues.apache.org/jira/browse/SPARK-29262 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.4.4 Reporter: feiwang Do we have plan to support insertIntoPartition function for dataFrameWriter? [~cloud_fan] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29256) Fix typo in building document
[ https://issues.apache.org/jira/browse/SPARK-29256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29256. -- Fix Version/s: 3.0.0 Assignee: Tomoko Komiyama Resolution: Fixed Resolved by https://github.com/apache/spark/pull/25937 > Fix typo in building document > -- > > Key: SPARK-29256 > URL: https://issues.apache.org/jira/browse/SPARK-29256 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Tomoko Komiyama >Assignee: Tomoko Komiyama >Priority: Trivial > Fix For: 3.0.0 > > > Typo in 'Building With Hive and JDBC Support' section in Building document. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29256) Fix typo in building document
[ https://issues.apache.org/jira/browse/SPARK-29256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938594#comment-16938594 ] Sean R. Owen commented on SPARK-29256: -- There's no need to file a JIRA for this kind of thing. > Fix typo in building document > -- > > Key: SPARK-29256 > URL: https://issues.apache.org/jira/browse/SPARK-29256 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Tomoko Komiyama >Priority: Trivial > > Typo in 'Building With Hive and JDBC Support' section in Building document. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29256) Fix typo in building document
[ https://issues.apache.org/jira/browse/SPARK-29256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-29256: - Issue Type: Improvement (was: Bug) > Fix typo in building document > -- > > Key: SPARK-29256 > URL: https://issues.apache.org/jira/browse/SPARK-29256 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Tomoko Komiyama >Priority: Trivial > > Typo in 'Building With Hive and JDBC Support' section in Building document. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29059) [SPIP] Support for Hive Materialized Views in Spark SQL.
[ https://issues.apache.org/jira/browse/SPARK-29059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938588#comment-16938588 ] Amogh Margoor commented on SPARK-29059: --- Pull Request here with all the optimizer changes: [GitHub Pull Request #25773|https://github.com/apache/spark/pull/25773] > [SPIP] Support for Hive Materialized Views in Spark SQL. > > > Key: SPARK-29059 > URL: https://issues.apache.org/jira/browse/SPARK-29059 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Amogh Margoor >Priority: Minor > > Materialized view was introduced in Apache Hive 3.0.0. Currently, Spark > Catalyst does not optimize queries against Hive tables using Materialized > View the way Apache Calcite does it for Hive. This Jira is to add support for > the same. > We have developed it in our internal trunk and would like to open source it. > It would consist of 3 major parts: > # Reading MV related Hive Metadata > # Implication Engine which would figure out if an expression exp1 implies > another expression exp2 i.e., if exp1 => exp2 is a tautology. This is similar > to RexImplication checker in Apache Calcite. > # Catalyst rule to replace tables by it's Materialized view using > Implication Engine. For e.g., if MV 'mv' has been created in Hive using query > 'select * from foo where x > 10 && x <110' then query 'select * from foo > where x > 70 and x < 100' will be transformed into 'select * from mv where x > >70 and x < 100' > Note that Implication Engine and Catalyst Rule is generic can be used even > when Spark decides to have it's own Materialized View. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29059) [SPIP] Support for Hive Materialized Views in Spark SQL.
[ https://issues.apache.org/jira/browse/SPARK-29059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amogh Margoor updated SPARK-29059: -- Issue Type: New Feature (was: Task) > [SPIP] Support for Hive Materialized Views in Spark SQL. > > > Key: SPARK-29059 > URL: https://issues.apache.org/jira/browse/SPARK-29059 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Amogh Margoor >Priority: Minor > > Materialized view was introduced in Apache Hive 3.0.0. Currently, Spark > Catalyst does not optimize queries against Hive tables using Materialized > View the way Apache Calcite does it for Hive. This Jira is to add support for > the same. > We have developed it in our internal trunk and would like to open source it. > It would consist of 3 major parts: > # Reading MV related Hive Metadata > # Implication Engine which would figure out if an expression exp1 implies > another expression exp2 i.e., if exp1 => exp2 is a tautology. This is similar > to RexImplication checker in Apache Calcite. > # Catalyst rule to replace tables by it's Materialized view using > Implication Engine. For e.g., if MV 'mv' has been created in Hive using query > 'select * from foo where x > 10 && x <110' then query 'select * from foo > where x > 70 and x < 100' will be transformed into 'select * from mv where x > >70 and x < 100' > Note that Implication Engine and Catalyst Rule is generic can be used even > when Spark decides to have it's own Materialized View. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29059) [SPIP] Support for Hive Materialized Views in Spark SQL.
[ https://issues.apache.org/jira/browse/SPARK-29059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amogh Margoor updated SPARK-29059: -- Summary: [SPIP] Support for Hive Materialized Views in Spark SQL. (was: Support for Hive Materialized Views in Spark SQL.) > [SPIP] Support for Hive Materialized Views in Spark SQL. > > > Key: SPARK-29059 > URL: https://issues.apache.org/jira/browse/SPARK-29059 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Amogh Margoor >Priority: Minor > > Materialized view was introduced in Apache Hive 3.0.0. Currently, Spark > Catalyst does not optimize queries against Hive tables using Materialized > View the way Apache Calcite does it for Hive. This Jira is to add support for > the same. > We have developed it in our internal trunk and would like to open source it. > It would consist of 3 major parts: > # Reading MV related Hive Metadata > # Implication Engine which would figure out if an expression exp1 implies > another expression exp2 i.e., if exp1 => exp2 is a tautology. This is similar > to RexImplication checker in Apache Calcite. > # Catalyst rule to replace tables by it's Materialized view using > Implication Engine. For e.g., if MV 'mv' has been created in Hive using query > 'select * from foo where x > 10 && x <110' then query 'select * from foo > where x > 70 and x < 100' will be transformed into 'select * from mv where x > >70 and x < 100' > Note that Implication Engine and Catalyst Rule is generic can be used even > when Spark decides to have it's own Materialized View. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938581#comment-16938581 ] Gary D. Gregory commented on SPARK-6305: Log4j 2 provides a clean separation between API and implementation. You can use the Log4j 2 API and plugin other logging implementations. You can use Log4j 2 as the implementation of other logging APIs. Please see https://logging.apache.org/log4j/2.x/manual/index.html. Log4j 2 also provides a Log4j 1 API bridge to Log4j 2. > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28997) Add `spark.sql.dialect`
[ https://issues.apache.org/jira/browse/SPARK-28997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-28997. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25697 [https://github.com/apache/spark/pull/25697] > Add `spark.sql.dialect` > --- > > Key: SPARK-28997 > URL: https://issues.apache.org/jira/browse/SPARK-28997 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.0 > > > After https://github.com/apache/spark/pull/25158 and > https://github.com/apache/spark/pull/25458, SQL features of PostgreSQL are > introduced into Spark. AFAIK, both features are implementation-defined > behaviors, which are not specified in ANSI SQL. > In such a case, this proposal is to add a configuration `spark.sql.dialect` > for choosing a database dialect. > After this PR, > Spark supports two database dialects, `Spark` and `PostgreSQL`. With > `PostgreSQL` dialect, Spark will: > 1. perform integral division with the / operator if both sides are integral > types; > 2. accept "true", "yes", "1", "false", "no", "0", and unique prefixes as > input and trim input for the boolean data type. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29255) Rename package pgSQL to postgreSQL
[ https://issues.apache.org/jira/browse/SPARK-29255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-29255. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25936 [https://github.com/apache/spark/pull/25936] > Rename package pgSQL to postgreSQL > -- > > Key: SPARK-29255 > URL: https://issues.apache.org/jira/browse/SPARK-29255 > Project: Spark > Issue Type: Task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > Fix For: 3.0.0 > > > To address the comment in > https://github.com/apache/spark/pull/25697#discussion_r328431070, let's > rename the package pgSQL to postgreSQL -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29261) Support recover live entities from KVStore for (SQL)AppStatusListener
wuyi created SPARK-29261: Summary: Support recover live entities from KVStore for (SQL)AppStatusListener Key: SPARK-29261 URL: https://issues.apache.org/jira/browse/SPARK-29261 Project: Spark Issue Type: Sub-task Components: Spark Core, SQL Affects Versions: 3.0.0 Reporter: wuyi To achieve incremental reply goal in SHS, we need to support recover live entities from KVStore for both SQLAppStatusListener and AppStatusListener. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29203) Reduce shuffle partitions in SQLQueryTestSuite
[ https://issues.apache.org/jira/browse/SPARK-29203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-29203: Fix Version/s: 2.4.5 > Reduce shuffle partitions in SQLQueryTestSuite > -- > > Key: SPARK-29203 > URL: https://issues.apache.org/jira/browse/SPARK-29203 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > spark.sql.shuffle.partitions=200(default): > {noformat} > [info] - subquery/in-subquery/in-joins.sql (6 minutes, 19 seconds) > [info] - subquery/in-subquery/not-in-joins.sql (2 minutes, 17 seconds) > [info] - subquery/scalar-subquery/scalar-subquery-predicate.sql (45 seconds, > 763 milliseconds) > {noformat} > spark.sql.shuffle.partitions=5: > {noformat} > [info] - subquery/in-subquery/in-joins.sql (1 minute, 12 seconds) > [info] - subquery/in-subquery/not-in-joins.sql (27 seconds, 541 milliseconds) > [info] - subquery/scalar-subquery/scalar-subquery-predicate.sql (17 seconds, > 360 milliseconds) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29203) Reduce shuffle partitions in SQLQueryTestSuite
[ https://issues.apache.org/jira/browse/SPARK-29203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-29203: Affects Version/s: 2.4.4 > Reduce shuffle partitions in SQLQueryTestSuite > -- > > Key: SPARK-29203 > URL: https://issues.apache.org/jira/browse/SPARK-29203 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.4.4, 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > spark.sql.shuffle.partitions=200(default): > {noformat} > [info] - subquery/in-subquery/in-joins.sql (6 minutes, 19 seconds) > [info] - subquery/in-subquery/not-in-joins.sql (2 minutes, 17 seconds) > [info] - subquery/scalar-subquery/scalar-subquery-predicate.sql (45 seconds, > 763 milliseconds) > {noformat} > spark.sql.shuffle.partitions=5: > {noformat} > [info] - subquery/in-subquery/in-joins.sql (1 minute, 12 seconds) > [info] - subquery/in-subquery/not-in-joins.sql (27 seconds, 541 milliseconds) > [info] - subquery/scalar-subquery/scalar-subquery-predicate.sql (17 seconds, > 360 milliseconds) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938506#comment-16938506 ] Steve Loughran commented on SPARK-6305: --- bq. Steve, apart from the security issue concern, we also have the logs not rolling properly. Are you confident that this is actually due to log4j 1.x bugs -and that an upgrade Will make it go away? As if not, unless you can find evidence that other people been finding a similar problem, I'd worry about your deployment and configuration before that of the trauma of a move to 2.x bq. In any case, my question is "Is Spark going to decouple itself from log4j 1.x and provide a way to configure logging with an alternative implementation?". bq. If yes, what would be the tentative timeline? No idea. I actually seem to recall some discussion about using logback behind the scenes, though you'll have to investigate that yourself > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29260) Enable supported Hive metastore versions once it support altering database location
[ https://issues.apache.org/jira/browse/SPARK-29260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-29260: Description: Hive 3.x is supported currently. Hive 2.2.1 and Hive 2.4.0 have not released. (was: Hive 3.0.x and Hive 3.1.x is supported. Hive 2.2.1, 2.4.0 haven't released.) > Enable supported Hive metastore versions once it support altering database > location > --- > > Key: SPARK-29260 > URL: https://issues.apache.org/jira/browse/SPARK-29260 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Hive 3.x is supported currently. Hive 2.2.1 and Hive 2.4.0 have not released. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29260) Enable supported Hive metastore versions once it support altering database location
[ https://issues.apache.org/jira/browse/SPARK-29260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-29260: Description: Hive 3.0.x and Hive 3.1.x is supported. Hive 2.2.1, 2.4.0 haven't released. > Enable supported Hive metastore versions once it support altering database > location > --- > > Key: SPARK-29260 > URL: https://issues.apache.org/jira/browse/SPARK-29260 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Hive 3.0.x and Hive 3.1.x is supported. Hive 2.2.1, 2.4.0 haven't released. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29260) Enable supported Hive metastore versions once it support altering database location
Yuming Wang created SPARK-29260: --- Summary: Enable supported Hive metastore versions once it support altering database location Key: SPARK-29260 URL: https://issues.apache.org/jira/browse/SPARK-29260 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938467#comment-16938467 ] Rajiv Bandi edited comment on SPARK-6305 at 9/26/19 10:20 AM: -- Steve, apart from the security issue concern, we also have the logs not rolling properly. In any case, my question is "Is Spark going to decouple itself from log4j 1.x and provide a way to configure logging with an alternative implementation?". If yes, what would be the tentative timeline? was (Author: rajivbandi): Steve, apart from the security issue concern, we also have the logs not rolling properly. In any case, my question is "Is Spark going to decouple itself from log4j 1.x and provided a way to configure logging with an alternative implementation?". If yes, what would be the tentative timeline? > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938467#comment-16938467 ] Rajiv Bandi commented on SPARK-6305: Steve, apart from the security issue concern, we also have the logs not rolling properly. In any case, my question is "Is Spark going to decouple itself from log4j 1.x and provided a way to configure logging with an alternative implementation?". If yes, what would be the tentative timeline? > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29212) Add common classes without using JVM backend
[ https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938465#comment-16938465 ] Maciej Szymkiewicz commented on SPARK-29212: [~podongfeng] First of all thank your for creating this ticket. ??Would you like to help work on this??? If we can reach some consensus here, sure. That's only a dozen of LOCs anyway. However personally I am more interested in general directions. As I tried to point out (not sure if successfully), the current resolution of SPARK-28985, despite its negligible impact as such, sets dangerous precedent and breaks up with existing ML API conventions. At this point I still cannot tell if that happened by accident, or if it is intentional (not being actively involved in the development process I get an impression that the latter might be true). > Add common classes without using JVM backend > > > Key: SPARK-29212 > URL: https://issues.apache.org/jira/browse/SPARK-29212 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > > copyed from [https://github.com/apache/spark/pull/25776.] > > Maciej's *Concern*: > *Use cases for public ML type hierarchy* > * Add Python-only Transformer implementations: > ** I am Python user and want to implement pure Python ML classifier without > providing JVM backend. > ** I want this classifier to be meaningfully positioned in the existing type > hierarchy. > ** However I have access only to high level classes ({{Estimator}}, > {{Model}}, {{MLReader}} / {{MLReadable}}). > * Run time parameter validation for both user defined (see above) and > existing class hierarchy, > ** I am a library developer who provides functions that are meaningful only > for specific categories of {{Estimators}} - here classifiers. > ** I want to validate that user passed argument is indeed a classifier: > *** For built-in objects using "private" type hierarchy is not really > satisfying (actually, what is the rationale behind making it "private"? If > the goal is Scala API parity, and Scala counterparts are public, shouldn't > these be too?). > ** For user defined objects I can: > *** Use duck typing (on {{setRawPredictionCol}} for classifier, on > {{numClasses}} for classification model) but it hardly satisfying. > *** Provide parallel non-abstract type hierarchy ({{Classifier}} or > {{PythonClassifier}} and so on) and require users to implement such > interfaces. That however would require separate logic for checking for > built-in and and user-provided classes. > *** Provide parallel abstract type hierarchy, register all existing built-in > classes and require users to do the same. > Clearly these are not satisfying solutions as they require either defensive > programming or reinventing the same functionality for different 3rd party > APIs. > * Static type checking > ** I am either end user or library developer and want to use PEP-484 > annotations to indicate components that require classifier or classification > model. > ** Currently I can provide only imprecise annotations, [such > as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241] > def setClassifier(self, value: Estimator[M]) -> OneVsRest: ... > or try to narrow things down using structural subtyping: > class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, > value: str) -> Classifier: ... class Classifier(Protocol, Model): def > setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> > int: ... > > Maciej's *Proposal*: > {code:java} > Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e. > class ClassifierParams: ... > class Predictor(Estimator,PredictorParams): > def setLabelCol(self, value): ... > def setFeaturesCol(self, value): ... > def setPredictionCol(self, value): ... > class Classifier(Predictor, ClassifierParams): > def setRawPredictionCol(self, value): ... > class PredictionModel(Model,PredictorParams): > def setFeaturesCol(self, value): ... > def setPredictionCol(self, value): ... > def numFeatures(self): ... > def predict(self, value): ... > and JVM interop should extend from this hierarchy, i.e. > class JavaPredictionModel(PredictionModel): ... > In other words it should be consistent with existing approach, where we have > ABCs reflecting Scala API (Transformer, Estimator, Model) and so on, and > Java* variants are their subclasses. > {code} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail:
[jira] [Resolved] (SPARK-28845) Enable spark.sql.execution.sortBeforeRepartition only for retried stages
[ https://issues.apache.org/jira/browse/SPARK-28845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanjian Li resolved SPARK-28845. - Resolution: Won't Do After further investigation, I found the objective of performing sort only for the retried indeterminate stage is unable to achieve. That will break our assumption for the `outputDeterministicLevel` for each RDD, which should be defined when the job submitted. While here we expected the output deterministic level depends on the stage attempt number. As the SPARK-25341 completed, the current behavior depends on the config `spark.sql.execution.sortBeforeRepartition`. Spark will sort before repartition, and rerun the failed tasks when setting the config to true. On the contrary, the whole stage will rerun. > Enable spark.sql.execution.sortBeforeRepartition only for retried stages > > > Key: SPARK-28845 > URL: https://issues.apache.org/jira/browse/SPARK-28845 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Priority: Major > > For fixing the correctness bug of SPARK-28699, we disable radix sort for the > scenario of repartition in Spark SQL. This will cause a performance > regression. > So for limiting the performance overhead, we'll do the optimizing work by > only enable sort for the repartition operation while stage retries happening. > This work depends on SPARK-25341. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29245) CCE during creating HiveMetaStoreClient
[ https://issues.apache.org/jira/browse/SPARK-29245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938439#comment-16938439 ] Yuming Wang commented on SPARK-29245: - How to reproduce: Hive side: {code:sh} export JAVA_HOME=/usr/lib/jdk1.8.0_221 export PATH=$JAVA_HOME/bin:$PATH cd /usr/lib/hive-2.3.5 bin/schematool -dbType derby -initSchema --verbose bin/hive --service metastore {code} Spark side: {code:sh} export JAVA_HOME=/usr/lib/jdk-11.0.3 export PATH=$JAVA_HOME/bin:$PATH build/sbt clean package -Phive -Phadoop-3.2 -Phive-thriftserver export SPARK_PREPEND_CLASSES=true bin/spark-shell --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083 {code} > CCE during creating HiveMetaStoreClient > > > Key: SPARK-29245 > URL: https://issues.apache.org/jira/browse/SPARK-29245 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 > Environment: CDH 6.3 >Reporter: Dongjoon Hyun >Priority: Blocker > > From `master` branch build, when I try to connect to an external HMS, I hit > the following. > {code} > 19/09/25 10:58:46 ERROR hive.log: Got exception: java.lang.ClassCastException > class [Ljava.lang.Object; cannot be cast to class [Ljava.net.URI; > ([Ljava.lang.Object; and [Ljava.net.URI; are in module java.base of loader > 'bootstrap') > java.lang.ClassCastException: class [Ljava.lang.Object; cannot be cast to > class [Ljava.net.URI; ([Ljava.lang.Object; and [Ljava.net.URI; are in module > java.base of loader 'bootstrap') > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:200) > at > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:70) > {code} > With HIVE-21508, I can get the following. > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT > /_/ > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.4) > Type in expressions to have them evaluated. > Type :help for more information. > scala> sql("show databases").show > ++ > |databaseName| > ++ > | . | > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-29254) Failed to include jars passed in through --jars when isolatedLoader is enabled
[ https://issues.apache.org/jira/browse/SPARK-29254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Katta updated SPARK-29254: -- Comment: was deleted (was: [~yumwang] I would like to work on it if you have not started ) > Failed to include jars passed in through --jars when isolatedLoader is enabled > -- > > Key: SPARK-29254 > URL: https://issues.apache.org/jira/browse/SPARK-29254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Failed to include jars passed in through --jars when {{isolatedLoader}} is > enabled({{spark.sql.hive.metastore.jars != builtin}}). How to reproduce: > {code:scala} > test("SPARK-29254: include jars passed in through --jars when > isolatedLoader is enabled") { > val unusedJar = TestUtils.createJarWithClasses(Seq.empty) > val jar1 = TestUtils.createJarWithClasses(Seq("SparkSubmitClassA")) > val jar2 = TestUtils.createJarWithClasses(Seq("SparkSubmitClassB")) > val jar3 = HiveTestJars.getHiveContribJar.getCanonicalPath > val jar4 = HiveTestJars.getHiveHcatalogCoreJar.getCanonicalPath > val jarsString = Seq(jar1, jar2, jar3, jar4).map(j => > j.toString).mkString(",") > val args = Seq( > "--class", SparkSubmitClassLoaderTest.getClass.getName.stripSuffix("$"), > "--name", "SparkSubmitClassLoaderTest", > "--master", "local-cluster[2,1,1024]", > "--conf", "spark.ui.enabled=false", > "--conf", "spark.master.rest.enabled=false", > "--conf", "spark.sql.hive.metastore.version=3.1.2", > "--conf", "spark.sql.hive.metastore.jars=maven", > "--driver-java-options", "-Dderby.system.durability=test", > "--jars", jarsString, > unusedJar.toString, "SparkSubmitClassA", "SparkSubmitClassB") > runSparkSubmit(args) > } > {code} > Logs: > {noformat} > 2019-09-25 22:11:42.854 - stderr> 19/09/25 22:11:42 ERROR log: error in > initSerDe: java.lang.ClassNotFoundException Class > org.apache.hive.hcatalog.data.JsonSerDe not found > 2019-09-25 22:11:42.854 - stderr> java.lang.ClassNotFoundException: Class > org.apache.hive.hcatalog.data.JsonSerDe not found > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.metastore.HiveMetaStoreUtils.getDeserializer(HiveMetaStoreUtils.java:84) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.metastore.HiveMetaStoreUtils.getDeserializer(HiveMetaStoreUtils.java:77) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:289) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:271) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getColsInternal(Table.java:663) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:646) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:898) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:937) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:539) > 2019-09-25 22:11:42.854 - stderr> at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:311) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:245) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:244) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:294) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:537) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:284) > 2019-09-25 22:11:42.854 - stderr> at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:242) > 2019-09-25
[jira] [Commented] (SPARK-29254) Failed to include jars passed in through --jars when isolatedLoader is enabled
[ https://issues.apache.org/jira/browse/SPARK-29254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938434#comment-16938434 ] Sandeep Katta commented on SPARK-29254: --- [~yumwang] I would like to work on it if you have not started > Failed to include jars passed in through --jars when isolatedLoader is enabled > -- > > Key: SPARK-29254 > URL: https://issues.apache.org/jira/browse/SPARK-29254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Failed to include jars passed in through --jars when {{isolatedLoader}} is > enabled({{spark.sql.hive.metastore.jars != builtin}}). How to reproduce: > {code:scala} > test("SPARK-29254: include jars passed in through --jars when > isolatedLoader is enabled") { > val unusedJar = TestUtils.createJarWithClasses(Seq.empty) > val jar1 = TestUtils.createJarWithClasses(Seq("SparkSubmitClassA")) > val jar2 = TestUtils.createJarWithClasses(Seq("SparkSubmitClassB")) > val jar3 = HiveTestJars.getHiveContribJar.getCanonicalPath > val jar4 = HiveTestJars.getHiveHcatalogCoreJar.getCanonicalPath > val jarsString = Seq(jar1, jar2, jar3, jar4).map(j => > j.toString).mkString(",") > val args = Seq( > "--class", SparkSubmitClassLoaderTest.getClass.getName.stripSuffix("$"), > "--name", "SparkSubmitClassLoaderTest", > "--master", "local-cluster[2,1,1024]", > "--conf", "spark.ui.enabled=false", > "--conf", "spark.master.rest.enabled=false", > "--conf", "spark.sql.hive.metastore.version=3.1.2", > "--conf", "spark.sql.hive.metastore.jars=maven", > "--driver-java-options", "-Dderby.system.durability=test", > "--jars", jarsString, > unusedJar.toString, "SparkSubmitClassA", "SparkSubmitClassB") > runSparkSubmit(args) > } > {code} > Logs: > {noformat} > 2019-09-25 22:11:42.854 - stderr> 19/09/25 22:11:42 ERROR log: error in > initSerDe: java.lang.ClassNotFoundException Class > org.apache.hive.hcatalog.data.JsonSerDe not found > 2019-09-25 22:11:42.854 - stderr> java.lang.ClassNotFoundException: Class > org.apache.hive.hcatalog.data.JsonSerDe not found > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.metastore.HiveMetaStoreUtils.getDeserializer(HiveMetaStoreUtils.java:84) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.metastore.HiveMetaStoreUtils.getDeserializer(HiveMetaStoreUtils.java:77) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:289) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:271) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getColsInternal(Table.java:663) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:646) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:898) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:937) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:539) > 2019-09-25 22:11:42.854 - stderr> at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:311) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:245) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:244) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:294) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:537) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:284) > 2019-09-25 22:11:42.854 - stderr> at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:242) >
[jira] [Commented] (SPARK-29254) Failed to include jars passed in through --jars when isolatedLoader is enabled
[ https://issues.apache.org/jira/browse/SPARK-29254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938433#comment-16938433 ] Lantao Jin commented on SPARK-29254: I am looking into this. > Failed to include jars passed in through --jars when isolatedLoader is enabled > -- > > Key: SPARK-29254 > URL: https://issues.apache.org/jira/browse/SPARK-29254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Failed to include jars passed in through --jars when {{isolatedLoader}} is > enabled({{spark.sql.hive.metastore.jars != builtin}}). How to reproduce: > {code:scala} > test("SPARK-29254: include jars passed in through --jars when > isolatedLoader is enabled") { > val unusedJar = TestUtils.createJarWithClasses(Seq.empty) > val jar1 = TestUtils.createJarWithClasses(Seq("SparkSubmitClassA")) > val jar2 = TestUtils.createJarWithClasses(Seq("SparkSubmitClassB")) > val jar3 = HiveTestJars.getHiveContribJar.getCanonicalPath > val jar4 = HiveTestJars.getHiveHcatalogCoreJar.getCanonicalPath > val jarsString = Seq(jar1, jar2, jar3, jar4).map(j => > j.toString).mkString(",") > val args = Seq( > "--class", SparkSubmitClassLoaderTest.getClass.getName.stripSuffix("$"), > "--name", "SparkSubmitClassLoaderTest", > "--master", "local-cluster[2,1,1024]", > "--conf", "spark.ui.enabled=false", > "--conf", "spark.master.rest.enabled=false", > "--conf", "spark.sql.hive.metastore.version=3.1.2", > "--conf", "spark.sql.hive.metastore.jars=maven", > "--driver-java-options", "-Dderby.system.durability=test", > "--jars", jarsString, > unusedJar.toString, "SparkSubmitClassA", "SparkSubmitClassB") > runSparkSubmit(args) > } > {code} > Logs: > {noformat} > 2019-09-25 22:11:42.854 - stderr> 19/09/25 22:11:42 ERROR log: error in > initSerDe: java.lang.ClassNotFoundException Class > org.apache.hive.hcatalog.data.JsonSerDe not found > 2019-09-25 22:11:42.854 - stderr> java.lang.ClassNotFoundException: Class > org.apache.hive.hcatalog.data.JsonSerDe not found > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.metastore.HiveMetaStoreUtils.getDeserializer(HiveMetaStoreUtils.java:84) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.metastore.HiveMetaStoreUtils.getDeserializer(HiveMetaStoreUtils.java:77) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:289) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:271) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getColsInternal(Table.java:663) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:646) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:898) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:937) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:539) > 2019-09-25 22:11:42.854 - stderr> at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:311) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:245) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:244) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:294) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:537) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:284) > 2019-09-25 22:11:42.854 - stderr> at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:242) > 2019-09-25 22:11:42.854 - stderr> at >
[jira] [Created] (SPARK-29259) Filesystem.exists is called even when not necessary for append save mode
Rahij Ramsharan created SPARK-29259: --- Summary: Filesystem.exists is called even when not necessary for append save mode Key: SPARK-29259 URL: https://issues.apache.org/jira/browse/SPARK-29259 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.4 Reporter: Rahij Ramsharan When saving a dataframe into Hadoop ([https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L93]), spark first checks if the file exists before inspecting the SaveMode to determine if it should actually insert data. However, the pathExists variable is actually not used in the case of SaveMode.Append. In some file systems, the exists call can be expensive and hence this PR makes that call only when necessary. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29257) All Task attempts scheduled to the same executor inevitably access the same bad disk
[ https://issues.apache.org/jira/browse/SPARK-29257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-29257: - Attachment: image-2019-09-26-16-44-48-554.png > All Task attempts scheduled to the same executor inevitably access the same > bad disk > > > Key: SPARK-29257 > URL: https://issues.apache.org/jira/browse/SPARK-29257 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.4, 2.4.4 >Reporter: Kent Yao >Priority: Major > Attachments: image-2019-09-26-16-44-48-554.png > > > We have an HDFS/YARN cluster with about 2k~3k nodes, each node has 8T * 12 > local disks for storage and shuffle. Sometimes, one or more disks get into > bad status during computations. Sometimes it does cause job level failure, > sometimes does. > The following picture shows one failure job caused by 4 task attempts were > all delivered to the same node and failed with almost the same exception for > writing the index temporary file to the same bad disk. > > This is caused by two reasons: > # As we can see in the figure the data and the node have the best data > locality for reading from HDFS. As the default spark.locality.wait(3s) taking > effect, there is a high probability that those attempts will be scheduled to > this node. > # The index file or data file name for a particular shuffle map task is > fixed. It is formed by the shuffle id, the map id and the noop reduce id > which is always 0. The root local dir is picked by the fixed file name's > non-negative hash code % the disk number. Thus, this value is also fixed. > Even when we have 12 disks in total and only one of them is broken, if the > broken one is once picked, all the following attempts of this task will > inevitably pick the broken one. > > > !image-2019-09-26-16-44-48-554.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29257) All Task attempts scheduled to the same executor inevitably access the same bad disk
[ https://issues.apache.org/jira/browse/SPARK-29257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-29257: - Description: We have an HDFS/YARN cluster with about 2k~3k nodes, each node has 8T * 12 local disks for storage and shuffle. Sometimes, one or more disks get into bad status during computations. Sometimes it does cause job level failure, sometimes does. The following picture shows one failure job caused by 4 task attempts were all delivered to the same node and failed with almost the same exception for writing the index temporary file to the same bad disk. This is caused by two reasons: # As we can see in the figure the data and the node have the best data locality for reading from HDFS. As the default spark.locality.wait(3s) taking effect, there is a high probability that those attempts will be scheduled to this node. # The index file or data file name for a particular shuffle map task is fixed. It is formed by the shuffle id, the map id and the noop reduce id which is always 0. The root local dir is picked by the fixed file name's non-negative hash code % the disk number. Thus, this value is also fixed. Even when we have 12 disks in total and only one of them is broken, if the broken one is once picked, all the following attempts of this task will inevitably pick the broken one. !image-2019-09-26-16-44-48-554.png! was: We have an HDFS/YARN cluster with about 2k~3k nodes, each node has 8T * 12 local disks for storage and shuffle. Sometimes, one or more disks get into bad status during computations. Sometimes it does cause job level failure, sometimes does. The following picture shows one failure job caused by 4 task attempts were all delivered to the same node and failed with almost the same exception for writing the index temporary file to the same bad disk. This is caused by two reasons: # As we can see in the figure the data and the node have the best data locality for reading from HDFS. As the default spark.locality.wait(3s) taking effect, there is a high probability that those attempts will be scheduled to this node. # The index file or data file name for a particular shuffle map task is fixed. It is formed by the shuffle id, the map id and the noop reduce id which is always 0. The root local dir is picked by the fixed file name's non-negative hash code % the disk number. Thus, this value is also fixed. Even when we have 12 disks in total and only one of them is broken, if the broken one is once picked, all the following attempts of this task will inevitably pick the broken one. !image-2019-09-26-15-35-29-342.png! > All Task attempts scheduled to the same executor inevitably access the same > bad disk > > > Key: SPARK-29257 > URL: https://issues.apache.org/jira/browse/SPARK-29257 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.4, 2.4.4 >Reporter: Kent Yao >Priority: Major > Attachments: image-2019-09-26-16-44-48-554.png > > > We have an HDFS/YARN cluster with about 2k~3k nodes, each node has 8T * 12 > local disks for storage and shuffle. Sometimes, one or more disks get into > bad status during computations. Sometimes it does cause job level failure, > sometimes does. > The following picture shows one failure job caused by 4 task attempts were > all delivered to the same node and failed with almost the same exception for > writing the index temporary file to the same bad disk. > > This is caused by two reasons: > # As we can see in the figure the data and the node have the best data > locality for reading from HDFS. As the default spark.locality.wait(3s) taking > effect, there is a high probability that those attempts will be scheduled to > this node. > # The index file or data file name for a particular shuffle map task is > fixed. It is formed by the shuffle id, the map id and the noop reduce id > which is always 0. The root local dir is picked by the fixed file name's > non-negative hash code % the disk number. Thus, this value is also fixed. > Even when we have 12 disks in total and only one of them is broken, if the > broken one is once picked, all the following attempts of this task will > inevitably pick the broken one. > > > !image-2019-09-26-16-44-48-554.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29258) parity between ml.evaluator and mllib.metrics
zhengruifeng created SPARK-29258: Summary: parity between ml.evaluator and mllib.metrics Key: SPARK-29258 URL: https://issues.apache.org/jira/browse/SPARK-29258 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 3.0.0 Reporter: zhengruifeng 1, expose {{BinaryClassificationMetrics.numBins}} in {{BinaryClassificationEvaluator}} 2, expose {{RegressionMetrics.throughOrigin}} in {{RegressionEvaluator}} 3, add metric {{explainedVariance}} in {{RegressionEvaluator}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29257) All Task attempts scheduled to the same executor inevitably access the same bad disk
Kent Yao created SPARK-29257: Summary: All Task attempts scheduled to the same executor inevitably access the same bad disk Key: SPARK-29257 URL: https://issues.apache.org/jira/browse/SPARK-29257 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 2.4.4, 2.3.4 Reporter: Kent Yao We have an HDFS/YARN cluster with about 2k~3k nodes, each node has 8T * 12 local disks for storage and shuffle. Sometimes, one or more disks get into bad status during computations. Sometimes it does cause job level failure, sometimes does. The following picture shows one failure job caused by 4 task attempts were all delivered to the same node and failed with almost the same exception for writing the index temporary file to the same bad disk. This is caused by two reasons: # As we can see in the figure the data and the node have the best data locality for reading from HDFS. As the default spark.locality.wait(3s) taking effect, there is a high probability that those attempts will be scheduled to this node. # The index file or data file name for a particular shuffle map task is fixed. It is formed by the shuffle id, the map id and the noop reduce id which is always 0. The root local dir is picked by the fixed file name's non-negative hash code % the disk number. Thus, this value is also fixed. Even when we have 12 disks in total and only one of them is broken, if the broken one is once picked, all the following attempts of this task will inevitably pick the broken one. !image-2019-09-26-15-35-29-342.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org