[jira] [Resolved] (SPARK-29258) parity between ml.evaluator and mllib.metrics

2019-09-26 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-29258.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25940
[https://github.com/apache/spark/pull/25940]

> parity between ml.evaluator and mllib.metrics
> -
>
> Key: SPARK-29258
> URL: https://issues.apache.org/jira/browse/SPARK-29258
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> 1, expose {{BinaryClassificationMetrics.numBins}} in 
> {{BinaryClassificationEvaluator}}
> 2, expose {{RegressionMetrics.throughOrigin}} in {{RegressionEvaluator}}
> 3, add metric {{explainedVariance}} in {{RegressionEvaluator}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29269) Pyspark ALSModel support getters/setters

2019-09-26 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939109#comment-16939109
 ] 

Huaxin Gao commented on SPARK-29269:


Yes. I am happy to work on this. Thanks! [~podongfeng]

> Pyspark ALSModel support getters/setters
> 
>
> Key: SPARK-29269
> URL: https://issues.apache.org/jira/browse/SPARK-29269
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> ping [~huaxingao] , would you like to work on this? This is similar to your 
> previous works. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27715) SQL query details in UI dose not show in correct format.

2019-09-26 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-27715.
--
Fix Version/s: 3.0.0
 Assignee: Genmao Yu
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/24609

> SQL query details in UI dose not show in correct format.
> 
>
> Key: SPARK-27715
> URL: https://issues.apache.org/jira/browse/SPARK-27715
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: Genmao Yu
>Assignee: Genmao Yu
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29175) Make maven central repository in IsolatedClientLoader configurable

2019-09-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29175.
---
Fix Version/s: 3.0.0
 Assignee: Yuanjian Li
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/25849

> Make maven central repository in IsolatedClientLoader configurable
> --
>
> Key: SPARK-29175
> URL: https://issues.apache.org/jira/browse/SPARK-29175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
>
> We need to connect a central repository in IsolatedClientLoader for 
> downloading Hive jars. Here we added a new config 
> `spark.sql.additionalRemoteRepositories`, a comma-delimited string config of 
> the optional additional remote maven mirror repositories, it can be used as 
> the additional remote repositories for the default maven central repo.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29245) CCE during creating HiveMetaStoreClient

2019-09-26 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939099#comment-16939099
 ] 

Hyukjin Kwon commented on SPARK-29245:
--

BTW, can we confirm that this is the last one we need for JDK 11 from Hive 
2.3.x?
We cannot keep asking to making minor release to Hive dev for every bug fix ...
It might better wait more when it's close to Spark 3 release to see if there 
are more bugs found, and ask it in a batch.
cc [~alangates]

> CCE during creating HiveMetaStoreClient 
> 
>
> Key: SPARK-29245
> URL: https://issues.apache.org/jira/browse/SPARK-29245
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: CDH 6.3
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> From `master` branch build, when I try to connect to an external HMS, I hit 
> the following.
> {code}
> 19/09/25 10:58:46 ERROR hive.log: Got exception: java.lang.ClassCastException 
> class [Ljava.lang.Object; cannot be cast to class [Ljava.net.URI; 
> ([Ljava.lang.Object; and [Ljava.net.URI; are in module java.base of loader 
> 'bootstrap')
> java.lang.ClassCastException: class [Ljava.lang.Object; cannot be cast to 
> class [Ljava.net.URI; ([Ljava.lang.Object; and [Ljava.net.URI; are in module 
> java.base of loader 'bootstrap')
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:200)
>   at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:70)
> {code}
> With HIVE-21508, I can get the following.
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.4)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sql("show databases").show
> ++
> |databaseName|
> ++
> |  .  |
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26431) Update availableSlots by availableCpus for barrier taskset

2019-09-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26431:
--
Fix Version/s: (was: 2.4.0)

> Update availableSlots by availableCpus for barrier taskset
> --
>
> Key: SPARK-26431
> URL: https://issues.apache.org/jira/browse/SPARK-26431
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: wuyi
>Priority: Major
>
> availableCpus decrease as  tasks allocated, so, we should update 
> availableSlots by availableCpus for barrier taskset to avoid unnecessary 
> resourceOffer process.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29246) Remove unnecessary imports in `core` module

2019-09-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29246.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/25927

> Remove unnecessary imports in `core` module
> ---
>
> Key: SPARK-29246
> URL: https://issues.apache.org/jira/browse/SPARK-29246
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jiaqi Li
>Assignee: Jiaqi Li
>Priority: Trivial
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29269) Pyspark ALSModel support getters/setters

2019-09-26 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-29269:


 Summary: Pyspark ALSModel support getters/setters
 Key: SPARK-29269
 URL: https://issues.apache.org/jira/browse/SPARK-29269
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


ping [~huaxingao] , would you like to work on this? This is similar to your 
previous works. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29142) Pyspark clustering models support column setters/getters/predict

2019-09-26 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-29142.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25859
[https://github.com/apache/spark/pull/25859]

> Pyspark clustering models support column setters/getters/predict
> 
>
> Key: SPARK-29142
> URL: https://issues.apache.org/jira/browse/SPARK-29142
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> Unlike the reg/clf models, clustering models do not have some common class, 
> so we need to add them one by one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29212) Add common classes without using JVM backend

2019-09-26 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939082#comment-16939082
 ] 

zhengruifeng commented on SPARK-29212:
--

[~zero323] I had not notice the base hierarchy without JVM-backend in 
SPARK-28985, and thanks you for pointing out it.

I guess we reach some consensus on:

1, add base classes without JVM-backend, and make JVM-classes extends them; 
(may limited to classes modified in SPARK-28985 at first)

2, rename private classname following PEP-8.

 

 

> Add common classes without using JVM backend
> 
>
> Key: SPARK-29212
> URL: https://issues.apache.org/jira/browse/SPARK-29212
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>
> copyed from [https://github.com/apache/spark/pull/25776.]
>  
> Maciej's *Concern*:
> *Use cases for public ML type hierarchy*
>  * Add Python-only Transformer implementations:
>  ** I am Python user and want to implement pure Python ML classifier without 
> providing JVM backend.
>  ** I want this classifier to be meaningfully positioned in the existing type 
> hierarchy.
>  ** However I have access only to high level classes ({{Estimator}}, 
> {{Model}}, {{MLReader}} / {{MLReadable}}).
>  * Run time parameter validation for both user defined (see above) and 
> existing class hierarchy,
>  ** I am a library developer who provides functions that are meaningful only 
> for specific categories of {{Estimators}} - here classifiers.
>  ** I want to validate that user passed argument is indeed a classifier:
>  *** For built-in objects using "private" type hierarchy is not really 
> satisfying (actually, what is the rationale behind making it "private"? If 
> the goal is Scala API parity, and Scala counterparts are public, shouldn't 
> these be too?).
>  ** For user defined objects I can:
>  *** Use duck typing (on {{setRawPredictionCol}} for classifier, on 
> {{numClasses}} for classification model) but it hardly satisfying.
>  *** Provide parallel non-abstract type hierarchy ({{Classifier}} or 
> {{PythonClassifier}} and so on) and require users to implement such 
> interfaces. That however would require separate logic for checking for 
> built-in and and user-provided classes.
>  *** Provide parallel abstract type hierarchy, register all existing built-in 
> classes and require users to do the same.
> Clearly these are not satisfying solutions as they require either defensive 
> programming or reinventing the same functionality for different 3rd party 
> APIs.
>  * Static type checking
>  ** I am either end user or library developer and want to use PEP-484 
> annotations to indicate components that require classifier or classification 
> model.
>  ** Currently I can provide only imprecise annotations, [such 
> as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241]
> def setClassifier(self, value: Estimator[M]) -> OneVsRest: ...
> or try to narrow things down using structural subtyping:
> class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, 
> value: str) -> Classifier: ... class Classifier(Protocol, Model): def 
> setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> 
> int: ...
>  
> Maciej's *Proposal*:
> {code:java}
> Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e.
> class ClassifierParams: ...
> class Predictor(Estimator,PredictorParams):
> def setLabelCol(self, value): ...
> def setFeaturesCol(self, value): ...
> def setPredictionCol(self, value): ...
> class Classifier(Predictor, ClassifierParams):
> def setRawPredictionCol(self, value): ...
> class PredictionModel(Model,PredictorParams):
> def setFeaturesCol(self, value): ...
> def setPredictionCol(self, value): ...
> def numFeatures(self): ...
> def predict(self, value): ...
> and JVM interop should extend from this hierarchy, i.e.
> class JavaPredictionModel(PredictionModel): ...
> In other words it should be consistent with existing approach, where we have 
> ABCs reflecting Scala API (Transformer, Estimator, Model) and so on, and 
> Java* variants are their subclasses.
>  {code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29257) All Task attempts scheduled to the same executor inevitably access the same bad disk

2019-09-26 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939075#comment-16939075
 ] 

Kent Yao commented on SPARK-29257:
--

this issue should be not a problem anymore after this fix - 
[https://github.com/apache/spark/pull/25620] . Each index and data file will 
have a unique name with task attempt id.

> All Task attempts scheduled to the same executor inevitably access the same 
> bad disk
> 
>
> Key: SPARK-29257
> URL: https://issues.apache.org/jira/browse/SPARK-29257
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.4, 2.4.4
>Reporter: Kent Yao
>Priority: Major
> Attachments: image-2019-09-26-16-44-48-554.png
>
>
> We have an HDFS/YARN cluster with about 2k~3k nodes, each node has 8T * 12 
> local disks for storage and shuffle. Sometimes, one or more disks get into 
> bad status during computations. Sometimes it does cause job level failure, 
> sometimes does.
> The following picture shows one failure job caused by 4 task attempts were 
> all delivered to the same node and failed with almost the same exception for 
> writing the index temporary file to the same bad disk.
>  
> This is caused by two reasons:
>  # As we can see in the figure the data and the node have the best data 
> locality for reading from HDFS. As the default spark.locality.wait(3s) taking 
> effect, there is a high probability that those attempts will be scheduled to 
> this node.
>  # The index file or data file name for a particular shuffle map task is 
> fixed. It is formed by the shuffle id, the map id and the noop reduce id 
> which is always 0. The root local dir is picked by the fixed file name's 
> non-negative hash code % the disk number. Thus, this value is also fixed.  
> Even when we have 12 disks in total and only one of them is broken, if the 
> broken one is once picked, all the following attempts of this task will 
> inevitably pick the broken one.
>  
>  
> !image-2019-09-26-16-44-48-554.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29268) Failed to start spark-sql when using Derby metastore and isolatedLoader is enabled

2019-09-26 Thread Sandeep Katta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939073#comment-16939073
 ] 

Sandeep Katta commented on SPARK-29268:
---

I will work on this

> Failed to start spark-sql when using Derby metastore and isolatedLoader is 
> enabled
> --
>
> Key: SPARK-29268
> URL: https://issues.apache.org/jira/browse/SPARK-29268
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Failed to start spark-sql when using Derby metastore and isolatedLoader is 
> enabled({{spark.sql.hive.metastore.jars != builtin}}).
> How to reproduce:
> {code:sh}
> bin/spark-sql --conf spark.sql.hive.metastore.version=2.1 --conf 
> spark.sql.hive.metastore.jars=maven
> {code}
> Logs:
> {noformat}
> ...
> Caused by: java.sql.SQLException: Failed to start database 'metastore_db' 
> with class loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@15a591d9, see 
> the next exception for details.
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
>  Source)
>   ... 108 more
> Caused by: java.sql.SQLException: Another instance of Derby may have already 
> booted the database /root/spark-2.3.4-bin-hadoop2.7/metastore_db.
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
>  Source)
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
> Source)
>   ... 105 more
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database /root/spark-2.3.4-bin-hadoop2.7/metastore_db.
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29268) Failed to start spark-sql when using Derby metastore and isolatedLoader is enabled

2019-09-26 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29268:

Description: 
Failed to start spark-sql when using Derby metastore and isolatedLoader is 
enabled({{spark.sql.hive.metastore.jars != builtin}}).
How to reproduce:
{code:sh}
bin/spark-sql --conf spark.sql.hive.metastore.version=2.1 --conf 
spark.sql.hive.metastore.jars=maven
{code}
Logs:
{noformat}
...
Caused by: java.sql.SQLException: Failed to start database 'metastore_db' with 
class loader 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@15a591d9, see the 
next exception for details.
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
 Source)
... 108 more
Caused by: java.sql.SQLException: Another instance of Derby may have already 
booted the database /root/spark-2.3.4-bin-hadoop2.7/metastore_db.
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
 Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
Source)
... 105 more
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
database /root/spark-2.3.4-bin-hadoop2.7/metastore_db.
...
{noformat}


  was:
Failed to start spark-sql when using Derby metastore and isolatedLoader is 
enabled({{spark.sql.hive.metastore.jars != builtin}}).
How to reproduce:
{noformat}
bin/spark-sql --conf spark.sql.hive.metastore.version=2.1 --conf 
spark.sql.hive.metastore.jars=maven
{noformat}
Logs:
{noformat}
...
Caused by: java.sql.SQLException: Failed to start database 'metastore_db' with 
class loader 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@15a591d9, see the 
next exception for details.
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
 Source)
... 108 more
Caused by: java.sql.SQLException: Another instance of Derby may have already 
booted the database /root/spark-2.3.4-bin-hadoop2.7/metastore_db.
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
 Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
Source)
... 105 more
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
database /root/spark-2.3.4-bin-hadoop2.7/metastore_db.
...
{noformat}



> Failed to start spark-sql when using Derby metastore and isolatedLoader is 
> enabled
> --
>
> Key: SPARK-29268
> URL: https://issues.apache.org/jira/browse/SPARK-29268
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Failed to start spark-sql when using Derby metastore and isolatedLoader is 
> enabled({{spark.sql.hive.metastore.jars != builtin}}).
> How to reproduce:
> {code:sh}
> bin/spark-sql --conf spark.sql.hive.metastore.version=2.1 --conf 
> spark.sql.hive.metastore.jars=maven
> {code}
> Logs:
> {noformat}
> ...
> Caused by: java.sql.SQLException: Failed to start database 'metastore_db' 
> with class loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@15a591d9, see 
> the next exception for details.
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
>  Source)
>   ... 108 more
> Caused by: java.sql.SQLException: Another instance of Derby may have already 
> booted the database /root/spark-2.3.4-bin-hadoop2.7/metastore_db.
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
>  Source)
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
> Source)
>   ... 105 more
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database /root/spark-2.3.4-bin-hadoop2.7/metastore_db.
> ...
> {noformat}



--
This message was 

[jira] [Updated] (SPARK-29268) Failed to start spark-sql when using Derby metastore and isolatedLoader is enabled

2019-09-26 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29268:

Description: 
Failed to start spark-sql when using Derby metastore and isolatedLoader is 
enabled({{spark.sql.hive.metastore.jars != builtin}}).
How to reproduce:
{noformat}
bin/spark-sql --conf spark.sql.hive.metastore.version=2.1 --conf 
spark.sql.hive.metastore.jars=maven
{noformat}
Logs:
{noformat}
...
Caused by: java.sql.SQLException: Failed to start database 'metastore_db' with 
class loader 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@15a591d9, see the 
next exception for details.
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
 Source)
... 108 more
Caused by: java.sql.SQLException: Another instance of Derby may have already 
booted the database /root/spark-2.3.4-bin-hadoop2.7/metastore_db.
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
 Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
Source)
... 105 more
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
database /root/spark-2.3.4-bin-hadoop2.7/metastore_db.
...
{noformat}


  was:
How to reproduce:
{noformat}
bin/spark-sql --conf spark.sql.hive.metastore.version=2.1 --conf 
spark.sql.hive.metastore.jars=maven
{noformat}
Logs:
{noformat}
...
Caused by: java.sql.SQLException: Failed to start database 'metastore_db' with 
class loader 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@15a591d9, see the 
next exception for details.
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
 Source)
... 108 more
Caused by: java.sql.SQLException: Another instance of Derby may have already 
booted the database /root/spark-2.3.4-bin-hadoop2.7/metastore_db.
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
 Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
Source)
... 105 more
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
database /root/spark-2.3.4-bin-hadoop2.7/metastore_db.
...
{noformat}



> Failed to start spark-sql when using Derby metastore and isolatedLoader is 
> enabled
> --
>
> Key: SPARK-29268
> URL: https://issues.apache.org/jira/browse/SPARK-29268
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Failed to start spark-sql when using Derby metastore and isolatedLoader is 
> enabled({{spark.sql.hive.metastore.jars != builtin}}).
> How to reproduce:
> {noformat}
> bin/spark-sql --conf spark.sql.hive.metastore.version=2.1 --conf 
> spark.sql.hive.metastore.jars=maven
> {noformat}
> Logs:
> {noformat}
> ...
> Caused by: java.sql.SQLException: Failed to start database 'metastore_db' 
> with class loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@15a591d9, see 
> the next exception for details.
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
>  Source)
>   ... 108 more
> Caused by: java.sql.SQLException: Another instance of Derby may have already 
> booted the database /root/spark-2.3.4-bin-hadoop2.7/metastore_db.
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
>  Source)
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
> Source)
>   ... 105 more
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database /root/spark-2.3.4-bin-hadoop2.7/metastore_db.
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To 

[jira] [Created] (SPARK-29268) Failed to start spark-sql when using Derby metastore and isolatedLoader is enabled

2019-09-26 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-29268:
---

 Summary: Failed to start spark-sql when using Derby metastore and 
isolatedLoader is enabled
 Key: SPARK-29268
 URL: https://issues.apache.org/jira/browse/SPARK-29268
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4, 2.3.4, 3.0.0
Reporter: Yuming Wang


How to reproduce:
{noformat}
bin/spark-sql --conf spark.sql.hive.metastore.version=2.1 --conf 
spark.sql.hive.metastore.jars=maven
{noformat}
Logs:
{noformat}
...
Caused by: java.sql.SQLException: Failed to start database 'metastore_db' with 
class loader 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@15a591d9, see the 
next exception for details.
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
 Source)
... 108 more
Caused by: java.sql.SQLException: Another instance of Derby may have already 
booted the database /root/spark-2.3.4-bin-hadoop2.7/metastore_db.
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
 Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
Source)
... 105 more
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
database /root/spark-2.3.4-bin-hadoop2.7/metastore_db.
...
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29254) Failed to include jars passed in through --jars when isolatedLoader is enabled

2019-09-26 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939068#comment-16939068
 ] 

angerszhu commented on SPARK-29254:
---

[~cltlfcjin] Not same reason.

> Failed to include jars passed in through --jars when isolatedLoader is enabled
> --
>
> Key: SPARK-29254
> URL: https://issues.apache.org/jira/browse/SPARK-29254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Failed to include jars passed in through --jars when {{isolatedLoader}} is 
> enabled({{spark.sql.hive.metastore.jars != builtin}}). How to reproduce:
> {code:scala}
>   test("SPARK-29254: include jars passed in through --jars when 
> isolatedLoader is enabled") {
> val unusedJar = TestUtils.createJarWithClasses(Seq.empty)
> val jar1 = TestUtils.createJarWithClasses(Seq("SparkSubmitClassA"))
> val jar2 = TestUtils.createJarWithClasses(Seq("SparkSubmitClassB"))
> val jar3 = HiveTestJars.getHiveContribJar.getCanonicalPath
> val jar4 = HiveTestJars.getHiveHcatalogCoreJar.getCanonicalPath
> val jarsString = Seq(jar1, jar2, jar3, jar4).map(j => 
> j.toString).mkString(",")
> val args = Seq(
>   "--class", SparkSubmitClassLoaderTest.getClass.getName.stripSuffix("$"),
>   "--name", "SparkSubmitClassLoaderTest",
>   "--master", "local-cluster[2,1,1024]",
>   "--conf", "spark.ui.enabled=false",
>   "--conf", "spark.master.rest.enabled=false",
>   "--conf", "spark.sql.hive.metastore.version=3.1.2",
>   "--conf", "spark.sql.hive.metastore.jars=maven",
>   "--driver-java-options", "-Dderby.system.durability=test",
>   "--jars", jarsString,
>   unusedJar.toString, "SparkSubmitClassA", "SparkSubmitClassB")
> runSparkSubmit(args)
>   }
> {code}
> Logs:
> {noformat}
> 2019-09-25 22:11:42.854 - stderr> 19/09/25 22:11:42 ERROR log: error in 
> initSerDe: java.lang.ClassNotFoundException Class 
> org.apache.hive.hcatalog.data.JsonSerDe not found
> 2019-09-25 22:11:42.854 - stderr> java.lang.ClassNotFoundException: Class 
> org.apache.hive.hcatalog.data.JsonSerDe not found
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreUtils.getDeserializer(HiveMetaStoreUtils.java:84)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreUtils.getDeserializer(HiveMetaStoreUtils.java:77)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:289)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:271)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Table.getColsInternal(Table.java:663)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:646)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:898)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:937)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:539)
> 2019-09-25 22:11:42.854 - stderr> at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:311)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:245)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:244)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:294)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:537)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:284)
> 2019-09-25 22:11:42.854 - stderr> at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:242)
> 2019-09-25 22:11:42.854 - stderr> at 
> 

[jira] [Commented] (SPARK-29262) DataFrameWriter insertIntoPartition function

2019-09-26 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939067#comment-16939067
 ] 

Wenchen Fan commented on SPARK-29262:
-

There is a `DataFrameWriterV2`, we can consider adding this API there.

> DataFrameWriter insertIntoPartition function
> 
>
> Key: SPARK-29262
> URL: https://issues.apache.org/jira/browse/SPARK-29262
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
>
> Do we have plan to support insertIntoPartition function for dataFrameWriter?
> [~cloud_fan]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29267) rdd.countApprox should stop when 'timeout'

2019-09-26 Thread Kangtian (Jira)
Kangtian created SPARK-29267:


 Summary: rdd.countApprox should stop when 'timeout'
 Key: SPARK-29267
 URL: https://issues.apache.org/jira/browse/SPARK-29267
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: Kangtian


{{The way to Approximate counting: org.apache.spark.rdd.RDD#countApprox}}

+countApprox(timeout: Long, confidence: Double = 0.95)+

 

But: 

when timeout comes, the job will continue run until really finish.

 

We Want:

*When timeout comes, the job will finish{color:#FF} immediately{color}*, 
without FinalValue

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29266) Optimize Dataset.isEmpty for base relations / unfiltered datasets

2019-09-26 Thread Josh Rosen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-29266:
---
Description: 
SPARK-23627 added a {{Dataset.isEmpty}} method. This is currently implemented as
{code:java}
def isEmpty: Boolean = withAction("isEmpty", 
limit(1).groupBy().count().queryExecution) { plan =>
plan.executeCollect().head.getLong(0) == 0
  }
{code}
which has a global limit of 1 embedded in the middle of the query plan.

As a result, this will end up computing *all* partitions of the Dataset but 
each task can stop early once it's computed a single record (due to how global 
limits are planned / executed in this query).

We could instead implement this as {{ds.limit(1).collect().isEmpty}} but that 
will go through the "CollectLimit" execution strategy which first computes 1 
partition, then 2, then 4, and so on. That will be faster in some cases but 
slower in others: if the dataset is indeed empty then that method will be 
slower than one which checks all partitions in parallel, but if it's non-empty 
(and most tasks' output is non-empty) then it can be much faster.

There's not an obviously-best implementation here. However, I think there's 
high value (and low downside) to optimizing for the special case where the 
Dataset is an unfiltered, untransformed input dataset (e.g. the output of 
{{spark.read.parquet}}):

I found a production job which calls {{isEmpty}} on the output of 
{{spark.read.parquet()}} and the {{isEmpty}} call took several minutes to 
complete because it needed to launch hundreds of thousands of tasks to compute 
a single record of each partition (this is an enormous dataset, hence the long 
runtime for this).

I could instruct the job author use a different, more efficient method of 
checking for non-emptiness, but this feels like the type of optimization that 
Spark should handle itself (reducing the burden placed on users to understand 
internal details of Spark's execution model).

Maybe we can special-case {{isEmpty}} for the case where plan consists of only 
a file source scan (or a file source scan followed by a projection, but without 
any filters, etc.). In those cases, we can use either the {{.limit(1).take()}} 
implementation (under assumption that we don't have a ton of empty input files) 
or something fancier (metadata-only query, looking at Parquet footers, 
delegating to some datasource API, etc).

 

  was:
SPARK-23627 added a {{Dataset.isEmpty}} method. This is currently implemented as
{code:java}
def isEmpty: Boolean = withAction("isEmpty", 
limit(1).groupBy().count().queryExecution) { plan =>
plan.executeCollect().head.getLong(0) == 0
  }
{code}
which has a global limit of 1 embedded in the middle of the query plan.

As a result, this will end up computing *all* partitions of the Dataset but 
each task can stop early once it's computed a single record.

We could instead implement this as {{ds.limit(1).collect().isEmpty}} but that 
will go through the "CollectLimit" execution strategy which first computes 1 
partition, then 2, then 4, and so on. That will be faster in some cases but 
slower in others: if the dataset is indeed empty then that method will be 
slower than one which checks all partitions in parallel, but if it's non-empty 
(and most tasks' output is non-empty) then it can be much faster.

There's not an obviously-best implementation here. However, I think there's 
high value (and low downside) to optimizing for the special case where the 
Dataset is an unfiltered, untransformed input dataset (e.g. the output of 
{{spark.read.parquet}}):

I found a production job which calls {{isEmpty}} on the output of 
{{spark.read.parquet()}} and the {{isEmpty}} call took several minutes to 
complete because it needed to launch hundreds of thousands of tasks to compute 
a single record of each partition (this is an enormous dataset, hence the long 
runtime for this).

I could instruct the job author use a different, more efficient method of 
checking for non-emptiness, but this feels like the type of optimization that 
Spark should handle itself (reducing the burden placed on users to understand 
internal details of Spark's execution model).

Maybe we can special-case {{IsEmpty}} for the case where plan consists of only 
a file source scan (or a file source scan followed by a projection, but without 
any filters, etc.). In those cases, we can use either the {{.limit(1).take()}} 
implementation (under assumption that we don't have a ton of empty input files) 
or something fancier (metadata-only query, looking at Parquet footers, 
delegating to some datasource API, etc).

 


> Optimize Dataset.isEmpty for base relations / unfiltered datasets
> -
>
> Key: SPARK-29266
> URL: https://issues.apache.org/jira/browse/SPARK-29266
> Project: Spark
>  

[jira] [Updated] (SPARK-29246) Remove unnecessary imports in `core` module

2019-09-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29246:
--
Summary: Remove unnecessary imports in `core` module  (was: Remove 
unnecessary imports in CoarseGrainedExecutorBackendSuite)

> Remove unnecessary imports in `core` module
> ---
>
> Key: SPARK-29246
> URL: https://issues.apache.org/jira/browse/SPARK-29246
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Tests
>Affects Versions: 3.0.0
>Reporter: Jiaqi Li
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-26 Thread Florentino Sainz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938989#comment-16938989
 ] 

Florentino Sainz edited comment on SPARK-29265 at 9/26/19 10:54 PM:


Sorry, ima test again... I'm seeing something that doesn't match my guessings 
in my test case.

I thought the cause was because it triggered a global sort, but plan is similar 
different to the one generated when ordering by another column, someone can 
guess why a similar Window+orderBy (ordering by one of the partitionBy columns) 
tanked my performance and only used one executor? (not a problem with data 
having only one value, that key had multiple values evenly distributed).

I'm going to keep investigating to see if I can reproduce this in a dummy test, 
but removing the orderBy in a similar case just fixed performance problem in 
production :$ (it was causing 50+min single-tasks)


was (Author: fsainz):
Sorry, ima test again... I'm seeing something that doesn't match my guessings 
in my test case.

I thought the cause was because it triggered a global sort, but plan is similar 
different to the one generated when ordering by another column, someone knows 
why a similar Window+orderBy (ordering by one of the partitionBy columns) 
tanked my performance and only used one executor? (not a problem with data 
having only one value, that key had multiple values evenly distributed).

I'm going to keep investigating to see if I can reproduce this in a dummy test, 
but removing the orderBy in a similar case just fixed performance problem in 
production :$ (it was causing 50+min single-tasks)

> Window orderBy causing full-DF orderBy 
> ---
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> collect_list($"number").over(myWindow))
> {code}
>  
> As a user, I would expect either:
> 1- Error/warning (because trying to sort on one of the columns of the window 
> partitionBy)
> 2- A mostly-useless operation which just orders the rows (which have the same 
> value on word) inside each Window but doesn't affect performance too much.
>  
> Currently what I see:
> *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
> performing a global orderBy on the whole DataFrame. Similar to 
> dataframe.orderBy("word").*
> *In my real environment, my program just didn't finish in time/crashed thus 
> causing my program to be very slow or crash (because as it's a global 
> orderBy, it will just go to one executor).*
>  
> In the test I can see how all elements of my DF are in a single partition 
> (side-effect of the global orderBy) 
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29246) Remove unnecessary imports in `core` module

2019-09-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29246:
-

Assignee: Jiaqi Li

> Remove unnecessary imports in `core` module
> ---
>
> Key: SPARK-29246
> URL: https://issues.apache.org/jira/browse/SPARK-29246
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jiaqi Li
>Assignee: Jiaqi Li
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29246) Remove unnecessary imports in `core` module

2019-09-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29246:
--
Component/s: (was: Tests)

> Remove unnecessary imports in `core` module
> ---
>
> Key: SPARK-29246
> URL: https://issues.apache.org/jira/browse/SPARK-29246
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jiaqi Li
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29259) Filesystem.exists is called even when not necessary for append save mode

2019-09-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29259:
-

Assignee: Rahij Ramsharan

> Filesystem.exists is called even when not necessary for append save mode
> 
>
> Key: SPARK-29259
> URL: https://issues.apache.org/jira/browse/SPARK-29259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rahij Ramsharan
>Assignee: Rahij Ramsharan
>Priority: Minor
> Fix For: 3.0.0
>
>
> When saving a dataframe into Hadoop 
> ([https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L93]),
>  spark first checks if the file exists before inspecting the SaveMode to 
> determine if it should actually insert data. However, the pathExists variable 
> is actually not used in the case of SaveMode.Append. In some file systems, 
> the exists call can be expensive and hence this PR makes that call only when 
> necessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29259) Filesystem.exists is called even when not necessary for append save mode

2019-09-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29259.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/25928

> Filesystem.exists is called even when not necessary for append save mode
> 
>
> Key: SPARK-29259
> URL: https://issues.apache.org/jira/browse/SPARK-29259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Rahij Ramsharan
>Priority: Minor
> Fix For: 3.0.0
>
>
> When saving a dataframe into Hadoop 
> ([https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L93]),
>  spark first checks if the file exists before inspecting the SaveMode to 
> determine if it should actually insert data. However, the pathExists variable 
> is actually not used in the case of SaveMode.Append. In some file systems, 
> the exists call can be expensive and hence this PR makes that call only when 
> necessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29259) Filesystem.exists is called even when not necessary for append save mode

2019-09-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29259:
--
Affects Version/s: (was: 2.4.4)
   3.0.0

> Filesystem.exists is called even when not necessary for append save mode
> 
>
> Key: SPARK-29259
> URL: https://issues.apache.org/jira/browse/SPARK-29259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rahij Ramsharan
>Priority: Minor
> Fix For: 3.0.0
>
>
> When saving a dataframe into Hadoop 
> ([https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L93]),
>  spark first checks if the file exists before inspecting the SaveMode to 
> determine if it should actually insert data. However, the pathExists variable 
> is actually not used in the case of SaveMode.Append. In some file systems, 
> the exists call can be expensive and hence this PR makes that call only when 
> necessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29266) Optimize Dataset.isEmpty for base relations / unfiltered datasets

2019-09-26 Thread Josh Rosen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-29266:
---
Description: 
SPARK-23627 added a {{Dataset.isEmpty}} method. This is currently implemented as
{code:java}
def isEmpty: Boolean = withAction("isEmpty", 
limit(1).groupBy().count().queryExecution) { plan =>
plan.executeCollect().head.getLong(0) == 0
  }
{code}
which has a global limit of 1 embedded in the middle of the query plan.

As a result, this will end up computing *all* partitions of the Dataset but 
each task can stop early once it's computed a single record.

We could instead implement this as {{ds.limit(1).collect().isEmpty}} but that 
will go through the "CollectLimit" execution strategy which first computes 1 
partition, then 2, then 4, and so on. That will be faster in some cases but 
slower in others: if the dataset is indeed empty then that method will be 
slower than one which checks all partitions in parallel, but if it's non-empty 
(and most tasks' output is non-empty) then it can be much faster.

There's not an obviously-best implementation here. However, I think there's 
high value (and low downside) to optimizing for the special case where the 
Dataset is an unfiltered, untransformed input dataset (e.g. the output of 
{{spark.read.parquet}}):

I found a production job which calls {{isEmpty}} on the output of 
{{spark.read.parquet()}} and the {{isEmpty}} call took several minutes to 
complete because it needed to launch hundreds of thousands of tasks to compute 
a single record of each partition (this is an enormous dataset, hence the long 
runtime for this).

I could instruct the job author use a different, more efficient method of 
checking for non-emptiness, but this feels like the type of optimization that 
Spark should handle itself (reducing the burden placed on users to understand 
internal details of Spark's execution model).

Maybe we can special-case {{IsEmpty}} for the case where plan consists of only 
a file source scan (or a file source scan followed by a projection, but without 
any filters, etc.). In those cases, we can use either the {{.limit(1).take()}} 
implementation (under assumption that we don't have a ton of empty input files) 
or something fancier (metadata-only query, looking at Parquet footers, 
delegating to some datasource API, etc).

 

  was:
SPARK-23627 added a {{Dataset.isEmpty}} method. This is currently implemented as
{code:java}
def isEmpty: Boolean = withAction("isEmpty", 
limit(1).groupBy().count().queryExecution) { plan =>
plan.executeCollect().head.getLong(0) == 0
  }
{code}
which has a global limit of 1 embedded in the middle of the query plan.

As a result, this will end up computing *all* partitions of the Dataset but 
each task can stop early once it's computed a single record.

We could instead implement this as {{ds.limit(1).collect().isEmpty}} but that 
will go through the "CollectLimit" execution strategy which first computes 1 
partition, then 2, then 4, and so on. That will be faster in some cases but 
slower in others: if the dataset is indeed empty then that method will be 
slower than one which checks all partitions in parallel, but if it's non-empty 
(and most tasks' output is non-empty) then it can be much faster.

There's not an obviously-best implementation here. However, I think there's 
high value (and low downside) to optimizing for the special case where the 
Dataset is an unfiltered, untransformed input dataset (e.g. the output of 
{{spark.read.parquet}}):

I found a production job which calls {{isEmpty}} on the output of 
{{spark.read.parquet()}} and the {{isEmpty}} call took several minutes to 
complete because it needed to launch hundreds of thousands of tasks to compute 
a single record of each partition (this is an enormous dataset, hence the long 
runtime for this).

I could instruct the job author use a different, more efficient method of 
checking for non-emptiness, but this feels like the type of optimization that 
Spark should handle itself.

Maybe we can special-case {{IsEmpty}} for the case where plan consists of only 
a file source scan (or a file source scan followed by a projection, but without 
any filters, etc.). In those cases, we can use either the {{.limit(1).take()}} 
implementation (under assumption that we don't have a ton of empty input files) 
or something fancier (metadata-only query, looking at Parquet footers, 
delegating to some datasource API, etc).

 


> Optimize Dataset.isEmpty for base relations / unfiltered datasets
> -
>
> Key: SPARK-29266
> URL: https://issues.apache.org/jira/browse/SPARK-29266
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Minor
>
> SPARK-23627 

[jira] [Updated] (SPARK-29266) Optimize Dataset.isEmpty for base relations / unfiltered datasets

2019-09-26 Thread Josh Rosen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-29266:
---
Description: 
SPARK-23627 added a {{Dataset.isEmpty}} method. This is currently implemented as
{code:java}
def isEmpty: Boolean = withAction("isEmpty", 
limit(1).groupBy().count().queryExecution) { plan =>
plan.executeCollect().head.getLong(0) == 0
  }
{code}
which has a global limit of 1 embedded in the middle of the query plan.

As a result, this will end up computing *all* partitions of the Dataset but 
each task can stop early once it's computed a single record.

We could instead implement this as {{ds.limit(1).collect().isEmpty}} but that 
will go through the "CollectLimit" execution strategy which first computes 1 
partition, then 2, then 4, and so on. That will be faster in some cases but 
slower in others: if the dataset is indeed empty then that method will be 
slower than one which checks all partitions in parallel, but if it's non-empty 
(and most tasks' output is non-empty) then it can be much faster.

There's not an obviously-best implementation here. However, I think there's 
high value (and low downside) to optimizing for the special case where the 
Dataset is an unfiltered, untransformed input dataset (e.g. the output of 
{{spark.read.parquet}}):

I found a production job which calls {{isEmpty}} on the output of 
{{spark.read.parquet()}} and the {{isEmpty}} call took several minutes to 
complete because it needed to launch hundreds of thousands of tasks to compute 
a single record of each partition (this is an enormous dataset).

I could instruct the job author use a different, more efficient method of 
checking for non-emptiness, but this feels like the type of optimization that 
Spark should handle itself.

Maybe we can special-case {{IsEmpty}} for the case where plan consists of only 
a file source scan (or a file source scan followed by a projection, but without 
any filters, etc.). In those cases, we can use either the {{.limit(1).take()}} 
implementation (under assumption that we don't have a ton of empty input files) 
or something fancier (metadata-only query, looking at Parquet footers, 
delegating to some datasource API, etc).

 

  was:
SPARK-23627 added a {{Dataset.isEmpty}} method. This is currently implemented as
{code:java}
def isEmpty: Boolean = withAction("isEmpty", 
limit(1).groupBy().count().queryExecution) { plan =>
plan.executeCollect().head.getLong(0) == 0
  }
{code}
which has a global limit of 1 embedded in the middle of the query plan.

As a result, this will end up computing *all* partitions of the Dataset but 
each task can stop early once it's computed a single record.

We could instead implement this as {{ds.limit(1).collect().isEmpty}} but that 
will go through the "CollectLimit" execution strategy which first computes 1 
partition, then 2, then 4, and so on. That will be faster in some cases but 
slower in others: if the dataset is indeed empty then that method will be 
slower than one which checks all partitions in parallel, but if it's non-empty 
(and most tasks' output is non-empty) then it can be much faster.

There's not an obviously-best implementation here. However, I think there's 
high value (and low downside) to optimizing for the special case where the 
Dataset is an unfiltered, untransformed input dataset (e.g. the output of 
{{spark.read.parquet}}):

I found a production job which calls {{isEmpty}} on the output of 
{{spark.read.parquet()}} and the {{isEmpty}} call took several minutes to 
complete because it needed to launch hundreds of thousands of tasks to compute 
a single record of each partition.

I could instruct the job author use a different, more efficient method of 
checking for non-emptiness, but this feels like the type of optimization that 
Spark should handle itself.

Maybe we can special-case {{IsEmpty}} for the case where plan consists of only 
a file source scan (or a file source scan followed by a projection, but without 
any filters, etc.). In those cases, we can use either the {{.limit(1).take()}} 
implementation (under assumption that we don't have a ton of empty input files) 
or something fancier (metadata-only query, looking at Parquet footers, 
delegating to some datasource API, etc).

 


> Optimize Dataset.isEmpty for base relations / unfiltered datasets
> -
>
> Key: SPARK-29266
> URL: https://issues.apache.org/jira/browse/SPARK-29266
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Minor
>
> SPARK-23627 added a {{Dataset.isEmpty}} method. This is currently implemented 
> as
> {code:java}
> def isEmpty: Boolean = withAction("isEmpty", 
> limit(1).groupBy().count().queryExecution) { plan =>
> 

[jira] [Updated] (SPARK-29266) Optimize Dataset.isEmpty for base relations / unfiltered datasets

2019-09-26 Thread Josh Rosen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-29266:
---
Description: 
SPARK-23627 added a {{Dataset.isEmpty}} method. This is currently implemented as
{code:java}
def isEmpty: Boolean = withAction("isEmpty", 
limit(1).groupBy().count().queryExecution) { plan =>
plan.executeCollect().head.getLong(0) == 0
  }
{code}
which has a global limit of 1 embedded in the middle of the query plan.

As a result, this will end up computing *all* partitions of the Dataset but 
each task can stop early once it's computed a single record.

We could instead implement this as {{ds.limit(1).collect().isEmpty}} but that 
will go through the "CollectLimit" execution strategy which first computes 1 
partition, then 2, then 4, and so on. That will be faster in some cases but 
slower in others: if the dataset is indeed empty then that method will be 
slower than one which checks all partitions in parallel, but if it's non-empty 
(and most tasks' output is non-empty) then it can be much faster.

There's not an obviously-best implementation here. However, I think there's 
high value (and low downside) to optimizing for the special case where the 
Dataset is an unfiltered, untransformed input dataset (e.g. the output of 
{{spark.read.parquet}}):

I found a production job which calls {{isEmpty}} on the output of 
{{spark.read.parquet()}} and the {{isEmpty}} call took several minutes to 
complete because it needed to launch hundreds of thousands of tasks to compute 
a single record of each partition (this is an enormous dataset, hence the long 
runtime for this).

I could instruct the job author use a different, more efficient method of 
checking for non-emptiness, but this feels like the type of optimization that 
Spark should handle itself.

Maybe we can special-case {{IsEmpty}} for the case where plan consists of only 
a file source scan (or a file source scan followed by a projection, but without 
any filters, etc.). In those cases, we can use either the {{.limit(1).take()}} 
implementation (under assumption that we don't have a ton of empty input files) 
or something fancier (metadata-only query, looking at Parquet footers, 
delegating to some datasource API, etc).

 

  was:
SPARK-23627 added a {{Dataset.isEmpty}} method. This is currently implemented as
{code:java}
def isEmpty: Boolean = withAction("isEmpty", 
limit(1).groupBy().count().queryExecution) { plan =>
plan.executeCollect().head.getLong(0) == 0
  }
{code}
which has a global limit of 1 embedded in the middle of the query plan.

As a result, this will end up computing *all* partitions of the Dataset but 
each task can stop early once it's computed a single record.

We could instead implement this as {{ds.limit(1).collect().isEmpty}} but that 
will go through the "CollectLimit" execution strategy which first computes 1 
partition, then 2, then 4, and so on. That will be faster in some cases but 
slower in others: if the dataset is indeed empty then that method will be 
slower than one which checks all partitions in parallel, but if it's non-empty 
(and most tasks' output is non-empty) then it can be much faster.

There's not an obviously-best implementation here. However, I think there's 
high value (and low downside) to optimizing for the special case where the 
Dataset is an unfiltered, untransformed input dataset (e.g. the output of 
{{spark.read.parquet}}):

I found a production job which calls {{isEmpty}} on the output of 
{{spark.read.parquet()}} and the {{isEmpty}} call took several minutes to 
complete because it needed to launch hundreds of thousands of tasks to compute 
a single record of each partition (this is an enormous dataset).

I could instruct the job author use a different, more efficient method of 
checking for non-emptiness, but this feels like the type of optimization that 
Spark should handle itself.

Maybe we can special-case {{IsEmpty}} for the case where plan consists of only 
a file source scan (or a file source scan followed by a projection, but without 
any filters, etc.). In those cases, we can use either the {{.limit(1).take()}} 
implementation (under assumption that we don't have a ton of empty input files) 
or something fancier (metadata-only query, looking at Parquet footers, 
delegating to some datasource API, etc).

 


> Optimize Dataset.isEmpty for base relations / unfiltered datasets
> -
>
> Key: SPARK-29266
> URL: https://issues.apache.org/jira/browse/SPARK-29266
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Minor
>
> SPARK-23627 added a {{Dataset.isEmpty}} method. This is currently implemented 
> as
> {code:java}
> def isEmpty: Boolean = withAction("isEmpty", 
> 

[jira] [Created] (SPARK-29266) Optimize Dataset.isEmpty for base relations / unfiltered datasets

2019-09-26 Thread Josh Rosen (Jira)
Josh Rosen created SPARK-29266:
--

 Summary: Optimize Dataset.isEmpty for base relations / unfiltered 
datasets
 Key: SPARK-29266
 URL: https://issues.apache.org/jira/browse/SPARK-29266
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Josh Rosen


SPARK-23627 added a {{Dataset.isEmpty}} method. This is currently implemented as
{code:java}
def isEmpty: Boolean = withAction("isEmpty", 
limit(1).groupBy().count().queryExecution) { plan =>
plan.executeCollect().head.getLong(0) == 0
  }
{code}
which has a global limit of 1 embedded in the middle of the query plan.

As a result, this will end up computing *all* partitions of the Dataset but 
each task can stop early once it's computed a single record.

We could instead implement this as {{ds.limit(1).collect().isEmpty}} but that 
will go through the "CollectLimit" execution strategy which first computes 1 
partition, then 2, then 4, and so on. That will be faster in some cases but 
slower in others: if the dataset is indeed empty then that method will be 
slower than one which checks all partitions in parallel, but if it's non-empty 
(and most tasks' output is non-empty) then it can be much faster.

There's not an obviously-best implementation here. However, I think there's 
high value (and low downside) to optimizing for the special case where the 
Dataset is an unfiltered, untransformed input dataset (e.g. the output of 
{{spark.read.parquet}}):

I found a production job which calls {{isEmpty}} on the output of 
{{spark.read.parquet()}} and the {{isEmpty}} call took several minutes to 
complete because it needed to launch hundreds of thousands of tasks to compute 
a single record of each partition.

I could instruct the job author use a different, more efficient method of 
checking for non-emptiness, but this feels like the type of optimization that 
Spark should handle itself.

Maybe we can special-case {{IsEmpty}} for the case where plan consists of only 
a file source scan (or a file source scan followed by a projection, but without 
any filters, etc.). In those cases, we can use either the {{.limit(1).take()}} 
implementation (under assumption that we don't have a ton of empty input files) 
or something fancier (metadata-only query, looking at Parquet footers, 
delegating to some datasource API, etc).

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-26 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-29265:
-
Attachment: (was: Test.scala)

> Window orderBy causing full-DF orderBy 
> ---
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> collect_list($"number").over(myWindow))
> {code}
>  
> As a user, I would expect either:
> 1- Error/warning (because trying to sort on one of the columns of the window 
> partitionBy)
> 2- A mostly-useless operation which just orders the rows (which have the same 
> value on word) inside each Window but doesn't affect performance too much.
>  
> Currently what I see:
> *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
> performing a global orderBy on the whole DataFrame. Similar to 
> dataframe.orderBy("word").*
> *In my real environment, my program just didn't finish in time/crashed thus 
> causing my program to be very slow or crash (because as it's a global 
> orderBy, it will just go to one executor).*
>  
> In the test I can see how all elements of my DF are in a single partition 
> (side-effect of the global orderBy) 
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29202) --driver-java-options are not passed to driver process in yarn client mode

2019-09-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29202.
---
Fix Version/s: 3.0.0
 Assignee: Sandeep Katta
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/25889

> --driver-java-options are not passed to driver process in yarn client mode
> --
>
> Key: SPARK-29202
> URL: https://issues.apache.org/jira/browse/SPARK-29202
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 3.0.0
>Reporter: Sandeep Katta
>Assignee: Sandeep Katta
>Priority: Major
> Fix For: 3.0.0
>
>
> Run the below command 
> ./bin/spark-sql --master yarn 
> --driver-java-options="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address="
>  
> In Spark 2.3.3
> /opt/softwares/Java/jdk1.8.0_211/bin/java -cp 
> /opt/BigdataTools/spark-2.3.3-bin-hadoop2.7/conf/:/opt/BigdataTools/spark-2.3.3-bin-hadoop2.7/jars/*:/opt/BigdataTools/hadoop-3.2.0/etc/hadoop/
>  -Xmx1g -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address= 
> org.apache.spark.deploy.SparkSubmit --master yarn --conf 
> spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=
>  --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver 
> spark-internal
>  
> In Spark 3.0
> /opt/softwares/Java/jdk1.8.0_211/bin/java -cp 
> /opt/apache/git/sparkSourceCode/spark/conf/:/opt/apache/git/sparkSourceCode/spark/assembly/target/scala-2.12/jars/*:/opt/BigdataTools/hadoop-3.2.0/etc/hadoop/
>  org.apache.spark.deploy.SparkSubmit --master yarn --conf 
> spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5556
>  --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver 
> spark-internal
> We can see that java options are not passed to driver process in spark3
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-26 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-29265:
-
Attachment: (was: TestSpark.zip)

> Window orderBy causing full-DF orderBy 
> ---
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> collect_list($"number").over(myWindow))
> {code}
>  
> As a user, I would expect either:
> 1- Error/warning (because trying to sort on one of the columns of the window 
> partitionBy)
> 2- A mostly-useless operation which just orders the rows (which have the same 
> value on word) inside each Window but doesn't affect performance too much.
>  
> Currently what I see:
> *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
> performing a global orderBy on the whole DataFrame. Similar to 
> dataframe.orderBy("word").*
> *In my real environment, my program just didn't finish in time/crashed thus 
> causing my program to be very slow or crash (because as it's a global 
> orderBy, it will just go to one executor).*
>  
> In the test I can see how all elements of my DF are in a single partition 
> (side-effect of the global orderBy) 
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-26 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-29265:
-
Affects Version/s: (was: 2.4.4)

> Window orderBy causing full-DF orderBy 
> ---
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
> Attachments: Test.scala, TestSpark.zip
>
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> collect_list($"number").over(myWindow))
> {code}
>  
> As a user, I would expect either:
> 1- Error/warning (because trying to sort on one of the columns of the window 
> partitionBy)
> 2- A mostly-useless operation which just orders the rows (which have the same 
> value on word) inside each Window but doesn't affect performance too much.
>  
> Currently what I see:
> *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
> performing a global orderBy on the whole DataFrame. Similar to 
> dataframe.orderBy("word").*
> *In my real environment, my program just didn't finish in time/crashed thus 
> causing my program to be very slow or crash (because as it's a global 
> orderBy, it will just go to one executor).*
>  
> In the test I can see how all elements of my DF are in a single partition 
> (side-effect of the global orderBy) 
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-26 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-29265:
-
Affects Version/s: 2.4.4

> Window orderBy causing full-DF orderBy 
> ---
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.4
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
> Attachments: Test.scala, TestSpark.zip
>
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> collect_list($"number").over(myWindow))
> {code}
>  
> As a user, I would expect either:
> 1- Error/warning (because trying to sort on one of the columns of the window 
> partitionBy)
> 2- A mostly-useless operation which just orders the rows (which have the same 
> value on word) inside each Window but doesn't affect performance too much.
>  
> Currently what I see:
> *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
> performing a global orderBy on the whole DataFrame. Similar to 
> dataframe.orderBy("word").*
> *In my real environment, my program just didn't finish in time/crashed thus 
> causing my program to be very slow or crash (because as it's a global 
> orderBy, it will just go to one executor).*
>  
> In the test I can see how all elements of my DF are in a single partition 
> (side-effect of the global orderBy) 
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-26 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-29265:
-
Description: 
Hi,

 

I had this problem in "real" environments and also made a self-contained test ( 
[^Test.scala] attached).

Having this Window definition:
{code:scala}
val myWindow = Window.partitionBy($"word").orderBy("word") 
val filt2 = filtrador.withColumn("avg_Time", 
collect_list($"number").over(myWindow))
{code}
 

As a user, I would expect either:

1- Error/warning (because trying to sort on one of the columns of the window 
partitionBy)

2- A mostly-useless operation which just orders the rows (which have the same 
value on word) inside each Window but doesn't affect performance too much.

 

Currently what I see:

*When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
performing a global orderBy on the whole DataFrame. Similar to 
dataframe.orderBy("word").*

*In my real environment, my program just didn't finish in time/crashed thus 
causing my program to be very slow or crash (because as it's a global orderBy, 
it will just go to one executor).*

 

In the test I can see how all elements of my DF are in a single partition 
(side-effect of the global orderBy) 

 

Full Code showing the error (see how the mapPartitions shows 99 rows in one 
partition) attached in Test.scala, sbt project (src and build.sbt) attached too 
in TestSpark.zip.

  was:
Hi,

 

I had this problem in "real" environments and also made a self-contained test ( 
[^Test.scala] attached).

Having this Window definition:
{code:scala}
val myWindow = Window.partitionBy($"word").orderBy("word") 
val filt2 = filtrador.withColumn("avg_Time", 
avg($"number").over(myWindow)){code}
 

As a user, I would expect either:

1- Error/warning (because trying to sort on one of the columns of the window 
partitionBy)

2- A mostly-useless operation which just orders the rows (which have the same 
value on word) inside each Window but doesn't affect performance too much.

 

Currently what I see:

*When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
performing a global orderBy on the whole DataFrame. Similar to 
dataframe.orderBy("word").*

*In my real environment, my program just didn't finish in time/crashed thus 
causing my program to be very slow or crash (because as it's a global orderBy, 
it will just go to one executor).*

 

In the test I can see how all elements of my DF are in a single partition 
(side-effect of the global orderBy) 

 

Full Code showing the error (see how the mapPartitions shows 99 rows in one 
partition) attached in Test.scala, sbt project (src and build.sbt) attached too 
in TestSpark.zip.


> Window orderBy causing full-DF orderBy 
> ---
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
> Attachments: Test.scala, TestSpark.zip
>
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> collect_list($"number").over(myWindow))
> {code}
>  
> As a user, I would expect either:
> 1- Error/warning (because trying to sort on one of the columns of the window 
> partitionBy)
> 2- A mostly-useless operation which just orders the rows (which have the same 
> value on word) inside each Window but doesn't affect performance too much.
>  
> Currently what I see:
> *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
> performing a global orderBy on the whole DataFrame. Similar to 
> dataframe.orderBy("word").*
> *In my real environment, my program just didn't finish in time/crashed thus 
> causing my program to be very slow or crash (because as it's a global 
> orderBy, it will just go to one executor).*
>  
> In the test I can see how all elements of my DF are in a single partition 
> (side-effect of the global orderBy) 
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-26 Thread Florentino Sainz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938989#comment-16938989
 ] 

Florentino Sainz edited comment on SPARK-29265 at 9/26/19 10:18 PM:


Sorry, ima test again... I'm seeing something that doesn't match my guessings 
in my test case.

I thought the cause was because it triggered a global sort, but plan is similar 
different to the one generated when ordering by another column, someone knows 
why a similar Window+orderBy (ordering by one of the partitionBy columns) 
tanked my performance and only used one executor? (not a problem with data 
having only one value, that key had multiple values evenly distributed).

I'm going to keep investigating to see if I can reproduce this in a dummy test, 
but removing the orderBy in a similar case just fixed performance problem in 
production :$


was (Author: fsainz):
Sorry, ima test again... I'm seeing something that doesn't match my guessings 
in my test case.

I thought the cause was because it triggered a global sort, but plan is similar 
different to the one generated when ordering by another column, someone knows 
why a similar Window+orderBy (ordering by one of the partitionBy columns) 
tanked my performance and only used one executor? (not a problem with data 
having only one value, that key had multiple values evenly distributed).

I'm going to keep investigating, but removing the orderBy in a similar case 
just fixed performance problem :$

> Window orderBy causing full-DF orderBy 
> ---
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
> Attachments: Test.scala, TestSpark.zip
>
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> avg($"number").over(myWindow)){code}
>  
> As a user, I would expect either:
> 1- Error/warning (because trying to sort on one of the columns of the window 
> partitionBy)
> 2- A mostly-useless operation which just orders the rows (which have the same 
> value on word) inside each Window but doesn't affect performance too much.
>  
> Currently what I see:
> *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
> performing a global orderBy on the whole DataFrame. Similar to 
> dataframe.orderBy("word").*
> *In my real environment, my program just didn't finish in time/crashed thus 
> causing my program to be very slow or crash (because as it's a global 
> orderBy, it will just go to one executor).*
>  
> In the test I can see how all elements of my DF are in a single partition 
> (side-effect of the global orderBy) 
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-26 Thread Florentino Sainz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938989#comment-16938989
 ] 

Florentino Sainz edited comment on SPARK-29265 at 9/26/19 10:18 PM:


Sorry, ima test again... I'm seeing something that doesn't match my guessings 
in my test case.

I thought the cause was because it triggered a global sort, but plan is similar 
different to the one generated when ordering by another column, someone knows 
why a similar Window+orderBy (ordering by one of the partitionBy columns) 
tanked my performance and only used one executor? (not a problem with data 
having only one value, that key had multiple values evenly distributed).

I'm going to keep investigating to see if I can reproduce this in a dummy test, 
but removing the orderBy in a similar case just fixed performance problem in 
production :$ (it was causing 50+min single-tasks)


was (Author: fsainz):
Sorry, ima test again... I'm seeing something that doesn't match my guessings 
in my test case.

I thought the cause was because it triggered a global sort, but plan is similar 
different to the one generated when ordering by another column, someone knows 
why a similar Window+orderBy (ordering by one of the partitionBy columns) 
tanked my performance and only used one executor? (not a problem with data 
having only one value, that key had multiple values evenly distributed).

I'm going to keep investigating to see if I can reproduce this in a dummy test, 
but removing the orderBy in a similar case just fixed performance problem in 
production :$

> Window orderBy causing full-DF orderBy 
> ---
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
> Attachments: Test.scala, TestSpark.zip
>
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> avg($"number").over(myWindow)){code}
>  
> As a user, I would expect either:
> 1- Error/warning (because trying to sort on one of the columns of the window 
> partitionBy)
> 2- A mostly-useless operation which just orders the rows (which have the same 
> value on word) inside each Window but doesn't affect performance too much.
>  
> Currently what I see:
> *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
> performing a global orderBy on the whole DataFrame. Similar to 
> dataframe.orderBy("word").*
> *In my real environment, my program just didn't finish in time/crashed thus 
> causing my program to be very slow or crash (because as it's a global 
> orderBy, it will just go to one executor).*
>  
> In the test I can see how all elements of my DF are in a single partition 
> (side-effect of the global orderBy) 
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-26 Thread Florentino Sainz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938989#comment-16938989
 ] 

Florentino Sainz edited comment on SPARK-29265 at 9/26/19 10:07 PM:


Sorry, ima test again... I'm seeing something that doesn't match my guessings 
in my test case.

I thought the cause was because it triggered a global sort, but plan is similar 
different to the one generated when ordering by another column, someone knows 
why a similar Window+orderBy (ordering by one of the partitionBy columns) 
tanked my performance and only used one executor? (not a problem with data 
having only one value, that key had multiple values evenly distributed).

I'm going to keep investigating, but removing the orderBy in a similar case 
just fixed performance problem :$


was (Author: fsainz):
Sorry, ima test again... I'm seeing something that doesn't match my guessings 
in my test case.

I thought the cause was because it triggered a global sort, but plan is similar 
different to the one generated when ordering by another column, someone knows 
why a similar Window+orderBy (ordering by one of the partitionBy columns) 
tanked my performance and only used one executor? (not a problem with data 
having only one value, that key had multiple values evenly distributed).

I'm going to keep investigating, but removing the orderBy in a similar case 
just fixed it :$

> Window orderBy causing full-DF orderBy 
> ---
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
> Attachments: Test.scala, TestSpark.zip
>
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> avg($"number").over(myWindow)){code}
>  
> As a user, I would expect either:
> 1- Error/warning (because trying to sort on one of the columns of the window 
> partitionBy)
> 2- A mostly-useless operation which just orders the rows (which have the same 
> value on word) inside each Window but doesn't affect performance too much.
>  
> Currently what I see:
> *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
> performing a global orderBy on the whole DataFrame. Similar to 
> dataframe.orderBy("word").*
> *In my real environment, my program just didn't finish in time/crashed thus 
> causing my program to be very slow or crash (because as it's a global 
> orderBy, it will just go to one executor).*
>  
> In the test I can see how all elements of my DF are in a single partition 
> (side-effect of the global orderBy) 
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-26 Thread Florentino Sainz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938989#comment-16938989
 ] 

Florentino Sainz edited comment on SPARK-29265 at 9/26/19 10:07 PM:


Sorry, ima test again... I'm seeing something that doesn't match my guessings 
in my test case.

I thought the cause was because it triggered a global sort, but plan is similar 
different to the one generated when ordering by another column, someone knows 
why a similar Window+orderBy (ordering by one of the partitionBy columns) 
tanked my performance and only used one executor? (not a problem with data 
having only one value, that key had multiple values evenly distributed).

I'm going to keep investigating, but removing the orderBy in a similar case 
just fixed it :$


was (Author: fsainz):
Sorry, ima test again... I'm seeing something that doesn't match my guessings 
in my test case.

I thought the cause was because it triggered a global sort, but plan is similar 
different to the one generated when ordering by another column, someone knows 
why a similar Window+orderBy (ordering by one of the partitionBy columns) 
tanked my performance and only used one executor? (not a problem with data 
having only one value, that key had multiple values evenly distributed)

> Window orderBy causing full-DF orderBy 
> ---
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
> Attachments: Test.scala, TestSpark.zip
>
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> avg($"number").over(myWindow)){code}
>  
> As a user, I would expect either:
> 1- Error/warning (because trying to sort on one of the columns of the window 
> partitionBy)
> 2- A mostly-useless operation which just orders the rows (which have the same 
> value on word) inside each Window but doesn't affect performance too much.
>  
> Currently what I see:
> *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
> performing a global orderBy on the whole DataFrame. Similar to 
> dataframe.orderBy("word").*
> *In my real environment, my program just didn't finish in time/crashed thus 
> causing my program to be very slow or crash (because as it's a global 
> orderBy, it will just go to one executor).*
>  
> In the test I can see how all elements of my DF are in a single partition 
> (side-effect of the global orderBy) 
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-26 Thread Florentino Sainz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938989#comment-16938989
 ] 

Florentino Sainz edited comment on SPARK-29265 at 9/26/19 10:03 PM:


Sorry, ima test again... I'm seeing something that doesn't match my guessings 
in my test case.

I thought the cause was because it triggered a global sort, but plan is similar 
different to the one generated when ordering by another column, someone knows 
why a similar Window+orderBy (ordering by one of the partitionBy columns) 
tanked my performance and only used one executor? (not a problem with data 
having only one value, that key had multiple values evenly distributed)


was (Author: fsainz):
Sorry, ima test again... I'm seeing something that doesn't match my guessings 
in my test case.

I thought the cause was because it triggered a global sort, but plan is similar 
different to the one generated when ordering by another column, someone knows 
why a similar Window+orderBy (ordering by one of the partitionBy columns) 
tanked my performance and only used one executor?

> Window orderBy causing full-DF orderBy 
> ---
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
> Attachments: Test.scala, TestSpark.zip
>
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> avg($"number").over(myWindow)){code}
>  
> As a user, I would expect either:
> 1- Error/warning (because trying to sort on one of the columns of the window 
> partitionBy)
> 2- A mostly-useless operation which just orders the rows (which have the same 
> value on word) inside each Window but doesn't affect performance too much.
>  
> Currently what I see:
> *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
> performing a global orderBy on the whole DataFrame. Similar to 
> dataframe.orderBy("word").*
> *In my real environment, my program just didn't finish in time/crashed thus 
> causing my program to be very slow or crash (because as it's a global 
> orderBy, it will just go to one executor).*
>  
> In the test I can see how all elements of my DF are in a single partition 
> (side-effect of the global orderBy) 
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-26 Thread Florentino Sainz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938989#comment-16938989
 ] 

Florentino Sainz edited comment on SPARK-29265 at 9/26/19 10:01 PM:


Sorry, ima test again... I'm seeing something that doesn't match my guessings 
in my test case.

I thought the cause was because it triggered a global sort, but plan is similar 
different to the one generated when ordering by another column, someone knows 
why a similar Window+orderBy (ordering by one of the partitionBy columns) 
tanked my performance and only used one executor?


was (Author: fsainz):
Sorry, ima test again... I'm seeing something that doesn't match in my test 
case.

I thought the cause was because it triggered a global sort, but plan is similar 
different to the one generated when ordering by another column, someone knows 
why a similar Window+orderBy (ordering by one of the partitionBy columns) 
tanked my performance and only used one executor?

> Window orderBy causing full-DF orderBy 
> ---
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
> Attachments: Test.scala, TestSpark.zip
>
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> avg($"number").over(myWindow)){code}
>  
> As a user, I would expect either:
> 1- Error/warning (because trying to sort on one of the columns of the window 
> partitionBy)
> 2- A mostly-useless operation which just orders the rows (which have the same 
> value on word) inside each Window but doesn't affect performance too much.
>  
> Currently what I see:
> *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
> performing a global orderBy on the whole DataFrame. Similar to 
> dataframe.orderBy("word").*
> *In my real environment, my program just didn't finish in time/crashed thus 
> causing my program to be very slow or crash (because as it's a global 
> orderBy, it will just go to one executor).*
>  
> In the test I can see how all elements of my DF are in a single partition 
> (side-effect of the global orderBy) 
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-26 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-29265:
-
Affects Version/s: (was: 2.4.4)
   (was: 2.4.3)
   (was: 2.3.0)
   2.4.0

> Window orderBy causing full-DF orderBy 
> ---
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
> Attachments: Test.scala, TestSpark.zip
>
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> avg($"number").over(myWindow)){code}
>  
> As a user, I would expect either:
> 1- Error/warning (because trying to sort on one of the columns of the window 
> partitionBy)
> 2- A mostly-useless operation which just orders the rows (which have the same 
> value on word) inside each Window but doesn't affect performance too much.
>  
> Currently what I see:
> *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
> performing a global orderBy on the whole DataFrame. Similar to 
> dataframe.orderBy("word").*
> *In my real environment, my program just didn't finish in time/crashed thus 
> causing my program to be very slow or crash (because as it's a global 
> orderBy, it will just go to one executor).*
>  
> In the test I can see how all elements of my DF are in a single partition 
> (side-effect of the global orderBy) 
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29242) Check results of expression examples automatically

2019-09-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29242:
--
Issue Type: Improvement  (was: Test)

> Check results of expression examples automatically
> --
>
> Key: SPARK-29242
> URL: https://issues.apache.org/jira/browse/SPARK-29242
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Expression examples demonstrate how to use associated functions, and show 
> expected results. Need to write a test which executes the examples and 
> compare actual and expected results. For example: 
> https://github.com/apache/spark/blob/051e691029c456fc2db5f229485d3fb8f5d0e84c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2038-L2043



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29242) Check results of expression examples automatically

2019-09-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29242:
--
Component/s: Tests

> Check results of expression examples automatically
> --
>
> Key: SPARK-29242
> URL: https://issues.apache.org/jira/browse/SPARK-29242
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Expression examples demonstrate how to use associated functions, and show 
> expected results. Need to write a test which executes the examples and 
> compare actual and expected results. For example: 
> https://github.com/apache/spark/blob/051e691029c456fc2db5f229485d3fb8f5d0e84c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2038-L2043



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-26 Thread Florentino Sainz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938989#comment-16938989
 ] 

Florentino Sainz edited comment on SPARK-29265 at 9/26/19 9:54 PM:
---

Sorry, ima test again... I'm seeing something that doesn't match in my test 
case.

I thought the cause was because it triggered a global sort, but plan is similar 
different to the one generated when ordering by another column, someone knows 
why a similar Window+orderBy tanked my performance and only used one executor?


was (Author: fsainz):
Sorry, ima test again... I'm seeing something that doesn't match in my test 
case.

> Window orderBy causing full-DF orderBy 
> ---
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.3, 2.4.4
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
> Attachments: Test.scala, TestSpark.zip
>
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> avg($"number").over(myWindow)){code}
>  
> As a user, I would expect either:
> 1- Error/warning (because trying to sort on one of the columns of the window 
> partitionBy)
> 2- A mostly-useless operation which just orders the rows (which have the same 
> value on word) inside each Window but doesn't affect performance too much.
>  
> Currently what I see:
> *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
> performing a global orderBy on the whole DataFrame. Similar to 
> dataframe.orderBy("word").*
> *In my real environment, my program just didn't finish in time/crashed thus 
> causing my program to be very slow or crash (because as it's a global 
> orderBy, it will just go to one executor).*
>  
> In the test I can see how all elements of my DF are in a single partition 
> (side-effect of the global orderBy) 
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-26 Thread Florentino Sainz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938989#comment-16938989
 ] 

Florentino Sainz edited comment on SPARK-29265 at 9/26/19 9:54 PM:
---

Sorry, ima test again... I'm seeing something that doesn't match in my test 
case.

I thought the cause was because it triggered a global sort, but plan is similar 
different to the one generated when ordering by another column, someone knows 
why a similar Window+orderBy (ordering by one of the partitionBy columns) 
tanked my performance and only used one executor?


was (Author: fsainz):
Sorry, ima test again... I'm seeing something that doesn't match in my test 
case.

I thought the cause was because it triggered a global sort, but plan is similar 
different to the one generated when ordering by another column, someone knows 
why a similar Window+orderBy tanked my performance and only used one executor?

> Window orderBy causing full-DF orderBy 
> ---
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.3, 2.4.4
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
> Attachments: Test.scala, TestSpark.zip
>
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> avg($"number").over(myWindow)){code}
>  
> As a user, I would expect either:
> 1- Error/warning (because trying to sort on one of the columns of the window 
> partitionBy)
> 2- A mostly-useless operation which just orders the rows (which have the same 
> value on word) inside each Window but doesn't affect performance too much.
>  
> Currently what I see:
> *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
> performing a global orderBy on the whole DataFrame. Similar to 
> dataframe.orderBy("word").*
> *In my real environment, my program just didn't finish in time/crashed thus 
> causing my program to be very slow or crash (because as it's a global 
> orderBy, it will just go to one executor).*
>  
> In the test I can see how all elements of my DF are in a single partition 
> (side-effect of the global orderBy) 
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-26 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-29265:
-
Description: 
Hi,

 

I had this problem in "real" environments and also made a self-contained test ( 
[^Test.scala] attached).

Having this Window definition:
{code:scala}
val myWindow = Window.partitionBy($"word").orderBy("word") 
val filt2 = filtrador.withColumn("avg_Time", 
avg($"number").over(myWindow)){code}
 

As a user, I would expect either:

1- Error/warning (because trying to sort on one of the columns of the window 
partitionBy)

2- A mostly-useless operation which just orders the rows (which have the same 
value on word) inside each Window but doesn't affect performance too much.

 

Currently what I see:

*When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
performing a global orderBy on the whole DataFrame. Similar to 
dataframe.orderBy("word").*

*In my real environment, my program just didn't finish in time/crashed thus 
causing my program to be very slow or crash (because as it's a global orderBy, 
it will just go to one executor).*

 

In the test I can see how all elements of my DF are in a single partition 
(side-effect of the global orderBy) 

 

Full Code showing the error (see how the mapPartitions shows 99 rows in one 
partition) attached in Test.scala, sbt project (src and build.sbt) attached too 
in TestSpark.zip.

  was:
Hi,

 

I had this problem in "real" environments and also made a self-contained test ( 
[^Test.scala] attached).

Having this Window definition:
{code:scala}
val myWindow = Window.partitionBy($"word").orderBy("word") 
val filt2 = filtrador.withColumn("avg_Time", 
avg($"number").over(myWindow)){code}
 

As a user, I would expect either:

1- Error/warning (because trying to sort on one of the columns of the window 
partitionBy)

2- A mostly-useless operation which just orders the rows inside each Window but 
doesn't affect performance too much.

 

Currently what I see:

*When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
performing a global orderBy on the whole DataFrame. Similar to 
dataframe.orderBy("word").*

*In my real environment, my program just didn't finish in time/crashed thus 
causing my program to be very slow or crash (because as it's a global orderBy, 
it will just go to one executor).*

 

In the test I can see how all elements of my DF are in a single partition 
(side-effect of the global orderBy) 

 

Full Code showing the error (see how the mapPartitions shows 99 rows in one 
partition) attached in Test.scala, sbt project (src and build.sbt) attached too 
in TestSpark.zip.


> Window orderBy causing full-DF orderBy 
> ---
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.3, 2.4.4
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
> Attachments: Test.scala, TestSpark.zip
>
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> avg($"number").over(myWindow)){code}
>  
> As a user, I would expect either:
> 1- Error/warning (because trying to sort on one of the columns of the window 
> partitionBy)
> 2- A mostly-useless operation which just orders the rows (which have the same 
> value on word) inside each Window but doesn't affect performance too much.
>  
> Currently what I see:
> *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
> performing a global orderBy on the whole DataFrame. Similar to 
> dataframe.orderBy("word").*
> *In my real environment, my program just didn't finish in time/crashed thus 
> causing my program to be very slow or crash (because as it's a global 
> orderBy, it will just go to one executor).*
>  
> In the test I can see how all elements of my DF are in a single partition 
> (side-effect of the global orderBy) 
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-26 Thread Florentino Sainz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938989#comment-16938989
 ] 

Florentino Sainz commented on SPARK-29265:
--

Sorry, ima test again... I'm seeing something that doesn't match in my test 
case.

> Window orderBy causing full-DF orderBy 
> ---
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.3, 2.4.4
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
> Attachments: Test.scala, TestSpark.zip
>
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> avg($"number").over(myWindow)){code}
>  
> As a user, I would expect either:
> 1- Error/warning (because trying to sort on one of the columns of the window 
> partitionBy)
> 2- A mostly-useless operation which just orders the rows inside each Window 
> but doesn't affect performance too much.
>  
> Currently what I see:
> *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
> performing a global orderBy on the whole DataFrame. Similar to 
> dataframe.orderBy("word").*
> *In my real environment, my program just didn't finish in time/crashed thus 
> causing my program to be very slow or crash (because as it's a global 
> orderBy, it will just go to one executor).*
>  
> In the test I can see how all elements of my DF are in a single partition 
> (side-effect of the global orderBy) 
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-26 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz resolved SPARK-29265.
--
Resolution: Not A Bug

> Window orderBy causing full-DF orderBy 
> ---
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.3, 2.4.4
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
> Attachments: Test.scala, TestSpark.zip
>
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> avg($"number").over(myWindow)){code}
>  
> As a user, I would expect either:
> 1- Error/warning (because trying to sort on one of the columns of the window 
> partitionBy)
> 2- A mostly-useless operation which just orders the rows inside each Window 
> but doesn't affect performance too much.
>  
> Currently what I see:
> *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
> performing a global orderBy on the whole DataFrame. Similar to 
> dataframe.orderBy("word").*
> *In my real environment, my program just didn't finish in time/crashed thus 
> causing my program to be very slow or crash (because as it's a global 
> orderBy, it will just go to one executor).*
>  
> In the test I can see how all elements of my DF are in a single partition 
> (side-effect of the global orderBy) 
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-26 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-29265:
-
Description: 
Hi,

 

I had this problem in "real" environments and also made a self-contained test ( 
[^Test.scala] attached).

Having this Window definition:
{code:scala}
val myWindow = Window.partitionBy($"word").orderBy("word") 
val filt2 = filtrador.withColumn("avg_Time", 
avg($"number").over(myWindow)){code}
 

As a user, I would expect either:

1- Error/warning (because trying to sort on one of the columns of the window 
partitionBy)

2- A mostly-useless operation which just orders the rows inside each Window but 
doesn't affect performance too much.

 

Currently what I see:

*When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
performing a global orderBy on the whole DataFrame. Similar to 
dataframe.orderBy("word").*

*In my real environment, my program just didn't finish in time/crashed thus 
causing my program to be very slow or crash (because as it's a global orderBy, 
it will just go to one executor).*

 

In the test I can see how all elements of my DF are in a single partition 
(side-effect of the global orderBy) 

 

Full Code showing the error (see how the mapPartitions shows 99 rows in one 
partition) attached in Test.scala, sbt project (src and build.sbt) attached too 
in TestSpark.zip.

  was:
Hi,

 

I had this problem in "real" environments and also made a self-contained test ( 
[^Test.scala] attached).

Having this Window definition:
{code:scala}
val myWindow = Window.partitionBy($"word").orderBy("word") 
val filt2 = filtrador.withColumn("avg_Time", 
avg($"number").over(myWindow)){code}
 

As a user, I would expect either:

1- Error/warning (because trying to sort on one of the columns of the window 
partitionBy)

2- A mostly-useless operation which just orders the rows inside each Window but 
doesn't affect performance too much.

 

Currently what I see:

*When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
performing a global orderBy of the whole DataFrame. Similar to 
dataframe.orderBy("word").*

*In my real environment, my program just didn't finish in time/crashed thus 
causing my program to be very slow or crash (because as it's a global orderBy, 
it will just go to one executor).*

 

In the test I can see how all elements of my DF are in a single partition 
(side-effect of the global orderBy) 

 

Full Code showing the error (see how the mapPartitions shows 99 rows in one 
partition) attached in Test.scala, sbt project (src and build.sbt) attached too 
in TestSpark.zip.


> Window orderBy causing full-DF orderBy 
> ---
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.3, 2.4.4
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
> Attachments: Test.scala, TestSpark.zip
>
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> avg($"number").over(myWindow)){code}
>  
> As a user, I would expect either:
> 1- Error/warning (because trying to sort on one of the columns of the window 
> partitionBy)
> 2- A mostly-useless operation which just orders the rows inside each Window 
> but doesn't affect performance too much.
>  
> Currently what I see:
> *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
> performing a global orderBy on the whole DataFrame. Similar to 
> dataframe.orderBy("word").*
> *In my real environment, my program just didn't finish in time/crashed thus 
> causing my program to be very slow or crash (because as it's a global 
> orderBy, it will just go to one executor).*
>  
> In the test I can see how all elements of my DF are in a single partition 
> (side-effect of the global orderBy) 
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-26 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-29265:
-
Description: 
Hi,

 

I had this problem in "real" environments and also made a self-contained test ( 
[^Test.scala] attached).

Having this Window definition:
{code:scala}
val myWindow = Window.partitionBy($"word").orderBy("word") 
val filt2 = filtrador.withColumn("avg_Time", 
avg($"number").over(myWindow)){code}
 

As a user, I would expect either:

1- Error/warning (because trying to sort on one of the columns of the window 
partitionBy)

2- A mostly-useless operation which just orders the rows inside each Window but 
doesn't affect performance too much.

 

Currently what I see:

*When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
performing a global orderBy of the whole DataFrame. Similar to 
dataframe.orderBy("word").*

*In my real environment, my program just didn't finish in time/crashed thus 
causing my program to be very slow or crash (because as it's a global orderBy, 
it will just go to one executor).*

 

In the test I can see how all elements of my DF are in a single partition 
(side-effect of the global orderBy) 

 

Full Code showing the error (see how the mapPartitions shows 99 rows in one 
partition) attached in Test.scala, sbt project (src and build.sbt) attached too 
in TestSpark.zip.

  was:
Hi,

 

I had this problem in "real" environments and also made a self-contained test ( 
[^Test.scala] attached).

Having this Window definition:
{code:scala}
val myWindow = Window.partitionBy($"word").orderBy("word") 
val filt2 = filtrador.withColumn("avg_Time", 
avg($"number").over(myWindow)){code}
 

As a user, I would expect either:

1- Error/warning (because trying to sort on one of the columns of the window 
partitionBy)

2- A mostly-useless operation which just orders the rows inside each Window but 
doesn't affect performance too much.

 

Currently what I see:

*When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
performing a global orderBy of the whole DataFrame. Similar to 
dataframe.orderBy("word").*

*In my real environment, my program just didn't finish in time/crashed thus 
causing my program to be very slow or crash (because as it's a global orderBy, 
it will just go to one executor).*

 

In the test I can see how all elements of my DF are in a single partition 
(side-effect of the global orderBy) 

 

Full Code showing the error (see how the mapPartitions shows 99 rows in one 
partition) attached in Test.scala, sbt project (src and build.sbt) attached too.


> Window orderBy causing full-DF orderBy 
> ---
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.3, 2.4.4
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
> Attachments: Test.scala, TestSpark.zip
>
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> avg($"number").over(myWindow)){code}
>  
> As a user, I would expect either:
> 1- Error/warning (because trying to sort on one of the columns of the window 
> partitionBy)
> 2- A mostly-useless operation which just orders the rows inside each Window 
> but doesn't affect performance too much.
>  
> Currently what I see:
> *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
> performing a global orderBy of the whole DataFrame. Similar to 
> dataframe.orderBy("word").*
> *In my real environment, my program just didn't finish in time/crashed thus 
> causing my program to be very slow or crash (because as it's a global 
> orderBy, it will just go to one executor).*
>  
> In the test I can see how all elements of my DF are in a single partition 
> (side-effect of the global orderBy) 
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-26 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-29265:
-
Description: 
Hi,

 

I had this problem in "real" environments and also made a self-contained test ( 
[^Test.scala] attached).

Having this Window definition:
{code:scala}
val myWindow = Window.partitionBy($"word").orderBy("word") 
val filt2 = filtrador.withColumn("avg_Time", 
avg($"number").over(myWindow)){code}
 

As a user, I would expect either:

1- Error/warning (because trying to sort on one of the columns of the window 
partitionBy)

2- A mostly-useless operation which just orders the rows inside each Window but 
doesn't affect performance too much.

 

Currently what I see:

*When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
performing a global orderBy of the whole DataFrame. Similar to 
dataframe.orderBy("word").*

*In my real environment, my program just didn't finish in time/crashed thus 
causing my program to be very slow or crash (because as it's a global orderBy, 
it will just go to one executor).*

 

In the test I can see how all elements of my DF are in a single partition 
(side-effect of the global orderBy) 

 

Full Code showing the error (see how the mapPartitions shows 99 rows in one 
partition) attached in Test.scala, sbt project (src and build.sbt) attached too.

  was:
Hi,

 

I had this problem in "real" environments and also made a self-contained test ( 
[^Test.scala] attached).

Having this Window definition:
{code:scala}
val myWindow = Window.partitionBy($"word").orderBy("word") 
val filt2 = filtrador.withColumn("avg_Time", 
avg($"number").over(myWindow)){code}
 

As a user, I would expect either:

1- Error/warning (because trying to sort on one of the columns of the window 
partitionBy)

2- A mostly-useless operation which just orders the rows inside each Window but 
doesn't affect performance too much.

 

Currently what I see:

*When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
performing a global orderBy of the whole DataFrame. Similar to 
dataframe.orderBy("word").*

*In my real environment, my program just didn't finish in time/crashed thus 
causing my program to be very slow or crash (because as it's a global orderBy, 
it will just go to one executor).*

 

In the test I can see how all elements of my DF are in a single partition 
(side-effect of the global orderBy) 

 

Full Code showing the error (see how the mapPartitions shows 99 rows in one 
partition) attached in Test.scala


> Window orderBy causing full-DF orderBy 
> ---
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.3, 2.4.4
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
> Attachments: Test.scala, TestSpark.zip
>
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> avg($"number").over(myWindow)){code}
>  
> As a user, I would expect either:
> 1- Error/warning (because trying to sort on one of the columns of the window 
> partitionBy)
> 2- A mostly-useless operation which just orders the rows inside each Window 
> but doesn't affect performance too much.
>  
> Currently what I see:
> *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
> performing a global orderBy of the whole DataFrame. Similar to 
> dataframe.orderBy("word").*
> *In my real environment, my program just didn't finish in time/crashed thus 
> causing my program to be very slow or crash (because as it's a global 
> orderBy, it will just go to one executor).*
>  
> In the test I can see how all elements of my DF are in a single partition 
> (side-effect of the global orderBy) 
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-26 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-29265:
-
Attachment: TestSpark.zip

> Window orderBy causing full-DF orderBy 
> ---
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.3, 2.4.4
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
> Attachments: Test.scala, TestSpark.zip
>
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> avg($"number").over(myWindow)){code}
>  
> As a user, I would expect either:
> 1- Error/warning (because trying to sort on one of the columns of the window 
> partitionBy)
> 2- A mostly-useless operation which just orders the rows inside each Window 
> but doesn't affect performance too much.
>  
> Currently what I see:
> *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
> performing a global orderBy of the whole DataFrame. Similar to 
> dataframe.orderBy("word").*
> *In my real environment, my program just didn't finish in time/crashed thus 
> causing my program to be very slow or crash (because as it's a global 
> orderBy, it will just go to one executor).*
>  
> In the test I can see how all elements of my DF are in a single partition 
> (side-effect of the global orderBy) 
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-26 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-29265:
-
Description: 
Hi,

 

I had this problem in "real" environments and also made a self-contained test ( 
[^Test.scala] attached).

Having this Window definition:
{code:scala}
val myWindow = Window.partitionBy($"word").orderBy("word") 
val filt2 = filtrador.withColumn("avg_Time", 
avg($"number").over(myWindow)){code}
 

As a user, I would expect either:

1- Error/warning (because trying to sort on one of the columns of the window 
partitionBy)

2- A mostly-useless operation which just orders the rows inside each Window but 
doesn't affect performance too much.

 

Currently what I see:

*When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
performing a global orderBy of the whole DataFrame. Similar to 
dataframe.orderBy("word").*

*In my real environment, my program just didn't finish in time/crashed thus 
causing my program to be very slow or crash (because as it's a global orderBy, 
it will just go to one executor).*

 

In the test I can see how all elements of my DF are in a single partition 
(side-effect of the global orderBy) 

 

Full Code showing the error (see how the mapPartitions shows 99 rows in one 
partition) attached in Test.scala

  was:
Hi,

 

I had this problem in "real" environments and also made a self-contained test 
(attached).

Having this Window definition:
{code:scala}
val myWindow = Window.partitionBy($"word").orderBy("word") 
val filt2 = filtrador.withColumn("avg_Time", 
avg($"number").over(myWindow)){code}
 

As a user, I would expect either:

1- Error/warning (because trying to sort on one of the columns of the window 
partitionBy)

2- A mostly-useless operation which just orders the rows inside each Window but 
doesn't affect performance too much.

 

Currently what I see:

*When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
performing a global orderBy of the whole DataFrame. Similar to 
dataframe.orderBy("word").*

*In my real environment, my program just didn't finish in time/crashed thus 
causing my program to be very slow or crash (because as it's a global orderBy, 
it will just go to one executor).*

 

In the test I can see how all elements of my DF are in a single partition 
(side-effect of the global orderBy) 

 

Full Code showing the error (see how the mapPartitions shows 99 rows in one 
partition) <-- You can pase it in Intellij/Any other and it should work:

 
{code:scala}

import java.io.ByteArrayOutputStream
import java.net.URL
import java.nio.charset.Charset

import org.apache.commons.io.IOUtils
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, 
StructType}

import scala.collection.mutable
object Test {

  case class Bank(age:Integer, job:String, marital : String, education : 
String, balance : Integer)


  def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
  .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .config("spark.sql.autoBroadcastJoinThreshold", -1)
  .master("local[4]")
  .appName("Word Count")
  .getOrCreate()

import org.apache.spark.sql.functions._
import spark.implicits._

val sc = spark.sparkContext
val expectedSchema = List(
  StructField("number", IntegerType, false),
  StructField("word", StringType, false),
  StructField("dummyColumn", StringType, false)
)
val expectedData = Seq(
  Row(8, "bat", "test"),
  Row(64, "mouse", "test"),
  Row(-27, "horse", "test")
)

val filtrador = spark.createDataFrame(
  spark.sparkContext.parallelize(expectedData),
  StructType(expectedSchema)
).withColumn("dummy", explode(array((1 until 100).map(lit): _*)))

//spark.createDataFrame(bank,Bank.getClass).createOrReplaceTempView("bank")
//spark.createDataFrame(bank,Bank.getClass).registerTempTable("bankDos")
//spark.createDataFrame(bank,Bank.getClass).registerTempTable("bankTres")


//val filtrador2=filtrador.crossJoin(filtrador)
//filtrador2.cache()
//filtrador2.union(filtrador2).count


val myWindow = Window.partitionBy($"word").orderBy("word")
val filt2 = filtrador.withColumn("avg_Time", avg($"number").over(myWindow))
filt2.show

filt2.rdd.mapPartitions(iter => Iterator(iter.size), 
true).collect().foreach(println)

  }
}

 {code}
 


> Window orderBy causing full-DF orderBy 
> ---
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 

[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-26 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-29265:
-
Attachment: Test.scala

> Window orderBy causing full-DF orderBy 
> ---
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.3, 2.4.4
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
> Attachments: Test.scala
>
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> (attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> avg($"number").over(myWindow)){code}
>  
> As a user, I would expect either:
> 1- Error/warning (because trying to sort on one of the columns of the window 
> partitionBy)
> 2- A mostly-useless operation which just orders the rows inside each Window 
> but doesn't affect performance too much.
>  
> Currently what I see:
> *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
> performing a global orderBy of the whole DataFrame. Similar to 
> dataframe.orderBy("word").*
> *In my real environment, my program just didn't finish in time/crashed thus 
> causing my program to be very slow or crash (because as it's a global 
> orderBy, it will just go to one executor).*
>  
> In the test I can see how all elements of my DF are in a single partition 
> (side-effect of the global orderBy) 
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) <-- You can pase it in Intellij/Any other and it should work:
>  
> {code:scala}
> import java.io.ByteArrayOutputStream
> import java.net.URL
> import java.nio.charset.Charset
> import org.apache.commons.io.IOUtils
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql._
> import org.apache.spark.sql.catalyst.encoders.RowEncoder
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.types.{IntegerType, StringType, StructField, 
> StructType}
> import scala.collection.mutable
> object Test {
>   case class Bank(age:Integer, job:String, marital : String, education : 
> String, balance : Integer)
>   def main(args: Array[String]): Unit = {
> val spark = SparkSession.builder
>   .config("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
>   .config("spark.sql.autoBroadcastJoinThreshold", -1)
>   .master("local[4]")
>   .appName("Word Count")
>   .getOrCreate()
> import org.apache.spark.sql.functions._
> import spark.implicits._
> val sc = spark.sparkContext
> val expectedSchema = List(
>   StructField("number", IntegerType, false),
>   StructField("word", StringType, false),
>   StructField("dummyColumn", StringType, false)
> )
> val expectedData = Seq(
>   Row(8, "bat", "test"),
>   Row(64, "mouse", "test"),
>   Row(-27, "horse", "test")
> )
> val filtrador = spark.createDataFrame(
>   spark.sparkContext.parallelize(expectedData),
>   StructType(expectedSchema)
> ).withColumn("dummy", explode(array((1 until 100).map(lit): _*)))
> 
> //spark.createDataFrame(bank,Bank.getClass).createOrReplaceTempView("bank")
> //spark.createDataFrame(bank,Bank.getClass).registerTempTable("bankDos")
> //spark.createDataFrame(bank,Bank.getClass).registerTempTable("bankTres")
> //val filtrador2=filtrador.crossJoin(filtrador)
> //filtrador2.cache()
> //filtrador2.union(filtrador2).count
> val myWindow = Window.partitionBy($"word").orderBy("word")
> val filt2 = filtrador.withColumn("avg_Time", 
> avg($"number").over(myWindow))
> filt2.show
> filt2.rdd.mapPartitions(iter => Iterator(iter.size), 
> true).collect().foreach(println)
>   }
> }
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-26 Thread Florentino Sainz (Jira)
Florentino Sainz created SPARK-29265:


 Summary: Window orderBy causing full-DF orderBy 
 Key: SPARK-29265
 URL: https://issues.apache.org/jira/browse/SPARK-29265
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.4, 2.4.3, 2.3.0
 Environment: Any
Reporter: Florentino Sainz


Hi,

 

I had this problem in "real" environments and also made a self-contained test 
(attached).

Having this Window definition:
{code:java}
val myWindow = Window.partitionBy($"word").orderBy("word") 
val filt2 = filtrador.withColumn("avg_Time", 
avg($"number").over(myWindow)){code}
 

As a user, I would expect either:

1- Error/warning (because trying to sort on one of the columns of the window 
partitionBy)

2- A mostly-useless operation which just orders the rows inside each Window but 
doesn't affect performance too much.

 

Currently what I see:

*When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
performing a global orderBy of the whole DataFrame. Similar to 
dataframe.orderBy("word").*

*In my real environment, my program just didn't finish in time/crashed thus 
causing my program to be very slow or crash (because as it's a global orderBy, 
it will just go to one executor).*

 

In the test I can see how all elements of my DF are in a single partition 
(side-effect of the global orderBy) 

 

Full Code showing the error (see how the mapPartitions shows 99 rows in one 
partition) <-- You can pase it in Intellij/Any other and it should work:

 
{quote}
import java.io.ByteArrayOutputStream
import java.net.URL
import java.nio.charset.Charset

import org.apache.commons.io.IOUtils
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.types.\{IntegerType, StringType, StructField, 
StructType}

import scala.collection.mutable
object Test {

case class Bank(age:Integer, job:String, marital : String, education : String, 
balance : Integer)


 def main(args: Array[String]): Unit = {
 val spark = SparkSession.builder
 .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
 .config("spark.sql.autoBroadcastJoinThreshold", -1)
 .master("local[4]")
 .appName("Word Count")
 .getOrCreate()

import org.apache.spark.sql.functions._
 import spark.implicits._

val sc = spark.sparkContext
 val expectedSchema = List(
 StructField("number", IntegerType, false),
 StructField("word", StringType, false),
 StructField("dummyColumn", StringType, false)
 )
 val expectedData = Seq(
 Row(8, "bat", "test"),
 Row(64, "mouse", "test"),
 Row(-27, "horse", "test")
 )

val filtrador = spark.createDataFrame(
 spark.sparkContext.parallelize(expectedData),
 StructType(expectedSchema)
 ).withColumn("dummy", explode(array((1 until 100).map(lit): _*)))


 val myWindow = Window.partitionBy($"word").orderBy("word")
 val filt2 = filtrador.withColumn("avg_Time", avg($"number").over(myWindow))
 filt2.show

filt2.rdd.mapPartitions(iter => Iterator(iter.size), 
true).collect().foreach(println)

}
}
{quote}
 
{code:java}
 {code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-26 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-29265:
-
Description: 
Hi,

 

I had this problem in "real" environments and also made a self-contained test 
(attached).

Having this Window definition:
{code:scala}
val myWindow = Window.partitionBy($"word").orderBy("word") 
val filt2 = filtrador.withColumn("avg_Time", 
avg($"number").over(myWindow)){code}
 

As a user, I would expect either:

1- Error/warning (because trying to sort on one of the columns of the window 
partitionBy)

2- A mostly-useless operation which just orders the rows inside each Window but 
doesn't affect performance too much.

 

Currently what I see:

*When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
performing a global orderBy of the whole DataFrame. Similar to 
dataframe.orderBy("word").*

*In my real environment, my program just didn't finish in time/crashed thus 
causing my program to be very slow or crash (because as it's a global orderBy, 
it will just go to one executor).*

 

In the test I can see how all elements of my DF are in a single partition 
(side-effect of the global orderBy) 

 

Full Code showing the error (see how the mapPartitions shows 99 rows in one 
partition) <-- You can pase it in Intellij/Any other and it should work:

 
{code:scala}

import java.io.ByteArrayOutputStream
import java.net.URL
import java.nio.charset.Charset

import org.apache.commons.io.IOUtils
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, 
StructType}

import scala.collection.mutable
object Test {

  case class Bank(age:Integer, job:String, marital : String, education : 
String, balance : Integer)


  def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
  .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .config("spark.sql.autoBroadcastJoinThreshold", -1)
  .master("local[4]")
  .appName("Word Count")
  .getOrCreate()

import org.apache.spark.sql.functions._
import spark.implicits._

val sc = spark.sparkContext
val expectedSchema = List(
  StructField("number", IntegerType, false),
  StructField("word", StringType, false),
  StructField("dummyColumn", StringType, false)
)
val expectedData = Seq(
  Row(8, "bat", "test"),
  Row(64, "mouse", "test"),
  Row(-27, "horse", "test")
)

val filtrador = spark.createDataFrame(
  spark.sparkContext.parallelize(expectedData),
  StructType(expectedSchema)
).withColumn("dummy", explode(array((1 until 100).map(lit): _*)))

//spark.createDataFrame(bank,Bank.getClass).createOrReplaceTempView("bank")
//spark.createDataFrame(bank,Bank.getClass).registerTempTable("bankDos")
//spark.createDataFrame(bank,Bank.getClass).registerTempTable("bankTres")


//val filtrador2=filtrador.crossJoin(filtrador)
//filtrador2.cache()
//filtrador2.union(filtrador2).count


val myWindow = Window.partitionBy($"word").orderBy("word")
val filt2 = filtrador.withColumn("avg_Time", avg($"number").over(myWindow))
filt2.show

filt2.rdd.mapPartitions(iter => Iterator(iter.size), 
true).collect().foreach(println)

  }
}

 {code}
 

  was:
Hi,

 

I had this problem in "real" environments and also made a self-contained test 
(attached).

Having this Window definition:
{code:java}
val myWindow = Window.partitionBy($"word").orderBy("word") 
val filt2 = filtrador.withColumn("avg_Time", 
avg($"number").over(myWindow)){code}
 

As a user, I would expect either:

1- Error/warning (because trying to sort on one of the columns of the window 
partitionBy)

2- A mostly-useless operation which just orders the rows inside each Window but 
doesn't affect performance too much.

 

Currently what I see:

*When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
performing a global orderBy of the whole DataFrame. Similar to 
dataframe.orderBy("word").*

*In my real environment, my program just didn't finish in time/crashed thus 
causing my program to be very slow or crash (because as it's a global orderBy, 
it will just go to one executor).*

 

In the test I can see how all elements of my DF are in a single partition 
(side-effect of the global orderBy) 

 

Full Code showing the error (see how the mapPartitions shows 99 rows in one 
partition) <-- You can pase it in Intellij/Any other and it should work:

 
{code: scala}

import java.io.ByteArrayOutputStream
import java.net.URL
import java.nio.charset.Charset

import org.apache.commons.io.IOUtils
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import 

[jira] [Updated] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-26 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-29265:
-
Description: 
Hi,

 

I had this problem in "real" environments and also made a self-contained test 
(attached).

Having this Window definition:
{code:java}
val myWindow = Window.partitionBy($"word").orderBy("word") 
val filt2 = filtrador.withColumn("avg_Time", 
avg($"number").over(myWindow)){code}
 

As a user, I would expect either:

1- Error/warning (because trying to sort on one of the columns of the window 
partitionBy)

2- A mostly-useless operation which just orders the rows inside each Window but 
doesn't affect performance too much.

 

Currently what I see:

*When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
performing a global orderBy of the whole DataFrame. Similar to 
dataframe.orderBy("word").*

*In my real environment, my program just didn't finish in time/crashed thus 
causing my program to be very slow or crash (because as it's a global orderBy, 
it will just go to one executor).*

 

In the test I can see how all elements of my DF are in a single partition 
(side-effect of the global orderBy) 

 

Full Code showing the error (see how the mapPartitions shows 99 rows in one 
partition) <-- You can pase it in Intellij/Any other and it should work:

 
{code: scala}

import java.io.ByteArrayOutputStream
import java.net.URL
import java.nio.charset.Charset

import org.apache.commons.io.IOUtils
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, 
StructType}

import scala.collection.mutable
object Test {

  case class Bank(age:Integer, job:String, marital : String, education : 
String, balance : Integer)


  def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
  .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .config("spark.sql.autoBroadcastJoinThreshold", -1)
  .master("local[4]")
  .appName("Word Count")
  .getOrCreate()

import org.apache.spark.sql.functions._
import spark.implicits._

val sc = spark.sparkContext
val expectedSchema = List(
  StructField("number", IntegerType, false),
  StructField("word", StringType, false),
  StructField("dummyColumn", StringType, false)
)
val expectedData = Seq(
  Row(8, "bat", "test"),
  Row(64, "mouse", "test"),
  Row(-27, "horse", "test")
)

val filtrador = spark.createDataFrame(
  spark.sparkContext.parallelize(expectedData),
  StructType(expectedSchema)
).withColumn("dummy", explode(array((1 until 100).map(lit): _*)))

//spark.createDataFrame(bank,Bank.getClass).createOrReplaceTempView("bank")
//spark.createDataFrame(bank,Bank.getClass).registerTempTable("bankDos")
//spark.createDataFrame(bank,Bank.getClass).registerTempTable("bankTres")


//val filtrador2=filtrador.crossJoin(filtrador)
//filtrador2.cache()
//filtrador2.union(filtrador2).count


val myWindow = Window.partitionBy($"word").orderBy("word")
val filt2 = filtrador.withColumn("avg_Time", avg($"number").over(myWindow))
filt2.show

filt2.rdd.mapPartitions(iter => Iterator(iter.size), 
true).collect().foreach(println)

  }
}

 {code}
 

  was:
Hi,

 

I had this problem in "real" environments and also made a self-contained test 
(attached).

Having this Window definition:
{code:java}
val myWindow = Window.partitionBy($"word").orderBy("word") 
val filt2 = filtrador.withColumn("avg_Time", 
avg($"number").over(myWindow)){code}
 

As a user, I would expect either:

1- Error/warning (because trying to sort on one of the columns of the window 
partitionBy)

2- A mostly-useless operation which just orders the rows inside each Window but 
doesn't affect performance too much.

 

Currently what I see:

*When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
performing a global orderBy of the whole DataFrame. Similar to 
dataframe.orderBy("word").*

*In my real environment, my program just didn't finish in time/crashed thus 
causing my program to be very slow or crash (because as it's a global orderBy, 
it will just go to one executor).*

 

In the test I can see how all elements of my DF are in a single partition 
(side-effect of the global orderBy) 

 

Full Code showing the error (see how the mapPartitions shows 99 rows in one 
partition) <-- You can pase it in Intellij/Any other and it should work:

 
{quote}
import java.io.ByteArrayOutputStream
import java.net.URL
import java.nio.charset.Charset

import org.apache.commons.io.IOUtils
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import 

[jira] [Resolved] (SPARK-29264) Incorrect example for the RLike expression

2019-09-26 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk resolved SPARK-29264.

Resolution: Won't Fix

> Incorrect example for the RLike expression
> --
>
> Key: SPARK-29264
> URL: https://issues.apache.org/jira/browse/SPARK-29264
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> The example for RLIKE is incorrect: 
> https://github.com/apache/spark/blob/a428f406693f1c372dc0e378f6b413eca9e367ac/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L174
> {code}
> spark-sql> SET spark.sql.parser.escapedStringLiterals=true;
> spark.sql.parser.escapedStringLiteralstrue
> spark-sql> SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\Users.*';
> 19/09/26 23:33:13 ERROR SparkSQLDriver: Failed in [SELECT 
> '%SystemDrive%\Users\John' rlike '%SystemDrive%\Users.*']
> java.util.regex.PatternSyntaxException: Illegal/unsupported escape sequence 
> near index 14
> %SystemDrive%\Users.*
>   ^
>   at java.util.regex.Pattern.error(Pattern.java:1957)
>   at java.util.regex.Pattern.escape(Pattern.java:2473)
>   at java.util.regex.Pattern.atom(Pattern.java:2200)
>   at java.util.regex.Pattern.sequence(Pattern.java:2132)
>   at java.util.regex.Pattern.expr(Pattern.java:1998)
>   at java.util.regex.Pattern.compile(Pattern.java:1698)
>   at java.util.regex.Pattern.(Pattern.java:1351)
>   at java.util.regex.Pattern.compile(Pattern.java:1028)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29264) Incorrect example for the RLike expression

2019-09-26 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938952#comment-16938952
 ] 

Maxim Gekk commented on SPARK-29264:


Remove Rlike from the ignore set after fix: 
[https://github.com/apache/spark/pull/25942/files#diff-5a2e7f03d14856c8769fd3ddea8742bdR171]

> Incorrect example for the RLike expression
> --
>
> Key: SPARK-29264
> URL: https://issues.apache.org/jira/browse/SPARK-29264
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> The example for RLIKE is incorrect: 
> https://github.com/apache/spark/blob/a428f406693f1c372dc0e378f6b413eca9e367ac/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L174
> {code}
> spark-sql> SET spark.sql.parser.escapedStringLiterals=true;
> spark.sql.parser.escapedStringLiteralstrue
> spark-sql> SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\Users.*';
> 19/09/26 23:33:13 ERROR SparkSQLDriver: Failed in [SELECT 
> '%SystemDrive%\Users\John' rlike '%SystemDrive%\Users.*']
> java.util.regex.PatternSyntaxException: Illegal/unsupported escape sequence 
> near index 14
> %SystemDrive%\Users.*
>   ^
>   at java.util.regex.Pattern.error(Pattern.java:1957)
>   at java.util.regex.Pattern.escape(Pattern.java:2473)
>   at java.util.regex.Pattern.atom(Pattern.java:2200)
>   at java.util.regex.Pattern.sequence(Pattern.java:2132)
>   at java.util.regex.Pattern.expr(Pattern.java:1998)
>   at java.util.regex.Pattern.compile(Pattern.java:1698)
>   at java.util.regex.Pattern.(Pattern.java:1351)
>   at java.util.regex.Pattern.compile(Pattern.java:1028)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29264) Incorrect example for the RLike expression

2019-09-26 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-29264:
---
Summary: Incorrect example for the RLike expression  (was: Fix examples for 
the RLike expression)

> Incorrect example for the RLike expression
> --
>
> Key: SPARK-29264
> URL: https://issues.apache.org/jira/browse/SPARK-29264
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> The example for RLIKE is incorrect: 
> https://github.com/apache/spark/blob/a428f406693f1c372dc0e378f6b413eca9e367ac/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L174
> {code}
> spark-sql> SET spark.sql.parser.escapedStringLiterals=true;
> spark.sql.parser.escapedStringLiteralstrue
> spark-sql> SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\Users.*';
> 19/09/26 23:33:13 ERROR SparkSQLDriver: Failed in [SELECT 
> '%SystemDrive%\Users\John' rlike '%SystemDrive%\Users.*']
> java.util.regex.PatternSyntaxException: Illegal/unsupported escape sequence 
> near index 14
> %SystemDrive%\Users.*
>   ^
>   at java.util.regex.Pattern.error(Pattern.java:1957)
>   at java.util.regex.Pattern.escape(Pattern.java:2473)
>   at java.util.regex.Pattern.atom(Pattern.java:2200)
>   at java.util.regex.Pattern.sequence(Pattern.java:2132)
>   at java.util.regex.Pattern.expr(Pattern.java:1998)
>   at java.util.regex.Pattern.compile(Pattern.java:1698)
>   at java.util.regex.Pattern.(Pattern.java:1351)
>   at java.util.regex.Pattern.compile(Pattern.java:1028)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29264) Fix examples for the RLike expression

2019-09-26 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-29264:
--

 Summary: Fix examples for the RLike expression
 Key: SPARK-29264
 URL: https://issues.apache.org/jira/browse/SPARK-29264
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


The example for RLIKE is incorrect: 
https://github.com/apache/spark/blob/a428f406693f1c372dc0e378f6b413eca9e367ac/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L174
{code}
spark-sql> SET spark.sql.parser.escapedStringLiterals=true;
spark.sql.parser.escapedStringLiterals  true
spark-sql> SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\Users.*';
19/09/26 23:33:13 ERROR SparkSQLDriver: Failed in [SELECT 
'%SystemDrive%\Users\John' rlike '%SystemDrive%\Users.*']
java.util.regex.PatternSyntaxException: Illegal/unsupported escape sequence 
near index 14
%SystemDrive%\Users.*
  ^
at java.util.regex.Pattern.error(Pattern.java:1957)
at java.util.regex.Pattern.escape(Pattern.java:2473)
at java.util.regex.Pattern.atom(Pattern.java:2200)
at java.util.regex.Pattern.sequence(Pattern.java:2132)
at java.util.regex.Pattern.expr(Pattern.java:1998)
at java.util.regex.Pattern.compile(Pattern.java:1698)
at java.util.regex.Pattern.(Pattern.java:1351)
at java.util.regex.Pattern.compile(Pattern.java:1028)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28917) Jobs can hang because of race of RDD.dependencies

2019-09-26 Thread Imran Rashid (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-28917:
-
Description: 
{{RDD.dependencies}} stores the precomputed cache value, but it is not 
thread-safe.  This can lead to a race where the value gets overwritten, but the 
DAGScheduler gets stuck in an inconsistent state.  In particular, this can 
happen when there is a race between the DAGScheduler event loop, and another 
thread (eg. a user thread, if there is multi-threaded job submission).


First, a job is submitted by the user, which then computes the result Stage and 
its parents:

https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L983

Which eventually makes a call to {{rdd.dependencies}}:

https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L519

At the same time, the user could also touch {{rdd.dependencies}} in another 
thread, which could overwrite the stored value because of the race.

Then the DAGScheduler checks the dependencies *again* later on in the job 
submission, via {{getMissingParentStages}}

https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1025

Because it will find new dependencies, it will create entirely different 
stages.  Now the job has some orphaned stages which will never run.

The symptoms of this are seeing disjoint sets of stages in the "Parents of 
final stage" and the "Missing parents" messages on job submission.

(*EDIT*: Seeing repeated msgs "Registering RDD X" actually is just fine, it is 
not a symptom of a problem at all.  It just means the RDD is the *input* to 
multiple shuffles.)

{noformat}
[INFO] 2019-08-15 23:22:31,570 org.apache.spark.SparkContext logInfo - Starting 
job: count at XXX.scala:462
...
[INFO] 2019-08-15 23:22:31,573 org.apache.spark.scheduler.DAGScheduler logInfo 
- Registering RDD 14 (repartition at XXX.scala:421)
...
...
[INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler logInfo 
- Got job 1 (count at XXX.scala:462) with 40 output partitions
[INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler logInfo 
- Final stage: ResultStage 5 (count at XXX.scala:462)
[INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler logInfo 
- Parents of final stage: List(ShuffleMapStage 4)
[INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler logInfo 
- Registering RDD 14 (repartition at XXX.scala:421)
[INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler logInfo 
- Missing parents: List(ShuffleMapStage 6)
{noformat}

Note that there is a similar issue w/ {{rdd.partitions}}. I don't see a way it 
could mess up the scheduler (seems its only used for 
{{rdd.partitions.length}}).  There is also an issue that {{rdd.storageLevel}} 
is read and cached in the scheduler, but it could be modified simultaneously by 
the user in another thread.   Similarly, I can't see a way it could effect the 
scheduler.

*WORKAROUND*:
(a) call {{rdd.dependencies}} while you know that RDD is only getting touched 
by one thread (eg. in the thread that created it, or before you submit multiple 
jobs touching that RDD from other threads). Then that value will get cached.
(b) don't submit jobs from multiple threads.

  was:
{{RDD.dependencies}} stores the precomputed cache value, but it is not 
thread-safe.  This can lead to a race where the value gets overwritten, but the 
DAGScheduler gets stuck in an inconsistent state.  In particular, this can 
happen when there is a race between the DAGScheduler event loop, and another 
thread (eg. a user thread, if there is multi-threaded job submission).


First, a job is submitted by the user, which then computes the result Stage and 
its parents:

https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L983

Which eventually makes a call to {{rdd.dependencies}}:

https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L519

At the same time, the user could also touch {{rdd.dependencies}} in another 
thread, which could overwrite the stored value because of the race.

Then the DAGScheduler checks the dependencies *again* later on in the job 
submission, via {{getMissingParentStages}}

https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1025

Because it will find new dependencies, it will create entirely different 
stages.  Now the job has some orphaned stages which will never run.

The symptoms of this are seeing disjoint sets of 

[jira] [Created] (SPARK-29263) availableSlots in scheduler can change before being checked by barrier taskset

2019-09-26 Thread Juliusz Sompolski (Jira)
Juliusz Sompolski created SPARK-29263:
-

 Summary: availableSlots in scheduler can change before being 
checked by barrier taskset
 Key: SPARK-29263
 URL: https://issues.apache.org/jira/browse/SPARK-29263
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 3.0.0
Reporter: Juliusz Sompolski


availableSlots are computed before the loop in resourceOffer, but they change 
in every iteration



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29245) CCE during creating HiveMetaStoreClient

2019-09-26 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938857#comment-16938857
 ] 

Dongjoon Hyun commented on SPARK-29245:
---

In addition to [~yumwang]'s configuration, in CDH 6.3, HMS is also running 
JDK11.

> CCE during creating HiveMetaStoreClient 
> 
>
> Key: SPARK-29245
> URL: https://issues.apache.org/jira/browse/SPARK-29245
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: CDH 6.3
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> From `master` branch build, when I try to connect to an external HMS, I hit 
> the following.
> {code}
> 19/09/25 10:58:46 ERROR hive.log: Got exception: java.lang.ClassCastException 
> class [Ljava.lang.Object; cannot be cast to class [Ljava.net.URI; 
> ([Ljava.lang.Object; and [Ljava.net.URI; are in module java.base of loader 
> 'bootstrap')
> java.lang.ClassCastException: class [Ljava.lang.Object; cannot be cast to 
> class [Ljava.net.URI; ([Ljava.lang.Object; and [Ljava.net.URI; are in module 
> java.base of loader 'bootstrap')
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:200)
>   at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:70)
> {code}
> With HIVE-21508, I can get the following.
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.4)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sql("show databases").show
> ++
> |databaseName|
> ++
> |  .  |
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25475) Refactor all benchmark to save the result as a separate file

2019-09-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25475:
--
Target Version/s: 3.0.0

> Refactor all benchmark to save the result as a separate file
> 
>
> Key: SPARK-25475
> URL: https://issues.apache.org/jira/browse/SPARK-25475
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This is an umbrella issue to refactor all benchmarks to use a common style 
> using main-method (instead of `test` method) and saving the result as a 
> separate file (instead of embedding as comments). This is not only for 
> consistency, but also for making the benchmark-automation easy. SPARK-25339 
> is finished as a reference model.
> *Completed*
> - FilterPushdownBenchmark.scala (SPARK-25339)
> *Candidates*
> - AggregateBenchmark.scala
> - AvroWriteBenchmark.scala (SPARK-24777)
> - ColumnarBatchBenchmark.scala
> - CompressionSchemeBenchmark.scala
> - DataSourceReadBenchmark.scala
> - DataSourceWriteBenchmark.scala (SPARK-24777)
> - DatasetBenchmark.scala
> - ExternalAppendOnlyUnsafeRowArrayBenchmark.scala
> - HashBenchmark.scala
> - HashByteArrayBenchmark.scala
> - JoinBenchmark.scala
> - KryoBenchmark.scala
> - MiscBenchmark.scala
> - ObjectHashAggregateExecBenchmark.scala
> - OrcReadBenchmark.scala
> - PrimitiveArrayBenchmark.scala
> - SortBenchmark.scala
> - SynthBenchmark.scala
> - TPCDSQueryBenchmark.scala
> - UDTSerializationBenchmark.scala
> - UnsafeArrayDataBenchmark.scala
> - UnsafeProjectionBenchmark.scala
> - WideSchemaBenchmark.scala
> Candidates will be reviewed and converted as a subtask of this JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29242) Check results of expression examples automatically

2019-09-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29242:
-

Assignee: Maxim Gekk

> Check results of expression examples automatically
> --
>
> Key: SPARK-29242
> URL: https://issues.apache.org/jira/browse/SPARK-29242
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Expression examples demonstrate how to use associated functions, and show 
> expected results. Need to write a test which executes the examples and 
> compare actual and expected results. For example: 
> https://github.com/apache/spark/blob/051e691029c456fc2db5f229485d3fb8f5d0e84c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2038-L2043



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29248) Pass in number of partitions to BuildWriter

2019-09-26 Thread Ximo Guanter (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938743#comment-16938743
 ] 

Ximo Guanter commented on SPARK-29248:
--

[~kabhwan], I have opened [https://github.com/apache/spark/pull/25945] with a 
proposal on how this could work. Hopefully the PR can help frame the technical 
discussion better, since I'm not sure I'm explaining myself fully.

> Pass in number of partitions to BuildWriter
> ---
>
> Key: SPARK-29248
> URL: https://issues.apache.org/jira/browse/SPARK-29248
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ximo Guanter
>Priority: Major
>
> When implementing a ScanBuilder, we require the implementor to provide the 
> schema of the data and the number of partitions.
> However, when someone is implementing WriteBuilder we only pass them the 
> schema, but not the number of partitions. This is an asymetrical developer 
> experience. Passing in the number of partitions on the WriteBuilder would 
> enable data sources to provision their write targets before starting to 
> write. For example, it could be used to provision a Kafka topic with a 
> specific number of partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29254) Failed to include jars passed in through --jars when isolatedLoader is enabled

2019-09-26 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938725#comment-16938725
 ] 

Lantao Jin commented on SPARK-29254:


Does it duplicate to https://issues.apache.org/jira/browse/SPARK-29022 ?

> Failed to include jars passed in through --jars when isolatedLoader is enabled
> --
>
> Key: SPARK-29254
> URL: https://issues.apache.org/jira/browse/SPARK-29254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Failed to include jars passed in through --jars when {{isolatedLoader}} is 
> enabled({{spark.sql.hive.metastore.jars != builtin}}). How to reproduce:
> {code:scala}
>   test("SPARK-29254: include jars passed in through --jars when 
> isolatedLoader is enabled") {
> val unusedJar = TestUtils.createJarWithClasses(Seq.empty)
> val jar1 = TestUtils.createJarWithClasses(Seq("SparkSubmitClassA"))
> val jar2 = TestUtils.createJarWithClasses(Seq("SparkSubmitClassB"))
> val jar3 = HiveTestJars.getHiveContribJar.getCanonicalPath
> val jar4 = HiveTestJars.getHiveHcatalogCoreJar.getCanonicalPath
> val jarsString = Seq(jar1, jar2, jar3, jar4).map(j => 
> j.toString).mkString(",")
> val args = Seq(
>   "--class", SparkSubmitClassLoaderTest.getClass.getName.stripSuffix("$"),
>   "--name", "SparkSubmitClassLoaderTest",
>   "--master", "local-cluster[2,1,1024]",
>   "--conf", "spark.ui.enabled=false",
>   "--conf", "spark.master.rest.enabled=false",
>   "--conf", "spark.sql.hive.metastore.version=3.1.2",
>   "--conf", "spark.sql.hive.metastore.jars=maven",
>   "--driver-java-options", "-Dderby.system.durability=test",
>   "--jars", jarsString,
>   unusedJar.toString, "SparkSubmitClassA", "SparkSubmitClassB")
> runSparkSubmit(args)
>   }
> {code}
> Logs:
> {noformat}
> 2019-09-25 22:11:42.854 - stderr> 19/09/25 22:11:42 ERROR log: error in 
> initSerDe: java.lang.ClassNotFoundException Class 
> org.apache.hive.hcatalog.data.JsonSerDe not found
> 2019-09-25 22:11:42.854 - stderr> java.lang.ClassNotFoundException: Class 
> org.apache.hive.hcatalog.data.JsonSerDe not found
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreUtils.getDeserializer(HiveMetaStoreUtils.java:84)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreUtils.getDeserializer(HiveMetaStoreUtils.java:77)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:289)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:271)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Table.getColsInternal(Table.java:663)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:646)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:898)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:937)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:539)
> 2019-09-25 22:11:42.854 - stderr> at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:311)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:245)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:244)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:294)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:537)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:284)
> 2019-09-25 22:11:42.854 - stderr> at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:242)
> 

[jira] [Created] (SPARK-29262) DataFrameWriter insertIntoPartition function

2019-09-26 Thread feiwang (Jira)
feiwang created SPARK-29262:
---

 Summary: DataFrameWriter insertIntoPartition function
 Key: SPARK-29262
 URL: https://issues.apache.org/jira/browse/SPARK-29262
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.4
Reporter: feiwang


Do we have plan to support insertIntoPartition function for dataFrameWriter?

[~cloud_fan]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29256) Fix typo in building document

2019-09-26 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29256.
--
Fix Version/s: 3.0.0
 Assignee: Tomoko Komiyama
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/25937

> Fix typo in building document 
> --
>
> Key: SPARK-29256
> URL: https://issues.apache.org/jira/browse/SPARK-29256
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Tomoko Komiyama
>Assignee: Tomoko Komiyama
>Priority: Trivial
> Fix For: 3.0.0
>
>
> Typo in 'Building With Hive and JDBC Support' section in Building document. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29256) Fix typo in building document

2019-09-26 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938594#comment-16938594
 ] 

Sean R. Owen commented on SPARK-29256:
--

There's no need to file a JIRA for this kind of thing.

> Fix typo in building document 
> --
>
> Key: SPARK-29256
> URL: https://issues.apache.org/jira/browse/SPARK-29256
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Tomoko Komiyama
>Priority: Trivial
>
> Typo in 'Building With Hive and JDBC Support' section in Building document. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29256) Fix typo in building document

2019-09-26 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-29256:
-
Issue Type: Improvement  (was: Bug)

> Fix typo in building document 
> --
>
> Key: SPARK-29256
> URL: https://issues.apache.org/jira/browse/SPARK-29256
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Tomoko Komiyama
>Priority: Trivial
>
> Typo in 'Building With Hive and JDBC Support' section in Building document. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29059) [SPIP] Support for Hive Materialized Views in Spark SQL.

2019-09-26 Thread Amogh Margoor (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938588#comment-16938588
 ] 

Amogh Margoor commented on SPARK-29059:
---

Pull Request here with all the optimizer changes: [GitHub Pull Request 
#25773|https://github.com/apache/spark/pull/25773]

> [SPIP] Support for Hive Materialized Views in Spark SQL.
> 
>
> Key: SPARK-29059
> URL: https://issues.apache.org/jira/browse/SPARK-29059
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Amogh Margoor
>Priority: Minor
>
> Materialized view was introduced in Apache Hive 3.0.0. Currently, Spark 
> Catalyst does not optimize queries against Hive tables using Materialized 
> View the way Apache Calcite does it for Hive. This Jira is to add support for 
> the same.
> We have developed it in our internal trunk and would like to open source it. 
> It would consist of 3 major parts:
>  # Reading MV related Hive Metadata
>  # Implication Engine which would figure out if an expression exp1 implies 
> another expression exp2 i.e., if exp1 => exp2 is a tautology. This is similar 
> to RexImplication checker in Apache Calcite.
>  # Catalyst rule to replace tables by it's Materialized view using 
> Implication Engine. For e.g., if MV 'mv' has been created in Hive using query 
> 'select * from foo where x > 10 && x <110'  then query 'select * from foo 
> where x > 70 and x < 100' will be transformed into 'select * from mv where x 
> >70 and x < 100'
> Note that Implication Engine and Catalyst Rule is generic can be used even 
> when Spark decides to have it's own Materialized View.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29059) [SPIP] Support for Hive Materialized Views in Spark SQL.

2019-09-26 Thread Amogh Margoor (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amogh Margoor updated SPARK-29059:
--
Issue Type: New Feature  (was: Task)

> [SPIP] Support for Hive Materialized Views in Spark SQL.
> 
>
> Key: SPARK-29059
> URL: https://issues.apache.org/jira/browse/SPARK-29059
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Amogh Margoor
>Priority: Minor
>
> Materialized view was introduced in Apache Hive 3.0.0. Currently, Spark 
> Catalyst does not optimize queries against Hive tables using Materialized 
> View the way Apache Calcite does it for Hive. This Jira is to add support for 
> the same.
> We have developed it in our internal trunk and would like to open source it. 
> It would consist of 3 major parts:
>  # Reading MV related Hive Metadata
>  # Implication Engine which would figure out if an expression exp1 implies 
> another expression exp2 i.e., if exp1 => exp2 is a tautology. This is similar 
> to RexImplication checker in Apache Calcite.
>  # Catalyst rule to replace tables by it's Materialized view using 
> Implication Engine. For e.g., if MV 'mv' has been created in Hive using query 
> 'select * from foo where x > 10 && x <110'  then query 'select * from foo 
> where x > 70 and x < 100' will be transformed into 'select * from mv where x 
> >70 and x < 100'
> Note that Implication Engine and Catalyst Rule is generic can be used even 
> when Spark decides to have it's own Materialized View.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29059) [SPIP] Support for Hive Materialized Views in Spark SQL.

2019-09-26 Thread Amogh Margoor (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amogh Margoor updated SPARK-29059:
--
Summary: [SPIP] Support for Hive Materialized Views in Spark SQL.  (was: 
Support for Hive Materialized Views in Spark SQL.)

> [SPIP] Support for Hive Materialized Views in Spark SQL.
> 
>
> Key: SPARK-29059
> URL: https://issues.apache.org/jira/browse/SPARK-29059
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Amogh Margoor
>Priority: Minor
>
> Materialized view was introduced in Apache Hive 3.0.0. Currently, Spark 
> Catalyst does not optimize queries against Hive tables using Materialized 
> View the way Apache Calcite does it for Hive. This Jira is to add support for 
> the same.
> We have developed it in our internal trunk and would like to open source it. 
> It would consist of 3 major parts:
>  # Reading MV related Hive Metadata
>  # Implication Engine which would figure out if an expression exp1 implies 
> another expression exp2 i.e., if exp1 => exp2 is a tautology. This is similar 
> to RexImplication checker in Apache Calcite.
>  # Catalyst rule to replace tables by it's Materialized view using 
> Implication Engine. For e.g., if MV 'mv' has been created in Hive using query 
> 'select * from foo where x > 10 && x <110'  then query 'select * from foo 
> where x > 70 and x < 100' will be transformed into 'select * from mv where x 
> >70 and x < 100'
> Note that Implication Engine and Catalyst Rule is generic can be used even 
> when Spark decides to have it's own Materialized View.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2019-09-26 Thread Gary D. Gregory (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938581#comment-16938581
 ] 

Gary D. Gregory commented on SPARK-6305:


Log4j 2 provides a clean separation between API and implementation. You can use 
the Log4j 2 API and plugin other logging implementations. You can use Log4j 2 
as the implementation of other logging APIs. Please see 
https://logging.apache.org/log4j/2.x/manual/index.html. Log4j 2 also provides a 
Log4j 1 API bridge to Log4j 2.

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28997) Add `spark.sql.dialect`

2019-09-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28997.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25697
[https://github.com/apache/spark/pull/25697]

> Add `spark.sql.dialect`
> ---
>
> Key: SPARK-28997
> URL: https://issues.apache.org/jira/browse/SPARK-28997
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> After https://github.com/apache/spark/pull/25158 and 
> https://github.com/apache/spark/pull/25458, SQL features of PostgreSQL are 
> introduced into Spark. AFAIK, both features are implementation-defined 
> behaviors, which are not specified in ANSI SQL.
> In such a case, this proposal is to add a configuration `spark.sql.dialect` 
> for choosing a database dialect.
> After this PR, 
> Spark supports two database dialects, `Spark` and `PostgreSQL`. With 
> `PostgreSQL` dialect, Spark will: 
> 1. perform integral division with the / operator if both sides are integral 
> types; 
> 2. accept "true", "yes", "1", "false", "no", "0", and unique prefixes as 
> input and trim input for the boolean data type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29255) Rename package pgSQL to postgreSQL

2019-09-26 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-29255.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25936
[https://github.com/apache/spark/pull/25936]

> Rename package pgSQL to postgreSQL
> --
>
> Key: SPARK-29255
> URL: https://issues.apache.org/jira/browse/SPARK-29255
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.0.0
>
>
> To address the comment in 
> https://github.com/apache/spark/pull/25697#discussion_r328431070, let's 
> rename the package pgSQL to postgreSQL



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29261) Support recover live entities from KVStore for (SQL)AppStatusListener

2019-09-26 Thread wuyi (Jira)
wuyi created SPARK-29261:


 Summary: Support recover live entities from KVStore for 
(SQL)AppStatusListener
 Key: SPARK-29261
 URL: https://issues.apache.org/jira/browse/SPARK-29261
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Affects Versions: 3.0.0
Reporter: wuyi


To achieve incremental reply goal in SHS, we need to support recover live 
entities from KVStore for both SQLAppStatusListener and AppStatusListener.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29203) Reduce shuffle partitions in SQLQueryTestSuite

2019-09-26 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29203:

Fix Version/s: 2.4.5

> Reduce shuffle partitions in SQLQueryTestSuite
> --
>
> Key: SPARK-29203
> URL: https://issues.apache.org/jira/browse/SPARK-29203
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> spark.sql.shuffle.partitions=200(default):
> {noformat}
> [info] - subquery/in-subquery/in-joins.sql (6 minutes, 19 seconds)
> [info] - subquery/in-subquery/not-in-joins.sql (2 minutes, 17 seconds)
> [info] - subquery/scalar-subquery/scalar-subquery-predicate.sql (45 seconds, 
> 763 milliseconds)
> {noformat}
> spark.sql.shuffle.partitions=5:
> {noformat}
> [info] - subquery/in-subquery/in-joins.sql (1 minute, 12 seconds)
> [info] - subquery/in-subquery/not-in-joins.sql (27 seconds, 541 milliseconds)
> [info] - subquery/scalar-subquery/scalar-subquery-predicate.sql (17 seconds, 
> 360 milliseconds)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29203) Reduce shuffle partitions in SQLQueryTestSuite

2019-09-26 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29203:

Affects Version/s: 2.4.4

> Reduce shuffle partitions in SQLQueryTestSuite
> --
>
> Key: SPARK-29203
> URL: https://issues.apache.org/jira/browse/SPARK-29203
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> spark.sql.shuffle.partitions=200(default):
> {noformat}
> [info] - subquery/in-subquery/in-joins.sql (6 minutes, 19 seconds)
> [info] - subquery/in-subquery/not-in-joins.sql (2 minutes, 17 seconds)
> [info] - subquery/scalar-subquery/scalar-subquery-predicate.sql (45 seconds, 
> 763 milliseconds)
> {noformat}
> spark.sql.shuffle.partitions=5:
> {noformat}
> [info] - subquery/in-subquery/in-joins.sql (1 minute, 12 seconds)
> [info] - subquery/in-subquery/not-in-joins.sql (27 seconds, 541 milliseconds)
> [info] - subquery/scalar-subquery/scalar-subquery-predicate.sql (17 seconds, 
> 360 milliseconds)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2019-09-26 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938506#comment-16938506
 ] 

Steve Loughran commented on SPARK-6305:
---

bq. Steve, apart from the security issue concern, we also have the logs not 
rolling properly.
 
Are you confident that this is actually due to log4j 1.x bugs -and that an 
upgrade Will make it go away? As if not, unless you can find evidence that 
other people been finding a similar problem, I'd worry about your deployment 
and configuration before that of the trauma of a move to 2.x

bq. In any case, my question is "Is Spark going to decouple itself from log4j 
1.x and provide a way to configure logging with an alternative implementation?".

bq. If yes, what would be the tentative timeline?

No idea.

I actually seem to recall some discussion about using logback behind the 
scenes, though you'll have to investigate that yourself

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29260) Enable supported Hive metastore versions once it support altering database location

2019-09-26 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29260:

Description: Hive 3.x is supported currently. Hive 2.2.1 and Hive 2.4.0 
have not released.  (was: Hive 3.0.x and Hive 3.1.x is supported. Hive 2.2.1, 
2.4.0 haven't released.)

> Enable supported Hive metastore versions once it support altering database 
> location
> ---
>
> Key: SPARK-29260
> URL: https://issues.apache.org/jira/browse/SPARK-29260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Hive 3.x is supported currently. Hive 2.2.1 and Hive 2.4.0 have not released.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29260) Enable supported Hive metastore versions once it support altering database location

2019-09-26 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29260:

Description: Hive 3.0.x and Hive 3.1.x is supported. Hive 2.2.1, 2.4.0 
haven't released.

> Enable supported Hive metastore versions once it support altering database 
> location
> ---
>
> Key: SPARK-29260
> URL: https://issues.apache.org/jira/browse/SPARK-29260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Hive 3.0.x and Hive 3.1.x is supported. Hive 2.2.1, 2.4.0 haven't released.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29260) Enable supported Hive metastore versions once it support altering database location

2019-09-26 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-29260:
---

 Summary: Enable supported Hive metastore versions once it support 
altering database location
 Key: SPARK-29260
 URL: https://issues.apache.org/jira/browse/SPARK-29260
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6305) Add support for log4j 2.x to Spark

2019-09-26 Thread Rajiv Bandi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938467#comment-16938467
 ] 

Rajiv Bandi edited comment on SPARK-6305 at 9/26/19 10:20 AM:
--

Steve, apart from the security issue concern, we also have the logs not rolling 
properly.

In any case, my question is "Is Spark going to decouple itself from log4j 1.x 
and provide a way to configure logging with an alternative implementation?".

If yes, what would be the tentative timeline?


was (Author: rajivbandi):
Steve, apart from the security issue concern, we also have the logs not rolling 
properly.

In any case, my question is "Is Spark going to decouple itself from log4j 1.x 
and provided a way to configure logging with an alternative implementation?".

If yes, what would be the tentative timeline?

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2019-09-26 Thread Rajiv Bandi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938467#comment-16938467
 ] 

Rajiv Bandi commented on SPARK-6305:


Steve, apart from the security issue concern, we also have the logs not rolling 
properly.

In any case, my question is "Is Spark going to decouple itself from log4j 1.x 
and provided a way to configure logging with an alternative implementation?".

If yes, what would be the tentative timeline?

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29212) Add common classes without using JVM backend

2019-09-26 Thread Maciej Szymkiewicz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938465#comment-16938465
 ] 

Maciej Szymkiewicz commented on SPARK-29212:


[~podongfeng] First of all thank your for creating this ticket.

 

??Would you like to help work on this???

If we can reach some consensus here, sure. That's only a dozen of LOCs anyway.

However personally I am more interested in general directions. As I tried to 
point out (not sure if successfully), the current resolution of SPARK-28985, 
despite its negligible impact as such, sets dangerous precedent and breaks up 
with existing ML API conventions. At this point I still cannot tell if that 
happened by accident, or if it is intentional (not being actively involved in 
the development process I get an impression that the latter might be true).
  

> Add common classes without using JVM backend
> 
>
> Key: SPARK-29212
> URL: https://issues.apache.org/jira/browse/SPARK-29212
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>
> copyed from [https://github.com/apache/spark/pull/25776.]
>  
> Maciej's *Concern*:
> *Use cases for public ML type hierarchy*
>  * Add Python-only Transformer implementations:
>  ** I am Python user and want to implement pure Python ML classifier without 
> providing JVM backend.
>  ** I want this classifier to be meaningfully positioned in the existing type 
> hierarchy.
>  ** However I have access only to high level classes ({{Estimator}}, 
> {{Model}}, {{MLReader}} / {{MLReadable}}).
>  * Run time parameter validation for both user defined (see above) and 
> existing class hierarchy,
>  ** I am a library developer who provides functions that are meaningful only 
> for specific categories of {{Estimators}} - here classifiers.
>  ** I want to validate that user passed argument is indeed a classifier:
>  *** For built-in objects using "private" type hierarchy is not really 
> satisfying (actually, what is the rationale behind making it "private"? If 
> the goal is Scala API parity, and Scala counterparts are public, shouldn't 
> these be too?).
>  ** For user defined objects I can:
>  *** Use duck typing (on {{setRawPredictionCol}} for classifier, on 
> {{numClasses}} for classification model) but it hardly satisfying.
>  *** Provide parallel non-abstract type hierarchy ({{Classifier}} or 
> {{PythonClassifier}} and so on) and require users to implement such 
> interfaces. That however would require separate logic for checking for 
> built-in and and user-provided classes.
>  *** Provide parallel abstract type hierarchy, register all existing built-in 
> classes and require users to do the same.
> Clearly these are not satisfying solutions as they require either defensive 
> programming or reinventing the same functionality for different 3rd party 
> APIs.
>  * Static type checking
>  ** I am either end user or library developer and want to use PEP-484 
> annotations to indicate components that require classifier or classification 
> model.
>  ** Currently I can provide only imprecise annotations, [such 
> as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241]
> def setClassifier(self, value: Estimator[M]) -> OneVsRest: ...
> or try to narrow things down using structural subtyping:
> class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, 
> value: str) -> Classifier: ... class Classifier(Protocol, Model): def 
> setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> 
> int: ...
>  
> Maciej's *Proposal*:
> {code:java}
> Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e.
> class ClassifierParams: ...
> class Predictor(Estimator,PredictorParams):
> def setLabelCol(self, value): ...
> def setFeaturesCol(self, value): ...
> def setPredictionCol(self, value): ...
> class Classifier(Predictor, ClassifierParams):
> def setRawPredictionCol(self, value): ...
> class PredictionModel(Model,PredictorParams):
> def setFeaturesCol(self, value): ...
> def setPredictionCol(self, value): ...
> def numFeatures(self): ...
> def predict(self, value): ...
> and JVM interop should extend from this hierarchy, i.e.
> class JavaPredictionModel(PredictionModel): ...
> In other words it should be consistent with existing approach, where we have 
> ABCs reflecting Scala API (Transformer, Estimator, Model) and so on, and 
> Java* variants are their subclasses.
>  {code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: 

[jira] [Resolved] (SPARK-28845) Enable spark.sql.execution.sortBeforeRepartition only for retried stages

2019-09-26 Thread Yuanjian Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li resolved SPARK-28845.
-
Resolution: Won't Do

After further investigation, I found the objective of performing sort only for 
the retried indeterminate stage is unable to achieve. That will break our 
assumption for the `outputDeterministicLevel` for each RDD, which should be 
defined when the job submitted. While here we expected the output deterministic 
level depends on the stage attempt number.
As the SPARK-25341 completed, the current behavior depends on the config 
`spark.sql.execution.sortBeforeRepartition`. Spark will sort before 
repartition, and rerun the failed tasks when setting the config to true. On the 
contrary, the whole stage will rerun.

> Enable spark.sql.execution.sortBeforeRepartition only for retried stages
> 
>
> Key: SPARK-28845
> URL: https://issues.apache.org/jira/browse/SPARK-28845
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> For fixing the correctness bug of SPARK-28699, we disable radix sort for the 
> scenario of repartition in Spark SQL. This will cause a performance 
> regression.
> So for limiting the performance overhead, we'll do the optimizing work by 
> only enable sort for the repartition operation while stage retries happening. 
> This work depends on SPARK-25341.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29245) CCE during creating HiveMetaStoreClient

2019-09-26 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938439#comment-16938439
 ] 

Yuming Wang commented on SPARK-29245:
-

How to reproduce:
Hive side:
{code:sh}
export JAVA_HOME=/usr/lib/jdk1.8.0_221
export PATH=$JAVA_HOME/bin:$PATH
cd /usr/lib/hive-2.3.5
bin/schematool -dbType derby -initSchema --verbose
bin/hive --service metastore
{code}

Spark side:
{code:sh}
export JAVA_HOME=/usr/lib/jdk-11.0.3
export PATH=$JAVA_HOME/bin:$PATH
build/sbt clean package -Phive -Phadoop-3.2 -Phive-thriftserver
export SPARK_PREPEND_CLASSES=true
bin/spark-shell --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083
{code}


> CCE during creating HiveMetaStoreClient 
> 
>
> Key: SPARK-29245
> URL: https://issues.apache.org/jira/browse/SPARK-29245
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: CDH 6.3
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> From `master` branch build, when I try to connect to an external HMS, I hit 
> the following.
> {code}
> 19/09/25 10:58:46 ERROR hive.log: Got exception: java.lang.ClassCastException 
> class [Ljava.lang.Object; cannot be cast to class [Ljava.net.URI; 
> ([Ljava.lang.Object; and [Ljava.net.URI; are in module java.base of loader 
> 'bootstrap')
> java.lang.ClassCastException: class [Ljava.lang.Object; cannot be cast to 
> class [Ljava.net.URI; ([Ljava.lang.Object; and [Ljava.net.URI; are in module 
> java.base of loader 'bootstrap')
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:200)
>   at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:70)
> {code}
> With HIVE-21508, I can get the following.
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.4)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sql("show databases").show
> ++
> |databaseName|
> ++
> |  .  |
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-29254) Failed to include jars passed in through --jars when isolatedLoader is enabled

2019-09-26 Thread Sandeep Katta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Katta updated SPARK-29254:
--
Comment: was deleted

(was: [~yumwang] I would like to work on it  if you have not started )

> Failed to include jars passed in through --jars when isolatedLoader is enabled
> --
>
> Key: SPARK-29254
> URL: https://issues.apache.org/jira/browse/SPARK-29254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Failed to include jars passed in through --jars when {{isolatedLoader}} is 
> enabled({{spark.sql.hive.metastore.jars != builtin}}). How to reproduce:
> {code:scala}
>   test("SPARK-29254: include jars passed in through --jars when 
> isolatedLoader is enabled") {
> val unusedJar = TestUtils.createJarWithClasses(Seq.empty)
> val jar1 = TestUtils.createJarWithClasses(Seq("SparkSubmitClassA"))
> val jar2 = TestUtils.createJarWithClasses(Seq("SparkSubmitClassB"))
> val jar3 = HiveTestJars.getHiveContribJar.getCanonicalPath
> val jar4 = HiveTestJars.getHiveHcatalogCoreJar.getCanonicalPath
> val jarsString = Seq(jar1, jar2, jar3, jar4).map(j => 
> j.toString).mkString(",")
> val args = Seq(
>   "--class", SparkSubmitClassLoaderTest.getClass.getName.stripSuffix("$"),
>   "--name", "SparkSubmitClassLoaderTest",
>   "--master", "local-cluster[2,1,1024]",
>   "--conf", "spark.ui.enabled=false",
>   "--conf", "spark.master.rest.enabled=false",
>   "--conf", "spark.sql.hive.metastore.version=3.1.2",
>   "--conf", "spark.sql.hive.metastore.jars=maven",
>   "--driver-java-options", "-Dderby.system.durability=test",
>   "--jars", jarsString,
>   unusedJar.toString, "SparkSubmitClassA", "SparkSubmitClassB")
> runSparkSubmit(args)
>   }
> {code}
> Logs:
> {noformat}
> 2019-09-25 22:11:42.854 - stderr> 19/09/25 22:11:42 ERROR log: error in 
> initSerDe: java.lang.ClassNotFoundException Class 
> org.apache.hive.hcatalog.data.JsonSerDe not found
> 2019-09-25 22:11:42.854 - stderr> java.lang.ClassNotFoundException: Class 
> org.apache.hive.hcatalog.data.JsonSerDe not found
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreUtils.getDeserializer(HiveMetaStoreUtils.java:84)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreUtils.getDeserializer(HiveMetaStoreUtils.java:77)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:289)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:271)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Table.getColsInternal(Table.java:663)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:646)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:898)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:937)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:539)
> 2019-09-25 22:11:42.854 - stderr> at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:311)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:245)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:244)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:294)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:537)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:284)
> 2019-09-25 22:11:42.854 - stderr> at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:242)
> 2019-09-25 

[jira] [Commented] (SPARK-29254) Failed to include jars passed in through --jars when isolatedLoader is enabled

2019-09-26 Thread Sandeep Katta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938434#comment-16938434
 ] 

Sandeep Katta commented on SPARK-29254:
---

[~yumwang] I would like to work on it  if you have not started 

> Failed to include jars passed in through --jars when isolatedLoader is enabled
> --
>
> Key: SPARK-29254
> URL: https://issues.apache.org/jira/browse/SPARK-29254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Failed to include jars passed in through --jars when {{isolatedLoader}} is 
> enabled({{spark.sql.hive.metastore.jars != builtin}}). How to reproduce:
> {code:scala}
>   test("SPARK-29254: include jars passed in through --jars when 
> isolatedLoader is enabled") {
> val unusedJar = TestUtils.createJarWithClasses(Seq.empty)
> val jar1 = TestUtils.createJarWithClasses(Seq("SparkSubmitClassA"))
> val jar2 = TestUtils.createJarWithClasses(Seq("SparkSubmitClassB"))
> val jar3 = HiveTestJars.getHiveContribJar.getCanonicalPath
> val jar4 = HiveTestJars.getHiveHcatalogCoreJar.getCanonicalPath
> val jarsString = Seq(jar1, jar2, jar3, jar4).map(j => 
> j.toString).mkString(",")
> val args = Seq(
>   "--class", SparkSubmitClassLoaderTest.getClass.getName.stripSuffix("$"),
>   "--name", "SparkSubmitClassLoaderTest",
>   "--master", "local-cluster[2,1,1024]",
>   "--conf", "spark.ui.enabled=false",
>   "--conf", "spark.master.rest.enabled=false",
>   "--conf", "spark.sql.hive.metastore.version=3.1.2",
>   "--conf", "spark.sql.hive.metastore.jars=maven",
>   "--driver-java-options", "-Dderby.system.durability=test",
>   "--jars", jarsString,
>   unusedJar.toString, "SparkSubmitClassA", "SparkSubmitClassB")
> runSparkSubmit(args)
>   }
> {code}
> Logs:
> {noformat}
> 2019-09-25 22:11:42.854 - stderr> 19/09/25 22:11:42 ERROR log: error in 
> initSerDe: java.lang.ClassNotFoundException Class 
> org.apache.hive.hcatalog.data.JsonSerDe not found
> 2019-09-25 22:11:42.854 - stderr> java.lang.ClassNotFoundException: Class 
> org.apache.hive.hcatalog.data.JsonSerDe not found
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreUtils.getDeserializer(HiveMetaStoreUtils.java:84)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreUtils.getDeserializer(HiveMetaStoreUtils.java:77)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:289)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:271)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Table.getColsInternal(Table.java:663)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:646)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:898)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:937)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:539)
> 2019-09-25 22:11:42.854 - stderr> at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:311)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:245)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:244)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:294)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:537)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:284)
> 2019-09-25 22:11:42.854 - stderr> at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:242)
> 

[jira] [Commented] (SPARK-29254) Failed to include jars passed in through --jars when isolatedLoader is enabled

2019-09-26 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938433#comment-16938433
 ] 

Lantao Jin commented on SPARK-29254:


I am looking into this.

> Failed to include jars passed in through --jars when isolatedLoader is enabled
> --
>
> Key: SPARK-29254
> URL: https://issues.apache.org/jira/browse/SPARK-29254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Failed to include jars passed in through --jars when {{isolatedLoader}} is 
> enabled({{spark.sql.hive.metastore.jars != builtin}}). How to reproduce:
> {code:scala}
>   test("SPARK-29254: include jars passed in through --jars when 
> isolatedLoader is enabled") {
> val unusedJar = TestUtils.createJarWithClasses(Seq.empty)
> val jar1 = TestUtils.createJarWithClasses(Seq("SparkSubmitClassA"))
> val jar2 = TestUtils.createJarWithClasses(Seq("SparkSubmitClassB"))
> val jar3 = HiveTestJars.getHiveContribJar.getCanonicalPath
> val jar4 = HiveTestJars.getHiveHcatalogCoreJar.getCanonicalPath
> val jarsString = Seq(jar1, jar2, jar3, jar4).map(j => 
> j.toString).mkString(",")
> val args = Seq(
>   "--class", SparkSubmitClassLoaderTest.getClass.getName.stripSuffix("$"),
>   "--name", "SparkSubmitClassLoaderTest",
>   "--master", "local-cluster[2,1,1024]",
>   "--conf", "spark.ui.enabled=false",
>   "--conf", "spark.master.rest.enabled=false",
>   "--conf", "spark.sql.hive.metastore.version=3.1.2",
>   "--conf", "spark.sql.hive.metastore.jars=maven",
>   "--driver-java-options", "-Dderby.system.durability=test",
>   "--jars", jarsString,
>   unusedJar.toString, "SparkSubmitClassA", "SparkSubmitClassB")
> runSparkSubmit(args)
>   }
> {code}
> Logs:
> {noformat}
> 2019-09-25 22:11:42.854 - stderr> 19/09/25 22:11:42 ERROR log: error in 
> initSerDe: java.lang.ClassNotFoundException Class 
> org.apache.hive.hcatalog.data.JsonSerDe not found
> 2019-09-25 22:11:42.854 - stderr> java.lang.ClassNotFoundException: Class 
> org.apache.hive.hcatalog.data.JsonSerDe not found
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreUtils.getDeserializer(HiveMetaStoreUtils.java:84)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreUtils.getDeserializer(HiveMetaStoreUtils.java:77)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:289)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:271)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Table.getColsInternal(Table.java:663)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:646)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:898)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:937)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:539)
> 2019-09-25 22:11:42.854 - stderr> at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:311)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:245)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:244)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:294)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:537)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:284)
> 2019-09-25 22:11:42.854 - stderr> at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
> 2019-09-25 22:11:42.854 - stderr> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:242)
> 2019-09-25 22:11:42.854 - stderr> at 
> 

[jira] [Created] (SPARK-29259) Filesystem.exists is called even when not necessary for append save mode

2019-09-26 Thread Rahij Ramsharan (Jira)
Rahij Ramsharan created SPARK-29259:
---

 Summary: Filesystem.exists is called even when not necessary for 
append save mode
 Key: SPARK-29259
 URL: https://issues.apache.org/jira/browse/SPARK-29259
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.4
Reporter: Rahij Ramsharan


When saving a dataframe into Hadoop 
([https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L93]),
 spark first checks if the file exists before inspecting the SaveMode to 
determine if it should actually insert data. However, the pathExists variable 
is actually not used in the case of SaveMode.Append. In some file systems, the 
exists call can be expensive and hence this PR makes that call only when 
necessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29257) All Task attempts scheduled to the same executor inevitably access the same bad disk

2019-09-26 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-29257:
-
Attachment: image-2019-09-26-16-44-48-554.png

> All Task attempts scheduled to the same executor inevitably access the same 
> bad disk
> 
>
> Key: SPARK-29257
> URL: https://issues.apache.org/jira/browse/SPARK-29257
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.4, 2.4.4
>Reporter: Kent Yao
>Priority: Major
> Attachments: image-2019-09-26-16-44-48-554.png
>
>
> We have an HDFS/YARN cluster with about 2k~3k nodes, each node has 8T * 12 
> local disks for storage and shuffle. Sometimes, one or more disks get into 
> bad status during computations. Sometimes it does cause job level failure, 
> sometimes does.
> The following picture shows one failure job caused by 4 task attempts were 
> all delivered to the same node and failed with almost the same exception for 
> writing the index temporary file to the same bad disk.
>  
> This is caused by two reasons:
>  # As we can see in the figure the data and the node have the best data 
> locality for reading from HDFS. As the default spark.locality.wait(3s) taking 
> effect, there is a high probability that those attempts will be scheduled to 
> this node.
>  # The index file or data file name for a particular shuffle map task is 
> fixed. It is formed by the shuffle id, the map id and the noop reduce id 
> which is always 0. The root local dir is picked by the fixed file name's 
> non-negative hash code % the disk number. Thus, this value is also fixed.  
> Even when we have 12 disks in total and only one of them is broken, if the 
> broken one is once picked, all the following attempts of this task will 
> inevitably pick the broken one.
>  
>  
> !image-2019-09-26-16-44-48-554.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29257) All Task attempts scheduled to the same executor inevitably access the same bad disk

2019-09-26 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-29257:
-
Description: 
We have an HDFS/YARN cluster with about 2k~3k nodes, each node has 8T * 12 
local disks for storage and shuffle. Sometimes, one or more disks get into bad 
status during computations. Sometimes it does cause job level failure, 
sometimes does.

The following picture shows one failure job caused by 4 task attempts were all 
delivered to the same node and failed with almost the same exception for 
writing the index temporary file to the same bad disk.

 

This is caused by two reasons:
 # As we can see in the figure the data and the node have the best data 
locality for reading from HDFS. As the default spark.locality.wait(3s) taking 
effect, there is a high probability that those attempts will be scheduled to 
this node.
 # The index file or data file name for a particular shuffle map task is fixed. 
It is formed by the shuffle id, the map id and the noop reduce id which is 
always 0. The root local dir is picked by the fixed file name's non-negative 
hash code % the disk number. Thus, this value is also fixed.  Even when we have 
12 disks in total and only one of them is broken, if the broken one is once 
picked, all the following attempts of this task will inevitably pick the broken 
one.

 

 

!image-2019-09-26-16-44-48-554.png!

  was:
We have an HDFS/YARN cluster with about 2k~3k nodes, each node has 8T * 12 
local disks for storage and shuffle. Sometimes, one or more disks get into bad 
status during computations. Sometimes it does cause job level failure, 
sometimes does.

The following picture shows one failure job caused by 4 task attempts were all 
delivered to the same node and failed with almost the same exception for 
writing the index temporary file to the same bad disk.

 

This is caused by two reasons:
 # As we can see in the figure the data and the node have the best data 
locality for reading from HDFS. As the default spark.locality.wait(3s) taking 
effect, there is a high probability that those attempts will be scheduled to 
this node.
 # The index file or data file name for a particular shuffle map task is fixed. 
It is formed by the shuffle id, the map id and the noop reduce id which is 
always 0. The root local dir is picked by the fixed file name's non-negative 
hash code % the disk number. Thus, this value is also fixed.  Even when we have 
12 disks in total and only one of them is broken, if the broken one is once 
picked, all the following attempts of this task will inevitably pick the broken 
one.

 

 

!image-2019-09-26-15-35-29-342.png!


> All Task attempts scheduled to the same executor inevitably access the same 
> bad disk
> 
>
> Key: SPARK-29257
> URL: https://issues.apache.org/jira/browse/SPARK-29257
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.4, 2.4.4
>Reporter: Kent Yao
>Priority: Major
> Attachments: image-2019-09-26-16-44-48-554.png
>
>
> We have an HDFS/YARN cluster with about 2k~3k nodes, each node has 8T * 12 
> local disks for storage and shuffle. Sometimes, one or more disks get into 
> bad status during computations. Sometimes it does cause job level failure, 
> sometimes does.
> The following picture shows one failure job caused by 4 task attempts were 
> all delivered to the same node and failed with almost the same exception for 
> writing the index temporary file to the same bad disk.
>  
> This is caused by two reasons:
>  # As we can see in the figure the data and the node have the best data 
> locality for reading from HDFS. As the default spark.locality.wait(3s) taking 
> effect, there is a high probability that those attempts will be scheduled to 
> this node.
>  # The index file or data file name for a particular shuffle map task is 
> fixed. It is formed by the shuffle id, the map id and the noop reduce id 
> which is always 0. The root local dir is picked by the fixed file name's 
> non-negative hash code % the disk number. Thus, this value is also fixed.  
> Even when we have 12 disks in total and only one of them is broken, if the 
> broken one is once picked, all the following attempts of this task will 
> inevitably pick the broken one.
>  
>  
> !image-2019-09-26-16-44-48-554.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29258) parity between ml.evaluator and mllib.metrics

2019-09-26 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-29258:


 Summary: parity between ml.evaluator and mllib.metrics
 Key: SPARK-29258
 URL: https://issues.apache.org/jira/browse/SPARK-29258
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


1, expose {{BinaryClassificationMetrics.numBins}} in 
{{BinaryClassificationEvaluator}}

2, expose {{RegressionMetrics.throughOrigin}} in {{RegressionEvaluator}}

3, add metric {{explainedVariance}} in {{RegressionEvaluator}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29257) All Task attempts scheduled to the same executor inevitably access the same bad disk

2019-09-26 Thread Kent Yao (Jira)
Kent Yao created SPARK-29257:


 Summary: All Task attempts scheduled to the same executor 
inevitably access the same bad disk
 Key: SPARK-29257
 URL: https://issues.apache.org/jira/browse/SPARK-29257
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 2.4.4, 2.3.4
Reporter: Kent Yao


We have an HDFS/YARN cluster with about 2k~3k nodes, each node has 8T * 12 
local disks for storage and shuffle. Sometimes, one or more disks get into bad 
status during computations. Sometimes it does cause job level failure, 
sometimes does.

The following picture shows one failure job caused by 4 task attempts were all 
delivered to the same node and failed with almost the same exception for 
writing the index temporary file to the same bad disk.

 

This is caused by two reasons:
 # As we can see in the figure the data and the node have the best data 
locality for reading from HDFS. As the default spark.locality.wait(3s) taking 
effect, there is a high probability that those attempts will be scheduled to 
this node.
 # The index file or data file name for a particular shuffle map task is fixed. 
It is formed by the shuffle id, the map id and the noop reduce id which is 
always 0. The root local dir is picked by the fixed file name's non-negative 
hash code % the disk number. Thus, this value is also fixed.  Even when we have 
12 disks in total and only one of them is broken, if the broken one is once 
picked, all the following attempts of this task will inevitably pick the broken 
one.

 

 

!image-2019-09-26-15-35-29-342.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >