[jira] [Commented] (SPARK-25452) Query with where clause is giving unexpected result in case of float column
[ https://issues.apache.org/jira/browse/SPARK-25452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628525#comment-16628525 ] Ayush Anubhava commented on SPARK-25452: Hi HyukjiKwon This issue does not seems to be duplicate. I saw the changes , I am able to get the same in spark- sql. Seems like fix is for beeline. > Query with where clause is giving unexpected result in case of float column > --- > > Key: SPARK-25452 > URL: https://issues.apache.org/jira/browse/SPARK-25452 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 > Environment: *Spark 2.3.1* > *Hadoop 2.7.2* >Reporter: Ayush Anubhava >Priority: Major > Attachments: image-2018-09-26-14-14-47-504.png > > > *Description* : Query with clause is giving unexpected result in case of > float column > > {color:#d04437}*Query with filter less than equal to is giving inappropriate > result{code}*{color} > {code} > 0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float); > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values > (0,0.0); > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values > (1,1.1); > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0; > +++--+ > | a | b | > +++--+ > | 0 | 0.0 | > | 1 | 1.10023841858 | > +++--+ > Query with filter less than equal to is giving in appropriate result > 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1; > ++--+--+ > | a | b | > ++--+--+ > | 0 | 0.0 | > ++--+--+ > 1 row selected (0.299 seconds) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling
Andrew Crosby created SPARK-25544: - Summary: Slow/failed convergence in Spark ML models due to internal predictor scaling Key: SPARK-25544 URL: https://issues.apache.org/jira/browse/SPARK-25544 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.3.2 Environment: Databricks runtime 4.2: Spark 2.3.1, Scala 2.11 Reporter: Andrew Crosby The LinearRegression and LogisticRegression estimators in Spark ML can take a large number of iterations to converge, or fail to converge altogether, when trained using the l-bfgs method with standardization turned off. *Details:* LinearRegression and LogisticRegression standardize their input features by default. In SPARK-8522 the option to disable standardization was added. This is implemented internally by changing the effective strength of regularization rather than disabling the feature scaling. Mathematically, both changing the effective regularizaiton strength, and disabling feature scaling should give the same solution, but they can have very different convergence properties. The normal justication given for scaling features is that it ensures that all covariances are O(1) and should improve numerical convergence, but this argument does not account for the regularization term. This doesn't cause any issues if standardization is set to true, since all features will have an O(1) regularization strength. But it does cause issues when standardization is set to false, since the effecive regularization strength of feature i is now O(1/ sigma_i^2) where sigma_i is the standard deviation of the feature. This means that predictors with small standard deviations will have very large effective regularization strengths and consequently lead to very large gradients and thus poor convergence in the solver. *Example code to recreate:* To demonstrate just how bad these convergence issues can be, here is a very simple test case which builds a linear regression model with a categorical feature, a numerical feature and their interaction. When fed the specified training data, this model will fail to converge before it hits the maximum iteration limit. Training data: ||category||numericFeature||label|| |1|1.0|0.5| |1|0.5|1.0| |2|0.01|2.0| {code:java} val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 2.0)).toDF("category", "numericFeature", "label") val indexer = new StringIndexer().setInputCol("category") .setOutputCol("categoryIndex") val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false) val interaction = new Interaction().setInputCols(Array("categoryEncoded", "numericFeature")).setOutputCol("interaction") val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", "interaction")).setOutputCol("features") val model = new LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100) val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, assembler, model)) val pipelineModel = pipeline.fit(df) val numIterations = pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code} *Possible fix:* These convergence issues can be fixed by turning of feature scaling when standardization is set to false rather than using an effective regularization strength. This can be hacked into LinearRegression.scala by simply replacing line 423 {code:java} val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt) {code} with {code:java} val featuresStd = if ($(standardization)) featuresSummarizer.variance.toArray.map(math.sqrt) else featuresSummarizer.variance.toArray.map(x => 1.0) {code} Rerunning the above test code with that hack in place, will lead to convergence after just 4 iterations instead of hitting the max iterations limit! *Impact:* I can't speak for other people, but I've personally encountered these convergence issues several times when building production scale Spark ML models, and have resorted to writing my only implementation of LinearRegression with the above hack in place. The issue is made worse by the fact that Spark does not raise an error when the maximum number of iterations is hit, so the first time you encounter the issue it can take a while to figure out what is going on. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25392) [Spark Job History]Inconsistent behaviour for pool details in spark web UI and history server page
[ https://issues.apache.org/jira/browse/SPARK-25392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ABHISHEK KUMAR GUPTA updated SPARK-25392: - Description: Steps: 1.Enable spark.scheduler.mode = FAIR 2.Submitted beeline jobs create database JH; use JH; create table one12( id int ); insert into one12 values(12); insert into one12 values(13); Select * from one12; 3.Click on JDBC Incompleted Application ID in Job History Page 4. Go to Job Tab in staged Web UI page 5. Click on run at AccessController.java:0 under Desription column 6 . Click default under Pool Name column of Completed Stages table URL:http://blr123109:23020/history/application_1536399199015_0006/stages/pool/?poolname=default 7. It throws below error HTTP ERROR 400 Problem accessing /history/application_1536399199015_0006/stages/pool/. Reason: Unknown pool: default Powered by Jetty:// x.y.z But under Yarn resource page it display the summary under Fair Scheduler Pool: default URL:https://blr123110:64323/proxy/application_1536399199015_0006/stages/pool?poolname=default Summary Pool Name Minimum Share Pool Weight Active Stages Running Tasks SchedulingMode default 0 1 0 0 FIFO was: Steps: 1.Enable spark.scheduler.mode = FAIR 2.Submitted beeline jobs create database JH; use JH; create table one12( id int ); insert into one12 values(12); insert into one12 values(13); Select * from one12; 3.Click on JDBC Incompleted Application ID in Job History Page 4. Go to Job Tab in staged Web UI page 5. Click on run at AccessController.java:0 under Desription column 6 . Click default under Pool Name column of Completed Stages table URL:http://blr123109:23020/history/application_1536399199015_0006/stages/pool/?poolname=default 7. It throws below error HTTP ERROR 400 Problem accessing /history/application_1536399199015_0006/stages/pool/. Reason: Unknown pool: default Powered by Jetty:// x.y.z But under Yarn resource page it display the summary under Fair Scheduler Pool: default URL:https://blr123110:64323/proxy/application_1536399199015_0006/stages/pool?poolname=default Summary Pool Name Minimum Share Pool Weight Active Stages Running Tasks SchedulingMode default 0 1 0 0 FIFO Issue Type: Improvement (was: Bug) OK Sandeep, make sure u handle this as Improvement. > [Spark Job History]Inconsistent behaviour for pool details in spark web UI > and history server page > --- > > Key: SPARK-25392 > URL: https://issues.apache.org/jira/browse/SPARK-25392 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 > Environment: OS: SUSE 11 > Spark Version: 2.3 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > Steps: > 1.Enable spark.scheduler.mode = FAIR > 2.Submitted beeline jobs > create database JH; > use JH; > create table one12( id int ); > insert into one12 values(12); > insert into one12 values(13); > Select * from one12; > 3.Click on JDBC Incompleted Application ID in Job History Page > 4. Go to Job Tab in staged Web UI page > 5. Click on run at AccessController.java:0 under Desription column > 6 . Click default under Pool Name column of Completed Stages table > URL:http://blr123109:23020/history/application_1536399199015_0006/stages/pool/?poolname=default > 7. It throws below error > HTTP ERROR 400 > Problem accessing /history/application_1536399199015_0006/stages/pool/. > Reason: > Unknown pool: default > Powered by Jetty:// x.y.z > But under > Yarn resource page it display the summary under Fair Scheduler Pool: default > URL:https://blr123110:64323/proxy/application_1536399199015_0006/stages/pool?poolname=default > Summary > Pool Name Minimum Share Pool Weight Active Stages Running Tasks > SchedulingMode > default 0 1 0 0 FIFO -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16859) History Server storage information is missing
[ https://issues.apache.org/jira/browse/SPARK-16859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628554#comment-16628554 ] t oo edited comment on SPARK-16859 at 9/26/18 10:43 AM: bump @shahid was (Author: toopt4): bump > History Server storage information is missing > - > > Key: SPARK-16859 > URL: https://issues.apache.org/jira/browse/SPARK-16859 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 >Reporter: Andrei Ivanov >Priority: Major > Labels: historyserver, newbie > > It looks like job history storage tab in history server is broken for > completed jobs since *1.6.2*. > More specifically it's broken since > [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845]. > I've fixed for my installation by effectively reverting the above patch > ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]). > IMHO, the most straightforward fix would be to implement > _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ making > sure it works from _ReplayListenerBus_. > The downside will be that it will still work incorrectly with pre patch job > histories. But then, it doesn't work since *1.6.2* anyhow. > PS: I'd really love to have this fixed eventually. But I'm pretty new to > Apache Spark and missing hands on Scala experience. So I'd prefer that it be > fixed by someone experienced with roadmap vision. If nobody volunteers I'll > try to patch myself. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25379) Improve ColumnPruning performance
[ https://issues.apache.org/jira/browse/SPARK-25379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-25379: --- Assignee: Marco Gaido > Improve ColumnPruning performance > - > > Key: SPARK-25379 > URL: https://issues.apache.org/jira/browse/SPARK-25379 > Project: Spark > Issue Type: Sub-task > Components: Optimizer, SQL >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Major > Fix For: 2.5.0 > > > The {{--}} operation on {{AttributeSet}} is quite expensive, especially where > many columns are involved. {{ColumnPruning}} heavily relies on that operator > and this affects its running time. There are 2 optimization which are > possible: > - Improve {{--}} performance; > - Replace {{--}} with {{subsetOf}} when possible. > Moreover, when building {{AttributeSet}} s we often do unneeded operations. > This also impacts other rules less significantly. > I'll provide more details about the performance improvement achievable in the > PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25379) Improve ColumnPruning performance
[ https://issues.apache.org/jira/browse/SPARK-25379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25379. - Resolution: Fixed Fix Version/s: 2.5.0 Issue resolved by pull request 22364 [https://github.com/apache/spark/pull/22364] > Improve ColumnPruning performance > - > > Key: SPARK-25379 > URL: https://issues.apache.org/jira/browse/SPARK-25379 > Project: Spark > Issue Type: Sub-task > Components: Optimizer, SQL >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Major > Fix For: 2.5.0 > > > The {{--}} operation on {{AttributeSet}} is quite expensive, especially where > many columns are involved. {{ColumnPruning}} heavily relies on that operator > and this affects its running time. There are 2 optimization which are > possible: > - Improve {{--}} performance; > - Replace {{--}} with {{subsetOf}} when possible. > Moreover, when building {{AttributeSet}} s we often do unneeded operations. > This also impacts other rules less significantly. > I'll provide more details about the performance improvement achievable in the > PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16859) History Server storage information is missing
[ https://issues.apache.org/jira/browse/SPARK-16859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628554#comment-16628554 ] t oo edited comment on SPARK-16859 at 9/26/18 10:46 AM: bump was (Author: toopt4): bump [~ashahid] > History Server storage information is missing > - > > Key: SPARK-16859 > URL: https://issues.apache.org/jira/browse/SPARK-16859 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 >Reporter: Andrei Ivanov >Priority: Major > Labels: historyserver, newbie > > It looks like job history storage tab in history server is broken for > completed jobs since *1.6.2*. > More specifically it's broken since > [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845]. > I've fixed for my installation by effectively reverting the above patch > ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]). > IMHO, the most straightforward fix would be to implement > _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ making > sure it works from _ReplayListenerBus_. > The downside will be that it will still work incorrectly with pre patch job > histories. But then, it doesn't work since *1.6.2* anyhow. > PS: I'd really love to have this fixed eventually. But I'm pretty new to > Apache Spark and missing hands on Scala experience. So I'd prefer that it be > fixed by someone experienced with roadmap vision. If nobody volunteers I'll > try to patch myself. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25502) [Spark Job History] Empty Page when page number exceeds the reatinedTask size
[ https://issues.apache.org/jira/browse/SPARK-25502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628561#comment-16628561 ] t oo commented on SPARK-25502: -- related https://jira.apache.org/jira/browse/SPARK-16859 ? > [Spark Job History] Empty Page when page number exceeds the reatinedTask size > -- > > Key: SPARK-25502 > URL: https://issues.apache.org/jira/browse/SPARK-25502 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.1 >Reporter: ABHISHEK KUMAR GUPTA >Assignee: shahid >Priority: Minor > Fix For: 2.3.3, 2.4.0 > > > *Steps:* > 1. Spark installed and running properly. > 2. spark.ui.retainedTask=10 ( it is default value ) > 3.Launch Spark shell ./spark-shell --master yarn > 4. Create a spark-shell application with a single job and 50 task > val rdd = sc.parallelize(1 to 50, 50) > rdd.count > 5. Launch Job History Page and go to spark-shell application created above > under Incomplete Task > 6. Right click and got to Job page of the application and from there click > and launch Stage Page > 7. Launch the Stage Id page for the specific Stage Id for the above created > job > 8. Scroll down and check for the task completion Summary > It Displays pagination panel showing *5000 Pages Jump to 1 Show 100 items in > a page* and Go button > 9. Replace 1 with 2333 page number > *Actual Result:* > 2 Pagination Panel displayed > *Expected Result:* > Pagination Panel should not display 5000 pages as retainedTask value is > 10 and it should display 1000 page only because each page holding 100 > tasks -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16859) History Server storage information is missing
[ https://issues.apache.org/jira/browse/SPARK-16859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628554#comment-16628554 ] t oo edited comment on SPARK-16859 at 9/26/18 10:45 AM: bump [~ashahid] was (Author: toopt4): bump @shahid > History Server storage information is missing > - > > Key: SPARK-16859 > URL: https://issues.apache.org/jira/browse/SPARK-16859 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 >Reporter: Andrei Ivanov >Priority: Major > Labels: historyserver, newbie > > It looks like job history storage tab in history server is broken for > completed jobs since *1.6.2*. > More specifically it's broken since > [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845]. > I've fixed for my installation by effectively reverting the above patch > ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]). > IMHO, the most straightforward fix would be to implement > _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ making > sure it works from _ReplayListenerBus_. > The downside will be that it will still work incorrectly with pre patch job > histories. But then, it doesn't work since *1.6.2* anyhow. > PS: I'd really love to have this fixed eventually. But I'm pretty new to > Apache Spark and missing hands on Scala experience. So I'd prefer that it be > fixed by someone experienced with roadmap vision. If nobody volunteers I'll > try to patch myself. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23401) Improve test cases for all supported types and unsupported types
[ https://issues.apache.org/jira/browse/SPARK-23401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628727#comment-16628727 ] Aleksandr Koriagin commented on SPARK-23401: I will take a look > Improve test cases for all supported types and unsupported types > > > Key: SPARK-23401 > URL: https://issues.apache.org/jira/browse/SPARK-23401 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Looks there are some missing types to test in supported types. > For example, please see > https://github.com/apache/spark/blob/c338c8cf8253c037ecd4f39bbd58ed5a86581b37/python/pyspark/sql/tests.py#L4397-L4401 > We can improve this test coverage. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25392) [Spark Job History]Inconsistent behaviour for pool details in spark web UI and history server page
[ https://issues.apache.org/jira/browse/SPARK-25392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628742#comment-16628742 ] sandeep katta commented on SPARK-25392: --- [~abhishek.akg] as per current design pool details are shown for live UI,I am working on this PR.Can you please update this to Improvement > [Spark Job History]Inconsistent behaviour for pool details in spark web UI > and history server page > --- > > Key: SPARK-25392 > URL: https://issues.apache.org/jira/browse/SPARK-25392 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 > Environment: OS: SUSE 11 > Spark Version: 2.3 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > Steps: > 1.Enable spark.scheduler.mode = FAIR > 2.Submitted beeline jobs > create database JH; > use JH; > create table one12( id int ); > insert into one12 values(12); > insert into one12 values(13); > Select * from one12; > 3.Click on JDBC Incompleted Application ID in Job History Page > 4. Go to Job Tab in staged Web UI page > 5. Click on run at AccessController.java:0 under Desription column > 6 . Click default under Pool Name column of Completed Stages table > URL:http://blr123109:23020/history/application_1536399199015_0006/stages/pool/?poolname=default > 7. It throws below error > HTTP ERROR 400 > Problem accessing /history/application_1536399199015_0006/stages/pool/. > Reason: > Unknown pool: default > Powered by Jetty:// x.y.z > But under > Yarn resource page it display the summary under Fair Scheduler Pool: default > URL:https://blr123110:64323/proxy/application_1536399199015_0006/stages/pool?poolname=default > Summary > Pool Name Minimum Share Pool Weight Active Stages Running Tasks > SchedulingMode > default 0 1 0 0 FIFO -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()
[ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628744#comment-16628744 ] Wenchen Fan commented on SPARK-25538: - cc [~kiszk] as well > incorrect row counts after distinct() > - > > Key: SPARK-25538 > URL: https://issues.apache.org/jira/browse/SPARK-25538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Reproduced on a Centos7 VM and from source in Intellij > on OS X. >Reporter: Steven Rand >Priority: Major > Labels: correctness > > It appears that {{df.distinct.count}} can return incorrect values after > SPARK-23713. It's possible that other operations are affected as well; > {{distinct}} just happens to be the one that we noticed. I believe that this > issue was introduced by SPARK-23713 because I can't reproduce it until that > commit, and I've been able to reproduce it after that commit as well as with > {{tags/v2.4.0-rc1}}. > Below are example spark-shell sessions to illustrate the problem. > Unfortunately the data used in these examples can't be uploaded to this Jira > ticket. I'll try to create test data which also reproduces the issue, and > will upload that if I'm able to do so. > Example from Spark 2.3.1, which behaves correctly: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 115 > {code} > Example from Spark 2.4.0-rc1, which returns different output: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 116 > scala> df.sort("col_0").distinct.count > res2: Long = 123 > scala> df.withColumnRenamed("col_0", "newName").distinct.count > res3: Long = 115 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24440) When use constant as column we may get wrong answer versus impala
[ https://issues.apache.org/jira/browse/SPARK-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628608#comment-16628608 ] Marco Gaido commented on SPARK-24440: - Can you provide a sample repro which can be run in order to debug the issue? > When use constant as column we may get wrong answer versus impala > - > > Key: SPARK-24440 > URL: https://issues.apache.org/jira/browse/SPARK-24440 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.3.0 >Reporter: zhoukang >Priority: Major > > For query below: > {code:java} > select `date`, 100 as platform, count(distinct deviceid) as new_user from > tv.clean_new_user where `date`=20180528 group by `date`, platform > {code} > We intended to group by 100 and get distinct deviceid number. > By spark sql,we get: > {code} > +---+---+---+--+ > | date| platform | new_user | > +---+---+---+--+ > | 20180528 | 100 | 521 | > | 20180528 | 100 | 82| > | 20180528 | 100 | 3 | > | 20180528 | 100 | 2 | > | 20180528 | 100 | 7 | > | 20180528 | 100 | 870 | > | 20180528 | 100 | 3 | > | 20180528 | 100 | 8 | > | 20180528 | 100 | 3 | > | 20180528 | 100 | 2204 | > | 20180528 | 100 | 1123 | > | 20180528 | 100 | 1 | > | 20180528 | 100 | 54| > | 20180528 | 100 | 440 | > | 20180528 | 100 | 4 | > | 20180528 | 100 | 478 | > | 20180528 | 100 | 34| > | 20180528 | 100 | 195 | > | 20180528 | 100 | 17| > | 20180528 | 100 | 18| > | 20180528 | 100 | 2 | > | 20180528 | 100 | 2 | > | 20180528 | 100 | 84| > | 20180528 | 100 | 1616 | > | 20180528 | 100 | 15| > | 20180528 | 100 | 7 | > | 20180528 | 100 | 479 | > | 20180528 | 100 | 50| > | 20180528 | 100 | 376 | > | 20180528 | 100 | 21| > | 20180528 | 100 | 842 | > | 20180528 | 100 | 444 | > | 20180528 | 100 | 538 | > | 20180528 | 100 | 1 | > | 20180528 | 100 | 2 | > | 20180528 | 100 | 7 | > | 20180528 | 100 | 17| > | 20180528 | 100 | 133 | > | 20180528 | 100 | 7 | > | 20180528 | 100 | 415 | > | 20180528 | 100 | 2 | > | 20180528 | 100 | 318 | > | 20180528 | 100 | 5 | > | 20180528 | 100 | 1 | > | 20180528 | 100 | 2060 | > | 20180528 | 100 | 1217 | > | 20180528 | 100 | 2 | > | 20180528 | 100 | 60| > | 20180528 | 100 | 22| > | 20180528 | 100 | 4 | > +---+---+---+--+ > {code} > Actually sum of the deviceid is below: > {code} > 0: jdbc:hive2://xxx/> select sum(t1.new_user) from (select `date`, 100 as > platform, count(distinct deviceid) as new_user from tv.clean_new_user where > `date`=20180528 group by `date`, platform)t1; > ++--+ > | sum(new_user) | > ++--+ > | 14816 | > ++--+ > 1 row selected (4.934 seconds) > {code} > And the real distinct deviceid value is below: > {code} > 0: jdbc:hive2://xxx/> select 100 as platform, count(distinct deviceid) as > new_user from tv.clean_new_user where `date`=20180528; > +---+---+--+ > | platform | new_user | > +---+---+--+ > | 100 | 14773 | > +---+---+--+ > 1 row selected (2.846 seconds) > {code} > In impala,with the first query we can get result below: > {code} > [xxx] > select `date`, 100 as platform, count(distinct deviceid) as new_user > from tv.clean_new_user where `date`=20180528 group by `date`, platform;Query: > select `date`, 100 as platform, count(distinct deviceid) as new_user from > tv.clean_new_user where `date`=20180528 group by `date`, platform > +--+--+--+ > | date | platform | new_user | > +--+--+--+ > | 20180528 | 100 | 14773| > +--+--+--+ > Fetched 1 row(s) in 1.00s > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21291) R bucketBy partitionBy API
[ https://issues.apache.org/jira/browse/SPARK-21291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628721#comment-16628721 ] Felix Cheung commented on SPARK-21291: -- The PR did not have bucketBy? > R bucketBy partitionBy API > -- > > Key: SPARK-21291 > URL: https://issues.apache.org/jira/browse/SPARK-21291 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Huaxin Gao >Priority: Major > Fix For: 2.5.0 > > > partitionBy exists but it's for windowspec only -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25502) [Spark Job History] Empty Page when page number exceeds the reatinedTask size
[ https://issues.apache.org/jira/browse/SPARK-25502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628588#comment-16628588 ] shahid commented on SPARK-25502: [~toopt4] No. please refer the PR, to see the fix > [Spark Job History] Empty Page when page number exceeds the reatinedTask size > -- > > Key: SPARK-25502 > URL: https://issues.apache.org/jira/browse/SPARK-25502 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.1 >Reporter: ABHISHEK KUMAR GUPTA >Assignee: shahid >Priority: Minor > Fix For: 2.3.3, 2.4.0 > > > *Steps:* > 1. Spark installed and running properly. > 2. spark.ui.retainedTask=10 ( it is default value ) > 3.Launch Spark shell ./spark-shell --master yarn > 4. Create a spark-shell application with a single job and 50 task > val rdd = sc.parallelize(1 to 50, 50) > rdd.count > 5. Launch Job History Page and go to spark-shell application created above > under Incomplete Task > 6. Right click and got to Job page of the application and from there click > and launch Stage Page > 7. Launch the Stage Id page for the specific Stage Id for the above created > job > 8. Scroll down and check for the task completion Summary > It Displays pagination panel showing *5000 Pages Jump to 1 Show 100 items in > a page* and Go button > 9. Replace 1 with 2333 page number > *Actual Result:* > 2 Pagination Panel displayed > *Expected Result:* > Pagination Panel should not display 5000 pages as retainedTask value is > 10 and it should display 1000 page only because each page holding 100 > tasks -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25541) CaseInsensitiveMap should be serializable after '-' or 'filterKeys'
[ https://issues.apache.org/jira/browse/SPARK-25541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25541. - Resolution: Fixed Assignee: Gengliang Wang Fix Version/s: 2.5.0 > CaseInsensitiveMap should be serializable after '-' or 'filterKeys' > --- > > Key: SPARK-25541 > URL: https://issues.apache.org/jira/browse/SPARK-25541 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.5.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling
[ https://issues.apache.org/jira/browse/SPARK-25544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Crosby updated SPARK-25544: -- Description: The LinearRegression and LogisticRegression estimators in Spark ML can take a large number of iterations to converge, or fail to converge altogether, when trained using the l-bfgs method with standardization turned off. *Details:* LinearRegression and LogisticRegression standardize their input features by default. In SPARK-8522 the option to disable standardization was added. This is implemented internally by changing the effective strength of regularization rather than disabling the feature scaling. Mathematically, both changing the effective regularizaiton strength, and disabling feature scaling should give the same solution, but they can have very different convergence properties. The normal justication given for scaling features is that it ensures that all covariances are O(1) and should improve numerical convergence, but this argument does not account for the regularization term. This doesn't cause any issues if standardization is set to true, since all features will have an O(1) regularization strength. But it does cause issues when standardization is set to false, since the effecive regularization strength of feature i is now O(1/ sigma_i^2) where sigma_i is the standard deviation of the feature. This means that predictors with small standard deviations will have very large effective regularization strengths and consequently lead to very large gradients and thus poor convergence in the solver. *Example code to recreate:* To demonstrate just how bad these convergence issues can be, here is a very simple test case which builds a linear regression model with a categorical feature, a numerical feature and their interaction. When fed the specified training data, this model will fail to converge before it hits the maximum iteration limit. Training data: ||category||numericFeature||label|| |1|1.0|0.5| |1|0.5|1.0| |2|0.01|2.0| {code:java} val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 2.0)).toDF("category", "numericFeature", "label") val indexer = new StringIndexer().setInputCol("category") .setOutputCol("categoryIndex") val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false) val interaction = new Interaction().setInputCols(Array("categoryEncoded", "numericFeature")).setOutputCol("interaction") val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", "interaction")).setOutputCol("features") val model = new LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100) val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, assembler, model)) val pipelineModel = pipeline.fit(df) val numIterations = pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code} *Possible fix:* These convergence issues can be fixed by turning off feature scaling when standardization is set to false rather than using an effective regularization strength. This can be hacked into LinearRegression.scala by simply replacing line 423 {code:java} val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt) {code} with {code:java} val featuresStd = if ($(standardization)) featuresSummarizer.variance.toArray.map(math.sqrt) else featuresSummarizer.variance.toArray.map(x => 1.0) {code} Rerunning the above test code with that hack in place, will lead to convergence after just 4 iterations instead of hitting the max iterations limit! *Impact:* I can't speak for other people, but I've personally encountered these convergence issues several times when building production scale Spark ML models, and have resorted to writing my only implementation of LinearRegression with the above hack in place. The issue is made worse by the fact that Spark does not raise an error when the maximum number of iterations is hit, so the first time you encounter the issue it can take a while to figure out what is going on. was: The LinearRegression and LogisticRegression estimators in Spark ML can take a large number of iterations to converge, or fail to converge altogether, when trained using the l-bfgs method with standardization turned off. *Details:* LinearRegression and LogisticRegression standardize their input features by default. In SPARK-8522 the option to disable standardization was added. This is implemented internally by changing the effective strength of regularization rather than disabling the feature scaling. Mathematically, both changing the effective regularizaiton strength, and disabling feature scaling should give the same solution, but they can have very different convergence properties. The normal
[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628852#comment-16628852 ] Eugeniu commented on SPARK-18112: - This issue should be reopened. As already commented by [~Tavis] https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L204 is referenced but it is not present in HiveConf since branch 2.0 https://github.com/apache/hive/blob/branch-1.2/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L1290 https://github.com/apache/hive/blob/branch-2.0/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java > Spark2.x does not support read data from Hive 2.x metastore > --- > > Key: SPARK-18112 > URL: https://issues.apache.org/jira/browse/SPARK-18112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: KaiXu >Assignee: Xiao Li >Priority: Critical > Fix For: 2.2.0 > > > Hive2.0 has been released in February 2016, after that Hive2.0.1 and > Hive2.1.0 have also been released for a long time, but till now spark only > support to read hive metastore data from Hive1.2.1 and older version, since > Hive2.x has many bugs fixed and performance improvement it's better and > urgent to upgrade to support Hive2.x > failed to load data from hive2.x metastore: > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4 > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45) > at > org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling
[ https://issues.apache.org/jira/browse/SPARK-25544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Crosby updated SPARK-25544: -- Description: The LinearRegression and LogisticRegression estimators in Spark ML can take a large number of iterations to converge, or fail to converge altogether, when trained using the l-bfgs method with standardization turned off. *Details:* LinearRegression and LogisticRegression standardize their input features by default. In SPARK-8522 the option to disable standardization was added. This is implemented internally by changing the effective strength of regularization rather than disabling the feature scaling. Mathematically, both changing the effective regularizaiton strength, and disabling feature scaling should give the same solution, but they can have very different convergence properties. The normal justication given for scaling features is that it ensures that all covariances are O(1) and should improve numerical convergence, but this argument does not account for the regularization term. This doesn't cause any issues if standardization is set to true, since all features will have an O(1) regularization strength. But it does cause issues when standardization is set to false, since the effecive regularization strength of feature i is now O(1/ sigma_i^2) where sigma_i is the standard deviation of the feature. This means that predictors with small standard deviations will have very large effective regularization strengths and consequently lead to very large gradients and thus poor convergence in the solver. *Example code to recreate:* To demonstrate just how bad these convergence issues can be, here is a very simple test case which builds a linear regression model with a categorical feature, a numerical feature and their interaction. When fed the specified training data, this model will fail to converge before it hits the maximum iteration limit. In this case, it is the interaction between category "2" and the numeric feature that leads to a feature with a small standard deviation. Training data: ||category||numericFeature||label|| |1|1.0|0.5| |1|0.5|1.0| |2|0.01|2.0| {code:java} val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 2.0)).toDF("category", "numericFeature", "label") val indexer = new StringIndexer().setInputCol("category") .setOutputCol("categoryIndex") val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false) val interaction = new Interaction().setInputCols(Array("categoryEncoded", "numericFeature")).setOutputCol("interaction") val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", "interaction")).setOutputCol("features") val model = new LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100) val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, assembler, model)) val pipelineModel = pipeline.fit(df) val numIterations = pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code} *Possible fix:* These convergence issues can be fixed by turning off feature scaling when standardization is set to false rather than using an effective regularization strength. This can be hacked into LinearRegression.scala by simply replacing line 423 {code:java} val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt) {code} with {code:java} val featuresStd = if ($(standardization)) featuresSummarizer.variance.toArray.map(math.sqrt) else featuresSummarizer.variance.toArray.map(x => 1.0) {code} Rerunning the above test code with that hack in place, will lead to convergence after just 4 iterations instead of hitting the max iterations limit! *Impact:* I can't speak for other people, but I've personally encountered these convergence issues several times when building production scale Spark ML models, and have resorted to writing my only implementation of LinearRegression with the above hack in place. The issue is made worse by the fact that Spark does not raise an error when the maximum number of iterations is hit, so the first time you encounter the issue it can take a while to figure out what is going on. was: The LinearRegression and LogisticRegression estimators in Spark ML can take a large number of iterations to converge, or fail to converge altogether, when trained using the l-bfgs method with standardization turned off. *Details:* LinearRegression and LogisticRegression standardize their input features by default. In SPARK-8522 the option to disable standardization was added. This is implemented internally by changing the effective strength of regularization rather than disabling the feature scaling. Mathematically, both changing the effective regularizaiton
[jira] [Created] (SPARK-25545) CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not confirm to non-nullable schema fields
Steven Bakhtiari created SPARK-25545: Summary: CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not confirm to non-nullable schema fields Key: SPARK-25545 URL: https://issues.apache.org/jira/browse/SPARK-25545 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.2, 2.3.1, 2.3.0 Reporter: Steven Bakhtiari I'm loading a CSV file into a dataframe using Spark. I have defined a Schema and specified one of the fields as non-nullable. When setting the mode to {{DROPMALFORMED}}, I expect any rows in the CSV with missing (null) values for those columns to result in the whole row being dropped. At the moment, the CSV loader correctly drops rows that do not conform to the field type, but the nullable property is seemingly ignored. Example CSV input: {code:java} 1,2,3 1,,3 ,2,3 1,2,abc {code} Example Spark job: {code:java} val spark = SparkSession .builder() .appName("csv-test") .master("local") .getOrCreate() spark.read .format("csv") .schema(StructType( StructField("col1", IntegerType, nullable = false) :: StructField("col2", IntegerType, nullable = false) :: StructField("col3", IntegerType, nullable = false) :: Nil)) .option("header", false) .option("mode", "DROPMALFORMED") .load("path/to/file.csv") .coalesce(1) .write .format("csv") .option("header", false) .save("path/to/output") {code} The actual output will be: {code:java} 1,2,3 1,,3 ,2,3{code} Note that the row containing non-integer values has been dropped, as expected, but rows containing null values persist, despite the nullable property being set to false in the schema definition. My expected output is: {code:java} 1,2,3{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25545) CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not confirm to non-nullable schema fields
[ https://issues.apache.org/jira/browse/SPARK-25545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628949#comment-16628949 ] Steven Bakhtiari commented on SPARK-25545: -- Somebody on SO pointed me to this older ticket that appears to touch on the same issue. SPARK-10848 > CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not > confirm to non-nullable schema fields > - > > Key: SPARK-25545 > URL: https://issues.apache.org/jira/browse/SPARK-25545 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2 >Reporter: Steven Bakhtiari >Priority: Minor > Labels: CSV, csv, csvparser > > I'm loading a CSV file into a dataframe using Spark. I have defined a Schema > and specified one of the fields as non-nullable. > When setting the mode to {{DROPMALFORMED}}, I expect any rows in the CSV with > missing (null) values for those columns to result in the whole row being > dropped. At the moment, the CSV loader correctly drops rows that do not > conform to the field type, but the nullable property is seemingly ignored. > Example CSV input: > {code:java} > 1,2,3 > 1,,3 > ,2,3 > 1,2,abc > {code} > Example Spark job: > {code:java} > val spark = SparkSession > .builder() > .appName("csv-test") > .master("local") > .getOrCreate() > spark.read > .format("csv") > .schema(StructType( > StructField("col1", IntegerType, nullable = false) :: > StructField("col2", IntegerType, nullable = false) :: > StructField("col3", IntegerType, nullable = false) :: Nil)) > .option("header", false) > .option("mode", "DROPMALFORMED") > .load("path/to/file.csv") > .coalesce(1) > .write > .format("csv") > .option("header", false) > .save("path/to/output") > {code} > The actual output will be: > {code:java} > 1,2,3 > 1,,3 > ,2,3{code} > Note that the row containing non-integer values has been dropped, as > expected, but rows containing null values persist, despite the nullable > property being set to false in the schema definition. > My expected output is: > {code:java} > 1,2,3{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25509) SHS V2 cannot enabled in Windows, because POSIX permissions is not support.
[ https://issues.apache.org/jira/browse/SPARK-25509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25509. --- Resolution: Fixed Fix Version/s: 2.4.0 2.3.3 Issue resolved by pull request 22520 [https://github.com/apache/spark/pull/22520] > SHS V2 cannot enabled in Windows, because POSIX permissions is not support. > --- > > Key: SPARK-25509 > URL: https://issues.apache.org/jira/browse/SPARK-25509 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1, 2.4.0 >Reporter: Rong Tang >Assignee: Rong Tang >Priority: Major > Fix For: 2.3.3, 2.4.0 > > > SHS V2 cannot enabled in Windoes, because windows doesn't support POSIX > permission. > with exception: java.lang.UnsupportedOperationException: 'posix:permissions' > not supported as initial attribute. > test case fails in windows without this fix. > org.apache.spark.deploy.history.HistoryServerDiskManagerSuite test("leasing > space") > > PR: https://github.com/apache/spark/pull/22520 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25509) SHS V2 cannot enabled in Windows, because POSIX permissions is not support.
[ https://issues.apache.org/jira/browse/SPARK-25509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-25509: - Assignee: Rong Tang > SHS V2 cannot enabled in Windows, because POSIX permissions is not support. > --- > > Key: SPARK-25509 > URL: https://issues.apache.org/jira/browse/SPARK-25509 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1, 2.4.0 >Reporter: Rong Tang >Assignee: Rong Tang >Priority: Major > Fix For: 2.3.3, 2.4.0 > > > SHS V2 cannot enabled in Windoes, because windows doesn't support POSIX > permission. > with exception: java.lang.UnsupportedOperationException: 'posix:permissions' > not supported as initial attribute. > test case fails in windows without this fix. > org.apache.spark.deploy.history.HistoryServerDiskManagerSuite test("leasing > space") > > PR: https://github.com/apache/spark/pull/22520 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628865#comment-16628865 ] Hyukjin Kwon commented on SPARK-18112: -- Can you post reproducer steps please before we open this? > Spark2.x does not support read data from Hive 2.x metastore > --- > > Key: SPARK-18112 > URL: https://issues.apache.org/jira/browse/SPARK-18112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: KaiXu >Assignee: Xiao Li >Priority: Critical > Fix For: 2.2.0 > > > Hive2.0 has been released in February 2016, after that Hive2.0.1 and > Hive2.1.0 have also been released for a long time, but till now spark only > support to read hive metastore data from Hive1.2.1 and older version, since > Hive2.x has many bugs fixed and performance improvement it's better and > urgent to upgrade to support Hive2.x > failed to load data from hive2.x metastore: > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4 > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45) > at > org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling
[ https://issues.apache.org/jira/browse/SPARK-25544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Crosby updated SPARK-25544: -- Description: The LinearRegression and LogisticRegression estimators in Spark ML can take a large number of iterations to converge, or fail to converge altogether, when trained using the l-bfgs method with standardization turned off. *Details:* LinearRegression and LogisticRegression standardize their input features by default. In SPARK-8522 the option to disable standardization was added. This is implemented internally by changing the effective strength of regularization rather than disabling the feature scaling. Mathematically, both changing the effective regularizaiton strength, and disabling feature scaling should give the same solution, but they can have very different convergence properties. The normal justication given for scaling features is that it ensures that all covariances are O(1) and should improve numerical convergence, but this argument does not account for the regularization term. This doesn't cause any issues if standardization is set to true, since all features will have an O(1) regularization strength. But it does cause issues when standardization is set to false, since the effecive regularization strength of feature i is now O(1/ sigma_i^2) where sigma_i is the standard deviation of the feature. This means that predictors with small standard deviations (which can occur legitimately e.g. via one hot encoding) will have very large effective regularization strengths and consequently lead to very large gradients and thus poor convergence in the solver. *Example code to recreate:* To demonstrate just how bad these convergence issues can be, here is a very simple test case which builds a linear regression model with a categorical feature, a numerical feature and their interaction. When fed the specified training data, this model will fail to converge before it hits the maximum iteration limit. In this case, it is the interaction between category "2" and the numeric feature that leads to a feature with a small standard deviation. Training data: ||category||numericFeature||label|| |1|1.0|0.5| |1|0.5|1.0| |2|0.01|2.0| {code:java} val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 2.0)).toDF("category", "numericFeature", "label") val indexer = new StringIndexer().setInputCol("category") .setOutputCol("categoryIndex") val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false) val interaction = new Interaction().setInputCols(Array("categoryEncoded", "numericFeature")).setOutputCol("interaction") val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", "interaction")).setOutputCol("features") val model = new LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100) val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, assembler, model)) val pipelineModel = pipeline.fit(df) val numIterations = pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code} *Possible fix:* These convergence issues can be fixed by turning off feature scaling when standardization is set to false rather than using an effective regularization strength. This can be hacked into LinearRegression.scala by simply replacing line 423 {code:java} val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt) {code} with {code:java} val featuresStd = if ($(standardization)) featuresSummarizer.variance.toArray.map(math.sqrt) else featuresSummarizer.variance.toArray.map(x => 1.0) {code} Rerunning the above test code with that hack in place, will lead to convergence after just 4 iterations instead of hitting the max iterations limit! *Impact:* I can't speak for other people, but I've personally encountered these convergence issues several times when building production scale Spark ML models, and have resorted to writing my only implementation of LinearRegression with the above hack in place. The issue is made worse by the fact that Spark does not raise an error when the maximum number of iterations is hit, so the first time you encounter the issue it can take a while to figure out what is going on. was: The LinearRegression and LogisticRegression estimators in Spark ML can take a large number of iterations to converge, or fail to converge altogether, when trained using the l-bfgs method with standardization turned off. *Details:* LinearRegression and LogisticRegression standardize their input features by default. In SPARK-8522 the option to disable standardization was added. This is implemented internally by changing the effective strength of regularization rather than disabling the feature scaling.
[jira] [Resolved] (SPARK-20937) Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, DataFrames and Datasets Guide
[ https://issues.apache.org/jira/browse/SPARK-20937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-20937. -- Resolution: Fixed Fix Version/s: 2.4.1 2.5.0 Issue resolved by pull request 22453 [https://github.com/apache/spark/pull/22453] > Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, > DataFrames and Datasets Guide > - > > Key: SPARK-20937 > URL: https://issues.apache.org/jira/browse/SPARK-20937 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Assignee: Chenxiao Mao >Priority: Trivial > Fix For: 2.5.0, 2.4.1 > > > As a follow-up to SPARK-20297 (and SPARK-10400) in which > {{spark.sql.parquet.writeLegacyFormat}} property was recommended for Impala > and Hive, Spark SQL docs for [Parquet > Files|https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration] > should have it documented. > p.s. It was asked about in [Why can't Impala read parquet files after Spark > SQL's write?|https://stackoverflow.com/q/44279870/1305344] on StackOverflow > today. > p.s. It's also covered in [~holden.ka...@gmail.com]'s "High Performance > Spark: Best Practices for Scaling and Optimizing Apache Spark" book (in Table > 3-10. Parquet data source options) that gives the option some wider publicity. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20937) Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, DataFrames and Datasets Guide
[ https://issues.apache.org/jira/browse/SPARK-20937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-20937: Assignee: Chenxiao Mao > Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, > DataFrames and Datasets Guide > - > > Key: SPARK-20937 > URL: https://issues.apache.org/jira/browse/SPARK-20937 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Assignee: Chenxiao Mao >Priority: Trivial > Fix For: 2.5.0, 2.4.1 > > > As a follow-up to SPARK-20297 (and SPARK-10400) in which > {{spark.sql.parquet.writeLegacyFormat}} property was recommended for Impala > and Hive, Spark SQL docs for [Parquet > Files|https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration] > should have it documented. > p.s. It was asked about in [Why can't Impala read parquet files after Spark > SQL's write?|https://stackoverflow.com/q/44279870/1305344] on StackOverflow > today. > p.s. It's also covered in [~holden.ka...@gmail.com]'s "High Performance > Spark: Best Practices for Scaling and Optimizing Apache Spark" book (in Table > 3-10. Parquet data source options) that gives the option some wider publicity. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25546) RDDInfo uses SparkEnv before it may have been initialized
Marcelo Vanzin created SPARK-25546: -- Summary: RDDInfo uses SparkEnv before it may have been initialized Key: SPARK-25546 URL: https://issues.apache.org/jira/browse/SPARK-25546 Project: Spark Issue Type: Bug Components: Spark Core, Tests Affects Versions: 2.4.0 Reporter: Marcelo Vanzin This code: {code} private[spark] object RDDInfo { private val callsiteLongForm = SparkEnv.get.conf.get(EVENT_LOG_CALLSITE_LONG_FORM) {code} Has two problems: - it keeps that value across different SparkEnv instances. So e.g. if you have two tests that rely on different values for that config, one of them will break. - it assumes tests always initialize a SparkEnv. e.g. if you run "core/testOnly *.AppStatusListenerSuite", it will fail because {{SparkEnv.get}} returns null. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25546) RDDInfo uses SparkEnv before it may have been initialized
[ https://issues.apache.org/jira/browse/SPARK-25546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629216#comment-16629216 ] Apache Spark commented on SPARK-25546: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/22558 > RDDInfo uses SparkEnv before it may have been initialized > - > > Key: SPARK-25546 > URL: https://issues.apache.org/jira/browse/SPARK-25546 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Major > > This code: > {code} > private[spark] object RDDInfo { > private val callsiteLongForm = > SparkEnv.get.conf.get(EVENT_LOG_CALLSITE_LONG_FORM) > {code} > Has two problems: > - it keeps that value across different SparkEnv instances. So e.g. if you > have two tests that rely on different values for that config, one of them > will break. > - it assumes tests always initialize a SparkEnv. e.g. if you run > "core/testOnly *.AppStatusListenerSuite", it will fail because > {{SparkEnv.get}} returns null. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25546) RDDInfo uses SparkEnv before it may have been initialized
[ https://issues.apache.org/jira/browse/SPARK-25546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25546: Assignee: (was: Apache Spark) > RDDInfo uses SparkEnv before it may have been initialized > - > > Key: SPARK-25546 > URL: https://issues.apache.org/jira/browse/SPARK-25546 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Major > > This code: > {code} > private[spark] object RDDInfo { > private val callsiteLongForm = > SparkEnv.get.conf.get(EVENT_LOG_CALLSITE_LONG_FORM) > {code} > Has two problems: > - it keeps that value across different SparkEnv instances. So e.g. if you > have two tests that rely on different values for that config, one of them > will break. > - it assumes tests always initialize a SparkEnv. e.g. if you run > "core/testOnly *.AppStatusListenerSuite", it will fail because > {{SparkEnv.get}} returns null. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25546) RDDInfo uses SparkEnv before it may have been initialized
[ https://issues.apache.org/jira/browse/SPARK-25546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629218#comment-16629218 ] Apache Spark commented on SPARK-25546: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/22558 > RDDInfo uses SparkEnv before it may have been initialized > - > > Key: SPARK-25546 > URL: https://issues.apache.org/jira/browse/SPARK-25546 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Major > > This code: > {code} > private[spark] object RDDInfo { > private val callsiteLongForm = > SparkEnv.get.conf.get(EVENT_LOG_CALLSITE_LONG_FORM) > {code} > Has two problems: > - it keeps that value across different SparkEnv instances. So e.g. if you > have two tests that rely on different values for that config, one of them > will break. > - it assumes tests always initialize a SparkEnv. e.g. if you run > "core/testOnly *.AppStatusListenerSuite", it will fail because > {{SparkEnv.get}} returns null. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21291) R bucketBy partitionBy API
[ https://issues.apache.org/jira/browse/SPARK-21291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629198#comment-16629198 ] Huaxin Gao commented on SPARK-21291: [~felixcheung] I will submit a PR for bucketBy. bucketBy doesn't work with save. {code:java} assertNotBucketed("save") {code} If bucketBy is set, shall I use saveAsTable instead? > R bucketBy partitionBy API > -- > > Key: SPARK-21291 > URL: https://issues.apache.org/jira/browse/SPARK-21291 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Huaxin Gao >Priority: Major > Fix For: 2.5.0 > > > partitionBy exists but it's for windowspec only -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629231#comment-16629231 ] David Spies commented on SPARK-18492: - Ran into this as well. It seems like this is happening because the "Optimized Logical Plan" is significantly larger than the "Parsed Logical Plan". Is there an "optimization" I can turn off that will keep the size down? (Spark v. 2.1.3) {code:java} == Parsed Logical Plan == Aggregate [count(1) AS count#2296L] +- Filter (age_imputed_fac#2247 = age_imputed_0) +- Project [PassengerId#2183L AS PassengerId#2226L, Survived#2184 AS Survived#2227, Pclass#2185 AS Pclass#2228, Sex#2186 AS Sex#2229, Age#2187 AS Age#2230, SibSp#2188L AS SibSp#2231L, Parch#2189L AS Parch#2232L, Ticket#2190 AS Ticket#2233, Fare#2191 AS Fare#2234, Cabin#2192 AS Cabin#2235, Embarked#2193 AS Embarked#2236, firstname_proc#2194 AS firstname_proc#2237, lastname_proc#2195 AS lastname_proc#2238, age_1_male#2196 AS age_1_male#2239, age_2_male#2197 AS age_2_male#2240, age_3_male#2198 AS age_3_male#2241, age_1_female#2199 AS age_1_female#2242, age_2_female#2200 AS age_2_female#2243, age_3_female#2201 AS age_3_female#2244, age_imputed#2202 AS age_imputed#2245, age_imputed_1#2203 AS age_imputed_1#2246, coalesce(CASE WHEN (true = ((age_imputed#2202 >= 0.0) && (age_imputed#2202 < 16.0))) THEN age_imputed_0 END, CASE WHEN (true = ((age_imputed#2202 >= 16.0) && (age_imputed#2202 < 32.0))) THEN age_imputed_1 END, CASE WHEN (true = ((age_imputed#2202 >= 32.0) && (age_imputed#2202 < 48.0))) THEN age_imputed_2 END, CASE WHEN (true = ((age_imputed#2202 >= 48.0) && (age_imputed#2202 < 64.0))) THEN age_imputed_3 END, CASE WHEN (true = ((age_imputed#2202 >= 64.0) && (age_imputed#2202 < 81.0))) THEN age_imputed_4 END, CASE WHEN (true = isnull(age_imputed#2202)) THEN age_imputed_NULL END) AS age_imputed_fac#2247] +- Project [PassengerId#2142L AS PassengerId#2183L, Survived#2143 AS Survived#2184, Pclass#2144 AS Pclass#2185, Sex#2145 AS Sex#2186, Age#2146 AS Age#2187, SibSp#2147L AS SibSp#2188L, Parch#2148L AS Parch#2189L, Ticket#2149 AS Ticket#2190, Fare#2150 AS Fare#2191, Cabin#2151 AS Cabin#2192, Embarked#2152 AS Embarked#2193, firstname_proc#2153 AS firstname_proc#2194, lastname_proc#2154 AS lastname_proc#2195, age_1_male#2155 AS age_1_male#2196, age_2_male#2156 AS age_2_male#2197, age_3_male#2157 AS age_3_male#2198, age_1_female#2158 AS age_1_female#2199, age_2_female#2159 AS age_2_female#2200, age_3_female#2160 AS age_3_female#2201, age_imputed#2161 AS age_imputed#2202, coalesce(age_imputed#2161, 0.0) AS age_imputed_1#2203] +- Project [PassengerId#2103L AS PassengerId#2142L, Survived#2104 AS Survived#2143, Pclass#2105 AS Pclass#2144, Sex#2106 AS Sex#2145, Age#2107 AS Age#2146, SibSp#2108L AS SibSp#2147L, Parch#2109L AS Parch#2148L, Ticket#2110 AS Ticket#2149, Fare#2111 AS Fare#2150, Cabin#2112 AS Cabin#2151, Embarked#2113 AS Embarked#2152, firstname_proc#2114 AS firstname_proc#2153, lastname_proc#2115 AS lastname_proc#2154, age_1_male#2116 AS age_1_male#2155, age_2_male#2117 AS age_2_male#2156, age_3_male#2118 AS age_3_male#2157, age_1_female#2119 AS age_1_female#2158, age_2_female#2120 AS age_2_female#2159, age_3_female#2121 AS age_3_female#2160, coalesce(age_1_male#2116, age_2_male#2117, age_3_male#2118, age_1_female#2119, age_2_female#2120, age_3_female#2121, Age#2107) AS age_imputed#2161] +- Project [PassengerId#2076L AS PassengerId#2103L, Survived#2077 AS Survived#2104, Pclass#2078 AS Pclass#2105, Sex#2079 AS Sex#2106, Age#2080 AS Age#2107, SibSp#2081L AS SibSp#2108L, Parch#2082L AS Parch#2109L, Ticket#2083 AS Ticket#2110, Fare#2084 AS Fare#2111, Cabin#2085 AS Cabin#2112, Embarked#2086 AS Embarked#2113, firstname_proc#2087 AS firstname_proc#2114, lastname_proc#2088 AS lastname_proc#2115, CASE WHEN (true = ((isnull(Age#2080) && (Sex#2079 = male)) && (Pclass#2078 = 1))) THEN 39.56 END AS age_1_male#2116, CASE WHEN (true = ((isnull(Age#2080) && (Sex#2079 = male)) && (Pclass#2078 = 2))) THEN 21.72 END AS age_2_male#2117, CASE WHEN (true = ((isnull(Age#2080) && (Sex#2079 = male)) && (Pclass#2078 = 3))) THEN 26.84 END AS age_3_male#2118, CASE WHEN (true = ((isnull(Age#2080) && (Sex#2079 = female)) && (Pclass#2078 = 1))) THEN 38.84 END AS age_1_female#2119, CASE WHEN (true = ((isnull(Age#2080) && (Sex#2079 = female)) && (Pclass#2078 = 2))) THEN 27.48 END AS age_2_female#2120, CASE WHEN (true = ((isnull(Age#2080) && (Sex#2079 = female)) && (Pclass#2078 = 3))) THEN 11.16 END AS age_3_female#2121] +- Project [CASE WHEN (true = ((PassengerId#106L >= 1) && (PassengerId#106L <= 900))) THEN PassengerId#106L END AS PassengerId#2076L, CASE WHEN (true = ((Survived#107 >= false) && (Survived#107 <= true))) THEN Survived#107 END AS Survived#2077, CASE WHEN (true = Pclass#108 IN (1,2,3)) THEN Pclass#108 END AS Pclass#2078, CASE WHEN (true
[jira] [Assigned] (SPARK-25546) RDDInfo uses SparkEnv before it may have been initialized
[ https://issues.apache.org/jira/browse/SPARK-25546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25546: Assignee: Apache Spark > RDDInfo uses SparkEnv before it may have been initialized > - > > Key: SPARK-25546 > URL: https://issues.apache.org/jira/browse/SPARK-25546 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Major > > This code: > {code} > private[spark] object RDDInfo { > private val callsiteLongForm = > SparkEnv.get.conf.get(EVENT_LOG_CALLSITE_LONG_FORM) > {code} > Has two problems: > - it keeps that value across different SparkEnv instances. So e.g. if you > have two tests that rely on different values for that config, one of them > will break. > - it assumes tests always initialize a SparkEnv. e.g. if you run > "core/testOnly *.AppStatusListenerSuite", it will fail because > {{SparkEnv.get}} returns null. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25533) Inconsistent message for Completed Jobs in the JobUI, when there are failed jobs, compared to spark2.2
[ https://issues.apache.org/jira/browse/SPARK-25533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-25533: -- Assignee: shahid > Inconsistent message for Completed Jobs in the JobUI, when there are failed > jobs, compared to spark2.2 > --- > > Key: SPARK-25533 > URL: https://issues.apache.org/jira/browse/SPARK-25533 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.1 >Reporter: shahid >Assignee: shahid >Priority: Major > Attachments: Screenshot from 2018-09-26 00-42-00.png, Screenshot from > 2018-09-26 00-46-35.png > > > Test steps: > 1) bin/spark-shell > {code:java} > sc.parallelize(1 to 5, 5).collect() > sc.parallelize(1 to 5, 2).map{ x => throw new RuntimeException("Fail > Job")}.collect() > {code} > *Output in spark - 2.3.1:* > !Screenshot from 2018-09-26 00-42-00.png! > *Output in spark - 2.2.1:* > !Screenshot from 2018-09-26 00-46-35.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25318) Add exception handling when wrapping the input stream during the the fetch or stage retry in response to a corrupted block
[ https://issues.apache.org/jira/browse/SPARK-25318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-25318. Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22325 [https://github.com/apache/spark/pull/22325] > Add exception handling when wrapping the input stream during the the fetch or > stage retry in response to a corrupted block > -- > > Key: SPARK-25318 > URL: https://issues.apache.org/jira/browse/SPARK-25318 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0 >Reporter: Reza Safi >Assignee: Reza Safi >Priority: Minor > Fix For: 2.4.0 > > > SPARK-4105 provided a solution to block corruption issue by retrying the > fetch or the stage. In the solution there is a step that wraps the input > stream with compression and/or encryption. This step is prone to exceptions, > but in the current code there is no exception handling for this step and this > has caused confusion for the user. In fact we have customers who reported an > exception like the following when SPARK-4105 is available to them: > {noformat} > 2018-08-28 22:35:54,361 ERROR [Driver] > org.apache.spark.deploy.yarn.ApplicationMaster:95 User class threw exception: > java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due > tostage failure: Task 452 in stage 209.0 failed 4 times, most recent > failure: Lost task 452.3 in stage y.0 (TID z, x, executor xx): > java.io.IOException: FAILED_TO_UNCOMPRESS(5) > 3976 at > org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) > 3977 at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) > 3978 at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:395) > 3979 at org.xerial.snappy.Snappy.uncompress(Snappy.java:431) > 3980 at > org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) > 3981 at > org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) > 3982 at > org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) > 3983 at > org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:159) > 3984 at > org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1219) > 3985 at > org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:48) > 3986 at > org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:47) > 3987 at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:328) > 3988 at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:55) > 3989 at > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > 3990 a > {noformat} > In this customer's version of spark, line 328 of > ShuffleBlockFetcherIterator.scala is the line that the following occurs: > {noformat} > input = streamWrapper(blockId, in) > {noformat} > It would be nice to add exception handling around this line to avoid > confusions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25536) executorSource.METRIC read wrong record in Executor.scala Line444
[ https://issues.apache.org/jira/browse/SPARK-25536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25536: -- Affects Version/s: 2.3.0 2.3.1 > executorSource.METRIC read wrong record in Executor.scala Line444 > - > > Key: SPARK-25536 > URL: https://issues.apache.org/jira/browse/SPARK-25536 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.3.1, 2.3.2 >Reporter: ZhuoerXu >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25318) Add exception handling when wrapping the input stream during the the fetch or stage retry in response to a corrupted block
[ https://issues.apache.org/jira/browse/SPARK-25318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-25318: -- Assignee: Reza Safi > Add exception handling when wrapping the input stream during the the fetch or > stage retry in response to a corrupted block > -- > > Key: SPARK-25318 > URL: https://issues.apache.org/jira/browse/SPARK-25318 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0 >Reporter: Reza Safi >Assignee: Reza Safi >Priority: Minor > Fix For: 2.4.0 > > > SPARK-4105 provided a solution to block corruption issue by retrying the > fetch or the stage. In the solution there is a step that wraps the input > stream with compression and/or encryption. This step is prone to exceptions, > but in the current code there is no exception handling for this step and this > has caused confusion for the user. In fact we have customers who reported an > exception like the following when SPARK-4105 is available to them: > {noformat} > 2018-08-28 22:35:54,361 ERROR [Driver] > org.apache.spark.deploy.yarn.ApplicationMaster:95 User class threw exception: > java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due > tostage failure: Task 452 in stage 209.0 failed 4 times, most recent > failure: Lost task 452.3 in stage y.0 (TID z, x, executor xx): > java.io.IOException: FAILED_TO_UNCOMPRESS(5) > 3976 at > org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) > 3977 at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) > 3978 at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:395) > 3979 at org.xerial.snappy.Snappy.uncompress(Snappy.java:431) > 3980 at > org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) > 3981 at > org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) > 3982 at > org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) > 3983 at > org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:159) > 3984 at > org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1219) > 3985 at > org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:48) > 3986 at > org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:47) > 3987 at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:328) > 3988 at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:55) > 3989 at > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > 3990 a > {noformat} > In this customer's version of spark, line 328 of > ShuffleBlockFetcherIterator.scala is the line that the following occurs: > {noformat} > input = streamWrapper(blockId, in) > {noformat} > It would be nice to add exception handling around this line to avoid > confusions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25535) Work around bad error checking in commons-crypto
[ https://issues.apache.org/jira/browse/SPARK-25535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629174#comment-16629174 ] Apache Spark commented on SPARK-25535: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/22557 > Work around bad error checking in commons-crypto > > > Key: SPARK-25535 > URL: https://issues.apache.org/jira/browse/SPARK-25535 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2 >Reporter: Marcelo Vanzin >Priority: Major > > The commons-crypto library used for encryption can get confused when certain > errors happen; that can lead to crashes since the Java side thinks the > ciphers are still valid while the native side has already cleaned up the > ciphers. > We can work around that in Spark by doing some error checking at a higher > level. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25535) Work around bad error checking in commons-crypto
[ https://issues.apache.org/jira/browse/SPARK-25535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25535: Assignee: (was: Apache Spark) > Work around bad error checking in commons-crypto > > > Key: SPARK-25535 > URL: https://issues.apache.org/jira/browse/SPARK-25535 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2 >Reporter: Marcelo Vanzin >Priority: Major > > The commons-crypto library used for encryption can get confused when certain > errors happen; that can lead to crashes since the Java side thinks the > ciphers are still valid while the native side has already cleaned up the > ciphers. > We can work around that in Spark by doing some error checking at a higher > level. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25535) Work around bad error checking in commons-crypto
[ https://issues.apache.org/jira/browse/SPARK-25535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25535: Assignee: Apache Spark > Work around bad error checking in commons-crypto > > > Key: SPARK-25535 > URL: https://issues.apache.org/jira/browse/SPARK-25535 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Major > > The commons-crypto library used for encryption can get confused when certain > errors happen; that can lead to crashes since the Java side thinks the > ciphers are still valid while the native side has already cleaned up the > ciphers. > We can work around that in Spark by doing some error checking at a higher > level. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629182#comment-16629182 ] Leo Gallucci commented on SPARK-18112: -- And to get things worse Hive is already in version 3. Same with Hadoop, the default Spark+Hadoop distribution comes with Hadoop 2.7 while Hadoop is already 3.1. Is really hard to understand how such a popular open source project like Spark keeps dependencies years old, some are 7 years old or more. > Spark2.x does not support read data from Hive 2.x metastore > --- > > Key: SPARK-18112 > URL: https://issues.apache.org/jira/browse/SPARK-18112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: KaiXu >Assignee: Xiao Li >Priority: Critical > Fix For: 2.2.0 > > > Hive2.0 has been released in February 2016, after that Hive2.0.1 and > Hive2.1.0 have also been released for a long time, but till now spark only > support to read hive metastore data from Hive1.2.1 and older version, since > Hive2.x has many bugs fixed and performance improvement it's better and > urgent to upgrade to support Hive2.x > failed to load data from hive2.x metastore: > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4 > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45) > at > org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-25546) RDDInfo uses SparkEnv before it may have been initialized
[ https://issues.apache.org/jira/browse/SPARK-25546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-25546: --- Comment: was deleted (was: User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/22558) > RDDInfo uses SparkEnv before it may have been initialized > - > > Key: SPARK-25546 > URL: https://issues.apache.org/jira/browse/SPARK-25546 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Major > > This code: > {code} > private[spark] object RDDInfo { > private val callsiteLongForm = > SparkEnv.get.conf.get(EVENT_LOG_CALLSITE_LONG_FORM) > {code} > Has two problems: > - it keeps that value across different SparkEnv instances. So e.g. if you > have two tests that rely on different values for that config, one of them > will break. > - it assumes tests always initialize a SparkEnv. e.g. if you run > "core/testOnly *.AppStatusListenerSuite", it will fail because > {{SparkEnv.get}} returns null. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629000#comment-16629000 ] Eugeniu commented on SPARK-18112: - I can only describe my situation. I am using AWS EMR 5.17.0 with Hive, Spark, Zeppelin, Hue installed. In Zeppelin the configuration variable for spark interpretter points to /usr/lib/spark. There I found jars/ folder. In jars folder I have the following hive related libraries. {code} -rw-r--r-- 1 root root 139044 Aug 15 01:06 hive-beeline-1.2.1-spark2-amzn-0.jar -rw-r--r-- 1 root root40850 Aug 15 01:06 hive-cli-1.2.1-spark2-amzn-0.jar -rw-r--r-- 1 root root 11497847 Aug 15 01:06 hive-exec-1.2.1-spark2-amzn-0.jar -rw-r--r-- 1 root root 101113 Aug 15 01:06 hive-jdbc-1.2.1-spark2-amzn-0.jar -rw-r--r-- 1 root root 5472179 Aug 15 01:06 hive-metastore-1.2.1-spark2-amzn-0.jar {code} If I replace them with their 2.3.3 equivalents, e.g. hive-exec-1.2.1-spark2-amzn-0.jar -> hive-exec-2.3.3-amzn-1.jar I get the following error when running SQL query in spark: {code} java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT at org.apache.spark.sql.hive.HiveUtils$.formatTimeVarsForHiveClient(HiveUtils.scala:205) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:286) at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66) at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:195) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:194) at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114) at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102) at org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39) at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54) at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52) at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1.(HiveSessionStateBuilder.scala:69) at org.apache.spark.sql.hive.HiveSessionStateBuilder.analyzer(HiveSessionStateBuilder.scala:69) at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293) at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293) at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:79) at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:79) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.zeppelin.spark.SparkSqlInterpreter.interpret(SparkSqlInterpreter.java:116) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:97) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:498) at org.apache.zeppelin.scheduler.Job.run(Job.java:175) at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at
[jira] [Commented] (SPARK-25533) Inconsistent message for Completed Jobs in the JobUI, when there are failed jobs, compared to spark2.2
[ https://issues.apache.org/jira/browse/SPARK-25533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629243#comment-16629243 ] Marcelo Vanzin commented on SPARK-25533: This is merged to master. I'll backport it to 2.4 and 2.3 after I fix an unrelated issue that I ran into during testing. > Inconsistent message for Completed Jobs in the JobUI, when there are failed > jobs, compared to spark2.2 > --- > > Key: SPARK-25533 > URL: https://issues.apache.org/jira/browse/SPARK-25533 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.1 >Reporter: shahid >Assignee: shahid >Priority: Major > Attachments: Screenshot from 2018-09-26 00-42-00.png, Screenshot from > 2018-09-26 00-46-35.png > > > Test steps: > 1) bin/spark-shell > {code:java} > sc.parallelize(1 to 5, 5).collect() > sc.parallelize(1 to 5, 2).map{ x => throw new RuntimeException("Fail > Job")}.collect() > {code} > *Output in spark - 2.3.1:* > !Screenshot from 2018-09-26 00-42-00.png! > *Output in spark - 2.2.1:* > !Screenshot from 2018-09-26 00-46-35.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()
[ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629281#comment-16629281 ] Kazuaki Ishizaki commented on SPARK-25538: -- Hi [~Steven Rand], would it be possible to share the schema of this DataFrame? > incorrect row counts after distinct() > - > > Key: SPARK-25538 > URL: https://issues.apache.org/jira/browse/SPARK-25538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Reproduced on a Centos7 VM and from source in Intellij > on OS X. >Reporter: Steven Rand >Priority: Major > Labels: correctness > > It appears that {{df.distinct.count}} can return incorrect values after > SPARK-23713. It's possible that other operations are affected as well; > {{distinct}} just happens to be the one that we noticed. I believe that this > issue was introduced by SPARK-23713 because I can't reproduce it until that > commit, and I've been able to reproduce it after that commit as well as with > {{tags/v2.4.0-rc1}}. > Below are example spark-shell sessions to illustrate the problem. > Unfortunately the data used in these examples can't be uploaded to this Jira > ticket. I'll try to create test data which also reproduces the issue, and > will upload that if I'm able to do so. > Example from Spark 2.3.1, which behaves correctly: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 115 > {code} > Example from Spark 2.4.0-rc1, which returns different output: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 116 > scala> df.sort("col_0").distinct.count > res2: Long = 123 > scala> df.withColumnRenamed("col_0", "newName").distinct.count > res3: Long = 115 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17952) SparkSession createDataFrame method throws exception for nested JavaBeans
[ https://issues.apache.org/jira/browse/SPARK-17952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629321#comment-16629321 ] Michal Šenkýř commented on SPARK-17952: --- Implemented nested bean support in pull request. Arrays and lists not supported yet. Will add them later if approved to put code in line with docs. > SparkSession createDataFrame method throws exception for nested JavaBeans > - > > Key: SPARK-17952 > URL: https://issues.apache.org/jira/browse/SPARK-17952 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1, 2.3.0 >Reporter: Amit Baghel >Priority: Major > > As per latest spark documentation for Java at > http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection, > > {quote} > Nested JavaBeans and List or Array fields are supported though. > {quote} > However nested JavaBean is not working. Please see the below code. > SubCategory class > {code} > public class SubCategory implements Serializable{ > private String id; > private String name; > > public String getId() { > return id; > } > public void setId(String id) { > this.id = id; > } > public String getName() { > return name; > } > public void setName(String name) { > this.name = name; > } > } > {code} > Category class > {code} > public class Category implements Serializable{ > private String id; > private SubCategory subCategory; > > public String getId() { > return id; > } > public void setId(String id) { > this.id = id; > } > public SubCategory getSubCategory() { > return subCategory; > } > public void setSubCategory(SubCategory subCategory) { > this.subCategory = subCategory; > } > } > {code} > SparkSample class > {code} > public class SparkSample { > public static void main(String[] args) throws IOException { > > SparkSession spark = SparkSession > .builder() > .appName("SparkSample") > .master("local") > .getOrCreate(); > //SubCategory > SubCategory sub = new SubCategory(); > sub.setId("sc-111"); > sub.setName("Sub-1"); > //Category > Category category = new Category(); > category.setId("s-111"); > category.setSubCategory(sub); > //categoryList > List categoryList = new ArrayList(); > categoryList.add(category); >//DF > Dataset dframe = spark.createDataFrame(categoryList, > Category.class); > dframe.show(); > } > } > {code} > Above code throws below error. > {code} > Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d > (of class com.sample.SubCategory) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403) > at > org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106) > at > org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106) > at > org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$class.toStream(Iterator.scala:1322) > at
[jira] [Commented] (SPARK-25501) Kafka delegation token support
[ https://issues.apache.org/jira/browse/SPARK-25501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629320#comment-16629320 ] Mingjie Tang commented on SPARK-25501: -- [~gsomogyi] Thanks for your reply. At first, what my PR proposed here is used for us to discuss, we can use it or disregard it. either way is ok for me. What I want to propose is that we can move this ticket asap, since this feature is critical for production and community. Second, You can build a document for discuss the design and have SPIP. I can learn advices from you and others. This would be useful. Finally, thanks so much for you to begin work on this work. Your example is very good. Therefore, you can refer my PR or do it by yourself, then, we can discuss and move forward this asap. What do you think about this? I hope to learn from you. > Kafka delegation token support > -- > > Key: SPARK-25501 > URL: https://issues.apache.org/jira/browse/SPARK-25501 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Gabor Somogyi >Priority: Major > > In kafka version 1.1 delegation token support is released. As spark updated > it's kafka client to 2.0.0 now it's possible to implement delegation token > support. Please see description: > https://cwiki.apache.org/confluence/display/KAFKA/KIP-48+Delegation+token+support+for+Kafka -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25531) new write APIs for data source v2
[ https://issues.apache.org/jira/browse/SPARK-25531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629418#comment-16629418 ] Ryan Blue commented on SPARK-25531: --- [~cloud_fan], what was the intent for this umbrella issue? You described it as progress of "Standardize SQL logical plans" but the current description is "new write APIs" instead. Also, these issues were already tracked under the umbrella SPARK-22386 to improve DSv2, which covers the new logical plans and other support issues like adding interfaces for required clustering and sorting (SPARK-23889). Is your intent to close the other issue because it is too old? > new write APIs for data source v2 > - > > Key: SPARK-25531 > URL: https://issues.apache.org/jira/browse/SPARK-25531 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.5.0 >Reporter: Wenchen Fan >Priority: Major > > The current data source write API heavily depend on {{SaveMode}}, which > doesn't have a clear semantic, especially when writing to tables. > We should design a new set of write API without {{SaveMode}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25547) Pluggable jdbc connection factory
Frank Sauer created SPARK-25547: --- Summary: Pluggable jdbc connection factory Key: SPARK-25547 URL: https://issues.apache.org/jira/browse/SPARK-25547 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.1 Reporter: Frank Sauer The ability to provide a custom connectionFactoryProvider via JDBCOptions so that JdbcUtils.createConnectionFactory can produce a custom connection factory would be very useful. In our case we needed to have the ability to load balance connections to an AWS Aurora Postgres cluster by round-robining through the endpoints of the read replicas since their own loan balancing was insufficient. We got away with it by copying most of the spark jdbc package and provide this feature there and changing the format from jdbc to our new package. However it would be nice if this were supported out of the box via a new option in JDBCOptions providing the classname for a ConnectionFactoryProvider. I'm creating this Jira in order to submit a PR which I have ready to go. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25547) Pluggable jdbc connection factory
[ https://issues.apache.org/jira/browse/SPARK-25547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25547: Assignee: Apache Spark > Pluggable jdbc connection factory > - > > Key: SPARK-25547 > URL: https://issues.apache.org/jira/browse/SPARK-25547 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Frank Sauer >Assignee: Apache Spark >Priority: Major > > The ability to provide a custom connectionFactoryProvider via JDBCOptions so > that JdbcUtils.createConnectionFactory can produce a custom connection > factory would be very useful. In our case we needed to have the ability to > load balance connections to an AWS Aurora Postgres cluster by round-robining > through the endpoints of the read replicas since their own loan balancing was > insufficient. We got away with it by copying most of the spark jdbc package > and provide this feature there and changing the format from jdbc to our new > package. However it would be nice if this were supported out of the box via > a new option in JDBCOptions providing the classname for a > ConnectionFactoryProvider. I'm creating this Jira in order to submit a PR > which I have ready to go. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25547) Pluggable jdbc connection factory
[ https://issues.apache.org/jira/browse/SPARK-25547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629425#comment-16629425 ] Apache Spark commented on SPARK-25547: -- User 'fsauer65' has created a pull request for this issue: https://github.com/apache/spark/pull/22560 > Pluggable jdbc connection factory > - > > Key: SPARK-25547 > URL: https://issues.apache.org/jira/browse/SPARK-25547 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Frank Sauer >Priority: Major > > The ability to provide a custom connectionFactoryProvider via JDBCOptions so > that JdbcUtils.createConnectionFactory can produce a custom connection > factory would be very useful. In our case we needed to have the ability to > load balance connections to an AWS Aurora Postgres cluster by round-robining > through the endpoints of the read replicas since their own loan balancing was > insufficient. We got away with it by copying most of the spark jdbc package > and provide this feature there and changing the format from jdbc to our new > package. However it would be nice if this were supported out of the box via > a new option in JDBCOptions providing the classname for a > ConnectionFactoryProvider. I'm creating this Jira in order to submit a PR > which I have ready to go. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25547) Pluggable jdbc connection factory
[ https://issues.apache.org/jira/browse/SPARK-25547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25547: Assignee: (was: Apache Spark) > Pluggable jdbc connection factory > - > > Key: SPARK-25547 > URL: https://issues.apache.org/jira/browse/SPARK-25547 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Frank Sauer >Priority: Major > > The ability to provide a custom connectionFactoryProvider via JDBCOptions so > that JdbcUtils.createConnectionFactory can produce a custom connection > factory would be very useful. In our case we needed to have the ability to > load balance connections to an AWS Aurora Postgres cluster by round-robining > through the endpoints of the read replicas since their own loan balancing was > insufficient. We got away with it by copying most of the spark jdbc package > and provide this feature there and changing the format from jdbc to our new > package. However it would be nice if this were supported out of the box via > a new option in JDBCOptions providing the classname for a > ConnectionFactoryProvider. I'm creating this Jira in order to submit a PR > which I have ready to go. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629420#comment-16629420 ] t oo commented on SPARK-18112: -- here, here! > Spark2.x does not support read data from Hive 2.x metastore > --- > > Key: SPARK-18112 > URL: https://issues.apache.org/jira/browse/SPARK-18112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: KaiXu >Assignee: Xiao Li >Priority: Critical > Fix For: 2.2.0 > > > Hive2.0 has been released in February 2016, after that Hive2.0.1 and > Hive2.1.0 have also been released for a long time, but till now spark only > support to read hive metastore data from Hive1.2.1 and older version, since > Hive2.x has many bugs fixed and performance improvement it's better and > urgent to upgrade to support Hive2.x > failed to load data from hive2.x metastore: > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4 > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45) > at > org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24285) Flaky test: ContinuousSuite.query without test harness
[ https://issues.apache.org/jira/browse/SPARK-24285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24285: -- Description: *2.5.0-SNAPSHOT* - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96640 *2.3.x* - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/370/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/373/ was: - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/370/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/373/ > Flaky test: ContinuousSuite.query without test harness > -- > > Key: SPARK-24285 > URL: https://issues.apache.org/jira/browse/SPARK-24285 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > *2.5.0-SNAPSHOT* > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96640 > *2.3.x* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/370/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/373/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24285) Flaky test: ContinuousSuite.query without test harness
[ https://issues.apache.org/jira/browse/SPARK-24285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24285: -- Description: *2.5.0-SNAPSHOT* - [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96640] {code:java} sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: scala.this.Predef.Set.apply[Int](0, 1, 2, 3).map[org.apache.spark.sql.Row, scala.collection.immutable.Set[org.apache.spark.sql.Row]](((x$3: Int) => org.apache.spark.sql.Row.apply(x$3)))(immutable.this.Set.canBuildFrom[org.apache.spark.sql.Row]).subsetOf(scala.this.Predef.refArrayOps[org.apache.spark.sql.Row](results).toSet[org.apache.spark.sql.Row]) was false{code} *2.3.x* - [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/370/] - [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/373/] was: *2.5.0-SNAPSHOT* - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96640 *2.3.x* - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/370/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/373/ > Flaky test: ContinuousSuite.query without test harness > -- > > Key: SPARK-24285 > URL: https://issues.apache.org/jira/browse/SPARK-24285 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > *2.5.0-SNAPSHOT* > - [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96640] > {code:java} > sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: > scala.this.Predef.Set.apply[Int](0, 1, 2, 3).map[org.apache.spark.sql.Row, > scala.collection.immutable.Set[org.apache.spark.sql.Row]](((x$3: Int) => > org.apache.spark.sql.Row.apply(x$3)))(immutable.this.Set.canBuildFrom[org.apache.spark.sql.Row]).subsetOf(scala.this.Predef.refArrayOps[org.apache.spark.sql.Row](results).toSet[org.apache.spark.sql.Row]) > was false{code} > *2.3.x* > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/370/] > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/373/] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25372) Deprecate Yarn-specific configs in regards to keytab login for SparkSubmit
[ https://issues.apache.org/jira/browse/SPARK-25372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-25372. Resolution: Fixed Fix Version/s: 2.5.0 Issue resolved by pull request 22362 [https://github.com/apache/spark/pull/22362] > Deprecate Yarn-specific configs in regards to keytab login for SparkSubmit > -- > > Key: SPARK-25372 > URL: https://issues.apache.org/jira/browse/SPARK-25372 > Project: Spark > Issue Type: Bug > Components: Kubernetes, YARN >Affects Versions: 2.4.0 >Reporter: Ilan Filonenko >Assignee: Ilan Filonenko >Priority: Major > Fix For: 2.5.0 > > > {{SparkSubmit}} already logs in the user if a keytab is provided, the only > issue is that it uses the existing configs which have "yarn" in their name. > As such, we should use a common name for the principal and keytab configs, > and deprecate the YARN-specific ones. > cc [~vanzin] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()
[ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629561#comment-16629561 ] Steven Rand commented on SPARK-25538: - [~kiszk], yes, the schema is: {code} scala> spark.read.parquet("hdfs:///data").printSchema root |-- col_0: string (nullable = true) |-- col_1: timestamp (nullable = true) |-- col_2: string (nullable = true) |-- col_3: timestamp (nullable = true) |-- col_4: string (nullable = true) |-- col_5: array (nullable = true) ||-- element: string (containsNull = true) |-- col_6: string (nullable = true) |-- col_7: array (nullable = true) ||-- element: decimal(38,18) (containsNull = true) |-- col_8: array (nullable = true) ||-- element: string (containsNull = true) |-- col_9: array (nullable = true) ||-- element: decimal(38,18) (containsNull = true) |-- col_10: string (nullable = true) |-- col_11: timestamp (nullable = true) |-- col_12: integer (nullable = true) |-- col_13: boolean (nullable = true) |-- col_14: decimal(38,18) (nullable = true) |-- col_15: long (nullable = true) |-- col_16: string (nullable = true) |-- col_17: integer (nullable = true) |-- col_18: array (nullable = true) ||-- element: string (containsNull = true) |-- col_19: string (nullable = true) |-- col_20: string (nullable = true) |-- col_21: array (nullable = true) ||-- element: string (containsNull = true) |-- col_22: string (nullable = true) |-- col_23: array (nullable = true) ||-- element: timestamp (containsNull = true) |-- col_24: string (nullable = true) |-- col_25: string (nullable = true) |-- col_26: string (nullable = true) |-- col_27: array (nullable = true) ||-- element: string (containsNull = true) |-- col_28: array (nullable = true) ||-- element: string (containsNull = true) |-- col_29: array (nullable = true) ||-- element: string (containsNull = true) |-- col_30: array (nullable = true) ||-- element: string (containsNull = true) |-- col_31: decimal(38,18) (nullable = true) |-- col_32: array (nullable = true) ||-- element: string (containsNull = true) |-- col_33: string (nullable = true) |-- col_34: array (nullable = true) ||-- element: decimal(38,18) (containsNull = true) |-- col_35: decimal(38,18) (nullable = true) |-- col_36: array (nullable = true) ||-- element: string (containsNull = true) |-- col_37: array (nullable = true) ||-- element: decimal(38,18) (containsNull = true) |-- col_38: decimal(38,18) (nullable = true) |-- col_39: array (nullable = true) ||-- element: string (containsNull = true) |-- col_40: string (nullable = true) |-- col_41: string (nullable = true) |-- col_42: string (nullable = true) |-- col_43: array (nullable = true) ||-- element: string (containsNull = true) |-- col_44: array (nullable = true) ||-- element: string (containsNull = true) |-- col_45: string (nullable = true) |-- col_46: array (nullable = true) ||-- element: string (containsNull = true) |-- col_47: array (nullable = true) ||-- element: decimal(38,18) (containsNull = true) |-- col_48: string (nullable = true) |-- col_49: array (nullable = true) ||-- element: string (containsNull = true) |-- col_50: array (nullable = true) ||-- element: decimal(38,18) (containsNull = true) |-- col_51: array (nullable = true) ||-- element: string (containsNull = true) |-- col_52: array (nullable = true) ||-- element: decimal(38,18) (containsNull = true) |-- col_53: string (nullable = true) |-- col_54: decimal(38,18) (nullable = true) |-- col_55: decimal(38,18) (nullable = true) |-- col_56: decimal(38,18) (nullable = true) |-- col_57: array (nullable = true) ||-- element: decimal(38,18) (containsNull = true) {code} > incorrect row counts after distinct() > - > > Key: SPARK-25538 > URL: https://issues.apache.org/jira/browse/SPARK-25538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Reproduced on a Centos7 VM and from source in Intellij > on OS X. >Reporter: Steven Rand >Priority: Major > Labels: correctness > > It appears that {{df.distinct.count}} can return incorrect values after > SPARK-23713. It's possible that other operations are affected as well; > {{distinct}} just happens to be the one that we noticed. I believe that this > issue was introduced by SPARK-23713 because I can't reproduce it until that > commit, and I've been able to reproduce it after that commit as well as with > {{tags/v2.4.0-rc1}}. > Below are example spark-shell sessions to illustrate the problem. > Unfortunately the data used in these examples can't be uploaded to this Jira > ticket. I'll try to create test data which also reproduces the issue, and > will upload that if
[jira] [Assigned] (SPARK-25372) Deprecate Yarn-specific configs in regards to keytab login for SparkSubmit
[ https://issues.apache.org/jira/browse/SPARK-25372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-25372: -- Assignee: Ilan Filonenko > Deprecate Yarn-specific configs in regards to keytab login for SparkSubmit > -- > > Key: SPARK-25372 > URL: https://issues.apache.org/jira/browse/SPARK-25372 > Project: Spark > Issue Type: Bug > Components: Kubernetes, YARN >Affects Versions: 2.4.0 >Reporter: Ilan Filonenko >Assignee: Ilan Filonenko >Priority: Major > Fix For: 2.5.0 > > > {{SparkSubmit}} already logs in the user if a keytab is provided, the only > issue is that it uses the existing configs which have "yarn" in their name. > As such, we should use a common name for the principal and keytab configs, > and deprecate the YARN-specific ones. > cc [~vanzin] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25454) Division between operands with negative scale can cause precision loss
[ https://issues.apache.org/jira/browse/SPARK-25454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25454. - Resolution: Fixed Assignee: Wenchen Fan Fix Version/s: 2.4.0 2.3.3 > Division between operands with negative scale can cause precision loss > -- > > Key: SPARK-25454 > URL: https://issues.apache.org/jira/browse/SPARK-25454 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Marco Gaido >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.3.3, 2.4.0 > > > The issue was originally reported by [~bersprockets] here: > https://issues.apache.org/jira/browse/SPARK-22036?focusedCommentId=16618104=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16618104. > The problem consist in a precision loss when the second operand of the > division is a decimal with a negative scale. It was present also before 2.3 > but it was harder to reproduce: you had to do something like > {{lit(BigDecimal(100e6))}}, while now this can happen more frequently with > SQL constants. > The problem is that our logic is taken from Hive and SQLServer where decimals > with negative scales are not allowed. We might also consider enforcing this > too in 3.0 eventually. Meanwhile we can fix the logic for computing the > result type for a division. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25540) Make HiveContext in PySpark behave as the same as Scala.
[ https://issues.apache.org/jira/browse/SPARK-25540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-25540: --- Assignee: Takuya Ueshin > Make HiveContext in PySpark behave as the same as Scala. > > > Key: SPARK-25540 > URL: https://issues.apache.org/jira/browse/SPARK-25540 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 2.4.0 > > > In Scala, {{HiveContext}} sets a config {{spark.sql.catalogImplementation}} > of the given {{SparkContext}} and then passes to {{SparkSession.builder}}. > The {{HiveContext}} in PySpark should behave as the same as it in Scala. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25540) Make HiveContext in PySpark behave as the same as Scala.
[ https://issues.apache.org/jira/browse/SPARK-25540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25540. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22552 [https://github.com/apache/spark/pull/22552] > Make HiveContext in PySpark behave as the same as Scala. > > > Key: SPARK-25540 > URL: https://issues.apache.org/jira/browse/SPARK-25540 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 2.4.0 > > > In Scala, {{HiveContext}} sets a config {{spark.sql.catalogImplementation}} > of the given {{SparkContext}} and then passes to {{SparkSession.builder}}. > The {{HiveContext}} in PySpark should behave as the same as it in Scala. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25548) In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field with true in the And(partitionOps, nonPartitionOps) to make the partition can be pruned
[ https://issues.apache.org/jira/browse/SPARK-25548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629655#comment-16629655 ] Apache Spark commented on SPARK-25548: -- User 'eatoncys' has created a pull request for this issue: https://github.com/apache/spark/pull/22561 > In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field > with true in the And(partitionOps, nonPartitionOps) to make the partition can > be pruned > - > > Key: SPARK-25548 > URL: https://issues.apache.org/jira/browse/SPARK-25548 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2 >Reporter: eaton >Priority: Critical > > In the PruneFileSourcePartitions optimizer, the partition files will not be > pruned if we use partition filter and non partition filter together, for > example: > sql("CREATE TABLE IF NOT EXISTS src_par (key INT, value STRING) partitioned > by(p_d int) stored as parquet ") > sql("insert overwrite table src_par partition(p_d=2) select 2 as key, '4' as > value") > sql("insert overwrite table src_par partition(p_d=3) select 3 as key, '4' as > value") > sql("insert overwrite table src_par partition(p_d=4) select 4 as key, '4' as > value") > The sql below will scan all the partition files, in which, the partition > **p_d=4** should be pruned. > **sql("select * from src_par where (p_d=2 and key=2) or (p_d=3 and > key=3)").show** -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25548) In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field with true in the And(partitionOps, nonPartitionOps) to make the partition can be pruned
[ https://issues.apache.org/jira/browse/SPARK-25548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25548: Assignee: Apache Spark > In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field > with true in the And(partitionOps, nonPartitionOps) to make the partition can > be pruned > - > > Key: SPARK-25548 > URL: https://issues.apache.org/jira/browse/SPARK-25548 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2 >Reporter: eaton >Assignee: Apache Spark >Priority: Critical > > In the PruneFileSourcePartitions optimizer, the partition files will not be > pruned if we use partition filter and non partition filter together, for > example: > sql("CREATE TABLE IF NOT EXISTS src_par (key INT, value STRING) partitioned > by(p_d int) stored as parquet ") > sql("insert overwrite table src_par partition(p_d=2) select 2 as key, '4' as > value") > sql("insert overwrite table src_par partition(p_d=3) select 3 as key, '4' as > value") > sql("insert overwrite table src_par partition(p_d=4) select 4 as key, '4' as > value") > The sql below will scan all the partition files, in which, the partition > **p_d=4** should be pruned. > **sql("select * from src_par where (p_d=2 and key=2) or (p_d=3 and > key=3)").show** -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25548) In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field with true in the And(partitionOps, nonPartitionOps) to make the partition can be pruned
[ https://issues.apache.org/jira/browse/SPARK-25548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25548: Assignee: Apache Spark > In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field > with true in the And(partitionOps, nonPartitionOps) to make the partition can > be pruned > - > > Key: SPARK-25548 > URL: https://issues.apache.org/jira/browse/SPARK-25548 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2 >Reporter: eaton >Assignee: Apache Spark >Priority: Critical > > In the PruneFileSourcePartitions optimizer, the partition files will not be > pruned if we use partition filter and non partition filter together, for > example: > sql("CREATE TABLE IF NOT EXISTS src_par (key INT, value STRING) partitioned > by(p_d int) stored as parquet ") > sql("insert overwrite table src_par partition(p_d=2) select 2 as key, '4' as > value") > sql("insert overwrite table src_par partition(p_d=3) select 3 as key, '4' as > value") > sql("insert overwrite table src_par partition(p_d=4) select 4 as key, '4' as > value") > The sql below will scan all the partition files, in which, the partition > **p_d=4** should be pruned. > **sql("select * from src_par where (p_d=2 and key=2) or (p_d=3 and > key=3)").show** -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25548) In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field with true in the And(partitionOps, nonPartitionOps) to make the partition can be pruned
[ https://issues.apache.org/jira/browse/SPARK-25548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25548: Assignee: (was: Apache Spark) > In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field > with true in the And(partitionOps, nonPartitionOps) to make the partition can > be pruned > - > > Key: SPARK-25548 > URL: https://issues.apache.org/jira/browse/SPARK-25548 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2 >Reporter: eaton >Priority: Critical > > In the PruneFileSourcePartitions optimizer, the partition files will not be > pruned if we use partition filter and non partition filter together, for > example: > sql("CREATE TABLE IF NOT EXISTS src_par (key INT, value STRING) partitioned > by(p_d int) stored as parquet ") > sql("insert overwrite table src_par partition(p_d=2) select 2 as key, '4' as > value") > sql("insert overwrite table src_par partition(p_d=3) select 3 as key, '4' as > value") > sql("insert overwrite table src_par partition(p_d=4) select 4 as key, '4' as > value") > The sql below will scan all the partition files, in which, the partition > **p_d=4** should be pruned. > **sql("select * from src_par where (p_d=2 and key=2) or (p_d=3 and > key=3)").show** -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25531) new write APIs for data source v2
[ https://issues.apache.org/jira/browse/SPARK-25531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629550#comment-16629550 ] Wenchen Fan commented on SPARK-25531: - I want to have a more structured view of the data source v2 project. It's a bad idea to put everything under SPARK-22386, which is so general that only says it's improvement. I'm starting to create tickets for some big steps of the data source v2 project, like this one, like the API refactoring, and potentially the catalog work, the custom metrics, etc. in the future. For this particular case, the final goal is to design a new write api, for both data source and end-users, to get rid of SaveMode. "Standardize SQL logical plans" is how to achieve this goal IMO. Note that all of them will be marks as "blocks SPARK-25186 Stabilize Data Source V2 API". > new write APIs for data source v2 > - > > Key: SPARK-25531 > URL: https://issues.apache.org/jira/browse/SPARK-25531 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.5.0 >Reporter: Wenchen Fan >Priority: Major > > The current data source write API heavily depend on {{SaveMode}}, which > doesn't have a clear semantic, especially when writing to tables. > We should design a new set of write API without {{SaveMode}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow
[ https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629572#comment-16629572 ] Bryan Cutler commented on SPARK-25351: -- Hi [~pgadige], yes please go ahead with this issue! When creating a DataFrame from Pandas without Arrow, category columns are converted into the type of the category. So in the example above, column "A" becomes a string type. The same should be done when Arrow is enabled, so we end up with the same Spark DataFrame. If you are able to, we also need to see how this affects pandas_udfs too. Thanks! > Handle Pandas category type when converting from Python with Arrow > -- > > Key: SPARK-25351 > URL: https://issues.apache.org/jira/browse/SPARK-25351 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Bryan Cutler >Priority: Major > > There needs to be some handling of category types done when calling > {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}. > Without Arrow, Spark casts each element to the category. For example > {noformat} > In [1]: import pandas as pd > In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]}) > In [3]: pdf["B"] = pdf["A"].astype('category') > In [4]: pdf > Out[4]: >A B > 0 a a > 1 b b > 2 c c > 3 a a > In [5]: pdf.dtypes > Out[5]: > A object > Bcategory > dtype: object > In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False) > In [8]: df = spark.createDataFrame(pdf) > In [9]: df.show() > +---+---+ > | A| B| > +---+---+ > | a| a| > | b| b| > | c| c| > | a| a| > +---+---+ > In [10]: df.printSchema() > root > |-- A: string (nullable = true) > |-- B: string (nullable = true) > In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True) > In [19]: df = spark.createDataFrame(pdf) >1667 spark_type = ArrayType(from_arrow_type(at.value_type)) >1668 else: > -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " > + str(at)) >1670 return spark_type >1671 > TypeError: Unsupported type in conversion from Arrow: > dictionary > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-25454) Division between operands with negative scale can cause precision loss
[ https://issues.apache.org/jira/browse/SPARK-25454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reopened SPARK-25454: - Assignee: (was: Wenchen Fan) I'm reopening it, since the bug is not fully fixed. But we do have a workaround now: setting {{spark.sql.legacy.literal.pickMinimumPrecision}} to false. > Division between operands with negative scale can cause precision loss > -- > > Key: SPARK-25454 > URL: https://issues.apache.org/jira/browse/SPARK-25454 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Marco Gaido >Priority: Major > > The issue was originally reported by [~bersprockets] here: > https://issues.apache.org/jira/browse/SPARK-22036?focusedCommentId=16618104=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16618104. > The problem consist in a precision loss when the second operand of the > division is a decimal with a negative scale. It was present also before 2.3 > but it was harder to reproduce: you had to do something like > {{lit(BigDecimal(100e6))}}, while now this can happen more frequently with > SQL constants. > The problem is that our logic is taken from Hive and SQLServer where decimals > with negative scales are not allowed. We might also consider enforcing this > too in 3.0 eventually. Meanwhile we can fix the logic for computing the > result type for a division. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25454) Division between operands with negative scale can cause precision loss
[ https://issues.apache.org/jira/browse/SPARK-25454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-25454: Fix Version/s: (was: 2.3.3) (was: 2.4.0) > Division between operands with negative scale can cause precision loss > -- > > Key: SPARK-25454 > URL: https://issues.apache.org/jira/browse/SPARK-25454 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Marco Gaido >Priority: Major > > The issue was originally reported by [~bersprockets] here: > https://issues.apache.org/jira/browse/SPARK-22036?focusedCommentId=16618104=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16618104. > The problem consist in a precision loss when the second operand of the > division is a decimal with a negative scale. It was present also before 2.3 > but it was harder to reproduce: you had to do something like > {{lit(BigDecimal(100e6))}}, while now this can happen more frequently with > SQL constants. > The problem is that our logic is taken from Hive and SQLServer where decimals > with negative scales are not allowed. We might also consider enforcing this > too in 3.0 eventually. Meanwhile we can fix the logic for computing the > result type for a division. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25548) In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field with true in the And(partitionOps, nonPartitionOps) to make the partition can be pruned
eaton created SPARK-25548: - Summary: In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field with true in the And(partitionOps, nonPartitionOps) to make the partition can be pruned Key: SPARK-25548 URL: https://issues.apache.org/jira/browse/SPARK-25548 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.2 Reporter: eaton In the PruneFileSourcePartitions optimizer, the partition files will not be pruned if we use partition filter and non partition filter together, for example: sql("CREATE TABLE IF NOT EXISTS src_par (key INT, value STRING) partitioned by(p_d int) stored as parquet ") sql("insert overwrite table src_par partition(p_d=2) select 2 as key, '4' as value") sql("insert overwrite table src_par partition(p_d=3) select 3 as key, '4' as value") sql("insert overwrite table src_par partition(p_d=4) select 4 as key, '4' as value") The sql below will scan all the partition files, in which, the partition **p_d=4** should be pruned. **sql("select * from src_par where (p_d=2 and key=2) or (p_d=3 and key=3)").show** -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16859) History Server storage information is missing
[ https://issues.apache.org/jira/browse/SPARK-16859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628554#comment-16628554 ] t oo commented on SPARK-16859: -- bump > History Server storage information is missing > - > > Key: SPARK-16859 > URL: https://issues.apache.org/jira/browse/SPARK-16859 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 >Reporter: Andrei Ivanov >Priority: Major > Labels: historyserver, newbie > > It looks like job history storage tab in history server is broken for > completed jobs since *1.6.2*. > More specifically it's broken since > [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845]. > I've fixed for my installation by effectively reverting the above patch > ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]). > IMHO, the most straightforward fix would be to implement > _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ making > sure it works from _ReplayListenerBus_. > The downside will be that it will still work incorrectly with pre patch job > histories. But then, it doesn't work since *1.6.2* anyhow. > PS: I'd really love to have this fixed eventually. But I'm pretty new to > Apache Spark and missing hands on Scala experience. So I'd prefer that it be > fixed by someone experienced with roadmap vision. If nobody volunteers I'll > try to patch myself. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25541) CaseInsensitiveMap should be serializable after '-' operator
[ https://issues.apache.org/jira/browse/SPARK-25541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-25541: --- Summary: CaseInsensitiveMap should be serializable after '-' operator (was: CaseInsensitiveMap should be serializable after '-' or 'filterKeys') > CaseInsensitiveMap should be serializable after '-' operator > > > Key: SPARK-25541 > URL: https://issues.apache.org/jira/browse/SPARK-25541 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.5.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25549) High level API to collect RDD statistics
[ https://issues.apache.org/jira/browse/SPARK-25549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629702#comment-16629702 ] Liang-Chi Hsieh commented on SPARK-25549: - cc [~cloud_fan] > High level API to collect RDD statistics > > > Key: SPARK-25549 > URL: https://issues.apache.org/jira/browse/SPARK-25549 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.5.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > We have low level API SparkContext.submitMapStage used for collecting > statistics of RDD. However it is too low level and is not so easy to use. We > need a high level API for that. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25481) Refactor ColumnarBatchBenchmark to use main method
[ https://issues.apache.org/jira/browse/SPARK-25481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-25481. --- Resolution: Fixed Fix Version/s: 2.5.0 Issue resolved by pull request 22490 [https://github.com/apache/spark/pull/22490] > Refactor ColumnarBatchBenchmark to use main method > -- > > Key: SPARK-25481 > URL: https://issues.apache.org/jira/browse/SPARK-25481 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.5.0 >Reporter: yucai >Assignee: yucai >Priority: Major > Fix For: 2.5.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25536) executorSource.METRIC read wrong record in Executor.scala Line444
[ https://issues.apache.org/jira/browse/SPARK-25536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-25536: - Assignee: shahid > executorSource.METRIC read wrong record in Executor.scala Line444 > - > > Key: SPARK-25536 > URL: https://issues.apache.org/jira/browse/SPARK-25536 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.3.1, 2.3.2 >Reporter: ZhuoerXu >Assignee: shahid >Priority: Major > Fix For: 2.3.3, 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25536) executorSource.METRIC read wrong record in Executor.scala Line444
[ https://issues.apache.org/jira/browse/SPARK-25536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629740#comment-16629740 ] Dongjoon Hyun commented on SPARK-25536: --- Issue resolved by pull request 22555 [https://github.com/apache/spark/pull/22555] > executorSource.METRIC read wrong record in Executor.scala Line444 > - > > Key: SPARK-25536 > URL: https://issues.apache.org/jira/browse/SPARK-25536 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.3.1, 2.3.2 >Reporter: ZhuoerXu >Assignee: shahid >Priority: Major > Fix For: 2.3.3, 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25536) executorSource.METRIC read wrong record in Executor.scala Line444
[ https://issues.apache.org/jira/browse/SPARK-25536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-25536. --- Resolution: Fixed > executorSource.METRIC read wrong record in Executor.scala Line444 > - > > Key: SPARK-25536 > URL: https://issues.apache.org/jira/browse/SPARK-25536 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.3.1, 2.3.2 >Reporter: ZhuoerXu >Assignee: shahid >Priority: Major > Fix For: 2.3.3, 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25536) executorSource.METRIC read wrong record in Executor.scala Line444
[ https://issues.apache.org/jira/browse/SPARK-25536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25536: -- Fix Version/s: 2.4.0 2.3.3 > executorSource.METRIC read wrong record in Executor.scala Line444 > - > > Key: SPARK-25536 > URL: https://issues.apache.org/jira/browse/SPARK-25536 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.3.1, 2.3.2 >Reporter: ZhuoerXu >Priority: Major > Fix For: 2.3.3, 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25540) Make HiveContext in PySpark behave as the same as Scala.
[ https://issues.apache.org/jira/browse/SPARK-25540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-25540: - Fix Version/s: (was: 2.4.0) 2.5.0 > Make HiveContext in PySpark behave as the same as Scala. > > > Key: SPARK-25540 > URL: https://issues.apache.org/jira/browse/SPARK-25540 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 2.5.0 > > > In Scala, {{HiveContext}} sets a config {{spark.sql.catalogImplementation}} > of the given {{SparkContext}} and then passes to {{SparkSession.builder}}. > The {{HiveContext}} in PySpark should behave as the same as it in Scala. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629749#comment-16629749 ] Hyukjin Kwon commented on SPARK-18112: -- Hive 3 support. See https://github.com/apache/spark/pull/21404 > Spark2.x does not support read data from Hive 2.x metastore > --- > > Key: SPARK-18112 > URL: https://issues.apache.org/jira/browse/SPARK-18112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: KaiXu >Assignee: Xiao Li >Priority: Critical > Fix For: 2.2.0 > > > Hive2.0 has been released in February 2016, after that Hive2.0.1 and > Hive2.1.0 have also been released for a long time, but till now spark only > support to read hive metastore data from Hive1.2.1 and older version, since > Hive2.x has many bugs fixed and performance improvement it's better and > urgent to upgrade to support Hive2.x > failed to load data from hive2.x metastore: > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4 > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45) > at > org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25468) Highlight current page index in the history server
[ https://issues.apache.org/jira/browse/SPARK-25468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25468. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22516 [https://github.com/apache/spark/pull/22516] > Highlight current page index in the history server > -- > > Key: SPARK-25468 > URL: https://issues.apache.org/jira/browse/SPARK-25468 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 2.3.1 >Reporter: Dhruve Ashar >Assignee: Adam Wang >Priority: Trivial > Fix For: 2.4.0 > > Attachments: SparkHistoryServer.png > > > Spark History Server Web UI should highlight the current page index selected > for better navigation. Without it being highlighted it is difficult to > identify the current page you are looking at. > > For example: Page 1 should be highlighted as show in SparkHistoryServer.png -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25541) CaseInsensitiveMap should be serializable after '-' or 'filterKeys'
[ https://issues.apache.org/jira/browse/SPARK-25541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629690#comment-16629690 ] Apache Spark commented on SPARK-25541: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/22562 > CaseInsensitiveMap should be serializable after '-' or 'filterKeys' > --- > > Key: SPARK-25541 > URL: https://issues.apache.org/jira/browse/SPARK-25541 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.5.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24341) Codegen compile error from predicate subquery
[ https://issues.apache.org/jira/browse/SPARK-24341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629698#comment-16629698 ] Apache Spark commented on SPARK-24341: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/22563 > Codegen compile error from predicate subquery > - > > Key: SPARK-24341 > URL: https://issues.apache.org/jira/browse/SPARK-24341 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Juliusz Sompolski >Assignee: Marco Gaido >Priority: Minor > Fix For: 2.4.0 > > > Ran on master: > {code} > drop table if exists juleka; > drop table if exists julekb; > create table juleka (a integer, b integer); > create table julekb (na integer, nb integer); > insert into juleka values (1,1); > insert into julekb values (1,1); > select * from juleka where (a, b) not in (select (na, nb) from julekb); > {code} > Results in: > {code} > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 27, Column 29: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 27, Column 29: Cannot compare types "int" and > "org.apache.spark.sql.catalyst.InternalRow" > at > com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) > at > com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) > at > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) > at > com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) > at > com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2344) > at > com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2316) > at > com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278) > at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193) > at com.google.common.cache.LocalCache.get(LocalCache.java:3932) > at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3936) > at > com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4806) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1415) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:92) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:46) > at > org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:380) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:99) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:97) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203) > at > scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38) > at > scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46) > at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:203) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:202) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389) > at > org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:49) > at > org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:126) > at > org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:125) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:111) > at
[jira] [Commented] (SPARK-25549) High level API to collect RDD statistics
[ https://issues.apache.org/jira/browse/SPARK-25549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629700#comment-16629700 ] Liang-Chi Hsieh commented on SPARK-25549: - The design doc is at: https://docs.google.com/document/d/177JYpF8N31Wpg86lmMI2yA5KGfpevDNkvpY7dnwRyDo/edit?usp=sharing > High level API to collect RDD statistics > > > Key: SPARK-25549 > URL: https://issues.apache.org/jira/browse/SPARK-25549 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.5.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > We have low level API SparkContext.submitMapStage used for collecting > statistics of RDD. However it is too low level and is not so easy to use. We > need a high level API for that. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25549) High level API to collect RDD statistics
Liang-Chi Hsieh created SPARK-25549: --- Summary: High level API to collect RDD statistics Key: SPARK-25549 URL: https://issues.apache.org/jira/browse/SPARK-25549 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 2.5.0 Reporter: Liang-Chi Hsieh We have low level API SparkContext.submitMapStage used for collecting statistics of RDD. However it is too low level and is not so easy to use. We need a high level API for that. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25481) Refactor ColumnarBatchBenchmark to use main method
[ https://issues.apache.org/jira/browse/SPARK-25481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-25481: - Assignee: yucai > Refactor ColumnarBatchBenchmark to use main method > -- > > Key: SPARK-25481 > URL: https://issues.apache.org/jira/browse/SPARK-25481 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.5.0 >Reporter: yucai >Assignee: yucai >Priority: Major > Fix For: 2.5.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629742#comment-16629742 ] Hyukjin Kwon commented on SPARK-18112: -- Hive 3 support is blocked by Hadoop 3 profile. See https://github.com/apache/spark/pull/21588 and please provide some input at https://issues.apache.org/jira/browse/SPARK-20202 > Spark2.x does not support read data from Hive 2.x metastore > --- > > Key: SPARK-18112 > URL: https://issues.apache.org/jira/browse/SPARK-18112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: KaiXu >Assignee: Xiao Li >Priority: Critical > Fix For: 2.2.0 > > > Hive2.0 has been released in February 2016, after that Hive2.0.1 and > Hive2.1.0 have also been released for a long time, but till now spark only > support to read hive metastore data from Hive1.2.1 and older version, since > Hive2.x has many bugs fixed and performance improvement it's better and > urgent to upgrade to support Hive2.x > failed to load data from hive2.x metastore: > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4 > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45) > at > org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629743#comment-16629743 ] Hyukjin Kwon commented on SPARK-18112: -- Re: https://issues.apache.org/jira/browse/SPARK-18112?focusedCommentId=16629000=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16629000 Did you set {{spark.sql.hive.metastore.version}}? > Spark2.x does not support read data from Hive 2.x metastore > --- > > Key: SPARK-18112 > URL: https://issues.apache.org/jira/browse/SPARK-18112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: KaiXu >Assignee: Xiao Li >Priority: Critical > Fix For: 2.2.0 > > > Hive2.0 has been released in February 2016, after that Hive2.0.1 and > Hive2.1.0 have also been released for a long time, but till now spark only > support to read hive metastore data from Hive1.2.1 and older version, since > Hive2.x has many bugs fixed and performance improvement it's better and > urgent to upgrade to support Hive2.x > failed to load data from hive2.x metastore: > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4 > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45) > at > org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25525) Do not update conf for existing SparkContext in SparkSession.getOrCreate.
[ https://issues.apache.org/jira/browse/SPARK-25525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-25525: Assignee: Takuya Ueshin > Do not update conf for existing SparkContext in SparkSession.getOrCreate. > - > > Key: SPARK-25525 > URL: https://issues.apache.org/jira/browse/SPARK-25525 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 2.5.0 > > > In SPARK-20946, we modified {{SparkSession.getOrCreate}} to not update conf > for existing {{SparkContext}} because {{SparkContext}} is shared by all > sessions. > We should not update it in PySpark side as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25525) Do not update conf for existing SparkContext in SparkSession.getOrCreate.
[ https://issues.apache.org/jira/browse/SPARK-25525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25525. -- Resolution: Fixed Fix Version/s: 2.5.0 Issue resolved by pull request 22545 [https://github.com/apache/spark/pull/22545] > Do not update conf for existing SparkContext in SparkSession.getOrCreate. > - > > Key: SPARK-25525 > URL: https://issues.apache.org/jira/browse/SPARK-25525 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 2.5.0 > > > In SPARK-20946, we modified {{SparkSession.getOrCreate}} to not update conf > for existing {{SparkContext}} because {{SparkContext}} is shared by all > sessions. > We should not update it in PySpark side as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629742#comment-16629742 ] Hyukjin Kwon edited comment on SPARK-18112 at 9/27/18 4:42 AM: --- Hadoop 3 profile. See https://github.com/apache/spark/pull/21588 and please provide some input at https://issues.apache.org/jira/browse/SPARK-20202 was (Author: hyukjin.kwon): Hive 3 support is blocked by Hadoop 3 profile. See https://github.com/apache/spark/pull/21588 and please provide some input at https://issues.apache.org/jira/browse/SPARK-20202 > Spark2.x does not support read data from Hive 2.x metastore > --- > > Key: SPARK-18112 > URL: https://issues.apache.org/jira/browse/SPARK-18112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: KaiXu >Assignee: Xiao Li >Priority: Critical > Fix For: 2.2.0 > > > Hive2.0 has been released in February 2016, after that Hive2.0.1 and > Hive2.1.0 have also been released for a long time, but till now spark only > support to read hive metastore data from Hive1.2.1 and older version, since > Hive2.x has many bugs fixed and performance improvement it's better and > urgent to upgrade to support Hive2.x > failed to load data from hive2.x metastore: > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4 > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45) > at > org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25468) Highlight current page index in the history server
[ https://issues.apache.org/jira/browse/SPARK-25468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-25468: - Assignee: Adam Wang > Highlight current page index in the history server > -- > > Key: SPARK-25468 > URL: https://issues.apache.org/jira/browse/SPARK-25468 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 2.3.1 >Reporter: Dhruve Ashar >Assignee: Adam Wang >Priority: Trivial > Fix For: 2.4.0 > > Attachments: SparkHistoryServer.png > > > Spark History Server Web UI should highlight the current page index selected > for better navigation. Without it being highlighted it is difficult to > identify the current page you are looking at. > > For example: Page 1 should be highlighted as show in SparkHistoryServer.png -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25536) executorSource.METRIC read wrong record in Executor.scala Line444
[ https://issues.apache.org/jira/browse/SPARK-25536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25536: Assignee: (was: Apache Spark) > executorSource.METRIC read wrong record in Executor.scala Line444 > - > > Key: SPARK-25536 > URL: https://issues.apache.org/jira/browse/SPARK-25536 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: ZhuoerXu >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25536) executorSource.METRIC read wrong record in Executor.scala Line444
[ https://issues.apache.org/jira/browse/SPARK-25536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25536: Assignee: Apache Spark > executorSource.METRIC read wrong record in Executor.scala Line444 > - > > Key: SPARK-25536 > URL: https://issues.apache.org/jira/browse/SPARK-25536 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: ZhuoerXu >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25536) executorSource.METRIC read wrong record in Executor.scala Line444
[ https://issues.apache.org/jira/browse/SPARK-25536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628339#comment-16628339 ] Apache Spark commented on SPARK-25536: -- User 'shahidki31' has created a pull request for this issue: https://github.com/apache/spark/pull/22555 > executorSource.METRIC read wrong record in Executor.scala Line444 > - > > Key: SPARK-25536 > URL: https://issues.apache.org/jira/browse/SPARK-25536 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: ZhuoerXu >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25536) executorSource.METRIC read wrong record in Executor.scala Line444
[ https://issues.apache.org/jira/browse/SPARK-25536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628267#comment-16628267 ] shahid edited comment on SPARK-25536 at 9/26/18 7:18 AM: - Thanks. I will raise a pr was (Author: shahid): I will raise a pr > executorSource.METRIC read wrong record in Executor.scala Line444 > - > > Key: SPARK-25536 > URL: https://issues.apache.org/jira/browse/SPARK-25536 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: ZhuoerXu >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25538) incorrect row counts after distinct()
[ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido updated SPARK-25538: Priority: Major (was: Blocker) > incorrect row counts after distinct() > - > > Key: SPARK-25538 > URL: https://issues.apache.org/jira/browse/SPARK-25538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Reproduced on a Centos7 VM and from source in Intellij > on OS X. >Reporter: Steven Rand >Priority: Major > > It appears that {{df.distinct.count}} can return incorrect values after > SPARK-23713. It's possible that other operations are affected as well; > {{distinct}} just happens to be the one that we noticed. I believe that this > issue was introduced by SPARK-23713 because I can't reproduce it until that > commit, and I've been able to reproduce it after that commit as well as with > {{tags/v2.4.0-rc1}}. > Below are example spark-shell sessions to illustrate the problem. > Unfortunately the data used in these examples can't be uploaded to this Jira > ticket. I'll try to create test data which also reproduces the issue, and > will upload that if I'm able to do so. > Example from Spark 2.3.1, which behaves correctly: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 115 > {code} > Example from Spark 2.4.0-rc1, which returns different output: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 116 > scala> df.sort("col_0").distinct.count > res2: Long = 123 > scala> df.withColumnRenamed("col_0", "newName").distinct.count > res3: Long = 115 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()
[ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628369#comment-16628369 ] Marco Gaido commented on SPARK-25538: - Please do not use Blocker and Critical when reporting issues as they are reserved for committer. Though, I agree this should be a blocker for 2.4.0 as it is a correctness issue. cc [~cloud_fan] > incorrect row counts after distinct() > - > > Key: SPARK-25538 > URL: https://issues.apache.org/jira/browse/SPARK-25538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Reproduced on a Centos7 VM and from source in Intellij > on OS X. >Reporter: Steven Rand >Priority: Major > Labels: correctness > > It appears that {{df.distinct.count}} can return incorrect values after > SPARK-23713. It's possible that other operations are affected as well; > {{distinct}} just happens to be the one that we noticed. I believe that this > issue was introduced by SPARK-23713 because I can't reproduce it until that > commit, and I've been able to reproduce it after that commit as well as with > {{tags/v2.4.0-rc1}}. > Below are example spark-shell sessions to illustrate the problem. > Unfortunately the data used in these examples can't be uploaded to this Jira > ticket. I'll try to create test data which also reproduces the issue, and > will upload that if I'm able to do so. > Example from Spark 2.3.1, which behaves correctly: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 115 > {code} > Example from Spark 2.4.0-rc1, which returns different output: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 116 > scala> df.sort("col_0").distinct.count > res2: Long = 123 > scala> df.withColumnRenamed("col_0", "newName").distinct.count > res3: Long = 115 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25538) incorrect row counts after distinct()
[ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido updated SPARK-25538: Labels: correctness (was: ) > incorrect row counts after distinct() > - > > Key: SPARK-25538 > URL: https://issues.apache.org/jira/browse/SPARK-25538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Reproduced on a Centos7 VM and from source in Intellij > on OS X. >Reporter: Steven Rand >Priority: Major > Labels: correctness > > It appears that {{df.distinct.count}} can return incorrect values after > SPARK-23713. It's possible that other operations are affected as well; > {{distinct}} just happens to be the one that we noticed. I believe that this > issue was introduced by SPARK-23713 because I can't reproduce it until that > commit, and I've been able to reproduce it after that commit as well as with > {{tags/v2.4.0-rc1}}. > Below are example spark-shell sessions to illustrate the problem. > Unfortunately the data used in these examples can't be uploaded to this Jira > ticket. I'll try to create test data which also reproduces the issue, and > will upload that if I'm able to do so. > Example from Spark 2.3.1, which behaves correctly: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 115 > {code} > Example from Spark 2.4.0-rc1, which returns different output: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 116 > scala> df.sort("col_0").distinct.count > res2: Long = 123 > scala> df.withColumnRenamed("col_0", "newName").distinct.count > res3: Long = 115 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org