[jira] [Commented] (SPARK-25452) Query with where clause is giving unexpected result in case of float column

2018-09-26 Thread Ayush Anubhava (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628525#comment-16628525
 ] 

Ayush Anubhava commented on SPARK-25452:


Hi HyukjiKwon

This issue does not seems to be duplicate.

I saw the changes , I am able to get the same in spark- sql.

Seems like fix is for beeline.

> Query with where clause is giving unexpected result in case of float column
> ---
>
> Key: SPARK-25452
> URL: https://issues.apache.org/jira/browse/SPARK-25452
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: *Spark 2.3.1*
> *Hadoop 2.7.2*
>Reporter: Ayush Anubhava
>Priority: Major
> Attachments: image-2018-09-26-14-14-47-504.png
>
>
> *Description* : Query with clause is giving unexpected result in case of 
> float column 
>  
> {color:#d04437}*Query with filter less than equal to is giving inappropriate 
> result{code}*{color}
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (0,0.0);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (1,1.1);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0;
> +++--+
> | a | b |
> +++--+
> | 0 | 0.0 |
> | 1 | 1.10023841858 |
> +++--+
> Query with filter less than equal to is giving in appropriate result
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1;
> ++--+--+
> | a | b |
> ++--+--+
> | 0 | 0.0 |
> ++--+--+
> 1 row selected (0.299 seconds)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling

2018-09-26 Thread Andrew Crosby (JIRA)
Andrew Crosby created SPARK-25544:
-

 Summary: Slow/failed convergence in Spark ML models due to 
internal predictor scaling
 Key: SPARK-25544
 URL: https://issues.apache.org/jira/browse/SPARK-25544
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.3.2
 Environment: Databricks runtime 4.2: Spark 2.3.1, Scala 2.11
Reporter: Andrew Crosby


The LinearRegression and LogisticRegression estimators in Spark ML can take a 
large number of iterations to converge, or fail to converge altogether, when 
trained using the l-bfgs method with standardization turned off.

*Details:*

LinearRegression and LogisticRegression standardize their input features by 
default. In SPARK-8522 the option to disable standardization was added. This is 
implemented internally by changing the effective strength of regularization 
rather than disabling the feature scaling. Mathematically, both changing the 
effective regularizaiton strength, and disabling feature scaling should give 
the same solution, but they can have very different convergence properties.

The normal justication given for scaling features is that it ensures that all 
covariances are O(1) and should improve numerical convergence, but this 
argument does not account for the regularization term. This doesn't cause any 
issues if standardization is set to true, since all features will have an O(1) 
regularization strength. But it does cause issues when standardization is set 
to false, since the effecive regularization strength of feature i is now O(1/ 
sigma_i^2) where sigma_i is the standard deviation of the feature. This means 
that predictors with small standard deviations will have very large effective 
regularization strengths and consequently lead to very large gradients and thus 
poor convergence in the solver.

*Example code to recreate:*

To demonstrate just how bad these convergence issues can be, here is a very 
simple test case which builds a linear regression model with a categorical 
feature, a numerical feature and their interaction. When fed the specified 
training data, this model will fail to converge before it hits the maximum 
iteration limit.

Training data:
||category||numericFeature||label||
|1|1.0|0.5|
|1|0.5|1.0|
|2|0.01|2.0|

 
{code:java}
val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 
2.0)).toDF("category", "numericFeature", "label")

val indexer = new StringIndexer().setInputCol("category") 
.setOutputCol("categoryIndex")
val encoder = new 
OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false)
val interaction = new Interaction().setInputCols(Array("categoryEncoded", 
"numericFeature")).setOutputCol("interaction")
val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", 
"interaction")).setOutputCol("features")
val model = new 
LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100)
val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, 
assembler, model))

val pipelineModel  = pipeline.fit(df)

val numIterations = 
pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code}
 *Possible fix:*

These convergence issues can be fixed by turning of feature scaling when 
standardization is set to false rather than using an effective regularization 
strength. This can be hacked into LinearRegression.scala by simply replacing 
line 423
{code:java}
val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt)
{code}
with
{code:java}
val featuresStd = if ($(standardization)) 
featuresSummarizer.variance.toArray.map(math.sqrt) else 
featuresSummarizer.variance.toArray.map(x => 1.0)
{code}
Rerunning the above test code with that hack in place, will lead to convergence 
after just 4 iterations instead of hitting the max iterations limit!

*Impact:*

I can't speak for other people, but I've personally encountered these 
convergence issues several times when building production scale Spark ML 
models, and have resorted to writing my only implementation of LinearRegression 
with the above hack in place. The issue is made worse by the fact that Spark 
does not raise an error when the maximum number of iterations is hit, so the 
first time you encounter the issue it can take a while to figure out what is 
going on.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25392) [Spark Job History]Inconsistent behaviour for pool details in spark web UI and history server page

2018-09-26 Thread ABHISHEK KUMAR GUPTA (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK KUMAR GUPTA updated SPARK-25392:
-
Description: 

Steps:
1.Enable spark.scheduler.mode = FAIR
2.Submitted beeline jobs
create database JH;
use JH;
create table one12( id int );
insert into one12 values(12);
insert into one12 values(13);
Select * from one12;
3.Click on JDBC Incompleted Application ID in Job History Page
4. Go to Job Tab in staged Web UI page
5. Click on run at AccessController.java:0 under Desription column
6 . Click default under Pool Name column of Completed Stages table
URL:http://blr123109:23020/history/application_1536399199015_0006/stages/pool/?poolname=default
7. It throws below error
HTTP ERROR 400

Problem accessing /history/application_1536399199015_0006/stages/pool/. Reason:

Unknown pool: default

Powered by Jetty:// x.y.z

But under 

Yarn resource page it display the summary under Fair Scheduler Pool: default 
URL:https://blr123110:64323/proxy/application_1536399199015_0006/stages/pool?poolname=default

Summary
Pool Name   Minimum Share   Pool Weight Active Stages   Running Tasks   
SchedulingMode
default 0   1   0   0   FIFO





  was:


Steps:
1.Enable spark.scheduler.mode = FAIR
2.Submitted beeline jobs
create database JH;
use JH;
create table one12( id int );
insert into one12 values(12);
insert into one12 values(13);
Select * from one12;
3.Click on JDBC Incompleted Application ID in Job History Page
4. Go to Job Tab in staged Web UI page
5. Click on run at AccessController.java:0 under Desription column
6 . Click default under Pool Name column of Completed Stages table
URL:http://blr123109:23020/history/application_1536399199015_0006/stages/pool/?poolname=default
7. It throws below error
HTTP ERROR 400

Problem accessing /history/application_1536399199015_0006/stages/pool/. Reason:

Unknown pool: default

Powered by Jetty:// x.y.z

But under 

Yarn resource page it display the summary under Fair Scheduler Pool: default 
URL:https://blr123110:64323/proxy/application_1536399199015_0006/stages/pool?poolname=default

Summary
Pool Name   Minimum Share   Pool Weight Active Stages   Running Tasks   
SchedulingMode
default 0   1   0   0   FIFO





 Issue Type: Improvement  (was: Bug)

OK Sandeep, make sure u handle this as Improvement.

> [Spark Job History]Inconsistent behaviour for pool details in spark web UI 
> and history server page 
> ---
>
> Key: SPARK-25392
> URL: https://issues.apache.org/jira/browse/SPARK-25392
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: OS: SUSE 11
> Spark Version: 2.3
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> Steps:
> 1.Enable spark.scheduler.mode = FAIR
> 2.Submitted beeline jobs
> create database JH;
> use JH;
> create table one12( id int );
> insert into one12 values(12);
> insert into one12 values(13);
> Select * from one12;
> 3.Click on JDBC Incompleted Application ID in Job History Page
> 4. Go to Job Tab in staged Web UI page
> 5. Click on run at AccessController.java:0 under Desription column
> 6 . Click default under Pool Name column of Completed Stages table
> URL:http://blr123109:23020/history/application_1536399199015_0006/stages/pool/?poolname=default
> 7. It throws below error
> HTTP ERROR 400
> Problem accessing /history/application_1536399199015_0006/stages/pool/. 
> Reason:
> Unknown pool: default
> Powered by Jetty:// x.y.z
> But under 
> Yarn resource page it display the summary under Fair Scheduler Pool: default 
> URL:https://blr123110:64323/proxy/application_1536399199015_0006/stages/pool?poolname=default
> Summary
> Pool Name Minimum Share   Pool Weight Active Stages   Running Tasks   
> SchedulingMode
> default   0   1   0   0   FIFO



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16859) History Server storage information is missing

2018-09-26 Thread t oo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628554#comment-16628554
 ] 

t oo edited comment on SPARK-16859 at 9/26/18 10:43 AM:


bump @shahid


was (Author: toopt4):
bump

> History Server storage information is missing
> -
>
> Key: SPARK-16859
> URL: https://issues.apache.org/jira/browse/SPARK-16859
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Andrei Ivanov
>Priority: Major
>  Labels: historyserver, newbie
>
> It looks like job history storage tab in history server is broken for 
> completed jobs since *1.6.2*. 
> More specifically it's broken since 
> [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845].
> I've fixed for my installation by effectively reverting the above patch 
> ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]).
> IMHO, the most straightforward fix would be to implement 
> _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ making 
> sure it works from _ReplayListenerBus_.
> The downside will be that it will still work incorrectly with pre patch job 
> histories. But then, it doesn't work since *1.6.2* anyhow.
> PS: I'd really love to have this fixed eventually. But I'm pretty new to 
> Apache Spark and missing hands on Scala experience. So  I'd prefer that it be 
> fixed by someone experienced with roadmap vision. If nobody volunteers I'll 
> try to patch myself.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25379) Improve ColumnPruning performance

2018-09-26 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25379:
---

Assignee: Marco Gaido

> Improve ColumnPruning performance
> -
>
> Key: SPARK-25379
> URL: https://issues.apache.org/jira/browse/SPARK-25379
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer, SQL
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.5.0
>
>
> The {{--}} operation on {{AttributeSet}} is quite expensive, especially where 
> many columns are involved. {{ColumnPruning}} heavily relies on that operator 
> and this affects its running time. There are 2 optimization which are 
> possible:
>  - Improve {{--}} performance;
>  - Replace {{--}} with {{subsetOf}} when possible.
> Moreover, when building {{AttributeSet}} s we often do unneeded operations. 
> This also impacts other rules less significantly.
> I'll provide more details about the performance improvement achievable in the 
> PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25379) Improve ColumnPruning performance

2018-09-26 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25379.
-
   Resolution: Fixed
Fix Version/s: 2.5.0

Issue resolved by pull request 22364
[https://github.com/apache/spark/pull/22364]

> Improve ColumnPruning performance
> -
>
> Key: SPARK-25379
> URL: https://issues.apache.org/jira/browse/SPARK-25379
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer, SQL
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.5.0
>
>
> The {{--}} operation on {{AttributeSet}} is quite expensive, especially where 
> many columns are involved. {{ColumnPruning}} heavily relies on that operator 
> and this affects its running time. There are 2 optimization which are 
> possible:
>  - Improve {{--}} performance;
>  - Replace {{--}} with {{subsetOf}} when possible.
> Moreover, when building {{AttributeSet}} s we often do unneeded operations. 
> This also impacts other rules less significantly.
> I'll provide more details about the performance improvement achievable in the 
> PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16859) History Server storage information is missing

2018-09-26 Thread t oo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628554#comment-16628554
 ] 

t oo edited comment on SPARK-16859 at 9/26/18 10:46 AM:


bump


was (Author: toopt4):
bump [~ashahid]

> History Server storage information is missing
> -
>
> Key: SPARK-16859
> URL: https://issues.apache.org/jira/browse/SPARK-16859
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Andrei Ivanov
>Priority: Major
>  Labels: historyserver, newbie
>
> It looks like job history storage tab in history server is broken for 
> completed jobs since *1.6.2*. 
> More specifically it's broken since 
> [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845].
> I've fixed for my installation by effectively reverting the above patch 
> ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]).
> IMHO, the most straightforward fix would be to implement 
> _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ making 
> sure it works from _ReplayListenerBus_.
> The downside will be that it will still work incorrectly with pre patch job 
> histories. But then, it doesn't work since *1.6.2* anyhow.
> PS: I'd really love to have this fixed eventually. But I'm pretty new to 
> Apache Spark and missing hands on Scala experience. So  I'd prefer that it be 
> fixed by someone experienced with roadmap vision. If nobody volunteers I'll 
> try to patch myself.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25502) [Spark Job History] Empty Page when page number exceeds the reatinedTask size

2018-09-26 Thread t oo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628561#comment-16628561
 ] 

t oo commented on SPARK-25502:
--

related https://jira.apache.org/jira/browse/SPARK-16859 ?

> [Spark Job History] Empty Page when page number exceeds the reatinedTask size 
> --
>
> Key: SPARK-25502
> URL: https://issues.apache.org/jira/browse/SPARK-25502
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: shahid
>Priority: Minor
> Fix For: 2.3.3, 2.4.0
>
>
> *Steps:*
> 1. Spark installed and running properly.
> 2. spark.ui.retainedTask=10 ( it is default value )
> 3.Launch Spark shell ./spark-shell --master yarn
> 4. Create a spark-shell application with a single job and 50 task
> val rdd = sc.parallelize(1 to 50, 50)
> rdd.count
> 5. Launch Job History Page and go to spark-shell application created above 
> under Incomplete Task
> 6. Right click and got to Job page of the application and from there click 
> and launch Stage Page
> 7. Launch the Stage Id page for the specific Stage Id for the above created 
> job
> 8. Scroll down and check for the task completion Summary
> It Displays pagination panel showing *5000 Pages Jump to 1 Show 100 items in 
> a page* and Go button
> 9. Replace 1 with 2333 page number
> *Actual Result:*
> 2 Pagination Panel displayed
> *Expected Result:*
> Pagination Panel should not display 5000 pages as retainedTask value is 
> 10 and it should display 1000 page only because each page holding 100 
> tasks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16859) History Server storage information is missing

2018-09-26 Thread t oo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628554#comment-16628554
 ] 

t oo edited comment on SPARK-16859 at 9/26/18 10:45 AM:


bump [~ashahid]


was (Author: toopt4):
bump @shahid

> History Server storage information is missing
> -
>
> Key: SPARK-16859
> URL: https://issues.apache.org/jira/browse/SPARK-16859
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Andrei Ivanov
>Priority: Major
>  Labels: historyserver, newbie
>
> It looks like job history storage tab in history server is broken for 
> completed jobs since *1.6.2*. 
> More specifically it's broken since 
> [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845].
> I've fixed for my installation by effectively reverting the above patch 
> ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]).
> IMHO, the most straightforward fix would be to implement 
> _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ making 
> sure it works from _ReplayListenerBus_.
> The downside will be that it will still work incorrectly with pre patch job 
> histories. But then, it doesn't work since *1.6.2* anyhow.
> PS: I'd really love to have this fixed eventually. But I'm pretty new to 
> Apache Spark and missing hands on Scala experience. So  I'd prefer that it be 
> fixed by someone experienced with roadmap vision. If nobody volunteers I'll 
> try to patch myself.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23401) Improve test cases for all supported types and unsupported types

2018-09-26 Thread Aleksandr Koriagin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628727#comment-16628727
 ] 

Aleksandr Koriagin commented on SPARK-23401:


I will take a look

> Improve test cases for all supported types and unsupported types
> 
>
> Key: SPARK-23401
> URL: https://issues.apache.org/jira/browse/SPARK-23401
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Looks there are some missing types to test in supported types. 
> For example, please see 
> https://github.com/apache/spark/blob/c338c8cf8253c037ecd4f39bbd58ed5a86581b37/python/pyspark/sql/tests.py#L4397-L4401
> We can improve this test coverage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25392) [Spark Job History]Inconsistent behaviour for pool details in spark web UI and history server page

2018-09-26 Thread sandeep katta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628742#comment-16628742
 ] 

sandeep katta commented on SPARK-25392:
---

[~abhishek.akg] as per current design pool details are shown for live UI,I am 
working on this PR.Can you please update this to Improvement

> [Spark Job History]Inconsistent behaviour for pool details in spark web UI 
> and history server page 
> ---
>
> Key: SPARK-25392
> URL: https://issues.apache.org/jira/browse/SPARK-25392
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: OS: SUSE 11
> Spark Version: 2.3
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> Steps:
> 1.Enable spark.scheduler.mode = FAIR
> 2.Submitted beeline jobs
> create database JH;
> use JH;
> create table one12( id int );
> insert into one12 values(12);
> insert into one12 values(13);
> Select * from one12;
> 3.Click on JDBC Incompleted Application ID in Job History Page
> 4. Go to Job Tab in staged Web UI page
> 5. Click on run at AccessController.java:0 under Desription column
> 6 . Click default under Pool Name column of Completed Stages table
> URL:http://blr123109:23020/history/application_1536399199015_0006/stages/pool/?poolname=default
> 7. It throws below error
> HTTP ERROR 400
> Problem accessing /history/application_1536399199015_0006/stages/pool/. 
> Reason:
> Unknown pool: default
> Powered by Jetty:// x.y.z
> But under 
> Yarn resource page it display the summary under Fair Scheduler Pool: default 
> URL:https://blr123110:64323/proxy/application_1536399199015_0006/stages/pool?poolname=default
> Summary
> Pool Name Minimum Share   Pool Weight Active Stages   Running Tasks   
> SchedulingMode
> default   0   1   0   0   FIFO



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()

2018-09-26 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628744#comment-16628744
 ] 

Wenchen Fan commented on SPARK-25538:
-

cc [~kiszk] as well

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Major
>  Labels: correctness
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24440) When use constant as column we may get wrong answer versus impala

2018-09-26 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628608#comment-16628608
 ] 

Marco Gaido commented on SPARK-24440:
-

Can you provide a sample repro which can be run in order to debug the issue?

> When use constant as column we may get wrong answer versus impala
> -
>
> Key: SPARK-24440
> URL: https://issues.apache.org/jira/browse/SPARK-24440
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.3.0
>Reporter: zhoukang
>Priority: Major
>
> For query below:
> {code:java}
> select `date`, 100 as platform, count(distinct deviceid) as new_user from 
> tv.clean_new_user where `date`=20180528 group by `date`, platform
> {code}
> We intended to group by 100 and get distinct deviceid number.
> By spark sql,we get:
> {code}
> +---+---+---+--+
> |   date| platform  | new_user  |
> +---+---+---+--+
> | 20180528  | 100   | 521   |
> | 20180528  | 100   | 82|
> | 20180528  | 100   | 3 |
> | 20180528  | 100   | 2 |
> | 20180528  | 100   | 7 |
> | 20180528  | 100   | 870   |
> | 20180528  | 100   | 3 |
> | 20180528  | 100   | 8 |
> | 20180528  | 100   | 3 |
> | 20180528  | 100   | 2204  |
> | 20180528  | 100   | 1123  |
> | 20180528  | 100   | 1 |
> | 20180528  | 100   | 54|
> | 20180528  | 100   | 440   |
> | 20180528  | 100   | 4 |
> | 20180528  | 100   | 478   |
> | 20180528  | 100   | 34|
> | 20180528  | 100   | 195   |
> | 20180528  | 100   | 17|
> | 20180528  | 100   | 18|
> | 20180528  | 100   | 2 |
> | 20180528  | 100   | 2 |
> | 20180528  | 100   | 84|
> | 20180528  | 100   | 1616  |
> | 20180528  | 100   | 15|
> | 20180528  | 100   | 7 |
> | 20180528  | 100   | 479   |
> | 20180528  | 100   | 50|
> | 20180528  | 100   | 376   |
> | 20180528  | 100   | 21|
> | 20180528  | 100   | 842   |
> | 20180528  | 100   | 444   |
> | 20180528  | 100   | 538   |
> | 20180528  | 100   | 1 |
> | 20180528  | 100   | 2 |
> | 20180528  | 100   | 7 |
> | 20180528  | 100   | 17|
> | 20180528  | 100   | 133   |
> | 20180528  | 100   | 7 |
> | 20180528  | 100   | 415   |
> | 20180528  | 100   | 2 |
> | 20180528  | 100   | 318   |
> | 20180528  | 100   | 5 |
> | 20180528  | 100   | 1 |
> | 20180528  | 100   | 2060  |
> | 20180528  | 100   | 1217  |
> | 20180528  | 100   | 2 |
> | 20180528  | 100   | 60|
> | 20180528  | 100   | 22|
> | 20180528  | 100   | 4 |
> +---+---+---+--+
> {code}
> Actually sum of the deviceid is below:
> {code}
> 0: jdbc:hive2://xxx/> select sum(t1.new_user) from (select `date`, 100 as 
> platform, count(distinct deviceid) as new_user from tv.clean_new_user where 
> `date`=20180528 group by `date`, platform)t1; 
> ++--+
> | sum(new_user)  |
> ++--+
> | 14816  |
> ++--+
> 1 row selected (4.934 seconds)
> {code}
> And the real distinct deviceid value is below:
> {code}
> 0: jdbc:hive2://xxx/> select 100 as platform, count(distinct deviceid) as 
> new_user from tv.clean_new_user where `date`=20180528;
> +---+---+--+
> | platform  | new_user  |
> +---+---+--+
> | 100   | 14773 |
> +---+---+--+
> 1 row selected (2.846 seconds)
> {code}
> In impala,with the first query we can get result below:
> {code}
> [xxx] > select `date`, 100 as platform, count(distinct deviceid) as new_user 
> from tv.clean_new_user where `date`=20180528 group by `date`, platform;Query: 
> select `date`, 100 as platform, count(distinct deviceid) as new_user from 
> tv.clean_new_user where `date`=20180528 group by `date`, platform
> +--+--+--+
> | date | platform | new_user |
> +--+--+--+
> | 20180528 | 100  | 14773|
> +--+--+--+
> Fetched 1 row(s) in 1.00s
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21291) R bucketBy partitionBy API

2018-09-26 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628721#comment-16628721
 ] 

Felix Cheung commented on SPARK-21291:
--

The PR did not have bucketBy?




> R bucketBy partitionBy API
> --
>
> Key: SPARK-21291
> URL: https://issues.apache.org/jira/browse/SPARK-21291
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 2.5.0
>
>
> partitionBy exists but it's for windowspec only



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25502) [Spark Job History] Empty Page when page number exceeds the reatinedTask size

2018-09-26 Thread shahid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628588#comment-16628588
 ] 

shahid commented on SPARK-25502:


[~toopt4] No. please refer the PR, to see the fix

> [Spark Job History] Empty Page when page number exceeds the reatinedTask size 
> --
>
> Key: SPARK-25502
> URL: https://issues.apache.org/jira/browse/SPARK-25502
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: shahid
>Priority: Minor
> Fix For: 2.3.3, 2.4.0
>
>
> *Steps:*
> 1. Spark installed and running properly.
> 2. spark.ui.retainedTask=10 ( it is default value )
> 3.Launch Spark shell ./spark-shell --master yarn
> 4. Create a spark-shell application with a single job and 50 task
> val rdd = sc.parallelize(1 to 50, 50)
> rdd.count
> 5. Launch Job History Page and go to spark-shell application created above 
> under Incomplete Task
> 6. Right click and got to Job page of the application and from there click 
> and launch Stage Page
> 7. Launch the Stage Id page for the specific Stage Id for the above created 
> job
> 8. Scroll down and check for the task completion Summary
> It Displays pagination panel showing *5000 Pages Jump to 1 Show 100 items in 
> a page* and Go button
> 9. Replace 1 with 2333 page number
> *Actual Result:*
> 2 Pagination Panel displayed
> *Expected Result:*
> Pagination Panel should not display 5000 pages as retainedTask value is 
> 10 and it should display 1000 page only because each page holding 100 
> tasks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25541) CaseInsensitiveMap should be serializable after '-' or 'filterKeys'

2018-09-26 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25541.
-
   Resolution: Fixed
 Assignee: Gengliang Wang
Fix Version/s: 2.5.0

> CaseInsensitiveMap should be serializable after '-' or 'filterKeys'
> ---
>
> Key: SPARK-25541
> URL: https://issues.apache.org/jira/browse/SPARK-25541
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.5.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling

2018-09-26 Thread Andrew Crosby (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Crosby updated SPARK-25544:
--
Description: 
The LinearRegression and LogisticRegression estimators in Spark ML can take a 
large number of iterations to converge, or fail to converge altogether, when 
trained using the l-bfgs method with standardization turned off.

*Details:*

LinearRegression and LogisticRegression standardize their input features by 
default. In SPARK-8522 the option to disable standardization was added. This is 
implemented internally by changing the effective strength of regularization 
rather than disabling the feature scaling. Mathematically, both changing the 
effective regularizaiton strength, and disabling feature scaling should give 
the same solution, but they can have very different convergence properties.

The normal justication given for scaling features is that it ensures that all 
covariances are O(1) and should improve numerical convergence, but this 
argument does not account for the regularization term. This doesn't cause any 
issues if standardization is set to true, since all features will have an O(1) 
regularization strength. But it does cause issues when standardization is set 
to false, since the effecive regularization strength of feature i is now O(1/ 
sigma_i^2) where sigma_i is the standard deviation of the feature. This means 
that predictors with small standard deviations will have very large effective 
regularization strengths and consequently lead to very large gradients and thus 
poor convergence in the solver.

*Example code to recreate:*

To demonstrate just how bad these convergence issues can be, here is a very 
simple test case which builds a linear regression model with a categorical 
feature, a numerical feature and their interaction. When fed the specified 
training data, this model will fail to converge before it hits the maximum 
iteration limit.

Training data:
||category||numericFeature||label||
|1|1.0|0.5|
|1|0.5|1.0|
|2|0.01|2.0|

 
{code:java}
val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 
2.0)).toDF("category", "numericFeature", "label")

val indexer = new StringIndexer().setInputCol("category") 
.setOutputCol("categoryIndex")
val encoder = new 
OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false)
val interaction = new Interaction().setInputCols(Array("categoryEncoded", 
"numericFeature")).setOutputCol("interaction")
val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", 
"interaction")).setOutputCol("features")
val model = new 
LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100)
val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, 
assembler, model))

val pipelineModel  = pipeline.fit(df)

val numIterations = 
pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code}
 *Possible fix:*

These convergence issues can be fixed by turning off feature scaling when 
standardization is set to false rather than using an effective regularization 
strength. This can be hacked into LinearRegression.scala by simply replacing 
line 423
{code:java}
val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt)
{code}
with
{code:java}
val featuresStd = if ($(standardization)) 
featuresSummarizer.variance.toArray.map(math.sqrt) else 
featuresSummarizer.variance.toArray.map(x => 1.0)
{code}
Rerunning the above test code with that hack in place, will lead to convergence 
after just 4 iterations instead of hitting the max iterations limit!

*Impact:*

I can't speak for other people, but I've personally encountered these 
convergence issues several times when building production scale Spark ML 
models, and have resorted to writing my only implementation of LinearRegression 
with the above hack in place. The issue is made worse by the fact that Spark 
does not raise an error when the maximum number of iterations is hit, so the 
first time you encounter the issue it can take a while to figure out what is 
going on.

 

  was:
The LinearRegression and LogisticRegression estimators in Spark ML can take a 
large number of iterations to converge, or fail to converge altogether, when 
trained using the l-bfgs method with standardization turned off.

*Details:*

LinearRegression and LogisticRegression standardize their input features by 
default. In SPARK-8522 the option to disable standardization was added. This is 
implemented internally by changing the effective strength of regularization 
rather than disabling the feature scaling. Mathematically, both changing the 
effective regularizaiton strength, and disabling feature scaling should give 
the same solution, but they can have very different convergence properties.

The normal 

[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2018-09-26 Thread Eugeniu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628852#comment-16628852
 ] 

Eugeniu commented on SPARK-18112:
-

This issue should be reopened.

As already commented by [~Tavis] 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L204
 is referenced but it is not present in HiveConf since branch 2.0

https://github.com/apache/hive/blob/branch-1.2/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L1290
https://github.com/apache/hive/blob/branch-2.0/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java


> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling

2018-09-26 Thread Andrew Crosby (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Crosby updated SPARK-25544:
--
Description: 
The LinearRegression and LogisticRegression estimators in Spark ML can take a 
large number of iterations to converge, or fail to converge altogether, when 
trained using the l-bfgs method with standardization turned off.

*Details:*

LinearRegression and LogisticRegression standardize their input features by 
default. In SPARK-8522 the option to disable standardization was added. This is 
implemented internally by changing the effective strength of regularization 
rather than disabling the feature scaling. Mathematically, both changing the 
effective regularizaiton strength, and disabling feature scaling should give 
the same solution, but they can have very different convergence properties.

The normal justication given for scaling features is that it ensures that all 
covariances are O(1) and should improve numerical convergence, but this 
argument does not account for the regularization term. This doesn't cause any 
issues if standardization is set to true, since all features will have an O(1) 
regularization strength. But it does cause issues when standardization is set 
to false, since the effecive regularization strength of feature i is now O(1/ 
sigma_i^2) where sigma_i is the standard deviation of the feature. This means 
that predictors with small standard deviations will have very large effective 
regularization strengths and consequently lead to very large gradients and thus 
poor convergence in the solver.

*Example code to recreate:*

To demonstrate just how bad these convergence issues can be, here is a very 
simple test case which builds a linear regression model with a categorical 
feature, a numerical feature and their interaction. When fed the specified 
training data, this model will fail to converge before it hits the maximum 
iteration limit. In this case, it is the interaction between category "2" and 
the numeric feature that leads to a feature with a small standard deviation.

Training data:
||category||numericFeature||label||
|1|1.0|0.5|
|1|0.5|1.0|
|2|0.01|2.0|

 
{code:java}
val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 
2.0)).toDF("category", "numericFeature", "label")

val indexer = new StringIndexer().setInputCol("category") 
.setOutputCol("categoryIndex")
val encoder = new 
OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false)
val interaction = new Interaction().setInputCols(Array("categoryEncoded", 
"numericFeature")).setOutputCol("interaction")
val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", 
"interaction")).setOutputCol("features")
val model = new 
LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100)
val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, 
assembler, model))

val pipelineModel  = pipeline.fit(df)

val numIterations = 
pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code}
 *Possible fix:*

These convergence issues can be fixed by turning off feature scaling when 
standardization is set to false rather than using an effective regularization 
strength. This can be hacked into LinearRegression.scala by simply replacing 
line 423
{code:java}
val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt)
{code}
with
{code:java}
val featuresStd = if ($(standardization)) 
featuresSummarizer.variance.toArray.map(math.sqrt) else 
featuresSummarizer.variance.toArray.map(x => 1.0)
{code}
Rerunning the above test code with that hack in place, will lead to convergence 
after just 4 iterations instead of hitting the max iterations limit!

*Impact:*

I can't speak for other people, but I've personally encountered these 
convergence issues several times when building production scale Spark ML 
models, and have resorted to writing my only implementation of LinearRegression 
with the above hack in place. The issue is made worse by the fact that Spark 
does not raise an error when the maximum number of iterations is hit, so the 
first time you encounter the issue it can take a while to figure out what is 
going on.

 

  was:
The LinearRegression and LogisticRegression estimators in Spark ML can take a 
large number of iterations to converge, or fail to converge altogether, when 
trained using the l-bfgs method with standardization turned off.

*Details:*

LinearRegression and LogisticRegression standardize their input features by 
default. In SPARK-8522 the option to disable standardization was added. This is 
implemented internally by changing the effective strength of regularization 
rather than disabling the feature scaling. Mathematically, both changing the 
effective regularizaiton 

[jira] [Created] (SPARK-25545) CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not confirm to non-nullable schema fields

2018-09-26 Thread Steven Bakhtiari (JIRA)
Steven Bakhtiari created SPARK-25545:


 Summary: CSV loading with DROPMALFORMED mode doesn't correctly 
drop rows that do not confirm to non-nullable schema fields
 Key: SPARK-25545
 URL: https://issues.apache.org/jira/browse/SPARK-25545
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2, 2.3.1, 2.3.0
Reporter: Steven Bakhtiari


I'm loading a CSV file into a dataframe using Spark. I have defined a Schema 
and specified one of the fields as non-nullable.

When setting the mode to {{DROPMALFORMED}}, I expect any rows in the CSV with 
missing (null) values for those columns to result in the whole row being 
dropped. At the moment, the CSV loader correctly drops rows that do not conform 
to the field type, but the nullable property is seemingly ignored.

Example CSV input:
{code:java}
1,2,3
1,,3
,2,3
1,2,abc
{code}
Example Spark job:
{code:java}
val spark = SparkSession
  .builder()
  .appName("csv-test")
  .master("local")
  .getOrCreate()

spark.read
  .format("csv")
  .schema(StructType(
StructField("col1", IntegerType, nullable = false) ::
  StructField("col2", IntegerType, nullable = false) ::
  StructField("col3", IntegerType, nullable = false) :: Nil))
  .option("header", false)
  .option("mode", "DROPMALFORMED")
  .load("path/to/file.csv")
  .coalesce(1)
  .write
  .format("csv")
  .option("header", false)
  .save("path/to/output")
{code}
The actual output will be:
{code:java}
1,2,3
1,,3
,2,3{code}
Note that the row containing non-integer values has been dropped, as expected, 
but rows containing null values persist, despite the nullable property being 
set to false in the schema definition.

My expected output is:
{code:java}
1,2,3{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25545) CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not confirm to non-nullable schema fields

2018-09-26 Thread Steven Bakhtiari (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628949#comment-16628949
 ] 

Steven Bakhtiari commented on SPARK-25545:
--

Somebody on SO pointed me to this older ticket that appears to touch on the 
same issue. SPARK-10848

> CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not 
> confirm to non-nullable schema fields
> -
>
> Key: SPARK-25545
> URL: https://issues.apache.org/jira/browse/SPARK-25545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
>Reporter: Steven Bakhtiari
>Priority: Minor
>  Labels: CSV, csv, csvparser
>
> I'm loading a CSV file into a dataframe using Spark. I have defined a Schema 
> and specified one of the fields as non-nullable.
> When setting the mode to {{DROPMALFORMED}}, I expect any rows in the CSV with 
> missing (null) values for those columns to result in the whole row being 
> dropped. At the moment, the CSV loader correctly drops rows that do not 
> conform to the field type, but the nullable property is seemingly ignored.
> Example CSV input:
> {code:java}
> 1,2,3
> 1,,3
> ,2,3
> 1,2,abc
> {code}
> Example Spark job:
> {code:java}
> val spark = SparkSession
>   .builder()
>   .appName("csv-test")
>   .master("local")
>   .getOrCreate()
> spark.read
>   .format("csv")
>   .schema(StructType(
> StructField("col1", IntegerType, nullable = false) ::
>   StructField("col2", IntegerType, nullable = false) ::
>   StructField("col3", IntegerType, nullable = false) :: Nil))
>   .option("header", false)
>   .option("mode", "DROPMALFORMED")
>   .load("path/to/file.csv")
>   .coalesce(1)
>   .write
>   .format("csv")
>   .option("header", false)
>   .save("path/to/output")
> {code}
> The actual output will be:
> {code:java}
> 1,2,3
> 1,,3
> ,2,3{code}
> Note that the row containing non-integer values has been dropped, as 
> expected, but rows containing null values persist, despite the nullable 
> property being set to false in the schema definition.
> My expected output is:
> {code:java}
> 1,2,3{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25509) SHS V2 cannot enabled in Windows, because POSIX permissions is not support.

2018-09-26 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25509.
---
   Resolution: Fixed
Fix Version/s: 2.4.0
   2.3.3

Issue resolved by pull request 22520
[https://github.com/apache/spark/pull/22520]

> SHS V2 cannot enabled in Windows, because POSIX permissions is not support.
> ---
>
> Key: SPARK-25509
> URL: https://issues.apache.org/jira/browse/SPARK-25509
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Rong Tang
>Assignee: Rong Tang
>Priority: Major
> Fix For: 2.3.3, 2.4.0
>
>
> SHS V2 cannot enabled in Windoes, because windows doesn't support POSIX 
> permission. 
> with exception: java.lang.UnsupportedOperationException: 'posix:permissions' 
> not supported as initial attribute.
> test case fails in windows without this fix. 
>  org.apache.spark.deploy.history.HistoryServerDiskManagerSuite test("leasing 
> space")
>  
> PR: https://github.com/apache/spark/pull/22520
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25509) SHS V2 cannot enabled in Windows, because POSIX permissions is not support.

2018-09-26 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25509:
-

Assignee: Rong Tang

> SHS V2 cannot enabled in Windows, because POSIX permissions is not support.
> ---
>
> Key: SPARK-25509
> URL: https://issues.apache.org/jira/browse/SPARK-25509
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Rong Tang
>Assignee: Rong Tang
>Priority: Major
> Fix For: 2.3.3, 2.4.0
>
>
> SHS V2 cannot enabled in Windoes, because windows doesn't support POSIX 
> permission. 
> with exception: java.lang.UnsupportedOperationException: 'posix:permissions' 
> not supported as initial attribute.
> test case fails in windows without this fix. 
>  org.apache.spark.deploy.history.HistoryServerDiskManagerSuite test("leasing 
> space")
>  
> PR: https://github.com/apache/spark/pull/22520
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2018-09-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628865#comment-16628865
 ] 

Hyukjin Kwon commented on SPARK-18112:
--

Can you post reproducer steps please before we open this?

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling

2018-09-26 Thread Andrew Crosby (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Crosby updated SPARK-25544:
--
Description: 
The LinearRegression and LogisticRegression estimators in Spark ML can take a 
large number of iterations to converge, or fail to converge altogether, when 
trained using the l-bfgs method with standardization turned off.

*Details:*

LinearRegression and LogisticRegression standardize their input features by 
default. In SPARK-8522 the option to disable standardization was added. This is 
implemented internally by changing the effective strength of regularization 
rather than disabling the feature scaling. Mathematically, both changing the 
effective regularizaiton strength, and disabling feature scaling should give 
the same solution, but they can have very different convergence properties.

The normal justication given for scaling features is that it ensures that all 
covariances are O(1) and should improve numerical convergence, but this 
argument does not account for the regularization term. This doesn't cause any 
issues if standardization is set to true, since all features will have an O(1) 
regularization strength. But it does cause issues when standardization is set 
to false, since the effecive regularization strength of feature i is now O(1/ 
sigma_i^2) where sigma_i is the standard deviation of the feature. This means 
that predictors with small standard deviations (which can occur legitimately 
e.g. via one hot encoding) will have very large effective regularization 
strengths and consequently lead to very large gradients and thus poor 
convergence in the solver.

*Example code to recreate:*

To demonstrate just how bad these convergence issues can be, here is a very 
simple test case which builds a linear regression model with a categorical 
feature, a numerical feature and their interaction. When fed the specified 
training data, this model will fail to converge before it hits the maximum 
iteration limit. In this case, it is the interaction between category "2" and 
the numeric feature that leads to a feature with a small standard deviation.

Training data:
||category||numericFeature||label||
|1|1.0|0.5|
|1|0.5|1.0|
|2|0.01|2.0|

 
{code:java}
val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 
2.0)).toDF("category", "numericFeature", "label")

val indexer = new StringIndexer().setInputCol("category") 
.setOutputCol("categoryIndex")
val encoder = new 
OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false)
val interaction = new Interaction().setInputCols(Array("categoryEncoded", 
"numericFeature")).setOutputCol("interaction")
val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", 
"interaction")).setOutputCol("features")
val model = new 
LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100)
val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, 
assembler, model))

val pipelineModel  = pipeline.fit(df)

val numIterations = 
pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code}
 *Possible fix:*

These convergence issues can be fixed by turning off feature scaling when 
standardization is set to false rather than using an effective regularization 
strength. This can be hacked into LinearRegression.scala by simply replacing 
line 423
{code:java}
val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt)
{code}
with
{code:java}
val featuresStd = if ($(standardization)) 
featuresSummarizer.variance.toArray.map(math.sqrt) else 
featuresSummarizer.variance.toArray.map(x => 1.0)
{code}
Rerunning the above test code with that hack in place, will lead to convergence 
after just 4 iterations instead of hitting the max iterations limit!

*Impact:*

I can't speak for other people, but I've personally encountered these 
convergence issues several times when building production scale Spark ML 
models, and have resorted to writing my only implementation of LinearRegression 
with the above hack in place. The issue is made worse by the fact that Spark 
does not raise an error when the maximum number of iterations is hit, so the 
first time you encounter the issue it can take a while to figure out what is 
going on.

 

  was:
The LinearRegression and LogisticRegression estimators in Spark ML can take a 
large number of iterations to converge, or fail to converge altogether, when 
trained using the l-bfgs method with standardization turned off.

*Details:*

LinearRegression and LogisticRegression standardize their input features by 
default. In SPARK-8522 the option to disable standardization was added. This is 
implemented internally by changing the effective strength of regularization 
rather than disabling the feature scaling. 

[jira] [Resolved] (SPARK-20937) Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, DataFrames and Datasets Guide

2018-09-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20937.
--
   Resolution: Fixed
Fix Version/s: 2.4.1
   2.5.0

Issue resolved by pull request 22453
[https://github.com/apache/spark/pull/22453]

> Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, 
> DataFrames and Datasets Guide
> -
>
> Key: SPARK-20937
> URL: https://issues.apache.org/jira/browse/SPARK-20937
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Assignee: Chenxiao Mao
>Priority: Trivial
> Fix For: 2.5.0, 2.4.1
>
>
> As a follow-up to SPARK-20297 (and SPARK-10400) in which 
> {{spark.sql.parquet.writeLegacyFormat}} property was recommended for Impala 
> and Hive, Spark SQL docs for [Parquet 
> Files|https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration]
>  should have it documented.
> p.s. It was asked about in [Why can't Impala read parquet files after Spark 
> SQL's write?|https://stackoverflow.com/q/44279870/1305344] on StackOverflow 
> today.
> p.s. It's also covered in [~holden.ka...@gmail.com]'s "High Performance 
> Spark: Best Practices for Scaling and Optimizing Apache Spark" book (in Table 
> 3-10. Parquet data source options) that gives the option some wider publicity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20937) Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, DataFrames and Datasets Guide

2018-09-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-20937:


Assignee: Chenxiao Mao

> Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, 
> DataFrames and Datasets Guide
> -
>
> Key: SPARK-20937
> URL: https://issues.apache.org/jira/browse/SPARK-20937
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Assignee: Chenxiao Mao
>Priority: Trivial
> Fix For: 2.5.0, 2.4.1
>
>
> As a follow-up to SPARK-20297 (and SPARK-10400) in which 
> {{spark.sql.parquet.writeLegacyFormat}} property was recommended for Impala 
> and Hive, Spark SQL docs for [Parquet 
> Files|https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration]
>  should have it documented.
> p.s. It was asked about in [Why can't Impala read parquet files after Spark 
> SQL's write?|https://stackoverflow.com/q/44279870/1305344] on StackOverflow 
> today.
> p.s. It's also covered in [~holden.ka...@gmail.com]'s "High Performance 
> Spark: Best Practices for Scaling and Optimizing Apache Spark" book (in Table 
> 3-10. Parquet data source options) that gives the option some wider publicity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25546) RDDInfo uses SparkEnv before it may have been initialized

2018-09-26 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-25546:
--

 Summary: RDDInfo uses SparkEnv before it may have been initialized
 Key: SPARK-25546
 URL: https://issues.apache.org/jira/browse/SPARK-25546
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Tests
Affects Versions: 2.4.0
Reporter: Marcelo Vanzin


This code:

{code}
private[spark] object RDDInfo {
  private val callsiteLongForm = 
SparkEnv.get.conf.get(EVENT_LOG_CALLSITE_LONG_FORM)
{code}

Has two problems:
- it keeps that value across different SparkEnv instances. So e.g. if you have 
two tests that rely on different values for that config, one of them will break.
- it assumes tests always initialize a SparkEnv. e.g. if you run "core/testOnly 
*.AppStatusListenerSuite", it will fail because {{SparkEnv.get}} returns null.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25546) RDDInfo uses SparkEnv before it may have been initialized

2018-09-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629216#comment-16629216
 ] 

Apache Spark commented on SPARK-25546:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22558

> RDDInfo uses SparkEnv before it may have been initialized
> -
>
> Key: SPARK-25546
> URL: https://issues.apache.org/jira/browse/SPARK-25546
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> This code:
> {code}
> private[spark] object RDDInfo {
>   private val callsiteLongForm = 
> SparkEnv.get.conf.get(EVENT_LOG_CALLSITE_LONG_FORM)
> {code}
> Has two problems:
> - it keeps that value across different SparkEnv instances. So e.g. if you 
> have two tests that rely on different values for that config, one of them 
> will break.
> - it assumes tests always initialize a SparkEnv. e.g. if you run 
> "core/testOnly *.AppStatusListenerSuite", it will fail because 
> {{SparkEnv.get}} returns null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25546) RDDInfo uses SparkEnv before it may have been initialized

2018-09-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25546:


Assignee: (was: Apache Spark)

> RDDInfo uses SparkEnv before it may have been initialized
> -
>
> Key: SPARK-25546
> URL: https://issues.apache.org/jira/browse/SPARK-25546
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> This code:
> {code}
> private[spark] object RDDInfo {
>   private val callsiteLongForm = 
> SparkEnv.get.conf.get(EVENT_LOG_CALLSITE_LONG_FORM)
> {code}
> Has two problems:
> - it keeps that value across different SparkEnv instances. So e.g. if you 
> have two tests that rely on different values for that config, one of them 
> will break.
> - it assumes tests always initialize a SparkEnv. e.g. if you run 
> "core/testOnly *.AppStatusListenerSuite", it will fail because 
> {{SparkEnv.get}} returns null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25546) RDDInfo uses SparkEnv before it may have been initialized

2018-09-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629218#comment-16629218
 ] 

Apache Spark commented on SPARK-25546:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22558

> RDDInfo uses SparkEnv before it may have been initialized
> -
>
> Key: SPARK-25546
> URL: https://issues.apache.org/jira/browse/SPARK-25546
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> This code:
> {code}
> private[spark] object RDDInfo {
>   private val callsiteLongForm = 
> SparkEnv.get.conf.get(EVENT_LOG_CALLSITE_LONG_FORM)
> {code}
> Has two problems:
> - it keeps that value across different SparkEnv instances. So e.g. if you 
> have two tests that rely on different values for that config, one of them 
> will break.
> - it assumes tests always initialize a SparkEnv. e.g. if you run 
> "core/testOnly *.AppStatusListenerSuite", it will fail because 
> {{SparkEnv.get}} returns null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21291) R bucketBy partitionBy API

2018-09-26 Thread Huaxin Gao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629198#comment-16629198
 ] 

Huaxin Gao commented on SPARK-21291:


[~felixcheung] I will submit a PR for bucketBy. 

bucketBy doesn't work with save.
{code:java}
assertNotBucketed("save")
{code}
If bucketBy is set, shall I use saveAsTable instead? 

 

> R bucketBy partitionBy API
> --
>
> Key: SPARK-21291
> URL: https://issues.apache.org/jira/browse/SPARK-21291
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 2.5.0
>
>
> partitionBy exists but it's for windowspec only



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2018-09-26 Thread David Spies (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629231#comment-16629231
 ] 

David Spies commented on SPARK-18492:
-

Ran into this as well. It seems like this is happening because the "Optimized 
Logical Plan" is significantly larger than the "Parsed Logical Plan". Is there 
an "optimization" I can turn off that will keep the size down?
(Spark v. 2.1.3)


{code:java}
== Parsed Logical Plan ==
Aggregate [count(1) AS count#2296L]
+- Filter (age_imputed_fac#2247 = age_imputed_0)
   +- Project [PassengerId#2183L AS PassengerId#2226L, Survived#2184 AS 
Survived#2227, Pclass#2185 AS Pclass#2228, Sex#2186 AS Sex#2229, Age#2187 AS 
Age#2230, SibSp#2188L AS SibSp#2231L, Parch#2189L AS Parch#2232L, Ticket#2190 
AS Ticket#2233, Fare#2191 AS Fare#2234, Cabin#2192 AS Cabin#2235, Embarked#2193 
AS Embarked#2236, firstname_proc#2194 AS firstname_proc#2237, 
lastname_proc#2195 AS lastname_proc#2238, age_1_male#2196 AS age_1_male#2239, 
age_2_male#2197 AS age_2_male#2240, age_3_male#2198 AS age_3_male#2241, 
age_1_female#2199 AS age_1_female#2242, age_2_female#2200 AS age_2_female#2243, 
age_3_female#2201 AS age_3_female#2244, age_imputed#2202 AS age_imputed#2245, 
age_imputed_1#2203 AS age_imputed_1#2246, coalesce(CASE WHEN (true = 
((age_imputed#2202 >= 0.0) && (age_imputed#2202 < 16.0))) THEN age_imputed_0 
END, CASE WHEN (true = ((age_imputed#2202 >= 16.0) && (age_imputed#2202 < 
32.0))) THEN age_imputed_1 END, CASE WHEN (true = ((age_imputed#2202 >= 32.0) 
&& (age_imputed#2202 < 48.0))) THEN age_imputed_2 END, CASE WHEN (true = 
((age_imputed#2202 >= 48.0) && (age_imputed#2202 < 64.0))) THEN age_imputed_3 
END, CASE WHEN (true = ((age_imputed#2202 >= 64.0) && (age_imputed#2202 < 
81.0))) THEN age_imputed_4 END, CASE WHEN (true = isnull(age_imputed#2202)) 
THEN age_imputed_NULL END) AS age_imputed_fac#2247]
  +- Project [PassengerId#2142L AS PassengerId#2183L, Survived#2143 AS 
Survived#2184, Pclass#2144 AS Pclass#2185, Sex#2145 AS Sex#2186, Age#2146 AS 
Age#2187, SibSp#2147L AS SibSp#2188L, Parch#2148L AS Parch#2189L, Ticket#2149 
AS Ticket#2190, Fare#2150 AS Fare#2191, Cabin#2151 AS Cabin#2192, Embarked#2152 
AS Embarked#2193, firstname_proc#2153 AS firstname_proc#2194, 
lastname_proc#2154 AS lastname_proc#2195, age_1_male#2155 AS age_1_male#2196, 
age_2_male#2156 AS age_2_male#2197, age_3_male#2157 AS age_3_male#2198, 
age_1_female#2158 AS age_1_female#2199, age_2_female#2159 AS age_2_female#2200, 
age_3_female#2160 AS age_3_female#2201, age_imputed#2161 AS age_imputed#2202, 
coalesce(age_imputed#2161, 0.0) AS age_imputed_1#2203]
 +- Project [PassengerId#2103L AS PassengerId#2142L, Survived#2104 AS 
Survived#2143, Pclass#2105 AS Pclass#2144, Sex#2106 AS Sex#2145, Age#2107 AS 
Age#2146, SibSp#2108L AS SibSp#2147L, Parch#2109L AS Parch#2148L, Ticket#2110 
AS Ticket#2149, Fare#2111 AS Fare#2150, Cabin#2112 AS Cabin#2151, Embarked#2113 
AS Embarked#2152, firstname_proc#2114 AS firstname_proc#2153, 
lastname_proc#2115 AS lastname_proc#2154, age_1_male#2116 AS age_1_male#2155, 
age_2_male#2117 AS age_2_male#2156, age_3_male#2118 AS age_3_male#2157, 
age_1_female#2119 AS age_1_female#2158, age_2_female#2120 AS age_2_female#2159, 
age_3_female#2121 AS age_3_female#2160, coalesce(age_1_male#2116, 
age_2_male#2117, age_3_male#2118, age_1_female#2119, age_2_female#2120, 
age_3_female#2121, Age#2107) AS age_imputed#2161]
+- Project [PassengerId#2076L AS PassengerId#2103L, Survived#2077 
AS Survived#2104, Pclass#2078 AS Pclass#2105, Sex#2079 AS Sex#2106, Age#2080 AS 
Age#2107, SibSp#2081L AS SibSp#2108L, Parch#2082L AS Parch#2109L, Ticket#2083 
AS Ticket#2110, Fare#2084 AS Fare#2111, Cabin#2085 AS Cabin#2112, Embarked#2086 
AS Embarked#2113, firstname_proc#2087 AS firstname_proc#2114, 
lastname_proc#2088 AS lastname_proc#2115, CASE WHEN (true = ((isnull(Age#2080) 
&& (Sex#2079 = male)) && (Pclass#2078 = 1))) THEN 39.56 END AS age_1_male#2116, 
CASE WHEN (true = ((isnull(Age#2080) && (Sex#2079 = male)) && (Pclass#2078 = 
2))) THEN 21.72 END AS age_2_male#2117, CASE WHEN (true = ((isnull(Age#2080) && 
(Sex#2079 = male)) && (Pclass#2078 = 3))) THEN 26.84 END AS age_3_male#2118, 
CASE WHEN (true = ((isnull(Age#2080) && (Sex#2079 = female)) && (Pclass#2078 = 
1))) THEN 38.84 END AS age_1_female#2119, CASE WHEN (true = ((isnull(Age#2080) 
&& (Sex#2079 = female)) && (Pclass#2078 = 2))) THEN 27.48 END AS 
age_2_female#2120, CASE WHEN (true = ((isnull(Age#2080) && (Sex#2079 = female)) 
&& (Pclass#2078 = 3))) THEN 11.16 END AS age_3_female#2121]
   +- Project [CASE WHEN (true = ((PassengerId#106L >= 1) && 
(PassengerId#106L <= 900))) THEN PassengerId#106L END AS PassengerId#2076L, 
CASE WHEN (true = ((Survived#107 >= false) && (Survived#107 <= true))) THEN 
Survived#107 END AS Survived#2077, CASE WHEN (true = Pclass#108 IN (1,2,3)) 
THEN Pclass#108 END AS Pclass#2078, CASE WHEN (true 

[jira] [Assigned] (SPARK-25546) RDDInfo uses SparkEnv before it may have been initialized

2018-09-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25546:


Assignee: Apache Spark

> RDDInfo uses SparkEnv before it may have been initialized
> -
>
> Key: SPARK-25546
> URL: https://issues.apache.org/jira/browse/SPARK-25546
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Major
>
> This code:
> {code}
> private[spark] object RDDInfo {
>   private val callsiteLongForm = 
> SparkEnv.get.conf.get(EVENT_LOG_CALLSITE_LONG_FORM)
> {code}
> Has two problems:
> - it keeps that value across different SparkEnv instances. So e.g. if you 
> have two tests that rely on different values for that config, one of them 
> will break.
> - it assumes tests always initialize a SparkEnv. e.g. if you run 
> "core/testOnly *.AppStatusListenerSuite", it will fail because 
> {{SparkEnv.get}} returns null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25533) Inconsistent message for Completed Jobs in the JobUI, when there are failed jobs, compared to spark2.2

2018-09-26 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25533:
--

Assignee: shahid

> Inconsistent message for Completed Jobs in the  JobUI, when there are failed 
> jobs, compared to spark2.2
> ---
>
> Key: SPARK-25533
> URL: https://issues.apache.org/jira/browse/SPARK-25533
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: shahid
>Assignee: shahid
>Priority: Major
> Attachments: Screenshot from 2018-09-26 00-42-00.png, Screenshot from 
> 2018-09-26 00-46-35.png
>
>
> Test steps:
>  1) bin/spark-shell
> {code:java}
> sc.parallelize(1 to 5, 5).collect()
> sc.parallelize(1 to 5, 2).map{ x => throw new RuntimeException("Fail 
> Job")}.collect()
> {code}
> *Output in spark - 2.3.1:*
> !Screenshot from 2018-09-26 00-42-00.png!
> *Output in spark - 2.2.1:*
> !Screenshot from 2018-09-26 00-46-35.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25318) Add exception handling when wrapping the input stream during the the fetch or stage retry in response to a corrupted block

2018-09-26 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25318.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22325
[https://github.com/apache/spark/pull/22325]

> Add exception handling when wrapping the input stream during the the fetch or 
> stage retry in response to a corrupted block
> --
>
> Key: SPARK-25318
> URL: https://issues.apache.org/jira/browse/SPARK-25318
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: Reza Safi
>Assignee: Reza Safi
>Priority: Minor
> Fix For: 2.4.0
>
>
> SPARK-4105 provided a solution to block corruption issue by retrying the 
> fetch or the stage. In the solution there is a step that wraps the input 
> stream with compression and/or encryption. This step is prone to exceptions, 
> but in the current code there is no exception handling for this step and this 
> has caused confusion for the user. In fact we have customers who reported an 
> exception like the following when SPARK-4105 is available to them:
> {noformat}
> 2018-08-28 22:35:54,361 ERROR [Driver] 
> org.apache.spark.deploy.yarn.ApplicationMaster:95 User class threw exception: 
> java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due 
> tostage failure: Task 452 in stage 209.0 failed 4 times, most recent 
> failure: Lost task 452.3 in stage y.0 (TID z, x, executor xx): 
> java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>   3976 at 
> org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   3977 at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>   3978 at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:395)
>   3979 at org.xerial.snappy.Snappy.uncompress(Snappy.java:431)
>   3980 at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
>   3981 at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   3982 at 
> org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   3983 at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:159)
>   3984 at 
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1219)
>   3985 at 
> org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:48)
>   3986 at 
> org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:47)
>   3987 at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:328)
>   3988 at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:55)
>   3989 at 
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   3990 a
> {noformat}
> In this customer's version of spark, line 328 of 
> ShuffleBlockFetcherIterator.scala is the line that the following occurs:
> {noformat}
> input = streamWrapper(blockId, in)
> {noformat}
> It would be nice to add exception handling around this line to avoid 
> confusions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25536) executorSource.METRIC read wrong record in Executor.scala Line444

2018-09-26 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25536:
--
Affects Version/s: 2.3.0
   2.3.1

> executorSource.METRIC read wrong record in Executor.scala Line444
> -
>
> Key: SPARK-25536
> URL: https://issues.apache.org/jira/browse/SPARK-25536
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
>Reporter: ZhuoerXu
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25318) Add exception handling when wrapping the input stream during the the fetch or stage retry in response to a corrupted block

2018-09-26 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25318:
--

Assignee: Reza Safi

> Add exception handling when wrapping the input stream during the the fetch or 
> stage retry in response to a corrupted block
> --
>
> Key: SPARK-25318
> URL: https://issues.apache.org/jira/browse/SPARK-25318
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: Reza Safi
>Assignee: Reza Safi
>Priority: Minor
> Fix For: 2.4.0
>
>
> SPARK-4105 provided a solution to block corruption issue by retrying the 
> fetch or the stage. In the solution there is a step that wraps the input 
> stream with compression and/or encryption. This step is prone to exceptions, 
> but in the current code there is no exception handling for this step and this 
> has caused confusion for the user. In fact we have customers who reported an 
> exception like the following when SPARK-4105 is available to them:
> {noformat}
> 2018-08-28 22:35:54,361 ERROR [Driver] 
> org.apache.spark.deploy.yarn.ApplicationMaster:95 User class threw exception: 
> java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due 
> tostage failure: Task 452 in stage 209.0 failed 4 times, most recent 
> failure: Lost task 452.3 in stage y.0 (TID z, x, executor xx): 
> java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>   3976 at 
> org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   3977 at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>   3978 at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:395)
>   3979 at org.xerial.snappy.Snappy.uncompress(Snappy.java:431)
>   3980 at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
>   3981 at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   3982 at 
> org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   3983 at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:159)
>   3984 at 
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1219)
>   3985 at 
> org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:48)
>   3986 at 
> org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:47)
>   3987 at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:328)
>   3988 at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:55)
>   3989 at 
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   3990 a
> {noformat}
> In this customer's version of spark, line 328 of 
> ShuffleBlockFetcherIterator.scala is the line that the following occurs:
> {noformat}
> input = streamWrapper(blockId, in)
> {noformat}
> It would be nice to add exception handling around this line to avoid 
> confusions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25535) Work around bad error checking in commons-crypto

2018-09-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629174#comment-16629174
 ] 

Apache Spark commented on SPARK-25535:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22557

> Work around bad error checking in commons-crypto
> 
>
> Key: SPARK-25535
> URL: https://issues.apache.org/jira/browse/SPARK-25535
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2
>Reporter: Marcelo Vanzin
>Priority: Major
>
> The commons-crypto library used for encryption can get confused when certain 
> errors happen; that can lead to crashes since the Java side thinks the 
> ciphers are still valid while the native side has already cleaned up the 
> ciphers.
> We can work around that in Spark by doing some error checking at a higher 
> level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25535) Work around bad error checking in commons-crypto

2018-09-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25535:


Assignee: (was: Apache Spark)

> Work around bad error checking in commons-crypto
> 
>
> Key: SPARK-25535
> URL: https://issues.apache.org/jira/browse/SPARK-25535
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2
>Reporter: Marcelo Vanzin
>Priority: Major
>
> The commons-crypto library used for encryption can get confused when certain 
> errors happen; that can lead to crashes since the Java side thinks the 
> ciphers are still valid while the native side has already cleaned up the 
> ciphers.
> We can work around that in Spark by doing some error checking at a higher 
> level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25535) Work around bad error checking in commons-crypto

2018-09-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25535:


Assignee: Apache Spark

> Work around bad error checking in commons-crypto
> 
>
> Key: SPARK-25535
> URL: https://issues.apache.org/jira/browse/SPARK-25535
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Major
>
> The commons-crypto library used for encryption can get confused when certain 
> errors happen; that can lead to crashes since the Java side thinks the 
> ciphers are still valid while the native side has already cleaned up the 
> ciphers.
> We can work around that in Spark by doing some error checking at a higher 
> level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2018-09-26 Thread Leo Gallucci (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629182#comment-16629182
 ] 

Leo Gallucci commented on SPARK-18112:
--

And to get things worse Hive is already in version 3. Same with Hadoop, the 
default Spark+Hadoop distribution comes with Hadoop 2.7 while Hadoop is already 
3.1. Is really hard to understand how such a popular open source project like 
Spark keeps dependencies years old, some are 7 years old or more.

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-25546) RDDInfo uses SparkEnv before it may have been initialized

2018-09-26 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-25546:
---
Comment: was deleted

(was: User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22558)

> RDDInfo uses SparkEnv before it may have been initialized
> -
>
> Key: SPARK-25546
> URL: https://issues.apache.org/jira/browse/SPARK-25546
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> This code:
> {code}
> private[spark] object RDDInfo {
>   private val callsiteLongForm = 
> SparkEnv.get.conf.get(EVENT_LOG_CALLSITE_LONG_FORM)
> {code}
> Has two problems:
> - it keeps that value across different SparkEnv instances. So e.g. if you 
> have two tests that rely on different values for that config, one of them 
> will break.
> - it assumes tests always initialize a SparkEnv. e.g. if you run 
> "core/testOnly *.AppStatusListenerSuite", it will fail because 
> {{SparkEnv.get}} returns null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2018-09-26 Thread Eugeniu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629000#comment-16629000
 ] 

Eugeniu commented on SPARK-18112:
-

I can only describe my situation. I am using AWS EMR 5.17.0 with Hive, Spark, 
Zeppelin, Hue installed. In Zeppelin the configuration variable for spark 
interpretter points to /usr/lib/spark. There I found jars/ folder. In jars 
folder I have the following hive related libraries. 

{code}
-rw-r--r-- 1 root root   139044 Aug 15 01:06 
hive-beeline-1.2.1-spark2-amzn-0.jar
-rw-r--r-- 1 root root40850 Aug 15 01:06 hive-cli-1.2.1-spark2-amzn-0.jar
-rw-r--r-- 1 root root 11497847 Aug 15 01:06 hive-exec-1.2.1-spark2-amzn-0.jar
-rw-r--r-- 1 root root   101113 Aug 15 01:06 hive-jdbc-1.2.1-spark2-amzn-0.jar
-rw-r--r-- 1 root root  5472179 Aug 15 01:06 
hive-metastore-1.2.1-spark2-amzn-0.jar
{code}

If I replace them with their 2.3.3 equivalents, e.g. 
hive-exec-1.2.1-spark2-amzn-0.jar -> hive-exec-2.3.3-amzn-1.jar I get the 
following error when running SQL query in spark:

{code}
java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
at 
org.apache.spark.sql.hive.HiveUtils$.formatTimeVarsForHiveClient(HiveUtils.scala:205)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:286)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:195)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:194)
at 
org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
at 
org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1.(HiveSessionStateBuilder.scala:69)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.analyzer(HiveSessionStateBuilder.scala:69)
at 
org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293)
at 
org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293)
at 
org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:79)
at 
org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:79)
at 
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
at 
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.zeppelin.spark.SparkSqlInterpreter.interpret(SparkSqlInterpreter.java:116)
at 
org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:97)
at 
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:498)
at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
at 
org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 

[jira] [Commented] (SPARK-25533) Inconsistent message for Completed Jobs in the JobUI, when there are failed jobs, compared to spark2.2

2018-09-26 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629243#comment-16629243
 ] 

Marcelo Vanzin commented on SPARK-25533:


This is merged to master. I'll backport it to 2.4 and 2.3 after I fix an 
unrelated issue that I ran into during testing.

> Inconsistent message for Completed Jobs in the  JobUI, when there are failed 
> jobs, compared to spark2.2
> ---
>
> Key: SPARK-25533
> URL: https://issues.apache.org/jira/browse/SPARK-25533
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: shahid
>Assignee: shahid
>Priority: Major
> Attachments: Screenshot from 2018-09-26 00-42-00.png, Screenshot from 
> 2018-09-26 00-46-35.png
>
>
> Test steps:
>  1) bin/spark-shell
> {code:java}
> sc.parallelize(1 to 5, 5).collect()
> sc.parallelize(1 to 5, 2).map{ x => throw new RuntimeException("Fail 
> Job")}.collect()
> {code}
> *Output in spark - 2.3.1:*
> !Screenshot from 2018-09-26 00-42-00.png!
> *Output in spark - 2.2.1:*
> !Screenshot from 2018-09-26 00-46-35.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()

2018-09-26 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629281#comment-16629281
 ] 

Kazuaki Ishizaki commented on SPARK-25538:
--

Hi [~Steven Rand], would it be possible to share the schema of this DataFrame?


> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Major
>  Labels: correctness
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17952) SparkSession createDataFrame method throws exception for nested JavaBeans

2018-09-26 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-17952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629321#comment-16629321
 ] 

Michal Šenkýř commented on SPARK-17952:
---

Implemented nested bean support in pull request. Arrays and lists not supported 
yet. Will add them later if approved to put code in line with docs.

> SparkSession createDataFrame method throws exception for nested JavaBeans
> -
>
> Key: SPARK-17952
> URL: https://issues.apache.org/jira/browse/SPARK-17952
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1, 2.3.0
>Reporter: Amit Baghel
>Priority: Major
>
> As per latest spark documentation for Java at 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection,
>  
> {quote}
> Nested JavaBeans and List or Array fields are supported though.
> {quote}
> However nested JavaBean is not working. Please see the below code.
> SubCategory class
> {code}
> public class SubCategory implements Serializable{
>   private String id;
>   private String name;
>   
>   public String getId() {
>   return id;
>   }
>   public void setId(String id) {
>   this.id = id;
>   }
>   public String getName() {
>   return name;
>   }
>   public void setName(String name) {
>   this.name = name;
>   }   
> }
> {code}
> Category class
> {code}
> public class Category implements Serializable{
>   private String id;
>   private SubCategory subCategory;
>   
>   public String getId() {
>   return id;
>   }
>   public void setId(String id) {
>   this.id = id;
>   }
>   public SubCategory getSubCategory() {
>   return subCategory;
>   }
>   public void setSubCategory(SubCategory subCategory) {
>   this.subCategory = subCategory;
>   }
> }
> {code}
> SparkSample class
> {code}
> public class SparkSample {
>   public static void main(String[] args) throws IOException { 
> 
>   SparkSession spark = SparkSession
>   .builder()
>   .appName("SparkSample")
>   .master("local")
>   .getOrCreate();
>   //SubCategory
>   SubCategory sub = new SubCategory();
>   sub.setId("sc-111");
>   sub.setName("Sub-1");
>   //Category
>   Category category = new Category();
>   category.setId("s-111");
>   category.setSubCategory(sub);
>   //categoryList
>   List categoryList = new ArrayList();
>   categoryList.add(category);
>//DF
>   Dataset dframe = spark.createDataFrame(categoryList, 
> Category.class);  
>   dframe.show();  
>   }
> }
> {code}
> Above code throws below error.
> {code}
> Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d 
> (of class com.sample.SubCategory)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$class.toStream(Iterator.scala:1322)
>   at 

[jira] [Commented] (SPARK-25501) Kafka delegation token support

2018-09-26 Thread Mingjie Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629320#comment-16629320
 ] 

Mingjie Tang commented on SPARK-25501:
--

[~gsomogyi] Thanks for your reply. 

At first, what my PR proposed here is used for us to discuss,  we can use it or 
disregard it. either way is ok for me. What I want to propose is that we can 
move this ticket asap, since this feature is critical for production and 
community. 

Second, You can build a document for discuss the design and have SPIP.  I can 
learn advices from you and others. This would be useful.  

Finally, thanks so much for you to begin work on this work. Your example is 
very good. Therefore, you can refer my PR or do it by yourself, then, we can 
discuss and move forward this asap. What do you think about this? I hope to 
learn from you. 

> Kafka delegation token support
> --
>
> Key: SPARK-25501
> URL: https://issues.apache.org/jira/browse/SPARK-25501
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> In kafka version 1.1 delegation token support is released. As spark updated 
> it's kafka client to 2.0.0 now it's possible to implement delegation token 
> support. Please see description: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-48+Delegation+token+support+for+Kafka



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25531) new write APIs for data source v2

2018-09-26 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629418#comment-16629418
 ] 

Ryan Blue commented on SPARK-25531:
---

[~cloud_fan], what was the intent for this umbrella issue? You described it as 
progress of "Standardize SQL logical plans" but the current description is "new 
write APIs" instead. Also, these issues were already tracked under the umbrella 
SPARK-22386 to improve DSv2, which covers the new logical plans and other 
support issues like adding interfaces for required clustering and sorting 
(SPARK-23889).

Is your intent to close the other issue because it is too old?

> new write APIs for data source v2
> -
>
> Key: SPARK-25531
> URL: https://issues.apache.org/jira/browse/SPARK-25531
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The current data source write API heavily depend on {{SaveMode}}, which 
> doesn't have a clear semantic, especially when writing to tables.
> We should design a new set of write API without {{SaveMode}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25547) Pluggable jdbc connection factory

2018-09-26 Thread Frank Sauer (JIRA)
Frank Sauer created SPARK-25547:
---

 Summary: Pluggable jdbc connection factory
 Key: SPARK-25547
 URL: https://issues.apache.org/jira/browse/SPARK-25547
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Frank Sauer


The ability to provide a custom connectionFactoryProvider via JDBCOptions so 
that JdbcUtils.createConnectionFactory can produce a custom connection factory 
would be very useful. In our case we needed to have the ability to load balance 
connections to an AWS Aurora Postgres cluster by round-robining through the 
endpoints of the read replicas since their own loan balancing was insufficient. 
We got away with it by copying most of the spark jdbc package and provide this 
feature there and changing the format from jdbc to our new package. However it 
would be nice  if this were supported out of the box via a new option in 
JDBCOptions providing the classname for a ConnectionFactoryProvider. I'm 
creating this Jira in order to submit a PR which I have ready to go.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25547) Pluggable jdbc connection factory

2018-09-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25547:


Assignee: Apache Spark

> Pluggable jdbc connection factory
> -
>
> Key: SPARK-25547
> URL: https://issues.apache.org/jira/browse/SPARK-25547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Frank Sauer
>Assignee: Apache Spark
>Priority: Major
>
> The ability to provide a custom connectionFactoryProvider via JDBCOptions so 
> that JdbcUtils.createConnectionFactory can produce a custom connection 
> factory would be very useful. In our case we needed to have the ability to 
> load balance connections to an AWS Aurora Postgres cluster by round-robining 
> through the endpoints of the read replicas since their own loan balancing was 
> insufficient. We got away with it by copying most of the spark jdbc package 
> and provide this feature there and changing the format from jdbc to our new 
> package. However it would be nice  if this were supported out of the box via 
> a new option in JDBCOptions providing the classname for a 
> ConnectionFactoryProvider. I'm creating this Jira in order to submit a PR 
> which I have ready to go.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25547) Pluggable jdbc connection factory

2018-09-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629425#comment-16629425
 ] 

Apache Spark commented on SPARK-25547:
--

User 'fsauer65' has created a pull request for this issue:
https://github.com/apache/spark/pull/22560

> Pluggable jdbc connection factory
> -
>
> Key: SPARK-25547
> URL: https://issues.apache.org/jira/browse/SPARK-25547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Frank Sauer
>Priority: Major
>
> The ability to provide a custom connectionFactoryProvider via JDBCOptions so 
> that JdbcUtils.createConnectionFactory can produce a custom connection 
> factory would be very useful. In our case we needed to have the ability to 
> load balance connections to an AWS Aurora Postgres cluster by round-robining 
> through the endpoints of the read replicas since their own loan balancing was 
> insufficient. We got away with it by copying most of the spark jdbc package 
> and provide this feature there and changing the format from jdbc to our new 
> package. However it would be nice  if this were supported out of the box via 
> a new option in JDBCOptions providing the classname for a 
> ConnectionFactoryProvider. I'm creating this Jira in order to submit a PR 
> which I have ready to go.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25547) Pluggable jdbc connection factory

2018-09-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25547:


Assignee: (was: Apache Spark)

> Pluggable jdbc connection factory
> -
>
> Key: SPARK-25547
> URL: https://issues.apache.org/jira/browse/SPARK-25547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Frank Sauer
>Priority: Major
>
> The ability to provide a custom connectionFactoryProvider via JDBCOptions so 
> that JdbcUtils.createConnectionFactory can produce a custom connection 
> factory would be very useful. In our case we needed to have the ability to 
> load balance connections to an AWS Aurora Postgres cluster by round-robining 
> through the endpoints of the read replicas since their own loan balancing was 
> insufficient. We got away with it by copying most of the spark jdbc package 
> and provide this feature there and changing the format from jdbc to our new 
> package. However it would be nice  if this were supported out of the box via 
> a new option in JDBCOptions providing the classname for a 
> ConnectionFactoryProvider. I'm creating this Jira in order to submit a PR 
> which I have ready to go.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2018-09-26 Thread t oo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629420#comment-16629420
 ] 

t oo commented on SPARK-18112:
--

here, here!

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24285) Flaky test: ContinuousSuite.query without test harness

2018-09-26 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24285:
--
Description: 
*2.5.0-SNAPSHOT*
- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96640

*2.3.x*
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/370/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/373/

  was:
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/370/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/373/


> Flaky test: ContinuousSuite.query without test harness
> --
>
> Key: SPARK-24285
> URL: https://issues.apache.org/jira/browse/SPARK-24285
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> *2.5.0-SNAPSHOT*
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96640
> *2.3.x*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/370/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/373/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24285) Flaky test: ContinuousSuite.query without test harness

2018-09-26 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24285:
--
Description: 
*2.5.0-SNAPSHOT*
 - [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96640]

{code:java}
sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
scala.this.Predef.Set.apply[Int](0, 1, 2, 3).map[org.apache.spark.sql.Row, 
scala.collection.immutable.Set[org.apache.spark.sql.Row]](((x$3: Int) => 
org.apache.spark.sql.Row.apply(x$3)))(immutable.this.Set.canBuildFrom[org.apache.spark.sql.Row]).subsetOf(scala.this.Predef.refArrayOps[org.apache.spark.sql.Row](results).toSet[org.apache.spark.sql.Row])
 was false{code}
*2.3.x*
 - 
[https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/370/]
 - 
[https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/373/]

  was:
*2.5.0-SNAPSHOT*
- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96640

*2.3.x*
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/370/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/373/


> Flaky test: ContinuousSuite.query without test harness
> --
>
> Key: SPARK-24285
> URL: https://issues.apache.org/jira/browse/SPARK-24285
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> *2.5.0-SNAPSHOT*
>  - [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96640]
> {code:java}
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> scala.this.Predef.Set.apply[Int](0, 1, 2, 3).map[org.apache.spark.sql.Row, 
> scala.collection.immutable.Set[org.apache.spark.sql.Row]](((x$3: Int) => 
> org.apache.spark.sql.Row.apply(x$3)))(immutable.this.Set.canBuildFrom[org.apache.spark.sql.Row]).subsetOf(scala.this.Predef.refArrayOps[org.apache.spark.sql.Row](results).toSet[org.apache.spark.sql.Row])
>  was false{code}
> *2.3.x*
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/370/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/373/]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25372) Deprecate Yarn-specific configs in regards to keytab login for SparkSubmit

2018-09-26 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25372.

   Resolution: Fixed
Fix Version/s: 2.5.0

Issue resolved by pull request 22362
[https://github.com/apache/spark/pull/22362]

> Deprecate Yarn-specific configs in regards to keytab login for SparkSubmit
> --
>
> Key: SPARK-25372
> URL: https://issues.apache.org/jira/browse/SPARK-25372
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, YARN
>Affects Versions: 2.4.0
>Reporter: Ilan Filonenko
>Assignee: Ilan Filonenko
>Priority: Major
> Fix For: 2.5.0
>
>
> {{SparkSubmit}} already logs in the user if a keytab is provided, the only 
> issue is that it uses the existing configs which have "yarn" in their name. 
> As such, we should use a common name for the principal and keytab configs, 
> and deprecate the YARN-specific ones.
> cc [~vanzin]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()

2018-09-26 Thread Steven Rand (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629561#comment-16629561
 ] 

Steven Rand commented on SPARK-25538:
-

[~kiszk], yes, the schema is:

 
{code}
scala> spark.read.parquet("hdfs:///data").printSchema
root
 |-- col_0: string (nullable = true)
 |-- col_1: timestamp (nullable = true)
 |-- col_2: string (nullable = true)
 |-- col_3: timestamp (nullable = true)
 |-- col_4: string (nullable = true)
 |-- col_5: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_6: string (nullable = true)
 |-- col_7: array (nullable = true)
 ||-- element: decimal(38,18) (containsNull = true)
 |-- col_8: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_9: array (nullable = true)
 ||-- element: decimal(38,18) (containsNull = true)
 |-- col_10: string (nullable = true)
 |-- col_11: timestamp (nullable = true)
 |-- col_12: integer (nullable = true)
 |-- col_13: boolean (nullable = true)
 |-- col_14: decimal(38,18) (nullable = true)
 |-- col_15: long (nullable = true)
 |-- col_16: string (nullable = true)
 |-- col_17: integer (nullable = true)
 |-- col_18: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_19: string (nullable = true)
 |-- col_20: string (nullable = true)
 |-- col_21: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_22: string (nullable = true)
 |-- col_23: array (nullable = true)
 ||-- element: timestamp (containsNull = true)
 |-- col_24: string (nullable = true)
 |-- col_25: string (nullable = true)
 |-- col_26: string (nullable = true)
 |-- col_27: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_28: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_29: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_30: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_31: decimal(38,18) (nullable = true)
 |-- col_32: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_33: string (nullable = true)
 |-- col_34: array (nullable = true)
 ||-- element: decimal(38,18) (containsNull = true)
 |-- col_35: decimal(38,18) (nullable = true)
 |-- col_36: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_37: array (nullable = true)
 ||-- element: decimal(38,18) (containsNull = true)
 |-- col_38: decimal(38,18) (nullable = true)
 |-- col_39: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_40: string (nullable = true)
 |-- col_41: string (nullable = true)
 |-- col_42: string (nullable = true)
 |-- col_43: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_44: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_45: string (nullable = true)
 |-- col_46: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_47: array (nullable = true)
 ||-- element: decimal(38,18) (containsNull = true)
 |-- col_48: string (nullable = true)
 |-- col_49: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_50: array (nullable = true)
 ||-- element: decimal(38,18) (containsNull = true)
 |-- col_51: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_52: array (nullable = true)
 ||-- element: decimal(38,18) (containsNull = true)
 |-- col_53: string (nullable = true)
 |-- col_54: decimal(38,18) (nullable = true)
 |-- col_55: decimal(38,18) (nullable = true)
 |-- col_56: decimal(38,18) (nullable = true)
 |-- col_57: array (nullable = true)
 ||-- element: decimal(38,18) (containsNull = true)
{code}

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Major
>  Labels: correctness
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if 

[jira] [Assigned] (SPARK-25372) Deprecate Yarn-specific configs in regards to keytab login for SparkSubmit

2018-09-26 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25372:
--

Assignee: Ilan Filonenko

> Deprecate Yarn-specific configs in regards to keytab login for SparkSubmit
> --
>
> Key: SPARK-25372
> URL: https://issues.apache.org/jira/browse/SPARK-25372
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, YARN
>Affects Versions: 2.4.0
>Reporter: Ilan Filonenko
>Assignee: Ilan Filonenko
>Priority: Major
> Fix For: 2.5.0
>
>
> {{SparkSubmit}} already logs in the user if a keytab is provided, the only 
> issue is that it uses the existing configs which have "yarn" in their name. 
> As such, we should use a common name for the principal and keytab configs, 
> and deprecate the YARN-specific ones.
> cc [~vanzin]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25454) Division between operands with negative scale can cause precision loss

2018-09-26 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25454.
-
   Resolution: Fixed
 Assignee: Wenchen Fan
Fix Version/s: 2.4.0
   2.3.3

> Division between operands with negative scale can cause precision loss
> --
>
> Key: SPARK-25454
> URL: https://issues.apache.org/jira/browse/SPARK-25454
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Marco Gaido
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.3, 2.4.0
>
>
> The issue was originally reported by [~bersprockets] here: 
> https://issues.apache.org/jira/browse/SPARK-22036?focusedCommentId=16618104=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16618104.
> The problem consist in a precision loss when the second operand of the 
> division is a decimal with a negative scale. It was present also before 2.3 
> but it was harder to reproduce: you had to do something like 
> {{lit(BigDecimal(100e6))}}, while now this can happen more frequently with 
> SQL constants.
> The problem is that our logic is taken from Hive and SQLServer where decimals 
> with negative scales are not allowed. We might also consider enforcing this 
> too in 3.0 eventually. Meanwhile we can fix the logic for computing the 
> result type for a division.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25540) Make HiveContext in PySpark behave as the same as Scala.

2018-09-26 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25540:
---

Assignee: Takuya Ueshin

> Make HiveContext in PySpark behave as the same as Scala.
> 
>
> Key: SPARK-25540
> URL: https://issues.apache.org/jira/browse/SPARK-25540
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.4.0
>
>
> In Scala, {{HiveContext}} sets a config {{spark.sql.catalogImplementation}} 
> of the given {{SparkContext}} and then passes to {{SparkSession.builder}}.
> The {{HiveContext}} in PySpark should behave as the same as it in Scala.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25540) Make HiveContext in PySpark behave as the same as Scala.

2018-09-26 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25540.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22552
[https://github.com/apache/spark/pull/22552]

> Make HiveContext in PySpark behave as the same as Scala.
> 
>
> Key: SPARK-25540
> URL: https://issues.apache.org/jira/browse/SPARK-25540
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.4.0
>
>
> In Scala, {{HiveContext}} sets a config {{spark.sql.catalogImplementation}} 
> of the given {{SparkContext}} and then passes to {{SparkSession.builder}}.
> The {{HiveContext}} in PySpark should behave as the same as it in Scala.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25548) In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field with true in the And(partitionOps, nonPartitionOps) to make the partition can be pruned

2018-09-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629655#comment-16629655
 ] 

Apache Spark commented on SPARK-25548:
--

User 'eatoncys' has created a pull request for this issue:
https://github.com/apache/spark/pull/22561

> In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field 
> with true in the And(partitionOps, nonPartitionOps) to make the partition can 
> be pruned
> -
>
> Key: SPARK-25548
> URL: https://issues.apache.org/jira/browse/SPARK-25548
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: eaton
>Priority: Critical
>
> In the PruneFileSourcePartitions optimizer, the partition files will not be 
> pruned if we use partition filter and non partition filter together, for 
> example:
> sql("CREATE TABLE IF NOT EXISTS src_par (key INT, value STRING) partitioned 
> by(p_d int) stored as parquet ")
>  sql("insert overwrite table src_par partition(p_d=2) select 2 as key, '4' as 
> value")
>  sql("insert overwrite table src_par partition(p_d=3) select 3 as key, '4' as 
> value")
>  sql("insert overwrite table src_par partition(p_d=4) select 4 as key, '4' as 
> value")
> The sql below will scan all the partition files, in which, the partition 
> **p_d=4** should be pruned.
>  **sql("select * from src_par where (p_d=2 and key=2) or (p_d=3 and 
> key=3)").show**



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25548) In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field with true in the And(partitionOps, nonPartitionOps) to make the partition can be pruned

2018-09-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25548:


Assignee: Apache Spark

> In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field 
> with true in the And(partitionOps, nonPartitionOps) to make the partition can 
> be pruned
> -
>
> Key: SPARK-25548
> URL: https://issues.apache.org/jira/browse/SPARK-25548
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: eaton
>Assignee: Apache Spark
>Priority: Critical
>
> In the PruneFileSourcePartitions optimizer, the partition files will not be 
> pruned if we use partition filter and non partition filter together, for 
> example:
> sql("CREATE TABLE IF NOT EXISTS src_par (key INT, value STRING) partitioned 
> by(p_d int) stored as parquet ")
>  sql("insert overwrite table src_par partition(p_d=2) select 2 as key, '4' as 
> value")
>  sql("insert overwrite table src_par partition(p_d=3) select 3 as key, '4' as 
> value")
>  sql("insert overwrite table src_par partition(p_d=4) select 4 as key, '4' as 
> value")
> The sql below will scan all the partition files, in which, the partition 
> **p_d=4** should be pruned.
>  **sql("select * from src_par where (p_d=2 and key=2) or (p_d=3 and 
> key=3)").show**



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25548) In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field with true in the And(partitionOps, nonPartitionOps) to make the partition can be pruned

2018-09-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25548:


Assignee: Apache Spark

> In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field 
> with true in the And(partitionOps, nonPartitionOps) to make the partition can 
> be pruned
> -
>
> Key: SPARK-25548
> URL: https://issues.apache.org/jira/browse/SPARK-25548
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: eaton
>Assignee: Apache Spark
>Priority: Critical
>
> In the PruneFileSourcePartitions optimizer, the partition files will not be 
> pruned if we use partition filter and non partition filter together, for 
> example:
> sql("CREATE TABLE IF NOT EXISTS src_par (key INT, value STRING) partitioned 
> by(p_d int) stored as parquet ")
>  sql("insert overwrite table src_par partition(p_d=2) select 2 as key, '4' as 
> value")
>  sql("insert overwrite table src_par partition(p_d=3) select 3 as key, '4' as 
> value")
>  sql("insert overwrite table src_par partition(p_d=4) select 4 as key, '4' as 
> value")
> The sql below will scan all the partition files, in which, the partition 
> **p_d=4** should be pruned.
>  **sql("select * from src_par where (p_d=2 and key=2) or (p_d=3 and 
> key=3)").show**



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25548) In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field with true in the And(partitionOps, nonPartitionOps) to make the partition can be pruned

2018-09-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25548:


Assignee: (was: Apache Spark)

> In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field 
> with true in the And(partitionOps, nonPartitionOps) to make the partition can 
> be pruned
> -
>
> Key: SPARK-25548
> URL: https://issues.apache.org/jira/browse/SPARK-25548
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: eaton
>Priority: Critical
>
> In the PruneFileSourcePartitions optimizer, the partition files will not be 
> pruned if we use partition filter and non partition filter together, for 
> example:
> sql("CREATE TABLE IF NOT EXISTS src_par (key INT, value STRING) partitioned 
> by(p_d int) stored as parquet ")
>  sql("insert overwrite table src_par partition(p_d=2) select 2 as key, '4' as 
> value")
>  sql("insert overwrite table src_par partition(p_d=3) select 3 as key, '4' as 
> value")
>  sql("insert overwrite table src_par partition(p_d=4) select 4 as key, '4' as 
> value")
> The sql below will scan all the partition files, in which, the partition 
> **p_d=4** should be pruned.
>  **sql("select * from src_par where (p_d=2 and key=2) or (p_d=3 and 
> key=3)").show**



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25531) new write APIs for data source v2

2018-09-26 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629550#comment-16629550
 ] 

Wenchen Fan commented on SPARK-25531:
-

I want to have a more structured view of the data source v2 project. It's a bad 
idea to put everything under SPARK-22386, which is so general that only says 
it's improvement. I'm starting to create tickets for some big steps of the data 
source v2 project, like this one, like the API refactoring, and potentially the 
catalog work, the custom metrics, etc. in the future.

For this particular case, the final goal is to design a new write api, for both 
data source and end-users, to get rid of SaveMode. "Standardize SQL logical 
plans" is how to achieve this goal IMO.

Note that all of them will be marks as "blocks SPARK-25186 Stabilize Data 
Source V2 API".

> new write APIs for data source v2
> -
>
> Key: SPARK-25531
> URL: https://issues.apache.org/jira/browse/SPARK-25531
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The current data source write API heavily depend on {{SaveMode}}, which 
> doesn't have a clear semantic, especially when writing to tables.
> We should design a new set of write API without {{SaveMode}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow

2018-09-26 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629572#comment-16629572
 ] 

Bryan Cutler commented on SPARK-25351:
--

Hi [~pgadige], yes please go ahead with this issue!  When creating a DataFrame 
from Pandas without Arrow, category columns are converted into the type of the 
category. So in the example above, column "A" becomes a string type. The same 
should be done when Arrow is enabled, so we end up with the same Spark 
DataFrame. If you are able to, we also need to see how this affects pandas_udfs 
too. Thanks!

> Handle Pandas category type when converting from Python with Arrow
> --
>
> Key: SPARK-25351
> URL: https://issues.apache.org/jira/browse/SPARK-25351
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Bryan Cutler
>Priority: Major
>
> There needs to be some handling of category types done when calling 
> {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}.  
> Without Arrow, Spark casts each element to the category. For example 
> {noformat}
> In [1]: import pandas as pd
> In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]})
> In [3]: pdf["B"] = pdf["A"].astype('category')
> In [4]: pdf
> Out[4]: 
>A  B
> 0  a  a
> 1  b  b
> 2  c  c
> 3  a  a
> In [5]: pdf.dtypes
> Out[5]: 
> A  object
> Bcategory
> dtype: object
> In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False)
> In [8]: df = spark.createDataFrame(pdf)
> In [9]: df.show()
> +---+---+
> |  A|  B|
> +---+---+
> |  a|  a|
> |  b|  b|
> |  c|  c|
> |  a|  a|
> +---+---+
> In [10]: df.printSchema()
> root
>  |-- A: string (nullable = true)
>  |-- B: string (nullable = true)
> In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True)
> In [19]: df = spark.createDataFrame(pdf)   
>1667 spark_type = ArrayType(from_arrow_type(at.value_type))
>1668 else:
> -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " 
> + str(at))
>1670 return spark_type
>1671 
> TypeError: Unsupported type in conversion from Arrow: 
> dictionary
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-25454) Division between operands with negative scale can cause precision loss

2018-09-26 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-25454:
-
  Assignee: (was: Wenchen Fan)

I'm reopening it, since the bug is not fully fixed. But we do have a workaround 
now: setting {{spark.sql.legacy.literal.pickMinimumPrecision}} to false.

> Division between operands with negative scale can cause precision loss
> --
>
> Key: SPARK-25454
> URL: https://issues.apache.org/jira/browse/SPARK-25454
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Marco Gaido
>Priority: Major
>
> The issue was originally reported by [~bersprockets] here: 
> https://issues.apache.org/jira/browse/SPARK-22036?focusedCommentId=16618104=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16618104.
> The problem consist in a precision loss when the second operand of the 
> division is a decimal with a negative scale. It was present also before 2.3 
> but it was harder to reproduce: you had to do something like 
> {{lit(BigDecimal(100e6))}}, while now this can happen more frequently with 
> SQL constants.
> The problem is that our logic is taken from Hive and SQLServer where decimals 
> with negative scales are not allowed. We might also consider enforcing this 
> too in 3.0 eventually. Meanwhile we can fix the logic for computing the 
> result type for a division.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25454) Division between operands with negative scale can cause precision loss

2018-09-26 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-25454:

Fix Version/s: (was: 2.3.3)
   (was: 2.4.0)

> Division between operands with negative scale can cause precision loss
> --
>
> Key: SPARK-25454
> URL: https://issues.apache.org/jira/browse/SPARK-25454
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Marco Gaido
>Priority: Major
>
> The issue was originally reported by [~bersprockets] here: 
> https://issues.apache.org/jira/browse/SPARK-22036?focusedCommentId=16618104=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16618104.
> The problem consist in a precision loss when the second operand of the 
> division is a decimal with a negative scale. It was present also before 2.3 
> but it was harder to reproduce: you had to do something like 
> {{lit(BigDecimal(100e6))}}, while now this can happen more frequently with 
> SQL constants.
> The problem is that our logic is taken from Hive and SQLServer where decimals 
> with negative scales are not allowed. We might also consider enforcing this 
> too in 3.0 eventually. Meanwhile we can fix the logic for computing the 
> result type for a division.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25548) In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field with true in the And(partitionOps, nonPartitionOps) to make the partition can be pruned

2018-09-26 Thread eaton (JIRA)
eaton created SPARK-25548:
-

 Summary: In the PruneFileSourcePartitions optimizer, replace the 
nonPartitionOps field with true in the And(partitionOps, nonPartitionOps) to 
make the partition can be pruned
 Key: SPARK-25548
 URL: https://issues.apache.org/jira/browse/SPARK-25548
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.2
Reporter: eaton


In the PruneFileSourcePartitions optimizer, the partition files will not be 
pruned if we use partition filter and non partition filter together, for 
example:

sql("CREATE TABLE IF NOT EXISTS src_par (key INT, value STRING) partitioned 
by(p_d int) stored as parquet ")
 sql("insert overwrite table src_par partition(p_d=2) select 2 as key, '4' as 
value")
 sql("insert overwrite table src_par partition(p_d=3) select 3 as key, '4' as 
value")
 sql("insert overwrite table src_par partition(p_d=4) select 4 as key, '4' as 
value")

The sql below will scan all the partition files, in which, the partition 
**p_d=4** should be pruned.
 **sql("select * from src_par where (p_d=2 and key=2) or (p_d=3 and 
key=3)").show**



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16859) History Server storage information is missing

2018-09-26 Thread t oo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628554#comment-16628554
 ] 

t oo commented on SPARK-16859:
--

bump

> History Server storage information is missing
> -
>
> Key: SPARK-16859
> URL: https://issues.apache.org/jira/browse/SPARK-16859
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Andrei Ivanov
>Priority: Major
>  Labels: historyserver, newbie
>
> It looks like job history storage tab in history server is broken for 
> completed jobs since *1.6.2*. 
> More specifically it's broken since 
> [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845].
> I've fixed for my installation by effectively reverting the above patch 
> ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]).
> IMHO, the most straightforward fix would be to implement 
> _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ making 
> sure it works from _ReplayListenerBus_.
> The downside will be that it will still work incorrectly with pre patch job 
> histories. But then, it doesn't work since *1.6.2* anyhow.
> PS: I'd really love to have this fixed eventually. But I'm pretty new to 
> Apache Spark and missing hands on Scala experience. So  I'd prefer that it be 
> fixed by someone experienced with roadmap vision. If nobody volunteers I'll 
> try to patch myself.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25541) CaseInsensitiveMap should be serializable after '-' operator

2018-09-26 Thread Gengliang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-25541:
---
Summary: CaseInsensitiveMap should be serializable after '-' operator  
(was: CaseInsensitiveMap should be serializable after '-' or 'filterKeys')

> CaseInsensitiveMap should be serializable after '-' operator
> 
>
> Key: SPARK-25541
> URL: https://issues.apache.org/jira/browse/SPARK-25541
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.5.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25549) High level API to collect RDD statistics

2018-09-26 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629702#comment-16629702
 ] 

Liang-Chi Hsieh commented on SPARK-25549:
-

cc [~cloud_fan]

 

> High level API to collect RDD statistics
> 
>
> Key: SPARK-25549
> URL: https://issues.apache.org/jira/browse/SPARK-25549
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.5.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> We have low level API SparkContext.submitMapStage used for collecting 
> statistics of RDD. However it is too low level and is not so easy to use. We 
> need a high level API for that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25481) Refactor ColumnarBatchBenchmark to use main method

2018-09-26 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25481.
---
   Resolution: Fixed
Fix Version/s: 2.5.0

Issue resolved by pull request 22490
[https://github.com/apache/spark/pull/22490]

> Refactor ColumnarBatchBenchmark to use main method
> --
>
> Key: SPARK-25481
> URL: https://issues.apache.org/jira/browse/SPARK-25481
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: yucai
>Assignee: yucai
>Priority: Major
> Fix For: 2.5.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25536) executorSource.METRIC read wrong record in Executor.scala Line444

2018-09-26 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25536:
-

Assignee: shahid

> executorSource.METRIC read wrong record in Executor.scala Line444
> -
>
> Key: SPARK-25536
> URL: https://issues.apache.org/jira/browse/SPARK-25536
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
>Reporter: ZhuoerXu
>Assignee: shahid
>Priority: Major
> Fix For: 2.3.3, 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25536) executorSource.METRIC read wrong record in Executor.scala Line444

2018-09-26 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629740#comment-16629740
 ] 

Dongjoon Hyun commented on SPARK-25536:
---

Issue resolved by pull request 22555
[https://github.com/apache/spark/pull/22555]

> executorSource.METRIC read wrong record in Executor.scala Line444
> -
>
> Key: SPARK-25536
> URL: https://issues.apache.org/jira/browse/SPARK-25536
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
>Reporter: ZhuoerXu
>Assignee: shahid
>Priority: Major
> Fix For: 2.3.3, 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25536) executorSource.METRIC read wrong record in Executor.scala Line444

2018-09-26 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25536.
---
Resolution: Fixed

> executorSource.METRIC read wrong record in Executor.scala Line444
> -
>
> Key: SPARK-25536
> URL: https://issues.apache.org/jira/browse/SPARK-25536
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
>Reporter: ZhuoerXu
>Assignee: shahid
>Priority: Major
> Fix For: 2.3.3, 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25536) executorSource.METRIC read wrong record in Executor.scala Line444

2018-09-26 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25536:
--
Fix Version/s: 2.4.0
   2.3.3

> executorSource.METRIC read wrong record in Executor.scala Line444
> -
>
> Key: SPARK-25536
> URL: https://issues.apache.org/jira/browse/SPARK-25536
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
>Reporter: ZhuoerXu
>Priority: Major
> Fix For: 2.3.3, 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25540) Make HiveContext in PySpark behave as the same as Scala.

2018-09-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25540:
-
Fix Version/s: (was: 2.4.0)
   2.5.0

> Make HiveContext in PySpark behave as the same as Scala.
> 
>
> Key: SPARK-25540
> URL: https://issues.apache.org/jira/browse/SPARK-25540
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.5.0
>
>
> In Scala, {{HiveContext}} sets a config {{spark.sql.catalogImplementation}} 
> of the given {{SparkContext}} and then passes to {{SparkSession.builder}}.
> The {{HiveContext}} in PySpark should behave as the same as it in Scala.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2018-09-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629749#comment-16629749
 ] 

Hyukjin Kwon commented on SPARK-18112:
--

Hive 3 support. See https://github.com/apache/spark/pull/21404

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25468) Highlight current page index in the history server

2018-09-26 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25468.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22516
[https://github.com/apache/spark/pull/22516]

> Highlight current page index in the history server
> --
>
> Key: SPARK-25468
> URL: https://issues.apache.org/jira/browse/SPARK-25468
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Dhruve Ashar
>Assignee: Adam Wang
>Priority: Trivial
> Fix For: 2.4.0
>
> Attachments: SparkHistoryServer.png
>
>
> Spark History Server Web UI should highlight the current page index selected 
> for better navigation. Without it being highlighted it is difficult to 
> identify the current page you are looking at. 
>  
> For example: Page 1 should be highlighted as show in SparkHistoryServer.png 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25541) CaseInsensitiveMap should be serializable after '-' or 'filterKeys'

2018-09-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629690#comment-16629690
 ] 

Apache Spark commented on SPARK-25541:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/22562

> CaseInsensitiveMap should be serializable after '-' or 'filterKeys'
> ---
>
> Key: SPARK-25541
> URL: https://issues.apache.org/jira/browse/SPARK-25541
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.5.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24341) Codegen compile error from predicate subquery

2018-09-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629698#comment-16629698
 ] 

Apache Spark commented on SPARK-24341:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/22563

> Codegen compile error from predicate subquery
> -
>
> Key: SPARK-24341
> URL: https://issues.apache.org/jira/browse/SPARK-24341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Juliusz Sompolski
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> Ran on master:
> {code}
> drop table if exists juleka;
> drop table if exists julekb;
> create table juleka (a integer, b integer);
> create table julekb (na integer, nb integer);
> insert into juleka values (1,1);
> insert into julekb values (1,1);
> select * from juleka where (a, b) not in (select (na, nb) from julekb);
> {code}
> Results in:
> {code}
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 27, Column 29: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 27, Column 29: Cannot compare types "int" and 
> "org.apache.spark.sql.catalyst.InternalRow"
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
>   at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
>   at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
>   at 
> com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2344)
>   at 
> com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2316)
>   at 
> com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278)
>   at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193)
>   at com.google.common.cache.LocalCache.get(LocalCache.java:3932)
>   at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3936)
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4806)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1415)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:92)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:46)
>   at 
> org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:380)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:99)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:97)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203)
>   at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>   at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
>   at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:203)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:202)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389)
>   at 
> org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:49)
>   at 
> org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:126)
>   at 
> org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:125)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:111)
>   at 

[jira] [Commented] (SPARK-25549) High level API to collect RDD statistics

2018-09-26 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629700#comment-16629700
 ] 

Liang-Chi Hsieh commented on SPARK-25549:
-

The design doc is at:

https://docs.google.com/document/d/177JYpF8N31Wpg86lmMI2yA5KGfpevDNkvpY7dnwRyDo/edit?usp=sharing

> High level API to collect RDD statistics
> 
>
> Key: SPARK-25549
> URL: https://issues.apache.org/jira/browse/SPARK-25549
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.5.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> We have low level API SparkContext.submitMapStage used for collecting 
> statistics of RDD. However it is too low level and is not so easy to use. We 
> need a high level API for that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25549) High level API to collect RDD statistics

2018-09-26 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-25549:
---

 Summary: High level API to collect RDD statistics
 Key: SPARK-25549
 URL: https://issues.apache.org/jira/browse/SPARK-25549
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 2.5.0
Reporter: Liang-Chi Hsieh


We have low level API SparkContext.submitMapStage used for collecting 
statistics of RDD. However it is too low level and is not so easy to use. We 
need a high level API for that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25481) Refactor ColumnarBatchBenchmark to use main method

2018-09-26 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25481:
-

Assignee: yucai

> Refactor ColumnarBatchBenchmark to use main method
> --
>
> Key: SPARK-25481
> URL: https://issues.apache.org/jira/browse/SPARK-25481
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: yucai
>Assignee: yucai
>Priority: Major
> Fix For: 2.5.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2018-09-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629742#comment-16629742
 ] 

Hyukjin Kwon commented on SPARK-18112:
--

Hive 3 support is blocked by Hadoop 3 profile. See 
https://github.com/apache/spark/pull/21588 and please provide some input at 
https://issues.apache.org/jira/browse/SPARK-20202

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2018-09-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629743#comment-16629743
 ] 

Hyukjin Kwon commented on SPARK-18112:
--

Re: 
https://issues.apache.org/jira/browse/SPARK-18112?focusedCommentId=16629000=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16629000

Did you set {{spark.sql.hive.metastore.version}}?

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25525) Do not update conf for existing SparkContext in SparkSession.getOrCreate.

2018-09-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25525:


Assignee: Takuya Ueshin

> Do not update conf for existing SparkContext in SparkSession.getOrCreate.
> -
>
> Key: SPARK-25525
> URL: https://issues.apache.org/jira/browse/SPARK-25525
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.5.0
>
>
> In SPARK-20946, we modified {{SparkSession.getOrCreate}} to not update conf 
> for existing {{SparkContext}} because {{SparkContext}} is shared by all 
> sessions.
> We should not update it in PySpark side as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25525) Do not update conf for existing SparkContext in SparkSession.getOrCreate.

2018-09-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25525.
--
   Resolution: Fixed
Fix Version/s: 2.5.0

Issue resolved by pull request 22545
[https://github.com/apache/spark/pull/22545]

> Do not update conf for existing SparkContext in SparkSession.getOrCreate.
> -
>
> Key: SPARK-25525
> URL: https://issues.apache.org/jira/browse/SPARK-25525
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.5.0
>
>
> In SPARK-20946, we modified {{SparkSession.getOrCreate}} to not update conf 
> for existing {{SparkContext}} because {{SparkContext}} is shared by all 
> sessions.
> We should not update it in PySpark side as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2018-09-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629742#comment-16629742
 ] 

Hyukjin Kwon edited comment on SPARK-18112 at 9/27/18 4:42 AM:
---

Hadoop 3 profile. See https://github.com/apache/spark/pull/21588 and please 
provide some input at https://issues.apache.org/jira/browse/SPARK-20202


was (Author: hyukjin.kwon):
Hive 3 support is blocked by Hadoop 3 profile. See 
https://github.com/apache/spark/pull/21588 and please provide some input at 
https://issues.apache.org/jira/browse/SPARK-20202

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25468) Highlight current page index in the history server

2018-09-26 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25468:
-

Assignee: Adam Wang

> Highlight current page index in the history server
> --
>
> Key: SPARK-25468
> URL: https://issues.apache.org/jira/browse/SPARK-25468
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Dhruve Ashar
>Assignee: Adam Wang
>Priority: Trivial
> Fix For: 2.4.0
>
> Attachments: SparkHistoryServer.png
>
>
> Spark History Server Web UI should highlight the current page index selected 
> for better navigation. Without it being highlighted it is difficult to 
> identify the current page you are looking at. 
>  
> For example: Page 1 should be highlighted as show in SparkHistoryServer.png 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25536) executorSource.METRIC read wrong record in Executor.scala Line444

2018-09-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25536:


Assignee: (was: Apache Spark)

> executorSource.METRIC read wrong record in Executor.scala Line444
> -
>
> Key: SPARK-25536
> URL: https://issues.apache.org/jira/browse/SPARK-25536
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: ZhuoerXu
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25536) executorSource.METRIC read wrong record in Executor.scala Line444

2018-09-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25536:


Assignee: Apache Spark

> executorSource.METRIC read wrong record in Executor.scala Line444
> -
>
> Key: SPARK-25536
> URL: https://issues.apache.org/jira/browse/SPARK-25536
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: ZhuoerXu
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25536) executorSource.METRIC read wrong record in Executor.scala Line444

2018-09-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628339#comment-16628339
 ] 

Apache Spark commented on SPARK-25536:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/22555

> executorSource.METRIC read wrong record in Executor.scala Line444
> -
>
> Key: SPARK-25536
> URL: https://issues.apache.org/jira/browse/SPARK-25536
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: ZhuoerXu
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25536) executorSource.METRIC read wrong record in Executor.scala Line444

2018-09-26 Thread shahid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628267#comment-16628267
 ] 

shahid edited comment on SPARK-25536 at 9/26/18 7:18 AM:
-

Thanks. I will raise a pr


was (Author: shahid):
I will raise a pr

> executorSource.METRIC read wrong record in Executor.scala Line444
> -
>
> Key: SPARK-25536
> URL: https://issues.apache.org/jira/browse/SPARK-25536
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: ZhuoerXu
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25538) incorrect row counts after distinct()

2018-09-26 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-25538:

Priority: Major  (was: Blocker)

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Major
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()

2018-09-26 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628369#comment-16628369
 ] 

Marco Gaido commented on SPARK-25538:
-

Please do not use Blocker and Critical when reporting issues as they are 
reserved for committer. Though, I agree this should be a blocker for 2.4.0 as 
it is a correctness issue. cc [~cloud_fan]

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Major
>  Labels: correctness
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25538) incorrect row counts after distinct()

2018-09-26 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-25538:

Labels: correctness  (was: )

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Major
>  Labels: correctness
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >