[jira] [Closed] (SPARK-22089) There is no need for fileStatusCache to invalidateAll when InMemoryFileIndex refresh

2017-09-21 Thread guichaoxian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guichaoxian closed SPARK-22089.
---
Resolution: Fixed

> There is no need for fileStatusCache to invalidateAll  when InMemoryFileIndex 
> refresh 
> --
>
> Key: SPARK-22089
> URL: https://issues.apache.org/jira/browse/SPARK-22089
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: guichaoxian
>
> the fileStatusCache is globally shared cache,refresh is only for one table,so 
> we do not need to invalidateAll entry in fileStatusCache
> {code:java}
>   /** Globally shared (not exclusive to this table) cache for file statuses 
> to speed up listing. */
>   private val fileStatusCache = FileStatusCache.getOrCreate(sparkSession)
> {code}
> {code:java}
>  override def refresh(): Unit = {
> refresh0()
> fileStatusCache.invalidateAll()
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21981) Python API for ClusteringEvaluator

2017-09-21 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-21981:
---

Assignee: Marco Gaido

> Python API for ClusteringEvaluator
> --
>
> Key: SPARK-21981
> URL: https://issues.apache.org/jira/browse/SPARK-21981
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>Assignee: Marco Gaido
> Fix For: 2.3.0
>
>
> We have implemented {{ClusteringEvaluator}} in SPARK-14516, we should expose 
> API for PySpark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21981) Python API for ClusteringEvaluator

2017-09-21 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-21981.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Python API for ClusteringEvaluator
> --
>
> Key: SPARK-21981
> URL: https://issues.apache.org/jira/browse/SPARK-21981
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
> Fix For: 2.3.0
>
>
> We have implemented {{ClusteringEvaluator}} in SPARK-14516, we should expose 
> API for PySpark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22094) processAllAvailable should not block forever when a query is stopped

2017-09-21 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-22094.
--
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

> processAllAvailable should not block forever when a query is stopped
> 
>
> Key: SPARK-22094
> URL: https://issues.apache.org/jira/browse/SPARK-22094
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.2.1, 2.3.0
>
>
> When a query is stopped, `processAllAvailable` may just block forever.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21766) DataFrame toPandas() raises ValueError with nullable int columns

2017-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21766:


Assignee: (was: Apache Spark)

> DataFrame toPandas() raises ValueError with nullable int columns
> 
>
> Key: SPARK-21766
> URL: https://issues.apache.org/jira/browse/SPARK-21766
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>
> When calling {{DataFrame.toPandas()}} (without Arrow enabled), if there is a 
> IntegerType column that has null values the following exception is thrown:
> {noformat}
> ValueError: Cannot convert non-finite values (NA or inf) to integer
> {noformat}
> This is because the null values first get converted to float NaN during the 
> construction of the Pandas DataFrame in {{from_records}}, and then it is 
> attempted to be converted back to to an integer where it fails.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21766) DataFrame toPandas() raises ValueError with nullable int columns

2017-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21766:


Assignee: Apache Spark

> DataFrame toPandas() raises ValueError with nullable int columns
> 
>
> Key: SPARK-21766
> URL: https://issues.apache.org/jira/browse/SPARK-21766
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>
> When calling {{DataFrame.toPandas()}} (without Arrow enabled), if there is a 
> IntegerType column that has null values the following exception is thrown:
> {noformat}
> ValueError: Cannot convert non-finite values (NA or inf) to integer
> {noformat}
> This is because the null values first get converted to float NaN during the 
> construction of the Pandas DataFrame in {{from_records}}, and then it is 
> attempted to be converted back to to an integer where it fails.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22096) use aggregateByKeyLocally to save one stage in calculating ItemFrequency in NaiveBayes

2017-09-21 Thread Vincent (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent updated SPARK-22096:

Attachment: performance data for NB.png

> use aggregateByKeyLocally to save one stage in calculating ItemFrequency in 
> NaiveBayes
> --
>
> Key: SPARK-22096
> URL: https://issues.apache.org/jira/browse/SPARK-22096
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Vincent
>Priority: Minor
> Attachments: performance data for NB.png
>
>
> NaiveBayes currently takes aggreateByKey followed by a collect to calculate 
> frequency for each feature/label. We can implement a new function 
> 'aggregateByKeyLocally' in RDD that merges locally on each mapper before 
> sending results to a reducer to save one stage.
> We tested on NaiveBayes and see ~16% performance gain with these changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22096) use aggregateByKeyLocally to save one stage in calculating ItemFrequency in NaiveBayes

2017-09-21 Thread Vincent (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent updated SPARK-22096:

Description: 
NaiveBayes currently takes aggreateByKey followed by a collect to calculate 
frequency for each feature/label. We can implement a new function 
'aggregateByKeyLocally' in RDD that merges locally on each mapper before 
sending results to a reducer to save one stage.
We tested on NaiveBayes and see ~16% performance gain with these changes.
[^performance data for NB.png]

  was:
NaiveBayes currently takes aggreateByKey followed by a collect to calculate 
frequency for each feature/label. We can implement a new function 
'aggregateByKeyLocally' in RDD that merges locally on each mapper before 
sending results to a reducer to save one stage.
We tested on NaiveBayes and see ~16% performance gain with these changes.


> use aggregateByKeyLocally to save one stage in calculating ItemFrequency in 
> NaiveBayes
> --
>
> Key: SPARK-22096
> URL: https://issues.apache.org/jira/browse/SPARK-22096
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Vincent
>Priority: Minor
> Attachments: performance data for NB.png
>
>
> NaiveBayes currently takes aggreateByKey followed by a collect to calculate 
> frequency for each feature/label. We can implement a new function 
> 'aggregateByKeyLocally' in RDD that merges locally on each mapper before 
> sending results to a reducer to save one stage.
> We tested on NaiveBayes and see ~16% performance gain with these changes.
> [^performance data for NB.png]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22096) use aggregateByKeyLocally to save one stage in calculating ItemFrequency in NaiveBayes

2017-09-21 Thread Vincent (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent updated SPARK-22096:

Description: 
NaiveBayes currently takes aggreateByKey followed by a collect to calculate 
frequency for each feature/label. We can implement a new function 
'aggregateByKeyLocally' in RDD that merges locally on each mapper before 
sending results to a reducer to save one stage.
We tested on NaiveBayes and see ~16% performance gain with these changes.

  was:
NaiveBayes currently takes aggreateByKey followed by a collect to calculate 
frequency for each feature/label. We can implement a new function 
'aggregateByKeyLocally' in RDD that merges locally on each mapper before 
sending results to a reducer to save one stage.
We tested on NaiveBayes and see ~20% performance gain with these changes.


> use aggregateByKeyLocally to save one stage in calculating ItemFrequency in 
> NaiveBayes
> --
>
> Key: SPARK-22096
> URL: https://issues.apache.org/jira/browse/SPARK-22096
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Vincent
>Priority: Minor
>
> NaiveBayes currently takes aggreateByKey followed by a collect to calculate 
> frequency for each feature/label. We can implement a new function 
> 'aggregateByKeyLocally' in RDD that merges locally on each mapper before 
> sending results to a reducer to save one stage.
> We tested on NaiveBayes and see ~16% performance gain with these changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22096) use aggregateByKeyLocally to save one stage in calculating ItemFrequency in NaiveBayes

2017-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175894#comment-16175894
 ] 

Apache Spark commented on SPARK-22096:
--

User 'VinceShieh' has created a pull request for this issue:
https://github.com/apache/spark/pull/19318

> use aggregateByKeyLocally to save one stage in calculating ItemFrequency in 
> NaiveBayes
> --
>
> Key: SPARK-22096
> URL: https://issues.apache.org/jira/browse/SPARK-22096
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Vincent
>Priority: Minor
>
> NaiveBayes currently takes aggreateByKey followed by a collect to calculate 
> frequency for each feature/label. We can implement a new function 
> 'aggregateByKeyLocally' in RDD that merges locally on each mapper before 
> sending results to a reducer to save one stage.
> We tested on NaiveBayes and see ~20% performance gain with these changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22096) use aggregateByKeyLocally to save one stage in calculating ItemFrequency in NaiveBayes

2017-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22096:


Assignee: (was: Apache Spark)

> use aggregateByKeyLocally to save one stage in calculating ItemFrequency in 
> NaiveBayes
> --
>
> Key: SPARK-22096
> URL: https://issues.apache.org/jira/browse/SPARK-22096
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Vincent
>Priority: Minor
>
> NaiveBayes currently takes aggreateByKey followed by a collect to calculate 
> frequency for each feature/label. We can implement a new function 
> 'aggregateByKeyLocally' in RDD that merges locally on each mapper before 
> sending results to a reducer to save one stage.
> We tested on NaiveBayes and see ~20% performance gain with these changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22096) use aggregateByKeyLocally to save one stage in calculating ItemFrequency in NaiveBayes

2017-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22096:


Assignee: Apache Spark

> use aggregateByKeyLocally to save one stage in calculating ItemFrequency in 
> NaiveBayes
> --
>
> Key: SPARK-22096
> URL: https://issues.apache.org/jira/browse/SPARK-22096
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Vincent
>Assignee: Apache Spark
>Priority: Minor
>
> NaiveBayes currently takes aggreateByKey followed by a collect to calculate 
> frequency for each feature/label. We can implement a new function 
> 'aggregateByKeyLocally' in RDD that merges locally on each mapper before 
> sending results to a reducer to save one stage.
> We tested on NaiveBayes and see ~20% performance gain with these changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22098) Add aggregateByKeyLocally in RDD

2017-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22098:


Assignee: (was: Apache Spark)

> Add aggregateByKeyLocally in RDD
> 
>
> Key: SPARK-22098
> URL: https://issues.apache.org/jira/browse/SPARK-22098
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Vincent
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22098) Add aggregateByKeyLocally in RDD

2017-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22098:


Assignee: Apache Spark

> Add aggregateByKeyLocally in RDD
> 
>
> Key: SPARK-22098
> URL: https://issues.apache.org/jira/browse/SPARK-22098
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Vincent
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22098) Add aggregateByKeyLocally in RDD

2017-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175892#comment-16175892
 ] 

Apache Spark commented on SPARK-22098:
--

User 'ConeyLiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/19317

> Add aggregateByKeyLocally in RDD
> 
>
> Key: SPARK-22098
> URL: https://issues.apache.org/jira/browse/SPARK-22098
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Vincent
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22097) Call serializationStream.close after we requested enough memory

2017-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22097:


Assignee: Apache Spark

> Call serializationStream.close after we requested enough memory
> ---
>
> Key: SPARK-22097
> URL: https://issues.apache.org/jira/browse/SPARK-22097
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Xianyang Liu
>Assignee: Apache Spark
>
> Current code, we close the `serializationStream` after we unrolled the block. 
> However, there is a otential problem that the size of underlying vector or 
> stream maybe larger the memory we requested. So here, we need check it agin 
> carefully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22097) Call serializationStream.close after we requested enough memory

2017-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175887#comment-16175887
 ] 

Apache Spark commented on SPARK-22097:
--

User 'ConeyLiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/19316

> Call serializationStream.close after we requested enough memory
> ---
>
> Key: SPARK-22097
> URL: https://issues.apache.org/jira/browse/SPARK-22097
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Xianyang Liu
>
> Current code, we close the `serializationStream` after we unrolled the block. 
> However, there is a otential problem that the size of underlying vector or 
> stream maybe larger the memory we requested. So here, we need check it agin 
> carefully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22097) Call serializationStream.close after we requested enough memory

2017-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22097:


Assignee: (was: Apache Spark)

> Call serializationStream.close after we requested enough memory
> ---
>
> Key: SPARK-22097
> URL: https://issues.apache.org/jira/browse/SPARK-22097
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Xianyang Liu
>
> Current code, we close the `serializationStream` after we unrolled the block. 
> However, there is a otential problem that the size of underlying vector or 
> stream maybe larger the memory we requested. So here, we need check it agin 
> carefully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22098) Add aggregateByKeyLocally in RDD

2017-09-21 Thread Vincent (JIRA)
Vincent created SPARK-22098:
---

 Summary: Add aggregateByKeyLocally in RDD
 Key: SPARK-22098
 URL: https://issues.apache.org/jira/browse/SPARK-22098
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Vincent
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22097) Call serializationStream.close after we requested enough memory

2017-09-21 Thread Xianyang Liu (JIRA)
Xianyang Liu created SPARK-22097:


 Summary: Call serializationStream.close after we requested enough 
memory
 Key: SPARK-22097
 URL: https://issues.apache.org/jira/browse/SPARK-22097
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Xianyang Liu


Current code, we close the `serializationStream` after we unrolled the block. 
However, there is a otential problem that the size of underlying vector or 
stream maybe larger the memory we requested. So here, we need check it agin 
carefully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22096) use aggregateByKeyLocally to save one stage in calculating ItemFrequency in NaiveBayes

2017-09-21 Thread Vincent (JIRA)
Vincent created SPARK-22096:
---

 Summary: use aggregateByKeyLocally to save one stage in 
calculating ItemFrequency in NaiveBayes
 Key: SPARK-22096
 URL: https://issues.apache.org/jira/browse/SPARK-22096
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: Vincent
Priority: Minor


NaiveBayes currently takes aggreateByKey followed by a collect to calculate 
frequency for each feature/label. We can implement a new function 
'aggregateByKeyLocally' in RDD that merges locally on each mapper before 
sending results to a reducer to save one stage.
We tested on NaiveBayes and see ~20% performance gain with these changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator

2017-09-21 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175833#comment-16175833
 ] 

zhengruifeng commented on SPARK-13030:
--

I approve that one-hot encoder should be a estimtor, and this maybe somewhat a 
bug.
I am a heavy user of Spark and MLLib, but I rarely use `ML.OneHotEncoder` and 
have to impl it myself.
Is there any plan? What about deprecating it now and make a break change in the 
future.



> Change OneHotEncoder to Estimator
> -
>
> Key: SPARK-13030
> URL: https://issues.apache.org/jira/browse/SPARK-13030
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Wojciech Jurczyk
>
> OneHotEncoder should be an Estimator, just like in scikit-learn 
> (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).
> In its current form, it is impossible to use when number of categories is 
> different between training dataset and test dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22093) UtilsSuite "resolveURIs with multiple paths" test always cancelled

2017-09-21 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175794#comment-16175794
 ] 

Hyukjin Kwon commented on SPARK-22093:
--

Would it make sense to just remove that {{assume}}? It looks that {{assume}} is 
not quite meaningful as that {{Utils.resolveURIs}} seems also supporting a 
single URI as the input too and I wonder why we should assume the input to 
contains {{,}} and multiple URIs.

> UtilsSuite "resolveURIs with multiple paths" test always cancelled
> --
>
> Key: SPARK-22093
> URL: https://issues.apache.org/jira/browse/SPARK-22093
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Trivial
>
> There's a call to {{assume}} in that test that is triggering, and causing the 
> test to print an exception and report itself as "cancelled". It only happens 
> in the last unit test (so coverage is fine, I guess), but still, that seems 
> wrong.
> {noformat}
> [info] - resolveURIs with multiple paths !!! CANCELED !!! (15 milliseconds)
> [info]   1 was not greater than 1 (UtilsSuite.scala:491)
> [info]   org.scalatest.exceptions.TestCanceledException:
> [info]   at 
> org.scalatest.Assertions$class.newTestCanceledException(Assertions.scala:531)
> [info]   at 
> org.scalatest.FunSuite.newTestCanceledException(FunSuite.scala:1560)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssume(Assertions.scala:516)
> [info]   at 
> org.apache.spark.util.UtilsSuite$$anonfun$6.assertResolves$2(UtilsSuite.scala:491)
> [info]   at 
> org.apache.spark.util.UtilsSuite$$anonfun$6.apply$mcV$sp(UtilsSuite.scala:512)
> [info]   at 
> org.apache.spark.util.UtilsSuite$$anonfun$6.apply(UtilsSuite.scala:489)
> [info]   at 
> org.apache.spark.util.UtilsSuite$$anonfun$6.apply(UtilsSuite.scala:489)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-22095) java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST

2017-09-21 Thread softwarevamp (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

softwarevamp closed SPARK-22095.

Resolution: Not A Problem

> java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST
> --
>
> Key: SPARK-22095
> URL: https://issues.apache.org/jira/browse/SPARK-22095
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: softwarevamp
>
> 17/09/22 00:02:43 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[main,5,main]
> java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:59)
> at scala.collection.MapLike$class.apply(MapLike.scala:141)
> at scala.collection.AbstractMap.apply(Map.scala:59)
> at 
> org.apache.spark.api.python.PythonGatewayServer$$anonfun$main$1.apply$mcV$sp(PythonGatewayServer.scala:50)
> at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1262)
> at 
> org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:37)
> at 
> org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22095) java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST

2017-09-21 Thread softwarevamp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175773#comment-16175773
 ] 

softwarevamp edited comment on SPARK-22095 at 9/22/17 1:29 AM:
---

i am sorry maybe my fault:
spark-submit --packages ... --repositories ... *pyspark-shell* py 
--py-files ...
i have extra 'pyspark-shell' in my command


was (Author: softwarevamp):
i am sorry maybe my fault:
spark-submit --packages ... --repositories ... *pyspark-shell* crawling.py 
--py-files ...
i have extra 'pyspark-shell' in my command

> java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST
> --
>
> Key: SPARK-22095
> URL: https://issues.apache.org/jira/browse/SPARK-22095
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: softwarevamp
>
> 17/09/22 00:02:43 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[main,5,main]
> java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:59)
> at scala.collection.MapLike$class.apply(MapLike.scala:141)
> at scala.collection.AbstractMap.apply(Map.scala:59)
> at 
> org.apache.spark.api.python.PythonGatewayServer$$anonfun$main$1.apply$mcV$sp(PythonGatewayServer.scala:50)
> at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1262)
> at 
> org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:37)
> at 
> org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19357) Parallel Model Evaluation for ML Tuning: Scala

2017-09-21 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175774#comment-16175774
 ] 

Weichen Xu commented on SPARK-19357:


[~josephkb] I thought about this, the desgin:
`Estimator:: def fit(dataset: Dataset[_], paramMaps: Array[ParamMap], 
parallelism: Int): Seq[M]`
bring another problem:
We want to optimize the memory cost of ML tuning, in current design, the max 
memory cost in tuning fitting will be numParamllelism * sizeof(model), but the 
design above will return full model list, which cause the memory cost to be 
numParamMaps * sizeof(model). It is possible to cause OOM when models are huge.


> Parallel Model Evaluation for ML Tuning: Scala
> --
>
> Key: SPARK-19357
> URL: https://issues.apache.org/jira/browse/SPARK-19357
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
> Fix For: 2.3.0
>
> Attachments: parallelism-verification-test.pdf
>
>
> This is a first step of the parent task of Optimizations for ML Pipeline 
> Tuning to perform model evaluation in parallel.  A simple approach is to 
> naively evaluate with a possible parameter to control the level of 
> parallelism.  There are some concerns with this:
> * excessive caching of datasets
> * what to set as the default value for level of parallelism.  1 will evaluate 
> all models in serial, as is done currently. Higher values could lead to 
> excessive caching.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22095) java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST

2017-09-21 Thread softwarevamp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175773#comment-16175773
 ] 

softwarevamp commented on SPARK-22095:
--

i am sorry maybe my fault:
spark-submit --packages ... --repositories ... *pyspark-shell* crawling.py 
--py-files ...
i have extra 'pyspark-shell' in my command

> java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST
> --
>
> Key: SPARK-22095
> URL: https://issues.apache.org/jira/browse/SPARK-22095
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: softwarevamp
>
> 17/09/22 00:02:43 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[main,5,main]
> java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:59)
> at scala.collection.MapLike$class.apply(MapLike.scala:141)
> at scala.collection.AbstractMap.apply(Map.scala:59)
> at 
> org.apache.spark.api.python.PythonGatewayServer$$anonfun$main$1.apply$mcV$sp(PythonGatewayServer.scala:50)
> at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1262)
> at 
> org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:37)
> at 
> org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22095) java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST

2017-09-21 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175770#comment-16175770
 ] 

Jeff Zhang commented on SPARK-22095:


Could you tell how to reproduce this issue ?

> java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST
> --
>
> Key: SPARK-22095
> URL: https://issues.apache.org/jira/browse/SPARK-22095
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: softwarevamp
>
> 17/09/22 00:02:43 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[main,5,main]
> java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:59)
> at scala.collection.MapLike$class.apply(MapLike.scala:141)
> at scala.collection.AbstractMap.apply(Map.scala:59)
> at 
> org.apache.spark.api.python.PythonGatewayServer$$anonfun$main$1.apply$mcV$sp(PythonGatewayServer.scala:50)
> at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1262)
> at 
> org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:37)
> at 
> org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22095) java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST

2017-09-21 Thread softwarevamp (JIRA)
softwarevamp created SPARK-22095:


 Summary: java.util.NoSuchElementException: key not found: 
_PYSPARK_DRIVER_CALLBACK_HOST
 Key: SPARK-22095
 URL: https://issues.apache.org/jira/browse/SPARK-22095
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.0
Reporter: softwarevamp


17/09/22 00:02:43 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
thread Thread[main,5,main]
java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:59)
at 
org.apache.spark.api.python.PythonGatewayServer$$anonfun$main$1.apply$mcV$sp(PythonGatewayServer.scala:50)
at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1262)
at 
org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:37)
at 
org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22094) processAllAvailable should not block forever when a query is stopped

2017-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175687#comment-16175687
 ] 

Apache Spark commented on SPARK-22094:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/19314

> processAllAvailable should not block forever when a query is stopped
> 
>
> Key: SPARK-22094
> URL: https://issues.apache.org/jira/browse/SPARK-22094
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> When a query is stopped, `processAllAvailable` may just block forever.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22094) processAllAvailable should not block forever when a query is stopped

2017-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22094:


Assignee: Apache Spark  (was: Shixiong Zhu)

> processAllAvailable should not block forever when a query is stopped
> 
>
> Key: SPARK-22094
> URL: https://issues.apache.org/jira/browse/SPARK-22094
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> When a query is stopped, `processAllAvailable` may just block forever.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22094) processAllAvailable should not block forever when a query is stopped

2017-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22094:


Assignee: Shixiong Zhu  (was: Apache Spark)

> processAllAvailable should not block forever when a query is stopped
> 
>
> Key: SPARK-22094
> URL: https://issues.apache.org/jira/browse/SPARK-22094
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> When a query is stopped, `processAllAvailable` may just block forever.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22094) processAllAvailable should not block forever when a query is stopped

2017-09-21 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-22094:


 Summary: processAllAvailable should not block forever when a query 
is stopped
 Key: SPARK-22094
 URL: https://issues.apache.org/jira/browse/SPARK-22094
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


When a query is stopped, `processAllAvailable` may just block forever.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19357) Parallel Model Evaluation for ML Tuning: Scala

2017-09-21 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175629#comment-16175629
 ] 

Joseph K. Bradley edited comment on SPARK-19357 at 9/21/17 11:14 PM:
-

[~bryanc], [~nick.pentre...@gmail.com], [~WeichenXu123] Well I feel a bit 
foolish; I just realized these changes to support parallel model evaluation are 
going to cause some problems for optimizing multi-model fitting.
* When we originally designed the Pipelines API, we put {{def fit(dataset: 
Dataset[_], paramMaps: Array[ParamMap]): Seq[M]}} in {{abstract class 
Estimator}} for the sake of eventually being able to override that method 
within specific Estimators which can do algorithm-specific optimizations.  
E.g., if you're tuning {{maxIter}}, then you should really only fit once and 
just save the model at various iterations along the way.
* These recent changes in master to CrossValidator and TrainValidationSplit 
have switched from calling fit() with all of the ParamMaps to calling fit() 
with a single ParamMap.  This means that the model-specific optimization is no 
longer possible.

Although we haven't found time yet to do these model-specific optimizations, 
I'd really like for us to be able to do so in the future.  For some models, 
this could lead to huge speedups (N^2 to N for the case of maxIter for linear 
models).  Any ideas for fixing this?  Here are my thoughts:
* To allow model-specific optimization, the implementation for fitting for 
multiple ParamMaps needs to be within models, not within CrossValidator or 
other tuning algorithms.
* Therefore, we need to use something like {{def fit(dataset: Dataset[_], 
paramMaps: Array[ParamMap]): Seq[M]}}.  However, we will need an API which 
takes the {{parallelism}} Param.
* Since {{Estimator}} is an abstract class, we can add a new method as long as 
it has a default implementation, without worrying about breaking APIs across 
Spark versions.  So we could add something like:
** {{def fit(dataset: Dataset[_], paramMaps: Array[ParamMap], parallelism: 
Int): Seq[M]}}
** However, this will not mesh well with our plans for dumping models from 
CrossValidator to disk during tuning.  For that, we would need to be able to 
pass callbacks, e.g.: {{def fit(dataset: Dataset[_], paramMaps: 
Array[ParamMap], parallelism: Int, callback: M => ()): Seq[M]}} (or something 
like that).

What do you think?


was (Author: josephkb):
[~bryanc], [~nick.pentre...@gmail.com], [~WeichenXu123] Well I feel a bit 
foolish; I just realized these changes to support parallel model evaluation are 
going to cause some problems for optimizing multi-model fitting.
* When we originally designed the Pipelines API, we put {{def fit(dataset: 
Dataset[_], paramMaps: Array[ParamMap]): Seq[M]}} in {{abstract class 
Estimator}} for the sake of eventually being able to override that method 
within specific Estimators which can do algorithm-specific optimizations.  
E.g., if you're tuning {{maxIter}}, then you should really only fit once and 
just save the model at various iterations along the way.
* These recent changes in master to CrossValidator and TrainValidationSplit 
have switched from calling fit() with all of the ParamMaps to calling fit() 
with a single ParamMap.  This means that the model-specific optimization is no 
longer possible.

Although we haven't found time yet to do these model-specific optimizations, 
I'd really like for us to be able to do so in the future.  Any ideas for fixing 
this?  Here are my thoughts:
* To allow model-specific optimization, the implementation for fitting for 
multiple ParamMaps needs to be within models, not within CrossValidator or 
other tuning algorithms.
* Therefore, we need to use something like {{def fit(dataset: Dataset[_], 
paramMaps: Array[ParamMap]): Seq[M]}}.  However, we will need an API which 
takes the {{parallelism}} Param.
* Since {{Estimator}} is an abstract class, we can add a new method as long as 
it has a default implementation, without worrying about breaking APIs across 
Spark versions.  So we could add something like:
** {{def fit(dataset: Dataset[_], paramMaps: Array[ParamMap], parallelism: 
Int): Seq[M]}}
** However, this will not mesh well with our plans for dumping models from 
CrossValidator to disk during tuning.  For that, we would need to be able to 
pass callbacks, e.g.: {{def fit(dataset: Dataset[_], paramMaps: 
Array[ParamMap], parallelism: Int, callback: M => ()): Seq[M]}} (or something 
like that).

What do you think?

> Parallel Model Evaluation for ML Tuning: Scala
> --
>
> Key: SPARK-19357
> URL: https://issues.apache.org/jira/browse/SPARK-19357
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
> Fix For: 2.3.0
>
>  

[jira] [Commented] (SPARK-19357) Parallel Model Evaluation for ML Tuning: Scala

2017-09-21 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175629#comment-16175629
 ] 

Joseph K. Bradley commented on SPARK-19357:
---

[~bryanc], [~nick.pentre...@gmail.com], [~WeichenXu123] Well I feel a bit 
foolish; I just realized these changes to support parallel model evaluation are 
going to cause some problems for optimizing multi-model fitting.
* When we originally designed the Pipelines API, we put {{def fit(dataset: 
Dataset[_], paramMaps: Array[ParamMap]): Seq[M]}} in {{abstract class 
Estimator}} for the sake of eventually being able to override that method 
within specific Estimators which can do algorithm-specific optimizations.  
E.g., if you're tuning {{maxIter}}, then you should really only fit once and 
just save the model at various iterations along the way.
* These recent changes in master to CrossValidator and TrainValidationSplit 
have switched from calling fit() with all of the ParamMaps to calling fit() 
with a single ParamMap.  This means that the model-specific optimization is no 
longer possible.

Although we haven't found time yet to do these model-specific optimizations, 
I'd really like for us to be able to do so in the future.  Any ideas for fixing 
this?  Here are my thoughts:
* To allow model-specific optimization, the implementation for fitting for 
multiple ParamMaps needs to be within models, not within CrossValidator or 
other tuning algorithms.
* Therefore, we need to use something like {{def fit(dataset: Dataset[_], 
paramMaps: Array[ParamMap]): Seq[M]}}.  However, we will need an API which 
takes the {{parallelism}} Param.
* Since {{Estimator}} is an abstract class, we can add a new method as long as 
it has a default implementation, without worrying about breaking APIs across 
Spark versions.  So we could add something like:
** {{def fit(dataset: Dataset[_], paramMaps: Array[ParamMap], parallelism: 
Int): Seq[M]}}
** However, this will not mesh well with our plans for dumping models from 
CrossValidator to disk during tuning.  For that, we would need to be able to 
pass callbacks, e.g.: {{def fit(dataset: Dataset[_], paramMaps: 
Array[ParamMap], parallelism: Int, callback: M => ()): Seq[M]}} (or something 
like that).

What do you think?

> Parallel Model Evaluation for ML Tuning: Scala
> --
>
> Key: SPARK-19357
> URL: https://issues.apache.org/jira/browse/SPARK-19357
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
> Fix For: 2.3.0
>
> Attachments: parallelism-verification-test.pdf
>
>
> This is a first step of the parent task of Optimizations for ML Pipeline 
> Tuning to perform model evaluation in parallel.  A simple approach is to 
> naively evaluate with a possible parameter to control the level of 
> parallelism.  There are some concerns with this:
> * excessive caching of datasets
> * what to set as the default value for level of parallelism.  1 will evaluate 
> all models in serial, as is done currently. Higher values could lead to 
> excessive caching.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21928) ClassNotFoundException for custom Kryo registrator class during serde in netty threads

2017-09-21 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-21928:
---
Fix Version/s: 2.1.2

> ClassNotFoundException for custom Kryo registrator class during serde in 
> netty threads
> --
>
> Key: SPARK-21928
> URL: https://issues.apache.org/jira/browse/SPARK-21928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: John Brock
>Assignee: Imran Rashid
> Fix For: 2.1.2, 2.2.1, 2.3.0
>
>
> From SPARK-13990 & SPARK-13926, Spark's SerializerManager has its own 
> instance of a KryoSerializer which does not have the defaultClassLoader set 
> on it. For normal task execution, that doesn't cause problems, because the 
> serializer falls back to the current thread's task loader, which is set 
> anyway.
> however, netty maintains its own thread pool, and those threads don't change 
> their classloader to include the extra use jars needed for the custom kryo 
> registrator. That only matters when blocks are sent across the network which 
> force serde in the netty thread. That won't happen often, because (a) spark 
> tries to execute tasks where the RDDs are already cached and (b) broadcast 
> blocks generally don't require any serde in the netty threads (that occurs in 
> the task thread that is reading the broadcast value).  However it can come up 
> with remote cache reads, or if fetching a broadcast block forces another 
> block to disk, which requires serialization.
> This doesn't effect the shuffle path, because the serde is never done in the 
> threads created by netty.
> I think a fix for this should be fairly straight-forward, we just need to set 
> the classloader on that extra kryo instance.
>  (original problem description below)
> I unfortunately can't reliably reproduce this bug; it happens only 
> occasionally, when training a logistic regression model with very large 
> datasets. The training will often proceed through several {{treeAggregate}} 
> calls without any problems, and then suddenly workers will start running into 
> this {{java.lang.ClassNotFoundException}}.
> After doing some debugging, it seems that whenever this error happens, Spark 
> is trying to use the {{sun.misc.Launcher$AppClassLoader}} {{ClassLoader}} 
> instance instead of the usual 
> {{org.apache.spark.util.MutableURLClassLoader}}. {{MutableURLClassLoader}} 
> can see my custom Kryo registrator, but the {{AppClassLoader}} instance can't.
> When this error does pop up, it's usually accompanied by the task seeming to 
> hang, and I need to kill Spark manually.
> I'm running a Spark application in cluster mode via spark-submit, and I have 
> a custom Kryo registrator. The JAR is built with {{sbt assembly}}.
> Exception message:
> {noformat}
> 17/08/29 22:39:04 ERROR TransportRequestHandler: Error opening block 
> StreamChunkId{streamId=542074019336, chunkIndex=0} for request from 
> /10.0.29.65:34332
> org.apache.spark.SparkException: Failed to register classes with Kryo
> at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:139)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:292)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:277)
> at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:186)
> at 
> org.apache.spark.serializer.SerializerManager.dataSerializeStream(SerializerManager.scala:169)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1382)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1377)
> at org.apache.spark.storage.DiskStore.put(DiskStore.scala:69)
> at 
> org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1377)
> at 
> org.apache.spark.storage.memory.MemoryStore.org$apache$spark$storage$memory$MemoryStore$$dropBlock$1(MemoryStore.scala:524)
> at 
> org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:545)
> at 
> org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:539)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.storage.memory.MemoryStore.evictBlocksToFreeSpace(MemoryStore.scala:539)
> at 
> org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:92)
> at 
> org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:73)
> at 
> 

[jira] [Resolved] (SPARK-22053) Implement stream-stream inner join in Append mode

2017-09-21 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-22053.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 19271
[https://github.com/apache/spark/pull/19271]

> Implement stream-stream inner join in Append mode
> -
>
> Key: SPARK-22053
> URL: https://issues.apache.org/jira/browse/SPARK-22053
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 3.0.0
>
>
> Stream-stream inner join is traditionally implemented using a two-way 
> symmetric hash join. At a high level, we want to do the following.
> 1. For each stream, we maintain the past rows as state in State Store. 
> - For each joining key, there can be multiple rows that have been 
> received. 
> - So, we have to effectively maintain a key-to-list-of-values multimap as 
> state for each stream.
> 2. In each batch, for each input row in each stream
> - Look up the other streams state to see if there are matching rows, and 
> output them if they satisfy the joining condition
> - Add the input row to corresponding stream’s state.
> - If the data has a timestamp/window column with watermark, then we will 
> use that to calculate the threshold for keys that are required to buffered 
> for future matches and drop the rest from the state.
> Cleaning up old unnecessary state rows depends completely on whether 
> watermark has been defined and what are join conditions. We definitely want 
> to support state clean up two types of queries that are likely to be common. 
> - Queries to time range conditions - E.g. {{SELECT * FROM leftTable, 
> rightTable ON leftKey = rightKey AND leftTime > rightTime - INTERVAL 8 
> MINUTES AND leftTime < rightTime + INTERVAL 1 HOUR}}
> - Queries with windows as the matching key - E.g. {{SELECT * FROM leftTable, 
> rightTable ON leftKey = rightKey AND window(leftTime, "1 hour") = 
> window(rightTime, "1 hour")}} (pseudo-SQL)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14236) UDAF does not use incomingSchema for update Method

2017-09-21 Thread Guilherme Braccialli (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175541#comment-16175541
 ] 

Guilherme Braccialli edited comment on SPARK-14236 at 9/21/17 10:29 PM:


+1 to implement this.

as a workaround I'm using code below to make code more readable:

{code:java}
  val inputColumns = Map(
  "start" -> TimestampType, 
  "end" -> TimestampType
  )
  override def inputSchema = StructType(inputColumns.map{case (name,dataType) 
=> StructField(name,dataType)}.toArray)
  val inputColumnsNameId = inputColumns.zipWithIndex.map{case ((name, 
dataType), position) => (name -> position)}
  val inputStart = inputColumnsNameId("start")
  val inputEnd = inputColumnsNameId("end")
{code}


PS: I did some tests and identified significant perfomance overhead if I try to 
resolve field names (by accessing map inputColumnsNameId) inside update 
function, that's why I created one val with respective id for each input field. 
I tested with approximate 1 billion rows.

same solution applies to bufferSchema.


was (Author: gbraccialli):
+1 to implement this.

as a workaround I'm using code below to make code more readable:

{code:scala}
  val inputColumns = Map(
  "start" -> TimestampType, 
  "end" -> TimestampType
  )
  override def inputSchema = StructType(inputColumns.map{case (name,dataType) 
=> StructField(name,dataType)}.toArray)
  val inputColumnsNameId = inputColumns.zipWithIndex.map{case ((name, 
dataType), position) => (name -> position)}
  val inputStart = inputColumnsNameId("start")
  val inputEnd = inputColumnsNameId("end")
{code}


PS: I did some tests and identified significant perfomance overhead if I try to 
resolve field names (by accessing map inputColumnsNameId) inside update 
function, that's why I created one val with respective id for each input field. 
I tested with approximate 1 billion rows.

same solution applies to bufferSchema.

> UDAF does not use incomingSchema for update Method
> --
>
> Key: SPARK-14236
> URL: https://issues.apache.org/jira/browse/SPARK-14236
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Matthias Niehoff
>Priority: Minor
>
> When I specify a schema for the incoming data in an UDAF, the schema will not 
> be applied to the incoming row in the update method. I can only access the 
> fields using their numeric indices and not with their names. The Fields in 
> the row are named input0, input1,...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22060) CrossValidator/TrainValidationSplit parallelism param persist/load bug

2017-09-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22060:
--
Target Version/s: 2.3.0

> CrossValidator/TrainValidationSplit parallelism param persist/load bug
> --
>
> Key: SPARK-22060
> URL: https://issues.apache.org/jira/browse/SPARK-22060
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> CrossValidator/TrainValidationSplit `parallelism` param cannot be saved, when 
> we save the CrossValidator/TrainValidationSplit object to disk.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22060) CrossValidator/TrainValidationSplit parallelism param persist/load bug

2017-09-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-22060:
-

Assignee: Weichen Xu

> CrossValidator/TrainValidationSplit parallelism param persist/load bug
> --
>
> Key: SPARK-22060
> URL: https://issues.apache.org/jira/browse/SPARK-22060
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>
> CrossValidator/TrainValidationSplit `parallelism` param cannot be saved, when 
> we save the CrossValidator/TrainValidationSplit object to disk.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22060) CrossValidator/TrainValidationSplit parallelism param persist/load bug

2017-09-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22060:
--
Shepherd: Joseph K. Bradley

> CrossValidator/TrainValidationSplit parallelism param persist/load bug
> --
>
> Key: SPARK-22060
> URL: https://issues.apache.org/jira/browse/SPARK-22060
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> CrossValidator/TrainValidationSplit `parallelism` param cannot be saved, when 
> we save the CrossValidator/TrainValidationSplit object to disk.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14236) UDAF does not use incomingSchema for update Method

2017-09-21 Thread Guilherme Braccialli (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175541#comment-16175541
 ] 

Guilherme Braccialli edited comment on SPARK-14236 at 9/21/17 10:29 PM:


+1 to implement this.

as a workaround I'm using code below to make code more readable:

{code:scala}
  val inputColumns = Map(
  "start" -> TimestampType, 
  "end" -> TimestampType
  )
  override def inputSchema = StructType(inputColumns.map{case (name,dataType) 
=> StructField(name,dataType)}.toArray)
  val inputColumnsNameId = inputColumns.zipWithIndex.map{case ((name, 
dataType), position) => (name -> position)}
  val inputStart = inputColumnsNameId("start")
  val inputEnd = inputColumnsNameId("end")
{code}


PS: I did some tests and identified significant perfomance overhead if I try to 
resolve field names (by accessing map inputColumnsNameId) inside update 
function, that's why I created one val with respective id for each input field. 
I tested with approximate 1 billion rows.

same solution applies to bufferSchema.


was (Author: gbraccialli):
+1 to implement this.

as a workaround I'm using code below to make code more readable:
  val inputColumns = Map(
  "start" -> TimestampType, 
  "end" -> TimestampType
  )
  override def inputSchema = StructType(inputColumns.map{case (name,dataType) 
=> StructField(name,dataType)}.toArray)
  val inputColumnsNameId = inputColumns.zipWithIndex.map{case ((name, 
dataType), position) => (name -> position)}
  val inputStart = inputColumnsNameId("start")
  val inputEnd = inputColumnsNameId("end")

PS: I did some tests and identified significant perfomance overhead if I try to 
resolve field names (by accessing map inputColumnsNameId) inside update 
function, that's why I created one val with respective id for each input field. 
I tested with approximate 1 billion rows.

same solution applies to bufferSchema.

> UDAF does not use incomingSchema for update Method
> --
>
> Key: SPARK-14236
> URL: https://issues.apache.org/jira/browse/SPARK-14236
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Matthias Niehoff
>Priority: Minor
>
> When I specify a schema for the incoming data in an UDAF, the schema will not 
> be applied to the incoming row in the update method. I can only access the 
> fields using their numeric indices and not with their names. The Fields in 
> the row are named input0, input1,...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14236) UDAF does not use incomingSchema for update Method

2017-09-21 Thread Guilherme Braccialli (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175541#comment-16175541
 ] 

Guilherme Braccialli commented on SPARK-14236:
--

+1 to implement this.

as a workaround I'm using code below to make code more readable:
  val inputColumns = Map(
  "start" -> TimestampType, 
  "end" -> TimestampType
  )
  override def inputSchema = StructType(inputColumns.map{case (name,dataType) 
=> StructField(name,dataType)}.toArray)
  val inputColumnsNameId = inputColumns.zipWithIndex.map{case ((name, 
dataType), position) => (name -> position)}
  val inputStart = inputColumnsNameId("start")
  val inputEnd = inputColumnsNameId("end")

PS: I did some tests and identified significant perfomance overhead if I try to 
resolve field names (by accessing map inputColumnsNameId) inside update 
function, that's why I created one val with respective id for each input field. 
I tested with approximate 1 billion rows.

same solution applies to bufferSchema.

> UDAF does not use incomingSchema for update Method
> --
>
> Key: SPARK-14236
> URL: https://issues.apache.org/jira/browse/SPARK-14236
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Matthias Niehoff
>Priority: Minor
>
> When I specify a schema for the incoming data in an UDAF, the schema will not 
> be applied to the incoming row in the update method. I can only access the 
> fields using their numeric indices and not with their names. The Fields in 
> the row are named input0, input1,...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22077) RpcEndpointAddress fails to parse spark URL if it is an ipv6 address.

2017-09-21 Thread Eric Vandenberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175479#comment-16175479
 ] 

Eric Vandenberg commented on SPARK-22077:
-

Yes, it worked when I overloaded with "localhost" so it is just a parsing 
issue.  However this is the default hostname in our configuration so the 
parsing should be more flexible for ipv6 address.

> RpcEndpointAddress fails to parse spark URL if it is an ipv6 address.
> -
>
> Key: SPARK-22077
> URL: https://issues.apache.org/jira/browse/SPARK-22077
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.0
>Reporter: Eric Vandenberg
>Priority: Minor
>
> RpcEndpointAddress fails to parse spark URL if it is an ipv6 address.
> For example, 
> sparkUrl = "spark://HeartbeatReceiver@2401:db00:2111:40a1:face:0:21:0:35243"
> is parsed as:
> host = null
> port = -1
> name = null
> While sparkUrl = spark://HeartbeatReceiver@localhost:55691 is parsed properly.
> This is happening on our production machines and causing spark to not start 
> up.
> org.apache.spark.SparkException: Invalid Spark URL: 
> spark://HeartbeatReceiver@2401:db00:2111:40a1:face:0:21:0:35243
>   at 
> org.apache.spark.rpc.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:65)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:133)
>   at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
>   at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
>   at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:32)
>   at org.apache.spark.executor.Executor.(Executor.scala:121)
>   at 
> org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:59)
>   at 
> org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173)
>   at org.apache.spark.SparkContext.(SparkContext.scala:507)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2283)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:833)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:825)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:825)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21928) ClassNotFoundException for custom Kryo registrator class during serde in netty threads

2017-09-21 Thread John Brock (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175452#comment-16175452
 ] 

John Brock commented on SPARK-21928:


Excellent, thanks for looking into this.

> ClassNotFoundException for custom Kryo registrator class during serde in 
> netty threads
> --
>
> Key: SPARK-21928
> URL: https://issues.apache.org/jira/browse/SPARK-21928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: John Brock
>Assignee: Imran Rashid
> Fix For: 2.2.1, 2.3.0
>
>
> From SPARK-13990 & SPARK-13926, Spark's SerializerManager has its own 
> instance of a KryoSerializer which does not have the defaultClassLoader set 
> on it. For normal task execution, that doesn't cause problems, because the 
> serializer falls back to the current thread's task loader, which is set 
> anyway.
> however, netty maintains its own thread pool, and those threads don't change 
> their classloader to include the extra use jars needed for the custom kryo 
> registrator. That only matters when blocks are sent across the network which 
> force serde in the netty thread. That won't happen often, because (a) spark 
> tries to execute tasks where the RDDs are already cached and (b) broadcast 
> blocks generally don't require any serde in the netty threads (that occurs in 
> the task thread that is reading the broadcast value).  However it can come up 
> with remote cache reads, or if fetching a broadcast block forces another 
> block to disk, which requires serialization.
> This doesn't effect the shuffle path, because the serde is never done in the 
> threads created by netty.
> I think a fix for this should be fairly straight-forward, we just need to set 
> the classloader on that extra kryo instance.
>  (original problem description below)
> I unfortunately can't reliably reproduce this bug; it happens only 
> occasionally, when training a logistic regression model with very large 
> datasets. The training will often proceed through several {{treeAggregate}} 
> calls without any problems, and then suddenly workers will start running into 
> this {{java.lang.ClassNotFoundException}}.
> After doing some debugging, it seems that whenever this error happens, Spark 
> is trying to use the {{sun.misc.Launcher$AppClassLoader}} {{ClassLoader}} 
> instance instead of the usual 
> {{org.apache.spark.util.MutableURLClassLoader}}. {{MutableURLClassLoader}} 
> can see my custom Kryo registrator, but the {{AppClassLoader}} instance can't.
> When this error does pop up, it's usually accompanied by the task seeming to 
> hang, and I need to kill Spark manually.
> I'm running a Spark application in cluster mode via spark-submit, and I have 
> a custom Kryo registrator. The JAR is built with {{sbt assembly}}.
> Exception message:
> {noformat}
> 17/08/29 22:39:04 ERROR TransportRequestHandler: Error opening block 
> StreamChunkId{streamId=542074019336, chunkIndex=0} for request from 
> /10.0.29.65:34332
> org.apache.spark.SparkException: Failed to register classes with Kryo
> at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:139)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:292)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:277)
> at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:186)
> at 
> org.apache.spark.serializer.SerializerManager.dataSerializeStream(SerializerManager.scala:169)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1382)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1377)
> at org.apache.spark.storage.DiskStore.put(DiskStore.scala:69)
> at 
> org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1377)
> at 
> org.apache.spark.storage.memory.MemoryStore.org$apache$spark$storage$memory$MemoryStore$$dropBlock$1(MemoryStore.scala:524)
> at 
> org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:545)
> at 
> org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:539)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.storage.memory.MemoryStore.evictBlocksToFreeSpace(MemoryStore.scala:539)
> at 
> org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:92)
> at 
> 

[jira] [Updated] (SPARK-22083) When dropping multiple blocks to disk, Spark should release all locks on a failure

2017-09-21 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-22083:
-
Description: 
{{MemoryStore.evictBlocksToFreeSpace}} first [acquires writer locks on all the 
blocks it intends to evict | 
https://github.com/apache/spark/blob/55d5fa79db883e4d93a9c102a94713c9d2d1fb55/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L520].
  However, if there is an exception while dropping blocks, there is no 
{{finally}} block to release all the locks.

If there is only one block being dropped, this isn't a problem (probably).  
Usually the call stack goes from {{MemoryStore.evictBlocksToFreeSpace --> 
dropBlocks --> BlockManager.dropFromMemory --> DiskStore.put}}.  And 
{{DiskStore.put}} does do a [{{removeBlock()}} in a {{finally}} 
block|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/DiskStore.scala#L83],
 which cleans up the locks.

I ran into this from the serialization issue in SPARK-21928.  In that, a netty 
thread ends up trying to evict some blocks from memory to disk, and fails.  
When there is only one block that needs to be evicted, and the error occurs, 
there isn't any real problem; I assume that netty thread is dead, but the 
executor threads seem fine.  However, in the cases where two blocks get 
dropped, one task gets completely stuck.  Unfortunately I don't have a stack 
trace from the stuck executor, but I assume it just waits forever on this lock 
that never gets released.

  was:
{{MemoryStore.evictBlocksToFreeSpace}} first [acquires writer locks on all the 
blocks it intends to evict | 
https://github.com/apache/spark/blob/55d5fa79db883e4d93a9c102a94713c9d2d1fb55/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L520].
  However, if there is an exception while dropping blocks, there is no 
{{finally}} block to release all the locks.

If there is only one block being dropped, this isn't a problem (probably).  
Usually the call stack goes from {{MemoryStore.evictBlocksToFreeSpace --> 
dropBlocks --> BlockManager.dropFromMemory --> DiskStore.put}}.  And 
{{DiskStore.put}} does do a [{{removeBlock()}} in a {{finally}} 
block|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/DiskStore.scala#L83],
 which cleans up the locks.

I ran into this from the serialization issue in SPARK-21928.  In that, a netty 
thread ends up trying to evict some blocks from memory to disk, and fails.  
When there is only block that needs to be evicted, and the error occurs, there 
isn't any real problem; I assume that netty thread is dead, but the executor 
threads seem fine.  However, in the cases where two blocks get dropped, one 
task gets completely stuck.  Unfortunately I don't have a stack trace from the 
stuck executor, but I assume it just waits forever on this lock that never gets 
released.


> When dropping multiple blocks to disk, Spark should release all locks on a 
> failure
> --
>
> Key: SPARK-22083
> URL: https://issues.apache.org/jira/browse/SPARK-22083
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Imran Rashid
>
> {{MemoryStore.evictBlocksToFreeSpace}} first [acquires writer locks on all 
> the blocks it intends to evict | 
> https://github.com/apache/spark/blob/55d5fa79db883e4d93a9c102a94713c9d2d1fb55/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L520].
>   However, if there is an exception while dropping blocks, there is no 
> {{finally}} block to release all the locks.
> If there is only one block being dropped, this isn't a problem (probably).  
> Usually the call stack goes from {{MemoryStore.evictBlocksToFreeSpace --> 
> dropBlocks --> BlockManager.dropFromMemory --> DiskStore.put}}.  And 
> {{DiskStore.put}} does do a [{{removeBlock()}} in a {{finally}} 
> block|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/DiskStore.scala#L83],
>  which cleans up the locks.
> I ran into this from the serialization issue in SPARK-21928.  In that, a 
> netty thread ends up trying to evict some blocks from memory to disk, and 
> fails.  When there is only one block that needs to be evicted, and the error 
> occurs, there isn't any real problem; I assume that netty thread is dead, but 
> the executor threads seem fine.  However, in the cases where two blocks get 
> dropped, one task gets completely stuck.  Unfortunately I don't have a stack 
> trace from the stuck executor, but I assume it just waits forever on this 
> lock that never gets released.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-

[jira] [Commented] (SPARK-21928) ClassNotFoundException for custom Kryo registrator class during serde in netty threads

2017-09-21 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175410#comment-16175410
 ] 

Imran Rashid commented on SPARK-21928:
--

thanks [~jbrock], thats great.  I think this is fully explained now.  I updated 
the title and description so folks know it is not related to ML, hope that is 
OK.

> ClassNotFoundException for custom Kryo registrator class during serde in 
> netty threads
> --
>
> Key: SPARK-21928
> URL: https://issues.apache.org/jira/browse/SPARK-21928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: John Brock
>Assignee: Imran Rashid
> Fix For: 2.2.1, 2.3.0
>
>
> From SPARK-13990 & SPARK-13926, Spark's SerializerManager has its own 
> instance of a KryoSerializer which does not have the defaultClassLoader set 
> on it. For normal task execution, that doesn't cause problems, because the 
> serializer falls back to the current thread's task loader, which is set 
> anyway.
> however, netty maintains its own thread pool, and those threads don't change 
> their classloader to include the extra use jars needed for the custom kryo 
> registrator. That only matters when blocks are sent across the network which 
> force serde in the netty thread. That won't happen often, because (a) spark 
> tries to execute tasks where the RDDs are already cached and (b) broadcast 
> blocks generally don't require any serde in the netty threads (that occurs in 
> the task thread that is reading the broadcast value).  However it can come up 
> with remote cache reads, or if fetching a broadcast block forces another 
> block to disk, which requires serialization.
> This doesn't effect the shuffle path, because the serde is never done in the 
> threads created by netty.
> I think a fix for this should be fairly straight-forward, we just need to set 
> the classloader on that extra kryo instance.
>  (original problem description below)
> I unfortunately can't reliably reproduce this bug; it happens only 
> occasionally, when training a logistic regression model with very large 
> datasets. The training will often proceed through several {{treeAggregate}} 
> calls without any problems, and then suddenly workers will start running into 
> this {{java.lang.ClassNotFoundException}}.
> After doing some debugging, it seems that whenever this error happens, Spark 
> is trying to use the {{sun.misc.Launcher$AppClassLoader}} {{ClassLoader}} 
> instance instead of the usual 
> {{org.apache.spark.util.MutableURLClassLoader}}. {{MutableURLClassLoader}} 
> can see my custom Kryo registrator, but the {{AppClassLoader}} instance can't.
> When this error does pop up, it's usually accompanied by the task seeming to 
> hang, and I need to kill Spark manually.
> I'm running a Spark application in cluster mode via spark-submit, and I have 
> a custom Kryo registrator. The JAR is built with {{sbt assembly}}.
> Exception message:
> {noformat}
> 17/08/29 22:39:04 ERROR TransportRequestHandler: Error opening block 
> StreamChunkId{streamId=542074019336, chunkIndex=0} for request from 
> /10.0.29.65:34332
> org.apache.spark.SparkException: Failed to register classes with Kryo
> at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:139)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:292)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:277)
> at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:186)
> at 
> org.apache.spark.serializer.SerializerManager.dataSerializeStream(SerializerManager.scala:169)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1382)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1377)
> at org.apache.spark.storage.DiskStore.put(DiskStore.scala:69)
> at 
> org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1377)
> at 
> org.apache.spark.storage.memory.MemoryStore.org$apache$spark$storage$memory$MemoryStore$$dropBlock$1(MemoryStore.scala:524)
> at 
> org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:545)
> at 
> org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:539)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.storage.memory.MemoryStore.evictBlocksToFreeSpace(MemoryStore.scala:539)
> at 
> 

[jira] [Updated] (SPARK-21928) ClassNotFoundException for custom Kryo registrator class during serde in netty threads

2017-09-21 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-21928:
-
Summary: ClassNotFoundException for custom Kryo registrator class during 
serde in netty threads  (was: ML LogisticRegression training occasionally 
produces java.lang.ClassNotFoundException when attempting to load custom Kryo 
registrator class)

> ClassNotFoundException for custom Kryo registrator class during serde in 
> netty threads
> --
>
> Key: SPARK-21928
> URL: https://issues.apache.org/jira/browse/SPARK-21928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: John Brock
>Assignee: Imran Rashid
> Fix For: 2.2.1, 2.3.0
>
>
> From SPARK-13990 & SPARK-13926, Spark's SerializerManager has its own 
> instance of a KryoSerializer which does not have the defaultClassLoader set 
> on it. For normal task execution, that doesn't cause problems, because the 
> serializer falls back to the current thread's task loader, which is set 
> anyway.
> however, netty maintains its own thread pool, and those threads don't change 
> their classloader to include the extra use jars needed for the custom kryo 
> registrator. That only matters when blocks are sent across the network which 
> force serde in the netty thread. That won't happen often, because (a) spark 
> tries to execute tasks where the RDDs are already cached and (b) broadcast 
> blocks generally don't require any serde in the netty threads (that occurs in 
> the task thread that is reading the broadcast value).  However it can come up 
> with remote cache reads, or if fetching a broadcast block forces another 
> block to disk, which requires serialization.
> This doesn't effect the shuffle path, because the serde is never done in the 
> threads created by netty.
> I think a fix for this should be fairly straight-forward, we just need to set 
> the classloader on that extra kryo instance.
>  (original problem description below)
> I unfortunately can't reliably reproduce this bug; it happens only 
> occasionally, when training a logistic regression model with very large 
> datasets. The training will often proceed through several {{treeAggregate}} 
> calls without any problems, and then suddenly workers will start running into 
> this {{java.lang.ClassNotFoundException}}.
> After doing some debugging, it seems that whenever this error happens, Spark 
> is trying to use the {{sun.misc.Launcher$AppClassLoader}} {{ClassLoader}} 
> instance instead of the usual 
> {{org.apache.spark.util.MutableURLClassLoader}}. {{MutableURLClassLoader}} 
> can see my custom Kryo registrator, but the {{AppClassLoader}} instance can't.
> When this error does pop up, it's usually accompanied by the task seeming to 
> hang, and I need to kill Spark manually.
> I'm running a Spark application in cluster mode via spark-submit, and I have 
> a custom Kryo registrator. The JAR is built with {{sbt assembly}}.
> Exception message:
> {noformat}
> 17/08/29 22:39:04 ERROR TransportRequestHandler: Error opening block 
> StreamChunkId{streamId=542074019336, chunkIndex=0} for request from 
> /10.0.29.65:34332
> org.apache.spark.SparkException: Failed to register classes with Kryo
> at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:139)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:292)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:277)
> at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:186)
> at 
> org.apache.spark.serializer.SerializerManager.dataSerializeStream(SerializerManager.scala:169)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1382)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1377)
> at org.apache.spark.storage.DiskStore.put(DiskStore.scala:69)
> at 
> org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1377)
> at 
> org.apache.spark.storage.memory.MemoryStore.org$apache$spark$storage$memory$MemoryStore$$dropBlock$1(MemoryStore.scala:524)
> at 
> org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:545)
> at 
> org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:539)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.storage.memory.MemoryStore.evictBlocksToFreeSpace(MemoryStore.scala:539)
> at 
> 

[jira] [Updated] (SPARK-21928) ML LogisticRegression training occasionally produces java.lang.ClassNotFoundException when attempting to load custom Kryo registrator class

2017-09-21 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-21928:
-
Description: 
>From SPARK-13990 & SPARK-13926, Spark's SerializerManager has its own instance 
>of a KryoSerializer which does not have the defaultClassLoader set on it. For 
>normal task execution, that doesn't cause problems, because the serializer 
>falls back to the current thread's task loader, which is set anyway.

however, netty maintains its own thread pool, and those threads don't change 
their classloader to include the extra use jars needed for the custom kryo 
registrator. That only matters when blocks are sent across the network which 
force serde in the netty thread. That won't happen often, because (a) spark 
tries to execute tasks where the RDDs are already cached and (b) broadcast 
blocks generally don't require any serde in the netty threads (that occurs in 
the task thread that is reading the broadcast value).  However it can come up 
with remote cache reads, or if fetching a broadcast block forces another block 
to disk, which requires serialization.

This doesn't effect the shuffle path, because the serde is never done in the 
threads created by netty.

I think a fix for this should be fairly straight-forward, we just need to set 
the classloader on that extra kryo instance.

 (original problem description below)

I unfortunately can't reliably reproduce this bug; it happens only 
occasionally, when training a logistic regression model with very large 
datasets. The training will often proceed through several {{treeAggregate}} 
calls without any problems, and then suddenly workers will start running into 
this {{java.lang.ClassNotFoundException}}.

After doing some debugging, it seems that whenever this error happens, Spark is 
trying to use the {{sun.misc.Launcher$AppClassLoader}} {{ClassLoader}} instance 
instead of the usual {{org.apache.spark.util.MutableURLClassLoader}}. 
{{MutableURLClassLoader}} can see my custom Kryo registrator, but the 
{{AppClassLoader}} instance can't.

When this error does pop up, it's usually accompanied by the task seeming to 
hang, and I need to kill Spark manually.

I'm running a Spark application in cluster mode via spark-submit, and I have a 
custom Kryo registrator. The JAR is built with {{sbt assembly}}.

Exception message:

{noformat}
17/08/29 22:39:04 ERROR TransportRequestHandler: Error opening block 
StreamChunkId{streamId=542074019336, chunkIndex=0} for request from 
/10.0.29.65:34332
org.apache.spark.SparkException: Failed to register classes with Kryo
at 
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:139)
at 
org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:292)
at 
org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:277)
at 
org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:186)
at 
org.apache.spark.serializer.SerializerManager.dataSerializeStream(SerializerManager.scala:169)
at 
org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1382)
at 
org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1377)
at org.apache.spark.storage.DiskStore.put(DiskStore.scala:69)
at 
org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1377)
at 
org.apache.spark.storage.memory.MemoryStore.org$apache$spark$storage$memory$MemoryStore$$dropBlock$1(MemoryStore.scala:524)
at 
org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:545)
at 
org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:539)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.storage.memory.MemoryStore.evictBlocksToFreeSpace(MemoryStore.scala:539)
at 
org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:92)
at 
org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:73)
at 
org.apache.spark.memory.StaticMemoryManager.acquireStorageMemory(StaticMemoryManager.scala:72)
at 
org.apache.spark.storage.memory.MemoryStore.putBytes(MemoryStore.scala:147)
at 
org.apache.spark.storage.BlockManager.maybeCacheDiskBytesInMemory(BlockManager.scala:1143)
at 
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doGetLocalBytes(BlockManager.scala:594)
at 
org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:559)
at 
org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:559)
at scala.Option.map(Option.scala:146)
at 

[jira] [Commented] (SPARK-21928) ML LogisticRegression training occasionally produces java.lang.ClassNotFoundException when attempting to load custom Kryo registrator class

2017-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175385#comment-16175385
 ] 

Apache Spark commented on SPARK-21928:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/19313

> ML LogisticRegression training occasionally produces 
> java.lang.ClassNotFoundException when attempting to load custom Kryo 
> registrator class
> ---
>
> Key: SPARK-21928
> URL: https://issues.apache.org/jira/browse/SPARK-21928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: John Brock
>Assignee: Imran Rashid
> Fix For: 2.2.1, 2.3.0
>
>
> I unfortunately can't reliably reproduce this bug; it happens only 
> occasionally, when training a logistic regression model with very large 
> datasets. The training will often proceed through several {{treeAggregate}} 
> calls without any problems, and then suddenly workers will start running into 
> this {{java.lang.ClassNotFoundException}}.
> After doing some debugging, it seems that whenever this error happens, Spark 
> is trying to use the {{sun.misc.Launcher$AppClassLoader}} {{ClassLoader}} 
> instance instead of the usual 
> {{org.apache.spark.util.MutableURLClassLoader}}. {{MutableURLClassLoader}} 
> can see my custom Kryo registrator, but the {{AppClassLoader}} instance can't.
> When this error does pop up, it's usually accompanied by the task seeming to 
> hang, and I need to kill Spark manually.
> I'm running a Spark application in cluster mode via spark-submit, and I have 
> a custom Kryo registrator. The JAR is built with {{sbt assembly}}.
> Exception message:
> {noformat}
> 17/08/29 22:39:04 ERROR TransportRequestHandler: Error opening block 
> StreamChunkId{streamId=542074019336, chunkIndex=0} for request from 
> /10.0.29.65:34332
> org.apache.spark.SparkException: Failed to register classes with Kryo
> at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:139)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:292)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:277)
> at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:186)
> at 
> org.apache.spark.serializer.SerializerManager.dataSerializeStream(SerializerManager.scala:169)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1382)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1377)
> at org.apache.spark.storage.DiskStore.put(DiskStore.scala:69)
> at 
> org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1377)
> at 
> org.apache.spark.storage.memory.MemoryStore.org$apache$spark$storage$memory$MemoryStore$$dropBlock$1(MemoryStore.scala:524)
> at 
> org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:545)
> at 
> org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:539)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.storage.memory.MemoryStore.evictBlocksToFreeSpace(MemoryStore.scala:539)
> at 
> org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:92)
> at 
> org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:73)
> at 
> org.apache.spark.memory.StaticMemoryManager.acquireStorageMemory(StaticMemoryManager.scala:72)
> at 
> org.apache.spark.storage.memory.MemoryStore.putBytes(MemoryStore.scala:147)
> at 
> org.apache.spark.storage.BlockManager.maybeCacheDiskBytesInMemory(BlockManager.scala:1143)
> at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doGetLocalBytes(BlockManager.scala:594)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:559)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:559)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:559)
> at 
> org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:353)
> at 
> org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$1.apply(NettyBlockRpcServer.scala:61)
> at 
> org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$1.apply(NettyBlockRpcServer.scala:60)
> at 

[jira] [Assigned] (SPARK-22071) Improve release build scripts to check correct JAVA version is being used for build

2017-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22071:


Assignee: Apache Spark  (was: holdenk)

> Improve release build scripts to check correct JAVA version is being used for 
> build
> ---
>
> Key: SPARK-22071
> URL: https://issues.apache.org/jira/browse/SPARK-22071
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.2
>Reporter: holdenk
>Assignee: Apache Spark
>
> The current release scripts assume the correct JAVA_HOME is set for the 
> release. While this isn't an issue when they are called in Jenkins it would 
> be reasonable to check this for release managers who may decide to build 
> outside of Jenkins (or as we migrate our release build environment into 
> docker).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22072) Allow the same shell params to be used for all of the different steps in release-build

2017-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175352#comment-16175352
 ] 

Apache Spark commented on SPARK-22072:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/19312

> Allow the same shell params to be used for all of the different steps in 
> release-build
> --
>
> Key: SPARK-22072
> URL: https://issues.apache.org/jira/browse/SPARK-22072
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.2, 2.3.0
>Reporter: holdenk
>Assignee: holdenk
>
> The jenkins script currently sets SPARK_VERSION to different values depending 
> on what action is being performed. To simplify the scripts use 
> SPARK_PACKAGE_VERSION in release-publish.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22071) Improve release build scripts to check correct JAVA version is being used for build

2017-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22071:


Assignee: holdenk  (was: Apache Spark)

> Improve release build scripts to check correct JAVA version is being used for 
> build
> ---
>
> Key: SPARK-22071
> URL: https://issues.apache.org/jira/browse/SPARK-22071
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.2
>Reporter: holdenk
>Assignee: holdenk
>
> The current release scripts assume the correct JAVA_HOME is set for the 
> release. While this isn't an issue when they are called in Jenkins it would 
> be reasonable to check this for release managers who may decide to build 
> outside of Jenkins (or as we migrate our release build environment into 
> docker).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22071) Improve release build scripts to check correct JAVA version is being used for build

2017-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175354#comment-16175354
 ] 

Apache Spark commented on SPARK-22071:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/19312

> Improve release build scripts to check correct JAVA version is being used for 
> build
> ---
>
> Key: SPARK-22071
> URL: https://issues.apache.org/jira/browse/SPARK-22071
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.2
>Reporter: holdenk
>Assignee: holdenk
>
> The current release scripts assume the correct JAVA_HOME is set for the 
> release. While this isn't an issue when they are called in Jenkins it would 
> be reasonable to check this for release managers who may decide to build 
> outside of Jenkins (or as we migrate our release build environment into 
> docker).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22072) Allow the same shell params to be used for all of the different steps in release-build

2017-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22072:


Assignee: Apache Spark  (was: holdenk)

> Allow the same shell params to be used for all of the different steps in 
> release-build
> --
>
> Key: SPARK-22072
> URL: https://issues.apache.org/jira/browse/SPARK-22072
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.2, 2.3.0
>Reporter: holdenk
>Assignee: Apache Spark
>
> The jenkins script currently sets SPARK_VERSION to different values depending 
> on what action is being performed. To simplify the scripts use 
> SPARK_PACKAGE_VERSION in release-publish.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22072) Allow the same shell params to be used for all of the different steps in release-build

2017-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22072:


Assignee: holdenk  (was: Apache Spark)

> Allow the same shell params to be used for all of the different steps in 
> release-build
> --
>
> Key: SPARK-22072
> URL: https://issues.apache.org/jira/browse/SPARK-22072
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.2, 2.3.0
>Reporter: holdenk
>Assignee: holdenk
>
> The jenkins script currently sets SPARK_VERSION to different values depending 
> on what action is being performed. To simplify the scripts use 
> SPARK_PACKAGE_VERSION in release-publish.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22083) When dropping multiple blocks to disk, Spark should release all locks on a failure

2017-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175309#comment-16175309
 ] 

Apache Spark commented on SPARK-22083:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/19311

> When dropping multiple blocks to disk, Spark should release all locks on a 
> failure
> --
>
> Key: SPARK-22083
> URL: https://issues.apache.org/jira/browse/SPARK-22083
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Imran Rashid
>
> {{MemoryStore.evictBlocksToFreeSpace}} first [acquires writer locks on all 
> the blocks it intends to evict | 
> https://github.com/apache/spark/blob/55d5fa79db883e4d93a9c102a94713c9d2d1fb55/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L520].
>   However, if there is an exception while dropping blocks, there is no 
> {{finally}} block to release all the locks.
> If there is only one block being dropped, this isn't a problem (probably).  
> Usually the call stack goes from {{MemoryStore.evictBlocksToFreeSpace --> 
> dropBlocks --> BlockManager.dropFromMemory --> DiskStore.put}}.  And 
> {{DiskStore.put}} does do a [{{removeBlock()}} in a {{finally}} 
> block|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/DiskStore.scala#L83],
>  which cleans up the locks.
> I ran into this from the serialization issue in SPARK-21928.  In that, a 
> netty thread ends up trying to evict some blocks from memory to disk, and 
> fails.  When there is only block that needs to be evicted, and the error 
> occurs, there isn't any real problem; I assume that netty thread is dead, but 
> the executor threads seem fine.  However, in the cases where two blocks get 
> dropped, one task gets completely stuck.  Unfortunately I don't have a stack 
> trace from the stuck executor, but I assume it just waits forever on this 
> lock that never gets released.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22083) When dropping multiple blocks to disk, Spark should release all locks on a failure

2017-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22083:


Assignee: (was: Apache Spark)

> When dropping multiple blocks to disk, Spark should release all locks on a 
> failure
> --
>
> Key: SPARK-22083
> URL: https://issues.apache.org/jira/browse/SPARK-22083
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Imran Rashid
>
> {{MemoryStore.evictBlocksToFreeSpace}} first [acquires writer locks on all 
> the blocks it intends to evict | 
> https://github.com/apache/spark/blob/55d5fa79db883e4d93a9c102a94713c9d2d1fb55/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L520].
>   However, if there is an exception while dropping blocks, there is no 
> {{finally}} block to release all the locks.
> If there is only one block being dropped, this isn't a problem (probably).  
> Usually the call stack goes from {{MemoryStore.evictBlocksToFreeSpace --> 
> dropBlocks --> BlockManager.dropFromMemory --> DiskStore.put}}.  And 
> {{DiskStore.put}} does do a [{{removeBlock()}} in a {{finally}} 
> block|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/DiskStore.scala#L83],
>  which cleans up the locks.
> I ran into this from the serialization issue in SPARK-21928.  In that, a 
> netty thread ends up trying to evict some blocks from memory to disk, and 
> fails.  When there is only block that needs to be evicted, and the error 
> occurs, there isn't any real problem; I assume that netty thread is dead, but 
> the executor threads seem fine.  However, in the cases where two blocks get 
> dropped, one task gets completely stuck.  Unfortunately I don't have a stack 
> trace from the stuck executor, but I assume it just waits forever on this 
> lock that never gets released.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22083) When dropping multiple blocks to disk, Spark should release all locks on a failure

2017-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22083:


Assignee: Apache Spark

> When dropping multiple blocks to disk, Spark should release all locks on a 
> failure
> --
>
> Key: SPARK-22083
> URL: https://issues.apache.org/jira/browse/SPARK-22083
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Imran Rashid
>Assignee: Apache Spark
>
> {{MemoryStore.evictBlocksToFreeSpace}} first [acquires writer locks on all 
> the blocks it intends to evict | 
> https://github.com/apache/spark/blob/55d5fa79db883e4d93a9c102a94713c9d2d1fb55/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L520].
>   However, if there is an exception while dropping blocks, there is no 
> {{finally}} block to release all the locks.
> If there is only one block being dropped, this isn't a problem (probably).  
> Usually the call stack goes from {{MemoryStore.evictBlocksToFreeSpace --> 
> dropBlocks --> BlockManager.dropFromMemory --> DiskStore.put}}.  And 
> {{DiskStore.put}} does do a [{{removeBlock()}} in a {{finally}} 
> block|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/DiskStore.scala#L83],
>  which cleans up the locks.
> I ran into this from the serialization issue in SPARK-21928.  In that, a 
> netty thread ends up trying to evict some blocks from memory to disk, and 
> fails.  When there is only block that needs to be evicted, and the error 
> occurs, there isn't any real problem; I assume that netty thread is dead, but 
> the executor threads seem fine.  However, in the cases where two blocks get 
> dropped, one task gets completely stuck.  Unfortunately I don't have a stack 
> trace from the stuck executor, but I assume it just waits forever on this 
> lock that never gets released.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22009) Using treeAggregate improve some algs

2017-09-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22009.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19232
[https://github.com/apache/spark/pull/19232]

> Using treeAggregate improve some algs
> -
>
> Key: SPARK-22009
> URL: https://issues.apache.org/jira/browse/SPARK-22009
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 2.3.0
>
>
> I test on a dataset of about 13M instances, and found that using 
> `treeAggregate` give a speedup in following algs:
> OneHotEncoder ~ 5%
> StatFunctions.calculateCov ~ 7%
> StatFunctions.multipleApproxQuantiles ~ 9% 
> RegressionEvaluator ~ 8% 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18136) Make PySpark pip install works on windows

2017-09-21 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175303#comment-16175303
 ] 

Jakub Nowacki commented on SPARK-18136:
---

I've tried using Windows command {{mklink}} to create symbolic links, but it 
seems to resolve the folder in {{%~dp0}} to the Scripts folder 
{{C:\Tools\Anaconda3\Scripts\}}.

> Make PySpark pip install works on windows
> -
>
> Key: SPARK-18136
> URL: https://issues.apache.org/jira/browse/SPARK-18136
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>
> Make sure that pip installer for PySpark works on windows



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22009) Using treeAggregate improve some algs

2017-09-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-22009:
-

Assignee: zhengruifeng

> Using treeAggregate improve some algs
> -
>
> Key: SPARK-22009
> URL: https://issues.apache.org/jira/browse/SPARK-22009
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 2.3.0
>
>
> I test on a dataset of about 13M instances, and found that using 
> `treeAggregate` give a speedup in following algs:
> OneHotEncoder ~ 5%
> StatFunctions.calculateCov ~ 7%
> StatFunctions.multipleApproxQuantiles ~ 9% 
> RegressionEvaluator ~ 8% 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22075) GBTs forgot to unpersist datasets cached by Checkpointer

2017-09-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-22075:
-

  Assignee: zhengruifeng
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> GBTs forgot to unpersist datasets cached by Checkpointer
> 
>
> Key: SPARK-22075
> URL: https://issues.apache.org/jira/browse/SPARK-22075
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 2.3.0
>
>
> {{PeriodicRDDCheckpointer}} will automatically persist the last 3 datasets 
> called by {{PeriodicRDDCheckpointer.update}}.
> In GBTs, the last 3 intermediate rdds are still cached after {{fit()}}
> {code}
> scala> val dataset = 
> spark.read.format("libsvm").load("./data/mllib/sample_kmeans_data.txt")
> dataset: org.apache.spark.sql.DataFrame = [label: double, features: vector]   
>   
> scala> dataset.persist()
> res0: dataset.type = [label: double, features: vector]
> scala> dataset.count
> res1: Long = 6
> scala> sc.getPersistentRDDs
> res2: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] =
> Map(8 -> *FileScan libsvm [label#0,features#1] Batched: false, Format: 
> LibSVM, Location: 
> InMemoryFileIndex[file:/Users/zrf/.dev/spark-2.2.0-bin-hadoop2.7/data/mllib/sample_kmeans_data.txt],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct>
>  MapPartitionsRDD[8] at persist at :26)
> scala> import org.apache.spark.ml.regression._
> import org.apache.spark.ml.regression._
> scala> val model = gbt.fit(dataset)
> :28: error: not found: value gbt
>val model = gbt.fit(dataset)
>^
> scala> val gbt = new GBTRegressor()
> gbt: org.apache.spark.ml.regression.GBTRegressor = gbtr_da1fe371a25e
> scala> val model = gbt.fit(dataset)
> 17/09/20 14:05:33 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:38 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:38 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> model: org.apache.spark.ml.regression.GBTRegressionModel = GBTRegressionModel 
> (uid=gbtr_da1fe371a25e) with 20 trees
> scala> sc.getPersistentRDDs
> res3: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] =
> Map(322 -> MapPartitionsRDD[322] at mapPartitions at 
> GradientBoostedTrees.scala:134, 307 -> 

[jira] [Resolved] (SPARK-22075) GBTs forgot to unpersist datasets cached by Checkpointer

2017-09-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22075.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19288
[https://github.com/apache/spark/pull/19288]

> GBTs forgot to unpersist datasets cached by Checkpointer
> 
>
> Key: SPARK-22075
> URL: https://issues.apache.org/jira/browse/SPARK-22075
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
> Fix For: 2.3.0
>
>
> {{PeriodicRDDCheckpointer}} will automatically persist the last 3 datasets 
> called by {{PeriodicRDDCheckpointer.update}}.
> In GBTs, the last 3 intermediate rdds are still cached after {{fit()}}
> {code}
> scala> val dataset = 
> spark.read.format("libsvm").load("./data/mllib/sample_kmeans_data.txt")
> dataset: org.apache.spark.sql.DataFrame = [label: double, features: vector]   
>   
> scala> dataset.persist()
> res0: dataset.type = [label: double, features: vector]
> scala> dataset.count
> res1: Long = 6
> scala> sc.getPersistentRDDs
> res2: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] =
> Map(8 -> *FileScan libsvm [label#0,features#1] Batched: false, Format: 
> LibSVM, Location: 
> InMemoryFileIndex[file:/Users/zrf/.dev/spark-2.2.0-bin-hadoop2.7/data/mllib/sample_kmeans_data.txt],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct>
>  MapPartitionsRDD[8] at persist at :26)
> scala> import org.apache.spark.ml.regression._
> import org.apache.spark.ml.regression._
> scala> val model = gbt.fit(dataset)
> :28: error: not found: value gbt
>val model = gbt.fit(dataset)
>^
> scala> val gbt = new GBTRegressor()
> gbt: org.apache.spark.ml.regression.GBTRegressor = gbtr_da1fe371a25e
> scala> val model = gbt.fit(dataset)
> 17/09/20 14:05:33 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:38 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:38 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> model: org.apache.spark.ml.regression.GBTRegressionModel = GBTRegressionModel 
> (uid=gbtr_da1fe371a25e) with 20 trees
> scala> sc.getPersistentRDDs
> res3: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] =
> Map(322 -> MapPartitionsRDD[322] at mapPartitions at 
> GradientBoostedTrees.scala:134, 307 -> MapPartitionsRDD[307] at mapPartitions 
> at 

[jira] [Resolved] (SPARK-22088) Incorrect scalastyle comment causes wrong styles in stringExpressions

2017-09-21 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22088.
-
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.3.0

> Incorrect scalastyle comment causes wrong styles in stringExpressions
> -
>
> Key: SPARK-22088
> URL: https://issues.apache.org/jira/browse/SPARK-22088
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 2.3.0
>
>
> There is an incorrect {{scalastyle:on}} comment in `stringExpressions.scala` 
> and causes the line size limit check ineffective in the file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18136) Make PySpark pip install works on windows

2017-09-21 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175276#comment-16175276
 ] 

Jakub Nowacki commented on SPARK-18136:
---

[PR|https://github.com/apache/spark/pull/19310] fixes how {{spark-class2.cmd}} 
looks for jars directory on Windows. It fails to find jars and start JVM as the 
condition for the env variable {{SPARK_JARS_DIR}} looks for 
{{%SPARK_HOME%\RELEASE}}, which is not included in the {{pip/conda}} build. 
Instead, it should look for {{%SPARK_HOME%\jars}}, which it is later referring 
to.

The above fixes the errors while importing {{pyspark}} into Python and creating 
SparkSession, but there is still an issue calling {{pyspark.cmd}}. Namely, 
normal command call on command line, without path specification fails with 
{{System cannot find the path specified.}}. It is likely due to the script link 
being resolved to Script folder in Anaconda, e.g. 
{{C:\Tools\Anaconda3\Scripts\pyspark.cmd}}. If the script is run via the full 
path to the PySpark package, e.g. 
{{\Tools\Anaconda3\Lib\site-packages\pyspark\bin\pyspark.cmd}} it works fine. 
It is likely due to the fact that {{SPARK_HOME}} is resolved as follows {{set 
SPARK_HOME=%~dp0..}}, which in case of the system call resolves (likely) to 
{{\Tools\Anaconda3\}} and should resolve to 
{{\Tools\Anaconda3\Lib\site-packages\pyspark\}}. Since I dion't know CMD 
scripting that well, I haven't found solution to this issue yet, apart from the 
workaround, i.e. calling it with full (direct) path.   

> Make PySpark pip install works on windows
> -
>
> Key: SPARK-18136
> URL: https://issues.apache.org/jira/browse/SPARK-18136
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>
> Make sure that pip installer for PySpark works on windows



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18136) Make PySpark pip install works on windows

2017-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175258#comment-16175258
 ] 

Apache Spark commented on SPARK-18136:
--

User 'jsnowacki' has created a pull request for this issue:
https://github.com/apache/spark/pull/19310

> Make PySpark pip install works on windows
> -
>
> Key: SPARK-18136
> URL: https://issues.apache.org/jira/browse/SPARK-18136
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>
> Make sure that pip installer for PySpark works on windows



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18136) Make PySpark pip install works on windows

2017-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18136:


Assignee: (was: Apache Spark)

> Make PySpark pip install works on windows
> -
>
> Key: SPARK-18136
> URL: https://issues.apache.org/jira/browse/SPARK-18136
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>
> Make sure that pip installer for PySpark works on windows



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18136) Make PySpark pip install works on windows

2017-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18136:


Assignee: Apache Spark

> Make PySpark pip install works on windows
> -
>
> Key: SPARK-18136
> URL: https://issues.apache.org/jira/browse/SPARK-18136
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>Assignee: Apache Spark
>
> Make sure that pip installer for PySpark works on windows



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19558) Provide a config option to attach QueryExecutionListener to SparkSession

2017-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175173#comment-16175173
 ] 

Apache Spark commented on SPARK-19558:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/19309

> Provide a config option to attach QueryExecutionListener to SparkSession
> 
>
> Key: SPARK-19558
> URL: https://issues.apache.org/jira/browse/SPARK-19558
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Salil Surendran
>
> Provide a configuration property(just like spark.extraListeners) to attach a 
> QueryExecutionListener to a SparkSession



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19558) Provide a config option to attach QueryExecutionListener to SparkSession

2017-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19558:


Assignee: (was: Apache Spark)

> Provide a config option to attach QueryExecutionListener to SparkSession
> 
>
> Key: SPARK-19558
> URL: https://issues.apache.org/jira/browse/SPARK-19558
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Salil Surendran
>
> Provide a configuration property(just like spark.extraListeners) to attach a 
> QueryExecutionListener to a SparkSession



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19558) Provide a config option to attach QueryExecutionListener to SparkSession

2017-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19558:


Assignee: Apache Spark

> Provide a config option to attach QueryExecutionListener to SparkSession
> 
>
> Key: SPARK-19558
> URL: https://issues.apache.org/jira/browse/SPARK-19558
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Salil Surendran
>Assignee: Apache Spark
>
> Provide a configuration property(just like spark.extraListeners) to attach a 
> QueryExecutionListener to a SparkSession



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-09-21 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175158#comment-16175158
 ] 

Timothy Hunter commented on SPARK-21866:


Putting this code under {{org.apache.spark.ml.image}} sounds good to me. Based 
on the initial exploration, it should not be too hard to integrate this in the 
data source framework. I am going to submit this proposal to a vote on the dev 
mailing list.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
> ** The exact channel ordering and meaning of each channel is 

[jira] [Resolved] (SPARK-21928) ML LogisticRegression training occasionally produces java.lang.ClassNotFoundException when attempting to load custom Kryo registrator class

2017-09-21 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-21928.

   Resolution: Fixed
 Assignee: Imran Rashid
Fix Version/s: 2.3.0
   2.2.1

> ML LogisticRegression training occasionally produces 
> java.lang.ClassNotFoundException when attempting to load custom Kryo 
> registrator class
> ---
>
> Key: SPARK-21928
> URL: https://issues.apache.org/jira/browse/SPARK-21928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: John Brock
>Assignee: Imran Rashid
> Fix For: 2.2.1, 2.3.0
>
>
> I unfortunately can't reliably reproduce this bug; it happens only 
> occasionally, when training a logistic regression model with very large 
> datasets. The training will often proceed through several {{treeAggregate}} 
> calls without any problems, and then suddenly workers will start running into 
> this {{java.lang.ClassNotFoundException}}.
> After doing some debugging, it seems that whenever this error happens, Spark 
> is trying to use the {{sun.misc.Launcher$AppClassLoader}} {{ClassLoader}} 
> instance instead of the usual 
> {{org.apache.spark.util.MutableURLClassLoader}}. {{MutableURLClassLoader}} 
> can see my custom Kryo registrator, but the {{AppClassLoader}} instance can't.
> When this error does pop up, it's usually accompanied by the task seeming to 
> hang, and I need to kill Spark manually.
> I'm running a Spark application in cluster mode via spark-submit, and I have 
> a custom Kryo registrator. The JAR is built with {{sbt assembly}}.
> Exception message:
> {noformat}
> 17/08/29 22:39:04 ERROR TransportRequestHandler: Error opening block 
> StreamChunkId{streamId=542074019336, chunkIndex=0} for request from 
> /10.0.29.65:34332
> org.apache.spark.SparkException: Failed to register classes with Kryo
> at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:139)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:292)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:277)
> at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:186)
> at 
> org.apache.spark.serializer.SerializerManager.dataSerializeStream(SerializerManager.scala:169)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1382)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1377)
> at org.apache.spark.storage.DiskStore.put(DiskStore.scala:69)
> at 
> org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1377)
> at 
> org.apache.spark.storage.memory.MemoryStore.org$apache$spark$storage$memory$MemoryStore$$dropBlock$1(MemoryStore.scala:524)
> at 
> org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:545)
> at 
> org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:539)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.storage.memory.MemoryStore.evictBlocksToFreeSpace(MemoryStore.scala:539)
> at 
> org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:92)
> at 
> org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:73)
> at 
> org.apache.spark.memory.StaticMemoryManager.acquireStorageMemory(StaticMemoryManager.scala:72)
> at 
> org.apache.spark.storage.memory.MemoryStore.putBytes(MemoryStore.scala:147)
> at 
> org.apache.spark.storage.BlockManager.maybeCacheDiskBytesInMemory(BlockManager.scala:1143)
> at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doGetLocalBytes(BlockManager.scala:594)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:559)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:559)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:559)
> at 
> org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:353)
> at 
> org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$1.apply(NettyBlockRpcServer.scala:61)
> at 
> org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$1.apply(NettyBlockRpcServer.scala:60)
> at 

[jira] [Resolved] (SPARK-22061) Add pipeline model of SVM

2017-09-21 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-22061.

Resolution: Won't Fix

> Add pipeline model of SVM
> -
>
> Key: SPARK-22061
> URL: https://issues.apache.org/jira/browse/SPARK-22061
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Jiaming Shu
>
> add pipeline implementation of SVM



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22061) Add pipeline model of SVM

2017-09-21 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175124#comment-16175124
 ] 

Nick Pentreath commented on SPARK-22061:


Agreed, this already exists. I closed this issue.

> Add pipeline model of SVM
> -
>
> Key: SPARK-22061
> URL: https://issues.apache.org/jira/browse/SPARK-22061
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Jiaming Shu
>
> add pipeline implementation of SVM



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22093) UtilsSuite "resolveURIs with multiple paths" test always cancelled

2017-09-21 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-22093:
--

 Summary: UtilsSuite "resolveURIs with multiple paths" test always 
cancelled
 Key: SPARK-22093
 URL: https://issues.apache.org/jira/browse/SPARK-22093
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.3.0
Reporter: Marcelo Vanzin
Priority: Trivial


There's a call to {{assume}} in that test that is triggering, and causing the 
test to print an exception and report itself as "cancelled". It only happens in 
the last unit test (so coverage is fine, I guess), but still, that seems wrong.

{noformat}
[info] - resolveURIs with multiple paths !!! CANCELED !!! (15 milliseconds)
[info]   1 was not greater than 1 (UtilsSuite.scala:491)
[info]   org.scalatest.exceptions.TestCanceledException:
[info]   at 
org.scalatest.Assertions$class.newTestCanceledException(Assertions.scala:531)
[info]   at org.scalatest.FunSuite.newTestCanceledException(FunSuite.scala:1560)
[info]   at 
org.scalatest.Assertions$AssertionsHelper.macroAssume(Assertions.scala:516)
[info]   at 
org.apache.spark.util.UtilsSuite$$anonfun$6.assertResolves$2(UtilsSuite.scala:491)
[info]   at 
org.apache.spark.util.UtilsSuite$$anonfun$6.apply$mcV$sp(UtilsSuite.scala:512)
[info]   at 
org.apache.spark.util.UtilsSuite$$anonfun$6.apply(UtilsSuite.scala:489)
[info]   at 
org.apache.spark.util.UtilsSuite$$anonfun$6.apply(UtilsSuite.scala:489)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22083) When dropping multiple blocks to disk, Spark should release all locks on a failure

2017-09-21 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-22083:
-
Description: 
{{MemoryStore.evictBlocksToFreeSpace}} first [acquires writer locks on all the 
blocks it intends to evict | 
https://github.com/apache/spark/blob/55d5fa79db883e4d93a9c102a94713c9d2d1fb55/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L520].
  However, if there is an exception while dropping blocks, there is no 
{{finally}} block to release all the locks.

If there is only one block being dropped, this isn't a problem (probably).  
Usually the call stack goes from {{MemoryStore.evictBlocksToFreeSpace --> 
dropBlocks --> BlockManager.dropFromMemory --> DiskStore.put}}.  And 
{{DiskStore.put}} does do a [{{removeBlock()}} in a {{finally}} 
block|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/DiskStore.scala#L83],
 which cleans up the locks.

I ran into this from the serialization issue in SPARK-21928.  In that, a netty 
thread ends up trying to evict some blocks from memory to disk, and fails.  
When there is only block that needs to be evicted, and the error occurs, there 
isn't any real problem; I assume that netty thread is dead, but the executor 
threads seem fine.  However, in the cases where two blocks get dropped, one 
task gets completely stuck.  Unfortunately I don't have a stack trace from the 
stuck executor, but I assume it just waits forever on this lock that never gets 
released.

  was:
{{MemoryStore.evictBlocksToFreeSpace}} first [acquires writer locks on all the 
blocks it intends to evict | 
https://github.com/apache/spark/blob/55d5fa79db883e4d93a9c102a94713c9d2d1fb55/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L520].
  However, if there is an exception while dropping blocks, there is no 
{{finally}} block to release all the locks.

If there is only block being dropped, this isn't a problem (probably).  Usually 
the call stack goes from {{MemoryStore.evictBlocksToFreeSpace --> dropBlocks 
--> BlockManager.dropFromMemory --> DiskStore.put}}.  And {{DiskStore.put}} 
does do a [{{removeBlock()}} in a {{finally}} 
block|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/DiskStore.scala#L83],
 which cleans up the locks.

I ran into this from the serialization issue in SPARK-21928.  In that, a netty 
thread ends up trying to evict some blocks from memory to disk, and fails.  
When there is only block that needs to be evicted, and the error occurs, there 
isn't any real problem; I assume that netty thread is dead, but the executor 
threads seem fine.  However, in the cases where two blocks get dropped, one 
task gets completely stuck.  Unfortunately I don't have a stack trace from the 
stuck executor, but I assume it just waits forever on this lock that never gets 
released.


> When dropping multiple blocks to disk, Spark should release all locks on a 
> failure
> --
>
> Key: SPARK-22083
> URL: https://issues.apache.org/jira/browse/SPARK-22083
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Imran Rashid
>
> {{MemoryStore.evictBlocksToFreeSpace}} first [acquires writer locks on all 
> the blocks it intends to evict | 
> https://github.com/apache/spark/blob/55d5fa79db883e4d93a9c102a94713c9d2d1fb55/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L520].
>   However, if there is an exception while dropping blocks, there is no 
> {{finally}} block to release all the locks.
> If there is only one block being dropped, this isn't a problem (probably).  
> Usually the call stack goes from {{MemoryStore.evictBlocksToFreeSpace --> 
> dropBlocks --> BlockManager.dropFromMemory --> DiskStore.put}}.  And 
> {{DiskStore.put}} does do a [{{removeBlock()}} in a {{finally}} 
> block|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/DiskStore.scala#L83],
>  which cleans up the locks.
> I ran into this from the serialization issue in SPARK-21928.  In that, a 
> netty thread ends up trying to evict some blocks from memory to disk, and 
> fails.  When there is only block that needs to be evicted, and the error 
> occurs, there isn't any real problem; I assume that netty thread is dead, but 
> the executor threads seem fine.  However, in the cases where two blocks get 
> dropped, one task gets completely stuck.  Unfortunately I don't have a stack 
> trace from the stuck executor, but I assume it just waits forever on this 
> lock that never gets released.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To 

[jira] [Commented] (SPARK-22092) Reallocation in OffHeapColumnVector.reserveInternal corrupts array data

2017-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175087#comment-16175087
 ] 

Apache Spark commented on SPARK-22092:
--

User 'ala' has created a pull request for this issue:
https://github.com/apache/spark/pull/19308

> Reallocation in OffHeapColumnVector.reserveInternal corrupts array data
> ---
>
> Key: SPARK-22092
> URL: https://issues.apache.org/jira/browse/SPARK-22092
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ala Luszczak
>
> OffHeapColumnVector.reserveInternal() will only copy already inserted values 
> during reallocation if this.data != null. However, for vectors containing 
> arrays, field data is disused and always equals null. Hence, the reallocation 
> of array vector always causes data loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22092) Reallocation in OffHeapColumnVector.reserveInternal corrupts array data

2017-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22092:


Assignee: Apache Spark

> Reallocation in OffHeapColumnVector.reserveInternal corrupts array data
> ---
>
> Key: SPARK-22092
> URL: https://issues.apache.org/jira/browse/SPARK-22092
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ala Luszczak
>Assignee: Apache Spark
>
> OffHeapColumnVector.reserveInternal() will only copy already inserted values 
> during reallocation if this.data != null. However, for vectors containing 
> arrays, field data is disused and always equals null. Hence, the reallocation 
> of array vector always causes data loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22092) Reallocation in OffHeapColumnVector.reserveInternal corrupts array data

2017-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22092:


Assignee: (was: Apache Spark)

> Reallocation in OffHeapColumnVector.reserveInternal corrupts array data
> ---
>
> Key: SPARK-22092
> URL: https://issues.apache.org/jira/browse/SPARK-22092
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ala Luszczak
>
> OffHeapColumnVector.reserveInternal() will only copy already inserted values 
> during reallocation if this.data != null. However, for vectors containing 
> arrays, field data is disused and always equals null. Hence, the reallocation 
> of array vector always causes data loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22092) Reallocation in OffHeapColumnVector.reserveInternal corrupts array data

2017-09-21 Thread Ala Luszczak (JIRA)
Ala Luszczak created SPARK-22092:


 Summary: Reallocation in OffHeapColumnVector.reserveInternal 
corrupts array data
 Key: SPARK-22092
 URL: https://issues.apache.org/jira/browse/SPARK-22092
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Ala Luszczak


OffHeapColumnVector.reserveInternal() will only copy already inserted values 
during reallocation if this.data != null. However, for vectors containing 
arrays, field data is disused and always equals null. Hence, the reallocation 
of array vector always causes data loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21928) ML LogisticRegression training occasionally produces java.lang.ClassNotFoundException when attempting to load custom Kryo registrator class

2017-09-21 Thread John Brock (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175066#comment-16175066
 ] 

John Brock commented on SPARK-21928:


It does! I see this in the log right before an executor got stuck:
{{17/08/31 19:56:25 INFO MemoryStore: 3 blocks selected for dropping (284.2 MB 
bytes)}}

> ML LogisticRegression training occasionally produces 
> java.lang.ClassNotFoundException when attempting to load custom Kryo 
> registrator class
> ---
>
> Key: SPARK-21928
> URL: https://issues.apache.org/jira/browse/SPARK-21928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: John Brock
>
> I unfortunately can't reliably reproduce this bug; it happens only 
> occasionally, when training a logistic regression model with very large 
> datasets. The training will often proceed through several {{treeAggregate}} 
> calls without any problems, and then suddenly workers will start running into 
> this {{java.lang.ClassNotFoundException}}.
> After doing some debugging, it seems that whenever this error happens, Spark 
> is trying to use the {{sun.misc.Launcher$AppClassLoader}} {{ClassLoader}} 
> instance instead of the usual 
> {{org.apache.spark.util.MutableURLClassLoader}}. {{MutableURLClassLoader}} 
> can see my custom Kryo registrator, but the {{AppClassLoader}} instance can't.
> When this error does pop up, it's usually accompanied by the task seeming to 
> hang, and I need to kill Spark manually.
> I'm running a Spark application in cluster mode via spark-submit, and I have 
> a custom Kryo registrator. The JAR is built with {{sbt assembly}}.
> Exception message:
> {noformat}
> 17/08/29 22:39:04 ERROR TransportRequestHandler: Error opening block 
> StreamChunkId{streamId=542074019336, chunkIndex=0} for request from 
> /10.0.29.65:34332
> org.apache.spark.SparkException: Failed to register classes with Kryo
> at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:139)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:292)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:277)
> at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:186)
> at 
> org.apache.spark.serializer.SerializerManager.dataSerializeStream(SerializerManager.scala:169)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1382)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1377)
> at org.apache.spark.storage.DiskStore.put(DiskStore.scala:69)
> at 
> org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1377)
> at 
> org.apache.spark.storage.memory.MemoryStore.org$apache$spark$storage$memory$MemoryStore$$dropBlock$1(MemoryStore.scala:524)
> at 
> org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:545)
> at 
> org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:539)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.storage.memory.MemoryStore.evictBlocksToFreeSpace(MemoryStore.scala:539)
> at 
> org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:92)
> at 
> org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:73)
> at 
> org.apache.spark.memory.StaticMemoryManager.acquireStorageMemory(StaticMemoryManager.scala:72)
> at 
> org.apache.spark.storage.memory.MemoryStore.putBytes(MemoryStore.scala:147)
> at 
> org.apache.spark.storage.BlockManager.maybeCacheDiskBytesInMemory(BlockManager.scala:1143)
> at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doGetLocalBytes(BlockManager.scala:594)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:559)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:559)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:559)
> at 
> org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:353)
> at 
> org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$1.apply(NettyBlockRpcServer.scala:61)
> at 
> org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$1.apply(NettyBlockRpcServer.scala:60)
> at 

[jira] [Closed] (SPARK-22091) There is no need for fileStatusCache to invalidateAll when InMemoryFileIndex refresh

2017-09-21 Thread guichaoxian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guichaoxian closed SPARK-22091.
---
Resolution: Duplicate

> There is no need for fileStatusCache to invalidateAll  when InMemoryFileIndex 
> refresh 
> --
>
> Key: SPARK-22091
> URL: https://issues.apache.org/jira/browse/SPARK-22091
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: guichaoxian
>
> the fileStatusCache is globally shared cache,refresh is only for one table,so 
> we do not need to invalidateAll entry in fileStatusCache
> {code:java}
>   /** Globally shared (not exclusive to this table) cache for file statuses 
> to speed up listing. */
>   private val fileStatusCache = FileStatusCache.getOrCreate(sparkSession)
> {code}
> {code:java}
>  override def refresh(): Unit = {
> refresh0()
> fileStatusCache.invalidateAll()
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-22090) There is no need for fileStatusCache to invalidateAll when InMemoryFileIndex refresh

2017-09-21 Thread guichaoxian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guichaoxian closed SPARK-22090.
---

> There is no need for fileStatusCache to invalidateAll  when InMemoryFileIndex 
> refresh 
> --
>
> Key: SPARK-22090
> URL: https://issues.apache.org/jira/browse/SPARK-22090
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: guichaoxian
>
> the fileStatusCache is globally shared cache,refresh is only for one table,so 
> we do not need to invalidateAll entry in fileStatusCache
> {code:java}
>   /** Globally shared (not exclusive to this table) cache for file statuses 
> to speed up listing. */
>   private val fileStatusCache = FileStatusCache.getOrCreate(sparkSession)
> {code}
> {code:java}
>  override def refresh(): Unit = {
> refresh0()
> fileStatusCache.invalidateAll()
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21653) Complement SQL expression document

2017-09-21 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh resolved SPARK-21653.
-
Resolution: Fixed

> Complement SQL expression document
> --
>
> Key: SPARK-21653
> URL: https://issues.apache.org/jira/browse/SPARK-21653
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> We have {{ExpressionDescription}} for SQL expressions. The expression 
> description tells what an expression's usage, arguments, and examples. Users 
> can understand how to use those expressions by {{DESCRIBE}} command in SQL:
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED In;
> Function: in
> Class: org.apache.spark.sql.catalyst.expressions.In
> Usage: expr1 in(expr2, expr3, ...) - Returns true if `expr` equals to any 
> valN.
> Extended Usage:
> No example/argument for in.
> {code}
> Not all SQL expressions have complete description now. For example, in the 
> above case, there is no example for function {{in}}. This task is going to 
> complement the expression description.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22086) Add expression description for CASE WHEN

2017-09-21 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22086.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19304
[https://github.com/apache/spark/pull/19304]

> Add expression description for CASE WHEN
> 
>
> Key: SPARK-22086
> URL: https://issues.apache.org/jira/browse/SPARK-22086
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Priority: Trivial
> Fix For: 2.3.0
>
>
> In SQL conditional expressions, only CASE WHEN lacks for expression 
> description. This trivial ticket goes to fill the gap.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22086) Add expression description for CASE WHEN

2017-09-21 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-22086:


Assignee: Liang-Chi Hsieh

> Add expression description for CASE WHEN
> 
>
> Key: SPARK-22086
> URL: https://issues.apache.org/jira/browse/SPARK-22086
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Trivial
> Fix For: 2.3.0
>
>
> In SQL conditional expressions, only CASE WHEN lacks for expression 
> description. This trivial ticket goes to fill the gap.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17997) Aggregation function for counting distinct values for multiple intervals

2017-09-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17997.
-
Resolution: Fixed

> Aggregation function for counting distinct values for multiple intervals
> 
>
> Key: SPARK-17997
> URL: https://issues.apache.org/jira/browse/SPARK-17997
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>
> This is for computing ndv's for bins in equi-height histograms. A bin 
> consists of two endpoints which form an interval of values and the ndv in 
> that interval. For computing histogram statistics, after getting the 
> endpoints, we need an agg function to count distinct values in each interval.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-17997) Aggregation function for counting distinct values for multiple intervals

2017-09-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-17997:
-
  Assignee: Zhenhua Wang

> Aggregation function for counting distinct values for multiple intervals
> 
>
> Key: SPARK-17997
> URL: https://issues.apache.org/jira/browse/SPARK-17997
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>
> This is for computing ndv's for bins in equi-height histograms. A bin 
> consists of two endpoints which form an interval of values and the ndv in 
> that interval. For computing histogram statistics, after getting the 
> endpoints, we need an agg function to count distinct values in each interval.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22090) There is no need for fileStatusCache to invalidateAll when InMemoryFileIndex refresh

2017-09-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22090.
---
Resolution: Duplicate

> There is no need for fileStatusCache to invalidateAll  when InMemoryFileIndex 
> refresh 
> --
>
> Key: SPARK-22090
> URL: https://issues.apache.org/jira/browse/SPARK-22090
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: guichaoxian
>
> the fileStatusCache is globally shared cache,refresh is only for one table,so 
> we do not need to invalidateAll entry in fileStatusCache
> {code:java}
>   /** Globally shared (not exclusive to this table) cache for file statuses 
> to speed up listing. */
>   private val fileStatusCache = FileStatusCache.getOrCreate(sparkSession)
> {code}
> {code:java}
>  override def refresh(): Unit = {
> refresh0()
> fileStatusCache.invalidateAll()
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22091) There is no need for fileStatusCache to invalidateAll when InMemoryFileIndex refresh

2017-09-21 Thread guichaoxian (JIRA)
guichaoxian created SPARK-22091:
---

 Summary: There is no need for fileStatusCache to invalidateAll  
when InMemoryFileIndex refresh 
 Key: SPARK-22091
 URL: https://issues.apache.org/jira/browse/SPARK-22091
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: guichaoxian


the fileStatusCache is globally shared cache,refresh is only for one table,so 
we do not need to invalidateAll entry in fileStatusCache
{code:java}
  /** Globally shared (not exclusive to this table) cache for file statuses to 
speed up listing. */
  private val fileStatusCache = FileStatusCache.getOrCreate(sparkSession)
{code}
{code:java}
 override def refresh(): Unit = {
refresh0()
fileStatusCache.invalidateAll()
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22090) There is no need for fileStatusCache to invalidateAll when InMemoryFileIndex refresh

2017-09-21 Thread guichaoxian (JIRA)
guichaoxian created SPARK-22090:
---

 Summary: There is no need for fileStatusCache to invalidateAll  
when InMemoryFileIndex refresh 
 Key: SPARK-22090
 URL: https://issues.apache.org/jira/browse/SPARK-22090
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: guichaoxian


the fileStatusCache is globally shared cache,refresh is only for one table,so 
we do not need to invalidateAll entry in fileStatusCache
{code:java}
  /** Globally shared (not exclusive to this table) cache for file statuses to 
speed up listing. */
  private val fileStatusCache = FileStatusCache.getOrCreate(sparkSession)
{code}
{code:java}
 override def refresh(): Unit = {
refresh0()
fileStatusCache.invalidateAll()
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22089) There is no need for fileStatusCache to invalidateAll when InMemoryFileIndex refresh

2017-09-21 Thread guichaoxian (JIRA)
guichaoxian created SPARK-22089:
---

 Summary: There is no need for fileStatusCache to invalidateAll  
when InMemoryFileIndex refresh 
 Key: SPARK-22089
 URL: https://issues.apache.org/jira/browse/SPARK-22089
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: guichaoxian


the fileStatusCache is globally shared cache,refresh is only for one table,so 
we do not need to invalidateAll entry in fileStatusCache
{code:java}
  /** Globally shared (not exclusive to this table) cache for file statuses to 
speed up listing. */
  private val fileStatusCache = FileStatusCache.getOrCreate(sparkSession)
{code}
{code:java}
 override def refresh(): Unit = {
refresh0()
fileStatusCache.invalidateAll()
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22034) CrossValidator's training and testing set with different set of labels, resulting in encoder transform error

2017-09-21 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174729#comment-16174729
 ] 

Weichen Xu commented on SPARK-22034:


Do you mean a pipeline including stage VectorIndexer + stage CrossValidator ?
Can you post a minimal program which can reproduce the bug ?

> CrossValidator's training and testing set with different set of labels, 
> resulting in encoder transform error
> 
>
> Key: SPARK-22034
> URL: https://issues.apache.org/jira/browse/SPARK-22034
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: Ubuntu 16.04
> Scala 2.11
> Spark 2.2.0
>Reporter: AnChe Kuo
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Let's say we have a VectorIndexer with maxCategories set to 13, and training 
> set has a column containing month label.
> In CrossValidator, dataframe is split into training and testing set 
> automatically. If could happen that training set happens to lack month 2 
> (could happen by chance, or happen quite frequently if we have unbalanced 
> label).
> When training set is being trained within the cross validator, the pipeline 
> is fitted with the training set only, resulting in a partial key map in 
> VectorIndexer. When this pipeline is used to transform the predict set, 
> VectorIndexer will throw  a "key not found" error.
> Making CrossValidator also an estimator thus can be connected to a whole 
> pipeline is a cool idea, but bug like this occurs, and is not expected.
> The solution, I am guessing, would be to check each stage in the pipeline, 
> and when we see encoder type stage, we fit the stage model with the complete 
> dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2017-09-21 Thread Marco Veluscek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174727#comment-16174727
 ] 

Marco Veluscek commented on SPARK-16845:


Hello, 
I have just encountered a similar issue when doing _except_ on two large 
dataframes.
My code executed with Spark 2.1.0 fails with an exception. The same code with 
Spark 2.2.0 works, but logs several exceptions. 
Since, I have to work with 2.1.0 because of company policies, I would like to 
know whether there is a way to fix or to work around this issue in 2.1.0?

Here are more details about the problem.
On my company cluster, I am working with Spark version 2.1.0.cloudera1 using 
Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112).

The two dataframes have about 1 million rows and 467 columns.
When I do the _except_ {{dataframe1.except(dataframe2)}} I get the following 
exception:
{code:title=Exception_with_2.1.0}
scheduler.TaskSetManager: Lost task 10.0 in stage 80.0 (TID 4146, 
cdhworker05.itec.lab, executor 4): java.util.concurrent.ExecutionException: 
java.lang.Exception: failed to compile: org.co
dehaus.janino.JaninoRuntimeException: Code of method 
"eval(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" 
grows beyond 64 KB
{code}

Then the logs show the generated code for the class {{SpecificPredicate}} which 
has more than 5000 rows.

I wrote a small script to reproduce the error:
{code:title=testExcept.scala}
import org.apache.spark.sql.functions._

import spark.implicits._

import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{DoubleType, StructField, StructType, 
IntegerType}

import scala.util.Random

def start(rows: Int, cols: Int, col: String, spark: SparkSession) = {

 val data = (1 to rows).map(_ => Seq.fill(cols)(1))

 val colNames = (1 to cols).mkString(",")
 val sch = StructType(colNames.split(",").map(fieldName => 
StructField(fieldName, IntegerType, true)))

 val rdd = spark.sparkContext.parallelize(data.map(x => Row(x:_*)))
 spark.sqlContext.createDataFrame(rdd, sch)
}

val dataframe1 = start(1000, 500, "column", spark)
val dataframe2 = start(1000, 500, "column", spark)

val res = dataframe1.except(dataframe2)

res.count()
{code}

I have also tried with a local Spark installation, version 2.2.0 using Scala 
version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131).
With this Spark version, the code does not fail but it logs several exceptions 
all saying the below:
{code:title=Exception_with_2.2.0}
17/09/21 12:42:26 ERROR CodeGenerator: failed to compile: 
org.codehaus.janino.JaninoRuntimeException: Code of method 
"eval(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" 
grows beyond 64 KB
{code}
Then the same generated code is logged.

In addition, this line is also logged several times:
{code}
17/09/21 12:46:20 WARN SortMergeJoinExec: Codegen disabled for this expression: 
(...
{code}

Since I have to work with Spark 2.1.0, is there a way to work around this 
problem? Maybe disabling the code gen?

Thank you for your help.


> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: hejie
>Assignee: Liwei Lin
> Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0
>
> Attachments: error.txt.zip
>
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19141) VectorAssembler metadata causing memory issues

2017-09-21 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174685#comment-16174685
 ] 

Weichen Xu commented on SPARK-19141:


Maybe we need design a sparse format of AttributeGroup for vector ML column. We 
don't need create Attribute for each vector dimension. The better way I think 
is only when needed we create it. But `VectorAssembler` create attribute for 
each dimension, in any case. Current design looks stupid.


> VectorAssembler metadata causing memory issues
> --
>
> Key: SPARK-19141
> URL: https://issues.apache.org/jira/browse/SPARK-19141
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.0, 2.0.0, 2.1.0
> Environment: Windows 10, Ubuntu 16.04.1, Scala 2.11.8, Spark 1.6.0, 
> 2.0.0, 2.1.0
>Reporter: Antonia Oprescu
>
> VectorAssembler produces unnecessary metadata that overflows the Java heap in 
> the case of sparse vectors. In the example below, the logical length of the 
> vector is 10^6, but the number of non-zero values is only 2.
> The problem arises when the vector assembler creates metadata (ML attributes) 
> for each of the 10^6 slots, even if this metadata didn't exist upstream (i.e. 
> HashingTF doesn't produce metadata per slot). Here is a chunk of metadata it 
> produces:
> {noformat}
> {"ml_attr":{"attrs":{"numeric":[{"idx":0,"name":"HashedFeat_0"},{"idx":1,"name":"HashedFeat_1"},{"idx":2,"name":"HashedFeat_2"},{"idx":3,"name":"HashedFeat_3"},{"idx":4,"name":"HashedFeat_4"},{"idx":5,"name":"HashedFeat_5"},{"idx":6,"name":"HashedFeat_6"},{"idx":7,"name":"HashedFeat_7"},{"idx":8,"name":"HashedFeat_8"},{"idx":9,"name":"HashedFeat_9"},...,{"idx":100,"name":"Feat01"}]},"num_attrs":101}}
> {noformat}
> In this lightweight example, the feature size limit seems to be 1,000,000 
> when run locally, but this scales poorly with more complicated routines. With 
> a larger dataset and a learner (say LogisticRegression), it maxes out 
> anywhere between 10k and 100k hash size even on a decent sized cluster.
> I did some digging, and it seems that the only metadata necessary for 
> downstream learners is the one indicating categorical columns. Thus, I 
> thought of the following possible solutions:
> 1. Compact representation of ml attributes metadata (but this seems to be a 
> bigger change)
> 2. Removal of non-categorical tags from the metadata created by the 
> VectorAssembler
> 3. An option on the existent VectorAssembler to skip unnecessary ml 
> attributes or create another transformer altogether
> I would happy to take a stab at any of these solutions, but I need some 
> direction from the Spark community.
> {code:title=VABug.scala |borderStyle=solid}
> import org.apache.spark.SparkConf
> import org.apache.spark.ml.feature.{HashingTF, VectorAssembler}
> import org.apache.spark.sql.SparkSession
> object VARepro {
>   case class Record(Label: Double, Feat01: Double, Feat02: Array[String])
>   def main(args: Array[String]) {
> val conf = new SparkConf()
>   .setAppName("Vector assembler bug")
>   .setMaster("local[*]")
> val spark = SparkSession.builder.config(conf).getOrCreate()
> import spark.implicits._
> val df = Seq(Record(1.0, 2.0, Array("4daf")), Record(0.0, 3.0, 
> Array("a9ee"))).toDS()
> val numFeatures = 1000
> val hashingScheme = new 
> HashingTF().setInputCol("Feat02").setOutputCol("HashedFeat").setNumFeatures(numFeatures)
> val hashedData = hashingScheme.transform(df)
> val vectorAssembler = new 
> VectorAssembler().setInputCols(Array("HashedFeat","Feat01")).setOutputCol("Features")
> val processedData = vectorAssembler.transform(hashedData).select("Label", 
> "Features")
> processedData.show()
>   }
> }
> {code}
> *Stacktrace from the example above:*
> Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit 
> exceeded
>   at 
> org.apache.spark.ml.attribute.NumericAttribute.copy(attributes.scala:272)
>   at 
> org.apache.spark.ml.attribute.NumericAttribute.withIndex(attributes.scala:215)
>   at 
> org.apache.spark.ml.attribute.NumericAttribute.withIndex(attributes.scala:195)
>   at 
> org.apache.spark.ml.attribute.AttributeGroup$$anonfun$3$$anonfun$apply$1.apply(AttributeGroup.scala:71)
>   at 
> org.apache.spark.ml.attribute.AttributeGroup$$anonfun$3$$anonfun$apply$1.apply(AttributeGroup.scala:70)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> scala.collection.IterableLike$class.copyToArray(IterableLike.scala:254)
>   at 
> scala.collection.SeqViewLike$AbstractTransformed.copyToArray(SeqViewLike.scala:37)
>   at 
> scala.collection.TraversableOnce$class.copyToArray(TraversableOnce.scala:278)
>   at 
> 

[jira] [Comment Edited] (SPARK-18727) Support schema evolution as new files are inserted into table

2017-09-21 Thread Serge Smertin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174660#comment-16174660
 ] 

Serge Smertin edited comment on SPARK-18727 at 9/21/17 12:31 PM:
-

i have some similar use-cases that were mentioned in [#comment-15987668] by 
[~simeons] - adding fields to nested _struct_ fields. application is built the 
way that parquet files are created/partitioned outside of Spark and only new 
columns might be added. Again, mostly within couple of nested structs. 

I don't know all potential implications of the idea, but can we just use the 
last element of selected files instead of the first one, as long as the 
FileStatus [list is already sorted by path 
lexicographically|https://github.com/apache/spark/blob/32fa0b81411f781173e185f4b19b9fd6d118f9fe/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L251]?
 It easier to guarantee that only new columns would be added over the time. And 
the following code change doesn't seem to be huge deviation from current 
behavior, thus tremendously saving time compared to 
{{spark.sql.parquet.mergeSchema=true}}:

{code:java} // ParquetFileFormat.scala (lines 232..240)
filesByType.commonMetadata.lastOption
.orElse(filesByType.metadata.lastOption)
.orElse(filesByType.data.lastOption)
{code}

/cc [~r...@databricks.com] [~xwu0226] 


was (Author: nfx):
in one of the use-cases for project in [#comment-15987668] by [~simeons] - 
adding fields to nested _struct_ fields. application is built the way that 
parquet files are created/partitioned outside of Spark and only new columns 
might be added. Again, mostly within couple of nested structs. 

I don't know all potential implications of the idea, but can we just use the 
last element of selected files instead of the first one, as long as the 
FileStatus [list is already sorted by path 
lexicographically|https://github.com/apache/spark/blob/32fa0b81411f781173e185f4b19b9fd6d118f9fe/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L251]?
 It easier to guarantee that only new columns would be added over the time. And 
the following code change doesn't seem to be huge deviation from current 
behavior, thus tremendously saving time compared to 
{{spark.sql.parquet.mergeSchema=true}}:

{code:java} // ParquetFileFormat.scala (lines 232..240)
filesByType.commonMetadata.lastOption
.orElse(filesByType.metadata.lastOption)
.orElse(filesByType.data.lastOption)
{code}

/cc [~r...@databricks.com] [~xwu0226] 

> Support schema evolution as new files are inserted into table
> -
>
> Key: SPARK-18727
> URL: https://issues.apache.org/jira/browse/SPARK-18727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, 
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, 
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, 
> or automatically as new files with compatible schemas are appended into the 
> table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table

2017-09-21 Thread Serge Smertin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174660#comment-16174660
 ] 

Serge Smertin commented on SPARK-18727:
---

in one of the use-cases for project in [#comment-15987668] by [~simeons] - 
adding fields to nested _struct_ fields. application is built the way that 
parquet files are created/partitioned outside of Spark and only new columns 
might be added. Again, mostly within couple of nested structs. 

I don't know all potential implications of the idea, but can we just use the 
last element of selected files instead of the first one, as long as the 
FileStatus [list is already sorted by path 
lexicographically|https://github.com/apache/spark/blob/32fa0b81411f781173e185f4b19b9fd6d118f9fe/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L251]?
 It easier to guarantee that only new columns would be added over the time. And 
the following code change doesn't seem to be huge deviation from current 
behavior, thus tremendously saving time compared to 
{{spark.sql.parquet.mergeSchema=true}}:

{code:java} // ParquetFileFormat.scala (lines 232..240)
filesByType.commonMetadata.lastOption
.orElse(filesByType.metadata.lastOption)
.orElse(filesByType.data.lastOption)
{code}

/cc [~r...@databricks.com] [~xwu0226] 

> Support schema evolution as new files are inserted into table
> -
>
> Key: SPARK-18727
> URL: https://issues.apache.org/jira/browse/SPARK-18727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, 
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, 
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, 
> or automatically as new files with compatible schemas are appended into the 
> table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21418) NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true

2017-09-21 Thread Artem Kupchinskiy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174642#comment-16174642
 ] 

Artem Kupchinskiy commented on SPARK-21418:
---

There is still a place in FileSourceScanExec.scala where None.get error 
theoretically could appear.
{code:java}
  val needsUnsafeRowConversion: Boolean = if 
(relation.fileFormat.isInstanceOf[ParquetSource]) {

SparkSession.getActiveSession.get.sessionState.conf.parquetVectorizedReaderEnabled
  } else {
false
  }
{code}

I think it is worth defending this val initialization as well (setting a 
default configuration value in case of None). Although I didn't encounter any 
None.get errors so far after applying this patch.



> NoSuchElementException: None.get in DataSourceScanExec with 
> sun.io.serialization.extendedDebugInfo=true
> ---
>
> Key: SPARK-21418
> URL: https://issues.apache.org/jira/browse/SPARK-21418
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Daniel Darabos
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.2.1, 2.3.0
>
>
> I don't have a minimal reproducible example yet, sorry. I have the following 
> lines in a unit test for our Spark application:
> {code}
> val df = mySparkSession.read.format("jdbc")
>   .options(Map("url" -> url, "dbtable" -> "test_table"))
>   .load()
> df.show
> println(df.rdd.collect)
> {code}
> The output shows the DataFrame contents from {{df.show}}. But the {{collect}} 
> fails:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> serialization failed: java.util.NoSuchElementException: None.get
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.org$apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.scala:70)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.simpleString(DataSourceScanExec.scala:52)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.simpleString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:349)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.org$apache$spark$sql$execution$DataSourceScanExec$$super$verboseString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.verboseString(DataSourceScanExec.scala:60)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.verboseString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:556)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:451)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:576)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:480)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:477)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:474)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1421)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> 

  1   2   >