[jira] [Updated] (SPARK-23206) Additional Memory Tuning Metrics

2018-02-08 Thread Lantao Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-23206:
---
Attachment: (was: ExecutorTab2.png)

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23206) Additional Memory Tuning Metrics

2018-02-08 Thread Lantao Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357844#comment-16357844
 ] 

Lantao Jin edited comment on SPARK-23206 at 2/9/18 7:56 AM:


Yes, we have done the similar things:
 !ExecutorsTab2.png! 

[~elu], could we have a talking on Skype or Slack to see what to do next and 
how can collaborate to make it bigger as umbrella ticket.


was (Author: cltlfcjin):
Yes, we have done the similar things:
 !ExecutorTab2.png! 

[~elu], could we have a talking on Skype or Slack to see what to do next and 
how can collaborate to make it bigger as umbrella ticket.

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorTab2.png, ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23206) Additional Memory Tuning Metrics

2018-02-08 Thread Lantao Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-23206:
---
Attachment: ExecutorsTab2.png

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorTab2.png, ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame

2018-02-08 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358048#comment-16358048
 ] 

Liang-Chi Hsieh commented on SPARK-2:
-

Currently I think we don't have API in Dataset to just fetch an any row back. 
Is it reasonable to add a \{{def any(n: Int): Array[T]}} to Dataset? cc 
[~cloud_fan]

> SparkML VectorAssembler.transform slow when needing to invoke .first() on 
> sorted DataFrame
> --
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, SQL
>Affects Versions: 2.2.1
>Reporter: V Luong
>Priority: Major
>
> Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes 
> oldDF.first() in order to establish some metadata/attributes: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.]
>  When oldDF is sorted, the above triggering of oldDF.first() can be very slow.
> For the purpose of establishing metadata, taking an arbitrary row from oldDF 
> will be just as good as taking oldDF.first(). Is there hence a way we can 
> speed up a great deal by somehow grabbing a random row, instead of relying on 
> oldDF.first()?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23369) HiveClientSuites fails with unresolved dependency

2018-02-08 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-23369:
---

 Summary: HiveClientSuites fails with unresolved dependency
 Key: SPARK-23369
 URL: https://issues.apache.org/jira/browse/SPARK-23369
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan


I saw it multiple times in PR builders. The error message is:

 
{code:java}
sbt.ForkMain$ForkError: java.lang.RuntimeException: [unresolved dependency: 
com.sun.jersey#jersey-json;1.14: configuration not found in 
com.sun.jersey#jersey-json;1.14: 'master(compile)'. Missing configuration: 
'compile'. It was required from org.apache.hadoop#hadoop-yarn-common;2.6.5 
compile] at 
org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1270)
 at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$2.apply(IsolatedClientLoader.scala:113)
 at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$2.apply(IsolatedClientLoader.scala:113)
 at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42) at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$.downloadVersion(IsolatedClientLoader.scala:112)
 at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$.liftedTree1$1(IsolatedClientLoader.scala:74)
 at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$.forVersion(IsolatedClientLoader.scala:62)
 at 
org.apache.spark.sql.hive.client.HiveClientBuilder$.buildClient(HiveClientBuilder.scala:51)
 at 
org.apache.spark.sql.hive.client.HiveVersionSuite.buildClient(HiveVersionSuite.scala:41)
 at 
org.apache.spark.sql.hive.client.HiveClientSuite.org$apache$spark$sql$hive$client$HiveClientSuite$$init(HiveClientSuite.scala:48)
 at 
org.apache.spark.sql.hive.client.HiveClientSuite.beforeAll(HiveClientSuite.scala:71)
 at 
org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212)
 at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) at 
org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) at 
org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1210) at 
org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1257) at 
org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1255) at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at 
org.scalatest.Suite$class.runNestedSuites(Suite.scala:1255) at 
org.apache.spark.sql.hive.client.HiveClientSuites.runNestedSuites(HiveClientSuites.scala:24)
 at org.scalatest.Suite$class.run(Suite.scala:1144) at 
org.apache.spark.sql.hive.client.HiveClientSuites.run(HiveClientSuites.scala:24)
 at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
 at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) at 
sbt.ForkMain$Run$2.call(ForkMain.java:296) at 
sbt.ForkMain$Run$2.call(ForkMain.java:286) at 
java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11222) Add style checker rules to validate doc tests aren't included in docs

2018-02-08 Thread Rekha Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358002#comment-16358002
 ] 

Rekha Joshi commented on SPARK-11222:
-

For doctest style raised issue - https://github.com/PyCQA/pydocstyle/issues/298

> Add style checker rules to validate doc tests aren't included in docs
> -
>
> Key: SPARK-11222
> URL: https://issues.apache.org/jira/browse/SPARK-11222
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Reporter: holdenk
>Priority: Trivial
>
> Add style checker test to make sure we have the blank line before starting 
> doctests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23368) OutputOrdering and OutputPartitioning in ProjectExec should reflect the projected columns

2018-02-08 Thread Maryann Xue (JIRA)
Maryann Xue created SPARK-23368:
---

 Summary: OutputOrdering and OutputPartitioning in ProjectExec 
should reflect the projected columns
 Key: SPARK-23368
 URL: https://issues.apache.org/jira/browse/SPARK-23368
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Maryann Xue


After column rename projection, the ProjectExec's outputOrdering and 
outputPartitioning should reflect the projected columns as well. For example,
{code:java}
SELECT b1
FROM (
SELECT a a1, b b1
FROM testData2
ORDER BY a
)
ORDER BY a1{code}
The inner query is ordered on a1 as well. If we had a rule to eliminate Sort on 
sorted result, together with this fix, the order-by in the outer query could 
have been optimized out.

 

Similarly, the below query
{code:java}
SELECT *
FROM (
SELECT t1.a a1, t2.a a2, t1.b b1, t2.b b2
FROM testData2 t1
LEFT JOIN testData2 t2
ON t1.a = t2.a
)
JOIN testData2 t3
ON a1 = t3.a{code}
is equivalent to
{code:java}
SELECT *
FROM testData2 t1
LEFT JOIN testData2 t2
ON t1.a = t2.a
JOIN testData2 t3
ON t1.a = t3.a{code}
, so the unnecessary sorting and hash-partitioning that have been optimized out 
for the second query should have be eliminated in the first query as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-23363) Fix spark-sql bug or improvement

2018-02-08 Thread guoxiaolongzte (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolongzte reopened SPARK-23363:


> Fix spark-sql bug or improvement
> 
>
> Key: SPARK-23363
> URL: https://issues.apache.org/jira/browse/SPARK-23363
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23354) spark jdbc does not maintain length of data type when I move data from MS sql server to Oracle using spark jdbc

2018-02-08 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357966#comment-16357966
 ] 

Yuming Wang commented on SPARK-23354:
-

Do you mean custom column type? you can find more details 
[here|https://github.com/apache/spark/pull/18266].

 

> spark jdbc does not maintain length of data type when I move data from MS sql 
> server to Oracle using spark jdbc
> ---
>
> Key: SPARK-23354
> URL: https://issues.apache.org/jira/browse/SPARK-23354
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.1
>Reporter: Lav Patel
>Priority: Major
>
> spark jdbc does not maintain length of data type when I move data from MS sql 
> server to Oracle using spark jdbc
>  
> To fix this, I have written code so it will figure out length of column and 
> it does the conversion.
>  
> I can put more details with a code sample if the community is interested. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23364) 'desc table' command in spark-sql add column head display

2018-02-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357965#comment-16357965
 ] 

Apache Spark commented on SPARK-23364:
--

User 'guoxiaolongzte' has created a pull request for this issue:
https://github.com/apache/spark/pull/20557

> 'desc table' command in spark-sql add column head display
> -
>
> Key: SPARK-23364
> URL: https://issues.apache.org/jira/browse/SPARK-23364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Minor
> Attachments: 1.png, 2.png
>
>
> fix before: 
>  !2.png! 
> fix after:
>  !1.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23364) 'desc table' command in spark-sql add column head display

2018-02-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23364:


Assignee: (was: Apache Spark)

> 'desc table' command in spark-sql add column head display
> -
>
> Key: SPARK-23364
> URL: https://issues.apache.org/jira/browse/SPARK-23364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Minor
> Attachments: 1.png, 2.png
>
>
> fix before: 
>  !2.png! 
> fix after:
>  !1.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23364) 'desc table' command in spark-sql add column head display

2018-02-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23364:


Assignee: Apache Spark

> 'desc table' command in spark-sql add column head display
> -
>
> Key: SPARK-23364
> URL: https://issues.apache.org/jira/browse/SPARK-23364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Assignee: Apache Spark
>Priority: Minor
> Attachments: 1.png, 2.png
>
>
> fix before: 
>  !2.png! 
> fix after:
>  !1.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23367) Include python document style checking

2018-02-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23367:


Assignee: (was: Apache Spark)

> Include python document style checking
> --
>
> Key: SPARK-23367
> URL: https://issues.apache.org/jira/browse/SPARK-23367
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Rekha Joshi
>Priority: Minor
>
> As per discussions [PR#20378 |https://github.com/apache/spark/pull/20378] 
> this jira is to include python doc style checking in spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23367) Include python document style checking

2018-02-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357958#comment-16357958
 ] 

Apache Spark commented on SPARK-23367:
--

User 'rekhajoshm' has created a pull request for this issue:
https://github.com/apache/spark/pull/20556

> Include python document style checking
> --
>
> Key: SPARK-23367
> URL: https://issues.apache.org/jira/browse/SPARK-23367
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Rekha Joshi
>Priority: Minor
>
> As per discussions [PR#20378 |https://github.com/apache/spark/pull/20378] 
> this jira is to include python doc style checking in spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23364) 'desc table' command in spark-sql add column head display

2018-02-08 Thread guoxiaolongzte (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolongzte updated SPARK-23364:
---
Summary: 'desc table' command in spark-sql add column head display  (was: 
desc table add column head display)

> 'desc table' command in spark-sql add column head display
> -
>
> Key: SPARK-23364
> URL: https://issues.apache.org/jira/browse/SPARK-23364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Minor
> Attachments: 1.png, 2.png
>
>
> fix before: 
>  !2.png! 
> fix after:
>  !1.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23367) Include python document style checking

2018-02-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23367:


Assignee: Apache Spark

> Include python document style checking
> --
>
> Key: SPARK-23367
> URL: https://issues.apache.org/jira/browse/SPARK-23367
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Rekha Joshi
>Assignee: Apache Spark
>Priority: Minor
>
> As per discussions [PR#20378 |https://github.com/apache/spark/pull/20378] 
> this jira is to include python doc style checking in spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-23364) desc table add column head display

2018-02-08 Thread guoxiaolongzte (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolongzte reopened SPARK-23364:


> desc table add column head display
> --
>
> Key: SPARK-23364
> URL: https://issues.apache.org/jira/browse/SPARK-23364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Minor
> Attachments: 1.png, 2.png
>
>
> fix before: 
>  !2.png! 
> fix after:
>  !1.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-02-08 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357953#comment-16357953
 ] 

Imran Rashid commented on SPARK-23206:
--

+1 on all the ideas discussed here so far.  One thing missing in the design doc 
is memory that lives outside of the JVM.  Eg., parquet and netty can both use a 
lot of off-heap memory, I think its useful to track that as well.  It seems 
[~cltlfcjin] has looked at that already.  Also [~tgraves] is interested I think 
and had some thoughts here: 
https://issues.apache.org/jira/browse/SPARK-21157?focusedCommentId=16179496=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16179496
(another related jira)

The design doc doesn't describe how you get metrics for each stage.  There was 
a proposal on how to do this on SPARK-9103 that was pretty good I think.

I'd be interested in being a part of the discussion if possible as well.

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorTab2.png, ExecutorsTab.png, 
> MemoryTuningMetricsDesignDoc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23367) Include python document style checking

2018-02-08 Thread Rekha Joshi (JIRA)
Rekha Joshi created SPARK-23367:
---

 Summary: Include python document style checking
 Key: SPARK-23367
 URL: https://issues.apache.org/jira/browse/SPARK-23367
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.2.1
Reporter: Rekha Joshi


As per discussions [PR#20378 |https://github.com/apache/spark/pull/20378] this 
jira is to include python doc style checking in spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23053) taskBinarySerialization and task partitions calculate in DagScheduler.submitMissingTasks should keep the same RDD checkpoint status

2018-02-08 Thread huangtengfei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357952#comment-16357952
 ] 

huangtengfei commented on SPARK-23053:
--

the following is a repro case, for clarity 
{code:java}
/** Wrapped rdd partition. */
class WrappedPartition(val partition: Partition) extends Partition {
  def index: Int = partition.index
}

/**
 * An RDD with a particular defined Partition which is WrappedPartition.
 * The compute method will cast the split to WrappedPartition. The cast 
operation will be
 * used in this test suite.
 */
class WrappedRDD(parent: RDD[Int]) extends RDD[Int](parent) {
  protected def getPartitions: Array[Partition] = {
parent.partitions.map(p => new WrappedPartition(p))
  }

  def compute(split: Partition, context: TaskContext): Iterator[Int] = {
parent.compute(split.asInstanceOf[WrappedPartition].partition, context)
  }
}
{code}
{code:java}
/**
 * In this repro, we simulate the scene in concurrent jobs using the same
 * rdd which is marked to do checkpoint:
 * Job one has already finished the spark job, and start the process of 
doCheckpoint;
 * Job two is submitted, and submitMissingTasks is called.
 * In submitMissingTasks, if taskSerialization is called before doCheckpoint is 
done,
 * while part calculates from stage.rdd.partitions is called after doCheckpoint 
is done,
 * we may get a ClassCastException when execute the task because of some rdd 
will do
 * Partition cast.
 *
 * With this test case, just want to indicate that we should do 
taskSerialization and
 * part calculate in submitMissingTasks with the same rdd checkpoint status.
 */
repro("SPARK-23053: avoid ClassCastException in concurrent execution with 
checkpoint") {
  // set checkpointDir.
  val checkpointDir = Utils.createTempDir()
  sc.setCheckpointDir(checkpointDir.toString)

  // Semaphores to control the process sequence for the two threads below.
  val doCheckpointStarted = new Semaphore(0)
  val taskBinaryBytesFinished = new Semaphore(0)
  val checkpointStateUpdated = new Semaphore(0)

  val rdd = new WrappedRDD(sc.makeRDD(1 to 100, 4))
  rdd.checkpoint()

  val checkpointRunnable = new Runnable {
override def run() = {
  // Simulate what RDD.doCheckpoint() does here.
  rdd.doCheckpointCalled = true
  val checkpointData = rdd.checkpointData.get
  RDDCheckpointData.synchronized {
if (checkpointData.cpState == CheckpointState.Initialized) {
  checkpointData.cpState = CheckpointState.CheckpointingInProgress
}
  }

  val newRDD = checkpointData.doCheckpoint()

  // Release doCheckpointStarted after job triggered in checkpoint 
finished, so
  // that taskBinary serialization can start.
  doCheckpointStarted.release()
  // Wait until taskBinary serialization finished in 
submitMissingTasksThread.
  taskBinaryBytesFinished.acquire()

  // Update our state and truncate the RDD lineage.
  RDDCheckpointData.synchronized {
checkpointData.cpRDD = Some(newRDD)
checkpointData.cpState = CheckpointState.Checkpointed
rdd.markCheckpointed()
  }
  checkpointStateUpdated.release()
}
  }

  val submitMissingTasksRunnable = new Runnable {
override def run() = {
  // Simulate the process of submitMissingTasks.
  // Wait until doCheckpoint job running finished, but checkpoint status 
not changed.
  doCheckpointStarted.acquire()

  val ser = SparkEnv.get.closureSerializer.newInstance()

  // Simulate task serialization while submitMissingTasks.
  // Task serialized with rdd checkpoint not finished.
  val cleanedFunc = sc.clean(Utils.getIteratorSize _)
  val func = (ctx: TaskContext, it: Iterator[Int]) => cleanedFunc(it)
  val taskBinaryBytes = JavaUtils.bufferToArray(
ser.serialize((rdd, func): AnyRef))
  // Because partition calculate is in a synchronized block, so in the 
fixed code
  // partition is calculated here.
  val correctPart = rdd.partitions(0)

  // Release taskBinaryBytesFinished so changing checkpoint status to 
Checkpointed will
  // be done in checkpointThread.
  taskBinaryBytesFinished.release()
  // Wait until checkpoint status changed to Checkpointed in 
checkpointThread.
  checkpointStateUpdated.acquire()

  // Now we're done simulating the interleaving that might happen within 
the scheduler,
  // we'll check to make sure the final state is OK by simulating a couple 
steps that
  // normally happen on the executor.
  // Part calculated with rdd checkpoint already finished.
  val errPart = rdd.partitions(0)

  // TaskBinary will be deserialized when run task in executor.
  val (taskRdd, taskFunc) = ser.deserialize[(RDD[Int], (TaskContext, 
Iterator[Int]) => Unit)](
ByteBuffer.wrap(taskBinaryBytes), 
Thread.currentThread.getContextClassLoader)

  val taskContext = 

[jira] [Commented] (SPARK-23235) Add executor Threaddump to api

2018-02-08 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357949#comment-16357949
 ] 

Imran Rashid commented on SPARK-23235:
--

[~jerryshao] thanks for pointing me at SPARK-23206, looks very relevant.

{quote}
If we expose endpoints in the executor side, how would we leverage them?
{quote}

I wasn't thinking spark itself would use it at all -- it would just be a source 
of information for other consumers.  Eg., I could imagine making use of it 
myself for debugging.  Sometimes when an executor is stuck its useful to get 
more than just the thread dump, eg. maybe you could give more info from 
BlockManager and the MemoryManager, more detailed shuffle metrics, etc.

I could also imagine it being used by some external tool which might use this 
info for a richer debugging or profiling experience.  we don't want to send too 
much info through the driver, you could potentially poll more from the 
executors directly.

in any case, these ideas are a ways out there, and I think just the threaddump 
api is a nice add so lets just leave it there for this ticket :)


> Add executor Threaddump to api
> --
>
> Key: SPARK-23235
> URL: https://issues.apache.org/jira/browse/SPARK-23235
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Minor
>  Labels: newbie
>
> It looks like the the thread dump {{/executors/threadDump/?executorId=[id]}} 
> is only available in the UI, not in the rest api at all.  This is especially 
> a pain because that page in the UI has extra formatting which makes it a pain 
> to send the output to somebody else (most likely you click "expand all" and 
> then copy paste that, which is OK, but is formatted weirdly).  We might also 
> just want a "format=raw" option even on the UI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23364) desc table add column head display

2018-02-08 Thread guoxiaolongzte (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357946#comment-16357946
 ] 

guoxiaolongzte commented on SPARK-23364:


I will PR to solve this matter, thank you.

> desc table add column head display
> --
>
> Key: SPARK-23364
> URL: https://issues.apache.org/jira/browse/SPARK-23364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Minor
> Attachments: 1.png, 2.png
>
>
> fix before: 
>  !2.png! 
> fix after:
>  !1.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19870) Repeatable deadlock on BlockInfoManager and TorrentBroadcast

2018-02-08 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357930#comment-16357930
 ] 

Imran Rashid commented on SPARK-19870:
--

[~eyalfa] I see that warning from every task in stage 23.  Is that doing a take 
or a limit or anything?  Also is that an executor which seems to be stuck 
reading the broadcast variables?  I think we really need executor logs & stack 
traces from an executor which is stuck in this state.

> Repeatable deadlock on BlockInfoManager and TorrentBroadcast
> 
>
> Key: SPARK-19870
> URL: https://issues.apache.org/jira/browse/SPARK-19870
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Shuffle
>Affects Versions: 2.0.2, 2.1.0
> Environment: ubuntu linux 14.04 x86_64 on ec2, hadoop cdh 5.10.0, 
> yarn coarse-grained.
>Reporter: Steven Ruppert
>Priority: Major
> Attachments: cs.executor.log, stack.txt
>
>
> Running what I believe to be a fairly vanilla spark job, using the RDD api, 
> with several shuffles, a cached RDD, and finally a conversion to DataFrame to 
> save to parquet. I get a repeatable deadlock at the very last reducers of one 
> of the stages.
> Roughly:
> {noformat}
> "Executor task launch worker-6" #56 daemon prio=5 os_prio=0 
> tid=0x7fffd88d3000 nid=0x1022b9 waiting for monitor entry 
> [0x7fffb95f3000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:207)
> - waiting to lock <0x0005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> - locked <0x0005b12f2290> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> and 
> {noformat}
> "Executor task launch worker-5" #55 daemon prio=5 os_prio=0 
> tid=0x7fffd88d nid=0x1022b8 in Object.wait() [0x7fffb96f4000]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at 
> org.apache.spark.storage.BlockInfoManager.lockForReading(BlockInfoManager.scala:202)
> - locked <0x000545736b58> (a 
> org.apache.spark.storage.BlockInfoManager)
> at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:444)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210)
> - locked <0x0005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> - locked <0x00059711eb10> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> 

[jira] [Resolved] (SPARK-23186) Initialize DriverManager first before loading Drivers

2018-02-08 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23186.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20359
[https://github.com/apache/spark/pull/20359]

> Initialize DriverManager first before loading Drivers
> -
>
> Key: SPARK-23186
> URL: https://issues.apache.org/jira/browse/SPARK-23186
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.0
>
>
> Since some JDBC Drivers have class initialization code to call 
> `DriverManager`, we need to initialize DriverManager first in order to avoid 
> potential deadlock situation like the following or STORM-2527.
> {code}
> Thread 9587: (state = BLOCKED)
>  - 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor,
>  java.lang.Object[]) @bci=0 (Compiled frame; information may be imprecise)
>  - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) 
> @bci=85, line=62 (Compiled frame)
>  - 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) 
> @bci=5, line=45 (Compiled frame)
>  - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) @bci=79, 
> line=423 (Compiled frame)
>  - java.lang.Class.newInstance() @bci=138, line=442 (Compiled frame)
>  - java.util.ServiceLoader$LazyIterator.nextService() @bci=119, line=380 
> (Interpreted frame)
>  - java.util.ServiceLoader$LazyIterator.next() @bci=11, line=404 (Interpreted 
> frame)
>  - java.util.ServiceLoader$1.next() @bci=37, line=480 (Interpreted frame)
>  - java.sql.DriverManager$2.run() @bci=21, line=603 (Interpreted frame)
>  - java.sql.DriverManager$2.run() @bci=1, line=583 (Interpreted frame)
>  - 
> java.security.AccessController.doPrivileged(java.security.PrivilegedAction) 
> @bci=0 (Compiled frame)
>  - java.sql.DriverManager.loadInitialDrivers() @bci=27, line=583 (Interpreted 
> frame)
>  - java.sql.DriverManager.() @bci=32, line=101 (Interpreted frame)
>  - 
> org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnection(java.lang.String,
>  java.lang.Integer, java.lang.String, java.util.Properties) @bci=12, line=98 
> (Interpreted frame)
>  - 
> org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(org.apache.hadoop.conf.Configuration,
>  java.util.Properties) @bci=22, line=57 (Interpreted frame)
>  - 
> org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(org.apache.hadoop.mapreduce.JobContext,
>  org.apache.hadoop.conf.Configuration) @bci=61, line=116 (Interpreted frame)
>  - 
> org.apache.phoenix.mapreduce.PhoenixInputFormat.createRecordReader(org.apache.hadoop.mapreduce.InputSplit,
>  org.apache.hadoop.mapreduce.TaskAttemptContext) @bci=10, line=71 
> (Interpreted frame)
>  - 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(org.apache.spark.rdd.NewHadoopRDD,
>  org.apache.spark.Partition, org.apache.spark.TaskContext) @bci=233, line=156 
> (Interpreted frame)
> Thread 9170: (state = BLOCKED)
>  - org.apache.phoenix.jdbc.PhoenixDriver.() @bci=35, line=125 
> (Interpreted frame)
>  - 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor,
>  java.lang.Object[]) @bci=0 (Compiled frame)
>  - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) 
> @bci=85, line=62 (Compiled frame)
>  - 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) 
> @bci=5, line=45 (Compiled frame)
>  - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) @bci=79, 
> line=423 (Compiled frame)
>  - java.lang.Class.newInstance() @bci=138, line=442 (Compiled frame)
>  - 
> org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(java.lang.String)
>  @bci=89, line=46 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply()
>  @bci=7, line=53 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply()
>  @bci=1, line=52 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD,
>  org.apache.spark.Partition, org.apache.spark.TaskContext) @bci=81, line=347 
> (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(org.apache.spark.Partition,
>  org.apache.spark.TaskContext) @bci=7, line=339 (Interpreted frame)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional 

[jira] [Assigned] (SPARK-23186) Initialize DriverManager first before loading Drivers

2018-02-08 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23186:
---

Assignee: Dongjoon Hyun

> Initialize DriverManager first before loading Drivers
> -
>
> Key: SPARK-23186
> URL: https://issues.apache.org/jira/browse/SPARK-23186
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.0
>
>
> Since some JDBC Drivers have class initialization code to call 
> `DriverManager`, we need to initialize DriverManager first in order to avoid 
> potential deadlock situation like the following or STORM-2527.
> {code}
> Thread 9587: (state = BLOCKED)
>  - 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor,
>  java.lang.Object[]) @bci=0 (Compiled frame; information may be imprecise)
>  - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) 
> @bci=85, line=62 (Compiled frame)
>  - 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) 
> @bci=5, line=45 (Compiled frame)
>  - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) @bci=79, 
> line=423 (Compiled frame)
>  - java.lang.Class.newInstance() @bci=138, line=442 (Compiled frame)
>  - java.util.ServiceLoader$LazyIterator.nextService() @bci=119, line=380 
> (Interpreted frame)
>  - java.util.ServiceLoader$LazyIterator.next() @bci=11, line=404 (Interpreted 
> frame)
>  - java.util.ServiceLoader$1.next() @bci=37, line=480 (Interpreted frame)
>  - java.sql.DriverManager$2.run() @bci=21, line=603 (Interpreted frame)
>  - java.sql.DriverManager$2.run() @bci=1, line=583 (Interpreted frame)
>  - 
> java.security.AccessController.doPrivileged(java.security.PrivilegedAction) 
> @bci=0 (Compiled frame)
>  - java.sql.DriverManager.loadInitialDrivers() @bci=27, line=583 (Interpreted 
> frame)
>  - java.sql.DriverManager.() @bci=32, line=101 (Interpreted frame)
>  - 
> org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnection(java.lang.String,
>  java.lang.Integer, java.lang.String, java.util.Properties) @bci=12, line=98 
> (Interpreted frame)
>  - 
> org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(org.apache.hadoop.conf.Configuration,
>  java.util.Properties) @bci=22, line=57 (Interpreted frame)
>  - 
> org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(org.apache.hadoop.mapreduce.JobContext,
>  org.apache.hadoop.conf.Configuration) @bci=61, line=116 (Interpreted frame)
>  - 
> org.apache.phoenix.mapreduce.PhoenixInputFormat.createRecordReader(org.apache.hadoop.mapreduce.InputSplit,
>  org.apache.hadoop.mapreduce.TaskAttemptContext) @bci=10, line=71 
> (Interpreted frame)
>  - 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(org.apache.spark.rdd.NewHadoopRDD,
>  org.apache.spark.Partition, org.apache.spark.TaskContext) @bci=233, line=156 
> (Interpreted frame)
> Thread 9170: (state = BLOCKED)
>  - org.apache.phoenix.jdbc.PhoenixDriver.() @bci=35, line=125 
> (Interpreted frame)
>  - 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor,
>  java.lang.Object[]) @bci=0 (Compiled frame)
>  - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) 
> @bci=85, line=62 (Compiled frame)
>  - 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) 
> @bci=5, line=45 (Compiled frame)
>  - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) @bci=79, 
> line=423 (Compiled frame)
>  - java.lang.Class.newInstance() @bci=138, line=442 (Compiled frame)
>  - 
> org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(java.lang.String)
>  @bci=89, line=46 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply()
>  @bci=7, line=53 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply()
>  @bci=1, line=52 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD,
>  org.apache.spark.Partition, org.apache.spark.TaskContext) @bci=81, line=347 
> (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(org.apache.spark.Partition,
>  org.apache.spark.TaskContext) @bci=7, line=339 (Interpreted frame)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23364) desc table add column head display

2018-02-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23364.
---
Resolution: Invalid

It's not clear what you are trying to communicate here. This is a common 
problem. Please put this to the mailing list first.

> desc table add column head display
> --
>
> Key: SPARK-23364
> URL: https://issues.apache.org/jira/browse/SPARK-23364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Minor
> Attachments: 1.png, 2.png
>
>
> fix before: 
>  !2.png! 
> fix after:
>  !1.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23363) Fix spark-sql bug or improvement

2018-02-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23363.
---
Resolution: Invalid

> Fix spark-sql bug or improvement
> 
>
> Key: SPARK-23363
> URL: https://issues.apache.org/jira/browse/SPARK-23363
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23366) Improve hot reading path in ReadAheadInputStream

2018-02-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23366:


Assignee: Apache Spark

> Improve hot reading path in ReadAheadInputStream
> 
>
> Key: SPARK-23366
> URL: https://issues.apache.org/jira/browse/SPARK-23366
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Juliusz Sompolski
>Assignee: Apache Spark
>Priority: Major
>
> ReadAheadInputStream was introduced in 
> [apache/spark#18317|https://github.com/apache/spark/pull/18317] to optimize 
> reading spill files from disk.
> However, investigating flamegraphs of profiles from investigating some 
> regressed workloads after switch to Spark 2.3, it seems that the hot path of 
> reading small amounts of data (like readInt) is inefficient - it involves 
> taking locks, and multiple checks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23366) Improve hot reading path in ReadAheadInputStream

2018-02-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23366:


Assignee: (was: Apache Spark)

> Improve hot reading path in ReadAheadInputStream
> 
>
> Key: SPARK-23366
> URL: https://issues.apache.org/jira/browse/SPARK-23366
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> ReadAheadInputStream was introduced in 
> [apache/spark#18317|https://github.com/apache/spark/pull/18317] to optimize 
> reading spill files from disk.
> However, investigating flamegraphs of profiles from investigating some 
> regressed workloads after switch to Spark 2.3, it seems that the hot path of 
> reading small amounts of data (like readInt) is inefficient - it involves 
> taking locks, and multiple checks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23310) Perf regression introduced by SPARK-21113

2018-02-08 Thread Juliusz Sompolski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357890#comment-16357890
 ] 

Juliusz Sompolski commented on SPARK-23310:
---

[~kiszk] I raised SPARK-23366 and submitted 
[https://github.com/apache/spark/pull/20555] against it.

> Perf regression introduced by SPARK-21113
> -
>
> Key: SPARK-23310
> URL: https://issues.apache.org/jira/browse/SPARK-23310
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Yin Huai
>Assignee: Sital Kedia
>Priority: Blocker
> Fix For: 2.3.0
>
>
> While running all TPC-DS queries with SF set to 1000, we noticed that Q95 
> (https://github.com/databricks/spark-sql-perf/blob/master/src/main/resources/tpcds_2_4/q95.sql)
>  has noticeable regression (11%). After looking into it, we found that the 
> regression was introduced by SPARK-21113. Specially, ReadAheadInputStream 
> gets lock congestion. After setting 
> spark.unsafe.sorter.spill.read.ahead.enabled set to false, the regression 
> disappear and the overall performance of all TPC-DS queries has improved.
>  
> I am proposing that we set spark.unsafe.sorter.spill.read.ahead.enabled to 
> false by default for Spark 2.3 and re-enable it after addressing the lock 
> congestion issue. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23366) Improve hot reading path in ReadAheadInputStream

2018-02-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357889#comment-16357889
 ] 

Apache Spark commented on SPARK-23366:
--

User 'juliuszsompolski' has created a pull request for this issue:
https://github.com/apache/spark/pull/20555

> Improve hot reading path in ReadAheadInputStream
> 
>
> Key: SPARK-23366
> URL: https://issues.apache.org/jira/browse/SPARK-23366
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> ReadAheadInputStream was introduced in 
> [apache/spark#18317|https://github.com/apache/spark/pull/18317] to optimize 
> reading spill files from disk.
> However, investigating flamegraphs of profiles from investigating some 
> regressed workloads after switch to Spark 2.3, it seems that the hot path of 
> reading small amounts of data (like readInt) is inefficient - it involves 
> taking locks, and multiple checks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23364) desc table add column head display

2018-02-08 Thread guoxiaolongzte (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolongzte updated SPARK-23364:
---
Description: 
fix before: 
 !2.png! 


fix after:
 !1.png! 

> desc table add column head display
> --
>
> Key: SPARK-23364
> URL: https://issues.apache.org/jira/browse/SPARK-23364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Minor
> Attachments: 1.png, 2.png
>
>
> fix before: 
>  !2.png! 
> fix after:
>  !1.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23364) desc table add column head display

2018-02-08 Thread guoxiaolongzte (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolongzte updated SPARK-23364:
---
Attachment: 2.png
1.png

> desc table add column head display
> --
>
> Key: SPARK-23364
> URL: https://issues.apache.org/jira/browse/SPARK-23364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Minor
> Attachments: 1.png, 2.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23366) Improve hot reading path in ReadAheadInputStream

2018-02-08 Thread Juliusz Sompolski (JIRA)
Juliusz Sompolski created SPARK-23366:
-

 Summary: Improve hot reading path in ReadAheadInputStream
 Key: SPARK-23366
 URL: https://issues.apache.org/jira/browse/SPARK-23366
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Juliusz Sompolski


ReadAheadInputStream was introduced in 
[apache/spark#18317|https://github.com/apache/spark/pull/18317] to optimize 
reading spill files from disk.
However, investigating flamegraphs of profiles from investigating some 
regressed workloads after switch to Spark 2.3, it seems that the hot path of 
reading small amounts of data (like readInt) is inefficient - it involves 
taking locks, and multiple checks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23364) desc table add column head display

2018-02-08 Thread guoxiaolongzte (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolongzte updated SPARK-23364:
---
Priority: Minor  (was: Major)

> desc table add column head display
> --
>
> Key: SPARK-23364
> URL: https://issues.apache.org/jira/browse/SPARK-23364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23365) DynamicAllocation with failure in straggler task can lead to a hung spark job

2018-02-08 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-23365:


 Summary: DynamicAllocation with failure in straggler task can lead 
to a hung spark job
 Key: SPARK-23365
 URL: https://issues.apache.org/jira/browse/SPARK-23365
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.2.1, 2.1.2, 2.3.0
Reporter: Imran Rashid


Dynamic Allocation can lead to a spark app getting stuck with 0 executors 
requested when the executors in the last tasks of a taskset fail (eg. with an 
OOM).

This happens when {{ExecutorAllocationManager}} s internal target number of 
executors gets out of sync with {{CoarseGrainedSchedulerBackend}} s target 
number.  {{EAM}} updates the {{CGSB}} in two ways: (1) it tracks how many tasks 
are active or pending in submitted stages, and computes how many executors 
would be needed for them.  And as tasks finish, it will actively decrease that 
count, informing the {{CGSB}} along the way.  (2) When it decides executors are 
inactive for long enough, then it requests that {{CGSB}} kill the executors -- 
this also tells the {{CGSB}} to update its target number of executors: 
https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L622

So when there is just one task left, you could have the following sequence of 
events:
(1) the {{EAM}} sets the desired number of executors to 1, and updates the 
{{CGSB}} too
(2) while that final task is still running, the other executors cross the idle 
timeout, and the {{EAM}} requests the {{CGSB}} kill them
(3) now the {{EAM}} has a target of 1 executor, and the {{CGSB}} has a target 
of 0 executors

If the final task completed normally now, everything would be OK; the next 
taskset would get submitted, the {{EAM}} would increase the target number of 
executors and it would update the {{CGSB}}.

But if the executor for that final task failed (eg. an OOM), then the {{EAM}} 
thinks it [doesn't need to update 
anything|https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L384-L386],
 because its target is already 1, which is all it needs for that final task; 
and the {{CGSB}} doesn't update anything either since its target is 0.

I think you can determine if this is the cause of a stuck app by looking for
{noformat}
yarn.YarnAllocator: Driver requested a total number of 0 executor(s).
{noformat}
in the logs of the ApplicationMaster (at least on yarn).

You can reproduce this with this test app, run with {{--conf 
"spark.dynamicAllocation.minExecutors=1" --conf 
"spark.dynamicAllocation.maxExecutors=5" --conf 
"spark.dynamicAllocation.executorIdleTimeout=5s"}}

{code}
import org.apache.spark.SparkEnv

sc.setLogLevel("INFO")

sc.parallelize(1 to 1, 1).count()

val execs = sc.parallelize(1 to 1000, 1000).map { _ => 
SparkEnv.get.executorId}.collect().toSet
val badExec = execs.head
println("will kill exec " + badExec)

new Thread() {
  override def run(): Unit = {
Thread.sleep(1)
println("about to kill exec " + badExec)
sc.killExecutor(badExec)
  }
}.start()

sc.parallelize(1 to 5, 5).mapPartitions { itr =>
  val exec = SparkEnv.get.executorId
  if (exec == badExec) {
Thread.sleep(2) // long enough that all the other tasks finish, and the 
executors cross the idle timeout
// meanwhile, something else should kill this executor
itr
  } else {
itr
  }
}.collect()
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23364) desc table add column head display

2018-02-08 Thread guoxiaolongzte (JIRA)
guoxiaolongzte created SPARK-23364:
--

 Summary: desc table add column head display
 Key: SPARK-23364
 URL: https://issues.apache.org/jira/browse/SPARK-23364
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: guoxiaolongzte






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23363) Fix spark-sql bug or improvement

2018-02-08 Thread guoxiaolongzte (JIRA)
guoxiaolongzte created SPARK-23363:
--

 Summary: Fix spark-sql bug or improvement
 Key: SPARK-23363
 URL: https://issues.apache.org/jira/browse/SPARK-23363
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 2.4.0
Reporter: guoxiaolongzte






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23206) Additional Memory Tuning Metrics

2018-02-08 Thread Edwina Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357846#comment-16357846
 ] 

Edwina Lu edited comment on SPARK-23206 at 2/9/18 2:30 AM:
---

[~cltlfcjin], thanks for uploading the screenshot – these would be useful 
metrics for us as well, and it would be great to coordinate our efforts in an 
umbrella ticket. Yes, let's talk – both Skype and Slack are good. What is a 
good time/day?


was (Author: elu):
[~cltlfcjin], thanks for uploading the screenshot – these would be useful 
metrics for us as well. Yes, let's talk – both Skype and Slack are good. What 
is a good time/day?

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorTab2.png, ExecutorsTab.png, 
> MemoryTuningMetricsDesignDoc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-02-08 Thread Edwina Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357846#comment-16357846
 ] 

Edwina Lu commented on SPARK-23206:
---

[~cltlfcjin], thanks for uploading the screenshot – these would be useful 
metrics for us as well. Yes, let's talk – both Skype and Slack are good. What 
is a good time/day?

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorTab2.png, ExecutorsTab.png, 
> MemoryTuningMetricsDesignDoc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23206) Additional Memory Tuning Metrics

2018-02-08 Thread Lantao Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357844#comment-16357844
 ] 

Lantao Jin edited comment on SPARK-23206 at 2/9/18 2:28 AM:


Yes, we have done the similar things:
 !ExecutorTab2.png! 

[~elu], could we have a talking on Skype or Slack to see what to do next and 
how can collaborate to make it bigger as umbrella ticket.


was (Author: cltlfcjin):
Yes, we have done the similar things:
 !Screen Shot 2018-02-09 at 10.21.19.png! 

[~elu], could we have a talking on Skype or Slack to see what to do next and 
how can collaborate to make it bigger as umbrella ticket.

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorTab2.png, ExecutorsTab.png, 
> MemoryTuningMetricsDesignDoc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23206) Additional Memory Tuning Metrics

2018-02-08 Thread Lantao Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-23206:
---
Attachment: ExecutorTab2.png

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorTab2.png, ExecutorsTab.png, 
> MemoryTuningMetricsDesignDoc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23206) Additional Memory Tuning Metrics

2018-02-08 Thread Lantao Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-23206:
---
Attachment: (was: Screen Shot 2018-02-09 at 10.21.19.png)

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorTab2.png, ExecutorsTab.png, 
> MemoryTuningMetricsDesignDoc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-02-08 Thread Lantao Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357844#comment-16357844
 ] 

Lantao Jin commented on SPARK-23206:


Yes, we have done the similar things:
 !Screen Shot 2018-02-09 at 10.21.19.png! 

[~elu], could we have a talking on Skype or Slack to see what to do next and 
how can collaborate to make it bigger as umbrella ticket.

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, MemoryTuningMetricsDesignDoc.pdf, 
> Screen Shot 2018-02-09 at 10.21.19.png, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23362) Migrate Kafka microbatch source to v2

2018-02-08 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-23362:
--
Summary: Migrate Kafka microbatch source to v2  (was: Migrate Kafka 
Microbatch source to v2)

> Migrate Kafka microbatch source to v2
> -
>
> Key: SPARK-23362
> URL: https://issues.apache.org/jira/browse/SPARK-23362
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23206) Additional Memory Tuning Metrics

2018-02-08 Thread Lantao Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-23206:
---
Attachment: Screen Shot 2018-02-09 at 10.21.19.png

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, MemoryTuningMetricsDesignDoc.pdf, 
> Screen Shot 2018-02-09 at 10.21.19.png, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23013) Migrate MemoryStream to V2

2018-02-08 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-23013.
---
Resolution: Duplicate

> Migrate MemoryStream to V2
> --
>
> Key: SPARK-23013
> URL: https://issues.apache.org/jira/browse/SPARK-23013
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-02-08 Thread Edwina Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357841#comment-16357841
 ] 

Edwina Lu commented on SPARK-23206:
---

[~jerryshao], thanks for your help and advice. [~cltlfcjin], I'll contact you 
to discuss and coordinate.

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, MemoryTuningMetricsDesignDoc.pdf, 
> StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23098) Migrate Kafka batch source to v2

2018-02-08 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-23098:
--
Summary: Migrate Kafka batch source to v2  (was: Migrate kafka batch source)

> Migrate Kafka batch source to v2
> 
>
> Key: SPARK-23098
> URL: https://issues.apache.org/jira/browse/SPARK-23098
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23097) Migrate text socket source to v2

2018-02-08 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-23097:
--
Summary: Migrate text socket source to v2  (was: Migrate text socket source)

> Migrate text socket source to v2
> 
>
> Key: SPARK-23097
> URL: https://issues.apache.org/jira/browse/SPARK-23097
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Saisai Shao
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23097) Migrate text socket source

2018-02-08 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-23097:
-

Assignee: Saisai Shao

> Migrate text socket source
> --
>
> Key: SPARK-23097
> URL: https://issues.apache.org/jira/browse/SPARK-23097
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Saisai Shao
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23362) Migrate Kafka Microbatch source to v2

2018-02-08 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-23362:
--
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-22911

> Migrate Kafka Microbatch source to v2
> -
>
> Key: SPARK-23362
> URL: https://issues.apache.org/jira/browse/SPARK-23362
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23098) Migrate kafka batch source

2018-02-08 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-23098:
--
Summary: Migrate kafka batch source  (was: Migrate kafka source)

> Migrate kafka batch source
> --
>
> Key: SPARK-23098
> URL: https://issues.apache.org/jira/browse/SPARK-23098
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23362) Migrate Kafka Microbatch source to v2

2018-02-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23362:


Assignee: Apache Spark  (was: Tathagata Das)

> Migrate Kafka Microbatch source to v2
> -
>
> Key: SPARK-23362
> URL: https://issues.apache.org/jira/browse/SPARK-23362
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23362) Migrate Kafka Microbatch source to v2

2018-02-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357827#comment-16357827
 ] 

Apache Spark commented on SPARK-23362:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/20554

> Migrate Kafka Microbatch source to v2
> -
>
> Key: SPARK-23362
> URL: https://issues.apache.org/jira/browse/SPARK-23362
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23362) Migrate Kafka Microbatch source to v2

2018-02-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23362:


Assignee: Tathagata Das  (was: Apache Spark)

> Migrate Kafka Microbatch source to v2
> -
>
> Key: SPARK-23362
> URL: https://issues.apache.org/jira/browse/SPARK-23362
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23362) Migrate Kafka Microbatch source to v2

2018-02-08 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-23362:
-

 Summary: Migrate Kafka Microbatch source to v2
 Key: SPARK-23362
 URL: https://issues.apache.org/jira/browse/SPARK-23362
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Tathagata Das
Assignee: Tathagata Das






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-02-08 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357824#comment-16357824
 ] 

Saisai Shao commented on SPARK-23206:
-

[~cltlfcjin] from Ebay also plans to do similar things, I think it would be 
better for you guys to coordinate to avoid conflicts.

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, MemoryTuningMetricsDesignDoc.pdf, 
> StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23285) Allow spark.executor.cores to be fractional

2018-02-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357755#comment-16357755
 ] 

Felix Cheung commented on SPARK-23285:
--

Aounds reasonable to me



> Allow spark.executor.cores to be fractional
> ---
>
> Key: SPARK-23285
> URL: https://issues.apache.org/jira/browse/SPARK-23285
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Scheduler, Spark Submit
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> There is a strong check for an integral number of cores per executor in 
> [SparkSubmitArguments.scala#L270-L272|https://github.com/apache/spark/blob/3f4060c340d6bac412e8819c4388ccba226efcf3/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L270-L272].
>  Given we're reusing that property in K8s, does it make sense to relax it?
>  
> K8s treats CPU as a "compressible resource" and can actually assign millicpus 
> to individual containers. Also to be noted - spark.driver.cores has no such 
> check in place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22156) Word2Vec: incorrect learning rate update equation when numIterations > 1

2018-02-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22156:
--
Summary: Word2Vec: incorrect learning rate update equation when 
numIterations > 1  (was: incorrect learning rate update equation when 
numIterations > 1)

> Word2Vec: incorrect learning rate update equation when numIterations > 1
> 
>
> Key: SPARK-22156
> URL: https://issues.apache.org/jira/browse/SPARK-22156
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: kento nozawa
>Assignee: kento nozawa
>Priority: Minor
> Fix For: 2.3.0
>
>
> Current equation of learning rate is incorrect when numIterations > 1.
> Original code: 
> https://github.com/tmikolov/word2vec/blob/master/word2vec.c#L393



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23285) Allow spark.executor.cores to be fractional

2018-02-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357665#comment-16357665
 ] 

Apache Spark commented on SPARK-23285:
--

User 'liyinan926' has created a pull request for this issue:
https://github.com/apache/spark/pull/20553

> Allow spark.executor.cores to be fractional
> ---
>
> Key: SPARK-23285
> URL: https://issues.apache.org/jira/browse/SPARK-23285
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Scheduler, Spark Submit
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> There is a strong check for an integral number of cores per executor in 
> [SparkSubmitArguments.scala#L270-L272|https://github.com/apache/spark/blob/3f4060c340d6bac412e8819c4388ccba226efcf3/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L270-L272].
>  Given we're reusing that property in K8s, does it make sense to relax it?
>  
> K8s treats CPU as a "compressible resource" and can actually assign millicpus 
> to individual containers. Also to be noted - spark.driver.cores has no such 
> check in place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23360) SparkSession.createDataFrame results in correct results with non-Arrow codepath

2018-02-08 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357645#comment-16357645
 ] 

Li Jin commented on SPARK-23360:


cc [~bryanc] as well. I tried to fix this but didn't succeed. I don't 
understand the timestamp area in Spark too well. I think someone understands 
the area probably has a better chance of fixing this one than me.

> SparkSession.createDataFrame results in correct results with non-Arrow 
> codepath
> ---
>
> Key: SPARK-23360
> URL: https://issues.apache.org/jira/browse/SPARK-23360
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Priority: Major
>
> {code:java}
> import datetime
> import pandas as pd
> import os
> dt = [datetime.datetime(2015, 10, 31, 22, 30)]
> pdf = pd.DataFrame({'time': dt})
> os.environ['TZ'] = 'America/New_York'
> df1 = spark.createDataFrame(pdf)
> df1.show()
> +---+
> |   time|
> +---+
> |2015-10-31 21:30:00|
> +---+
> {code}
> Seems to related to this line here:
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1776]
> It appears to be an issue with "tzlocal()"
> Wrong:
> {code:java}
> from_tz = "America/New_York"
> to_tz = "tzlocal()"
> s.apply(lambda ts:  
> ts.tz_localize(from_tz,ambiguous=False).tz_convert(to_tz).tz_localize(None)
> if ts is not pd.NaT else pd.NaT)
> 0   2015-10-31 21:30:00
> Name: time, dtype: datetime64[ns]
> {code}
> Correct:
> {code:java}
> from_tz = "America/New_York"
> to_tz = "America/New_York"
> s.apply(
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
> if ts is not pd.NaT else pd.NaT)
> 0   2015-10-31 22:30:00
> Name: time, dtype: datetime64[ns]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23351) checkpoint corruption in long running application

2018-02-08 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357632#comment-16357632
 ] 

Shixiong Zhu commented on SPARK-23351:
--

I believe this should be resolved in 
https://issues.apache.org/jira/browse/SPARK-21696. Could you try Spark 2.2.1?

> checkpoint corruption in long running application
> -
>
> Key: SPARK-23351
> URL: https://issues.apache.org/jira/browse/SPARK-23351
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: David Ahern
>Priority: Major
>
> hi, after leaving my (somewhat high volume) Structured Streaming application 
> running for some time, i get the following exception.  The same exception 
> also repeats when i try to restart the application.  The only way to get the 
> application back running is to clear the checkpoint directory which is far 
> from ideal.
> Maybe a stream is not being flushed/closed properly internally by Spark when 
> checkpointing?
>  
>  User class threw exception: 
> org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to 
> stage failure: Task 55 in stage 1.0 failed 4 times, most recent failure: Lost 
> task 55.3 in stage 1.0 (TID 240, gbslixaacspa04u.metis.prd, executor 2): 
> java.io.EOFException
>  at java.io.DataInputStream.readInt(DataInputStream.java:392)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:481)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358)
>  at scala.Option.getOrElse(Option.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:265)
>  at 
> org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:200)
>  at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:108)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23351) checkpoint corruption in long running application

2018-02-08 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357629#comment-16357629
 ] 

Shixiong Zhu commented on SPARK-23351:
--

What's your file system? HDFS?

> checkpoint corruption in long running application
> -
>
> Key: SPARK-23351
> URL: https://issues.apache.org/jira/browse/SPARK-23351
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: David Ahern
>Priority: Major
>
> hi, after leaving my (somewhat high volume) Structured Streaming application 
> running for some time, i get the following exception.  The same exception 
> also repeats when i try to restart the application.  The only way to get the 
> application back running is to clear the checkpoint directory which is far 
> from ideal.
> Maybe a stream is not being flushed/closed properly internally by Spark when 
> checkpointing?
>  
>  User class threw exception: 
> org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to 
> stage failure: Task 55 in stage 1.0 failed 4 times, most recent failure: Lost 
> task 55.3 in stage 1.0 (TID 240, gbslixaacspa04u.metis.prd, executor 2): 
> java.io.EOFException
>  at java.io.DataInputStream.readInt(DataInputStream.java:392)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:481)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358)
>  at scala.Option.getOrElse(Option.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:265)
>  at 
> org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:200)
>  at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:108)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23361) Driver restart fails if it happens after 7 days from app submission

2018-02-08 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-23361:
--

 Summary: Driver restart fails if it happens after 7 days from app 
submission
 Key: SPARK-23361
 URL: https://issues.apache.org/jira/browse/SPARK-23361
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.1.0
Reporter: Marcelo Vanzin


If you submit an app that is supposed to run for > 7 days (so using 
\-\principal / \-\-keytab in cluster mode), and there's a failure that causes 
the driver to restart after 7 days (that being the default token lifetime for 
HDFS), the new driver will fail with an error like the following:

{noformat}
Exception in thread "main" 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
 token (lots of uninteresting token info) can't be found in cache
at org.apache.hadoop.ipc.Client.call(Client.java:1472)
at org.apache.hadoop.ipc.Client.call(Client.java:1409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at com.sun.proxy.$Proxy17.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2123)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1253)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1249)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1249)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$6.apply(ApplicationMaster.scala:160)
{noformat}

Note: lines may not align with actual Apache code because that's our internal 
build.

This happens because as part of the app submission, the launcher provides 
delegation tokens to be used by the AM (=driver in this case), and those are 
expired at that point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23360) SparkSession.createDataFrame results in correct results with non-Arrow codepath

2018-02-08 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357573#comment-16357573
 ] 

Li Jin commented on SPARK-23360:


Also this works fine in Arrow path.

> SparkSession.createDataFrame results in correct results with non-Arrow 
> codepath
> ---
>
> Key: SPARK-23360
> URL: https://issues.apache.org/jira/browse/SPARK-23360
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Priority: Major
>
> {code:java}
> import datetime
> import pandas as pd
> import os
> dt = [datetime.datetime(2015, 10, 31, 22, 30)]
> pdf = pd.DataFrame({'time': dt})
> os.environ['TZ'] = 'America/New_York'
> df1 = spark.createDataFrame(pdf)
> df1.show()
> +---+
> |   time|
> +---+
> |2015-10-31 21:30:00|
> +---+
> {code}
> Seems to related to this line here:
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1776]
> It appears to be an issue with "tzlocal()"
> Wrong:
> {code:java}
> from_tz = "America/New_York"
> to_tz = "tzlocal()"
> s.apply(lambda ts:  
> ts.tz_localize(from_tz,ambiguous=False).tz_convert(to_tz).tz_localize(None)
> if ts is not pd.NaT else pd.NaT)
> 0   2015-10-31 21:30:00
> Name: time, dtype: datetime64[ns]
> {code}
> Correct:
> {code:java}
> from_tz = "America/New_York"
> to_tz = "America/New_York"
> s.apply(
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
> if ts is not pd.NaT else pd.NaT)
> 0   2015-10-31 22:30:00
> Name: time, dtype: datetime64[ns]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23360) SparkSession.createDataFrame results in correct results with non-Arrow codepath

2018-02-08 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357572#comment-16357572
 ] 

Li Jin commented on SPARK-23360:


cc [~cloud_fan] [~ueshin] [~hyukjin.kwon]

This is a regression from 2.2.1

> SparkSession.createDataFrame results in correct results with non-Arrow 
> codepath
> ---
>
> Key: SPARK-23360
> URL: https://issues.apache.org/jira/browse/SPARK-23360
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Priority: Major
>
> {code:java}
> import datetime
> import pandas as pd
> import os
> dt = [datetime.datetime(2015, 10, 31, 22, 30)]
> pdf = pd.DataFrame({'time': dt})
> os.environ['TZ'] = 'America/New_York'
> df1 = spark.createDataFrame(pdf)
> df1.show()
> +---+
> |   time|
> +---+
> |2015-10-31 21:30:00|
> +---+
> {code}
> Seems to related to this line here:
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1776]
> It appears to be an issue with "tzlocal()"
> Wrong:
> {code:java}
> from_tz = "America/New_York"
> to_tz = "tzlocal()"
> s.apply(lambda ts:  
> ts.tz_localize(from_tz,ambiguous=False).tz_convert(to_tz).tz_localize(None)
> if ts is not pd.NaT else pd.NaT)
> 0   2015-10-31 21:30:00
> Name: time, dtype: datetime64[ns]
> {code}
> Correct:
> {code:java}
> from_tz = "America/New_York"
> to_tz = "America/New_York"
> s.apply(
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
> if ts is not pd.NaT else pd.NaT)
> 0   2015-10-31 22:30:00
> Name: time, dtype: datetime64[ns]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23360) SparkSession.createDataFrame results in correct results with non-Arrow codepath

2018-02-08 Thread Li Jin (JIRA)
Li Jin created SPARK-23360:
--

 Summary: SparkSession.createDataFrame results in correct results 
with non-Arrow codepath
 Key: SPARK-23360
 URL: https://issues.apache.org/jira/browse/SPARK-23360
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Li Jin


{code:java}
import datetime
import pandas as pd
import os

dt = [datetime.datetime(2015, 10, 31, 22, 30)]
pdf = pd.DataFrame({'time': dt})

os.environ['TZ'] = 'America/New_York'

df1 = spark.createDataFrame(pdf)
df1.show()

+---+
|   time|
+---+
|2015-10-31 21:30:00|
+---+
{code}
Seems to related to this line here:

[https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1776]

It appears to be an issue with "tzlocal()"

Wrong:
{code:java}
from_tz = "America/New_York"
to_tz = "tzlocal()"

s.apply(lambda ts:  
ts.tz_localize(from_tz,ambiguous=False).tz_convert(to_tz).tz_localize(None)
if ts is not pd.NaT else pd.NaT)

0   2015-10-31 21:30:00
Name: time, dtype: datetime64[ns]
{code}
Correct:
{code:java}
from_tz = "America/New_York"
to_tz = "America/New_York"

s.apply(
lambda ts: ts.tz_localize(from_tz, 
ambiguous=False).tz_convert(to_tz).tz_localize(None)
if ts is not pd.NaT else pd.NaT)

0   2015-10-31 22:30:00
Name: time, dtype: datetime64[ns]
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23099) Migrate foreach sink

2018-02-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23099:


Assignee: Apache Spark

> Migrate foreach sink
> 
>
> Key: SPARK-23099
> URL: https://issues.apache.org/jira/browse/SPARK-23099
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23099) Migrate foreach sink

2018-02-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23099:


Assignee: (was: Apache Spark)

> Migrate foreach sink
> 
>
> Key: SPARK-23099
> URL: https://issues.apache.org/jira/browse/SPARK-23099
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23099) Migrate foreach sink

2018-02-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357525#comment-16357525
 ] 

Apache Spark commented on SPARK-23099:
--

User 'jose-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/20552

> Migrate foreach sink
> 
>
> Key: SPARK-23099
> URL: https://issues.apache.org/jira/browse/SPARK-23099
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23285) Allow spark.executor.cores to be fractional

2018-02-08 Thread Yinan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357500#comment-16357500
 ] 

Yinan Li edited comment on SPARK-23285 at 2/8/18 8:22 PM:
--

Given the complexity and significant impact of the changes proposed in 
[https://github.com/apache/spark/pull/20460] to the way Spark handles task 
scheduling, task parallelism, and dynamic resource allocation, etc., I'm 
thinking if we should instead introduce a K8s specific configuration property 
for specifying the executor cores that follows the Kubernetes 
[convention|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu].
 It seems Mesos fine-grained mode does this with 
{{spark.mesos.mesosExecutor.cores}}. We can have something like 
{{spark.kubernetes.executor.cores}} that is only used for specifying the CPU 
core request for the executor pods. Existing configuration properties 
{{spark.executor.cores}} and {{spark.task.cpus}} still play their roles in task 
parallelism, task scheduling, etc. That is, {{spark.kubernetes.executor.cores}} 
only determines the physical CPU cores available to an executor. An executor 
can still run multiple tasks simultaneously if {{spark.executor.cores}} is a 
multiple of {{spark.task.cpus}}. If not set, 
{{spark.kubernetes.executor.cores}} falls back to {{spark.executor.cores}}. 
WDYT? 

 

[~felixcheung] [~jerryshao] [~jiangxb1987]


was (Author: liyinan926):
Given the complexity and significant impact of the changes proposed in 
[https://github.com/apache/spark/pull/20460] to the way Spark handles task 
scheduling, task parallelism, and dynamic resource allocation, etc., I'm 
thinking if we should instead introduce a K8s specific configuration property 
for specifying the executor cores that follows the Kubernetes 
[convention|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu].
 It seems Mesos fine-grained mode does this with 
{{spark.mesos.mesosExecutor.cores}}. We can have something like 
{{spark.kubernetes.executor.cores}} that is only used for specifying the CPU 
core request for the executor pods. Existing configuration properties 
{{spark.executor.cores}} and {{spark.task.cpus}} still play their roles in task 
parallelism, task scheduling, etc. That is, {{spark.kubernetes.executor.cores}} 
only determines the physical CPU cores available to an executor. An executor 
can still run multiple tasks simultaneously if {{spark.executor.cores}} is a 
multiple of {{spark.task.cpus}}. If not set, 
{{spark.kubernetes.executor.cores}} falls back to {{spark.executor.cores}}. 
WDYT? 

> Allow spark.executor.cores to be fractional
> ---
>
> Key: SPARK-23285
> URL: https://issues.apache.org/jira/browse/SPARK-23285
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Scheduler, Spark Submit
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> There is a strong check for an integral number of cores per executor in 
> [SparkSubmitArguments.scala#L270-L272|https://github.com/apache/spark/blob/3f4060c340d6bac412e8819c4388ccba226efcf3/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L270-L272].
>  Given we're reusing that property in K8s, does it make sense to relax it?
>  
> K8s treats CPU as a "compressible resource" and can actually assign millicpus 
> to individual containers. Also to be noted - spark.driver.cores has no such 
> check in place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23244) Incorrect handling of default values when deserializing python wrappers of scala transformers

2018-02-08 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357504#comment-16357504
 ] 

Bryan Cutler edited comment on SPARK-23244 at 2/8/18 8:08 PM:
--

This is the same issue as SPARK-21685 caused by pyspark not differentiating 
from defaults and set params when transferring values to Java. I think we can 
close this and SPARK-23234 also.


was (Author: bryanc):
This is same issue as SPARK-21685 caused by pyspark not differentiating from 
defaults and set params when transferring values to Java. I think we can close 
this and SPARK-23234 also.

> Incorrect handling of default values when deserializing python wrappers of 
> scala transformers
> -
>
> Key: SPARK-23244
> URL: https://issues.apache.org/jira/browse/SPARK-23244
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Tomas Nykodym
>Priority: Minor
>
> Default values are not handled properly when serializing/deserializing python 
> trasnformers which are wrappers around scala objects. It looks like that 
> after deserialization the default values which were based on uid do not get 
> properly restored and values which were not set are set to their (original) 
> default values.
> Here's a simple code example using Bucketizer:
> {code:python}
> >>> from pyspark.ml.feature import Bucketizer
> >>> a = Bucketizer() 
> >>> a.save("bucketizer0")
> >>> b = load("bucketizer0") 
> >>> a._defaultParamMap[a.outputCol]
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b._defaultParamMap[b.outputCol]
> u'Bucketizer_41cf9afbc559ca2bfc9a__output'
> >>> a.isSet(a.outputCol)
> False 
> >>> b.isSet(b.outputCol)
> True
> >>> a.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23244) Incorrect handling of default values when deserializing python wrappers of scala transformers

2018-02-08 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357504#comment-16357504
 ] 

Bryan Cutler commented on SPARK-23244:
--

This is same issue as SPARK-21685 caused by pyspark not differentiating from 
defaults and set params when transferring values to Java. I think we can close 
this and SPARK-23234 also.

> Incorrect handling of default values when deserializing python wrappers of 
> scala transformers
> -
>
> Key: SPARK-23244
> URL: https://issues.apache.org/jira/browse/SPARK-23244
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Tomas Nykodym
>Priority: Minor
>
> Default values are not handled properly when serializing/deserializing python 
> trasnformers which are wrappers around scala objects. It looks like that 
> after deserialization the default values which were based on uid do not get 
> properly restored and values which were not set are set to their (original) 
> default values.
> Here's a simple code example using Bucketizer:
> {code:python}
> >>> from pyspark.ml.feature import Bucketizer
> >>> a = Bucketizer() 
> >>> a.save("bucketizer0")
> >>> b = load("bucketizer0") 
> >>> a._defaultParamMap[a.outputCol]
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b._defaultParamMap[b.outputCol]
> u'Bucketizer_41cf9afbc559ca2bfc9a__output'
> >>> a.isSet(a.outputCol)
> False 
> >>> b.isSet(b.outputCol)
> True
> >>> a.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23285) Allow spark.executor.cores to be fractional

2018-02-08 Thread Yinan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357500#comment-16357500
 ] 

Yinan Li commented on SPARK-23285:
--

Given the complexity and significant impact of the changes proposed in 
[https://github.com/apache/spark/pull/20460] to the way Spark handles task 
scheduling, task parallelism, and dynamic resource allocation, etc., I'm 
thinking if we should instead introduce a K8s specific configuration property 
for specifying the executor cores that follows the Kubernetes 
[convention|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu].
 It seems Mesos fine-grained mode does this with 
{{spark.mesos.mesosExecutor.cores}}. We can have something like 
{{spark.kubernetes.executor.cores}} that is only used for specifying the CPU 
core request for the executor pods. Existing configuration properties 
{{spark.executor.cores}} and {{spark.task.cpus}} still play their roles in task 
parallelism, task scheduling, etc. That is, {{spark.kubernetes.executor.cores}} 
only determines the physical CPU cores available to an executor. An executor 
can still run multiple tasks simultaneously if {{spark.executor.cores}} is a 
multiple of {{spark.task.cpus}}. If not set, 
{{spark.kubernetes.executor.cores}} falls back to {{spark.executor.cores}}. 
WDYT? 

> Allow spark.executor.cores to be fractional
> ---
>
> Key: SPARK-23285
> URL: https://issues.apache.org/jira/browse/SPARK-23285
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Scheduler, Spark Submit
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> There is a strong check for an integral number of cores per executor in 
> [SparkSubmitArguments.scala#L270-L272|https://github.com/apache/spark/blob/3f4060c340d6bac412e8819c4388ccba226efcf3/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L270-L272].
>  Given we're reusing that property in K8s, does it make sense to relax it?
>  
> K8s treats CPU as a "compressible resource" and can actually assign millicpus 
> to individual containers. Also to be noted - spark.driver.cores has no such 
> check in place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23318) FP-growth: WARN FPGrowth: Input data is not cached

2018-02-08 Thread Arseniy Tashoyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357482#comment-16357482
 ] 

Arseniy Tashoyan commented on SPARK-23318:
--

I want. But I'm short on time now. Will do.

> FP-growth: WARN FPGrowth: Input data is not cached
> --
>
> Key: SPARK-23318
> URL: https://issues.apache.org/jira/browse/SPARK-23318
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Arseniy Tashoyan
>Priority: Minor
>  Labels: MLLib,, fp-growth
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When running FPGrowth.fit() from _ml_ package, one can see a warning:
> WARN FPGrowth: Input data is not cached.
> This warning occurs even the dataset of transactions is cached.
> Actually this warning comes from the FPGrowth implementation in old _mllib_ 
> package. New FPGrowth performs some transformations on the input data set of 
> transactions and then passes it to the old FPGrowth - without caching. Hence 
> the warning.
> The problem looks similar to SPARK-18356
>  If you don't mind, I can push a similar fix:
> {code}
> // ml.FPGrowth
> val handlePersistence = dataset.storageLevel == StorageLevel.NONE
> if (handlePersistence) {
>   // cache the data
> }
> // then call mllib.FPGrowth
> // finally unpersist the data
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23244) Incorrect handling of default values when deserializing python wrappers of scala transformers

2018-02-08 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357473#comment-16357473
 ] 

Marco Gaido commented on SPARK-23244:
-

The change is related because your problem is caused by the python api setting 
(wrongly) all the values (default and not default) as real values. So the model 
is persisted with all the default values set as they were actually set by the 
user. That PR is avoiding the default values being actually set, so the 
persisted model will treat them all as defaults and the newly loaded model will 
be right.

If you have more questions feel free to ask. And feel free to try my patch on 
your own to check whether your problem is solved or not and to provide 
feedbacks on it if you want.

> Incorrect handling of default values when deserializing python wrappers of 
> scala transformers
> -
>
> Key: SPARK-23244
> URL: https://issues.apache.org/jira/browse/SPARK-23244
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Tomas Nykodym
>Priority: Minor
>
> Default values are not handled properly when serializing/deserializing python 
> trasnformers which are wrappers around scala objects. It looks like that 
> after deserialization the default values which were based on uid do not get 
> properly restored and values which were not set are set to their (original) 
> default values.
> Here's a simple code example using Bucketizer:
> {code:python}
> >>> from pyspark.ml.feature import Bucketizer
> >>> a = Bucketizer() 
> >>> a.save("bucketizer0")
> >>> b = load("bucketizer0") 
> >>> a._defaultParamMap[a.outputCol]
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b._defaultParamMap[b.outputCol]
> u'Bucketizer_41cf9afbc559ca2bfc9a__output'
> >>> a.isSet(a.outputCol)
> False 
> >>> b.isSet(b.outputCol)
> True
> >>> a.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2

2018-02-08 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357441#comment-16357441
 ] 

Kazuaki Ishizaki commented on SPARK-23309:
--

When there is a repro, I am happy to investigate the reason.

> Spark 2.3 cached query performance 20-30% worse then spark 2.2
> --
>
> Key: SPARK-23309
> URL: https://issues.apache.org/jira/browse/SPARK-23309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
>
> I was testing spark 2.3 rc2 and I am seeing a performance regression in sql 
> queries on cached data.
> The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 
> partitions
> Here is the example query:
> val dailycached = spark.sql("select something from table where dt = 
> '20170301' AND something IS NOT NULL")
> dailycached.createOrReplaceTempView("dailycached") 
> spark.catalog.cacheTable("dailyCached")
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show()
>  
> On spark 2.2 I see queries times average 13 seconds
> On the same nodes I see spark 2.3 queries times average 17 seconds
> Note these are times of queries after the initial caching.  so just running 
> the last line again: 
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() 
> multiple times.
>  
> I also ran a query over more data (335GB input/587.5 GB cached) and saw a 
> similar discrepancy in the performance of querying cached data between spark 
> 2.3 and spark 2.2, where 2.2 was better by like 20%.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23336) Upgrade snappy-java to 1.1.7.1

2018-02-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-23336:
-

Assignee: Yuming Wang

> Upgrade snappy-java to 1.1.7.1
> --
>
> Key: SPARK-23336
> URL: https://issues.apache.org/jira/browse/SPARK-23336
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Minor
> Fix For: 2.4.0
>
>
> We should upgrade the snappy-java version to improve performance compression 
> (5%) and decompression (20%).
> Details:
>  
> [https://github.com/xerial/snappy-java/blob/master/Milestone.md#snappy-java-114-2017-05-22]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23336) Upgrade snappy-java to 1.1.7.1

2018-02-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23336.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20510
[https://github.com/apache/spark/pull/20510]

> Upgrade snappy-java to 1.1.7.1
> --
>
> Key: SPARK-23336
> URL: https://issues.apache.org/jira/browse/SPARK-23336
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Minor
> Fix For: 2.4.0
>
>
> We should upgrade the snappy-java version to improve performance compression 
> (5%) and decompression (20%).
> Details:
>  
> [https://github.com/xerial/snappy-java/blob/master/Milestone.md#snappy-java-114-2017-05-22]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23271) Parquet output contains only "_SUCCESS" file after empty DataFrame saving

2018-02-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357399#comment-16357399
 ] 

Apache Spark commented on SPARK-23271:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/20551

> Parquet output contains only "_SUCCESS" file after empty DataFrame saving 
> --
>
> Key: SPARK-23271
> URL: https://issues.apache.org/jira/browse/SPARK-23271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Pavlo Z.
>Priority: Minor
> Attachments: parquet-empty-output.zip
>
>
> Sophisticated case, reproduced only if read empty CSV file without header 
> with assigned schema.
> Steps for reproduce (Scala):
> {code:java}
> val anySchema = StructType(StructField("anyName", StringType, nullable = 
> false) :: Nil)
> val inputDF = spark.read.schema(anySchema).csv(inputFolderWithEmptyCSVFile)
> inputDF.write.parquet(outputFolderName)
> // Exception: org.apache.spark.sql.AnalysisException: Unable to infer schema 
> for Parquet. It must be specified manually.;
> val actualDF = spark.read.parquet(outputFolderName)
>  
> {code}
> *Actual:* Only "_SUCCESS" file in output directory
> *Expected*: at least one Parquet file with schema.
> Project for reproduce is attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2018-02-08 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-21187:
-
Description: 
This is to track adding the remaining type support in Arrow Converters. 
Currently, only primitive data types are supported. '

Remaining types:
 * -*Date*-
 * -*Timestamp*-
 * *Complex*: Struct, -Array-, Map
 * -*Decimal*-
 * *Binary* - in pyspark

Some things to do before closing this out:
 * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
values as BigDecimal)-
 * -Need to add some user docs-
 * -Make sure Python tests are thorough-
 * Check into complex type support mentioned in comments by [~leif], should we 
support mulit-indexing?

  was:
This is to track adding the remaining type support in Arrow Converters. 
Currently, only primitive data types are supported. '

Remaining types:
 * -*Date*-
 * -*Timestamp*-
 * *Complex*: Struct, -Array-, Map
 * -*Decimal*-
 * *Binary* - in pyspark

Some things to do before closing this out:
 * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
values as BigDecimal)-
 * ~Need to add some user docs~
 * -Make sure Python tests are thorough-
 * Check into complex type support mentioned in comments by [~leif], should we 
support mulit-indexing?


> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> This is to track adding the remaining type support in Arrow Converters. 
> Currently, only primitive data types are supported. '
> Remaining types:
>  * -*Date*-
>  * -*Timestamp*-
>  * *Complex*: Struct, -Array-, Map
>  * -*Decimal*-
>  * *Binary* - in pyspark
> Some things to do before closing this out:
>  * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
> values as BigDecimal)-
>  * -Need to add some user docs-
>  * -Make sure Python tests are thorough-
>  * Check into complex type support mentioned in comments by [~leif], should 
> we support mulit-indexing?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2018-02-08 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-21187:
-
Description: 
This is to track adding the remaining type support in Arrow Converters. 
Currently, only primitive data types are supported. '

Remaining types:
 * -*Date*-
 * -*Timestamp*-
 * *Complex*: Struct, -Array-, Map
 * -*Decimal*-
 * *Binary* - in pyspark

Some things to do before closing this out:
 * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
values as BigDecimal)-
 * ~Need to add some user docs~
 * -Make sure Python tests are thorough-
 * Check into complex type support mentioned in comments by [~leif], should we 
support mulit-indexing?

  was:
This is to track adding the remaining type support in Arrow Converters. 
Currently, only primitive data types are supported. '

Remaining types:
 * -*Date*-
 * -*Timestamp*-
 * *Complex*: Struct, -Array-, Map
 * -*Decimal*-
 * *Binary* - in pyspark ** 

Some things to do before closing this out:
 * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
values as BigDecimal)-
 * ~Need to add some user docs~
 * -Make sure Python tests are thorough-
 * Check into complex type support mentioned in comments by [~leif], should we 
support mulit-indexing?


> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> This is to track adding the remaining type support in Arrow Converters. 
> Currently, only primitive data types are supported. '
> Remaining types:
>  * -*Date*-
>  * -*Timestamp*-
>  * *Complex*: Struct, -Array-, Map
>  * -*Decimal*-
>  * *Binary* - in pyspark
> Some things to do before closing this out:
>  * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
> values as BigDecimal)-
>  * ~Need to add some user docs~
>  * -Make sure Python tests are thorough-
>  * Check into complex type support mentioned in comments by [~leif], should 
> we support mulit-indexing?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2018-02-08 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-21187:
-
Description: 
This is to track adding the remaining type support in Arrow Converters. 
Currently, only primitive data types are supported. '

Remaining types:
 * -*Date*-
 * -*Timestamp*-
 * *Complex*: Struct, -Array-, Map
 * -*Decimal*-
 * *Binary* - in pyspark ** 

Some things to do before closing this out:
 * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
values as BigDecimal)-
 * ~Need to add some user docs~
 * -Make sure Python tests are thorough-
 * Check into complex type support mentioned in comments by [~leif], should we 
support mulit-indexing?

  was:
This is to track adding the remaining type support in Arrow Converters.  
Currently, only primitive data types are supported.  '

Remaining types:

* -*Date*-
* -*Timestamp*-
* *Complex*: Struct, -Array-, Map
* -*Decimal*-


Some things to do before closing this out:

* -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
values as BigDecimal)-
* Need to add some user docs
* -Make sure Python tests are thorough-
* Check into complex type support mentioned in comments by [~leif], should we 
support mulit-indexing?


> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> This is to track adding the remaining type support in Arrow Converters. 
> Currently, only primitive data types are supported. '
> Remaining types:
>  * -*Date*-
>  * -*Timestamp*-
>  * *Complex*: Struct, -Array-, Map
>  * -*Decimal*-
>  * *Binary* - in pyspark ** 
> Some things to do before closing this out:
>  * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
> values as BigDecimal)-
>  * ~Need to add some user docs~
>  * -Make sure Python tests are thorough-
>  * Check into complex type support mentioned in comments by [~leif], should 
> we support mulit-indexing?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

2018-02-08 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357387#comment-16357387
 ] 

Steve Loughran commented on SPARK-23308:


HADOOP-15216 covers S3A handling this failure with backoff, also FNFEs on 
delete inconsistency

> ignoreCorruptFiles should not ignore retryable IOException
> --
>
> Key: SPARK-23308
> URL: https://issues.apache.org/jira/browse/SPARK-23308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Márcio Furlani Carmona
>Priority: Minor
>
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
>  
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2

2018-02-08 Thread Sameer Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357363#comment-16357363
 ] 

Sameer Agarwal commented on SPARK-23309:


Thanks, I'll then go ahead and downgrade the priority for now to unblock RC3. 
Please feel free to -1 the RC if there's a repro.

> Spark 2.3 cached query performance 20-30% worse then spark 2.2
> --
>
> Key: SPARK-23309
> URL: https://issues.apache.org/jira/browse/SPARK-23309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Blocker
>
> I was testing spark 2.3 rc2 and I am seeing a performance regression in sql 
> queries on cached data.
> The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 
> partitions
> Here is the example query:
> val dailycached = spark.sql("select something from table where dt = 
> '20170301' AND something IS NOT NULL")
> dailycached.createOrReplaceTempView("dailycached") 
> spark.catalog.cacheTable("dailyCached")
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show()
>  
> On spark 2.2 I see queries times average 13 seconds
> On the same nodes I see spark 2.3 queries times average 17 seconds
> Note these are times of queries after the initial caching.  so just running 
> the last line again: 
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() 
> multiple times.
>  
> I also ran a query over more data (335GB input/587.5 GB cached) and saw a 
> similar discrepancy in the performance of querying cached data between spark 
> 2.3 and spark 2.2, where 2.2 was better by like 20%.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2

2018-02-08 Thread Sameer Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-23309:
---
Priority: Major  (was: Blocker)

> Spark 2.3 cached query performance 20-30% worse then spark 2.2
> --
>
> Key: SPARK-23309
> URL: https://issues.apache.org/jira/browse/SPARK-23309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
>
> I was testing spark 2.3 rc2 and I am seeing a performance regression in sql 
> queries on cached data.
> The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 
> partitions
> Here is the example query:
> val dailycached = spark.sql("select something from table where dt = 
> '20170301' AND something IS NOT NULL")
> dailycached.createOrReplaceTempView("dailycached") 
> spark.catalog.cacheTable("dailyCached")
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show()
>  
> On spark 2.2 I see queries times average 13 seconds
> On the same nodes I see spark 2.3 queries times average 17 seconds
> Note these are times of queries after the initial caching.  so just running 
> the last line again: 
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() 
> multiple times.
>  
> I also ran a query over more data (335GB input/587.5 GB cached) and saw a 
> similar discrepancy in the performance of querying cached data between spark 
> 2.3 and spark 2.2, where 2.2 was better by like 20%.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2

2018-02-08 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357346#comment-16357346
 ] 

Thomas Graves commented on SPARK-23309:
---

sorry I haven't had time to make a query/dataset to reproduce that.  I'm ok 
with this not being a blocker for 2.3.

> Spark 2.3 cached query performance 20-30% worse then spark 2.2
> --
>
> Key: SPARK-23309
> URL: https://issues.apache.org/jira/browse/SPARK-23309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Blocker
>
> I was testing spark 2.3 rc2 and I am seeing a performance regression in sql 
> queries on cached data.
> The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 
> partitions
> Here is the example query:
> val dailycached = spark.sql("select something from table where dt = 
> '20170301' AND something IS NOT NULL")
> dailycached.createOrReplaceTempView("dailycached") 
> spark.catalog.cacheTable("dailyCached")
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show()
>  
> On spark 2.2 I see queries times average 13 seconds
> On the same nodes I see spark 2.3 queries times average 17 seconds
> Note these are times of queries after the initial caching.  so just running 
> the last line again: 
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() 
> multiple times.
>  
> I also ran a query over more data (335GB input/587.5 GB cached) and saw a 
> similar discrepancy in the performance of querying cached data between spark 
> 2.3 and spark 2.2, where 2.2 was better by like 20%.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2

2018-02-08 Thread Sameer Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357344#comment-16357344
 ] 

Sameer Agarwal commented on SPARK-23309:


[~tgraves] [~smilegator] [~cloud_fan] – any advice here? If we'd like this to 
continue to block the release on this, it'd be good to have a repro.

> Spark 2.3 cached query performance 20-30% worse then spark 2.2
> --
>
> Key: SPARK-23309
> URL: https://issues.apache.org/jira/browse/SPARK-23309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Blocker
>
> I was testing spark 2.3 rc2 and I am seeing a performance regression in sql 
> queries on cached data.
> The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 
> partitions
> Here is the example query:
> val dailycached = spark.sql("select something from table where dt = 
> '20170301' AND something IS NOT NULL")
> dailycached.createOrReplaceTempView("dailycached") 
> spark.catalog.cacheTable("dailyCached")
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show()
>  
> On spark 2.2 I see queries times average 13 seconds
> On the same nodes I see spark 2.3 queries times average 17 seconds
> Note these are times of queries after the initial caching.  so just running 
> the last line again: 
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() 
> multiple times.
>  
> I also ran a query over more data (335GB input/587.5 GB cached) and saw a 
> similar discrepancy in the performance of querying cached data between spark 
> 2.3 and spark 2.2, where 2.2 was better by like 20%.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-23244) Incorrect handling of default values when deserializing python wrappers of scala transformers

2018-02-08 Thread Tomas Nykodym (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomas Nykodym reopened SPARK-23244:
---

I might be wrong but I don't think this is a duplicate. There is an overlap 
between the two in that the default values are set as real values, but:
1) I am not sure if it is in the same context. I looked at the PR addressing 
SPARK-23234 and the changes were in a an unrelated method to my problem. 
2) I don't see how SPARK-23234 addresses how default values based on uid are 
set after deserialization of JavaTransformer.



> Incorrect handling of default values when deserializing python wrappers of 
> scala transformers
> -
>
> Key: SPARK-23244
> URL: https://issues.apache.org/jira/browse/SPARK-23244
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Tomas Nykodym
>Priority: Minor
>
> Default values are not handled properly when serializing/deserializing python 
> trasnformers which are wrappers around scala objects. It looks like that 
> after deserialization the default values which were based on uid do not get 
> properly restored and values which were not set are set to their (original) 
> default values.
> Here's a simple code example using Bucketizer:
> {code:python}
> >>> from pyspark.ml.feature import Bucketizer
> >>> a = Bucketizer() 
> >>> a.save("bucketizer0")
> >>> b = load("bucketizer0") 
> >>> a._defaultParamMap[a.outputCol]
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b._defaultParamMap[b.outputCol]
> u'Bucketizer_41cf9afbc559ca2bfc9a__output'
> >>> a.isSet(a.outputCol)
> False 
> >>> b.isSet(b.outputCol)
> True
> >>> a.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18844) Add more binary classification metrics to BinaryClassificationMetrics

2018-02-08 Thread Sandeep Kumar Choudhary (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357281#comment-16357281
 ] 

Sandeep Kumar Choudhary commented on SPARK-18844:
-

I have submitted the patch. It is now okay to test.

> Add more binary classification metrics to BinaryClassificationMetrics
> -
>
> Key: SPARK-18844
> URL: https://issues.apache.org/jira/browse/SPARK-18844
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.0.2
>Reporter: Zak Patterson
>Priority: Minor
>  Labels: evaluation
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> BinaryClassificationMetrics only implements Precision (positive predictive 
> value) and recall (true positive rate). It should implement more 
> comprehensive metrics.
> Moreover, the instance variables storing computed counts are marked private, 
> and there are no accessors for them. So if one desired to add this 
> functionality, one would have to duplicate this calculation, which is not 
> trivial:
> https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144
> Currently Implemented Metrics
> ---
> * Precision (PPV): `precisionByThreshold`
> * Recall (Sensitivity, true positive rate): `recallByThreshold`
> Desired additional metrics
> ---
> * False omission rate: `forByThreshold`
> * False discovery rate: `fdrByThreshold`
> * Negative predictive value: `npvByThreshold`
> * False negative rate: `fnrByThreshold`
> * True negative rate (Specificity): `specificityByThreshold`
> * False positive rate: `fprByThreshold`
> Alternatives
> ---
> The `createCurve` method is marked private. If it were marked public, and the 
> trait BinaryClassificationMetricComputer were also marked public, then it 
> would be easy to define new computers to get whatever the user wanted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

2018-02-08 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357266#comment-16357266
 ] 

Steve Loughran commented on SPARK-23308:


bq. Other option would be creating a special exception 
(CorruptedFileException?) that could be thrown by FS implementations and let 
them decide what is a corrupted file or just a transient error.

Its pretty hard to get consistent semantics on "working" FS behaviour, let 
alone failure modes; it's why the Hadoop FS specs and compliance tests have the 
notion of strict failure "does what HDFS does" and "lax", "raises an IOE". 
AFAIK HDFS raises {{ChecksumException}} on checksum errors, I don't know what 
it does on. say. decryption failure or erasure coding problems. and don't 
really want to look. You could try to add a parent class here, "Unrecoverable 
IOE" & see about getting it in to everything over time.

Common prefixes and the classic year=2018/month=12 partitioning is pretty 
pathological for S3. But like you say, 503 is the standard response, though it 
may be caught in the AWS SDK. Talk to the AWS people

> ignoreCorruptFiles should not ignore retryable IOException
> --
>
> Key: SPARK-23308
> URL: https://issues.apache.org/jira/browse/SPARK-23308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Márcio Furlani Carmona
>Priority: Minor
>
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
>  
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23244) Incorrect handling of default values when deserializing python wrappers of scala transformers

2018-02-08 Thread Tomas Nykodym (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357268#comment-16357268
 ] 

Tomas Nykodym commented on SPARK-23244:
---

I might be wrong but I don't think this is a duplicate. There is an overlap 
between the two in that the default values are set as real values, but:
1) I am not sure if it is in the same context. I looked at the PR addressing 
SPARK-23234 and the changes were in a an unrelated method to my problem. 
2) I don't see how SPARK-23234 addresses how default values based on uid are 
set after deserialization of JavaTransformer.



> Incorrect handling of default values when deserializing python wrappers of 
> scala transformers
> -
>
> Key: SPARK-23244
> URL: https://issues.apache.org/jira/browse/SPARK-23244
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Tomas Nykodym
>Priority: Minor
>
> Default values are not handled properly when serializing/deserializing python 
> trasnformers which are wrappers around scala objects. It looks like that 
> after deserialization the default values which were based on uid do not get 
> properly restored and values which were not set are set to their (original) 
> default values.
> Here's a simple code example using Bucketizer:
> {code:python}
> >>> from pyspark.ml.feature import Bucketizer
> >>> a = Bucketizer() 
> >>> a.save("bucketizer0")
> >>> b = load("bucketizer0") 
> >>> a._defaultParamMap[a.outputCol]
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b._defaultParamMap[b.outputCol]
> u'Bucketizer_41cf9afbc559ca2bfc9a__output'
> >>> a.isSet(a.outputCol)
> False 
> >>> b.isSet(b.outputCol)
> True
> >>> a.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18844) Add more binary classification metrics to BinaryClassificationMetrics

2018-02-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18844:


Assignee: (was: Apache Spark)

> Add more binary classification metrics to BinaryClassificationMetrics
> -
>
> Key: SPARK-18844
> URL: https://issues.apache.org/jira/browse/SPARK-18844
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.0.2
>Reporter: Zak Patterson
>Priority: Minor
>  Labels: evaluation
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> BinaryClassificationMetrics only implements Precision (positive predictive 
> value) and recall (true positive rate). It should implement more 
> comprehensive metrics.
> Moreover, the instance variables storing computed counts are marked private, 
> and there are no accessors for them. So if one desired to add this 
> functionality, one would have to duplicate this calculation, which is not 
> trivial:
> https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144
> Currently Implemented Metrics
> ---
> * Precision (PPV): `precisionByThreshold`
> * Recall (Sensitivity, true positive rate): `recallByThreshold`
> Desired additional metrics
> ---
> * False omission rate: `forByThreshold`
> * False discovery rate: `fdrByThreshold`
> * Negative predictive value: `npvByThreshold`
> * False negative rate: `fnrByThreshold`
> * True negative rate (Specificity): `specificityByThreshold`
> * False positive rate: `fprByThreshold`
> Alternatives
> ---
> The `createCurve` method is marked private. If it were marked public, and the 
> trait BinaryClassificationMetricComputer were also marked public, then it 
> would be easy to define new computers to get whatever the user wanted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18844) Add more binary classification metrics to BinaryClassificationMetrics

2018-02-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18844:


Assignee: Apache Spark

> Add more binary classification metrics to BinaryClassificationMetrics
> -
>
> Key: SPARK-18844
> URL: https://issues.apache.org/jira/browse/SPARK-18844
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.0.2
>Reporter: Zak Patterson
>Assignee: Apache Spark
>Priority: Minor
>  Labels: evaluation
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> BinaryClassificationMetrics only implements Precision (positive predictive 
> value) and recall (true positive rate). It should implement more 
> comprehensive metrics.
> Moreover, the instance variables storing computed counts are marked private, 
> and there are no accessors for them. So if one desired to add this 
> functionality, one would have to duplicate this calculation, which is not 
> trivial:
> https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144
> Currently Implemented Metrics
> ---
> * Precision (PPV): `precisionByThreshold`
> * Recall (Sensitivity, true positive rate): `recallByThreshold`
> Desired additional metrics
> ---
> * False omission rate: `forByThreshold`
> * False discovery rate: `fdrByThreshold`
> * Negative predictive value: `npvByThreshold`
> * False negative rate: `fnrByThreshold`
> * True negative rate (Specificity): `specificityByThreshold`
> * False positive rate: `fprByThreshold`
> Alternatives
> ---
> The `createCurve` method is marked private. If it were marked public, and the 
> trait BinaryClassificationMetricComputer were also marked public, then it 
> would be easy to define new computers to get whatever the user wanted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18844) Add more binary classification metrics to BinaryClassificationMetrics

2018-02-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357256#comment-16357256
 ] 

Apache Spark commented on SPARK-18844:
--

User 'sandecho' has created a pull request for this issue:
https://github.com/apache/spark/pull/20549

> Add more binary classification metrics to BinaryClassificationMetrics
> -
>
> Key: SPARK-18844
> URL: https://issues.apache.org/jira/browse/SPARK-18844
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.0.2
>Reporter: Zak Patterson
>Priority: Minor
>  Labels: evaluation
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> BinaryClassificationMetrics only implements Precision (positive predictive 
> value) and recall (true positive rate). It should implement more 
> comprehensive metrics.
> Moreover, the instance variables storing computed counts are marked private, 
> and there are no accessors for them. So if one desired to add this 
> functionality, one would have to duplicate this calculation, which is not 
> trivial:
> https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144
> Currently Implemented Metrics
> ---
> * Precision (PPV): `precisionByThreshold`
> * Recall (Sensitivity, true positive rate): `recallByThreshold`
> Desired additional metrics
> ---
> * False omission rate: `forByThreshold`
> * False discovery rate: `fdrByThreshold`
> * Negative predictive value: `npvByThreshold`
> * False negative rate: `fnrByThreshold`
> * True negative rate (Specificity): `specificityByThreshold`
> * False positive rate: `fprByThreshold`
> Alternatives
> ---
> The `createCurve` method is marked private. If it were marked public, and the 
> trait BinaryClassificationMetricComputer were also marked public, then it 
> would be easy to define new computers to get whatever the user wanted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11334) numRunningTasks can't be less than 0, or it will affect executor allocation

2018-02-08 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-11334:
---
Fix Version/s: 2.3.0

> numRunningTasks can't be less than 0, or it will affect executor allocation
> ---
>
> Key: SPARK-11334
> URL: https://issues.apache.org/jira/browse/SPARK-11334
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: meiyoula
>Assignee: Sital Kedia
>Priority: Major
> Fix For: 2.3.0
>
>
> With *Dynamic Allocation* function, a task failed over *maxFailure* time, all 
> the dependent jobs, stages, tasks will be killed or aborted. In this process, 
> *SparkListenerTaskEnd* event will be behind in *SparkListenerStageCompleted* 
> and *SparkListenerJobEnd*. Like the Event Log below:
> {code}
> {"Event":"SparkListenerStageCompleted","Stage Info":{"Stage ID":20,"Stage 
> Attempt ID":0,"Stage Name":"run at AccessController.java:-2","Number of 
> Tasks":200}
> {"Event":"SparkListenerJobEnd","Job ID":9,"Completion Time":1444914699829}
> {"Event":"SparkListenerTaskEnd","Stage ID":20,"Stage Attempt ID":0,"Task 
> Type":"ResultTask","Task End Reason":{"Reason":"TaskKilled"},"Task 
> Info":{"Task ID":1955,"Index":88,"Attempt":2,"Launch 
> Time":1444914699763,"Executor 
> ID":"5","Host":"linux-223","Locality":"PROCESS_LOCAL","Speculative":false,"Getting
>  Result Time":0,"Finish Time":1444914699864,"Failed":true,"Accumulables":[]}}
> {code}
> Because that, the *numRunningTasks* in *ExecutorAllocationManager* class will 
> be less than 0, and it will affect executor allocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20327) Add CLI support for YARN custom resources, like GPUs

2018-02-08 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357107#comment-16357107
 ] 

Marcelo Vanzin commented on SPARK-20327:


We cannot add a feature that requires Hadoop 3.0. If you need to call APIs that 
only exist in Hadoop 3.0, you need to do so via reflection.

There are no plans yet to drop support for Hadoop 2.x.

> Add CLI support for YARN custom resources, like GPUs
> 
>
> Key: SPARK-20327
> URL: https://issues.apache.org/jira/browse/SPARK-20327
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.1.0
>Reporter: Daniel Templeton
>Priority: Major
>  Labels: newbie
>
> YARN-3926 adds the ability for administrators to configure custom resources, 
> like GPUs.  This JIRA is to add support to Spark for requesting resources 
> other than CPU virtual cores and memory.  See YARN-3926.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20327) Add CLI support for YARN custom resources, like GPUs

2018-02-08 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-20327:
---
Shepherd:   (was: Mark Grover)

> Add CLI support for YARN custom resources, like GPUs
> 
>
> Key: SPARK-20327
> URL: https://issues.apache.org/jira/browse/SPARK-20327
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.1.0
>Reporter: Daniel Templeton
>Priority: Major
>  Labels: newbie
>
> YARN-3926 adds the ability for administrators to configure custom resources, 
> like GPUs.  This JIRA is to add support to Spark for requesting resources 
> other than CPU virtual cores and memory.  See YARN-3926.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21860) Improve memory reuse for heap memory in `HeapMemoryAllocator`

2018-02-08 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21860.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 19077
[https://github.com/apache/spark/pull/19077]

> Improve memory reuse for heap memory in `HeapMemoryAllocator`
> -
>
> Key: SPARK-21860
> URL: https://issues.apache.org/jira/browse/SPARK-21860
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
> Fix For: 2.4.0
>
>
> In `HeapMemoryAllocator`, when allocating memory from pool,  and the key of 
> pool is memory size.
> Actually  some size of memory ,such as 1025bytes,1026bytes,..1032bytes,  
> we can think they are the same,because we allocate memory in multiples of 8 
> bytes.
> In this case, we can improve memory reuse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >