[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043732#comment-16043732 ] Shixiong Zhu commented on SPARK-20952: -- For `ParquetFileFormat#readFootersInParallel`, I would suggest that you just set the TaskContext in "parFiles.flatMap". {code} val taskContext = TaskContext.get val parFiles = partFiles.par parFiles.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(8)) parFiles.flatMap { currentFile => TaskContext.setTaskContext(taskContext) ... }.seq {code} In this special case, it's safe since this is a local one-time thread pool. > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043714#comment-16043714 ] Robert Kruszewski commented on SPARK-20952: --- Right, but how do I pass it downstream? > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043712#comment-16043712 ] Robert Kruszewski commented on SPARK-20952: --- It doesn't but things underneath it do. It's weird from consumer perspective that you have a feature that you can't really use because you can't assert that it behaves consistently. In my case we have some filesystem features relying on taskcontext > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043710#comment-16043710 ] Shixiong Zhu commented on SPARK-20952: -- Although I don't know what you plan to do, you can save the TaskContext into a local variable like this: {code} private[parquet] def readParquetFootersInParallel( conf: Configuration, partFiles: Seq[FileStatus], ignoreCorruptFiles: Boolean): Seq[Footer] = { val taskContext = TaskContext.get val parFiles = partFiles.par parFiles.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(8)) parFiles.flatMap { currentFile => try { // Use `taskContext` rather than `TaskContext.get` // Skips row group information since we only need the schema. // ParquetFileReader.readFooter throws RuntimeException, instead of IOException, // when it can't read the footer. Some(new Footer(currentFile.getPath(), ParquetFileReader.readFooter( conf, currentFile, SKIP_ROW_GROUPS))) } catch { case e: RuntimeException => if (ignoreCorruptFiles) { logWarning(s"Skipped the footer in the corrupted file: $currentFile", e) None } else { throw new IOException(s"Could not read footer for file: $currentFile", e) } } }.seq } {code} > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043708#comment-16043708 ] Shixiong Zhu commented on SPARK-20952: -- Why it needs TaskContext? > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043704#comment-16043704 ] Robert Kruszewski commented on SPARK-20952: --- No modifications, it's this code https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L477 which spins up a threadpool to read files per partition. I imagine there's more cases like this but first one I encountered > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043694#comment-16043694 ] Shixiong Zhu commented on SPARK-20952: -- [~robert3005] could you show me your codes? Are you modifying "ParquetFileFormat#readFootersInParallel"? > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043669#comment-16043669 ] Robert Kruszewski commented on SPARK-20952: --- I am not really attached to the solution. Would be happy to implement anything that maintainers are happy with as long as it ensures we get taskcontext always anywhere on the task side. For instance issue I am facing now is that ParquetFileFormat#readFootersInParallel is not able to access it leading to failures. > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043655#comment-16043655 ] Shixiong Zhu commented on SPARK-20952: -- If TaskContext is not inheritable, we can always find a way to pass it to the codes that need to access it. But if it's inheritable, it's pretty hard to avoid TaskContext pollution (or avoid using a stale TaskContext, you have to always set it manually in a task running in a cached thread). [~joshrosen] listed many tickets that are caused by localProperties is InheritableThreadLocal: https://issues.apache.org/jira/browse/SPARK-14686?focusedCommentId=15244478=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15244478 > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043623#comment-16043623 ] Robert Kruszewski commented on SPARK-20952: --- This is already an issue though on driver side (that threadpool is driver side which already has inheritable thread pool). This issue is only so we have same behaviour on executors and driver > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035302#comment-16035302 ] Shixiong Zhu commented on SPARK-20952: -- What I'm concerned about is global thread pools, such as https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L128 > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035195#comment-16035195 ] Robert Kruszewski commented on SPARK-20952: --- 2 is already happening on executors where the Task will set and unset it's taskcontext correctly. Agree we should add 1 > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035171#comment-16035171 ] Andrew Ash commented on SPARK-20952: For the localProperties on SparkContext it does 2 things I can see to improve safety: - first, it clones the properties for new threads so changes in the parent thread don't unintentionally affect a child thread: https://github.com/apache/spark/blob/v2.2.0-rc2/core/src/main/scala/org/apache/spark/SparkContext.scala#L330 - second, it clears the properties when they're no longer being used: https://github.com/apache/spark/blob/v2.2.0-rc2/core/src/main/scala/org/apache/spark/SparkContext.scala#L1942 Do we need to do do either the defensive cloning or the proactive clearing of taskInfos in executors like are done in the driver? > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16034427#comment-16034427 ] Robert Kruszewski commented on SPARK-20952: --- You're right that this needs a bit of clarification. There's a bit more subtlety with respect to that actual threadlocal and it's use. First of all this makes behaviour same as on the driver where you have localProperties on SparkContext which are inheritable. Secondly I believe the issue you're describing will not arise since a) executor tasks are uninterruptible and b) the thread pool used to run them is a cachedThreadPool and not a ForkJoinPool, hence given task thread will not inherit from another task thread. Let me know if I am missing something here though. > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033915#comment-16033915 ] Shixiong Zhu commented on SPARK-20952: -- InheritableThreadLocal only works when creating a new thread. Here you were talking about thread pools. > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033385#comment-16033385 ] Apache Spark commented on SPARK-20952: -- User 'robert3005' has created a pull request for this issue: https://github.com/apache/spark/pull/18176 > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org