[jira] [Commented] (SPARK-3963) Support getting task-scoped properties from TaskContext

2014-12-01 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230878#comment-14230878
 ] 

Shivaram Venkataraman commented on SPARK-3963:
--

[~pwendell] This looks pretty useful -- Was this postponed from 1.2 ? I have a 
use case that needs Hadoop file names and was wondering if there was a 
workaround before this is implemented.

 Support getting task-scoped properties from TaskContext
 ---

 Key: SPARK-3963
 URL: https://issues.apache.org/jira/browse/SPARK-3963
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell

 This is a proposal for a minor feature. Given stabilization of the 
 TaskContext API, it would be nice to have a mechanism for Spark jobs to 
 access properties that are defined based on task-level scope by Spark RDD's. 
 I'd like to propose adding a simple properties hash map with some standard 
 spark properties that users can access. Later it would be nice to support 
 users setting these properties, but for now to keep it simple in 1.2. I'd 
 prefer users not be able to set them.
 The main use case is providing the file name from Hadoop RDD's, a very common 
 request. But I'd imagine us using this for other things later on. We could 
 also use this to expose some of the taskMetrics, such as e.g. the input bytes.
 {code}
 val data = sc.textFile(s3n//..2014/*/*/*.json)
 data.mapPartitions { 
   val tc = TaskContext.get
   val filename = tc.getProperty(TaskContext.HADOOP_FILE_NAME)
   val parts = fileName.split(/)
   val (year, month, day) = (parts[3], parts[4], parts[5])
   ...
 }
 {code}
 Internally we'd have a method called setProperty, but this wouldn't be 
 exposed initially. This is structured as a simple (String, String) hash map 
 for ease of porting to python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3963) Support getting task-scoped properties from TaskContext

2014-12-01 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230887#comment-14230887
 ] 

Patrick Wendell commented on SPARK-3963:


[~shivaram] - I think HadoopRDD has a mapPartitionsWithInputSplit, it's a bit 
ugly but I think you may be able to use that if you can get access to the 
underlying HadoopRDD. The Split IIRC can give you the filename.

 Support getting task-scoped properties from TaskContext
 ---

 Key: SPARK-3963
 URL: https://issues.apache.org/jira/browse/SPARK-3963
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell

 This is a proposal for a minor feature. Given stabilization of the 
 TaskContext API, it would be nice to have a mechanism for Spark jobs to 
 access properties that are defined based on task-level scope by Spark RDD's. 
 I'd like to propose adding a simple properties hash map with some standard 
 spark properties that users can access. Later it would be nice to support 
 users setting these properties, but for now to keep it simple in 1.2. I'd 
 prefer users not be able to set them.
 The main use case is providing the file name from Hadoop RDD's, a very common 
 request. But I'd imagine us using this for other things later on. We could 
 also use this to expose some of the taskMetrics, such as e.g. the input bytes.
 {code}
 val data = sc.textFile(s3n//..2014/*/*/*.json)
 data.mapPartitions { 
   val tc = TaskContext.get
   val filename = tc.getProperty(TaskContext.HADOOP_FILE_NAME)
   val parts = fileName.split(/)
   val (year, month, day) = (parts[3], parts[4], parts[5])
   ...
 }
 {code}
 Internally we'd have a method called setProperty, but this wouldn't be 
 exposed initially. This is structured as a simple (String, String) hash map 
 for ease of porting to python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3963) Support getting task-scoped properties from TaskContext

2014-12-01 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230900#comment-14230900
 ] 

Shivaram Venkataraman commented on SPARK-3963:
--

Thanks. I somehow missed `mapPartitionsWithInputSplit` -- that will work for 
now.

 Support getting task-scoped properties from TaskContext
 ---

 Key: SPARK-3963
 URL: https://issues.apache.org/jira/browse/SPARK-3963
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell

 This is a proposal for a minor feature. Given stabilization of the 
 TaskContext API, it would be nice to have a mechanism for Spark jobs to 
 access properties that are defined based on task-level scope by Spark RDD's. 
 I'd like to propose adding a simple properties hash map with some standard 
 spark properties that users can access. Later it would be nice to support 
 users setting these properties, but for now to keep it simple in 1.2. I'd 
 prefer users not be able to set them.
 The main use case is providing the file name from Hadoop RDD's, a very common 
 request. But I'd imagine us using this for other things later on. We could 
 also use this to expose some of the taskMetrics, such as e.g. the input bytes.
 {code}
 val data = sc.textFile(s3n//..2014/*/*/*.json)
 data.mapPartitions { 
   val tc = TaskContext.get
   val filename = tc.getProperty(TaskContext.HADOOP_FILE_NAME)
   val parts = fileName.split(/)
   val (year, month, day) = (parts[3], parts[4], parts[5])
   ...
 }
 {code}
 Internally we'd have a method called setProperty, but this wouldn't be 
 exposed initially. This is structured as a simple (String, String) hash map 
 for ease of porting to python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3963) Support getting task-scoped properties from TaskContext

2014-10-16 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173451#comment-14173451
 ] 

Reynold Xin commented on SPARK-3963:


Would it make sense to support arbitrary data types also? Also should we 
consider merging this with TaskMetrics?

 Support getting task-scoped properties from TaskContext
 ---

 Key: SPARK-3963
 URL: https://issues.apache.org/jira/browse/SPARK-3963
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell

 This is a proposal for a minor feature. Given stabilization of the 
 TaskContext API, it would be nice to have a mechanism for Spark jobs to 
 access properties that are defined based on task-level scope by Spark RDD's. 
 I'd like to propose adding a simple properties hash map with some standard 
 spark properties that users can access. Later it would be nice to support 
 users setting these properties, but for now to keep it simple in 1.2. I'd 
 prefer users not be able to set them.
 The main use case is providing the file name from Hadoop RDD's, a very common 
 request. But I'd imagine us using this for other things later on. We could 
 also use this to expose some of the taskMetrics, such as e.g. the input bytes.
 {code}
 val data = sc.textFile(s3n//..2014/*/*/*.json)
 data.mapPartitions { 
   val tc = TaskContext.get
   val filename = tc.getProperty(TaskContext.HADOOP_FILE_NAME)
   val parts = fileName.split(/)
   val (year, month, day) = (parts[3], parts[4], parts[5])
   ...
 }
 {code}
 Internally we'd have a method called setProperty, but this wouldn't be 
 exposed initially. This is structured as a simple (String, String) hash map 
 for ease of porting to python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3963) Support getting task-scoped properties from TaskContext

2014-10-16 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174660#comment-14174660
 ] 

Patrick Wendell commented on SPARK-3963:


In the initial version of this - I don't want to do either of those things.

 Support getting task-scoped properties from TaskContext
 ---

 Key: SPARK-3963
 URL: https://issues.apache.org/jira/browse/SPARK-3963
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell

 This is a proposal for a minor feature. Given stabilization of the 
 TaskContext API, it would be nice to have a mechanism for Spark jobs to 
 access properties that are defined based on task-level scope by Spark RDD's. 
 I'd like to propose adding a simple properties hash map with some standard 
 spark properties that users can access. Later it would be nice to support 
 users setting these properties, but for now to keep it simple in 1.2. I'd 
 prefer users not be able to set them.
 The main use case is providing the file name from Hadoop RDD's, a very common 
 request. But I'd imagine us using this for other things later on. We could 
 also use this to expose some of the taskMetrics, such as e.g. the input bytes.
 {code}
 val data = sc.textFile(s3n//..2014/*/*/*.json)
 data.mapPartitions { 
   val tc = TaskContext.get
   val filename = tc.getProperty(TaskContext.HADOOP_FILE_NAME)
   val parts = fileName.split(/)
   val (year, month, day) = (parts[3], parts[4], parts[5])
   ...
 }
 {code}
 Internally we'd have a method called setProperty, but this wouldn't be 
 exposed initially. This is structured as a simple (String, String) hash map 
 for ease of porting to python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3963) Support getting task-scoped properties from TaskContext

2014-10-15 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173404#comment-14173404
 ] 

Patrick Wendell commented on SPARK-3963:


[~rxin] and [~adav] I'd be interested in any feedback on this.

 Support getting task-scoped properties from TaskContext
 ---

 Key: SPARK-3963
 URL: https://issues.apache.org/jira/browse/SPARK-3963
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell

 This is a proposal for a minor feature. Given stabilization of the 
 TaskContext API, it would be nice to have a mechanism for Spark jobs to 
 access properties that are defined based on task-level scope by Spark RDD's. 
 I'd like to propose adding a simple properties hash map with some standard 
 spark properties that users can access. Later it would be nice to support 
 users setting these properties, but for now to keep it simple in 1.2. I'd 
 prefer users not be able to set them.
 The main use case is providing the file name from Hadoop RDD's, a very common 
 request. But I'd imagine us using this for other things later on. We could 
 also use this to expose some of the taskMetrics, such as e.g. the input bytes.
 {code}
 val data = sc.textFile(s3n//..2014/*/*/*.json)
 data.mapPartitions { 
   val tc = TaskContext.get
   val filename = tc.getProperty(TaskContext.HADOOP_FILE_NAME)
   val parts = fileName.split(/)
   val (year, month, day) = (parts[3], parts[4], parts[5])
   ...
 }
 {code}
 Internally we'd have a method called setProperty, but this wouldn't be 
 exposed initially. This is structured as a simple (String, String) hash map 
 for ease of porting to python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org