Hello all, I have a custom RDD for fast loading of data from a non-partitioned source. The partitioning happens in the RDD implementation by pushing data from the source into queues picked up by the current active partitions in worker threads.
This works great on a multi-threaded single host (say with the manager set to "local[x]" ) but I'd like to run it distributed. However, I need to know, not only which "slice" my partition is, but also which host (by sequence) it's on so I can divide up the source by worker (host) and then run the multi-threaded. In other words, I need what effectively amounts to a 2-tier slice identifier. I know this is probably unorthodox, but is there some way to get this information in the compute method or the deserialized Partition objects? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cluster-Aware-Custom-RDD-tp21196.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org