(SPARK-22148) TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors are blacklisted but dynamic allocation is enabled

2017-10-24 Thread Juan Rodríguez Hortalá
Hi, I've been working on this issue, and I would like to get your feedback on the following approach. The idea is that instead of failing in `TaskSetManager.abortIfCompletelyBlacklisted`, when a task cannot be scheduled in any executor but dynamic allocation is enabled, we will register this task

Re: Graceful node decommission mechanism for Spark

2017-10-20 Thread Juan Rodríguez Hortalá
Hi, Are there any comments or suggestions regarding this proposal? Thanks, Juan On Mon, Oct 16, 2017 at 10:27 AM, Juan Rodríguez Hortalá < juan.rodriguez.hort...@gmail.com> wrote: > Hi all, > > I have a prototype for "Keep track of nodes which are going to be shut >

Graceful node decommission mechanism for Spark

2017-10-16 Thread Juan Rodríguez Hortalá
Hi all, I have a prototype for "Keep track of nodes which are going to be shut down & avoid scheduling new tasks" ( https://issues.apache.org/jira/browse/SPARK-20628) that I would like to discuss with the community. I added a WIP PR for that in https://github.com/apache/spark/pull/19267. The

Re: RDD API patterns

2015-09-19 Thread Juan Rodríguez Hortalá
Hi Sim, I understand that what you propose is defining a trait SparkIterable (and also PairSparkIterable for RDDs of pairs) that encapsulates the methods in RDDs, and then program using that trait instead of RDD. That is similar to programming using scala.collection.GenSeq to abstract from using

JobScheduler: Error generating jobs for time for custom InputDStream

2015-09-16 Thread Juan Rodríguez Hortalá
Hi, Sorry to insist, anyone has any thoughts on this? Or at least someone can point me to a documentation of DStream.compute() so I can understand when I should return None for a batch? Thanks Juan 2015-09-14 20:51 GMT+02:00 Juan Rodríguez Hortalá < juan.rodriguez.hort...@gmail.com>:

Fwd: JobScheduler: Error generating jobs for time for custom InputDStream

2015-09-14 Thread Juan Rodríguez Hortalá
Hi, I sent this message to the user list a few weeks ago with no luck, so I'm forwarding it to the dev list in case someone could give a hand with this. Thanks a lot in advance I've developed a ScalaCheck property for testing Spark Streaming transformations. To do that I had to develop a custom

Re: taking an n number of rows from and RDD starting from an index

2015-09-02 Thread Juan Rodríguez Hortalá
Hi, Maybe you could use zipWithIndex and filter to skip the first elements. For example starting from scala> sc.parallelize(100 to 120, 4).zipWithIndex.collect res12: Array[(Int, Long)] = Array((100,0), (101,1), (102,2), (103,3), (104,4), (105,5), (106,6), (107,7), (108,8), (109,9), (110,10),

Re: Compact RDD representation

2015-07-20 Thread Juan Rodríguez Hortalá
will save time on each repetition. I will also do some experiments. About little repetitions: in what use cases we will lose efficiency? it will also test it. What I need to do this commitment? Just create ticket in Jira? 2015-07-19 21:56 GMT+03:00 Juan Rodríguez Hortalá juan.rodriguez.hort

Re: Compact RDD representation

2015-07-19 Thread Juan Rodríguez Hortalá
Hi, My two cents is that that could be interesting if all RDD and pair RDD operations would be lifted to work on groupedRDD. For example as suggested a map on grouped RDDs would be more efficient if the original RDD had lots of duplicate entries, but for RDDs with little repetitions I guess you

Re: how to implement my own datasource?

2015-06-25 Thread Juan Rodríguez Hortalá
Hi, You can connect to by JDBC as described in https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases. Other option is using HadoopRDD and NewHadoopRDD to connect to databases compatible with Hadoop, like HBase, some examples can be found at chapter 5 of Learning

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-12 Thread Juan Rodríguez Hortalá
Hi, If you want I would be happy to work in this. I have worked with KafkaUtils.createDirectStream before, in a pull request that wasn't accepted https://github.com/apache/spark/pull/5367. I'm fluent with Python and I'm starting to feel comfortable with Scala, so if someone opens a JIRA I can

Re: Creating topology in spark streaming

2015-05-06 Thread Juan Rodríguez Hortalá
Hi, You can use the method repartition from DStream (for the Scala API) or JavaDStream (for the Java API) defrepartition(numPartitions: Int): DStream https://spark.apache.org/docs/latest/api/scala/org/apache/spark/streaming/dstream/DStream.html [T] Return a new DStream with an increased or

Re: RDD split into multiple RDDs

2015-04-29 Thread Juan Rodríguez Hortalá
into separate RDDs. On Wed, Apr 29, 2015 at 2:10 PM, Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi Sébastien, I came with a similar problem some time ago, you can see the discussion in the Spark users mailing list at http://markmail.org/message/fudmem4yy63p62ar#query:+page:1