Spark and tree data structures
Hi all, I'd like to use an octree data structure in order to simplify several computations in a big data set. I've been wondering if Spark has any built-in options for such structures (the only thing I could find is the DecisionTree), specially if they make use of RDDs. I've also been exploring the possibility of using key-value pairs to simulate a tree's structure within an RDD, but this makes the program a lot harder to understand and limits my options when processing the data. Any advice is very welcome, thanks in advance. Regards, Silvina
Reading file header in Spark
Hi everyone! I'm really new to Spark and I'm trying to figure out which would be the proper way to do the following: 1.- Read a file header (a single line) 2.- Build with it a configuration object 3.- Use that object in a function that will be called by map() I thought about using filter() after textFile(), but I don't want to get an RDD as result for I'm expecting a unique object. Any help is very appreciated. Thanks in advance, Silvina
Re: Reading file header in Spark
Thank you! This is what I needed, I've read it should work as the first() method as well. It's a pity that the taken element cannot be removed from the RDD though. Thanks again! On 16 July 2014 12:09, Sean Owen so...@cloudera.com wrote: You can rdd.take(1) to get just the header line. I think someone mentioned before that this is a good use case for having a tail method on RDDs too, to skip the header for subsequent processing. But you can ignore it with a filter, or logic in your map method. On Wed, Jul 16, 2014 at 11:01 AM, Silvina Caíno Lores silvi.ca...@gmail.com wrote: Hi everyone! I'm really new to Spark and I'm trying to figure out which would be the proper way to do the following: 1.- Read a file header (a single line) 2.- Build with it a configuration object 3.- Use that object in a function that will be called by map() I thought about using filter() after textFile(), but I don't want to get an RDD as result for I'm expecting a unique object. Any help is very appreciated. Thanks in advance, Silvina
Compilation error in Spark 1.0.0
Hi everyone, I am new to Spark and I'm having problems to make my code compile. I have the feeling I might be misunderstanding the functions so I would be very glad to get some insight in what could be wrong. The problematic code is the following: JavaRDDBody bodies = lines.map(l - {Body b = new Body(); b.parse(l);} ); JavaPairRDDPartition, IterableBody partitions = bodies.mapToPair(b - b.computePartitions(maxDistance)).groupByKey(); Partition and Body are defined inside the driver class. Body contains the following definition: protected IterableTuple2Partition, Body computePartitions (int maxDistance) The idea is to reproduce the following schema: The first map results in: *body1, body2, ... * The mapToPair should output several of these:* (partition_i, body1), (partition_i, body2)...* Which are gathered by key as follows: *(partition_i, (body1, body_n), (partition_i', (body2, body_n') ...* Thanks in advance. Regards, Silvina
Re: Compilation error in Spark 1.0.0
Right, the compile error is a casting issue telling me I cannot assign a JavaPairRDDPartition, Body to a JavaPairRDDObject, Object. It happens in the mapToPair() method. On 9 July 2014 19:52, Sean Owen so...@cloudera.com wrote: You forgot the compile error! On Wed, Jul 9, 2014 at 6:14 PM, Silvina Caíno Lores silvi.ca...@gmail.com wrote: Hi everyone, I am new to Spark and I'm having problems to make my code compile. I have the feeling I might be misunderstanding the functions so I would be very glad to get some insight in what could be wrong. The problematic code is the following: JavaRDDBody bodies = lines.map(l - {Body b = new Body(); b.parse(l);} ); JavaPairRDDPartition, IterableBody partitions = bodies.mapToPair(b - b.computePartitions(maxDistance)).groupByKey(); Partition and Body are defined inside the driver class. Body contains the following definition: protected IterableTuple2Partition, Body computePartitions (int maxDistance) The idea is to reproduce the following schema: The first map results in: *body1, body2, ... * The mapToPair should output several of these:* (partition_i, body1), (partition_i, body2)...* Which are gathered by key as follows: *(partition_i, (body1, body_n), (partition_i', (body2, body_n') ...* Thanks in advance. Regards, Silvina