Spark and tree data structures

2014-10-08 Thread Silvina Caíno Lores
Hi all,

I'd like to use an octree data structure in order to simplify several
computations in a big data set. I've been wondering if Spark has any
built-in options for such structures (the only thing I could find is the
DecisionTree), specially if they make use of RDDs.

I've also been exploring the possibility of using key-value pairs to
simulate a tree's structure within an RDD, but this makes the program a lot
harder to understand and limits my options when processing the data.

Any advice is very welcome, thanks in advance.

Regards,
Silvina


Reading file header in Spark

2014-07-16 Thread Silvina Caíno Lores
Hi everyone!

I'm really new to Spark and I'm trying to figure out which would be the
proper way to do the following:

1.- Read a file header (a single line)
2.- Build with it a configuration object
3.- Use that object in a function that will be called by map()

I thought about using filter() after textFile(), but I don't want to get an
RDD as result for I'm expecting a unique object.

Any help is very appreciated.

Thanks in advance,
Silvina


Re: Reading file header in Spark

2014-07-16 Thread Silvina Caíno Lores
Thank you! This is what I needed, I've read it should work as the first()
method as well. It's a pity that the taken element cannot be removed from
the RDD though.

Thanks again!


On 16 July 2014 12:09, Sean Owen so...@cloudera.com wrote:

 You can rdd.take(1) to get just the header line.

 I think someone mentioned before that this is a good use case for
 having a tail method on RDDs too, to skip the header for subsequent
 processing. But you can ignore it with a filter, or logic in your map
 method.

 On Wed, Jul 16, 2014 at 11:01 AM, Silvina Caíno Lores
 silvi.ca...@gmail.com wrote:
  Hi everyone!
 
  I'm really new to Spark and I'm trying to figure out which would be the
  proper way to do the following:
 
  1.- Read a file header (a single line)
  2.- Build with it a configuration object
  3.- Use that object in a function that will be called by map()
 
  I thought about using filter() after textFile(), but I don't want to get
 an
  RDD as result for I'm expecting a unique object.
 
  Any help is very appreciated.
 
  Thanks in advance,
  Silvina



Compilation error in Spark 1.0.0

2014-07-09 Thread Silvina Caíno Lores
Hi everyone,

I am new to Spark and I'm having problems to make my code compile. I have
the feeling I might be misunderstanding the functions so I would be very
glad to get some insight in what could be wrong.

The problematic code is the following:

JavaRDDBody bodies = lines.map(l - {Body b = new Body(); b.parse(l);} );

JavaPairRDDPartition, IterableBody partitions =
bodies.mapToPair(b -
b.computePartitions(maxDistance)).groupByKey();

Partition and Body are defined inside the driver class. Body contains the
following definition:

protected IterableTuple2Partition, Body computePartitions (int
maxDistance)

The idea is to reproduce the following schema:

The first map results in: *body1, body2, ... *
The mapToPair should output several of these:* (partition_i, body1),
(partition_i, body2)...*
Which are gathered by key as follows: *(partition_i, (body1, body_n),
(partition_i', (body2, body_n') ...*

Thanks in advance.
Regards,
Silvina


Re: Compilation error in Spark 1.0.0

2014-07-09 Thread Silvina Caíno Lores
Right, the compile error is a casting issue telling me I cannot assign
a JavaPairRDDPartition,
Body to a JavaPairRDDObject, Object. It happens in the mapToPair()
method.




On 9 July 2014 19:52, Sean Owen so...@cloudera.com wrote:

 You forgot the compile error!


 On Wed, Jul 9, 2014 at 6:14 PM, Silvina Caíno Lores silvi.ca...@gmail.com
  wrote:

 Hi everyone,

  I am new to Spark and I'm having problems to make my code compile. I
 have the feeling I might be misunderstanding the functions so I would be
 very glad to get some insight in what could be wrong.

 The problematic code is the following:

 JavaRDDBody bodies = lines.map(l - {Body b = new Body(); b.parse(l);}
 );

 JavaPairRDDPartition, IterableBody partitions =
 bodies.mapToPair(b -
 b.computePartitions(maxDistance)).groupByKey();

  Partition and Body are defined inside the driver class. Body contains
 the following definition:

 protected IterableTuple2Partition, Body computePartitions (int
 maxDistance)

 The idea is to reproduce the following schema:

 The first map results in: *body1, body2, ... *
 The mapToPair should output several of these:* (partition_i, body1),
 (partition_i, body2)...*
 Which are gathered by key as follows: *(partition_i, (body1,
 body_n), (partition_i', (body2, body_n') ...*

 Thanks in advance.
 Regards,
 Silvina