Re: New to Spark - Paritioning Question

Mike Wright Tue, 08 Sep 2015 16:38:22 -0700

Thanks for the response!

Well, in retrospect each partition doesn't need to be restricted to a
single key. But, I cannot have values associated with a key span partitions
since they all need to be processed together for a key to facilitate
cumulative calcs. So provided an individual key has all its values in a
single partition, I'm OK.


Additionally, the values will be written to the database, and from what I
have read doing this at the partition level is the best compromise between
1) Writing the calculated values for each key (lots of connect/disconnects)
and collecting them all at the end and writing them all at once.

I am using a groupBy against the filtered RDD the get the grouping I want,
but apparently this may not be the most efficient way, and it seems that
everything is always in a single partition under this scenario.


_______________

*Mike Wright*
Principal Architect, Software Engineering

SNL Financial LC
434-951-7816 *p*
434-244-4466 *f*
540-470-0119 *m*

[email protected]

On Tue, Sep 8, 2015 at 5:38 PM, Richard Marscher <[email protected]>
wrote:

> That seems like it could work, although I don't think `partitionByKey` is
> a thing, at least for RDD. You might be able to merge step #2 and step #3
> into one step by using the `reduceByKey` function signature that takes in a
> Partitioner implementation.
>
> def reduceByKey(partitioner: Partitioner
> <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/Partitioner.html>
> , func: (V, V) ⇒ V): RDD
> <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html>
> [(K, V)]
>
> Merge the values for each key using an associative reduce function. This
> will also perform the merging locally on each mapper before sending results
> to a reducer, similarly to a "combiner" in MapReduce.
>
> The tricky part might be getting the partitioner to know about the number
> of partitions, which I think it needs to know upfront in `abstract def
> numPartitions: Int`. The `HashPartitioner` for example takes in the
> number as a constructor argument, maybe you could use that with an upper
> bound size if you don't mind empty partitions. Otherwise you might have to
> mess around to extract the exact number of keys if it's not readily
> available.
>
> Aside: what is the requirement to have each partition only contain the
> data related to one key?
>
> On Fri, Sep 4, 2015 at 11:06 AM, mmike87 <[email protected]> wrote:
>
>> Hello, I am new to Apache Spark and this is my company's first Spark
>> project.
>> Essentially, we are calculating models dealing with Mining data using
>> Spark.
>>
>> I am holding all the source data in a persisted RDD that we will refresh
>> periodically. When a "scenario" is passed to the Spark job (we're using
>> Job
>> Server) the persisted RDD is filtered to the relevant mines. For example,
>> we
>> may want all mines in Chile and the 1990-2015 data for each.
>>
>> Many of the calculations are cumulative, that is when we apply user-input
>> "adjustment factors" to a value, we also need the "flexed" value we
>> calculated for that mine previously.
>>
>> To ensure that this works, the idea if to:
>>
>> 1) Filter the superset to relevant mines (done)
>> 2) Group the subset by the unique identifier for the mine. So, a group may
>> be all the rows for mine "A" for 1990-2015
>> 3) I then want to ensure that the RDD is partitioned by the Mine
>> Identifier
>> (and Integer).
>>
>> It's step 3 that is confusing me. I suspect it's very easy ... do I simply
>> use PartitionByKey?
>>
>> We're using Java if that makes any difference.
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/New-to-Spark-Paritioning-Question-tp24580.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>
>
> --
> *Richard Marscher*
> Software Engineer
> Localytics
> Localytics.com <http://localytics.com/> | Our Blog
> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
> Facebook <http://facebook.com/localytics> | LinkedIn
> <http://www.linkedin.com/company/1148792?trk=tyah>
>

Re: New to Spark - Paritioning Question

Reply via email to