Thanks Ted

Hi I have been working through some examples tutorials for Apache Spark in
an attempt to establish how I would solve the following scenario (see data
examples in Appendix):

I have 1 billion+ rows that have a key value (i.e. driver ID) and a number
of relevant attributes (product class, date/time) that I need to evaluate
using certain business rules/algorithms. These rules are based on grouped
data (i.e. perform business rules on driver ID 1 and then perform the same
rules on driver ID 2 etc.); typical business rules include the ability to
perform backward and forward looking checks (see sample below) within a
grouped dataset. Importantly, I need to process the grouped data (driver ID
1, 2,3,4 …) concurrently. An example of the business rules:



For each data grouping / set (i.e. driver ID = 1 order chronologically by
date):

·         The first row is always an ‘initiate’ = ROW ID 1

·         the product class value has previously/future (backward or
forward looking) occurred  = ‘DUPLICATE’ = ROW ID 2

·         changed product (backward looking only) in the same product class
aka A - > A1  = ‘SWAP’ = ROW ID 3

·         because the product has not previously occurred =  ‘ADD’ = ROW ID
4

·         the product class value previously/future (backward or forward
looking) occurred  = ‘DUPLICATE’ = ROW ID 5

·         the product class value previously/future (backward or forward
looking) occurred  = ‘DUPLICATE’ = ROW ID 6

 Questions:

1.       Should I use dataframes to ‘pull the source data? If so, do I do a
groupby and order by as part of the SQL query?

2.       How do I then split the grouped data (i.e. driver ID key value
pairs) to then be parallelized for concurrent processing (i.e. ideally the
number of parallel datasets/grouped data should run at max node cluster
capacity)? DO I need to do some sort of mappartitioning ?

3.       Pending (1) & (2) answers: How does each (i.e. grouped data set)
dataframe or RDD or dataset perform these rules based checks (i.e. backward
and forward looking checks) ? i.e. how is this achieved in SPARK?

ps. I have solid JAVA background but a complete APache Spark novice so your
help would be really appreciated


sparkie



Appendix

Input/OUTPUT



ROWID,                                Driver ID,             product class,
   date,     RESULT

1,                            1,                            A,
                           1/1/16   INITIATE

2,                            1,                            A,
                2/2/16   DUPLICATE

3,                            1,                            A1,
                        3/4/16   SWAP

4,                            1,                            B,
                           2/5/16   ADD

5,                            1,                            C,
                           1/1/16   DUPLICATE

6,                            1,                            C,
                2/2/16   DUPLICATE

7,                            2,                            A,
                           2/2/16   INITIATE

8,                            2,                            B,
                           3/4/16   ADD

9,                            2,                            A,
                           2/5/16   DUPLICATE

On Wed, Jun 8, 2016 at 1:39 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> I think this is the correct forum.
>
> Please describe your case.
>
> On Jun 7, 2016, at 8:33 PM, Francois Le Roux <lerfranc...@gmail.com>
> wrote:
>
> HI folks, I have been working through the available online Apache spark
>  tutorials and I am stuck with a scenario that i would like to solve in
> SPARK. Is this a forum where i can publish a narrative for the problem /
> scenario that i am trying to solve ?
>
>
> any assitance appreciated
>
>
> thanks
>
> frank
>
>

Reply via email to