Hi I have been working through some examples tutorials for Apache Spark in an attempt to establish how I would solve the following scenario (see data examples in Appendix): I have 1 billion+ rows that have a key value (i.e. driver ID) and a number of relevant attributes (product class, date/time) that I need to evaluate using certain business rules/algorithms. These rules are based on grouped data (i.e. perform business rules on driver ID 1 and then perform the same rules on driver ID 2 etc.); typical business rules include the ability to perform backward and forward looking checks (see sample below) within a grouped dataset. Importantly, I need to process the grouped data (driver ID 1, 2,3,4 …) concurrently. An example of the business rules: For each data grouping / set (i.e. driver ID = 1 order chronologically by date): · The first row is always an ‘initiate’ = ROW ID 1
· the product class value has previously/future (backward or forward looking) occurred = ‘DUPLICATE’ = ROW ID 2 · changed product (backward looking only) in the same product class aka A - > A1 = ‘SWAP’ = ROW ID 3 · because the product has not previously occurred = ‘ADD’ = ROW ID 4 · the product class value previously/future (backward or forward looking) occurred = ‘DUPLICATE’ = ROW ID 5 · the product class value previously/future (backward or forward looking) occurred = ‘DUPLICATE’ = ROW ID 6 Questions: 1. Should I use dataframes to ‘pull the source data? If so, do I do a groupby and order by as part of the SQL query? 2. How do I then split the grouped data (i.e. driver ID key value pairs) to then be parallelized for concurrent processing (i.e. ideally the number of parallel datasets/grouped data should run at max node cluster capacity)? DO I need to do some sort of mappartitioning ? 3. Pending (1) & (2) answers: How does each (i.e. grouped data set) dataframe or RDD or dataset perform these rules based checks (i.e. backward and forward looking checks) ? i.e. how is this achieved in SPARK? ps. I have solid JAVA background but a complete Apache Spark novice so your help would be really appreciated Appendix Input/OUTPUT ROWID, Driver ID, product class, date, RESULT 1, 1, A, 1/1/16 INITIATE 2, 1, A, 2/2/16 DUPLICATE 3, 1, A1, 3/4/16 SWAP 4, 1, B, 2/5/16 ADD 5, 1, C, 1/1/16 DUPLICATE 6, 1, C, 2/2/16 DUPLICATE 7, 2, A, 2/2/16 INITIATE 8, 2, B, 3/4/16 ADD 9, 2, A, 2/5/16 DUPLICATE -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-design-pattern-approaches-tp27109.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org