In the context of telecom industry, let's supose we have several existing
RDDs populated from some tables in Cassandra:
val callPrices: RDD[PriceRow]
val calls: RDD[CallRow]
val offersInCourse: RDD[OfferRow]
where types are defined as follows,
/** Represents the price per minute for a concrete hour */
case class PriceRow(
val year: Int,
val month: Int,
val day: Int,
val hour: Int,
val basePrice: Float)
/** Call registries*/
case class CallRow(
val customer: String,
val year: Int,
val month: Int,
val day: Int,
val minutes: Int)
/** Is there any discount that could be applicable here? */
case class OfferRow(
val offerName: String,
val hour: Int,//[0..23]
val discount: Float)//[0..1]
Assuming we cannot use `flatMap` to mix these three RDDs like this way
(since RDD is not really 'monadic'):
/**
* The final bill at a concrete hour for a call
* is defined as {{{
* def billPerHour(minutes: Int,basePrice:Float,discount:Float) =
* minutes * basePrice * discount
* }}}
*/
val bills: RDD[BillRow] = for{
price <- callPrices
call <- calls if call.hour==price.hour
offer <- offersInCourse if offer.hour==price.hour
} yield BillRow(
call.customer,
call.hour,
billPerHour(call.minutes,price.basePrice,offer.discount))
case class BillRow(
val customer: String,
val hour: DateTime,
val amount: Float)
which is the best practise for generating a new RDD that join all these
three RDDs and represents the bill for a concrete customer?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Joining-not-pair-RDDs-in-Spark-tp5034.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.