Hi,

In my problem, I need to group the DataFrame, apply the business logic for
each group and finally emit a new DataFrame based on that. To describe in
detail, there is a device_dataframe which contains the timestamp of when
the device had been turned on (on) and turned off (off).

+---------+----- +--------------------+
|device_id|state |   d_ts             |
+---------+----- +--------------------+
|1        |off   |2020-09-10 16:14:58 |
|1        |on    |2020-09-19 16:14:58 |
|2        |on    |2020-09-20 16:14:58 |
|2        |off   |2020-10-03 16:14:58 |
|4        |on    |2020-09-20 16:14:58 |
|5        |off   |2020-09-20 16:14:58 |
+---------+-----+-------+-------------+

On the other hand, there is a DataFrame containing events information
including its timestamp and its corresponding device.

+-----+---------+--------------------+
|e_id |device_id|       e_ts         |
+-----+---------+--------------------+
|1    |1        |2020-09-20 16:14:58 |
|2    |2        |2020-10-08 09:19:55 |
|3    |4        |2020-11-01 12:15:37 |
|4    |5        |2020-10-08 01:35:08 |
+-----+---------+-------+------------+

The following is a join example of two DataFrames:

+---------+-----+--------------------+------+--------------------+
|device_id|e_id |       e_ts         |state |   d_ts             |
+---------+-----+--------------------+------+--------------------+
|1        |1    |2020-09-20 16:14:58 |off   |2020-09-10 16:14:58 |
|1        |1    |2020-09-20 16:14:58 |on    |2020-09-19 16:14:58 |
|2        |2    |2020-10-08 09:19:55 |on    |2020-09-20 16:14:58 |
|2        |2    |2020-10-08 09:19:55 |off   |2020-10-03 16:14:58 |
|4        |3    |2020-11-01 12:15:37 |on    |2020-09-20 16:14:58 |
|5        |4    |2020-10-08 01:35:08 |off   |2020-09-20 16:14:58 |
+---------+-----+-------+--------------------+------+------------+

What I finally need to find is the events information that happened while
its corresponding device was on. For example in the case of the above
table, the event_id 1 is valid because it happened on 2020-09-20 16:14:58
and its device has been on since 2020-09-19 16:14:58, and the event_id 2 is
not valid as its device was turned down on 2020-10-03 16:14:58 and never
turned on again, and so on which results in the following table:

+---------+-----+-------------------+
|device_id|e_id |       e_ts        |
+---------+-----+-------------------+
|1        |1    |2020-09-20 16:14:58|
|4        |3    |2020-11-01 12:15:37|
+---------+-----+-------------------+

I did the below to group the join table base on the devices:

val grouped = eventDF
      .join(deviceDF, "device_id")
      .groupBy("device_id")

which results in RelationalGroupedDataSet. Now I need to apply the logic to
each group and emit the result DataFrame but I didn't find a solution for
that. I checked the UDAFs but I found it not working in my case.I know how
to solve this using RDD API, but I want to find its *Column API* approach.
Any help or suggestion will be appreciated.
Thanks

Reply via email to