Hi
I am new in Apache Spark and I created a spark job that reads the data from
a Mysql database and does some processing on it and then commits it to
another table.

The odd thing I faced was that Spark reads all the data from the table when
I use
`sparkSession.read.jdbc` and `sparkDf.rdd.map` *waits* the whole *iteration
to be done *and then *starts* to write and there is no checkpoint here
anywhere!
In such a manner all the intermediate work gets lost on node/cluster reboot.
Instead it can read from the table in batches, perform the process and then
write the results this is way more fault tolerant. I was wondering how I
can achieve this?
Many Thanks

Reply via email to