Hi,
I would need some thoughts or inputs or any starting point to achieve
following scenario.
I submit a job using spark-submit with a certain set of parameters.

It reads data from a source, does some processing on RDDs and generates
some output and completes.

Then I submit same job again with next set of parameters.
It should also read data from a source do same processing and at the same
time read data from the result generated by previous job and merge the two
and again store the results.

This process goes on and on.

So I need to store RDD or output of RDD into some storage of previous job
to make it available to next job.

What are my options.
1. Use checkpoint
Can I use checkpoint on the final stage of RDD and then load the same RDD
again by specifying checkpoint path in next job. Is checkpoint right for
this kind of situation.

2. Save output of previous job into some json file and then create a data
frame of that in next job.
Have I got this right, is this option better than option 1.

3. I have heard a lot about paquet files. However I don't know how it
integrates with spark.
Can I use that here as intermediate storage.
Is this available in spark 1.6?

Any other thoughts or idea.

Thanks
Sachin

Reply via email to