I have already seen on example where data is generated using spark, no reason to think it's a bad idea as far as I know. You can check the code here, I m not very sure but I think there is something there which generates data for the TPCDS benchmark and you can provide how much data you want in the tables from 1G upwards.
From: Esa Heikkinen [mailto:esa.heikki...@student.tut.fi] Sent: Tuesday, June 20, 2017 7:34 PM To: user@spark.apache.org Subject: Using Spark as a simulator Hi Spark is a data analyzer, but would it be possible to use Spark as a data generator or simulator ? My simulation can be very huge and i think a parallelized simulation using by Spark (cloud) could work. Is that good or bad idea ? Regards Esa Heikkinen DISCLAIMER ========== This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.