My understanding is the SparkSQL allows one to access Spark data as if it were stored in a relational database. It compiles SQL queries into a series of calls to the Spark API.
I need the performance of a SQL database, but I don't care about doing queries with SQL. I create the input to MLib by doing a massive JOIN query. So, I am creating a single collection by combining many collections. This sort of operation is very inefficient in Mongo, Cassandra or HDFS. I could store my data in a relational database, and copy the query results to Spark for processing. However, I was hoping I could keep everything in Spark. On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta <soumya.sima...@gmail.com> wrote: > 1. What data store do you want to store your data in ? HDFS, HBase, > Cassandra, S3 or something else? > 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)? > > One option is to process the data in Spark and then store it in the > relational database of your choice. > > > > > On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf <opus...@gmail.com> wrote: > >> Hello all, >> >> We are considering Spark for our organization. It is obviously a superb >> platform for processing massive amounts of data... how about retrieving it? >> >> We are currently storing our data in a relational database in a star >> schema. Retrieving our data requires doing many complicated joins across >> many tables. >> >> Can we use Spark as a relational database? Or, if not, can we put Spark >> on top of a relational database? >> >> Note that we don't care about SQL. Accessing our data via standard >> queries is nice, but we are equally happy (or more happy) to write Scala >> code. >> >> What is important to us is doing relational queries on huge amounts of >> data. Is Spark good at this? >> >> Thank you very much in advance >> Peter >> > >