My understanding is the SparkSQL allows one to access Spark data as if it
were stored in a relational database.  It compiles SQL queries into a
series of calls to the Spark API.

I need the performance of a SQL database, but I don't care about doing
queries with SQL.

I create the input to MLib by doing a massive JOIN query.  So, I am
creating a single collection by combining many collections.  This sort of
operation is very inefficient in Mongo, Cassandra or HDFS.

I could store my data in a relational database, and copy the query results
to Spark for processing.  However, I was hoping I could keep everything in
Spark.

On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta <soumya.sima...@gmail.com>
wrote:

> 1. What data store do you want to store your data in ? HDFS, HBase,
> Cassandra, S3 or something else?
> 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)?
>
> One option is to process the data in Spark and then store it in the
> relational database of your choice.
>
>
>
>
> On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf <opus...@gmail.com> wrote:
>
>> Hello all,
>>
>> We are considering Spark for our organization.  It is obviously a superb
>> platform for processing massive amounts of data... how about retrieving it?
>>
>> We are currently storing our data in a relational database in a star
>> schema.  Retrieving our data requires doing many complicated joins across
>> many tables.
>>
>> Can we use Spark as a relational database?  Or, if not, can we put Spark
>> on top of a relational database?
>>
>> Note that we don't care about SQL.  Accessing our data via standard
>> queries is nice, but we are equally happy (or more happy) to write Scala
>> code.
>>
>> What is important to us is doing relational queries on huge amounts of
>> data.  Is Spark good at this?
>>
>> Thank you very much in advance
>> Peter
>>
>
>

Reply via email to