Any specific reason to choose Spark? It sounds like you have a
Write-Once-Read-Many Times dataset, which is logically partitioned across
customers, sitting in some data store. And essentially you are looking for
a fast way to access it, and most likely you will use the same partition
key for quering the data. This is more of a database/NoSQL kind of use case
than Spark (which is more of distributed processing engine,I reckon).

On Mon, Mar 6, 2017 at 11:56 AM, Subhash Sriram <subhash.sri...@gmail.com>
wrote:

> Hi Allan,
>
> Where is the data stored right now? If it's in a relational database, and
> you are using Spark with Hadoop, I feel like it would make sense to move
> the import the data into HDFS, just because it would be faster to access
> the data. You could use Sqoop to do that.
>
> In terms of having a long running Spark context, you could look into the
> Spark job server:
>
> https://github.com/spark-jobserver/spark-jobserver/blob/master/README.md
>
> It would allow you to cache all the data in memory and then accept queries
> via REST API calls. You would have to refresh your cache as the data
> changes of course, but it sounds like that is not very often.
>
> In terms of running the queries themselves, I would think you could use
> Spark SQL and the DataFrame/DataSet API, which is built into Spark. You
> will have to think about the best way to partition your data, depending on
> the queries themselves.
>
> Here is a link to the Spark SQL docs:
>
> http://spark.apache.org/docs/latest/sql-programming-guide.html
>
> I hope that helps, and I'm sure other folks will have some helpful advice
> as well.
>
> Thanks,
> Subhash
>
> Sent from my iPhone
>
> On Mar 5, 2017, at 3:49 PM, Allan Richards <allan.richa...@gmail.com>
> wrote:
>
> Hi,
>
> I am looking to use Spark to help execute queries against a reasonably
> large dataset (1 billion rows). I'm a bit lost with all the different
> libraries / add ons to Spark, and am looking for some direction as to what
> I should look at / what may be helpful.
>
> A couple of relevant points:
>  - The dataset doesn't change over time.
>  - There are a small number of applications (or queries I guess, but it's
> more complicated than a single SQL query) that I want to run against it,
> but the parameters to those queries will change all the time.
>  - There is a logical grouping of the data per customer, which will
> generally consist of 1-5000 rows.
>
> I want each query to run as fast as possible (less than a second or two).
> So ideally I want to keep all the records in memory, but distributed over
> the different nodes in the cluster. Does this mean sharing a SparkContext
> between queries, or is this where HDFS comes in, or is there something else
> that would be better suited?
>
> Or is there another overall approach I should look into for executing
> queries in "real time" against a dataset this size?
>
> Thanks,
> Allan.
>
>


-- 
Best Regards,
Ayan Guha

Reply via email to