Any specific reason to choose Spark? It sounds like you have a Write-Once-Read-Many Times dataset, which is logically partitioned across customers, sitting in some data store. And essentially you are looking for a fast way to access it, and most likely you will use the same partition key for quering the data. This is more of a database/NoSQL kind of use case than Spark (which is more of distributed processing engine,I reckon).
On Mon, Mar 6, 2017 at 11:56 AM, Subhash Sriram <subhash.sri...@gmail.com> wrote: > Hi Allan, > > Where is the data stored right now? If it's in a relational database, and > you are using Spark with Hadoop, I feel like it would make sense to move > the import the data into HDFS, just because it would be faster to access > the data. You could use Sqoop to do that. > > In terms of having a long running Spark context, you could look into the > Spark job server: > > https://github.com/spark-jobserver/spark-jobserver/blob/master/README.md > > It would allow you to cache all the data in memory and then accept queries > via REST API calls. You would have to refresh your cache as the data > changes of course, but it sounds like that is not very often. > > In terms of running the queries themselves, I would think you could use > Spark SQL and the DataFrame/DataSet API, which is built into Spark. You > will have to think about the best way to partition your data, depending on > the queries themselves. > > Here is a link to the Spark SQL docs: > > http://spark.apache.org/docs/latest/sql-programming-guide.html > > I hope that helps, and I'm sure other folks will have some helpful advice > as well. > > Thanks, > Subhash > > Sent from my iPhone > > On Mar 5, 2017, at 3:49 PM, Allan Richards <allan.richa...@gmail.com> > wrote: > > Hi, > > I am looking to use Spark to help execute queries against a reasonably > large dataset (1 billion rows). I'm a bit lost with all the different > libraries / add ons to Spark, and am looking for some direction as to what > I should look at / what may be helpful. > > A couple of relevant points: > - The dataset doesn't change over time. > - There are a small number of applications (or queries I guess, but it's > more complicated than a single SQL query) that I want to run against it, > but the parameters to those queries will change all the time. > - There is a logical grouping of the data per customer, which will > generally consist of 1-5000 rows. > > I want each query to run as fast as possible (less than a second or two). > So ideally I want to keep all the records in memory, but distributed over > the different nodes in the cluster. Does this mean sharing a SparkContext > between queries, or is this where HDFS comes in, or is there something else > that would be better suited? > > Or is there another overall approach I should look into for executing > queries in "real time" against a dataset this size? > > Thanks, > Allan. > > -- Best Regards, Ayan Guha