Re: Spark Beginner: Correct approach for use case

2017-03-08 Thread Allan Richards
Thanks for the feedback everyone. We've had a look at different SQL based solutions, and have got good performance out of them, but some of the reports we make can't be generated with a single bit of SQL. This is just an investigation to see if Spark is a viable alternative. I've got another

Re: Spark Beginner: Correct approach for use case

2017-03-05 Thread Jörn Franke
I agree with the others that a dedicated NoSQL datastore can make sense. You should look at the lambda architecture paradigm. Keep in mind that more memory does not necessarily mean more performance. It is the right data structure for the queries of your users. Additionally, if your queries

Re: Spark Beginner: Correct approach for use case

2017-03-05 Thread ayan guha
Any specific reason to choose Spark? It sounds like you have a Write-Once-Read-Many Times dataset, which is logically partitioned across customers, sitting in some data store. And essentially you are looking for a fast way to access it, and most likely you will use the same partition key for

Re: Spark Beginner: Correct approach for use case

2017-03-05 Thread Subhash Sriram
Hi Allan, Where is the data stored right now? If it's in a relational database, and you are using Spark with Hadoop, I feel like it would make sense to move the import the data into HDFS, just because it would be faster to access the data. You could use Sqoop to do that. In terms of having a

Spark Beginner: Correct approach for use case

2017-03-05 Thread Allan Richards
Hi, I am looking to use Spark to help execute queries against a reasonably large dataset (1 billion rows). I'm a bit lost with all the different libraries / add ons to Spark, and am looking for some direction as to what I should look at / what may be helpful. A couple of relevant points: - The