RE: Spark to eliminate full-table scan latency

Ron Ayoub Mon, 27 Oct 2014 18:45:12 -0700

This does look like it provides a good way to allow other process to access the 
contents of an RDD in a separate app? Is there any other general purpose 
mechanism for serving up RDD data? I understand that the driver app and workers 
all are app specific and run in separate executors but would be cool if there 
was some general way to create a server app based on Spark. Perhaps Spark SQL 
is that general way and I'll soon find out. Thanks.

From: mich...@databricks.com
Date: Mon, 27 Oct 2014 14:35:46 -0700
Subject: Re: Spark to eliminate full-table scan latency
To: ronalday...@live.com
CC: user@spark.apache.org

You can access cached data in spark through the JDBC server:
http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server

On Mon, Oct 27, 2014 at 1:47 PM, Ron Ayoub <ronalday...@live.com> wrote:

We have a table containing 25 features per item id along with feature weights. 
A correlation matrix can be constructed for every feature pair based on 
co-occurrence. If a user inputs a feature they can find out the features that 
are correlated with a self-join requiring a single full table scan. This 
results in high latency for big data (10 seconds +) due to the IO involved in 
the full table scan. My idea is for this feature the data can be loaded into an 
RDD and transformations and actions can be applied to find out per query what 
are the correlated features. 
I'm pretty sure Spark can do this sort of thing. Since I'm new, what I'm not 
sure about is, is Spark appropriate as a server application? For instance, the 
drive application would have to load the RDD and then listen for request and 
return results, perhaps using a socket?  Are there any libraries to facilitate 
this sort of Spark server app? So I understand how Spark can be used to grab 
data, run algorithms, and put results back but is it appropriate as the engine 
of a server app and what are the general patterns involved?

RE: Spark to eliminate full-table scan latency

Reply via email to