RE: Spark to eliminate full-table scan latency

2014-11-19 Thread bchazalet
You can serve queries over your RDD data yes, and return results to the
user/client as long as your driver is alive. 

For example, I have built a play! application that acts as a driver
(creating a spark context), loads up data from my database, organize it and
subsequently receive and process user queries over http. As long as my play!
application is running, my spark application is kept alive within the
cluster. 

You can also have a look at this from ooyala:
https://github.com/ooyala/spark-jobserver



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-to-eliminate-full-table-scan-latency-tp17395p19261.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark to eliminate full-table scan latency

2014-10-28 Thread Matt Narrell
I’ve been puzzled by this lately.  I too would like to use the thrift server to 
provide JDBC style access to datasets via SparkSQL.  Is this possible?  The 
examples show temp tables created during the lifetime of a SparkContext.  I 
assume I can use SparkSQL to query those tables while the context is active, 
but what happens when the context is stopped?  I can no longer query this 
table, via the thrift server.  Do I need Hive in this scenario?  I don’t want 
to rebuild the Spark distribution unless absolutely necessary.

From the examples, it looks like SparkSQL is syntax sugar for manipulating an 
RDD, but if I need external access to this data, I need a separate store, 
outside of Spark (Mongo/Cassandra/HDFS/etc..)  Am I correct here?

Thanks,

mn

 On Oct 27, 2014, at 7:43 PM, Ron Ayoub ronalday...@live.com wrote:
 
 This does look like it provides a good way to allow other process to access 
 the contents of an RDD in a separate app? Is there any other general purpose 
 mechanism for serving up RDD data? I understand that the driver app and 
 workers all are app specific and run in separate executors but would be cool 
 if there was some general way to create a server app based on Spark. Perhaps 
 Spark SQL is that general way and I'll soon find out. Thanks. 
 
 From: mich...@databricks.com
 Date: Mon, 27 Oct 2014 14:35:46 -0700
 Subject: Re: Spark to eliminate full-table scan latency
 To: ronalday...@live.com
 CC: user@spark.apache.org
 
 You can access cached data in spark through the JDBC server:
 
 http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server
  
 http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server
 
 On Mon, Oct 27, 2014 at 1:47 PM, Ron Ayoub ronalday...@live.com 
 mailto:ronalday...@live.com wrote:
 We have a table containing 25 features per item id along with feature 
 weights. A correlation matrix can be constructed for every feature pair based 
 on co-occurrence. If a user inputs a feature they can find out the features 
 that are correlated with a self-join requiring a single full table scan. This 
 results in high latency for big data (10 seconds +) due to the IO involved in 
 the full table scan. My idea is for this feature the data can be loaded into 
 an RDD and transformations and actions can be applied to find out per query 
 what are the correlated features. 
 
 I'm pretty sure Spark can do this sort of thing. Since I'm new, what I'm not 
 sure about is, is Spark appropriate as a server application? For instance, 
 the drive application would have to load the RDD and then listen for request 
 and return results, perhaps using a socket?  Are there any libraries to 
 facilitate this sort of Spark server app? So I understand how Spark can be 
 used to grab data, run algorithms, and put results back but is it appropriate 
 as the engine of a server app and what are the general patterns involved?



Spark to eliminate full-table scan latency

2014-10-27 Thread Ron Ayoub
We have a table containing 25 features per item id along with feature weights. 
A correlation matrix can be constructed for every feature pair based on 
co-occurrence. If a user inputs a feature they can find out the features that 
are correlated with a self-join requiring a single full table scan. This 
results in high latency for big data (10 seconds +) due to the IO involved in 
the full table scan. My idea is for this feature the data can be loaded into an 
RDD and transformations and actions can be applied to find out per query what 
are the correlated features. 
I'm pretty sure Spark can do this sort of thing. Since I'm new, what I'm not 
sure about is, is Spark appropriate as a server application? For instance, the 
drive application would have to load the RDD and then listen for request and 
return results, perhaps using a socket?  Are there any libraries to facilitate 
this sort of Spark server app? So I understand how Spark can be used to grab 
data, run algorithms, and put results back but is it appropriate as the engine 
of a server app and what are the general patterns involved?

  

Re: Spark to eliminate full-table scan latency

2014-10-27 Thread Michael Armbrust
You can access cached data in spark through the JDBC server:

http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server

On Mon, Oct 27, 2014 at 1:47 PM, Ron Ayoub ronalday...@live.com wrote:

 We have a table containing 25 features per item id along with feature
 weights. A correlation matrix can be constructed for every feature pair
 based on co-occurrence. If a user inputs a feature they can find out the
 features that are correlated with a self-join requiring a single full table
 scan. This results in high latency for big data (10 seconds +) due to the
 IO involved in the full table scan. My idea is for this feature the data
 can be loaded into an RDD and transformations and actions can be applied to
 find out per query what are the correlated features.

 I'm pretty sure Spark can do this sort of thing. Since I'm new, what I'm
 not sure about is, is Spark appropriate as a server application? For
 instance, the drive application would have to load the RDD and then listen
 for request and return results, perhaps using a socket?  Are there any
 libraries to facilitate this sort of Spark server app? So I understand how
 Spark can be used to grab data, run algorithms, and put results back but is
 it appropriate as the engine of a server app and what are the general
 patterns involved?




RE: Spark to eliminate full-table scan latency

2014-10-27 Thread Ron Ayoub
This does look like it provides a good way to allow other process to access the 
contents of an RDD in a separate app? Is there any other general purpose 
mechanism for serving up RDD data? I understand that the driver app and workers 
all are app specific and run in separate executors but would be cool if there 
was some general way to create a server app based on Spark. Perhaps Spark SQL 
is that general way and I'll soon find out. Thanks. 

From: mich...@databricks.com
Date: Mon, 27 Oct 2014 14:35:46 -0700
Subject: Re: Spark to eliminate full-table scan latency
To: ronalday...@live.com
CC: user@spark.apache.org

You can access cached data in spark through the JDBC server:
http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server

On Mon, Oct 27, 2014 at 1:47 PM, Ron Ayoub ronalday...@live.com wrote:



We have a table containing 25 features per item id along with feature weights. 
A correlation matrix can be constructed for every feature pair based on 
co-occurrence. If a user inputs a feature they can find out the features that 
are correlated with a self-join requiring a single full table scan. This 
results in high latency for big data (10 seconds +) due to the IO involved in 
the full table scan. My idea is for this feature the data can be loaded into an 
RDD and transformations and actions can be applied to find out per query what 
are the correlated features. 
I'm pretty sure Spark can do this sort of thing. Since I'm new, what I'm not 
sure about is, is Spark appropriate as a server application? For instance, the 
drive application would have to load the RDD and then listen for request and 
return results, perhaps using a socket?  Are there any libraries to facilitate 
this sort of Spark server app? So I understand how Spark can be used to grab 
data, run algorithms, and put results back but is it appropriate as the engine 
of a server app and what are the general patterns involved?