Hi, I'd like to submit a possible use case and have some guidance on the
overall architecture. 
I have 2 different datasources (a relational PostgreSQL and a Cassandra
cluster) and I'd like to provide to user the ability to query data 'joining'
the 2 worlds. 
So, an idea that comes to my mind is: pre-process data and create 2
dataframes, 1 for PG and 1 for cassandra and register dataframes as tables
in Hive. Then enable thrift server and connect from an external application
via hive JDBC. 
In this way, a 3rd party user can perform its own queries on both the DBs,
joining as per need. 
>From a mock-up code, this seems to work, but I'm a bit converned about how
spark is handling such use case. 
Let's say: 
-> PG DB ->> DATAFRAME 1 ->> registered as Hive table DB1 
-> CASANDRA DB ->> DATAFRAME 2 ->> registered as Hive table DB2 

What happens when a user via thrift server submit a query like 'select ...
from DB1 JOIN DB2 ON ... WHERE ...'? 
Are connections to both DBs kept opened or are they reopened at need (i.e.,
is there a way to setup a 'connection pool'/'connection cache')? 
Do I have to persist(memory + disk) these dataframes in order to don't
overload databases? 
Is spark's embedded thrift server robust enough for such use cases? Is there
any procution use of this component? 

Thanks to everybody! 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Expose-spark-pre-computed-data-via-thrift-server-tp26568.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to