Hi, I'd like to submit a possible use case and have some guidance on the overall architecture. I have 2 different datasources (a relational PostgreSQL and a Cassandra cluster) and I'd like to provide to user the ability to query data 'joining' the 2 worlds. So, an idea that comes to my mind is: pre-process data and create 2 dataframes, 1 for PG and 1 for cassandra and register dataframes as tables in Hive. Then enable thrift server and connect from an external application via hive JDBC. In this way, a 3rd party user can perform its own queries on both the DBs, joining as per need. >From a mock-up code, this seems to work, but I'm a bit converned about how spark is handling such use case. Let's say: -> PG DB ->> DATAFRAME 1 ->> registered as Hive table DB1 -> CASANDRA DB ->> DATAFRAME 2 ->> registered as Hive table DB2
What happens when a user via thrift server submit a query like 'select ... from DB1 JOIN DB2 ON ... WHERE ...'? Are connections to both DBs kept opened or are they reopened at need (i.e., is there a way to setup a 'connection pool'/'connection cache')? Do I have to persist(memory + disk) these dataframes in order to don't overload databases? Is spark's embedded thrift server robust enough for such use cases? Is there any procution use of this component? Thanks to everybody! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Expose-spark-pre-computed-data-via-thrift-server-tp26568.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
