Re: spark SQL thriftserver over ignite and cassandra

Denis Magda Tue, 04 Oct 2016 18:34:26 -0700

Hi Vincent,

See my answers inline


> On Oct 4, 2016, at 12:54 AM, vincent gromakowski 
> <[email protected]> wrote:
> 
> Hi,
> I know that Ignite has SQL support but:
> - ODBC driver doesn't seem to provide HTTP(S) support, which is easier to 
> integrate on corporate networks with rules, firewalls, proxies

Igor Sapego, what URIs are supported presently? 

> - The SQL engine doesn't seem to scale like Spark SQL would. For instance, 
> Spark won't generate OOM is dataset (source or result) doesn't fit in memory. 
> From Ignite side, it's not clear…

OOM is not related to scalability topic at all. This is about application’s 
logic. 

Ignite SQL engine perfectly scales out along with your cluster. Moreover, 
Ignite supports indexes which allows you to get O(logN) running time complexity 
for your SQL queries while in case of Spark you will face with full-scans 
(O(N)) all the time.

However, to benefit from Ignite SQL queries you have to put all the data 
in-memory. Ignite doesn’t go to a CacheStore (Cassandra, relational database, 
MongoDB, etc) while a SQL query is executed and won’t preload anything from an 
underlying CacheStore. Automatic preloading works for key-value queries like 
cache.get(key).

> - Spark thrift can manage multi tenancy: different users can connect to the 
> same SQL engine and share cache. In Ignite it's one cache per user, so a big 
> waste of RAM.

Everyone can connect to an Ignite cluster and work with the same set of 
distributed caches. I’m not sure why you need to create caches with the same 
content for every user.

If you need a real multi-tenancy support where cacheA is allowed to be accessed 
by a group of users A only and cacheB by users from group B then you can take a 
look at GridGain which is built on top of Ignite
https://gridgain.readme.io/docs/multi-tenancy


> 
> What I want to achieve is :
> - use Cassandra for data store as it provides idempotence (HDFS/hive 
> doesn't), resulting in exactly once semantic without any duplicates. 
> - use Spark SQL thriftserver in multi tenancy for large scale adhoc analytics 
> queries (> TB) from an ODBC driver through HTTP(S) 
> - accelerate Cassandra reads when the data modeling of the Cassandra table 
> doesn't fit the queries. Queries would be OLAP style: target multiple C* 
> partitions, groupby or filters on lots of dimensions that aren't necessarely 
> in the C* table key.
> 

As it was mentioned Ignite uses Cassandra as a CacheStore. You should keep this 
in mind. Before trying to assemble all the chain I would recommend you trying 
to connect Spark SQL thrift server directly to Ignite and work with its shared 
RDDs [1]. A shared RDD (basically Ignite cache) can be backed by Cassandra. 
Probably this chain will work for you but I can’t give more precise guidance on 
this.

[1] https://apacheignite-fs.readme.io/docs/ignite-for-spark
 
—
Denis

> Thanks for your advises
> 
> 
> 2016-10-04 6:51 GMT+02:00 Jörn Franke <[email protected] 
> <mailto:[email protected]>>:
> I am not sure that this will be performant. What do you want to achieve here? 
> Fast lookups? Then the Cassandra Ignite store might be the right solution. If 
> you want to do more analytic style of queries then you can put the data on 
> HDFS/Hive and use the Ignite HDFS cache to cache certain partitions/tables in 
> Hive in-memory. If you want to go to iterative machine learning algorithms 
> you can go for Spark on top of this. You can use then also Ignite cache for 
> Spark RDDs.
> 
> On 4 Oct 2016, at 02:24, Alexey Kuznetsov <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>> Hi, Vincent!
>> 
>> Ignite also has SQL support (also scalable), I think it will be much faster 
>> to query directly from Ignite than query from Spark.
>> Also please mind, that before executing queries you should load all needed 
>> data to cache.
>> To load data from Cassandra to Ignite you may use Cassandra store [1].
>> 
>> [1] https://apacheignite.readme.io/docs/ignite-with-apache-cassandra 
>> <https://apacheignite.readme.io/docs/ignite-with-apache-cassandra>
>> 
>> On Tue, Oct 4, 2016 at 4:19 AM, vincent gromakowski 
>> <[email protected] <mailto:[email protected]>> wrote:
>> Hi,
>> I am evaluating the possibility to use Spark SQL (and its scalability) over 
>> an Ignite cache with Cassandra persistent store to increase read workloads 
>> like OLAP style analytics.
>> Is there any way to configure Spark thriftserver to load an external table 
>> in Ignite like we can do in Cassandra ?
>> Here is an example of config for spark backed by cassandra
>> 
>> CREATE EXTERNAL TABLE MyHiveTable 
>>         ( id int, data string ) 
>>         STORED BY 'org.apache.hadoop.hive.cassandra.cql.CqlStorageHandler' 
>>         TBLPROPERTIES ("cassandra.host" = "x.x.x.x", "cassandra.ks.name 
>> <http://cassandra.ks.name/>" = "test" , 
>>           "cassandra.cf.name <http://cassandra.cf.name/>" = "mytable" , 
>>           "cassandra.ks.repfactor" = "1" , 
>>           "cassandra.ks.strategy" = 
>>             "org.apache.cassandra.locator.SimpleStrategy" ); 
>> 
>> 
>> 
>> 
>> -- 
>> Alexey Kuznetsov
>

Re: spark SQL thriftserver over ignite and cassandra

Reply via email to