Re: spark SQL thriftserver over ignite and cassandra

Denis Magda Wed, 05 Oct 2016 15:12:48 -0700

Vincent,

Please see below


> On Oct 5, 2016, at 4:31 AM, vincent gromakowski 
> <[email protected]> wrote:
> 
> Hi
> thanks for your explanations. Please find inline more questions 
> 
> Vincent
> 
> 2016-10-05 3:33 GMT+02:00 Denis Magda <[email protected] 
> <mailto:[email protected]>>:
> Hi Vincent,
> 
> See my answers inline
> 
>> On Oct 4, 2016, at 12:54 AM, vincent gromakowski 
>> <[email protected] <mailto:[email protected]>> wrote:
>> 
>> Hi,
>> I know that Ignite has SQL support but:
>> - ODBC driver doesn't seem to provide HTTP(S) support, which is easier to 
>> integrate on corporate networks with rules, firewalls, proxies
> 
> Igor Sapego, what URIs are supported presently? 
> 
>> - The SQL engine doesn't seem to scale like Spark SQL would. For instance, 
>> Spark won't generate OOM is dataset (source or result) doesn't fit in 
>> memory. From Ignite side, it's not clear…
> 
> OOM is not related to scalability topic at all. This is about application’s 
> logic. 
> 
> Ignite SQL engine perfectly scales out along with your cluster. Moreover, 
> Ignite supports indexes which allows you to get O(logN) running time 
> complexity for your SQL queries while in case of Spark you will face with 
> full-scans (O(N)) all the time.
> 
> However, to benefit from Ignite SQL queries you have to put all the data 
> in-memory. Ignite doesn’t go to a CacheStore (Cassandra, relational database, 
> MongoDB, etc) while a SQL query is executed and won’t preload anything from 
> an underlying CacheStore. Automatic preloading works for key-value queries 
> like cache.get(key).
> 
> 
> This is an issue because I will potentially have to query TB of data. If I 
> use Spark thriftserver backed by IgniteRDD, does it solve this point and can 
> I get automatic preloading from C* ?

IgniteRDD will load missing tuples (key-value) pair from Cassandra because 
essentially IgniteRDD is an IgniteCache and Cassandra is a CacheStore. The only 
thing that is left to check is whether Spark triftserver can work with 
IgniteRDDs. Hope you will be able figure out this and share your feedback with 
us.


> 
>> - Spark thrift can manage multi tenancy: different users can connect to the 
>> same SQL engine and share cache. In Ignite it's one cache per user, so a big 
>> waste of RAM.
> 
> Everyone can connect to an Ignite cluster and work with the same set of 
> distributed caches. I’m not sure why you need to create caches with the same 
> content for every user.
> 
> It's a security issue, Ignite cache doesn't provide multiple user account per 
> cache. I am thinking of using Spark to authenticate multiple users and then 
> Spark use a shared account on Ignite cache
>  
Basically, Ignite provides basic security interfaces and some implementations 
which you can rely on by building your secure solution. This article can be 
useful for your case
http://smartkey.co.uk/development/securing-an-apache-ignite-cluster/ 
<http://smartkey.co.uk/development/securing-an-apache-ignite-cluster/>

—
Denis

> 
> If you need a real multi-tenancy support where cacheA is allowed to be 
> accessed by a group of users A only and cacheB by users from group B then you 
> can take a look at GridGain which is built on top of Ignite
> https://gridgain.readme.io/docs/multi-tenancy 
> <https://gridgain.readme.io/docs/multi-tenancy>
> 
> 
> 
> OK but I am evaluating open source only solutions (kylin, druid, alluxio...), 
> it's a constraint from my hierarchy
>> 
>> What I want to achieve is :
>> - use Cassandra for data store as it provides idempotence (HDFS/hive 
>> doesn't), resulting in exactly once semantic without any duplicates. 
>> - use Spark SQL thriftserver in multi tenancy for large scale adhoc 
>> analytics queries (> TB) from an ODBC driver through HTTP(S) 
>> - accelerate Cassandra reads when the data modeling of the Cassandra table 
>> doesn't fit the queries. Queries would be OLAP style: target multiple C* 
>> partitions, groupby or filters on lots of dimensions that aren't necessarely 
>> in the C* table key.
>> 
> 
> As it was mentioned Ignite uses Cassandra as a CacheStore. You should keep 
> this in mind. Before trying to assemble all the chain I would recommend you 
> trying to connect Spark SQL thrift server directly to Ignite and work with 
> its shared RDDs [1]. A shared RDD (basically Ignite cache) can be backed by 
> Cassandra. Probably this chain will work for you but I can’t give more 
> precise guidance on this.
> 
> 
> I will try to make it works and give you feedback
> 
>  
> [1] https://apacheignite-fs.readme.io/docs/ignite-for-spark 
> <https://apacheignite-fs.readme.io/docs/ignite-for-spark>
>  
> —
> Denis
> 
>> Thanks for your advises
>> 
>> 
>> 2016-10-04 6:51 GMT+02:00 Jörn Franke <[email protected] 
>> <mailto:[email protected]>>:
>> I am not sure that this will be performant. What do you want to achieve 
>> here? Fast lookups? Then the Cassandra Ignite store might be the right 
>> solution. If you want to do more analytic style of queries then you can put 
>> the data on HDFS/Hive and use the Ignite HDFS cache to cache certain 
>> partitions/tables in Hive in-memory. If you want to go to iterative machine 
>> learning algorithms you can go for Spark on top of this. You can use then 
>> also Ignite cache for Spark RDDs.
>> 
>> On 4 Oct 2016, at 02:24, Alexey Kuznetsov <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>>> Hi, Vincent!
>>> 
>>> Ignite also has SQL support (also scalable), I think it will be much faster 
>>> to query directly from Ignite than query from Spark.
>>> Also please mind, that before executing queries you should load all needed 
>>> data to cache.
>>> To load data from Cassandra to Ignite you may use Cassandra store [1].
>>> 
>>> [1] https://apacheignite.readme.io/docs/ignite-with-apache-cassandra 
>>> <https://apacheignite.readme.io/docs/ignite-with-apache-cassandra>
>>> 
>>> On Tue, Oct 4, 2016 at 4:19 AM, vincent gromakowski 
>>> <[email protected] <mailto:[email protected]>> 
>>> wrote:
>>> Hi,
>>> I am evaluating the possibility to use Spark SQL (and its scalability) over 
>>> an Ignite cache with Cassandra persistent store to increase read workloads 
>>> like OLAP style analytics.
>>> Is there any way to configure Spark thriftserver to load an external table 
>>> in Ignite like we can do in Cassandra ?
>>> Here is an example of config for spark backed by cassandra
>>> 
>>> CREATE EXTERNAL TABLE MyHiveTable 
>>>         ( id int, data string ) 
>>>         STORED BY 'org.apache.hadoop.hive.cassandra.cql.CqlStorageHandler' 
>>>         TBLPROPERTIES ("cassandra.host" = "x.x.x.x", "cassandra.ks.name 
>>> <http://cassandra.ks.name/>" = "test" , 
>>>           "cassandra.cf.name <http://cassandra.cf.name/>" = "mytable" , 
>>>           "cassandra.ks.repfactor" = "1" , 
>>>           "cassandra.ks.strategy" = 
>>>             "org.apache.cassandra.locator.SimpleStrategy" ); 
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Alexey Kuznetsov

Re: spark SQL thriftserver over ignite and cassandra

Reply via email to