Re: spark SQL thriftserver over ignite and cassandra

vincent gromakowski Wed, 05 Oct 2016 04:31:58 -0700

Hi
thanks for your explanations. Please find inline more questions

Vincent


2016-10-05 3:33 GMT+02:00 Denis Magda <[email protected]>:

> Hi Vincent,
>
> See my answers inline
>
> On Oct 4, 2016, at 12:54 AM, vincent gromakowski <
> [email protected]> wrote:
>
> Hi,
> I know that Ignite has SQL support but:
> - ODBC driver doesn't seem to provide HTTP(S) support, which is easier to
> integrate on corporate networks with rules, firewalls, proxies
>
>
> *Igor Sapego*, what URIs are supported presently?
>
> - The SQL engine doesn't seem to scale like Spark SQL would. For instance,
> Spark won't generate OOM is dataset (source or result) doesn't fit in
> memory. From Ignite side, it's not clear…
>
>
> OOM is not related to scalability topic at all. This is about
> application’s logic.
>
> Ignite SQL engine perfectly scales out along with your cluster. Moreover,
> Ignite supports indexes which allows you to get O(logN) running time
> complexity for your SQL queries while in case of Spark you will face with
> full-scans (O(N)) all the time.
>
> However, to benefit from Ignite SQL queries you have to put all the data
> in-memory. Ignite doesn’t go to a CacheStore (Cassandra, relational
> database, MongoDB, etc) while a SQL query is executed and won’t preload
> anything from an underlying CacheStore. Automatic preloading works for
> key-value queries like cache.get(key).
>


This is an issue because I will potentially have to query TB of data. If I
use Spark thriftserver backed by IgniteRDD, does it solve this point and
can I get automatic preloading from C* ?

>
> - Spark thrift can manage multi tenancy: different users can connect to
> the same SQL engine and share cache. In Ignite it's one cache per user, so
> a big waste of RAM.
>
>
> Everyone can connect to an Ignite cluster and work with the same set of
> distributed caches. I’m not sure why you need to create caches with the
> same content for every user.
>

It's a security issue, Ignite cache doesn't provide multiple user account
per cache. I am thinking of using Spark to authenticate multiple users and
then Spark use a shared account on Ignite cache


>
> If you need a real multi-tenancy support where cacheA is allowed to be
> accessed by a group of users A only and cacheB by users from group B then
> you can take a look at GridGain which is built on top of Ignite
> https://gridgain.readme.io/docs/multi-tenancy
>
>
>
OK but I am evaluating open source only solutions (kylin, druid,
alluxio...), it's a constraint from my hierarchy

>
> What I want to achieve is :
> - use Cassandra for data store as it provides idempotence (HDFS/hive
> doesn't), resulting in exactly once semantic without any duplicates.
> - use Spark SQL thriftserver in multi tenancy for large scale adhoc
> analytics queries (> TB) from an ODBC driver through HTTP(S)
> - accelerate Cassandra reads when the data modeling of the Cassandra table
> doesn't fit the queries. Queries would be OLAP style: target multiple C*
> partitions, groupby or filters on lots of dimensions that aren't
> necessarely in the C* table key.
>
>
> As it was mentioned Ignite uses Cassandra as a CacheStore. You should keep
> this in mind. Before trying to assemble all the chain I would recommend you
> trying to connect Spark SQL thrift server directly to Ignite and work with
> its shared RDDs [1]. A shared RDD (basically Ignite cache) can be backed by
> Cassandra. Probably this chain will work for you but I can’t give more
> precise guidance on this.
>
>
I will try to make it works and give you feedback



> [1] https://apacheignite-fs.readme.io/docs/ignite-for-spark
>
> —
> Denis
>
> Thanks for your advises
>
>
> 2016-10-04 6:51 GMT+02:00 Jörn Franke <[email protected]>:
>
>> I am not sure that this will be performant. What do you want to achieve
>> here? Fast lookups? Then the Cassandra Ignite store might be the right
>> solution. If you want to do more analytic style of queries then you can put
>> the data on HDFS/Hive and use the Ignite HDFS cache to cache certain
>> partitions/tables in Hive in-memory. If you want to go to iterative machine
>> learning algorithms you can go for Spark on top of this. You can use then
>> also Ignite cache for Spark RDDs.
>>
>> On 4 Oct 2016, at 02:24, Alexey Kuznetsov <[email protected]>
>> wrote:
>>
>> Hi, Vincent!
>>
>> Ignite also has SQL support (also scalable), I think it will be much
>> faster to query directly from Ignite than query from Spark.
>> Also please mind, that before executing queries you should load all
>> needed data to cache.
>> To load data from Cassandra to Ignite you may use Cassandra store [1].
>>
>> [1] https://apacheignite.readme.io/docs/ignite-with-apache-cassandra
>>
>> On Tue, Oct 4, 2016 at 4:19 AM, vincent gromakowski <
>> [email protected]> wrote:
>>
>>> Hi,
>>> I am evaluating the possibility to use Spark SQL (and its scalability)
>>> over an Ignite cache with Cassandra persistent store to increase read
>>> workloads like OLAP style analytics.
>>> Is there any way to configure Spark thriftserver to load an external
>>> table in Ignite like we can do in Cassandra ?
>>> Here is an example of config for spark backed by cassandra
>>>
>>> CREATE EXTERNAL TABLE MyHiveTable
>>>         ( id int, data string )
>>>         STORED BY 'org.apache.hadoop.hive.cassandra.cql.CqlStorageHandler'
>>>
>>>         TBLPROPERTIES ("cassandra.host" = "x.x.x.x", "cassandra.ks.name"
>>> = "test" ,
>>>           "cassandra.cf.name" = "mytable" ,
>>>           "cassandra.ks.repfactor" = "1" ,
>>>           "cassandra.ks.strategy" =
>>>             "org.apache.cassandra.locator.SimpleStrategy" );
>>>
>>>
>>
>>
>> --
>> Alexey Kuznetsov
>>
>>
>
>

Re: spark SQL thriftserver over ignite and cassandra

Reply via email to