Re: spark SQL thriftserver over ignite and cassandra

vincent gromakowski Mon, 17 Oct 2016 11:41:21 -0700

Hi
I mean using HTTPS transport instead of binary (thrift?) transport.

2016-10-17 19:10 GMT+02:00 Igor Sapego <isap...@gridgain.com>:


> Hi Vincent,
>
> Can you please explain what do you mean by HTTP(S) support for the ODBC?
>
> I'm not quite sure I get it.
>
> Best Regards,
> Igor
>
> On Thu, Oct 6, 2016 at 9:59 AM, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> Thanks
>>
>> Starting the thriftserver with igniterdd tables doesn't seem very hard.
>> Implementing a security layer over ignite cache may be harder as I need to:
>> - get username from thriftserver
>> - intercept each request and check permissions
>> Maybe spark will also be able to handle permissions...
>>
>> I will keep you informed
>>
>> Le 6 oct. 2016 00:12, "Denis Magda" <dma...@gridgain.com> a écrit :
>>
>>> Vincent,
>>>
>>> Please see below
>>>
>>> On Oct 5, 2016, at 4:31 AM, vincent gromakowski <
>>> vincent.gromakow...@gmail.com> wrote:
>>>
>>> Hi
>>> thanks for your explanations. Please find inline more questions
>>>
>>> Vincent
>>>
>>> 2016-10-05 3:33 GMT+02:00 Denis Magda <dma...@gridgain.com>:
>>>
>>>> Hi Vincent,
>>>>
>>>> See my answers inline
>>>>
>>>> On Oct 4, 2016, at 12:54 AM, vincent gromakowski <
>>>> vincent.gromakow...@gmail.com> wrote:
>>>>
>>>> Hi,
>>>> I know that Ignite has SQL support but:
>>>> - ODBC driver doesn't seem to provide HTTP(S) support, which is easier
>>>> to integrate on corporate networks with rules, firewalls, proxies
>>>>
>>>>
>>>> *Igor Sapego*, what URIs are supported presently?
>>>>
>>>> - The SQL engine doesn't seem to scale like Spark SQL would. For
>>>> instance, Spark won't generate OOM is dataset (source or result) doesn't
>>>> fit in memory. From Ignite side, it's not clear…
>>>>
>>>>
>>>> OOM is not related to scalability topic at all. This is about
>>>> application’s logic.
>>>>
>>>> Ignite SQL engine perfectly scales out along with your cluster.
>>>> Moreover, Ignite supports indexes which allows you to get O(logN) running
>>>> time complexity for your SQL queries while in case of Spark you will face
>>>> with full-scans (O(N)) all the time.
>>>>
>>>> However, to benefit from Ignite SQL queries you have to put all the
>>>> data in-memory. Ignite doesn’t go to a CacheStore (Cassandra, relational
>>>> database, MongoDB, etc) while a SQL query is executed and won’t preload
>>>> anything from an underlying CacheStore. Automatic preloading works for
>>>> key-value queries like cache.get(key).
>>>>
>>>
>>>
>>> This is an issue because I will potentially have to query TB of data. If
>>> I use Spark thriftserver backed by IgniteRDD, does it solve this point and
>>> can I get automatic preloading from C* ?
>>>
>>>
>>> IgniteRDD will load missing tuples (key-value) pair from Cassandra
>>> because essentially IgniteRDD is an IgniteCache and Cassandra is a
>>> CacheStore. The only thing that is left to check is whether Spark
>>> triftserver can work with IgniteRDDs. Hope you will be able figure out this
>>> and share your feedback with us.
>>>
>>>
>>>
>>>> - Spark thrift can manage multi tenancy: different users can connect to
>>>> the same SQL engine and share cache. In Ignite it's one cache per user, so
>>>> a big waste of RAM.
>>>>
>>>>
>>>> Everyone can connect to an Ignite cluster and work with the same set of
>>>> distributed caches. I’m not sure why you need to create caches with the
>>>> same content for every user.
>>>>
>>>
>>> It's a security issue, Ignite cache doesn't provide multiple user
>>> account per cache. I am thinking of using Spark to authenticate multiple
>>> users and then Spark use a shared account on Ignite cache
>>>
>>>
>>> Basically, Ignite provides basic security interfaces and some
>>> implementations which you can rely on by building your secure solution.
>>> This article can be useful for your case
>>> http://smartkey.co.uk/development/securing-an-apache-ignite-cluster/
>>>
>>> —
>>> Denis
>>>
>>>
>>>> If you need a real multi-tenancy support where cacheA is allowed to be
>>>> accessed by a group of users A only and cacheB by users from group B then
>>>> you can take a look at GridGain which is built on top of Ignite
>>>> https://gridgain.readme.io/docs/multi-tenancy
>>>>
>>>>
>>>>
>>> OK but I am evaluating open source only solutions (kylin, druid,
>>> alluxio...), it's a constraint from my hierarchy
>>>
>>>>
>>>> What I want to achieve is :
>>>> - use Cassandra for data store as it provides idempotence (HDFS/hive
>>>> doesn't), resulting in exactly once semantic without any duplicates.
>>>> - use Spark SQL thriftserver in multi tenancy for large scale adhoc
>>>> analytics queries (> TB) from an ODBC driver through HTTP(S)
>>>> - accelerate Cassandra reads when the data modeling of the Cassandra
>>>> table doesn't fit the queries. Queries would be OLAP style: target multiple
>>>> C* partitions, groupby or filters on lots of dimensions that aren't
>>>> necessarely in the C* table key.
>>>>
>>>>
>>>> As it was mentioned Ignite uses Cassandra as a CacheStore. You should
>>>> keep this in mind. Before trying to assemble all the chain I would
>>>> recommend you trying to connect Spark SQL thrift server directly to Ignite
>>>> and work with its shared RDDs [1]. A shared RDD (basically Ignite cache)
>>>> can be backed by Cassandra. Probably this chain will work for you but I
>>>> can’t give more precise guidance on this.
>>>>
>>>>
>>> I will try to make it works and give you feedback
>>>
>>>
>>>
>>>> [1] https://apacheignite-fs.readme.io/docs/ignite-for-spark
>>>>
>>>> —
>>>> Denis
>>>>
>>>> Thanks for your advises
>>>>
>>>>
>>>> 2016-10-04 6:51 GMT+02:00 Jörn Franke <jornfra...@gmail.com>:
>>>>
>>>>> I am not sure that this will be performant. What do you want to
>>>>> achieve here? Fast lookups? Then the Cassandra Ignite store might be the
>>>>> right solution. If you want to do more analytic style of queries then you
>>>>> can put the data on HDFS/Hive and use the Ignite HDFS cache to cache
>>>>> certain partitions/tables in Hive in-memory. If you want to go to 
>>>>> iterative
>>>>> machine learning algorithms you can go for Spark on top of this. You can
>>>>> use then also Ignite cache for Spark RDDs.
>>>>>
>>>>> On 4 Oct 2016, at 02:24, Alexey Kuznetsov <akuznet...@gridgain.com>
>>>>> wrote:
>>>>>
>>>>> Hi, Vincent!
>>>>>
>>>>> Ignite also has SQL support (also scalable), I think it will be much
>>>>> faster to query directly from Ignite than query from Spark.
>>>>> Also please mind, that before executing queries you should load all
>>>>> needed data to cache.
>>>>> To load data from Cassandra to Ignite you may use Cassandra store [1].
>>>>>
>>>>> [1] https://apacheignite.readme.io/docs/ignite-with-apache-cassandra
>>>>>
>>>>> On Tue, Oct 4, 2016 at 4:19 AM, vincent gromakowski <vincent.gromakows
>>>>> k...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>> I am evaluating the possibility to use Spark SQL (and its
>>>>>> scalability) over an Ignite cache with Cassandra persistent store to
>>>>>> increase read workloads like OLAP style analytics.
>>>>>> Is there any way to configure Spark thriftserver to load an external
>>>>>> table in Ignite like we can do in Cassandra ?
>>>>>> Here is an example of config for spark backed by cassandra
>>>>>>
>>>>>> CREATE EXTERNAL TABLE MyHiveTable
>>>>>>         ( id int, data string )
>>>>>>         STORED BY 'org.apache.hadoop.hive.cassan
>>>>>> dra.cql.CqlStorageHandler'
>>>>>>         TBLPROPERTIES ("cassandra.host" = "x.x.x.x", "
>>>>>> cassandra.ks.name" = "test" ,
>>>>>>           "cassandra.cf.name" = "mytable" ,
>>>>>>           "cassandra.ks.repfactor" = "1" ,
>>>>>>           "cassandra.ks.strategy" =
>>>>>>             "org.apache.cassandra.locator.SimpleStrategy" );
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Alexey Kuznetsov
>>>>>
>>>>>
>>>
>

Re: spark SQL thriftserver over ignite and cassandra

Reply via email to