Re: spark SQL thriftserver over ignite and cassandra

Igor Sapego Tue, 25 Oct 2016 12:14:41 -0700

Vincent,

That's right, our ODBC driver does not support using HTTP(S) as a transport
currently.


Best Regards,
Igor

On Mon, Oct 17, 2016 at 9:40 PM, vincent gromakowski <
[email protected]> wrote:

> Hi
> I mean using HTTPS transport instead of binary (thrift?) transport.
>
> 2016-10-17 19:10 GMT+02:00 Igor Sapego <[email protected]>:
>
>> Hi Vincent,
>>
>> Can you please explain what do you mean by HTTP(S) support for the ODBC?
>>
>> I'm not quite sure I get it.
>>
>> Best Regards,
>> Igor
>>
>> On Thu, Oct 6, 2016 at 9:59 AM, vincent gromakowski <
>> [email protected]> wrote:
>>
>>> Thanks
>>>
>>> Starting the thriftserver with igniterdd tables doesn't seem very hard.
>>> Implementing a security layer over ignite cache may be harder as I need to:
>>> - get username from thriftserver
>>> - intercept each request and check permissions
>>> Maybe spark will also be able to handle permissions...
>>>
>>> I will keep you informed
>>>
>>> Le 6 oct. 2016 00:12, "Denis Magda" <[email protected]> a écrit :
>>>
>>>> Vincent,
>>>>
>>>> Please see below
>>>>
>>>> On Oct 5, 2016, at 4:31 AM, vincent gromakowski <
>>>> [email protected]> wrote:
>>>>
>>>> Hi
>>>> thanks for your explanations. Please find inline more questions
>>>>
>>>> Vincent
>>>>
>>>> 2016-10-05 3:33 GMT+02:00 Denis Magda <[email protected]>:
>>>>
>>>>> Hi Vincent,
>>>>>
>>>>> See my answers inline
>>>>>
>>>>> On Oct 4, 2016, at 12:54 AM, vincent gromakowski <
>>>>> [email protected]> wrote:
>>>>>
>>>>> Hi,
>>>>> I know that Ignite has SQL support but:
>>>>> - ODBC driver doesn't seem to provide HTTP(S) support, which is easier
>>>>> to integrate on corporate networks with rules, firewalls, proxies
>>>>>
>>>>>
>>>>> *Igor Sapego*, what URIs are supported presently?
>>>>>
>>>>> - The SQL engine doesn't seem to scale like Spark SQL would. For
>>>>> instance, Spark won't generate OOM is dataset (source or result) doesn't
>>>>> fit in memory. From Ignite side, it's not clear…
>>>>>
>>>>>
>>>>> OOM is not related to scalability topic at all. This is about
>>>>> application’s logic.
>>>>>
>>>>> Ignite SQL engine perfectly scales out along with your cluster.
>>>>> Moreover, Ignite supports indexes which allows you to get O(logN) running
>>>>> time complexity for your SQL queries while in case of Spark you will face
>>>>> with full-scans (O(N)) all the time.
>>>>>
>>>>> However, to benefit from Ignite SQL queries you have to put all the
>>>>> data in-memory. Ignite doesn’t go to a CacheStore (Cassandra, relational
>>>>> database, MongoDB, etc) while a SQL query is executed and won’t preload
>>>>> anything from an underlying CacheStore. Automatic preloading works for
>>>>> key-value queries like cache.get(key).
>>>>>
>>>>
>>>>
>>>> This is an issue because I will potentially have to query TB of data.
>>>> If I use Spark thriftserver backed by IgniteRDD, does it solve this point
>>>> and can I get automatic preloading from C* ?
>>>>
>>>>
>>>> IgniteRDD will load missing tuples (key-value) pair from Cassandra
>>>> because essentially IgniteRDD is an IgniteCache and Cassandra is a
>>>> CacheStore. The only thing that is left to check is whether Spark
>>>> triftserver can work with IgniteRDDs. Hope you will be able figure out this
>>>> and share your feedback with us.
>>>>
>>>>
>>>>
>>>>> - Spark thrift can manage multi tenancy: different users can connect
>>>>> to the same SQL engine and share cache. In Ignite it's one cache per user,
>>>>> so a big waste of RAM.
>>>>>
>>>>>
>>>>> Everyone can connect to an Ignite cluster and work with the same set
>>>>> of distributed caches. I’m not sure why you need to create caches with the
>>>>> same content for every user.
>>>>>
>>>>
>>>> It's a security issue, Ignite cache doesn't provide multiple user
>>>> account per cache. I am thinking of using Spark to authenticate multiple
>>>> users and then Spark use a shared account on Ignite cache
>>>>
>>>>
>>>> Basically, Ignite provides basic security interfaces and some
>>>> implementations which you can rely on by building your secure solution.
>>>> This article can be useful for your case
>>>> http://smartkey.co.uk/development/securing-an-apache-ignite-cluster/
>>>>
>>>> —
>>>> Denis
>>>>
>>>>
>>>>> If you need a real multi-tenancy support where cacheA is allowed to be
>>>>> accessed by a group of users A only and cacheB by users from group B then
>>>>> you can take a look at GridGain which is built on top of Ignite
>>>>> https://gridgain.readme.io/docs/multi-tenancy
>>>>>
>>>>>
>>>>>
>>>> OK but I am evaluating open source only solutions (kylin, druid,
>>>> alluxio...), it's a constraint from my hierarchy
>>>>
>>>>>
>>>>> What I want to achieve is :
>>>>> - use Cassandra for data store as it provides idempotence (HDFS/hive
>>>>> doesn't), resulting in exactly once semantic without any duplicates.
>>>>> - use Spark SQL thriftserver in multi tenancy for large scale adhoc
>>>>> analytics queries (> TB) from an ODBC driver through HTTP(S)
>>>>> - accelerate Cassandra reads when the data modeling of the Cassandra
>>>>> table doesn't fit the queries. Queries would be OLAP style: target 
>>>>> multiple
>>>>> C* partitions, groupby or filters on lots of dimensions that aren't
>>>>> necessarely in the C* table key.
>>>>>
>>>>>
>>>>> As it was mentioned Ignite uses Cassandra as a CacheStore. You should
>>>>> keep this in mind. Before trying to assemble all the chain I would
>>>>> recommend you trying to connect Spark SQL thrift server directly to Ignite
>>>>> and work with its shared RDDs [1]. A shared RDD (basically Ignite cache)
>>>>> can be backed by Cassandra. Probably this chain will work for you but I
>>>>> can’t give more precise guidance on this.
>>>>>
>>>>>
>>>> I will try to make it works and give you feedback
>>>>
>>>>
>>>>
>>>>> [1] https://apacheignite-fs.readme.io/docs/ignite-for-spark
>>>>>
>>>>> —
>>>>> Denis
>>>>>
>>>>> Thanks for your advises
>>>>>
>>>>>
>>>>> 2016-10-04 6:51 GMT+02:00 Jörn Franke <[email protected]>:
>>>>>
>>>>>> I am not sure that this will be performant. What do you want to
>>>>>> achieve here? Fast lookups? Then the Cassandra Ignite store might be the
>>>>>> right solution. If you want to do more analytic style of queries then you
>>>>>> can put the data on HDFS/Hive and use the Ignite HDFS cache to cache
>>>>>> certain partitions/tables in Hive in-memory. If you want to go to 
>>>>>> iterative
>>>>>> machine learning algorithms you can go for Spark on top of this. You can
>>>>>> use then also Ignite cache for Spark RDDs.
>>>>>>
>>>>>> On 4 Oct 2016, at 02:24, Alexey Kuznetsov <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> Hi, Vincent!
>>>>>>
>>>>>> Ignite also has SQL support (also scalable), I think it will be much
>>>>>> faster to query directly from Ignite than query from Spark.
>>>>>> Also please mind, that before executing queries you should load all
>>>>>> needed data to cache.
>>>>>> To load data from Cassandra to Ignite you may use Cassandra store [1].
>>>>>>
>>>>>> [1] https://apacheignite.readme.io/docs/ignite-with-apache-cassandra
>>>>>>
>>>>>> On Tue, Oct 4, 2016 at 4:19 AM, vincent gromakowski <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> I am evaluating the possibility to use Spark SQL (and its
>>>>>>> scalability) over an Ignite cache with Cassandra persistent store to
>>>>>>> increase read workloads like OLAP style analytics.
>>>>>>> Is there any way to configure Spark thriftserver to load an external
>>>>>>> table in Ignite like we can do in Cassandra ?
>>>>>>> Here is an example of config for spark backed by cassandra
>>>>>>>
>>>>>>> CREATE EXTERNAL TABLE MyHiveTable
>>>>>>>         ( id int, data string )
>>>>>>>         STORED BY 'org.apache.hadoop.hive.cassan
>>>>>>> dra.cql.CqlStorageHandler'
>>>>>>>         TBLPROPERTIES ("cassandra.host" = "x.x.x.x", "
>>>>>>> cassandra.ks.name" = "test" ,
>>>>>>>           "cassandra.cf.name" = "mytable" ,
>>>>>>>           "cassandra.ks.repfactor" = "1" ,
>>>>>>>           "cassandra.ks.strategy" =
>>>>>>>             "org.apache.cassandra.locator.SimpleStrategy" );
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Alexey Kuznetsov
>>>>>>
>>>>>>
>>>>
>>
>

Re: spark SQL thriftserver over ignite and cassandra

Reply via email to