Hi thanks for your explanations. Please find inline more questions Vincent
2016-10-05 3:33 GMT+02:00 Denis Magda <[email protected]>: > Hi Vincent, > > See my answers inline > > On Oct 4, 2016, at 12:54 AM, vincent gromakowski < > [email protected]> wrote: > > Hi, > I know that Ignite has SQL support but: > - ODBC driver doesn't seem to provide HTTP(S) support, which is easier to > integrate on corporate networks with rules, firewalls, proxies > > > *Igor Sapego*, what URIs are supported presently? > > - The SQL engine doesn't seem to scale like Spark SQL would. For instance, > Spark won't generate OOM is dataset (source or result) doesn't fit in > memory. From Ignite side, it's not clear… > > > OOM is not related to scalability topic at all. This is about > application’s logic. > > Ignite SQL engine perfectly scales out along with your cluster. Moreover, > Ignite supports indexes which allows you to get O(logN) running time > complexity for your SQL queries while in case of Spark you will face with > full-scans (O(N)) all the time. > > However, to benefit from Ignite SQL queries you have to put all the data > in-memory. Ignite doesn’t go to a CacheStore (Cassandra, relational > database, MongoDB, etc) while a SQL query is executed and won’t preload > anything from an underlying CacheStore. Automatic preloading works for > key-value queries like cache.get(key). > This is an issue because I will potentially have to query TB of data. If I use Spark thriftserver backed by IgniteRDD, does it solve this point and can I get automatic preloading from C* ? > > - Spark thrift can manage multi tenancy: different users can connect to > the same SQL engine and share cache. In Ignite it's one cache per user, so > a big waste of RAM. > > > Everyone can connect to an Ignite cluster and work with the same set of > distributed caches. I’m not sure why you need to create caches with the > same content for every user. > It's a security issue, Ignite cache doesn't provide multiple user account per cache. I am thinking of using Spark to authenticate multiple users and then Spark use a shared account on Ignite cache > > If you need a real multi-tenancy support where cacheA is allowed to be > accessed by a group of users A only and cacheB by users from group B then > you can take a look at GridGain which is built on top of Ignite > https://gridgain.readme.io/docs/multi-tenancy > > > OK but I am evaluating open source only solutions (kylin, druid, alluxio...), it's a constraint from my hierarchy > > What I want to achieve is : > - use Cassandra for data store as it provides idempotence (HDFS/hive > doesn't), resulting in exactly once semantic without any duplicates. > - use Spark SQL thriftserver in multi tenancy for large scale adhoc > analytics queries (> TB) from an ODBC driver through HTTP(S) > - accelerate Cassandra reads when the data modeling of the Cassandra table > doesn't fit the queries. Queries would be OLAP style: target multiple C* > partitions, groupby or filters on lots of dimensions that aren't > necessarely in the C* table key. > > > As it was mentioned Ignite uses Cassandra as a CacheStore. You should keep > this in mind. Before trying to assemble all the chain I would recommend you > trying to connect Spark SQL thrift server directly to Ignite and work with > its shared RDDs [1]. A shared RDD (basically Ignite cache) can be backed by > Cassandra. Probably this chain will work for you but I can’t give more > precise guidance on this. > > I will try to make it works and give you feedback > [1] https://apacheignite-fs.readme.io/docs/ignite-for-spark > > — > Denis > > Thanks for your advises > > > 2016-10-04 6:51 GMT+02:00 Jörn Franke <[email protected]>: > >> I am not sure that this will be performant. What do you want to achieve >> here? Fast lookups? Then the Cassandra Ignite store might be the right >> solution. If you want to do more analytic style of queries then you can put >> the data on HDFS/Hive and use the Ignite HDFS cache to cache certain >> partitions/tables in Hive in-memory. If you want to go to iterative machine >> learning algorithms you can go for Spark on top of this. You can use then >> also Ignite cache for Spark RDDs. >> >> On 4 Oct 2016, at 02:24, Alexey Kuznetsov <[email protected]> >> wrote: >> >> Hi, Vincent! >> >> Ignite also has SQL support (also scalable), I think it will be much >> faster to query directly from Ignite than query from Spark. >> Also please mind, that before executing queries you should load all >> needed data to cache. >> To load data from Cassandra to Ignite you may use Cassandra store [1]. >> >> [1] https://apacheignite.readme.io/docs/ignite-with-apache-cassandra >> >> On Tue, Oct 4, 2016 at 4:19 AM, vincent gromakowski < >> [email protected]> wrote: >> >>> Hi, >>> I am evaluating the possibility to use Spark SQL (and its scalability) >>> over an Ignite cache with Cassandra persistent store to increase read >>> workloads like OLAP style analytics. >>> Is there any way to configure Spark thriftserver to load an external >>> table in Ignite like we can do in Cassandra ? >>> Here is an example of config for spark backed by cassandra >>> >>> CREATE EXTERNAL TABLE MyHiveTable >>> ( id int, data string ) >>> STORED BY 'org.apache.hadoop.hive.cassandra.cql.CqlStorageHandler' >>> >>> TBLPROPERTIES ("cassandra.host" = "x.x.x.x", "cassandra.ks.name" >>> = "test" , >>> "cassandra.cf.name" = "mytable" , >>> "cassandra.ks.repfactor" = "1" , >>> "cassandra.ks.strategy" = >>> "org.apache.cassandra.locator.SimpleStrategy" ); >>> >>> >> >> >> -- >> Alexey Kuznetsov >> >> > >
