Re: Is it possible to have a column which can hold any data type (for inserting as json)

2017-01-31 Thread Benjamin Roth
You should post the whole CQL query you try to execute! Why don't you use a
native JSON type for your JSON data?

2017-02-01 7:51 GMT+01:00 Rajeswari Menon :

> Hi,
>
>
>
> I have a json data as shown below.
>
>
>
> {
>
> "address":"127.0.0.1",
>
> "datatype":"DOUBLE",
>
> "name":"Longitude",
>
>  "attributes":{
>
> "Id":"1"
>
> },
>
> "category":"REAL",
>
> "value":1.390692,
>
> "timestamp":1485923271718,
>
> "quality":"GOOD"
>
> }
>
>
>
> To store the above json to Cassandra, I defined a table as shown below
>
>
>
> *create* *table* data
>
> (
>
>   id *int* *primary* *key*,
>
>   address text,
>
>   datatype text,
>
>   name text,
>
>   *attributes* *map* < text, text >,
>
>   category text,
>
>   value text,
>
>   "timestamp" *timestamp*,
>
>   quality text
>
> );
>
>
>
> When I try to insert the data as JSON I got the error : *Error decoding
> JSON value for value: Expected a UTF-8 string, but got a Double: 1.390692*.
> The message is clear that a double value cannot be inserted to text column.
> The real issue is that the value can be of any data type, so the schema
> cannot be predefined. Is there a way to create a column which can hold
> value of any data type. (I don’t want to hold the entire json as string. My
> preferred way is to define a schema.)
>
>
>
> Regards,
>
> Rajeswari
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer


Is it possible to have a column which can hold any data type (for inserting as json)

2017-01-31 Thread Rajeswari Menon
Hi,

I have a json data as shown below.

{
"address":"127.0.0.1",
"datatype":"DOUBLE",
"name":"Longitude",
 "attributes":{
"Id":"1"
},
"category":"REAL",
"value":1.390692,
"timestamp":1485923271718,
"quality":"GOOD"
}

To store the above json to Cassandra, I defined a table as shown below

create table data
(
  id int primary key,
  address text,
  datatype text,
  name text,
  attributes map < text, text >,
  category text,
  value text,
  "timestamp" timestamp,
  quality text
);

When I try to insert the data as JSON I got the error : Error decoding JSON 
value for value: Expected a UTF-8 string, but got a Double: 1.390692. The 
message is clear that a double value cannot be inserted to text column. The 
real issue is that the value can be of any data type, so the schema cannot be 
predefined. Is there a way to create a column which can hold value of any data 
type. (I don't want to hold the entire json as string. My preferred way is to 
define a schema.)

Regards,
Rajeswari


Re: Global TTL vs Insert TTL

2017-01-31 Thread Cogumelos Maravilha
Hi Alain,

Thanks for your response and the links.

I've also checked "Time series data model and tombstones".

Is it safe to use TWCS in C* 3.9?

Thanks in advance.


On 31-01-2017 11:27, Alain RODRIGUEZ wrote:
>
> Is there a overhead using line by line option or wasted disk space?
>
>  There is a very recent topic about that in the mailing list, look for
> "Time series data model and tombstones". I believe DuyHai answer your
> question there with more details :).
>
> *tl;dr:*
>
> Yes, if you know the TTL in advance, and it is fixed, you might want
> to go with the table option instead of adding the TTL in each insert.
> Also you might want consider using TWCS compaction strategy.
>
> Here are some blogposts my coworkers recently wrote about TWCS, it
> might be useful:
>
> http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html
> http://thelastpickle.com/blog/2017/01/10/twcs-part2.html
>
> C*heers,
> ---
> Alain Rodriguez - @arodream - al...@thelastpickle.com
> 
> France
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
>
> 2017-01-31 10:43 GMT+01:00 Cogumelos Maravilha
> >:
>
> Hi I'm just wondering what option is fastest:
>
> Global:***create table xxx (.|AND |**|default_time_to_live = 
> |**|XXX|**|;|**||and**UPDATE xxx USING TTL XXX;*
>
> Line by line:
>
> *INSERT INTO xxx (...USING TTL xxx;*
>
> Is there a overhead using line by line option or wasted disk space?
>
> Thanks in advance.
>
>



Re: [External] Re: Cassandra ad hoc search options

2017-01-31 Thread Justin Cameron
+1

On Tue, 31 Jan 2017 at 10:04 Jonathan Haddad  wrote:

> With regards to having DCs for specific workloads, it would be nice to
> have per DC indexes.  See
> https://issues.apache.org/jira/browse/CASSANDRA-12663.
>
> On Tue, Jan 31, 2017 at 9:52 AM Justin Cameron 
> wrote:
>
> Lucene/Elassandra and Spark serve different purposes.
>
> Lucene & Elassandra are designed for real-time queries that have
> predicates on columns not in the Cassandra primary key (i.e. searches). For
> example if you have a "person" table with person_id as the primary key but
> you want to allow users of your app to search for users by their last name.
>
> Spark is designed for batch and/or streaming analytical workloads (it can
> also do other things, but these are it's primary uses). For example you
> might want to know how many people of different age groups use your
> application.
>
> Ideally you should separate these workloads from each other and from your
> operational workload (standard C* queries) into their own Cassandra
> datacenters, as they each have very different performance impacts &
> requirements.
>
> On Tue, 31 Jan 2017 at 00:57 vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
> You can also have a look at https://github.com/strapdata/elassandra
>
>
> 2017-01-31 9:50 GMT+01:00 vincent gromakowski <
> vincent.gromakow...@gmail.com>:
>
> The problem with adhoc queries on casssandra (with spark or not) is the
> partition model of cassandra that needs to be respected to avoid full scan
> queries (the link you mentioned explains all of them). With FiloDB, which
> works on cassandra, you can pushdown predicates of the partition key and
> segment key in an arbitrary order resulting in less full scan
> queries. Another advantage is the computed columns that can also prune
> partitions or segments so reduce the reads based on a subpart of the key
> (like a timerange of 2 hours or 10 min).
> Anyway it's not magic and my personal analysis doesn't target filodb as a
> fully adhoc query solution but it's largely better than pure cassandra. You
> can easily have pushdown predicates on any combination of 1 to 3-5 columns
> depending on the dataset compared to pure cassandra where you need to
> provide a first key value to pushdown the second key predicate, then the
> third key...
>
> 2017-01-31 8:56 GMT+01:00 Yu, John :
>
> Thanks. I thought you have given up Lucene for Spark, but it seems your
> Lucene still works.
>
>
>
> Spark also has a Cassandra connector, and my questions were more towards
> that.
>
> From
> https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md,
> it seems there’re limitations on how much one can select the data to
> support ad hoc queries. It seems mostly limited to clustering columns.
> Maybe in other cases, it would result in full scan, but that’s going to be
> very slow.
>
>
>
> Regards,
>
> John
>
>
>
> *From:* siddharth verma [mailto:sidd.verma29.l...@gmail.com]
> *Sent:* Monday, January 30, 2017 10:20 PM
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: [External] Re: Cassandra ad hoc search options
>
>
>
> Hi,
>
> *Are you using the DataStax connector as well? *
>
> Yes, we used it to query on lucene index.
>
>
>
> *Does it support querying against any column well (not just clustering
> columns)?*
>
> Yes it does. We used lucene particularly for this purpose.
>
> ( You can use :
>
> 1.
> https://github.com/Stratio/cassandra-lucene-index/blob/branch-3.0.10/doc/documentation.rst#searching
>
> 2. https://www.youtube.com/watch?v=Hg5s-hXy_-M
>
> for more details)
>
>
>
> *I’m wondering how it could build the index around them “on-the-fly”*
>
> You can build indexes at run time, but it takes time(took a lot of time on
> our cluster. Plus, CPU utilization went through the roof)
>
>
>
> *did you use Spark for the full set of data or just partial*
>
> We weren't allowed to install spark ( tech decision)
>
> Some tech discussions going around for the bulk job ecosystem.
>
>
>
> Hence as a work around, we used a faster scan utility.
>
> For all the adhoc purposes/scripts, you could do a full scan.
>
>
>
> I hope it helps.
>
>
>
> Regards
>
>
>
>
>
> On Tue, Jan 31, 2017 at 4:11 AM, Yu, John  wrote:
>
> A follow up question is: did you use Spark for the full set of data or
> just partial? In our case, I feel we need all the data to support ad hoc
> queries (with multiple conditional filters).
>
>
>
> Thanks,
>
> John
>
>
>
> *From:* Yu, John [mailto:john...@sandc.com]
> *Sent:* Monday, January 30, 2017 12:04 AM
> *To:* user@cassandra.apache.org
> *Subject:* RE: [External] Re: Cassandra ad hoc search options
>
>
>
> Thanks for the input! Are you using the DataStax connector as well? Does
> it support querying against any column well (not just clustering columns)?
> I’m wondering how it could build the index around them “on-the-fly”.
>
>
>
> Regards,
>
> John
>
>
>
> *From:* 

Re: [External] Re: Cassandra ad hoc search options

2017-01-31 Thread Jonathan Haddad
With regards to having DCs for specific workloads, it would be nice to have
per DC indexes.  See https://issues.apache.org/jira/browse/CASSANDRA-12663.


On Tue, Jan 31, 2017 at 9:52 AM Justin Cameron 
wrote:

> Lucene/Elassandra and Spark serve different purposes.
>
> Lucene & Elassandra are designed for real-time queries that have
> predicates on columns not in the Cassandra primary key (i.e. searches). For
> example if you have a "person" table with person_id as the primary key but
> you want to allow users of your app to search for users by their last name.
>
> Spark is designed for batch and/or streaming analytical workloads (it can
> also do other things, but these are it's primary uses). For example you
> might want to know how many people of different age groups use your
> application.
>
> Ideally you should separate these workloads from each other and from your
> operational workload (standard C* queries) into their own Cassandra
> datacenters, as they each have very different performance impacts &
> requirements.
>
> On Tue, 31 Jan 2017 at 00:57 vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
> You can also have a look at https://github.com/strapdata/elassandra
>
>
> 2017-01-31 9:50 GMT+01:00 vincent gromakowski <
> vincent.gromakow...@gmail.com>:
>
> The problem with adhoc queries on casssandra (with spark or not) is the
> partition model of cassandra that needs to be respected to avoid full scan
> queries (the link you mentioned explains all of them). With FiloDB, which
> works on cassandra, you can pushdown predicates of the partition key and
> segment key in an arbitrary order resulting in less full scan
> queries. Another advantage is the computed columns that can also prune
> partitions or segments so reduce the reads based on a subpart of the key
> (like a timerange of 2 hours or 10 min).
> Anyway it's not magic and my personal analysis doesn't target filodb as a
> fully adhoc query solution but it's largely better than pure cassandra. You
> can easily have pushdown predicates on any combination of 1 to 3-5 columns
> depending on the dataset compared to pure cassandra where you need to
> provide a first key value to pushdown the second key predicate, then the
> third key...
>
> 2017-01-31 8:56 GMT+01:00 Yu, John :
>
> Thanks. I thought you have given up Lucene for Spark, but it seems your
> Lucene still works.
>
>
>
> Spark also has a Cassandra connector, and my questions were more towards
> that.
>
> From
> https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md,
> it seems there’re limitations on how much one can select the data to
> support ad hoc queries. It seems mostly limited to clustering columns.
> Maybe in other cases, it would result in full scan, but that’s going to be
> very slow.
>
>
>
> Regards,
>
> John
>
>
>
> *From:* siddharth verma [mailto:sidd.verma29.l...@gmail.com]
> *Sent:* Monday, January 30, 2017 10:20 PM
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: [External] Re: Cassandra ad hoc search options
>
>
>
> Hi,
>
> *Are you using the DataStax connector as well? *
>
> Yes, we used it to query on lucene index.
>
>
>
> *Does it support querying against any column well (not just clustering
> columns)?*
>
> Yes it does. We used lucene particularly for this purpose.
>
> ( You can use :
>
> 1.
> https://github.com/Stratio/cassandra-lucene-index/blob/branch-3.0.10/doc/documentation.rst#searching
>
> 2. https://www.youtube.com/watch?v=Hg5s-hXy_-M
>
> for more details)
>
>
>
> *I’m wondering how it could build the index around them “on-the-fly”*
>
> You can build indexes at run time, but it takes time(took a lot of time on
> our cluster. Plus, CPU utilization went through the roof)
>
>
>
> *did you use Spark for the full set of data or just partial*
>
> We weren't allowed to install spark ( tech decision)
>
> Some tech discussions going around for the bulk job ecosystem.
>
>
>
> Hence as a work around, we used a faster scan utility.
>
> For all the adhoc purposes/scripts, you could do a full scan.
>
>
>
> I hope it helps.
>
>
>
> Regards
>
>
>
>
>
> On Tue, Jan 31, 2017 at 4:11 AM, Yu, John  wrote:
>
> A follow up question is: did you use Spark for the full set of data or
> just partial? In our case, I feel we need all the data to support ad hoc
> queries (with multiple conditional filters).
>
>
>
> Thanks,
>
> John
>
>
>
> *From:* Yu, John [mailto:john...@sandc.com]
> *Sent:* Monday, January 30, 2017 12:04 AM
> *To:* user@cassandra.apache.org
> *Subject:* RE: [External] Re: Cassandra ad hoc search options
>
>
>
> Thanks for the input! Are you using the DataStax connector as well? Does
> it support querying against any column well (not just clustering columns)?
> I’m wondering how it could build the index around them “on-the-fly”.
>
>
>
> Regards,
>
> John
>
>
>
> *From:* siddharth verma [mailto:sidd.verma29.l...@gmail.com
> ]
> *Sent:* 

Re: [External] Re: Cassandra ad hoc search options

2017-01-31 Thread Justin Cameron
Lucene/Elassandra and Spark serve different purposes.

Lucene & Elassandra are designed for real-time queries that have predicates
on columns not in the Cassandra primary key (i.e. searches). For example if
you have a "person" table with person_id as the primary key but you want to
allow users of your app to search for users by their last name.

Spark is designed for batch and/or streaming analytical workloads (it can
also do other things, but these are it's primary uses). For example you
might want to know how many people of different age groups use your
application.

Ideally you should separate these workloads from each other and from your
operational workload (standard C* queries) into their own Cassandra
datacenters, as they each have very different performance impacts &
requirements.

On Tue, 31 Jan 2017 at 00:57 vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> You can also have a look at https://github.com/strapdata/elassandra
>
>
> 2017-01-31 9:50 GMT+01:00 vincent gromakowski <
> vincent.gromakow...@gmail.com>:
>
> The problem with adhoc queries on casssandra (with spark or not) is the
> partition model of cassandra that needs to be respected to avoid full scan
> queries (the link you mentioned explains all of them). With FiloDB, which
> works on cassandra, you can pushdown predicates of the partition key and
> segment key in an arbitrary order resulting in less full scan
> queries. Another advantage is the computed columns that can also prune
> partitions or segments so reduce the reads based on a subpart of the key
> (like a timerange of 2 hours or 10 min).
> Anyway it's not magic and my personal analysis doesn't target filodb as a
> fully adhoc query solution but it's largely better than pure cassandra. You
> can easily have pushdown predicates on any combination of 1 to 3-5 columns
> depending on the dataset compared to pure cassandra where you need to
> provide a first key value to pushdown the second key predicate, then the
> third key...
>
> 2017-01-31 8:56 GMT+01:00 Yu, John :
>
> Thanks. I thought you have given up Lucene for Spark, but it seems your
> Lucene still works.
>
>
>
> Spark also has a Cassandra connector, and my questions were more towards
> that.
>
> From
> https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md,
> it seems there’re limitations on how much one can select the data to
> support ad hoc queries. It seems mostly limited to clustering columns.
> Maybe in other cases, it would result in full scan, but that’s going to be
> very slow.
>
>
>
> Regards,
>
> John
>
>
>
> *From:* siddharth verma [mailto:sidd.verma29.l...@gmail.com]
> *Sent:* Monday, January 30, 2017 10:20 PM
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: [External] Re: Cassandra ad hoc search options
>
>
>
> Hi,
>
> *Are you using the DataStax connector as well? *
>
> Yes, we used it to query on lucene index.
>
>
>
> *Does it support querying against any column well (not just clustering
> columns)?*
>
> Yes it does. We used lucene particularly for this purpose.
>
> ( You can use :
>
> 1.
> https://github.com/Stratio/cassandra-lucene-index/blob/branch-3.0.10/doc/documentation.rst#searching
>
> 2. https://www.youtube.com/watch?v=Hg5s-hXy_-M
>
> for more details)
>
>
>
> *I’m wondering how it could build the index around them “on-the-fly”*
>
> You can build indexes at run time, but it takes time(took a lot of time on
> our cluster. Plus, CPU utilization went through the roof)
>
>
>
> *did you use Spark for the full set of data or just partial*
>
> We weren't allowed to install spark ( tech decision)
>
> Some tech discussions going around for the bulk job ecosystem.
>
>
>
> Hence as a work around, we used a faster scan utility.
>
> For all the adhoc purposes/scripts, you could do a full scan.
>
>
>
> I hope it helps.
>
>
>
> Regards
>
>
>
>
>
> On Tue, Jan 31, 2017 at 4:11 AM, Yu, John  wrote:
>
> A follow up question is: did you use Spark for the full set of data or
> just partial? In our case, I feel we need all the data to support ad hoc
> queries (with multiple conditional filters).
>
>
>
> Thanks,
>
> John
>
>
>
> *From:* Yu, John [mailto:john...@sandc.com]
> *Sent:* Monday, January 30, 2017 12:04 AM
> *To:* user@cassandra.apache.org
> *Subject:* RE: [External] Re: Cassandra ad hoc search options
>
>
>
> Thanks for the input! Are you using the DataStax connector as well? Does
> it support querying against any column well (not just clustering columns)?
> I’m wondering how it could build the index around them “on-the-fly”.
>
>
>
> Regards,
>
> John
>
>
>
> *From:* siddharth verma [mailto:sidd.verma29.l...@gmail.com
> ]
> *Sent:* Friday, January 27, 2017 12:15 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: [External] Re: Cassandra ad hoc search options
>
>
>
> Hi
>
> We used lucene stratio plugin with C*3.0.3
>
>
>
> Helped to solve a lot of some read patterns. Served well for prefix.
>

Re: Global TTL vs Insert TTL

2017-01-31 Thread Alain RODRIGUEZ
>
> Is there a overhead using line by line option or wasted disk space?
>
>  There is a very recent topic about that in the mailing list, look for "Time
series data model and tombstones". I believe DuyHai answer your question
there with more details :).

*tl;dr:*

Yes, if you know the TTL in advance, and it is fixed, you might want to go
with the table option instead of adding the TTL in each insert. Also you
might want consider using TWCS compaction strategy.

Here are some blogposts my coworkers recently wrote about TWCS, it might be
useful:

http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html
http://thelastpickle.com/blog/2017/01/10/twcs-part2.html

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com



2017-01-31 10:43 GMT+01:00 Cogumelos Maravilha :

> Hi I'm just wondering what option is fastest:
>
> Global:*create table xxx (.**AND **default_time_to_live = **XXX**;**
> and**UPDATE xxx USING TTL XXX;*
>
> Line by line:
> *INSERT INTO xxx (...** USING TTL xxx;*
>
> Is there a overhead using line by line option or wasted disk space?
>
> Thanks in advance.
>
>


Global TTL vs Insert TTL

2017-01-31 Thread Cogumelos Maravilha
Hi I'm just wondering what option is fastest:

Global:***create table xxx (.|AND |**|default_time_to_live = 
|**|XXX|**|;|**||and**UPDATE xxx USING TTL XXX;*

Line by line:

*INSERT INTO xxx (...USING TTL xxx;*

Is there a overhead using line by line option or wasted disk space?

Thanks in advance.



Re: [External] Re: Cassandra ad hoc search options

2017-01-31 Thread vincent gromakowski
You can also have a look at https://github.com/strapdata/elassandra


2017-01-31 9:50 GMT+01:00 vincent gromakowski :

> The problem with adhoc queries on casssandra (with spark or not) is the
> partition model of cassandra that needs to be respected to avoid full scan
> queries (the link you mentioned explains all of them). With FiloDB, which
> works on cassandra, you can pushdown predicates of the partition key and
> segment key in an arbitrary order resulting in less full scan
> queries. Another advantage is the computed columns that can also prune
> partitions or segments so reduce the reads based on a subpart of the key
> (like a timerange of 2 hours or 10 min).
> Anyway it's not magic and my personal analysis doesn't target filodb as a
> fully adhoc query solution but it's largely better than pure cassandra. You
> can easily have pushdown predicates on any combination of 1 to 3-5 columns
> depending on the dataset compared to pure cassandra where you need to
> provide a first key value to pushdown the second key predicate, then the
> third key...
>
> 2017-01-31 8:56 GMT+01:00 Yu, John :
>
>> Thanks. I thought you have given up Lucene for Spark, but it seems your
>> Lucene still works.
>>
>>
>>
>> Spark also has a Cassandra connector, and my questions were more towards
>> that.
>>
>> From https://github.com/datastax/spark-cassandra-connector/blob/
>> master/doc/3_selection.md, it seems there’re limitations on how much one
>> can select the data to support ad hoc queries. It seems mostly limited to
>> clustering columns. Maybe in other cases, it would result in full scan, but
>> that’s going to be very slow.
>>
>>
>>
>> Regards,
>>
>> John
>>
>>
>>
>> *From:* siddharth verma [mailto:sidd.verma29.l...@gmail.com]
>> *Sent:* Monday, January 30, 2017 10:20 PM
>>
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: [External] Re: Cassandra ad hoc search options
>>
>>
>>
>> Hi,
>>
>> *Are you using the DataStax connector as well? *
>>
>> Yes, we used it to query on lucene index.
>>
>>
>>
>> *Does it support querying against any column well (not just clustering
>> columns)?*
>>
>> Yes it does. We used lucene particularly for this purpose.
>>
>> ( You can use :
>>
>> 1. https://github.com/Stratio/cassandra-lucene-index/blob/branc
>> h-3.0.10/doc/documentation.rst#searching
>>
>> 2. https://www.youtube.com/watch?v=Hg5s-hXy_-M
>>
>> for more details)
>>
>>
>>
>> *I’m wondering how it could build the index around them “on-the-fly”*
>>
>> You can build indexes at run time, but it takes time(took a lot of time
>> on our cluster. Plus, CPU utilization went through the roof)
>>
>>
>>
>> *did you use Spark for the full set of data or just partial*
>>
>> We weren't allowed to install spark ( tech decision)
>>
>> Some tech discussions going around for the bulk job ecosystem.
>>
>>
>>
>> Hence as a work around, we used a faster scan utility.
>>
>> For all the adhoc purposes/scripts, you could do a full scan.
>>
>>
>>
>> I hope it helps.
>>
>>
>>
>> Regards
>>
>>
>>
>>
>>
>> On Tue, Jan 31, 2017 at 4:11 AM, Yu, John  wrote:
>>
>> A follow up question is: did you use Spark for the full set of data or
>> just partial? In our case, I feel we need all the data to support ad hoc
>> queries (with multiple conditional filters).
>>
>>
>>
>> Thanks,
>>
>> John
>>
>>
>>
>> *From:* Yu, John [mailto:john...@sandc.com]
>> *Sent:* Monday, January 30, 2017 12:04 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* RE: [External] Re: Cassandra ad hoc search options
>>
>>
>>
>> Thanks for the input! Are you using the DataStax connector as well? Does
>> it support querying against any column well (not just clustering columns)?
>> I’m wondering how it could build the index around them “on-the-fly”.
>>
>>
>>
>> Regards,
>>
>> John
>>
>>
>>
>> *From:* siddharth verma [mailto:sidd.verma29.l...@gmail.com
>> ]
>> *Sent:* Friday, January 27, 2017 12:15 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: [External] Re: Cassandra ad hoc search options
>>
>>
>>
>> Hi
>>
>> We used lucene stratio plugin with C*3.0.3
>>
>>
>>
>> Helped to solve a lot of some read patterns. Served well for prefix.
>>
>> But created problems as repairs failed repeatedly.
>>
>> We might have used it sub optimally, not sure.
>>
>>
>>
>> Later, we had to do away with it, and tried to serve most of the read
>> patterns with materialised views. (currently C*3.0.9)
>>
>>
>>
>> Currently, for adhoc querries, we use spark or full scan.
>>
>>
>>
>> Regards,
>>
>>
>>
>> On Fri, Jan 27, 2017 at 1:03 PM, Yu, John  wrote:
>>
>> Thanks a lot. Mind sharing a couple of points where you feel it’s better
>> than the alternatives.
>>
>>
>>
>> Regards,
>>
>> John
>>
>>
>>
>> *From:* Jonathan Haddad [mailto:j...@jonhaddad.com]
>> *Sent:* Thursday, January 26, 2017 2:33 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* [External] Re: Cassandra ad hoc search options
>>

Re: [External] Re: Cassandra ad hoc search options

2017-01-31 Thread vincent gromakowski
The problem with adhoc queries on casssandra (with spark or not) is the
partition model of cassandra that needs to be respected to avoid full scan
queries (the link you mentioned explains all of them). With FiloDB, which
works on cassandra, you can pushdown predicates of the partition key and
segment key in an arbitrary order resulting in less full scan
queries. Another advantage is the computed columns that can also prune
partitions or segments so reduce the reads based on a subpart of the key
(like a timerange of 2 hours or 10 min).
Anyway it's not magic and my personal analysis doesn't target filodb as a
fully adhoc query solution but it's largely better than pure cassandra. You
can easily have pushdown predicates on any combination of 1 to 3-5 columns
depending on the dataset compared to pure cassandra where you need to
provide a first key value to pushdown the second key predicate, then the
third key...

2017-01-31 8:56 GMT+01:00 Yu, John :

> Thanks. I thought you have given up Lucene for Spark, but it seems your
> Lucene still works.
>
>
>
> Spark also has a Cassandra connector, and my questions were more towards
> that.
>
> From https://github.com/datastax/spark-cassandra-connector/
> blob/master/doc/3_selection.md, it seems there’re limitations on how much
> one can select the data to support ad hoc queries. It seems mostly limited
> to clustering columns. Maybe in other cases, it would result in full scan,
> but that’s going to be very slow.
>
>
>
> Regards,
>
> John
>
>
>
> *From:* siddharth verma [mailto:sidd.verma29.l...@gmail.com]
> *Sent:* Monday, January 30, 2017 10:20 PM
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: [External] Re: Cassandra ad hoc search options
>
>
>
> Hi,
>
> *Are you using the DataStax connector as well? *
>
> Yes, we used it to query on lucene index.
>
>
>
> *Does it support querying against any column well (not just clustering
> columns)?*
>
> Yes it does. We used lucene particularly for this purpose.
>
> ( You can use :
>
> 1. https://github.com/Stratio/cassandra-lucene-index/blob/
> branch-3.0.10/doc/documentation.rst#searching
>
> 2. https://www.youtube.com/watch?v=Hg5s-hXy_-M
>
> for more details)
>
>
>
> *I’m wondering how it could build the index around them “on-the-fly”*
>
> You can build indexes at run time, but it takes time(took a lot of time on
> our cluster. Plus, CPU utilization went through the roof)
>
>
>
> *did you use Spark for the full set of data or just partial*
>
> We weren't allowed to install spark ( tech decision)
>
> Some tech discussions going around for the bulk job ecosystem.
>
>
>
> Hence as a work around, we used a faster scan utility.
>
> For all the adhoc purposes/scripts, you could do a full scan.
>
>
>
> I hope it helps.
>
>
>
> Regards
>
>
>
>
>
> On Tue, Jan 31, 2017 at 4:11 AM, Yu, John  wrote:
>
> A follow up question is: did you use Spark for the full set of data or
> just partial? In our case, I feel we need all the data to support ad hoc
> queries (with multiple conditional filters).
>
>
>
> Thanks,
>
> John
>
>
>
> *From:* Yu, John [mailto:john...@sandc.com]
> *Sent:* Monday, January 30, 2017 12:04 AM
> *To:* user@cassandra.apache.org
> *Subject:* RE: [External] Re: Cassandra ad hoc search options
>
>
>
> Thanks for the input! Are you using the DataStax connector as well? Does
> it support querying against any column well (not just clustering columns)?
> I’m wondering how it could build the index around them “on-the-fly”.
>
>
>
> Regards,
>
> John
>
>
>
> *From:* siddharth verma [mailto:sidd.verma29.l...@gmail.com
> ]
> *Sent:* Friday, January 27, 2017 12:15 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: [External] Re: Cassandra ad hoc search options
>
>
>
> Hi
>
> We used lucene stratio plugin with C*3.0.3
>
>
>
> Helped to solve a lot of some read patterns. Served well for prefix.
>
> But created problems as repairs failed repeatedly.
>
> We might have used it sub optimally, not sure.
>
>
>
> Later, we had to do away with it, and tried to serve most of the read
> patterns with materialised views. (currently C*3.0.9)
>
>
>
> Currently, for adhoc querries, we use spark or full scan.
>
>
>
> Regards,
>
>
>
> On Fri, Jan 27, 2017 at 1:03 PM, Yu, John  wrote:
>
> Thanks a lot. Mind sharing a couple of points where you feel it’s better
> than the alternatives.
>
>
>
> Regards,
>
> John
>
>
>
> *From:* Jonathan Haddad [mailto:j...@jonhaddad.com]
> *Sent:* Thursday, January 26, 2017 2:33 PM
> *To:* user@cassandra.apache.org
> *Subject:* [External] Re: Cassandra ad hoc search options
>
>
>
> > With Cassandra, what are the options for ad hoc query/search similar to
> RDBMS?
>
>
>
> Your best options are Spark w/ the DataStax connector or Presto.
> Cassandra isn't built for ad-hoc queries so you need to use other tools to
> make it work.
>
>
>
> On Thu, Jan 26, 2017 at 2:22 PM Yu, John  wrote:
>
> Hi All,
>
>
>
> Hope