Re: Performance issues with "many" CQL columns

Jack Krupansky Sun, 14 Feb 2016 17:22:19 -0800

What does your query actually look like today?

Is your non-EQ on timestamp selecting a single row a few rows or many rows
(dozens, hundreds, thousands)?



-- Jack Krupansky

On Sun, Feb 14, 2016 at 7:40 PM, Gianluca Borello <gianl...@sysdig.com>
wrote:

> Thanks again.
>
> One clarification about "reading in a single SELECT": in my point 2, I
> mentioned the need to read a variable subset of columns every time, usually
> in the range of ~5 out of 30. I can't find a way to do that in a single
> SELECT unless I use the IN operator (which I can't, as explained).
>
> Is there any other method you were thinking of, or your "reading in a
> single SELECT" is just applicable when I need to read the whole set of
> columns (which is never my case, unfortunately)?
>
> Thanks
>
>
> On Sun, Feb 14, 2016 at 4:34 PM, Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
>
>> You can definitely read all of columns in a single SELECT. And the
>> n-INSERTS can be batched and will insert fewer cells in the storage engine
>> than the previous approach.
>>
>> -- Jack Krupansky
>>
>> On Sun, Feb 14, 2016 at 7:31 PM, Gianluca Borello <gianl...@sysdig.com>
>> wrote:
>>
>>> Thank you for your reply.
>>>
>>> Your advice is definitely sound, although it still seems suboptimal to
>>> me because:
>>>
>>> 1) It requires N INSERT queries from the application code (where N is
>>> the number of columns)
>>>
>>> 2) It requires N SELECT queries from my application code (where N is the
>>> number of columns I need to read at any given time, which is determined at
>>> runtime). I can't even use the IN operator (e.g. WHERE column_number IN (1,
>>> 2, 3, ...)) because I am already using a non-EQ relation on the timestamp
>>> key and Cassandra restricts me to only one non-EQ relation.
>>>
>>> In summary, I can (and will) adapt my code to use a similar approach
>>> despite everything, but the goal of my message was mainly to understand why
>>> the jira issues I linked above are not full of dozens of "+1" comments.
>>>
>>> To me this really feels like a terrible performance issue that should be
>>> fixed by default (or in the very worst case clearly documented), even after
>>> understanding the motivation for reading all the columns in the CQL row.
>>>
>>> Thanks
>>>
>>> On Sun, Feb 14, 2016 at 3:05 PM, Jack Krupansky <
>>> jack.krupan...@gmail.com> wrote:
>>>
>>>> You could add the column number as an additional clustering key. And
>>>> then you can actually use COMPACT STORAGE for even more efficient storage
>>>> and access (assuming there is only  a single non-PK data column, the blob
>>>> value.) You can then access (read or write) an individual column/blob or a
>>>> slice of them.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> On Sun, Feb 14, 2016 at 5:22 PM, Gianluca Borello <gianl...@sysdig.com>
>>>> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> I've just painfully discovered a "little" detail in Cassandra:
>>>>> Cassandra touches all columns on a CQL select (related issues
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6586,
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6588,
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-7085).
>>>>>
>>>>> My data model is fairly simple: I have a bunch of "sensors" reporting
>>>>> a blob of data (~10-100KB) periodically. When reading, 99% of the times 
>>>>> I'm
>>>>> interested in a subportion of that blob of data across an arbitrary period
>>>>> of time. What I do is simply splitting those blobs of data in about 30
>>>>> logical units and write them in a CQL table such as:
>>>>>
>>>>> create table data (
>>>>> id bigint,
>>>>> ts bigint,
>>>>> column1 blob,
>>>>> column2 blob,
>>>>> column3 blob,
>>>>> ...
>>>>> column29 blob,
>>>>> column30 blob
>>>>> primary key (id, ts)
>>>>>
>>>>> id is a combination of the sensor id and a time bucket, in order to
>>>>> not get the row too wide. Essentially, I thought this was a very legit 
>>>>> data
>>>>> model that helps me keep my application code very simple (because I can
>>>>> work on a single table, I can write a split sensor blob in a single CQL
>>>>> query and I can read a subset of the columns very efficiently with one
>>>>> single CQL query).
>>>>>
>>>>> What I didn't realize is that Cassandra seems to always process all
>>>>> the columns of the CQL row, regardless of the fact that my query asks just
>>>>> one column, and this has dramatic effect on the performance of my reads.
>>>>>
>>>>> I wrote a simple isolated test case where I test how long it takes to
>>>>> read one *single* column in a CQL table composed of several columns (at
>>>>> each iteration I add and populate 10 new columns), each filled with 1MB
>>>>> blobs:
>>>>>
>>>>> 10 columns: 209 ms
>>>>> 20 columns: 339 ms
>>>>> 30 columns: 510 ms
>>>>> 40 columns: 670 ms
>>>>> 50 columns: 884 ms
>>>>> 60 columns: 1056 ms
>>>>> 70 columns: 1527 ms
>>>>> 80 columns: 1503 ms
>>>>> 90 columns: 1600 ms
>>>>> 100 columns: 1792 ms
>>>>>
>>>>> In other words, even if the result set returned is exactly the same
>>>>> across all these iteration, the response time increases linearly with the
>>>>> size of the other columns, and this is really causing a lot of problems in
>>>>> my application.
>>>>>
>>>>> By reading the JIRA issues, it seems like this is considered a very
>>>>> minor optimization not worth the effort of fixing, so I'm asking: is my 
>>>>> use
>>>>> case really so anomalous that the horrible performance that I'm
>>>>> experiencing are to be considered "expected" and need to be fixed with 
>>>>> some
>>>>> painful column family splitting and messy application code?
>>>>>
>>>>> Thanks
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Performance issues with "many" CQL columns

Reply via email to