What does your query actually look like today? Is your non-EQ on timestamp selecting a single row a few rows or many rows (dozens, hundreds, thousands)?
-- Jack Krupansky On Sun, Feb 14, 2016 at 7:40 PM, Gianluca Borello <gianl...@sysdig.com> wrote: > Thanks again. > > One clarification about "reading in a single SELECT": in my point 2, I > mentioned the need to read a variable subset of columns every time, usually > in the range of ~5 out of 30. I can't find a way to do that in a single > SELECT unless I use the IN operator (which I can't, as explained). > > Is there any other method you were thinking of, or your "reading in a > single SELECT" is just applicable when I need to read the whole set of > columns (which is never my case, unfortunately)? > > Thanks > > > On Sun, Feb 14, 2016 at 4:34 PM, Jack Krupansky <jack.krupan...@gmail.com> > wrote: > >> You can definitely read all of columns in a single SELECT. And the >> n-INSERTS can be batched and will insert fewer cells in the storage engine >> than the previous approach. >> >> -- Jack Krupansky >> >> On Sun, Feb 14, 2016 at 7:31 PM, Gianluca Borello <gianl...@sysdig.com> >> wrote: >> >>> Thank you for your reply. >>> >>> Your advice is definitely sound, although it still seems suboptimal to >>> me because: >>> >>> 1) It requires N INSERT queries from the application code (where N is >>> the number of columns) >>> >>> 2) It requires N SELECT queries from my application code (where N is the >>> number of columns I need to read at any given time, which is determined at >>> runtime). I can't even use the IN operator (e.g. WHERE column_number IN (1, >>> 2, 3, ...)) because I am already using a non-EQ relation on the timestamp >>> key and Cassandra restricts me to only one non-EQ relation. >>> >>> In summary, I can (and will) adapt my code to use a similar approach >>> despite everything, but the goal of my message was mainly to understand why >>> the jira issues I linked above are not full of dozens of "+1" comments. >>> >>> To me this really feels like a terrible performance issue that should be >>> fixed by default (or in the very worst case clearly documented), even after >>> understanding the motivation for reading all the columns in the CQL row. >>> >>> Thanks >>> >>> On Sun, Feb 14, 2016 at 3:05 PM, Jack Krupansky < >>> jack.krupan...@gmail.com> wrote: >>> >>>> You could add the column number as an additional clustering key. And >>>> then you can actually use COMPACT STORAGE for even more efficient storage >>>> and access (assuming there is only a single non-PK data column, the blob >>>> value.) You can then access (read or write) an individual column/blob or a >>>> slice of them. >>>> >>>> -- Jack Krupansky >>>> >>>> On Sun, Feb 14, 2016 at 5:22 PM, Gianluca Borello <gianl...@sysdig.com> >>>> wrote: >>>> >>>>> Hi >>>>> >>>>> I've just painfully discovered a "little" detail in Cassandra: >>>>> Cassandra touches all columns on a CQL select (related issues >>>>> https://issues.apache.org/jira/browse/CASSANDRA-6586, >>>>> https://issues.apache.org/jira/browse/CASSANDRA-6588, >>>>> https://issues.apache.org/jira/browse/CASSANDRA-7085). >>>>> >>>>> My data model is fairly simple: I have a bunch of "sensors" reporting >>>>> a blob of data (~10-100KB) periodically. When reading, 99% of the times >>>>> I'm >>>>> interested in a subportion of that blob of data across an arbitrary period >>>>> of time. What I do is simply splitting those blobs of data in about 30 >>>>> logical units and write them in a CQL table such as: >>>>> >>>>> create table data ( >>>>> id bigint, >>>>> ts bigint, >>>>> column1 blob, >>>>> column2 blob, >>>>> column3 blob, >>>>> ... >>>>> column29 blob, >>>>> column30 blob >>>>> primary key (id, ts) >>>>> >>>>> id is a combination of the sensor id and a time bucket, in order to >>>>> not get the row too wide. Essentially, I thought this was a very legit >>>>> data >>>>> model that helps me keep my application code very simple (because I can >>>>> work on a single table, I can write a split sensor blob in a single CQL >>>>> query and I can read a subset of the columns very efficiently with one >>>>> single CQL query). >>>>> >>>>> What I didn't realize is that Cassandra seems to always process all >>>>> the columns of the CQL row, regardless of the fact that my query asks just >>>>> one column, and this has dramatic effect on the performance of my reads. >>>>> >>>>> I wrote a simple isolated test case where I test how long it takes to >>>>> read one *single* column in a CQL table composed of several columns (at >>>>> each iteration I add and populate 10 new columns), each filled with 1MB >>>>> blobs: >>>>> >>>>> 10 columns: 209 ms >>>>> 20 columns: 339 ms >>>>> 30 columns: 510 ms >>>>> 40 columns: 670 ms >>>>> 50 columns: 884 ms >>>>> 60 columns: 1056 ms >>>>> 70 columns: 1527 ms >>>>> 80 columns: 1503 ms >>>>> 90 columns: 1600 ms >>>>> 100 columns: 1792 ms >>>>> >>>>> In other words, even if the result set returned is exactly the same >>>>> across all these iteration, the response time increases linearly with the >>>>> size of the other columns, and this is really causing a lot of problems in >>>>> my application. >>>>> >>>>> By reading the JIRA issues, it seems like this is considered a very >>>>> minor optimization not worth the effort of fixing, so I'm asking: is my >>>>> use >>>>> case really so anomalous that the horrible performance that I'm >>>>> experiencing are to be considered "expected" and need to be fixed with >>>>> some >>>>> painful column family splitting and messy application code? >>>>> >>>>> Thanks >>>>> >>>> >>>> >>> >> >