Re: SELECT some_column vs SELECT *

2015-11-24 Thread Jon Haddad
If it's sparsely populated you'll get the same benefit from the schema 
definition.  You don't pay for fields you don't use.

> On Nov 24, 2015, at 12:17 PM, Jack Krupansky  wrote:
> 
> Are all or ost of the 1000+ columns populated for a given row? If they are 
> sparse you can replace them with a single map collection column which would 
> only occupy the entries that are populated.
> 
> -- Jack Krupansky
> 
> On Tue, Nov 24, 2015 at 11:04 AM, Jack Krupansky  > wrote:
> As always, your queries should drive your data model. Unless you really need 
> 1000+ columns for most queries, you should consider separate tables for the 
> subsets of the columns that need to be returned for a given query.
> 
> The new 3.0 Materialized View feature can be used to easily create subsets of 
> a base table, although that begs the question of whether you ever really need 
> all 1000+ columns in the same table.
> 
> -- Jack Krupansky
> 
> On Tue, Nov 24, 2015 at 10:45 AM, Kai Wang  > wrote:
> Hi all,
> 
> If I have the following table:
> CREATE TABLE t (
>   pk int,
>   ck int,
>   c1 int,
>   c2 int,
>   ...
>   PRIMARY KEY (pk, ck)
> )
> 
> There are lots of non-clustering columns (1000+). From time to time I need to 
> do a query like this:
> 
> SELECT c1 FROM t WHERE pk = abc AND ck > xyz;
> 
> How efficient is this query compared to SELECT * ...? Apparently SELECT c1 
> would save a lot of network bandwidth since only c1 needs to be transferred 
> on the wire. But I am more interested in the impact on disk IO. If I 
> understand C* storage engine correctly, one CQL row is clustered together on 
> disk. That means c1 from different rows are stored apart. In the case of 
> SELECT c1, does C* do multiple seeks to only lift c1 of each row from disk or 
> lift the whole row into memory and return c1 from there?
> 
> From comments on https://issues.apache.org/jira/browse/CASSANDRA-5762 
>  it seems C* lifts the 
> whole row as of 1.2.7. Is this still the case on 2.1.*?
> 
> Thanks.
> 
> 



Re: SELECT some_column vs SELECT *

2015-11-24 Thread Jack Krupansky
Are all or ost of the 1000+ columns populated for a given row? If they are
sparse you can replace them with a single map collection column which would
only occupy the entries that are populated.

-- Jack Krupansky

On Tue, Nov 24, 2015 at 11:04 AM, Jack Krupansky 
wrote:

> As always, your queries should drive your data model. Unless you really
> need 1000+ columns for most queries, you should consider separate tables
> for the subsets of the columns that need to be returned for a given query.
>
> The new 3.0 Materialized View feature can be used to easily create subsets
> of a base table, although that begs the question of whether you ever really
> need all 1000+ columns in the same table.
>
> -- Jack Krupansky
>
> On Tue, Nov 24, 2015 at 10:45 AM, Kai Wang  wrote:
>
>> Hi all,
>>
>> If I have the following table:
>> CREATE TABLE t (
>>   pk int,
>>   ck int,
>>   c1 int,
>>   c2 int,
>>   ...
>>   PRIMARY KEY (pk, ck)
>> )
>>
>> There are lots of non-clustering columns (1000+). From time to time I
>> need to do a query like this:
>>
>> SELECT c1 FROM t WHERE pk = abc AND ck > xyz;
>>
>> How efficient is this query compared to SELECT * ...? Apparently SELECT
>> c1 would save a lot of network bandwidth since only c1 needs to be
>> transferred on the wire. But I am more interested in the impact on disk IO.
>> If I understand C* storage engine correctly, one CQL row is clustered
>> together on disk. That means c1 from different rows are stored apart. In
>> the case of SELECT c1, does C* do multiple seeks to only lift c1 of each
>> row from disk or lift the whole row into memory and return c1 from there?
>>
>> From comments on https://issues.apache.org/jira/browse/CASSANDRA-5762 it
>> seems C* lifts the whole row as of 1.2.7. Is this still the case on 2.1.*?
>>
>> Thanks.
>>
>
>


SELECT some_column vs SELECT *

2015-11-24 Thread Kai Wang
Hi all,

If I have the following table:
CREATE TABLE t (
  pk int,
  ck int,
  c1 int,
  c2 int,
  ...
  PRIMARY KEY (pk, ck)
)

There are lots of non-clustering columns (1000+). From time to time I need
to do a query like this:

SELECT c1 FROM t WHERE pk = abc AND ck > xyz;

How efficient is this query compared to SELECT * ...? Apparently SELECT c1
would save a lot of network bandwidth since only c1 needs to be transferred
on the wire. But I am more interested in the impact on disk IO. If I
understand C* storage engine correctly, one CQL row is clustered together
on disk. That means c1 from different rows are stored apart. In the case of
SELECT c1, does C* do multiple seeks to only lift c1 of each row from disk
or lift the whole row into memory and return c1 from there?

>From comments on https://issues.apache.org/jira/browse/CASSANDRA-5762 it
seems C* lifts the whole row as of 1.2.7. Is this still the case on 2.1.*?

Thanks.


Re: SELECT some_column vs SELECT *

2015-11-24 Thread Jack Krupansky
As always, your queries should drive your data model. Unless you really
need 1000+ columns for most queries, you should consider separate tables
for the subsets of the columns that need to be returned for a given query.

The new 3.0 Materialized View feature can be used to easily create subsets
of a base table, although that begs the question of whether you ever really
need all 1000+ columns in the same table.

-- Jack Krupansky

On Tue, Nov 24, 2015 at 10:45 AM, Kai Wang  wrote:

> Hi all,
>
> If I have the following table:
> CREATE TABLE t (
>   pk int,
>   ck int,
>   c1 int,
>   c2 int,
>   ...
>   PRIMARY KEY (pk, ck)
> )
>
> There are lots of non-clustering columns (1000+). From time to time I need
> to do a query like this:
>
> SELECT c1 FROM t WHERE pk = abc AND ck > xyz;
>
> How efficient is this query compared to SELECT * ...? Apparently SELECT c1
> would save a lot of network bandwidth since only c1 needs to be transferred
> on the wire. But I am more interested in the impact on disk IO. If I
> understand C* storage engine correctly, one CQL row is clustered together
> on disk. That means c1 from different rows are stored apart. In the case of
> SELECT c1, does C* do multiple seeks to only lift c1 of each row from disk
> or lift the whole row into memory and return c1 from there?
>
> From comments on https://issues.apache.org/jira/browse/CASSANDRA-5762 it
> seems C* lifts the whole row as of 1.2.7. Is this still the case on 2.1.*?
>
> Thanks.
>