Re: Efficient Paging Option in Wide Rows

2016-04-24 Thread Clint Martin
I tend to agree with Carlos. Having multiple row keys and parallelizing
your queries will tend to result in faster responses.  Keeping positions
relatively small will also help your cluster to manage your data more
efficiently also resulting in better performance.

One thing I would recommend is to denormalise your tables. Rather than
having an index table, just store a copy of your data. That way instead of
reading a bunch of indexes into a main table and then having to read each
record from the main table, you can just read the data you are after all at
once.

This trades disk storage space for performance. So you will need to
calculate the benefit of speed vs the cost of additional storage.

Clint
On Apr 24, 2016 1:44 PM, "Carlos Alonso"  wrote:

> Hi Anuj,
>
> That's a very good question and I'd like to hear an answer from anyone who
> can give a detailed answer, but in the mean time I'll try to give my two
> cents.
>
> First of all I think I'd rather split all the values into different
> partition keys for two reasons:
> 1.- If you're sure you're accessing all data at the same time you'll be
> able to parallelize the queries by hitting more nodes on your cluster
> rather than creating a hotspot on the owner(s) of the data.
> 2.- It is a recommended good practice to keep partitions small enough.
> Check if your partition would fit in the good practice by applying the
> formulae from this video:
> https://academy.datastax.com/courses/ds220-data-modeling/physical-partition-size
>
> Cheers!
>
> Carlos Alonso | Software Engineer | @calonso 
>
> On 23 April 2016 at 20:25, Anuj Wadehra  wrote:
>
>> Hi,
>>
>> Can anyone take this question?
>>
>> Thanks
>> Anuj
>>
>> Sent from Yahoo Mail on Android
>> 
>>
>> On Sat, 23 Apr, 2016 at 2:30 PM, Anuj Wadehra
>>  wrote:
>> I think I complicated the question..so I am trying to put the question
>> crisply..
>>
>> We have a table defined with clustering key/column. We have  5
>> different clustering key values.
>>
>> If we want to fetch all 5 rowd,Which query option would be faster and
>> why?
>>
>> 1. Given a single primary key/partition key with 5 clustering
>> keys..we will page through the single partition 500 records at a time.Thus,
>> we will do 5/500=100 db hits but for same partition key.
>>
>> 2. Given 100 different primary keys with each primary key having just 500
>> clustering key columns. Here also we will need 100 db hits but for
>> different partitions.
>>
>>
>> Basically I want to understand any optimizations built into CQL/Cassandra
>> which make paging through a single partition more efficient than querying
>> data from different partitions.
>>
>>
>> Thanks
>> Anuj
>>
>> Sent from Yahoo Mail on Android
>> 
>>
>> On Fri, 22 Apr, 2016 at 8:27 PM, Anuj Wadehra
>>  wrote:
>> Hi,
>>
>> I have a wide row index table so that I can fetch all row keys
>> corresponding to a column value.
>>
>> Row of index_table will look like:
>>
>> ColValue1:bucket1 >> rowkey1, rowkey2.. rowkeyn
>> ..
>> ColValue1:bucketn>> rowkey1, rowkey2.. rowkeyn
>>
>> We will have buckets to avoid hotspots. Row keys of main table are random
>> numbers and we will never do column slice like:
>>
>> Select * from index_table where key=xxx and
>> Col > rowkey1 and col < rowkey10
>>
>> Also, we will ALWAYS fetch all data for a given value of index column.
>> Thus all buckets havr to be read.
>>
>> Each index column value can map to thousands-millions of row keys in main
>> table.
>>
>> Based on our use case, there are two design choices in front of me:
>>
>> 1. Have large number of buckets/rows for an index column value and have
>> lesser data ( around few thousands) in each row.
>>
>> Thus, every time we want to fetch all row keys for an index col value, we
>> will query more rows and for each row we will have to page through data 500
>> records at a time.
>>
>> 2. Have fewer buckets/rows for an index column value.
>>
>> Every time we want to fetch all row keys for an index col value, we will
>> query data less numner of wider rows and then page through each wide row
>> reading 500 columns at a time.
>>
>>
>> Which approach is more efficient?
>>
>>  Approach1: More number of rows with less data in each row.
>>
>>
>> OR
>>
>> Approach 2: less number of  rows with more data in each row
>>
>>
>> Either ways,  we are fetching only 500 records at a time in a query. Even
>> in approach 2 (wider rows) , we can query only small data of 500 at a time.
>>
>>
>> Thanks
>> Anuj
>>
>>
>>
>>
>>
>>
>


Re: Efficient Paging Option in Wide Rows

2016-04-24 Thread Carlos Alonso
Hi Anuj,

That's a very good question and I'd like to hear an answer from anyone who
can give a detailed answer, but in the mean time I'll try to give my two
cents.

First of all I think I'd rather split all the values into different
partition keys for two reasons:
1.- If you're sure you're accessing all data at the same time you'll be
able to parallelize the queries by hitting more nodes on your cluster
rather than creating a hotspot on the owner(s) of the data.
2.- It is a recommended good practice to keep partitions small enough.
Check if your partition would fit in the good practice by applying the
formulae from this video:
https://academy.datastax.com/courses/ds220-data-modeling/physical-partition-size

Cheers!

Carlos Alonso | Software Engineer | @calonso 

On 23 April 2016 at 20:25, Anuj Wadehra  wrote:

> Hi,
>
> Can anyone take this question?
>
> Thanks
> Anuj
>
> Sent from Yahoo Mail on Android
> 
>
> On Sat, 23 Apr, 2016 at 2:30 PM, Anuj Wadehra
>  wrote:
> I think I complicated the question..so I am trying to put the question
> crisply..
>
> We have a table defined with clustering key/column. We have  5
> different clustering key values.
>
> If we want to fetch all 5 rowd,Which query option would be faster and
> why?
>
> 1. Given a single primary key/partition key with 5 clustering keys..we
> will page through the single partition 500 records at a time.Thus, we will
> do 5/500=100 db hits but for same partition key.
>
> 2. Given 100 different primary keys with each primary key having just 500
> clustering key columns. Here also we will need 100 db hits but for
> different partitions.
>
>
> Basically I want to understand any optimizations built into CQL/Cassandra
> which make paging through a single partition more efficient than querying
> data from different partitions.
>
>
> Thanks
> Anuj
>
> Sent from Yahoo Mail on Android
> 
>
> On Fri, 22 Apr, 2016 at 8:27 PM, Anuj Wadehra
>  wrote:
> Hi,
>
> I have a wide row index table so that I can fetch all row keys
> corresponding to a column value.
>
> Row of index_table will look like:
>
> ColValue1:bucket1 >> rowkey1, rowkey2.. rowkeyn
> ..
> ColValue1:bucketn>> rowkey1, rowkey2.. rowkeyn
>
> We will have buckets to avoid hotspots. Row keys of main table are random
> numbers and we will never do column slice like:
>
> Select * from index_table where key=xxx and
> Col > rowkey1 and col < rowkey10
>
> Also, we will ALWAYS fetch all data for a given value of index column.
> Thus all buckets havr to be read.
>
> Each index column value can map to thousands-millions of row keys in main
> table.
>
> Based on our use case, there are two design choices in front of me:
>
> 1. Have large number of buckets/rows for an index column value and have
> lesser data ( around few thousands) in each row.
>
> Thus, every time we want to fetch all row keys for an index col value, we
> will query more rows and for each row we will have to page through data 500
> records at a time.
>
> 2. Have fewer buckets/rows for an index column value.
>
> Every time we want to fetch all row keys for an index col value, we will
> query data less numner of wider rows and then page through each wide row
> reading 500 columns at a time.
>
>
> Which approach is more efficient?
>
>  Approach1: More number of rows with less data in each row.
>
>
> OR
>
> Approach 2: less number of  rows with more data in each row
>
>
> Either ways,  we are fetching only 500 records at a time in a query. Even
> in approach 2 (wider rows) , we can query only small data of 500 at a time.
>
>
> Thanks
> Anuj
>
>
>
>
>
>


Re: Publishing from cassandra

2016-04-24 Thread Laing, Michael
You could take a look at, or follow:
https://issues.apache.org/jira/browse/CASSANDRA-8844

On Sun, Apr 24, 2016 at 10:51 AM, Alexander Orr  wrote:

> Hi,
>
> I'm wondering if someone could help me, I'd like to use cassandra to store
> data and publish this on dowstream to another database (kdb if anyone is
> interested). Essentially I'd like to be able to run a function or operation
> on cassandra from an upstream process that would insert to table and
> publish the data on downstream.
>
> I can't see anything in the docs, but I'm relatively new to cassandra.
> Assuming there's not something simple already in place what would be the
> best way to impliment this kind of mechanism? I have some java that will
> allow me to talk to the db I want to, but I'm not sure of the  best way to
> integrate this with cassandra.
>
> UDFs seem to have ponential, but I don't think it's possible to use
> external libraries/classes within UDFs. All I can think of at the minute is
> either having a process which controls cassandra, publishes to it and also
> the downstream system directly or cloning the git repo and seeing if I can
> hack in some extra functionality.
>
> Any suggestions welcome.
>
> Thanks
>
> Alex
>


Publishing from cassandra

2016-04-24 Thread Alexander Orr
Hi,

I'm wondering if someone could help me, I'd like to use cassandra to store
data and publish this on dowstream to another database (kdb if anyone is
interested). Essentially I'd like to be able to run a function or operation
on cassandra from an upstream process that would insert to table and
publish the data on downstream.

I can't see anything in the docs, but I'm relatively new to cassandra.
Assuming there's not something simple already in place what would be the
best way to impliment this kind of mechanism? I have some java that will
allow me to talk to the db I want to, but I'm not sure of the  best way to
integrate this with cassandra.

UDFs seem to have ponential, but I don't think it's possible to use
external libraries/classes within UDFs. All I can think of at the minute is
either having a process which controls cassandra, publishes to it and also
the downstream system directly or cloning the git repo and seeing if I can
hack in some extra functionality.

Any suggestions welcome.

Thanks

Alex


Re: Changing snitch from PropertyFile to Gossip

2016-04-24 Thread Carlos Rolo
As long as the topology doesn't change, yes. Repair once you finish.
Em 24/04/2016 13:23, "AJ"  escreveu:

> Is it possible to do this without down time i.e. run in mixed mode while
> doing a rolling upgrade?

-- 


--





Changing snitch from PropertyFile to Gossip

2016-04-24 Thread AJ
Is it possible to do this without down time i.e. run in mixed mode while doing 
a rolling upgrade?