Insertion Delay Cassandra 2.1.9

2015-11-06 Thread Greg Traub
Cassandra users,

I have a 4 node Cassandra cluster set up.  All nodes are in a single rack
and distribution center.  I have a loader program which loads 40 million
rows into a table in a keyspace with a replication factor of 3.
Immediately after inserting the rows (after the loader program finishes),
if I SELECT count(*) from the table, the result is less than 40 million.
If I run our dumper program to retrieve all rows, it is less than 40
million.  However, if I wait roughly 20 minutes, the count eventually
reaches 40 million rows and the dumper program returns all 40 million.

If I do the same thing in a keyspace where the replication factor is 1, I
don't have any "stabilization" time and the 40 million rows are immediately
available.

I've modified the loading and dumping programs to use both the Thrift Java
driver and the CQL Java driver and neither seems to make a difference.

I'm very new to Cassandra and my questions are, what may be causing this
delay in all rows being available and how might I lessen/eliminate this
delay?

Thanks,
Greg


Re: Insertion Delay Cassandra 2.1.9

2015-11-06 Thread Greg Traub
Vidur,

Forgive me if I'm getting this wrong as I'm exceptionally new to Cassandra.

By consistency, if you mean the USING CONSISTENCY clause, then I'm not
specifying it which, per the CQL documentation, means a default of ONE.

On Fri, Nov 6, 2015 at 1:49 PM, Vidur Malik  wrote:

> What is your query consistency?
>
> On Fri, Nov 6, 2015 at 1:47 PM, Greg Traub 
> wrote:
>
>> Cassandra users,
>>
>> I have a 4 node Cassandra cluster set up.  All nodes are in a single rack
>> and distribution center.  I have a loader program which loads 40 million
>> rows into a table in a keyspace with a replication factor of 3.
>> Immediately after inserting the rows (after the loader program finishes),
>> if I SELECT count(*) from the table, the result is less than 40 million.
>> If I run our dumper program to retrieve all rows, it is less than 40
>> million.  However, if I wait roughly 20 minutes, the count eventually
>> reaches 40 million rows and the dumper program returns all 40 million.
>>
>> If I do the same thing in a keyspace where the replication factor is 1, I
>> don't have any "stabilization" time and the 40 million rows are immediately
>> available.
>>
>> I've modified the loading and dumping programs to use both the Thrift
>> Java driver and the CQL Java driver and neither seems to make a difference.
>>
>> I'm very new to Cassandra and my questions are, what may be causing this
>> delay in all rows being available and how might I lessen/eliminate this
>> delay?
>>
>> Thanks,
>> Greg
>>
>
>
>
> --
>
> Vidur Malik
>
> [image: ShopKeep] 
>
> 800.820.9814
> <8008209814> [image: ShopKeep]  [image:
> ShopKeep]  [image: ShopKeep]
> 
>


Re: Insertion Delay Cassandra 2.1.9

2015-11-06 Thread Vidur Malik
Ah, I thought you may have been using a higher consistency, which would
explain your error since the data may not have been replicated across all 3
nodes when you made the query.
Anyway, it seems to be happening because of replication. What version of
Cassandra are you using? There may be a issue filed in their JIRA.

On Fri, Nov 6, 2015 at 1:58 PM, Greg Traub  wrote:

> Vidur,
>
> Forgive me if I'm getting this wrong as I'm exceptionally new to Cassandra.
>
> By consistency, if you mean the USING CONSISTENCY clause, then I'm not
> specifying it which, per the CQL documentation, means a default of ONE.
>
> On Fri, Nov 6, 2015 at 1:49 PM, Vidur Malik  wrote:
>
>> What is your query consistency?
>>
>> On Fri, Nov 6, 2015 at 1:47 PM, Greg Traub 
>> wrote:
>>
>>> Cassandra users,
>>>
>>> I have a 4 node Cassandra cluster set up.  All nodes are in a single
>>> rack and distribution center.  I have a loader program which loads 40
>>> million rows into a table in a keyspace with a replication factor of 3.
>>> Immediately after inserting the rows (after the loader program finishes),
>>> if I SELECT count(*) from the table, the result is less than 40 million.
>>> If I run our dumper program to retrieve all rows, it is less than 40
>>> million.  However, if I wait roughly 20 minutes, the count eventually
>>> reaches 40 million rows and the dumper program returns all 40 million.
>>>
>>> If I do the same thing in a keyspace where the replication factor is 1,
>>> I don't have any "stabilization" time and the 40 million rows are
>>> immediately available.
>>>
>>> I've modified the loading and dumping programs to use both the Thrift
>>> Java driver and the CQL Java driver and neither seems to make a difference.
>>>
>>> I'm very new to Cassandra and my questions are, what may be causing this
>>> delay in all rows being available and how might I lessen/eliminate this
>>> delay?
>>>
>>> Thanks,
>>> Greg
>>>
>>
>>
>>
>> --
>>
>> Vidur Malik
>>
>> [image: ShopKeep] 
>>
>> 800.820.9814
>> <8008209814> [image: ShopKeep]  [image:
>> ShopKeep]  [image: ShopKeep]
>> 
>>
>
>


-- 

Vidur Malik

[image: ShopKeep] 

800.820.9814
<8008209814> [image: ShopKeep]  [image:
ShopKeep]  [image: ShopKeep]



Re: Insertion Delay Cassandra 2.1.9

2015-11-06 Thread Vidur Malik
What is your query consistency?

On Fri, Nov 6, 2015 at 1:47 PM, Greg Traub  wrote:

> Cassandra users,
>
> I have a 4 node Cassandra cluster set up.  All nodes are in a single rack
> and distribution center.  I have a loader program which loads 40 million
> rows into a table in a keyspace with a replication factor of 3.
> Immediately after inserting the rows (after the loader program finishes),
> if I SELECT count(*) from the table, the result is less than 40 million.
> If I run our dumper program to retrieve all rows, it is less than 40
> million.  However, if I wait roughly 20 minutes, the count eventually
> reaches 40 million rows and the dumper program returns all 40 million.
>
> If I do the same thing in a keyspace where the replication factor is 1, I
> don't have any "stabilization" time and the 40 million rows are immediately
> available.
>
> I've modified the loading and dumping programs to use both the Thrift Java
> driver and the CQL Java driver and neither seems to make a difference.
>
> I'm very new to Cassandra and my questions are, what may be causing this
> delay in all rows being available and how might I lessen/eliminate this
> delay?
>
> Thanks,
> Greg
>



-- 

Vidur Malik

[image: ShopKeep] 

800.820.9814
<8008209814> [image: ShopKeep]  [image:
ShopKeep]  [image: ShopKeep]



Re: Insertion Delay Cassandra 2.1.9

2015-11-06 Thread Bryan Cheng
Your experience, then, is expected (although 20m delay seems excessive, and
is a sign you may be overloading your cluster, which may be expected with
an unthrottled bulk load like that).

When you insert with consistency ONE on RF > 1, that means your query
returns after one node confirms the write. The write will attempt to go out
to the other nodes that are responsible for that row, but the coordinator
does not bother waiting for the response. If your nodes are overloaded,
they may not accept the write at all; failures may result in hinted handoff
being used, or just the write being dropped in general.

At the end of your load, you likely have nodes missing writes. Look for
dropped MUTATION messages in your nodetool tpstats. For operations that
cannot tolerate this, you need to write and read with a higher consistency
level.

Consistency is achieved over time via hinted handoff, read repair, and
other mechanics (assuming you're not running a repair in between). Your
cluster will gradually return to consistency, *provided your nodes do not
suffer any downtime or exceed the hint window in terms of unavailability*.



On Fri, Nov 6, 2015 at 10:58 AM, Greg Traub  wrote:

> Vidur,
>
> Forgive me if I'm getting this wrong as I'm exceptionally new to Cassandra.
>
> By consistency, if you mean the USING CONSISTENCY clause, then I'm not
> specifying it which, per the CQL documentation, means a default of ONE.
>
> On Fri, Nov 6, 2015 at 1:49 PM, Vidur Malik  wrote:
>
>> What is your query consistency?
>>
>> On Fri, Nov 6, 2015 at 1:47 PM, Greg Traub 
>> wrote:
>>
>>> Cassandra users,
>>>
>>> I have a 4 node Cassandra cluster set up.  All nodes are in a single
>>> rack and distribution center.  I have a loader program which loads 40
>>> million rows into a table in a keyspace with a replication factor of 3.
>>> Immediately after inserting the rows (after the loader program finishes),
>>> if I SELECT count(*) from the table, the result is less than 40 million.
>>> If I run our dumper program to retrieve all rows, it is less than 40
>>> million.  However, if I wait roughly 20 minutes, the count eventually
>>> reaches 40 million rows and the dumper program returns all 40 million.
>>>
>>> If I do the same thing in a keyspace where the replication factor is 1,
>>> I don't have any "stabilization" time and the 40 million rows are
>>> immediately available.
>>>
>>> I've modified the loading and dumping programs to use both the Thrift
>>> Java driver and the CQL Java driver and neither seems to make a difference.
>>>
>>> I'm very new to Cassandra and my questions are, what may be causing this
>>> delay in all rows being available and how might I lessen/eliminate this
>>> delay?
>>>
>>> Thanks,
>>> Greg
>>>
>>
>>
>>
>> --
>>
>> Vidur Malik
>>
>> [image: ShopKeep] 
>>
>> 800.820.9814
>> <8008209814> [image: ShopKeep]  [image:
>> ShopKeep]  [image: ShopKeep]
>> 
>>
>
>