Re: Bootstrapping data from Cassandra 2.2.5 datacenter to 3.0.8 datacenter fails because of streaming errors

2016-10-10 Thread Utkarsh Sengar
As Johathan said, you need to upgrade cassandra directly and use "nodetool
upgradesstables".
Datastax has an excellent resource on upgrading cassandra
https://docs.datastax.com/en/latest-upgrade/upgrade/cassandra/upgdCassandra.html,
specifically
https://docs.datastax.com/en/latest-upgrade/upgrade/cassandra/upgrdCassandraDetails.html

Make sure you have a snapshot from which you can restore using "nodetool
snapshot". We upgraded from 1.x to 2.x and the upgrade went south, had to
restore from snapshot.

Thanks,
-Utkarsh


On Mon, Oct 10, 2016 at 4:46 PM, Jonathan Haddad  wrote:

> You can't stream between major versions. Don't tear down your first data
> center, upgrade it instead.
> On Mon, Oct 10, 2016 at 4:35 PM Abhishek Verma  wrote:
>
>> Hi Cassandra users,
>>
>> We are trying to upgrade our Cassandra version from 2.2.5 to 3.0.8
>> (running on Mesos, but that's besides the point). We have two datacenters,
>> so in order to preserve our data, we are trying to upgrade one datacenter
>> at a time.
>>
>> Initially both DCs (dc1 and dc2) are running 2.2.5. The idea is to tear
>> down dc1 completely (delete all the data in it), bring it up with 3.0.8,
>> let data replicate from dc2 to dc1, and then tear down dc2, bring it up
>> with 3.0.8 and replicate data from dc1.
>>
>> I am able to reproduce the problem on bare metal clusters running on 3
>> nodes. I am using Oracle's server-jre-8u74-linux-x64 JRE.
>>
>> *Node A*: Downloaded 2.2.5-bin.tar.gz, changed the seeds to include its
>> own IP address, changed listen_address and rpc_address to its own IP and
>> changed endpoint_snitch to GossipingPropertyFileSnitch. I
>> changed conf/cassandra-rackdc.properties to
>> dc=dc2
>> rack=rack2
>> This node started up fine and is UN in nodetool status in dc2.
>>
>> I used CQL shell to create a table and insert 3 rows:
>> verma@x:~/apache-cassandra-2.2.5$ bin/cqlsh $HOSTNAME
>> Connected to Test Cluster at x:9042.
>> [cqlsh 5.0.1 | Cassandra 2.2.5 | CQL spec 3.3.1 | Native protocol v4]
>> Use HELP for help.
>> cqlsh> desc tmp
>>
>> CREATE KEYSPACE tmp WITH replication = {'class':
>> 'NetworkTopologyStrategy', 'dc1': '1', 'dc2': '1'}  AND durable_writes =
>> true;
>>
>> CREATE TABLE tmp.map (
>> key text PRIMARY KEY,
>> value text
>> )...;
>> cqlsh> select * from tmp.map;
>>
>>  key | value
>> -+---
>>   k1 |v1
>>   k3 |v3
>>   k2 |v2
>>
>>
>> *Node B:* Downloaded 3.0.8-bin.tar.gz, changed the seeds to include
>> itself and node A, changed listen_address and rpc_address to its own IP,
>> changed endpoint_snitch to GossipingPropertyFileSnitch. I did not change
>> conf/cassandra-rackdc.properties and its contents are
>> dc=dc1
>> rack=rack1
>>
>> In the logs, I see:
>> INFO  [main] 2016-10-10 22:42:42,850 MessagingService.java:557 - Starting
>> Messaging Service on /10.164.32.29:7000 (eth0)
>> INFO  [main] 2016-10-10 22:42:42,864 StorageService.java:784 - This node
>> will not auto bootstrap because it is configured to be a seed node.
>>
>> So I start a third node:
>> *Node C:* Downloaded 3.0.8-bin.tar.gz, changed the seeds to include node
>> A and node B, changed listen_address and rpc_address to its own IP, changed
>> endpoint_snitch to GossipingPropertyFileSnitch. I did not change
>> conf/cassandra-rackdc.properties.
>> Now, nodetool status shows:
>>
>> verma@xxx:~/apache-cassandra-3.0.8$ bin/nodetool status
>> Datacenter: dc1
>> ===
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address   Load   Tokens   Owns (effective)  Host ID
>> Rack
>> UJ 87.81 KB   256  ?
>> 9064832d-ed5c-4c42-ad5a-f754b52b670c  rack1
>> UN107.72 KB  256  100.0%
>>  28b1043f-115b-46a5-b6b6-8609829cde76  rack1
>> Datacenter: dc2
>> ===
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address   Load   Tokens   Owns (effective)  Host ID
>> Rack
>> UN  73.2 KB256  100.0%
>>  09cc542c-2299-45a5-a4d1-159c239ded37  rack2
>>
>> Nodetool describe cluster shows:
>> verma@xxx:~/apache-cassandra-3.0.8$ bin/nodetool describecluster
>> Cluster Information:
>> Name: Test Cluster
>> Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
>> Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
>> Schema versions:
>> c2a2bb4f-7d31-3fb8-a216-00b41a643650: [, ]
>>
>> 9770e3c5-3135-32e2-b761-65a0f6d8824e: []
>>
>> Note that there are two schema versions and they don't match.
>>
>> I see the following in the system.log:
>>
>> INFO  [InternalResponseStage:1] 2016-10-10 22:48:36,055
>> ColumnFamilyStore.java:390 - Initializing system_auth.roles
>> INFO  [main] 2016-10-10 22:48:36,316 StorageService.java:1149 - JOINING:
>> waiting for schema information to complete
>> INFO  [main] 2016-10-10 22:48:36,316 StorageService.java:1149 - JOINING:
>> schema complete, ready to bootstrap
>> INFO  [main] 2016-10-10 22:48:36,316 StorageService.java:1149 - JOINING:

Re: Motivation for a DHT ring

2016-06-30 Thread Utkarsh Sengar
With fault tolerance and reliability, it also gives a faster lookup
mechanism across various nodes in a cluster.
Amazon's dynamo paper might be a better read to understand the reasoning
behind a DHT based system:
http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

On Wed, Jun 29, 2016 at 11:48 PM, Jens Rantil  wrote:

> Some reasons I can come up with:
> - it would be hard to have tunable read/consistencies/replicas when
> interfacing with a file system.
> - data locality support would require strong coupling to the distributed
> file system interface (if at all possible given that certain sstables
> should live on the same data node).
> - operator complexity both administering a distributed file system as well
> as a Cassandra cluster. This was a personal reason why I chose Cassandra
> instead of HBase for a project.
>
> Cheers,
> Jens
>
> Den ons 29 juni 2016 13:01jean paul  skrev:
>
>>
>>
>> 2016-06-28 22:29 GMT+01:00 jean paul :
>>
>>> Hi all,
>>>
>>> Please, What is the motivation for choosing a DHT ring in cassandra? Why
>>> not use a normal parallel or distributed file system that supports
>>> replication?
>>>
>>> Thank you so much for clarification.
>>>
>>> Kind regards.
>>>
>>
>> --
>
> Jens Rantil
> Backend Developer @ Tink
>
> Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
> For urgent matters you can reach me at +46-708-84 18 32.
>



-- 
Thanks,
-Utkarsh


Re: Cassandra vs Elasticsearch.

2014-05-03 Thread Utkarsh Sengar
I have also written a prototype ES-Cassandra river:
https://github.com/eBay/cassandra-river
Never tested it in prod, might need improvements.

Thanks,
-Utkarsh


On Sat, May 3, 2014 at 1:37 PM, Tim Dunphy  wrote:

> I'd like to try your ElasticSearch / Cassandra driver as well. Could you
> post a link? Is it on GitHub or similar?
>
> Thanks
> Tim
>
> Sent from my iPhone
>
> On May 3, 2014, at 4:06 PM, prabhat  wrote:
>
> Great idea. I can do test
>
> Prabhat Kumar Singh
>
>
>
> On Sun, May 4, 2014 at 12:32 AM, Elias Ross  wrote:
>
>> I've come up with a driver so that Elasticsearch can store its index
>> data in Cassandra. I'm not sure how well it performs, as I haven't
>> really put it through any big data sets. But you then get the
>> advantage of data durability in Cassandra and the search capability of
>> Elasticsearch.
>>
>> It's very experimental, but I'm going to open source it sometime, I'd
>> like to know if some people could really test it with some big data.
>>
>
>


-- 
Thanks,
-Utkarsh


Re: Cassandra 2.0 new features ?

2013-06-12 Thread Utkarsh Sengar
Jonathan Ellis has covered it briefly in this presentation here (slide 51):
http://www.slideshare.net/planetcassandra/nyc-jonathan-ellis-keynote-cassandra-12-20

They are:

   1. Eager retries
   2. Improved compaction
   3. Triggers
   4. CAS (Compare-and-set)
   5. More-efficient repair


Thanks,
-Utkarsh


On Wed, Jun 12, 2013 at 3:45 PM, Emalayan Vairavanathan <
svemala...@yahoo.com> wrote:

> Hi All,
>
> Can anyone tell me about the new features that are going to come in
> Cassandra 2.0 ?
>
> Thank you
> Emalayan
>



-- 
Thanks,
-Utkarsh


Re: How to find total number of rows in Cassandra databaase?

2013-04-21 Thread Utkarsh Sengar
Difference b/w cqlsh and cli is documented by the datastax guys here
nicely: http://www.datastax.com/support-forums/topic/cli-vs-cql

Thanks,
-Utkarsh


On Sun, Apr 21, 2013 at 1:39 PM, Techy Teck  wrote:

> Yeah it helps a lot. I always have this doubt with me. What is the
> difference between CLI and CQL?
>
>
>
> On Sun, Apr 21, 2013 at 1:30 PM, Utkarsh Sengar wrote:
>
>> Using cqlsh you can do:
>>
>> SELECT COUNT(*) FROM columnfamily LIMIT 5000;
>>
>> Does that help?
>>
>> Read more: http://www.datastax.com/docs/1.0/references/cql/SELECT
>>
>> Thanks,
>> -Utkarsh
>>
>>
>>
>> On Sun, Apr 21, 2013 at 1:04 PM, Techy Teck wrote:
>>
>>> I have inserted 1000 rows in Cassandra database. Now I am trying to find
>>> out how many rows have been inserted in Cassandra database using the CLI
>>> mode.
>>>
>>>
>>> In rdbms, I can do this sql-
>>>
>>> *   SELECT count(*) from TABLE;*
>>>
>>> And this will give me total count for that table;
>>>
>>> How to do the same thing in Cassandra database?
>>>
>>> I am running Cassandra 1.2.3
>>>
>>
>>
>>
>> --
>> Thanks,
>> -Utkarsh
>>
>
>


-- 
Thanks,
-Utkarsh


Re: How to find total number of rows in Cassandra databaase?

2013-04-21 Thread Utkarsh Sengar
Using cqlsh you can do:

SELECT COUNT(*) FROM columnfamily LIMIT 5000;

Does that help?

Read more: http://www.datastax.com/docs/1.0/references/cql/SELECT

Thanks,
-Utkarsh



On Sun, Apr 21, 2013 at 1:04 PM, Techy Teck  wrote:

> I have inserted 1000 rows in Cassandra database. Now I am trying to find
> out how many rows have been inserted in Cassandra database using the CLI
> mode.
>
>
> In rdbms, I can do this sql-
>
> *   SELECT count(*) from TABLE;*
>
> And this will give me total count for that table;
>
> How to do the same thing in Cassandra database?
>
> I am running Cassandra 1.2.3
>



-- 
Thanks,
-Utkarsh


Reading data in bulk from cassandra for indexing in Elastic search

2013-03-28 Thread Utkarsh Sengar
Hello,

I am trying to implement an indexer for a column family in cassandra
(cluster of 4 nodes) using elastic search. There is a river
pluginwhich I am
writing which retrieves data from cassandra and throws to
elastic search. It is triggered once a day (which is configurable based on
the requirement).

Total keys: ~50M

So for reading the whole column family (random partition), I am going ahead
with this approach:
As mentioned here , I use
this example 
(
PaginateGetRangeSlices.java):

*Approach 1:*
1. Get chucks of 10,000 keys (which is configurable, but when I increase it
to more than 15,000, I get a thrift frame size error cassandra. To fix it,
I will need to increase that frame size via cassandra.yml)  and its columns
(around 15 columns/key).
2. Then send 15,000 read records to elastic search.
3. It is single threaded for now. It will be hard to make this
multithreaded because I will need to track the range of keys which is
already read and share start key value. with every thread. Think
PaginateGetRangeSlices.java example, but multi-threaded.

I have implemented this approach, its not that fast. Takes about 6hours to
complete.

*Approach 2:*
1. Get all the keys using same query as above. But retrieve only the key.
2. Divide the keys by x. Where x will the total threads I spawn. Every
individual thread will do an individual GET for a key and insert it in
elastic search. This will considerably increase hits to cassandra, but
sounds more efficient.


*So my questions are:*
1. What is the suggest strategy to read bulk data from cassandra? Which
read pattern is better, one big get range slide with 10,000 keys-columns or
multiple small GETs for every keys?

2. How about reading more values at once, say 50,000 keys-columns by
increasing the thrift frame size from 16Mb to something greater like 54MB?
How will it impact cassandra's performance in general?

Will appreciate your input about any other strategies you use to move bulk
data from cassandra.

-- 
Thanks,
-Utkarsh