Marcelo, Here is a link to the preview of the python fast copy program:
https://gist.github.com/michaelplaing/37d89c8f5f09ae779e47 It will copy a table from one cluster to another with some transformation- they can be the same cluster. It has 3 main throttles to experiment with: 1. fetch_size: size of source pages in rows 2. worker_count: number of worker subprocesses 3. concurrency: number of async callback chains per worker subprocess It is easy to overrun Cassandra and the python driver, so I recommend starting with the defaults: fetch_size: 1000; worker_count: 2; concurrency: 10. Additionally there are switches to set 'policies' by source and destination: retry (downgrade consistency), dc_aware, and token_aware. retry is useful if you are getting timeouts. For the others YMMV. To use it you need to define the SELECT and UPDATE cql statements as well as the 'map_fields' method. The worker subprocesses divide up the token range among themselves and proceed quasi-independently. Each worker opens a connection to each cluster and the driver sets up connection pools to the nodes in the cluster. Anyway there are a lot of processes, threads, callbacks going at once so it is fun to watch. On my regional cluster of small nodes in AWS I got about 3000 rows per second transferred after things warmed up a bit - each row about 6kb. ml On Wed, Jun 4, 2014 at 11:49 AM, Laing, Michael <[email protected]> wrote: > OK Marcelo, I'll work on it today. -ml > > > On Tue, Jun 3, 2014 at 8:24 PM, Marcelo Elias Del Valle < > [email protected]> wrote: > >> Hi Michael, >> >> For sure I would be interested in this program! >> >> I am new both to python and for cql. I started creating this copier, but >> was having problems with timeouts. Alex solved my problem here on the list, >> but I think I will still have a lot of trouble making the copy to work fine. >> >> I open sourced my version here: >> https://github.com/s1mbi0se/cql_record_processor >> >> Just in case it's useful for anything. >> >> However, I saw CQL has support for concurrency itself and having >> something made by someone who knows Python CQL Driver better would be very >> helpful. >> >> My two servers today are at OVH (ovh.com), we have servers at AWS but >> but several cases we prefer other hosts. Both servers have SDD and 64 Gb >> RAM, I could use the script as a benchmark for you if you want. Besides, we >> have some bigger clusters, I could run on the just to test the speed if >> this is going to help. >> >> Regards >> Marcelo. >> >> >> 2014-06-03 11:40 GMT-03:00 Laing, Michael <[email protected]>: >> >> Hi Marcelo, >>> >>> I could create a fast copy program by repurposing some python apps that >>> I am using for benchmarking the python driver - do you still need this? >>> >>> With high levels of concurrency and multiple subprocess workers, based >>> on my current actual benchmarks, I think I can get well over 1,000 >>> rows/second on my mac and significantly more in AWS. I'm using variable >>> size rows averaging 5kb. >>> >>> This would be the initial version of a piece of the benchmark suite we >>> will release as part of our nyt⨍aбrik project on 21 June for my >>> Cassandra Day NYC talk re the python driver. >>> >>> ml >>> >>> >>> On Mon, Jun 2, 2014 at 2:15 PM, Marcelo Elias Del Valle < >>> [email protected]> wrote: >>> >>>> Hi Jens, >>>> >>>> Thanks for trying to help. >>>> >>>> Indeed, I know I can't do it using just CQL. But what would you use to >>>> migrate data manually? I tried to create a python program using auto >>>> paging, but I am getting timeouts. I also tried Hive, but no success. >>>> I only have two nodes and less than 200Gb in this cluster, any simple >>>> way to extract the data quickly would be good enough for me. >>>> >>>> Best regards, >>>> Marcelo. >>>> >>>> >>>> >>>> 2014-06-02 15:08 GMT-03:00 Jens Rantil <[email protected]>: >>>> >>>> Hi Marcelo, >>>>> >>>>> Looks like you can't do this without migrating your data manually: >>>>> https://stackoverflow.com/questions/18421668/alter-cassandra-column-family-primary-key-using-cassandra-cli-or-cql >>>>> >>>>> Cheers, >>>>> Jens >>>>> >>>>> >>>>> On Mon, Jun 2, 2014 at 7:48 PM, Marcelo Elias Del Valle < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I have some cql CFs in a 2 node Cassandra 2.0.8 cluster. >>>>>> >>>>>> I realized I created my column family with the wrong partition. >>>>>> Instead of: >>>>>> >>>>>> CREATE TABLE IF NOT EXISTS entity_lookup ( >>>>>> name varchar, >>>>>> value varchar, >>>>>> entity_id uuid, >>>>>> PRIMARY KEY ((name, value), entity_id)) >>>>>> WITH >>>>>> caching=all; >>>>>> >>>>>> I used: >>>>>> >>>>>> CREATE TABLE IF NOT EXISTS entitylookup ( >>>>>> name varchar, >>>>>> value varchar, >>>>>> entity_id uuid, >>>>>> PRIMARY KEY (name, value, entity_id)) >>>>>> WITH >>>>>> caching=all; >>>>>> >>>>>> >>>>>> Now I need to migrate the data from the second CF to the first one. >>>>>> I am using Data Stax Community Edition. >>>>>> >>>>>> What would be the best way to convert data from one CF to the other? >>>>>> >>>>>> Best regards, >>>>>> Marcelo. >>>>>> >>>>> >>>>> >>>> >>> >> >
