Adding to Stefan's comment. There is a "scylladb" migrator, which uses the spark connector from Datastax, and theoretically can work on any Cassandra compiant DB.. and should not be limited to cassandra to scylla.
https://www.scylladb.com/2019/02/07/moving-from-cassandra-to-scylla-via-apache-spark-scylla-migrator/ https://github.com/scylladb/scylla-migrator On Thu, Mar 14, 2019 at 3:04 PM Durity, Sean R <sean_r_dur...@homedepot.com> wrote: > The possibility of a highly available way to do this gives more > challenges. I would be weighing the cost of a complex solution vs the > possibility of a maintenance window when you stop your app to move the > data, then restart. > > > > For the straight copy of the data, I am currently enamored with DataStax’s > dsbulk utility for unloading and loading larger amounts of data. I don’t > have extensive experience, yet, but it has been fast enough in my > experiments – and that is without doing too much tuning for speed. From a > host not in the cluster, I was able to extract 3.5 million rows in about 11 > seconds. I inserted them into a differently partitioned table in about 26 > seconds. Very small data rows, but it was impressive for not doing much to > try and speed it up further. (In some other tests, it was about ¼ the time > of simple copy statement from cqlsh) > > > > If I was designing something for a “can’t take an outage” scenario, I > would start with: > > - Writing the data to the old and new tables on all inserts > > - On reads, read from the new table first. If not there, read > from the old table ß could introduce some latency, but would be > available; could also do asynchronous reads on both tables and choose the > latest > > - Do this until the data has been copied from old to new (with > dsbulk or custom code or Spark) > > - Drop the double writes and conditional reads > > > > > > Sean > > > > *From:* Stefan Miklosovic <stefan.mikloso...@instaclustr.com> > *Sent:* Wednesday, March 13, 2019 6:39 PM > *To:* user@cassandra.apache.org > *Subject:* Re: [EXTERNAL] Re: Migrate large volume of data from one table > to another table within the same cluster when COPY is not an option. > > > > Hi Leena, > > > > as already suggested in my previous email, you could use Apache Spark and > Cassandra Spark connector (1). I have checked TTLs and I believe you should > especially read this section (2) about TTLs. Seems like thats what you need > to do, ttls per row. The workflow would be that you read from your source > table, making transformations per row (via some mapping) and then you would > save it to new table. > > > > This would import it "all" but until you switch to the new table and > records are still being saved into the original one, I am not sure how to > cover "the gap" in such sense that once you make the switch, you would miss > records which were created in the first table after you did the loading. > You could maybe leverage Spark streaming (Cassandra connector knows that > too) so you would make this transformation on the fly with new ones. > > > > (1) https://github.com/datastax/spark-cassandra-connector > <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_datastax_spark-2Dcassandra-2Dconnector&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=_DgzHjtyiXt4QUBiWPplE-cs_HMaVflC9fAK6I4TdpQ&s=mMB-uNoPbBBK9Zfn5WuDoKoF31IgSi1MXgNlYG7jhDE&e=> > > (2) > https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md#using-a-different-value-for-each-row > <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_datastax_spark-2Dcassandra-2Dconnector_blob_master_doc_5-5Fsaving.md-23using-2Da-2Ddifferent-2Dvalue-2Dfor-2Deach-2Drow&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=_DgzHjtyiXt4QUBiWPplE-cs_HMaVflC9fAK6I4TdpQ&s=AwO-LFAxHWvYgzjuWt9ez5FHKDeNdS3C6KYfaoUUgOs&e=> > > > > > > On Thu, 14 Mar 2019 at 00:13, Leena Ghatpande <lghatpa...@hotmail.com> > wrote: > > Understand, 2nd table would be a better approach. So what would be the > best way to copy 70M rows from current table to the 2nd table with ttl set > on each record as the first table? > > > ------------------------------ > > *From:* Durity, Sean R <sean_r_dur...@homedepot.com> > *Sent:* Wednesday, March 13, 2019 8:17 AM > *To:* user@cassandra.apache.org > *Subject:* RE: [EXTERNAL] Re: Migrate large volume of data from one table > to another table within the same cluster when COPY is not an option. > > > > Correct, there is no current flag. I think there SHOULD be one. > > > > > > *From:* Dieudonné Madishon NGAYA <dmng...@gmail.com> > *Sent:* Tuesday, March 12, 2019 7:17 PM > *To:* user@cassandra.apache.org > *Subject:* [EXTERNAL] Re: Migrate large volume of data from one table to > another table within the same cluster when COPY is not an option. > > > > Hi Sean, you can’t flag in Cassandra.yaml not allowing allow filtering , > the only thing you can do will be from your data model . > > Don’t ask Cassandra to query all data from table but the ideal query will > be using single partition. > > > > On Tue, Mar 12, 2019 at 6:46 PM Stefan Miklosovic < > stefan.mikloso...@instaclustr.com> wrote: > > Hi Sean, > > > > for sure, the best approach would be to create another table which would > treat just that specific query. > > > > How do I set the flag for not allowing allow filtering in cassandra.yaml? > I read a doco and there seems to be nothing about that. > > > > Regards > > > > On Wed, 13 Mar 2019 at 06:57, Durity, Sean R <sean_r_dur...@homedepot.com> > wrote: > > If there are 2 access patterns, I would consider having 2 tables. The > first one with the ID, which you say is the majority use case. Then have a > second table that uses a time-bucket approach as others have suggested: > > (time bucket, id) as primary key > > Choose a time bucket (day, week, hour, month, whatever) that would hold > less than 100 MB of data in the time-bucket partition. > > > > You could include all relevant data in the second table to meet your > query. OR, if that data seems too large or too volatile to duplicate, just > include your primary key and look-up the data in the primary table as > needed. > > > > If you use allow filtering, you are setting yourself up for failure to > scale. I tell my developers, “if you use allow filtering, you are doing it > wrong.” In fact, I think the Cassandra admin should be able to set a flag > in cassandra.yaml to not allow filtering at all. The cluster should be able > to protect itself from bad queries. > > > > > > > > *From:* Leena Ghatpande <lghatpa...@hotmail.com> > *Sent:* Tuesday, March 12, 2019 9:02 AM > *To:* Stefan Miklosovic <stefan.mikloso...@instaclustr.com>; > user@cassandra.apache.org > *Subject:* [EXTERNAL] Re: Migrate large volume of data from one table to > another table within the same cluster when COPY is not an option. > > > > Our data model cannot be like below as you have recommended as majority of > the reads need to select the data by the partition key (id) only, not by > date. > > You could remodel your data in such way that you would make primary key > like this > > ((date), hour-minute, id) > > or > > ((date, hour-minute), id) > > > > > > By adding the date as clustering column, yes the idea was to use the Allow > Filtering on the date and pull the records. Understand that it is not > recommended to do this, but we have been doing this on another existing > large table and have not run into any issue so far. But want to understand > if there is a better approach to this? > > > > Thanks > > > ------------------------------ > > *From:* Stefan Miklosovic <stefan.mikloso...@instaclustr.com> > *Sent:* Monday, March 11, 2019 7:12 PM > *To:* user@cassandra.apache.org > *Subject:* Re: Migrate large volume of data from one table to another > table within the same cluster when COPY is not an option. > > > > The query which does not work should be like this, I made a mistake there > > > > cqlsh> SELECT * from my_keyspace.my_table where number > 2; > > InvalidRequest: Error from server: code=2200 [Invalid query] > message="Cannot execute this query as it might involve data filtering and > thus may have unpredictable performance. If you want to execute this query > despite the performance unpredictability, use ALLOW FILTERING" > > > > > > On Tue, 12 Mar 2019 at 10:10, Stefan Miklosovic < > stefan.mikloso...@instaclustr.com> wrote: > > Hi Leena, > > > > "We are thinking of creating a new table with a date field as a > clustering column to be able to query for date ranges, but partition key to > clustering key will be 1-1. Is this a good approach?" > > > > If you want to select by some time range here, I am wondering how would > making datetime a clustering column help you here? You still have to > provide primary key, right? > > > > E.g. select * from your_keyspace.your_table where id=123 and my_date > > yesterday and my_date < tomorrow (you got the idea) > > > > If you make my_date clustering column, you cant not do this below, because > you still have to specify partition key fully and then clustering key > (optionally) where you can further order and do ranges. But you cant do a > query without specifying partition key. Well, you can use ALLOW FILTERING > but you do not want to do this at all in your situation as it would scan > everything. > > > > select * from your_keyspace.your_table where my_date > yesterday and > my_date < tomorrow > > > > cqlsh> create KEYSPACE my_keyspace WITH replication = {'class': > 'NetworkTopologyStrategy', 'dc1': '1'}; > > cqlsh> CREATE TABLE my_keyspace.my_table (id uuid, number int, PRIMARY KEY > ((id), number)); > > > > cqlsh> SELECT * from my_keyspace.my_table ; > > > > id | number > > --------------------------------------+-------- > > 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f | 3 > > abdc0184-a695-427d-b63b-57cdf7a45f00 | 1 > > 90fe112e-0f74-4cbc-8767-67bdc9c8c3b0 | 4 > > 8cff3eb7-1aff-4dc7-9969-60190c7e4675 | 2 > > > > cqlsh> SELECT * from my_keyspace.my_table where id = > '6e23f79a-8b67-47e0-b8e0-50be78bb1c7f' and number > 2; > > InvalidRequest: Error from server: code=2200 [Invalid query] > message="Invalid STRING constant (6e23f79a-8b67-47e0-b8e0-50be78bb1c7f) for > "id" of type uuid" > > > > cqlsh> SELECT * from my_keyspace.my_table where id = > 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f and number > 2; > > > > id | number > > --------------------------------------+-------- > > 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f | 3 > > > > You could remodel your data in such way that you would make primary key > like this > > > > ((date), hour-minute, id) > > > > or > > > > ((date, hour-minute), id) > > > > I would prefer the second one because if you expect a lot of data per day, > they would all end up on same set of replicas as hash of partition key > would be same whole day if you have same date all day so I think you would > end up with hotspots. You want to have your data spread more evenly so the > second one seems to be better to me. > > > > You can also investigate how to do this with materialized view but I am > not sure about the performance here. > > > > If you want to copy data you can do this e.g. by Cassandra Spark > connector, you would just read table and as you read it you would write to > another one. That is imho the fastest approach and the least error prone. > You can do that on live production data and you can just make a "switch" > afterwards. Not sure about ttls but that should be transparent while > copying that. > > > > On Tue, 12 Mar 2019 at 03:04, Leena Ghatpande <lghatpa...@hotmail.com> > wrote: > > We have a table with over 70M rows with a partition key that is unique. > We have a created datetime stamp on each record, and we have a need to > select all rows created for a date range. Secondary index is not an option > as its high cardinality and could slow performance doing a full scan on 70M > rows. > > > > We are thinking of creating a new table with a date field as a clustering > column to be able to query for date ranges, but partition key to clustering > key will be 1-1. Is this a good approach? > > To do this, we need to copy this large volume of data from table1 to > table2 within the same cluster, while updates are still happening to > table1. We need to do this real time without impacting our customers. COPY > is not an option, as we have ttl's on each row on table1 that need to be > applied to table2 as well. > > > > So what would be the best approach > > 1. To be able select data using date range without impacting > performance. This operation will be needed only on adhoc basis and it wont > be as frequent . > 2. Best way to migrate large volume of data with ttl from one table to > another within the same cluster. > > > > Any other suggestions also will be greatly appreciated. > > > > > > > Stefan Miklosovic > > > > Stefan Miklosovic > > > ------------------------------ > > > The information in this Internet Email is confidential and may be legally > privileged. It is intended solely for the addressee. Access to this Email > by anyone else is unauthorized. If you are not the intended recipient, any > disclosure, copying, distribution or any action taken or omitted to be > taken in reliance on it, is prohibited and may be unlawful. When addressed > to our clients any opinions or advice contained in this Email are subject > to the terms and conditions expressed in any applicable governing The Home > Depot terms of business or client engagement letter. The Home Depot > disclaims all responsibility and liability for the accuracy and content of > this attachment and for any damages or losses arising from any > inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other > items of a destructive nature, which may be contained in this attachment > and shall not be liable for direct, indirect, consequential or special > damages in connection with this e-mail message or its attachment. > > > Stefan Miklosovic > > -- > > > > Best regards > > _____________________________________________________________ > > > [image: > https://www.facebook.com/DMN-BigData-371074727032197/?modal=admin_todo_tour] > <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_DMN-2DBigData-2D371074727032197_-3Fmodal-3Dadmin-5Ftodo-5Ftour&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=Be86MmxLibyKk8oxXeTYi0gzXBnCsVEtV0WLZwIBsWY&e=> > > <https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_dmnbigdata&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=KinxHlxxeSKrA0moelGQ2zzr1f6EpimoYKLvCHagfVw&e=> > > <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=XUqWrw-xfuEB6DvHcrdSlp4wkeSZ4WEpl5N7hJROTJQ&e=> > > <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_dngaya_&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=ziiCdML08BtSMoZRZjZBjlXu45ZtuXE-70aC9vdtpdk&e=> > > > *Dieudonne Madishon NGAYA* > Datastax, Cassandra Architect > *P: *7048580065 > *w: *www.dmnbigdata.com > <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.dmnbigdata.com_&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=J2WoKkzomumHyaEJpHca4urihb80evzx_d6cujZu664&e=> > *E: *dmng...@dmnbigdata.com > *Private E: *dmng...@gmail.com > *A: *Charlotte,NC,28273, USA > > > > > > > ------------------------------ > > > The information in this Internet Email is confidential and may be legally > privileged. It is intended solely for the addressee. Access to this Email > by anyone else is unauthorized. If you are not the intended recipient, any > disclosure, copying, distribution or any action taken or omitted to be > taken in reliance on it, is prohibited and may be unlawful. When addressed > to our clients any opinions or advice contained in this Email are subject > to the terms and conditions expressed in any applicable governing The Home > Depot terms of business or client engagement letter. The Home Depot > disclaims all responsibility and liability for the accuracy and content of > this attachment and for any damages or losses arising from any > inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other > items of a destructive nature, which may be contained in this attachment > and shall not be liable for direct, indirect, consequential or special > damages in connection with this e-mail message or its attachment. > > > > > Stefan Miklosovic > > ------------------------------ > > The information in this Internet Email is confidential and may be legally > privileged. It is intended solely for the addressee. Access to this Email > by anyone else is unauthorized. If you are not the intended recipient, any > disclosure, copying, distribution or any action taken or omitted to be > taken in reliance on it, is prohibited and may be unlawful. When addressed > to our clients any opinions or advice contained in this Email are subject > to the terms and conditions expressed in any applicable governing The Home > Depot terms of business or client engagement letter. The Home Depot > disclaims all responsibility and liability for the accuracy and content of > this attachment and for any damages or losses arising from any > inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other > items of a destructive nature, which may be contained in this attachment > and shall not be liable for direct, indirect, consequential or special > damages in connection with this e-mail message or its attachment. >