Re: [EXTERNAL] Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.

Rahul Singh Thu, 14 Mar 2019 16:10:54 -0700

Adding to Stefan's comment. There is a "scylladb" migrator, which uses the
spark connector from Datastax, and theoretically can work on any Cassandra
compiant DB.. and should not be limited to cassandra to scylla.


https://www.scylladb.com/2019/02/07/moving-from-cassandra-to-scylla-via-apache-spark-scylla-migrator/

https://github.com/scylladb/scylla-migrator

On Thu, Mar 14, 2019 at 3:04 PM Durity, Sean R <sean_r_dur...@homedepot.com>
wrote:

> The possibility of a highly available way to do this gives more
> challenges. I would be weighing the cost of a complex solution vs the
> possibility of a maintenance window when you stop your app to move the
> data, then restart.
>
>
>
> For the straight copy of the data, I am currently enamored with DataStax’s
> dsbulk utility for unloading and loading larger amounts of data. I don’t
> have extensive experience, yet, but it has been fast enough in my
> experiments – and that is without doing too much tuning for speed. From a
> host not in the cluster, I was able to extract 3.5 million rows in about 11
> seconds. I inserted them into a differently partitioned table in about 26
> seconds. Very small data rows, but it was impressive for not doing much to
> try and speed it up further. (In some other tests, it was about ¼ the time
> of simple copy statement from cqlsh)
>
>
>
> If I was designing something for a “can’t take an outage” scenario, I
> would start with:
>
> -          Writing the data to the old and new tables on all inserts
>
> -          On reads, read from the new table first. If not there, read
> from the old table ß could introduce some latency, but would be
> available; could also do asynchronous reads on both tables and choose the
> latest
>
> -          Do this until the data has been copied from old to new (with
> dsbulk or custom code or Spark)
>
> -          Drop the double writes and conditional reads
>
>
>
>
>
> Sean
>
>
>
> *From:* Stefan Miklosovic <stefan.mikloso...@instaclustr.com>
> *Sent:* Wednesday, March 13, 2019 6:39 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: [EXTERNAL] Re: Migrate large volume of data from one table
> to another table within the same cluster when COPY is not an option.
>
>
>
> Hi Leena,
>
>
>
> as already suggested in my previous email, you could use Apache Spark and
> Cassandra Spark connector (1). I have checked TTLs and I believe you should
> especially read this section (2) about TTLs. Seems like thats what you need
> to do, ttls per row. The workflow would be that you read from your source
> table, making transformations per row (via some mapping) and then you would
> save it to new table.
>
>
>
> This would import it "all" but until you switch to the new table and
> records are still being saved into the original one, I am not sure how to
> cover "the gap" in such sense that once you make the switch, you would miss
> records which were created in the first table after you did the loading.
> You could maybe leverage Spark streaming (Cassandra connector knows that
> too) so you would make this transformation on the fly with new ones.
>
>
>
> (1) https://github.com/datastax/spark-cassandra-connector
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_datastax_spark-2Dcassandra-2Dconnector&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=_DgzHjtyiXt4QUBiWPplE-cs_HMaVflC9fAK6I4TdpQ&s=mMB-uNoPbBBK9Zfn5WuDoKoF31IgSi1MXgNlYG7jhDE&e=>
>
> (2)
> https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md#using-a-different-value-for-each-row
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_datastax_spark-2Dcassandra-2Dconnector_blob_master_doc_5-5Fsaving.md-23using-2Da-2Ddifferent-2Dvalue-2Dfor-2Deach-2Drow&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=_DgzHjtyiXt4QUBiWPplE-cs_HMaVflC9fAK6I4TdpQ&s=AwO-LFAxHWvYgzjuWt9ez5FHKDeNdS3C6KYfaoUUgOs&e=>
>
>
>
>
>
> On Thu, 14 Mar 2019 at 00:13, Leena Ghatpande <lghatpa...@hotmail.com>
> wrote:
>
> Understand, 2nd table would be a better approach. So what would be the
> best way to copy 70M rows from current table to the 2nd table with ttl set
> on each record as the first table?
>
>
> ------------------------------
>
> *From:* Durity, Sean R <sean_r_dur...@homedepot.com>
> *Sent:* Wednesday, March 13, 2019 8:17 AM
> *To:* user@cassandra.apache.org
> *Subject:* RE: [EXTERNAL] Re: Migrate large volume of data from one table
> to another table within the same cluster when COPY is not an option.
>
>
>
> Correct, there is no current flag. I think there SHOULD be one.
>
>
>
>
>
> *From:* Dieudonné Madishon NGAYA <dmng...@gmail.com>
> *Sent:* Tuesday, March 12, 2019 7:17 PM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] Re: Migrate large volume of data from one table to
> another table within the same cluster when COPY is not an option.
>
>
>
> Hi Sean, you can’t flag in Cassandra.yaml not allowing allow filtering ,
> the only thing you can do will be from your data model .
>
> Don’t ask Cassandra to query all data from table but the ideal query will
> be using single partition.
>
>
>
> On Tue, Mar 12, 2019 at 6:46 PM Stefan Miklosovic <
> stefan.mikloso...@instaclustr.com> wrote:
>
> Hi Sean,
>
>
>
> for sure, the best approach would be to create another table which would
> treat just that specific query.
>
>
>
> How do I set the flag for not allowing allow filtering in cassandra.yaml?
> I read a doco and there seems to be nothing about that.
>
>
>
> Regards
>
>
>
> On Wed, 13 Mar 2019 at 06:57, Durity, Sean R <sean_r_dur...@homedepot.com>
> wrote:
>
> If there are 2 access patterns, I would consider having 2 tables. The
> first one with the ID, which you say is the majority use case.  Then have a
> second table that uses a time-bucket approach as others have suggested:
>
> (time bucket, id) as primary key
>
> Choose a time bucket (day, week, hour, month, whatever) that would hold
> less than 100 MB of data in the time-bucket partition.
>
>
>
> You could include all relevant data in the second table to meet your
> query. OR, if that data seems too large or too volatile to duplicate, just
> include your primary key and look-up the data in the primary table as
> needed.
>
>
>
> If you use allow filtering, you are setting yourself up for failure to
> scale. I tell my developers, “if you use allow filtering, you are doing it
> wrong.” In fact, I think the Cassandra admin should be able to set a flag
> in cassandra.yaml to not allow filtering at all. The cluster should be able
> to protect itself from bad queries.
>
>
>
>
>
>
>
> *From:* Leena Ghatpande <lghatpa...@hotmail.com>
> *Sent:* Tuesday, March 12, 2019 9:02 AM
> *To:* Stefan Miklosovic <stefan.mikloso...@instaclustr.com>;
> user@cassandra.apache.org
> *Subject:* [EXTERNAL] Re: Migrate large volume of data from one table to
> another table within the same cluster when COPY is not an option.
>
>
>
> Our data model cannot be like below as you have recommended as majority of
> the reads need to select the data by the partition key (id) only, not by
> date.
>
> You could remodel your data in such way that you would make primary key
> like this
>
> ((date), hour-minute, id)
>
> or
>
> ((date, hour-minute), id)
>
>
>
>
>
> By adding the date as clustering column, yes the idea was to use the Allow
> Filtering on the date and pull the records. Understand that it is not
> recommended to do this, but we have been doing this on another existing
> large table and have not run into any issue so far. But want to understand
> if there is a better approach to this?
>
>
>
> Thanks
>
>
> ------------------------------
>
> *From:* Stefan Miklosovic <stefan.mikloso...@instaclustr.com>
> *Sent:* Monday, March 11, 2019 7:12 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Migrate large volume of data from one table to another
> table within the same cluster when COPY is not an option.
>
>
>
> The query which does not work should be like this, I made a mistake there
>
>
>
> cqlsh> SELECT * from my_keyspace.my_table where  number > 2;
>
> InvalidRequest: Error from server: code=2200 [Invalid query]
> message="Cannot execute this query as it might involve data filtering and
> thus may have unpredictable performance. If you want to execute this query
> despite the performance unpredictability, use ALLOW FILTERING"
>
>
>
>
>
> On Tue, 12 Mar 2019 at 10:10, Stefan Miklosovic <
> stefan.mikloso...@instaclustr.com> wrote:
>
> Hi Leena,
>
>
>
> "We are thinking of creating a new table with a date field as a
> clustering column to be able to query for date ranges, but partition key to
> clustering key will be 1-1. Is this a good approach?"
>
>
>
> If you want to select by some time range here, I am wondering how would
> making datetime a clustering column help you here? You still have to
> provide primary key, right?
>
>
>
> E.g. select * from your_keyspace.your_table where id=123 and my_date >
> yesterday and my_date < tomorrow (you got the idea)
>
>
>
> If you make my_date clustering column, you cant not do this below, because
> you still have to specify partition key fully and then clustering key
> (optionally) where you can further order and do ranges. But you cant do a
> query without specifying partition key. Well, you can use ALLOW FILTERING
> but you do not want to do this at all in your situation as it would scan
> everything.
>
>
>
> select * from your_keyspace.your_table where my_date > yesterday and
> my_date < tomorrow
>
>
>
> cqlsh> create KEYSPACE my_keyspace WITH replication = {'class':
> 'NetworkTopologyStrategy', 'dc1': '1'};
>
> cqlsh> CREATE TABLE my_keyspace.my_table (id uuid, number int, PRIMARY KEY
> ((id), number));
>
>
>
> cqlsh> SELECT * from my_keyspace.my_table ;
>
>
>
>  id                                   | number
>
> --------------------------------------+--------
>
>  6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3
>
>  abdc0184-a695-427d-b63b-57cdf7a45f00 |      1
>
>  90fe112e-0f74-4cbc-8767-67bdc9c8c3b0 |      4
>
>  8cff3eb7-1aff-4dc7-9969-60190c7e4675 |      2
>
>
>
> cqlsh> SELECT * from my_keyspace.my_table where id =
> '6e23f79a-8b67-47e0-b8e0-50be78bb1c7f' and  number > 2;
>
> InvalidRequest: Error from server: code=2200 [Invalid query]
> message="Invalid STRING constant (6e23f79a-8b67-47e0-b8e0-50be78bb1c7f) for
> "id" of type uuid"
>
>
>
> cqlsh> SELECT * from my_keyspace.my_table where id =
> 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f and  number > 2;
>
>
>
>  id                                   | number
>
> --------------------------------------+--------
>
>  6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3
>
>
>
> You could remodel your data in such way that you would make primary key
> like this
>
>
>
> ((date), hour-minute, id)
>
>
>
> or
>
>
>
> ((date, hour-minute), id)
>
>
>
> I would prefer the second one because if you expect a lot of data per day,
> they would all end up on same set of replicas as hash of partition key
> would be same whole day if you have same date all day so I think you would
> end up with hotspots. You want to have your data spread more evenly so the
> second one seems to be better to me.
>
>
>
> You can also investigate how to do this with materialized view but I am
> not sure about the performance here.
>
>
>
> If you want to copy data you can do this e.g. by Cassandra Spark
> connector, you would just read table and as you read it you would write to
> another one. That is imho the fastest approach and the least error prone.
> You can do that on live production data and you can just make a "switch"
> afterwards. Not sure about ttls but that should be transparent while
> copying that.
>
>
>
> On Tue, 12 Mar 2019 at 03:04, Leena Ghatpande <lghatpa...@hotmail.com>
> wrote:
>
> We have a table with over 70M rows with a partition key that is unique.
> We have a  created datetime stamp on each record, and we have a need to
> select all rows created for a date range. Secondary index is not an option
> as its high cardinality and could slow performance doing a full scan on 70M
> rows.
>
>
>
> We are thinking of creating a new table with a date field as a clustering
> column to be able to query for date ranges, but partition key to clustering
> key will be 1-1. Is this a good approach?
>
> To do this, we need to copy this large volume of data from table1 to
> table2 within the same cluster, while updates are still happening to
> table1. We need to do this real time without impacting our customers. COPY
> is not an option, as we have ttl's on each row on table1 that need to be
> applied to table2 as well.
>
>
>
> So what would be the best approach
>
>    1. To be able select data using date range without impacting
>    performance. This operation will be needed only on adhoc basis and it wont
>    be as frequent .
>    2. Best way to migrate large volume of data with ttl from one table to
>    another within the same cluster.
>
>
>
> Any other suggestions also will be greatly appreciated.
>
>
>
>
>
>
> Stefan Miklosovic
>
>
>
> Stefan Miklosovic
>
>
> ------------------------------
>
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>
>
> Stefan Miklosovic
>
> --
>
>
>
> Best regards
>
> _____________________________________________________________
>
>
> [image:
> https://www.facebook.com/DMN-BigData-371074727032197/?modal=admin_todo_tour]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_DMN-2DBigData-2D371074727032197_-3Fmodal-3Dadmin-5Ftodo-5Ftour&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=Be86MmxLibyKk8oxXeTYi0gzXBnCsVEtV0WLZwIBsWY&e=>
>
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_dmnbigdata&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=KinxHlxxeSKrA0moelGQ2zzr1f6EpimoYKLvCHagfVw&e=>
>
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=XUqWrw-xfuEB6DvHcrdSlp4wkeSZ4WEpl5N7hJROTJQ&e=>
>
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_dngaya_&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=ziiCdML08BtSMoZRZjZBjlXu45ZtuXE-70aC9vdtpdk&e=>
>
>
> *Dieudonne Madishon NGAYA*
> Datastax, Cassandra Architect
> *P: *7048580065
> *w: *www.dmnbigdata.com
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.dmnbigdata.com_&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=J2WoKkzomumHyaEJpHca4urihb80evzx_d6cujZu664&e=>
> *E: *dmng...@dmnbigdata.com
> *Private E: *dmng...@gmail.com
> *A: *Charlotte,NC,28273, USA
>
>
>
>
>
>
> ------------------------------
>
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>
>
>
>
> Stefan Miklosovic
>
> ------------------------------
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>

Re: [EXTERNAL] Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.

Reply via email to