Hi Bogdan,

First - I am probably captain obvious here - but you will lose some data in
the process. as  *interaction_timestamp *is not part of the primary key
anymore you will upsert the new row multiple times, last one win.

The CSV rows are loaded in parallel with multiple threads and multiple
chunks as such the order is* NOT guaranteed* indeed. I have to go back to
older documentation to get the proper description : COPY FROM loads rows
from a CSV file in a parallel non-deterministic order.
https://docs.datastax.com/en/cql-oss/3.1/cql/cql_reference/copy_r.html

You can still play with CHUNKSIZE but this is not want you want.
https://docs.datastax.com/en/cql-oss/3.3/cql/cql_reference/cqlshCopy.html

Solution here would be to work on in the dump I am afraid (items are
ordered in first table and keep only latest *interaction_timestamp)*

Cedrick

On Tue, Feb 11, 2020 at 1:17 PM Bogdan Gherca <bogdan.ghe...@gmail.com>
wrote:

> Hey Cassandra folks,
>
> I'm trying to change the schema of an existing table by creating a new one
> and migrating the data.
>
> The initial table schema looks like this:
>
>
>
>
>
>
> *CREATE TABLE IF NOT EXISTS initial_table (    user_id
> text,    message_id              timeuuid,    interaction_state
> text,    interaction_timestamp   timestamp,    PRIMARY KEY ((user_id),
> message_id, interaction_state, interaction_timestamp));*
>
> We're trying to remove interaction timestamp from the PK - same schema but
> with *PRIMARY KEY ((user_id), message_id, interaction_state)*
>
> When importing the .csv dump obtained from the *initial_table, *the
> timestamp column seems to be written in a weird way. Multiple rows from the
> old schema need to be merged to a single entry of the new schema. For most
> cases, it seems the last entry entry gets copied over to the new table
> while for others a random one gets copied. Check out the below csv sample
> and the copy from result.
>
> *initial_table_dump.csv*
>
> *123,ed6c69a0-0add-11b2-8080-808080808080,DISMISSED,2020-01-03
> 17:50:59+0000123,ed6c69a0-0add-11b2-8080-808080808080,DISMISSED,2020-01-10
> 00:05:41+0000*
>
> *copy new_table(user_id, message_id, interaction_state,
> interaction_timestamp) from '~/initial_table_dump.csv';*
>
> *Result:*
>
>
>
> *user_id | message_id                           | interaction_state |
> interaction_timestamp---------+--------------------------------------+-------------------+--------------------------
>  123
> | ed6c69a0-0add-11b2-8080-808080808080 |         DISMISSED | 2020-01-03
> 17:50:59+0000*
>
> Notice the first row from the csv gets written into the new table in this
> case - here there are only two rows, but for multiple ones it seems a
> random one would be copied over, not the first/last one necessarily. When
> updating the interaction_timestamp column value as below, it seems to copy
> the latest entry to the new table.
>
> *initial_table_dump_2.csv*
>
> *123,ed6c69a0-0add-11b2-8080-808080808080,DISMISSED,2020-01-03
> 17:50:59+0000123,ed6c69a0-0add-11b2-8080-808080808080,DISMISSED,2020-01-05
> 00:05:41+0000*
>
> *- perform same copy from operation - *
>
> *Result:*
>
>
> *user_id | message_id                           | interaction_state |
> interaction_timestamp---------+--------------------------------------+-------------------+--------------------------
>  123
> | ed6c69a0-0add-11b2-8080-808080808080 |         DISMISSED | **2020-01-05
> 00:05:41+0000*
>
> Could someone help me understand why this might happen? Does the 'copy
> from' follow the order from the csv when doing the import or there are no
> order guarantees?
>
> I'm using the below cqlsh and Cassandra versions:
> *[cqlsh 5.0.1 | Cassandra 2.2.15 | CQL spec 3.3.1 | Native protocol v4]*
>
> Thanks,
> Bogdan
>


-- 



Cédrick Lunven

*EMEA Developer Advocate Manager *
🎓Free Trainings : *DataStax Academy <https://academy.datastax.com/>*

❓Ask us your questions : *DataStax Community
<https://community.datastax.com/index.html>*


🔬Test our new products : *DataStax Labs
<https://downloads.datastax.com/#labs>*

Reply via email to