Re: COPY command with where condition

2020-01-20 Thread Alex Ott
I think that you may avoid timeout if you specify token condition inside
WHERE, like:

-query "SELECT * FROM probe_sensors WHERE token(...) > :start and
token(...) <= :end AND localisation_id = 208812 ALLOW FILTERING"

replace ... with list of partition key names


On Fri, Jan 17, 2020 at 7:47 PM Chris Splinter 
wrote:

> Do you know your partition keys?
>
> One option could be to enumerate that list of partition keys in separate
> cmds to make the individual operations less expensive for the cluster.
>
> For example:
> Say your partition key column is called id and the ids in your database
> are [1,2,3]
>
> You could do
> ./dsbulk unload --dsbulk.schema.keyspace 'dev_keyspace' -query "SELECT *
> FROM probe_sensors WHERE id = 1 AND localisation_id = 208812" -url
> /home/dump
> ./dsbulk unload --dsbulk.schema.keyspace 'dev_keyspace' -query "SELECT *
> FROM probe_sensors WHERE id = 2 AND localisation_id = 208812" -url
> /home/dump
> ./dsbulk unload --dsbulk.schema.keyspace 'dev_keyspace' -query "SELECT *
> FROM probe_sensors WHERE id = 3 AND localisation_id = 208812" -url
> /home/dump
>
>
> Does that option work for you?
>
>
>
> On Fri, Jan 17, 2020 at 12:17 PM adrien ruffie 
> wrote:
>
>> I don't really know for the moment in production environment, but for
>> developpment environment the table contains more than 10.000.000 rows.
>> But we need just a sub dataset of this table not the entirety ...
>> --
>> *De :* Chris Splinter 
>> *Envoyé :* vendredi 17 janvier 2020 17:40
>> *À :* adrien ruffie 
>> *Cc :* user@cassandra.apache.org ; Erick
>> Ramirez 
>> *Objet :* Re: COPY command with where condition
>>
>> What you are seeing there is a standard read timeout, how many rows do
>> you expect back from that query?
>>
>> On Fri, Jan 17, 2020 at 9:50 AM adrien ruffie 
>> wrote:
>>
>> Thank you very much,
>>
>>  so I do this request with for example -->
>>
>> ./dsbulk unload --dsbulk.schema.keyspace 'dev_keyspace' -query "SELECT *
>> FROM probe_sensors WHERE localisation_id = 208812 ALLOW FILTERING" -url
>> /home/dump
>>
>>
>> But I get the following error
>> com.datastax.dsbulk.executor.api.exception.BulkExecutionException:
>> Statement execution failed: SELECT * FROM crt_sensors WHERE site_id =
>> 208812 ALLOW FILTERING (Cassandra timeout during read query at consistency
>> LOCAL_ONE (1 responses were required but only 0 replica responded))
>>
>> but I configured my driver with following driver.conf, but nothing work
>> correctly. Do you know what is the problem ?
>>
>> datastax-java-driver {
>> basic {
>>
>>
>> contact-points = ["data1com:9042","data2.com:9042"]
>>
>> request {
>> timeout = "200"
>> consistency = "LOCAL_ONE"
>>
>> }
>> }
>> advanced {
>>
>> auth-provider {
>> class = PlainTextAuthProvider
>> username = "superuser"
>> password = "mypass"
>>
>> }
>> }
>> }
>> --
>> *De :* Chris Splinter 
>> *Envoyé :* vendredi 17 janvier 2020 16:17
>> *À :* user@cassandra.apache.org 
>> *Cc :* Erick Ramirez 
>> *Objet :* Re: COPY command with where condition
>>
>> DSBulk has an option that lets you specify the query ( including a WHERE
>> clause )
>>
>> See Example 19 in this blog post for details:
>> https://www.datastax.com/blog/2019/06/datastax-bulk-loader-unloading
>>
>> On Fri, Jan 17, 2020 at 7:34 AM Jean Tremblay <
>> jean.tremb...@zen-innovations.com> wrote:
>>
>> Did you think about using a Materialised View to generate what you want
>> to keep, and then use DSBulk to extract the data?
>>
>> On 17 Jan 2020, at 14:30 , adrien ruffie 
>> wrote:
>>
>> Sorry I come back to a quick question about the bulk loader ...
>>
>> https://www.datastax.com/blog/2018/05/introducing-datastax-bulk-loader
>>
>> I read this : "Operations such as converting strings to lowercase,
>> arithmetic on input columns, or filtering out rows based on some criteria,
>> are not supported. "
>>
>> Consequently, it's still not possible to use a WHERE clause with DSBulk,
>> right ?
>>
>> I don't really know how I can do it, in order to don't keep the wholeness
>> of business data already stored and which don't need to export...
>>
>>
>>
>> --
>> *De :* adrien ruffie 
>> *Envoyé :* vendredi 17 janvier 2020 11:39
>> *À :* Erick Ramirez ; user@cassandra.apache.org <
>> user@cassandra.apache.org>
>> *Objet :* RE: COPY command with where condition
>>
>> Thank a lot !
>> It's a good news for DSBulk ! I will take a look around this solution.
>>
>> best regards,
>> Adrian
>> --
>> *De :* Erick Ramirez 
>> *Envoyé :* vendredi 17 janvier 2020 10:02
>> *À :* user@cassandra.apache.org 
>> *Objet :* Re: COPY command with where condition
>>
>> The COPY command doesn't support filtering and it doesn't perform well
>> for large tables.
>>
>> Have you considered the DSBulk tool from DataStax? Previously, it only
>> worked 

Re: [EXTERNAL] Re: COPY command with where condition

2020-01-20 Thread Jean Carlo
Hello

Nobody has mentioned but you can use spark cassandra connector also.
Preferably if your data set is so big that a simple copy to csv cannot
handle it

Saludos

Jean Carlo

"The best way to predict the future is to invent it" Alan Kay


On Fri, Jan 17, 2020 at 8:11 PM Durity, Sean R 
wrote:

> sstablekeys (in the tools directory?) can extract the actual keys from
> your sstables. You have to run it on each node and then combine and de-dupe
> the final results, but I have used this technique with a query generator to
> extract data more efficiently.
>
>
>
>
>
> Sean Durity
>
>
>
> *From:* Chris Splinter 
> *Sent:* Friday, January 17, 2020 1:47 PM
> *To:* adrien ruffie 
> *Cc:* user@cassandra.apache.org; Erick Ramirez 
> *Subject:* [EXTERNAL] Re: COPY command with where condition
>
>
>
> Do you know your partition keys?
>
>
>
> One option could be to enumerate that list of partition keys in separate
> cmds to make the individual operations less expensive for the cluster.
>
>
>
> For example:
>
> Say your partition key column is called id and the ids in your database
> are [1,2,3]
>
>
>
> You could do
>
> ./dsbulk unload --dsbulk.schema.keyspace 'dev_keyspace' -query "SELECT *
> FROM probe_sensors WHERE id = 1 AND localisation_id = 208812" -url
> /home/dump
>
> ./dsbulk unload --dsbulk.schema.keyspace 'dev_keyspace' -query "SELECT *
> FROM probe_sensors WHERE id = 2 AND localisation_id = 208812" -url
> /home/dump
>
> ./dsbulk unload --dsbulk.schema.keyspace 'dev_keyspace' -query "SELECT *
> FROM probe_sensors WHERE id = 3 AND localisation_id = 208812" -url
> /home/dump
>
>
>
>
>
> Does that option work for you?
>
>
>
>
>
>
>
> On Fri, Jan 17, 2020 at 12:17 PM adrien ruffie 
> wrote:
>
> I don't really know for the moment in production environment, but for
> developpment environment the table contains more than 10.000.000 rows.
>
> But we need just a sub dataset of this table not the entirety ...
> --
>
> *De :* Chris Splinter 
> *Envoyé :* vendredi 17 janvier 2020 17:40
> *À :* adrien ruffie 
> *Cc :* user@cassandra.apache.org ; Erick
> Ramirez 
> *Objet :* Re: COPY command with where condition
>
>
>
> What you are seeing there is a standard read timeout, how many rows do you
> expect back from that query?
>
>
>
> On Fri, Jan 17, 2020 at 9:50 AM adrien ruffie 
> wrote:
>
> Thank you very much,
>
>
>
>  so I do this request with for example -->
>
>
>
> ./dsbulk unload --dsbulk.schema.keyspace 'dev_keyspace' -query "SELECT *
> FROM probe_sensors WHERE localisation_id = 208812 ALLOW FILTERING" -url
> /home/dump
>
>
>
>
>
> But I get the following error
>
> com.datastax.dsbulk.executor.api.exception.BulkExecutionException:
> Statement execution failed: SELECT * FROM crt_sensors WHERE site_id =
> 208812 ALLOW FILTERING (Cassandra timeout during read query at consistency
> LOCAL_ONE (1 responses were required but only 0 replica responded))
>
>
>
> but I configured my driver with following driver.conf, but nothing work
> correctly. Do you know what is the problem ?
>
>
>
> datastax-java-driver {
>
> basic {
>
>
>
>
>
> contact-points = ["data1com:9042","data2.com:9042 [data2.com]
> 
> "]
>
>
>
> request {
>
> timeout = "200"
>
> consistency = "LOCAL_ONE"
>
>
>
> }
>
> }
>
> advanced {
>
>
>
> auth-provider {
>
> class = PlainTextAuthProvider
>
> username = "superuser"
>
> password = "mypass"
>
>
>
> }
>
> }
>
> }
> --
>
> *De :* Chris Splinter 
> *Envoyé :* vendredi 17 janvier 2020 16:17
> *À :* user@cassandra.apache.org 
> *Cc :* Erick Ramirez 
> *Objet :* Re: COPY command with where condition
>
>
>
> DSBulk has an option that lets you specify the query ( including a WHERE
> clause )
>
>
>
> See Example 19 in this blog post for details: 
> https://www.datastax.com/blog/2019/06/datastax-bulk-loader-unloading
> [datastax.com]
> 
>
>
>
> On Fri, Jan 17, 2020 at 7:34 AM Jean Tremblay <
> jean.tremb...@zen-innovations.com> wrote:
>
> Did you think about using a Materialised View to generate what you want to
> keep, and then use DSBulk to extract the data?
>
>
>
> On 17 Jan 2020, at 14:30 , adrien ruffie 
> wrote:
>
>
>
> Sorry I come back to a quick question about the bulk loader ...
>
>
>
> https://www.datastax.com/blog/2018/05/introducing-datastax-bulk-loader
> [datastax.com]
> 
>
>
>
> I read this : "Operations such as converting strings