[jira] [Comment Edited] (CASSANDRA-9304) COPY TO improvements

2015-11-05 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991279#comment-14991279
 ] 

Stefania edited comment on CASSANDRA-9304 at 11/5/15 9:11 AM:
--

Thank you for your input. 

Regarding version support for Windows, fine for 2.2+ but for completeness I'll 
point out that the only obstacle left in 2.1 is the name of the file (_cqlsh_ 
-> _cqlsh.py_).

Regarding the problem with pipes, I've replaced pipes with queues so we don't 
need to deal with the low level platform specific details. Queues can also be 
safely used from the callback threads, which was not the case for pipes.

Regarding the problem with the driver, -I haven't tested in 2.2 but I don't 
think it matters which version since- I verified the problem applies to 2.2 as 
well, yesterday I was using the latest cassandra-test driver version, today I 
used 2.7.2. The column type is the same, {{cassandra.cqltypes.BytesType}}, the 
method called from {{recv_result_rows()}} is the same, {{>}} but 
{{cls.serialize}} in {{from_binary}} is a lambda for the case that works and 
the default implementation {{CassandraType.deserialize}} for  the case that 
does not work. I don't know where the lambda comes from but I noticed there is 
a cython deserialize for {{BytesType}} in deserializers.pyx. I don't know how 
cython works but if this is picked up in the normal case then the problem is 
again with the way multiprocessing imports modules. 

The problem can be solved by adding a deserialize implementation to BytesType, 
like it's done for other types:

{code}
Stefi@Lila MINGW64 ~/git/cstar/python-driver ((2.7.2))
$ git diff
diff --git a/cassandra/cqltypes.py b/cassandra/cqltypes.py
index f39d28b..eb8d3b6 100644
--- a/cassandra/cqltypes.py
+++ b/cassandra/cqltypes.py
@@ -350,6 +350,10 @@ class BytesType(_CassandraType):
 def serialize(val, protocol_version):
 return six.binary_type(val)

+@staticmethod
+def deserialize(byts, protocol_version):
+return bytearray(byts)
+

 class DecimalType(_CassandraType):
 typename = 'decimal'
{code}

If this is not enough and you want to debug some more [~aholmber], you can use 
the 2.1 patch attached. I'm still working on the 2.2. merge. You need to 
generate a table with a blob, I used cassandra-stress. Then run {{COPY 
 TO 'anyfile';}} from cqlsh and this should result in a Unicode 
decode error on Windows because the blob is received as a string. If you prefer 
me to test things for you, that works too.



was (Author: stefania):
Thank you for your input. 

Regarding version support for Windows, fine for 2.2+ but for completeness I'll 
point out that the only obstacle left in 2.1 is the name of the file (_cqlsh_ 
-> _cqlsh.py_).

Regarding the problem with pipes, I've replaced pipes with queues so we don't 
need to deal with the low level platform specific details. Queues can also be 
safely used from the callback threads, which was not the case for pipes.

Regarding the problem with the driver, I haven't tested in 2.2 but I don't 
think it matters which version since yesterday I was using the latest 
cassandra-test driver version. Today I used 2.7.2. The column type is the same, 
{{cassandra.cqltypes.BytesType}}, the method called from {{recv_result_rows()}} 
is the same, {{>}} but {{cls.serialize}} in {{from_binary}} is 
a lambda for the case that works and the default implementation 
{{CassandraType.deserialize}} for  the case that does not work. I don't know 
where the lambda comes from but I noticed there is a cython deserialize for 
{{BytesType}} in deserializers.pyx. I don't know how cython works but if this 
is picked up in the normal case then the problem is again with the way 
multiprocessing imports modules. 

The problem can be solved by adding a deserialize implementation to BytesType, 
like it's done for other types:

{code}
Stefi@Lila MINGW64 ~/git/cstar/python-driver ((2.7.2))
$ git diff
diff --git a/cassandra/cqltypes.py b/cassandra/cqltypes.py
index f39d28b..eb8d3b6 100644
--- a/cassandra/cqltypes.py
+++ b/cassandra/cqltypes.py
@@ -350,6 +350,10 @@ class BytesType(_CassandraType):
 def serialize(val, protocol_version):
 return six.binary_type(val)

+@staticmethod
+def deserialize(byts, protocol_version):
+return bytearray(byts)
+

 class DecimalType(_CassandraType):
 typename = 'decimal'
{code}

If this is not enough and you want to debug some more [~aholmber], you can use 
the 2.1 patch attached. I'm still working on the 2.2. merge. You need to 
generate a table with a blob, I used cassandra-stress. Then run {{COPY 
 TO 'anyfile';}} from cqlsh and this should result in a Unicode 
decode error on Windows because the blob is received as a string. If you prefer 
me to test things for you, that works too.


> COPY TO improvements
> 
>
> Key: CASSANDRA-9304
>  

[jira] [Comment Edited] (CASSANDRA-9304) COPY TO improvements

2015-10-13 Thread Tyler Hobbs (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955515#comment-14955515
 ] 

Tyler Hobbs edited comment on CASSANDRA-9304 at 10/13/15 8:04 PM:
--

The new code comments are very nice, thank you for putting those in.  The rest 
of the changes look pretty good to me as well.

bq.  I removed StringIO.StringIO, I believe io.StringIO is equivalent and 
preferred to StringIO.StringIO? If so, # noqa doesn't seem to be required, the 
first E402 is at line 117.

It looks like my version of flake8 was old and didn't handle {{try/except}} 
style imports well.  Ignore that comment :)

However, there are a couple of other legit flake8 warnings:

{noformat}
bin/cqlsh|2216| W806 local variable 'ks' is assigned to but never used
bin/cqlsh|2217| W806 local variable 'cf' is assigned to but never used
{noformat}

I also tested this out with a 1m row insert from stress and was surprised to 
see that I got some timeouts on multiple ranges.  These were 
{{OperationTimedOut}} errors, so it's not immediately clear where the hangup 
is.  I did notice that the current error handling code loses the original 
exception class (which can be useful), so I suggest changing {{err_callback()}} 
from:

{code}
self.pipe.send((token_range, Exception(err.message)))
{code}

to

{code}
self.pipe.send((token_range, Exception(err.__class__.__name__ + " - " + 
err.message)))
{code}

To avoid the timeouts, I experimented with lowering the page size from 5k to 
1k.  This did resolve the timeouts for me, and also smoothed the throughput.  I 
suggest that we lower the page size (by doing {{session.default_fetch_size = 
N}}) to 1k just to lower the impact on nodes.  (EDIT: maybe a page size option 
for {{COPY TO}} would be a good idea?)

Additionally, we probably want to add some basic timeout recovery.  The 
{{err_callback()}} could perform exponential backoff for a limited number of 
attempts if an {{OperationTimedOut}} is thrown.

To handle coordinator-level timeouts, we could subclass 
{{cassandra.policies.RetryPolicy}} with an {{on_read_timeout()}} that performs 
exponential backoff for a limited number of attempts.  You can pass an instance 
of this to the Cluster constructor: {{Cluster(..., retry_policy=foo)}}.

Sorry for the additional work, I just don't want to end up with a {{COPY TO}} 
that goes fast enough to hit timeouts without any sort of recourse for users.  
That's something that we already have a bit of a problem with for {{COPY FROM}}.


was (Author: thobbs):
The new code comments are very nice, thank you for putting those in.  The rest 
of the changes look pretty good to me as well.

bq.  I removed StringIO.StringIO, I believe io.StringIO is equivalent and 
preferred to StringIO.StringIO? If so, # noqa doesn't seem to be required, the 
first E402 is at line 117.

It looks like my version of flake8 was old and didn't handle {{try/except}} 
style imports well.  Ignore that comment :)

However, there are a couple of other legit flake8 warnings:

{noformat}
bin/cqlsh|2216| W806 local variable 'ks' is assigned to but never used
bin/cqlsh|2217| W806 local variable 'cf' is assigned to but never used
{noformat}

I also tested this out with a 1m row insert from stress and was surprised to 
see that I got some timeouts on multiple ranges.  These were 
{{OperationTimedOut}} errors, so it's not immediately clear where the hangup 
is.  I did notice that the current error handling code loses the original 
exception class (which can be useful), so I suggest changing {{err_callback()}} 
from:

{code}
self.pipe.send((token_range, Exception(err.message)))
{code}

to

{code}
self.pipe.send((token_range, Exception(err.__class__.__name__ + " - " + 
err.message)))
{code}

To avoid the timeouts, I experimented with lowering the page size from 5k to 
1k.  This did resolve the timeouts for me, and also smoothed the throughput.  I 
suggest that we lower the page size (by doing {{session.default_fetch_size = 
N}}) to 1k just to lower the impact on nodes.

Additionally, we probably want to add some basic timeout recovery.  The 
{{err_callback()}} could perform exponential backoff for a limited number of 
attempts if an {{OperationTimedOut}} is thrown.

To handle coordinator-level timeouts, we could subclass 
{{cassandra.policies.RetryPolicy}} with an {{on_read_timeout()}} that performs 
exponential backoff for a limited number of attempts.  You can pass an instance 
of this to the Cluster constructor: {{Cluster(..., retry_policy=foo)}}.

Sorry for the additional work, I just don't want to end up with a {{COPY TO}} 
that goes fast enough to hit timeouts without any sort of recourse for users.  
That's something that we already have a bit of a problem with for {{COPY FROM}}.

> COPY TO improvements
> 
>
> Key: CASSANDRA-9304
> URL: 

[jira] [Comment Edited] (CASSANDRA-9304) COPY TO improvements

2015-07-20 Thread David Kua (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633984#comment-14633984
 ] 

David Kua edited comment on CASSANDRA-9304 at 7/20/15 8:44 PM:
---

https://github.com/dkua/cassandra/tree/9304

In the above branch are my improvements to COPY TO. Which basically amounts to 
figuring out the token ranges from the token ring, starting some subprocesses, 
giving each subprocess a subset of the ranges, and have them perform the 
queries asynchronously and pass each formatted page back to the parent process 
to write to the CSV file.

The resulting CSV is unordered so changes to the dtests needed to be made, see 
here: https://github.com/dkua/cassandra-dtest/tree/bulk_export They have also 
been submitted to the dtest repo on Github as a PR.

-

A small benchmark was done on a table of 10M rows inside of a Vagrant box with 
8 cores. The table was created using the following command 
`tools/bin/cassandra-stress write n=10M -rate threads=50`.

The original single proc version took about 30 minutes to export the table.
The multi proc version takes about 7 minutes.
[~brianmhess]'s cassandra-unloader takes a little over 2 minutes.


was (Author: dkua):
https://github.com/dkua/cassandra/tree/9304

In the above branch are my improvements to COPY TO. Which basically amounts to 
figuring out the token ranges from the token ring, starting some subprocesses, 
giving each subprocess a subset of the ranges, and have them perform the 
queries asynchronously and pass each formatted page back to the parent process 
to write to the CSV file.

The resulting CSV is unordered so changes to the dtests needed to be made, see 
here: https://github.com/dkua/cassandra-dtest/tree/bulk_export They have also 
been submitted to the dtest repo on Github as a PR.

-

A small benchmark was done on a table of 10M rows inside of a Vagrant box with 
8 cores. The table was created using the following command 
`tools/bin/cassandra-stress write n=10M -rate threads=50`.

The original single proc version took about 30 minutes to export the table.
The multi proc version takes about 7 minutes.

 COPY TO improvements
 

 Key: CASSANDRA-9304
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9304
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: David Kua
Priority: Minor
  Labels: cqlsh
 Fix For: 2.1.x


 COPY FROM has gotten a lot of love.  COPY TO not so much.  One obvious 
 improvement could be to parallelize reading and writing (write one page of 
 data while fetching the next).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9304) COPY TO improvements

2015-07-20 Thread David Kua (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633984#comment-14633984
 ] 

David Kua edited comment on CASSANDRA-9304 at 7/20/15 7:41 PM:
---

https://github.com/dkua/cassandra/tree/9304

In the above branch are my improvements to COPY TO. Which basically amounts to 
figuring out the token ranges from the token ring, starting some subprocesses, 
giving each subprocess a subset of the ranges, and have them perform the 
queries asynchronously and pass each formatted page back to the parent process 
to write to the CSV file.

The resulting CSV is unordered so changes to the dtests needed to be made, see 
here: https://github.com/dkua/cassandra-dtest/tree/bulk_export They have also 
been submitted to the dtest repo on Github as a PR.

-

A small benchmark was done on a table of 10M rows inside of a Vagrant box with 
8 cores. The table was created using the following command 
`tools/bin/cassandra-stress write n=10M -rate threads=50`.

The original single proc version took about 30 minutes to export the table.
The multi proc version takes about 7 minutes.


was (Author: dkua):
https://github.com/dkua/cassandra/tree/9304

In the above branch are my improvements to COPY TO. Which basically amounts to 
figuring out the token ranges from the token ring, starting some subprocesses, 
giving each subprocess a subset of the ranges, and have them perform the 
queries asynchronously and pass each formatted page back to the parent process 
to write to the CSV file.

The resulting CSV is unordered so changes to the dtests needed to be made. They 
have been submitted to the dtest repo on Github as a PR.

-

A small benchmark was done on a table of 10M rows inside of a Vagrant box with 
8 cores. The table was created using the following command 
`tools/bin/cassandra-stress write n=10M -rate threads=50`.

The original single proc version took about 30 minutes to export the table.
The multi proc version takes about 7 minutes.

 COPY TO improvements
 

 Key: CASSANDRA-9304
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9304
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: David Kua
Priority: Minor
  Labels: cqlsh
 Fix For: 2.1.x


 COPY FROM has gotten a lot of love.  COPY TO not so much.  One obvious 
 improvement could be to parallelize reading and writing (write one page of 
 data while fetching the next).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9304) COPY TO improvements

2015-07-20 Thread David Kua (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633984#comment-14633984
 ] 

David Kua edited comment on CASSANDRA-9304 at 7/20/15 7:39 PM:
---

https://github.com/dkua/cassandra/tree/9304

In the above branch are my improvements to COPY TO. Which basically amounts to 
figuring out the token ranges from the token ring, starting some subprocesses, 
giving each subprocess a subset of the ranges, and have them perform the 
queries asynchronously and pass each formatted page back to the parent process 
to write to the CSV file.

The resulting CSV is unordered so changes to the dtests needed to be made. They 
have been submitted to the dtest repo on Github as a PR.

-

A small benchmark was done on a table of 10M rows inside of a Vagrant box with 
8 cores. The table was created using the following command 
`tools/bin/cassandra-stress write n=10M -rate threads=50`.

The original single proc version took about 30 minutes to export the table.
The multi proc version takes about 7 minutes.


was (Author: dkua):
https://github.com/dkua/cassandra/tree/9304

In the above branch are my improvements to COPY TO. Which basically amounts to 
figuring out the token ranges from the token ring, starting some subprocesses, 
giving each subprocess a subset of the ranges, and have them perform the 
queries asynchronously and pass each formatted page back to the parent process 
to write to the CSV file.

The resulting CSV is unordered so changes to the dtests needed to be made. They 
have been submitted to the dtest repo on Github as a PR.

 COPY TO improvements
 

 Key: CASSANDRA-9304
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9304
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: David Kua
Priority: Minor
  Labels: cqlsh
 Fix For: 2.1.x


 COPY FROM has gotten a lot of love.  COPY TO not so much.  One obvious 
 improvement could be to parallelize reading and writing (write one page of 
 data while fetching the next).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)