[jira] [Comment Edited] (CASSANDRA-9304) COPY TO improvements
[ https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991279#comment-14991279 ] Stefania edited comment on CASSANDRA-9304 at 11/5/15 9:11 AM: -- Thank you for your input. Regarding version support for Windows, fine for 2.2+ but for completeness I'll point out that the only obstacle left in 2.1 is the name of the file (_cqlsh_ -> _cqlsh.py_). Regarding the problem with pipes, I've replaced pipes with queues so we don't need to deal with the low level platform specific details. Queues can also be safely used from the callback threads, which was not the case for pipes. Regarding the problem with the driver, -I haven't tested in 2.2 but I don't think it matters which version since- I verified the problem applies to 2.2 as well, yesterday I was using the latest cassandra-test driver version, today I used 2.7.2. The column type is the same, {{cassandra.cqltypes.BytesType}}, the method called from {{recv_result_rows()}} is the same, {{>}} but {{cls.serialize}} in {{from_binary}} is a lambda for the case that works and the default implementation {{CassandraType.deserialize}} for the case that does not work. I don't know where the lambda comes from but I noticed there is a cython deserialize for {{BytesType}} in deserializers.pyx. I don't know how cython works but if this is picked up in the normal case then the problem is again with the way multiprocessing imports modules. The problem can be solved by adding a deserialize implementation to BytesType, like it's done for other types: {code} Stefi@Lila MINGW64 ~/git/cstar/python-driver ((2.7.2)) $ git diff diff --git a/cassandra/cqltypes.py b/cassandra/cqltypes.py index f39d28b..eb8d3b6 100644 --- a/cassandra/cqltypes.py +++ b/cassandra/cqltypes.py @@ -350,6 +350,10 @@ class BytesType(_CassandraType): def serialize(val, protocol_version): return six.binary_type(val) +@staticmethod +def deserialize(byts, protocol_version): +return bytearray(byts) + class DecimalType(_CassandraType): typename = 'decimal' {code} If this is not enough and you want to debug some more [~aholmber], you can use the 2.1 patch attached. I'm still working on the 2.2. merge. You need to generate a table with a blob, I used cassandra-stress. Then run {{COPY TO 'anyfile';}} from cqlsh and this should result in a Unicode decode error on Windows because the blob is received as a string. If you prefer me to test things for you, that works too. was (Author: stefania): Thank you for your input. Regarding version support for Windows, fine for 2.2+ but for completeness I'll point out that the only obstacle left in 2.1 is the name of the file (_cqlsh_ -> _cqlsh.py_). Regarding the problem with pipes, I've replaced pipes with queues so we don't need to deal with the low level platform specific details. Queues can also be safely used from the callback threads, which was not the case for pipes. Regarding the problem with the driver, I haven't tested in 2.2 but I don't think it matters which version since yesterday I was using the latest cassandra-test driver version. Today I used 2.7.2. The column type is the same, {{cassandra.cqltypes.BytesType}}, the method called from {{recv_result_rows()}} is the same, {{>}} but {{cls.serialize}} in {{from_binary}} is a lambda for the case that works and the default implementation {{CassandraType.deserialize}} for the case that does not work. I don't know where the lambda comes from but I noticed there is a cython deserialize for {{BytesType}} in deserializers.pyx. I don't know how cython works but if this is picked up in the normal case then the problem is again with the way multiprocessing imports modules. The problem can be solved by adding a deserialize implementation to BytesType, like it's done for other types: {code} Stefi@Lila MINGW64 ~/git/cstar/python-driver ((2.7.2)) $ git diff diff --git a/cassandra/cqltypes.py b/cassandra/cqltypes.py index f39d28b..eb8d3b6 100644 --- a/cassandra/cqltypes.py +++ b/cassandra/cqltypes.py @@ -350,6 +350,10 @@ class BytesType(_CassandraType): def serialize(val, protocol_version): return six.binary_type(val) +@staticmethod +def deserialize(byts, protocol_version): +return bytearray(byts) + class DecimalType(_CassandraType): typename = 'decimal' {code} If this is not enough and you want to debug some more [~aholmber], you can use the 2.1 patch attached. I'm still working on the 2.2. merge. You need to generate a table with a blob, I used cassandra-stress. Then run {{COPY TO 'anyfile';}} from cqlsh and this should result in a Unicode decode error on Windows because the blob is received as a string. If you prefer me to test things for you, that works too. > COPY TO improvements > > > Key: CASSANDRA-9304 >
[jira] [Comment Edited] (CASSANDRA-9304) COPY TO improvements
[ https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955515#comment-14955515 ] Tyler Hobbs edited comment on CASSANDRA-9304 at 10/13/15 8:04 PM: -- The new code comments are very nice, thank you for putting those in. The rest of the changes look pretty good to me as well. bq. I removed StringIO.StringIO, I believe io.StringIO is equivalent and preferred to StringIO.StringIO? If so, # noqa doesn't seem to be required, the first E402 is at line 117. It looks like my version of flake8 was old and didn't handle {{try/except}} style imports well. Ignore that comment :) However, there are a couple of other legit flake8 warnings: {noformat} bin/cqlsh|2216| W806 local variable 'ks' is assigned to but never used bin/cqlsh|2217| W806 local variable 'cf' is assigned to but never used {noformat} I also tested this out with a 1m row insert from stress and was surprised to see that I got some timeouts on multiple ranges. These were {{OperationTimedOut}} errors, so it's not immediately clear where the hangup is. I did notice that the current error handling code loses the original exception class (which can be useful), so I suggest changing {{err_callback()}} from: {code} self.pipe.send((token_range, Exception(err.message))) {code} to {code} self.pipe.send((token_range, Exception(err.__class__.__name__ + " - " + err.message))) {code} To avoid the timeouts, I experimented with lowering the page size from 5k to 1k. This did resolve the timeouts for me, and also smoothed the throughput. I suggest that we lower the page size (by doing {{session.default_fetch_size = N}}) to 1k just to lower the impact on nodes. (EDIT: maybe a page size option for {{COPY TO}} would be a good idea?) Additionally, we probably want to add some basic timeout recovery. The {{err_callback()}} could perform exponential backoff for a limited number of attempts if an {{OperationTimedOut}} is thrown. To handle coordinator-level timeouts, we could subclass {{cassandra.policies.RetryPolicy}} with an {{on_read_timeout()}} that performs exponential backoff for a limited number of attempts. You can pass an instance of this to the Cluster constructor: {{Cluster(..., retry_policy=foo)}}. Sorry for the additional work, I just don't want to end up with a {{COPY TO}} that goes fast enough to hit timeouts without any sort of recourse for users. That's something that we already have a bit of a problem with for {{COPY FROM}}. was (Author: thobbs): The new code comments are very nice, thank you for putting those in. The rest of the changes look pretty good to me as well. bq. I removed StringIO.StringIO, I believe io.StringIO is equivalent and preferred to StringIO.StringIO? If so, # noqa doesn't seem to be required, the first E402 is at line 117. It looks like my version of flake8 was old and didn't handle {{try/except}} style imports well. Ignore that comment :) However, there are a couple of other legit flake8 warnings: {noformat} bin/cqlsh|2216| W806 local variable 'ks' is assigned to but never used bin/cqlsh|2217| W806 local variable 'cf' is assigned to but never used {noformat} I also tested this out with a 1m row insert from stress and was surprised to see that I got some timeouts on multiple ranges. These were {{OperationTimedOut}} errors, so it's not immediately clear where the hangup is. I did notice that the current error handling code loses the original exception class (which can be useful), so I suggest changing {{err_callback()}} from: {code} self.pipe.send((token_range, Exception(err.message))) {code} to {code} self.pipe.send((token_range, Exception(err.__class__.__name__ + " - " + err.message))) {code} To avoid the timeouts, I experimented with lowering the page size from 5k to 1k. This did resolve the timeouts for me, and also smoothed the throughput. I suggest that we lower the page size (by doing {{session.default_fetch_size = N}}) to 1k just to lower the impact on nodes. Additionally, we probably want to add some basic timeout recovery. The {{err_callback()}} could perform exponential backoff for a limited number of attempts if an {{OperationTimedOut}} is thrown. To handle coordinator-level timeouts, we could subclass {{cassandra.policies.RetryPolicy}} with an {{on_read_timeout()}} that performs exponential backoff for a limited number of attempts. You can pass an instance of this to the Cluster constructor: {{Cluster(..., retry_policy=foo)}}. Sorry for the additional work, I just don't want to end up with a {{COPY TO}} that goes fast enough to hit timeouts without any sort of recourse for users. That's something that we already have a bit of a problem with for {{COPY FROM}}. > COPY TO improvements > > > Key: CASSANDRA-9304 > URL:
[jira] [Comment Edited] (CASSANDRA-9304) COPY TO improvements
[ https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633984#comment-14633984 ] David Kua edited comment on CASSANDRA-9304 at 7/20/15 8:44 PM: --- https://github.com/dkua/cassandra/tree/9304 In the above branch are my improvements to COPY TO. Which basically amounts to figuring out the token ranges from the token ring, starting some subprocesses, giving each subprocess a subset of the ranges, and have them perform the queries asynchronously and pass each formatted page back to the parent process to write to the CSV file. The resulting CSV is unordered so changes to the dtests needed to be made, see here: https://github.com/dkua/cassandra-dtest/tree/bulk_export They have also been submitted to the dtest repo on Github as a PR. - A small benchmark was done on a table of 10M rows inside of a Vagrant box with 8 cores. The table was created using the following command `tools/bin/cassandra-stress write n=10M -rate threads=50`. The original single proc version took about 30 minutes to export the table. The multi proc version takes about 7 minutes. [~brianmhess]'s cassandra-unloader takes a little over 2 minutes. was (Author: dkua): https://github.com/dkua/cassandra/tree/9304 In the above branch are my improvements to COPY TO. Which basically amounts to figuring out the token ranges from the token ring, starting some subprocesses, giving each subprocess a subset of the ranges, and have them perform the queries asynchronously and pass each formatted page back to the parent process to write to the CSV file. The resulting CSV is unordered so changes to the dtests needed to be made, see here: https://github.com/dkua/cassandra-dtest/tree/bulk_export They have also been submitted to the dtest repo on Github as a PR. - A small benchmark was done on a table of 10M rows inside of a Vagrant box with 8 cores. The table was created using the following command `tools/bin/cassandra-stress write n=10M -rate threads=50`. The original single proc version took about 30 minutes to export the table. The multi proc version takes about 7 minutes. COPY TO improvements Key: CASSANDRA-9304 URL: https://issues.apache.org/jira/browse/CASSANDRA-9304 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: David Kua Priority: Minor Labels: cqlsh Fix For: 2.1.x COPY FROM has gotten a lot of love. COPY TO not so much. One obvious improvement could be to parallelize reading and writing (write one page of data while fetching the next). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9304) COPY TO improvements
[ https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633984#comment-14633984 ] David Kua edited comment on CASSANDRA-9304 at 7/20/15 7:41 PM: --- https://github.com/dkua/cassandra/tree/9304 In the above branch are my improvements to COPY TO. Which basically amounts to figuring out the token ranges from the token ring, starting some subprocesses, giving each subprocess a subset of the ranges, and have them perform the queries asynchronously and pass each formatted page back to the parent process to write to the CSV file. The resulting CSV is unordered so changes to the dtests needed to be made, see here: https://github.com/dkua/cassandra-dtest/tree/bulk_export They have also been submitted to the dtest repo on Github as a PR. - A small benchmark was done on a table of 10M rows inside of a Vagrant box with 8 cores. The table was created using the following command `tools/bin/cassandra-stress write n=10M -rate threads=50`. The original single proc version took about 30 minutes to export the table. The multi proc version takes about 7 minutes. was (Author: dkua): https://github.com/dkua/cassandra/tree/9304 In the above branch are my improvements to COPY TO. Which basically amounts to figuring out the token ranges from the token ring, starting some subprocesses, giving each subprocess a subset of the ranges, and have them perform the queries asynchronously and pass each formatted page back to the parent process to write to the CSV file. The resulting CSV is unordered so changes to the dtests needed to be made. They have been submitted to the dtest repo on Github as a PR. - A small benchmark was done on a table of 10M rows inside of a Vagrant box with 8 cores. The table was created using the following command `tools/bin/cassandra-stress write n=10M -rate threads=50`. The original single proc version took about 30 minutes to export the table. The multi proc version takes about 7 minutes. COPY TO improvements Key: CASSANDRA-9304 URL: https://issues.apache.org/jira/browse/CASSANDRA-9304 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: David Kua Priority: Minor Labels: cqlsh Fix For: 2.1.x COPY FROM has gotten a lot of love. COPY TO not so much. One obvious improvement could be to parallelize reading and writing (write one page of data while fetching the next). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9304) COPY TO improvements
[ https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633984#comment-14633984 ] David Kua edited comment on CASSANDRA-9304 at 7/20/15 7:39 PM: --- https://github.com/dkua/cassandra/tree/9304 In the above branch are my improvements to COPY TO. Which basically amounts to figuring out the token ranges from the token ring, starting some subprocesses, giving each subprocess a subset of the ranges, and have them perform the queries asynchronously and pass each formatted page back to the parent process to write to the CSV file. The resulting CSV is unordered so changes to the dtests needed to be made. They have been submitted to the dtest repo on Github as a PR. - A small benchmark was done on a table of 10M rows inside of a Vagrant box with 8 cores. The table was created using the following command `tools/bin/cassandra-stress write n=10M -rate threads=50`. The original single proc version took about 30 minutes to export the table. The multi proc version takes about 7 minutes. was (Author: dkua): https://github.com/dkua/cassandra/tree/9304 In the above branch are my improvements to COPY TO. Which basically amounts to figuring out the token ranges from the token ring, starting some subprocesses, giving each subprocess a subset of the ranges, and have them perform the queries asynchronously and pass each formatted page back to the parent process to write to the CSV file. The resulting CSV is unordered so changes to the dtests needed to be made. They have been submitted to the dtest repo on Github as a PR. COPY TO improvements Key: CASSANDRA-9304 URL: https://issues.apache.org/jira/browse/CASSANDRA-9304 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: David Kua Priority: Minor Labels: cqlsh Fix For: 2.1.x COPY FROM has gotten a lot of love. COPY TO not so much. One obvious improvement could be to parallelize reading and writing (write one page of data while fetching the next). -- This message was sent by Atlassian JIRA (v6.3.4#6332)