[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13857930#comment-13857930 ] Lars Bohl commented on CASSANDRA-6268: -- I built from trunk today and restarted the cluster using the generated build/dist folder in place of the 2.0.3 tarball. There were still over 5 map tasks in hadoop, the same number as before. Maybe some setting in cassandra.yaml needs to change, or some setting in cascading-cassandra source tap? Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Fix For: 1.2.13, 2.0.4 Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 6268-thrift-2.0.txt Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13840857#comment-13840857 ] Jeremiah Jordan commented on CASSANDRA-6268: [~jbellis] we good to commit this now? Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 6268-thrift-2.0.txt Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13840883#comment-13840883 ] Jonathan Ellis commented on CASSANDRA-6268: --- Unless I'm missing something the 2.0 patch does not bump the Thrift version to 19.39.0. Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 6268-thrift-2.0.txt Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13840899#comment-13840899 ] Jeremiah Jordan commented on CASSANDRA-6268: [~pkolaczk] Looks like interface/thrift/gen-java/org/apache/cassandra/thrift/Constants.java is missing from the thrift-2.0 patch Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 6268-thrift-2.0.txt Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13821351#comment-13821351 ] Piotr Kołaczkowski commented on CASSANDRA-6268: --- Updated to 19.36.2 and 19.39.0 Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 6268-thrift-2.0.txt Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13819043#comment-13819043 ] Jonathan Ellis commented on CASSANDRA-6268: --- So I'm back to bumping to 19.36.2 and 19.39.0 as the best option. Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 6268-thrift-2.0.txt Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13819049#comment-13819049 ] Jeremiah Jordan commented on CASSANDRA-6268: bq. So I'm back to bumping to 19.36.2 and 19.39.0 as the best option. +1 to those. I think those are the least likely to cause issues. And if we change something for 2.1, we should bump the version by 3 or something, to leave some extra numbers for 2.0. Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 6268-thrift-2.0.txt Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817668#comment-13817668 ] Jonathan Ellis commented on CASSANDRA-6268: --- So, hmm. What do we do about the Thrift version? We should bump 1.2 version to 19.37, but we've already used 19.37 in 2.0.something. At least I assume we have because it's currently 19.38. We could # freeze thrift at 19.36.1 and 19.38.0, respectively # bump the patch version even though this is not a bugfix # see if the 19.37 release was actually public, and if not, go ahead and reuse that # just commit to 2.0 and leave 1.2 alone Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 6268-thrift-2.0.txt Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817675#comment-13817675 ] Jonathan Ellis commented on CASSANDRA-6268: --- Sorry, Piotr; can you rebuild the Thrift patches w/ that change? Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 6268-thrift-2.0.txt Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817673#comment-13817673 ] Jonathan Ellis commented on CASSANDRA-6268: --- Looks like we went with bump patch in 1.2, minor in 2.0 for CASSANDRA-6202. Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 6268-thrift-2.0.txt Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817702#comment-13817702 ] Brandon Williams commented on CASSANDRA-6268: - Aleksey pointed out we can use the nuclear option and just bump them both above the highest one: 1.2 to 39, and 2.0 to 41 (to avoid having to use this option again.) Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 6268-thrift-2.0.txt Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817781#comment-13817781 ] Jonathan Ellis commented on CASSANDRA-6268: --- Isn't it confusing to have 1.2.x a higher version than 2.0.y? Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 6268-thrift-2.0.txt Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817789#comment-13817789 ] Jeremiah Jordan commented on CASSANDRA-6268: So why not just 19.36.2 and 19.39.0? Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 6268-thrift-2.0.txt Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817811#comment-13817811 ] Brandon Williams commented on CASSANDRA-6268: - bq. Isn't it confusing to have 1.2.x a higher version than 2.0.y? What? 1.2 would be 39, and 2.0 would be 41. 1.2 is unlikely to ever get a new feature that 2.0 wouldn't, so that's fairly safe. bq. So why not just 19.36.2 and 19.39.0 Because technically, api-wise, this isn't a bugfix. We bent the rule slightly on CASSANDRA-6202 to avoid this conflict, but I like Aleksey's idea better. Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 6268-thrift-2.0.txt Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817824#comment-13817824 ] Jonathan Ellis commented on CASSANDRA-6268: --- 1.2.12 (39) would be higher than 2.0.0 (37), which is semantically incorrect since it implies that 1.2.12 supports a superset of what 2.0.0 does. Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 6268-thrift-2.0.txt Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817836#comment-13817836 ] Brandon Williams commented on CASSANDRA-6268: - Ah, I see. Well, shit. Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 6268-thrift-2.0.txt Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815758#comment-13815758 ] Piotr Kołaczkowski commented on CASSANDRA-6268: --- Ok. I also generate a separate diff for thrift changes. Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 6268-2.txt, 6268-cassandra-2.0.txt Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815908#comment-13815908 ] Piotr Kołaczkowski commented on CASSANDRA-6268: --- Ok, done - separate thrift / source patches attached. I double checked the versions of thrift were right: 0.7.0 for 1.2 branch and 0.9.1 for 2.0 branch. Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 6268-thrift-2.0.txt Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13814832#comment-13814832 ] Piotr Kołaczkowski commented on CASSANDRA-6268: --- There was a bug in the patch that caused describe_local_ring to be swapped with describe_ring. Attached fixed patch. Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 6268-2.txt Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815270#comment-13815270 ] Jonathan Ellis commented on CASSANDRA-6268: --- Should I be worried that the 2.0 patch is half the size of the other? Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 6268-2.txt, 6268-cassandra-2.0.txt Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815340#comment-13815340 ] Piotr Kołaczkowski commented on CASSANDRA-6268: --- Hmm, most of it are changes generated by thrift. But let me double check that. Maybe I messed up something. Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 6268-2.txt, 6268-cassandra-2.0.txt Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815352#comment-13815352 ] Jeremiah Jordan commented on CASSANDRA-6268: Double check you are using the right version of thrift for each one. A bunch of the stuff in the original patch look like formatting changes. Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 6268-2.txt, 6268-cassandra-2.0.txt Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813224#comment-13813224 ] Jeremiah Jordan commented on CASSANDRA-6268: Something else here, getLocation in the record readers doesn't take DC into account: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/hadoop/ColumnFamilyRecordReader.java#L187 https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/hadoop/cql3/CqlPagingRecordReader.java#L209 So even after we get the splits right, we need to make sure the local node is connected to. Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 6268.txt Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811914#comment-13811914 ] Piotr Kołaczkowski commented on CASSANDRA-6268: --- [~alexliu68] Can you provide steps to reproduce? I did test it with another DC configured with VNodes and didn't observe what you're saying. Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 6268.txt Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811788#comment-13811788 ] Alex Liu commented on CASSANDRA-6268: - We may need merge the ranges, there are still many small ranges returned by describe_local_ring if other DC is configured with vnodes. Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 6268.txt Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13810571#comment-13810571 ] Jonathan Ellis commented on CASSANDRA-6268: --- LGTM. Please provide a patch against 2.0 as well. (Note that the Thrift version changes from 0.7 to 0.9.1.) Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 6268.txt Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13809224#comment-13809224 ] Piotr Kołaczkowski commented on CASSANDRA-6268: --- Attached a new patch. Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 6268.txt Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808368#comment-13808368 ] Jonathan Ellis commented on CASSANDRA-6268: --- We closed CASSANDRA-6124 in favor of adding LOCAL_ONE; I can't think of a case where you'd want to span more than one DC but less than all. Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 0001-DSP-2572-Adds-ability-to-set-target-DCs-where-a-Hado.patch Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it, because splits are generated from the results of describe_ring, which returns a huge number of ranges. The proposed fix: - allows for specifying the DCs the Hadoop job should be run - merges the consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808381#comment-13808381 ] Piotr Kołaczkowski commented on CASSANDRA-6268: --- I know, but LOCAL_ONE does not fix the problem. Actually the DC restriction is here so we don't go to the DCs with enabled vnodes, as their ranges are split. If I could get the name of the local DC from the place where splits are generated, then I could make it without this setting. Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 0001-DSP-2572-Adds-ability-to-set-target-DCs-where-a-Hado.patch Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808383#comment-13808383 ] Jeremy Hanna commented on CASSANDRA-6268: - The problem is that describe ring will still include ranges from all datacenters. So even isolating the queries to a single non-vnode DC, you can still get crazy split sizes based on the ranges from the other datacenters. Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 0001-DSP-2572-Adds-ability-to-set-target-DCs-where-a-Hado.patch Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808396#comment-13808396 ] Jonathan Ellis commented on CASSANDRA-6268: --- So wouldn't the right fix be to address that root cause? Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 0001-DSP-2572-Adds-ability-to-set-target-DCs-where-a-Hado.patch Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808412#comment-13808412 ] Piotr Kołaczkowski commented on CASSANDRA-6268: --- Right, but where to get the DC name from? I order to merge ranges, I need to know in which DC. Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 0001-DSP-2572-Adds-ability-to-set-target-DCs-where-a-Hado.patch Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808433#comment-13808433 ] Brandon Williams commented on CASSANDRA-6268: - We could make something like describe_local_ring where it would only take into account the node's local dc. Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 0001-DSP-2572-Adds-ability-to-set-target-DCs-where-a-Hado.patch Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808437#comment-13808437 ] Jonathan Ellis commented on CASSANDRA-6268: --- Right, that's where I was going. Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 0001-DSP-2572-Adds-ability-to-set-target-DCs-where-a-Hado.patch Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808439#comment-13808439 ] Piotr Kołaczkowski commented on CASSANDRA-6268: --- I'm fine with it. I can create another patch. One more detail - now with LOCAL_ONE is it guaranteed that Hadoop job won't go out of current DC? Or is using LOCAL_ONE only a user-setting? Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 0001-DSP-2572-Adds-ability-to-set-target-DCs-where-a-Hado.patch Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808449#comment-13808449 ] Jonathan Ellis commented on CASSANDRA-6268: --- bq. with LOCAL_ONE is it guaranteed that Hadoop job won't go out of current DC Unless you override CL to something else, that's what it means. Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 0001-DSP-2572-Adds-ability-to-set-target-DCs-where-a-Hado.patch Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808497#comment-13808497 ] Alex Liu commented on CASSANDRA-6268: - bq. Are there any valid use cases where we do like to run M/R job in more than one DC (e.g. all)? describe_local_ring would disable that totally. The local DC has at least one replica, so MR at local DC is sufficient for data locality. describe_local_ring should return a virtual ring of the local DC using those replica. Poor performance of Hadoop if any DC is using VNodes Key: CASSANDRA-6268 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268 Project: Cassandra Issue Type: Improvement Components: Hadoop Reporter: Piotr Kołaczkowski Assignee: Piotr Kołaczkowski Attachments: 0001-DSP-2572-Adds-ability-to-set-target-DCs-where-a-Hado.patch Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run. The proposed fix: 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs) 2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs For non-DSE users this feature is turned off by default and doesn't change the old behaviour. -- This message was sent by Atlassian JIRA (v6.1#6144)