[jira] [Commented] (CASSANDRA-6492) Have server pick query page size by default
[ https://issues.apache.org/jira/browse/CASSANDRA-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630977#comment-14630977 ] Sylvain Lebresne commented on CASSANDRA-6492: - bq. I'm just worried about not being able to meet user expectations when we first expose a page size in bytes. I understand, and it's a valid concern. But I don't know, I'm just not a fan of hard-coded magic constants. Even if we hide that bytes target from view, we might still be really off on our stats and fail it, which can still have user visible consequence, and so I'm not sure this ultimately help users comprehension of what is going on. The other aspect is that if we do that (just have a default mode), users for which the default doesn't work are still stuck with providing the page size in number of rows, which still requires them to guess-estimate their average row size, which is annoying to do when we can probably do a pretty good job of guess-estimating server-side automatically. But I totally agree we should be very clear initially that this is a very soft target. And maybe we can experiment a bit to get a better sense of how bad that estimate will be in practice. That is, we can try different schemas and workloads (even try actively to game the estimate), and if it proves very easy to get an estimate that is very off, then I can agree that exposing the size is probably not a good idea (though if that's the case, it will also be worth asking ourselves if even a default is going to help more than it hurts). If it's quite hard however (to get an estimate that is very off reality), then we'll still warn users that it's not precise, but that's probably good enough in practice. Have server pick query page size by default --- Key: CASSANDRA-6492 URL: https://issues.apache.org/jira/browse/CASSANDRA-6492 Project: Cassandra Issue Type: New Feature Components: API Reporter: Jonathan Ellis Assignee: Benjamin Lerer Priority: Minor Labels: client-impacting We're almost always going to do a better job picking a page size based on sstable stats, than users will guesstimating. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6492) Have server pick query page size by default
[ https://issues.apache.org/jira/browse/CASSANDRA-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629674#comment-14629674 ] Sylvain Lebresne commented on CASSANDRA-6492: - bq. With aggregates, we can pretty safely ignore user-specified page sizes I'm not sure it's that simple. The page size provided by the user _is_ used for the internal aggregation, and at least for the {{count}} method, this has been the case for a while (we can debate whether that was the best idea but that debate won't change the backward compatibility problem) and some people _could_ be relying on this. We can, I suppose, assume that we will never make a worst choice of page size than users and thus that this won't ever have any visible impact for users (either by making aggregate slower due to a page size too small, or by OOM the server due to a page size too large), but that's a slightly dangerous assumption imo. Or we could also decide that picking the page size ourselves will be better than the status quo often enough that it's worth the risk of breaking a few users. But then we touch another problem: our internal stats _only_ give us an estimate of the average of rows. To pick a page size from that, you have to decide what is a reasonable size target for a page. We can certainly do experiments to find out what a good default is, but what is the right choice ultimately depends on factors like hardward, typical workload on the cluster, etc.. Which brings me back to Piotr comments above: I'm not sure we should take that issue as picking a page size out of thin air. But rather as recognizing that in most case the proper way to pick the page size is in bytes than in number of rows and that we should provide that option. In summary, I'm not completely convinced it's wise to provide this for aggregates with a hardcoded target page size in bytes and no way to override it. Have server pick query page size by default --- Key: CASSANDRA-6492 URL: https://issues.apache.org/jira/browse/CASSANDRA-6492 Project: Cassandra Issue Type: New Feature Components: API Reporter: Jonathan Ellis Assignee: Benjamin Lerer Priority: Minor We're almost always going to do a better job picking a page size based on sstable stats, than users will guesstimating. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6492) Have server pick query page size by default
[ https://issues.apache.org/jira/browse/CASSANDRA-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629882#comment-14629882 ] Tyler Hobbs commented on CASSANDRA-6492: While I do think that automatically picking the page size based on internal metrics will be safer/better for the vast majority of users, I'll agree that there has to be _some_ way for the user to override that in case it's drastically wrong. I suppose that means we shouldn't switch aggregates to an automatic page size until we can provide that option. I also agree that ultimately, using a page size based on bytes instead of rows makes sense, but it will require many more internal changes. Perhaps a good first step is to add support for automatic page size selection (by passing -1 for the page size, or something similar), which we can internally implement using a row-based page size. Later, we can convert the internal page size to be byte-based. If that proves to be safe and effective, we can take the last step of providing a bytes-based page size to users. Have server pick query page size by default --- Key: CASSANDRA-6492 URL: https://issues.apache.org/jira/browse/CASSANDRA-6492 Project: Cassandra Issue Type: New Feature Components: API Reporter: Jonathan Ellis Assignee: Benjamin Lerer Priority: Minor We're almost always going to do a better job picking a page size based on sstable stats, than users will guesstimating. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6492) Have server pick query page size by default
[ https://issues.apache.org/jira/browse/CASSANDRA-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629905#comment-14629905 ] Sylvain Lebresne commented on CASSANDRA-6492: - bq. but it will require many more internal changes To be clear, I wasn't suggesting we change all the internal paging to use bytes. Just that we add the option to the native protocol so we can pass the page size either in number of rows, or as a target size in bytes, and that in the later case we'd use the internal metrics to translate that target into a number of rows. We'd be obviously upfront that the when the page size is passed in bytes, it's just a rough target rather than a guarantee size. We can obviously later change all the internal to make that later option more precise, but it's totally way more long term. Have server pick query page size by default --- Key: CASSANDRA-6492 URL: https://issues.apache.org/jira/browse/CASSANDRA-6492 Project: Cassandra Issue Type: New Feature Components: API Reporter: Jonathan Ellis Assignee: Benjamin Lerer Priority: Minor We're almost always going to do a better job picking a page size based on sstable stats, than users will guesstimating. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6492) Have server pick query page size by default
[ https://issues.apache.org/jira/browse/CASSANDRA-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629912#comment-14629912 ] Tyler Hobbs commented on CASSANDRA-6492: Ah, I see. The main advantage of using a bytes-based page size is that it handles highly variable row sizes more safely and optimally. If we translate a byte-based page size into a row-based one using internal metrics, we lose most of those advantages. Have server pick query page size by default --- Key: CASSANDRA-6492 URL: https://issues.apache.org/jira/browse/CASSANDRA-6492 Project: Cassandra Issue Type: New Feature Components: API Reporter: Jonathan Ellis Assignee: Benjamin Lerer Priority: Minor We're almost always going to do a better job picking a page size based on sstable stats, than users will guesstimating. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6492) Have server pick query page size by default
[ https://issues.apache.org/jira/browse/CASSANDRA-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629963#comment-14629963 ] Tyler Hobbs commented on CASSANDRA-6492: bq. I don't understand, how is that different from your Perhaps a good first step is to add support for automatic page size selection? What did you had in mind for that? Because the only idea I had to do that from the internal metrics would be to use the metrics to get a estimated average row size, pick some presumably hard-coded bytes size target for a page, and compute the actual page size in rows from that. Sorry, I should have been more clear. That _is_ what I envisioned for automatic page size selection. It's not optimal there (due to highly variable row sizes), but it's basically the server making a best effort attempt, and we haven't really made any sort of contract with the user. However, I don't think it's as good of an idea if we expose that as a page size in bytes option to the user. If the user requests a page size of 1MB but we end up reading 50MB due to abnormally large rows, that seems like bad behavior. Maybe if we present it as only a very soft target for now, that's okay, but I'm mostly worried about not matching user expectations. With internal paging for aggregates, there are no user expectations (aside from not OOMing), so it doesn't matter as much if we're off from our target. bq. Or to put it another way, having the server pick a default is not the problem we're trying to fix. The problem we're trying to fix is that to pick a proper page size, you currently have to guess-estimate the average size of your rows, but we can do a better guess-estimation server side and that's what we should provide here. I think we're trying to solve both. For aggregates, users may not even be aware that the page size is affecting how the aggregate is handled internally, and that's especially problematic for cqlsh, where the default page size is 100. bq. I think we're in agreement that the no-guess-estimate solution is a lot more involved. Yes. bq. And one of the bonus of directly modifying the protocol to allow a page size target in bytes (rather than only providing a default mode with hard-coded target server side) is that once we do implement the more involved change-the-internals solution, we'll have no additional use visible change to do, thing will just get auto-magically better and safer. That does sound like a nice property, I'm just worried about not being able to meet user expectations when we first expose a page size in bytes. Have server pick query page size by default --- Key: CASSANDRA-6492 URL: https://issues.apache.org/jira/browse/CASSANDRA-6492 Project: Cassandra Issue Type: New Feature Components: API Reporter: Jonathan Ellis Assignee: Benjamin Lerer Priority: Minor We're almost always going to do a better job picking a page size based on sstable stats, than users will guesstimating. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6492) Have server pick query page size by default
[ https://issues.apache.org/jira/browse/CASSANDRA-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629939#comment-14629939 ] Sylvain Lebresne commented on CASSANDRA-6492: - bq. If we translate a byte-based page size into a row-based one using internal metrics, we lose most of those advantages. I don't understand, how is that different from your Perhaps a good first step is to add support for automatic page size selection? What did you had in mind for that? Because the only idea I had to do that from the internal metrics would be to use the metrics to get a estimated average row size, pick some presumably hard-coded bytes size target for a page, and compute the actual page size in rows from that. In which case, I'm saying that instead of hard-coding that target and since we'll need a modification to the protocol anyway, let's allow the user to provide that target. It's more flexible than just having the options of a page size in row or some default. Or to put it another way, having the server pick a default is not the problem we're trying to fix. The problem we're trying to fix is that to pick a proper page size, you currently have to guess-estimate the average size of your rows, but we can do a better guess-estimation server side and that's what we should provide here. Of course its still imperfect, but I think we're in agreement that the no-guess-estimate solution is a lot more involved. And one of the bonus of directly modifying the protocol to allow a page size target in bytes (rather than only providing a default mode with hard-coded target server side) is that once we do implement the more involved change-the-internals solution, we'll have no additional use visible change to do, thing will just get auto-magically better and safer. Have server pick query page size by default --- Key: CASSANDRA-6492 URL: https://issues.apache.org/jira/browse/CASSANDRA-6492 Project: Cassandra Issue Type: New Feature Components: API Reporter: Jonathan Ellis Assignee: Benjamin Lerer Priority: Minor We're almost always going to do a better job picking a page size based on sstable stats, than users will guesstimating. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6492) Have server pick query page size by default
[ https://issues.apache.org/jira/browse/CASSANDRA-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628099#comment-14628099 ] Tyler Hobbs commented on CASSANDRA-6492: As discussed in CASSANDRA-9802, this definitely makes sense for the internal page size that we use for computing aggregates. However, automatically selecting the page size for non-aggregate queries is a different matter, and I don't think we should tie the two together. Have server pick query page size by default --- Key: CASSANDRA-6492 URL: https://issues.apache.org/jira/browse/CASSANDRA-6492 Project: Cassandra Issue Type: New Feature Components: API Reporter: Jonathan Ellis Assignee: Benjamin Lerer Priority: Minor We're almost always going to do a better job picking a page size based on sstable stats, than users will guesstimating. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6492) Have server pick query page size by default
[ https://issues.apache.org/jira/browse/CASSANDRA-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628213#comment-14628213 ] Sylvain Lebresne commented on CASSANDRA-6492: - bq. However, automatically selecting the page size for non-aggregate queries is a different matter How is that fundamentally different? The first reason for the page size is to make sure you never load too much data at once and OOM the server. For both aggregate and non-aggregate queries, that's going to depend on the size of you data in a similar way. Don't get me wrong though: we will probably have to separate the 2 concepts at some point, when we add {{GROUP BY}}. But let's not get ahead of ourselves. I think that as a first step, using the data stats to pick a default page size for both kind of queries is going to be an improvement over the status quo, and so I think we should start there. Have server pick query page size by default --- Key: CASSANDRA-6492 URL: https://issues.apache.org/jira/browse/CASSANDRA-6492 Project: Cassandra Issue Type: New Feature Components: API Reporter: Jonathan Ellis Assignee: Benjamin Lerer Priority: Minor We're almost always going to do a better job picking a page size based on sstable stats, than users will guesstimating. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6492) Have server pick query page size by default
[ https://issues.apache.org/jira/browse/CASSANDRA-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628304#comment-14628304 ] Tyler Hobbs commented on CASSANDRA-6492: The difference is mostly in the amount of work and driver-level support. With aggregates, we can pretty safely ignore user-specified page sizes because it only affects Cassandra internally and doesn't change how the results are returned. For non-aggregate queries, we probably need to make some protocol changes, such as using a page size of -1 to indicate that the server should select the page size. I haven't thought through exactly how drivers would need to handle this in a backwards-compatible-friendly way, but it seems like the issue is more complex than for aggregates alone. Splitting out the work for aggregates could let us commit that simpler part much sooner, and suboptimal aggregate page sizes are what motivated us to get this done in the first place. Have server pick query page size by default --- Key: CASSANDRA-6492 URL: https://issues.apache.org/jira/browse/CASSANDRA-6492 Project: Cassandra Issue Type: New Feature Components: API Reporter: Jonathan Ellis Assignee: Benjamin Lerer Priority: Minor We're almost always going to do a better job picking a page size based on sstable stats, than users will guesstimating. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6492) Have server pick query page size by default
[ https://issues.apache.org/jira/browse/CASSANDRA-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018768#comment-14018768 ] Piotr Kołaczkowski commented on CASSANDRA-6492: --- Would be nice to offer a way for the user to pick page size in bytes. Have server pick query page size by default --- Key: CASSANDRA-6492 URL: https://issues.apache.org/jira/browse/CASSANDRA-6492 Project: Cassandra Issue Type: New Feature Components: API Reporter: Jonathan Ellis Assignee: Sylvain Lebresne Priority: Minor We're almost always going to do a better job picking a page size based on sstable stats, than users will guesstimating. -- This message was sent by Atlassian JIRA (v6.2#6252)