[jira] [Commented] (CASSANDRA-6492) Have server pick query page size by default

2015-07-17 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630977#comment-14630977
 ] 

Sylvain Lebresne commented on CASSANDRA-6492:
-

bq.  I'm just worried about not being able to meet user expectations when we 
first expose a page size in bytes.

I understand, and it's a valid concern. But I don't know, I'm just not a fan of 
hard-coded magic constants. Even if we hide that bytes target from view, we 
might still be really off on our stats and fail it, which can still have user 
visible consequence, and so I'm not sure this ultimately help users 
comprehension of what is going on.

The other aspect is that if we do that (just have a default mode), users for 
which the default doesn't work are still stuck with providing the page size in 
number of rows, which still requires them to guess-estimate their average row 
size, which is annoying to do when we can probably do a pretty good job of 
guess-estimating server-side automatically.

But I totally agree we should be very clear initially that this is a very soft 
target. And maybe we can experiment a bit to get a better sense of how bad 
that estimate will be in practice. That is, we can try different schemas and 
workloads (even try actively to game the estimate), and if it proves very 
easy to get an estimate that is very off, then I can agree that exposing the 
size is probably not a good idea (though if that's the case, it will also be 
worth asking ourselves if even a default is going to help more than it hurts). 
If it's quite hard however (to get an estimate that is very off reality), then 
we'll still warn users that it's not precise, but that's probably good enough 
in practice.

 Have server pick query page size by default
 ---

 Key: CASSANDRA-6492
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6492
 Project: Cassandra
  Issue Type: New Feature
  Components: API
Reporter: Jonathan Ellis
Assignee: Benjamin Lerer
Priority: Minor
  Labels: client-impacting

 We're almost always going to do a better job picking a page size based on 
 sstable stats, than users will guesstimating.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6492) Have server pick query page size by default

2015-07-16 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629674#comment-14629674
 ] 

Sylvain Lebresne commented on CASSANDRA-6492:
-

bq. With aggregates, we can pretty safely ignore user-specified page sizes

I'm not sure it's that simple. The page size provided by the user _is_ used for 
the internal aggregation, and at least for the {{count}} method, this has been 
the case for a while (we can debate whether that was the best idea but that 
debate won't change the backward compatibility problem) and some people _could_ 
be relying on this. We can, I suppose, assume that we will never make a worst 
choice of page size than users and thus that this won't ever have any visible 
impact for users (either by making aggregate slower due to a page size too 
small, or by OOM the server due to a page size too large), but that's a 
slightly dangerous assumption imo.

Or we could also decide that picking the page size ourselves will be better 
than the status quo often enough that it's worth the risk of breaking a few 
users. But then we touch another problem: our internal stats _only_ give us an 
estimate of the average of rows. To pick a page size from that, you have to 
decide what is a reasonable size target for a page. We can certainly do 
experiments to find out what a good default is, but what is the right choice 
ultimately depends on factors like hardward, typical workload on the cluster, 
etc.. 

Which brings me back to Piotr comments above: I'm not sure we should take that 
issue as picking a page size out of thin air. But rather as recognizing that in 
most case the proper way to pick the page size is in bytes than in number of 
rows and that we should provide that option.

In summary, I'm not completely convinced it's wise to provide this for 
aggregates with a hardcoded target page size in bytes and no way to override it.


 Have server pick query page size by default
 ---

 Key: CASSANDRA-6492
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6492
 Project: Cassandra
  Issue Type: New Feature
  Components: API
Reporter: Jonathan Ellis
Assignee: Benjamin Lerer
Priority: Minor

 We're almost always going to do a better job picking a page size based on 
 sstable stats, than users will guesstimating.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6492) Have server pick query page size by default

2015-07-16 Thread Tyler Hobbs (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629882#comment-14629882
 ] 

Tyler Hobbs commented on CASSANDRA-6492:


While I do think that automatically picking the page size based on internal 
metrics will be safer/better for the vast majority of users, I'll agree that 
there has to be _some_ way for the user to override that in case it's 
drastically wrong.  I suppose that means we shouldn't switch aggregates to an 
automatic page size until we can provide that option.

I also agree that ultimately, using a page size based on bytes instead of rows 
makes sense, but it will require many more internal changes.  Perhaps a good 
first step is to add support for automatic page size selection (by passing -1 
for the page size, or something similar), which we can internally implement 
using a row-based page size.  Later, we can convert the internal page size to 
be byte-based. If that proves to be safe and effective, we can take the last 
step of providing a bytes-based page size to users.

 Have server pick query page size by default
 ---

 Key: CASSANDRA-6492
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6492
 Project: Cassandra
  Issue Type: New Feature
  Components: API
Reporter: Jonathan Ellis
Assignee: Benjamin Lerer
Priority: Minor

 We're almost always going to do a better job picking a page size based on 
 sstable stats, than users will guesstimating.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6492) Have server pick query page size by default

2015-07-16 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629905#comment-14629905
 ] 

Sylvain Lebresne commented on CASSANDRA-6492:
-

bq. but it will require many more internal changes

To be clear, I wasn't suggesting we change all the internal paging to use 
bytes. Just that we add the option to the native protocol so we can pass the 
page size either in number of rows, or as a target size in bytes, and that in 
the later case we'd use the internal metrics to translate that target into a 
number of rows. We'd be obviously upfront that the when the page size is passed 
in bytes, it's just a rough target rather than a guarantee size.

We can obviously later change all the internal to make that later option more 
precise, but it's totally way more long term.

 Have server pick query page size by default
 ---

 Key: CASSANDRA-6492
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6492
 Project: Cassandra
  Issue Type: New Feature
  Components: API
Reporter: Jonathan Ellis
Assignee: Benjamin Lerer
Priority: Minor

 We're almost always going to do a better job picking a page size based on 
 sstable stats, than users will guesstimating.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6492) Have server pick query page size by default

2015-07-16 Thread Tyler Hobbs (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629912#comment-14629912
 ] 

Tyler Hobbs commented on CASSANDRA-6492:


Ah, I see.  The main advantage of using a bytes-based page size is that it 
handles highly variable row sizes more safely and optimally.  If we translate a 
byte-based page size into a row-based one using internal metrics, we lose most 
of those advantages.

 Have server pick query page size by default
 ---

 Key: CASSANDRA-6492
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6492
 Project: Cassandra
  Issue Type: New Feature
  Components: API
Reporter: Jonathan Ellis
Assignee: Benjamin Lerer
Priority: Minor

 We're almost always going to do a better job picking a page size based on 
 sstable stats, than users will guesstimating.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6492) Have server pick query page size by default

2015-07-16 Thread Tyler Hobbs (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629963#comment-14629963
 ] 

Tyler Hobbs commented on CASSANDRA-6492:


bq. I don't understand, how is that different from your Perhaps a good first 
step is to add support for automatic page size selection? What did you had in 
mind for that? Because the only idea I had to do that from the internal metrics 
would be to use the metrics to get a estimated average row size, pick some 
presumably hard-coded bytes size target for a page, and compute the actual page 
size in rows from that.

Sorry, I should have been more clear.  That _is_ what I envisioned for 
automatic page size selection.  It's not optimal there (due to highly variable 
row sizes), but it's basically the server making a best effort attempt, and we 
haven't really made any sort of contract with the user. However, I don't think 
it's as good of an idea if we expose that as a page size in bytes option to 
the user.  If the user requests a page size of 1MB but we end up reading 50MB 
due to abnormally large rows, that seems like bad behavior.  Maybe if we 
present it as only a very soft target for now, that's okay, but I'm mostly 
worried about not matching user expectations.  With internal paging for 
aggregates, there are no user expectations (aside from not OOMing), so it 
doesn't matter as much if we're off from our target.

bq. Or to put it another way, having the server pick a default is not the 
problem we're trying to fix. The problem we're trying to fix is that to pick a 
proper page size, you currently have to guess-estimate the average size of your 
rows, but we can do a better guess-estimation server side and that's what we 
should provide here.

I think we're trying to solve both.  For aggregates, users may not even be 
aware that the page size is affecting how the aggregate is handled internally, 
and that's especially problematic for cqlsh, where the default page size is 100.

bq. I think we're in agreement that the no-guess-estimate solution is a lot 
more involved.

Yes.

bq. And one of the bonus of directly modifying the protocol to allow a page 
size target in bytes (rather than only providing a default mode with hard-coded 
target server side) is that once we do implement the more involved 
change-the-internals solution, we'll have no additional use visible change to 
do, thing will just get auto-magically better and safer.

That does sound like a nice property, I'm just worried about not being able to 
meet user expectations when we first expose a page size in bytes.

 Have server pick query page size by default
 ---

 Key: CASSANDRA-6492
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6492
 Project: Cassandra
  Issue Type: New Feature
  Components: API
Reporter: Jonathan Ellis
Assignee: Benjamin Lerer
Priority: Minor

 We're almost always going to do a better job picking a page size based on 
 sstable stats, than users will guesstimating.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6492) Have server pick query page size by default

2015-07-16 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629939#comment-14629939
 ] 

Sylvain Lebresne commented on CASSANDRA-6492:
-

bq. If we translate a byte-based page size into a row-based one using internal 
metrics, we lose most of those advantages.

I don't understand, how is that different from your Perhaps a good first step 
is to add support for automatic page size selection? What did you had in mind 
for that? Because the only idea I had to do that from the internal metrics 
would be to use the metrics to get a estimated average row size, pick some 
presumably hard-coded bytes size target for a page, and compute the actual page 
size in rows from that. In which case, I'm saying that instead of hard-coding 
that target and since we'll need a modification to the protocol anyway, let's 
allow the user to provide that target. It's more flexible than just having the 
options of a page size in row or some default.

Or to put it another way, having the server pick a default is not the problem 
we're trying to fix. The problem we're trying to fix is that to pick a proper 
page size, you currently have to guess-estimate the average size of your rows, 
but we can do a better guess-estimation server side and that's what we should 
provide here. Of course its still imperfect, but I think we're in agreement 
that the no-guess-estimate solution is a lot more involved.

And one of the bonus of directly modifying the protocol to allow a page size 
target in bytes (rather than only providing a default mode with hard-coded 
target server side) is that once we do implement the more involved 
change-the-internals solution, we'll have no additional use visible change to 
do, thing will just get auto-magically better and safer.

 Have server pick query page size by default
 ---

 Key: CASSANDRA-6492
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6492
 Project: Cassandra
  Issue Type: New Feature
  Components: API
Reporter: Jonathan Ellis
Assignee: Benjamin Lerer
Priority: Minor

 We're almost always going to do a better job picking a page size based on 
 sstable stats, than users will guesstimating.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6492) Have server pick query page size by default

2015-07-15 Thread Tyler Hobbs (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628099#comment-14628099
 ] 

Tyler Hobbs commented on CASSANDRA-6492:


As discussed in CASSANDRA-9802, this definitely makes sense for the internal 
page size that we use for computing aggregates.  However, automatically 
selecting the page size for non-aggregate queries is a different matter, and I 
don't think we should tie the two together.

 Have server pick query page size by default
 ---

 Key: CASSANDRA-6492
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6492
 Project: Cassandra
  Issue Type: New Feature
  Components: API
Reporter: Jonathan Ellis
Assignee: Benjamin Lerer
Priority: Minor

 We're almost always going to do a better job picking a page size based on 
 sstable stats, than users will guesstimating.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6492) Have server pick query page size by default

2015-07-15 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628213#comment-14628213
 ] 

Sylvain Lebresne commented on CASSANDRA-6492:
-

bq. However, automatically selecting the page size for non-aggregate queries is 
a different matter

How is that fundamentally different? The first reason for the page size is to 
make sure you never load too much data at once and OOM the server. For both 
aggregate and non-aggregate queries, that's going to depend on the size of you 
data in a similar way.

Don't get me wrong though: we will probably have to separate the 2 concepts at 
some point, when we add {{GROUP BY}}. But let's not get ahead of ourselves. I 
think that as a first step, using the data stats to pick a default page size 
for both kind of queries is going to be an improvement over the status quo, and 
so I think we should start there.

 Have server pick query page size by default
 ---

 Key: CASSANDRA-6492
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6492
 Project: Cassandra
  Issue Type: New Feature
  Components: API
Reporter: Jonathan Ellis
Assignee: Benjamin Lerer
Priority: Minor

 We're almost always going to do a better job picking a page size based on 
 sstable stats, than users will guesstimating.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6492) Have server pick query page size by default

2015-07-15 Thread Tyler Hobbs (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628304#comment-14628304
 ] 

Tyler Hobbs commented on CASSANDRA-6492:


The difference is mostly in the amount of work and driver-level support.

With aggregates, we can pretty safely ignore user-specified page sizes because 
it only affects Cassandra internally and doesn't change how the results are 
returned.

For non-aggregate queries, we probably need to make some protocol changes, such 
as using a page size of -1 to indicate that the server should select the page 
size.  I haven't thought through exactly how drivers would need to handle this 
in a backwards-compatible-friendly way, but it seems like the issue is more 
complex than for aggregates alone.  Splitting out the work for aggregates could 
let us commit that simpler part much sooner, and suboptimal aggregate page 
sizes are what motivated us to get this done in the first place.

 Have server pick query page size by default
 ---

 Key: CASSANDRA-6492
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6492
 Project: Cassandra
  Issue Type: New Feature
  Components: API
Reporter: Jonathan Ellis
Assignee: Benjamin Lerer
Priority: Minor

 We're almost always going to do a better job picking a page size based on 
 sstable stats, than users will guesstimating.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6492) Have server pick query page size by default

2014-06-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018768#comment-14018768
 ] 

Piotr Kołaczkowski commented on CASSANDRA-6492:
---

Would be nice to offer a way for the user to pick page size in bytes.

 Have server pick query page size by default
 ---

 Key: CASSANDRA-6492
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6492
 Project: Cassandra
  Issue Type: New Feature
  Components: API
Reporter: Jonathan Ellis
Assignee: Sylvain Lebresne
Priority: Minor

 We're almost always going to do a better job picking a page size based on 
 sstable stats, than users will guesstimating.



--
This message was sent by Atlassian JIRA
(v6.2#6252)