Re: [DISCUSS] Streaming API in CouchDB 4.0

2020-04-28 Thread Adam Kocoloski
Hi Ilya, 

Initial reaction — there’s a lot to like here. It seems like a pragmatic step 
forward for the current API that handles the corner case of large responses 
while maintaining compatibility for the large majority of API requests that 
don’t exceed this limit.

I think the `limit` parameter might be overloaded, though. If we add this 
pagination capability then `limit` could be used to refer to the size of a 
“page” or to the size of the overall result set. In the current proposal 
there’s no easy way to say “give me 5,000 results in blocks of 100”. Maybe it 
makes sense to add `page_size` as a new parameter and keep `limit` as the total 
number of results returned (i.e., once the `limit` is reached the final 
response will not have a “next” bookmark).

If we do introduce a `page_size` I would not impose a maximum on `limit`. It 
should be possible to use bookmarks to page through an entire database.

To your other questions:

> - `bookmark` vs `token`?

Prefer `bookmark` since that’s already used for _find and _search

> - should we prohibit setting other fields when bookmark is set?

As a general rule, yes. I can see an exception around allowing users to say 
whether they want to retrieve the next page of results using the same FDB read 
version as the previous request, and how the server should proceed if that read 
version is now too old. But that’s for later.

> - `previous`/`next`/`first` as href vs token value itself (i.e. `{"previous": 
> "983uiwfjkdsdf", "next": "12343tyekf3", "first": "iekjhfwo034"}`)

I think I would avoid including the scheme and server. I could see including 
the full path instead of just the bookmark, particularly as you’re including 
the “first” bit (which is a nice touch).

A couple of other comments:

> Latter on we would introduce API versioning and deal with `{db}/_changes` and 
> `_all_docs` endpoints. 

I think you mean `_all_dbs` instead of `_all_docs` here, and I agree that’s 
fine to defer.

> - don't use delayed responses when `bookmark` field is provided
> - don't use delayed responses when `limit` query key is specified and when it 
> is below the max limit
> - return 400 when limit query key is specified and it is greater than the max 
> limit
> - return 400 when we stream rows (in case when `limit` query key wasn't 
> specified) and reach max limit

If we add the `page_size` parameter does this become

- don't use delayed responses when `bookmark` field is provided
- don't use delayed responses when `page_size` query key is specified and when 
it is below the max
- return 400 when `page_size` query key is specified and it is greater than the 
max

I feel like the discussion of using delayed response or not is just a 
performance optimization, assuming we’re able to identify the right status code 
before streaming. So it could be omitted.

Choosing a default `page_size` requires some thought. I think we want to have a 
default, even though it is a behavior change. At the same time maybe we can 
minimize the number of times where pagination needs to show up with something 
like the following:

- user supplies `limit` less than max page size -> set `page_size` = `limit`
- user omits `limit`or sets one that exceeds max page size -> set `page_size`= 
~100

Regards, Adam

> On Apr 28, 2020, at 7:56 AM, Ilya Khlopotov  wrote:
> 
> Hello, 
> 
> I would like to introduce second proposal.
> 
> 1) Add new optional query field called `bookmark` (or `token`) to following 
> endpoints
>  - {db}/_all_docs
>  - {db}/_all_docs/queries
>  - _dbs_info
>  - {db}/_design/{ddoc}/_view/{view}
>  - {db}/_design/{ddoc}/_view/{view}/queries
> 2) Add following additional fields into response:
>   ```
>"first": {
>"href": "https://myserver.com/myddb/_all_docs?limit=50=true;
>},
>"previous": {
> "href": "https://myserver.com/myddb/_all_docs?bookmark=983uiwfjkdsdf;
>},
>"next": {
>"href": "https://myserver.com/myddb/_all_docs?bookmark=12343tyekf3;
> },
> ```
> 3) Implement per-endpoint configurable max limits 
>   ```
>   [request_limits]
>  _all_docs = 5000
>  _all_docs/queries = 5000
>  _all_dbs = 5000
>  _dbs_info = 5000
>  _view = 2500
>  _view/queries = 2500
>  _find = 2500
>  ```
> 4) Implement following semantics:
>   - The bookmark would be opaque token and would include information needed 
> to ensure proper pagination without the need to repeat initial parameters of 
> the request. In fact we might prohibit setting additional parameters when 
> bookmark query field is specified.
>   - don't use delayed responses when `bookmark` field is provided
>   - don't use delayed responses when `limit` query key is specified and when 
> it is below the max limit
>   - return 400 when limit query key is specified and it is greater than the 
> max limit
>   - return 400 when we stream rows (in case when `limit` query key wasn't 
> specified) and reach max limit
>   - the `previous`/`next`/`first` keys are optional and we omit them 

Re: [DISCUSS] Streaming API in CouchDB 4.0

2020-04-28 Thread Nick Vatamaniuc
The `{"href": "https://myserver.com/myddb/_all_docs?limit=50=true"}`
might be tricky if requests have to go through a few reverse proxies
before reaching CouchDB. CouchDB might not know its own "external"
domain so to speak. I have used X-Forwarded-For before for this exact
pattern before, and there is even an RFC for it now
https://tools.ietf.org/html/rfc7239, but it can be rather fragile.

On `bookmark` vs `token`, I like bookmark, cursor, iterator/iterate
better than token I think. Token us used in authentication /
authorization so it might get confusing.

On the `previous`/`next`/`first` wonder if `previous` and makes sense
if we already know the direction they are iterating toward based on
the initial request. Or do you propose that the bookmark doesn't
encode the direction anymore?

Thinking more about this and what's included in the bookmark, it
doesn't even need to include the limit. A user could change their
limit from request to request even when using a bookmark. If we don't
include the direction of iteration, we are down to just the start_key
that actually keeps changing in the bookmark and it would be kinda
silly to b64 encode the start key by itself in an opaque blob? So
going with that, we could just have a matching parameter like
inclusive_start=false (to match inclusive_end=) and let users pick off
the key from the result and use that as the start_key with
inclusive_start=false so they continue iterating from were they
stopped.

Also I think _dbs_info emits an array as well. We just recently added
a GET method for it to match _list_dbs.
On Tue, Apr 28, 2020 at 12:38 PM Paul Davis  wrote:
>
> Seems reasonable to me. I'd agree that setting query string parameters
> with a bookmark should be rejected. I was also going to suggest
> eliding the href member. In the examples I've seen those are usually
> structured as something like:
>
> "links": {
> "previous": "/path/and/qs=foo",
> "next": "/path/and/qs=bar"
> }
>
> On Tue, Apr 28, 2020 at 6:56 AM Ilya Khlopotov  wrote:
> >
> > Hello,
> >
> > I would like to introduce second proposal.
> >
> > 1) Add new optional query field called `bookmark` (or `token`) to following 
> > endpoints
> >   - {db}/_all_docs
> >   - {db}/_all_docs/queries
> >   - _dbs_info
> >   - {db}/_design/{ddoc}/_view/{view}
> >   - {db}/_design/{ddoc}/_view/{view}/queries
> > 2) Add following additional fields into response:
> >```
> > "first": {
> > "href": 
> > "https://myserver.com/myddb/_all_docs?limit=50=true;
> > },
> > "previous": {
> >  "href": 
> > "https://myserver.com/myddb/_all_docs?bookmark=983uiwfjkdsdf;
> > },
> > "next": {
> > "href": "https://myserver.com/myddb/_all_docs?bookmark=12343tyekf3;
> >  },
> >  ```
> > 3) Implement per-endpoint configurable max limits
> >```
> >[request_limits]
> >   _all_docs = 5000
> >   _all_docs/queries = 5000
> >   _all_dbs = 5000
> >   _dbs_info = 5000
> >   _view = 2500
> >   _view/queries = 2500
> >   _find = 2500
> >   ```
> > 4) Implement following semantics:
> >- The bookmark would be opaque token and would include information 
> > needed to ensure proper pagination without the need to repeat initial 
> > parameters of the request. In fact we might prohibit setting additional 
> > parameters when bookmark query field is specified.
> >- don't use delayed responses when `bookmark` field is provided
> >- don't use delayed responses when `limit` query key is specified and 
> > when it is below the max limit
> >- return 400 when limit query key is specified and it is greater than 
> > the max limit
> >- return 400 when we stream rows (in case when `limit` query key wasn't 
> > specified) and reach max limit
> >- the `previous`/`next`/`first` keys are optional and we omit them for 
> > the cases they don't make sense
> >
> > Latter on we would introduce API versioning and deal with `{db}/_changes` 
> > and `_all_docs` endpoints.
> >
> > Questions:
> > - `bookmark` vs `token`?
> > - should we prohibit setting other fields when bookmark is set?
> > - `previous`/`next`/`first` as href vs token value itself (i.e. 
> > `{"previous": "983uiwfjkdsdf", "next": "12343tyekf3", "first": 
> > "iekjhfwo034"}`)
> >
> > Best regards,
> > iilyak
> >
> > On 2020/04/22 20:18:57, Ilya Khlopotov  wrote:
> > > Hello everyone,
> > >
> > > Based on the discussions on the thread I would like to propose a number 
> > > of first steps:
> > > 1) introduce new endpoints
> > >   - {db}/_all_docs/page
> > >   - {db}/_all_docs/queries/page
> > >   - _all_dbs/page
> > >   - _dbs_info/page
> > >   - {db}/_design/{ddoc}/_view/{view}/page
> > >   - {db}/_design/{ddoc}/_view/{view}/queries/page
> > >   - {db}/_find/page
> > >
> > > These new endpoints would act as follows:
> > > - don't use delayed responses
> > > - return object with following structure
> > >   ```
> > >   {
> > >  "total": Total,
> > >  "bookmark": base64 encoded opaque value,
> > >  

Re: [DISCUSS] Streaming API in CouchDB 4.0

2020-04-28 Thread Paul Davis
Seems reasonable to me. I'd agree that setting query string parameters
with a bookmark should be rejected. I was also going to suggest
eliding the href member. In the examples I've seen those are usually
structured as something like:

"links": {
"previous": "/path/and/qs=foo",
"next": "/path/and/qs=bar"
}

On Tue, Apr 28, 2020 at 6:56 AM Ilya Khlopotov  wrote:
>
> Hello,
>
> I would like to introduce second proposal.
>
> 1) Add new optional query field called `bookmark` (or `token`) to following 
> endpoints
>   - {db}/_all_docs
>   - {db}/_all_docs/queries
>   - _dbs_info
>   - {db}/_design/{ddoc}/_view/{view}
>   - {db}/_design/{ddoc}/_view/{view}/queries
> 2) Add following additional fields into response:
>```
> "first": {
> "href": 
> "https://myserver.com/myddb/_all_docs?limit=50=true;
> },
> "previous": {
>  "href": "https://myserver.com/myddb/_all_docs?bookmark=983uiwfjkdsdf;
> },
> "next": {
> "href": "https://myserver.com/myddb/_all_docs?bookmark=12343tyekf3;
>  },
>  ```
> 3) Implement per-endpoint configurable max limits
>```
>[request_limits]
>   _all_docs = 5000
>   _all_docs/queries = 5000
>   _all_dbs = 5000
>   _dbs_info = 5000
>   _view = 2500
>   _view/queries = 2500
>   _find = 2500
>   ```
> 4) Implement following semantics:
>- The bookmark would be opaque token and would include information needed 
> to ensure proper pagination without the need to repeat initial parameters of 
> the request. In fact we might prohibit setting additional parameters when 
> bookmark query field is specified.
>- don't use delayed responses when `bookmark` field is provided
>- don't use delayed responses when `limit` query key is specified and when 
> it is below the max limit
>- return 400 when limit query key is specified and it is greater than the 
> max limit
>- return 400 when we stream rows (in case when `limit` query key wasn't 
> specified) and reach max limit
>- the `previous`/`next`/`first` keys are optional and we omit them for the 
> cases they don't make sense
>
> Latter on we would introduce API versioning and deal with `{db}/_changes` and 
> `_all_docs` endpoints.
>
> Questions:
> - `bookmark` vs `token`?
> - should we prohibit setting other fields when bookmark is set?
> - `previous`/`next`/`first` as href vs token value itself (i.e. `{"previous": 
> "983uiwfjkdsdf", "next": "12343tyekf3", "first": "iekjhfwo034"}`)
>
> Best regards,
> iilyak
>
> On 2020/04/22 20:18:57, Ilya Khlopotov  wrote:
> > Hello everyone,
> >
> > Based on the discussions on the thread I would like to propose a number of 
> > first steps:
> > 1) introduce new endpoints
> >   - {db}/_all_docs/page
> >   - {db}/_all_docs/queries/page
> >   - _all_dbs/page
> >   - _dbs_info/page
> >   - {db}/_design/{ddoc}/_view/{view}/page
> >   - {db}/_design/{ddoc}/_view/{view}/queries/page
> >   - {db}/_find/page
> >
> > These new endpoints would act as follows:
> > - don't use delayed responses
> > - return object with following structure
> >   ```
> >   {
> >  "total": Total,
> >  "bookmark": base64 encoded opaque value,
> >  "completed": true | false,
> >  "update_seq": when available,
> >  "page": current page number,
> >  "items": [
> >  ]
> >   }
> >   ```
> > - the bookmark would include following data (base64 or protobuff???):
> >   - direction
> >   - page
> >   - descending
> >   - endkey
> >   - endkey_docid
> >   - inclusive_end
> >   - startkey
> >   - startkey_docid
> >   - last_key
> >   - update_seq
> >   - timestamp
> >   ```
> >
> > 2) Implement per-endpoint configurable max limits
> > ```
> > _all_docs = 5000
> > _all_docs/queries = 5000
> > _all_dbs = 5000
> > _dbs_info = 5000
> > _view = 2500
> > _view/queries = 2500
> > _find = 2500
> > ```
> >
> > Latter (after few years) CouchDB would deprecate and remove old endpoints.
> >
> > Best regards,
> > iilyak
> >
> > On 2020/02/19 22:39:45, Nick Vatamaniuc  wrote:
> > > Hello everyone,
> > >
> > > I'd like to discuss the shape and behavior of streaming APIs for CouchDB 
> > > 4.x
> > >
> > > By "streaming APIs" I mean APIs which stream data in row as it gets
> > > read from the database. These are the endpoints I was thinking of:
> > >
> > >  _all_docs, _all_dbs, _dbs_info  and query results
> > >
> > > I want to focus on what happens when FoundationDB transactions
> > > time-out after 5 seconds. Currently, all those APIs except _changes[1]
> > > feeds, will crash or freeze. The reason is because the
> > > transaction_too_old error at the end of 5 seconds is retry-able by
> > > default, so the request handlers run again and end up shoving the
> > > whole request down the socket again, headers and all, which is
> > > obviously broken and not what we want.
> > >
> > > There are few alternatives discussed in couchdb-dev channel. I'll
> > > present some behaviors but feel free to add more. Some ideas might
> > > have been discounted on the IRC 

Re: [DISCUSS] Streaming API in CouchDB 4.0

2020-04-28 Thread Ilya Khlopotov
Hello, 

I would like to introduce second proposal.

1) Add new optional query field called `bookmark` (or `token`) to following 
endpoints
  - {db}/_all_docs
  - {db}/_all_docs/queries
  - _dbs_info
  - {db}/_design/{ddoc}/_view/{view}
  - {db}/_design/{ddoc}/_view/{view}/queries
2) Add following additional fields into response:
   ```
"first": {
"href": "https://myserver.com/myddb/_all_docs?limit=50=true;
},
"previous": {
 "href": "https://myserver.com/myddb/_all_docs?bookmark=983uiwfjkdsdf;
},
"next": {
"href": "https://myserver.com/myddb/_all_docs?bookmark=12343tyekf3;
 },
 ```
3) Implement per-endpoint configurable max limits 
   ```
   [request_limits]
  _all_docs = 5000
  _all_docs/queries = 5000
  _all_dbs = 5000
  _dbs_info = 5000
  _view = 2500
  _view/queries = 2500
  _find = 2500
  ```
4) Implement following semantics:
   - The bookmark would be opaque token and would include information needed to 
ensure proper pagination without the need to repeat initial parameters of the 
request. In fact we might prohibit setting additional parameters when bookmark 
query field is specified.
   - don't use delayed responses when `bookmark` field is provided
   - don't use delayed responses when `limit` query key is specified and when 
it is below the max limit
   - return 400 when limit query key is specified and it is greater than the 
max limit
   - return 400 when we stream rows (in case when `limit` query key wasn't 
specified) and reach max limit
   - the `previous`/`next`/`first` keys are optional and we omit them for the 
cases they don't make sense

Latter on we would introduce API versioning and deal with `{db}/_changes` and 
`_all_docs` endpoints. 
  
Questions:
- `bookmark` vs `token`?
- should we prohibit setting other fields when bookmark is set?
- `previous`/`next`/`first` as href vs token value itself (i.e. `{"previous": 
"983uiwfjkdsdf", "next": "12343tyekf3", "first": "iekjhfwo034"}`)

Best regards,
iilyak

On 2020/04/22 20:18:57, Ilya Khlopotov  wrote: 
> Hello everyone,
> 
> Based on the discussions on the thread I would like to propose a number of 
> first steps:
> 1) introduce new endpoints
>   - {db}/_all_docs/page
>   - {db}/_all_docs/queries/page
>   - _all_dbs/page
>   - _dbs_info/page
>   - {db}/_design/{ddoc}/_view/{view}/page
>   - {db}/_design/{ddoc}/_view/{view}/queries/page
>   - {db}/_find/page
> 
> These new endpoints would act as follows:
> - don't use delayed responses
> - return object with following structure
>   ```
>   {
>  "total": Total,
>  "bookmark": base64 encoded opaque value,
>  "completed": true | false,
>  "update_seq": when available,
>  "page": current page number,
>  "items": [
>  ]
>   }
>   ```
> - the bookmark would include following data (base64 or protobuff???):
>   - direction
>   - page
>   - descending
>   - endkey
>   - endkey_docid
>   - inclusive_end
>   - startkey
>   - startkey_docid
>   - last_key
>   - update_seq
>   - timestamp
>   ```
> 
> 2) Implement per-endpoint configurable max limits
> ```
> _all_docs = 5000
> _all_docs/queries = 5000
> _all_dbs = 5000
> _dbs_info = 5000
> _view = 2500
> _view/queries = 2500
> _find = 2500
> ```
> 
> Latter (after few years) CouchDB would deprecate and remove old endpoints.
> 
> Best regards,
> iilyak
> 
> On 2020/02/19 22:39:45, Nick Vatamaniuc  wrote: 
> > Hello everyone,
> > 
> > I'd like to discuss the shape and behavior of streaming APIs for CouchDB 4.x
> > 
> > By "streaming APIs" I mean APIs which stream data in row as it gets
> > read from the database. These are the endpoints I was thinking of:
> > 
> >  _all_docs, _all_dbs, _dbs_info  and query results
> > 
> > I want to focus on what happens when FoundationDB transactions
> > time-out after 5 seconds. Currently, all those APIs except _changes[1]
> > feeds, will crash or freeze. The reason is because the
> > transaction_too_old error at the end of 5 seconds is retry-able by
> > default, so the request handlers run again and end up shoving the
> > whole request down the socket again, headers and all, which is
> > obviously broken and not what we want.
> > 
> > There are few alternatives discussed in couchdb-dev channel. I'll
> > present some behaviors but feel free to add more. Some ideas might
> > have been discounted on the IRC discussion already but I'll present
> > them anyway in case is sparks further conversation:
> > 
> > A) Do what _changes[1] feeds do. Start a new transaction and continue
> > streaming the data from the next key after last emitted in the
> > previous transaction. Document the API behavior change that it may
> > present a view of the data is never a point-in-time[4] snapshot of the
> > DB.
> > 
> >  - Keeps the API shape the same as CouchDB <4.0. Client libraries
> > don't have to change to continue using these CouchDB 4.0 endpoints
> >  - This is the easiest to implement since it would re-use the
> > implementation for _changes feed (an