[prometheus-developers] Would tooling for PromQL formatting/manipulation be useful and where should it live?

2022-10-04 Thread Rob Skillington
Hey Prometheus team,

Have noticed asks for tooling around reformatting/manipulating and
generally refactoring sets of queries and rule definitions (where there is
a high number of defined queries). Use cases include such cases as "I want
to duplicate a set of alerts to target different environments with
different label combinations and also conditions".

I opened a PR to add some basic commands given I had seen this earlier PR
mention that there was intention for the PromQL AST pretty print formatting
to be useable from promtool:
https://github.com/prometheus/prometheus/pull/10544

I now realize it may have been better perhaps to raise the question of
if/where it should live here before opening the PR. What would be the
reception of housing these commands in promtool and/or if not there then
where a good recommended place would be for these to live do people think?

PR in question:
https://github.com/prometheus/prometheus/pull/11411

Best,
Rob

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/CABakzZa5NKzVY5yaY-NKbuqeULx8zhOHkuTKo_QiWCpFG%2BDnxQ%40mail.gmail.com.


[prometheus-developers] Re: Prometheus Alert-Generator Compliance testing for M3

2022-03-06 Thread Rob Skillington
Hey Ganesh,

I have replied on behalf of Chronosphere for the corresponding alert
generator compliance test.

For M3 and M3DB the project by itself doesn’t offer any alert generation
capabilities and thus itself cannot produce a positive or negative result.
It would needed to be combined with a rule manager from a one of the other
independent Prometheus remote storage open source projects to produce such
capabilities.

Result for alert generation capability as such is N/A.

Thanks for reaching out and the contributors behind the alert generation
compliance program.

Rob

On Wed, Mar 2, 2022 at 3:37 AM Ganesh Vernekar  wrote:

> Hello M3DB team,
>
> I hope this email finds you well.
>
> As a part of Prometheus conformance program
> , we had floated a doc on
> "Prometheus Alert-Generator Compliance" many months ago and had finalized
> the specification for that (see here
> 
> ).
>
> The test suite to test the specification is now ready and instructions on
> how to run the test suite are present here
> 
> .
>
> If you wish M3 to be compliant with Prometheus Alert-Generator, please
> test your software with the above test suite and report back the results by
> replying to this thread (results being the entire log output of the test
> run).
>
> If you do need any additional help from the test suite (for example
> setting custom headers to some requests), please let me know and I will add
> those abilities to the test suite.
>
> Going forward, we would like to automate this process. We ask you to add
> your test-suite config template by opening a PR against
> prometheus/compliance  and
> create test-m3.yaml in the alert_generator directory. See test-.*.yaml
>  files here
>  for
> example. Please also add instructions on how to set up as comments in the
> same file.
>
> If you face issues in running the test suite or understanding any error
> messages, I am happy to answer your queries.
>
> We plan to publish the results during the third week of May 2022 (during
> KubeCon EU).
>
> Thanks,
> Ganesh (codesome)
> Prometheus team
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/CAFtK1UMvV%2BwjApe8u4hgiWht7%2B4wpNAuFHcTu_f2kVCPgvUyjw%40mail.gmail.com.


[prometheus-developers] Re: Prometheus Alert-Generator Compliance testing for Chronosphere

2022-03-06 Thread Rob Skillington
Hey Ganesh,

Thanks for the heads up and to all the work that has gone into this from
the various contributors, you will hear from us in a bit.

While I have you I’m looking for a review to update the OpenMetrics
conformance instructions to mention the scrape validator tool which Chao
authored last year:
https://github.com/prometheus/compliance/pull/77/files

Best,
Rob

On Wed, Mar 2, 2022 at 3:44 AM Ganesh Vernekar  wrote:

> Hello Chronosphere team,
>
> I hope this email finds you well.
>
> As a part of Prometheus conformance program
> , we had floated a doc on
> "Prometheus Alert-Generator Compliance" many months ago and had finalized
> the specification for that (see here
> 
> ).
>
> The test suite to test the specification is now ready and instructions on
> how to run the test suite are present here
> 
> .
>
> If you wish Chronosphere to be compliant with Prometheus Alert-Generator,
> please test your software with the above test suite and report back the
> results by replying to this thread (results being the entire log output of
> the test run).
>
> If you do need any additional help from the test suite (for example
> setting custom headers to some requests), please let me know and I will add
> those abilities to the test suite.
>
> Going forward, we would like to automate this process. We ask you to add
> your test-suite config template by opening a PR against
> prometheus/compliance  and
> create test-chronosphere.yaml in the alert_generator directory. See
> test-.*.yaml files here
>  for
> example. Please also add instructions on how to set up as comments in the
> same file.
>
> If you face issues in running the test suite or understanding any error
> messages, I am happy to answer your queries.
>
> We plan to publish the results during the third week of May 2022 (during
> KubeCon EU).
>
> Thanks,
> Ganesh (codesome)
> Prometheus team
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/CABakzZbPYGUsy3iutpdFu8QsDd5d%3DrJUo2A0KE0WpOSW0eMYfg%40mail.gmail.com.


Re: [prometheus-developers] Requirements / Best Practices to use Prometheus Metrics for Serverless environments

2021-11-27 Thread Rob Skillington
Here’s the documentation for using M3 coordinator (with it without M3
aggregator) with a backend that has a Prometheus Remote Write receiver:
https://m3db.io/docs/how_to/any_remote_storage/

Would be more than happy to do a call some time on this topic, the more
we’ve looked at this it’s a client library issue primarily way before you
consider the backend/receiver aspect (which there are options out there and
are fairly mechanical to overcome, vs the client library concerns which
have a lot of ergonomic and practical issues especially in a serverless
environment where you may need to wait for publishing before finishing your
request - perhaps an async process like publishing a message to local
serverless message queue like SQS is an option and having a reader read
that and use another client library to push that data out is ideal - it
would be more type safe and probably less lossy than logs and reading the
logs then publishing but would need good client library support for both
the serverless producers and the readers/pushers).

Rob

On Sat, Nov 27, 2021 at 1:41 AM Rob Skillington  wrote:

> FWIW we have been experimenting with users pushing OpenMetrics protobuf
> payloads quite successfully, but only sophisticated exporters that can
> guarantee no collisions of time series and generate their own monotonic
> counters, etc are using this at this time.
>
> If you're looking for a solution that also involves aggregation support,
> M3 Coordinator (either standalone or combined with M3 Aggregator) supports
> Remote Write as a backend (and is thus compatible with Thanos, Cortex and
> of course Prometheus itself too due to the PRW receiver).
>
> M3 Coordinator however does not have any nice support to publish to it
> from a serverless environment (since the primary protocol it supports is
> Prometheus Remote Write which has no metrics clients, etc I would assume).
>
> Rob
>
>
> On Mon, Nov 15, 2021 at 9:54 PM Bartłomiej Płotka 
> wrote:
>
>> Hi All,
>>
>> I would love to resurrect this thread. I think we are missing a good
>> push-gateway like a product that would ideally live in Prometheus
>> (repo/binary or can be recommended by us) and convert events to metrics in
>> a cheap way. Because this is what it is when we talk about short-living
>> containers and serverless functions. What's the latest Rob? I would be
>> interested in some call for this if that is still on the table. (:
>>
>> I think we have some new options on the table like supporting Otel
>> metrics as such potential high-cardinal event push, given there are more
>> and more clients for that API. Potentially Otel collector can work as such
>> "push gateway" proxy, but at this point, it's extremely generic, so we
>> might want to consider something more focused/efficient/easier to maintain.
>> Let's see (: The other problem is that Otel metrics is yet another
>> protocol. Users might want to use push gateway API, remote write or
>> logs/traces as per @Tobias Schmidt  idea
>>
>> Another service "loggateway" (or otherwise named) would then stream the
>>> logs, aggregate them and either expose them on the common /metrics endpoint
>>> or push them with remote write right away to a Prometheus instance hosted
>>> somewhere (like Grafana Cloud)."
>>
>>
>> Kind Regards,
>> Bartek Płotka (@bwplotka)
>>
>>
>> On Fri, Jun 25, 2021 at 6:11 AM Rob Skillington 
>> wrote:
>>
>>> With respect to OpenMetrics push, we had something very similar at
>>> $prevco that pushed something that looked very similar to the protobuf
>>> payload of OpenMetrics (but was Thrift snapshot of an aggregated set of
>>> metrics from in process) that was used by short running tasks (for Jenkins,
>>> Flink jobs, etc).
>>>
>>> I definitely agree it’s not ideal and ideally the platform provider can
>>> supply a collection point (there is something for Jenkins, a plug-in that
>>> can do this, but custom metrics is very hard / nigh impossible to make work
>>> with it, and this is a non-cloud provider environment that’s actually
>>> possible to make work, just no one has made it seamless).
>>>
>>> I agree with Richi that something that could push to a Prometheus Agent
>>> like target that supports OpenMetrics push could be a good middle ground
>>> with the right support / guidelines:
>>> - A way to specify multiple Prometheus Agent targets and quickly
>>> failover from one to another if within $X ms one is not responding (you
>>> could imagine a 5ms budget for each and max 3 are tried, introducing at
>>> worst 15ms overhead when all are down in 3 loc

Re: [prometheus-developers] Requirements / Best Practices to use Prometheus Metrics for Serverless environments

2021-11-27 Thread Rob Skillington
FWIW we have been experimenting with users pushing OpenMetrics protobuf
payloads quite successfully, but only sophisticated exporters that can
guarantee no collisions of time series and generate their own monotonic
counters, etc are using this at this time.

If you're looking for a solution that also involves aggregation support, M3
Coordinator (either standalone or combined with M3 Aggregator) supports
Remote Write as a backend (and is thus compatible with Thanos, Cortex and
of course Prometheus itself too due to the PRW receiver).

M3 Coordinator however does not have any nice support to publish to it from
a serverless environment (since the primary protocol it supports is
Prometheus Remote Write which has no metrics clients, etc I would assume).

Rob


On Mon, Nov 15, 2021 at 9:54 PM Bartłomiej Płotka 
wrote:

> Hi All,
>
> I would love to resurrect this thread. I think we are missing a good
> push-gateway like a product that would ideally live in Prometheus
> (repo/binary or can be recommended by us) and convert events to metrics in
> a cheap way. Because this is what it is when we talk about short-living
> containers and serverless functions. What's the latest Rob? I would be
> interested in some call for this if that is still on the table. (:
>
> I think we have some new options on the table like supporting Otel
> metrics as such potential high-cardinal event push, given there are more
> and more clients for that API. Potentially Otel collector can work as such
> "push gateway" proxy, but at this point, it's extremely generic, so we
> might want to consider something more focused/efficient/easier to maintain.
> Let's see (: The other problem is that Otel metrics is yet another
> protocol. Users might want to use push gateway API, remote write or
> logs/traces as per @Tobias Schmidt  idea
>
> Another service "loggateway" (or otherwise named) would then stream the
>> logs, aggregate them and either expose them on the common /metrics endpoint
>> or push them with remote write right away to a Prometheus instance hosted
>> somewhere (like Grafana Cloud)."
>
>
> Kind Regards,
> Bartek Płotka (@bwplotka)
>
>
> On Fri, Jun 25, 2021 at 6:11 AM Rob Skillington 
> wrote:
>
>> With respect to OpenMetrics push, we had something very similar at
>> $prevco that pushed something that looked very similar to the protobuf
>> payload of OpenMetrics (but was Thrift snapshot of an aggregated set of
>> metrics from in process) that was used by short running tasks (for Jenkins,
>> Flink jobs, etc).
>>
>> I definitely agree it’s not ideal and ideally the platform provider can
>> supply a collection point (there is something for Jenkins, a plug-in that
>> can do this, but custom metrics is very hard / nigh impossible to make work
>> with it, and this is a non-cloud provider environment that’s actually
>> possible to make work, just no one has made it seamless).
>>
>> I agree with Richi that something that could push to a Prometheus Agent
>> like target that supports OpenMetrics push could be a good middle ground
>> with the right support / guidelines:
>> - A way to specify multiple Prometheus Agent targets and quickly failover
>> from one to another if within $X ms one is not responding (you could
>> imagine a 5ms budget for each and max 3 are tried, introducing at worst
>> 15ms overhead when all are down in 3 local availability zones, but in
>> general this is a disaster case)
>> - Deduplication ability so that a retried push is not double counted,
>> this might mean timestamping the metrics… (so if written twice only first
>> record kept, etc)
>>
>> I think it should similar to the Push Gateway be generally a last resort
>> kind of option and have clear limitations so that pull still remains the
>> clear choice for anything but these environments.
>>
>> Is there any interest discussing this on a call some time?
>>
>> Rob
>>
>> On Thu, Jun 24, 2021 at 5:09 PM Bjoern Rabenstein 
>> wrote:
>>
>>> On 22.06.21 11:26, Tobias Schmidt wrote:
>>> >
>>> > Last night I was wondering if there are any other common interfaces
>>> > available in serverless environments and noticed that all products by
>>> AWS
>>> > (Lambda) and GCP (Functions, Run) at least provide the option to
>>> handle log
>>> > streams, sometimes even log files on disk. I'm currently thinking about
>>> > experimenting with an approach where containers log metrics to stdout /
>>> > some file, get picked up by the serverless runtime and written to some
>>> log
>>> > stream. Another service "log

Re: [prometheus-developers] Evolving remote APIs

2021-11-27 Thread Rob Skillington
There's a.now out of date but working proof of concept PR from August last
year that added TYPE, HELP and UNIT to the WAL and also to Prometheus
Remote Write payloads (on a per TimeSeries samples basis):
https://github.com/prometheus/prometheus/pull/7771

Once it's added to the WAL there's no reason it can't be put into both (A)
any new Remote API and (B) extending the existing Remote Write API v1 as a
minor release (e.g. Remote Write 1.1).

There was a 20%-30% increase in network traffic with sending it with every
single remote write request (on every series):
https://github.com/prometheus/prometheus/pull/7771#issuecomment-675956119

We arrived with another solution over the course of discussion on the PR
which would be to "send type and unit every time (since so negligible) but
help only every 5 minutes" with perhaps some way to tweak this behavior via
config or some other means.

Rob


On Fri, Nov 26, 2021 at 1:23 AM 'Fabian Reinartz' via Prometheus Developers
 wrote:

>
> As maintainer of Prometheus server, in general, I am worried that
>> getting a wal that'd be more "able" than the actual Prometheus TSDB
>> would weaken the Prometheus server use case in favor of SaaS platforms.
>>
>> It does not sound great for the users who rely on Prometheus
>> alone, which I think will continue to represent a large part of our
>> community in the future.
>>
>
> Where do you see the downside for these users? It doesn't seem that a
> structured remote-write API would take anything away from
> users using the Prometheus server with local storage.
>
>
>> Additionally, the Query Engine should take advantage of those new
>> properties as well: until we do not support that in Prometheus TSDB,
>> it's harder to take advantage of the OpenMetrics types in the language.
>>
>
> True, though I don't understand why this is an argument against the
> remote-write protocol supporting the instrumentation data model.
>
> Tailing whatever structure TSDB currently supports, which will probably be
> a moving target for some time, seems like it would cause unnecessary
> change frequency to the API or require waiting a few years before making
> any changes at all.
> Or is the goal to not give service offerings access to more structure than
> Prometheus itself can make use of?
>
>
> I should say that I'm primarily speaking from technical curiosity here.
> Our own offering doesn't need such fundamental changes, though
> they would make some things a bit simpler of course.
>
>
> On another front, from an efficiency standpoint, don't we want to batch
>> samples from exact same ts in many cases (e.g network partition)?
>>
>
> Could you elaborate with an example?
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-developers+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-developers/CAG97UEmPzxB5Sr1rOpjYOrRURBuU%3DFP4YsWM-0mk6o9XZt4xBQ%40mail.gmail.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/CABakzZb5hY3aKN7r-6Xq2andGhmeTEhLsyy%2BUAbHgigUhJL9nQ%40mail.gmail.com.


Re: [prometheus-developers] Moving the PromQL editor to prometheus/prometheus

2021-08-10 Thread Rob Skillington
You could also follow the Kubernetes model where subdirectories of the
repository is mirrored to a second repository (either by CI or some other
infrastructure) and there the code is tagged.

That way you still have a monorepo of all the code and can make single
changes across layers, but the releasing and other versioning aspect is
done in a separate repo (and potentially handling issues, etc too).

This is how the k8s client is released separately even though the code
lives in the main k8s central repo alongside k8s API server, kubelet, etc.

Rob

On Tue, Aug 10, 2021 at 8:17 AM Augustin Husson 
wrote:

> From my point of view, to have a different tag wasn't because I didn't
> want to wait for a Prometheus release.
>
> In fact, currently these repositories are for the moment quite in a
> maintenance mode. It just follows the changes of PromQL basically. So it's
> quite fine to wait for the Prometheus release to unleash any bugfixes /
> features.
>
> On my side, my concern regarding following the tag version of Prometheus
> is more we will release the npm package quite often with no changes. That's
> something weird to release a library with no changes.
>
> It is still interesting to create UI module to be able to share code
> between Thanos and Prometheus (I have made a proposal in this sense here
> ,
> which I think can be improved), but in that particular case, I think the
> changes will appear quite often and it will be one npm package that would
> contain all Prometheus module. ( a bit like angular is doing for example).
> So in that particular case, it makes sense to follow the tag of Prometheus.
>
> In this perspective, I could imagine that the PromQL editor is actually a
> Prometheus module, but then it will be a different npm package. I could
> leave with that, as long as it won't be the unique UI module.
>
> Another idea would be to release the npm package during the release
> process of Prometheus, but the version won't follow the tag, it will follow
> what is written in the npm package. So if the version didn't change between
> 2 Prometheus versions, then it won't release the npm package.
> Like that we don't have extra git tag, we don't release any extra version
> with no changes.
> WDYT about this last proposition ?
>
> Le mar. 10 août 2021 à 13:29, Julien Pivotto  a
> écrit :
>
>> Hello,
>>
>> I like the idea to combine them in one repository.
>>
>> I would rather see if we can use it "unversioned" inside
>> prometheus/prometheus and release it together with the Prometheus
>> releases for the outside world.
>>
>> My concerns are:
>>
>> - It would add an extra burden to release management if we add extra
>> steps or
>>   more packages
>> - I expect that some people actually build Prometheus from the tags and
>>   adding extra tags could break quite a few workloads. I do not think
>>   that building tags is a xkcd 1172 case https://xkcd.com/1172/
>>
>> Additionally, there has been interests in the past to have even more
>> UI modules available, e.g. for thanos.
>>
>> I know that it would be quite inconvenient to wait for a Prometheus
>> release to publish bugfixes for these, but:
>> 1) we release Prometheus quite often
>> 2) we should still try to minimize the code *not used* by Prometheus
>>   itself, so that bugfixes will more likely hit Prometheus as well.
>>
>> Regards,
>>
>> On 10 Aug 13:16, Julius Volz wrote:
>> > I like the idea. I want to make sure that having multiple tag formats
>> for
>> > differently-versioned subprojects (Prometheus itself and one or multiple
>> > npm packages) doesn't cause any problems I don't foresee. It would be
>> great
>> > if people more familiar with the current Prometheus CI / build system
>> could
>> > give an opinion on that. CC-ing Julien as I think he has a decent
>> overview
>> > over that part, and he is also the default Prometheus server repo
>> > maintainer.
>> >
>> > On Tue, Aug 10, 2021 at 12:36 PM Augustin Husson <
>> husson.augus...@gmail.com>
>> > wrote:
>> >
>> > > Hello fellow Prometheus developers :),
>> > >
>> > > As you probably know, in Prometheus, you have since a couple month a
>> great
>> > > PromQL editor (with autocomplete, linter, highlight feature) which is
>> for
>> > > the moment maintained in two separate repositories:
>> > >
>> > >- prometheus-community/codemirror-promql
>> > > that
>> > >contains all the autocomplete / linter / highlight logic.
>> > >- promlabs/lezer-promql 
>> > >that contains the PromQL grammar (web version)
>> > >
>> > > When a new feature enriched PromQL, the PR on Prometheus' side is
>> usually
>> > > modifying the backend and the documentation. But it doesn't change the
>> > > PromQL editor since it's in two different repositories.
>> > > It's usually Julius or/and me that are putting back this feature,
>> 

Re: [prometheus-developers] Requirements / Best Practices to use Prometheus Metrics for Serverless environments

2021-06-24 Thread Rob Skillington
With respect to OpenMetrics push, we had something very similar at $prevco
that pushed something that looked very similar to the protobuf payload of
OpenMetrics (but was Thrift snapshot of an aggregated set of metrics from
in process) that was used by short running tasks (for Jenkins, Flink jobs,
etc).

I definitely agree it’s not ideal and ideally the platform provider can
supply a collection point (there is something for Jenkins, a plug-in that
can do this, but custom metrics is very hard / nigh impossible to make work
with it, and this is a non-cloud provider environment that’s actually
possible to make work, just no one has made it seamless).

I agree with Richi that something that could push to a Prometheus Agent
like target that supports OpenMetrics push could be a good middle ground
with the right support / guidelines:
- A way to specify multiple Prometheus Agent targets and quickly failover
from one to another if within $X ms one is not responding (you could
imagine a 5ms budget for each and max 3 are tried, introducing at worst
15ms overhead when all are down in 3 local availability zones, but in
general this is a disaster case)
- Deduplication ability so that a retried push is not double counted, this
might mean timestamping the metrics… (so if written twice only first record
kept, etc)

I think it should similar to the Push Gateway be generally a last resort
kind of option and have clear limitations so that pull still remains the
clear choice for anything but these environments.

Is there any interest discussing this on a call some time?

Rob

On Thu, Jun 24, 2021 at 5:09 PM Bjoern Rabenstein 
wrote:

> On 22.06.21 11:26, Tobias Schmidt wrote:
> >
> > Last night I was wondering if there are any other common interfaces
> > available in serverless environments and noticed that all products by AWS
> > (Lambda) and GCP (Functions, Run) at least provide the option to handle
> log
> > streams, sometimes even log files on disk. I'm currently thinking about
> > experimenting with an approach where containers log metrics to stdout /
> > some file, get picked up by the serverless runtime and written to some
> log
> > stream. Another service "loggateway" (or otherwise named) would then
> stream
> > the logs, aggregate them and either expose them on the common /metrics
> > endpoint or push them with remote write right away to a Prometheus
> instance
> > hosted somewhere (like Grafana Cloud).
>
> Perhaps I'm missing something, but isn't that
> https://github.com/google/mtail ?
>
> --
> Björn Rabenstein
> [PGP-ID] 0x851C3DA17D748D03
> [email] bjo...@rabenste.in
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-developers+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-developers/20210624210908.GB11559%40jahnn
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/CABakzZaVwUW27uSvto%2BVkPtspcKP2B4pCkSHQi-%2B1C%2Bb22R1yg%40mail.gmail.com.


Re: [prometheus-developers] Re: Remote Write Metadata propagation

2020-08-19 Thread Rob Skillington
If anyone wants to do some further testing on their own datasets, would
definitely be interesting to see what range they are in.

I’ll start addressing latest round of comments and tie up tests.

On Wed, Aug 19, 2020 at 4:53 AM Brian Brazil <
brian.bra...@robustperception.io> wrote:

> On Wed, 19 Aug 2020 at 09:47, Rob Skillington  wrote:
>
>> To add a bit more detail to that example, I was actually using a
>> fairly tuned
>> remote write queue config that sent large batches since the batch send
>> deadline
>> was set to 1 minute longer with a max samples per send of 5,000. Here's
>> that
>> config:
>> ```
>> remote_write:
>>   - url: http://localhost:3030/remote/write
>> remote_timeout: 30s
>> queue_config:
>>   capacity: 1
>>   max_shards: 10
>>   min_shards: 3
>>   max_samples_per_send: 5000
>>   batch_send_deadline: 1m
>>   min_backoff: 50ms
>>   max_backoff: 1s
>> ```
>>
>> Using the default config we get worse utilization for both before/after
>> numbers
>> but the delta/difference is less:
>> - steady state ~177kb/sec without this change
>> - steady state ~210kb/sec with this change
>> - roughly 20% increase
>>
>
> I think 20% is okay all things considered.
>
> Brian
>
>
>>
>> Using config:
>> ```
>> remote_write:
>>   - url: http://localhost:3030/remote/write
>> remote_timeout: 30s
>> ```
>>
>> Implicitly the values for this config is:
>> - min shards 1
>> - max shards 1000
>> - max samples per send 100
>> - capacity 500
>> - batch send deadline 5s
>> - min backoff 30ms
>> - max backoff 100ms
>>
>> On Wed, Aug 19, 2020 at 4:26 AM Brian Brazil <
>> brian.bra...@robustperception.io> wrote:
>>
>>> On Wed, 19 Aug 2020 at 09:20, Rob Skillington 
>>> wrote:
>>>
>>>> Here's the results from testing:
>>>> - node_exporter exporting 309 metrics each by turning on a lot of
>>>> optional
>>>>   collectors, all have help set, very few have unit set
>>>> - running 8 on the host at 1s scrape interval, each with unique
>>>> instance label
>>>> - steady state ~137kb/sec without this change
>>>> - steady state ~172kb/sec with this change
>>>> - roughly 30% increase
>>>>
>>>> Graph here:
>>>>
>>>> https://github.com/prometheus/prometheus/pull/7771#issuecomment-675923976
>>>>
>>>> How do we want to proceed? This could be fairly close to the higher end
>>>> of
>>>> the spectrum in terms of expected increase given the node_exporter
>>>> metrics
>>>> density and fairly verbose metadata.
>>>>
>>>> Even having said that however 30% is a fairly big increase and
>>>> relatively large
>>>> egress cost to have to swallow without any way to back out of this
>>>> behavior.
>>>>
>>>> What do folks think of next steps?
>>>>
>>>
>>> It is on the high end, however this is going to be among the worst cases
>>> as there's not going to be a lot of per-metric cardinality from the node
>>> exporter. I bet if you greatly increased the number of targets (and reduced
>>> the scrape interval to compensate) it'd be more reasonable. I think this is
>>> just about okay.
>>>
>>> Brian
>>>
>>>
>>>>
>>>>
>>>> On Tue, Aug 11, 2020 at 11:55 AM Rob Skillington 
>>>> wrote:
>>>>
>>>>> Agreed - I'll see what I can do in getting some numbers for a workload
>>>>> collecting cAdvisor metrics, it seems to have a significant amount of
>>>>> HELP set:
>>>>>
>>>>> https://github.com/google/cadvisor/blob/8450c56c21bc5406e2df79a2162806b9a23ebd34/metrics/testdata/prometheus_metrics
>>>>>
>>>>>
>>>>> On Tue, Aug 11, 2020 at 6:15 AM Brian Brazil <
>>>>> brian.bra...@robustperception.io> wrote:
>>>>>
>>>>>> On Tue, 11 Aug 2020 at 11:07, Julien Pivotto <
>>>>>> roidelapl...@prometheus.io> wrote:
>>>>>>
>>>>>>> On 11 Aug 11:05, Brian Brazil wrote:
>>>>>>>
>>>>>>>
>>>>>>> > On Tue, 11 Aug 2020 at 04:09, Callum Styan 
>>

Re: [prometheus-developers] Re: Remote Write Metadata propagation

2020-08-19 Thread Rob Skillington
To add a bit more detail to that example, I was actually using a
fairly tuned
remote write queue config that sent large batches since the batch send
deadline
was set to 1 minute longer with a max samples per send of 5,000. Here's
that
config:
```
remote_write:
  - url: http://localhost:3030/remote/write
remote_timeout: 30s
queue_config:
  capacity: 1
  max_shards: 10
  min_shards: 3
  max_samples_per_send: 5000
  batch_send_deadline: 1m
  min_backoff: 50ms
  max_backoff: 1s
```

Using the default config we get worse utilization for both before/after
numbers
but the delta/difference is less:
- steady state ~177kb/sec without this change
- steady state ~210kb/sec with this change
- roughly 20% increase

Using config:
```
remote_write:
  - url: http://localhost:3030/remote/write
remote_timeout: 30s
```

Implicitly the values for this config is:
- min shards 1
- max shards 1000
- max samples per send 100
- capacity 500
- batch send deadline 5s
- min backoff 30ms
- max backoff 100ms

On Wed, Aug 19, 2020 at 4:26 AM Brian Brazil <
brian.bra...@robustperception.io> wrote:

> On Wed, 19 Aug 2020 at 09:20, Rob Skillington  wrote:
>
>> Here's the results from testing:
>> - node_exporter exporting 309 metrics each by turning on a lot of
>> optional
>>   collectors, all have help set, very few have unit set
>> - running 8 on the host at 1s scrape interval, each with unique instance
>> label
>> - steady state ~137kb/sec without this change
>> - steady state ~172kb/sec with this change
>> - roughly 30% increase
>>
>> Graph here:
>> https://github.com/prometheus/prometheus/pull/7771#issuecomment-675923976
>>
>> How do we want to proceed? This could be fairly close to the higher end of
>> the spectrum in terms of expected increase given the node_exporter
>> metrics
>> density and fairly verbose metadata.
>>
>> Even having said that however 30% is a fairly big increase and relatively
>> large
>> egress cost to have to swallow without any way to back out of this
>> behavior.
>>
>> What do folks think of next steps?
>>
>
> It is on the high end, however this is going to be among the worst cases
> as there's not going to be a lot of per-metric cardinality from the node
> exporter. I bet if you greatly increased the number of targets (and reduced
> the scrape interval to compensate) it'd be more reasonable. I think this is
> just about okay.
>
> Brian
>
>
>>
>>
>> On Tue, Aug 11, 2020 at 11:55 AM Rob Skillington 
>> wrote:
>>
>>> Agreed - I'll see what I can do in getting some numbers for a workload
>>> collecting cAdvisor metrics, it seems to have a significant amount of
>>> HELP set:
>>>
>>> https://github.com/google/cadvisor/blob/8450c56c21bc5406e2df79a2162806b9a23ebd34/metrics/testdata/prometheus_metrics
>>>
>>>
>>> On Tue, Aug 11, 2020 at 6:15 AM Brian Brazil <
>>> brian.bra...@robustperception.io> wrote:
>>>
>>>> On Tue, 11 Aug 2020 at 11:07, Julien Pivotto <
>>>> roidelapl...@prometheus.io> wrote:
>>>>
>>>>> On 11 Aug 11:05, Brian Brazil wrote:
>>>>> > On Tue, 11 Aug 2020 at 04:09, Callum Styan 
>>>>> wrote:
>>>>> >
>>>>> > > I'm hesitant to add anything that significantly increases the
>>>>> network
>>>>> > > bandwidth usage or remote write while at the same time not giving
>>>>> users a
>>>>> > > way to tune the usage to their needs.
>>>>> > >
>>>>> > > I agree with Brian that we don't want the protocol itself to become
>>>>> > > stateful by introducing something like negotiation. I'd also
>>>>> prefer not to
>>>>> > > introduce multiple ways to do things, though I'm hoping we can
>>>>> find a way
>>>>> > > to accommodate your use case while not ballooning average users
>>>>> network
>>>>> > > egress bill.
>>>>> > >
>>>>> > > I am fine with forcing the consuming end to be somewhat stateful
>>>>> like in
>>>>> > > the case of Josh's PR where all metadata is sent periodically and
>>>>> must be
>>>>> > > stored by the remote storage system.
>>>>> > >
>>>>> >
>>>>> >
>>>>> >
>>>>> > > Overall I'd like to see some numbers regarding current network
>>>>> bandwidth
>&

Re: [prometheus-developers] Re: Remote Write Metadata propagation

2020-08-19 Thread Rob Skillington
Here's the results from testing:
- node_exporter exporting 309 metrics each by turning on a lot of optional
  collectors, all have help set, very few have unit set
- running 8 on the host at 1s scrape interval, each with unique instance
label
- steady state ~137kb/sec without this change
- steady state ~172kb/sec with this change
- roughly 30% increase

Graph here:
https://github.com/prometheus/prometheus/pull/7771#issuecomment-675923976

How do we want to proceed? This could be fairly close to the higher end of
the spectrum in terms of expected increase given the node_exporter metrics
density and fairly verbose metadata.

Even having said that however 30% is a fairly big increase and relatively
large
egress cost to have to swallow without any way to back out of this behavior.

What do folks think of next steps?


On Tue, Aug 11, 2020 at 11:55 AM Rob Skillington 
wrote:

> Agreed - I'll see what I can do in getting some numbers for a workload
> collecting cAdvisor metrics, it seems to have a significant amount of HELP
> set:
>
> https://github.com/google/cadvisor/blob/8450c56c21bc5406e2df79a2162806b9a23ebd34/metrics/testdata/prometheus_metrics
>
>
> On Tue, Aug 11, 2020 at 6:15 AM Brian Brazil <
> brian.bra...@robustperception.io> wrote:
>
>> On Tue, 11 Aug 2020 at 11:07, Julien Pivotto 
>> wrote:
>>
>>> On 11 Aug 11:05, Brian Brazil wrote:
>>> > On Tue, 11 Aug 2020 at 04:09, Callum Styan 
>>> wrote:
>>> >
>>> > > I'm hesitant to add anything that significantly increases the network
>>> > > bandwidth usage or remote write while at the same time not giving
>>> users a
>>> > > way to tune the usage to their needs.
>>> > >
>>> > > I agree with Brian that we don't want the protocol itself to become
>>> > > stateful by introducing something like negotiation. I'd also prefer
>>> not to
>>> > > introduce multiple ways to do things, though I'm hoping we can find
>>> a way
>>> > > to accommodate your use case while not ballooning average users
>>> network
>>> > > egress bill.
>>> > >
>>> > > I am fine with forcing the consuming end to be somewhat stateful
>>> like in
>>> > > the case of Josh's PR where all metadata is sent periodically and
>>> must be
>>> > > stored by the remote storage system.
>>> > >
>>> >
>>> >
>>> >
>>> > > Overall I'd like to see some numbers regarding current network
>>> bandwidth
>>> > > of remote write, remote write with metadata via Josh's PR, and
>>> remote write
>>> > > with sending metadata for every series in a remote write payload.
>>> > >
>>> >
>>> > I agree, I noticed that in Rob's PR and had the same thought.
>>>
>>> Remote bandwidth are likely to affect only people using remote write.
>>>
>>> Getting a view on the on-disk size of the WAL would be great too, as
>>> that will affect everyone.
>>>
>>
>> I'm not worried about that, it's only really on series creation so won't
>> be noticed unless you have really high levels of churn.
>>
>> Brian
>>
>>
>>>
>>> >
>>> > Brian
>>> >
>>> >
>>> > >
>>> > > Rob, I'll review your PR tomorrow but it looks like Julien and Brian
>>> may
>>> > > already have that covered.
>>> > >
>>> > > On Sun, Aug 9, 2020 at 9:36 PM Rob Skillington 
>>> > > wrote:
>>> > >
>>> > >> Update: The PR now sends the fields over remote write from the WAL
>>> and
>>> > >> metadata
>>> > >> is also updated in the WAL when any field changes.
>>> > >>
>>> > >> Now opened the PR against the primary repo:
>>> > >> https://github.com/prometheus/prometheus/pull/7771
>>> > >>
>>> > >> I have tested this end-to-end with a modified M3 branch:
>>> > >> https://github.com/m3db/m3/compare/r/test-prometheus-metadata
>>> > >> > {... "msg":"received
>>> > >> series","labels":"{__name__="prometheus_rule_group_...
>>> > >> >
>>> iterations_total",instance="localhost:9090",job="prometheus01",role=...
>>> > >> > "remote"}","type":"counter","unit":"",&quo

Re: [prometheus-developers] Re: Remote Write Metadata propagation

2020-08-11 Thread Rob Skillington
Agreed - I'll see what I can do in getting some numbers for a workload
collecting cAdvisor metrics, it seems to have a significant amount of HELP
set:
https://github.com/google/cadvisor/blob/8450c56c21bc5406e2df79a2162806b9a23ebd34/metrics/testdata/prometheus_metrics


On Tue, Aug 11, 2020 at 6:15 AM Brian Brazil <
brian.bra...@robustperception.io> wrote:

> On Tue, 11 Aug 2020 at 11:07, Julien Pivotto 
> wrote:
>
>> On 11 Aug 11:05, Brian Brazil wrote:
>> > On Tue, 11 Aug 2020 at 04:09, Callum Styan 
>> wrote:
>> >
>> > > I'm hesitant to add anything that significantly increases the network
>> > > bandwidth usage or remote write while at the same time not giving
>> users a
>> > > way to tune the usage to their needs.
>> > >
>> > > I agree with Brian that we don't want the protocol itself to become
>> > > stateful by introducing something like negotiation. I'd also prefer
>> not to
>> > > introduce multiple ways to do things, though I'm hoping we can find a
>> way
>> > > to accommodate your use case while not ballooning average users
>> network
>> > > egress bill.
>> > >
>> > > I am fine with forcing the consuming end to be somewhat stateful like
>> in
>> > > the case of Josh's PR where all metadata is sent periodically and
>> must be
>> > > stored by the remote storage system.
>> > >
>> >
>> >
>> >
>> > > Overall I'd like to see some numbers regarding current network
>> bandwidth
>> > > of remote write, remote write with metadata via Josh's PR, and remote
>> write
>> > > with sending metadata for every series in a remote write payload.
>> > >
>> >
>> > I agree, I noticed that in Rob's PR and had the same thought.
>>
>> Remote bandwidth are likely to affect only people using remote write.
>>
>> Getting a view on the on-disk size of the WAL would be great too, as
>> that will affect everyone.
>>
>
> I'm not worried about that, it's only really on series creation so won't
> be noticed unless you have really high levels of churn.
>
> Brian
>
>
>>
>> >
>> > Brian
>> >
>> >
>> > >
>> > > Rob, I'll review your PR tomorrow but it looks like Julien and Brian
>> may
>> > > already have that covered.
>> > >
>> > > On Sun, Aug 9, 2020 at 9:36 PM Rob Skillington 
>> > > wrote:
>> > >
>> > >> Update: The PR now sends the fields over remote write from the WAL
>> and
>> > >> metadata
>> > >> is also updated in the WAL when any field changes.
>> > >>
>> > >> Now opened the PR against the primary repo:
>> > >> https://github.com/prometheus/prometheus/pull/7771
>> > >>
>> > >> I have tested this end-to-end with a modified M3 branch:
>> > >> https://github.com/m3db/m3/compare/r/test-prometheus-metadata
>> > >> > {... "msg":"received
>> > >> series","labels":"{__name__="prometheus_rule_group_...
>> > >> >
>> iterations_total",instance="localhost:9090",job="prometheus01",role=...
>> > >> > "remote"}","type":"counter","unit":"","help":"The total number of
>> > >> scheduled...
>> > >> > rule group evaluations, whether executed or missed."}
>> > >>
>> > >> Tests still haven't been updated. Please any feedback on the
>> approach /
>> > >> data structures would be greatly appreciated.
>> > >>
>> > >> Would be good to know what others thoughts are on next steps.
>> > >>
>> > >> On Sat, Aug 8, 2020 at 11:21 AM Rob Skillington > >
>> > >> wrote:
>> > >>
>> > >>> Here's a draft PR that builds that propagates metadata to the WAL
>> and
>> > >>> the WAL
>> > >>> reader can read it back:
>> > >>> https://github.com/robskillington/prometheus/pull/1/files
>> > >>>
>> > >>> Would like a little bit of feedback before on the datatypes and
>> > >>> structure going
>> > >>> further if folks are open to that.
>> > >>>
>> > >>> There's a few things not happening:
>> > >>> - Remote wr

Re: [prometheus-developers] Re: Remote Write Metadata propagation

2020-08-09 Thread Rob Skillington
Update: The PR now sends the fields over remote write from the WAL and
metadata
is also updated in the WAL when any field changes.

Now opened the PR against the primary repo:
https://github.com/prometheus/prometheus/pull/7771

I have tested this end-to-end with a modified M3 branch:
https://github.com/m3db/m3/compare/r/test-prometheus-metadata
> {... "msg":"received
series","labels":"{__name__="prometheus_rule_group_...
> iterations_total",instance="localhost:9090",job="prometheus01",role=...
> "remote"}","type":"counter","unit":"","help":"The total number of
scheduled...
> rule group evaluations, whether executed or missed."}

Tests still haven't been updated. Please any feedback on the approach /
data structures would be greatly appreciated.

Would be good to know what others thoughts are on next steps.

On Sat, Aug 8, 2020 at 11:21 AM Rob Skillington  wrote:

> Here's a draft PR that builds that propagates metadata to the WAL and the
> WAL
> reader can read it back:
> https://github.com/robskillington/prometheus/pull/1/files
>
> Would like a little bit of feedback before on the datatypes and structure
> going
> further if folks are open to that.
>
> There's a few things not happening:
> - Remote write queue manager does not use or send these extra fields yet.
> - Head does not reset the "metadata" slice (not sure where "series" slice
> is
>   reset in the head for pending series writes to WAL, want to do in same
> place).
> - Metadata is not re-written on change yet.
> - Tests.
>
>
> On Sat, Aug 8, 2020 at 9:37 AM Rob Skillington 
> wrote:
>
>> Sounds good, I've updated the proposal with details on places in which
>> changes
>> are required given the new approach:
>>
>> https://docs.google.com/document/d/1LY8Im8UyIBn8e3LJ2jB-MoajXkfAqW2eKzY735aYxqo/edit#
>>
>>
>> On Fri, Aug 7, 2020 at 2:09 PM Brian Brazil <
>> brian.bra...@robustperception.io> wrote:
>>
>>> On Fri, 7 Aug 2020 at 15:48, Rob Skillington 
>>> wrote:
>>>
>>>> True - I mean this could also be a blacklist by config perhaps, so if
>>>> you
>>>> really don't want to have increased egress you can optionally turn off
>>>> sending
>>>> the TYPE, HELP, UNIT or send them at different frequencies via config.
>>>> We could
>>>> package some sensible defaults so folks don't need to update their
>>>> config.
>>>>
>>>> The main intention is to enable these added features and make it
>>>> possible for
>>>> various consumers to be able to adjust some of these parameters if
>>>> required
>>>> since backends can be so different in their implementation. For M3 I
>>>> would be
>>>> totally fine with the extra egress that should be mitigated fairly
>>>> considerably
>>>> by Snappy and the fact that HELP is common across certain metric
>>>> families and
>>>> receiving it every single Remote Write request.
>>>>
>>>
>>> That's really a micro-optimisation. If you are that worried about
>>> bandwidth you'd run a sidecar specific to your remote backend that was
>>> stateful and far more efficient overall. Sending the full label names and
>>> values on every request is going to be far more than the overhead of
>>> metadata on top of that, so I don't see a need as it stands for any of this
>>> to be configurable.
>>>
>>> Brian
>>>
>>>
>>>>
>>>> On Fri, Aug 7, 2020 at 3:56 AM Brian Brazil <
>>>> brian.bra...@robustperception.io> wrote:
>>>>
>>>>> On Thu, 6 Aug 2020 at 22:58, Rob Skillington 
>>>>> wrote:
>>>>>
>>>>>> Hey Björn,
>>>>>>
>>>>>>
>>>>>> Thanks for the detailed response. I've had a few back and forths on
>>>>>> this with
>>>>>> Brian and Chris over IRC and CNCF Slack now too.
>>>>>>
>>>>>> I agree that fundamentally it seems naive to idealistically model
>>>>>> this around
>>>>>> per metric name. It needs to be per series given what may happen
>>>>>> w.r.t.
>>>>>> collision across targets, etc.
>>>>>>
>>>>>> Perhaps we can separate these discussions apart into two
>>>>>> considerations:
>>>>>>
>

Re: [prometheus-developers] Re: Remote Write Metadata propagation

2020-08-06 Thread Rob Skillington
Hey Björn,


Thanks for the detailed response. I've had a few back and forths on this
with
Brian and Chris over IRC and CNCF Slack now too.

I agree that fundamentally it seems naive to idealistically model this
around
per metric name. It needs to be per series given what may happen w.r.t.
collision across targets, etc.

Perhaps we can separate these discussions apart into two considerations:

1) Modeling of the data such that it is kept around for transmission
(primarily
we're focused on WAL here).

2) Transmission (and of which you allude to has many areas for improvement).

For (1) - it seems like this needs to be done per time series, thankfully
we
actually already have modeled this to be stored per series data just once
in a
single WAL file. I will write up my proposal here, but it will surmount to
essentially encoding the HELP, UNIT and TYPE to the WAL per series similar
to
how labels for a series are encoded once per series in the WAL. Since this
optimization is in place, there's already a huge dampening effect on how
expensive it is to write out data about a series (e.g. labels). We can
always
go and collect a sample WAL file and measure how much extra size
with/without
HELP, UNIT and TYPE this would add, but it seems like it won't
fundamentally
change the order of magnitude in terms of "information about a timeseries
storage size" vs "datapoints about a timeseries storage size". One extra
change
would be re-encoding the series into the WAL if the HELP changed for that
series, just so that when HELP does change it can be up to date from the
view
of whoever is reading the WAL (i.e. the Remote Write loop). Since this
entry
needs to be loaded into memory for Remote Write today anyway, with string
interning as suggested by Chris, it won't change the memory profile
algorithmically of a Prometheus with Remote Write enabled. There will be
some
overhead that at most would likely be similar to the label data, but we
aren't
altering data structures (so won't change big-O magnitude of memory being
used),
we're adding fields to existing data structures that exist and string
interning
should actually make it much less onerous since there is a large duplicative
effect with HELP among time series.

For (2) - now we have basically TYPE, HELP and UNIT all available for
transmission if we wanted to send it with every single datapoint. While I
think
we should definitely examine HPACK like compression features as you
mentioned
Björn, I think we should think more about separating that kind of work into
a
Milestone 2 where this is considered. For the time being it's very
plausible
we could do some negotiation of the receiving Remote Write endpoint by
sending
a "GET" to the remote write endpoint and seeing if it responds with a
"capabilities + preferences" response, and if the endpoint specifies that
it
would like to receive metadata all the time on every single request and let
Snappy take care of keeping size not ballooning too much, or if it would
like
TYPE on every single datapoint, and HELP and UNIT every DESIRED_SECONDS or
so.
To enable a "send HELP every 10 minutes" feature we would have to add to
the
datastructure that holds the LABELS, TYPE, HELP and UNIT for each series a
"last sent" timestamp to know when to resend to that backend, but that
seems
entirely plausible and would not use more than 4 extra bytes.

These thoughts are based on the discussion I've had and the thoughts on
this
thread. What's the feedback on this before I go ahead and re-iterate the
design
to more closely map to what I'm suggesting here?

Best,
Rob

On Thu, Aug 6, 2020 at 2:01 PM Bjoern Rabenstein  wrote:

> On 03.08.20 03:04, Rob Skillington wrote:
> > Ok - I have a proposal which could be broken up into two pieces, first
> > delivering TYPE per datapoint, the second consistently and reliably HELP
> and
> > UNIT once per unique metric name:
> >
> https://docs.google.com/document/d/1LY8Im8UyIBn8e3LJ2jB-MoajXkfAqW2eKzY735aYxqo
> > /edit#heading=h.bik9uwphqy3g
>
> Thanks for the doc. I have commented on it, but while doing so, I felt
> the urge to comment more generally, which would not fit well into the
> margin of a Google doc. My thoughts are also a bit out of scope of
> Rob's design doc and more about the general topic of remote write and
> the equally general topic of metadata (about which we have an ongoing
> discussion among the Prometheus developers).
>
> Disclaimer: I don't know the remote-write protocol very well. My hope
> here is that my somewhat distant perspective is of some value as it
> allows to take a step back. However, I might just miss crucial details
> that completely invalidate my thoughts. We'll see...
>
> I do care a lot about metadata, though. (And ironically, the reason
> why I declared remote write "somebody else's problem" is that I've
> always disl

Re: [prometheus-developers] Re: Remote Write Metadata propagation

2020-08-03 Thread Rob Skillington
Ok - I have a proposal which could be broken up into two pieces, first
delivering TYPE per datapoint, the second consistently and reliably HELP
and UNIT once per unique metric name:
https://docs.google.com/document/d/1LY8Im8UyIBn8e3LJ2jB-MoajXkfAqW2eKzY735aYxqo/edit#heading=h.bik9uwphqy3g

Would love to get some feedback on it. Thanks for the consideration. Is
there anyone in particular I should reach out to ask for feedback from
directly?

Best,
Rob


On Tue, Jul 21, 2020 at 5:55 PM Rob Skillington  wrote:

> Also want to point out that with just TYPE you can do things such as know
> it's a histogram type and then suggest using "sum(rate(...)) by (le)" with
> a one click button in a UI which again is significantly harder without that
> information.
>
> The reason it becomes important though is some systems (i.e. StackDriver)
> require this schema/metric information the first time you record a sample.
> So you really want the very basics of it the first time you receive that
> sample (i.e. at least TYPE):
>
> Defines a metric type and its schema. Once a metric descriptor is created,
>> deleting or altering it stops data collection and makes the metric type's
>> existing data unusable.
>> The following are specific rules for service defined Monitoring metric
>> descriptors:
>> type, metricKind, valueType and description fields are all required. The
>> unit field must be specified if the valueType is any of DOUBLE, INT64,
>> DISTRIBUTION.
>> Maximum of default 500 metric descriptors per service is allowed.
>> Maximum of default 10 labels per metric descriptor is allowed.
>
>
> https://cloud.google.com/monitoring/api/ref_v3/rest/v3/projects.metricDescriptors
>
> Just an example, but other systems and definitely systems that want to do
> processing of metrics on the way in would prefer at very least things like
> TYPE and maybe ideally UNIT too are specified.
>
>
> On Tue, Jul 21, 2020 at 5:49 PM Rob Skillington 
> wrote:
>
>> Hey Chris,
>>
>> Apologies on the delay to your response.
>>
>> Yes I think that even just TYPE would be a great first step. I am working
>> on a very small one pager that outlines perhaps how we get from here to
>> that future you talk about.
>>
>> In terms of downstream processing, just having the TYPE on every single
>> sample would be a huge step forward as it enables the ability to do
>> stateless processing of the metric (i.e. downsampling and working out
>> whether counter resets need to be detected during downsampling of this
>> single individual sample).
>>
>> Also you can imagine this enables the ability to suggest certain
>> functions that can be applied, i.e. auto-suggest rate(...) should be
>> applied without needing to analyze or use best effort heuristics of the
>> actual values of a time series.
>>
>> Completely agreed that solving this for UNIT and HELP is more difficult
>> and that information would likely be much nicer to be sent/stored per
>> metric name rather than per time-series sample.
>>
>> I'll send out the Google doc for some comments shortly.
>>
>> Transactional approach is interesting, it could be difficult given that
>> this information can flap (i.e. start with some value for HELP/UNIT but a
>> different target of the same application has a different value) and hence
>> that means ordering is important and dealing with transactional order could
>> be a hard problem. I agree that making this deterministic if possible would
>> be great. Maybe it could be something like a token that is sent alongside
>> the first remote write payload, and if that continuation token that the
>> receiver sees means it missed some part of the stream then it can go and do
>> a full sync and from there on in receive updates/additions in a
>> transactional way from the stream over remote write. Just a random thought
>> though and requires more exploration / different solutions being listed to
>> weigh up pros/cons/complexity/etc.
>>
>> Best,
>> Rob
>>
>>
>>
>> On Thu, Jul 16, 2020 at 4:39 PM Chris Marchbanks 
>> wrote:
>>
>>> Hi Rob,
>>>
>>> I would also like metadata to become stateless, and view 6815
>>> <https://github.com/prometheus/prometheus/pull/6815> only as a first
>>> step, and the start of an output format. Currently, there is a work in
>>> progress design doc, and another topic for an upcoming dev summit, for
>>> allowing use cases where metadata needs to be in the same request as the
>>> samples.
>>>
>>> Generally, I (and some others I have talked to) don't want to sen

Re: [prometheus-developers] Re: Remote Write Metadata propagation

2020-07-21 Thread Rob Skillington
Also want to point out that with just TYPE you can do things such as know
it's a histogram type and then suggest using "sum(rate(...)) by (le)" with
a one click button in a UI which again is significantly harder without that
information.

The reason it becomes important though is some systems (i.e. StackDriver)
require this schema/metric information the first time you record a sample.
So you really want the very basics of it the first time you receive that
sample (i.e. at least TYPE):

Defines a metric type and its schema. Once a metric descriptor is created,
> deleting or altering it stops data collection and makes the metric type's
> existing data unusable.
> The following are specific rules for service defined Monitoring metric
> descriptors:
> type, metricKind, valueType and description fields are all required. The
> unit field must be specified if the valueType is any of DOUBLE, INT64,
> DISTRIBUTION.
> Maximum of default 500 metric descriptors per service is allowed.
> Maximum of default 10 labels per metric descriptor is allowed.

https://cloud.google.com/monitoring/api/ref_v3/rest/v3/projects.metricDescriptors

Just an example, but other systems and definitely systems that want to do
processing of metrics on the way in would prefer at very least things like
TYPE and maybe ideally UNIT too are specified.


On Tue, Jul 21, 2020 at 5:49 PM Rob Skillington  wrote:

> Hey Chris,
>
> Apologies on the delay to your response.
>
> Yes I think that even just TYPE would be a great first step. I am working
> on a very small one pager that outlines perhaps how we get from here to
> that future you talk about.
>
> In terms of downstream processing, just having the TYPE on every single
> sample would be a huge step forward as it enables the ability to do
> stateless processing of the metric (i.e. downsampling and working out
> whether counter resets need to be detected during downsampling of this
> single individual sample).
>
> Also you can imagine this enables the ability to suggest certain functions
> that can be applied, i.e. auto-suggest rate(...) should be applied without
> needing to analyze or use best effort heuristics of the actual values of a
> time series.
>
> Completely agreed that solving this for UNIT and HELP is more difficult
> and that information would likely be much nicer to be sent/stored per
> metric name rather than per time-series sample.
>
> I'll send out the Google doc for some comments shortly.
>
> Transactional approach is interesting, it could be difficult given that
> this information can flap (i.e. start with some value for HELP/UNIT but a
> different target of the same application has a different value) and hence
> that means ordering is important and dealing with transactional order could
> be a hard problem. I agree that making this deterministic if possible would
> be great. Maybe it could be something like a token that is sent alongside
> the first remote write payload, and if that continuation token that the
> receiver sees means it missed some part of the stream then it can go and do
> a full sync and from there on in receive updates/additions in a
> transactional way from the stream over remote write. Just a random thought
> though and requires more exploration / different solutions being listed to
> weigh up pros/cons/complexity/etc.
>
> Best,
> Rob
>
>
>
> On Thu, Jul 16, 2020 at 4:39 PM Chris Marchbanks 
> wrote:
>
>> Hi Rob,
>>
>> I would also like metadata to become stateless, and view 6815
>> <https://github.com/prometheus/prometheus/pull/6815> only as a first
>> step, and the start of an output format. Currently, there is a work in
>> progress design doc, and another topic for an upcoming dev summit, for
>> allowing use cases where metadata needs to be in the same request as the
>> samples.
>>
>> Generally, I (and some others I have talked to) don't want to send all
>> the metadata with every sample as that is very repetitive, specifically for
>> histograms and metrics with many series. Instead, I would like remote write
>> requests to become transaction based, at which point all the metadata from
>> that scrape/transaction can be added to the metadata field introduced to
>> the proto in 6815 <https://github.com/prometheus/prometheus/pull/6815>
>> and then each sample can be linked to a metadata entry without as much
>> duplication. That is very broad strokes, and I am sure it will be refined
>> or changed completely with more usage.
>>
>> That said, TYPE and UNIT are much smaller than metric name and help text,
>> and I would support adding those to a linked metadata entry before remote
>> write becomes transactional. Would that satisfy your use c

[prometheus-developers] Re: Remote Write Metadata propagation

2020-07-16 Thread Rob Skillington
Typo: "community request" should be: "community contribution that
duplicates some of PR 6815"

On Thu, Jul 16, 2020 at 3:27 PM Rob Skillington  wrote:

> Firstly: Thanks a lot for sharing the dev summit notes, they are greatly
> appreciated. Also thank you for a great PromCon!
>
> In regards to prometheus remote write metadata propagation consensus, is
> there any plans/projects/collaborations that can be done to perhaps plan
> work on a protocol that might help others in the ecosystem offer the same
> benefits to Prometheus ecosystem projects that operate on a per write
> request basis (i.e. stateless processing of a write request)?
>
> I understand https://github.com/prometheus/prometheus/pull/6815 unblocks
> feature development on top of Prometheus for users with specific
> architectures, however it is a non-starter for a lot of other projects,
> especially for third party exporters to systems that are unowned by end
> users (i.e. writing a StackDriver remote write endpoint that targeted
> StackDriver, the community is unable to change the implementation of
> StackDriver itself to cache/statefully make metrics metadata available at
> ingestion time to StackDriver).
>
> Obviously I have a vested interest since as a remote write target, M3 has
> several stateless components before TSDB ingestion and flowing the entire
> metadata to a distributed set of DB nodes that own a different set of the
> metrics space from each other node this has implications on M3 itself of
> course too (i.e. it is non-trivial to map metric name -> DB node without
> some messy stateful cache sitting somewhere in the architecture which adds
> operational burdens to end users).
>
> I suppose what I'm asking is, are maintainers open to a community request
> that duplicates some of https://github.com/prometheus/prometheus/pull/6815 but
> sends just metric TYPE and UNIT per datapoint (which would need to be
> captured by the WAL if feature is enabled) to a backend so it can
> statefully be processed correctly without needing a sync of a global set of
> metadata to a backend?
>
> And if not, what are the plans here and how can we collaborate to make
> this data useful to other consumers in the Prometheus ecosystem.
>
> Best intentions,
> Rob
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/CABakzZbvZeyKLXfK08aiXgGcZso%3D8A0H1JBT9jwBzf6rCiUmVw%40mail.gmail.com.