[jira] [Commented] (RATIS-2389) Implementing Opentelemetry Tracing in Apache Ratis
[ https://issues.apache.org/jira/browse/RATIS-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18057187#comment-18057187 ] Tsz-wo Sze commented on RATIS-2389: --- bq. ... the client might trigger retries ... We should track using [ClientInvocationId|https://github.com/apache/ratis/blob/master/ratis-common/src/main/java/org/apache/ratis/protocol/ClientInvocationId.java]. The retries will have a different ID. > Implementing Opentelemetry Tracing in Apache Ratis > -- > > Key: RATIS-2389 > URL: https://issues.apache.org/jira/browse/RATIS-2389 > Project: Ratis > Issue Type: New Feature > Components: client, server >Affects Versions: 3.3.0 >Reporter: Tak-Lon (Stephen) Wu >Assignee: Tak-Lon (Stephen) Wu >Priority: Minor > Attachments: PoC-result-collected-spans.png, > PoC-result-span-detail.png, PoC-result.png > > > This proposal outlines the addition of OpenTelemetry support to Ratis. By > instrumenting the full client-side request path, we can empower users and > maintainers with the granular data necessary for both long-term performance > optimization and proactive daily monitoring. > * 1-pager proposal: > [https://docs.google.com/document/d/1UKGVqOzkAXqUAJxOz1RHq6fIiO3xqV57eIqi-f9qdE4/edit?tab=t.0#heading=h.5a3u31wlm0n] > * PoC: [https://github.com/taklwu/ratis/tree/opentelemetry0129] > Subtasks > * Define the Metadata Field: Modify RaftRpcMessage.proto to include an > optional SpanContext field. > * Add TraceUtil: Land the utility class in ratis-common based on the code > you see in HBase. > * Create the client span: Introduce the span supplier and CLIENT span hook. > * Instrument GRPC on the Server: Start with the GRPC module as it is the > most common transport. Instrument the onNext methods (or within the caller) > to start/stop spans. > * Come up with the user guide as part of the release. > > Preliminary results were captured by running filestore example. > Reference > 1. HBase Tracing with Opentelemetry, HBASE-22120 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (RATIS-2389) Implementing Opentelemetry Tracing in Apache Ratis
[ https://issues.apache.org/jira/browse/RATIS-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18057143#comment-18057143 ] Xinyu Tan commented on RATIS-2389: -- [~taklwu] > I am still getting up to speed on Ratis, but I want to clarify the scope of > the retries you mentioned. Are we referring to sendRequestWithRetry (used by > AsyncImpl#sendReadOnlyUnordered) or the operations governed by RetryPolicies? > If so, should we address this as a follow-up task if time permits. Yes, I was referring to the fact that the client might trigger retries for various reasons, which could lead to an unexpected increase in latency. It would be best to add detection for this within the Client span. I have no further questions regarding the other parts. Thanks for your patient replies! Looking forward to this feature! > Implementing Opentelemetry Tracing in Apache Ratis > -- > > Key: RATIS-2389 > URL: https://issues.apache.org/jira/browse/RATIS-2389 > Project: Ratis > Issue Type: New Feature > Components: client, server >Affects Versions: 3.3.0 >Reporter: Tak-Lon (Stephen) Wu >Assignee: Tak-Lon (Stephen) Wu >Priority: Minor > Attachments: PoC-result-collected-spans.png, > PoC-result-span-detail.png, PoC-result.png > > > This proposal outlines the addition of OpenTelemetry support to Ratis. By > instrumenting the full client-side request path, we can empower users and > maintainers with the granular data necessary for both long-term performance > optimization and proactive daily monitoring. > * 1-pager proposal: > [https://docs.google.com/document/d/1UKGVqOzkAXqUAJxOz1RHq6fIiO3xqV57eIqi-f9qdE4/edit?tab=t.0#heading=h.5a3u31wlm0n] > * PoC: [https://github.com/taklwu/ratis/tree/opentelemetry0129] > Subtasks > * Define the Metadata Field: Modify RaftRpcMessage.proto to include an > optional SpanContext field. > * Add TraceUtil: Land the utility class in ratis-common based on the code > you see in HBase. > * Create the client span: Introduce the span supplier and CLIENT span hook. > * Instrument GRPC on the Server: Start with the GRPC module as it is the > most common transport. Instrument the onNext methods (or within the caller) > to start/stop spans. > * Come up with the user guide as part of the release. > > Preliminary results were captured by running filestore example. > Reference > 1. HBase Tracing with Opentelemetry, HBASE-22120 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (RATIS-2389) Implementing Opentelemetry Tracing in Apache Ratis
[
https://issues.apache.org/jira/browse/RATIS-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18056750#comment-18056750
]
Tak-Lon (Stephen) Wu commented on RATIS-2389:
-
Hi [~tanxinyu], thanks for your reviewing and comments, and your observation is
correct and let me response inline below.
{quote}As a possible follow-up improvement, it might be worth considering
modeling retries on the client side as child spans, and capturing more detailed
critical paths in the server-side read and write workflows
{quote}
I am still getting up to speed on Ratis, but I want to clarify the scope of the
retries you mentioned. Are we referring to {{sendRequestWithRetry}} (used by
{{{}AsyncImpl#sendReadOnlyUnordered{}}}) or the operations governed by
RetryPolicies? If so, should we address this as a follow-up task if time
permits.
Regarding server-side tracing, my current thinking is to implement these as
*INTERNAL* span kinds (maybe also as followup). This would provide much-needed
visibility into Log Appenders & Replication, Snapshots, Internal state
transitions, etc.
Ultimately, we will rely on user and community feedback to identify further
checkpoints or improvements needed for the tracing pipeline as it evolves. and
thanks your suggestion again.
{quote}Regarding trace data delivery and configuration, it may be helpful if
the documentation
{quote}
The sub-task RATIS-2396 should cover partially about the documentation request,
and it's good reminder for me to include some information about OpenTelemetry
trace being collected and sent.
To shortly answer your concern, the collected trace data are sent
asynchronously, sample ratio, batch size, target host could be configurable.
{quote}would it be possible to visualize the tracing results using Jaeger and
include some screenshots or links (for example, showing the read and write
paths) in the PR or related documentation?
{quote}
If you look at the [1] that we traced AsyncImpl#send that support
`sendReadOnly`, `sendReadAfterWrite`, and etc. The PoC result was captured by
running the FireStore client and [^PoC-result-span-detail.png] was visual by
Jaeger UI (same as your request) that attached when JIRA created. E.g. in the
`RW/org.apache.ratis.protocol.RaftClientRequest/sendRequestAsync` is a "Write"
request where `RW` was the `TypeCase` in the code for `Write`.
{code:java}
CompletableFuture send(
RaftClientRequest.Type type, Message message, RaftPeerId server) {
final Supplier spanSupplier = new OperationSpanBuilder(server)
.setOperationName("AsyncImpl::send")
.setOperationType(type);
return TraceUtils.tracedFuture(() -> client.getOrderedAsync().send(type,
message, server),
spanSupplier);
}
{code}
I just put those pictures back on the documentation after this comment.
Reference
1. [https://github.com/taklwu/ratis/tree/opentelemetry0129]
> Implementing Opentelemetry Tracing in Apache Ratis
> --
>
> Key: RATIS-2389
> URL: https://issues.apache.org/jira/browse/RATIS-2389
> Project: Ratis
> Issue Type: New Feature
> Components: client, server
>Affects Versions: 3.3.0
>Reporter: Tak-Lon (Stephen) Wu
>Assignee: Tak-Lon (Stephen) Wu
>Priority: Minor
> Attachments: PoC-result-span-detail.png, PoC-result.png
>
>
> This proposal outlines the addition of OpenTelemetry support to Ratis. By
> instrumenting the full client-side request path, we can empower users and
> maintainers with the granular data necessary for both long-term performance
> optimization and proactive daily monitoring.
> * 1-pager proposal:
> [https://docs.google.com/document/d/1UKGVqOzkAXqUAJxOz1RHq6fIiO3xqV57eIqi-f9qdE4/edit?tab=t.0#heading=h.5a3u31wlm0n]
> * PoC: [https://github.com/taklwu/ratis/tree/opentelemetry0129]
> Subtasks
> * Define the Metadata Field: Modify RaftRpcMessage.proto to include an
> optional SpanContext field.
> * Add TraceUtil: Land the utility class in ratis-common based on the code
> you see in HBase.
> * Create the client span: Introduce the span supplier and CLIENT span hook.
> * Instrument GRPC on the Server: Start with the GRPC module as it is the
> most common transport. Instrument the onNext methods (or within the caller)
> to start/stop spans.
> * Come up with the user guide as part of the release.
> Reference
> 1. HBase Tracing with Opentelemetry,
> [HBASE-22120|https://issues.apache.org/jira/browse/HBASE-22120]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
[jira] [Commented] (RATIS-2389) Implementing Opentelemetry Tracing in Apache Ratis
[ https://issues.apache.org/jira/browse/RATIS-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18056669#comment-18056669 ] Xinyu Tan commented on RATIS-2389: -- [~taklwu] Thank you for the proposal — it looks very promising overall. At the moment, the tracing seems to focus on two high-level spans (client and server). As a possible follow-up improvement, it might be worth considering modeling retries on the client side as child spans, and capturing more detailed critical paths in the server-side read and write workflows. This could help provide deeper insights into potential read/write bottlenecks and guide future optimizations. Regarding trace data delivery and configuration, it may be helpful if the documentation could further clarify how these aspects are defined and configured, such as whether trace data is sent synchronously or asynchronously, the batching strategy and batch size, the target host, and whether RPC compression can be enabled. For the current POC, before moving the PR into a formal review state, would it be possible to visualize the tracing results using Jaeger and include some screenshots or links (for example, showing the read and write paths) in the PR or related documentation? This could help reviewers gain a clearer and more concrete understanding of the POC’s behavior and effectiveness. > Implementing Opentelemetry Tracing in Apache Ratis > -- > > Key: RATIS-2389 > URL: https://issues.apache.org/jira/browse/RATIS-2389 > Project: Ratis > Issue Type: New Feature > Components: client, server >Affects Versions: 3.3.0 >Reporter: Tak-Lon (Stephen) Wu >Assignee: Tak-Lon (Stephen) Wu >Priority: Minor > Attachments: PoC-result-span-detail.png, PoC-result.png > > > This proposal outlines the addition of OpenTelemetry support to Ratis. By > instrumenting the full client-side request path, we can empower users and > maintainers with the granular data necessary for both long-term performance > optimization and proactive daily monitoring. > * 1-pager proposal: > [https://docs.google.com/document/d/1UKGVqOzkAXqUAJxOz1RHq6fIiO3xqV57eIqi-f9qdE4/edit?tab=t.0#heading=h.5a3u31wlm0n] > * PoC: [https://github.com/taklwu/ratis/tree/opentelemetry0129] > Subtasks > * Define the Metadata Field: Modify RaftRpcMessage.proto to include an > optional SpanContext field. > * Add TraceUtil: Land the utility class in ratis-common based on the code > you see in HBase. > * Create the client span: Introduce the span supplier and CLIENT span hook. > * Instrument GRPC on the Server: Start with the GRPC module as it is the > most common transport. Instrument the onNext methods (or within the caller) > to start/stop spans. > * Come up with the user guide as part of the release. > Reference > 1. HBase Tracing with Opentelemetry, > [HBASE-22120|https://issues.apache.org/jira/browse/HBASE-22120] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (RATIS-2389) Implementing Opentelemetry Tracing in Apache Ratis
[ https://issues.apache.org/jira/browse/RATIS-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18056303#comment-18056303 ] Tsz-wo Sze commented on RATIS-2389: --- [~taklwu], thanks a lot for working on the Opentelemetry instrumentation. The proposal sound great! Assigning this to you. > Implementing Opentelemetry Tracing in Apache Ratis > -- > > Key: RATIS-2389 > URL: https://issues.apache.org/jira/browse/RATIS-2389 > Project: Ratis > Issue Type: New Feature > Components: client, server >Affects Versions: 3.3.0 >Reporter: Tak-Lon (Stephen) Wu >Priority: Minor > Attachments: PoC-result-span-detail.png, PoC-result.png > > > This proposal outlines the addition of OpenTelemetry support to Ratis. By > instrumenting the full client-side request path, we can empower users and > maintainers with the granular data necessary for both long-term performance > optimization and proactive daily monitoring. > * 1-pager proposal: > [https://docs.google.com/document/d/1UKGVqOzkAXqUAJxOz1RHq6fIiO3xqV57eIqi-f9qdE4/edit?tab=t.0#heading=h.5a3u31wlm0n] > * PoC: [https://github.com/taklwu/ratis/tree/opentelemetry0129] > Subtasks > * Define the Metadata Field: Modify RaftRpcMessage.proto to include an > optional SpanContext field. > * Add TraceUtil: Land the utility class in ratis-common based on the code > you see in HBase. > * Create the client span: Introduce the span supplier and CLIENT span hook. > * Instrument GRPC on the Server: Start with the GRPC module as it is the > most common transport. Instrument the onNext methods (or within the caller) > to start/stop spans. > * Come up with the user guide as part of the release. > Reference > 1. HBase Tracing with Opentelemetry, > [HBASE-22120|https://issues.apache.org/jira/browse/HBASE-22120] -- This message was sent by Atlassian Jira (v8.20.10#820010)
