[jira] [Updated] (FLINK-30120) Flink Statefun Task Failure causes Restart of all Tasks with Regional Failover Strategy
[ https://issues.apache.org/jira/browse/FLINK-30120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Weinwurm updated FLINK-30120: - Description: Hey all, We've noticed that a single task failure causes all of the Statefun tasks to be restarted. For example, a single task fails because of some Statefun Endpoint unavailability or if one of our Kuberentes TaskManager pods go down. Flink then determines that the _region_ failover strategy requires all tasks to be restarted so we see this in the logs: {code:java} Nov 17 10:20:30 org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - Calculating tasks to restart to recover the failed task 31284d56d1e2112b0f20099ee448a6a9_11. Nov 17 10:20:30 org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - 5650 tasks should be restarted to recover the failed task 31284d56d1e2112b0f20099ee448a6a9_11. {code} Our tasks are all fully independent so I would like that only the one failed task to get restarted or moved to a different TaskManager slot. Is there a way to tell Flink to only restart the failed task? Or is there a specific reason why the region failover strategy decides to restart all tasks? If not, we'd really appreciate a way to enable individual task failovers. Thanks in advance! Stephan was: Hey all, We've noticed that a single task failure causes all of the Statefun tasks to be restarted. For example, a single task fails because of some Statefun Endpoint unavailability or if one of our Kuberentes TaskManager pods go down. Flink then determines that the _region_ failover strategy requires all tasks to be restarted so we see this in the logs: {code:java} Nov 17 10:20:30 org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - Calculating tasks to restart to recover the failed task 31284d56d1e2112b0f20099ee448a6a9_11. Nov 17 10:20:30 org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - 5650 tasks should be restarted to recover the failed task 31284d56d1e2112b0f20099ee448a6a9_11. {code} Our tasks are all fully independent so I would like that only the one failed task to get restarted or moved to a different TaskManager slot. Is there a way to tell Flink to only restart the failed task? Or is there a specific reason why the region failover strategy decides to restart all tasks? Thanks in advance! Stephan > Flink Statefun Task Failure causes Restart of all Tasks with Regional > Failover Strategy > --- > > Key: FLINK-30120 > URL: https://issues.apache.org/jira/browse/FLINK-30120 > Project: Flink > Issue Type: Improvement > Components: Stateful Functions >Affects Versions: statefun-3.0.0, statefun-3.2.0, statefun-3.1.1 >Reporter: Stephan Weinwurm >Priority: Major > > Hey all, > We've noticed that a single task failure causes all of the Statefun tasks to > be restarted. > For example, a single task fails because of some Statefun Endpoint > unavailability or if one of our Kuberentes TaskManager pods go down. > Flink then determines that the _region_ failover strategy requires all tasks > to be restarted so we see this in the logs: > > {code:java} > Nov 17 10:20:30 > org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy > [] - Calculating tasks to restart to recover the failed task > 31284d56d1e2112b0f20099ee448a6a9_11. > Nov 17 10:20:30 > org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy > [] - 5650 tasks should be restarted to recover the failed task > 31284d56d1e2112b0f20099ee448a6a9_11. {code} > Our tasks are all fully independent so I would like that only the one failed > task to get restarted or moved to a different TaskManager slot. > Is there a way to tell Flink to only restart the failed task? Or is there a > specific reason why the region failover strategy decides to restart all tasks? > If not, we'd really appreciate a way to enable individual task failovers. > Thanks in advance! > Stephan -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-30120) Flink Statefun Task Failure causes restart of all tasks
[ https://issues.apache.org/jira/browse/FLINK-30120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Weinwurm updated FLINK-30120: - Description: Hey all, We've noticed that a single task failure causes all of the Statefun tasks to be restarted. For example, a single task fails because of some Statefun Endpoint unavailability or if one of our Kuberentes TaskManager pods go down. Flink then determines that the _region_ failover strategy requires all tasks to be restarted so we see this in the logs: {code:java} Nov 17 10:20:30 org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - Calculating tasks to restart to recover the failed task 31284d56d1e2112b0f20099ee448a6a9_11. Nov 17 10:20:30 org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - 5650 tasks should be restarted to recover the failed task 31284d56d1e2112b0f20099ee448a6a9_11. {code} Our tasks are all fully independent so I would like that only the one failed task to get restarted or moved to a different TaskManager slot. Is there a way to tell Flink to only restart the failed task? Or is there a specific reason why the region failover strategy decides to restart all tasks? Thanks in advance! Stephan was: Hey all, We've noticed that a single task failure causes all of the Statefun tasks to be restarted. For example, a single task fails because of some Statefun Endpoint unavailability or if one of our Kuberentes TaskManager pods go down. Flink then determines that the _region_ failover strategy requires all tasks to be restarted so we see this in the logs: Nov 17 10:20:30 org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - Calculating tasks to restart to recover the failed task 31284d56d1e2112b0f20099ee448a6a9_11. {code:java} Nov 17 10:20:30 org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - 5650 tasks should be restarted to recover the failed task 31284d56d1e2112b0f20099ee448a6a9_11. {code} Our tasks are all fully independent so I would like that only the one failed task to get restarted or moved to a different TaskManager slot. Is there a way to tell Flink to only restart the failed task? Or is there a specific reason why the region failover strategy decides to restart all tasks? Thanks in advance! Stephan > Flink Statefun Task Failure causes restart of all tasks > --- > > Key: FLINK-30120 > URL: https://issues.apache.org/jira/browse/FLINK-30120 > Project: Flink > Issue Type: Improvement > Components: Stateful Functions >Affects Versions: statefun-3.0.0, statefun-3.2.0, statefun-3.1.1 >Reporter: Stephan Weinwurm >Priority: Major > > Hey all, > We've noticed that a single task failure causes all of the Statefun tasks to > be restarted. > For example, a single task fails because of some Statefun Endpoint > unavailability or if one of our Kuberentes TaskManager pods go down. > Flink then determines that the _region_ failover strategy requires all tasks > to be restarted so we see this in the logs: > > {code:java} > Nov 17 10:20:30 > org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy > [] - Calculating tasks to restart to recover the failed task > 31284d56d1e2112b0f20099ee448a6a9_11. > Nov 17 10:20:30 > org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy > [] - 5650 tasks should be restarted to recover the failed task > 31284d56d1e2112b0f20099ee448a6a9_11. {code} > > > Our tasks are all fully independent so I would like that only the one failed > task to get restarted or moved to a different TaskManager slot. > Is there a way to tell Flink to only restart the failed task? Or is there a > specific reason why the region failover strategy decides to restart all tasks? > Thanks in advance! > Stephan -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-30120) Flink Statefun Task Failure causes restart of all tasks
[ https://issues.apache.org/jira/browse/FLINK-30120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Weinwurm updated FLINK-30120: - Description: Hey all, We've noticed that a single task failure causes all of the Statefun tasks to be restarted. For example, a single task fails because of some Statefun Endpoint unavailability or if one of our Kuberentes TaskManager pods go down. Flink then determines that the _region_ failover strategy requires all tasks to be restarted so we see this in the logs: {code:java} Nov 17 10:20:30 org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - Calculating tasks to restart to recover the failed task 31284d56d1e2112b0f20099ee448a6a9_11. Nov 17 10:20:30 org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - 5650 tasks should be restarted to recover the failed task 31284d56d1e2112b0f20099ee448a6a9_11. {code} Our tasks are all fully independent so I would like that only the one failed task to get restarted or moved to a different TaskManager slot. Is there a way to tell Flink to only restart the failed task? Or is there a specific reason why the region failover strategy decides to restart all tasks? Thanks in advance! Stephan was: Hey all, We've noticed that a single task failure causes all of the Statefun tasks to be restarted. For example, a single task fails because of some Statefun Endpoint unavailability or if one of our Kuberentes TaskManager pods go down. Flink then determines that the _region_ failover strategy requires all tasks to be restarted so we see this in the logs: {code:java} Nov 17 10:20:30 org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - Calculating tasks to restart to recover the failed task 31284d56d1e2112b0f20099ee448a6a9_11. Nov 17 10:20:30 org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - 5650 tasks should be restarted to recover the failed task 31284d56d1e2112b0f20099ee448a6a9_11. {code} Our tasks are all fully independent so I would like that only the one failed task to get restarted or moved to a different TaskManager slot. Is there a way to tell Flink to only restart the failed task? Or is there a specific reason why the region failover strategy decides to restart all tasks? Thanks in advance! Stephan > Flink Statefun Task Failure causes restart of all tasks > --- > > Key: FLINK-30120 > URL: https://issues.apache.org/jira/browse/FLINK-30120 > Project: Flink > Issue Type: Improvement > Components: Stateful Functions >Affects Versions: statefun-3.0.0, statefun-3.2.0, statefun-3.1.1 >Reporter: Stephan Weinwurm >Priority: Major > > Hey all, > We've noticed that a single task failure causes all of the Statefun tasks to > be restarted. > For example, a single task fails because of some Statefun Endpoint > unavailability or if one of our Kuberentes TaskManager pods go down. > Flink then determines that the _region_ failover strategy requires all tasks > to be restarted so we see this in the logs: > > {code:java} > Nov 17 10:20:30 > org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy > [] - Calculating tasks to restart to recover the failed task > 31284d56d1e2112b0f20099ee448a6a9_11. > Nov 17 10:20:30 > org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy > [] - 5650 tasks should be restarted to recover the failed task > 31284d56d1e2112b0f20099ee448a6a9_11. {code} > > Our tasks are all fully independent so I would like that only the one failed > task to get restarted or moved to a different TaskManager slot. > Is there a way to tell Flink to only restart the failed task? Or is there a > specific reason why the region failover strategy decides to restart all tasks? > Thanks in advance! > Stephan -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-30120) Flink Statefun Task Failure causes Restart of all Tasks with Regional Failover Strategy
[ https://issues.apache.org/jira/browse/FLINK-30120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Weinwurm updated FLINK-30120: - Summary: Flink Statefun Task Failure causes Restart of all Tasks with Regional Failover Strategy (was: Flink Statefun Task Failure causes restart of all tasks with regional failover strategy) > Flink Statefun Task Failure causes Restart of all Tasks with Regional > Failover Strategy > --- > > Key: FLINK-30120 > URL: https://issues.apache.org/jira/browse/FLINK-30120 > Project: Flink > Issue Type: Improvement > Components: Stateful Functions >Affects Versions: statefun-3.0.0, statefun-3.2.0, statefun-3.1.1 >Reporter: Stephan Weinwurm >Priority: Major > > Hey all, > We've noticed that a single task failure causes all of the Statefun tasks to > be restarted. > For example, a single task fails because of some Statefun Endpoint > unavailability or if one of our Kuberentes TaskManager pods go down. > Flink then determines that the _region_ failover strategy requires all tasks > to be restarted so we see this in the logs: > > {code:java} > Nov 17 10:20:30 > org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy > [] - Calculating tasks to restart to recover the failed task > 31284d56d1e2112b0f20099ee448a6a9_11. > Nov 17 10:20:30 > org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy > [] - 5650 tasks should be restarted to recover the failed task > 31284d56d1e2112b0f20099ee448a6a9_11. {code} > > Our tasks are all fully independent so I would like that only the one failed > task to get restarted or moved to a different TaskManager slot. > Is there a way to tell Flink to only restart the failed task? Or is there a > specific reason why the region failover strategy decides to restart all tasks? > Thanks in advance! > Stephan -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-30120) Flink Statefun Task Failure causes restart of all tasks with regional failover strategy
[ https://issues.apache.org/jira/browse/FLINK-30120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Weinwurm updated FLINK-30120: - Summary: Flink Statefun Task Failure causes restart of all tasks with regional failover strategy (was: Flink Statefun Task Failure causes restart of all tasks) > Flink Statefun Task Failure causes restart of all tasks with regional > failover strategy > --- > > Key: FLINK-30120 > URL: https://issues.apache.org/jira/browse/FLINK-30120 > Project: Flink > Issue Type: Improvement > Components: Stateful Functions >Affects Versions: statefun-3.0.0, statefun-3.2.0, statefun-3.1.1 >Reporter: Stephan Weinwurm >Priority: Major > > Hey all, > We've noticed that a single task failure causes all of the Statefun tasks to > be restarted. > For example, a single task fails because of some Statefun Endpoint > unavailability or if one of our Kuberentes TaskManager pods go down. > Flink then determines that the _region_ failover strategy requires all tasks > to be restarted so we see this in the logs: > > {code:java} > Nov 17 10:20:30 > org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy > [] - Calculating tasks to restart to recover the failed task > 31284d56d1e2112b0f20099ee448a6a9_11. > Nov 17 10:20:30 > org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy > [] - 5650 tasks should be restarted to recover the failed task > 31284d56d1e2112b0f20099ee448a6a9_11. {code} > > Our tasks are all fully independent so I would like that only the one failed > task to get restarted or moved to a different TaskManager slot. > Is there a way to tell Flink to only restart the failed task? Or is there a > specific reason why the region failover strategy decides to restart all tasks? > Thanks in advance! > Stephan -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-30120) Flink Statefun Task Failure causes restart of all tasks
[ https://issues.apache.org/jira/browse/FLINK-30120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Weinwurm updated FLINK-30120: - Description: Hey all, We've noticed that a single task failure causes all of the Statefun tasks to be restarted. For example, a single task fails because of some Statefun Endpoint unavailability or if one of our Kuberentes TaskManager pods go down. Flink then determines that the _region_ failover strategy requires all tasks to be restarted so we see this in the logs: Nov 17 10:20:30 org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - Calculating tasks to restart to recover the failed task 31284d56d1e2112b0f20099ee448a6a9_11. {code:java} Nov 17 10:20:30 org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - 5650 tasks should be restarted to recover the failed task 31284d56d1e2112b0f20099ee448a6a9_11. {code} Our tasks are all fully independent so I would like that only the one failed task to get restarted or moved to a different TaskManager slot. Is there a way to tell Flink to only restart the failed task? Or is there a specific reason why the region failover strategy decides to restart all tasks? Thanks in advance! Stephan was: Hey all, We've noticed that a single task failure causes all of the Statefun tasks to be restarted. For example, a single task fails because of some Statefun Endpoint unavailability or if one of our Kuberentes TaskManager pods go down. Flink then determines that the `region` failover strategy requires all tasks to be restarted so we see this in the logs: ``` Nov 17 10:20:30 org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - Calculating tasks to restart to recover the failed task 31284d56d1e2112b0f20099ee448a6a9_11. Nov 17 10:20:30 org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - 5650 tasks should be restarted to recover the failed task 31284d56d1e2112b0f20099ee448a6a9_11. ``` Our tasks are all fully independent so I would like that only the one failed task to get restarted or moved to a different TaskManager slot. Is there a way to tell Flink to only restart the failed task? Or is there a specific reason why the region failover strategy decides to restart all tasks? Thanks in advance! Stephan > Flink Statefun Task Failure causes restart of all tasks > --- > > Key: FLINK-30120 > URL: https://issues.apache.org/jira/browse/FLINK-30120 > Project: Flink > Issue Type: Improvement > Components: Stateful Functions >Affects Versions: statefun-3.0.0, statefun-3.2.0, statefun-3.1.1 >Reporter: Stephan Weinwurm >Priority: Major > > Hey all, > We've noticed that a single task failure causes all of the Statefun tasks to > be restarted. > For example, a single task fails because of some Statefun Endpoint > unavailability or if one of our Kuberentes TaskManager pods go down. > Flink then determines that the _region_ failover strategy requires all tasks > to be restarted so we see this in the logs: > > Nov 17 10:20:30 > org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy > [] - Calculating tasks to restart to recover the failed task > 31284d56d1e2112b0f20099ee448a6a9_11. > {code:java} > Nov 17 10:20:30 > org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy > [] - 5650 tasks should be restarted to recover the failed task > 31284d56d1e2112b0f20099ee448a6a9_11. {code} > > > Our tasks are all fully independent so I would like that only the one failed > task to get restarted or moved to a different TaskManager slot. > Is there a way to tell Flink to only restart the failed task? Or is there a > specific reason why the region failover strategy decides to restart all tasks? > Thanks in advance! > Stephan -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-30120) Flink Statefun Task Failure causes restart of all tasks
Stephan Weinwurm created FLINK-30120: Summary: Flink Statefun Task Failure causes restart of all tasks Key: FLINK-30120 URL: https://issues.apache.org/jira/browse/FLINK-30120 Project: Flink Issue Type: Improvement Components: Stateful Functions Affects Versions: statefun-3.1.1, statefun-3.2.0, statefun-3.0.0 Reporter: Stephan Weinwurm Hey all, We've noticed that a single task failure causes all of the Statefun tasks to be restarted. For example, a single task fails because of some Statefun Endpoint unavailability or if one of our Kuberentes TaskManager pods go down. Flink then determines that the `region` failover strategy requires all tasks to be restarted so we see this in the logs: ``` Nov 17 10:20:30 org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - Calculating tasks to restart to recover the failed task 31284d56d1e2112b0f20099ee448a6a9_11. Nov 17 10:20:30 org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - 5650 tasks should be restarted to recover the failed task 31284d56d1e2112b0f20099ee448a6a9_11. ``` Our tasks are all fully independent so I would like that only the one failed task to get restarted or moved to a different TaskManager slot. Is there a way to tell Flink to only restart the failed task? Or is there a specific reason why the region failover strategy decides to restart all tasks? Thanks in advance! Stephan -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-28747) "target_id can not be missing" in HTTP statefun request
[ https://issues.apache.org/jira/browse/FLINK-28747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607301#comment-17607301 ] Stephan Weinwurm commented on FLINK-28747: -- [~groot] / [~trohrmann]quick ping. I don't have enough knowledge on what's the right fix here. Would one of you mind taking a look and getting this fixed? Thank you in advance! > "target_id can not be missing" in HTTP statefun request > --- > > Key: FLINK-28747 > URL: https://issues.apache.org/jira/browse/FLINK-28747 > Project: Flink > Issue Type: Bug > Components: Stateful Functions >Affects Versions: statefun-3.0.0, statefun-3.2.0, statefun-3.1.1 >Reporter: Stephan Weinwurm >Priority: Major > > Hi all, > We've suddenly started to see the following exception in our HTTP statefun > functions endpoints: > {code}Traceback (most recent call last): > File > "/src/.venv/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", > line 403, in run_asgi > result = await app(self.scope, self.receive, self.send) > File > "/src/.venv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", > line 78, in __call__ > return await self.app(scope, receive, send) > File "/src/worker/baseplate_asgi/asgi/baseplate_asgi_middleware.py", line > 37, in __call__ > await span_processor.execute() > File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line > 61, in execute > raise e > File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line > 57, in execute > await self.app(self.scope, self.receive, self.send) > File "/src/.venv/lib/python3.9/site-packages/starlette/applications.py", > line 124, in __call__ > await self.middleware_stack(scope, receive, send) > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line > 184, in __call__ > raise exc > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line > 162, in __call__ > await self.app(scope, receive, _send) > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", > line 75, in __call__ > raise exc > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", > line 64, in __call__ > await self.app(scope, receive, sender) > File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line > 680, in __call__ > await route.handle(scope, receive, send) > File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line > 275, in handle > await self.app(scope, receive, send) > File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line > 65, in app > response = await func(request) > File "/src/worker/baseplate_statefun/server/asgi/make_statefun_handler.py", > line 25, in statefun_handler > result = await handler.handle_async(request_body) > File "/src/.venv/lib/python3.9/site-packages/statefun/request_reply_v3.py", > line 262, in handle_async > msg = Message(target_typename=sdk_address.typename, > target_id=sdk_address.id, > File "/src/.venv/lib/python3.9/site-packages/statefun/messages.py", line > 42, in __init__ > raise ValueError("target_id can not be missing"){code} > Interestingly, this has started to happen in three separate Flink deployments > at the very same time. The only thing in common between the three deployments > is that they consume the same Kafka topics. > No deployments have happened when the issue started happening which was on > July 28th 3:05PM. We have since been continuously seeing the error. > We were also able to extract the request that Flink sends to the HTTP > statefun endpoint: > {code}{'invocation': {'target': {'namespace': 'com.x.dummy', 'type': > 'dummy'}, 'invocations': [{'argument': {'typename': > 'type.googleapis.com/v2_event.Event', 'has_value': True, 'value': > '-redicated-'}}]}} > {code} > As you can see, no `id` field is present in the `invocation.target` object or > the `target_id` was an empty string. > > This is our module.yaml from one of the Flink deployments: > > {code} > version: "3.0" > module: > meta: > type: remote > spec: > endpoints: > - endpoint: > meta: > kind: io.statefun.endpoints.v1/http > spec: > functions: com.x.dummy/dummy > urlPathTemplate: [http://x-worker-dummy.x-functions:9090/statefun] > timeouts: > call: 2 min > read: 2 min > write: 2 min > maxNumBatchRequests: 100 > ingresses: > - ingress: > meta: > type: io.statefun.kafka/ingress > id: com.x/ingress > spec: > address: x-kafka-0.x.ue1.x.net:9092 > consumerGroupId: x-worker-dummy > topics: > - topic: v2_post_events > valueType: type.googleapis.com/v2_event.Event > targets: > - com.x.dummy/dummy > startupPosition: > type: group-offsets > autoOffsetResetPosition: earliest
[jira] [Commented] (FLINK-28747) "target_id can not be missing" in HTTP statefun request
[ https://issues.apache.org/jira/browse/FLINK-28747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581940#comment-17581940 ] Stephan Weinwurm commented on FLINK-28747: -- [~groot], what's the correct fix here? Allow `None` values for `target_id` or should we set a fixed / random value for `target_id`? I don't know what the implications are of either solution. If the fix is trivial, I'm happy to do it - I'm just not very familiar with the code base. > "target_id can not be missing" in HTTP statefun request > --- > > Key: FLINK-28747 > URL: https://issues.apache.org/jira/browse/FLINK-28747 > Project: Flink > Issue Type: Bug > Components: Stateful Functions >Affects Versions: statefun-3.0.0, statefun-3.2.0, statefun-3.1.1 >Reporter: Stephan Weinwurm >Priority: Major > > Hi all, > We've suddenly started to see the following exception in our HTTP statefun > functions endpoints: > {code}Traceback (most recent call last): > File > "/src/.venv/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", > line 403, in run_asgi > result = await app(self.scope, self.receive, self.send) > File > "/src/.venv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", > line 78, in __call__ > return await self.app(scope, receive, send) > File "/src/worker/baseplate_asgi/asgi/baseplate_asgi_middleware.py", line > 37, in __call__ > await span_processor.execute() > File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line > 61, in execute > raise e > File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line > 57, in execute > await self.app(self.scope, self.receive, self.send) > File "/src/.venv/lib/python3.9/site-packages/starlette/applications.py", > line 124, in __call__ > await self.middleware_stack(scope, receive, send) > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line > 184, in __call__ > raise exc > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line > 162, in __call__ > await self.app(scope, receive, _send) > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", > line 75, in __call__ > raise exc > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", > line 64, in __call__ > await self.app(scope, receive, sender) > File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line > 680, in __call__ > await route.handle(scope, receive, send) > File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line > 275, in handle > await self.app(scope, receive, send) > File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line > 65, in app > response = await func(request) > File "/src/worker/baseplate_statefun/server/asgi/make_statefun_handler.py", > line 25, in statefun_handler > result = await handler.handle_async(request_body) > File "/src/.venv/lib/python3.9/site-packages/statefun/request_reply_v3.py", > line 262, in handle_async > msg = Message(target_typename=sdk_address.typename, > target_id=sdk_address.id, > File "/src/.venv/lib/python3.9/site-packages/statefun/messages.py", line > 42, in __init__ > raise ValueError("target_id can not be missing"){code} > Interestingly, this has started to happen in three separate Flink deployments > at the very same time. The only thing in common between the three deployments > is that they consume the same Kafka topics. > No deployments have happened when the issue started happening which was on > July 28th 3:05PM. We have since been continuously seeing the error. > We were also able to extract the request that Flink sends to the HTTP > statefun endpoint: > {code}{'invocation': {'target': {'namespace': 'com.x.dummy', 'type': > 'dummy'}, 'invocations': [{'argument': {'typename': > 'type.googleapis.com/v2_event.Event', 'has_value': True, 'value': > '-redicated-'}}]}} > {code} > As you can see, no `id` field is present in the `invocation.target` object or > the `target_id` was an empty string. > > This is our module.yaml from one of the Flink deployments: > > {code} > version: "3.0" > module: > meta: > type: remote > spec: > endpoints: > - endpoint: > meta: > kind: io.statefun.endpoints.v1/http > spec: > functions: com.x.dummy/dummy > urlPathTemplate: [http://x-worker-dummy.x-functions:9090/statefun] > timeouts: > call: 2 min > read: 2 min > write: 2 min > maxNumBatchRequests: 100 > ingresses: > - ingress: > meta: > type: io.statefun.kafka/ingress > id: com.x/ingress > spec: > address: x-kafka-0.x.ue1.x.net:9092 > consumerGroupId: x-worker-dummy > topics: > - topic: v2_post_events > valueType: type.googleapis.com/v2_event.Event >
[jira] [Commented] (FLINK-28747) "target_id can not be missing" in HTTP statefun request
[ https://issues.apache.org/jira/browse/FLINK-28747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17575006#comment-17575006 ] Stephan Weinwurm commented on FLINK-28747: -- Yes, you're right, that seems like protobuf behaviour. What's the correct fix here in the python sdk? Remove [the check|https://github.com/apache/flink-statefun/blob/master/statefun-sdk-python/statefun/messages.py#L41] and/or set `target_id` to a random or fixed value? Thanks again for looking into this! > "target_id can not be missing" in HTTP statefun request > --- > > Key: FLINK-28747 > URL: https://issues.apache.org/jira/browse/FLINK-28747 > Project: Flink > Issue Type: Bug > Components: Stateful Functions >Affects Versions: statefun-3.0.0, statefun-3.2.0, statefun-3.1.1 >Reporter: Stephan Weinwurm >Priority: Major > > Hi all, > We've suddenly started to see the following exception in our HTTP statefun > functions endpoints: > {code}Traceback (most recent call last): > File > "/src/.venv/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", > line 403, in run_asgi > result = await app(self.scope, self.receive, self.send) > File > "/src/.venv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", > line 78, in __call__ > return await self.app(scope, receive, send) > File "/src/worker/baseplate_asgi/asgi/baseplate_asgi_middleware.py", line > 37, in __call__ > await span_processor.execute() > File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line > 61, in execute > raise e > File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line > 57, in execute > await self.app(self.scope, self.receive, self.send) > File "/src/.venv/lib/python3.9/site-packages/starlette/applications.py", > line 124, in __call__ > await self.middleware_stack(scope, receive, send) > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line > 184, in __call__ > raise exc > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line > 162, in __call__ > await self.app(scope, receive, _send) > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", > line 75, in __call__ > raise exc > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", > line 64, in __call__ > await self.app(scope, receive, sender) > File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line > 680, in __call__ > await route.handle(scope, receive, send) > File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line > 275, in handle > await self.app(scope, receive, send) > File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line > 65, in app > response = await func(request) > File "/src/worker/baseplate_statefun/server/asgi/make_statefun_handler.py", > line 25, in statefun_handler > result = await handler.handle_async(request_body) > File "/src/.venv/lib/python3.9/site-packages/statefun/request_reply_v3.py", > line 262, in handle_async > msg = Message(target_typename=sdk_address.typename, > target_id=sdk_address.id, > File "/src/.venv/lib/python3.9/site-packages/statefun/messages.py", line > 42, in __init__ > raise ValueError("target_id can not be missing"){code} > Interestingly, this has started to happen in three separate Flink deployments > at the very same time. The only thing in common between the three deployments > is that they consume the same Kafka topics. > No deployments have happened when the issue started happening which was on > July 28th 3:05PM. We have since been continuously seeing the error. > We were also able to extract the request that Flink sends to the HTTP > statefun endpoint: > {code}{'invocation': {'target': {'namespace': 'com.x.dummy', 'type': > 'dummy'}, 'invocations': [{'argument': {'typename': > 'type.googleapis.com/v2_event.Event', 'has_value': True, 'value': > '-redicated-'}}]}} > {code} > As you can see, no `id` field is present in the `invocation.target` object or > the `target_id` was an empty string. > > This is our module.yaml from one of the Flink deployments: > > {code} > version: "3.0" > module: > meta: > type: remote > spec: > endpoints: > - endpoint: > meta: > kind: io.statefun.endpoints.v1/http > spec: > functions: com.x.dummy/dummy > urlPathTemplate: [http://x-worker-dummy.x-functions:9090/statefun] > timeouts: > call: 2 min > read: 2 min > write: 2 min > maxNumBatchRequests: 100 > ingresses: > - ingress: > meta: > type: io.statefun.kafka/ingress > id: com.x/ingress > spec: > address: x-kafka-0.x.ue1.x.net:9092 > consumerGroupId: x-worker-dummy > topics: > - topic: v2_post_events > valueType:
[jira] [Commented] (FLINK-28747) "target_id can not be missing" in HTTP statefun request
[ https://issues.apache.org/jira/browse/FLINK-28747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17574331#comment-17574331 ] Stephan Weinwurm commented on FLINK-28747: -- Hey [~trohrmann], thank you very much for taking a look! Yes, you are right, I was able to confirm that we do have empty keys in the Kafka topic (this shouldn't happen but that's a different story). The {{ToFunction}} message that I've pasted in the description does not contain {{target.id}}. Is this expected? If the key in Kafka was indeed an empty string, shouldn't Flink send an empty string to the function as well? I would expect {{'id': ''}} to be in the {{target}} object. In this case a check for an empty string would still cause the error. > "target_id can not be missing" in HTTP statefun request > --- > > Key: FLINK-28747 > URL: https://issues.apache.org/jira/browse/FLINK-28747 > Project: Flink > Issue Type: Bug > Components: Stateful Functions >Affects Versions: statefun-3.0.0, statefun-3.2.0, statefun-3.1.1 >Reporter: Stephan Weinwurm >Priority: Major > > Hi all, > We've suddenly started to see the following exception in our HTTP statefun > functions endpoints: > {code}Traceback (most recent call last): > File > "/src/.venv/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", > line 403, in run_asgi > result = await app(self.scope, self.receive, self.send) > File > "/src/.venv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", > line 78, in __call__ > return await self.app(scope, receive, send) > File "/src/worker/baseplate_asgi/asgi/baseplate_asgi_middleware.py", line > 37, in __call__ > await span_processor.execute() > File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line > 61, in execute > raise e > File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line > 57, in execute > await self.app(self.scope, self.receive, self.send) > File "/src/.venv/lib/python3.9/site-packages/starlette/applications.py", > line 124, in __call__ > await self.middleware_stack(scope, receive, send) > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line > 184, in __call__ > raise exc > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line > 162, in __call__ > await self.app(scope, receive, _send) > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", > line 75, in __call__ > raise exc > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", > line 64, in __call__ > await self.app(scope, receive, sender) > File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line > 680, in __call__ > await route.handle(scope, receive, send) > File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line > 275, in handle > await self.app(scope, receive, send) > File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line > 65, in app > response = await func(request) > File "/src/worker/baseplate_statefun/server/asgi/make_statefun_handler.py", > line 25, in statefun_handler > result = await handler.handle_async(request_body) > File "/src/.venv/lib/python3.9/site-packages/statefun/request_reply_v3.py", > line 262, in handle_async > msg = Message(target_typename=sdk_address.typename, > target_id=sdk_address.id, > File "/src/.venv/lib/python3.9/site-packages/statefun/messages.py", line > 42, in __init__ > raise ValueError("target_id can not be missing"){code} > Interestingly, this has started to happen in three separate Flink deployments > at the very same time. The only thing in common between the three deployments > is that they consume the same Kafka topics. > No deployments have happened when the issue started happening which was on > July 28th 3:05PM. We have since been continuously seeing the error. > We were also able to extract the request that Flink sends to the HTTP > statefun endpoint: > {code}{'invocation': {'target': {'namespace': 'com.x.dummy', 'type': > 'dummy'}, 'invocations': [{'argument': {'typename': > 'type.googleapis.com/v2_event.Event', 'has_value': True, 'value': > '-redicated-'}}]}} > {code} > As you can see, no `id` field is present in the `invocation.target` object or > the `target_id` was an empty string. > > This is our module.yaml from one of the Flink deployments: > > {code} > version: "3.0" > module: > meta: > type: remote > spec: > endpoints: > - endpoint: > meta: > kind: io.statefun.endpoints.v1/http > spec: > functions: com.x.dummy/dummy > urlPathTemplate: [http://x-worker-dummy.x-functions:9090/statefun] > timeouts: > call: 2 min > read: 2 min > write: 2 min >
[jira] [Commented] (FLINK-28747) "target_id can not be missing" in HTTP statefun request
[ https://issues.apache.org/jira/browse/FLINK-28747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573961#comment-17573961 ] Stephan Weinwurm commented on FLINK-28747: -- I've tested this issue in the following versions and they all behave the same: * 3.2.0 * 3.1.1 * 3.0.0 > "target_id can not be missing" in HTTP statefun request > --- > > Key: FLINK-28747 > URL: https://issues.apache.org/jira/browse/FLINK-28747 > Project: Flink > Issue Type: Bug > Components: Stateful Functions >Affects Versions: statefun-3.0.0, statefun-3.2.0, statefun-3.1.1 >Reporter: Stephan Weinwurm >Priority: Major > > Hi all, > We've suddenly started to see the following exception in our HTTP statefun > functions endpoints: > {code}Traceback (most recent call last): > File > "/src/.venv/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", > line 403, in run_asgi > result = await app(self.scope, self.receive, self.send) > File > "/src/.venv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", > line 78, in __call__ > return await self.app(scope, receive, send) > File "/src/worker/baseplate_asgi/asgi/baseplate_asgi_middleware.py", line > 37, in __call__ > await span_processor.execute() > File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line > 61, in execute > raise e > File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line > 57, in execute > await self.app(self.scope, self.receive, self.send) > File "/src/.venv/lib/python3.9/site-packages/starlette/applications.py", > line 124, in __call__ > await self.middleware_stack(scope, receive, send) > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line > 184, in __call__ > raise exc > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line > 162, in __call__ > await self.app(scope, receive, _send) > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", > line 75, in __call__ > raise exc > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", > line 64, in __call__ > await self.app(scope, receive, sender) > File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line > 680, in __call__ > await route.handle(scope, receive, send) > File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line > 275, in handle > await self.app(scope, receive, send) > File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line > 65, in app > response = await func(request) > File "/src/worker/baseplate_statefun/server/asgi/make_statefun_handler.py", > line 25, in statefun_handler > result = await handler.handle_async(request_body) > File "/src/.venv/lib/python3.9/site-packages/statefun/request_reply_v3.py", > line 262, in handle_async > msg = Message(target_typename=sdk_address.typename, > target_id=sdk_address.id, > File "/src/.venv/lib/python3.9/site-packages/statefun/messages.py", line > 42, in __init__ > raise ValueError("target_id can not be missing"){code} > Interestingly, this has started to happen in three separate Flink deployments > at the very same time. The only thing in common between the three deployments > is that they consume the same Kafka topics. > No deployments have happened when the issue started happening which was on > July 28th 3:05PM. We have since been continuously seeing the error. > We were also able to extract the request that Flink sends to the HTTP > statefun endpoint: > {code}{'invocation': {'target': {'namespace': 'com.x.dummy', 'type': > 'dummy'}, 'invocations': [{'argument': {'typename': > 'type.googleapis.com/v2_event.Event', 'has_value': True, 'value': > '-redicated-'}}]}} > {code} > As you can see, no `id` field is present in the `invocation.target` object or > the `target_id` was an empty string. > > This is our module.yaml from one of the Flink deployments: > > {code} > version: "3.0" > module: > meta: > type: remote > spec: > endpoints: > - endpoint: > meta: > kind: io.statefun.endpoints.v1/http > spec: > functions: com.x.dummy/dummy > urlPathTemplate: [http://x-worker-dummy.x-functions:9090/statefun] > timeouts: > call: 2 min > read: 2 min > write: 2 min > maxNumBatchRequests: 100 > ingresses: > - ingress: > meta: > type: io.statefun.kafka/ingress > id: com.x/ingress > spec: > address: x-kafka-0.x.ue1.x.net:9092 > consumerGroupId: x-worker-dummy > topics: > - topic: v2_post_events > valueType: type.googleapis.com/v2_event.Event > targets: > - com.x.dummy/dummy > startupPosition: > type: group-offsets > autoOffsetResetPosition: earliest > {code} > > Can you please help us investigate as this is
[jira] [Updated] (FLINK-28747) "target_id can not be missing" in HTTP statefun request
[ https://issues.apache.org/jira/browse/FLINK-28747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Weinwurm updated FLINK-28747: - Affects Version/s: statefun-3.0.0 > "target_id can not be missing" in HTTP statefun request > --- > > Key: FLINK-28747 > URL: https://issues.apache.org/jira/browse/FLINK-28747 > Project: Flink > Issue Type: Bug > Components: Stateful Functions >Affects Versions: statefun-3.0.0, statefun-3.2.0, statefun-3.1.1 >Reporter: Stephan Weinwurm >Priority: Major > > Hi all, > We've suddenly started to see the following exception in our HTTP statefun > functions endpoints: > {code}Traceback (most recent call last): > File > "/src/.venv/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", > line 403, in run_asgi > result = await app(self.scope, self.receive, self.send) > File > "/src/.venv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", > line 78, in __call__ > return await self.app(scope, receive, send) > File "/src/worker/baseplate_asgi/asgi/baseplate_asgi_middleware.py", line > 37, in __call__ > await span_processor.execute() > File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line > 61, in execute > raise e > File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line > 57, in execute > await self.app(self.scope, self.receive, self.send) > File "/src/.venv/lib/python3.9/site-packages/starlette/applications.py", > line 124, in __call__ > await self.middleware_stack(scope, receive, send) > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line > 184, in __call__ > raise exc > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line > 162, in __call__ > await self.app(scope, receive, _send) > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", > line 75, in __call__ > raise exc > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", > line 64, in __call__ > await self.app(scope, receive, sender) > File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line > 680, in __call__ > await route.handle(scope, receive, send) > File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line > 275, in handle > await self.app(scope, receive, send) > File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line > 65, in app > response = await func(request) > File "/src/worker/baseplate_statefun/server/asgi/make_statefun_handler.py", > line 25, in statefun_handler > result = await handler.handle_async(request_body) > File "/src/.venv/lib/python3.9/site-packages/statefun/request_reply_v3.py", > line 262, in handle_async > msg = Message(target_typename=sdk_address.typename, > target_id=sdk_address.id, > File "/src/.venv/lib/python3.9/site-packages/statefun/messages.py", line > 42, in __init__ > raise ValueError("target_id can not be missing"){code} > Interestingly, this has started to happen in three separate Flink deployments > at the very same time. The only thing in common between the three deployments > is that they consume the same Kafka topics. > No deployments have happened when the issue started happening which was on > July 28th 3:05PM. We have since been continuously seeing the error. > We were also able to extract the request that Flink sends to the HTTP > statefun endpoint: > {code}{'invocation': {'target': {'namespace': 'com.x.dummy', 'type': > 'dummy'}, 'invocations': [{'argument': {'typename': > 'type.googleapis.com/v2_event.Event', 'has_value': True, 'value': > '-redicated-'}}]}} > {code} > As you can see, no `id` field is present in the `invocation.target` object or > the `target_id` was an empty string. > > This is our module.yaml from one of the Flink deployments: > > {code} > version: "3.0" > module: > meta: > type: remote > spec: > endpoints: > - endpoint: > meta: > kind: io.statefun.endpoints.v1/http > spec: > functions: com.x.dummy/dummy > urlPathTemplate: [http://x-worker-dummy.x-functions:9090/statefun] > timeouts: > call: 2 min > read: 2 min > write: 2 min > maxNumBatchRequests: 100 > ingresses: > - ingress: > meta: > type: io.statefun.kafka/ingress > id: com.x/ingress > spec: > address: x-kafka-0.x.ue1.x.net:9092 > consumerGroupId: x-worker-dummy > topics: > - topic: v2_post_events > valueType: type.googleapis.com/v2_event.Event > targets: > - com.x.dummy/dummy > startupPosition: > type: group-offsets > autoOffsetResetPosition: earliest > {code} > > Can you please help us investigate as this is critically impacting our prod > setup? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-28747) "target_id can not be missing" in HTTP statefun request
[ https://issues.apache.org/jira/browse/FLINK-28747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Weinwurm updated FLINK-28747: - Affects Version/s: statefun-3.1.1 > "target_id can not be missing" in HTTP statefun request > --- > > Key: FLINK-28747 > URL: https://issues.apache.org/jira/browse/FLINK-28747 > Project: Flink > Issue Type: Bug > Components: Stateful Functions >Affects Versions: statefun-3.2.0, statefun-3.1.1 >Reporter: Stephan Weinwurm >Priority: Major > > Hi all, > We've suddenly started to see the following exception in our HTTP statefun > functions endpoints: > {code}Traceback (most recent call last): > File > "/src/.venv/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", > line 403, in run_asgi > result = await app(self.scope, self.receive, self.send) > File > "/src/.venv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", > line 78, in __call__ > return await self.app(scope, receive, send) > File "/src/worker/baseplate_asgi/asgi/baseplate_asgi_middleware.py", line > 37, in __call__ > await span_processor.execute() > File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line > 61, in execute > raise e > File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line > 57, in execute > await self.app(self.scope, self.receive, self.send) > File "/src/.venv/lib/python3.9/site-packages/starlette/applications.py", > line 124, in __call__ > await self.middleware_stack(scope, receive, send) > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line > 184, in __call__ > raise exc > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line > 162, in __call__ > await self.app(scope, receive, _send) > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", > line 75, in __call__ > raise exc > File > "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", > line 64, in __call__ > await self.app(scope, receive, sender) > File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line > 680, in __call__ > await route.handle(scope, receive, send) > File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line > 275, in handle > await self.app(scope, receive, send) > File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line > 65, in app > response = await func(request) > File "/src/worker/baseplate_statefun/server/asgi/make_statefun_handler.py", > line 25, in statefun_handler > result = await handler.handle_async(request_body) > File "/src/.venv/lib/python3.9/site-packages/statefun/request_reply_v3.py", > line 262, in handle_async > msg = Message(target_typename=sdk_address.typename, > target_id=sdk_address.id, > File "/src/.venv/lib/python3.9/site-packages/statefun/messages.py", line > 42, in __init__ > raise ValueError("target_id can not be missing"){code} > Interestingly, this has started to happen in three separate Flink deployments > at the very same time. The only thing in common between the three deployments > is that they consume the same Kafka topics. > No deployments have happened when the issue started happening which was on > July 28th 3:05PM. We have since been continuously seeing the error. > We were also able to extract the request that Flink sends to the HTTP > statefun endpoint: > {code}{'invocation': {'target': {'namespace': 'com.x.dummy', 'type': > 'dummy'}, 'invocations': [{'argument': {'typename': > 'type.googleapis.com/v2_event.Event', 'has_value': True, 'value': > '-redicated-'}}]}} > {code} > As you can see, no `id` field is present in the `invocation.target` object or > the `target_id` was an empty string. > > This is our module.yaml from one of the Flink deployments: > > {code} > version: "3.0" > module: > meta: > type: remote > spec: > endpoints: > - endpoint: > meta: > kind: io.statefun.endpoints.v1/http > spec: > functions: com.x.dummy/dummy > urlPathTemplate: [http://x-worker-dummy.x-functions:9090/statefun] > timeouts: > call: 2 min > read: 2 min > write: 2 min > maxNumBatchRequests: 100 > ingresses: > - ingress: > meta: > type: io.statefun.kafka/ingress > id: com.x/ingress > spec: > address: x-kafka-0.x.ue1.x.net:9092 > consumerGroupId: x-worker-dummy > topics: > - topic: v2_post_events > valueType: type.googleapis.com/v2_event.Event > targets: > - com.x.dummy/dummy > startupPosition: > type: group-offsets > autoOffsetResetPosition: earliest > {code} > > Can you please help us investigate as this is critically impacting our prod > setup? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-28747) "target_id can not be missing" in HTTP statefun request
[ https://issues.apache.org/jira/browse/FLINK-28747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Weinwurm updated FLINK-28747: - Description: Hi all, We've suddenly started to see the following exception in our HTTP statefun functions endpoints: {code}Traceback (most recent call last): File "/src/.venv/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi result = await app(self.scope, self.receive, self.send) File "/src/.venv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__ return await self.app(scope, receive, send) File "/src/worker/baseplate_asgi/asgi/baseplate_asgi_middleware.py", line 37, in __call__ await span_processor.execute() File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line 61, in execute raise e File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line 57, in execute await self.app(self.scope, self.receive, self.send) File "/src/.venv/lib/python3.9/site-packages/starlette/applications.py", line 124, in __call__ await self.middleware_stack(scope, receive, send) File "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 184, in __call__ raise exc File "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 162, in __call__ await self.app(scope, receive, _send) File "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 75, in __call__ raise exc File "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 64, in __call__ await self.app(scope, receive, sender) File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 680, in __call__ await route.handle(scope, receive, send) File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 275, in handle await self.app(scope, receive, send) File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 65, in app response = await func(request) File "/src/worker/baseplate_statefun/server/asgi/make_statefun_handler.py", line 25, in statefun_handler result = await handler.handle_async(request_body) File "/src/.venv/lib/python3.9/site-packages/statefun/request_reply_v3.py", line 262, in handle_async msg = Message(target_typename=sdk_address.typename, target_id=sdk_address.id, File "/src/.venv/lib/python3.9/site-packages/statefun/messages.py", line 42, in __init__ raise ValueError("target_id can not be missing"){code} Interestingly, this has started to happen in three separate Flink deployments at the very same time. The only thing in common between the three deployments is that they consume the same Kafka topics. No deployments have happened when the issue started happening which was on July 28th 3:05PM. We have since been continuously seeing the error. We were also able to extract the request that Flink sends to the HTTP statefun endpoint: {code}{'invocation': {'target': {'namespace': 'com.x.dummy', 'type': 'dummy'}, 'invocations': [{'argument': {'typename': 'type.googleapis.com/v2_event.Event', 'has_value': True, 'value': '-redicated-'}}]}} {code} As you can see, no `id` field is present in the `invocation.target` object or the `target_id` was an empty string. This is our module.yaml from one of the Flink deployments: {code} version: "3.0" module: meta: type: remote spec: endpoints: - endpoint: meta: kind: io.statefun.endpoints.v1/http spec: functions: com.x.dummy/dummy urlPathTemplate: [http://x-worker-dummy.x-functions:9090/statefun] timeouts: call: 2 min read: 2 min write: 2 min maxNumBatchRequests: 100 ingresses: - ingress: meta: type: io.statefun.kafka/ingress id: com.x/ingress spec: address: x-kafka-0.x.ue1.x.net:9092 consumerGroupId: x-worker-dummy topics: - topic: v2_post_events valueType: type.googleapis.com/v2_event.Event targets: - com.x.dummy/dummy startupPosition: type: group-offsets autoOffsetResetPosition: earliest {code} Can you please help us investigate as this is critically impacting our prod setup? was: Hi all, We've suddenly started to see the following exception in our HTTP statefun functions endpoints: ``` Traceback (most recent call last): File "/src/.venv/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi result = await app(self.scope, self.receive, self.send) File "/src/.venv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__ return await self.app(scope, receive, send) File "/src/worker/baseplate_asgi/asgi/baseplate_asgi_middleware.py", line 37, in __call__ await span_processor.execute() File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line 61, in execute raise e File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py",
[jira] [Created] (FLINK-28747) "target_id can not be missing" in HTTP statefun request
Stephan Weinwurm created FLINK-28747: Summary: "target_id can not be missing" in HTTP statefun request Key: FLINK-28747 URL: https://issues.apache.org/jira/browse/FLINK-28747 Project: Flink Issue Type: Bug Components: Stateful Functions Affects Versions: statefun-3.2.0 Reporter: Stephan Weinwurm Hi all, We've suddenly started to see the following exception in our HTTP statefun functions endpoints: ``` Traceback (most recent call last): File "/src/.venv/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi result = await app(self.scope, self.receive, self.send) File "/src/.venv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__ return await self.app(scope, receive, send) File "/src/worker/baseplate_asgi/asgi/baseplate_asgi_middleware.py", line 37, in __call__ await span_processor.execute() File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line 61, in execute raise e File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line 57, in execute await self.app(self.scope, self.receive, self.send) File "/src/.venv/lib/python3.9/site-packages/starlette/applications.py", line 124, in __call__ await self.middleware_stack(scope, receive, send) File "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 184, in __call__ raise exc File "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 162, in __call__ await self.app(scope, receive, _send) File "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 75, in __call__ raise exc File "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 64, in __call__ await self.app(scope, receive, sender) File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 680, in __call__ await route.handle(scope, receive, send) File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 275, in handle await self.app(scope, receive, send) File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 65, in app response = await func(request) File "/src/worker/baseplate_statefun/server/asgi/make_statefun_handler.py", line 25, in statefun_handler result = await handler.handle_async(request_body) File "/src/.venv/lib/python3.9/site-packages/statefun/request_reply_v3.py", line 262, in handle_async msg = Message(target_typename=sdk_address.typename, target_id=sdk_address.id, File "/src/.venv/lib/python3.9/site-packages/statefun/messages.py", line 42, in __init__ raise ValueError("target_id can not be missing") ``` Interestingly, this has started to happen in three separate Flink deployments at the very same time. The only thing in common between the three deployments is that they consume the same Kafka topics. No deployments have happened when the issue started happening which was on July 28th 3:05PM. We have since been continuously seeing the error. We were also able to extract the request that Flink sends to the HTTP statefun endpoint: ``` {'invocation': {'target': {'namespace': 'com.x.dummy', 'type': 'dummy'}, 'invocations': [{'argument': {'typename': 'type.googleapis.com/v2_event.Event', 'has_value': True, 'value': '-redicated-'}}]}} ``` As you can see, no `id` field is present in the `invocation.target` object or the `target_id` was an empty string. This is our module.yaml from one of the Flink deployments: ``` version: "3.0" module: meta: type: remote spec: endpoints: - endpoint: meta: kind: io.statefun.endpoints.v1/http spec: functions: com.x.dummy/dummy urlPathTemplate: [http://x-worker-dummy.x-functions:9090/statefun] timeouts: call: 2 min read: 2 min write: 2 min maxNumBatchRequests: 100 ingresses: - ingress: meta: type: io.statefun.kafka/ingress id: com.x/ingress spec: address: x-kafka-0.x.ue1.x.net:9092 consumerGroupId: x-worker-dummy topics: - topic: v2_post_events valueType: type.googleapis.com/v2_event.Event targets: - com.x.dummy/dummy startupPosition: type: group-offsets autoOffsetResetPosition: earliest ``` Can you please help us investigate as this is critically impacting our prod setup? -- This message was sent by Atlassian Jira (v8.20.10#820010)