[jira] [Updated] (FLINK-30120) Flink Statefun Task Failure causes Restart of all Tasks with Regional Failover Strategy

2022-11-23 Thread Stephan Weinwurm (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Weinwurm updated FLINK-30120:
-
Description: 
Hey all,

We've noticed that a single task failure causes all of the Statefun tasks to be 
restarted.

For example, a single task fails because of some Statefun Endpoint 
unavailability or if one of our Kuberentes TaskManager pods go down. 
Flink then determines that the _region_ failover strategy requires all tasks to 
be restarted so we see this in the logs:

 
{code:java}
Nov 17 10:20:30 
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
 [] - Calculating tasks to restart to recover the failed task 
31284d56d1e2112b0f20099ee448a6a9_11.
Nov 17 10:20:30 
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
 [] - 5650 tasks should be restarted to recover the failed task 
31284d56d1e2112b0f20099ee448a6a9_11. {code}
Our tasks are all fully independent so I would like that only the one failed 
task to get restarted or moved to a different TaskManager slot.

Is there a way to tell Flink to only restart the failed task? Or is there a 
specific reason why the region failover strategy decides to restart all tasks?

If not, we'd really appreciate a way to enable individual task failovers.

Thanks in advance!
Stephan

  was:
Hey all,

We've noticed that a single task failure causes all of the Statefun tasks to be 
restarted.

For example, a single task fails because of some Statefun Endpoint 
unavailability or if one of our Kuberentes TaskManager pods go down. 
Flink then determines that the _region_ failover strategy requires all tasks to 
be restarted so we see this in the logs:

 
{code:java}
Nov 17 10:20:30 
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
 [] - Calculating tasks to restart to recover the failed task 
31284d56d1e2112b0f20099ee448a6a9_11.
Nov 17 10:20:30 
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
 [] - 5650 tasks should be restarted to recover the failed task 
31284d56d1e2112b0f20099ee448a6a9_11. {code}
 

Our tasks are all fully independent so I would like that only the one failed 
task to get restarted or moved to a different TaskManager slot.

Is there a way to tell Flink to only restart the failed task? Or is there a 
specific reason why the region failover strategy decides to restart all tasks?

Thanks in advance!
Stephan


> Flink Statefun Task Failure causes Restart of all Tasks with Regional 
> Failover Strategy
> ---
>
> Key: FLINK-30120
> URL: https://issues.apache.org/jira/browse/FLINK-30120
> Project: Flink
>  Issue Type: Improvement
>  Components: Stateful Functions
>Affects Versions: statefun-3.0.0, statefun-3.2.0, statefun-3.1.1
>Reporter: Stephan Weinwurm
>Priority: Major
>
> Hey all,
> We've noticed that a single task failure causes all of the Statefun tasks to 
> be restarted.
> For example, a single task fails because of some Statefun Endpoint 
> unavailability or if one of our Kuberentes TaskManager pods go down. 
> Flink then determines that the _region_ failover strategy requires all tasks 
> to be restarted so we see this in the logs:
>  
> {code:java}
> Nov 17 10:20:30 
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
>  [] - Calculating tasks to restart to recover the failed task 
> 31284d56d1e2112b0f20099ee448a6a9_11.
> Nov 17 10:20:30 
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
>  [] - 5650 tasks should be restarted to recover the failed task 
> 31284d56d1e2112b0f20099ee448a6a9_11. {code}
> Our tasks are all fully independent so I would like that only the one failed 
> task to get restarted or moved to a different TaskManager slot.
> Is there a way to tell Flink to only restart the failed task? Or is there a 
> specific reason why the region failover strategy decides to restart all tasks?
> If not, we'd really appreciate a way to enable individual task failovers.
> Thanks in advance!
> Stephan



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-30120) Flink Statefun Task Failure causes restart of all tasks

2022-11-21 Thread Stephan Weinwurm (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Weinwurm updated FLINK-30120:
-
Description: 
Hey all,

We've noticed that a single task failure causes all of the Statefun tasks to be 
restarted.

For example, a single task fails because of some Statefun Endpoint 
unavailability or if one of our Kuberentes TaskManager pods go down. 
Flink then determines that the _region_ failover strategy requires all tasks to 
be restarted so we see this in the logs:

 
{code:java}
Nov 17 10:20:30 
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
 [] - Calculating tasks to restart to recover the failed task 
31284d56d1e2112b0f20099ee448a6a9_11.
Nov 17 10:20:30 
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
 [] - 5650 tasks should be restarted to recover the failed task 
31284d56d1e2112b0f20099ee448a6a9_11. {code}
 

 

Our tasks are all fully independent so I would like that only the one failed 
task to get restarted or moved to a different TaskManager slot.

Is there a way to tell Flink to only restart the failed task? Or is there a 
specific reason why the region failover strategy decides to restart all tasks?

Thanks in advance!
Stephan

  was:
Hey all,

We've noticed that a single task failure causes all of the Statefun tasks to be 
restarted.

For example, a single task fails because of some Statefun Endpoint 
unavailability or if one of our Kuberentes TaskManager pods go down. 
Flink then determines that the _region_ failover strategy requires all tasks to 
be restarted so we see this in the logs:

 

Nov 17 10:20:30 
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
 [] - Calculating tasks to restart to recover the failed task 
31284d56d1e2112b0f20099ee448a6a9_11.
{code:java}
Nov 17 10:20:30 
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
 [] - 5650 tasks should be restarted to recover the failed task 
31284d56d1e2112b0f20099ee448a6a9_11. {code}
 

 

Our tasks are all fully independent so I would like that only the one failed 
task to get restarted or moved to a different TaskManager slot.

Is there a way to tell Flink to only restart the failed task? Or is there a 
specific reason why the region failover strategy decides to restart all tasks?

Thanks in advance!
Stephan


> Flink Statefun Task Failure causes restart of all tasks
> ---
>
> Key: FLINK-30120
> URL: https://issues.apache.org/jira/browse/FLINK-30120
> Project: Flink
>  Issue Type: Improvement
>  Components: Stateful Functions
>Affects Versions: statefun-3.0.0, statefun-3.2.0, statefun-3.1.1
>Reporter: Stephan Weinwurm
>Priority: Major
>
> Hey all,
> We've noticed that a single task failure causes all of the Statefun tasks to 
> be restarted.
> For example, a single task fails because of some Statefun Endpoint 
> unavailability or if one of our Kuberentes TaskManager pods go down. 
> Flink then determines that the _region_ failover strategy requires all tasks 
> to be restarted so we see this in the logs:
>  
> {code:java}
> Nov 17 10:20:30 
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
>  [] - Calculating tasks to restart to recover the failed task 
> 31284d56d1e2112b0f20099ee448a6a9_11.
> Nov 17 10:20:30 
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
>  [] - 5650 tasks should be restarted to recover the failed task 
> 31284d56d1e2112b0f20099ee448a6a9_11. {code}
>  
>  
> Our tasks are all fully independent so I would like that only the one failed 
> task to get restarted or moved to a different TaskManager slot.
> Is there a way to tell Flink to only restart the failed task? Or is there a 
> specific reason why the region failover strategy decides to restart all tasks?
> Thanks in advance!
> Stephan



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-30120) Flink Statefun Task Failure causes restart of all tasks

2022-11-21 Thread Stephan Weinwurm (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Weinwurm updated FLINK-30120:
-
Description: 
Hey all,

We've noticed that a single task failure causes all of the Statefun tasks to be 
restarted.

For example, a single task fails because of some Statefun Endpoint 
unavailability or if one of our Kuberentes TaskManager pods go down. 
Flink then determines that the _region_ failover strategy requires all tasks to 
be restarted so we see this in the logs:

 
{code:java}
Nov 17 10:20:30 
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
 [] - Calculating tasks to restart to recover the failed task 
31284d56d1e2112b0f20099ee448a6a9_11.
Nov 17 10:20:30 
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
 [] - 5650 tasks should be restarted to recover the failed task 
31284d56d1e2112b0f20099ee448a6a9_11. {code}
 

Our tasks are all fully independent so I would like that only the one failed 
task to get restarted or moved to a different TaskManager slot.

Is there a way to tell Flink to only restart the failed task? Or is there a 
specific reason why the region failover strategy decides to restart all tasks?

Thanks in advance!
Stephan

  was:
Hey all,

We've noticed that a single task failure causes all of the Statefun tasks to be 
restarted.

For example, a single task fails because of some Statefun Endpoint 
unavailability or if one of our Kuberentes TaskManager pods go down. 
Flink then determines that the _region_ failover strategy requires all tasks to 
be restarted so we see this in the logs:

 
{code:java}
Nov 17 10:20:30 
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
 [] - Calculating tasks to restart to recover the failed task 
31284d56d1e2112b0f20099ee448a6a9_11.
Nov 17 10:20:30 
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
 [] - 5650 tasks should be restarted to recover the failed task 
31284d56d1e2112b0f20099ee448a6a9_11. {code}
 

 

Our tasks are all fully independent so I would like that only the one failed 
task to get restarted or moved to a different TaskManager slot.

Is there a way to tell Flink to only restart the failed task? Or is there a 
specific reason why the region failover strategy decides to restart all tasks?

Thanks in advance!
Stephan


> Flink Statefun Task Failure causes restart of all tasks
> ---
>
> Key: FLINK-30120
> URL: https://issues.apache.org/jira/browse/FLINK-30120
> Project: Flink
>  Issue Type: Improvement
>  Components: Stateful Functions
>Affects Versions: statefun-3.0.0, statefun-3.2.0, statefun-3.1.1
>Reporter: Stephan Weinwurm
>Priority: Major
>
> Hey all,
> We've noticed that a single task failure causes all of the Statefun tasks to 
> be restarted.
> For example, a single task fails because of some Statefun Endpoint 
> unavailability or if one of our Kuberentes TaskManager pods go down. 
> Flink then determines that the _region_ failover strategy requires all tasks 
> to be restarted so we see this in the logs:
>  
> {code:java}
> Nov 17 10:20:30 
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
>  [] - Calculating tasks to restart to recover the failed task 
> 31284d56d1e2112b0f20099ee448a6a9_11.
> Nov 17 10:20:30 
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
>  [] - 5650 tasks should be restarted to recover the failed task 
> 31284d56d1e2112b0f20099ee448a6a9_11. {code}
>  
> Our tasks are all fully independent so I would like that only the one failed 
> task to get restarted or moved to a different TaskManager slot.
> Is there a way to tell Flink to only restart the failed task? Or is there a 
> specific reason why the region failover strategy decides to restart all tasks?
> Thanks in advance!
> Stephan



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-30120) Flink Statefun Task Failure causes Restart of all Tasks with Regional Failover Strategy

2022-11-21 Thread Stephan Weinwurm (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Weinwurm updated FLINK-30120:
-
Summary: Flink Statefun Task Failure causes Restart of all Tasks with 
Regional Failover Strategy  (was: Flink Statefun Task Failure causes restart of 
all tasks with regional failover strategy)

> Flink Statefun Task Failure causes Restart of all Tasks with Regional 
> Failover Strategy
> ---
>
> Key: FLINK-30120
> URL: https://issues.apache.org/jira/browse/FLINK-30120
> Project: Flink
>  Issue Type: Improvement
>  Components: Stateful Functions
>Affects Versions: statefun-3.0.0, statefun-3.2.0, statefun-3.1.1
>Reporter: Stephan Weinwurm
>Priority: Major
>
> Hey all,
> We've noticed that a single task failure causes all of the Statefun tasks to 
> be restarted.
> For example, a single task fails because of some Statefun Endpoint 
> unavailability or if one of our Kuberentes TaskManager pods go down. 
> Flink then determines that the _region_ failover strategy requires all tasks 
> to be restarted so we see this in the logs:
>  
> {code:java}
> Nov 17 10:20:30 
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
>  [] - Calculating tasks to restart to recover the failed task 
> 31284d56d1e2112b0f20099ee448a6a9_11.
> Nov 17 10:20:30 
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
>  [] - 5650 tasks should be restarted to recover the failed task 
> 31284d56d1e2112b0f20099ee448a6a9_11. {code}
>  
> Our tasks are all fully independent so I would like that only the one failed 
> task to get restarted or moved to a different TaskManager slot.
> Is there a way to tell Flink to only restart the failed task? Or is there a 
> specific reason why the region failover strategy decides to restart all tasks?
> Thanks in advance!
> Stephan



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-30120) Flink Statefun Task Failure causes restart of all tasks with regional failover strategy

2022-11-21 Thread Stephan Weinwurm (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Weinwurm updated FLINK-30120:
-
Summary: Flink Statefun Task Failure causes restart of all tasks with 
regional failover strategy  (was: Flink Statefun Task Failure causes restart of 
all tasks)

> Flink Statefun Task Failure causes restart of all tasks with regional 
> failover strategy
> ---
>
> Key: FLINK-30120
> URL: https://issues.apache.org/jira/browse/FLINK-30120
> Project: Flink
>  Issue Type: Improvement
>  Components: Stateful Functions
>Affects Versions: statefun-3.0.0, statefun-3.2.0, statefun-3.1.1
>Reporter: Stephan Weinwurm
>Priority: Major
>
> Hey all,
> We've noticed that a single task failure causes all of the Statefun tasks to 
> be restarted.
> For example, a single task fails because of some Statefun Endpoint 
> unavailability or if one of our Kuberentes TaskManager pods go down. 
> Flink then determines that the _region_ failover strategy requires all tasks 
> to be restarted so we see this in the logs:
>  
> {code:java}
> Nov 17 10:20:30 
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
>  [] - Calculating tasks to restart to recover the failed task 
> 31284d56d1e2112b0f20099ee448a6a9_11.
> Nov 17 10:20:30 
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
>  [] - 5650 tasks should be restarted to recover the failed task 
> 31284d56d1e2112b0f20099ee448a6a9_11. {code}
>  
> Our tasks are all fully independent so I would like that only the one failed 
> task to get restarted or moved to a different TaskManager slot.
> Is there a way to tell Flink to only restart the failed task? Or is there a 
> specific reason why the region failover strategy decides to restart all tasks?
> Thanks in advance!
> Stephan



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-30120) Flink Statefun Task Failure causes restart of all tasks

2022-11-21 Thread Stephan Weinwurm (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Weinwurm updated FLINK-30120:
-
Description: 
Hey all,

We've noticed that a single task failure causes all of the Statefun tasks to be 
restarted.

For example, a single task fails because of some Statefun Endpoint 
unavailability or if one of our Kuberentes TaskManager pods go down. 
Flink then determines that the _region_ failover strategy requires all tasks to 
be restarted so we see this in the logs:

 

Nov 17 10:20:30 
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
 [] - Calculating tasks to restart to recover the failed task 
31284d56d1e2112b0f20099ee448a6a9_11.
{code:java}
Nov 17 10:20:30 
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
 [] - 5650 tasks should be restarted to recover the failed task 
31284d56d1e2112b0f20099ee448a6a9_11. {code}
 

 

Our tasks are all fully independent so I would like that only the one failed 
task to get restarted or moved to a different TaskManager slot.

Is there a way to tell Flink to only restart the failed task? Or is there a 
specific reason why the region failover strategy decides to restart all tasks?

Thanks in advance!
Stephan

  was:
Hey all,

We've noticed that a single task failure causes all of the Statefun tasks to be 
restarted.

For example, a single task fails because of some Statefun Endpoint 
unavailability or if one of our Kuberentes TaskManager pods go down. 
Flink then determines that the `region` failover strategy requires all tasks to 
be restarted so we see this in the logs:

```
Nov 17 10:20:30 
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
 [] - Calculating tasks to restart to recover the failed task 
31284d56d1e2112b0f20099ee448a6a9_11.
Nov 17 10:20:30 
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
 [] - 5650 tasks should be restarted to recover the failed task 
31284d56d1e2112b0f20099ee448a6a9_11. 
```

Our tasks are all fully independent so I would like that only the one failed 
task to get restarted or moved to a different TaskManager slot.

Is there a way to tell Flink to only restart the failed task? Or is there a 
specific reason why the region failover strategy decides to restart all tasks? 

Thanks in advance!
Stephan


> Flink Statefun Task Failure causes restart of all tasks
> ---
>
> Key: FLINK-30120
> URL: https://issues.apache.org/jira/browse/FLINK-30120
> Project: Flink
>  Issue Type: Improvement
>  Components: Stateful Functions
>Affects Versions: statefun-3.0.0, statefun-3.2.0, statefun-3.1.1
>Reporter: Stephan Weinwurm
>Priority: Major
>
> Hey all,
> We've noticed that a single task failure causes all of the Statefun tasks to 
> be restarted.
> For example, a single task fails because of some Statefun Endpoint 
> unavailability or if one of our Kuberentes TaskManager pods go down. 
> Flink then determines that the _region_ failover strategy requires all tasks 
> to be restarted so we see this in the logs:
>  
> Nov 17 10:20:30 
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
>  [] - Calculating tasks to restart to recover the failed task 
> 31284d56d1e2112b0f20099ee448a6a9_11.
> {code:java}
> Nov 17 10:20:30 
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
>  [] - 5650 tasks should be restarted to recover the failed task 
> 31284d56d1e2112b0f20099ee448a6a9_11. {code}
>  
>  
> Our tasks are all fully independent so I would like that only the one failed 
> task to get restarted or moved to a different TaskManager slot.
> Is there a way to tell Flink to only restart the failed task? Or is there a 
> specific reason why the region failover strategy decides to restart all tasks?
> Thanks in advance!
> Stephan



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30120) Flink Statefun Task Failure causes restart of all tasks

2022-11-21 Thread Stephan Weinwurm (Jira)
Stephan Weinwurm created FLINK-30120:


 Summary: Flink Statefun Task Failure causes restart of all tasks
 Key: FLINK-30120
 URL: https://issues.apache.org/jira/browse/FLINK-30120
 Project: Flink
  Issue Type: Improvement
  Components: Stateful Functions
Affects Versions: statefun-3.1.1, statefun-3.2.0, statefun-3.0.0
Reporter: Stephan Weinwurm


Hey all,

We've noticed that a single task failure causes all of the Statefun tasks to be 
restarted.

For example, a single task fails because of some Statefun Endpoint 
unavailability or if one of our Kuberentes TaskManager pods go down. 
Flink then determines that the `region` failover strategy requires all tasks to 
be restarted so we see this in the logs:

```
Nov 17 10:20:30 
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
 [] - Calculating tasks to restart to recover the failed task 
31284d56d1e2112b0f20099ee448a6a9_11.
Nov 17 10:20:30 
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
 [] - 5650 tasks should be restarted to recover the failed task 
31284d56d1e2112b0f20099ee448a6a9_11. 
```

Our tasks are all fully independent so I would like that only the one failed 
task to get restarted or moved to a different TaskManager slot.

Is there a way to tell Flink to only restart the failed task? Or is there a 
specific reason why the region failover strategy decides to restart all tasks? 

Thanks in advance!
Stephan



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-28747) "target_id can not be missing" in HTTP statefun request

2022-09-20 Thread Stephan Weinwurm (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-28747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607301#comment-17607301
 ] 

Stephan Weinwurm commented on FLINK-28747:
--

[~groot] / [~trohrmann]quick ping. I don't have enough knowledge on what's the 
right fix here. Would one of you mind taking a look and getting this fixed? 
Thank you in advance!

> "target_id can not be missing" in HTTP statefun request
> ---
>
> Key: FLINK-28747
> URL: https://issues.apache.org/jira/browse/FLINK-28747
> Project: Flink
>  Issue Type: Bug
>  Components: Stateful Functions
>Affects Versions: statefun-3.0.0, statefun-3.2.0, statefun-3.1.1
>Reporter: Stephan Weinwurm
>Priority: Major
>
> Hi all,
> We've suddenly started to see the following exception in our HTTP statefun 
> functions endpoints:
> {code}Traceback (most recent call last):
>   File 
> "/src/.venv/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", 
> line 403, in run_asgi
> result = await app(self.scope, self.receive, self.send)
>   File 
> "/src/.venv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", 
> line 78, in __call__
> return await self.app(scope, receive, send)
>   File "/src/worker/baseplate_asgi/asgi/baseplate_asgi_middleware.py", line 
> 37, in __call__
> await span_processor.execute()
>   File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line 
> 61, in execute
> raise e
>   File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line 
> 57, in execute
> await self.app(self.scope, self.receive, self.send)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/applications.py", 
> line 124, in __call__
> await self.middleware_stack(scope, receive, send)
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 
> 184, in __call__
> raise exc
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 
> 162, in __call__
> await self.app(scope, receive, _send)
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", 
> line 75, in __call__
> raise exc
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", 
> line 64, in __call__
> await self.app(scope, receive, sender)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 
> 680, in __call__
> await route.handle(scope, receive, send)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 
> 275, in handle
> await self.app(scope, receive, send)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 
> 65, in app
> response = await func(request)
>   File "/src/worker/baseplate_statefun/server/asgi/make_statefun_handler.py", 
> line 25, in statefun_handler
> result = await handler.handle_async(request_body)
>   File "/src/.venv/lib/python3.9/site-packages/statefun/request_reply_v3.py", 
> line 262, in handle_async
> msg = Message(target_typename=sdk_address.typename, 
> target_id=sdk_address.id,
>   File "/src/.venv/lib/python3.9/site-packages/statefun/messages.py", line 
> 42, in __init__
> raise ValueError("target_id can not be missing"){code}
> Interestingly, this has started to happen in three separate Flink deployments 
> at the very same time. The only thing in common between the three deployments 
> is that they consume the same Kafka topics.
> No deployments have happened when the issue started happening which was on 
> July 28th 3:05PM. We have since been continuously seeing the error.
> We were also able to extract the request that Flink sends to the HTTP 
> statefun endpoint:
> {code}{'invocation': {'target': {'namespace': 'com.x.dummy', 'type': 
> 'dummy'}, 'invocations': [{'argument': {'typename': 
> 'type.googleapis.com/v2_event.Event', 'has_value': True, 'value': 
> '-redicated-'}}]}}
> {code}
> As you can see, no `id` field is present in the `invocation.target` object or 
> the `target_id` was an empty string.
>  
> This is our module.yaml from one of the Flink deployments:
>  
> {code}
> version: "3.0"
> module:
> meta:
> type: remote
> spec:
> endpoints:
>  - endpoint:
> meta:
> kind: io.statefun.endpoints.v1/http
> spec:
> functions: com.x.dummy/dummy
> urlPathTemplate: [http://x-worker-dummy.x-functions:9090/statefun]
> timeouts:
> call: 2 min
> read: 2 min
> write: 2 min
> maxNumBatchRequests: 100
> ingresses:
>  - ingress:
> meta:
> type: io.statefun.kafka/ingress
> id: com.x/ingress
> spec:
> address: x-kafka-0.x.ue1.x.net:9092
> consumerGroupId: x-worker-dummy
> topics:
>  - topic: v2_post_events
> valueType: type.googleapis.com/v2_event.Event
> targets:
>  - com.x.dummy/dummy
> startupPosition:
> type: group-offsets
> autoOffsetResetPosition: earliest

[jira] [Commented] (FLINK-28747) "target_id can not be missing" in HTTP statefun request

2022-08-19 Thread Stephan Weinwurm (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-28747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581940#comment-17581940
 ] 

Stephan Weinwurm commented on FLINK-28747:
--

[~groot], what's the correct fix here? Allow `None` values for `target_id` or 
should we set a fixed / random value for `target_id`? I don't know what the 
implications are of either solution.

If the fix is trivial, I'm happy to do it - I'm just not very familiar with the 
code base.

> "target_id can not be missing" in HTTP statefun request
> ---
>
> Key: FLINK-28747
> URL: https://issues.apache.org/jira/browse/FLINK-28747
> Project: Flink
>  Issue Type: Bug
>  Components: Stateful Functions
>Affects Versions: statefun-3.0.0, statefun-3.2.0, statefun-3.1.1
>Reporter: Stephan Weinwurm
>Priority: Major
>
> Hi all,
> We've suddenly started to see the following exception in our HTTP statefun 
> functions endpoints:
> {code}Traceback (most recent call last):
>   File 
> "/src/.venv/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", 
> line 403, in run_asgi
> result = await app(self.scope, self.receive, self.send)
>   File 
> "/src/.venv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", 
> line 78, in __call__
> return await self.app(scope, receive, send)
>   File "/src/worker/baseplate_asgi/asgi/baseplate_asgi_middleware.py", line 
> 37, in __call__
> await span_processor.execute()
>   File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line 
> 61, in execute
> raise e
>   File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line 
> 57, in execute
> await self.app(self.scope, self.receive, self.send)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/applications.py", 
> line 124, in __call__
> await self.middleware_stack(scope, receive, send)
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 
> 184, in __call__
> raise exc
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 
> 162, in __call__
> await self.app(scope, receive, _send)
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", 
> line 75, in __call__
> raise exc
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", 
> line 64, in __call__
> await self.app(scope, receive, sender)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 
> 680, in __call__
> await route.handle(scope, receive, send)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 
> 275, in handle
> await self.app(scope, receive, send)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 
> 65, in app
> response = await func(request)
>   File "/src/worker/baseplate_statefun/server/asgi/make_statefun_handler.py", 
> line 25, in statefun_handler
> result = await handler.handle_async(request_body)
>   File "/src/.venv/lib/python3.9/site-packages/statefun/request_reply_v3.py", 
> line 262, in handle_async
> msg = Message(target_typename=sdk_address.typename, 
> target_id=sdk_address.id,
>   File "/src/.venv/lib/python3.9/site-packages/statefun/messages.py", line 
> 42, in __init__
> raise ValueError("target_id can not be missing"){code}
> Interestingly, this has started to happen in three separate Flink deployments 
> at the very same time. The only thing in common between the three deployments 
> is that they consume the same Kafka topics.
> No deployments have happened when the issue started happening which was on 
> July 28th 3:05PM. We have since been continuously seeing the error.
> We were also able to extract the request that Flink sends to the HTTP 
> statefun endpoint:
> {code}{'invocation': {'target': {'namespace': 'com.x.dummy', 'type': 
> 'dummy'}, 'invocations': [{'argument': {'typename': 
> 'type.googleapis.com/v2_event.Event', 'has_value': True, 'value': 
> '-redicated-'}}]}}
> {code}
> As you can see, no `id` field is present in the `invocation.target` object or 
> the `target_id` was an empty string.
>  
> This is our module.yaml from one of the Flink deployments:
>  
> {code}
> version: "3.0"
> module:
> meta:
> type: remote
> spec:
> endpoints:
>  - endpoint:
> meta:
> kind: io.statefun.endpoints.v1/http
> spec:
> functions: com.x.dummy/dummy
> urlPathTemplate: [http://x-worker-dummy.x-functions:9090/statefun]
> timeouts:
> call: 2 min
> read: 2 min
> write: 2 min
> maxNumBatchRequests: 100
> ingresses:
>  - ingress:
> meta:
> type: io.statefun.kafka/ingress
> id: com.x/ingress
> spec:
> address: x-kafka-0.x.ue1.x.net:9092
> consumerGroupId: x-worker-dummy
> topics:
>  - topic: v2_post_events
> valueType: type.googleapis.com/v2_event.Event
> 

[jira] [Commented] (FLINK-28747) "target_id can not be missing" in HTTP statefun request

2022-08-03 Thread Stephan Weinwurm (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-28747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17575006#comment-17575006
 ] 

Stephan Weinwurm commented on FLINK-28747:
--

Yes, you're right, that seems like protobuf behaviour. 

What's the correct fix here in the python sdk? Remove [the 
check|https://github.com/apache/flink-statefun/blob/master/statefun-sdk-python/statefun/messages.py#L41]
 and/or set `target_id` to a random or fixed value?

Thanks again for looking into this!

> "target_id can not be missing" in HTTP statefun request
> ---
>
> Key: FLINK-28747
> URL: https://issues.apache.org/jira/browse/FLINK-28747
> Project: Flink
>  Issue Type: Bug
>  Components: Stateful Functions
>Affects Versions: statefun-3.0.0, statefun-3.2.0, statefun-3.1.1
>Reporter: Stephan Weinwurm
>Priority: Major
>
> Hi all,
> We've suddenly started to see the following exception in our HTTP statefun 
> functions endpoints:
> {code}Traceback (most recent call last):
>   File 
> "/src/.venv/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", 
> line 403, in run_asgi
> result = await app(self.scope, self.receive, self.send)
>   File 
> "/src/.venv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", 
> line 78, in __call__
> return await self.app(scope, receive, send)
>   File "/src/worker/baseplate_asgi/asgi/baseplate_asgi_middleware.py", line 
> 37, in __call__
> await span_processor.execute()
>   File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line 
> 61, in execute
> raise e
>   File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line 
> 57, in execute
> await self.app(self.scope, self.receive, self.send)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/applications.py", 
> line 124, in __call__
> await self.middleware_stack(scope, receive, send)
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 
> 184, in __call__
> raise exc
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 
> 162, in __call__
> await self.app(scope, receive, _send)
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", 
> line 75, in __call__
> raise exc
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", 
> line 64, in __call__
> await self.app(scope, receive, sender)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 
> 680, in __call__
> await route.handle(scope, receive, send)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 
> 275, in handle
> await self.app(scope, receive, send)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 
> 65, in app
> response = await func(request)
>   File "/src/worker/baseplate_statefun/server/asgi/make_statefun_handler.py", 
> line 25, in statefun_handler
> result = await handler.handle_async(request_body)
>   File "/src/.venv/lib/python3.9/site-packages/statefun/request_reply_v3.py", 
> line 262, in handle_async
> msg = Message(target_typename=sdk_address.typename, 
> target_id=sdk_address.id,
>   File "/src/.venv/lib/python3.9/site-packages/statefun/messages.py", line 
> 42, in __init__
> raise ValueError("target_id can not be missing"){code}
> Interestingly, this has started to happen in three separate Flink deployments 
> at the very same time. The only thing in common between the three deployments 
> is that they consume the same Kafka topics.
> No deployments have happened when the issue started happening which was on 
> July 28th 3:05PM. We have since been continuously seeing the error.
> We were also able to extract the request that Flink sends to the HTTP 
> statefun endpoint:
> {code}{'invocation': {'target': {'namespace': 'com.x.dummy', 'type': 
> 'dummy'}, 'invocations': [{'argument': {'typename': 
> 'type.googleapis.com/v2_event.Event', 'has_value': True, 'value': 
> '-redicated-'}}]}}
> {code}
> As you can see, no `id` field is present in the `invocation.target` object or 
> the `target_id` was an empty string.
>  
> This is our module.yaml from one of the Flink deployments:
>  
> {code}
> version: "3.0"
> module:
> meta:
> type: remote
> spec:
> endpoints:
>  - endpoint:
> meta:
> kind: io.statefun.endpoints.v1/http
> spec:
> functions: com.x.dummy/dummy
> urlPathTemplate: [http://x-worker-dummy.x-functions:9090/statefun]
> timeouts:
> call: 2 min
> read: 2 min
> write: 2 min
> maxNumBatchRequests: 100
> ingresses:
>  - ingress:
> meta:
> type: io.statefun.kafka/ingress
> id: com.x/ingress
> spec:
> address: x-kafka-0.x.ue1.x.net:9092
> consumerGroupId: x-worker-dummy
> topics:
>  - topic: v2_post_events
> valueType: 

[jira] [Commented] (FLINK-28747) "target_id can not be missing" in HTTP statefun request

2022-08-02 Thread Stephan Weinwurm (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-28747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17574331#comment-17574331
 ] 

Stephan Weinwurm commented on FLINK-28747:
--

Hey [~trohrmann],

thank you very much for taking a look! Yes, you are right, I was able to 
confirm that we do have empty keys in the Kafka topic (this shouldn't happen 
but that's a different story).

 The {{ToFunction}} message that I've pasted in the description does not 
contain {{target.id}}. Is this expected? If the key in Kafka was indeed an 
empty string, shouldn't Flink send an empty string to the function as well? I 
would expect {{'id': ''}} to be in the {{target}} object. In this case a check 
for an empty string would still cause the error.

> "target_id can not be missing" in HTTP statefun request
> ---
>
> Key: FLINK-28747
> URL: https://issues.apache.org/jira/browse/FLINK-28747
> Project: Flink
>  Issue Type: Bug
>  Components: Stateful Functions
>Affects Versions: statefun-3.0.0, statefun-3.2.0, statefun-3.1.1
>Reporter: Stephan Weinwurm
>Priority: Major
>
> Hi all,
> We've suddenly started to see the following exception in our HTTP statefun 
> functions endpoints:
> {code}Traceback (most recent call last):
>   File 
> "/src/.venv/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", 
> line 403, in run_asgi
> result = await app(self.scope, self.receive, self.send)
>   File 
> "/src/.venv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", 
> line 78, in __call__
> return await self.app(scope, receive, send)
>   File "/src/worker/baseplate_asgi/asgi/baseplate_asgi_middleware.py", line 
> 37, in __call__
> await span_processor.execute()
>   File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line 
> 61, in execute
> raise e
>   File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line 
> 57, in execute
> await self.app(self.scope, self.receive, self.send)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/applications.py", 
> line 124, in __call__
> await self.middleware_stack(scope, receive, send)
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 
> 184, in __call__
> raise exc
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 
> 162, in __call__
> await self.app(scope, receive, _send)
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", 
> line 75, in __call__
> raise exc
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", 
> line 64, in __call__
> await self.app(scope, receive, sender)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 
> 680, in __call__
> await route.handle(scope, receive, send)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 
> 275, in handle
> await self.app(scope, receive, send)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 
> 65, in app
> response = await func(request)
>   File "/src/worker/baseplate_statefun/server/asgi/make_statefun_handler.py", 
> line 25, in statefun_handler
> result = await handler.handle_async(request_body)
>   File "/src/.venv/lib/python3.9/site-packages/statefun/request_reply_v3.py", 
> line 262, in handle_async
> msg = Message(target_typename=sdk_address.typename, 
> target_id=sdk_address.id,
>   File "/src/.venv/lib/python3.9/site-packages/statefun/messages.py", line 
> 42, in __init__
> raise ValueError("target_id can not be missing"){code}
> Interestingly, this has started to happen in three separate Flink deployments 
> at the very same time. The only thing in common between the three deployments 
> is that they consume the same Kafka topics.
> No deployments have happened when the issue started happening which was on 
> July 28th 3:05PM. We have since been continuously seeing the error.
> We were also able to extract the request that Flink sends to the HTTP 
> statefun endpoint:
> {code}{'invocation': {'target': {'namespace': 'com.x.dummy', 'type': 
> 'dummy'}, 'invocations': [{'argument': {'typename': 
> 'type.googleapis.com/v2_event.Event', 'has_value': True, 'value': 
> '-redicated-'}}]}}
> {code}
> As you can see, no `id` field is present in the `invocation.target` object or 
> the `target_id` was an empty string.
>  
> This is our module.yaml from one of the Flink deployments:
>  
> {code}
> version: "3.0"
> module:
> meta:
> type: remote
> spec:
> endpoints:
>  - endpoint:
> meta:
> kind: io.statefun.endpoints.v1/http
> spec:
> functions: com.x.dummy/dummy
> urlPathTemplate: [http://x-worker-dummy.x-functions:9090/statefun]
> timeouts:
> call: 2 min
> read: 2 min
> write: 2 min
> 

[jira] [Commented] (FLINK-28747) "target_id can not be missing" in HTTP statefun request

2022-08-01 Thread Stephan Weinwurm (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-28747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573961#comment-17573961
 ] 

Stephan Weinwurm commented on FLINK-28747:
--

I've tested this issue in the following versions and they all behave the same:
 * 3.2.0
 * 3.1.1
 * 3.0.0

 

> "target_id can not be missing" in HTTP statefun request
> ---
>
> Key: FLINK-28747
> URL: https://issues.apache.org/jira/browse/FLINK-28747
> Project: Flink
>  Issue Type: Bug
>  Components: Stateful Functions
>Affects Versions: statefun-3.0.0, statefun-3.2.0, statefun-3.1.1
>Reporter: Stephan Weinwurm
>Priority: Major
>
> Hi all,
> We've suddenly started to see the following exception in our HTTP statefun 
> functions endpoints:
> {code}Traceback (most recent call last):
>   File 
> "/src/.venv/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", 
> line 403, in run_asgi
> result = await app(self.scope, self.receive, self.send)
>   File 
> "/src/.venv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", 
> line 78, in __call__
> return await self.app(scope, receive, send)
>   File "/src/worker/baseplate_asgi/asgi/baseplate_asgi_middleware.py", line 
> 37, in __call__
> await span_processor.execute()
>   File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line 
> 61, in execute
> raise e
>   File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line 
> 57, in execute
> await self.app(self.scope, self.receive, self.send)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/applications.py", 
> line 124, in __call__
> await self.middleware_stack(scope, receive, send)
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 
> 184, in __call__
> raise exc
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 
> 162, in __call__
> await self.app(scope, receive, _send)
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", 
> line 75, in __call__
> raise exc
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", 
> line 64, in __call__
> await self.app(scope, receive, sender)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 
> 680, in __call__
> await route.handle(scope, receive, send)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 
> 275, in handle
> await self.app(scope, receive, send)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 
> 65, in app
> response = await func(request)
>   File "/src/worker/baseplate_statefun/server/asgi/make_statefun_handler.py", 
> line 25, in statefun_handler
> result = await handler.handle_async(request_body)
>   File "/src/.venv/lib/python3.9/site-packages/statefun/request_reply_v3.py", 
> line 262, in handle_async
> msg = Message(target_typename=sdk_address.typename, 
> target_id=sdk_address.id,
>   File "/src/.venv/lib/python3.9/site-packages/statefun/messages.py", line 
> 42, in __init__
> raise ValueError("target_id can not be missing"){code}
> Interestingly, this has started to happen in three separate Flink deployments 
> at the very same time. The only thing in common between the three deployments 
> is that they consume the same Kafka topics.
> No deployments have happened when the issue started happening which was on 
> July 28th 3:05PM. We have since been continuously seeing the error.
> We were also able to extract the request that Flink sends to the HTTP 
> statefun endpoint:
> {code}{'invocation': {'target': {'namespace': 'com.x.dummy', 'type': 
> 'dummy'}, 'invocations': [{'argument': {'typename': 
> 'type.googleapis.com/v2_event.Event', 'has_value': True, 'value': 
> '-redicated-'}}]}}
> {code}
> As you can see, no `id` field is present in the `invocation.target` object or 
> the `target_id` was an empty string.
>  
> This is our module.yaml from one of the Flink deployments:
>  
> {code}
> version: "3.0"
> module:
> meta:
> type: remote
> spec:
> endpoints:
>  - endpoint:
> meta:
> kind: io.statefun.endpoints.v1/http
> spec:
> functions: com.x.dummy/dummy
> urlPathTemplate: [http://x-worker-dummy.x-functions:9090/statefun]
> timeouts:
> call: 2 min
> read: 2 min
> write: 2 min
> maxNumBatchRequests: 100
> ingresses:
>  - ingress:
> meta:
> type: io.statefun.kafka/ingress
> id: com.x/ingress
> spec:
> address: x-kafka-0.x.ue1.x.net:9092
> consumerGroupId: x-worker-dummy
> topics:
>  - topic: v2_post_events
> valueType: type.googleapis.com/v2_event.Event
> targets:
>  - com.x.dummy/dummy
> startupPosition:
> type: group-offsets
> autoOffsetResetPosition: earliest
> {code}
>  
> Can you please help us investigate as this is 

[jira] [Updated] (FLINK-28747) "target_id can not be missing" in HTTP statefun request

2022-08-01 Thread Stephan Weinwurm (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-28747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Weinwurm updated FLINK-28747:
-
Affects Version/s: statefun-3.0.0

> "target_id can not be missing" in HTTP statefun request
> ---
>
> Key: FLINK-28747
> URL: https://issues.apache.org/jira/browse/FLINK-28747
> Project: Flink
>  Issue Type: Bug
>  Components: Stateful Functions
>Affects Versions: statefun-3.0.0, statefun-3.2.0, statefun-3.1.1
>Reporter: Stephan Weinwurm
>Priority: Major
>
> Hi all,
> We've suddenly started to see the following exception in our HTTP statefun 
> functions endpoints:
> {code}Traceback (most recent call last):
>   File 
> "/src/.venv/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", 
> line 403, in run_asgi
> result = await app(self.scope, self.receive, self.send)
>   File 
> "/src/.venv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", 
> line 78, in __call__
> return await self.app(scope, receive, send)
>   File "/src/worker/baseplate_asgi/asgi/baseplate_asgi_middleware.py", line 
> 37, in __call__
> await span_processor.execute()
>   File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line 
> 61, in execute
> raise e
>   File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line 
> 57, in execute
> await self.app(self.scope, self.receive, self.send)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/applications.py", 
> line 124, in __call__
> await self.middleware_stack(scope, receive, send)
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 
> 184, in __call__
> raise exc
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 
> 162, in __call__
> await self.app(scope, receive, _send)
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", 
> line 75, in __call__
> raise exc
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", 
> line 64, in __call__
> await self.app(scope, receive, sender)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 
> 680, in __call__
> await route.handle(scope, receive, send)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 
> 275, in handle
> await self.app(scope, receive, send)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 
> 65, in app
> response = await func(request)
>   File "/src/worker/baseplate_statefun/server/asgi/make_statefun_handler.py", 
> line 25, in statefun_handler
> result = await handler.handle_async(request_body)
>   File "/src/.venv/lib/python3.9/site-packages/statefun/request_reply_v3.py", 
> line 262, in handle_async
> msg = Message(target_typename=sdk_address.typename, 
> target_id=sdk_address.id,
>   File "/src/.venv/lib/python3.9/site-packages/statefun/messages.py", line 
> 42, in __init__
> raise ValueError("target_id can not be missing"){code}
> Interestingly, this has started to happen in three separate Flink deployments 
> at the very same time. The only thing in common between the three deployments 
> is that they consume the same Kafka topics.
> No deployments have happened when the issue started happening which was on 
> July 28th 3:05PM. We have since been continuously seeing the error.
> We were also able to extract the request that Flink sends to the HTTP 
> statefun endpoint:
> {code}{'invocation': {'target': {'namespace': 'com.x.dummy', 'type': 
> 'dummy'}, 'invocations': [{'argument': {'typename': 
> 'type.googleapis.com/v2_event.Event', 'has_value': True, 'value': 
> '-redicated-'}}]}}
> {code}
> As you can see, no `id` field is present in the `invocation.target` object or 
> the `target_id` was an empty string.
>  
> This is our module.yaml from one of the Flink deployments:
>  
> {code}
> version: "3.0"
> module:
> meta:
> type: remote
> spec:
> endpoints:
>  - endpoint:
> meta:
> kind: io.statefun.endpoints.v1/http
> spec:
> functions: com.x.dummy/dummy
> urlPathTemplate: [http://x-worker-dummy.x-functions:9090/statefun]
> timeouts:
> call: 2 min
> read: 2 min
> write: 2 min
> maxNumBatchRequests: 100
> ingresses:
>  - ingress:
> meta:
> type: io.statefun.kafka/ingress
> id: com.x/ingress
> spec:
> address: x-kafka-0.x.ue1.x.net:9092
> consumerGroupId: x-worker-dummy
> topics:
>  - topic: v2_post_events
> valueType: type.googleapis.com/v2_event.Event
> targets:
>  - com.x.dummy/dummy
> startupPosition:
> type: group-offsets
> autoOffsetResetPosition: earliest
> {code}
>  
> Can you please help us investigate as this is critically impacting our prod 
> setup?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-28747) "target_id can not be missing" in HTTP statefun request

2022-08-01 Thread Stephan Weinwurm (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-28747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Weinwurm updated FLINK-28747:
-
Affects Version/s: statefun-3.1.1

> "target_id can not be missing" in HTTP statefun request
> ---
>
> Key: FLINK-28747
> URL: https://issues.apache.org/jira/browse/FLINK-28747
> Project: Flink
>  Issue Type: Bug
>  Components: Stateful Functions
>Affects Versions: statefun-3.2.0, statefun-3.1.1
>Reporter: Stephan Weinwurm
>Priority: Major
>
> Hi all,
> We've suddenly started to see the following exception in our HTTP statefun 
> functions endpoints:
> {code}Traceback (most recent call last):
>   File 
> "/src/.venv/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", 
> line 403, in run_asgi
> result = await app(self.scope, self.receive, self.send)
>   File 
> "/src/.venv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", 
> line 78, in __call__
> return await self.app(scope, receive, send)
>   File "/src/worker/baseplate_asgi/asgi/baseplate_asgi_middleware.py", line 
> 37, in __call__
> await span_processor.execute()
>   File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line 
> 61, in execute
> raise e
>   File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line 
> 57, in execute
> await self.app(self.scope, self.receive, self.send)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/applications.py", 
> line 124, in __call__
> await self.middleware_stack(scope, receive, send)
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 
> 184, in __call__
> raise exc
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 
> 162, in __call__
> await self.app(scope, receive, _send)
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", 
> line 75, in __call__
> raise exc
>   File 
> "/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", 
> line 64, in __call__
> await self.app(scope, receive, sender)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 
> 680, in __call__
> await route.handle(scope, receive, send)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 
> 275, in handle
> await self.app(scope, receive, send)
>   File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 
> 65, in app
> response = await func(request)
>   File "/src/worker/baseplate_statefun/server/asgi/make_statefun_handler.py", 
> line 25, in statefun_handler
> result = await handler.handle_async(request_body)
>   File "/src/.venv/lib/python3.9/site-packages/statefun/request_reply_v3.py", 
> line 262, in handle_async
> msg = Message(target_typename=sdk_address.typename, 
> target_id=sdk_address.id,
>   File "/src/.venv/lib/python3.9/site-packages/statefun/messages.py", line 
> 42, in __init__
> raise ValueError("target_id can not be missing"){code}
> Interestingly, this has started to happen in three separate Flink deployments 
> at the very same time. The only thing in common between the three deployments 
> is that they consume the same Kafka topics.
> No deployments have happened when the issue started happening which was on 
> July 28th 3:05PM. We have since been continuously seeing the error.
> We were also able to extract the request that Flink sends to the HTTP 
> statefun endpoint:
> {code}{'invocation': {'target': {'namespace': 'com.x.dummy', 'type': 
> 'dummy'}, 'invocations': [{'argument': {'typename': 
> 'type.googleapis.com/v2_event.Event', 'has_value': True, 'value': 
> '-redicated-'}}]}}
> {code}
> As you can see, no `id` field is present in the `invocation.target` object or 
> the `target_id` was an empty string.
>  
> This is our module.yaml from one of the Flink deployments:
>  
> {code}
> version: "3.0"
> module:
> meta:
> type: remote
> spec:
> endpoints:
>  - endpoint:
> meta:
> kind: io.statefun.endpoints.v1/http
> spec:
> functions: com.x.dummy/dummy
> urlPathTemplate: [http://x-worker-dummy.x-functions:9090/statefun]
> timeouts:
> call: 2 min
> read: 2 min
> write: 2 min
> maxNumBatchRequests: 100
> ingresses:
>  - ingress:
> meta:
> type: io.statefun.kafka/ingress
> id: com.x/ingress
> spec:
> address: x-kafka-0.x.ue1.x.net:9092
> consumerGroupId: x-worker-dummy
> topics:
>  - topic: v2_post_events
> valueType: type.googleapis.com/v2_event.Event
> targets:
>  - com.x.dummy/dummy
> startupPosition:
> type: group-offsets
> autoOffsetResetPosition: earliest
> {code}
>  
> Can you please help us investigate as this is critically impacting our prod 
> setup?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-28747) "target_id can not be missing" in HTTP statefun request

2022-07-29 Thread Stephan Weinwurm (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-28747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Weinwurm updated FLINK-28747:
-
Description: 
Hi all,

We've suddenly started to see the following exception in our HTTP statefun 
functions endpoints:

{code}Traceback (most recent call last):
  File 
"/src/.venv/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", 
line 403, in run_asgi
result = await app(self.scope, self.receive, self.send)
  File 
"/src/.venv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", 
line 78, in __call__
return await self.app(scope, receive, send)
  File "/src/worker/baseplate_asgi/asgi/baseplate_asgi_middleware.py", line 37, 
in __call__
await span_processor.execute()
  File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line 61, 
in execute
raise e
  File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line 57, 
in execute
await self.app(self.scope, self.receive, self.send)
  File "/src/.venv/lib/python3.9/site-packages/starlette/applications.py", line 
124, in __call__
await self.middleware_stack(scope, receive, send)
  File "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", 
line 184, in __call__
raise exc
  File "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", 
line 162, in __call__
await self.app(scope, receive, _send)
  File 
"/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", 
line 75, in __call__
raise exc
  File 
"/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", 
line 64, in __call__
await self.app(scope, receive, sender)
  File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 680, 
in __call__
await route.handle(scope, receive, send)
  File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 275, 
in handle
await self.app(scope, receive, send)
  File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 65, 
in app
response = await func(request)
  File "/src/worker/baseplate_statefun/server/asgi/make_statefun_handler.py", 
line 25, in statefun_handler
result = await handler.handle_async(request_body)
  File "/src/.venv/lib/python3.9/site-packages/statefun/request_reply_v3.py", 
line 262, in handle_async
msg = Message(target_typename=sdk_address.typename, 
target_id=sdk_address.id,
  File "/src/.venv/lib/python3.9/site-packages/statefun/messages.py", line 42, 
in __init__
raise ValueError("target_id can not be missing"){code}

Interestingly, this has started to happen in three separate Flink deployments 
at the very same time. The only thing in common between the three deployments 
is that they consume the same Kafka topics.

No deployments have happened when the issue started happening which was on July 
28th 3:05PM. We have since been continuously seeing the error.

We were also able to extract the request that Flink sends to the HTTP statefun 
endpoint:



{code}{'invocation': {'target': {'namespace': 'com.x.dummy', 'type': 'dummy'}, 
'invocations': [{'argument': {'typename': 'type.googleapis.com/v2_event.Event', 
'has_value': True, 'value': '-redicated-'}}]}}
{code}

As you can see, no `id` field is present in the `invocation.target` object or 
the `target_id` was an empty string.

 

This is our module.yaml from one of the Flink deployments:

 
{code}
version: "3.0"
module:
meta:
type: remote
spec:
endpoints:
 - endpoint:
meta:
kind: io.statefun.endpoints.v1/http
spec:
functions: com.x.dummy/dummy
urlPathTemplate: [http://x-worker-dummy.x-functions:9090/statefun]
timeouts:
call: 2 min
read: 2 min
write: 2 min
maxNumBatchRequests: 100
ingresses:
 - ingress:
meta:
type: io.statefun.kafka/ingress
id: com.x/ingress
spec:
address: x-kafka-0.x.ue1.x.net:9092
consumerGroupId: x-worker-dummy
topics:
 - topic: v2_post_events
valueType: type.googleapis.com/v2_event.Event
targets:
 - com.x.dummy/dummy
startupPosition:
type: group-offsets
autoOffsetResetPosition: earliest
{code}

 

Can you please help us investigate as this is critically impacting our prod 
setup?

  was:
Hi all,

We've suddenly started to see the following exception in our HTTP statefun 
functions endpoints:

```

Traceback (most recent call last):
  File 
"/src/.venv/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", 
line 403, in run_asgi
result = await app(self.scope, self.receive, self.send)
  File 
"/src/.venv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", 
line 78, in __call__
return await self.app(scope, receive, send)
  File "/src/worker/baseplate_asgi/asgi/baseplate_asgi_middleware.py", line 37, 
in __call__
await span_processor.execute()
  File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line 61, 
in execute
raise e
  File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", 

[jira] [Created] (FLINK-28747) "target_id can not be missing" in HTTP statefun request

2022-07-29 Thread Stephan Weinwurm (Jira)
Stephan Weinwurm created FLINK-28747:


 Summary: "target_id can not be missing" in HTTP statefun request
 Key: FLINK-28747
 URL: https://issues.apache.org/jira/browse/FLINK-28747
 Project: Flink
  Issue Type: Bug
  Components: Stateful Functions
Affects Versions: statefun-3.2.0
Reporter: Stephan Weinwurm


Hi all,

We've suddenly started to see the following exception in our HTTP statefun 
functions endpoints:

```

Traceback (most recent call last):
  File 
"/src/.venv/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", 
line 403, in run_asgi
result = await app(self.scope, self.receive, self.send)
  File 
"/src/.venv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", 
line 78, in __call__
return await self.app(scope, receive, send)
  File "/src/worker/baseplate_asgi/asgi/baseplate_asgi_middleware.py", line 37, 
in __call__
await span_processor.execute()
  File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line 61, 
in execute
raise e
  File "/src/worker/baseplate_asgi/asgi/asgi_http_span_processor.py", line 57, 
in execute
await self.app(self.scope, self.receive, self.send)
  File "/src/.venv/lib/python3.9/site-packages/starlette/applications.py", line 
124, in __call__
await self.middleware_stack(scope, receive, send)
  File "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", 
line 184, in __call__
raise exc
  File "/src/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", 
line 162, in __call__
await self.app(scope, receive, _send)
  File 
"/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", 
line 75, in __call__
raise exc
  File 
"/src/.venv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", 
line 64, in __call__
await self.app(scope, receive, sender)
  File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 680, 
in __call__
await route.handle(scope, receive, send)
  File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 275, 
in handle
await self.app(scope, receive, send)
  File "/src/.venv/lib/python3.9/site-packages/starlette/routing.py", line 65, 
in app
response = await func(request)
  File "/src/worker/baseplate_statefun/server/asgi/make_statefun_handler.py", 
line 25, in statefun_handler
result = await handler.handle_async(request_body)
  File "/src/.venv/lib/python3.9/site-packages/statefun/request_reply_v3.py", 
line 262, in handle_async
msg = Message(target_typename=sdk_address.typename, 
target_id=sdk_address.id,
  File "/src/.venv/lib/python3.9/site-packages/statefun/messages.py", line 42, 
in __init__
raise ValueError("target_id can not be missing")

```

Interestingly, this has started to happen in three separate Flink deployments 
at the very same time. The only thing in common between the three deployments 
is that they consume the same Kafka topics.

No deployments have happened when the issue started happening which was on July 
28th 3:05PM. We have since been continuously seeing the error.

We were also able to extract the request that Flink sends to the HTTP statefun 
endpoint:


```
{'invocation': {'target': {'namespace': 'com.x.dummy', 'type': 'dummy'}, 
'invocations': [{'argument': {'typename': 'type.googleapis.com/v2_event.Event', 
'has_value': True, 'value': '-redicated-'}}]}}
```

As you can see, no `id` field is present in the `invocation.target` object or 
the `target_id` was an empty string.

 

This is our module.yaml from one of the Flink deployments:

 
```
version: "3.0"
module:
meta:
type: remote
spec:
endpoints:
 - endpoint:
meta:
kind: io.statefun.endpoints.v1/http
spec:
functions: com.x.dummy/dummy
urlPathTemplate: [http://x-worker-dummy.x-functions:9090/statefun]
timeouts:
call: 2 min
read: 2 min
write: 2 min
maxNumBatchRequests: 100
ingresses:
 - ingress:
meta:
type: io.statefun.kafka/ingress
id: com.x/ingress
spec:
address: x-kafka-0.x.ue1.x.net:9092
consumerGroupId: x-worker-dummy
topics:
 - topic: v2_post_events
valueType: type.googleapis.com/v2_event.Event
targets:
 - com.x.dummy/dummy
startupPosition:
type: group-offsets
autoOffsetResetPosition: earliest
```

 

Can you please help us investigate as this is critically impacting our prod 
setup?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)