[jira] [Created] (FLINK-20469) Enable TaskManager start and terminate in MiniCluster

2020-12-03 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-20469:
---

 Summary: Enable TaskManager start and terminate in MiniCluster
 Key: FLINK-20469
 URL: https://issues.apache.org/jira/browse/FLINK-20469
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin
 Fix For: 1.12.0


Currently we expose startTaskManager/terminateTaskManager only in internal 
TestingMiniCluster. Nonetheless, they are useful methods to implement IT cases 
similar to E2E tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20468) Enable leadership control in MiniCluster to test JM failover

2020-12-03 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-20468:
---

 Summary: Enable leadership control in MiniCluster to test JM 
failover
 Key: FLINK-20468
 URL: https://issues.apache.org/jira/browse/FLINK-20468
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin
 Fix For: 1.12.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20290) Duplicated output in FileSource continuous ITCase with TM failover

2020-11-23 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-20290:
---

 Summary: Duplicated output in FileSource continuous ITCase with TM 
failover
 Key: FLINK-20290
 URL: https://issues.apache.org/jira/browse/FLINK-20290
 Project: Flink
  Issue Type: Bug
  Components: Connectors / FileSystem
Affects Versions: 1.12.0
Reporter: Andrey Zagrebin


If FileSourceTextLinesITCase::testContinuousTextFileSource includes TM restarts 
(after failing TM with TestingMiniCluster::terminateTaskExecutor, see 
testContinuousTextFileSourceWithTaskManagerFailover in 
[branch|https://github.com/azagrebin/flink/tree/FLINK-20118-it]) then sometimes 
I observe duplicated lines in the output after running the test suite 5-10 
times in IDE:
{code:java}
Test 
testContinuousTextFileSourceWithTaskManagerFailover(org.apache.flink.connector.file.src.FileSourceTextLinesITCase)
 failed with:
java.lang.AssertionError: 
Expected: ["And by opposing end them?--To die,--to sleep,--", "And enterprises 
of great pith and moment,", "And lose the name of action.--Soft you now!", "And 
makes us rather bear those ills we have", "And thus the native hue of 
resolution", "Be all my sins remember'd.", "But that the dread of something 
after death,--", "Devoutly to be wish'd. To die,--to sleep;--", "For in that 
sleep of death what dreams may come,", "For who would bear the whips and scorns 
of time,", "Is sicklied o'er with the pale cast of thought;", "Must give us 
pause: there's the respect", "No more; and by a sleep to say we end", "No 
traveller returns,--puzzles the will,", "Or to take arms against a sea of 
troubles,", "Than fly to others that we know not of?", "That flesh is heir 
to,--'tis a consummation", "That makes calamity of so long life;", "That 
patient merit of the unworthy takes,", "The fair Ophelia!--Nymph, in thy 
orisons", "The heartache, and the thousand natural shocks", "The insolence of 
office, and the spurns", "The oppressor's wrong, the proud man's contumely,", 
"The pangs of despis'd love, the law's delay,", "The slings and arrows of 
outrageous fortune", "The undiscover'd country, from whose bourn", "Thus 
conscience does make cowards of us all;", "To be, or not to be,--that is the 
question:--", "To grunt and sweat under a weary life,", "To sleep! perchance to 
dream:--ay, there's the rub;", "When he himself might his quietus make", "When 
we have shuffled off this mortal coil,", "Whether 'tis nobler in the mind to 
suffer", "With a bare bodkin? who would these fardels bear,", "With this 
regard, their currents turn awry,"]
 but: was ["And by opposing end them?--To die,--to sleep,--", "And 
enterprises of great pith and moment,", "And lose the name of action.--Soft you 
now!", "And makes us rather bear those ills we have", "And thus the native hue 
of resolution", "Be all my sins remember'd.", "But that the dread of something 
after death,--", "Devoutly to be wish'd. To die,--to sleep;--", "Devoutly to be 
wish'd. To die,--to sleep;--", "For in that sleep of death what dreams may 
come,", "For who would bear the whips and scorns of time,", "Is sicklied o'er 
with the pale cast of thought;", "Must give us pause: there's the respect", "No 
more; and by a sleep to say we end", "No more; and by a sleep to say we end", 
"No traveller returns,--puzzles the will,", "Or to take arms against a sea of 
troubles,", "Than fly to others that we know not of?", "That flesh is heir 
to,--'tis a consummation", "That flesh is heir to,--'tis a consummation", "That 
makes calamity of so long life;", "The fair Ophelia!--Nymph, in thy orisons", 
"The heartache, and the thousand natural shocks", "The heartache, and the 
thousand natural shocks", "The slings and arrows of outrageous fortune", "The 
undiscover'd country, from whose bourn", "Thus conscience does make cowards of 
us all;", "To be, or not to be,--that is the question:--", "To grunt and sweat 
under a weary life,", "To sleep! perchance to dream:--ay, there's the rub;", 
"To sleep! perchance to dream:--ay, there's the rub;", "When we have shuffled 
off this mortal coil,", "Whether 'tis nobler in the mind to suffer", "With a 
bare bodkin? who would these fardels bear,", "With this regard, their currents 
turn awry,"]
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
at org.junit.Assert.assertThat(Assert.java:956)
at org.junit.Assert.assertThat(Assert.java:923)
at 
org.apache.flink.connector.file.src.FileSourceTextLinesITCase.verifyResult(FileSourceTextLinesITCase.java:198)
at 
org.apache.flink.connector.file.src.FileSourceTextLinesITCase.testContinuousTextFileSource(FileSourceTextLinesITCase.java:151)
at 
org.apache.flink.connector.file.src.FileSourceTextLinesITCase.testContinuousTextFileSourceWithTaskManagerFailover(FileSourceTextLinesITCase.java:109)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 

[jira] [Created] (FLINK-20261) Uncaught exception in ExecutorNotifier due to split assignment broken by failed task

2020-11-20 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-20261:
---

 Summary: Uncaught exception in ExecutorNotifier due to split 
assignment broken by failed task
 Key: FLINK-20261
 URL: https://issues.apache.org/jira/browse/FLINK-20261
 Project: Flink
  Issue Type: Bug
  Components: Connectors / FileSystem
Affects Versions: 1.12.0
Reporter: Andrey Zagrebin


While trying to extend FileSourceTextLinesITCase::testContinuousTextFileSource 
with recovery test after TM failure (TestingMiniCluster::terminateTaskExecutor, 
[branch|https://github.com/azagrebin/flink/tree/FLINK-20118-it]), I encountered 
the following case:
* SourceCoordinatorContext::assignSplits schedules async assignment (all reader 
tasks alive)
* call TestingMiniCluster::terminateTaskExecutor while doing writeFile in a 
loop of testContinuousTextFileSource
* causes graceful TaskExecutor::onStop shutdown
* causes TM/RM disconnect and failing slot allocations in JM by RM
* eventually causes SourceCoordinatorContext::unregisterSourceReader
* actual assignment starts (SourceCoordinatorContext::assignSplits: 
callInCoordinatorThread)
* registeredReaders.containsKey(subtaskId) check fails with 
IllegalArgumentException which is uncaught in single thread executor
* forces ThreadPool to recreate the single thread
* calls CoordinatorExecutorThreadFactory::newThread
* fails expected condition of single thread creation with IllegalStateException 
which is uncaught
* calls FatalExitExceptionHandler and exits JVM abruptly



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20171) Improve error message for Flink process memory configuration

2020-11-16 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-20171:
---

 Summary: Improve error message for Flink process memory 
configuration
 Key: FLINK-20171
 URL: https://issues.apache.org/jira/browse/FLINK-20171
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Affects Versions: 1.11.0
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin
 Fix For: 1.12.0, 1.11.0


Currently, all configuration failures will result in 
IllegalConfigurationException from JobManagerProcessUtils and 
TaskExecutorProcessUtils. The exception error messages do not refer to the 
process type (JM or TM), it can only become clear from the stack trace.

We can wrap main configuration calls with extra try/catch 
(TaskExecutorProcessUtils::processSpecFromConfig and 
JobManagerProcessUtils::processSpecFromConfigWithNewOptionToInterpretLegacyHeap)
 where IllegalConfigurationException is wrapped into another one which states 
type of the process (JM or TM).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20078) Factor out an ExecutionGraph factory method for DefaultExecutionTopology

2020-11-10 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-20078:
---

 Summary: Factor out an ExecutionGraph factory method for 
DefaultExecutionTopology
 Key: FLINK-20078
 URL: https://issues.apache.org/jira/browse/FLINK-20078
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Reporter: Andrey Zagrebin


Based on [this PR 
discussion|https://github.com/apache/flink/pull/13958#discussion_r519676104].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-19954) Move execution deployment tracking logic from legacy EG code to SchedulerNG

2020-11-03 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-19954:
---

 Summary: Move execution deployment tracking logic from legacy EG 
code to SchedulerNG
 Key: FLINK-19954
 URL: https://issues.apache.org/jira/browse/FLINK-19954
 Project: Flink
  Issue Type: Task
  Components: Runtime / Coordination
Affects Versions: 1.12.0
Reporter: Andrey Zagrebin


FLINK-17075 introduced the execution state reconciliation between TM and JM. 
The reconciliation requires tracking of the execution deployment state. The 
tracking logic was added to the legacy code of EG state handling which is 
partially inactive as discussed in FLINK-19927. The recent state handling logic 
resides in the new SchedulerNG, currently DefaultScheduler.

We could reconsider how the execution tracking for reconciliation is integrated 
with the scheduling. I think the tracking logic could be moved from 
Execution#deploy and EG#notifyExecutionChange to either 
SchedulerNG#updateTaskExecutionState or DefaultScheduler#deployTaskSafe. The 
latter looks to me currently more natural. ExecutionVertexOperations.deploy 
could return submission future for deployment completion in 
ExecutionDeploymentTracker and Execution#getTerminalFuture to stop the 
tracking. This would be also easier to unit test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-19923) Remove BulkSlotProvider, its implementation and tests

2020-11-02 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-19923:
---

 Summary: Remove BulkSlotProvider, its implementation and tests
 Key: FLINK-19923
 URL: https://issues.apache.org/jira/browse/FLINK-19923
 Project: Flink
  Issue Type: Task
  Components: Runtime / Coordination
Affects Versions: 1.12.0
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin
 Fix For: 1.12.0


BulkSlotProvider is not used any more because it was introduced for the removed 
OneSlotAllocator replaced by SlotSharingExecutionSlotAllocator for the 
PipelinedRegionSchedulingStrategy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-19918) RocksIncrementalCheckpointRescalingTest.testScalingDown fails on Windows

2020-11-02 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-19918:
---

 Summary: RocksIncrementalCheckpointRescalingTest.testScalingDown 
fails on Windows
 Key: FLINK-19918
 URL: https://issues.apache.org/jira/browse/FLINK-19918
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Reporter: Andrey Zagrebin


{code:java}
java.lang.NullPointerExceptionjava.lang.NullPointerException at 
org.apache.flink.streaming.util.AbstractStreamOperatorTestHarness.close(AbstractStreamOperatorTestHarness.java:656)
 at 
org.apache.flink.contrib.streaming.state.RocksIncrementalCheckpointRescalingTest.closeHarness(RocksIncrementalCheckpointRescalingTest.java:357)
 at 
org.apache.flink.contrib.streaming.state.RocksIncrementalCheckpointRescalingTest.testScalingDown(RocksIncrementalCheckpointRescalingTest.java:276)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-19917) RocksDBInitTest.testTempLibFolderDeletedOnFail fails on Windows

2020-11-02 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-19917:
---

 Summary:  RocksDBInitTest.testTempLibFolderDeletedOnFail fails on 
Windows
 Key: FLINK-19917
 URL: https://issues.apache.org/jira/browse/FLINK-19917
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Reporter: Andrey Zagrebin


{code:java}
java.lang.AssertionError: 
Expected :0
Actual   :2{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-19860) Consider skipping restart and traverse regions which are already being restarted in RestartPipelinedRegionFailoverStrategy

2020-10-28 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-19860:
---

 Summary: Consider skipping restart and traverse regions which are 
already being restarted in RestartPipelinedRegionFailoverStrategy
 Key: FLINK-19860
 URL: https://issues.apache.org/jira/browse/FLINK-19860
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Reporter: Andrey Zagrebin


Original 
[discussion|https://github.com/apache/flink/pull/13749#pullrequestreview-516385846].
 Follow-up for FLINK-19712.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-19832) Improve handling of immediately failed physical slot in SlotSharingExecutionSlotAllocator

2020-10-27 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-19832:
---

 Summary: Improve handling of immediately failed physical slot in 
SlotSharingExecutionSlotAllocator
 Key: FLINK-19832
 URL: https://issues.apache.org/jira/browse/FLINK-19832
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Affects Versions: 1.12.0
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin


Improve handling of immediately failed physical slot in 
SlotSharingExecutionSlotAllocator

If a physical slot future the immediately fails for a new SharedSlot in 
SlotSharingExecutionSlotAllocator#getOrAllocateSharedSlot but we continue to 
add logical slots to this SharedSlot, eventually, the logical slot also fails 
and gets removed from {{the SharedSlot}} which gets released (state RELEASED). 
The subsequent logical slot addings in the loop of 
{{allocateLogicalSlotsFromSharedSlots}} will fail the scheduling
with the ALLOCATED state check because it will be RELEASED.

The subsequent bulk timeout check will also not find the SharedSlot and fail 
with NPE.

Hence, such SharedSlot with the immediately failed physical slot future should 
not be kept in the SlotSharingExecutionSlotAllocator and the logical slot 
requests depending on it can be immediately returned failed. The bulk timeout 
check does not need to be started because if some physical (and its logical) 
slot requests failed then the whole bulk will be canceled by scheduler.

If the last assumption is not true for the future scheduling, this bulk failure 
might need additional explicit pending requests cancelation. We expect to 
refactor it for the declarative scheduling anyways.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-19142) Investigate slot hijacking from preceding pipelined regions after failover

2020-09-04 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-19142:
---

 Summary: Investigate slot hijacking from preceding pipelined 
regions after failover
 Key: FLINK-19142
 URL: https://issues.apache.org/jira/browse/FLINK-19142
 Project: Flink
  Issue Type: Improvement
Reporter: Andrey Zagrebin


The ticket originates from [this PR 
discussion|https://github.com/apache/flink/pull/13181#discussion_r481087221].

The previous AllocationIDs are used by PreviousAllocationSlotSelectionStrategy 
to schedule subtasks into the slot where they were previously executed before a 
failover. If the previous slot (AllocationID) is not available, we do not want 
subtasks to take previous slots (AllocationIDs) of other subtasks.

The MergingSharedSlotProfileRetriever gets all previous AllocationIDs of the 
bulk from SlotSharingExecutionSlotAllocator but only from the current bulk. The 
previous AllocationIDs of other bulks stay unknown. Therefore, the current bulk 
can potentially hijack the previous slots from the preceding bulks. On the 
other hand the previous AllocationIDs of other tasks should be taken if the 
other tasks are not going to run at the same time, e.g. not enough resources 
after failover or other bulks are done.

One way to do it may be to give to MergingSharedSlotProfileRetriever all 
previous AllocationIDs of bulks which are going to run at the same time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18957) Implement bulk fulfil-ability timeout tracking for shared slots

2020-08-14 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-18957:
---

 Summary: Implement bulk fulfil-ability timeout tracking for shared 
slots
 Key: FLINK-18957
 URL: https://issues.apache.org/jira/browse/FLINK-18957
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin
 Fix For: 1.12.0


Track fulfil-ability of required physical slots for all SharedSlot(s) (no 
matter whether they are created at this bulk or not) with timeout. This ensures 
we will not wait indefinitely if the required slots for this bulk cannot be 
fully fulfilled at the same time.
 # Create a LogicalSlotRequestBulk to track all physical requests and logical 
slot requests (logical slot requests only which belong to the bulk)
 # Mark physical slot request fulfilled in LogicalSlotRequestBulk, once its 
future is done
 # If any physical slot request fails then clear the LogicalSlotRequestBulk to 
stop the fulfil-ability check
 # Schedule a fulfil-ability check in LogicalSlotRequestBulkChecker for the 
LogicalSlotRequestBulk
 # In case of timeout:
 # cancel/fail the logical slot futures of the bulk in SharedSlot(s)
 # remove



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18751) Implement SlotSharingExecutionSlotAllocator

2020-07-29 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-18751:
---

 Summary: Implement SlotSharingExecutionSlotAllocator
 Key: FLINK-18751
 URL: https://issues.apache.org/jira/browse/FLINK-18751
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin


SlotSharingExecutionSlotAllocator maintains a SharedSlot for each 
ExecutionSlotSharingGroup. SlotSharingExecutionSlotAllocator allocates physical 
slots for SharedSlot(s) and then allocates logical slots from it for scheduled 
tasks. In this way, the slot sharing hints can be respected in the 
ExecutionSlotAllocator. And we no longer need to rely on the SlotProvider to do 
the slot sharing matching. Co-location constraints will be respected since 
co-located subtasks will be in the same ExecutionSlotSharingGroup.

The physical slot will be lazily allocated for a SharedSlot, upon any hosted 
subtask asking for the SharedSlot. Each subsequent sharing subtask allocates a 
logical slot from the SharedSlot. The SharedSlot/physical slot can be released 
only if all the requested logical slots are released or canceled.
h4. Slot Allocation Process

When SlotSharingExecutionSlotAllocator receives a set of tasks to allocate 
slots for, it should do the following:
 # Map the tasks to ExecutionSlotSharingGroup(s)
 # Check which ExecutionSlotSharingGroup(s) _already have_ SharedSlot(s)
 # For all involved ExecutionSlotSharingGroup(s) _which do not have a 
SharedSlot_ yet.
 # Create a SlotProfile future by MergingSharedSlotProfileRetriever and then
 # Allocate a physical slot from the PhysicalSlotProvider
 # Create SharedSlot based on the returned physical slot futures
 # If physical slot future fails, remove the SharedSlot


 # Allocate logical slot futures for the tasks from all corresponding 
SharedSlot(s).
 # Generates SlotExecutionVertexAssignment(s)  based on the logical slot 
futures and returns the results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18739) Implement MergingSharedSlotProfileRetriever

2020-07-28 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-18739:
---

 Summary: Implement MergingSharedSlotProfileRetriever
 Key: FLINK-18739
 URL: https://issues.apache.org/jira/browse/FLINK-18739
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin


Input location preferences will be considered for each SharedSlot when 
allocating a physical slot for it. Input location preferences of a SharedSlot 
will be the merge of input location preferences of all the tasks to run in this 
SharedSlot.

Inter-ExecutionSlotSharingGroup input location preferences can be respected in 
this way for ExecutionSlotSharingGroups belonging to different bulks. If 
ExecutionSlotSharingGroups belong to the same bulk, the input location 
preferences are ignored because of possible cyclic dependencies. Later, we can 
optimise this case when the declarative resource management for reactive mode 
is ready.

Intra-ExecutionSlotSharingGroup input location preferences will also be 
respected when creating ExecutionSlotSharingGroup(s) in 
DefaultSlotSharingStrategy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18709) Implement PhysicalSlotProvider

2020-07-24 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-18709:
---

 Summary: Implement PhysicalSlotProvider
 Key: FLINK-18709
 URL: https://issues.apache.org/jira/browse/FLINK-18709
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin


PhysicalSlotProviderImpl tries to allocate a physical slot from the available 
idle cached slots in SlotPool. If it is not possible, it requests a new slot 
from the SlotPool.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18689) Deterministic Slot Sharing

2020-07-23 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-18689:
---

 Summary: Deterministic Slot Sharing
 Key: FLINK-18689
 URL: https://issues.apache.org/jira/browse/FLINK-18689
 Project: Flink
  Issue Type: New Feature
  Components: Runtime / Coordination
Reporter: Andrey Zagrebin


[Design 
doc|https://docs.google.com/document/d/10CbCVDBJWafaFOovIXrR8nAr2BnZ_RAGFtUeTHiJplw]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18690) Implement LocalInputPreferredSlotSharingStrategy

2020-07-23 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-18690:
---

 Summary: Implement LocalInputPreferredSlotSharingStrategy
 Key: FLINK-18690
 URL: https://issues.apache.org/jira/browse/FLINK-18690
 Project: Flink
  Issue Type: Sub-task
Reporter: Andrey Zagrebin
Assignee: Zhu Zhu


Implement ExecutionSlotSharingGroup, SlotSharingStrategy and default 
LocalInputPreferredSlotSharingStrategy.

The default strategy would be LocalInputPreferredSlotSharingStrategy. It will 
try to reduce remote data exchanges. Subtasks, which are connected and belong 
to the same SlotSharingGroup, tend to be put in the same 
ExecutionSlotSharingGroup.

See [design 
doc|https://docs.google.com/document/d/10CbCVDBJWafaFOovIXrR8nAr2BnZ_RAGFtUeTHiJplw/edit#heading=h.t4vfmm4atqoy]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18646) Managed memory released check can block RPC thread

2020-07-20 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-18646:
---

 Summary: Managed memory released check can block RPC thread
 Key: FLINK-18646
 URL: https://issues.apache.org/jira/browse/FLINK-18646
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Task
Affects Versions: 1.11.0
Reporter: Andrey Zagrebin


UnsafeMemoryBudget#verifyEmpty, called on slot freeing, needs time to wait on 
GC of all allocated/released managed memory. If there are a lot of segments to 
GC then it can take time to finish the check. If slot freeing happens in RPC 
thread, the GC waiting can block it and TM risks to miss its heartbeat.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18467) Document what can be reconfigured for state with TTL between job restarts

2020-07-02 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-18467:
---

 Summary: Document what can be reconfigured for state with TTL 
between job restarts
 Key: FLINK-18467
 URL: https://issues.apache.org/jira/browse/FLINK-18467
 Project: Flink
  Issue Type: Task
  Components: Runtime / State Backends
Affects Versions: 1.8.4
Reporter: Andrey Zagrebin


changing whether the state has TTL or not is not easy as it requires migration
but changing how to treat the expiration timestamp is possible, e.g. value of 
TTL or when to update/remove it



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18454) Add a code contribution section about how to look for what to contribute

2020-06-30 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-18454:
---

 Summary: Add a code contribution section about how to look for 
what to contribute
 Key: FLINK-18454
 URL: https://issues.apache.org/jira/browse/FLINK-18454
 Project: Flink
  Issue Type: Task
  Components: Project Website
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin


This section is to give general advices about browsing open Jira issues and 
starter tasks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18309) Recommend avoiding uppercase to emphasise statements in doc style

2020-06-15 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-18309:
---

 Summary: Recommend avoiding uppercase to emphasise statements in 
doc style
 Key: FLINK-18309
 URL: https://issues.apache.org/jira/browse/FLINK-18309
 Project: Flink
  Issue Type: Improvement
  Components: Project Website
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin


Some contributions tend to use uppercase in user docs to highlight and/or 
emphasise statements. For example: "you MUST use the latest version". This 
style may appear somewhat aggressive to users.

Therefore, I suggest to add a recommendation to not use uppercase in user docs. 
We could highlight this statements as note paragraphs or with less 'shooting' 
style, e.g. italics to draw user attention.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18308) KafkaProducerTestBase->Kafka011ProducerExactlyOnceITCase. testExactlyOnceCustomOperator hangs in Azure

2020-06-15 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-18308:
---

 Summary: KafkaProducerTestBase->Kafka011ProducerExactlyOnceITCase. 
testExactlyOnceCustomOperator hangs in Azure
 Key: FLINK-18308
 URL: https://issues.apache.org/jira/browse/FLINK-18308
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka
Affects Versions: 1.11.0
Reporter: Andrey Zagrebin


[https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=3267=logs=d44f43ce-542c-597d-bf94-b0718c71e5e8=03dca39c-73e8-5aaf-601d-328ae5c35f20]

For last 3.5 hours, the test log ends with about 5000 entries like this:
{code:java}
2020-06-11T11:04:54.3299945Z 11:04:54,328 [FailingIdentityMapper Status 
Printer] INFO  
org.apache.flink.streaming.connectors.kafka.testutils.FailingIdentityMapper [] 
- > Failing mapper  0: count=690, 
totalCount=1000{code}
The problem was observed not on master but in [this 
PR|https://github.com/apache/flink/pull/12596]. The PR is simple fatal error 
handling refactoring in TM. Therefore, the PR looks unrelated. Another run of 
this PR in My Azure CI 
[passes|https://dev.azure.com/azagrebin/azagrebin/_build/results?buildId=214=results].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18250) Enrich OOM error messages with more details in ClusterEntrypoint

2020-06-11 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-18250:
---

 Summary: Enrich OOM error messages with more details in 
ClusterEntrypoint
 Key: FLINK-18250
 URL: https://issues.apache.org/jira/browse/FLINK-18250
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin
 Fix For: 1.11.0


Similar to what we added for TM in [https://github.com/apache/flink/pull/11408],

we should add more information and hints about out-of-memory failures to JM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17811) Update docker hub Flink page

2020-05-19 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-17811:
---

 Summary: Update docker hub Flink page
 Key: FLINK-17811
 URL: https://issues.apache.org/jira/browse/FLINK-17811
 Project: Flink
  Issue Type: Task
  Components: Deployment / Docker
Reporter: Andrey Zagrebin


In FLINK-17161, we refactored the Flink docker images docs. We should also 
update and possibly link the related Flink docs about docker integration in 
[docker hub Flink image 
description|https://hub.docker.com/_/flink?tab=description].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17740) Remove flink-container/kubernetes

2020-05-15 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-17740:
---

 Summary: Remove flink-container/kubernetes
 Key: FLINK-17740
 URL: https://issues.apache.org/jira/browse/FLINK-17740
 Project: Flink
  Issue Type: Sub-task
  Components: Deployment / Docker, Deployment / Kubernetes
Reporter: Andrey Zagrebin
Assignee: Chesnay Schepler
 Fix For: 1.11.0


FLINK-17161 added Kubernetes integration examples for Job Cluster.
FLINK-17656 copies job service yaml from flink-container/kubernetes to e2e 
Kubernetes Job Cluster test making them independent.
Therefore, we do not need flink-container/kubernetes and it can be removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17652) Legacy JM heap options should fallback to new JVM_HEAP_MEMORY in standalone

2020-05-13 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-17652:
---

 Summary: Legacy JM heap options should fallback to new 
JVM_HEAP_MEMORY in standalone
 Key: FLINK-17652
 URL: https://issues.apache.org/jira/browse/FLINK-17652
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Configuration, Runtime / Coordination
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin
 Fix For: 1.11.0


FLINK-16742 states that the legacy JM heap options should fallback to 
JobManagerOptions.JVM_HEAP_MEMORY in standalone scripts. 
BashJavaUtils#getJmResourceParams has been implemented to fallback to 
JobManagerOptions.TOTAL_FLINK_MEMORY. This should be fixed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17546) Consider setting the number of TM CPU cores to the actual number of cores

2020-05-06 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-17546:
---

 Summary: Consider setting the number of TM CPU cores to the actual 
number of cores
 Key: FLINK-17546
 URL: https://issues.apache.org/jira/browse/FLINK-17546
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Configuration
Reporter: Andrey Zagrebin


So far we do not use CPU cores resource in TaskExecutorResourceSpec. It was a 
preparation for dynamic slot/resource allocation (FLINK-14187). It is not fully 
clear how Flink or users would define the number of cores. We could consider 
setting the number of TM CPU cores to the actual number of cores by default, 
e.g. got somehow from OS in standalone or container configuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17465) Update Chinese user documentation for job manager memory model

2020-04-29 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-17465:
---

 Summary: Update Chinese user documentation for job manager memory 
model
 Key: FLINK-17465
 URL: https://issues.apache.org/jira/browse/FLINK-17465
 Project: Flink
  Issue Type: Sub-task
Reporter: Andrey Zagrebin
 Fix For: 1.11.0


This is a follow-up for FLINK-16946.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17344) RecordWriterTest.testIdleTime possibly deadlocks on Travis

2020-04-23 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-17344:
---

 Summary: RecordWriterTest.testIdleTime possibly deadlocks on Travis
 Key: FLINK-17344
 URL: https://issues.apache.org/jira/browse/FLINK-17344
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Network
Affects Versions: 1.11.0
Reporter: Andrey Zagrebin
 Fix For: 1.11.0


https://travis-ci.org/github/apache/flink/jobs/678193214
The test was introduced in 
[FLINK-16864|https://jira.apache.org/jira/browse/FLINK-16864].
It may be an instability as it passed 2 times (core and core-scala) and failed 
in core-hadoop:
https://travis-ci.org/github/apache/flink/builds/678193199



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17167) Extend entry point script and docs with history server mode

2020-04-15 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-17167:
---

 Summary: Extend entry point script and docs with history server 
mode
 Key: FLINK-17167
 URL: https://issues.apache.org/jira/browse/FLINK-17167
 Project: Flink
  Issue Type: Sub-task
  Components: Deployment / Docker
Reporter: Andrey Zagrebin
Assignee: Sebastian J.
 Fix For: 1.11.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17166) Modify the log4j-console.properties to also output logs into the files for WebUI

2020-04-15 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-17166:
---

 Summary: Modify the log4j-console.properties to also output logs 
into the files for WebUI
 Key: FLINK-17166
 URL: https://issues.apache.org/jira/browse/FLINK-17166
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Configuration
Reporter: Andrey Zagrebin
 Fix For: 1.11.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17165) Remove flink-container/docker

2020-04-15 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-17165:
---

 Summary: Remove flink-container/docker
 Key: FLINK-17165
 URL: https://issues.apache.org/jira/browse/FLINK-17165
 Project: Flink
  Issue Type: Sub-task
  Components: Deployment / Docker
Reporter: Andrey Zagrebin
 Fix For: 1.11.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17164) Extend entry point script and docs with job cluster mode and user job artefacts

2020-04-15 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-17164:
---

 Summary: Extend entry point script and docs with job cluster mode 
and user job artefacts
 Key: FLINK-17164
 URL: https://issues.apache.org/jira/browse/FLINK-17164
 Project: Flink
  Issue Type: Sub-task
  Components: Deployment / Docker
Reporter: Andrey Zagrebin
 Fix For: 1.11.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17163) Remove flink-contrib/docker-flink

2020-04-15 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-17163:
---

 Summary: Remove flink-contrib/docker-flink
 Key: FLINK-17163
 URL: https://issues.apache.org/jira/browse/FLINK-17163
 Project: Flink
  Issue Type: Sub-task
  Components: Deployment / Docker
Reporter: Andrey Zagrebin
 Fix For: 1.11.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17162) Document examples of how to extend the official docker hub image

2020-04-15 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-17162:
---

 Summary: Document examples of how to extend the official docker 
hub image
 Key: FLINK-17162
 URL: https://issues.apache.org/jira/browse/FLINK-17162
 Project: Flink
  Issue Type: Sub-task
  Components: Deployment / Docker, Documentation
Reporter: Andrey Zagrebin
 Fix For: 1.11.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17161) Document the official docker hub image and examples of how to run

2020-04-15 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-17161:
---

 Summary: Document the official docker hub image and examples of 
how to run
 Key: FLINK-17161
 URL: https://issues.apache.org/jira/browse/FLINK-17161
 Project: Flink
  Issue Type: Sub-task
  Components: Deployment / Docker, Documentation
Reporter: Andrey Zagrebin
 Fix For: 1.11.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17160) FLIP-111: Docker image unification

2020-04-15 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-17160:
---

 Summary: FLIP-111: Docker image unification
 Key: FLINK-17160
 URL: https://issues.apache.org/jira/browse/FLINK-17160
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Docker, Dockerfiles
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin
 Fix For: 1.11.0


This an umbrella issue for 
[FLIP-111.|https://cwiki.apache.org/confluence/display/FLINK/FLIP-111%3A+Docker+image+unification]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17048) Add memory related JVM args to Mesos JM startup scripts

2020-04-08 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-17048:
---

 Summary: Add memory related JVM args to Mesos JM startup scripts
 Key: FLINK-17048
 URL: https://issues.apache.org/jira/browse/FLINK-17048
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Configuration, Runtime / Coordination
Reporter: Andrey Zagrebin


It looks like we never respected memory configuration in Mesos JM startup 
scripts:
mesos-appmaster.sh
mesos-appmaster-job.sh

Now we have a chance to adopt FLIP-116 here as well, similar to what we are 
doing with standalone scripts in FLINK-16742



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16946) Update user documentation for job manager memory model

2020-04-02 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-16946:
---

 Summary: Update user documentation for job manager memory model
 Key: FLINK-16946
 URL: https://issues.apache.org/jira/browse/FLINK-16946
 Project: Flink
  Issue Type: Sub-task
  Components: Documentation, Runtime / Configuration, Runtime / 
Coordination
Reporter: Andrey Zagrebin
 Fix For: 1.11.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16754) Consider refactoring of ProcessMemoryUtilsTestBase to avoid inheritance

2020-03-24 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-16754:
---

 Summary: Consider refactoring of ProcessMemoryUtilsTestBase to 
avoid inheritance
 Key: FLINK-16754
 URL: https://issues.apache.org/jira/browse/FLINK-16754
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Configuration, Runtime / Coordination
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin
 Fix For: 1.11.0


After FLINK-16615 we have class structure of memory utils with isolation of 
responsibilities, mostly through composition. We should consider to refactor 
the tests as well to get more abstraction targeted tests with better isolation 
and w/o implicit test inheritance contracts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16742) Extend and use BashJavaUtils to start JM JVM process and pass JVM memory args

2020-03-24 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-16742:
---

 Summary: Extend and use BashJavaUtils to start JM JVM process and 
pass JVM memory args
 Key: FLINK-16742
 URL: https://issues.apache.org/jira/browse/FLINK-16742
 Project: Flink
  Issue Type: Sub-task
  Components: Deployment / Scripts, Runtime / Configuration, Runtime / 
Coordination
Reporter: Andrey Zagrebin


Currently, the legacy options `_jobmanager.heap.size_` (or 
`_jobmanager.heap.mb_`) is used in JM standalone bash scripts to pass JVM heap 
size arg and start JM JVM process.

BashJavaUtils should be extended to get JVM memory arg list string from Flink 
configuration. BashJavaUtils can use 
JobManagerProcessUtils#processSpecFromConfig to obtain JobManagerProcessSpec. 
JobManagerProcessSpec can be passed to 
ProcessMemoryUtils#generateJvmParametersStr to get JVM memory arg list string.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16746) Deprecate/remove legacy memory options for JM and expose the new ones

2020-03-24 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-16746:
---

 Summary: Deprecate/remove legacy memory options for JM and expose 
the new ones
 Key: FLINK-16746
 URL: https://issues.apache.org/jira/browse/FLINK-16746
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Configuration, Runtime / Coordination
Reporter: Andrey Zagrebin
 Fix For: 1.11.0


Deprecate legacy heap options: `_jobmanager.heap.size_` (and update 
`_jobmanager.heap.mb_`)

Remove container cut-off options: `_containerized.heap-cutoff-ratio_` and 
`_containerized.heap-cutoff-min_`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16745) Use JobManagerProcessUtils to start JM container and pass JVM memory args

2020-03-24 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-16745:
---

 Summary: Use JobManagerProcessUtils to start JM container and pass 
JVM memory args
 Key: FLINK-16745
 URL: https://issues.apache.org/jira/browse/FLINK-16745
 Project: Flink
  Issue Type: Sub-task
  Components: Deployment / Kubernetes, Deployment / Mesos, Deployment / 
YARN, Runtime / Configuration, Runtime / Coordination
Reporter: Andrey Zagrebin
 Fix For: 1.11.0


JobManagerProcessUtils#processSpecFromConfig should be used to get 
JobManagerProcessSpec. Then JobManagerProcessSpec can be passed to 
ProcessMemoryUtils#generateJvmParametersStr to get JVM memory arg list string.

The configuration should be fixed to fallback to 
JobManagerOptions.TOTAL_PROCESS_MEMORY if a legacy option is set 
(JobManagerProcessUtils#getConfigurationWithLegacyHeapSizeMappedToNewConfigOption)
 before passing it to JobManagerProcessUtils#processSpecFromConfig.

Then the JVM memory arg list can be used to start the JM container in 
Yarn/Mesos/Kubernetes active RMs instead of using the existing legacy heap 
options: `_jobmanager.heap.size_` (or `_jobmanager.heap.mb_`).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16686) [State TTL] Make user class loader available in native RocksDB compaction thread

2020-03-19 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-16686:
---

 Summary: [State TTL] Make user class loader available in native 
RocksDB compaction thread
 Key: FLINK-16686
 URL: https://issues.apache.org/jira/browse/FLINK-16686
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Affects Versions: 1.8.0
Reporter: Andrey Zagrebin


The issue is initially reported 
[here|https://stackoverflow.com/questions/60745711/flink-kryo-serializer-because-chill-serializer-couldnt-be-found].

The problem is that the java code of Flink compaction filter is called from 
RocksDB native C++ code. It is called in the context of the native compaction 
thread. RocksDB has utilities to create java Thread context for the Flink java 
callback. Presumably, the Java thread context class loader is not set at all 
and if it is queried then it produces NullPointerException.

The provided report enabled a list state with TTL. The compaction filter has to 
deserialise elements to check expiration. The deserialiser relies on Kryo which 
queries the thread context class loader which is expected to be the user class 
loader of the task but turns out to be null.

We should investigate how to pass the user class loader to the compaction 
thread of the list state with TTL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16615) Introduce data structures and utilities to calculate Job Manager memory components

2020-03-16 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-16615:
---

 Summary: Introduce data structures and utilities to calculate Job 
Manager memory components
 Key: FLINK-16615
 URL: https://issues.apache.org/jira/browse/FLINK-16615
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Configuration, Runtime / Coordination
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin
 Fix For: 1.11.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16614) FLIP-116 Unified Memory Configuration for Job Manager

2020-03-16 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-16614:
---

 Summary: FLIP-116 Unified Memory Configuration for Job Manager
 Key: FLINK-16614
 URL: https://issues.apache.org/jira/browse/FLINK-16614
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Configuration, Runtime / Coordination
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin
 Fix For: 1.11.0


This is the umbrella issue of [FLIP-116: Unified Memory Configuration for Job 
Managers|https://cwiki.apache.org/confluence/display/FLINK/FLIP+116%3A+Unified+Memory+Configuration+for+Job+Managers].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16406) Increase default value for JVM Metaspace to minimise its OutOfMemoryError

2020-03-03 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-16406:
---

 Summary: Increase default value for JVM Metaspace to minimise its 
OutOfMemoryError
 Key: FLINK-16406
 URL: https://issues.apache.org/jira/browse/FLINK-16406
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Configuration, Runtime / Task
Affects Versions: 1.10.0
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin
 Fix For: 1.10.1, 1.11.0


With FLIP-49 ([FLINK-13980|https://issues.apache.org/jira/browse/FLINK-13980]), 
we introduced a limit for JVM Metaspace 
('taskmanager.memory.jvm-metaspace.size') when TM JVM process is started. It 
caused '_OutOfMemoryError: Metaspace_' for some use cases after upgrading to 
the latest 1.10 version. In some cases, a real class loading leak has been 
discovered, like in 
[FLINK-16142|https://issues.apache.org/jira/browse/FLINK-16142]. Some users had 
to increase the default value to accommodate for their use cases (mostly from 
96Mb to 256Mb).

While this limit was introduced to properly plan Flink resources, especially 
for container environment, and to detect class loading leaks, the user 
experience should be as smooth as possible. One way is provide good 
documentation for this change 
([FLINK-16278|https://issues.apache.org/jira/browse/FLINK-16278]).

Another question is the sanity of the default value. It is still arguable what 
the default value should be (currently 96Mb). In general, the size depends on 
the use case (job user code, how many jobs are deployed in the cluster etc).

This issue tries to tackle this problem by firstly increasing it to 256Mb. We 
also want to poll which Metaspace setting resolved the _OutOfMemoryError_. 
Please, if you encountered this problem, report here any relevant specifics of 
your job and your Metaspace size if there was no class loading leak.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16198) FileUtilsTest fails on Mac OS

2020-02-21 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-16198:
---

 Summary: FileUtilsTest fails on Mac OS
 Key: FLINK-16198
 URL: https://issues.apache.org/jira/browse/FLINK-16198
 Project: Flink
  Issue Type: Bug
  Components: FileSystems, Tests
Reporter: Andrey Zagrebin


The following tests fail if run on Mac OS (IDE/maven).

 

FileUtilsTest.testCompressionOnRelativePath:

 
{code:java}
java.nio.file.NoSuchFileException: 
../../../../../var/folders/67/v4yp_42d21j6_n8k1h556h0cgn/T/junit6496651678375117676/compressDir/rootDirjava.nio.file.NoSuchFileException:
 
../../../../../var/folders/67/v4yp_42d21j6_n8k1h556h0cgn/T/junit6496651678375117676/compressDir/rootDir
 at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at 
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at 
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at 
sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384)
 at java.nio.file.Files.createDirectory(Files.java:674) at 
org.apache.flink.util.FileUtilsTest.verifyDirectoryCompression(FileUtilsTest.java:440)
 at 
org.apache.flink.util.FileUtilsTest.testCompressionOnRelativePath(FileUtilsTest.java:261)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
 at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
 at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
 at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
 at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) at 
org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) at 
org.junit.rules.RunRules.evaluate(RunRules.java:20) at 
org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
 at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at 
org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at 
org.junit.runners.ParentRunner.run(ParentRunner.java:363) at 
org.junit.runner.JUnitCore.run(JUnitCore.java:137) at 
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
 at 
com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
 at 
com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
 at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
{code}
 

FileUtilsTest.testDeleteDirectoryConcurrently

 

 
{code:java}
java.nio.file.FileSystemException: 
/var/folders/67/v4yp_42d21j6_n8k1h556h0cgn/T/junit7558825557740784886/junit3566161583262218465/ab1fa0bde8b22cad58b717508c7a7300/121fdf5f7b057183843ed2e1298f9b66/6598025f390d3084d69c98b36e542fe2/8db7cd9c063396a19a86f5b63ce53f66:
 Invalid argument at 
sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at 
sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244)
at 
sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108)
at java.nio.file.Files.deleteIfExists(Files.java:1165)
at 
org.apache.flink.util.FileUtils.deleteFileOrDirectoryInternal(FileUtils.java:324)
at org.apache.flink.util.FileUtils.guardIfWindows(FileUtils.java:391)
at 
org.apache.flink.util.FileUtils.deleteFileOrDirectory(FileUtils.java:258)
at 
org.apache.flink.util.FileUtils.cleanDirectoryInternal(FileUtils.java:376)
at 
org.apache.flink.util.FileUtils.deleteDirectoryInternal(FileUtils.java:335)
at 
org.apache.flink.util.FileUtils.deleteFileOrDirectoryInternal(FileUtils.java:320)
at org.apache.flink.util.FileUtils.guardIfWindows(FileUtils.java:391)
at 
org.apache.flink.util.FileUtils.deleteFileOrDirectory(FileUtils.java:258)
at 
org.apache.flink.util.FileUtils.cleanDirectoryInternal(FileUtils.java:376)
at 
org.apache.flink.util.FileUtils.deleteDirectoryInternal(FileUtils.java:335)
at 

[jira] [Created] (FLINK-15991) Create Chinese documentation for FLIP-49 TM memory model

2020-02-11 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-15991:
---

 Summary: Create Chinese documentation for FLIP-49 TM memory model
 Key: FLINK-15991
 URL: https://issues.apache.org/jira/browse/FLINK-15991
 Project: Flink
  Issue Type: Task
  Components: Documentation
Reporter: Andrey Zagrebin
Assignee: Xintong Song
 Fix For: 1.10.0, 1.11.0


Chinese translation of FLINK-15143



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15989) Rewrap OutOfMemoryError in allocateUnpooledOffHeap with better message

2020-02-11 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-15989:
---

 Summary: Rewrap OutOfMemoryError in allocateUnpooledOffHeap with 
better message
 Key: FLINK-15989
 URL: https://issues.apache.org/jira/browse/FLINK-15989
 Project: Flink
  Issue Type: Improvement
Reporter: Andrey Zagrebin
 Fix For: 1.10.1, 1.11.0


Now if Flink allocates direct memory in 
MemorySegmentFactory#allocateUnpooledOffHeapMemory and its limit is exceeded 
for any reason, e.g. user code over-allocated direct memory, 
ByteBuffer#allocateDirect will throw a generic "OutOfMemoryError: Direct buffer 
memory". We can catch it and add a message which provides more explanation and 
points to an option taskmanager.memory.task.off-heap.size to increase as a 
possible solution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15946) Task manager Kubernetes pods take long time to terminate

2020-02-06 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-15946:
---

 Summary: Task manager Kubernetes pods take long time to terminate
 Key: FLINK-15946
 URL: https://issues.apache.org/jira/browse/FLINK-15946
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Kubernetes, Runtime / Coordination
Affects Versions: 1.10.0
Reporter: Andrey Zagrebin


The problem is initially described in this [ML 
thread|https://mail-archives.apache.org/mod_mbox/flink-user/202002.mbox/browser].

We should investigate whether and if yes, why the TM pod killing/shutdown is 
delayed by reconnecting to the terminated JM.

cc [~fly_in_gis]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15942) Improve logging of infinite resource profile

2020-02-06 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-15942:
---

 Summary: Improve logging of infinite resource profile
 Key: FLINK-15942
 URL: https://issues.apache.org/jira/browse/FLINK-15942
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Configuration, Runtime / Task
Affects Versions: 1.10.0
Reporter: Andrey Zagrebin


After we set task memory and CPU to infinity in FLINK-15763, it spoiled the 
logs:
{code:java}
00:23:49,442 INFO org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl 
- Free slot TaskSlot(index:0, state:ACTIVE, resource profile: 
ResourceProfile{cpuCores=44942328371557892500.,
 taskHeapMemory=2097152.000tb (2305843009213693951 bytes), 
taskOffHeapMemory=2097152.000tb (2305843009213693951 bytes), 
managedMemory=20.000mb (20971520 bytes), networkMemory=16.000mb (16777216 
bytes)}, allocationId: 349dacfbf1ac4d0b44a2d11e1976d264, jobId: 
689a0cf24b40f16b6f45157f78754c46).
{code}
We should treat the infinity as a special case and print it accordingly



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15774) Consider adding an explicit network memory size config option

2020-01-27 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-15774:
---

 Summary: Consider adding an explicit network memory size  config 
option
 Key: FLINK-15774
 URL: https://issues.apache.org/jira/browse/FLINK-15774
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Configuration, Runtime / Task
Reporter: Andrey Zagrebin


See [PR 
discussion|[https://github.com/apache/flink/pull/10946#discussion_r371169251]]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15763) Set necessary resource configuration options to defaults for local execution ignoring FLIP-49

2020-01-24 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-15763:
---

 Summary: Set necessary resource configuration options to defaults 
for local execution ignoring FLIP-49
 Key: FLINK-15763
 URL: https://issues.apache.org/jira/browse/FLINK-15763
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Task
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin
 Fix For: 1.10.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15758) Investigate potential out-of-memory problems due to managed unsafe memory allocation

2020-01-24 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-15758:
---

 Summary: Investigate potential out-of-memory problems due to 
managed unsafe memory allocation
 Key: FLINK-15758
 URL: https://issues.apache.org/jira/browse/FLINK-15758
 Project: Flink
  Issue Type: Task
  Components: Runtime / Task
Reporter: Andrey Zagrebin
 Fix For: 1.11.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15741) Fix TTL docs after enabling RocksDB compaction filter by default

2020-01-23 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-15741:
---

 Summary: Fix TTL docs after enabling RocksDB compaction filter by 
default
 Key: FLINK-15741
 URL: https://issues.apache.org/jira/browse/FLINK-15741
 Project: Flink
  Issue Type: Task
  Components: Runtime / State Backends
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin
 Fix For: 1.10.0


RocksDB compaction filter is always enabled by default after [FLINK-15506 
|https://issues.apache.org/jira/browse/FLINK-15506]and we deprecated its 
disabling. The docs should not refer to its enabling/disabling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15621) State TTL: Remove deprecated option and method to disable TTL compaction filter

2020-01-16 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-15621:
---

 Summary: State TTL: Remove deprecated option and method to disable 
TTL compaction filter
 Key: FLINK-15621
 URL: https://issues.apache.org/jira/browse/FLINK-15621
 Project: Flink
  Issue Type: Task
Reporter: Andrey Zagrebin
 Fix For: 1.11.0


Follow-up for FLINK-15506.
 * Remove RocksDBOptions#TTL_COMPACT_FILTER_ENABLED
 * Remove 
RocksDBStateBackend#enableTtlCompactionFilter/isTtlCompactionFilterEnabled/disableTtlCompactionFilter,
 also in python API
 * Cleanup code from this flag and tests, also in python API
 * Check any related code in 
[frocksdb|[https://github.com/dataArtisans/frocksdb]] if any



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15620) State TTL: Remove deprecated enable default background cleanup

2020-01-16 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-15620:
---

 Summary: State TTL: Remove deprecated enable default background 
cleanup
 Key: FLINK-15620
 URL: https://issues.apache.org/jira/browse/FLINK-15620
 Project: Flink
  Issue Type: Task
  Components: Runtime / State Backends
Reporter: Andrey Zagrebin
 Fix For: 1.11.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15606) Deprecate enable default background cleanup of state with TTL

2020-01-15 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-15606:
---

 Summary: Deprecate enable default background cleanup of state with 
TTL
 Key: FLINK-15606
 URL: https://issues.apache.org/jira/browse/FLINK-15606
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / State Backends
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin
 Fix For: 1.10.0


Follow-up for FLINK-14898.

Enabling TTL without any background cleanup does not make too much
 sense. So we can keep it always enabled, just cleanup settings can be
 tweaked for particular backends.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15605) Remove deprecated in 1.9 StateTtlConfig.TimeCharacteristic

2020-01-15 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-15605:
---

 Summary: Remove deprecated in 1.9 StateTtlConfig.TimeCharacteristic
 Key: FLINK-15605
 URL: https://issues.apache.org/jira/browse/FLINK-15605
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / State Backends
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin
 Fix For: 1.10.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15597) Relax sanity check of JVM memory overhead to be within its min/max

2020-01-15 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-15597:
---

 Summary: Relax sanity check of JVM memory overhead to be within 
its min/max
 Key: FLINK-15597
 URL: https://issues.apache.org/jira/browse/FLINK-15597
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Configuration, Runtime / Task
Reporter: Andrey Zagrebin
Assignee: Xintong Song
 Fix For: 1.10.0


When the explicitly configured process and Flink memory sizes are verified with 
the JVM meta space and overhead, JVM overhead does not have to be the exact 
fraction.
It can be just within its min/max range, similar to how it is now for 
network/shuffle memory check after FLINK-15300.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15519) Preserve logs from BashJavaUtils and make them part of TM logs

2020-01-08 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-15519:
---

 Summary: Preserve logs from BashJavaUtils and make them part of TM 
logs
 Key: FLINK-15519
 URL: https://issues.apache.org/jira/browse/FLINK-15519
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Configuration
Affects Versions: 1.10.0
Reporter: Andrey Zagrebin
 Fix For: 1.10.0


In FLINK-13983 we introduced BashJavaUtils utility to call in taskmanager.sh 
before starting TM and calculate memory configuration for the JVM process of TM.

Ideally, it would be nice to preserve BashJavaUtils logs and make them part of 
the TM logs. Currently, logging for BashJavaUtils is configured from the class 
path and can differ from TM logging. Moreover TM logging can rewrite 
BashJavaUtils even if we align their loggings (e.g. 
log4j.appender.file.append=false in default log4j.properties  for Flink).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15517) Use back 'network' in 'shuffle' memory config option names

2020-01-08 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-15517:
---

 Summary: Use back 'network' in 'shuffle' memory config option names
 Key: FLINK-15517
 URL: https://issues.apache.org/jira/browse/FLINK-15517
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Configuration, Runtime / Task
Affects Versions: 1.10.0
Reporter: Andrey Zagrebin
 Fix For: 1.10.0


The issue is based on the discussion outcome on [Dev 
ML|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Some-feedback-after-trying-out-the-new-FLIP-49-memory-configurations-td36129.html].

As there is no strong consensus about using new 'shuffle' naming for memory 
config options. We decided to fallback to the the existing naming and use 
'network' instead of 'shuffle'. We still keeping the new naming convention with 
prefix 'taskmanager.memory.*'.

Basically, we just need to rename 'taskmanager.memory.shuffle.*' to 
'taskmanager.memory.network.*' as 'shuffle' naming has never been released.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15300) Shuffle memory fraction sanity check does not account for its min/max limit

2019-12-17 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-15300:
---

 Summary: Shuffle memory fraction sanity check does not account for 
its min/max limit
 Key: FLINK-15300
 URL: https://issues.apache.org/jira/browse/FLINK-15300
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Configuration
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin
 Fix For: 1.10.0


If we have a configuration which results in setting shuffle memory size to its 
min or max, not fraction during TM startup then starting TM parses generated 
dynamic properties and while doing the sanity check 
(TaskExecutorResourceUtils#sanityCheckShuffleMemory) it fails because it checks 
the exact fraction for min/max value.

Example, start TM with the following Flink config:
{code:java}
taskmanager.memory.total-flink.size: 350m
taskmanager.memory.framework.heap.size: 16m
taskmanager.memory.shuffle.fraction: 0.1{code}
It will result in the following extra program args:
{code:java}
taskmanager.memory.shuffle.max: 67108864b
 taskmanager.memory.framework.off-heap.size: 134217728b
 taskmanager.memory.managed.size: 146800642b
 taskmanager.cpu.cores: 1.0
 taskmanager.memory.task.heap.size: 2097150b
 taskmanager.memory.task.off-heap.size: 0b
 taskmanager.memory.shuffle.min: 67108864b{code}
where the derived fraction was less than shuffle memory min size (64mb),
so it was set to the min value: 64mb.



 

While TM starts, TaskExecutorResourceUtils#sanityCheckShuffleMemory trows the 
following exception:
{code:java}
org.apache.flink.configuration.IllegalConfigurationException: Derived Shuffle 
Memory size(64 Mb (67108864 bytes)) does not match configured Shuffle Memory 
fraction 
(0.1000149011612).org.apache.flink.configuration.IllegalConfigurationException:
 Derived Shuffle Memory size(64 Mb (67108864 bytes)) does not match configured 
Shuffle Memory fraction (0.1000149011612). at 
org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.sanityCheckShuffleMemory(TaskExecutorResourceUtils.java:552)
 at 
org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.deriveResourceSpecWithExplicitTaskAndManagedMemory(TaskExecutorResourceUtils.java:183)
 at 
org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:135)
{code}
This can be fixed by checking whether the fraction to assert is within the 
min/max range.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15198) Remove deprecated mesos.resourcemanager.tasks.mem in 1.11

2019-12-11 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-15198:
---

 Summary: Remove deprecated mesos.resourcemanager.tasks.mem in 1.11
 Key: FLINK-15198
 URL: https://issues.apache.org/jira/browse/FLINK-15198
 Project: Flink
  Issue Type: Task
  Components: Deployment / Mesos
Affects Versions: 1.11.0
Reporter: Andrey Zagrebin
 Fix For: 1.11.0


In FLINK-15082, we deprecated 'mesos.resourcemanager.tasks.mem' in favour of 
the new unified option 'taskmanager.memory.total-process.size' from FLIP-49. We 
should remove it now in 1.11.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15094) Warning about using private constructor of java.nio.DirectByteBuffer in Java 11

2019-12-06 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-15094:
---

 Summary: Warning about using private constructor of 
java.nio.DirectByteBuffer in Java 11
 Key: FLINK-15094
 URL: https://issues.apache.org/jira/browse/FLINK-15094
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Task
Reporter: Andrey Zagrebin


The unsafe off-heap in 
[FLINK-13985|https://jira.apache.org/jira/browse/FLINK-13985] was implemented 
by hacking into a private constructor of java.nio.DirectByteBuffer. This causes 
undesirable warnings in Java 11:


{code:java}
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.flink.core.memory.MemoryUtils 
(file:/C:/Development/repos/flink/flink-core/target/classes/) to constructor 
java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of 
org.apache.flink.core.memory.MemoryUtils
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be denied in a future release{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14901) Throw Error in MemoryUtils if there is problem with using system classes over reflection

2019-11-21 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-14901:
---

 Summary: Throw Error in MemoryUtils if there is problem with using 
system classes over reflection
 Key: FLINK-14901
 URL: https://issues.apache.org/jira/browse/FLINK-14901
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Task
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin
 Fix For: 1.10.0


If something goes wrong using system classes over reflection n MemoryUtils or 
JavaGcCleanerWrapper, Flink cannot really recover from it. We should throw an 
Error in this case which should stay unhandled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14898) Enable background cleanup of state with TTL by default

2019-11-21 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-14898:
---

 Summary: Enable background cleanup of state with TTL by default
 Key: FLINK-14898
 URL: https://issues.apache.org/jira/browse/FLINK-14898
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / State Backends
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin
 Fix For: 1.10.0


So far, we were conservative about enabling background cleanup strategies for 
state with TTL. In general, if state is configured to have TTL, most users 
would expect the background cleanup to kick in. As there were no reported 
issues so far since the release of backend specific cleanups and that should 
not affect any state without TTL, this issue suggests to enable default 
background cleanup for backends.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14637) Introduce framework off heap memory config

2019-11-06 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-14637:
---

 Summary: Introduce framework off heap memory config
 Key: FLINK-14637
 URL: https://issues.apache.org/jira/browse/FLINK-14637
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Task
Reporter: Andrey Zagrebin
 Fix For: 1.10.0


At the moment after 
[-FLINK-13982-|https://issues.apache.org/jira/browse/FLINK-13982], when we do 
not account for adhoc direct memory allocations for Flink framework (except 
network buffers) or done by some used libraries used in Flink. In general, we 
expect this allocations to stay under a certain reasonably low limit but we 
have to have some margin for them so that JVM direct memory limit is not 
exactly equal to network buffers and does not fail. We can address it by 
introducing framework off heap memory config option.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14633) Account for netty direct allocations in direct memory limit (Queryable state)

2019-11-06 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-14633:
---

 Summary: Account for netty direct allocations in direct memory 
limit (Queryable state)
 Key: FLINK-14633
 URL: https://issues.apache.org/jira/browse/FLINK-14633
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Queryable State, Runtime / Task
Reporter: Andrey Zagrebin
 Fix For: 1.10.0


At the moment after 
[-FLINK-13982-|https://issues.apache.org/jira/browse/FLINK-13982], when we 
calculate JVM direct memory limit, we do not account for direct allocations 
from netty arenas in org.apache.flink.queryablestate.network.NettyBufferPool.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14631) Account for netty direct allocations in direct memory limit

2019-11-06 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-14631:
---

 Summary: Account for netty direct allocations in direct memory 
limit
 Key: FLINK-14631
 URL: https://issues.apache.org/jira/browse/FLINK-14631
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Task
Reporter: Andrey Zagrebin
 Fix For: 1.10.0


At the moment after FLINK-13982, when we calculate JVM direct memory limit, we 
only account for memory segment network buffers but not for direct allocations 
from netty arenas in org.apache.flink.runtime.io.network.netty.NettyBufferPool. 
We should include netty arenas into shuffle memory calculations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14522) Adjust GC Cleaner for unsafe memory and Java 11

2019-10-24 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-14522:
---

 Summary: Adjust GC Cleaner for unsafe memory and Java 11 
 Key: FLINK-14522
 URL: https://issues.apache.org/jira/browse/FLINK-14522
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Task
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin
 Fix For: 1.10.0


sun.misc.Cleaner is not available in Java 11.
It was moved to jdk.internal.ref.Cleaner of java.base module.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14400) Shrink the scope of MemoryManager from TaskExecutor to slot

2019-10-15 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-14400:
---

 Summary: Shrink the scope of MemoryManager from TaskExecutor to 
slot 
 Key: FLINK-14400
 URL: https://issues.apache.org/jira/browse/FLINK-14400
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination, Runtime / Task
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin


MemoryManager currently manages the memory bookkeeping for all slots/tasks 
inside one TaskExecutor. For better abstraction and isolation of slots, we can 
shrink its scope and make it per slot. The memory limits are fixed now per slot 
at the moment of slot creation. All operators, sharing the slot, will get their 
fixed fractional limits. In future, we might make it possible for operators to 
over-allocate beyond their fraction limit if there is some available free 
memory in the slot but it should be possible to reclaim the over-allocated 
memory at any time if other operator decides to claim its fair share within its 
limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14399) Add memory chunk reservation API to MemoryManager

2019-10-15 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-14399:
---

 Summary: Add memory chunk reservation API to MemoryManager
 Key: FLINK-14399
 URL: https://issues.apache.org/jira/browse/FLINK-14399
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Task
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin


MemoryManager allocates paged segments from the provided memory pools of 
different types (on-/off-heap). Additionally, it can manage reservation and 
release of arbitrarily sized chunks of memory from the same memory pools 
respecting their overall limit. The way, how the memory is allocated, used and 
freed, is then up to the memory user. MemoryManager is just a book-keeping and 
limit checking component.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-13963) Consolidate Hadoop file systems usage and Hadoop integration docs

2019-09-04 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-13963:
---

 Summary: Consolidate Hadoop file systems usage and Hadoop 
integration docs
 Key: FLINK-13963
 URL: https://issues.apache.org/jira/browse/FLINK-13963
 Project: Flink
  Issue Type: Improvement
  Components: Connectors / FileSystem, Connectors / Hadoop 
Compatibility, Documentation, FileSystems
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin
 Fix For: 1.10.0


We have hadoop related docs in several places at the moment:
 * *dev/batch/connectors.md* (Hadoop FS implementations and setup)
 * *dev/batch/hadoop_compatibility.md* (not valid any more that Flink always 
has Hadoop types out of the box as we do not build and provide Flink with 
Hadoop by default)
 * *ops/filesystems/index.md* (plugins, Hadoop FS implementations and setup 
revisited)
 * *ops/deployment/hadoop.md* (Hadoop classpath)
 * *ops/config.md* (deprecated way to provide Hadoop configuration in Flink 
conf)

We could consolidate all these pieces of docs into a consistent structure to 
help users to navigate through the docs to well-defined spots depending on 
which feature they are trying to use.

The places in docs which should contain the information about Hadoop:
 * *dev/batch/hadoop_compatibility.md* (only Dataset API specific stuff about 
integration with Hadoop)
 * *ops/filesystems/index.md* (Flink FS plugins and Hadoop FS implementations)
 * *ops/deployment/hadoop.md* (Hadoop configuration and classpath)

How to setup Hadoop itself should be only in *ops/deployment/hadoop.md*. All 
other places dealing with Hadoop/HDFS should contain only their related things 
and just reference it 'how to configure Hadoop'. Like all chapters about 
writing to file systems (batch connectors and streaming file sinks) should just 
reference file systems.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-13927) Add note about hadoop dependencies for local debug

2019-08-30 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-13927:
---

 Summary: Add note about hadoop dependencies for local debug
 Key: FLINK-13927
 URL: https://issues.apache.org/jira/browse/FLINK-13927
 Project: Flink
  Issue Type: Improvement
  Components: Connectors / FileSystem, Documentation, FileSystems
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin


Currently if user tries to run the job locally (e.g. from IDE) and uses Hadoop 
fs, it will not work if hadoop dependencies are not on the class path which is 
the case for the example from the quick start.

We can add a hint about adding provided hadoop dependencies to:
[https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/hadoop.html]

and cross reference with:
[https://ci.apache.org/projects/flink/flink-docs-master/ops/filesystems/index.html#hadoop-configuration]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-13820) Breaking long function argument lists and chained method calls

2019-08-22 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-13820:
---

 Summary: Breaking long function argument lists and chained method 
calls
 Key: FLINK-13820
 URL: https://issues.apache.org/jira/browse/FLINK-13820
 Project: Flink
  Issue Type: Sub-task
  Components: Documentation, Project Website
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin


Breaking the line of too long statements (line longness is yet to be fully 
defined) to improve code readability in case of
 * Long function argument lists (declaration or call): void func(type1 arg1, 
type2 arg2, ...)
 * Long sequence of chained calls: 
list.stream().map(...).reduce(...).collect(...)...

Rules:
 * Break the list of arguments/calls if the line exceeds limit or earlier if 
you believe that the breaking would improve the code readability
 * If you break the line then each argument/call should have a separate line, 
including the first one
 * Each new line argument/call should have one extra indentation relative to 
the line of the parent function name or called entity
 * The opening parenthesis always stays on the line of the parent function name
 * The possible thrown exception list is never broken and stays on the same 
last line
 * The dot of a chained call is always on the line of that chained call 
proceeding the call at the beginning

Examples of breaking:
 * Function arguments

{code:java}
public void func(
    int arg1,
    int arg2,
    ...) throws E1, E2, E3 {
    
}{code}

 * Chained method calls:

{code:java}
values
    .stream()
    .map(...)
    .collect(...);{code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-13819) Introduce RpcEndpoint State

2019-08-22 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-13819:
---

 Summary: Introduce RpcEndpoint State
 Key: FLINK-13819
 URL: https://issues.apache.org/jira/browse/FLINK-13819
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin
 Fix For: 1.10.0, 1.9.1


To better reflect the lifecycle of RpcEndpoint, we suggest to introduce its 
state:
 * created
 * started
 * stopping

We can use the state e.g. to make decision about how to react on API calls if 
it is already known that the RpcEndpoint is terminating, as required e.g. for 
FLINK-13769.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-13812) Code style for the usage of Java Optional

2019-08-21 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-13812:
---

 Summary: Code style for the usage of Java Optional
 Key: FLINK-13812
 URL: https://issues.apache.org/jira/browse/FLINK-13812
 Project: Flink
  Issue Type: Sub-task
  Components: Documentation, Project Website
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin






--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-13804) Collections initial capacity

2019-08-20 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-13804:
---

 Summary: Collections initial capacity
 Key: FLINK-13804
 URL: https://issues.apache.org/jira/browse/FLINK-13804
 Project: Flink
  Issue Type: Sub-task
  Components: Documentation, Project Website
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin


The code style conclusion to add to web site:
 
Set the initial capacity only if there is a good proven reason to do it. 
Otherwise do not clutter the code with it.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-13802) Flink code style guide

2019-08-20 Thread Andrey Zagrebin (Jira)
Andrey Zagrebin created FLINK-13802:
---

 Summary: Flink code style guide
 Key: FLINK-13802
 URL: https://issues.apache.org/jira/browse/FLINK-13802
 Project: Flink
  Issue Type: Task
  Components: Documentation, Project Website
Reporter: Andrey Zagrebin


This is an umbrella issue to introduce and improve Flink code style guide.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-13642) Refactor ShuffleMaster to optionally provide preferred TM location for produced partitions

2019-08-07 Thread Andrey Zagrebin (JIRA)
Andrey Zagrebin created FLINK-13642:
---

 Summary: Refactor ShuffleMaster to optionally provide preferred TM 
location for produced partitions
 Key: FLINK-13642
 URL: https://issues.apache.org/jira/browse/FLINK-13642
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Andrey Zagrebin






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13641) Consider removing UnknownInputChannel in NettyShuffleEnvironment

2019-08-07 Thread Andrey Zagrebin (JIRA)
Andrey Zagrebin created FLINK-13641:
---

 Summary: Consider removing UnknownInputChannel in 
NettyShuffleEnvironment
 Key: FLINK-13641
 URL: https://issues.apache.org/jira/browse/FLINK-13641
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Network
Reporter: Andrey Zagrebin


UnknownInputChannel basically is a place holder while the producer is unknown 
which contains some partition parameters to use later for creation of the known 
channel.

If NettyShuffleEnvironment#updatePartitionInfo provides enough information to 
add the known channel, one could consider removing UnknownInputChannel and 
refactoring indexed channel array into a dynamic list.

Initially suggested in 
[https://github.com/apache/flink/pull/8362#discussion_r290308989]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13640) Consider TaskDeploymentDescriptorFactory to be Execution specific

2019-08-07 Thread Andrey Zagrebin (JIRA)
Andrey Zagrebin created FLINK-13640:
---

 Summary: Consider TaskDeploymentDescriptorFactory to be Execution 
specific
 Key: FLINK-13640
 URL: https://issues.apache.org/jira/browse/FLINK-13640
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Affects Versions: 1.9.0
Reporter: Andrey Zagrebin


suggested in [https://github.com/apache/flink/pull/8362#discussion_r290297974]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13639) Consider refactoring of IntermediateResultPartitionID to consist of IntermediateDataSetID and partitionIndex

2019-08-07 Thread Andrey Zagrebin (JIRA)
Andrey Zagrebin created FLINK-13639:
---

 Summary: Consider refactoring of IntermediateResultPartitionID to 
consist of IntermediateDataSetID and partitionIndex
 Key: FLINK-13639
 URL: https://issues.apache.org/jira/browse/FLINK-13639
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Reporter: Andrey Zagrebin






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13638) Refactor RemoteChannelStateChecker#isProducerConsumerReadyOrAbortConsumption to return result action

2019-08-07 Thread Andrey Zagrebin (JIRA)
Andrey Zagrebin created FLINK-13638:
---

 Summary: Refactor 
RemoteChannelStateChecker#isProducerConsumerReadyOrAbortConsumption to return 
result action
 Key: FLINK-13638
 URL: https://issues.apache.org/jira/browse/FLINK-13638
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Network
Affects Versions: 1.9.0, 1.10.0
Reporter: Andrey Zagrebin


RemoteChannelStateChecker#isProducerConsumerReadyOrAbortConsumption either 
triggers some action (fail or cancel) or returns a decision (trigger new 
partition check or not). It would be more symmetric if this class would not 
trigger any action but only return a decision what to do:

 
{code:java}
enum Action {
  FAIL(Throwable cause),
  CANCEL(String msg),
  TRIGGER_PARTITION_CHECK, NOOP
}
{code}
 

Then the caller would be responsible for making the action. That way this class 
would only need access to responseHandle{{.getProducerExecutionState()}} and 
not responseHandle{{ }}itself.

>From PR discussion 
>[https://github.com/apache/flink/pull/8463#discussion_r288290783]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13581) BatchFineGrainedRecoveryITCase failed on Travis

2019-08-05 Thread Andrey Zagrebin (JIRA)
Andrey Zagrebin created FLINK-13581:
---

 Summary: BatchFineGrainedRecoveryITCase failed on Travis
 Key: FLINK-13581
 URL: https://issues.apache.org/jira/browse/FLINK-13581
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.9.0
Reporter: Andrey Zagrebin
 Fix For: 1.9.0


[https://travis-ci.com/flink-ci/flink/jobs/221567908]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13435) Remove ShuffleDescriptor.ReleaseType and make release semantics fixed per partition type

2019-07-26 Thread Andrey Zagrebin (JIRA)
Andrey Zagrebin created FLINK-13435:
---

 Summary: Remove ShuffleDescriptor.ReleaseType and make release 
semantics fixed per partition type
 Key: FLINK-13435
 URL: https://issues.apache.org/jira/browse/FLINK-13435
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination, Runtime / Network
Affects Versions: 1.9.0
Reporter: Andrey Zagrebin
 Fix For: 1.9.0, 1.10.0


In a long term we do not need auto-release semantics for blocking (persistent) 
partition. We expect them always to be released externally by JM and assume 
they can be consumed multiple times.

The pipelined partitions have always only one consumer and one consumption 
attempt. Afterwards they can be always released automatically.

ShuffleDescriptor.ReleaseType was introduced to make release semantics more 
flexible but it is not needed in a long term.

FORCE_PARTITION_RELEASE_ON_CONSUMPTION was introduced as a safety net to be 
able to fallback to 1.8 behaviour without the partition tracker and JM taking 
care about blocking partition release. We can make this option specific for 
NettyShuffleEnvironment which was the only existing shuffle service before. If 
it is activated then the blocking partition is also auto-released on a 
consumption attempt as it was before. The fine-grained recovery will just not 
find the partition after the job restart in this case and will restart the 
producer.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13408) Schedule StandaloneResourceManager.setFailUnfulfillableRequest whenever the leadership is acquired

2019-07-24 Thread Andrey Zagrebin (JIRA)
Andrey Zagrebin created FLINK-13408:
---

 Summary: Schedule 
StandaloneResourceManager.setFailUnfulfillableRequest whenever the leadership 
is acquired
 Key: FLINK-13408
 URL: https://issues.apache.org/jira/browse/FLINK-13408
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Affects Versions: 1.9.0
Reporter: Andrey Zagrebin
 Fix For: 1.9.0, 1.10.0


We introduced _StandaloneResourceManager.__setFailUnfulfillableRequest_ to give 
some time to task executors to register the available slots before the slot 
requests can be checked whether they can be fulfilled or not. 
_setFailUnfulfillableRequest_ is scheduled now only once when the RM is 
initialised but the task executors will register themselves every time this RM 
gets the leadership. Hence, _setFailUnfulfillableRequest_ should be scheduled 
after each leader election.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13407) Fix thread visibility of checked SlotManager.failUnfulfillableRequest in StandaloneResourceManagerTest

2019-07-24 Thread Andrey Zagrebin (JIRA)
Andrey Zagrebin created FLINK-13407:
---

 Summary: Fix thread visibility of checked 
SlotManager.failUnfulfillableRequest in StandaloneResourceManagerTest
 Key: FLINK-13407
 URL: https://issues.apache.org/jira/browse/FLINK-13407
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Affects Versions: 1.9.0
Reporter: Andrey Zagrebin
 Fix For: 1.9.0, 1.10.0


One of the potential reasons of the test instability, described in FLINK-13242, 
is that the test checks _failUnfulfillableRequest_ directly in its thread but 
it is set in the main RPC thread of _StandaloneResourceManager#initialize_. It 
can lead to a potential problem with the visibility of 
_SlotManager.failUnfulfillableRequest_ in 
_StandaloneResourceManagerTest#testStartupPeriod_.

The idea for the fix is to read _SlotManager.failUnfulfillableRequest_ in the 
main thread of _StandaloneResourceManager_, get future result and check in the 
_StandaloneResourceManagerTest.assertHappensUntil_.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13371) Release partitions in JM of producer gets restarted

2019-07-22 Thread Andrey Zagrebin (JIRA)
Andrey Zagrebin created FLINK-13371:
---

 Summary: Release partitions in JM of producer gets restarted
 Key: FLINK-13371
 URL: https://issues.apache.org/jira/browse/FLINK-13371
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination, Runtime / Network
Affects Versions: 1.9.0
Reporter: Andrey Zagrebin


As discussed in FLINK-13245, there can be a case that producer does not even 
detect any consumption attempt if consumer fails before the connection is 
established. It means we cannot fully rely on shuffle service for the release 
on consumption in case of consumer failure. When producer restarts it will leak 
partitions from the previous attempt. Previously we had an explicit release 
call for this case in Execution.cancel/suspend. Basically JM has to explicitly 
release all partitions produced by the previous task execution attempt in case 
of producer restart, including `released on consumption` partitions. For this 
change, we might need to track all partitions in PartitionTrackerImpl.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13169) IT test for fine-grained recovery (task executor failures)

2019-07-09 Thread Andrey Zagrebin (JIRA)
Andrey Zagrebin created FLINK-13169:
---

 Summary: IT test for fine-grained recovery (task executor failures)
 Key: FLINK-13169
 URL: https://issues.apache.org/jira/browse/FLINK-13169
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin
 Fix For: 1.9.0


The BatchFineGrainedRecoveryITCase can be extended with an additional test 
failure strategy which abruptly shuts down the task executor. This leads to the 
loss of all previously completed and the in-progress mapper result partitions. 
The fail-over strategy should subsequently restart the current in-progress 
mapper and all previous mappers because the previous result is unavailable. 
When the source is recomputed, all mappers has to be restarted again to 
recalculate their lost results.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-13019) IT test for fine-grained recovery

2019-06-27 Thread Andrey Zagrebin (JIRA)
Andrey Zagrebin created FLINK-13019:
---

 Summary: IT test for fine-grained recovery
 Key: FLINK-13019
 URL: https://issues.apache.org/jira/browse/FLINK-13019
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin
 Fix For: 1.9.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12960) Move ResultPartitionDeploymentDescriptor#releasedOnConsumption to PartitionDescriptor#releasedOnConsumption

2019-06-24 Thread Andrey Zagrebin (JIRA)
Andrey Zagrebin created FLINK-12960:
---

 Summary: Move 
ResultPartitionDeploymentDescriptor#releasedOnConsumption to 
PartitionDescriptor#releasedOnConsumption
 Key: FLINK-12960
 URL: https://issues.apache.org/jira/browse/FLINK-12960
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin
 Fix For: 1.9.0


ResultPartitionDeploymentDescriptor#releasedOnConsumption shows the intention 
how the partition is going to be used by the shuffle user. If it is not 
supported by the shuffle service for a certain type of partition, 
ShuffleMaster#registerPartitionWithProducer and 
ShuffleEnvironment#createResultPartitionWriters should throw an exception. 
ShuffleMaster#registerPartitionWithProducer takes PartitionDescriptor. 
ResultPartitionDeploymentDescriptor#releasedOnConsumption should be part of 
PartitionDescriptor so that not only ShuffleEnvironment but also ShuffleMaster 
is already aware about releasedOnConsumption.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12890) Add partition lifecycle related Shuffle API

2019-06-18 Thread Andrey Zagrebin (JIRA)
Andrey Zagrebin created FLINK-12890:
---

 Summary: Add partition lifecycle related Shuffle API
 Key: FLINK-12890
 URL: https://issues.apache.org/jira/browse/FLINK-12890
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Andrey Zagrebin
Assignee: Andrey Zagrebin
 Fix For: 1.9.0


At the moment we have ShuffleEnvironment.releasePartitions which is used to 
release locally occupied resources of partition. JM can also use it by calling 
TaskExecutorGateway.releasePartitions.

To support lifecycle management of partitions (FLINK-12069, relevant mostly for 
batch and blocking partitions), we need to extend Shuffle API:
 * ShuffleDescriptor.hasLocalResources() indicates that this partition occupies 
local resources on TM and requires TM running to consume the produced data 
(e.g. true for default NettyShuffleEnviroment and false for externally stored 
partitions). If a partition needs external lifecycle management and is not 
released after the first consumption is done 
(ResultPartitionDeploymentDescriptor.isReleasedOnConsumption()), then RM/JM 
should keep TMs, which produce these partitions, running until partition still 
needs to be consumed. The connection to these TMs should also to be kept to 
issue the RPC call TaskExecutorGateway.releasePartitions once partition is not 
needed any more.

 * ShuffleMaster.removePartitionExternally(): JM should call this whenever the 
partition doe not need to be consumed any more. This call releases partition 
resources possibly occupied externally outside of TM and should not depend on 
ShuffleDescriptor.hasLocalResources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12873) Create a separate maven module for Shuffle API

2019-06-17 Thread Andrey Zagrebin (JIRA)
Andrey Zagrebin created FLINK-12873:
---

 Summary: Create a separate maven module for Shuffle API
 Key: FLINK-12873
 URL: https://issues.apache.org/jira/browse/FLINK-12873
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Configuration, Runtime / Network
Reporter: Andrey Zagrebin


At the moment, shuffle service API is a part of flink-runtime maven module. The 
implementers of other shuffle services will have to depend on the fat 
dependency of flink-runtime. We should consider factoring out the shuffle API 
interfaces to a separate maven module which depends only on flink-core. Later 
we can consider the same for the custom high availability services.

The final structure could be e.g. (up to discussion):
 * flink-runtime (already includes default shuffle and high availability 
implementations)
 * flink-runtime-extensions
 ** flink-runtime-extensions-core
 ** flink-shuffle-extensions-api
 ** flink-high-availability-extensions-api



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12731) Load shuffle service implementations from plugin manager

2019-06-04 Thread Andrey Zagrebin (JIRA)
Andrey Zagrebin created FLINK-12731:
---

 Summary: Load shuffle service implementations from plugin manager
 Key: FLINK-12731
 URL: https://issues.apache.org/jira/browse/FLINK-12731
 Project: Flink
  Issue Type: Sub-task
Reporter: Andrey Zagrebin


The simple way to load shuffle service is to do it from class path with the 
default class loader. Additionally, shuffle service implementations can be 
loaded as plugins with their own class loaders using PluginManager.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12706) Introduce ShuffleManager interface and its configuration

2019-06-03 Thread Andrey Zagrebin (JIRA)
Andrey Zagrebin created FLINK-12706:
---

 Summary: Introduce ShuffleManager interface and its configuration
 Key: FLINK-12706
 URL: https://issues.apache.org/jira/browse/FLINK-12706
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Configuration, Runtime / Coordination, Runtime 
/ Network
Reporter: Andrey Zagrebin
 Fix For: 1.9.0


ShuffleManager should be a factory for ShuffleMaster in JM and ShuffleService 
in TM. The default implementation is already available former 
NetworkEnvironment. To make it pluggable, we need to provide a service loading 
for the configured ShuffleManager implementation class in Flink configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   >