flink-io FileNotFoundException

2019-03-11 Thread Alexander Smirnov
Hi everybody,

I am using Flink 1.4.2 and periodically my job goes down with the following
exception in logs. Relaunching the job does not help, only restarting the
whole cluster.

Is there a JIRA problem for that? will upgrade to 1.5 help?

java.io.FileNotFoundException:
/tmp/flink-io-20a15b29-1838-4de0-b383-165b1c49655c/c41ceb40a4eca0d0b739e0d1e2db45b9cc22d160ef25769993084a90e6a79b78.0.buffer
(No such file or directory)
at java.io.RandomAccessFile.open0(Native Method)
at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
at java.io.RandomAccessFile.(RandomAccessFile.java:243)
at
org.apache.flink.streaming.runtime.io.BufferSpiller.createSpillingChannel(BufferSpiller.java:259)
at
org.apache.flink.streaming.runtime.io.BufferSpiller.(BufferSpiller.java:120)
at
org.apache.flink.streaming.runtime.io.BarrierBuffer.(BarrierBuffer.java:149)
at
org.apache.flink.streaming.runtime.io.StreamTwoInputProcessor.(StreamTwoInputProcessor.java:147)
at
org.apache.flink.streaming.runtime.tasks.TwoInputStreamTask.init(TwoInputStreamTask.java:79)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:235)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)

Thank you,
Alex


Re: Standalone HA cluster: Fatal error occurred in the cluster entrypoint.

2019-02-25 Thread Alexander Smirnov
Hi all,

I am getting similar exception while upgrading from Flink 1.4 to 1.6:

```
06 Feb 2019 14:37:34,080 ERROR
org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error
occurred in the cluster entrypoint.
java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not
retrieve submitted JobGraph from state handle under
/689f43070c701826e19ac24841050ea1. This indicates that the retrieved state
handle is broken. Try cleaning the state handle store.
at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:74)
at
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
at
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
at
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.flink.util.FlinkException: Could not retrieve
submitted JobGraph from state handle under
/689f43070c701826e19ac24841050ea1. This indicates that the retrieved state
handle is broken. Try cleaning the state handle store.
at
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
... 9 more
Caused by: java.io.InvalidClassException:
org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration;
local class incompatible: stream classdesc serialVersionUID =
-647384516034982626, local class serialVersionUID = 2
```

Is it safe to clean Zookeeper state as it is suggested in logs? What kind
of information I am losing?

Thank you, Alexander

On Fri, Nov 16, 2018 at 7:46 PM Olga Luganska  wrote:

> Hi, Miki
>
> Thank you for reply!
>
> I have deleted zookeeper data and was able to restart cluster.
>
> Olga
>
> Sent from my iPhone
>
> On Nov 16, 2018, at 4:38 AM, miki haiat  wrote:
>
> I "solved" this issue by cleaning the zookeeper information and start the
> cluster again all the the checkpoint and job graph data will be erased and
> basacly you will start a new cluster...
>
> It's happened to me allot on a 1.5.x
> On a 1.6 things are running perfect .
> I'm not sure way this error is back again on 1.6.1 ?
>
>
> On Fri, 16 Nov 2018, 0:42 Olga Luganska 
>> Hello,
>>
>> I am running flink 1.6.1 standalone HA cluster. Today I am unable to
>> start cluster because of "Fatal error in cluster entrypoint"
>> (I used to see this error when running flink 1.5 version, after upgrade
>> to 1.6.1 (which had a fix for this bug) everything worked well for a while)
>>
>> Question: what exactly needs to be done to clean "state handle store"?
>>
>> 2018-11-15 15:09:53,181 DEBUG
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor  - Fencing
>> token not set: Ignoring message LocalFencedMessage(null,
>> org.apache.flink.runtime.rpc.messages.RunAsync@21fd224c) because the
>> fencing token is null.
>>
>> 2018-11-15 15:09:53,182 ERROR
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error
>> occurred in the cluster entrypoint.
>>
>> java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could
>> not retrieve submitted JobGraph from state handle under
>> /e13034f83a80072204facb2cec9ea6a3. This indicates that the retrieved state
>> handle is broken. Try cleaning the state handle store.
>>
>> at
>> org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
>>
>> at
>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$1(FunctionUtils.java:61)
>>
>> at
>> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
>>
>> at
>> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
>>
>> at
>> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>>
>> at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
>>
>> at
>> 

Re: FileNotFoundException on starting the job

2018-11-02 Thread Alexander Smirnov
my guess is that tmp directory got cleaned on your host and Flink couldn't
restore memory state from it upon startup.

Take a look at
https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#configuring-temporary-io-directories
article, I think it is relevant

On Thu, Nov 1, 2018 at 8:51 PM Dmitry Minaev  wrote:

> Hi everyone,
>
> I'm having an issue when restarting a job in Flink. I'm doing a simple
> stop with savepoint and then start from the savepoint. Savepoints are
> stored in a separate folder, there is no configuration for "/tmp" folder in
> my setup. There is only 1 task manager and parallelism is 1.
>
> I'm getting FileNotFoundException:
>
> 31 Oct 2018 23:40:35,837 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph -
> filter-business-metrics -> Sink: data_feed (1/1)
> (51ce53532932c33805291dc188d2f99e) switched from DEPLOYING to RUNNING.
> 31 Oct 2018 23:40:35,837 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph -
> agents-working-on-interactions (1/1) (72a916158d07f2353fb270848d95ba2f)
> switched from DEPLOYING to RUNNING.
> 31 Oct 2018 23:40:35,929 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph -
> interaction-details (1/1) (c004e64e90c0dbd3bc007459bc3d7420) switched from
> RUNNING to FAILED.
> java.io.FileNotFoundException:
> /tmp/flink-io-7bfd6603-c115-463d-bcfc-b97e31be5a37/f7ce787242e6afd91c3cbeccc2f74bc4a7dd0e6e600ff83e51bc5be9a95750f9.0.buffer
> (No such file or directory)
> at java.io.RandomAccessFile.open0(Native Method)
> at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
> at java.io.RandomAccessFile.(RandomAccessFile.java:243)
> at
> org.apache.flink.streaming.runtime.io.BufferSpiller.createSpillingChannel(BufferSpiller.java:259)
> at
> org.apache.flink.streaming.runtime.io.BufferSpiller.(BufferSpiller.java:120)
> at
> org.apache.flink.streaming.runtime.io.BarrierBuffer.(BarrierBuffer.java:149)
> at
> org.apache.flink.streaming.runtime.io.StreamInputProcessor.(StreamInputProcessor.java:129)
> at
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.init(OneInputStreamTask.java:56)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:235)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
> at java.lang.Thread.run(Thread.java:748)
>
> I've checked the logs and there are no errors prior to that. The job was
> stopped with no issues, and it was starting normally and passed multiple
> operators setting them to RUNNING state. But for several other operators it
> throws this FileNotFoundException.
>
> Any help is appreciated.
>
> -- Regards, Dmitry
> --
>
> --
> Dmitry
>


Re: Kafka connector error: This server does not host this topic-partition

2018-10-23 Thread Alexander Smirnov
Thanks Dominik, hope it will be resolved soon

On Tue, Oct 23, 2018 at 4:47 PM Dominik Wosiński  wrote:

> Hey Alexander,
> It seems that this issue occurs when the broker is down and the partition
> is selecting the new leader AFAIK. There is one JIRA issue I have found,
> not sure if that's what are You looking for:
> https://issues.apache.org/jira/browse/KAFKA-6221
>
> This issue is connected with Kafka itself rather than Flink.
>
> Best Regards,
> Dom.
>
> wt., 23 paź 2018 o 15:04 Alexander Smirnov 
> napisał(a):
>
>> Hi,
>>
>> I stumbled upon an exception in the "Exceptions" tab which I could not
>> explain. Do you know what could cause it? Unfortunately I don't know how to
>> reproduce it. Do you know if there is a respective JIRA issue for it?
>>
>> Here's the exception's stack trace:
>>
>> org.apache.flink.streaming.connectors.kafka.FlinkKafka011Exception:
>> Failed to send data to Kafka: This server does not host this
>> topic-partition.
>> at
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.checkErroneous(FlinkKafkaProducer011.java:999)
>> at
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.invoke(FlinkKafkaProducer011.java:614)
>> at
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.invoke(FlinkKafkaProducer011.java:93)
>> at
>> org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.invoke(TwoPhaseCommitSinkFunction.java:219)
>> at
>> org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:56)
>> at
>> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:549)
>> at
>> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:524)
>> at
>> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:504)
>> at
>> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:830)
>> at
>> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:808)
>> at
>> org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:51)
>> at
>> com.five9.distribution.FilterBusinessMetricsFunction.flatMap(FilterBusinessMetricsFunction.java:162)
>> at
>> com.five9.distribution.FilterBusinessMetricsFunction.flatMap(FilterBusinessMetricsFunction.java:31)
>> at
>> org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:50)
>> at
>> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:207)
>> at
>> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69)
>> at
>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264)
>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
>> at java.lang.Thread.run(Thread.java:748)
>> Caused by:
>> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This
>> server does not host this topic-partition.
>>
>> Thank you,
>> Alex
>>
>


Kafka connector error: This server does not host this topic-partition

2018-10-23 Thread Alexander Smirnov
Hi,

I stumbled upon an exception in the "Exceptions" tab which I could not
explain. Do you know what could cause it? Unfortunately I don't know how to
reproduce it. Do you know if there is a respective JIRA issue for it?

Here's the exception's stack trace:

org.apache.flink.streaming.connectors.kafka.FlinkKafka011Exception: Failed
to send data to Kafka: This server does not host this topic-partition.
at
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.checkErroneous(FlinkKafkaProducer011.java:999)
at
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.invoke(FlinkKafkaProducer011.java:614)
at
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.invoke(FlinkKafkaProducer011.java:93)
at
org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.invoke(TwoPhaseCommitSinkFunction.java:219)
at
org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:56)
at
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:549)
at
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:524)
at
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:504)
at
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:830)
at
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:808)
at
org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:51)
at
com.five9.distribution.FilterBusinessMetricsFunction.flatMap(FilterBusinessMetricsFunction.java:162)
at
com.five9.distribution.FilterBusinessMetricsFunction.flatMap(FilterBusinessMetricsFunction.java:31)
at
org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:50)
at
org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:207)
at
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.common.errors.UnknownTopicOrPartitionException:
This server does not host this topic-partition.

Thank you,
Alex


Re: Initializing mapstate hangs

2018-10-22 Thread Alexander Smirnov
I think that's because you declared it as transient field.

Move the declaration inside of "open" function to resolve that

On Mon, Oct 22, 2018 at 3:48 PM Ahmad Hassan  wrote:

> 2018-10-22 13:46:31,944 INFO  org.apache.flink.runtime.taskmanager.Task
>   - Window(SlidingProcessingTimeWindows(18, 18),
> TimeTrigger, MetricWindowFunction) -> Map -> Sink: Unnamed (1/1)
> (5677190a0d292df3ad8f3521519cd980) switched from RUNNING to FAILED.
>
> java.lang.NullPointerException: The state properties must not be null
>
> at org.apache.flink.util.Preconditions.checkNotNull(Preconditions.java:75)
>
> at
> org.apache.flink.streaming.api.operators.StreamingRuntimeContext.checkPreconditionsAndGetKeyedStateStore(StreamingRuntimeContext.java:174)
>
> at
> org.apache.flink.streaming.api.operators.StreamingRuntimeContext.getMapState(StreamingRuntimeContext.java:168)
>
> at
> com.sap.hybris.conversion.flink.processors.chain.MetricWindowFunction.open(MetricWindowFunction.java:62)
>
> at
> org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36)
>
> at
> org.apache.flink.api.java.operators.translation.WrappingFunction.open(WrappingFunction.java:45)
>
> at
> org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36)
>
> at
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:102)
>
> at
> org.apache.flink.streaming.runtime.operators.windowing.WindowOperator.open(WindowOperator.java:219)
>
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:424)
>
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:290)
>
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>
>
> On Sat, 20 Oct 2018 at 11:29, vino yang  wrote:
>
>> Hi Ahmad,
>>
>> Can you try to dump thread info from the Task Manager's JVM instance?
>>
>> Thanks, vino.
>>
>> Ahmad Hassan  于2018年10月20日周六 下午4:24写道:
>>
>>> Flink 1.6.0. Valuestate initialises successful but mapstate hangs
>>>
>>> Regards
>>>
>>> On 20 Oct 2018, at 02:55, vino yang  wrote:
>>>
>>> Hi Ahmad,
>>>
>>> Which version of Flink do you use?
>>>
>>> Thanks, vino.
>>>
>>> Ahmad Hassan  于2018年10月19日周五 下午11:32写道:
>>>
 Hi,

 Initializing mapstate hangs in window function. However if i use
 valuestate then it is initialized succcessfully. I am using rocksdb to
 store the state.

 public class MyWindowFunction extends RichWindowFunction>>> Payload, Tuple, TimeWindow>
 {
 private transient MapStateDescriptor productsDescriptor
 = new MapStateDescriptor<>(
 "mapState", String.class, String.class);

 @Override
 public void apply(Tuple key, TimeWindow window, final Iterable
 input,
 final Collector out)
 {
 // do something
 }

 @Override
 public void open(Configuration parameters) throws Exception
 {
 System.out.println("## open init window state ");
 * MapState state =
 this.getRuntimeContext().getMapState(productsDescriptor); <<< program hangs
 here*
 System.out.println("## open window state " + state);
 }
 }

 Thanks for the help.

>>>


Re: ArrayIndexOutOfBoundsException

2018-09-25 Thread Alexander Smirnov
Appreciate your help, Stefan! 
On Tue, 25 Sep 2018 at 18:19, Stefan Richter 
wrote:

> You only need to update the flink jars, the job requires no update. I
> think you also cannot start from this checkpoint/savepoint after the
> upgrade because it seems to be corrupted from the bug. You need to us an
> older point to restart.
>
> Best,
> Stefan
>
>
> Am 25.09.2018 um 16:53 schrieb Alexander Smirnov <
> alexander.smirn...@gmail.com>:
>
> Thanks Stefan.
>
> is it only Flink runtime should be updated, or the job should be
> recompiled too?
> Is there a workaround to start the job without upgrading Flink?
>
> Alex
>
> On Tue, Sep 25, 2018 at 5:48 PM Stefan Richter <
> s.rich...@data-artisans.com> wrote:
>
>> Hi,
>>
>> this problem looks like https://issues.apache.org/jira/browse/FLINK-8836 
>> which
>> would also match to your Flink version. I suggest to update to 1.4.3 or
>> higher to avoid the issue in the future.
>>
>> Best,
>> Stefan
>>
>>
>> Am 25.09.2018 um 16:37 schrieb Alexander Smirnov <
>> alexander.smirn...@gmail.com>:
>>
>> I'm getting an exception on job starting from a savepoint. Why that could
>> happen?
>>
>> Flink 1.4.2
>>
>>
>> java.lang.IllegalStateException: Could not initialize operator state
>> backend.
>> at
>> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:301)
>> at
>> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:248)
>> at
>> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692)
>> at
>> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679)
>> at
>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
>> at java.lang.Thread.run(Thread.java:748)
>> Caused by: java.lang.ArrayIndexOutOfBoundsException: -2
>> at java.util.ArrayList.elementData(ArrayList.java:418)
>> at java.util.ArrayList.get(ArrayList.java:431)
>> at
>> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:42)
>> at
>> com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:805)
>> at
>> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:759)
>> at
>> org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:249)
>> at
>> org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:136)
>> at
>> org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:30)
>> at
>> org.apache.flink.runtime.state.DefaultOperatorStateBackend.deserializeStateValues(DefaultOperatorStateBackend.java:584)
>> at
>> org.apache.flink.runtime.state.DefaultOperatorStateBackend.restore(DefaultOperatorStateBackend.java:399)
>> at
>> org.apache.flink.streaming.runtime.tasks.StreamTask.createOperatorStateBackend(StreamTask.java:733)
>> at
>> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:299)
>> ... 6 more
>>
>>
>>
>


Re: ArrayIndexOutOfBoundsException

2018-09-25 Thread Alexander Smirnov
Thanks Stefan.

is it only Flink runtime should be updated, or the job should be recompiled
too?
Is there a workaround to start the job without upgrading Flink?

Alex

On Tue, Sep 25, 2018 at 5:48 PM Stefan Richter 
wrote:

> Hi,
>
> this problem looks like https://issues.apache.org/jira/browse/FLINK-8836 which
> would also match to your Flink version. I suggest to update to 1.4.3 or
> higher to avoid the issue in the future.
>
> Best,
> Stefan
>
>
> Am 25.09.2018 um 16:37 schrieb Alexander Smirnov <
> alexander.smirn...@gmail.com>:
>
> I'm getting an exception on job starting from a savepoint. Why that could
> happen?
>
> Flink 1.4.2
>
>
> java.lang.IllegalStateException: Could not initialize operator state
> backend.
> at
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:301)
> at
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:248)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: -2
> at java.util.ArrayList.elementData(ArrayList.java:418)
> at java.util.ArrayList.get(ArrayList.java:431)
> at
> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:42)
> at
> com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:805)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:759)
> at
> org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:249)
> at
> org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:136)
> at
> org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:30)
> at
> org.apache.flink.runtime.state.DefaultOperatorStateBackend.deserializeStateValues(DefaultOperatorStateBackend.java:584)
> at
> org.apache.flink.runtime.state.DefaultOperatorStateBackend.restore(DefaultOperatorStateBackend.java:399)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.createOperatorStateBackend(StreamTask.java:733)
> at
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:299)
> ... 6 more
>
>
>


ArrayIndexOutOfBoundsException

2018-09-25 Thread Alexander Smirnov
I'm getting an exception on job starting from a savepoint. Why that could
happen?

Flink 1.4.2


java.lang.IllegalStateException: Could not initialize operator state
backend.
at
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:301)
at
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:248)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ArrayIndexOutOfBoundsException: -2
at java.util.ArrayList.elementData(ArrayList.java:418)
at java.util.ArrayList.get(ArrayList.java:431)
at
com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:42)
at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:805)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:759)
at
org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:249)
at
org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:136)
at
org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:30)
at
org.apache.flink.runtime.state.DefaultOperatorStateBackend.deserializeStateValues(DefaultOperatorStateBackend.java:584)
at
org.apache.flink.runtime.state.DefaultOperatorStateBackend.restore(DefaultOperatorStateBackend.java:399)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.createOperatorStateBackend(StreamTask.java:733)
at
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:299)
... 6 more


Re: Kryo exception

2018-08-29 Thread Alexander Smirnov
Thanks Hequn!
On Thu, 30 Aug 2018 at 04:49, Hequn Cheng  wrote:

> Hi Alex,
>
> It seems a bug. There is a discussion here
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Kryo-Exception-td20324.html>
> .
> Best, Hequn
>
> On Wed, Aug 29, 2018 at 10:19 PM Alexander Smirnov <
> alexander.smirn...@gmail.com> wrote:
>
>> Hi,
>>
>> A job fell into a restart loop with the following exception. Is it
>> something known?
>> What could cause it?
>>
>> Flink 1.4.2
>>
>> 16 Aug 2018 13:43:00,835 INFO org.apache.flink.runtime.taskmanager.Task -
>> Source: Custom Source -> (Filter -> Timestamps/Watermarks -> Map, Filter ->
>> Timestamps/Watermarks -> Flat Map) (1/1) (4306448a2c99603e3e19304357158d12)
>> switched from RUNNING to FAILED.
>> java.lang.IllegalStateException: Could not initialize operator state
>> backend.
>> at
>> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:301)
>> at
>> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:248)
>> at
>> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692)
>> at
>> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679)
>> at
>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
>> at java.lang.Thread.run(Thread.java:748)
>> Caused by: com.esotericsoftware.kryo.KryoException:
>> java.lang.IndexOutOfBoundsException: Index: 54, Size: 1
>> Serialization trace:
>> topic
>> (org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition)
>> at
>> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
>> at
>> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:528)
>> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761)
>> at
>> org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:249)
>> at
>> org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:136)
>> at
>> org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:30)
>> at
>> org.apache.flink.runtime.state.DefaultOperatorStateBackend.deserializeStateValues(DefaultOperatorStateBackend.java:584)
>> at
>> org.apache.flink.runtime.state.DefaultOperatorStateBackend.restore(DefaultOperatorStateBackend.java:399)
>> at
>> org.apache.flink.streaming.runtime.tasks.StreamTask.createOperatorStateBackend(StreamTask.java:733)
>> at
>> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:299)
>> ... 6 more
>> Caused by: java.lang.IndexOutOfBoundsException: Index: 54, Size: 1
>> at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>> at java.util.ArrayList.get(ArrayList.java:429)
>> at
>> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:42)
>> at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:805)
>> at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:728)
>> at
>> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:113)
>> ... 15 more
>>
>


Kryo exception

2018-08-29 Thread Alexander Smirnov
Hi,

A job fell into a restart loop with the following exception. Is it
something known?
What could cause it?

Flink 1.4.2

16 Aug 2018 13:43:00,835 INFO org.apache.flink.runtime.taskmanager.Task -
Source: Custom Source -> (Filter -> Timestamps/Watermarks -> Map, Filter ->
Timestamps/Watermarks -> Flat Map) (1/1) (4306448a2c99603e3e19304357158d12)
switched from RUNNING to FAILED.
java.lang.IllegalStateException: Could not initialize operator state
backend.
at
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:301)
at
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:248)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.esotericsoftware.kryo.KryoException:
java.lang.IndexOutOfBoundsException: Index: 54, Size: 1
Serialization trace:
topic
(org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition)
at
com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:528)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761)
at
org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:249)
at
org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:136)
at
org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:30)
at
org.apache.flink.runtime.state.DefaultOperatorStateBackend.deserializeStateValues(DefaultOperatorStateBackend.java:584)
at
org.apache.flink.runtime.state.DefaultOperatorStateBackend.restore(DefaultOperatorStateBackend.java:399)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.createOperatorStateBackend(StreamTask.java:733)
at
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:299)
... 6 more
Caused by: java.lang.IndexOutOfBoundsException: Index: 54, Size: 1
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at
com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:42)
at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:805)
at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:728)
at
com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:113)
... 15 more


How do I investigate checkpoints failures

2018-08-21 Thread Alexander Smirnov
Hello,

I have a cluster with multiple jobs running on it. One of the jobs has
checkpoints constantly failing
[image: image.png]

How do I investigate it?

Thank you,
Alex


High CPU usage

2018-08-17 Thread Alexander Smirnov
Hello,

I noticed CPU utilization went high and took a thread dump on the task
manager node. Why would RocksDBMapState.entries() / seek0 call consumes CPU?

It is Flink 1.4.2

"Co-Flat Map (3/4)" #16129 prio=5 os_prio=0 tid=0x7fefac029000
nid=0x338f runnable [0x7feed2002000]
   java.lang.Thread.State: RUNNABLE
at org.rocksdb.RocksIterator.seek0(Native Method)
at
org.rocksdb.AbstractRocksIterator.seek(AbstractRocksIterator.java:58)
at
org.apache.flink.contrib.streaming.state.RocksDBMapState$RocksDBMapIterator.loadCache(RocksDBMapState.java:489)
at
org.apache.flink.contrib.streaming.state.RocksDBMapState$RocksDBMapIterator.hasNext(RocksDBMapState.java:433)
at
org.apache.flink.contrib.streaming.state.RocksDBMapState.entries(RocksDBMapState.java:147)
at
org.apache.flink.runtime.state.UserFacingMapState.entries(UserFacingMapState.java:77)

Thank you,
Alex


Re: Flink log and out files

2018-08-01 Thread Alexander Smirnov
thanks guys,

So, is it a correct statement - if my job doesn't write anything to stdout,
the "*.out" file should be empty?

for some reason it contains the same info as "log" and much more.

For the "log" files, I can control rotation via log4j configuration, but
how do I setup rotation for "out" files?
Or, how do I disable them at all?

I'm using 1.4.2

Thank you,
Alex

On Wed, Aug 1, 2018 at 7:00 PM Andrey Zagrebin 
wrote:

> Hi Alexander,
>
> there is also a doc link where log configuration  is described:
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.5/monitoring/logging.html
> You can modify log configuration in conf directory according to logging
> framework docs.
>
> Cheers,
> Andrey
>
>
> On 1 Aug 2018, at 17:30, vino yang  wrote:
>
> Hi Alexander:
>
> .log and .out are different. Usually, the .log file stores the log
> information output by the log framework. Flink uses slf4j as the log
> interface and supports log4j and logback configurations. The .out file
> stores the STDOUT information. This information is usually output by you
> calling some APIs such as the print sink API.
>
> Thanks, vino.
>
> 2018-08-01 23:19 GMT+08:00 Alexander Smirnov  >:
>
>> Hi,
>>
>> could you please explain the difference between *.log and *.out files in
>> Flink?
>> What information is supposed to be in each of them?
>> Is "log" a subset of "out"?
>> How do I setup rotation with gzipping?
>>
>> Thank you,
>> Alex
>>
>
>
>


Flink log and out files

2018-08-01 Thread Alexander Smirnov
Hi,

could you please explain the difference between *.log and *.out files in
Flink?
What information is supposed to be in each of them?
Is "log" a subset of "out"?
How do I setup rotation with gzipping?

Thank you,
Alex


Is Flink using even-odd versioning system

2018-07-10 Thread Alexander Smirnov
to denote development and stable releases?


Re: Consolidated log for a job?

2018-05-14 Thread Alexander Smirnov
Hi Alexey,

I know that Kibana(https://en.wikipedia.org/wiki/Kibana) can show logs from
different servers at one screen. May be this is what you are looking for

Alex

On Mon, May 14, 2018 at 5:17 PM NEKRASSOV, ALEXEI  wrote:

> Is there a way to see logs from multiple Task Managers **all in one place**
> (for a running or a completed job)? Or I need to check logs on each Task
> Manager individually?
>
>
>
> Thanks,
> Alex Nekrassov
>
>
>


Re: This server is not the leader for that topic-partition

2018-05-07 Thread Alexander Smirnov
thank you Piotr

On Mon, May 7, 2018 at 2:59 PM Piotr Nowojski <pi...@data-artisans.com>
wrote:

> Hi,
>
> Regardless if that will fix the problem or not, please consider upgrading
> to Kafka 0.11.0.2 or 1.0.1. Kafka 0.11.0 release was quite messy and it
> might be that the bug you have hit was fixed in 0.11.0.2.
>
> As a side note, as far as we know our FlinkKafkaProducer011 works fine
> with Kafka 1.0.x.
>
> Piotrek
>
> On 7 May 2018, at 12:12, Alexander Smirnov <alexander.smirn...@gmail.com>
> wrote:
>
> Hi Piotr, using 0.11.0 Kafka version
>
> On Sat, May 5, 2018 at 10:19 AM Piotr Nowojski <pi...@data-artisans.com>
> wrote:
>
>> FlinkKafka011Producer uses Kafka 0.11.0.2.
>>
>> However I’m not sure if bumping KafkaProducer version solves this issue
>> or upgrading Kafka. What Kafka version are you using?
>>
>> Piotrek
>>
>>
>> On 4 May 2018, at 17:55, Alexander Smirnov <alexander.smirn...@gmail.com>
>> wrote:
>>
>> Thanks for quick turnaround Stefan, Piotr
>>
>> This is a rare reproducible issue and I will keep an eye on it
>>
>> searching on the Stack Overflow I found
>> https://stackoverflow.com/questions/43378664/kafka-leader-election-causes-kafka-streams-crash
>>
>> They say that the problem is fixed in 0.10.2.1 of kafka producer so I
>> wonder which version is used in FlinkKafkaProducer integration. For earlier
>> versions it is proposed to use configuration:
>>
>> final Properties props = new Properties();...
>> props.put(ProducerConfig.RETRIES_CONFIG, 10);
>> props.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, 
>> Integer.toString(Integer.MAX_VALUE));props.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG,
>>  2);
>>
>>
>>
>>
>> On Fri, May 4, 2018 at 4:58 PM Piotr Nowojski <pi...@data-artisans.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I think Stefan is right. Quick google search points to this:
>>> https://stackoverflow.com/questions/47767169/kafka-this-server-is-not-the-leader-for-that-topic-partition
>>>
>>> Please let us know if changing your configuration will solve the problem!
>>>
>>> Piotrek
>>>
>>> On 4 May 2018, at 15:53, Stefan Richter <s.rich...@data-artisans.com>
>>> wrote:
>>>
>>> Hi,
>>>
>>> I think in general this means that your producer client does not connect
>>> to the correct Broker (the leader) but to a broker that is just a follower
>>> and the follower can not execute that request. However, I am not sure what
>>> causes this in the context of the FlinkKafkaProducer. Maybe Piotr (in CC)
>>> has an idea?
>>>
>>> Best,
>>> Stefan
>>>
>>> Am 04.05.2018 um 15:45 schrieb Alexander Smirnov <
>>> alexander.smirn...@gmail.com>:
>>>
>>> Hi,
>>>
>>> what could cause the following exception?
>>>
>>> org.apache.flink.streaming.connectors.kafka.FlinkKafka011Exception:
>>> Failed to send data to Kafka: This server is not the leader for that
>>> topic-partition.
>>> at
>>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.checkErroneous(FlinkKafkaProducer011.java:999)
>>> at
>>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.invoke(FlinkKafkaProducer011.java:614)
>>> at
>>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.invoke(FlinkKafkaProducer011.java:93)
>>> at
>>> org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.invoke(TwoPhaseCommitSinkFunction.java:219)
>>> at
>>> org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:56)
>>> at
>>> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:549)
>>> at
>>> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:524)
>>> at
>>> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:504)
>>> at
>>> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:830)
>>> at
>>> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:808)
>>> at
>>> org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:51)
>>> at
>>> com.alexander.smirnov.FilterBMFunction.flatMap(FilterBMFunction.java:162)
>>>
>>>
>>> Thank you,
>>> Alex
>>>
>>>
>>>
>>>
>>
>


Re: This server is not the leader for that topic-partition

2018-05-07 Thread Alexander Smirnov
Hi Piotr, using 0.11.0 Kafka version

On Sat, May 5, 2018 at 10:19 AM Piotr Nowojski <pi...@data-artisans.com>
wrote:

> FlinkKafka011Producer uses Kafka 0.11.0.2.
>
> However I’m not sure if bumping KafkaProducer version solves this issue or
> upgrading Kafka. What Kafka version are you using?
>
> Piotrek
>
>
> On 4 May 2018, at 17:55, Alexander Smirnov <alexander.smirn...@gmail.com>
> wrote:
>
> Thanks for quick turnaround Stefan, Piotr
>
> This is a rare reproducible issue and I will keep an eye on it
>
> searching on the Stack Overflow I found
> https://stackoverflow.com/questions/43378664/kafka-leader-election-causes-kafka-streams-crash
>
> They say that the problem is fixed in 0.10.2.1 of kafka producer so I
> wonder which version is used in FlinkKafkaProducer integration. For earlier
> versions it is proposed to use configuration:
>
> final Properties props = new Properties();...
> props.put(ProducerConfig.RETRIES_CONFIG, 10);
> props.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, 
> Integer.toString(Integer.MAX_VALUE));props.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG,
>  2);
>
>
>
>
> On Fri, May 4, 2018 at 4:58 PM Piotr Nowojski <pi...@data-artisans.com>
> wrote:
>
>> Hi,
>>
>> I think Stefan is right. Quick google search points to this:
>> https://stackoverflow.com/questions/47767169/kafka-this-server-is-not-the-leader-for-that-topic-partition
>>
>> Please let us know if changing your configuration will solve the problem!
>>
>> Piotrek
>>
>> On 4 May 2018, at 15:53, Stefan Richter <s.rich...@data-artisans.com>
>> wrote:
>>
>> Hi,
>>
>> I think in general this means that your producer client does not connect
>> to the correct Broker (the leader) but to a broker that is just a follower
>> and the follower can not execute that request. However, I am not sure what
>> causes this in the context of the FlinkKafkaProducer. Maybe Piotr (in CC)
>> has an idea?
>>
>> Best,
>> Stefan
>>
>> Am 04.05.2018 um 15:45 schrieb Alexander Smirnov <
>> alexander.smirn...@gmail.com>:
>>
>> Hi,
>>
>> what could cause the following exception?
>>
>> org.apache.flink.streaming.connectors.kafka.FlinkKafka011Exception:
>> Failed to send data to Kafka: This server is not the leader for that
>> topic-partition.
>> at
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.checkErroneous(FlinkKafkaProducer011.java:999)
>> at
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.invoke(FlinkKafkaProducer011.java:614)
>> at
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.invoke(FlinkKafkaProducer011.java:93)
>> at
>> org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.invoke(TwoPhaseCommitSinkFunction.java:219)
>> at
>> org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:56)
>> at
>> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:549)
>> at
>> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:524)
>> at
>> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:504)
>> at
>> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:830)
>> at
>> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:808)
>> at
>> org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:51)
>> at
>> com.alexander.smirnov.FilterBMFunction.flatMap(FilterBMFunction.java:162)
>>
>>
>> Thank you,
>> Alex
>>
>>
>>
>>
>


Re: This server is not the leader for that topic-partition

2018-05-04 Thread Alexander Smirnov
Thanks for quick turnaround Stefan, Piotr

This is a rare reproducible issue and I will keep an eye on it

searching on the Stack Overflow I found
https://stackoverflow.com/questions/43378664/kafka-leader-election-causes-kafka-streams-crash

They say that the problem is fixed in 0.10.2.1 of kafka producer so I
wonder which version is used in FlinkKafkaProducer integration. For earlier
versions it is proposed to use configuration:

final Properties props = new Properties();...
props.put(ProducerConfig.RETRIES_CONFIG, 10);
props.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG,
Integer.toString(Integer.MAX_VALUE));props.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG,
2);




On Fri, May 4, 2018 at 4:58 PM Piotr Nowojski <pi...@data-artisans.com>
wrote:

> Hi,
>
> I think Stefan is right. Quick google search points to this:
> https://stackoverflow.com/questions/47767169/kafka-this-server-is-not-the-leader-for-that-topic-partition
>
> Please let us know if changing your configuration will solve the problem!
>
> Piotrek
>
> On 4 May 2018, at 15:53, Stefan Richter <s.rich...@data-artisans.com>
> wrote:
>
> Hi,
>
> I think in general this means that your producer client does not connect
> to the correct Broker (the leader) but to a broker that is just a follower
> and the follower can not execute that request. However, I am not sure what
> causes this in the context of the FlinkKafkaProducer. Maybe Piotr (in CC)
> has an idea?
>
> Best,
> Stefan
>
> Am 04.05.2018 um 15:45 schrieb Alexander Smirnov <
> alexander.smirn...@gmail.com>:
>
> Hi,
>
> what could cause the following exception?
>
> org.apache.flink.streaming.connectors.kafka.FlinkKafka011Exception: Failed
> to send data to Kafka: This server is not the leader for that
> topic-partition.
> at
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.checkErroneous(FlinkKafkaProducer011.java:999)
> at
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.invoke(FlinkKafkaProducer011.java:614)
> at
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.invoke(FlinkKafkaProducer011.java:93)
> at
> org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.invoke(TwoPhaseCommitSinkFunction.java:219)
> at
> org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:56)
> at
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:549)
> at
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:524)
> at
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:504)
> at
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:830)
> at
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:808)
> at
> org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:51)
> at
> com.alexander.smirnov.FilterBMFunction.flatMap(FilterBMFunction.java:162)
>
>
> Thank you,
> Alex
>
>
>
>


This server is not the leader for that topic-partition

2018-05-04 Thread Alexander Smirnov
Hi,

what could cause the following exception?

org.apache.flink.streaming.connectors.kafka.FlinkKafka011Exception: Failed
to send data to Kafka: This server is not the leader for that
topic-partition.
at
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.checkErroneous(FlinkKafkaProducer011.java:999)
at
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.invoke(FlinkKafkaProducer011.java:614)
at
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.invoke(FlinkKafkaProducer011.java:93)
at
org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.invoke(TwoPhaseCommitSinkFunction.java:219)
at
org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:56)
at
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:549)
at
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:524)
at
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:504)
at
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:830)
at
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:808)
at
org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:51)
at com.alexander.smirnov.FilterBMFunction.flatMap(FilterBMFunction.java:162)


Thank you,
Alex


Can't send kafka message with timestamp

2018-04-26 Thread Alexander Smirnov
Hi,


I'm creating kafka producer with timestamps enabled following
instructions at
https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/connectors/kafka.html#kafka-producer


Optional customPartitioner = Optional.empty();

FlinkKafkaProducer011 result = new
FlinkKafkaProducer011<>(defaultTopic, serializationSchema, properties,
customPartitioner);

result.setWriteTimestampToKafka(true);



but getting an exception:


java.lang.IllegalArgumentException: Invalid timestamp: -1. Timestamp
should always be non-negative or null.
at 
org.apache.kafka.clients.producer.ProducerRecord.(ProducerRecord.java:70)
at 
org.apache.kafka.clients.producer.ProducerRecord.(ProducerRecord.java:93)
at 
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.invoke(FlinkKafkaProducer011.java:642)
at 
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.invoke(FlinkKafkaProducer011.java:93)
at 
org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.invoke(TwoPhaseCommitSinkFunction.java:219)
at 
org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:56)
at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:549)
at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:524)
at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:504)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:830)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:808)
at 
org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:51)




Is there anything else I need to configure to embed timestamp
information into resulting kafka message?


Thank you,

Alex


Re: Checkpointing barriers

2018-04-24 Thread Alexander Smirnov
ok, I got it. Barrier-n is an indicator or n-th checkpoint.

My first impression was that barriers are carrying offset information, but
it was wrong.

Thanks for unblocking ;-)

Alex


Checkpointing barriers

2018-04-23 Thread Alexander Smirnov
Hi,

I'm reading documentation about checkpointing:
https://ci.apache.org/projects/flink/flink-docs-master/internals/stream_checkpointing.html
It describes a case, when an operator receives data from all its incoming
streams alongs with barriers. There's also an illustration on that page for
the case.

One thing confuses me, though.

Each data stream has its own source which emits own sequence of barriers.
This is very implementation specific, and the docs say that for incoming
Kafka streams the connector uses offset information to produce a barrier.
But the illustration and the explanation have "barrier n" in both of the
incoming streams.

"As soon as the operator receives snapshot barrier n from an incoming
stream, it cannot process any further records from that stream until it has
received the barrier n from the other inputs as well"

Can you please help me to understand why "barrier n" would appear in
different streams.
I'd expect "barrier n" in stream1 and "barrier m" in stream2

Thank you,
Alex


Re: data enrichment with SQL use case

2018-04-23 Thread Alexander Smirnov
Hi Fabian,

please share the workarounds, that must be helpful for my case as well

Thank you,
Alex

On Mon, Apr 23, 2018 at 2:14 PM Fabian Hueske  wrote:

> Hi Miki,
>
> Sorry for the late response.
> There are basically two ways to implement an enrichment join as in your
> use case.
>
> 1) Keep the meta data in the database and implement a job that reads the
> stream from Kafka and queries the database in an ASyncIO operator for every
> stream record. This should be the easier implementation but it will send
> one query to the DB for each streamed record.
> 2) Replicate the meta data into Flink state and join the streamed records
> with the state. This solution is more complex because you need propagate
> updates of the meta data (if there are any) into the Flink state. At the
> moment, Flink lacks a few features to have a good implementation of this
> approach, but there a some workarounds that help in certain cases.
>
> Note that Flink's SQL support does not add advantages for the either of
> both approaches. You should use the DataStream API (and possible
> ProcessFunctions).
>
> I'd go for the first approach if one query per record is feasible.
> Let me know if you need to tackle the second approach and I can give some
> details on the workarounds I mentioned.
>
> Best, Fabian
>
> 2018-04-16 20:38 GMT+02:00 Ken Krugler :
>
>> Hi Miki,
>>
>> I haven’t tried mixing AsyncFunctions with SQL queries.
>>
>> Normally I’d create a regular DataStream workflow that first reads from
>> Kafka, then has an AsyncFunction to read from the SQL database.
>>
>> If there are often duplicate keys in the Kafka-based stream, you could
>> keyBy(key) before the AsyncFunction, and then cache the result of the SQL
>> query.
>>
>> — Ken
>>
>> On Apr 16, 2018, at 11:19 AM, miki haiat  wrote:
>>
>> HI thanks  for the reply  i will try to break your reply to the flow
>> execution order .
>>
>> First data stream Will use AsyncIO and select the table ,
>> Second stream will be kafka and the i can join the stream and map it ?
>>
>> If that   the case  then i will select the table only once on load ?
>> How can i make sure that my stream table is "fresh" .
>>
>> Im thinking to myself , is thire a way to use flink backend (ROKSDB)  and
>> create read/write through
>> macanisem ?
>>
>> Thanks
>>
>> miki
>>
>>
>>
>> On Mon, Apr 16, 2018 at 2:45 AM, Ken Krugler > > wrote:
>>
>>> If the SQL data is all (or mostly all) needed to join against the data
>>> from Kafka, then I might try a regular join.
>>>
>>> Otherwise it sounds like you want to use an AsyncFunction to do ad hoc
>>> queries (in parallel) against your SQL DB.
>>>
>>>
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/operators/asyncio.html
>>>
>>> — Ken
>>>
>>>
>>> On Apr 15, 2018, at 12:15 PM, miki haiat  wrote:
>>>
>>> Hi,
>>>
>>> I have a case of meta data enrichment and im wondering if my approach is
>>> the correct way .
>>>
>>>1. input stream from kafka.
>>>2. MD in msSQL .
>>>3. map to new pojo
>>>
>>> I need to extract  a key from the kafka stream   and use it to select
>>> some values from the sql table  .
>>>
>>> SO i thought  to use  the table SQL api in order to select the table MD
>>> then convert the kafka stream to table and join the data by  the stream
>>> key .
>>>
>>> At the end i need to map the joined data to a new POJO and send it to
>>> elesticserch .
>>>
>>> Any suggestions or different ways to solve this use case ?
>>>
>>> thanks,
>>> Miki
>>>
>>>
>>>
>>>
>>> --
>>> Ken Krugler
>>> http://www.scaleunlimited.com
>>> custom big data solutions & training
>>> Hadoop, Cascading, Cassandra & Solr
>>>
>>>
>>
>> 
>> http://about.me/kkrugler
>> +1 530-210-6378 <(530)%20210-6378>
>>
>>
>


Re: Tracking deserialization errors

2018-04-23 Thread Alexander Smirnov
That's absolutely no problem Tzu-Li. Either of them would work. Thank you!

On Thu, Apr 19, 2018 at 4:56 PM Tzu-Li (Gordon) Tai 
wrote:

> @Alexander
> Sorry about that, that would be my mistake. I’ll close FLINK-9204 as a
> duplicate and leave my thoughts on FLINK-9155. Thanks for pointing out!
>
>
> On 19 April 2018 at 2:00:51 AM, Elias Levy (fearsome.lucid...@gmail.com)
> wrote:
>
> Either proposal would work.  In the later case, at a minimum we'd need a
> way to identify the source within the metric.  The basic error metric would
> then allow us to go into the logs to determine the cause of the error, as
> we already record the message causing trouble in the log.
>
>
> On Mon, Apr 16, 2018 at 4:42 AM, Fabian Hueske  wrote:
>
>> Thanks for starting the discussion Elias.
>>
>> I see two ways to address this issue.
>>
>> 1) Add an interface that a deserialization schema can implement to
>> register metrics. Each source would need to check for the interface and
>> call it to setup metrics.
>> 2) Check for null returns in the source functions and increment a
>> respective counter.
>>
>> In both cases, we need to touch the source connectors.
>>
>> I see that passing information such as topic name, partition, and offset
>> are important debugging information. However, I don't think that metrics
>> would be good to capture them.
>> In that case, log files might be a better approach.
>>
>> I'm not sure to what extend the source functions (Kafka, Kinesis) support
>> such error tracking.
>> Adding Gordon to the thread who knows the internals of the connectors.
>>
>>


Multi threaded operators?

2018-04-23 Thread Alexander Smirnov
Hi,

I have a co-flatmap function which reads data from external DB on specific
events.
The API for the DB layer is homegrown and it uses multiple threads to speed
up reading.

Can it cause any problems if I use the multithreading API in the flatmap1
function? Is it allowed in Flink?

Or, maybe I should employ better approaches for that. May be async I/O?

Thank you,
Alex


Re: How to add new Kafka topic to consumer

2018-04-11 Thread Alexander Smirnov
this feature has been implemented in 1.4.0, take a look at

https://issues.apache.org/jira/browse/FLINK-4022
https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/connectors/kafka.html#kafka-consumers-topic-and-partition-discovery


On Wed, Apr 11, 2018 at 3:33 PM chandresh pancholi <
chandreshpancholi...@gmail.com> wrote:

> Hi,
>
> Is there a way to add Kafka topic dynamically to stream?
> For example there were Kafka topic named foo1, foo2, foo3 and task manager
> will start consuming events from all 3 topics. Now I create another two
> topic foo4 & bar1 in Kafka.
>
> How will FlinkKafkaConsumer would read events from foo4 & bar1 topic?
>
> Thanks
>
> --
> Chandresh Pancholi
> Senior Software Engineer
> Flipkart.com
> Email-id:chandresh.panch...@flipkart.com
> Contact:08951803660
>


Re: Re: java.lang.Exception: TaskManager was lost/killed

2018-04-09 Thread Alexander Smirnov
I've seen similar problem, but it was not a heap size, but Metaspace.
It was caused by a job restarting in a loop. Looks like for each restart,
Flink loads new instance of classes and very soon in runs out of metaspace.

I've created a JIRA issue for this problem, but got no response from the
development team on it: https://issues.apache.org/jira/browse/FLINK-9132


On Mon, Apr 9, 2018 at 11:36 AM 王凯  wrote:

> thanks a lot,i will try it
>
> 在 2018-04-09 00:06:02,"TechnoMage"  写道:
>
> I have seen this when my task manager ran out of RAM.  Increase the heap
> size.
>
> flink-conf.yaml:
> taskmanager.heap.mb
> jobmanager.heap.mb
>
> Michael
>
> On Apr 8, 2018, at 2:36 AM, 王凯  wrote:
>
> 
> hi all, recently, i found a problem,it runs well when start. But after
> long run,the exception display as above,how can resolve it?
>
>
>
>
>
>
>
>
>


Re: Tracking deserialization errors

2018-04-08 Thread Alexander Smirnov
I have the same question. In case of kafka source, it would be good to know
topic name and offset of the corrupted message for further investigation.
Looks like the only option is to write messages into a log file

On Fri, Apr 6, 2018 at 9:12 PM Elias Levy 
wrote:

> I was wondering how are folks tracking deserialization errors.
> The AbstractDeserializationSchema interface provides no mechanism for the
> deserializer to instantiate a metric counter, and "deserialize" must return
> a null instead of raising an exception in case of error if you want your
> job to continue functioning during a deserialization error.  But that means
> such errors are invisible.
>
> Thoughts?
>


Re: Restart strategy defined in flink-conf.yaml is ignored

2018-04-06 Thread Alexander Smirnov
Thanks Piotr,

I've created a JIRA issue to track it:
https://issues.apache.org/jira/browse/FLINK-9143

Alex


On Thu, Apr 5, 2018 at 11:28 PM Piotr Nowojski <pi...@data-artisans.com>
wrote:

> Hi,
>
> Thanks for the details! I can confirm this behaviour. flink-conf.yaml
> restart-strategy value is being completely ignored (regardless of it’s
> value) when user enables checkpointing:
>
> env.enableCheckpointing(5000, CheckpointingMode.EXACTLY_ONCE);
>
> I suspect this is a bug, but I have to confirm it.
>
> Thanks, Piotrek
>
> On 5 Apr 2018, at 12:40, Alexander Smirnov <alexander.smirn...@gmail.com>
> wrote:
>
> jobmanager.log:
>
> *2018-04-05 22:37:28,348 INFO
> org.apache.flink.configuration.GlobalConfiguration- Loading
> configuration property: restart-strategy, none*
> 2018-04-05 22:37:28,353 INFO  org.apache.flink.core.fs.FileSystem
>  - Hadoop is not in the classpath/dependencies. The
> extended set of supported File Systems via Hadoop is not available.
> 2018-04-05 22:37:28,506 INFO
> org.apache.flink.runtime.jobmanager.JobManager- Starting
> JobManager without high-availability
> 2018-04-05 22:37:28,510 INFO
> org.apache.flink.runtime.jobmanager.JobManager- Starting
> JobManager on localhost:6123 with execution mode CLUSTER
> 2018-04-05 22:37:28,517 INFO
> org.apache.flink.runtime.security.modules.HadoopModuleFactory  - Cannot
> create Hadoop Security Module because Hadoop cannot be found in the
> Classpath.
> 2018-04-05 22:37:28,546 INFO
> org.apache.flink.runtime.security.SecurityUtils   - Cannot
> install HadoopSecurityContext because Hadoop cannot be found in the
> Classpath.
> 2018-04-05 22:37:28,591 INFO
> org.apache.flink.runtime.jobmanager.JobManager- Trying to
> start actor system at localhost:6123
> 2018-04-05 22:37:28,981 INFO  akka.event.slf4j.Slf4jLogger
>   - Slf4jLogger started
> 2018-04-05 22:37:29,027 INFO  akka.remote.Remoting
>   - Starting remoting
> 2018-04-05 22:37:29,129 INFO  akka.remote.Remoting
>   - Remoting started; listening on addresses :[
> akka.tcp://flink@localhost:6123]
> 2018-04-05 22:37:29,135 INFO
> org.apache.flink.runtime.jobmanager.JobManager- Actor
> system started at akka.tcp://flink@localhost:6123
> 2018-04-05 22:37:29,148 INFO
> org.apache.flink.runtime.metrics.MetricRegistryImpl   - No metrics
> reporter configured, no metrics will be exposed/reported.
> 2018-04-05 22:37:29,152 INFO
> org.apache.flink.runtime.jobmanager.JobManager- Starting
> JobManager web frontend
> 2018-04-05 22:37:29,161 INFO
> org.apache.flink.runtime.webmonitor.WebMonitorUtils   - Determined
> location of JobManager log file:
> /Users/asmirnov/flink-1.4.2/log/flink-jobmanager-0.log
> 2018-04-05 22:37:29,161 INFO
> org.apache.flink.runtime.webmonitor.WebMonitorUtils   - Determined
> location of JobManager stdout file:
> /Users/asmirnov/flink-1.4.2/log/flink-jobmanager-0.out
> 2018-04-05 22:37:29,162 INFO
> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Using
> directory
> /var/folders/5s/yj6g5wd90h158whcb_483hhhq7t4sw/T/flink-web-901a3fb7-d366-4f90-b75c-1e1f8038ed37
> for the web interface files
> 2018-04-05 22:37:29,162 INFO
> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Created
> directory
> /var/folders/5s/yj6g5wd90h158whcb_483hhhq7t4sw/T/flink-web-21e5d8a8-7967-40f0-97d7-a803d9bd5913
> for web frontend JAR file uploads.
> 2018-04-05 22:37:29,447 INFO
> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Web
> frontend listening at localhost:8081
> 2018-04-05 22:37:29,447 INFO
> org.apache.flink.runtime.jobmanager.JobManager- Starting
> JobManager actor
> 2018-04-05 22:37:29,452 INFO  org.apache.flink.runtime.blob.BlobServer
>   - Created BLOB server storage directory
> /var/folders/5s/yj6g5wd90h158whcb_483hhhq7t4sw/T/blobStore-6777e862-0c2c-4679-a42f-b1921baa5236
> 2018-04-05 22:37:29,453 INFO  org.apache.flink.runtime.blob.BlobServer
>   - Started BLOB server at 0.0.0.0:60697 - max concurrent
> requests: 50 - max backlog: 1000
> 2018-04-05 22:37:29,533 INFO
> org.apache.flink.runtime.jobmanager.MemoryArchivist   - Started
> memory archivist akka://flink/user/archive
> 2018-04-05 22:37:29,533 INFO
> org.apache.flink.runtime.jobmanager.JobManager- Starting
> JobManager at akka.tcp://flink@localhost:6123/user/jobmanager.
> 2018-04-05 22:37:29,544 INFO
> org.apache.flink.runtime.jobmanager.JobManager- JobManager
> akka.tcp://flink@localhost:6123/user/jobmanager was

Re: Restart strategy defined in flink-conf.yaml is ignored

2018-04-05 Thread Alexander Smirnov
  org.apache.flink.runtime.client.JobClient
 - Checking and uploading JAR files
2018-04-05 22:38:29,639 INFO
org.apache.flink.runtime.jobmanager.JobManager- Submitting
job 43ecfe9cb258b7f624aad9868d306edb (Failed job).
*2018-04-05 22:38:29,643 INFO
org.apache.flink.runtime.jobmanager.JobManager- Using
restart strategy
FixedDelayRestartStrategy(maxNumberRestartAttempts=2147483647,
delayBetweenRestartAttempts=1) for 43ecfe9cb258b7f624aad9868d306edb.*
2018-04-05 22:38:29,656 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph- Job
recovers via failover strategy: full graph restart



On Thu, Apr 5, 2018 at 10:35 PM Alexander Smirnov <
alexander.smirn...@gmail.com> wrote:

> Hi Piotr,
>
> I'm using Flink 1.4.2
>
> it's a standard flink distribution downloaded and unpacked.
>
> added the following lines to conf/flink-conf.yaml:
> restart-strategy: none
> state.backend: rocksdb
> state.backend.fs.checkpointdir:
> file:///tmp/nfsrecovery/flink-checkpoints-metadata
> state.backend.rocksdb.checkpointdir:
> file:///tmp/nfsrecovery/flink-checkpoints-rocksdb
>
> created new java project as described at
> https://ci.apache.org/projects/flink/flink-docs-release-1.4/quickstart/java_api_quickstart.html
>
> here's the code:
>
> public class FailedJob
> {
> static final Logger LOGGER = LoggerFactory.getLogger(FailedJob.class);
>
> public static void main( String[] args ) throws Exception
> {
> final StreamExecutionEnvironment env =
> StreamExecutionEnvironment.getExecutionEnvironment();
>
>
> env.enableCheckpointing(5000,
> CheckpointingMode.EXACTLY_ONCE);
>
> DataStream stream =
> env.fromCollection(Arrays.asList("test"));
>
> stream.map(new MapFunction<String, String>(){
> @Override
> public String map(String obj) {
> throw new NullPointerException("NPE");
> }
> });
>
> env.execute("Failed job");
> }
> }
>
> attaching screenshots, please let me know if more info is needed
>
> Alex
>
>
>
>
> On Thu, Apr 5, 2018 at 5:35 PM Piotr Nowojski <pi...@data-artisans.com>
> wrote:
>
>> Hi,
>>
>> Can you provide more details, like post your configuration/log
>> files/screen shots from web UI and Flink version being used?
>>
>> Piotrek
>>
>> > On 5 Apr 2018, at 06:07, Alexander Smirnov <
>> alexander.smirn...@gmail.com> wrote:
>> >
>> > Hello,
>> >
>> > I've defined restart strategy in flink-conf.yaml as none. WebUI / Job
>> Manager section confirms that.
>> > But looks like this setting is disregarded.
>> >
>> > When I go into job's configuration in the WebUI, in the Execution
>> Configuration section I can see:
>> > Max. number of execution retries  Restart with fixed delay
>> (1 ms). #2147483647 <(214)%20748-3647> restart attempts.
>> >
>> > Do you think it is a bug?
>> >
>> > Alex
>>
>>


Re: Kafka exceptions in Flink log file

2018-04-05 Thread Alexander Smirnov
Hi Timo,

it is the latest released version - 1.4.2

This only happens when a job falls into a restart loop and stays in it for
20 minutes or so.
Looks like for each restart, Flink loads classes anew, but previously
loaded classes are not garbage collected for some reason (still referenced?)

Very soon, JVM runs out of Metaspace and this error occurs.

Thank you,
Alex



On Tue, Apr 3, 2018 at 4:24 PM Timo Walther <twal...@apache.org> wrote:

> Hi Alex,
>
> which version of Flink are you running? There were some class loading
> issues with Kafka recently. I would try it with the newest Flink
> version. Otherwise ClassNotFoundException usually indicates that
> something is wrong with your dependencies. Maybe you can share your
> pom.xml with us.
>
> Regards,
> Timo
>
> Am 02.04.18 um 13:32 schrieb Alexander Smirnov:
> > I see a lot of messages in flink log like below. What's the cause?
> >
> >
> > 02 Apr 2018 04:09:13,554 ERROR
> > org.apache.kafka.clients.producer.internals.Sender - Uncaught error in
> > kafka producer I/O thread:
> > org.apache.kafka.common.KafkaException: Error registering mbean
> >
> kafka.producer:type=producer-node-metrics,client-id=producer-1,node-id=node-1
> > at
> >
> org.apache.kafka.common.metrics.JmxReporter.reregister(JmxReporter.java:163)
> > at
> >
> org.apache.kafka.common.metrics.JmxReporter.metricChange(JmxReporter.java:81)
> > at
> > org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:504)
> > at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:255)
> > at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:240)
> > at
> >
> org.apache.kafka.common.network.Selector$SelectorMetrics.maybeRegisterConnectionMetrics(Selector.java:811)
> > at
> >
> org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:353)
> > at
> > org.apache.kafka.common.network.Selector.poll(Selector.java:326)
> > at
> > org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:433)
> > at
> > org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:224)
> > at
> > org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:162)
> > at java.lang.Thread.run(Thread.java:748)
> > Caused by: javax.management.InstanceAlreadyExistsException:
> >
> kafka.producer:type=producer-node-metrics,client-id=producer-1,node-id=node-1
> > at
> > com.sun.jmx.mbeanserver.Repository.addMBean(Repository.java:437)
> > at
> >
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerWithRepository(DefaultMBeanServerInterceptor.java:1898)
> > at
> >
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(DefaultMBeanServerInterceptor.java:966)
> > at
> >
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(DefaultMBeanServerInterceptor.java:900)
> > at
> >
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(DefaultMBeanServerInterceptor.java:324)
> > at
> >
> com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java:522)
> > at
> >
> org.apache.kafka.common.metrics.JmxReporter.reregister(JmxReporter.java:161)
> > ... 11 more
> > 02 Apr 2018 04:09:13,673 ERROR
> > org.apache.kafka.common.utils.KafkaThread - Uncaught exception in
> > kafka-producer-network-thread | producer-3:
> > java.lang.NoClassDefFoundError: org/apache/kafka/clients/NetworkClient$1
> > at
> >
> org.apache.kafka.clients.NetworkClient.processDisconnection(NetworkClient.java:583)
> > at
> >
> org.apache.kafka.clients.NetworkClient.handleDisconnections(NetworkClient.java:705)
> > at
> > org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:443)
> > at
> > org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:224)
> > at
> > org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:162)
> > at java.lang.Thread.run(Thread.java:748)
> > Caused by: java.lang.ClassNotFoundException:
> > org.apache.kafka.clients.NetworkClient$1
> > at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> > at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> > at
> >
> org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$ChildFirstClassLoader.loadClass(FlinkUserCodeClassLoaders.java:128)
> > at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> > ... 6 more
> >
> >
> > Thank you,
> > Alex
>
>
>


Restart strategy defined in flink-conf.yaml is ignored

2018-04-05 Thread Alexander Smirnov
Hello,

I've defined restart strategy in flink-conf.yaml as none. WebUI / Job
Manager section confirms that.
But looks like this setting is disregarded.

When I go into job's configuration in the WebUI, in the Execution
Configuration section I can see:
Max. number of execution retries  Restart with fixed delay
(1 ms). #2147483647 restart attempts.

Do you think it is a bug?

Alex


Kafka exceptions in Flink log file

2018-04-02 Thread Alexander Smirnov
I see a lot of messages in flink log like below. What's the cause?


02 Apr 2018 04:09:13,554 ERROR
org.apache.kafka.clients.producer.internals.Sender - Uncaught error in
kafka producer I/O thread:
org.apache.kafka.common.KafkaException: Error registering mbean
kafka.producer:type=producer-node-metrics,client-id=producer-1,node-id=node-1
at
org.apache.kafka.common.metrics.JmxReporter.reregister(JmxReporter.java:163)
at
org.apache.kafka.common.metrics.JmxReporter.metricChange(JmxReporter.java:81)
at
org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:504)
at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:255)
at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:240)
at
org.apache.kafka.common.network.Selector$SelectorMetrics.maybeRegisterConnectionMetrics(Selector.java:811)
at
org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:353)
at org.apache.kafka.common.network.Selector.poll(Selector.java:326)
at
org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:433)
at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:224)
at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:162)
at java.lang.Thread.run(Thread.java:748)
Caused by: javax.management.InstanceAlreadyExistsException:
kafka.producer:type=producer-node-metrics,client-id=producer-1,node-id=node-1
at com.sun.jmx.mbeanserver.Repository.addMBean(Repository.java:437)
at
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerWithRepository(DefaultMBeanServerInterceptor.java:1898)
at
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(DefaultMBeanServerInterceptor.java:966)
at
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(DefaultMBeanServerInterceptor.java:900)
at
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(DefaultMBeanServerInterceptor.java:324)
at
com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java:522)
at
org.apache.kafka.common.metrics.JmxReporter.reregister(JmxReporter.java:161)
... 11 more
02 Apr 2018 04:09:13,673 ERROR org.apache.kafka.common.utils.KafkaThread -
Uncaught exception in kafka-producer-network-thread | producer-3:
java.lang.NoClassDefFoundError: org/apache/kafka/clients/NetworkClient$1
at
org.apache.kafka.clients.NetworkClient.processDisconnection(NetworkClient.java:583)
at
org.apache.kafka.clients.NetworkClient.handleDisconnections(NetworkClient.java:705)
at
org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:443)
at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:224)
at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:162)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException:
org.apache.kafka.clients.NetworkClient$1
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at
org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$ChildFirstClassLoader.loadClass(FlinkUserCodeClassLoaders.java:128)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 6 more


Thank you,
Alex


Re: Master and Slave files

2018-03-28 Thread Alexander Smirnov
Thanks for confirming, Nico 

On Wed, Mar 28, 2018 at 3:57 PM Nico Kruber <n...@data-artisans.com> wrote:

> If you refer to the files under the conf folder, these are only used by
> the standalone cluster startup scripts, i.e. bin/start-cluster.sh and
> bin/stop-cluster.sh
>
>
> Nico
>
> On 28/03/18 12:27, Alexander Smirnov wrote:
> > Hi,
> >
> > are the files needed only on cluster startup stage?
> > are they only used by bash scripts?
> >
> > Alex
>
> --
> Nico Kruber | Software Engineer
> data Artisans
>
> Follow us @dataArtisans
> --
> Join Flink Forward - The Apache Flink Conference
> Stream Processing | Event Driven | Real Time
> --
> Data Artisans GmbH | Stresemannstr. 121A,10963 Berlin, Germany
> data Artisans, Inc. | 1161 Mission Street, San Francisco, CA-94103, USA
> --
> Data Artisans GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
>
>


Master and Slave files

2018-03-28 Thread Alexander Smirnov
Hi,

are the files needed only on cluster startup stage?
are they only used by bash scripts?

Alex


Re: Standalone cluster instability

2018-03-26 Thread Alexander Smirnov
Hi Piotr,

I didn't find anything special in the logs before the failure.
Here are the logs, please take a look:

https://drive.google.com/drive/folders/1zlUDMpbO9xZjjJzf28lUX-bkn_x7QV59?usp=sharing

The configuration is:

3 task managers:
qafdsflinkw011.scl
qafdsflinkw012.scl
qafdsflinkw013.scl - lost connection

3 job  managers:
qafdsflinkm011.scl - the leader
qafdsflinkm012.scl
qafdsflinkm013.scl

3 zookeepers:
qafdsflinkzk011.scl
qafdsflinkzk012.scl
qafdsflinkzk013.scl

Thank you,
Alex



On Wed, Mar 21, 2018 at 6:23 PM Piotr Nowojski <pi...@data-artisans.com>
wrote:

> Hi,
>
> Does the issue really happen after 48 hours?
> Is there some indication of a failure in TaskManager log?
>
> If you will be still unable to solve the problem, please provide full
> TaskManager and JobManager logs.
>
> Piotrek
>
> On 21 Mar 2018, at 16:00, Alexander Smirnov <alexander.smirn...@gmail.com>
> wrote:
>
> One more question - I see a lot of line like the following in the logs
>
> [2018-03-21 00:30:35,975] ERROR Association to [akka.tcp://
> fl...@qafdsflinkw811.nn.five9lab.com:35320] with UID [1500204560]
> irrecoverably failed. Quarantining address. (akka.remote.Remoting)
> [2018-03-21 00:34:15,208] WARN Association to [akka.tcp://
> fl...@qafdsflinkw811.nn.five9lab.com:41068] with unknown UID is
> irrecoverably failed. Address cannot be quarantined without knowing the
> UID, gating instead for 5000 ms. (akka.remote.Remoting)
> [2018-03-21 00:34:15,235] WARN Association to [akka.tcp://
> fl...@qafdsflinkw811.nn.five9lab.com:40677] with unknown UID is
> irrecoverably failed. Address cannot be quarantined without knowing the
> UID, gating instead for 5000 ms. (akka.remote.Remoting)
> [2018-03-21 00:34:15,256] WARN Association to [akka.tcp://
> fl...@qafdsflinkw811.nn.five9lab.com:40382] with unknown UID is
> irrecoverably failed. Address cannot be quarantined without knowing the
> UID, gating instead for 5000 ms. (akka.remote.Remoting)
> [2018-03-21 00:34:15,256] WARN Association to [akka.tcp://
> fl...@qafdsflinkw811.nn.five9lab.com:44744] with unknown UID is
> irrecoverably failed. Address cannot be quarantined without knowing the
> UID, gating instead for 5000 ms. (akka.remote.Remoting)
> [2018-03-21 00:34:15,266] WARN Association to [akka.tcp://
> fl...@qafdsflinkw811.nn.five9lab.com:42413] with unknown UID is
> irrecoverably failed. Address cannot be quarantined without knowing the
> UID, gating instead for 5000 ms. (akka.remote.Remoting)
>
>
> The host is available, but I don't understand where port number comes
> from. Task Manager uses another port (which is printed in logs on startup)
> Could you please help to understand why it happens?
>
> Thank you,
> Alex
>
>
> On Wed, Mar 21, 2018 at 4:19 PM Alexander Smirnov <
> alexander.smirn...@gmail.com> wrote:
>
>> Hello,
>>
>> I've assembled a standalone cluster of 3 task managers and 3 job
>> managers(and 3 ZK) following the instructions at
>>
>>
>> https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/deployment/cluster_setup.html
>>  and
>>
>> https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/jobmanager_high_availability.html
>>
>> It works ok, but randomly, task managers becomes unavailable. JobManager
>> has exception like below in logs:
>>
>>
>> [2018-03-19 00:33:10,211] WARN Association with remote system [akka.tcp://
>> fl...@qafdsflinkw811.nn.five9lab.com:42413] has failed, address is now
>> gated for [5000] ms. Reason: [Association failed with [akka.tcp://
>> fl...@qafdsflinkw811.nn.five9lab.com:42413]] Caused by: [Connection
>> refused: qafdsflinkw811.nn.five9lab.com/10.5.61.124:42413]
>> (akka.remote.ReliableDeliverySupervisor)
>> [2018-03-21 00:30:35,975] ERROR Association to [akka.tcp://
>> fl...@qafdsflinkw811.nn.five9lab.com:35320] with UID [1500204560]
>> irrecoverably failed. Quarantining address. (akka.remote.Remoting)
>> java.util.concurrent.TimeoutException: Remote system has been silent for
>> too long. (more than 48.0 hours)
>> at
>> akka.remote.ReliableDeliverySupervisor$$anonfun$idle$1.applyOrElse(Endpoint.scala:375)
>> at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>> at
>> akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:203)
>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>> at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>> at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>> at
>> scala.co

Apache Zookeeper vs Flink Zookeeper

2018-03-21 Thread Alexander Smirnov
Hi,

For standalone cluster configuration, is it possible to use vanilla Apache
Zookeeper?

I saw there's a wrapper around it in Flink -  FlinkZooKeeperQuorumPeer. Is
it mandatory to use it?

Thank you,
Alex


Re: Submitting a job via command line

2017-10-13 Thread Alexander Smirnov
Thank you so much, it helped!

From: Piotr Nowojski <pi...@data-artisans.com<mailto:pi...@data-artisans.com>>
Date: Thursday, October 12, 2017 at 6:00 PM
To: Alexander Smirnov <asmir...@five9.com<mailto:asmir...@five9.com>>
Cc: "user@flink.apache.org<mailto:user@flink.apache.org>" 
<user@flink.apache.org<mailto:user@flink.apache.org>>
Subject: Re: Submitting a job via command line

Have you tried this
http://mail-archives.apache.org/mod_mbox/flink-user/201705.mbox/%3ccagr9p8bxhljseexwzvxlk+drotyp1yxjy4n4_qgerdzxz8u...@mail.gmail.com%3E<http://mail-archives.apache.org/mod_mbox/flink-user/201705.mbox/<cagr9p8bxhljseexwzvxlk+drotyp1yxjy4n4_qgerdzxz8u...@mail.gmail.com>>
?

Piotrek

On 12 Oct 2017, at 16:30, Alexander Smirnov 
<asmir...@five9.com<mailto:asmir...@five9.com>> wrote:

Hello All,

I got the following error while attempting to execute a job via command line:

[root@flink01 bin]# ./flink run -c com.five9.stream.PrecomputeJob 
/vagrant/flink-precompute-1.0-SNAPSHOT.jar -Xmx2048m -Xms2048m
Cluster configuration: Standalone cluster with JobManager at 
flink01.pb.lx-draskin5.five9.com/10.11.132.110:6123<http://flink01.pb.lx-draskin5.five9.com/10.11.132.110:6123>
Using address 
flink01.pb.lx-draskin5.five9.com:6123<http://flink01.pb.lx-draskin5.five9.com:6123>
 to connect to JobManager.
JobManager web interface address 
http://flink01.pb.lx-draskin5.five9.com:8081<http://flink01.pb.lx-draskin5.five9.com:8081/>
Starting execution of program
Submitting job with JobID: 222a9d44d2069ab3cc41866c8f3a. Waiting for job 
completion.
Connected to JobManager at 
Actor[akka.tcp://fl...@flink01.pb.lx-draskin5.five9.com<mailto://fl...@flink01.pb.lx-draskin5.five9.com>:6123/user/jobmanager#-1899708478]
 with leader session id ----.


The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: The program 
execution failed: Couldn't retrieve the JobExecutionResult from the JobManager.
at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:478)
at 
org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:105)
at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:442)
at 
org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:73)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1499)
at com.five9.stream.PrecomputeJob.execute(PrecomputeJob.java:137)
at 
com.five9.stream.PrecomputeJob.configureAndExecute(PrecomputeJob.java:78)
at com.five9.stream.PrecomputeJob.main(PrecomputeJob.java:65)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
at 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:419)
at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:381)
at 
org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:838)
at org.apache.flink.client.CliFrontend.run(CliFrontend.java:259)
at 
org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1086)
at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1133)
at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1130)
at 
org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1130)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Couldn't 
retrieve the JobExecutionResult from the JobManager.
at 
org.apache.flink.runtime.client.JobClient.awaitJobResult(JobClient.java:309)
at 
org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:396)
at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:467)
... 25 more
Caused by: 
org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job 
submission to the JobManager ti

Submitting a job via command line

2017-10-12 Thread Alexander Smirnov
Hello All,

I got the following error while attempting to execute a job via command line:

[root@flink01 bin]# ./flink run -c com.five9.stream.PrecomputeJob 
/vagrant/flink-precompute-1.0-SNAPSHOT.jar -Xmx2048m -Xms2048m
Cluster configuration: Standalone cluster with JobManager at 
flink01.pb.lx-draskin5.five9.com/10.11.132.110:6123
Using address flink01.pb.lx-draskin5.five9.com:6123 to connect to JobManager.
JobManager web interface address 
http://flink01.pb.lx-draskin5.five9.com:8081
Starting execution of program
Submitting job with JobID: 222a9d44d2069ab3cc41866c8f3a. Waiting for job 
completion.
Connected to JobManager at 
Actor[akka.tcp://fl...@flink01.pb.lx-draskin5.five9.com:6123/user/jobmanager#-1899708478]
 with leader session id ----.


The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: The program 
execution failed: Couldn't retrieve the JobExecutionResult from the JobManager.
at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:478)
at 
org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:105)
at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:442)
at 
org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:73)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1499)
at com.five9.stream.PrecomputeJob.execute(PrecomputeJob.java:137)
at 
com.five9.stream.PrecomputeJob.configureAndExecute(PrecomputeJob.java:78)
at com.five9.stream.PrecomputeJob.main(PrecomputeJob.java:65)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
at 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:419)
at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:381)
at 
org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:838)
at org.apache.flink.client.CliFrontend.run(CliFrontend.java:259)
at 
org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1086)
at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1133)
at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1130)
at 
org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1130)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Couldn't 
retrieve the JobExecutionResult from the JobManager.
at 
org.apache.flink.runtime.client.JobClient.awaitJobResult(JobClient.java:309)
at 
org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:396)
at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:467)
... 25 more
Caused by: 
org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job 
submission to the JobManager timed out. You may increase 'akka.client.timeout' 
in case the JobManager needs more time to configure and confirm the job 
submission.
at 
org.apache.flink.runtime.client.JobSubmissionClientActor.handleCustomMessage(JobSubmissionClientActor.java:119)
at 
org.apache.flink.runtime.client.JobClientActor.handleMessage(JobClientActor.java:251)
at 
org.apache.flink.runtime.akka.FlinkUntypedActor.handleLeaderSessionID(FlinkUntypedActor.java:89)
at 
org.apache.flink.runtime.akka.FlinkUntypedActor.onReceive(FlinkUntypedActor.java:68)
at 
akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at 

Re: Time zones problem

2017-08-15 Thread Alexander Smirnov
Hi Biplob,

Yes unix timestamp is what I¹m using now.

But the problem is that a time window like '1 day' is defined using
different start-end timestamps for users in different time zones

Let me try to draw it

|1--2-3-4---|

1 and 3 - time frames for European users
2 and 4 - time frames for American users

Now we need to calculate some metrics, like sum or max of call durations
using time window '1 day¹.
Evidently they are different for different time zones.

Should I create an aggregation window for each and every time zone to
cover that?
For each possible reporting time interval? Is it how this can be achieved
with Flink?

Thank you,
Alex






On 8/15/17, 1:09 AM, "Biplob Biswas"  wrote:

>Regarding timezones, you should probably convert your time to the unix
>timestamp which will be consistent all over the world, and then you can
>create your window based on this timestamp.
>
>
>
>
>
>--
>View this message in context:
>http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Time-z
>ones-problem-tp14907p14911.html
>Sent from the Apache Flink User Mailing List archive. mailing list
>archive at Nabble.com.
>




CONFIDENTIALITY NOTICE: This e-mail and any files attached may contain 
confidential information of Five9 and/or its affiliated entities. Access by the 
intended recipient only is authorized. Any liability arising from any party 
acting, or refraining from acting, on any information contained in this e-mail 
is hereby excluded. If you are not the intended recipient, please notify the 
sender immediately, destroy the original transmission and its attachments and 
do not disclose the contents to any other person, use it for any purpose, or 
store or copy the information in any medium. Copyright in this e-mail and any 
attachments belongs to Five9 and/or its affiliated entities.



Time zones problem

2017-08-14 Thread Alexander Smirnov
Hello everybody,

I’m exploring Flink options to build statistics engine for call center solution.
One thing I’m not sure how to implement.

Currently have the following jobs in the architecture.

Job #1 – is for matching start and end events and calculate durations. Like 
having call started and call ended events it is possible to calculate how long 
was the call.
Job #2 – is for aggregating precomputed values using multiple time windows, 
like 1 hour, 1 day, 1 week. (can we define time windows run-time by the way?). 
Also it joins control stream to care of subscription requests.

Supervisors, who consume the statistics, are spread around the world and can 
work in different time zones.
So, having time window defined as 1 day results in different data for European 
and American supervisors at exactly the same moment.

I’m not sure how to achieve that with Flink.
What are the best practices to work with different time zones? Any hint is 
appreciated.

Thanks,
Alex



CONFIDENTIALITY NOTICE: This e-mail and any files attached may contain 
confidential information of Five9 and/or its affiliated entities. Access by the 
intended recipient only is authorized. Any liability arising from any party 
acting, or refraining from acting, on any information contained in this e-mail 
is hereby excluded. If you are not the intended recipient, please notify the 
sender immediately, destroy the original transmission and its attachments and 
do not disclose the contents to any other person, use it for any purpose, or 
store or copy the information in any medium. Copyright in this e-mail and any 
attachments belongs to Five9 and/or its affiliated entities.


Multi-tenant, deploying flink cluster

2016-05-13 Thread Alexander Smirnov
Hi,

source data, read from MQ, contains tenant Id.
Is there a way to route messages from particular tenant to particular Flink
node? Is it what can be configured?

Thank you,
Alex


Re: convert Json to Tuple

2016-04-25 Thread Alexander Smirnov
Thanks Timur!

I should have mentioned, I need it for Java

On Mon, Apr 25, 2016 at 10:13 PM Timur Fayruzov <timur.fairu...@gmail.com>
wrote:

> Normally, Json4s or Jackson+scala plugin work well for json to scala data
> structure conversions. However, I would not expect they support a special
> case for tuples, since JSON key-value fields would normally convert to case
> classes and JSON arrays are converted to, well, arrays. That's being said,
> StackOverflow to the rescue:
> http://stackoverflow.com/questions/31909308/is-it-possible-to-parse-json-array-into-a-tuple-with-json4s
>
> I didn't try the approach myself, though.
>
> On Mon, Apr 25, 2016 at 6:50 PM, Alexander Smirnov <
> alexander.smirn...@gmail.com> wrote:
>
>> Hello everybody!
>>
>> my RMQSource function receives string with JSONs in it.
>> Because many operations in Flink rely on Tuple operations, I think it is
>> a good idea to convert JSON to Tuple.
>>
>> I believe this task has been solved already :)
>> what's the common approach for this conversion?
>>
>> Thank you,
>> Alex
>>
>
>


convert Json to Tuple

2016-04-25 Thread Alexander Smirnov
Hello everybody!

my RMQSource function receives string with JSONs in it.
Because many operations in Flink rely on Tuple operations, I think it is a
good idea to convert JSON to Tuple.

I believe this task has been solved already :)
what's the common approach for this conversion?

Thank you,
Alex


Re: Flink program without a line of code

2016-04-25 Thread Alexander Smirnov
thank you so much for the responses, guys!

On Sat, Apr 23, 2016 at 12:09 AM Flavio Pompermaier <pomperma...@okkam.it>
wrote:

> Hi Alexander,
> since I was looking for something similar some days ago here is what I
> know about this argument:
> during the Stratosphere project there was Meteor and Supremo allowing that
> [1] but then it was dismissed in favour of Pig integration that I don't
> wheter it was ever completed. Yiu could give a try to Piglet project[2]
> that allows to use PIG with Spark and Flink but I don't know how well it
> works (Flink integration is also very recent and not documented anywhere).
>
> Best,
> Flavio
>
> [1] http://stratosphere.eu/assets/papers/Sopremo_Meteor%20BigData.pdf
> [2] https://github.com/ksattler/piglet
> On 23 Apr 2016 07:48, "Aljoscha Krettek" <aljos...@apache.org> wrote:
>
> Hi,
> I think if the Table API/SQL API evolves enough it should be able to
> supply a Flink program as just an SQL query with source/sink definitions.
> Hopefully, in the future. :-)
>
> Cheers,
> Aljoscha
>
> On Fri, 22 Apr 2016 at 23:10 Fabian Hueske <fhue...@gmail.com> wrote:
>
>> Hi Alex,
>>
>> welcome to the Flink community!
>> Right now, there is no way to specify a Flink program without writing
>> code (Java, Scala, Python(beta)).
>>
>> In principle it is possible to put such functionality on top of the
>> DataStream or DataSet APIs.
>> This has been done before for other programming APIs (Flink's own
>> libraries Table API, Gelly, FlinkML, and externals Apache Beam / Google
>> DataFlow, Mahout, Cascading, ...). However, all of these are again
>> programming APIs, some specialized for certain use-cases.
>>
>> Specifying Flink programs by config files (or graphically) would require
>> a data model, a DataStream/DataSet program generator and probably a code
>> generation component.
>>
>> Best, Fabian
>>
>> 2016-04-22 18:41 GMT+02:00 Alexander Smirnov <
>> alexander.smirn...@gmail.com>:
>>
>>> Hi guys!
>>>
>>> I’m new to Flink, and actually to this mailing list as well :) this is
>>> my first message.
>>> I’m still reading the documentation and I would say Flink is an amazing
>>> system!! Thanks everybody who participated in the development!
>>>
>>> The information I didn’t find in the documentation - if it is possible
>>> to describe data(stream) transformation without any code (Java/Scala).
>>> I mean if it is possible to describe datasource functions, all of the
>>> operators, connections between them, and sinks in a plain text
>>> configuration file and then feed it to Flink.
>>> In this case it would be possible to change data flow without
>>> recompilation/redeployment.
>>>
>>> Is there a similar functionality in Flink? May be some third party
>>> plugin?
>>>
>>> Thank you,
>>> Alex
>>
>>
>>


Flink program without a line of code

2016-04-22 Thread Alexander Smirnov
Hi guys!

I’m new to Flink, and actually to this mailing list as well :) this is my first 
message.
I’m still reading the documentation and I would say Flink is an amazing 
system!! Thanks everybody who participated in the development!

The information I didn’t find in the documentation - if it is possible to 
describe data(stream) transformation without any code (Java/Scala).
I mean if it is possible to describe datasource functions, all of the 
operators, connections between them, and sinks in a plain text configuration 
file and then feed it to Flink.
In this case it would be possible to change data flow without 
recompilation/redeployment.

Is there a similar functionality in Flink? May be some third party plugin?

Thank you,
Alex