It would be great to know if this only occurs in setups where Netty in involved (more than one TaskManager and, and at least one shuffle/rebalance) or also in one-taskmanager setups (which have local channels only).
Stephan On Tue, Oct 4, 2016 at 11:49 AM, Till Rohrmann <trohrm...@apache.org> wrote: > Hi Tarandeep, > > it would be great if you could compile a small example data set with which > you're able to reproduce your problem. We could then try to debug it. It > would also be interesting to know whether Flavio's bug solves your problem > or not. > > Cheers, > Till > > On Mon, Oct 3, 2016 at 10:26 PM, Flavio Pompermaier <pomperma...@okkam.it> > wrote: > >> I think you're running into the same exception I face sometimes..I've >> opened a jira for it [1]. Could you please try to apply that patch and see >> if things get better? >> >> https://issues.apache.org/jira/plugins/servlet/mobile#issue/FLINK-4719 >> >> Best, >> Flavio >> >> On 3 Oct 2016 22:09, "Tarandeep Singh" <tarand...@gmail.com> wrote: >> >>> Now, when I ran it again (with lower task slots per machine) I got a >>> different error- >>> >>> org.apache.flink.client.program.ProgramInvocationException: The program >>> execution failed: Job execution failed. >>> at org.apache.flink.client.program.Client.runBlocking(Client.ja >>> va:381) >>> at org.apache.flink.client.program.Client.runBlocking(Client.ja >>> va:355) >>> at org.apache.flink.client.program.Client.runBlocking(Client.ja >>> va:315) >>> at org.apache.flink.client.program.ContextEnvironment.execute(C >>> ontextEnvironment.java:60) >>> at org.apache.flink.api.java.ExecutionEnvironment.execute(Execu >>> tionEnvironment.java:855) >>> .... >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce >>> ssorImpl.java:62) >>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe >>> thodAccessorImpl.java:43) >>> at java.lang.reflect.Method.invoke(Method.java:498) >>> at org.apache.flink.client.program.PackagedProgram.callMainMeth >>> od(PackagedProgram.java:505) >>> at org.apache.flink.client.program.PackagedProgram.invokeIntera >>> ctiveModeForExecution(PackagedProgram.java:403) >>> at org.apache.flink.client.program.Client.runBlocking(Client.ja >>> va:248) >>> at org.apache.flink.client.CliFrontend.executeProgramBlocking(C >>> liFrontend.java:866) >>> at org.apache.flink.client.CliFrontend.run(CliFrontend.java:333) >>> at org.apache.flink.client.CliFrontend.parseParameters(CliFront >>> end.java:1189) >>> at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1239) >>> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job >>> execution failed. >>> at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$hand >>> leMessage$1$$anonfun$applyOrElse$7.apply$mcV$sp(JobManager.scala:714) >>> at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$hand >>> leMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:660) >>> at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$hand >>> leMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:660) >>> at scala.concurrent.impl.Future$PromiseCompletingRunnable.lifte >>> dTree1$1(Future.scala:24) >>> at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(F >>> uture.scala:24) >>> at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) >>> at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask. >>> exec(AbstractDispatcher.scala:401) >>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.j >>> ava:260) >>> at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExec >>> All(ForkJoinPool.java:1253) >>> at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(For >>> kJoinPool.java:1346) >>> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPoo >>> l.java:1979) >>> at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinW >>> orkerThread.java:107) >>> Caused by: com.esotericsoftware.kryo.KryoException: Unable to find >>> class: javaec40-d994-yteBuffer >>> at com.esotericsoftware.kryo.util.DefaultClassResolver.readName >>> (DefaultClassResolver.java:138) >>> at com.esotericsoftware.kryo.util.DefaultClassResolver.readClas >>> s(DefaultClassResolver.java:115) >>> at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:641) >>> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:752) >>> at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSeriali >>> zer.deserialize(KryoSerializer.java:228) >>> at org.apache.flink.api.java.typeutils.runtime.PojoSerializer.d >>> eserialize(PojoSerializer.java:431) >>> at org.apache.flink.runtime.plugable.NonReusingDeserializationD >>> elegate.read(NonReusingDeserializationDelegate.java:55) >>> at org.apache.flink.runtime.io.network.api.serialization.Spilli >>> ngAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingA >>> daptiveSpanningRecordDeserializer.java:124) >>> at org.apache.flink.runtime.io.network.api.reader.AbstractRecor >>> dReader.getNextRecord(AbstractRecordReader.java:65) >>> at org.apache.flink.runtime.io.network.api.reader.MutableRecord >>> Reader.next(MutableRecordReader.java:34) >>> at org.apache.flink.runtime.operators.util.ReaderIterator.next( >>> ReaderIterator.java:73) >>> at org.apache.flink.runtime.operators.FlatMapDriver.run(FlatMap >>> Driver.java:101) >>> at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.j >>> ava:480) >>> at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTas >>> k.java:345) >>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) >>> at java.lang.Thread.run(Thread.java:745) >>> Caused by: java.lang.ClassNotFoundException: javaec40-d994-yteBuffer >>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381) >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424) >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357) >>> at java.lang.Class.forName0(Native Method) >>> at java.lang.Class.forName(Class.java:348) >>> at com.esotericsoftware.kryo.util.DefaultClassResolver.readName >>> (DefaultClassResolver.java:136) >>> ... 15 more >>> >>> >>> -Tarandeep >>> >>> On Mon, Oct 3, 2016 at 12:49 PM, Tarandeep Singh <tarand...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> I am using flink-1.0.0 and running ETL (batch) jobs on it for quite >>>> some time (few months) without any problem. Starting this morning, I have >>>> been getting errors like these- >>>> >>>> "Received an event in channel 3 while still having data from a record. >>>> This indicates broken serialization logic. If you are using custom >>>> serialization code (Writable or Value types), check their serialization >>>> routines. In the case of Kryo, check the respective Kryo serializer." >>>> >>>> My datasets are in Avro format. The only thing that changed today is - >>>> I moved to smaller cluster. When I first ran the ETL jobs, they failed with >>>> this error- >>>> >>>> "Insufficient number of network buffers: required 20, but only 10 >>>> available. The total number of network buffers is currently set to 20000. >>>> You can increase this number by setting the configuration key >>>> 'taskmanager.network.numberOfBuffers'" >>>> >>>> I increased number of buffers to 30k. Also decreased number of slots >>>> per machine as I noticed load per machine was too high. After that, when I >>>> restart the jobs, I am getting the above error. >>>> >>>> Can someone please help me debug it? >>>> >>>> Thank you, >>>> Tarandeep >>>> >>> >>> >