I have a 4 data node apex system using Hadoop distribution CDH-5.6.0 and Apex version incubator-apex-core-3.3.0-incubating. I have an application that should input (stream) a file from Amazon S3. If the file (which is a tar.gz file with a single file inside which has json lines delimited by a new line) is small (maybe a thousand records) there are no issues. If I try a 5GB file, the stream works for a while and I am able to process maybe 200,000 of the 1,000,000 records (amount changes every time, sometimes more processed sometimes less) and then exceptions are thrown such as:
-com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): com.amazonaws.internal.StaticCredentialsProvider -java.lang.IllegalStateException: Deploy request failed: [OperatorDeployInfo -WARN com.datatorrent.netlet.OptimizedEventLoop: Exception on unattached SelectionKey sun.nio.ch.SelectionKeyImpl@29369cb9<mailto:sun.nio.ch.SelectionKeyImpl@29369cb9> java.io.IOException: Broken pipe Once these exceptions come up, the KyroExceptions continuously come up and no more data is processed. Is there something that needs to be done in the Apex operators to handle processing large files(Streams)?
