Spark task hangs infinitely when accessing S3

2015-09-10 Thread Mario Pastorelli
Dear community, I am facing a problem accessing data on S3 via Spark. My current configuration is the following: - Spark 1.4.1 - Hadoop 2.7.1 - hadoop-aws-2.7.1 - mesos 0.22.1 I am accessing the data using the s3a protocol but it just hangs. The job runs through the whole data set but

Spark Streaming dealing with broken files without dying

2015-08-10 Thread Mario Pastorelli
Hey Sparkers, I would like to use Spark Streaming in production to observe a directory and process files that are put inside it. The problem is that some of those files can be broken leading to a IOException from the input reader. This should be fine for the framework I think: the exception should

Use logback instead of log4j in a Spark job

2015-06-28 Thread Mario Pastorelli
Hey sparkers, I'm trying to use Logback for logging from my Spark jobs but I noticed that if I submit the job with spark-submit then the log4j implementation of slf4j is loaded instead of logback. Consequently, any call to org.slf4j.LoggerFactory.getLogger will return a log4j logger instead of a

Load slf4j from the job assembly instead of from the Spark jar

2015-06-20 Thread Mario Pastorelli
Hi everyone, I'm trying to use the logstash-logback-encoder https://github.com/logstash/logstash-logback-encoder in my spark jobs but I'm having some problems with the Spark classloader. The logstash-logback-encoder uses a special version of the slf4j BasicMarker

Problem with groupBy and OOM when just writing the group in a file

2015-03-30 Thread Mario Pastorelli
we are experiencing some problems with the groupBy operations when used to group together data that will be written in the same file. The operation that we want to do is the following: given some data with a timestamp, we want to sort it by timestamp, group it by hour and write one file per

Re: Problem with groupBy and OOM when just writing the group in a file

2015-03-30 Thread Mario Pastorelli
repartitionAndSortWithinPartition to *partition* by hour and then sort by time. Then you encounter your data for an hour in order in an Iterator with mapPartitions. On Mon, Mar 30, 2015 at 10:06 AM, Mario Pastorelli mario.pastore...@teralytics.ch wrote: we are experiencing some problems with the groupBy

Re: Spark streaming: missing classes when kafka consumer classes

2014-12-12 Thread Mario Pastorelli
-simple) Regards, *-- Flávio R. Santos* Chaordic | /Platform/ _www.chaordic.com.br http://www.chaordic.com.br/_ +55 48 3232.3200 On Thu, Dec 11, 2014 at 12:32 PM, Mario Pastorelli mario.pastore...@teralytics.ch mailto:mario.pastore...@teralytics.ch wrote: Thanks akhil for the answer. I am

Spark streaming: missing classes when kafka consumer classes

2014-12-11 Thread Mario Pastorelli
is my sbt script but I don't understand why. Thanks, Mario Pastorelli

Re: Spark streaming: missing classes when kafka consumer classes

2014-12-11 Thread Mario Pastorelli
*) sc.addJar(/home/akhld/.ivy2/cache/org.apache.kafka/kafka_2.10/jars/*kafka_2.10-0.8.0.jar*) val ssc = new StreamingContext(sc, Seconds(10)) Thanks Best Regards On Thu, Dec 11, 2014 at 6:22 PM, Mario Pastorelli mario.pastore...@teralytics.ch mailto:mario.pastore...@teralytics.ch wrote: Hi

Re: Spark streaming: missing classes when kafka consumer classes

2014-12-11 Thread Mario Pastorelli
bundled inside it. Thanks Best Regards On Thu, Dec 11, 2014 at 7:10 PM, Mario Pastorelli mario.pastore...@teralytics.ch mailto:mario.pastore...@teralytics.ch wrote: In this way it works but it's not portable and the idea of having a fat jar is to avoid exactly this. Is there any system