compression behaviour inconsistency between 1.3 and 1.4

2015-07-15 Thread Marcin Cylke
Hi I've observed an inconsistent behaviour in .saveAsTextFile. Up until version 1.3 it was possible to save RDDs as snappy compressed files with the invocation of rdd.saveAsTextFile(targetFile) but after upgrading to 1.4 this no longer works. I need to specify a codec for that: rdd.saveAsText

Re: Using Neo4j with Apache Spark

2015-03-12 Thread Marcin Cylke
On Thu, 12 Mar 2015 00:48:12 -0700 d34th4ck3r wrote: > I'm trying to use Neo4j with Apache Spark Streaming but I am finding > serializability as an issue. > > Basically, I want Apache Spark to parse and bundle my data in real > time. After, the data has been bundled it should be stored in the >

Re: skewed outer join with spark 1.2.0 - memory consumption

2015-03-11 Thread Marcin Cylke
On Wed, 11 Mar 2015 11:19:56 +0100 Marcin Cylke wrote: > Hi > > I'm trying to do a join of two datasets: 800GB with ~50MB. The job finishes if I set spark.yarn.executor.memoryOverhead to 2048MB. If it is around 1000MB it fails with "executor lost" errors. My spark s

skewed outer join with spark 1.2.0 - memory consumption

2015-03-11 Thread Marcin Cylke
Hi I'm trying to do a join of two datasets: 800GB with ~50MB. My code looks like this: private def parseClickEventLine(line: String, jsonFormatBC: Broadcast[LazyJsonFormat]): ClickEvent = { val json = line.parseJson.asJsObject val eventJson = if (json.fields.contains("recommendationId

spark 1.2 slower than 1.0 in unit tests

2015-02-18 Thread Marcin Cylke
Hi We're using Spark in our app's unit tests. The tests start spark context with "local[*]" and test time now is 178 seconds on spark 1.2 instead of 41 seconds on 1.0. We are using spark version from cloudera CDH (1.2.0-cdh5.3.1). Could you give some hints what could cause that? and where to sea

Re: problem with hdfs access in spark job

2014-05-18 Thread Marcin Cylke
On Thu, 15 May 2014 09:44:35 -0700 Marcelo Vanzin wrote: > These are actually not worrisome; that's just the HDFS client doing > its own thing to support HA. It probably picked the "wrong" NN to try > first, and got the "NN in standby" exception, which it logs. Then it > tries the other NN and th

problem with hdfs access in spark job

2014-05-14 Thread Marcin Cylke
Hi I'm running Spark 0.9.1 on hadoop cluster - cdh4.2.1, with YARN. I have a job, that performs a few transformations on a given file and joins that file with some other. The job itself finishes with success, however some tasks are failed and then after rerun succeeds. During the development

Re: 'Filesystem closed' while running spark job

2014-04-22 Thread Marcin Cylke
On Tue, 22 Apr 2014 12:28:15 +0200 Marcin Cylke wrote: > Hi > > I have a Spark job that reads files from HDFS, does some pretty basic > transformations, then writes it to some other location on hdfs. > > I'm running this job with spark-0.9.1-rc3, on Hadoop Yarn with >

'Filesystem closed' while running spark job

2014-04-22 Thread Marcin Cylke
Hi I have a Spark job that reads files from HDFS, does some pretty basic transformations, then writes it to some other location on hdfs. I'm running this job with spark-0.9.1-rc3, on Hadoop Yarn with Kerberos security enabled. One of my approaches to fixing this issue was changing SparkConf, s