In my experience my streaming I was getting tens of thousands of empty files created in HDFS. This was crushing my systems performance when my batch jobs ran over the data sets. There is a lot of over head open and closing empty files.
I think creating empty files or keeping empty partitions around is probably a bug how ever I never filed a bug report. Please file a bug report. Please copy me on the Jira There is also a related performance issue. I use reparation() to ensure CSV files have a max number of rows. (it an product requirement to make csv files more user friendly). In my experience if I do not reparation a partitions with a single row of data would cause a separate part-* file to be created. I wound out with large number of very small files. I have always wonder how to configure partitions to get better performance. I would think we are better off with a few very large partitions in most cases. I.E. Keep more stuff in memory with less overhead. I was really hoping Spark would automatically handle this for me Andy From: Mich Talebzadeh <mich.talebza...@gmail.com> Date: Tuesday, April 5, 2016 at 3:49 PM To: Andrew Davidson <a...@santacruzintegration.com> Cc: "user @spark" <user@spark.apache.org> Subject: Re: Saving Spark streaming RDD with saveAsTextFiles ends up creating empty files on HDFS > Thanks Andy. > > Do we know if this is a known bug or simply a feature that on the face of it > Spark cannot save RDD output to a text file? > > > > Dr Mich Talebzadeh > > > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8 > Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV > 8Pw> > > > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > > On 5 April 2016 at 23:35, Andy Davidson <a...@santacruzintegration.com> wrote: >> Hi Mich >> >> Yup I was surprised to find empty files. Its easy to work around. Note I >> should probably use coalesce() and not repartition() >> >> In general I found I almost always need to reparation. I was getting >> thousands of empty partitions. It was really slowing my system down. >> >> private static void save(JavaDStream<String> json, String outputURIBase) { >> >> /* >> >> using saveAsTestFiles will cause lots of empty directories to be >> created. >> >> DStream<String> data = json.dstream(); >> >> data.saveAsTextFiles(outputURI, null); >> >> */ >> >> >> >> jsonTweets.foreachRDD(new VoidFunction2<JavaRDD<String>, Time>() { >> >> private static final long serialVersionUID = 1L; >> >> @Override >> >> public void call(JavaRDD<String> rdd, Time time) throws Exception >> { >> >> Long count = rdd.count(); >> >> //if(!rdd.isEmpty()) { >> >> if(count > 0) { >> >> rdd = repartition(rdd, count.intValue()); >> >> long milliSeconds = time.milliseconds(); >> >> String date = >> Utils.convertMillisecondsToDateStr(milliSeconds); >> >> String dirPath = outputURIBase >> >> + File.separator + date >> >> + File.separator + "tweet-" + >> time.milliseconds(); >> >> rdd.saveAsTextFile(dirPath); >> >> } >> >> } >> >> >> >> final int maxNumRowsPerFile = 200; >> >> JavaRDD<String> repartition(JavaRDD<String> rdd, int count) { >> >> long numPartisions = count / maxNumRowsPerFile + 1; >> >> Long tmp = numPartisions; >> >> rdd = rdd.repartition(tmp.intValue()); >> >> return rdd; >> >> } >> >> }); >> >> >> >> } >> >> >> >> >> From: Mich Talebzadeh <mich.talebza...@gmail.com> >> Date: Tuesday, April 5, 2016 at 3:06 PM >> To: "user @spark" <user@spark.apache.org> >> Subject: Saving Spark streaming RDD with saveAsTextFiles ends up creating >> empty files on HDFS >> >>> Spark 1.6.1 >>> >>> The following creates empty files. It prints lines OK with println >>> >>> val result = lines.filter(_.contains("ASE 15")).flatMap(line => >>> line.split("\n,")).map(word => (word, 1)).reduceByKey(_ + _) >>> result.saveAsTextFiles("/tmp/rdd_stuff") >>> >>> I am getting zero length files >>> >>> drwxr-xr-x - hduser supergroup 0 2016-04-05 23:19 >>> /tmp/rdd_stuff-1459894755000 >>> drwxr-xr-x - hduser supergroup 0 2016-04-05 23:20 >>> /tmp/rdd_stuff-1459894810000 >>> >>> Any ideas? >>> >>> Thanks, >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr >>> V8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABU >>> rV8Pw> >>> >>> >>> >>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >>> >>> >