Agreed. This is spark 1.2 on CDH5.x. How do you mitigate where the data sets are larger than available memory?
My jobs stall and gc/heap issues all over the place. ..via mobile > On Oct 6, 2015, at 4:44 PM, Mohammed Guller <moham...@glassbeam.com> wrote: > > I have not used LZO compressed files from Spark, so not sure why it stalls > without caching. > > In general, if you are going to make just one pass over the data, there is > not much benefit in caching it. The data gets read anyway only after the > first action is called. If you are calling just a map operation and then a > save operation, I don't see how caching would help. > > Mohammed > > > -----Original Message----- > From: Matt Narrell [mailto:matt.narr...@gmail.com] > Sent: Tuesday, October 6, 2015 3:32 PM > To: Mohammed Guller > Cc: davidkl; user@spark.apache.org > Subject: Re: laziness in textFile reading from HDFS? > > One. > > I read in LZO compressed files from HDFS Perform a map operation cache the > results of this map operation call saveAsHadoopFile to write LZO back to HDFS. > > Without the cache, the job will stall. > > mn > >> On Oct 5, 2015, at 7:25 PM, Mohammed Guller <moham...@glassbeam.com> wrote: >> >> Is there any specific reason for caching the RDD? How many passes you make >> over the dataset? >> >> Mohammed >> >> -----Original Message----- >> From: Matt Narrell [mailto:matt.narr...@gmail.com] >> Sent: Saturday, October 3, 2015 9:50 PM >> To: Mohammed Guller >> Cc: davidkl; user@spark.apache.org >> Subject: Re: laziness in textFile reading from HDFS? >> >> Is there any more information or best practices here? I have the exact same >> issues when reading large data sets from HDFS (larger than available RAM) >> and I cannot run without setting the RDD persistence level to >> MEMORY_AND_DISK_SER, and using nearly all the cluster resources. >> >> Should I repartition this RDD to be equal to the number of cores? >> >> I notice that the job duration on the YARN UI is about 30 minutes longer >> than the Spark UI. When the job initially starts, there is no tasks shown >> in the Spark UI..? >> >> All I;m doing is reading records from HDFS text files with sc.textFile, and >> rewriting them back to HDFS grouped by a timestamp. >> >> Thanks, >> mn >> >>> On Sep 29, 2015, at 8:06 PM, Mohammed Guller <moham...@glassbeam.com> wrote: >>> >>> 1) It is not required to have the same amount of memory as data. >>> 2) By default the # of partitions are equal to the number of HDFS >>> blocks >>> 3) Yes, the read operation is lazy >>> 4) It is okay to have more number of partitions than number of cores. >>> >>> Mohammed >>> >>> -----Original Message----- >>> From: davidkl [mailto:davidkl...@hotmail.com] >>> Sent: Monday, September 28, 2015 1:40 AM >>> To: user@spark.apache.org >>> Subject: laziness in textFile reading from HDFS? >>> >>> Hello, >>> >>> I need to process a significant amount of data every day, about 4TB. This >>> will be processed in batches of about 140GB. The cluster this will be >>> running on doesn't have enough memory to hold the dataset at once, so I am >>> trying to understand how this works internally. >>> >>> When using textFile to read an HDFS folder (containing multiple files), I >>> understand that the number of partitions created are equal to the number of >>> HDFS blocks, correct? Are those created in a lazy way? I mean, if the >>> number of blocks/partitions is larger than the number of cores/threads the >>> Spark driver was launched with (N), are N partitions created initially and >>> then the rest when required? Or are all those partitions created up front? >>> >>> I want to avoid reading the whole data into memory just to spill it out to >>> disk if there is no enough memory. >>> >>> Thanks! >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/laziness-in-textF >>> i le-reading-from-HDFS-tp24837.html Sent from the Apache Spark User >>> List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For >>> additional commands, e-mail: user-h...@spark.apache.org >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For >>> additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org