Re: laziness in textFile reading from HDFS?

Matt Narrell Tue, 06 Oct 2015 16:08:57 -0700

Agreed. This is spark 1.2 on CDH5.x. How do you mitigate where the data sets 
are larger than available memory?


My jobs stall and gc/heap issues all over the place.  

..via mobile

> On Oct 6, 2015, at 4:44 PM, Mohammed Guller <moham...@glassbeam.com> wrote:
> 
> I have not used LZO compressed files from Spark, so not sure why it stalls 
> without caching. 
> 
> In general, if you are going to make just one pass over the data, there is 
> not much benefit in caching it. The data gets read anyway only after the 
> first action is called. If you are calling just a map operation and then a 
> save operation, I don't see how caching would help.
> 
> Mohammed
> 
> 
> -----Original Message-----
> From: Matt Narrell [mailto:matt.narr...@gmail.com] 
> Sent: Tuesday, October 6, 2015 3:32 PM
> To: Mohammed Guller
> Cc: davidkl; user@spark.apache.org
> Subject: Re: laziness in textFile reading from HDFS?
> 
> One.
> 
> I read in LZO compressed files from HDFS Perform a map operation cache the 
> results of this map operation call saveAsHadoopFile to write LZO back to HDFS.
> 
> Without the cache, the job will stall.  
> 
> mn
> 
>> On Oct 5, 2015, at 7:25 PM, Mohammed Guller <moham...@glassbeam.com> wrote:
>> 
>> Is there any specific reason for caching the RDD? How many passes you make 
>> over the dataset? 
>> 
>> Mohammed
>> 
>> -----Original Message-----
>> From: Matt Narrell [mailto:matt.narr...@gmail.com]
>> Sent: Saturday, October 3, 2015 9:50 PM
>> To: Mohammed Guller
>> Cc: davidkl; user@spark.apache.org
>> Subject: Re: laziness in textFile reading from HDFS?
>> 
>> Is there any more information or best practices here?  I have the exact same 
>> issues when reading large data sets from HDFS (larger than available RAM) 
>> and I cannot run without setting the RDD persistence level to 
>> MEMORY_AND_DISK_SER, and using nearly all the cluster resources.
>> 
>> Should I repartition this RDD to be equal to the number of cores?  
>> 
>> I notice that the job duration on the YARN UI is about 30 minutes longer 
>> than the Spark UI.  When the job initially starts, there is no tasks shown 
>> in the Spark UI..?
>> 
>> All I;m doing is reading records from HDFS text files with sc.textFile, and 
>> rewriting them back to HDFS grouped by a timestamp.
>> 
>> Thanks,
>> mn
>> 
>>> On Sep 29, 2015, at 8:06 PM, Mohammed Guller <moham...@glassbeam.com> wrote:
>>> 
>>> 1) It is not required to have the same amount of memory as data. 
>>> 2) By default the # of partitions are equal to the number of HDFS 
>>> blocks
>>> 3) Yes, the read operation is lazy
>>> 4) It is okay to have more number of partitions than number of cores. 
>>> 
>>> Mohammed
>>> 
>>> -----Original Message-----
>>> From: davidkl [mailto:davidkl...@hotmail.com]
>>> Sent: Monday, September 28, 2015 1:40 AM
>>> To: user@spark.apache.org
>>> Subject: laziness in textFile reading from HDFS?
>>> 
>>> Hello,
>>> 
>>> I need to process a significant amount of data every day, about 4TB. This 
>>> will be processed in batches of about 140GB. The cluster this will be 
>>> running on doesn't have enough memory to hold the dataset at once, so I am 
>>> trying to understand how this works internally.
>>> 
>>> When using textFile to read an HDFS folder (containing multiple files), I 
>>> understand that the number of partitions created are equal to the number of 
>>> HDFS blocks, correct? Are those created in a lazy way? I mean, if the 
>>> number of blocks/partitions is larger than the number of cores/threads the 
>>> Spark driver was launched with (N), are N partitions created initially and 
>>> then the rest when required? Or are all those partitions created up front?
>>> 
>>> I want to avoid reading the whole data into memory just to spill it out to 
>>> disk if there is no enough memory.
>>> 
>>> Thanks! 
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: 
>>> http://apache-spark-user-list.1001560.n3.nabble.com/laziness-in-textF
>>> i le-reading-from-HDFS-tp24837.html Sent from the Apache Spark User 
>>> List mailing list archive at Nabble.com.
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
>>> additional commands, e-mail: user-h...@spark.apache.org
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
>>> additional commands, e-mail: user-h...@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: laziness in textFile reading from HDFS?

Reply via email to