Hi Kramer,
        Some options:
        1. Store in Cassandra with TTL = 24 hours. When you read the full 
table, you get the latest 24 hours data.
        2. Store in Hive as ORC file and use timestamp field to filter out the 
old data.
        3. Try windowing in spark or flink (have not used either).


Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga


-----Original Message-----
From: kramer2...@126.com [mailto:kramer2...@126.com] 
Sent: Monday, April 11, 2016 16.18
To: user@spark.apache.org
Subject: Why Spark having OutOfMemory Exception?

I use spark to do some very simple calculation. The description is like below 
(pseudo code):


While timestamp == 5 minutes
    
    df = read_hdf() # Read hdfs to get a dataframe every 5 minutes
    
    my_dict[timestamp] = df # Put the data frame into a dict

    delete_old_dataframe( my_dict ) # Delete old dataframe (timestamp is one
24 hour before)

    big_df = merge(my_dict) # Merge the recent 24 hours data frame

To explain..

I have new files comes in every 5 minutes. But I need to generate report on 
recent 24 hours data. 
The concept of 24 hours means I need to delete the oldest data frame every time 
I put a new one into it.
So I maintain a dict (my_dict in above code), the dict contains map like
timestamp: dataframe. Everytime I put dataframe into the dict, I will go 
through the dict to delete those old data frame whose timestamp is 24 hour ago.
After delete and input. I merge the data frames in the dict to a big one and 
run SQL on it to get my report.

*
I want to know if any thing wrong about this model? Because it is very slow 
after started for a while and hit OutOfMemory. I know that my memory is enough. 
Also size of file is very small for test purpose. So should not have memory 
problem.

I am wondering if there is lineage issue, but I am not sure. 

*



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-Spark-having-OutOfMemory-Exception-tp26743.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org

Information transmitted by this e-mail is proprietary to Mphasis, its 
associated companies and/ or its customers and is intended 
for use only by the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient or it appears that this mail has been forwarded 
to you without proper authority, you are notified that any use or dissemination 
of this information in any manner is strictly 
prohibited. In such cases, please notify us immediately at 
mailmas...@mphasis.com and delete this mail from your records.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to