Hi Kramer,
Some options:
1. Store in Cassandra with TTL = 24 hours. When you read the full
table, you get the latest 24 hours data.
2. Store in Hive as ORC file and use timestamp field to filter out the
old data.
3. Try windowing in spark or flink (have not used either).
Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga
-----Original Message-----
From: [email protected] [mailto:[email protected]]
Sent: Monday, April 11, 2016 16.18
To: [email protected]
Subject: Why Spark having OutOfMemory Exception?
I use spark to do some very simple calculation. The description is like below
(pseudo code):
While timestamp == 5 minutes
df = read_hdf() # Read hdfs to get a dataframe every 5 minutes
my_dict[timestamp] = df # Put the data frame into a dict
delete_old_dataframe( my_dict ) # Delete old dataframe (timestamp is one
24 hour before)
big_df = merge(my_dict) # Merge the recent 24 hours data frame
To explain..
I have new files comes in every 5 minutes. But I need to generate report on
recent 24 hours data.
The concept of 24 hours means I need to delete the oldest data frame every time
I put a new one into it.
So I maintain a dict (my_dict in above code), the dict contains map like
timestamp: dataframe. Everytime I put dataframe into the dict, I will go
through the dict to delete those old data frame whose timestamp is 24 hour ago.
After delete and input. I merge the data frames in the dict to a big one and
run SQL on it to get my report.
*
I want to know if any thing wrong about this model? Because it is very slow
after started for a while and hit OutOfMemory. I know that my memory is enough.
Also size of file is very small for test purpose. So should not have memory
problem.
I am wondering if there is lineage issue, but I am not sure.
*
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Why-Spark-having-OutOfMemory-Exception-tp26743.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected] For additional
commands, e-mail: [email protected]
Information transmitted by this e-mail is proprietary to Mphasis, its
associated companies and/ or its customers and is intended
for use only by the individual or entity to which it is addressed, and may
contain information that is privileged, confidential or
exempt from disclosure under applicable law. If you are not the intended
recipient or it appears that this mail has been forwarded
to you without proper authority, you are notified that any use or dissemination
of this information in any manner is strictly
prohibited. In such cases, please notify us immediately at
[email protected] and delete this mail from your records.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]