I submit my code to a spark stand alone cluster. Find the memory usage
executor process keeps growing. Which cause the program to crash.
I modified the code and submit several times. Find below 4 line may causing
the issue
dataframe =
dataframe.groupBy(['router','interface']).agg(func.sum('bits').alias('bits'))
windowSpec =
Window.partitionBy(dataframe['router']).orderBy(dataframe['bits'].desc())
rank = func.dense_rank().over(windowSpec)
ret =
dataframe.select(dataframe['router'],dataframe['interface'],dataframe['bits'],
rank.alias('rank')).filter("rank<=2")
It looks a little complicated but it is just some Window function on
dataframe. I use the HiveContext because SQLContext do not support window
function yet. Without the 4 line, my code can run all night. Adding them
will cause the memory leak. Program will crash in a few hours.
I will provided the whole code (50 lines)here. ForAsk01.py
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n26921/ForAsk01.py>
Please advice me if it is a bug..
Also here is the submit command
nohup ./bin/spark-submit \
--master spark://ES01:7077 \
--executor-memory 4G \
--num-executors 1 \
--total-executor-cores 1 \
--conf "spark.storage.memoryFraction=0.2" \
./ForAsk.py 1>a.log 2>b.log &
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]