Re: Will the HiveContext cause memory leak ?

2016-05-12 Thread Ted Yu
The link below doesn't refer to specific bug. 

Can you send the correct link ?

Thanks 

> On May 12, 2016, at 6:50 PM, "kramer2...@126.com"  wrote:
> 
> It seems we hit the same issue.
> 
> There was a bug on 1.5.1 about memory leak. But I am using 1.6.1
> 
> Here is the link about the bug in 1.5.1 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
> 
> 
> 
> 
> 
> At 2016-05-12 23:10:43, "Simon Schiff [via Apache Spark User List]" <[hidden 
> email]> wrote:
> I read with Spark-Streaming from a Port. The incoming data consists of key 
> and value pairs. Then I call forEachRDD on each window. There I create a 
> Dataset from the window and do some SQL Querys on it. On the result i only do 
> show, to see the content. It works well, but the memory usage increases. When 
> it reaches the maximum nothing works anymore. When I use more memory. The 
> Program runs some time longer, but the problem persists. Because I run a 
> Programm which writes to the Port, I can control perfectly how much Data 
> Spark has to Process. When I write every one ms one key and value Pair the 
> Problem is the same as when i write only every second a key and value pair to 
> the port. 
> 
> When I dont create a Dataset in the foreachRDD and only count the Elements in 
> the RDD, then everything works fine. I also use groupBy agg functions in the 
> querys. 
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26940.html
> To unsubscribe from Will the HiveContext cause memory leak ?, click here.
> NAML
> 
> 
>  
> 
> 
> View this message in context: Re:Re: Re:Re: Will the HiveContext cause memory 
> leak ?
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Will the HiveContext cause memory leak ?

2016-05-11 Thread kramer2...@126.com
sorry I have to correction again. It may still a memory leak. Because at last
the memory usage goes up again... 

eventually , the stream program crashed.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26933.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Will the HiveContext cause memory leak ?

2016-05-11 Thread kramer2...@126.com
After 8 hours. The usage of memory become stable. Use the Top command will
find it will be 75%. So means 12GB memory.


But it still do not make sense. Because my workload is very small.


I use this spark to calculate on one csv file every 20 seconds. The size of
the csv file is 1.3M.


So spark is using almost 10 000 times of memory than my workload. Does that
mean I need prepare 1TB RAM if the workload is 100M?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26927.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Will the HiveContext cause memory leak ?

2016-05-10 Thread Ted Yu
Which Spark release are you using ?

I assume executor crashed due to OOME.

Did you have a chance to capture jmap on the executor before it crashed ?

Have you tried giving more memory to the executor ?

Thanks

On Tue, May 10, 2016 at 8:25 PM, kramer2...@126.com 
wrote:

> I submit my code to a spark stand alone cluster. Find the memory usage
> executor process keeps growing. Which cause the program to crash.
>
> I modified the code and submit several times. Find below 4 line may causing
> the issue
>
> dataframe =
>
> dataframe.groupBy(['router','interface']).agg(func.sum('bits').alias('bits'))
> windowSpec =
> Window.partitionBy(dataframe['router']).orderBy(dataframe['bits'].desc())
> rank = func.dense_rank().over(windowSpec)
> ret =
>
> dataframe.select(dataframe['router'],dataframe['interface'],dataframe['bits'],
> rank.alias('rank')).filter("rank<=2")
>
> It looks a little complicated but it is just some Window function on
> dataframe. I use the HiveContext because SQLContext do not support window
> function yet. Without the 4 line, my code can run all night. Adding them
> will cause the memory leak. Program will crash in a few hours.
>
> I will provided the whole code (50 lines)here.  ForAsk01.py
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/n26921/ForAsk01.py
> >
> Please advice me if it is a bug..
>
> Also here is the submit command
>
> nohup ./bin/spark-submit  \
> --master spark://ES01:7077 \
> --executor-memory 4G \
> --num-executors 1 \
> --total-executor-cores 1 \
> --conf "spark.storage.memoryFraction=0.2"  \
> ./ForAsk.py 1>a.log 2>b.log &
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>