The memory parameters : --executor-memory 8G --driver-memory 4G. Please note that the data size is very small. Total size of the data is less than 10M
As per jmap. It is a little hard for me to do so. I am not a java developer. I will google the jmap first, thanks Regards Mingwei At 2016-04-20 11:03:20, "Ted Yu" <yuzhih...@gmail.com> wrote: >Can you tell us the memory parameters you used ? > >If you can capture jmap before the GC limit was exceeded, that would give us >more clue. > >Thanks > >> On Apr 19, 2016, at 7:40 PM, "kramer2...@126.com" <kramer2...@126.com> wrote: >> >> Hi All >> >> I use spark doing some calculation. >> The situation is >> 1. New file will come into a folder periodically >> 2. I turn the new files into data frame and insert it into an previous data >> frame. >> >> The code is like below : >> >> >> # Get the file list in the HDFS directory >> client = InsecureClient('http://10.79.148.184:50070') >> file_list = client.list('/test') >> >> df_total = None >> counter = 0 >> for file in file_list: >> counter += 1 >> >> # turn each file (CSV format) into data frame >> lines = sc.textFile("/test/%s" % file) >> parts = lines.map(lambda l: l.split(",")) >> rows = parts.map(lambda p: Row(router=p[0], interface=int(p[1]), >> protocol=p[7],bit=int(p[10]))) >> df = sqlContext.createDataFrame(rows) >> >> # do some transform on the data frame >> df_protocol = >> df.groupBy(['protocol']).agg(func.sum('bit').alias('bit')) >> >> # add the current data frame to previous data frame set >> if not df_total: >> df_total = df_protocol >> else: >> df_total = df_total.unionAll(df_protocol) >> >> # cache the df_total >> df_total.cache() >> if counter % 5 == 0: >> df_total.rdd.checkpoint() >> >> # get the df_total information >> df_total.show() >> >> >> I know that as time goes on, the df_total could be big. But actually, before >> that time come, the above code already raise exception. >> >> When the loop is about 30 loops. The code throw GC overhead limit exceeded >> exception. The file is very small so even 300 loops the data size could only >> be about a few MB. I do not know why it throw GC error. >> >> The exception detail is below : >> >> 16/04/19 15:54:55 ERROR Utils: Uncaught exception in thread >> task-result-getter-2 >> java.lang.OutOfMemoryError: GC overhead limit exceeded >> at >> scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328) >> at scala.collection.immutable.HashMap.updated(HashMap.scala:54) >> at >> scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516) >> at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:606) >> at >> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) >> at >> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) >> at >> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) >> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) >> at >> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) >> at >> java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500) >> at >> org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220) >> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204) >> at >> org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219) >> at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:606) >> at >> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) >> at >> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) >> at >> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) >> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) >> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) >> at >> org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:79) >> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204) >> at >> org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:62) >> at >> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) >> at >> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) >> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) >> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) >> at >> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) >> at >> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109) >> Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: GC >> overhead limit exceeded >> at >> scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328) >> at scala.collection.immutable.HashMap.updated(HashMap.scala:54) >> at >> scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516) >> at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:606) >> at >> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) >> at >> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) >> at >> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) >> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) >> at >> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) >> at >> java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500) >> at >> org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220) >> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204) >> at >> org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219) >> at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:606) >> at >> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) >> at >> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) >> at >> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) >> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) >> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) >> at >> org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:79) >> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204) >> at >> org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:62) >> at >> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) >> at >> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) >> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) >> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) >> at >> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) >> at >> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109) >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Why-very-small-work-load-cause-GC-overhead-limit-tp26803.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >>