Joseph K. Bradley created SPARK-6120: ----------------------------------------
Summary: DecisionTree.save uses too much Java heap space for default spark shell settings Key: SPARK-6120 URL: https://issues.apache.org/jira/browse/SPARK-6120 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley When the Python DecisionTree example in the programming guide is run, it runs out of Java Heap Space: {code} scala> model.save(sc, "myModelPath") [Stage 12:> (0 + 8) / 8]15/03/02 14:19:16 ERROR Executor: Exception in task 1.0 in stage 12.0 (TID 22) java.lang.OutOfMemoryError: Java heap space at parquet.bytes.CapacityByteArrayOutputStream.initSlabs(CapacityByteArrayOutputStream.java:65) at parquet.bytes.CapacityByteArrayOutputStream.<init>(CapacityByteArrayOutputStream.java:57) at parquet.column.values.plain.PlainValuesWriter.<init>(PlainValuesWriter.java:45) at parquet.column.values.dictionary.DictionaryValuesWriter.<init>(DictionaryValuesWriter.java:102) at parquet.column.values.dictionary.DictionaryValuesWriter$PlainDoubleDictionaryValuesWriter.<init>(DictionaryValuesWriter.java:471) at parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:111) at parquet.column.impl.ColumnWriterImpl.<init>(ColumnWriterImpl.java:74) at parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68) at parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56) at parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.<init>(MessageColumnIO.java:178) at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369) at parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108) at parquet.hadoop.InternalParquetRecordWriter.<init>(InternalParquetRecordWriter.java:94) at parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:64) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252) at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:620) at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:641) at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:641) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} When saving using JSON format instead of Parquet, this works. It seems to be caused by Parquet requiring a lot of metadata to describe the schema. I'm labeling this a bug since it should succeed with the default spark-shell settings. Potential fixes are: * increasing spark-shell default heap space settings (This is probably too hard to agree on currently.) * not using Parquet for storage (This would be good for small examples but probably worse for large models, where Parquet would be more efficient than other formats.) * compressing the schema (The various values in the DecisionTree model could be flattened into a single Seq of Double. This may be the best option for now.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org