[ https://issues.apache.org/jira/browse/SPARK-6120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joseph K. Bradley reassigned SPARK-6120: ---------------------------------------- Assignee: Joseph K. Bradley > DecisionTree.save uses too much Java heap space for default spark shell > settings > -------------------------------------------------------------------------------- > > Key: SPARK-6120 > URL: https://issues.apache.org/jira/browse/SPARK-6120 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 1.3.0 > Reporter: Joseph K. Bradley > Assignee: Joseph K. Bradley > > When the Python DecisionTree example in the programming guide is run, it runs > out of Java Heap Space: > {code} > scala> model.save(sc, "myModelPath") > [Stage 12:> > (0 + 8) > / 8]15/03/02 14:19:16 ERROR Executor: Exception in task 1.0 in stage 12.0 > (TID 22) > java.lang.OutOfMemoryError: Java heap space > at > parquet.bytes.CapacityByteArrayOutputStream.initSlabs(CapacityByteArrayOutputStream.java:65) > at > parquet.bytes.CapacityByteArrayOutputStream.<init>(CapacityByteArrayOutputStream.java:57) > at > parquet.column.values.plain.PlainValuesWriter.<init>(PlainValuesWriter.java:45) > at > parquet.column.values.dictionary.DictionaryValuesWriter.<init>(DictionaryValuesWriter.java:102) > at > parquet.column.values.dictionary.DictionaryValuesWriter$PlainDoubleDictionaryValuesWriter.<init>(DictionaryValuesWriter.java:471) > at > parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:111) > at parquet.column.impl.ColumnWriterImpl.<init>(ColumnWriterImpl.java:74) > at > parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68) > at > parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56) > at > parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.<init>(MessageColumnIO.java:178) > at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369) > at > parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108) > at > parquet.hadoop.InternalParquetRecordWriter.<init>(InternalParquetRecordWriter.java:94) > at > parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:64) > at > parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282) > at > parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252) > at > org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:620) > at > org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:641) > at > org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:641) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:64) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > When saving using JSON format instead of Parquet, this works. It seems to be > caused by Parquet requiring a lot of metadata to describe the schema. > I'm labeling this a bug since it should succeed with the default spark-shell > settings. Potential fixes are: > * increasing spark-shell default heap space settings (This is probably too > hard to agree on currently.) > * not using Parquet for storage (This would be good for small examples but > probably worse for large models, where Parquet would be more efficient than > other formats.) > * compressing the schema (The various values in the DecisionTree model could > be flattened into a single Seq of Double. This may be the best option for > now.) > Notes: > * This happens in both pyspark and Scala shells. > * Increasing driver memory to 1g (from the default of 512m) makes this > succeed. > * Running other examples such as NaiveBayes with the default of 512m works. > * This is a bit strange in that the actual size of the saved model on disk is > small (86K on disk for me). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org