[jira] [Commented] (SPARK-19809) NullPointerException on zero-size ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604926#comment-16604926 ] Shirish Tatikonda commented on SPARK-19809: --- Thank you [~dongjoon] > NullPointerException on zero-size ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1 >Reporter: MichaĆ Dawid >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.3.0 > > Attachments: image-2018-02-26-20-29-49-410.png, > spark.sql.hive.convertMetastoreOrc.txt > > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456) > at
[jira] [Commented] (SPARK-19809) NullPointerException on zero-size ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598227#comment-16598227 ] Shirish Tatikonda commented on SPARK-19809: --- [~dongjoon] I am encountering the same problem even with Spark version 2.3.1. {code:java} [local:~] spark-shell 2018-08-30 21:07:25 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1535688452266). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.1 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101) Type in expressions to have them evaluated. Type :help for more information. scala> sql("create table empty_orc(a int) stored as orc location '/tmp/empty_orc'").show 2018-08-30 21:07:44 WARN ObjectStore: - Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 2018-08-30 21:07:44 WARN ObjectStore:568 - Failed to get database default, returning NoSuchObjectException 2018-08-30 21:07:45 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException ++ || ++ ++ // in a different terminal, I did "touch /tmp/empty_orc/zero.orc" scala> sql("select * from empty_orc").show java.lang.RuntimeException: serious problem at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:340) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3273) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484) at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253) at
[jira] [Commented] (SPARK-13434) Reduce Spark RandomForest memory footprint
[ https://issues.apache.org/jira/browse/SPARK-13434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240020#comment-15240020 ] Shirish Tatikonda commented on SPARK-13434: --- [~josephkb] Just curious if there is any plan set forth for implementing local training for deep trees. This is by far the most important optimization -- in my opinion, critical for applying RF to real-world problems. > Reduce Spark RandomForest memory footprint > -- > > Key: SPARK-13434 > URL: https://issues.apache.org/jira/browse/SPARK-13434 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.0 > Environment: Linux >Reporter: Ewan Higgs > Labels: decisiontree, mllib, randomforest > Attachments: heap-usage.log, rf-heap-usage.png > > > The RandomForest implementation can easily run out of memory on moderate > datasets. This was raised in the a user's benchmarking game on github > (https://github.com/szilard/benchm-ml/issues/19). I looked to see if there > was a tracking issue, but I couldn't fine one. > Using Spark 1.6, a user of mine is running into problems running the > RandomForest training on largish datasets on machines with 64G memory and the > following in {{spark-defaults.conf}}: > {code} > spark.executor.cores 2 > spark.executor.instances 199 > spark.executor.memory 10240M > {code} > I reproduced the excessive memory use from the benchmark example (using an > input CSV of 1.3G and 686 columns) in spark shell with {{spark-shell > --driver-memory 30G --executor-memory 30G}} and have a heap profile from a > single machine by running {{jmap -histo:live }}. I took a sample > every 5 seconds and at the peak it looks like this: > {code} > num #instances #bytes class name > -- >1: 5428073 8458773496 [D >2: 12293653 4124641992 [I >3: 32508964 1820501984 org.apache.spark.mllib.tree.model.Node >4: 53068426 1698189632 org.apache.spark.mllib.tree.model.Predict >5: 72853787 1165660592 scala.Some >6: 16263408 910750848 > org.apache.spark.mllib.tree.model.InformationGainStats >7: 72969 390492744 [B >8: 3327008 133080320 > org.apache.spark.mllib.tree.impl.DTStatsAggregator >9: 3754500 120144000 > scala.collection.immutable.HashMap$HashMap1 > 10: 3318349 106187168 org.apache.spark.mllib.tree.model.Split > 11: 3534946 84838704 > org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo > 12: 3764745 60235920 java.lang.Integer > 13: 3327008 53232128 > org.apache.spark.mllib.tree.impurity.EntropyAggregator > 14:380804 45361144 [C > 15:268887 34877128 > 16:268887 34431568 > 17:908377 34042760 [Lscala.collection.immutable.HashMap; > 18: 110 2640 > org.apache.spark.mllib.regression.LabeledPoint > 19: 110 2640 org.apache.spark.mllib.linalg.SparseVector > 20: 20206 25979864 > 21: 100 2400 org.apache.spark.mllib.tree.impl.TreePoint > 22: 100 2400 > org.apache.spark.mllib.tree.impl.BaggedPoint > 23:908332 21799968 > scala.collection.immutable.HashMap$HashTrieMap > 24: 20206 20158864 > 25: 17023 14380352 > 26:16 13308288 > [Lorg.apache.spark.mllib.tree.impl.DTStatsAggregator; > 27:445797 10699128 scala.Tuple2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org