This is a bug in DataFrame caching. You can avoid caching or turn off compression. It is fixed in Spark 1.5.1
On Sat, Oct 31, 2015 at 2:31 AM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > I don’t believe I have it on 1.5.1. Are you able to test the data locally > to confirm, or is it too large? > > From: "Zhang, Jingyu" <jingyu.zh...@news.com.au> > Date: Friday, October 30, 2015 at 7:31 PM > To: Silvio Fiorito <silvio.fior...@granturing.com> > Cc: Ted Yu <yuzhih...@gmail.com>, user <user@spark.apache.org> > > Subject: Re: key not found: sportingpulse.com in Spark SQL 1.5.0 > > Thanks Silvio and Ted, > > Can you please let me know how to fix this intermittent issues? Should I > wait EMR upgrading to support the Spark 1.5.1 or change my code from > DataFrame to normal Spark map-reduce? > > Regards, > > Jingyu > > On 31 October 2015 at 09:40, Silvio Fiorito <silvio.fior...@granturing.com > > wrote: > >> It's something due to the columnar compression. I've seen similar >> intermittent issues when caching DataFrames. "sportingpulse.com" is a >> value in one of the columns of the DF. >> ------------------------------ >> From: Ted Yu <yuzhih...@gmail.com> >> Sent: 10/30/2015 6:33 PM >> To: Zhang, Jingyu <jingyu.zh...@news.com.au> >> Cc: user <user@spark.apache.org> >> Subject: Re: key not found: sportingpulse.com in Spark SQL 1.5.0 >> >> I searched for sportingpulse in *.scala and *.java files under 1.5 >> branch. >> There was no hit. >> >> mvn dependency doesn't show sportingpulse either. >> >> Is it possible this is specific to EMR ? >> >> Cheers >> >> On Fri, Oct 30, 2015 at 2:57 PM, Zhang, Jingyu <jingyu.zh...@news.com.au> >> wrote: >> >>> There is not a problem in Spark SQL 1.5.1 but the error of "key not >>> found: sportingpulse.com" shown up when I use 1.5.0. >>> >>> I have to use the version of 1.5.0 because that the one AWS EMR >>> support. Can anyone tell me why Spark uses "sportingpulse.com" and how >>> to fix it? >>> >>> Thanks. >>> >>> Caused by: java.util.NoSuchElementException: key not found: >>> sportingpulse.com >>> >>> at scala.collection.MapLike$class.default(MapLike.scala:228) >>> >>> at scala.collection.AbstractMap.default(Map.scala:58) >>> >>> at scala.collection.mutable.HashMap.apply(HashMap.scala:64) >>> >>> at >>> org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress( >>> compressionSchemes.scala:258) >>> >>> at >>> org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build( >>> CompressibleColumnBuilder.scala:110) >>> >>> at org.apache.spark.sql.columnar.NativeColumnBuilder.build( >>> ColumnBuilder.scala:87) >>> >>> at >>> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply( >>> InMemoryColumnarTableScan.scala:152) >>> >>> at >>> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply( >>> InMemoryColumnarTableScan.scala:152) >>> >>> at scala.collection.TraversableLike$$anonfun$map$1.apply( >>> TraversableLike.scala:244) >>> >>> at scala.collection.TraversableLike$$anonfun$map$1.apply( >>> TraversableLike.scala:244) >>> >>> at scala.collection.IndexedSeqOptimized$class.foreach( >>> IndexedSeqOptimized.scala:33) >>> >>> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) >>> >>> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) >>> >>> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) >>> >>> at >>> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next( >>> InMemoryColumnarTableScan.scala:152) >>> >>> at >>> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next( >>> InMemoryColumnarTableScan.scala:120) >>> >>> at org.apache.spark.storage.MemoryStore.unrollSafely( >>> MemoryStore.scala:278) >>> >>> at org.apache.spark.CacheManager.putInBlockManager( >>> CacheManager.scala:171) >>> >>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) >>> >>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262) >>> >>> at org.apache.spark.rdd.MapPartitionsRDD.compute( >>> MapPartitionsRDD.scala:38) >>> >>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) >>> >>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) >>> >>> at org.apache.spark.rdd.MapPartitionsRDD.compute( >>> MapPartitionsRDD.scala:38) >>> >>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) >>> >>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) >>> >>> at org.apache.spark.rdd.MapPartitionsRDD.compute( >>> MapPartitionsRDD.scala:38) >>> >>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) >>> >>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) >>> >>> at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute( >>> MapPartitionsWithPreparationRDD.scala:63) >>> >>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) >>> >>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) >>> >>> at org.apache.spark.rdd.MapPartitionsRDD.compute( >>> MapPartitionsRDD.scala:38) >>> >>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) >>> >>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) >>> >>> at org.apache.spark.scheduler.ShuffleMapTask.runTask( >>> ShuffleMapTask.scala:73) >>> >>> at org.apache.spark.scheduler.ShuffleMapTask.runTask( >>> ShuffleMapTask.scala:41) >>> >>> at org.apache.spark.scheduler.Task.run(Task.scala:88) >>> >>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) >>> >>> at java.util.concurrent.ThreadPoolExecutor.runWorker( >>> ThreadPoolExecutor.java:1142) >>> >>> at java.util.concurrent.ThreadPoolExecutor$Worker.run( >>> ThreadPoolExecutor.java:617) >>> >>> This message and its attachments may contain legally privileged or >>> confidential information. It is intended solely for the named addressee. If >>> you are not the addressee indicated in this message or responsible for >>> delivery of the message to the addressee, you may not copy or deliver this >>> message or its attachments to anyone. Rather, you should permanently delete >>> this message and its attachments and kindly notify the sender by reply >>> e-mail. Any content of this message and its attachments which does not >>> relate to the official business of the sending company must be taken not to >>> have been sent or endorsed by that company or any of its related entities. >>> No warranty is made that the e-mail or attachments are free from computer >>> virus or other defect. >> >> >> > > This message and its attachments may contain legally privileged or > confidential information. It is intended solely for the named addressee. If > you are not the addressee indicated in this message or responsible for > delivery of the message to the addressee, you may not copy or deliver this > message or its attachments to anyone. Rather, you should permanently delete > this message and its attachments and kindly notify the sender by reply > e-mail. Any content of this message and its attachments which does not > relate to the official business of the sending company must be taken not to > have been sent or endorsed by that company or any of its related entities. > No warranty is made that the e-mail or attachments are free from computer > virus or other defect. >