Cached data not showing up in Storage tab
When I cache a variable the data never shows up in the storage tab. The storage tab is always blank. I have tried it in Zeppelin as well as spark-shell. scala> val classCount = spark.read.parquet("s3:// /classCount") scala> classCount.persist scala> classCount.count Nothing shows up in the Storage tab of either Zeppelin or spark-shell. However, I have several running applications in production that does show the data in cache. I am using Scala and Spark 2.2.1 in EMR. Any workarounds to see the data in cache. The problem is mentioned here: https://forums.databricks.com/questions/117/why-is-my-rdd-not-showing-up-in-the-storage-tab-of.html https://stackoverflow.com/questions/44792213/blank-storage-tab-in-spark-history-server - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark seems to think that a particular broadcast variable is large in size
The same problem is mentioned here : https://forums.databricks.com/questions/117/why-is-my-rdd-not-showing-up-in-the-storage-tab-of.html https://stackoverflow.com/questions/44792213/blank-storage-tab-in-spark-history-server On Tue, Oct 16, 2018 at 8:06 AM Venkat Dabri wrote: > > I did try that mechanism before but the data never shows up in the > storage tab. The storage tab is always blank. I have tried it in > Zeppelin as well as spark-shell. > > scala> val classCount = spark.read.parquet("s3:// /classCount") > scala> classCount.persist > scala> classCount.count > > Nothing shows up in the Storage tab of either Zeppelin or spark-shell. > However, I have several running applications in production that does > show the data in cache. I am using Scala and Spark 2.2.1 in EMR. Any > workarounds to see the data in cache. > On Mon, Oct 15, 2018 at 2:53 PM Dillon Dukek wrote: > > > > In your program persist the smaller table and use count to force it to > > materialize. Then in the Spark UI go to the Storage tab. The size of your > > table as spark sees it should be displayed there. Out of curiosity what > > version / language of Spark are you using? > > > > On Mon, Oct 15, 2018 at 11:53 AM Venkat Dabri wrote: > >> > >> I am trying to do a broadcast join on two tables. The size of the > >> smaller table will vary based upon the parameters but the size of the > >> larger table is close to 2TB. What I have noticed is that if I don't > >> set the spark.sql.autoBroadcastJoinThreshold to 10G some of these > >> operations do a SortMergeJoin instead of a broadcast join. But the > >> size of the smaller table shouldn't be this big at all. I wrote the > >> smaller table to a s3 folder and it took only 12.6 MB of space. I > >> didn't some operations on the smaller table so the shuffle size > >> appears on the Spark History Server and the size in memory seemed to > >> be 150 MB nowhere near 10G. Also if I force a broadcast join on the > >> smaller table it takes a long time to broadcast, leading me to think > >> that the table might not just be 150 MB in size. What would be a good > >> way to figure out the actual size that Spark is seeing and deciding > >> whether it crosses the spark.sql.autoBroadcastJoinThreshold? > >> > >> - > >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >> - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark seems to think that a particular broadcast variable is large in size
I did try that mechanism before but the data never shows up in the storage tab. The storage tab is always blank. I have tried it in Zeppelin as well as spark-shell. scala> val classCount = spark.read.parquet("s3:// /classCount") scala> classCount.persist scala> classCount.count Nothing shows up in the Storage tab of either Zeppelin or spark-shell. However, I have several running applications in production that does show the data in cache. I am using Scala and Spark 2.2.1 in EMR. Any workarounds to see the data in cache. On Mon, Oct 15, 2018 at 2:53 PM Dillon Dukek wrote: > > In your program persist the smaller table and use count to force it to > materialize. Then in the Spark UI go to the Storage tab. The size of your > table as spark sees it should be displayed there. Out of curiosity what > version / language of Spark are you using? > > On Mon, Oct 15, 2018 at 11:53 AM Venkat Dabri wrote: >> >> I am trying to do a broadcast join on two tables. The size of the >> smaller table will vary based upon the parameters but the size of the >> larger table is close to 2TB. What I have noticed is that if I don't >> set the spark.sql.autoBroadcastJoinThreshold to 10G some of these >> operations do a SortMergeJoin instead of a broadcast join. But the >> size of the smaller table shouldn't be this big at all. I wrote the >> smaller table to a s3 folder and it took only 12.6 MB of space. I >> didn't some operations on the smaller table so the shuffle size >> appears on the Spark History Server and the size in memory seemed to >> be 150 MB nowhere near 10G. Also if I force a broadcast join on the >> smaller table it takes a long time to broadcast, leading me to think >> that the table might not just be 150 MB in size. What would be a good >> way to figure out the actual size that Spark is seeing and deciding >> whether it crosses the spark.sql.autoBroadcastJoinThreshold? >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Spark seems to think that a particular broadcast variable is large in size
I am trying to do a broadcast join on two tables. The size of the smaller table will vary based upon the parameters but the size of the larger table is close to 2TB. What I have noticed is that if I don't set the spark.sql.autoBroadcastJoinThreshold to 10G some of these operations do a SortMergeJoin instead of a broadcast join. But the size of the smaller table shouldn't be this big at all. I wrote the smaller table to a s3 folder and it took only 12.6 MB of space. I didn't some operations on the smaller table so the shuffle size appears on the Spark History Server and the size in memory seemed to be 150 MB nowhere near 10G. Also if I force a broadcast join on the smaller table it takes a long time to broadcast, leading me to think that the table might not just be 150 MB in size. What would be a good way to figure out the actual size that Spark is seeing and deciding whether it crosses the spark.sql.autoBroadcastJoinThreshold? - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: java.lang.UnsupportedOperationException: No Encoder found for Set[String]
We are using spark 2.2.0. Is it possible to bring the ExpressionEncoder from 2.3.0 and related classes into my code base and use them? I see the changes in ExpressionEncoder between 2.3.0 and 2.2.0 is not much but there might be many other classes underneath that might have changed. On Thu, Aug 16, 2018 at 5:23 AM, Manu Zhang wrote: > Hi, > > It's added since Spark 2.3.0. > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala#L180 > > Regards, > Manu Zhang > > On Thu, Aug 16, 2018 at 9:59 AM V0lleyBallJunki3 > wrote: >> >> Hello, >> I am using Spark 2.2.2 with Scala 2.11.8. I wrote a short program >> >> val spark = SparkSession.builder().master("local[4]").getOrCreate() >> >> case class TestCC(i: Int, ss: Set[String]) >> >> import spark.implicits._ >> import spark.sqlContext.implicits._ >> >> val testCCDS = Seq(TestCC(1,Set("SS","Salil")), TestCC(2, Set("xx", >> "XYZ"))).toDS() >> >> >> I get : >> java.lang.UnsupportedOperationException: No Encoder found for Set[String] >> - field (class: "scala.collection.immutable.Set", name: "ss") >> - root class: "TestCC" >> at >> >> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:632) >> at >> >> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:455) >> at >> >> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56) >> at >> >> org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:809) >> at >> >> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39) >> at >> >> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:455) >> at >> >> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1$$anonfun$10.apply(ScalaReflection.scala:626) >> at >> >> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1$$anonfun$10.apply(ScalaReflection.scala:614) >> at >> >> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) >> >> To the best of my knowledge implicit support for Set has been added in >> Spark >> 2.2. Am I missing something? >> >> >> >> -- >> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org