Cached data not showing up in Storage tab

2018-10-16 Thread Venkat Dabri
When I cache a variable the data never shows up in the
storage tab. The storage tab is always blank. I have tried it in
Zeppelin as well as spark-shell.

scala> val classCount = spark.read.parquet("s3:// /classCount")
scala> classCount.persist
scala> classCount.count

Nothing shows up in the Storage tab of either Zeppelin or spark-shell.
However, I have several running applications in production that does
show the data in cache. I am using Scala and Spark 2.2.1 in EMR. Any
workarounds to see the data in cache.

The problem is mentioned here:
https://forums.databricks.com/questions/117/why-is-my-rdd-not-showing-up-in-the-storage-tab-of.html
https://stackoverflow.com/questions/44792213/blank-storage-tab-in-spark-history-server

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark seems to think that a particular broadcast variable is large in size

2018-10-16 Thread Venkat Dabri
The same problem is mentioned here :
https://forums.databricks.com/questions/117/why-is-my-rdd-not-showing-up-in-the-storage-tab-of.html
https://stackoverflow.com/questions/44792213/blank-storage-tab-in-spark-history-server
On Tue, Oct 16, 2018 at 8:06 AM Venkat Dabri  wrote:
>
> I did try that mechanism before but the data never shows up in the
> storage tab. The storage tab is always blank. I have tried it in
> Zeppelin as well as spark-shell.
>
> scala> val classCount = spark.read.parquet("s3:// /classCount")
> scala> classCount.persist
> scala> classCount.count
>
> Nothing shows up in the Storage tab of either Zeppelin or spark-shell.
> However, I have several running applications in production that does
> show the data in cache. I am using Scala and Spark 2.2.1 in EMR. Any
> workarounds to see the data in cache.
> On Mon, Oct 15, 2018 at 2:53 PM Dillon Dukek  wrote:
> >
> > In your program persist the smaller table and use count to force it to 
> > materialize. Then in the Spark UI go to the Storage tab. The size of your 
> > table as spark sees it should be displayed there. Out of curiosity what 
> > version / language of Spark are you using?
> >
> > On Mon, Oct 15, 2018 at 11:53 AM Venkat Dabri  wrote:
> >>
> >> I am trying to do a broadcast join on two tables. The size of the
> >> smaller table will vary based upon the parameters but the size of the
> >> larger table is close to 2TB. What I have noticed is that if I don't
> >> set the spark.sql.autoBroadcastJoinThreshold to 10G some of these
> >> operations do a SortMergeJoin instead of a broadcast join. But the
> >> size of the smaller table shouldn't be this big at all. I wrote the
> >> smaller table to a s3 folder and it took only 12.6 MB of space. I
> >> didn't some operations on the smaller table so the shuffle size
> >> appears on the Spark History Server and the size in memory seemed to
> >> be 150 MB nowhere near 10G. Also if I force a broadcast join on the
> >> smaller table it takes a long time to broadcast, leading me to think
> >> that the table might not just be 150 MB in size. What would be a good
> >> way to figure out the actual size that Spark is seeing and deciding
> >> whether it crosses the spark.sql.autoBroadcastJoinThreshold?
> >>
> >> -
> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark seems to think that a particular broadcast variable is large in size

2018-10-16 Thread Venkat Dabri
I did try that mechanism before but the data never shows up in the
storage tab. The storage tab is always blank. I have tried it in
Zeppelin as well as spark-shell.

scala> val classCount = spark.read.parquet("s3:// /classCount")
scala> classCount.persist
scala> classCount.count

Nothing shows up in the Storage tab of either Zeppelin or spark-shell.
However, I have several running applications in production that does
show the data in cache. I am using Scala and Spark 2.2.1 in EMR. Any
workarounds to see the data in cache.
On Mon, Oct 15, 2018 at 2:53 PM Dillon Dukek  wrote:
>
> In your program persist the smaller table and use count to force it to 
> materialize. Then in the Spark UI go to the Storage tab. The size of your 
> table as spark sees it should be displayed there. Out of curiosity what 
> version / language of Spark are you using?
>
> On Mon, Oct 15, 2018 at 11:53 AM Venkat Dabri  wrote:
>>
>> I am trying to do a broadcast join on two tables. The size of the
>> smaller table will vary based upon the parameters but the size of the
>> larger table is close to 2TB. What I have noticed is that if I don't
>> set the spark.sql.autoBroadcastJoinThreshold to 10G some of these
>> operations do a SortMergeJoin instead of a broadcast join. But the
>> size of the smaller table shouldn't be this big at all. I wrote the
>> smaller table to a s3 folder and it took only 12.6 MB of space. I
>> didn't some operations on the smaller table so the shuffle size
>> appears on the Spark History Server and the size in memory seemed to
>> be 150 MB nowhere near 10G. Also if I force a broadcast join on the
>> smaller table it takes a long time to broadcast, leading me to think
>> that the table might not just be 150 MB in size. What would be a good
>> way to figure out the actual size that Spark is seeing and deciding
>> whether it crosses the spark.sql.autoBroadcastJoinThreshold?
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark seems to think that a particular broadcast variable is large in size

2018-10-15 Thread Venkat Dabri
I am trying to do a broadcast join on two tables. The size of the
smaller table will vary based upon the parameters but the size of the
larger table is close to 2TB. What I have noticed is that if I don't
set the spark.sql.autoBroadcastJoinThreshold to 10G some of these
operations do a SortMergeJoin instead of a broadcast join. But the
size of the smaller table shouldn't be this big at all. I wrote the
smaller table to a s3 folder and it took only 12.6 MB of space. I
didn't some operations on the smaller table so the shuffle size
appears on the Spark History Server and the size in memory seemed to
be 150 MB nowhere near 10G. Also if I force a broadcast join on the
smaller table it takes a long time to broadcast, leading me to think
that the table might not just be 150 MB in size. What would be a good
way to figure out the actual size that Spark is seeing and deciding
whether it crosses the spark.sql.autoBroadcastJoinThreshold?

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: java.lang.UnsupportedOperationException: No Encoder found for Set[String]

2018-08-16 Thread Venkat Dabri
We are using spark 2.2.0. Is it possible to bring the
ExpressionEncoder from 2.3.0 and related classes into my code base and
use them? I see the changes in ExpressionEncoder between 2.3.0 and
2.2.0 is not much but there might be many other classes underneath
that might have changed.

On Thu, Aug 16, 2018 at 5:23 AM, Manu Zhang  wrote:
> Hi,
>
> It's added since Spark 2.3.0.
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala#L180
>
> Regards,
> Manu Zhang
>
> On Thu, Aug 16, 2018 at 9:59 AM V0lleyBallJunki3 
> wrote:
>>
>> Hello,
>>   I am using Spark 2.2.2 with Scala 2.11.8. I wrote a short program
>>
>> val spark = SparkSession.builder().master("local[4]").getOrCreate()
>>
>> case class TestCC(i: Int, ss: Set[String])
>>
>> import spark.implicits._
>> import spark.sqlContext.implicits._
>>
>> val testCCDS = Seq(TestCC(1,Set("SS","Salil")), TestCC(2, Set("xx",
>> "XYZ"))).toDS()
>>
>>
>> I get :
>> java.lang.UnsupportedOperationException: No Encoder found for Set[String]
>> - field (class: "scala.collection.immutable.Set", name: "ss")
>> - root class: "TestCC"
>>   at
>>
>> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:632)
>>   at
>>
>> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:455)
>>   at
>>
>> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
>>   at
>>
>> org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:809)
>>   at
>>
>> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
>>   at
>>
>> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:455)
>>   at
>>
>> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1$$anonfun$10.apply(ScalaReflection.scala:626)
>>   at
>>
>> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1$$anonfun$10.apply(ScalaReflection.scala:614)
>>   at
>>
>> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>>
>> To the best of my knowledge implicit support for Set has been added in
>> Spark
>> 2.2. Am I missing something?
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org