Hi,
question about using the R api for spark:we load some files from oracle 
(through jdbc ) and register it in a temporary table in spark.
I see a lot of shuffling, but we have joins between large and small tables. So 
I probably need to broadcast the small tables.
Normally autobroadcasting happens for tables up to 
(spark.sql.autoBroadcastJoinThreshold) 10MB, but spark only knows if the table 
is small enough to broadcast based on the statistics. These are statistics 
known to the hive metastore. So I assume for a temporary table (registered 
based on external files or in this case an oracle table) there will not be any 
statistics.
Is there any way to compute the stats for a temporary table so that spark will 
know whether he needs to autobroadcast?


Thanks!

Reply via email to