It works fine, thanks for the help Michael. Liancheng also told me a trick, using a subquery with LIMIT n. It works in latest 1.2.0
BTW, looks like the broadcast optimization won't be recognized if I do a left join instead of a inner join. Is that true? How can I make it work for left joins? Cheers, Jianshi On Thu, Oct 9, 2014 at 3:10 AM, Michael Armbrust <mich...@databricks.com> wrote: > Thanks for the input. We purposefully made sure that the config option > did not make it into a release as it is not something that we are willing > to support long term. That said we'll try and make this easier in the > future either through hints or better support for statistics. > > In this particular case you can get what you want by registering the > tables as external tables and setting an flag. Here's a helper function to > do what you need. > > /** > * Sugar for creating a Hive external table from a parquet path. > */ > def createParquetTable(name: String, file: String): Unit = { > import org.apache.spark.sql.hive.HiveMetastoreTypes > > val rdd = parquetFile(file) > val schema = rdd.schema.fields.map(f => s"${f.name} > ${HiveMetastoreTypes.toMetastoreType(f.dataType)}").mkString(",\n") > val ddl = s""" > |CREATE EXTERNAL TABLE $name ( > | $schema > |) > |ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' > |STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' > |OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat' > |LOCATION '$file'""".stripMargin > sql(ddl) > setConf("spark.sql.hive.convertMetastoreParquet", "true") > } > > You'll also need to run this to populate the statistics: > > ANALYZE TABLE tableName COMPUTE STATISTICS noscan; > > > On Wed, Oct 8, 2014 at 1:44 AM, Jianshi Huang <jianshi.hu...@gmail.com> > wrote: > >> Ok, currently there's cost-based optimization however Parquet statistics >> is not implemented... >> >> What's the good way if I want to join a big fact table with several tiny >> dimension tables in Spark SQL (1.1)? >> >> I wish we can allow user hint for the join. >> >> Jianshi >> >> On Wed, Oct 8, 2014 at 2:18 PM, Jianshi Huang <jianshi.hu...@gmail.com> >> wrote: >> >>> Looks like https://issues.apache.org/jira/browse/SPARK-1800 is not >>> merged into master? >>> >>> I cannot find spark.sql.hints.broadcastTables in latest master, but >>> it's in the following patch. >>> >>> >>> https://github.com/apache/spark/commit/76ca4341036b95f71763f631049fdae033990ab5 >>> >>> >>> Jianshi >>> >>> >>> On Mon, Sep 29, 2014 at 1:24 AM, Jianshi Huang <jianshi.hu...@gmail.com> >>> wrote: >>> >>>> Yes, looks like it can only be controlled by the >>>> parameter spark.sql.autoBroadcastJoinThreshold, which is a little bit weird >>>> to me. >>>> >>>> How am I suppose to know the exact bytes of a table? Let me specify the >>>> join algorithm is preferred I think. >>>> >>>> Jianshi >>>> >>>> On Sun, Sep 28, 2014 at 11:57 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>>> >>>>> Have you looked at SPARK-1800 ? >>>>> >>>>> e.g. see sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala >>>>> Cheers >>>>> >>>>> On Sun, Sep 28, 2014 at 1:55 AM, Jianshi Huang < >>>>> jianshi.hu...@gmail.com> wrote: >>>>> >>>>>> I cannot find it in the documentation. And I have a dozen dimension >>>>>> tables to (left) join... >>>>>> >>>>>> >>>>>> Cheers, >>>>>> -- >>>>>> Jianshi Huang >>>>>> >>>>>> LinkedIn: jianshi >>>>>> Twitter: @jshuang >>>>>> Github & Blog: http://huangjs.github.com/ >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Jianshi Huang >>>> >>>> LinkedIn: jianshi >>>> Twitter: @jshuang >>>> Github & Blog: http://huangjs.github.com/ >>>> >>> >>> >>> >>> -- >>> Jianshi Huang >>> >>> LinkedIn: jianshi >>> Twitter: @jshuang >>> Github & Blog: http://huangjs.github.com/ >>> >> >> >> >> -- >> Jianshi Huang >> >> LinkedIn: jianshi >> Twitter: @jshuang >> Github & Blog: http://huangjs.github.com/ >> > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/