Thanks Michael. My bad regarding hive table primary keys.
I have one big 140GB hdfs file and external hive table defined on it. Table is
not partitioned. When I read external hive table using sqlContext.sql, how does
spark decides number of partitions which should be created for that data frame?
There is no such thing as primary keys in the Hive metastore, but Spark SQL
does support partitioned hive tables:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables
DataFrameWriter also has a partitionBy method.
On Thu, Aug 20, 2015 at 7:29 AM,
Hi
I have a question regarding data frame partition. I read a hive table from
spark and following spark api converts it as DF.
test_df = sqlContext.sql(“select * from hivetable1”)
How does spark decide partition of test_df? Is there a way to partition test_df
based on some column while readin