[ https://issues.apache.org/jira/browse/SPARK-6921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zhangxiongfei updated SPARK-6921: --------------------------------- Comment: was deleted (was: I think the root cause may be the following: 1)When the "SQLContext.parquetFile()" is invoked,an instance of case class "ParquetRelation2" is created: def parquetFile(paths: String*): DataFrame = { baseRelationToDataFrame(parquet.ParquetRelation2(paths, Map.empty)(this)) } From now on,the field "val sqlContext: SQLContext" of case class "ParquetRelation2" is not the same instance any more as the original one which is instantiated in Spark Shell. 2)Try to set hadoopconfiguration sc.hadoopConfiguration.setLong("fs.local.block.size",268435456) This code does NOT change the field ""val sqlContext: SQLContext"" of "ParquetRelation2" instance.It only change the original sqlContext. 3)Save the current DataFrame as Parquet files. "saveAsParquetFile()" will use the cloned "sqlContext" which is just a field of "ParquetRelation2" to write the DataFrame, so the configuration of "fs.local.block.size" is still the default 32M. ) > Spark SQL API "saveAsParquetFile" will output tachyon file with different > block size > ------------------------------------------------------------------------------------ > > Key: SPARK-6921 > URL: https://issues.apache.org/jira/browse/SPARK-6921 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.3.0 > Reporter: zhangxiongfei > Priority: Blocker > > I run below code in Spark Shell to access parquet files in Tachyon. > 1.First,created a DataFrame by loading a bunch of Parquet Files in Tachyon > val ta3 > =sqlContext.parquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m"); > 2.Second, set the "fs.local.block.size" to 256M to make sure that block > size of output files in Tachyon is 256M. > sc.hadoopConfiguration.setLong("fs.local.block.size",268435456) > 3.Third,saved above DataFrame into Parquet files that is stored in Tachyon > > ta3.saveAsParquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test"); > After above code run successfully, the output parquet files were stored in > Tachyon,but these files have different block size,below is the information of > those files in the path > "tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test": > File Name Size Block Size > In-Memory Pin Creation Time > _SUCCESS 0.00 B 256.00 MB 100% > NO 04-13-2015 17:48:23:519 > _common_metadata 1088.00 B 256.00 MB 100% NO > 04-13-2015 17:48:23:741 > _metadata 22.71 KB 256.00 MB 100% NO > 04-13-2015 17:48:23:646 > part-r-00001.parquet 177.19 MB 32.00 MB 100% NO > 04-13-2015 17:46:44:626 > part-r-00002.parquet 177.21 MB 32.00 MB 100% NO > 04-13-2015 17:46:44:636 > part-r-00003.parquet 177.02 MB 32.00 MB 100% NO > 04-13-2015 17:46:45:439 > part-r-00004.parquet 177.21 MB 32.00 MB 100% NO > 04-13-2015 17:46:44:845 > part-r-00005.parquet 177.40 MB 32.00 MB 100% NO > 04-13-2015 17:46:44:638 > part-r-00006.parquet 177.33 MB 32.00 MB 100% NO > 04-13-2015 17:46:44:648 > It seems that the API saveAsParquetFile does not distribute/broadcast the > hadoopconfiguration to executors like the other API such as > saveAsTextFile.The configutation "fs.local.block.size" only take effects on > Driver. > If I set that configuration before loading parquet files,the problem is gone. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org