Works OK for me

scala> val df =
sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
"true").option("header",
"false").load("hdfs://rhes564:9000/data/stg/accounts/ll/18740868")
df: org.apache.spark.sql.DataFrame = [C0: string, C1: string, C2: string,
C3: string, C4: string, C5: string, C6: string, C7: string, C8: string]
scala>
df.write.mode("overwrite").parquet("/user/hduser/ll_18740868.parquet")
scala> sqlContext.read.parquet("/user/hduser/ll_18740868.parquet")count
res2: Long = 3651
scala> val ff = sqlContext.read.parquet("/user/hduser/ll_18740868.parquet")
ff: org.apache.spark.sql.DataFrame = [C0: string, C1: string, C2: string,
C3: string, C4: string, C5: string, C6: string, C7: string, C8: string]
scala> ff.take(5)
res3: Array[org.apache.spark.sql.Row] = Array([Transaction Date,Transaction
Type,Sort Code,Account Number,Transaction Description,Debit Amount,Credit
Amount,Balance,], [31/12/2009,CPT,'30-64-72,18740868,LTSB STH KENSINGTO CD
5710 31DEC09 ,90.00,,400.00,null], [31/12/2009,CPT,'30-64-72,18740868,LTSB
CHELSEA (3091 CD 5710 31DEC09 ,10.00,,490.00,null],
[31/12/2009,DEP,'30-64-72,18740868,CHELSEA ,,500.00,500.00,null],
[Transaction Date,Transaction Type,Sort Code,Account Number,Transaction
Description,Debit Amount,Credit Amount,Balance,])

Now in Zeppelin create an external table and read it

[image: Inline images 2]


HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 July 2016 at 09:04, Chanh Le <giaosu...@gmail.com> wrote:

> I continue to debug
>
> *16/07/29 13:57:35 INFO FileScanRDD: Reading File path:
> file:///Users/giaosudau/Documents/Topics.parquet/part-r-00000-8997050f-e063-427e-b53c-f0a61739706f.gz.parquet,
> range: 0-3118, partition values: [empty row]*
> vs OK one
> *16/07/29 15:02:47 INFO FileScanRDD: Reading File path:
> file:///Users/giaosudau/data_example/FACT_ADMIN_HOURLY/time=2016-07-24-18/network_id=30206/part-r-00000-c5f5e18d-c8a1-4831-8903-3c60b02bdfe8.snappy.parquet,
> range: 0-6050, partition values: [2016-07-24-18,30206]*
>
> I attached 2 files.
>
>
>
>
>
>
> On Jul 29, 2016, at 9:44 AM, Chanh Le <giaosu...@gmail.com> wrote:
>
> Hi everyone,
>
> For more investigation I attached the file that I convert CSV to parquet.
>
> Spark Code
>
> I loaded from CSV file
> *val df = spark.sqlContext.read
> .format("com.databricks.spark.csv").option("delimiter",
> ",").option("header", "true").option("inferSchema",
> "true").load("/Users/giaosudau/Downloads/Topics.xls - Sheet 1.csv")*
> I create a Parquet
>
> *df.write.mode("overwrite").parquet("/Users/giaosudau/Documents/Topics.parquet”)*
>
> It’s OK in Spark-Shell
>
> *scala> df.take(5)*
> *res22: Array[org.apache.spark.sql.Row] = Array([124,Nghệ thuật & Giải
> trí,Arts & Entertainment,0,124,1], [53,Scandal,Scandal,124,124,53,2],
> [54,Showbiz - World,Showbiz-World,124,124,54,2], [52,Âm
> nhạc,Entertainment-Music,124,124,52,2], [47,Bar - Karaoke -
> Massage,Bar-Karaoke-Massage-Prostitution,124,124,47,2])*
>
> When Create a table in STS
>
> *0: jdbc:hive2://localhost:10000> CREATE EXTERNAL TABLE topic (TOPIC_ID
> int, TOPIC_NAME_VN String, TOPIC_NAME_EN String, PARENT_ID int, FULL_PARENT
> String, LEVEL_ID int) STORED AS PARQUET LOCATION
> '/Users/giaosudau/Documents/Topics.parquet’;*
>
> But I get all result NULL
>
> <Screen Shot 2016-07-29 at 9.42.26 AM.png>
>
>
>
> I think it’s really a BUG right?
>
> Regards,
> Chanh
>
>
> <Topics.parquet>
>
>
> <Topics.xls - Sheet 1.csv>
>
>
>
>
>
> On Jul 28, 2016, at 4:25 PM, Chanh Le <giaosu...@gmail.com> wrote:
>
> Hi everyone,
>
> I have problem when I create a external table in Spark Thrift Server (STS)
> and query the data.
>
> Scenario:
> *Spark 2.0*
> *Alluxio 1.2.0 *
> *Zeppelin 0.7.0*
> STS start script
> */home/spark/spark-2.0.0-bin-hadoop2.6/sbin/start-thriftserver.sh --master
> mesos://zk://master1:2181,master2:2181,master3:2181/mesos --conf
> spark.driver.memory=5G --conf spark.scheduler.mode=FAIR --class
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --jars
> /home/spark/spark-2.0.0-bin-hadoop2.6/jars/alluxio-core-client-spark-1.2.0-jar-with-dependencies.jar
> --total-executor-cores 35 spark-internal --hiveconf
> hive.server2.thrift.port=10000 --hiveconf
> hive.metastore.warehouse.dir=/user/hive/warehouse --hiveconf
> hive.metastore.metadb.dir=/user/hive/metadb --conf
> spark.sql.shuffle.partitions=20*
>
> I have a file store in Alluxio *alluxio://master2:19998/etl_info/TOPIC*
>
> then I create a table in STS by
> CREATE EXTERNAL TABLE topic (topic_id int, topic_name_vn String,
> topic_name_en String, parent_id int, full_parent String, level_id int)
> STORED AS PARQUET LOCATION 'alluxio://master2:19998/etl_info/TOPIC';
>
> to compare STS with Spark I create a temp table with name topics
> spark.sqlContext.read.parquet("alluxio://master2:19998/etl_info/TOPIC
> ").registerTempTable("topics")
>
> Then I do query and compare.
> <Screen Shot 2016-07-28 at 4.18.59 PM.png>
>
>
> As you can see the result is different.
> Is that a bug? Or I did something wrong
>
> Regards,
> Chanh
>
>
>
>
>

Reply via email to