Re: Spark Thrift Server (Spark 2.0) show table has value with NULL in all fields

Mich Talebzadeh Sat, 30 Jul 2016 04:08:46 -0700

Actually Hive SQL is a superset of Spark SQL. Data type may not be an issue.


If I create the table after DataFrame creation as explicitly a Hive parquet
table through Spark, Hive sees it and you can see it in Spark thrift server
with data in it (basically you are using Hive Thrift server under the
bonnet).

If I let Spark create table with df.write.mode("overwrite").
parquet("/user/hduser/ll_18740868.parquet")

Then Hive does not seem to see the data when an external Hive table is
created on it!

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 30 July 2016 at 11:52, Chanh Le <giaosu...@gmail.com> wrote:

> I agree with you. Maybe some change on data type in Spark that Hive still
> not support or not competitive so that why It shows NULL.
>
>
> On Jul 30, 2016, at 5:47 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> I think it is still a Hive problem because Spark thrift server is
> basically a Hive thrift server.
>
> An ACID test would be to log in to Hive CLI or Hive thrift server (you are
> actually using Hive thrift server on port 10000 when using Spark thrift
> server) and see whether you see data
>
> When you use Spark it should work.
>
> I still believe it is a bug in Hive
>
> HTH
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 30 July 2016 at 11:43, Chanh Le <giaosu...@gmail.com> wrote:
>
>> Hi Mich,
>> Thanks for supporting. Here some of my thoughts.
>>
>> BTW can you log in to thrift server and do select * from <TABLE> limit 10
>>
>> Do you see the rows?
>>
>>
>> Yes I can see the row but all the fields value NULL.
>>
>> Works OK for me
>>
>>
>> You just test the number of row. In my case I check and it shows 117 rows
>> but the problem is about the data is NULL in all fields.
>>
>>
>> AS I see it the issue is that Hive table created as external on Parquet
>> table somehow does not see data. Rows are all nulls.
>>
>> I don't think this is specific to thrift server. Just log in to Hive and
>> see you can read the data from your table topic created as external.
>>
>> I noticed the same issue
>>
>>
>> I don’t think it’s a Hive issue. Right now I am using Spark and Zeppelin.
>>
>>
>> And the point is why with the same parquet file ( I convert from CSV to
>> parquet)* it can be read in Spark but not in STS*.
>>
>> One more thing is with the same file and method to create table in STS in 
>> *Spark
>> 1.6.1 it works fine.*
>>
>>
>> Regards,
>> Chanh
>>
>>
>>
>> On Jul 30, 2016, at 2:10 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> BTW can you log in to thrift server and do select * from <TABLE> limit 10
>>
>> Do you see the rows?
>>
>> Dr Mich Talebzadeh
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>> http://talebzadehmich.wordpress.com
>>
>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>> loss, damage or destruction of data or any other property which may arise
>> from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>> On 30 July 2016 at 07:20, Mich Talebzadeh <mich.talebza...@gmail.com
>> > wrote:
>> Works OK for me
>>
>> scala> val df =
>> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
>> "true").option("header", "false").load("
>> hdfs://rhes564:9000/data/stg/accounts/ll/18740868")
>> df: org.apache.spark.sql.DataFrame = [C0: string, C1: string, C2: string,
>> C3: string, C4: string, C5: string, C6: string, C7: string, C8: string]
>> scala>
>> df.write.mode("overwrite").parquet("/user/hduser/ll_18740868.parquet")
>> scala> sqlContext.read.parquet("/user/hduser/ll_18740868.parquet")count
>> res2: Long = 3651
>> scala> val ff =
>> sqlContext.read.parquet("/user/hduser/ll_18740868.parquet")
>> ff: org.apache.spark.sql.DataFrame = [C0: string, C1: string, C2: string,
>> C3: string, C4: string, C5: string, C6: string, C7: string, C8: string]
>> scala> ff.take(5)
>> res3: Array[org.apache.spark.sql.Row] = Array([Transaction
>> Date,Transaction Type,Sort Code,Account
>> Number,Transaction Description,Debit Amount,Credit Amount,Balance,],
>> [31/12/2009,CPT,'30-64-72,18740868,LTSB STH KENSINGTO CD 5710
>> 31DEC09 ,90.00,,400.00,null], [31/12/2009,CPT,'30-64-72,18740868,LTSB
>> CHELSEA (3091 CD 5710 31DEC09
>> ,10.00,,490.00,null], [31/12/2009,DEP,'30-64-72,18740868,CHELSEA
>> ,,500.00,500.00,null], [Transaction Date,Transaction Type,Sort
>> Code,Account Number,Transaction Description,Debit Amount,Credit
>> Amount,Balance,])
>>
>> Now in Zeppelin create an external table and read it
>>
>> <image.png>
>>
>>
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>> http://talebzadehmich.wordpress.com
>>
>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>> loss, damage or destruction of data or any other property which may arise
>> from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>> On 29 July 2016 at 09:04, Chanh Le <giaosu...@gmail.com> wrote:
>> I continue to debug
>>
>> 16/07/29 13:57:35 INFO FileScanRDD: Reading File path:
>> file:///Users/giaosudau/Documents/Topics.parquet/part-r-00000-8997050f-e063-427e-b53c-f0a61739706f.gz.parquet,
>>  range:
>> 0-3118, partition values: [empty row]
>> vs OK one
>> 16/07/29 15:02:47 INFO FileScanRDD: Reading File path:
>> file:///Users/giaosudau/data_example/FACT_ADMIN_HOURLY/time=2016-07-24-18/network_id=30206/part-r-00000-c5f5e18d-c8a1-4831-8903-3c60b02bdfe8.snappy.parquet,
>> range: 0-6050, partition values: [2016-07-24-18,30206]
>>
>> I attached 2 files.
>>
>>
>>
>>
>>
>>
>> On Jul 29, 2016, at 9:44 AM, Chanh Le <giaosu...@gmail.com> wrote:
>>
>> Hi everyone,
>>
>> For more investigation I attached the file that I convert CSV to parquet.
>>
>> Spark Code
>>
>> I loaded from CSV file
>> val df = spark.sqlContext.read
>> .format("com.databricks.spark.csv").option("delimiter",
>> ",").option("header",
>> "true").option("inferSchema", 
>> "true").load("/Users/giaosudau/Downloads/Topics.xls
>> - Sheet 1.csv")
>> I create a Parquet
>>
>> df.write.mode("overwrite").parquet("/Users/giaosudau/Documents/Topics.parquet”)
>>
>> It’s OK in Spark-Shell
>>
>> scala> df.take(5)
>> res22: Array[org.apache.spark.sql.Row] = Array([124,Nghệ thuật & Giải
>> trí,Arts & Entertainment,0,124,1], [53,Scandal,Scandal,124,124,53,2],
>> [54,Showbiz - World,Showbiz-World,124,124,54,2], [52,Âm
>> nhạc,Entertainment-Music,124,124,52,2], [47,Bar - Karaoke -
>> Massage,Bar-Karaoke-Massage-Prostitution,124,124,47,2])
>>
>> When Create a table in STS
>>
>> 0: jdbc:hive2://localhost:10000> CREATE EXTERNAL TABLE topic (TOPIC_ID
>> int, TOPIC_NAME_VN String, TOPIC_NAME_EN String, PARENT_ID int,
>> FULL_PARENT String, LEVEL_ID int) STORED AS PARQUET LOCATION
>> '/Users/giaosudau/Documents/Topics.parquet’;
>>
>> But I get all result NULL
>>
>> <Screen Shot 2016-07-29 at 9.42.26 AM.png>
>>
>>
>>
>> I think it’s really a BUG right?
>>
>> Regards,
>> Chanh
>>
>>
>> <Topics.parquet>
>>
>>
>> <Topics.xls - Sheet 1.csv>
>>
>>
>>
>>
>>
>> On Jul 28, 2016, at 4:25 PM, Chanh Le <giaosu...@gmail.com> wrote:
>>
>> Hi everyone,
>>
>> I have problem when I create a external table in Spark Thrift Server
>> (STS) and query the data.
>>
>> Scenario:
>> Spark 2.0
>> Alluxio 1.2.0
>> Zeppelin 0.7.0
>> STS start script
>> /home/spark/spark-2.0.0-bin-hadoop2.6/sbin/start-thriftserver.sh --master
>> mesos://zk://master1:2181,master2:2181,master3:2181/mesos --conf
>> spark.driver.memory=5G --conf spark.scheduler.mode=FAIR --class
>> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --jars
>> /home/spark/spark-2.0.0-bin-hadoop2.6/jars/alluxio-core-client-spark-1.2.0-jar-with-dependencies.jar
>> --total-executor-cores 35 spark-internal --hiveconf
>> hive.server2.thrift.port=10000
>> --hiveconf hive.metastore.warehouse.dir=/user/hive/warehouse --hiveconf
>> hive.metastore.metadb.dir=/user/hive/metadb --conf
>> spark.sql.shuffle.partitions=20
>>
>> I have a file store in Alluxio alluxio://master2:19998/etl_info/TOPIC
>>
>> then I create a table in STS by
>> CREATE EXTERNAL TABLE topic (topic_id int, topic_name_vn String,
>> topic_name_en String, parent_id int, full_parent String, level_id int)
>> STORED AS PARQUET LOCATION 'alluxio://master2:19998/etl_info/TOPIC';
>>
>> to compare STS with Spark I create a temp table with name topics
>> spark.sqlContext.read.parquet("alluxio://master2:19998/etl_info/TOPIC
>> ").registerTempTable("topics")
>>
>> Then I do query and compare.
>> <Screen Shot 2016-07-28 at 4.18.59 PM.png>
>>
>>
>> As you can see the result is different.
>> Is that a bug? Or I did something wrong
>>
>> Regards,
>> Chanh
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: Spark Thrift Server (Spark 2.0) show table has value with NULL in all fields

Reply via email to