Hello,

Here is my problem:


A) I have data in csv with some boolean columns;

unfortunately, the values in these columns are 't' or 'f' (single letter);

this is an artifact (from Redshift) that I cannot control.


B) I need to create a spark dataframe from this data;

for that, I create a Hive DB and a temp Hive table

and then SELECT * from it, like this:


    sql_str = """SELECT * FROM {db}.{s}_{t} """.format(

                 db=hive_db_name, s=schema, t=table)

    df = sql_cxt.sql(sql_str)


This works, I can print df and it gives me all my columns with correct data
types.

But:


C) If I create the table like this:


    CREATE EXTERNAL TABLE IF NOT EXISTS {db}.{schema}_{table}({cols})


    ROW FORMAT DELIMITED


    FIELDS TERMINATED BY '|t'


    STORED AS TEXTFILE

    LOCATION ...


, this converts all my 't' and 'f' to Nulls.



So:


D) I found out about LazySimpleSerDe that presumably must do what I mean
(convert t and f to true and false on the fly). From ```
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties```
(quote):


    """

    hive.lazysimple.extended_boolean_literal

    Default Value: false

    Added in: Hive 0.14 with HIVE-3635

    LazySimpleSerDe uses this property to determine

    if it treats 'T', 't', 'F', 'f', '1', and '0' as extended,

    legal boolean literals, in addition to 'TRUE' and 'FALSE'.

    The default is false, which means only 'TRUE' and 'FALSE'

    are treated as legal boolean literals.

    """


According to this (or at least so I think), I now create a table in Hive DB
like this:





    create_table_sql = """

        CREATE EXTERNAL TABLE IF NOT EXISTS
{db_name}.{schema}_{table}({cols})

        ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

        WITH SERDEPROPERTIES ("separatorChar" = "\|")

        STORED AS TEXTFILE

        LOCATION '{loc}'

        TBLPROPERTIES ('hive.lazysimple.extended_boolean_literal'='true')

        """.format(db_name=hive_db_name,

                   schema=schema,

                   table=table,

                   cols=",\n".join(cols),

                   loc=location)


    return sql_cxt.sql(create_table_sql)





This does create a table,

I can again see all the columns with proper data types,

the df.count() is correct, but df.head(3)

gives me all values for my boolean columns == Null.


(:___


____________________________________________

I tried for hours different variants for my CREATE TABLE...


* with or without SERDEPROPERTIES,

* with or without TBLPROPERTIES,

* with "FIELDS TERMINATED BY..." or without,


etc.


All give me either


- Null in place of 't' and 'f', or

- an empty df (nothing from df.head(5)), or

- a syntax error, or

- some 100 pages of Java exceptions.


The real problem is, I would say, that there is no single example of CREATE
TABLE with LazySimpleSerDe

that does the job that is described in the docs.


I would really, really appreciate your help or any ideas. I pulled out
almost all my hair.


Thank you in advance!

Reply via email to