[jira] [Created] (AIRFLOW-2441) Fix bugs in HiveCliHook.load_df

Kengo Seki (JIRA) Tue, 08 May 2018 18:08:37 -0700

Kengo Seki created AIRFLOW-2441:
-----------------------------------

             Summary: Fix bugs in HiveCliHook.load_df
                 Key: AIRFLOW-2441
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2441
             Project: Apache Airflow
          Issue Type: Bug
          Components: hive_hooks, hooks
            Reporter: Kengo Seki
            Assignee: Kengo Seki



{{HiveCliHook.load_df}} has some bugs and doesn't work for now.

1. Executing it fails as follows:

{code}
In [1]: import pandas as pd

In [2]: df = pd.DataFrame({"c": ["foo", "bar", "baz"]})

In [3]: from airflow.hooks.hive_hooks import HiveCliHook

In [4]: hook = HiveCliHook()
[2018-05-08 06:38:19,211] {base_hook.py:85} INFO - Using connection to: 
localhost

In [5]: hook.load_df(df, "t")

(snip)

TypeError: "delimiter" must be string, not unicode
{code}

To solve this, "delimiter" parameter should be encoded by "encoding" parameter. 
The latter is declared but unused for now.

2. For small dataset, it loads an empty file into Hive:

{code}
In [1]: import pandas as pd
   ...: df = pd.DataFrame({"c": ["foo", "bar", "baz"]})
   ...: from airflow.hooks.hive_hooks import HiveCliHook
   ...: hook = HiveCliHook()
   ...: hook.load_df(df, "t")
   ...:

(snip)

[2018-05-08 20:46:48,883] {hive_hooks.py:231} INFO - Loading data to table 
default.t
[2018-05-08 20:46:49,448] {hive_hooks.py:231} INFO - Table default.t stats: 
[numFiles=1, numRows=0, totalSize=0, rawDataSize=0]
{code}

{code}
hive> SELECT count(*) FROM t;

(snip)

OK
0
Time taken: 4.962 seconds, Fetched: 1 row(s)
{code}

This is because the file contents is still in buffer when LOAD DATA statement 
is executed. That should be flushed just like {{HiveCliHook.run_cli}} does.

3. Even with fixes for #1 and #2, unexpected data is loaded into Hive:

{code}
In [1]: import pandas as pd
   ...: df = pd.DataFrame({"c": ["foo", "bar", "baz"]})
   ...: from airflow.hooks.hive_hooks import HiveCliHook
   ...: hook = HiveCliHook()
   ...: hook.load_df(df, "t")
   ...:

(snip)

[2018-05-08 20:57:17,467] {hive_hooks.py:231} INFO - Loading data to table 
default.t
[2018-05-08 20:57:18,163] {hive_hooks.py:231} INFO - Table default.t stats: 
[numFiles=1, numRows=0, totalSize=21, rawDataSize=0]
{code}

{code}
hive> SELECT * FROM t;
OK

0
1
2
Time taken: 2.317 seconds, Fetched: 4 row(s)
{code}

This is because {{pandas.DataFrame.to_csv}} outputs data into file with row 
index by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (AIRFLOW-2441) Fix bugs in HiveCliHook.load_df

Reply via email to