Kengo Seki created AIRFLOW-2441: ----------------------------------- Summary: Fix bugs in HiveCliHook.load_df Key: AIRFLOW-2441 URL: https://issues.apache.org/jira/browse/AIRFLOW-2441 Project: Apache Airflow Issue Type: Bug Components: hive_hooks, hooks Reporter: Kengo Seki Assignee: Kengo Seki
{{HiveCliHook.load_df}} has some bugs and doesn't work for now. 1. Executing it fails as follows: {code} In [1]: import pandas as pd In [2]: df = pd.DataFrame({"c": ["foo", "bar", "baz"]}) In [3]: from airflow.hooks.hive_hooks import HiveCliHook In [4]: hook = HiveCliHook() [2018-05-08 06:38:19,211] {base_hook.py:85} INFO - Using connection to: localhost In [5]: hook.load_df(df, "t") (snip) TypeError: "delimiter" must be string, not unicode {code} To solve this, "delimiter" parameter should be encoded by "encoding" parameter. The latter is declared but unused for now. 2. For small dataset, it loads an empty file into Hive: {code} In [1]: import pandas as pd ...: df = pd.DataFrame({"c": ["foo", "bar", "baz"]}) ...: from airflow.hooks.hive_hooks import HiveCliHook ...: hook = HiveCliHook() ...: hook.load_df(df, "t") ...: (snip) [2018-05-08 20:46:48,883] {hive_hooks.py:231} INFO - Loading data to table default.t [2018-05-08 20:46:49,448] {hive_hooks.py:231} INFO - Table default.t stats: [numFiles=1, numRows=0, totalSize=0, rawDataSize=0] {code} {code} hive> SELECT count(*) FROM t; (snip) OK 0 Time taken: 4.962 seconds, Fetched: 1 row(s) {code} This is because the file contents is still in buffer when LOAD DATA statement is executed. That should be flushed just like {{HiveCliHook.run_cli}} does. 3. Even with fixes for #1 and #2, unexpected data is loaded into Hive: {code} In [1]: import pandas as pd ...: df = pd.DataFrame({"c": ["foo", "bar", "baz"]}) ...: from airflow.hooks.hive_hooks import HiveCliHook ...: hook = HiveCliHook() ...: hook.load_df(df, "t") ...: (snip) [2018-05-08 20:57:17,467] {hive_hooks.py:231} INFO - Loading data to table default.t [2018-05-08 20:57:18,163] {hive_hooks.py:231} INFO - Table default.t stats: [numFiles=1, numRows=0, totalSize=21, rawDataSize=0] {code} {code} hive> SELECT * FROM t; OK 0 1 2 Time taken: 2.317 seconds, Fetched: 4 row(s) {code} This is because {{pandas.DataFrame.to_csv}} outputs data into file with row index by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005)