Bug in Hive Split function (Tested on Hive 0.9 and 0.11)

John Omernik Wed, 09 Oct 2013 07:46:30 -0700

Hello all, I think I have outlined a bug in the hive split function:

Summary: When calling split on a string of data, it will only return all
array items if the the last array item has a value. For example, if I have
a string of text delimited by tab with 7 columns, and the first four are
filled, but the last three are blank, split will only return a 4 position
array. If  any number of "middle" columns are empty, but the last item
still has a value, then it will return the proper number of columns.  This
was tested in Hive 0.9 and hive 0.11.


Data:
(Note \t represents a tab char, \x09 the line endings should be \n (UNIX
style) not sure what email will do to them).  Basically my data is 7 lines
of data with the first 7 letters separated by tab.  On some lines I've left
out certain letters, but kept the number of tabs exactly the same.

input.txt
a\tb\tc\td\te\tf\tg
a\tb\tc\td\te\t\tg
a\tb\t\td\t\tf\tg
\t\t\td\te\tf\tg
a\tb\tc\td\t\t\t
a\t\t\t\te\tf\tg
a\t\t\td\t\t\tg

I then created a table with one column from that data:


DROP TABLE tmp_jo_tab_test;****

CREATE table tmp_jo_tab_test (message_line STRING)****

STORED AS TEXTFILE;****

** **

LOAD DATA LOCAL INPATH '/tmp/input.txt'****

OVERWRITE INTO TABLE tmp_jo_tab_test;


Ok just to validate I created a python counting script:


#!/usr/bin/python****

** **

import sys****

** **

** **

for line in sys.stdin:****

    line = line[0:-1]****

    out = line.split("\t")****

    print len(out)


The output there is :

$ cat input.txt |./cnt_tabs.py****

7****

7****

7****

7****

7****

7****

7


Based on that information, split on tab should return me 7 for each line as
well:


hive -e "select size(split(message_line, '\\t')) from tmp_jo_tab_test;"****

** **

7****

7****

7****

7****

4****

7****

7


However it does not.  It would appear that the line where only the first
four letters are filled in(and blank is passed in on the last three) only
returns 4 splits, where there should technically be 7, 4 for letters
included, and three blanks.


a\tb\tc\td\t\t\t

Bug in Hive Split function (Tested on Hive 0.9 and 0.11)

Reply via email to