I opened a JIRA on this: https://issues.apache.org/jira/browse/HIVE-5506
On Wed, Oct 9, 2013 at 9:44 AM, John Omernik <j...@omernik.com> wrote: > Hello all, I think I have outlined a bug in the hive split function: > > Summary: When calling split on a string of data, it will only return all > array items if the the last array item has a value. For example, if I have > a string of text delimited by tab with 7 columns, and the first four are > filled, but the last three are blank, split will only return a 4 position > array. If any number of "middle" columns are empty, but the last item > still has a value, then it will return the proper number of columns. This > was tested in Hive 0.9 and hive 0.11. > > Data: > (Note \t represents a tab char, \x09 the line endings should be \n (UNIX > style) not sure what email will do to them). Basically my data is 7 lines > of data with the first 7 letters separated by tab. On some lines I've left > out certain letters, but kept the number of tabs exactly the same. > > input.txt > a\tb\tc\td\te\tf\tg > a\tb\tc\td\te\t\tg > a\tb\t\td\t\tf\tg > \t\t\td\te\tf\tg > a\tb\tc\td\t\t\t > a\t\t\t\te\tf\tg > a\t\t\td\t\t\tg > > I then created a table with one column from that data: > > > DROP TABLE tmp_jo_tab_test;**** > > CREATE table tmp_jo_tab_test (message_line STRING)**** > > STORED AS TEXTFILE;**** > > ** ** > > LOAD DATA LOCAL INPATH '/tmp/input.txt'**** > > OVERWRITE INTO TABLE tmp_jo_tab_test; > > > Ok just to validate I created a python counting script: > > > #!/usr/bin/python**** > > ** ** > > import sys**** > > ** ** > > ** ** > > for line in sys.stdin:**** > > line = line[0:-1]**** > > out = line.split("\t")**** > > print len(out) > > > The output there is : > > $ cat input.txt |./cnt_tabs.py**** > > 7**** > > 7**** > > 7**** > > 7**** > > 7**** > > 7**** > > 7 > > > Based on that information, split on tab should return me 7 for each line > as well: > > > hive -e "select size(split(message_line, '\\t')) from tmp_jo_tab_test;"*** > * > > ** ** > > 7**** > > 7**** > > 7**** > > 7**** > > 4**** > > 7**** > > 7 > > > However it does not. It would appear that the line where only the first > four letters are filled in(and blank is passed in on the last three) only > returns 4 splits, where there should technically be 7, 4 for letters > included, and three blanks. > > > a\tb\tc\td\t\t\t > > > > > > >