----------------------------------------------------------------------------------------------------
-- bash
----------------------------------------------------------------------------------------------------
mkdir t
cat>t/data.txt
1|44,85
2|56,37,83,68,43
3|33,48,42,18,23,80,31,86,48,42,37,52,99,55,93,1,63,67,32,75,44,57,70,2
4|77,26,95,53,11,99,74,82,7,55,75,6,32,87,75,99,80
5|48,78,39,62,16,44,43,63
6|35,97,99,19,22,50,29,84,82,25,77
7|80,43,82,94,81,58,70,8,70,6,62,100,60,84,55,24,100,75,84,15,53,5,19,45,61,73
8|66,44,66,4,80,72,81,63,51,24,51,77,87,85,10,36,43,2
9|39,64,29,14,9,42,66,56,33
hdfs dfs -put t /tmp
----------------------------------------------------------------------------------------------------
-- hive
----------------------------------------------------------------------------------------------------
create external table t
(
RowID int
,stringColumn string
)
row format delimited
fields terminated by '|'
location '/tmp/t'
;
select RowID,regexp_extract (stringColumn,'(,?[^,]*){0,10}',0) as
string10,length(stringColumn)-length(regexp_replace(stringColumn,',',''))+1 as
count from t;
1 44,85 2
2 56,37,83,68,43 5
3 33,48,42,18,23,80,31,86,48,42 24
4 77,26,95,53,11,99,74,82,7,55 17
5 48,78,39,62,16,44,43,63 8
6 35,97,99,19,22,50,29,84,82,25 11
7 80,43,82,94,81,58,70,8,70,6 26
8 66,44,66,4,80,72,81,63,51,24 18
9 39,64,29,14,9,42,66,56,33 9
Extracting the first 100 (10 in my example) tokens can be done with
regexp_extract or regexp_replace
hive> select regexp_extract
('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','(,?[^,]*){0,10}',0);
1,2,3,4,5,6,7,8,9,10
hive> select regexp_replace
('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','((,?[^,]*){0,10}).*','$1');
1,2,3,4,5,6,7,8,9,10
From: Mahender Sarangam [mailto:[email protected]]
Sent: Thursday, June 09, 2016 7:13 PM
To: [email protected]
Subject: Get 100 items in Comma Separated strings from Hive Column.
Hi,
We have hive table which has a single column with more than 1000 comma
separated string items. Is there a way to retrieve only 100 string items from
that Column. Also we need to capture number of comma separated string items. We
are looking for more of "substring_index" functionality, since we are using
Hive 1.2 version, we couldn't find "substring_index" UDF function, Is there a
way to achieve the same functionality with "regexp_extract" and I also see
there is UDF available not sure whether this helps us achieving same
functionality.
https://github.com/brndnmtthws/facebook-hive-udfs/blob/master/src/main/java/com/facebook/hive/udf/UDFRegexpExtractAll.java
Scenario : Table1 (Source Table)
RowID stringColumn
1 1,2,3,4...10000
2 2,4,5,8,4
3 10,11,98,100
Now i Would like to show table result structure like below
Row ID 100String count
1 1,2,3...100 10000
2 2,4,5,8,4 5