I suspect that your ultimate goal with this is likely solved by extracting just
the bits of the original string that you want, ie the underlined section below,
and then doing some regex over the result:
I like listening to rock music
So:
SET PATTERNS=(rock music)|(hip hop)|(dub step);
from (
from (
from dual
select 'I like listening to rock music' as str
) a
select
a.str
,regexp_replace(regexp_extract(str,'${hiveconf:PATTERNS}'),' ','') as genre
,regexp_replace(str,'${hiveconf:PATTERNS}','#genre') as str_minus_genre
) b
select
b.str as original
,b.str_minus_genre
,b.genre as extracted_genre
,regexp_replace(b.str_minus_genre,'#genre',b.genre) as cleaned_str;
-----Original Message-----
From: Viral Parikh [mailto:[email protected]]
Sent: 12 February 2015 09:32
To: [email protected]
Cc: Viral Parikh
Subject: COMMERCIAL:Re: COMMERCIAL:Re: Hive - regexp_replace function for
multiplestrings
Thank you Matthew!
In Solution 2, is the highlighted portion your typo - so 'hip music' would
remain 'hip music'. Did you hipmusic (as unigram?)
However your second comment is correct - "however it presumes that for each
ngram you don't want to keep any spaces", but I don't want that to happen.
So for example if I have something like -
I like listening to rock music.
I don't want this to be - Ilikelisteningtorockmusic since with the IN condition
it selected this statement and its replacing all spaces with no space as per
the regexp_replace function.
Correct me if I am understanding your solution 2 wrong?
Thanks,
Viral
From: Matthew Dixon
<[email protected]<mailto:[email protected]<mailto:[email protected]%3cmailto:[email protected]>>>
Reply-To:
"[email protected]<mailto:[email protected]><mailto:[email protected]%3cmailto:[email protected]%3e>"
<[email protected]<mailto:[email protected]<mailto:[email protected]%3cmailto:[email protected]>>>
Date: Friday, February 6, 2015 at 3:18 AM
To:
"[email protected]<mailto:[email protected]><mailto:[email protected]%3cmailto:[email protected]%3e>"
<[email protected]<mailto:[email protected]<mailto:[email protected]%3cmailto:[email protected]>>>
Cc: Viral Parikh
<[email protected]<mailto:[email protected]<mailto:[email protected]%3cmailto:[email protected]>>>
Subject: RE: COMMERCIAL:Re: Hive - regexp_replace function for multiple strings
Below 2 solutions.
Solution1 uses lookahead and lookbehind but works with bi-grams only. It also
doesn't enforce the pairs you're asking for, so for instance hip music would
become hipmusic.
Solution2 uses simple IN syntax with if(), works with n-grams beyond bi-grams
and enforces the actual patterns you want to change (so 'hip music' would
remain 'hip music', however it presumes that for each ngram you don't want to
keep any spaces.
from (
from dual
select explode(array('hip hop','rock music')) as txt
) sub
select
txt
,regexp_replace(txt,'(?<=(hip)|(rock)) (?=(hop)|(music))','') as solution1
,if(txt in ('hip hop','rock music'),regexp_replace(txt,' ',''),txt) as solution2
Using a text editor like sublime or n++ you could construct the monster regex
(for solution1) or the IN set (for solution2) to capture all the patterns you
need.
From: Pradeep Gollakota [mailto:[email protected]]
Sent: 03 February 2015 21:56
To:
[email protected]<mailto:[email protected]<mailto:[email protected]%3cmailto:[email protected]>>
Cc: Viral Parikh
Subject: COMMERCIAL:Re: Hive - regexp_replace function for multiple strings
I don't think this is doable using the out of the box regexp_replace() UDF.
That way I would do it, is using a file to create a mapping between a regexp
and it's replacement and write a custom UDF that loads this file and applies
all regular expressions on the input.
Hope this helps.
On Tue, Feb 3, 2015 at 10:46 AM, Viral Parikh
<[email protected]<mailto:[email protected]<mailto:[email protected]%3cmailto:[email protected]>>>
wrote:
Hi Everyone,
I am using hive 0.13! I want to find multiple tokens like "hip hop" and "rock
music" in my data and replace them with "hiphop" and "rockmusic" - basically
replace them without white space. I have used the regexp_replace function in
hive. Below is my query and it works great for above 2 examples.
drop table vp_hiphop;
create table vp_hiphop as
select userid, ntext,
regexp_replace(regexp_replace(ntext, 'hip hop', 'hiphop'), 'rock music',
'rockmusic') as ntext1
from vp_nlp_protext_males
;
But I have 100 such bigrams/ngrams and want to be able to do replace
efficiently where I just remove the whitespace. I can pattern match the phrase
- hip hop and rock music but in the replace I want to simply trim the white
spaces. Below is what I tried. I also tried using trim with regexp_replace but
it wants the third argument in the regexp_replace function.
drop table vp_hiphop;
create table vp_hiphop as
select userid, ntext,
regexp_replace(ntext, '(hip hop)|(rock music)') as ntext1
from vp_nlp_protext_males
;
--
IMPORTANT NOTICE
The sender does not guarantee that this message, including any attachment, is
secure or virus free. Also, it is confidential and may be privileged or
otherwise protected from disclosure. If you are not the intended recipient, do
not disclose or copy it or its contents. Please telephone or email the sender
and delete the message entirely from your system. No binding obligations or
payment commitments are to be derived from the contents of this email unless
and until a clear written agreement containing all the necessary terms and
conditions is properly executed.
Jagex Limited is a company registered in England & Wales with company number
03982706 and a registered office at St John's Innovation Centre, Cowley Road,
Cambridge, CB4 0WS, UK.
--
IMPORTANT NOTICE
The sender does not guarantee that this message, including any attachment, is
secure or virus free. Also, it is confidential and may be privileged or
otherwise protected from disclosure. If you are not the intended recipient, do
not disclose or copy it or its contents. Please telephone or email the sender
and delete the message entirely from your system. No binding obligations or
payment commitments are to be derived from the contents of this email unless
and until a clear written agreement containing all the necessary terms and
conditions is properly executed.
Jagex Limited is a company registered in England & Wales with company number
03982706 and a registered office at St John's Innovation Centre, Cowley Road,
Cambridge, CB4 0WS, UK.