RE: COMMERCIAL:Re: COMMERCIAL:Re: Hive - regexp_replace function for multiplestrings

Matthew Dixon Tue, 17 Feb 2015 05:17:44 -0800

I suspect that your ultimate goal with this is likely solved by extracting just 
the bits of the original string that you want, ie the underlined section below, 
and then doing some regex over the result:




                I like listening to rock music



So:



SET PATTERNS=(rock music)|(hip hop)|(dub step);



from (

from (

  from dual

  select 'I like listening to rock music' as str

) a

select

  a.str

  ,regexp_replace(regexp_extract(str,'${hiveconf:PATTERNS}'),' ','') as genre

  ,regexp_replace(str,'${hiveconf:PATTERNS}','#genre') as str_minus_genre

) b

select

 b.str as original

,b.str_minus_genre

,b.genre as extracted_genre

,regexp_replace(b.str_minus_genre,'#genre',b.genre) as cleaned_str;











-----Original Message-----
From: Viral Parikh [mailto:[email protected]]
Sent: 12 February 2015 09:32
To: [email protected]
Cc: Viral Parikh
Subject: COMMERCIAL:Re: COMMERCIAL:Re: Hive - regexp_replace function for 
multiplestrings



Thank you Matthew!



In Solution 2, is the highlighted portion your typo - so 'hip music' would 
remain 'hip music'. Did you hipmusic (as unigram?)



However your second comment is correct - "however it presumes that for each 
ngram you don't want to keep any spaces", but I don't want that to happen.



So for example if I have something like -



I like listening to rock music.



I don't want this to be - Ilikelisteningtorockmusic since with the IN condition 
it selected this statement and its replacing all spaces with no space as per 
the regexp_replace function.



Correct me if I am understanding your solution 2 wrong?





Thanks,

Viral

From: Matthew Dixon 
<[email protected]<mailto:[email protected]<mailto:[email protected]%3cmailto:[email protected]>>>

Reply-To: 
"[email protected]<mailto:[email protected]><mailto:[email protected]%3cmailto:[email protected]%3e>"
 
<[email protected]<mailto:[email protected]<mailto:[email protected]%3cmailto:[email protected]>>>

Date: Friday, February 6, 2015 at 3:18 AM

To: 
"[email protected]<mailto:[email protected]><mailto:[email protected]%3cmailto:[email protected]%3e>"
 
<[email protected]<mailto:[email protected]<mailto:[email protected]%3cmailto:[email protected]>>>

Cc: Viral Parikh 
<[email protected]<mailto:[email protected]<mailto:[email protected]%3cmailto:[email protected]>>>

Subject: RE: COMMERCIAL:Re: Hive - regexp_replace function for multiple strings



Below 2 solutions.



Solution1 uses lookahead and lookbehind but works with bi-grams only.  It also 
doesn't enforce the pairs you're asking for, so for instance hip music would 
become hipmusic.



Solution2 uses simple IN syntax with if(), works with n-grams beyond bi-grams 
and enforces the actual patterns you want to change (so 'hip music' would 
remain 'hip music', however it presumes that for each ngram you don't want to 
keep any spaces.



from (

from dual

select explode(array('hip hop','rock music')) as txt

) sub

select

txt

,regexp_replace(txt,'(?<=(hip)|(rock)) (?=(hop)|(music))','') as solution1 
,if(txt in ('hip hop','rock music'),regexp_replace(txt,' ',''),txt) as solution2





Using a text editor like sublime or n++ you could construct the monster regex 
(for solution1) or the IN set (for solution2) to capture all the patterns you 
need.





From: Pradeep Gollakota [mailto:[email protected]]

Sent: 03 February 2015 21:56

To: 
[email protected]<mailto:[email protected]<mailto:[email protected]%3cmailto:[email protected]>>

Cc: Viral Parikh

Subject: COMMERCIAL:Re: Hive - regexp_replace function for multiple strings



I don't think this is doable using the out of the box regexp_replace() UDF. 
That way I would do it, is using a file to create a mapping between a regexp 
and it's replacement and write a custom UDF that loads this file and applies 
all regular expressions on the input.



Hope this helps.



On Tue, Feb 3, 2015 at 10:46 AM, Viral Parikh 
<[email protected]<mailto:[email protected]<mailto:[email protected]%3cmailto:[email protected]>>>
 wrote:



Hi Everyone,



I am using hive 0.13! I want to find multiple tokens like "hip hop" and "rock 
music" in my data and replace them with "hiphop" and "rockmusic" - basically 
replace them without white space. I have used the regexp_replace function in 
hive. Below is my query and it works great for above 2 examples.



drop table vp_hiphop;



create table vp_hiphop as



select userid, ntext,



       regexp_replace(regexp_replace(ntext, 'hip hop', 'hiphop'), 'rock music', 
'rockmusic') as ntext1



from  vp_nlp_protext_males



;



But I have 100 such bigrams/ngrams and want to be able to do replace 
efficiently where I just remove the whitespace. I can pattern match the phrase 
- hip hop and rock music but in the replace I want to simply trim the white 
spaces. Below is what I tried. I also tried using trim with regexp_replace but 
it wants the third argument in the regexp_replace function.



drop table vp_hiphop;



create table vp_hiphop as



select  userid, ntext,



        regexp_replace(ntext, '(hip hop)|(rock music)') as ntext1



from  vp_nlp_protext_males



;









--



IMPORTANT NOTICE



The sender does not guarantee that this message, including any attachment, is 
secure or virus free. Also, it is confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, do 
not disclose or copy it or its contents. Please telephone or email the sender 
and delete the message entirely from your system. No binding obligations or 
payment commitments are to be derived from the contents of this email unless 
and until a clear written agreement containing all the necessary terms and 
conditions is properly executed.



Jagex Limited is a company registered in England & Wales with company number 
03982706 and a registered office at St John's Innovation Centre, Cowley Road, 
Cambridge, CB4 0WS, UK.






-- 

IMPORTANT NOTICE

The sender does not guarantee that this message, including any attachment, is 
secure or virus free. Also, it is confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, do 
not disclose or copy it or its contents. Please telephone or email the sender 
and delete the message entirely from your system. No binding obligations or 
payment commitments are to be derived from the contents of this email unless 
and until a clear written agreement containing all the necessary terms and 
conditions is properly executed.

Jagex Limited is a company registered in England & Wales with company number 
03982706 and a registered office at St John's Innovation Centre, Cowley Road, 
Cambridge, CB4 0WS, UK.

RE: COMMERCIAL:Re: COMMERCIAL:Re: Hive - regexp_replace function for multiplestrings

Reply via email to