hi Viral, Unless you are strictly bound to change the text to achieve your objectives... you may actually wish to explore ngrams and context_ngram combinations to uniquely identify the patterns you want to explore and move them to a new table for further processinng
Better do it at file level on Unix for faster and cleaner results , if it has to be done the replacing way only. regards Devopam On Wed, Feb 4, 2015 at 3:25 AM, Pradeep Gollakota <[email protected]> wrote: > I don't think this is doable using the out of the box regexp_replace() > UDF. That way I would do it, is using a file to create a mapping between a > regexp and it's replacement and write a custom UDF that loads this file and > applies all regular expressions on the input. > > Hope this helps. > > On Tue, Feb 3, 2015 at 10:46 AM, Viral Parikh <[email protected]> > wrote: > >> Hi Everyone, >> >> I am using hive 0.13! I want to find multiple tokens like "hip hop" and >> "rock music" in my data and replace them with "hiphop" and "rockmusic" - >> basically replace them without white space. I have used the regexp_replace >> function in hive. Below is my query and it works great for above 2 examples. >> >> drop table vp_hiphop; >> create table vp_hiphop asselect userid, ntext, >> regexp_replace(regexp_replace(ntext, 'hip hop', 'hiphop'), 'rock >> music', 'rockmusic') as ntext1from vp_nlp_protext_males; >> >> But I have 100 such bigrams/ngrams and want to be able to do replace >> efficiently where I just remove the whitespace. I can pattern match the >> phrase - hip hop and rock music but in the replace I want to simply trim >> the white spaces. Below is what I tried. I also tried using trim with >> regexp_replace but it wants the third argument in the regexp_replace >> function. >> >> drop table vp_hiphop; >> create table vp_hiphop asselect userid, ntext, >> regexp_replace(ntext, '(hip hop)|(rock music)') as ntext1from >> vp_nlp_protext_males; >> >> > -- Devopam Mittra Life and Relations are not binary
