RE: RegexSerDe with Filters

Markovitz, Dudu Sat, 02 Jul 2016 12:57:25 -0700

Hi Venkat

You don’t necessarily need the three views if your goal is to join them.
You can achieve the same result using a single view and an aggregated query.
Please test the following code and see if it works for you or you would like to 
get a different solution.


Dudu


create external table t
(
    c1              string
   ,ts              string
   ,c3              string
   ,log_rec_level   string
   ,tid             string
   ,att             string
   ,val             string
   ,val_num         string
)
    row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
    with serdeproperties 
('input.regex'='(\\S+)\\s+(.{23})\\s+(\\S+)\\s+\\[([^]]+)\\]\\s+TID:\\s*(\\d+)\\s+([^:]+):?\\s*((\\d+)?.*)')
    stored as textfile
    location '/tmp/t'
;

create view v
as
select      c1
           ,cast (ts as timestamp)      as ts
           ,c3                          as c3
           ,log_rec_level               as log_rec_level
           ,cast (tid as bigint)        as tid
           ,att                         as att
           ,val                         as val
           ,cast (val_num as bigint)    as val_num

from        t
;


select      tid

           ,min (case when att like '%ProcessingHandler Message%'   then ts     
    end)  as ts_ProcessingHandler_Message
           ,min (case when att = 'Request received in writer'       then ts     
    end)  as ts_Request_received_in_writer
           ,min (case when att = 'Total time'                       then ts     
    end)  as ts_Total_time

           ,min (case when att like '%ProcessingHandler Message%'   then 
val_num    end)  as timestamp
           ,min (case when att = 'Total time'                       then 
val_num    end)  as Total_time

from        v

group by    tid
;




From: Arun Patel [mailto:arunp.bigd...@gmail.com]
Sent: Friday, July 01, 2016 9:20 PM
To: user@hive.apache.org
Subject: Re: RegexSerDe with Filters

Dudu,

Thanks for your continued support.  I need one more quick help.  I have one 
more log file as shown below.

STD-SERV 2016-06-29 12:10:39.142 c.f.c.s.F.ProcessingHandler [INFO] 
TID:101114719017567668 cluster1 ProcessingHandler Message timestamp: 
1467216639090
STD-SERV 2016-06-29 12:10:39.143 c.f.c.s.F.ProcessingHandler [INFO] TID: 
101114719017567668 cluster1: Processed request
STD-SERV 2016-06-29 12:10:39.163 c.f.c.s.F.WritingHandler [INFO] TID: 
101114719017567668 Request received in writer
STD-SERV 2016-06-29 12:10:39.273 c.f.c.s.F.WritingHandler [INFO] TID: 
101114719017567668 Processed request
STD-SERV 2016-06-29 12:10:39.273 c.f.c.s.F.WritingHandler [INFO] TID: 
101114719017567668 Total time: 10 ms

I need to create 3 views for 3 requirements.
1) create a view to get timestamp, TID number and cluster1 for lines  
"ProcessingHandler Message timestamp".  But, for this line there is no space 
between TID: and TID number.

2) create a view to get timestamp, TID for the lines "Request received in 
writer".  There is a space between TID: and TID number.

3) Create a view to get timestamp, TID for the lines "Total time:".  There is a 
space between TID: and TID number.

How do I create base table and views?  I am planning to join these 3 views 
based on TID.  Do I need to take any special considerations?

Regards,
Venkat




On Fri, Jun 24, 2016 at 5:17 PM, Arun Patel 
<arunp.bigd...@gmail.com<mailto:arunp.bigd...@gmail.com>> wrote:
Dudu, Thanks for the clarification. Looks like I have an issue with my Hive 
installation.  I tried in a different cluster and it works.

Thanks again.


On Fri, Jun 24, 2016 at 4:59 PM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
This is a tested, working code.
If you’re using https://regex101.com ,first replace backslash pairs (\\ ) with 
a single backslash (\) and also use the ‘g’ modifier in order to find all of 
the matches.

The regular expression is -
(\S+)\s+([0-9]{4}-[0-9]{2}-[0-9]{2} 
[0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})\s+(\S+)\s+\[([^]]+)\]\s+(\S+)\s+:\s+(TID:\s\d+)?\s*(.*)

I’ll send you a screen shot in private, since you don’t want to expose the data.

Dudu


From: Arun Patel 
[mailto:arunp.bigd...@gmail.com<mailto:arunp.bigd...@gmail.com>]
Sent: Friday, June 24, 2016 9:33 PM

To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: RegexSerDe with Filters

Looks like Regex pattern is not working.  I tested the pattern on 
https://regex101.com/ and it does not find any match.

Any suggestions?

On Thu, Jun 23, 2016 at 3:01 PM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
My pleasure.
Please feel free to reach me if needed.

Dudu

From: Arun Patel 
[mailto:arunp.bigd...@gmail.com<mailto:arunp.bigd...@gmail.com>]
Sent: Wednesday, June 22, 2016 2:57 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: RegexSerDe with Filters

Thank you very much, Dudu.  This really helps.

On Tue, Jun 21, 2016 at 7:48 PM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
Hi

Here is the code (without the log data).

I’ve created some of the views using different text processing technics.
The rest of the views could be create in similar ways.


Dudu



bash
----------------------------------------------------------------------------------------------------

hdfs dfs -mkdir -p /tmp/log/20160621
hdfs dfs –put logfile.txt /tmp/log/20160621


hive
----------------------------------------------------------------------------------------------------

/*
    External table log

    Defines all common columns + optional column 'tid' which appears in most 
log records + the rest of the log ('txt')

*/

drop table if exists log;

create external table log
(
    c1              string
   ,ts              string
   ,ts_frac         string
   ,log_rec_level   string
   ,c4              string
   ,c5              string
   ,tid             string
   ,txt             string
)
    partitioned by (dt date)
    row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
    with serdeproperties 
('input.regex'='(\\S+)\\s+([0-9]{4}-[0-9]{2}-[0-9]{2}<file:///\\S+)\s+(%5b0-9%5d%7b4%7d-%5b0-9%5d%7b2%7d-%5b0-9%5d%7b2%7d>
 
[0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})\\s+(\\S+)\\s+\\[([^]]+)\\]\\s+(\\S+)\\s+:\\s+(TID:\\s\\d+)?\\s*(.*)')<file:///\\s+(\S+)\s+\%5b(%5b%5e%5d%5d+)\%5d\s+(\S+)\s+:\s+(TID:\s\d+)%3f\s*(.*)')>
    stored as textfile
    location '/tmp/log'
;

alter table log add partition (dt=date '2016-06-21') location 
'/tmp/log/20160621';

select * from log;
----------------------------------------------------------------------------------------------------

/*
    View log_v

    Base view for all other views

*/

drop view if exists log_v;

create view log_v
as
select      c1
           ,cast (concat_ws ('.',ts,ts_frac) as timestamp)  as ts
           ,log_rec_level
           ,c4
           ,c5
           ,cast (ltrim(substr (tid,5)) as bigint)          as tid
           ,txt

from        log
;

select * from log_v;

----------------------------------------------------------------------------------------------------

drop view if exists log_v_reaping_path;

create view log_v_reaping_path
as
select      c1
           ,ts
           ,log_rec_level
           ,c4
           ,c5
           ,substr (txt,15)     as reaping_path

from        log_V

where       txt like 'Reaping path: %'
;

select * from log_v_reaping_path;

----------------------------------------------------------------------------------------------------

drop view if exists log_v_published_to_kafka;

create view log_v_published_to_kafka
as
select      c1
           ,ts
           ,log_rec_level
           ,c4
           ,c5
           ,tid

           ,      ltrim (kv [' Key']      )                 as key
           ,cast (ltrim (kv [' size']     ) as bigint )     as size
           ,      ltrim (kv [' topic']    )                 as topic
           ,cast (ltrim (kv [' partition']) as int    )     as partition
           ,cast (ltrim (kv [' offset']   ) as bigint )     as offset

from       (select      c1
                       ,ts
                       ,log_rec_level
                       ,c4
                       ,c5
                       ,tid
                       ,str_to_map (substr (txt ,locate ('.',txt)+1),',',':')   
as kv

            from        log_V

            where       txt like 'Published to Kafka. %'
            )
            as t
;

select * from log_v_published_to_kafka;

----------------------------------------------------------------------------------------------------

drop view if exists log_v_get_request;

create view log_v_get_request
as
select      c1
           ,ts
           ,log_rec_level
           ,c4
           ,c5
           ,tid
           ,substr (txt,31) as path

from        log_V

where       txt like 'GET request received for path %'
;

select * from log_v_get_request;

----------------------------------------------------------------------------------------------------

drop view if exists log_v_unlock_request;

create view log_v_unlock_request
as
select      c1
           ,ts
           ,log_rec_level
           ,c4
           ,c5
           ,tid
           ,regexp_extract (txt,'rowkey (\\S+)',1<file:///\\S+)',1>)      as 
rowkey
           ,regexp_extract (txt,'lock id (\\S+)',1<file:///\\S+)',1>)     as 
lock_id

from        log_V

where       txt like 'Unlock request for schema DU %'
;


From: Markovitz, Dudu 
[mailto:dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>]
Sent: Tuesday, June 21, 2016 2:26 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: RE: RegexSerDe with Filters

Hi

I would suggest creating a single external table with daily partitions and 
multiple views each with the appropriate filtering.
If you’ll send me log sample (~100 rows) I’ll send you an example.

Dudu

From: Arun Patel [mailto:arunp.bigd...@gmail.com]
Sent: Tuesday, June 21, 2016 1:51 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: RegexSerDe with Filters

Hello Hive Experts,

I use flume to ingest application specific logs from Syslog to HDFS.  
Currently, I grep the HDFS directory for specific patterns (for multiple types 
of requests) and then create reports.  However, generating reports for Weekly 
and Monthly are not salable.

I would like to create multiple external on the daily HDFS directory 
partitioned by date with RegexSerde and then create separate Parquet tables for 
every kind of request.

Question is - How do I create multiple (about 20) RegexSerde tables on same 
data applying filters?  This will be just like 20 grep commands I am running 
today.

Example:  hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'STORE Request 
Received for APPXXXX' | awk '{print $4, $13, $14, $17, $20}'
                hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'SCAN Request 
Received for APPYYYY' | awk '{print $4, $14, $19, $21, $22}'
                hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'TOTAL TIME' 
| awk '{print $4, $24}'

I would like to create a tables which does this kind of job and then writes 
output to Parquet tables.

Please let me know how this can be done.  Thank you!

Regards,
Arun

RE: RegexSerDe with Filters

Reply via email to