Re: Hive question, summing second-level domain names

Loren Siebert Mon, 23 May 2011 13:12:24 -0700

Use regexp_extract() and group on that.
        http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF#String_Functions


For example, to get class C IP block from an IP address in a column called 
‘host':

    regexp_extract(host, '(.*)\\.\\d{1,3}', 1) classc

   


On May 23, 2011, at 1:03 PM, Adam Phelps wrote:

> (As an FYI I'm relatively new to Hive and have no previous SQL experience, so 
> have been struggling a bit with the Language manual which seems to assume 
> previous SQL experience)
> 
> Suppose I have a table, within which there is a column which contains domain 
> names (ie such as hadoop.apache.org).  I want to perform a count of all 
> second-level domains, ie hadoop.apache.org and hive.apache.org would count in 
> the same bucket.
> 
> Now I could count things for a particular second-level domain like this:
> 
> SELECT
>  year, month, day, hour, COUNT(1) as count
> FROM
>  domainlog
> WHERE
>  year = 2011 AND
>  month = 05 AND
>  day = 15 AND
>  (
>    domain RLIKE ".*[.]apache[.]org"
>  )
> GROUP BY
>  year, month, day, hour
> 
> however I'm not seeing a way to sum up all second-level domains rather than a 
> particular one.  I basically want to group everything using a regular 
> expression along the lines of ".*[.][^.]*[.][^.]*" and then output lines with 
> a count for the common portion.  Any pointers in the correct direction would 
> be welcome.
> 
> Thanks
> - Adam

Re: Hive question, summing second-level domain names

Reply via email to