Use regexp_extract() and group on that. http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF#String_Functions
For example, to get class C IP block from an IP address in a column called ‘host': regexp_extract(host, '(.*)\\.\\d{1,3}', 1) classc On May 23, 2011, at 1:03 PM, Adam Phelps wrote: > (As an FYI I'm relatively new to Hive and have no previous SQL experience, so > have been struggling a bit with the Language manual which seems to assume > previous SQL experience) > > Suppose I have a table, within which there is a column which contains domain > names (ie such as hadoop.apache.org). I want to perform a count of all > second-level domains, ie hadoop.apache.org and hive.apache.org would count in > the same bucket. > > Now I could count things for a particular second-level domain like this: > > SELECT > year, month, day, hour, COUNT(1) as count > FROM > domainlog > WHERE > year = 2011 AND > month = 05 AND > day = 15 AND > ( > domain RLIKE ".*[.]apache[.]org" > ) > GROUP BY > year, month, day, hour > > however I'm not seeing a way to sum up all second-level domains rather than a > particular one. I basically want to group everything using a regular > expression along the lines of ".*[.][^.]*[.][^.]*" and then output lines with > a count for the common portion. Any pointers in the correct direction would > be welcome. > > Thanks > - Adam