Hive Parsing

Public Network Services Wed, 15 Jun 2011 16:49:28 -0700

Hi...

I am trying to figure out how Hive parses an input file into a table,
to use it as a model for implementing a similar parser. Having had a
look at the source code of the org.apache.hadoop.hive.ql.parse
package, I am not sure whether this is the (only) place to search for
the answer.


For example, to parse in an Apache weblog, I have found this HQL example:

CREATE TABLE apachelog(host STRING, identity STRING,
        user STRING,  time STRING, request STRING, status STRING,
        size STRING, referer STRING, agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
        "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^
\"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^
\"]*|\"[^\"]*\"))?",
        "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE;

whereas for CSV the row format would be something like

ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12'

So, my question is,
- Does Hive have a "conventional" parser library that uses a separate
class (e.g., regexParser, CSVParser) to implement the above commands,
- Does it embed any 3rd-party code (like the Apache Commons CSV
library) to do its parsing? or
- Does it work in a different way?

Hive Parsing

Reply via email to