Hi...
I am trying to figure out how Hive parses an input file into a table,
to use it as a model for implementing a similar parser. Having had a
look at the source code of the org.apache.hadoop.hive.ql.parse
package, I am not sure whether this is the (only) place to search for
the answer.
For example, to parse in an Apache weblog, I have found this HQL example:
CREATE TABLE apachelog(host STRING, identity STRING,
user STRING, time STRING, request STRING, status STRING,
size STRING, referer STRING, agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^
\"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^
\"]*|\"[^\"]*\"))?",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE;
whereas for CSV the row format would be something like
ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12'
So, my question is,
- Does Hive have a "conventional" parser library that uses a separate
class (e.g., regexParser, CSVParser) to implement the above commands,
- Does it embed any 3rd-party code (like the Apache Commons CSV
library) to do its parsing? or
- Does it work in a different way?