Looks like Hive is allowing any Unicode character since 0.13. as per Hive
 Documentation at https://cwiki.apache.org/confluence/display/Hive/
LanguageManual+DDL#LanguageManualDDL-CreateTable. Column names can be
surrounded by backtick (`)

But proabably ORC is allowing only alphanumerics, period and underscore.
Could some one please confirm that? If so, is there a plan to support other
characters in ORC?

I am using TypeDescription::fromString to generate a schema and I tried
using a field name with '-' (hyphen) in it. I got an exception with this
trace.
java.lang.IllegalArgumentException: Missing required char ':' at
'struct<before^-after:string>'

at org.apache.orc.TypeDescription.requireChar(TypeDescription.java:259)
at org.apache.orc.TypeDescription.parseStruct(TypeDescription.java:286)
at org.apache.orc.TypeDescription.parseType(TypeDescription.java:338)
at org.apache.orc.TypeDescription.fromString(TypeDescription.java:359)

Looking at this code at https://github.com/apache/orc/
blob/master/java/core/src/java/org/apache/orc/TypeDescription.java#L241,
 it seems only alphanumeric, period and underscores are supported in column
names.

While in hive, I could create a table with columns containing '-' (hyphen)
when they are surrounded by backtick.
hive> create table table_with_hyphen ( `hyphen-inbetween` string) stored as
ORC;
OK
Time taken: 0.075 seconds

Schema for this orc file came out like this. Via a call
to org.apache.orc.Reader::getSchema()::getJson()

{"category": "struct", "id": 0, "max": 1, "fields": [
  "_col0": {"category": "string", "id": 1, "max": 1}]}

Thanks,
Manoj

Reply via email to