ORC itself can handle any UTF-8 characters in the column names, but the type name parser made too many assumptions about valid characters in the field names. I've created a new jira https://issues.apache.org/jira/browse/ORC-104 to address the problem.
.. Owen On Fri, Sep 23, 2016 at 12:34 PM, Manoj Narayanan <[email protected] > wrote: > Looks like Hive is allowing any Unicode character since 0.13. as per Hive > Documentation at https://cwiki.apache.org/co > nfluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable. > Column names can be surrounded by backtick (`) > > But proabably ORC is allowing only alphanumerics, period and underscore. > Could some one please confirm that? If so, is there a plan to support other > characters in ORC? > > I am using TypeDescription::fromString to generate a schema and I tried > using a field name with '-' (hyphen) in it. I got an exception with this > trace. > java.lang.IllegalArgumentException: Missing required char ':' at > 'struct<before^-after:string>' > > at org.apache.orc.TypeDescription.requireChar(TypeDescription.java:259) > at org.apache.orc.TypeDescription.parseStruct(TypeDescription.java:286) > at org.apache.orc.TypeDescription.parseType(TypeDescription.java:338) > at org.apache.orc.TypeDescription.fromString(TypeDescription.java:359) > > Looking at this code at https://github.com/apache/orc/ > blob/master/java/core/src/java/org/apache/orc/TypeDescription.java#L241, > it seems only alphanumeric, period and underscores are supported in > column names. > > While in hive, I could create a table with columns containing '-' (hyphen) > when they are surrounded by backtick. > hive> create table table_with_hyphen ( `hyphen-inbetween` string) stored > as ORC; > OK > Time taken: 0.075 seconds > > Schema for this orc file came out like this. Via a call > to org.apache.orc.Reader::getSchema()::getJson() > > {"category": "struct", "id": 0, "max": 1, "fields": [ > "_col0": {"category": "string", "id": 1, "max": 1}]} > > Thanks, > Manoj >
