[ 
https://issues.apache.org/jira/browse/HIVE-14792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392040#comment-16392040
 ] 

Aihua Xu edited comment on HIVE-14792 at 3/8/18 11:08 PM:
----------------------------------------------------------

By checking the patch, since we are calling getTableMetadata() which puts 
"columns.types" and the other properties in table properties. Setting this to 
false will not have such metadata in there. Do you think it's a good idea to 
just get the avro schema from Serde info rather than everything? 


was (Author: aihuaxu):
By checking the patch, since we are calling getTableMetadata() which puts 
"columns.types" and the other properties in table properties. Setting this to 
false will not have such metadata in there. Do you think it's a good idea to 
just get the serde lib info rather than everything? 

> AvroSerde reads the remote schema-file at least once per mapper, per table 
> reference.
> -------------------------------------------------------------------------------------
>
>                 Key: HIVE-14792
>                 URL: https://issues.apache.org/jira/browse/HIVE-14792
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 1.2.1, 2.1.0
>            Reporter: Mithun Radhakrishnan
>            Assignee: Aihua Xu
>            Priority: Major
>              Labels: TODOC2.2, TODOC2.4
>             Fix For: 3.0.0, 2.4.0, 2.2.1
>
>         Attachments: HIVE-14792.1.patch, HIVE-14792.3.patch, 
> HIVE-14792.4.patch
>
>
> Avro tables that use "external" schema files stored on HDFS can cause 
> excessive calls to {{FileSystem::open()}}, especially for queries that spawn 
> large numbers of mappers.
> This is because of the following code in {{AvroSerDe::initialize()}}:
> {code:title=AvroSerDe.java|borderStyle=solid}
> public void initialize(Configuration configuration, Properties properties) 
> throws SerDeException {
> // ...
>     if (hasExternalSchema(properties)
>         || columnNameProperty == null || columnNameProperty.isEmpty()
>         || columnTypeProperty == null || columnTypeProperty.isEmpty()) {
>       schema = determineSchemaOrReturnErrorSchema(configuration, properties);
>     } else {
>       // Get column names and sort order
>       columnNames = Arrays.asList(columnNameProperty.split(","));
>       columnTypes = 
> TypeInfoUtils.getTypeInfosFromTypeString(columnTypeProperty);
>       schema = getSchemaFromCols(properties, columnNames, columnTypes, 
> columnCommentProperty);
>          
> properties.setProperty(AvroSerdeUtils.AvroTableProperties.SCHEMA_LITERAL.getPropName(),
>  schema.toString());
>     }
> // ...
> }
> {code}
> For tables using {{avro.schema.url}}, every time the SerDe is initialized 
> (i.e. at least once per mapper), the schema file is read remotely. For 
> queries with thousands of mappers, this leads to a stampede to the handful 
> (3?) datanodes that host the schema-file. In the best case, this causes 
> slowdowns.
> It would be preferable to distribute the Avro-schema to all mappers as part 
> of the job-conf. The alternatives aren't exactly appealing:
> # One can't rely solely on the {{column.list.types}} stored in the Hive 
> metastore. (HIVE-14789).
> # {{avro.schema.literal}} might not always be usable, because of the 
> size-limit on table-parameters. The typical size of the Avro-schema file is 
> between 0.5-3MB, in my limited experience. Bumping the max table-parameter 
> size isn't a great solution.
> If the {{avro.schema.file}} were read during query-planning, and made 
> available as part of table-properties (but not serialized into the 
> metastore), the downstream logic will remain largely intact. I have a patch 
> that does this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to