[ https://issues.apache.org/jira/browse/HADOOP-13619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yulei Li reassigned HADOOP-13619: --------------------------------- Assignee: Yulei Li > missing data intermittently when reading avro file in Spark from Swift storage > ------------------------------------------------------------------------------ > > Key: HADOOP-13619 > URL: https://issues.apache.org/jira/browse/HADOOP-13619 > Project: Hadoop Common > Issue Type: Bug > Components: fs/swift > Affects Versions: 2.6.0 > Environment: Linux EL6 > Reporter: Steve Yang > Assignee: Yulei Li > Priority: Blocker > > library used: org.apache.hadoop:hadoop-openstack:2.6.0 > We are loading avro files from Oracle Storage Service server (i.e., Swift > server) into Spark DataFrame object through the Spark Data Source API. For > example: > return hiveCtx.read().format("com.databricks.spark.avro").load(objectName); > The number of records is less than the actual record count in the avro file > when reading the avro file from Storage Service server using OpenStack Swift > API. > If we run a SQL on top of the returned data frome like "select count(\*) as > C1 from <temp table>" we can see the record count is smaller when reading the > same avro file from local file system. > For a large avro file (awclassic.avro, 105M) the count is always wrong (42451 > records vs. 60855). From the log file we can see the reading os the file is > splitted into 4: > 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: > swift://qaTestData.oracleswift/testAvro/awclassic.avro:100663296+10044747 > 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: > swift://qaTestData.oracleswift/testAvro/awclassic.avro:33554432+33554432 > 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: > swift://qaTestData.oracleswift/testAvro/awclassic.avro:0+33554432 > 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: > swift://qaTestData.oracleswift/testAvro/awclassic.avro:67108864+33554432 > For a smaller avro file (wine.avro, 19M) the count sometimes is correct > (57076 records) and sometimes wrong (26999 records). Run the same spark SQL > 10 times back-to-back produces the following record count results: > run 1: 26999 > run 2: 26999 > run 3: 57076 > run 4: 57056 > run 5: 57076 > run 6: 26999 > run 7: 57076 > run 8: 57076 > run 9: 57076 > run 10: 57076 > For this wine.avro test case there are two splits: > 2016-08-31 17:42:32 INFO HadoopRDD:59 - Input split: > swift://qaTestData.oracleswift/testAvro/wine.avro:9965269+9965270 > 2016-08-31 17:42:32 INFO HadoopRDD:59 - Input split: > swift://qaTestData.oracleswift/testAvro/wine.avro:0+9965269 > I will attach a zip file containing the smaller avro file in question and the > debugged log file section of reading wine.avro file - one with successful > reading(C4.ok) and one with missing record reading(C5.miss). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org