[jira] [Updated] (HADOOP-13619) missing data intermittently when reading avro file in Spark from Swift storage

2016-09-21 Thread Yulei Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yulei Li updated HADOOP-13619:
--
Assignee: (was: Yulei Li)

> missing data intermittently when reading avro file in Spark from Swift storage
> --
>
> Key: HADOOP-13619
> URL: https://issues.apache.org/jira/browse/HADOOP-13619
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs/swift
>Affects Versions: 2.6.0
> Environment: Linux EL6
>Reporter: Steve Yang
>Priority: Blocker
>
> library used: org.apache.hadoop:hadoop-openstack:2.6.0
> We are loading avro files from Oracle Storage Service server (i.e., Swift 
> server) into Spark DataFrame object through the Spark Data Source API. For 
> example:
> return hiveCtx.read().format("com.databricks.spark.avro").load(objectName);
> The number of records is less than the actual record count in the avro file 
> when reading the avro file from Storage Service server using OpenStack Swift 
> API.
> If we run a SQL on top of the returned data frome like "select count(\*) as 
> C1 from " we can see the record count is smaller when reading the 
> same avro file from local file system.
> For a large avro file (awclassic.avro, 105M) the count is always wrong (42451 
> records vs. 60855). From the log file we can see the reading os the file is 
> splitted into 4:
> 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: 
> swift://qaTestData.oracleswift/testAvro/awclassic.avro:100663296+10044747
> 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: 
> swift://qaTestData.oracleswift/testAvro/awclassic.avro:33554432+33554432
> 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: 
> swift://qaTestData.oracleswift/testAvro/awclassic.avro:0+33554432
> 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: 
> swift://qaTestData.oracleswift/testAvro/awclassic.avro:67108864+33554432
> For a smaller avro file (wine.avro, 19M) the count sometimes is correct 
> (57076 records) and sometimes wrong (26999 records). Run the same spark SQL 
> 10 times back-to-back produces the following record count results:
> run 1: 26999
> run 2: 26999
> run 3: 57076
> run 4: 57056
> run 5: 57076
> run 6: 26999
> run 7: 57076
> run 8: 57076
> run 9: 57076
> run 10: 57076
> For this wine.avro test case there are two splits:
> 2016-08-31 17:42:32 INFO HadoopRDD:59 - Input split: 
> swift://qaTestData.oracleswift/testAvro/wine.avro:9965269+9965270
> 2016-08-31 17:42:32 INFO HadoopRDD:59 - Input split: 
> swift://qaTestData.oracleswift/testAvro/wine.avro:0+9965269
> I will attach a zip file containing the smaller avro file in question and the 
> debugged log file section of reading wine.avro file - one with successful 
> reading(C4.ok) and one with missing record reading(C5.miss).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13619) missing data intermittently when reading avro file in Spark from Swift storage

2016-09-19 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13619:

Summary: missing data intermittently when reading avro file in Spark from 
Swift storage  (was: missing data intermittently when reading avro file)

> missing data intermittently when reading avro file in Spark from Swift storage
> --
>
> Key: HADOOP-13619
> URL: https://issues.apache.org/jira/browse/HADOOP-13619
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs/swift
>Affects Versions: 2.6.0
> Environment: Linux EL6
>Reporter: Steve Yang
>Priority: Blocker
>
> library used: org.apache.hadoop:hadoop-openstack:2.6.0
> We are loading avro files from Oracle Storage Service server (i.e., Swift 
> server) into Spark DataFrame object through the Spark Data Source API. For 
> example:
> return hiveCtx.read().format("com.databricks.spark.avro").load(objectName);
> The number of records is less than the actual record count in the avro file 
> when reading the avro file from Storage Service server using OpenStack Swift 
> API.
> If we run a SQL on top of the returned data frome like "select count(\*) as 
> C1 from " we can see the record count is smaller when reading the 
> same avro file from local file system.
> For a large avro file (awclassic.avro, 105M) the count is always wrong (42451 
> records vs. 60855). From the log file we can see the reading os the file is 
> splitted into 4:
> 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: 
> swift://qaTestData.oracleswift/testAvro/awclassic.avro:100663296+10044747
> 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: 
> swift://qaTestData.oracleswift/testAvro/awclassic.avro:33554432+33554432
> 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: 
> swift://qaTestData.oracleswift/testAvro/awclassic.avro:0+33554432
> 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: 
> swift://qaTestData.oracleswift/testAvro/awclassic.avro:67108864+33554432
> For a smaller avro file (wine.avro, 19M) the count sometimes is correct 
> (57076 records) and sometimes wrong (26999 records). Run the same spark SQL 
> 10 times back-to-back produces the following record count results:
> run 1: 26999
> run 2: 26999
> run 3: 57076
> run 4: 57056
> run 5: 57076
> run 6: 26999
> run 7: 57076
> run 8: 57076
> run 9: 57076
> run 10: 57076
> For this wine.avro test case there are two splits:
> 2016-08-31 17:42:32 INFO HadoopRDD:59 - Input split: 
> swift://qaTestData.oracleswift/testAvro/wine.avro:9965269+9965270
> 2016-08-31 17:42:32 INFO HadoopRDD:59 - Input split: 
> swift://qaTestData.oracleswift/testAvro/wine.avro:0+9965269
> I will attach a zip file containing the smaller avro file in question and the 
> debugged log file section of reading wine.avro file - one with successful 
> reading(C4.ok) and one with missing record reading(C5.miss).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org