[GitHub] [incubator-hudi] adamjoneill edited a comment on issue #1325: presto - querying nested object in parquet file created by hudi

2020-02-12 Thread GitBox
adamjoneill edited a comment on issue #1325: presto - querying nested object in 
parquet file created by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-585338502
 
 
   i've managed to narrow down the issue to the data that is coming off the 
kinesis stream.
   
   when i replace the data from the stream with some test data as follows
   
   with the following code:
   
   ```
if (!rdd.isEmpty()){
   val json = rdd.map(record=>new String(record))
   val dataFrame = spark.read.json(json)
   dataFrame.printSchema();
   dataFrame.show();
   
   
   val hudiTableName = "order"
   val hudiTablePath = path + hudiTableName
   
   val hudiOptions = Map[String,String](
   DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "id",
   HoodieWriteConfig.TABLE_NAME -> hudiTableName, 
   DataSourceWriteOptions.OPERATION_OPT_KEY -> 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, 
   DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "id")
   
   // Write data into the Hudi dataset
 
dataFrame.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Overwrite).save(hudiTablePath)
   }
   ```
   
   i replaced
   
   ```
val dataFrame = spark.read.json(json)
   ```
   
   with
   
   ```
   val dataFrame = sparkContext.parallelize(Seq(Foo(1, Bar(1, "first")), 
Foo(2, Bar(2, "second".toDF()
   ```
   
   and the `select * from table` worked as well as nested query `select id, 
bar.id, bar.name from table`
   
   So at this stage it's looking like there's an issue with the data and how 
it's coming off the kinesis stream
   
   Update:
   I've pasted the data using `dataFrame.show()` from the rdd off the stream 
here:
   ```
   
++---++++-+
   |clientId|eventId|  eventTimestamp|  id|   order|  
typeOfEvent|
   
++---++++-+
   | 369| 115423|2020-02-12T15:54:...|34551840|[External, 
[[Aust...|order-created|
   
++---++++-+
   ```
   Taking the data off the stream but ignoring the nested column 'order' and 
the `select * from table` query works.
   ```
   val jsonFrame = spark.read.json(json)
   val dataFrame = jsonFrame.select("clientid", "eventId", "id", "typeOfEvent")
   ```
   
   Other than being nested, is there something about the order column that 
would make this happen?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on issue #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2020-02-12 Thread GitBox
yanghua commented on issue #1100: [HUDI-289] Implement a test suite to support 
long running test for Hudi writing and querying end-end
URL: https://github.com/apache/incubator-hudi/pull/1100#issuecomment-585549211
 
 
   > sg. In that case, I would like to re-ask my intiial questions again.. How 
have we validated the framework? are there real tests now? How/when are they 
run? :)
   
   In my opinion, we will trigger it with 
[azure-pipelines](https://github.com/apachehudi-ci/incubator-hudi/blob/master/azure-pipelines.yml)
 after it merges into master branch.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


Build failed in Jenkins: hudi-snapshot-deployment-0.5 #187

2020-02-12 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.28 KB...]
plexus-classworlds-2.5.2.jar

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.5.2-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark-bundle_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities-bundle_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 

[GitHub] [incubator-hudi] vinothchandar commented on issue #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2020-02-12 Thread GitBox
vinothchandar commented on issue #1100: [HUDI-289] Implement a test suite to 
support long running test for Hudi writing and querying end-end
URL: https://github.com/apache/incubator-hudi/pull/1100#issuecomment-585528405
 
 
   sg. In that case, I would like to re-ask my intiial questions again.. How 
have we validated the framework? are there real tests now? How/when are they 
run? :) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on issue #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2020-02-12 Thread GitBox
yanghua commented on issue #1100: [HUDI-289] Implement a test suite to support 
long running test for Hudi writing and querying end-end
URL: https://github.com/apache/incubator-hudi/pull/1100#issuecomment-585522237
 
 
   > Or is it this PR?
   
   Yes


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on issue #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2020-02-12 Thread GitBox
yanghua commented on issue #1100: [HUDI-289] Implement a test suite to support 
long running test for Hudi writing and querying end-end
URL: https://github.com/apache/incubator-hudi/pull/1100#issuecomment-585519006
 
 
   > @yanghua merged the pom PR, can you rebase this so we can merge this PR ? 
I think we have gone through it multiple times, let's have the first version 
committed so we can come up with more action items around this
   
   OK, will rebase with the master branch.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-607) Hive sync fails to register tables partitioned by Date Type column

2020-02-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-607:

Labels: pull-request-available  (was: )

> Hive sync fails to register tables partitioned by Date Type column
> --
>
> Key: HUDI-607
> URL: https://issues.apache.org/jira/browse/HUDI-607
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Major
>  Labels: pull-request-available
>
> h2. Issue Description
> As part of spark to avro conversion, Spark's *Date* type is represented as 
> corresponding *Date Logical Type* in Avro, which is underneath represented in 
> Avro by physical *Integer* type. For this reason when forming the Avro 
> records from Spark rows, it is converted to corresponding *Epoch day* to be 
> stored as corresponding *Integer* value in the parquet files.
> However, this manifests into a problem that when a *Date Type* column is 
> chosen as partition column. In this case, Hudi's partition column 
> *_hoodie_partition_path* also gets the corresponding *epoch day integer* 
> value when reading the partition field from the avro record, and as a result 
> syncing partitions in hudi table issues a command like the following, where 
> the date is an integer:
> {noformat}
> ALTER TABLE uditme_hudi.uditme_hudi_events_cow_feb05_00 ADD IF NOT EXISTS   
> PARTITION (event_date='17897') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_00/17897'
>PARTITION (event_date='17898') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_00/17898'
>PARTITION (event_date='17899') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_00/17899'
>PARTITION (event_date='17900') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_00/17900'{noformat}
> Hive is not able to make sense of the partition field values like *17897* as 
> it is not able to convert it to corresponding date from this string. It 
> actually expects the actual date to be represented in string form.
> So, we need to make sure that Hudi's partition field gets the actual date 
> value in string form, instead of the integer. This change makes sure that 
> when a fields value is retrieved from the Avro record, we check that if its 
> *Date Logical Type* we return the actual date value, instead of the epoch. 
> After this change the command for sync partitions issues is like:
> {noformat}
> ALTER TABLE `uditme_hudi`.`uditme_hudi_events_cow_feb05_01` ADD IF NOT EXISTS 
>   PARTITION (`event_date`='2019-01-01') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_01/2019-01-01'
>PARTITION (`event_date`='2019-01-02') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_01/2019-01-02'
>PARTITION (`event_date`='2019-01-03') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_01/2019-01-03'
>PARTITION (`event_date`='2019-01-04') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_01/2019-01-04'{noformat}
> h2. Stack Trace
> {noformat}
> 20/01/13 23:28:04 INFO HoodieHiveClient: Last commit time synced is not 
> known, listing all partitions in 
> s3://feichi-test/fact_hourly_search_term_conversions/merge_on_read_aws_hudi_jar,FS
>  :com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@1f0c8e1f
> 20/01/13 23:28:08 INFO HiveSyncTool: Storage partitions scan complete. Found 
> 31
> 20/01/13 23:28:08 INFO HiveSyncTool: New Partitions [18206, 18207, 18208, 
> 18209, 18210, 18211, 18212, 18213, 18214, 18215, 18216, 18217, 18218, 18219, 
> 18220, 18221, 18222, 18223, 18224, 18225, 18226, 18227, 18228, 18229, 18230, 
> 18231, 18232, 18233, 18234, 18235, 18236]
> 20/01/13 23:28:08 INFO HoodieHiveClient: Adding partitions 31 to table 
> fact_hourly_search_term_conversions_hudi_mor_hudi_jar
> 20/01/13 23:28:08 INFO HoodieHiveClient: Executing SQL ALTER TABLE 
> default.fact_hourly_search_term_conversions_hudi_mor_hudi_jar ADD IF NOT 
> EXISTS   PARTITION (dim_date='18206') LOCATION 
> 's3://feichi-test/fact_hourly_search_term_conversions/merge_on_read_aws_hudi_jar/18206'
>PARTITION (dim_date='18207') LOCATION $
> s3://feichi-test/fact_hourly_search_term_conversions/merge_on_read_aws_hudi_jar/18207'
>PARTITION (dim_date='18208') LOCATION 
> 's3://feichi-test/fact_hourly_search_term_conversions/merge_on_read_aws_hudi_jar/18208'
>PARTITION (dim_date='18209') LOCATION 
> 's3://feichi-test/fact_hourly_search_term_conversions/merge_$
> n_read_aws_hudi_jar/18209'   PARTITION (dim_date='18210') LOCATION 
> 

[GitHub] [incubator-hudi] umehrot2 opened a new pull request #1330: [HUDI-607] Fix to allow creation/syncing of Hive tables partitioned by Date type columns

2020-02-12 Thread GitBox
umehrot2 opened a new pull request #1330: [HUDI-607] Fix to allow 
creation/syncing of Hive tables partitioned by Date type columns
URL: https://github.com/apache/incubator-hudi/pull/1330
 
 
   ## What is the purpose of the pull request
   
   The issue is well described in 
https://issues.apache.org/jira/browse/HUDI-607 . It fixes an issue where Hive 
sync fails to create table, when a Date type column is the partition column for 
the table.
   
   ## Brief change log
   
   *(for example:)*
   - Make changes in DataSourceUtils, to check if the field for which value is 
being returned is of Avro Logical Date Type. If yes, it is returned as the 
actual date string and not epoch day value. This would let Hudi use the actual 
date string as the value, everywhere in its metadata.
   - Unit tests
   
   ## Verify this pull request
   
   This change added tests and can be verified as follows:
   - Added **DataSourceUtilsTest** class
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bwu2 edited a comment on issue #1328: Hudi upsert hangs

2020-02-12 Thread GitBox
bwu2 edited a comment on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-585512613
 
 
   @vinothchandar Thanks for taking the time to reply!
   
   Let me describe the simplest example of this problem on a tiny COW data set: 
Create a data frame with 4m rows and one column with values 1, 2, 34m in 
that column. Bulk insert that into Hudi (using the one column as the 
`recordkey`). This takes ~1 minute to run and the data size is about 30MB. Now 
upsert the same data frame into the existing table. This take >2 hours to run.
   
   Alternatively, if we upsert a new data frame into the existing table with 
values 401...8m (still 4m rows upserted), this takes ~1 minute to run.
   
   To answer your other queries: 
   * almost all of the time is spent in the  `HoodieSparkSqlWriter` job (and 
within that job, the `count at HoodieSparkSqlWriter.scala` stage (the 
`HoodieBloomIndex` jobs run quickly).
   * it seems highly unlikely to be resource constraint issue with such a small 
example.
   
   Shall i raise a JIRA for this? Or is this the expected behavior for such a 
workload?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bwu2 edited a comment on issue #1328: Hudi upsert hangs

2020-02-12 Thread GitBox
bwu2 edited a comment on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-585512613
 
 
   @vinothchandar Thanks for taking the time to reply!
   
   Let me describe the simplest example of this problem on a tiny COW data set: 
Create a data frame with 4m rows and one column with values 1, 2, 34m in 
that column. Bulk insert that into Hudi (using the one column as the 
`recordkey`). This takes ~1 minute to run and the data size is about 30MB. Now 
upsert the same data frame into the table a second time. This take >2 hours to 
run.
   
   Alternatively, if we upsert a new data frame with values 401...8m (still 
4m rows upserted), this takes ~1 minute to run.
   
   To answer your other queries: 
   * almost all of the time is spent in the  `HoodieSparkSqlWriter` job (and 
within that job, the `count at HoodieSparkSqlWriter.scala` stage (the 
`HoodieBloomIndex` jobs run quickly).
   * it seems highly unlikely to be resource constraint issue with such a small 
example.
   
   Shall i raise a JIRA for this? Or is this the expected behavior for such a 
workload?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bwu2 edited a comment on issue #1328: Hudi upsert hangs

2020-02-12 Thread GitBox
bwu2 edited a comment on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-585512613
 
 
   @vinothchandar Thanks for taking the time to reply!
   
   Let me describe the simplest example of this problem on a tiny COW data set: 
Create a data frame with 4m rows and one column with values 1, 2, 34m in 
that column. Bulk insert that into Hudi (using the one column as the 
`recordkey`). This takes ~1 minute to run and the data size is about 30MB. Now 
upsert the same data frame into the table a second time. This take >2 hours to 
run.
   
   Alternatively, if we upsert a new data frame with values 401...8m (still 
4m rows upserted), this takes ~1 minute to run.
   
   To answer your other queries: 
   * almost all of the time is spent in the  `HoodieSparkSqlWriter`}} job (and 
within that job, the `count at HoodieSparkSqlWriter.scala` stage (the 
`HoodieBloomIndex` jobs run quickly).
   * it seems highly unlikely to be resource constraint issue with such a small 
example.
   
   Shall i raise a JIRA for this? Or is this the expected behavior for such a 
workload?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bwu2 edited a comment on issue #1328: Hudi upsert hangs

2020-02-12 Thread GitBox
bwu2 edited a comment on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-585512613
 
 
   @vinothchandar Thanks for taking the time to reply!
   
   Let me describe the simplest example of this problem on a tiny COW data set: 
Create a data frame with 4m rows and one column with values 1, 2, 34m in 
that column. Bulk insert that into Hudi (using the one column as the 
`recordkey`). This takes ~1 minute to run and the data size is about 30MB. Now 
upsert the same data frame into the table a second time. This take >2 hours to 
run.
   
   Alternatively, if we upsert a new data frame with values 401...8m (still 
4m rows upserted), this takes ~1 minute to run.
   
   To answer your other queries: 
   * almost all of the time is spent in the  `HoodieSparkSqlWriter`}} job (and 
within that job, the {{`count at HoodieSparkSqlWriter.scala`}} stage (the 
`HoodieBloomIndex` jobs run quickly).
   * it seems highly unlikely to be resource constraint issue with such a small 
example.
   
   Shall i raise a JIRA for this? Or is this the expected behavior for such a 
workload?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bwu2 edited a comment on issue #1328: Hudi upsert hangs

2020-02-12 Thread GitBox
bwu2 edited a comment on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-585512613
 
 
   @vinothchandar Thanks for taking the time to reply!
   
   Let me describe the simplest example of this problem on a tiny COW data set: 
Create a data frame with 4m rows and one column with values 1, 2, 34m in 
that column. Bulk insert that into Hudi (using the one column as the 
`recordkey`). This takes ~1 minute to run and the data size is about 30MB. Now 
upsert the same data frame into the table a second time. This take >2 hours to 
run.
   
   Alternatively, if we upsert a new data frame with values 401...8m (still 
4m rows upserted), this takes ~1 minute to run.
   
   To answer your other queries: 
   * almost all of the time is spent in the  {{HoodieSparkSqlWriter}} job (and 
within that job, the {{count at HoodieSparkSqlWriter.scala}} stage (the 
BloomIndex parts run quickly).
   * it seems highly unlikely to be resource constraint issue with such a small 
example.
   
   Shall i raise a Jira for this? Or is this the expected behavior for such a 
workload?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Assigned] (HUDI-607) Hive sync fails to register tables partitioned by Date Type column

2020-02-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra reassigned HUDI-607:
--

Assignee: Udit Mehrotra

> Hive sync fails to register tables partitioned by Date Type column
> --
>
> Key: HUDI-607
> URL: https://issues.apache.org/jira/browse/HUDI-607
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Major
>
> h2. Issue Description
> As part of spark to avro conversion, Spark's *Date* type is represented as 
> corresponding *Date Logical Type* in Avro, which is underneath represented in 
> Avro by physical *Integer* type. For this reason when forming the Avro 
> records from Spark rows, it is converted to corresponding *Epoch day* to be 
> stored as corresponding *Integer* value in the parquet files.
> However, this manifests into a problem that when a *Date Type* column is 
> chosen as partition column. In this case, Hudi's partition column 
> *_hoodie_partition_path* also gets the corresponding *epoch day integer* 
> value when reading the partition field from the avro record, and as a result 
> syncing partitions in hudi table issues a command like the following, where 
> the date is an integer:
> {noformat}
> ALTER TABLE uditme_hudi.uditme_hudi_events_cow_feb05_00 ADD IF NOT EXISTS   
> PARTITION (event_date='17897') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_00/17897'
>PARTITION (event_date='17898') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_00/17898'
>PARTITION (event_date='17899') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_00/17899'
>PARTITION (event_date='17900') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_00/17900'{noformat}
> Hive is not able to make sense of the partition field values like *17897* as 
> it is not able to convert it to corresponding date from this string. It 
> actually expects the actual date to be represented in string form.
> So, we need to make sure that Hudi's partition field gets the actual date 
> value in string form, instead of the integer. This change makes sure that 
> when a fields value is retrieved from the Avro record, we check that if its 
> *Date Logical Type* we return the actual date value, instead of the epoch. 
> After this change the command for sync partitions issues is like:
> {noformat}
> ALTER TABLE `uditme_hudi`.`uditme_hudi_events_cow_feb05_01` ADD IF NOT EXISTS 
>   PARTITION (`event_date`='2019-01-01') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_01/2019-01-01'
>PARTITION (`event_date`='2019-01-02') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_01/2019-01-02'
>PARTITION (`event_date`='2019-01-03') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_01/2019-01-03'
>PARTITION (`event_date`='2019-01-04') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_01/2019-01-04'{noformat}
> h2. Stack Trace
> {noformat}
> 20/01/13 23:28:04 INFO HoodieHiveClient: Last commit time synced is not 
> known, listing all partitions in 
> s3://feichi-test/fact_hourly_search_term_conversions/merge_on_read_aws_hudi_jar,FS
>  :com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@1f0c8e1f
> 20/01/13 23:28:08 INFO HiveSyncTool: Storage partitions scan complete. Found 
> 31
> 20/01/13 23:28:08 INFO HiveSyncTool: New Partitions [18206, 18207, 18208, 
> 18209, 18210, 18211, 18212, 18213, 18214, 18215, 18216, 18217, 18218, 18219, 
> 18220, 18221, 18222, 18223, 18224, 18225, 18226, 18227, 18228, 18229, 18230, 
> 18231, 18232, 18233, 18234, 18235, 18236]
> 20/01/13 23:28:08 INFO HoodieHiveClient: Adding partitions 31 to table 
> fact_hourly_search_term_conversions_hudi_mor_hudi_jar
> 20/01/13 23:28:08 INFO HoodieHiveClient: Executing SQL ALTER TABLE 
> default.fact_hourly_search_term_conversions_hudi_mor_hudi_jar ADD IF NOT 
> EXISTS   PARTITION (dim_date='18206') LOCATION 
> 's3://feichi-test/fact_hourly_search_term_conversions/merge_on_read_aws_hudi_jar/18206'
>PARTITION (dim_date='18207') LOCATION $
> s3://feichi-test/fact_hourly_search_term_conversions/merge_on_read_aws_hudi_jar/18207'
>PARTITION (dim_date='18208') LOCATION 
> 's3://feichi-test/fact_hourly_search_term_conversions/merge_on_read_aws_hudi_jar/18208'
>PARTITION (dim_date='18209') LOCATION 
> 's3://feichi-test/fact_hourly_search_term_conversions/merge_$
> n_read_aws_hudi_jar/18209'   PARTITION (dim_date='18210') LOCATION 
> 's3://feichi-test/fact_hourly_search_term_conversions/merge_on_read_aws_hudi_jar/18210'
>

[jira] [Created] (HUDI-607) Hive sync fails to register tables partitioned by Date Type column

2020-02-12 Thread Udit Mehrotra (Jira)
Udit Mehrotra created HUDI-607:
--

 Summary: Hive sync fails to register tables partitioned by Date 
Type column
 Key: HUDI-607
 URL: https://issues.apache.org/jira/browse/HUDI-607
 Project: Apache Hudi (incubating)
  Issue Type: Bug
  Components: Hive Integration
Reporter: Udit Mehrotra


h2. Issue Description

As part of spark to avro conversion, Spark's *Date* type is represented as 
corresponding *Date Logical Type* in Avro, which is underneath represented in 
Avro by physical *Integer* type. For this reason when forming the Avro records 
from Spark rows, it is converted to corresponding *Epoch day* to be stored as 
corresponding *Integer* value in the parquet files.

However, this manifests into a problem that when a *Date Type* column is chosen 
as partition column. In this case, Hudi's partition column 
*_hoodie_partition_path* also gets the corresponding *epoch day integer* value 
when reading the partition field from the avro record, and as a result syncing 
partitions in hudi table issues a command like the following, where the date is 
an integer:
{noformat}
ALTER TABLE uditme_hudi.uditme_hudi_events_cow_feb05_00 ADD IF NOT EXISTS   
PARTITION (event_date='17897') LOCATION 
's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_00/17897'
   PARTITION (event_date='17898') LOCATION 
's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_00/17898'
   PARTITION (event_date='17899') LOCATION 
's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_00/17899'
   PARTITION (event_date='17900') LOCATION 
's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_00/17900'{noformat}
Hive is not able to make sense of the partition field values like *17897* as it 
is not able to convert it to corresponding date from this string. It actually 
expects the actual date to be represented in string form.

So, we need to make sure that Hudi's partition field gets the actual date value 
in string form, instead of the integer. This change makes sure that when a 
fields value is retrieved from the Avro record, we check that if its *Date 
Logical Type* we return the actual date value, instead of the epoch. After this 
change the command for sync partitions issues is like:
{noformat}
ALTER TABLE `uditme_hudi`.`uditme_hudi_events_cow_feb05_01` ADD IF NOT EXISTS   
PARTITION (`event_date`='2019-01-01') LOCATION 
's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_01/2019-01-01'
   PARTITION (`event_date`='2019-01-02') LOCATION 
's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_01/2019-01-02'
   PARTITION (`event_date`='2019-01-03') LOCATION 
's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_01/2019-01-03'
   PARTITION (`event_date`='2019-01-04') LOCATION 
's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_01/2019-01-04'{noformat}
h2. Stack Trace
{noformat}
20/01/13 23:28:04 INFO HoodieHiveClient: Last commit time synced is not known, 
listing all partitions in 
s3://feichi-test/fact_hourly_search_term_conversions/merge_on_read_aws_hudi_jar,FS
 :com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@1f0c8e1f
20/01/13 23:28:08 INFO HiveSyncTool: Storage partitions scan complete. Found 31
20/01/13 23:28:08 INFO HiveSyncTool: New Partitions [18206, 18207, 18208, 
18209, 18210, 18211, 18212, 18213, 18214, 18215, 18216, 18217, 18218, 18219, 
18220, 18221, 18222, 18223, 18224, 18225, 18226, 18227, 18228, 18229, 18230, 
18231, 18232, 18233, 18234, 18235, 18236]
20/01/13 23:28:08 INFO HoodieHiveClient: Adding partitions 31 to table 
fact_hourly_search_term_conversions_hudi_mor_hudi_jar
20/01/13 23:28:08 INFO HoodieHiveClient: Executing SQL ALTER TABLE 
default.fact_hourly_search_term_conversions_hudi_mor_hudi_jar ADD IF NOT EXISTS 
  PARTITION (dim_date='18206') LOCATION 
's3://feichi-test/fact_hourly_search_term_conversions/merge_on_read_aws_hudi_jar/18206'
   PARTITION (dim_date='18207') LOCATION $
s3://feichi-test/fact_hourly_search_term_conversions/merge_on_read_aws_hudi_jar/18207'
   PARTITION (dim_date='18208') LOCATION 
's3://feichi-test/fact_hourly_search_term_conversions/merge_on_read_aws_hudi_jar/18208'
   PARTITION (dim_date='18209') LOCATION 
's3://feichi-test/fact_hourly_search_term_conversions/merge_$
n_read_aws_hudi_jar/18209'   PARTITION (dim_date='18210') LOCATION 
's3://feichi-test/fact_hourly_search_term_conversions/merge_on_read_aws_hudi_jar/18210'
   PARTITION (dim_date='18211') LOCATION 
's3://feichi-test/fact_hourly_search_term_conversions/merge_on_read_aws_hudi_jar/18211'
   PARTITION (dim_date='18212') L$
CATION 
's3://feichi-test/fact_hourly_search_term_conversions/merge_on_read_aws_hudi_jar/18212'
   PARTITION (dim_date='18213') LOCATION 
's3://feichi-test/fact_hourly_search_term_conversions/merge_on_read_aws_hudi_jar/18213'
   PARTITION 

[GitHub] [incubator-hudi] vinothchandar commented on issue #1325: presto - querying nested object in parquet file created by hudi

2020-02-12 Thread GitBox
vinothchandar commented on issue #1325: presto - querying nested object in 
parquet file created by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-585504515
 
 
   cc @bhasudha as well 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1329: [SUPPORT] Presto cannot query non-partitioned table

2020-02-12 Thread GitBox
vinothchandar commented on issue #1329: [SUPPORT] Presto cannot query 
non-partitioned table
URL: https://github.com/apache/incubator-hudi/issues/1329#issuecomment-585504364
 
 
   @bhasudha can you please help out here


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar edited a comment on issue #1328: Hudi upsert hangs

2020-02-12 Thread GitBox
vinothchandar edited a comment on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-585503340
 
 
   Reposting my response here.. 
   
   There seems to be a lot of common concerns here.. 
https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide is an useful 
resource, that hopefully can benefit here..
   
   Few high level thoughts:
   - It would be good to layout if the most time spent is on the indexing 
stages (ones tagged with HoodieBloomIndex) or the actual writing.. 
   - Hudi does keep the input in memory to compute the stats it needs to size 
files. So if you don't provide sufficient executore/rdd storage memory, it will 
spill and can cause slowdowns.. (covered in tuning guide & have seen this 
happen with users often)
   - On workload pattern itself, BloomIndex range pruning can be turned off 
https://hudi.apache.org/docs/configurations.html#bloomIndexPruneByRanges if the 
keys ranges are random anyway.. Generally speaking, unless we have RFC-8 
(record level indexing), cases of random write/upserting majority of the rows 
in a table, may give bloom index overhead, since the bloom filters/ranges are 
not at all useful in pruning out files . We have an interim solution coming out 
in the next release.. falling back to plain old join to implement the indexing. 
   - In terms or MOR and COW, MOR will help only if you have lots of updates 
and bottleneck is on the writing.. 
   - If listing is an issue, please turn the following so the table is listed 
once and we re-use the filesytem metadata hoodie.embed.timeline.server=true
   
   I would appreciate a JIRA, so that I can break each into sub-task and 
tackle/resolve independently..
   
   
   I am personally focussing on performance now and want to make it lot faster 
in 0.6.0 release. So all this help would be deeply appreciated


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1328: Hudi upsert hangs

2020-02-12 Thread GitBox
vinothchandar commented on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-585503340
 
 
   Reposting my response here.. 
   
   There seems to be a lot of common concerns here.. 
https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide is an useful 
resource, that hopefully can benefit here..
   
   Few high level thoughts:
   It would be good to layout if the most time spent is on the indexing stages 
(ones tagged with HoodieBloomIndex) or the actual writing.. 
   Hudi does keep the input in memory to compute the stats it needs to size 
files. So if you don't provide sufficient executore/rdd storage memory, it will 
spill and can cause slowdowns.. (covered in tuning guide & have seen this 
happen with users often)
   On workload pattern itself, BloomIndex range pruning can be turned off 
https://hudi.apache.org/docs/configurations.html#bloomIndexPruneByRanges if the 
keys ranges are random anyway.. Generally speaking, unless we have RFC-8 
(record level indexing), cases of random write/upserting majority of the rows 
in a table, may give bloom index overhead, since the bloom filters/ranges are 
not at all useful in pruning out files . We have an interim solution coming out 
in the next release.. falling back to plain old join to implement the indexing. 
   In terms or MOR and COW, MOR will help only if you have lots of updates and 
bottleneck is on the writing.. 
   If listing is an issue, please turn the following so the table is listed 
once and we re-use the filesytem metadata hoodie.embed.timeline.server=true
   I would appreciate a JIRA, so that I can break each into sub-task and 
tackle/resolve independently..
   
   I am personally focussing on performance now and want to make it lot faster 
in 0.6.0 release. So all this help would be deeply appreciated


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2020-02-12 Thread GitBox
vinothchandar commented on issue #1100: [HUDI-289] Implement a test suite to 
support long running test for Hudi writing and querying end-end
URL: https://github.com/apache/incubator-hudi/pull/1100#issuecomment-585487447
 
 
   Or is it this PR? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2020-02-12 Thread GitBox
vinothchandar commented on issue #1100: [HUDI-289] Implement a test suite to 
support long running test for Hudi writing and querying end-end
URL: https://github.com/apache/incubator-hudi/pull/1100#issuecomment-585487391
 
 
   @n3nash @yanghua once ready, please open a PR against master.. Love to do 
one final review before merging.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1326: org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType UPDATE for partition :0

2020-02-12 Thread GitBox
lamber-ken commented on issue #1326: 
org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType 
UPDATE for partition :0
URL: https://github.com/apache/incubator-hudi/issues/1326#issuecomment-585482228
 
 
   @matthewLiem welcome :)
   
   **Record some  notes:**
   
   Change `lit(123456)` to `lit(123456L)`
   
   ```
   val updateDF = inputDF.withColumn("run_detail_id", lit(123456))
   ```
   
   Reproduce steps:
   ```
   import org.apache.spark.sql.functions._
   import org.apache.spark.sql.SparkSession
   
   val inputDataPath = "file:///tmp/test/ttt/*"
   val hudiTableName = "hudi_identity"
   val hudiTablePath = "file:///tmp/test/nnn"
   
   val hudiOptions = Map[String,String](
   "hoodie.datasource.write.recordkey.field" -> "auth_id",
   "hoodie.table.name" -> hudiTableName, 
   "hoodie.datasource.write.precombine.field" -> "last_mod_time")
   
   // create
   val inputDF = spark.read.format("parquet").load(inputDataPath)
   
inputDF.write.format("org.apache.hudi").options(hudiOptions).mode("Overwrite").save(hudiTablePath)

   
   // update
   val inputDF = spark.read.format("parquet").load(inputDataPath)
   val updateDF = inputDF.withColumn("run_detail_id", lit(123456))
   
updateDF.write.format("org.apache.hudi").options(hudiOptions).mode("Append").save(hudiTablePath)
   ```
   
   
[data.parquet.zip](https://github.com/apache/incubator-hudi/files/4195852/data.parquet.zip)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bwu2 edited a comment on issue #1328: Hudi upsert hangs

2020-02-12 Thread GitBox
bwu2 edited a comment on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-585471970
 
 
   An update on this: the job did eventually complete. But took 2.4 hours 
(other similar jobs ran more than 9 hours before we had to kill them). This 
doesn't seem reasonable for such a small table. 
   
   Also, i changed the bulk insert parallelism from 1 to 2 (and doubled the 
data size) and the problem still existed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bwu2 commented on issue #1328: Hudi upsert hangs

2020-02-12 Thread GitBox
bwu2 commented on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-585471970
 
 
   An update on this: the job did eventually complete. But took 2.4 hours 
(other similar jobs ran more than 9 hours before we had to kill them). This 
doesn't seem reasonable for such a small table. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] matthewLiem closed issue #1326: org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType UPDATE for partition :0

2020-02-12 Thread GitBox
matthewLiem closed issue #1326: 
org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType 
UPDATE for partition :0
URL: https://github.com/apache/incubator-hudi/issues/1326
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] matthewLiem commented on issue #1326: org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType UPDATE for partition :0

2020-02-12 Thread GitBox
matthewLiem commented on issue #1326: 
org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType 
UPDATE for partition :0
URL: https://github.com/apache/incubator-hudi/issues/1326#issuecomment-585446206
 
 
   Thanks for the help @lamber-ken. The issue was due to a mismatch in data 
types between the hudi table and the DF we're looking to UPSERT. Casting 
properly and ensuring schema matched types across both resolved the issue. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on issue #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2020-02-12 Thread GitBox
n3nash commented on issue #1100: [HUDI-289] Implement a test suite to support 
long running test for Hudi writing and querying end-end
URL: https://github.com/apache/incubator-hudi/pull/1100#issuecomment-585444365
 
 
   @yanghua merged the pom PR, can you rebase this so we can merge this PR ? I 
think we have gone through it multiple times, let's have the first version 
committed so we can come up with more action items around this 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated (d2c872e -> 63b4216)

2020-02-12 Thread nagarwal
This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from d2c872e  [HUDI-605] Avoid calculating the size of schema redundantly 
(#1317)
 add 63b4216  CLI - add option to print additional commit metadata

No new revisions were added by this update.

Summary of changes:
 .../apache/hudi/cli/commands/CommitsCommand.java   | 66 +-
 1 file changed, 63 insertions(+), 3 deletions(-)



[GitHub] [incubator-hudi] n3nash merged pull request #1318: [HUDI-571] CLI - add option to print additional commit metadata

2020-02-12 Thread GitBox
n3nash merged pull request #1318: [HUDI-571] CLI - add option to print 
additional commit metadata
URL: https://github.com/apache/incubator-hudi/pull/1318
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] popart opened a new issue #1329: [SUPPORT] Presto cannot query non-partitioned table

2020-02-12 Thread GitBox
popart opened a new issue #1329: [SUPPORT] Presto cannot query non-partitioned 
table
URL: https://github.com/apache/incubator-hudi/issues/1329
 
 
   **Describe the problem you faced**
   
   I made a non-partitioned Hudi table using Spark. I was able to query it with 
Spark & Presto, but when I tried querying it with Presto, I received the error 
`Could not find partitionDepth in partition metafile`.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Use an an emr-5.28.0 cluster
   2. Run spark shell: 
   ```
   spark-shell --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' 
--deploy-mode client
   ```
   3. Run spark code:
   ```
   import scala.collection.JavaConversions._
   import org.apache.spark.sql.SaveMode._
   import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.config.HoodieWriteConfig._
   import org.apache.hudi.hive._
   import org.apache.hudi.keygen.NonpartitionedKeyGenerator
   
   val inputPath = "s3://path/to/a/parquet/file"
   val tableName = "my_test_table"
   val basePath = "s3://test-bucket/my_test_table" 
   
   val inputDf = spark.read.parquet(inputPath)
   
   val hudiOptions = Map[String,String](
   RECORDKEY_FIELD_OPT_KEY -> "dim_advertiser_id",
   PRECOMBINE_FIELD_OPT_KEY -> "update_time",
   TABLE_NAME -> tableName,
   KEYGENERATOR_CLASS_OPT_KEY -> 
classOf[NonpartitionedKeyGenerator].getCanonicalName, //needed for non 
partitioned table
   HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> 
classOf[NonPartitionedExtractor].getCanonicalName, //needed for non partitioned 
table
   OPERATION_OPT_KEY -> BULK_INSERT_OPERATION_OPT_VAL,
   HIVE_SYNC_ENABLED_OPT_KEY -> "true",
   HIVE_TABLE_OPT_KEY -> tableName,
   TABLE_TYPE_OPT_KEY -> COW_TABLE_TYPE_OPT_VAL,   
"hoodie.bulkinsert.shuffle.parallelism" -> "10")
   
   inputDf.write.format("org.apache.hudi").
   options(bulk_insert_hudiOptions).
   mode(Overwrite).
   save(basePath);
   ```
   4. Querying the table in Spark or Hive both work
   5. Querying the table in Presto fails
   ```
   [hadoop@ip-172-31-128-118 ~]$ presto-cli --catalog hive --schema default
   presto:default> select count(*) from my_test_table;
   
   Query 20200211_185123_00018_pruwt, FAILED, 1 node
   Splits: 17 total, 0 done (0.00%)
   0:02 [0 rows, 0B] [0 rows/s, 0B/s]
   
   Query 20200211_185123_00018_pruwt failed: Could not find partitionDepth in 
partition metafile
   com.facebook.presto.spi.PrestoException: Could not find partitionDepth in 
partition metafile
 at 
com.facebook.presto.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:200)
 at 
com.facebook.presto.hive.util.ResumableTasks.safeProcessTask(ResumableTasks.java:47)
 at 
com.facebook.presto.hive.util.ResumableTasks.access$000(ResumableTasks.java:20)
 at 
com.facebook.presto.hive.util.ResumableTasks$1.run(ResumableTasks.java:35)
 at 
io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:78)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.hudi.exception.HoodieException: Could not find 
partitionDepth in partition metafile
 at 
org.apache.hudi.common.model.HoodiePartitionMetadata.getPartitionDepth(HoodiePartitionMetadata.java:75)
 at 
org.apache.hudi.hadoop.HoodieParquetInputFormat.getTableMetaClient(HoodieParquetInputFormat.java:209)
 at 
org.apache.hudi.hadoop.HoodieParquetInputFormat.groupFileStatus(HoodieParquetInputFormat.java:158)
 at 
org.apache.hudi.hadoop.HoodieParquetInputFormat.listStatus(HoodieParquetInputFormat.java:69)
 at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:288)
 at 
com.facebook.presto.hive.BackgroundHiveSplitLoader.loadPartition(BackgroundHiveSplitLoader.java:371)
 at 
com.facebook.presto.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:264)
 at 
com.facebook.presto.hive.BackgroundHiveSplitLoader.access$300(BackgroundHiveSplitLoader.java:96)
 at 
com.facebook.presto.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:193)
 ... 7 more
   ```
   
   **Expected behavior**
   
   Presto should return a count of all the rows. Other Presto queries should 
succeed.
   
   **Environment Description**
   
   * EMR version: emr-5.28.0
   
   * Hudi version : 0.5.1-incubating
   
   * Spark version : 2.4.4
   
   * Hive version : 2.3.6
   
   * Hadoop version : 2.8.5
   
   * Presto version: 0.277
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   **Stacktrace**
   
   Included in 

[jira] [Commented] (HUDI-407) Implement a join-based index

2020-02-12 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035721#comment-17035721
 ] 

Vinoth Chandar commented on HUDI-407:
-

[~shivnarayan] are you working on this? if not, would love to take a stab at 
this 

> Implement a join-based index
> 
>
> Key: HUDI-407
> URL: https://issues.apache.org/jira/browse/HUDI-407
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, newbie, Performance
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-554) Restructure code/packages to move more code back into hudi-writer-common

2020-02-12 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-554:

Fix Version/s: 0.6.0

> Restructure code/packages  to move more code back into hudi-writer-common
> -
>
> Key: HUDI-554
> URL: https://issues.apache.org/jira/browse/HUDI-554
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-554) Restructure code/packages to move more code back into hudi-writer-common

2020-02-12 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035718#comment-17035718
 ] 

Vinoth Chandar commented on HUDI-554:
-

[~yanghua] sg.. a branch seems good when we actually break it up and iterating 
as you mention... For this JIRA, let me just target master, since this probably 
needs to happen anyway. 

I am trying to get something up in the next few days . 

> Restructure code/packages  to move more code back into hudi-writer-common
> -
>
> Key: HUDI-554
> URL: https://issues.apache.org/jira/browse/HUDI-554
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] bwu2 opened a new issue #1328: Hudi upsert hangs

2020-02-12 Thread GitBox
bwu2 opened a new issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328
 
 
   **Describe the problem you faced**
   When we upsert data into Hudi, we're finding that the job just hangs in some 
cases. Specifically, we have an ETL pipeline where we re-ingest a lot of data 
(i.e. we upsert data that already exists in the Hudi table). When the 
proportion of data that is not new is very high, the Hudi spark job seems to 
hang before writing out the updated table.
   
   Note that this currently affects 2 of the 80 tables in our ETL pipeline and 
the rest run fine. 
   
   **To Reproduce**
   See gist at: https://gist.github.com/bwu2/89f98e0926374f71c80e4b2fa5089f18
   
   The code there creates a Hudi table with 4m rows. It then upserts another 4m 
rows, 3.5m of which are the same as the original 4m.
   
   Note that bulk parallelism of the initial load is deliberately set to 1 to 
ensure we avoid lots of small files.
   
   Running this code on an EMR cluster (either interactively in a PySpark shell 
or spark-submit) causes the upsert job never to finish, being stuck somewhere 
in the Spark job with description (from the Spark history server):
   `count at HoodieSparkSqlWriter.scala:255` (after the stage `mapToPair at 
HoodieWriteClient.java:492` and before/during the stage `count at 
HoodieSparkSqlWriter.scala:255`).
   
   For a table this small, it shouldn't matter about 
cores/memory/executors/instance type but we have varied these too with no 
success.
   
   **Expected behavior**
   Expected the upsert job to succeed and the total number of rows in the table 
to be 4.5m.
   
   **Environment Description
   Running on EMR 5.29.0 
   * Hudi version : tested on 0.5.0, 0.5.1 and latest build off master
   
   * Spark version : 2.4.4
   
   * Hive version : N/A
   
   * Hadoop version : 2.8.5 (Amazon)
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : NO
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on issue #1320: [HUDI-571] Add min/max headers on archived files

2020-02-12 Thread GitBox
satishkotha commented on issue #1320: [HUDI-571] Add min/max headers on 
archived files
URL: https://github.com/apache/incubator-hudi/pull/1320#issuecomment-585398180
 
 
   @n3nash thanks. looks like this may need some more discussion. Take a look 
at my reply and let me know what you think.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files

2020-02-12 Thread GitBox
satishkotha commented on a change in pull request #1320: [HUDI-571] Add min/max 
headers on archived files
URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r378489085
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieLogBlock.java
 ##
 @@ -121,7 +121,7 @@ public long getLogBlockLength() {
* new enums at the end.
*/
   public enum HeaderMetadataType {
-INSTANT_TIME, TARGET_INSTANT_TIME, SCHEMA, COMMAND_BLOCK_TYPE
+INSTANT_TIME, TARGET_INSTANT_TIME, SCHEMA, COMMAND_BLOCK_TYPE, 
MIN_INSTANT_TIME, MAX_INSTANT_TIME
 
 Review comment:
   I dont have strong opinion either way. These do seem generic enough that we 
can use for non-archived blocks too.
   
   Also, nested enums could become cumbersome especially since we persist 
ordinals in metadata. I can test to see if works.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files

2020-02-12 Thread GitBox
satishkotha commented on a change in pull request #1320: [HUDI-571] Add min/max 
headers on archived files
URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r378485704
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieCommitArchiveLog.java
 ##
 @@ -268,6 +270,19 @@ public Path getArchiveFilePath() {
 return archiveFilePath;
   }
 
+  private void writeHeaderBlock(Schema wrapperSchema, List 
instants) throws Exception {
+if (!instants.isEmpty()) {
+  Collections.sort(instants, HoodieInstant.COMPARATOR);
+  HoodieInstant minInstant = instants.get(0);
+  HoodieInstant maxInstant = instants.get(instants.size() - 1);
+  Map metadataMap = Maps.newHashMap();
+  metadataMap.put(HeaderMetadataType.SCHEMA, wrapperSchema.toString());
+  metadataMap.put(HeaderMetadataType.MIN_INSTANT_TIME, 
minInstant.getTimestamp());
+  metadataMap.put(HeaderMetadataType.MAX_INSTANT_TIME, 
maxInstant.getTimestamp());
+  this.writer.appendBlock(new HoodieAvroDataBlock(Collections.emptyList(), 
metadataMap));
+}
+  }
+
   private void writeToFile(Schema wrapperSchema, List records) 
throws Exception {
 
 Review comment:
   I've included decision for including header block above. Let me know. file 
is closed after archiving all instants that qualify. So i think file can grow 
is not a issue. Correct me if i'm reading this wrong. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files

2020-02-12 Thread GitBox
satishkotha commented on a change in pull request #1320: [HUDI-571] Add min/max 
headers on archived files
URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r378484755
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java
 ##
 @@ -182,8 +183,11 @@ private String getMetadataKey(String action) {
   //read the avro blocks
   while (reader.hasNext()) {
 HoodieAvroDataBlock blk = (HoodieAvroDataBlock) reader.next();
-// TODO If we can store additional metadata in datablock, we can 
skip parsing records
-// (such as startTime, endTime of records in the block)
+if (isDataOutOfRange(blk, filter)) {
 
 Review comment:
   No. In the current implementation, the first block tracks range for entire 
fire. In some cases there are lot of archived files and its much faster to skip 
entire file when looking at older ranges. 
   
   The overhead of storing metadata on every block seemed high. By default, we 
are grouping 10 records into one block. That translates to 10KB in size. Header 
on every block with min/max is adding 40 bytes overhead. So, 0.4% overhead 
seemed  high to me. Let me know if you think we can ignore overhead. I can move 
this to per block


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on issue #1309: [HUDI-592] Remove duplicated dependencies in the pom file of test suite module

2020-02-12 Thread GitBox
n3nash commented on issue #1309: [HUDI-592] Remove duplicated dependencies in 
the pom file of test suite module
URL: https://github.com/apache/incubator-hudi/pull/1309#issuecomment-585392157
 
 
   Retriggered the build, it passes, i think there is still some flakiness in 
the build, lets investigate that separately


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch hudi_test_suite_refactor updated (044759a -> 5f22849)

2020-02-12 Thread nagarwal
This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from 044759a  [HUDI-503] Add hudi test suite documentation into the README 
file of the test suite module (#1191)
 add 5f22849  [HUDI-592] Remove duplicated dependencies in the pom file of 
test suite module

No new revisions were added by this update.

Summary of changes:
 hudi-test-suite/pom.xml | 36 
 1 file changed, 36 deletions(-)



[GitHub] [incubator-hudi] n3nash merged pull request #1309: [HUDI-592] Remove duplicated dependencies in the pom file of test suite module

2020-02-12 Thread GitBox
n3nash merged pull request #1309: [HUDI-592] Remove duplicated dependencies in 
the pom file of test suite module
URL: https://github.com/apache/incubator-hudi/pull/1309
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files

2020-02-12 Thread GitBox
satishkotha commented on a change in pull request #1320: [HUDI-571] Add min/max 
headers on archived files
URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r378480109
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieCommitArchiveLog.java
 ##
 @@ -268,6 +270,19 @@ public Path getArchiveFilePath() {
 return archiveFilePath;
   }
 
+  private void writeHeaderBlock(Schema wrapperSchema, List 
instants) throws Exception {
+if (!instants.isEmpty()) {
+  Collections.sort(instants, HoodieInstant.COMPARATOR);
 
 Review comment:
   ok, will do


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on issue #1318: [HUDI-571] CLI - add option to print additional commit metadata

2020-02-12 Thread GitBox
satishkotha commented on issue #1318: [HUDI-571] CLI - add option to print 
additional commit metadata
URL: https://github.com/apache/incubator-hudi/pull/1318#issuecomment-585385524
 
 
   @n3nash addressed comments. Please take a look.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1312: [HUDI-571] Add "compactions show archived" command to CLI

2020-02-12 Thread GitBox
satishkotha commented on a change in pull request #1312: [HUDI-571] Add 
"compactions show archived" command to CLI
URL: https://github.com/apache/incubator-hudi/pull/1312#discussion_r378475212
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/CompactionCommand.java
 ##
 @@ -95,51 +101,9 @@ public String compactionsAll(
   throws IOException {
 HoodieTableMetaClient client = checkAndGetMetaClient();
 HoodieActiveTimeline activeTimeline = client.getActiveTimeline();
-HoodieTimeline timeline = activeTimeline.getCommitsAndCompactionTimeline();
-HoodieTimeline commitTimeline = 
activeTimeline.getCommitTimeline().filterCompletedInstants();
-Set committed = 
commitTimeline.getInstants().map(HoodieInstant::getTimestamp).collect(Collectors.toSet());
-
-List instants = 
timeline.getReverseOrderedInstants().collect(Collectors.toList());
-List rows = new ArrayList<>();
-for (HoodieInstant instant : instants) {
-  HoodieCompactionPlan compactionPlan = null;
-  if (!HoodieTimeline.COMPACTION_ACTION.equals(instant.getAction())) {
-try {
-  // This could be a completed compaction. Assume a compaction request 
file is present but skip if fails
-  compactionPlan = AvroUtils.deserializeCompactionPlan(
-  activeTimeline.readCompactionPlanAsBytes(
-  
HoodieTimeline.getCompactionRequestedInstant(instant.getTimestamp())).get());
-} catch (HoodieIOException ioe) {
-  // SKIP
-}
-  } else {
-compactionPlan = 
AvroUtils.deserializeCompactionPlan(activeTimeline.readCompactionPlanAsBytes(
-
HoodieTimeline.getCompactionRequestedInstant(instant.getTimestamp())).get());
-  }
-
-  if (null != compactionPlan) {
-State state = instant.getState();
-if (committed.contains(instant.getTimestamp())) {
-  state = State.COMPLETED;
-}
-if (includeExtraMetadata) {
-  rows.add(new Comparable[] {instant.getTimestamp(), state.toString(),
-  compactionPlan.getOperations() == null ? 0 : 
compactionPlan.getOperations().size(),
-  compactionPlan.getExtraMetadata().toString()});
-} else {
-  rows.add(new Comparable[] {instant.getTimestamp(), state.toString(),
-  compactionPlan.getOperations() == null ? 0 : 
compactionPlan.getOperations().size()});
-}
-  }
-}
-
-Map> fieldNameToConverterMap = new 
HashMap<>();
-TableHeader header = new TableHeader().addTableHeaderField("Compaction 
Instant Time").addTableHeaderField("State")
-.addTableHeaderField("Total FileIds to be Compacted");
-if (includeExtraMetadata) {
-  header = header.addTableHeaderField("Extra Metadata");
-}
-return HoodiePrintHelper.print(header, fieldNameToConverterMap, 
sortByField, descending, limit, headerOnly, rows);
+return printAllCompactions(activeTimeline,
+compactionPlanReader(this::readCompactionPlanForActiveTimeline, 
activeTimeline),
 
 Review comment:
   @nbalajee printAllCompactions only calls compcationPlanReader for commits 
and compactions. As part of refactor, timeline.getCommitsAndCompactionTimeline 
has been moved into printAllCompactions (to reuse between active/archive 
timelines). I just verified this works as expected even in presence of cleans. 
   Let me know if you think there is a better way to organize this.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1318: [HUDI-571] CLI - add option to print additional commit metadata

2020-02-12 Thread GitBox
satishkotha commented on a change in pull request #1318: [HUDI-571] CLI - add 
option to print additional commit metadata
URL: https://github.com/apache/incubator-hudi/pull/1318#discussion_r378471663
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java
 ##
 @@ -100,8 +100,58 @@ private String printCommits(HoodieDefaultTimeline 
timeline,
 return HoodiePrintHelper.print(header, fieldNameToConverterMap, 
sortByField, descending, limit, headerOnly, rows);
   }
 
+  private String printCommitsWithMetadata(HoodieDefaultTimeline timeline,
+  final Integer limit, final String sortByField,
+  final boolean descending,
+  final boolean headerOnly) throws IOException {
+final List rows = new ArrayList<>();
+
+final List commits = 
timeline.getCommitsTimeline().filterCompletedInstants()
+.getInstants().collect(Collectors.toList());
+// timeline can be read from multiple files. So sort is needed instead of 
reversing the collection
+Collections.sort(commits, HoodieInstant.COMPARATOR.reversed());
+
+for (int i = 0; i < commits.size(); i++) {
+  final HoodieInstant commit = commits.get(i);
+  final HoodieCommitMetadata commitMetadata = 
HoodieCommitMetadata.fromBytes(
+  timeline.getInstantDetails(commit).get(),
+  HoodieCommitMetadata.class);
+
+  for (Map.Entry> partitionWriteStat :
+  commitMetadata.getPartitionToWriteStats().entrySet()) {
+for (HoodieWriteStat hoodieWriteStat : partitionWriteStat.getValue()) {
+  rows.add(new Comparable[]{ commit.getAction(), 
commit.getTimestamp(), hoodieWriteStat.getPartitionPath(),
+  hoodieWriteStat.getFileId(), 
hoodieWriteStat.getPrevCommit(), hoodieWriteStat.getNumWrites(),
+  hoodieWriteStat.getNumInserts(), 
hoodieWriteStat.getNumDeletes(),
+  hoodieWriteStat.getNumUpdateWrites(), 
hoodieWriteStat.getTotalWriteErrors(),
+  hoodieWriteStat.getTotalLogBlocks(), 
hoodieWriteStat.getTotalCorruptLogBlock(),
+  hoodieWriteStat.getTotalRollbackBlocks(), 
hoodieWriteStat.getTotalLogRecords(),
+  hoodieWriteStat.getTotalUpdatedRecordsCompacted(), 
hoodieWriteStat.getTotalWriteBytes()
+  });
+}
+  }
+}
+
+final Map> fieldNameToConverterMap = new 
HashMap<>();
+fieldNameToConverterMap.put("Total Bytes Written", entry -> {
+  return 
NumericUtils.humanReadableByteCount((Double.valueOf(entry.toString(;
+});
+
+TableHeader header = new 
TableHeader().addTableHeaderField("action").addTableHeaderField("instant")
+
.addTableHeaderField("partition").addTableHeaderField("file_id").addTableHeaderField("prev_instant")
+
.addTableHeaderField("num_writes").addTableHeaderField("num_inserts").addTableHeaderField("num_deletes")
 
 Review comment:
   Good point. converted to camel case to be consistent


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] adamjoneill edited a comment on issue #1325: presto - querying nested object in parquet file created by hudi

2020-02-12 Thread GitBox
adamjoneill edited a comment on issue #1325: presto - querying nested object in 
parquet file created by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-585338502
 
 
   i've managed to narrow down the issue to the data that is coming off the 
kinesis stream.
   
   when i replace the data from the stream with some test data as follows
   
   with the following code:
   
   ```
if (!rdd.isEmpty()){
   val json = rdd.map(record=>new String(record))
   val dataFrame = spark.read.json(json)
   dataFrame.printSchema();
   dataFrame.show();
   
   
   val hudiTableName = "order"
   val hudiTablePath = path + hudiTableName
   
   val hudiOptions = Map[String,String](
   DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "id",
   HoodieWriteConfig.TABLE_NAME -> hudiTableName, 
   DataSourceWriteOptions.OPERATION_OPT_KEY -> 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, 
   DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "id")
   
   // Write data into the Hudi dataset
 
dataFrame.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Overwrite).save(hudiTablePath)
   }
   ```
   
   i replaced
   
   ```
val dataFrame = spark.read.json(json)
   ```
   
   with
   
   ```
   val dataFrame = sparkContext.parallelize(Seq(Foo(1, Bar(1, "first")), 
Foo(2, Bar(2, "second".toDF()
   ```
   
   and the `select * from table` worked as well as nested query `select id, 
bar.id, bar.name from table`
   
   So at this stage it's looking like there's an issue with the data and how 
it's coming off the kinesis stream
   
   Update:
   I've pasted the schema from the rdd off the stream here:
   ```
   
++---++++-+
   |clientId|eventId|  eventTimestamp|  id|   order|  
typeOfEvent|
   
++---++++-+
   | 369| 115423|2020-02-12T15:54:...|34551840|[External, 
[[Aust...|order-created|
   
++---++++-+
   ```
   Taking the data off the stream but ignoring the nested column 'order' and 
the `select * from table` query works.
   ```
   val jsonFrame = spark.read.json(json)
   val dataFrame = jsonFrame.select("clientid", "eventId", "id", "typeOfEvent")
   ```
   
   Other than being nested, is there something about the order column that 
would make this happen?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] hmatu commented on a change in pull request #1242: [HUDI-544] Adjust the read and write path of archive

2020-02-12 Thread GitBox
hmatu commented on a change in pull request #1242: [HUDI-544] Adjust the read 
and write path of archive
URL: https://github.com/apache/incubator-hudi/pull/1242#discussion_r378446313
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/ArchivedCommitsCommand.java
 ##
 @@ -138,9 +139,11 @@ public String showCommits(
   throws IOException {
 
 System.out.println("===> Showing only " + limit + " archived 
commits <===");
-String basePath = HoodieCLI.getTableMetaClient().getBasePath();
+HoodieTableMetaClient metaClient = HoodieCLI.getTableMetaClient();
+String basePath = metaClient.getBasePath();
+Path archivePath = new Path(metaClient.getArchivePath() + 
"/.commits_.archive*");
 FileStatus[] fsStatuses =
-FSUtils.getFs(basePath, HoodieCLI.conf).globStatus(new Path(basePath + 
"/.hoodie/.commits_.archive*"));
+FSUtils.getFs(basePath, HoodieCLI.conf).globStatus(archivePath);
 
 Review comment:
   @n3nash, this pr aims to "Adjust the read and write path of archive". IMO, 
it is unnecessary to create an new pr.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on issue #1216: [HUDI-525] lack of insert info in delta_commit inflight

2020-02-12 Thread GitBox
n3nash commented on issue #1216: [HUDI-525] lack of insert info in delta_commit 
inflight
URL: https://github.com/apache/incubator-hudi/pull/1216#issuecomment-585358877
 
 
   @liujianhuiouc on second thoughts, do you have a definite use case for 
knowing the number of inserts during a failed write ? Since there are no 
fileIds, this information is actually use-less to Hudi - this intermediate 
commit metadata is meant for use internally by Hudi and not by external 
consumers so don't see the value in this. WDYT ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] adamjoneill commented on issue #1324: Presto - select * from table does not work

2020-02-12 Thread GitBox
adamjoneill commented on issue #1324: Presto - select * from table does not work
URL: https://github.com/apache/incubator-hudi/issues/1324#issuecomment-585357996
 
 
   i'll close this, i believe to be related to 
https://github.com/apache/incubator-hudi/issues/1325


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] adamjoneill closed issue #1324: Presto - select * from table does not work

2020-02-12 Thread GitBox
adamjoneill closed issue #1324: Presto - select * from table does not work
URL: https://github.com/apache/incubator-hudi/issues/1324
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1242: [HUDI-544] Adjust the read and write path of archive

2020-02-12 Thread GitBox
n3nash commented on a change in pull request #1242: [HUDI-544] Adjust the read 
and write path of archive
URL: https://github.com/apache/incubator-hudi/pull/1242#discussion_r378441023
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/ArchivedCommitsCommand.java
 ##
 @@ -138,9 +139,11 @@ public String showCommits(
   throws IOException {
 
 System.out.println("===> Showing only " + limit + " archived 
commits <===");
-String basePath = HoodieCLI.getTableMetaClient().getBasePath();
+HoodieTableMetaClient metaClient = HoodieCLI.getTableMetaClient();
+String basePath = metaClient.getBasePath();
+Path archivePath = new Path(metaClient.getArchivePath() + 
"/.commits_.archive*");
 FileStatus[] fsStatuses =
-FSUtils.getFs(basePath, HoodieCLI.conf).globStatus(new Path(basePath + 
"/.hoodie/.commits_.archive*"));
+FSUtils.getFs(basePath, HoodieCLI.conf).globStatus(archivePath);
 
 Review comment:
   @hmatu I think this is just a clean up of code to remove the hard-coded 
"archived" and providing it through the archive folder name, this is fine IMO 
since this does not break any backwards compatibility @hddong confirm ?
   What you are referring to is also correct, but that should be another PR to 
allow for reading a path pattern for `show archived commits` just like others 
have.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1318: [HUDI-571] CLI - add option to print additional commit metadata

2020-02-12 Thread GitBox
n3nash commented on a change in pull request #1318: [HUDI-571] CLI - add option 
to print additional commit metadata
URL: https://github.com/apache/incubator-hudi/pull/1318#discussion_r378435736
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java
 ##
 @@ -100,8 +100,58 @@ private String printCommits(HoodieDefaultTimeline 
timeline,
 return HoodiePrintHelper.print(header, fieldNameToConverterMap, 
sortByField, descending, limit, headerOnly, rows);
   }
 
+  private String printCommitsWithMetadata(HoodieDefaultTimeline timeline,
+  final Integer limit, final String sortByField,
+  final boolean descending,
+  final boolean headerOnly) throws IOException {
+final List rows = new ArrayList<>();
+
+final List commits = 
timeline.getCommitsTimeline().filterCompletedInstants()
+.getInstants().collect(Collectors.toList());
+// timeline can be read from multiple files. So sort is needed instead of 
reversing the collection
+Collections.sort(commits, HoodieInstant.COMPARATOR.reversed());
+
+for (int i = 0; i < commits.size(); i++) {
+  final HoodieInstant commit = commits.get(i);
+  final HoodieCommitMetadata commitMetadata = 
HoodieCommitMetadata.fromBytes(
+  timeline.getInstantDetails(commit).get(),
+  HoodieCommitMetadata.class);
+
+  for (Map.Entry> partitionWriteStat :
+  commitMetadata.getPartitionToWriteStats().entrySet()) {
+for (HoodieWriteStat hoodieWriteStat : partitionWriteStat.getValue()) {
+  rows.add(new Comparable[]{ commit.getAction(), 
commit.getTimestamp(), hoodieWriteStat.getPartitionPath(),
+  hoodieWriteStat.getFileId(), 
hoodieWriteStat.getPrevCommit(), hoodieWriteStat.getNumWrites(),
+  hoodieWriteStat.getNumInserts(), 
hoodieWriteStat.getNumDeletes(),
+  hoodieWriteStat.getNumUpdateWrites(), 
hoodieWriteStat.getTotalWriteErrors(),
+  hoodieWriteStat.getTotalLogBlocks(), 
hoodieWriteStat.getTotalCorruptLogBlock(),
+  hoodieWriteStat.getTotalRollbackBlocks(), 
hoodieWriteStat.getTotalLogRecords(),
+  hoodieWriteStat.getTotalUpdatedRecordsCompacted(), 
hoodieWriteStat.getTotalWriteBytes()
+  });
+}
+  }
+}
+
+final Map> fieldNameToConverterMap = new 
HashMap<>();
+fieldNameToConverterMap.put("Total Bytes Written", entry -> {
+  return 
NumericUtils.humanReadableByteCount((Double.valueOf(entry.toString(;
+});
+
+TableHeader header = new 
TableHeader().addTableHeaderField("action").addTableHeaderField("instant")
+
.addTableHeaderField("partition").addTableHeaderField("file_id").addTableHeaderField("prev_instant")
+
.addTableHeaderField("num_writes").addTableHeaderField("num_inserts").addTableHeaderField("num_deletes")
 
 Review comment:
   Do we add names in snake_case in other header commands ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nbalajee commented on a change in pull request #1312: [HUDI-571] Add "compactions show archived" command to CLI

2020-02-12 Thread GitBox
nbalajee commented on a change in pull request #1312: [HUDI-571] Add 
"compactions show archived" command to CLI
URL: https://github.com/apache/incubator-hudi/pull/1312#discussion_r378433086
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/CompactionCommand.java
 ##
 @@ -95,51 +101,9 @@ public String compactionsAll(
   throws IOException {
 HoodieTableMetaClient client = checkAndGetMetaClient();
 HoodieActiveTimeline activeTimeline = client.getActiveTimeline();
-HoodieTimeline timeline = activeTimeline.getCommitsAndCompactionTimeline();
-HoodieTimeline commitTimeline = 
activeTimeline.getCommitTimeline().filterCompletedInstants();
-Set committed = 
commitTimeline.getInstants().map(HoodieInstant::getTimestamp).collect(Collectors.toSet());
-
-List instants = 
timeline.getReverseOrderedInstants().collect(Collectors.toList());
-List rows = new ArrayList<>();
-for (HoodieInstant instant : instants) {
-  HoodieCompactionPlan compactionPlan = null;
-  if (!HoodieTimeline.COMPACTION_ACTION.equals(instant.getAction())) {
-try {
-  // This could be a completed compaction. Assume a compaction request 
file is present but skip if fails
-  compactionPlan = AvroUtils.deserializeCompactionPlan(
-  activeTimeline.readCompactionPlanAsBytes(
-  
HoodieTimeline.getCompactionRequestedInstant(instant.getTimestamp())).get());
-} catch (HoodieIOException ioe) {
-  // SKIP
-}
-  } else {
-compactionPlan = 
AvroUtils.deserializeCompactionPlan(activeTimeline.readCompactionPlanAsBytes(
-
HoodieTimeline.getCompactionRequestedInstant(instant.getTimestamp())).get());
-  }
-
-  if (null != compactionPlan) {
-State state = instant.getState();
-if (committed.contains(instant.getTimestamp())) {
-  state = State.COMPLETED;
-}
-if (includeExtraMetadata) {
-  rows.add(new Comparable[] {instant.getTimestamp(), state.toString(),
-  compactionPlan.getOperations() == null ? 0 : 
compactionPlan.getOperations().size(),
-  compactionPlan.getExtraMetadata().toString()});
-} else {
-  rows.add(new Comparable[] {instant.getTimestamp(), state.toString(),
-  compactionPlan.getOperations() == null ? 0 : 
compactionPlan.getOperations().size()});
-}
-  }
-}
-
-Map> fieldNameToConverterMap = new 
HashMap<>();
-TableHeader header = new TableHeader().addTableHeaderField("Compaction 
Instant Time").addTableHeaderField("State")
-.addTableHeaderField("Total FileIds to be Compacted");
-if (includeExtraMetadata) {
-  header = header.addTableHeaderField("Extra Metadata");
-}
-return HoodiePrintHelper.print(header, fieldNameToConverterMap, 
sortByField, descending, limit, headerOnly, rows);
+return printAllCompactions(activeTimeline,
+compactionPlanReader(this::readCompactionPlanForActiveTimeline, 
activeTimeline),
 
 Review comment:
   Pass timeline containing commit + compaction actions only, instead of 
activeTimeline (which may have other actions)?
   
   HoodieTimeline timeline = activeTimeline.getCommitsAndCompactionTimeline();
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files

2020-02-12 Thread GitBox
n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max 
headers on archived files
URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r378431282
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieLogBlock.java
 ##
 @@ -121,7 +121,7 @@ public long getLogBlockLength() {
* new enums at the end.
*/
   public enum HeaderMetadataType {
-INSTANT_TIME, TARGET_INSTANT_TIME, SCHEMA, COMMAND_BLOCK_TYPE
+INSTANT_TIME, TARGET_INSTANT_TIME, SCHEMA, COMMAND_BLOCK_TYPE, 
MIN_INSTANT_TIME, MAX_INSTANT_TIME
 
 Review comment:
   How about we don't add the MIN_INSTANT_TIME & MAX_INSTANT_TIME to the 
HeaderMetadataType but create an enum like   
   ```
   public enum HeaderMetadataType {
   INSTANT_TIME,
   TARGET_INSTANT_TIME,
   SCHEMA,
   COMMAND_BLOCK_TYPE;
   public enum ArchivedLogHeaderMetadataType {
 MIN_INSTANT_TIME;

   }
 }
   ```
   It looks like we are just overloading the  `HeaderMetadataType` with 
information that is pertaining only to archived files which is going to cause 
confusions when reading actual data blocks. WDYT?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files

2020-02-12 Thread GitBox
n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max 
headers on archived files
URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r378427516
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieCommitArchiveLog.java
 ##
 @@ -268,6 +270,19 @@ public Path getArchiveFilePath() {
 return archiveFilePath;
   }
 
+  private void writeHeaderBlock(Schema wrapperSchema, List 
instants) throws Exception {
+if (!instants.isEmpty()) {
+  Collections.sort(instants, HoodieInstant.COMPARATOR);
+  HoodieInstant minInstant = instants.get(0);
+  HoodieInstant maxInstant = instants.get(instants.size() - 1);
+  Map metadataMap = Maps.newHashMap();
+  metadataMap.put(HeaderMetadataType.SCHEMA, wrapperSchema.toString());
+  metadataMap.put(HeaderMetadataType.MIN_INSTANT_TIME, 
minInstant.getTimestamp());
+  metadataMap.put(HeaderMetadataType.MAX_INSTANT_TIME, 
maxInstant.getTimestamp());
+  this.writer.appendBlock(new HoodieAvroDataBlock(Collections.emptyList(), 
metadataMap));
+}
+  }
+
   private void writeToFile(Schema wrapperSchema, List records) 
throws Exception {
 
 Review comment:
   Move the writing of the header to this part, basically, augment the same 
DataBlock that is has the archived records with the metadata information that 
you want to push here, we already write the schema, just add more entries (like 
above) to the headers here. Then you will be able to read each block and then 
filter based on whether the block should be considered or not - this is more 
generic than adding an extra empty log block to track min/max over the entire 
file (which is hard since the file keeps growing anyways) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files

2020-02-12 Thread GitBox
n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max 
headers on archived files
URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r378425612
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java
 ##
 @@ -182,8 +183,11 @@ private String getMetadataKey(String action) {
   //read the avro blocks
   while (reader.hasNext()) {
 HoodieAvroDataBlock blk = (HoodieAvroDataBlock) reader.next();
-// TODO If we can store additional metadata in datablock, we can 
skip parsing records
-// (such as startTime, endTime of records in the block)
+if (isDataOutOfRange(blk, filter)) {
 
 Review comment:
   We only know that the data is out of range from the block and not from the 
whole file right ? So, this should be continue instead of break ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files

2020-02-12 Thread GitBox
n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max 
headers on archived files
URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r378424618
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieCommitArchiveLog.java
 ##
 @@ -268,6 +270,19 @@ public Path getArchiveFilePath() {
 return archiveFilePath;
   }
 
+  private void writeHeaderBlock(Schema wrapperSchema, List 
instants) throws Exception {
+if (!instants.isEmpty()) {
+  Collections.sort(instants, HoodieInstant.COMPARATOR);
 
 Review comment:
   It is unclear what order the HoodieInstant.COMPARATOR sorts the hoodie 
instants to, can we put a comment here ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] adamjoneill edited a comment on issue #1325: presto - querying nested object in parquet file created by hudi

2020-02-12 Thread GitBox
adamjoneill edited a comment on issue #1325: presto - querying nested object in 
parquet file created by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-585338502
 
 
   i've managed to narrow down the issue to the data that is coming off the 
kinesis stream.
   
   when i replace the data from the stream with some test data as follows
   
   with the following code:
   
   ```
if (!rdd.isEmpty()){
   val json = rdd.map(record=>new String(record))
   val dataFrame = spark.read.json(json)
   dataFrame.printSchema();
   dataFrame.show();
   }
   
   val hudiTableName = "order"
   val hudiTablePath = path + hudiTableName
   
   val hudiOptions = Map[String,String](
   DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "id",
   HoodieWriteConfig.TABLE_NAME -> hudiTableName, 
   DataSourceWriteOptions.OPERATION_OPT_KEY -> 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, 
   DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "id")
   
   // Write data into the Hudi dataset
 
dataFrame.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Overwrite).save(hudiTablePath)
   ```
   
   i replaced
   
   ```
val dataFrame = spark.read.json(json)
   ```
   
   with
   
   ```
   val dataFrame = sparkContext.parallelize(Seq(Foo(1, Bar(1, "first")), 
Foo(2, Bar(2, "second".toDF()
   ```
   
   and the `select * from table` worked as well as nested query `select id, 
bar.id, bar.name from table`
   
   So at this stage it's looking like there's an issue with the data and how 
it's coming off the kinesis stream


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] adamjoneill edited a comment on issue #1325: presto - querying nested object in parquet file created by hudi

2020-02-12 Thread GitBox
adamjoneill edited a comment on issue #1325: presto - querying nested object in 
parquet file created by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-585338502
 
 
   i've managed to narrow down the issue to the data that is coming off the 
kinesis stream.
   
   when i replace the data from the stream with some test data as follows
   
   with the following code:
   
   ```
if (!rdd.isEmpty()){
   val json = rdd.map(record=>new String(record))
   val dataFrame = spark.read.json(json)
   dataFrame.printSchema();
   dataFrame.show();
   
   
   val hudiTableName = "order"
   val hudiTablePath = path + hudiTableName
   
   val hudiOptions = Map[String,String](
   DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "id",
   HoodieWriteConfig.TABLE_NAME -> hudiTableName, 
   DataSourceWriteOptions.OPERATION_OPT_KEY -> 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, 
   DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "id")
   
   // Write data into the Hudi dataset
 
dataFrame.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Overwrite).save(hudiTablePath)
   }
   ```
   
   i replaced
   
   ```
val dataFrame = spark.read.json(json)
   ```
   
   with
   
   ```
   val dataFrame = sparkContext.parallelize(Seq(Foo(1, Bar(1, "first")), 
Foo(2, Bar(2, "second".toDF()
   ```
   
   and the `select * from table` worked as well as nested query `select id, 
bar.id, bar.name from table`
   
   So at this stage it's looking like there's an issue with the data and how 
it's coming off the kinesis stream


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] adamjoneill commented on issue #1325: presto - querying nested object in parquet file created by hudi

2020-02-12 Thread GitBox
adamjoneill commented on issue #1325: presto - querying nested object in 
parquet file created by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-585338502
 
 
   i've managed to narrow down the issue to the data that is coming off the 
kinesis stream.
   
   when i replace the data from the stream with some test data as follows
   
   with the following code:
   
   ```
   if (!rdd.isEmpty()){
   val json = rdd.map(record=>new String(record))
   val dataFrame = spark.read.json(json)
   dataFrame.printSchema();
   dataFrame.show();
   }
   
   val hudiTableName = "order"
   val hudiTablePath = path + hudiTableName
   
   val hudiOptions = Map[String,String](
   DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "id",
   HoodieWriteConfig.TABLE_NAME -> hudiTableName, 
   DataSourceWriteOptions.OPERATION_OPT_KEY -> 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, 
   DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "id")
   
   // Write data into the Hudi dataset
   
dataFrame.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Overwrite).save(hudiTablePath)
   ```
   
   i replaced
   
   ```
   val dataFrame = spark.read.json(json)
   ```
   
   with
   
   ```
   val dataFrame = sparkContext.parallelize(Seq(Foo(1, Bar(1, "first")), Foo(2, 
Bar(2, "second".toDF()
   ```
   
   and the `select * from table` worked as well as nested query `select id, 
bar.id, bar.name from table`
   
   So at this stage it's looking like there's an issue with the data and how 
it's coming off the kinesis stream


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf merged pull request #1317: [HUDI-605] Avoid calculating the size of schema redundantly

2020-02-12 Thread GitBox
leesf merged pull request #1317: [HUDI-605] Avoid calculating the size of 
schema redundantly
URL: https://github.com/apache/incubator-hudi/pull/1317
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated (e5a69ed -> d2c872e)

2020-02-12 Thread leesf
This is an automated email from the ASF dual-hosted git repository.

leesf pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from e5a69ed  [HUDI-478] Fix too many files with unapproved license when 
execute build_local_docker_images (#1323)
 add d2c872e  [HUDI-605] Avoid calculating the size of schema redundantly 
(#1317)

No new revisions were added by this update.

Summary of changes:
 .../org/apache/hudi/common/util/HoodieRecordSizeEstimator.java | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)



[GitHub] [incubator-hudi] lamber-ken commented on issue #1325: presto - querying nested object in parquet file created by hudi

2020-02-12 Thread GitBox
lamber-ken commented on issue #1325: presto - querying nested object in parquet 
file created by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-585147638
 
 
   I guess you used `release-0.5.1` or master branch.
   
   `release-0.5.0` use `com.databricks.spark.avro.SchemaConverters`.
   `release-0.5.1` use `org.apache.spark.sql.avro.SchemaConverters`.
   
   Source Code: 
   
https://github.com/apache/incubator-hudi/blob/release-0.5.0/hudi-spark/src/main/scala/org/apache/hudi/AvroConversionUtils.scala
   
   
https://github.com/apache/incubator-hudi/blob/release-0.5.1/hudi-spark/src/main/scala/org/apache/hudi/AvroConversionUtils.scala


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] adamjoneill edited a comment on issue #1325: presto - querying nested object in parquet file created by hudi

2020-02-12 Thread GitBox
adamjoneill edited a comment on issue #1325: presto - querying nested object in 
parquet file created by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-585128793
 
 
   thanks @lamber-ken 
   
   I've updated my deploy step to remove all references to 
org.apache.spark:spark-avro_2.11:2.4.4
   ```
   aws emr add-steps --cluster-id j-xx --steps 
Type=spark,Name=ScalaStream,Args=[\
   --deploy-mode,cluster,\
   --master,yarn,\
   
--jars,\'/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-streaming-kinesis-asl-assembly.jar\',\
   --conf,spark.yarn.submit.waitAppCompletion=false,\
   --conf,yarn.log-aggregation-enable=true,\
   --conf,spark.dynamicAllocation.enabled=true,\
   --conf,spark.cores.max=4,\
   --conf,spark.network.timeout=300,\
   --conf,spark.serializer=org.apache.spark.serializer.KryoSerializer,\
   --conf,spark.sql.hive.convertMetastoreParquet=false,\
   --class,ScalaStream,\
   s3://./simple-project_2.11-1.0.jar\
   ],ActionOnFailure=CONTINUE
   ```
   
   and i receive the following error, where no hudi file is created for a 
streaming record
   
   ```
   20/02/12 09:56:45 INFO BlockManagerInfo: Removed broadcast_20_piece0 on 
ip-10-10-10-212.ap-southeast-2.compute.internal:33453 in memory (size: 7.9 KB, 
free: 2.6 GB)
   Exception in thread "streaming-job-executor-0" 
java.lang.NoClassDefFoundError: org/apache/spark/sql/avro/SchemaConverters$
at 
org.apache.hudi.AvroConversionUtils$.convertStructTypeToAvroSchema(AvroConversionUtils.scala:80)
at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:81)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at ScalaStream$.handleOrderCreated(stream.scala:39)
at ScalaStream$$anonfun$main$1.apply(stream.scala:110)
at ScalaStream$$anonfun$main$1.apply(stream.scala:82)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at 

[GitHub] [incubator-hudi] adamjoneill commented on issue #1325: presto - querying nested object in parquet file created by hudi

2020-02-12 Thread GitBox
adamjoneill commented on issue #1325: presto - querying nested object in 
parquet file created by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-585128793
 
 
   thanks @lamber-ken 
   
   I've updated my deploy step to remove all references to 
org.apache.spark:spark-avro_2.11:2.4.4
   ```
   aws emr add-steps --cluster-id j-xx --steps 
Type=spark,Name=ScalaStream,Args=[\
   --deploy-mode,cluster,\
   --master,yarn,\
   
--jars,\'/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-streaming-kinesis-asl-assembly.jar\',\
   --conf,spark.yarn.submit.waitAppCompletion=false,\
   --conf,yarn.log-aggregation-enable=true,\
   --conf,spark.dynamicAllocation.enabled=true,\
   --conf,spark.cores.max=4,\
   --conf,spark.network.timeout=300,\
   --conf,spark.serializer=org.apache.spark.serializer.KryoSerializer,\
   --conf,spark.sql.hive.convertMetastoreParquet=false,\
   --class,ScalaStream,\
   s3://./simple-project_2.11-1.0.jar\
   ],ActionOnFailure=CONTINUE
   ```
   
   and i receive the following error, where no hudi file is created for a 
streaming record
   
   ```
   20/02/12 09:56:45 INFO BlockManagerInfo: Removed broadcast_20_piece0 on 
ip-10-10-10-212.ap-southeast-2.compute.internal:33453 in memory (size: 7.9 KB, 
free: 2.6 GB)
   Exception in thread "streaming-job-executor-0" 
java.lang.NoClassDefFoundError: org/apache/spark/sql/avro/SchemaConverters$
at 
org.apache.hudi.AvroConversionUtils$.convertStructTypeToAvroSchema(AvroConversionUtils.scala:80)
at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:81)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at ScalaStream$.handleOrderCreated(stream.scala:39)
at ScalaStream$$anonfun$main$1.apply(stream.scala:110)
at ScalaStream$$anonfun$main$1.apply(stream.scala:82)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at 

[GitHub] [incubator-hudi] OpenOpened commented on a change in pull request #1200: [HUDI-514] A schema provider to get metadata through Jdbc

2020-02-12 Thread GitBox
OpenOpened commented on a change in pull request #1200: [HUDI-514] A schema 
provider to get metadata through Jdbc
URL: https://github.com/apache/incubator-hudi/pull/1200#discussion_r378134572
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.java
 ##
 @@ -235,4 +248,57 @@ public static TypedProperties readConfig(InputStream in) 
throws IOException {
 defaults.load(in);
 return defaults;
   }
+
+  /***
+   * call spark function get the schema through jdbc.
+   * The code logic implementation refers to spark 2.4.x and spark 3.x.
+   * @param options
+   * @return
+   * @throws Exception
+   */
+  public static Schema getJDBCSchema(Map options) throws 
Exception {
+scala.collection.immutable.Map ioptions = 
toScalaImmutableMap(options);
+JDBCOptions jdbcOptions = new JDBCOptions(ioptions);
+Connection conn = JdbcUtils.createConnectionFactory(jdbcOptions).apply();
+String url = jdbcOptions.url();
+String table = jdbcOptions.tableOrQuery();
+JdbcOptionsInWrite jdbcOptionsInWrite = new JdbcOptionsInWrite(ioptions);
+boolean tableExists = JdbcUtils.tableExists(conn, jdbcOptionsInWrite);
+
+if (tableExists) {
+  JdbcDialect dialect = JdbcDialects.get(url);
+  try (PreparedStatement statement = 
conn.prepareStatement(dialect.getSchemaQuery(table))) {
+statement.setQueryTimeout(Integer.parseInt(options.get("timeout")));
+try (ResultSet rs = statement.executeQuery()) {
+  StructType structType;
+  if (Boolean.parseBoolean(ioptions.get("nullable").get())) {
+structType = JdbcUtils.getSchema(rs, dialect, true);
+  } else {
+structType = JdbcUtils.getSchema(rs, dialect, false);
+  }
+  return AvroConversionUtils.convertStructTypeToAvroSchema(structType, 
table, "hoodie." + table);
+}
+  }
+} else {
+  throw new HoodieException(String.format("%s table does not exists!", 
table));
+}
+  }
+
+  /**
+   * Replace java map with scala immutable map.
+   * refers: 
https://stackoverflow.com/questions/11903167/convert-java-util-hashmap-to-scala-collection-immutable-map-in-java/11903737#11903737
 
 Review comment:
   @vinothchandar @leesf I reimplemented the relevant logic using the second 
approach.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services