[jira] [Commented] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-20 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041647#comment-17041647
 ] 

Vinoth Chandar commented on HUDI-625:
-

Following is the test code used to profile these object sizes 

 
{code:java}
@Test
public void test() throws Exception {
  SizeEstimator estimator = new DefaultSizeEstimator();
  Schema longSchema = Schema.createRecord("hudi_schema","","", false,
  Arrays.asList(new Schema.Field("id", Schema.create(Schema.Type.LONG), 
null, null)));

  List records = IntStream.range(0, 100)
  .mapToObj(k -> {
GenericRecord gr = new GenericData.Record(longSchema);
gr.put("id", (long) k);
OverwriteWithLatestAvroPayload payload = new 
OverwriteWithLatestAvroPayload(gr, k);
HoodieRecord record = new HoodieRecord<>(new HoodieKey(k + "", 
"default"), payload);
record.unseal();
record.setCurrentLocation(new HoodieRecordLocation("20200402101048", 
UUID.randomUUID().toString()));
record.setNewLocation(new HoodieRecordLocation("2019375493949", 
UUID.randomUUID().toString()));
return record;
  }).collect(Collectors.toList());

  DiskBasedMap diskBasedMap = new 
DiskBasedMap<>("/tmp/diskmap");

  long writeStartMs = System.currentTimeMillis();
  for (HoodieRecord record : records) {
diskBasedMap.put(record.getRecordKey(), record);
  }
  System.err.println(">>> write took : " + (System.currentTimeMillis() - 
writeStartMs));

  long readStartMs = System.currentTimeMillis();
  for (HoodieRecord record : records) {
diskBasedMap.get(record.getRecordKey());
  }
  System.err.println(">>> read took : " + (System.currentTimeMillis() - 
readStartMs));

  //Thread.sleep(Long.MAX_VALUE);
} {code}

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png
>
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
>   option(BULK_INSERT_PARALLELISM, 1).
>   mode("Overwrite").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> // Runs very slow
> df1.limit(300).write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).

[jira] [Commented] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-20 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041644#comment-17041644
 ] 

lamber-ken commented on HUDI-625:
-

Thinking several solutions can try: :)
 * Use Spliterator, instead of using single thread works on IO / Serialize / 
Deserialize
 * Improve kryo's performance, 
 * Support multi get (multithreads), need to consider thread safety there. 
RocksDB support multiget

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png
>
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
>   option(BULK_INSERT_PARALLELISM, 1).
>   mode("Overwrite").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> // Runs very slow
> df1.limit(300).write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   save(config("target"))
> // Runs very slow
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> {code}
>  
>  
>  
> h2. *Analysis*
> h3. *Upsert (400 entries)*
> {code:java}
> WARN HoodieMergeHandle: 
> Number of entries in MemoryBasedMap => 150875 
> Total size in bytes of MemoryBasedMap => 83886580 
> Number of entries in DiskBasedMap => 3849125 
> Size of file spilled to disk => 1443046132
> {code}
> h3. Hang stackstrace (DiskBasedMap#get)
>  
> {code:java}
> "pool-21-thread-2" Id=696 cpuUsage=98% RUNNABLE
> at java.util.zip.ZipFile.getEntry(Native Method)
> at java.util.zip.ZipFile.getEntry(ZipFile.java:310)
> -  locked java.util.jar.JarFile@1fc27ed4
> at 

[jira] [Updated] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-20 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-625:

Description: 
[https://github.com/apache/incubator-hudi/issues/1328]

 

 So what's going on here is that each entry (single data field) is estimated to 
be around 500-750 bytes in memory and things spill a lot... 
{code:java}
20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 for 
3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 partitionPath=default}, 
currentLocation='HoodieRecordLocation {instantTime=20200220225748, 
fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
newLocation='HoodieRecordLocation {instantTime=20200220225921, 
fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
 
h2. Reproduce steps

 
{code:java}
export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
${SPARK_HOME}/bin/spark-shell \
--executor-memory 6G \
--packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
{code}
 
{code:java}
val HUDI_FORMAT = "org.apache.hudi"
val TABLE_NAME = "hoodie.table.name"
val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
val UPSERT_OPERATION_OPT_VAL = "upsert"
val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
val config = Map(
"table_name" -> "example_table",
"target" -> "file:///tmp/example_table/",
"primary_key" ->  "id",
"sort_key" -> "id"
)
val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
"{\"id\":" + i + "}")
val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
val df1 = spark.read.json(jsonRDD)

println(s"${df1.count()} records in source 1")

df1.write.format(HUDI_FORMAT).
  option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
  option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
  option(TABLE_NAME, config("table_name")).
  option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
  option(BULK_INSERT_PARALLELISM, 1).
  mode("Overwrite").
  
save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
 records in Hudi table")

// Runs very slow
df1.limit(300).write.format(HUDI_FORMAT).
  option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
  option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
  option(TABLE_NAME, config("table_name")).
  option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
  option(UPSERT_PARALLELISM, 20).
  mode("Append").
  save(config("target"))

// Runs very slow
df1.write.format(HUDI_FORMAT).
  option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
  option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
  option(TABLE_NAME, config("table_name")).
  option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
  option(UPSERT_PARALLELISM, 20).
  mode("Append").
  
save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
 records in Hudi table")
{code}
 

 

 
h2. *Analysis*
h3. *Upsert (400 entries)*
{code:java}
WARN HoodieMergeHandle: 
Number of entries in MemoryBasedMap => 150875 
Total size in bytes of MemoryBasedMap => 83886580 
Number of entries in DiskBasedMap => 3849125 
Size of file spilled to disk => 1443046132
{code}
h3. Hang stackstrace (DiskBasedMap#get)

 
{code:java}
"pool-21-thread-2" Id=696 cpuUsage=98% RUNNABLE
at java.util.zip.ZipFile.getEntry(Native Method)
at java.util.zip.ZipFile.getEntry(ZipFile.java:310)
-  locked java.util.jar.JarFile@1fc27ed4
at java.util.jar.JarFile.getEntry(JarFile.java:240)
at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1005)
at sun.misc.URLClassPath.getResource(URLClassPath.java:212)
at java.net.URLClassLoader$1.run(URLClassLoader.java:365)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
-  locked java.lang.Object@28f65251
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
-  locked 
scala.reflect.internal.util.ScalaClassLoader$URLClassLoader@a353dff
at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
-  locked com.esotericsoftware.reflectasm.AccessClassLoader@2c7122e2
at 
com.esotericsoftware.reflectasm.AccessClassLoader.loadClass(AccessClassLoader.java:92)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at 

[jira] [Updated] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-20 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-625:

Attachment: image-2020-02-21-15-35-56-637.png

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png
>
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> This is not too far from reality 
> !image-2020-02-20-23-34-27-466.png|width=952,height=58!
> !image-2020-02-20-23-34-24-155.png|width=975,height=19!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-20 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-625:

Description: 
[https://github.com/apache/incubator-hudi/issues/1328]

 

 So what's going on here is that each entry (single data field) is estimated to 
be around 500-750 bytes in memory and things spill a lot... 
{code:java}
20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 for 
3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 partitionPath=default}, 
currentLocation='HoodieRecordLocation {instantTime=20200220225748, 
fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
newLocation='HoodieRecordLocation {instantTime=20200220225921, 
fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
 
h2. Reproduce steps

 
{code:java}
export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
${SPARK_HOME}/bin/spark-shell \
--executor-memory 6G \
--packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
{code}
 
{code:java}
val HUDI_FORMAT = "org.apache.hudi"
val TABLE_NAME = "hoodie.table.name"
val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
val UPSERT_OPERATION_OPT_VAL = "upsert"
val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
val config = Map(
"table_name" -> "example_table",
"target" -> "file:///tmp/example_table/",
"primary_key" ->  "id",
"sort_key" -> "id"
)
val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
"{\"id\":" + i + "}")
val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
val df1 = spark.read.json(jsonRDD)

println(s"${df1.count()} records in source 1")

df1.write.format(HUDI_FORMAT).
  option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
  option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
  option(TABLE_NAME, config("table_name")).
  option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
  option(BULK_INSERT_PARALLELISM, 1).
  mode("Overwrite").
  
save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
 records in Hudi table")

// Runs very slow
df1.limit(300).write.format(HUDI_FORMAT).
  option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
  option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
  option(TABLE_NAME, config("table_name")).
  option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
  option(UPSERT_PARALLELISM, 20).
  mode("Append").
  save(config("target"))

// Runs very slow
df1.write.format(HUDI_FORMAT).
  option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
  option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
  option(TABLE_NAME, config("table_name")).
  option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
  option(UPSERT_PARALLELISM, 20).
  mode("Append").
  
save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
 records in Hudi table")
{code}
 

 

 
h2. *Analysis*
h3. *Upsert (400 entries)*
{code:java}
WARN HoodieMergeHandle: 
Number of entries in MemoryBasedMap => 150875 
Total size in bytes of MemoryBasedMap => 83886580 
Number of entries in DiskBasedMap => 3849125 
Size of file spilled to disk => 1443046132
{code}
h3. Hang stackstrace (DiskBasedMap#get)

 
{code:java}
"pool-21-thread-2" Id=696 cpuUsage=98% RUNNABLE
at java.util.zip.ZipFile.getEntry(Native Method)
at java.util.zip.ZipFile.getEntry(ZipFile.java:310)
-  locked java.util.jar.JarFile@1fc27ed4
at java.util.jar.JarFile.getEntry(JarFile.java:240)
at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1005)
at sun.misc.URLClassPath.getResource(URLClassPath.java:212)
at java.net.URLClassLoader$1.run(URLClassLoader.java:365)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
-  locked java.lang.Object@28f65251
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
-  locked 
scala.reflect.internal.util.ScalaClassLoader$URLClassLoader@a353dff
at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
-  locked com.esotericsoftware.reflectasm.AccessClassLoader@2c7122e2
at 
com.esotericsoftware.reflectasm.AccessClassLoader.loadClass(AccessClassLoader.java:92)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at 

[jira] [Updated] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-625:

Summary: Address performance concerns on DiskBasedMap.get() during upsert 
of thin records  (was: Address performance concerns on DiskBasedMap.get() 
during upsert of small workload )

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png
>
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> This is not too far from reality 
> !image-2020-02-20-23-34-27-466.png|width=952,height=58!
> !image-2020-02-20-23-34-24-155.png|width=975,height=19!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of small workload

2020-02-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-625:

Description: 
[https://github.com/apache/incubator-hudi/issues/1328]

 

 So what's going on here is that each entry (single data field) is estimated to 
be around 500-750 bytes in memory and things spill a lot... 
{code:java}
20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 for 
3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 partitionPath=default}, 
currentLocation='HoodieRecordLocation {instantTime=20200220225748, 
fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
newLocation='HoodieRecordLocation {instantTime=20200220225921, 
fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
 

This is not too far from reality 

!image-2020-02-20-23-34-27-466.png|width=952,height=58!

!image-2020-02-20-23-34-24-155.png|width=975,height=19!

 

 

  was:
[https://github.com/apache/incubator-hudi/issues/1328]

 

 So what's going on here is that each entry (single data field) is estimated to 
be around 500-750 bytes in memory and things spill a lot... 
{code:java}
20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 for 
3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 partitionPath=default}, 
currentLocation='HoodieRecordLocation {instantTime=20200220225748, 
fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
newLocation='HoodieRecordLocation {instantTime=20200220225921, 
fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}


> Address performance concerns on DiskBasedMap.get() during upsert of small 
> workload 
> ---
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png
>
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> This is not too far from reality 
> !image-2020-02-20-23-34-27-466.png|width=952,height=58!
> !image-2020-02-20-23-34-24-155.png|width=975,height=19!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of small workload

2020-02-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-625:

Attachment: image-2020-02-20-23-34-24-155.png

> Address performance concerns on DiskBasedMap.get() during upsert of small 
> workload 
> ---
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png
>
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of small workload

2020-02-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-625:

Attachment: image-2020-02-20-23-34-27-466.png

> Address performance concerns on DiskBasedMap.get() during upsert of small 
> workload 
> ---
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png
>
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of small workload

2020-02-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-625:

Description: 
[https://github.com/apache/incubator-hudi/issues/1328]

 

 So what's going on here is that each entry (single data field) is estimated to 
be around 500-750 bytes in memory and things spill a lot... 
{code:java}
20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 for 
3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 partitionPath=default}, 
currentLocation='HoodieRecordLocation {instantTime=20200220225748, 
fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
newLocation='HoodieRecordLocation {instantTime=20200220225921, 
fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}

  was:
[https://github.com/apache/incubator-hudi/issues/1328]

 

 


> Address performance concerns on DiskBasedMap.get() during upsert of small 
> workload 
> ---
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-624) Split some of the code from PR for HUDI-479

2020-02-20 Thread vinoyang (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041584#comment-17041584
 ] 

vinoyang commented on HUDI-624:
---

Done via master branch: 8f6035de4a0486e996647e1246334123aed0c9d6

> Split some of the code from PR for HUDI-479 
> 
>
> Key: HUDI-624
> URL: https://issues.apache.org/jira/browse/HUDI-624
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
>Priority: Major
>  Labels: patch, pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This Jira is to reduce the size of the code base in PR# 1159 for HUDI-479, 
> making it easier for review.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-624) Split some of the code from PR for HUDI-479

2020-02-20 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang updated HUDI-624:
--
Status: Closed  (was: Patch Available)

> Split some of the code from PR for HUDI-479 
> 
>
> Key: HUDI-624
> URL: https://issues.apache.org/jira/browse/HUDI-624
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
>Priority: Major
>  Labels: patch, pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This Jira is to reduce the size of the code base in PR# 1159 for HUDI-479, 
> making it easier for review.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] yanghua merged pull request #1344: [HUDI-624]: Split some of the code from PR for HUDI-479

2020-02-20 Thread GitBox
yanghua merged pull request #1344: [HUDI-624]: Split some of the code from PR 
for HUDI-479
URL: https://github.com/apache/incubator-hudi/pull/1344
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated: [HUDI-624]: Split some of the code from PR for HUDI-479 (#1344)

2020-02-20 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 078d482  [HUDI-624]: Split some of the code from PR for HUDI-479 
(#1344)
078d482 is described below

commit 078d4825d909b2c469398f31c97d2290687321a8
Author: Suneel Marthi 
AuthorDate: Fri Feb 21 01:22:21 2020 -0500

[HUDI-624]: Split some of the code from PR for HUDI-479 (#1344)
---
 .../hudi/cli/commands/HoodieLogFileCommand.java| 11 +++
 .../org/apache/hudi/cli/commands/SparkMain.java| 10 +++---
 .../java/org/apache/hudi/cli/utils/SparkUtil.java  |  5 ++-
 .../org/apache/hudi/config/HoodieWriteConfig.java  |  4 +--
 .../org/apache/hudi/func/LazyIterableIterator.java |  2 +-
 .../hudi/index/bloom/BloomIndexFileInfo.java   |  9 +++---
 .../org/apache/hudi/io/HoodieAppendHandle.java |  4 +--
 .../org/apache/hudi/io/HoodieCommitArchiveLog.java |  8 ++---
 .../strategy/BoundedIOCompactionStrategy.java  |  5 ++-
 .../io/compact/strategy/CompactionStrategy.java|  5 ++-
 .../apache/hudi/metrics/JmxMetricsReporter.java|  4 +--
 .../org/apache/hudi/table/RollbackExecutor.java|  6 ++--
 .../org/apache/hudi/TestCompactionAdminClient.java |  1 -
 .../apache/hudi/config/TestHoodieWriteConfig.java  |  4 +--
 .../hudi/index/bloom/TestHoodieBloomIndex.java | 22 ++---
 .../index/bloom/TestHoodieGlobalBloomIndex.java| 10 +++---
 .../org/apache/hudi/common/model/HoodieKey.java|  7 ++---
 .../org/apache/hudi/common/model/HoodieRecord.java |  8 ++---
 .../hudi/common/model/HoodieRecordLocation.java|  7 ++---
 .../hudi/common/util/BufferedRandomAccessFile.java |  6 +---
 .../java/org/apache/hudi/common/util/FSUtils.java  |  3 +-
 .../hudi/common/util/ObjectSizeCalculator.java | 36 +++---
 .../log/TestHoodieLogFormatAppendFailure.java  |  4 +--
 .../table/string/TestHoodieActiveTimeline.java |  1 -
 .../table/view/TestHoodieTableFileSystemView.java  |  5 ++-
 .../hudi/common/util/TestCompactionUtils.java  | 21 ++---
 .../org/apache/hudi/hive/SchemaDifference.java | 28 +
 .../java/org/apache/hudi/hive/util/SchemaUtil.java |  8 ++---
 .../org/apache/hudi/hive/TestHiveSyncTool.java |  3 +-
 .../test/java/org/apache/hudi/hive/TestUtil.java   | 15 -
 .../org/apache/hudi/utilities/UtilHelpers.java |  9 +++---
 31 files changed, 130 insertions(+), 141 deletions(-)

diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java
index 8a50309..2bb87e0 100644
--- 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java
+++ 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java
@@ -38,8 +38,6 @@ import org.apache.hudi.config.HoodieCompactionConfig;
 import org.apache.hudi.config.HoodieMemoryConfig;
 import org.apache.hudi.hive.util.SchemaUtil;
 
-import com.google.common.base.Preconditions;
-import com.google.common.collect.Maps;
 import com.fasterxml.jackson.databind.ObjectMapper;
 
 import org.apache.avro.Schema;
@@ -59,6 +57,7 @@ import java.util.Arrays;
 import java.util.HashMap;
 import java.util.List;
 import java.util.Map;
+import java.util.Objects;
 import java.util.stream.Collectors;
 
 import scala.Tuple2;
@@ -85,14 +84,14 @@ public class HoodieLogFileCommand implements CommandMarker {
 List logFilePaths = Arrays.stream(fs.globStatus(new 
Path(logFilePathPattern)))
 .map(status -> 
status.getPath().toString()).collect(Collectors.toList());
 Map, Map>, Integer>>> commitCountAndMetadata =
-Maps.newHashMap();
+new HashMap<>();
 int numCorruptBlocks = 0;
 int dummyInstantTimeCount = 0;
 
 for (String logFilePath : logFilePaths) {
   FileStatus[] fsStatus = fs.listStatus(new Path(logFilePath));
   Schema writerSchema = new AvroSchemaConverter()
-  
.convert(Preconditions.checkNotNull(SchemaUtil.readSchemaFromLogFile(fs, new 
Path(logFilePath;
+  .convert(Objects.requireNonNull(SchemaUtil.readSchemaFromLogFile(fs, 
new Path(logFilePath;
   Reader reader = HoodieLogFormat.newReader(fs, new 
HoodieLogFile(fsStatus[0].getPath()), writerSchema);
 
   // read the avro blocks
@@ -181,7 +180,7 @@ public class HoodieLogFileCommand implements CommandMarker {
 AvroSchemaConverter converter = new AvroSchemaConverter();
 // get schema from last log file
 Schema readerSchema =
-
converter.convert(Preconditions.checkNotNull(SchemaUtil.readSchemaFromLogFile(fs,
 new Path(logFilePaths.get(logFilePaths.size() - 1);
+
converter.convert(Objects.requireNonNull(SchemaUtil.readSchemaFromLogFile(fs, 
new Path(logFilePaths.get(logFilePaths.size() - 1);
 
 List allRecords = new 

Build failed in Jenkins: hudi-snapshot-deployment-0.5 #195

2020-02-20 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.29 KB...]
plexus-classworlds-2.5.2.jar

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.5.2-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark-bundle_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities-bundle_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 

[GitHub] [incubator-hudi] smarthi commented on a change in pull request #1344: [HUDI-624]: Split some of the code from PR for HUDI-479

2020-02-20 Thread GitBox
smarthi commented on a change in pull request #1344: [HUDI-624]: Split some of 
the code from PR for HUDI-479
URL: https://github.com/apache/incubator-hudi/pull/1344#discussion_r382368430
 
 

 ##
 File path: hudi-common/src/main/java/org/apache/hudi/common/util/FSUtils.java
 ##
 @@ -47,12 +47,13 @@
 import java.util.Arrays;
 import java.util.LinkedList;
 import java.util.List;
+import java.util.Objects;
 import java.util.Map.Entry;
-import java.util.UUID;
 import java.util.function.Function;
 import java.util.regex.Matcher;
 import java.util.regex.Pattern;
 import java.util.stream.Stream;
+import java.util.UUID;
 
 Review comment:
   ok - so where should UUID be then?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1344: [HUDI-624]: Split some of the code from PR for HUDI-479

2020-02-20 Thread GitBox
yanghua commented on a change in pull request #1344: [HUDI-624]: Split some of 
the code from PR for HUDI-479
URL: https://github.com/apache/incubator-hudi/pull/1344#discussion_r382369350
 
 

 ##
 File path: hudi-common/src/main/java/org/apache/hudi/common/util/FSUtils.java
 ##
 @@ -47,12 +47,13 @@
 import java.util.Arrays;
 import java.util.LinkedList;
 import java.util.List;
+import java.util.Objects;
 import java.util.Map.Entry;
-import java.util.UUID;
 import java.util.function.Function;
 import java.util.regex.Matcher;
 import java.util.regex.Pattern;
 import java.util.stream.Stream;
+import java.util.UUID;
 
 Review comment:
   The original place before changing?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] smarthi commented on a change in pull request #1344: [HUDI-624]: Split some of the code from PR for HUDI-479

2020-02-20 Thread GitBox
smarthi commented on a change in pull request #1344: [HUDI-624]: Split some of 
the code from PR for HUDI-479
URL: https://github.com/apache/incubator-hudi/pull/1344#discussion_r382368430
 
 

 ##
 File path: hudi-common/src/main/java/org/apache/hudi/common/util/FSUtils.java
 ##
 @@ -47,12 +47,13 @@
 import java.util.Arrays;
 import java.util.LinkedList;
 import java.util.List;
+import java.util.Objects;
 import java.util.Map.Entry;
-import java.util.UUID;
 import java.util.function.Function;
 import java.util.regex.Matcher;
 import java.util.regex.Pattern;
 import java.util.stream.Stream;
+import java.util.UUID;
 
 Review comment:
   ok - so where UUID be then?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1344: [HUDI-624]: Split some of the code from PR for HUDI-479

2020-02-20 Thread GitBox
yanghua commented on a change in pull request #1344: [HUDI-624]: Split some of 
the code from PR for HUDI-479
URL: https://github.com/apache/incubator-hudi/pull/1344#discussion_r382363030
 
 

 ##
 File path: hudi-common/src/main/java/org/apache/hudi/common/util/FSUtils.java
 ##
 @@ -47,12 +47,13 @@
 import java.util.Arrays;
 import java.util.LinkedList;
 import java.util.List;
+import java.util.Objects;
 import java.util.Map.Entry;
-import java.util.UUID;
 import java.util.function.Function;
 import java.util.regex.Matcher;
 import java.util.regex.Pattern;
 import java.util.stream.Stream;
+import java.util.UUID;
 
 Review comment:
   It seems the old position is correct? For the import order, the priority of 
the class name is larger than package name.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1344: [HUDI-624]: Split some of the code from PR for HUDI-479

2020-02-20 Thread GitBox
yanghua commented on a change in pull request #1344: [HUDI-624]: Split some of 
the code from PR for HUDI-479
URL: https://github.com/apache/incubator-hudi/pull/1344#discussion_r382364978
 
 

 ##
 File path: hudi-hive/src/main/java/org/apache/hudi/hive/SchemaDifference.java
 ##
 @@ -74,6 +67,17 @@ public boolean isEmpty() {
 return deleteColumns.isEmpty() && updateColumnTypes.isEmpty() && 
addColumnTypes.isEmpty();
   }
 
+  @Override
+  public String toString() {
+return new StringJoiner(", ", SchemaDifference.class.getSimpleName() + 
"[", "]")
 
 Review comment:
   It seems that we have changed the generated string pattern? Did you check 
whether the `SchemaDifference#toString` method is used in the key judgement 
logic or not?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1344: [HUDI-624]: Split some of the code from PR for HUDI-479

2020-02-20 Thread GitBox
yanghua commented on a change in pull request #1344: [HUDI-624]: Split some of 
the code from PR for HUDI-479
URL: https://github.com/apache/incubator-hudi/pull/1344#discussion_r382364125
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/ObjectSizeCalculator.java
 ##
 @@ -16,24 +16,23 @@
 
 package org.apache.hudi.common.util;
 
-import com.google.common.base.Preconditions;
-import com.google.common.cache.CacheBuilder;
-import com.google.common.cache.CacheLoader;
-import com.google.common.cache.LoadingCache;
-import com.google.common.collect.Sets;
-
 import java.lang.management.ManagementFactory;
 import java.lang.management.MemoryPoolMXBean;
 import java.lang.reflect.Array;
 import java.lang.reflect.Field;
 import java.lang.reflect.Modifier;
 import java.util.ArrayDeque;
 import java.util.Arrays;
+import java.util.Collections;
 import java.util.Deque;
+import java.util.IdentityHashMap;
 import java.util.LinkedList;
 import java.util.List;
+import java.util.Map;
+import java.util.Objects;
 import java.util.Set;
 
+
 
 Review comment:
   redundant empty line?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-623) Remove UpgradePayloadFromUberToApache

2020-02-20 Thread vinoyang (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041440#comment-17041440
 ] 

vinoyang commented on HUDI-623:
---

OK, let wait for more release cycle.

> Remove UpgradePayloadFromUberToApache
> -
>
> Key: HUDI-623
> URL: https://issues.apache.org/jira/browse/HUDI-623
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>  Components: Code Cleanup
>Reporter: vinoyang
>Assignee: wangxianghu
>Priority: Trivial
> Fix For: 0.5.2
>
>
> {{UpgradePayloadFromUberToApache}} used to covert the package names from the 
> pattern {{com.uber.hoodie}} to {{org.apache.hudi}}. It's a one-shot work. 
> Since we have done this work. IMO, we can remove this class.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar commented on issue #1328: Hudi upsert hangs

2020-02-20 Thread GitBox
vinothchandar commented on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-589452198
 
 
   @bwu2 Got it.. I think the root issue is that the map is spilling more than 
needed. I am trying to understand why.. Will update the JIRA as I uncover 
stuff. if its easy, we can target a fix in the next 0.5.2 release itself. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bwu2 commented on issue #1328: Hudi upsert hangs

2020-02-20 Thread GitBox
bwu2 commented on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-589446887
 
 
   Thanks for your replies!
   
@lamber-ken I will try again with that setting. Does increasing the memory 
available by setting `option("hoodie.memory.merge.max.size", "200485760")` 
work better than increasing executor memory or`hoodie.memory.merge.fraction`? 
Will it not result in an OOME? 
   
   @vinothchandar Yes, we are re-running against real datasets in the next 
couple of days. I will report back. But do note that the original problem 
actually arose on a real dataset not a degenerate or artificial workload: it 
was not a particularly wide table (with 6 or 7 fields), with about 4m rows bulk 
inserted and then another 4m upserted (of which most were the same rows).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HUDI-627) Publish coverage to codecov.io

2020-02-20 Thread Ramachandran M S (Jira)
Ramachandran M S created HUDI-627:
-

 Summary: Publish coverage to codecov.io
 Key: HUDI-627
 URL: https://issues.apache.org/jira/browse/HUDI-627
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
Reporter: Ramachandran M S


* Publish the coverage to codecov.io on every build
 * Fix code coverage to pickup cross module testing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-627) Publish coverage to codecov.io

2020-02-20 Thread Ramachandran M S (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramachandran M S reassigned HUDI-627:
-

Assignee: Ramachandran M S

> Publish coverage to codecov.io
> --
>
> Key: HUDI-627
> URL: https://issues.apache.org/jira/browse/HUDI-627
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Ramachandran M S
>Assignee: Ramachandran M S
>Priority: Major
>
> * Publish the coverage to codecov.io on every build
>  * Fix code coverage to pickup cross module testing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-618) Improve unit test coverage for org.apache.hudi.common.table.view. PriorityBasedFileSystemView

2020-02-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-618:

Labels: pull-request-available  (was: )

> Improve unit test coverage for org.apache.hudi.common.table.view. 
> PriorityBasedFileSystemView
> -
>
> Key: HUDI-618
> URL: https://issues.apache.org/jira/browse/HUDI-618
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Ramachandran M S
>Assignee: Ramachandran M S
>Priority: Major
>  Labels: pull-request-available
>
> Add unit tests for all methods



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] ramachandranms opened a new pull request #1345: [HUDI-618] Adding unit tests for PriorityBasedFileSystemView

2020-02-20 Thread GitBox
ramachandranms opened a new pull request #1345: [HUDI-618] Adding unit tests 
for PriorityBasedFileSystemView
URL: https://github.com/apache/incubator-hudi/pull/1345
 
 
   ## What is the purpose of the pull request
   
   - This PR is to address the JIRA ticket - 
[HUDI-618](https://issues.apache.org/jira/browse/HUDI-618)
   - Added unit tests for `org.apache.hudi.common.table.view. 
PriorityBasedFileSystemView`
   - Refactored DRY code in 
`org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView`
   
   ## Brief change log
   
   - Added unit tests to `org.apache.hudi.common.table.view. 
PriorityBasedFileSystemView` for improving code coverage
   
   ## Verify this pull request
   
   - Checked code coverage to ensure all methods and branches are covered
   - Run all tests to ensure refactoring didn't break anything
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1341: [HUDI-626] Add exportToTable option to CLI

2020-02-20 Thread GitBox
satishkotha commented on a change in pull request #1341: [HUDI-626] Add 
exportToTable option to CLI
URL: https://github.com/apache/incubator-hudi/pull/1341#discussion_r382195220
 
 

 ##
 File path: hudi-cli/src/main/java/org/apache/hudi/cli/utils/TempTableUtil.java
 ##
 @@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.utils;
+
+import org.apache.hudi.exception.HoodieException;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.SQLContext;
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.List;
+import java.util.stream.Collectors;
+
+public class TempTableUtil {
+  private static final Logger LOG = LogManager.getLogger(TempTableUtil.class);
+
+  private JavaSparkContext jsc;
+  private SQLContext sqlContext;
+
+  public TempTableUtil(String appName) {
+try {
+  SparkConf sparkConf = new SparkConf().setAppName(appName)
+  .set("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer").setMaster("local[8]");
+  jsc = new JavaSparkContext(sparkConf);
+  jsc.setLogLevel("ERROR");
+
+  sqlContext = new SQLContext(jsc);
+} catch (Throwable ex) {
+  // log full stack trace and rethrow. Without this its difficult to debug 
failures, if any
+  LOG.error("unable to initialize spark context ", ex);
+  throw new HoodieException(ex);
+}
+  }
+
+  public void write(String tableName, List headers, 
List> rows) {
+try {
+  if (headers.isEmpty() || rows.isEmpty()) {
+return;
+  }
+
+  if (rows.stream().filter(row -> row.size() != headers.size()).count() > 
0) {
+throw new HoodieException("Invalid row, does not match headers " + 
headers.size() + " " + rows.size());
+  }
+
+  // replace all whitespaces in headers to make it easy to write sql 
queries
+  List headersNoSpaces = headers.stream().map(title -> 
title.replaceAll("\\s+",""))
+  .collect(Collectors.toList());
+
+  // generate schema for table
+  StructType structType = new StructType();
+  for (int i = 0; i < headersNoSpaces.size(); i++) {
+// try guessing data type from column data.
+DataType headerDataType = getDataType(rows.get(0).get(i));
+structType = 
structType.add(DataTypes.createStructField(headersNoSpaces.get(i), 
headerDataType, true));
+  }
+  List records = rows.stream().map(row -> 
RowFactory.create(row.toArray(new Comparable[row.size()])))
+  .collect(Collectors.toList());
+  Dataset dataset = this.sqlContext.createDataFrame(records, 
structType);
+  dataset.createOrReplaceTempView(tableName);
+  System.out.println("Wrote table view: " + tableName);
+} catch (Throwable ex) {
+  // log full stack trace and rethrow. Without this its difficult to debug 
failures, if any
+  LOG.error("unable to write ", ex);
+  throw new HoodieException(ex);
+}
+  }
+
+  public void runQuery(String sqlText) {
+try {
+  this.sqlContext.sql(sqlText).show(Integer.MAX_VALUE, false);
+} catch (Throwable ex) {
+  // log full stack trace and rethrow. Without this its difficult to debug 
failures, if any
+  LOG.error("unable to read ", ex);
+  throw new HoodieException(ex);
+}
+  }
+
+  public void deleteTable(String tableName) {
+try {
+  sqlContext.sql("DROP TABLE IF EXISTS " + tableName);
+} catch (Throwable ex) {
+  // log full stack trace and rethrow. Without this its difficult to debug 
failures, if any
+  LOG.error("unable to initialize spark context ", ex);
+  throw new HoodieException(ex);
+}
+  }
+
+  private DataType getDataType(Comparable comparable) {
 
 Review comment:
   Not sure what you mean. This is dynamically inferring schema of tables 
output by CLI to make it easy to filter. If you have suggestions on  how to 
improve this, 

[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1341: [HUDI-626] Add exportToTable option to CLI

2020-02-20 Thread GitBox
satishkotha commented on a change in pull request #1341: [HUDI-626] Add 
exportToTable option to CLI
URL: https://github.com/apache/incubator-hudi/pull/1341#discussion_r382194382
 
 

 ##
 File path: hudi-cli/src/main/java/org/apache/hudi/cli/utils/TempTableUtil.java
 ##
 @@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.utils;
+
+import org.apache.hudi.exception.HoodieException;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.SQLContext;
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.List;
+import java.util.stream.Collectors;
+
+public class TempTableUtil {
+  private static final Logger LOG = LogManager.getLogger(TempTableUtil.class);
+
+  private JavaSparkContext jsc;
 
 Review comment:
   This is just another utility similar to how we setup spark context in tests. 
Created an abstraction. Let me know if you have any more concrete suggestions


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1341: [HUDI-626] Add exportToTable option to CLI

2020-02-20 Thread GitBox
satishkotha commented on a change in pull request #1341: [HUDI-626] Add 
exportToTable option to CLI
URL: https://github.com/apache/incubator-hudi/pull/1341#discussion_r382193841
 
 

 ##
 File path: hudi-cli/src/main/java/org/apache/hudi/cli/HoodiePrintHelper.java
 ##
 @@ -57,11 +60,38 @@ public static String print(String[] header, String[][] 
rows) {
*/
   public static String print(TableHeader rowHeader, Map> fieldNameToConverterMap,
   String sortByField, boolean isDescending, Integer limit, boolean 
headerOnly, List rows) {
+return print(rowHeader, fieldNameToConverterMap, sortByField, 
isDescending, limit, headerOnly, rows, "");
+  }
+
+  /**
+   * Serialize Table to printable string and also export a temporary view to 
easily write sql queries.
+   *
+   * Ideally, exporting view needs to be outside PrintHelper, but all commands 
use this. So this is easy
+   * way to add support for all commands
+   *
+   * @param rowHeader Row Header
+   * @param fieldNameToConverterMap Field Specific Converters
+   * @param sortByField Sorting field
+   * @param isDescending Order
+   * @param limit Limit
+   * @param headerOnly Headers only
+   * @param rows List of rows
+   * @param tempTableName table name to export
+   * @return Serialized form for printing
+   */
+  public static String print(TableHeader rowHeader, Map> fieldNameToConverterMap,
+  String sortByField, boolean isDescending, Integer limit, boolean 
headerOnly, List rows,
+  String tempTableName) {
 
 if (headerOnly) {
   return HoodiePrintHelper.print(rowHeader);
 }
 
+if (!Strings.isNullOrEmpty(tempTableName)) {
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on issue #1341: [HUDI-626] Add exportToTable option to CLI

2020-02-20 Thread GitBox
satishkotha commented on issue #1341: [HUDI-626] Add exportToTable option to CLI
URL: https://github.com/apache/incubator-hudi/pull/1341#issuecomment-589252115
 
 
   > Please first create a JIRA for the PR.
   
   @smarthi My bad. Added. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1341: [HUDI-626] Add exportToTable option to CLI

2020-02-20 Thread GitBox
satishkotha commented on a change in pull request #1341: [HUDI-626] Add 
exportToTable option to CLI
URL: https://github.com/apache/incubator-hudi/pull/1341#discussion_r382193779
 
 

 ##
 File path: hudi-cli/src/main/java/org/apache/hudi/cli/HoodiePrintHelper.java
 ##
 @@ -18,13 +18,16 @@
 
 package org.apache.hudi.cli;
 
+import com.google.common.base.Strings;
 
 Review comment:
   GTK. thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-626) Hudi CLI add export to table option

2020-02-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-626:

Description: 
CLI shell is very restrictive and it is sometimes hard to filter specific rows. 
Adding ability to export results of CLI command into temporary table. So CLI 
users can write HiveQL queries to look for any specific information.
(This idea has been brought up by multiple folks on the team, thanks everyone 
for great suggestions)

  was:
Hudi CLI has 'show archived commits' command which is not very helpful

 
{code:java}
->show archived commits
===> Showing only 10 archived commits <===
    
    | CommitTime    | CommitType|
    |===|
    | 2019033304| commit    |
    | 20190323220154| commit    |
    | 20190323220154| commit    |
    | 20190323224004| commit    |
    | 20190323224013| commit    |
    | 20190323224229| commit    |
    | 20190323224229| commit    |
    | 20190323232849| commit    |
    | 20190323233109| commit    |
    | 20190323233109| commit    |
 {code}
Modify or introduce new command to make it easy to debug

 


> Hudi CLI add export to table option
> ---
>
> Key: HUDI-626
> URL: https://issues.apache.org/jira/browse/HUDI-626
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: CLI
>Reporter: satish
>Assignee: satish
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>
> CLI shell is very restrictive and it is sometimes hard to filter specific 
> rows. Adding ability to export results of CLI command into temporary table. 
> So CLI users can write HiveQL queries to look for any specific information.
> (This idea has been brought up by multiple folks on the team, thanks everyone 
> for great suggestions)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-626) Hudi CLI add export to table option

2020-02-20 Thread satish (Jira)
satish created HUDI-626:
---

 Summary: Hudi CLI add export to table option
 Key: HUDI-626
 URL: https://issues.apache.org/jira/browse/HUDI-626
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: CLI
Reporter: satish
Assignee: satish
 Fix For: 0.5.2


Hudi CLI has 'show archived commits' command which is not very helpful

 
{code:java}
->show archived commits
===> Showing only 10 archived commits <===
    
    | CommitTime    | CommitType|
    |===|
    | 2019033304| commit    |
    | 20190323220154| commit    |
    | 20190323220154| commit    |
    | 20190323224004| commit    |
    | 20190323224013| commit    |
    | 20190323224229| commit    |
    | 20190323224229| commit    |
    | 20190323232849| commit    |
    | 20190323233109| commit    |
    | 20190323233109| commit    |
 {code}
Modify or introduce new command to make it easy to debug

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar commented on issue #954: org.apache.hudi.org.apache.hadoop_hive.metastore.api.NoSuchObjectException: table not found

2020-02-20 Thread GitBox
vinothchandar commented on issue #954:  
org.apache.hudi.org.apache.hadoop_hive.metastore.api.NoSuchObjectException: 
 table not found
URL: https://github.com/apache/incubator-hudi/issues/954#issuecomment-589242784
 
 
   @umehrot2 For some of the misconfigs, we could add it to the troubleshooting 
guide that @pratyakshsharma is putting together.. This will reduce our support 
cost significantly ..
   
   
   >> One issue that is relevant is that schema evolution does now work against 
Glue catalog and I will create a JIRA for that.
   +1. thanks @umehrot2 for being so awesome 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-623) Remove UpgradePayloadFromUberToApache

2020-02-20 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041219#comment-17041219
 ] 

Vinoth Chandar commented on HUDI-623:
-

Might be good to leave this around for few more releases? in case some one is 
still on the uber code? 

 

even some jobs at uber may be on it.. cc [~nishith29] [~vbalaji]

> Remove UpgradePayloadFromUberToApache
> -
>
> Key: HUDI-623
> URL: https://issues.apache.org/jira/browse/HUDI-623
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>  Components: Code Cleanup
>Reporter: vinoyang
>Assignee: wangxianghu
>Priority: Trivial
> Fix For: 0.5.2
>
>
> {{UpgradePayloadFromUberToApache}} used to covert the package names from the 
> pattern {{com.uber.hoodie}} to {{org.apache.hudi}}. It's a one-shot work. 
> Since we have done this work. IMO, we can remove this class.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-573) Rolling stats written twice onto commit metadata

2020-02-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-573.
---
Resolution: Fixed

> Rolling stats written twice onto commit metadata
> 
>
> Key: HUDI-573
> URL: https://issues.apache.org/jira/browse/HUDI-573
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> To reproduce, simply write a commit and observe the file locally 
> {code:java}
>   "extraMetadataMap" : {
> "ROLLING_STAT" : "{\n  \"partitionToRollingStats\" : {\n\"date-0\" : 
> {\n  \"b2404182-a20b-48f3-9386-1fd46e233aa6-1\" : {\n\"fileId\" : 
> \"b2404182-a20b-48f3-9386-1fd46e233aa6-1\",\n\"inserts\" : 86046,\n   
>  \"upserts\" : 0,\n\"deletes\" : 0,\n
> \"totalInputWriteBytesToDisk\" : 0,\n\"totalInputWriteBytesOnDisk\" : 
> 49691893\n  },\n  \"b2404182-a20b-48f3-9386-1fd46e233aa6-0\" : {\n
> \"fileId\" : \"b2404182-a20b-48f3-9386-1fd46e233aa6-0\",\n
> \"inserts\" : 214350,\n\"upserts\" : 0,\n\"deletes\" : 0,\n   
>  \"totalInputWriteBytesToDisk\" : 0,\n
> \"totalInputWriteBytesOnDisk\" : 123138517\n  }\n},\n\"date-2\" : 
> {\n  \"f8924f32-09e6-4049-8830-9fb623d6c1e9-2\" : {\n\"fileId\" : 
> \"f8924f32-09e6-4049-8830-9fb623d6c1e9-2\",\n\"inserts\" : 186262,\n  
>   \"upserts\" : 0,\n\"deletes\" : 0,\n
> \"totalInputWriteBytesToDisk\" : 0,\n\"totalInputWriteBytesOnDisk\" : 
> 107012887\n  },\n  \"f8924f32-09e6-4049-8830-9fb623d6c1e9-1\" : {\n   
>  \"fileId\" : \"f8924f32-09e6-4049-8830-9fb623d6c1e9-1\",\n
> \"inserts\" : 214350,\n\"upserts\" : 0,\n\"deletes\" : 0,\n   
>  \"totalInputWriteBytesToDisk\" : 0,\n
> \"totalInputWriteBytesOnDisk\" : 123081042\n  }\n},\n\"date-1\" : 
> {\n  \"f8924f32-09e6-4049-8830-9fb623d6c1e9-0\" : {\n\"fileId\" : 
> \"f8924f32-09e6-4049-8830-9fb623d6c1e9-0\",\n\"inserts\" : 63043,\n   
>  \"upserts\" : 0,\n\"deletes\" : 0,\n
> \"totalInputWriteBytesToDisk\" : 0,\n\"totalInputWriteBytesOnDisk\" : 
> 36528903\n  },\n  \"b2404182-a20b-48f3-9386-1fd46e233aa6-3\" : {\n
> \"fileId\" : \"b2404182-a20b-48f3-9386-1fd46e233aa6-3\",\n
> \"inserts\" : 21632,\n\"upserts\" : 0,\n\"deletes\" : 0,\n
> \"totalInputWriteBytesToDisk\" : 0,\n
> \"totalInputWriteBytesOnDisk\" : 12820469\n  },\n  
> \"b2404182-a20b-48f3-9386-1fd46e233aa6-2\" : {\n\"fileId\" : 
> \"b2404182-a20b-48f3-9386-1fd46e233aa6-2\",\n\"inserts\" : 214319,\n  
>   \"upserts\" : 0,\n\"deletes\" : 0,\n
> \"totalInputWriteBytesToDisk\" : 0,\n\"totalInputWriteBytesOnDisk\" : 
> 123119553\n  }\n}\n  },\n  \"actionType\" : \"commit\"\n}",
> "schema" : 
> "{\"type\":\"record\",\"name\":\"hoodie_benchmark_record\",\"namespace\":\"hoodie.hoodie_benchmark\",\"fields\":[{\"name\":\"key\",\"type\":[\"string\",\"null\"]},{\"name\":\"partition\",\"type\":[\"string\",\"null\"]},{\"name\":\"ts\",\"type\":[\"long\",\"null\"]},{\"name\":\"textField\",\"type\":[\"string\",\"null\"]},{\"name\":\"decimalField\",\"type\":[\"float\",\"null\"]},{\"name\":\"longField\",\"type\":[\"long\",\"null\"]},{\"name\":\"arrayField\",\"type\":[{\"type\":\"array\",\"items\":[\"int\",\"null\"]},\"null\"]},{\"name\":\"mapField\",\"type\":[{\"type\":\"map\",\"values\":[\"int\",\"null\"]},\"null\"]}]}"
>   },
>   "extraMetadata" : {
> "ROLLING_STAT" : "{\n  \"partitionToRollingStats\" : {\n\"date-0\" : 
> {\n  \"b2404182-a20b-48f3-9386-1fd46e233aa6-1\" : {\n\"fileId\" : 
> \"b2404182-a20b-48f3-9386-1fd46e233aa6-1\",\n\"inserts\" : 86046,\n   
>  \"upserts\" : 0,\n\"deletes\" : 0,\n
> \"totalInputWriteBytesToDisk\" : 0,\n\"totalInputWriteBytesOnDisk\" : 
> 49691893\n  },\n  \"b2404182-a20b-48f3-9386-1fd46e233aa6-0\" : {\n
> \"fileId\" : \"b2404182-a20b-48f3-9386-1fd46e233aa6-0\",\n
> \"inserts\" : 214350,\n\"upserts\" : 0,\n\"deletes\" : 0,\n   
>  \"totalInputWriteBytesToDisk\" : 0,\n
> \"totalInputWriteBytesOnDisk\" : 123138517\n  }\n},\n\"date-2\" : 
> {\n  \"f8924f32-09e6-4049-8830-9fb623d6c1e9-2\" : {\n\"fileId\" : 
> \"f8924f32-09e6-4049-8830-9fb623d6c1e9-2\",\n\"inserts\" : 186262,\n  
>   \"upserts\" : 0,\n\"deletes\" : 0,\n
> 

[jira] [Updated] (HUDI-573) Rolling stats written twice onto commit metadata

2020-02-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-573:

Status: Open  (was: New)

> Rolling stats written twice onto commit metadata
> 
>
> Key: HUDI-573
> URL: https://issues.apache.org/jira/browse/HUDI-573
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> To reproduce, simply write a commit and observe the file locally 
> {code:java}
>   "extraMetadataMap" : {
> "ROLLING_STAT" : "{\n  \"partitionToRollingStats\" : {\n\"date-0\" : 
> {\n  \"b2404182-a20b-48f3-9386-1fd46e233aa6-1\" : {\n\"fileId\" : 
> \"b2404182-a20b-48f3-9386-1fd46e233aa6-1\",\n\"inserts\" : 86046,\n   
>  \"upserts\" : 0,\n\"deletes\" : 0,\n
> \"totalInputWriteBytesToDisk\" : 0,\n\"totalInputWriteBytesOnDisk\" : 
> 49691893\n  },\n  \"b2404182-a20b-48f3-9386-1fd46e233aa6-0\" : {\n
> \"fileId\" : \"b2404182-a20b-48f3-9386-1fd46e233aa6-0\",\n
> \"inserts\" : 214350,\n\"upserts\" : 0,\n\"deletes\" : 0,\n   
>  \"totalInputWriteBytesToDisk\" : 0,\n
> \"totalInputWriteBytesOnDisk\" : 123138517\n  }\n},\n\"date-2\" : 
> {\n  \"f8924f32-09e6-4049-8830-9fb623d6c1e9-2\" : {\n\"fileId\" : 
> \"f8924f32-09e6-4049-8830-9fb623d6c1e9-2\",\n\"inserts\" : 186262,\n  
>   \"upserts\" : 0,\n\"deletes\" : 0,\n
> \"totalInputWriteBytesToDisk\" : 0,\n\"totalInputWriteBytesOnDisk\" : 
> 107012887\n  },\n  \"f8924f32-09e6-4049-8830-9fb623d6c1e9-1\" : {\n   
>  \"fileId\" : \"f8924f32-09e6-4049-8830-9fb623d6c1e9-1\",\n
> \"inserts\" : 214350,\n\"upserts\" : 0,\n\"deletes\" : 0,\n   
>  \"totalInputWriteBytesToDisk\" : 0,\n
> \"totalInputWriteBytesOnDisk\" : 123081042\n  }\n},\n\"date-1\" : 
> {\n  \"f8924f32-09e6-4049-8830-9fb623d6c1e9-0\" : {\n\"fileId\" : 
> \"f8924f32-09e6-4049-8830-9fb623d6c1e9-0\",\n\"inserts\" : 63043,\n   
>  \"upserts\" : 0,\n\"deletes\" : 0,\n
> \"totalInputWriteBytesToDisk\" : 0,\n\"totalInputWriteBytesOnDisk\" : 
> 36528903\n  },\n  \"b2404182-a20b-48f3-9386-1fd46e233aa6-3\" : {\n
> \"fileId\" : \"b2404182-a20b-48f3-9386-1fd46e233aa6-3\",\n
> \"inserts\" : 21632,\n\"upserts\" : 0,\n\"deletes\" : 0,\n
> \"totalInputWriteBytesToDisk\" : 0,\n
> \"totalInputWriteBytesOnDisk\" : 12820469\n  },\n  
> \"b2404182-a20b-48f3-9386-1fd46e233aa6-2\" : {\n\"fileId\" : 
> \"b2404182-a20b-48f3-9386-1fd46e233aa6-2\",\n\"inserts\" : 214319,\n  
>   \"upserts\" : 0,\n\"deletes\" : 0,\n
> \"totalInputWriteBytesToDisk\" : 0,\n\"totalInputWriteBytesOnDisk\" : 
> 123119553\n  }\n}\n  },\n  \"actionType\" : \"commit\"\n}",
> "schema" : 
> "{\"type\":\"record\",\"name\":\"hoodie_benchmark_record\",\"namespace\":\"hoodie.hoodie_benchmark\",\"fields\":[{\"name\":\"key\",\"type\":[\"string\",\"null\"]},{\"name\":\"partition\",\"type\":[\"string\",\"null\"]},{\"name\":\"ts\",\"type\":[\"long\",\"null\"]},{\"name\":\"textField\",\"type\":[\"string\",\"null\"]},{\"name\":\"decimalField\",\"type\":[\"float\",\"null\"]},{\"name\":\"longField\",\"type\":[\"long\",\"null\"]},{\"name\":\"arrayField\",\"type\":[{\"type\":\"array\",\"items\":[\"int\",\"null\"]},\"null\"]},{\"name\":\"mapField\",\"type\":[{\"type\":\"map\",\"values\":[\"int\",\"null\"]},\"null\"]}]}"
>   },
>   "extraMetadata" : {
> "ROLLING_STAT" : "{\n  \"partitionToRollingStats\" : {\n\"date-0\" : 
> {\n  \"b2404182-a20b-48f3-9386-1fd46e233aa6-1\" : {\n\"fileId\" : 
> \"b2404182-a20b-48f3-9386-1fd46e233aa6-1\",\n\"inserts\" : 86046,\n   
>  \"upserts\" : 0,\n\"deletes\" : 0,\n
> \"totalInputWriteBytesToDisk\" : 0,\n\"totalInputWriteBytesOnDisk\" : 
> 49691893\n  },\n  \"b2404182-a20b-48f3-9386-1fd46e233aa6-0\" : {\n
> \"fileId\" : \"b2404182-a20b-48f3-9386-1fd46e233aa6-0\",\n
> \"inserts\" : 214350,\n\"upserts\" : 0,\n\"deletes\" : 0,\n   
>  \"totalInputWriteBytesToDisk\" : 0,\n
> \"totalInputWriteBytesOnDisk\" : 123138517\n  }\n},\n\"date-2\" : 
> {\n  \"f8924f32-09e6-4049-8830-9fb623d6c1e9-2\" : {\n\"fileId\" : 
> \"f8924f32-09e6-4049-8830-9fb623d6c1e9-2\",\n\"inserts\" : 186262,\n  
>   \"upserts\" : 0,\n\"deletes\" : 0,\n
> 

[jira] [Updated] (HUDI-573) Rolling stats written twice onto commit metadata

2020-02-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-573:

Fix Version/s: (was: 0.6.0)
   0.5.2

> Rolling stats written twice onto commit metadata
> 
>
> Key: HUDI-573
> URL: https://issues.apache.org/jira/browse/HUDI-573
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> To reproduce, simply write a commit and observe the file locally 
> {code:java}
>   "extraMetadataMap" : {
> "ROLLING_STAT" : "{\n  \"partitionToRollingStats\" : {\n\"date-0\" : 
> {\n  \"b2404182-a20b-48f3-9386-1fd46e233aa6-1\" : {\n\"fileId\" : 
> \"b2404182-a20b-48f3-9386-1fd46e233aa6-1\",\n\"inserts\" : 86046,\n   
>  \"upserts\" : 0,\n\"deletes\" : 0,\n
> \"totalInputWriteBytesToDisk\" : 0,\n\"totalInputWriteBytesOnDisk\" : 
> 49691893\n  },\n  \"b2404182-a20b-48f3-9386-1fd46e233aa6-0\" : {\n
> \"fileId\" : \"b2404182-a20b-48f3-9386-1fd46e233aa6-0\",\n
> \"inserts\" : 214350,\n\"upserts\" : 0,\n\"deletes\" : 0,\n   
>  \"totalInputWriteBytesToDisk\" : 0,\n
> \"totalInputWriteBytesOnDisk\" : 123138517\n  }\n},\n\"date-2\" : 
> {\n  \"f8924f32-09e6-4049-8830-9fb623d6c1e9-2\" : {\n\"fileId\" : 
> \"f8924f32-09e6-4049-8830-9fb623d6c1e9-2\",\n\"inserts\" : 186262,\n  
>   \"upserts\" : 0,\n\"deletes\" : 0,\n
> \"totalInputWriteBytesToDisk\" : 0,\n\"totalInputWriteBytesOnDisk\" : 
> 107012887\n  },\n  \"f8924f32-09e6-4049-8830-9fb623d6c1e9-1\" : {\n   
>  \"fileId\" : \"f8924f32-09e6-4049-8830-9fb623d6c1e9-1\",\n
> \"inserts\" : 214350,\n\"upserts\" : 0,\n\"deletes\" : 0,\n   
>  \"totalInputWriteBytesToDisk\" : 0,\n
> \"totalInputWriteBytesOnDisk\" : 123081042\n  }\n},\n\"date-1\" : 
> {\n  \"f8924f32-09e6-4049-8830-9fb623d6c1e9-0\" : {\n\"fileId\" : 
> \"f8924f32-09e6-4049-8830-9fb623d6c1e9-0\",\n\"inserts\" : 63043,\n   
>  \"upserts\" : 0,\n\"deletes\" : 0,\n
> \"totalInputWriteBytesToDisk\" : 0,\n\"totalInputWriteBytesOnDisk\" : 
> 36528903\n  },\n  \"b2404182-a20b-48f3-9386-1fd46e233aa6-3\" : {\n
> \"fileId\" : \"b2404182-a20b-48f3-9386-1fd46e233aa6-3\",\n
> \"inserts\" : 21632,\n\"upserts\" : 0,\n\"deletes\" : 0,\n
> \"totalInputWriteBytesToDisk\" : 0,\n
> \"totalInputWriteBytesOnDisk\" : 12820469\n  },\n  
> \"b2404182-a20b-48f3-9386-1fd46e233aa6-2\" : {\n\"fileId\" : 
> \"b2404182-a20b-48f3-9386-1fd46e233aa6-2\",\n\"inserts\" : 214319,\n  
>   \"upserts\" : 0,\n\"deletes\" : 0,\n
> \"totalInputWriteBytesToDisk\" : 0,\n\"totalInputWriteBytesOnDisk\" : 
> 123119553\n  }\n}\n  },\n  \"actionType\" : \"commit\"\n}",
> "schema" : 
> "{\"type\":\"record\",\"name\":\"hoodie_benchmark_record\",\"namespace\":\"hoodie.hoodie_benchmark\",\"fields\":[{\"name\":\"key\",\"type\":[\"string\",\"null\"]},{\"name\":\"partition\",\"type\":[\"string\",\"null\"]},{\"name\":\"ts\",\"type\":[\"long\",\"null\"]},{\"name\":\"textField\",\"type\":[\"string\",\"null\"]},{\"name\":\"decimalField\",\"type\":[\"float\",\"null\"]},{\"name\":\"longField\",\"type\":[\"long\",\"null\"]},{\"name\":\"arrayField\",\"type\":[{\"type\":\"array\",\"items\":[\"int\",\"null\"]},\"null\"]},{\"name\":\"mapField\",\"type\":[{\"type\":\"map\",\"values\":[\"int\",\"null\"]},\"null\"]}]}"
>   },
>   "extraMetadata" : {
> "ROLLING_STAT" : "{\n  \"partitionToRollingStats\" : {\n\"date-0\" : 
> {\n  \"b2404182-a20b-48f3-9386-1fd46e233aa6-1\" : {\n\"fileId\" : 
> \"b2404182-a20b-48f3-9386-1fd46e233aa6-1\",\n\"inserts\" : 86046,\n   
>  \"upserts\" : 0,\n\"deletes\" : 0,\n
> \"totalInputWriteBytesToDisk\" : 0,\n\"totalInputWriteBytesOnDisk\" : 
> 49691893\n  },\n  \"b2404182-a20b-48f3-9386-1fd46e233aa6-0\" : {\n
> \"fileId\" : \"b2404182-a20b-48f3-9386-1fd46e233aa6-0\",\n
> \"inserts\" : 214350,\n\"upserts\" : 0,\n\"deletes\" : 0,\n   
>  \"totalInputWriteBytesToDisk\" : 0,\n
> \"totalInputWriteBytesOnDisk\" : 123138517\n  }\n},\n\"date-2\" : 
> {\n  \"f8924f32-09e6-4049-8830-9fb623d6c1e9-2\" : {\n\"fileId\" : 
> \"f8924f32-09e6-4049-8830-9fb623d6c1e9-2\",\n\"inserts\" : 186262,\n  
>   \"upserts\" : 0,\n\"deletes\" 

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-02-20 Thread GitBox
bvaradar commented on a change in pull request #1150: [HUDI-288]: Add support 
for ingesting multiple kafka streams in a single DeltaStreamer deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r382175651
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/TableConfig.java
 ##
 @@ -0,0 +1,200 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import com.fasterxml.jackson.annotation.JsonIgnoreProperties;
+import com.fasterxml.jackson.annotation.JsonProperty;
+
+import java.util.Objects;
+
+/*
+Represents object with all the topic level overrides for multi table delta 
streamer execution
+ */
+@JsonIgnoreProperties(ignoreUnknown = true)
 
 Review comment:
   @pratyakshsharma : Sounds fair. Let's proceed with having TableConfig 
containing both Source and Sink config but rename the class to reflect it. The 
name TableConfig seems misleading to me.  I will go through the change once you 
address other comments.
   
   Regarding your comment about having same source, I think you meant source 
type (Kafka, DFS,...).  As the current TableConfig takes care of 1 pair of 
Source<->Sink configs (with configs in separate folders) anyways, there is no 
need to force that restriction. Let me know if I am missing something ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-573) Rolling stats written twice onto commit metadata

2020-02-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-573:

Labels: pull-request-available  (was: )

> Rolling stats written twice onto commit metadata
> 
>
> Key: HUDI-573
> URL: https://issues.apache.org/jira/browse/HUDI-573
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> To reproduce, simply write a commit and observe the file locally 
> {code:java}
>   "extraMetadataMap" : {
> "ROLLING_STAT" : "{\n  \"partitionToRollingStats\" : {\n\"date-0\" : 
> {\n  \"b2404182-a20b-48f3-9386-1fd46e233aa6-1\" : {\n\"fileId\" : 
> \"b2404182-a20b-48f3-9386-1fd46e233aa6-1\",\n\"inserts\" : 86046,\n   
>  \"upserts\" : 0,\n\"deletes\" : 0,\n
> \"totalInputWriteBytesToDisk\" : 0,\n\"totalInputWriteBytesOnDisk\" : 
> 49691893\n  },\n  \"b2404182-a20b-48f3-9386-1fd46e233aa6-0\" : {\n
> \"fileId\" : \"b2404182-a20b-48f3-9386-1fd46e233aa6-0\",\n
> \"inserts\" : 214350,\n\"upserts\" : 0,\n\"deletes\" : 0,\n   
>  \"totalInputWriteBytesToDisk\" : 0,\n
> \"totalInputWriteBytesOnDisk\" : 123138517\n  }\n},\n\"date-2\" : 
> {\n  \"f8924f32-09e6-4049-8830-9fb623d6c1e9-2\" : {\n\"fileId\" : 
> \"f8924f32-09e6-4049-8830-9fb623d6c1e9-2\",\n\"inserts\" : 186262,\n  
>   \"upserts\" : 0,\n\"deletes\" : 0,\n
> \"totalInputWriteBytesToDisk\" : 0,\n\"totalInputWriteBytesOnDisk\" : 
> 107012887\n  },\n  \"f8924f32-09e6-4049-8830-9fb623d6c1e9-1\" : {\n   
>  \"fileId\" : \"f8924f32-09e6-4049-8830-9fb623d6c1e9-1\",\n
> \"inserts\" : 214350,\n\"upserts\" : 0,\n\"deletes\" : 0,\n   
>  \"totalInputWriteBytesToDisk\" : 0,\n
> \"totalInputWriteBytesOnDisk\" : 123081042\n  }\n},\n\"date-1\" : 
> {\n  \"f8924f32-09e6-4049-8830-9fb623d6c1e9-0\" : {\n\"fileId\" : 
> \"f8924f32-09e6-4049-8830-9fb623d6c1e9-0\",\n\"inserts\" : 63043,\n   
>  \"upserts\" : 0,\n\"deletes\" : 0,\n
> \"totalInputWriteBytesToDisk\" : 0,\n\"totalInputWriteBytesOnDisk\" : 
> 36528903\n  },\n  \"b2404182-a20b-48f3-9386-1fd46e233aa6-3\" : {\n
> \"fileId\" : \"b2404182-a20b-48f3-9386-1fd46e233aa6-3\",\n
> \"inserts\" : 21632,\n\"upserts\" : 0,\n\"deletes\" : 0,\n
> \"totalInputWriteBytesToDisk\" : 0,\n
> \"totalInputWriteBytesOnDisk\" : 12820469\n  },\n  
> \"b2404182-a20b-48f3-9386-1fd46e233aa6-2\" : {\n\"fileId\" : 
> \"b2404182-a20b-48f3-9386-1fd46e233aa6-2\",\n\"inserts\" : 214319,\n  
>   \"upserts\" : 0,\n\"deletes\" : 0,\n
> \"totalInputWriteBytesToDisk\" : 0,\n\"totalInputWriteBytesOnDisk\" : 
> 123119553\n  }\n}\n  },\n  \"actionType\" : \"commit\"\n}",
> "schema" : 
> "{\"type\":\"record\",\"name\":\"hoodie_benchmark_record\",\"namespace\":\"hoodie.hoodie_benchmark\",\"fields\":[{\"name\":\"key\",\"type\":[\"string\",\"null\"]},{\"name\":\"partition\",\"type\":[\"string\",\"null\"]},{\"name\":\"ts\",\"type\":[\"long\",\"null\"]},{\"name\":\"textField\",\"type\":[\"string\",\"null\"]},{\"name\":\"decimalField\",\"type\":[\"float\",\"null\"]},{\"name\":\"longField\",\"type\":[\"long\",\"null\"]},{\"name\":\"arrayField\",\"type\":[{\"type\":\"array\",\"items\":[\"int\",\"null\"]},\"null\"]},{\"name\":\"mapField\",\"type\":[{\"type\":\"map\",\"values\":[\"int\",\"null\"]},\"null\"]}]}"
>   },
>   "extraMetadata" : {
> "ROLLING_STAT" : "{\n  \"partitionToRollingStats\" : {\n\"date-0\" : 
> {\n  \"b2404182-a20b-48f3-9386-1fd46e233aa6-1\" : {\n\"fileId\" : 
> \"b2404182-a20b-48f3-9386-1fd46e233aa6-1\",\n\"inserts\" : 86046,\n   
>  \"upserts\" : 0,\n\"deletes\" : 0,\n
> \"totalInputWriteBytesToDisk\" : 0,\n\"totalInputWriteBytesOnDisk\" : 
> 49691893\n  },\n  \"b2404182-a20b-48f3-9386-1fd46e233aa6-0\" : {\n
> \"fileId\" : \"b2404182-a20b-48f3-9386-1fd46e233aa6-0\",\n
> \"inserts\" : 214350,\n\"upserts\" : 0,\n\"deletes\" : 0,\n   
>  \"totalInputWriteBytesToDisk\" : 0,\n
> \"totalInputWriteBytesOnDisk\" : 123138517\n  }\n},\n\"date-2\" : 
> {\n  \"f8924f32-09e6-4049-8830-9fb623d6c1e9-2\" : {\n\"fileId\" : 
> \"f8924f32-09e6-4049-8830-9fb623d6c1e9-2\",\n\"inserts\" : 186262,\n  
>   \"upserts\" : 0,\n\"deletes\" : 0,\n
> \"totalInputWriteBytesToDisk\" : 0,\n

[incubator-hudi] branch master updated: Refactoring getter to avoid double extrametadata in json representation

2020-02-20 Thread vbalaji
This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 185ff64  Refactoring getter to avoid double extrametadata in json 
representation
185ff64 is described below

commit 185ff646ad6979722a3f1c4b34d87c7f98bd87e4
Author: Nishith Agarwal 
AuthorDate: Thu Jan 23 23:34:22 2020 -0800

Refactoring getter to avoid double extrametadata in json representation
---
 .../org/apache/hudi/common/model/HoodieCommitMetadata.java   | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java
index f16ef2f..3097052 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java
@@ -47,7 +47,7 @@ public class HoodieCommitMetadata implements Serializable {
   protected Map> partitionToWriteStats;
   protected Boolean compacted;
 
-  private Map extraMetadataMap;
+  private Map extraMetadata;
 
   // for ser/deser
   public HoodieCommitMetadata() {
@@ -55,7 +55,7 @@ public class HoodieCommitMetadata implements Serializable {
   }
 
   public HoodieCommitMetadata(boolean compacted) {
-extraMetadataMap = new HashMap<>();
+extraMetadata = new HashMap<>();
 partitionToWriteStats = new HashMap<>();
 this.compacted = compacted;
   }
@@ -68,7 +68,7 @@ public class HoodieCommitMetadata implements Serializable {
   }
 
   public void addMetadata(String metaKey, String value) {
-extraMetadataMap.put(metaKey, value);
+extraMetadata.put(metaKey, value);
   }
 
   public List getWriteStats(String partitionPath) {
@@ -76,7 +76,7 @@ public class HoodieCommitMetadata implements Serializable {
   }
 
   public Map getExtraMetadata() {
-return extraMetadataMap;
+return extraMetadata;
   }
 
   public Map> getPartitionToWriteStats() {
@@ -84,7 +84,7 @@ public class HoodieCommitMetadata implements Serializable {
   }
 
   public String getMetadata(String metaKey) {
-return extraMetadataMap.get(metaKey);
+return extraMetadata.get(metaKey);
   }
 
   public Boolean getCompacted() {
@@ -343,6 +343,6 @@ public class HoodieCommitMetadata implements Serializable {
   @Override
   public String toString() {
 return "HoodieCommitMetadata{partitionToWriteStats=" + 
partitionToWriteStats + ", compacted=" + compacted
-+ ", extraMetadataMap=" + extraMetadataMap + '}';
++ ", extraMetadata=" + extraMetadata + '}';
   }
 }



[GitHub] [incubator-hudi] bvaradar merged pull request #1278: [HUDI-573] Refactoring getter to avoid double extrametadata in json representation of HoodieCommitMetadata

2020-02-20 Thread GitBox
bvaradar merged pull request #1278: [HUDI-573] Refactoring getter to avoid 
double extrametadata in json representation of HoodieCommitMetadata
URL: https://github.com/apache/incubator-hudi/pull/1278
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1278: [HUDI-573] Refactoring getter to avoid double extrametadata in json representation of HoodieCommitMetadata

2020-02-20 Thread GitBox
bvaradar commented on a change in pull request #1278: [HUDI-573] Refactoring 
getter to avoid double extrametadata in json representation of 
HoodieCommitMetadata
URL: https://github.com/apache/incubator-hudi/pull/1278#discussion_r382159855
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java
 ##
 @@ -47,15 +47,15 @@
   protected Map> partitionToWriteStats;
   protected Boolean compacted;
 
-  private Map extraMetadataMap;
+  private Map extraMetadata;
 
 Review comment:
   I was asking about any issues around storing extra-metadata in commit files 
in active timeline. For example, As we store them in json format in active 
timeline, wanted to check if DeltaStreamer can read checkpoints without any 
problem from the json file. But,,having thought about it, this should be fine. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1328: Hudi upsert hangs

2020-02-20 Thread GitBox
vinothchandar commented on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-589195840
 
 
   https://issues.apache.org/jira/browse/HUDI-625 filed this to look into this 
scenario.. 
   
   @bwu2 In the meantime, could you run your benchmark against a real dataset 
with more fields so number of files spread out and record sizes are larger. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-53) Implement Record level Index to map a record key to a pair #90

2020-02-20 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-53?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041138#comment-17041138
 ] 

Vinoth Chandar commented on HUDI-53:


can we use this for the indexing work? if you have a new one, please close this 

> Implement Record level Index to map a record key to a  FileID> pair #90
> ---
>
> Key: HUDI-53
> URL: https://issues.apache.org/jira/browse/HUDI-53
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Major
>
> [https://github.com/uber/hudi/issues/90] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-145) Limit the amount of partitions considered for GlobalBloomIndex

2020-02-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-145:
---

Assignee: (was: Vinoth Chandar)

> Limit the amount of partitions considered for GlobalBloomIndex
> --
>
> Key: HUDI-145
> URL: https://issues.apache.org/jira/browse/HUDI-145
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, newbie
>Reporter: Vinoth Chandar
>Priority: Major
>
> Currently, global bloom index will check inputs against files in all 
> partitions.. In lot of cases, the user may know a range of partitions 
> actually impacted from updates clearly (e.g upstream system drops updates 
> older than a year, ... )..  In such a scenario,it may make sense to support 
> an option for Global bloom to control how many partitions you want to match 
> against, to gain performance. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-539) RO Path filter does not pick up hadoop configs from the spark context

2020-02-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-539:

Summary: RO Path filter does not pick up hadoop configs from the spark 
context  (was: No FileSystem for scheme: abfss)

> RO Path filter does not pick up hadoop configs from the spark context
> -
>
> Key: HUDI-539
> URL: https://issues.apache.org/jira/browse/HUDI-539
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Common Core
>Affects Versions: 0.5.1
> Environment: Spark version : 2.4.4
> Hadoop version : 2.7.3
> Databricks Runtime: 6.1
>Reporter: Sam Somuah
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
>
> Hi,
>  I'm trying to use hudi to write to one of the Azure storage container file 
> systems, ADLS Gen 2 (abfs://). ABFS:// is one of the whitelisted file 
> schemes. The issue I'm facing is that in {{HoodieROTablePathFilter}} it tries 
> to get a file path passing in a blank hadoop configuration. This manifests as 
> {{java.io.IOException: No FileSystem for scheme: abfss}} because it doesn't 
> have any of the configuration in the environment.
> The problematic line is
> [https://github.com/apache/incubator-hudi/blob/2bb0c21a3dd29687e49d362ed34f050380ff47ae/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java#L96]
>  
> Stacktrace
> java.io.IOException: No FileSystem for scheme: abfss
> at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
> at 
> org.apache.hudi.hadoop.HoodieROTablePathFilter.accept(HoodieROTablePathFilter.java:96)
> at 
> org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$16.apply(InMemoryFileIndex.scala:349)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-53) Implement Record level Index to map a record key to a pair #90

2020-02-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-53?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-53:
--

Assignee: sivabalan narayanan  (was: Vinoth Chandar)

> Implement Record level Index to map a record key to a  FileID> pair #90
> ---
>
> Key: HUDI-53
> URL: https://issues.apache.org/jira/browse/HUDI-53
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Major
>
> [https://github.com/uber/hudi/issues/90] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-295) Do one-time cleanup of Hudi git history

2020-02-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-295:
---

Assignee: (was: Vinoth Chandar)

> Do one-time cleanup of Hudi git history
> ---
>
> Key: HUDI-295
> URL: https://issues.apache.org/jira/browse/HUDI-295
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Docs
>Reporter: Vinoth Chandar
>Priority: Major
>
> https://lists.apache.org/thread.html/dc6eb516e248088dac1a2b5c9690383dfe2eb3912f76bbe9dd763c2b@



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-295) Do one-time cleanup of Hudi git history

2020-02-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-295:

Status: New  (was: Open)

> Do one-time cleanup of Hudi git history
> ---
>
> Key: HUDI-295
> URL: https://issues.apache.org/jira/browse/HUDI-295
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Docs
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>
> https://lists.apache.org/thread.html/dc6eb516e248088dac1a2b5c9690383dfe2eb3912f76bbe9dd763c2b@



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of small workload

2020-02-20 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-625:
---

 Summary: Address performance concerns on DiskBasedMap.get() during 
upsert of small workload 
 Key: HUDI-625
 URL: https://issues.apache.org/jira/browse/HUDI-625
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: Performance, Writer Core
Reporter: Vinoth Chandar
Assignee: Vinoth Chandar
 Fix For: 0.6.0


[https://github.com/apache/incubator-hudi/issues/1328]

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar commented on issue #1328: Hudi upsert hangs

2020-02-20 Thread GitBox
vinothchandar commented on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-589152895
 
 
   @lamber-ken is right.. I am looking into why the DiskBasedMap is so slow 
(there was a recent change.. wondering if its a regression.. ) Will raise a 
JIRA nonetheless..  
   
   
   So bit more explanation..The big difference is that all 4M entries go to one 
file and its a degenerate workload (i.e a single field record) where the 
metadata to data overhead is lot.. We have a spilling mechanism to handle large 
number keys merging into a single file (like the spill map you will see in 
spark shuffle) and that seems to be performing poorly.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-624) Split some of the code from PR for HUDI-479

2020-02-20 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-624:
---
Status: In Progress  (was: Open)

> Split some of the code from PR for HUDI-479 
> 
>
> Key: HUDI-624
> URL: https://issues.apache.org/jira/browse/HUDI-624
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
>Priority: Major
>  Labels: patch, pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This Jira is to reduce the size of the code base in PR# 1159 for HUDI-479, 
> making it easier for review.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-624) Split some of the code from PR for HUDI-479

2020-02-20 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-624:
---
Status: Patch Available  (was: In Progress)

> Split some of the code from PR for HUDI-479 
> 
>
> Key: HUDI-624
> URL: https://issues.apache.org/jira/browse/HUDI-624
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
>Priority: Major
>  Labels: patch, pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This Jira is to reduce the size of the code base in PR# 1159 for HUDI-479, 
> making it easier for review.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-624) Split some of the code from PR for HUDI-479

2020-02-20 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-624:
---
Status: Open  (was: New)

> Split some of the code from PR for HUDI-479 
> 
>
> Key: HUDI-624
> URL: https://issues.apache.org/jira/browse/HUDI-624
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
>Priority: Major
>  Labels: patch, pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This Jira is to reduce the size of the code base in PR# 1159 for HUDI-479, 
> making it easier for review.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] smarthi opened a new pull request #1344: [HUDI-624]: Split some of the code from PR for HUDI-479

2020-02-20 Thread GitBox
smarthi opened a new pull request #1344: [HUDI-624]: Split some of the code 
from PR for HUDI-479
URL: https://github.com/apache/incubator-hudi/pull/1344
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   Reducing the code size from PR# 1159
   ## Brief change log
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [X] Has a corresponding JIRA in PR title & commit

- [X] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-624) Split some of the code from PR for HUDI-479

2020-02-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-624:

Labels: patch pull-request-available  (was: patch)

> Split some of the code from PR for HUDI-479 
> 
>
> Key: HUDI-624
> URL: https://issues.apache.org/jira/browse/HUDI-624
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
>Priority: Major
>  Labels: patch, pull-request-available
> Fix For: 0.5.2
>
>
> This Jira is to reduce the size of the code base in PR# 1159 for HUDI-479, 
> making it easier for review.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-624) Split some of the code from PR for HUDI-479

2020-02-20 Thread Suneel Marthi (Jira)
Suneel Marthi created HUDI-624:
--

 Summary: Split some of the code from PR for HUDI-479 
 Key: HUDI-624
 URL: https://issues.apache.org/jira/browse/HUDI-624
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: Code Cleanup
Reporter: Suneel Marthi
Assignee: Suneel Marthi
 Fix For: 0.5.2


This Jira is to reduce the size of the code base in PR# 1159 for HUDI-479, 
making it easier for review.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] lamber-ken commented on issue #1328: Hudi upsert hangs

2020-02-20 Thread GitBox
lamber-ken commented on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-588971355
 
 
   Hi @bwu2, add option when upsert `option("hoodie.memory.merge.max.size", 
"200485760")`, let's try again : )


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services