[GitHub] [hudi] rubenssoto commented on issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer

2020-08-05 Thread GitBox
rubenssoto commented on issue #1878: URL: https://github.com/apache/hudi/issues/1878#issuecomment-669431943 Thank you so much @bvaradar for your help This is an automated message from the Apache Git Service. To respond to

[GitHub] [hudi] rubenssoto commented on issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer

2020-08-02 Thread GitBox
rubenssoto commented on issue #1878: URL: https://github.com/apache/hudi/issues/1878#issuecomment-667675696 Bulk-insert do some deduplication? This is an automated message from the Apache Git Service. To respond to the

[GitHub] [hudi] rubenssoto commented on issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer

2020-07-29 Thread GitBox
rubenssoto commented on issue #1878: URL: https://github.com/apache/hudi/issues/1878#issuecomment-665432999 Hi bvaradar, how are you? I hope doing fine! I have a new case, which is a little more important to me, the problem is almost the same. I adopted the strategy to first batch

[GitHub] [hudi] rubenssoto commented on issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer

2020-07-25 Thread GitBox
rubenssoto commented on issue #1878: URL: https://github.com/apache/hudi/issues/1878#issuecomment-663927931 Hi Again.  When I changed the insert option to upsert the performance got worse. 1 Master Node m5.xlarge(4 vcpu, 16gb Ram) 1 Core Node r5.xlarge(4 vcpu, 32gb ram) 4

[GitHub] [hudi] rubenssoto commented on issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer

2020-07-25 Thread GitBox
rubenssoto commented on issue #1878: URL: https://github.com/apache/hudi/issues/1878#issuecomment-663906344 Hi bvaradar, thank you for your awnser. I tried to increase spark.yarn.executor.memoryOverhead to 2GB with foreachbatch option inside writestream and it worked. 4 nodes with 4

[GitHub] [hudi] rubenssoto commented on issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer

2020-07-24 Thread GitBox
rubenssoto commented on issue #1878: URL: https://github.com/apache/hudi/issues/1878#issuecomment-663806475 I tried resizing the cluster with 3 more nodes, so in total(4 nodes) after resizing I had 4 cores in each node and 16gb of ram each, and it wasn't any difference, the job keeps very