Re: About github issue 639
Yes, fixed On Thu, May 9, 2019 at 6:13 AM Vinoth Chandar wrote: > Images don't render on the mailing list. :( > Seems like the issue if fixed now? > > On Tue, May 7, 2019 at 10:15 PM Jun Zhu > wrote: > > > Hi, > > I run the new code pull from master branch, and compare with another > > stream job which run hudi 0.4.5 on maven. Both running per 10 minutes. > > The roll-back worked. > > Top is 0.4.5, bottom is 0.4.6 > > [image: Screen Shot 2019-05-08 at 1.06.17 PM.png] > > [image: Screen Shot 2019-05-08 at 1.06.54 PM.png] > > And about log, I rewrite the log.trace to log.error to avoid log explode > > with trace. > > And There is nothing in variable: > > > >> 19/05/07 06:30:36 ERROR HoodieSparkSQLWriter: insert failed with 1 > errors > >> : > >> 19/05/07 06:30:36 ERROR HoodieSparkSQLWriter: Printing out the top 100 > >> errors > >> spark log > >> 19/05/07 06:30:36 ERROR HoodieSparkSQLWriter: Global error : > >> .spark log > >> 19/05/07 07:10:40 ERROR HoodieSparkSQLWriter: insert failed with 1 > errors > >> : > >> 19/05/07 07:10:40 ERROR HoodieSparkSQLWriter: Printing out the top 100 > >> errors > > > > > > Thanks, > > Jun > > > > On Sat, May 4, 2019 at 11:31 AM Vinoth Chandar > wrote: > > > >> No worries. This just landed on master, you can give it a shot. You ll > >> also > >> end up picking up interval tree based filtering for global index, which > >> will speed things along a lot. Fyi > >> > >> Have a good holiday! > >> > >> Thanks > >> Vinoth > >> > >> On Fri, May 3, 2019 at 7:19 PM Jun Zhu > >> wrote: > >> > >> > Hi team, > >> > i will try that, thank you so much, sorry for late reply, just have a > >> > holiday in china😅. > >> > Thanks > >> > Jun > >> > > >> > On Wed, May 1, 2019 at 7:08 PM Vinoth Chandar > >> wrote: > >> > > >> > > Hi Jun, > >> > > > >> > > I was able to track that the HoodieSparkSQLWriter (common path for > >> > > streaming sink and batch datasource) ends up calling > >> > > DataSourceUtils.createHoodieClient, which creates the client as > >> follows > >> > > > >> > > return new HoodieWriteClient<>(jssc, writeConfig); > >> > > > >> > > There is a third parameter that denotes whether the writer needs to > >> > > rollback inflights. For e.g, DeltaStreamer invokes > >> > > > >> > > HoodieWriteClient client = new HoodieWriteClient<>(jssc, hoodieCfg, > >> > true); > >> > > > >> > > While I trace down why we had this difference, could you try > changing > >> > this > >> > > one line here, and add third "true" argument and give it a shot. > >> > > > >> > > > >> > > >> > https://github.com/apache/incubator-hudi/blob/b34a204a527a156406908686e54484a0c3d8a3d7/hoodie-spark/src/main/java/com/uber/hoodie/DataSourceUtils.java#L148 > >> > > > >> > > > >> > > Thanks > >> > > Vinoth > >> > > > >> > > On Tue, Apr 30, 2019 at 11:16 PM [email protected] < > >> [email protected]> > >> > > wrote: > >> > > > >> > > > > >> > > > Hi Jun, > >> > > > You had mentioned that you are seeing the log message > >> > > > "insert failed with 1 errors" > >> > > > Did you see any exception stack traces before this message. You > can > >> > also > >> > > > take a look at spark UI to see if stdout/stderr of failed tasks > (if > >> > > > present). > >> > > > Also, it looks like if you also enable "trace" level logging, you > >> would > >> > > > see exceptions getting logged at the end. So, enabling "trace" > level > >> > > > logging is another way to debug what is happening. > >> > > > '''log.error(s"$operation failed with ${errorCount} errors :"); > >> > > > if (log.isTraceEnabled) { > >> > > > log.trace("Printing out the top 100 errors") ... > >> > > > ''' > >> > > > Balaji.V > >> > > > > >> > > > On Tuesday, April 30, 2019, 8:17:57 AM PDT, Vinoth Chandar < > >> > > > [email protected]> wrote: > >> > > > > >> > > > Hi Jun, > >> > > > > >> > > > Basically you are saying streaming path leaves some inflights > >> behind.. > >> > > let > >> > > > me see if I can reproduce it. If you have a simple test case, > please > >> > > share > >> > > > > >> > > > Thanks > >> > > > Vinoth > >> > > > > >> > > > On Tue, Apr 30, 2019 at 1:04 AM Jun Zhu > >> > > >> > > > wrote: > >> > > > > >> > > > > Hi Vinoth, > >> > > > > In spark streaming log I find "2019-04-30 03:26:11 ERROR > >> > > > > HoodieSparkSQLWriter:182 - insert failed with 1 errors :"(no > >> continue > >> > > > error > >> > > > > logs) , during which commit end with inflight and not cleaned. > >> > > > > Just for feedback, we can dedup data correctly in batch way. > >> Should > >> > add > >> > > > > more logic for exception handling if using spark stream I think. > >> > > > > Regards, > >> > > > > Jun > >> > > > > > >> > > > > > >> > > > > On Tue, Apr 30, 2019 at 2:46 AM Vinoth Chandar < > [email protected] > >> > > >> > > > wrote: > >> > > > > > >> > > > > > Another option to try would be setting the > >> > > > > > spark.sql.hive.convertMetastoreParquet=false, if you are > >> querying > >> > via > >> > > > the > >> > > > > > Hive t
Re: About github issue 639
Images don't render on the mailing list. :( Seems like the issue if fixed now? On Tue, May 7, 2019 at 10:15 PM Jun Zhu wrote: > Hi, > I run the new code pull from master branch, and compare with another > stream job which run hudi 0.4.5 on maven. Both running per 10 minutes. > The roll-back worked. > Top is 0.4.5, bottom is 0.4.6 > [image: Screen Shot 2019-05-08 at 1.06.17 PM.png] > [image: Screen Shot 2019-05-08 at 1.06.54 PM.png] > And about log, I rewrite the log.trace to log.error to avoid log explode > with trace. > And There is nothing in variable: > >> 19/05/07 06:30:36 ERROR HoodieSparkSQLWriter: insert failed with 1 errors >> : >> 19/05/07 06:30:36 ERROR HoodieSparkSQLWriter: Printing out the top 100 >> errors >> spark log >> 19/05/07 06:30:36 ERROR HoodieSparkSQLWriter: Global error : >> .spark log >> 19/05/07 07:10:40 ERROR HoodieSparkSQLWriter: insert failed with 1 errors >> : >> 19/05/07 07:10:40 ERROR HoodieSparkSQLWriter: Printing out the top 100 >> errors > > > Thanks, > Jun > > On Sat, May 4, 2019 at 11:31 AM Vinoth Chandar wrote: > >> No worries. This just landed on master, you can give it a shot. You ll >> also >> end up picking up interval tree based filtering for global index, which >> will speed things along a lot. Fyi >> >> Have a good holiday! >> >> Thanks >> Vinoth >> >> On Fri, May 3, 2019 at 7:19 PM Jun Zhu >> wrote: >> >> > Hi team, >> > i will try that, thank you so much, sorry for late reply, just have a >> > holiday in china😅. >> > Thanks >> > Jun >> > >> > On Wed, May 1, 2019 at 7:08 PM Vinoth Chandar >> wrote: >> > >> > > Hi Jun, >> > > >> > > I was able to track that the HoodieSparkSQLWriter (common path for >> > > streaming sink and batch datasource) ends up calling >> > > DataSourceUtils.createHoodieClient, which creates the client as >> follows >> > > >> > > return new HoodieWriteClient<>(jssc, writeConfig); >> > > >> > > There is a third parameter that denotes whether the writer needs to >> > > rollback inflights. For e.g, DeltaStreamer invokes >> > > >> > > HoodieWriteClient client = new HoodieWriteClient<>(jssc, hoodieCfg, >> > true); >> > > >> > > While I trace down why we had this difference, could you try changing >> > this >> > > one line here, and add third "true" argument and give it a shot. >> > > >> > > >> > >> https://github.com/apache/incubator-hudi/blob/b34a204a527a156406908686e54484a0c3d8a3d7/hoodie-spark/src/main/java/com/uber/hoodie/DataSourceUtils.java#L148 >> > > >> > > >> > > Thanks >> > > Vinoth >> > > >> > > On Tue, Apr 30, 2019 at 11:16 PM [email protected] < >> [email protected]> >> > > wrote: >> > > >> > > > >> > > > Hi Jun, >> > > > You had mentioned that you are seeing the log message >> > > > "insert failed with 1 errors" >> > > > Did you see any exception stack traces before this message. You can >> > also >> > > > take a look at spark UI to see if stdout/stderr of failed tasks (if >> > > > present). >> > > > Also, it looks like if you also enable "trace" level logging, you >> would >> > > > see exceptions getting logged at the end. So, enabling "trace" level >> > > > logging is another way to debug what is happening. >> > > > '''log.error(s"$operation failed with ${errorCount} errors :"); >> > > > if (log.isTraceEnabled) { >> > > > log.trace("Printing out the top 100 errors") ... >> > > > ''' >> > > > Balaji.V >> > > > >> > > > On Tuesday, April 30, 2019, 8:17:57 AM PDT, Vinoth Chandar < >> > > > [email protected]> wrote: >> > > > >> > > > Hi Jun, >> > > > >> > > > Basically you are saying streaming path leaves some inflights >> behind.. >> > > let >> > > > me see if I can reproduce it. If you have a simple test case, please >> > > share >> > > > >> > > > Thanks >> > > > Vinoth >> > > > >> > > > On Tue, Apr 30, 2019 at 1:04 AM Jun Zhu > > >> > > > wrote: >> > > > >> > > > > Hi Vinoth, >> > > > > In spark streaming log I find "2019-04-30 03:26:11 ERROR >> > > > > HoodieSparkSQLWriter:182 - insert failed with 1 errors :"(no >> continue >> > > > error >> > > > > logs) , during which commit end with inflight and not cleaned. >> > > > > Just for feedback, we can dedup data correctly in batch way. >> Should >> > add >> > > > > more logic for exception handling if using spark stream I think. >> > > > > Regards, >> > > > > Jun >> > > > > >> > > > > >> > > > > On Tue, Apr 30, 2019 at 2:46 AM Vinoth Chandar > > >> > > > wrote: >> > > > > >> > > > > > Another option to try would be setting the >> > > > > > spark.sql.hive.convertMetastoreParquet=false, if you are >> querying >> > via >> > > > the >> > > > > > Hive table registered by Hudi. >> > > > > > >> > > > > > On Sat, Apr 27, 2019 at 7:02 PM Jun Zhu >> > > > >> > > > > > wrote: >> > > > > > >> > > > > > > Thanks for explanation vinoth, code was same list in >> > > > > > > https://github.com/apache/incubator-hudi/issues/639, with >> > setting >> > > > > table >> > > > > > > format to >> `.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, >> > > > > > > Dat
Re: About github issue 639
Hi, I run the new code pull from master branch, and compare with another stream job which run hudi 0.4.5 on maven. Both running per 10 minutes. The roll-back worked. Top is 0.4.5, bottom is 0.4.6 [image: Screen Shot 2019-05-08 at 1.06.17 PM.png] [image: Screen Shot 2019-05-08 at 1.06.54 PM.png] And about log, I rewrite the log.trace to log.error to avoid log explode with trace. And There is nothing in variable: > 19/05/07 06:30:36 ERROR HoodieSparkSQLWriter: insert failed with 1 errors : > 19/05/07 06:30:36 ERROR HoodieSparkSQLWriter: Printing out the top 100 > errors > spark log > 19/05/07 06:30:36 ERROR HoodieSparkSQLWriter: Global error : > .spark log > 19/05/07 07:10:40 ERROR HoodieSparkSQLWriter: insert failed with 1 errors : > 19/05/07 07:10:40 ERROR HoodieSparkSQLWriter: Printing out the top 100 > errors Thanks, Jun On Sat, May 4, 2019 at 11:31 AM Vinoth Chandar wrote: > No worries. This just landed on master, you can give it a shot. You ll also > end up picking up interval tree based filtering for global index, which > will speed things along a lot. Fyi > > Have a good holiday! > > Thanks > Vinoth > > On Fri, May 3, 2019 at 7:19 PM Jun Zhu wrote: > > > Hi team, > > i will try that, thank you so much, sorry for late reply, just have a > > holiday in china😅. > > Thanks > > Jun > > > > On Wed, May 1, 2019 at 7:08 PM Vinoth Chandar wrote: > > > > > Hi Jun, > > > > > > I was able to track that the HoodieSparkSQLWriter (common path for > > > streaming sink and batch datasource) ends up calling > > > DataSourceUtils.createHoodieClient, which creates the client as follows > > > > > > return new HoodieWriteClient<>(jssc, writeConfig); > > > > > > There is a third parameter that denotes whether the writer needs to > > > rollback inflights. For e.g, DeltaStreamer invokes > > > > > > HoodieWriteClient client = new HoodieWriteClient<>(jssc, hoodieCfg, > > true); > > > > > > While I trace down why we had this difference, could you try changing > > this > > > one line here, and add third "true" argument and give it a shot. > > > > > > > > > https://github.com/apache/incubator-hudi/blob/b34a204a527a156406908686e54484a0c3d8a3d7/hoodie-spark/src/main/java/com/uber/hoodie/DataSourceUtils.java#L148 > > > > > > > > > Thanks > > > Vinoth > > > > > > On Tue, Apr 30, 2019 at 11:16 PM [email protected] < > [email protected]> > > > wrote: > > > > > > > > > > > Hi Jun, > > > > You had mentioned that you are seeing the log message > > > > "insert failed with 1 errors" > > > > Did you see any exception stack traces before this message. You can > > also > > > > take a look at spark UI to see if stdout/stderr of failed tasks (if > > > > present). > > > > Also, it looks like if you also enable "trace" level logging, you > would > > > > see exceptions getting logged at the end. So, enabling "trace" level > > > > logging is another way to debug what is happening. > > > > '''log.error(s"$operation failed with ${errorCount} errors :"); > > > > if (log.isTraceEnabled) { > > > > log.trace("Printing out the top 100 errors") ... > > > > ''' > > > > Balaji.V > > > > > > > > On Tuesday, April 30, 2019, 8:17:57 AM PDT, Vinoth Chandar < > > > > [email protected]> wrote: > > > > > > > > Hi Jun, > > > > > > > > Basically you are saying streaming path leaves some inflights > behind.. > > > let > > > > me see if I can reproduce it. If you have a simple test case, please > > > share > > > > > > > > Thanks > > > > Vinoth > > > > > > > > On Tue, Apr 30, 2019 at 1:04 AM Jun Zhu > > > > wrote: > > > > > > > > > Hi Vinoth, > > > > > In spark streaming log I find "2019-04-30 03:26:11 ERROR > > > > > HoodieSparkSQLWriter:182 - insert failed with 1 errors :"(no > continue > > > > error > > > > > logs) , during which commit end with inflight and not cleaned. > > > > > Just for feedback, we can dedup data correctly in batch way. Should > > add > > > > > more logic for exception handling if using spark stream I think. > > > > > Regards, > > > > > Jun > > > > > > > > > > > > > > > On Tue, Apr 30, 2019 at 2:46 AM Vinoth Chandar > > > > wrote: > > > > > > > > > > > Another option to try would be setting the > > > > > > spark.sql.hive.convertMetastoreParquet=false, if you are querying > > via > > > > the > > > > > > Hive table registered by Hudi. > > > > > > > > > > > > On Sat, Apr 27, 2019 at 7:02 PM Jun Zhu > > > > > > > > > wrote: > > > > > > > > > > > > > Thanks for explanation vinoth, code was same list in > > > > > > > https://github.com/apache/incubator-hudi/issues/639, with > > setting > > > > > table > > > > > > > format to `.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, > > > > > > > DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)`. > > > > > > > And the result data was stored on aws s3. > > > > > > > I will try more on > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > `spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", > > > > > > > classOf[com.ub
Re: About github issue 639
No worries. This just landed on master, you can give it a shot. You ll also end up picking up interval tree based filtering for global index, which will speed things along a lot. Fyi Have a good holiday! Thanks Vinoth On Fri, May 3, 2019 at 7:19 PM Jun Zhu wrote: > Hi team, > i will try that, thank you so much, sorry for late reply, just have a > holiday in china😅. > Thanks > Jun > > On Wed, May 1, 2019 at 7:08 PM Vinoth Chandar wrote: > > > Hi Jun, > > > > I was able to track that the HoodieSparkSQLWriter (common path for > > streaming sink and batch datasource) ends up calling > > DataSourceUtils.createHoodieClient, which creates the client as follows > > > > return new HoodieWriteClient<>(jssc, writeConfig); > > > > There is a third parameter that denotes whether the writer needs to > > rollback inflights. For e.g, DeltaStreamer invokes > > > > HoodieWriteClient client = new HoodieWriteClient<>(jssc, hoodieCfg, > true); > > > > While I trace down why we had this difference, could you try changing > this > > one line here, and add third "true" argument and give it a shot. > > > > > https://github.com/apache/incubator-hudi/blob/b34a204a527a156406908686e54484a0c3d8a3d7/hoodie-spark/src/main/java/com/uber/hoodie/DataSourceUtils.java#L148 > > > > > > Thanks > > Vinoth > > > > On Tue, Apr 30, 2019 at 11:16 PM [email protected] > > wrote: > > > > > > > > Hi Jun, > > > You had mentioned that you are seeing the log message > > > "insert failed with 1 errors" > > > Did you see any exception stack traces before this message. You can > also > > > take a look at spark UI to see if stdout/stderr of failed tasks (if > > > present). > > > Also, it looks like if you also enable "trace" level logging, you would > > > see exceptions getting logged at the end. So, enabling "trace" level > > > logging is another way to debug what is happening. > > > '''log.error(s"$operation failed with ${errorCount} errors :"); > > > if (log.isTraceEnabled) { > > > log.trace("Printing out the top 100 errors") ... > > > ''' > > > Balaji.V > > > > > > On Tuesday, April 30, 2019, 8:17:57 AM PDT, Vinoth Chandar < > > > [email protected]> wrote: > > > > > > Hi Jun, > > > > > > Basically you are saying streaming path leaves some inflights behind.. > > let > > > me see if I can reproduce it. If you have a simple test case, please > > share > > > > > > Thanks > > > Vinoth > > > > > > On Tue, Apr 30, 2019 at 1:04 AM Jun Zhu > > > wrote: > > > > > > > Hi Vinoth, > > > > In spark streaming log I find "2019-04-30 03:26:11 ERROR > > > > HoodieSparkSQLWriter:182 - insert failed with 1 errors :"(no continue > > > error > > > > logs) , during which commit end with inflight and not cleaned. > > > > Just for feedback, we can dedup data correctly in batch way. Should > add > > > > more logic for exception handling if using spark stream I think. > > > > Regards, > > > > Jun > > > > > > > > > > > > On Tue, Apr 30, 2019 at 2:46 AM Vinoth Chandar > > > wrote: > > > > > > > > > Another option to try would be setting the > > > > > spark.sql.hive.convertMetastoreParquet=false, if you are querying > via > > > the > > > > > Hive table registered by Hudi. > > > > > > > > > > On Sat, Apr 27, 2019 at 7:02 PM Jun Zhu > > > > > > wrote: > > > > > > > > > > > Thanks for explanation vinoth, code was same list in > > > > > > https://github.com/apache/incubator-hudi/issues/639, with > setting > > > > table > > > > > > format to `.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, > > > > > > DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)`. > > > > > > And the result data was stored on aws s3. > > > > > > I will try more on > > > > > > > > > > > > > > > > > > > > > > > > > > > `spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", > > > > > > classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter], > > > > > > classOf[org.apache.hadoop.fs.PathFilter]);` from the phenomenon, > > the > > > > > > config did not take effects maybe. > > > > > > > > > > > > On Sat, Apr 27, 2019 at 12:09 AM Vinoth Chandar < > [email protected] > > > > > > > > wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > >>The duplicates was found in inflight commit parquet files. > > > > Wondering > > > > > if > > > > > > > this was expected? > > > > > > > Spark shell should not even be reading in-flight parquet files. > > Can > > > > you > > > > > > > double check if the spark access is properly configured? > > > > > > > http://hudi.apache.org/querying_data.html#spark > > > > > > > > > > > > > > Inflight should be rolled back at the start of the next > > > commit/delta > > > > > > > commit.. Not sure why there are so many inflight delta commits. > > > > > > > If you can give a reproducible case, happy to debug it more.. > > > > > > > > > > > > > > Only complete instants are archived.. So yes, inflight is not > > > > > archived.. > > > > > > > > > > > > > > Hope that helps > > > > > > > > > > > > > > Thanks > > > > > > > Vinoth > > > > >
Re: About github issue 639
Hi team, i will try that, thank you so much, sorry for late reply, just have a holiday in china😅. Thanks Jun On Wed, May 1, 2019 at 7:08 PM Vinoth Chandar wrote: > Hi Jun, > > I was able to track that the HoodieSparkSQLWriter (common path for > streaming sink and batch datasource) ends up calling > DataSourceUtils.createHoodieClient, which creates the client as follows > > return new HoodieWriteClient<>(jssc, writeConfig); > > There is a third parameter that denotes whether the writer needs to > rollback inflights. For e.g, DeltaStreamer invokes > > HoodieWriteClient client = new HoodieWriteClient<>(jssc, hoodieCfg, true); > > While I trace down why we had this difference, could you try changing this > one line here, and add third "true" argument and give it a shot. > > https://github.com/apache/incubator-hudi/blob/b34a204a527a156406908686e54484a0c3d8a3d7/hoodie-spark/src/main/java/com/uber/hoodie/DataSourceUtils.java#L148 > > > Thanks > Vinoth > > On Tue, Apr 30, 2019 at 11:16 PM [email protected] > wrote: > > > > > Hi Jun, > > You had mentioned that you are seeing the log message > > "insert failed with 1 errors" > > Did you see any exception stack traces before this message. You can also > > take a look at spark UI to see if stdout/stderr of failed tasks (if > > present). > > Also, it looks like if you also enable "trace" level logging, you would > > see exceptions getting logged at the end. So, enabling "trace" level > > logging is another way to debug what is happening. > > '''log.error(s"$operation failed with ${errorCount} errors :"); > > if (log.isTraceEnabled) { > > log.trace("Printing out the top 100 errors") ... > > ''' > > Balaji.V > > > > On Tuesday, April 30, 2019, 8:17:57 AM PDT, Vinoth Chandar < > > [email protected]> wrote: > > > > Hi Jun, > > > > Basically you are saying streaming path leaves some inflights behind.. > let > > me see if I can reproduce it. If you have a simple test case, please > share > > > > Thanks > > Vinoth > > > > On Tue, Apr 30, 2019 at 1:04 AM Jun Zhu > > wrote: > > > > > Hi Vinoth, > > > In spark streaming log I find "2019-04-30 03:26:11 ERROR > > > HoodieSparkSQLWriter:182 - insert failed with 1 errors :"(no continue > > error > > > logs) , during which commit end with inflight and not cleaned. > > > Just for feedback, we can dedup data correctly in batch way. Should add > > > more logic for exception handling if using spark stream I think. > > > Regards, > > > Jun > > > > > > > > > On Tue, Apr 30, 2019 at 2:46 AM Vinoth Chandar > > wrote: > > > > > > > Another option to try would be setting the > > > > spark.sql.hive.convertMetastoreParquet=false, if you are querying via > > the > > > > Hive table registered by Hudi. > > > > > > > > On Sat, Apr 27, 2019 at 7:02 PM Jun Zhu > > > > wrote: > > > > > > > > > Thanks for explanation vinoth, code was same list in > > > > > https://github.com/apache/incubator-hudi/issues/639, with setting > > > table > > > > > format to `.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, > > > > > DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)`. > > > > > And the result data was stored on aws s3. > > > > > I will try more on > > > > > > > > > > > > > > > > > > > > `spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", > > > > > classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter], > > > > > classOf[org.apache.hadoop.fs.PathFilter]);` from the phenomenon, > the > > > > > config did not take effects maybe. > > > > > > > > > > On Sat, Apr 27, 2019 at 12:09 AM Vinoth Chandar > > > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > >>The duplicates was found in inflight commit parquet files. > > > Wondering > > > > if > > > > > > this was expected? > > > > > > Spark shell should not even be reading in-flight parquet files. > Can > > > you > > > > > > double check if the spark access is properly configured? > > > > > > http://hudi.apache.org/querying_data.html#spark > > > > > > > > > > > > Inflight should be rolled back at the start of the next > > commit/delta > > > > > > commit.. Not sure why there are so many inflight delta commits. > > > > > > If you can give a reproducible case, happy to debug it more.. > > > > > > > > > > > > Only complete instants are archived.. So yes, inflight is not > > > > archived.. > > > > > > > > > > > > Hope that helps > > > > > > > > > > > > Thanks > > > > > > Vinoth > > > > > > > > > > > > On Fri, Apr 26, 2019 at 2:09 AM Jun Zhu > > > > > > > > > wrote: > > > > > > > > > > > > > Hi Vinoth, > > > > > > > Some continue question about this thread. > > > > > > > Here is what I found after running a few days: > > > > > > > in .hoodie folder, due to retain policy maybe, there is an > > > obviously > > > > > > > line(list in the end of email). Before it the cleaned commit > was > > > > > > archived, > > > > > > > find duplication when query inflight commit correspond > partition > > by > > > > > > > spark-shell. After the line, all behave norma
Re: About github issue 639
Hi Jun, I was able to track that the HoodieSparkSQLWriter (common path for streaming sink and batch datasource) ends up calling DataSourceUtils.createHoodieClient, which creates the client as follows return new HoodieWriteClient<>(jssc, writeConfig); There is a third parameter that denotes whether the writer needs to rollback inflights. For e.g, DeltaStreamer invokes HoodieWriteClient client = new HoodieWriteClient<>(jssc, hoodieCfg, true); While I trace down why we had this difference, could you try changing this one line here, and add third "true" argument and give it a shot. https://github.com/apache/incubator-hudi/blob/b34a204a527a156406908686e54484a0c3d8a3d7/hoodie-spark/src/main/java/com/uber/hoodie/DataSourceUtils.java#L148 Thanks Vinoth On Tue, Apr 30, 2019 at 11:16 PM [email protected] wrote: > > Hi Jun, > You had mentioned that you are seeing the log message > "insert failed with 1 errors" > Did you see any exception stack traces before this message. You can also > take a look at spark UI to see if stdout/stderr of failed tasks (if > present). > Also, it looks like if you also enable "trace" level logging, you would > see exceptions getting logged at the end. So, enabling "trace" level > logging is another way to debug what is happening. > '''log.error(s"$operation failed with ${errorCount} errors :"); > if (log.isTraceEnabled) { > log.trace("Printing out the top 100 errors") ... > ''' > Balaji.V > > On Tuesday, April 30, 2019, 8:17:57 AM PDT, Vinoth Chandar < > [email protected]> wrote: > > Hi Jun, > > Basically you are saying streaming path leaves some inflights behind.. let > me see if I can reproduce it. If you have a simple test case, please share > > Thanks > Vinoth > > On Tue, Apr 30, 2019 at 1:04 AM Jun Zhu > wrote: > > > Hi Vinoth, > > In spark streaming log I find "2019-04-30 03:26:11 ERROR > > HoodieSparkSQLWriter:182 - insert failed with 1 errors :"(no continue > error > > logs) , during which commit end with inflight and not cleaned. > > Just for feedback, we can dedup data correctly in batch way. Should add > > more logic for exception handling if using spark stream I think. > > Regards, > > Jun > > > > > > On Tue, Apr 30, 2019 at 2:46 AM Vinoth Chandar > wrote: > > > > > Another option to try would be setting the > > > spark.sql.hive.convertMetastoreParquet=false, if you are querying via > the > > > Hive table registered by Hudi. > > > > > > On Sat, Apr 27, 2019 at 7:02 PM Jun Zhu > > > wrote: > > > > > > > Thanks for explanation vinoth, code was same list in > > > > https://github.com/apache/incubator-hudi/issues/639, with setting > > table > > > > format to `.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, > > > > DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)`. > > > > And the result data was stored on aws s3. > > > > I will try more on > > > > > > > > > > > > > > `spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", > > > > classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter], > > > > classOf[org.apache.hadoop.fs.PathFilter]);` from the phenomenon, the > > > > config did not take effects maybe. > > > > > > > > On Sat, Apr 27, 2019 at 12:09 AM Vinoth Chandar > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > >>The duplicates was found in inflight commit parquet files. > > Wondering > > > if > > > > > this was expected? > > > > > Spark shell should not even be reading in-flight parquet files. Can > > you > > > > > double check if the spark access is properly configured? > > > > > http://hudi.apache.org/querying_data.html#spark > > > > > > > > > > Inflight should be rolled back at the start of the next > commit/delta > > > > > commit.. Not sure why there are so many inflight delta commits. > > > > > If you can give a reproducible case, happy to debug it more.. > > > > > > > > > > Only complete instants are archived.. So yes, inflight is not > > > archived.. > > > > > > > > > > Hope that helps > > > > > > > > > > Thanks > > > > > Vinoth > > > > > > > > > > On Fri, Apr 26, 2019 at 2:09 AM Jun Zhu > > > > > > wrote: > > > > > > > > > > > Hi Vinoth, > > > > > > Some continue question about this thread. > > > > > > Here is what I found after running a few days: > > > > > > in .hoodie folder, due to retain policy maybe, there is an > > obviously > > > > > > line(list in the end of email). Before it the cleaned commit was > > > > > archived, > > > > > > find duplication when query inflight commit correspond partition > by > > > > > > spark-shell. After the line, all behave normal, global dedup > works. > > > > > > The duplicates was found in inflight commit parquet files. > > Wondering > > > if > > > > > > this was expected? > > > > > > Q: > > > > > > 1. The inflight commit should be turned to roll back status in > > next > > > > > > writes. Is it normal that so many inflight commit did not make > it? > > Or > > > > > can I > > > > > > config a retain policy to turn inflight to roll_back in another > > way? > >
Re: About github issue 639
Hi Jun,
You had mentioned that you are seeing the log message
"insert failed with 1 errors"
Did you see any exception stack traces before this message. You can also take a
look at spark UI to see if stdout/stderr of failed tasks (if present).
Also, it looks like if you also enable "trace" level logging, you would see
exceptions getting logged at the end. So, enabling "trace" level logging is
another way to debug what is happening.
'''log.error(s"$operation failed with ${errorCount} errors :");
if (log.isTraceEnabled) {
log.trace("Printing out the top 100 errors")Â Â Â ...
'''
Balaji.V
On Tuesday, April 30, 2019, 8:17:57 AM PDT, Vinoth Chandar
wrote:
Hi Jun,
Basically you are saying streaming path leaves some inflights behind.. let
me see if I can reproduce it. If you have a simple test case, please share
Thanks
Vinoth
On Tue, Apr 30, 2019 at 1:04 AM Jun Zhu wrote:
> Hi Vinoth,
> In spark streaming log I find "2019-04-30 03:26:11 ERROR
> HoodieSparkSQLWriter:182 - insert failed with 1 errors :"(no continue error
> logs) , during which commit end with inflight and not cleaned.
> Just for feedback, we can dedup data correctly in batch way. Should add
> more logic for exception handling if using spark stream I think.
> Regards,
> Jun
>
>
> On Tue, Apr 30, 2019 at 2:46 AM Vinoth Chandar wrote:
>
> > Another option to try would be setting the
> > spark.sql.hive.convertMetastoreParquet=false, if you are querying via the
> > Hive table registered by Hudi.
> >
> > On Sat, Apr 27, 2019 at 7:02 PM Jun Zhu
> > wrote:
> >
> > > Thanks for explanation vinoth, code was same list in
> > > https://github.com/apache/incubator-hudi/issues/639, with setting
> table
> > > format to `.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,
> > > DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)`.
> > > And the result data was stored on aws s3.
> > > I will try more on
> > >
> > >
> >
> `spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
> > > classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],
> > > classOf[org.apache.hadoop.fs.PathFilter]);`Â from the phenomenon, the
> > > config did not take effects maybe.
> > >
> > > On Sat, Apr 27, 2019 at 12:09 AM Vinoth Chandar
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > >>The duplicates was found in inflight commit parquet files.
> Wondering
> > if
> > > > this was expected?
> > > > Spark shell should not even be reading in-flight parquet files. Can
> you
> > > > double check if the spark access is properly configured?
> > > > http://hudi.apache.org/querying_data.html#spark
> > > >
> > > > Inflight should be rolled back at the start of the next commit/delta
> > > > commit.. Not sure why there are so many inflight delta commits.
> > > > If you can give a reproducible case, happy to debug it more..
> > > >
> > > > Only complete instants are archived.. So yes, inflight is not
> > archived..
> > > >
> > > > Hope that helps
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > > > On Fri, Apr 26, 2019 at 2:09 AM Jun Zhu
> > > > wrote:
> > > >
> > > > > Hi Vinoth,
> > > > > Some continue question about this thread.
> > > > > Here is what I found after running a few days:
> > > > > in .hoodie folder, due to retain policy maybe, there is an
> obviously
> > > > > line(list in the end of email). Before it the cleaned commit was
> > > > archived,
> > > > > find duplication when query inflight commit correspond partition by
> > > > > spark-shell. After the line, all behave normal, global dedup works.
> > > > > The duplicates was found in inflight commit parquet files.
> Wondering
> > if
> > > > > this was expected?
> > > > > Q:
> > > > > 1. The inflight commit should be turned to roll back status in
> next
> > > > > writes. Is it normal that so many inflight commit did not make it?
> Or
> > > > can I
> > > > > config a retain policy to turn inflight to roll_back in another
> way?
> > > > > 2. Did commit retain policy do not archive inflight commit?
> > > > >
> > > > > 2019-04-23 20:23:47Â Â Â Â 378 20190423122339.deltacommit.inflight
> > > > >
> > > > > 2019-04-23 20:43:53Â Â Â Â 378 20190423124343.deltacommit.inflight
> > > > >
> > > > > 2019-04-23 22:14:04Â Â Â Â 378 20190423141354.deltacommit.inflight
> > > > >
> > > > > 2019-04-23 22:44:09Â Â Â Â 378 20190423144400.deltacommit.inflight
> > > > >
> > > > > 2019-04-23 22:54:18Â Â Â Â 378 20190423145408.deltacommit.inflight
> > > > >
> > > > > 2019-04-23 23:04:09Â Â Â Â 378 20190423150400.deltacommit.inflight
> > > > >
> > > > > 2019-04-23 23:24:30Â Â Â Â 378 20190423152421.deltacommit.inflight
> > > > >
> > > > > *2019-04-23 23:44:34Â Â Â Â 378
> 20190423154424.deltacommit.inflight*
> > > > >
> > > > > *2019-04-24 00:15:46Â Â Â 2991 20190423161431.clean*
> > > > >
> > > > > 2019-04-24 00:15:21Â Â 870536 20190423161431.deltacommit
> > > > >
> > > > > 2019-04-24 00:25:19Â Â Â 2991 20190423162424.clean
> > > > >
> > > > > 2019-04-24 00:25:09Â Â 875825 20190423162424.deltacommit
> > > > >
Re: About github issue 639
Hi Jun,
Basically you are saying streaming path leaves some inflights behind.. let
me see if I can reproduce it. If you have a simple test case, please share
Thanks
Vinoth
On Tue, Apr 30, 2019 at 1:04 AM Jun Zhu wrote:
> Hi Vinoth,
> In spark streaming log I find "2019-04-30 03:26:11 ERROR
> HoodieSparkSQLWriter:182 - insert failed with 1 errors :"(no continue error
> logs) , during which commit end with inflight and not cleaned.
> Just for feedback, we can dedup data correctly in batch way. Should add
> more logic for exception handling if using spark stream I think.
> Regards,
> Jun
>
>
> On Tue, Apr 30, 2019 at 2:46 AM Vinoth Chandar wrote:
>
> > Another option to try would be setting the
> > spark.sql.hive.convertMetastoreParquet=false, if you are querying via the
> > Hive table registered by Hudi.
> >
> > On Sat, Apr 27, 2019 at 7:02 PM Jun Zhu
> > wrote:
> >
> > > Thanks for explanation vinoth, code was same list in
> > > https://github.com/apache/incubator-hudi/issues/639, with setting
> table
> > > format to `.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,
> > > DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)`.
> > > And the result data was stored on aws s3.
> > > I will try more on
> > >
> > >
> >
> `spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
> > > classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],
> > > classOf[org.apache.hadoop.fs.PathFilter]);` from the phenomenon, the
> > > config did not take effects maybe.
> > >
> > > On Sat, Apr 27, 2019 at 12:09 AM Vinoth Chandar
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > >>The duplicates was found in inflight commit parquet files.
> Wondering
> > if
> > > > this was expected?
> > > > Spark shell should not even be reading in-flight parquet files. Can
> you
> > > > double check if the spark access is properly configured?
> > > > http://hudi.apache.org/querying_data.html#spark
> > > >
> > > > Inflight should be rolled back at the start of the next commit/delta
> > > > commit.. Not sure why there are so many inflight delta commits.
> > > > If you can give a reproducible case, happy to debug it more..
> > > >
> > > > Only complete instants are archived.. So yes, inflight is not
> > archived..
> > > >
> > > > Hope that helps
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > > > On Fri, Apr 26, 2019 at 2:09 AM Jun Zhu
> > > > wrote:
> > > >
> > > > > Hi Vinoth,
> > > > > Some continue question about this thread.
> > > > > Here is what I found after running a few days:
> > > > > in .hoodie folder, due to retain policy maybe, there is an
> obviously
> > > > > line(list in the end of email). Before it the cleaned commit was
> > > > archived,
> > > > > find duplication when query inflight commit correspond partition by
> > > > > spark-shell. After the line, all behave normal, global dedup works.
> > > > > The duplicates was found in inflight commit parquet files.
> Wondering
> > if
> > > > > this was expected?
> > > > > Q:
> > > > > 1. The inflight commit should be turned to roll back status in
> next
> > > > > writes. Is it normal that so many inflight commit did not make it?
> Or
> > > > can I
> > > > > config a retain policy to turn inflight to roll_back in another
> way?
> > > > > 2. Did commit retain policy do not archive inflight commit?
> > > > >
> > > > > 2019-04-23 20:23:47378 20190423122339.deltacommit.inflight
> > > > >
> > > > > 2019-04-23 20:43:53378 20190423124343.deltacommit.inflight
> > > > >
> > > > > 2019-04-23 22:14:04378 20190423141354.deltacommit.inflight
> > > > >
> > > > > 2019-04-23 22:44:09378 20190423144400.deltacommit.inflight
> > > > >
> > > > > 2019-04-23 22:54:18378 20190423145408.deltacommit.inflight
> > > > >
> > > > > 2019-04-23 23:04:09378 20190423150400.deltacommit.inflight
> > > > >
> > > > > 2019-04-23 23:24:30378 20190423152421.deltacommit.inflight
> > > > >
> > > > > *2019-04-23 23:44:34378
> 20190423154424.deltacommit.inflight*
> > > > >
> > > > > *2019-04-24 00:15:46 2991 20190423161431.clean*
> > > > >
> > > > > 2019-04-24 00:15:21 870536 20190423161431.deltacommit
> > > > >
> > > > > 2019-04-24 00:25:19 2991 20190423162424.clean
> > > > >
> > > > > 2019-04-24 00:25:09 875825 20190423162424.deltacommit
> > > > >
> > > > > 2019-04-24 00:35:26 2991 20190423163429.clean
> > > > >
> > > > > 2019-04-24 00:35:18 881925 20190423163429.deltacommit
> > > > >
> > > > > 2019-04-24 00:46:14 2991 20190423164428.clean
> > > > >
> > > > > 2019-04-24 00:45:44 888025 20190423164428.deltacommit
> > > > >
> > > > > Thanks,
> > > > > Jun
> > > > >
> > > > > On 2019/04/18 14:29:23, Vinoth Chandar wrote:
> > > > > > Hi Jun,>
> > > > > >
> > > > > > Responses below.>
> > > > > >
> > > > > > >>1. Some file inflight may never reach commit?>
> > > > > > yes. the next attempt at writing will first issue a rollback to
> > clean
> > > > up>
> > > > > > such partial/leftover files firs
Re: About github issue 639
Hi Vinoth,
In spark streaming log I find "2019-04-30 03:26:11 ERROR
HoodieSparkSQLWriter:182 - insert failed with 1 errors :"(no continue error
logs) , during which commit end with inflight and not cleaned.
Just for feedback, we can dedup data correctly in batch way. Should add
more logic for exception handling if using spark stream I think.
Regards,
Jun
On Tue, Apr 30, 2019 at 2:46 AM Vinoth Chandar wrote:
> Another option to try would be setting the
> spark.sql.hive.convertMetastoreParquet=false, if you are querying via the
> Hive table registered by Hudi.
>
> On Sat, Apr 27, 2019 at 7:02 PM Jun Zhu
> wrote:
>
> > Thanks for explanation vinoth, code was same list in
> > https://github.com/apache/incubator-hudi/issues/639, with setting table
> > format to `.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,
> > DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)`.
> > And the result data was stored on aws s3.
> > I will try more on
> >
> >
> `spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
> > classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],
> > classOf[org.apache.hadoop.fs.PathFilter]);` from the phenomenon, the
> > config did not take effects maybe.
> >
> > On Sat, Apr 27, 2019 at 12:09 AM Vinoth Chandar
> wrote:
> >
> > > Hi,
> > >
> > > >>The duplicates was found in inflight commit parquet files. Wondering
> if
> > > this was expected?
> > > Spark shell should not even be reading in-flight parquet files. Can you
> > > double check if the spark access is properly configured?
> > > http://hudi.apache.org/querying_data.html#spark
> > >
> > > Inflight should be rolled back at the start of the next commit/delta
> > > commit.. Not sure why there are so many inflight delta commits.
> > > If you can give a reproducible case, happy to debug it more..
> > >
> > > Only complete instants are archived.. So yes, inflight is not
> archived..
> > >
> > > Hope that helps
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Fri, Apr 26, 2019 at 2:09 AM Jun Zhu
> > > wrote:
> > >
> > > > Hi Vinoth,
> > > > Some continue question about this thread.
> > > > Here is what I found after running a few days:
> > > > in .hoodie folder, due to retain policy maybe, there is an obviously
> > > > line(list in the end of email). Before it the cleaned commit was
> > > archived,
> > > > find duplication when query inflight commit correspond partition by
> > > > spark-shell. After the line, all behave normal, global dedup works.
> > > > The duplicates was found in inflight commit parquet files. Wondering
> if
> > > > this was expected?
> > > > Q:
> > > > 1. The inflight commit should be turned to roll back status in next
> > > > writes. Is it normal that so many inflight commit did not make it? Or
> > > can I
> > > > config a retain policy to turn inflight to roll_back in another way?
> > > > 2. Did commit retain policy do not archive inflight commit?
> > > >
> > > > 2019-04-23 20:23:47378 20190423122339.deltacommit.inflight
> > > >
> > > > 2019-04-23 20:43:53378 20190423124343.deltacommit.inflight
> > > >
> > > > 2019-04-23 22:14:04378 20190423141354.deltacommit.inflight
> > > >
> > > > 2019-04-23 22:44:09378 20190423144400.deltacommit.inflight
> > > >
> > > > 2019-04-23 22:54:18378 20190423145408.deltacommit.inflight
> > > >
> > > > 2019-04-23 23:04:09378 20190423150400.deltacommit.inflight
> > > >
> > > > 2019-04-23 23:24:30378 20190423152421.deltacommit.inflight
> > > >
> > > > *2019-04-23 23:44:34378 20190423154424.deltacommit.inflight*
> > > >
> > > > *2019-04-24 00:15:46 2991 20190423161431.clean*
> > > >
> > > > 2019-04-24 00:15:21 870536 20190423161431.deltacommit
> > > >
> > > > 2019-04-24 00:25:19 2991 20190423162424.clean
> > > >
> > > > 2019-04-24 00:25:09 875825 20190423162424.deltacommit
> > > >
> > > > 2019-04-24 00:35:26 2991 20190423163429.clean
> > > >
> > > > 2019-04-24 00:35:18 881925 20190423163429.deltacommit
> > > >
> > > > 2019-04-24 00:46:14 2991 20190423164428.clean
> > > >
> > > > 2019-04-24 00:45:44 888025 20190423164428.deltacommit
> > > >
> > > > Thanks,
> > > > Jun
> > > >
> > > > On 2019/04/18 14:29:23, Vinoth Chandar wrote:
> > > > > Hi Jun,>
> > > > >
> > > > > Responses below.>
> > > > >
> > > > > >>1. Some file inflight may never reach commit?>
> > > > > yes. the next attempt at writing will first issue a rollback to
> clean
> > > up>
> > > > > such partial/leftover files first, before it begins the new
> commit.>
> > > > >
> > > > > >>2. In occasion which inflight and parquet file generated by
> > inflight
> > > > still>
> > > > > exists, the global dedup will not dedup based on such kind file?>
> > > > > even if not rolled back, we check for the inflight parquet files
> > > against>
> > > > > the committed timeline, which it wont be a part of. So should be
> > safe.>
> > > > >
> > > > >
> > > > > >>3. In occasion which inflight and parquet file gene
Re: About github issue 639
Another option to try would be setting the
spark.sql.hive.convertMetastoreParquet=false, if you are querying via the
Hive table registered by Hudi.
On Sat, Apr 27, 2019 at 7:02 PM Jun Zhu wrote:
> Thanks for explanation vinoth, code was same list in
> https://github.com/apache/incubator-hudi/issues/639, with setting table
> format to `.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,
> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)`.
> And the result data was stored on aws s3.
> I will try more on
>
> `spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
> classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],
> classOf[org.apache.hadoop.fs.PathFilter]);` from the phenomenon, the
> config did not take effects maybe.
>
> On Sat, Apr 27, 2019 at 12:09 AM Vinoth Chandar wrote:
>
> > Hi,
> >
> > >>The duplicates was found in inflight commit parquet files. Wondering if
> > this was expected?
> > Spark shell should not even be reading in-flight parquet files. Can you
> > double check if the spark access is properly configured?
> > http://hudi.apache.org/querying_data.html#spark
> >
> > Inflight should be rolled back at the start of the next commit/delta
> > commit.. Not sure why there are so many inflight delta commits.
> > If you can give a reproducible case, happy to debug it more..
> >
> > Only complete instants are archived.. So yes, inflight is not archived..
> >
> > Hope that helps
> >
> > Thanks
> > Vinoth
> >
> > On Fri, Apr 26, 2019 at 2:09 AM Jun Zhu
> > wrote:
> >
> > > Hi Vinoth,
> > > Some continue question about this thread.
> > > Here is what I found after running a few days:
> > > in .hoodie folder, due to retain policy maybe, there is an obviously
> > > line(list in the end of email). Before it the cleaned commit was
> > archived,
> > > find duplication when query inflight commit correspond partition by
> > > spark-shell. After the line, all behave normal, global dedup works.
> > > The duplicates was found in inflight commit parquet files. Wondering if
> > > this was expected?
> > > Q:
> > > 1. The inflight commit should be turned to roll back status in next
> > > writes. Is it normal that so many inflight commit did not make it? Or
> > can I
> > > config a retain policy to turn inflight to roll_back in another way?
> > > 2. Did commit retain policy do not archive inflight commit?
> > >
> > > 2019-04-23 20:23:47378 20190423122339.deltacommit.inflight
> > >
> > > 2019-04-23 20:43:53378 20190423124343.deltacommit.inflight
> > >
> > > 2019-04-23 22:14:04378 20190423141354.deltacommit.inflight
> > >
> > > 2019-04-23 22:44:09378 20190423144400.deltacommit.inflight
> > >
> > > 2019-04-23 22:54:18378 20190423145408.deltacommit.inflight
> > >
> > > 2019-04-23 23:04:09378 20190423150400.deltacommit.inflight
> > >
> > > 2019-04-23 23:24:30378 20190423152421.deltacommit.inflight
> > >
> > > *2019-04-23 23:44:34378 20190423154424.deltacommit.inflight*
> > >
> > > *2019-04-24 00:15:46 2991 20190423161431.clean*
> > >
> > > 2019-04-24 00:15:21 870536 20190423161431.deltacommit
> > >
> > > 2019-04-24 00:25:19 2991 20190423162424.clean
> > >
> > > 2019-04-24 00:25:09 875825 20190423162424.deltacommit
> > >
> > > 2019-04-24 00:35:26 2991 20190423163429.clean
> > >
> > > 2019-04-24 00:35:18 881925 20190423163429.deltacommit
> > >
> > > 2019-04-24 00:46:14 2991 20190423164428.clean
> > >
> > > 2019-04-24 00:45:44 888025 20190423164428.deltacommit
> > >
> > > Thanks,
> > > Jun
> > >
> > > On 2019/04/18 14:29:23, Vinoth Chandar wrote:
> > > > Hi Jun,>
> > > >
> > > > Responses below.>
> > > >
> > > > >>1. Some file inflight may never reach commit?>
> > > > yes. the next attempt at writing will first issue a rollback to clean
> > up>
> > > > such partial/leftover files first, before it begins the new commit.>
> > > >
> > > > >>2. In occasion which inflight and parquet file generated by
> inflight
> > > still>
> > > > exists, the global dedup will not dedup based on such kind file?>
> > > > even if not rolled back, we check for the inflight parquet files
> > against>
> > > > the committed timeline, which it wont be a part of. So should be
> safe.>
> > > >
> > > >
> > > > >>3. In occasion which inflight and parquet file generated by
> inflight
> > > still>
> > > > exists, the correct query result will be decided by read config(I>
> > > > mean mapreduce.input.pathFilter.class>
> > > > in sparksql)>
> > > > yes. the filtering should work as well. its the same technique used
> by>
> > > > writer.>
> > > >
> > > >
> > > > >>4. Is there any way we can use>
> > > >
> > > > >>
> > > >
> > >
> > >
> >
> spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",>
> > >
> > > > > classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],>
> > > > > classOf[org.apache.hadoop.fs.PathFilter]);>
> > > >
> > > > in spark thrift server when start it?>
> > > >
> >
Re: About github issue 639
Thanks for explanation vinoth, code was same list in
https://github.com/apache/incubator-hudi/issues/639, with setting table
format to `.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,
DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)`.
And the result data was stored on aws s3.
I will try more on
`spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],
classOf[org.apache.hadoop.fs.PathFilter]);` from the phenomenon, the
config did not take effects maybe.
On Sat, Apr 27, 2019 at 12:09 AM Vinoth Chandar wrote:
> Hi,
>
> >>The duplicates was found in inflight commit parquet files. Wondering if
> this was expected?
> Spark shell should not even be reading in-flight parquet files. Can you
> double check if the spark access is properly configured?
> http://hudi.apache.org/querying_data.html#spark
>
> Inflight should be rolled back at the start of the next commit/delta
> commit.. Not sure why there are so many inflight delta commits.
> If you can give a reproducible case, happy to debug it more..
>
> Only complete instants are archived.. So yes, inflight is not archived..
>
> Hope that helps
>
> Thanks
> Vinoth
>
> On Fri, Apr 26, 2019 at 2:09 AM Jun Zhu
> wrote:
>
> > Hi Vinoth,
> > Some continue question about this thread.
> > Here is what I found after running a few days:
> > in .hoodie folder, due to retain policy maybe, there is an obviously
> > line(list in the end of email). Before it the cleaned commit was
> archived,
> > find duplication when query inflight commit correspond partition by
> > spark-shell. After the line, all behave normal, global dedup works.
> > The duplicates was found in inflight commit parquet files. Wondering if
> > this was expected?
> > Q:
> > 1. The inflight commit should be turned to roll back status in next
> > writes. Is it normal that so many inflight commit did not make it? Or
> can I
> > config a retain policy to turn inflight to roll_back in another way?
> > 2. Did commit retain policy do not archive inflight commit?
> >
> > 2019-04-23 20:23:47378 20190423122339.deltacommit.inflight
> >
> > 2019-04-23 20:43:53378 20190423124343.deltacommit.inflight
> >
> > 2019-04-23 22:14:04378 20190423141354.deltacommit.inflight
> >
> > 2019-04-23 22:44:09378 20190423144400.deltacommit.inflight
> >
> > 2019-04-23 22:54:18378 20190423145408.deltacommit.inflight
> >
> > 2019-04-23 23:04:09378 20190423150400.deltacommit.inflight
> >
> > 2019-04-23 23:24:30378 20190423152421.deltacommit.inflight
> >
> > *2019-04-23 23:44:34378 20190423154424.deltacommit.inflight*
> >
> > *2019-04-24 00:15:46 2991 20190423161431.clean*
> >
> > 2019-04-24 00:15:21 870536 20190423161431.deltacommit
> >
> > 2019-04-24 00:25:19 2991 20190423162424.clean
> >
> > 2019-04-24 00:25:09 875825 20190423162424.deltacommit
> >
> > 2019-04-24 00:35:26 2991 20190423163429.clean
> >
> > 2019-04-24 00:35:18 881925 20190423163429.deltacommit
> >
> > 2019-04-24 00:46:14 2991 20190423164428.clean
> >
> > 2019-04-24 00:45:44 888025 20190423164428.deltacommit
> >
> > Thanks,
> > Jun
> >
> > On 2019/04/18 14:29:23, Vinoth Chandar wrote:
> > > Hi Jun,>
> > >
> > > Responses below.>
> > >
> > > >>1. Some file inflight may never reach commit?>
> > > yes. the next attempt at writing will first issue a rollback to clean
> up>
> > > such partial/leftover files first, before it begins the new commit.>
> > >
> > > >>2. In occasion which inflight and parquet file generated by inflight
> > still>
> > > exists, the global dedup will not dedup based on such kind file?>
> > > even if not rolled back, we check for the inflight parquet files
> against>
> > > the committed timeline, which it wont be a part of. So should be safe.>
> > >
> > >
> > > >>3. In occasion which inflight and parquet file generated by inflight
> > still>
> > > exists, the correct query result will be decided by read config(I>
> > > mean mapreduce.input.pathFilter.class>
> > > in sparksql)>
> > > yes. the filtering should work as well. its the same technique used by>
> > > writer.>
> > >
> > >
> > > >>4. Is there any way we can use>
> > >
> > > >>
> > >
> >
> >
> spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",>
> >
> > > > classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],>
> > > > classOf[org.apache.hadoop.fs.PathFilter]);>
> > >
> > > in spark thrift server when start it?>
> > >
> > > I am not familiar with the Spark thrift server myself. Any pointers
> where
> > I>
> > > can learn more?>
> > > Two suggestions :>
> > > - You can check if you can add this to the Hadoop configuration xml
> > files>
> > > and see if it gets picked up by Spark?>
> > > - Alternatively, you can set the spark config mentioned here>
> > > http://hudi.apache.org/querying_data.html#spark-rt-view (works for ro
> > view>
> > > also), which should be doable I am as
Re: About github issue 639
Hi,
>>The duplicates was found in inflight commit parquet files. Wondering if
this was expected?
Spark shell should not even be reading in-flight parquet files. Can you
double check if the spark access is properly configured?
http://hudi.apache.org/querying_data.html#spark
Inflight should be rolled back at the start of the next commit/delta
commit.. Not sure why there are so many inflight delta commits.
If you can give a reproducible case, happy to debug it more..
Only complete instants are archived.. So yes, inflight is not archived..
Hope that helps
Thanks
Vinoth
On Fri, Apr 26, 2019 at 2:09 AM Jun Zhu wrote:
> Hi Vinoth,
> Some continue question about this thread.
> Here is what I found after running a few days:
> in .hoodie folder, due to retain policy maybe, there is an obviously
> line(list in the end of email). Before it the cleaned commit was archived,
> find duplication when query inflight commit correspond partition by
> spark-shell. After the line, all behave normal, global dedup works.
> The duplicates was found in inflight commit parquet files. Wondering if
> this was expected?
> Q:
> 1. The inflight commit should be turned to roll back status in next
> writes. Is it normal that so many inflight commit did not make it? Or can I
> config a retain policy to turn inflight to roll_back in another way?
> 2. Did commit retain policy do not archive inflight commit?
>
> 2019-04-23 20:23:47378 20190423122339.deltacommit.inflight
>
> 2019-04-23 20:43:53378 20190423124343.deltacommit.inflight
>
> 2019-04-23 22:14:04378 20190423141354.deltacommit.inflight
>
> 2019-04-23 22:44:09378 20190423144400.deltacommit.inflight
>
> 2019-04-23 22:54:18378 20190423145408.deltacommit.inflight
>
> 2019-04-23 23:04:09378 20190423150400.deltacommit.inflight
>
> 2019-04-23 23:24:30378 20190423152421.deltacommit.inflight
>
> *2019-04-23 23:44:34378 20190423154424.deltacommit.inflight*
>
> *2019-04-24 00:15:46 2991 20190423161431.clean*
>
> 2019-04-24 00:15:21 870536 20190423161431.deltacommit
>
> 2019-04-24 00:25:19 2991 20190423162424.clean
>
> 2019-04-24 00:25:09 875825 20190423162424.deltacommit
>
> 2019-04-24 00:35:26 2991 20190423163429.clean
>
> 2019-04-24 00:35:18 881925 20190423163429.deltacommit
>
> 2019-04-24 00:46:14 2991 20190423164428.clean
>
> 2019-04-24 00:45:44 888025 20190423164428.deltacommit
>
> Thanks,
> Jun
>
> On 2019/04/18 14:29:23, Vinoth Chandar wrote:
> > Hi Jun,>
> >
> > Responses below.>
> >
> > >>1. Some file inflight may never reach commit?>
> > yes. the next attempt at writing will first issue a rollback to clean up>
> > such partial/leftover files first, before it begins the new commit.>
> >
> > >>2. In occasion which inflight and parquet file generated by inflight
> still>
> > exists, the global dedup will not dedup based on such kind file?>
> > even if not rolled back, we check for the inflight parquet files against>
> > the committed timeline, which it wont be a part of. So should be safe.>
> >
> >
> > >>3. In occasion which inflight and parquet file generated by inflight
> still>
> > exists, the correct query result will be decided by read config(I>
> > mean mapreduce.input.pathFilter.class>
> > in sparksql)>
> > yes. the filtering should work as well. its the same technique used by>
> > writer.>
> >
> >
> > >>4. Is there any way we can use>
> >
> > >>
> >
>
> spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",>
>
> > > classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],>
> > > classOf[org.apache.hadoop.fs.PathFilter]);>
> >
> > in spark thrift server when start it?>
> >
> > I am not familiar with the Spark thrift server myself. Any pointers where
> I>
> > can learn more?>
> > Two suggestions :>
> > - You can check if you can add this to the Hadoop configuration xml
> files>
> > and see if it gets picked up by Spark?>
> > - Alternatively, you can set the spark config mentioned here>
> > http://hudi.apache.org/querying_data.html#spark-rt-view (works for ro
> view>
> > also), which should be doable I am assuming at this thrift server>
> >
> >
> > Thanks>
> > Vinoth>
> >
> >
> > On Wed, Apr 17, 2019 at 12:08 AM Jun Zhu
> wrote:>
> >
> > > Hi,>
> > > Link: https://github.com/apache/incubator-hudi/issues/639>
> > > Sorry , failed open
> https://lists.apache.org/[email protected]>
> > > .>
> > > I have some follow up questions for issue 639:>
> > >>
> > > So, the sequence of events is . We write parquet files and then upon>
> > > > successful writing of all attempted parquet files, we actually make
> the>
> > > > commit as completed. (i.e not inflight anymore). So this is normal.
> This>
> > > is>
> > > > done to prevent queries from reading partially written parquet
> files..>
> > > >>
> > >>
> > > Does that mean:>
> > > 1. Some file inflight may never reach commit?>
> > > 2. In occasion which inflight and parquet file generated b
Re: About github issue 639
Hi Vinoth,
Some continue question about this thread.
Here is what I found after running a few days:
in .hoodie folder, due to retain policy maybe, there is an obviously
line(list in the end of email). Before it the cleaned commit was archived,
find duplication when query inflight commit correspond partition by
spark-shell. After the line, all behave normal, global dedup works.
The duplicates was found in inflight commit parquet files. Wondering if
this was expected?
Q:
1. The inflight commit should be turned to roll back status in next
writes. Is it normal that so many inflight commit did not make it? Or can I
config a retain policy to turn inflight to roll_back in another way?
2. Did commit retain policy do not archive inflight commit?
2019-04-23 20:23:47378 20190423122339.deltacommit.inflight
2019-04-23 20:43:53378 20190423124343.deltacommit.inflight
2019-04-23 22:14:04378 20190423141354.deltacommit.inflight
2019-04-23 22:44:09378 20190423144400.deltacommit.inflight
2019-04-23 22:54:18378 20190423145408.deltacommit.inflight
2019-04-23 23:04:09378 20190423150400.deltacommit.inflight
2019-04-23 23:24:30378 20190423152421.deltacommit.inflight
*2019-04-23 23:44:34378 20190423154424.deltacommit.inflight*
*2019-04-24 00:15:46 2991 20190423161431.clean*
2019-04-24 00:15:21 870536 20190423161431.deltacommit
2019-04-24 00:25:19 2991 20190423162424.clean
2019-04-24 00:25:09 875825 20190423162424.deltacommit
2019-04-24 00:35:26 2991 20190423163429.clean
2019-04-24 00:35:18 881925 20190423163429.deltacommit
2019-04-24 00:46:14 2991 20190423164428.clean
2019-04-24 00:45:44 888025 20190423164428.deltacommit
Thanks,
Jun
On 2019/04/18 14:29:23, Vinoth Chandar wrote:
> Hi Jun,>
>
> Responses below.>
>
> >>1. Some file inflight may never reach commit?>
> yes. the next attempt at writing will first issue a rollback to clean up>
> such partial/leftover files first, before it begins the new commit.>
>
> >>2. In occasion which inflight and parquet file generated by inflight
still>
> exists, the global dedup will not dedup based on such kind file?>
> even if not rolled back, we check for the inflight parquet files against>
> the committed timeline, which it wont be a part of. So should be safe.>
>
>
> >>3. In occasion which inflight and parquet file generated by inflight
still>
> exists, the correct query result will be decided by read config(I>
> mean mapreduce.input.pathFilter.class>
> in sparksql)>
> yes. the filtering should work as well. its the same technique used by>
> writer.>
>
>
> >>4. Is there any way we can use>
>
> >>
>
spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",>
> > classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],>
> > classOf[org.apache.hadoop.fs.PathFilter]);>
>
> in spark thrift server when start it?>
>
> I am not familiar with the Spark thrift server myself. Any pointers where
I>
> can learn more?>
> Two suggestions :>
> - You can check if you can add this to the Hadoop configuration xml
files>
> and see if it gets picked up by Spark?>
> - Alternatively, you can set the spark config mentioned here>
> http://hudi.apache.org/querying_data.html#spark-rt-view (works for ro
view>
> also), which should be doable I am assuming at this thrift server>
>
>
> Thanks>
> Vinoth>
>
>
> On Wed, Apr 17, 2019 at 12:08 AM Jun Zhu
wrote:>
>
> > Hi,>
> > Link: https://github.com/apache/incubator-hudi/issues/639>
> > Sorry , failed open
https://lists.apache.org/[email protected]>
> > .>
> > I have some follow up questions for issue 639:>
> >>
> > So, the sequence of events is . We write parquet files and then upon>
> > > successful writing of all attempted parquet files, we actually make
the>
> > > commit as completed. (i.e not inflight anymore). So this is normal.
This>
> > is>
> > > done to prevent queries from reading partially written parquet
files..>
> > >>
> >>
> > Does that mean:>
> > 1. Some file inflight may never reach commit?>
> > 2. In occasion which inflight and parquet file generated by inflight
still>
> > exists, the global dedup will not dedup based on such kind file?>
> > 3. In occasion which inflight and parquet file generated by inflight
still>
> > exists, the correct query result will be decided by read config(I>
> > mean mapreduce.input.pathFilter.class>
> > in sparksql)>
> > 4. Is there any way we can use>
> >>
> > >>
> >
spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",>
> > > classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],>
> > > classOf[org.apache.hadoop.fs.PathFilter]);>
> >>
> > in spark thrift server when start it?>
> >>
> > Best,>
> > -->
> > [image: vshapesaqua11553186012.gif] *Jun Zhu*>
> > Sr. Engineer I, Data>
> > +86 18565739171>
> >>
> > [image: in1552694272.png] >
> > [image:>
> > fb1552694203.png] <
Re: About github issue 639
Hi Jun,
Responses below.
>>1. Some file inflight may never reach commit?
yes. the next attempt at writing will first issue a rollback to clean up
such partial/leftover files first, before it begins the new commit.
>>2. In occasion which inflight and parquet file generated by inflight still
exists, the global dedup will not dedup based on such kind file?
even if not rolled back, we check for the inflight parquet files against
the committed timeline, which it wont be a part of. So should be safe.
>>3. In occasion which inflight and parquet file generated by inflight still
exists, the correct query result will be decided by read config(I
mean mapreduce.input.pathFilter.class
in sparksql)
yes. the filtering should work as well. its the same technique used by
writer.
>>4. Is there any way we can use
>
spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
> classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],
> classOf[org.apache.hadoop.fs.PathFilter]);
in spark thrift server when start it?
I am not familiar with the Spark thrift server myself. Any pointers where I
can learn more?
Two suggestions :
- You can check if you can add this to the Hadoop configuration xml files
and see if it gets picked up by Spark?
- Alternatively, you can set the spark config mentioned here
http://hudi.apache.org/querying_data.html#spark-rt-view (works for ro view
also), which should be doable I am assuming at this thrift server
Thanks
Vinoth
On Wed, Apr 17, 2019 at 12:08 AM Jun Zhu wrote:
> Hi,
> Link: https://github.com/apache/incubator-hudi/issues/639
> Sorry , failed open https://lists.apache.org/[email protected]
> .
> I have some follow up questions for issue 639:
>
> So, the sequence of events is . We write parquet files and then upon
> > successful writing of all attempted parquet files, we actually make the
> > commit as completed. (i.e not inflight anymore). So this is normal. This
> is
> > done to prevent queries from reading partially written parquet files..
> >
>
> Does that mean:
> 1. Some file inflight may never reach commit?
> 2. In occasion which inflight and parquet file generated by inflight still
> exists, the global dedup will not dedup based on such kind file?
> 3. In occasion which inflight and parquet file generated by inflight still
> exists, the correct query result will be decided by read config(I
> mean mapreduce.input.pathFilter.class
> in sparksql)
> 4. Is there any way we can use
>
> >
> spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
> > classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],
> > classOf[org.apache.hadoop.fs.PathFilter]);
>
> in spark thrift server when start it?
>
> Best,
> --
> [image: vshapesaqua11553186012.gif] *Jun Zhu*
> Sr. Engineer I, Data
> +86 18565739171
>
> [image: in1552694272.png]
> [image:
> fb1552694203.png] [image:
> tw1552694330.png] [image:
> ig1552694392.png]
> Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, China
>
