[GitHub] [incubator-hudi] yihua commented on a change in pull request #884: [HUDI-240] Translate Use Cases page

2019-09-14 Thread GitBox
yihua commented on a change in pull request #884: [HUDI-240] Translate Use 
Cases page
URL: https://github.com/apache/incubator-hudi/pull/884#discussion_r324424618
 
 

 ##
 File path: docs/use_cases.cn.md
 ##
 @@ -4,73 +4,65 @@ keywords: hudi, data ingestion, etl, real time, use cases
 sidebar: mydoc_sidebar
 permalink: use_cases.html
 toc: false
-summary: "Following are some sample use-cases for Hudi, which illustrate the 
benefits in terms of faster processing & increased efficiency"
+summary: "下面展示一些使用Hudi的示例,示例说明了加快处理速度和提高效率的好处"
 
 ---
 
 
 
-## Near Real-Time Ingestion
+## 近实时摄取
 
-Ingesting data from external sources like (event logs, databases, external 
sources) into a [Hadoop Data Lake](http://martinfowler.com/bliki/DataLake.html) 
is a well known problem.
-In most (if not all) Hadoop deployments, it is unfortunately solved in a 
piecemeal fashion, using a medley of ingestion tools,
-even though this data is arguably the most valuable for the entire 
organization.
+将外部源(如事件日志,数据库,外部源)的数据摄取到[Hadoop数据湖](http://martinfowler.com/bliki/DataLake.html)是一个众所周知的问题。
+尽管这些数据对整个组织来说是最有价值的,但不幸的是,在大多数(可能不是全部)Hadoop部署中都使用零散的方式解决,即使用混合摄取工具。
 
 
-For RDBMS ingestion, Hudi provides __faster loads via Upserts__, as opposed 
costly & inefficient bulk loads. For e.g, you can read the MySQL BIN log or 
[Sqoop Incremental 
Import](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)
 and apply them to an
-equivalent Hudi table on DFS. This would be much faster/efficient than a [bulk 
merge 
job](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)
-or [complicated handcrafted merge 
workflows](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
+对于RDBMS摄取,Hudi提供__通过更新插入更快地加载__,而不是昂贵且低效的批量加载。例如,您可以读取MySQL 
BIN日志或[Sqoop增量导入](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)并将其应用于
+DFS上的等效Hudi表。这比[批量合并任务](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)更快/更有效率或[复杂的手工合并工作流](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
 
 
-For NoSQL datastores like [Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/), even moderately big installations store 
billions of rows.
-It goes without saying that __full bulk loads are simply infeasible__ and more 
efficient approaches are needed if ingestion is to keep up with the typically 
high update volumes.
+对于NoSQL数据存储,如[Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/),即使是中等规模大小也会存储数十亿行。
+毫无疑问, __全量加载不可行__ 如果摄取需要跟上较高的更新量,那么则需要更有效的方法。
 
 
-Even for immutable data sources like [Kafka](kafka.apache.org) , Hudi helps 
__enforces a minimum file size on HDFS__, which improves NameNode health by 
solving one of the [age old problems in Hadoop 
land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a 
holistic way. This is all the more important for event streams, since typically 
its higher volume (eg: click streams) and if not managed well, can cause 
serious damage to your Hadoop cluster.
+即使对于像[Kafka](kafka.apache.org)这样的不可变数据源,Hudi也可以 __强制在HDFS上使用最小文件大小__, 
这以整体的方式解决[Hadoop中的一个老问题](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/)来改善NameNode的健康状况。这对事件流来说更为重要,因为它通常具有较高容量(例如:点击流),如果管理不当,可能会对Hadoop群集造成严重损害。
 
 Review comment:
   Looks good.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #884: [HUDI-240] Translate Use Cases page

2019-09-12 Thread GitBox
yihua commented on a change in pull request #884: [HUDI-240] Translate Use 
Cases page
URL: https://github.com/apache/incubator-hudi/pull/884#discussion_r324034722
 
 

 ##
 File path: docs/use_cases.cn.md
 ##
 @@ -4,73 +4,65 @@ keywords: hudi, data ingestion, etl, real time, use cases
 sidebar: mydoc_sidebar
 permalink: use_cases.html
 toc: false
-summary: "Following are some sample use-cases for Hudi, which illustrate the 
benefits in terms of faster processing & increased efficiency"
+summary: "下面展示一些使用Hudi的示例,示例说明了加快处理速度和提高效率的好处"
 
 ---
 
 
 
-## Near Real-Time Ingestion
+## 近实时摄取
 
-Ingesting data from external sources like (event logs, databases, external 
sources) into a [Hadoop Data Lake](http://martinfowler.com/bliki/DataLake.html) 
is a well known problem.
-In most (if not all) Hadoop deployments, it is unfortunately solved in a 
piecemeal fashion, using a medley of ingestion tools,
-even though this data is arguably the most valuable for the entire 
organization.
+将外部源(如事件日志,数据库,外部源)的数据摄取到[Hadoop数据湖](http://martinfowler.com/bliki/DataLake.html)是一个众所周知的问题。
+尽管这些数据对整个组织来说是最有价值的,但不幸的是,在大多数(可能不是全部)Hadoop部署中都使用零散的方式解决,即使用混合摄取工具。
 
 
-For RDBMS ingestion, Hudi provides __faster loads via Upserts__, as opposed 
costly & inefficient bulk loads. For e.g, you can read the MySQL BIN log or 
[Sqoop Incremental 
Import](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)
 and apply them to an
-equivalent Hudi table on DFS. This would be much faster/efficient than a [bulk 
merge 
job](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)
-or [complicated handcrafted merge 
workflows](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
+对于RDBMS摄取,Hudi提供__通过更新插入更快地加载__,而不是昂贵且低效的批量加载。例如,您可以读取MySQL 
BIN日志或[Sqoop增量导入](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)并将其应用于
+DFS上的等效Hudi表。这比[批量合并任务](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)更快/更有效率或[复杂的手工合并工作流](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
 
 
-For NoSQL datastores like [Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/), even moderately big installations store 
billions of rows.
-It goes without saying that __full bulk loads are simply infeasible__ and more 
efficient approaches are needed if ingestion is to keep up with the typically 
high update volumes.
+对于NoSQL数据存储,如[Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/),即使是中等规模大小也会存储数十亿行。
+毫无疑问, __全量加载不可行__ 如果摄取需要跟上较高的更新量,那么则需要更有效的方法。
 
 
-Even for immutable data sources like [Kafka](kafka.apache.org) , Hudi helps 
__enforces a minimum file size on HDFS__, which improves NameNode health by 
solving one of the [age old problems in Hadoop 
land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a 
holistic way. This is all the more important for event streams, since typically 
its higher volume (eg: click streams) and if not managed well, can cause 
serious damage to your Hadoop cluster.
+即使对于像[Kafka](kafka.apache.org)这样的不可变数据源,Hudi也可以 __强制在HDFS上使用最小文件大小__, 
这以整体的方式解决[Hadoop中的一个老问题](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/)来改善NameNode的健康状况。这对事件流来说更为重要,因为它通常具有较高容量(例如:点击流),如果管理不当,可能会对Hadoop群集造成严重损害。
 
-Across all sources, Hudi adds the much needed ability to atomically publish 
new data to consumers via notion of commits, shielding them from partial 
ingestion failures
+在所有源中,Hudi增加了通过提交概念以原子方式向消费者发布新数据所需的能力,使其免受部分摄取失败的影响。
 
+## 近实时分析
 
-## Near Real-time Analytics
+通常,实时[datamarts](https://en.wikipedia.org/wiki/Data_mart)由专业数据存储分析提供支持,例如[Druid](http://druid.io/)或[Memsql](http://www.memsql.com/)或[OpenTSDB](http://opentsdb.net/)。
+这对于较小规模([相比于这样安装Hadoop](https://blog.twitter.com/2015/hadoop-filesystem-at-twitter))的数据量来说绝对是完美的,这需要在亚秒级响应查询,例如系统监控或交互式实时分析。
+但是,由于Hadoop上的数据太陈旧了,通常这些系统会被滥用于较少的交互式查询,这导致利用率不足和硬件/许可证成本的浪费。
 
-Typically, real-time [datamarts](https://en.wikipedia.org/wiki/Data_mart) are 
powered by specialized analytical stores such as [Druid](http://druid.io/) or 
[Memsql](http://www.memsql.com/) or [even OpenTSDB](http://opentsdb.net/) .
 
 Review comment:
   这样翻译如何:“专业(实时)数据分析存储”?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #884: [HUDI-240] Translate Use Cases page

2019-09-12 Thread GitBox
yihua commented on a change in pull request #884: [HUDI-240] Translate Use 
Cases page
URL: https://github.com/apache/incubator-hudi/pull/884#discussion_r324036810
 
 

 ##
 File path: docs/use_cases.cn.md
 ##
 @@ -4,73 +4,65 @@ keywords: hudi, data ingestion, etl, real time, use cases
 sidebar: mydoc_sidebar
 permalink: use_cases.html
 toc: false
-summary: "Following are some sample use-cases for Hudi, which illustrate the 
benefits in terms of faster processing & increased efficiency"
+summary: "下面展示一些使用Hudi的示例,示例说明了加快处理速度和提高效率的好处"
 
 ---
 
 
 
-## Near Real-Time Ingestion
+## 近实时摄取
 
-Ingesting data from external sources like (event logs, databases, external 
sources) into a [Hadoop Data Lake](http://martinfowler.com/bliki/DataLake.html) 
is a well known problem.
-In most (if not all) Hadoop deployments, it is unfortunately solved in a 
piecemeal fashion, using a medley of ingestion tools,
-even though this data is arguably the most valuable for the entire 
organization.
+将外部源(如事件日志,数据库,外部源)的数据摄取到[Hadoop数据湖](http://martinfowler.com/bliki/DataLake.html)是一个众所周知的问题。
+尽管这些数据对整个组织来说是最有价值的,但不幸的是,在大多数(可能不是全部)Hadoop部署中都使用零散的方式解决,即使用混合摄取工具。
 
 
-For RDBMS ingestion, Hudi provides __faster loads via Upserts__, as opposed 
costly & inefficient bulk loads. For e.g, you can read the MySQL BIN log or 
[Sqoop Incremental 
Import](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)
 and apply them to an
-equivalent Hudi table on DFS. This would be much faster/efficient than a [bulk 
merge 
job](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)
-or [complicated handcrafted merge 
workflows](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
+对于RDBMS摄取,Hudi提供__通过更新插入更快地加载__,而不是昂贵且低效的批量加载。例如,您可以读取MySQL 
BIN日志或[Sqoop增量导入](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)并将其应用于
+DFS上的等效Hudi表。这比[批量合并任务](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)更快/更有效率或[复杂的手工合并工作流](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
 
 
-For NoSQL datastores like [Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/), even moderately big installations store 
billions of rows.
-It goes without saying that __full bulk loads are simply infeasible__ and more 
efficient approaches are needed if ingestion is to keep up with the typically 
high update volumes.
+对于NoSQL数据存储,如[Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/),即使是中等规模大小也会存储数十亿行。
+毫无疑问, __全量加载不可行__ 如果摄取需要跟上较高的更新量,那么则需要更有效的方法。
 
 
-Even for immutable data sources like [Kafka](kafka.apache.org) , Hudi helps 
__enforces a minimum file size on HDFS__, which improves NameNode health by 
solving one of the [age old problems in Hadoop 
land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a 
holistic way. This is all the more important for event streams, since typically 
its higher volume (eg: click streams) and if not managed well, can cause 
serious damage to your Hadoop cluster.
+即使对于像[Kafka](kafka.apache.org)这样的不可变数据源,Hudi也可以 __强制在HDFS上使用最小文件大小__, 
这以整体的方式解决[Hadoop中的一个老问题](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/)来改善NameNode的健康状况。这对事件流来说更为重要,因为它通常具有较高容量(例如:点击流),如果管理不当,可能会对Hadoop群集造成严重损害。
 
-Across all sources, Hudi adds the much needed ability to atomically publish 
new data to consumers via notion of commits, shielding them from partial 
ingestion failures
+在所有源中,Hudi增加了通过提交概念以原子方式向消费者发布新数据所需的能力,使其免受部分摄取失败的影响。
 
+## 近实时分析
 
-## Near Real-time Analytics
+通常,实时[datamarts](https://en.wikipedia.org/wiki/Data_mart)由专业数据存储分析提供支持,例如[Druid](http://druid.io/)或[Memsql](http://www.memsql.com/)或[OpenTSDB](http://opentsdb.net/)。
+这对于较小规模([相比于这样安装Hadoop](https://blog.twitter.com/2015/hadoop-filesystem-at-twitter))的数据量来说绝对是完美的,这需要在亚秒级响应查询,例如系统监控或交互式实时分析。
+但是,由于Hadoop上的数据太陈旧了,通常这些系统会被滥用于较少的交互式查询,这导致利用率不足和硬件/许可证成本的浪费。
 
-Typically, real-time [datamarts](https://en.wikipedia.org/wiki/Data_mart) are 
powered by specialized analytical stores such as [Druid](http://druid.io/) or 
[Memsql](http://www.memsql.com/) or [even OpenTSDB](http://opentsdb.net/) .
-This is absolutely perfect for lower scale ([relative to Hadoop installations 
like this](https://blog.twitter.com/2015/hadoop-filesystem-at-twitter)) data, 
that needs sub-second query responses such as system monitoring or interactive 
real-time analysis.
-But, typically these systems end up getting abused for less interactive 
queries also since data on Hadoop is intolerably stale. This leads to under 
utilization & wasteful hardware/license costs.
+另一方面,Hadoop上的交互式SQL解决方案(如Presto和SparkSQL)表现出色,在__几秒钟内完成查询__。
+通过将__数据新鲜度提高到几分钟__,Hudi可以提供一个更有效的替代方案,并支持存储在DFS中的__数量级更大的数据集__的实时分析。
+此外,Hudi没有外部依赖(如专用于实时分析的HBase集群),因此可以在更新的分析上实现更快的分析,而不会增加操作开销。
 
 
-On the other hand, 

[GitHub] [incubator-hudi] yihua commented on a change in pull request #884: [HUDI-240] Translate Use Cases page

2019-09-12 Thread GitBox
yihua commented on a change in pull request #884: [HUDI-240] Translate Use 
Cases page
URL: https://github.com/apache/incubator-hudi/pull/884#discussion_r324038535
 
 

 ##
 File path: docs/use_cases.cn.md
 ##
 @@ -4,73 +4,65 @@ keywords: hudi, data ingestion, etl, real time, use cases
 sidebar: mydoc_sidebar
 permalink: use_cases.html
 toc: false
-summary: "Following are some sample use-cases for Hudi, which illustrate the 
benefits in terms of faster processing & increased efficiency"
+summary: "下面展示一些使用Hudi的示例,示例说明了加快处理速度和提高效率的好处"
 
 ---
 
 
 
-## Near Real-Time Ingestion
+## 近实时摄取
 
-Ingesting data from external sources like (event logs, databases, external 
sources) into a [Hadoop Data Lake](http://martinfowler.com/bliki/DataLake.html) 
is a well known problem.
-In most (if not all) Hadoop deployments, it is unfortunately solved in a 
piecemeal fashion, using a medley of ingestion tools,
-even though this data is arguably the most valuable for the entire 
organization.
+将外部源(如事件日志,数据库,外部源)的数据摄取到[Hadoop数据湖](http://martinfowler.com/bliki/DataLake.html)是一个众所周知的问题。
+尽管这些数据对整个组织来说是最有价值的,但不幸的是,在大多数(可能不是全部)Hadoop部署中都使用零散的方式解决,即使用混合摄取工具。
 
 
-For RDBMS ingestion, Hudi provides __faster loads via Upserts__, as opposed 
costly & inefficient bulk loads. For e.g, you can read the MySQL BIN log or 
[Sqoop Incremental 
Import](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)
 and apply them to an
-equivalent Hudi table on DFS. This would be much faster/efficient than a [bulk 
merge 
job](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)
-or [complicated handcrafted merge 
workflows](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
+对于RDBMS摄取,Hudi提供__通过更新插入更快地加载__,而不是昂贵且低效的批量加载。例如,您可以读取MySQL 
BIN日志或[Sqoop增量导入](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)并将其应用于
+DFS上的等效Hudi表。这比[批量合并任务](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)更快/更有效率或[复杂的手工合并工作流](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
 
 
-For NoSQL datastores like [Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/), even moderately big installations store 
billions of rows.
-It goes without saying that __full bulk loads are simply infeasible__ and more 
efficient approaches are needed if ingestion is to keep up with the typically 
high update volumes.
+对于NoSQL数据存储,如[Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/),即使是中等规模大小也会存储数十亿行。
+毫无疑问, __全量加载不可行__ 如果摄取需要跟上较高的更新量,那么则需要更有效的方法。
 
 
-Even for immutable data sources like [Kafka](kafka.apache.org) , Hudi helps 
__enforces a minimum file size on HDFS__, which improves NameNode health by 
solving one of the [age old problems in Hadoop 
land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a 
holistic way. This is all the more important for event streams, since typically 
its higher volume (eg: click streams) and if not managed well, can cause 
serious damage to your Hadoop cluster.
+即使对于像[Kafka](kafka.apache.org)这样的不可变数据源,Hudi也可以 __强制在HDFS上使用最小文件大小__, 
这以整体的方式解决[Hadoop中的一个老问题](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/)来改善NameNode的健康状况。这对事件流来说更为重要,因为它通常具有较高容量(例如:点击流),如果管理不当,可能会对Hadoop群集造成严重损害。
 
-Across all sources, Hudi adds the much needed ability to atomically publish 
new data to consumers via notion of commits, shielding them from partial 
ingestion failures
+在所有源中,Hudi增加了通过提交概念以原子方式向消费者发布新数据所需的能力,使其免受部分摄取失败的影响。
 
+## 近实时分析
 
-## Near Real-time Analytics
+通常,实时[datamarts](https://en.wikipedia.org/wiki/Data_mart)由专业数据存储分析提供支持,例如[Druid](http://druid.io/)或[Memsql](http://www.memsql.com/)或[OpenTSDB](http://opentsdb.net/)。
+这对于较小规模([相比于这样安装Hadoop](https://blog.twitter.com/2015/hadoop-filesystem-at-twitter))的数据量来说绝对是完美的,这需要在亚秒级响应查询,例如系统监控或交互式实时分析。
+但是,由于Hadoop上的数据太陈旧了,通常这些系统会被滥用于较少的交互式查询,这导致利用率不足和硬件/许可证成本的浪费。
 
-Typically, real-time [datamarts](https://en.wikipedia.org/wiki/Data_mart) are 
powered by specialized analytical stores such as [Druid](http://druid.io/) or 
[Memsql](http://www.memsql.com/) or [even OpenTSDB](http://opentsdb.net/) .
-This is absolutely perfect for lower scale ([relative to Hadoop installations 
like this](https://blog.twitter.com/2015/hadoop-filesystem-at-twitter)) data, 
that needs sub-second query responses such as system monitoring or interactive 
real-time analysis.
-But, typically these systems end up getting abused for less interactive 
queries also since data on Hadoop is intolerably stale. This leads to under 
utilization & wasteful hardware/license costs.
+另一方面,Hadoop上的交互式SQL解决方案(如Presto和SparkSQL)表现出色,在__几秒钟内完成查询__。
+通过将__数据新鲜度提高到几分钟__,Hudi可以提供一个更有效的替代方案,并支持存储在DFS中的__数量级更大的数据集__的实时分析。
+此外,Hudi没有外部依赖(如专用于实时分析的HBase集群),因此可以在更新的分析上实现更快的分析,而不会增加操作开销。
 
 
-On the other hand, 

[GitHub] [incubator-hudi] yihua commented on a change in pull request #884: [HUDI-240] Translate Use Cases page

2019-09-12 Thread GitBox
yihua commented on a change in pull request #884: [HUDI-240] Translate Use 
Cases page
URL: https://github.com/apache/incubator-hudi/pull/884#discussion_r324034394
 
 

 ##
 File path: docs/use_cases.cn.md
 ##
 @@ -4,73 +4,65 @@ keywords: hudi, data ingestion, etl, real time, use cases
 sidebar: mydoc_sidebar
 permalink: use_cases.html
 toc: false
-summary: "Following are some sample use-cases for Hudi, which illustrate the 
benefits in terms of faster processing & increased efficiency"
+summary: "下面展示一些使用Hudi的示例,示例说明了加快处理速度和提高效率的好处"
 
 ---
 
 
 
-## Near Real-Time Ingestion
+## 近实时摄取
 
-Ingesting data from external sources like (event logs, databases, external 
sources) into a [Hadoop Data Lake](http://martinfowler.com/bliki/DataLake.html) 
is a well known problem.
-In most (if not all) Hadoop deployments, it is unfortunately solved in a 
piecemeal fashion, using a medley of ingestion tools,
-even though this data is arguably the most valuable for the entire 
organization.
+将外部源(如事件日志,数据库,外部源)的数据摄取到[Hadoop数据湖](http://martinfowler.com/bliki/DataLake.html)是一个众所周知的问题。
+尽管这些数据对整个组织来说是最有价值的,但不幸的是,在大多数(可能不是全部)Hadoop部署中都使用零散的方式解决,即使用混合摄取工具。
 
 
-For RDBMS ingestion, Hudi provides __faster loads via Upserts__, as opposed 
costly & inefficient bulk loads. For e.g, you can read the MySQL BIN log or 
[Sqoop Incremental 
Import](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)
 and apply them to an
-equivalent Hudi table on DFS. This would be much faster/efficient than a [bulk 
merge 
job](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)
-or [complicated handcrafted merge 
workflows](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
+对于RDBMS摄取,Hudi提供__通过更新插入更快地加载__,而不是昂贵且低效的批量加载。例如,您可以读取MySQL 
BIN日志或[Sqoop增量导入](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)并将其应用于
+DFS上的等效Hudi表。这比[批量合并任务](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)更快/更有效率或[复杂的手工合并工作流](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
 
 
-For NoSQL datastores like [Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/), even moderately big installations store 
billions of rows.
-It goes without saying that __full bulk loads are simply infeasible__ and more 
efficient approaches are needed if ingestion is to keep up with the typically 
high update volumes.
+对于NoSQL数据存储,如[Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/),即使是中等规模大小也会存储数十亿行。
+毫无疑问, __全量加载不可行__ 如果摄取需要跟上较高的更新量,那么则需要更有效的方法。
 
 
-Even for immutable data sources like [Kafka](kafka.apache.org) , Hudi helps 
__enforces a minimum file size on HDFS__, which improves NameNode health by 
solving one of the [age old problems in Hadoop 
land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a 
holistic way. This is all the more important for event streams, since typically 
its higher volume (eg: click streams) and if not managed well, can cause 
serious damage to your Hadoop cluster.
+即使对于像[Kafka](kafka.apache.org)这样的不可变数据源,Hudi也可以 __强制在HDFS上使用最小文件大小__, 
这以整体的方式解决[Hadoop中的一个老问题](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/)来改善NameNode的健康状况。这对事件流来说更为重要,因为它通常具有较高容量(例如:点击流),如果管理不当,可能会对Hadoop群集造成严重损害。
 
-Across all sources, Hudi adds the much needed ability to atomically publish 
new data to consumers via notion of commits, shielding them from partial 
ingestion failures
+在所有源中,Hudi增加了通过提交概念以原子方式向消费者发布新数据所需的能力,使其免受部分摄取失败的影响。
 
+## 近实时分析
 
-## Near Real-time Analytics
+通常,实时[datamarts](https://en.wikipedia.org/wiki/Data_mart)由专业数据存储分析提供支持,例如[Druid](http://druid.io/)或[Memsql](http://www.memsql.com/)或[OpenTSDB](http://opentsdb.net/)。
 
 Review comment:
   “datamarts” can be translated to “数据集市”?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #884: [HUDI-240] Translate Use Cases page

2019-09-12 Thread GitBox
yihua commented on a change in pull request #884: [HUDI-240] Translate Use 
Cases page
URL: https://github.com/apache/incubator-hudi/pull/884#discussion_r324035571
 
 

 ##
 File path: docs/use_cases.cn.md
 ##
 @@ -4,73 +4,65 @@ keywords: hudi, data ingestion, etl, real time, use cases
 sidebar: mydoc_sidebar
 permalink: use_cases.html
 toc: false
-summary: "Following are some sample use-cases for Hudi, which illustrate the 
benefits in terms of faster processing & increased efficiency"
+summary: "下面展示一些使用Hudi的示例,示例说明了加快处理速度和提高效率的好处"
 
 ---
 
 
 
-## Near Real-Time Ingestion
+## 近实时摄取
 
-Ingesting data from external sources like (event logs, databases, external 
sources) into a [Hadoop Data Lake](http://martinfowler.com/bliki/DataLake.html) 
is a well known problem.
-In most (if not all) Hadoop deployments, it is unfortunately solved in a 
piecemeal fashion, using a medley of ingestion tools,
-even though this data is arguably the most valuable for the entire 
organization.
+将外部源(如事件日志,数据库,外部源)的数据摄取到[Hadoop数据湖](http://martinfowler.com/bliki/DataLake.html)是一个众所周知的问题。
+尽管这些数据对整个组织来说是最有价值的,但不幸的是,在大多数(可能不是全部)Hadoop部署中都使用零散的方式解决,即使用混合摄取工具。
 
 
-For RDBMS ingestion, Hudi provides __faster loads via Upserts__, as opposed 
costly & inefficient bulk loads. For e.g, you can read the MySQL BIN log or 
[Sqoop Incremental 
Import](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)
 and apply them to an
-equivalent Hudi table on DFS. This would be much faster/efficient than a [bulk 
merge 
job](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)
-or [complicated handcrafted merge 
workflows](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
+对于RDBMS摄取,Hudi提供__通过更新插入更快地加载__,而不是昂贵且低效的批量加载。例如,您可以读取MySQL 
BIN日志或[Sqoop增量导入](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)并将其应用于
+DFS上的等效Hudi表。这比[批量合并任务](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)更快/更有效率或[复杂的手工合并工作流](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
 
 
-For NoSQL datastores like [Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/), even moderately big installations store 
billions of rows.
-It goes without saying that __full bulk loads are simply infeasible__ and more 
efficient approaches are needed if ingestion is to keep up with the typically 
high update volumes.
+对于NoSQL数据存储,如[Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/),即使是中等规模大小也会存储数十亿行。
+毫无疑问, __全量加载不可行__ 如果摄取需要跟上较高的更新量,那么则需要更有效的方法。
 
 
-Even for immutable data sources like [Kafka](kafka.apache.org) , Hudi helps 
__enforces a minimum file size on HDFS__, which improves NameNode health by 
solving one of the [age old problems in Hadoop 
land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a 
holistic way. This is all the more important for event streams, since typically 
its higher volume (eg: click streams) and if not managed well, can cause 
serious damage to your Hadoop cluster.
+即使对于像[Kafka](kafka.apache.org)这样的不可变数据源,Hudi也可以 __强制在HDFS上使用最小文件大小__, 
这以整体的方式解决[Hadoop中的一个老问题](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/)来改善NameNode的健康状况。这对事件流来说更为重要,因为它通常具有较高容量(例如:点击流),如果管理不当,可能会对Hadoop群集造成严重损害。
 
-Across all sources, Hudi adds the much needed ability to atomically publish 
new data to consumers via notion of commits, shielding them from partial 
ingestion failures
+在所有源中,Hudi增加了通过提交概念以原子方式向消费者发布新数据所需的能力,使其免受部分摄取失败的影响。
 
+## 近实时分析
 
-## Near Real-time Analytics
+通常,实时[datamarts](https://en.wikipedia.org/wiki/Data_mart)由专业数据存储分析提供支持,例如[Druid](http://druid.io/)或[Memsql](http://www.memsql.com/)或[OpenTSDB](http://opentsdb.net/)。
+这对于较小规模([相比于这样安装Hadoop](https://blog.twitter.com/2015/hadoop-filesystem-at-twitter))的数据量来说绝对是完美的,这需要在亚秒级响应查询,例如系统监控或交互式实时分析。
+但是,由于Hadoop上的数据太陈旧了,通常这些系统会被滥用于较少的交互式查询,这导致利用率不足和硬件/许可证成本的浪费。
 
 Review comment:
   “较少的交互式查询” => “非交互式查询”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #884: [HUDI-240] Translate Use Cases page

2019-09-12 Thread GitBox
yihua commented on a change in pull request #884: [HUDI-240] Translate Use 
Cases page
URL: https://github.com/apache/incubator-hudi/pull/884#discussion_r324032642
 
 

 ##
 File path: docs/use_cases.cn.md
 ##
 @@ -4,73 +4,65 @@ keywords: hudi, data ingestion, etl, real time, use cases
 sidebar: mydoc_sidebar
 permalink: use_cases.html
 toc: false
-summary: "Following are some sample use-cases for Hudi, which illustrate the 
benefits in terms of faster processing & increased efficiency"
+summary: "下面展示一些使用Hudi的示例,示例说明了加快处理速度和提高效率的好处"
 
 ---
 
 
 
-## Near Real-Time Ingestion
+## 近实时摄取
 
-Ingesting data from external sources like (event logs, databases, external 
sources) into a [Hadoop Data Lake](http://martinfowler.com/bliki/DataLake.html) 
is a well known problem.
-In most (if not all) Hadoop deployments, it is unfortunately solved in a 
piecemeal fashion, using a medley of ingestion tools,
-even though this data is arguably the most valuable for the entire 
organization.
+将外部源(如事件日志,数据库,外部源)的数据摄取到[Hadoop数据湖](http://martinfowler.com/bliki/DataLake.html)是一个众所周知的问题。
+尽管这些数据对整个组织来说是最有价值的,但不幸的是,在大多数(可能不是全部)Hadoop部署中都使用零散的方式解决,即使用混合摄取工具。
 
 
-For RDBMS ingestion, Hudi provides __faster loads via Upserts__, as opposed 
costly & inefficient bulk loads. For e.g, you can read the MySQL BIN log or 
[Sqoop Incremental 
Import](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)
 and apply them to an
-equivalent Hudi table on DFS. This would be much faster/efficient than a [bulk 
merge 
job](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)
-or [complicated handcrafted merge 
workflows](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
+对于RDBMS摄取,Hudi提供__通过更新插入更快地加载__,而不是昂贵且低效的批量加载。例如,您可以读取MySQL 
BIN日志或[Sqoop增量导入](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)并将其应用于
+DFS上的等效Hudi表。这比[批量合并任务](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)更快/更有效率或[复杂的手工合并工作流](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
 
 
-For NoSQL datastores like [Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/), even moderately big installations store 
billions of rows.
-It goes without saying that __full bulk loads are simply infeasible__ and more 
efficient approaches are needed if ingestion is to keep up with the typically 
high update volumes.
+对于NoSQL数据存储,如[Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/),即使是中等规模大小也会存储数十亿行。
+毫无疑问, __全量加载不可行__ 如果摄取需要跟上较高的更新量,那么则需要更有效的方法。
 
 
-Even for immutable data sources like [Kafka](kafka.apache.org) , Hudi helps 
__enforces a minimum file size on HDFS__, which improves NameNode health by 
solving one of the [age old problems in Hadoop 
land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a 
holistic way. This is all the more important for event streams, since typically 
its higher volume (eg: click streams) and if not managed well, can cause 
serious damage to your Hadoop cluster.
+即使对于像[Kafka](kafka.apache.org)这样的不可变数据源,Hudi也可以 __强制在HDFS上使用最小文件大小__, 
这以整体的方式解决[Hadoop中的一个老问题](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/)来改善NameNode的健康状况。这对事件流来说更为重要,因为它通常具有较高容量(例如:点击流),如果管理不当,可能会对Hadoop群集造成严重损害。
 
 Review comment:
   “这以整体的方式解决...” feels a bit awkward, yet I’m not sure how to improve it.  See 
if you can make this part better.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #884: [HUDI-240] Translate Use Cases page

2019-09-12 Thread GitBox
yihua commented on a change in pull request #884: [HUDI-240] Translate Use 
Cases page
URL: https://github.com/apache/incubator-hudi/pull/884#discussion_r324037042
 
 

 ##
 File path: docs/use_cases.cn.md
 ##
 @@ -4,73 +4,65 @@ keywords: hudi, data ingestion, etl, real time, use cases
 sidebar: mydoc_sidebar
 permalink: use_cases.html
 toc: false
-summary: "Following are some sample use-cases for Hudi, which illustrate the 
benefits in terms of faster processing & increased efficiency"
+summary: "下面展示一些使用Hudi的示例,示例说明了加快处理速度和提高效率的好处"
 
 ---
 
 
 
-## Near Real-Time Ingestion
+## 近实时摄取
 
-Ingesting data from external sources like (event logs, databases, external 
sources) into a [Hadoop Data Lake](http://martinfowler.com/bliki/DataLake.html) 
is a well known problem.
-In most (if not all) Hadoop deployments, it is unfortunately solved in a 
piecemeal fashion, using a medley of ingestion tools,
-even though this data is arguably the most valuable for the entire 
organization.
+将外部源(如事件日志,数据库,外部源)的数据摄取到[Hadoop数据湖](http://martinfowler.com/bliki/DataLake.html)是一个众所周知的问题。
+尽管这些数据对整个组织来说是最有价值的,但不幸的是,在大多数(可能不是全部)Hadoop部署中都使用零散的方式解决,即使用混合摄取工具。
 
 
-For RDBMS ingestion, Hudi provides __faster loads via Upserts__, as opposed 
costly & inefficient bulk loads. For e.g, you can read the MySQL BIN log or 
[Sqoop Incremental 
Import](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)
 and apply them to an
-equivalent Hudi table on DFS. This would be much faster/efficient than a [bulk 
merge 
job](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)
-or [complicated handcrafted merge 
workflows](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
+对于RDBMS摄取,Hudi提供__通过更新插入更快地加载__,而不是昂贵且低效的批量加载。例如,您可以读取MySQL 
BIN日志或[Sqoop增量导入](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)并将其应用于
+DFS上的等效Hudi表。这比[批量合并任务](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)更快/更有效率或[复杂的手工合并工作流](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
 
 
-For NoSQL datastores like [Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/), even moderately big installations store 
billions of rows.
-It goes without saying that __full bulk loads are simply infeasible__ and more 
efficient approaches are needed if ingestion is to keep up with the typically 
high update volumes.
+对于NoSQL数据存储,如[Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/),即使是中等规模大小也会存储数十亿行。
+毫无疑问, __全量加载不可行__ 如果摄取需要跟上较高的更新量,那么则需要更有效的方法。
 
 
-Even for immutable data sources like [Kafka](kafka.apache.org) , Hudi helps 
__enforces a minimum file size on HDFS__, which improves NameNode health by 
solving one of the [age old problems in Hadoop 
land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a 
holistic way. This is all the more important for event streams, since typically 
its higher volume (eg: click streams) and if not managed well, can cause 
serious damage to your Hadoop cluster.
+即使对于像[Kafka](kafka.apache.org)这样的不可变数据源,Hudi也可以 __强制在HDFS上使用最小文件大小__, 
这以整体的方式解决[Hadoop中的一个老问题](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/)来改善NameNode的健康状况。这对事件流来说更为重要,因为它通常具有较高容量(例如:点击流),如果管理不当,可能会对Hadoop群集造成严重损害。
 
-Across all sources, Hudi adds the much needed ability to atomically publish 
new data to consumers via notion of commits, shielding them from partial 
ingestion failures
+在所有源中,Hudi增加了通过提交概念以原子方式向消费者发布新数据所需的能力,使其免受部分摄取失败的影响。
 
+## 近实时分析
 
-## Near Real-time Analytics
+通常,实时[datamarts](https://en.wikipedia.org/wiki/Data_mart)由专业数据存储分析提供支持,例如[Druid](http://druid.io/)或[Memsql](http://www.memsql.com/)或[OpenTSDB](http://opentsdb.net/)。
+这对于较小规模([相比于这样安装Hadoop](https://blog.twitter.com/2015/hadoop-filesystem-at-twitter))的数据量来说绝对是完美的,这需要在亚秒级响应查询,例如系统监控或交互式实时分析。
+但是,由于Hadoop上的数据太陈旧了,通常这些系统会被滥用于较少的交互式查询,这导致利用率不足和硬件/许可证成本的浪费。
 
-Typically, real-time [datamarts](https://en.wikipedia.org/wiki/Data_mart) are 
powered by specialized analytical stores such as [Druid](http://druid.io/) or 
[Memsql](http://www.memsql.com/) or [even OpenTSDB](http://opentsdb.net/) .
-This is absolutely perfect for lower scale ([relative to Hadoop installations 
like this](https://blog.twitter.com/2015/hadoop-filesystem-at-twitter)) data, 
that needs sub-second query responses such as system monitoring or interactive 
real-time analysis.
-But, typically these systems end up getting abused for less interactive 
queries also since data on Hadoop is intolerably stale. This leads to under 
utilization & wasteful hardware/license costs.
+另一方面,Hadoop上的交互式SQL解决方案(如Presto和SparkSQL)表现出色,在__几秒钟内完成查询__。
+通过将__数据新鲜度提高到几分钟__,Hudi可以提供一个更有效的替代方案,并支持存储在DFS中的__数量级更大的数据集__的实时分析。
+此外,Hudi没有外部依赖(如专用于实时分析的HBase集群),因此可以在更新的分析上实现更快的分析,而不会增加操作开销。
 
 
-On the other hand, 

[GitHub] [incubator-hudi] yihua commented on a change in pull request #884: [HUDI-240] Translate Use Cases page

2019-09-12 Thread GitBox
yihua commented on a change in pull request #884: [HUDI-240] Translate Use 
Cases page
URL: https://github.com/apache/incubator-hudi/pull/884#discussion_r324036192
 
 

 ##
 File path: docs/use_cases.cn.md
 ##
 @@ -4,73 +4,65 @@ keywords: hudi, data ingestion, etl, real time, use cases
 sidebar: mydoc_sidebar
 permalink: use_cases.html
 toc: false
-summary: "Following are some sample use-cases for Hudi, which illustrate the 
benefits in terms of faster processing & increased efficiency"
+summary: "下面展示一些使用Hudi的示例,示例说明了加快处理速度和提高效率的好处"
 
 ---
 
 
 
-## Near Real-Time Ingestion
+## 近实时摄取
 
-Ingesting data from external sources like (event logs, databases, external 
sources) into a [Hadoop Data Lake](http://martinfowler.com/bliki/DataLake.html) 
is a well known problem.
-In most (if not all) Hadoop deployments, it is unfortunately solved in a 
piecemeal fashion, using a medley of ingestion tools,
-even though this data is arguably the most valuable for the entire 
organization.
+将外部源(如事件日志,数据库,外部源)的数据摄取到[Hadoop数据湖](http://martinfowler.com/bliki/DataLake.html)是一个众所周知的问题。
+尽管这些数据对整个组织来说是最有价值的,但不幸的是,在大多数(可能不是全部)Hadoop部署中都使用零散的方式解决,即使用混合摄取工具。
 
 
-For RDBMS ingestion, Hudi provides __faster loads via Upserts__, as opposed 
costly & inefficient bulk loads. For e.g, you can read the MySQL BIN log or 
[Sqoop Incremental 
Import](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)
 and apply them to an
-equivalent Hudi table on DFS. This would be much faster/efficient than a [bulk 
merge 
job](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)
-or [complicated handcrafted merge 
workflows](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
+对于RDBMS摄取,Hudi提供__通过更新插入更快地加载__,而不是昂贵且低效的批量加载。例如,您可以读取MySQL 
BIN日志或[Sqoop增量导入](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)并将其应用于
+DFS上的等效Hudi表。这比[批量合并任务](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)更快/更有效率或[复杂的手工合并工作流](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
 
 
-For NoSQL datastores like [Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/), even moderately big installations store 
billions of rows.
-It goes without saying that __full bulk loads are simply infeasible__ and more 
efficient approaches are needed if ingestion is to keep up with the typically 
high update volumes.
+对于NoSQL数据存储,如[Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/),即使是中等规模大小也会存储数十亿行。
+毫无疑问, __全量加载不可行__ 如果摄取需要跟上较高的更新量,那么则需要更有效的方法。
 
 
-Even for immutable data sources like [Kafka](kafka.apache.org) , Hudi helps 
__enforces a minimum file size on HDFS__, which improves NameNode health by 
solving one of the [age old problems in Hadoop 
land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a 
holistic way. This is all the more important for event streams, since typically 
its higher volume (eg: click streams) and if not managed well, can cause 
serious damage to your Hadoop cluster.
+即使对于像[Kafka](kafka.apache.org)这样的不可变数据源,Hudi也可以 __强制在HDFS上使用最小文件大小__, 
这以整体的方式解决[Hadoop中的一个老问题](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/)来改善NameNode的健康状况。这对事件流来说更为重要,因为它通常具有较高容量(例如:点击流),如果管理不当,可能会对Hadoop群集造成严重损害。
 
-Across all sources, Hudi adds the much needed ability to atomically publish 
new data to consumers via notion of commits, shielding them from partial 
ingestion failures
+在所有源中,Hudi增加了通过提交概念以原子方式向消费者发布新数据所需的能力,使其免受部分摄取失败的影响。
 
+## 近实时分析
 
-## Near Real-time Analytics
+通常,实时[datamarts](https://en.wikipedia.org/wiki/Data_mart)由专业数据存储分析提供支持,例如[Druid](http://druid.io/)或[Memsql](http://www.memsql.com/)或[OpenTSDB](http://opentsdb.net/)。
+这对于较小规模([相比于这样安装Hadoop](https://blog.twitter.com/2015/hadoop-filesystem-at-twitter))的数据量来说绝对是完美的,这需要在亚秒级响应查询,例如系统监控或交互式实时分析。
+但是,由于Hadoop上的数据太陈旧了,通常这些系统会被滥用于较少的交互式查询,这导致利用率不足和硬件/许可证成本的浪费。
 
-Typically, real-time [datamarts](https://en.wikipedia.org/wiki/Data_mart) are 
powered by specialized analytical stores such as [Druid](http://druid.io/) or 
[Memsql](http://www.memsql.com/) or [even OpenTSDB](http://opentsdb.net/) .
-This is absolutely perfect for lower scale ([relative to Hadoop installations 
like this](https://blog.twitter.com/2015/hadoop-filesystem-at-twitter)) data, 
that needs sub-second query responses such as system monitoring or interactive 
real-time analysis.
-But, typically these systems end up getting abused for less interactive 
queries also since data on Hadoop is intolerably stale. This leads to under 
utilization & wasteful hardware/license costs.
+另一方面,Hadoop上的交互式SQL解决方案(如Presto和SparkSQL)表现出色,在__几秒钟内完成查询__。
+通过将__数据新鲜度提高到几分钟__,Hudi可以提供一个更有效的替代方案,并支持存储在DFS中的__数量级更大的数据集__的实时分析。
+此外,Hudi没有外部依赖(如专用于实时分析的HBase集群),因此可以在更新的分析上实现更快的分析,而不会增加操作开销。
 
 
-On the other hand, 

[GitHub] [incubator-hudi] yihua commented on a change in pull request #884: [HUDI-240] Translate Use Cases page

2019-09-12 Thread GitBox
yihua commented on a change in pull request #884: [HUDI-240] Translate Use 
Cases page
URL: https://github.com/apache/incubator-hudi/pull/884#discussion_r324037595
 
 

 ##
 File path: docs/use_cases.cn.md
 ##
 @@ -4,73 +4,65 @@ keywords: hudi, data ingestion, etl, real time, use cases
 sidebar: mydoc_sidebar
 permalink: use_cases.html
 toc: false
-summary: "Following are some sample use-cases for Hudi, which illustrate the 
benefits in terms of faster processing & increased efficiency"
+summary: "下面展示一些使用Hudi的示例,示例说明了加快处理速度和提高效率的好处"
 
 ---
 
 
 
-## Near Real-Time Ingestion
+## 近实时摄取
 
-Ingesting data from external sources like (event logs, databases, external 
sources) into a [Hadoop Data Lake](http://martinfowler.com/bliki/DataLake.html) 
is a well known problem.
-In most (if not all) Hadoop deployments, it is unfortunately solved in a 
piecemeal fashion, using a medley of ingestion tools,
-even though this data is arguably the most valuable for the entire 
organization.
+将外部源(如事件日志,数据库,外部源)的数据摄取到[Hadoop数据湖](http://martinfowler.com/bliki/DataLake.html)是一个众所周知的问题。
+尽管这些数据对整个组织来说是最有价值的,但不幸的是,在大多数(可能不是全部)Hadoop部署中都使用零散的方式解决,即使用混合摄取工具。
 
 
-For RDBMS ingestion, Hudi provides __faster loads via Upserts__, as opposed 
costly & inefficient bulk loads. For e.g, you can read the MySQL BIN log or 
[Sqoop Incremental 
Import](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)
 and apply them to an
-equivalent Hudi table on DFS. This would be much faster/efficient than a [bulk 
merge 
job](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)
-or [complicated handcrafted merge 
workflows](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
+对于RDBMS摄取,Hudi提供__通过更新插入更快地加载__,而不是昂贵且低效的批量加载。例如,您可以读取MySQL 
BIN日志或[Sqoop增量导入](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)并将其应用于
+DFS上的等效Hudi表。这比[批量合并任务](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)更快/更有效率或[复杂的手工合并工作流](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
 
 
-For NoSQL datastores like [Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/), even moderately big installations store 
billions of rows.
-It goes without saying that __full bulk loads are simply infeasible__ and more 
efficient approaches are needed if ingestion is to keep up with the typically 
high update volumes.
+对于NoSQL数据存储,如[Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/),即使是中等规模大小也会存储数十亿行。
+毫无疑问, __全量加载不可行__ 如果摄取需要跟上较高的更新量,那么则需要更有效的方法。
 
 
-Even for immutable data sources like [Kafka](kafka.apache.org) , Hudi helps 
__enforces a minimum file size on HDFS__, which improves NameNode health by 
solving one of the [age old problems in Hadoop 
land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a 
holistic way. This is all the more important for event streams, since typically 
its higher volume (eg: click streams) and if not managed well, can cause 
serious damage to your Hadoop cluster.
+即使对于像[Kafka](kafka.apache.org)这样的不可变数据源,Hudi也可以 __强制在HDFS上使用最小文件大小__, 
这以整体的方式解决[Hadoop中的一个老问题](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/)来改善NameNode的健康状况。这对事件流来说更为重要,因为它通常具有较高容量(例如:点击流),如果管理不当,可能会对Hadoop群集造成严重损害。
 
-Across all sources, Hudi adds the much needed ability to atomically publish 
new data to consumers via notion of commits, shielding them from partial 
ingestion failures
+在所有源中,Hudi增加了通过提交概念以原子方式向消费者发布新数据所需的能力,使其免受部分摄取失败的影响。
 
+## 近实时分析
 
-## Near Real-time Analytics
+通常,实时[datamarts](https://en.wikipedia.org/wiki/Data_mart)由专业数据存储分析提供支持,例如[Druid](http://druid.io/)或[Memsql](http://www.memsql.com/)或[OpenTSDB](http://opentsdb.net/)。
+这对于较小规模([相比于这样安装Hadoop](https://blog.twitter.com/2015/hadoop-filesystem-at-twitter))的数据量来说绝对是完美的,这需要在亚秒级响应查询,例如系统监控或交互式实时分析。
+但是,由于Hadoop上的数据太陈旧了,通常这些系统会被滥用于较少的交互式查询,这导致利用率不足和硬件/许可证成本的浪费。
 
-Typically, real-time [datamarts](https://en.wikipedia.org/wiki/Data_mart) are 
powered by specialized analytical stores such as [Druid](http://druid.io/) or 
[Memsql](http://www.memsql.com/) or [even OpenTSDB](http://opentsdb.net/) .
-This is absolutely perfect for lower scale ([relative to Hadoop installations 
like this](https://blog.twitter.com/2015/hadoop-filesystem-at-twitter)) data, 
that needs sub-second query responses such as system monitoring or interactive 
real-time analysis.
-But, typically these systems end up getting abused for less interactive 
queries also since data on Hadoop is intolerably stale. This leads to under 
utilization & wasteful hardware/license costs.
+另一方面,Hadoop上的交互式SQL解决方案(如Presto和SparkSQL)表现出色,在__几秒钟内完成查询__。
+通过将__数据新鲜度提高到几分钟__,Hudi可以提供一个更有效的替代方案,并支持存储在DFS中的__数量级更大的数据集__的实时分析。
+此外,Hudi没有外部依赖(如专用于实时分析的HBase集群),因此可以在更新的分析上实现更快的分析,而不会增加操作开销。
 
 
-On the other hand, 

[GitHub] [incubator-hudi] yihua commented on a change in pull request #884: [HUDI-240] Translate Use Cases page

2019-09-12 Thread GitBox
yihua commented on a change in pull request #884: [HUDI-240] Translate Use 
Cases page
URL: https://github.com/apache/incubator-hudi/pull/884#discussion_r324038203
 
 

 ##
 File path: docs/use_cases.cn.md
 ##
 @@ -4,73 +4,65 @@ keywords: hudi, data ingestion, etl, real time, use cases
 sidebar: mydoc_sidebar
 permalink: use_cases.html
 toc: false
-summary: "Following are some sample use-cases for Hudi, which illustrate the 
benefits in terms of faster processing & increased efficiency"
+summary: "下面展示一些使用Hudi的示例,示例说明了加快处理速度和提高效率的好处"
 
 ---
 
 
 
-## Near Real-Time Ingestion
+## 近实时摄取
 
-Ingesting data from external sources like (event logs, databases, external 
sources) into a [Hadoop Data Lake](http://martinfowler.com/bliki/DataLake.html) 
is a well known problem.
-In most (if not all) Hadoop deployments, it is unfortunately solved in a 
piecemeal fashion, using a medley of ingestion tools,
-even though this data is arguably the most valuable for the entire 
organization.
+将外部源(如事件日志,数据库,外部源)的数据摄取到[Hadoop数据湖](http://martinfowler.com/bliki/DataLake.html)是一个众所周知的问题。
+尽管这些数据对整个组织来说是最有价值的,但不幸的是,在大多数(可能不是全部)Hadoop部署中都使用零散的方式解决,即使用混合摄取工具。
 
 
-For RDBMS ingestion, Hudi provides __faster loads via Upserts__, as opposed 
costly & inefficient bulk loads. For e.g, you can read the MySQL BIN log or 
[Sqoop Incremental 
Import](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)
 and apply them to an
-equivalent Hudi table on DFS. This would be much faster/efficient than a [bulk 
merge 
job](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)
-or [complicated handcrafted merge 
workflows](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
+对于RDBMS摄取,Hudi提供__通过更新插入更快地加载__,而不是昂贵且低效的批量加载。例如,您可以读取MySQL 
BIN日志或[Sqoop增量导入](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)并将其应用于
+DFS上的等效Hudi表。这比[批量合并任务](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)更快/更有效率或[复杂的手工合并工作流](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
 
 
-For NoSQL datastores like [Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/), even moderately big installations store 
billions of rows.
-It goes without saying that __full bulk loads are simply infeasible__ and more 
efficient approaches are needed if ingestion is to keep up with the typically 
high update volumes.
+对于NoSQL数据存储,如[Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/),即使是中等规模大小也会存储数十亿行。
+毫无疑问, __全量加载不可行__ 如果摄取需要跟上较高的更新量,那么则需要更有效的方法。
 
 
-Even for immutable data sources like [Kafka](kafka.apache.org) , Hudi helps 
__enforces a minimum file size on HDFS__, which improves NameNode health by 
solving one of the [age old problems in Hadoop 
land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a 
holistic way. This is all the more important for event streams, since typically 
its higher volume (eg: click streams) and if not managed well, can cause 
serious damage to your Hadoop cluster.
+即使对于像[Kafka](kafka.apache.org)这样的不可变数据源,Hudi也可以 __强制在HDFS上使用最小文件大小__, 
这以整体的方式解决[Hadoop中的一个老问题](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/)来改善NameNode的健康状况。这对事件流来说更为重要,因为它通常具有较高容量(例如:点击流),如果管理不当,可能会对Hadoop群集造成严重损害。
 
-Across all sources, Hudi adds the much needed ability to atomically publish 
new data to consumers via notion of commits, shielding them from partial 
ingestion failures
+在所有源中,Hudi增加了通过提交概念以原子方式向消费者发布新数据所需的能力,使其免受部分摄取失败的影响。
 
+## 近实时分析
 
-## Near Real-time Analytics
+通常,实时[datamarts](https://en.wikipedia.org/wiki/Data_mart)由专业数据存储分析提供支持,例如[Druid](http://druid.io/)或[Memsql](http://www.memsql.com/)或[OpenTSDB](http://opentsdb.net/)。
+这对于较小规模([相比于这样安装Hadoop](https://blog.twitter.com/2015/hadoop-filesystem-at-twitter))的数据量来说绝对是完美的,这需要在亚秒级响应查询,例如系统监控或交互式实时分析。
+但是,由于Hadoop上的数据太陈旧了,通常这些系统会被滥用于较少的交互式查询,这导致利用率不足和硬件/许可证成本的浪费。
 
-Typically, real-time [datamarts](https://en.wikipedia.org/wiki/Data_mart) are 
powered by specialized analytical stores such as [Druid](http://druid.io/) or 
[Memsql](http://www.memsql.com/) or [even OpenTSDB](http://opentsdb.net/) .
-This is absolutely perfect for lower scale ([relative to Hadoop installations 
like this](https://blog.twitter.com/2015/hadoop-filesystem-at-twitter)) data, 
that needs sub-second query responses such as system monitoring or interactive 
real-time analysis.
-But, typically these systems end up getting abused for less interactive 
queries also since data on Hadoop is intolerably stale. This leads to under 
utilization & wasteful hardware/license costs.
+另一方面,Hadoop上的交互式SQL解决方案(如Presto和SparkSQL)表现出色,在__几秒钟内完成查询__。
+通过将__数据新鲜度提高到几分钟__,Hudi可以提供一个更有效的替代方案,并支持存储在DFS中的__数量级更大的数据集__的实时分析。
+此外,Hudi没有外部依赖(如专用于实时分析的HBase集群),因此可以在更新的分析上实现更快的分析,而不会增加操作开销。
 
 
-On the other hand, 

[GitHub] [incubator-hudi] yihua commented on a change in pull request #884: [HUDI-240] Translate Use Cases page

2019-09-12 Thread GitBox
yihua commented on a change in pull request #884: [HUDI-240] Translate Use 
Cases page
URL: https://github.com/apache/incubator-hudi/pull/884#discussion_r324035236
 
 

 ##
 File path: docs/use_cases.cn.md
 ##
 @@ -4,73 +4,65 @@ keywords: hudi, data ingestion, etl, real time, use cases
 sidebar: mydoc_sidebar
 permalink: use_cases.html
 toc: false
-summary: "Following are some sample use-cases for Hudi, which illustrate the 
benefits in terms of faster processing & increased efficiency"
+summary: "下面展示一些使用Hudi的示例,示例说明了加快处理速度和提高效率的好处"
 
 ---
 
 
 
-## Near Real-Time Ingestion
+## 近实时摄取
 
-Ingesting data from external sources like (event logs, databases, external 
sources) into a [Hadoop Data Lake](http://martinfowler.com/bliki/DataLake.html) 
is a well known problem.
-In most (if not all) Hadoop deployments, it is unfortunately solved in a 
piecemeal fashion, using a medley of ingestion tools,
-even though this data is arguably the most valuable for the entire 
organization.
+将外部源(如事件日志,数据库,外部源)的数据摄取到[Hadoop数据湖](http://martinfowler.com/bliki/DataLake.html)是一个众所周知的问题。
+尽管这些数据对整个组织来说是最有价值的,但不幸的是,在大多数(可能不是全部)Hadoop部署中都使用零散的方式解决,即使用混合摄取工具。
 
 
-For RDBMS ingestion, Hudi provides __faster loads via Upserts__, as opposed 
costly & inefficient bulk loads. For e.g, you can read the MySQL BIN log or 
[Sqoop Incremental 
Import](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)
 and apply them to an
-equivalent Hudi table on DFS. This would be much faster/efficient than a [bulk 
merge 
job](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)
-or [complicated handcrafted merge 
workflows](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
+对于RDBMS摄取,Hudi提供__通过更新插入更快地加载__,而不是昂贵且低效的批量加载。例如,您可以读取MySQL 
BIN日志或[Sqoop增量导入](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)并将其应用于
+DFS上的等效Hudi表。这比[批量合并任务](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)更快/更有效率或[复杂的手工合并工作流](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
 
 
-For NoSQL datastores like [Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/), even moderately big installations store 
billions of rows.
-It goes without saying that __full bulk loads are simply infeasible__ and more 
efficient approaches are needed if ingestion is to keep up with the typically 
high update volumes.
+对于NoSQL数据存储,如[Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/),即使是中等规模大小也会存储数十亿行。
+毫无疑问, __全量加载不可行__ 如果摄取需要跟上较高的更新量,那么则需要更有效的方法。
 
 
-Even for immutable data sources like [Kafka](kafka.apache.org) , Hudi helps 
__enforces a minimum file size on HDFS__, which improves NameNode health by 
solving one of the [age old problems in Hadoop 
land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a 
holistic way. This is all the more important for event streams, since typically 
its higher volume (eg: click streams) and if not managed well, can cause 
serious damage to your Hadoop cluster.
+即使对于像[Kafka](kafka.apache.org)这样的不可变数据源,Hudi也可以 __强制在HDFS上使用最小文件大小__, 
这以整体的方式解决[Hadoop中的一个老问题](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/)来改善NameNode的健康状况。这对事件流来说更为重要,因为它通常具有较高容量(例如:点击流),如果管理不当,可能会对Hadoop群集造成严重损害。
 
-Across all sources, Hudi adds the much needed ability to atomically publish 
new data to consumers via notion of commits, shielding them from partial 
ingestion failures
+在所有源中,Hudi增加了通过提交概念以原子方式向消费者发布新数据所需的能力,使其免受部分摄取失败的影响。
 
+## 近实时分析
 
-## Near Real-time Analytics
+通常,实时[datamarts](https://en.wikipedia.org/wiki/Data_mart)由专业数据存储分析提供支持,例如[Druid](http://druid.io/)或[Memsql](http://www.memsql.com/)或[OpenTSDB](http://opentsdb.net/)。
+这对于较小规模([相比于这样安装Hadoop](https://blog.twitter.com/2015/hadoop-filesystem-at-twitter))的数据量来说绝对是完美的,这需要在亚秒级响应查询,例如系统监控或交互式实时分析。
 
 Review comment:
   “[相比于这样安装Hadoop]” should be after “完美的”?
   “这需要” => “这种情况需要”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #884: [HUDI-240] Translate Use Cases page

2019-09-12 Thread GitBox
yihua commented on a change in pull request #884: [HUDI-240] Translate Use 
Cases page
URL: https://github.com/apache/incubator-hudi/pull/884#discussion_r323891191
 
 

 ##
 File path: docs/use_cases.cn.md
 ##
 @@ -4,73 +4,65 @@ keywords: hudi, data ingestion, etl, real time, use cases
 sidebar: mydoc_sidebar
 permalink: use_cases.html
 toc: false
-summary: "Following are some sample use-cases for Hudi, which illustrate the 
benefits in terms of faster processing & increased efficiency"
+summary: "下面展示一些使用Hudi的示例,示例说明了加快处理速度和提高效率的好处"
 
 ---
 
 
 
-## Near Real-Time Ingestion
+## 近实时摄取
 
-Ingesting data from external sources like (event logs, databases, external 
sources) into a [Hadoop Data Lake](http://martinfowler.com/bliki/DataLake.html) 
is a well known problem.
-In most (if not all) Hadoop deployments, it is unfortunately solved in a 
piecemeal fashion, using a medley of ingestion tools,
-even though this data is arguably the most valuable for the entire 
organization.
+将外部源(如事件日志,数据库,外部源)的数据摄取到[Hadoop数据湖](http://martinfowler.com/bliki/DataLake.html)是一个众所周知的问题。
+尽管这些数据对整个组织来说是最有价值的,但不幸的是,在大多数(可能不是全部)Hadoop部署中都使用零散的方式解决,即使用混合摄取工具。
 
 Review comment:
   “可能不是全部” => “如果不是全部”
   
   For “using a medley of ingestion tools”, I’m thinking maybe this way of 
translation is better?
   “使用多个不同的摄取工具”
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #884: [HUDI-240] Translate Use Cases page

2019-09-12 Thread GitBox
yihua commented on a change in pull request #884: [HUDI-240] Translate Use 
Cases page
URL: https://github.com/apache/incubator-hudi/pull/884#discussion_r323895356
 
 

 ##
 File path: docs/use_cases.cn.md
 ##
 @@ -4,73 +4,65 @@ keywords: hudi, data ingestion, etl, real time, use cases
 sidebar: mydoc_sidebar
 permalink: use_cases.html
 toc: false
-summary: "Following are some sample use-cases for Hudi, which illustrate the 
benefits in terms of faster processing & increased efficiency"
+summary: "下面展示一些使用Hudi的示例,示例说明了加快处理速度和提高效率的好处"
 
 ---
 
 
 
-## Near Real-Time Ingestion
+## 近实时摄取
 
-Ingesting data from external sources like (event logs, databases, external 
sources) into a [Hadoop Data Lake](http://martinfowler.com/bliki/DataLake.html) 
is a well known problem.
-In most (if not all) Hadoop deployments, it is unfortunately solved in a 
piecemeal fashion, using a medley of ingestion tools,
-even though this data is arguably the most valuable for the entire 
organization.
+将外部源(如事件日志,数据库,外部源)的数据摄取到[Hadoop数据湖](http://martinfowler.com/bliki/DataLake.html)是一个众所周知的问题。
+尽管这些数据对整个组织来说是最有价值的,但不幸的是,在大多数(可能不是全部)Hadoop部署中都使用零散的方式解决,即使用混合摄取工具。
 
 
-For RDBMS ingestion, Hudi provides __faster loads via Upserts__, as opposed 
costly & inefficient bulk loads. For e.g, you can read the MySQL BIN log or 
[Sqoop Incremental 
Import](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)
 and apply them to an
-equivalent Hudi table on DFS. This would be much faster/efficient than a [bulk 
merge 
job](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)
-or [complicated handcrafted merge 
workflows](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
+对于RDBMS摄取,Hudi提供__通过更新插入更快地加载__,而不是昂贵且低效的批量加载。例如,您可以读取MySQL 
BIN日志或[Sqoop增量导入](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)并将其应用于
+DFS上的等效Hudi表。这比[批量合并任务](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)更快/更有效率或[复杂的手工合并工作流](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
 
 Review comment:
   => “这比批量合并任务及复杂的手工合并工作流更快/更有效率”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #884: [HUDI-240] Translate Use Cases page

2019-09-12 Thread GitBox
yihua commented on a change in pull request #884: [HUDI-240] Translate Use 
Cases page
URL: https://github.com/apache/incubator-hudi/pull/884#discussion_r323887264
 
 

 ##
 File path: docs/use_cases.cn.md
 ##
 @@ -4,73 +4,65 @@ keywords: hudi, data ingestion, etl, real time, use cases
 sidebar: mydoc_sidebar
 permalink: use_cases.html
 toc: false
-summary: "Following are some sample use-cases for Hudi, which illustrate the 
benefits in terms of faster processing & increased efficiency"
+summary: "下面展示一些使用Hudi的示例,示例说明了加快处理速度和提高效率的好处"
 
 Review comment:
   “下面展示一些...,实例说明了...” => “以下是一些,说明了...”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #884: [HUDI-240] Translate Use Cases page

2019-09-12 Thread GitBox
yihua commented on a change in pull request #884: [HUDI-240] Translate Use 
Cases page
URL: https://github.com/apache/incubator-hudi/pull/884#discussion_r323895886
 
 

 ##
 File path: docs/use_cases.cn.md
 ##
 @@ -4,73 +4,65 @@ keywords: hudi, data ingestion, etl, real time, use cases
 sidebar: mydoc_sidebar
 permalink: use_cases.html
 toc: false
-summary: "Following are some sample use-cases for Hudi, which illustrate the 
benefits in terms of faster processing & increased efficiency"
+summary: "下面展示一些使用Hudi的示例,示例说明了加快处理速度和提高效率的好处"
 
 ---
 
 
 
-## Near Real-Time Ingestion
+## 近实时摄取
 
-Ingesting data from external sources like (event logs, databases, external 
sources) into a [Hadoop Data Lake](http://martinfowler.com/bliki/DataLake.html) 
is a well known problem.
-In most (if not all) Hadoop deployments, it is unfortunately solved in a 
piecemeal fashion, using a medley of ingestion tools,
-even though this data is arguably the most valuable for the entire 
organization.
+将外部源(如事件日志,数据库,外部源)的数据摄取到[Hadoop数据湖](http://martinfowler.com/bliki/DataLake.html)是一个众所周知的问题。
+尽管这些数据对整个组织来说是最有价值的,但不幸的是,在大多数(可能不是全部)Hadoop部署中都使用零散的方式解决,即使用混合摄取工具。
 
 
-For RDBMS ingestion, Hudi provides __faster loads via Upserts__, as opposed 
costly & inefficient bulk loads. For e.g, you can read the MySQL BIN log or 
[Sqoop Incremental 
Import](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)
 and apply them to an
-equivalent Hudi table on DFS. This would be much faster/efficient than a [bulk 
merge 
job](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)
-or [complicated handcrafted merge 
workflows](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
+对于RDBMS摄取,Hudi提供__通过更新插入更快地加载__,而不是昂贵且低效的批量加载。例如,您可以读取MySQL 
BIN日志或[Sqoop增量导入](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)并将其应用于
+DFS上的等效Hudi表。这比[批量合并任务](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)更快/更有效率或[复杂的手工合并工作流](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
 
 
-For NoSQL datastores like [Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/), even moderately big installations store 
billions of rows.
-It goes without saying that __full bulk loads are simply infeasible__ and more 
efficient approaches are needed if ingestion is to keep up with the typically 
high update volumes.
+对于NoSQL数据存储,如[Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/),即使是中等规模大小也会存储数十亿行。
+毫无疑问, __全量加载不可行__ 如果摄取需要跟上较高的更新量,那么则需要更有效的方法。
 
 Review comment:
   Add “,” after “全量加载不可行”?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #884: [HUDI-240] Translate Use Cases page

2019-09-12 Thread GitBox
yihua commented on a change in pull request #884: [HUDI-240] Translate Use 
Cases page
URL: https://github.com/apache/incubator-hudi/pull/884#discussion_r323888340
 
 

 ##
 File path: docs/use_cases.cn.md
 ##
 @@ -4,73 +4,65 @@ keywords: hudi, data ingestion, etl, real time, use cases
 sidebar: mydoc_sidebar
 permalink: use_cases.html
 toc: false
-summary: "Following are some sample use-cases for Hudi, which illustrate the 
benefits in terms of faster processing & increased efficiency"
+summary: "下面展示一些使用Hudi的示例,示例说明了加快处理速度和提高效率的好处"
 
 ---
 
 
 
-## Near Real-Time Ingestion
+## 近实时摄取
 
-Ingesting data from external sources like (event logs, databases, external 
sources) into a [Hadoop Data Lake](http://martinfowler.com/bliki/DataLake.html) 
is a well known problem.
-In most (if not all) Hadoop deployments, it is unfortunately solved in a 
piecemeal fashion, using a medley of ingestion tools,
-even though this data is arguably the most valuable for the entire 
organization.
+将外部源(如事件日志,数据库,外部源)的数据摄取到[Hadoop数据湖](http://martinfowler.com/bliki/DataLake.html)是一个众所周知的问题。
 
 Review comment:
   “,” => “、” for list items within the pair of brackets.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #884: [HUDI-240] Translate Use Cases page

2019-09-12 Thread GitBox
yihua commented on a change in pull request #884: [HUDI-240] Translate Use 
Cases page
URL: https://github.com/apache/incubator-hudi/pull/884#discussion_r323895316
 
 

 ##
 File path: docs/use_cases.cn.md
 ##
 @@ -4,73 +4,65 @@ keywords: hudi, data ingestion, etl, real time, use cases
 sidebar: mydoc_sidebar
 permalink: use_cases.html
 toc: false
-summary: "Following are some sample use-cases for Hudi, which illustrate the 
benefits in terms of faster processing & increased efficiency"
+summary: "下面展示一些使用Hudi的示例,示例说明了加快处理速度和提高效率的好处"
 
 ---
 
 
 
-## Near Real-Time Ingestion
+## 近实时摄取
 
-Ingesting data from external sources like (event logs, databases, external 
sources) into a [Hadoop Data Lake](http://martinfowler.com/bliki/DataLake.html) 
is a well known problem.
-In most (if not all) Hadoop deployments, it is unfortunately solved in a 
piecemeal fashion, using a medley of ingestion tools,
-even though this data is arguably the most valuable for the entire 
organization.
+将外部源(如事件日志,数据库,外部源)的数据摄取到[Hadoop数据湖](http://martinfowler.com/bliki/DataLake.html)是一个众所周知的问题。
+尽管这些数据对整个组织来说是最有价值的,但不幸的是,在大多数(可能不是全部)Hadoop部署中都使用零散的方式解决,即使用混合摄取工具。
 
 
-For RDBMS ingestion, Hudi provides __faster loads via Upserts__, as opposed 
costly & inefficient bulk loads. For e.g, you can read the MySQL BIN log or 
[Sqoop Incremental 
Import](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)
 and apply them to an
-equivalent Hudi table on DFS. This would be much faster/efficient than a [bulk 
merge 
job](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)
-or [complicated handcrafted merge 
workflows](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
+对于RDBMS摄取,Hudi提供__通过更新插入更快地加载__,而不是昂贵且低效的批量加载。例如,您可以读取MySQL 
BIN日志或[Sqoop增量导入](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)并将其应用于
 
 Review comment:
   “通过更新插入更快地加载” => “通过更新插入达到的更快加载”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services