[GitHub] [incubator-druid] ccaominh commented on a change in pull request #8311: Docusaurus build framework + ingestion doc refresh.
ccaominh commented on a change in pull request #8311: Docusaurus build framework + ingestion doc refresh. URL: https://github.com/apache/incubator-druid/pull/8311#discussion_r315945457 ## File path: .travis.yml ## @@ -164,6 +164,10 @@ matrix: script: - $MVN test -pl 'web-console' +- name: "docs" + install: cd website && npm install + script: cd website && npm run lint Review comment: From the travis log, it looks like the `cd website` done in the `install` step stays in effect when the `script step`. Some alternatives are to use pushd/popd or a subshell. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [incubator-druid] ccaominh commented on a change in pull request #8311: Docusaurus build framework + ingestion doc refresh.
ccaominh commented on a change in pull request #8311: Docusaurus build framework + ingestion doc refresh. URL: https://github.com/apache/incubator-druid/pull/8311#discussion_r314429121 ## File path: docs/dependencies/metadata-storage.md ## @@ -113,7 +110,7 @@ Note that the format of this blob can and will change from time-to-time. ### Rule Table Review comment: The title case style here is not consistent with the changes you made above This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [incubator-druid] ccaominh commented on a change in pull request #8311: Docusaurus build framework + ingestion doc refresh.
ccaominh commented on a change in pull request #8311: Docusaurus build framework + ingestion doc refresh. URL: https://github.com/apache/incubator-druid/pull/8311#discussion_r314434911 ## File path: docs/ingestion/index.md ## @@ -0,0 +1,769 @@ +--- +id: index +title: "Ingestion" +--- + + + +## Overview + +All data in Druid is organized into _segments_, which are data files that generally have up to a few million rows each. +Loading data in Druid is called _ingestion_ or _indexing_ and consists of reading data from a source system and creating +segments based on that data. + +In most ingestion methods, the work of loading data is done by Druid MiddleManager processes. One exception is +Hadoop-based ingestion, where this work is instead done using a Hadoop MapReduce job on YARN (although MiddleManager +processes are still involved in starting and monitoring the Hadoop jobs). Once segments have been generated and stored +in [deep storage](../dependencies/deep-storage.md), they will be loaded by Historical processes. For more details on +how this works under the hood, see the [Storage design](../design/architecture.md#storage-design) section of Druid's design +documentation. + +## How to use this documentation + +This **page you are currently reading** provides information about universal Druid ingestion concepts, and about +configurations that are common to all [ingestion methods](#ingestion-methods). + +The **individual pages for each ingestion method** provide additional information about concepts and configurations +that are unique to each ingestion method. + +We recommend reading (or at least skimming) this universal page first, and then referring to the page for the +ingestion method or methods that you have chosen. + +## Ingestion methods + +The table below lists Druid's most common data ingestion methods, along with comparisons to help you choose +the best one for your situation. Each ingestion method supports its own set of source systems to pull from. For details +about how each method works, as well as configuration properties specific to that method, check out its documentation +page. + +### Streaming + +The most recommended, and most popular, method of streaming ingestion is the +[Kafka indexing service](../development/extensions-core/kafka-ingestion.md) that reads directly from Kafka. The Kinesis +indexing service also works well if you prefer Kinesis. + +This table compares the major available options: + +| **Method** | [Kafka](../development/extensions-core/kafka-ingestion.md) | [Kinesis](../development/extensions-core/kinesis-ingestion.md) | [Tranquility](tranquility.md) | +|---|-|--|| +| **Supervisor type** | `kafka` | `kinesis` | N/A | +| **How it works** | Druid reads directly from Apache Kafka. | Druid reads directly from Amazon Kinesis. | Tranquility, a library that ships separately from Druid, is used to push data into Druid. | +| **Can ingest late data?** | Yes | Yes | No (late data is dropped based on the `windowPeriod` config) | +| **Exactly-once guarantees?** | Yes | Yes | No | + +### Batch + +When doing batch loads from files, you should use one-time [tasks](tasks.md), and you have three options: `index` +(native batch; single-task), `index_parallel` (native batch; parallel), or `index_hadoop` (Hadoop-based). The following +table compares and contrasts the three batch ingestion options. + +In general, we recommend native batch whenever it meets your needs, since the setup is simpler (it does not depend on +an external Hadoop cluster). However, there are still scenarios where Hadoop-based batch ingestion is the right choice, +especially due to its support for custom partitioning options and reading binary data formats. + +This table compares the major available options: + +| **Method** | [Native batch (simple)](native-batch.html#simple-task) | [Native batch (parallel)](native-batch.html#parallel-task) | [Hadoop-based](hadoop.html) | +|---|-|--|| +| **Task type** | `index` | `index_parallel` | `index_hadoop` | +| **Automatically parallel?** | No. Each task is single-threaded. | Yes, if firehose is splittable. See [firehose documentation](native-batch.md#firehoses) for details. | Yes, always. | +| **Can append or overwrite?** | Yes, both. | Yes, both. | Overwrite only. | +| **External dependencies** | None. | None. | Hadoop cluster (Druid submits Map/Reduce jobs). | +| **Input locations** | Any [firehose](native-batch.md#firehoses). | Any [firehose](native-batch.md#firehoses). | Any Hadoop FileSystem or Druid datasource. | +| **File formats** | Text file formats (CSV, TSV, JSON). Support for binary formats is coming in a future release. | Text file formats (CSV, TSV, JSON). Support for binary formats is coming in a future release. | Any Hadoop InputFormat. | +| **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md
[GitHub] [incubator-druid] ccaominh commented on a change in pull request #8311: Docusaurus build framework + ingestion doc refresh.
ccaominh commented on a change in pull request #8311: Docusaurus build framework + ingestion doc refresh. URL: https://github.com/apache/incubator-druid/pull/8311#discussion_r314431778 ## File path: docs/ingestion/index.md ## @@ -0,0 +1,769 @@ +--- +id: index +title: "Ingestion" +--- + + + +## Overview + +All data in Druid is organized into _segments_, which are data files that generally have up to a few million rows each. +Loading data in Druid is called _ingestion_ or _indexing_ and consists of reading data from a source system and creating +segments based on that data. + +In most ingestion methods, the work of loading data is done by Druid MiddleManager processes. One exception is +Hadoop-based ingestion, where this work is instead done using a Hadoop MapReduce job on YARN (although MiddleManager +processes are still involved in starting and monitoring the Hadoop jobs). Once segments have been generated and stored +in [deep storage](../dependencies/deep-storage.md), they will be loaded by Historical processes. For more details on +how this works under the hood, see the [Storage design](../design/architecture.md#storage-design) section of Druid's design +documentation. + +## How to use this documentation + +This **page you are currently reading** provides information about universal Druid ingestion concepts, and about +configurations that are common to all [ingestion methods](#ingestion-methods). + +The **individual pages for each ingestion method** provide additional information about concepts and configurations +that are unique to each ingestion method. + +We recommend reading (or at least skimming) this universal page first, and then referring to the page for the +ingestion method or methods that you have chosen. + +## Ingestion methods + +The table below lists Druid's most common data ingestion methods, along with comparisons to help you choose +the best one for your situation. Each ingestion method supports its own set of source systems to pull from. For details +about how each method works, as well as configuration properties specific to that method, check out its documentation +page. + +### Streaming + +The most recommended, and most popular, method of streaming ingestion is the +[Kafka indexing service](../development/extensions-core/kafka-ingestion.md) that reads directly from Kafka. The Kinesis +indexing service also works well if you prefer Kinesis. + +This table compares the major available options: + +| **Method** | [Kafka](../development/extensions-core/kafka-ingestion.md) | [Kinesis](../development/extensions-core/kinesis-ingestion.md) | [Tranquility](tranquility.md) | +|---|-|--|| +| **Supervisor type** | `kafka` | `kinesis` | N/A | +| **How it works** | Druid reads directly from Apache Kafka. | Druid reads directly from Amazon Kinesis. | Tranquility, a library that ships separately from Druid, is used to push data into Druid. | +| **Can ingest late data?** | Yes | Yes | No (late data is dropped based on the `windowPeriod` config) | +| **Exactly-once guarantees?** | Yes | Yes | No | + +### Batch + +When doing batch loads from files, you should use one-time [tasks](tasks.md), and you have three options: `index` +(native batch; single-task), `index_parallel` (native batch; parallel), or `index_hadoop` (Hadoop-based). The following +table compares and contrasts the three batch ingestion options. Review comment: This last sentence is slightly out of place. Perhaps merge it with the sentence on line 81? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [incubator-druid] ccaominh commented on a change in pull request #8311: Docusaurus build framework + ingestion doc refresh.
ccaominh commented on a change in pull request #8311: Docusaurus build framework + ingestion doc refresh. URL: https://github.com/apache/incubator-druid/pull/8311#discussion_r314436852 ## File path: docs/ingestion/index.md ## @@ -0,0 +1,769 @@ +--- +id: index +title: "Ingestion" +--- + + + +## Overview + +All data in Druid is organized into _segments_, which are data files that generally have up to a few million rows each. +Loading data in Druid is called _ingestion_ or _indexing_ and consists of reading data from a source system and creating +segments based on that data. + +In most ingestion methods, the work of loading data is done by Druid MiddleManager processes. One exception is +Hadoop-based ingestion, where this work is instead done using a Hadoop MapReduce job on YARN (although MiddleManager +processes are still involved in starting and monitoring the Hadoop jobs). Once segments have been generated and stored +in [deep storage](../dependencies/deep-storage.md), they will be loaded by Historical processes. For more details on +how this works under the hood, see the [Storage design](../design/architecture.md#storage-design) section of Druid's design +documentation. + +## How to use this documentation + +This **page you are currently reading** provides information about universal Druid ingestion concepts, and about +configurations that are common to all [ingestion methods](#ingestion-methods). + +The **individual pages for each ingestion method** provide additional information about concepts and configurations +that are unique to each ingestion method. + +We recommend reading (or at least skimming) this universal page first, and then referring to the page for the +ingestion method or methods that you have chosen. + +## Ingestion methods + +The table below lists Druid's most common data ingestion methods, along with comparisons to help you choose +the best one for your situation. Each ingestion method supports its own set of source systems to pull from. For details +about how each method works, as well as configuration properties specific to that method, check out its documentation +page. + +### Streaming + +The most recommended, and most popular, method of streaming ingestion is the +[Kafka indexing service](../development/extensions-core/kafka-ingestion.md) that reads directly from Kafka. The Kinesis +indexing service also works well if you prefer Kinesis. + +This table compares the major available options: + +| **Method** | [Kafka](../development/extensions-core/kafka-ingestion.md) | [Kinesis](../development/extensions-core/kinesis-ingestion.md) | [Tranquility](tranquility.md) | +|---|-|--|| +| **Supervisor type** | `kafka` | `kinesis` | N/A | +| **How it works** | Druid reads directly from Apache Kafka. | Druid reads directly from Amazon Kinesis. | Tranquility, a library that ships separately from Druid, is used to push data into Druid. | +| **Can ingest late data?** | Yes | Yes | No (late data is dropped based on the `windowPeriod` config) | +| **Exactly-once guarantees?** | Yes | Yes | No | + +### Batch + +When doing batch loads from files, you should use one-time [tasks](tasks.md), and you have three options: `index` +(native batch; single-task), `index_parallel` (native batch; parallel), or `index_hadoop` (Hadoop-based). The following +table compares and contrasts the three batch ingestion options. + +In general, we recommend native batch whenever it meets your needs, since the setup is simpler (it does not depend on +an external Hadoop cluster). However, there are still scenarios where Hadoop-based batch ingestion is the right choice, +especially due to its support for custom partitioning options and reading binary data formats. + +This table compares the major available options: + +| **Method** | [Native batch (simple)](native-batch.html#simple-task) | [Native batch (parallel)](native-batch.html#parallel-task) | [Hadoop-based](hadoop.html) | +|---|-|--|| +| **Task type** | `index` | `index_parallel` | `index_hadoop` | +| **Automatically parallel?** | No. Each task is single-threaded. | Yes, if firehose is splittable. See [firehose documentation](native-batch.md#firehoses) for details. | Yes, always. | +| **Can append or overwrite?** | Yes, both. | Yes, both. | Overwrite only. | +| **External dependencies** | None. | None. | Hadoop cluster (Druid submits Map/Reduce jobs). | +| **Input locations** | Any [firehose](native-batch.md#firehoses). | Any [firehose](native-batch.md#firehoses). | Any Hadoop FileSystem or Druid datasource. | +| **File formats** | Text file formats (CSV, TSV, JSON). Support for binary formats is coming in a future release. | Text file formats (CSV, TSV, JSON). Support for binary formats is coming in a future release. | Any Hadoop InputFormat. | +| **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md
[GitHub] [incubator-druid] ccaominh commented on a change in pull request #8311: Docusaurus build framework + ingestion doc refresh.
ccaominh commented on a change in pull request #8311: Docusaurus build framework + ingestion doc refresh. URL: https://github.com/apache/incubator-druid/pull/8311#discussion_r314433247 ## File path: docs/ingestion/index.md ## @@ -0,0 +1,769 @@ +--- +id: index +title: "Ingestion" +--- + + + +## Overview + +All data in Druid is organized into _segments_, which are data files that generally have up to a few million rows each. +Loading data in Druid is called _ingestion_ or _indexing_ and consists of reading data from a source system and creating +segments based on that data. + +In most ingestion methods, the work of loading data is done by Druid MiddleManager processes. One exception is +Hadoop-based ingestion, where this work is instead done using a Hadoop MapReduce job on YARN (although MiddleManager +processes are still involved in starting and monitoring the Hadoop jobs). Once segments have been generated and stored +in [deep storage](../dependencies/deep-storage.md), they will be loaded by Historical processes. For more details on +how this works under the hood, see the [Storage design](../design/architecture.md#storage-design) section of Druid's design +documentation. + +## How to use this documentation + +This **page you are currently reading** provides information about universal Druid ingestion concepts, and about +configurations that are common to all [ingestion methods](#ingestion-methods). + +The **individual pages for each ingestion method** provide additional information about concepts and configurations +that are unique to each ingestion method. + +We recommend reading (or at least skimming) this universal page first, and then referring to the page for the +ingestion method or methods that you have chosen. + +## Ingestion methods + +The table below lists Druid's most common data ingestion methods, along with comparisons to help you choose +the best one for your situation. Each ingestion method supports its own set of source systems to pull from. For details +about how each method works, as well as configuration properties specific to that method, check out its documentation +page. + +### Streaming + +The most recommended, and most popular, method of streaming ingestion is the +[Kafka indexing service](../development/extensions-core/kafka-ingestion.md) that reads directly from Kafka. The Kinesis +indexing service also works well if you prefer Kinesis. + +This table compares the major available options: + +| **Method** | [Kafka](../development/extensions-core/kafka-ingestion.md) | [Kinesis](../development/extensions-core/kinesis-ingestion.md) | [Tranquility](tranquility.md) | +|---|-|--|| +| **Supervisor type** | `kafka` | `kinesis` | N/A | +| **How it works** | Druid reads directly from Apache Kafka. | Druid reads directly from Amazon Kinesis. | Tranquility, a library that ships separately from Druid, is used to push data into Druid. | +| **Can ingest late data?** | Yes | Yes | No (late data is dropped based on the `windowPeriod` config) | +| **Exactly-once guarantees?** | Yes | Yes | No | + +### Batch + +When doing batch loads from files, you should use one-time [tasks](tasks.md), and you have three options: `index` +(native batch; single-task), `index_parallel` (native batch; parallel), or `index_hadoop` (Hadoop-based). The following +table compares and contrasts the three batch ingestion options. + +In general, we recommend native batch whenever it meets your needs, since the setup is simpler (it does not depend on +an external Hadoop cluster). However, there are still scenarios where Hadoop-based batch ingestion is the right choice, +especially due to its support for custom partitioning options and reading binary data formats. + +This table compares the major available options: + +| **Method** | [Native batch (simple)](native-batch.html#simple-task) | [Native batch (parallel)](native-batch.html#parallel-task) | [Hadoop-based](hadoop.html) | +|---|-|--|| +| **Task type** | `index` | `index_parallel` | `index_hadoop` | +| **Automatically parallel?** | No. Each task is single-threaded. | Yes, if firehose is splittable. See [firehose documentation](native-batch.md#firehoses) for details. | Yes, always. | +| **Can append or overwrite?** | Yes, both. | Yes, both. | Overwrite only. | +| **External dependencies** | None. | None. | Hadoop cluster (Druid submits Map/Reduce jobs). | +| **Input locations** | Any [firehose](native-batch.md#firehoses). | Any [firehose](native-batch.md#firehoses). | Any Hadoop FileSystem or Druid datasource. | +| **File formats** | Text file formats (CSV, TSV, JSON). Support for binary formats is coming in a future release. | Text file formats (CSV, TSV, JSON). Support for binary formats is coming in a future release. | Any Hadoop InputFormat. | +| **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md
[GitHub] [incubator-druid] ccaominh commented on a change in pull request #8311: Docusaurus build framework + ingestion doc refresh.
ccaominh commented on a change in pull request #8311: Docusaurus build framework + ingestion doc refresh. URL: https://github.com/apache/incubator-druid/pull/8311#discussion_r314433990 ## File path: docs/ingestion/index.md ## @@ -0,0 +1,769 @@ +--- +id: index +title: "Ingestion" +--- + + + +## Overview + +All data in Druid is organized into _segments_, which are data files that generally have up to a few million rows each. +Loading data in Druid is called _ingestion_ or _indexing_ and consists of reading data from a source system and creating +segments based on that data. + +In most ingestion methods, the work of loading data is done by Druid MiddleManager processes. One exception is +Hadoop-based ingestion, where this work is instead done using a Hadoop MapReduce job on YARN (although MiddleManager +processes are still involved in starting and monitoring the Hadoop jobs). Once segments have been generated and stored +in [deep storage](../dependencies/deep-storage.md), they will be loaded by Historical processes. For more details on +how this works under the hood, see the [Storage design](../design/architecture.md#storage-design) section of Druid's design +documentation. + +## How to use this documentation + +This **page you are currently reading** provides information about universal Druid ingestion concepts, and about +configurations that are common to all [ingestion methods](#ingestion-methods). + +The **individual pages for each ingestion method** provide additional information about concepts and configurations +that are unique to each ingestion method. + +We recommend reading (or at least skimming) this universal page first, and then referring to the page for the +ingestion method or methods that you have chosen. + +## Ingestion methods + +The table below lists Druid's most common data ingestion methods, along with comparisons to help you choose +the best one for your situation. Each ingestion method supports its own set of source systems to pull from. For details +about how each method works, as well as configuration properties specific to that method, check out its documentation +page. + +### Streaming + +The most recommended, and most popular, method of streaming ingestion is the +[Kafka indexing service](../development/extensions-core/kafka-ingestion.md) that reads directly from Kafka. The Kinesis +indexing service also works well if you prefer Kinesis. + +This table compares the major available options: + +| **Method** | [Kafka](../development/extensions-core/kafka-ingestion.md) | [Kinesis](../development/extensions-core/kinesis-ingestion.md) | [Tranquility](tranquility.md) | +|---|-|--|| +| **Supervisor type** | `kafka` | `kinesis` | N/A | +| **How it works** | Druid reads directly from Apache Kafka. | Druid reads directly from Amazon Kinesis. | Tranquility, a library that ships separately from Druid, is used to push data into Druid. | +| **Can ingest late data?** | Yes | Yes | No (late data is dropped based on the `windowPeriod` config) | +| **Exactly-once guarantees?** | Yes | Yes | No | + +### Batch + +When doing batch loads from files, you should use one-time [tasks](tasks.md), and you have three options: `index` +(native batch; single-task), `index_parallel` (native batch; parallel), or `index_hadoop` (Hadoop-based). The following +table compares and contrasts the three batch ingestion options. + +In general, we recommend native batch whenever it meets your needs, since the setup is simpler (it does not depend on +an external Hadoop cluster). However, there are still scenarios where Hadoop-based batch ingestion is the right choice, +especially due to its support for custom partitioning options and reading binary data formats. + +This table compares the major available options: + +| **Method** | [Native batch (simple)](native-batch.html#simple-task) | [Native batch (parallel)](native-batch.html#parallel-task) | [Hadoop-based](hadoop.html) | +|---|-|--|| +| **Task type** | `index` | `index_parallel` | `index_hadoop` | +| **Automatically parallel?** | No. Each task is single-threaded. | Yes, if firehose is splittable. See [firehose documentation](native-batch.md#firehoses) for details. | Yes, always. | +| **Can append or overwrite?** | Yes, both. | Yes, both. | Overwrite only. | +| **External dependencies** | None. | None. | Hadoop cluster (Druid submits Map/Reduce jobs). | +| **Input locations** | Any [firehose](native-batch.md#firehoses). | Any [firehose](native-batch.md#firehoses). | Any Hadoop FileSystem or Druid datasource. | +| **File formats** | Text file formats (CSV, TSV, JSON). Support for binary formats is coming in a future release. | Text file formats (CSV, TSV, JSON). Support for binary formats is coming in a future release. | Any Hadoop InputFormat. | +| **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md