[GitHub] [incubator-druid] ccaominh commented on a change in pull request #8311: Docusaurus build framework + ingestion doc refresh.

2019-08-20 Thread GitBox
ccaominh commented on a change in pull request #8311: Docusaurus build 
framework + ingestion doc refresh.
URL: https://github.com/apache/incubator-druid/pull/8311#discussion_r315945457
 
 

 ##
 File path: .travis.yml
 ##
 @@ -164,6 +164,10 @@ matrix:
   script:
 - $MVN test -pl 'web-console'
 
+- name: "docs"
+  install: cd website && npm install
+  script: cd website && npm run lint
 
 Review comment:
   From the travis log, it looks like the `cd website` done in the `install` 
step stays in effect when the `script step`.  Some alternatives are to use 
pushd/popd or a subshell.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [incubator-druid] ccaominh commented on a change in pull request #8311: Docusaurus build framework + ingestion doc refresh.

2019-08-15 Thread GitBox
ccaominh commented on a change in pull request #8311: Docusaurus build 
framework + ingestion doc refresh.
URL: https://github.com/apache/incubator-druid/pull/8311#discussion_r314429121
 
 

 ##
 File path: docs/dependencies/metadata-storage.md
 ##
 @@ -113,7 +110,7 @@ Note that the format of this blob can and will change from 
time-to-time.
 ### Rule Table
 
 Review comment:
   The title case style here is not consistent with the changes you made above


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [incubator-druid] ccaominh commented on a change in pull request #8311: Docusaurus build framework + ingestion doc refresh.

2019-08-15 Thread GitBox
ccaominh commented on a change in pull request #8311: Docusaurus build 
framework + ingestion doc refresh.
URL: https://github.com/apache/incubator-druid/pull/8311#discussion_r314434911
 
 

 ##
 File path: docs/ingestion/index.md
 ##
 @@ -0,0 +1,769 @@
+---
+id: index
+title: "Ingestion"
+---
+
+
+
+## Overview
+
+All data in Druid is organized into _segments_, which are data files that 
generally have up to a few million rows each.
+Loading data in Druid is called _ingestion_ or _indexing_ and consists of 
reading data from a source system and creating
+segments based on that data.
+
+In most ingestion methods, the work of loading data is done by Druid 
MiddleManager processes. One exception is
+Hadoop-based ingestion, where this work is instead done using a Hadoop 
MapReduce job on YARN (although MiddleManager
+processes are still involved in starting and monitoring the Hadoop jobs). Once 
segments have been generated and stored
+in [deep storage](../dependencies/deep-storage.md), they will be loaded by 
Historical processes. For more details on
+how this works under the hood, see the [Storage 
design](../design/architecture.md#storage-design) section of Druid's design
+documentation.
+
+## How to use this documentation
+
+This **page you are currently reading** provides information about universal 
Druid ingestion concepts, and about
+configurations that are common to all [ingestion methods](#ingestion-methods).
+
+The **individual pages for each ingestion method** provide additional 
information about concepts and configurations
+that are unique to each ingestion method.
+
+We recommend reading (or at least skimming) this universal page first, and 
then referring to the page for the
+ingestion method or methods that you have chosen.
+
+## Ingestion methods
+
+The table below lists Druid's most common data ingestion methods, along with 
comparisons to help you choose
+the best one for your situation. Each ingestion method supports its own set of 
source systems to pull from. For details
+about how each method works, as well as configuration properties specific to 
that method, check out its documentation
+page.
+
+### Streaming
+
+The most recommended, and most popular, method of streaming ingestion is the
+[Kafka indexing service](../development/extensions-core/kafka-ingestion.md) 
that reads directly from Kafka. The Kinesis
+indexing service also works well if you prefer Kinesis.
+
+This table compares the major available options:
+
+| **Method** | [Kafka](../development/extensions-core/kafka-ingestion.md) | 
[Kinesis](../development/extensions-core/kinesis-ingestion.md) | 
[Tranquility](tranquility.md) |
+|---|-|--||
+| **Supervisor type** | `kafka` | `kinesis` | N/A |
+| **How it works** | Druid reads directly from Apache Kafka. | Druid reads 
directly from Amazon Kinesis. | Tranquility, a library that ships separately 
from Druid, is used to push data into Druid. |
+| **Can ingest late data?** | Yes | Yes | No (late data is dropped based on 
the `windowPeriod` config) |
+| **Exactly-once guarantees?** | Yes | Yes | No |
+
+### Batch
+
+When doing batch loads from files, you should use one-time [tasks](tasks.md), 
and you have three options: `index`
+(native batch; single-task), `index_parallel` (native batch; parallel), or 
`index_hadoop` (Hadoop-based). The following
+table compares and contrasts the three batch ingestion options.
+
+In general, we recommend native batch whenever it meets your needs, since the 
setup is simpler (it does not depend on
+an external Hadoop cluster). However, there are still scenarios where 
Hadoop-based batch ingestion is the right choice,
+especially due to its support for custom partitioning options and reading 
binary data formats.
+
+This table compares the major available options:
+
+| **Method** | [Native batch (simple)](native-batch.html#simple-task) | 
[Native batch (parallel)](native-batch.html#parallel-task) | 
[Hadoop-based](hadoop.html) |
+|---|-|--||
+| **Task type** | `index` | `index_parallel` | `index_hadoop` |
+| **Automatically parallel?** | No. Each task is single-threaded. | Yes, if 
firehose is splittable. See [firehose documentation](native-batch.md#firehoses) 
for details. | Yes, always. |
+| **Can append or overwrite?** | Yes, both. | Yes, both. | Overwrite only. |
+| **External dependencies** | None. | None. | Hadoop cluster (Druid submits 
Map/Reduce jobs). |
+| **Input locations** | Any [firehose](native-batch.md#firehoses). | Any 
[firehose](native-batch.md#firehoses). | Any Hadoop FileSystem or Druid 
datasource. |
+| **File formats** | Text file formats (CSV, TSV, JSON). Support for binary 
formats is coming in a future release. | Text file formats (CSV, TSV, JSON). 
Support for binary formats is coming in a future release. | Any Hadoop 
InputFormat. |
+| **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in 
the 

[GitHub] [incubator-druid] ccaominh commented on a change in pull request #8311: Docusaurus build framework + ingestion doc refresh.

2019-08-15 Thread GitBox
ccaominh commented on a change in pull request #8311: Docusaurus build 
framework + ingestion doc refresh.
URL: https://github.com/apache/incubator-druid/pull/8311#discussion_r314431778
 
 

 ##
 File path: docs/ingestion/index.md
 ##
 @@ -0,0 +1,769 @@
+---
+id: index
+title: "Ingestion"
+---
+
+
+
+## Overview
+
+All data in Druid is organized into _segments_, which are data files that 
generally have up to a few million rows each.
+Loading data in Druid is called _ingestion_ or _indexing_ and consists of 
reading data from a source system and creating
+segments based on that data.
+
+In most ingestion methods, the work of loading data is done by Druid 
MiddleManager processes. One exception is
+Hadoop-based ingestion, where this work is instead done using a Hadoop 
MapReduce job on YARN (although MiddleManager
+processes are still involved in starting and monitoring the Hadoop jobs). Once 
segments have been generated and stored
+in [deep storage](../dependencies/deep-storage.md), they will be loaded by 
Historical processes. For more details on
+how this works under the hood, see the [Storage 
design](../design/architecture.md#storage-design) section of Druid's design
+documentation.
+
+## How to use this documentation
+
+This **page you are currently reading** provides information about universal 
Druid ingestion concepts, and about
+configurations that are common to all [ingestion methods](#ingestion-methods).
+
+The **individual pages for each ingestion method** provide additional 
information about concepts and configurations
+that are unique to each ingestion method.
+
+We recommend reading (or at least skimming) this universal page first, and 
then referring to the page for the
+ingestion method or methods that you have chosen.
+
+## Ingestion methods
+
+The table below lists Druid's most common data ingestion methods, along with 
comparisons to help you choose
+the best one for your situation. Each ingestion method supports its own set of 
source systems to pull from. For details
+about how each method works, as well as configuration properties specific to 
that method, check out its documentation
+page.
+
+### Streaming
+
+The most recommended, and most popular, method of streaming ingestion is the
+[Kafka indexing service](../development/extensions-core/kafka-ingestion.md) 
that reads directly from Kafka. The Kinesis
+indexing service also works well if you prefer Kinesis.
+
+This table compares the major available options:
+
+| **Method** | [Kafka](../development/extensions-core/kafka-ingestion.md) | 
[Kinesis](../development/extensions-core/kinesis-ingestion.md) | 
[Tranquility](tranquility.md) |
+|---|-|--||
+| **Supervisor type** | `kafka` | `kinesis` | N/A |
+| **How it works** | Druid reads directly from Apache Kafka. | Druid reads 
directly from Amazon Kinesis. | Tranquility, a library that ships separately 
from Druid, is used to push data into Druid. |
+| **Can ingest late data?** | Yes | Yes | No (late data is dropped based on 
the `windowPeriod` config) |
+| **Exactly-once guarantees?** | Yes | Yes | No |
+
+### Batch
+
+When doing batch loads from files, you should use one-time [tasks](tasks.md), 
and you have three options: `index`
+(native batch; single-task), `index_parallel` (native batch; parallel), or 
`index_hadoop` (Hadoop-based). The following
+table compares and contrasts the three batch ingestion options.
 
 Review comment:
   This last sentence is slightly out of place. Perhaps merge it with the 
sentence on line 81?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [incubator-druid] ccaominh commented on a change in pull request #8311: Docusaurus build framework + ingestion doc refresh.

2019-08-15 Thread GitBox
ccaominh commented on a change in pull request #8311: Docusaurus build 
framework + ingestion doc refresh.
URL: https://github.com/apache/incubator-druid/pull/8311#discussion_r314436852
 
 

 ##
 File path: docs/ingestion/index.md
 ##
 @@ -0,0 +1,769 @@
+---
+id: index
+title: "Ingestion"
+---
+
+
+
+## Overview
+
+All data in Druid is organized into _segments_, which are data files that 
generally have up to a few million rows each.
+Loading data in Druid is called _ingestion_ or _indexing_ and consists of 
reading data from a source system and creating
+segments based on that data.
+
+In most ingestion methods, the work of loading data is done by Druid 
MiddleManager processes. One exception is
+Hadoop-based ingestion, where this work is instead done using a Hadoop 
MapReduce job on YARN (although MiddleManager
+processes are still involved in starting and monitoring the Hadoop jobs). Once 
segments have been generated and stored
+in [deep storage](../dependencies/deep-storage.md), they will be loaded by 
Historical processes. For more details on
+how this works under the hood, see the [Storage 
design](../design/architecture.md#storage-design) section of Druid's design
+documentation.
+
+## How to use this documentation
+
+This **page you are currently reading** provides information about universal 
Druid ingestion concepts, and about
+configurations that are common to all [ingestion methods](#ingestion-methods).
+
+The **individual pages for each ingestion method** provide additional 
information about concepts and configurations
+that are unique to each ingestion method.
+
+We recommend reading (or at least skimming) this universal page first, and 
then referring to the page for the
+ingestion method or methods that you have chosen.
+
+## Ingestion methods
+
+The table below lists Druid's most common data ingestion methods, along with 
comparisons to help you choose
+the best one for your situation. Each ingestion method supports its own set of 
source systems to pull from. For details
+about how each method works, as well as configuration properties specific to 
that method, check out its documentation
+page.
+
+### Streaming
+
+The most recommended, and most popular, method of streaming ingestion is the
+[Kafka indexing service](../development/extensions-core/kafka-ingestion.md) 
that reads directly from Kafka. The Kinesis
+indexing service also works well if you prefer Kinesis.
+
+This table compares the major available options:
+
+| **Method** | [Kafka](../development/extensions-core/kafka-ingestion.md) | 
[Kinesis](../development/extensions-core/kinesis-ingestion.md) | 
[Tranquility](tranquility.md) |
+|---|-|--||
+| **Supervisor type** | `kafka` | `kinesis` | N/A |
+| **How it works** | Druid reads directly from Apache Kafka. | Druid reads 
directly from Amazon Kinesis. | Tranquility, a library that ships separately 
from Druid, is used to push data into Druid. |
+| **Can ingest late data?** | Yes | Yes | No (late data is dropped based on 
the `windowPeriod` config) |
+| **Exactly-once guarantees?** | Yes | Yes | No |
+
+### Batch
+
+When doing batch loads from files, you should use one-time [tasks](tasks.md), 
and you have three options: `index`
+(native batch; single-task), `index_parallel` (native batch; parallel), or 
`index_hadoop` (Hadoop-based). The following
+table compares and contrasts the three batch ingestion options.
+
+In general, we recommend native batch whenever it meets your needs, since the 
setup is simpler (it does not depend on
+an external Hadoop cluster). However, there are still scenarios where 
Hadoop-based batch ingestion is the right choice,
+especially due to its support for custom partitioning options and reading 
binary data formats.
+
+This table compares the major available options:
+
+| **Method** | [Native batch (simple)](native-batch.html#simple-task) | 
[Native batch (parallel)](native-batch.html#parallel-task) | 
[Hadoop-based](hadoop.html) |
+|---|-|--||
+| **Task type** | `index` | `index_parallel` | `index_hadoop` |
+| **Automatically parallel?** | No. Each task is single-threaded. | Yes, if 
firehose is splittable. See [firehose documentation](native-batch.md#firehoses) 
for details. | Yes, always. |
+| **Can append or overwrite?** | Yes, both. | Yes, both. | Overwrite only. |
+| **External dependencies** | None. | None. | Hadoop cluster (Druid submits 
Map/Reduce jobs). |
+| **Input locations** | Any [firehose](native-batch.md#firehoses). | Any 
[firehose](native-batch.md#firehoses). | Any Hadoop FileSystem or Druid 
datasource. |
+| **File formats** | Text file formats (CSV, TSV, JSON). Support for binary 
formats is coming in a future release. | Text file formats (CSV, TSV, JSON). 
Support for binary formats is coming in a future release. | Any Hadoop 
InputFormat. |
+| **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in 
the 

[GitHub] [incubator-druid] ccaominh commented on a change in pull request #8311: Docusaurus build framework + ingestion doc refresh.

2019-08-15 Thread GitBox
ccaominh commented on a change in pull request #8311: Docusaurus build 
framework + ingestion doc refresh.
URL: https://github.com/apache/incubator-druid/pull/8311#discussion_r314433247
 
 

 ##
 File path: docs/ingestion/index.md
 ##
 @@ -0,0 +1,769 @@
+---
+id: index
+title: "Ingestion"
+---
+
+
+
+## Overview
+
+All data in Druid is organized into _segments_, which are data files that 
generally have up to a few million rows each.
+Loading data in Druid is called _ingestion_ or _indexing_ and consists of 
reading data from a source system and creating
+segments based on that data.
+
+In most ingestion methods, the work of loading data is done by Druid 
MiddleManager processes. One exception is
+Hadoop-based ingestion, where this work is instead done using a Hadoop 
MapReduce job on YARN (although MiddleManager
+processes are still involved in starting and monitoring the Hadoop jobs). Once 
segments have been generated and stored
+in [deep storage](../dependencies/deep-storage.md), they will be loaded by 
Historical processes. For more details on
+how this works under the hood, see the [Storage 
design](../design/architecture.md#storage-design) section of Druid's design
+documentation.
+
+## How to use this documentation
+
+This **page you are currently reading** provides information about universal 
Druid ingestion concepts, and about
+configurations that are common to all [ingestion methods](#ingestion-methods).
+
+The **individual pages for each ingestion method** provide additional 
information about concepts and configurations
+that are unique to each ingestion method.
+
+We recommend reading (or at least skimming) this universal page first, and 
then referring to the page for the
+ingestion method or methods that you have chosen.
+
+## Ingestion methods
+
+The table below lists Druid's most common data ingestion methods, along with 
comparisons to help you choose
+the best one for your situation. Each ingestion method supports its own set of 
source systems to pull from. For details
+about how each method works, as well as configuration properties specific to 
that method, check out its documentation
+page.
+
+### Streaming
+
+The most recommended, and most popular, method of streaming ingestion is the
+[Kafka indexing service](../development/extensions-core/kafka-ingestion.md) 
that reads directly from Kafka. The Kinesis
+indexing service also works well if you prefer Kinesis.
+
+This table compares the major available options:
+
+| **Method** | [Kafka](../development/extensions-core/kafka-ingestion.md) | 
[Kinesis](../development/extensions-core/kinesis-ingestion.md) | 
[Tranquility](tranquility.md) |
+|---|-|--||
+| **Supervisor type** | `kafka` | `kinesis` | N/A |
+| **How it works** | Druid reads directly from Apache Kafka. | Druid reads 
directly from Amazon Kinesis. | Tranquility, a library that ships separately 
from Druid, is used to push data into Druid. |
+| **Can ingest late data?** | Yes | Yes | No (late data is dropped based on 
the `windowPeriod` config) |
+| **Exactly-once guarantees?** | Yes | Yes | No |
+
+### Batch
+
+When doing batch loads from files, you should use one-time [tasks](tasks.md), 
and you have three options: `index`
+(native batch; single-task), `index_parallel` (native batch; parallel), or 
`index_hadoop` (Hadoop-based). The following
+table compares and contrasts the three batch ingestion options.
+
+In general, we recommend native batch whenever it meets your needs, since the 
setup is simpler (it does not depend on
+an external Hadoop cluster). However, there are still scenarios where 
Hadoop-based batch ingestion is the right choice,
+especially due to its support for custom partitioning options and reading 
binary data formats.
+
+This table compares the major available options:
+
+| **Method** | [Native batch (simple)](native-batch.html#simple-task) | 
[Native batch (parallel)](native-batch.html#parallel-task) | 
[Hadoop-based](hadoop.html) |
+|---|-|--||
+| **Task type** | `index` | `index_parallel` | `index_hadoop` |
+| **Automatically parallel?** | No. Each task is single-threaded. | Yes, if 
firehose is splittable. See [firehose documentation](native-batch.md#firehoses) 
for details. | Yes, always. |
+| **Can append or overwrite?** | Yes, both. | Yes, both. | Overwrite only. |
+| **External dependencies** | None. | None. | Hadoop cluster (Druid submits 
Map/Reduce jobs). |
+| **Input locations** | Any [firehose](native-batch.md#firehoses). | Any 
[firehose](native-batch.md#firehoses). | Any Hadoop FileSystem or Druid 
datasource. |
+| **File formats** | Text file formats (CSV, TSV, JSON). Support for binary 
formats is coming in a future release. | Text file formats (CSV, TSV, JSON). 
Support for binary formats is coming in a future release. | Any Hadoop 
InputFormat. |
+| **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in 
the 

[GitHub] [incubator-druid] ccaominh commented on a change in pull request #8311: Docusaurus build framework + ingestion doc refresh.

2019-08-15 Thread GitBox
ccaominh commented on a change in pull request #8311: Docusaurus build 
framework + ingestion doc refresh.
URL: https://github.com/apache/incubator-druid/pull/8311#discussion_r314433990
 
 

 ##
 File path: docs/ingestion/index.md
 ##
 @@ -0,0 +1,769 @@
+---
+id: index
+title: "Ingestion"
+---
+
+
+
+## Overview
+
+All data in Druid is organized into _segments_, which are data files that 
generally have up to a few million rows each.
+Loading data in Druid is called _ingestion_ or _indexing_ and consists of 
reading data from a source system and creating
+segments based on that data.
+
+In most ingestion methods, the work of loading data is done by Druid 
MiddleManager processes. One exception is
+Hadoop-based ingestion, where this work is instead done using a Hadoop 
MapReduce job on YARN (although MiddleManager
+processes are still involved in starting and monitoring the Hadoop jobs). Once 
segments have been generated and stored
+in [deep storage](../dependencies/deep-storage.md), they will be loaded by 
Historical processes. For more details on
+how this works under the hood, see the [Storage 
design](../design/architecture.md#storage-design) section of Druid's design
+documentation.
+
+## How to use this documentation
+
+This **page you are currently reading** provides information about universal 
Druid ingestion concepts, and about
+configurations that are common to all [ingestion methods](#ingestion-methods).
+
+The **individual pages for each ingestion method** provide additional 
information about concepts and configurations
+that are unique to each ingestion method.
+
+We recommend reading (or at least skimming) this universal page first, and 
then referring to the page for the
+ingestion method or methods that you have chosen.
+
+## Ingestion methods
+
+The table below lists Druid's most common data ingestion methods, along with 
comparisons to help you choose
+the best one for your situation. Each ingestion method supports its own set of 
source systems to pull from. For details
+about how each method works, as well as configuration properties specific to 
that method, check out its documentation
+page.
+
+### Streaming
+
+The most recommended, and most popular, method of streaming ingestion is the
+[Kafka indexing service](../development/extensions-core/kafka-ingestion.md) 
that reads directly from Kafka. The Kinesis
+indexing service also works well if you prefer Kinesis.
+
+This table compares the major available options:
+
+| **Method** | [Kafka](../development/extensions-core/kafka-ingestion.md) | 
[Kinesis](../development/extensions-core/kinesis-ingestion.md) | 
[Tranquility](tranquility.md) |
+|---|-|--||
+| **Supervisor type** | `kafka` | `kinesis` | N/A |
+| **How it works** | Druid reads directly from Apache Kafka. | Druid reads 
directly from Amazon Kinesis. | Tranquility, a library that ships separately 
from Druid, is used to push data into Druid. |
+| **Can ingest late data?** | Yes | Yes | No (late data is dropped based on 
the `windowPeriod` config) |
+| **Exactly-once guarantees?** | Yes | Yes | No |
+
+### Batch
+
+When doing batch loads from files, you should use one-time [tasks](tasks.md), 
and you have three options: `index`
+(native batch; single-task), `index_parallel` (native batch; parallel), or 
`index_hadoop` (Hadoop-based). The following
+table compares and contrasts the three batch ingestion options.
+
+In general, we recommend native batch whenever it meets your needs, since the 
setup is simpler (it does not depend on
+an external Hadoop cluster). However, there are still scenarios where 
Hadoop-based batch ingestion is the right choice,
+especially due to its support for custom partitioning options and reading 
binary data formats.
+
+This table compares the major available options:
+
+| **Method** | [Native batch (simple)](native-batch.html#simple-task) | 
[Native batch (parallel)](native-batch.html#parallel-task) | 
[Hadoop-based](hadoop.html) |
+|---|-|--||
+| **Task type** | `index` | `index_parallel` | `index_hadoop` |
+| **Automatically parallel?** | No. Each task is single-threaded. | Yes, if 
firehose is splittable. See [firehose documentation](native-batch.md#firehoses) 
for details. | Yes, always. |
+| **Can append or overwrite?** | Yes, both. | Yes, both. | Overwrite only. |
+| **External dependencies** | None. | None. | Hadoop cluster (Druid submits 
Map/Reduce jobs). |
+| **Input locations** | Any [firehose](native-batch.md#firehoses). | Any 
[firehose](native-batch.md#firehoses). | Any Hadoop FileSystem or Druid 
datasource. |
+| **File formats** | Text file formats (CSV, TSV, JSON). Support for binary 
formats is coming in a future release. | Text file formats (CSV, TSV, JSON). 
Support for binary formats is coming in a future release. | Any Hadoop 
InputFormat. |
+| **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in 
the