[1/2] carbondata-site git commit: fix some typo issues

chenliang613 Thu, 27 Jul 2017 18:14:43 -0700

Repository: carbondata-site
Updated Branches:
  refs/heads/asf-site b41bbc68b -> 650ce2993



http://git-wip-us.apache.org/repos/asf/carbondata-site/blob/650ce299/src/site/markdown/installation-guide.md
----------------------------------------------------------------------
diff --git a/src/site/markdown/installation-guide.md 
b/src/site/markdown/installation-guide.md
new file mode 100644
index 0000000..a0fc690
--- /dev/null
+++ b/src/site/markdown/installation-guide.md
@@ -0,0 +1,190 @@
+<!--
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+-->
+
+# Installation Guide
+This tutorial guides you through the installation and configuration of 
CarbonData in the following two modes :
+
+* [Installing and Configuring CarbonData on Standalone Spark 
Cluster](#installing-and-configuring-carbondata-on-standalone-spark-cluster)
+* [Installing and Configuring CarbonData on âSpark on YARNâ 
Cluster](#installing-and-configuring-carbondata-on-spark-on-yarn-cluster)
+
+followed by :
+
+* [Query Execution using CarbonData Thrift 
Server](#query-execution-using-carbondata-thrift-server)
+
+## Installing and Configuring CarbonData on Standalone Spark Cluster
+
+### Prerequisites
+
+   - Hadoop HDFS and Yarn should be installed and running.
+
+   - Spark should be installed and running on all the cluster nodes.
+
+   - CarbonData user should have permission to access HDFS.
+
+
+### Procedure
+
+1. [Build the 
CarbonData](https://github.com/apache/carbondata/blob/master/build/README.md) 
project and get the assembly jar from 
`./assembly/target/scala-2.1x/carbondata_xxx.jar`. 
+
+2. Copy `./assembly/target/scala-2.1x/carbondata_xxx.jar` to 
`$SPARK_HOME/carbonlib` folder.
+
+     **NOTE**: Create the carbonlib folder if it does not exist inside 
`$SPARK_HOME` path.
+
+3. Add the carbonlib folder path in the Spark classpath. (Edit 
`$SPARK_HOME/conf/spark-env.sh` file and modify the value of `SPARK_CLASSPATH` 
by appending `$SPARK_HOME/carbonlib/*` to the existing value)
+
+4. Copy the `./conf/carbon.properties.template` file from CarbonData 
repository to `$SPARK_HOME/conf/` folder and rename the file to 
`carbon.properties`.
+
+5. Repeat Step 2 to Step 5 in all the nodes of the cluster.
+    
+6. In Spark node[master], configure the properties mentioned in the following 
table in `$SPARK_HOME/conf/spark-defaults.conf` file.
+
+| Property | Value | Description |
+|---------------------------------|-----------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
+| spark.driver.extraJavaOptions | `-Dcarbon.properties.filepath = 
$SPARK_HOME/conf/carbon.properties` | A string of extra JVM options to pass to 
the driver. For instance, GC settings or other logging. |
+| spark.executor.extraJavaOptions | `-Dcarbon.properties.filepath = 
$SPARK_HOME/conf/carbon.properties` | A string of extra JVM options to pass to 
executors. For instance, GC settings or other logging. **NOTE**: You can enter 
multiple values separated by space. |
+
+7. Add the following properties in `$SPARK_HOME/conf/carbon.properties` file:
+
+| Property             | Required | Description                                
                                            | Example                           
  | Remark  |
+|----------------------|----------|----------------------------------------------------------------------------------------|-------------------------------------|---------|
+| carbon.storelocation | NO       | Location where data CarbonData will create 
the store and write the data in its own format. | 
hdfs://HOSTNAME:PORT/Opt/CarbonStore      | Propose to set HDFS directory |
+
+
+8. Verify the installation. For example:
+
+```
+./spark-shell --master spark://HOSTNAME:PORT --total-executor-cores 2
+--executor-memory 2G
+```
+
+**NOTE**: Make sure you have permissions for CarbonData JARs and files through 
which driver and executor will start.
+
+To get started with CarbonData : [Quick Start](quick-start-guide.md), [DDL 
Operations on CarbonData](ddl-operation-on-carbondata.md)
+
+## Installing and Configuring CarbonData on "Spark on YARN" Cluster
+
+   This section provides the procedure to install CarbonData on "Spark on 
YARN" cluster.
+
+### Prerequisites
+   * Hadoop HDFS and Yarn should be installed and running.
+   * Spark should be installed and running in all the clients.
+   * CarbonData user should have permission to access HDFS.
+
+### Procedure
+
+   The following steps are only for Driver Nodes. (Driver nodes are the one 
which starts the spark context.)
+
+1. [Build the 
CarbonData](https://github.com/apache/carbondata/blob/master/build/README.md) 
project and get the assembly jar from 
`./assembly/target/scala-2.1x/carbondata_xxx.jar` and copy to 
`$SPARK_HOME/carbonlib` folder.
+
+    **NOTE**: Create the carbonlib folder if it does not exists inside 
`$SPARK_HOME` path.
+
+2. Copy the `./conf/carbon.properties.template` file from CarbonData 
repository to `$SPARK_HOME/conf/` folder and rename the file to 
`carbon.properties`.
+
+3. Create `tar,gz` file of carbonlib folder and move it inside the carbonlib 
folder.
+
+```
+cd $SPARK_HOME
+tar -zcvf carbondata.tar.gz carbonlib/
+mv carbondata.tar.gz carbonlib/
+```
+
+4. Configure the properties mentioned in the following table in 
`$SPARK_HOME/conf/spark-defaults.conf` file.
+
+| Property | Description | Value |
+|---------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------|
+| spark.master | Set this value to run the Spark in yarn cluster mode. | Set 
yarn-client to run the Spark in yarn cluster mode. |
+| spark.yarn.dist.files | Comma-separated list of files to be placed in the 
working directory of each executor. |`$SPARK_HOME/conf/carbon.properties` |
+| spark.yarn.dist.archives | Comma-separated list of archives to be extracted 
into the working directory of each executor. 
|`$SPARK_HOME/carbonlib/carbondata.tar.gz` |
+| spark.executor.extraJavaOptions | A string of extra JVM options to pass to 
executors. For instance  **NOTE**: You can enter multiple values separated by 
space. |`-Dcarbon.properties.filepath = carbon.properties` |
+| spark.executor.extraClassPath | Extra classpath entries to prepend to the 
classpath of executors. **NOTE**: If SPARK_CLASSPATH is defined in 
spark-env.sh, then comment it and append the values in below parameter 
spark.driver.extraClassPath |`carbondata.tar.gz/carbonlib/*` |
+| spark.driver.extraClassPath | Extra classpath entries to prepend to the 
classpath of the driver. **NOTE**: If SPARK_CLASSPATH is defined in 
spark-env.sh, then comment it and append the value in below parameter 
spark.driver.extraClassPath. |`$SPARK_HOME/carbonlib/*` |
+| spark.driver.extraJavaOptions | A string of extra JVM options to pass to the 
driver. For instance, GC settings or other logging. 
|`-Dcarbon.properties.filepath = $SPARK_HOME/conf/carbon.properties` |
+
+
+5. Add the following properties in `$SPARK_HOME/conf/carbon.properties`:
+
+| Property | Required | Description | Example | Default Value |
+|----------------------|----------|----------------------------------------------------------------------------------------|-------------------------------------|---------------|
+| carbon.storelocation | NO | Location where CarbonData will create the store 
and write the data in its own format. | hdfs://HOSTNAME:PORT/Opt/CarbonStore | 
Propose to set HDFS directory|
+
+6. Verify the installation.
+
+```
+ ./bin/spark-shell --master yarn-client --driver-memory 1g
+ --executor-cores 2 --executor-memory 2G
+```
+  **NOTE**: Make sure you have permissions for CarbonData JARs and files 
through which driver and executor will start.
+
+  Getting started with CarbonData : [Quick Start](quick-start-guide.md), [DDL 
Operations on CarbonData](ddl-operation-on-carbondata.md)
+
+## Query Execution Using CarbonData Thrift Server
+
+### Starting CarbonData Thrift Server.
+
+   a. cd `$SPARK_HOME`
+
+   b. Run the following command to start the CarbonData thrift server.
+
+```
+./bin/spark-submit
+--conf spark.sql.hive.thriftServer.singleSession=true
+--class org.apache.carbondata.spark.thriftserver.CarbonThriftServer
+$SPARK_HOME/carbonlib/$CARBON_ASSEMBLY_JAR <carbon_store_path>
+```
+
+| Parameter | Description | Example |
+|---------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|
+| CARBON_ASSEMBLY_JAR | CarbonData assembly jar name present in the 
`$SPARK_HOME/carbonlib/` folder. | 
carbondata_2.xx-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar |
+| carbon_store_path | This is a parameter to the CarbonThriftServer class. 
This a HDFS path where CarbonData files will be kept. Strongly Recommended to 
put same as carbon.storelocation parameter of carbon.properties. | 
`hdfs://<host_name>:port/user/hive/warehouse/carbon.store` |
+
+**Examples**
+   
+   * Start with default memory and executors.
+
+```
+./bin/spark-submit
+--conf spark.sql.hive.thriftServer.singleSession=true
+--class org.apache.carbondata.spark.thriftserver.CarbonThriftServer 
+$SPARK_HOME/carbonlib
+/carbondata_2.xx-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar
+hdfs://<host_name>:port/user/hive/warehouse/carbon.store
+```
+   
+   * Start with Fixed executors and resources.
+
+```
+./bin/spark-submit --conf spark.sql.hive.thriftServer.singleSession=true 
+--class org.apache.carbondata.spark.thriftserver.CarbonThriftServer 
+--num-executors 3 --driver-memory 20g --executor-memory 250g 
+--executor-cores 32 
+/srv/OSCON/BigData/HACluster/install/spark/sparkJdbc/lib
+/carbondata_2.xx-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar
+hdfs://<host_name>:port/user/hive/warehouse/carbon.store
+```
+  
+### Connecting to CarbonData Thrift Server Using Beeline.
+
+```
+     cd $SPARK_HOME
+     ./bin/beeline jdbc:hive2://<thriftserver_host>:port
+
+     Example
+     ./bin/beeline jdbc:hive2://10.10.10.10:10000
+```
+

http://git-wip-us.apache.org/repos/asf/carbondata-site/blob/650ce299/src/site/markdown/quick-start-guide.md
----------------------------------------------------------------------
diff --git a/src/site/markdown/quick-start-guide.md 
b/src/site/markdown/quick-start-guide.md
new file mode 100644
index 0000000..1c490ac
--- /dev/null
+++ b/src/site/markdown/quick-start-guide.md
@@ -0,0 +1,163 @@
+<!--
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+-->
+
+# Quick Start
+This tutorial provides a quick introduction to using CarbonData.
+
+##  Prerequisites
+* [Installation and building 
CarbonData](https://github.com/apache/carbondata/blob/master/build).
+* Create a sample.csv file using the following commands. The CSV file is 
required for loading data into CarbonData.
+
+  ```
+  cd carbondata
+  cat > sample.csv << EOF
+  id,name,city,age
+  1,david,shenzhen,31
+  2,eason,shenzhen,27
+  3,jarry,wuhan,35
+  EOF
+  ```
+
+## Interactive Analysis with Spark Shell Version 2.1
+
+Apache Spark Shell provides a simple way to learn the API, as well as a 
powerful tool to analyze data interactively. Please visit [Apache Spark 
Documentation](http://spark.apache.org/docs/latest/) for more details on Spark 
shell.
+
+#### Basics
+
+Start Spark shell by running the following command in the Spark directory:
+
+```
+./bin/spark-shell --jars <carbondata assembly jar path>
+```
+**NOTE**: Assembly jar will be available after [building 
CarbonData](https://github.com/apache/carbondata/blob/master/build/README.md) 
and can be copied from `./assembly/target/scala-2.1x/carbondata_xxx.jar`
+
+In this shell, SparkSession is readily available as `spark` and Spark context 
is readily available as `sc`.
+
+In order to create a CarbonSession we will have to configure it explicitly in 
the following manner :
+
+* Import the following :
+
+```
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.CarbonSession._
+```
+
+* Create a CarbonSession :
+
+```
+val carbon = SparkSession.builder().config(sc.getConf)
+             .getOrCreateCarbonSession("<hdfs store path>")
+```
+**NOTE**: By default metastore location is pointed to `../carbon.metastore`, 
user can provide own metastore location to CarbonSession like 
`SparkSession.builder().config(sc.getConf)
+.getOrCreateCarbonSession("<hdfs store path>", "<local metastore path>")`
+
+#### Executing Queries
+
+###### Creating a Table
+
+```
+scala>carbon.sql("CREATE TABLE
+                        IF NOT EXISTS test_table(
+                                  id string,
+                                  name string,
+                                  city string,
+                                  age Int)
+                       STORED BY 'carbondata'")
+```
+
+###### Loading Data to a Table
+
+```
+scala>carbon.sql("LOAD DATA INPATH 'sample.csv file path'
+                  INTO TABLE test_table")
+```
+**NOTE**: Please provide the real file path of `sample.csv` for the above 
script.
+
+###### Query Data from a Table
+
+```
+scala>carbon.sql("SELECT * FROM test_table").show()
+
+scala>carbon.sql("SELECT city, avg(age), sum(age)
+                  FROM test_table
+                  GROUP BY city").show()
+```
+
+## Interactive Analysis with Spark Shell Version 1.6
+
+#### Basics
+
+Start Spark shell by running the following command in the Spark directory:
+
+```
+./bin/spark-shell --jars <carbondata assembly jar path>
+```
+**NOTE**: Assembly jar will be available after [building 
CarbonData](https://github.com/apache/carbondata/
+blob/master/build/README.md) and can be copied from 
`./assembly/target/scala-2.1x/carbondata_xxx.jar`
+
+**NOTE**: In this shell, SparkContext is readily available as `sc`.
+
+* In order to execute the Queries we need to import CarbonContext:
+
+```
+import org.apache.spark.sql.CarbonContext
+```
+
+* Create an instance of CarbonContext in the following manner :
+
+```
+val cc = new CarbonContext(sc, "<hdfs store path>")
+```
+**NOTE**: If running on local machine without hdfs, configure the local 
machine's store path instead of hdfs store path
+
+#### Executing Queries
+
+###### Creating a Table
+
+```
+scala>cc.sql("CREATE TABLE
+              IF NOT EXISTS test_table (
+                         id string,
+                         name string,
+                         city string,
+                         age Int)
+              STORED BY 'carbondata'")
+```
+To see the table created :
+
+```
+scala>cc.sql("SHOW TABLES").show()
+```
+
+###### Loading Data to a Table
+
+```
+scala>cc.sql("LOAD DATA INPATH 'sample.csv file path'
+              INTO TABLE test_table")
+```
+**NOTE**: Please provide the real file path of `sample.csv` for the above 
script.
+
+###### Query Data from a Table
+
+```
+scala>cc.sql("SELECT * FROM test_table").show()
+scala>cc.sql("SELECT city, avg(age), sum(age)
+              FROM test_table
+              GROUP BY city").show()
+```

http://git-wip-us.apache.org/repos/asf/carbondata-site/blob/650ce299/src/site/markdown/supported-data-types-in-carbondata.md
----------------------------------------------------------------------
diff --git a/src/site/markdown/supported-data-types-in-carbondata.md 
b/src/site/markdown/supported-data-types-in-carbondata.md
new file mode 100644
index 0000000..561248c
--- /dev/null
+++ b/src/site/markdown/supported-data-types-in-carbondata.md
@@ -0,0 +1,42 @@
+<!--
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+-->
+
+#  Data Types
+
+#### CarbonData supports the following data types:
+
+  * Numeric Types
+    * SMALLINT
+    * INT/INTEGER
+    * BIGINT
+    * DOUBLE
+    * DECIMAL
+
+  * Date/Time Types
+    * TIMESTAMP
+    * DATE
+
+  * String Types
+    * STRING
+    * CHAR
+    * VARCHAR
+
+  * Complex Types
+    * arrays: ARRAY``<data_type>``
+    * structs: STRUCT``<col_name : data_type COMMENT col_comment, ...>``

http://git-wip-us.apache.org/repos/asf/carbondata-site/blob/650ce299/src/site/markdown/troubleshooting.md
----------------------------------------------------------------------
diff --git a/src/site/markdown/troubleshooting.md 
b/src/site/markdown/troubleshooting.md
new file mode 100644
index 0000000..5464997
--- /dev/null
+++ b/src/site/markdown/troubleshooting.md
@@ -0,0 +1,246 @@
+<!--
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+-->
+
+# Troubleshooting
+This tutorial is designed to provide troubleshooting for end users and 
developers
+who are building, deploying, and using CarbonData.
+
+## Failed to load thrift libraries
+
+  **Symptom**
+
+  Thrift throws following exception :
+
+  ```
+  thrift: error while loading shared libraries:
+  libthriftc.so.0: cannot open shared object file: No such file or directory
+  ```
+
+  **Possible Cause**
+
+  The complete path to the directory containing the libraries is not 
configured correctly.
+
+  **Procedure**
+
+  Follow the Apache thrift docs at 
[https://thrift.apache.org/docs/install](https://thrift.apache.org/docs/install)
 to install thrift correctly.
+
+## Failed to launch the Spark Shell
+
+  **Symptom**
+
+  The shell prompts the following error :
+
+  ```
+  org.apache.spark.sql.CarbonContext$$anon$$apache$spark$sql$catalyst$analysis
+  $OverrideCatalog$_setter_$org$apache$spark$sql$catalyst$analysis
+  $OverrideCatalog$$overrides_$e
+  ```
+
+  **Possible Cause**
+
+  The Spark Version and the selected Spark Profile do not match.
+
+  **Procedure**
+
+  1. Ensure your spark version and selected profile for spark are correct.
+
+  2. Use the following command :
+
+```
+"mvn -Pspark-2.1 -Dspark.version {yourSparkVersion} clean package"
+```
+Note :  Refrain from using "mvn clean package" without specifying the profile.
+
+## Failed to execute load query on cluster.
+
+  **Symptom**
+
+  Load query failed with the following exception:
+
+  ```
+  Dictionary file is locked for updation.
+  ```
+
+  **Possible Cause**
+
+  The carbon.properties file is not identical in all the nodes of the cluster.
+
+  **Procedure**
+
+  Follow the steps to ensure the carbon.properties file is consistent across 
all the nodes:
+
+  1. Copy the carbon.properties file from the master node to all the other 
nodes in the cluster.
+     For example, you can use ssh to copy this file to all the nodes.
+
+  2. For the changes to take effect, restart the Spark cluster.
+
+## Failed to execute insert query on cluster.
+
+  **Symptom**
+
+  Load query failed with the following exception:
+
+  ```
+  Dictionary file is locked for updation.
+  ```
+
+  **Possible Cause**
+
+  The carbon.properties file is not identical in all the nodes of the cluster.
+
+  **Procedure**
+
+  Follow the steps to ensure the carbon.properties file is consistent across 
all the nodes:
+
+  1. Copy the carbon.properties file from the master node to all the other 
nodes in the cluster.
+       For example, you can use scp to copy this file to all the nodes.
+
+  2. For the changes to take effect, restart the Spark cluster.
+
+## Failed to connect to hiveuser with thrift
+
+  **Symptom**
+
+  We get the following exception :
+
+  ```
+  Cannot connect to hiveuser.
+  ```
+
+  **Possible Cause**
+
+  The external process does not have permission to access.
+
+  **Procedure**
+
+  Ensure that the Hiveuser in mysql must allow its access to the external 
processes.
+
+## Failed to read the metastore db during table creation.
+
+  **Symptom**
+
+  We get the following exception on trying to connect :
+
+  ```
+  Cannot read the metastore db
+  ```
+
+  **Possible Cause**
+
+  The metastore db is dysfunctional.
+
+  **Procedure**
+
+  Remove the metastore db from the carbon.metastore in the Spark Directory.
+
+## Failed to load data on the cluster
+
+  **Symptom**
+
+  Data loading fails with the following exception :
+
+   ```
+   Data Load failure exeception
+   ```
+
+  **Possible Cause**
+
+  The following issue can cause the failure :
+
+  1. The core-site.xml, hive-site.xml, yarn-site and carbon.properties are not 
consistent across all nodes of the cluster.
+
+  2. Path to hdfs ddl is not configured correctly in the carbon.properties.
+
+  **Procedure**
+
+   Follow the steps to ensure the following configuration files are consistent 
across all the nodes:
+
+   1. Copy the core-site.xml, hive-site.xml, yarn-site,carbon.properties files 
from the master node to all the other nodes in the cluster.
+      For example, you can use scp to copy this file to all the nodes.
+
+      Note : Set the path to hdfs ddl in carbon.properties in the master node.
+
+   2. For the changes to take effect, restart the Spark cluster.
+
+
+
+## Failed to insert data on the cluster
+
+  **Symptom**
+
+  Insertion fails with the following exception :
+
+   ```
+   Data Load failure exeception
+   ```
+
+  **Possible Cause**
+
+  The following issue can cause the failure :
+
+  1. The core-site.xml, hive-site.xml, yarn-site and carbon.properties are not 
consistent across all nodes of the cluster.
+
+  2. Path to hdfs ddl is not configured correctly in the carbon.properties.
+
+  **Procedure**
+
+   Follow the steps to ensure the following configuration files are consistent 
across all the nodes:
+
+   1. Copy the core-site.xml, hive-site.xml, yarn-site,carbon.properties files 
from the master node to all the other nodes in the cluster.
+      For example, you can use scp to copy this file to all the nodes.
+
+      Note : Set the path to hdfs ddl in carbon.properties in the master node.
+
+   2. For the changes to take effect, restart the Spark cluster.
+
+## Failed to execute Concurrent Operations(Load,Insert,Update) on table by 
multiple workers.
+
+  **Symptom**
+
+  Execution fails with the following exception :
+
+   ```
+   Table is locked for updation.
+   ```
+
+  **Possible Cause**
+
+  Concurrency not supported.
+
+  **Procedure**
+
+  Worker must wait for the query execution to complete and the table to 
release the lock for another query execution to succeed.
+
+## Failed to create a table with a single numeric column.
+
+  **Symptom**
+
+  Execution fails with the following exception :
+
+   ```
+   Table creation fails.
+   ```
+
+  **Possible Cause**
+
+  Behaviour not supported.
+
+  **Procedure**
+
+  A single column that can be considered as dimension is mandatory for table 
creation.

http://git-wip-us.apache.org/repos/asf/carbondata-site/blob/650ce299/src/site/markdown/useful-tips-on-carbondata.md
----------------------------------------------------------------------
diff --git a/src/site/markdown/useful-tips-on-carbondata.md 
b/src/site/markdown/useful-tips-on-carbondata.md
new file mode 100644
index 0000000..6c73b5e
--- /dev/null
+++ b/src/site/markdown/useful-tips-on-carbondata.md
@@ -0,0 +1,236 @@
+<!--
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+-->
+
+# Useful Tips
+This tutorial guides you to create CarbonData Tables and optimize performance.
+The following sections will elaborate on the above topics :
+
+* [Suggestions to create CarbonData 
Table](#suggestions-to-create-carbondata-table)
+* [Configuration for Optimizing Data Loading performance for Massive 
Data](#configuration-for-optimizing-data-loading-performance-for-massive-data)
+* [Optimizing Mass Data 
Loading](#configurations-for-optimizing-carbondata-performance)
+
+
+## Suggestions to Create CarbonData Table
+
+Recently CarbonData was used to analyze performance of Telecommunication field.
+The results of the analysis for table creation with dimensions ranging from
+10 thousand to 10 billion rows and 100 to 300 columns have been summarized 
below.
+
+The following table describes some of the columns from the table used.
+
+
+**Table Column Description**
+
+| Column Name | Data Type     | Cardinality | Attribution |
+|-------------|---------------|-------------|-------------|
+| msisdn      | String        | 30 million  | Dimension   |
+| BEGIN_TIME  | BigInt        | 10 Thousand | Dimension   |
+| HOST        | String        | 1 million   | Dimension   |
+| Dime_1      | String        | 1 Thousand  | Dimension   |
+| counter_1   | Numeric(20,0) | NA          | Measure     |
+| ...         | ...           | NA          | Measure     |
+| counter_100 | Numeric(20,0) | NA          | Measure     |
+
+CarbonData has more than 50 test cases, on the basis of these we have 
following suggestions to enhance the query performance :
+
+
+
+* **Put the frequently-used column filter in the beginning**
+
+  For example, MSISDN filter is used in most of the query then we must put the 
MSISDN in the first column.
+The create table command can be modified as suggested below :
+
+```
+  create table carbondata_table(
+  msisdn String,
+  ...
+  )STORED BY 'org.apache.carbondata.format'
+  TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,..',
+  'DICTIONARY_INCLUDE'='...');
+
+  Example:
+  create table carbondata_table(
+    msisdn String,
+    BEGIN_TIME bigint
+    )STORED BY 'org.apache.carbondata.format'
+    TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN',
+    'DICTIONARY_INCLUDE'='BEGIN_TIME');
+
+```
+
+  Now the query with MSISDN in the filter will be more efficient.
+
+
+* **Put the frequently-used columns in the order of low to high cardinality**
+
+  If the table in the specified query has multiple columns which are 
frequently used to filter the results, it is suggested to put
+  the columns in the order of cardinality low to high. This ordering of 
frequently used columns improves the compression ratio and
+  enhances the performance of queries with filter on these columns.
+
+  For example if MSISDN, HOST and Dime_1 are frequently-used columns, then the 
column order of table is suggested as
+  Dime_1>HOST>MSISDN as Dime_1 has the lowest cardinality.
+  The create table command can be modified as suggested below :
+
+```
+  create table carbondata_table(
+  Dime_1 String,
+  HOST String,
+  MSISDN String,
+  ...
+  )STORED BY 'org.apache.carbondata.format'
+  TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST..',
+  'DICTIONARY_INCLUDE'='Dime_1..');
+
+  Example:
+  create table carbondata_table(
+    Dime_1 String,
+    HOST String,
+    MSISDN String
+    )STORED BY 'org.apache.carbondata.format'
+    TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST',
+    'DICTIONARY_INCLUDE'='Dime_1');
+
+
+```
+
+
+* **Put the Dimension type columns in order of low to high cardinality**
+
+  If the columns used to filter are not frequently used, then it is suggested 
to order all the columns of dimension type in order of low to high cardinality.
+The create table command can be modified as below :
+
+```
+  create table carbondata_table(
+    Dime_1 String,
+    BEGIN_TIME bigint,
+    END_TIME bigint,
+    HOST String,
+    MSISDN String
+    ...
+    )STORED BY 'org.apache.carbondata.format'
+    TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST...',
+    'DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME...');
+```
+
+
+* **For measure type columns with non high accuracy, replace Numeric(20,0) 
data type with Double data type**
+
+  For columns of measure type, not requiring high accuracy, it is suggested to 
replace Numeric data type with Double to enhance
+query performance. The create table command can be modified as below :
+
+```
+  create table carbondata_table(
+    Dime_1 String,
+    BEGIN_TIME bigint,
+    END_TIME bigint,
+    HOST String,
+    MSISDN String,
+    counter_1 double,
+    counter_2 double,
+    ...
+    counter_100 double
+    )STORED BY 'org.apache.carbondata.format'
+    TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST...',
+    'DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME...');
+```
+  The result of performance analysis of test-case shows reduction in query 
execution time from 15 to 3 seconds, thereby improving performance by nearly 5 
times.
+
+
+* **Columns of incremental character should be re-arranged at the end of 
dimensions**
+
+  Consider the following scenario where data is loaded each day and the 
begin_time is incremental for each load, it is
+suggested to put begin_time at the end of dimensions.
+
+  Incremental values are efficient in using min/max index. The create table 
command can be modified as below :
+
+```
+  create table carbondata_table(
+    Dime_1 String,
+    HOST String,
+    MSISDN String,
+    counter_1 double,
+    counter_2 double,
+    BEGIN_TIME bigint,
+    END_TIME bigint,
+    ...
+    counter_100 double
+    )STORED BY 'org.apache.carbondata.format'
+    TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST...',
+    'DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME....');
+```
+
+
+* **Avoid adding high cardinality columns to dictionary**
+
+  If the system has low memory configuration, then it is suggested to exclude 
high cardinality columns from the dictionary to
+enhance load performance. Creation of  dictionary for high cardinality columns 
at time of load will degrade load performance due to
+excessive memory usage.
+
+  By default CarbonData determines the cardinality at the first data load and 
allows for dictionary creation only if the cardinality is less than
+1 million.
+
+
+
+## Configuration for Optimizing Data Loading performance for Massive Data
+
+
+ CarbonData supports large data load, in this process sorting data while 
loading consumes a lot of memory and disk IO and
+ this can result sometimes in "Out Of Memory" exception.
+ If you do not have much memory to use, then you may prefer to slow the speed 
of data loading instead of data load failure.
+ You can configure CarbonData by tuning following properties in 
carbon.properties file to get a better performance.
+
+| Parameter | Default Value | Description/Tuning |
+|-----------|-------------|--------|
+|carbon.number.of.cores.while.loading|Default: 2.This value should be >= 
2|Specifies the number of cores used for data processing during data loading in 
CarbonData. |
+|carbon.sort.size|Default: 100000. The value should be >= 100.|Threshold to 
write local file in sort step when loading data|
+|carbon.sort.file.write.buffer.size|Default:  50000.|DataOutputStream buffer. |
+|carbon.number.of.cores.block.sort|Default: 7 | If you have huge memory and 
cpus, increase it as you will|
+|carbon.merge.sort.reader.thread|Default: 3 |Specifies the number of cores 
used for temp file merging during data loading in CarbonData.|
+|carbon.merge.sort.prefetch|Default: true | You may want set this value to 
false if you have not enough memory|
+
+
+For example, if there are  10 million records ,and i have only 16 cores ,64GB 
memory, will be loaded to CarbonData table.
+Using the default configuration  always fail in sort step. Modify 
carbon.properties as suggested below:
+
+
+```
+carbon.number.of.cores.block.sort=1
+carbon.merge.sort.reader.thread=1
+carbon.sort.size=5000
+carbon.sort.file.write.buffer.size=5000
+carbon.merge.sort.prefetch=false
+```
+
+## Configurations for Optimizing CarbonData Performance
+
+Recently we did some performance POC on CarbonData for Finance and 
telecommunication Field. It involved detailed queries and aggregation
+scenarios. After the completion of POC, some of the configurations impacting 
the performance have been identified and tabulated below :
+
+| Parameter | Location | Used For  | Description | Tuning |
+|----------------------------------------------|-----------------------------------|---------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| carbon.sort.intermediate.files.limit | spark/carbonlib/carbon.properties | 
Data loading | During the loading of data, local temp is used to sort the data. 
This number specifies the minimum number of intermediate files after which the  
merge sort has to be initiated. | Increasing the parameter to a higher value 
will improve the load performance. For example, when we increase the value from 
20 to 100, it increases the data load performance from 35MB/S to more than 
50MB/S. Higher values of this parameter consumes  more memory during the load. |
+| carbon.number.of.cores.while.loading | spark/carbonlib/carbon.properties | 
Data loading | Specifies the number of cores used for data processing during 
data loading in CarbonData. | If you have more number of CPUs, then you can 
increase the number of CPUs, which will increase the performance. For example 
if we increase the value from 2 to 4 then the CSV reading performance can 
increase about 1 times |
+| carbon.compaction.level.threshold | spark/carbonlib/carbon.properties | Data 
loading and Querying | For minor compaction, specifies the number of segments 
to be merged in stage 1 and number of compacted segments to be merged in stage 
2. | Each CarbonData load will create one segment, if every load is small in 
size it will generate many small file over a period of time impacting the query 
performance. Configuring this parameter will merge the small segment to one big 
segment which will sort the data and improve the performance. For Example in 
one telecommunication scenario, the performance improves about 2 times after 
minor compaction. |
+| spark.sql.shuffle.partitions | spark/conf/spark-defaults.conf | Querying | 
The number of task started when spark shuffle. | The value can be 1 to 2 times 
as much as the executor cores. In an aggregation scenario, reducing the number 
from 200 to 32 reduced the query time from 17 to 9 seconds. |
+| spark.executor.instances/spark.executor.cores/spark.executor.memory | 
spark/conf/spark-defaults.conf | Querying | The number of executors, CPU cores, 
and memory used for CarbonData query. | In the bank scenario, we provide the 4 
CPUs cores and 15 GB for each executor which can get good performance. This 2 
value does not mean more the better. It needs to be configured properly in case 
of limited resources. For example, In the bank scenario, it has enough CPU 32 
cores each node but less memory 64 GB each node. So we cannot give more CPU but 
less memory. For example, when 4 cores and 12GB for each executor. It sometimes 
happens GC during the query which impact the query performance very much from 
the 3 second to more than 15 seconds. In this scenario need to increase the 
memory or decrease the CPU cores. |
+| carbon.detail.batch.size | spark/carbonlib/carbon.properties | Data loading 
| The buffer size to store records, returned from the block scan. | In limit 
scenario this parameter is very important. For example your query limit is 
1000. But if we set this value to 3000 that means we get 3000 records from scan 
but spark will only take 1000 rows. So the 2000 remaining are useless. In one 
Finance test case after we set it to 100, in the limit 1000 scenario the 
performance increase about 2 times in comparison to if we set this value to 
12000. |
+| carbon.use.local.dir | spark/carbonlib/carbon.properties | Data loading | 
Whether use YARN local directories for multi-table load disk load balance | If 
this is set it to true CarbonData will use YARN local directories for 
multi-table load disk load balance, that will improve the data load 
performance. |
+| carbon.use.multiple.temp.dir | spark/carbonlib/carbon.properties | Data 
loading | Whether to use multiple YARN local directories during table data 
loading for disk load balance | After enabling 'carbon.use.local.dir', if this 
is set to true, CarbonData will use all YARN local directories during data load 
for disk load balance, that will improve the data load performance. Please 
enable this property when you encounter disk hotspot problem during data 
loading. |
+
+Note: If your CarbonData instance is provided only for query, you may specify 
the property 'spark.speculation=true' which is in conf directory of spark.

[1/2] carbondata-site git commit: fix some typo issues

Reply via email to