Author: greid Date: Sat Jan 30 18:09:43 2016 New Revision: 1727741 URL: http://svn.apache.org/viewvc?rev=1727741&view=rev Log: Add information workout around permissions issues when running the bulk loader.
Modified: phoenix/site/publish/bulk_dataload.html phoenix/site/source/src/site/markdown/bulk_dataload.md Modified: phoenix/site/publish/bulk_dataload.html URL: http://svn.apache.org/viewvc/phoenix/site/publish/bulk_dataload.html?rev=1727741&r1=1727740&r2=1727741&view=diff ============================================================================== --- phoenix/site/publish/bulk_dataload.html (original) +++ phoenix/site/publish/bulk_dataload.html Sat Jan 30 18:09:43 2016 @@ -1,7 +1,7 @@ <!DOCTYPE html> <!-- - Generated by Apache Maven Doxia at 2016-01-23 + Generated by Apache Maven Doxia at 2016-01-30 Rendered using Reflow Maven Skin 1.1.0 (http://andriusvelykis.github.io/reflow-maven-skin) --> <html xml:lang="en" lang="en"> @@ -217,17 +217,17 @@ <p>For higher-throughput loading distributed over the cluster, the MapReduce loader can be used. This loader first converts all data into HFiles, and then provides the created HFiles to HBase after the HFile creation is complete. </p> <p>The MapReduce loader is launched using the <tt>hadoop</tt> command with the Phoenix client jar, as follows:</p> <div class="source"> - <pre>hadoop jar phoenix-3.0.0-incubating-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input /data/example.csv + <pre>hadoop jar phoenix-<version>-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input /data/example.csv </pre> </div> <p>When using Phoenix 4.0 and above, there is a known HBase issue( âNotice to Mapreduce users of HBase 0.96.1 and aboveâ <a class="externalLink" href="https://hbase.apache.org/book.html">https://hbase.apache.org/book.html</a> ), you should use following command:</p> <div class="source"> - <pre>HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar phoenix-4.0.0-incubating-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input /data/example.csv + <pre>HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar phoenix-<version>-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input /data/example.csv </pre> </div> <p>OR</p> <div class="source"> - <pre>HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar phoenix-4.0.0-incubating-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input /data/example.csv + <pre>HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar phoenix-<version>-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input /data/example.csv </pre> </div> <p>The input file must be present on HDFS (not the local filesystem where the command is being run). </p> @@ -285,31 +285,46 @@ <div class="section"> <h3 id="Notes_on_the_MapReduce_importer">Notes on the MapReduce importer</h3> <p>The current MR-based bulk loader will run one MR job to load your data table and one MR per index table to populate your indexes. Use the -it option to only load one of your index tables.</p> - </div> -</div> -<div class="section"> - <h2 id="Loading_array_data">Loading array data</h2> - <p>Both the PSQL loader and MapReduce loader support loading array values with the <tt>-a</tt> flag. Arrays in a CSV file are represented by a field that uses a different delimiter than the main CSV delimiter. For example, the following file would represent an id field and an array of integers:</p> - <div class="source"> - <pre>1,2:3:4 + <div class="section"> + <h4 id="Permissions_issues_when_uploading_HFiles">Permissions issues when uploading HFiles</h4> + <p>There can be permissions issues in the final stage of a bulk load, when the created HFiles are handed over to HBase. HBase needs to be able to move the created HFiles, which means that it needs to have write access to the directories where the files have been written. If this is not the case, the uploading of HFiles will hang for a very long time before finally failing.</p> + <p>There are two main workarounds for this issue: running the bulk load process as the <tt>hbase</tt> user, or creating the output files with as readable for all users.</p> + <p>The first option can be done by simply starting the hadoop command with <tt>sudo -u hbase</tt>, i.e. </p> + <div class="source"> + <pre>sudo -u hbase hadoop jar phoenix-<version>-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input /data/example.csv +</pre> + </div> + <p>Creating the output files as readable by all can be done by setting the <tt>fs.permissions.umask-mode</tt> configuration setting to â000â. This can be set in the hadoop configuration on the machine being used to submit the job, or can be set for the job only during submission on the command line as follows:</p> + <div class="source"> + <pre>hadoop jar phoenix-<version>-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool -Dfs.permissions.umask-mode=000 --table EXAMPLE --input /data/example.csv +</pre> + </div> + </div> + <div class="section"> + <h4 id="Loading_array_data">Loading array data</h4> + <p>Both the PSQL loader and MapReduce loader support loading array values with the <tt>-a</tt> flag. Arrays in a CSV file are represented by a field that uses a different delimiter than the main CSV delimiter. For example, the following file would represent an id field and an array of integers:</p> + <div class="source"> + <pre>1,2:3:4 2,3:4,5 </pre> - </div> - <p>To load this file, the default delimiter (comma) would be used, and the array delimiter (colon) would be supplied with the parameter <tt>-a ':'</tt>.</p> -</div> -<div class="section"> - <h2 id="A_note_on_separator_characters">A note on separator characters</h2> - <p>The default separator character for both loaders is a comma (,). A common separator for input files is the tab character, which can tricky to supply on the command line. A common mistake is trying to supply a tab as the separator by typing the following</p> - <div class="source"> - <pre>-d '\t' + </div> + <p>To load this file, the default delimiter (comma) would be used, and the array delimiter (colon) would be supplied with the parameter <tt>-a ':'</tt>.</p> + </div> + <div class="section"> + <h4 id="A_note_on_separator_characters">A note on separator characters</h4> + <p>The default separator character for both loaders is a comma (,). A common separator for input files is the tab character, which can tricky to supply on the command line. A common mistake is trying to supply a tab as the separator by typing the following</p> + <div class="source"> + <pre>-d '\t' </pre> + </div> + <p>This will not work, as the shell will supply this value as two characters (a backslash and a âtâ) to Phoenix.</p> + <p>Two ways in which you can supply a special character such as a tab on the command line are as follows:</p> + <ol style="list-style-type: decimal"> + <li> <p>By preceding the string representation of a tab with a dollar sign:</p> <p>-d $â\tâ</p></li> + <li> <p>By entering the separator as Ctrl+v, and then pressing the tab key:</p> <p>-d â^v<tab>â</p></li> + </ol> + </div> </div> - <p>This will not work, as the shell will supply this value as two characters (a backslash and a âtâ) to Phoenix.</p> - <p>Two ways in which you can supply a special character such as a tab on the command line are as follows:</p> - <ol style="list-style-type: decimal"> - <li> <p>By preceding the string representation of a tab with a dollar sign:</p> <p>-d $â\tâ</p></li> - <li> <p>By entering the separator as Ctrl+v, and then pressing the tab key:</p> <p>-d â^v<tab>â</p></li> - </ol> </div> </div> </div> Modified: phoenix/site/source/src/site/markdown/bulk_dataload.md URL: http://svn.apache.org/viewvc/phoenix/site/source/src/site/markdown/bulk_dataload.md?rev=1727741&r1=1727740&r2=1727741&view=diff ============================================================================== --- phoenix/site/source/src/site/markdown/bulk_dataload.md (original) +++ phoenix/site/source/src/site/markdown/bulk_dataload.md Sat Jan 30 18:09:43 2016 @@ -45,15 +45,15 @@ For higher-throughput loading distribute The MapReduce loader is launched using the `hadoop` command with the Phoenix client jar, as follows: - hadoop jar phoenix-3.0.0-incubating-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input /data/example.csv + hadoop jar phoenix-<version>-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input /data/example.csv When using Phoenix 4.0 and above, there is a known HBase issue( "Notice to Mapreduce users of HBase 0.96.1 and above" https://hbase.apache.org/book.html ), you should use following command: - HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar phoenix-4.0.0-incubating-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input /data/example.csv + HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar phoenix-<version>-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input /data/example.csv OR - HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar phoenix-4.0.0-incubating-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input /data/example.csv + HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar phoenix-<version>-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input /data/example.csv The input file must be present on HDFS (not the local filesystem where the command is being run). @@ -76,7 +76,21 @@ The following parameters can be used wit ### Notes on the MapReduce importer The current MR-based bulk loader will run one MR job to load your data table and one MR per index table to populate your indexes. Use the -it option to only load one of your index tables. -## Loading array data +#### Permissions issues when uploading HFiles + +There can be issues due to file permissions on the created HFiles in the final stage of a bulk load, when the created HFiles are handed over to HBase. HBase needs to be able to move the created HFiles, which means that it needs to have write access to the directories where the files have been written. If this is not the case, the uploading of HFiles will hang for a very long time before finally failing. + +There are two main workarounds for this issue: running the bulk load process as the `hbase` user, or creating the output files with as readable for all users. + +The first option can be done by simply starting the hadoop command with `sudo -u hbase`, i.e. + + sudo -u hbase hadoop jar phoenix-<version>-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input /data/example.csv + +Creating the output files as readable by all can be done by setting the `fs.permissions.umask-mode` configuration setting to "000". This can be set in the hadoop configuration on the machine being used to submit the job, or can be set for the job only during submission on the command line as follows: + + hadoop jar phoenix-<version>-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool -Dfs.permissions.umask-mode=000 --table EXAMPLE --input /data/example.csv + +#### Loading array data Both the PSQL loader and MapReduce loader support loading array values with the `-a` flag. Arrays in a CSV file are represented by a field that uses a different delimiter than the main CSV delimiter. For example, the following file would represent an id field and an array of integers: @@ -85,7 +99,7 @@ Both the PSQL loader and MapReduce loade To load this file, the default delimiter (comma) would be used, and the array delimiter (colon) would be supplied with the parameter `-a ':'`. -## A note on separator characters +#### A note on separator characters The default separator character for both loaders is a comma (,). A common separator for input files is the tab character, which can tricky to supply on the command line. A common mistake is trying to supply a tab as the separator by typing the following