bulk_dataload.md

greid Sat, 30 Jan 2016 10:10:37 -0800

Author: greid
Date: Sat Jan 30 18:09:43 2016
New Revision: 1727741

URL: http://svn.apache.org/viewvc?rev=1727741&view=rev
Log:
Add information workout around permissions issues when
running the bulk loader.


Modified:
    phoenix/site/publish/bulk_dataload.html
    phoenix/site/source/src/site/markdown/bulk_dataload.md

Modified: phoenix/site/publish/bulk_dataload.html
URL: 
http://svn.apache.org/viewvc/phoenix/site/publish/bulk_dataload.html?rev=1727741&r1=1727740&r2=1727741&view=diff
==============================================================================
--- phoenix/site/publish/bulk_dataload.html (original)
+++ phoenix/site/publish/bulk_dataload.html Sat Jan 30 18:09:43 2016
@@ -1,7 +1,7 @@
 
 <!DOCTYPE html>
 <!--
- Generated by Apache Maven Doxia at 2016-01-23
+ Generated by Apache Maven Doxia at 2016-01-30
  Rendered using Reflow Maven Skin 1.1.0 
(http://andriusvelykis.github.io/reflow-maven-skin)
 -->
 <html  xml:lang="en" lang="en">
@@ -217,17 +217,17 @@
  <p>For higher-throughput loading distributed over the cluster, the MapReduce 
loader can be used. This loader first converts all data into HFiles, and then 
provides the created HFiles to HBase after the HFile creation is complete. </p> 
  <p>The MapReduce loader is launched using the <tt>hadoop</tt> command with 
the Phoenix client jar, as follows:</p> 
  <div class="source"> 
-  <pre>hadoop jar phoenix-3.0.0-incubating-client.jar 
org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input 
/data/example.csv
+  <pre>hadoop jar phoenix-&lt;version&gt;-client.jar 
org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input 
/data/example.csv
 </pre> 
  </div> 
  <p>When using Phoenix 4.0 and above, there is a known HBase issue( âNotice 
to Mapreduce users of HBase 0.96.1 and aboveâ <a class="externalLink" 
href="https://hbase.apache.org/book.html";>https://hbase.apache.org/book.html</a>
 ), you should use following command:</p> 
  <div class="source"> 
-  <pre>HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar 
phoenix-4.0.0-incubating-client.jar 
org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input 
/data/example.csv
+  <pre>HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar 
phoenix-&lt;version&gt;-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool 
--table EXAMPLE --input /data/example.csv
 </pre> 
  </div> 
  <p>OR</p> 
  <div class="source"> 
-  <pre>HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop 
jar phoenix-4.0.0-incubating-client.jar 
org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input 
/data/example.csv
+  <pre>HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop 
jar phoenix-&lt;version&gt;-client.jar 
org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input 
/data/example.csv
 </pre> 
  </div> 
  <p>The input file must be present on HDFS (not the local filesystem where the 
command is being run). </p> 
@@ -285,31 +285,46 @@
  <div class="section"> 
   <h3 id="Notes_on_the_MapReduce_importer">Notes on the MapReduce 
importer</h3> 
   <p>The current MR-based bulk loader will run one MR job to load your data 
table and one MR per index table to populate your indexes. Use the -it option 
to only load one of your index tables.</p> 
- </div> 
-</div> 
-<div class="section"> 
- <h2 id="Loading_array_data">Loading array data</h2> 
- <p>Both the PSQL loader and MapReduce loader support loading array values 
with the <tt>-a</tt> flag. Arrays in a CSV file are represented by a field that 
uses a different delimiter than the main CSV delimiter. For example, the 
following file would represent an id field and an array of integers:</p> 
- <div class="source"> 
-  <pre>1,2:3:4
+  <div class="section"> 
+   <h4 id="Permissions_issues_when_uploading_HFiles">Permissions issues when 
uploading HFiles</h4> 
+   <p>There can be permissions issues in the final stage of a bulk load, when 
the created HFiles are handed over to HBase. HBase needs to be able to move the 
created HFiles, which means that it needs to have write access to the 
directories where the files have been written. If this is not the case, the 
uploading of HFiles will hang for a very long time before finally failing.</p> 
+   <p>There are two main workarounds for this issue: running the bulk load 
process as the <tt>hbase</tt> user, or creating the output files with as 
readable for all users.</p> 
+   <p>The first option can be done by simply starting the hadoop command with 
<tt>sudo -u hbase</tt>, i.e. </p> 
+   <div class="source"> 
+    <pre>sudo -u hbase hadoop jar phoenix-&lt;version&gt;-client.jar 
org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input 
/data/example.csv
+</pre> 
+   </div> 
+   <p>Creating the output files as readable by all can be done by setting the 
<tt>fs.permissions.umask-mode</tt> configuration setting to â000â. This can 
be set in the hadoop configuration on the machine being used to submit the job, 
or can be set for the job only during submission on the command line as 
follows:</p> 
+   <div class="source"> 
+    <pre>hadoop jar phoenix-&lt;version&gt;-client.jar 
org.apache.phoenix.mapreduce.CsvBulkLoadTool -Dfs.permissions.umask-mode=000 
--table EXAMPLE --input /data/example.csv
+</pre> 
+   </div> 
+  </div> 
+  <div class="section"> 
+   <h4 id="Loading_array_data">Loading array data</h4> 
+   <p>Both the PSQL loader and MapReduce loader support loading array values 
with the <tt>-a</tt> flag. Arrays in a CSV file are represented by a field that 
uses a different delimiter than the main CSV delimiter. For example, the 
following file would represent an id field and an array of integers:</p> 
+   <div class="source"> 
+    <pre>1,2:3:4
 2,3:4,5
 </pre> 
- </div> 
- <p>To load this file, the default delimiter (comma) would be used, and the 
array delimiter (colon) would be supplied with the parameter <tt>-a 
':'</tt>.</p> 
-</div> 
-<div class="section"> 
- <h2 id="A_note_on_separator_characters">A note on separator characters</h2> 
- <p>The default separator character for both loaders is a comma (,). A common 
separator for input files is the tab character, which can tricky to supply on 
the command line. A common mistake is trying to supply a tab as the separator 
by typing the following</p> 
- <div class="source"> 
-  <pre>-d '\t'
+   </div> 
+   <p>To load this file, the default delimiter (comma) would be used, and the 
array delimiter (colon) would be supplied with the parameter <tt>-a 
':'</tt>.</p> 
+  </div> 
+  <div class="section"> 
+   <h4 id="A_note_on_separator_characters">A note on separator characters</h4> 
+   <p>The default separator character for both loaders is a comma (,). A 
common separator for input files is the tab character, which can tricky to 
supply on the command line. A common mistake is trying to supply a tab as the 
separator by typing the following</p> 
+   <div class="source"> 
+    <pre>-d '\t'
 </pre> 
+   </div> 
+   <p>This will not work, as the shell will supply this value as two 
characters (a backslash and a âtâ) to Phoenix.</p> 
+   <p>Two ways in which you can supply a special character such as a tab on 
the command line are as follows:</p> 
+   <ol style="list-style-type: decimal"> 
+    <li> <p>By preceding the string representation of a tab with a dollar 
sign:</p> <p>-d $â\tâ</p></li> 
+    <li> <p>By entering the separator as Ctrl+v, and then pressing the tab 
key:</p> <p>-d â^v&lt;tab&gt;â</p></li> 
+   </ol> 
+  </div> 
  </div> 
- <p>This will not work, as the shell will supply this value as two characters 
(a backslash and a âtâ) to Phoenix.</p> 
- <p>Two ways in which you can supply a special character such as a tab on the 
command line are as follows:</p> 
- <ol style="list-style-type: decimal"> 
-  <li> <p>By preceding the string representation of a tab with a dollar 
sign:</p> <p>-d $â\tâ</p></li> 
-  <li> <p>By entering the separator as Ctrl+v, and then pressing the tab 
key:</p> <p>-d â^v&lt;tab&gt;â</p></li> 
- </ol> 
 </div>
                        </div>
                </div>

Modified: phoenix/site/source/src/site/markdown/bulk_dataload.md
URL: 
http://svn.apache.org/viewvc/phoenix/site/source/src/site/markdown/bulk_dataload.md?rev=1727741&r1=1727740&r2=1727741&view=diff
==============================================================================
--- phoenix/site/source/src/site/markdown/bulk_dataload.md (original)
+++ phoenix/site/source/src/site/markdown/bulk_dataload.md Sat Jan 30 18:09:43 
2016
@@ -45,15 +45,15 @@ For higher-throughput loading distribute
 
 The MapReduce loader is launched using the `hadoop` command with the Phoenix 
client jar, as follows:
 
-    hadoop jar phoenix-3.0.0-incubating-client.jar 
org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input 
/data/example.csv
+    hadoop jar phoenix-<version>-client.jar 
org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input 
/data/example.csv
 
 When using Phoenix 4.0 and above, there is a known HBase issue( "Notice to 
Mapreduce users of HBase 0.96.1 and above" https://hbase.apache.org/book.html 
), you should use following command:
 
-    HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar 
phoenix-4.0.0-incubating-client.jar 
org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input 
/data/example.csv
+    HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar 
phoenix-<version>-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool 
--table EXAMPLE --input /data/example.csv
 
 OR
 
-    HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop 
jar phoenix-4.0.0-incubating-client.jar 
org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input 
/data/example.csv
+    HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop 
jar phoenix-<version>-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool 
--table EXAMPLE --input /data/example.csv
 
 The input file must be present on HDFS (not the local filesystem where the 
command is being run). 
 
@@ -76,7 +76,21 @@ The following parameters can be used wit
 ### Notes on the MapReduce importer
 The current MR-based bulk loader will run one MR job to load your data table 
and one MR per index table to populate your indexes. Use the -it option to only 
load one of your index tables.
 
-## Loading array data
+#### Permissions issues when uploading HFiles
+
+There can be issues due to file permissions on the created HFiles in the final 
stage of a bulk load, when the created HFiles are handed over to HBase. HBase 
needs to be able to move the created HFiles, which means that it needs to have 
write access to the directories where the files have been written. If this is 
not the case, the uploading of HFiles will hang for a very long time before 
finally failing.
+
+There are two main workarounds for this issue: running the bulk load process 
as the `hbase` user, or creating the output files with as readable for all 
users.
+
+The first option can be done by simply starting the hadoop command with `sudo 
-u hbase`, i.e. 
+
+    sudo -u hbase hadoop jar phoenix-<version>-client.jar 
org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input 
/data/example.csv
+
+Creating the output files as readable by all can be done by setting the 
`fs.permissions.umask-mode` configuration setting to "000". This can be set in 
the hadoop configuration on the machine being used to submit the job, or can be 
set for the job only during submission on the command line as follows:
+
+    hadoop jar phoenix-<version>-client.jar 
org.apache.phoenix.mapreduce.CsvBulkLoadTool -Dfs.permissions.umask-mode=000 
--table EXAMPLE --input /data/example.csv
+
+#### Loading array data
 
 Both the PSQL loader and MapReduce loader support loading array values with 
the `-a` flag. Arrays in a CSV file are represented by a field that uses a 
different delimiter than the main CSV delimiter. For example, the following 
file would represent an id field and an array of integers:
 
@@ -85,7 +99,7 @@ Both the PSQL loader and MapReduce loade
 
 To load this file, the default delimiter (comma) would be used, and the array 
delimiter (colon) would be supplied with the parameter `-a ':'`.
 
-## A note on separator characters
+#### A note on separator characters
 
 The default separator character for both loaders is a comma (,). A common 
separator for input files is the tab character, 
 which can tricky to supply on the command line. A common mistake is trying to 
supply a tab as the separator by typing the following

svn commit: r1727741 - in /phoenix/site: publish/bulk_dataload.html source/src/site/markdown/bulk_dataload.md

Reply via email to