[accumulo-examples] branch master updated: More updates to MapReduce (#32)

mwalch Tue, 15 Jan 2019 07:09:08 -0800

This is an automated email from the ASF dual-hosted git repository.

mwalch pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/accumulo-examples.git



The following commit(s) were added to refs/heads/master by this push:
     new 26efc49  More updates to MapReduce (#32)
26efc49 is described below

commit 26efc4950978d1575d92f04d0c38042334f17ee0
Author: Mike Walch <mwa...@apache.org>
AuthorDate: Tue Jan 15 10:08:51 2019 -0500

    More updates to MapReduce (#32)
    
    * WordCount now supports using HDFS path for client props
    * Updated docs and fixed arguments to MapReduce job
---
 README.md                                          |  16 +--
 docs/compactionStrategy.md                         |   8 +-
 docs/dirlist.md                                    |  18 ++--
 docs/isolation.md                                  |   4 +-
 docs/mapred.md                                     | 114 ---------------------
 docs/sample.md                                     |   8 +-
 docs/uniquecols.md                                 |  23 +++++
 docs/wordcount.md                                  |  72 +++++++++++++
 .../examples/mapreduce/TokenFileWordCount.java     | 107 -------------------
 .../accumulo/examples/mapreduce/WordCount.java     |  12 ++-
 .../accumulo/examples/mapreduce/MapReduceIT.java   |   2 +-
 11 files changed, 134 insertions(+), 250 deletions(-)

diff --git a/README.md b/README.md
index 3a8ff8f..77c91bc 100644
--- a/README.md
+++ b/README.md
@@ -24,7 +24,7 @@ Follow the steps below to run the Accumulo examples:
 
 1. Clone this repository
 
-      git clone https://github.com/apache/accumulo-examples.git
+        git clone https://github.com/apache/accumulo-examples.git
 
 2. Follow [Accumulo's quickstart][quickstart] to install and run an Accumulo 
instance.
    Accumulo has an [accumulo-client.properties] in `conf/` that must be 
configured as
@@ -34,13 +34,13 @@ Follow the steps below to run the Accumulo examples:
    are set in your shell, you may be able skip this step. Make sure 
`ACCUMULO_CLIENT_PROPS` is
    set to the location of your [accumulo-client.properties].
 
-      cp conf/env.sh.example conf/env.sh
-      vim conf/env.sh
+        cp conf/env.sh.example conf/env.sh
+        vim conf/env.sh
 
 3. Build the examples repo and copy the examples jar to Accumulo's `lib/ext` 
directory:
 
-      ./bin/build
-      cp target/accumulo-examples.jar /path/to/accumulo/lib/ext/
+        ./bin/build
+        cp target/accumulo-examples.jar /path/to/accumulo/lib/ext/
 
 4. Each Accumulo example has its own documentation and instructions for 
running the example which
    are linked to below.
@@ -76,7 +76,6 @@ Each example below highlights a feature of Apache Accumulo.
 | [filter] | Using the AgeOffFilter to remove records more than 30 seconds 
old. |
 | [helloworld] | Inserting records both inside map/reduce jobs and outside. 
And reading records between two rows. |
 | [isolation] | Using the isolated scanner to ensure partial changes are not 
seen. |
-| [mapred] | Using MapReduce to read from and write to Accumulo tables. |
 | [maxmutation] | Limiting mutation size to avoid running out of memory. |
 | [regex] | Using MapReduce and Accumulo to find data using regular 
expressions. |
 | [reservations] | Using conditional mutations to implement simple reservation 
system. |
@@ -86,7 +85,9 @@ Each example below highlights a feature of Apache Accumulo.
 | [shard] | Using the intersecting iterator with a term index partitioned by 
document. |
 | [tabletofile] | Using MapReduce to read a table and write one of its columns 
to a file in HDFS. |
 | [terasort] | Generating random data and sorting it using Accumulo. |
+| [uniquecols] | Use MapReduce to count unique columns in Accumulo |
 | [visibility] | Using visibilities (or combinations of authorizations). Also 
shows user permissions. |
+| [wordcount] | Use MapReduce and Accumulo to do a word count on text files |
 
 ## Release Testing
 
@@ -112,7 +113,6 @@ This repository can be used to test Accumulo release 
candidates.  See
 [filter]: docs/filter.md
 [helloworld]: docs/helloworld.md
 [isolation]: docs/isolation.md
-[mapred]: docs/mapred.md
 [maxmutation]: docs/maxmutation.md
 [regex]: docs/regex.md
 [reservations]: docs/reservations.md
@@ -122,6 +122,8 @@ This repository can be used to test Accumulo release 
candidates.  See
 [shard]: docs/shard.md
 [tabletofile]: docs/tabletofile.md
 [terasort]: docs/terasort.md
+[uniquecols]: docs/uniquecols.md
 [visibility]: docs/visibility.md
+[wordcount]: docs/wordcount.md
 [ti]: https://travis-ci.org/apache/accumulo-examples.svg?branch=master
 [tl]: https://travis-ci.org/apache/accumulo-examples
diff --git a/docs/compactionStrategy.md b/docs/compactionStrategy.md
index a7c96d5..6b5bebc 100644
--- a/docs/compactionStrategy.md
+++ b/docs/compactionStrategy.md
@@ -44,13 +44,13 @@ The commands below will configure the 
TwoTierCompactionStrategy to use gz compre
 
 Generate some data and files in order to test the strategy:
 
-    $ ./bin/runex client.SequentialBatchWriter -c ./examples.conf -t test1 
--start 0 --num 10000 --size 50 --batchMemory 20M --batchLatency 500 
--batchThreads 20
+    $ ./bin/runex client.SequentialBatchWriter -t test1 --start 0 --num 10000 
--size 50 --batchMemory 20M --batchLatency 500 --batchThreads 20
     $ accumulo shell -u root -p secret -e "flush -t test1"
-    $ ./bin/runex client.SequentialBatchWriter -c ./examples.conf -t test1 
--start 0 --num 11000 --size 50 --batchMemory 20M --batchLatency 500 
--batchThreads 20
+    $ ./bin/runex client.SequentialBatchWriter -t test1 --start 0 --num 11000 
--size 50 --batchMemory 20M --batchLatency 500 --batchThreads 20
     $ accumulo shell -u root -p secret -e "flush -t test1"
-    $ ./bin/runex client.SequentialBatchWriter -c ./examples.conf -t test1 
--start 0 --num 12000 --size 50 --batchMemory 20M --batchLatency 500 
--batchThreads 20
+    $ ./bin/runex client.SequentialBatchWriter -t test1 --start 0 --num 12000 
--size 50 --batchMemory 20M --batchLatency 500 --batchThreads 20
     $ accumulo shell -u root -p secret -e "flush -t test1"
-    $ ./bin/runex client.SequentialBatchWriter -c ./examples.conf -t test1 
--start 0 --num 13000 --size 50 --batchMemory 20M --batchLatency 500 
--batchThreads 20
+    $ ./bin/runex client.SequentialBatchWriter -t test1 --start 0 --num 13000 
--size 50 --batchMemory 20M --batchLatency 500 --batchThreads 20
     $ accumulo shell -u root -p secret -e "flush -t test1"
 
 View the tserver log in <accumulo_home>/logs for the compaction and find the 
name of the <rfile> that was compacted for your table. Print info about this 
file using the PrintInfo tool:
diff --git a/docs/dirlist.md b/docs/dirlist.md
index 3602d40..2b653cf 100644
--- a/docs/dirlist.md
+++ b/docs/dirlist.md
@@ -31,7 +31,7 @@ This example shows how to use Accumulo to store a file system 
history. It has th
 
 To begin, ingest some data with Ingest.java.
 
-    $ ./bin/runex dirlist.Ingest -c ./examples.conf --vis exampleVis 
--chunkSize 100000 /local/username/workspace
+    $ ./bin/runex dirlist.Ingest --vis exampleVis --chunkSize 100000 
/local/username/workspace
 
 This may take some time if there are large files in the 
/local/username/workspace directory. If you use 0 instead of 100000 on the 
command line, the ingest will run much faster, but it will not put any file 
data into Accumulo (the dataTable will be empty).
 Note that running this example will create tables dirTable, indexTable, and 
dataTable in Accumulo that you should delete when you have completed the 
example.
@@ -43,26 +43,26 @@ To browse the data ingested, use Viewer.java. Be sure to 
give the "username" use
 
 then run the Viewer:
 
-    $ ./bin/runex dirlist.Viewer -c ./examples.conf -t dirTable --dataTable 
dataTable --auths exampleVis --path /local/username/workspace
+    $ ./bin/runex dirlist.Viewer -t dirTable --dataTable dataTable --auths 
exampleVis --path /local/username/workspace
 
 To list the contents of specific directories, use QueryUtil.java.
 
-    $ ./bin/runex dirlist.QueryUtil -c ./examples.conf -t dirTable --auths 
exampleVis --path /local/username
-    $ ./bin/runex dirlist.QueryUtil -c ./examples.conf -t dirTable --auths 
exampleVis --path /local/username/workspace
+    $ ./bin/runex dirlist.QueryUtil -t dirTable --auths exampleVis --path 
/local/username
+    $ ./bin/runex dirlist.QueryUtil -t dirTable --auths exampleVis --path 
/local/username/workspace
 
 To perform searches on file or directory names, also use QueryUtil.java. 
Search terms must contain no more than one wild card and cannot contain "/".
 *Note* these queries run on the _indexTable_ table instead of the dirTable 
table.
 
-    $ ./bin/runex dirlist.QueryUtil -c ./examples.conf -t indexTable --auths 
exampleVis --path filename --search
-    $ ./bin/runex dirlist.QueryUtil -c ./examples.conf -t indexTable --auths 
exampleVis --path 'filename*' --search
-    $ ./bin/runex dirlist.QueryUtil -c ./examples.conf -t indexTable --auths 
exampleVis --path '*jar' --search
-    $ ./bin/runex dirlist.QueryUtil -c ./examples.conf -t indexTable --auths 
exampleVis --path 'filename*jar' --search
+    $ ./bin/runex dirlist.QueryUtil -t indexTable --auths exampleVis --path 
filename --search
+    $ ./bin/runex dirlist.QueryUtil -t indexTable --auths exampleVis --path 
'filename*' --search
+    $ ./bin/runex dirlist.QueryUtil -t indexTable --auths exampleVis --path 
'*jar' --search
+    $ ./bin/runex dirlist.QueryUtil -t indexTable --auths exampleVis --path 
'filename*jar' --search
 
 To count the number of direct children (directories and files) and descendants 
(children and children's descendants, directories and files), run the FileCount 
over the dirTable table.
 The results are written back to the same table. FileCount reads from and 
writes to Accumulo. This requires scan authorizations for the read and a 
visibility for the data written.
 In this example, the authorizations and visibility are set to the same value, 
exampleVis. See the [visibility example][vis] for more information on 
visibility and authorizations.
 
-    $ ./bin/runex dirlist.FileCount -c ./examples.conf -t dirTable --auths 
exampleVis
+    $ ./bin/runex dirlist.FileCount -t dirTable --auths exampleVis
 
 ## Directory Table
 
diff --git a/docs/isolation.md b/docs/isolation.md
index d6dc5ac..a848af9 100644
--- a/docs/isolation.md
+++ b/docs/isolation.md
@@ -30,7 +30,7 @@ reading the row at the same time a mutation is changing the 
row.
 Below, Interference Test is run without isolation enabled for 5000 iterations
 and it reports problems.
 
-    $ ./bin/runex isolation.InterferenceTest -c ./examples.conf -t isotest 
--iterations 5000
+    $ ./bin/runex isolation.InterferenceTest -t isotest --iterations 5000
     ERROR Columns in row 053 had multiple values [53, 4553]
     ERROR Columns in row 061 had multiple values [561, 61]
     ERROR Columns in row 070 had multiple values [570, 1070]
@@ -43,7 +43,7 @@ and it reports problems.
 Below, Interference Test is run with isolation enabled for 5000 iterations and
 it reports no problems.
 
-    $ ./bin/runex isolation.InterferenceTest -c ./examples.conf -t isotest 
--iterations 5000 --isolated
+    $ ./bin/runex isolation.InterferenceTest -t isotest --iterations 5000 
--isolated
     finished
 
 
diff --git a/docs/mapred.md b/docs/mapred.md
deleted file mode 100644
index d370792..0000000
--- a/docs/mapred.md
+++ /dev/null
@@ -1,114 +0,0 @@
-<!--
-Licensed to the Apache Software Foundation (ASF) under one or more
-contributor license agreements.  See the NOTICE file distributed with
-this work for additional information regarding copyright ownership.
-The ASF licenses this file to You under the Apache License, Version 2.0
-(the "License"); you may not use this file except in compliance with
-the License.  You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
--->
-# Apache Accumulo MapReduce Example
-
-## WordCount Example
-
-The WordCount example ([WordCount.java]) uses MapReduce and Accumulo to compute
-word counts for a set of documents. This is accomplished using a map-only 
MapReduce
-job and a Accumulo table with combiners.
-
-
-To run this example, create a directory in HDFS containing text files. You can
-use the Accumulo README for data:
-
-    $ hdfs dfs -mkdir /wc
-    $ hdfs dfs -copyFromLocal /path/to/accumulo/README.md /wc/README.md
-
-Verify that the file was created:
-
-    $ hdfs dfs -ls /wc
-
-After creating the table, run the WordCount MapReduce job with your HDFS input 
directory:
-
-    $ ./bin/runmr mapreduce.WordCount -i /wc
-
-[WordCount.java] creates an Accumulo table (named with a SummingCombiner 
iterator
-attached to it. It runs a map-only M/R job that reads the specified HDFS 
directory containing text files and
-writes word counts to Accumulo table.
-
-After the MapReduce job completes, query the Accumulo table to see word counts.
-
-    $ accumulo shell
-    username@instance> table wordCount
-    username@instance wordCount> scan -b the
-    the count:20080906 []    75
-    their count:20080906 []    2
-    them count:20080906 []    1
-    then count:20080906 []    1
-    ...
-
-Another example to look at is
-org.apache.accumulo.examples.mapreduce.UniqueColumns. This example
-computes the unique set of columns in a table and shows how a map reduce job
-can directly read a tables files from HDFS.
-
-One more example available is
-org.apache.accumulo.examples.mapreduce.TokenFileWordCount.
-The TokenFileWordCount example works exactly the same as the WordCount example
-explained above except that it uses a token file rather than giving the
-password directly to the map-reduce job (this avoids having the password
-displayed in the job's configuration which is world-readable).
-
-To create a token file, use the create-token utility
-
-  $ accumulo create-token
-
-It defaults to creating a PasswordToken, but you can specify the token class
-with -tc (requires the fully qualified class name). Based on the token class,
-it will prompt you for each property required to create the token.
-
-The last value it prompts for is a local filename to save to. If this file
-exists, it will append the new token to the end. Multiple tokens can exist in
-a file, but only the first one for each user will be recognized.
-
-Rather than waiting for the prompts, you can specify some options when calling
-create-token, for example
-
-  $ accumulo create-token -u root -p secret -f root.pw
-
-would create a token file containing a PasswordToken for
-user 'root' with password 'secret' and saved to 'root.pw'
-
-This local file needs to be uploaded to hdfs to be used with the
-map-reduce job. For example, if the file were 'root.pw' in the local directory:
-
-  $ hadoop fs -put root.pw root.pw
-
-This would put 'root.pw' in the user's home directory in hdfs.
-
-Because the basic WordCount example uses Opts to parse its arguments
-(which extends ClientOnRequiredTable), you can use a token file with
-the basic WordCount example by calling the same command as explained above
-except replacing the password with the token file (rather than -p, use -tf).
-
-  $ ./bin/runmr mapreduce.WordCount --input /user/username/wc -t wordCount -u 
username -tf tokenfile
-
-In the above examples, username was 'root' and tokenfile was 'root.pw'
-
-However, if you don't want to use the Opts class to parse arguments,
-the TokenFileWordCount is an example of using the token file manually.
-
-  $ ./bin/runmr mapreduce.TokenFileWordCount instance zookeepers username 
tokenfile /user/username/wc wordCount
-
-The results should be the same as the WordCount example except that the
-authentication token was not stored in the configuration. It was instead
-stored in a file that the map-reduce job pulled into the distributed cache.
-(If you ran either of these on the same table right after the
-WordCount example, then the resulting counts should just double.)
-
-[WordCount.java]: 
../src/main/java/org/apache/accumulo/examples/mapreduce/WordCount.java
diff --git a/docs/sample.md b/docs/sample.md
index 1f6cae5..4c58c3a 100644
--- a/docs/sample.md
+++ b/docs/sample.md
@@ -88,7 +88,7 @@ failure and fixiing the problem with a compaction.
 The example above is replicated in a java program using the Accumulo API.
 Below is the program name and the command to run it.
 
-    ./bin/runex sample.SampleExample -c ./examples.conf
+    ./bin/runex sample.SampleExample
 
 The commands below look under the hood to give some insight into how this
 feature works.  The commands determine what files the sampex table is using.
@@ -166,13 +166,13 @@ shard table based on the column qualifier.
 After enabling sampling, the command below counts the number of documents in
 the sample containing the words `import` and `int`.     
 
-    $ ./bin/runex shard.Query --sample -c ./examples.conf -t shard import int 
| fgrep '.java' | wc
+    $ ./bin/runex shard.Query --sample -t shard import int | fgrep '.java' | wc
          11      11    1246
 
 The command below counts the total number of documents containing the words
 `import` and `int`.
 
-    $ ./bin/runex shard.Query -c ./examples.conf -t shard import int | fgrep 
'.java' | wc
+    $ ./bin/runex shard.Query -t shard import int | fgrep '.java' | wc
        1085    1085  118175
 
 The counts 11 out of 1085 total are around what would be expected for a modulus
@@ -188,4 +188,4 @@ To experiment with this iterator, use the following 
command.  The
 `--sampleCutoff` option below will cause the query to return nothing if based
 on the sample it appears a query would return more than 1000 documents.
 
-    $ ./bin/runex shard.Query --sampleCutoff 1000 -c ./examples.conf -t shard 
import int | fgrep '.java' | wc
+    $ ./bin/runex shard.Query --sampleCutoff 1000 -t shard import int | fgrep 
'.java' | wc
diff --git a/docs/uniquecols.md b/docs/uniquecols.md
new file mode 100644
index 0000000..46b6a30
--- /dev/null
+++ b/docs/uniquecols.md
@@ -0,0 +1,23 @@
+<!--
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# Apache Accumulo Unique Columns example
+
+The UniqueColumns examples ([UniqueColumns.java]) computes the unique set
+of columns in a table and shows how a map reduce job can directly read a
+tables files from HDFS.
+
+[UniqueColumns.java]: 
../src/main/java/org/apache/accumulo/examples/mapreduce/UniqueColumns.java
diff --git a/docs/wordcount.md b/docs/wordcount.md
new file mode 100644
index 0000000..601f1de
--- /dev/null
+++ b/docs/wordcount.md
@@ -0,0 +1,72 @@
+<!--
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# Apache Accumulo Word Count example
+
+The WordCount example ([WordCount.java]) uses MapReduce and Accumulo to compute
+word counts for a set of documents. This is accomplished using a map-only 
MapReduce
+job and a Accumulo table with combiners.
+
+To run this example, create a directory in HDFS containing text files. You can
+use the Accumulo README for data:
+
+    $ hdfs dfs -mkdir /wc
+    $ hdfs dfs -copyFromLocal /path/to/accumulo/README.md /wc/README.md
+
+Verify that the file was created:
+
+    $ hdfs dfs -ls /wc
+
+After creating the table, run the WordCount MapReduce job with your HDFS input 
directory:
+
+    $ ./bin/runmr mapreduce.WordCount -i /wc
+
+[WordCount.java] creates an Accumulo table (named with a SummingCombiner 
iterator
+attached to it. It runs a map-only M/R job that reads the specified HDFS 
directory containing text files and
+writes word counts to Accumulo table.
+
+After the MapReduce job completes, query the Accumulo table to see word counts.
+
+    $ accumulo shell
+    username@instance> table wordCount
+    username@instance wordCount> scan -b the
+    the count:20080906 []    75
+    their count:20080906 []    2
+    them count:20080906 []    1
+    then count:20080906 []    1
+    ...
+
+When the WordCount MapReduce job was run above, the client properties were 
serialized
+into the MapReduce configuration.  This is insecure if the properties contain 
sensitive 
+information like passwords. A more secure option is store 
accumulo-client.properties
+in HDFS and run th job with the `-D` options.  This will configure the 
MapReduce job
+to obtain the client properties from HDFS:
+
+    $ hdfs dfs -copyFromLocal ./conf/accumulo-client.properties /user/myuser/
+    $ ./bin/runmr mapreduce.WordCount -i /wc -t wordCount2 -d 
/user/myuser/accumulo-client.properties
+
+After the MapReduce job completes, query the `wordCount2` table. The results 
should
+be the same as before:
+
+    $ accumulo shell
+    username@instance> table wordCount
+    username@instance wordCount> scan -b the
+    the count:20080906 []    75
+    their count:20080906 []    2
+    ...
+
+
+[WordCount.java]: 
../src/main/java/org/apache/accumulo/examples/mapreduce/WordCount.java
diff --git 
a/src/main/java/org/apache/accumulo/examples/mapreduce/TokenFileWordCount.java 
b/src/main/java/org/apache/accumulo/examples/mapreduce/TokenFileWordCount.java
deleted file mode 100644
index 010989c..0000000
--- 
a/src/main/java/org/apache/accumulo/examples/mapreduce/TokenFileWordCount.java
+++ /dev/null
@@ -1,107 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-package org.apache.accumulo.examples.mapreduce;
-
-import java.io.IOException;
-
-import org.apache.accumulo.core.client.ClientConfiguration;
-import org.apache.accumulo.core.client.mapreduce.AccumuloOutputFormat;
-import org.apache.accumulo.core.data.Mutation;
-import org.apache.accumulo.core.data.Value;
-import org.apache.hadoop.conf.Configuration;
-import org.apache.hadoop.conf.Configured;
-import org.apache.hadoop.io.LongWritable;
-import org.apache.hadoop.io.Text;
-import org.apache.hadoop.mapreduce.Job;
-import org.apache.hadoop.mapreduce.Mapper;
-import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
-import org.apache.hadoop.util.Tool;
-import org.apache.hadoop.util.ToolRunner;
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
-
-/**
- * A simple map reduce job that inserts word counts into accumulo. See the 
README for instructions
- * on how to run this. This version does not use the ClientOpts class to parse 
arguments as an
- * example of using AccumuloInputFormat and AccumuloOutputFormat directly. See 
README.mapred for
- * more details.
- *
- */
-public class TokenFileWordCount extends Configured implements Tool {
-
-  private static final Logger log = 
LoggerFactory.getLogger(TokenFileWordCount.class);
-
-  public static class MapClass extends Mapper<LongWritable,Text,Text,Mutation> 
{
-    @Override
-    public void map(LongWritable key, Text value, Context output) throws 
IOException {
-      String[] words = value.toString().split("\\s+");
-
-      for (String word : words) {
-
-        Mutation mutation = new Mutation(new Text(word));
-        mutation.put(new Text("count"), new Text("20080906"), new 
Value("1".getBytes()));
-
-        try {
-          output.write(null, mutation);
-        } catch (InterruptedException e) {
-          log.error("Could not write to Context.", e);
-        }
-      }
-    }
-  }
-
-  @Override
-  public int run(String[] args) throws Exception {
-
-    String instance = args[0];
-    String zookeepers = args[1];
-    String user = args[2];
-    String tokenFile = args[3];
-    String input = args[4];
-    String tableName = args[5];
-
-    Job job = Job.getInstance(getConf());
-    job.setJobName(TokenFileWordCount.class.getName());
-    job.setJarByClass(this.getClass());
-
-    job.setInputFormatClass(TextInputFormat.class);
-    TextInputFormat.setInputPaths(job, input);
-
-    job.setMapperClass(MapClass.class);
-
-    job.setNumReduceTasks(0);
-
-    job.setOutputFormatClass(AccumuloOutputFormat.class);
-    job.setOutputKeyClass(Text.class);
-    job.setOutputValueClass(Mutation.class);
-
-    // AccumuloInputFormat not used here, but it uses the same functions.
-    AccumuloOutputFormat.setZooKeeperInstance(job,
-        
ClientConfiguration.loadDefault().withInstance(instance).withZkHosts(zookeepers));
-    AccumuloOutputFormat.setConnectorInfo(job, user, tokenFile);
-    AccumuloOutputFormat.setCreateTables(job, true);
-    AccumuloOutputFormat.setDefaultTableName(job, tableName);
-
-    job.waitForCompletion(true);
-    return job.isSuccessful() ? 0 : 1;
-  }
-
-  public static void main(String[] args) throws Exception {
-    int res = ToolRunner.run(new Configuration(), new TokenFileWordCount(), 
args);
-    System.exit(res);
-  }
-}
diff --git 
a/src/main/java/org/apache/accumulo/examples/mapreduce/WordCount.java 
b/src/main/java/org/apache/accumulo/examples/mapreduce/WordCount.java
index 5bc4c70..1864fe3 100644
--- a/src/main/java/org/apache/accumulo/examples/mapreduce/WordCount.java
+++ b/src/main/java/org/apache/accumulo/examples/mapreduce/WordCount.java
@@ -51,6 +51,9 @@ public class WordCount {
     String tableName = "wordCount";
     @Parameter(names = {"-i", "--input"}, required = true, description = "HDFS 
input directory")
     String inputDirectory;
+    @Parameter(names = {"-d", "--dfsPath"},
+        description = "HDFS Path where accumulo-client.properties exists")
+    String hdfsPath;
   }
 
   public static class MapClass extends Mapper<LongWritable,Text,Text,Mutation> 
{
@@ -101,8 +104,13 @@ public class WordCount {
     job.setOutputKeyClass(Text.class);
     job.setOutputValueClass(Mutation.class);
 
-    
AccumuloOutputFormat.configure().clientProperties(opts.getClientProperties())
-        .defaultTable(opts.tableName).store(job);
+    if (opts.hdfsPath != null) {
+      AccumuloOutputFormat.configure().clientPropertiesPath(opts.hdfsPath)
+          .defaultTable(opts.tableName).store(job);
+    } else {
+      
AccumuloOutputFormat.configure().clientProperties(opts.getClientProperties())
+          .defaultTable(opts.tableName).store(job);
+    }
     System.exit(job.waitForCompletion(true) ? 0 : 1);
   }
 }
diff --git 
a/src/test/java/org/apache/accumulo/examples/mapreduce/MapReduceIT.java 
b/src/test/java/org/apache/accumulo/examples/mapreduce/MapReduceIT.java
index a5c83c0..d66aa0b 100644
--- a/src/test/java/org/apache/accumulo/examples/mapreduce/MapReduceIT.java
+++ b/src/test/java/org/apache/accumulo/examples/mapreduce/MapReduceIT.java
@@ -63,7 +63,7 @@ public class MapReduceIT extends ConfigurableMacBase {
 
   @Test
   public void test() throws Exception {
-    String confFile = System.getProperty("user.dir") + "/target/examples.conf";
+    String confFile = System.getProperty("user.dir") + 
"/target/accumulo-client.properties";
     String instance = getClientInfo().getInstanceName();
     String keepers = getClientInfo().getZooKeepers();
     ExamplesIT.writeClientPropsFile(confFile, instance, keepers, "root", 
ROOT_PASSWORD);

[accumulo-examples] branch master updated: More updates to MapReduce (#32)

Reply via email to