[GitHub] keith-turner commented on a change in pull request #56: ACCUMULO-4784 Updating docs with Connector builder

2018-02-12 Thread GitBox
keith-turner commented on a change in pull request #56: ACCUMULO-4784 Updating 
docs with Connector builder
URL: https://github.com/apache/accumulo-website/pull/56#discussion_r167737967
 
 

 ##
 File path: _docs-2-0/development/client-properties.md
 ##
 @@ -0,0 +1,39 @@
+---
+title: Client Properties
+category: development
+order: 9
+---
+
+
+
+Below are properties set in `accumulo-client.properties` that configure 
Accumulo clients:
+
+| Property | Default value | Description |
 
 Review comment:
   This could be updated with since tag column


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] keith-turner commented on a change in pull request #56: ACCUMULO-4784 Updating docs with Connector builder

2018-02-12 Thread GitBox
keith-turner commented on a change in pull request #56: ACCUMULO-4784 Updating 
docs with Connector builder
URL: https://github.com/apache/accumulo-website/pull/56#discussion_r167737649
 
 

 ##
 File path: _docs-2-0/administration/in-depth-install.md
 ##
 @@ -294,27 +294,11 @@ will expect the KeyStore in the same location.
 
 ### Client Configuration
 
-In version 1.6.0, Accumulo included a new type of configuration file known as 
a client
-configuration file. One problem with the traditional "site.xml" file that is 
prevalent
-through Hadoop is that it is a single file used by both clients and servers. 
This makes
-it very difficult to protect secrets that are only meant for the server 
processes while
-allowing the clients to connect to the servers.
+Accumulo clients are configured in a different way than Accumulo servers. 
Clients are
+configured when [an Accumulo Connnector is created][client-conn] using Java 
builder methods
+or a `accumulo-client.properties` file containing [client 
properties][client-props].
 
-The client configuration file is a subset of the information stored in 
accumulo-site.xml
-meant only for consumption by clients of Accumulo. By default, Accumulo checks 
a number
-of locations for a client configuration by default:
-
-* `/path/to/accumulo/conf/client.conf`
 
 Review comment:
   If using the new API, will it still look in these locations?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] keith-turner commented on a change in pull request #56: ACCUMULO-4784 Updating docs with Connector builder

2018-02-12 Thread GitBox
keith-turner commented on a change in pull request #56: ACCUMULO-4784 Updating 
docs with Connector builder
URL: https://github.com/apache/accumulo-website/pull/56#discussion_r167738254
 
 

 ##
 File path: _docs-2-0/getting-started/clients.md
 ##
 @@ -16,92 +16,79 @@ If you are using Maven to create Accumulo client code, add 
the following to your
 
 ```
 
-## Running Client Code
-
-There are multiple ways to run Java code that use Accumulo. Below is a list
-of the different ways to execute client code.
-
-* build and execute an uber jar
-* add `accumulo classpath` to your Java classpath
-* use the `accumulo` command
-* use the `accumulo-util hadoop-jar` command
-
-### Build and execute an uber jar
-
-If you have included `accumulo-core` as dependency in your pom, you can build 
an uber jar
-using the Maven assembly or shade plugin and use it to run Accumulo client 
code. When building
-an uber jar, you should set the versions of any Hadoop dependencies in your 
pom to match the
-version running on your cluster.
-
-### Add 'accumulo classpath' to your Java classpath
-
-To run Accumulo client code using the `java` command, use the `accumulo 
classpath` command 
-to include all of Accumulo's dependencies on your classpath:
-
-java -classpath /path/to/my.jar:/path/to/dep.jar:$(accumulo classpath) 
com.my.Main arg1 arg2
-
-If you would like to review which jars are included, the `accumulo classpath` 
command can
-output a more human readable format using the `-d` option which enables 
debugging:
-
-accumulo classpath -d
-
-### Use the accumulo command
-
-Another option for running your code is to use the Accumulo script which can 
execute a
-main class (if it exists on its classpath):
-
-accumulo com.foo.Client arg1 arg2
-
-While the Accumulo script will add all of Accumulo's dependencies to the 
classpath, you
-will need to add any jars that your create or depend on beyond what Accumulo 
already
-depends on. This can be accomplished by either adding the jars to the 
`lib/ext` directory
-of your Accumulo installation or by adding jars to the CLASSPATH variable 
before calling
-the accumulo command.
-
-export CLASSPATH=/path/to/my.jar:/path/to/dep.jar; accumulo com.foo.Client 
arg1 arg2
-
-### Use the 'accumulo-util hadoop-jar' command
-
-If you are writing map reduce job that accesses Accumulo, then you can use
-`accumulo-util hadoop-jar` to run those jobs. See the [MapReduce 
example][mapred-example]
-for more information.
-
 ## Connecting
 
-All clients must first identify the Accumulo instance to which they will be
-communicating. Code to do this is as follows:
+Before writing Accumulo client code, you will need the following information.
+
+ * Accumulo instance name
+ * Zookeeper connection string
+ * Accumulo username & password
+
+The [Connector] object is the main entry point for Accumulo clients. It can be 
created using one
+of the following methods:
+
+1. Using the `accumulo-client.properties` file (a template can be found in the 
`conf/` directory
+   of the tarball distribution):
+```java
+Connector conn = Connector.builder()
+
.usingProperties("/path/to/accumulo-client.properties").build();
+```
+1. Using the builder methods of [Connector]:
+```java
+Connector conn = Connector.builder().forInstance("myinstance", 
"zookeeper1,zookeper2")
+.usingPasswordCredentials("myuser", 
"mypassword").build();
+```
+1. Using a Java Properties object.
+```java
+Properties props = new Properties()
+props.put("instance.name", "myinstance")
+props.put("instance.zookeepers", "zookeeper1,zookeeper2")
+props.put("auth.method", "password")
+props.put("auth.username", "myuser")
+props.put("auth.password", "mypassword")
+Connector conn = Connector.builder().usingProperties(props).build();
+```
+
+If a `accumulo-client.properties` file or a Java Properties object is used to 
create a [Connector], the following
+[client properties][client-props] must be set:
+
+* [instance.name]
+* [instance.zookeepers]
+* [auth.method]
+* [auth.username]
+* [auth.password]
+
+# Authentication
+
+When creating a [Connector], the user must be authenticated using one of the 
following
+implementations of [AuthenticationToken] below:
+
+1. [PasswordToken] is the must commonly used implementation.
+1. [CredentialProviderToken] leverages the Hadoop CredentialProviders (new in 
Hadoop 2.6).
+   For example, the [CredentialProviderToken] can be used in conjunction with 
a Java KeyStore to
+   alleviate passwords stored in cleartext. When stored in HDFS, a single 
KeyStore can be used across
+   an entire instance. Be aware that KeyStores stored on the local filesystem 
must be made available
+   to all nodes in the Accumulo cluster.
+1. [KerberosToken] can be provided to use the authentication provided by 
Kerberos. Using Kerberos
+   requires external setup and additional configuration, but provides a 

[GitHub] keith-turner commented on a change in pull request #56: ACCUMULO-4784 Updating docs with Connector builder

2018-02-12 Thread GitBox
keith-turner commented on a change in pull request #56: ACCUMULO-4784 Updating 
docs with Connector builder
URL: https://github.com/apache/accumulo-website/pull/56#discussion_r167737718
 
 

 ##
 File path: _docs-2-0/administration/kerberos.md
 ##
 @@ -342,16 +342,14 @@ Valid starting   Expires  Service 
principal
 
  Configuration
 
-The second thing clients need to do is to set up their client configuration 
file. By
-default, this file is stored in `~/.accumulo/config` or 
`/path/to/accumulo/client.conf`.
-Accumulo utilities also allow you to provide your own copy of this file in any 
location
-using the `--config-file` command line option.
+The second thing clients need to do is to to configure kerberos when an 
Accumulo Connector is
 
 Review comment:
   `to to configure kerberos`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Resolved] (ACCUMULO-4709) Add size sanity checks to Mutations

2018-02-12 Thread Keith Turner (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Turner resolved ACCUMULO-4709.

Resolution: Fixed

> Add size sanity checks to Mutations
> ---
>
> Key: ACCUMULO-4709
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4709
> Project: Accumulo
>  Issue Type: Improvement
>Reporter: Keith Turner
>Assignee: Gergely Hajós
>Priority: Major
>  Labels: newbie, pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Based in ACCUMULO-4708, it may be good to add size sanity checks to 
> Accumulo's Mutation data type.  The first step would be to determine how 
> Mutation handles the following situations currently.
>  * Create a mutation and put lots of small entries where total size exceeds 
> 2GB
>  * Create a mutation and add a single entry where the total of all fields 
> exceeds 2GB, but no individual field exceeds 2GB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] keith-turner commented on a change in pull request #361: ACCUMULO-4784 - Create builder for Connector

2018-02-12 Thread GitBox
keith-turner commented on a change in pull request #361: ACCUMULO-4784 - Create 
builder for Connector
URL: https://github.com/apache/accumulo/pull/361#discussion_r167734131
 
 

 ##
 File path: core/src/main/java/org/apache/accumulo/core/client/Connector.java
 ##
 @@ -237,4 +284,272 @@ public abstract BatchDeleter createBatchDeleter(String 
tableName, Authorizations
* @since 1.7.0
*/
   public abstract ReplicationOperations replicationOperations();
+
+  /**
+   * Builds ConnectionInfo after all options have been specified
+   *
+   * @since 2.0.0
+   */
+  public interface ConnInfoFactory {
+
+/**
+ * Builds ConnectionInfo after all options have been specified
+ *
+ * @return ConnectionInfo
+ */
+ConnectionInfo info();
+  }
+
+  /**
+   * Builds Connector
+   *
+   * @since 2.0.0
+   */
+  public interface ConnectorFactory extends ConnInfoFactory {
+
+/**
+ * Builds Connector after all options have been specified
+ *
+ * @return Connector
+ */
+Connector build() throws AccumuloException, AccumuloSecurityException;
+
+  }
+
+  /**
+   * Builder method for setting Accumulo instance and zookeepers
+   *
+   * @since 2.0.0
+   */
+  public interface InstanceArgs {
+AuthenticationArgs forInstance(String instanceName, String zookeepers);
+  }
+
+  /**
+   * Builder methods for creating Connector using properties
+   *
+   * @since 2.0.0
+   */
+  public interface PropertyOptions extends InstanceArgs {
+
+/**
+ * Build using properties file. An example properties file can be found at 
conf/accumulo-client.properties in the Accumulo tarball distribution.
+ *
+ * @param propertiesFile
+ *  Path to properties file
+ * @return this builder
+ */
+ConnectorFactory usingProperties(String propertiesFile);
+
+/**
+ * Build using Java properties object. A list of available properties can 
be found in the documentation on the project website 
(http://accumulo.apache.org)
+ * under 'Development' -> 'Client Properties'
+ *
+ * @param properties
+ *  Properties object
+ * @return this builder
+ */
+ConnectorFactory usingProperties(Properties properties);
+  }
+
+  public interface ConnectionInfoOptions extends PropertyOptions {
+
+/**
+ * Build using connection information
+ *
+ * @param connectionInfo
+ *  ConnectionInfo object
+ * @return this builder
+ */
+ConnectorFactory usingConnectionInfo(ConnectionInfo connectionInfo);
+  }
+
+  /**
+   * Build methods for authentication
+   *
+   * @since 2.0.0
+   */
+  public interface AuthenticationArgs {
+
+/**
+ * Build using password-based credentials
+ *
+ * @param username
+ *  User name
+ * @param password
+ *  Password
+ * @return this builder
+ */
+ConnectionOptions usingPasswordCredentials(String username, CharSequence 
password);
+
+/**
+ * Build using Kerberos credentials
+ *
+ * @param principal
+ *  Principal
+ * @param keyTabFile
+ *  Path to keytab file
+ * @return this builder
+ */
+ConnectionOptions usingKerberosCredentials(String principal, String 
keyTabFile);
+
+/**
+ * Build using credentials from a CredentialProvider
+ *
+ * @param username
+ *  Accumulo user name
+ * @param name
+ *  Alias to extract Accumulo user password from CredentialProvider
+ * @param providerUrls
+ *  Comma seperated list of URLs defining CredentialProvider(s)
+ * @return this builder
+ */
+ConnectionOptions usingCredentialProvider(String username, String name, 
String providerUrls);
 
 Review comment:
   Could be `usingProvider()`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] keith-turner commented on a change in pull request #361: ACCUMULO-4784 - Create builder for Connector

2018-02-12 Thread GitBox
keith-turner commented on a change in pull request #361: ACCUMULO-4784 - Create 
builder for Connector
URL: https://github.com/apache/accumulo/pull/361#discussion_r167733987
 
 

 ##
 File path: core/src/main/java/org/apache/accumulo/core/client/Connector.java
 ##
 @@ -237,4 +284,272 @@ public abstract BatchDeleter createBatchDeleter(String 
tableName, Authorizations
* @since 1.7.0
*/
   public abstract ReplicationOperations replicationOperations();
+
+  /**
+   * Builds ConnectionInfo after all options have been specified
+   *
+   * @since 2.0.0
+   */
+  public interface ConnInfoFactory {
+
+/**
+ * Builds ConnectionInfo after all options have been specified
+ *
+ * @return ConnectionInfo
+ */
+ConnectionInfo info();
+  }
+
+  /**
+   * Builds Connector
+   *
+   * @since 2.0.0
+   */
+  public interface ConnectorFactory extends ConnInfoFactory {
+
+/**
+ * Builds Connector after all options have been specified
+ *
+ * @return Connector
+ */
+Connector build() throws AccumuloException, AccumuloSecurityException;
+
+  }
+
+  /**
+   * Builder method for setting Accumulo instance and zookeepers
+   *
+   * @since 2.0.0
+   */
+  public interface InstanceArgs {
+AuthenticationArgs forInstance(String instanceName, String zookeepers);
+  }
+
+  /**
+   * Builder methods for creating Connector using properties
+   *
+   * @since 2.0.0
+   */
+  public interface PropertyOptions extends InstanceArgs {
+
+/**
+ * Build using properties file. An example properties file can be found at 
conf/accumulo-client.properties in the Accumulo tarball distribution.
+ *
+ * @param propertiesFile
+ *  Path to properties file
+ * @return this builder
+ */
+ConnectorFactory usingProperties(String propertiesFile);
+
+/**
+ * Build using Java properties object. A list of available properties can 
be found in the documentation on the project website 
(http://accumulo.apache.org)
+ * under 'Development' -> 'Client Properties'
+ *
+ * @param properties
+ *  Properties object
+ * @return this builder
+ */
+ConnectorFactory usingProperties(Properties properties);
+  }
+
+  public interface ConnectionInfoOptions extends PropertyOptions {
+
+/**
+ * Build using connection information
+ *
+ * @param connectionInfo
+ *  ConnectionInfo object
+ * @return this builder
+ */
+ConnectorFactory usingConnectionInfo(ConnectionInfo connectionInfo);
+  }
+
+  /**
+   * Build methods for authentication
+   *
+   * @since 2.0.0
+   */
+  public interface AuthenticationArgs {
+
+/**
+ * Build using password-based credentials
+ *
+ * @param username
+ *  User name
+ * @param password
+ *  Password
+ * @return this builder
+ */
+ConnectionOptions usingPasswordCredentials(String username, CharSequence 
password);
+
+/**
+ * Build using Kerberos credentials
+ *
+ * @param principal
+ *  Principal
+ * @param keyTabFile
+ *  Path to keytab file
+ * @return this builder
+ */
+ConnectionOptions usingKerberosCredentials(String principal, String 
keyTabFile);
 
 Review comment:
   This could be `usingKerberos()`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] keith-turner commented on a change in pull request #361: ACCUMULO-4784 - Create builder for Connector

2018-02-12 Thread GitBox
keith-turner commented on a change in pull request #361: ACCUMULO-4784 - Create 
builder for Connector
URL: https://github.com/apache/accumulo/pull/361#discussion_r167733942
 
 

 ##
 File path: core/src/main/java/org/apache/accumulo/core/client/Connector.java
 ##
 @@ -237,4 +284,272 @@ public abstract BatchDeleter createBatchDeleter(String 
tableName, Authorizations
* @since 1.7.0
*/
   public abstract ReplicationOperations replicationOperations();
+
+  /**
+   * Builds ConnectionInfo after all options have been specified
+   *
+   * @since 2.0.0
+   */
+  public interface ConnInfoFactory {
+
+/**
+ * Builds ConnectionInfo after all options have been specified
+ *
+ * @return ConnectionInfo
+ */
+ConnectionInfo info();
+  }
+
+  /**
+   * Builds Connector
+   *
+   * @since 2.0.0
+   */
+  public interface ConnectorFactory extends ConnInfoFactory {
+
+/**
+ * Builds Connector after all options have been specified
+ *
+ * @return Connector
+ */
+Connector build() throws AccumuloException, AccumuloSecurityException;
+
+  }
+
+  /**
+   * Builder method for setting Accumulo instance and zookeepers
+   *
+   * @since 2.0.0
+   */
+  public interface InstanceArgs {
+AuthenticationArgs forInstance(String instanceName, String zookeepers);
+  }
+
+  /**
+   * Builder methods for creating Connector using properties
+   *
+   * @since 2.0.0
+   */
+  public interface PropertyOptions extends InstanceArgs {
+
+/**
+ * Build using properties file. An example properties file can be found at 
conf/accumulo-client.properties in the Accumulo tarball distribution.
+ *
+ * @param propertiesFile
+ *  Path to properties file
+ * @return this builder
+ */
+ConnectorFactory usingProperties(String propertiesFile);
+
+/**
+ * Build using Java properties object. A list of available properties can 
be found in the documentation on the project website 
(http://accumulo.apache.org)
+ * under 'Development' -> 'Client Properties'
+ *
+ * @param properties
+ *  Properties object
+ * @return this builder
+ */
+ConnectorFactory usingProperties(Properties properties);
+  }
+
+  public interface ConnectionInfoOptions extends PropertyOptions {
+
+/**
+ * Build using connection information
+ *
+ * @param connectionInfo
+ *  ConnectionInfo object
+ * @return this builder
+ */
+ConnectorFactory usingConnectionInfo(ConnectionInfo connectionInfo);
+  }
+
+  /**
+   * Build methods for authentication
+   *
+   * @since 2.0.0
+   */
+  public interface AuthenticationArgs {
+
+/**
+ * Build using password-based credentials
+ *
+ * @param username
+ *  User name
+ * @param password
+ *  Password
+ * @return this builder
+ */
+ConnectionOptions usingPasswordCredentials(String username, CharSequence 
password);
 
 Review comment:
   When I tried using the API I found this to be a very long method name.  
Could be `usingPassword()`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] keith-turner commented on a change in pull request #361: ACCUMULO-4784 - Create builder for Connector

2018-02-12 Thread GitBox
keith-turner commented on a change in pull request #361: ACCUMULO-4784 - Create 
builder for Connector
URL: https://github.com/apache/accumulo/pull/361#discussion_r167734092
 
 

 ##
 File path: core/src/main/java/org/apache/accumulo/core/client/Connector.java
 ##
 @@ -237,4 +284,272 @@ public abstract BatchDeleter createBatchDeleter(String 
tableName, Authorizations
* @since 1.7.0
*/
   public abstract ReplicationOperations replicationOperations();
+
+  /**
+   * Builds ConnectionInfo after all options have been specified
+   *
+   * @since 2.0.0
+   */
+  public interface ConnInfoFactory {
+
+/**
+ * Builds ConnectionInfo after all options have been specified
+ *
+ * @return ConnectionInfo
+ */
+ConnectionInfo info();
+  }
+
+  /**
+   * Builds Connector
+   *
+   * @since 2.0.0
+   */
+  public interface ConnectorFactory extends ConnInfoFactory {
+
+/**
+ * Builds Connector after all options have been specified
+ *
+ * @return Connector
+ */
+Connector build() throws AccumuloException, AccumuloSecurityException;
+
+  }
+
+  /**
+   * Builder method for setting Accumulo instance and zookeepers
+   *
+   * @since 2.0.0
+   */
+  public interface InstanceArgs {
+AuthenticationArgs forInstance(String instanceName, String zookeepers);
+  }
+
+  /**
+   * Builder methods for creating Connector using properties
+   *
+   * @since 2.0.0
+   */
+  public interface PropertyOptions extends InstanceArgs {
+
+/**
+ * Build using properties file. An example properties file can be found at 
conf/accumulo-client.properties in the Accumulo tarball distribution.
+ *
+ * @param propertiesFile
+ *  Path to properties file
+ * @return this builder
+ */
+ConnectorFactory usingProperties(String propertiesFile);
+
+/**
+ * Build using Java properties object. A list of available properties can 
be found in the documentation on the project website 
(http://accumulo.apache.org)
+ * under 'Development' -> 'Client Properties'
+ *
+ * @param properties
+ *  Properties object
+ * @return this builder
+ */
+ConnectorFactory usingProperties(Properties properties);
+  }
+
+  public interface ConnectionInfoOptions extends PropertyOptions {
+
+/**
+ * Build using connection information
+ *
+ * @param connectionInfo
+ *  ConnectionInfo object
+ * @return this builder
+ */
+ConnectorFactory usingConnectionInfo(ConnectionInfo connectionInfo);
+  }
+
+  /**
+   * Build methods for authentication
+   *
+   * @since 2.0.0
+   */
+  public interface AuthenticationArgs {
+
+/**
+ * Build using password-based credentials
+ *
+ * @param username
+ *  User name
+ * @param password
+ *  Password
+ * @return this builder
+ */
+ConnectionOptions usingPasswordCredentials(String username, CharSequence 
password);
+
+/**
+ * Build using Kerberos credentials
+ *
+ * @param principal
+ *  Principal
+ * @param keyTabFile
+ *  Path to keytab file
+ * @return this builder
+ */
+ConnectionOptions usingKerberosCredentials(String principal, String 
keyTabFile);
+
+/**
+ * Build using credentials from a CredentialProvider
+ *
+ * @param username
+ *  Accumulo user name
+ * @param name
+ *  Alias to extract Accumulo user password from CredentialProvider
+ * @param providerUrls
+ *  Comma seperated list of URLs defining CredentialProvider(s)
+ * @return this builder
+ */
+ConnectionOptions usingCredentialProvider(String username, String name, 
String providerUrls);
+
+/**
+ * Build using specified credentials
+ *
+ * @param principal
+ *  Principal/username
+ * @param token
+ *  Authentication token
+ * @return this builder
+ */
+ConnectionOptions usingCredentials(String principal, AuthenticationToken 
token);
 
 Review comment:
   Could be `usingToken()`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


Accumulo-Pull-Requests - Build # 1039 - Still unstable

2018-02-12 Thread Apache Jenkins Server
The Apache Jenkins build system has built Accumulo-Pull-Requests (build #1039)

Status: Still unstable

Check console output at 
https://builds.apache.org/job/Accumulo-Pull-Requests/1039/ to view the results.

[jira] [Updated] (ACCUMULO-4809) Session manager clean up can happen when lock held.

2018-02-12 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ACCUMULO-4809:
-
Labels: pull-request-available  (was: )

> Session manager clean up can happen when lock held.
> ---
>
> Key: ACCUMULO-4809
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4809
> Project: Accumulo
>  Issue Type: Bug
>Affects Versions: 1.7.3, 1.8.1
>Reporter: Keith Turner
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.7.4, 1.9.0, 2.0.0
>
>
> While working on [PR #382|https://github.com/apache/accumulo/pull/382] for 
> ACCUMULO-4782 I noticed a significant concurrency bug.  Before #382 their was 
> a single lock for the session manager. The session manager will clean up idle 
> sessions.  This clean up should happen outside the session manager lock, 
> because all tserver read/write operation use the session manger so it should 
> be responsive.
> The bug is the following.
>  * Both getActiveScansPerTable() and getActiveScans() lock the session 
> manager and then lock idleSessions.  See [SessionManager line 
> 233|https://github.com/apache/accumulo/blob/rel/1.7.3/server/tserver/src/main/java/org/apache/accumulo/tserver/session/SessionManager.java#L233]
>  
>  * The sweep() method locks idleSessions and does cleanup while this lock is 
> held. [See SessionManager 
> 200|https://github.com/apache/accumulo/blob/rel/1.7.3/server/tserver/src/main/java/org/apache/accumulo/tserver/session/SessionManager.java#L200]
>  
> Therefore it is possible for getActiveScansPerTable() or getActiveScans() to 
> lock the session manager and then block trying to lock idleSessions while 
> cleanup is happening in sweep().  This will block all access to the session 
> manager while cleanup happens.
> The changes in #382 will fix this for 1.9.0 and 2.0.0.  However I Am not sure 
> about backporting #382 to 1.7.  A more targeted fix could be made for 1.7 or 
> #382 could be backported.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] keith-turner opened a new pull request #383: ACCUMULO-4809 Avoid blocking during session cleanup

2018-02-12 Thread GitBox
keith-turner opened a new pull request #383: ACCUMULO-4809 Avoid blocking 
during session cleanup
URL: https://github.com/apache/accumulo/pull/383
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


Accumulo-Master - Build # 2256 - Fixed

2018-02-12 Thread Apache Jenkins Server
The Apache Jenkins build system has built Accumulo-Master (build #2256)

Status: Fixed

Check console output at https://builds.apache.org/job/Accumulo-Master/2256/ to 
view the results.

[GitHub] keith-turner closed pull request #373: ACCUMULO-4709 sanity check in Mutation

2018-02-12 Thread GitBox
keith-turner closed pull request #373: ACCUMULO-4709 sanity check in Mutation
URL: https://github.com/apache/accumulo/pull/373
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/core/src/main/java/org/apache/accumulo/core/data/Mutation.java 
b/core/src/main/java/org/apache/accumulo/core/data/Mutation.java
index ebc72f5e22..5e1a7ba869 100644
--- a/core/src/main/java/org/apache/accumulo/core/data/Mutation.java
+++ b/core/src/main/java/org/apache/accumulo/core/data/Mutation.java
@@ -37,6 +37,9 @@
 import org.apache.hadoop.io.Writable;
 import org.apache.hadoop.io.WritableUtils;
 
+import com.google.common.annotations.VisibleForTesting;
+import com.google.common.base.Preconditions;
+
 /**
  * Mutation represents an action that manipulates a row in a table. A mutation 
holds a list of column/value pairs that represent an atomic set of modifications
  * to make to a row.
@@ -66,6 +69,13 @@
*/
   static final int VALUE_SIZE_COPY_CUTOFF = 1 << 15;
 
+  /**
+   * Maximum size of a mutation (2GB).
+   */
+  static final long MAX_MUTATION_SIZE = (1L << 31);
+
+  static final long SERIALIZATION_OVERHEAD = 5;
+
   /**
* Formats available for serializing Mutations. The formats are described in 
a separate document.
*/
@@ -79,6 +89,10 @@
   private int entries;
   private List values;
 
+  // tracks estimated size of row.length + largeValues.length
+  @VisibleForTesting
+  long estRowAndLargeValSize = 0;
+
   private UnsynchronizedBuffer.Writer buffer;
 
   private List updates;
@@ -169,6 +183,7 @@ public Mutation(byte[] row, int start, int length, int 
initialBufferSize) {
 this.row = new byte[length];
 System.arraycopy(row, start, this.row, 0, length);
 buffer = new UnsynchronizedBuffer.Writer(initialBufferSize);
+estRowAndLargeValSize = length + SERIALIZATION_OVERHEAD;
   }
 
   /**
@@ -306,6 +321,9 @@ private void put(byte[] cf, int cfLength, byte[] cq, int 
cqLength, byte[] cv, bo
 if (buffer == null) {
   throw new IllegalStateException("Can not add to mutation after 
serializing it");
 }
+long estimatedSizeAfterPut = estRowAndLargeValSize + buffer.size() + 
cfLength + cqLength + cv.length + (hasts ? 8 : 0) + valLength + 2
++ 4 * SERIALIZATION_OVERHEAD;
+Preconditions.checkArgument(estimatedSizeAfterPut < MAX_MUTATION_SIZE && 
estimatedSizeAfterPut >= 0, "Maximum mutation size must be less than 2GB ");
 put(cf, cfLength);
 put(cq, cqLength);
 put(cv);
@@ -325,6 +343,7 @@ private void put(byte[] cf, int cfLength, byte[] cq, int 
cqLength, byte[] cv, bo
   System.arraycopy(val, 0, copy, 0, valLength);
   values.add(copy);
   put(-1 * values.size());
+  estRowAndLargeValSize += valLength + SERIALIZATION_OVERHEAD;
 }
 
 entries++;
diff --git 
a/core/src/main/java/org/apache/accumulo/core/util/UnsynchronizedBuffer.java 
b/core/src/main/java/org/apache/accumulo/core/util/UnsynchronizedBuffer.java
index f58737babe..89671d1037 100644
--- a/core/src/main/java/org/apache/accumulo/core/util/UnsynchronizedBuffer.java
+++ b/core/src/main/java/org/apache/accumulo/core/util/UnsynchronizedBuffer.java
@@ -162,6 +162,10 @@ public void writeVLong(long i) {
 data[offset++] = (byte) ((i & mask) >> shiftbits);
   }
 }
+
+public int size() {
+  return offset;
+}
   }
 
   /**
diff --git a/core/src/test/java/org/apache/accumulo/core/data/MutationTest.java 
b/core/src/test/java/org/apache/accumulo/core/data/MutationTest.java
index 7d247f5f4b..7f4e2fda75 100644
--- a/core/src/test/java/org/apache/accumulo/core/data/MutationTest.java
+++ b/core/src/test/java/org/apache/accumulo/core/data/MutationTest.java
@@ -680,4 +680,12 @@ public void testPutAfterEquals() {
   fail("Calling Mutation#equals then Mutation#put should not result in an 
IllegalStateException.");
 }
   }
+
+  @Test(expected = IllegalArgumentException.class)
+  public void testSanityCheck() {
+Mutation m = new Mutation("too big mutation");
+m.put("cf", "cq1", "v");
+m.estRowAndLargeValSize += (Long.MAX_VALUE / 2);
+m.put("cf", "cq2", "v");
+  }
 }


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] keith-turner commented on issue #373: ACCUMULO-4709 sanity check in Mutation

2018-02-12 Thread GitBox
keith-turner commented on issue #373: ACCUMULO-4709 sanity check in Mutation
URL: https://github.com/apache/accumulo/pull/373#issuecomment-365075619
 
 
   @ghajos thanks for the contribution.  Would you like to be added to the 
Accumulo [people page](http://accumulo.apache.org/people/)?  If so let me know 
what info you would like or make a PR to the [website 
repo](https://github.com/apache/accumulo-website).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] ctubbsii commented on a change in pull request #369: [ACCUMULO-4787] Close input stream in AccumuloReplicaSystem

2018-02-12 Thread GitBox
ctubbsii commented on a change in pull request #369: [ACCUMULO-4787] Close 
input stream in AccumuloReplicaSystem
URL: https://github.com/apache/accumulo/pull/369#discussion_r167680196
 
 

 ##
 File path: 
server/tserver/src/main/java/org/apache/accumulo/tserver/log/DfsLogger.java
 ##
 @@ -414,7 +413,7 @@ public static DFSLoggerInputStreams 
readHeaderAndReturnStream(VolumeManager fs,
 
   }
 } catch (EOFException e) {
-  log.warn("Got EOFException trying to read WAL header information, 
assuming the rest of the file (" + path + ") has no data.");
+  log.warn("Got EOFException trying to read WAL header information, 
assuming the rest of the file has no data.");
 
 Review comment:
   I'm not a fan of both logging *and* throwing. I think it's generally better 
to pick one or the other: handle an exception (by logging in this case) or pass 
it up. Dropping the path makes it even more obvious that this might be a 
redundant message... since my first instinct was to suggest adding `, e` to the 
`log.warn` method call, but that's probably not needed if the 
`LogHeaderIncompleteException` is keeping the stack trace for logging later.
   
   It's not a big deal... but something to think about if there's an obvious 
cleanup here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] ctubbsii commented on a change in pull request #369: [ACCUMULO-4787] Close input stream in AccumuloReplicaSystem

2018-02-12 Thread GitBox
ctubbsii commented on a change in pull request #369: [ACCUMULO-4787] Close 
input stream in AccumuloReplicaSystem
URL: https://github.com/apache/accumulo/pull/369#discussion_r167682325
 
 

 ##
 File path: 
server/tserver/src/main/java/org/apache/accumulo/tserver/log/LogSorter.java
 ##
 @@ -113,44 +113,46 @@ public void sort(String name, Path srcPath, String 
destPath) {
 // the following call does not throw an exception if the file/dir does 
not exist
 fs.deleteRecursively(new Path(destPath));
 
-DFSLoggerInputStreams inputStreams;
-try {
-  inputStreams = DfsLogger.readHeaderAndReturnStream(fs, srcPath, 
conf);
-} catch (LogHeaderIncompleteException e) {
-  log.warn("Could not read header from write-ahead log " + srcPath + 
". Not sorting.");
-  // Creating a 'finished' marker will cause recovery to proceed 
normally and the
-  // empty file will be correctly ignored downstream.
-  fs.mkdirs(new Path(destPath));
-  writeBuffer(destPath, Collections.> 
emptyList(), part++);
-  fs.create(SortedLogState.getFinishedMarkerPath(destPath)).close();
-  return;
-}
+try (final FSDataInputStream fsinput = fs.open(srcPath)) {
+  DFSLoggerInputStreams inputStreams;
+  try {
+inputStreams = DfsLogger.readHeaderAndReturnStream(fsinput, conf);
 
 Review comment:
   I'm not exactly sure what kind of additional resources the crypto stream 
might use, but it seems to me that since `inputStreams` is not `Closeable`, it 
may be possible for an the `fsinput` to be closed, but the crypto stream in 
`inputStreams` to still be holding resources. Would need to dig further to be 
sure, but it would be nice if `DFSLoggerInputStreams` was `Closeable` and this 
was created as a second resource, in the `try-with-resources` block on line 116.
   
   Same comment applies to other uses of `DFSLoggerInputStreams`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] ctubbsii commented on a change in pull request #369: [ACCUMULO-4787] Close input stream in AccumuloReplicaSystem

2018-02-12 Thread GitBox
ctubbsii commented on a change in pull request #369: [ACCUMULO-4787] Close 
input stream in AccumuloReplicaSystem
URL: https://github.com/apache/accumulo/pull/369#discussion_r167678913
 
 

 ##
 File path: 
server/tserver/src/main/java/org/apache/accumulo/tserver/replication/AccumuloReplicaSystem.java
 ##
 @@ -370,11 +368,112 @@ protected Status replicateLogs(ClientContext 
peerContext, final HostAndPort peer
 
 log.debug("Replication WAL to peer tserver");
 final Set tids;
-final DataInputStream input;
-Span span = Trace.start("Read WAL header");
-span.data("file", p.toString());
-try {
-  input = getWalStream(p);
+try (final FSDataInputStream fsinput = fs.open(p); final DataInputStream 
input = getWalStream(p, fsinput)) {
 
 Review comment:
   That might be configurable. Would have to investigate the options in newer 
Eclipse versions. It doesn't matter to me... if it's configurable, I'll defer 
to others as to what is most readable.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] keith-turner commented on a change in pull request #369: [ACCUMULO-4787] Close input stream in AccumuloReplicaSystem

2018-02-12 Thread GitBox
keith-turner commented on a change in pull request #369: [ACCUMULO-4787] Close 
input stream in AccumuloReplicaSystem
URL: https://github.com/apache/accumulo/pull/369#discussion_r167660400
 
 

 ##
 File path: 
server/tserver/src/main/java/org/apache/accumulo/tserver/replication/AccumuloReplicaSystem.java
 ##
 @@ -370,11 +368,112 @@ protected Status replicateLogs(ClientContext 
peerContext, final HostAndPort peer
 
 log.debug("Replication WAL to peer tserver");
 final Set tids;
-final DataInputStream input;
-Span span = Trace.start("Read WAL header");
-span.data("file", p.toString());
-try {
-  input = getWalStream(p);
+try (final FSDataInputStream fsinput = fs.open(p); final DataInputStream 
input = getWalStream(p, fsinput)) {
+  log.debug("Skipping unwanted data in WAL");
+  Span span = Trace.start("Consume WAL prefix");
+  span.data("file", p.toString());
+  try {
+// We want to read all records in the WAL up to the "begin" offset 
contained in the Status message,
+// building a Set of tids from DEFINE_TABLET events which correspond 
to table ids for future mutations
+tids = consumeWalPrefix(target, input, p, status, sizeLimit);
+  } catch (IOException e) {
+log.warn("Unexpected error consuming file.");
+return status;
+  } finally {
+span.stop();
+  }
+
+  log.debug("Sending batches of data to peer tserver");
 
 Review comment:
   Would be nice to open an issue for follow up work.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] adamjshook commented on a change in pull request #369: [ACCUMULO-4787] Close input stream in AccumuloReplicaSystem

2018-02-12 Thread GitBox
adamjshook commented on a change in pull request #369: [ACCUMULO-4787] Close 
input stream in AccumuloReplicaSystem
URL: https://github.com/apache/accumulo/pull/369#discussion_r167651645
 
 

 ##
 File path: 
server/tserver/src/main/java/org/apache/accumulo/tserver/replication/AccumuloReplicaSystem.java
 ##
 @@ -370,11 +368,112 @@ protected Status replicateLogs(ClientContext 
peerContext, final HostAndPort peer
 
 log.debug("Replication WAL to peer tserver");
 final Set tids;
-final DataInputStream input;
-Span span = Trace.start("Read WAL header");
-span.data("file", p.toString());
-try {
-  input = getWalStream(p);
+try (final FSDataInputStream fsinput = fs.open(p); final DataInputStream 
input = getWalStream(p, fsinput)) {
+  log.debug("Skipping unwanted data in WAL");
+  Span span = Trace.start("Consume WAL prefix");
+  span.data("file", p.toString());
+  try {
+// We want to read all records in the WAL up to the "begin" offset 
contained in the Status message,
+// building a Set of tids from DEFINE_TABLET events which correspond 
to table ids for future mutations
+tids = consumeWalPrefix(target, input, p, status, sizeLimit);
+  } catch (IOException e) {
+log.warn("Unexpected error consuming file.");
+return status;
+  } finally {
+span.stop();
+  }
+
+  log.debug("Sending batches of data to peer tserver");
 
 Review comment:
   Yeah, it's a little nasty.  These are all indentation changes with the 
addition of the `try-catch`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] adamjshook commented on a change in pull request #369: [ACCUMULO-4787] Close input stream in AccumuloReplicaSystem

2018-02-12 Thread GitBox
adamjshook commented on a change in pull request #369: [ACCUMULO-4787] Close 
input stream in AccumuloReplicaSystem
URL: https://github.com/apache/accumulo/pull/369#discussion_r167651285
 
 

 ##
 File path: 
server/tserver/src/main/java/org/apache/accumulo/tserver/replication/AccumuloReplicaSystem.java
 ##
 @@ -370,11 +368,112 @@ protected Status replicateLogs(ClientContext 
peerContext, final HostAndPort peer
 
 log.debug("Replication WAL to peer tserver");
 final Set tids;
-final DataInputStream input;
-Span span = Trace.start("Read WAL header");
-span.data("file", p.toString());
-try {
-  input = getWalStream(p);
+try (final FSDataInputStream fsinput = fs.open(p); final DataInputStream 
input = getWalStream(p, fsinput)) {
 
 Review comment:
   The formatter in the build process reformats it so they are on the same line.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


Accumulo-Pull-Requests - Build # 1038 - Failure

2018-02-12 Thread Apache Jenkins Server
The Apache Jenkins build system has built Accumulo-Pull-Requests (build #1038)

Status: Failure

Check console output at 
https://builds.apache.org/job/Accumulo-Pull-Requests/1038/ to view the results.

[GitHub] joshelser commented on a change in pull request #369: [ACCUMULO-4787] Close input stream in AccumuloReplicaSystem

2018-02-12 Thread GitBox
joshelser commented on a change in pull request #369: [ACCUMULO-4787] Close 
input stream in AccumuloReplicaSystem
URL: https://github.com/apache/accumulo/pull/369#discussion_r167645744
 
 

 ##
 File path: 
server/tserver/src/main/java/org/apache/accumulo/tserver/replication/AccumuloReplicaSystem.java
 ##
 @@ -370,11 +368,112 @@ protected Status replicateLogs(ClientContext 
peerContext, final HostAndPort peer
 
 log.debug("Replication WAL to peer tserver");
 final Set tids;
-final DataInputStream input;
-Span span = Trace.start("Read WAL header");
-span.data("file", p.toString());
-try {
-  input = getWalStream(p);
+try (final FSDataInputStream fsinput = fs.open(p); final DataInputStream 
input = getWalStream(p, fsinput)) {
+  log.debug("Skipping unwanted data in WAL");
+  Span span = Trace.start("Consume WAL prefix");
+  span.data("file", p.toString());
+  try {
+// We want to read all records in the WAL up to the "begin" offset 
contained in the Status message,
+// building a Set of tids from DEFINE_TABLET events which correspond 
to table ids for future mutations
+tids = consumeWalPrefix(target, input, p, status, sizeLimit);
+  } catch (IOException e) {
+log.warn("Unexpected error consuming file.");
+return status;
+  } finally {
+span.stop();
+  }
+
+  log.debug("Sending batches of data to peer tserver");
 
 Review comment:
   Yuck, this should really be split up into its own method. I know you just 
copy-pasted, but if you want to come back in here later and clean up some more 
;)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] joshelser commented on a change in pull request #369: [ACCUMULO-4787] Close input stream in AccumuloReplicaSystem

2018-02-12 Thread GitBox
joshelser commented on a change in pull request #369: [ACCUMULO-4787] Close 
input stream in AccumuloReplicaSystem
URL: https://github.com/apache/accumulo/pull/369#discussion_r167645248
 
 

 ##
 File path: 
server/tserver/src/main/java/org/apache/accumulo/tserver/replication/AccumuloReplicaSystem.java
 ##
 @@ -370,11 +368,112 @@ protected Status replicateLogs(ClientContext 
peerContext, final HostAndPort peer
 
 log.debug("Replication WAL to peer tserver");
 final Set tids;
-final DataInputStream input;
-Span span = Trace.start("Read WAL header");
-span.data("file", p.toString());
-try {
-  input = getWalStream(p);
+try (final FSDataInputStream fsinput = fs.open(p); final DataInputStream 
input = getWalStream(p, fsinput)) {
 
 Review comment:
   nit: would prefer to see these split up onto two lines. Not sure what others 
think :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] adamjshook commented on issue #369: [ACCUMULO-4787] Close input stream in AccumuloReplicaSystem

2018-02-12 Thread GitBox
adamjshook commented on issue #369: [ACCUMULO-4787] Close input stream in 
AccumuloReplicaSystem
URL: https://github.com/apache/accumulo/pull/369#issuecomment-365013146
 
 
   @ctubbsii @keith-turner Undo on the code relocations and applied Keith's 
suggestion to refactor out the `FSDataInputStream`.  Back to you!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (ACCUMULO-4813) Accepting mapping file for bulk import

2018-02-12 Thread Keith Turner (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Turner updated ACCUMULO-4813:
---
Description: 
During bulk import, inspecting files to determine where they go is expensive 
and slow.  In order to spread the cost, Accumulo has an internal mechanism to 
spread the work of inspecting files to random tablet servers.  Because this 
internal process takes time and consumes resources on the cluster, users want 
control over it.  The best way to give this control may be to externalize it by 
allowing bulk imports to have a mapping file.  This mapping file would specify 
the ranges where files should be loaded.  If Accumulo provided API to help 
produce this file, then that work could be done in Map Reduce or Spark.  This 
would give users all the control they want over when and where this computation 
is done.  This would naturally fit in the process used to create the bulk 
files. 

To make bulk import fast this mapping file should have the following properties.
 * Key in file is a range
 * Value in file is a list of files
 * Ranges are non overlapping
 * File is sorted by range/key
 * Has a mapping for every non-empty file in the bulk import directory.

If Accumulo provides APIs to do the following operation, then producing the 
file could written as a map/reduce job.
 * For a given rfile produce a list of row ranges where the file should be 
loaded.  These row ranges would be based on tablets.
 * Merge row range,list of file pairs
 * Serialize row range,list of files pairs

With a mapping file, the bulk import algorithm could be written as follows.  
This could all be executed in the master with no need to run inspection task on 
random tablet servers.
 * Sanity check file
 ** Ensure in sorted order
 ** Ensure ranges are non-overlapping
 ** Ensure each file in directory has at least one entry in file
 ** Ensure all splits in the file exist in the table.
 * Since file is sorted can do a merged read of file and metadata table, 
looping over the following operations for each tablet until all files are 
loaded.
 ** Read the loaded files for the tablet
 ** Read the files to load for the range
 ** For any files not loaded, send an async load message to the tablet server

The above algorithm can just keep scanning the metadata table and sending async 
load messages until the bulk import is complete.  Since the load messages are 
async, the bulk load could of a large number of files could potentially be very 
fast.

The bulk load operation can easily handle the case of tablets splitting during 
the operation by matching a single range in the file to multiple tablets.  
However attempting to handle merges would be a lot more tricky.  It would 
probably be simplest to fail the operation if a merge is detected.  The nice 
thing is that this can be done in a very clean way.   Once the bulk import 
operation has the table lock, merges can not happen.  So after getting the 
table lock the bulk import operation can ensure all splits in the file exist in 
the table. The operation can abort if the condition is not met before doing any 
work.  If this condition is not met, it indicates a merge happened between 
generating the mapping file an doing the bulk import.

Hopefully the mapping file plus the algorithm that sends async load messages 
can dramatically speed up bulk import operations.  This may lessen the need for 
other things like prioritizing bulk import.  To measure this, it would be very 
useful create a bulk import performance test that can create many files with 
very little data and measure the time it takes load them.

  was:
During bulk import, inspecting files to determine where they go is expensive 
and slow.  In order to spread the cost, Accumulo has an internal mechanism to 
spread the work of inspecting files to random tablet servers.  Because this 
internal process takes time and consumes resources on the cluster, users want 
control over it.  The best way to give this control may be to externalize it by 
allowing bulk imports to have a mapping file.  This mapping file would specify 
the ranges where files should be loaded.  If Accumulo provided API to help 
produce this file, then that work could be done in Map Reduce or Spark.  This 
would give users all the control they want over when and where this computation 
is done.  This would naturally fit in the process used to create the bulk 
files. 

To make bulk import fast this mapping file should have the following properties.
 * Key in file is a range
 * Value in file is a list of files
 * Ranges are non overlapping
 * File is sorted by range/key
 * Has a mapping for every non-empty file in the bulk import directory.

If Accumulo provides APIs to do the following operation, then producing the 
file could written as a map/reduce job.
 * For a given file produce a list of ranges
 * Merge range,list of file pairs
 * Serialize 

[jira] [Commented] (ACCUMULO-4813) Accepting mapping file for bulk import

2018-02-12 Thread Keith Turner (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361089#comment-16361089
 ] 

Keith Turner commented on ACCUMULO-4813:


Conceptually the contents of this special mapping file would something like.
{code:java}
  SortedMap
{code}
The file would need a special file extension, not sure what it would be.  Maybe 
.lm for load mapping?

> Accepting mapping file for bulk import
> --
>
> Key: ACCUMULO-4813
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4813
> Project: Accumulo
>  Issue Type: Sub-task
>Reporter: Keith Turner
>Priority: Major
> Fix For: 2.0.0
>
>
> During bulk import, inspecting files to determine where they go is expensive 
> and slow.  In order to spread the cost, Accumulo has an internal mechanism to 
> spread the work of inspecting files to random tablet servers.  Because this 
> internal process takes time and consumes resources on the cluster, users want 
> control over it.  The best way to give this control may be to externalize it 
> by allowing bulk imports to have a mapping file.  This mapping file would 
> specify the ranges where files should be loaded.  If Accumulo provided API to 
> help produce this file, then that work could be done in Map Reduce or Spark.  
> This would give users all the control they want over when and where this 
> computation is done.  This would naturally fit in the process used to create 
> the bulk files. 
> To make bulk import fast this mapping file should have the following 
> properties.
>  * Key in file is a range
>  * Value in file is a list of files
>  * Ranges are non overlapping
>  * File is sorted by range/key
>  * Has a mapping for every non-empty file in the bulk import directory.
> If Accumulo provides APIs to do the following operation, then producing the 
> file could written as a map/reduce job.
>  * For a given file produce a list of ranges
>  * Merge range,list of file pairs
>  * Serialize range,list of files pairs
> With a mapping file, the bulk import algorithm could be written as follows.  
> This could all be executed in the master with no need to run inspection task 
> on random tablet servers.
>  * Sanity check file
>  ** Ensure in sorted order
>  ** Ensure ranges are non-overlapping
>  ** Ensure each file in directory has at least one entry in file
>  ** Ensure all splits in the file exist in the table.
>  * Since file is sorted can do a merged read of file and metadata table, 
> looping over the following operations for each tablet until all files are 
> loaded.
>  ** Read the loaded files for the tablet
>  ** Read the files to load for the range
>  ** For any files not loaded, send an async load message to the tablet server
> The above algorithm can just keep scanning the metadata table and sending 
> async load messages until the bulk import is complete.  Since the load 
> messages are async, the bulk load could of a large number of files could 
> potentially be very fast.
> The bulk load operation can easily handle the case of tablets splitting 
> during the operation by matching a single range in the file to multiple 
> tablets.  However attempting to handle merges would be a lot more tricky.  It 
> would probably be simplest to fail the operation if a merge is detected.  The 
> nice thing is that this can be done in a very clean way.   Once the bulk 
> import operation has the table lock, merges can not happen.  So after getting 
> the table lock the bulk import operation can ensure all splits in the file 
> exist in the table. The operation can abort if the condition is not met 
> before doing any work.  If this condition is not met, it indicates a merge 
> happened between generating the mapping file an doing the bulk import.
> Hopefully the mapping file plus the algorithm that sends async load messages 
> can dramatically speed up bulk import operations.  This may lessen the need 
> for other things like prioritizing bulk import.  To measure this, it would be 
> very useful create a bulk import performance test that can create many files 
> with very little data and measure the time it takes load them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ACCUMULO-4813) Accepting mapping file for bulk import

2018-02-12 Thread Keith Turner (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361089#comment-16361089
 ] 

Keith Turner edited comment on ACCUMULO-4813 at 2/12/18 5:06 PM:
-

Conceptually the contents of this special mapping file would be something like.
{code:java}
  SortedMap
{code}
The file would need a special file extension, not sure what it would be.  Maybe 
.lm for load mapping?


was (Author: kturner):
Conceptually the contents of this special mapping file would something like.
{code:java}
  SortedMap
{code}
The file would need a special file extension, not sure what it would be.  Maybe 
.lm for load mapping?

> Accepting mapping file for bulk import
> --
>
> Key: ACCUMULO-4813
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4813
> Project: Accumulo
>  Issue Type: Sub-task
>Reporter: Keith Turner
>Priority: Major
> Fix For: 2.0.0
>
>
> During bulk import, inspecting files to determine where they go is expensive 
> and slow.  In order to spread the cost, Accumulo has an internal mechanism to 
> spread the work of inspecting files to random tablet servers.  Because this 
> internal process takes time and consumes resources on the cluster, users want 
> control over it.  The best way to give this control may be to externalize it 
> by allowing bulk imports to have a mapping file.  This mapping file would 
> specify the ranges where files should be loaded.  If Accumulo provided API to 
> help produce this file, then that work could be done in Map Reduce or Spark.  
> This would give users all the control they want over when and where this 
> computation is done.  This would naturally fit in the process used to create 
> the bulk files. 
> To make bulk import fast this mapping file should have the following 
> properties.
>  * Key in file is a range
>  * Value in file is a list of files
>  * Ranges are non overlapping
>  * File is sorted by range/key
>  * Has a mapping for every non-empty file in the bulk import directory.
> If Accumulo provides APIs to do the following operation, then producing the 
> file could written as a map/reduce job.
>  * For a given file produce a list of ranges
>  * Merge range,list of file pairs
>  * Serialize range,list of files pairs
> With a mapping file, the bulk import algorithm could be written as follows.  
> This could all be executed in the master with no need to run inspection task 
> on random tablet servers.
>  * Sanity check file
>  ** Ensure in sorted order
>  ** Ensure ranges are non-overlapping
>  ** Ensure each file in directory has at least one entry in file
>  ** Ensure all splits in the file exist in the table.
>  * Since file is sorted can do a merged read of file and metadata table, 
> looping over the following operations for each tablet until all files are 
> loaded.
>  ** Read the loaded files for the tablet
>  ** Read the files to load for the range
>  ** For any files not loaded, send an async load message to the tablet server
> The above algorithm can just keep scanning the metadata table and sending 
> async load messages until the bulk import is complete.  Since the load 
> messages are async, the bulk load could of a large number of files could 
> potentially be very fast.
> The bulk load operation can easily handle the case of tablets splitting 
> during the operation by matching a single range in the file to multiple 
> tablets.  However attempting to handle merges would be a lot more tricky.  It 
> would probably be simplest to fail the operation if a merge is detected.  The 
> nice thing is that this can be done in a very clean way.   Once the bulk 
> import operation has the table lock, merges can not happen.  So after getting 
> the table lock the bulk import operation can ensure all splits in the file 
> exist in the table. The operation can abort if the condition is not met 
> before doing any work.  If this condition is not met, it indicates a merge 
> happened between generating the mapping file an doing the bulk import.
> Hopefully the mapping file plus the algorithm that sends async load messages 
> can dramatically speed up bulk import operations.  This may lessen the need 
> for other things like prioritizing bulk import.  To measure this, it would be 
> very useful create a bulk import performance test that can create many files 
> with very little data and measure the time it takes load them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ACCUMULO-4697) tablet server could kick off too many idle compactions

2018-02-12 Thread Mark Owens (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Owens reassigned ACCUMULO-4697:


Assignee: Mark Owens

> tablet server could kick off too many idle compactions
> --
>
> Key: ACCUMULO-4697
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4697
> Project: Accumulo
>  Issue Type: Bug
>  Components: tserver
>Reporter: Adam Fuchs
>Assignee: Mark Owens
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Tablet.initiateMajorCompaction() always returns false, but we check the 
> return value when accounting for the number of idle major compactions started 
> in TabletServer.MajorCompactor.run() and continue to kick off idle 
> compactions on every tablet.
> DefaultCompactionStrategy doesn't do anything special for idle compactions, 
> so this will only be an issue with a custom CompactionStrategy.
> My guess would be the best thing to do would be to get rid of idle 
> compactions and get rid of the return value on initiateMajorCompaction().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ACCUMULO-4813) Accepting mapping file for bulk import

2018-02-12 Thread Keith Turner (JIRA)
Keith Turner created ACCUMULO-4813:
--

 Summary: Accepting mapping file for bulk import
 Key: ACCUMULO-4813
 URL: https://issues.apache.org/jira/browse/ACCUMULO-4813
 Project: Accumulo
  Issue Type: Sub-task
Reporter: Keith Turner
 Fix For: 2.0.0


During bulk import, inspecting files to determine where they go is expensive 
and slow.  In order to spread the cost, Accumulo has an internal mechanism to 
spread the work of inspecting files to random tablet servers.  Because this 
internal process takes time and consumes resources on the cluster, users want 
control over it.  The best way to give this control may be to externalize it by 
allowing bulk imports to have a mapping file.  This mapping file would specify 
the ranges where files should be loaded.  If Accumulo provided API to help 
produce this file, then that work could be done in Map Reduce or Spark.  This 
would give users all the control they want over when and where this computation 
is done.  This would naturally fit in the process used to create the bulk 
files. 

To make bulk import fast this mapping file should have the following properties.
 * Key in file is a range
 * Value in file is a list of files
 * Ranges are non overlapping
 * File is sorted by range/key
 * Has a mapping for every non-empty file in the bulk import directory.

If Accumulo provides APIs to do the following operation, then producing the 
file could written as a map/reduce job.
 * For a given file produce a list of ranges
 * Merge range,list of file pairs
 * Serialize range,list of files pairs

With a mapping file, the bulk import algorithm could be written as follows.  
This could all be executed in the master with no need to run inspection task on 
random tablet servers.
 * Sanity check file
 ** Ensure in sorted order
 ** Ensure ranges are non-overlapping
 ** Ensure each file in directory has at least one entry in file
 ** Ensure all splits in the file exist in the table.
 * Since file is sorted can do a merged read of file and metadata table, 
looping over the following operations for each tablet until all files are 
loaded.
 ** Read the loaded files for the tablet
 ** Read the files to load for the range
 ** For any files not loaded, send an async load message to the tablet server

The above algorithm can just keep scanning the metadata table and sending async 
load messages until the bulk import is complete.  Since the load messages are 
async, the bulk load could of a large number of files could potentially be very 
fast.

The bulk load operation can easily handle the case of tablets splitting during 
the operation by matching a single range in the file to multiple tablets.  
However attempting to handle merges would be a lot more tricky.  It would 
probably be simplest to fail the operation if a merge is detected.  The nice 
thing is that this can be done in a very clean way.   Once the bulk import 
operation has the table lock, merges can not happen.  So after getting the 
table lock the bulk import operation can ensure all splits in the file exist in 
the table. The operation can abort if the condition is not met before doing any 
work.  If this condition is not met, it indicates a merge happened between 
generating the mapping file an doing the bulk import.

Hopefully the mapping file plus the algorithm that sends async load messages 
can dramatically speed up bulk import operations.  This may lessen the need for 
other things like prioritizing bulk import.  To measure this, it would be very 
useful create a bulk import performance test that can create many files with 
very little data and measure the time it takes load them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)