[GitHub] [accumulo] keith-turner commented on a diff in pull request #3180: Enable users to provide per-volume Hadoop Filesystem overrides

2023-02-05 Thread via GitHub


keith-turner commented on code in PR #3180:
URL: https://github.com/apache/accumulo/pull/3180#discussion_r1096872490


##
server/base/src/main/java/org/apache/accumulo/server/fs/VolumeManagerImpl.java:
##
@@ -353,6 +362,38 @@ public short getDefaultReplication(Path path) {
 return getFileSystemByPath(path).getDefaultReplication(path);
   }
 
+  /**
+   * The Hadoop Configuration object does not currently allow for duplicate 
properties to be set in
+   * a single Configuration for different FileSystem URIs. Here we will look 
for properties in the
+   * Accumulo configuration of the form:
+   *
+   * 
+   * general.custom.volume-uri.hdfs-property
+   * 
+   *
+   * We will use these properties to return a new Configuration object that 
can be used with the
+   * FileSystem URI.
+   *
+   * @param conf AccumuloConfiguration object
+   * @param hadoopConf Hadoop Configuration object
+   * @param filesystemURI Volume Filesystem URI
+   * @return Hadoop Configuration with custom overrides for this FileSystem
+   */
+  private static Configuration 
getVolumeManagerConfiguration(AccumuloConfiguration conf,
+  final Configuration hadoopConf, final String filesystemURI) {
+final Configuration volumeConfig = new Configuration(hadoopConf);
+final Map customProps =
+
conf.getAllPropertiesWithPrefixStripped(Property.GENERAL_ARBITRARY_PROP_PREFIX);
+customProps.forEach((key, value) -> {
+  if (key.startsWith(filesystemURI)) {
+String property = key.substring(filesystemURI.length() + 1);
+log.debug("Overriding property {} to {} for volume {}", property, 
value, filesystemURI);

Review Comment:
   Really like the logging.  Does this logging only happen once per volume on a 
server?  If its only once, maybe could promote it to info.



##
server/base/src/test/java/org/apache/accumulo/server/fs/VolumeManagerImplTest.java:
##
@@ -104,4 +111,88 @@ public Scope getChooserScope() {
   assertThrows(RuntimeException.class, () -> vm.choose(chooserEnv, 
volumes));
 }
   }
+
+  @Test
+  public void testConfigurationOverrides() throws Exception {
+
+final String vol1 = "file://127.0.0.1/vol1/";
+final String vol2 = "file://localhost/vol2/";
+final String vol3 = "hdfs://127.0.0.1/accumulo";
+final String vol4 = "hdfs://localhost/accumulo";
+
+ConfigurationCopy conf = new ConfigurationCopy();
+conf.set(Property.INSTANCE_VOLUMES, String.join(",", vol1, vol2, vol3, 
vol4));
+conf.set(Property.GENERAL_VOLUME_CHOOSER, 
Property.GENERAL_VOLUME_CHOOSER.getDefaultValue());
+conf.set(Property.GENERAL_ARBITRARY_PROP_PREFIX.getKey() + vol1 + "."
++ HdfsClientConfigKeys.HedgedRead.THREADPOOL_SIZE_KEY, "10");
+conf.set(Property.GENERAL_ARBITRARY_PROP_PREFIX.getKey() + vol1 + "."

Review Comment:
   Could set a third override on one volume that no others set to ensure there 
is no interference for this case.  For example could set a third blocksize prop 
for only vol1. 



##
server/base/src/test/java/org/apache/accumulo/server/fs/VolumeManagerImplTest.java:
##
@@ -104,4 +111,88 @@ public Scope getChooserScope() {
   assertThrows(RuntimeException.class, () -> vm.choose(chooserEnv, 
volumes));
 }
   }
+
+  @Test
+  public void testConfigurationOverrides() throws Exception {
+
+final String vol1 = "file://127.0.0.1/vol1/";
+final String vol2 = "file://localhost/vol2/";
+final String vol3 = "hdfs://127.0.0.1/accumulo";
+final String vol4 = "hdfs://localhost/accumulo";
+
+ConfigurationCopy conf = new ConfigurationCopy();
+conf.set(Property.INSTANCE_VOLUMES, String.join(",", vol1, vol2, vol3, 
vol4));
+conf.set(Property.GENERAL_VOLUME_CHOOSER, 
Property.GENERAL_VOLUME_CHOOSER.getDefaultValue());
+conf.set(Property.GENERAL_ARBITRARY_PROP_PREFIX.getKey() + vol1 + "."
++ HdfsClientConfigKeys.HedgedRead.THREADPOOL_SIZE_KEY, "10");
+conf.set(Property.GENERAL_ARBITRARY_PROP_PREFIX.getKey() + vol1 + "."
++ HdfsClientConfigKeys.DFS_CLIENT_CACHE_DROP_BEHIND_READS, "true");
+conf.set(Property.GENERAL_ARBITRARY_PROP_PREFIX.getKey() + vol2 + "."
++ HdfsClientConfigKeys.HedgedRead.THREADPOOL_SIZE_KEY, "20");
+conf.set(Property.GENERAL_ARBITRARY_PROP_PREFIX.getKey() + vol2 + "."
++ HdfsClientConfigKeys.DFS_CLIENT_CACHE_DROP_BEHIND_READS, "false");
+conf.set(Property.GENERAL_ARBITRARY_PROP_PREFIX.getKey() + vol3 + "."
++ HdfsClientConfigKeys.HedgedRead.THREADPOOL_SIZE_KEY, "30");
+conf.set(Property.GENERAL_ARBITRARY_PROP_PREFIX.getKey() + vol3 + "."
++ HdfsClientConfigKeys.DFS_CLIENT_CACHE_DROP_BEHIND_READS, "TRUE");
+conf.set(Property.GENERAL_ARBITRARY_PROP_PREFIX.getKey() + vol4 + "."
++ HdfsClientConfigKeys.HedgedRead.THREADPOOL_SIZE_KEY, "40");
+conf.set(Property.GENERAL_ARBITRARY_PROP_PREFIX.getKey() + vol4 + "."
++ HdfsClientConfigKeys.DFS_CLIENT_CACHE_DROP_BEHIND_READS, "FALSE");
+

[GitHub] [accumulo] keith-turner commented on pull request #3143: Support scanning offline tables using scan servers

2023-02-05 Thread via GitHub


keith-turner commented on PR #3143:
URL: https://github.com/apache/accumulo/pull/3143#issuecomment-1418339001

   After merging main which contained the fixes for #3168 I spun up [this 
test](https://gist.github.com/keith-turner/f2159111b025e600a6e0abbaba1d92f3) 
again and its been running for 72hrs w/o issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Resolved] (ACCUMULO-1068) Improve Javadoc for Instance Operations

2023-02-05 Thread Christopher Tubbs (Jira)


 [ 
https://issues.apache.org/jira/browse/ACCUMULO-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Tubbs resolved ACCUMULO-1068.
-
Resolution: Duplicate

Duplicated by https://github.com/apache/accumulo/pull/3188

> Improve Javadoc for Instance Operations
> ---
>
> Key: ACCUMULO-1068
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1068
> Project: Accumulo
>  Issue Type: Improvement
>  Components: client
>Reporter: Jim Klucar
>Priority: Minor
>  Labels: newbie
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The difference between InstanceOperations.getSiteConfiguration and 
> getSystemConfiguration could be highlighted more clearly in the Javadoc. Also 
> how they relate to getProperty and setProperty.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [accumulo] cshannon commented on issue #608: Garbage Collector removed referenced files

2023-02-05 Thread via GitHub


cshannon commented on issue #608:
URL: https://github.com/apache/accumulo/issues/608#issuecomment-1418236284

   I did some more testing this past Friday/Saturday on this and couldn't get 
an errors to show up still. I did however see in the logs the iterator actually 
detect inconsistencies in the meatadata table and it fixed itself (just like I 
said in my previous comment). I talked to @keith-turner about this a bit and he 
suggested I try testing against 1.10 (and not just 2.1 and main). I also 
realized I was mostly testing using the mini accumulo cluster which does not 
use HDFS. So today I spent some time-rerunning my tests against 1.10 and using 
Uno (so it's a real cluster with HDFS) and I still didn't get an errors and 
everything worked as it should.
   
   One idea I had that could possibly help the problem would be to require 
garbage collection to run more than one time and compare results before 
actually removing files. For example, GC could run the scan to get the file 
candidates with references multiple times in the same GC run and then compare 
results and take a superset or fail if inconsistent.  Another option is to 
require more than one GC run before actually deleting. Something like when GC 
runs and comes comes up with the files it is about to delete maybe we just mark 
them (in metadata probably) as deleted/to be deleted and a subsequent run could 
do the actual delete if it detects the file as marked in a previous run so we 
know we at least had multiple runs where we think we should delete. We could 
even make it configurable to require X number of positive hits to do a deletion 
or have a time delay etc. 
   
   Running the scan more than once to detect the file references would only 
actually help if the problem was transient and non-deterministic and was 
isolated to a single scan and wouldn't just happen in a future scan which is 
hard to say because we don't know the actual problem. There's also a bit of a 
chicken/egg problem where if we are writing the GC result back to metadata but 
metadata scans are the issue and inconsistent it could be a problem. Also this 
behavior to delay deletion can sort of already be accomplished by using HDFS 
trash and files can be recovered from there and there is already #3140 to help 
make the HDFS trash option better but having it built in to GC could also be 
nice if it would help prevent early deletion in the first place.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org