Re: FileSystem API (was: Slack call notes)
1734 > > > > > > > Without tests it is slightly more feasible... > > > > > > > grep -r "import org.apache.hadoop" --include=*.java * | grep -v > > > "test" > > > > > | > > > > > > wc > > > > > > > -l > > > > > > > 858 > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 25, 2020 at 3:19 PM David Mollitor < > dam6...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > I too have been thinking about this for a pet project. > There is > > > > > > already > > > > > > > > Apache Commons VFS that, with some investment, could probably > > > serve > > > > > all > > > > > > > > these requirements. > > > > > > > > > > > > > > > > On Wed, Mar 25, 2020, 3:16 PM Christopher < > ctubb...@apache.org> > > > > > wrote: > > > > > > > > > > > > > > > > > (Forking this thread, as it's a distinct topic) > > > > > > > > > > > > > > > > > > I've thought about it. The idea has driven me to try to > reduce > > > our > > > > > > use > > > > > > > > > of Hadoop-specific code, and to isolate Hadoop-specific > stuff > > > > > behind > > > > > > > > > some abstraction, wherever possible. Though, I'll admit, > we're > > > > > > nowhere > > > > > > > > > close to where we'd want to be to be fully decoupled from > > > Hadoop. > > > > > > > > > > > > > > > > > > I've also been looking a lot at our VolumeManager code > lately, > > > to > > > > > try > > > > > > > > > to improve it a bit, and to create better abstractions for > > > Volumes, > > > > > > > > > that could aid future work in this area. > > > > > > > > > > > > > > > > > > But, I haven't directly been working on new FileSystem API > > > > > > > > > abstraction... just trying to lay some groundwork for that > > > > > > possibility > > > > > > > > > in future. > > > > > > > > > > > > > > > > > > It'd be nice to get to a point where we have a > Hadoop-specific > > > > > > > > > implementation isolated to a jar that can be swapped out at > > > runtime > > > > > > > > > for other file system implementations, as needed. I see > that > > > as a > > > > > > > > > somewhat long-way off. > > > > > > > > > > > > > > > > > > On Wed, Mar 25, 2020 at 2:08 PM > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I couldn't make the call today, but am curious if > anyone > > > has > > > > > > > > > previously brought up creating a FileSystem API for > Accumulo so > > > > > that > > > > > > we > > > > > > > > > could use implementations other than Hadoop. I realize that > > > Hadoop > > > > > > > > provides > > > > > > > > > implementations for things other than HDFS but that doesn't > > > > > > necessarily > > > > > > > > > mean that all filesystem implementations are covered. > > > > > > > > > > > > > > > > > > > > -Original Message- > > > > > > > > > > From: Christopher > > > > > > > > > > Sent: Wednesday, March 25, 2020 1:45 PM > > > > > > > > > > To: accumulo-dev > > > > > > > > > > Subject: Slack call notes > > > > > > > > > > > > > > > > > > > > Several committers/contributors in the community joined a > > > call in > > > > > > Slack > > > > > > > > > on Wednesday, at 1130-1230, New York (Eastern) time. Here
Re: FileSystem API (was: Slack call notes)
> already > > > > > > > Apache Commons VFS that, with some investment, could probably > > serve > > > > all > > > > > > > these requirements. > > > > > > > > > > > > > > On Wed, Mar 25, 2020, 3:16 PM Christopher > > > > wrote: > > > > > > > > > > > > > > > (Forking this thread, as it's a distinct topic) > > > > > > > > > > > > > > > > I've thought about it. The idea has driven me to try to reduce > > our > > > > > use > > > > > > > > of Hadoop-specific code, and to isolate Hadoop-specific stuff > > > > behind > > > > > > > > some abstraction, wherever possible. Though, I'll admit, we're > > > > > nowhere > > > > > > > > close to where we'd want to be to be fully decoupled from > > Hadoop. > > > > > > > > > > > > > > > > I've also been looking a lot at our VolumeManager code lately, > > to > > > > try > > > > > > > > to improve it a bit, and to create better abstractions for > > Volumes, > > > > > > > > that could aid future work in this area. > > > > > > > > > > > > > > > > But, I haven't directly been working on new FileSystem API > > > > > > > > abstraction... just trying to lay some groundwork for that > > > > > possibility > > > > > > > > in future. > > > > > > > > > > > > > > > > It'd be nice to get to a point where we have a Hadoop-specific > > > > > > > > implementation isolated to a jar that can be swapped out at > > runtime > > > > > > > > for other file system implementations, as needed. I see that > > as a > > > > > > > > somewhat long-way off. > > > > > > > > > > > > > > > > On Wed, Mar 25, 2020 at 2:08 PM wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > I couldn't make the call today, but am curious if anyone > > has > > > > > > > > previously brought up creating a FileSystem API for Accumulo so > > > > that > > > > > we > > > > > > > > could use implementations other than Hadoop. I realize that > > Hadoop > > > > > > > provides > > > > > > > > implementations for things other than HDFS but that doesn't > > > > > necessarily > > > > > > > > mean that all filesystem implementations are covered. > > > > > > > > > > > > > > > > > > -Original Message- > > > > > > > > > From: Christopher > > > > > > > > > Sent: Wednesday, March 25, 2020 1:45 PM > > > > > > > > > To: accumulo-dev > > > > > > > > > Subject: Slack call notes > > > > > > > > > > > > > > > > > > Several committers/contributors in the community joined a > > call in > > > > > Slack > > > > > > > > on Wednesday, at 1130-1230, New York (Eastern) time. Here are > > my > > > > > notes of > > > > > > > > the call. Please feel free to add to them. > > > > > > > > > > > > > > > > > > I shared the overall philosophy and backstory to some of the > > > > script > > > > > > > > improvements in 2.x to help guide current/future work on the > > > > scripts. > > > > > > > > > > > > > > > > > > * bin/accumulo is inspired by old jpackage.org standards > > which > > > > are > > > > > > > > still in use in RPM macros for Java packaging in > > Fedora/RHEL/etc. > > > > > The key > > > > > > > > idea is that scripts are simple... set up environment (class > > path, > > > > > etc.), > > > > > > > > locate java, and exec a single process with the provided args. > > > > > > > > > * bin/accumulo-service is inspired by old SysVInit scripts > > for > > > > > > > > start/stop/restart/status of a single service > > > > > >
Re: FileSystem API (was: Slack call notes)
gt; > > > > some abstraction, wherever possible. Though, I'll admit, we're > > > > nowhere > > > > > > > close to where we'd want to be to be fully decoupled from > Hadoop. > > > > > > > > > > > > > > I've also been looking a lot at our VolumeManager code lately, > to > > > try > > > > > > > to improve it a bit, and to create better abstractions for > Volumes, > > > > > > > that could aid future work in this area. > > > > > > > > > > > > > > But, I haven't directly been working on new FileSystem API > > > > > > > abstraction... just trying to lay some groundwork for that > > > > possibility > > > > > > > in future. > > > > > > > > > > > > > > It'd be nice to get to a point where we have a Hadoop-specific > > > > > > > implementation isolated to a jar that can be swapped out at > runtime > > > > > > > for other file system implementations, as needed. I see that > as a > > > > > > > somewhat long-way off. > > > > > > > > > > > > > > On Wed, Mar 25, 2020 at 2:08 PM wrote: > > > > > > > > > > > > > > > > > > > > > > > > I couldn't make the call today, but am curious if anyone > has > > > > > > > previously brought up creating a FileSystem API for Accumulo so > > > that > > > > we > > > > > > > could use implementations other than Hadoop. I realize that > Hadoop > > > > > > provides > > > > > > > implementations for things other than HDFS but that doesn't > > > > necessarily > > > > > > > mean that all filesystem implementations are covered. > > > > > > > > > > > > > > > > -Original Message- > > > > > > > > From: Christopher > > > > > > > > Sent: Wednesday, March 25, 2020 1:45 PM > > > > > > > > To: accumulo-dev > > > > > > > > Subject: Slack call notes > > > > > > > > > > > > > > > > Several committers/contributors in the community joined a > call in > > > > Slack > > > > > > > on Wednesday, at 1130-1230, New York (Eastern) time. Here are > my > > > > notes of > > > > > > > the call. Please feel free to add to them. > > > > > > > > > > > > > > > > I shared the overall philosophy and backstory to some of the > > > script > > > > > > > improvements in 2.x to help guide current/future work on the > > > scripts. > > > > > > > > > > > > > > > > * bin/accumulo is inspired by old jpackage.org standards > which > > > are > > > > > > > still in use in RPM macros for Java packaging in > Fedora/RHEL/etc. > > > > The key > > > > > > > idea is that scripts are simple... set up environment (class > path, > > > > etc.), > > > > > > > locate java, and exec a single process with the provided args. > > > > > > > > * bin/accumulo-service is inspired by old SysVInit scripts > for > > > > > > > start/stop/restart/status of a single service > > > > > > > > * behavior of bin/accumulo and bin/accumulo-service can be > > > > manipulated > > > > > > > through launch environment > > > > > > > > * bin/accumulo-cluster uses bin/accumulo-service, and is > provided > > > > as a > > > > > > > simple, out-of-the-box cluster management tool > > > > > > > > * bin/accumulo-cluster and bin/accumulo-service are > replaceable; > > > > they > > > > > > > are useful for out-of-the-box, but one would expect them to be > > > > > > unnecessary > > > > > > > if using systemd, or a vendor-provided cluster management > system > > > > > > > > * we discussed possibly moving bin/accumulo-cluster and > > > > > > > bin/accumulo-service to contrib/ in the tarball, or some > subdir of > > > > bin/, > > > > > > > but it was suggested to not make too many disruptive changes > there > > > > > > > > * we discussed the possibility of adding a config file for > > > > > > > bin/accumulo-cluster (also mentioned on > > > > > > > > https://github.com/apache/accumulo/pull/1568) > > > > > > > > * we discussed the need to document the intent/purpose/scope > of > > > the > > > > > > > scripts in comments inside the scripts themselves > > > > > > > > * Ed Coleman asked if it'd be good to document a systemd > > > example; I > > > > > > > suggested it might make for a good blog post (perhaps by the > person > > > > who > > > > > > > wrote the systemd unit files for Fluo Muchos) > > > > > > > > > > > > > > > > Keith Turner discussed his development efforts with regard to > > > > enabling > > > > > > > more controls over compactions. > > > > > > > > > > > > > > > > * one main idea was to keep configuration/API for data > separate > > > > from > > > > > > > that for execution > > > > > > > > * data is concerns to application owners, whereas execution > > > > involves > > > > > > > system admins (resource contention, etc.) > > > > > > > > * he will submit a PR for review when ready > > > > > > > > * he also suggested another call to go over the PR > > > > > > > > > > > > > > > > Billie Rinaldi discussed better support for Azure Data Lake > > > Storage > > > > > > > > Gen2 (ADLSv2). > > > > > > > > > > > > > > > > * maintaining a fork for experimenting, and working on > reliably > > > > testing > > > > > > > issues involving WALs > > > > > > > > * did not recommend using ADLSv2 with WALs, but that we > should > > > > still > > > > > > > support it > > > > > > > > * might need to implement a custom log closer to better > support > > > it > > > > > > > > > > > > > > > > Mike Miller brought up the idea of eliminating more static > > > internal > > > > > > > state. > > > > > > > > > > > > > > > > * ServerConfigurationFactory might be improved in this > regard, > > > with > > > > > > some > > > > > > > additional ZK cleanup > > > > > > > > * Other ZK cleanup might help elsewhere (such as ZooCache) > > > > > > > > * I suggested tablet location cache might also benefit from > being > > > > bound > > > > > > > to an AccumuloClient lifecycle (or a dedicated opaque object > that > > > > could > > > > > > be > > > > > > > shared across AccumuloClient instances with its own > user-managed > > > > > > lifecycle) > > > > > > > > > > > > > > > > Please add anything I might have missed (or got wrong) in > > > response > > > > to > > > > > > > this post. > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
Re: FileSystem API (was: Slack call notes)
I'm not familiar with it, but the website says it can replace HDFS. There appears to be an "HDFS Gateway" (https://github.com/minio/minio/blob/master/docs/gateway/hdfs.md) that might be useful. At a glance, it looks like no abstraction is needed in Accumulo code is needed for it... you just run the gateway and Accumulo believes it is using HDFS, but it is really using MinIO instead. There also might be a Hadoop FileSystem implementation for it to use it directly without a Gateway, but I didn't have any luck with a quick search for one. In either case, there shouldn't need to be any changes to Accumulo itself. If changes to Accumulo do become necessary (or desired), I'd be interested in collaborating on that part. If it's just a matter of trying it with the Gateway or existing Hadoop FileSystem implementation, I'd also be interested in testing any step-by-step HOWTO guides somebody might want to write as a blog post. On Fri, Apr 24, 2020 at 11:20 AM Mike Miller wrote: > > I have no experience with MinIO but would be interested in learning more > and collaborating. > > On Fri, Apr 24, 2020 at 10:57 AM Michael Wall wrote: > > > Resurrecting this thread on the File System API. I have been thinking > > about giving Minio [1] a try for both WALs and RFiles. Seems to me like > > step one is to abstract internal interfaces for both targeted against 2.1? > > Couple of questions > > > > 1 - Anyone have experience with minio? > > 2 - Anyone interested in collaborating? Thinking anything from providing > > input to helping to test once we get a prototype to actually doing some > > development. > > > > Thanks, hope everyone is staying safe and healthy. > > > > [1] - https://min.io/ > > > > On Wed, Mar 25, 2020 at 6:08 PM Christopher wrote: > > > > > Only 705 across 280 files, if you exclude Text, though :) > > > > > > grep -rP 'org[.]apache[.]hadoop(?![.]io[.]Text)' --include='*.java' * > > > | grep -v test/ | wc -l > > > > > > On Wed, Mar 25, 2020 at 3:34 PM Mike Miller wrote: > > > > > > > > I think we have come a long way removing any external types from the > > API, > > > > for reasons other than de-coupling from Hadoop. While we don't have > > many > > > > dependencies on the other components of Hadoop, we are still very > > tightly > > > > coupled to HDFS. > > > > For example, some quick grep'ing of the code shows: > > > > "grep -r "import org.apache.hadoop" --include=*.java * | wc -l" > > > > 1734 > > > > Without tests it is slightly more feasible... > > > > grep -r "import org.apache.hadoop" --include=*.java * | grep -v "test" > > | > > > wc > > > > -l > > > > 858 > > > > > > > > > > > > On Wed, Mar 25, 2020 at 3:19 PM David Mollitor > > > wrote: > > > > > > > > > Hello, > > > > > > > > > > I too have been thinking about this for a pet project. There is > > > already > > > > > Apache Commons VFS that, with some investment, could probably serve > > all > > > > > these requirements. > > > > > > > > > > On Wed, Mar 25, 2020, 3:16 PM Christopher > > wrote: > > > > > > > > > > > (Forking this thread, as it's a distinct topic) > > > > > > > > > > > > I've thought about it. The idea has driven me to try to reduce our > > > use > > > > > > of Hadoop-specific code, and to isolate Hadoop-specific stuff > > behind > > > > > > some abstraction, wherever possible. Though, I'll admit, we're > > > nowhere > > > > > > close to where we'd want to be to be fully decoupled from Hadoop. > > > > > > > > > > > > I've also been looking a lot at our VolumeManager code lately, to > > try > > > > > > to improve it a bit, and to create better abstractions for Volumes, > > > > > > that could aid future work in this area. > > > > > > > > > > > > But, I haven't directly been working on new FileSystem API > > > > > > abstraction... just trying to lay some groundwork for that > > > possibility > > > > > > in future. > > > > > > > > > > > > It'd be nice to get to a point where we have a Hadoop-specific > > > > > > implementation isolated to a jar that can be swapped out at runtime > > &
Re: FileSystem API (was: Slack call notes)
I have no experience with MinIO but would be interested in learning more and collaborating. On Fri, Apr 24, 2020 at 10:57 AM Michael Wall wrote: > Resurrecting this thread on the File System API. I have been thinking > about giving Minio [1] a try for both WALs and RFiles. Seems to me like > step one is to abstract internal interfaces for both targeted against 2.1? > Couple of questions > > 1 - Anyone have experience with minio? > 2 - Anyone interested in collaborating? Thinking anything from providing > input to helping to test once we get a prototype to actually doing some > development. > > Thanks, hope everyone is staying safe and healthy. > > [1] - https://min.io/ > > On Wed, Mar 25, 2020 at 6:08 PM Christopher wrote: > > > Only 705 across 280 files, if you exclude Text, though :) > > > > grep -rP 'org[.]apache[.]hadoop(?![.]io[.]Text)' --include='*.java' * > > | grep -v test/ | wc -l > > > > On Wed, Mar 25, 2020 at 3:34 PM Mike Miller wrote: > > > > > > I think we have come a long way removing any external types from the > API, > > > for reasons other than de-coupling from Hadoop. While we don't have > many > > > dependencies on the other components of Hadoop, we are still very > tightly > > > coupled to HDFS. > > > For example, some quick grep'ing of the code shows: > > > "grep -r "import org.apache.hadoop" --include=*.java * | wc -l" > > > 1734 > > > Without tests it is slightly more feasible... > > > grep -r "import org.apache.hadoop" --include=*.java * | grep -v "test" > | > > wc > > > -l > > > 858 > > > > > > > > > On Wed, Mar 25, 2020 at 3:19 PM David Mollitor > > wrote: > > > > > > > Hello, > > > > > > > > I too have been thinking about this for a pet project. There is > > already > > > > Apache Commons VFS that, with some investment, could probably serve > all > > > > these requirements. > > > > > > > > On Wed, Mar 25, 2020, 3:16 PM Christopher > wrote: > > > > > > > > > (Forking this thread, as it's a distinct topic) > > > > > > > > > > I've thought about it. The idea has driven me to try to reduce our > > use > > > > > of Hadoop-specific code, and to isolate Hadoop-specific stuff > behind > > > > > some abstraction, wherever possible. Though, I'll admit, we're > > nowhere > > > > > close to where we'd want to be to be fully decoupled from Hadoop. > > > > > > > > > > I've also been looking a lot at our VolumeManager code lately, to > try > > > > > to improve it a bit, and to create better abstractions for Volumes, > > > > > that could aid future work in this area. > > > > > > > > > > But, I haven't directly been working on new FileSystem API > > > > > abstraction... just trying to lay some groundwork for that > > possibility > > > > > in future. > > > > > > > > > > It'd be nice to get to a point where we have a Hadoop-specific > > > > > implementation isolated to a jar that can be swapped out at runtime > > > > > for other file system implementations, as needed. I see that as a > > > > > somewhat long-way off. > > > > > > > > > > On Wed, Mar 25, 2020 at 2:08 PM wrote: > > > > > > > > > > > > > > > > > > I couldn't make the call today, but am curious if anyone has > > > > > previously brought up creating a FileSystem API for Accumulo so > that > > we > > > > > could use implementations other than Hadoop. I realize that Hadoop > > > > provides > > > > > implementations for things other than HDFS but that doesn't > > necessarily > > > > > mean that all filesystem implementations are covered. > > > > > > > > > > > > -Original Message- > > > > > > From: Christopher > > > > > > Sent: Wednesday, March 25, 2020 1:45 PM > > > > > > To: accumulo-dev > > > > > > Subject: Slack call notes > > > > > > > > > > > > Several committers/contributors in the community joined a call in > > Slack > > > > > on Wednesday, at 1130-1230, New York (Eastern) time. Here are my > > notes of > > > > > the call. Please feel free to add to them. >
Re: FileSystem API (was: Slack call notes)
Resurrecting this thread on the File System API. I have been thinking about giving Minio [1] a try for both WALs and RFiles. Seems to me like step one is to abstract internal interfaces for both targeted against 2.1? Couple of questions 1 - Anyone have experience with minio? 2 - Anyone interested in collaborating? Thinking anything from providing input to helping to test once we get a prototype to actually doing some development. Thanks, hope everyone is staying safe and healthy. [1] - https://min.io/ On Wed, Mar 25, 2020 at 6:08 PM Christopher wrote: > Only 705 across 280 files, if you exclude Text, though :) > > grep -rP 'org[.]apache[.]hadoop(?![.]io[.]Text)' --include='*.java' * > | grep -v test/ | wc -l > > On Wed, Mar 25, 2020 at 3:34 PM Mike Miller wrote: > > > > I think we have come a long way removing any external types from the API, > > for reasons other than de-coupling from Hadoop. While we don't have many > > dependencies on the other components of Hadoop, we are still very tightly > > coupled to HDFS. > > For example, some quick grep'ing of the code shows: > > "grep -r "import org.apache.hadoop" --include=*.java * | wc -l" > > 1734 > > Without tests it is slightly more feasible... > > grep -r "import org.apache.hadoop" --include=*.java * | grep -v "test" | > wc > > -l > > 858 > > > > > > On Wed, Mar 25, 2020 at 3:19 PM David Mollitor > wrote: > > > > > Hello, > > > > > > I too have been thinking about this for a pet project. There is > already > > > Apache Commons VFS that, with some investment, could probably serve all > > > these requirements. > > > > > > On Wed, Mar 25, 2020, 3:16 PM Christopher wrote: > > > > > > > (Forking this thread, as it's a distinct topic) > > > > > > > > I've thought about it. The idea has driven me to try to reduce our > use > > > > of Hadoop-specific code, and to isolate Hadoop-specific stuff behind > > > > some abstraction, wherever possible. Though, I'll admit, we're > nowhere > > > > close to where we'd want to be to be fully decoupled from Hadoop. > > > > > > > > I've also been looking a lot at our VolumeManager code lately, to try > > > > to improve it a bit, and to create better abstractions for Volumes, > > > > that could aid future work in this area. > > > > > > > > But, I haven't directly been working on new FileSystem API > > > > abstraction... just trying to lay some groundwork for that > possibility > > > > in future. > > > > > > > > It'd be nice to get to a point where we have a Hadoop-specific > > > > implementation isolated to a jar that can be swapped out at runtime > > > > for other file system implementations, as needed. I see that as a > > > > somewhat long-way off. > > > > > > > > On Wed, Mar 25, 2020 at 2:08 PM wrote: > > > > > > > > > > > > > > > I couldn't make the call today, but am curious if anyone has > > > > previously brought up creating a FileSystem API for Accumulo so that > we > > > > could use implementations other than Hadoop. I realize that Hadoop > > > provides > > > > implementations for things other than HDFS but that doesn't > necessarily > > > > mean that all filesystem implementations are covered. > > > > > > > > > > -Original Message- > > > > > From: Christopher > > > > > Sent: Wednesday, March 25, 2020 1:45 PM > > > > > To: accumulo-dev > > > > > Subject: Slack call notes > > > > > > > > > > Several committers/contributors in the community joined a call in > Slack > > > > on Wednesday, at 1130-1230, New York (Eastern) time. Here are my > notes of > > > > the call. Please feel free to add to them. > > > > > > > > > > I shared the overall philosophy and backstory to some of the script > > > > improvements in 2.x to help guide current/future work on the scripts. > > > > > > > > > > * bin/accumulo is inspired by old jpackage.org standards which are > > > > still in use in RPM macros for Java packaging in Fedora/RHEL/etc. > The key > > > > idea is that scripts are simple... set up environment (class path, > etc.), > > > > locate java, and exec a single process with the provided args. > > > > > * bin/accumulo-service is inspired by
Re: FileSystem API (was: Slack call notes)
Only 705 across 280 files, if you exclude Text, though :) grep -rP 'org[.]apache[.]hadoop(?![.]io[.]Text)' --include='*.java' * | grep -v test/ | wc -l On Wed, Mar 25, 2020 at 3:34 PM Mike Miller wrote: > > I think we have come a long way removing any external types from the API, > for reasons other than de-coupling from Hadoop. While we don't have many > dependencies on the other components of Hadoop, we are still very tightly > coupled to HDFS. > For example, some quick grep'ing of the code shows: > "grep -r "import org.apache.hadoop" --include=*.java * | wc -l" > 1734 > Without tests it is slightly more feasible... > grep -r "import org.apache.hadoop" --include=*.java * | grep -v "test" | wc > -l > 858 > > > On Wed, Mar 25, 2020 at 3:19 PM David Mollitor wrote: > > > Hello, > > > > I too have been thinking about this for a pet project. There is already > > Apache Commons VFS that, with some investment, could probably serve all > > these requirements. > > > > On Wed, Mar 25, 2020, 3:16 PM Christopher wrote: > > > > > (Forking this thread, as it's a distinct topic) > > > > > > I've thought about it. The idea has driven me to try to reduce our use > > > of Hadoop-specific code, and to isolate Hadoop-specific stuff behind > > > some abstraction, wherever possible. Though, I'll admit, we're nowhere > > > close to where we'd want to be to be fully decoupled from Hadoop. > > > > > > I've also been looking a lot at our VolumeManager code lately, to try > > > to improve it a bit, and to create better abstractions for Volumes, > > > that could aid future work in this area. > > > > > > But, I haven't directly been working on new FileSystem API > > > abstraction... just trying to lay some groundwork for that possibility > > > in future. > > > > > > It'd be nice to get to a point where we have a Hadoop-specific > > > implementation isolated to a jar that can be swapped out at runtime > > > for other file system implementations, as needed. I see that as a > > > somewhat long-way off. > > > > > > On Wed, Mar 25, 2020 at 2:08 PM wrote: > > > > > > > > > > > > I couldn't make the call today, but am curious if anyone has > > > previously brought up creating a FileSystem API for Accumulo so that we > > > could use implementations other than Hadoop. I realize that Hadoop > > provides > > > implementations for things other than HDFS but that doesn't necessarily > > > mean that all filesystem implementations are covered. > > > > > > > > -Original Message- > > > > From: Christopher > > > > Sent: Wednesday, March 25, 2020 1:45 PM > > > > To: accumulo-dev > > > > Subject: Slack call notes > > > > > > > > Several committers/contributors in the community joined a call in Slack > > > on Wednesday, at 1130-1230, New York (Eastern) time. Here are my notes of > > > the call. Please feel free to add to them. > > > > > > > > I shared the overall philosophy and backstory to some of the script > > > improvements in 2.x to help guide current/future work on the scripts. > > > > > > > > * bin/accumulo is inspired by old jpackage.org standards which are > > > still in use in RPM macros for Java packaging in Fedora/RHEL/etc. The key > > > idea is that scripts are simple... set up environment (class path, etc.), > > > locate java, and exec a single process with the provided args. > > > > * bin/accumulo-service is inspired by old SysVInit scripts for > > > start/stop/restart/status of a single service > > > > * behavior of bin/accumulo and bin/accumulo-service can be manipulated > > > through launch environment > > > > * bin/accumulo-cluster uses bin/accumulo-service, and is provided as a > > > simple, out-of-the-box cluster management tool > > > > * bin/accumulo-cluster and bin/accumulo-service are replaceable; they > > > are useful for out-of-the-box, but one would expect them to be > > unnecessary > > > if using systemd, or a vendor-provided cluster management system > > > > * we discussed possibly moving bin/accumulo-cluster and > > > bin/accumulo-service to contrib/ in the tarball, or some subdir of bin/, > > > but it was suggested to not make too many disruptive changes there > > > > * we discussed the possibility of adding a config file for > > > bin/accumulo-cluster (also mentio
Re: FileSystem API (was: Slack call notes)
I think we have come a long way removing any external types from the API, for reasons other than de-coupling from Hadoop. While we don't have many dependencies on the other components of Hadoop, we are still very tightly coupled to HDFS. For example, some quick grep'ing of the code shows: "grep -r "import org.apache.hadoop" --include=*.java * | wc -l" 1734 Without tests it is slightly more feasible... grep -r "import org.apache.hadoop" --include=*.java * | grep -v "test" | wc -l 858 On Wed, Mar 25, 2020 at 3:19 PM David Mollitor wrote: > Hello, > > I too have been thinking about this for a pet project. There is already > Apache Commons VFS that, with some investment, could probably serve all > these requirements. > > On Wed, Mar 25, 2020, 3:16 PM Christopher wrote: > > > (Forking this thread, as it's a distinct topic) > > > > I've thought about it. The idea has driven me to try to reduce our use > > of Hadoop-specific code, and to isolate Hadoop-specific stuff behind > > some abstraction, wherever possible. Though, I'll admit, we're nowhere > > close to where we'd want to be to be fully decoupled from Hadoop. > > > > I've also been looking a lot at our VolumeManager code lately, to try > > to improve it a bit, and to create better abstractions for Volumes, > > that could aid future work in this area. > > > > But, I haven't directly been working on new FileSystem API > > abstraction... just trying to lay some groundwork for that possibility > > in future. > > > > It'd be nice to get to a point where we have a Hadoop-specific > > implementation isolated to a jar that can be swapped out at runtime > > for other file system implementations, as needed. I see that as a > > somewhat long-way off. > > > > On Wed, Mar 25, 2020 at 2:08 PM wrote: > > > > > > > > > I couldn't make the call today, but am curious if anyone has > > previously brought up creating a FileSystem API for Accumulo so that we > > could use implementations other than Hadoop. I realize that Hadoop > provides > > implementations for things other than HDFS but that doesn't necessarily > > mean that all filesystem implementations are covered. > > > > > > -Original Message- > > > From: Christopher > > > Sent: Wednesday, March 25, 2020 1:45 PM > > > To: accumulo-dev > > > Subject: Slack call notes > > > > > > Several committers/contributors in the community joined a call in Slack > > on Wednesday, at 1130-1230, New York (Eastern) time. Here are my notes of > > the call. Please feel free to add to them. > > > > > > I shared the overall philosophy and backstory to some of the script > > improvements in 2.x to help guide current/future work on the scripts. > > > > > > * bin/accumulo is inspired by old jpackage.org standards which are > > still in use in RPM macros for Java packaging in Fedora/RHEL/etc. The key > > idea is that scripts are simple... set up environment (class path, etc.), > > locate java, and exec a single process with the provided args. > > > * bin/accumulo-service is inspired by old SysVInit scripts for > > start/stop/restart/status of a single service > > > * behavior of bin/accumulo and bin/accumulo-service can be manipulated > > through launch environment > > > * bin/accumulo-cluster uses bin/accumulo-service, and is provided as a > > simple, out-of-the-box cluster management tool > > > * bin/accumulo-cluster and bin/accumulo-service are replaceable; they > > are useful for out-of-the-box, but one would expect them to be > unnecessary > > if using systemd, or a vendor-provided cluster management system > > > * we discussed possibly moving bin/accumulo-cluster and > > bin/accumulo-service to contrib/ in the tarball, or some subdir of bin/, > > but it was suggested to not make too many disruptive changes there > > > * we discussed the possibility of adding a config file for > > bin/accumulo-cluster (also mentioned on > > > https://github.com/apache/accumulo/pull/1568) > > > * we discussed the need to document the intent/purpose/scope of the > > scripts in comments inside the scripts themselves > > > * Ed Coleman asked if it'd be good to document a systemd example; I > > suggested it might make for a good blog post (perhaps by the person who > > wrote the systemd unit files for Fluo Muchos) > > > > > > Keith Turner discussed his development efforts with regard to enabling > > more controls over compactions. > > > > > > * one main idea was to ke
Re: FileSystem API (was: Slack call notes)
Hello, I too have been thinking about this for a pet project. There is already Apache Commons VFS that, with some investment, could probably serve all these requirements. On Wed, Mar 25, 2020, 3:16 PM Christopher wrote: > (Forking this thread, as it's a distinct topic) > > I've thought about it. The idea has driven me to try to reduce our use > of Hadoop-specific code, and to isolate Hadoop-specific stuff behind > some abstraction, wherever possible. Though, I'll admit, we're nowhere > close to where we'd want to be to be fully decoupled from Hadoop. > > I've also been looking a lot at our VolumeManager code lately, to try > to improve it a bit, and to create better abstractions for Volumes, > that could aid future work in this area. > > But, I haven't directly been working on new FileSystem API > abstraction... just trying to lay some groundwork for that possibility > in future. > > It'd be nice to get to a point where we have a Hadoop-specific > implementation isolated to a jar that can be swapped out at runtime > for other file system implementations, as needed. I see that as a > somewhat long-way off. > > On Wed, Mar 25, 2020 at 2:08 PM wrote: > > > > > > I couldn't make the call today, but am curious if anyone has > previously brought up creating a FileSystem API for Accumulo so that we > could use implementations other than Hadoop. I realize that Hadoop provides > implementations for things other than HDFS but that doesn't necessarily > mean that all filesystem implementations are covered. > > > > -Original Message----- > > From: Christopher > > Sent: Wednesday, March 25, 2020 1:45 PM > > To: accumulo-dev > > Subject: Slack call notes > > > > Several committers/contributors in the community joined a call in Slack > on Wednesday, at 1130-1230, New York (Eastern) time. Here are my notes of > the call. Please feel free to add to them. > > > > I shared the overall philosophy and backstory to some of the script > improvements in 2.x to help guide current/future work on the scripts. > > > > * bin/accumulo is inspired by old jpackage.org standards which are > still in use in RPM macros for Java packaging in Fedora/RHEL/etc. The key > idea is that scripts are simple... set up environment (class path, etc.), > locate java, and exec a single process with the provided args. > > * bin/accumulo-service is inspired by old SysVInit scripts for > start/stop/restart/status of a single service > > * behavior of bin/accumulo and bin/accumulo-service can be manipulated > through launch environment > > * bin/accumulo-cluster uses bin/accumulo-service, and is provided as a > simple, out-of-the-box cluster management tool > > * bin/accumulo-cluster and bin/accumulo-service are replaceable; they > are useful for out-of-the-box, but one would expect them to be unnecessary > if using systemd, or a vendor-provided cluster management system > > * we discussed possibly moving bin/accumulo-cluster and > bin/accumulo-service to contrib/ in the tarball, or some subdir of bin/, > but it was suggested to not make too many disruptive changes there > > * we discussed the possibility of adding a config file for > bin/accumulo-cluster (also mentioned on > > https://github.com/apache/accumulo/pull/1568) > > * we discussed the need to document the intent/purpose/scope of the > scripts in comments inside the scripts themselves > > * Ed Coleman asked if it'd be good to document a systemd example; I > suggested it might make for a good blog post (perhaps by the person who > wrote the systemd unit files for Fluo Muchos) > > > > Keith Turner discussed his development efforts with regard to enabling > more controls over compactions. > > > > * one main idea was to keep configuration/API for data separate from > that for execution > > * data is concerns to application owners, whereas execution involves > system admins (resource contention, etc.) > > * he will submit a PR for review when ready > > * he also suggested another call to go over the PR > > > > Billie Rinaldi discussed better support for Azure Data Lake Storage > > Gen2 (ADLSv2). > > > > * maintaining a fork for experimenting, and working on reliably testing > issues involving WALs > > * did not recommend using ADLSv2 with WALs, but that we should still > support it > > * might need to implement a custom log closer to better support it > > > > Mike Miller brought up the idea of eliminating more static internal > state. > > > > * ServerConfigurationFactory might be improved in this regard, with some > additional ZK cleanup > > * Other ZK cleanup might help elsewhere (such as ZooCache) > > * I suggested tablet location cache might also benefit from being bound > to an AccumuloClient lifecycle (or a dedicated opaque object that could be > shared across AccumuloClient instances with its own user-managed lifecycle) > > > > Please add anything I might have missed (or got wrong) in response to > this post. > > >
Re: Slack call notes
Replied in a new thread. On Wed, Mar 25, 2020 at 2:08 PM wrote: > > > I couldn't make the call today, but am curious if anyone has previously > brought up creating a FileSystem API for Accumulo so that we could use > implementations other than Hadoop. I realize that Hadoop provides > implementations for things other than HDFS but that doesn't necessarily mean > that all filesystem implementations are covered. > > -Original Message- > From: Christopher > Sent: Wednesday, March 25, 2020 1:45 PM > To: accumulo-dev > Subject: Slack call notes > > Several committers/contributors in the community joined a call in Slack on > Wednesday, at 1130-1230, New York (Eastern) time. Here are my notes of the > call. Please feel free to add to them. > > I shared the overall philosophy and backstory to some of the script > improvements in 2.x to help guide current/future work on the scripts. > > * bin/accumulo is inspired by old jpackage.org standards which are still in > use in RPM macros for Java packaging in Fedora/RHEL/etc. The key idea is that > scripts are simple... set up environment (class path, etc.), locate java, and > exec a single process with the provided args. > * bin/accumulo-service is inspired by old SysVInit scripts for > start/stop/restart/status of a single service > * behavior of bin/accumulo and bin/accumulo-service can be manipulated > through launch environment > * bin/accumulo-cluster uses bin/accumulo-service, and is provided as a > simple, out-of-the-box cluster management tool > * bin/accumulo-cluster and bin/accumulo-service are replaceable; they are > useful for out-of-the-box, but one would expect them to be unnecessary if > using systemd, or a vendor-provided cluster management system > * we discussed possibly moving bin/accumulo-cluster and bin/accumulo-service > to contrib/ in the tarball, or some subdir of bin/, but it was suggested to > not make too many disruptive changes there > * we discussed the possibility of adding a config file for > bin/accumulo-cluster (also mentioned on > https://github.com/apache/accumulo/pull/1568) > * we discussed the need to document the intent/purpose/scope of the scripts > in comments inside the scripts themselves > * Ed Coleman asked if it'd be good to document a systemd example; I suggested > it might make for a good blog post (perhaps by the person who wrote the > systemd unit files for Fluo Muchos) > > Keith Turner discussed his development efforts with regard to enabling more > controls over compactions. > > * one main idea was to keep configuration/API for data separate from that for > execution > * data is concerns to application owners, whereas execution involves system > admins (resource contention, etc.) > * he will submit a PR for review when ready > * he also suggested another call to go over the PR > > Billie Rinaldi discussed better support for Azure Data Lake Storage > Gen2 (ADLSv2). > > * maintaining a fork for experimenting, and working on reliably testing > issues involving WALs > * did not recommend using ADLSv2 with WALs, but that we should still support > it > * might need to implement a custom log closer to better support it > > Mike Miller brought up the idea of eliminating more static internal state. > > * ServerConfigurationFactory might be improved in this regard, with some > additional ZK cleanup > * Other ZK cleanup might help elsewhere (such as ZooCache) > * I suggested tablet location cache might also benefit from being bound to an > AccumuloClient lifecycle (or a dedicated opaque object that could be shared > across AccumuloClient instances with its own user-managed lifecycle) > > Please add anything I might have missed (or got wrong) in response to this > post. >
FileSystem API (was: Slack call notes)
(Forking this thread, as it's a distinct topic) I've thought about it. The idea has driven me to try to reduce our use of Hadoop-specific code, and to isolate Hadoop-specific stuff behind some abstraction, wherever possible. Though, I'll admit, we're nowhere close to where we'd want to be to be fully decoupled from Hadoop. I've also been looking a lot at our VolumeManager code lately, to try to improve it a bit, and to create better abstractions for Volumes, that could aid future work in this area. But, I haven't directly been working on new FileSystem API abstraction... just trying to lay some groundwork for that possibility in future. It'd be nice to get to a point where we have a Hadoop-specific implementation isolated to a jar that can be swapped out at runtime for other file system implementations, as needed. I see that as a somewhat long-way off. On Wed, Mar 25, 2020 at 2:08 PM wrote: > > > I couldn't make the call today, but am curious if anyone has previously > brought up creating a FileSystem API for Accumulo so that we could use > implementations other than Hadoop. I realize that Hadoop provides > implementations for things other than HDFS but that doesn't necessarily mean > that all filesystem implementations are covered. > > -Original Message- > From: Christopher > Sent: Wednesday, March 25, 2020 1:45 PM > To: accumulo-dev > Subject: Slack call notes > > Several committers/contributors in the community joined a call in Slack on > Wednesday, at 1130-1230, New York (Eastern) time. Here are my notes of the > call. Please feel free to add to them. > > I shared the overall philosophy and backstory to some of the script > improvements in 2.x to help guide current/future work on the scripts. > > * bin/accumulo is inspired by old jpackage.org standards which are still in > use in RPM macros for Java packaging in Fedora/RHEL/etc. The key idea is that > scripts are simple... set up environment (class path, etc.), locate java, and > exec a single process with the provided args. > * bin/accumulo-service is inspired by old SysVInit scripts for > start/stop/restart/status of a single service > * behavior of bin/accumulo and bin/accumulo-service can be manipulated > through launch environment > * bin/accumulo-cluster uses bin/accumulo-service, and is provided as a > simple, out-of-the-box cluster management tool > * bin/accumulo-cluster and bin/accumulo-service are replaceable; they are > useful for out-of-the-box, but one would expect them to be unnecessary if > using systemd, or a vendor-provided cluster management system > * we discussed possibly moving bin/accumulo-cluster and bin/accumulo-service > to contrib/ in the tarball, or some subdir of bin/, but it was suggested to > not make too many disruptive changes there > * we discussed the possibility of adding a config file for > bin/accumulo-cluster (also mentioned on > https://github.com/apache/accumulo/pull/1568) > * we discussed the need to document the intent/purpose/scope of the scripts > in comments inside the scripts themselves > * Ed Coleman asked if it'd be good to document a systemd example; I suggested > it might make for a good blog post (perhaps by the person who wrote the > systemd unit files for Fluo Muchos) > > Keith Turner discussed his development efforts with regard to enabling more > controls over compactions. > > * one main idea was to keep configuration/API for data separate from that for > execution > * data is concerns to application owners, whereas execution involves system > admins (resource contention, etc.) > * he will submit a PR for review when ready > * he also suggested another call to go over the PR > > Billie Rinaldi discussed better support for Azure Data Lake Storage > Gen2 (ADLSv2). > > * maintaining a fork for experimenting, and working on reliably testing > issues involving WALs > * did not recommend using ADLSv2 with WALs, but that we should still support > it > * might need to implement a custom log closer to better support it > > Mike Miller brought up the idea of eliminating more static internal state. > > * ServerConfigurationFactory might be improved in this regard, with some > additional ZK cleanup > * Other ZK cleanup might help elsewhere (such as ZooCache) > * I suggested tablet location cache might also benefit from being bound to an > AccumuloClient lifecycle (or a dedicated opaque object that could be shared > across AccumuloClient instances with its own user-managed lifecycle) > > Please add anything I might have missed (or got wrong) in response to this > post. >
RE: Slack call notes
I couldn't make the call today, but am curious if anyone has previously brought up creating a FileSystem API for Accumulo so that we could use implementations other than Hadoop. I realize that Hadoop provides implementations for things other than HDFS but that doesn't necessarily mean that all filesystem implementations are covered. -Original Message- From: Christopher Sent: Wednesday, March 25, 2020 1:45 PM To: accumulo-dev Subject: Slack call notes Several committers/contributors in the community joined a call in Slack on Wednesday, at 1130-1230, New York (Eastern) time. Here are my notes of the call. Please feel free to add to them. I shared the overall philosophy and backstory to some of the script improvements in 2.x to help guide current/future work on the scripts. * bin/accumulo is inspired by old jpackage.org standards which are still in use in RPM macros for Java packaging in Fedora/RHEL/etc. The key idea is that scripts are simple... set up environment (class path, etc.), locate java, and exec a single process with the provided args. * bin/accumulo-service is inspired by old SysVInit scripts for start/stop/restart/status of a single service * behavior of bin/accumulo and bin/accumulo-service can be manipulated through launch environment * bin/accumulo-cluster uses bin/accumulo-service, and is provided as a simple, out-of-the-box cluster management tool * bin/accumulo-cluster and bin/accumulo-service are replaceable; they are useful for out-of-the-box, but one would expect them to be unnecessary if using systemd, or a vendor-provided cluster management system * we discussed possibly moving bin/accumulo-cluster and bin/accumulo-service to contrib/ in the tarball, or some subdir of bin/, but it was suggested to not make too many disruptive changes there * we discussed the possibility of adding a config file for bin/accumulo-cluster (also mentioned on https://github.com/apache/accumulo/pull/1568) * we discussed the need to document the intent/purpose/scope of the scripts in comments inside the scripts themselves * Ed Coleman asked if it'd be good to document a systemd example; I suggested it might make for a good blog post (perhaps by the person who wrote the systemd unit files for Fluo Muchos) Keith Turner discussed his development efforts with regard to enabling more controls over compactions. * one main idea was to keep configuration/API for data separate from that for execution * data is concerns to application owners, whereas execution involves system admins (resource contention, etc.) * he will submit a PR for review when ready * he also suggested another call to go over the PR Billie Rinaldi discussed better support for Azure Data Lake Storage Gen2 (ADLSv2). * maintaining a fork for experimenting, and working on reliably testing issues involving WALs * did not recommend using ADLSv2 with WALs, but that we should still support it * might need to implement a custom log closer to better support it Mike Miller brought up the idea of eliminating more static internal state. * ServerConfigurationFactory might be improved in this regard, with some additional ZK cleanup * Other ZK cleanup might help elsewhere (such as ZooCache) * I suggested tablet location cache might also benefit from being bound to an AccumuloClient lifecycle (or a dedicated opaque object that could be shared across AccumuloClient instances with its own user-managed lifecycle) Please add anything I might have missed (or got wrong) in response to this post.
Slack call notes
Several committers/contributors in the community joined a call in Slack on Wednesday, at 1130-1230, New York (Eastern) time. Here are my notes of the call. Please feel free to add to them. I shared the overall philosophy and backstory to some of the script improvements in 2.x to help guide current/future work on the scripts. * bin/accumulo is inspired by old jpackage.org standards which are still in use in RPM macros for Java packaging in Fedora/RHEL/etc. The key idea is that scripts are simple... set up environment (class path, etc.), locate java, and exec a single process with the provided args. * bin/accumulo-service is inspired by old SysVInit scripts for start/stop/restart/status of a single service * behavior of bin/accumulo and bin/accumulo-service can be manipulated through launch environment * bin/accumulo-cluster uses bin/accumulo-service, and is provided as a simple, out-of-the-box cluster management tool * bin/accumulo-cluster and bin/accumulo-service are replaceable; they are useful for out-of-the-box, but one would expect them to be unnecessary if using systemd, or a vendor-provided cluster management system * we discussed possibly moving bin/accumulo-cluster and bin/accumulo-service to contrib/ in the tarball, or some subdir of bin/, but it was suggested to not make too many disruptive changes there * we discussed the possibility of adding a config file for bin/accumulo-cluster (also mentioned on https://github.com/apache/accumulo/pull/1568) * we discussed the need to document the intent/purpose/scope of the scripts in comments inside the scripts themselves * Ed Coleman asked if it'd be good to document a systemd example; I suggested it might make for a good blog post (perhaps by the person who wrote the systemd unit files for Fluo Muchos) Keith Turner discussed his development efforts with regard to enabling more controls over compactions. * one main idea was to keep configuration/API for data separate from that for execution * data is concerns to application owners, whereas execution involves system admins (resource contention, etc.) * he will submit a PR for review when ready * he also suggested another call to go over the PR Billie Rinaldi discussed better support for Azure Data Lake Storage Gen2 (ADLSv2). * maintaining a fork for experimenting, and working on reliably testing issues involving WALs * did not recommend using ADLSv2 with WALs, but that we should still support it * might need to implement a custom log closer to better support it Mike Miller brought up the idea of eliminating more static internal state. * ServerConfigurationFactory might be improved in this regard, with some additional ZK cleanup * Other ZK cleanup might help elsewhere (such as ZooCache) * I suggested tablet location cache might also benefit from being bound to an AccumuloClient lifecycle (or a dedicated opaque object that could be shared across AccumuloClient instances with its own user-managed lifecycle) Please add anything I might have missed (or got wrong) in response to this post.