Re: FileSystem API (was: Slack call notes)

Michael Wall Wed, 29 Apr 2020 06:26:25 -0700

Yeah, I agree the docs are not clear.  I didn't actually run it myself,
just took a peek at the source at
https://github.com/minio/minio/blob/master/cmd/gateway/hdfs/gateway-hdfs.go


On Tue, Apr 28, 2020 at 9:43 PM Christopher <ctubb...@apache.org> wrote:

> The page seemed to describe an S3 gateway, as well as a separate HDFS
> gateway, but I only took a superficial reading of the docs, and did
> not look at the code at all.
>
> On Tue, Apr 28, 2020 at 4:19 PM Michael Wall <mjw...@gmail.com> wrote:
> >
> > That HDFS gateway appears to be an S3 layer on top of HDFS, not and HDFS
> > layer on top of S3/Minio.  It allows you to write code to use Minio and
> > pull existing data from HDFS as you migrate it into Minio.  As far as I
> can
> > tell, it would not work without changes to Accumulo.
> >
> > In the next week or so I'll look at actually putting interfaces around
> the
> > HDFS interactions for RFiles and WALs as a first step.  I will report
> back
> > with my findings and hopefully some code.
> >
> > Thanks
> >
> > Mike
> >
> > On Fri, Apr 24, 2020 at 10:32 PM Christopher <ctubb...@apache.org>
> wrote:
> >
> > > I'm not familiar with it, but the website says it can replace HDFS.
> > > There appears to be an "HDFS Gateway"
> > > (https://github.com/minio/minio/blob/master/docs/gateway/hdfs.md) that
> > > might be useful. At a glance, it looks like no abstraction is needed
> > > in Accumulo code is needed for it... you just run the gateway and
> > > Accumulo believes it is using HDFS, but it is really using MinIO
> > > instead.
> > >
> > > There also might be a Hadoop FileSystem implementation for it to use
> > > it directly without a Gateway, but I didn't have any luck with a quick
> > > search for one.
> > >
> > > In either case, there shouldn't need to be any changes to Accumulo
> itself.
> > >
> > > If changes to Accumulo do become necessary (or desired), I'd be
> > > interested in collaborating on that part. If it's just a matter of
> > > trying it with the Gateway or existing Hadoop FileSystem
> > > implementation, I'd also be interested in testing any step-by-step
> > > HOWTO guides somebody might want to write as a blog post.
> > >
> > > On Fri, Apr 24, 2020 at 11:20 AM Mike Miller <mmil...@apache.org>
> wrote:
> > > >
> > > > I have no experience with MinIO but would be interested in learning
> more
> > > > and collaborating.
> > > >
> > > > On Fri, Apr 24, 2020 at 10:57 AM Michael Wall <mjw...@apache.org>
> wrote:
> > > >
> > > > > Resurrecting this thread on the File System API.  I have been
> thinking
> > > > > about giving Minio [1] a try for both WALs and RFiles.  Seems to me
> > > like
> > > > > step one is to abstract internal interfaces for both targeted
> against
> > > 2.1?
> > > > > Couple of questions
> > > > >
> > > > > 1 - Anyone have experience with minio?
> > > > > 2 - Anyone interested in collaborating?  Thinking anything from
> > > providing
> > > > > input to helping to test once we get a prototype to actually doing
> some
> > > > > development.
> > > > >
> > > > > Thanks, hope everyone is staying safe and healthy.
> > > > >
> > > > > [1] - https://min.io/
> > > > >
> > > > > On Wed, Mar 25, 2020 at 6:08 PM Christopher <ctubb...@apache.org>
> > > wrote:
> > > > >
> > > > > > Only 705 across 280 files, if you exclude Text, though :)
> > > > > >
> > > > > > grep -rP 'org[.]apache[.]hadoop(?![.]io[.]Text)'
> --include='*.java' *
> > > > > > | grep -v test/ | wc -l
> > > > > >
> > > > > > On Wed, Mar 25, 2020 at 3:34 PM Mike Miller <mmil...@apache.org>
> > > wrote:
> > > > > > >
> > > > > > > I think we have come a long way removing any external types
> from
> > > the
> > > > > API,
> > > > > > > for reasons other than de-coupling from Hadoop.  While we don't
> > > have
> > > > > many
> > > > > > > dependencies on the other components of Hadoop, we are still
> very
> > > > > tightly
> > > > > > > coupled to HDFS.
> > > > > > > For example, some quick grep'ing of the code shows:
> > > > > > > "grep -r "import org.apache.hadoop" --include=*.java * | wc -l"
> > > > > > > 1734
> > > > > > > Without tests it is slightly more feasible...
> > > > > > > grep -r "import org.apache.hadoop" --include=*.java * | grep -v
> > > "test"
> > > > > |
> > > > > > wc
> > > > > > > -l
> > > > > > > 858
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Mar 25, 2020 at 3:19 PM David Mollitor <
> dam6...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > I too have been thinking about this for a pet project.
> There is
> > > > > > already
> > > > > > > > Apache Commons VFS that, with some investment, could probably
> > > serve
> > > > > all
> > > > > > > > these requirements.
> > > > > > > >
> > > > > > > > On Wed, Mar 25, 2020, 3:16 PM Christopher <
> ctubb...@apache.org>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > (Forking this thread, as it's a distinct topic)
> > > > > > > > >
> > > > > > > > > I've thought about it. The idea has driven me to try to
> reduce
> > > our
> > > > > > use
> > > > > > > > > of Hadoop-specific code, and to isolate Hadoop-specific
> stuff
> > > > > behind
> > > > > > > > > some abstraction, wherever possible. Though, I'll admit,
> we're
> > > > > > nowhere
> > > > > > > > > close to where we'd want to be to be fully decoupled from
> > > Hadoop.
> > > > > > > > >
> > > > > > > > > I've also been looking a lot at our VolumeManager code
> lately,
> > > to
> > > > > try
> > > > > > > > > to improve it a bit, and to create better abstractions for
> > > Volumes,
> > > > > > > > > that could aid future work in this area.
> > > > > > > > >
> > > > > > > > > But, I haven't directly been working on new FileSystem API
> > > > > > > > > abstraction... just trying to lay some groundwork for that
> > > > > > possibility
> > > > > > > > > in future.
> > > > > > > > >
> > > > > > > > > It'd be nice to get to a point where we have a
> Hadoop-specific
> > > > > > > > > implementation isolated to a jar that can be swapped out at
> > > runtime
> > > > > > > > > for other file system implementations, as needed. I see
> that
> > > as a
> > > > > > > > > somewhat long-way off.
> > > > > > > > >
> > > > > > > > > On Wed, Mar 25, 2020 at 2:08 PM <dlmar...@comcast.net>
> wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >   I couldn't make the call today, but am curious if
> anyone
> > > has
> > > > > > > > > previously brought up creating a FileSystem API for
> Accumulo so
> > > > > that
> > > > > > we
> > > > > > > > > could use implementations other than Hadoop. I realize that
> > > Hadoop
> > > > > > > > provides
> > > > > > > > > implementations for things other than HDFS but that doesn't
> > > > > > necessarily
> > > > > > > > > mean that all filesystem implementations are covered.
> > > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Christopher <ctubb...@apache.org>
> > > > > > > > > > Sent: Wednesday, March 25, 2020 1:45 PM
> > > > > > > > > > To: accumulo-dev <dev@accumulo.apache.org>
> > > > > > > > > > Subject: Slack call notes
> > > > > > > > > >
> > > > > > > > > > Several committers/contributors in the community joined a
> > > call in
> > > > > > Slack
> > > > > > > > > on Wednesday, at 1130-1230, New York (Eastern) time. Here
> are
> > > my
> > > > > > notes of
> > > > > > > > > the call. Please feel free to add to them.
> > > > > > > > > >
> > > > > > > > > > I shared the overall philosophy and backstory to some of
> the
> > > > > script
> > > > > > > > > improvements in 2.x to help guide current/future work on
> the
> > > > > scripts.
> > > > > > > > > >
> > > > > > > > > > * bin/accumulo is inspired by old jpackage.org standards
> > > which
> > > > > are
> > > > > > > > > still in use in RPM macros for Java packaging in
> > > Fedora/RHEL/etc.
> > > > > > The key
> > > > > > > > > idea is that scripts are simple... set up environment
> (class
> > > path,
> > > > > > etc.),
> > > > > > > > > locate java, and exec a single process with the provided
> args.
> > > > > > > > > > * bin/accumulo-service is inspired by old SysVInit
> scripts
> > > for
> > > > > > > > > start/stop/restart/status of a single service
> > > > > > > > > > * behavior of bin/accumulo and bin/accumulo-service can
> be
> > > > > > manipulated
> > > > > > > > > through launch environment
> > > > > > > > > > * bin/accumulo-cluster uses bin/accumulo-service, and is
> > > provided
> > > > > > as a
> > > > > > > > > simple, out-of-the-box cluster management tool
> > > > > > > > > > * bin/accumulo-cluster and bin/accumulo-service are
> > > replaceable;
> > > > > > they
> > > > > > > > > are useful for out-of-the-box, but one would expect them
> to be
> > > > > > > > unnecessary
> > > > > > > > > if using systemd, or a vendor-provided cluster management
> > > system
> > > > > > > > > > * we discussed possibly moving bin/accumulo-cluster and
> > > > > > > > > bin/accumulo-service to contrib/ in the tarball, or some
> > > subdir of
> > > > > > bin/,
> > > > > > > > > but it was suggested to not make too many disruptive
> changes
> > > there
> > > > > > > > > > * we discussed the possibility of adding a config file
> for
> > > > > > > > > bin/accumulo-cluster (also mentioned on
> > > > > > > > > > https://github.com/apache/accumulo/pull/1568)
> > > > > > > > > > * we discussed the need to document the
> intent/purpose/scope
> > > of
> > > > > the
> > > > > > > > > scripts in comments inside the scripts themselves
> > > > > > > > > > * Ed Coleman asked if it'd be good to document a systemd
> > > > > example; I
> > > > > > > > > suggested it might make for a good blog post (perhaps by
> the
> > > person
> > > > > > who
> > > > > > > > > wrote the systemd unit files for Fluo Muchos)
> > > > > > > > > >
> > > > > > > > > > Keith Turner discussed his development efforts with
> regard to
> > > > > > enabling
> > > > > > > > > more controls over compactions.
> > > > > > > > > >
> > > > > > > > > > * one main idea was to keep configuration/API for data
> > > separate
> > > > > > from
> > > > > > > > > that for execution
> > > > > > > > > > * data is concerns to application owners, whereas
> execution
> > > > > > involves
> > > > > > > > > system admins (resource contention, etc.)
> > > > > > > > > > * he will submit a PR for review when ready
> > > > > > > > > > * he also suggested another call to go over the PR
> > > > > > > > > >
> > > > > > > > > > Billie Rinaldi discussed better support for Azure Data
> Lake
> > > > > Storage
> > > > > > > > > > Gen2 (ADLSv2).
> > > > > > > > > >
> > > > > > > > > > * maintaining a fork for experimenting, and working on
> > > reliably
> > > > > > testing
> > > > > > > > > issues involving WALs
> > > > > > > > > > * did not recommend using ADLSv2 with WALs, but that we
> > > should
> > > > > > still
> > > > > > > > > support it
> > > > > > > > > > * might need to implement a custom log closer to better
> > > support
> > > > > it
> > > > > > > > > >
> > > > > > > > > > Mike Miller brought up the idea of eliminating more
> static
> > > > > internal
> > > > > > > > > state.
> > > > > > > > > >
> > > > > > > > > > * ServerConfigurationFactory might be improved in this
> > > regard,
> > > > > with
> > > > > > > > some
> > > > > > > > > additional ZK cleanup
> > > > > > > > > > * Other ZK cleanup might help elsewhere (such as
> ZooCache)
> > > > > > > > > > * I suggested tablet location cache might also benefit
> from
> > > being
> > > > > > bound
> > > > > > > > > to an AccumuloClient lifecycle (or a dedicated opaque
> object
> > > that
> > > > > > could
> > > > > > > > be
> > > > > > > > > shared across AccumuloClient instances with its own
> > > user-managed
> > > > > > > > lifecycle)
> > > > > > > > > >
> > > > > > > > > > Please add anything I might have missed (or got wrong) in
> > > > > response
> > > > > > to
> > > > > > > > > this post.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > >
>

Re: FileSystem API (was: Slack call notes)

Reply via email to