Re: INT128 Column Support Interest

Grant Henke Tue, 21 Nov 2017 11:25:51 -0800

>
> I'm somewhat against such a configuration. This being a server-side
> configuration results in Kudu deployments in different environments having
> different sets of available types, which seems very difficult for
> downstream users to deal with.



Yeah I agree. I am not super into the idea.

Even though "least common denominator" kind
> of sucks, it's also not a bad policy for software that aims to be part of a
> pretty diverse ecosystem.


I think because Kudu is generally the "bottom" layer it would be best to
build
new features/types from the bottom up where possible. As opposed to always
playing catchup in the ecosystem. That said, I think thats only true given
there is
interest or demand for the feature or data type. It doesn't look like that
demand exists in this case though.

I think without clear user demand for >28 digits it's just not worth the
> complexity.


Agreed. Not much response here so we should drop this for now.

 That's a good point. However, I'm guessing that users are more likely to

intuitively know that "9 digits is enough" more easily than they will know
> that "64 bits is enough". In my experience people underestimate the range
> of 64-bit integers and might choose INT128 if available even if they have
> no need for anywhere near that range


That makes sense. Instead of supporting INT128 for larger ranges,
if there is demand for more digits we could add support for decimal
precisions 39 to 77 with internal INT256 support (or VarInt).


On Mon, Nov 20, 2017 at 6:51 PM, Todd Lipcon <t...@cloudera.com> wrote:

> On Mon, Nov 20, 2017 at 1:12 PM, Grant Henke <ghe...@cloudera.com> wrote:
>
> > Thank you for the feedback. Below are some responses.
> >
> > Do we have a compatible SQL type to map this to in Spark SQL, Impala,
> > > Presto, etc? What type would we map to in Java?
> >
> >
> > In Java we would Map to a BigInteger. Their isn't a perfectly natural
> > mapping for SQL that I know of. It has been mentioned in the past that we
> > could have server side flags to disable/enable the ability to create
> > columns of certain types to prevent users from creating tables that are
> not
> > readable by certain integrations. This problem exists today with the
> BINARY
> > column type.
> >
>
> I'm somewhat against such a configuration. This being a server-side
> configuration results in Kudu deployments in different environments having
> different sets of available types, which seems very difficult for
> downstream users to deal with. Even though "least common denominator" kind
> of sucks, it's also not a bad policy for software that aims to be part of a
> pretty diverse ecosystem.
>
>
>
> >
> > > Why not just _not_ expose it and only expose decimal.
> >
> >
> > Technically decimal only supports 28 9's where INT128 can support
> slightly
> > larger numbers. Their may also be more overhead dealing with a decimal
> > type. Though I am not positive about that.
> >
>
> I think without clear user demand for >28 digits it's just not worth the
> complexity.
>
>
> >
> > Encoders: like Dan mentioned, it seems like we might not be able to do a
> > > very efficient job of encoding these very large integers. Stuff like
> > > bitshuffle, SIMD bitpacking, etc, isn't really designed for such large
> > > values. So, I'm a little afraid that we'll end up only with PLAIN and
> > > people will be upset with the storage overhead and performance.
> >
> >
> >  Aren't we going to need efficient encodings in order to make decimal
> work
> > > well, anyway?
> >
> >
> > We will need to ensure performant encoding exists for INT128 to make
> > decimals with a precisions >= 18 work well anyway. We should likely have
> > parity
> > with the other integer types to reduce any confusion about differing
> > precisions having different encoding considerations. Although Presto
> > documents that precision >= 18 are slower than the others. We could do
> > something similar and follow on with improvements.
> >
> > In the current int128 internal patch I know that the RLE doesn't work for
> > int128. I don't have a lot of background on Kudu's encoding details, so
> > investigating encodings further is one of my next steps.
> >
>
> That's a good point. However, I'm guessing that users are more likely to
> intuitively know that "9 digits is enough" more easily than they will know
> that "64 bits is enough". In my experience people underestimate the range
> of 64-bit integers and might choose INT128 if available even if they have
> no need for anywhere near that range.
>
> -Todd
>
>
> >
> > On Thu, Nov 16, 2017 at 5:30 PM, Dan Burkert <danburk...@apache.org>
> > wrote:
> >
> > > Aren't we going to need efficient encodings in order to make decimal
> work
> > > well, anyway?
> > >
> > > - Dan
> > >
> > > On Thu, Nov 16, 2017 at 2:54 PM, Todd Lipcon <t...@cloudera.com>
> wrote:
> > >
> > >> On Thu, Nov 16, 2017 at 2:28 PM, Dan Burkert <danburk...@apache.org>
> > >> wrote:
> > >>
> > >> > I think it would be useful.  As far as I've seen the main costs in
> > >> > carrying data types are in writing performant encoders, and updating
> > >> > integrations to work with them.  I'm guessing with 128 bit integers
> > >> there
> > >> > would be some integrations that can't or won't support it, which
> might
> > >> be a
> > >> > cause for confusion.  Overall, though, I think the upsides of
> > efficiency
> > >> > and decreased storage space are compelling.   Do you have a sense
> yet
> > of
> > >> > what encodings are going to be supported down the road (will we get
> to
> > >> full
> > >> > parity with 32/64)?
> > >> >
> > >>
> > >> Yea, my concerns are:
> > >>
> > >> 1) Integrations: do we have a compatible SQL type to map this to in
> > Spark
> > >> SQL, Impala, Presto, etc? What type would we map to in Java? It seems
> > like
> > >> the most natural mapping would be DECIMAL(39) or somesuch in SQL. So,
> if
> > >> we're going to map it the same as decimal anyway, why not just _not_
> > >> expose
> > >> it and only expose decimal? If someone wants to store a 128-bit hash
> as
> > a
> > >> DECIMAL(39) they are free to, of course. Postgres's built-in int types
> > >> only
> > >> go up to 64-bit (bigint)
> > >>
> > >> In addition to the choice of DECIMAL, for things like fixed-length
> > binary
> > >> maybe we are better off later adding a fixed-length BINARY type, like
> > >> BINARY(16) which could be used for storing large hashes? There is
> > >> precedent
> > >> for fixed-length CHAR(n) in SQL, but no such precedent for int128.
> > >>
> > >>
> > >> 2) Encoders: like Dan mentioned, it seems like we might not be able to
> > do
> > >> a
> > >> very efficient job of encoding these very large integers. Stuff like
> > >> bitshuffle, SIMD bitpacking, etc, isn't really designed for such large
> > >> values. So, I'm a little afraid that we'll end up only with PLAIN and
> > >> people will be upset with the storage overhead and performance.
> > >>
> > >> -Todd
> > >>
> > >> >
> > >> > On Thu, Nov 16, 2017 at 2:19 PM, Grant Henke <ghe...@cloudera.com>
> > >> wrote:
> > >> >
> > >> >> Hi all,
> > >> >>
> > >> >> As a part of adding DECIMAL support to Kudu it was necessary to add
> > >> >> internal support for 128 bit integers. Taking that one step further
> > and
> > >> >> supporting public columns and APIs for 128 bit integers would not
> be
> > >> too
> > >> >> much additional work. However, I wanted to gauge the interest from
> > the
> > >> >> community.
> > >> >>
> > >> >> My initial thoughts are that having an INT128 column type could be
> > >> useful
> > >> >> for things like UUIDs, IPv6 addresses, MD5 hashes and other similar
> > >> types
> > >> >> of data.
> > >> >>
> > >> >> Is there any interest or uses for a INT128 column type? Is anyone
> > >> >> currently using a STRING or BINARY column for 128 bit data?
> > >> >>
> > >> >> Thank you,
> > >> >> Grant
> > >> >> --
> > >> >> Grant Henke
> > >> >> Software Engineer | Cloudera
> > >> >> gr...@cloudera.com | twitter.com/gchenke |
> > linkedin.com/in/granthenke
> > >> >>
> > >> >
> > >> >
> > >>
> > >>
> > >> --
> > >> Todd Lipcon
> > >> Software Engineer, Cloudera
> > >>
> > >
> > >
> >
> >
> > --
> > Grant Henke
> > Software Engineer | Cloudera
> > gr...@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Grant Henke
Software Engineer | Cloudera
gr...@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

Re: INT128 Column Support Interest

Reply via email to