It seems like we don't have enough input on this topic to make a decision right now. I placed the JIRA ARROW-352 in the 0.9.0 milestone, but we really should try to get this done soon so that downstream users are not blocked on using Arrow to send around interval data.
- Wes On Fri, Oct 20, 2017 at 12:34 AM, Li Jin <ice.xell...@gmail.com> wrote: > +1 on this one. > > My reason is this makes timestamp/interval calculation faster, i.e, > "timestamp + interval < timestamp" should be faster without dealing with > two component in interval. Although I am not quite sure about the rational > behind the two component representation, which seems to be what is used in > Spark: > > https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java > > I am interested in hearing reasoning behind two component. > > On Wed, Oct 18, 2017 at 8:32 PM, Wes McKinney <wesmck...@gmail.com> wrote: > >> I opened this patch over 2 months ago to add some additional metadata >> for intervals: >> >> https://github.com/apache/arrow/pull/920 >> >> Java supports a two-component DAY_TIME interval type as a combo of >> days and milliseconds: >> >> https://github.com/apache/arrow/blob/402baa4ec391b61dd37c770ae7978d >> 51b9b550fa/java/vector/src/main/codegen/data/ValueVectorTypes.tdd#L106 >> >> I propose that we change the interval representation to be a number of >> elapsed units of time from a particular point in time. This unit >> choices would be the same as our unit for timestamps, so an interval >> can be viewed as a delta between two timestamps of some resolution >> (second through nanoseconds) [1]. >> >> As context, a number of systems I have worked with deal in absolute >> time deltas. In pandas, for example, the difference of timestamps >> (datetime64 values) is a timedelta: >> >> In [1]: import pandas as pd >> >> In [2]: dr1 = pd.date_range('1/1/2000', periods=5) >> >> In [3]: dr2 = pd.date_range('1/2/2000', periods=5) >> >> In [4]: dr1 - dr2 >> Out[4]: TimedeltaIndex(['-1 days', '-1 days', '-1 days', '-1 days', >> '-1 days'], dtype='timedelta64[ns]', freq=None) >> >> In [5]: (dr1 - dr2).values >> Out[5]: >> array([-86400000000000, -86400000000000, -86400000000000, -86400000000000, >> -86400000000000], dtype='timedelta64[ns]') >> >> We need to be able to represent this data coherently (up to nanosecond >> resolution) with the Arrow metadata, and we will also at some point >> need to perform analytics directly on this data type. >> >> An alternative proposal to changing the DAY_TIME interval >> representation is to add another kind of interval type, so instead of >> only YEAR_MONTH and DAY_TIME, we have TIMEDELTA. The downside of this, >> of course, is the extra implementation complexity. DAY_TIME with the >> current Java representation also seems to me to be a subset of what >> you can represent with TIMEDELTA. >> >> It would be great to make a decision about this so we can get this >> metadata finalized in the 0.8.0 release. >> >> Thanks >> Wes >> >> [1]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L135 >>