Re: Specifying attribute slot for storing/reading statistics

2019-09-12 Thread Esteban Zimanyi
>
> So these are 4 different data types (or classes of data types) that you
> introduce in your extension? Or is that just a conceptual view and it's
> stored in some other way (e.g. normalized in some way)?
>

At the SQL level these 4 durations are not distinguishable. For example for
a tfloat (temporal float) we can have

select tfloat '1@2000-01-01' -- Instant duration
select tfloat '{1@2000-01-01 , 2@2000-01-02 , 1@2000-01-03}' -- Instant set
duration
select tfloat '[1@2000-01-01, 2@2000-01-02 , 1@2000-01-03)' -- Sequence
duration, left-inclusive and right-exclusive bound,
select tfloat {'[1@2000-01-01, 2@2000-01-02 , 1@2000-01-03], '[1@2000-01-04,
1@2000-01-05]}  ' -- Sequence set duration

Nevertheless it is possible to restrict a column to a specific duration
with a typymod specifier as in

create table test ( ..., measure tfloat(Instant) -- only Instant durations
accepted, ...)

At the C level these 4 durations are distinguished and implement in
something equivalent to a template abstract class Temporal with four
subclasses TemporalInst, TemporalI, TemporalSeq, and TemporalS. Indeed the
algorithms for manipulating these 4 durations are completely different.
They are called template classes since they keep the Oid of the base type
(float for tfloat or geometry for tgeompoint) in the same way array or
ranges do.

For more information please refer to the manual at github
https://github.com/ULB-CoDE-WIT/MobilityDB/


> I don't think we're strongly against changing the code to allow this, as
> long as it does not break existing extensions/code (unnecessarily).
>
> >If you want I can prepare a PR in order to understand the implications of
> >these changes. Please let me know.
> >
>
> I think having an actual patch to look at would be helpful.
>

I am preparing a first patch for the files selfuncs.h and selfunc.c and
thus for instant duration selectivity. It basically
1) Moves some prototypes of the static functions from the .c to the .h file
so that the functions are exported.
2) Passes the operator from the top level functions to the inner functions
such as mcv_selectivity or ineq_histogram_selectivity.

This allows me to call the functions twice, once for the value component
and another for the time component, e.g. as follows.

else if (cachedOp == CONTAINED_OP || cachedOp == OVERLAPS_OP)
{
/* Enable the addition of the selectivity of the value and time
 * dimensions since either may be missing */
int selec_value = 1.0, selec_time = 1.0;

/* Selectivity for the value dimension */
if (MOBDB_FLAGS_GET_X(box->flags))
{
operator = oper_oid(LT_OP, valuetypid, valuetypid);
selec_value = scalarineqsel(root, operator, false, false
, vardata,
Float8GetDatum(box->xmin), valuetypid);
operator = oper_oid(GT_OP, valuetypid, valuetypid);
selec_value += scalarineqsel(root, operator, true, false
, vardata,
Float8GetDatum(box->xmax), valuetypid);
selec_value = 1 - selec_value;
}
/* Selectivity for the time dimension */
if (MOBDB_FLAGS_GET_T(box->flags))
{
operator = oper_oid(LT_OP, T_TIMESTAMPTZ, T_TIMESTAMPTZ);
selec_time = scalarineqsel(root, operator, false, false
, vardata,
TimestampTzGetDatum(box->tmin), TIMESTAMPTZOID);
operator = oper_oid(GT_OP, T_TIMESTAMPTZ, T_TIMESTAMPTZ);
selec_time += scalarineqsel(root, operator, true, false
, vardata,
TimestampTzGetDatum(box->tmax), TIMESTAMPTZOID);
selec_time = 1 - selec_time;
}
selec = selec_value * selec_time;
}

Regards

Esteban


Re: Specifying attribute slot for storing/reading statistics

2019-09-10 Thread Tomas Vondra

Hi,

Please don't top-post. If you're not responding to parts of the e-mail,
then don't quote it.

On Fri, Sep 06, 2019 at 12:50:33PM +0200, Esteban Zimanyi wrote:

Dear Tom

Many thanks for your quick reply. Indeed both solutions you proposed can be
combined together in order to solve all the problems. However changes in
the code are needed. Let me now elaborate on the solution concerning the
combination of stakind/staop first and I will elaborate on adding a new
kind identifier after.

In order to understand the setting, let me explain a little more about the
different kinds of temporal types. As explained in my previous email these
are types whose values are composed of elements v@t where v is a
PostgreSQL/PostGIS type (float or geometry) and t is a TimestampTz. There
are four kinds of temporal types, depending on the their duration
* Instant: Values of the form v@t. These are used for example to represent
car accidents as in Point(0 0)@2000-01-01 08:30
* InstantSet: A set of values {v1@t1, , vn@tn} where the values between
the points are unknown. These are used for example to represent checkins in
FourSquare or RFID readings
* Sequence: A sequence of values [v1@t1, , vn@tn] where the values
between two successive instants vi@ti vj@tj are (linearly) interpolated.
These are used to represent for example GPS tracks.
* SequenceSet: A set of sequences {s1, ... , sn} where there is a temporal
gap between them. These are used to represent for example GPS tracks where
the signal was lost during a time period.



So these are 4 different data types (or classes of data types) that you
introduce in your extension? Or is that just a conceptual view and it's
stored in some other way (e.g. normalized in some way)?


To compute the selectivity of temporal types we assume that time and space
dimensions are independent and thus we can reuse all existing analyze and
selectivity infrastructure in PostgreSQL/PostGIS. For the various durations
this amounts to
* Instant: Use the functions in analyze.c and selfuncs.c independently for
the value and time dimensions
* InstantSet: Use the functions in array_typanalyze.c, array_selfuncs.c
independently for the value and time dimensions
* Sequence and SequenceSet: To simplify, we do not take into account the
gaps, and thus use the functions in rangetypes_typanalyze.c,
rangetypes_selfuncs.c independently for the value and time dimensions



OK.


However, this requires that the analyze and selectivity functions in all
the above files satisfy the following
* Set the staop when computing statistics. For example in
rangetypes_typanalyze.c the staop is set for
STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM but not for
STATISTIC_KIND_BOUNDS_HISTOGRAM
* Always call get_attstatsslot with the operator Oid not with InvalidOid.
For example, from the 17 times this function is called in selfuncs.c only
two are passed with an operator. This also requires to pass the operator as
an additional parameter to several functions. For example, the operator
should be passed to the function ineq_histogram_selectivity in selfuncs.c
* Export several top-level functions which are currently static. For
example, var_eq_const, ineq_histogram_selectivity, eqjoinsel_inner and
several others in the file selfuncs.c should be exported.

That would solve all the problems excepted for
STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM, since in this case the staop will
always be Float8LessOperator, independently of whether we are computing
lengths of value ranges or of tstzranges. This could be solved by using a
different stakind for the value and time dimensions.



I don't think we're strongly against changing the code to allow this, as 
long as it does not break existing extensions/code (unnecessarily).



If you want I can prepare a PR in order to understand the implications of
these changes. Please let me know.



I think having an actual patch to look at would be helpful.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Specifying attribute slot for storing/reading statistics

2019-09-06 Thread Esteban Zimanyi
Dear Tom

Many thanks for your quick reply. Indeed both solutions you proposed can be
combined together in order to solve all the problems. However changes in
the code are needed. Let me now elaborate on the solution concerning the
combination of stakind/staop first and I will elaborate on adding a new
kind identifier after.

In order to understand the setting, let me explain a little more about the
different kinds of temporal types. As explained in my previous email these
are types whose values are composed of elements v@t where v is a
PostgreSQL/PostGIS type (float or geometry) and t is a TimestampTz. There
are four kinds of temporal types, depending on the their duration
* Instant: Values of the form v@t. These are used for example to represent
car accidents as in Point(0 0)@2000-01-01 08:30
* InstantSet: A set of values {v1@t1, , vn@tn} where the values between
the points are unknown. These are used for example to represent checkins in
FourSquare or RFID readings
* Sequence: A sequence of values [v1@t1, , vn@tn] where the values
between two successive instants vi@ti vj@tj are (linearly) interpolated.
These are used to represent for example GPS tracks.
* SequenceSet: A set of sequences {s1, ... , sn} where there is a temporal
gap between them. These are used to represent for example GPS tracks where
the signal was lost during a time period.

To compute the selectivity of temporal types we assume that time and space
dimensions are independent and thus we can reuse all existing analyze and
selectivity infrastructure in PostgreSQL/PostGIS. For the various durations
this amounts to
* Instant: Use the functions in analyze.c and selfuncs.c independently for
the value and time dimensions
* InstantSet: Use the functions in array_typanalyze.c, array_selfuncs.c
independently for the value and time dimensions
* Sequence and SequenceSet: To simplify, we do not take into account the
gaps, and thus use the functions in rangetypes_typanalyze.c,
rangetypes_selfuncs.c independently for the value and time dimensions

However, this requires that the analyze and selectivity functions in all
the above files satisfy the following
* Set the staop when computing statistics. For example in
rangetypes_typanalyze.c the staop is set for
STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM but not for
STATISTIC_KIND_BOUNDS_HISTOGRAM
* Always call get_attstatsslot with the operator Oid not with InvalidOid.
For example, from the 17 times this function is called in selfuncs.c only
two are passed with an operator. This also requires to pass the operator as
an additional parameter to several functions. For example, the operator
should be passed to the function ineq_histogram_selectivity in selfuncs.c
* Export several top-level functions which are currently static. For
example, var_eq_const, ineq_histogram_selectivity, eqjoinsel_inner and
several others in the file selfuncs.c should be exported.

That would solve all the problems excepted for
STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM, since in this case the staop will
always be Float8LessOperator, independently of whether we are computing
lengths of value ranges or of tstzranges. This could be solved by using a
different stakind for the value and time dimensions.

If you want I can prepare a PR in order to understand the implications of
these changes. Please let me know.

Esteban


On Thu, Sep 5, 2019 at 5:11 PM Tom Lane  wrote:

> Esteban Zimanyi  writes:
> > We are developing the analyze/selectivity functions for those types. Our
> > approach is to use the standard PostgreSQL/PostGIS functions for the
> value
> > and the time dimensions where the slots starting from 0 will be used for
> > the value dimension, and the slots starting from 2 will be used for the
> > time dimension. For example, for tfloat we use range_typanalyze and
> related
> > functions for
> > * collecting in slots 0 and 1, STATISTIC_KIND_BOUNDS_HISTOGRAM
> > and STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM for the float ranges of the
> value
> > dimension
> > *  collecting in slots 2 and 3, STATISTIC_KIND_BOUNDS_HISTOGRAM
> > and STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM for the periods (similar to
> > tstzranges) of the time dimension
>
> IMO this is fundamentally wrong, or at least contrary to the design
> of pg_statistic.  It is not supposed to matter which "slot" a given
> statistic type is actually stored in; rather, readers are supposed to
> search for the desired statistic type using the stakindN, staopN and
> (if relevant) stacollN fields.
>
> In this case it seems like it'd be reasonable to rely on the staop
> fields to distinguish between the value and time dimensions, since
> (IIUC) they're of different types.
>
> Another idea is to invent your own slot kind identifiers instead of
> using built-in ones.  I'm not sure that there's any point in using
> the built-in kind values, since (a) none of the core selectivity code
> is likely to get called on your data and (b) even if it were, it'd
> likely do the wrong thing.  See the 

Re: Specifying attribute slot for storing/reading statistics

2019-09-05 Thread Tom Lane
Esteban Zimanyi  writes:
> We are developing the analyze/selectivity functions for those types. Our
> approach is to use the standard PostgreSQL/PostGIS functions for the value
> and the time dimensions where the slots starting from 0 will be used for
> the value dimension, and the slots starting from 2 will be used for the
> time dimension. For example, for tfloat we use range_typanalyze and related
> functions for
> * collecting in slots 0 and 1, STATISTIC_KIND_BOUNDS_HISTOGRAM
> and STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM for the float ranges of the value
> dimension
> *  collecting in slots 2 and 3, STATISTIC_KIND_BOUNDS_HISTOGRAM
> and STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM for the periods (similar to
> tstzranges) of the time dimension

IMO this is fundamentally wrong, or at least contrary to the design
of pg_statistic.  It is not supposed to matter which "slot" a given
statistic type is actually stored in; rather, readers are supposed to
search for the desired statistic type using the stakindN, staopN and
(if relevant) stacollN fields.

In this case it seems like it'd be reasonable to rely on the staop
fields to distinguish between the value and time dimensions, since
(IIUC) they're of different types.

Another idea is to invent your own slot kind identifiers instead of
using built-in ones.  I'm not sure that there's any point in using
the built-in kind values, since (a) none of the core selectivity code
is likely to get called on your data and (b) even if it were, it'd
likely do the wrong thing.  See the comments in pg_statistic.h,
starting about line 150, about assignment of non-built-in slot kinds.

> Is there any chance that the API for accessing the typanalyze and
> selectivity functions will be enhanced in a future release ?

Well, maybe you could convince us that the stakind/staop scheme for
identifying statistics is inadequate so we need another identification
field (corresponding to a component of the column being described,
perhaps).  I'd be strongly against assigning any semantic meaning
to the slot numbers, though.  That's likely to break code that's
written according to existing conventions.

regards, tom lane




Specifying attribute slot for storing/reading statistics

2019-09-05 Thread Esteban Zimanyi
Dear all

We are developing MobilityDB, an open source PostgreSQL/PostGIS extension
that provides temporal and spatio-temporal types. The source code, manuals,
and related publications are available at the address
https://github.com/ULB-CoDE-WIT/MobilityDB/


In MobilityDB temporal types are types derived from PostgreSQL/PostGIS
types to which a time dimension is added. MobilityDB provides the following
temporal types: tbool (temporal boolean), tint (temporal int), tfloat
(temporal float), text (temporal text), tgeompoint (temporal geometric
points) and tgeogpoint (temporal geographic points). For example, we can
define a tfloat and a tgeompoint as follows

SELECT tfloat '[1.5@2000-01-01, 2.5@2000-01-02, 1.5@2000-01-03]';
SELECT tgeompoint '[Point(0 0)@2000-01-01 08:00, Point(1 0)@2000-01-02
08:05, Point(1 1)@2000-01-03 08:10]';

We are developing the analyze/selectivity functions for those types. Our
approach is to use the standard PostgreSQL/PostGIS functions for the value
and the time dimensions where the slots starting from 0 will be used for
the value dimension, and the slots starting from 2 will be used for the
time dimension. For example, for tfloat we use range_typanalyze and related
functions for
* collecting in slots 0 and 1, STATISTIC_KIND_BOUNDS_HISTOGRAM
and STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM for the float ranges of the value
dimension
*  collecting in slots 2 and 3, STATISTIC_KIND_BOUNDS_HISTOGRAM
and STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM for the periods (similar to
tstzranges) of the time dimension

However, we end up copying several PostgreSQL functions to which we only
add an additional parameter stating the slot number from which the specific
statistic kind should be found (either 0 or 2)

bool
get_attstatsslot_mobdb(AttStatsSlot *sslot, HeapTuple statstuple,
int reqkind, Oid reqop, int flags, int startslot)
{
[...]
for (i = startslot; i < STATISTIC_NUM_SLOTS; i++)
{
if ((>stakind1)[i] == reqkind &&
(reqop == InvalidOid || (>staop1)[i] == reqop))
break;
}
[...]
}

double
var_eq_const_mobdb(VariableStatData *vardata, Oid operator, Datum constval,
bool negate, int startslot)
{
[...]
}
Selectivity
scalarineqsel_mobdb(PlannerInfo *root, Oid operator, bool isgt, bool iseq,
VariableStatData *vardata, Datum constval, Oid consttype,
int startslot)
{
[...]
}

static Selectivity
mcv_selectivity_mobdb(VariableStatData *vardata, FmgrInfo *opproc,
Datum constval, Oid atttype, bool varonleft,
double *sumcommonp, int startslot)
{
[...]
}
static double
ineq_histogram_selectivity_mobdb(PlannerInfo *root, VariableStatData *
vardata,
FmgrInfo *opproc, bool isgt, bool iseq, Datum constval,
Oid consttype, int startslot)
{
[...]
}

in addition to copying other functions needed by the above functions since
they are not exported (defined as static)

static bool
get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata,
Oid sortop, Datum *min, Datum *max)

static bool
get_actual_variable_endpoint(Relation heapRel,
Relation indexRel, ScanDirection indexscandir,
ScanKey scankeys, int16 typLen,
bool typByVal, MemoryContext outercontext,
Datum *endpointDatum)

[...]

Is there a better way to do this ?

Is there any chance that the API for accessing the typanalyze and
selectivity functions will be enhanced in a future release ?

Regards

Esteban

-- 

Prof. Esteban Zimanyi
Department of Computer & Decision Engineering  (CoDE) CP 165/15
Universite Libre de Bruxelles
Avenue F. D. Roosevelt 50
B-1050 Brussels, Belgium
fax: + 32.2.650.47.13
tel: + 32.2.650.31.85
e-mail: ezima...@ulb.ac.be
Internet: http://code.ulb.ac.be/