Re: genomic locus

2017-12-25 Thread Teodor Sigaev



I think I can wrangle this type into GiST just by tweaking consistent(), 
union(), and picksplit(), if I manage to express my needs in C without breaking 
too many things. My first attempt segfaulted.
Actually, consistent() can determ actual query data type by strategy number. See 
examples in ltree, intarray



If all goes to plan, I will end up with an index tree partitioned by contig at 
the top level and geometrically down from there. That will be as close as I can 
get to an array of config-specific indices, without having to store data in 
separate tables.


What do you think of that?


I have some doubt that you can distinguish root page, but it's possible to 
distinguish leaf pages, intarray and tsearch do that.


Reading your plan, I found an idea for GIN: key for GIN is a pair of (contig, 
one genome position). So, any search for interset operation with be actually a 
range search from (contig, start) to (contig, end)





I have a low-level technical question. Because I can’t anticipate the maximum 
length of contig names (and do not want to waste space), I have made the new 
locus type a varlena, like this:


#include "utils/varlena.h"

typedef struct LOCUS
{
   int32 l_len_; /* varlena header (do not touch directly!) */
   int32 start;
   int32 end;
   char  contig[FLEXIBLE_ARRAY_MEMBER];
} LOCUS;

#define LOCUS_SIZE(str) (offsetof(LOCUS, contig) + sizeof(str))

sizeof? or strlen ?



That flexible array member messes with me every time I need to copy it while 
deriving a new locus object from an existing one (or from a pair). What I ended 
up doing is this:


   LOCUS  *l = PG_GETARG_LOCUS_P(0);
   LOCUS  *new_locus;
   char   *contig;
   int    size;
   new_locus = (LOCUS *) palloc0(sizeof(*new_locus));
   contig = pstrdup(l->contig); // need this to determine the length of contig 
l->contig should be null-terminated for pstrdup, but if so, you don't need to 
pstrdup() it - you could use l->contig directly below. BTW, LOCUS_SIZE should 
add 1 byte for '\0' character in this case.




name at runtime
   size = LOCUS_SIZE(contig);
   SET_VARSIZE(new_locus, size);
   strcpy(new_locus->contig, contig);

Is there a more direct way to clone a varlena structure (possibly assigning an 
differently-sized contig to it)? One that is also memory-safe?

Store length of contig in LOCUS struct.


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



Re: genomic locus

2017-12-22 Thread Gene Selkov

> On Dec 22, 2017, at 1:53 AM, Teodor Sigaev <teo...@sigaev.ru> wrote:
> 
> Hmm, would you try to implement separate type for querying? Similar to 
> tsquery, lquery (for ltree), jsquery etc.

That sounds like a good idea if I want to make an app that will only be 
accessed through a purpose-built front end. Now I’m messing with lots of data, 
making numerous small one-off experiments. Since all of that stuff consumes my 
brain power and keystrokes, I want to minimize all the ancillary stuff or at 
least make it invisible. But I’ll probably go ahead and add a separate 
query-friendly type  that if nothing else helps.

I think I can wrangle this type into GiST just by tweaking consistent(), 
union(), and picksplit(), if I manage to express my needs in C without breaking 
too many things. My first attempt segfaulted.

Here’s my plan of attack. I think that by setting a union of inconsistent loci 
(those on different contigs) to [0, MAX_INT], I will expose such union to a 
huge penalty, even without having to do anything with the current penalty(). No 
query will go there.

The required change to consistent() is obvious: contigs do not match — go away.

The situation with picksplit() may be more tricky; I can’t imagine all possible 
consequences until I’ve seen how it works. Currently, with simple intervals, it 
sorts them by center points and splits the sorted list in half. I have already 
changed the internal sort function, picksplit_item_cmp(), to make sure the data 
are sorted on the contig first, then by center point (or any other geometric 
feature). I am thinking about splitting the list at the first inconsistent 
contig, sending the consistent first part with its well-defined geometry to the 
left page, and filling the right page with whatever remains. If the right page 
has inconsistent contigs, its bounding box will be [0, MAX_INT], and it should 
be again picked for splitting at the next iteration (if I understand the 
algorithm correctly).

If all goes to plan, I will end up with an index tree partitioned by contig at 
the top level and geometrically down from there. That will be as close as I can 
get to an array of config-specific indices, without having to store data in 
separate tables.

What do you think of that?



I have a low-level technical question. Because I can’t anticipate the maximum 
length of contig names (and do not want to waste space), I have made the new 
locus type a varlena, like this:

#include "utils/varlena.h"

typedef struct LOCUS
{
  int32 l_len_; /* varlena header (do not touch directly!) */
  int32 start;
  int32 end;
  char  contig[FLEXIBLE_ARRAY_MEMBER];
} LOCUS;

#define LOCUS_SIZE(str) (offsetof(LOCUS, contig) + sizeof(str))

That flexible array member messes with me every time I need to copy it while 
deriving a new locus object from an existing one (or from a pair). What I ended 
up doing is this:

  LOCUS  *l = PG_GETARG_LOCUS_P(0);
  LOCUS  *new_locus;
  char   *contig;
  intsize;

  new_locus = (LOCUS *) palloc0(sizeof(*new_locus));
  contig = pstrdup(l->contig); // need this to determine the length of contig 
name at runtime
  size = LOCUS_SIZE(contig);
  SET_VARSIZE(new_locus, size);
  strcpy(new_locus->contig, contig);

Is there a more direct way to clone a varlena structure (possibly assigning an 
differently-sized contig to it)? One that is also memory-safe?


Thank you,

—Gene


> Gene Selkov wrote:
>>> On Dec 17, 2017, at 7:57 PM, Robert Haas <robertmh...@gmail.com 
>>> <mailto:robertmh...@gmail.com> <mailto:robertmh...@gmail.com 
>>> <mailto:robertmh...@gmail.com>>> wrote:
>>> 
>>> On Fri, Dec 15, 2017 at 2:49 PM, Gene Selkov <selko...@gmail.com 
>>> <mailto:selko...@gmail.com> <mailto:selko...@gmail.com 
>>> <mailto:selko...@gmail.com>>> wrote:
>>>> I need a data type to represent genomic positions, which will consist of a
>>>> string and a pair of integers with interval logic and access methods. Sort
>>>> of like my seg type, but more straightforward.
>>> 
>>> Have you thought about just using a composite type?
>> Yes, I have. That is sort of what I have been doing; a composite type 
>> certainly gets the job done but I don’t feel it reduces query complexity, at 
>> least from the user’s point of view. Maybe I don’t know enough.
>> Here’s an example of how I imagine a composite genomic locus (conventionally 
>> represented as text ‘:’ integer ‘-‘ integer):
>> CREATE TYPE locus AS (contig text, coord int4range);
>> CREATE TABLE test_locus (
>>   pos locus,
>>   ref text,
>>   alt text,
>>   id text
>> );
>> CREATE INDEX test_locus_coord_ix ON test_locus (((pos).coord));
>> \copy test_locus from test_locus.tab
>> Where t

Re: genomic locus

2017-12-21 Thread Teodor Sigaev
Hmm, would you try to implement separate type for querying? Similar to tsquery, 
lquery (for ltree), jsquery etc.


Gene Selkov wrote:


On Dec 17, 2017, at 7:57 PM, Robert Haas <robertmh...@gmail.com 
<mailto:robertmh...@gmail.com>> wrote:


On Fri, Dec 15, 2017 at 2:49 PM, Gene Selkov <selko...@gmail.com 
<mailto:selko...@gmail.com>> wrote:

I need a data type to represent genomic positions, which will consist of a
string and a pair of integers with interval logic and access methods. Sort
of like my seg type, but more straightforward.


Have you thought about just using a composite type?


Yes, I have. That is sort of what I have been doing; a composite type certainly 
gets the job done but I don’t feel it reduces query complexity, at least from 
the user’s point of view. Maybe I don’t know enough.


Here’s an example of how I imagine a composite genomic locus (conventionally 
represented as text ‘:’ integer ‘-‘ integer):


CREATE TYPE locus AS (contig text, coord int4range);
CREATE TABLE test_locus (
   pos locus,
   ref text,
   alt text,
   id text
);
CREATE INDEX test_locus_coord_ix ON test_locus (((pos).coord));
\copy test_locus from test_locus.tab

Where test_locus.tab has stuff like:

(chr3,"[178916937,178916940]")GAACHP2_PIK3CA_2
(chr3,"[178916939,178916948]")AGGATCHP2_PIK3CA_2
(chr3,"[178916940,178916941]")GACHP2_PIK3CA_2
(chr3,"[178916943,178916944]")AGCHP2_PIK3CA_2
(chr3,"[178916943,178916946]")AAGCHP2_PIK3CA_2
(chr3,"[178916943,178916952]")AAGATCCTCCHP2_PIK3CA_2
(chr3,"[178916944,178916945]")AGCHP2_PIK3CA_2
(chr3,"[178916945,178916946]")GCCHP2_PIK3CA_2
(chr3,"[178916945,178916946]")GTCHP2_PIK3CA_2
(chr3,"[178916945,178916948]")GATCHP2_PIK3CA_2

When the table is loaded, I can pull the subset shown above with this query:

SELECT * FROM test_locus WHERE (pos).contig = 'chr3' AND (pos).coord && 
'[178916937, 178916948]’;

               pos               |    ref    | alt |      id
+---+-+---
  (chr3,"[178916937,178916941)") | GAA       |     | CHP2_PIK3CA_2
  (chr3,"[178916939,178916949)") | AGGAT |     | CHP2_PIK3CA_2
  . . . .

So far so good. It gets the job done. However, it is only a small step towards a 
fully encapsulated, monolithic type I want it to be. The above query It is 
marginally better than its atomic-type equivalent:


SELECT * FROM test WHERE contig = 'chr3' AND greatest(start, 178916937) <= 
least(stop, 178916948);

  contig |   start   |   stop    |    ref    | alt |      id
+---+---+---+-+---
  chr3   | 178916937 | 178916940 | GAA       |     | CHP2_PIK3CA_2
  chr3   | 178916939 | 178916948 | AGGAT |     | CHP2_PIK3CA_2
  . . . .

and it requires addition syntax transformations steps to go from conventional 
locus representation 'chr3:178916937-178916940' to composite 
'(chr3,"[178916937,178916940]”)’ and back.


Of course, the relative benefits of partial encapsulation I achieve by bundling 
text with int4range accumulate, compared to (text, int4, int4), as queries grow 
more complex. But because the elements of a composite type still require a 
separate query term for each of them (unless there is some magic I am not aware 
of), the complexity of a typical query I need to run exceeds my feeble 
sight-reading capacity. I want things that are conceptually simple to be 
expressed in simple terms, if possible.


Like so:

CREATE EXTENSION locus;
CREATE TABLE test_locus (
   pos locus,
   ref text,
   alt text,
   id text
);
\copy test_locus from data/oncomine.hotspot.tab

SELECT * FROM test_locus WHERE pos && 'chr3:178916937-178916948';
            pos            |    ref    | alt |      id
--+---+-+---
  chr3:178916937-178916940 | GAA       |     | CHP2_PIK3CA_2
  chr3:178916939-178916948 | AGGAT |     | CHP2_PIK3CA_2
  chr3:178916940-178916941 | G         | A   | CHP2_PIK3CA_2
  chr3:178916943-178916944 | A         | G   | CHP2_PIK3CA_2
  chr3:178916943-178916946 | AAG       |     | CHP2_PIK3CA_2
  chr3:178916943-178916952 | AAGATCCTC |     | CHP2_PIK3CA_2
  chr3:178916944-178916945 | A         | G   | CHP2_PIK3CA_2
  chr3:178916945-178916946 | G         | C   | CHP2_PIK3CA_2
  chr3:178916945-178916946 | G         | T   | CHP2_PIK3CA_2
  chr3:178916945-178916948 | GAT       |     | CHP2_PIK3CA_2
(10 rows)

I have encountered some pesky geometry / indexing problems while building this 
extension (https://github.com/selkovjr/locus), but I hope I can solve them at 
least at the level afforded by the composite type, while keeping the clean 
interface of a monolithic type. I understand I could probably achieve the same 
cleanliness by defining functions and operators over the complex type, but by 
the time I’m done with that, will I have coded abo

Re: genomic locus

2017-12-21 Thread Teodor Sigaev

Hi!

Problem 1. What is a union of ‘1:6000-7000’ and ‘X:1-2’? Intuitively, it 
should be NULL, however, I am not sure the method allows for that; it was 
developed for objects living in the same metric space. I have mechanistically 
reproduce the indexing methods of seg, but the resulting index is broken. All 
queries against an indexed table return a null result.
You can use fixed-size signature for string part (contig?) of locus, it will 
help to find that string part by equlity or intersection of need it. Signature 
of inner key should be bitwise-ORed of all child keys. Looking on query example, 
seems, you need to build signature tree and, if signature is the same of two 
key, take into account start-end range.



Problem 2. While the intersection (overlap, ) of any two loci produces 
obvious results, non-intersection does not. When I query for all loci not 
overlapping ‘1:6000-7000’, I expect to find all non-overlapping loci on contig 
1. I don’t want the query to return anything from other contigs, because it is 
obvious that features on different contigs do not overlap. I may be able to fix 
that by making separate functions for non-overlaps and adding a constraint to 
them, but that seems like a kludge.

Seems like you have two different non-overlap operation...


Problem 3 (alternative to 1). I realize that any clustering can help build an 
efficient index, no matter how bizarre. So I could, for example, ignore the 
contigs altogether and build a single index tree, using only position 
co-ordinates and pretending that all positions are on the same contig; the 
question then is whether and how such lossy index will affect the ordering of 
query results. Can I use a separate function for ordering? I have yet to make an 
experiment. Not that this would be equivalent to indexing the attributes of a 
composite type separately (if I understood it correctly).


Ordering and GiST aren't very close things :) You can try to have a look to 
KNN-search feature of GiST. https://www.pgcon.org/2010/schedule/events/227.en.html







--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



Re: genomic locus

2017-12-21 Thread Gene Selkov
Hi Oleg,

Great to hear from you. I wondered how many of the old-timers were still around.

> Why not use composite type ? For simple interval approach it's worked for us
> (see attached hdate.sql).

I have just begun looking at your hdate example; I see potentially useful stuff 
in it, but the first thing that I noticed is hat it is not fully equivalent to 
my problem. It looks like you only need to match intervals, while I need to 
match intervals and something else — ideally, in a single operation. I 
attempted to explain that in my reply to Craig Ringer.

> If you need to specify distribution
> function,

Not in this case; there is no uncertainty associated with the loci; where there 
is uncertainty is in the existence of a feature called at a locus: is it real 
or is it a technogenic artifact? But that is a different problem for a later 
day.

> than it may be
> worth to see orion project http://orion.cs.purdue.edu/index.html
> 6 years ago we was thinking about implementation special UNCERTAINTY data type
> (http://www.sai.msu.su/~megera/postgres/talks/big_uncertain_data.pdf), but 
> never
> started :( It'd be nice if you start this very interesting for science 
> project.

I love uncertainty, and I’ve always wished I could make it computable. I also 
wish folks around me had the same appreciation for it. My job is to say yes or 
no where the data suggest maybe, or maybe not. Needless to say, I feel a bit 
exercised.

I am reading the info you provided with keen interest.

> btw, now you can use range data type, check
> https://wiki.postgresql.org/images/7/73/Range-types-pgopen-2012.pdf 
> 

Great stuff, I was not aware of it. I saw it in early development but did not 
know it made it to the core. I tried it (and will go and update a few kludgy 
apps where I had to use bad surrogates). It is not directly applicable to 
genomic loci because it will require additional constraints for intelligent 
matching. I want to go for compete encapsulation of constraints.

Part of the reason for such a perverse desire is that I use the database as a 
calculator — that is, I load some data in a one-off experiment and I literally 
type everything in psql while I muddle through. There is a limit on how much I 
can type and not screw things up beyond comprehension, so I want the query 
language to be as easy and interactive as possible. Having to drag along a set 
of additional constraints is not quite interactive and is error-prone.


Regards,

—Gene

Re: genomic locus

2017-12-21 Thread Gene Selkov

> On Dec 18, 2017, at 6:59 AM, Craig Ringer  wrote:
> 
> If you think it'd make logical sense to extend seg with a string descriptor 
> of some sort and could come up with a name/use case that's not quite so 
> narrowly focused as genetics alone, then I could see adding it as a secondary 
> type in the same extension.
> 
> But it's more likely that the best course would be to extract the seg 
> extension from core, rename it, hack it as desired, and build it as an 
> extension maintained out-of-tree. 

That is exactly what I’ve been doing for a few days now, and the process is 
testing my sanity.

Attaching a string to an interval seems like an easy enough undertaking, and I 
have got it to work at the UI level, at least. The queries I wanted to be able 
to make against it run without problems and produce the desired results. Here 
is my first attempt: https://github.com/selkovjr/locus

Problems arise around queries I didn’t expect to be making and there are issues 
around indexing that I am not sure how to solve.

The main problem is that attaching a tag to an interval makes it incommensurate 
with intervals having a different tag. That makes them hard to index with an 
access method based on containment, such as GiST.

Problem 1. What is a union of ‘1:6000-7000’ and ‘X:1-2’? Intuitively, 
it should be NULL, however, I am not sure the method allows for that; it was 
developed for objects living in the same metric space. I have mechanistically 
reproduce the indexing methods of seg, but the resulting index is broken. All 
queries against an indexed table return a null result.

Problem 2. While the intersection (overlap, ) of any two loci produces 
obvious results, non-intersection does not. When I query for all loci not 
overlapping ‘1:6000-7000’, I expect to find all non-overlapping loci on contig 
1. I don’t want the query to return anything from other contigs, because it is 
obvious that features on different contigs do not overlap. I may be able to fix 
that by making separate functions for non-overlaps and adding a constraint to 
them, but that seems like a kludge.

Problem 3 (alternative to 1). I realize that any clustering can help build an 
efficient index, no matter how bizarre. So I could, for example, ignore the 
contigs altogether and build a single index tree, using only position 
co-ordinates and pretending that all positions are on the same contig; the 
question then is whether and how such lossy index will affect the ordering of 
query results. Can I use a separate function for ordering? I have yet to make 
an experiment. Not that this would be equivalent to indexing the attributes of 
a composite type separately (if I understood it correctly).

An alternative to neglecting the contig element might be to use it as a second 
dimension. Expressed that way, a union of several loci might consist of a set 
of contig names attached to the bounding interval. Not sure whether that makes 
any sense; in the first approximation, I imagine something equivalent to 
storing each contig’s data in a separate table with a separate index, except 
derived from a single actual database table, but I have no clue for how to go 
about doing that.


Thanks,

—Gene




Re: genomic locus

2017-12-21 Thread Gene Selkov

> On Dec 18, 2017, at 5:00 AM, PostgreSQL - Hans-Jürgen Schönig 
>  wrote:
> 
> maybe this one is also helpful: 
> https://wiki.postgresql.org/images/1/1b/Postbis_pgcon_eu_2012.pdf 
> 
> it seems they have put a lot of work into that.

Yes, that is a curious piece of code and I will give it a test, although it is 
not related to the problem at hand. It is about representing sequences, while I 
am struggling to create a convenient representation of loci.

I liked the slide show you linked. It touched on a pertinent subject (although 
again, not related to loci). Just last week, I attempted to help a relative who 
works on a microbiome project to sort through the guidelines about the encoding 
of altitude, elevation, and depth thought up by somebody at EBI without much 
regard for their usability. We haven’t had much success in sorting that out 
because the guidelines were obviously written for other people to follow; their 
practical application resulted in total confusion.

For example, is this definition redundant or contradictory:

"The altitude of the sample is the vertical distance between Earth's surface 
above sea level and the sampled position in the air”

Is that AGL or AMSL? Why not use a dictionary definition? Because of stuff like 
that, I treat all products by biologists with suspicion, until proven workable 
— especially stuff developed by consortia of multiple institutions.

Anyway, thanks for the link.

—Gene

Re: genomic locus

2017-12-21 Thread Gene Selkov

> On Dec 17, 2017, at 7:57 PM, Robert Haas <robertmh...@gmail.com> wrote:
> 
> On Fri, Dec 15, 2017 at 2:49 PM, Gene Selkov <selko...@gmail.com> wrote:
>> I need a data type to represent genomic positions, which will consist of a
>> string and a pair of integers with interval logic and access methods. Sort
>> of like my seg type, but more straightforward.
> 
> Have you thought about just using a composite type?

Yes, I have. That is sort of what I have been doing; a composite type certainly 
gets the job done but I don’t feel it reduces query complexity, at least from 
the user’s point of view. Maybe I don’t know enough.

Here’s an example of how I imagine a composite genomic locus (conventionally 
represented as text ‘:’ integer ‘-‘ integer):

CREATE TYPE locus AS (contig text, coord int4range);
CREATE TABLE test_locus (
  pos locus,
  ref text,
  alt text,
  id text
);
CREATE INDEX test_locus_coord_ix ON test_locus (((pos).coord));
\copy test_locus from test_locus.tab

Where test_locus.tab has stuff like:

(chr3,"[178916937,178916940]")  GAA CHP2_PIK3CA_2
(chr3,"[178916939,178916948]")  AGGAT   CHP2_PIK3CA_2
(chr3,"[178916940,178916941]")  G   A   CHP2_PIK3CA_2
(chr3,"[178916943,178916944]")  A   G   CHP2_PIK3CA_2
(chr3,"[178916943,178916946]")  AAG CHP2_PIK3CA_2
(chr3,"[178916943,178916952]")  AAGATCCTC   CHP2_PIK3CA_2
(chr3,"[178916944,178916945]")  A   G   CHP2_PIK3CA_2
(chr3,"[178916945,178916946]")  G   C   CHP2_PIK3CA_2
(chr3,"[178916945,178916946]")  G   T   CHP2_PIK3CA_2
(chr3,"[178916945,178916948]")  GAT CHP2_PIK3CA_2

When the table is loaded, I can pull the subset shown above with this query:

SELECT * FROM test_locus WHERE (pos).contig = 'chr3' AND (pos).coord && 
'[178916937, 178916948]’;
  pos   |ref| alt |  id
+---+-+---
 (chr3,"[178916937,178916941)") | GAA   | | CHP2_PIK3CA_2
 (chr3,"[178916939,178916949)") | AGGAT | | CHP2_PIK3CA_2
 . . . .

So far so good. It gets the job done. However, it is only a small step towards 
a fully encapsulated, monolithic type I want it to be. The above query It is 
marginally better than its atomic-type equivalent:

SELECT * FROM test WHERE contig = 'chr3' AND greatest(start, 178916937) <= 
least(stop, 178916948);
 contig |   start   |   stop|ref| alt |  id
+---+---+---+-+---
 chr3   | 178916937 | 178916940 | GAA   | | CHP2_PIK3CA_2
 chr3   | 178916939 | 178916948 | AGGAT | | CHP2_PIK3CA_2
 . . . .

and it requires addition syntax transformations steps to go from conventional 
locus representation 'chr3:178916937-178916940' to composite 
'(chr3,"[178916937,178916940]”)’ and back.

Of course, the relative benefits of partial encapsulation I achieve by bundling 
text with int4range accumulate, compared to (text, int4, int4), as queries grow 
more complex. But because the elements of a composite type still require a 
separate query term for each of them (unless there is some magic I am not aware 
of), the complexity of a typical query I need to run exceeds my feeble 
sight-reading capacity. I want things that are conceptually simple to be 
expressed in simple terms, if possible.

Like so:

CREATE EXTENSION locus;
CREATE TABLE test_locus (
  pos locus,
  ref text,
  alt text,
  id text
);
\copy test_locus from data/oncomine.hotspot.tab

SELECT * FROM test_locus WHERE pos && 'chr3:178916937-178916948';
   pos|ref| alt |  id
--+---+-+---
 chr3:178916937-178916940 | GAA   | | CHP2_PIK3CA_2
 chr3:178916939-178916948 | AGGAT | | CHP2_PIK3CA_2
 chr3:178916940-178916941 | G | A   | CHP2_PIK3CA_2
 chr3:178916943-178916944 | A | G   | CHP2_PIK3CA_2
 chr3:178916943-178916946 | AAG   | | CHP2_PIK3CA_2
 chr3:178916943-178916952 | AAGATCCTC | | CHP2_PIK3CA_2
 chr3:178916944-178916945 | A | G   | CHP2_PIK3CA_2
 chr3:178916945-178916946 | G | C   | CHP2_PIK3CA_2
 chr3:178916945-178916946 | G | T   | CHP2_PIK3CA_2
 chr3:178916945-178916948 | GAT   | | CHP2_PIK3CA_2
(10 rows)

I have encountered some pesky geometry / indexing problems while building this 
extension (https://github.com/selkovjr/locus 
<https://github.com/selkovjr/locus>), but I hope I can solve them at least at 
the level afforded by the composite type, while keeping the clean interface of 
a monolithic type. I understand I could probably achieve the same cleanliness 
by defining functions and operators over the complex type, but by the time I’m 
done with that, will I have coded about the same amount of stuff as required to 
build an extended type?


Regards,

—Gene






Re: genomic locus

2017-12-21 Thread Gene Selkov
Nice work, Andrew!

I wish I knew about it earlier.

> On Dec 16, 2017, at 8:53 AM, Andrew Dunstan  
> wrote:
> 
> I keep most of the out-of-tree extensions I maintain green by building
> and testing them in a buildfarm member. That way I become aware pretty
> quickly if any API change has broken them, as happened just the other
> day in fact. To do this requires writing a small perl paqckage. There
> are three examples in the buildfarm client sources at
> https://github.com/PGBuildFarm/client-code/tree/master/PGBuild/Modules
> and one of these is included in the buildfarm client releases.



Re: genomic locus

2017-12-19 Thread Andrey Zhidenkov
Uncertain type is a great idea. I needed type like this to store
information about premiere date or people birth date in movie database.
Finally I made two columns:  and . Precocious could me
'year', 'month' and so on and if precocious was 'year' the date should
always had to be like '-01-01'. I have just read your presentation from
2011 and have a question - is it possible to compare two uncertain dates
like 'January 2017' and 'year 2017'?

On 19 Dec. 2017 23:22, "Oleg Bartunov"  wrote:

On Fri, Dec 15, 2017 at 10:49 PM, Gene Selkov  wrote:
> Greetings everyone,

Привет !

>
> I need a data type to represent genomic positions, which will consist of a
> string and a pair of integers with interval logic and access methods. Sort
> of like my seg type, but more straightforward.

Why not use composite type ? For simple interval approach it's worked for us
(see attached hdate.sql). If you need to specify distribution
function, than it may be
worth to see orion project http://orion.cs.purdue.edu/index.html
6 years ago we was thinking about implementation special UNCERTAINTY data
type
(http://www.sai.msu.su/~megera/postgres/talks/big_uncertain_data.pdf), but
never
started :( It'd be nice if you start this very interesting for science
project.

btw, now you can use range data type, check
https://wiki.postgresql.org/images/7/73/Range-types-pgopen-2012.pdf



>
> I noticed somebody took a good care of seg while I was away for the last
20
> years, and I am extremely grateful for that. I have been using it. In the
> meantime, things have changed and now I am almost clueless about how you
> deal with contributed modules and what steps I should take once I get it
to
> work. Also, is it a good idea to clone and fix seg for this purpose, or is
> there a more appropriate template? Or maybe just building it from scratch
> will be a better idea?
>
> I have seen a lot of bit rot in other extensions (never contributed) that
I
> have not maintained since 2009 and I now I am unable to fix some of them,
so
> I wonder how much of old knowledge is still applicable. In other words, is
> what I see in new code just a change of macros or the change of
principles?

nothing special, copy, modify, compile




>
> Thanks,
>
> --Gene


Re: genomic locus

2017-12-19 Thread Oleg Bartunov
On Fri, Dec 15, 2017 at 10:49 PM, Gene Selkov  wrote:
> Greetings everyone,

Привет !

>
> I need a data type to represent genomic positions, which will consist of a
> string and a pair of integers with interval logic and access methods. Sort
> of like my seg type, but more straightforward.

Why not use composite type ? For simple interval approach it's worked for us
(see attached hdate.sql). If you need to specify distribution
function, than it may be
worth to see orion project http://orion.cs.purdue.edu/index.html
6 years ago we was thinking about implementation special UNCERTAINTY data type
(http://www.sai.msu.su/~megera/postgres/talks/big_uncertain_data.pdf), but never
started :( It'd be nice if you start this very interesting for science project.

btw, now you can use range data type, check
https://wiki.postgresql.org/images/7/73/Range-types-pgopen-2012.pdf



>
> I noticed somebody took a good care of seg while I was away for the last 20
> years, and I am extremely grateful for that. I have been using it. In the
> meantime, things have changed and now I am almost clueless about how you
> deal with contributed modules and what steps I should take once I get it to
> work. Also, is it a good idea to clone and fix seg for this purpose, or is
> there a more appropriate template? Or maybe just building it from scratch
> will be a better idea?
>
> I have seen a lot of bit rot in other extensions (never contributed) that I
> have not maintained since 2009 and I now I am unable to fix some of them, so
> I wonder how much of old knowledge is still applicable. In other words, is
> what I see in new code just a change of macros or the change of principles?

nothing special, copy, modify, compile




>
> Thanks,
>
> --Gene


hdate.sql
Description: Binary data


Re: genomic locus

2017-12-18 Thread Craig Ringer
On 16 December 2017 at 03:49, Gene Selkov  wrote:

> Greetings everyone,
>
> I need a data type to represent genomic positions, which will consist of a
> string and a pair of integers with interval logic and access methods. Sort
> of like my seg type, but more straightforward.
>
> I noticed somebody took a good care of seg while I was away for the last
> 20 years, and I am extremely grateful for that. I have been using it. In
> the meantime, things have changed and now I am almost clueless about how
> you deal with contributed modules and what steps I should take once I get
> it to work. Also, is it a good idea to clone and fix seg for this purpose,
> or is there a more appropriate template? Or maybe just building it from
> scratch will be a better idea?
>
> I have seen a lot of bit rot in other extensions (never contributed) that
> I have not maintained since 2009 and I now I am unable to fix some of them,
> so I wonder how much of old knowledge is still applicable. In other words,
> is what I see in new code just a change of macros or the change of
> principles?
>

Welcome back!

I think your chances of modifying 'seg' are slim, since you'd surely have
to change the on-disk format and that'd break existing users. Plus probably
change the API.

If you think it'd make logical sense to extend seg with a string descriptor
of some sort and could come up with a name/use case that's not quite so
narrowly focused as genetics alone, then I could see adding it as a
secondary type in the same extension.

But it's more likely that the best course would be to extract the seg
extension from core, rename it, hack it as desired, and build it as an
extension maintained out-of-tree.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: genomic locus

2017-12-18 Thread PostgreSQL - Hans-Jürgen Schönig
hello …

maybe this one is also helpful: 
https://wiki.postgresql.org/images/1/1b/Postbis_pgcon_eu_2012.pdf
it seems they have put a lot of work into that.

regards,

hans


> On 15 Dec 2017, at 20:49, Gene Selkov  wrote:
> 
> Greetings everyone,
> 
> I need a data type to represent genomic positions, which will consist of a 
> string and a pair of integers with interval logic and access methods. Sort of 
> like my seg type, but more straightforward.
> 
> I noticed somebody took a good care of seg while I was away for the last 20 
> years, and I am extremely grateful for that. I have been using it. In the 
> meantime, things have changed and now I am almost clueless about how you deal 
> with contributed modules and what steps I should take once I get it to work. 
> Also, is it a good idea to clone and fix seg for this purpose, or is there a 
> more appropriate template? Or maybe just building it from scratch will be a 
> better idea?
> 
> I have seen a lot of bit rot in other extensions (never contributed) that I 
> have not maintained since 2009 and I now I am unable to fix some of them, so 
> I wonder how much of old knowledge is still applicable. In other words, is 
> what I see in new code just a change of macros or the change of principles?
> 
> Thanks,
> 
> --Gene

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de , 
http://www.cybertec.at 






Re: genomic locus

2017-12-17 Thread Robert Haas
On Fri, Dec 15, 2017 at 2:49 PM, Gene Selkov  wrote:
> I need a data type to represent genomic positions, which will consist of a
> string and a pair of integers with interval logic and access methods. Sort
> of like my seg type, but more straightforward.
>
> I noticed somebody took a good care of seg while I was away for the last 20
> years, and I am extremely grateful for that. I have been using it. In the
> meantime, things have changed and now I am almost clueless about how you
> deal with contributed modules and what steps I should take once I get it to
> work. Also, is it a good idea to clone and fix seg for this purpose, or is
> there a more appropriate template? Or maybe just building it from scratch
> will be a better idea?

Have you thought about just using a composite type?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: genomic locus

2017-12-16 Thread Andrew Dunstan


On 12/15/2017 05:50 PM, Michael Paquier wrote:
>
>> I have seen a lot of bit rot in other extensions (never contributed) that I
>> have not maintained since 2009 and I now I am unable to fix some of them, so
>> I wonder how much of old knowledge is still applicable. In other words, is
>> what I see in new code just a change of macros or the change of principles?
> APIs in Postgres are usually stable. You should be able to update your
> own extensions. If you want to discuss about a couple of things in
> particular, don't hesitate!


I keep most of the out-of-tree extensions I maintain green by building
and testing them in a buildfarm member. That way I become aware pretty
quickly if any API change has broken them, as happened just the other
day in fact. To do this requires writing a small perl paqckage. There
are three examples in the buildfarm client sources at
https://github.com/PGBuildFarm/client-code/tree/master/PGBuild/Modules
and one of these is included in the buildfarm client releases.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: genomic locus

2017-12-15 Thread Michael Paquier
On Sat, Dec 16, 2017 at 4:49 AM, Gene Selkov  wrote:
> I noticed somebody took a good care of seg while I was away for the last 20
> years, and I am extremely grateful for that. I have been using it. In the
> meantime, things have changed and now I am almost clueless about how you
> deal with contributed modules and what steps I should take once I get it to
> work. Also, is it a good idea to clone and fix seg for this purpose, or is
> there a more appropriate template? Or maybe just building it from scratch
> will be a better idea?

Wah. You are the author of the seg module and your name is listed in a
set of commits from 2000:
https://www.postgresql.org/docs/devel/static/seg.html
seg is now shaped as what is called an extension (see
https://www.postgresql.org/docs/9.6/static/external-extensions.html),
which is made roughly a library and a set of SQL commands that the
server can load automatically after enabling them with the command
CREATE EXTENSION. If you wish to fix seg in some way, you could always
patch them. But I am not sure what you are trying to fix, so more
details would be welcome.

> I have seen a lot of bit rot in other extensions (never contributed) that I
> have not maintained since 2009 and I now I am unable to fix some of them, so
> I wonder how much of old knowledge is still applicable. In other words, is
> what I see in new code just a change of macros or the change of principles?

APIs in Postgres are usually stable. You should be able to update your
own extensions. If you want to discuss about a couple of things in
particular, don't hesitate!
-- 
Michael



genomic locus

2017-12-15 Thread Gene Selkov
Greetings everyone,

I need a data type to represent genomic positions, which will consist of a
string and a pair of integers with interval logic and access methods. Sort
of like my seg type, but more straightforward.

I noticed somebody took a good care of seg while I was away for the last 20
years, and I am extremely grateful for that. I have been using it. In the
meantime, things have changed and now I am almost clueless about how you
deal with contributed modules and what steps I should take once I get it to
work. Also, is it a good idea to clone and fix seg for this purpose, or is
there a more appropriate template? Or maybe just building it from scratch
will be a better idea?

I have seen a lot of bit rot in other extensions (never contributed) that I
have not maintained since 2009 and I now I am unable to fix some of them,
so I wonder how much of old knowledge is still applicable. In other words,
is what I see in new code just a change of macros or the change of
principles?

Thanks,

--Gene