Re: [HACKERS] pgbench throttling latency limit

2014-09-13 Thread Fabien COELHO



[about logging...]


Here is an attempt at updating the log features, including the aggregate 
and sampling stuff, with skipped transactions under throttling.


I moved the logging stuff into a function which is called when a 
transaction is skipped or finished.


From a log file format perspective, I think that - would be better than 

skipped.

--
Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 087e0d3..72f00d8 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -141,6 +141,18 @@ double		sample_rate = 0.0;
 int64		throttle_delay = 0;
 
 /*
+ * Transactions which take longer that this limit are counted as late
+ * and reported as such, although they are completed anyway.
+ *
+ * When under throttling: execution time slots which are more than
+ * this late (in us) are simply skipped, and the corresponding transaction
+ * is counted as such... it is not even started;
+ * otherwise above the limit transactions are counted as such, with the latency
+ * measured wrt the transaction schedule, not its actual start.
+ */
+int64		latency_limit = 0;
+
+/*
  * tablespace selection
  */
 char	   *tablespace = NULL;
@@ -238,6 +250,8 @@ typedef struct
 	int64		throttle_trigger;		/* previous/next throttling (us) */
 	int64		throttle_lag;	/* total transaction lag behind throttling */
 	int64		throttle_lag_max;		/* max transaction lag */
+	int64		throttle_latency_skipped; /* lagging transactions skipped */
+	int64		latency_late; /* late transactions */
 } TState;
 
 #define INVALID_THREAD		((pthread_t) 0)
@@ -250,6 +264,8 @@ typedef struct
 	int64		sqlats;
 	int64		throttle_lag;
 	int64		throttle_lag_max;
+	int64		throttle_latency_skipped;
+	int64		latency_late;
 } TResult;
 
 /*
@@ -295,6 +311,8 @@ typedef struct
 	double		max_lag;
 	double		sum_lag;		/* sum(lag) */
 	double		sum2_lag;		/* sum(lag*lag) */
+
+	int			skipped;		/* skipped transactions under --rate --limit */
 } AggVals;
 
 static Command **sql_files[MAX_FILES];	/* SQL script files */
@@ -372,6 +390,10 @@ usage(void)
 		   -f, --file=FILENAME  read transaction script from FILENAME\n
 		 -j, --jobs=NUM   number of threads (default: 1)\n
 		 -l, --logwrite transaction times to log file\n
+		 -L, --limit=NUM  count transactions lasting more than NUM ms.\n
+		  under throttling (--rate), transactions behind schedule\n
+		  more than NUM ms are skipped, and those finishing more\n
+		  than NUM ms after their scheduled start are counted.\n
 		 -M, --protocol=simple|extended|prepared\n
 		  protocol for submitting queries (default: simple)\n
 		 -n, --no-vacuum  do not run VACUUM before tests\n
@@ -991,13 +1013,13 @@ void
 agg_vals_init(AggVals *aggs, instr_time start)
 {
 	/* basic counters */
-	aggs-cnt = 0;/* number of transactions */
+	aggs-cnt = 0;/* number of transactions (some were possibly skipped) */
 	aggs-sum_latency = 0;		/* SUM(latency) */
 	aggs-sum2_latency = 0;/* SUM(latency*latency) */
 
 	/* min and max transaction duration */
-	aggs-min_latency = 0;
-	aggs-max_latency = 0;
+	aggs-min_latency = -1.0;
+	aggs-max_latency = -1.0;
 
 	/* schedule lag counters */
 	aggs-sum_lag = 0;
@@ -1005,10 +1027,173 @@ agg_vals_init(AggVals *aggs, instr_time start)
 	aggs-min_lag = 0;
 	aggs-max_lag = 0;
 
+	/* skipped transactions under --rate  --limit */
+	aggs-skipped = 0;
+
 	/* start of the current interval */
 	aggs-start_time = INSTR_TIME_GET_DOUBLE(start);
 }
 
+/* generate log
+ */
+static void doLog(TState *thread, CState *st, FILE *logfile, AggVals *agg, bool skipped)
+{
+	/*
+	 * write the log entry if this row belongs to the random sample,
+	 * or no sampling rate was given which means log everything.
+	 */
+	if (sample_rate == 0.0 ||
+		pg_erand48(thread-random_state) = sample_rate)
+	{
+		double		lag;
+		double		latency;
+		instr_time	now;
+
+		INSTR_TIME_SET_CURRENT(now);
+
+		latency = skipped? -1.0: (double) (INSTR_TIME_GET_MICROSEC(now) - st-txn_scheduled);
+		lag = (double) (INSTR_TIME_GET_MICROSEC(st-txn_begin) - st-txn_scheduled);
+
+		/* should we aggregate the results or not? */
+		if (agg_interval  0)
+		{
+			/*
+			 * are we still in the same interval? if yes, accumulate
+			 * the values (print them otherwise)
+			 */
+			if (agg-start_time + agg_interval = INSTR_TIME_GET_DOUBLE(now))
+			{
+agg-cnt += 1;
+if (skipped)
+	agg-skipped += 1;
+else
+{
+	/* there is no latency to record if the transaction was skipped */
+	agg-sum_latency += latency;
+	agg-sum2_latency += latency * latency;
+
+	/* first non-skipped in this aggregation interval */
+	if ((agg-min_latency == -1.0) || (latency  agg-min_latency))
+		agg-min_latency = latency;
+
+	if ((agg-max_latency == -1.0) || (latency  agg-max_latency))
+		agg-max_latency = latency;
+}
+
+/* 

Re: [HACKERS] documentation update for doc/src/sgml/func.sgml

2014-09-13 Thread Fabien COELHO



Attached is an updated version of the patch.


Ok.

I notice that you decided against adding tags around function and type 
names.



It's really not about the IEEE changing something, but about someone
changing the Wikipedia page. The way I linked it makes sure it always
displays the same version of the page.


I understood that. I think that the likelyhood of someone removing the 
section about rounding in the IEEE standard is very low, and the current 
php link is pretty ugly from a REST perspective, and would prevent to see 
the possible improved version of the page. Well, no big deal.



Of course a general rule how to link to WP would be nice ...


I'm afraid that the current implicit rule is more or less no links, at 
least there are very few of them but in the glossary, and when I submitted 
docs with them they were removed before committing.


However I do support the idea that these links are useful references and 
should be put where appropriate, even if it means that there must be some 
updates from time to time.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Partial sort

2014-09-13 Thread Alexander Korotkov
On Tue, Aug 19, 2014 at 2:02 PM, David Rowley dgrowle...@gmail.com wrote:

 Here's a few notes from reading over the code:

 * pathkeys.c

   EquivalenceMember *member = (EquivalenceMember *)
  lfirst(list_head(key-pk_eclass-ec_members));

 You can use linitial() instead of lfirst(list_head()). The same thing
 occurs in costsize.c


Fixed.


 * pathkeys.c

 The following fragment:

 n = pathkeys_common(root-query_pathkeys, pathkeys);

 if (n != 0)
 {
 /* It's useful ... or at least the first N keys are */
  return n;
 }

 return 0; /* path ordering not useful */
 }

 Could just read:

 /* return the number of path keys in common, or 0 if there are none */
  return pathkeys_common(root-query_pathkeys, pathkeys);


Fixed.


 * execnodes.h

 In struct SortState, some new fields don't have a comment.


Fixed.


 I've also thrown a few different workloads at the patch and I'm very
 impressed with most of the results. Especially when LIMIT is used, however
 I've found a regression case which I thought I should highlight, but for
 now I can't quite see what could be done to fix it.

 create table a (x int not null, y int not null);
 insert into a select x.x,y.y from generate_series(1,100) x(x) cross
 join generate_series(1,10) y(y);

 Patched:
 explain analyze select x,y from a where x+0=1 order by x,y limit 10;
  QUERY PLAN

 
  Limit  (cost=92.42..163.21 rows=10 width=8) (actual
 time=6239.426..6239.429 rows=10 loops=1)
-  Partial sort  (cost=92.42..354064.37 rows=5 width=8) (actual
 time=6239.406..6239.407 rows=10 loops=1)
  Sort Key: x, y
  Presorted Key: x
  Sort Method: quicksort  Memory: 25kB
  -  Index Scan using a_x_idx on a  (cost=0.44..353939.13
 rows=5 width=8) (actual time=0.059..6239.319 rows=10 loops=1)
Filter: ((x + 0) = 1)
Rows Removed by Filter: 990
  Planning time: 0.212 ms
  Execution time: 6239.505 ms
 (10 rows)


 Time: 6241.220 ms

 Unpatched:
 explain analyze select x,y from a where x+0=1 order by x,y limit 10;
  QUERY PLAN

 
  Limit  (cost=195328.26..195328.28 rows=10 width=8) (actual
 time=3077.759..3077.761 rows=10 loops=1)
-  Sort  (cost=195328.26..195453.26 rows=5 width=8) (actual
 time=3077.757..3077.757 rows=10 loops=1)
  Sort Key: x, y
  Sort Method: quicksort  Memory: 25kB
  -  Seq Scan on a  (cost=0.00..194247.77 rows=5 width=8)
 (actual time=0.018..3077.705 rows=10 loops=1)
Filter: ((x + 0) = 1)
Rows Removed by Filter: 990
  Planning time: 0.510 ms
  Execution time: 3077.837 ms
 (9 rows)


 Time: 3080.201 ms

 As you can see, the patched version performs an index scan in order to get
 the partially sorted results, but it does end up quite a bit slower than
 the seqscan/sort that the unpatched master performs. I'm not quite sure how
 realistic the x+0 = 1 WHERE clause is, but perhaps the same would happen if
 something like x+y = 1 was performed too After a bit more analysis on
 this, I see that if I change the 50k estimate to 10 in the debugger that
 the num_groups is properly estimated at 1 and it then performs the seq scan
 instead. So it looks like the costings of the patch are not to blame here.
 (The 50k row estimate comes from rel tuples  / DEFAULT_NUM_DISTINCT)


Yes, the error comes from assumption of 50k row estimate. I've checked
similar example when estimate is fine.

create table b as (select x.x,y.y,x.x z from generate_series(1,100)
x(x) cross join generate_series(1,10) y(y));
create index b_x_idx on b(x);
analyze b;

There is column z which is both not in index and not in order by clause.
If we replace x+0=1 with z=1 optimizer didn't decide to use partial
sort.

explain analyze select x,y,z from b where z=1 order by x,y limit 10;
QUERY PLAN
──
 Limit  (cost=179056.59..179056.61 rows=10 width=12) (actual
time=1072.498..1072.500 rows=10 loops=1)
   -  Sort  (cost=179056.59..179056.63 rows=18 width=12) (actual
time=1072.495..1072.495 rows=10 loops=1)
 Sort Key: x, y
 Sort Method: quicksort  Memory: 25kB
 -  Seq Scan on b  (cost=0.00..179056.21 rows=18 width=12) (actual
time=0.020..1072.454 rows=10 loops=1)
   Filter: (z = 1)
   Rows Removed by Filter: 990
 Planning time: 0.501 ms
 Execution time: 1072.555 ms
(9 rows)

If we event force optimizer to use partial sort then cost estimation will
be fine.

set enable_seqscan = off;
explain 

Re: [HACKERS] documentation update for doc/src/sgml/func.sgml

2014-09-13 Thread David G Johnston
Fabien COELHO-3 wrote
 Of course a general rule how to link to WP would be nice ...
 
 I'm afraid that the current implicit rule is more or less no links, at 
 least there are very few of them but in the glossary, and when I submitted 
 docs with them they were removed before committing.

Ideally if external links were to be allowed PostgreSQL.org would maintain a
URL shortening service and those shortened links would be in the
documentation so that any necessary pointers could be changed immediately
and without source updates.

David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/documentation-update-for-doc-src-sgml-func-sgml-tp5815621p5818924.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] documentation update for doc/src/sgml/func.sgml

2014-09-13 Thread Fabien COELHO



Of course a general rule how to link to WP would be nice ...


I'm afraid that the current implicit rule is more or less no links, at
least there are very few of them but in the glossary, and when I submitted
docs with them they were removed before committing.


Ideally if external links were to be allowed PostgreSQL.org would maintain a
URL shortening service and those shortened links would be in the
documentation so that any necessary pointers could be changed immediately
and without source updates.


Why not. This would mean maintaining the redirection table somewhere. That 
would also be a place to check whether some links become dandling 
pointers. However there are also implications on the infrastructure to 
host these redirections. Maybe one page on wiki.postgresql.org could be 
used for this purpose.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GSoC2014] Patch ALTER TABLE ... SET LOGGED

2014-09-13 Thread Fabrízio de Royes Mello
On Tue, Aug 26, 2014 at 1:42 AM, Fabrízio de Royes Mello 
fabriziome...@gmail.com wrote:


 On Fri, Aug 22, 2014 at 5:23 PM, Alvaro Herrera alvhe...@2ndquadrant.com
wrote:
 
  Fabrízio de Royes Mello wrote:
   On Fri, Aug 22, 2014 at 4:45 PM, Alvaro Herrera 
alvhe...@2ndquadrant.com
   wrote:
 
I pointed out, in the email just before pushing the patch, that
perhaps
we should pass down the new relpersistence flag into
finish_heap_swap,
and from there we could pass it down to reindex_index() which is
where
it would be needed.  I'm not sure it's worth the trouble, but I
think we
can still ask Fabrizio to rework that part.
 
   I can rework this part if it's a real concern.
 
  I guess we can make a better assessment by seeing what it would take.
  I'm afraid it will turn out to be really ugly.
 
  Now, there's some desire to have unlogged indexes on logged tables; I
  guess if we have that, then eventually there will also be a desire to
  swap individual indexes from logged to unlogged.  Perhaps whatever fix
  we come up with here would pave the road for that future feature.
 

 Álvaro,

 I did a refactoring to pass down the relpersistence to finish_heap_swap
and then change the catalog inside the reindex_index instead of in the
rewrite table phase.

 That way we can get rid the function ATChangeIndexesPersistence in the
src/backend/commands/tablecmds.c.

 Thoughts?


Patch rebased and added to commitfest [1].

Regards,

[1] https://commitfest.postgresql.org/action/commitfest_view?id=24

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
 Timbira: http://www.timbira.com.br
 Blog: http://fabriziomello.github.io
 Linkedin: http://br.linkedin.com/in/fabriziomello
 Twitter: http://twitter.com/fabriziomello
 Github: http://github.com/fabriziomello
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index ee10594..173f412 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1973,7 +1973,7 @@ index_build(Relation heapRelation,
 	 * created it, or truncated twice in a subsequent transaction, the
 	 * relfilenode won't change, and nothing needs to be done here.
 	 */
-	if (heapRelation-rd_rel-relpersistence == RELPERSISTENCE_UNLOGGED 
+	if (indexRelation-rd_rel-relpersistence == RELPERSISTENCE_UNLOGGED 
 		!smgrexists(indexRelation-rd_smgr, INIT_FORKNUM))
 	{
 		RegProcedure ambuildempty = indexRelation-rd_am-ambuildempty;
@@ -3099,7 +3099,7 @@ IndexGetRelation(Oid indexId, bool missing_ok)
  * reindex_index - This routine is used to recreate a single index
  */
 void
-reindex_index(Oid indexId, bool skip_constraint_checks)
+reindex_index(Oid indexId, bool skip_constraint_checks, char relpersistence)
 {
 	Relation	iRel,
 heapRelation;
@@ -3160,6 +3160,12 @@ reindex_index(Oid indexId, bool skip_constraint_checks)
 			indexInfo-ii_ExclusionStrats = NULL;
 		}
 
+		/*
+		 * Check if need to set the new relpersistence
+		 */
+		if (iRel-rd_rel-relpersistence != relpersistence)
+			iRel-rd_rel-relpersistence = relpersistence;
+
 		/* We'll build a new physical relation for the index */
 		RelationSetNewRelfilenode(iRel, InvalidTransactionId,
   InvalidMultiXactId);
@@ -3358,11 +3364,19 @@ reindex_relation(Oid relid, int flags)
 		foreach(indexId, indexIds)
 		{
 			Oid			indexOid = lfirst_oid(indexId);
+			char		relpersistence = rel-rd_rel-relpersistence;
 
 			if (is_pg_class)
 RelationSetIndexList(rel, doneIndexes, InvalidOid);
 
-			reindex_index(indexOid, !(flags  REINDEX_REL_CHECK_CONSTRAINTS));
+			/* Check for flags to force UNLOGGED or PERMANENT persistence */
+			if (flags  REINDEX_REL_FORCE_UNLOGGED)
+relpersistence = RELPERSISTENCE_UNLOGGED;
+			else if (flags  REINDEX_REL_FORCE_PERMANENT)
+relpersistence = RELPERSISTENCE_PERMANENT;
+
+			reindex_index(indexOid, !(flags  REINDEX_REL_CHECK_CONSTRAINTS),
+		  relpersistence);
 
 			CommandCounterIncrement();
 
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index ff80b09..f285026 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -588,7 +588,8 @@ rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose)
 	 */
 	finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
 	 swap_toast_by_content, false, true,
-	 frozenXid, cutoffMulti);
+	 frozenXid, cutoffMulti,
+	 OldHeap-rd_rel-relpersistence);
 }
 
 
@@ -1474,7 +1475,8 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
  bool check_constraints,
  bool is_internal,
  TransactionId frozenXid,
- MultiXactId cutoffMulti)
+ MultiXactId cutoffMulti,
+ char newrelpersistence)
 {
 	ObjectAddress object;
 	Oid			mapped_tables[4];
@@ -1518,6 +1520,12 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
 	reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
 	if (check_constraints)
 		reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
+
+	if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
+		reindex_flags |= 

Re: [HACKERS] formatting.c

2014-09-13 Thread Bruce Momjian
On Fri, Sep 12, 2014 at 11:48:03PM -0400, Tom Lane wrote:
 Bruce Momjian br...@momjian.us writes:
  Is everyone OK with me renaming some variables, structures, and macros? 
  It will make back-patching harder, but will allow us to properly maintain
  that file.
 
 The back-patching problem could be addressed by back-patching the
 renaming.  If it's purely a mechanical thing, there should be minimal risk
 no?  I would claim that messed-up back-patches would have non-negligible
 risk too, so you can't say that not back-patching is clearly safer.

Uh, yes, it is mechanical and could be backpatched as well.

 Now, whether the renaming actually makes things any clearer is something
 I reserve judgment on.

What the code does now is call almost everything a number, including
the format values, binary and string representations of the number, etc.
You can take a look at reverted commit
f68dc5d86b9f287f80f4417f5a24d876eb13771d to see an example of the
renaming;  that is probably only half the job.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-13 Thread Andres Freund
On 2014-09-13 08:52:33 +0300, Ants Aasma wrote:
 On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva arthur...@gmail.com wrote:
  That's not entirely true. CRC-32C beats pretty much everything with the same
  length quality-wise and has both hardware implementations and highly
  optimized software versions.
 
 For better or for worse CRC is biased by detecting all single bit
 errors, the detection capability of larger errors is slightly
 diminished. The quality of the other algorithms I mentioned is also
 very good, while producing uniformly varying output.

There's also much more literature about the various CRCs in comparison
to some of these hash allgorithms. Pretty much everything tests how well
they're suited for hashtables, but that's not really what we need
(although it might not hurt *at all* to have something faster there...).

I do think we need to think about the types of errors we really have to
detect. It's not all that clear that either the typical guarantees/tests
for CRCs nor for checksums (smhasher, whatever) are very
representative...

 CRC has exactly
 one hardware implementation in general purpose CPU's and Intel has a
 patent on the techniques they used to implement it. The fact that AMD
 hasn't yet implemented this instruction shows that this patent is
 non-trivial to work around.

I think AMD has implemeded SSE4.2 with bulldozer. It's still only recent
x86 though. So I think there's good reasons for moving away from it.

How one could get patents on exposing hardware CRC implementations -
hard to find a computing device without one - as a instruction is beyond
me...

I think it's pretty clear by now that we should move to lz4 for a couple
things - which bundles xxhash with it. So that has one argument for it.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-13 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2014-09-13 08:52:33 +0300, Ants Aasma wrote:
 On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva arthur...@gmail.com wrote:
 That's not entirely true. CRC-32C beats pretty much everything with the same
 length quality-wise and has both hardware implementations and highly
 optimized software versions.

 For better or for worse CRC is biased by detecting all single bit
 errors, the detection capability of larger errors is slightly
 diminished. The quality of the other algorithms I mentioned is also
 very good, while producing uniformly varying output.

 There's also much more literature about the various CRCs in comparison
 to some of these hash allgorithms.

Indeed.  CRCs have well-understood properties for error detection.
Have any of these new algorithms been analyzed even a hundredth as
thoroughly?  No.  I'm unimpressed by evidence-free claims that
something else is also very good.

Now, CRCs are designed for detecting the sorts of short burst errors
that are (or were, back in the day) common on phone lines.  You could
certainly make an argument that that's not the type of threat we face
for PG data.  However, I've not seen anyone actually make such an
argument, let alone demonstrate that some other algorithm would be better.
To start with, you'd need to explain precisely what other error pattern
is more important to defend against, and why.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Postgres code for a query intermediate dataset

2014-09-13 Thread Rohit Goyal
Hi All,

I want to work on the code of intermediate dataset of select and update
query.

For example.

Rohit's salary has been updated 4 times, so it has 4 different version of
salary.

I want to select  salary of person named Rohit. Now suppose , in
intermediate result, I found 4 different versions of the data. I want to
know the code portion which i need to look for working on all 4 versions in
dataset. :)

Thanks in advance!!

Regards,
Rohit Goyal



-- 
Regards,
Rohit Goyal


Re: [HACKERS] Postgres code for a query intermediate dataset

2014-09-13 Thread Atri Sharma
On Sat, Sep 13, 2014 at 11:06 PM, Rohit Goyal rhtgyl...@gmail.com wrote:


 Hi All,

 I want to work on the code of intermediate dataset of select and update
 query.

 For example.

 Rohit's salary has been updated 4 times, so it has 4 different version of
 salary.

 I want to select  salary of person named Rohit. Now suppose , in
 intermediate result, I found 4 different versions of the data. I want to
 know the code portion which i need to look for working on all 4 versions in
 dataset. :)

 Thanks in advance!!



Not sure what you are looking for, but each update is an insert of a new
tuple with the new values and marking the old tuple as deleted.

There is no need for tracking the versions of any changes in data set. Each
update operation leaves only one visible tuple. If the transaction commits,
inserted tuple becomes visible and old row is marked deleted. If the
transaction rollbacks, only the old tuple shall remain visible.
-- 
Regards,

Atri
*l'apprenant*


Re: [HACKERS] Postgres code for a query intermediate dataset

2014-09-13 Thread David G Johnston
Atri Sharma wrote
 On Sat, Sep 13, 2014 at 11:06 PM, Rohit Goyal lt;

 rhtgyl.87@

 gt; wrote:
 

 Hi All,

 I want to work on the code of intermediate dataset of select and update
 query.

 For example.

 Rohit's salary has been updated 4 times, so it has 4 different version of
 salary.

 I want to select  salary of person named Rohit. Now suppose , in
 intermediate result, I found 4 different versions of the data. I want to
 know the code portion which i need to look for working on all 4 versions
 in
 dataset. :)

 Thanks in advance!!



 Not sure what you are looking for, but each update is an insert of a new
 tuple with the new values and marking the old tuple as deleted.
 
 There is no need for tracking the versions of any changes in data set.
 Each
 update operation leaves only one visible tuple. If the transaction
 commits,
 inserted tuple becomes visible and old row is marked deleted. If the
 transaction rollbacks, only the old tuple shall remain visible.
 -- 
 Regards,
 
 Atri
 *l'apprenant*

Or rather even if you want to be able to reference the older versions of
that record there is nothing in PostgreSQL to facilitate that. You have to
manually create and manage the data so that you know during what time period
a given record is valid.

David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Postgres-code-for-a-query-intermediate-dataset-tp5818931p5818935.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres code for a query intermediate dataset

2014-09-13 Thread Atri Sharma
On Sat, Sep 13, 2014 at 11:52 PM, David G Johnston 
david.g.johns...@gmail.com wrote:

 Atri Sharma wrote
  On Sat, Sep 13, 2014 at 11:06 PM, Rohit Goyal lt;



 Or rather even if you want to be able to reference the older versions of
 that record there is nothing in PostgreSQL to facilitate that. You have to
 manually create and manage the data so that you know during what time
 period
 a given record is valid.

 David J.





Sometimes I do miss 'time travel' we used to have :)

Regards,

Atri
-- 
Regards,

Atri
*l'apprenant*


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-13 Thread k...@rice.edu
On Sat, Sep 13, 2014 at 12:55:33PM -0400, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2014-09-13 08:52:33 +0300, Ants Aasma wrote:
  On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva arthur...@gmail.com wrote:
  That's not entirely true. CRC-32C beats pretty much everything with the 
  same
  length quality-wise and has both hardware implementations and highly
  optimized software versions.
 
  For better or for worse CRC is biased by detecting all single bit
  errors, the detection capability of larger errors is slightly
  diminished. The quality of the other algorithms I mentioned is also
  very good, while producing uniformly varying output.
 
  There's also much more literature about the various CRCs in comparison
  to some of these hash allgorithms.
 
 Indeed.  CRCs have well-understood properties for error detection.
 Have any of these new algorithms been analyzed even a hundredth as
 thoroughly?  No.  I'm unimpressed by evidence-free claims that
 something else is also very good.
 
 Now, CRCs are designed for detecting the sorts of short burst errors
 that are (or were, back in the day) common on phone lines.  You could
 certainly make an argument that that's not the type of threat we face
 for PG data.  However, I've not seen anyone actually make such an
 argument, let alone demonstrate that some other algorithm would be better.
 To start with, you'd need to explain precisely what other error pattern
 is more important to defend against, and why.
 
   regards, tom lane
 

Here is a blog on the development of xxhash:

http://fastcompression.blogspot.com/2012/04/selecting-checksum-algorithm.html

Regards,
Ken


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-13 Thread Peter Geoghegan
On Fri, Sep 12, 2014 at 11:38 AM, Robert Haas robertmh...@gmail.com wrote:
 Based on discussion thus far it seems that there's a possibility that
 the trade-off may be different for short strings vs. long strings.  If
 the string is small enough to fit in the L1 CPU cache, then it may be
 that memcmp() followed by strcoll() is not much more expensive than
 strcoll().  That should be easy to figure out: write a standalone C
 program that creates a bunch of arbitrary, fairly-short strings, say
 32 bytes, in a big array.  Make the strings different near the end,
 but not at the beginning.  Then have the program either do strcoll()
 on every pair (O(n^2)) or, with a #define, memcmp() followed by
 strcoll() on every pair.  It should be easy to see whether the
 memcmp() adds 1% or 25% or 100%.

I get where you're coming from now (I think). It seems like you're
interested in getting a sense of what it would cost to do an
opportunistic memcmp() in a universe where memory latency isn't by far
the dominant cost (which it clearly is, as evidenced by my most recent
benchmark where a variant of Heikki's highly unsympathetic SQL test
case showed a ~1% regression). You've described a case with totally
predictable access patterns, perfectly suited to prefetching, and with
no other memory access bottlenecks. As I've said, this seems
reasonable (at least with those caveats in mind).

The answer to your high level question appears to be: the
implementation (the totally opportunistic memcmp() == 0 thing)
benefits from the fact that we're always bottlenecked on memory, and
to a fairly appreciable degree. I highly doubt that this is something
that can fail to work out with real SQL queries, but anything that
furthers our understanding of the optimization is a good thing. Of
course, within tuplesort we're not really getting the totally
opportunistic memcmp()s for free - rather, we're using a resource
that we would not otherwise be able to use at all.

This graph illustrates the historic trends of CPU and memory performance:

http://www.cs.virginia.edu/stream/stream_logo.gif

I find this imbalance quite remarkable - no wonder researchers are
finding ways to make the best of the situation. To make matters worse,
the per-core trends for memory bandwidth are now actually *negative
growth*. We may actually be going backwards, if we assume that that's
the bottleneck, and that we cannot parallelize things.

Anyway, attached rough test program implements what you outline. This
is for 30,000 32 byte strings (where just the final two bytes differ).
On my laptop, output looks like this (edited to only show median
duration in each case):


Strings generated - beginning tests
(baseline) duration of comparisons without useless memcmp()s: 13.445506 seconds

duration of comparisons with useless memcmp()s: 17.720501 seconds


It looks like about a 30% increase in run time when we always have
useless memcmps().

You can change the constants around easily - let's consider 64 KiB
strings now (by changing STRING_SIZE to 65536). In order to make the
program not take too long, I also reduce the number of strings
(N_STRINGS) from 30,000 to 1,000. If I do so, this is what I see as
output:


Strings generated - beginning tests
(baseline) duration of comparisons without useless memcmp()s: 11.205683 seconds

duration of comparisons with useless memcmp()s: 14.542997 seconds


It looks like the overhead here is surprisingly consistent with the
first run - again, about a 30% increase in run time.

As for 1MiB strings (this time, with an N_STRINGS of 300):


Strings generated - beginning tests
(baseline) duration of comparisons without useless memcmp()s: 23.624183 seconds

duration of comparisons with useless memcmp()s: 35.332728 seconds


So, at this scale, the overhead gets quite a bit worse, but the case
also becomes quite a bit less representative (if that's even
possible). I suspect that the call stack's stupidly large size may be
a problem here, but I didn't try and fix that.

Does this answer your question? Are you intent on extrapolating across
different platforms based on this test program, rather than looking at
real SQL queries? While a certain amount of that makes sense, I think
we should focus on queries that have some chance of being seen in real
production PostgreSQL instances. Failing that, actual SQL queries.

-- 
Peter Geoghegan
#include ctype.h
#include stdio.h
#include stdlib.h
#include string.h
#include sys/time.h

/* STRING_SIZE does not include NUL byte */
#define STRING_SIZE		32
#define N_STRINGS		3
#define LAST_N_DIFFER	2

#define INSTR_TIME_SUBTRACT(x,y) \
	do { \
		(x).tv_sec -= (y).tv_sec; \
		(x).tv_usec -= (y).tv_usec; \
		/* Normalize */ \
		while ((x).tv_usec  0) \
		{ \
			(x).tv_usec += 100; \
			(x).tv_sec--; \
		} \
	} while (0)

#define INSTR_TIME_GET_DOUBLE(t) \
	(((double) (t).tv_sec) + ((double) (t).tv_usec) / 100.0)

#define Min(x, y)		((x)  (y) ? (x) : (y))

/*
 * Generate single random alphabetical ASCII 

Re: [HACKERS] Postgres code for a query intermediate dataset

2014-09-13 Thread Gavin Flower

On 14/09/14 06:35, Atri Sharma wrote:



On Sat, Sep 13, 2014 at 11:52 PM, David G Johnston 
david.g.johns...@gmail.com mailto:david.g.johns...@gmail.com wrote:


Atri Sharma wrote
 On Sat, Sep 13, 2014 at 11:06 PM, Rohit Goyal lt;



Or rather even if you want to be able to reference the older
versions of
that record there is nothing in PostgreSQL to facilitate that. You
have to
manually create and manage the data so that you know during what
time period
a given record is valid.

David J.





Sometimes I do miss 'time travel' we used to have :)

Regards,

Atri
--
Regards,
Atri
/l'apprenant/
That is only because the Guild of Time Travellers was formed, and we are 
very selective in whom we allow to join.  It was a massive undertaking 
to purge the knowledge of effective time travel from the general 
populace (H. G. Wells had to be expelled with a partial brain wipe)!   :-)


On a more serious note:
I did design and implement a system to allow what the original poster 
was after, it involved 2 tables for each logical table, and used both an 
EFFECTIVE_DATE  an AS_AT_DATE.  This allowed insurance quotes to be 
valid for a given of time, even if the insurance rates were set change 
after the quote was given (but before the quote expired).  This was 
about 15 years ago. It was amusing that my wife joined that team 10 
years after I left, and found 2 of my original colleagues still there!



Cheers,
Gavin


Re: [HACKERS] A mechanism securing web applications in DBMS

2014-09-13 Thread Stephen Frost
Zhaomo,

* Zhaomo Yang (zhy...@cs.ucsd.edu) wrote:
  Have you considered just using a regular, but unlogged, table?  That
  would also avoid any risk that the application manages to drop or shadow
  the temp table somehow with a fake table that changes who is currently
  authenticated, and avoids having to figure out how to deal with the temp
  table vanishing due to the connections going away.
 
 So then all the currently logged in users will be stored in the same
 table, which means we also need to make sure that the correct row in
 that table is used when the row-level security policy refers to the
 current application-level user.

Yes- but that's pretty trivially done, given that you've stipulated that
a single connection DB connection must be used from authentication until
de-authentication.  All that is needed is an additional column in the
auth table which is populated with a pseudo-random value which is
guaranteed to be unique and constant for the duration of the
authenticated time- and the database backend PID is perfect for that.
The auth function can call the pg_backend_pid() function directly and
then the policies can include a 'pid = pg_backend_pid()' as part of the
join to the auth table.  The auth function can also complain loudly if
an entry in the pid table is found with the current PID during auth (and
similar- the de-auth function can complain if an entry with the current
PID is *not* found).  This would eliminate the need for the on-connect
triggers, I believe (though those are interesting for other reasons..).

 Let me send you a copy of our paper in a separate email which is a
 thorough description of the mechanism (including background, threat
 model, how it works, etc), which should give you an better idea on
 every aspect of the mechanism. Please do not distribute it because it
 has been accepted for publication. Note that the implementation we
 show in the paper is just a prototype (we made the changes so that we
 could implement it quickly). Our goal always is to integrate our
 mechanism into open source DBMS's like PG and MySQL cleanly.

It'd be very interesting to see this done with the unlogged table,
security definer functions, and the row-level policies patch which we're
working on.  I'd further suggest that the application also use multiple
roles which are set noinherit and 'set role' based on the operation
which it's currently being used for- this would add another level of
protection.  Using stored procedures (for more than just the auth and
de-auth functions as suggested here) can also be a good idea.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] jsonb contains behaviour weirdness

2014-09-13 Thread Peter Geoghegan
On Fri, Sep 12, 2014 at 6:40 AM, Alexander Korotkov
aekorot...@gmail.com wrote:
 It's likely that JB_ROOT_COUNT(val)  JB_ROOT_COUNT(tmpl) should be
 checked only for objects, not arrays. Also, should JsonbDeepContains does
 same fast check when it deals with nested objects?

Attached patch implements something similar to what you describe here,
fixing your example.

I haven't added the optimization to JsonbDeepContains(). I think that
if anything, we should remove the optimization entirely, which is what
I've done -- an rhs is it contained within? value is hardly ever
going to be an object that has more pairs than the object we're
checking it is contained within. It's almost certainly going to have
far fewer pairs. Apart from only applying to objects, that
optimization just isn't an effective way of eliminating jsonb values
from consideration quickly. I'd rather not bother at all, rather than
having a complicated comment about why the optimization applies to
objects and not arrays.

-- 
Peter Geoghegan
diff --git a/src/backend/utils/adt/jsonb_op.c b/src/backend/utils/adt/jsonb_op.c
new file mode 100644
index 2d071b2..6fcdbad
*** a/src/backend/utils/adt/jsonb_op.c
--- b/src/backend/utils/adt/jsonb_op.c
*** jsonb_contains(PG_FUNCTION_ARGS)
*** 117,124 
  	JsonbIterator *it1,
  			   *it2;
  
! 	if (JB_ROOT_COUNT(val)  JB_ROOT_COUNT(tmpl) ||
! 		JB_ROOT_IS_OBJECT(val) != JB_ROOT_IS_OBJECT(tmpl))
  		PG_RETURN_BOOL(false);
  
  	it1 = JsonbIteratorInit(val-root);
--- 117,123 
  	JsonbIterator *it1,
  			   *it2;
  
! 	if (JB_ROOT_IS_OBJECT(val) != JB_ROOT_IS_OBJECT(tmpl))
  		PG_RETURN_BOOL(false);
  
  	it1 = JsonbIteratorInit(val-root);
*** jsonb_contained(PG_FUNCTION_ARGS)
*** 137,144 
  	JsonbIterator *it1,
  			   *it2;
  
! 	if (JB_ROOT_COUNT(val)  JB_ROOT_COUNT(tmpl) ||
! 		JB_ROOT_IS_OBJECT(val) != JB_ROOT_IS_OBJECT(tmpl))
  		PG_RETURN_BOOL(false);
  
  	it1 = JsonbIteratorInit(val-root);
--- 136,142 
  	JsonbIterator *it1,
  			   *it2;
  
! 	if (JB_ROOT_IS_OBJECT(val) != JB_ROOT_IS_OBJECT(tmpl))
  		PG_RETURN_BOOL(false);
  
  	it1 = JsonbIteratorInit(val-root);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] run xmllint during build (was Re: need xmllint on borka)

2014-09-13 Thread Peter Eisentraut
On 8/21/14 4:28 PM, Fabien COELHO wrote:
 I'm fine with $(missing) or whatever else that provides a clear message.

I've committed the $(missing) use separately, and rebased this patch on
top of that.

From cb2787b2473186239eb23e7320654513df8e8fb7 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut pete...@gmx.net
Date: Sat, 13 Sep 2014 20:42:16 -0400
Subject: [PATCH v2] doc: Check DocBook XML validity during the build

Building the documentation with XSLT does not check the DTD, like a
DSSSL build would.  One can often get away with having invalid XML, but
the stylesheets might then create incorrect output, as they are not
designed to handle that.  Therefore, check the validity of the XML
against the DTD, using xmllint, during the build.

Add xmllint detection to configure, and add some documentation.

xmllint comes with libxml2, which is already in use, but it might be in
a separate package, such as libxml2-utils on Debian.
---
 configure  | 43 +++
 configure.in   |  1 +
 doc/src/sgml/Makefile  |  9 -
 doc/src/sgml/docguide.sgml | 19 +--
 src/Makefile.global.in |  1 +
 5 files changed, 70 insertions(+), 3 deletions(-)

diff --git a/configure b/configure
index 4918f95..873bcff 100755
--- a/configure
+++ b/configure
@@ -630,6 +630,7 @@ vpath_build
 PROVE
 OSX
 XSLTPROC
+XMLLINT
 DBTOEPUB
 COLLATEINDEX
 DOCBOOKSTYLE
@@ -14449,6 +14450,48 @@ fi
   test -n $DBTOEPUB  break
 done
 
+for ac_prog in xmllint
+do
+  # Extract the first word of $ac_prog, so it can be a program name with args.
+set dummy $ac_prog; ac_word=$2
+{ $as_echo $as_me:${as_lineno-$LINENO}: checking for $ac_word 5
+$as_echo_n checking for $ac_word...  6; }
+if ${ac_cv_prog_XMLLINT+:} false; then :
+  $as_echo_n (cached)  6
+else
+  if test -n $XMLLINT; then
+  ac_cv_prog_XMLLINT=$XMLLINT # Let the user override the test.
+else
+as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
+for as_dir in $PATH
+do
+  IFS=$as_save_IFS
+  test -z $as_dir  as_dir=.
+for ac_exec_ext in '' $ac_executable_extensions; do
+  if as_fn_executable_p $as_dir/$ac_word$ac_exec_ext; then
+ac_cv_prog_XMLLINT=$ac_prog
+$as_echo $as_me:${as_lineno-$LINENO}: found $as_dir/$ac_word$ac_exec_ext 5
+break 2
+  fi
+done
+  done
+IFS=$as_save_IFS
+
+fi
+fi
+XMLLINT=$ac_cv_prog_XMLLINT
+if test -n $XMLLINT; then
+  { $as_echo $as_me:${as_lineno-$LINENO}: result: $XMLLINT 5
+$as_echo $XMLLINT 6; }
+else
+  { $as_echo $as_me:${as_lineno-$LINENO}: result: no 5
+$as_echo no 6; }
+fi
+
+
+  test -n $XMLLINT  break
+done
+
 for ac_prog in xsltproc
 do
   # Extract the first word of $ac_prog, so it can be a program name with args.
diff --git a/configure.in b/configure.in
index 1392277..e3b17ea 100644
--- a/configure.in
+++ b/configure.in
@@ -1859,6 +1859,7 @@ PGAC_CHECK_DOCBOOK(4.2)
 PGAC_PATH_DOCBOOK_STYLESHEETS
 PGAC_PATH_COLLATEINDEX
 AC_CHECK_PROGS(DBTOEPUB, dbtoepub)
+AC_CHECK_PROGS(XMLLINT, xmllint)
 AC_CHECK_PROGS(XSLTPROC, xsltproc)
 AC_CHECK_PROGS(OSX, [osx sgml2xml sx])
 
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index 1d42be8..8bdd26c 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -44,6 +44,10 @@ ifndef OSX
 OSX = $(missing) osx
 endif
 
+ifndef XMLLINT
+XMLLINT = $(missing) xmllint
+endif
+
 ifndef XSLTPROC
 XSLTPROC = $(missing) xsltproc
 endif
@@ -78,6 +82,7 @@ override SPFLAGS += -wall -wno-unused-param -wno-empty -wfully-tagged
 man distprep-man: man-stamp
 
 man-stamp: stylesheet-man.xsl postgres.xml
+	$(XMLLINT) --noout --valid postgres.xml
 	$(XSLTPROC) $(XSLTPROCFLAGS) $(XSLTPROC_MAN_FLAGS) $^
 	touch $@
 
@@ -254,11 +259,13 @@ endif
 xslthtml: xslthtml-stamp
 
 xslthtml-stamp: stylesheet.xsl postgres.xml
+	$(XMLLINT) --noout --valid postgres.xml
 	$(XSLTPROC) $(XSLTPROCFLAGS) $(XSLTPROC_HTML_FLAGS) $^
 	cp $(srcdir)/stylesheet.css html/
 	touch $@
 
 htmlhelp: stylesheet-hh.xsl postgres.xml
+	$(XMLLINT) --noout --valid postgres.xml
 	$(XSLTPROC) $(XSLTPROCFLAGS) $^
 
 %-A4.fo.tmp: stylesheet-fo.xsl %.xml
@@ -268,7 +275,6 @@ htmlhelp: stylesheet-hh.xsl postgres.xml
 	$(XSLTPROC) $(XSLTPROCFLAGS) --stringparam paper.type USletter -o $@ $^
 
 FOP = fop
-XMLLINT = xmllint
 
 # reformat FO output so that locations of errors are easier to find
 %.fo: %.fo.tmp
@@ -281,6 +287,7 @@ XMLLINT = xmllint
 
 epub: postgres.epub
 postgres.epub: postgres.xml
+	$(XMLLINT) --noout --valid $
 	$(DBTOEPUB) $
 
 
diff --git a/doc/src/sgml/docguide.sgml b/doc/src/sgml/docguide.sgml
index 943ab5a..614a76c 100644
--- a/doc/src/sgml/docguide.sgml
+++ b/doc/src/sgml/docguide.sgml
@@ -149,6 +149,20 @@ titleTool Sets/title
 /varlistentry
 
 varlistentry
+ termulink url=http://xmlsoft.org/;Libxml2/ulink for commandxmllint/command/term
+ listitem
+  para
+   This library and the commandxmllint/command tool it contains are
+   used for processing XML.  Many developers will already
+   have applicationLibxml2/application installed, 

Re: [HACKERS] Audit of logout

2014-09-13 Thread Tom Lane
Amit Kapila amit.kapil...@gmail.com writes:
 Yes, it was failing only on windows.  Today when I further tried to
 look into the issue, I found that if I rebuild plpgsql, it didn't occurred.
 So the conclusion is that it occurred due to build mismatch, however I
 am not sure why a rebuild of plpgsql is required for this patch.
 Sorry for the noise.

plpgsql contains references to PGC_SUSET and PGC_USERSET, whose values are
changed by this patch, so failure is unsurprising if you don't rebuild it.
(I saw failures in the regression tests even without Windows until I
updated plpgsql.)

Committed with minor cosmetic adjustments; mostly, rewriting some
comments.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-13 Thread k...@rice.edu
On Sat, Sep 13, 2014 at 09:50:55PM -0300, Arthur Silva wrote:
 On Sat, Sep 13, 2014 at 1:55 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 
  Andres Freund and...@2ndquadrant.com writes:
   On 2014-09-13 08:52:33 +0300, Ants Aasma wrote:
   On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva arthur...@gmail.com
  wrote:
   That's not entirely true. CRC-32C beats pretty much everything with
  the same
   length quality-wise and has both hardware implementations and highly
   optimized software versions.
 
   For better or for worse CRC is biased by detecting all single bit
   errors, the detection capability of larger errors is slightly
   diminished. The quality of the other algorithms I mentioned is also
   very good, while producing uniformly varying output.
 
   There's also much more literature about the various CRCs in comparison
   to some of these hash allgorithms.
 
  Indeed.  CRCs have well-understood properties for error detection.
  Have any of these new algorithms been analyzed even a hundredth as
  thoroughly?  No.  I'm unimpressed by evidence-free claims that
  something else is also very good.
 
  Now, CRCs are designed for detecting the sorts of short burst errors
  that are (or were, back in the day) common on phone lines.  You could
  certainly make an argument that that's not the type of threat we face
  for PG data.  However, I've not seen anyone actually make such an
  argument, let alone demonstrate that some other algorithm would be better.
  To start with, you'd need to explain precisely what other error pattern
  is more important to defend against, and why.
 
  regards, tom lane
 
 
 Mysql went this way as well, changing the CRC polynomial in 5.6.
 
 What we are looking for here is uniqueness thus better error detection. Not
 avalanche effect, nor cryptographically secure, nor bit distribution.
 As far as I'm aware CRC32C is unbeaten collision wise and time proven.
 
 I couldn't find tests with xxhash and crc32 on the same hardware so I spent
 some time putting together a benchmark (see attachment, to run it just
 start run.sh)
 
 I included a crc32 implementation using ssr4.2 instructions (which works on
 pretty much any Intel processor built after 2008 and AMD built after 2012),
 a portable Slice-By-8 software implementation and xxhash since it's the
 fastest software 32bit hash I know of.
 
 Here're the results running the test program on my i5-4200M
 
 crc sb8: 90444623
 elapsed: 0.513688s
 speed: 1.485220 GB/s
 
 crc hw: 90444623
 elapsed: 0.048327s
 speed: 15.786877 GB/s
 
 xxhash: 7f4a8d5
 elapsed: 0.182100s
 speed: 4.189663 GB/s
 
 The hardware version is insanely and works on the majority of Postgres
 setups and the fallback software implementations is 2.8x slower than the
 fastest 32bit hash around.
 
 Hopefully it'll be useful in the discussion.

Thank you for running this sample benchmark. It definitely shows that the
hardware version of the CRC is very fast, unfortunately it is really only
available on x64 Intel/AMD processors which leaves all the rest lacking.
For current 64-bit hardware, it might be instructive to also try using
the XXH64 version and just take one half of the hash. It should come in
at around 8.5 GB/s, or very nearly the speed of the hardware accelerated
CRC. Also, while I understand that CRC has a very venerable history and
is well studied for transmission type errors, I have been unable to find
any research on its applicability to validating file/block writes to a
disk drive. While it is to quote you unbeaten collision wise, xxhash,
both the 32-bit and 64-bit version are its equal. Since there seems to
be a lack of research on disk based error detection versus CRC polynomials,
it seems likely that any of the proposed hash functions are on an equal
footing in this regard. As Andres commented up-thread, xxhash comes along
for free with lz4.

Regards,
Ken


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-13 Thread Arthur Silva
On Sat, Sep 13, 2014 at 10:27 PM, k...@rice.edu k...@rice.edu wrote:

 On Sat, Sep 13, 2014 at 09:50:55PM -0300, Arthur Silva wrote:
  On Sat, Sep 13, 2014 at 1:55 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 
   Andres Freund and...@2ndquadrant.com writes:
On 2014-09-13 08:52:33 +0300, Ants Aasma wrote:
On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva arthur...@gmail.com
   wrote:
That's not entirely true. CRC-32C beats pretty much everything with
   the same
length quality-wise and has both hardware implementations and
 highly
optimized software versions.
  
For better or for worse CRC is biased by detecting all single bit
errors, the detection capability of larger errors is slightly
diminished. The quality of the other algorithms I mentioned is also
very good, while producing uniformly varying output.
  
There's also much more literature about the various CRCs in
 comparison
to some of these hash allgorithms.
  
   Indeed.  CRCs have well-understood properties for error detection.
   Have any of these new algorithms been analyzed even a hundredth as
   thoroughly?  No.  I'm unimpressed by evidence-free claims that
   something else is also very good.
  
   Now, CRCs are designed for detecting the sorts of short burst errors
   that are (or were, back in the day) common on phone lines.  You could
   certainly make an argument that that's not the type of threat we face
   for PG data.  However, I've not seen anyone actually make such an
   argument, let alone demonstrate that some other algorithm would be
 better.
   To start with, you'd need to explain precisely what other error pattern
   is more important to defend against, and why.
  
   regards, tom lane
  
 
  Mysql went this way as well, changing the CRC polynomial in 5.6.
 
  What we are looking for here is uniqueness thus better error detection.
 Not
  avalanche effect, nor cryptographically secure, nor bit distribution.
  As far as I'm aware CRC32C is unbeaten collision wise and time proven.
 
  I couldn't find tests with xxhash and crc32 on the same hardware so I
 spent
  some time putting together a benchmark (see attachment, to run it just
  start run.sh)
 
  I included a crc32 implementation using ssr4.2 instructions (which works
 on
  pretty much any Intel processor built after 2008 and AMD built after
 2012),
  a portable Slice-By-8 software implementation and xxhash since it's the
  fastest software 32bit hash I know of.
 
  Here're the results running the test program on my i5-4200M
 
  crc sb8: 90444623
  elapsed: 0.513688s
  speed: 1.485220 GB/s
 
  crc hw: 90444623
  elapsed: 0.048327s
  speed: 15.786877 GB/s
 
  xxhash: 7f4a8d5
  elapsed: 0.182100s
  speed: 4.189663 GB/s
 
  The hardware version is insanely and works on the majority of Postgres
  setups and the fallback software implementations is 2.8x slower than the
  fastest 32bit hash around.
 
  Hopefully it'll be useful in the discussion.

 Thank you for running this sample benchmark. It definitely shows that the
 hardware version of the CRC is very fast, unfortunately it is really only
 available on x64 Intel/AMD processors which leaves all the rest lacking.
 For current 64-bit hardware, it might be instructive to also try using
 the XXH64 version and just take one half of the hash. It should come in
 at around 8.5 GB/s, or very nearly the speed of the hardware accelerated
 CRC. Also, while I understand that CRC has a very venerable history and
 is well studied for transmission type errors, I have been unable to find
 any research on its applicability to validating file/block writes to a
 disk drive. While it is to quote you unbeaten collision wise, xxhash,
 both the 32-bit and 64-bit version are its equal. Since there seems to
 be a lack of research on disk based error detection versus CRC polynomials,
 it seems likely that any of the proposed hash functions are on an equal
 footing in this regard. As Andres commented up-thread, xxhash comes along
 for free with lz4.

 Regards,
 Ken


For the sake of completeness the results for xxhash64 in my machine

xxhash64
speed:  7.365398 GB/s

Which is indeed very fast.


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-13 Thread Claudio Freire
On Sat, Sep 13, 2014 at 10:27 PM, k...@rice.edu k...@rice.edu wrote:
 Here're the results running the test program on my i5-4200M

 crc sb8: 90444623
 elapsed: 0.513688s
 speed: 1.485220 GB/s

 crc hw: 90444623
 elapsed: 0.048327s
 speed: 15.786877 GB/s

 xxhash: 7f4a8d5
 elapsed: 0.182100s
 speed: 4.189663 GB/s

 The hardware version is insanely and works on the majority of Postgres
 setups and the fallback software implementations is 2.8x slower than the
 fastest 32bit hash around.

 Hopefully it'll be useful in the discussion.

 Thank you for running this sample benchmark. It definitely shows that the
 hardware version of the CRC is very fast, unfortunately it is really only
 available on x64 Intel/AMD processors which leaves all the rest lacking.
 For current 64-bit hardware, it might be instructive to also try using
 the XXH64 version and just take one half of the hash. It should come in
 at around 8.5 GB/s, or very nearly the speed of the hardware accelerated
 CRC. Also, while I understand that CRC has a very venerable history and
 is well studied for transmission type errors, I have been unable to find
 any research on its applicability to validating file/block writes to a
 disk drive. While it is to quote you unbeaten collision wise, xxhash,
 both the 32-bit and 64-bit version are its equal. Since there seems to
 be a lack of research on disk based error detection versus CRC polynomials,
 it seems likely that any of the proposed hash functions are on an equal
 footing in this regard. As Andres commented up-thread, xxhash comes along
 for free with lz4.


Bear in mind that

a) taking half of the CRC will invalidate all error detection
capability research, and it may also invalidate its properties,
depending on the CRC itself.

b) bit corruption as is the target kind of error for CRC are resurging
in SSDs, as can be seen in table 4 of a link that I think appeared on
this same list:
https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf

I would totally forget of taking half of whatever CRC. That's looking
for pain, in that it will invalidate all existing and future research
on that hash/CRC type.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Partial sort

2014-09-13 Thread Peter Geoghegan
On Fri, Sep 12, 2014 at 2:19 PM, Alexander Korotkov
aekorot...@gmail.com wrote:
 Actually, higher cardinality skip columns is better. Sorting of smaller
 groups is faster than sorting larger groups of same size. Also, with smaller
 groups you achieve limit more accurate (in average), i.e. sort smaller
 amount of total rows.

Higher cardinality leading key columns are better for what? Do you
mean they're better in that those cases are more sympathetic to your
patch, or better in the general sense that they'll perform better for
the user? The first example query, that you chose to demonstrate your
patch had a leading, indexed column (column v1) with much lower
cardinality/n_distinct than the column that had to be sorted on
(column v2). That was what my comments were based on.

When this feature is used, there will often be a significantly lower
cardinality in the leading, indexed column (as in your example).
Otherwise, the user might well have been happy to just order on the
indexed/leading column alone. That isn't definitely true, but it's
often true.

 I'm not sure if that's worth it to more or less
 duplicate heap_tuple_attr_equals() to save a mere n expensive
 comparisons, but it's something to think about (actually, there are
 probably less than even n comparisons in practice because there'll be
 a limit).

 Not correct. Smaller groups are not OK. Imagine that two representations of
 same skip column value exists. Index may return them in any order, even
 change them one by one. In this case sorting on other column never takes
 place, while it should.

I think I explained this badly - it wouldn't be okay to make the
grouping smaller, but as you say we could tie-break with a proper
B-Tree support function 1 comparison to make sure we really had
reached the end of our grouping.

FWIW I want all bttextcmp()/varstr_cmp() comparisons to try a memcmp()
first, so the use of the equality operator with text in mind that you
mention may soon not be useful at all. The evidence suggests that
memcmp() is so much cheaper than special preparatory NUL-termination +
strcoll() that we should always try it first when sorting text, even
when we have no good reason to think it will work. The memcmp() is
virtually free. And so, you see why it might be worth thinking about
this when we already have reasonable confidence that many comparisons
will indicate that datums are equal. Other datatypes will have
expensive real comparators, not just text or numeric.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Partial sort

2014-09-13 Thread Peter Geoghegan
Some quick comments on partial-sort-basic-2.patch:

 *** a/src/include/utils/tuplesort.h
 --- b/src/include/utils/tuplesort.h
 ***
 *** 24,29 
 --- 24,30 
   #include executor/tuptable.h
   #include fmgr.h
   #include utils/relcache.h
 + #include utils/sortsupport.h

Why include sortsupport.h here?

I would like to see more comments, especially within ExecSort(). The
structure of that routine is quite unclear.

I don't really like this MakeSortSupportKeys() stuff within ExecSort():

 !   /* Support structures for cmpSortSkipCols - already sorted columns */
 !   if (skipCols)
 !   node-skipKeys = MakeSortSupportKeys(skipCols,
 !plannode-sortColIdx,
 !plannode-sortOperators,
 !plannode-collations,
 !plannode-nullsFirst);

 +   /* Only pass on remaining columns that are unsorted */
 tuplesortstate = tuplesort_begin_heap(tupDesc,
 ! plannode-numCols - skipCols,
 ! 
 (plannode-sortColIdx[skipCols]),
 ! 
 (plannode-sortOperators[skipCols]),
 ! 
 (plannode-collations[skipCols]),
 ! 
 (plannode-nullsFirst[skipCols]),
   work_mem,
   node-randomAccess);

You're calling the new function MakeSortSupportKeys() (which
encapsulates setting up sortsupport state for all attributes) twice;
first, to populate the skip keys (the indexed attribute(s)), and
second, when tuplesort_begin_heap() is called, so that there is state
for unindexed sort groups that must be manually sorted. That doesn't
seem great.

I think we might be better off if a tuplesort function was called
shortly after tuplesort_begin_heap() is called. How top-n heap sorts
work is something that largely lives in tuplesort's head. Today, we
call tuplesort_set_bound() to hint to tuplesort By the way, this is a
top-n heap sort applicable sort. I think that with this patch, we
should then hint (where applicable) by the way, you won't actually be
required to sort those first n indexed attributes; rather, you can
expect to scan those in logical order. You may work the rest out
yourself, and may be clever about exploiting the sorted-ness of the
first few columns. The idea of managing a bunch of tiny sorts from
with ExecSort(), and calling the new function tuplesort_reset() seems
questionable. tuplesortstate is supposed to be private/opaque to
nodeSort.c, and the current design strains that.

I'd like to keep nodeSort.c simple. I think it's pretty clear that the
guts of this do not belong within ExecSort(), in any case. Also, the
additions there should be much better commented, wherever they finally
end up.

In this struct:

 *** a/src/include/nodes/execnodes.h
 --- b/src/include/nodes/execnodes.h
 *** typedef struct SortState
 *** 1670,1678 
 --- 1670,1682 
  boolbounded;/* is the result set bounded? */
  int64   bound;  /* if bounded, how many tuples are needed */
  boolsort_Done;  /* sort completed yet? */
 +   boolfinished;   /* fetching tuples from outer node
 +  is finished ? */
  boolbounded_Done;   /* value of bounded we did the sort with */
  int64   bound_Done; /* value of bound we did the sort with */
  void   *tuplesortstate; /* private state of tuplesort.c */
 +   SortSupport skipKeys;   /* columns already sorted in input */
 +   HeapTuple   prev;   /* previous tuple from outer node */
   } SortState;

I think you should be clearer about the scope and duration of fields
like finished, if this really belongs here. In general, there should
be some high-level comments about how the feature added by the patch
fits together, which there isn't right now.

That's all I have for now.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers