[HACKERS] patch: add MAP_HUGETLB to mmap() where supported (WIP)

2013-09-13 Thread Richard Poole
The attached patch adds the MAP_HUGETLB flag to mmap() for shared memory
on systems that support it. It's based on Christian Kruse's patch from
last year, incorporating suggestions from Andres Freund.

On a system with 4GB shared_buffers, doing pgbench runs long enough for
each backend to touch most of the buffers, this patch saves nearly 8MB of
memory per backend and improves performances by just over 2% on average.

It is still WIP as there are a couple of points that Andres has pointed
out to me that haven't been addressed yet; also, the documentation is
incomplete.

Richard

-- 
Richard Poole http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 23ebc11..703b28f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1052,6 +1052,42 @@ include 'filename'
   /listitem
  /varlistentry
 
+ varlistentry id=guc-huge-tlb-pages xreflabel=huge_tlb_pages
+  termvarnamehuge_tlb_pages/varname (typeenum/type)/term
+  indexterm
+   primaryvarnamehuge_tlb_pages/ configuration parameter/primary
+  /indexterm
+  listitem
+   para
+Enables/disables the use of huge tlb pages. Valid values are
+literalon/literal, literaloff/literal and literaltry/literal.
+The default value is literaltry/literal.
+   /para
+
+	   para
+	   Use of huge tlb pages reduces the cpu time spent on memory management and
+	   the amount of memory used for page tables and therefore improves performance.
+	   /para
+
+   para
+With varnamehuge_tlb_pages/varname set to literalon/literal
+symbolmmap()/symbol will be called with symbolMAP_HUGETLB/symbol.
+If the call fails the server will fail fatally.
+   /para
+
+   para
+With varnamehuge_tlb_pages/varname set to literaloff/literal we
+will not use symbolMAP_HUGETLB/symbol at all.
+   /para
+
+   para
+With varnamehuge_tlb_pages/varname set to literaltry/literal
+we will try to use symbolMAP_HUGETLB/symbol and fall back to
+symbolmmap()/symbol without symbolMAP_HUGETLB/symbol.
+   /para
+  /listitem
+ /varlistentry
+
  varlistentry id=guc-temp-buffers xreflabel=temp_buffers
   termvarnametemp_buffers/varname (typeinteger/type)/term
   indexterm
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 20e3c32..57fff35 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -27,10 +27,14 @@
 #ifdef HAVE_SYS_SHM_H
 #include sys/shm.h
 #endif
+#ifdef MAP_HUGETLB
+#include dirent.h
+#endif
 
 #include miscadmin.h
 #include storage/ipc.h
 #include storage/pg_shmem.h
+#include utils/guc.h
 
 
 typedef key_t IpcMemoryKey;		/* shared memory key passed to shmget(2) */
@@ -61,6 +65,13 @@ typedef int IpcMemoryId;		/* shared memory ID returned by shmget(2) */
 #define MAP_FAILED ((void *) -1)
 #endif
 
+#ifdef MAP_HUGETLB
+#define PG_HUGETLB_BASE_ADDR (void *)(0x0UL)
+#define PG_MAP_HUGETLB MAP_HUGETLB
+#else
+#define PG_MAP_HUGETLB 0
+#endif
+
 
 unsigned long UsedShmemSegID = 0;
 void	   *UsedShmemSegAddr = NULL;
@@ -342,6 +353,161 @@ PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2)
 }
 
 
+#ifdef MAP_HUGETLB
+#define HUGE_PAGE_INFO_DIR  /sys/kernel/mm/hugepages
+
+/*
+ *	static long InternalGetFreeHugepagesCount(const char *name)
+ *
+ * Attempt to read the number of available hugepages from
+ * /sys/kernel/mm/hugepages/hugepages-size/free_hugepages
+ * Will fail (return -1) if file could not be opened, 0 if no pages are available
+ * and  0 if there are free pages
+ *
+ */
+static long
+InternalGetFreeHugepagesCount(const char *name)
+{
+	int fd;
+	char buff[1024];
+	size_t len;
+	long result;
+	char *ptr;
+
+	len = snprintf(buff, 1024, %s/%s/free_hugepages, HUGE_PAGE_INFO_DIR, name);
+	if (len == 1024) /* I don't think that this will happen ever */
+	{
+		ereport(huge_tlb_pages == HUGE_TLB_TRY ? DEBUG1 : WARNING,
+(errmsg(Filename %s/%s/free_hugepages is too long, HUGE_PAGE_INFO_DIR, name),
+ errcontext(while checking hugepage size)));
+		return -1;
+	}
+
+	fd = open(buff, O_RDONLY);
+	if (fd = 0)
+	{
+		ereport(huge_tlb_pages == HUGE_TLB_TRY ? DEBUG1 : WARNING,
+(errmsg(Could not open file %s: %s, buff, strerror(errno)),
+ errcontext(while checking hugepage size)));
+		return -1;
+	}
+
+	len = read(fd, buff, 1024);
+	if (len = 0)
+	{
+		ereport(huge_tlb_pages == HUGE_TLB_TRY ? DEBUG1 : WARNING,
+(errmsg(Error reading from file %s: %s, buff, strerror(errno)),
+ errcontext(while checking hugepage size)));
+		close(fd);
+		return -1;
+	}
+
+	/*
+	 * If the content of free_hugepages is longer than or equal to 1024 bytes
+	 * the rest is irrelevant; we simply want to know if there are any
+	 * hugepages left
+	 */
+	if (len == 1024)
+	{
+		buff[1023] = 0;
+	}
+	else
+	{
+		buff[len] = 0;
+	}
+
+	close(fd);
+
+	result = strtol

[HACKERS] stray SIGALRM

2013-06-14 Thread Richard Poole
In 9.3beta1, a backend will receive a SIGALRM after authentication_timeout
seconds, even if authentication has been successful. Most of the time
this doesn't hurt anyone, but there are cases, such as when the backend
is doing the open() of a backend copy, when it breaks things and results
in an error getting reported to the client. In particular, if you're doing
a copy from a FIFO, it is normal for open() to block until the process at
the other end has data ready, so you're very likely to have it interrupted
by the SIGALRM and fail.

To see the SIGALRM just run psql then determine your backend's pid,
attach an strace to it, and wait 60 seconds, or whatever you've got
authentication_timeout set to.

This behaviour appears in 6ac7facdd3990baf47efc124e9d7229422a06452 as a
side-effect of speeding things up by getting rid of setitimer() calls;
it's not obvious what's a good way to fix it without losing the benefits
of that commit.

Thanks Alvaro and Andres for helping me get from why is my copy getting
these signals to understanding what's actually going on.

Richard

-- 
Richard Poole http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] another plperl bug

2004-11-23 Thread Richard Poole
On Tue, Nov 23, 2004 at 11:37:22AM -0500, Tom Lane wrote:
 
  CREATE FUNCTION test1() RETURNS TEXT AS $$
  return [test];
  $$ LANGUAGE plperl;
 
  SELECT test1();
test1   
  --
   ARRAY(0x8427a58)
  (1 row)
 
 This is exactly what Perl will do if you try to coerce an array to a
 scalar:
 
 $ perl -e 'print [test 1], \n'
 ARRAY(0xa03ec28)
 $

To go a stage further, there's no array-to-scalar coercion happening
there; the [] syntax gives you a reference to an anonymous array, and
a reference to an array is a scalar, even when evaluated in list
context, as Tom's example is. If you wanted to return a list from
a sub in perl you'd just go return(test 1, test 2).

 so I don't think a Perl programmer would find it surprising; if anything
 he'd probably complain if we *didn't* do that.

Indeed. It would be Perlish to have some magic so that when you called
one PL/Perl function from another you could return an array ref from
the inner one and have it Do What You Mean in the outer one, too.


Richard

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] CVS Messages

2002-08-17 Thread Richard Poole

On Fri, Aug 16, 2002 at 05:44:50PM -0400, Rod Taylor wrote:
 Is it possible for the cvs emails to include a URL to the appropriate
 entries in cvs web?
 
 The below is current:
 
 Modified files:
 src/backend/utils/adt: ruleutils.c

If you're using a Web browser with support for smart bookmarks, nicknames,
and javascript: URLs, then you can define a bookmark as something like:

javascript:re=/(:.)/;window.location=http://developer.postgresql.org/cvsweb.cgi/pgsql-server/+%s.replace(re,
 /)

and then cut-and-paste the line from the email into your location field
using a nickname:

pgcvs src/backend/utils/adt: ruleutils.c

and have it bring up the cvsweb page. Works with galeon; I guess other
recent browsers (Konqueror, Moz, IE?) can do something very similar if
not quite identical.

Richard

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Re: Outstanding patches

2001-05-08 Thread Richard Poole

On Tue, May 08, 2001 at 05:49:16PM -0400, Tom Lane wrote:
 I presume that Ian is not thinking about such a scenario, but only about
 using %type in a schema file that he will reload into a freshly created
 database each time he edits it.  That avoids the issue of whether %type
 declarations can or should track changes on the fly, but I think he's
 still going to run into problems with function naming: do
 fooey(foo.bar%type) and fooey(foo.baz%type) conflict, or not?  Maybe
 today the schema works and tomorrow you get an error.

How about a feature in psql which would read something like '%type' and
convert it to the appropriate thing before it passed it to the backend?
Then you could use it without thinking about it in a script which you
would \i into psql. That would do what's wanted here without having
any backend nasties. I'm not offering to implement it myself - at least
not right now - but does it seem like a sensible idea?

Richard

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Query not using index, please explain.

2001-03-08 Thread Richard Poole

On Thu, Mar 08, 2001 at 02:43:54PM -0500, Matthew Hagerty wrote:
 Richard,
 
 Thanks for the response, I guess I should have included a little more 
 information.  The table contains 3.5 million rows.  The indexes were 
 created after the data was imported into the table and I had just run 
 vacuum and vacuum analyze on the database before trying the queries and 
 sending this question to hackers.
 
 When I turned the seqscan variable off and ran the query with the 
 '04-01-2000' date the results were literally instantaneous.  Turn the 
 seqscan back on and it takes right around 3 minutes.  Also, the query for 
 any date older than the '04-01-2000' returns zero rows.  The actual number 
 of rows for the '04-01-2000' select is right around 8300.

This is where you need an expert. :) But I'll have a go and someone
will correct me if I'm wrong...

The statistics which are kept aren't fine-grained enough to be right
here. All the optimiser knows are the highest and lowest values of
the attribute, the most common value (not really useful here), the
number of nulls in the column, and the "dispersion" (a sort of
handwavy measure of how bunched-together the values are). So in a
case like this, where effectively the values are all different over
a certain range, all it can do is (more or less) linearly interpolate
in the range to guess how many tuples are going to be returned. Which
means it's liable to be completely wrong if your values aren't
evenly distributed over their whole range, which it seems they aren't.
It thinks you're going to hit around 1/28 of the tuples in this table,
presumably because '04/01/2000' is about 1/28 of the way from your
minimum value to your maximum.

This sort of thing will all become much better one fine day when
we have much better statistics available, and so many of us want
such things that that fine day will surely come. Until then, I think
you're best off turning off seqscans from your client code when
you know they'll be wrong. (That's what we do here in several similar
cases).

Can someone who really knows this stuff (Tom?) step in if what I've
just said is completely wrong?

 select domain from history_entries group by domain;
 
 To me, since there is an index on domain, it seems like this should be a 
 rather fast thing to do?  It takes a *very* long time, no matter if I turn 
 seqscan on or off.

The reason this is slow is that Postgres always has to look at heap
tuples, even when it's been sent there by indexes. This in turn is
because of the way the storage manager works (only by looking in the
heap can you tell for sure whether a tuple is valid for the current
transaction). So a "group by" always has to look at every heap tuple
(that hasn't been eliminated by a where clause). "select distinct"
has the same problem. I don't think there's a way to do what you
want here with your existing schema without a sequential scan over
the table.


Richard

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly