Re: [PATCHES] prevent invalidly encoded input

2007-09-12 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes:
>   
> addlitchar(unescape_single_char(yytext[1]));
> + if 
> (IS_HIGHBIT_SET(literalbuf[literallen]))
> + saw_high_bit = true;

Isn't that array subscript off-by-one?  Probably better to put the test
inside unescape_single_char(), anyway.

Otherwise it looks sane, though maybe shy a comment or so.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] prevent invalidly encoded input

2007-09-11 Thread Andrew Dunstan



Tom Lane wrote:


So it looks like you need to recheck if unescape_single_char sees a
high-bit-set char.

You should take a second look at the COPY code to see if there's a
similar case there --- I forget what it does with backslash followed
by non-digit.

  


It's covered. Revised patch attached. I'll probably apply this some time 
tomorrow.


cheers

andrew
Index: src/backend/commands/copy.c
===
RCS file: /cvsroot/pgsql/src/backend/commands/copy.c,v
retrieving revision 1.286
diff -c -r1.286 copy.c
*** src/backend/commands/copy.c	7 Sep 2007 20:59:26 -	1.286
--- src/backend/commands/copy.c	12 Sep 2007 03:21:25 -
***
*** 2685,2690 
--- 2685,2691 
  		char	   *start_ptr;
  		char	   *end_ptr;
  		int			input_len;
+ 		boolsaw_high_bit = false;
  
  		/* Make sure space remains in fieldvals[] */
  		if (fieldno >= maxfields)
***
*** 2749,2754 
--- 2750,2757 
  }
  			}
  			c = val & 0377;
+ 			if (IS_HIGHBIT_SET(c))
+ saw_high_bit = true;
  		}
  		break;
  	case 'x':
***
*** 2772,2777 
--- 2775,2782 
  	}
  }
  c = val & 0xff;
+ if (IS_HIGHBIT_SET(c))
+ 	saw_high_bit = true;			
  			}
  		}
  		break;
***
*** 2799,2805 
  		 * literally
  		 */
  }
! 			}
  
  			/* Add c to output string */
  			*output_ptr++ = c;
--- 2804,2810 
  		 * literally
  		 */
  }
! 			}			
  
  			/* Add c to output string */
  			*output_ptr++ = c;
***
*** 2808,2813 
--- 2813,2828 
  		/* Terminate attribute value in output area */
  		*output_ptr++ = '\0';
  
+ 		/* If we de-escaped a char with the high bit set, make sure
+ 		 * we still have valid data for the db encoding. Avoid calling strlen 
+ 		 * here for the sake of efficiency.
+ 		 */
+ 		if (saw_high_bit)
+ 		{
+ 			char *fld = fieldvals[fieldno];
+ 			pg_verifymbstr(fld, output_ptr - (fld + 1), false);
+ 		}
+ 
  		/* Check whether raw input matched null marker */
  		input_len = end_ptr - start_ptr;
  		if (input_len == cstate->null_print_len &&
Index: src/backend/parser/scan.l
===
RCS file: /cvsroot/pgsql/src/backend/parser/scan.l,v
retrieving revision 1.140
diff -c -r1.140 scan.l
*** src/backend/parser/scan.l	12 Aug 2007 20:18:06 -	1.140
--- src/backend/parser/scan.l	12 Sep 2007 03:21:26 -
***
*** 60,65 
--- 60,66 
  bool			standard_conforming_strings = false;
  
  static bool		warn_on_first_escape;
+ static bool saw_high_bit = false;
  
  /*
   * literalbuf is used to accumulate literal values when multiple rules
***
*** 426,431 
--- 427,433 
  
  {xqstart}		{
  	warn_on_first_escape = true;
+ 	saw_high_bit = false;
  	SET_YYLLOC();
  	if (standard_conforming_strings)
  		BEGIN(xq);
***
*** 435,440 
--- 437,443 
  }
  {xestart}		{
  	warn_on_first_escape = false;
+ 	saw_high_bit = false;
  	SET_YYLLOC();
  	BEGIN(xe);
  	startlit();
***
*** 443,448 
--- 446,453 
  {quotefail} {
  	yyless(1);
  	BEGIN(INITIAL);
+ 	if (saw_high_bit)
+ 		pg_verifymbstr(literalbuf, literallen, false);
  	yylval.str = litbufdup();
  	return SCONST;
  }
***
*** 469,486 
--- 474,497 
  	}
  	check_string_escape_warning(yytext[1]);
  	addlitchar(unescape_single_char(yytext[1]));
+ 	if (IS_HIGHBIT_SET(literalbuf[literallen]))
+ 		saw_high_bit = true;
  }
  {xeoctesc}  {
  	unsigned char c = strtoul(yytext+1, NULL, 8);
  
  	check_escape_warning();
  	addlitchar(c);
+ 	if (IS_HIGHBIT_SET(c))
+ 		saw_high_bit = true;
  }
  {xehexesc}  {
  	unsigned char c = strtoul(yytext+2, NULL, 16);
  
  	check_escape_warning();
  	addlitchar(c);
+ 	if (IS_HIGHBIT_SET(c))
+ 		saw_high_bit = true;
  }
  {quotecontinue} {
  	/* ignore */

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] prevent invalidly encoded input

2007-09-11 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> Also, I'd kinda like to have the check-for-high-bit optimization in
>> scan.l too --- some people do throw big literals at the thing.
>> 
> OK, will do. Am I correct in thinking I don't need to worry about the 
>  case, only the  and  cases?

[ squint ... ]  Hm, wouldn't bet on it.  That leads to
unescape_single_char(), which is fine for the cases that it explicitly
knows about (\b and so on), but what if the following byte has the
high bit set?  Not only would that pass through a high bit to the
output, but very possibly this results in disassembling a multibyte
input character.

So it looks like you need to recheck if unescape_single_char sees a
high-bit-set char.

You should take a second look at the COPY code to see if there's a
similar case there --- I forget what it does with backslash followed
by non-digit.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] prevent invalidly encoded input

2007-09-11 Thread Andrew Dunstan



Tom Lane wrote:

Also, I'd kinda like to have the check-for-high-bit optimization in
scan.l too --- some people do throw big literals at the thing.


  
OK, will do. Am I correct in thinking I don't need to worry about the 
 case, only the  and  cases?


cheers

andrew

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] prevent invalidly encoded input

2007-09-11 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> Attached is a patch to the scanner and the COPY code that checks for 
> invalidly encoded data that can currently leak into our system via \ 
> escapes in quoted literals or text mode copy fields, as recently 
> discussed. That would still leave holes via chr(), convert() and 
> possibly other functions, but these two paths are the biggest holes that 
> need plugging.

The COPY code looks sane.  On the scan.l change, I believe two out of
three of those calls are useless, because we do not do backslash
processing in dollar-quoted strings nor in quoted identifiers.
Also, I'd kinda like to have the check-for-high-bit optimization in
scan.l too --- some people do throw big literals at the thing.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match