Re: [Mesa-dev] [PATCH 2/7] glsl/glcpp: use ralloc_sprint_rewrite_tail to avoid slow vsprintf

2017-01-01 Thread Marek Olšák
On Jan 1, 2017 11:55 AM, "Vladislav Egorov"  wrote:



01.01.2017 06:41, Kenneth Graunke пишет:

On Sunday, January 1, 2017 1:34:27 AM PST Marek Olšák wrote:
>
>> From: Marek Olšák 
>>
>> This reduces compile times by 4.5% with the Gallium noop driver and
>> gl_constants::GLSLOptimizeConservatively == true.
>>
> Compile times of...what exactly?  Do you have any statistics for this
> by itself?
>
> Assuming we add your helper, this patch looks reasonable.
> Reviewed-by: Kenneth Graunke 
>
> BTW, I suspect you could get some additional speed up by changing
>
> parser->output = ralloc_strdup(parser, "");
>
> to something like:
>
> parser->output = ralloc_size(parser, strlen(orig_concatenated_src));
> parser->output[0] = '\0';
>
> to try and avoid reallocations.  rewrite_tail will realloc just enough
> space every time it allocates, which means once you reallocate, you're
> going to be calling realloc on every single token.  Yuck!
>
> ralloc/talloc's string libraries were never meant for serious string
> processing like the preprocessor does.  They're meant for convenience
> when constructing debug messages which don't need to be that efficient.
>
> Perhaps a better approach would be to have the preprocessor do this
> itself.  Just ralloc_size() output and initialize the null byte.
> reralloc to double the size if you need more space.  At the end of
> preprocessing, reralloc to output_length at the end of free any waste
> from doubling.
>
> I suspect that would be a *lot* more efficient, and is probably what
> we should have done in the first place...
>
I have similar patch (maybe need 1-2 days to clean it up), and I've tested
both variants. String in exponentially growing (by +50%) string buffer
works better, but not *THAT* much better as I expected. It seems that in
the sequence of str = realloc(str, 1001); str = realloc(str, 1002); str =
realloc(str, 1003), etc. most of reallocs will be non-moving in both
glibc's allocator and jemalloc. For example, jemalloc have size classes
that already grow exponentially by 15-25% - ..., 4K, 5K, 6K, 7K, 8K, 10K,
12K, 14K, 16K, 20K, 24K, .., 4M, 5M, ...  realloc will just test if the
requested size belongs to the same size class and do nothing. Reallocs
inside of the same size class will be always non-moving and almost free.
Overall avoiding formatted printing (DOUBLE formatted printing, which is
entirely avoidable too) gives the single largest boost to the pre-processor.

Benchmark on my shader-db (glcpp and shader-db's run smashed together to do
only preprocessing). Note that I used old jemalloc from Ubuntu 16.04, which
can be important, because jemalloc changed its size class strategy since
then.
perf stat --repeat 10
master8.91s
master+jemalloc   8.60s
Marek's patch 5.50s
Marek's patch+jemalloc5.03s
my string_buffer  4.57s
my string_buffer+jemalloc 4.43s
my series 3.83s
my series+jemalloc3.68s


Since you are further than me, let's merge your work instead.

Marek


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 2/7] glsl/glcpp: use ralloc_sprint_rewrite_tail to avoid slow vsprintf

2017-01-01 Thread Marek Olšák
On Jan 1, 2017 4:41 AM, "Kenneth Graunke"  wrote:

On Sunday, January 1, 2017 1:34:27 AM PST Marek Olšák wrote:
> From: Marek Olšák 
>
> This reduces compile times by 4.5% with the Gallium noop driver and
> gl_constants::GLSLOptimizeConservatively == true.

Compile times of...what exactly?  Do you have any statistics for this
by itself?


I always use my shader-db with almost 30k shaders and I used the time
command.


Assuming we add your helper, this patch looks reasonable.
Reviewed-by: Kenneth Graunke 

BTW, I suspect you could get some additional speed up by changing

   parser->output = ralloc_strdup(parser, "");

to something like:

   parser->output = ralloc_size(parser, strlen(orig_concatenated_src));
   parser->output[0] = '\0';

to try and avoid reallocations.  rewrite_tail will realloc just enough
space every time it allocates, which means once you reallocate, you're
going to be calling realloc on every single token.  Yuck!

ralloc/talloc's string libraries were never meant for serious string
processing like the preprocessor does.  They're meant for convenience
when constructing debug messages which don't need to be that efficient.

Perhaps a better approach would be to have the preprocessor do this
itself.  Just ralloc_size() output and initialize the null byte.
reralloc to double the size if you need more space.  At the end of
preprocessing, reralloc to output_length at the end of free any waste
from doubling.

I suspect that would be a *lot* more efficient, and is probably what
we should have done in the first place...


Sure. However, realloc is a no-op when there is free space after the
allocation. It's a useless call, but it doesn't look too bad in a profiler.

Marek
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 2/7] glsl/glcpp: use ralloc_sprint_rewrite_tail to avoid slow vsprintf

2017-01-01 Thread Vladislav Egorov



01.01.2017 06:41, Kenneth Graunke пишет:

On Sunday, January 1, 2017 1:34:27 AM PST Marek Olšák wrote:

From: Marek Olšák 

This reduces compile times by 4.5% with the Gallium noop driver and
gl_constants::GLSLOptimizeConservatively == true.

Compile times of...what exactly?  Do you have any statistics for this
by itself?

Assuming we add your helper, this patch looks reasonable.
Reviewed-by: Kenneth Graunke 

BTW, I suspect you could get some additional speed up by changing

parser->output = ralloc_strdup(parser, "");

to something like:

parser->output = ralloc_size(parser, strlen(orig_concatenated_src));
parser->output[0] = '\0';

to try and avoid reallocations.  rewrite_tail will realloc just enough
space every time it allocates, which means once you reallocate, you're
going to be calling realloc on every single token.  Yuck!

ralloc/talloc's string libraries were never meant for serious string
processing like the preprocessor does.  They're meant for convenience
when constructing debug messages which don't need to be that efficient.

Perhaps a better approach would be to have the preprocessor do this
itself.  Just ralloc_size() output and initialize the null byte.
reralloc to double the size if you need more space.  At the end of
preprocessing, reralloc to output_length at the end of free any waste
from doubling.

I suspect that would be a *lot* more efficient, and is probably what
we should have done in the first place...
I have similar patch (maybe need 1-2 days to clean it up), and I've 
tested both variants. String in exponentially growing (by +50%) string 
buffer works better, but not *THAT* much better as I expected. It seems 
that in the sequence of str = realloc(str, 1001); str = realloc(str, 
1002); str = realloc(str, 1003), etc. most of reallocs will be 
non-moving in both glibc's allocator and jemalloc. For example, jemalloc 
have size classes that already grow exponentially by 15-25% - ..., 4K, 
5K, 6K, 7K, 8K, 10K, 12K, 14K, 16K, 20K, 24K, .., 4M, 5M, ...  realloc 
will just test if the requested size belongs to the same size class and 
do nothing. Reallocs inside of the same size class will be always 
non-moving and almost free. Overall avoiding formatted printing (DOUBLE 
formatted printing, which is entirely avoidable too) gives the single 
largest boost to the pre-processor.


Benchmark on my shader-db (glcpp and shader-db's run smashed together to 
do only preprocessing). Note that I used old jemalloc from Ubuntu 16.04, 
which can be important, because jemalloc changed its size class strategy 
since then.

perf stat --repeat 10
master8.91s
master+jemalloc   8.60s
Marek's patch 5.50s
Marek's patch+jemalloc5.03s
my string_buffer  4.57s
my string_buffer+jemalloc 4.43s
my series 3.83s
my series+jemalloc3.68s

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 2/7] glsl/glcpp: use ralloc_sprint_rewrite_tail to avoid slow vsprintf

2016-12-31 Thread Kenneth Graunke
On Sunday, January 1, 2017 1:34:27 AM PST Marek Olšák wrote:
> From: Marek Olšák 
> 
> This reduces compile times by 4.5% with the Gallium noop driver and
> gl_constants::GLSLOptimizeConservatively == true.

Compile times of...what exactly?  Do you have any statistics for this
by itself?

Assuming we add your helper, this patch looks reasonable.
Reviewed-by: Kenneth Graunke 

BTW, I suspect you could get some additional speed up by changing

   parser->output = ralloc_strdup(parser, "");

to something like:

   parser->output = ralloc_size(parser, strlen(orig_concatenated_src));
   parser->output[0] = '\0';

to try and avoid reallocations.  rewrite_tail will realloc just enough
space every time it allocates, which means once you reallocate, you're
going to be calling realloc on every single token.  Yuck!

ralloc/talloc's string libraries were never meant for serious string
processing like the preprocessor does.  They're meant for convenience
when constructing debug messages which don't need to be that efficient.

Perhaps a better approach would be to have the preprocessor do this
itself.  Just ralloc_size() output and initialize the null byte.
reralloc to double the size if you need more space.  At the end of
preprocessing, reralloc to output_length at the end of free any waste
from doubling.

I suspect that would be a *lot* more efficient, and is probably what
we should have done in the first place...


signature.asc
Description: This is a digitally signed message part.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [PATCH 2/7] glsl/glcpp: use ralloc_sprint_rewrite_tail to avoid slow vsprintf

2016-12-31 Thread Marek Olšák
From: Marek Olšák 

This reduces compile times by 4.5% with the Gallium noop driver and
gl_constants::GLSLOptimizeConservatively == true.
---
 src/compiler/glsl/glcpp/glcpp-parse.y | 39 +++
 1 file changed, 21 insertions(+), 18 deletions(-)

diff --git a/src/compiler/glsl/glcpp/glcpp-parse.y 
b/src/compiler/glsl/glcpp/glcpp-parse.y
index 63012bc..b84e5ff 100644
--- a/src/compiler/glsl/glcpp/glcpp-parse.y
+++ b/src/compiler/glsl/glcpp/glcpp-parse.y
@@ -202,21 +202,21 @@ add_builtin_define(glcpp_parser_t *parser, const char 
*name, int value);
 input:
/* empty */
 |  input line
 ;
 
 line:
control_line
 |  SPACE control_line
 |  text_line {
_glcpp_parser_print_expanded_token_list (parser, $1);
-   ralloc_asprintf_rewrite_tail (>output, 
>output_length, "\n");
+   ralloc_sprint_rewrite_tail(>output, 
>output_length, "\n", 1);
}
 |  expanded_line
 ;
 
 expanded_line:
IF_EXPANDED expression NEWLINE {
if (parser->is_gles && $2.undefined_macro)
glcpp_error(& @1, parser, "undefined macro %s in 
expression (illegal in GLES)", $2.undefined_macro);
_glcpp_parser_skip_stack_push_if (parser, & @1, $2.value);
}
@@ -252,21 +252,21 @@ define:
 |  FUNC_IDENTIFIER '(' ')' replacement_list NEWLINE {
_define_function_macro (parser, & @1, $1, NULL, $4);
}
 |  FUNC_IDENTIFIER '(' identifier_list ')' replacement_list NEWLINE {
_define_function_macro (parser, & @1, $1, $3, $5);
}
 ;
 
 control_line:
control_line_success {
-   ralloc_asprintf_rewrite_tail (>output, 
>output_length, "\n");
+   ralloc_sprint_rewrite_tail(>output, 
>output_length, "\n", 1);
}
 |  control_line_error
 |  HASH_TOKEN LINE pp_tokens NEWLINE {
 
if (parser->skip_stack == NULL ||
parser->skip_stack->type == SKIP_NO_SKIP)
{
_glcpp_parser_expand_and_lex_from (parser,
   LINE_EXPANDED, $3,
   
EXPANSION_MODE_IGNORE_DEFINED);
@@ -428,21 +428,22 @@ control_line_success:
 |  HASH_TOKEN VERSION_TOKEN version_constant IDENTIFIER NEWLINE {
if (parser->version_set) {
glcpp_error(& @1, parser, "#version must appear on the 
first line");
}
_glcpp_parser_handle_version_declaration(parser, $3, $4, true);
}
 |  HASH_TOKEN NEWLINE {
glcpp_parser_resolve_implicit_version(parser);
}
 |  HASH_TOKEN PRAGMA NEWLINE {
-   ralloc_asprintf_rewrite_tail (>output, 
>output_length, "#%s", $2);
+   ralloc_sprint_rewrite_tail(>output, 
>output_length, "#", 1);
+   ralloc_sprint_rewrite_tail(>output, 
>output_length, $2, strlen($2));
}
 ;
 
 control_line_error:
HASH_TOKEN ERROR_TOKEN NEWLINE {
glcpp_error(& @1, parser, "#%s", $2);
}
 |  HASH_TOKEN DEFINE_TOKEN NEWLINE {
glcpp_error (& @1, parser, "#define without macro name");
}
@@ -1109,71 +1110,73 @@ _token_list_equal_ignoring_space(token_list_t *a, 
token_list_t *b)
   node_b = node_b->next;
}
 
return 1;
 }
 
 static void
 _token_print(char **out, size_t *len, token_t *token)
 {
if (token->type < 256) {
-  ralloc_asprintf_rewrite_tail (out, len, "%c", token->type);
+  char s[2] = {token->type, 0};
+  ralloc_sprint_rewrite_tail(out, len, s, 1);
   return;
}
 
switch (token->type) {
case INTEGER:
   ralloc_asprintf_rewrite_tail (out, len, "%" PRIiMAX, token->value.ival);
   break;
case IDENTIFIER:
case INTEGER_STRING:
case OTHER:
-  ralloc_asprintf_rewrite_tail (out, len, "%s", token->value.str);
+  ralloc_sprint_rewrite_tail(out, len, token->value.str,
+ strlen(token->value.str));
   break;
case SPACE:
-  ralloc_asprintf_rewrite_tail (out, len, " ");
+  ralloc_sprint_rewrite_tail(out, len, " ", 1);
   break;
case LEFT_SHIFT:
-  ralloc_asprintf_rewrite_tail (out, len, "<<");
+  ralloc_sprint_rewrite_tail(out, len, "<<", 2);
   break;
case RIGHT_SHIFT:
-  ralloc_asprintf_rewrite_tail (out, len, ">>");
+  ralloc_sprint_rewrite_tail(out, len, ">>", 2);
   break;
case LESS_OR_EQUAL:
-  ralloc_asprintf_rewrite_tail (out, len, "<=");
+  ralloc_sprint_rewrite_tail(out, len, "<=", 2);
   break;
case GREATER_OR_EQUAL:
-  ralloc_asprintf_rewrite_tail (out, len, ">=");
+  ralloc_sprint_rewrite_tail(out, len, ">=", 2);
   break;
case EQUAL:
-  ralloc_asprintf_rewrite_tail (out, len, "==");
+  ralloc_sprint_rewrite_tail(out, len, "==", 2);