Re: [pcre-dev] Release candidate for 10.10
I don't think these are related. The problem is in test2, not in test19 (the serialization test). Sparc64 JIT has not implemented yet. ARM 64 supports unaligned access, so a possible alignment issue is unusual. Perhaps the program counter is executing something from a wrong address. Petr, please run gdb --args ./pcre2test -q -S 16 -8 -jit ./testdata/testinput2 testtry, and start the program with r. When it crashes, please type bt 10 and disassemble $pc-128,$pc+128 and send me the output. I think it is enough to send me the dump privately. Regards, Zoltan p...@hermes.cam.ac.uk írta: On Thu, 26 Feb 2015, Giuseppe D'Angelo wrote: On 26 February 2015 at 12:39, Zoltán Herczeg hzmes...@freemail.hu wrote: The message bus error is also interesting, not the usual segmentation fault. I don't know this error, but according to wikipedia, a bus error is a fault raised by hardware when a process is trying to access memory that the CPU cannot physically address. It's also raised in some more common scenarios, such as misaligned memory access, or access beyond the end of a memory-mapped file. Maybe qemu is allowing an unaligned access that the native CPU would disallow for some reason. I recently fixed a misaligned memory bug that showed up on SPARC 64-bit. I think it was also a bus error. The patch was in the pcre2_serialize.c file, but I cannot remember whether the fix was before or after the -RC1 tarball was made. The bug was not related to JIT, but I think there are still JIT issues with SPARC 64-bit, aren't there? Could this be related? Philip -- Philip Hazel -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Release candidate for 10.10
Hi Petr, thank you for running PCRE2 on many environments. The AArch64 is worse. It compiles with JIT but crashes on tests: FAIL: RunTest = PCRE2 C library tests using test data from ./testdata PCRE2 version 10.10-RC1 2015-02-20 Testing 8-bit library Test 0: Unchecked pcre2test argument tests (to improve coverage) OK Test 1: Main non-UTF, non-UCP functionality (compatible with Perl = 5.10) OK OK with JIT Test 2: API, errors, internals, and non-Perl stuff (excluding UTF-8) OK ./RunTest: line 446: 18631 Bus error (core dumped) $sim $valgrind ./pcre2test -q $test2stack $bmode $opt $testdata/testinput2 testtry I don't have time to debug it now. But it worked in 8.36 version. I tried RunTest in my qemu emulated ARM-64 environment (I don't have access to real hardware), and it worked. Hence I definitely need some help to fix this. My ARM-64 GCC is Linaro 4.8.3, and I can only compile PCRE in static mode, otherwise qemu does not work. It is interesting that test1 runs correctly, and test2 fails. Perhaps a stack related issue? The message bus error is also interesting, not the usual segmentation fault. I don't know this error, but according to wikipedia, a bus error is a fault raised by hardware when a process is trying to access memory that the CPU cannot physically address. Could you run pcre2_jit_test as well? RunTest accepts test numbers between 1 and 18, e.g: RunTest 1 7 9 17. Could you check which tests are fail besides test2? Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Release candidate for 10.10
Hi, src/sljit/sljitNativeARM_32.c:340:28: error: 'compiler' undeclared (first use in this function) SLJIT_FREE(curr_patch, compiler-allocator_data); ^ src/pcre2_jit_compile.c:60:61: note: in definition of macro 'SLJIT_FREE' #define SLJIT_FREE(ptr, allocator_data) pcre2_jit_free(ptr, allocator_data) ^ src/sljit/sljitNativeARM_32.c:340:28: note: each undeclared identifier is reported only once for each function it appears in SLJIT_FREE(curr_patch, compiler-allocator_data); ^ src/pcre2_jit_compile.c:60:61: note: in definition of macro 'SLJIT_FREE' #define SLJIT_FREE(ptr, allocator_data) pcre2_jit_free(ptr, allocator_data) thank you for reporting this issue. It was a missing parameter. I fixed it in both PCRE and PCRE2. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Which limit is hit?
The (?:abc){1234} is a simple case, but sometimes you need to track the status of the subpattern. E.g. when you match /(?:aa|a){6}/ to aa (this is a badly written, but possible pattern requires exponential matching time). The engine greedily matches aa at the beginning, but eventually it needs to try to second alternative, since there are not enough 'a'-s in the input. The interpreter tracks this on the stack, where each iteration has its own stack frame. These frames only track the subpattern, they are not aware of the subpattern index. Tracking the subpattern index would require dynamic memory allocation, which is not preferred in PCRE, since memory allocation is slow. Regards, Zoltan Jean-Christophe Deschamps jch.descha...@free.fr írta: Zoltan, At 07:17 26/01/2015, you wrote: the pattern is always compiled to byte code first, and JIT converts it back, so using JIT alone does not help. Ah, I didn't knew that point. The reason of not using an iterator in the interpreter is practical: PCRE interpreter uses stack recursion, and you cannot easily share variable data across function calls. This is not a problem for single character iterators, but matching brackets would require inspecting the machine stack. Finding the previous call of an iterator on the stack chain and getting local data from it is difficult (in C at least). Instead the byte code of a subpattern is repeated so there is no need for tracking the iterator count. I don't want to abuse your time and patience but I'd love to understand the whole picture. In the case of a fixed repetition factor, are there cases which need to backtrack in the middle of the iteration? I may be missing something obvious but (?:abc){1234} matches as a whole or not at all. If it doesn't and if at start of the pattern all is needed is to bump the matching point in the subject and backtrack at the beginning of the loop. Using pcretest -d indeed shows that for instance (?:abc){5,7} expands in fixed repetition of 'abc' 5 times then twice an optional 'abc'. I understand your point for the variable, optional, part but wouldn't it be worth implementing an iterator in the bytecode for the fixed part? That wouldn't solve the (?:abc){5,} case but still it would help in 2/3 of the cases, like (?:abc){} or (?:abc){,10022}. -- [1]j...@antichoc.net References 1. mailto:j...@q-e-d.org -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Which limit is hit?
Hi, the pattern is always compiled to byte code first, and JIT converts it back, so using JIT alone does not help. The reason of not using an iterator in the interpreter is practical: PCRE interpreter uses stack recursion, and you cannot easily share variable data across function calls. This is not a problem for single character iterators, but matching brackets would require inspecting the machine stack. Finding the previous call of an iterator on the stack chain and getting local data from it is difficult (in C at least). Instead the byte code of a subpattern is repeated so there is no need for tracking the iterator count. JIT does not use machine stack for recursion, and it has an infrastructure for iterator data sharing, so this is not an issue there. Regards, Zoltan Jean-Christophe Deschamps jch.descha...@free.fr írta: At 18:30 25/01/2015, you wrote: ´¯¯¯ I think the issue is that the byte code of the pattern is too big. It is basically (?:\d+=) times. It was easier to implement the interpreter this way (JIT converts back the byte code into an interator again, because of the code size). To make this work, increase the link size 3 or 4 (--with-link-size=4) when compiling PCRE. `--- So if I understand you correctly, the only options are to either use a larger link size or use JIT, none of which is under my control since I'm using a script language interpretor embedding PCRE in linked form. While I regard PCRE as a superior engine and feel obliged by the work of the dev team I find unfortunate the choice to not implement an internal loop structure for fixed repetition of subpatterns. Thank you for your insight anyway. -- [1]j...@antichoc.net References 1. mailto:j...@q-e-d.org -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] PCRE2 is released
Hi, I wanted to say something similar. Now PCRE cannot be built without UCP if UTF is enabled, but using it is a different question. You can modify the behavior of \w etc. in the same was as in PCRE1. Regards, Zoltan Giuseppe D'Angelo dange...@gmail.com írta: On 5 January 2015 at 17:57, Jean-Christophe Deschamps jch.descha...@free.fr wrote: While this seems reasonnable at the first look, linking of these options has one unfortunate drawback: it dramatically changes the semantics of \w, \W, \b etc. and previously working patterns over UTF strings could produce different results. What do you mean? \w changes matching from ASCII to the Unicode property if you compile the pattern with PCRE2_UCP (*). Those are configure-time options to build PCRE2 itself with or without Unicode/UCP support. (*) and if you're afraid of a (*UCP) inside a pattern, then there's PCRE2_NEVER_UCP to always disable UCP. Cheers, -- Giuseppe D'Angelo -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] pcre2_jit_match(): match context needed?
Hi, this was simply an oversight from my side (after some refactoring). The mcontext can be NULL. Thank you for noticing it. I hope I fixed it. Regards, Zoltan Ralf Junker ralfjun...@gmx.de írta: PCRE2 pcre2jit.html: The fast path function is called pcre2_jit_match(), and it takes exactly the same arguments as pcre2_match(). From this I concluded that I can use pcre2_jit_match() exactly as pcre2_match(), provided that I previously called pcre2_jit_compile(). I hence called pcre2_jit_match() without providing a match context. As a result, I received an AV. Creating a match context and passing it to pcre2_jit_match() solved the problem. I wondered if pcre2_jit_match() indeed needs a match context, but the JIT FAST PATH API section does not mention it. pcre2_jit_match.c is not clear either: line 136 tests if (mcontext != NULL) whereas line 159 de-references mcontext without prior testing - which leads to the AV described above. If pcre2_jit_match() indeed needs a match context, should it not return an error if missing (i.e. PCRE2_ERROR_JIT_BADOPTION) instead of raising an AV? Ralf -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] PCRE2: PCRE2_INFO_JIT in docs, but not in pcre2.h
Hi, That's an error in the docs, incorrect editing of the PCRE1 document. What it should be saying is that you can use PCRE2_INFO_JITSIZE. If the result is non-zero, JIT compilation was successful. Thanks for noticing! This gives a bit more depth to JIT complation check. In PCRE2, you can call jit_compile multiple times with different mode combinations (complete, partial soft, etc.). If a given mode is not yet compiled, the library performs the compilation. If the compilation is successful, PCRE2_INFO_JITSIZE grows. For example lets consider the following steps: 1) Before any compilation, PCRE2_INFO_JITSIZE is 0. 2) Call jit_compile with complete mode. If the compilation is successful, PCRE2_INFO_JITSIZE will be increased. Otherwise it is unchanged. 3) Call jit_compile with partial hard mode. If the compilation is successful, PCRE2_INFO_JITSIZE will be increased. Otherwise it is unchanged. To make the life easier, jit_compile also returns with an error code if the compilation is failed. The jit_compile returns with success if no work is required (the requested compilation modes have been already complied, or no compilation is requested). The only exception when JIT is not available, because jit_compile always return with an error code in that case. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Using the Google C++ PCRE code
Hi, It has almost as many matching features as PCRE with same syntax, and is guaranteed non-backtracking with linear time performance in the length of the input. Hm, not exactly. Re2 never said that. They have a list of available features, and the more interesting ones are all grayed out: https://code.google.com/p/re2/wiki/Syntax Linear runtime doesn't make it faster than PCRE. Sometimes it is much faster, sometimes it is much slower, sometimes they have the same speed. You can see some interesting facts in my presentation on CGO 2014: http://cgo.org/cgo2014/wp-content/uploads/2013/05/Extend_PCRE.pdf Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] SVN #1507 JIT _WIN32 compile error
Hi Ralf, Thank you for the feedback. I hope I fixed it (r1512). Let me know if there are more issues. Regards, Zoltan Ralf Junker ralfjun...@gmx.de írta: Since SVN #1507, JIT no longer compiles for _WIN32. In particular, sljitUtils.c, lines 242 and 249 are missing the new allocator argument. Ralf -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
[pcre-dev] Enabled by default in PCRE2
Hi All, let me start with some good news: the JIT compiler is ported to PCRE2, and mostly working (or better to say it has not extensively tested yet). But the main topic of my mail is starting a discussion about the default configuration in PCRE2. In PCRE1, we had a rather simple approach: everything was disabled by default. This allowed avoiding compatibility issues when new features were introduced, and made the library small. But it also has disadvantages: the default library lacks the needed features for many applications, and we get less feedbacks about different issues on various systems. We are thinking about changing this approach, and we would like to hear your opinion. The core feature set of PCRE2 is the following: - pcre8 - pcre16 - pcre32 - unicode - jit These are commonly used on many systems, and they are core part of PCRE2, rather than extensions. What would you enabled by default? Of course all of them can be disabled if needed. As for me: pcre8, pcre16, and unicode. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] 8.35 RC1 on Aarch64
Hi Laurent, thank you very much for spending time on fixing bugs in PCRE! I really appreciate it. I also found these issues and fixed them yesterday: https://tahini.csx.cam.ac.uk/lurker/message/20140326.182344.f2586b5b.en.html Me neither have hardware, but these changes are clearly needed. The cache flushes were commented out because of an older QEMU. and the missing macro is really a bug. Regards, Zoltan Laurent Desnogues laurent.desnog...@gmail.com írta: Hello, I was playing with pcre 8.35 on QEMU Aarch64. I think I have identified two issues in sljit/sljitNativeARM_64.c - the commented out calls to SLJIT_CACHE_FLUSH are needed - in emit_cmp_to0, there's a missing call to ADJUST_LOCAL_OFFSET(src, srcw); It would be nice if someone with hardware access could confirm. Hope this helps, Laurent -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] 8.35 Release candidate available
Hi Petr, thank you very much for the testing! This is really a great help for us. Could you try the MIPS-64 port as well? You are the first one who actually has an ARM-64 hardware :) The developer branch of qemu A64 does not support all instructions, so it fails when SLJIT_CACHE_FLUSH is called with undefined instruction. I commented out these calls to run my code on quemu. Could you reenable them in sljitNativeARM_64.c (there are 3 of them) and rerun the tests? Just remove the /* */ around them. Regards, Zoltan Petr Pisar ppi...@redhat.com írta: On Fri, Mar 14, 2014 at 04:09:41PM +, p...@hermes.cam.ac.uk wrote: A first release candidate for 8.35 is now available here: [...] Please test as much as you can. The code has quite a few changes - as always, look in ChangeLog for details. I performed tests on Linux with glibc with JIT enabled where supported with following results: PPC:passed s390: passed amd64: passed PPC64: passed i686: passed s390x: passed n32 MIPS64: passed aarch64:failed ARMv6j: passed The aarch64 fails at two test suites: (1) pcre_jit_test: Running JIT regression tests target CPU of SLJIT compiler: ARM-64 64bit (little endian + unaligned) in 8 bit mode with UTF-8 enabled and ucp enabled: in 16 bit mode with UTF-16 enabled and ucp enabled: in 32 bit mode with UTF-32 enabled and ucp enabled: 8 and 16 bit: Ovector[0] value differs(J8:-4219968,I8:0,J16:-2164020,I16:0): [529] 'ab' @ 'a' 8 and 32 bit: Ovector[0] value differs(J8:-4219968,I8:0,J32:-1077914,I32:0): [529] 'ab' @ 'a' 16 and 16 bit: Ovector[0] value differs(J16:-2164020,I16:0,J32:-2164020,I32:0): [529] 'ab' @ 'a' .. 8 and 16 bit: Ovector[0] value differs(J8:-4219968,I8:0,J16:-2164020,I16:0): [532] '\b#' @ 'a' 8 and 32 bit: Ovector[0] value differs(J8:-4219968,I8:0,J32:-1077914,I32:0): [532] '\b#' @ 'a' 16 and 16 bit: Ovector[0] value differs(J16:-2164020,I16:0,J32:-2164020,I32:0): [532] '\b#' @ 'a' 8 and 16 bit: Ovector[0] value differs(J8:-4219968,I8:0,J16:-2164020,I16:0): [533] '(?=a)b' @ 'a' 8 and 32 bit: Ovector[0] value differs(J8:-4219968,I8:0,J32:-1077914,I32:0): [533] '(?=a)b' @ 'a' 16 and 16 bit: Ovector[0] value differs(J16:-2164020,I16:0,J32:-2164020,I32:0): [533] '(?=a)b' @ 'a' 8 and 16 bit: Ovector[0] value differs(J8:-4223416,I8:2,J16:-2164020,I16:2): [534] 'abc|(?=xxa)bc' @ 'xxab' 8 and 32 bit: Ovector[0] value differs(J8:-4223416,I8:2,J32:-1077914,I32:2): [534] 'abc|(?=xxa)bc' @ 'xxab' 16 and 16 bit: Ovector[0] value differs(J16:-2164020,I16:2,J32:-2164020,I32:2): [534] 'abc|(?=xxa)bc' @ 'xxab' 8 and 16 bit: Ovector[0] value differs(J8:-4219968,I8:0,J16:-2164020,I16:0): [535] 'a\B' @ 'a' 8 and 32 bit: Ovector[0] value differs(J8:-4219968,I8:0,J32:-1077914,I32:0): [535] 'a\B' @ 'a' 16 and 16 bit: Ovector[0] value differs(J16:-2164020,I16:0,J32:-2164020,I32:0): [535] 'a\B' @ 'a' ... 8 and 16 bit: Ovector[0] value differs(J8:-4219968,I8:0,J16:-2164020,I16:0): [563] 'a(*PRUNE)a|m' @ 'a' 8 and 32 bit: Ovector[0] value differs(J8:-4219968,I8:0,J32:-1077914,I32:0): [563] 'a(*PRUNE)a|m' @ 'a' 16 and 16 bit: Ovector[0] value differs(J16:-2164020,I16:0,J32:-2164020,I32:0): [563] 'a(*PRUNE)a|m' @ 'a' .. Successful test ratio: 99% (6 failed) (2) RunTest gets aborted by glibc stack protector or segfaults: PCRE C library tests using test data from ./testdata PCRE version 8.35-RC1 2014-03-14 Testing 8-bit library Test 1: Main functionality (Compatible with Perl = 5.10) OK OK with study ./RunTest: line 425: 8914 Segmentation fault $sim $valgrind ./pcretest -q $bmode $opt $testdata/testinput1 testtry This happens with the JIT mode (./pcretest -q -8 -s+ testdata/testinput2). If I disable JIT, tests pass on aarch64. Please do not consider the aarch64 tests seriously. The software (kernel, glibc, GCC) and the hardware (emulator) are still changing. -- Petr -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] JIT and other builds almost work
Hi, If you don’t mind I will just leave all 8 builds activated, personally I would consider sparc64-jit the most relevant build besides amd64-jit for Solaris so make it a +1 from me to encourage you to add the remaining part :-) I will try to add it, but likely not in the near future. Unfortunately I don't have access to any Sparc64 machine, and lack of time is also a bit of problem. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] JIT and other builds almost work
Hi, One more issue is that I did not clean up the build directory between builds, it was essentially an svn update followed by configure, make, make check. Now that I do a full clean inbetween everything passes. Maybe a dependency is missing in the Makefile, but I can happily live with rebuilding every time. this is what I suspected. It was clear that only a few files were rebuilt, and it tried to link the executables with the wrong libraries. On the long run it might be worth to investigate this issue. The build is almost perfect, just the one on Solaris sparcv9 with 64 bit and JIT enabled dumps core: https://buildfarm.opencsw.org/buildbot_admin/builders/pcre-jit-solaris10-sparcv9/builds/0 I am a bit surpised, since according to the build log the compiler is not GCC. However the Solaris compiler seems to mimic most of the GCC features. Unfortunately, SPARC-64 is not yet supported by the JIT compiler. I never had time to do it, and there was no strong requests on this lists to do it (few people said it would be a nice feature, but even they didn't have a sparc system). I would suggest to remove (disable) the sparc64-JIT bot until the CPU is supported. Thank you very much for setting up these bots. This is a great help for us and the project. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Approved messages from build...@opencsw.org
Hi, Undefinedfirst referenced symbol in file pcre_jit_exec pcre_jit_test-pcre_jit_test.o pcre16_jit_exec pcre_jit_test-pcre_jit_test.o pcre32_jit_exec pcre_jit_test-pcre_jit_test.o ld: fatal: symbol referencing errors. No output written to .libs/pcre_jit_test Interesting, It seems it just wants to compile a JIT test program without compiling anything else. I suspect it wants to link it with the system PCRE. Something is probably wrong with the build commands. Setting up a buildbot for JIT is not urgent, I was just curious about your future plans. Since the compiler on Solaris is not GCC (LLVM), some porting work is likely needed before the bot could work. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Approved messages from build...@opencsw.org
Hi, How about making another list for such automatic messages? They clutter a bit this list (which is supposed to be for discussing development). I agree that this list is mostly for porting questions, reporting bugs and requesting features. PCRE has only a few developers. When somebody does a change, he can manually check the bots after his commit (he should btw). This approach works well with other open-source projects where I am involved. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
[pcre-dev] Improve prefix search in JIT
Hi, this mail is a summary of another performance optimization which was recently added to the JIT compiler. The new code is basically a simplified version of the Boyer–Moore string search algorithm, and its purpose is searching for fixed prefixes. The first step is finding the longest fixed part of a pattern, which position is known. For example, the analysis of the /a.abcd.ab.*abcde/ pattern yields abcd, since this string is longer than both a and ab. Although abcde is even longer than abcd, its position is unknown, which makes it unsuitable for prefix searching. Strings shorter than four characters are considered as too short, and the optimization is aborted. The analysis also detect the offset of the last character. The offset of 'd' is 5 in this particular case. The next step is generating a table, which has 256 entries, one for each character in 8 bit mode. In 16 and 32 bit modes the entries are generated for character 0xff codes (during runtime, only the last byte of a character is read). All entries contain the number of characters, which can be skipped when that particular character is read from the last character offset. In our example, the entry of 'd' will be zero, since this character can be part of a match. The entry of 'c' is one, 'b' is two, 'a' is three, and all other characters are four. Hence, if we read an 'e', we can advance the input pointer by four characters, and we don't need to spend CPU cycles to check the skipped characters. I saw a nice performance progresson on my Snort benchmark set (using an x86-64 sytem). The total runtime was decreased to 40.8 seconds from 45.0 seconds, which is 9.4% speedup. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] \K in lookahead assertion gives unpredictable result
Hi, In Linux, it just prints the rest of the string, starting at the start position - I think this is probably accidental. When the length of the matched string is negative, it prints to the end of the string. Presumably in your case there isn't a trailing zero... I didn't know it is accidental. I thought printing an empty string is less useful for a testing tool. Perhaps printing the offsets themselves would be the best. I tried the pcretest on a win7 box compiled by Visual C compiler (version 15.00.30729.01 for x64) and MinGW 4.4.7 and both prints a 'b'. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Atomic group optimizations issue
Hi, Can you explain, why capturing atomic groups with capturing brackets inside can't produce tail recursion in cases when '(?:' can do this? because PCRE has to prepare for the worst case scenario. Your input string does not match to the capturing bracket, so the capturing bracket does not increase the backtracking depth (it could if the input would be different). However, pcre_compile does not know anything about the input. The compiled pattern must work even if the capturing bracket matches sometimes, so the engine must choose the costly OP_ONCE instead of OP_ONCE_NC. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] \K in lookahead assertion gives unpredictable result
Hi, Anyway I suspect the result is correct (it start with b) I don't think so. Result must not have 'b'. Did you see my other mail yesterday? I explained there that pcretest yields 'b' for testing purposes (it handles startoffset is greater than endoffset cases this way). I also described how this issue can be debugged. The 'ab' input should be terminated by zero (I showed where the zero is appended), but it seems something changes that value. Perhaps a read/write breakpoint could help. Could you check that the zero is appended in your case as well? If so, could you send me a backtrace when the zero is changed? Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] \K in lookahead assertion gives unpredictable result
Hi, just a little help for debugging this. In such cases, where the start offset is bigger than end offset (1 0 this case), pcretest prints out the characters from start offset to the end of the input. It could print an empty string (such as Perl), but that would not be helpful for testing/debugging purposes. The characters are printed by this function: static int pchars(pcre_uint8 *p, int length, FILE *f) It starts with: if (length 0) length = strlen((char *)p); I suspect something is wrong with p. The zero terminated string is ensured by line 4925 in pcretest: if (pcre_mode == PCRE8_MODE) { *q8 = 0; len = (int)(q8 - (pcre_uint8 *)dbuffer); } Please check what dbuffer contains after this line, and please check what is the value of 'p' in pchars(..). Philip, perhaps we could write a note when the length is 0. Such as: 0: start offset (1) is bigger than end offset (0), print input to the end b Hope this helps, Zoltan Zoltán Herczeg hzmes...@freemail.hu írta: Hi, yes it seems there is a rubbish after the 'b'. However, I don't see this behavior in the recent release (under Linux at least), and your binary is half year old. Could you try it with a newer version? Anyway I suspect the result is correct (it start with b), just the printing does not stop after the end of the input. This might be a pcretest or windows libc bug. Regards, Zoltan ND nad...@mail.ru írta: Good day! Here is pcretest.exe listing: PCRE version 8.34-RC 2013-06-14 /(?=a\K)/ ab 0: b\x89b\x1f\xe4J~\x04 This match is unpredictable for me. May be a bug there. Thanks a lot. -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] \K in lookahead assertion gives unpredictable result
Hi, yes it seems there is a rubbish after the 'b'. However, I don't see this behavior in the recent release (under Linux at least), and your binary is half year old. Could you try it with a newer version? Anyway I suspect the result is correct (it start with b), just the printing does not stop after the end of the input. This might be a pcretest or windows libc bug. Regards, Zoltan ND nad...@mail.ru írta: Good day! Here is pcretest.exe listing: PCRE version 8.34-RC 2013-06-14 /(?=a\K)/ ab 0: b\x89b\x1f\xe4J~\x04 This match is unpredictable for me. May be a bug there. Thanks a lot. -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Atomic group optimizations issue
Hi Naden, I like that you always have some interesting questions :) It is great to talk about the internals and optimizations of PCRE. Here is the answer to your question: in the second case, you see the effect of an optimization called tail recursion. You can even see that in the first case, if you remove the capturing brackets around 1: re /(?1|2|.)*?(3|4)/ data \Mabcdefghijklmnopqrstuvwxyz Minimum match() limit = 190 Minimum match() recursion limit = 3 No match However, in your first case, PCRE is forced to use OP_ONCE instead of OP_ONCE_NC, and that opcode is more costly (in terms of both stack and runtime), since the interpreter have to restore all capturing brackets (JIT is a bit more clever here, it only restores the brackets inside the atomic block). It does not know that the capturing bracket will never match. At least not for this particular input. The result is also different, if we put the capturing bracket around the dot: re /(?:1|2|(.))*?(3|4)/ data \Mabcdefghijklmnopqrstuvwxyz Minimum match() limit = 217 Minimum match() recursion limit = 29 No match You can see an increase here is well. Suggestion: if it is possible, don't use capturing brackets inside atomic blocks. Regards, Zoltan ND nad...@mail.ru írta: Good day! Here is two pcretest.exe listings: PCRE version 8.34-RC 2013-06-14 /(?(1)|2|.)*?(3|4)/ \Mabcdefghijklmnopqrstuvwxyz Minimum match() limit = 217 Minimum match() recursion limit = 55 No match PCRE version 8.34-RC 2013-06-14 /(?:(1)|2|.)*?(3|4)/ \Mabcdefghijklmnopqrstuvwxyz Minimum match() limit = 217 Minimum match() recursion limit = 3 No match In listing 2 (? is replaced by (?:. And Minimum match() recursion limit unexpectedly reduces to 3 from 55. I guess this happens due some internal PCRE optimizations. Is there possibility to reduce Minimum match() recursion limit for atomic groups in (?:-way? Best regards -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] [Bug 1419] PCRE slower at matching UTF8 character classes.
Hi, no. This is an experimental new feature. The patch is not even in trunk. Anyway, I don't think it will affect your port, since it is UTF only. Regards, Zoltan Ze'ev Atlas zatl...@yahoo.com írta: Is the patch already incorporated in the latest version of 8.34? I'd like to download the best version in order to perform my port to z/OS. Ze'ev Atlas From: Zoltan Herczeg hzmes...@freemail.hu To: pcre-dev@exim.org Sent: Wednesday, December 18, 2013 1:20 AM Subject: [pcre-dev] [Bug 1419] PCRE slower at matching UTF8 character classes. --- You are receiving this mail because: --- You are on the CC list for the bug. http://bugs.exim.org/show_bug.cgi?id=1419 Zoltan Herczeg hzmes...@freemail.hu changed: What |Removed |Added Attachment #673 is|0 |1 obsolete| | --- Comment #17 from Zoltan Herczeg hzmes...@freemail.hu 2013-12-18 06:20:16 --- Created an attachment (id=674) -- (http://bugs.exim.org/attachment.cgi?id=674) Second patch Better patch. Also includes JIT support. -- Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] JIT compilation of regexps
Hi This doesn't answer the question whether the pre-compilation would fail on JIT-enabled systems, or perhaps it would still speed something up? I am not sure I understand this question, but you can pre-compile a regex, save its byte code, reload it, and compile with JIT. You cannot save and reload JIT code because several resources are accessed by absolute addresses. The other question is whether it is impossible to store the JIT data in principle, or perhaps there could be an option to store it somehow in some of the future releases? If performance would be sacrificed, you could probably do it. But this is not an easy task, since you would need to generate a position independent code with relative resource accesses. There would be another solution, but that is complex, and involves mmap magic. You would need an absolute address space, where all regexps, character properties, JIT code, etc. would be stored, and this address space could be saved on the disk. Later you can map this address space on restore its content. The process is similar as the ld tool loads binaries on Linux. You also need to redirect all allocations to use this space. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] 8.34-RC1 release candidate available for testing
Hi Petr, thank you for the report, and sorry for the late answer. The bug is fixed here: http://lists.pcre.org/lurker/message/20131130.070502.fb1cef50.en.html Regards, Zoltan Petr Pisar ppi...@redhat.com írta: On Tue, Nov 19, 2013 at 03:47:09PM +, p...@hermes.cam.ac.uk wrote: I have just made 8.34-RC1 available for testing here: This release does not pass tests on MIPS. It fails JIT RunTest `Test 6: Unicode property support (Compatible with Perl = 5.10)': Test 6: Unicode property support (Compatible with Perl = 5.10) OK OK with study --- ./testdata/testoutput6 2013-11-12 16:59:09.0 +0100 +++ testtry 2013-11-22 11:55:28.0 +0100 @@ -1336,7 +1336,7 @@ /^[[:print:]]*/8W A z\x{a0}\x{a1} - 0: A z\x{a0}\x{a1} + 0: A /^[[:punct:]]*/8W .+\x{a1}\x{a0} and others. Full test-suite.log is attached. -- Petr -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Using PCRE upon Asian and other two-byte national codings
Hi, currently PCRE character tables can only hold lowercase / flipped case and various type bits for the first 256 characters. Supporting the whole 64K character set in 16 bit mode would take 409600 bytes of memory, which is less than half megabyte. Today, even smartphones can afford that cost. The trade-of would be that the same tables could not be used in 8/16/32 bit modes anymore, since the lowercase / flipped case tables would depend on the natural character length. Hence a table with only 256 characters would be bigger in 16/32 bit mode than now. (Note: the table size would always be divisible by 256. This would allow not to change anything in 8 bit mode, but we could also support character sets which does not have 64K characters in 16 bit and especially in 32 bit mode, where we have 4096M characters). I am sure we cannot do this for 8.34 (this is not an easy task), but if this is important for many people, we might think about this later. Regards, Zoltan p...@hermes.cam.ac.uk írta: On Sat, 23 Nov 2013, Zoltán Herczeg wrote: PCRE supports 2 or 4 byte character encodings, but character properties are only supported for 0-255 character codes. I think I had better clarify that, for the record. The 16-bit and 32-bit PCRE libraries do support Unicode character properties, just like the 8-bit library. However, locale-based properties apply only to 0-255 character codes. Philip -- Philip Hazel -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] [Bug 1404] New: build fail due to undefined reference to `check_char_prop' in pcre 8.34-trunk
Hi Rand, thank you for the report, and sorry for the late answer. We did a lot of rework on auto-possessifying in PCRE, and the code still requires testing. I suspect --enable-unicode-properties was not passed to configure since I only see this error when it is not there. The fix was trivial, but all \X related test cases were moved from test2 to test6, so the patch is quite big: https://lists.exim.org/lurker/message/20131025.173753.f7e011ef.en.html Regards, Zoltan r...@sent.com írta: (I posted this as a bug -- still pending approval -- but had not yet joined this list; It appears the auto-post to this list didn't happen. Reposting ... ) I build/install a local pcre instance from head, cd /usr/local/src/pcre svn info Path: . Working Copy Root Path: /usr/local/src/pcre URL: svn://vcs.exim.org/pcre/code/trunk Repository Root: svn://vcs.exim.org/pcre Repository UUID: 2f5784b3-3f2a-0410-8824-cb99058d5e15 Revision: 1383 Node Kind: directory Schedule: normal Last Changed Author: zherczeg Last Changed Rev: 1383 Last Changed Date: 2013-10-18 10:50:06 -0700 (Fri, 18 Oct 2013) ./configure \ --disable-static \ --enable-jit \ --with-link-size=2 \ --with-match-limit=1000 \ --enable-utf \ --enable-unicode-properties \ --enable-newline-is-lf make make install pcre-config --version 8.34-RC pkg-config libpcre --libs --cflags -I/usr/local/include -L/usr/local/lib64 -lpcre Previouly OK, building nginx 1.5.6 against that PCRE instance now fails @, ./configure \ ... --with-pcre=/usr/local/src/pcre --with-pcre-jit \ ... make ... objs/addon/naxsi_src/naxsi_json.o \ objs/addon/src/ndk.o \ objs/ngx_modules.o \ -L/usr/local/ssl/lib64 -Wl,-rpath,/usr/local/ssl/lib64 -lssl -lcrypto -ldl -lz -Wl,-E -lpthread -lcrypt -L/usr/local/lib64/libluajit-5.1.so -lluajit-5.1 -lm /usr/local/src/pcre/.libs/libpcre.a -lssl -lcrypto -lz -lGeoIP /usr/local/src/pcre/.libs/libpcre.a(libpcre_la-pcre_compile.o): In function `compare_opcodes': pcre_compile.c:(.text+0x2811): undefined reference to `check_char_prop' collect2: error: ld returned 1 exit status make[1]: *** [objs/nginx] Error 1 make[1]: Leaving directory `/data/src/nginx-1.5.6' make: *** [build] Error 2 That undefined reference to 'check_char_prop' appears in the PCRE sources. fyi, this is on uname -a Linux test/loc 3.7.10-1.16-desktop #1 SMP PREEMPT Fri May 31 20:21:23 UTC 2013 (97c14ba) x86_64 x86_64 x86_64 GNU/Linux gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib64/gcc/x86_64-suse-linux/4.8/lto-wrapper Target: x86_64-suse-linux Configured with: ../configure --prefix=/usr --infodir=/usr/share/info --mandir=/usr/share/man --libdir=/usr/lib64 --libexecdir=/usr/lib64 --enable-languages=c,c++,objc,fortran,obj-c++,java,ada --enable-checking=release --with-gxx-include-dir=/usr/include/c++/4.8 --enable-ssp --disable-libssp --disable-plugin --with-bugurl=http://bugs.opensuse.org/ --with-pkgversion='SUSE Linux' --disable-libgcj --disable-libmudflap --with-slibdir=/lib64 --with-system-zlib --enable-__cxa_atexit --enable-libstdcxx-allocator=new --disable-libstdcxx-pch --enable-version-specific-runtime-libs --enable-linker-build-id --program-suffix=-4.8 --enable-linux-futex --without-system-libunwind --with-arch-32=i586 --with-tune=generic --build=x86_64-suse-linux --host=x86_64-suse-linux Thread model: posix gcc version 4.8.2 20131016 [gcc-4_8-branch revision 203692] (SUSE Linux) rand -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
[pcre-dev] Auto-possessifying improvements in PCRE
Hi all, we have been working on the auto-possessifying optimization in PCRE for some time, and I would like to give you a brief summary of the results we achived. This optimization replaces greedy/non-greedy single character repetitions with their appropriate posessive form. A simple example is rewriting /a+b/ to /a++b/. This optimization has been part of PCRE for a very long time, but it only supported simple cases before. Now it can even replace \s* to \s*+ in /\s*(?:left|right)?hand/. Of course using \s*+ directly in the pattern would provide the same effect, but possessive quantifiers are among the less known regular expression features, and they are rarely used (this is my impression at least). The performance of possessive quantifiers are usually much higher than other quantifiers, since the backtracking phase can be totally skipped. However, before I show some results, let me tell you a bit more about possessive quantifiers. Their primary purpose is to define unbreakable multi-byte character sequences. The improved matching speed is just a side effect. For example, sch represent a single consonant in german, splitting it to sc and h is meaningless. Another nice example is newlines: most of the time a newline can be \r, \n and \r\n. If we want to find the aa and bb strings, which are separated by at least two newlines, we can use the /aa(?\r\n|\r|\n){2,}bb/ pattern. Without the ? bracket type, the pattern would happily accept aa\r\nbb as well, and that is incorrect. Back to the results, then. Once I got a pattern set used by an Intrusion Detection System, and I use it for benchmarking and also getting ideas how people use regular expressions. Sometimes I browse http://regexlib.com/ as well. I realized that most patterns are not exactly efficient, so regex compiler optimizations such as auto-possessifying seems very important. The gain provided by this particular optimization is the following (INT: interpreter, JIT: PCRE-JIT compiler, s: seconds): was: INT: 412.16 s, JIT: 86.22 s now: INT: 182.94 s, JIT: 45.46 s progress: INT: 125% JIT: 90% Of course on other pattern sets the results might be totally different, but we hope this helps to improve the overall performance of our favourite regex engine. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] question about PCRE performance
Hi Yaron, I started to sketch up something about the advantages and disadvantages of different engine types: http://sljit.sourceforge.net/regex_compare.html I plan to add more info in the future. Regards, Zoltan Zoltán Herczeg hzmes...@freemail.hu írta: Hi Yaron, It depends on your patterns, and input, etc. PCRE is a backtracking, NFA based engine, while RE2 is a DFA based engine. You can see a comparison of PCRE and other engines (includeing RE2) here: http://sljit.sourceforge.net/regex_perf.html Actually there is a myth that DFA based engines are faster, since they have guaranteed linear runtime, but people tend to forget to mention the fact, that generating their state machine requires exponential runtime. In practice, both DFA and NFA based engines have patterns, where their runtime is exponential, and these patterns are called pathological cases. A PCRE pathological case: /(a*)*b/ An RE pathological case: /a[^b]{64}a/ The structure of these pathological cases are different across engine types, so a pathological case of an NFA engine is usually fast on a DFA based engine, and vice versa (you can see this on the link you sent, the author focuses on some PCRE pathological cases to advertise its engine). But you can probably combine them to have slow execution speed on any engine. Actually it is possible to make a DFA based engine to have truly linear runtime with on-the-fly state generation, but those engines are the slowest for typical patterns, so they are not popular (although they can be a good choice if you have many patterns which are pathological on both engine types). The other notable difference is that NFA engines have much bigger feature set, since their state machine can contain any actions (conditional decision, assertions, etc.). PCRE is really strong here, since it has more features than any other engine in the world. Source: http://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines People usually think choosing a regular expression engine is a simple task, but this is not really true. On the contrary you need to consider a lot of strengths and weaknesses to find the best choice. If you have enough time, you can try multiple choices (e.g. the JIT compiler in PCRE), and decide according to the results. Regards, Zoltan Yaron Dayagi yday...@trustwave.com írta: Hello, I need your assistance regarding PCRE performance. Someone who works with me suggested to use RE2. I'm a bit skeptic and would like to stick to PCRE. I got to a Google page about RE2 and there was a link to http://swtch.com/~rsc/regexp/regexp1.html. Is the data in the article correct? Can u tell me anything about the comparison between RE2 and PCRE? Thanks you, Yaron. This transmission may contain information that is privileged, confidential, and/or exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution, or use of the information contained herein (including any reliance thereon) is strictly prohibited. If you received this transmission in error, please immediately contact the sender and destroy the material in its entirety, whether in electronic or hard copy format. -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] [Bug 1380] Report match when doesn't match and vice-versa
Hi, How would you make a pattern to extract lines that violate the format? This should be really easy using negative assertions. E.g: a line must match /^PATTERN$/ (non-multiline match). A non-matching line can be listed as: /^(?!PATTERN$).*/ Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] JIT have very limited applicability
Hi, JIT is typically used by server applications such as NGNIX webserver, Suricata Intrusion Detection System, ModSecurity firewall, etc. and it seems they are happy with it. Some Korean guys made a paper about integrating JIT into Snort: http://kiise.or.kr/e_journal/2013/4/CST/pdf/01.pdf They also have the same conclusion as yours: JIT compilation has a considerable overhead, but since server applications are rarely restarted, this is not an issue. You can also use a deferred compilation in a second thread if server startup time is important (and use the interpreter if a pattern is not yet compiled). The other typical use case is searching a single pattern in a huge input, such as GNU grep, ag (silver searcher), etc. In short JIT only helps if you compile a pattern once, and use it several times. It is inefficient for searching a pattern once in a small string like checking a version string. I don't understand this: pattern is reused many times in a row in VM, without use of other patterns between. Why other patterns can not be used? Regards, Zoltan ND nad...@mail.ru írta: Good day! As I trying to say some time ago, PCRE-JIT in most applications is useless. It only slowing matching process. Consider timings. In most real situations JIT compile+run time greatly exceeds interpreter compile+run time. Taking this into account there are very few circumstances when JIT brings benefits to user: - JIT compile+run time is less then interpreter run time (I think that number of this cases is about a zero); - pattern is reused many times in a row in VM, without use of other patterns between (IMHO number of this cases is also about a zero). So JIT without ability to use JIT-precompiled patterns have a ve-e-e-ry limited applicability. May be I'm err. Correct me please if so. I propose add to JIT ability of saving JIT-compiled data. So full precompiled pattern can be used by main application. This eliminates a JIT compile time costs and allow to use a speed benefits of JIT in most cases. I understand than compiled JIT-data saved on one platform probably can't be used on other platforms. But it can be used by application then starts on one platform with great effectiveness. I take this opportunity to thank Phillip and Zoltan for large efforts and great product. Best regards. -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] New API for PCRE
Hi, I think one of the drivers for a new API is that the current (int) option bits are pretty much all used up. I don't know how the new API might address that. I'm saying that OPCRE would still define its options as (int) while the equivalent set of options in NPCRE might not fit in a single (int). we are thinking about a context based approach for the new API. Instead of pre-defined data structures, there will be getters-setters to set an appropriate list of flags or variables. We don't plan a getter-setter for each flag or variable, e.g. there will be a struct, which contains all allocator related params (malloc, free, realloc pointers, and user params), and it can be get/set by a pcre_[get|set]_allocator(context, pointer_to_this_struct). The details are not finalized. This provides a lot of freedom (internal structures can be reorganized, computed flags can be supported). So there is no need to worry about flags or arguments anymore. Copying, duplicating contexts are also planned, to make it easy to share the same context for several patterns. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] JIT access violation
Hi, yes, it helps a bit, since now the length is 69 on my side as well. However, I still cannot see the buffer overflow, since the offset is 64. The value (OP_KET) is also correct. Could you print the re after the re = (REAL_PCRE *)(PUBL(malloc))(size); and common-start and GET(common-start, 1) as well? If the offset is really incorrect, probably common-start-re will not be equal to 56. Regards, Zoltan Ralf Junker ralfjun...@gmx.de írta: On 13.05.2013 12:36, Zoltán Herczeg wrote: this is quite interesting. Am I see right, that your pattern only contains two fixed characters (backslash and space)? On a 32 bit Linux system, in 8 bit mode, that is 67 bytes long (56 bytes for header, 11 for pattern) instead of 69. That read access reads byte 63, which is perfect. The pattern contains, without leading / trailing slahes: \Q\ \E The core pattern is one backslash and one space each. This is the interesting part: size = sizeof(REAL_PCRE) + (length + cd-names_found * cd-name_entry_size) * sizeof(pcre_uchar); Could you print sizeof(REAL_PCRE), length, and size here? After this line, the numbers are as follows: sizeof(REAL_PCRE) = 56 length= 13 size = 69 Does it matter that I compile with LINK_SIZE=3 ? Yes, it does. If I recompile with LINK_SIZE=2 (the default), I get these numbers: sizeof(REAL_PCRE) = 56 length= 11 size = 69 Does this help? Ralf -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] JIT access violation
Hi, yes, it helps a bit, since now the length is 69 on my side as well. However, I still cannot see the buffer overflow, since the offset is 64. The value (OP_KET) is also correct. Could you print the re after the re = (REAL_PCRE *)(PUBL(malloc))(size); and common-start and GET(common-start, 1) as well? If the offset is really incorrect, probably common-start-re will not be equal to 56. Regards, Zoltan Ralf Junker ralfjun...@gmx.de írta: On 13.05.2013 12:36, Zoltán Herczeg wrote: this is quite interesting. Am I see right, that your pattern only contains two fixed characters (backslash and space)? On a 32 bit Linux system, in 8 bit mode, that is 67 bytes long (56 bytes for header, 11 for pattern) instead of 69. That read access reads byte 63, which is perfect. The pattern contains, without leading / trailing slahes: \Q\ \E The core pattern is one backslash and one space each. This is the interesting part: size = sizeof(REAL_PCRE) + (length + cd-names_found * cd-name_entry_size) * sizeof(pcre_uchar); Could you print sizeof(REAL_PCRE), length, and size here? After this line, the numbers are as follows: sizeof(REAL_PCRE) = 56 length= 13 size = 69 Does it matter that I compile with LINK_SIZE=3 ? Yes, it does. If I recompile with LINK_SIZE=2 (the default), I get these numbers: sizeof(REAL_PCRE) = 56 length= 11 size = 69 Does this help? Ralf -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] JIT access violation
Hi, yes, it helps a bit, since now the length is 69 on my side as well. However, I still cannot see the buffer overflow, since the offset is 64. The value (OP_KET) is also correct. Could you print the re after the re = (REAL_PCRE *)(PUBL(malloc))(size); and common-start and GET(common-start, 1) as well? If the offset is really incorrect, probably common-start-re will not be equal to 56. Regards, Zoltan Ralf Junker ralfjun...@gmx.de írta: On 13.05.2013 12:36, Zoltán Herczeg wrote: this is quite interesting. Am I see right, that your pattern only contains two fixed characters (backslash and space)? On a 32 bit Linux system, in 8 bit mode, that is 67 bytes long (56 bytes for header, 11 for pattern) instead of 69. That read access reads byte 63, which is perfect. The pattern contains, without leading / trailing slahes: \Q\ \E The core pattern is one backslash and one space each. This is the interesting part: size = sizeof(REAL_PCRE) + (length + cd-names_found * cd-name_entry_size) * sizeof(pcre_uchar); Could you print sizeof(REAL_PCRE), length, and size here? After this line, the numbers are as follows: sizeof(REAL_PCRE) = 56 length= 13 size = 69 Does it matter that I compile with LINK_SIZE=3 ? Yes, it does. If I recompile with LINK_SIZE=2 (the default), I get these numbers: sizeof(REAL_PCRE) = 56 length= 11 size = 69 Does this help? Ralf -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Matching file contents as one string using PCRE_DOTALL
Hi, dot (.) is the inverse of \R. If you need to match everything, use [\x00-\xff] or (.|\R) or \p{Any} (the latter only if unicode is enabled). I would choose the first for ascii, and the third for unicode matches. Regards, Zoltan pcun...@fsmail.net írta: I have read the entire PCRE documentation and have not found something that states clearly what I'm looking to do. The documentation says: By default, PCRE treats the subject string as consisting of a single line of characters (even if it actually contains newlines). That is what I want to hear, but it is not working out in practice. Testing version 8.32. Goals: 1. Load a text file into a std::wstring buffer. 2. Have no regard for the concept of lines. One big string is fine. 3. Use Positive Lookahead to find terms in ANY order. 4. I don't ever care about capturing or using LookBehinds. 5. Any character before or after the query terms is fine. Regex: Thus I have come up with this: (?=.*hello.*)(?=.*world.*) Sample data from file.txt: First line, hello present. Second with world present. Please verify this Solution: However, it only seems to work if I set PCRE_DOTALL. What are the consequences of using this flag? Is there a better way? I don't really want to use the ^$ combo as that is the notion of lines, no? They don't work anyway. -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] PCRE Wiki and Development Repos etc
Hi, We could create a PCRE project on github, thus allowing some other people to take management of the whole thing - in this case we could move the wiki across very easily and change tahini's redirects. It would also be possible to reflect the svn repository into a github git repository which is a step towards being able to change revision control systems, but does give you access to some of github's tools. Personally I don't mind taking the whole project to somewhere else. But we should choose carefully, there are many project hosting services around the world. On github there are many projects called pcre, and I am not sure anyone can tell which one will be real. Another thing is -f option, which can be used to The git tool is nice in general, but it is possible to destroy a whole project with -f. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] JIT increase stack requirement for SVN 1295
Hi Ralf, Many thanks! I will keep an eye on SVN development and will be available for testing! Philip told me, that I can simply reconstruct repetitions in JIT, instead of introducing new opcodes and do other changes in the interpreter. The reconstruction is imprecise, e.g /(?:ab){2}(?:ab)(?:ab)/ is predicted as /(?:ab){4}/ or /(?:ab)(?:(?:ab)(?:(?:ab)?)?)?/ as /(?:ab){1,3}/, but that is not a problem in practice. We can even call that as an optimization :) Here are the compilation statistics of some patterns recently reported as way too resource hungry: NOW: /(a)(?2){0,1999}?(b)/ compile time: 0 ms compile stack usage: 7008 bytes /(a)(?(DEFINE)(b))(?2){0,1999}?(?2)/ compile time: 0 ms compile stack usage: 6088 bytes /((\w{4}aa){4}aa){4}aa){3}aa){4}aa){2}aa){4}aa){3}aa){3}aa){11}aa){3}aa/ compile time: 200 ms compile stack usage: 1912 bytes WAS: /(a)(?2){0,1999}?(b)/ compile time: 10 ms compile stack usage: 889264 bytes /(a)(?(DEFINE)(b))(?2){0,1999}?(?2)/ compile time: 10 ms compile stack usage: 889312 bytes /((\w{4}aa){4}aa){4}aa){3}aa){4}aa){2}aa){4}aa){3}aa){3}aa){11}aa){3}aa/ compile time: 6300 ms compile stack usage: 1440775 bytes However, compiling a pattern with a nested 2000 brackets such as /...(a)...)/ can still consume a huge amount of resources, even if it can be considered as a simple pattern. I have no plans for optimizing them. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] JIT speed tests
Thanks... I'm so sorry but my bad English don't allow me understand this. Is there other word's? I don't feel how to speed up my application by using another thread. You probably have a list of patterns (array, linked list, etc.). Each item should have a pcre and a study. When your application is initialized, set all study to NULL, and start your application. After all patterns are loaded, start a thread, which process the items of the list one-by-one, and perform a pcre_study. Copy the returned value to the study member. This is thread safe, since the main thread only reads the study member when you call pcre_exec, and only the thread modify it. Your application starts with normal speed, which gradually improves as more patterns are compiled by the worker thread. If you have a pattern priority order, you can compile the more important ones first. If you already have a study member, you need to be a bit more careful when you replace it, because you cannot free it if the main thread is currently using the study. However, it is not difficult to avoid such situations. Are there this memory reads/writes more costly then JIT-recompiling? Zoltan, can you answer for this question please? I have never tried it, so I don't know the exact performance overhead. You can give it a try to implement it, and if it works, I am willing to review your patches. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] JIT speed tests
This is an exponential case for the code generator. It is similar to the one which is sent by Ralf recently. In PCRE, if P is not a character literal or a backreference, (P){n,m} is expanded to (P)(P)(P)...(P)(?:(P)(?:(P))?)? The code generator optimize these patterns one-by-one, and this process requires a lot of time and an enormous stack space. Where do you use such pattern? Can't you use a better pattern? Ahead-of-time (AOT) compilation is not really useful in my experiences, because pointers are not known at compile time, and you need to replace them by costly memory reads and writes. If you really prefer AOT, I would suggest other tools, such as lex/flex/ etc. which generates C code. Regards, Zoltan ND nad...@mail.ru írta: Hi, Zoltan! I'm start testing JIT and meets two problems: 1. here is pcretest listing PCRE version 8.33-RC1 2012-12-07 /((\w{4}aa){4}aa){4}aa){3}aa){4}aa){2}aa){4}aa){3}aa){3}aa){11}aa){3}aa/imsxS+ Compile time 4. milliseconds Study time 4637. milliseconds 01234567890123456789012345678901234567890123456789012345678901234567890123456789. Execute time 0. milliseconds No match Study time is 4 seconds!!! Wow! It's enormous large! Is there a way to optimize? 2. Since JIT study time is so large, why there is no possibility to precompile JIT data? Precompiling may be excellent way to expand JIT's sphere of application. Regards -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] JIT increase stack requirement for SVN 1295
Hi, definitely worth investigating it. However, I cannot see it on my 64 bit Linux machine. I have some questions: - these patterns simply match to ab, because of the non-greedy quantifier. That requires a very small amount of memory. Is this intentional? What is your input for these patterns? - I modified your pattens by putting an 'x' at their end, and matching to an abb...bbbx subject, where 1999 'b'-s are found. With a non-greedy quantifier: 16008 bytes of stack is consumed With a greedy quantifier: 31992 bytes of stack is consumed That is nowhere from the 660K memory provided by you. Regards, Zoltan Ralf Junker ralfjun...@gmx.de írta: The SVN 1295 JIT engine on Win32 requires more stack than before to compile patterns with a large number repeated subpatterns. I did not track which code change exactly is responsible for the increase, but compared to SVN 1239 I had to almost double the maximum stack size from about 66 to 112 in order to prevent an out-of-stack exception for these patterns: (a)(?2){0,1999}?(b) (a)(?(DEFINE)(b))(?2){0,1999}?(?2) I am not sure if this is a bug or simply a side-effect required by the new JIT features. However, it might be worth noting that the non-JIT compiler requires considerably less stack so improvement might be possible. Ralf -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] JIT increase stack requirement for SVN 1295
Hi My compiler and environment are very different from yours. In particular I do not have alloca(). I replaced the only call to alloca() by char array[SLJIT_MAX_LOCAL_SIZE]; according to your recommendation a long time ago. But this also affects runtime only, not compile time. I am afraid it is not possible to tell where the jump is without narrowing the revision range. Can you precisely measure the stack size? Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
[pcre-dev] Announcuing backtracking verb support in JIT.
Hi All, all backtracking verbs are supported by the JIT compiler. It was a considerable amount of work, and many things were changed, so please test the engine. We also had discussions with Perl devs, which resulted some changes: there is no priority order between backtracking verbs anymore, instead, we perform the effect of the most recent one (the verb which we backtracked into). Assertions might also ignore these verbs as perl does (except ACCEPT of course), but we still discussing it. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Return last bumpalong offset in partial_hard matching
Hi, Zoltan, have you time and possibilities to make support of backtracking verbs in JIT? It's will be another great deal! I don't have much free time nowadays. Which one is the most important for you? I was thinking about them and they are difficult (not surprisingly they are the last unimplemented features). Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Return last bumpalong offset in partial_hard matching
Hi, JIT support is added. Regards, Zoltan p...@hermes.cam.ac.uk írta: I have committed a patch that puts the bumpalong offset into the third element of the offsets vector when the interpreter is used. This applies to both hard and soft partial matching. pcretest has been modified to show this value when offsets[2] != offsets[0]. I have updated the pcrepartial documentation. I expect Zoltán will update JIT to support this as well in due course. At the moment, some partial matching tests don't work with JIT. (I added a nojit option to RunTest to make it easy to avoid running JIT when you know it won't work.) Philip -- Philip Hazel -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Return last bumpalong offset in partial_hard matching
Just to make sure I understand what you are suggesting: instead of returning the earliest character that was inspected, you want it to return the starting point of the last match attempt. Is that right? I presume you then expect to use that offset minus the max lookbehind to discover what characters to keep. Is that right? Just my two cents: we did incompatible changes before such as removing pcre_info, or disallowing 0xD800-0xDFFF range in UTF character sets. After these changes people needed to fix their software (Apache or PHP), and they did it and moved on. We never actually removed a feature, just reworked it. The original behavior was always odd to me, because it depends on the current subject. I feel that is not exactly consistent. After this change the matching code will be less complex (a little faster and more maintainable), and I think the use cases will not even change because ovector[0]-max_lookbehind character must be kept even now. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] JIT and callouts
Hi all, I would like to announce that callouts are working in JIT now, and all patches were landed! The only thing which is not supported is callouts between a conditional block and its condition. At the moment this can only be inserted using auto-callouts, which is probably not the best use case for JIT. Also, optimized capturing brackets are disabled when callouts are used (a little performance loss). Naden, could you try the callout feature? I am really curious whether it gives any performance boost for your application. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] PCRE GPU offload
Hi, yes, you can submit your patches to the exim bugzilla: http://bugs.exim.org/ As a starting point check pcre_exec() which is the primary matcher function. You can also play with pcretest, which is a tester program, which provides command line access to all features of PCRE. For pattern examples you can check the testdata subdirectory. It is also a good idea to check the pcre byte-code. Btw, PCRE has excellent documentation thanks to Philip's effort. Regards, Zoltan Roman Vasilyev ro...@bitzermobile.com írta: I'm new in libpcre, if I want to try OpenCL improvements for it, could you recommend me start point? As well as if I'll have patch for review can I send it here? -Original Message- From: Zoltán Herczeg [mailto:hzmes...@freemail.hu] Sent: Tuesday, February 12, 2013 1:47 PM To: Roman Vasilyev Cc: pcre-dev@exim.org Subject: RE: [pcre-dev] PCRE GPU offload Now I'm completely clear where is the problem. But seems like OpenCL doing it same way as OpenSSL makes AES en/decryption. It uses 16byte blocks for operation, just you know you using full 16 or just part of it. And in case of texture size, you can create smaller texture size and upload your 1MB string stream by blocks, once first entry is done, you can stop uploading. Minimum texture size 1pixel. Is this called texture atlas? Or is that something different? I am not a CL expert :) The other thing we need to worry about is the matching length. Let's say we start matching parallelly from the first 128 starting offsets, but the matching length is not known at compile time. It can be 100 bytes, 1000 bytes, anything. Are you able to do that without uploading the whole input, which is the worst case? E.g: /\p{Any}*/ matches the whole input. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] JIT and callouts
Hi, JIT is designed to fall back to interpreted execution if a pattern is not supported. So you can enable it even if not all patterns will be supported by it. I hope the majority of your patterns are already covered by now. Regards, Zoltan ND nad...@mail.ru írta: On 2013-02-13 18:44, Zoltán Herczeg wrote: Naden, could you try the callout feature? I am really curious whether it gives any performance boost for your application. Hi, Zoltan! Thank you very much for great work. But as I wrote earlier my application hardly uses callouts as well as all of backtracking verbs. Without backtracking verbs implemented, JIT is still OFF in my PCRE-computations. If you can spend another little chunk of your vim to finish a whole backtracking verbs support, than I'll test JIT with really hard patterns and input streams. Best regards. -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] PCRE GPU offload
Sorry, understood. So main idea, that OpenCl not fits PCRE let's call it string stream? Yes. If your input is let's say 1 Mbyte, and you search /abc/, and it is found in the first 100 bytes, running the same kernel on the remaining 1 million - 100 starting positions is a waste of time. I am not sure you can stop the kernels after the first match. And uploading the 1Mbyte input, programming the device is costly as well. It is much faster to scan the first 100 byte one-by-one. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] PCRE GPU offload
Now I'm completely clear where is the problem. But seems like OpenCL doing it same way as OpenSSL makes AES en/decryption. It uses 16byte blocks for operation, just you know you using full 16 or just part of it. And in case of texture size, you can create smaller texture size and upload your 1MB string stream by blocks, once first entry is done, you can stop uploading. Minimum texture size 1pixel. Is this called texture atlas? Or is that something different? I am not a CL expert :) The other thing we need to worry about is the matching length. Let's say we start matching parallelly from the first 128 starting offsets, but the matching length is not known at compile time. It can be 100 bytes, 1000 bytes, anything. Are you able to do that without uploading the whole input, which is the worst case? E.g: /\p{Any}*/ matches the whole input. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] JIT and callouts
Hi Naden, ok, I will work on this feature. It is not an easy task, so I cannot promise when it will be done. Regards, Zoltan ND nad...@mail.ru írta: Hi, Zoltan! The question is what to do. 1. Is it worth to implement a restricted callout mechanism (some members are set to an invalid value)? 2. What should we do with the ovector? 3. And a theoretical question: is JIT worth when we call expensive C functions? 1. I see ovector and capture_top can be filled when callout starts. Why there is no possibility to set capture_last? 2. Fill it when callout starts. 3. My point of view is YES. Everybody must be awared then callout is expensive mechanism due for it's nature. The strategy of applying callouts includes understanding that them must be use at a pinch only. The JIT expenses for preparing to start callout are negligible comparing with callout action itself. But callouts are used not only when the speed is not important. Yet use of them is effective when JIT engine reaches them rarely through the hard remaining pattern processing. In any case the effectiveness or ineffectiveness of pattern depends on user skills much more. My applications executes a HUGE PCRE computations. But callouts are located in places that interpreter passes very rarely. And I think use of JIT will bring a great advantage nesessarily. Thanx. -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Caching and PCRE
Hi, no. And it shouldn't. The effectiveness of caching depends on the use case, we should let the applications decide by themselves. Regardless, caching is good. We wrote a short research paper about it a long time ago: http://www.inf.u-szeged.hu/~akiss/pub/pdf/hodovan_regex.pdf Regards, Zoltan Ze'ev Atlas zatl...@yahoo.com írta: Hi All I ask to make sure that I am not missing something! Does PCRE provide caching for the compiled (or original) Regular Expression or does it assume that the user would provide such functionality. I did not find anything in the man pages except of some note about caching C compiler usage in build time which has nothing to do with my question. I need to know for sure before I considering developing such a functionality myself. Ze'ev Atlas -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Fix for buffer read after the end
The second one. The bugs are in new code, and released with 8.32. Release 8.31 is not affected. Now fixed in trunk. Actually JIT in 8.32 contains several new optimizations, so it is much faster than 8.31, but it seems new bugs are introduced as well. Regards, Zoltan Giuseppe D'Angelo dange...@gmail.com írta: Hi, On 26 January 2013 23:18, Zoltán Herczeg hzmes...@freemail.hu wrote: Hi, This is a heads up of a two recently landed fixes for those, who maintain a binary pcre library. These are critical fixes, but easy to backport: Patch: https://lists.exim.org/lurker/message/20130126.175148.60d4ca3c.en.html Effect: input string might be read after the end. Maximum of 4 bytes. Affects: JIT in 16 and 32 bit mode Introduced: PCRE 8.32 Patch: https://lists.exim.org/lurker/message/20130118.082046.fcbace28.en.html Effect: no matches are reported when there is a match Affects: JIT when LINK_SIZE is not 2. Introduced: PCRE 8.32 Thank you for this message. When you say Introduced: PCRE 8.32, do you mean that the issue has been fixed in 8.32 or that the issue has been introduced with it (and fixed in current svn trunk)? Thanks, -- Giuseppe D'Angelo -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
[pcre-dev] Fix for buffer read after the end
Hi, This is a heads up of a two recently landed fixes for those, who maintain a binary pcre library. These are critical fixes, but easy to backport: Patch: https://lists.exim.org/lurker/message/20130126.175148.60d4ca3c.en.html Effect: input string might be read after the end. Maximum of 4 bytes. Affects: JIT in 16 and 32 bit mode Introduced: PCRE 8.32 Patch: https://lists.exim.org/lurker/message/20130118.082046.fcbace28.en.html Effect: no matches are reported when there is a match Affects: JIT when LINK_SIZE is not 2. Introduced: PCRE 8.32 Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Static linking with pcre16 library
Hi, I know this issue. PCRE_STATIC was defined when you built the library, but not when you used it. Regards, Zoltan Алексей Павлов alex...@gmail.com írta: Hi everybody! I have builded pcre-8.32 as static with mingw-w64. When I building Qt with system-pcre I got errors: c:\QtSDK\Qt-builds\work\build-x32-s\qt-5.0.0\qtbase\src\corelib/tools/qregularexpression.cpp:928: undefined reference to `_imp__pcre16_free' c:\QtSDK\Qt-builds\work\build-x32-s\qt-5.0.0\qtbase\src\corelib/tools/qregularexpression.cpp:929: undefined reference to `_imp__pcre16_free_study' C:/QtSDK/Qt-builds/work/build-x32-s/qt-5.0.0/qtbase/lib\libQt5Cored.a(qregularexpression.o): In function `ZN25QRegularExpressionPrivate14compilePatternEv': c:\QtSDK\Qt-builds\work\build-x32-s\qt-5.0.0\qtbase\src\corelib/tools/qregularexpression.cpp:956: undefined reference to `_imp__pcre16_compile2' C:/QtSDK/Qt-builds/work/build-x32-s/qt-5.0.0/qtbase/lib\libQt5Cored.a(qregularexpression.o): In function `ZN25QRegularExpressionPrivate14getPatternInfoEv': c:\QtSDK\Qt-builds\work\build-x32-s\qt-5.0.0\qtbase\src\corelib/tools/qregularexpression.cpp:975: undefined reference to `_imp__pcre16_fullinfo' c:\QtSDK\Qt-builds\work\build-x32-s\qt-5.0.0\qtbase\src\corelib/tools/qregularexpression.cpp:979: undefined reference to `_imp__pcre16_fullinfo' c:\QtSDK\Qt-builds\work\build-x32-s\qt-5.0.0\qtbase\src\corelib/tools/qregularexpression.cpp:985: undefined reference to `_imp__pcre16_config' C:/QtSDK/Qt-builds/work/build-x32-s/qt-5.0.0/qtbase/lib\libQt5Cored.a(qregularexpression.o): In function `ZN25QRegularExpressionPrivate15optimizePatternEv': c:\QtSDK\Qt-builds\work\build-x32-s\qt-5.0.0\qtbase\src\corelib/tools/qregularexpression.cpp:1106: undefined reference to `_imp__pcre16_study' c:\QtSDK\Qt-builds\work\build-x32-s\qt-5.0.0\qtbase\src\corelib/tools/qregularexpression.cpp:1109: undefined reference to `_imp__pcre16_assign_jit_stack' C:/QtSDK/Qt-builds/work/build-x32-s/qt-5.0.0/qtbase/lib\libQt5Cored.a(qregularexpression.o): In function `ZNK25QRegularExpressionPrivate19captureIndexForNameERK7QString': c:\QtSDK\Qt-builds\work\build-x32-s\qt-5.0.0\qtbase\src\corelib/tools/qregularexpression.cpp:1130: undefined reference to `_imp__pcre16_get_stringnumber' C:/QtSDK/Qt-builds/work/build-x32-s/qt-5.0.0/qtbase/lib\libQt5Cored.a(qregularexpression.o): In function `pcre16SafeExec': c:\QtSDK\Qt-builds\work\build-x32-s\qt-5.0.0\qtbase\src\corelib/tools/qregularexpression.cpp:1150: undefined reference to `_imp__pcre16_exec' c:\QtSDK\Qt-builds\work\build-x32-s\qt-5.0.0\qtbase\src\corelib/tools/qregularexpression.cpp:1157: undefined reference to `_imp__pcre16_exec' C:/QtSDK/Qt-builds/work/build-x32-s/qt-5.0.0/qtbase/lib\libQt5Cored.a(qregularexpression.o): In function `ZN20QPcreJitStackPointerC1Ev': c:\QtSDK\Qt-builds\work\build-x32-s\qt-5.0.0\qtbase\src\corelib/tools/qregularexpression.cpp:1026: undefined reference to `_imp__pcre16_jit_stack_alloc' C:/QtSDK/Qt-builds/work/build-x32-s/qt-5.0.0/qtbase/lib\libQt5Cored.a(qregularexpression.o): In function `ZN20QPcreJitStackPointerD1Ev': c:\QtSDK\Qt-builds\work\build-x32-s\qt-5.0.0\qtbase\src\corelib/tools/qregularexpression.cpp:1034: undefined reference to `_imp__pcre16_jit_stack_free' collect2.exe: error: ld returned 1 exit status Is anybody know how can I solve this? -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Is pcre_fullinfo thread-safe?
Hi Giuseppe, I am interested in Philip's opinion, but in general I think PCRE should always be fully thread safe. Regards, Zoltan Giuseppe D'Angelo dange...@gmail.com írta: I know it is already :-) By looking at its source code, it just extracts information from the compiled pattern and the extra data. However, the man page simply talks about matching from multiple threads and makes no mention of the other, ancillary PCRE functions. Can one always assume they're thread safe? Thanks, -- Giuseppe D'Angelo -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] JIT failure?
Thanks Naden for finding this sophisticated bug. Fix is landed. I was thinking about security as well, whether this bug can be exploited in any way. Fortunately it is not possible to do a buffer overflow or any other crash, only some patterns do not mach when they should. Regards, Zoltan ND nad...@mail.ru írta: Hi! Here is a pcretest.exe listing: PCRE version 8.33-RC1 2012-12-07 /^12345678abcd/imsxS+ 12345678abcd No match But match is expected. Thanx. -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] JIT failure?
Hi, thanks for the report! After some investigation I realized that link_size=3 is required for this bug to appear. And it was a typo in fast_forward_first_n_chars(...): -pcre_uchar *cc = common-start + 1 + IMM2_SIZE; +pcre_uchar *cc = common-start + 1 + LINK_SIZE; Philip, do we have tests for link size other than 2? Or shall I just put it into testinput12? Regards, Zoltan ND nad...@mail.ru írta: Which platform and toolchain are you using? Under Linux x86-64 I get a successful match: Windows 7 64 bit. PCRE version 8.33-RC1 2012-12-07 Compiled with 8-bit support UTF-8 support Unicode properties support Just-in-time compiler support: x86 32bit (little endian + unaligned) Newline sequence is LF \R matches all Unicode newlines Internal link size = 3 POSIX malloc threshold = 10 Default match limit = 1000 Default recursion depth limit = 1000 Match recursion uses stack -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] pcre_compile.c: error_texts
Each user gets its own writable data section; read-only data sections are instead shared between the users. I don''t think anything would work without a private .data and .bss section. As far as I remember the point of static libraries is that they use position independent code, so they can be mapped anywhere, not just from a fixed starting offset. Otherwise they are the same as anything else. But again, this requires some sort of preprocessing in order to deal with the XSTRING(...), which are fixed at configure time. And if we accept to do the preprocessing, then I think we can statically build the array of the offsets. Well, we can generate such files at compile time, similar to unicode data, default char tables, etc. Lots of tables are already generated. One more probably doesn't matter. The compiler can simply include this table. We could also include a default copy (another .dist file) in the source. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] pcre_compile.c: error_texts
Hi Arpe, thanks for your suggestion! Just my two cents: the reason of the search is that it shouldn't be a bottleneck. I suspect it takes less than 1% of the total compilation time, although I never really measured it. I think before we jump to coding, we should measure it somehow. I wouldn't worry about the thread safety of your static array. If you write the same values to the memory from different threads, that is safe. Therefore, if for some reasons the array would be initialized multiple times, that should be thread safe (as long as you write exactly the same values to the array again). Alternatively, we can simply use something like: static char* error_texts[] = { text1, text2, ... }; error_texts[i] would return with the text corresponding to error code. Regards, Zoltan Kevin Connor Arpe kevina...@gmail.com írta: Hello, This is my first mail to this list. I am doing some hacking on GLib (part of GNOME), specifically their GRegex module which is built on top of this (incredible) library. I was digging into pcre_compile.c (herehttp://vcs.pcre.org/viewvc/code/trunk/pcre_compile.c?revision=1233view=markup) to understand ownership of 'errorptr'. I know now that ownership is not transferred to the caller. I found this line in pcre_compile2 (): *errorptr = find_error_text(errorcode); Following the trail, I found this comment: /* The texts of compile-time error messages. These are char * because they are passed to the outside world. Do not ever re-use any error number, because they are documented. Always add a new error instead. Messages marked DEAD below are no longer used. This used to be a table of strings, but in order to reduce the number of relocations needed when a shared library is loaded dynamically, it is now one long string. We cannot use a table of offsets, because the lengths of inserts such as XSTRING(MAX_NAME_SIZE) are not known. Instead, we simply count through to the one we want - this isn't a performance issue because these strings are used only when there is a compilation error. Each substring ends with \0 to insert a null character. This includes the final substring, so that the whole string ends with \0\0, which can be detected when counting through. */ I have no doubt that this design exists for very good reasons, as explained in the comment. (I had never considered relocation impact from a huge list of char pointers.) Regarding the note about performance issue, I have an idea to improve. What if we create a (static) list of offsets that is only initialised at first compile error? Something like: static int error_text_offsets[78] = { -1 }; I am willing to write the code and submit a patch. However, since I am new to this project, I don't know enough about its style (example: cathedral vs. bazaar?). Also, I wanted to have my idea considered before writing code. If you think this idea is worthy, there are two areas of concern I can think of: Sizing error_text_offsets = I am unsure if error_text_offsets should be sized (1) static/precisely, (2) static/liberally, or (3) dynamically ((3a) precisely or (3b) liberally). (1) static/precisely: we need to keep error_text_offsets precisely sized to error_texts. Today, it looks to be 78. This uses the least amount of memory, but incurs higher risk/maintenance costs. I'm all about low-maintenance (future proofing) where possible/reasonable. This would probably require a test to be written to ensure the precise size is correct. Imagine a scenario where error_texts has an addition, but error_text_offsets does not grow. I have not yet looked if PCRE includes testing as part of its release procedure. I assume yes. (2) static/liberally: something big, but not crazy, like 255. ex: static int error_text_offsets[255] If error_texts grows over time, we have space. Low maintenance, but takes a bit more memory. (3) dynamically: static int *error_text_offsets = NULL allocate on first use via malloc() or PRCE's built-in/override method Again: do we allocate precisely or liberally? (a) Precisely can be done via static constant or at runtime (counting the number of embedded strings). (b) Liberally: Allocate something big enough, like 255, to prevent double scan/two-pass over error_texts. First pass: Calculate size of offsets array. Second pass: Calculate and store offsets. Multithreading == Finally, what about multithreading (MT)? Imagine the scenario where two parallel threads call pcre_compile2 () independently. Both their compiles fail. The init routine for error_text_offsets needs to be MT-aware. Generally, (in C) I don't know how to properly handle that issue when initialising this list the first time. (I know how to do it in Java/C#, but there you get a huge framework library with every install. C is bit trickier due to cross platform issues -- pthreads, etc.) Plus, I know nothing about how PRCE
Re: [pcre-dev] [PATCH] Quash deprecation warnings on Windows
Hi am I understand correctly that no JIT is involved here? You tell me :-) I certainly built PCRE with JIT support... pcretest runs tests without JIT by default. -s+ needs to be passed to pcretest, or s+ to a specific test to enable JIT. The bad pattern has s flag, bot no plus so it is only studied. I linked it with /STACK:1000; should that be enough? I don't know that there's anything else I can do to increase the available stack space. Since there is no crash, this is likely not a stack related issue then. But if I redirect the output to a file, or even just pipe it into cat, all I get is the first line. Both behaviors occur consistently. I suspect the std library ov msvc has some limitation here. Even in the console mode, the results are still missing: 0: seite\x0adokumenteninformation\x0aseitentitel... 1: seite 2: \x0a 3: seite I didn't see such thing in my simple msvc environment. Perhaps you could catch printf-s in debug and check which one is not working. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] [PATCH] Quash deprecation warnings on Windows
Hi, am I understand correctly that no JIT is involved here? The output stops after: ~(\w+)/?(.)*/(\1)~smgI The test itself requires a huge stack. (Did you enlarge the stack area for pcretest?) Could you try this test manually: pcretest.exe -q mytest The mytest should only contain this particular test (line 1426 and 1427 from testinput2). Regards, Zoltan Daniel Richard G. o...@teragram.com írta: On Sun, 9 Dec 2012, Zoltán Herczeg wrote: Daniel, please test the current code, and let me know if any other issues remain. Alas, still getting some test failures from r1235 on 64-bit Windows with icl: Test 2: API, errors, internals, and non-Perl stuff failed executing command-line: ...\pcre-build-debug\pcretest.exe -q ...\pcre-8.33-RC1\testdata\testinput2 testout8\testoutput2 Test 2: Test with Study Override failed executing command-line: ...\pcre-build-debug\pcretest.exe -q -s ...\pcre-8.33-RC1\testdata\testinput2 testoutstudy8\testoutput2 (likewise with the 16- and 32-bit tests) All the testout{,study}{8,16,32}/testoutput2 files are identical; I've attached one of them. --Daniel -- Daniel Richard G. || dani...@teragram.com || Software Developer Teragram Linguistic Technologies (a division of SAS) http://www.teragram.com/ -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] opencl
No strong opinion. I think the use case matters. Your example is a simple implementation of posix regex. PCRE is much more complex, but if you don't need its features, this simple code can do as well. PCRE searches the first occurrence of a pattern, and returns a lot of info like the position of capturing brackets. OpenCL is designed to run the same kernel on a large number of inputs. So if your use case is finding all occurrence of a pattern (including intersecting ones), OpenCL might be useful again, since you can run the same kernel from all starting positions in the same time. However, this could waste a lot of power if you only need the first one. Another question is the effectiveness of the code, since the OpenCL compiler is a lightweight compiler, which prefers compilation speed over compiler optimizations. Btw you don't need to copy the input if you use a system which supports host pointers like many embedded systems. Regards, Zoltan james jones james.voip+ker...@gmail.com írta: What do you think about this approach?On Wed, Dec 5, 2012 at 5:34 PM, james jones james.voip+ker...@gmail.com wrote: I have been able to port and compile the one below but I am having a little trouble with getting strings from ram to the memory on the graphics. Getting a lot memory access error right before a seg fault. I am just doing it wrong and need to make sure I am copying the array correctly. I have not looked at it in about two weeks. Will pick it up again over the weekend. http://www.cse.yorku.ca/~oz/regex.bun -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] [PATCH] Quash deprecation warnings on Windows
Thank you, Philip. Daniel, please test the current code, and let me know if any other issues remain. Regards, Zoltan Philip Hazel p...@hermes.cam.ac.uk írta: On Thu, 6 Dec 2012, Daniel Richard G. wrote: On Thu, 6 Dec 2012, Graycode wrote: Yet the CMakeLists.txt file currently contains: IF(MSVC) ADD_DEFINITIONS(-D_CRT_SECURE_NO_DEPRECATE) ENDIF(MSVC) I'm not a Cmake user, but that seems a more appropriate place for settings that are unique to a particular compiler brand and version. Ah, good catch! I missed that bit. New patch attached. I have applied and committed this patch. Philip -- Philip Hazel -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] [PATCH] Quash deprecation warnings on Windows
Hi, I landed a somewhat bigger patch, which contains all changes you reported in the last few weeks. Thanks for porting and testing! I saw there are some conversion related warnings as well, let me know if you have any fixes for them. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] opencl
Hi, I don't think so. In WebKit we did some OpenCL optimizations, but I am not sure the general concept of a backtracking, NFA based engine is suitable for OpenCL. At least the whole engine is surely way too big. Perhaps something like the JIT compiler could work, just generating C like code for OpenCL. But uploading the input could be too expensive, and searching only the first match is difficult, since I am not sure that other running kernels could be stopped when the first occurrence is found. Perhaps a DFA based engine is better for such purpose, but the first occurrence issues are still similar. OpenCL is a powerful tool, but I am not sure this is an ideal use case for it. Regards, Zoltan james jones james.voip+ker...@gmail.com írta: Has anyone gotten PCRE ported for OpenCL or Cuda? -James -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] PCRE 8.32 is released
And the JIT compiler is improved a lot (based on the data revealed by profiling). Many patterns run 20-40% faster. Regards, Zoltan Philip Hazel p...@hermes.cam.ac.uk írta: I have just put the 8.32 release onto the FTP site: ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/pcre-8.32.tar.gz ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/pcre-8.32.tar.bz2 ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/pcre-8.32.zip The main changes in this release are new support for 32-bit character strings and UTF-32, and improved Unicode support for \X and characters that have more than one other case. -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] [PATCH] JIT support with Intel compiler, older Solaris
It would be easy enough to test---just disable the cpp case that uses __cpuid(). I cannot. GCC supports ATT sytax only. Visual C does not support any inline assembly in 64 bit mode: http://msdn.microsoft.com/en-us/library/wbk4z78b.aspx (And I can confirm that is true). However, we might be able to test it with your ICC compiler, since it works on Win64 and supports inline assembly. Attached is the C source file I used, and an assembly file generated by icc -g -no-gcc -S. This is clearly cdecl. I also compiled it with GCC on linux, and it generates 3 different types of assembly code depending on the calling convention. I will send it privately to you. (Perhaps this whole chat could be private, I am not sure many people are interested. If someone wants to join this discussion about low level stuff, just send me an email) Intel really says that cdecl and stdcall only affects Windows. But this is not true. http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/fortran/win/lref_for/source_files/rfattcst.htm Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] [PATCH] JIT support with Intel compiler, older Solaris
Hi, Well, our audience here isn't just people following the list today, but also people Googling for this information years from now. Hello readers in 2017! :-) still feels out of scope. SourceForge provides some kind of forum, perhaps we could use that: http://sourceforge.net/projects/sljit/forums Btw, I recently updated the comparison of various regex engines: http://sljit.sourceforge.net/regex_perf.html Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] [PATCH] JIT support with Intel compiler, older Solaris
Hi Daniel, Only problem is, it's not clear how the assembly should refer to local variables (from and to). I was reviewing this article... I may misunderstand something here, but I thought you target sparc not x86. Why not provide an Intel-syntax equivalent of the ATT assembly instead of an error? There may be other 64-bit compilers that aren't handled by the above cases. It would be a speculative fix, which may not work. Anyway, I still get segfaults on 32-bit (x86) Linux with the Intel compiler whether I use stdcall or cdecl (but not as badly as fastcall). Would it help if I provided you with built code from this compiler, so you can see exactly what it's putting out? Perhaps some functions can reveal how it works. int test(int a, int b, char* c) { retrun a + b + strlen(c); } Could you try this with various calling conventions and send me the assembly? Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Version 8.32-RC1 (release candidate) available for testing
Hi Petr, thanks for the measurements. All warnings are caused by a single macro on PPC64: #define SLJIT_FUNC_OFFSET(func_name) ((sljit_sw)*(void**)func_name) Does anyone know how to fix this? Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] [PATCH] JIT support with Intel compiler, older Solaris
Hi Daniel, thanks for the patch! Would it be a problem, if I would land it after the next pcre release? We focus on fixes now, and this is kind of a new feature. * Added defined(__INTEL_COMPILER) to the appropriate conditionals * Moved the 64-bit _MSC_VER case up so that this is used in preference on Windows (note that icl does define _MSC_VER on Windows, set to the appropriate SDK version) +#if defined(_MSC_VER) _MSC_VER = 1400 What happens if _MSC_VER 1400 (btw that is LARGE version number :P) Is there any 64 bit x86 support before 1400? -#define SLJIT_CALL __stdcall Your patch removes __stdcall which I think is necessary. Some x86/32 compilers use exotic ABIs, and this define helps detecting it. Btw, is the Intel C compiler works with defining its ABI type? The other modifications seem ok. About the sparc: I think you can just add an ifdef around the cache flush instruction, which is accepted by your compiler. (I hope you tried to compile it in 32 bit mode, since sparc64 is not yet supported.) Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] [PATCH] Fixes for Sun cc + SLJIT, cpp compatibility
Hi, thanks for the fixes. They are all landed now (although in multiple fragments). And thanks for porting sljit to a large number of exotic systems. Regards, Zoltan Daniel Richard G. o...@teragram.com írta: Building PCRE with JIT support on a Solaris x86-64 system with the vendor compiler gave me /path/to/pcre-r1213/sljit/sljitNativeX86_common.c, line 325: #error: SLJIT_DETECT_SSE2 is not implemented for this C compiler cc: acomp failed for /path/to/pcre-r1213/pcre_jit_compile.c In fixing that, I also noticed a number of indented cpp directives in the source, e.g. #else #error This is an error message #endif Officially, ANSI C supports this, but I've run into older (yet ANSI-capable) compilers that choke on the whitespace before the # mark. This goes for gcc -traditional, too. My patch, in addition to fixing the aforementioned Solaris build issue, changes these directives to the more compatible form #else # error This is an error message #endif which is already used elsewhere in the PCRE source. --Daniel -- Daniel Richard G. || dani...@teragram.com || Software Developer Teragram Linguistic Technologies (a division of SAS) http://www.teragram.com/-- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] [Bug 1295] add 32-bit library
Hi, I am still a little lost of this masking feature. I know why we need it in compile. But why we need it in exec? I know if you read a character, which is 0x10, and read its UCD value (e.g matching to a unicode property), you get a crash regardless of masking. But that is ok, since the input must be a valid UTF stream in UTF mode (performance VS safety - we prefer the first). I know there are other engines, which prefers the second, but you have to pay its price. Honestly, I would never use PCRE in security critical environment. The code is in a really good shape, but it is too complex. In WebKit, we use sandboxing, and we doesn't care WebKit itself is safe or not (the second is more likely, it is just too big). Tom, you could use that approach as well. So my question is, do we really need masking in exec? Regards, Zoltan Philip Hazel p...@hermes.cam.ac.uk írta: On Sun, 28 Oct 2012, Tom Bishop, Wenlin Institute wrote: A naive PCRE user only wants to know whether a file begins with a particular character sequence, for example, #!/bin/bash. Not caring whether the file is valid UTF-32 and not having read the documentation very carefully, this programmer uses the flag PCRE_NO_UTF32_CHECK so that the program will run faster (or maybe just having copy-pasted it from somewhere). PCRE says the file matches #!/bin/bash, so the program executes the file as a bash script, causing a nuclear power plant to explode. With respect, I think this is a bit drastic. Anybody writing a program where the consequences of failure are so catastrophic *should* care whether the file is valid UTF, should read the documentation, and shouldn't just copy-paste. I know people are stupid. Are they really *that* stupid? Should we not implement PCRE_NO_UTFx_CHECK at all? Using it incorrectly can cause crashes and problems in all modes. I am happy to beef up the warnings in the docs. Do any of you happen to be on the mailing list for libcurl? A recent discussion is relevant. The subject line is The Most Dangerous Code in the World. Due to widespread misunderstanding of the API, many programs using libcurl have made this error: setting CURLOPT_SSL_VERIFYHOST to TRUE, will result in the SSL connection being insecure against a man-in-the-middle attacker. Sounds harmless, right? The word insecure doesn't sound harmless to me! Given an option named CURLOPT_SSL_VERIFYHOST, wouldn't TRUE be better than FALSE? In fact it's supposed to be a three valued option, not boolean, and the value 1 is dangerous. It is regretful that the C language does not have proper boolean values/variables, but instead subverts ints. Philip -- Philip Hazel -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Major JIT compiler update in PCRE with Sparc 32 support
-Running in 16-bit mode but pattern was compiled in 32-bit mode +Running in 16-bit mode but pattern was compiled in 0-bit mode I fixed this issue. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
[pcre-dev] Major JIT compiler update in PCRE with Sparc 32 support
Hi, I just landed a major JIT compiler update in PCRE. The new code mostly contains code refactoring and bugfixing in MIPS and PowerPC ports. An experimental Sparc 32 support is also added. I noticed that all big endian systems complain that: -Running in 16-bit mode but pattern was compiled in 32-bit mode +Running in 16-bit mode but pattern was compiled in 0-bit mode (These are binary tests as far as I remember) Other than that, all test pass. I will check this issue in due course as well. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Match for the first 100 charaters of a string
Hi, I am not sure I totally understand your mail. You have a string, which length is 1G, and you want to display the first 1000 characters, and replace literals (characters?) with another symbol, is that correct? Why do you need pcre for this purpose? A simple loop which checks every character could do that, and it would be much faster. Or you can simply pass a length of 1000 to stop pcre after the 1000th character. Regards, Zoltan Sandhya Sriraj sandhyar1...@gmail.com írta: Hi, I want to display first 1000 characters of a string (literals are replaced by special symbol). I am using pcre library to replace the literal. After replacing every literal I am checking for the length of the string and if it is 1000 then stop matching and display the string. My problem is, Suppose I am sending a string with length 1GB, and if there is no literal in that string, pcre will check for the entire string. I want to check the match for first 1000 charaters. Is there any way to do this? Regards, Sandhya Sriraj -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] [PATCH] RunTest.bat, pcre_test.bat and CMake
I have cmake + windows box, so I will check this patch in due course. Regards, Zoltan Daniel Richard G. o...@teragram.com írta: I was building PCRE on Windows, but the pcre_test_bat test was not working for me. There were a few issues with how RunTest.bat is invoked via CMake, and how the batch file makes use of quoting; the attached patch addresses these: ++ CMakeLists.txt * Added quoting to When testing is complete... message for better clarity when PROJECT_BINARY_DIR contains spaces * MESSAGE() does not print a blank line, but MESSAGE( ) does * Reworked the generation of pcre_test.txt (now pcre_test.bat) to handle path quoting correctly (Windows convention is to use quoting only with literal path values, not with variable dereferences), and use %CMAKE_CONFIG_TYPE% to play well with IDEs that build different configurations of pcretest.exe * Got rid of BatDriver.cmake, as the extra indirection should no longer be needed ++ RunTest.bat * Fixed quoting I proofed this using Visual Studio IDE and NMake-makefile CMake generators, with spaces in both the source and binary directory paths. --Daniel -- Daniel Richard G. || dani...@teragram.com || Software Developer Teragram Linguistic Technologies (a division of SAS) http://www.teragram.com/-- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
[pcre-dev] Security risk or not? Changing PCRE options from patterns.
Hi, Pcre has a nice feature, that you can change options by passing special control strings. E.g: /(*UTF8)a/ makes the pattern an UTF8 pattern. I am sure most people are not aware of this feature. Its side effect can be used for denial service attacks, since the valid UTF checks are not affected by recursion limit checks. So the pattern above can slow down a web service, which runs patterns on an ascii input where the input buffer is huge. My problem is, that these flag changes cannot be prevented by software, and I think most developers are unaware of it (since this is just an extension). I know it is useful in certain cases, but I feel it may be exploited by harmful software. I have not any solution for this issue at the moment, I am just curious what do you think? Is this a real risk or not? Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] A native pcre exec for JIT
Hi, Also IMHO for *new* API we shouldn't continue the problems of the old APIs; that means we should use size_t for the length, start_offset and offsetcount parameters and the offsets themselves. (If the current code can't cope, just reuturn an error if length INT_MAX, but then we can fix that without changing API.) Also, maybe options should be unsigned (it's flags, right?). I think this would be a major change, which should only happen if all pcre API would go to the new form. Probably uint32_t would be the best for flags. We were thinking about a complete redesign of the API for some time, and perhaps we should note these requirements as well. Perhaps we should introduce a pcre2.h sometimes and some conversion functions which translates the arguments from the old format. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] A native pcre exec for JIT
But out of curiosity, how come that pcre_exec can't just do what this function is supposed to do when it has a JIT-compiled pattern, UTF checks are not requested, etc.? The point of the new API is not inventing something which was not possible before. Instead, it increases the performance of the most common case of performance hungry applications in release mode (by eliminating debug checks and some function calls). Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
[pcre-dev] A native pcre exec for JIT
Dear devs, I have been thinking for some time, that the point of JIT is offering outstanding pattern matching performance (and it improved a lot lately) and it is still used through pcre_exec. So I am planning the add a new interface for JIT only: int pcre[16]_jit_exec(const pcre_extra *extra_data, PCRE_SPTR subject, int length, int start_offset, int options, int *offsets, int offsetcount, pcre_jit_stack *stack) Basically it is the same as pcre[16]_exec, excapt that re is removed and a jit stack argument is added. The interface of JIT is stable now, and I don't think it will be change much in the future. The purpose of this function is offering a faster execution speed by skipping checks. I.e. it does not check that the input is valid UTF, and the pointers are non-NULL, JIT compilation is successful. Since the jit stack is directly passed (and a mandatory argument!) the JIT callback is not used as well (much better for multithreaded software). This is suitable for software, which require high-performance matching speed, and all arguments are known to be valid, since they were checked before. These checks are mainly for avoiding software bugs, and these debug features are not needed for production ready, high-performance software. Other than skipping checks, it operates in the same way as pcre_exec. The new interface requires about 33% less time than pcre_exec. What do you think? Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] [Bug 1295] add 32-bit library
Hi, this still looks quite theoretical for me. However, if you come up with a patch which has negligible performance overhead, I am willing to review. Regards, Zoltan Tom Bishop, Wenlin Institute tan...@wenlin.com írta: On Sep 17, 2012, at 5:09 PM, Christian Persch (GNOME) c...@gnome.org wrote: It's absolutely certain that there will never be unicode characters 10, so there's no forward compatibility problem. A few years ago it was absolutely certain there would never be Unicode characters U+. As a result, a lot of supposedly Unicode-based software still in widespread use fails for characters outside the Basic Multilingual Plane. Should we learn from such mistakes, or repeat them? Now you seem to want some sort of UCS-4 mode that would allow any characters from the 31-bit range (up to 7fff) of UCS-4 ? I don't see how that would be useful; for example, which properties would those characters beyond the UTF-32 range have ? By default, the same properties as for unassigned code points less than U+11. Especially relevant to this discussion, an essential property for each character is that it shouldn't be matched with some other character without a valid reason. One application of code points beyond U+10 is for extended private use. Properties for all unassigned characters could be specified by the same protocols as for ordinary private-use characters. It should be possible to specify custom properties for each character, including those in the current private-use ranges U+E000..U+F8FF, U+F..U+D, and U+10..U+10FFFD. For example, depending on the application, people may want to treat some private-use characters as letters, numbers, whitespace, or combining marks. (This is an ability PCRE really should have anyway.) (And if an actual use case for that UCS-4 mode ever arises, we can just add it at that point as a _new_ flag/mode.) It might be best to design the API and add a few lines of code now, while all the authors are alive, and before assumptions about PCRE have been hard-coded into applications that depend on it. Three possible behaviors are under consideration, when a 32-bit string contains a code unit 0x0010: (1) trigger an error for invalid UTF-32; (2) mask it with 0x001F; or (3) treat it as a character in its own right. I think I understand that (1) will be the default (which is good), and that (2) can currently be obtained by turning on the PCRE_NO_UTF32_CHECK option. You said that the masking is only a temporary measure while developing this. It's not clear what that implies: once the development is complete, would the PCRE_NO_UTF32_CHECK option still produce behavior (2), or would the masking code be removed and the PCRE_NO_UTF32_CHECK option produce behavior (3)? It seems that there are three possible purposes for someone to specify an option named PCRE_NO_UTF32_CHECK: (A) simply to speed up the code a bit, since they're absolutely certain that their strings are valid UTF-32; (B) to obtain behavior (2) since they've included extra information in the eleven highest bits; or (C) to obtain behavior (3) to support characters beyond U+10. For purpose (A), suppose on some rare occasion the absolute certainty is mistaken; then the best behavior for PCRE is (3), since 0x1021 isn't a valid code for an exclamation point (U+0021) and PCRE shouldn't report a match when in reality there isn't a match. The difference between behaviors (2) and (3) is huge. If only one or the other is supported, (3) is more appropriate -- again, PCRE shouldn't report a match when in reality there isn't a match. If the masking is considered a useful option for the long term and not only a temporary measure, then there could be two options in addition to the default (strict UTF-32 checking). They might be named: PCRE_MASK_UTF32_BEYOND_1F for behavior (2) and PCRE_ALLOW_UTF32_BEYOND_10 for behavior (3). This might only require a few additional lines of code. I'm happy to help with the implementation. Best wishes, Tom 文林 Wenlin Institute, Inc.Software for Learning Chinese E-mail: wen...@wenlin.com Web: http://www.wenlin.com Telephone: 1-877-4-WENLIN (1-877-493-6546) ☯ -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] [Bug 1295] New: add 32-bit library
We have only a few (two) masks as far as I remember. In practice 16 and 32 bit modes are basically useless without UTF, since you can only set character types only for the first 256 code point. And UFT is not likely will go beyond 0x10 in the foreseeable future, since only a small fragment of it is filled with characters, and that is basically cover nearly all spoken and dead (but real!) languages. UTF is not intended to be a picture library for bored graphics designers, so we can say PCRE only supports characters = 0xfff without limiting any practical use cases. I agree with the checks. Regards, Zoltan Tom Bishop, Wenlin Institute tan...@wenlin.com írta: On Sep 14, 2012, at 3:26 PM, Christian Persch (GNOME) c...@gnome.org wrote: ...Since UTF-32 only occupies 21 bits of the 32-bit characters, it's useful for implementations to use the upper bits to store extra info (flags, etc). Since it's more efficient to pass the unmodified strings to pcre32, I aim to make pcre32 mask out those upper bits. This is done in the code but hasn't been debugged yet (it's not working yet). I suggest that such masking behavior should not be the default, but only enabled, if at all, by explicitly setting some configuration option. If a 32-bit string contains a code unit such as 0x1021, the safer assumption is that it is *not* equivalent to U+0021. 0x1021 might trigger a warning that the string is not valid UTF-32, or it might just be treated as a different character. But to treat it by default as matching U+0021 would be just as wrong as an ASCII-based program treating 0xA1 as equivalent to 0x21. The originally ASCII-based programs that continue to work well today (for Latin1, UTF-8, etc.) are the ones that treat the byte 0xA1 differently from 0x21, and refrain from masking/bending/folding/mutilating it. Using the upper bits of 32-bit code units for flags, etc., risks incompatibility with future use of code points beyond U+10 (such for extended private use); developers need to weigh the risks and benefits of such an approach carefully. Anyway, if they do it, they should at least be responsible for setting an option instructing PCRE to mask the high bits. In general, most libraries shouldn't be expected to mask or ignore those bits. I hope this suggestion is helpful. A 32-bit PCRE is likely to be useful for the long-term future, especially if code points beyond U+10 are eventually employed. Best wishes, Tom -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
[pcre-dev] Forcing study to return with something
Hi all, it is inconvenient for several uses cases, that study sometimes return with something and sometimes not, and we need to manually allocate a pcre_extra. Shall we add a flag which forces it to return with a plain, empty, pcre_extra if it would return with NULL (and there was no error of course)? Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Forcing study to return with something
What about PCRE_STUDY_EXTRA_NEEDED? My problem with NONULL that the study would still return NULL in case of an error (rare case, means that the regex is invalid, i.e wrong magic number). Thus, this feature could also be used for error check. Regards, Zoltan Philip Hazel p...@hermes.cam.ac.uk írta: On Sat, 11 Aug 2012, Zoltán Herczeg wrote: it is inconvenient for several uses cases, that study sometimes return with something and sometimes not, and we need to manually allocate a pcre_extra. Shall we add a flag which forces it to return with a plain, empty, pcre_extra if it would return with NULL (and there was no error of course)? I have no objection. Indeed, I am currently working on pcregrep (fixing various oddities with --include and --exclude) and its code could perhaps be made a bit tidier if this flag exists. What shall we call it? PCRE_STUDY_NEVER_NULL? PCRE_STUDY_NONULL? PCRE_STUDY_FORCE_ALLOC? Maybe NONULL is clearest, but I don't really mind (and maybe there's something better). Philip -- Philip Hazel -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Help with pcregrep on Windows
Hi, 7.0 is really old. We do not support it anymore. Could you help me in trying to update pcregrep on my Windows system? I am not a Windows expert myself, but with cmake and mingw (+make) it is easy to build it. Cygwin probably works as well. You can find mingw related things here: http://sourceforge.net/projects/mingw/files/ GCC: http://sourceforge.net/projects/mingw/files/MinGW/Base/gcc/Version4/ There are some installers there, I have never tried them. I just downloaded the usual components (binutils, libc, etc.) until the the compiler is finally able to compile a hello world. make utility is also important: http://sourceforge.net/projects/mingw/files/MSYS/Base/ CMake is available here: http://www.cmake.org/ Download it as well. This command configures PCRE from its root directory if everything is available in the PATH: cmake -G MinGW Makefiles . -DCMAKE_C_FLAGS:STRING=-Wl,--stack,16777216 -DPCRE_BUILD_PCRE16=ON -DPCRE_BUILD_PCRE8=ON -DPCRE_SUPPORT_JIT=ON -DPCRE_SUPPORT_UNICODE_PROPERTIES=ON And a simple make command builds it. Testing: make test Detailed build process: make VERBOSE=1 Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] A PCRE 32-bit library?
Hi Christian, wow, this is awesome! The UTF16 support was really designed that way, that the UTF32 support shouldn't be too much trouble, but I suspect there are some places, where you needed to do a lot of work. I have some questions: 1) The first / required character data is only 16 bit wide. Did you changed them to 32 bit? What about the alignment of the new structure? 2) Did you make new tests? 3) What is the status of the non-utf, plain 32 bit mode? I remember places where the uint32 characters also contain some flags in the higher bits. 4) Which build systems support UTF32? 5) What about JIT? Again, I think this is a really nice work! I suspect only new symbols were added to the pcre.h, so we shouldn't worry about compatibility. Regards, Zoltan I was wondering if there would be any interest in a contribution adding a 32-bit (UTF-32) PCRE library alongside the existing 8-bit and 16-bit libraries? I've got an almost finished patch against 8.31, which turned out to be *much* less work than anticipated, thanks to all the work done for the 16-bit library... -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Start optimization issue
If DOTALL is set the result is the same. It seems that is_anchored() is in action. I think PCRE must not assume such patterns as anchored. No. /* .* means start at start or after \n if it isn't in brackets that may be referenced. */ else if (op == OP_TYPESTAR || op == OP_TYPEMINSTAR || op == OP_TYPEPOSSTAR) { if (scode[1] != OP_ANY || (bracket_map backref_map) != 0) return FALSE; } This assumption is wrong for those .*-s, which are inside an atomic block, or bactracking is broken by some recursive control verb like this one: re /.*?a(*PRUNE)b/ data aab No match data re /(*NO_START_OPT).*?a(*PRUNE)b/ data aab 0: ab Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] JIT doesn't build under WinCE
Hi, I renamed the variable to quit. I hope it will fix the WinCE build. http://www.exim.org/lurker/message/20120708.164441.53d8f041.hu.html Regards, Zoltan Giuseppe D'Angelo dange...@gmail.com írta: Hi, a small followup: the JIT doesn't build under WinCE/x86 as well because the indirect, apparently unguarded inclusion of excpt.h from windows.h (#included from sljitUtils.c) has #define leave __leave and this breaks pcre_jit_compile.c that has a leave field in a struct. A simple #undef after the include leave seems to do the trick. :( Thanks, -- Giuseppe D'Angelo -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Start optimization issue
Hi, I have investigated this issue, and in the optimized case, PCRE_STARTLINE is set, so it searches the first newline. There is a comment for this before is_startline(...): /* This is called to find out if every branch starts with ^ or .* so that first char processing can be done to speed things up in multiline matching and for non-DOTALL patterns that start with .* (which must start at the beginning or after \n). As in the case of is_anchored() (see above), we have to take account of back references to capturing brackets that contain .* because in that case we can't make the assumption. ... */ Probably the atomic block affects this case, which removes the backtracking ability from .* and maybe other recursion control verbs (like (*COMMIT)) can also do this. Regards, Zoltan ND nad...@mail.ru írta: Good day! Here is pcretest.exe listing: PCRE version 8.31 2012-07-06 /(?.*?a)(?=ba)/ aba No match MATCH was inspected. More investigation returns that is start optimization issue. PCRE version 8.31 2012-07-06 /(*NO_START_OPT)(?.*?a)(?=ba)/ aba 0: ba What kind of start optimization doing things? I don't find in documentation anything about this case. Thanx. -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] [PATCH] IBM xlc compatibility, pcretest memory-leak fixes
Hi Daniel, thank you very much for fixing these issues. I hope JIT is working well on your system now (If you can share some performance results as well, I am really interested). I would like to add your changes to the project but the patch has a strange syntax on my machine: --_e03c4e17-38a7-4221-a4b1-fe0372a519e7_ Content-Transfer-Encoding: uuencode Content-Disposition: attachment; filename=pcre-fixes.patch Content-Type: application/octet-stream; name=pcre-fixes.patch begin 666 pcre-fixes.patch M26YD97@Z(%)U;E1EW0*/3T]/3T]/3T]/3T]/3T]/3T]/3T]/3T]/3T]/3T] M/3T]/3T]/3T]/3T]/3T]/3T]/3T]/3T]/3T]/3T]/3T]/3T]/0HM+2T@4G5N M55S= DHF5V:7-I;VX@.3@R*0HK*RL@4G5N55S= DH=V]R:VEN9R!C;W!Y M*0I 0 M,34U+#@*S$U-2PX($! B C(%-E=!U!A('-U:71A8FQE()D M:69F(B!C;VUM86YD(9OB!C;VUP87)IV]N+B!3;VUE('-YW1E;7,*(,@ ... How can I decode your patch? Regards, Zoltan Daniel Richard G. o...@teragram.com írta: Hello folks, I've put in some work on PCRE lately, and would like to submit my changes. They largely address (1) compatibility with the IBM C compiler (xlc), and (2) memory leaks in the pcretest test driver. Everything is in the attached patch, against SVN trunk. A walk-through is below: ++ RunTest @ The initial goal here was to quell the illegal option error message from older diff(1) programs; the restructuring of the conditional is a parallel to my changes to RunGrepTest ++ pcre_jit_compile.c @ This comment is intended to help folks who are led to that SLJIT_MALLOC() by Valgrind. It certainly would have helped me, as pcretest was doing the clear-the-PCRE_EXTRA_EXECUTABLE_JIT-flag thing! ++ sljit/sljitNativePPC_common.c @ Allow this file to be compiled by IBM's xlc compiler, which specifically supports GCC asm syntax, but does not define __GNUC__. This support can be switched off with -qnoasm, however... http://publib.boulder.ibm.com/infocenter/lnxpcomp/v8v101/topic/com.ibm.xlcpp8l.doc/compiler/ref/ruoptasm.htm ... so give a more helpful error in that case. Use __asm__ instead of plain asm, as the latter can be de-recognized as a keyword with -qnokeyword=asm, and so the former is more robust: http://publib.boulder.ibm.com/infocenter/lnxpcomp/v8v101/topic/com.ibm.xlcpp8l.doc/compiler/ref/ruoptkey.htm Also, I encountered a mysterious assembly error when building with -O3 and -qfuncsect: 8 libtool: compile: /usr/vac/bin/xlc_r -DHAVE_CONFIG_H -I. -I/srcdir/pcre-8.30 -O3 -q64 -qfuncsect -c -M /srcdir/pcre-8.30/pcre_jit_compile.c -DPIC -o pcre_jit_compile.o /srcdir/pcre-8.30/sljit/sljitNativePPC_common.c, line 42.2: 1506-948 (W) #warning This file may fail to compile if -qfuncsect is used Assembler: pcre_jit_compile.s: line 59973: undefined symbol S.8704.IPRA._sljit_emit_cmp pcre_jit_compile.s: line 59973: illegal expression pcre_jit_compile.s: line 59988: undefined symbol S.8705.IPRA._compile_xclass_hotpath pcre_jit_compile.s: line 59988: illegal expression pcre_jit_compile.s: line 59991: undefined symbol S.8706.IPRA._compile_char1_hotpath pcre_jit_compile.s: line 59991: illegal expression pcre_jit_compile.s: line 6: undefined symbol S.8707.IPRA._compile_iterator_hotpath pcre_jit_compile.s: line 6: illegal expression 1500-067: (S) asm statement generates errors in assembler output. 1586-346 (U) An error occurred during code generation. The code generation return code was 1. make: The error code from the last command is 1. Stop. make: The error code from the last command is 2. 8 (I was using a handful of other options as well, but they had no bearing on this failure mode.) This error appears to be due to the inline assembly of ppc_cache_flush(), because if I comment out the asm statement, the error goes away. Removing -qfuncsect allows the source to compile, asm and all, so I added a #warning for folks using the IBM compiler to help them diagnose the problem. @ Quelled a warning about converting a function pointer to void* ++ sljit/sljitNativePPC_64.c @ Allow compilation with the IBM compiler and use __asm__ instead of asm ++ sljit/sljitConfigInternal.h @ The IBM compiler does define __powerpc__, but not __powerpc64__, even in 64-bit mode. It does, however, define _ARCH_PPC and _ARCH_PPC64: http://publib.boulder.ibm.com/infocenter/comphelp/v8v101/topic/com.ibm.xlcpp8a.doc/compiler/ref/ruoptarc.htm ++ pcretest.c @ There were a few if (re != NULL) ... bits below this, and I thought, this is silly---just check whether re is NULL right after the new_malloc() call, and bail out if it is @ re was being leaked @ new_free(re) unconditionally, because we know it's not NULL @ re and f were being leaked @ extra-executable_jit was being leaked because an existing pcre_extra object with JIT data would enter this block, then have its PCRE_EXTRA_EXECUTABLE_JIT cleared, and then there was no way to subsequently tell whether extra-executable_jit was a valid pointer or not. I resolved this just by
Re: [pcre-dev] JIT doesn't build under WinCE
Hi, wow, WinCE is still alive? Anyway, according to this doc these functions should be available everywhere, but it was never tested, thus the ifdefs: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0043c/IHI0043C_rtabi.pdf They can be removed if other platforms like them. Regards, Zoltan Giuseppe D'Angelo dange...@gmail.com írta: Hello, when targeting Windows CE, the current ARMv5 JIT does not build due to line 1803 in sljitNativeARM_v5.c: #error Software divmod functions are needed The whole block (present also in other files): #if defined(__GNUC__) extern unsigned int __aeabi_uidivmod(unsigned numerator, unsigned denominator); extern unsigned int __aeabi_idivmod(unsigned numerator, unsigned denominator); #else #error Software divmod functions are needed #endif makes me think that building JIT for ARM with a non-GNU-compatible toolchain is unsupported. Is that the case? Cheers, -- Giuseppe D'Angelo -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] 8.31-RC1 test release is available
PPC with JIT S/390 witout JIT AMD64 with JIT PPC64 with JIT i686 with JIT S/390x without JIT ARMv7 with JIT ARMv5 with JIT Awesome, you really have so many systems :) Thanks for the testing! Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
[pcre-dev] Fw: Re: [PATCH] add malloc and alloc_size attributes to allocation functions
Hi, The thing is that PCRE is often built along with other projects. The more autoconf magic we add, the worst, since those options are unlikely to be picked up by these projects bundling PCRE. (and I I totally agree with this. However, adding everything to pcre.h makes it far less readable. In sljit, I introduced separate header files for compiler magic. And __has_attribute breaks the rule that everything must be prefixed by pcre or PCRE. This attribute has nothing to do with glibc. It is used by gcc/clang to provide useful warnings for, e.g., array out-of-bounds indexing. Still seems a research thingy for me. Just out of curiosity did you actually captured anything with it in PCRE? Apart of the warnings, it can be used for optimizations, and run-time code instrumentation. Actually both in pcre and in sljit, malloc is rarely used by design, so I am still unsure about its benefit. I am not against this feature, but I really would like to see a real use case which benefit from adding more symbols to a header file. Btw did you try JIT? I suspect it would offer far more speedup than any malloc optimization. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Fw: Re: [PATCH] add malloc and alloc_size attributes to allocation functions
Hi, The __has_attribute macro is special in clang. It's defined by the compiler. Oh, that is probably a misunderstanding. I mean instead of defining this macro when missing, we should define something with PCRE_ prefix, which is empty if __has_attribute is not supported. No, nothing appeared in clang, which is a good thing! :) But it will prevent overflows being introduced in the future. Great! I didn't try the JIT compiler. But I'll pass it through clang and I'll let you know if it finds any bug. Anyway, the benefits add up. The malloc attribute will be used in JIT mode as well. Thanks, let me know if you have any results. Philip, I think we should organize header file content better by separating it into blocks, and adding a comment about the content of the block. Something like this: /* === */ /* This block contains compiler specific options */ /* === */ /* This block is not core part of the PCRE, but it helps to improve code generation on some compilers. */ /* === */ /* This block contains preprocessor defines */ /* === */ /* === */ /* This block contains types */ /* === */ Although we should not introduce too many of them, than can be also confusing. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] [PATCH] add malloc and alloc_size attributes to allocation functions
Hi, I feel this patch just adds unnecessary complexity to the header file. What exactly are these optimizations? For buffer overflows, valgrind is the perfect detection tool with its red zone based detection algorithm. What else can you do with these macros? Or better to ask: what is your exact use case where you need these features? Perhaps we can suggest some workarounds for them. Regards, Zoltan Nuno Lopes nunoplo...@sapo.pt írta: Hi, Please find in attach a patch to add the malloc and alloc_size attributes to PCRE's custom allocation functions. The malloc attribute specifies that a given function behaves like malloc, and therefore the returned pointer is fresh (i.e., doesn't alias anything else). It is used mostly for optimization purposes. (I didn't add it to the function pointers, because GCC doesn't support that, although clang does). The alloc_size attribute specifies that a function allocates memory of size given by the set of specified parameters. In PCRE's case, it's only the first parameter. This attribute enables some optimizations and analysis of buffer overflows and related stuff. Regards, Nuno -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev