[issue38582] re: backreference number in replace string can't >= 100

2019-10-25 Thread Matthew Barnett


Matthew Barnett  added the comment:

If we did decide to remove it, but there was still a demand for octal escapes, 
then I'd suggest introducing \oXXX.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38582] re: backreference number in replace string can't >= 100

2019-10-25 Thread Ma Lin


Ma Lin  added the comment:

> I'd still retain \0 as a special case, since it really is useful.

Yes, maybe \0 is used widely, I didn't think of it.
Changing is troublesome, let's keep it as is.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38582] re: backreference number in replace string can't >= 100

2019-10-25 Thread Vedran Čačić

Vedran Čačić  added the comment:

Not very useful, surely (now that we have hex escapes).
[I'd still retain \0 as a special case, since it really is useful.] But a lot 
more useful than a hundred backreferences.

And I'm as a matter of principle opposed to changing something that's been in 
the language for decades for the benefit of someone that's by their own words 
"just learned Python". [Changing documentation is fine.] They by definition 
don't see the whole picture. Now that we don't have a BDFL anymore, I think 
it's vitally important to have some principles such as this one.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38582] re: backreference number in replace string can't >= 100

2019-10-25 Thread Ma Lin


Ma Lin  added the comment:

Octal escape:
\oooCharacter with octal value ooo
As in Standard C, up to three octal digits are accepted.

It only accepts UCS1 characters (ooo <= 0o377):
>>> ord('\377')
255
>>> len('\378')
2
>>> '\378' == '\37' + '8'
True

IMHO this is not useful, and creates confusions.
Maybe it can be deprecated in language level.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38582] re: backreference number in replace string can't >= 100

2019-10-25 Thread Vedran Čačić

Vedran Čačić  added the comment:

The documentation clearly says:

> This special sequence can only be used to match one of the first 99 groups. 
> If the first digit of number is 0, or number is 3 octal digits long, it will 
> not be interpreted as a group match, but as the character with octal value 
> number.

Maybe it should also mention Serhiy's technique at that place, something like

"If you need more than 99 groups, you can name them using..."

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38582] re: backreference number in replace string can't >= 100

2019-10-25 Thread veaba


veaba <908662...@qq.com> added the comment:

Aha, it's me. It's the mysterious power from the East. I just learned python.

I've solved my problem. It's a very simple replace replacement, and it's solved 
in three lines.

I'm trying to solve the problem of inadvertently finding out in the process of 
translating HTML text into markdown file. The document contains very complex 
strings, so I do that. Now it seems that the method I used before is a very 
inappropriate and inappropriate way to implement, which is a mistake.

However, I insist that this regular overflow is still a problem. It doesn't 
even translate a bunch of meaningless strings without any error.

I didn't find such a bug until I randomly selected and checked 2. K documents. 
I don't know if it's unlucky or lucky.

Then, I will not participate in the discussion of the remaining high-end issues.

Good luck.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38582] re: backreference number in replace string can't >= 100

2019-10-25 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

I do not believe somebody uses handwritten regular expressions with more than 
100 groups. But if you generate regular expression, you can use named groups 
(?P...) (?P=g12345).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38582] re: backreference number in replace string can't >= 100

2019-10-25 Thread veaba

veaba <908662...@qq.com> added the comment:

Yes, this is not a good place to use regular expressions.

Using regular expressions:
def actual_re_demo():
import re
# This is an indefinite string...
text = "tf.where(condition, x=None, y=None, name=None) tf.batch_gather ..."

# Converting fields that need to be matched into regular expressions is 
also an indefinite string
pattern_str = re.compile('(tf\\.batch_gather)|(None)|(a1)')

#I don't know how many, so it's over \ \ 100 \ \ n
x = re.sub(pattern_str, '`'+'\\1\\2'+'`', text)

print(x)

# hope if:tf.Prefix needs to match,The result will be:`tf.xx`,

# But in fact, it's not just TF. As a prefix, it's a random character, it 
can be a suffix, it can be other characters.

#  If more than 100, the result 
is=>:989¡¢£¤¥¦§89¨©ª«¬­®¯89°±²³´µ¶·89¸¹º»¼½¾¿890123`, 
name=`None@ABCDEFG89HIJKLMNO89PQRSTUVW89XYZ[\]^_89`abcdefg89hijklmno89pqrstuvw89xyz{|}~8901234567890123456789

# I noticed in the comment area that it was caused by a confusion of Radix, 
which seems to be embarrassing.


Use replace to solve it. It looks much better.
def no_need_re():
text = "tf.where(condition, x=None, y=None, name=None) tf.batch_gather ..."
pattern_list = ['tf.batch_gather', 'None']
for item in pattern_list:
text=text.replace(item, '`'+item+'`')

print(text)

no_need_re()

Expect to report an error directly if it exceeds the limit, instead of 
overflowing the character, like this:

989¡¢£¤¥¦§89¨©ª«¬­®¯89°±²³´µ¶·89¸¹º»¼½¾¿890123`, 
name=`None@ABCDEFG89HIJKLMNO89PQRSTUVW89XYZ[\]^_89`abcdefg89hijklmno89pqrstuvw89xyz{|}~8901234567890123456789

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38582] re: backreference number in replace string can't >= 100

2019-10-25 Thread Vedran Čačić

Vedran Čačić  added the comment:

I have no problem with long regexes. But those are not only long, those must be 
_deeply nested_ regexes, where simply 100 is an arbitrary limit. I'm quite sure 
if you really need depth 100, you must also need a dynamic depth of nesting, 
which you cannot really achieve with regexes.

Yes, if there is a will to change this, supporting \g would be a way to go.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38582] re: backreference number in replace string can't >= 100

2019-10-24 Thread Ma Lin


Ma Lin  added the comment:

@veaba 
Post only in English is fine.

> Is this actually needed?
Maybe very very few people dynamically generate some large patterns.

> However, \g<...> is not accepted in a pattern.
> in the "regex" module I added support for it in a pattern too.
Yes, backreference number in pattern also can't >= 100
Support \g<...> in pattern is a good idea.

If fix this issue, may produce backward compatibility issue: the parser will 
confuse backreference numbers and octal escape numbers.
Maybe can clarify the limit (<=99) in the document is enough.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38582] re: backreference number in replace string can't >= 100

2019-10-24 Thread veaba

veaba <908662...@qq.com> added the comment:

这里来自实际我的一个项目(https://github.com/veaba/tensorflow-docs/blob/master/scripts/spider_tensorflow_docs.py#L39-L56),当然也许我这个方法不是正确的,它只是我刚学python的一个尝试。

这个项目步骤是这样:根据HTML tag 提取文本转为markdown格式。 标签,需要用符号“`”包围,然后循环里面将匹配的字符通过\\* 
替换出来。

所以,你们见到了,我发现这样的一个正则溢出错误。


如果能够放开反斜杠替换符无限个数限制对我会很友好,当然如果真的不需要的话,我自己想别的办法。


This is from a project I actually worked on 
(https://github.com/veaba/tensorflow-docs/blob/master/scripts/spider_tensorflow_docs.py#L39-L56).
 Of course, this method is not correct. It's just an attempt to learn python.



The project steps are as follows: extract the text according to HTML tag and 
change it to markdown format. The < code > label needs to be surrounded by the 
symbol "`", and then the matching characters are replaced by \ \ * in the loop.



So, as you can see, I found such a regular overflow error.




It would be nice for me to be able to let go of the infinite number of 
backslash substitutions. Of course, if I really don't need it, I'll try 
something else.

--
hgrepos: +385

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38582] re: backreference number in replace string can't >= 100

2019-10-24 Thread Vedran Čačić

Vedran Čačić  added the comment:

Is this actually needed? I can't remember ever needing more than 4 (in a 
pattern). I find it very hard to believe someone might actually have such a 
regex with more than a hundred backreferences. Probably it's just a misguided 
attempt to parse a nested structure with a regex.

--
nosy: +veky

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38582] re: backreference number in replace string can't >= 100

2019-10-24 Thread Matthew Barnett


Matthew Barnett  added the comment:

A numeric escape of 3 digits is an octal (base 8) escape; the octal escape 
"\100" gives the same character as the hexadecimal escape "\x40".

In a replacement template, you can use "\g<100>" if you want group 100 because 
\g<...> accepts both numeric and named group references.

However, \g<...> is not accepted in a pattern.

(By the way, in the "regex" module I added support for it in a pattern too.)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38582] re: backreference number in replace string can't >= 100

2019-10-24 Thread Ma Lin


Ma Lin  added the comment:

Backreference number in replace string can't >= 100
https://github.com/python/cpython/blob/v3.8.0/Lib/sre_parse.py#L1022-L1036

If none take this, I will try to fix this issue tomorrow.

--
nosy: +serhiy.storchaka
title: Regular match overflow -> re: backreference number in replace string 
can't >= 100
versions: +Python 3.7, Python 3.8, Python 3.9

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com