[issue21448] Email Parser use 100% CPU

2015-05-22 Thread Roundup Robot
Roundup Robot added the comment: New changeset 830bcf4fb29b by Raymond Hettinger in branch 'default': Issue #21448: Improve performance of the email feedparser https://hg.python.org/cpython/rev/830bcf4fb29b -- ___ Python tracker

[issue21448] Email Parser use 100% CPU

2015-05-22 Thread Raymond Hettinger
Changes by Raymond Hettinger raymond.hettin...@gmail.com: -- assignee: rhettinger - resolution: - fixed status: open - closed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue21448 ___

[issue21448] Email Parser use 100% CPU

2015-04-13 Thread R. David Murray
R. David Murray added the comment: Raymond, are you gong to apply the deque patch (maybe after doing performance measurement) or should we close this? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue21448

[issue21448] Email Parser use 100% CPU

2014-08-12 Thread Roundup Robot
Roundup Robot added the comment: New changeset ba90bd01c5f1 by Serhiy Storchaka in branch '2.7': Issue #21448: Fixed FeedParser feed() to avoid O(N**2) behavior when parsing long line. http://hg.python.org/cpython/rev/ba90bd01c5f1 New changeset 1b1f92e39462 by Serhiy Storchaka in branch '3.4':

[issue21448] Email Parser use 100% CPU

2014-08-12 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: The test_parser.diff file catches the bug in fix_email_parse.diff I don't see this. But well, it does no harm. Please commit fix_prepending2.diff yourself. -- assignee: serhiy.storchaka - rhettinger versions: -Python 2.7, Python 3.4

[issue21448] Email Parser use 100% CPU

2014-08-12 Thread Roundup Robot
Roundup Robot added the comment: New changeset 71cb8f605f77 by Serhiy Storchaka in branch '2.7': Decreased memory requirements of new tests added in issue21448. http://hg.python.org/cpython/rev/71cb8f605f77 New changeset c19d3465965f by Serhiy Storchaka in branch '3.4': Decreased memory

[issue21448] Email Parser use 100% CPU

2014-08-10 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Here is a patch which combines fixed Raymond's patch and FeedParser tests. These tests cover this issue, a bug in my patch, and (surprisingly) a bug in Raymond's patch. I didn't include Raymond's test because looks as it doesn't catch any bug. If there are

[issue21448] Email Parser use 100% CPU

2014-08-10 Thread Raymond Hettinger
Raymond Hettinger added the comment: The test_parser.diff file catches the bug in fix_email_parse.diff and it provides some assurance that push() functions as an incremental version of str.splitlines(). I would like to have this test included. It does some good and does no harm. --

[issue21448] Email Parser use 100% CPU

2014-08-05 Thread R. David Murray
R. David Murray added the comment: Serhiy: there was an issue with /r/n going across a chunk boundary that was fixed a while back, so there should be a test for that (I hope). As for how to handle line breaks, backward compatibility applies: we have to continue to do what we did before, and

[issue21448] Email Parser use 100% CPU

2014-08-05 Thread Antoine Pitrou
Antoine Pitrou added the comment: Should this be categorized as a security issue? You could easily DoS a server with that (email.parser is used by http.client to parse HTTP headers, it seems). -- nosy: +christian.heimes, pitrou ___ Python tracker

[issue21448] Email Parser use 100% CPU

2014-08-05 Thread Raymond Hettinger
Raymond Hettinger added the comment: Should this be categorized as a security issue? You could easily DoS a server with that (email.parser is used by http.client to parse HTTP headers, it seems). I think it makes sense to treat this as a security issue. I don't have a preference about

[issue21448] Email Parser use 100% CPU

2014-08-05 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: I found a bug in my patch. Following code from email.parser import Parser BLOCKSIZE = 8192 s = 'From: e...@example.com\nFoo: ' s += 'x' * ((-len(s) - 1) % BLOCKSIZE) + '\rBar: ' s += 'y' * ((-len(s) - 1) % BLOCKSIZE) + '\x85Baz: ' s += 'z' * ((-len(s) - 1) %

[issue21448] Email Parser use 100% CPU

2014-08-03 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: fix_email_parse.diff is not work when one chunk ends with '\r' and next chunk doesn't start with '\n'. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue21448

[issue21448] Email Parser use 100% CPU

2014-08-03 Thread Raymond Hettinger
Raymond Hettinger added the comment: Attaching revised patch. I forgot to reapply splitlines. -- Added file: http://bugs.python.org/file36230/fix_email_parse2.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue21448

[issue21448] Email Parser use 100% CPU

2014-08-03 Thread Raymond Hettinger
Raymond Hettinger added the comment: Attaching a more extensive test -- Added file: http://bugs.python.org/file36231/test_parser.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue21448 ___

[issue21448] Email Parser use 100% CPU

2014-08-03 Thread Raymond Hettinger
Changes by Raymond Hettinger raymond.hettin...@gmail.com: Added file: http://bugs.python.org/file36232/fix_prepending.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue21448 ___

[issue21448] Email Parser use 100% CPU

2014-08-03 Thread Raymond Hettinger
Changes by Raymond Hettinger raymond.hettin...@gmail.com: Added file: http://bugs.python.org/file36233/fix_prepending2.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue21448 ___

[issue21448] Email Parser use 100% CPU

2014-08-03 Thread Raymond Hettinger
Changes by Raymond Hettinger raymond.hettin...@gmail.com: Removed file: http://bugs.python.org/file36232/fix_prepending.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue21448 ___

[issue21448] Email Parser use 100% CPU

2014-08-03 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: fix_email_parse2.diff slightly changes behavior. See my comments on Rietveld. As for fix_prepending2.diff, could you please provide any benchmark results? And there is yet one bug in current code. str.splitlines() splits a string not only breaking it at

[issue21448] Email Parser use 100% CPU

2014-08-03 Thread Raymond Hettinger
Raymond Hettinger added the comment: As for fix_prepending2.diff, could you please provide any benchmark results No. Inserting at the beginning of a list is always O(n) and inserting at the beginning of a deque is always O(1). -- ___ Python

[issue21448] Email Parser use 100% CPU

2014-08-03 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Yes, but if n is limited, O(n) becomes O(1). In our case n is the number of fed but not read lines. I suppose the worst case is a number of empty lines, in this case n=8192. I tried following microbenchmark and did not noticed significant difference. $

[issue21448] Email Parser use 100% CPU

2014-08-03 Thread Raymond Hettinger
Raymond Hettinger added the comment: A deque is typically the right data structure when you need to append, pop, and extend on both the left and right side. It is designed specifically for that task. Also, it nicely cleans-up the code by removing the backwards line list and the list

[issue21448] Email Parser use 100% CPU

2014-08-02 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Parser reads from input file small chunks (8192 churacters) and feed FeedParser which pushes data into BufferedSubFile. In BufferedSubFile.push() chunks of incomplete data are accumulated in a buffer and repeatedly scanned for newlines. Every push() has

[issue21448] Email Parser use 100% CPU

2014-08-02 Thread Serhiy Storchaka
Changes by Serhiy Storchaka storch...@gmail.com: -- stage: - patch review ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue21448 ___ ___

[issue21448] Email Parser use 100% CPU

2014-08-02 Thread Raymond Hettinger
Raymond Hettinger added the comment: I'm looking at the patch today. -- assignee: - rhettinger nosy: +rhettinger ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue21448 ___

[issue21448] Email Parser use 100% CPU

2014-08-02 Thread Raymond Hettinger
Raymond Hettinger added the comment: I think the push() code can be a little cleaner. Attaching a revised patch that simplifies push() a bit. -- assignee: rhettinger - serhiy.storchaka Added file: http://bugs.python.org/file36216/fix_email_parse.diff

[issue21448] Email Parser use 100% CPU

2014-08-02 Thread Raymond Hettinger
Changes by Raymond Hettinger raymond.hettin...@gmail.com: -- Removed message: http://bugs.python.org/msg224577 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue21448 ___

[issue21448] Email Parser use 100% CPU

2014-08-01 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Therefore the bug is that email parser is dramatically slow for abnormal long lines. It has quadratic complexity from line size. Minimal example: import email.parser import time data = 'From: exam...@example.com\n\n' + 'x' * 1000 start = time.time()

[issue21448] Email Parser use 100% CPU

2014-05-09 Thread Tshepang Lekhonkhobe
Changes by Tshepang Lekhonkhobe tshep...@gmail.com: -- nosy: +tshepang ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue21448 ___ ___

[issue21448] Email Parser use 100% CPU

2014-05-08 Thread jader fabiano
jader fabiano added the comment: Hi. I undestood this problem that It was happening, I was writting the mime wrong in the attachments. I read a file with size 4M and I've converted to Base64, so I've written in the mime the content. But i wasn't put the lines with 76 ccharacters plus /r/n. I was

[issue21448] Email Parser use 100% CPU

2014-05-06 Thread jader fabiano
New submission from jader fabiano: Use email.parser to catch the mime's header,when a mime has attachments the process is consuming 100% of the CPU. And It can take until four minutes to finish the header parser. -- components: email messages: 218008 nosy: barry, jader.fabiano,

[issue21448] Email Parser use 100% CPU

2014-05-06 Thread R. David Murray
R. David Murray added the comment: Can you provide more details on how to reproduce the problem, please? For example, a sample message and the sequence of python calls you use to parse it. -- ___ Python tracker rep...@bugs.python.org

[issue21448] Email Parser use 100% CPU

2014-05-06 Thread jader fabiano
jader fabiano added the comment: I am openning a file and I am passing the File Descriptor to this function Parse().parse( fp ): This file has two attachments Example: self.fileDescriptor( file, 'rb') headers = Parser().parse(self.fileDescriptor ) #Here the process starts to consume 100% of the

[issue21448] Email Parser use 100% CPU

2014-05-06 Thread jader fabiano
jader fabiano added the comment: Sorry! Correct line self.fileDescriptor = open( file, 'rb') 2014-05-06 16:51 GMT-03:00 jader fabiano rep...@bugs.python.org: jader fabiano added the comment: I am openning a file and I am passing the File Descriptor to this function Parse().parse( fp ):

[issue21448] Email Parser use 100% CPU

2014-05-06 Thread R. David Murray
R. David Murray added the comment: We'll need the data file as well. This is going to be a data-dependent issue. With a 12MB body, I'm guessing there's some decoding pathology involved, which may or may not have been already fixed in python3. To confirm this you could use HeaderParser

[issue21448] Email Parser use 100% CPU

2014-05-06 Thread jader fabiano
jader fabiano added the comment: No, The file has 12Mb, because It has attachments. I am going to show an example. You can use a file thus: Date: Tue, May 10:27:17 6 -0300 (BRT) From: em...@email.com.br MIME-Version: 1.0 To: exam...@example.com Subject:example Content-Type: multipart/mixed;

[issue21448] Email Parser use 100% CPU

2014-05-06 Thread R. David Murray
R. David Murray added the comment: Sorry, I was using RFC-speak. A message is divided into 'headers' and 'body', and all of the attachments are part of the body in RFC terms. But think of it as 'initial headers' and 'everything else'. Please either attach the full file, and/or try your

[issue21448] Email Parser use 100% CPU

2014-05-06 Thread R. David Murray
R. David Murray added the comment: Also to clarify: HeaderParser will *also* read the entire message, it just won't look for MIME attachments in the 'everything else', it will just treat the 'everything else' as arbitrary data and record it as the payload of the top level Message object.