Re: Parsing Nested List
On Sunday, February 4, 2018 at 5:32:51 PM UTC-6, Stanley Denman wrote: > On Sunday, February 4, 2018 at 4:26:24 PM UTC-6, Stanley Denman wrote: > > I am trying to parse a Python nested list that is the result of the > > getOutlines() function of module PyPFD2 using pyparsing module. This is the > > result I get. what in the world are 'expandtabs' and why is that making a > > difference to my parse attempt? > > > > Python Code > > 7 > > import PPDF2,pyparsing > > from pyparsing import Word, alphas, nums > > pdfFileObj=open('x.pdf','rb') > > pdfReader=PyPDF2.PdfFileReader(pdfFileObj) > > List=pdfReader.getOutlines() > > myparser = Word( alphas ) + Word(nums, exact=2) +"of" + Word(nums, exact=2) > > myparser.parseString(List) > > > > This is the error I get: > > > > Traceback (most recent call last): > > File "", line 1, in > > myparser.parseString(List) > > File "C:\python\lib\site-packages\pyparsing.py", line 1620, in parseString > > instring = instring.expandtabs() > > AttributeError: 'list' object has no attribute 'expandtabs' > > > > Thanks so much, not getting any helpful responses from > > https://python-forum.io. I have found that I can use the index values in the list to print out the section I need. So print(MyList[7]) get me to section f taht I want. print(MyList[9][1]) for example give me a string that is the bookmark entry for Exhibit 1F. But this index value would presumeably be different for each pdf file - that is there may not always be Section A-E, but there will always be a Section F. In ther words, the index values that get me to the right section would be different in each pdf file. -- https://mail.python.org/mailman/listinfo/python-list
Re: Parsing Nested List
On Sunday, February 4, 2018 at 5:06:26 PM UTC-6, Steven D'Aprano wrote: > On Sun, 04 Feb 2018 14:26:10 -0800, Stanley Denman wrote: > > > I am trying to parse a Python nested list that is the result of the > > getOutlines() function of module PyPFD2 using pyparsing module. > > pyparsing parses strings, not lists. > > I fear that you have completely misunderstood what pyparsing does: it > isn't a general-purpose parser of arbitrary Python objects like lists. > Like most parsers (actually, all parsers that I know of...) it takes text > as input and produces some sort of machine representation: > > https://en.wikipedia.org/wiki/Parsing#Computer_languages > > > So your code is not working because you are calling parseString() with a > list argument: > > myparser.parseString(List) > > > The name of the function, parseString(), should have been a hint that it > requires a *string* as argument. > > You have generated an outline: > > List = pdfReader.getOutlines() > > but do you know what the format of that list is? I'm going to assume that > it looks something like this: > > ['ABCD 01 of 99', 'EFGH 02 of 99', 'IJKL 03 of 99', ...] > > since that matches the template you gave to pyparsing. Notice that: > > - words are separated by spaces; > > - the first word is any arbitrary word, made up of just letters; > > - followed by EXACTLY two digits; > > - followed by the word "of"; > > - followed by EXACTLY two digits. > > Furthermore, I'm assuming it is a simple, non-nested list. If that is not > the case, you will need to explain precisely what the format of the > outline actually is. > > To parse this list is simple and pyparsing is not required: > > for item in List: > words = item.split() > if len(words) != 4: > raise ValueError('bad input data: %r' % item) > first, number, x, total = words > number = int(number) > assert x == 'of' > total = int(total) > print(first, number, total) > > > > > Hope this helps. > > (Please keep any replies on the list.) > > > > -- > Steve Thank you so much Steve. I do seem to be barking up the wrong tree. The result of running getOutlines() is indeed a nested list: it is the pdfs bookmarks. There are 3 levels: level 1 is the section from A-F. When a section there are exhibits, so in Section A we have exhibits 1A to nA. Finally there are bookmarks for individual pages in an exhibit. So we have this for Section A: [{'/Title': 'Section A. Payment Documents/Decisions', '/Page': IndirectObject(1, 0), '/Type': '/FitB'}, [{'/Title': '1A: Disability Determination Transmittal (831) Dec. Dt.: 05/27/2016 (1 page)', '/Page': IndirectObject(1, 0), '/Type': '/FitB'}, [{'/Title': '1A (Page 1 of 1)', '/Page': IndirectObject(1, 0), '/Type': '/FitB'}], {'/Title': '2A: Disability Determination Explanation (DDE) Dec. Dt.: 05/27/2016 (10 pages)', '/Page': IndirectObject(6, 0), '/Type': '/FitB'}, [{'/Title': '2A (Page 1 of 10)', '/Page': IndirectObject(6, 0), '/Type': '/FitB'}, {'/Title': '2A (Page 2 of 10)', '/Page': IndirectObject(10, 0), '/Type': '/FitB'}, {'/Title': '2A (Page 3 of 10)', '/Page': IndirectObject(14, 0), '/Type': '/FitB'}, {'/Title': '2A (Page 4 of 10)', '/Page': IndirectObject(18, 0), '/Type': '/FitB'}, {'/Title': '2A (Page 5 of 10)', '/Page': IndirectObject(22, 0), '/Type': '/FitB'}, {'/Title': '2A (Page 6 of 10)', '/Page': IndirectObject(26, 0), '/Type': '/FitB'}, {'/Title': '2A (Page 7 of 10)', '/Page': IndirectObject(30, 0), '/Type': '/FitB'}, {'/Title': '2A (Page 8 of 10)', '/Page': IndirectObject(34, 0), '/Type': '/FitB'}, {'/Title': '2A (Page 9 of 10)', '/Page': IndirectObject(38, 0), '/Type': '/FitB'}, {'/Title': '2A (Page 10 of 10)', '/Page': IndirectObject(42, 0), '/Type': '/FitB'}], {'/Title': '3A: ALJ Hearing Decision (ALJDEC) Dec. Dt.: 12/17/2012 (22 pages)', '/Page': IndirectObject(47, 0), '/Type': '/FitB'}, [{'/Title': '3A (Page 1 of 22)', '/Page': IndirectObject(47, 0), '/Type': '/FitB'}, {'/Title': '3A (Page 2 of 22)', '/Page': IndirectObject(51, 0), '/Type': '/FitB'}, {'/Title': '3A (Page 3 of 22)', '/Page': IndirectObject(55, 0), '/Type': '/FitB'}, {'/Title': '3A (Page 4 of 22)', '/Page': IndirectObject(59, 0), '/Type': '/FitB'}, {'/Title': '3A (Page 5 of 22)', '/Page': IndirectObject(63, 0), '/Type': '/FitB'}, {'/Title': '3A (Page 6 of 22)', '/Page': IndirectObject(67, 0), '/Type': '/FitB'}, {'/Title': '3A (Page 7 of 22)', '/Page': IndirectObjec t(71, 0), '/Type': '/FitB'}, {'/Title': '3A (Page 8 of 22)', '/Page': IndirectObject(75, 0), '/Type': '/FitB'}, {'/Title': '3A (Page 9 of 22)', '/Page': IndirectObject(79, 0), '/Type': '/FitB'}, {'/Title': '3A (Page 10 of 22)', '/Page': IndirectObject(83, 0), '/Type': '/FitB'}, {'/Title': '3A (Page 11 of 22)', '/Page': IndirectObject(88, 0), '/Type': '/FitB'}, {'/Title': '3A (Page 12 of 22)', '/Page': IndirectObject(92, 0), '/Type': '/FitB'}, {'/Title': '3A (Page 13 of 22)', '/Page':
Re: Parsing Nested List
On Sunday, February 4, 2018 at 4:26:24 PM UTC-6, Stanley Denman wrote: > I am trying to parse a Python nested list that is the result of the > getOutlines() function of module PyPFD2 using pyparsing module. This is the > result I get. what in the world are 'expandtabs' and why is that making a > difference to my parse attempt? > > Python Code > 7 > import PPDF2,pyparsing > from pyparsing import Word, alphas, nums > pdfFileObj=open('x.pdf','rb') > pdfReader=PyPDF2.PdfFileReader(pdfFileObj) > List=pdfReader.getOutlines() > myparser = Word( alphas ) + Word(nums, exact=2) +"of" + Word(nums, exact=2) > myparser.parseString(List) > > This is the error I get: > > Traceback (most recent call last): > File "", line 1, in > myparser.parseString(List) > File "C:\python\lib\site-packages\pyparsing.py", line 1620, in parseString > instring = instring.expandtabs() > AttributeError: 'list' object has no attribute 'expandtabs' > > Thanks so much, not getting any helpful responses from > https://python-forum.io. -- https://mail.python.org/mailman/listinfo/python-list
Re: Parsing Nested List
On Sun, 04 Feb 2018 14:26:10 -0800, Stanley Denman wrote: > I am trying to parse a Python nested list that is the result of the > getOutlines() function of module PyPFD2 using pyparsing module. pyparsing parses strings, not lists. I fear that you have completely misunderstood what pyparsing does: it isn't a general-purpose parser of arbitrary Python objects like lists. Like most parsers (actually, all parsers that I know of...) it takes text as input and produces some sort of machine representation: https://en.wikipedia.org/wiki/Parsing#Computer_languages So your code is not working because you are calling parseString() with a list argument: myparser.parseString(List) The name of the function, parseString(), should have been a hint that it requires a *string* as argument. You have generated an outline: List = pdfReader.getOutlines() but do you know what the format of that list is? I'm going to assume that it looks something like this: ['ABCD 01 of 99', 'EFGH 02 of 99', 'IJKL 03 of 99', ...] since that matches the template you gave to pyparsing. Notice that: - words are separated by spaces; - the first word is any arbitrary word, made up of just letters; - followed by EXACTLY two digits; - followed by the word "of"; - followed by EXACTLY two digits. Furthermore, I'm assuming it is a simple, non-nested list. If that is not the case, you will need to explain precisely what the format of the outline actually is. To parse this list is simple and pyparsing is not required: for item in List: words = item.split() if len(words) != 4: raise ValueError('bad input data: %r' % item) first, number, x, total = words number = int(number) assert x == 'of' total = int(total) print(first, number, total) Hope this helps. (Please keep any replies on the list.) -- Steve -- https://mail.python.org/mailman/listinfo/python-list
Re: Parsing Nested List
On Mon, Feb 5, 2018 at 9:26 AM, Stanley Denmanwrote: > I am trying to parse a Python nested list that is the result of the > getOutlines() function of module PyPFD2 using pyparsing module. This is the > result I get. what in the world are 'expandtabs' and why is that making a > difference to my parse attempt? > > Python Code > 7 > import PPDF2,pyparsing > from pyparsing import Word, alphas, nums > pdfFileObj=open('x.pdf','rb') > pdfReader=PyPDF2.PdfFileReader(pdfFileObj) > List=pdfReader.getOutlines() > myparser = Word( alphas ) + Word(nums, exact=2) +"of" + Word(nums, exact=2) > myparser.parseString(List) > > This is the error I get: > > Traceback (most recent call last): > File " ", line 1, in > myparser.parseString(List) > File "C:\python\lib\site-packages\pyparsing.py", line 1620, in parseString > instring = instring.expandtabs() > AttributeError: 'list' object has no attribute 'expandtabs' > > Thanks so much, not getting any helpful responses from > https://python-forum.io. By the look of this code, it's expecting a string. (The variable name "instring" is suggestive of this, and strings DO have an expandtabs method.) You're calling a method named "parseString", and presumably giving it a list. I don't know what you mean by "nested" though. Maybe you want to iterate over the list and parse each of the strings in it? More info about what you're trying to do here would help. ChrisA -- https://mail.python.org/mailman/listinfo/python-list