Re: Parsing Nested List

2018-02-04 Thread Stanley Denman
On Sunday, February 4, 2018 at 5:32:51 PM UTC-6, Stanley Denman wrote:
> On Sunday, February 4, 2018 at 4:26:24 PM UTC-6, Stanley Denman wrote:
> > I am trying to parse a Python nested list that is the result of the 
> > getOutlines() function of module PyPFD2 using pyparsing module. This is the 
> > result I get. what in the world are 'expandtabs' and why is that making a 
> > difference to my parse attempt?
> > 
> > Python Code
> > 7
> > import PPDF2,pyparsing
> > from pyparsing import Word, alphas, nums
> > pdfFileObj=open('x.pdf','rb')
> > pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
> > List=pdfReader.getOutlines()
> > myparser = Word( alphas ) + Word(nums, exact=2) +"of" + Word(nums, exact=2)
> > myparser.parseString(List)
> > 
> > This is the error I get:
> > 
> > Traceback (most recent call last):
> >   File "", line 1, in 
> > myparser.parseString(List)
> >   File "C:\python\lib\site-packages\pyparsing.py", line 1620, in parseString
> > instring = instring.expandtabs()
> > AttributeError: 'list' object has no attribute 'expandtabs'
> > 
> > Thanks so much, not getting any helpful responses from 
> > https://python-forum.io.

I have found that I can use the index values in the list to print out the 
section I need.  So print(MyList[7]) get me to section f taht I want.  
print(MyList[9][1]) for example give me a string that is the bookmark entry for 
Exhibit 1F.  But this index value would presumeably be different for each pdf 
file - that is there may not always be Section A-E, but there will always be a 
Section F. In ther words, the index values that get me to the right section 
would be different in each pdf file.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Parsing Nested List

2018-02-04 Thread Stanley Denman
On Sunday, February 4, 2018 at 5:06:26 PM UTC-6, Steven D'Aprano wrote:
> On Sun, 04 Feb 2018 14:26:10 -0800, Stanley Denman wrote:
> 
> > I am trying to parse a Python nested list that is the result of the
> > getOutlines() function of module PyPFD2 using pyparsing module.
> 
> pyparsing parses strings, not lists.
> 
> I fear that you have completely misunderstood what pyparsing does: it 
> isn't a general-purpose parser of arbitrary Python objects like lists. 
> Like most parsers (actually, all parsers that I know of...) it takes text 
> as input and produces some sort of machine representation:
> 
> https://en.wikipedia.org/wiki/Parsing#Computer_languages
> 
> 
> So your code is not working because you are calling parseString() with a 
> list argument:
> 
> myparser.parseString(List)
> 
> 
> The name of the function, parseString(), should have been a hint that it 
> requires a *string* as argument.
> 
> You have generated an outline:
> 
> List = pdfReader.getOutlines()
> 
> but do you know what the format of that list is? I'm going to assume that 
> it looks something like this:
> 
> ['ABCD 01 of 99', 'EFGH 02 of 99', 'IJKL 03 of 99', ...]
> 
> since that matches the template you gave to pyparsing. Notice that:
> 
> - words are separated by spaces;
> 
> - the first word is any arbitrary word, made up of just letters;
> 
> - followed by EXACTLY two digits;
> 
> - followed by the word "of";
> 
> - followed by EXACTLY two digits.
> 
> Furthermore, I'm assuming it is a simple, non-nested list. If that is not 
> the case, you will need to explain precisely what the format of the 
> outline actually is.
> 
> To parse this list is simple and pyparsing is not required:
> 
> for item in List:
> words = item.split()
> if len(words) != 4:
> raise ValueError('bad input data: %r' % item)
> first, number, x, total = words
> number = int(number)
> assert x == 'of'
> total = int(total)
> print(first, number, total)
> 
> 
> 
> 
> Hope this helps.
> 
> (Please keep any replies on the list.)
> 
> 
> 
> -- 
> Steve

Thank you so much Steve.  I do seem to be barking up the wrong tree.  The 
result of running getOutlines() is indeed a nested list: it is the pdfs 
bookmarks.  There are 3 levels: level 1 is the section from A-F. When a section 
there are exhibits, so in Section A we have exhibits 1A to nA. Finally there 
are bookmarks for individual pages in an exhibit.   So we have this for Section 
A:

[{'/Title': 'Section A.  Payment Documents/Decisions', '/Page': 
IndirectObject(1, 0), '/Type': '/FitB'}, [{'/Title': '1A:  Disability 
Determination Transmittal (831) Dec. Dt.:  05/27/2016 (1 page)', '/Page': 
IndirectObject(1, 0), '/Type': '/FitB'}, [{'/Title': '1A (Page 1 of 1)', 
'/Page': IndirectObject(1, 0), '/Type': '/FitB'}], {'/Title': '2A:  Disability 
Determination Explanation (DDE) Dec. Dt.:  05/27/2016 (10 pages)', '/Page': 
IndirectObject(6, 0), '/Type': '/FitB'}, [{'/Title': '2A (Page 1 of 10)', 
'/Page': IndirectObject(6, 0), '/Type': '/FitB'}, {'/Title': '2A (Page 2 of 
10)', '/Page': IndirectObject(10, 0), '/Type': '/FitB'}, {'/Title': '2A (Page 3 
of 10)', '/Page': IndirectObject(14, 0), '/Type': '/FitB'}, {'/Title': '2A 
(Page 4 of 10)', '/Page': IndirectObject(18, 0), '/Type': '/FitB'}, {'/Title': 
'2A (Page 5 of 10)', '/Page': IndirectObject(22, 0), '/Type': '/FitB'}, 
{'/Title': '2A (Page 6 of 10)', '/Page': IndirectObject(26, 0), '/Type': 
'/FitB'}, {'/Title': '2A (Page 7 
 of 10)', '/Page': IndirectObject(30, 0), '/Type': '/FitB'}, {'/Title': '2A 
(Page 8 of 10)', '/Page': IndirectObject(34, 0), '/Type': '/FitB'}, {'/Title': 
'2A (Page 9 of 10)', '/Page': IndirectObject(38, 0), '/Type': '/FitB'}, 
{'/Title': '2A (Page 10 of 10)', '/Page': IndirectObject(42, 0), '/Type': 
'/FitB'}], {'/Title': '3A:  ALJ Hearing Decision (ALJDEC) Dec. Dt.:  12/17/2012 
(22 pages)', '/Page': IndirectObject(47, 0), '/Type': '/FitB'}, [{'/Title': '3A 
(Page 1 of 22)', '/Page': IndirectObject(47, 0), '/Type': '/FitB'}, {'/Title': 
'3A (Page 2 of 22)', '/Page': IndirectObject(51, 0), '/Type': '/FitB'}, 
{'/Title': '3A (Page 3 of 22)', '/Page': IndirectObject(55, 0), '/Type': 
'/FitB'}, {'/Title': '3A (Page 4 of 22)', '/Page': IndirectObject(59, 0), 
'/Type': '/FitB'}, {'/Title': '3A (Page 5 of 22)', '/Page': IndirectObject(63, 
0), '/Type': '/FitB'}, {'/Title': '3A (Page 6 of 22)', '/Page': 
IndirectObject(67, 0), '/Type': '/FitB'}, {'/Title': '3A (Page 7 of 22)', 
'/Page': IndirectObjec
 t(71, 0), '/Type': '/FitB'}, {'/Title': '3A (Page 8 of 22)', '/Page': 
IndirectObject(75, 0), '/Type': '/FitB'}, {'/Title': '3A (Page 9 of 22)', 
'/Page': IndirectObject(79, 0), '/Type': '/FitB'}, {'/Title': '3A (Page 10 of 
22)', '/Page': IndirectObject(83, 0), '/Type': '/FitB'}, {'/Title': '3A (Page 
11 of 22)', '/Page': IndirectObject(88, 0), '/Type': '/FitB'}, {'/Title': '3A 
(Page 12 of 22)', '/Page': IndirectObject(92, 0), '/Type': '/FitB'}, {'/Title': 
'3A (Page 13 of 22)', '/Page': 

Re: Parsing Nested List

2018-02-04 Thread Stanley Denman
On Sunday, February 4, 2018 at 4:26:24 PM UTC-6, Stanley Denman wrote:
> I am trying to parse a Python nested list that is the result of the 
> getOutlines() function of module PyPFD2 using pyparsing module. This is the 
> result I get. what in the world are 'expandtabs' and why is that making a 
> difference to my parse attempt?
> 
> Python Code
> 7
> import PPDF2,pyparsing
> from pyparsing import Word, alphas, nums
> pdfFileObj=open('x.pdf','rb')
> pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
> List=pdfReader.getOutlines()
> myparser = Word( alphas ) + Word(nums, exact=2) +"of" + Word(nums, exact=2)
> myparser.parseString(List)
> 
> This is the error I get:
> 
> Traceback (most recent call last):
>   File "", line 1, in 
> myparser.parseString(List)
>   File "C:\python\lib\site-packages\pyparsing.py", line 1620, in parseString
> instring = instring.expandtabs()
> AttributeError: 'list' object has no attribute 'expandtabs'
> 
> Thanks so much, not getting any helpful responses from 
> https://python-forum.io.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Parsing Nested List

2018-02-04 Thread Steven D'Aprano
On Sun, 04 Feb 2018 14:26:10 -0800, Stanley Denman wrote:

> I am trying to parse a Python nested list that is the result of the
> getOutlines() function of module PyPFD2 using pyparsing module.

pyparsing parses strings, not lists.

I fear that you have completely misunderstood what pyparsing does: it 
isn't a general-purpose parser of arbitrary Python objects like lists. 
Like most parsers (actually, all parsers that I know of...) it takes text 
as input and produces some sort of machine representation:

https://en.wikipedia.org/wiki/Parsing#Computer_languages


So your code is not working because you are calling parseString() with a 
list argument:

myparser.parseString(List)


The name of the function, parseString(), should have been a hint that it 
requires a *string* as argument.

You have generated an outline:

List = pdfReader.getOutlines()

but do you know what the format of that list is? I'm going to assume that 
it looks something like this:

['ABCD 01 of 99', 'EFGH 02 of 99', 'IJKL 03 of 99', ...]

since that matches the template you gave to pyparsing. Notice that:

- words are separated by spaces;

- the first word is any arbitrary word, made up of just letters;

- followed by EXACTLY two digits;

- followed by the word "of";

- followed by EXACTLY two digits.

Furthermore, I'm assuming it is a simple, non-nested list. If that is not 
the case, you will need to explain precisely what the format of the 
outline actually is.

To parse this list is simple and pyparsing is not required:

for item in List:
words = item.split()
if len(words) != 4:
raise ValueError('bad input data: %r' % item)
first, number, x, total = words
number = int(number)
assert x == 'of'
total = int(total)
print(first, number, total)




Hope this helps.

(Please keep any replies on the list.)



-- 
Steve

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Parsing Nested List

2018-02-04 Thread Chris Angelico
On Mon, Feb 5, 2018 at 9:26 AM, Stanley Denman
 wrote:
> I am trying to parse a Python nested list that is the result of the 
> getOutlines() function of module PyPFD2 using pyparsing module. This is the 
> result I get. what in the world are 'expandtabs' and why is that making a 
> difference to my parse attempt?
>
> Python Code
> 7
> import PPDF2,pyparsing
> from pyparsing import Word, alphas, nums
> pdfFileObj=open('x.pdf','rb')
> pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
> List=pdfReader.getOutlines()
> myparser = Word( alphas ) + Word(nums, exact=2) +"of" + Word(nums, exact=2)
> myparser.parseString(List)
>
> This is the error I get:
>
> Traceback (most recent call last):
>   File "", line 1, in 
> myparser.parseString(List)
>   File "C:\python\lib\site-packages\pyparsing.py", line 1620, in parseString
> instring = instring.expandtabs()
> AttributeError: 'list' object has no attribute 'expandtabs'
>
> Thanks so much, not getting any helpful responses from 
> https://python-forum.io.

By the look of this code, it's expecting a string. (The variable name
"instring" is suggestive of this, and strings DO have an expandtabs
method.) You're calling a method named "parseString", and presumably
giving it a list. I don't know what you mean by "nested" though.

Maybe you want to iterate over the list and parse each of the strings in it?

More info about what you're trying to do here would help.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list