Re: Regex on a Dictionary

2018-02-13 Thread Stanley Denman
On Tuesday, February 13, 2018 at 9:41:14 AM UTC-6, Mark Lawrence wrote:
> On 13/02/18 13:11, Stanley Denman wrote:
> > I am trying to performance a regex on a "string" of text that python 
> > isinstance is telling me is a dictionary.  When I run the code I get the 
> > following error:
> > 
> > {'/Title': '1F:  Progress Notes  Src.:  MILANI, JOHN C Tmt. Dt.:  
> > 05/12/2014 - 05/28/2014 (9 pages)', '/Page': IndirectObject(465, 0), 
> > '/Type': '/FitB'}
> > 
> > Traceback (most recent call last):
> >File "C:\Users\stand\Desktop\", line 9, in 
> >  x=MyRegex.findall(MyDict)
> > TypeError: expected string or bytes-like object
> > 
> > Here is the "string" of code I am working with:
> Please call it a dictionary as in the subject line, quite clearly it is 
> not a string in any way, shape or form.
> > 
> > {'/Title': '1F:  Progress Notes  Src.:  MILANI, JOHN C Tmt. Dt.:  
> > 05/12/2014 - 05/28/2014 (9 pages)', '/Page': IndirectObject(465, 0), 
> > '/Type': '/FitB'}
> > 
> > I want to grab the name "MILANI, JOHN C" and the last date "-mm/dd/" as 
> > a pair such that if I have  X numbers of string like the above I will end 
> > out with N pairs of values (name and date)/  Here is my code:
> >   
> > import PyPDF2,re
> > pdfFileObj=open('x.pdf','rb')
> > pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
> > Result=pdfReader.getOutlines()
> > MyDict=(Result[-1][0])
> > print(MyDict)
> > print(isinstance(MyDict,dict))
> > MyRegex=re.compile(r"MILANI,")
> > x=MyRegex.findall(MyDict)
> > print(x)
> > 
> > Thanks in advance for any help.
> > 
> Was the string methods solution that I gave a week or so ago so bad that 
> you still think that you need a regex to solve this?
> -- 
> My fellow Pythonistas, ask not what our language can do for you, ask
> what you can do for our language.
> Mark Lawrence

My Apology Mark.  You took the time to give me the basis of a non-regex 
solution and I had not taken the time to fully review your answer.Did not 
understand it at first blush, but I think now I do.

Regex on a Dictionary

2018-02-13 Thread Stanley Denman
I am trying to performance a regex on a "string" of text that python isinstance 
is telling me is a dictionary.  When I run the code I get the following error:

{'/Title': '1F:  Progress Notes  Src.:  MILANI, JOHN C Tmt. Dt.:  05/12/2014 - 
05/28/2014 (9 pages)', '/Page': IndirectObject(465, 0), '/Type': '/FitB'}

Traceback (most recent call last):
  File "C:\Users\stand\Desktop\", line 9, in 
TypeError: expected string or bytes-like object

Here is the "string" of code I am working with:

{'/Title': '1F:  Progress Notes  Src.:  MILANI, JOHN C Tmt. Dt.:  05/12/2014 - 
05/28/2014 (9 pages)', '/Page': IndirectObject(465, 0), '/Type': '/FitB'}

I want to grab the name "MILANI, JOHN C" and the last date "-mm/dd/" as a 
pair such that if I have  X numbers of string like the above I will end out 
with N pairs of values (name and date)/  Here is my code:
import PyPDF2,re

Thanks in advance for any help.

Re: Extracting data from ython dictionary object (Posting On Python-List Prohibited)

2018-02-09 Thread Stanley Denman
On Friday, February 9, 2018 at 12:20:29 AM UTC-6, Lawrence D’Oliveiro wrote:
> On Friday, February 9, 2018 at 6:04:48 PM UTC+13, Stanley Denman wrote:
> > {'/Title': '1F:  Progress Notes  Src.:  MILANI, JOHN C Tmt. Dt.:  
> > 05/12/2014 - 05/28/2014 (9 pages)', '/Page': IndirectObject(465, 0), 
> > '/Type': '/FitB'}
> > 
> > What a want is the following to end up as fields on my Word template merge:
> > MedSourceFirstName: "John"
> > MedSourceLastName: "Milani"
> > MedSourceLastTreatment: "05/28/2014"
> > 
> > If I use keys() on the dictionary I get this:
> > ['/Title', '/Page', '/Type']I was hoping "Src" and Tmt Dt." would be treated
> > as keys.  Seems like the key/value pair of a dictionary would translate
> > nicely to fieldname and fielddata ...
> It would, except that’s not how the information is represented in the PDF 
> file. Looks like what you want is all in the title string. So extracting it 
> will require some string manipulation. Do all the title strings follow the 
> same format? That should simplify the manipulations you need to do.

Thanks you Lawrence for your response. Sounds like I am going to have to dig in 
to Regex to get at the test I want.

Re: Extracting data from ython dictionary object

2018-02-09 Thread Stanley Denman
On Friday, February 9, 2018 at 1:08:27 AM UTC-6, dieter wrote:
> Stanley Denman <> writes:
> > I am new to Python. I am trying to extract text from the bookmarks in a PDF 
> > file that would provide the data for a Word template merge. I have gotten 
> > down to a string of text pulled out of the list object that I got from 
> > using PyPDF2 module.  I am stuck on now to get the data out of the string 
> > that I need.  I am calling it a string, but Python is recognizing as a 
> > dictionary object.  
> >
> > Here is the string: 
> >
> > {'/Title': '1F:  Progress Notes  Src.:  MILANI, JOHN C Tmt. Dt.:  
> > 05/12/2014 - 05/28/2014 (9 pages)', '/Page': IndirectObject(465, 0), 
> > '/Type': '/FitB'}
> >
> > What a want is the following to end up as fields on my Word template merge:
> > MedSourceFirstName: "John"
> > MedSourceLastName: "Milani"
> > MedSourceLastTreatment: "05/28/2014"
> >
> > If I use keys() on the dictionary I get this:
> > ['/Title', '/Page', '/Type']I was hoping "Src" and Tmt Dt." would be 
> > treated as keys.  Seems like the key/value pair of a dictionary would 
> > translate nicely to fieldname and fielddata for a Word document merge.  
> > Here is my  code so far. 
> A Python "dict" is a mapping of keys to values. Its "keys" method
> gives you the keys (as you have used above).
> The subscription syntax ("[]"; e.g.
> "pdf_info['/Title']") allows you to access the value associated with
> "".
> In your case, relevant information is coded inside the values themselves.
> You will need to extract this information yourself. Python's "re" module
> might be of help (see the "library reference", for details).

Thanks for your response.  Nice to know I am at least on the right path.  
Sounds like I am going to have to did in to Regex to get at the test I want.

Extracting data from ython dictionary object

2018-02-08 Thread Stanley Denman
I am new to Python. I am trying to extract text from the bookmarks in a PDF 
file that would provide the data for a Word template merge. I have gotten down 
to a string of text pulled out of the list object that I got from using PyPDF2 
module.  I am stuck on now to get the data out of the string that I need.  I am 
calling it a string, but Python is recognizing as a dictionary object.  

Here is the string: 

{'/Title': '1F:  Progress Notes  Src.:  MILANI, JOHN C Tmt. Dt.:  05/12/2014 - 
05/28/2014 (9 pages)', '/Page': IndirectObject(465, 0), '/Type': '/FitB'}

What a want is the following to end up as fields on my Word template merge:
MedSourceFirstName: "John"
MedSourceLastName: "Milani"
MedSourceLastTreatment: "05/28/2014"

If I use keys() on the dictionary I get this:
['/Title', '/Page', '/Type']I was hoping "Src" and Tmt Dt." would be treated as 
keys.  Seems like the key/value pair of a dictionary would translate nicely to 
fieldname and fielddata for a Word document merge.  Here is my  code so far. 

[python]import PyPDF2

I get this output in Sublime Text:
{'/Title': '1F:  Progress Notes  Src.:  MILANI, JOHN C Tmt. Dt.:  05/12/2014 - 
05/28/2014 (9 pages)', '/Page': IndirectObject(465, 0), '/Type': '/FitB'}
['/Title', '/Page', '/Type']
[Finished in 0.4s]

Thank you in advance for any suggestions.

Re: Parsing Nested List

2018-02-04 Thread Stanley Denman
On Sunday, February 4, 2018 at 5:32:51 PM UTC-6, Stanley Denman wrote:
> On Sunday, February 4, 2018 at 4:26:24 PM UTC-6, Stanley Denman wrote:
> > I am trying to parse a Python nested list that is the result of the 
> > getOutlines() function of module PyPFD2 using pyparsing module. This is the 
> > result I get. what in the world are 'expandtabs' and why is that making a 
> > difference to my parse attempt?
> > 
> > Python Code
> > 7
> > import PPDF2,pyparsing
> > from pyparsing import Word, alphas, nums
> > pdfFileObj=open('x.pdf','rb')
> > pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
> > List=pdfReader.getOutlines()
> > myparser = Word( alphas ) + Word(nums, exact=2) +"of" + Word(nums, exact=2)
> > myparser.parseString(List)
> > 
> > This is the error I get:
> > 
> > Traceback (most recent call last):
> >   File "<pyshell#23>", line 1, in 
> > myparser.parseString(List)
> >   File "C:\python\lib\site-packages\", line 1620, in parseString
> > instring = instring.expandtabs()
> > AttributeError: 'list' object has no attribute 'expandtabs'
> > 
> > Thanks so much, not getting any helpful responses from 
> >

I have found that I can use the index values in the list to print out the 
section I need.  So print(MyList[7]) get me to section f taht I want.  
print(MyList[9][1]) for example give me a string that is the bookmark entry for 
Exhibit 1F.  But this index value would presumeably be different for each pdf 
file - that is there may not always be Section A-E, but there will always be a 
Section F. In ther words, the index values that get me to the right section 
would be different in each pdf file.

Re: Parsing Nested List

2018-02-04 Thread Stanley Denman
On Sunday, February 4, 2018 at 5:06:26 PM UTC-6, Steven D'Aprano wrote:
> On Sun, 04 Feb 2018 14:26:10 -0800, Stanley Denman wrote:
> > I am trying to parse a Python nested list that is the result of the
> > getOutlines() function of module PyPFD2 using pyparsing module.
> pyparsing parses strings, not lists.
> I fear that you have completely misunderstood what pyparsing does: it 
> isn't a general-purpose parser of arbitrary Python objects like lists. 
> Like most parsers (actually, all parsers that I know of...) it takes text 
> as input and produces some sort of machine representation:
> So your code is not working because you are calling parseString() with a 
> list argument:
> myparser.parseString(List)
> The name of the function, parseString(), should have been a hint that it 
> requires a *string* as argument.
> You have generated an outline:
> List = pdfReader.getOutlines()
> but do you know what the format of that list is? I'm going to assume that 
> it looks something like this:
> ['ABCD 01 of 99', 'EFGH 02 of 99', 'IJKL 03 of 99', ...]
> since that matches the template you gave to pyparsing. Notice that:
> - words are separated by spaces;
> - the first word is any arbitrary word, made up of just letters;
> - followed by EXACTLY two digits;
> - followed by the word "of";
> - followed by EXACTLY two digits.
> Furthermore, I'm assuming it is a simple, non-nested list. If that is not 
> the case, you will need to explain precisely what the format of the 
> outline actually is.
> To parse this list is simple and pyparsing is not required:
> for item in List:
> words = item.split()
> if len(words) != 4:
> raise ValueError('bad input data: %r' % item)
> first, number, x, total = words
> number = int(number)
> assert x == 'of'
> total = int(total)
> print(first, number, total)
> Hope this helps.
> (Please keep any replies on the list.)
> -- 
> Steve

Thank you so much Steve.  I do seem to be barking up the wrong tree.  The 
result of running getOutlines() is indeed a nested list: it is the pdfs 
bookmarks.  There are 3 levels: level 1 is the section from A-F. When a section 
there are exhibits, so in Section A we have exhibits 1A to nA. Finally there 
are bookmarks for individual pages in an exhibit.   So we have this for Section 

[{'/Title': 'Section A.  Payment Documents/Decisions', '/Page': 
IndirectObject(1, 0), '/Type': '/FitB'}, [{'/Title': '1A:  Disability 
Determination Transmittal (831) Dec. Dt.:  05/27/2016 (1 page)', '/Page': 
IndirectObject(1, 0), '/Type': '/FitB'}, [{'/Title': '1A (Page 1 of 1)', 
'/Page': IndirectObject(1, 0), '/Type': '/FitB'}], {'/Title': '2A:  Disability 
Determination Explanation (DDE) Dec. Dt.:  05/27/2016 (10 pages)', '/Page': 
IndirectObject(6, 0), '/Type': '/FitB'}, [{'/Title': '2A (Page 1 of 10)', 
'/Page': IndirectObject(6, 0), '/Type': '/FitB'}, {'/Title': '2A (Page 2 of 
10)', '/Page': IndirectObject(10, 0), '/Type': '/FitB'}, {'/Title': '2A (Page 3 
of 10)', '/Page': IndirectObject(14, 0), '/Type': '/FitB'}, {'/Title': '2A 
(Page 4 of 10)', '/Page': IndirectObject(18, 0), '/Type': '/FitB'}, {'/Title': 
'2A (Page 5 of 10)', '/Page': IndirectObject(22, 0), '/Type': '/FitB'}, 
{'/Title': '2A (Page 6 of 10)', '/Page': IndirectObject(26, 0), '/Type': 
'/FitB'}, {'/Title': '2A (Page 7 
 of 10)', '/Page': IndirectObject(30, 0), '/Type': '/FitB'}, {'/Title': '2A 
(Page 8 of 10)', '/Page': IndirectObject(34, 0), '/Type': '/FitB'}, {'/Title': 
'2A (Page 9 of 10)', '/Page': IndirectObject(38, 0), '/Type': '/FitB'}, 
{'/Title': '2A (Page 10 of 10)', '/Page': IndirectObject(42, 0), '/Type': 
'/FitB'}], {'/Title': '3A:  ALJ Hearing Decision (ALJDEC) Dec. Dt.:  12/17/2012 
(22 pages)', '/Page': IndirectObject(47, 0), '/Type': '/FitB'}, [{'/Title': '3A 
(Page 1 of 22)', '/Page': IndirectObject(47, 0), '/Type': '/FitB'}, {'/Title': 
'3A (Page 2 of 22)', '/Page': IndirectObject(51, 0), '/Type': '/FitB'}, 
{'/Title': '3A (Page 3 of 22)', '/Page': IndirectObject(55, 0), '/Type': 
'/FitB'}, {'/Title': '3A (Page 4 of 22)', '/Page': IndirectObject(59, 0), 
'/Type': '/FitB'}, {'/Title': '3A (Page 5 of 22)', '/Page': IndirectObject(63, 
0), '/Type': '/FitB'}, {'/Title': '3A (Page 6 of 22)', '/Page': 
IndirectObject(67, 0), '/Type': '/FitB'}, {'/Title': '3A (Page 7 of 22)', 
'/Page': IndirectObjec
 t(71, 0), '/Type': '/FitB'}, {'/Title': '3A (Page 8 of 22)', '/Page': 
IndirectObject(75, 0), '/Type': '/FitB'}, {'/Title': '3A (Page 9 of 22)', 
'/Page': IndirectObject(79, 0), '/Type': '/FitB'}, {'/Title': '3A (Page 10 of 
22)', '/Page': IndirectObject(83, 0

Re: Parsing Nested List

2018-02-04 Thread Stanley Denman
On Sunday, February 4, 2018 at 4:26:24 PM UTC-6, Stanley Denman wrote:
> I am trying to parse a Python nested list that is the result of the 
> getOutlines() function of module PyPFD2 using pyparsing module. This is the 
> result I get. what in the world are 'expandtabs' and why is that making a 
> difference to my parse attempt?
> Python Code
> 7
> import PPDF2,pyparsing
> from pyparsing import Word, alphas, nums
> pdfFileObj=open('x.pdf','rb')
> pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
> List=pdfReader.getOutlines()
> myparser = Word( alphas ) + Word(nums, exact=2) +"of" + Word(nums, exact=2)
> myparser.parseString(List)
> This is the error I get:
> Traceback (most recent call last):
>   File "<pyshell#23>", line 1, in 
> myparser.parseString(List)
>   File "C:\python\lib\site-packages\", line 1620, in parseString
> instring = instring.expandtabs()
> AttributeError: 'list' object has no attribute 'expandtabs'
> Thanks so much, not getting any helpful responses from 


Parsing Nested List

2018-02-04 Thread Stanley Denman
I am trying to parse a Python nested list that is the result of the 
getOutlines() function of module PyPFD2 using pyparsing module. This is the 
result I get. what in the world are 'expandtabs' and why is that making a 
difference to my parse attempt?

Python Code
import PPDF2,pyparsing
from pyparsing import Word, alphas, nums
myparser = Word( alphas ) + Word(nums, exact=2) +"of" + Word(nums, exact=2)

This is the error I get:

Traceback (most recent call last):
  File "", line 1, in 
  File "C:\python\lib\site-packages\", line 1620, in parseString
instring = instring.expandtabs()
AttributeError: 'list' object has no attribute 'expandtabs'

Thanks so much, not getting any helpful responses from