Re: More Help with python .find fucntion

2011-01-08 Thread Keith Anthony
THanks ... I am new to Python ...

Comparing the result of find with -1 fixes the bug ... some
of the endobj start in the firt position ...

You're right about the lines ending in \n by accident,
EXCEPT in PDF files items are separated by obj \n
and endobj\n


-- 
- --- -- -
Posted with NewsLeecher v4.0 Final
Web @ http://www.newsleecher.com/?usenet
--- -  -- -

-- 
http://mail.python.org/mailman/listinfo/python-list


More Help with python .find fucntion

2011-01-07 Thread Keith Anthony
My previous question asked how to read a file into a strcuture
a line at a time.  Figured it out.  Now I'm trying to use .find
to separate out the PDF objects.  (See code)  PROBLEM/QUESTION:
My call to lines[i].find does NOT find all instances of endobj.
Any help available?  Any insights?

#!/usr/bin/python

inputfile =  file('sample.pdf','rb')# This is PDF with which we 
will work
lines = inputfile.readlines()   # read file one line at a time

linestart = []  # Starting address for each line
lineend = []# Ending address for each line
linetype = []

print len(lines)# print number of lines

i = 0   # define an iterator, i
addr = 0# and address pointer

while i  len(lines):   # Go through each line
linestart = linestart + [addr]
length = len(lines[i])
lineend = lineend + [addr + (length-1)]
addr = addr + length
i = i + 1

i = 0
while i  len(lines):   # Initialize line types as 
normal
linetype = linetype + ['normal']
i = i + 1

i = 0
while i  len(lines):   # 
if lines[i].find(' obj')  0:
linetype[i] = 'object'
print At address ,linestart[i],object found at line ,i,: , 
lines[i]
if lines[i].find('endobj')  0:
linetype[i] = 'endobj'
print At address ,linestart[i],endobj found at line ,i,: , 
lines[i]
i = i + 1



-- 
- --- -- -
Posted with NewsLeecher v4.0 Final
Web @ http://www.newsleecher.com/?usenet
--- -  -- -

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: More Help with python .find fucntion

2011-01-07 Thread Chris Rebert
On Fri, Jan 7, 2011 at 8:43 PM, Keith Anthony kanth...@woh.rr.com wrote:
 My previous question asked how to read a file into a strcuture
 a line at a time.  Figured it out.  Now I'm trying to use .find
 to separate out the PDF objects.  (See code)  PROBLEM/QUESTION:
 My call to lines[i].find does NOT find all instances of endobj.
 Any help available?  Any insights?

 #!/usr/bin/python

 inputfile =  file('sample.pdf','rb')            # This is PDF with which we 
 will work
 lines = inputfile.readlines()                   # read file one line at a time

 linestart = []                                  # Starting address for each 
 line
 lineend = []                                    # Ending address for each line
 linetype = []

 print len(lines)                                # print number of lines

 i = 0                                           # define an iterator, i
 addr = 0                                        # and address pointer

 while i  len(lines):                           # Go through each line
    linestart = linestart + [addr]
    length = len(lines[i])
    lineend = lineend + [addr + (length-1)]
    addr = addr + length
    i = i + 1

 i = 0
 while i  len(lines):                           # Initialize line types as 
 normal
    linetype = linetype + ['normal']
    i = i + 1

 i = 0
 while i  len(lines):                           #
    if lines[i].find(' obj')  0:
        linetype[i] = 'object'
        print At address ,linestart[i],object found at line ,i,: , 
 lines[i]
    if lines[i].find('endobj')  0:
        linetype[i] = 'endobj'
        print At address ,linestart[i],endobj found at line ,i,: , 
 lines[i]
    i = i + 1

Your code can be simplified significantly.
In particular:
- Don't add single-element lists. Use the list.append() method instead.
- One seldom manually tracks counters like `i` in Python; use range()
or enumerate() instead.
- Lists have a multiply method which gives the concatenation of n
copies of the list.

Revised version (untested obviously):

inputfile =  file('sample.pdf','rb')# This is PDF with
which we will work
lines = inputfile.readlines()   # read file one line at a time

linestart = []  # Starting address for each line
lineend = []# Ending address for each line
linetype = ['normal']*len(lines)

print len(lines)# print number of lines

addr = 0# and address pointer

for line in lines: # Go through each line
   linestart.append(addr)
   length = len(line)
   lineend.append(addr + (length-1))
   addr += length

for i, line in enumerate(lines):
   if line.find(' obj')  0:
   linetype[i] = 'object'
   print At address ,linestart[i],object found at line ,i,: , line
   if line.find('endobj')  0:
   linetype[i] = 'endobj'
   print At address ,linestart[i],endobj found at line ,i,: , line


As to the bug: I think you want != -1 rather than  0 for your
conditionals; remember that Python list/string indices are 0-based.

Cheers,
Chris
--
http://blog.rebertia.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: More Help with python .find fucntion

2011-01-07 Thread Steven D'Aprano
On Fri, 07 Jan 2011 22:43:54 -0600, Keith Anthony wrote:

 My previous question asked how to read a file into a strcuture a line at
 a time.  Figured it out.  Now I'm trying to use .find to separate out
 the PDF objects.  (See code)  PROBLEM/QUESTION: My call to lines[i].find
 does NOT find all instances of endobj. Any help available?  Any
 insights?
 
 #!/usr/bin/python
 
 inputfile =  file('sample.pdf','rb')# This is PDF with which
 we will work 
 lines = inputfile.readlines()   # read file
 one line at a time

That's incorrect. readlines() reads the entire file in one go, and splits 
it into individual lines.


 linestart = []  # Starting address for
 each line
 lineend = []# Ending
 address for each line
 linetype = []

*raises eyebrow*

How is an empty list a starting or ending address?

The only thing worse than no comments where you need them is misleading 
comments. A variable called linestart implies that it should be a 
position, e.g. linestart = 0. Or possibly a flag.


 print len(lines)# print number of lines
 
 i = 0   # define an iterator, i

Again, 0 is not an iterator. 0 is a number.


 addr = 0# and address pointer

 while i  len(lines):   # Go through each line
 linestart = linestart + [addr]
 length = len(lines[i])
 lineend = lineend + [addr + (length-1)] addr = addr + length
 i = i + 1

Complicated and confusing and not the way to do it in Python. Something 
like this is much simpler:


linetypes = []  # note plural
inputfile =  open('sample.pdf','rb')  # Don't use file, use open.

for line_number, line in enumerate(inputfile):
# Process one line at a time. No need for that nonsense with manually
# tracked line numbers, enumerate() does that for us.
# No need to initialise linetypes.
status = 'normal'
i = line.find(' obj')
if i = 0:
print Object found at offset %d in line %d % (i, line_number)
status = 'object'
i = line.find('endobj')
if i = 0:
print endobj found at offset %d in line %d % (i, line_number)
if status == 'normal': status = 'endobj'
else: status = 'object  endobj'  # both found on the one line
linetypes.append(status)
# What if obj or endobj exist more than once in a line?



One last thing... if PDF files are a binary format, what makes you think 
that they can be processed line-by-line? They may not have lines, except 
by accident.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list