Re: More Help with python .find fucntion
THanks ... I am new to Python ... Comparing the result of find with -1 fixes the bug ... some of the endobj start in the firt position ... You're right about the lines ending in \n by accident, EXCEPT in PDF files items are separated by obj \n and endobj\n -- - --- -- - Posted with NewsLeecher v4.0 Final Web @ http://www.newsleecher.com/?usenet --- - -- - -- http://mail.python.org/mailman/listinfo/python-list
More Help with python .find fucntion
My previous question asked how to read a file into a strcuture a line at a time. Figured it out. Now I'm trying to use .find to separate out the PDF objects. (See code) PROBLEM/QUESTION: My call to lines[i].find does NOT find all instances of endobj. Any help available? Any insights? #!/usr/bin/python inputfile = file('sample.pdf','rb')# This is PDF with which we will work lines = inputfile.readlines() # read file one line at a time linestart = [] # Starting address for each line lineend = []# Ending address for each line linetype = [] print len(lines)# print number of lines i = 0 # define an iterator, i addr = 0# and address pointer while i len(lines): # Go through each line linestart = linestart + [addr] length = len(lines[i]) lineend = lineend + [addr + (length-1)] addr = addr + length i = i + 1 i = 0 while i len(lines): # Initialize line types as normal linetype = linetype + ['normal'] i = i + 1 i = 0 while i len(lines): # if lines[i].find(' obj') 0: linetype[i] = 'object' print At address ,linestart[i],object found at line ,i,: , lines[i] if lines[i].find('endobj') 0: linetype[i] = 'endobj' print At address ,linestart[i],endobj found at line ,i,: , lines[i] i = i + 1 -- - --- -- - Posted with NewsLeecher v4.0 Final Web @ http://www.newsleecher.com/?usenet --- - -- - -- http://mail.python.org/mailman/listinfo/python-list
Re: More Help with python .find fucntion
On Fri, Jan 7, 2011 at 8:43 PM, Keith Anthony kanth...@woh.rr.com wrote: My previous question asked how to read a file into a strcuture a line at a time. Figured it out. Now I'm trying to use .find to separate out the PDF objects. (See code) PROBLEM/QUESTION: My call to lines[i].find does NOT find all instances of endobj. Any help available? Any insights? #!/usr/bin/python inputfile = file('sample.pdf','rb') # This is PDF with which we will work lines = inputfile.readlines() # read file one line at a time linestart = [] # Starting address for each line lineend = [] # Ending address for each line linetype = [] print len(lines) # print number of lines i = 0 # define an iterator, i addr = 0 # and address pointer while i len(lines): # Go through each line linestart = linestart + [addr] length = len(lines[i]) lineend = lineend + [addr + (length-1)] addr = addr + length i = i + 1 i = 0 while i len(lines): # Initialize line types as normal linetype = linetype + ['normal'] i = i + 1 i = 0 while i len(lines): # if lines[i].find(' obj') 0: linetype[i] = 'object' print At address ,linestart[i],object found at line ,i,: , lines[i] if lines[i].find('endobj') 0: linetype[i] = 'endobj' print At address ,linestart[i],endobj found at line ,i,: , lines[i] i = i + 1 Your code can be simplified significantly. In particular: - Don't add single-element lists. Use the list.append() method instead. - One seldom manually tracks counters like `i` in Python; use range() or enumerate() instead. - Lists have a multiply method which gives the concatenation of n copies of the list. Revised version (untested obviously): inputfile = file('sample.pdf','rb')# This is PDF with which we will work lines = inputfile.readlines() # read file one line at a time linestart = [] # Starting address for each line lineend = []# Ending address for each line linetype = ['normal']*len(lines) print len(lines)# print number of lines addr = 0# and address pointer for line in lines: # Go through each line linestart.append(addr) length = len(line) lineend.append(addr + (length-1)) addr += length for i, line in enumerate(lines): if line.find(' obj') 0: linetype[i] = 'object' print At address ,linestart[i],object found at line ,i,: , line if line.find('endobj') 0: linetype[i] = 'endobj' print At address ,linestart[i],endobj found at line ,i,: , line As to the bug: I think you want != -1 rather than 0 for your conditionals; remember that Python list/string indices are 0-based. Cheers, Chris -- http://blog.rebertia.com -- http://mail.python.org/mailman/listinfo/python-list
Re: More Help with python .find fucntion
On Fri, 07 Jan 2011 22:43:54 -0600, Keith Anthony wrote: My previous question asked how to read a file into a strcuture a line at a time. Figured it out. Now I'm trying to use .find to separate out the PDF objects. (See code) PROBLEM/QUESTION: My call to lines[i].find does NOT find all instances of endobj. Any help available? Any insights? #!/usr/bin/python inputfile = file('sample.pdf','rb')# This is PDF with which we will work lines = inputfile.readlines() # read file one line at a time That's incorrect. readlines() reads the entire file in one go, and splits it into individual lines. linestart = [] # Starting address for each line lineend = []# Ending address for each line linetype = [] *raises eyebrow* How is an empty list a starting or ending address? The only thing worse than no comments where you need them is misleading comments. A variable called linestart implies that it should be a position, e.g. linestart = 0. Or possibly a flag. print len(lines)# print number of lines i = 0 # define an iterator, i Again, 0 is not an iterator. 0 is a number. addr = 0# and address pointer while i len(lines): # Go through each line linestart = linestart + [addr] length = len(lines[i]) lineend = lineend + [addr + (length-1)] addr = addr + length i = i + 1 Complicated and confusing and not the way to do it in Python. Something like this is much simpler: linetypes = [] # note plural inputfile = open('sample.pdf','rb') # Don't use file, use open. for line_number, line in enumerate(inputfile): # Process one line at a time. No need for that nonsense with manually # tracked line numbers, enumerate() does that for us. # No need to initialise linetypes. status = 'normal' i = line.find(' obj') if i = 0: print Object found at offset %d in line %d % (i, line_number) status = 'object' i = line.find('endobj') if i = 0: print endobj found at offset %d in line %d % (i, line_number) if status == 'normal': status = 'endobj' else: status = 'object endobj' # both found on the one line linetypes.append(status) # What if obj or endobj exist more than once in a line? One last thing... if PDF files are a binary format, what makes you think that they can be processed line-by-line? They may not have lines, except by accident. -- Steven -- http://mail.python.org/mailman/listinfo/python-list