Hi, I am trying to convert .pdf files to .txt files. The script I am using below is mostly taken from research done on Google and it appears to be the one outline most consistently favored (http://code.activestate.com/recipes/577095-convert-pdf-to-plain-text/).
I am using Win 7, Python 2.7.1. My code: #pdf2txt.py import sys import pyPdf import os def getPDFContent(path): content = "" # Load PDF into pyPDF pdf = pyPdf.PdfFileReader(file(path, "rb")) # Iterate pages for i in range(0, pdf.getNumPages()): # Extract text from page and add to content content += pdf.getPage(i).extractText() + " \n" # Collapse whitespace # content = u" ".join(content.replace(u"\xa0", u" ").strip().split()) return content def main(): pdf = sys.argv[1] filedir,filename = os.path.split(pdf) nameonly = os.path.splitext(filename) newname = nameonly[0] + ".txt" outtxt = os.path.join(filedir,newname) f = open(outtxt,'w') f.write(getPDFContent(pdf)) f.close() main() exit() ============================================================================================================== The program runs for a while and then dies while in one of the pypdf functions. The trace is below. Any insight into how to resolve this situation will be most appreciated. Thank you, Robert ======================================================================================================================= The trace I get is: decimal.InvalidOperation: Invalid literal for Decimal: '.' File "C:\Users\bermanrl\Projects\ScriptSearch\testdir\pdf2txt.py", line 28, in <module> main() File "C:\Users\bermanrl\Projects\ScriptSearch\testdir\pdf2txt.py", line 25, in main f.write(getPDFContent(pdf)) File "C:\Users\bermanrl\Projects\ScriptSearch\testdir\pdf2txt.py", line 13, in getPDFContent content += pdf.getPage(i).extractText() + " \n" File "C:\Python27\Lib\site-packages\pyPdf-1.13-py2.7-win32.egg\pyPdf \pdf.py", line 1381, in extractText content = ContentStream(content, self.pdf) File "C:\Python27\Lib\site-packages\pyPdf-1.13-py2.7-win32.egg\pyPdf \pdf.py", line 1464, in __init__ self.__parseContentStream(stream) File "C:\Python27\Lib\site-packages\pyPdf-1.13-py2.7-win32.egg\pyPdf \pdf.py", line 1503, in __parseContentStream operands.append(readObject(stream, None)) File "C:\Python27\Lib\site-packages\pyPdf-1.13-py2.7-win32.egg\pyPdf \generic.py", line 87, in readObject return NumberObject.readFromStream(stream) File "C:\Python27\Lib\site-packages\pyPdf-1.13-py2.7-win32.egg\pyPdf \generic.py", line 234, in readFromStream return FloatObject(name) File "C:\Python27\Lib\site-packages\pyPdf-1.13-py2.7-win32.egg\pyPdf \generic.py", line 207, in __new__ return decimal.Decimal.__new__(cls, str(value), context) File "C:\Python27\Lib\decimal.py", line 548, in __new__ "Invalid literal for Decimal: %r" % value) File "C:\Python27\Lib\decimal.py", line 3844, in _raise_error raise error(explanation) _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor