Hi Chris,
thanks for fast reply and all recommendations in helps me much!
as you recommended me i used Pdfminer module to extract the text from pdf
files and then with file.xreadlines() I allocated the lines where my
keyword (factors in this case) appears.
Till now i extract just the lines but im wondering if its able to extract
whole sentenses (only this) where my keawords (factors in this case) are
located.
I used following script
import os, subprocess
path=C:\\PDF # insert the path to the directory of interest here
dirList=os.listdir(path)
for fname in dirList:
output =fname.rstrip(.pdf) + .txt
subprocess.call([C:\Python26\python.exe, pdf2txt.py, -o, output,
fname])
print fname
file = open(output)
for line in file.xreadlines():
if driving in line:
print(line)
---
Robert Pazur
Mobile : +421 948 001 705
Skype : ruegdeg
2011/5/6 Chris Rebert c...@rebertia.com
On Thu, May 5, 2011 at 2:26 PM, Robert Pazur pazurrob...@gmail.com
wrote:
Dear all,
i would like to access some text and count the occurrence as follows
I got a lots of pdf with some scientific articles and i want to preview
which words are usually related with for example determinants
as an example in the article is a sentence elevation is the most
important determinant
how can i acquire the elevation string?
of course i dont know where the sententence in article is located or
which
particular word could there be
any suggestions?
Extract the text using PDFMiner[1], pyPdf[2], or PageCatcher[3]. Then
use something similar to n-grams on the extracted text, filtering out
those that don't contain determinant(s). Then just keep a word
frequency table for the remaining n-grams.
Not-quite-pseudo-code:
from collections import defaultdict, deque
N = 7 # length of n-grams to consider; tune as needed
buf = deque(maxlen=N)
targets = frozenset((determinant, determinants))
steps_until_gone = 0
word2freq = defaultdict(int)
for word in words_from_pdf:
if word in targets:
steps_until_gone = N
buf.append(word)
if steps_until_gone:
for related_word in buf:
if related_word not in targets:
word2freq[related_word] += 1
steps_until_gone -= 1
for count, word in sorted((v,k) for k,v in word2freq.iteritems()):
print(word, ':', count)
Making this more efficient and less naive is left as an exercise to the
reader.
There may very well already be something similar but more
sophisticated in NLTK[4]; I've never used it, so I dunno.
[1]: http://www.unixuser.org/~euske/python/pdfminer/index.html
[2]: http://pybrary.net/pyPdf/
[3]: http://www.reportlab.com/software/#pagecatcher
[4]: http://www.nltk.org/
Cheers,
Chris
--
http://rebertia.com
--
http://mail.python.org/mailman/listinfo/python-list