Re: access to some text string in PDFs

2011-05-06 Thread Robert Pazur
Hi Chris,

thanks for fast reply and all recommendations in helps me much!
as you recommended me i used Pdfminer module to extract the text from pdf
files and then with file.xreadlines() I  allocated the lines where my
keyword (factors in this case) appears.
Till now i extract just the lines but im wondering if its able to extract
whole sentenses (only this)   where my keawords (factors in this case) are
located.

I used following script  

import os, subprocess

path=C:\\PDF  # insert the path to the directory of interest here
dirList=os.listdir(path)
for fname in dirList:
output =fname.rstrip(.pdf) + .txt
subprocess.call([C:\Python26\python.exe, pdf2txt.py, -o, output,
fname])
print fname
file = open(output)
for line in file.xreadlines():
if driving in line:
print(line)

---
Robert Pazur
Mobile : +421 948 001 705
Skype  : ruegdeg


2011/5/6 Chris Rebert c...@rebertia.com

 On Thu, May 5, 2011 at 2:26 PM, Robert Pazur pazurrob...@gmail.com
 wrote:
  Dear all,
  i would like to access some text and count the occurrence as follows 
  I got a lots of pdf with some scientific articles and i want to preview
   which words are usually related with for example determinants
  as an example in the article is a sentence  elevation is the most
  important determinant
  how can i acquire the elevation string?
  of course i dont know where the sententence in article is located or
 which
  particular word could there be
  any suggestions?

 Extract the text using PDFMiner[1], pyPdf[2], or PageCatcher[3]. Then
 use something similar to n-grams on the extracted text, filtering out
 those that don't contain determinant(s). Then just keep a word
 frequency table for the remaining n-grams.

 Not-quite-pseudo-code:
 from collections import defaultdict, deque
 N = 7 # length of n-grams to consider; tune as needed
 buf = deque(maxlen=N)
 targets = frozenset((determinant, determinants))
 steps_until_gone = 0
 word2freq = defaultdict(int)
 for word in words_from_pdf:
if word in targets:
steps_until_gone = N
buf.append(word)
if steps_until_gone:
for related_word in buf:
if related_word not in targets:
word2freq[related_word] += 1
steps_until_gone -= 1
 for count, word in sorted((v,k) for k,v in word2freq.iteritems()):
print(word, ':', count)

 Making this more efficient and less naive is left as an exercise to the
 reader.
 There may very well already be something similar but more
 sophisticated in NLTK[4]; I've never used it, so I dunno.

 [1]: http://www.unixuser.org/~euske/python/pdfminer/index.html
 [2]: http://pybrary.net/pyPdf/
 [3]: http://www.reportlab.com/software/#pagecatcher
 [4]: http://www.nltk.org/

 Cheers,
 Chris
 --
 http://rebertia.com

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: access to some text string in PDFs

2011-05-05 Thread Chris Rebert
On Thu, May 5, 2011 at 2:26 PM, Robert Pazur pazurrob...@gmail.com wrote:
 Dear all,
 i would like to access some text and count the occurrence as follows 
 I got a lots of pdf with some scientific articles and i want to preview
  which words are usually related with for example determinants
 as an example in the article is a sentence  elevation is the most
 important determinant
 how can i acquire the elevation string?
 of course i dont know where the sententence in article is located or which
 particular word could there be
 any suggestions?

Extract the text using PDFMiner[1], pyPdf[2], or PageCatcher[3]. Then
use something similar to n-grams on the extracted text, filtering out
those that don't contain determinant(s). Then just keep a word
frequency table for the remaining n-grams.

Not-quite-pseudo-code:
from collections import defaultdict, deque
N = 7 # length of n-grams to consider; tune as needed
buf = deque(maxlen=N)
targets = frozenset((determinant, determinants))
steps_until_gone = 0
word2freq = defaultdict(int)
for word in words_from_pdf:
if word in targets:
steps_until_gone = N
buf.append(word)
if steps_until_gone:
for related_word in buf:
if related_word not in targets:
word2freq[related_word] += 1
steps_until_gone -= 1
for count, word in sorted((v,k) for k,v in word2freq.iteritems()):
print(word, ':', count)

Making this more efficient and less naive is left as an exercise to the reader.
There may very well already be something similar but more
sophisticated in NLTK[4]; I've never used it, so I dunno.

[1]: http://www.unixuser.org/~euske/python/pdfminer/index.html
[2]: http://pybrary.net/pyPdf/
[3]: http://www.reportlab.com/software/#pagecatcher
[4]: http://www.nltk.org/

Cheers,
Chris
--
http://rebertia.com
-- 
http://mail.python.org/mailman/listinfo/python-list