Oh man, that is so awesome!
I imagine that there's a lot that people can do with tools like this. I
opened a documentation page for it here:
https://trisquel.info/en/wiki/information-processing
I don't know whether Information Processing is the most appropriate name.
If there's a better name, please open a new page,
My first run of the script was successful, and I now have a
basename-matches PDF. I renamed the match file to word-word-#ofhits.pdf
I run the script a second time, on that basename-matches PDF
(word-word-#ofhits.pdf), to achieve the and functionality. Unfortunatley,
on this second run,
Trisquel comes with programs like mv and pdfunite right? Do most gnu users
(like me until very recently) have them on the computer and not use them? Or
do popular GUI programs depend on these kinds of programs to do things (like
export to pdf in LibreOffice)?
Thanks for the tip; I hadn't thought about how to structure the search in
this way. I will review how I'm doing my searches with this in mind.
My largest set of PDFs is 80 files. In that set, some PDFs are as big as ~20
mb, some are only ~500 kb.
In the newest version of pdf-page-grep, the number of matching pages is
restricted to 1021 right? I can search my PDFs in smaller groups if this is
the case.
I think that I understand -- pdfjam lets the computer group the matches
without first creating an individual PDF for each page-match.
I will read the new script to spot the differences and to try to understand
how you did it.
I get an error message at the end of the run, and there doesn't seem to be a
matches file in my working folder. Maybe 500+ megs of PDFs is too much. I
did a few test runs with a few 10-15 page PDFs, and that seemed to work.
I/O Error: Couldn't open file '/tmp/pdf-page-grep.PHgWDa-1022': Too
Oh, I understand now -- When I read pipe grep earlier, I thought that
pipe referred to script instruction or terminal command that I didn't know
yet.
You're right; no need to separate out the pages. I just have to pipe grep
with a second set of words to achieve and.
The script writes the basename-matches.pdf file to the same folder where
the script and PDFs are, right?
I can't find that matches file. Is it a problem if my PDFs have spaces in
the name? (Particularly the last PDF, that the script uses to create the
matches.pdf file name)
I tried running the script again today, but am having trouble. When I cd to
the directory where pdf-page-grep is, and enter pdf-page-grep, the terminal
tells me that there is no such command.
I tried moving the script to the directory in my PATH variable; I'm not sure
that this went well.
That means there's no such command in your $PATH. If the script is in the
research folder and executable you need to run it with ./scriptname
Right! Thank you!
I moved the script to the directory in my PATH variable like MB suggested.
But I'm not sure that it worked properly. I'll do a little research about
that and try again more carefully.
Is it possible to search for pages that contain words -- at least one word
from each of two groups? For example:
First group of ORs: car, truck, bus, bicycle, or motorcycle
and
Second group of ORs: blue, red, green, purple, or beige
So a good hit could have the word green and truck on
I just thought of something. I could use pdf-page-grep to do a first pass
with my first group of ORs.
Then I could split the matches file into single-page PDFs.
And then use a new set of ORs on those single-page PDFs.
This would be like having an And in the search. Is there an automatic
There actually was a problem with the input PDFs: if they were not in the
working directory, the script was crashing. Also, the output pages were not
in the correct order (the order in which the user gave the PDFs). Finally,
the script was retuning 0 even if no page matched the patterns (the
I want to learn Shell scripting now and will read those comments carefully --
thanks
Where's the license for the script?
Oops! I added those lines:
# Distributed under the terms of the GNU General Public License v3
# AUTHOR: Magic Banana
# e-mail: lc...@dcc.ufmg.br
I simplified the script: http://dcc.ufmg.br/~lcerf/utilities/pdf-page-grep
It now is closer to my original proposal since it extracts the individual
pages with matches and, in the end, join them all.
Besides basic POSIX commands (such as 'grep' and 'awk'), the script now only
relies on
In case someone with a similar situation finds this page -- here's how to run
a script:
1. Open the terminal, and type:
cd [directory where your script is]
Example:
cd /home/username/Desktop/research/
Put the PDF files in the same directory
2. Then type the following, to give yourself
It took me a little while to figure out what I was looking at ... thank you
so much! I'm running the script now, and it's finding pages! This is so
cool. I'm going to PM you about that beer.
Also, thanks a lot Legimet. You guys are the best.
It is true that you have to turn the script executable. You can do that with
'chmod +x' or from a graphical file browser (in Nautilus: right click,
Properties, Permissions tab, a box to check).
If you plan to frequently use the script, you had better move it to a
directory listed in your
The script now considers that the arguments that start with - (e.g., -F
or --ignore-case) are options for 'grep'. I put the script on my website:
http://dcc.ufmg.br/~lcerf/en/utilities.html#pdf-page-grep
MB, having the text would be way more useful than the PDF pages! Thanks for
recommending pdftotext and the -layout option.
I have some questions -- could you help me break this process down into
smaller steps?
I looked up pdfjam's split command online -- I think that it may be a little
for file in pages/* is a for loop. That means that it will execute the body
of the loop for each file in the directory pages/*, setting the variable file
to the filename each time.
'if pdftotext $file - | grep -i regexps': the 'pdftotext $file -' part
outputs the text of the pdf to
Is there a way to search PDF files for keywords, and then create new PDFs
that contain only the pages that contain those keywords.
I'd like to search the PDFs for words with various combinations of and and
or. Is this possible?
27 matches
Mail list logo