On Thu, 26 Jun 2014, Richard wrote:
You haven't by chance happen to have programmatically looped through a directory full of pdfs and used Tika to extract each of their pdf contents into separate text or xml files? If so, what do you recommend to do the extraction?

For a proof of concept, how about something simple like a bash for loop and the tika app?

for i in *.pdf; do j=`echo "$i" | sed 's/.pdf//'`; java -jar tika-app.jar
  --text "$i" > "$j.txt"; java -jar tika-app.jar --xml "$i" > "$j.xml"; done

Nick

Reply via email to