[ https://issues.apache.org/jira/browse/TIKA-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison closed TIKA-1776. ----------------------------- Resolution: Not A Problem No problem. Let us know if you have any other surprises. > tika stop converting at this pdf document > ----------------------------------------- > > Key: TIKA-1776 > URL: https://issues.apache.org/jira/browse/TIKA-1776 > Project: Tika > Issue Type: Bug > Components: batch > Affects Versions: 1.10 > Environment: Intel Core I5 4GB Ram, Notebook > OS: debian8, x64, Gnome > java version "1.8.0_60" > Java(TM) SE Runtime Environment (build 1.8.0_60-b27) > Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode) > ruby 2.2.3p173 (2015-08-18 revision 51636) [x86_64-linux] > Reporter: tranquillo > > Hi and thank you all for this great project, > I use https://github.com/offenesdresden/ratsinfo-scraper to download > thousands of pdfs and convert it from pdf to xml, that works pretty well and > need max 1-2minutes even for big files. But since over 15hours the process > hangs with CPU load = 0% at one file: > http://ratsinfo.dresden.de/getfile.php?id=149624&type=do > wich is just 5mb large, but contains text, scans and CAD plans. > I run "get_xml()" from follwing class (located in tika_app.rb): > ----------------------------- > require 'rubygems' > require 'stringio' > require 'open4' > class TikaApp > def initialize(document) > filename = File.basename(document) > t = Time.now > puts t.strftime("%H:%M:%S") + ": analyze #{filename}" > @document = document > java_cmd = 'java' > java_args = '-server -Djava.awt.headless=true' > tika_path = "tika-app.jar" > @tika_cmd = "#{java_cmd} #{java_args} -jar '#{tika_path}'" > end > def get_xml > run_tika('--xml') > end > def get_metadata > run_tika('--metadata --json') > end > private > def run_tika(option) > final_cmd = "#{@tika_cmd} #{option} '#{@document}'" > pid, stdin, stdout, stderr = Open4::popen4(final_cmd) > stdout_result = stdout.read.strip > stderr_result = stderr.read.strip > unless strip_stderr(stderr_result).empty? > end > stdout_result > ensure > stdin.close > stdout.close > stderr.close > end > def strip_stderr(s) > s.gsub(/^(info|warn) - .*$/i, '').strip > end > end > ---------- > The tika command with this function looks like this: > java -server -Djava.awt.headless=true -jar 'tika-app.jar' --xml > '~/data/00149624.pdf' -- This message was sent by Atlassian JIRA (v6.3.4#6332)