[ 
https://issues.apache.org/jira/browse/TIKA-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison closed TIKA-1776.
-----------------------------
    Resolution: Not A Problem

No problem.  Let us know if you have any other surprises.

> tika stop converting at this pdf document
> -----------------------------------------
>
>                 Key: TIKA-1776
>                 URL: https://issues.apache.org/jira/browse/TIKA-1776
>             Project: Tika
>          Issue Type: Bug
>          Components: batch
>    Affects Versions: 1.10
>         Environment: Intel Core I5 4GB Ram, Notebook
> OS: debian8, x64, Gnome
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
> ruby 2.2.3p173 (2015-08-18 revision 51636) [x86_64-linux]
>            Reporter: tranquillo
>
> Hi and thank you all for this great project,
> I use https://github.com/offenesdresden/ratsinfo-scraper to download 
> thousands of pdfs and convert it from pdf to xml, that works pretty well and 
> need max 1-2minutes even for big files. But since over 15hours the process 
> hangs with CPU load = 0% at one file: 
> http://ratsinfo.dresden.de/getfile.php?id=149624&type=do 
> wich is just 5mb large, but contains text, scans and CAD plans.
> I run "get_xml()" from follwing class (located in tika_app.rb):
> -----------------------------
> require 'rubygems'
> require 'stringio'
> require 'open4'
> class TikaApp
>     def initialize(document)
>         filename = File.basename(document)
>         t = Time.now
>         puts t.strftime("%H:%M:%S") + ": analyze #{filename}"
>         @document = document
>         java_cmd = 'java'
>         java_args = '-server -Djava.awt.headless=true'
>         tika_path = "tika-app.jar"
>         @tika_cmd = "#{java_cmd} #{java_args} -jar '#{tika_path}'"
>     end
>     def get_xml
>         run_tika('--xml')
>     end
>     def get_metadata
>         run_tika('--metadata --json')
>     end
>     private
>     def run_tika(option)
>         final_cmd = "#{@tika_cmd} #{option} '#{@document}'"
>         pid, stdin, stdout, stderr = Open4::popen4(final_cmd)
>         stdout_result = stdout.read.strip
>         stderr_result = stderr.read.strip
>         unless strip_stderr(stderr_result).empty?
>         end
>         stdout_result
>     ensure
>         stdin.close
>         stdout.close
>         stderr.close
>     end
>     def strip_stderr(s)
>         s.gsub(/^(info|warn) - .*$/i, '').strip
>     end
> end
> ----------
> The tika command with this function looks like this: 
> java -server -Djava.awt.headless=true -jar 'tika-app.jar' --xml 
> '~/data/00149624.pdf'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to