RE: [EXT] Re: PDF extraction using Tika

Hanjan, Harinderdeep S. Wed, 26 Aug 2020 07:24:06 -0700

I found it better to offload PDF parsing and text extraction to a standalone 
Tika Server instead. This way, if a PDF crashes the Tika Server, it will not 
take down the JVM where your code is running.
You could easily have multiple instances of Tika Server running (perhaps on 
another machine) and if one is not responding, move on to the next one. This 
will also allow you to easily incorporate using multiple PDF extraction tools, 
should Tika fail on a PDF.


The way this would work is something like this:
- Your code sees a PDF
- It sends the PDF to Tika Server
- Tika Server parses the PDF and returns the text to you (Can send to other 
extraction tools here)
- You take the extracted text, add it to your SolrInputDocument and send it on 
its merry way to Solr

One thing to keep in mind when using Tika Server is that it takes up a _lot_ of 
RAM. You'd want to limit the it's JVM's memory footprint. For example, the 
following will limit it to 2GB
> java -Xmx2048m -jar tika-server-1.24.jar

- H

-----Original Message-----
From: Jan Høydahl [mailto:jan....@cominvent.com]
Sent: August 26, 2020 6:19 AM
To: solr-user <solr-user@lucene.apache.org>
Subject: [EXT] Re: PDF extraction using Tika

When I worked for a search engine vendor in my previous life, the PDF parsing 
pipeline looked something like this

Try parsing the PDF file with tool X
If failure or timeout, try instead with tool Y If failure or timeout, try 
instead with tool Z

In this case X would be the preferred parser, but Y and Z would be fallbacks 
that would hopefully not fail in the same place as X.

Agree that PDFBox and Tika is impressive. However, in your own code you could 
also fallback to some other tool if you want a more robust pipeline.

Jan

> 26. aug. 2020 kl. 11:06 skrev Charlie Hull <char...@flax.co.uk>:
>
> Hi Joe,
>
> Tika is pretty amazing at coping with the things people throw at it and I 
> know the team behind it have added a very extensive testing framework. 
> However, the reality is that malformed, huge or just plain crazy documents 
> may cause crashes - PDFs are mad, you can even embed Javascript in them I 
> believe, and I've also seen PDFs running to thousands of pages. There's *no 
> way* to design out every possible crash, and it's far better to design your 
> system to cope if necessary by separating the PDF processing from Solr.
>
> Charlie
>
> On 25/08/2020 11:46, Joe Doupnik wrote:
>> More properly,it would be best to fix Tika and thus not push extra 
>> complexity upon many many users. Error handling is one thing, crashes though 
>> ought to be designed out.
>>     Thanks,
>>     Joe D.
>>
>> On 25/08/2020 10:54, Charlie Hull wrote:
>>> On 25/08/2020 06:04, Srinivas Kashyap wrote:
>>>> Hi Alexandre,
>>>>
>>>> Yes, these are the same PDF files running in windows and linux. There are 
>>>> around 30 pdf files and I tried indexing single file, but faced same 
>>>> error. Is it related to how PDF stored in linux?
>>> Did you try running Tika (the same version as you're using in Solr) 
>>> standalone on the file as Alexandre suggested?
>>>>
>>>> And with regard to DIH and TIKA going away, can you share if any program 
>>>> which extracts from PDF and pushes into solr?
>>>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lucidworks.com_post_indexing-2Dwith-2Dsolrj_&d=DwIFaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=K2RffikYjYEm9pNz7rnNO_vxytl-ljujndRmklLfE1I&m=wBYbI-OTXCGozU54J-RvN2_k5DycKaR1Cyd6NvFwNjA&s=y7eqZOnWcxKsDzoMEEkCBeiDN30z9ucZ85vItwPI6IQ&e=
>>>   is one example. You should run Tika separately as it's entirely possible 
>>> for it to fail to parse a PDF and crash - and if you're running it in DIH & 
>>> Solr it then brings down everything. Separate your PDF processing from your 
>>> Solr indexing.
>>>
>>>
>>> Cheers
>>>
>>> Charlie
>>>
>>>>
>>>> Thanks,
>>>> Srinivas Kashyap
>>>>
>>>> -----Original Message-----
>>>> From: Alexandre Rafalovitch <arafa...@gmail.com>
>>>> Sent: 24 August 2020 20:54
>>>> To: solr-user <solr-user@lucene.apache.org>
>>>> Subject: Re: PDF extraction using Tika
>>>>
>>>> The issue seems to be more with a specific file and at the level way below 
>>>> Solr's or possibly even Tika's:
>>>> Caused by: java.io.IOException: expected='>' actual='
>>>> ' at offset 2383
>>>>                  at
>>>> org.apache.pdfbox.pdfparser.BaseParser.readExpectedChar(BaseParser.
>>>> java:1045)
>>>>
>>>> Are you indexing the same files on Windows and Linux? I am guessing
>>>> not. I would try to narrow down which of the files it is. One way
>>>> could be to get a standalone Tika (make sure to match the version
>>>> Solr
>>>> embeds) and run it over the documents by itself. It will probably complain 
>>>> with the same error.
>>>>
>>>> Regards,
>>>>     Alex.
>>>> P.s. Additionally, both DIH and Embedded Tika are not recommended for 
>>>> production. And both will be going away in future Solr versions. You may 
>>>> have a much less brittle pipeline if you save the structured outputs from 
>>>> those Tika standalone runs and then index them into Solr, possibly 
>>>> pre-processed.
>>>>
>>>> On Mon, 24 Aug 2020 at 11:09, Srinivas Kashyap 
>>>> <srini...@bamboorose.com.invalid> wrote:
>>>>> Hello,
>>>>>
>>>>> We are using TikaEntityProcessor to extract the content out of PDF and 
>>>>> make the content searchable.
>>>>>
>>>>> When jetty is run on windows based machine, we are able to successfully 
>>>>> load documents using full import DIH(tika entity). Here PDF's is 
>>>>> maintained in windows file system.
>>>>>
>>>>> But when jetty solr is run on linux machine, and try to run DIH,
>>>>> we are getting below exception: (Here PDF's are maintained in
>>>>> linux
>>>>> filesystem)
>>>>>
>>>>> Full Import failed:java.lang.RuntimeException: 
>>>>> java.lang.RuntimeException: 
>>>>> org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to 
>>>>> read content Processing Document # 1
>>>>>                  at 
>>>>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271)
>>>>>                  at 
>>>>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
>>>>>                  at 
>>>>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
>>>>>                  at 
>>>>> org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
>>>>>                  at java.lang.Thread.run(Thread.java:748)
>>>>> Caused by: java.lang.RuntimeException: 
>>>>> org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to 
>>>>> read content Processing Document # 1
>>>>>                  at 
>>>>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:417)
>>>>>                  at 
>>>>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
>>>>>                  at 
>>>>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
>>>>>                  ... 4 more
>>>>> Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: 
>>>>> Unable to read content Processing Document # 1
>>>>>                  at 
>>>>> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69)
>>>>>                  at 
>>>>> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:171)
>>>>>                  at 
>>>>> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
>>>>>                  at 
>>>>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
>>>>>                  at 
>>>>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)
>>>>>                  at 
>>>>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
>>>>>                  ... 6 more
>>>>> Caused by: org.apache.tika.exception.TikaException: Unable to extract PDF 
>>>>> content
>>>>>                  at 
>>>>> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139)
>>>>>                  at 
>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
>>>>>                  at 
>>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>>>>>                  at 
>>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>>>>>                  at 
>>>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>>>>>                  at
>>>>> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(Tik
>>>>> aE
>>>>  b   u:     k( D^q y   M6k^ ӯ4w 4 L"  L  km  v ^    2
  M  ~}  ; C Mq   ^JkiJ  q v  4  m  n 'p  }  ȡƙB)j` 9 r .꣣` Vs 
>>>>  A  ?ܶr`  yǢ  [ m4٧<    R13 u y  u ~6 d ͵  7      2
  M  ~}  ; C LBI楈
>>>>     ן| ;  /zyТ %y & sm5 O5   ڪ*   9 ԇ 
>>>> ;ö  m9~ U Ǒ3 't R.}  } d   #Nq   ] p  |  l)  p   d*  d G  +
>>>> "       Q  ،O Fe   CtUY  HA     Ӗ+# ˤ  F^f obW  4Z A   <
m `
>>>>  '4uu
>>>> 7Z1  63 R (! :Y c b  D 1$ .uf02      Z]      5Wz ˲ |   RT
>>>>     Y VV _ ~.zMTE {;0  D# Ơy(   n  ՠ  !knzp
>>>>   ^w ޕ ^uТ %y & sm5 O5  ˛  -  ,j
>>>>  u֭y  h z   *'vH  ('j۫yץ   y h  e  %  bvX   , X   칻 &ޖ+-  i   yקq隊
>>>> X  X   )^h),    BC  [
   < D{7  %  l
>>>>     z
>>>>  pv \ ZB k; C    *
>>>> j-a4   _1  7  L  Ki =]ph lP ?   /. !  #   y)
T ѝ
FDD?  B  'Ge, 54   

>>>>
>>>> D3    &  @ * Z5 b bֲo  (v  & y   ׊ 3X W5      '  (|=   ۘ )
>>>> t %   oh  N  T é    ւ}    1 {S=:|   Br 2  {' /Q  #cA 
    ƭ   z˥   l`  %y &vH w   n r  !v'g  ޮ'  z  f   &  z  Mz M   z  Ӊ蒊    ,     
(  &j) ~ %         t  Z X    . [ zw   !z  u " v zʹ n8  yݫ)z   ם   z &    W ` 
Z   z۫ ^x M       \  % ǧy Zr   fj }   4K)DS    8 !y ^   j)\ d^   = a  k 
ǫ  n }y 6r  '~V r      i ^  yǢ  _   j)k  ^ \.  Zr     Z ׵   j    + m;ێ< 
^wo)      (   D L  w l   ׾t  8 M  ͸ ~}    Z X ~V r    슉L   ȩ  L  -  ` Hp7     
 ( n
>>>>   M   }ӽ4<4 yǢ  R<ZZ  (  ^r   f    Z   z۫ ^x M       \  % ǧy Zr
>>>> ^ ('j׬ N  <םڲ   +my  ׫rX  ةmێ< ^w az{bq b t^  m l`  %y &vH w
>>>> n r  !v'g  ޮ'  z  f   &  z  Mz M   z  Ӊ蒊    ,
>>>>   (  &j) ~ %         t  Z X    . [ zw   !z  u " v zʹ n8  yݫ)z   ם
>>>>    z &    W ` Z   z۫ ^x M       \  % ǧy Zr   fj }   4K)DS
>>>>    8 !y ^   j)\ d^q y  i  V i ^  % ƥ  - n  V򙨥   {Yp Ʃi ^  HS T
>>>>    jٚ Z ׵
 j    +   #>'u-4 nt   Z X ~V r  N
  M Ӯ}ׯ  ML$^q yڦj)y  8 {  [   M6
  M Ӯ} }4 C X      '  -  k ǫ  i    i ^  + v {9    q ^ Ǜ Y
  GzZfj)m  %     [ zw   !z  .+-R{.n +   j)m  %    칻 &ޖ   i ^
  O  fj)m  %   [ zw   !z  .+-! (   z nq  j    +   $貉k ǫ    i ^  zX z  N   X 
 X   %   [ zw   !z  E     f     h   u    }ې. m6 N  ]  MyǢ  _     n      z   
 Z  !z  Bβ  w[   -j    + m=  5 ^6o& i ^  * ɭ     " 13ӹ   6 m4 u  4 D^q y  
ihq a  e ƥ  - n  V򲖦w[   -j    +  1  I  z     w[   -j    +   D L   .= -7  %  
 [ zw   !z  N
  M Ӯ} o  ML%Պ  IƧ  Cy      +    Z  n     Zr  ҥ
>>>> 9t jd    ەҥ  z  ]*Z +Z
>>>>   (  ۛZ   i /x    * t
 ׬  (  cD

>>>>

T

>>>>

T
 P   @,  
>>>> 0
 ED 40   nGL  $ @ 
 C5% 4 DM T q
8Ӎ
>>>>  T  H
>>>>  V   ^j v+nW  az{bq b t^  m  Z  n
>>>>    Zr   j  vw vH   m8񸭑 y ^ Й  r  yǢ  _    X^u i ^  5Ӎ4񼥡ƥ  -
>>>> w[   -j    + ]8 Nڙ    ? ]4ی" 13 v|   q  Dy +  k ǫ    i ^     
>>>>  6 m4 u  4 E  z+ u# = ,   ~ & ק +rf    bz{b
>>>> ӷ oN<덴wg zZ     qwg   %y &z{ޖ ^~ &J   t f   &     C4 E    ~ &
>>>>      qwg   %y &     qwg   %y &
>>>>   ۭ;   θ Gvy   X^u i ^  HS T    jٱ  y Zr
" 13҉   ; ~  Z   ǧy Zr   {n@
>>>>   M:  t M4 D E    o)  X  vz
>>>> ( W  l"  L   k( ^5 ^}  6~  Z   ǧy Zr  ( n
>>>>   M   tӽ4<4 (      ڶ ޼
>>>>    秭 Z    Z  ]  W(   ^  nahz   *'~ &u ^  ,j
>>>>  v˛  -   +  ; O  Ꙃ  Io Y   T  /)     z$  t 3 z  j&       y
  a +*{J>(U1`   /C  
    UV r- ;  !  Rޤ      5i      ^   P;     :5_Te m  Hw
T  fo k   fp   !  2dG `ڦD S   tt W |  s 0
 E WaS5 xR   r   sS S ޭ  =   7 /       R
 yUk 5q>4  t  ;wR|{ ) v { |+P@N:ɦ        Ģ v     Ƭ     zV y  zV y   t w
>>>> ׭tۘ
>>>>  z   -jצ g z " w gZ 鞲Ơz'l   r hn :y3 D  ^  / t  8     L
  *+@ Q
>>>>  -k .Q$  o~06^x    V \  sdk m  6Le: *  з    ?,( :Ҷx FY 3*J
>>>>   7  o r \y   I . ނm.
O  X  ! ?ܓ  ץ eZ
  ڧ' g`  o    D v zѼ }Z   Q TH </ WZ  H     W॑f  H       Ln NR 3W  P 
+ )j x  eJ   `   ч=   m $ a v    툟}h R q1 ,j    ^   }  W  ɤ nyDs  H2dv 3 D 
< BE   o         Lo  y '  yb  4 u R    %hhN] z+ u   M Ӟ  ͺ  "  L   ٫    
N j  ן| ; }|  'ۀ. m6 m  M  M
>>>> 30E^ Ȩ ]    +!jx(ɩd +!j}w H&j)\ `ڵ  {n@
>>>>   Mv  u Nw Ǭ  
>>>> 0 & xd| ݭ ?2  jRQ    G     =Ő    f   &J  y B 歊x ڱ隵 _y n  -4S
>>>> %   [ zw   !z
>>>>   z{Sʗ  [b     mz ڶ+ n  u  M M4 M4 NӾ  Z{w^ M4 M4 M4 NӾ  Z{w^ *'
>>>>    O*^  m  Z w!j    !   (     y a    \          hzȧv+&k v+&k
>>>> Z ǫ  ޶  v           +n v z kjǌj !rV    j    (  m      J k歺  j
>>>>  (  ^ƘkjǢ   z  窹  ~' t֦z)  -   P-4  ڊ[0  m  q m    ǩ *'~ &r   垊m
>>>> ~ ҢZ y n  ڙ    xƭrZ    je{a   s u궻        Z 触 g YA  " )Z    ,(
>>>> K0    b  ^  ʷ jy"  (  k  p  m    ښ   & W  ڱ : ^
  [        Z  (  ^ ݵ E jج     zh   M(        ߢ ay      M4 M4 M4ӆ  9i֞ 
׶ntityProcessor.java:165)
>>>>>                  ... 10 more
>>>>> Caused by: java.io.IOException: expected='>' actual='
>>>>> ' at offset 2383
>>>>>                  at 
>>>>> org.apache.pdfbox.pdfparser.BaseParser.readExpectedChar(BaseParser.java:1045)
>>>>>                  at 
>>>>> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:226)
>>>>>                  at 
>>>>> org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:163)
>>>>>                  at 
>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:510)
>>>>>                  at 
>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
>>>>>                  at 
>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
>>>>>                  at 
>>>>> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
>>>>>                  at 
>>>>> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
>>>>>                  at 
>>>>> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
>>>>>                  at 
>>>>> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>>>>>                  at 
>>>>> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>>>>>                  at 
>>>>> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
>>>>>                  ... 15 more
>>>>>
>>>>> Can you please suggest, how to extract PDF from linux based file system?
>>>>>
>>>>> Thanks,
>>>>> Srinivas Kashyap
>>>>> ________________________________
>>>>> DISCLAIMER:
>>>>> E-mails and attachments from Bamboo Rose, LLC are confidential.
>>>>> If you are not the intended recipient, please notify the sender 
>>>>> immediately by replying to the e-mail, and then delete it without making 
>>>>> copies or using it in any way.
>>>>> No representation is made that this email or any attachments are free of 
>>>>> viruses. Virus scanning is recommended and is the responsibility of the 
>>>>> recipient.
>>>>>
>>>>> Disclaimer
>>>>>
>>>>> The information contained in this communication from the sender is 
>>>>> confidential. It is intended solely for use by the recipient and others 
>>>>> authorized to receive it. If you are not the recipient, you are hereby 
>>>>> notified that any disclosure, copying, distribution or taking action in 
>>>>> relation of the contents of this information is strictly prohibited and 
>>>>> may be unlawful.
>>>>>
>>>>> This email has been scanned for viruses and malware, and may have been 
>>>>> automatically archived by Mimecast Ltd, an innovator in Software as a 
>>>>> Service (SaaS) for business. Providing a safer and more useful place for 
>>>>> your human generated data. Specializing in; Security, archiving and 
>>>>> compliance. To find out more visit the Mimecast website.
>>>
>>>
>>
>
> --
> Charlie Hull
> OpenSource Connections, previously Flax
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web:
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.o19s.com&d=DwI
> FaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=K2RffikYjYEm9pNz7rnNO_vxytl-ljujndRmklLfE1I&m=wBYbI-OTXCGozU54J-RvN2_k5DycKaR1Cyd6NvFwNjA&s=yr8kPKdCSSxV9CgKLlLgQ4TAuEAaXSavGRi0kxL5j80&e=
>


________________________________
NOTICE -
This communication is intended ONLY for the use of the person or entity named 
above and may contain information that is confidential or legally privileged. 
If you are not the intended recipient named above or a person responsible for 
delivering messages or communications to the intended recipient, YOU ARE HEREBY 
NOTIFIED that any use, distribution, or copying of this communication or any of 
the information contained in it is strictly prohibited. If you have received 
this communication in error, please notify us immediately by telephone and then 
destroy or delete this communication, or return it to us by mail if requested 
by us. The City of Calgary thanks you for your attention and co-operation.

RE: [EXT] Re: PDF extraction using Tika

Reply via email to