RE: PDF Processing

Allison, Timothy B. Wed, 02 Nov 2016 04:02:50 -0700

It depends (tm).  As soon as 1.14 is released, I'll add PDAction extraction 
from PDFs (TIKA-2090), and that will include javascript (as stored in 
PDActions)... that capability doesn't currently exist.  If there are other 
components that you'd like to have extracted, let us know, and we'll consider 
adding them.


If you want a look at what javascript extraction will look like, I recently 
extracted ~70k javascript elements from our 500k regression corpus:
http://162.242.228.174/embedded_files

specifically:

http://162.242.228.174/embedded_files/js_in_pdfs.tar.bz2

> entire structure of a document and extract any or all pieces from it.
Within reason(tm), that _is_ the goal of Tika.  The focus is text, but we try 
to maintain some structural information where we can, e.g. bold/italic/lists 
and paragraph boundaries in MSOffice and related formats.  We do not do full 
stylistic extraction (font name, size, etc), but the general formatting 
components that apply across formats, we try to maintain.



From: Jim Idle [mailto:ji...@proofpoint.com]
Sent: Wednesday, November 2, 2016 3:30 AM
To: user@tika.apache.org
Subject: PDF Processing

I am wondering if I am using Tika for purposes it was not aimed at. I am 
beginning to thing that it's main aim is extract text from documents, whereas I 
really want to get an entire structure of a document and extract any or all 
pieces from it. For instance when parsing a PDF, if it has embedded streams, I 
want to be able to extract the embed stream (for instance a JavaScript). PDFBox 
can do this, but such information does not turn up in a ContentHandler passed 
to Tika.

If I want to do more than get just the text, should I really use the underlying 
parsers directly and not try to abstract them using Tika?

Many thanks,

Jim

RE: PDF Processing

Reply via email to