RE: Extracting macros in 1.15

Allison, Timothy B. Mon, 05 Jun 2017 11:20:06 -0700

Ha!  I have no immediate plans…I am curious, though, about interests among our 
user base.  Shall we open a ticket for tracking, PRs?

From: Ken Krugler [mailto:kkrugler_li...@transpac.com]
Sent: Monday, June 5, 2017 2:12 PM
To: user@tika.apache.org
Subject: Re: Extracting macros in 1.15

On Jun 5, 2017, at 10:43am, Allison, Timothy B. 
<talli...@mitre.org<mailto:talli...@mitre.org>> wrote:

Jim,
  Thank you, again, for reaching out to us.  Now that we have a user who 
actually cares about macros, I have some follow up questions, we aren’t 
treating js in html as a macro…should we try to do that?  Are there other 
macro-like bits of code that we should be extracting?

Oddly enough, this just came up for me a few days ago.

I was going to use a custom mapper and content handler to extract the <script> 
data, but having built-in support that treats them as macros would be better.

So yes, please :)

How would you handle the src=xxx attribute? Ultimately I plan to treat these 
like an import statement in a regular source code file.

Regards,

— Ken

From: Jim Idle [mailto:ji...@proofpoint.com]
Sent: Sunday, June 4, 2017 4:07 AM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: RE: Extracting macros in 1.15

Direct Java calls and "I am using the AutoDetectParser at the moment."

I find an online example buried a test for another package, so I have worked 
out how to do it now, but it seems that if I have many difference document 
types to support I will have to configure each parser separately. So be it, but 
it seems like there is a case for a subset of options that may apply to all 
such as "extract anything that qualifies as a 'macro'" that all parsers would 
obey if they have not been told anything specifically.

It is my opinion (for what it's worth 😉, that all parsers should extract 
everything they can unless told otherwise, but it is what it is I guess and I 
am pleased to have TIKA as an aid in analyzing all the myriad document types.

Jim

        pc = new ParseContext();
       parser = new AutoDetectParser();
        OfficeParserConfig officeParserConfig = new OfficeParserConfig();
        officeParserConfig.setExtractMacros(true);
        pc.set(OfficeParserConfig.class, officeParserConfig);

> -----Original Message-----
> From: Nick Burch [mailto:apa...@gagravarr.org]
> Sent: Saturday, June 3, 2017 16:36
> To: user@tika.apache.org<mailto:user@tika.apache.org>
> Subject: Re: Extracting macros in 1.15
>
> On Sat, 3 Jun 2017, Jim Idle wrote:
> > After being baffled why macros no longer show up in 1.15 I found:
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org
> > _jira_browse_TIKA-
> 2D2302&d=DwIBAg&c=Vxt5e0Osvvt2gflwSlsJ5DmPGcPvTRKLJy
> >
> p031rXjhg&r=LQ_Q8ZxvkO2zK857fAbj5MDtaB4Bvrpw3bihfO3Bhbw&m=o8gr
> 8gP1-gre
> >
> pBVLNkl9r56fM6Jt6LIlRff8aub3bEA&s=8nhkO_W_dLX6R9XdCgmgqoEpbRlvVL
> iSwf4L
> > rAFE1tA&e=
> >
> > Can anyone point me to an example of doing this? I am finding bits and
> > pieces but no example of turning macros back on.I basically want all
> > macros in all documents, office, pdf, anything really.
>
> How do you call Apache Tika? Tika App? Tika Server? Tika java class facade?
> Direct Java calls to TikaConfig / AutoDetectParser etc?
>
> The solution will differ depending on which one you use
>
> Nick

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

RE: Extracting macros in 1.15

Reply via email to