Ha! I have no immediate plans…I am curious, though, about interests among our user base. Shall we open a ticket for tracking, PRs?
From: Ken Krugler [mailto:kkrugler_li...@transpac.com] Sent: Monday, June 5, 2017 2:12 PM To: user@tika.apache.org Subject: Re: Extracting macros in 1.15 On Jun 5, 2017, at 10:43am, Allison, Timothy B. <talli...@mitre.org<mailto:talli...@mitre.org>> wrote: Jim, Thank you, again, for reaching out to us. Now that we have a user who actually cares about macros, I have some follow up questions, we aren’t treating js in html as a macro…should we try to do that? Are there other macro-like bits of code that we should be extracting? Oddly enough, this just came up for me a few days ago. I was going to use a custom mapper and content handler to extract the <script> data, but having built-in support that treats them as macros would be better. So yes, please :) How would you handle the src=xxx attribute? Ultimately I plan to treat these like an import statement in a regular source code file. Regards, — Ken From: Jim Idle [mailto:ji...@proofpoint.com] Sent: Sunday, June 4, 2017 4:07 AM To: user@tika.apache.org<mailto:user@tika.apache.org> Subject: RE: Extracting macros in 1.15 Direct Java calls and "I am using the AutoDetectParser at the moment." I find an online example buried a test for another package, so I have worked out how to do it now, but it seems that if I have many difference document types to support I will have to configure each parser separately. So be it, but it seems like there is a case for a subset of options that may apply to all such as "extract anything that qualifies as a 'macro'" that all parsers would obey if they have not been told anything specifically. It is my opinion (for what it's worth 😉, that all parsers should extract everything they can unless told otherwise, but it is what it is I guess and I am pleased to have TIKA as an aid in analyzing all the myriad document types. Jim pc = new ParseContext(); parser = new AutoDetectParser(); OfficeParserConfig officeParserConfig = new OfficeParserConfig(); officeParserConfig.setExtractMacros(true); pc.set(OfficeParserConfig.class, officeParserConfig); > -----Original Message----- > From: Nick Burch [mailto:apa...@gagravarr.org] > Sent: Saturday, June 3, 2017 16:36 > To: user@tika.apache.org<mailto:user@tika.apache.org> > Subject: Re: Extracting macros in 1.15 > > On Sat, 3 Jun 2017, Jim Idle wrote: > > After being baffled why macros no longer show up in 1.15 I found: > > https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org > > _jira_browse_TIKA- > 2D2302&d=DwIBAg&c=Vxt5e0Osvvt2gflwSlsJ5DmPGcPvTRKLJy > > > p031rXjhg&r=LQ_Q8ZxvkO2zK857fAbj5MDtaB4Bvrpw3bihfO3Bhbw&m=o8gr > 8gP1-gre > > > pBVLNkl9r56fM6Jt6LIlRff8aub3bEA&s=8nhkO_W_dLX6R9XdCgmgqoEpbRlvVL > iSwf4L > > rAFE1tA&e= > > > > Can anyone point me to an example of doing this? I am finding bits and > > pieces but no example of turning macros back on.I basically want all > > macros in all documents, office, pdf, anything really. > > How do you call Apache Tika? Tika App? Tika Server? Tika java class facade? > Direct Java calls to TikaConfig / AutoDetectParser etc? > > The solution will differ depending on which one you use > > Nick -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr