Re: Solr 6.4. Can't index MS Visio vsdx files

2017-07-04 Thread Charlie Hull

On 11/04/2017 20:48, Allison, Timothy B. wrote:

It depends.  We've been trying to make parsers more, erm, flexible, but there 
are some problems from which we cannot recover.

Tl;dr there isn't a short answer.  :(

My sense is that DIH/ExtractingDocumentHandler is intended to get people up and 
running with Solr easily but it is not really a great idea for production.  See 
Erick's gem: https://lucidworks.com/2012/02/14/indexing-with-solrj/


+1. Tika extraction should happen *outside* Solr in production. A 
colleague even wrote a simple wrapper for Tika to help build this sort 
of thing: https://github.com/mattflax/dropwizard-tika-server


Charlie




As for the Tika portion... at the very least, Tika _shouldn't_ cause the 
ingesting process to crash.  At most, it should fail at the file level and not 
cause greater havoc.  In practice, if you're processing millions of files from 
the wild, you'll run into bad behavior and need to defend against permanent 
hangs, oom, memory leaks.

Also, at the least, if there's an exception with an embedded file, Tika should 
catch it and keep going with the rest of the file.  If this doesn't happen let 
us know!  We are aware that some types of embedded file stream problems were 
causing parse failures on the entire file, and we now catch those in Tika 
1.15-SNAPSHOT and don't let them percolate up through the parent file (they're 
reported in the metadata though).

Specifically for your stack traces:

For your initial problem with the missing class exceptions -- I thought we used 
to catch those in docx and log them.  I haven't been able to track this down, 
though.  I can look more if you have a need.

For "Caused by: org.apache.poi.POIXMLException: Invalid 'Row_Type' name 'PolylineTo' 
", this problem might go away if we implemented a pure SAX parser for vsdx.  We just 
did this for docx and pptx (coming in 1.15) and these are more robust to variation 
because they aren't requiring a match with the ooxml schema.  I haven't looked much at 
vsdx, but that _might_ help.

For "TODO Support v5 Pointers", this isn't supported and would require 
contributions.  However, I agree that POI shouldn't throw a Runtime exception.  Perhaps 
open an issue in POI, or maybe we should catch this special example at the Tika level?

For "Caused by: java.lang.ArrayIndexOutOfBoundsException:", the POI team 
_might_ be able to modify the parser to ignore a stream if there's an exception, but 
that's often a sign that something needs to be fixed with the parser.  In short, the 
solution will come from POI.

Best,

 Tim

-Original Message-
From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
Sent: Tuesday, April 11, 2017 1:56 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr 6.4. Can't index MS Visio vsdx files

Thanks for your responses.
Are there any posibilities to ignore parsing errors and continue indexing?
because now solr/tika stops parsing whole document if it finds any exception

On Apr 11, 2017 19:51, "Allison, Timothy B." <talli...@mitre.org> wrote:


You might want to drop a note to the dev or user's list on Apache POI.

I'm not extremely familiar with the vsd(x) portion of our code base.

The first item ("PolylineTo") may be caused by a mismatch btwn your
doc and the ooxml spec.

The second item appears to be an unsupported feature.

The third item may be an area for improvement within our codebase...I
can't tell just from the stacktrace.

You'll probably get more helpful answers over on POI.  Sorry, I can't
help with this...

Best,

   Tim

P.S.

 3.1. ooxml-schemas-1.3.jar instead of poi-ooxml-schemas-3.15.jar


You shouldn't need both. Ooxml-schemas-1.3.jar should be a super set
of poi-ooxml-schemas-3.15.jar






---
This email has been checked for viruses by AVG.
http://www.avg.com




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


RE: Solr 6.4. Can't index MS Visio vsdx files

2017-07-03 Thread Allison, Timothy B.
Sorry.  Y, you'll have to update commons-compress to 1.14.

-Original Message-
From: Gytis Mikuciunas [mailto:gyt...@gmail.com] 
Sent: Monday, July 3, 2017 9:15 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 6.4. Can't index MS Visio vsdx files

hi,

So I'm back from my long vacations :)

I'm trying to bring-up a fresh solr 6.6 standalone instance on windows
2012R2 server.

Replaced:

poi-*3.15-beta1 ---> poi-*3.16
tika-*1.13 ---> tika-*1.15


Tried to index one txt file and got (with poi and tika files that come out of 
the box, it indexes this txt file without errors):


SimplePostTool: WARNING: Response:   
Error 500 Server Error

HTTP ERROR 500
Problem accessing /solr/v20170703xxx/update/extract. Reason:
Server ErrorCaused
by:java.lang.NoClassDefFoundError:
org/apache/commons/compress/archivers/ArchiveStreamProvider
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(Unknown Source)
at java.security.SecureClassLoader.defineClass(Unknown Source)
at java.net.URLClassLoader.defineClass(Unknown Source)
at java.net.URLClassLoader.access$100(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at
org.apache.tika.parser.pkg.ZipContainerDetector.detectArchiveFormat(ZipContainerDetector.java:112)
at
org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:83)
at
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:115)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
at
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:534)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
at
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.ja

Re: Solr 6.4. Can't index MS Visio vsdx files

2017-07-03 Thread Gytis Mikuciunas
ARNING: IOException while reading response:
java.io.IOException: Server returned HTTP response code: 500 for URL:
http://localhost:80/solr/v20170703xxx/update/extract?resource.name=xx
1 files indexed.
COMMITting Solr index changes to
http://localhost:80/solr/v20170703xxx/update...
Time spent: 0:00:00.350



On Mon, Jun 5, 2017 at 7:41 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> https://issues.apache.org/jira/browse/SOLR-10335 is tracking the upgrade
> in Solr to Tika 1.15.  Please chime in on that issue.
>
> You should be able to swap in POI 3.16 (final) wherever you had earlier
> versions, make sure to include: poi, poi-scratchpad, poi-ooxml,
> poi-ooxml-schemas.  And make sure to include tika-parsers (1.15),
> tika-core, tika-java7, tika-xmp.  Also, include commons-collections4 (which
> is new in POI w Tika 1.14).  (I assume you have already added curvesapi?)
>
> -Original Message-
> From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
> Sent: Saturday, June 3, 2017 5:39 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr 6.4. Can't index MS Visio vsdx files
>
> Great Tim.
>
> What do I need to do to integrate it on my current installation?
>
>
> On May 31, 2017 16:24, "Allison, Timothy B." <talli...@mitre.org> wrote:
>
> Apache Tika 1.15 is now available.
>
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org]
> Sent: Tuesday, May 9, 2017 7:45 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr 6.4. Can't index MS Visio vsdx files
>
> Probably better to ask on the Tika list.  We'll push the release asap
> after PDFBox 2.0.6 is out.  Andreas plans to cut the release candidate for
> PDFBox this Friday.  Tika will probably have an RC by Monday 5/15, with the
> release happening later in the week...That's if there are no surprises...[2]
>
> You can get a recent build if you'd like to test [1].
>
> Best,
>
>   Tim
>
> [1] https://builds.apache.org/view/Tika/job/Tika-trunk/
> [2] If you are curious, for the comparison reports btwn PDFBox 2.0.5 and
> 2.0.6-SNAPSHOT on ~500k pdfs, see: http://162.242.228.174/
> reports/reports_pdfbox_2_0_6.tar.gz
>
> -----Original Message-
> From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
> Sent: Tuesday, May 9, 2017 7:17 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr 6.4. Can't index MS Visio vsdx files
>
> Are there any news regarding Tika 1.15? Maybe it's already ready for
> download somewhere
>
> G.
>
> On Wed, Apr 12, 2017 at 6:57 PM, Allison, Timothy B. <talli...@mitre.org>
> wrote:
>
> > The release candidate for POI was just cut...unfortunately, I think
> > after Nick Burch fixed the 'PolylineTo' issue...thank you, btw, for
> opening that!
> >
> > That'll be done within a week unless there are surprises.  Once that's
> > out, I have to update a few things, but I'd think we'd have a
> > candidate for Tika a week later, then a week for release.
> >
> > You can get nightly builds here: https://builds.apache.org/
> >
> > Please ask on the POI or Tika users lists for how to get the
> > latest/latest running, and thank you, again, for opening the issue on
> POI's Bugzilla.
> >
> > Best,
> >
> >Tim
> >
> > -Original Message-
> > From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
> > Sent: Wednesday, April 12, 2017 1:00 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Solr 6.4. Can't index MS Visio vsdx files
> >
> > when 1.15 will be released? maybe you have some beta version and I
> > could test it :)
> >
> > SAX sounds interesting, and from info that I found in google it could
> > solve my issues.
> >
> > On Tue, Apr 11, 2017 at 10:48 PM, Allison, Timothy B.
> > <talli...@mitre.org>
> > wrote:
> >
> > > It depends.  We've been trying to make parsers more, erm, flexible,
> > > but there are some problems from which we cannot recover.
> > >
> > > Tl;dr there isn't a short answer.  :(
> > >
> > > My sense is that DIH/ExtractingDocumentHandler is intended to get
> > > people up and running with Solr easily but it is not really a great
> > > idea for production.  See Erick's gem: https://lucidworks.com/2012/
> > > 02/14/indexing-with-solrj/
> > >
> > > As for the Tika portion... at the very least, Tika _shouldn't_ cause
> > > the ingesting process to crash.  At most, it should fail at the file
> > > level and not cause greater havoc.  In practice, if you're
> > > processing millions of files from the wild, you'll run into bad
> 

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-06-05 Thread Allison, Timothy B.
https://issues.apache.org/jira/browse/SOLR-10335 is tracking the upgrade in 
Solr to Tika 1.15.  Please chime in on that issue.

You should be able to swap in POI 3.16 (final) wherever you had earlier 
versions, make sure to include: poi, poi-scratchpad, poi-ooxml, 
poi-ooxml-schemas.  And make sure to include tika-parsers (1.15), tika-core, 
tika-java7, tika-xmp.  Also, include commons-collections4 (which is new in POI 
w Tika 1.14).  (I assume you have already added curvesapi?)

-Original Message-
From: Gytis Mikuciunas [mailto:gyt...@gmail.com] 
Sent: Saturday, June 3, 2017 5:39 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr 6.4. Can't index MS Visio vsdx files

Great Tim.

What do I need to do to integrate it on my current installation?


On May 31, 2017 16:24, "Allison, Timothy B." <talli...@mitre.org> wrote:

Apache Tika 1.15 is now available.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Tuesday, May 9, 2017 7:45 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr 6.4. Can't index MS Visio vsdx files

Probably better to ask on the Tika list.  We'll push the release asap after 
PDFBox 2.0.6 is out.  Andreas plans to cut the release candidate for PDFBox 
this Friday.  Tika will probably have an RC by Monday 5/15, with the release 
happening later in the week...That's if there are no surprises...[2]

You can get a recent build if you'd like to test [1].

Best,

  Tim

[1] https://builds.apache.org/view/Tika/job/Tika-trunk/
[2] If you are curious, for the comparison reports btwn PDFBox 2.0.5 and 
2.0.6-SNAPSHOT on ~500k pdfs, see: http://162.242.228.174/ 
reports/reports_pdfbox_2_0_6.tar.gz

-Original Message-
From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
Sent: Tuesday, May 9, 2017 7:17 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 6.4. Can't index MS Visio vsdx files

Are there any news regarding Tika 1.15? Maybe it's already ready for download 
somewhere

G.

On Wed, Apr 12, 2017 at 6:57 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> The release candidate for POI was just cut...unfortunately, I think 
> after Nick Burch fixed the 'PolylineTo' issue...thank you, btw, for
opening that!
>
> That'll be done within a week unless there are surprises.  Once that's 
> out, I have to update a few things, but I'd think we'd have a 
> candidate for Tika a week later, then a week for release.
>
> You can get nightly builds here: https://builds.apache.org/
>
> Please ask on the POI or Tika users lists for how to get the 
> latest/latest running, and thank you, again, for opening the issue on
POI's Bugzilla.
>
> Best,
>
>Tim
>
> -Original Message-
> From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
> Sent: Wednesday, April 12, 2017 1:00 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr 6.4. Can't index MS Visio vsdx files
>
> when 1.15 will be released? maybe you have some beta version and I 
> could test it :)
>
> SAX sounds interesting, and from info that I found in google it could 
> solve my issues.
>
> On Tue, Apr 11, 2017 at 10:48 PM, Allison, Timothy B.
> <talli...@mitre.org>
> wrote:
>
> > It depends.  We've been trying to make parsers more, erm, flexible, 
> > but there are some problems from which we cannot recover.
> >
> > Tl;dr there isn't a short answer.  :(
> >
> > My sense is that DIH/ExtractingDocumentHandler is intended to get 
> > people up and running with Solr easily but it is not really a great 
> > idea for production.  See Erick's gem: https://lucidworks.com/2012/ 
> > 02/14/indexing-with-solrj/
> >
> > As for the Tika portion... at the very least, Tika _shouldn't_ cause 
> > the ingesting process to crash.  At most, it should fail at the file 
> > level and not cause greater havoc.  In practice, if you're 
> > processing millions of files from the wild, you'll run into bad 
> > behavior and need to defend against permanent hangs, oom, memory leaks.
> >
> > Also, at the least, if there's an exception with an embedded file, 
> > Tika should catch it and keep going with the rest of the file.  If 
> > this doesn't happen let us know!  We are aware that some types of 
> > embedded file stream problems were causing parse failures on the 
> > entire file, and we now catch those in Tika 1.15-SNAPSHOT and don't 
> > let them percolate up through the parent file (they're reported in 
> > the
> metadata though).
> >
> > Specifically for your stack traces:
> >
> > For your initial problem with the missing class exceptions -- I 
> > thought we used to catch those in docx and log them.  I haven't been 
> > able to track this down, though.  I can look more if you have a need.
> >
> &

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-06-03 Thread Gytis Mikuciunas
Great Tim.

What do I need to do to integrate it on my current installation?


On May 31, 2017 16:24, "Allison, Timothy B." <talli...@mitre.org> wrote:

Apache Tika 1.15 is now available.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Tuesday, May 9, 2017 7:45 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr 6.4. Can't index MS Visio vsdx files

Probably better to ask on the Tika list.  We'll push the release asap after
PDFBox 2.0.6 is out.  Andreas plans to cut the release candidate for PDFBox
this Friday.  Tika will probably have an RC by Monday 5/15, with the
release happening later in the week...That's if there are no surprises...[2]

You can get a recent build if you'd like to test [1].

Best,

  Tim

[1] https://builds.apache.org/view/Tika/job/Tika-trunk/
[2] If you are curious, for the comparison reports btwn PDFBox 2.0.5 and
2.0.6-SNAPSHOT on ~500k pdfs, see: http://162.242.228.174/
reports/reports_pdfbox_2_0_6.tar.gz

-Original Message-
From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
Sent: Tuesday, May 9, 2017 7:17 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 6.4. Can't index MS Visio vsdx files

Are there any news regarding Tika 1.15? Maybe it's already ready for
download somewhere

G.

On Wed, Apr 12, 2017 at 6:57 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> The release candidate for POI was just cut...unfortunately, I think
> after Nick Burch fixed the 'PolylineTo' issue...thank you, btw, for
opening that!
>
> That'll be done within a week unless there are surprises.  Once that's
> out, I have to update a few things, but I'd think we'd have a
> candidate for Tika a week later, then a week for release.
>
> You can get nightly builds here: https://builds.apache.org/
>
> Please ask on the POI or Tika users lists for how to get the
> latest/latest running, and thank you, again, for opening the issue on
POI's Bugzilla.
>
> Best,
>
>Tim
>
> -Original Message-
> From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
> Sent: Wednesday, April 12, 2017 1:00 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr 6.4. Can't index MS Visio vsdx files
>
> when 1.15 will be released? maybe you have some beta version and I
> could test it :)
>
> SAX sounds interesting, and from info that I found in google it could
> solve my issues.
>
> On Tue, Apr 11, 2017 at 10:48 PM, Allison, Timothy B.
> <talli...@mitre.org>
> wrote:
>
> > It depends.  We've been trying to make parsers more, erm, flexible,
> > but there are some problems from which we cannot recover.
> >
> > Tl;dr there isn't a short answer.  :(
> >
> > My sense is that DIH/ExtractingDocumentHandler is intended to get
> > people up and running with Solr easily but it is not really a great
> > idea for production.  See Erick's gem: https://lucidworks.com/2012/
> > 02/14/indexing-with-solrj/
> >
> > As for the Tika portion... at the very least, Tika _shouldn't_ cause
> > the ingesting process to crash.  At most, it should fail at the file
> > level and not cause greater havoc.  In practice, if you're
> > processing millions of files from the wild, you'll run into bad
> > behavior and need to defend against permanent hangs, oom, memory leaks.
> >
> > Also, at the least, if there's an exception with an embedded file,
> > Tika should catch it and keep going with the rest of the file.  If
> > this doesn't happen let us know!  We are aware that some types of
> > embedded file stream problems were causing parse failures on the
> > entire file, and we now catch those in Tika 1.15-SNAPSHOT and don't
> > let them percolate up through the parent file (they're reported in
> > the
> metadata though).
> >
> > Specifically for your stack traces:
> >
> > For your initial problem with the missing class exceptions -- I
> > thought we used to catch those in docx and log them.  I haven't been
> > able to track this down, though.  I can look more if you have a need.
> >
> > For "Caused by: org.apache.poi.POIXMLException: Invalid 'Row_Type'
> > name 'PolylineTo' ", this problem might go away if we implemented a
> > pure SAX parser for vsdx.  We just did this for docx and pptx
> > (coming in 1.15) and these are more robust to variation because they
> > aren't requiring a match with the ooxml schema.  I haven't looked
> > much at vsdx, but that _might_ help.
> >
> > For "TODO Support v5 Pointers", this isn't supported and would
> > require contributions.  However, I agree that POI shouldn't throw a
> > Runtime exception.  Perhaps open an issue in POI, or maybe we should
> > catch this sp

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-05-31 Thread Allison, Timothy B.
Apache Tika 1.15 is now available.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Tuesday, May 9, 2017 7:45 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr 6.4. Can't index MS Visio vsdx files

Probably better to ask on the Tika list.  We'll push the release asap after 
PDFBox 2.0.6 is out.  Andreas plans to cut the release candidate for PDFBox 
this Friday.  Tika will probably have an RC by Monday 5/15, with the release 
happening later in the week...That's if there are no surprises...[2]

You can get a recent build if you'd like to test [1].

Best,

  Tim

[1] https://builds.apache.org/view/Tika/job/Tika-trunk/
[2] If you are curious, for the comparison reports btwn PDFBox 2.0.5 and 
2.0.6-SNAPSHOT on ~500k pdfs, see: 
http://162.242.228.174/reports/reports_pdfbox_2_0_6.tar.gz
 
-Original Message-
From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
Sent: Tuesday, May 9, 2017 7:17 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 6.4. Can't index MS Visio vsdx files

Are there any news regarding Tika 1.15? Maybe it's already ready for download 
somewhere

G.

On Wed, Apr 12, 2017 at 6:57 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> The release candidate for POI was just cut...unfortunately, I think 
> after Nick Burch fixed the 'PolylineTo' issue...thank you, btw, for opening 
> that!
>
> That'll be done within a week unless there are surprises.  Once that's 
> out, I have to update a few things, but I'd think we'd have a 
> candidate for Tika a week later, then a week for release.
>
> You can get nightly builds here: https://builds.apache.org/
>
> Please ask on the POI or Tika users lists for how to get the 
> latest/latest running, and thank you, again, for opening the issue on POI's 
> Bugzilla.
>
> Best,
>
>Tim
>
> -Original Message-
> From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
> Sent: Wednesday, April 12, 2017 1:00 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr 6.4. Can't index MS Visio vsdx files
>
> when 1.15 will be released? maybe you have some beta version and I 
> could test it :)
>
> SAX sounds interesting, and from info that I found in google it could 
> solve my issues.
>
> On Tue, Apr 11, 2017 at 10:48 PM, Allison, Timothy B. 
> <talli...@mitre.org>
> wrote:
>
> > It depends.  We've been trying to make parsers more, erm, flexible, 
> > but there are some problems from which we cannot recover.
> >
> > Tl;dr there isn't a short answer.  :(
> >
> > My sense is that DIH/ExtractingDocumentHandler is intended to get 
> > people up and running with Solr easily but it is not really a great 
> > idea for production.  See Erick's gem: https://lucidworks.com/2012/ 
> > 02/14/indexing-with-solrj/
> >
> > As for the Tika portion... at the very least, Tika _shouldn't_ cause 
> > the ingesting process to crash.  At most, it should fail at the file 
> > level and not cause greater havoc.  In practice, if you're 
> > processing millions of files from the wild, you'll run into bad 
> > behavior and need to defend against permanent hangs, oom, memory leaks.
> >
> > Also, at the least, if there's an exception with an embedded file, 
> > Tika should catch it and keep going with the rest of the file.  If 
> > this doesn't happen let us know!  We are aware that some types of 
> > embedded file stream problems were causing parse failures on the 
> > entire file, and we now catch those in Tika 1.15-SNAPSHOT and don't 
> > let them percolate up through the parent file (they're reported in 
> > the
> metadata though).
> >
> > Specifically for your stack traces:
> >
> > For your initial problem with the missing class exceptions -- I 
> > thought we used to catch those in docx and log them.  I haven't been 
> > able to track this down, though.  I can look more if you have a need.
> >
> > For "Caused by: org.apache.poi.POIXMLException: Invalid 'Row_Type'
> > name 'PolylineTo' ", this problem might go away if we implemented a 
> > pure SAX parser for vsdx.  We just did this for docx and pptx 
> > (coming in 1.15) and these are more robust to variation because they 
> > aren't requiring a match with the ooxml schema.  I haven't looked 
> > much at vsdx, but that _might_ help.
> >
> > For "TODO Support v5 Pointers", this isn't supported and would 
> > require contributions.  However, I agree that POI shouldn't throw a 
> > Runtime exception.  Perhaps open an issue in POI, or maybe we should 
> > catch this special example at the Tika level?
> >
> > For "Caused by: java.lang.ArrayIndexOutOfBoundsException:", the POI 

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-05-09 Thread Allison, Timothy B.
Probably better to ask on the Tika list.  We'll push the release asap after 
PDFBox 2.0.6 is out.  Andreas plans to cut the release candidate for PDFBox 
this Friday.  Tika will probably have an RC by Monday 5/15, with the release 
happening later in the week...That's if there are no surprises...[2]

You can get a recent build if you'd like to test [1].

Best,

  Tim

[1] https://builds.apache.org/view/Tika/job/Tika-trunk/
[2] If you are curious, for the comparison reports btwn PDFBox 2.0.5 and 
2.0.6-SNAPSHOT on ~500k pdfs, see: 
http://162.242.228.174/reports/reports_pdfbox_2_0_6.tar.gz
 
-Original Message-
From: Gytis Mikuciunas [mailto:gyt...@gmail.com] 
Sent: Tuesday, May 9, 2017 7:17 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 6.4. Can't index MS Visio vsdx files

Are there any news regarding Tika 1.15? Maybe it's already ready for download 
somewhere

G.

On Wed, Apr 12, 2017 at 6:57 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> The release candidate for POI was just cut...unfortunately, I think 
> after Nick Burch fixed the 'PolylineTo' issue...thank you, btw, for opening 
> that!
>
> That'll be done within a week unless there are surprises.  Once that's 
> out, I have to update a few things, but I'd think we'd have a 
> candidate for Tika a week later, then a week for release.
>
> You can get nightly builds here: https://builds.apache.org/
>
> Please ask on the POI or Tika users lists for how to get the 
> latest/latest running, and thank you, again, for opening the issue on POI's 
> Bugzilla.
>
> Best,
>
>Tim
>
> -Original Message-
> From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
> Sent: Wednesday, April 12, 2017 1:00 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr 6.4. Can't index MS Visio vsdx files
>
> when 1.15 will be released? maybe you have some beta version and I 
> could test it :)
>
> SAX sounds interesting, and from info that I found in google it could 
> solve my issues.
>
> On Tue, Apr 11, 2017 at 10:48 PM, Allison, Timothy B. 
> <talli...@mitre.org>
> wrote:
>
> > It depends.  We've been trying to make parsers more, erm, flexible, 
> > but there are some problems from which we cannot recover.
> >
> > Tl;dr there isn't a short answer.  :(
> >
> > My sense is that DIH/ExtractingDocumentHandler is intended to get 
> > people up and running with Solr easily but it is not really a great 
> > idea for production.  See Erick's gem: https://lucidworks.com/2012/ 
> > 02/14/indexing-with-solrj/
> >
> > As for the Tika portion... at the very least, Tika _shouldn't_ cause 
> > the ingesting process to crash.  At most, it should fail at the file 
> > level and not cause greater havoc.  In practice, if you're 
> > processing millions of files from the wild, you'll run into bad 
> > behavior and need to defend against permanent hangs, oom, memory leaks.
> >
> > Also, at the least, if there's an exception with an embedded file, 
> > Tika should catch it and keep going with the rest of the file.  If 
> > this doesn't happen let us know!  We are aware that some types of 
> > embedded file stream problems were causing parse failures on the 
> > entire file, and we now catch those in Tika 1.15-SNAPSHOT and don't 
> > let them percolate up through the parent file (they're reported in 
> > the
> metadata though).
> >
> > Specifically for your stack traces:
> >
> > For your initial problem with the missing class exceptions -- I 
> > thought we used to catch those in docx and log them.  I haven't been 
> > able to track this down, though.  I can look more if you have a need.
> >
> > For "Caused by: org.apache.poi.POIXMLException: Invalid 'Row_Type'
> > name 'PolylineTo' ", this problem might go away if we implemented a 
> > pure SAX parser for vsdx.  We just did this for docx and pptx 
> > (coming in 1.15) and these are more robust to variation because they 
> > aren't requiring a match with the ooxml schema.  I haven't looked 
> > much at vsdx, but that _might_ help.
> >
> > For "TODO Support v5 Pointers", this isn't supported and would 
> > require contributions.  However, I agree that POI shouldn't throw a 
> > Runtime exception.  Perhaps open an issue in POI, or maybe we should 
> > catch this special example at the Tika level?
> >
> > For "Caused by: java.lang.ArrayIndexOutOfBoundsException:", the POI 
> > team _might_ be able to modify the parser to ignore a stream if 
> > there's an exception, but that's often a sign that something needs 
> > to be fixed with the parser.  In short, the solution will come from POI.
&g

Re: Solr 6.4. Can't index MS Visio vsdx files

2017-05-09 Thread Gytis Mikuciunas
Are there any news regarding Tika 1.15? Maybe it's already ready for
download somewhere

G.

On Wed, Apr 12, 2017 at 6:57 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> The release candidate for POI was just cut...unfortunately, I think after
> Nick Burch fixed the 'PolylineTo' issue...thank you, btw, for opening that!
>
> That'll be done within a week unless there are surprises.  Once that's
> out, I have to update a few things, but I'd think we'd have a candidate for
> Tika a week later, then a week for release.
>
> You can get nightly builds here: https://builds.apache.org/
>
> Please ask on the POI or Tika users lists for how to get the latest/latest
> running, and thank you, again, for opening the issue on POI's Bugzilla.
>
> Best,
>
>Tim
>
> -Original Message-
> From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
> Sent: Wednesday, April 12, 2017 1:00 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr 6.4. Can't index MS Visio vsdx files
>
> when 1.15 will be released? maybe you have some beta version and I could
> test it :)
>
> SAX sounds interesting, and from info that I found in google it could
> solve my issues.
>
> On Tue, Apr 11, 2017 at 10:48 PM, Allison, Timothy B. <talli...@mitre.org>
> wrote:
>
> > It depends.  We've been trying to make parsers more, erm, flexible,
> > but there are some problems from which we cannot recover.
> >
> > Tl;dr there isn't a short answer.  :(
> >
> > My sense is that DIH/ExtractingDocumentHandler is intended to get
> > people up and running with Solr easily but it is not really a great
> > idea for production.  See Erick's gem: https://lucidworks.com/2012/
> > 02/14/indexing-with-solrj/
> >
> > As for the Tika portion... at the very least, Tika _shouldn't_ cause
> > the ingesting process to crash.  At most, it should fail at the file
> > level and not cause greater havoc.  In practice, if you're processing
> > millions of files from the wild, you'll run into bad behavior and need
> > to defend against permanent hangs, oom, memory leaks.
> >
> > Also, at the least, if there's an exception with an embedded file,
> > Tika should catch it and keep going with the rest of the file.  If
> > this doesn't happen let us know!  We are aware that some types of
> > embedded file stream problems were causing parse failures on the
> > entire file, and we now catch those in Tika 1.15-SNAPSHOT and don't
> > let them percolate up through the parent file (they're reported in the
> metadata though).
> >
> > Specifically for your stack traces:
> >
> > For your initial problem with the missing class exceptions -- I
> > thought we used to catch those in docx and log them.  I haven't been
> > able to track this down, though.  I can look more if you have a need.
> >
> > For "Caused by: org.apache.poi.POIXMLException: Invalid 'Row_Type'
> > name 'PolylineTo' ", this problem might go away if we implemented a
> > pure SAX parser for vsdx.  We just did this for docx and pptx (coming
> > in 1.15) and these are more robust to variation because they aren't
> > requiring a match with the ooxml schema.  I haven't looked much at
> > vsdx, but that _might_ help.
> >
> > For "TODO Support v5 Pointers", this isn't supported and would require
> > contributions.  However, I agree that POI shouldn't throw a Runtime
> > exception.  Perhaps open an issue in POI, or maybe we should catch
> > this special example at the Tika level?
> >
> > For "Caused by: java.lang.ArrayIndexOutOfBoundsException:", the POI
> > team _might_ be able to modify the parser to ignore a stream if
> > there's an exception, but that's often a sign that something needs to
> > be fixed with the parser.  In short, the solution will come from POI.
> >
> > Best,
> >
> >  Tim
> >
> > -Original Message-
> > From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
> > Sent: Tuesday, April 11, 2017 1:56 PM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Solr 6.4. Can't index MS Visio vsdx files
> >
> > Thanks for your responses.
> > Are there any posibilities to ignore parsing errors and continue
> indexing?
> > because now solr/tika stops parsing whole document if it finds any
> > exception
> >
> > On Apr 11, 2017 19:51, "Allison, Timothy B." <talli...@mitre.org> wrote:
> >
> > > You might want to drop a note to the dev or user's list on Apache POI.
> > >
> > > I'm not extremely familiar with the vsd(x) portion of our code base.
> &

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-04-12 Thread Allison, Timothy B.
The release candidate for POI was just cut...unfortunately, I think after Nick 
Burch fixed the 'PolylineTo' issue...thank you, btw, for opening that!

That'll be done within a week unless there are surprises.  Once that's out, I 
have to update a few things, but I'd think we'd have a candidate for Tika a 
week later, then a week for release.

You can get nightly builds here: https://builds.apache.org/

Please ask on the POI or Tika users lists for how to get the latest/latest 
running, and thank you, again, for opening the issue on POI's Bugzilla.

Best,

   Tim

-Original Message-
From: Gytis Mikuciunas [mailto:gyt...@gmail.com] 
Sent: Wednesday, April 12, 2017 1:00 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 6.4. Can't index MS Visio vsdx files

when 1.15 will be released? maybe you have some beta version and I could test 
it :)

SAX sounds interesting, and from info that I found in google it could solve my 
issues.

On Tue, Apr 11, 2017 at 10:48 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> It depends.  We've been trying to make parsers more, erm, flexible, 
> but there are some problems from which we cannot recover.
>
> Tl;dr there isn't a short answer.  :(
>
> My sense is that DIH/ExtractingDocumentHandler is intended to get 
> people up and running with Solr easily but it is not really a great 
> idea for production.  See Erick's gem: https://lucidworks.com/2012/ 
> 02/14/indexing-with-solrj/
>
> As for the Tika portion... at the very least, Tika _shouldn't_ cause 
> the ingesting process to crash.  At most, it should fail at the file 
> level and not cause greater havoc.  In practice, if you're processing 
> millions of files from the wild, you'll run into bad behavior and need 
> to defend against permanent hangs, oom, memory leaks.
>
> Also, at the least, if there's an exception with an embedded file, 
> Tika should catch it and keep going with the rest of the file.  If 
> this doesn't happen let us know!  We are aware that some types of 
> embedded file stream problems were causing parse failures on the 
> entire file, and we now catch those in Tika 1.15-SNAPSHOT and don't 
> let them percolate up through the parent file (they're reported in the 
> metadata though).
>
> Specifically for your stack traces:
>
> For your initial problem with the missing class exceptions -- I 
> thought we used to catch those in docx and log them.  I haven't been 
> able to track this down, though.  I can look more if you have a need.
>
> For "Caused by: org.apache.poi.POIXMLException: Invalid 'Row_Type' 
> name 'PolylineTo' ", this problem might go away if we implemented a 
> pure SAX parser for vsdx.  We just did this for docx and pptx (coming 
> in 1.15) and these are more robust to variation because they aren't 
> requiring a match with the ooxml schema.  I haven't looked much at 
> vsdx, but that _might_ help.
>
> For "TODO Support v5 Pointers", this isn't supported and would require 
> contributions.  However, I agree that POI shouldn't throw a Runtime 
> exception.  Perhaps open an issue in POI, or maybe we should catch 
> this special example at the Tika level?
>
> For "Caused by: java.lang.ArrayIndexOutOfBoundsException:", the POI 
> team _might_ be able to modify the parser to ignore a stream if 
> there's an exception, but that's often a sign that something needs to 
> be fixed with the parser.  In short, the solution will come from POI.
>
> Best,
>
>          Tim
>
> -Original Message-
> From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
> Sent: Tuesday, April 11, 2017 1:56 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr 6.4. Can't index MS Visio vsdx files
>
> Thanks for your responses.
> Are there any posibilities to ignore parsing errors and continue indexing?
> because now solr/tika stops parsing whole document if it finds any 
> exception
>
> On Apr 11, 2017 19:51, "Allison, Timothy B." <talli...@mitre.org> wrote:
>
> > You might want to drop a note to the dev or user's list on Apache POI.
> >
> > I'm not extremely familiar with the vsd(x) portion of our code base.
> >
> > The first item ("PolylineTo") may be caused by a mismatch btwn your 
> > doc and the ooxml spec.
> >
> > The second item appears to be an unsupported feature.
> >
> > The third item may be an area for improvement within our 
> > codebase...I can't tell just from the stacktrace.
> >
> > You'll probably get more helpful answers over on POI.  Sorry, I 
> > can't help with this...
> >
> > Best,
> >
> >Tim
> >
> > P.S.
> > >  3.1. ooxml-schemas-1.3.jar instead of poi-ooxml-schemas-3.15.jar
> >
> > You shouldn't need both. Ooxml-schemas-1.3.jar should be a super set 
> > of poi-ooxml-schemas-3.15.jar
> >
> >
> >
>


Re: Solr 6.4. Can't index MS Visio vsdx files

2017-04-11 Thread Gytis Mikuciunas
when 1.15 will be released? maybe you have some beta version and I could
test it :)

SAX sounds interesting, and from info that I found in google it could solve
my issues.

On Tue, Apr 11, 2017 at 10:48 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> It depends.  We've been trying to make parsers more, erm, flexible, but
> there are some problems from which we cannot recover.
>
> Tl;dr there isn't a short answer.  :(
>
> My sense is that DIH/ExtractingDocumentHandler is intended to get people
> up and running with Solr easily but it is not really a great idea for
> production.  See Erick's gem: https://lucidworks.com/2012/
> 02/14/indexing-with-solrj/
>
> As for the Tika portion... at the very least, Tika _shouldn't_ cause the
> ingesting process to crash.  At most, it should fail at the file level and
> not cause greater havoc.  In practice, if you're processing millions of
> files from the wild, you'll run into bad behavior and need to defend
> against permanent hangs, oom, memory leaks.
>
> Also, at the least, if there's an exception with an embedded file, Tika
> should catch it and keep going with the rest of the file.  If this doesn't
> happen let us know!  We are aware that some types of embedded file stream
> problems were causing parse failures on the entire file, and we now catch
> those in Tika 1.15-SNAPSHOT and don't let them percolate up through the
> parent file (they're reported in the metadata though).
>
> Specifically for your stack traces:
>
> For your initial problem with the missing class exceptions -- I thought we
> used to catch those in docx and log them.  I haven't been able to track
> this down, though.  I can look more if you have a need.
>
> For "Caused by: org.apache.poi.POIXMLException: Invalid 'Row_Type' name
> 'PolylineTo' ", this problem might go away if we implemented a pure SAX
> parser for vsdx.  We just did this for docx and pptx (coming in 1.15) and
> these are more robust to variation because they aren't requiring a match
> with the ooxml schema.  I haven't looked much at vsdx, but that _might_
> help.
>
> For "TODO Support v5 Pointers", this isn't supported and would require
> contributions.  However, I agree that POI shouldn't throw a Runtime
> exception.  Perhaps open an issue in POI, or maybe we should catch this
> special example at the Tika level?
>
> For "Caused by: java.lang.ArrayIndexOutOfBoundsException:", the POI team
> _might_ be able to modify the parser to ignore a stream if there's an
> exception, but that's often a sign that something needs to be fixed with
> the parser.  In short, the solution will come from POI.
>
> Best,
>
>          Tim
>
> -----Original Message-
> From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
> Sent: Tuesday, April 11, 2017 1:56 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr 6.4. Can't index MS Visio vsdx files
>
> Thanks for your responses.
> Are there any posibilities to ignore parsing errors and continue indexing?
> because now solr/tika stops parsing whole document if it finds any
> exception
>
> On Apr 11, 2017 19:51, "Allison, Timothy B." <talli...@mitre.org> wrote:
>
> > You might want to drop a note to the dev or user's list on Apache POI.
> >
> > I'm not extremely familiar with the vsd(x) portion of our code base.
> >
> > The first item ("PolylineTo") may be caused by a mismatch btwn your
> > doc and the ooxml spec.
> >
> > The second item appears to be an unsupported feature.
> >
> > The third item may be an area for improvement within our codebase...I
> > can't tell just from the stacktrace.
> >
> > You'll probably get more helpful answers over on POI.  Sorry, I can't
> > help with this...
> >
> > Best,
> >
> >Tim
> >
> > P.S.
> > >  3.1. ooxml-schemas-1.3.jar instead of poi-ooxml-schemas-3.15.jar
> >
> > You shouldn't need both. Ooxml-schemas-1.3.jar should be a super set
> > of poi-ooxml-schemas-3.15.jar
> >
> >
> >
>


RE: Solr 6.4. Can't index MS Visio vsdx files

2017-04-11 Thread Allison, Timothy B.
It depends.  We've been trying to make parsers more, erm, flexible, but there 
are some problems from which we cannot recover.

Tl;dr there isn't a short answer.  :(

My sense is that DIH/ExtractingDocumentHandler is intended to get people up and 
running with Solr easily but it is not really a great idea for production.  See 
Erick's gem: https://lucidworks.com/2012/02/14/indexing-with-solrj/ 

As for the Tika portion... at the very least, Tika _shouldn't_ cause the 
ingesting process to crash.  At most, it should fail at the file level and not 
cause greater havoc.  In practice, if you're processing millions of files from 
the wild, you'll run into bad behavior and need to defend against permanent 
hangs, oom, memory leaks.

Also, at the least, if there's an exception with an embedded file, Tika should 
catch it and keep going with the rest of the file.  If this doesn't happen let 
us know!  We are aware that some types of embedded file stream problems were 
causing parse failures on the entire file, and we now catch those in Tika 
1.15-SNAPSHOT and don't let them percolate up through the parent file (they're 
reported in the metadata though).

Specifically for your stack traces:

For your initial problem with the missing class exceptions -- I thought we used 
to catch those in docx and log them.  I haven't been able to track this down, 
though.  I can look more if you have a need.

For "Caused by: org.apache.poi.POIXMLException: Invalid 'Row_Type' name 
'PolylineTo' ", this problem might go away if we implemented a pure SAX parser 
for vsdx.  We just did this for docx and pptx (coming in 1.15) and these are 
more robust to variation because they aren't requiring a match with the ooxml 
schema.  I haven't looked much at vsdx, but that _might_ help.

For "TODO Support v5 Pointers", this isn't supported and would require 
contributions.  However, I agree that POI shouldn't throw a Runtime exception.  
Perhaps open an issue in POI, or maybe we should catch this special example at 
the Tika level?

For "Caused by: java.lang.ArrayIndexOutOfBoundsException:", the POI team 
_might_ be able to modify the parser to ignore a stream if there's an 
exception, but that's often a sign that something needs to be fixed with the 
parser.  In short, the solution will come from POI.

Best,

 Tim

-Original Message-
From: Gytis Mikuciunas [mailto:gyt...@gmail.com] 
Sent: Tuesday, April 11, 2017 1:56 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr 6.4. Can't index MS Visio vsdx files

Thanks for your responses.
Are there any posibilities to ignore parsing errors and continue indexing?
because now solr/tika stops parsing whole document if it finds any exception

On Apr 11, 2017 19:51, "Allison, Timothy B." <talli...@mitre.org> wrote:

> You might want to drop a note to the dev or user's list on Apache POI.
>
> I'm not extremely familiar with the vsd(x) portion of our code base.
>
> The first item ("PolylineTo") may be caused by a mismatch btwn your 
> doc and the ooxml spec.
>
> The second item appears to be an unsupported feature.
>
> The third item may be an area for improvement within our codebase...I 
> can't tell just from the stacktrace.
>
> You'll probably get more helpful answers over on POI.  Sorry, I can't 
> help with this...
>
> Best,
>
>Tim
>
> P.S.
> >  3.1. ooxml-schemas-1.3.jar instead of poi-ooxml-schemas-3.15.jar
>
> You shouldn't need both. Ooxml-schemas-1.3.jar should be a super set 
> of poi-ooxml-schemas-3.15.jar
>
>
>


RE: Solr 6.4. Can't index MS Visio vsdx files

2017-04-11 Thread Gytis Mikuciunas
Thanks for your responses.
Are there any posibilities to ignore parsing errors and continue indexing?
because now solr/tika stops parsing whole document if it finds any exception

On Apr 11, 2017 19:51, "Allison, Timothy B."  wrote:

> You might want to drop a note to the dev or user's list on Apache POI.
>
> I'm not extremely familiar with the vsd(x) portion of our code base.
>
> The first item ("PolylineTo") may be caused by a mismatch btwn your doc
> and the ooxml spec.
>
> The second item appears to be an unsupported feature.
>
> The third item may be an area for improvement within our codebase...I
> can't tell just from the stacktrace.
>
> You'll probably get more helpful answers over on POI.  Sorry, I can't help
> with this...
>
> Best,
>
>Tim
>
> P.S.
> >  3.1. ooxml-schemas-1.3.jar instead of poi-ooxml-schemas-3.15.jar
>
> You shouldn't need both. Ooxml-schemas-1.3.jar should be a super set of
> poi-ooxml-schemas-3.15.jar
>
>
>


RE: Solr 6.4. Can't index MS Visio vsdx files

2017-04-11 Thread Allison, Timothy B.
You might want to drop a note to the dev or user's list on Apache POI.

I'm not extremely familiar with the vsd(x) portion of our code base.

The first item ("PolylineTo") may be caused by a mismatch btwn your doc and the 
ooxml spec.

The second item appears to be an unsupported feature.

The third item may be an area for improvement within our codebase...I can't 
tell just from the stacktrace.

You'll probably get more helpful answers over on POI.  Sorry, I can't help with 
this...

Best,

   Tim

P.S.
>  3.1. ooxml-schemas-1.3.jar instead of poi-ooxml-schemas-3.15.jar

You shouldn't need both. Ooxml-schemas-1.3.jar should be a super set of 
poi-ooxml-schemas-3.15.jar




Re: Solr 6.4. Can't index MS Visio vsdx files

2017-04-11 Thread Gytis Mikuciunas
apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser@4fa1aaa6\r\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)\r\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\r\n\tat
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)\r\n\tat
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)\r\n\t...
32 more\r\nCaused by: java.lang.ArrayIndexOutOfBoundsException:
1639168\r\n\tat
org.apache.poi.util.LittleEndian.getInt(LittleEndian.java:161)\r\n\tat
org.apache.poi.util.LittleEndian.getUInt(LittleEndian.java:300)\r\n\tat
org.apache.poi.hdgf.streams.PointerContainingStream.(PointerContainingStream.java:49)\r\n\tat
org.apache.poi.hdgf.streams.Stream.createStream(Stream.java:81)\r\n\tat
org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:88)\r\n\tat
org.apache.poi.hdgf.HDGFDiagram.(HDGFDiagram.java:98)\r\n\tat
org.apache.poi.hdgf.extractor.VisioTextExtractor.(VisioTextExtractor.java:55)\r\n\tat
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:160)\r\n\tat
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)\r\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\r\n\t...
35 more\r\n",
"metadata": [
"error-class",
"org.apache.solr.common.SolrException",
"root-error-class",
"java.lang.ArrayIndexOutOfBoundsException"
]
}
}

Regards,

Gytis

On Mon, Feb 6, 2017 at 6:54 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> Shouldn't have taken you that much effort.  Sorry.
>
> Y, I should probably get around to a patch for: https://issues.apache.org/
> jira/browse/SOLR-9552
>
> Although, frankly, it might be time for Tika 1.15 shortly.
>
> -Original Message-
> From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
> Sent: Monday, February 6, 2017 11:15 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr 6.4. Can't index MS Visio vsdx files
>
> Tim, you saved my day ;)
>
> now vsdx files were indexed successfully.
>
> Thank you very much!!!
>
> summary: as a workaround I have in solr-6.4.0\contrib\extraction\lib:
>
> 1. ooxml-schemas-1.3.jar instead of poi-ooxml-schemas-3.15.jar 2.
> curvesapi-1.03.jar
>
>
> So, now I'm waiting when this will be implemented in a official version of
> solr/tika.
>
> Regards,
> Gytis
>
> On Mon, Feb 6, 2017 at 4:16 PM, Allison, Timothy B. <talli...@mitre.org>
> wrote:
>
> > Argh.  Looks like we need to add curvesapi (BSD 3-clause) to Solr.
> >
> > For now, add this jar:
> > https://mvnrepository.com/artifact/com.github.virtuald/curvesapi/1.03
> >
> > See also [1]
> >
> > [1] http://apache-poi.1045710.n5.nabble.com/support-for-
> > reading-Microsoft-Visio-2013-vsdx-format-td5721500.html
> >
> > -Original Message-
> > From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
> > Sent: Monday, February 6, 2017 8:19 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Solr 6.4. Can't index MS Visio vsdx files
> >
> > sad, but didn't help.
> >
> > what I did:
> >
> > 1. stopped solr: bin\solr stop -p 80
> > 2. removed poi-ooxml-schemas-3.15.jar from contrib\extraction\lib 3.
> > add ooxml-schemas-1.3.jar to contrib\extraction\lib 4. restarted solr:
> > bin\solr start -p 80 -m 4g 5. tried again to parse vsdx file:
> >
> > java -Dauto -Dc=db_new02 -Dport=80 -Dfiletypes=vsd,vsdx
> > -Drecursive=yes -jar example/exampledocs/post.jar "I:\Tools"
> >
> > SimplePostTool version 5.0.0
> > Posting files to [base] url http://localhost:80/solr/db_new02/update...
> > Entering auto mode. File endings considered are vsd,vsdx Entering
> > recursive mode, max depth=999, delay=0s Indexing directory I:\Tools (1
> > files, depth=0) POSTing file span ports.vsdx
> > (application/octet-stream) to [base]/extract
> > SimplePostTool: WARNING: Solr returned an error #500 (Server Error)
> > for
> > url:
> > http://localhost:80/solr/db_new02/update/extract?resource.
> > name=I%3A%5CTools%5Cspan+ports.vsdx
> > SimplePostTool: WARNING: Response:> http-equiv="Content-Type" content="text/html;charset=utf-8"/>
> > Error 500 Server Error
> > 
> > HTTP ERROR 500
> > Problem accessing /solr/db_new02/update/extract. Reason:
> > Server ErrorCaused
> > by:java.lang.NoClassDefFoundError:
> com/graphbuilder/curve/Point
>

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-02-06 Thread Allison, Timothy B.
Shouldn't have taken you that much effort.  Sorry.

Y, I should probably get around to a patch for: 
https://issues.apache.org/jira/browse/SOLR-9552

Although, frankly, it might be time for Tika 1.15 shortly.

-Original Message-
From: Gytis Mikuciunas [mailto:gyt...@gmail.com] 
Sent: Monday, February 6, 2017 11:15 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 6.4. Can't index MS Visio vsdx files

Tim, you saved my day ;)

now vsdx files were indexed successfully.

Thank you very much!!!

summary: as a workaround I have in solr-6.4.0\contrib\extraction\lib:

1. ooxml-schemas-1.3.jar instead of poi-ooxml-schemas-3.15.jar 2. 
curvesapi-1.03.jar


So, now I'm waiting when this will be implemented in a official version of 
solr/tika.

Regards,
Gytis

On Mon, Feb 6, 2017 at 4:16 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> Argh.  Looks like we need to add curvesapi (BSD 3-clause) to Solr.
>
> For now, add this jar:
> https://mvnrepository.com/artifact/com.github.virtuald/curvesapi/1.03
>
> See also [1]
>
> [1] http://apache-poi.1045710.n5.nabble.com/support-for-
> reading-Microsoft-Visio-2013-vsdx-format-td5721500.html
>
> -Original Message-
> From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
> Sent: Monday, February 6, 2017 8:19 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr 6.4. Can't index MS Visio vsdx files
>
> sad, but didn't help.
>
> what I did:
>
> 1. stopped solr: bin\solr stop -p 80
> 2. removed poi-ooxml-schemas-3.15.jar from contrib\extraction\lib 3. 
> add ooxml-schemas-1.3.jar to contrib\extraction\lib 4. restarted solr: 
> bin\solr start -p 80 -m 4g 5. tried again to parse vsdx file:
>
> java -Dauto -Dc=db_new02 -Dport=80 -Dfiletypes=vsd,vsdx 
> -Drecursive=yes -jar example/exampledocs/post.jar "I:\Tools"
>
> SimplePostTool version 5.0.0
> Posting files to [base] url http://localhost:80/solr/db_new02/update...
> Entering auto mode. File endings considered are vsd,vsdx Entering 
> recursive mode, max depth=999, delay=0s Indexing directory I:\Tools (1 
> files, depth=0) POSTing file span ports.vsdx 
> (application/octet-stream) to [base]/extract
> SimplePostTool: WARNING: Solr returned an error #500 (Server Error) 
> for
> url:
> http://localhost:80/solr/db_new02/update/extract?resource.
> name=I%3A%5CTools%5Cspan+ports.vsdx
> SimplePostTool: WARNING: Response:http-equiv="Content-Type" content="text/html;charset=utf-8"/>
> Error 500 Server Error
> 
> HTTP ERROR 500
> Problem accessing /solr/db_new02/update/extract. Reason:
> Server ErrorCaused
> by:java.lang.NoClassDefFoundError: com/graphbuilder/curve/Point
> at java.lang.Class.getDeclaredConstructors0(Native Method)
> at java.lang.Class.privateGetDeclaredConstructors(Unknown Source)
> at java.lang.Class.getConstructor0(Unknown Source)
> at java.lang.Class.getDeclaredConstructor(Unknown Source)
> at org.apache.poi.xdgf.util.ObjectFactory.put(
> ObjectFactory.java:34)
> at
> org.apache.poi.xdgf.usermodel.section.geometry.
> GeometryRowFactory.clinit(GeometryRowFactory.java:39)
> at
> org.apache.poi.xdgf.usermodel.section.GeometrySection.
> init(GeometrySection.java:55)
> at
> org.apache.poi.xdgf.usermodel.XDGFSheet.init(XDGFSheet.java:77)
> at
> org.apache.poi.xdgf.usermodel.XDGFShape.init(XDGFShape.java:113)
> at
> org.apache.poi.xdgf.usermodel.XDGFShape.init(XDGFShape.java:107)
> at
> org.apache.poi.xdgf.usermodel.XDGFBaseContents.onDocumentRead(
> XDGFBaseContents.java:82)
> at
> org.apache.poi.xdgf.usermodel.XDGFMasterContents.onDocumentRead(
> XDGFMasterContents.java:66)
> at
> org.apache.poi.xdgf.usermodel.XDGFMasters.onDocumentRead(
> XDGFMasters.java:101)
> at
> org.apache.poi.xdgf.usermodel.XmlVisioDocument.onDocumentRead(
> XmlVisioDocument.java:106)
> at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
> at
> org.apache.poi.xdgf.usermodel.XmlVisioDocument.init(
> XmlVisioDocument.java:79)
> at
> org.apache.poi.xdgf.extractor.XDGFVisioExtractor.init&
> gt;(XDGFVisioExtractor.java:41)
> at
> org.apache.poi.extractor.ExtractorFactory.createExtractor(
> ExtractorFactory.java:207)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(
> OOXMLExtractorFactory.java:86)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.
> parse(OOXMLParser.java:87)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.ja

Re: Solr 6.4. Can't index MS Visio vsdx files

2017-02-06 Thread Gytis Mikuciunas
Tim, you saved my day ;)

now vsdx files were indexed successfully.

Thank you very much!!!

summary: as a workaround I have in solr-6.4.0\contrib\extraction\lib:

1. ooxml-schemas-1.3.jar instead of poi-ooxml-schemas-3.15.jar
2. curvesapi-1.03.jar


So, now I'm waiting when this will be implemented in a official version of
solr/tika.

Regards,
Gytis

On Mon, Feb 6, 2017 at 4:16 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> Argh.  Looks like we need to add curvesapi (BSD 3-clause) to Solr.
>
> For now, add this jar:
> https://mvnrepository.com/artifact/com.github.virtuald/curvesapi/1.03
>
> See also [1]
>
> [1] http://apache-poi.1045710.n5.nabble.com/support-for-
> reading-Microsoft-Visio-2013-vsdx-format-td5721500.html
>
> -Original Message-
> From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
> Sent: Monday, February 6, 2017 8:19 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr 6.4. Can't index MS Visio vsdx files
>
> sad, but didn't help.
>
> what I did:
>
> 1. stopped solr: bin\solr stop -p 80
> 2. removed poi-ooxml-schemas-3.15.jar from contrib\extraction\lib 3. add
> ooxml-schemas-1.3.jar to contrib\extraction\lib 4. restarted solr: bin\solr
> start -p 80 -m 4g 5. tried again to parse vsdx file:
>
> java -Dauto -Dc=db_new02 -Dport=80 -Dfiletypes=vsd,vsdx -Drecursive=yes
> -jar example/exampledocs/post.jar "I:\Tools"
>
> SimplePostTool version 5.0.0
> Posting files to [base] url http://localhost:80/solr/db_new02/update...
> Entering auto mode. File endings considered are vsd,vsdx Entering
> recursive mode, max depth=999, delay=0s Indexing directory I:\Tools (1
> files, depth=0) POSTing file span ports.vsdx (application/octet-stream) to
> [base]/extract
> SimplePostTool: WARNING: Solr returned an error #500 (Server Error) for
> url:
> http://localhost:80/solr/db_new02/update/extract?resource.
> name=I%3A%5CTools%5Cspan+ports.vsdx
> SimplePostTool: WARNING: Response:http-equiv="Content-Type" content="text/html;charset=utf-8"/>
> Error 500 Server Error
> 
> HTTP ERROR 500
> Problem accessing /solr/db_new02/update/extract. Reason:
> Server ErrorCaused
> by:java.lang.NoClassDefFoundError: com/graphbuilder/curve/Point
> at java.lang.Class.getDeclaredConstructors0(Native Method)
> at java.lang.Class.privateGetDeclaredConstructors(Unknown Source)
> at java.lang.Class.getConstructor0(Unknown Source)
> at java.lang.Class.getDeclaredConstructor(Unknown Source)
> at org.apache.poi.xdgf.util.ObjectFactory.put(
> ObjectFactory.java:34)
> at
> org.apache.poi.xdgf.usermodel.section.geometry.
> GeometryRowFactory.clinit(GeometryRowFactory.java:39)
> at
> org.apache.poi.xdgf.usermodel.section.GeometrySection.
> init(GeometrySection.java:55)
> at
> org.apache.poi.xdgf.usermodel.XDGFSheet.init(XDGFSheet.java:77)
> at
> org.apache.poi.xdgf.usermodel.XDGFShape.init(XDGFShape.java:113)
> at
> org.apache.poi.xdgf.usermodel.XDGFShape.init(XDGFShape.java:107)
> at
> org.apache.poi.xdgf.usermodel.XDGFBaseContents.onDocumentRead(
> XDGFBaseContents.java:82)
> at
> org.apache.poi.xdgf.usermodel.XDGFMasterContents.onDocumentRead(
> XDGFMasterContents.java:66)
> at
> org.apache.poi.xdgf.usermodel.XDGFMasters.onDocumentRead(
> XDGFMasters.java:101)
> at
> org.apache.poi.xdgf.usermodel.XmlVisioDocument.onDocumentRead(
> XmlVisioDocument.java:106)
> at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
> at
> org.apache.poi.xdgf.usermodel.XmlVisioDocument.init(
> XmlVisioDocument.java:79)
> at
> org.apache.poi.xdgf.extractor.XDGFVisioExtractor.init&
> gt;(XDGFVisioExtractor.java:41)
> at
> org.apache.poi.extractor.ExtractorFactory.createExtractor(
> ExtractorFactory.java:207)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(
> OOXMLExtractorFactory.java:86)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.
> parse(OOXMLParser.java:87)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(
> ExtractingDocumentLoader.java:228)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(
> ContentStreamHandlerBase.java:68)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(
> 

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-02-06 Thread Allison, Timothy B.
Argh.  Looks like we need to add curvesapi (BSD 3-clause) to Solr.

For now, add this jar:
https://mvnrepository.com/artifact/com.github.virtuald/curvesapi/1.03 

See also [1]

[1] 
http://apache-poi.1045710.n5.nabble.com/support-for-reading-Microsoft-Visio-2013-vsdx-format-td5721500.html

-Original Message-
From: Gytis Mikuciunas [mailto:gyt...@gmail.com] 
Sent: Monday, February 6, 2017 8:19 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 6.4. Can't index MS Visio vsdx files

sad, but didn't help.

what I did:

1. stopped solr: bin\solr stop -p 80
2. removed poi-ooxml-schemas-3.15.jar from contrib\extraction\lib 3. add 
ooxml-schemas-1.3.jar to contrib\extraction\lib 4. restarted solr: bin\solr 
start -p 80 -m 4g 5. tried again to parse vsdx file:

java -Dauto -Dc=db_new02 -Dport=80 -Dfiletypes=vsd,vsdx -Drecursive=yes -jar 
example/exampledocs/post.jar "I:\Tools"

SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:80/solr/db_new02/update...
Entering auto mode. File endings considered are vsd,vsdx Entering recursive 
mode, max depth=999, delay=0s Indexing directory I:\Tools (1 files, depth=0) 
POSTing file span ports.vsdx (application/octet-stream) to [base]/extract
SimplePostTool: WARNING: Solr returned an error #500 (Server Error) for
url:
http://localhost:80/solr/db_new02/update/extract?resource.name=I%3A%5CTools%5Cspan+ports.vsdx
SimplePostTool: WARNING: Response:   
Error 500 Server Error

HTTP ERROR 500
Problem accessing /solr/db_new02/update/extract. Reason:
Server ErrorCaused
by:java.lang.NoClassDefFoundError: com/graphbuilder/curve/Point
at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors(Unknown Source)
at java.lang.Class.getConstructor0(Unknown Source)
at java.lang.Class.getDeclaredConstructor(Unknown Source)
at org.apache.poi.xdgf.util.ObjectFactory.put(ObjectFactory.java:34)
at
org.apache.poi.xdgf.usermodel.section.geometry.GeometryRowFactory.clinit(GeometryRowFactory.java:39)
at
org.apache.poi.xdgf.usermodel.section.GeometrySection.init(GeometrySection.java:55)
at
org.apache.poi.xdgf.usermodel.XDGFSheet.init(XDGFSheet.java:77)
at
org.apache.poi.xdgf.usermodel.XDGFShape.init(XDGFShape.java:113)
at
org.apache.poi.xdgf.usermodel.XDGFShape.init(XDGFShape.java:107)
at
org.apache.poi.xdgf.usermodel.XDGFBaseContents.onDocumentRead(XDGFBaseContents.java:82)
at
org.apache.poi.xdgf.usermodel.XDGFMasterContents.onDocumentRead(XDGFMasterContents.java:66)
at
org.apache.poi.xdgf.usermodel.XDGFMasters.onDocumentRead(XDGFMasters.java:101)
at
org.apache.poi.xdgf.usermodel.XmlVisioDocument.onDocumentRead(XmlVisioDocument.java:106)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
at
org.apache.poi.xdgf.usermodel.XmlVisioDocument.init(XmlVisioDocument.java:79)
at
org.apache.poi.xdgf.extractor.XDGFVisioExtractor.init(XDGFVisioExtractor.java:41)
at
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:207)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:166)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2306)
at
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:658)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:464)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:296)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandle

Re: Solr 6.4. Can't index MS Visio vsdx files

2017-02-06 Thread Gytis Mikuciunas
.
> visio.x2012.main.ConnectsType
> at java.net.URLClassLoader.findClass(Unknown Source)
> at java.lang.ClassLoader.loadClass(Unknown Source)
> at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
> at java.lang.ClassLoader.loadClass(Unknown Source)
> ... 17 more
>
>
> So next step is to open bug ticket on tika's jira.
>
>
> And what about with your proposed workaround?
> "If this is a missing bean issue (sorry, I can't tell from your stacktrace
> which class is missing), as a temporary workaround, you can rm
> "poi-ooxml-schemas" and add the full "ooxml-schemas", and you should be
> good to go. [3]"
>
> as tika is failing, is it could help or not?
>
> Gytis
>
>
> On Fri, Feb 3, 2017 at 10:31 PM, Allison, Timothy B. <talli...@mitre.org>
> wrote:
>
> > This is a Tika/POI problem.  Please download tika-app 1.14 [1] or a
> > nightly version of Tika [2] and run
> >
> > java -jar tika-app.jar 
> >
> > If the problem is fixed, we'll try to upgrade dependencies in Solr.
> > If it isn't fixed, please open a bug on Tika's Jira.
> >
> > If this is a missing bean issue (sorry, I can't tell from your
> > stacktrace which class is missing), as a temporary workaround, you can
> > rm "poi-ooxml-schemas" and add the full "ooxml-schemas", and you
> > should be good to go. [3]
> >
> > Cheers,
> >
> >   Tim
> >
> > [1] http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.14.jar
> >
> > [2] https://builds.apache.org/job/Tika-trunk/1193/org.apache.
> > tika$tika-app/artifact/org.apache.tika/tika-app/1.15-
> > 20170202.203920-124/tika-app-1.15-20170202.203920-124.jar
> >
> > [3] http://poi.apache.org/faq.html#faq-N10025
> >
> > -Original Message-
> > From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
> > Sent: Friday, February 3, 2017 9:49 AM
> > To: solr-user <solr-user@lucene.apache.org>
> > Subject: Re: Solr 6.4. Can't index MS Visio vsdx files
> >
> > This kind of information extraction comes from Apache Tika that is
> > shipped with Solr. However Solr does not ship every possible parser
> > with its installation. So, I think you are hitting Tika where it
> > manages to figure out what type of content you have, but does not have
> > (Apache POI - another O/S project) library installed.
> >
> > What you need to do is to get the additional jar from Tika/POI's
> > project/download and make it visible to Solr (probably as an extension
> > jar in a lib folder somewhere - I am a bit hazy on that for latest Solr).
> >
> > The version of Tika that Solr uses is part of the changes notes. For
> > 6.4, it is https://github.com/apache/lucene-solr/blob/releases/
> > lucene-solr/6.4.0/solr/CHANGES.txt
> > and it is Tika 1.13
> >
> > Hope it helps,
> >Alex.
> > 
> > http://www.solr-start.com/ - Resources for Solr users, new and
> > experienced
> >
> >
> > On 3 February 2017 at 05:57, Gytis Mikuciunas <gyt...@gmail.com> wrote:
> > > Hi,
> > >
> > >
> > > I'm using single core Solr 6.4 instance on windows server (windows
> > > server
> > > 2012 R2 standard),
> > > Java v8, (build 1.8.0_121-b13).
> > >
> > > All works more or less ok, except MS Visio vsdx files indexing.
> > >
> > >
> > > Every time it throws an error (no matters if it tries to index vsdx
> > > file or for example docx with visio diagram inside).
> > >
> > > Thx in advance for your help. If you need some additional info,
> > > please
> > ask.
> > >
> > >
> > > Error/Exception from log:
> > >
> > >
> > >  Null:java.lang.RuntimeException: java.lang.NoClassDefFoundError:
> > > Could not initialize class
> > > org.apache.poi.xdgf.usermodel.section.geometry.GeometryRowFactory
> > > at
> > > org.apache.poi.xdgf.usermodel.section.GeometrySection.
> > init(GeometrySection.java:55)
> > > at
> > > org.apache.poi.xdgf.usermodel.XDGFSheet.init(
> XDGFSheet.java:77)
> > > at
> > > org.apache.poi.xdgf.usermodel.XDGFShape.init(
> XDGFShape.java:113)
> > > at
> > > org.apache.poi.xdgf.usermodel.XDGFShape.init(
> XDGFShape.java:107)
> > > at
> > > org.apache.poi.xdgf.usermodel.XDGFBaseContents.onDocumentRead(
> > XDGFBaseContents.java:82)
> > > at
> > > org.apach

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-02-06 Thread Allison, Timothy B.
Ah, ConnectsType.  That's fixed in the most recent version of POI [1], and will 
soon be fixed in Tika [2].  So, no need to open a ticket on Tika's Jira.

> as tika is failing, is it could help or not?

Y, that will absolutely help.  In your Solr contrib/extract/lib directory, 
you'll see poi-ooxml-schemas-3.xx.jar.  Remove that jar and add 
ooxml-schemas.jar [3].  As documented in [4], poi-ooxml-schemas is a subset of 
the much larger (complete) ooxml-schemas; ConnectsType was not in the subset, 
but it _should_ be in ooxml-schemas.

Cheers,

 Tim



[1] https://bz.apache.org/bugzilla/show_bug.cgi?id=60489
[2] https://issues.apache.org/jira/browse/TIKA-2208 
[3] https://mvnrepository.com/artifact/org.apache.poi/ooxml-schemas/1.3 
[4] http://poi.apache.org/faq.html#faq-N10025 


Hi again,

I've tried with tika-app - didn't help

java -jar tika-app-1.14.jar "I:\Dat\span ports.vsdx"
Exception in thread "main" java.lang.NoClassDefFoundError:
com/microsoft/schemas/office/visio/x2012/main/ConnectsType
at com.microsoft.schemas.office.visio.x2012.main.impl.
PageContentsTypeImpl.getConnects(Unknown Source)
at org.apache.poi.xdgf.usermodel.XDGFBaseContents.onDocumentRead(
XDGFBaseContents.java:89)
at org.apache.poi.xdgf.usermodel.XDGFPageContents.onDocumentRead(
XDGFPageContents.java:73)
at org.apache.poi.xdgf.usermodel.XDGFPages.onDocumentRead(
XDGFPages.java:94)
at org.apache.poi.xdgf.usermodel.XmlVisioDocument.onDocumentRead(
XmlVisioDocument.java:108)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
at org.apache.poi.xdgf.usermodel.XmlVisioDocument.(
XmlVisioDocument.java:79)
at org.apache.poi.xdgf.extractor.XDGFVisioExtractor.(
XDGFVisioExtractor.java:41)
at org.apache.poi.extractor.ExtractorFactory.createExtractor(
ExtractorFactory.java:207)
at org.apache.tika.parser.microsoft.ooxml.
OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.
parse(OOXMLParser.java:87)
at org.apache.tika.parser.CompositeParser.parse(
CompositeParser.java:280)
at org.apache.tika.parser.CompositeParser.parse(
CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(
AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)
Caused by: java.lang.ClassNotFoundException: com.microsoft.schemas.office.
visio.x2012.main.ConnectsType
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
... 17 more


So next step is to open bug ticket on tika's jira.


And what about with your proposed workaround?
"If this is a missing bean issue (sorry, I can't tell from your stacktrace 
which class is missing), as a temporary workaround, you can rm 
"poi-ooxml-schemas" and add the full "ooxml-schemas", and you should be good to 
go. [3]"

as tika is failing, is it could help or not?

Gytis


On Fri, Feb 3, 2017 at 10:31 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> This is a Tika/POI problem.  Please download tika-app 1.14 [1] or a 
> nightly version of Tika [2] and run
>
> java -jar tika-app.jar 
>
> If the problem is fixed, we'll try to upgrade dependencies in Solr.  
> If it isn't fixed, please open a bug on Tika's Jira.
>
> If this is a missing bean issue (sorry, I can't tell from your 
> stacktrace which class is missing), as a temporary workaround, you can 
> rm "poi-ooxml-schemas" and add the full "ooxml-schemas", and you 
> should be good to go. [3]
>
> Cheers,
>
>   Tim
>
> [1] http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.14.jar
>
> [2] https://builds.apache.org/job/Tika-trunk/1193/org.apache.
> tika$tika-app/artifact/org.apache.tika/tika-app/1.15-
> 20170202.203920-124/tika-app-1.15-20170202.203920-124.jar
>
> [3] http://poi.apache.org/faq.html#faq-N10025
>
> -Original Message-
> From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
> Sent: Friday, February 3, 2017 9:49 AM
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Re: Solr 6.4. Can't index MS Visio vsdx files
>
> This kind of information extraction comes from Apache Tika that is 
> shipped with Solr. However Solr does not ship every possible parser 
> with its installation. So, I think you are hitting Tika where it 
> manages to figure out what type of content you have, but does not have 
> (Apache POI - another O/S project) library installed.
>
> What you need 

Re: Solr 6.4. Can't index MS Visio vsdx files

2017-02-05 Thread Gytis Mikuciunas
Hi again,

I've tried with tika-app - didn't help

java -jar tika-app-1.14.jar "I:\Dat\span ports.vsdx"
Exception in thread "main" java.lang.NoClassDefFoundError:
com/microsoft/schemas/office/visio/x2012/main/ConnectsType
at com.microsoft.schemas.office.visio.x2012.main.impl.
PageContentsTypeImpl.getConnects(Unknown Source)
at org.apache.poi.xdgf.usermodel.XDGFBaseContents.onDocumentRead(
XDGFBaseContents.java:89)
at org.apache.poi.xdgf.usermodel.XDGFPageContents.onDocumentRead(
XDGFPageContents.java:73)
at org.apache.poi.xdgf.usermodel.XDGFPages.onDocumentRead(
XDGFPages.java:94)
at org.apache.poi.xdgf.usermodel.XmlVisioDocument.onDocumentRead(
XmlVisioDocument.java:108)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
at org.apache.poi.xdgf.usermodel.XmlVisioDocument.(
XmlVisioDocument.java:79)
at org.apache.poi.xdgf.extractor.XDGFVisioExtractor.(
XDGFVisioExtractor.java:41)
at org.apache.poi.extractor.ExtractorFactory.createExtractor(
ExtractorFactory.java:207)
at org.apache.tika.parser.microsoft.ooxml.
OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.
parse(OOXMLParser.java:87)
at org.apache.tika.parser.CompositeParser.parse(
CompositeParser.java:280)
at org.apache.tika.parser.CompositeParser.parse(
CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(
AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)
Caused by: java.lang.ClassNotFoundException: com.microsoft.schemas.office.
visio.x2012.main.ConnectsType
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
... 17 more


So next step is to open bug ticket on tika's jira.


And what about with your proposed workaround?
"If this is a missing bean issue (sorry, I can't tell from your stacktrace
which class is missing), as a temporary workaround, you can rm
"poi-ooxml-schemas" and add the full "ooxml-schemas", and you should be
good to go. [3]"

as tika is failing, is it could help or not?

Gytis


On Fri, Feb 3, 2017 at 10:31 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> This is a Tika/POI problem.  Please download tika-app 1.14 [1] or a
> nightly version of Tika [2] and run
>
> java -jar tika-app.jar 
>
> If the problem is fixed, we'll try to upgrade dependencies in Solr.  If it
> isn't fixed, please open a bug on Tika's Jira.
>
> If this is a missing bean issue (sorry, I can't tell from your stacktrace
> which class is missing), as a temporary workaround, you can rm
> "poi-ooxml-schemas" and add the full "ooxml-schemas", and you should be
> good to go. [3]
>
> Cheers,
>
>   Tim
>
> [1] http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.14.jar
>
> [2] https://builds.apache.org/job/Tika-trunk/1193/org.apache.
> tika$tika-app/artifact/org.apache.tika/tika-app/1.15-
> 20170202.203920-124/tika-app-1.15-20170202.203920-124.jar
>
> [3] http://poi.apache.org/faq.html#faq-N10025
>
> -Original Message-
> From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
> Sent: Friday, February 3, 2017 9:49 AM
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Re: Solr 6.4. Can't index MS Visio vsdx files
>
> This kind of information extraction comes from Apache Tika that is shipped
> with Solr. However Solr does not ship every possible parser with its
> installation. So, I think you are hitting Tika where it manages to figure
> out what type of content you have, but does not have (Apache POI - another
> O/S project) library installed.
>
> What you need to do is to get the additional jar from Tika/POI's
> project/download and make it visible to Solr (probably as an extension jar
> in a lib folder somewhere - I am a bit hazy on that for latest Solr).
>
> The version of Tika that Solr uses is part of the changes notes. For 6.4,
> it is https://github.com/apache/lucene-solr/blob/releases/
> lucene-solr/6.4.0/solr/CHANGES.txt
> and it is Tika 1.13
>
> Hope it helps,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 3 February 2017 at 05:57, Gytis Mikuciunas <gyt...@gmail.com> wrote:
> > Hi,
> >
> >
> > I'm using single core Solr 6.4 instance on windows server (windows
> > server
> > 2012 R2 standard),
> > Java v8, (build 1

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-02-03 Thread Gytis Mikuciunas
Thx guys for your ideas. I'll test and let you know.

Regards,

On Feb 3, 2017 22:31, "Allison, Timothy B." <talli...@mitre.org> wrote:

> This is a Tika/POI problem.  Please download tika-app 1.14 [1] or a
> nightly version of Tika [2] and run
>
> java -jar tika-app.jar 
>
> If the problem is fixed, we'll try to upgrade dependencies in Solr.  If it
> isn't fixed, please open a bug on Tika's Jira.
>
> If this is a missing bean issue (sorry, I can't tell from your stacktrace
> which class is missing), as a temporary workaround, you can rm
> "poi-ooxml-schemas" and add the full "ooxml-schemas", and you should be
> good to go. [3]
>
> Cheers,
>
>   Tim
>
> [1] http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.14.jar
>
> [2] https://builds.apache.org/job/Tika-trunk/1193/org.apache.
> tika$tika-app/artifact/org.apache.tika/tika-app/1.15-
> 20170202.203920-124/tika-app-1.15-20170202.203920-124.jar
>
> [3] http://poi.apache.org/faq.html#faq-N10025
>
> -Original Message-
> From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
> Sent: Friday, February 3, 2017 9:49 AM
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Re: Solr 6.4. Can't index MS Visio vsdx files
>
> This kind of information extraction comes from Apache Tika that is shipped
> with Solr. However Solr does not ship every possible parser with its
> installation. So, I think you are hitting Tika where it manages to figure
> out what type of content you have, but does not have (Apache POI - another
> O/S project) library installed.
>
> What you need to do is to get the additional jar from Tika/POI's
> project/download and make it visible to Solr (probably as an extension jar
> in a lib folder somewhere - I am a bit hazy on that for latest Solr).
>
> The version of Tika that Solr uses is part of the changes notes. For 6.4,
> it is https://github.com/apache/lucene-solr/blob/releases/
> lucene-solr/6.4.0/solr/CHANGES.txt
> and it is Tika 1.13
>
> Hope it helps,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 3 February 2017 at 05:57, Gytis Mikuciunas <gyt...@gmail.com> wrote:
> > Hi,
> >
> >
> > I'm using single core Solr 6.4 instance on windows server (windows
> > server
> > 2012 R2 standard),
> > Java v8, (build 1.8.0_121-b13).
> >
> > All works more or less ok, except MS Visio vsdx files indexing.
> >
> >
> > Every time it throws an error (no matters if it tries to index vsdx
> > file or for example docx with visio diagram inside).
> >
> > Thx in advance for your help. If you need some additional info, please
> ask.
> >
> >
> > Error/Exception from log:
> >
> >
> >  Null:java.lang.RuntimeException: java.lang.NoClassDefFoundError:
> > Could not initialize class
> > org.apache.poi.xdgf.usermodel.section.geometry.GeometryRowFactory
> > at
> > org.apache.poi.xdgf.usermodel.section.GeometrySection.
> init(GeometrySection.java:55)
> > at
> > org.apache.poi.xdgf.usermodel.XDGFSheet.init(XDGFSheet.java:77)
> > at
> > org.apache.poi.xdgf.usermodel.XDGFShape.init(XDGFShape.java:113)
> > at
> > org.apache.poi.xdgf.usermodel.XDGFShape.init(XDGFShape.java:107)
> > at
> > org.apache.poi.xdgf.usermodel.XDGFBaseContents.onDocumentRead(
> XDGFBaseContents.java:82)
> > at
> > org.apache.poi.xdgf.usermodel.XDGFMasterContents.onDocumentRead(
> XDGFMasterContents.java:66)
> > at
> > org.apache.poi.xdgf.usermodel.XDGFMasters.onDocumentRead(
> XDGFMasters.java:101)
> > at
> > org.apache.poi.xdgf.usermodel.XmlVisioDocument.onDocumentRead(
> XmlVisioDocument.java:106)
> > at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
> > at
> > org.apache.poi.xdgf.usermodel.XmlVisioDocument.init(
> XmlVisioDocument.java:79)
> > at
> > org.apache.poi.xdgf.extractor.XDGFVisioExtractor.init&
> gt;(XDGFVisioExtractor.java:41)
> > at
> > org.apache.poi.extractor.ExtractorFactory.createExtractor(
> ExtractorFactory.java:212)
> > at
> > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(
> OOXMLExtractorFactory.java:86)
> > at
> > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.
> parse(OOXMLParser.java:87)
> > at
> > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> > at
> > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> > at
>

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-02-03 Thread Allison, Timothy B.
This is a Tika/POI problem.  Please download tika-app 1.14 [1] or a nightly 
version of Tika [2] and run 

java -jar tika-app.jar 

If the problem is fixed, we'll try to upgrade dependencies in Solr.  If it 
isn't fixed, please open a bug on Tika's Jira.

If this is a missing bean issue (sorry, I can't tell from your stacktrace which 
class is missing), as a temporary workaround, you can rm "poi-ooxml-schemas" 
and add the full "ooxml-schemas", and you should be good to go. [3]

Cheers,

  Tim

[1] http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.14.jar 

[2] 
https://builds.apache.org/job/Tika-trunk/1193/org.apache.tika$tika-app/artifact/org.apache.tika/tika-app/1.15-20170202.203920-124/tika-app-1.15-20170202.203920-124.jar

[3] http://poi.apache.org/faq.html#faq-N10025

-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Friday, February 3, 2017 9:49 AM
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: Solr 6.4. Can't index MS Visio vsdx files

This kind of information extraction comes from Apache Tika that is shipped with 
Solr. However Solr does not ship every possible parser with its installation. 
So, I think you are hitting Tika where it manages to figure out what type of 
content you have, but does not have (Apache POI - another O/S project) library 
installed.

What you need to do is to get the additional jar from Tika/POI's 
project/download and make it visible to Solr (probably as an extension jar in a 
lib folder somewhere - I am a bit hazy on that for latest Solr).

The version of Tika that Solr uses is part of the changes notes. For 6.4, it is 
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.4.0/solr/CHANGES.txt
and it is Tika 1.13

Hope it helps,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 3 February 2017 at 05:57, Gytis Mikuciunas <gyt...@gmail.com> wrote:
> Hi,
>
>
> I'm using single core Solr 6.4 instance on windows server (windows 
> server
> 2012 R2 standard),
> Java v8, (build 1.8.0_121-b13).
>
> All works more or less ok, except MS Visio vsdx files indexing.
>
>
> Every time it throws an error (no matters if it tries to index vsdx 
> file or for example docx with visio diagram inside).
>
> Thx in advance for your help. If you need some additional info, please ask.
>
>
> Error/Exception from log:
>
>
>  Null:java.lang.RuntimeException: java.lang.NoClassDefFoundError: 
> Could not initialize class 
> org.apache.poi.xdgf.usermodel.section.geometry.GeometryRowFactory
> at
> org.apache.poi.xdgf.usermodel.section.GeometrySection.init(GeometrySection.java:55)
> at
> org.apache.poi.xdgf.usermodel.XDGFSheet.init(XDGFSheet.java:77)
> at
> org.apache.poi.xdgf.usermodel.XDGFShape.init(XDGFShape.java:113)
> at
> org.apache.poi.xdgf.usermodel.XDGFShape.init(XDGFShape.java:107)
> at
> org.apache.poi.xdgf.usermodel.XDGFBaseContents.onDocumentRead(XDGFBaseContents.java:82)
> at
> org.apache.poi.xdgf.usermodel.XDGFMasterContents.onDocumentRead(XDGFMasterContents.java:66)
> at
> org.apache.poi.xdgf.usermodel.XDGFMasters.onDocumentRead(XDGFMasters.java:101)
> at
> org.apache.poi.xdgf.usermodel.XmlVisioDocument.onDocumentRead(XmlVisioDocument.java:106)
> at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
> at
> org.apache.poi.xdgf.usermodel.XmlVisioDocument.init(XmlVisioDocument.java:79)
> at
> org.apache.poi.xdgf.extractor.XDGFVisioExtractor.init(XDGFVisioExtractor.java:41)
> at
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:212)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
> at
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
> at
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
> at
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
> at
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
> at
> org.apache.tika

Re: Solr 6.4. Can't index MS Visio vsdx files

2017-02-03 Thread Alexandre Rafalovitch
This kind of information extraction comes from Apache Tika that is
shipped with Solr. However Solr does not ship every possible parser
with its installation. So, I think you are hitting Tika where it
manages to figure out what type of content you have, but does not have
(Apache POI - another O/S project) library installed.

What you need to do is to get the additional jar from Tika/POI's
project/download and make it visible to Solr (probably as an extension
jar in a lib folder somewhere - I am a bit hazy on that for latest
Solr).

The version of Tika that Solr uses is part of the changes notes. For
6.4, it is 
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.4.0/solr/CHANGES.txt
and it is Tika 1.13

Hope it helps,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 3 February 2017 at 05:57, Gytis Mikuciunas  wrote:
> Hi,
>
>
> I'm using single core Solr 6.4 instance on windows server (windows server
> 2012 R2 standard),
> Java v8, (build 1.8.0_121-b13).
>
> All works more or less ok, except MS Visio vsdx files indexing.
>
>
> Every time it throws an error (no matters if it tries to index vsdx file or
> for example docx with visio diagram inside).
>
> Thx in advance for your help. If you need some additional info, please ask.
>
>
> Error/Exception from log:
>
>
>  Null:java.lang.RuntimeException: java.lang.NoClassDefFoundError: Could not
> initialize class
> org.apache.poi.xdgf.usermodel.section.geometry.GeometryRowFactory
> at
> org.apache.poi.xdgf.usermodel.section.GeometrySection.init(GeometrySection.java:55)
> at
> org.apache.poi.xdgf.usermodel.XDGFSheet.init(XDGFSheet.java:77)
> at
> org.apache.poi.xdgf.usermodel.XDGFShape.init(XDGFShape.java:113)
> at
> org.apache.poi.xdgf.usermodel.XDGFShape.init(XDGFShape.java:107)
> at
> org.apache.poi.xdgf.usermodel.XDGFBaseContents.onDocumentRead(XDGFBaseContents.java:82)
> at
> org.apache.poi.xdgf.usermodel.XDGFMasterContents.onDocumentRead(XDGFMasterContents.java:66)
> at
> org.apache.poi.xdgf.usermodel.XDGFMasters.onDocumentRead(XDGFMasters.java:101)
> at
> org.apache.poi.xdgf.usermodel.XmlVisioDocument.onDocumentRead(XmlVisioDocument.java:106)
> at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
> at
> org.apache.poi.xdgf.usermodel.XmlVisioDocument.init(XmlVisioDocument.java:79)
> at
> org.apache.poi.xdgf.extractor.XDGFVisioExtractor.init(XDGFVisioExtractor.java:41)
> at
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:212)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
> at
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
> at
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
> at
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
> at
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:166)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2306)
> at
> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:658)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:464)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:296)
> at
>