Generally, I'd recommend opening an issue on PDFBox's Jira with the file that
you shared. Tika uses PDFBox...if a fix can be made there, it will propagate
back through Tika to Solr.
That said, PDFBox 2.0-RC2 extracts no text and warns: WARNING: No Unicode
mapping for CID+71 (71) in font 505Edd
Hi Johannes,
I suspect that Scott's answer would be more efficient than the following, and
I may be misunderstanding the problem!
This type of search is supported at the Lucene level by a SpanNearQuery with
inOrder set to false.
So, how do you get a SpanQuery in Solr? You might want to l
The other thing to check is the ComplexPhraseQueryParser, see:
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-ComplexPhraseQueryParser
It uses the Span queries to build up the query...
Best,
Erick
On Fri, Dec 18, 2015 at 11:23 AM, Allison, Timothy B.
wrote:
> Hi Jo
I concur with Erick and Upayavira that it is best to keep Tika in a separate
JVM...well, ideally a separate box or rack or even data center [0][1]. :)
But seriously, if you're using DIH/SolrCell, you have to configure Tika to
parse documents recursively. This was made possible in SOLR-7189...se
Might want to look into:
https://github.com/flaxsearch/luwak
or
https://github.com/OpenSextant/SolrTextTagger
-Original Message-
From: Will Moy [mailto:w...@fullfact.org]
Sent: Tuesday, January 05, 2016 11:02 AM
To: solr-user@lucene.apache.org
Subject: Many patterns against many sen
Don't know what the answer from the Solr side is, but from the Tika side, I
recently failed to get TIKA-1830 into Tika 1.12...so there may be a need to
wait for Tika 1.13.
No matter the answer on when there'll be an upgrade within Solr, I strongly
encourage carving Tika into a separate JVM/serv
Three basic options:
1) one generic field that handles non-whitespace languages and normalization
robustly (downside: no language specific stopwords, stemming, etc)
2) one field per language (hope lang id works and that you don't have many
multilingual docs)
3) one Solr core for language (ditto)
Might not have the parsers on your path within your Solr framework?
Which tika jars are on your path?
If you want the functionality of all of Tika, use the standalone tika-app.jar,
but do not use the app in the same JVM as Solr...without a custom class loader.
The Solr team carefully prunes
ork. I'm trying to
> use Tika from my own crawler application that uses SojrJ to send the
> raw text to Solr for indexing.
>
> What is it that I am missing?!
>
> Steve
>
> On Tue, Feb 2, 2016 at 3:03 PM, Allison, Timothy B.
>
> wrote:
>
>> Mig
you'll have to grab that and add it to your class path. :)
See also, very recently:
https://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3C027601d15ea8%2443ffcf90%24cbff6eb0%24%40thetaphi.de%3E
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
S
I have one answer here [0], but I'd be interested to hear what Solr
users/devs/integrators have experienced on this topic.
[0]
http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3CCY1PR09MB0795EAED947B53965BC86874C7D70%40CY1PR09MB0795.namprd09.prod.outlook.com%3E
-Original Me
ust catch any exceptions
in my code and "do the right thing". I'm not sure I see any real benefit in yet
another JVM.
FWIW,
Erick
On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B. wrote:
> I have one answer here [0], but I'd be interested to hear what Solr
> user
Ha. Spoke too soon about this thread not getting swamped.
Will add the dropwizard-tika-server to our wiki page. Thank you for the link!
As a side note, I'll submit a pull request to update the AbstractTikaResource
to avoid a potential NPE if the mime type can't be parsed...we just fixed this
Y, this looks like a Tika feature. If you run the tika-app.jar [1]on your file
and you get the same output, then that's Tika's doing.
Drop a note on the u...@tika.apache.org list if Tika isn't meeting your needs.
-Original Message-
From: Sreenivasa Kallu [mailto:sreenivasaka...@gmail.co
control the document corpus,
> you have to build something far more tolerant as per Tim's comments.
>
> FWIW,
> Erick
>
> On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B.
>
> wrote:
> > I completely agree on the impulse, and for the vast majority of the
>
Should have looked at how we handle psts before earlier responsesorry.
What you're seeing is Tika's default treatment of embedded documents, it
concatenates them all into one string. It'll do the same thing for zip files
and other container files. The default Tika format is xhtml, and we i
ut
>> > what's going on with the offending document(s). Or record the name
>> > somewhere and skip it next time 'round. Or
>> >
>> > How much you have to build in here really depends on your use case.
>> > For "small enough&q
55 AM, Allison, Timothy B.
wrote:
> Should have looked at how we handle psts before earlier responsesorry.
>
> What you're seeing is Tika's default treatment of embedded documents,
> it concatenates them all into one string. It'll do the same thing for
> zip fi
> If you're going to use Tika for production indexing, you should write
> a Java program using SolrJ and Tika so that you are in complete
> control, and so Solr isn't unstable.
+1
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201601.mbox/%3cby2pr09mb11210edfcfa297528940b07c7...@by
> I can tell you that Tika is quite the resource hog. It is likely chewing up
> CPU and memory
> resources at an incredible rate, slowing down your Solr server. You
> would probably see better performance than ERH if you incorporate Tika
> and SolrJ into a client indexing program that runs o
I think Solr is using a version of Tika that predates that addition of the
Grobid parser. You'll have to add that manually somehow until Solr upgrades to
Tika 1.13 (soon to be released...I think). SOLR-8981.
-Original Message-
From: Betsey Benagh [mailto:betsey.ben...@stresearch.com]
Y. Solr 6.0.0 is shipping with Tika 1.7. Grobid came in with Tika 1.11.
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Wednesday, May 4, 2016 10:29 AM
To: solr-user@lucene.apache.org
Subject: RE: Integrating grobid with Tika in solr
I think Solr is using
Y, integrating Tika is non-trivial. I think Uwe adds the dependencies with
great care by hand by carefully looking at the dependency tree in Maven and
making sure there weren't any conflicts.
-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org]
Sent: Wednesday, May 4, 20
Only a month late to respond, and the response likely won't help.
I agree with Shawn that Tika can be a memory hog. I try to leave 1GB per
thread, but your mileage will vary dramatically depending on your docs. I'd
expect that you'd get an OOM, though, somewhere...
There have been rare bugs i
> There are some zip files inside the directory and have been addressed
> to in the database. I'm thinking those are the one's it's jumping
> right over.
With SOLR-7189, which should have kicked in for 5.1, Tika shouldn't skip over
Zip files, it should process all the contents of those zips and
Agree with all below, and don't hesitate to open a ticket on Tika's Jira and/or
POI's bugzilla...especially if you can share the triggering document.
-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
Sent: Thursday, November 05, 2015 6:05 PM
To: solr-user
Subjec
All,
I recently took a look at the source code for TikaEntityProcessor, and I
noticed that the code is not configuring the ParseContext to have Tika's
AutoDetectParser (or any parser) parse documents recursively. That is, if you
have a zip file or any other container document, DIH's TikaEnti
ewsletter:
http://www.solr-start.com/
On 4 March 2015 at 11:06, Allison, Timothy B. wrote:
> All,
>
> I recently took a look at the source code for TikaEntityProcessor, and I
> noticed that the code is not configuring the ParseContext to have Tika's
> AutoDetectPar
What class is origQuery?
You will have to do more rewriting/calculation if you're trying to convert a
PhraseQuery to a SpanNearQuery.
If you dig around in
org.apache.lucene.search.highlight.WeightedSpanTermExtractor in the Lucene
highlighter package, you might get some inspiration.
I have a h
ou informed.
Regards,Andy
Le Mardi 7 avril 2015 20h26, "Allison, Timothy B." a
écrit :
What class is origQuery?
You will have to do more rewriting/calculation if you're trying to convert a
PhraseQuery to a SpanNearQuery.
If you dig around in
org.apache.lucene.search.hig
I entirely agree with Erick -- it is best to isolate Tika in its own jvm if you
can -- bad things can happen if you don't [1] [2].
Erick's blog on SolrJ is fantastic. If you want to have Tika parse embedded
documents/attachments, make sure to set the parser in the ParseContext before
parsing:
tools.
Thanks & Regards
Vijay
On 16 April 2015 at 12:33, Allison, Timothy B. wrote:
> I entirely agree with Erick -- it is best to isolate Tika in its own jvm
> if you can -- bad things can happen if you don't [1] [2].
>
> Erick's blog on SolrJ is fantastic. If you wan
+1
:)
>PS: one more thing - please, tell your management that you will never
>ever successfully all real-world PDFs and cater for that fact in your
>requirements :-)
Trung,
I haven't experimented with our OCR parser yet, but this should give a good
start: https://wiki.apache.org/tika/TikaOCR .
Have you installed tesseract?
Tika colleagues,
Any other tips? What else has to be configured and how?
-Original Message-
From: trung.ht [mailto:trung...@
I completely agree with Erick about the utility of the TermsComponent to see
what is actually being indexed. If you find problems there and if you haven't
done so already, you might also investigate further down the stack. It might
make sense to run the tika-app.jar (whichever version you are
You'll need the ComplexPhraseQueryParser [1] to handle multiterm
(wildcard/fuzzy/regex) terms in proximity. Beware, though, that that does not
perform analysis on fuzzy/wildcard IIRC).
The SurroundQueryParser can probably do both phrase near phrase and multiterm
within proximity. Same warning
A classic on the importance of prototyping with your data and on the
intractability of sizing in the abstract:
https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
This might be of use:
https://svn.apache.org/repos/asf/lucene/dev/trunk/dev-tools/s
This may have been an issue with Solr's wrapper of Tika. See:
https://issues.apache.org/jira/browse/SOLR-7189
-Original Message-
From: 步青云 [mailto:mailliup...@qq.com]
Sent: Wednesday, June 17, 2015 10:17 PM
To: solr-user
Subject: About indexing embed file with solr
Hello,
Could a
Unfortunately, no. We can't even do that now with straight Tika. I imagine
this is for pdf files? If you'd like to add this as a feature, please submit a
ticket over on Tika.
-Original Message-
From: Paden [mailto:rumsey...@gmail.com]
Sent: Wednesday, July 08, 2015 12:14 PM
To: solr-
M
To: solr-user@lucene.apache.org
Subject: Re: Can I instruct the Tika Entity Processor to skip the first page
using the DIH?
On 08/07/2015 20:39, Allison, Timothy B. wrote:
> Unfortunately, no. We can't even do that now with straight Tika. I
> imagine this is for pdf files? If y
>>Wow, that code looks familiar ;)...
Erick and Paden,
The following is not the source of your problem, but I thought I'd mention it
while you reference Erick's fantastic blog post on solrj
(http://lucidworks.com/blog/indexing-with-solrj/). I tried to comment on
Erick's blog post, but someth
If I understand the question correctly...
I'm assuming you are indexing rich documents (PDF/DOC/MSG, etc) with DIH's Tika
handler. Some of those documents have attachments.
If that's the case, all of the content of embedded docs _should_[0] be
extracted, but then all of that content across the
Depending on your needs, you might want to take a look at my SpanQueryParser
(LUCENE-5205/SOLR-5410). It does not offer dtsearch syntax, but if the
SurroundQueryParser was close enough, this parser may be of use. If you need
modifications to it, let me know. I'm in the process of adding
Span
>...and I've just blogged about some of the issues one can run into with this
>sort of project, hope this is useful!
http://www.flax.co.uk/blog/2016/05/13/old-new-query-parser/
+1 completely non-trivial task to roll your own.
I'd add that incorporating multiterm analysis (analysis/normalization
I'm only minimally familiar with Solr Cell, but...
1) It looks like you aren't setting extractFormat=text. According to [0]...the
default is xhtml which will include a bunch of the metadata.
2) is there an attr_* dynamic field in your index with type="ignored"? This
would strip out the attr_ f
Of course, for greater control over indexing (and for more robust handling of
exceedingly rare (but real) infinite loops/OOM caused by Tika), consider SolrJ:
http://searchhub.org/2012/02/14/indexing-with-solrj/
-Original Message-
From: Simon Blandford [mailto:simon.blandf...@bkconnect.ne
ext/css" charset="utf-8"
>> media="screen" href="/wiki/modernized/css/screen.css"/>
>> <link rel="stylesheet" type="text/css" charset="utf-8"
>> media="print" href="
All,
This is a toy example, but is there a way to search for, say, stores with
sales of > $x in the last 2 months with Solr?
$x and the time frame are selected by the user at query time.
If the queries could be constrained (this is still tbd), I could see updating
"stats" fields within eac
uence/display/solr/The+Standard+Query+Parser#TheStandardQueryParser-DifferencesbetweenLuceneQueryParserandtheSolrStandardQueryParser
Regards,
Alex.
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/
On 3 June 2016 at 23:23, Allison, Timothy B. wrote:
> All,
> This is a toy example, b
>Two things: Here's a sample bit of SolrJ code, pulling out the DB stuff should
>be straightforward:
http://searchhub.org/2012/02/14/indexing-with-solrj/
+1
> We tend to prefer running Tika externally as it's entirely possible
> that Tika will crash or hang with certain files - and that will
I was just looking at SolrCellBuilder, and it looks like there's an assumption
that documents will not have attachments/embedded objects. Unless I
misunderstand the code, users will not be able to search documents inside zips,
or attachments in msg/ doc/pdf/etc (cf. SOLR-7189).
Are embedded do
From: Brandon Miller [mailto:computerengineer.bran...@gmail.com]
Sent: Monday, June 20, 2016 4:12 PM
To: Allison, Timothy B. ; solr-user@lucene.apache.org
Subject: Re: SpanQuery - How to wrap a NOT subquery
Thank you, Timothy.
I have support for and am using SpanNotQuery elsewhere. Maybe there is
> dtSearch allows a user to have NOTs embedded in proximity searches.
And, if you're heading down the path of building your own queryparser to handle
dtSearch's syntax, please read and heed Charlie Hull's post:
http://www.flax.co.uk/blog/2016/05/13/old-new-query-parser/
See also:
http://www.fl
>Perhaps I'm misunderstanding the pre/post parameters?
Pre/post parameters: " 'six' or 'seven' should not appear $pre tokens before
'thirty' or $post tokens after 'thirty'
Maybe something like this:
spanNear([
spanNear([field:one, field:thousand, field:one, field:hundred], 0, true),
spanNot(
>Awesome, 0 pre and 1 post works!
Great!
> What if I wanted to match thirty, but exclude if six or seven are included
> anywhere in the document?
Any time you need "anywhere in the document", use a "regular" query (not
SpanQuery). As you wrote initially, you can construct a BooleanQuery that
Not that I need any other book beyond this one... but I didn't realize that the
50% discount code applies to all books in the order. :)
Congratulations, Doug and John!
-Original Message-
From: Doug Turnbull [mailto:dturnb...@opensourceconnections.com]
Sent: Tuesday, June 21, 2016 2:12 P
+1 to langdetect
In Tika 2.0, we're going to remove our own language detection code and allow
users to select Optimaize (fork of langdetect), MIT Lincoln Lab’s Text.jl
library or Yalder (https://github.com/kkrugler/yalder). The first two are now
available in Tika 1.13.
-Original Message--
ICU normalization (ICUFoldingFilterFactory) will at least handle "ß" -> "ss"
(IIRC) and some other language-general variants that might get you close.
There are, of course, language specific analyzers
(https://wiki.apache.org/solr/LanguageAnalysis#German) , but I don't think
they'll get you Fo
I don't think that's configurable at the moment.
Tika-colleagues, any recommendations?
If you're able to share the file on Tika's jira, we'd be happy to take a look.
You shouldn't be getting the zip bomb unless there is a mismatch between
opening and closing tags (which could point to a bug
Y, looks like Nick (gagravarr) has answered on SO -- can't do it in Tika
currently.
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, September 22, 2016 10:42 AM
To: solr-user@lucene.apache.org
Cc: 'u...@tika.apache.org'
Subject: RE
> I'll try to get a sample HTML yielding to this problem and attach it to Jira.
Great! Tika 1.14 is around the corner...if this is an easy fix ... :)
Thank you.
va API and examples for SolrJ and Tika to
>>> achieve that...
>>>
>>> Just wanted to confirm. I'll try to get a sample HTML yielding to
>>> this problem and attach it to Jira.
>>>
>>> Thanks,
>>> Rodrigo.
>>>
>>> Em 22-09-
t;> 11133_f6ef-eutelsat.htm
>>
>> I'll try to create a ticket for this on Jira if I find its location
>> but feel free to open it yourself if you prefer, just let me know.
>>
>> Em 22-09-2016 12:33, Allison, Timothy B. escreveu:
>>>>
>>>&g
This doesn't answer your question, but Erick Erickson's blog on this topic is
invaluable:
https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
-Original Message-
From: Vasu Y [mailto:vya...@gmail.com]
Sent: Monday, October 3, 2016
All,
I recently blogged about some of the work we're doing with a large scale
regression corpus to make Tika, POI and PDFBox more robust and to identify
regressions before release. If you'd like to chip in with recommendations,
requests or Hadoop/Spark clusters (why not shoot for the stars), p
AM
To: solr-user@lucene.apache.org
Subject: RE: Solr 6.4. Can't index MS Visio vsdx files
Great Tim.
What do I need to do to integrate it on my current installation?
On May 31, 2017 16:24, "Allison, Timothy B." wrote:
Apache Tika 1.15 is now available.
-Original Message
> There is no standard across different types of docs as to what meta-data
> field is
>> included. PDF might have a "last_edited" field. Word might have a
>> "last_modified" field where the two mean the same thing.
On Tika, we _try_ to normalize fields according to various standards, the most
Yeah, Chris knows a thing or two about Tika. :)
-Original Message-
From: ZiYuan [mailto:ziyu...@gmail.com]
Sent: Tuesday, June 20, 2017 8:00 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing PDF files with Solr 6.6 while allowing highlighting
matched text with context
No intenti
> So, if you are trying to make sure your index breaks words properly on
> eastern languages, just use ICU Tokenizer.
I defer to the expertise on this list, but last I checked ICUTokenizer uses
dictionary lookup to tokenize CJK. This may work well for some tasks, but I
haven't evaluated whe
>http - however, the big advantage of doing your indexing on different machine
>is that the heavy lifting that tika does in extracting text from documents,
>finding metadata etc is not happening on the server. If the indexer crashes,
>it doesn’t affect Solr either.
+1
for what can go wrong:
Solr index changes to
http://localhost:80/solr/v20170703xxx/update...
Time spent: 0:00:00.350
On Mon, Jun 5, 2017 at 7:41 PM, Allison, Timothy B.
wrote:
> https://issues.apache.org/jira/browse/SOLR-10335 is tracking the
> upgrade in Solr to Tika 1.15. Please chime in on that issue.
>
>
>4. Write an external program that fetches the file, fetches the metadata,
>combines them, and send them to Solr.
I've done this with some custom crawls. Thanks to Erick Erickson, this is a
snap:
https://lucidworks.com/2012/02/14/indexing-with-solrj/
With the caveat that Tika should really be i
+1
I was hoping to use this as a case for arguing for turning off an overly
aggressive stemmer, but I checked on your 10 docs and query, and David is
right, of course -- if you change the default operator to AND, you only get the
one document back that you had intended to.
I can still use this
Solrians,
We have a request to drop phonetic strings from xlsx as the default in Tika.
I'm not familiar enough with Japanese to know if users would generally expect
to be able to search on these as well as the original. The current practice is
to include them.
Any recommendations? Thank y
bq: How do I get a list of all valid field names based on the file type
bq: You don't. At least I've never found any. Plus various document formats
will allow custom meta-data fields so there's no definitive list.
It would be trivial to add field counts per mime to tika-eval. If you're
interes
https://wiki.apache.org/solr/DataImportHandlerFaq#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going_out_of_memory._Why_does_DataImportHandler_bring_everything_to_memory.3F
-Original Message-
From: Deeksha Sharma [mailto:dsha...@flexera.co
What version of Solr are you using?
I thought this had been fixed fairly recently, but I can't quickly find the
JIRA. Let me take a look.
Best,
Tim
This was one of my initial reasons for my SpanQueryParser LUCENE-5205[1] and
[2], which handles analysis of multiterms even in phra
lob/master/lucene-5205/src/test/java/org/apache/lucene/queryparser/spans/TestAdvancedAnalyzers.java#L117
-----Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, October 5, 2017 8:02 AM
To: solr-user@lucene.apache.org
Subject: RE: Complexphrase treats wildca
e certain of it :-)
Do you remember any reason that multi term analysis is not happening in
ComplexPhraseQueryParser?
I'm on 6.6.1, so latest on the 6.x branch.
2017-10-05 14:34 GMT+02:00 Allison, Timothy B. :
> There's every chance that I'm missing something at the Solr level
ses, but the regular multiterms
should be ok.
Still no answer for you...
2017-10-05 14:34 GMT+02:00 Allison, Timothy B. :
> There's every chance that I'm missing something at the Solr level, but
> it _looks_ at the Lucene level, like ComplexPhraseQueryParser is still
> not ap
That could be it. I'm not able to reproduce this with trunk. More next week.
In trunk, if I add this to schema15.xml:
This test passes.
@Test
public void testCharFilter() {
assertU(adoc("iso-latin1", "cr\u00E6zy tr\u00E6n", "id", "1"));
assertU(comm
1']"
);
Notice how cr\u00E6zy* is used as a query term which mimics the behaviour I
originally reported, namely that CPQP does not analyse it because of the
wildcard and thus does not hit the charfilter from the query side.
2017-10-06 20:54 GMT+02:00 Allison, Timothy B. :
> Th
The initial question wasn't about a phrasal search, but I largely agree that
diff q parsers handle the analysis chain differently for multiterms.
Yes, Porter is crazily aggressive. USE WITH CAUTION!
As has been pointed out, use the Solr admin window and the "debug" in the query
option to see
t do you suggest to use for stemming instead of "Porter" ? I guess, it
wasn't chosen intentionally.
In the best we trust
Georgy Nevsky
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, November 30, 2017 8:25 AM
To: solr-user@lucene.apa
ug Turnbull and John Berryman's "Relevant Search" enough on
how to layer fields...among many other great insights:
https://www.manning.com/books/relevant-search
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, November 30, 2017 9:20 AM
To:
You've gotten far better answers on this already, but you can use the
SpanNotQuery in the SpanQueryParser I maintain and have published to maven
central [1][2][3].
This does not carry out any nlp, but this would allow literal "headache (no
not)"!~5,0 -> "headache" but not if "no" or "not" shows
> I don't see any weird character when I manual copy it to any text editor.
That's a good diagnostic step, but there's a chance that Adobe (or your viewer)
got it right, and Tika or PDFBox isn't getting it right.
If you run tika-app on the file [0], do you get the same problem? See our stub
on
This came up back in September [1] and [2]. Same trigger...crazy number of
divs.
I think we could modify the AutoDetectParser to enable configuration of maximum
zip-bomb depth via tika-config.
If there's any interest in this, re-open TIKA-2091, and I'll take a look.
Best,
Tim
This is a Tika/POI problem. Please download tika-app 1.14 [1] or a nightly
version of Tika [2] and run
java -jar tika-app.jar
If the problem is fixed, we'll try to upgrade dependencies in Solr. If it
isn't fixed, please open a bug on Tika's Jira.
If this is a missing bean issue (sorry, I c
d to
go. [3]"
as tika is failing, is it could help or not?
Gytis
On Fri, Feb 3, 2017 at 10:31 PM, Allison, Timothy B.
wrote:
> This is a Tika/POI problem. Please download tika-app 1.14 [1] or a
> nightly version of Tika [2] and run
>
> java -jar tika-app.jar
>
> If th
Argh. Looks like we need to add curvesapi (BSD 3-clause) to Solr.
For now, add this jar:
https://mvnrepository.com/artifact/com.github.virtuald/curvesapi/1.03
See also [1]
[1]
http://apache-poi.1045710.n5.nabble.com/support-for-reading-Microsoft-Visio-2013-vsdx-format-td5721500.html
-Ori
ml-schemas-1.3.jar instead of poi-ooxml-schemas-3.15.jar 2.
curvesapi-1.03.jar
So, now I'm waiting when this will be implemented in a official version of
solr/tika.
Regards,
Gytis
On Mon, Feb 6, 2017 at 4:16 PM, Allison, Timothy B.
wrote:
> Argh. Looks like we need to add curvesapi
>It is *strongly* recommended to *not* use >the Tika that's embedded within
>Solr, but >instead to do the processing outside of Solr >in a program of your
>own and index the results.
+1
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201601.mbox/%3CBY2PR09MB11210EDFCFA297528940B07C
All,
I finally got around to documenting Apache Tika's MockParser[1]. As of Tika
1.15 (unreleased), add tika-core-tests.jar to your class path, and you can
simulate:
1. Regular catchable exceptions
2. OOMs
3. Permanent hangs
This will allow you to determine if your ingest framework is robust
Please also see:
https://wiki.apache.org/tika/TikaOCR
and
https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR
If you have any other questions about Apache Tika and OCR, please feel free to
ask on our users list as well: u...@tika.apache.org
Cheers,
Tim
-Origin
]
Sent: Monday, March 27, 2017 11:48 AM
To: solr-user@lucene.apache.org
Subject: Re: Index scanned documents
I tried this solution from Tim Allison, and it works.
http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files
Regards,
Edwin
On 27 March 2017 at 20:07, A
> Note that the OCRing is a separate task from Solr indexing, and is best done
> on separate machines.
+1
-Original Message-
From: Rick Leir [mailto:rl...@leirtech.com]
Sent: Thursday, March 30, 2017 7:37 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing speed reduced significant
> Also we will try to decouple tika to solr.
+1
-Original Message-
From: tstusr [mailto:ulfrhe...@gmail.com]
Sent: Friday, March 31, 2017 4:31 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr performance issue on indexing
Hi, thanks for the feedback.
Yes, it is about OOM, indeed e
Please open an issue on Tika's JIRA and share the triggering file if possible.
If we can touch the file, we may be able to recommend alternate ways to
configure Tika's encoding detectors. We just added configurability to the
encoding detectors and that will be available with Tika 1.15. [1]
We
You might want to drop a note to the dev or user's list on Apache POI.
I'm not extremely familiar with the vsd(x) portion of our code base.
The first item ("PolylineTo") may be caused by a mismatch btwn your doc and the
ooxml spec.
The second item appears to be an unsupported feature.
The thir
1 - 100 of 128 matches
Mail list logo