I'd strongly recommend rolling your own ingest code. See Erick's
superb: https://lucidworks.com/post/indexing-with-solrj/
You can easily get attachments via the RecursiveParserWrapper, e.g.
https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParse
Unfortunately, It Depends(TM)*...these are the steps I take:
https://wiki.apache.org/tika/UpgradingTikaInSolr
There can be version conflicts and other awful, unforeseen things if
you don't get it right.
We're on the cusp of the release for 1.21 (I mean it this time)...I'll
upgrade Solr as soon as
Sorry build #182: https://builds.apache.org/job/tika-branch-1x/
On Thu, May 2, 2019 at 12:01 PM Tim Allison wrote:
>
> I just pushed a fix for TIKA-2861. If you can either build locally or
> wait a few hours for Jenkins to build #182, let me know if that works
> with straight
I just pushed a fix for TIKA-2861. If you can either build locally or
wait a few hours for Jenkins to build #182, let me know if that works
with straight tika-app.jar.
On Thu, May 2, 2019 at 5:00 AM Where is Where wrote:
>
> Thank you Alex and Tim.
> I have looked at the solrconfig.xml file (I a
Related?
https://issues.apache.org/jira/plugins/servlet/mobile#issue/TIKA-2861
On Wed, May 1, 2019 at 8:09 AM Alexandre Rafalovitch
wrote:
> What happens when you run it against a standalone Tika (recommended option
> anyway)? Do you see the relevant fields?
>
> Not every Tika field is capture
TextField is a classname. Look in managedschema and pick a field type by
name, e.g. text_general
On Sat, Apr 6, 2019 at 9:00 AM Dave Beckstrom
wrote:
> Hi Everyone,
>
> I'm really hating SOLR. All I want is to define a text field that data
> can be indexed into and which is searchable. Should
Should probably send this one from an anonymous email... :(
I can see from the results that elevate is working with this:
select?&defType=edismax&q=transcript&qf=my_field
However, elevate is not working with this:
select?&q={!edismax%20v=transcript%20qf=my_field}
This is Solr 4.x...y, I know..
Haha, looks like Jörn just answered this... onError="skip|continue"
>greatly preferable if the indexing process could ignore exceptions
Please, no. I'm 100% behind the sentiment that DIH should gracefully
handle Tika exceptions, but the better option is to log the
exceptions, store the stacktrace
>At the end of the day it would be a much better architecture to parse the
> PDFs using plain standalone TikaServer
+1
Also, note that we added a -spawnChild switch to tika-server that will
run the server in a child process and kill+restart the child process
if there is an infinite loop/oom/segfa
All,
I don't know if this change was intended, but it feels like a bug to me...
TokenFilterFactory[] filters = new TokenFilterFactory[2];
filters[0] = new LowerCaseFilterFactory(Collections.EMPTY_MAP);
filters[1] = new ASCIIFoldingFilterFactory(Collections.EMPTY_MAP);
TokenizerChain chain = new
User error..please ignore.
On Thu, Jan 17, 2019 at 4:36 PM Tim Allison wrote:
>
> All,
> I recently tried to upgrade a project that relies on the snapshot
> repos[1], but maven wasn't able to pull lucene-highlighter,
> lucene-test-framework, lucene-memory, among a
All,
I recently tried to upgrade a project that relies on the snapshot
repos[1], but maven wasn't able to pull lucene-highlighter,
lucene-test-framework, lucene-memory, among a few others. However,
maven was able to pull lucene-core and most other artifacts for
8.0.0-SNAPSHOT. I manually checke
Y, I tracked this down within Solr. This is a feature, not a bug. I
found a solution (set {{captureAttr}} to {{true}}):
https://issues.apache.org/jira/browse/TIKA-2814?focusedCommentId=16745263&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16745263
Please, though,
to ding Nuance (or tesseract), I just wish to point out that
> what to OCR is important, because OCR works well when it has good input.
>
> > -Original Message-
> > From: Tim Allison
> > Sent: Friday, November 2, 2018 11:03 AM
> > To: solr-user@lucene.apach
OCR'ing of PDFs is fiddly at the moment because of Tika, not Solr! We
have an open ticket to make it "just work", but we aren't there yet
(TIKA-2749).
You have to tell Tika how you want to process images from PDFs via the
tika-config.xml file.
You've seen this link in the links you mentioned:
ht
ariable to the path-variables pointing to
> > > "Tesseract-OCR/tessdata".
> > >
> > > Now Tesseract works with Danish language from the CMD, but now I can't
> > > make the code work in Java, not even with default settings (which I
> > > could before). A
Tika relies on you to install tesseract and all the language libraries
you'll need.
If you can successfully call `tesseract testing/eurotext.png
testing/eurotext-dan -l dan`, Tika _should_ be able to specify "dan"
with your code above.
On Fri, Oct 26, 2018 at 10:49 AM Martin Frank Hansen (MHQ) wr
r
>
> Hi Tim,
>
> It is msg files and I added tika-app-1.14.jar to the build path - and now
> it works 😊 But how do I get it to read the attachments as well?
>
> -Original Message-
> From: Tim Allison
> Sent: 25. oktober 2018 21:57
> To: solr-user@lucene.ap
how do I get it to read the attachments as well?
>
> -Original Message-
> From: Tim Allison
> Sent: 25. oktober 2018 21:57
> To: solr-user@lucene.apache.org
> Subject: Re: Reading data using Tika to Solr
>
> If you’re processing actual msg (not eml), you’ll also nee
llows:
>
> Tika-parsers-1.4.jar
> Tika-core-1.4.jar
> Commons-io-2.5.jar
> Httpclient-4.5.3
> Httpcore-4.4.6.jar
> Httpmime-4.5.3.jar
> Slf4j-api1-7-24.jar
> Jcl-over--slf4j-1.7.24.jar
> Solr-cell-7.5.0.jar
> Solr-core-7.5.0.jar
> Solr-solrj-7.5.0.jar
> Noggit-0.
To follow up w Erick’s point, there are a bunch of transitive dependencies
from tika-parsers. If you aren’t using maven or similar build system to
grab the dependencies, it can be tricky to get it right. If you aren’t
using maven, and you can afford the risks of jar hell, consider using
tika-app or
This is probably caused by an encoding detection problem in Nutch and/or
Tika. If you can share the file on the Tika user’s list, I can take a look.
On Fri, Oct 5, 2018 at 7:11 AM UMA MAHESWAR
wrote:
> HI ALL,
>
> while i am using nutch for crawling and indexing in to solr,while storing
> data i
If you haven’t already, might want to check out maximal marginal
relevance...original paper: Carbonell and Goldstein.
On Thu, Sep 27, 2018 at 7:29 PM Joel Bernstein wrote:
> Yeah, I think your plan sounds fine.
>
> Do you have a specific use case for diversity of results. I've been
> wondering i
+1 to Shawn's and Erick's points about isolating Tika in a separate jvm.
Y, please do let us know: u...@tika.apache.org We might be able to
help out, and you, in turn, can help the community figure out what's
going on; see e.g.: https://issues.apache.org/jira/browse/TIKA-2703
On Sun, Aug 5, 2018
t; > the info is in our "official" place but the real story is in another
> > place,
> > > one we alternately tell people to sometimes ignore but sometimes keep
> up
> > to
> > > date? Even I'm confused.
> > >
> > > On Sat, May 26, 20
W00t! Thank you, Shawn!
The "don't use ERH in production" response comes up frequently enough
> that I have created a wiki page we can use for responses:
>
> https://wiki.apache.org/solr/RecommendCustomIndexingWithTika
>
> Tim, you are extremely well-qualified to expand and correct this page.
> Er
+1 as always to Erick’s advice. DIH is only a PoC.
We do have a DigestingParser in Tika, and when you combine that w the
RecursiveParserWrapper, you can get digests not only of the main file but
also on all embedded files/attachments...which can be pretty neat for some
use cases.
Operators are st
...@mail.gmail.com%3e
On Sat, May 26, 2018 at 6:34 AM Tim Allison wrote:
> You’ll need to provide a PasswordProvider in the ParseContext. I don’t
> think that is currently possible in the Solr integration. Please open a
> ticket if SolrJ doesn’t meet your needs.
>
> On Thu, May 24,
You’ll need to provide a PasswordProvider in the ParseContext. I don’t
think that is currently possible in the Solr integration. Please open a
ticket if SolrJ doesn’t meet your needs.
On Thu, May 24, 2018 at 1:03 PM Alexandre Rafalovitch
wrote:
> Hmm. If it works, then it is Tika magic. Which m
29 matches
Mail list logo