from:"Tim Allison"

Re: Indexing information on number of attachments and their names in EML file

2019-08-02 Thread Tim Allison

I'd strongly recommend rolling your own ingest code. See Erick's superb: https://lucidworks.com/post/indexing-with-solrj/ You can easily get attachments via the RecursiveParserWrapper, e.g. https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParse

Re: problem indexing GPS metadata for video upload

2019-05-10 Thread Tim Allison

Unfortunately, It Depends(TM)*...these are the steps I take: https://wiki.apache.org/tika/UpgradingTikaInSolr There can be version conflicts and other awful, unforeseen things if you don't get it right. We're on the cusp of the release for 1.21 (I mean it this time)...I'll upgrade Solr as soon as

Re: problem indexing GPS metadata for video upload

2019-05-02 Thread Tim Allison

Sorry build #182: https://builds.apache.org/job/tika-branch-1x/ On Thu, May 2, 2019 at 12:01 PM Tim Allison wrote: > > I just pushed a fix for TIKA-2861. If you can either build locally or > wait a few hours for Jenkins to build #182, let me know if that works > with straight

Re: problem indexing GPS metadata for video upload

2019-05-02 Thread Tim Allison

I just pushed a fix for TIKA-2861. If you can either build locally or wait a few hours for Jenkins to build #182, let me know if that works with straight tika-app.jar. On Thu, May 2, 2019 at 5:00 AM Where is Where wrote: > > Thank you Alex and Tim. > I have looked at the solrconfig.xml file (I a

Re: problem indexing GPS metadata for video upload

2019-05-01 Thread Tim Allison

Related? https://issues.apache.org/jira/plugins/servlet/mobile#issue/TIKA-2861 On Wed, May 1, 2019 at 8:09 AM Alexandre Rafalovitch wrote: > What happens when you run it against a standalone Tika (recommended option > anyway)? Do you see the relevant fields? > > Not every Tika field is capture

Re: SOLR Text Field

2019-04-06 Thread Tim Allison

TextField is a classname. Look in managedschema and pick a field type by name, e.g. text_general On Sat, Apr 6, 2019 at 9:00 AM Dave Beckstrom wrote: > Hi Everyone, > > I'm really hating SOLR. All I want is to define a text field that data > can be indexed into and which is searchable. Should

Why is elevate not working when I convert a request to local parameters?

2019-03-22 Thread Tim Allison

Should probably send this one from an anonymous email... :( I can see from the results that elevate is working with this: select?&defType=edismax&q=transcript&qf=my_field However, elevate is not working with this: select?&q={!edismax%20v=transcript%20qf=my_field} This is Solr 4.x...y, I know..

Re: Help with a DIH config file

2019-03-15 Thread Tim Allison

Haha, looks like Jörn just answered this... onError="skip|continue" >greatly preferable if the indexing process could ignore exceptions Please, no. I'm 100% behind the sentiment that DIH should gracefully handle Tika exceptions, but the better option is to log the exceptions, store the stacktrace

Re: by: java.util.zip.DataFormatException: invalid distance too far back reported by Solr API

2019-02-05 Thread Tim Allison

>At the end of the day it would be a much better architecture to parse the > PDFs using plain standalone TikaServer +1 Also, note that we added a -spawnChild switch to tika-server that will run the server in a child process and kill+restart the child process if there is an infinite loop/oom/segfa

TokenizerChain.getMultiTermAnalyzer().normalize() no longer normalizes multiterms in 8.x?!

2019-01-25 Thread Tim Allison

All, I don't know if this change was intended, but it feels like a bug to me... TokenFilterFactory[] filters = new TokenFilterFactory[2]; filters[0] = new LowerCaseFilterFactory(Collections.EMPTY_MAP); filters[1] = new ASCIIFoldingFilterFactory(Collections.EMPTY_MAP); TokenizerChain chain = new

Re: 8.0.0-SNAPSHOT snapshot repo poms broken?

2019-01-17 Thread Tim Allison

User error..please ignore. On Thu, Jan 17, 2019 at 4:36 PM Tim Allison wrote: > > All, > I recently tried to upgrade a project that relies on the snapshot > repos[1], but maven wasn't able to pull lucene-highlighter, > lucene-test-framework, lucene-memory, among a

8.0.0-SNAPSHOT snapshot repo poms broken?

2019-01-17 Thread Tim Allison

All, I recently tried to upgrade a project that relies on the snapshot repos[1], but maven wasn't able to pull lucene-highlighter, lucene-test-framework, lucene-memory, among a few others. However, maven was able to pull lucene-core and most other artifacts for 8.0.0-SNAPSHOT. I manually checke

Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

2019-01-17 Thread Tim Allison

Y, I tracked this down within Solr. This is a feature, not a bug. I found a solution (set {{captureAttr}} to {{true}}): https://issues.apache.org/jira/browse/TIKA-2814?focusedCommentId=16745263&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16745263 Please, though,

Re: Solr OCR Support

2018-11-02 Thread Tim Allison

to ding Nuance (or tesseract), I just wish to point out that > what to OCR is important, because OCR works well when it has good input. > > > -Original Message- > > From: Tim Allison > > Sent: Friday, November 2, 2018 11:03 AM > > To: solr-user@lucene.apach

Re: Solr OCR Support

2018-11-02 Thread Tim Allison

OCR'ing of PDFs is fiddly at the moment because of Tika, not Solr! We have an open ticket to make it "just work", but we aren't there yet (TIKA-2749). You have to tell Tika how you want to process images from PDFs via the tika-config.xml file. You've seen this link in the links you mentioned: ht

Re: Tesseract language

2018-10-27 Thread Tim Allison

ariable to the path-variables pointing to > > > "Tesseract-OCR/tessdata". > > > > > > Now Tesseract works with Danish language from the CMD, but now I can't > > > make the code work in Java, not even with default settings (which I > > > could before). A

Re: Tesseract language

2018-10-26 Thread Tim Allison

Tika relies on you to install tesseract and all the language libraries you'll need. If you can successfully call `tesseract testing/eurotext.png testing/eurotext-dan -l dan`, Tika _should_ be able to specify "dan" with your code above. On Fri, Oct 26, 2018 at 10:49 AM Martin Frank Hansen (MHQ) wr

Re: Reading data using Tika to Solr

2018-10-26 Thread Tim Allison

r > > Hi Tim, > > It is msg files and I added tika-app-1.14.jar to the build path - and now > it works 😊 But how do I get it to read the attachments as well? > > -Original Message- > From: Tim Allison > Sent: 25. oktober 2018 21:57 > To: solr-user@lucene.ap

Re: Reading data using Tika to Solr

2018-10-26 Thread Tim Allison

how do I get it to read the attachments as well? > > -Original Message- > From: Tim Allison > Sent: 25. oktober 2018 21:57 > To: solr-user@lucene.apache.org > Subject: Re: Reading data using Tika to Solr > > If you’re processing actual msg (not eml), you’ll also nee

Re: Reading data using Tika to Solr

2018-10-25 Thread Tim Allison

llows: > > Tika-parsers-1.4.jar > Tika-core-1.4.jar > Commons-io-2.5.jar > Httpclient-4.5.3 > Httpcore-4.4.6.jar > Httpmime-4.5.3.jar > Slf4j-api1-7-24.jar > Jcl-over--slf4j-1.7.24.jar > Solr-cell-7.5.0.jar > Solr-core-7.5.0.jar > Solr-solrj-7.5.0.jar > Noggit-0.

Re: Reading data using Tika to Solr

2018-10-25 Thread Tim Allison

To follow up w Erick’s point, there are a bunch of transitive dependencies from tika-parsers. If you aren’t using maven or similar build system to grab the dependencies, it can be tricky to get it right. If you aren’t using maven, and you can afford the risks of jar hell, consider using tika-app or

Re: Encoding issue in solr

2018-10-05 Thread Tim Allison

This is probably caused by an encoding detection problem in Nutch and/or Tika. If you can share the file on the Tika user’s list, I can take a look. On Fri, Oct 5, 2018 at 7:11 AM UMA MAHESWAR wrote: > HI ALL, > > while i am using nutch for crawling and indexing in to solr,while storing > data i

Re: solr and diversification

2018-09-28 Thread Tim Allison

If you haven’t already, might want to check out maximal marginal relevance...original paper: Carbonell and Goldstein. On Thu, Sep 27, 2018 at 7:29 PM Joel Bernstein wrote: > Yeah, I think your plan sounds fine. > > Do you have a specific use case for diversity of results. I've been > wondering i

Re: Memory Leak in 7.3 to 7.4

2018-08-06 Thread Tim Allison

+1 to Shawn's and Erick's points about isolating Tika in a separate jvm. Y, please do let us know: u...@tika.apache.org We might be able to help out, and you, in turn, can help the community figure out what's going on; see e.g.: https://issues.apache.org/jira/browse/TIKA-2703 On Sun, Aug 5, 2018

Re: Index protected zip

2018-05-29 Thread Tim Allison

t; > the info is in our "official" place but the real story is in another > > place, > > > one we alternately tell people to sometimes ignore but sometimes keep > up > > to > > > date? Even I'm confused. > > > > > > On Sat, May 26, 20

Re: Index protected zip

2018-05-26 Thread Tim Allison

W00t! Thank you, Shawn! The "don't use ERH in production" response comes up frequently enough > that I have created a wiki page we can use for responses: > > https://wiki.apache.org/solr/RecommendCustomIndexingWithTika > > Tim, you are extremely well-qualified to expand and correct this page. > Er

Re: simple enrich uploaded binary documents with sha256 hashes

2018-05-26 Thread Tim Allison

+1 as always to Erick’s advice. DIH is only a PoC. We do have a DigestingParser in Tika, and when you combine that w the RecursiveParserWrapper, you can get digests not only of the main file but also on all embedded files/attachments...which can be pretty neat for some use cases. Operators are st

Re: Index protected zip

2018-05-26 Thread Tim Allison

...@mail.gmail.com%3e On Sat, May 26, 2018 at 6:34 AM Tim Allison wrote: > You’ll need to provide a PasswordProvider in the ParseContext. I don’t > think that is currently possible in the Solr integration. Please open a > ticket if SolrJ doesn’t meet your needs. > > On Thu, May 24,

Re: Index protected zip

2018-05-26 Thread Tim Allison

You’ll need to provide a PasswordProvider in the ParseContext. I don’t think that is currently possible in the Solr integration. Please open a ticket if SolrJ doesn’t meet your needs. On Thu, May 24, 2018 at 1:03 PM Alexandre Rafalovitch wrote: > Hmm. If it works, then it is Tika magic. Which m

Re: Indexing information on number of attachments and their names in EML file

Re: problem indexing GPS metadata for video upload

Re: problem indexing GPS metadata for video upload

Re: problem indexing GPS metadata for video upload

Re: problem indexing GPS metadata for video upload

Re: SOLR Text Field

Why is elevate not working when I convert a request to local parameters?

Re: Help with a DIH config file

Re: by: java.util.zip.DataFormatException: invalid distance too far back reported by Solr API

TokenizerChain.getMultiTermAnalyzer().normalize() no longer normalizes multiterms in 8.x?!

Re: 8.0.0-SNAPSHOT snapshot repo poms broken?

8.0.0-SNAPSHOT snapshot repo poms broken?

Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

Re: Solr OCR Support

Re: Solr OCR Support

Re: Tesseract language

Re: Tesseract language

Re: Reading data using Tika to Solr

Re: Reading data using Tika to Solr

Re: Reading data using Tika to Solr

Re: Reading data using Tika to Solr

Re: Encoding issue in solr

Re: solr and diversification

Re: Memory Leak in 7.3 to 7.4

Re: Index protected zip

Re: Index protected zip

Re: simple enrich uploaded binary documents with sha256 hashes

Re: Index protected zip

Re: Index protected zip

29 matches

Site Navigation

Mail list logo

Footer information