RE: Specialized Solr Application

2018-04-20 Thread Allison, Timothy B.
>1) the toughest pdfs to identify are those that are partly searchable (text) and partly not (image-based text).  However, I've found that such documents tend to exist in clusters. Agreed. We should do something better in Tika to identify image-only pages on a page-by-page basis, and

RE: Specialized Solr Application

2018-04-18 Thread Allison, Timothy B.
To be Waldorf to Erick's Statler (if I may), lots of things can go wrong during content extraction.[1] I had two big concerns when I heard of your task: 1) image only pdfs, which can parse without problem, but which might yield 0 content. 2) emails (see, e.g. SOLR-12048) It sounds like

RE: Specialized Solr Application

2018-04-17 Thread Allison, Timothy B.
+1 to Charlie's guidance. And... >60,000 documents, mostly pdfs and emails. > However, there's a premium on precision (and recall) in searches. Please, oh, please, no matter what you're using for content/text extraction and/or OCR, run tika-eval[1] on the output to ensure that that you are

RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-12 Thread Allison, Timothy B.
There's also, of course, tika-server.  No matter the method, it is always best to isolate Tika to its own jvm, vm or m. -Original Message- From: Charlie Hull [mailto:char...@flax.co.uk] Sent: Monday, April 9, 2018 4:15 PM To: solr-user@lucene.apache.org Subject: Re: How to use Tika

RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Allison, Timothy B.
+1 https://lucidworks.com/2012/02/14/indexing-with-solrj/ We should add a chatbot to the list that includes Charlie's advice and the link to Erick's blog post whenever Tika is used.  -Original Message- From: Charlie Hull [mailto:char...@flax.co.uk] Sent: Monday, April 9, 2018 12:44

RE: Query redg : diacritics in keyword search

2018-03-30 Thread Allison, Timothy B.
For a simple illustration of Charlie's point and a side bonus on the 78 reasons to use the ICUFoldingFilter if you happen to be processing Arabic script languages, see slides 31-33:

RE: Solr search word NOT followed by another word

2018-02-15 Thread Allison, Timothy B.
Nice. Thank you! -Original Message- From: Emir Arnautović [mailto:emir.arnauto...@sematext.com] Sent: Thursday, February 15, 2018 2:19 PM To: solr-user@lucene.apache.org Subject: Re: Solr search word NOT followed by another word Hi, I did not provide the right query. If you query as

RE: Solr search word NOT followed by another word

2018-02-15 Thread Allison, Timothy B.
I just updated the SpanQueryParser (LUCENE-5205) and its Solr plugin (SOLR-5410) for master and 7.2.1. What version of Solr are you using and which version of the plugin? These should be available on maven central shortly: version 7.2-0.1 org.tallison.solr solr-5410 7.2-0.1 Or

RE: Solr search word NOT followed by another word

2018-02-15 Thread Allison, Timothy B.
I've been away from the ComplexQueryParser for a while, and I was wrong when I said in my earlier email that no currently included Solr parse generates a SpanNotQuery. You're right, Emir, that the ComplexQueryParser does generate a SpanNotQuery, and, y, I just tried this with 7.2.1, and it

RE: Solr search word NOT followed by another word

2018-02-14 Thread Allison, Timothy B.
On Mon, Feb 12, 2018 at 10:41 AM, Allison, Timothy B. <talli...@mitre.org> wrote: > That requires a SpanNotQuery. AFAIK, there is no way to do this with > the current parsers included in Solr. > > My SpanQueryParser does cover this, and I'm hoping to port it to 7.x > today or

RE: Solr search word NOT followed by another word

2018-02-12 Thread Allison, Timothy B.
That requires a SpanNotQuery. AFAIK, there is no way to do this with the current parsers included in Solr. My SpanQueryParser does cover this, and I'm hoping to port it to 7.x today or tomorrow. Syntax would be "Leonardo [da vinci]"!~0,1 https://issues.apache.org/jira/browse/LUCENE-5205

RE: Solr Wildcard Search

2017-11-30 Thread Allison, Timothy B.
l and John Berryman's "Relevant Search" enough on how to layer fields...among many other great insights: https://www.manning.com/books/relevant-search -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, November 30, 2017 9:20 AM To: solr-user@lucen

RE: Solr Wildcard Search

2017-11-30 Thread Allison, Timothy B.
g instead of "Porter" ? I guess, it wasn't chosen intentionally. In the best we trust Georgy Nevsky -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, November 30, 2017 8:25 AM To: solr-user@lucene.apache.org Subject: RE: Solr Wildcard Searc

RE: Solr Wildcard Search

2017-11-30 Thread Allison, Timothy B.
The initial question wasn't about a phrasal search, but I largely agree that diff q parsers handle the analysis chain differently for multiterms. Yes, Porter is crazily aggressive. USE WITH CAUTION! As has been pointed out, use the Solr admin window and the "debug" in the query option to see

RE: Complexphrase treats wildcards differently than other query parsers

2017-10-09 Thread Allison, Timothy B.
r\u00E6zy* is used as a query term which mimics the behaviour I originally reported, namely that CPQP does not analyse it because of the wildcard and thus does not hit the charfilter from the query side. 2017-10-06 20:54 GMT+02:00 Allison, Timothy B. <talli...@mitre.org>: > That could

RE: Complexphrase treats wildcards differently than other query parsers

2017-10-06 Thread Allison, Timothy B.
That could be it. I'm not able to reproduce this with trunk. More next week. In trunk, if I add this to schema15.xml: This test passes. @Test public void testCharFilter() { assertU(adoc("iso-latin1", "cr\u00E6zy tr\u00E6n", "id", "1"));

RE: Complexphrase treats wildcards differently than other query parsers

2017-10-05 Thread Allison, Timothy B.
he regular multiterms should be ok. Still no answer for you... 2017-10-05 14:34 GMT+02:00 Allison, Timothy B. <talli...@mitre.org>: > There's every chance that I'm missing something at the Solr level, but > it _looks_ at the Lucene level, like ComplexPhraseQueryParser is still &

RE: Complexphrase treats wildcards differently than other query parsers

2017-10-05 Thread Allison, Timothy B.
of it :-) Do you remember any reason that multi term analysis is not happening in ComplexPhraseQueryParser? I'm on 6.6.1, so latest on the 6.x branch. 2017-10-05 14:34 GMT+02:00 Allison, Timothy B. <talli...@mitre.org>: > There's every chance that I'm missing something at the S

RE: Complexphrase treats wildcards differently than other query parsers

2017-10-05 Thread Allison, Timothy B.
/lucene-5205/src/test/java/org/apache/lucene/queryparser/spans/TestAdvancedAnalyzers.java#L117 -----Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, October 5, 2017 8:02 AM To: solr-user@lucene.apache.org Subject: RE: Complexphrase treats wildcards differ

RE: Complexphrase treats wildcards differently than other query parsers

2017-10-05 Thread Allison, Timothy B.
What version of Solr are you using? I thought this had been fixed fairly recently, but I can't quickly find the JIRA. Let me take a look. Best, Tim This was one of my initial reasons for my SpanQueryParser LUCENE-5205[1] and [2], which handles analysis of multiterms even in

RE: DataImport Handler Out of Memory

2017-09-27 Thread Allison, Timothy B.
https://wiki.apache.org/solr/DataImportHandlerFaq#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going_out_of_memory._Why_does_DataImportHandler_bring_everything_to_memory.3F -Original Message- From: Deeksha Sharma

RE: Solr fields for Microsoft files, image files, PDF, text files

2017-09-25 Thread Allison, Timothy B.
bq: How do I get a list of all valid field names based on the file type bq: You don't. At least I've never found any. Plus various document formats will allow custom meta-data fields so there's no definitive list. It would be trivial to add field counts per mime to tika-eval. If you're

TIKA-2440 Remove Furigana/phonetic as default for xlsx?

2017-08-09 Thread Allison, Timothy B.
Solrians, We have a request to drop phonetic strings from xlsx as the default in Tika. I'm not familiar enough with Japanese to know if users would generally expect to be able to search on these as well as the original. The current practice is to include them. Any recommendations? Thank

RE: Arabic words search in solr

2017-08-02 Thread Allison, Timothy B.
+1 I was hoping to use this as a case for arguing for turning off an overly aggressive stemmer, but I checked on your 10 docs and query, and David is right, of course -- if you change the default operator to AND, you only get the one document back that you had intended to. I can still use

RE: How to "chain" import handlers: import from DB and from file system

2017-07-10 Thread Allison, Timothy B.
>4. Write an external program that fetches the file, fetches the metadata, >combines them, and send them to Solr. I've done this with some custom crawls. Thanks to Erick Erickson, this is a snap: https://lucidworks.com/2012/02/14/indexing-with-solrj/ With the caveat that Tika should really be

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-07-03 Thread Allison, Timothy B.
o http://localhost:80/solr/v20170703xxx/update... Time spent: 0:00:00.350 On Mon, Jun 5, 2017 at 7:41 PM, Allison, Timothy B. <talli...@mitre.org> wrote: > https://issues.apache.org/jira/browse/SOLR-10335 is tracking the > upgrade in Solr to Tika 1.15. Please chime in on that issue. >

RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-20 Thread Allison, Timothy B.
>http - however, the big advantage of doing your indexing on different machine >is that the heavy lifting that tika does in extracting text from documents, >finding metadata etc is not happening on the server. If the indexer crashes, >it doesn’t affect Solr either. +1 for what can go wrong:

RE: How are people using the ICUTokenizer?

2017-06-20 Thread Allison, Timothy B.
> So, if you are trying to make sure your index breaks words properly on > eastern languages, just use ICU Tokenizer. I defer to the expertise on this list, but last I checked ICUTokenizer uses dictionary lookup to tokenize CJK. This may work well for some tasks, but I haven't evaluated

RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-20 Thread Allison, Timothy B.
Yeah, Chris knows a thing or two about Tika. :) -Original Message- From: ZiYuan [mailto:ziyu...@gmail.com] Sent: Tuesday, June 20, 2017 8:00 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context No

RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-19 Thread Allison, Timothy B.
> There is no standard across different types of docs as to what meta-data > field is >> included. PDF might have a "last_edited" field. Word might have a >> "last_modified" field where the two mean the same thing. On Tika, we _try_ to normalize fields according to various standards, the most

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-06-05 Thread Allison, Timothy B.
AM To: solr-user@lucene.apache.org Subject: RE: Solr 6.4. Can't index MS Visio vsdx files Great Tim. What do I need to do to integrate it on my current installation? On May 31, 2017 16:24, "Allison, Timothy B." <talli...@mitre.org> wrote: Apache Tika 1.15 is now available. -

Re: XLSB files not indexed

2017-05-31 Thread Allison, Timothy B.
Apache Tika version 1.15 now handles XLSB files. The behavior described below is the expected behavior if a file type is identified but there is no parser to handle that file type. A little late to the game, I admit... :) Cheers, Tim FromRoland Everaert

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-05-31 Thread Allison, Timothy B.
Apache Tika 1.15 is now available. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Tuesday, May 9, 2017 7:45 AM To: solr-user@lucene.apache.org Subject: RE: Solr 6.4. Can't index MS Visio vsdx files Probably better to ask on the Tika list. We'll push

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-05-09 Thread Allison, Timothy B.
. On Wed, Apr 12, 2017 at 6:57 PM, Allison, Timothy B. <talli...@mitre.org> wrote: > The release candidate for POI was just cut...unfortunately, I think > after Nick Burch fixed the 'PolylineTo' issue...thank you, btw, for opening > that! > > That'll be done

RE: keyword-in-content for PDF document

2017-04-13 Thread Allison, Timothy B.
If you don't care about sentence boundaries, but just want a window around target terms and you want concordance functionality (sort before, after, etc), you might check out LUCENE-5317, which is available as a standalone jar on my github site [1] and is available through maven central. Using

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-04-12 Thread Allison, Timothy B.
, and from info that I found in google it could solve my issues. On Tue, Apr 11, 2017 at 10:48 PM, Allison, Timothy B. <talli...@mitre.org> wrote: > It depends. We've been trying to make parsers more, erm, flexible, > but there are some problems from which we cannot recover. > >

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-04-11 Thread Allison, Timothy B.
a stops parsing whole document if it finds any exception On Apr 11, 2017 19:51, "Allison, Timothy B." <talli...@mitre.org> wrote: > You might want to drop a note to the dev or user's list on Apache POI. > > I'm not extremely familiar with the vsd(x) portion of our code base. > &

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-04-11 Thread Allison, Timothy B.
You might want to drop a note to the dev or user's list on Apache POI. I'm not extremely familiar with the vsd(x) portion of our code base. The first item ("PolylineTo") may be caused by a mismatch btwn your doc and the ooxml spec. The second item appears to be an unsupported feature. The

RE: Japanese character is garbled when using TikaEntityProcessor

2017-04-10 Thread Allison, Timothy B.
Please open an issue on Tika's JIRA and share the triggering file if possible. If we can touch the file, we may be able to recommend alternate ways to configure Tika's encoding detectors. We just added configurability to the encoding detectors and that will be available with Tika 1.15. [1]

RE: Solr performance issue on indexing

2017-04-04 Thread Allison, Timothy B.
> Also we will try to decouple tika to solr. +1 -Original Message- From: tstusr [mailto:ulfrhe...@gmail.com] Sent: Friday, March 31, 2017 4:31 PM To: solr-user@lucene.apache.org Subject: Re: Solr performance issue on indexing Hi, thanks for the feedback. Yes, it is about OOM, indeed

RE: Indexing speed reduced significantly with OCR

2017-03-30 Thread Allison, Timothy B.
> Note that the OCRing is a separate task from Solr indexing, and is best done > on separate machines. +1 -Original Message- From: Rick Leir [mailto:rl...@leirtech.com] Sent: Thursday, March 30, 2017 7:37 AM To: solr-user@lucene.apache.org Subject: Re: Indexing speed reduced

RE: Index scanned documents

2017-03-27 Thread Allison, Timothy B.
, March 27, 2017 11:48 AM To: solr-user@lucene.apache.org Subject: Re: Index scanned documents I tried this solution from Tim Allison, and it works. http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files Regards, Edwin On 27 March 2017 at 20:07, Allison, Timothy

RE: Index scanned documents

2017-03-27 Thread Allison, Timothy B.
Please also see: https://wiki.apache.org/tika/TikaOCR and https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR If you have any other questions about Apache Tika and OCR, please feel free to ask on our users list as well: u...@tika.apache.org Cheers, Tim

Testing an ingest framework that uses Apache Tika

2017-02-16 Thread Allison, Timothy B.
All, I finally got around to documenting Apache Tika's MockParser[1]. As of Tika 1.15 (unreleased), add tika-core-tests.jar to your class path, and you can simulate: 1. Regular catchable exceptions 2. OOMs 3. Permanent hangs This will allow you to determine if your ingest framework is robust

RE: DataImportHandler - Unable to load Tika Config Processing Document # 1

2017-02-08 Thread Allison, Timothy B.
>It is *strongly* recommended to *not* use >the Tika that's embedded within >Solr, but >instead to do the processing outside of Solr >in a program of your >own and index the results. +1

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-02-06 Thread Allison, Timothy B.
-1.3.jar instead of poi-ooxml-schemas-3.15.jar 2. curvesapi-1.03.jar So, now I'm waiting when this will be implemented in a official version of solr/tika. Regards, Gytis On Mon, Feb 6, 2017 at 4:16 PM, Allison, Timothy B. <talli...@mitre.org> wrote: > Argh. Looks like we need to add

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-02-06 Thread Allison, Timothy B.
Argh. Looks like we need to add curvesapi (BSD 3-clause) to Solr. For now, add this jar: https://mvnrepository.com/artifact/com.github.virtuald/curvesapi/1.03 See also [1] [1] http://apache-poi.1045710.n5.nabble.com/support-for-reading-Microsoft-Visio-2013-vsdx-format-td5721500.html

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-02-06 Thread Allison, Timothy B.
ling, is it could help or not? Gytis On Fri, Feb 3, 2017 at 10:31 PM, Allison, Timothy B. <talli...@mitre.org> wrote: > This is a Tika/POI problem. Please download tika-app 1.14 [1] or a > nightly version of Tika [2] and run > > java -jar tika-app.jar > > If the probl

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-02-03 Thread Allison, Timothy B.
This is a Tika/POI problem. Please download tika-app 1.14 [1] or a nightly version of Tika [2] and run java -jar tika-app.jar If the problem is fixed, we'll try to upgrade dependencies in Solr. If it isn't fixed, please open a bug on Tika's Jira. If this is a missing bean issue (sorry, I

RE: Zip Bomb Exception in HTML File

2017-01-04 Thread Allison, Timothy B.
This came up back in September [1] and [2]. Same trigger...crazy number of divs. I think we could modify the AutoDetectParser to enable configuration of maximum zip-bomb depth via tika-config. If there's any interest in this, re-open TIKA-2091, and I'll take a look. Best, Tim

RE: Unicode Character Problem

2016-12-12 Thread Allison, Timothy B.
> I don't see any weird character when I manual copy it to any text editor. That's a good diagnostic step, but there's a chance that Adobe (or your viewer) got it right, and Tika or PDFBox isn't getting it right. If you run tika-app on the file [0], do you get the same problem? See our stub

RE: negation search help

2016-11-23 Thread Allison, Timothy B.
You've gotten far better answers on this already, but you can use the SpanNotQuery in the SpanQueryParser I maintain and have published to maven central [1][2][3]. This does not carry out any nlp, but this would allow literal "headache (no not)"!~5,0 -> "headache" but not if "no" or "not"

Apache Tika's public regression corpus

2016-10-05 Thread Allison, Timothy B.
All, I recently blogged about some of the work we're doing with a large scale regression corpus to make Tika, POI and PDFBox more robust and to identify regressions before release. If you'd like to chip in with recommendations, requests or Hadoop/Spark clusters (why not shoot for the stars),

RE: SOLR Sizing

2016-10-03 Thread Allison, Timothy B.
This doesn't answer your question, but Erick Erickson's blog on this topic is invaluable: https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ -Original Message- From: Vasu Y [mailto:vya...@gmail.com] Sent: Monday, October 3, 2016

RE: Disabling Zip bomb detection in Tika

2016-09-22 Thread Allison, Timothy B.
Archives/edgar/data/1472033/000119380513001310/e6 >> 11133_f6ef-eutelsat.htm >> >> I'll try to create a ticket for this on Jira if I find its location >> but feel free to open it yourself if you prefer, just let me know. >> >> Em 22-09-2016 12:33, Allison, T

RE: Disabling Zip bomb detection in Tika

2016-09-22 Thread Allison, Timothy B.
rJ and Tika to >>> achieve that... >>> >>> Just wanted to confirm. I'll try to get a sample HTML yielding to >>> this problem and attach it to Jira. >>> >>> Thanks, >>> Rodrigo. >>> >>> Em 22-09-2016 11:48, Allison, Ti

RE: Disabling Zip bomb detection in Tika

2016-09-22 Thread Allison, Timothy B.
> I'll try to get a sample HTML yielding to this problem and attach it to Jira. Great! Tika 1.14 is around the corner...if this is an easy fix ... :) Thank you.

RE: Disabling Zip bomb detection in Tika

2016-09-22 Thread Allison, Timothy B.
Y, looks like Nick (gagravarr) has answered on SO -- can't do it in Tika currently. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, September 22, 2016 10:42 AM To: solr-user@lucene.apache.org Cc: 'u...@tika.apache.org' <u...@tika.apache.

RE: Disabling Zip bomb detection in Tika

2016-09-22 Thread Allison, Timothy B.
I don't think that's configurable at the moment. Tika-colleagues, any recommendations? If you're able to share the file on Tika's jira, we'd be happy to take a look. You shouldn't be getting the zip bomb unless there is a mismatch between opening and closing tags (which could point to a bug

RE: Solr 6.1 :: language specific analysis

2016-08-10 Thread Allison, Timothy B.
ICU normalization (ICUFoldingFilterFactory) will at least handle "ß" -> "ss" (IIRC) and some other language-general variants that might get you close. There are, of course, language specific analyzers (https://wiki.apache.org/solr/LanguageAnalysis#German) , but I don't think they'll get you

RE: Automatic Language Identification

2016-07-01 Thread Allison, Timothy B.
+1 to langdetect In Tika 2.0, we're going to remove our own language detection code and allow users to select Optimaize (fork of langdetect), MIT Lincoln Lab’s Text.jl library or Yalder (https://github.com/kkrugler/yalder). The first two are now available in Tika 1.13. -Original

RE: [ANN] Relevant Search by Manning out! (Thanks Solr community!)

2016-06-21 Thread Allison, Timothy B.
Not that I need any other book beyond this one... but I didn't realize that the 50% discount code applies to all books in the order. :) Congratulations, Doug and John! -Original Message- From: Doug Turnbull [mailto:dturnb...@opensourceconnections.com] Sent: Tuesday, June 21, 2016 2:12

RE: SpanQuery - How to wrap a NOT subquery

2016-06-21 Thread Allison, Timothy B.
>Awesome, 0 pre and 1 post works! Great! > What if I wanted to match thirty, but exclude if six or seven are included > anywhere in the document? Any time you need "anywhere in the document", use a "regular" query (not SpanQuery). As you wrote initially, you can construct a BooleanQuery that

RE: SpanQuery - How to wrap a NOT subquery

2016-06-21 Thread Allison, Timothy B.
>Perhaps I'm misunderstanding the pre/post parameters? Pre/post parameters: " 'six' or 'seven' should not appear $pre tokens before 'thirty' or $post tokens after 'thirty' Maybe something like this: spanNear([ spanNear([field:one, field:thousand, field:one, field:hundred], 0, true),

RE: SpanQuery - How to wrap a NOT subquery

2016-06-21 Thread Allison, Timothy B.
> dtSearch allows a user to have NOTs embedded in proximity searches. And, if you're heading down the path of building your own queryparser to handle dtSearch's syntax, please read and heed Charlie Hull's post: http://www.flax.co.uk/blog/2016/05/13/old-new-query-parser/ See also:

RE: SpanQuery - How to wrap a NOT subquery

2016-06-21 Thread Allison, Timothy B.
From: Brandon Miller [mailto:computerengineer.bran...@gmail.com] Sent: Monday, June 20, 2016 4:12 PM To: Allison, Timothy B. <talli...@mitre.org>; solr-user@lucene.apache.org Subject: Re: SpanQuery - How to wrap a NOT subquery Thank you, Timothy. I have support for and am using SpanNo

Morphlines.cell and attachments in complex docs?

2016-06-17 Thread Allison, Timothy B.
I was just looking at SolrCellBuilder, and it looks like there's an assumption that documents will not have attachments/embedded objects. Unless I misunderstand the code, users will not be able to search documents inside zips, or attachments in msg/ doc/pdf/etc (cf. SOLR-7189). Are embedded

RE: Bypassing ExtractingRequestHandler

2016-06-13 Thread Allison, Timothy B.
>Two things: Here's a sample bit of SolrJ code, pulling out the DB stuff should >be straightforward: http://searchhub.org/2012/02/14/indexing-with-solrj/ +1 > We tend to prefer running Tika externally as it's entirely possible > that Tika will crash or hang with certain files - and that

RE: find stores with sales of > $x in last 2 months ?

2016-06-06 Thread Allison, Timothy B.
tandardQueryParser-DifferencesbetweenLuceneQueryParserandtheSolrStandardQueryParser Regards, Alex. Newsletter and resources for Solr beginners and intermediates: http://www.solr-start.com/ On 3 June 2016 at 23:23, Allison, Timothy B. <talli...@mitre.org> wrote: > All, > This is a t

find stores with sales of > $x in last 2 months ?

2016-06-03 Thread Allison, Timothy B.
All, This is a toy example, but is there a way to search for, say, stores with sales of > $x in the last 2 months with Solr? $x and the time frame are selected by the user at query time. If the queries could be constrained (this is still tbd), I could see updating "stats" fields within

RE: Metadata and HTML ending up in searchable text

2016-05-31 Thread Allison, Timothy B.
media="screen" href="/wiki/modernized/css/screen.css"/ >> link rel="stylesheet" type="text/css" charset="utf-8" >> media="print" href="/wiki/modernized/css/print.css"/... >> >&

RE: Metadata and HTML ending up in searchable text

2016-05-27 Thread Allison, Timothy B.
Of course, for greater control over indexing (and for more robust handling of exceedingly rare (but real) infinite loops/OOM caused by Tika), consider SolrJ: http://searchhub.org/2012/02/14/indexing-with-solrj/ -Original Message- From: Simon Blandford

RE: Metadata and HTML ending up in searchable text

2016-05-27 Thread Allison, Timothy B.
I'm only minimally familiar with Solr Cell, but... 1) It looks like you aren't setting extractFormat=text. According to [0]...the default is xhtml which will include a bunch of the metadata. 2) is there an attr_* dynamic field in your index with type="ignored"? This would strip out the attr_

RE: dtSearch parser & Introduction

2016-05-13 Thread Allison, Timothy B.
>...and I've just blogged about some of the issues one can run into with this >sort of project, hope this is useful! http://www.flax.co.uk/blog/2016/05/13/old-new-query-parser/ +1 completely non-trivial task to roll your own. I'd add that incorporating multiterm analysis (analysis/normalization

RE: dtSearch parser & Introduction

2016-05-13 Thread Allison, Timothy B.
Depending on your needs, you might want to take a look at my SpanQueryParser (LUCENE-5205/SOLR-5410). It does not offer dtsearch syntax, but if the SurroundQueryParser was close enough, this parser may be of use. If you need modifications to it, let me know. I'm in the process of adding

RE: Indexing a (File attached to a document)

2016-05-12 Thread Allison, Timothy B.
If I understand the question correctly... I'm assuming you are indexing rich documents (PDF/DOC/MSG, etc) with DIH's Tika handler. Some of those documents have attachments. If that's the case, all of the content of embedded docs _should_[0] be extracted, but then all of that content across

RE: Integrating grobid with Tika in solr

2016-05-04 Thread Allison, Timothy B.
Y, integrating Tika is non-trivial. I think Uwe adds the dependencies with great care by hand by carefully looking at the dependency tree in Maven and making sure there weren't any conflicts. -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Wednesday, May 4,

RE: Integrating grobid with Tika in solr

2016-05-04 Thread Allison, Timothy B.
Y. Solr 6.0.0 is shipping with Tika 1.7. Grobid came in with Tika 1.11. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Wednesday, May 4, 2016 10:29 AM To: solr-user@lucene.apache.org Subject: RE: Integrating grobid with Tika in solr I think Solr is using

RE: Integrating grobid with Tika in solr

2016-05-04 Thread Allison, Timothy B.
I think Solr is using a version of Tika that predates that addition of the Grobid parser. You'll have to add that manually somehow until Solr upgrades to Tika 1.13 (soon to be released...I think). SOLR-8981. -Original Message- From: Betsey Benagh [mailto:betsey.ben...@stresearch.com]

RE: Overall large size in Solr across collections

2016-04-26 Thread Allison, Timothy B.
> I can tell you that Tika is quite the resource hog. It is likely chewing up > CPU and memory > resources at an incredible rate, slowing down your Solr server. You > would probably see better performance than ERH if you incorporate Tika > and SolrJ into a client indexing program that runs

RE: Indexing docuements in Solr 5 Using Tika extraction error

2016-03-28 Thread Allison, Timothy B.
> If you're going to use Tika for production indexing, you should write > a Java program using SolrJ and Tika so that you are in complete > control, and so Solr isn't unstable. +1

RE: outlook email file pst extraction problem

2016-03-02 Thread Allison, Timothy B.
, Allison, Timothy B. <talli...@mitre.org> wrote: > Should have looked at how we handle psts before earlier responsesorry. > > What you're seeing is Tika's default treatment of embedded documents, > it concatenates them all into one string. It'll do the same thing for >

RE: outlook email file pst extraction problem

2016-02-11 Thread Allison, Timothy B.
Should have looked at how we handle psts before earlier responsesorry. What you're seeing is Tika's default treatment of embedded documents, it concatenates them all into one string. It'll do the same thing for zip files and other container files. The default Tika format is xhtml, and we

RE: outlook email file pst extraction problem

2016-02-11 Thread Allison, Timothy B.
Y, this looks like a Tika feature. If you run the tika-app.jar [1]on your file and you get the same output, then that's Tika's doing. Drop a note on the u...@tika.apache.org list if Tika isn't meeting your needs. -Original Message- From: Sreenivasa Kallu

RE: How is Tika used with Solr

2016-02-11 Thread Allison, Timothy B.
and _especially_ where you don't control the document corpus, > you have to build something far more tolerant as per Tim's comments. > > FWIW, > Erick > > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B. > <talli...@mitre.org> > wrote: > > I completely agree o

RE: How is Tika used with Solr

2016-02-11 Thread Allison, Timothy B.
just used tika from > my own jvm. > > On Thu, Feb 11, 2016 at 8:45 PM, Allison, Timothy B. > <talli...@mitre.org> > wrote: > >> x-post to Tika user's >> >> Y and n. If you run tika app as: >> >> java -jar tika-app.jar >> >> It r

RE: How is Tika used with Solr

2016-02-10 Thread Allison, Timothy B.
catch any exceptions in my code and "do the right thing". I'm not sure I see any real benefit in yet another JVM. FWIW, Erick On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B. <talli...@mitre.org> wrote: > I have one answer here [0], but I'd be interested to hear what Solr

RE: How is Tika used with Solr

2016-02-10 Thread Allison, Timothy B.
Ha. Spoke too soon about this thread not getting swamped. Will add the dropwizard-tika-server to our wiki page. Thank you for the link! As a side note, I'll submit a pull request to update the AbstractTikaResource to avoid a potential NPE if the mime type can't be parsed...we just fixed this

RE: How is Tika used with Solr

2016-02-09 Thread Allison, Timothy B.
I have one answer here [0], but I'd be interested to hear what Solr users/devs/integrators have experienced on this topic. [0] http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3CCY1PR09MB0795EAED947B53965BC86874C7D70%40CY1PR09MB0795.namprd09.prod.outlook.com%3E -Original

RE: Using Tika that comes with Solr 5.2

2016-02-03 Thread Allison, Timothy B.
framework. I'm trying to > use Tika from my own crawler application that uses SojrJ to send the > raw text to Solr for indexing. > > What is it that I am missing?! > > Steve > > On Tue, Feb 2, 2016 at 3:03 PM, Allison, Timothy B. > <talli...@mitre.org> > wrote: &

RE: Using Tika that comes with Solr 5.2

2016-02-03 Thread Allison, Timothy B.
ave to grab that and add it to your class path. :) See also, very recently: https://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3C027601d15ea8%2443ffcf90%24cbff6eb0%24%40thetaphi.de%3E -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Wednesday, F

RE: Multi-lingual search

2016-02-02 Thread Allison, Timothy B.
Three basic options: 1) one generic field that handles non-whitespace languages and normalization robustly (downside: no language specific stopwords, stemming, etc) 2) one field per language (hope lang id works and that you don't have many multilingual docs) 3) one Solr core for language

RE: Using Tika that comes with Solr 5.2

2016-02-02 Thread Allison, Timothy B.
Might not have the parsers on your path within your Solr framework? Which tika jars are on your path? If you want the functionality of all of Tika, use the standalone tika-app.jar, but do not use the app in the same JVM as Solr...without a custom class loader. The Solr team carefully prunes

RE: When does Solr plan to update its embedded Apache Tika version?

2016-02-02 Thread Allison, Timothy B.
Don't know what the answer from the Solr side is, but from the Tika side, I recently failed to get TIKA-1830 into Tika 1.12...so there may be a need to wait for Tika 1.13. No matter the answer on when there'll be an upgrade within Solr, I strongly encourage carving Tika into a separate

RE: Many patterns against many sentences, storing all results

2016-01-05 Thread Allison, Timothy B.
Might want to look into: https://github.com/flaxsearch/luwak or https://github.com/OpenSextant/SolrTextTagger -Original Message- From: Will Moy [mailto:w...@fullfact.org] Sent: Tuesday, January 05, 2016 11:02 AM To: solr-user@lucene.apache.org Subject: Many patterns against many

RE: Unable to extract images content (OCR) from PDF files using Solr

2016-01-05 Thread Allison, Timothy B.
I concur with Erick and Upayavira that it is best to keep Tika in a separate JVM...well, ideally a separate box or rack or even data center [0][1]. :) But seriously, if you're using DIH/SolrCell, you have to configure Tika to parse documents recursively. This was made possible in

RE: Permutations of entries in a multivalued field

2015-12-18 Thread Allison, Timothy B.
Hi Johannes, I suspect that Scott's answer would be more efficient than the following, and I may be misunderstanding the problem! This type of search is supported at the Lucene level by a SpanNearQuery with inOrder set to false. So, how do you get a SpanQuery in Solr? You might want to

RE: Permutations of entries in a multivalued field

2015-12-18 Thread Allison, Timothy B.
ntries in a multivalued field The other thing to check is the ComplexPhraseQueryParser, see: https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-ComplexPhraseQueryParser It uses the Span queries to build up the query... Best, Erick On Fri, Dec 18, 2015 at 11:23 AM, Allison, Tim

RE: Issues when indexing PDF files

2015-12-17 Thread Allison, Timothy B.
Generally, I'd recommend opening an issue on PDFBox's Jira with the file that you shared. Tika uses PDFBox...if a fix can be made there, it will propagate back through Tika to Solr. That said, PDFBox 2.0-RC2 extracts no text and warns: WARNING: No Unicode mapping for CID+71 (71) in font

RE: tikaparser docx file fails with exception

2015-11-06 Thread Allison, Timothy B.
Agree with all below, and don't hesitate to open a ticket on Tika's Jira and/or POI's bugzilla...especially if you can share the triggering document. -Original Message- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: Thursday, November 05, 2015 6:05 PM To: solr-user

  1   2   >