from:"Allison, Timothy B."

RE: Issues when indexing PDF files

2015-12-17 Thread Allison, Timothy B.

Generally, I'd recommend opening an issue on PDFBox's Jira with the file that you shared. Tika uses PDFBox...if a fix can be made there, it will propagate back through Tika to Solr. That said, PDFBox 2.0-RC2 extracts no text and warns: WARNING: No Unicode mapping for CID+71 (71) in font 505Edd

RE: Permutations of entries in a multivalued field

2015-12-18 Thread Allison, Timothy B.

Hi Johannes, I suspect that Scott's answer would be more efficient than the following, and I may be misunderstanding the problem! This type of search is supported at the Lucene level by a SpanNearQuery with inOrder set to false. So, how do you get a SpanQuery in Solr? You might want to l

RE: Permutations of entries in a multivalued field

2015-12-18 Thread Allison, Timothy B.

The other thing to check is the ComplexPhraseQueryParser, see: https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-ComplexPhraseQueryParser It uses the Span queries to build up the query... Best, Erick On Fri, Dec 18, 2015 at 11:23 AM, Allison, Timothy B. wrote: > Hi Jo

RE: Unable to extract images content (OCR) from PDF files using Solr

2016-01-05 Thread Allison, Timothy B.

I concur with Erick and Upayavira that it is best to keep Tika in a separate JVM...well, ideally a separate box or rack or even data center [0][1]. :) But seriously, if you're using DIH/SolrCell, you have to configure Tika to parse documents recursively. This was made possible in SOLR-7189...se

RE: Many patterns against many sentences, storing all results

2016-01-05 Thread Allison, Timothy B.

Might want to look into: https://github.com/flaxsearch/luwak or https://github.com/OpenSextant/SolrTextTagger -Original Message- From: Will Moy [mailto:w...@fullfact.org] Sent: Tuesday, January 05, 2016 11:02 AM To: solr-user@lucene.apache.org Subject: Many patterns against many sen

RE: When does Solr plan to update its embedded Apache Tika version?

2016-02-02 Thread Allison, Timothy B.

Don't know what the answer from the Solr side is, but from the Tika side, I recently failed to get TIKA-1830 into Tika 1.12...so there may be a need to wait for Tika 1.13. No matter the answer on when there'll be an upgrade within Solr, I strongly encourage carving Tika into a separate JVM/serv

RE: Multi-lingual search

2016-02-02 Thread Allison, Timothy B.

Three basic options: 1) one generic field that handles non-whitespace languages and normalization robustly (downside: no language specific stopwords, stemming, etc) 2) one field per language (hope lang id works and that you don't have many multilingual docs) 3) one Solr core for language (ditto)

RE: Using Tika that comes with Solr 5.2

2016-02-02 Thread Allison, Timothy B.

Might not have the parsers on your path within your Solr framework? Which tika jars are on your path? If you want the functionality of all of Tika, use the standalone tika-app.jar, but do not use the app in the same JVM as Solr...without a custom class loader. The Solr team carefully prunes

RE: Using Tika that comes with Solr 5.2

2016-02-03 Thread Allison, Timothy B.

ork. I'm trying to > use Tika from my own crawler application that uses SojrJ to send the > raw text to Solr for indexing. > > What is it that I am missing?! > > Steve > > On Tue, Feb 2, 2016 at 3:03 PM, Allison, Timothy B. > > wrote: > >> Mig

RE: Using Tika that comes with Solr 5.2

2016-02-03 Thread Allison, Timothy B.

you'll have to grab that and add it to your class path. :) See also, very recently: https://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3C027601d15ea8%2443ffcf90%24cbff6eb0%24%40thetaphi.de%3E -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] S

RE: How is Tika used with Solr

2016-02-09 Thread Allison, Timothy B.

I have one answer here [0], but I'd be interested to hear what Solr users/devs/integrators have experienced on this topic. [0] http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3CCY1PR09MB0795EAED947B53965BC86874C7D70%40CY1PR09MB0795.namprd09.prod.outlook.com%3E -Original Me

RE: How is Tika used with Solr

2016-02-10 Thread Allison, Timothy B.

ust catch any exceptions in my code and "do the right thing". I'm not sure I see any real benefit in yet another JVM. FWIW, Erick On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B. wrote: > I have one answer here [0], but I'd be interested to hear what Solr > user

RE: How is Tika used with Solr

2016-02-10 Thread Allison, Timothy B.

Ha. Spoke too soon about this thread not getting swamped. Will add the dropwizard-tika-server to our wiki page. Thank you for the link! As a side note, I'll submit a pull request to update the AbstractTikaResource to avoid a potential NPE if the mime type can't be parsed...we just fixed this

RE: outlook email file pst extraction problem

2016-02-11 Thread Allison, Timothy B.

Y, this looks like a Tika feature. If you run the tika-app.jar [1]on your file and you get the same output, then that's Tika's doing. Drop a note on the u...@tika.apache.org list if Tika isn't meeting your needs. -Original Message- From: Sreenivasa Kallu [mailto:sreenivasaka...@gmail.co

RE: How is Tika used with Solr

2016-02-11 Thread Allison, Timothy B.

control the document corpus, > you have to build something far more tolerant as per Tim's comments. > > FWIW, > Erick > > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B. > > wrote: > > I completely agree on the impulse, and for the vast majority of the >

RE: outlook email file pst extraction problem

2016-02-11 Thread Allison, Timothy B.

Should have looked at how we handle psts before earlier responsesorry. What you're seeing is Tika's default treatment of embedded documents, it concatenates them all into one string. It'll do the same thing for zip files and other container files. The default Tika format is xhtml, and we i

RE: How is Tika used with Solr

2016-02-11 Thread Allison, Timothy B.

ut >> > what's going on with the offending document(s). Or record the name >> > somewhere and skip it next time 'round. Or >> > >> > How much you have to build in here really depends on your use case. >> > For "small enough&q

RE: outlook email file pst extraction problem

2016-03-02 Thread Allison, Timothy B.

55 AM, Allison, Timothy B. wrote: > Should have looked at how we handle psts before earlier responsesorry. > > What you're seeing is Tika's default treatment of embedded documents, > it concatenates them all into one string. It'll do the same thing for > zip fi

RE: Indexing docuements in Solr 5 Using Tika extraction error

2016-03-28 Thread Allison, Timothy B.

> If you're going to use Tika for production indexing, you should write > a Java program using SolrJ and Tika so that you are in complete > control, and so Solr isn't unstable. +1 https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201601.mbox/%3cby2pr09mb11210edfcfa297528940b07c7...@by

RE: Overall large size in Solr across collections

2016-04-26 Thread Allison, Timothy B.

> I can tell you that Tika is quite the resource hog. It is likely chewing up > CPU and memory > resources at an incredible rate, slowing down your Solr server. You > would probably see better performance than ERH if you incorporate Tika > and SolrJ into a client indexing program that runs o

RE: Integrating grobid with Tika in solr

2016-05-04 Thread Allison, Timothy B.

I think Solr is using a version of Tika that predates that addition of the Grobid parser. You'll have to add that manually somehow until Solr upgrades to Tika 1.13 (soon to be released...I think). SOLR-8981. -Original Message- From: Betsey Benagh [mailto:betsey.ben...@stresearch.com]

RE: Integrating grobid with Tika in solr

2016-05-04 Thread Allison, Timothy B.

Y. Solr 6.0.0 is shipping with Tika 1.7. Grobid came in with Tika 1.11. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Wednesday, May 4, 2016 10:29 AM To: solr-user@lucene.apache.org Subject: RE: Integrating grobid with Tika in solr I think Solr is using

RE: Integrating grobid with Tika in solr

2016-05-04 Thread Allison, Timothy B.

Y, integrating Tika is non-trivial. I think Uwe adds the dependencies with great care by hand by carefully looking at the dependency tree in Maven and making sure there weren't any conflicts. -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Wednesday, May 4, 20

RE: Data Import Handler Stays Idle

2015-08-28 Thread Allison, Timothy B.

Only a month late to respond, and the response likely won't help. I agree with Shawn that Tika can be a memory hog. I try to leave 1GB per thread, but your mileage will vary dramatically depending on your docs. I'd expect that you'd get an OOM, though, somewhere... There have been rare bugs i

RE: Data Import Handler Stays Idle

2015-08-28 Thread Allison, Timothy B.

> There are some zip files inside the directory and have been addressed > to in the database. I'm thinking those are the one's it's jumping > right over. With SOLR-7189, which should have kicked in for 5.1, Tika shouldn't skip over Zip files, it should process all the contents of those zips and

RE: tikaparser docx file fails with exception

2015-11-06 Thread Allison, Timothy B.

Agree with all below, and don't hesitate to open a ticket on Tika's Jira and/or POI's bugzilla...especially if you can share the triggering document. -Original Message- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: Thursday, November 05, 2015 6:05 PM To: solr-user Subjec

DIH's TikaEntityProcessor's handling of embedded documents

2015-03-04 Thread Allison, Timothy B.

All, I recently took a look at the source code for TikaEntityProcessor, and I noticed that the code is not configuring the ParseContext to have Tika's AutoDetectParser (or any parser) parse documents recursively. That is, if you have a zip file or any other container document, DIH's TikaEnti

RE: DIH's TikaEntityProcessor's handling of embedded documents

2015-03-04 Thread Allison, Timothy B.

ewsletter: http://www.solr-start.com/ On 4 March 2015 at 11:06, Allison, Timothy B. wrote: > All, > > I recently took a look at the source code for TikaEntityProcessor, and I > noticed that the code is not configuring the ParseContext to have Tika's > AutoDetectPar

RE: Trouble GetSpans lucene 4

2015-04-07 Thread Allison, Timothy B.

What class is origQuery? You will have to do more rewriting/calculation if you're trying to convert a PhraseQuery to a SpanNearQuery. If you dig around in org.apache.lucene.search.highlight.WeightedSpanTermExtractor in the Lucene highlighter package, you might get some inspiration. I have a h

RE: Trouble GetSpans lucene 4

2015-04-07 Thread Allison, Timothy B.

ou informed. Regards,Andy Le Mardi 7 avril 2015 20h26, "Allison, Timothy B." a écrit : What class is origQuery? You will have to do more rewriting/calculation if you're trying to convert a PhraseQuery to a SpanNearQuery. If you dig around in org.apache.lucene.search.hig

RE: Indexing PDF and MS Office files

2015-04-16 Thread Allison, Timothy B.

I entirely agree with Erick -- it is best to isolate Tika in its own jvm if you can -- bad things can happen if you don't [1] [2]. Erick's blog on SolrJ is fantastic. If you want to have Tika parse embedded documents/attachments, make sure to set the parser in the ParseContext before parsing:

RE: Indexing PDF and MS Office files

2015-04-16 Thread Allison, Timothy B.

tools. Thanks & Regards Vijay On 16 April 2015 at 12:33, Allison, Timothy B. wrote: > I entirely agree with Erick -- it is best to isolate Tika in its own jvm > if you can -- bad things can happen if you don't [1] [2]. > > Erick's blog on SolrJ is fantastic. If you wan

RE: Indexing PDF and MS Office files

2015-04-16 Thread Allison, Timothy B.

+1 :) >PS: one more thing - please, tell your management that you will never >ever successfully all real-world PDFs and cater for that fact in your >requirements :-)

FW: TIKA OCR not working

2015-04-27 Thread Allison, Timothy B.

Trung, I haven't experimented with our OCR parser yet, but this should give a good start: https://wiki.apache.org/tika/TikaOCR . Have you installed tesseract? Tika colleagues, Any other tips? What else has to be configured and how? -Original Message- From: trung.ht [mailto:trung...@

RE: Odp.: solr issue with pdf forms

2015-04-29 Thread Allison, Timothy B.

I completely agree with Erick about the utility of the TermsComponent to see what is actually being indexed. If you find problems there and if you haven't done so already, you might also investigate further down the stack. It might make sense to run the tika-app.jar (whichever version you are

RE: Proximity Search

2015-04-30 Thread Allison, Timothy B.

You'll need the ComplexPhraseQueryParser [1] to handle multiterm (wildcard/fuzzy/regex) terms in proximity. Beware, though, that that does not perform analysis on fuzzy/wildcard IIRC). The SurroundQueryParser can probably do both phrase near phrase and multiterm within proximity. Same warning

RE: HW requirements

2015-05-28 Thread Allison, Timothy B.

A classic on the importance of prototyping with your data and on the intractability of sizing in the abstract: https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ This might be of use: https://svn.apache.org/repos/asf/lucene/dev/trunk/dev-tools/s

RE: About indexing embed file with solr

2015-07-08 Thread Allison, Timothy B.

This may have been an issue with Solr's wrapper of Tika. See: https://issues.apache.org/jira/browse/SOLR-7189 -Original Message- From: 步青云 [mailto:mailliup...@qq.com] Sent: Wednesday, June 17, 2015 10:17 PM To: solr-user Subject: About indexing embed file with solr Hello, Could a

RE: Can I instruct the Tika Entity Processor to skip the first page using the DIH?

2015-07-08 Thread Allison, Timothy B.

Unfortunately, no. We can't even do that now with straight Tika. I imagine this is for pdf files? If you'd like to add this as a feature, please submit a ticket over on Tika. -Original Message- From: Paden [mailto:rumsey...@gmail.com] Sent: Wednesday, July 08, 2015 12:14 PM To: solr-

RE: Can I instruct the Tika Entity Processor to skip the first page using the DIH?

2015-07-09 Thread Allison, Timothy B.

M To: solr-user@lucene.apache.org Subject: Re: Can I instruct the Tika Entity Processor to skip the first page using the DIH? On 08/07/2015 20:39, Allison, Timothy B. wrote: > Unfortunately, no. We can't even do that now with straight Tika. I > imagine this is for pdf files? If y

RE: SolrJ/Tika custom indexer not indexing CERTAIN .doc text? | SIDENOTE

2015-07-10 Thread Allison, Timothy B.

>>Wow, that code looks familiar ;)... Erick and Paden, The following is not the source of your problem, but I thought I'd mention it while you reference Erick's fantastic blog post on solrj (http://lucidworks.com/blog/indexing-with-solrj/). I tried to comment on Erick's blog post, but someth

RE: Indexing a (File attached to a document)

2016-05-12 Thread Allison, Timothy B.

If I understand the question correctly... I'm assuming you are indexing rich documents (PDF/DOC/MSG, etc) with DIH's Tika handler. Some of those documents have attachments. If that's the case, all of the content of embedded docs _should_[0] be extracted, but then all of that content across the

RE: dtSearch parser & Introduction

2016-05-13 Thread Allison, Timothy B.

Depending on your needs, you might want to take a look at my SpanQueryParser (LUCENE-5205/SOLR-5410). It does not offer dtsearch syntax, but if the SurroundQueryParser was close enough, this parser may be of use. If you need modifications to it, let me know. I'm in the process of adding Span

RE: dtSearch parser & Introduction

2016-05-13 Thread Allison, Timothy B.

>...and I've just blogged about some of the issues one can run into with this >sort of project, hope this is useful! http://www.flax.co.uk/blog/2016/05/13/old-new-query-parser/ +1 completely non-trivial task to roll your own. I'd add that incorporating multiterm analysis (analysis/normalization

RE: Metadata and HTML ending up in searchable text

2016-05-27 Thread Allison, Timothy B.

I'm only minimally familiar with Solr Cell, but... 1) It looks like you aren't setting extractFormat=text. According to [0]...the default is xhtml which will include a bunch of the metadata. 2) is there an attr_* dynamic field in your index with type="ignored"? This would strip out the attr_ f

RE: Metadata and HTML ending up in searchable text

2016-05-27 Thread Allison, Timothy B.

Of course, for greater control over indexing (and for more robust handling of exceedingly rare (but real) infinite loops/OOM caused by Tika), consider SolrJ: http://searchhub.org/2012/02/14/indexing-with-solrj/ -Original Message- From: Simon Blandford [mailto:simon.blandf...@bkconnect.ne

RE: Metadata and HTML ending up in searchable text

2016-05-31 Thread Allison, Timothy B.

ext/css" charset="utf-8" >> media="screen" href="/wiki/modernized/css/screen.css"/> >> <link rel="stylesheet" type="text/css" charset="utf-8" >> media="print" href="

find stores with sales of > $x in last 2 months ?

2016-06-03 Thread Allison, Timothy B.

All, This is a toy example, but is there a way to search for, say, stores with sales of > $x in the last 2 months with Solr? $x and the time frame are selected by the user at query time. If the queries could be constrained (this is still tbd), I could see updating "stats" fields within eac

RE: find stores with sales of > $x in last 2 months ?

2016-06-06 Thread Allison, Timothy B.

uence/display/solr/The+Standard+Query+Parser#TheStandardQueryParser-DifferencesbetweenLuceneQueryParserandtheSolrStandardQueryParser Regards, Alex. Newsletter and resources for Solr beginners and intermediates: http://www.solr-start.com/ On 3 June 2016 at 23:23, Allison, Timothy B. wrote: > All, > This is a toy example, b

RE: Bypassing ExtractingRequestHandler

2016-06-13 Thread Allison, Timothy B.

>Two things: Here's a sample bit of SolrJ code, pulling out the DB stuff should >be straightforward: http://searchhub.org/2012/02/14/indexing-with-solrj/ +1 > We tend to prefer running Tika externally as it's entirely possible > that Tika will crash or hang with certain files - and that will

Morphlines.cell and attachments in complex docs?

2016-06-17 Thread Allison, Timothy B.

I was just looking at SolrCellBuilder, and it looks like there's an assumption that documents will not have attachments/embedded objects. Unless I misunderstand the code, users will not be able to search documents inside zips, or attachments in msg/ doc/pdf/etc (cf. SOLR-7189). Are embedded do

RE: SpanQuery - How to wrap a NOT subquery

2016-06-21 Thread Allison, Timothy B.

From: Brandon Miller [mailto:computerengineer.bran...@gmail.com] Sent: Monday, June 20, 2016 4:12 PM To: Allison, Timothy B. ; solr-user@lucene.apache.org Subject: Re: SpanQuery - How to wrap a NOT subquery Thank you, Timothy. I have support for and am using SpanNotQuery elsewhere. Maybe there is

RE: SpanQuery - How to wrap a NOT subquery

2016-06-21 Thread Allison, Timothy B.

> dtSearch allows a user to have NOTs embedded in proximity searches. And, if you're heading down the path of building your own queryparser to handle dtSearch's syntax, please read and heed Charlie Hull's post: http://www.flax.co.uk/blog/2016/05/13/old-new-query-parser/ See also: http://www.fl

RE: SpanQuery - How to wrap a NOT subquery

2016-06-21 Thread Allison, Timothy B.

>Perhaps I'm misunderstanding the pre/post parameters? Pre/post parameters: " 'six' or 'seven' should not appear $pre tokens before 'thirty' or $post tokens after 'thirty' Maybe something like this: spanNear([ spanNear([field:one, field:thousand, field:one, field:hundred], 0, true), spanNot(

RE: SpanQuery - How to wrap a NOT subquery

2016-06-21 Thread Allison, Timothy B.

>Awesome, 0 pre and 1 post works! Great! > What if I wanted to match thirty, but exclude if six or seven are included > anywhere in the document? Any time you need "anywhere in the document", use a "regular" query (not SpanQuery). As you wrote initially, you can construct a BooleanQuery that

RE: [ANN] Relevant Search by Manning out! (Thanks Solr community!)

2016-06-21 Thread Allison, Timothy B.

Not that I need any other book beyond this one... but I didn't realize that the 50% discount code applies to all books in the order. :) Congratulations, Doug and John! -Original Message- From: Doug Turnbull [mailto:dturnb...@opensourceconnections.com] Sent: Tuesday, June 21, 2016 2:12 P

RE: Automatic Language Identification

2016-07-01 Thread Allison, Timothy B.

+1 to langdetect In Tika 2.0, we're going to remove our own language detection code and allow users to select Optimaize (fork of langdetect), MIT Lincoln Lab’s Text.jl library or Yalder (https://github.com/kkrugler/yalder). The first two are now available in Tika 1.13. -Original Message--

RE: Solr 6.1 :: language specific analysis

2016-08-10 Thread Allison, Timothy B.

ICU normalization (ICUFoldingFilterFactory) will at least handle "ß" -> "ss" (IIRC) and some other language-general variants that might get you close. There are, of course, language specific analyzers (https://wiki.apache.org/solr/LanguageAnalysis#German) , but I don't think they'll get you Fo

RE: Disabling Zip bomb detection in Tika

2016-09-22 Thread Allison, Timothy B.

I don't think that's configurable at the moment. Tika-colleagues, any recommendations? If you're able to share the file on Tika's jira, we'd be happy to take a look. You shouldn't be getting the zip bomb unless there is a mismatch between opening and closing tags (which could point to a bug

RE: Disabling Zip bomb detection in Tika

2016-09-22 Thread Allison, Timothy B.

Y, looks like Nick (gagravarr) has answered on SO -- can't do it in Tika currently. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, September 22, 2016 10:42 AM To: solr-user@lucene.apache.org Cc: 'u...@tika.apache.org' Subject: RE

RE: Disabling Zip bomb detection in Tika

2016-09-22 Thread Allison, Timothy B.

> I'll try to get a sample HTML yielding to this problem and attach it to Jira. Great! Tika 1.14 is around the corner...if this is an easy fix ... :) Thank you.

RE: Disabling Zip bomb detection in Tika

2016-09-22 Thread Allison, Timothy B.

va API and examples for SolrJ and Tika to >>> achieve that... >>> >>> Just wanted to confirm. I'll try to get a sample HTML yielding to >>> this problem and attach it to Jira. >>> >>> Thanks, >>> Rodrigo. >>> >>> Em 22-09-

RE: Disabling Zip bomb detection in Tika

2016-09-22 Thread Allison, Timothy B.

t;> 11133_f6ef-eutelsat.htm >> >> I'll try to create a ticket for this on Jira if I find its location >> but feel free to open it yourself if you prefer, just let me know. >> >> Em 22-09-2016 12:33, Allison, Timothy B. escreveu: >>>> >>>&g

RE: SOLR Sizing

2016-10-03 Thread Allison, Timothy B.

This doesn't answer your question, but Erick Erickson's blog on this topic is invaluable: https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ -Original Message- From: Vasu Y [mailto:vya...@gmail.com] Sent: Monday, October 3, 2016

Apache Tika's public regression corpus

2016-10-05 Thread Allison, Timothy B.

All, I recently blogged about some of the work we're doing with a large scale regression corpus to make Tika, POI and PDFBox more robust and to identify regressions before release. If you'd like to chip in with recommendations, requests or Hadoop/Spark clusters (why not shoot for the stars), p

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-06-05 Thread Allison, Timothy B.

AM To: solr-user@lucene.apache.org Subject: RE: Solr 6.4. Can't index MS Visio vsdx files Great Tim. What do I need to do to integrate it on my current installation? On May 31, 2017 16:24, "Allison, Timothy B." wrote: Apache Tika 1.15 is now available. -Original Message

RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-19 Thread Allison, Timothy B.

> There is no standard across different types of docs as to what meta-data > field is >> included. PDF might have a "last_edited" field. Word might have a >> "last_modified" field where the two mean the same thing. On Tika, we _try_ to normalize fields according to various standards, the most

RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-20 Thread Allison, Timothy B.

Yeah, Chris knows a thing or two about Tika. :) -Original Message- From: ZiYuan [mailto:ziyu...@gmail.com] Sent: Tuesday, June 20, 2017 8:00 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context No intenti

RE: How are people using the ICUTokenizer?

2017-06-20 Thread Allison, Timothy B.

> So, if you are trying to make sure your index breaks words properly on > eastern languages, just use ICU Tokenizer. I defer to the expertise on this list, but last I checked ICUTokenizer uses dictionary lookup to tokenize CJK. This may work well for some tasks, but I haven't evaluated whe

RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-20 Thread Allison, Timothy B.

>http - however, the big advantage of doing your indexing on different machine >is that the heavy lifting that tika does in extracting text from documents, >finding metadata etc is not happening on the server. If the indexer crashes, >it doesn’t affect Solr either. +1 for what can go wrong:

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-07-03 Thread Allison, Timothy B.

Solr index changes to http://localhost:80/solr/v20170703xxx/update... Time spent: 0:00:00.350 On Mon, Jun 5, 2017 at 7:41 PM, Allison, Timothy B. wrote: > https://issues.apache.org/jira/browse/SOLR-10335 is tracking the > upgrade in Solr to Tika 1.15. Please chime in on that issue. > >

RE: How to "chain" import handlers: import from DB and from file system

2017-07-10 Thread Allison, Timothy B.

>4. Write an external program that fetches the file, fetches the metadata, >combines them, and send them to Solr. I've done this with some custom crawls. Thanks to Erick Erickson, this is a snap: https://lucidworks.com/2012/02/14/indexing-with-solrj/ With the caveat that Tika should really be i

RE: Arabic words search in solr

2017-08-02 Thread Allison, Timothy B.

+1 I was hoping to use this as a case for arguing for turning off an overly aggressive stemmer, but I checked on your 10 docs and query, and David is right, of course -- if you change the default operator to AND, you only get the one document back that you had intended to. I can still use this

TIKA-2440 Remove Furigana/phonetic as default for xlsx?

2017-08-09 Thread Allison, Timothy B.

Solrians, We have a request to drop phonetic strings from xlsx as the default in Tika. I'm not familiar enough with Japanese to know if users would generally expect to be able to search on these as well as the original. The current practice is to include them. Any recommendations? Thank y

RE: Solr fields for Microsoft files, image files, PDF, text files

2017-09-25 Thread Allison, Timothy B.

bq: How do I get a list of all valid field names based on the file type bq: You don't. At least I've never found any. Plus various document formats will allow custom meta-data fields so there's no definitive list. It would be trivial to add field counts per mime to tika-eval. If you're interes

RE: DataImport Handler Out of Memory

2017-09-27 Thread Allison, Timothy B.

https://wiki.apache.org/solr/DataImportHandlerFaq#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going_out_of_memory._Why_does_DataImportHandler_bring_everything_to_memory.3F -Original Message- From: Deeksha Sharma [mailto:dsha...@flexera.co

RE: Complexphrase treats wildcards differently than other query parsers

2017-10-05 Thread Allison, Timothy B.

What version of Solr are you using? I thought this had been fixed fairly recently, but I can't quickly find the JIRA. Let me take a look. Best, Tim This was one of my initial reasons for my SpanQueryParser LUCENE-5205[1] and [2], which handles analysis of multiterms even in phra

RE: Complexphrase treats wildcards differently than other query parsers

2017-10-05 Thread Allison, Timothy B.

lob/master/lucene-5205/src/test/java/org/apache/lucene/queryparser/spans/TestAdvancedAnalyzers.java#L117 -----Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, October 5, 2017 8:02 AM To: solr-user@lucene.apache.org Subject: RE: Complexphrase treats wildca

RE: Complexphrase treats wildcards differently than other query parsers

2017-10-05 Thread Allison, Timothy B.

e certain of it :-) Do you remember any reason that multi term analysis is not happening in ComplexPhraseQueryParser? I'm on 6.6.1, so latest on the 6.x branch. 2017-10-05 14:34 GMT+02:00 Allison, Timothy B. : > There's every chance that I'm missing something at the Solr level

RE: Complexphrase treats wildcards differently than other query parsers

2017-10-05 Thread Allison, Timothy B.

ses, but the regular multiterms should be ok. Still no answer for you... 2017-10-05 14:34 GMT+02:00 Allison, Timothy B. : > There's every chance that I'm missing something at the Solr level, but > it _looks_ at the Lucene level, like ComplexPhraseQueryParser is still > not ap

RE: Complexphrase treats wildcards differently than other query parsers

2017-10-06 Thread Allison, Timothy B.

That could be it. I'm not able to reproduce this with trunk. More next week. In trunk, if I add this to schema15.xml: This test passes. @Test public void testCharFilter() { assertU(adoc("iso-latin1", "cr\u00E6zy tr\u00E6n", "id", "1")); assertU(comm

RE: Complexphrase treats wildcards differently than other query parsers

2017-10-09 Thread Allison, Timothy B.

1']" ); Notice how cr\u00E6zy* is used as a query term which mimics the behaviour I originally reported, namely that CPQP does not analyse it because of the wildcard and thus does not hit the charfilter from the query side. 2017-10-06 20:54 GMT+02:00 Allison, Timothy B. : > Th

RE: Solr Wildcard Search

2017-11-30 Thread Allison, Timothy B.

The initial question wasn't about a phrasal search, but I largely agree that diff q parsers handle the analysis chain differently for multiterms. Yes, Porter is crazily aggressive. USE WITH CAUTION! As has been pointed out, use the Solr admin window and the "debug" in the query option to see

RE: Solr Wildcard Search

2017-11-30 Thread Allison, Timothy B.

t do you suggest to use for stemming instead of "Porter" ? I guess, it wasn't chosen intentionally. In the best we trust Georgy Nevsky -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, November 30, 2017 8:25 AM To: solr-user@lucene.apa

RE: Solr Wildcard Search

2017-11-30 Thread Allison, Timothy B.

ug Turnbull and John Berryman's "Relevant Search" enough on how to layer fields...among many other great insights: https://www.manning.com/books/relevant-search -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, November 30, 2017 9:20 AM To:

RE: negation search help

2016-11-23 Thread Allison, Timothy B.

You've gotten far better answers on this already, but you can use the SpanNotQuery in the SpanQueryParser I maintain and have published to maven central [1][2][3]. This does not carry out any nlp, but this would allow literal "headache (no not)"!~5,0 -> "headache" but not if "no" or "not" shows

RE: Unicode Character Problem

2016-12-12 Thread Allison, Timothy B.

> I don't see any weird character when I manual copy it to any text editor. That's a good diagnostic step, but there's a chance that Adobe (or your viewer) got it right, and Tika or PDFBox isn't getting it right. If you run tika-app on the file [0], do you get the same problem? See our stub on

RE: Zip Bomb Exception in HTML File

2017-01-04 Thread Allison, Timothy B.

This came up back in September [1] and [2]. Same trigger...crazy number of divs. I think we could modify the AutoDetectParser to enable configuration of maximum zip-bomb depth via tika-config. If there's any interest in this, re-open TIKA-2091, and I'll take a look. Best, Tim

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-02-03 Thread Allison, Timothy B.

This is a Tika/POI problem. Please download tika-app 1.14 [1] or a nightly version of Tika [2] and run java -jar tika-app.jar If the problem is fixed, we'll try to upgrade dependencies in Solr. If it isn't fixed, please open a bug on Tika's Jira. If this is a missing bean issue (sorry, I c

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-02-06 Thread Allison, Timothy B.

d to go. [3]" as tika is failing, is it could help or not? Gytis On Fri, Feb 3, 2017 at 10:31 PM, Allison, Timothy B. wrote: > This is a Tika/POI problem. Please download tika-app 1.14 [1] or a > nightly version of Tika [2] and run > > java -jar tika-app.jar > > If th

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-02-06 Thread Allison, Timothy B.

Argh. Looks like we need to add curvesapi (BSD 3-clause) to Solr. For now, add this jar: https://mvnrepository.com/artifact/com.github.virtuald/curvesapi/1.03 See also [1] [1] http://apache-poi.1045710.n5.nabble.com/support-for-reading-Microsoft-Visio-2013-vsdx-format-td5721500.html -Ori

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-02-06 Thread Allison, Timothy B.

ml-schemas-1.3.jar instead of poi-ooxml-schemas-3.15.jar 2. curvesapi-1.03.jar So, now I'm waiting when this will be implemented in a official version of solr/tika. Regards, Gytis On Mon, Feb 6, 2017 at 4:16 PM, Allison, Timothy B. wrote: > Argh. Looks like we need to add curvesapi

RE: DataImportHandler - Unable to load Tika Config Processing Document # 1

2017-02-08 Thread Allison, Timothy B.

>It is *strongly* recommended to *not* use >the Tika that's embedded within >Solr, but >instead to do the processing outside of Solr >in a program of your >own and index the results. +1 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201601.mbox/%3CBY2PR09MB11210EDFCFA297528940B07C

Testing an ingest framework that uses Apache Tika

2017-02-16 Thread Allison, Timothy B.

All, I finally got around to documenting Apache Tika's MockParser[1]. As of Tika 1.15 (unreleased), add tika-core-tests.jar to your class path, and you can simulate: 1. Regular catchable exceptions 2. OOMs 3. Permanent hangs This will allow you to determine if your ingest framework is robust

RE: Index scanned documents

2017-03-27 Thread Allison, Timothy B.

Please also see: https://wiki.apache.org/tika/TikaOCR and https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR If you have any other questions about Apache Tika and OCR, please feel free to ask on our users list as well: u...@tika.apache.org Cheers, Tim -Origin

RE: Index scanned documents

2017-03-27 Thread Allison, Timothy B.

] Sent: Monday, March 27, 2017 11:48 AM To: solr-user@lucene.apache.org Subject: Re: Index scanned documents I tried this solution from Tim Allison, and it works. http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files Regards, Edwin On 27 March 2017 at 20:07, A

RE: Indexing speed reduced significantly with OCR

2017-03-30 Thread Allison, Timothy B.

> Note that the OCRing is a separate task from Solr indexing, and is best done > on separate machines. +1 -Original Message- From: Rick Leir [mailto:rl...@leirtech.com] Sent: Thursday, March 30, 2017 7:37 AM To: solr-user@lucene.apache.org Subject: Re: Indexing speed reduced significant

RE: Solr performance issue on indexing

2017-04-04 Thread Allison, Timothy B.

> Also we will try to decouple tika to solr. +1 -Original Message- From: tstusr [mailto:ulfrhe...@gmail.com] Sent: Friday, March 31, 2017 4:31 PM To: solr-user@lucene.apache.org Subject: Re: Solr performance issue on indexing Hi, thanks for the feedback. Yes, it is about OOM, indeed e

RE: Japanese character is garbled when using TikaEntityProcessor

2017-04-10 Thread Allison, Timothy B.

Please open an issue on Tika's JIRA and share the triggering file if possible. If we can touch the file, we may be able to recommend alternate ways to configure Tika's encoding detectors. We just added configurability to the encoding detectors and that will be available with Tika 1.15. [1] We

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-04-11 Thread Allison, Timothy B.

You might want to drop a note to the dev or user's list on Apache POI. I'm not extremely familiar with the vsd(x) portion of our code base. The first item ("PolylineTo") may be caused by a mismatch btwn your doc and the ooxml spec. The second item appears to be an unsupported feature. The thir

1 2 >

1 - 100 of 128 matches

Mail list logo