WordDelimiterFilter preserveOriginal position increment
Hi, I'm having an issue with the WDF preserveOriginal=1 setting and the matching of a phrase query. Here's an example of the text that is being indexed: ...obtained with the Southern African Large Telescope,SALT... A lot of our text is extracted from PDFs, so this kind of formatting junk is very common. The phrase query that is failing is: Southern African Large Telescope From looking at the analysis debugger I can see that the WDF is getting the term Telescope,SALT and correctly splitting on the comma. The problem seems to be that the original term is given the 1st position, e.g.: Pos Term 1 Southern 2 African 3 Large 4 Telescope,SALT -- original term 5 Telescope 6 SALT Only by adding a phrase slop of ~1 do I get a match. I realize that the WDF is behaving correctly in this case (or at least I can't imagine a rational alternative). But I'm curious if anyone can suggest an way to work around this issue that doesn't involve adding phrase query slop. Thanks, --jay
Re: WordDelimiterFilter preserveOriginal position increment
Bah... While attempting to duplicate this on our 4.x instance I realized I was mis-reading the analysis output. In the example I mentioned it was actually a SynonymFilter in the analysis chain that was affecting the term position (we have several synonyms for telescope). Regardless, it seems to not be a problem in 4.x. Thanks, --jay On Tue, Oct 23, 2012 at 10:45 AM, Shawn Heisey s...@elyograg.org wrote: On 10/23/2012 8:16 AM, Jay Luker wrote: From looking at the analysis debugger I can see that the WDF is getting the term Telescope,SALT and correctly splitting on the comma. The problem seems to be that the original term is given the 1st position, e.g.: Pos Term 1 Southern 2 African 3 Large 4 Telescope,SALT -- original term 5 Telescope 6 SALT Jay, I have WDF with preserveOriginal turned on. I get the following from WDF parsing in the analysis page on either 3.5 or 4.1-SNAPSHOT, and the analyzer shows that all four of the query words are found in consecutive fields. On the new version, I had to slide a scrollbar to the right to see the last term. Visually they were not in consecutive fields on the new version (they were on 3.5), but the position number says otherwise. 1Southern 2African 3Large 4Telescope,SALT 4Telescope 5SALT 5TelescopeSALT My full WDF parameters: index: {preserveOriginal=1, splitOnCaseChange=1, generateNumberParts=1, catenateWords=1, splitOnNumerics=1, stemEnglishPossessive=1, luceneMatchVersion=LUCENE_35, generateWordParts=1, catenateAll=0, catenateNumbers=1} query: {preserveOriginal=1, splitOnCaseChange=1, generateNumberParts=1, catenateWords=0, splitOnNumerics=1, stemEnglishPossessive=1, luceneMatchVersion=LUCENE_35, generateWordParts=1, catenateAll=0, catenateNumbers=0} I understand from other messages on the mailing list that I should not have preserveOriginal on the query side, but I have not yet changed it. If your position numbers really are what you indicated, you may have found a bug. I have not tried the released 4.0.0 version, I expect to deploy from the 4.x branch under development. Thanks, Shawn
Re: NumericRangeQuery: what am I doing wrong?
On Wed, Dec 14, 2011 at 5:02 PM, Chris Hostetter hossman_luc...@fucit.org wrote: I'm a little lost in this thread ... if you are programaticly construction a NumericRangeQuery object to execute in the JVM against a Solr index, that suggests you are writting some sort of SOlr plugin (or uembedding solr in some way) It's not you; it's me. I'm just doing weird things, partly, I'm sure, due to ignorance, but sometimes out of expediency. I was experimenting with ways to do a NumericRangeFilter, and the tests I was trying used the Lucene api to query a Solr index, therefore I didn't have access to the IndexSchema. Also my question might have been better directed at the lucene-general list to avoid confusion. Thanks, --jay
NumericRangeQuery: what am I doing wrong?
I can't get NumericRangeQuery or TermQuery to work on my integer id field. I feel like I must be missing something obvious. I have a test index that has only two documents, id:9076628 and id:8003001. The id field is defined like so: field name=id type=tint indexed=true stored=true required=true / A MatchAllDocsQuery will return the 2 documents, but any queries I try on the id field return no results. For instance, public void testIdRange() throws IOException { Query q = NumericRangeQuery.newIntRange(id, 1, 1000, true, true); System.out.println(query: + q); assertEquals(2, searcher.search(q, 5).totalHits); } public void testIdSearch() throws IOException { Query q = new TermQuery(new Term(id, 9076628)); System.out.println(query: + q); assertEquals(1, searcher.search(q, 5).totalHits); } Both tests fail with totalHits being 0. This is using solr/lucene trunk, but I tried also with 3.2 and got the same results. What could I be doing wrong here? Thanks, --jay
Re: NumericRangeQuery: what am I doing wrong?
On Wed, Dec 14, 2011 at 2:04 PM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, seems like it should work, but there are two things you might try: 1 just execute the query in Solr. id:1 TO 100]. Does that work? Yep, that works fine. 2 I'm really grasping at straws here, but it's *possible* that you need to use the same precisionstep as tint (8?)? There's a constructor that takes precisionStep as a parameter, but the default is 4 in the 3.x code. Ah-ha, that was it. I did not notice the alternate constructor. The field was originally indexed with solr's default int type, which has precisionStep=0 (i.e., don't index at different precision levels). The equivalent value for the NumericRangeQuery constructor is 32. This isn't exactly inuitive, but I was able to figure it out with a careful reading of the javadoc. Thanks! --jay
Re: RegexQuery performance
On Sat, Dec 10, 2011 at 9:25 PM, Erick Erickson erickerick...@gmail.com wrote: My off-the-top-of-my-head notion is you implement a Filter whose job is to emit some special tokens when you find strings like this that allow you to search without regexes. For instance, in the example you give, you could index something like...oh... I don't know, ###VER### as well as the normal text of IRAS-A-FPA-3-RDR-IMPS-V6.0. Now, when searching for docs with the pattern you used as an example, you look for ###VER### instead. I guess it all depends on how many regexes you need to allow. This wouldn't work at all if you allow users to put in arbitrary regexes, but if you have a small enough number of patterns you'll allow, something like this could work. This is a great suggestion. I think the number of users that need this feature, as well as the variety of regexs that would be used, is small enough that it could definitely work. I turns it into a problem of collecting the necessary regexes, plus the UI details. Thanks! --jay
Re: RegexQuery performance
Hi Erick, On Fri, Dec 9, 2011 at 12:37 PM, Erick Erickson erickerick...@gmail.com wrote: Could you show us some examples of the kinds of things you're using regex for? I.e. the raw text and the regex you use to match the example? Sure! An example identifier would be IRAS-A-FPA-3-RDR-IMPS-V6.0, which identifies a particular Planetary Data System data set. Another example is ULY-J-GWE-8-NULL-RESULTS-V1.0. These kind of strings frequently appear in the references section of the articles, so the context looks something like, ... rvey. IRAS-A-FPA-3-RDR-IMPS-V6.0, NASA Planetary Data System Tholen, D. J. 1989, in Asteroids II, ed ... The simple straightforward regex I've been using is /[A-Z0-9:\-]+V\d+\.\d+/. There may be a smarter regex approach but I haven't put my mind to it because I assumed the primary performance issue was elsewhere. The reason I ask is that perhaps there are other approaches, especially thinking about some clever analyzing at index time. For instance, perhaps NGrams are an option. Perhaps just making WordDelimiterFilterFactory do its tricks. Perhaps. WordDelimiter does help in the sense that if you search for a specific identifier you will usually find fairly accurate results, even for cases where the hyphens resulted in the term being broken up. But I'm not sure how WordDelimiter can help if I want to search for a pattern. I tried a few tweaks to the index, like putting a minimum character count for terms, making sure WordDelimeter's preserveOriginal is turned on, indexing without lowercasing so that I don't have to use Pattern.CASE_INSENSITIVE. Performance was not improved significantly. The new RegexpQuery mentioned by R. Muir looks promising, but I haven't built an instance of trunk yet to try it out. Any ohter suggestions appreciated. Thanks! --jay In other words, this could be an XY problem Best Erick On Thu, Dec 8, 2011 at 11:14 AM, Robert Muir rcm...@gmail.com wrote: On Thu, Dec 8, 2011 at 11:01 AM, Jay Luker lb...@reallywow.com wrote: Hi, I am trying to provide a means to search our corpus of nearly 2 million fulltext astronomy and physics articles using regular expressions. A small percentage of our users need to be able to locate, for example, certain types of identifiers that are present within the fulltext (grant numbers, dataset identifers, etc). My straightforward attempts to do this using RegexQuery have been successful only in the sense that I get the results I'm looking for. The performance, however, is pretty terrible, with most queries taking five minutes or longer. Is this the performance I should expect considering the size of my index and the massive number of terms? Are there any alternative approaches I could try? Things I've already tried: * reducing the sheer number of terms by adding a LengthFilter, min=6, to my index analysis chain * swapping in the JakartaRegexpCapabilities Things I intend to try if no one has any better suggestions: * chunk up the index and search concurrently, either by sharding or using a RangeQuery based on document id Any suggestions appreciated. This RegexQuery is not really scalable in my opinion, its always linear to the number of terms except in super-rare circumstances where it can compute a common prefix (and slow to boot). You can try svn trunk's RegexpQuery -- don't forget the p, instead from lucene core (it works from queryparser: /[ab]foo/, myfield:/bar/ etc) The performance is faster, but keep in mind its only as good as the regular expressions, if the regular expressions are like /.*foo.*/, then its just as slow as wildcard of *foo*. -- lucidimagination.com
RegexQuery performance
Hi, I am trying to provide a means to search our corpus of nearly 2 million fulltext astronomy and physics articles using regular expressions. A small percentage of our users need to be able to locate, for example, certain types of identifiers that are present within the fulltext (grant numbers, dataset identifers, etc). My straightforward attempts to do this using RegexQuery have been successful only in the sense that I get the results I'm looking for. The performance, however, is pretty terrible, with most queries taking five minutes or longer. Is this the performance I should expect considering the size of my index and the massive number of terms? Are there any alternative approaches I could try? Things I've already tried: * reducing the sheer number of terms by adding a LengthFilter, min=6, to my index analysis chain * swapping in the JakartaRegexpCapabilities Things I intend to try if no one has any better suggestions: * chunk up the index and search concurrently, either by sharding or using a RangeQuery based on document id Any suggestions appreciated. Thanks, --jay
Re: PatternTokenizer failure
On Tue, Nov 29, 2011 at 9:37 AM, Michael Kuhlmann k...@solarier.de wrote: Jay, I think the problem is this: You're checking whether the character preceding the array of at least one whitespace is not a hyphen. However, when you've more than one whitespace, like this: foo- \n bar then there's another array of whitespaces - \n - which is precedes by the first whitespace - . Therefore, you'll need to not only check for preceding hyphens, but also for preceding whitespaces. I'll leave this as an exercise for you. ;) -Kuli Just for the sake of closure, you were correct. I needed to update the regex to include a whitespace character in the negative look-behind, i.e., (?![-\s])\s+. Thanks, --jay
Re: InvalidTokenOffsetsException when using MappingCharFilterFactory, DictionaryCompoundWordTokenFilterFactory and Highlighting
I am having a similar issue with OffsetExceptions during highlighting. In all of the explanations and bug reports I'm reading there is a mention this is all the result of a problem with HTMLStripCharFilter. But my analysis chains don't (that I'm aware of) make use of HTMLStripCharFilter, so can someone explain what else might be going on? Or is it acknowledged that the bug may exist elsewhere? Thanks, --jay On Fri, Nov 11, 2011 at 4:37 AM, Vadim Kisselmann v.kisselm...@googlemail.com wrote: Hi Edwin, Chris it´s an old bug. I have big problems too with OffsetExceptions when i use Highlighting, or Carrot. It looks like a problem with HTMLStripCharFilter. Patch doesn´t work. https://issues.apache.org/jira/browse/LUCENE-2208 Regards Vadim 2011/11/11 Edwin Steiner edwin.stei...@gmail.com I just entered a bug: https://issues.apache.org/jira/browse/SOLR-2891 Thanks regards, Edwin On Nov 7, 2011, at 8:47 PM, Chris Hostetter wrote: : finally I want to use Solr highlighting. But there seems to be a problem : if I combine the char filter and the compound word filter in combination : with highlighting (an : org.apache.lucene.search.highlight.InvalidTokenOffsetsException is : raised). Definitely sounds like a bug somwhere in dealing with the offsets. can you please file a Jira, and include all of the data you have provided here? it would also be helpful to know what the analysis tool says about the various attributes of your tokens at each stage of the analysis? : SEVERE: org.apache.solr.common.SolrException: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token fall exceeds length of provided text sized 12 : at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:469) : at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:378) : at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:116) : at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194) : at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) : at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360) : at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) : at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) : at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) : at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) : at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224) : at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) : at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:462) : at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:164) : at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100) : at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:851) : at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) : at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:405) : at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:278) : at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:515) : at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:302) : at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) : at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) : at java.lang.Thread.run(Thread.java:680) : Caused by: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token fall exceeds length of provided text sized 12 : at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:228) : at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:462) : ... 23 more -Hoss
PatternTokenizer failure
Hi all, I'm trying to use PatternTokenizer and not getting expected results. Not sure where the failure lies. What I'm trying to do is split my input on whitespace except in cases where the whitespace is preceded by a hyphen character. So to do this I'm using a negative look behind assertion in the pattern, e.g. (?!-)\s+. Expected behavior: foo bar - [foo,bar] - OK foo \n bar - [foo,bar] - OK foo- bar - [foo- bar] - OK foo-\nbar - [foo-\nbar] - OK foo- \n bar - [foo- \n bar] - FAILS Here's a test case that demonstrates the failure: public void testPattern() throws Exception { MapString,String args = new HashMapString, String(); args.put( PatternTokenizerFactory.GROUP, -1 ); args.put( PatternTokenizerFactory.PATTERN, (?!-)\\s+ ); Reader reader = new StringReader(blah \n foo bar- baz\nfoo-\nbar- baz foo- \n bar); PatternTokenizerFactory tokFactory = new PatternTokenizerFactory(); tokFactory.init( args ); TokenStream stream = tokFactory.create( reader ); assertTokenStreamContents(stream, new String[] { blah, foo, bar- baz, foo-\nbar- baz, foo- \n bar }); } This fails with the following output: org.junit.ComparisonFailure: term 4 expected:foo- [\n bar] but was:foo- [] Am I doing something wrong? Incorrect expectations? Or could this be a bug? Thanks, --jay
Re: Document has fields with different update frequencies: how best to model
You are correct that ExternalFileField values can only be used in query functions (i.e. scoring, basically). Sorry for firing off that answer without reading your use case more carefully. I'd be inclined towards giving your Option #1 a try, but that's without knowing much about the scale of your app, size of your index, documents, etc. Unneeded field updates are only a problem if they're causing performance problems, right? Otherwise, trying to avoid seems like premature optimization. --jay On Sat, Jun 11, 2011 at 5:26 AM, lee carroll lee.a.carr...@googlemail.com wrote: Hi Jay I thought external file field could not be returned as a field but only used in scoring. trunk has pseudo field which can take a function value but we cant move to trunk. also its a more general question around schema design, what happens if you have several fields with different update frequencies. It does not seem external file field is the use case for this. On 10 June 2011 20:13, Jay Luker lb...@reallywow.com wrote: Take a look at ExternalFileField [1]. It's meant for exactly what you want to do here. FYI, there is an issue with caching of the external values introduced in v1.4 but, thankfully, resolved in v3.2 [2] --jay [1] http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html [2] https://issues.apache.org/jira/browse/SOLR-2536 On Fri, Jun 10, 2011 at 12:54 PM, lee carroll lee.a.carr...@googlemail.com wrote: Hi, We have a document type which has fields which are pretty static. Say they change once every 6 month. But the same document has a field which changes hourly What are the best approaches to index this document ? Eg Hotel ID (static) , Hotel Description (static and costly to get from a url etc), FromPrice (changes hourly) Option 1 Index hourly as a single document and don't worry about the unneeded field updates Option 2 Split into 2 document types and index independently. This would require the front end application to query multiple times? doc1 ID,Description,DocType doc2 ID,HotelID,Price,DocType application performs searches based on hotel attributes for each hotel match issue query to get price Any other options ? Can you query across documents ? We run 1.4.1, we could maybe update to 3.2 but I don't think I could swing to trunk for JOIN feature (if that indeed is JOIN's use case) Thanks in advance PS Am I just worrying about de-normalised data and should sort the source data out maybe by caching and get over it ...? cheers Lee c
Re: Document has fields with different update frequencies: how best to model
Take a look at ExternalFileField [1]. It's meant for exactly what you want to do here. FYI, there is an issue with caching of the external values introduced in v1.4 but, thankfully, resolved in v3.2 [2] --jay [1] http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html [2] https://issues.apache.org/jira/browse/SOLR-2536 On Fri, Jun 10, 2011 at 12:54 PM, lee carroll lee.a.carr...@googlemail.com wrote: Hi, We have a document type which has fields which are pretty static. Say they change once every 6 month. But the same document has a field which changes hourly What are the best approaches to index this document ? Eg Hotel ID (static) , Hotel Description (static and costly to get from a url etc), FromPrice (changes hourly) Option 1 Index hourly as a single document and don't worry about the unneeded field updates Option 2 Split into 2 document types and index independently. This would require the front end application to query multiple times? doc1 ID,Description,DocType doc2 ID,HotelID,Price,DocType application performs searches based on hotel attributes for each hotel match issue query to get price Any other options ? Can you query across documents ? We run 1.4.1, we could maybe update to 3.2 but I don't think I could swing to trunk for JOIN feature (if that indeed is JOIN's use case) Thanks in advance PS Am I just worrying about de-normalised data and should sort the source data out maybe by caching and get over it ...? cheers Lee c
Re: Solr performance
On Wed, May 11, 2011 at 7:07 AM, javaxmlsoapdev vika...@yahoo.com wrote: I have some 25 odd fields with stored=true in schema.xml. Retrieving back 5,000 records back takes a few secs. I also tried passing fl and only include one field in the response but still response time is same. What are the things to look to tune the performance. Confirm that you have enableLazyFieldLoading set to true in solrconfig.xml. I suspect it is since that's the default. Is the request taking a few seconds the first time, but returns quickly on subsequent requests? Also, may or may not be relevant, but you might find a few bits of info in this thread enlightening: http://lucene.472066.n3.nabble.com/documentCache-clarification-td1780504.html --jay
Re: Text Only Extraction Using Solr and Tika
Hi Emyr, You could try using the extractOnly=true parameter [1]. Of course, you'll need to repost the extracted text manually. --jay [1] http://wiki.apache.org/solr/ExtractingRequestHandler#Extract_Only On Thu, May 5, 2011 at 9:36 AM, Emyr James emyr.ja...@sussex.ac.uk wrote: Hi All, I have solr and tika installed and am happily extracting and indexing various files. Unfortunately on some word documents it blows up since it tries to auto-generate a 'title' field but my title field in the schema is single valued. Here is my config for the extract handler... requestHandler name=/update/extract class=org.apache.solr.handler.extraction.ExtractingRequestHandler lst name=defaults str name=uprefixignored_/str /lst /requestHandler Is there a config option to make it only extract text, or ideally to allow me to specify which metadata fields to accept ? E.g. I'd like to use any author metadata it finds but to not use any title metadata it finds as I want title to be single valued and set explicitly using a literal.title in the post request. I did look around for some docs but all i can find are very basic examples. there's no comprehensive configuration documentation out there as far as I can tell. ALSO... I get some other bad responses coming back such as... htmlheadtitleApache Tomcat/6.0.28 - Error report/titlestyle!--H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:# 525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;c olor:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}HR {color : #525D76;}--/style /headbodyh1HTTP Status 500 - org.ap ache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator; java.lang.NoSuchMethodError: org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator; at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:168) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:636) /h1HR size=1 noshade=noshadepbtype/b Status report/ppbmessage/b uorg.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator; For the above my url was... http://localhost:8080/solr/update/extract?literal.id=3922defaultField=contentfmap.content=contentuprefix=ignored_stream.contentType=application%2Fvnd.ms-powerpointcommit=trueliteral.title=Reactor+cycle+141literal.not es=literal.tag=UCN_productionliteral.author=Maurits+van+der+Grinten I guess there's something special I need to be able to process power point files ? Maybe I need to get the latest apache POI ? Any suggestions welcome... Regards, Emyr
tika/pdfbox knobs levers
Hi all, I'm wondering if there are any knobs or levers i can set in solrconfig.xml that affect how pdfbox text extraction is performed by the extraction handler. I would like to take advantage of pdfbox's ability to normalize diacritics and ligatures [1], but that doesn't seem to be the default behavior. Is there a way to enable this? Thanks, --jay [1] http://pdfbox.apache.org/apidocs/index.html?org/apache/pdfbox/util/TextNormalize.html
Re: UIMA example setup w/o OpenCalais
Thank you, that worked. For the record, my objection to the OpenCalais service is that their ToS states that they will retain a copy of the metadata submitted by you, and that by submitting data to the service you grant Thomson Reuters a non-exclusive perpetual, sublicensable, royalty-free license to that metadata. The AlchemyAPI service Tos states only that they retain the *generated* metadata. Just a warning to anyone else thinking of experimenting with Solr UIMA. --jay On Fri, Apr 8, 2011 at 6:45 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote: Hi Jay, you should be able to do so by simply removing the OpenCalaisAnnotator from the execution pipeline commenting the line 124 of the file: solr/contrib/uima/src/main/resources/org/apache/uima/desc/OverridingParamsExtServicesAE.xml Hope this helps, Tommaso 2011/4/7 Jay Luker lb...@reallywow.com Hi, I'd would like to experiment with the UIMA contrib package, but I have issues with the OpenCalais service's ToS and would rather not use it. Is there a way to adapt the UIMA example setup to use only the AlchemyAPI service? I tried simply leaving out the OpenCalais api key but i get exceptions thrown during indexing. Thanks, --jay
UIMA example setup w/o OpenCalais
Hi, I'd would like to experiment with the UIMA contrib package, but I have issues with the OpenCalais service's ToS and would rather not use it. Is there a way to adapt the UIMA example setup to use only the AlchemyAPI service? I tried simply leaving out the OpenCalais api key but i get exceptions thrown during indexing. Thanks, --jay
Re: Highlight snippets for a set of known documents
It turns out the answer is I'm a moron; I had an unnoticed rows=1 nestled in the querystring I was testing with. Anyway, thanks for replying! --jay On Fri, Apr 1, 2011 at 4:25 AM, Stefan Matheis matheis.ste...@googlemail.com wrote: Jay, i'm not sure, but did you try it w/ brackets? q=foobarfq={!q.op=OR}(id:1 id:5 id:11) Regards Stefan On Thu, Mar 31, 2011 at 6:40 PM, Jay Luker lb...@reallywow.com wrote: Hi all, I'm trying to get highlight snippets for a set of known documents and I must being doing something wrong because it's only sort of working. Say my query is foobar and I already know that docs 1, 5 and 11 are matches. Now I want to retrieve the highlight snippets for the term foobar for docs 1, 5 and 11. What I assumed would work was something like: ...q=foobarfq={!q.op=OR}id:1 id:5 id:11 This returns numfound=3 in the response, but I only get the highlight snippets for document id:1. What am I doing wrong? Thanks, --jay
Help with parsing configuration using SolrParams/NamedList
Hi, I'm trying to use a CustomSimilarityFactory and pass in per-field options from the schema.xml, like so: similarity class=org.ads.solr.CustomSimilarityFactory lst name=field_a int name=min500/int int name=max1/int float name=steepness0.5/float /lst lst name=field_b int name=min500/int int name=max2/int float name=steepness0.5/float /lst /similarity My problem is I am utterly failing to figure out how to parse this nested option structure within my CustomSimilarityFactory class. I know that the settings are available as a SolrParams object within the getSimilarity() method. I'm convinced I need to convert to a NamedList using params.toNamedList(), but my java fu is too feeble to code the dang thing. The closest I seem to get is the top level as a NamedList where the keys are field_a and field_b, but then my values are strings, e.g., {min=500,max=1,steepness=0.5}. Anyone who could dash off a quick example of how to do this? Thanks, --jay
Re: Sending binary data as part of a query
On Mon, Jan 31, 2011 at 9:22 PM, Chris Hostetter hossman_luc...@fucit.org wrote: that class should probably have been named ContentStreamUpdateHandlerBase or something like that -- it tries to encapsulate the logic that most RequestHandlers using COntentStreams (for updating) need to worry about. Your QueryComponent (as used by SearchHandler) should be able to access the ContentStreams the same way that class does ... call req.getContentStreams(). Sending a binary stream from a remote client depends on how the client is implemented -- you can do it via HTTP using the POST body (with or w/o multi-part mime) in any langauge you want. If you are using SolrJ you may again run into an assumption that using ContentStreams means you are doing an Update but that's just a vernacular thing ... something like a ContentStreamUpdateRequest should work just as well for a query (as long as you set the neccessary params and/or request handler path) Thanks for the help. I was just about to reply to my own question for the benefit of future googlers when I noticed your response. :) I actually got this working, much the way you suggest. The client is python. I created a gist with the script I used for testing [1]. On the solr side my QueryComponent grabs the stream, uses jzlib.ZInputStream to do the deflating, then translates the incoming integers in the bitset (my solr schema.xml integer ids) to the lucene ids and creates a docSetFilter with them. Very relieved to get this working as it's the basis of a talk I'm giving next week [2]. :-) --jay [1] https://gist.github.com/806397 [2] http://code4lib.org/conference/2011/luker
Sending binary data as part of a query
Hi all, Here is what I am interested in doing: I would like to send a compressed integer bitset as a query to solr. The bitset integers represent my document ids and the results I want to get back is the facet data for those documents. I have successfully created a QueryComponent class that, assuming it has the integer bitset, can turn that into the necessary DocSetFilter to pass to the searcher, get back the facets, etc. That part all works right now because I'm using either canned or randomly generated bitsets on the server side. What I'm unsure how to do is actually send this compressed bitset from a client to solr as part of the query. From what I can tell, the Solr API classes that are involved in handling binary data as part of a request assume that the data is a document to be added. For instance, extending ContentStreamHandlerBase requires implementing some kind of document loader and an UpdateRequestProcessorChain and a bunch of other stuff that I don't really think I should need. Is there a simpler way? Anyone tried or succeeded in doing anything similar to this? Thanks, --jay
Re: Using jetty's GzipFilter in the example solr.war
On Sun, Nov 14, 2010 at 12:49 AM, Kiwi de coder kiwio...@gmail.com wrote: try to put u filter on top of web.xml (instead of middle or bottom), i try this few day and it just only a simple solution (not sure is a spec to put on top or is a bug) Thank you. An explanation of why this worked is probably better explored on the jetty list, but, for the record, it did. --jay
Using jetty's GzipFilter in the example solr.war
Hi, I thought I'd try turning on gzip compression but I can't seem to get jetty's GzipFilter to actually compress my responses. I unpacked the example solr.war and tried adding variations of the following to the web.xml (and then rejar-ed), but as far as I can tell, jetty isn't actually compressing anything. filter filter-nameGZipFilter/filter-name display-nameJetty's GZip Filter/display-name descriptionFilter that zips all the content on-the-fly/description filter-classorg.mortbay.servlet.GzipFilter/filter-class init-param param-namemimeTypes/param-name param-value*/param-value /init-param /filter filter-mapping filter-nameGZipFilter/filter-name url-pattern*/url-pattern /filter-mapping I've also tried explicitly listing mime-types and assigning the filter-mapping using servlet-name. I can see that the GzipFilter is being loaded when I add -DDEBUG to the jetty startup command. But as far as I can tell from looking at the response headers nothing is being gzipped. I'm expecting to see Content-Encoding: gzip in the response headers. Anyone successfully gotten this to work? Thanks, --jay
Re: documentCache clarification
On Thu, Oct 28, 2010 at 7:27 PM, Chris Hostetter hossman_luc...@fucit.org wrote: The queryResultCache is keyed on Query,Sort,Start,Rows,Filters and the value is a DocList object ... http://lucene.apache.org/solr/api/org/apache/solr/search/DocList.html Unlike the Document objects in the documentCache, the DocLists in the queryResultCache never get modified (techincally Solr doesn't actually modify the Documents either, the Document just keeps track of it's fields and updates itself as Lazy Load fields are needed) if a DocList containing results 0-10 is put in the cache, it's not going to be of any use for a query with start=50. but if it contains 0-50 it *can* be used if start 50 and rows 50 -- that's where the queryResultWindowSize comes in. if you use start=0rows=10, but your window size is 50, SolrIndexSearcher will (under the covers) use start=0rows=50 and put that in the cache, returning a slice from 0-10 for your query. the next query asking for 10-20 will be a cache hit. This makes sense but still doesn't explain what I'm seeing in my cache stats. When I issue a request with rows=10 the stats show an insert into the queryResultCache. If I send the same query, this time with rows=1000, I would not expect to see a cache hit but I do. So it seems like there must be something useful in whatever gets cached on the first request for rows=10 for it to be re-used by the request for rows=1000. --jay
Re: documentCache clarification
On Wed, Oct 27, 2010 at 9:13 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : schema.) My evidence for this is the documentCache stats reported by : solr/admin. If I request rows=10fl=id followed by : rows=10fl=id,title I would expect to see the 2nd request result in : a 2nd insert to the cache, but instead I see that the 2nd request hits : the cache from the 1st request. rows=10fl=* does the same thing. your evidence is correct, but your interpretation is incorrect. the objects in the documentCache are lucene Documents, which contain a List of Field refrences. when enableLazyFieldLoading=true is set, and there is a documentCache Document fetched from the IndexReader only contains the Fields specified in the fl, and all other Fields are marked as LOAD_LAZY. When there is a cache hit on that uniqueKey at a later date, the Fields allready loaded are used directly if requested, but the Fields marked LOAD_LAZY are (you guessed it) lazy loaded from the IndexReader and then the Document updates the refrence to the newly actualized fields (which are no longer marked LOAD_LAZY) So with different fl params, the same Document Object is continually used, but the Fields in that Document grow as the fields requested (using the fl param) change. Great stuff. Makes sense. Thanks for the clarification, and if no one objects I'll update the wiki with some of this info. I'm still not clear on this statement from the wiki's description of the documentCache: (Note: This cache cannot be used as a source for autowarming because document IDs will change when anything in the index changes so they can't be used by a new searcher.) Can anyone elaborate a bit on that. I think I've read it at least 10 times and I'm still unable to draw a mental picture. I'm wondering if the document IDs referred to are the ones I'm defining in my schema, or are they the underlying lucene ids, i.e. the ones that, according to the Lucene in Action book, are relative within each segment? : will *not* result in an insert to queryResultCache. I have tried : various increments--10, 100, 200, 500--and it seems the magic number : is somewhere between 200 (cache insert) and 500 (no insert). Can : someone explain this? In addition to the queryResultMaxDocsCached config option already mentioned (which controls wether a DocList is cached based on it's size) there is also the queryResultWindowSize config option which may confuse your cache observations. if the window size is 50 and you ask for start=0rows=10 what actually gets cached is 0-50 (assuming there are more then 50 results) so a subsequent request for start=10rows=10 will be a cache hit. Just so I'm clear, does the queryResultCache operate in a similar manner as the documentCache as to what is actually cached? In other words, is it the caching of the docList object that is reported in the cache statistics hits/inserts numbers? And that object would get updated with a new set of ordered doc ids on subsequent, larger requests. (I'm flailing a bit to articulate the question, I know). For example, if my queryResultMaxDocsCached is set to 200 and I issue a request with rows=500, then I won't get a docList object entry in the queryResultCache. However, if I issue a request with rows=10, I will get an insert, and then a later request for rows=500 would re-use and update that original cached docList object. Right? And would it be updated with the full list of 500 ordered doc ids or only 200? Thanks, --jay
documentCache clarification
Hi all, The solr wiki says this about the documentCache: The more fields you store in your documents, the higher the memory usage of this cache will be. OK, but if i have enableLazyFieldLoading set to true and in my request parameters specify fl=id, then the number of fields per document shouldn't affect the memory usage of the document cache, right? Thanks, --jay
Re: documentCache clarification
(btw, I'm running 1.4.1) It looks like my assumption was wrong. Regardless of the fields selected using the fl parameter and the enableLazyFieldLoading setting, solr apparently fetches from disk and caches all the fields in the document (or maybe just those that are stored=true in my schema.) My evidence for this is the documentCache stats reported by solr/admin. If I request rows=10fl=id followed by rows=10fl=id,title I would expect to see the 2nd request result in a 2nd insert to the cache, but instead I see that the 2nd request hits the cache from the 1st request. rows=10fl=* does the same thing. i.e., the first request, even though I have enableLazyFieldLoading=true and I'm only asking for the ids, fetches the entire document from disk and inserts into the documentCache. Subsequent requests, regardless of which fields I actually select, don't hit the disk but are loaded from the documentCache. Is this really the expected behavior and/or am I misunderstanding something? A 2nd question: while watching these stats I noticed something else weird with the queryResultCache. It seems that inserts to the queryResultCache depend on the number of rows requested. For example, an initial request (solr restarted, clean cache, etc) with rows=10 will result in a insert. A 2nd request of the same query with rows=1000 will result in a cache hit. However if you reverse that order, starting with a clean cache, an initial request for rows=1000 will *not* result in an insert to queryResultCache. I have tried various increments--10, 100, 200, 500--and it seems the magic number is somewhere between 200 (cache insert) and 500 (no insert). Can someone explain this? Thanks, --jay On Wed, Oct 27, 2010 at 10:54 AM, Markus Jelsma markus.jel...@openindex.io wrote: I've been wondering about this too some time ago. I've found more informationenableLazyFieldLoading in SOLR-52 and some correspondence on this one but it didn't give me a definitive answer.. [1]: https://issues.apache.org/jira/browse/SOLR-52 [2]: http://www.mail-archive.com/solr-...@lucene.apache.org/msg01185.html On Wednesday 27 October 2010 16:39:44 Jay Luker wrote: Hi all, The solr wiki says this about the documentCache: The more fields you store in your documents, the higher the memory usage of this cache will be. OK, but if i have enableLazyFieldLoading set to true and in my request parameters specify fl=id, then the number of fields per document shouldn't affect the memory usage of the document cache, right? Thanks, --jay -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350
Re: Autocommit not happening
For the sake of any future googlers I'll report my own clueless but thankfully brief struggle with autocommit. There are two parts to the story: Part One is where I realize my autoCommit config was not contained within my updateHandler. In Part Two I realized I had typed autocommit rather than autoCommit. --jay On Fri, Jul 23, 2010 at 2:35 PM, John DeRosa jo...@ipstreet.com wrote: On Jul 23, 2010, at 9:37 AM, John DeRosa wrote: Hi! I'm a Solr newbie, and I don't understand why autocommits aren't happening in my Solr installation. [snip] Never mind... I have discovered my boneheaded mistake. It's so silly, I wish I could retract my question from the archives.