Re: camel-casing and dismax troubles
On Wed, May 13, 2009 at 6:23 AM, Yonik Seeley yo...@lucidimagination.com wrote: On Tue, May 12, 2009 at 7:19 PM, Geoffrey Young ge...@modperlcookbook.org wrote: hi all :) I'm having trouble with camel-cased query strings and the dismax handler. a user query LeAnn Rimes isn't matching the indexed term Leann Rimes This is the camel-case case that can't currently be handled by a single WordDelimiterFilter. If the indexeddoc had LeAnn, then it would be indexed as le,ann/leann and hence queries of both forms le ann and leann would match. However since the indexed term is simply leann, a WordDelimiterFilter configured to split won't match (a search for LeAnn will be translated into a search for le ann. but the concatparts and/or concatall should handle splicing the tokens back together, right? One way to work around this now is to do a copyField into another field that catenates split terms in the query analyzer instead of generating/splitting, and then search across both fields. yeah, unforunately, that's not an option for me :) BTW, your parsed query below shows you turned on both catenation and generation (or perhaps preserveOriginal) for split subwords in your query analyzer. Unfortunately this configuration doesn't work due to the ambiguity of what it means to have multiple terms at the same position (this is the same problem for multi-word synonyms at query time). The query shown below looks for leann or le followed by ann and hence an indexed term of leann won't match. ugh. ok, thanks for letting me know. I'm not using the same concat parameters on the index as the query based on the solr wiki docs. but I've always wondered if that was a good idea. I'll see if matching them up helps at all. thanks. I'll let you know what I find. --Geoff
camel-casing and dismax troubles
hi all :) I'm having trouble with camel-cased query strings and the dismax handler. a user query LeAnn Rimes isn't matching the indexed term Leann Rimes even though both are lower-cased in the end. furthermore, the analysis tool shows a match. the debug query looks like parsedquery:+((DisjunctionMaxQuery((search-en:\(leann le) ann\)) DisjunctionMaxQuery((search-en:rimes)))~2) (), I have a feeling it's due to how the broken up tokens are added back into the token stream with PreserveOriginal, and some strange interaction between that order and dismax, but I'm not entirely sure. configs follow. thoughts appreciated. --Geoff fieldType name=search-en class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.ISOLatin1AccentFilterFactory / filter class=solr.WordDelimiterFilterFactory preserveOriginal=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=false words=stopwords-en.txt/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.ISOLatin1AccentFilterFactory / filter class=solr.WordDelimiterFilterFactory preserveOriginal=1 generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=false words=stopwords-en.txt/ /analyzer /fieldType
dismax and WordDelimiterFilterFactory+PreserveOriginal
hi all :) I have two filters combined with dismax on the query side: WordDelimiterFilterFactory { preserveOriginal=1, generateNumberParts=1, catenateWords=0, generateWordParts=1, catenateAll=0, catenateNumbers=0} followed by lowecase filter factory. the analyzer shows the phrase gUYS and dOLLS being tokenized as guys uys and dolls olls g d and matching an index where everything is like you would expect (lowercased, etc). anyway, dismax is failing to get a match, even though the analyzer says all is ok. dismax reports the following: rawquerystring:gUYS and dOLLS, querystring:gUYS and dOLLS, parsedquery:+((DisjunctionMaxQuery((search:\(guys g) uys\)) DisjunctionMaxQuery((search:\(dolls d) olls\)))~2) (), parsedquery_toString:+(((search:\(guys g) uys\) (search:\(dolls d) olls\))~2) (), so it seems like PreserveOriginal is mucking with the token order in a way that makes dismax very unhappy. thoughts? --eoff
filtering on blank OR specific range
hi all :) I'm having difficultly filtering my documents when a field is either blank or set to a specific value. I would have thought this would work fq=-Type:[* TO *] OR Type:blue which I would expect to find all document where either Type is undefined or Type is blue. my actual result set is zero. using a similar filter fq=-Type:[* TO *] OR OtherThing:cat does what I would expect (documents with undefined type or documents with cats), so it feels like solr is getting confused with the range negation and ORing, but only when the field is the same. adding various parentheses makes no difference. I know this is kind of nebulous sounding, but I was hoping someone would look at this and go you're doing it wrong. your filter should be... the field is defined as field name=Type type=string indexed=true stored=true multiValued=true/ if it matters. tia --Geoff
Re: filtering on blank OR specific range
Lance Norskog wrote: Try: Type:blue OR -Type:[* TO *] You can't have a negative clause at the beginning. Yes, Lucene should barf about this. I did try that, before and again now, and still no luck. anything else? --Geoff
Re: solr 1.3 snapshooter doesn't work, commit never ending
sunnyfr wrote: I tried last evening before leaving and this morning time elapsed was very important like you can notice above and no snapshot, no error in the logs. I'm actually having a similar trouble. I've enabled postCommit and postOptimize hooks with an absolute path to snapshooter. not only are the snapshots not created, I can't even see calls (or errors) in catalina.out. it's supposed to be this easy, right? listener event=postCommit class=solr.RunExecutableListener str name=exe/path/to/solr/bin/snapshooter/str bool name=waittrue/bool /listener nothing else? of course, calling it manually is just fine. --Geoff
Re: using DataImportHandler instead of POST?
Chris Hostetter wrote: : I chugg away at 1.5 million records in a single file, but solr never : commits. specifically, it ignores my autocommit settings. (I can : commit separately at the end, of course :) the way the autocommit settings work is soemthing i always get confused by -- the autocommit logic may not kick in untill the add is finished, regardless of how many docs are in it -- but i'm not certain 9and if i'm correct, i'm not sure if that's a bug or a feature) ok, that makes sense. fwiw, I tried to break the records into add chunks in the same file but solr complained about multiple root entities. I knew you couldn't mix adds and deletes (rats ;) but I figured multiple add blocks would be ok. I guess not :) this may be a motivating reason to use DIH in your use case even though you've already got it in the XmlUpdateRequestHandler format. yeah, I'll check. though I don't know what I'd do with trying to figure out which records were committed and which weren't... : but I might be misunderstanding autocommit. I have it set as the : default solrconfig.xml does, in the updateHandler section (mapped to : UpdateHandler2) but /update is mapped to XmlUpdateRequestHandler. : should I be shuffling some things around? due to some unfortunately naming decisions several years ago an update Handler and a Request handler that does updates aren't the same thing ... updateHandler (which whould always be DirectUpdateHandler2) is the low level internal code that is responsible for actually making the index modiciations -- XmlUpdateRequestHandler (or DataImportHandler) parses the raw input and hands off to DirectUpdateHandler2 to make the changes. ok, thanks. I kind of implied that from the wiki, but it was still confusing, so thanks for the clarification. --Geoff
Re: using DataImportHandler instead of POST?
Geoffrey Young wrote: Chris Hostetter wrote: : I have a well-formed xml file, suitable for POSTting to solr. that : works just fine. it's very large, though, and using curl in production : is so very lame. is there a very simple config that will let solr just : slurp up the file via the DataImportHandler? solr already has You don't even need DIH for this, just enableRemoteStreaming and use the stream.file param and you can load the file from local disk... http://wiki.apache.org/solr/ContentStream this is the solution I think I'm going to go with - it seems to work perfectly. well, with one exception. I chugg away at 1.5 million records in a single file, but solr never commits. specifically, it ignores my autocommit settings. (I can commit separately at the end, of course :) but I might be misunderstanding autocommit. I have it set as the default solrconfig.xml does, in the updateHandler section (mapped to UpdateHandler2) but /update is mapped to XmlUpdateRequestHandler. should I be shuffling some things around? thanks. --Geoff
Re: using DataImportHandler instead of POST?
Chris Hostetter wrote: : I have a well-formed xml file, suitable for POSTting to solr. that : works just fine. it's very large, though, and using curl in production : is so very lame. is there a very simple config that will let solr just : slurp up the file via the DataImportHandler? solr already has You don't even need DIH for this, just enableRemoteStreaming and use the stream.file param and you can load the file from local disk... http://wiki.apache.org/solr/ContentStream this is the solution I think I'm going to go with - it seems to work perfectly. thanks (to both of you). --Geoff
using DataImportHandler instead of POST?
hi all :) I'm sorry I need to ask this, but after reading and re-reading the wiki I don't see a clear path... I have a well-formed xml file, suitable for POSTting to solr. that works just fine. it's very large, though, and using curl in production is so very lame. is there a very simple config that will let solr just slurp up the file via the DataImportHandler? solr already has everything it needs in schema.xml, so I don't think this would be very hard... if I fully understood the DataImportHandler :) tia --Geoff
Re: spellchecker problems (bugs)
This issue has been fixed in the trunk. Can you please use the latest trunk code and try? current trunk looks good. thanks! --Geoff
Re: Multiple search components in one handler - ie spellchecker
Andrew Nagy wrote: Hello - I am attempting to add the spellCheck component in my search requesthandler so when a users does a search, they get the results and spelling corrections all in one query just like the way the facets work. I am having some trouble accomplishing this - can anyone point me to documentation (other than http://wiki.apache.org/solr/SpellCheckComponent) on how to do this or an example solrconfig that would do this correctly? http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200806.mbox/[EMAIL PROTECTED] in general, just add the arr name=last-components strspellcheck/str /arr bit to your existing handler after following setup in the twiki docs. you can ignore the part about the exceptions, as that has been fixed in trunk. HTH --Geoff
Re: Multiple search components in one handler - ie spellchecker
Andrew Nagy wrote: Thanks for getting back to me Geoff. Although, that is pretty much what I have. Maybe if I show my solrconfig someone might be able to point out what I have incorrect? The problem is that nothing related to the spelling options are show in the results, just the normal expected search results. right. the spellcheck component does not issue a separate query *after* running the spellcheck, it merely offers suggestions in parallel with your existing query. the results are more like below are the results for $query. did you mean $suggestions? HTH --Geoff
Re: spell-checker and faceting
dudes dudes wrote: Hi, I'm trying to couple spell-checking mechanism with faceting in one url statement.. I can get the spell check right, but the facet doesn't work when it's combined with spell-checker... http://localhost:8080/solr/spellCheckCompRH?q=smathspellcheck.q=smathspellcheck=truespellcheck.build=trueselect?q=smathrows=0facet=truefacet.limit=1facet.field=firstname it corrects smath to Smith, but doesn't facet it. I was able to get faceting working without issue. it seems to me your query string is off - note the 'select?q=smath' in the middle of your query. I'd try again with that part removed. also note you only need to spellcheck.build=true once, not on each request. --Geoff
Re: spellchecker problems (bugs)
Jonathan Lee wrote: I don't see the patch attached to my original email either -- does solr-user not allow attachments? This is ugly, but here's the patch inline: issue created in jira: https://issues.apache.org/jira/browse/SOLR-648 --Geoff
Re: spellchecker problems (bugs)
Shalin Shekhar Mangar wrote: The problems you described in the spellchecker are noted in https://issues.apache.org/jira/browse/SOLR-622 -- I shall create an issue to synchronize spellcheck.build so that the index is not corrupted. I'd like to discuss this a little... I'm not sure that I want to rebuild the spelling index each time the underlying data index changes - the process takes very long and my updates are frequent changes to non-spelling related data. what I'd really like is for a change to my index to not cause an exception. IIRC the old way of using a spellchecker didn't work like this at all - I could completely rm data/index and leave data/spell in place, add new data, not issue cmd=build and the spelling parts still worked just fine (albeit with old data). not to say that SOLR-622 isn't a good idea (it is) but I don't really think the entire solution is keeping the spellcheck index in sync. do they need to be kept in sync for things not to implode on me? --Geoff
Re: problems with SpellCheckComponent
When I made: http://localhost:8080/solr/spellCheckCompRH?q=*:*spellcheck.q=ruckspellcheck=true I have this exception: Estado HTTP 500 - null java.lang.NullPointerException at org.apache.solr.handler.component.SpellCheckComponent.getTokens(SpellCheckComponent.java:217) I see this all the time - to the point where I wonder how stable the new component is. I've *think* I've traced it to o the presence of both q *and* spellcheck.q o and *any* restart of solr without re-issuing spellcheck.build=true I haven't been using any form of spellchecker for long, but I'm reasonably sure that I didn't need to rebuild on every restart. I also used to think it was changes to schema.xml (and not a simple restart) that caused the issue, but I've seen the exception with no changes. I've also seen the exception pop up without a restart when the server sits overnight (last query of the day ok, go to sleep, query again in the morning and *boom*) but regardless of restart issues, I've never seen it happen with just the q or just the spellcheck.q fields in my query - it's always when they're both there. --Geoff
Re: problems with SpellCheckComponent
Shalin Shekhar Mangar wrote: Hi Geoff, I can't find anything in the code which would give this exception when both q and spellcheck.q is specified. Though, this exception is certainly possible when you restart solr. Anyways, I'll look into it more deeply. great, thanks. There are a few ways in which we can improve this component. For example a lot of this trouble can go away if we can reload the spell index on startup if it exists or build it if it does not exist (SOLR-593 would need to be resolved for this). With SOLR-605 committed, we can now add an option to re-build the index (built from Solr fields) on commits by adding a listener using the API. There are a few issues with collation which are being handled in SOLR-606. I'll open new issues to track these items. Please bear with us since this is a new component and may take a few iterations to stabilize. Thank you for helping us find these issues :) np - this is a great feature to have and it's going to save me some effort as we prepare for deployment, so it's worth taking the time to work out the bugs. thanks for your effort. --Geoff
Re: SpellCheckerRequestHandler qt parameter
I had null pointer exceptions left and right while composing this email... then I added spellcheck.build=true to one and they went away. do you need to rebuild the spelling index every time you alter (certain parts) of solrconfig.xml? it was very consistent as reported below, but after simply issuing a rebuild I can't reproduce the null pointer. this seems to happen every time I stop and start solr. o q=termspellcheck.q=term - ok o stop start solr (we're using tomcat 6.0.16) o q=termspellcheck.q=term - null pointer (SpellCheckComponent.getTokens(SpellCheckComponent.java:215)) o q=$term - ok o q=termspellcheck.q=termspellcheck.build=true - ok --Geoff
Re: SpellCheckerRequestHandler qt parameter
Norberto Meijome wrote: Hi there, Short and sweet : Is SCRH intended to honour qt= ? longer... I'm testing the newest SCRH ( SOLR-572), using last night's nightly build. I have defined a 'dismax' request handler which searches across a number of fields. When I use the SCRH in a query, and I pass the qt=dismax parameter, it is ignored. Furthermore, the default field is shown as being used when I add debugQuery=true. I could replace some of dismax's capabilities with a longer query string , but some parameters such as mm don't seem to exist with the standard handler. it seems like it ought to work as a component of your dismax handler. this works for me: requestHandler name=dismax class=solr.DisMaxRequestHandler lst name=defaults str name=echoParamsnone/str str name=indentoff/str str name=qfsearch-en/str /lst lst name=invariants str name=mm100%/str str name=wtjson/str /lst lst name=appends str name=fqType:Event/str /lst arr name=last-components strspellcheck/str /arr /requestHandler searchComponent name=spellcheck class=org.apache.solr.handler.component.SpellCheckComponent ... from docs ... /searchComponent well *almost* - it works most excellently with q=$term but when I add spellchecker.q=$term things implode: HTTP Status 500 - null java.lang.NullPointerException at org.apache.solr.handler.component.SpellCheckComponent.getTokens(SpellCheckComponent.java:215) at org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:183) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:156) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:125) at org.apache.solr.core.SolrCore.execute(SolrCore.java:965) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:272) at... not being a java guy I need to use solr out of the box, and adding spellchecker.q makes my multi-word terms checked at the phrase level (mickey mouse) instead of at the word level (mickey mouse) which is the behavior I'm seeking. the docs make it sound like I could write my own SpellingQueryConverter, but... well, they also use both q and spellchecker.q at the same time, so it shouldn't implode like that :) anyway, HTH --Geoff
Re: SpellCheckerRequestHandler qt parameter
Grant Ingersoll wrote: On Jun 26, 2008, at 5:25 PM, Geoffrey Young wrote: well *almost* - it works most excellently with q=$term but when I add spellchecker.q=$term things implode: HTTP Status 500 - null java.lang.NullPointerException at org .apache .solr .handler .component.SpellCheckComponent.getTokens(SpellCheckComponent.java:215) at org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:183) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:156) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:125) at org.apache.solr.core.SolrCore.execute(SolrCore.java:965) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:272) at... not being a java guy I need to use solr out of the box, and adding spellchecker.q makes my multi-word terms checked at the phrase level (mickey mouse) instead of at the word level (mickey mouse) which is the behavior I'm seeking. the docs make it sound like I could write my own SpellingQueryConverter, but... well, they also use both q and spellchecker.q at the same time, so it shouldn't implode like that :) What's your searchComponent look like for the SpellCheckComponent, exactly? sorry for the long post - first some trivial stuff... I had null pointer exceptions left and right while composing this email... then I added spellcheck.build=true to one and they went away. do you need to rebuild the spelling index every time you alter (certain parts) of solrconfig.xml? it was very consistent as reported below, but after simply issuing a rebuild I can't reproduce the null pointer. my original problem was... request to /solr/select?q=celin+dionqf=search-enqt=Search::Model::JSON::Search::Scanspellcheck=trueindent=onechoParams=all succeeds as { responseHeader:{ status:0, QTime:9, params:{ fq:Type:Event OR Type:Attraction OR Type:Venue, echoParams:all, indent:on, qf:search-en, defType:dismax, spellcheck:true, echoParams:all, indent:on, q:celin dion, qf:search-en, qt:Search::Model::JSON::Search::Scan, mm:100%, facet:false, start:0, wt:json, rows:0}}, response:{numFound:59,start:0,docs:[] }, spellcheck:{ suggestions:[ celin,[ numFound,1, startOffset,0, endOffset,5, suggestion,[celina]]]}} request to /solr/select?q=celin+dionqf=search-enqt=Search::Model::JSON::Search::Scanspellcheck=trueindent=onechoParams=allspellcheck.q=celin+dion implodes with null java.lang.NullPointerException at org.apache.solr.handler.component.SpellCheckComponent.getTokens(SpellCheckComponent.java:215) at... if it makes a difference, it's svn trunk from last night + SOLR-14 applied. thanks for taking the time - I'm hoping this isn't now a wild goose chase :) --Geoff solrconfix.xml requestHandler name=Search::Model::JSON::Search::Scan class=solr.DisMaxRequestHandler lst name=defaults str name=echoParamsnone/str str name=indentoff/str str name=qfsearch sAttractionName sVenueName/str /lst lst name=invariants str name=mm100%/str str name=wtjson/str int name=start0/int int name=rows0/int str name=facetfalse/str /lst lst name=appends str name=fqType:Event OR Type:Attraction OR Type:Venue/str /lst arr name=last-components strspellcheck/str /arr /requestHandler searchComponent name=spellcheck class=org.apache.solr.handler.component.SpellCheckComponent lst name=defaults str name=spellcheck.onlyMorePopularfalse/str str name=spellcheck.extendedResultsfalse/str /lst lst name=invariants str name=spellcheck.count5/str /lst str name=queryAnalyzerFieldTypespell/str lst name=spellchecker str name=namedefault/str str name=fieldspell/str str name=spellcheckIndexDirdefaultspell/str /lst /searchComponent queryConverter name=queryConverter class=org.apache.solr.spelling.SpellingQueryConverter/ schema.xml: fieldType name=spell class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType
Re: missing document count?
Chris Hostetter wrote: : not hard, but useful information to have handy without additional : manipulations on my part. : our pages are the results of multiple queries. so, given a max number of : records per page (or total), the rows asked of query2 is max - query1, of in the common case, counting the number of docs in a result is just as easy as reading some attribute containing the count. I suppose :) in my mind, one (potentially) requires just a read, while the other requires some further manipulations. but I suppose most modern languages have optimizations for things like array size :) It sounds like you have a more complicated case where what you really wnat is the count of how many docs there are in the entire response I don't know how complex it is to ask for documents in the response, but yes :) (ie: multiple result sections) ... multiple results from multiple queries, not a single query. but really, I wasn't planning on having anyone (solr or otherwise) solving my needs. I just find it odd that I need to discern the number of returned results. that count is admitedly a little more work but would also be completley useless to most clients if it was included in the response perhaps :) (just as the number of fields in each doc, or the total number of strings in the response) ... there is a lot of metadata that *could* be included in the response, but we don't bother when the client can compute that metadata just as easily as the server -- among other things, it helps keep the response size smaller. agreed - smaller is better. as for client as easily as a the server, I assumed that solr was keeping track of the document count already, if only to see when the number of documents exceeds the rows parameter. if so, all the people who care about number of documents in the result (which, I'll assume, is more than those who care about total strings in the response ;) are all re-computing a known value. This was actually one of the orriginal guiding principles of Solr: support features that are faster/cheaper/easier/more-efficient on the central server then they would be on the clients (sorting, docset caching, faceting, etc...) sure, I'll buy that. but in my mind it was only exposing something solr already was calculating anyway. regardless, thanks for taking the time :) --Geoff
Re: searching only within allowed documents
Solr allows you to specify filters in separate parameters that are applied to the main query, but cached separately. q=the user queryfq=folder:f13fq=folder:f24 I've been wanting more explanation around this for a while, so maybe now is a good time to ask :) the cached separately verbiage here is the same as in the twiki, but I don't really understand what it means. more precisely, I'm wondering what the real performance, caching, etc differences are between q=fielda:foo+fieldb:barmm=100% and q=fielda:foofq=fieldb:bar my situation is similar to the original poster's in that documents matching fielda is very large and common (say theaters across the world) while fieldb would narrow it considerably (one by country, then one by zipcode, etc). thanks --Geoff
adding expand=true to WordDelimiterFilter
hi :) I'm having an interesting problem with my data. in general, I want the results of the WordDelimiterFilter for better matching, but there are times when it's just too aggressive. for example boys2men = boys 2 men (good) p!nk = pnk (maybe) !!! = (nothing - bad) there's a special place for bands who name themselves just punctuation marks :) anyway, one way around this is synonyms. but if I do that then I need to run the synonym filter multiple times. the first might expand !!! = chk chk chk p!nk = pink while the next would need to run after the WordDelimiterFilter for boys 2 men = boyz II men I'd really like to avoid multiple passes (and multiple synonym files) if at all possible, but that's the solution I'm faced with currently... unless an 'expand' option were added to the WordDelimiterFilter, in which case I'd have p!nk = p!nk pnk after it runs, so I could just apply the synonyms once. or maybe there's another solution I'm missing. would it be difficult (or desirable) to add an expand option? --Geoff
Re: adding expand=true to WordDelimiterFilter
Chris Hostetter wrote: by expand=true it sounds like you mean you are looking for a way to preserve the orriginal term without any characteres removed. yes, that's it. This sounds like SOLR-14 ... you might want to take a look at it, and see if the patch is still useable, and if not see if you can bring it up to date. I'm working with a team that deploys this all for me, so I've asked them. I'll report back. thanks for pointing it out :) --Geoff
Re: token concat filter?
Otis Gospodnetic wrote: Geoff, Whether synonyms are applied at index time or query time is controlled via schema.xml - it depends on where you put the synonym factory, whether in the index-time or query-time section of a fieldType. Synonyms are read once on start, I believe. It might be good to have them read at index reader open time, as is done with elevate component... I'm looking a bit more into this now. I don't think you need synonyms applied at both query and index time if you're using expand - one or the other ought to work properly. in fact, I suspect I'm the last person to figure this out ;) the question is, then, which is the more efficient place to apply them? my first inclination is to apply them (and other similar expanding mechanisms) to just the index so that the expansion happens only once and is held in (an efficient) index as opposed to manipulating every query. the SpellCheckerRequestHandler example on the wiki has the opposite configuration, expanding synonyms on (only) the query. thoughts on which approach is the more efficient one? --Geoff
Re: token concat filter?
Otis Gospodnetic wrote: There is actually a Wiki page explaining this pretty well... have you seen it? I guess not. I've been reading the wiki, but the trouble with wiki's always seems to be (for me) finding stuff. can you point it out? Index-time expansion means larger indices and inability to easily change synonyms (e.g. you thought of a new synonym for fish and want to add it to the already indexed docs). yes, I've thought of the latter limitation. due to other factors, I'm hoping to re-index all of our documents from scratch nightly, so that's not much of a concern. --Geoff
Re: Sort results on a field not ordered
Erik Hatcher wrote: What field type is chapterTitle? I'm betting it is an analyzed field with multiple values (tokens/terms) per document. To successfully sort, you'll need to have a single value per document - using copyField can help with this to have both a searchable field and a sortable version. does this apply to facet fields as well? I noticed that that if I set facet.sort=true the results are indeed sorted by count... until the counts are the same, after which they are in random order (instead of ascii alpha). --Geoff
token concat filter?
hi :) I'm looking for a filter that will compress all tokens into a single token. the WordDelimiterFilterFactory does it for tokens it finds itself, but not ones passed to it. basically, I'm trying to match Radiohead in the index with radio head in the query. if it were spelled RadioHead or Radio-head in the index I'd find it, but as it is I'm missing it... unless I could squish all the query terms into a single token. or maybe there's another route I haven't thought about yet. --Geoff
Re: token concat filter?
Yonik Seeley wrote: If there are only a few such cases, it might be better to use synonyms to correct them. unfortunately, there are too many to handle this way. Off the top of my head there's no concatenating token filter, but it wouldn't be hard to make one. hmm, ok. I'm not a java guy, so I'll try the PatternTokenizerFactory before trying to write my own. thanks :) speaking of synonyms... will changes to synonyms.txt (and the other files) take affect on each re-indexing, or does the solr server read it once on load then hold on to it until restart? --Geoff
Re: token concat filter?
Walter Underwood wrote: I've been doing it with synonyms and I have several hundred of them. I'm dealing mostly with proper names, so I expect more like 80k of them for our data :) Concatenating bi-word groups is pretty useful for English. We have a habit of gluing words together. database used to be two words. Dictionaries still think it should be web server. :) --Geoff
Re: token concat filter?
Walter Underwood wrote: I doubt it would be that many. I recommend tracking the searches and the clicks, and working on queries with low clickthrough. the trouble is I'm in a dynamic biz - last weeks popular clicks are very different from this weeks, so by the time I analyze last weeks popular misses it's too late to add them. but the non-space issue represents 10% of my misses consistently over time. Here are a few of mine from that sort of analysis: ghost dog = ghost dog, ghostdog ghost hunters = ghost hunters, ghosthunters ghost rider = ghost rider, ghostrider ghost world = ghost world, ghostworld ghostbusters = ghostbusters, ghost busters I don't see as many in personal names. Mostly, things like De Niro and DiCaprio. hannahmontana? ;) --Geoff
Re: token concat filter?
Otis Gospodnetic wrote: Geoff, Whether synonyms are applied at index time or query time is controlled via schema.xml - it depends on where you put the synonym factory, whether in the index-time or query-time section of a fieldType. Synonyms are read once on start, I believe. It might be good to have them read at index reader open time, as is done with elevate component... coolio, thanks. --Geoff
Re: Got parseException when search keyword AND on a text field
Otis Gospodnetic wrote: Not in one place and documented. The place to look are query parsers, but things like AND OR NOT TO are the ones to look out for. this seems like something solr ought to handle gracefully on the backend for me - if I need to write logic to make sure a malicious query for AND NOT all by itself (in all caps) doesn't make solr implode then so does everyone else... --Geoff
another spellchecker question
hi :) I've noticed that (with solr 1.2) the returned order (as well as the actual matched set) is affected by the number of matches you ask for: q=hannasuggestionCount=1 suggestions:[Yanna] q=hannasuggestionCount=2 suggestions:[Manna, Yanna] q=hannasuggestionCount=5 suggestions:[Manna, Nanna, Sanna, Vanna, Shanna] note how the #1 result is completely missing from the top 5... or at least that's how I _used_ to think about the sets :) unfortunately, extendedresults seems to be a 1.3-only option, so I can't see what's going on here. but I guess I'm asking if this is expected behavior. --Geoff
Re: another spellchecker question
Shalin Shekhar Mangar wrote: Hi Geoffrey, Yes, this is a caveat in the lucene contrib spellchecker which Solr uses. From the lucene spell checker javadocs: * pAs the Lucene similarity that is used to fetch the most relevant n-grammed terms * is not the same as the edit distance strategy used to calculate the best * matching spell-checked word from the hits that Lucene found, one usually has * to retrieve a couple of numSug's in order to get the true best match. * * pI.e. if numSug == 1, don't count on that suggestion being the best one. * Thus, you should set this value to bat least/b 5 for a good suggestion. Therefore what you're seeing is by design. Probably we should change the default number of suggestions when querying lucene spellchecker to 5 and give back the top result if the user asks for only one suggestion from solr. great, thanks for all that - I'm still trying to figure out where all the relevant docs live. you've been really helpful. --Geoff
Re: config for very frequent solr updates
Otis Gospodnetic wrote: Geoff, There was just another thread where the person said he was doing updates every 2 minutes. ok, I see that now. unfortunately, the data is sparse there :) Like you said, with the way Solr warms searchers, this could be too frequent for instances with large caches and high autowarmCount. ok, thanks. I'll have a better sense of the size of my data soon, but I suspect it's nowhere near on the scale of most of the people here - maybe a million documents, tops. right now I'm proof-of-concept'ing nearly all our data (but in a single language) and it's 500K documents with an index of 100M :) You may be better off playing with the combination of larger older index and a smaller index with updates kept in RAM (on the slave, of course). good info, thanks. sorry for the basic questions. and thanks for the (later) pointer to solr-303 - I found the distributed search docs from there and will keep that in mind as I move forward. --Geoff Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Geoffrey Young [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Thursday, April 17, 2008 8:28:09 AM Subject: config for very frequent solr updates hi all :) I didn't see any documentation on this, so I was wondering what the experience here was with updating solr with a small but constant trickle of daemon-style updates. unfortunately, it's a business requirement that backend db updates make it to search as the changes roll in (5 minutes is out of the question). with the way solr handles new searchers, warming, etc, I was wondering if anyone had experience with this kind of thing and could share thoughts/general config stuff on it. thanks. --Geoff
config for very frequent solr updates
hi all :) I didn't see any documentation on this, so I was wondering what the experience here was with updating solr with a small but constant trickle of daemon-style updates. unfortunately, it's a business requirement that backend db updates make it to search as the changes roll in (5 minutes is out of the question). with the way solr handles new searchers, warming, etc, I was wondering if anyone had experience with this kind of thing and could share thoughts/general config stuff on it. thanks. --Geoff
Re: schema help
Rachel McConnell wrote: Our Solr use consists of several rather different data types, some of which have one-to-many relationships with other types. We don't need to do any searching of quite the kind you describe, but I have an idea about it, depending on what you need to do with the book data. It is rather hacky, but maybe you can improve it. coolio, thanks :) [snip] If your 'authors' 'write' 'books' with great frequency, you'd need to update a lot... yeah, unfortunately that's the case :) I was using the book analogy because I figured it was simple to explain, not necessarily because I was trying to be vague :) Another possibility is to do two searches, with this kind of structure, which sort of mimics an RDBMS: * everything in Solr has a field, type (book, author, library, etc). these can be filtered on a search by search basis * books have a field, authorId, uniquely referencing the author * your first search will restricted to just authors, from which you will extract the IDs. * your second search will be restricted to just books, whose authorId field is exactly one of the IDs from the first search I think this approach solves the mindset issues I was having - I didn't want to be left with a schema like this authorId bookID1 bookID2 ... but since lucene allows for all kinds of slots to exist and be empty, it seems I can simplify that to authorId bookId and use multiple queries to satisfy the display needs. it's probably more a duh! moment for the majority, but lucene is sufficiently different from what I'm used to that it's taking me a bit of time :) As you have noticed, Lucene is not an RDBMS. Searching through all the text of all the books is more the use it was designed around; of course the analogy might not be THAT strong with your need! I think the fulltext search capabilities will serve us well for some aspects of our search needs. the stemming, language, and other filters will definitely be a help to just about everything we do. speaking of language, this is my last question for now... what's the idiomatic way to represent multiple languages? left to my own devices I'd probably do something like name_en-us name-es-us anyway, thanks so much for your help. --Geoff
Re: schema help
the trouble I'm having is one of dimension. an author has many, many attributes (name, birthdate, biography in $language, etc). as does each book (title in $language, summary in $language, genre, etc). as does each library (name, address, directions in $language, etc). so an author with N books doesn't seem to scale very well in the flat representations I'm finding in all the lucene/solr docs and examples... at least not in some way I can wrap my head around. OG: I'm not sure why the number of attributes worries you. Imagine is as a wide RDBMS table, if it helps. Indices with dozens of fields are not uncommon. it's not necessarily the number of fields, it's the Attribute1 .. AttributeN-style numbering that worries me. but I think it's all starting to make sense now... if wanting to pull data in multiple queries was my holdup. OG: You certainly can do that. I'm not sure I understand where the hard part is. You seem to know what attributes each entity has. Maybe you are confused by how to handle N different types of entities in a single index? yes... or, more properly, how to relate them to eachother. I understand that the schema can hold tons of attributes that are unused in different documents. my question seems to be how to organize my data such that I can answer the question how do I get a list of libraries with $book like $pattern - where does the de-normalization typically occur? if a document fully represents a book by an author in a library such that the same book (with all it's attributes) is in my index multiple times (one for each library) how do I drill down to showing just the directions to a specific library? (I'm assuming a single index is what you currently have in mind) using different indices is what my lucene+compass counterparts are doing. I couldn't find an example of that in the solr docs (unless the answer is running multiple, distinct instances at the same time) eew :) seriously, though, that's what we have now - all rdbms driven. if solr could only conceptually handle the initial lookup there wouldn't be much point. OG: Well, there might or might not be, depending on how much data you have, how flexible and fast your RDBMS-powered (full-text?) search, and so on. The Lucene/Solr for full-text search + RDBMS/BDB for display data is a common combination. the decision has been made to use lucene to replace all rdbms functionality for search *cough* :) maybe I'm thinking about this all wrong (as is to be expected :), but I just can't believe that nobody is using solr to represent data a bit more complex than the examples out there. OG: Oh, lots of people are, it's just that examples are simple, so people new to Solr, Lucene, etc. have easier time learning. :) thanks for your help here. --Geoff
schema help
hi :) I'm trying to work out a schema for our widgets. more than just coming up with something I'd like something idiomatic in solr terms. any help is much appreciated. here's a similar problem space to what I'm working with... lets say we're talking books. books are written by authors and held in libraries. a sister company is using lucene+compass and they seem to have completely different collections (or whatever the technical term is :) authors books libraries so that a search for authors hits only the authors dataset. all of the solr examples I can find don't seem to address this kind of data disparity. what is the standard and idiomatic approach for solr? for my particular data I'd want to display something like this author book in library book in library on the same result page, but using a completely flat, single schema doesn't seem to scale very well. collective widsom most welcome :) --Geoff
Re: schema help
Otis Gospodnetic wrote: Geoff, I'm not sure if I understood your problem correctly, but it sounds like you want your search to be restricted to authors, but then you want to list all of his/her books when displaying results. that's about right. add that I may also want to search on libraries and show all the books (and authors) stored there. in real life, it's not books or authors, of course, but the parallels are close enough :) in fact, the library example is a good one for me... or at least a network of public libraries linked together. The easiest thing to do would be to create an index where each row/Document has the author name, the book title, etc. For each author-matching Document you'd pull his/her books out of the result set. Yes, this means the author name would be denormalized in RDBMS-speak. I think I can live with the denormalization - it seems lucene is flat and very different conceptually than a database :) the trouble I'm having is one of dimension. an author has many, many attributes (name, birthdate, biography in $language, etc). as does each book (title in $language, summary in $language, genre, etc). as does each library (name, address, directions in $language, etc). so an author with N books doesn't seem to scale very well in the flat representations I'm finding in all the lucene/solr docs and examples... at least not in some way I can wrap my head around. part of what seemed really appealing about lucene in general was that you could stuff all this (unindexed) information into a document and retrieve it all based on some search criteria. but it's seeming very difficult for me to wrap my head around the data I need to represent. Another option is not to index/store book titles, but rather have only an author index to search against. The book data (mapped to author identities) would then be pulled from an external source (e.g. RDBMS: select title from books where author_id in (1,2,3)) at search results display time. eew :) seriously, though, that's what we have now - all rdbms driven. if solr could only conceptually handle the initial lookup there wouldn't be much point. maybe I'm thinking about this all wrong (as is to be expected :), but I just can't believe that nobody is using solr to represent data a bit more complex than the examples out there. thanks for the feedback. --Geoff Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Geoffrey Young [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Tuesday, March 11, 2008 12:17:32 PM Subject: schema help hi :) I'm trying to work out a schema for our widgets. more than just coming up with something I'd like something idiomatic in solr terms. any help is much appreciated. here's a similar problem space to what I'm working with... lets say we're talking books. books are written by authors and held in libraries. a sister company is using lucene+compass and they seem to have completely different collections (or whatever the technical term is :) authors books libraries so that a search for authors hits only the authors dataset. all of the solr examples I can find don't seem to address this kind of data disparity. what is the standard and idiomatic approach for solr? for my particular data I'd want to display something like this author book in library book in library on the same result page, but using a completely flat, single schema doesn't seem to scale very well. collective widsom most welcome :) --Geoff
multiple things in a document
hi all :) I'm just getting up to speed with solr (and lucene, for that matter) for a new project. after reading through the available docs I'm not finding an answer to my most basic (newbie, certainly) question. please feel free to just point me to the proper doc :) this isn't my actual use case, but it's close enough for general understanding... say I want to store data on a collection of SKUs which (for the unfamiliar :) are a combination of item + location. so we might have sku id name item location item id name location id name all of the schema.xml examples seem to deal with just a flat thing perhaps with multiple entries of the same field. what I'm after is how to represent this kind of relationship in the schema, such that I can limit my result set to, say, a sku or item, but if I search on sku I can discriminate between the sku name and the item name in my results. from my reading on lucene this is pretty basic stuff, but I don't see how the solr layer approaches this at all. again, doc pointers much appreciated. thanks for listening :) --Geoff