Re: Newbie with Java + typo
Daniel: As a fellow 'non-java' person I feel your pain (well, felt it anyway). A lot depends on your load and the machine, but I successfully ran the stock jetty system on a box last summer for work and didn't have performance problems. The bigger issue was from the other java people complaining that I hadn't used the standard jboss setup they had already working. However, I didnt' have access to that machine, nor would anyone give it to me at the time, so it was a catch 22. Performance-wise, the stock jetty will probably do just fine for you. Longer term, you may want to learn more about jboss or tomcat or something else which can give you more application management options and such. But don't let those things stop you from running jetty/solr in production - it's worked fine for me. On Jan 21, 2008 10:48 AM, Daniel Andersson [EMAIL PROTECTED] wrote: Hi people First the typo on http://wiki.apache.org/solr/mySolr: Production Typically it's not recommended do have your front end it should probably be ..recommended To have.. Second, I don't know much about Java, nor about Jetty/Resin/JBoss/ Tomcat. I went through the tutorial and was impressed with how easy it all seemed. Until the tutorial ended.. As a newbie, should I use Tomcat, JBoss, Resin, Jetty or the thing that comes with the example (Jetty, or?)? All the installation pages talk about this and that that doesn't make much sense to non-Java people like myself :-/ Would be MUCH appreciated with some after-tutorial page for us newbies. Right now I'm just looking for something that can be used on a production level machine. It doesn't have to be the fastest, as long as it's fairly easy to install. Recommendations and pointers are very welcome :) Thanks in advance! / d -- Michael Kimsal http://webdevradio.com
Re: Leading WildCard in Query
Please vote for SOLR-218. I'm not aware of any other way to accomplish the leading wildcard functionality that would be convenient. SOLR-218 is not asking that it be enabled by default, only that it be functionality that is exposed to SOLR admins via config.xml. On Dec 12, 2007 6:29 AM, Eswar K [EMAIL PROTECTED] wrote: Hi All, I understand that a leading Wild card search is not allowed as it is a very costly operation. There is an issues logged for it . ( http://issues.apache.org/jira/browse/SOLR-218). Is there any other way of enabling leading wildcards apart from doing it in code by calling * QueryParser.setAllowLeadingWildcard( true );*? Regards, Eswar -- Michael Kimsal http://webdevradio.com
Re: can I do *thing* substring searches at all?
https://issues.apache.org/jira/browse/SOLR-218 Please vote for SOLR 218 and perhaps this setting will make it in to the next version. It's explicitly shut off in SOLR, but available in Lucene. On Dec 2, 2007 9:56 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Would n- -g gr ra am mi in ng that field work for you? foothingbar -.fo oo ot th hi in ng gn ba ar *thing* - th hi in ng - bingo, a match Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Brian Whitman [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Thursday, November 29, 2007 11:51:37 PM Subject: can I do *thing* substring searches at all? With a fieldtype of string, can I do any sort of *thing* search? I can do thing* but not *thing or *thing*. Workarounds? -- Michael Kimsal http://webdevradio.com
Re: leading wildcards
Vote for that issue and perhaps it'll gain some more traction. A former colleague of mine was the one who contributed the patch in SOLR 218 and it would be nice to have that configuration option 'standard' (if off by default) in the next SOLR release. On Nov 12, 2007 11:18 AM, Traut [EMAIL PROTECTED] wrote: Seems like there is no way to enable leading wildcard queries except code editing and files repacking. :( On 11/12/07, Bill Au [EMAIL PROTECTED] wrote: The related bug is still open: http://issues.apache.org/jira/browse/SOLR-218 Bill On Nov 12, 2007 10:25 AM, Traut [EMAIL PROTECTED] wrote: Hi I found the thread about enabling leading wildcards in Solr as additional option in config file. I've got nightly Solr build and I can't find any options connected with leading wildcards in config files. How I can enable leading wildcard queries in Solr? Thank you -- Best regards, Traut -- Best regards, Traut -- Michael Kimsal http://webdevradio.com
Re: Term extraction
Not sure if this is in the same league or not, but Yahoo offers a term extraction web service. http://developer.yahoo.com/search/content/V1/termExtraction.html On 9/20/07, Grant Ingersoll [EMAIL PROTECTED] wrote: You might investigate some tools like Alias-i's LingPipe or do some searches for phrase recognition software, etc. -Grant On Sep 19, 2007, at 9:58 PM, Pieter Berkel wrote: I'm currently looking at methods of term extraction and automatic keyword generation from indexed documents. I've been experimenting with MoreLikeThis and values returned by the mlt.interestingTerms parameter and so far this approach has worked well. However, I'd like to be able to analyze documents more intelligently to recognize phrase keywords such as open source, Microsoft Office, Bill Gates rather than splitting each word into separate tokens (the field is never used in search queries so matching is not an issue). I've been looking at SynonymFilterFactory as a possible solution to this problem but haven't been able to work out the specifics of how to configure it for phrase mappings. Has anybody else dealt with this problem before or able to offer any insights into achieve the desired results? Thanks in advance, Pieter -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ -- Michael Kimsal http://webdevradio.com
Re: Using Ruby to POST to Solr
The curl man page states: If you start the data with the letter @, the rest should be a file name to read the data from, or - if you want curl to read the data from stdin. The contents of the file must already be url-encoded. Multiple files can also be specified. Posting data from a file named 'foobar' would thus be done with --data @foobar. On 9/11/07, Matt Mitchell [EMAIL PROTECTED] wrote: Hi, I just posted this to the ruby/google group. It probably belongs here! Also, anyone know exactly what the @ symbol in the curl command is doing? Thanks, Matt I've got a script that uses curl, and would like (for educational purposes mind you) to use ruby instead. This is the curl command that works: F=./my_data.xml curl 'http://localhost:8080/update' --data-binary @$F -H 'Content- type:text/xml; charset=utf-8' I've been messing with Net::Http using something like below, with variations (Base64.encode64) but nothing works yet. Anyone know the ruby equivlent to the curl version above? Thanks! # NOT WORKING: my_url = 'http://localhost:8080/update' data = File.read('my_data.xml') url = URI.parse(my_url) post = Net::HTTP::Post.new(url.path) post.body = data post.content_type = 'application/x-www-form-urlencoded; charset=utf-8' response = Net::HTTP.start(url.host, url.port) do |http| http.request(post) end puts response.body -- Michael Kimsal http://webdevradio.com
Indexing HTML
Hello I'm trying to index individual lines of an HTML file, and I'm hitting this error: TEXT must be immediately followed by END_TAG and not START_TAG I've got something that looks like add doc field name=id4/field field name=linea href=foobarbilinktext/i/b/a/field /doc /add Actually, that sample code above, as its own data file POSTed to SOLR, throws parser must be on START_TAG or TEXT to read text (position: START_TAG seen ...lt;field name=linegt;lt;a href=foobargt;... @4:37 as an error. Any clues as to how I can do this? I'd like to keep the original copy of each line intact in the index. Thanks! -- Michael Kimsal http://webdevradio.com
Re: I'm using PHP curl post xml command to Solr,Is it the only way to post data?
Using PHP5 (5.1 or higher I think) http://us.php.net/manual/en/function.http-post-fields.php is available. From the example on that page: $fields = array( 'name' = 'mike', 'pass' = 'passwordt' ); $response = http_post_fields(http://www.example.com/;, $fields); Looks pretty simple, but I haven't tried it yet. On 6/25/07, Kijiji Xu, Ping [EMAIL PROTECTED] wrote: What about fsockopen, Or any other simple method? Thanks -- Regards Xp from china -- Michael Kimsal http://webdevradio.com
Re: Date range problem
I've only been able to get date/time stuff to work when the entire full date/time format is used 2007-05-30T12:34:56Z Or is there a + in there too? On 6/25/07, Stu Hood [EMAIL PROTECTED] wrote: Hello, Searching by date ranges doesn't seem to work in the example Solr install. A query like `timestamp:[20070101 TO 20080101]` returns: *message* *Invalid Date String:'20070101'* *description* *The request sent by the client was syntactically incorrect (Invalid Date String:'20070101').* That query should be valid according to http://lucene.apache.org/java/docs/queryparsersyntax.html#Range%20Searches Any ideas? Stu Hood Webmail.us You manage your business. We'll manage your email.(r) -- Michael Kimsal http://webdevradio.com
Benefit of schema
Is there any benefit to using a fixed schema as opposed to the 'wildcard' approach demonstrated in the sample schema.xml file? -- Michael Kimsal http://webdevradio.com
Re: Benefit of schema
I wasn't sure if I was perhaps missing some sort of optimization that may occur under the hood during querying. I sort of thought that what you just wrote may be the case. Thanks! On 6/23/07, Erik Hatcher [EMAIL PROTECTED] wrote: On Jun 23, 2007, at 1:38 PM, Michael Kimsal wrote: Is there any benefit to using a fixed schema as opposed to the 'wildcard' approach demonstrated in the sample schema.xml file? It's nice to have known straightforward field names for querying, like: title:web development AND author:kimsal With wildcarded fields, you won't end up with such clean field names. Other than aesthetics and how the client application will interact with Solr, there really is no difference. Erik -- Michael Kimsal http://webdevradio.com
Re: Question to php to do with multi index
The curl_multi is probably the most effective way, using straight PHP. Another option would be to spawn several jobs, assuming unix/linux, and wait for them to get done. It doesn't give you very good error handling (well, none at all actually!) but would let you run multiple indexing jobs at once. Visit http://us.php.net/shell_exec and look at the 'class exec' contributed note about halfway down the page. It'll give you an idea of how to easily spawn multiple jobs. If you're using PHP5, the proc_open function may be another way to go. proc_open was available in 4, but there were a number of extra parameters and controls made available in 5. http://us.php.net/manual/en/function.proc-open.php An adventurous soul could combine the two concepts in to one class to manage pipes communication between multiple child processes effectively. On 4/26/07, James liu [EMAIL PROTECTED] wrote: php not support multi thread,,,and how can u solve with multi index in parallel? now i use curl_multi maybe more effect way i don't know,,,so if u know, tell me. thks. -- regards jl -- Michael Kimsal http://webdevradio.com
Re: case sensitivity
Can you point me to the process for submitting these small patches? I'm looking at the jira site but don't see much of anything there outlining a process for submitting patches. Sorry to be so basic about this, but I'm trying to follow correct procedures on both sides of the aisle, so to speak. On 4/27/07, Yonik Seeley [EMAIL PROTECTED] wrote: On 4/26/07, Michael Kimsal [EMAIL PROTECTED] wrote: We're (and by 'we' I mean my esteemed colleague!) working on patching a few of these items to be in the solrconf.xml file and should likely have some patches submitted next week. It's being done on 'company time' and I'm not sure about the exact policy/procedure for this sort of thing here (or indeed, if there is one at all). That's fine, as long as your company has agreed to contribute back the patch (under the Apache license). Apache enjoys a lot of business support (being business friendly) and a *lot* of contributions is done on company time. Anything really big would probably need a CLA, but patches only require clicking the grant license to ASF button in JIRA. -Yonik -- Michael Kimsal http://webdevradio.com
case sensitivity
I've looked through the mailing lists and can't find much of anything regarding case sensitivity. It seems SOLR is case sensitive by default - I'm using the default settings with a very basic schema - just text fields. Is there any way to tell the query parser to be case insensitive during a query? Or do I have to reindex all my data again with lowercase values? -- Michael Kimsal http://webdevradio.com
Re: case sensitivity
I was just writing a followup. I'm using the default text field type fieldtype name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldtype That looks to me like it's got LowerCaseFilterFactory in the query analyzer and the index analyzer. I'm still digging in to this, but are there any other things to look for anyone can point me to? (Thanks Erik!) On 4/26/07, Erik Hatcher [EMAIL PROTECTED] wrote: On Apr 26, 2007, at 5:43 PM, Michael Kimsal wrote: I've looked through the mailing lists and can't find much of anything regarding case sensitivity. It seems SOLR is case sensitive by default - I'm using the default settings with a very basic schema - just text fields. All depends on the analysis you have set up for the fields. If you're indexing string-type fields in the default example schema, there is effectively no analysis so searches must be exact matches case and all. Is there any way to tell the query parser to be case insensitive during a query? Or do I have to reindex all my data again with lowercase values? Terms are indexed in a case-sensitive manner, so if you need case insensitivity you need to lowercase on the way in and on querying. Erik -- Michael Kimsal http://webdevradio.com
Re: case sensitivity
type:changelog AND ( ( (listing:Fox) or (listing:Fox*) or (listing:*Fox) ) ) and type:changelog AND ( ( (listing:fox) or (listing:fox*) or (listing:*fox) ) ) Is this to do with the wildcards? Actually, I've just answered my own question. type:changelog AND ( ( (listing:fox) ) ) and type:changelog AND ( ( (listing:Fox) ) ) give the same results. But adding in the or listing:fox* or listing:*fox is always case-sensitive. However, http://wiki.apache.org/lucene-java/LuceneFAQ#head-133cf44dd3dff3680c96c1316a663e881eeac35aseems to say that wildcard searches are not case-sensitive. Unless someone can point out a way around this, it seems I'll need to manually reindex and lower-case everything on the way in, then reformat my search queries to be lower-case as well. On 4/26/07, Michael Kimsal [EMAIL PROTECTED] wrote: I was just writing a followup. I'm using the default text field type fieldtype name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class= solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class= solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class= solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldtype That looks to me like it's got LowerCaseFilterFactory in the query analyzer and the index analyzer. I'm still digging in to this, but are there any other things to look for anyone can point me to? (Thanks Erik!) On 4/26/07, Erik Hatcher [EMAIL PROTECTED] wrote: On Apr 26, 2007, at 5:43 PM, Michael Kimsal wrote: I've looked through the mailing lists and can't find much of anything regarding case sensitivity. It seems SOLR is case sensitive by default - I'm using the default settings with a very basic schema - just text fields. All depends on the analysis you have set up for the fields. If you're indexing string-type fields in the default example schema, there is effectively no analysis so searches must be exact matches case and all. Is there any way to tell the query parser to be case insensitive during a query? Or do I have to reindex all my data again with lowercase values? Terms are indexed in a case-sensitive manner, so if you need case insensitivity you need to lowercase on the way in and on querying. Erik -- Michael Kimsal http://webdevradio.com -- Michael Kimsal http://webdevradio.com
Re: case sensitivity
My colleague, after some digging, found in SolrQueryParser (around line 62) setLowercaseExpandedTerms(false); The default for Lucene is true. Was this intentional? Or an oversight? Perhaps it's not related to my problem, but it seems that it might be. Thanks in advance! On 4/26/07, Michael Kimsal [EMAIL PROTECTED] wrote: type:changelog AND ( ( (listing:Fox) or (listing:Fox*) or (listing:*Fox) ) ) and type:changelog AND ( ( (listing:fox) or (listing:fox*) or (listing:*fox) ) ) Is this to do with the wildcards? Actually, I've just answered my own question. type:changelog AND ( ( (listing:fox) ) ) and type:changelog AND ( ( (listing:Fox) ) ) give the same results. But adding in the or listing:fox* or listing:*fox is always case-sensitive. However, http://wiki.apache.org/lucene-java/LuceneFAQ#head-133cf44dd3dff3680c96c1316a663e881eeac35aseems to say that wildcard searches are not case-sensitive. Unless someone can point out a way around this, it seems I'll need to manually reindex and lower-case everything on the way in, then reformat my search queries to be lower-case as well. On 4/26/07, Michael Kimsal [EMAIL PROTECTED] wrote: I was just writing a followup. I'm using the default text field type fieldtype name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class= solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class= solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class= solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldtype That looks to me like it's got LowerCaseFilterFactory in the query analyzer and the index analyzer. I'm still digging in to this, but are there any other things to look for anyone can point me to? (Thanks Erik!) On 4/26/07, Erik Hatcher [EMAIL PROTECTED] wrote: On Apr 26, 2007, at 5:43 PM, Michael Kimsal wrote: I've looked through the mailing lists and can't find much of anything regarding case sensitivity. It seems SOLR is case sensitive by default - I'm using the default settings with a very basic schema - just text fields. All depends on the analysis you have set up for the fields. If you're indexing string-type fields in the default example schema, there is effectively no analysis so searches must be exact matches case and all. Is there any way to tell the query parser to be case insensitive during a query? Or do I have to reindex all my data again with lowercase values? Terms are indexed in a case-sensitive manner, so if you need case insensitivity you need to lowercase on the way in and on querying. Erik -- Michael Kimsal http://webdevradio.com -- Michael Kimsal http://webdevradio.com -- Michael Kimsal http://webdevradio.com
Re: case sensitivity
We're (and by 'we' I mean my esteemed colleague!) working on patching a few of these items to be in the solrconf.xml file and should likely have some patches submitted next week. It's being done on 'company time' and I'm not sure about the exact policy/procedure for this sort of thing here (or indeed, if there is one at all). On 4/26/07, Erik Hatcher [EMAIL PROTECTED] wrote: On Apr 26, 2007, at 6:03 PM, Michael Kimsal wrote: My colleague, after some digging, found in SolrQueryParser (around line 62) setLowercaseExpandedTerms(false); The default for Lucene is true. Was this intentional? Or an oversight? I was just about to respond that this is likely the issue with your non-totally-lowercased wildcard terms. I don't consider it an oversight, but rather this whole analysis business and wildcards are things that vary from project to project on how they should be handled. If you, have, for example, a string field and want to do prefixed queries on them (trailing asterisk) you wouldn't want the term to be lowercased. I think we should open up as many of the switches as we can to QueryParser, allowing users to tinker with them if they want, setting the defaults to the most common reasonable settings we can agree upon. Erik -- Michael Kimsal http://webdevradio.com
expressing this logic
Hello all: I'm trying to find a record in my index where the 'type' is changelog and the 'filename' has 'angel' in it. Expressing this as type:changelog filename:+angel or filename:+angel* or filename:+*angel throws a parse error (probably understandably) type:changelog (filename:+angel or filename:+angel* or filename:+*angel) doesn't seem to work either. I've tried this a number of ways and I either get a parse error or *everything* is returned - I only want records where the type is 'changelog' and the filename has 'angel' in it. How would this be expressed? -- Michael Kimsal http://webdevradio.com
Re: AW: Leading wildcards
Maarten: Would you mind sharing your custom query parser? On 4/20/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: thanks, this worked like a charm !! we built a custom QueryParser and we integrated the *foo** in it, so basically we can now search leading, trailing and both ... only crappy thing is the max Boolean clauses, but i'm going to look into that after the weekend for the next release of Solr : do not make this default, too many risks but do make an option in the config to enable it, it's a very nice feature thanks everybody for the help and have a nice weekend, maarten Burkamp, Christian [EMAIL PROTECTED] 19/04/2007 12:37 Please respond to solr-user@lucene.apache.org To solr-user@lucene.apache.org cc Subject AW: Leading wildcards Hi there, Solr does not support leading wildcards, because it uses Lucene's standard QueryParser class without changing the defaults. You can easily change this by inserting the line parser.setAllowLeadingWildcards(true); in QueryParsing.java line 92. (This is after creating a QueryParser instance in QueryParsing.parseQuery(...)) and it obviously means that you have to change solr's source code. It would be nice to have an option in the schema to switch leading wildcards on or off per field. Leading wildcards really make no sense on richly populated fields because queries tend to result in too many clauses exceptions most of the time. This works for leading wildcards. Unfortunately it does not enable searches with leading AND trailing wildcards. (E.g. searching for *lega* does not find results even if the term elegance is in the index. If you put a second asterisk at the end, the term elegance is found. (search for *lega** to get hits). Can anybody explain this though it seems to be more of a lucene QueryParser issue? -- Christian -Ursprüngliche Nachricht- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Gesendet: Donnerstag, 19. April 2007 08:35 An: solr-user@lucene.apache.org Betreff: Leading wildcards hi, we have been trying to get the leading wildcards to work. we have been looking around the Solr website, the Lucene website, wiki's and the mailing lists etc ... but we found a lot of contradictory information. so we have a few question : - is the latest version of lucene capable of handling leading wildcards ? - is the latest version of solr capable of handling leading wildcards ? - do we need to make adjustments to the solr source code ? - if we need to adjust the solr source, what do we need to change ? thanks in advance ! Maarten -- Michael Kimsal http://webdevradio.com
Re: Leading wildcards
I've investigated this recently, and it looks like the latest lucene dev supposedly supports leading/trailing at the same time. However, I couldn't get the latest dev solr to build with the latest dev lucene (as of two weeks ago). A lucene mailing list seemed to indicate that lucene as of the last official build support both leading/trailing at the same time, but it then seemed to indicate that it was a 'in development branch only' state still. I can't find that thread, but that's my understanding of the current situation. It's bugged us a little bit, because it's something that we need (to be able to emulate the previous foo LIKE '%bar%' SQL behaviour we're replacing), but can't offer our users yet. On 4/19/07, Burkamp, Christian [EMAIL PROTECTED] wrote: Hi there, Solr does not support leading wildcards, because it uses Lucene's standard QueryParser class without changing the defaults. You can easily change this by inserting the line parser.setAllowLeadingWildcards(true); in QueryParsing.java line 92. (This is after creating a QueryParser instance in QueryParsing.parseQuery(...)) and it obviously means that you have to change solr's source code. It would be nice to have an option in the schema to switch leading wildcards on or off per field. Leading wildcards really make no sense on richly populated fields because queries tend to result in too many clauses exceptions most of the time. This works for leading wildcards. Unfortunately it does not enable searches with leading AND trailing wildcards. (E.g. searching for *lega* does not find results even if the term elegance is in the index. If you put a second asterisk at the end, the term elegance is found. (search for *lega** to get hits). Can anybody explain this though it seems to be more of a lucene QueryParser issue? -- Christian -Ursprüngliche Nachricht- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Gesendet: Donnerstag, 19. April 2007 08:35 An: solr-user@lucene.apache.org Betreff: Leading wildcards hi, we have been trying to get the leading wildcards to work. we have been looking around the Solr website, the Lucene website, wiki's and the mailing lists etc ... but we found a lot of contradictory information. so we have a few question : - is the latest version of lucene capable of handling leading wildcards ? - is the latest version of solr capable of handling leading wildcards ? - do we need to make adjustments to the solr source code ? - if we need to adjust the solr source, what do we need to change ? thanks in advance ! Maarten -- Michael Kimsal http://webdevradio.com
Re: Leading wildcards
Agreed, but in our tests (100M index) it wasn't a performance hit, and much better (as in it actually worked) than MSSQL ;) On 4/19/07, Erik Hatcher [EMAIL PROTECTED] wrote: On Apr 19, 2007, at 6:56 AM, Michael Kimsal wrote: It's bugged us a little bit, because it's something that we need (to be able to emulate the previous foo LIKE '%bar%' SQL behaviour we're replacing), but can't offer our users yet. I have also run into this issue and have intended to fix up Solr to allow configuring that switch on QueryParser. I'll eventually get to this, but someone supply a patch with a test case would get it done sooner. I must, however, caveat discussion of leading wildcards with the underlying effect you get. If you use standard analysis and perform a leading wildcard query, you incur a (possibly) dramatic hit in terms of performance. Lucene has to scan *every* term in the specified field. In fact, with my 3.7M index, a fuzzy query for the very same reason, kills the query. There is also a switch on fuzzy query that needs to be configurable through Solr, to adjust the number of leading characters that are fixed to avoid this all term scanning. There are techniques that can be used to improve the performance of in-string types of queries like this, at the expense of indexing time and size and clever query creation. One such technique I've used successfully is term rotation enumeration (cat = cat$, at$c, t $ca). This involves custom analyzers and query creation. Once Solr supports this switch, you may find performance fine with leading wildcard queries, but at least be forewarned that there are scalability skeletons in this closet. Erik -- Michael Kimsal http://webdevradio.com
Re: Leading wildcards
It still seems like it's only something that would be invoked by a user's query. If I queried for *foobar and leading wildcards were not on in the server, I'd get back nothing, which isn't really correct. I'd think the application should tell the user that that syntax isn't supported. Perhaps I'm simplifying it a bit. It would certainly help out our comfort level to have it either be on or configurable by default, rather than having to maintain a 'patched' version (yes, the patch is only one line, but it's the principle of the thing). I suspect this would be the same for others. On 4/19/07, Erik Hatcher [EMAIL PROTECTED] wrote: On Apr 19, 2007, at 10:39 AM, Yonik Seeley wrote: On 4/19/07, Erik Hatcher [EMAIL PROTECTED] wrote: parser.setAllowLeadingWildcards(true); I have also run into this issue and have intended to fix up Solr to allow configuring that switch on QueryParser. Any reason that parser.setAllowLeadingWildcards(true) shouldn't be the default? That's fine by me. But... Does it really need to be configurable? It all depends on how bad of a hit it'd take on Solr. What's the breaking point where the performance of full-term scanning (and subsequently faceting, of course) kills over or dies? FuzzyQuery's die on my 3.7M index and not-super-beefy hardware and system setup. Erik -- Michael Kimsal http://webdevradio.com
Re: Leading wildcards
I'm in the middle of looking in to that. For *you* ;) it may seem like a quick thing to do. For me, who's not an expert at this stuff, it's a balance between delving in deeply enough to figure how to do it and hitting our deadlines. It's actually on someone else's plate here, but he's backed up with two other projects here first. It's not that I don't *want* to contribute, but hardly have enough time to get the basics done some days. On 4/19/07, Erik Hatcher [EMAIL PROTECTED] wrote: On Apr 19, 2007, at 11:04 AM, Michael Kimsal wrote: Perhaps I'm simplifying it a bit. It would certainly help out our comfort level to have it either be on or configurable by default, rather than having to maintain a 'patched' version (yes, the patch is only one line, but it's the principle of the thing). I suspect this would be the same for others. And here's where your effort could go the extra mile to help _yourself_ out as well as the community... instead of the one-line change, make it a few more lines and make it a switch from the configuration (like the toggle for AND/OR default operator) and even better round it out with a test case. Submit it, lobby for it to be reviewed and applied, and step 3... profit! :) Erik -- Michael Kimsal http://webdevradio.com
Re: Solr logo poll
My wife votes for A. :) On 4/9/07, Nitin Borwankar [EMAIL PROTECTED] wrote: B Yonik Seeley wrote: Quick poll... Solr 2.1 release planning is underway, and a new logo may be a part of that. What form of logo do you prefer, A or B? There may be further tweaks to these pictures, but I'd like to get a sense of what the user community likes. A) http://issues.apache.org/jira/secure/attachment/12349897/logo-solr-d.jpg B) http://issues.apache.org/jira/secure/attachment/12353535/12353535_solr-nick.gif Just respond to this thread with your preference. -Yonik -- Nitin Borwankar http://walruscarpenter.wordpress.comOf shoes and ships and sealing wax of cabbages and kings http://greener.comFind, Learn, Act Greener, the search engine for the planet http://tagschema.com Implementation of tag database applications [EMAIL PROTECTED] 510-872-7066 -- Michael Kimsal http://webdevradio.com
Re: SOLR hosting
Thanks. Perhaps I should have clarified a bit. I was looking more for the first option. And part of what I was asking for was to gauge some interest. If there are no companies offering that, is there any demand in a service like that? On 3/23/07, Tim Archambault [EMAIL PROTECTED] wrote: Is your question inherently asking if someone out there provides a service that manages the indexes, etc for you and pre-installs and configures the software? If NOT, I can tell you that I bought a Linux VPS at Hostmysite.com cheaply and dedicated 1 virtual domain to my SOLR instance and it worked fairly easily. I'm no tech expert and got it to run. Hope that helps. Tim On 3/21/07, Michael Kimsal [EMAIL PROTECTED] wrote: Are there any companies that offer hosted SOLR services? If not, is there any interest in the community in a service like this? -- Michael Kimsal http://webdevradio.com -- Michael Kimsal http://webdevradio.com
Wildcards
Hello all: While I realize this goes against the grain of an indexing server, is there any way to do wildcard searching like the following: Term indexed is 123456789 Searching for *456* would find 123456789 Is there any mechanism to enable or allow for that scenario? Thanks! -- Michael Kimsal http://webdevradio.com
Re: Wildcards
This looks like a lucene issue. http://www.nabble.com/-jira--Created%3A-%28LUCENE-839%29-WildcardQuery-do-not-find-documents-if-leading-and-trailing-*-is-used-tf3435336.html And it appears to have been fixed recently: This problem was already fixed since 2.1.0. When was 2.1.0 out? Oh - last month. Will there be new SOLR package bundles with the latest lucene? On 3/21/07, Michael Kimsal [EMAIL PROTECTED] wrote: I changed the 'leading wildcard' setting in the query parser (well, actually someone else here did, but it works). *789 works but *456* still doesn't work. Yeah, I guess I'm seeing the same behaviour as you are. Does this seem like a potential bug? Like the first * is cancelling out the logic for the second * ? On 3/21/07, Erik Hatcher [EMAIL PROTECTED] wrote: Lucene now supports *456* type queries, however it requires setting an attribute to allow leading wildcards on the QueryParser. Solr does not set this flag (that I can tell in my quick search) so I don't believe you can do this with Solr currently, until/unless an option is made to set that flag. However, I just tried with my dataset and I don't get parse errors from a *foo* query, but I don't get results either (strange, it seems). Erik On Mar 21, 2007, at 2:59 PM, Michael Kimsal wrote: Hello all: While I realize this goes against the grain of an indexing server, is there any way to do wildcard searching like the following: Term indexed is 123456789 Searching for *456* would find 123456789 Is there any mechanism to enable or allow for that scenario? Thanks! -- Michael Kimsal http://webdevradio.com -- Michael Kimsal http://webdevradio.com -- Michael Kimsal http://webdevradio.com
Re: Wildcards
Well, I recompiled SOLR against the latest lucene release (2.1.0) and it still doesn't work. The nabble reference page there indicates that it might not have worked right in 2.1.0 but someone there is suggesting that it works in the latest trunk. Is there perhaps something else that would need to be enabled in SOLR beyong the leadingWildCard to have this work? Thanks for everyone's patience. On 3/21/07, Michael Kimsal [EMAIL PROTECTED] wrote: This looks like a lucene issue. http://www.nabble.com/-jira--Created%3A-%28LUCENE-839%29-WildcardQuery-do-not-find-documents-if-leading-and-trailing-*-is-used-tf3435336.html And it appears to have been fixed recently: This problem was already fixed since 2.1.0. When was 2.1.0 out? Oh - last month. Will there be new SOLR package bundles with the latest lucene? On 3/21/07, Michael Kimsal [EMAIL PROTECTED] wrote: I changed the 'leading wildcard' setting in the query parser (well, actually someone else here did, but it works). *789 works but *456* still doesn't work. Yeah, I guess I'm seeing the same behaviour as you are. Does this seem like a potential bug? Like the first * is cancelling out the logic for the second * ? On 3/21/07, Erik Hatcher [EMAIL PROTECTED] wrote: Lucene now supports *456* type queries, however it requires setting an attribute to allow leading wildcards on the QueryParser. Solr does not set this flag (that I can tell in my quick search) so I don't believe you can do this with Solr currently, until/unless an option is made to set that flag. However, I just tried with my dataset and I don't get parse errors from a *foo* query, but I don't get results either (strange, it seems). Erik On Mar 21, 2007, at 2:59 PM, Michael Kimsal wrote: Hello all: While I realize this goes against the grain of an indexing server, is there any way to do wildcard searching like the following: Term indexed is 123456789 Searching for *456* would find 123456789 Is there any mechanism to enable or allow for that scenario? Thanks! -- Michael Kimsal http://webdevradio.com -- Michael Kimsal http://webdevradio.com -- Michael Kimsal http://webdevradio.com -- Michael Kimsal http://webdevradio.com
Re: Spelling
Any opinions on commenting out the stemmer in the default text field? It might be less confusing to have a more intuitive example, while easily showing the way to the more advanced analysis. I'm in favor of that. I imagine there's others like me that want to get started with the defaults first, and having them be more useful for 'average' use cases would be helpful, with comments on how to do advanced stuff left in. Thanks! -- Michael Kimsal http://webdevradio.com
Re: Spelling
This isn't something I use that approach on. Let me explain. I work in a call center, and I'm doing a search for specific key word in customer notes every night. For example, we might need a report of which customers called up about apple, banana or pear. I have a script which generates a report for the required key words, and the report is mailed to the appropriate staff for review/action. The highlighting comes in to help them quickly locate the problem words. But not being able to highlight the misspellings (bannana, peaar, etc.) means that they may overlook the particular entries when reviewing the email. When you say rewrite the query, what specifically do you mean? I'm googling (direct and on the solr site) for query.rewrite, but nothing is jumping out at me as anything that's useful/pertinent. It sounds like you're telling me to do some manipulation on the query first, but I'm currently just passing queries as part of the GET string in an HTTP request (this was my main attraction to SOLR in the first place) Is there a way to trigger the 'rewrite' functionality via another GET parameter? Thanks all! On 2/6/07, karl wettin [EMAIL PROTECTED] wrote: 6 feb 2007 kl. 04.19 skrev Michael Kimsal: Thanks Erik. That worked, then threw me for another loop, which I sort of have fixed I think. I'm using the highligher functionality, but it doesn't seem to highlight the 'matched' word if it's a partial match, although it does in fact return that record. Am I missing something obvious here, or is highlighting of partial matches not supported? You need to rewrite the query. See Query.rewrite. (I think that's it.) But, fuzzy queries are sort of slow, at least compared to many other things. Depending on your server load and corpus size, perhaps I would recommend you using some sort of did you mean- functionallity rather than fuzzy queries. -- karl -- Michael Kimsal http://webdevradio.com
Re: Date ranges
Thanks Hoss - I'll give that a try - intuitively that sounds like it'll work (I'm still new to this - it's not second nature to me just yet!) On 2/3/07, Chris Hostetter [EMAIL PROTECTED] wrote: : However, when I run the following search : foobar date:[2005-08-01T00:00:00Z TO 2005-08-01T23:59:59Z] : I get values back that do not have a date value in the 08/01/2005 range. unless you changed somethine else to mkae queries default to all clauses mandatroy (aka: and AND query) that's searching for anythign mathcing foobar, or anything in that date range) try this... +foobar +date:[2005-08-01T00:00:00Z TO 2005-08-01T23:59:59Z] : Does anyone have any clues/pointers to help me debug this? adding debugQuery=1 to any URL will help you see exactly what query is being used, and show you an explanation of why each document matched. : : Thanks! : -Hoss -- Michael Kimsal http://webdevradio.com
Date ranges
I'm having a devil of a time getting date seaching to work properly. I've created a 'date' field in my schema, and I put values like 2005-08-01T23:59:59Z in it. However, when I run the following search foobar date:[2005-08-01T00:00:00Z TO 2005-08-01T23:59:59Z] I get values back that do not have a date value in the 08/01/2005 range. Does anyone have any clues/pointers to help me debug this? Thanks!
possible FAQ - lucene interop
Hello all: We've got one java-based project at work using lucene. I'm looking to use solr as a search system for some other projects at work. Once data is indexed in solr, can we get at it using standard lucene libraries? I know how I want to use solr, but if the java devs need to get at the data as well, I'd rather that 1) they be able to use their existing tech and skills and 2) I not have to reindex everything in lucene-only indexes. I've read the FAQs and some of the mailing list and couldn't find this question addressed. Thanks. -- Michael Kimsal http://webdevradio.com