Re: Making stemming dynamic at query time
On Dec 18, 2007 9:41 PM, Kamran Shadkhast [EMAIL PROTECTED] wrote: ...it would be great if we could dynamiclly control this during search if we want to search with stemming or not The easiest is probably to have two copies of your field, using copyField, one stemmed and one not, and search in one or the other. -Bertrand
Re: Which terms in the query match
On 10/16/07, Nishant Soni [EMAIL PROTECTED] wrote: ...So is there a way to query solr about which of the tokens in the query actually matched ?... The analyzer admin page should help, see http://wiki.apache.org/solr/FAQ#head-b25df8c8393bbcca28f1f344c432975002e29ca9 -Bertrand
Re: Strange behavior when searching with accents
On 9/20/07, Thierry Collogne [EMAIL PROTECTED] wrote: ..when we search for matthé or for matthe, we get two totally different results The analyzer admin tool should help you find out what's happening, see http://wiki.apache.org/solr/FAQ#head-b25df8c8393bbcca28f1f344c432975002e29ca9 -Bertrand
Re: Strange behavior when searching with accents
On 9/20/07, Thierry Collogne [EMAIL PROTECTED] wrote: ...Thank you very much. Moving the filter class= solr.ISOLatin1AccentFilterFactory/ up in the chain fixed it Yes, the problem was the EnglishPorterFilterFactory before the accents removal: the stemmer doesn't know about accents, so no stemming occured on matthé whereas matthe was stemmed to matth. BTW, your rené example makes me think you're indexing french, if that's the case you might want to use a stemmer configured for that language, for example filter class=Solr.SnowballPorterFilterFactory language=French/ -Bertrand
Re: Strange behavior when searching with accents
On 9/20/07, Thorsten Scherler [EMAIL PROTECTED] wrote: ...Betrand, does the French Snowball work fine?... I've seen some weirdnesses, like tennis and tenir (means to hold) both stemmed to ten, but in all of our (simple) tests it was ok. The application where we're using it does not require high precision though, so it looked good enough and we didn't do create very extensive tests for it. -Bertrand
Re: SOLR developer
On 8/31/07, Tim Archambault [EMAIL PROTECTED] wrote: ...I'm thinking of sending a similar list-serv item out, but I noticed this is a solr-user list, not necessarily a developers list so I thought I'd ask Note that there's also [EMAIL PROTECTED] for such purposes, see http://www.apachenews.org/archives/000465.html But AFAIK, project-related job offers are ok on ASF lists, preferably with a [JOB] marker in the subject line. -Bertrand (*not* available for consulting ATM, and currently inactive on Solr anyway)
Re: solr question
On 7/21/07, Alessandro Ferrucci [EMAIL PROTECTED] wrote: ... the user could enter the following combinations of words: ... WORD WORD ...where the second instance is either last-name first-name OR first-name last-name. ... The dismax handler can indeed search terms in several fields, but I'd also suggest, as an alternative, copying all names to an additional allnames field at indexing time. This is done using copyfield in you schema.xml, see http://wiki.apache.org/solr/SchemaXml and the Solr example schema.xml. You can then search in this allnames field when you don't know if terms belong to the first or last names, and also easily combine this with other searches, boost it, etc. -Bertrand
Re: LIUS/Fulltext indexing
On 6/12/07, Yonik Seeley [EMAIL PROTECTED] wrote: ... I think Tika will be the way forward (some of the code for Tika is coming from LIUS)... Work has indeed started to incoroporate the Lius code into Tika, see https://issues.apache.org/jira/browse/TIKA-7 and http://incubator.apache.org/projects/tika.html -Bertrand
Re: LIUS/Fulltext indexing
On 6/12/07, Vish D. [EMAIL PROTECTED] wrote: ...Sounds interesting. I can't seem to find any clear dates on the project website. Do you know? ...V1 shipping date?... Not at the moment, Tika just entered incubation and it's impossible to predict what will happen. But help is welcome, of course ;-) -Bertrand
Re: how to crawl when Solr is search engine?
On 6/7/07, Ian Holsman [EMAIL PROTECTED] wrote: . it's called XSLT. most modern browsers can do the transform on the client side. otherwise there is some server side tools (cocoon I think does this) to do the transform on the server before sending it out Solr also does server-side XSLT, see http://wiki.apache.org/solr/XsltResponseWriter -Bertrand
Re: Solr in Windows
On 4/26/07, guruprasad [EMAIL PROTECTED] wrote: ...Is it only for Linux or can I install Solr on my Windows Desktop too?... Solr itself should run fine on any JVM 1.5, including Windows (and several Solr developers are working on Windows IIUC). Some of our docs refer to auxiliary scripts that do not run under plain windows. The SimplePostTool described in http://lucene.apache.org/solr/tutorial.html helps, it's not released yet but you can get it from https://issues.apache.org/jira/browse/SOLR-194 -Bertrand
Re: Re[2]: Things are not quite stable...
On 4/25/07, Jack L [EMAIL PROTECTED] wrote: ...Maybe it's time to think about upgrading Jetty... It's in the pipeline, see https://issues.apache.org/jira/browse/SOLR-128 -Bertrand
Re: Re[6]: Things are not quite stable...
On 4/25/07, Jack L [EMAIL PROTECTED] wrote: ...Regardless, I think it's a good idea to use a newer, released (not RC) version in general, considering 5.1 is one major version behind Agreed, but note that we don't have any factual evidence that the Jetty RC that we use is indeed the cause of SOLR-118, so upgrading might not solve the problem. We're just at the wild guess stage at this point, and many of us have never seen the problem. In my case, we have more urgent stuff to do before looking at the problem in more detail. -Bertrand
Re: snapshooter on OS X
On 4/23/07, Grant Ingersoll [EMAIL PROTECTED] wrote: ...The error says something about command not found line 15, but all the files I looked at, line 15 was a comment... Running your script with bash -x myscript should help, it will echo commands before executing them. -Bertrand
Re: finalizer() in SolrCore (was: Commits and Container Shutdown)
On 4/16/07, Yonik Seeley [EMAIL PROTECTED] wrote: ...Yes, it's a typo. Fixed in revision 529367. -Bertrand
finalizer() in SolrCore (was: Commits and Container Shutdown)
On 4/16/07, Erik Hatcher [EMAIL PROTECTED] wrote: ...Further details on this: SolrCore has a finalizer() method that closes the update handler. I'm not clear on finalizer() though. How/ when is that invoked? I know about Object.finalize(), but not finalizer()... Looking at the code, it seems like SolrCore.finalizer() is not called anywhere. A typo maybe? There's also a similar SolrIndexWriter.finalizer(). -Bertrand
Re: Solr Query Language
On 4/16/07, Jack L [EMAIL PROTECTED] wrote: Is the lucene query syntax available in solr? ... The syntax depends on the request handler used, if you're using the standard one the docs are at http://wiki.apache.org/solr/StandardRequestHandler -Bertrand
Re: Posting PDF,DOC,TXT
On 4/6/07, Suresh Kannan [EMAIL PROTECTED] wrote: I would like to post PDF, DOC, TXT into SOLR to do the indexing. There's no way to do that directly at the moment, you'll need to convert them to the XML format that Solr expects. The Lucene FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ lists a number of tools that can help extract content and metadata from various formats. -Bertrand
Re: Solr logo poll
On 4/6/07, Yonik Seeley [EMAIL PROTECTED] wrote: ...What form of logo do you prefer, A or B? B -Bertrand (a Tex Avery fan ;-)
Re: Instructables on solr
On 4/4/07, Ryan McKinley [EMAIL PROTECTED] wrote: ...We have been running solr for months as a band-aid, this release integrates solr deeply... Awesome - thanks for sharing this! If you don't mind, it'd be cool to add some info to http://wiki.apache.org/solr/PublicServers -Bertrand
Re: Reposting unABLE to match
On 3/27/07, Shridhar Venkatraman [EMAIL PROTECTED] wrote: ...Reposting unABLE to match No need to repost if your message made it to the list. If it hasn't been answered yet, it either means that no one knows the answer or that no one has had the time to answer yet. We're all volunteers here. -Bertrand
Re: schema field type doesn't work
On 3/24/07, Dimitar Ouzounov [EMAIL PROTECTED] wrote: ...I must be doing something wrong, maybe in the schema. Does anyone have any suggestions?.. The best way to debug such problems is with the analyzer admin tool: http://localhost:8983/solr/admin/analysis.jsp You can try various combinations of analyzers and see what Solr actually indexes for various values. HTH, -Bertrand
Re: How to assure a permanent index.
On 3/21/07, Thierry Collogne [EMAIL PROTECTED] wrote: ...I mean if I do the following. - delete all documents from the index - add all documents - do a commit. Will this result in a temporary empty index, or will I always have results?... Changes to the index are invisible to the search components until a commit/ is sent to Solr, so you should be fine (although personally I'd feel safer replacing documents in smaller batches). You could also use the index switching mechanism used when replicating Solr indexes (see http://wiki.apache.org/solr/CollectionDistribution) to prepare the index in another Solr instance and activate it instantly when needed. -Bertrand
Re: Problems with special characters
On 3/21/07, Thierry Collogne [EMAIL PROTECTED] wrote: ...I am using the post.jar file to update the search indexes. Problem is that foreign characters like é, à, ... don't work correctly... You're right, I have entered the issue in https://issues.apache.org/jira/browse/SOLR-194 For now, using this as a workaround should help: java -Dfile.encoding=UTF-8 -jar post.jar http://localhost:8983/solr/update utf8-example.xml -Bertrand
Re: Problems with special characters
On 3/21/07, Bertrand Delacretaz [EMAIL PROTECTED] wrote: ...For now, using this as a workaround should help: java -Dfile.encoding=UTF-8 -jar post.jar http://localhost:8983/solr/update utf8-example.xml.. Should be fixed now, if you can grab the latest SimplePostToolCode [1] it should work irrelevant of the default JVM encoding. Please confirm if you test it. It's a kind of brute force fix, I have hardcoded the encoding as UTF-8, I'm keeping SOLR-194 open so that we don't forget to fix this (but considering SOLR-190 it's not urgent to fix). -Bertrand [1] https://svn.apache.org/repos/asf/lucene/solr/trunk/src/java/org/apache/solr/util/SimplePostTool.java
Re: Date range boost
On 3/12/07, stefano nicolai [EMAIL PROTECTED] wrote: ...All of these items have a field containing the date they were created (it's a string field at the moment, as i have this type inside my DB). I want to give a higher score to the ones with the most recent date... You should be able to use boost functions for this, see for example http://www.mail-archive.com/solr-user@lucene.apache.org/msg01877.html and http://lucene.apache.org/solr/api/org/apache/solr/search/QueryParsing.html#parseFunction(java.lang.String,%20org.apache.solr.schema.IndexSchema) -Bertrand
Re: production solr - app server choice ?
On 3/9/07, rubdabadub [EMAIL PROTECTED] wrote: ...The site is a local portal and the traffic is very high and I am not sure if Jetty is enough maybe it is Just an additional note on this: asking four people about what very high traffic means might also give you five different answers ;-) FWIW, I've been testing Solr on the plain Jetty example config at more than 100 semi-random queries per second and it ran just fine, on a medium-range server (dual Xeon 2Ghz IIRC). But this is with our data and our type of queries - I agree with Erik that testing is the only way to find out how your setup will perform with your own data and queries. Simply generating a lot of semi-random requests from a collection of possible query parameters, and feeding the resulting URLs to multiple instances of curl or wget to generate some load, will tell you a lot about how your setup performs, and where the hotspots are. -Bertrand
Re: Adding data as UTF-8
On 3/10/07, Walter Underwood [EMAIL PROTECTED] wrote: It is better to use application/xml. See RFC 3023. Using text/xml; charset=UTF-8 will override the XML encoding declaration. application/xml will not... I agree, but did you try this with our example setup, started with java -jar start.jar? It doesn't seem to work here: If I change our example/exampledocs/post.sh to use curl $URL --data-binary @$f -H 'Content-type:application/xml' instead of curl $URL --data-binary @$f -H 'Content-type:text/xml; charset=utf-8' the encoding declaration of my posted XML is ignored, characters are interpreted according to my JVM encoding (-Dfile.encoding makes a difference in that case). Are you seeing something different, or do you know why this is so? -Bertrand
Re: Adding data as UTF-8
On 3/10/07, Walter Underwood [EMAIL PROTECTED] wrote: If it does something different, that is a bug. RFC 3023 is clear. --wunder.. Sure - just wanted to confirm what I'm seeing, thanks! -Bertrand
Re: production solr - app server choice ?
On 3/9/07, rubdabadub [EMAIL PROTECTED] wrote: ...I am wondering what everyone is using when it comes to app server i.e. Jetty, Resin, Tomcat etc I suspect that asking four people might give you five different answers on this one ;-) Whichever servlet container you use, IMHO the important thing is to learn to know how to tune it according to your needs, traffic patterns, hardware and software environment, etc. -Bertrand
Re: Error with bin/optimize and multiple solr webapps
On 3/7/07, Jeff Rodenburg [EMAIL PROTECTED] wrote: Oops, my bad I didn't see either 186 or 187 before entering 188. :-) I have closed SOLR-186 and SOLR-187 as duplicates, please add relevant info to SOLR-188 if needed. -Bertrand
Re: merely a suggestion: schema.xml validator or better schema validation logging
On 3/3/07, Ryan McKinley [EMAIL PROTECTED] wrote: ...The rationale with the solrconfig stuff is that a broken config should behave as best it can. This is great if you are running a real site with people actively using it - it is a pain in the ass if you are getting started and don't notice errors I think it's a PITA in any case, I like my systems to fail loudly when something's wrong in the configs (with details about what's happening, of course). -Bertrand
Re: merely a suggestion: schema.xml validator or better schema validation logging
On 3/2/07, Jed Reynolds [EMAIL PROTECTED] wrote: ...my first try at defining a schema.xml file was tough because my only feedback for a long time was NullPointerException from SolrCore when I was trying to add content... Can you give us enough information to reproduce the problem? What was wrong in your schema, exactly? Please indicate also which version of Solr you used. -Bertrand
Re: MoreLikeThis and term vectors - documentation suggestion
On 2/26/07, Ken Krugler [EMAIL PROTECTED] wrote: ...I was trying out the MoreLikeThis support, and getting some odd results... Thanks for the info, I have added a link to your message at https://issues.apache.org/jira/browse/SOLR-69 -Bertrand
Re: Tagging
On 2/14/07, Erik Hatcher [EMAIL PROTECTED] wrote: ...Sorry if I'm sending things mangled somehow - and if anyone has suggestions on correcting I'm all ears For long links I tend to use http://tinyurl.com/, but it's a bit painful to do that for all links. -Bertrand
Re: Incremental replication...
On 2/13/07, escher2k [EMAIL PROTECTED] wrote: ...Atleast from looking at the snapshooter script, it doesn't seem to be doing anything specific... The snapshooter script only makes an instant snapshot of the index directory using cp -lr. This does not involve any copying of index data. The actual replication is done using rsync in the other scripts, by copying the index snapshot elsewhere. Rsync only copies what has changed since the last copy, and not many files change in a Lucene index when adding documents, so it's correct that replication uses little bandwidth when adding documents. Index optimization, OTOH, causes much larger changes in the index directory, so after an optimization rsync will usually have much more data to transfer. -Bertrand
Re: performance testing practices
On 2/5/07, Erik Hatcher [EMAIL PROTECTED] wrote: ...What numbers are folks capturing? What techniques are you using to capture numbers?... I've been using my httpstone utility (http://code.google.com/p/httpstone/) along with ab (http://httpd.apache.org/docs/2.2/programs/ab.html) to generate many concurrent search requests, based on semi-random query URLs generated by shell scripts. The goal was to find out, on our hardware, how many typical queries per second we could serve with acceptable response times (less than 2.5 seconds). In our case, we found out that 100-200 requests per second were not a problem, and stopped testing as this is much more than we need currently. So I don't have precise numbers, but we know that we're safe with our current load. HTH, but it's more empirical than structured testing ;-) -Bertrand
Re: MoreLikeThis similarity-type queries in Solr
On 1/31/07, Brian Whitman [EMAIL PROTECTED] wrote: Does Solr have support for the Lucene query-contrib MoreLikeThis query type or anything like it? ... Yes, there's a patch in http://issues.apache.org/jira/browse/SOLR-69 - if you try it, please add your comments on that page. -Bertrand
Re: MoreLikeThis similarity-type queries in Solr
On 1/31/07, Andrew Nagy [EMAIL PROTECTED] wrote: ... Yes, there's a patch in http://issues.apache.org/jira/browse/SOLR-69 -... Anyword on something like this being incorporated into the official SOLR release? The patch is quite simple, I think we could commit it soon if the other committers agree. What's missing are unit tests, I'll try to write them next week unless someone beats me to it (I'm quite busy with other stuff ATM). -Bertrand
Re: How to Index Word, Excel, PDF files?
On 1/29/07, Leandro Saad [EMAIL PROTECTED] wrote: ...I'd like to know if solr can index Word, Excel and PDF files or I must create a xml representation of those files matching my schema?... Currently you must create the XML yourself outside of Solr. This might change, see https://issues.apache.org/jira/browse/SOLR-104 and the recent related update plugins discussions. -Bertrand
Re: Split one string into many fields
On 1/22/07, Yonik Seeley [EMAIL PROTECTED] wrote: ...When we get to it, I'd like to hear why it (things like PDF parsing) should be inside Solr rather than outside using our update interfaces Same here. I haven't had time to follow the recent (rich) design discussions about this stuff, but if I was designing this, I'd put all the document processing code in a separate module (separate servlet?) and keep the Solr core lean and mean, with as thin an interface as possible. -Bertrand
Re: Document freshness and Boost Functions
On 1/17/07, Luis Neves [EMAIL PROTECTED] wrote: ...I see that is possible to use Boost Functions to influence the score. How would that work in order to improve the score of recent documents? (I have a timestamp field in the schema)... I've been using expressions like these in boolean queries, based on a broadcast_date field: _val_:linear(recip(rord(broadcast_date),1,1000,1000),11,0) Where recip computes an age-based score, and linear is used to boost it. See http://incubator.apache.org/solr/docs/api/org/apache/solr/search/QueryParsing.html, and also the list archives, these functions have been discussed before. I'm not sure off the top of my head how to use this with dismax queries though. -Bertrand
Re: Calling Solr requests from java code - examples?
On 1/16/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: ...Could someone give me some code examples on how Solr requests can be called by Java code... Although our Java client landscape is still a bit fuzzy (there are several variants floating around), you might want to look at the code found in http://issues.apache.org/jira/browse/SOLR-20 If you're new to Java, I'd recommend playing with HttpClient first (http://jakarta.apache.org/commons/httpclient/), see the tutorial there for the basics. The standard Java library classes are also usable to write HTTP clients, but HttpClient will help a lot in getting the details right, if you don't mind depending on that library. -Bertrand
Re: Calling Solr requests from java code - examples?
On 1/16/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: ...and how would you do it calling it from another web application, let's say from a servlet or so?... Doesn't make much difference if your client is a standalone or a web application: you Solr client class will need to be configured with the base URL of the Solr server, it will make HTTP requests to it and parse the results as needed. -Bertrand
Re: Calling Solr requests from java code - examples?
On 1/16/07, Pavel Penchev [EMAIL PROTECTED] wrote: ...What about the case where solr and my application are deployed in the same instance of say tomcat. Is there a way to skip the http requests and use a direct api?... The javax.servlet.RequestDispatcher interface allows you to access other resources (including servlets) running in the same container. I've never used it but it looks like what you'd need (including a custom HttpServletResponse class to capture the other servlet's output). See http://java.sun.com/j2ee/1.4/docs/tutorial/doc/Servlets9.html#wp64684 which is part of http://java.sun.com/j2ee/1.4/docs/tutorial/doc/index.html Depending on how much faster this is than going the http way, it might be interesting to include it as another protocol in a Java Solr client. -Bertrand
Re: Faceted Dates
On 1/9/07, Ryan McKinley [EMAIL PROTECTED] wrote: ...I would like to use faceted browsing to group documents by year, month, and day. I can think of a few ways to do this, but I'd like to see what folks think before i start down the wrong track Dunno if you've already read it, but I found this page interesting when it comes to date queries, it might give you some additional ideas: http://wiki.apache.org/jakarta-lucene/LargeScaleDateRangeProcessing -Bertrand
Re: Handling disparate data sources in Solr
On 12/23/06, Alan Burlison [EMAIL PROTECTED] wrote: ...As well as centralising the index, I also want to centralise the handling of the different document types... My Subversion and Solr presentation from the last Cocoon GetTogether might give you ideas for how to handle this, see the link at http://wiki.apache.org/solr/SolrResources. Although it does not handle all binary formats out of the box (might need to write some java glue code to implement new formats), Cocoon is a good tool for transforming various document formats to XML and filter the results to generate the appropriate XML for Solr. I wouldn't add functionality to Solr for doing this, it's best to keep things loosely-coupled IMHO. -Bertrand
Re: Opinions wanted about a new Solr logo (SOLR-58)
On 12/18/06, Linda Tan [EMAIL PROTECTED] wrote: I just learned no attachments are allowed on this list. I've put the image in the jira.. Thanks, it looks good indeed! -Bertrand
Re: post the output of a URL to solr
On 11/30/06, Mike Klaas [EMAIL PROTECTED] wrote: ...Try something like: wget http://localhost:/gaz/solr/f0.xml -O - | curl http://localhost:8983/solr/update --data-binary - -H 'Content-type:text/xml; charset=utf-8' and if you use curl you can use it on both sides to avoid the dependency on both tools: curl http://localhost:/gaz/solr/f0.xml | curl ... -Bertrand
Re: Solr and Oracle
On 11/23/06, Nicolas St-Laurent [EMAIL PROTECTED] wrote: ...I index huge Oracle tables with Lucene with a custom made indexer/search engine. But I would prefer to use Solr instead... Instead of using Lucene's API directly, with Solr you'll have to add your documents to the index using HTTP POST messages. There are a few Java clients for Solr floating around on the wiki and in Jira IIRC, but you just need a POST, any way of doing it is fine (using jakarta httpclient for example). See http://wiki.apache.org/solr/SolrResources for more info. -Bertrand
Re: Extending Solr's Admin functionality
On 9/24/06, Erik Hatcher [EMAIL PROTECTED] wrote: ...perhaps some authentication/ authorization as well as HTTPS should eventually make it into the core, but getting more fine grained is unnecessary... If meaningful URLs are used (admin/stats, admin/config, admin/analysis, etc.), it is relatively easy to use either the servlet container or something like mod_proxy to implement security. Designing a good URL scheme might remove the need to address security concerns at the Solr level. -Bertrand
Re: Re: Doc add limit
On 7/28/06, Yonik Seeley [EMAIL PROTECTED] wrote: ...Getting all the little details of connection handling correct can be tough... it's probably a good idea if we work toward common client libraries so everyone doesn't have to reinvent them Jakarta's HttpClient [1] is IMHO a good base for Java clients, and it's easy to use, see the PostXML example in [2]. -Bertrand [1] http://jakarta.apache.org/commons/httpclient/ [2] http://svn.apache.org/viewvc/jakarta/commons/proper/httpclient/trunk/src/examples/PostXML.java?revision=410848view=markup
Re: Re: Cyrillic characters
On 7/19/06, Tricia Williams [EMAIL PROTECTED] wrote: ...What I called the _solr url encoding_ was the q= parameter translated into I'm not sure what encoding in the url... I think I've seen the same problem, haven't investigated deeper but IIUC the encoding used when posting a form is related to both the encoding indicated by the web server in the HTTP headers, and the encoding indicated (optionally) in the HTML page with something like meta content=text/html; charset=UTF-8 http-equiv=content-type/ In my case I've found that, running SOLR from start.jar with default settings: -If I search désormais from the solr/admin page, it is translated to q=d%E9sormais in the URL, and nothing's found (the word is in my index) -If I replace the q= value with q=d%C3%A9sormais (which is the encoding that I get when entering this word in the Google search form), my query works I haven't seen the problem with my own search form, which includes the above http-equiv meta and is served as a static page from my web server. So I think something's wrong with the encoding on the solr/admin/ search page, but I haven't investigated further. Hope this helps...not sure if it does but the above scenario looks similar to yours. -Bertrand