Re: Internationalization
Way to go Bess! This is great stuff you're sharing. I have a question though... On Jan 16, 2007, at 11:48 AM, Bess Sadler wrote: Currently, we are assigning all fields, no matter what language to type string, defined as fieldtype name=string class=solr.StrField sortMissingLast=true/ This does string matching very well, but doesn't do any stop words, or stemming, or anything fancy. We are toying with the idea of a custom Tibetan indexer to better break up the Tibetan into discrete words, but for this particular project (because it mostly has to do with proper names, not long passages of text) this hasn't been a problem yet, and the above solution seems to be doing the trick. Why are you assigning all fields to a string type? That indexes each field as-is, with no tokenization at all. How are you using that field from the front-end? I'd think you'd want to copyField everything into a text field. Elizabeth (Bess) Sadler Head, Technical and Metadata Services Digital Scholarship Services Box 400129 Alderman Library University of Virginia Charlottesville, VA 22904 Just two floors down what amazing folks we have on this! Erik
Re: XML querying
Hi, Thorsten Scherler wrote: On Mon, 2007-01-15 at 13:42 +, Luis Neves wrote: I think you should explain your use case a wee bit more. What I do now to index XML documents it's to use a Filter to strip the markup, this works but it's impossible to know where in the document is the match located. why do you need to know where? Poorly phrased from my part. Ideally I want to apply lucene filters to the xml content. Something like what Nux does: http://dsd.lbl.gov/nux/api/nux/xom/pool/FullTextUtil.html -- Luis Neves
Document freshness and Boost Functions
Hello, Reading the javadocs from the DisMaxRequestHandler I see that is possible to use Boost Functions to influence the score. How would that work in order to improve the score of recent documents? (I have a timestamp field in the schema)... I'm assuming it's possible (right?), but I can't figure out the syntax. -- Luis Neves
Re: XML querying
On Wed, 2007-01-17 at 09:36 +, Luis Neves wrote: Hi, Thorsten Scherler wrote: On Mon, 2007-01-15 at 13:42 +, Luis Neves wrote: I think you should explain your use case a wee bit more. What I do now to index XML documents it's to use a Filter to strip the markup, this works but it's impossible to know where in the document is the match located. why do you need to know where? Poorly phrased from my part. Ideally I want to apply lucene filters to the xml content. Something like what Nux does: http://dsd.lbl.gov/nux/api/nux/xom/pool/FullTextUtil.html http://dsd.lbl.gov/nux/#Google-like realtime fulltext search via Apache Lucene engine If you have a look at this you will see that the lucene search is plain and not xquery based. It is more that you can define relations like in SQL connecting tow tables via keys. Like I understand it, it will return the docs that have the xpath /books/book[author=James and the lucene:match(abstract, $query) where the lucene match is based on a normal lucene query. I reckon it should be very easy to do something like this in a client environment like cocoon/forrest. See the nux code for getting an idea. If I would need to solve this I would look for a component that allows me XQuery like nux and a component that let me do query against a solr server. Then you just need to match the documents that return for both components a result with a custom method. salu2 -- Luis Neves
Re: Document freshness and Boost Functions
On 1/17/07, Luis Neves [EMAIL PROTECTED] wrote: ...I see that is possible to use Boost Functions to influence the score. How would that work in order to improve the score of recent documents? (I have a timestamp field in the schema)... I've been using expressions like these in boolean queries, based on a broadcast_date field: _val_:linear(recip(rord(broadcast_date),1,1000,1000),11,0) Where recip computes an age-based score, and linear is used to boost it. See http://incubator.apache.org/solr/docs/api/org/apache/solr/search/QueryParsing.html, and also the list archives, these functions have been discussed before. I'm not sure off the top of my head how to use this with dismax queries though. -Bertrand
Re: my think about solr replication
On 1/17/07, James liu [EMAIL PROTECTED] wrote: when i use mysql replication, i think why not use it? Perhaps doable, but every slave would need to re-index the same documents pulled from the db. It would be more CPU and resource intensive and harder to keep in sync. If you get a corrupted disk, how to you recover except to rebuild everything from the db (and that means a long outage). Same issues go for other document distribution methods such as using a message queues. Anyway, if this type of distribution works for you, use it! Solr's distribution mechanism is optional. -Yonik
Solr graduates and joins Lucene as sub-project
Solr has just graduated from the Incubator, and has been accepted as a Lucene sub-project! Thanks to all the Lucene and Solr users, contributors, and developers who helped make this happen! I have a feeling we're just getting started :-) -Yonik
Re: solr + cocoon problem
Hi, I agree, this is not a legal URL. But the thing is that cocoon itself is sending the unescaped URL. That is why I thought I am not using the right tools from cocoon. mirko Quoting Chris Hostetter [EMAIL PROTECTED]: : java.io.IOException: Server returned HTTP response code: 505 for URL: : http://hostname/solr/select/?q=a b : : : The interesting thing is that if I access http://hostname/solr/select/?q=a b : directly it works. i don't know anything about cocoon, but that is not a legal URL, URLs can't have spaces in them ... if you type a space into your browser, it's probably being nice and URL escaping it for you (that's what most browsers seem to do now a days) i'm guessing Cocoon automaticaly un-escapes the input to your app, and you need to re-URL escape it before sending it to Solr. -Hoss
Re: solr + cocoon problem
On Wed, 2007-01-17 at 10:25 -0500, [EMAIL PROTECTED] wrote: Hi, I agree, this is not a legal URL. But the thing is that cocoon itself is sending the unescaped URL. ...because you told it so. You use map:generate src=http://hostname/solr/select/?q={request-param:q}; type=file The request param module will not escape the param by default. salu2
Re: solr + cocoon problem
Thanks Thorsten, that really was helpful. Cocoon's url-encode module does solve my problem. mirko Quoting Thorsten Scherler [EMAIL PROTECTED]: On Wed, 2007-01-17 at 10:25 -0500, [EMAIL PROTECTED] wrote: Hi, I agree, this is not a legal URL. But the thing is that cocoon itself is sending the unescaped URL. ...because you told it so. You use map:generate src=http://hostname/solr/select/?q={request-param:q}; type=file The request param module will not escape the param by default. salu2
Re: Solr graduates and joins Lucene as sub-project
Congrats to all involved committers on the project as well. Solr is an invaluable system in my operation. Great job. On 1/17/07, Yonik Seeley [EMAIL PROTECTED] wrote: Solr has just graduated from the Incubator, and has been accepted as a Lucene sub-project! Thanks to all the Lucene and Solr users, contributors, and developers who helped make this happen! I have a feeling we're just getting started :-) -Yonik
Re: Solr graduates and joins Lucene as sub-project
Congratulations Yonik and the Solr team! I just got started playing with Solr (having done all with raw Lucene and Java object caches only until now) Too bad I can't reach the issue tracker now, as I want to contribute a PHP responsewriter to Solr. This work is also a start for a set of generic classes (first release within a few weeks I guess) to be used in PHP apps and frameworks. Paul On 1/17/07, Yonik Seeley [EMAIL PROTECTED] wrote: Solr has just graduated from the Incubator, and has been accepted as a Lucene sub-project! Thanks to all the Lucene and Solr users, contributors, and developers who helped make this happen! I have a feeling we're just getting started :-) -Yonik -- http://walhalla.wordpress.com
possible FAQ - lucene interop
Hello all: We've got one java-based project at work using lucene. I'm looking to use solr as a search system for some other projects at work. Once data is indexed in solr, can we get at it using standard lucene libraries? I know how I want to use solr, but if the java devs need to get at the data as well, I'd rather that 1) they be able to use their existing tech and skills and 2) I not have to reindex everything in lucene-only indexes. I've read the FAQs and some of the mailing list and couldn't find this question addressed. Thanks. -- Michael Kimsal http://webdevradio.com
Re: possible FAQ - lucene interop
Hi Michael, What Solr is really doing is building a Lucene index. In most cases a Java developer should be able to access the index that Solr built through the IndexReader/IndexSearcher classes and the location of the index that Solr built. See the Lucene API for details on these and other classes. The default index location is in solr/data/index relative to where you start the servlet which is running Solr. Hope you find that helpful, Tricia On Wed, 17 Jan 2007, Michael Kimsal wrote: Hello all: We've got one java-based project at work using lucene. I'm looking to use solr as a search system for some other projects at work. Once data is indexed in solr, can we get at it using standard lucene libraries? I know how I want to use solr, but if the java devs need to get at the data as well, I'd rather that 1) they be able to use their existing tech and skills and 2) I not have to reindex everything in lucene-only indexes. I've read the FAQs and some of the mailing list and couldn't find this question addressed. Thanks. -- Michael Kimsal http://webdevradio.com
Re: possible FAQ - lucene interop
: Thanks - that helps, and ideally should help with adoption questions here. : You said most cases - I've read something about solr extends lucene in : the docs. Are there some specific solr-only bits of functionality that : would preclude vanilla-lucene code from accessing solr-created indexes? the notion that Solr extends Lucene is primarily in terms of the HTTP API it provides, but there is lots of code in the Solr code base that extends the fuctionality of Lucene in various ways ... FunctionQueries for example, and support for them in the SolrQueryParser (which is a subclass of the Lucene QueryParser). If your primary concern is that you want to allow people writting apps using the raw Lucene APIs want to be able to access your index, your only real concern is in how you design your schema ... whatever analyzers you use on text fields will need to be available to the other clients, if you use the any of the complex field types, (sortable ints, dates, etc) then those other apps will need to know how to convert values before querying those fields. in addition to the solr.war, the solr distributions include a jar containing all of the stock code that ships with Slr -- primarily for compiling against when building plugins, but that same code JAR could be used by standalone Lucene apps to access the various TokenFilters and FieldTypes that Solr provides if you use them in your schema. -Hoss
Re: my think about solr replication
: i try it but not success maybe because i m poor in freebsd.(if u know how to : config and use, tell me and i will be very happy.:) ) for the record, i'm sure the bug with using the distribution scripts on FreeBSD is a minor one, it just needs someone with some expertise in BSD/bash to take a look at it ... regretably i am not one of those people... https://issues.apache.org/jira/browse/SOLR-93 -Hoss
Re: Document freshness and Boost Functions
: Boost Functions to influence the score. How would that work in order to : improve the score of recent documents? (I have a timestamp field in the : I've been using expressions like these in boolean queries, based on a : broadcast_date field: : : _val_:linear(recip(rord(broadcast_date),1,1000,1000),11,0) : I'm not sure off the top of my head how to use this with dismax queries though. with teh dismax request handler, you can specify a bq param which takes in a raw lucene query for boostig -- the query above with the _val_ sytnax would work htere -- but the DisMax handler also has explicit support for boost function parsing with the bf param, so you could say... http://localhost:8983/solr/search?qt=dismaxq=hossbf=linear(recip(rord(broadcast_date),1,1000,1000),11,0) http://incubator.apache.org/solr/docs/api/org/apache/solr/request/DisMaxRequestHandler.html -Hoss
Re: One item, multiple fields, and range queries
: OK, you lost me. It sounds as if this PhraseQuery-ish approach involves : breaking datetime and lat/long values into pieces, and evaluation occurs : with positioning. Is that accurate? i'm not sure what you mean by pieces ... the idea is that you would have a single latitude field and a single longitude field and a single when field, and if an item had a single event, you would store a single value in each field ... but if the item has multiple events, you would store them in the same relative ordering, and then use the same kind of logic PhraseQuery uses to verify that if the latitude field has a value in the right range, and the longitude field has a value in the right range, and the when field has a value in the right range, that all of those values have the same position (specificly: are within a set amount of slop from eachother, which you would allways set to 0) : It seems like this could even be done in the same field if one had a : query type that allowed querying for tokens at the same position. : Just index _noun at the same position as house (and make sure : there can't be collisions between real terms and markers via escaping, : or use \0 instead of _, etc). true ... but the point doug made way back when is that with a generalized multi-field phrase query you wouldn't have to do that escaping ... the hard part in this case is the numeric ranges. -Hoss
Bucketing result set (User list posting)...
I have a requirement wherein the documents that are retrieved based on the similarity computation are bucketed and resorted based on user score. An example - Let us say a search returns the following data set - Doc ID Lucene score User score 10001000 125 1000 900 225 1000 800 25 1000 700 525 100050 25 100040 125 Assuming two bucket are created, the expected result is - Doc ID Lucene score User score 1000 900 225 10001000 125 1000 800 25 --- 1000 700 525 100040 125 100050 25 I am assuming that the only way to do this is to change some of the Solr internals. Any pointers would be most helpful on the best way to go about it. I will also post this on the Dev list. Thanks. -- View this message in context: http://www.nabble.com/Bucketing-result-set-%28User-list-posting%29...-tf3031129.html#a8421968 Sent from the Solr - User mailing list archive at Nabble.com.
Re: Bucketing result set (User list posting)...
Re: Bucketing result set (User list posting)... Please don't post solr-user questions on solr-dev. Crossposting is bad; multi-posting is even worse. Most if not all of solr dev's read solr-user and will respond to you there. On 1/17/07, escher2k [EMAIL PROTECTED] wrote: I have a requirement wherein the documents that are retrieved based on the similarity computation are bucketed and resorted based on user score. An example - Let us say a search returns the following data set - Doc ID Lucene score User score 10001000 125 1000 900 225 1000 800 25 1000 700 525 100050 25 100040 125 Assuming two bucket are created, the expected result is - Doc ID Lucene score User score 1000 900 225 10001000 125 1000 800 25 --- 1000 700 525 100040 125 100050 25 I am assuming that the only way to do this is to change some of the Solr internals. Any pointers would be most helpful on the best way to go about it. I will also post this on the How is the bucketing done? How are the user scores stored? It looks like you are picking constant-sized groups from the solr-sorted result list. In this case, surely this can be done trivially client-side? I could be totally misinterpreting your question, however. cheers, -MIke
Re: One item, multiple fields, and range queries
Now I follow. I was misreading the first comments, thinking that the field content would be deconstructed to smaller components or pieces. Too much (or not enough) coffee. I'm expecting the index doc needs to be constructed with lat/long/dates in sequential order, i.e.: doc add field name=event_id123/field field name=latitude32.123456/field field name=longitude-88.987654/field field name=when01/31/2007/field field name=latitude42.123456/field field name=longitude-98.987654/field field name=when01/31/2007/field field name=latitude40.123456/field field name=longitude-108.987654/field field name=when01/30/2007/field .etc. Assuming slop count of 0, while the intention is to match lat/long/when in that order, could it possibly match long/when/lat, or when/lat/long? Does PhraseQuery enforce order and starting point as well? Assuming all of this, how does range query come into play? Or could the PhraseQuery portion be applied as a filter? On 1/17/07, Chris Hostetter [EMAIL PROTECTED] wrote: : OK, you lost me. It sounds as if this PhraseQuery-ish approach involves : breaking datetime and lat/long values into pieces, and evaluation occurs : with positioning. Is that accurate? i'm not sure what you mean by pieces ... the idea is that you would have a single latitude field and a single longitude field and a single when field, and if an item had a single event, you would store a single value in each field ... but if the item has multiple events, you would store them in the same relative ordering, and then use the same kind of logic PhraseQuery uses to verify that if the latitude field has a value in the right range, and the longitude field has a value in the right range, and the when field has a value in the right range, that all of those values have the same position (specificly: are within a set amount of slop from eachother, which you would allways set to 0) : It seems like this could even be done in the same field if one had a : query type that allowed querying for tokens at the same position. : Just index _noun at the same position as house (and make sure : there can't be collisions between real terms and markers via escaping, : or use \0 instead of _, etc). true ... but the point doug made way back when is that with a generalized multi-field phrase query you wouldn't have to do that escaping ... the hard part in this case is the numeric ranges. -Hoss
Re: Solr graduates and joins Lucene as sub-project
On 1/17/07, Paul Borgermans [EMAIL PROTECTED] wrote: Congratulations Yonik and the Solr team! I just got started playing with Solr (having done all with raw Lucene and Java object caches only until now) Too bad I can't reach the issue tracker now, as I want to contribute a PHP responsewriter to Solr. This work is also a start for a set of generic classes (first release within a few weeks I guess) to be used in PHP apps and frameworks. Cool, can't wait to see it! I bet some of the guys at the upcoming code4lib pre-conference thing that Erik is leading, http://code4lib.org/node/139 will appreciate more PHP support too. -Yonik