Re: Internationalization

2007-01-17 Thread Erik Hatcher

Way to go Bess!   This is great stuff you're sharing.

I have a question though...

On Jan 16, 2007, at 11:48 AM, Bess Sadler wrote:
Currently, we are assigning all fields, no matter what language to  
type string, defined as


fieldtype name=string class=solr.StrField  
sortMissingLast=true/


This does string matching very well, but doesn't do any stop words,  
or stemming, or anything fancy. We are toying with the idea of a  
custom Tibetan indexer to better break up the Tibetan into discrete  
words, but for this particular project (because it mostly has to do  
with proper names, not long passages of text) this hasn't been a  
problem yet, and the above solution seems to be doing the trick.


Why are you assigning all fields to a string type?  That indexes  
each field as-is, with no tokenization at all.  How are you using  
that field from the front-end?   I'd think you'd want to copyField  
everything into a text field.



Elizabeth (Bess) Sadler
Head, Technical and Metadata Services
Digital Scholarship Services
Box 400129
Alderman Library
University of Virginia
Charlottesville, VA 22904


Just two floors down what amazing folks we have on this!

Erik



Re: XML querying

2007-01-17 Thread Luis Neves

Hi,

Thorsten Scherler wrote:

On Mon, 2007-01-15 at 13:42 +, Luis Neves wrote:




I think you should explain your use case a wee bit more.


What I do now to index XML documents it's to use a Filter to strip
the markup, 

this works but it's impossible to know where in the document is the match 
located.


why do you need to know where? 


Poorly phrased from my part. Ideally I want to apply lucene filters to the xml 
content.

Something like what Nux does:
http://dsd.lbl.gov/nux/api/nux/xom/pool/FullTextUtil.html


--
Luis Neves


Document freshness and Boost Functions

2007-01-17 Thread Luis Neves


Hello,
Reading the javadocs from the DisMaxRequestHandler I see that is possible to use 
Boost Functions to influence the score. How would that work in order to 
improve the score of recent documents? (I have a timestamp field in the 
schema)... I'm assuming it's possible (right?), but I can't figure out the syntax.


--
Luis Neves






Re: XML querying

2007-01-17 Thread Thorsten Scherler
On Wed, 2007-01-17 at 09:36 +, Luis Neves wrote:
 Hi,
 
 Thorsten Scherler wrote:
  On Mon, 2007-01-15 at 13:42 +, Luis Neves wrote:
 
  
  I think you should explain your use case a wee bit more.
  
  What I do now to index XML documents it's to use a Filter to strip
  the markup, 
  this works but it's impossible to know where in the document is the 
  match located.
  
  why do you need to know where? 
 
 Poorly phrased from my part. Ideally I want to apply lucene filters to the 
 xml 
 content.
 Something like what Nux does:
 http://dsd.lbl.gov/nux/api/nux/xom/pool/FullTextUtil.html
 

http://dsd.lbl.gov/nux/#Google-like realtime fulltext search via Apache
Lucene engine

If you have a look at this you will see that the lucene search is plain
and not xquery based. It is more that you can define relations like in
SQL connecting tow tables via keys. Like I understand it, it will return
the docs that have the xpath /books/book[author=James and the
lucene:match(abstract, $query) where the lucene match is based on a
normal lucene query.

I reckon it should be very easy to do something like this in a client
environment like cocoon/forrest. See the nux code for getting an idea.
If I would need to solve this I would look for a component that allows
me XQuery like nux and a component that let me do query against a solr
server.

Then you just need to match the documents that return for both
components a result with a custom method.

salu2

 
 --
 Luis Neves



Re: Document freshness and Boost Functions

2007-01-17 Thread Bertrand Delacretaz

On 1/17/07, Luis Neves [EMAIL PROTECTED] wrote:


...I see that is possible to use
Boost Functions to influence the score. How would that work in order to
improve the score of recent documents? (I have a timestamp field in the
schema)...


I've been using expressions like these in boolean queries, based on  a
broadcast_date field:

_val_:linear(recip(rord(broadcast_date),1,1000,1000),11,0)

Where recip computes an age-based score, and linear is used to boost it.

See 
http://incubator.apache.org/solr/docs/api/org/apache/solr/search/QueryParsing.html,
and also the list archives, these functions have been discussed
before.

I'm not sure off the top of my head how to use this with dismax queries though.

-Bertrand


Re: my think about solr replication

2007-01-17 Thread Yonik Seeley

On 1/17/07, James liu [EMAIL PROTECTED] wrote:

when i use mysql replication, i think why not use it?


Perhaps doable, but every slave would need to re-index the same
documents pulled from the db.  It would be more CPU and resource
intensive and harder to keep in sync.  If you get a corrupted disk,
how to you recover except to rebuild everything from the db (and that
means a long outage).

Same issues go for other document distribution methods such as using a
message queues.

Anyway, if this type of distribution works for you, use it!
Solr's distribution mechanism is optional.

-Yonik


Solr graduates and joins Lucene as sub-project

2007-01-17 Thread Yonik Seeley

Solr has just graduated from the Incubator, and has been accepted as a
Lucene sub-project!
Thanks to all the Lucene and Solr users, contributors, and developers
who helped make this happen!

I have a feeling we're just getting started :-)
-Yonik


Re: solr + cocoon problem

2007-01-17 Thread mirko
Hi,

I agree, this is not a legal URL.  But the thing is that cocoon itself is
sending the unescaped URL.  That is why I thought I am not using the right
tools from cocoon.

mirko


Quoting Chris Hostetter [EMAIL PROTECTED]:


 : java.io.IOException: Server returned HTTP response code: 505 for URL:
 : http://hostname/solr/select/?q=a b
 :
 :
 : The interesting thing is that if I access http://hostname/solr/select/?q=a
 b
 : directly it works.

 i don't know anything about cocoon, but that is not a legal URL, URLs
 can't have spaces in them ... if you type a space into your browser, it's
 probably being nice and URL escaping it for you (that's what most browsers
 seem to do now a days)

 i'm guessing Cocoon automaticaly un-escapes the input to your app, and you
 need to re-URL escape it before sending it to Solr.




 -Hoss





Re: solr + cocoon problem

2007-01-17 Thread Thorsten Scherler
On Wed, 2007-01-17 at 10:25 -0500, [EMAIL PROTECTED] wrote:
 Hi,
 
 I agree, this is not a legal URL.  But the thing is that cocoon itself is
 sending the unescaped URL. 

...because you told it so.

You use 
map:generate
src=http://hostname/solr/select/?q={request-param:q};
type=file 

The request param module will not escape the param by default.

salu2



Re: solr + cocoon problem

2007-01-17 Thread mirko
Thanks Thorsten,

that really was helpful.  Cocoon's url-encode module does solve my problem.

mirko


Quoting Thorsten Scherler [EMAIL PROTECTED]:

 On Wed, 2007-01-17 at 10:25 -0500, [EMAIL PROTECTED] wrote:
  Hi,
 
  I agree, this is not a legal URL.  But the thing is that cocoon itself is
  sending the unescaped URL.

 ...because you told it so.

 You use
 map:generate
 src=http://hostname/solr/select/?q={request-param:q};
 type=file 

 The request param module will not escape the param by default.

 salu2





Re: Solr graduates and joins Lucene as sub-project

2007-01-17 Thread Jeff Rodenburg

Congrats to all involved committers on the project as well.  Solr is an
invaluable system in my operation.  Great job.

On 1/17/07, Yonik Seeley [EMAIL PROTECTED] wrote:


Solr has just graduated from the Incubator, and has been accepted as a
Lucene sub-project!
Thanks to all the Lucene and Solr users, contributors, and developers
who helped make this happen!

I have a feeling we're just getting started :-)
-Yonik



Re: Solr graduates and joins Lucene as sub-project

2007-01-17 Thread Paul Borgermans

Congratulations Yonik and the Solr team!

I just got started playing with Solr (having done all with raw Lucene and
Java object caches only until now)

Too bad I can't reach the issue tracker now, as I want to contribute a PHP
responsewriter to Solr. This work is also a start for a set of generic
classes (first release within a few weeks I guess) to be used in PHP apps
and frameworks.

Paul

On 1/17/07, Yonik Seeley [EMAIL PROTECTED] wrote:


Solr has just graduated from the Incubator, and has been accepted as a
Lucene sub-project!
Thanks to all the Lucene and Solr users, contributors, and developers
who helped make this happen!

I have a feeling we're just getting started :-)
-Yonik





--
http://walhalla.wordpress.com


possible FAQ - lucene interop

2007-01-17 Thread Michael Kimsal

Hello all:

We've got one java-based project at work using lucene.  I'm looking to use
solr as a search system for some other projects at work.  Once data is
indexed in solr, can we get at it using standard lucene libraries?  I know
how I want to use solr, but if the java devs need to get at the data as
well, I'd rather that 1) they be able to use their existing tech and skills
and 2) I not have to reindex everything in lucene-only indexes.

I've read the FAQs and some of the mailing list and couldn't find this
question addressed.

Thanks.

--
Michael Kimsal
http://webdevradio.com


Re: possible FAQ - lucene interop

2007-01-17 Thread Tricia Williams

Hi Michael,

   What Solr is really doing is building a Lucene index.  In most cases a 
Java developer should be able to access the index that Solr built through 
the IndexReader/IndexSearcher classes and the location of the index that 
Solr built.  See the Lucene API for details on these and other classes. 
The default index location is in solr/data/index relative to where you 
start the servlet which is running Solr.


Hope you find that helpful,
Tricia


On Wed, 17 Jan 2007, Michael Kimsal wrote:


Hello all:

We've got one java-based project at work using lucene.  I'm looking to use
solr as a search system for some other projects at work.  Once data is
indexed in solr, can we get at it using standard lucene libraries?  I know
how I want to use solr, but if the java devs need to get at the data as
well, I'd rather that 1) they be able to use their existing tech and skills
and 2) I not have to reindex everything in lucene-only indexes.

I've read the FAQs and some of the mailing list and couldn't find this
question addressed.

Thanks.

--
Michael Kimsal
http://webdevradio.com



Re: possible FAQ - lucene interop

2007-01-17 Thread Chris Hostetter

: Thanks - that helps, and ideally should help with adoption questions here.
: You said most cases - I've read something about solr extends lucene in
: the docs.  Are there some specific solr-only bits of functionality that
: would preclude vanilla-lucene code from accessing solr-created indexes?

the notion that Solr extends Lucene is primarily in terms of the HTTP
API it provides, but there is lots of code in the Solr code base that
extends the fuctionality of Lucene in various ways ... FunctionQueries for
example, and support for them in the SolrQueryParser (which is a subclass
of the Lucene QueryParser).  If your primary concern is that you
want to allow people writting apps using the raw Lucene APIs want to be
able to access your index, your only real concern is in how you design
your schema ... whatever analyzers you use on text fields will need to be
available to the other clients, if you use the any of the complex field
types, (sortable ints, dates, etc) then those other apps will need to know
how to convert values before querying those fields.

in addition to the solr.war, the solr distributions include a jar
containing all of the stock code that ships with Slr -- primarily for
compiling against when building plugins, but that same code JAR could be
used by standalone Lucene apps to access the various TokenFilters and
FieldTypes that Solr provides if you use them in your schema.


-Hoss



Re: my think about solr replication

2007-01-17 Thread Chris Hostetter

: i try it but not success maybe because i m poor in freebsd.(if u know how to
: config and use, tell me and i will be very happy.:) )

for the record, i'm sure the bug with using the distribution scripts on
FreeBSD is a minor one, it just needs someone with some expertise in
BSD/bash to take a look at it ... regretably i am not one of those
people...

https://issues.apache.org/jira/browse/SOLR-93

-Hoss



Re: Document freshness and Boost Functions

2007-01-17 Thread Chris Hostetter

:  Boost Functions to influence the score. How would that work in order to
:  improve the score of recent documents? (I have a timestamp field in the

: I've been using expressions like these in boolean queries, based on  a
: broadcast_date field:
:
: _val_:linear(recip(rord(broadcast_date),1,1000,1000),11,0)

: I'm not sure off the top of my head how to use this with dismax queries 
though.

with teh dismax request handler, you can specify a bq param which takes
in a raw lucene query for boostig -- the query above with the _val_ sytnax
would work htere -- but the DisMax handler also has explicit support for
boost function parsing with the bf param, so you could say...

http://localhost:8983/solr/search?qt=dismaxq=hossbf=linear(recip(rord(broadcast_date),1,1000,1000),11,0)

http://incubator.apache.org/solr/docs/api/org/apache/solr/request/DisMaxRequestHandler.html

-Hoss



Re: One item, multiple fields, and range queries

2007-01-17 Thread Chris Hostetter

: OK, you lost me.  It sounds as if this PhraseQuery-ish approach involves
: breaking datetime and lat/long values into pieces, and evaluation occurs
: with positioning.  Is that accurate?

i'm not sure what you mean by pieces ... the idea is that you would have a
single latitude field and a single longitude field and a single when
field, and if an item had a single event, you would store a single value
in each field ... but if the item has multiple events, you would store
them in the same relative ordering, and then use the same kind of logic
PhraseQuery uses to verify that if the latitude field has a value in the
right range, and the longitude field has a value in the right range, and
the when field has a value in the right range, that all of those values
have the same position (specificly: are within a set amount of slop from
eachother, which you would allways set to 0)

:  It seems like this could even be done in the same field if one had a
:  query type that allowed querying for tokens at the same position.
:  Just index _noun at the same position as house (and make sure
:  there can't be collisions between real terms and markers via escaping,
:  or use \0 instead of _, etc).

true ... but the point doug made way back when is that with a generalized
multi-field phrase query you wouldn't have to do that escaping ... the
hard part in this case is the numeric ranges.


-Hoss



Bucketing result set (User list posting)...

2007-01-17 Thread escher2k

I have a requirement wherein the documents that are retrieved based on the
similarity computation
are bucketed and resorted based on user score. 
An example -

Let us say a search returns the following data set -

Doc ID   Lucene score User score
10001000  125
1000  900  225
1000  800  25
1000  700  525
100050  25
100040  125

Assuming two bucket are created, the expected result is - 
Doc ID   Lucene score User score
1000  900  225
10001000  125
1000  800  25
---
1000  700  525
100040  125
100050  25

I am assuming that the only way to do this is to change some of the Solr
internals.  Any pointers would
be most helpful on the best way to go about it. I will also post this on the
Dev list.

Thanks.

-- 
View this message in context: 
http://www.nabble.com/Bucketing-result-set-%28User-list-posting%29...-tf3031129.html#a8421968
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Bucketing result set (User list posting)...

2007-01-17 Thread Mike Klaas

Re: Bucketing result set (User list posting)...


Please don't post solr-user questions on solr-dev.  Crossposting is
bad; multi-posting is even worse.  Most if not all of solr dev's read
solr-user and will respond to you there.

On 1/17/07, escher2k [EMAIL PROTECTED] wrote:


I have a requirement wherein the documents that are retrieved based on the
similarity computation
are bucketed and resorted based on user score.
An example -

Let us say a search returns the following data set -

Doc ID   Lucene score User score
10001000  125
1000  900  225
1000  800  25
1000  700  525
100050  25
100040  125

Assuming two bucket are created, the expected result is -
Doc ID   Lucene score User score
1000  900  225
10001000  125
1000  800  25
---
1000  700  525
100040  125
100050  25

I am assuming that the only way to do this is to change some of the Solr
internals.  Any pointers would
be most helpful on the best way to go about it. I will also post this on the


How is the bucketing done?  How are the user scores stored?  It looks
like you are picking constant-sized groups from the solr-sorted result
list.  In this case, surely this can be done trivially client-side? I
could be totally misinterpreting your question, however.

cheers,
-MIke


Re: One item, multiple fields, and range queries

2007-01-17 Thread Jeff Rodenburg

Now I follow.  I was misreading the first comments, thinking that the field
content would be deconstructed to smaller components or pieces.  Too much
(or not enough) coffee.

I'm expecting the index doc needs to be constructed with lat/long/dates in
sequential order, i.e.:

doc
add
  field name=event_id123/field

  field name=latitude32.123456/field
  field name=longitude-88.987654/field
  field name=when01/31/2007/field

  field name=latitude42.123456/field
  field name=longitude-98.987654/field
  field name=when01/31/2007/field

  field name=latitude40.123456/field
  field name=longitude-108.987654/field
  field name=when01/30/2007/field
.etc.

Assuming slop count of 0, while the intention is to match lat/long/when in
that order, could it possibly match long/when/lat, or when/lat/long?  Does
PhraseQuery enforce order and starting point as well?

Assuming all of this, how does range query come into play?  Or could the
PhraseQuery portion be applied as a filter?



On 1/17/07, Chris Hostetter [EMAIL PROTECTED] wrote:



: OK, you lost me.  It sounds as if this PhraseQuery-ish approach involves
: breaking datetime and lat/long values into pieces, and evaluation occurs
: with positioning.  Is that accurate?

i'm not sure what you mean by pieces ... the idea is that you would have a
single latitude field and a single longitude field and a single when
field, and if an item had a single event, you would store a single value
in each field ... but if the item has multiple events, you would store
them in the same relative ordering, and then use the same kind of logic
PhraseQuery uses to verify that if the latitude field has a value in the
right range, and the longitude field has a value in the right range, and
the when field has a value in the right range, that all of those values
have the same position (specificly: are within a set amount of slop from
eachother, which you would allways set to 0)

:  It seems like this could even be done in the same field if one had a
:  query type that allowed querying for tokens at the same position.
:  Just index _noun at the same position as house (and make sure
:  there can't be collisions between real terms and markers via escaping,
:  or use \0 instead of _, etc).

true ... but the point doug made way back when is that with a generalized
multi-field phrase query you wouldn't have to do that escaping ... the
hard part in this case is the numeric ranges.


-Hoss




Re: Solr graduates and joins Lucene as sub-project

2007-01-17 Thread Yonik Seeley

On 1/17/07, Paul Borgermans [EMAIL PROTECTED] wrote:

Congratulations Yonik and the Solr team!

I just got started playing with Solr (having done all with raw Lucene and
Java object caches only until now)

Too bad I can't reach the issue tracker now, as I want to contribute a PHP
responsewriter to Solr. This work is also a start for a set of generic
classes (first release within a few weeks I guess) to be used in PHP apps
and frameworks.


Cool, can't wait to see it!  I bet some of the guys at the upcoming
code4lib pre-conference thing that Erik is leading,
http://code4lib.org/node/139 will appreciate more PHP support too.

-Yonik