Re: Solr like for autocomplete field?

2010-11-03 Thread Amit Nithian
I implemented the edge ngrams solution and it's an awesome one
compared to any other that I could think of because I can index more
than just text (other metadata) that can be used to *rank* the
autocomplete results eventually getting to rank by the probability of
selection which is, after all, what you want to try and maximize with
such systems.


On Tue, Nov 2, 2010 at 6:30 PM, Lance Norskog goks...@gmail.com wrote:
 And the SpellingComponent.

 There's nothing to help you with phrases.

 On Tue, Nov 2, 2010 at 11:21 AM, Erick Erickson erickerick...@gmail.com 
 wrote:
 Also, you might want to consider TermsComponent, see:

 http://wiki.apache.org/solr/TermsComponent

 Also, note that there's an autosuggestcomponent, that's recently been
 committed.

 Best
 Erick

 On Tue, Nov 2, 2010 at 1:56 PM, PeterKerk vettepa...@hotmail.com wrote:


 I have a city field. Now when a user starts typing in a city textbox I want
 to return found matches (like Google).

 So for example, user types new, and I will return new york, new
 hampshire etc.

 my schema.xml

 field name=city type=string indexed=true stored=true/

 my current url:


 http://localhost:8983/solr/db/select/?indent=onfacet=trueq=*:*start=0rows=25fl=idfacet.field=cityfq=city:new


 Basically 2 questions here:
 1. is the url Im using the best practice when implementing autocomplete?
 What I wanted to do, is use the facets for found matches.
 2. How can I match PART of the cityname just like the SQL LIKE command,
 cityname LIKE '%userinput'


 Thanks!
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-like-for-autocomplete-field-tp1829480p1829480.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 --
 Lance Norskog
 goks...@gmail.com



Re: Solr like for autocomplete field?

2010-11-03 Thread truebner
Have a look at ajax-solr
http://evolvingweb.github.com/ajax-solr/

in the tutorial is an example of an autocompletion widget.

tob



From:
Amit Nithian anith...@gmail.com
To:
solr-user@lucene.apache.org
Date:
03.11.2010 07:36
Subject:
Re: Solr like for autocomplete field?



I implemented the edge ngrams solution and it's an awesome one
compared to any other that I could think of because I can index more
than just text (other metadata) that can be used to *rank* the
autocomplete results eventually getting to rank by the probability of
selection which is, after all, what you want to try and maximize with
such systems.


On Tue, Nov 2, 2010 at 6:30 PM, Lance Norskog goks...@gmail.com wrote:
 And the SpellingComponent.

 There's nothing to help you with phrases.

 On Tue, Nov 2, 2010 at 11:21 AM, Erick Erickson 
erickerick...@gmail.com wrote:
 Also, you might want to consider TermsComponent, see:

 http://wiki.apache.org/solr/TermsComponent

 Also, note that there's an autosuggestcomponent, that's recently been
 committed.

 Best
 Erick

 On Tue, Nov 2, 2010 at 1:56 PM, PeterKerk vettepa...@hotmail.com 
wrote:


 I have a city field. Now when a user starts typing in a city textbox I 
want
 to return found matches (like Google).

 So for example, user types new, and I will return new york, new
 hampshire etc.

 my schema.xml

 field name=city type=string indexed=true stored=true/

 my current url:


 
http://localhost:8983/solr/db/select/?indent=onfacet=trueq=*:*start=0rows=25fl=idfacet.field=cityfq=city:new



 Basically 2 questions here:
 1. is the url Im using the best practice when implementing 
autocomplete?
 What I wanted to do, is use the facets for found matches.
 2. How can I match PART of the cityname just like the SQL LIKE 
command,
 cityname LIKE '%userinput'


 Thanks!
 --
 View this message in context:
 
http://lucene.472066.n3.nabble.com/Solr-like-for-autocomplete-field-tp1829480p1829480.html

 Sent from the Solr - User mailing list archive at Nabble.com.





 --
 Lance Norskog
 goks...@gmail.com





_

Sachsen DV Betriebs- und Servicegesellschaft mbH
Täubchenweg 26
04317 Leipzig
Amtsgericht Leipzig, HRB 18545

Geschäftsführer: Herbert Roller Brandão, Dr. Jean-Michael Pfitzner

Aufsichtsratsvorsitzender: Sven Petersen

RE: Searching Across Multiple Cores

2010-11-03 Thread Lohrenz, Steven
Sorry about the late response to this, but was on holidays. 

No, as of right now there is not the same schema in each shard. 

I need to be able to search a set of data resources with manually defined data 
fields. All of those fields are searchable. 

Any one of these resources can be added to an individual's favourites list with 
the possibility of them adding additional tags, which are also searchable. The 
favourites folder needs to be searchable on all the same fields as the main 
data set and on the additional user defined tags. 

Search fields for the main data schema are:
resourceId
resourceType
resourceGradeLevel
resourceKeywords
resourceLength
resourceSubjectArea
and about 30 more fields

The searchable fields for the My Favourites schema are:
userId
userFolder
userDefinedGradeLevel
userDefinedTags
plus all of those in the main data set. 

Search queries:
1. Search the main data set for all those resources with keyword 'foo'.
2. Search the main data set for all those resources with keyword 'foo' and are 
for grade 3. 
3. Search the main data set for all those resources with subject area of 
'grammar'.
4. Search My Favourites folder for all the resources I have moved there (userId 
= 12321) with the keyword 'foo'. 
5. Search My Favourites folder for all the resources I have moved there (userId 
= 12321) with the keyword 'foo' and are for grade 3 and are in the folder 
'testing'. 
6. Search My Favourites folder for all the resources I have moved there (userId 
= 12321) with the subject area of 'grammar' and I have tagged with 
'interesting'. 
7. Various combinations of the above. 

The simplest way I came up with to do this is to have 2 separate schemas. One 
for the main data set and one for My Favourites. When someone adds a resource 
from the main data set to their My Favourites folder all the data from the main 
data set is copied over the My Favourites schema and the userId, folder and 
other user specific information is added also. 

But there could be 1 million copies of basically the same data in the My 
Favourites (if 1 million users add the same resource to their favourites). I 
thought that would waste a lot of space, so was looking for another way to do 
this (using a type of join - see below). Are there any other possibilities?

Cheers,
Steve

-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: 14 October 2010 18:58
To: solr-user@lucene.apache.org
Subject: Re: Searching Across Multiple Cores

The point/use-case of sharding/distributed search is for performance, 
not for segregating different data in different places. Distributed 
search assumes the same schema in each shard -- do you have that?

I don't think distributed search means to support the kind of joining 
you describe, that's not really what Solr does.

But if you actually do have the same schema accross your shards, and 
have distributed search set up properly -- then you don't need to do any 
special joining, the shards end up forming one 'logical' index, that's 
the point of it.  I don't think you can do what you describe. Solr 
doesn't do joins like an rdbms, Solr works on a single set of 
documents, not multiple tables or collections. 

If you describe your data and the kind of queries you want to run, 
someone might be able to figure out a way to de-normalize the data to 
support what you want to do.  Which won't really have anything to do 
with shards/distributed search -- you add in distributed search for 
performance or giant-size-of-index purposes, but it doesn't change your 
schema design or queries.

Lohrenz, Steven wrote:
 Ken, 

 Ok, I understand how the distributed search works, but I don't understand how 
 to build my query appropriately so that the results returned from the two 
 shards only return values that exist in both result sets. 

 In essence, I'm doing a join across the two shards on the resourceId. 

 So Core0 has:
 resourceId (unique key)
 title 
 tag1
 tag2 
 tag3

 And Core1 has:
 resourceId + folder + userId + grade (concatenated - this is the uniqueId)
 resourceId
 folder
 userId
 grade

 For example, I would want to find all the content with userId = 893489 and 
 tag1 = 'contentTagX'. 

 My thought of how to do this is to search Core1 for all the items with userId 
 = 893489. This would return a set of results for that user with resourceId. 
 Then I would need to search Core0 for where tag1 = 'contentTagX' and where 
 resourceId = those returned in the result set from Core1. 

 I can probably do this in a search handler (say Core3 with a mashup of the 2 
 schemas but just redirects to the other shards), but is there an easier way 
 to do it?

 Or am I missing something?

 Thanks for your help,
 Steve


 -Original Message-
 From: Ken Stanley [mailto:doh...@gmail.com] 
 Sent: 14 October 2010 18:19
 To: solr-user@lucene.apache.org
 Subject: Re: Searching Across Multiple Cores

 Steve,

 Using shards is actually quite simple; it's just a matter of setting 

Re: Updating last_modified field when using DIH

2010-11-03 Thread Stefan Matheis
Juan,

that's correct .. solr will not touch your database, that's part of your
application-code. solr uses an updated timestamp (which is available
through dataimporter.last_index_time).

so, image the following situation, solr import runs every 10 minutes .. last
run at 11:00, your entity gets updated at 11:03, next solr-run at 11:10 will
detect this as changed, import the entity and run again at 11:20 .. then, no
entity will match the delta-query because solr will ask for a
modification_date  11:10 (last solr-run at this time).

you'll only need to update the last_modified field (in your application)
when the entity is changed and you want solr to (re-)index your data.

HTH,
Stefan

On Tue, Nov 2, 2010 at 7:35 PM, Juan Manuel Alvarez naici...@gmail.comwrote:

 Hello everyone!

 I would like to ask you a question about DIH and delta import.

 I am trying to sync Solr with a PostgreSQL database and I have a field
 ent_lastModified of type timestamp without timezone.

 Here is my xml file:

 dataConfig
dataSource name=jdbc driver=org.postgresql.Driver
 url=jdbc:postgresql://host user=XXX password=XXX readOnly=true
 autoCommit=false
transactionIsolation=TRANSACTION_READ_COMMITTED
 holdability=CLOSE_CURSORS_AT_COMMIT/
document
entity name='myEntity' dataSource='jdbc' pk='id'
query=' SELECT * FROM Entities'
deltaImportQuery='SELECT ent_id AS id FROM
 Entities WHERE ent_id=${dataimporter.delta.id}'
  deltaQuery=' SELECT ent_id AS id FROM Entities WHERE
 ent_lastModified gt; #39;${dataimporter.last_index_time}#39;'

/entity
/document
 /dataConfig

 Full-import works fine, but when I run a delta-import the
 ent_lastModified field, I get the corresponding records, but the
 ent_lastModified stays the same, so if I make another delta-import,
 the same records are retreived.

 I have read all the documentation at
 http://wiki.apache.org/solr/DataImportHandler but I could not find an
 update query for the last_modified field and Solr does not seem to
 do this automatically.
 I have also tried to name the field last_modified as in the example,
 but its value keeps unchanged after a delta-import.

 Can anyone point me in the right direction?

 Thanks in advance!
 Juan M.



RE: Updating last_modified field when using DIH

2010-11-03 Thread Ephraim Ofir
Also, your deltaImportQuery should be:
deltaImportQuery='SELECT * FROM Entities WHERE
ent_id=${dataimporter.delta.id}'

Otherwise you're just importing the ids and not the rest of the data.

If performance is important to you, you might also want to check out
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201008.mbox/%3
c9f8b39cb3b7c6d4594293ea29ccf438b01702...@icq-mail.icq.il.office.aol.com
%3E

Ephraim Ofir


-Original Message-
From: Stefan Matheis [mailto:matheis.ste...@googlemail.com] 
Sent: Wednesday, November 03, 2010 12:58 PM
To: solr-user@lucene.apache.org
Subject: Re: Updating last_modified field when using DIH

Juan,

that's correct .. solr will not touch your database, that's part of your
application-code. solr uses an updated timestamp (which is available
through dataimporter.last_index_time).

so, image the following situation, solr import runs every 10 minutes ..
last
run at 11:00, your entity gets updated at 11:03, next solr-run at 11:10
will
detect this as changed, import the entity and run again at 11:20 ..
then, no
entity will match the delta-query because solr will ask for a
modification_date  11:10 (last solr-run at this time).

you'll only need to update the last_modified field (in your application)
when the entity is changed and you want solr to (re-)index your data.

HTH,
Stefan

On Tue, Nov 2, 2010 at 7:35 PM, Juan Manuel Alvarez
naici...@gmail.comwrote:

 Hello everyone!

 I would like to ask you a question about DIH and delta import.

 I am trying to sync Solr with a PostgreSQL database and I have a field
 ent_lastModified of type timestamp without timezone.

 Here is my xml file:

 dataConfig
dataSource name=jdbc driver=org.postgresql.Driver
 url=jdbc:postgresql://host user=XXX password=XXX readOnly=true
 autoCommit=false
transactionIsolation=TRANSACTION_READ_COMMITTED
 holdability=CLOSE_CURSORS_AT_COMMIT/
document
entity name='myEntity' dataSource='jdbc' pk='id'
query=' SELECT * FROM Entities'
deltaImportQuery='SELECT ent_id AS id FROM
 Entities WHERE ent_id=${dataimporter.delta.id}'
  deltaQuery=' SELECT ent_id AS id FROM Entities WHERE
 ent_lastModified gt; #39;${dataimporter.last_index_time}#39;'

/entity
/document
 /dataConfig

 Full-import works fine, but when I run a delta-import the
 ent_lastModified field, I get the corresponding records, but the
 ent_lastModified stays the same, so if I make another delta-import,
 the same records are retreived.

 I have read all the documentation at
 http://wiki.apache.org/solr/DataImportHandler but I could not find an
 update query for the last_modified field and Solr does not seem to
 do this automatically.
 I have also tried to name the field last_modified as in the example,
 but its value keeps unchanged after a delta-import.

 Can anyone point me in the right direction?

 Thanks in advance!
 Juan M.



Re: Query question

2010-11-03 Thread Ahmet Arslan
 My impression was that
 
 city:Chicago^10 +Romantic +View
 
 would do what you want (with the standard lucene query
 parser and default operator OR), and I'm not sure about
 this, but I have a feeling that the version with Boolean
 operators AND/OR and parens might actually net out to the
 same thing, since under the hood all the terms have to be
 translated into optional, required or forbidden:
 lucene doesn't actually have true binary boolean
 operators.  At least that was the impression I got
 after some discussion at a recent conference.  I may
 have misunderstood - if so, could someone who knows set me
 straight?

Yes, you are completely right. If the default operator is set to OR, your query 
would do the trick. And it is better to use and think in terms of unitary 
operators.





Re: Query question

2010-11-03 Thread kenf_nc

Unfortunately the default operator is set to AND and I can't change that at
this time. 

If I do  (city:Chicago^10 OR Romantic OR View) it returns way too many
unwanted results.
If I do (city:Chicago^10 OR (Romantic AND View)) it returns less unwanted
results, but still a lot.
iorixxx's solution of (Romantic AND View AND (city:Chicago^10 OR (*:*
-city:Chicago))) does seem to work. Chicago results are at the top, and the
remaining results seem to fit the other search parameters. It's an ugly
query, but does seem to do the trick for now until I master Dismax.

Thanks all!

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-question-tp1828367p1834793.html
Sent from the Solr - User mailing list archive at Nabble.com.


Core status uptime and startTime

2010-11-03 Thread Marc Sturlese

As far as I know, in the core admin page you can find when was the last time
an index had a modification and was comitted checking the lastModified.
But? what startTime and uptime mean?
Thanks in advance
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Core-status-uptime-and-startTime-tp1834806p1834806.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query question

2010-11-03 Thread Mike Sokolov

Another alternative (prettier to my eye), would be:

(city:Chicago AND Romantic AND View)^10 OR (Romantic AND View)


-Mike



On 11/03/2010 09:28 AM, kenf_nc wrote:

Unfortunately the default operator is set to AND and I can't change that at
this time.

If I do  (city:Chicago^10 OR Romantic OR View) it returns way too many
unwanted results.
If I do (city:Chicago^10 OR (Romantic AND View)) it returns less unwanted
results, but still a lot.
iorixxx's solution of (Romantic AND View AND (city:Chicago^10 OR (*:*
-city:Chicago))) does seem to work. Chicago results are at the top, and the
remaining results seem to fit the other search parameters. It's an ugly
query, but does seem to do the trick for now until I master Dismax.

Thanks all!

   


Corename after Swap in MultiCore

2010-11-03 Thread sivaram

Hi everyone,

Long question but please hold on. I'm using a multicore Solr instance to
index different documents from different sources( around 4) and I'm using a
common config for all the cores. So, for each source I have core and temp
core like 'doc' and 'doc-temp'. So, everytime I want to get new data, I do
dataimport to the temp core and then swap the cores. For swaping I'm using
the postCommit event listener to make sure the swap is done after the
completing commit. 

After the first swap when I use solr.core.name on the doc-temp it is
returning doc as its name ( because the commit is done on the doc's data dir
after the first swap ). How do I get the core name of the doc-temp here in
order to swap again with .swap ? 

I'm stuck here. Please help me. Also if anyone know for sure if a dataimport
is being done on a core then the next swap query will be executed only after
this dataimport is finished?

Thanks in advance.
Ram.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Corename-after-Swap-in-MultiCore-tp1835325p1835325.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Query question

2010-11-03 Thread cbennett
Another option is to override the default operator in the query.

{!lucene q.op=OR}city:Chicago^10 +Romantic +View

Colin.

 -Original Message-
 From: Mike Sokolov [mailto:soko...@ifactory.com]
 Sent: Wednesday, November 03, 2010 9:42 AM
 To: solr-user@lucene.apache.org
 Cc: kenf_nc
 Subject: Re: Query question
 
 Another alternative (prettier to my eye), would be:
 
 (city:Chicago AND Romantic AND View)^10 OR (Romantic AND View)
 
 
 -Mike
 
 
 
 On 11/03/2010 09:28 AM, kenf_nc wrote:
  Unfortunately the default operator is set to AND and I can't change
 that at
  this time.
 
  If I do  (city:Chicago^10 OR Romantic OR View) it returns way too
 many
  unwanted results.
  If I do (city:Chicago^10 OR (Romantic AND View)) it returns less
 unwanted
  results, but still a lot.
  iorixxx's solution of (Romantic AND View AND (city:Chicago^10 OR (*:*
  -city:Chicago))) does seem to work. Chicago results are at the top,
 and the
  remaining results seem to fit the other search parameters. It's an
 ugly
  query, but does seem to do the trick for now until I master Dismax.
 
  Thanks all!
 
 





Re: Influencing scores on values in multiValue fields

2010-11-03 Thread Jonathan Rochkind
Be careful of multi-term queries and String types.   By multi-term here, 
I mean multi-term according to the 'pre-tokenization' that dismax and 
standard parsers do -- basically on whitespace.  If you have a string 
with whitespace as a single (non-tokenized field) in a Solr String type, 
and you have a q that is that identical string (with whitespace, but NOT 
enclosed in phrase quotes) -- it still won't match.  Because of the 
pre-tokenization-on-whitespace that the query parsers do.


It WILL still match if you put the q in double quotes for a phrase. And 
it WILL still match for a dismax pf phrase boost.  But it will not match 
a dismax qf field, or a standard query parser fielded q search.


This makes this approach to solving the problem not always do what you'd 
like. I haven't figured out a better one though. With dismax, if you 
include it both as a boosted field in qf (which will match on 
single-term queries, but not on queries with whitespace) AND as a 
boosted field in pf (which will match on queries with whitespace, but 
wont' be used at all for queries without whitespace, as dismax doesn't 
even bring the pf into play unless the pre-tokenization comes up with 
more than one term) -- it seems to mostly do what you'd want.  An 
alternate strategy might be trying to use it as a dismax bq query, since 
you can tell bq to use an alternate query parser (for example !field or 
!raw) that won't do the pre-tokenization.


Imran wrote:

Thanks Mike for your suggestion. It did take me down the correct route. I
basically created another multiValue field of type 'string' and boosted
that. To get the partial matches to avoid the length normalisation I had the
'text' type multiValue field to omitNorms. The results look as per expected
so far on this configuration.

Cheers
-- Imran

On Fri, Oct 29, 2010 at 1:09 PM, Michael Sokolov soko...@ifactory.comwrote:

  

How about creating another field for doing exact matches (a string);
searching both and boosting the string match?

-Mike



-Original Message-
From: Imran [mailto:imranboho...@gmail.com]
Sent: Friday, October 29, 2010 6:25 AM
To: solr-user@lucene.apache.org
Subject: Influencing scores on values in multiValue fields

Hi All

We've got an index in which we have a multiValued field per document.

Assume the multivalue field values in each document to be;

Doc1:
bar lifters

Doc2:
truck tires
back drops
bar lifters

Doc 3:
iron bar lifters

Doc 4:
brass bar lifters
iron bar lifters
tire something
truck something
oil gas

Now when we search for 'bar lifters' the expectation (based on the
requirements) is that we get results in the order of Doc1,
Doc 2, Doc4 and Doc3.
Doc 1 - since there's an exact match (and only one) for the
search terms Doc 2 - since ther'e an exact match amongst the
values Doc 4 - since there's a partial match on the values
but the number of matches are more than Doc 3 Doc 3 - since
there's a partial match

However, the results come out as Doc1, Doc3, Doc2, Doc4.
Looking at the explaination of the result it appears Doc 2 is
loosing to Doc3 and Doc 4 is loosing to Doc3 based on length
normalisation.

We think we can see the reason for that - the field length in
doc2 is greater than doc3 and doc 4 is greater doc3.
However, is there any mechanism I can force doc2 to beat doc3
and doc4 to beat doc3 with this structure.

We did look at using omitNorms=true, but that messes up the
scores for all docs. The result comes out as Doc4, Doc1,
Doc2, Doc3 (where Doc1, Doc2 and
Doc3 gets the same score)
This is because the fieldNorm is not taken into account anymore (as
expected) and the termFrequence being the only contributing
factor. So trying to avoid length normalisation through
omitNorms is not helping.

Is there anyway where we can influence an exact match of a
value in a multiValue field to add on to the overall score
whilst keeping the lenght normalisation?

Hope that makes sense.

Cheers
-- Imran

  



  


Re: Searching Across Multiple Cores

2010-11-03 Thread Jonathan Rochkind
Basically, Solr doesn't do that. It seems to be a frequent topic on the 
listserv, people wanting Solr to be able to do something like that. But, 
as far as I know, it doesn't -- and I don't have a good idea of 
alternate ways to solve that kind of problem either.


Try put everything in the same core, is the general answer.

Solr shard distribution is designed for performance scaling, not for 
accomplishing join like behavior accross two different schemas, the 
distribution/shard thing isn't going to get you to that.


Lohrenz, Steven wrote:
Sorry about the late response to this, but was on holidays. 

No, as of right now there is not the same schema in each shard. 

I need to be able to search a set of data resources with manually defined data fields. All of those fields are searchable. 

Any one of these resources can be added to an individual's favourites list with the possibility of them adding additional tags, which are also searchable. The favourites folder needs to be searchable on all the same fields as the main data set and on the additional user defined tags. 


Search fields for the main data schema are:
resourceId
resourceType
resourceGradeLevel
resourceKeywords
resourceLength
resourceSubjectArea
and about 30 more fields

The searchable fields for the My Favourites schema are:
userId
userFolder
userDefinedGradeLevel
userDefinedTags
plus all of those in the main data set. 


Search queries:
1. Search the main data set for all those resources with keyword 'foo'.
2. Search the main data set for all those resources with keyword 'foo' and are for grade 3. 
3. Search the main data set for all those resources with subject area of 'grammar'.
4. Search My Favourites folder for all the resources I have moved there (userId = 12321) with the keyword 'foo'. 
5. Search My Favourites folder for all the resources I have moved there (userId = 12321) with the keyword 'foo' and are for grade 3 and are in the folder 'testing'. 
6. Search My Favourites folder for all the resources I have moved there (userId = 12321) with the subject area of 'grammar' and I have tagged with 'interesting'. 
7. Various combinations of the above. 

The simplest way I came up with to do this is to have 2 separate schemas. One for the main data set and one for My Favourites. When someone adds a resource from the main data set to their My Favourites folder all the data from the main data set is copied over the My Favourites schema and the userId, folder and other user specific information is added also. 


But there could be 1 million copies of basically the same data in the My 
Favourites (if 1 million users add the same resource to their favourites). I 
thought that would waste a lot of space, so was looking for another way to do 
this (using a type of join - see below). Are there any other possibilities?

Cheers,
Steve

-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: 14 October 2010 18:58

To: solr-user@lucene.apache.org
Subject: Re: Searching Across Multiple Cores

The point/use-case of sharding/distributed search is for performance, 
not for segregating different data in different places. Distributed 
search assumes the same schema in each shard -- do you have that?


I don't think distributed search means to support the kind of joining 
you describe, that's not really what Solr does.


But if you actually do have the same schema accross your shards, and 
have distributed search set up properly -- then you don't need to do any 
special joining, the shards end up forming one 'logical' index, that's 
the point of it.  I don't think you can do what you describe. Solr 
doesn't do joins like an rdbms, Solr works on a single set of 
documents, not multiple tables or collections. 

If you describe your data and the kind of queries you want to run, 
someone might be able to figure out a way to de-normalize the data to 
support what you want to do.  Which won't really have anything to do 
with shards/distributed search -- you add in distributed search for 
performance or giant-size-of-index purposes, but it doesn't change your 
schema design or queries.


Lohrenz, Steven wrote:
  
Ken, 

Ok, I understand how the distributed search works, but I don't understand how to build my query appropriately so that the results returned from the two shards only return values that exist in both result sets. 

In essence, I'm doing a join across the two shards on the resourceId. 


So Core0 has:
resourceId (unique key)
title 
tag1
tag2 
tag3


And Core1 has:
resourceId + folder + userId + grade (concatenated - this is the uniqueId)
resourceId
folder
userId
grade

For example, I would want to find all the content with userId = 893489 and tag1 = 'contentTagX'. 

My thought of how to do this is to search Core1 for all the items with userId = 893489. This would return a set of results for that user with resourceId. Then I would need to search Core0 for where tag1 = 'contentTagX' and where resourceId = those returned in the 

Re: Possible memory leaks with frequent replication

2010-11-03 Thread Jonathan Rochkind
I hadn't looked at the code, am not familiar with Solr code, and can't 
say what that code does.


But I have experienced issues that I _believe_ were caused by too 
frequent commits causing over-lapping searcher preperation. And I've 
definitely seen Solr documentation that suggests this is an issue. Let 
me find it now to see if the experts think these documented suggests are 
still correct or not:


On the other hand, autowarming (populating) a new collection could take 
a lot of time, especially since it uses only one thread and one CPU. If 
your settings fire off snapinstaller too frequently, then a Solr slave 
could be in the undesirable condition of handing-off queries to one 
(old) collection, and, while warming a new collection, a second “new” 
one could be snapped and begin warming!


If we attempted to solve such a situation, we would have to invalidate 
the first “new” collection in order to use the second one, then when a 
“third” new collection would be snapped and warmed, we would have to 
invalidate the “second” new collection, and so on ad infinitum. A 
completely warmed collection would never make it to full term before it 
was aborted. This can be prevented with a properly tuned configuration 
so new collections do not get installed too rapidly. 


http://wiki.apache.org/solr/SolrPerformanceFactors#Updates_and_Commit_Frequency_Tradeoffs

I think I've seen that same advice on another wiki page without being 
specifically regarding replication, but just being about commit 
frequency balanced with auto-warming, leading to overlapping warming, 
leading to spiraling RAM/CPU usage -- but NOT an exception being thrown 
or HTTP error delivered.


I can't find it on the wiki, but here's a listserv post with someone 
reporting findings that match my understanding: 
http://osdir.com/ml/solr-user.lucene.apache.org/2010-09/msg00528.html


How does this advice square with the code Lance found?  Is my 
understanding of how frequent commits can interact with time it takes to 
warm a new collection correct? Appreciate any additional info.





Lance Norskog wrote:

Isn't that what this code does?

  onDeckSearchers++;
  if (onDeckSearchers  1) {
// should never happen... just a sanity check
log.error(logid+ERROR!!! onDeckSearchers is  + onDeckSearchers);
onDeckSearchers=1;  // reset
  } else if (onDeckSearchers  maxWarmingSearchers) {
onDeckSearchers--;
String msg=Error opening new searcher. exceeded limit of
maxWarmingSearchers=+maxWarmingSearchers + , try again later.;
log.warn(logid++ msg);
// HTTP 503==service unavailable, or 409==Conflict
throw new
SolrException(SolrException.ErrorCode.SERVICE_UNAVAILABLE,msg,true);
  } else if (onDeckSearchers  1) {
log.info(logid+PERFORMANCE WARNING: Overlapping
onDeckSearchers= + onDeckSearchers);
  }


On Tue, Nov 2, 2010 at 10:02 AM, Jonathan Rochkind rochk...@jhu.edu wrote:
  

It's definitely a known 'issue' that you can't replicate (or do any other
kind of index change, including a commit) at a faster frequency than your
warming queries take to complete, or you'll wind up with something like
you've seen.

It's in some documentation somewhere I saw, for sure.

The advice to 'just query against the master' is kind of odd, because,
then... why have a slave at all, if you aren't going to query against it?  I
guess just for backup purposes.

But even with just one solr, or querying master, if you commit at rate such
that commits come before the warming queries can complete, you're going to
have the same issue.

The only answer I know of is Don't commit (or replicate) at a faster rate
than it takes your warming to complete.  You can reduce your warming
queries/operations, or reduce your commit/replicate frequency.

Would be interesting/useful if Solr noticed this going on, and gave you some
kind of error in the log (or even an exception when started with a certain
parameter for testing) Overlapping warming queries, you're committing too
fast or something. Because it's easy to make this happen without realizing
it, and then your Solr does what Simon says, runs out of RAM and/or uses a
whole lot of CPU and disk io.

Lance Norskog wrote:


You should query against the indexer. I'm impressed that you got 5s
replication to work reliably.

On Mon, Nov 1, 2010 at 4:27 PM, Simon Wistow si...@thegestalt.org wrote:

  

We've been trying to get a setup in which a slave replicates from a
master every few seconds (ideally every second but currently we have it
set at every 5s).

Everything seems to work fine until, periodically, the slave just stops
responding from what looks like it running out of memory:

org.apache.catalina.core.StandardWrapperValve invoke
SEVERE: Servlet.service() for servlet jsp threw exception
java.lang.OutOfMemoryError: Java heap space


(our monitoring seems to confirm this).

Looking around my suspicion is that it takes new Readers longer to warm
than 

Re: Possible memory leaks with frequent replication

2010-11-03 Thread Jonathan Rochkind
Ah, but reading Peter's email message I reference more carefully, it 
seems that Solr already DOES provide an info-level log warning you about 
over-lapping warming, awesome. (But again, I'm pretty sure it does NOT 
throw or HTTP error in that condition, based on my and others experience).



 To check if your Solr environment is suffering from this, turn on INFO
 level logging, and look for: 'PERFORMANCE WARNING: Overlapping
 onDeckSearchers=x'.

Sweet, good to know, and I'll definitely add this to my debugging 
toolbox. Peter's listserv message really ought to be a wiki page, I 
think.  Any reason for me not to just add it as a new one with title 
Commit frequency and auto-warming or something like that?  Unless it's 
already in the wiki somewhere I haven't found, assuming the wiki will 
let an ordinary user-created account add a new page.

//
Jonathan Rochkind wrote:
I hadn't looked at the code, am not familiar with Solr code, and can't 
say what that code does.


But I have experienced issues that I _believe_ were caused by too 
frequent commits causing over-lapping searcher preperation. And I've 
definitely seen Solr documentation that suggests this is an issue. Let 
me find it now to see if the experts think these documented suggests are 
still correct or not:


On the other hand, autowarming (populating) a new collection could take 
a lot of time, especially since it uses only one thread and one CPU. If 
your settings fire off snapinstaller too frequently, then a Solr slave 
could be in the undesirable condition of handing-off queries to one 
(old) collection, and, while warming a new collection, a second “new” 
one could be snapped and begin warming!


If we attempted to solve such a situation, we would have to invalidate 
the first “new” collection in order to use the second one, then when a 
“third” new collection would be snapped and warmed, we would have to 
invalidate the “second” new collection, and so on ad infinitum. A 
completely warmed collection would never make it to full term before it 
was aborted. This can be prevented with a properly tuned configuration 
so new collections do not get installed too rapidly. 


http://wiki.apache.org/solr/SolrPerformanceFactors#Updates_and_Commit_Frequency_Tradeoffs

I think I've seen that same advice on another wiki page without being 
specifically regarding replication, but just being about commit 
frequency balanced with auto-warming, leading to overlapping warming, 
leading to spiraling RAM/CPU usage -- but NOT an exception being thrown 
or HTTP error delivered.


I can't find it on the wiki, but here's a listserv post with someone 
reporting findings that match my understanding: 
http://osdir.com/ml/solr-user.lucene.apache.org/2010-09/msg00528.html


How does this advice square with the code Lance found?  Is my 
understanding of how frequent commits can interact with time it takes to 
warm a new collection correct? Appreciate any additional info.





Lance Norskog wrote:
  

Isn't that what this code does?

  onDeckSearchers++;
  if (onDeckSearchers  1) {
// should never happen... just a sanity check
log.error(logid+ERROR!!! onDeckSearchers is  + onDeckSearchers);
onDeckSearchers=1;  // reset
  } else if (onDeckSearchers  maxWarmingSearchers) {
onDeckSearchers--;
String msg=Error opening new searcher. exceeded limit of
maxWarmingSearchers=+maxWarmingSearchers + , try again later.;
log.warn(logid++ msg);
// HTTP 503==service unavailable, or 409==Conflict
throw new
SolrException(SolrException.ErrorCode.SERVICE_UNAVAILABLE,msg,true);
  } else if (onDeckSearchers  1) {
log.info(logid+PERFORMANCE WARNING: Overlapping
onDeckSearchers= + onDeckSearchers);
  }


On Tue, Nov 2, 2010 at 10:02 AM, Jonathan Rochkind rochk...@jhu.edu wrote:
  


It's definitely a known 'issue' that you can't replicate (or do any other
kind of index change, including a commit) at a faster frequency than your
warming queries take to complete, or you'll wind up with something like
you've seen.

It's in some documentation somewhere I saw, for sure.

The advice to 'just query against the master' is kind of odd, because,
then... why have a slave at all, if you aren't going to query against it?  I
guess just for backup purposes.

But even with just one solr, or querying master, if you commit at rate such
that commits come before the warming queries can complete, you're going to
have the same issue.

The only answer I know of is Don't commit (or replicate) at a faster rate
than it takes your warming to complete.  You can reduce your warming
queries/operations, or reduce your commit/replicate frequency.

Would be interesting/useful if Solr noticed this going on, and gave you some
kind of error in the log (or even an exception when started with a certain
parameter for testing) Overlapping warming queries, you're committing too
fast or something. Because it's easy to make this happen 

Re: A bug in ComplexPhraseQuery ?

2010-11-03 Thread jmr


iorixxx wrote:
 
 
 I added this change to SOLR-1604, can you test it give us feedback?
 
 

Hi,

Sorry for the delay.
We have tested the change and it is OK for this.

However, we have found that this query is crashing when using
CoomplexPhraseQuery:
sulfur-reducing bacteria

It is due to the dash inside the phrase.
Here is the trace:
java.lang.IllegalArgumentException: Unknown query type
org.apache.lucene.search.PhraseQuery found in phrase query string
sulfur-reducing bacteria
 at
org.apache.lucene.queryParser.ComplexPhraseQueryParser$ComplexPhraseQuery.rewrite(ComplexPhraseQueryParser.java:290)
 at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:438)
 at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:311)
 at org.apache.lucene.search.Query.weight(Query.java:98)
 at org.apache.lucene.search.Searcher.createWeight(Searcher.java:230)
 at org.apache.lucene.search.Searcher.search(Searcher.java:171)
 at
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:988)
...

Regards
Jean-Michel

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/A-bug-in-ComplexPhraseQuery-tp1744659p1835918.html
Sent from the Solr - User mailing list archive at Nabble.com.


Override SynonymFilterFactory to load synonyms from alternate data source

2010-11-03 Thread Will Milspec
Hi all,

Can anyone comment on the ease/merit of overriding the shipped
SynonymFilterFactory with a version that could load the synonyms from an
alternate data source?

Our application currently maintains synonyms in its database ; we could
export this data to 'synonyms.txt', but would prefer a db aware
implementationv of SynonymFilterFactory, i.e. avoiding that middle step.

From the looks of the class (private instances, static methods), it doesn't
lend itself to easy subclassing..

Any comments or recommendations?

thanks

will


Re: Override SynonymFilterFactory to load synonyms from alternate data source

2010-11-03 Thread Ahmet Arslan
 Our application currently maintains synonyms in its
 database ; we could
 export this data to 'synonyms.txt', but would prefer a db
 aware
 implementationv of SynonymFilterFactory, i.e. avoiding that
 middle step.
 
 From the looks of the class (private instances, static
 methods), it doesn't
 lend itself to easy subclassing..

just write your own DataBaseSynonymFilterFactory 
that loads the synonyms from your db using your custom logic and then 
constructs the SynonymFilter objects like the existing factory [1]

[1] http://search-lucene.com/m/Av4xC1PtNLW1/


  


Negative or zero value for fieldNorm

2010-11-03 Thread Markus Jelsma
Hi all,

I've got some puzzling issue here. During tests i noticed a document at the 
bottom of the results where it should not be. I query using DisMax on title 
and content field and have a boost on title using qf. Out of 30 results, only 
two documents also have the term in the title.

Using debugQuery and fl=*,score i quickly noticed large negative maxScore of 
the complete resultset and a portion of the resultset where scores sum up to 
zero because of a product with 0 (fieldNorm).

See below for debug output for a result with score = 0:

0.0 = (MATCH) sum of:
  0.0 = (MATCH) max of:
0.0 = (MATCH) weight(content:kunstgrasveld in 7), product of:
  0.75658196 = queryWeight(content:kunstgrasveld), product of:
6.6516633 = idf(docFreq=33, maxDocs=9682)
0.113743275 = queryNorm
  0.0 = (MATCH) fieldWeight(content:kunstgrasveld in 7), product of:
2.236068 = tf(termFreq(content:kunstgrasveld)=5)
6.6516633 = idf(docFreq=33, maxDocs=9682)
0.0 = fieldNorm(field=content, doc=7)
0.0 = (MATCH) fieldWeight(title:kunstgrasveld in 7), product of:
  1.0 = tf(termFreq(title:kunstgrasveld)=1)
  8.791729 = idf(docFreq=3, maxDocs=9682)
  0.0 = fieldNorm(field=title, doc=7)

And one with a negative score:

3.0716116E-4 = (MATCH) sum of:
  3.0716116E-4 = (MATCH) max of:
3.0716116E-4 = (MATCH) weight(content:kunstgrasveld in 1462), product of:
  0.75658196 = queryWeight(content:kunstgrasveld), product of:
6.6516633 = idf(docFreq=33, maxDocs=9682)
0.113743275 = queryNorm
  4.059853E-4 = (MATCH) fieldWeight(content:kunstgrasveld in 1462), product 
of:
1.0 = tf(termFreq(content:kunstgrasveld)=1)
6.6516633 = idf(docFreq=33, maxDocs=9682)
6.1035156E-5 = fieldNorm(field=content, doc=1462)

There are no funky issues with term analysis for the text fieldType, in fact, 
the term passes through unchanged. I don't do omitNorms, i store termVectors 
etc.

Because fieldNorm = fieldBoost / sqrt(numTermsForField) i suspect my input from 
Nutch is messed up. A fieldNorm can never be = 0 for a normal positive boost 
and field boosts should not be zero or negative (correct me if i'm wrong). But, 
since i can't yet figure out what field boosts Nutch sends to me i thought i'd 
drop by on this mailing list first.

There are quite a few query terms that return with zero or negative scores and 
many that behave as i expect. I find it also a bit hard to comprehend why the 
docs with negative score rank higher in the result set than documents with 
zero score. Sorting defaults to score DESC,  but this is perhaps another 
issue.

Anyway, the test runs on a Solr 1.4.1 instance with Java 6 under the hood. 
Help or directions are appreciated =)

Cheers,

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350


blacklist docs by uniqueKey

2010-11-03 Thread Ravi Kiran
Hello,
I have a single core servicing 3 different applications, one of the
application doesnt want some specific docs to show up (driven by Editorial
decision). Over a period of time the amount of blacklisted docs could grow,
hence I do not want to restrict them in a query as it the query could get
extremely large. Is there a configuration option where we can blacklist ids
(uniqueKey) from showing up in results.

Is there anything similar to EvelationComponent that demotes docs ? This
could be ideal. I tried to look up and see if there was a boosting option in
elevation component so that I could negatively boost certain docs but could
not find any.

Can anybody kindly point me in the right direction.

Thanks

Ravi Kiran Bhaskar


Question about morelikethis and multiple fields

2010-11-03 Thread ahammad

Hello,

I'm trying to implement a Related Articles feature within my search
application using the mlt handler.

To give you a little background information, my Solr index contains a single
core that is created by merging 10+ other cores. Within this core is my main
data item known as an article; however, there are other data items like
technical documents, tickets, etc.

When a user opens an article on my web application, I want to show Related
Articles based on 2 fields (title and body). I am using SolrJ as a back-end
for this .

The way I'm thinking of doing it is to search on the title of the existing
article, and hope that the first hit is that actual article. This works in
most of the cases, but occasionally it grabs either the wrong article or a
different type of data item altogether (the first hit my be a technical
document, which is totally unrelated to articles). The following is my
query:

?qt=%2Fmltmlt.match.include=truemlt.mindf=1mlt.mintf=1mlt.fl=title,bodyq=search
stringfq=dataItem:articledebugQuery=true

There is one main thing that I noticed is that this only seems to match on
the body field and not the title field. I think it's doing what it's
supposed to and I'm not fully grasping the idea of mlt.

So when it does the initial search to find the document against which it
will find related articles, what search handlers would it use? Normally, my
queries are carried out using dismax with some boosting functionality
applied to them. When I use the standard query handler however, with the qt
parameter defining mlt, what happens for the initial search?

Also, if anybody can suggest an alternative implementation to this I would
greatly appreciate it. Like I said, it's entirely possible that I don't
fully understand mlt and it's causing me to implement stuff in a weird way.

Thanks/

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-about-morelikethis-and-multiple-fields-tp1836778p1836778.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Negative or zero value for fieldNorm

2010-11-03 Thread Yonik Seeley
Regarding Negative or zero value for fieldNorm, I don't see any
negative fieldNorms here... just very small positive ones?

Anyway the fieldNorm is the product of the lengthNorm and the
index-time boost of the field (which is itself the product of the
index time boost on the document and the index time boost of all
instances of that field).  Index time boosts default to 1 though, so
they have no effect unless something has explicitly set a boost.

-Yonik
http://www.lucidimagination.com



On Wed, Nov 3, 2010 at 2:30 PM, Markus Jelsma
markus.jel...@openindex.io wrote:
 Hi all,

 I've got some puzzling issue here. During tests i noticed a document at the
 bottom of the results where it should not be. I query using DisMax on title
 and content field and have a boost on title using qf. Out of 30 results, only
 two documents also have the term in the title.

 Using debugQuery and fl=*,score i quickly noticed large negative maxScore of
 the complete resultset and a portion of the resultset where scores sum up to
 zero because of a product with 0 (fieldNorm).

 See below for debug output for a result with score = 0:

 0.0 = (MATCH) sum of:
  0.0 = (MATCH) max of:
    0.0 = (MATCH) weight(content:kunstgrasveld in 7), product of:
      0.75658196 = queryWeight(content:kunstgrasveld), product of:
        6.6516633 = idf(docFreq=33, maxDocs=9682)
        0.113743275 = queryNorm
      0.0 = (MATCH) fieldWeight(content:kunstgrasveld in 7), product of:
        2.236068 = tf(termFreq(content:kunstgrasveld)=5)
        6.6516633 = idf(docFreq=33, maxDocs=9682)
        0.0 = fieldNorm(field=content, doc=7)
    0.0 = (MATCH) fieldWeight(title:kunstgrasveld in 7), product of:
      1.0 = tf(termFreq(title:kunstgrasveld)=1)
      8.791729 = idf(docFreq=3, maxDocs=9682)
      0.0 = fieldNorm(field=title, doc=7)

 And one with a negative score:

 3.0716116E-4 = (MATCH) sum of:
  3.0716116E-4 = (MATCH) max of:
    3.0716116E-4 = (MATCH) weight(content:kunstgrasveld in 1462), product of:
      0.75658196 = queryWeight(content:kunstgrasveld), product of:
        6.6516633 = idf(docFreq=33, maxDocs=9682)
        0.113743275 = queryNorm
      4.059853E-4 = (MATCH) fieldWeight(content:kunstgrasveld in 1462), product
 of:
        1.0 = tf(termFreq(content:kunstgrasveld)=1)
        6.6516633 = idf(docFreq=33, maxDocs=9682)
        6.1035156E-5 = fieldNorm(field=content, doc=1462)

 There are no funky issues with term analysis for the text fieldType, in fact,
 the term passes through unchanged. I don't do omitNorms, i store termVectors
 etc.

 Because fieldNorm = fieldBoost / sqrt(numTermsForField) i suspect my input 
 from
 Nutch is messed up. A fieldNorm can never be = 0 for a normal positive boost
 and field boosts should not be zero or negative (correct me if i'm wrong). 
 But,
 since i can't yet figure out what field boosts Nutch sends to me i thought i'd
 drop by on this mailing list first.

 There are quite a few query terms that return with zero or negative scores and
 many that behave as i expect. I find it also a bit hard to comprehend why the
 docs with negative score rank higher in the result set than documents with
 zero score. Sorting defaults to score DESC,  but this is perhaps another
 issue.

 Anyway, the test runs on a Solr 1.4.1 instance with Java 6 under the hood.
 Help or directions are appreciated =)

 Cheers,

 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536600 / 06-50258350



Re: blacklist docs by uniqueKey

2010-11-03 Thread Erick Erickson
How dynamic is this list? Is it feasable to add a field to your docs like
blacklisteddocs, and at editorial's discretion add values to that field
like app1, app2?

At that point you can just filter them out via a filter query...

Best
Erick

On Wed, Nov 3, 2010 at 2:40 PM, Ravi Kiran ravi.bhas...@gmail.com wrote:

 Hello,
I have a single core servicing 3 different applications, one of the
 application doesnt want some specific docs to show up (driven by Editorial
 decision). Over a period of time the amount of blacklisted docs could grow,
 hence I do not want to restrict them in a query as it the query could get
 extremely large. Is there a configuration option where we can blacklist ids
 (uniqueKey) from showing up in results.

 Is there anything similar to EvelationComponent that demotes docs ? This
 could be ideal. I tried to look up and see if there was a boosting option
 in
 elevation component so that I could negatively boost certain docs but could
 not find any.

 Can anybody kindly point me in the right direction.

 Thanks

 Ravi Kiran Bhaskar



Re: blacklist docs by uniqueKey

2010-11-03 Thread Yonik Seeley
On Wed, Nov 3, 2010 at 3:05 PM, Erick Erickson erickerick...@gmail.com wrote:
 How dynamic is this list? Is it feasable to add a field to your docs like
 blacklisteddocs, and at editorial's discretion add values to that field
 like app1, app2?

 At that point you can just filter them out via a filter query...

Right, or a combination of the two approaches.
For a realtime approach, add the newest filters (say any filters added
that day) to a filter query, and roll those into a nightly reindex.

-Yonik
http://www.lucidimagination.com


 Best
 Erick

 On Wed, Nov 3, 2010 at 2:40 PM, Ravi Kiran ravi.bhas...@gmail.com wrote:

 Hello,
        I have a single core servicing 3 different applications, one of the
 application doesnt want some specific docs to show up (driven by Editorial
 decision). Over a period of time the amount of blacklisted docs could grow,
 hence I do not want to restrict them in a query as it the query could get
 extremely large. Is there a configuration option where we can blacklist ids
 (uniqueKey) from showing up in results.

 Is there anything similar to EvelationComponent that demotes docs ? This
 could be ideal. I tried to look up and see if there was a boosting option
 in
 elevation component so that I could negatively boost certain docs but could
 not find any.

 Can anybody kindly point me in the right direction.

 Thanks

 Ravi Kiran Bhaskar




How to display the synonyms

2010-11-03 Thread jayant

Hi, If the synonym.txt file define the following
castle,fort
I am able to match fort when the user wants to search for castle.
However, I would like to tell the user that castle is a synonym for
fort. It is for those users that may wonder why they got a different 
search result when they were looking for castle. Is there a way to get
that info when the search is made.
Thanks.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-display-the-synonyms-tp1837103p1837103.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Negative or zero value for fieldNorm

2010-11-03 Thread Markus Jelsma

 Regarding Negative or zero value for fieldNorm, I don't see any
 negative fieldNorms here... just very small positive ones?

Of course, you're right. The E-# got twisted in my mind and became negative. 
Silly me.

 Anyway the fieldNorm is the product of the lengthNorm and the
 index-time boost of the field (which is itself the product of the
 index time boost on the document and the index time boost of all
 instances of that field).  Index time boosts default to 1 though, so
 they have no effect unless something has explicitly set a boost.

I've just checked docs 7 and 1462 (resp. first and second in debug output 
below) with Luke. The title and content fields have no index time boosts, thus 
defaulting to 1.0f which is fine.

Then, why does doc 7 have a fieldNorm of 0.0 on title (and so setting a 0.0 
score on the doc in the result set) and does doc 1462 have a very very small 
fieldNorm?

debugOutput for doc 7:
0.0 = fieldNorm(field=title, doc=7)

Luke on the title field of doc 7.
float name=boost1.0/float

Thanks for your reply!


 -Yonik
 http://www.lucidimagination.com
 
 
 
 On Wed, Nov 3, 2010 at 2:30 PM, Markus Jelsma
 
 markus.jel...@openindex.io wrote:
  Hi all,
  
  I've got some puzzling issue here. During tests i noticed a document at
  the bottom of the results where it should not be. I query using DisMax
  on title and content field and have a boost on title using qf. Out of 30
  results, only two documents also have the term in the title.
  
  Using debugQuery and fl=*,score i quickly noticed large negative maxScore
  of the complete resultset and a portion of the resultset where scores
  sum up to zero because of a product with 0 (fieldNorm).
  
  See below for debug output for a result with score = 0:
  
  0.0 = (MATCH) sum of:
   0.0 = (MATCH) max of:
 0.0 = (MATCH) weight(content:kunstgrasveld in 7), product of:
   0.75658196 = queryWeight(content:kunstgrasveld), product of:
 6.6516633 = idf(docFreq=33, maxDocs=9682)
 0.113743275 = queryNorm
   0.0 = (MATCH) fieldWeight(content:kunstgrasveld in 7), product of:
 2.236068 = tf(termFreq(content:kunstgrasveld)=5)
 6.6516633 = idf(docFreq=33, maxDocs=9682)
 0.0 = fieldNorm(field=content, doc=7)
 0.0 = (MATCH) fieldWeight(title:kunstgrasveld in 7), product of:
   1.0 = tf(termFreq(title:kunstgrasveld)=1)
   8.791729 = idf(docFreq=3, maxDocs=9682)
   0.0 = fieldNorm(field=title, doc=7)
  
  And one with a negative score:
  
  3.0716116E-4 = (MATCH) sum of:
   3.0716116E-4 = (MATCH) max of:
 3.0716116E-4 = (MATCH) weight(content:kunstgrasveld in 1462), product
  of: 0.75658196 = queryWeight(content:kunstgrasveld), product of:
  6.6516633 = idf(docFreq=33, maxDocs=9682)
 0.113743275 = queryNorm
   4.059853E-4 = (MATCH) fieldWeight(content:kunstgrasveld in 1462),
  product of:
 1.0 = tf(termFreq(content:kunstgrasveld)=1)
 6.6516633 = idf(docFreq=33, maxDocs=9682)
 6.1035156E-5 = fieldNorm(field=content, doc=1462)
  
  There are no funky issues with term analysis for the text fieldType, in
  fact, the term passes through unchanged. I don't do omitNorms, i store
  termVectors etc.
  
  Because fieldNorm = fieldBoost / sqrt(numTermsForField) i suspect my
  input from Nutch is messed up. A fieldNorm can never be = 0 for a
  normal positive boost and field boosts should not be zero or negative
  (correct me if i'm wrong). But, since i can't yet figure out what field
  boosts Nutch sends to me i thought i'd drop by on this mailing list
  first.
  
  There are quite a few query terms that return with zero or negative
  scores and many that behave as i expect. I find it also a bit hard to
  comprehend why the docs with negative score rank higher in the result
  set than documents with zero score. Sorting defaults to score DESC,  but
  this is perhaps another issue.
  
  Anyway, the test runs on a Solr 1.4.1 instance with Java 6 under the
  hood. Help or directions are appreciated =)
  
  Cheers,
  
  --
  Markus Jelsma - CTO - Openindex
  http://www.linkedin.com/in/markus17
  050-8536600 / 06-50258350


Re: Question about morelikethis and multiple fields

2010-11-03 Thread ahammad

I don't quite understand what you mean by that. Did you mean TermVector
Components?

Also, I did some more digging and I found some messages on this mailing list
about filtering. From what I understand, using the standard query handler
(solr/select/?q=...) with a qt parameter allows you to filter on the initial
response using the fq parameter. While this is not a perfect solution for my
application, it will greatly reduce any errors that I may get in the data.
However, when I tried fq, all it's doing is filtering on the result set from
the mlt handler, not the initial response. I need to filter on both the
initial response and the result set.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-about-morelikethis-and-multiple-fields-tp1836778p1837351.html
Sent from the Solr - User mailing list archive at Nabble.com.


Filter by relevance

2010-11-03 Thread Jason Brown
Is it possible to filter my search results by relevance? For example, anything 
below a certain value shouldn't be returned?

I also retrieve facet counts in my search queries, so it would be useful if the 
facet counts also respected the filter on the relevance.

Thank You.

Jason.

If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


Re: blacklist docs by uniqueKey

2010-11-03 Thread Jonathan Rochkind
I don't believe there is, but it occurs to me that the additional 
feature that Tom Burton-West contemplates in the thread filter query 
from external list of Solr unique IDs could potentially address your 
problem too, if it existed. I think that feature could also address a 
variety of problems, I've been thinking about it.


http://apache.markmail.org/message/etqwbv6piikaqgo5

Ravi Kiran wrote:

Hello,
I have a single core servicing 3 different applications, one of the
application doesnt want some specific docs to show up (driven by Editorial
decision). Over a period of time the amount of blacklisted docs could grow,
hence I do not want to restrict them in a query as it the query could get
extremely large. Is there a configuration option where we can blacklist ids
(uniqueKey) from showing up in results.

Is there anything similar to EvelationComponent that demotes docs ? This
could be ideal. I tried to look up and see if there was a boosting option in
elevation component so that I could negatively boost certain docs but could
not find any.

Can anybody kindly point me in the right direction.

Thanks

Ravi Kiran Bhaskar

  


phrase boost on dismax query

2010-11-03 Thread Jason Brown

I have 3 fields in my index that I use in a dismax query with boosts and phrase 
boosts.

I've realised that 1 field I'm not really interested in at all, unless the 
search term is in that field as a phrase.

Is it realistic to set the normal boost to zero for this field, but the phrase 
boost to soemthing much higher in order to achieve the desired effect?

Thank You

If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


Re: blacklist docs by uniqueKey

2010-11-03 Thread Jan Høydahl / Cominvent
How does the exclude=true option in elevate.xml perform with large number of 
excludes?
Then you could have a separate elevate config for that client.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 3. nov. 2010, at 20.11, Yonik Seeley wrote:

 On Wed, Nov 3, 2010 at 3:05 PM, Erick Erickson erickerick...@gmail.com 
 wrote:
 How dynamic is this list? Is it feasable to add a field to your docs like
 blacklisteddocs, and at editorial's discretion add values to that field
 like app1, app2?
 
 At that point you can just filter them out via a filter query...
 
 Right, or a combination of the two approaches.
 For a realtime approach, add the newest filters (say any filters added
 that day) to a filter query, and roll those into a nightly reindex.
 
 -Yonik
 http://www.lucidimagination.com
 
 
 Best
 Erick
 
 On Wed, Nov 3, 2010 at 2:40 PM, Ravi Kiran ravi.bhas...@gmail.com wrote:
 
 Hello,
I have a single core servicing 3 different applications, one of the
 application doesnt want some specific docs to show up (driven by Editorial
 decision). Over a period of time the amount of blacklisted docs could grow,
 hence I do not want to restrict them in a query as it the query could get
 extremely large. Is there a configuration option where we can blacklist ids
 (uniqueKey) from showing up in results.
 
 Is there anything similar to EvelationComponent that demotes docs ? This
 could be ideal. I tried to look up and see if there was a boosting option
 in
 elevation component so that I could negatively boost certain docs but could
 not find any.
 
 Can anybody kindly point me in the right direction.
 
 Thanks
 
 Ravi Kiran Bhaskar
 
 



Re: Filter by relevance

2010-11-03 Thread Ahmet Arslan
 Is it possible to filter my search
 results by relevance? For example, anything below a certain
 value shouldn't be returned?
 

http://search-lucene.com/m/4AHNF17wIJW1/


  


RE: blacklist docs by uniqueKey

2010-11-03 Thread Andrew Cogan
A filter that could accept a list of SOLR document IDs as articulated by Tom
Burton-West would enable some important features for our application. So if
anyone is wondering if this would be a useful feature, consider this a yes
vote.


-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: Wednesday, November 03, 2010 3:55 PM
To: solr-user@lucene.apache.org
Subject: Re: blacklist docs by uniqueKey

I don't believe there is, but it occurs to me that the additional 
feature that Tom Burton-West contemplates in the thread filter query 
from external list of Solr unique IDs could potentially address your 
problem too, if it existed. I think that feature could also address a 
variety of problems, I've been thinking about it.

http://apache.markmail.org/message/etqwbv6piikaqgo5

Ravi Kiran wrote:
 Hello,
 I have a single core servicing 3 different applications, one of
the
 application doesnt want some specific docs to show up (driven by Editorial
 decision). Over a period of time the amount of blacklisted docs could
grow,
 hence I do not want to restrict them in a query as it the query could get
 extremely large. Is there a configuration option where we can blacklist
ids
 (uniqueKey) from showing up in results.

 Is there anything similar to EvelationComponent that demotes docs ? This
 could be ideal. I tried to look up and see if there was a boosting option
in
 elevation component so that I could negatively boost certain docs but
could
 not find any.

 Can anybody kindly point me in the right direction.

 Thanks

 Ravi Kiran Bhaskar

   





Re: Possible memory leaks with frequent replication

2010-11-03 Thread Lance Norskog
Do you use EmbeddedSolr in the query server? There is a memory leak
that shows up when taking a lot of replications.

On Wed, Nov 3, 2010 at 8:28 AM, Jonathan Rochkind rochk...@jhu.edu wrote:
 Ah, but reading Peter's email message I reference more carefully, it seems
 that Solr already DOES provide an info-level log warning you about
 over-lapping warming, awesome. (But again, I'm pretty sure it does NOT throw
 or HTTP error in that condition, based on my and others experience).


 To check if your Solr environment is suffering from this, turn on INFO
 level logging, and look for: 'PERFORMANCE WARNING: Overlapping
 onDeckSearchers=x'.

 Sweet, good to know, and I'll definitely add this to my debugging toolbox.
 Peter's listserv message really ought to be a wiki page, I think.  Any
 reason for me not to just add it as a new one with title Commit frequency
 and auto-warming or something like that?  Unless it's already in the wiki
 somewhere I haven't found, assuming the wiki will let an ordinary
 user-created account add a new page.
 //
 Jonathan Rochkind wrote:

 I hadn't looked at the code, am not familiar with Solr code, and can't say
 what that code does.

 But I have experienced issues that I _believe_ were caused by too frequent
 commits causing over-lapping searcher preperation. And I've definitely seen
 Solr documentation that suggests this is an issue. Let me find it now to see
 if the experts think these documented suggests are still correct or not:

 On the other hand, autowarming (populating) a new collection could take a
 lot of time, especially since it uses only one thread and one CPU. If your
 settings fire off snapinstaller too frequently, then a Solr slave could be
 in the undesirable condition of handing-off queries to one (old) collection,
 and, while warming a new collection, a second “new” one could be snapped and
 begin warming!

 If we attempted to solve such a situation, we would have to invalidate the
 first “new” collection in order to use the second one, then when a “third”
 new collection would be snapped and warmed, we would have to invalidate the
 “second” new collection, and so on ad infinitum. A completely warmed
 collection would never make it to full term before it was aborted. This can
 be prevented with a properly tuned configuration so new collections do not
 get installed too rapidly. 


 http://wiki.apache.org/solr/SolrPerformanceFactors#Updates_and_Commit_Frequency_Tradeoffs

 I think I've seen that same advice on another wiki page without being
 specifically regarding replication, but just being about commit frequency
 balanced with auto-warming, leading to overlapping warming, leading to
 spiraling RAM/CPU usage -- but NOT an exception being thrown or HTTP error
 delivered.

 I can't find it on the wiki, but here's a listserv post with someone
 reporting findings that match my understanding:
 http://osdir.com/ml/solr-user.lucene.apache.org/2010-09/msg00528.html

 How does this advice square with the code Lance found?  Is my
 understanding of how frequent commits can interact with time it takes to
 warm a new collection correct? Appreciate any additional info.




 Lance Norskog wrote:


 Isn't that what this code does?

      onDeckSearchers++;
      if (onDeckSearchers  1) {
        // should never happen... just a sanity check
        log.error(logid+ERROR!!! onDeckSearchers is  + onDeckSearchers);
        onDeckSearchers=1;  // reset
      } else if (onDeckSearchers  maxWarmingSearchers) {
        onDeckSearchers--;
        String msg=Error opening new searcher. exceeded limit of
 maxWarmingSearchers=+maxWarmingSearchers + , try again later.;
        log.warn(logid++ msg);
        // HTTP 503==service unavailable, or 409==Conflict
        throw new
 SolrException(SolrException.ErrorCode.SERVICE_UNAVAILABLE,msg,true);
      } else if (onDeckSearchers  1) {
        log.info(logid+PERFORMANCE WARNING: Overlapping
 onDeckSearchers= + onDeckSearchers);
      }


 On Tue, Nov 2, 2010 at 10:02 AM, Jonathan Rochkind rochk...@jhu.edu
 wrote:


 It's definitely a known 'issue' that you can't replicate (or do any
 other
 kind of index change, including a commit) at a faster frequency than
 your
 warming queries take to complete, or you'll wind up with something like
 you've seen.

 It's in some documentation somewhere I saw, for sure.

 The advice to 'just query against the master' is kind of odd, because,
 then... why have a slave at all, if you aren't going to query against
 it?  I
 guess just for backup purposes.

 But even with just one solr, or querying master, if you commit at rate
 such
 that commits come before the warming queries can complete, you're going
 to
 have the same issue.

 The only answer I know of is Don't commit (or replicate) at a faster
 rate
 than it takes your warming to complete.  You can reduce your warming
 queries/operations, or reduce your commit/replicate frequency.

 Would be interesting/useful if Solr noticed this going on, and gave you
 

Re: Filter by relevance

2010-11-03 Thread Erick Erickson
Be aware, though, that relevance isn't absolute, it's only interesting
#within# a query. And it's
then normed between 0 and 1. So picking a certain value is rarely doing
what you think it will.
Limiting to the top N docs is usually more reasonable

But this may be an XY problem. What is it you're trying to accomplish?
Perhaps if you
state the problem, some other suggestions may be in the offing

Best
Erick

On Wed, Nov 3, 2010 at 4:48 PM, Jason Brown jason.br...@sjp.co.uk wrote:

 Is it possible to filter my search results by relevance? For example,
 anything below a certain value shouldn't be returned?

 I also retrieve facet counts in my search queries, so it would be useful if
 the facet counts also respected the filter on the relevance.

 Thank You.

 Jason.

 If you wish to view the St. James's Place email disclaimer, please use the
 link below

 http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer



ZendCon 2010 - Slides on Building Intelligent Search Applications with Apache Solr and PHP 5

2010-11-03 Thread Israel Ekpo
Due to popular demand, the link to my slides @ ZendCon are now available
here in case anyone else is looking for it.

http://slidesha.re/bAXNF3

The sample code will be uploaded shortly.

Feedback is also appreciated

http://joind.in/2261

-- 
°O°
Good Enough is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/


Re: blacklist docs by uniqueKey

2010-11-03 Thread Ravi Kiran
Mr.Rochkind pointed out the exact requirement I had in mind i.e. filter
query from external list of Solr unique IDs. On the flip side, even filter
queries can be dicey for me as I could very easily blow past the 1024 bytes
URL GET limit as my original queries itself are very long..just adding 100
or 200 IDs to exclude could cause troubles. This is the exactly why I am
trying to find a configuration option as opposed to writing filter queries

Thank you all for actively helping me out.

Ravi Kiran Bhaskar
Principal Software Engineer
Washington Post
1150 15th Street NW, Washington, DC 20071

On Wed, Nov 3, 2010 at 4:55 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 I don't believe there is, but it occurs to me that the additional feature
 that Tom Burton-West contemplates in the thread filter query from external
 list of Solr unique IDs could potentially address your problem too, if it
 existed. I think that feature could also address a variety of problems, I've
 been thinking about it.

 http://apache.markmail.org/message/etqwbv6piikaqgo5


 Ravi Kiran wrote:

 Hello,
I have a single core servicing 3 different applications, one of the
 application doesnt want some specific docs to show up (driven by Editorial
 decision). Over a period of time the amount of blacklisted docs could
 grow,
 hence I do not want to restrict them in a query as it the query could get
 extremely large. Is there a configuration option where we can blacklist
 ids
 (uniqueKey) from showing up in results.

 Is there anything similar to EvelationComponent that demotes docs ? This
 could be ideal. I tried to look up and see if there was a boosting option
 in
 elevation component so that I could negatively boost certain docs but
 could
 not find any.

 Can anybody kindly point me in the right direction.

 Thanks

 Ravi Kiran Bhaskar






Re: blacklist docs by uniqueKey

2010-11-03 Thread Ravi Kiran
Yes I also did see the exclude=true in an example elevate.xml...was
wondering what it does precisely and if text MUST have a value ? I couldnt
find any documentation explaining it

 query text=ipod
   doc id=MA147LL/A /  !-- put the actual ipod at the top --
   doc id=IW-02 exclude=true / !-- exclude this cable --
 /query

Ravi Kiran Bhaskar
Principal Software Engineer
Washington Post
1150 15th Street NW, Washington, DC 20071

On Wed, Nov 3, 2010 at 5:12 PM, Jan Høydahl / Cominvent 
jan@cominvent.com wrote:

 How does the exclude=true option in elevate.xml perform with large number
 of excludes?
 Then you could have a separate elevate config for that client.

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com

 On 3. nov. 2010, at 20.11, Yonik Seeley wrote:

  On Wed, Nov 3, 2010 at 3:05 PM, Erick Erickson erickerick...@gmail.com
 wrote:
  How dynamic is this list? Is it feasable to add a field to your docs
 like
  blacklisteddocs, and at editorial's discretion add values to that field
  like app1, app2?
 
  At that point you can just filter them out via a filter query...
 
  Right, or a combination of the two approaches.
  For a realtime approach, add the newest filters (say any filters added
  that day) to a filter query, and roll those into a nightly reindex.
 
  -Yonik
  http://www.lucidimagination.com
 
 
  Best
  Erick
 
  On Wed, Nov 3, 2010 at 2:40 PM, Ravi Kiran ravi.bhas...@gmail.com
 wrote:
 
  Hello,
 I have a single core servicing 3 different applications, one of
 the
  application doesnt want some specific docs to show up (driven by
 Editorial
  decision). Over a period of time the amount of blacklisted docs could
 grow,
  hence I do not want to restrict them in a query as it the query could
 get
  extremely large. Is there a configuration option where we can blacklist
 ids
  (uniqueKey) from showing up in results.
 
  Is there anything similar to EvelationComponent that demotes docs ?
 This
  could be ideal. I tried to look up and see if there was a boosting
 option
  in
  elevation component so that I could negatively boost certain docs but
 could
  not find any.
 
  Can anybody kindly point me in the right direction.
 
  Thanks
 
  Ravi Kiran Bhaskar
 
 




Re: A bug in ComplexPhraseQuery ?

2010-11-03 Thread Ahmet Arslan
 However, we have found that this query is crashing when
 using
 CoomplexPhraseQuery:
 sulfur-reducing bacteria
 
 It is due to the dash inside the phrase.
 Here is the trace:
 java.lang.IllegalArgumentException: Unknown query type
 org.apache.lucene.search.PhraseQuery found in phrase
 query string
 sulfur-reducing bacteria

I added Terje Eggestad's fix[1], can you test it give us feedback?

[1]https://issues.apache.org/jira/browse/LUCENE-1486?focusedCommentId=12900278page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12900278


  


Does Solr support Natural Language Search

2010-11-03 Thread jayant

Does Solr support Natural Language Search? I did not find any thing about
this in the reference manual. Please let me know.
Thanks.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Does-Solr-support-Natural-Language-Search-tp1839262p1839262.html
Sent from the Solr - User mailing list archive at Nabble.com.


Problem escaping question marks

2010-11-03 Thread Stephen Powis
I'm having difficulty properly escaping ? in my search queries.  It seems as
tho it matches any character.

Some info, a simplified schema and query to explain the issue I'm having.
I'm currently running solr1.4.1

Schema:

field name=id type=sint indexed=true stored=true required=true /
field name=first_name type=string indexed=true stored=true
required=false /

I want to return any first name with a Question Mark in it
Query: first_name: *\?*

Returns all documents with any character in it.

Can anyone lend a hand?
Thanks!
Stephen


Re: replication not working between 1.4.1 and 3.1-dev

2010-11-03 Thread Shawn Heisey

On 10/29/2010 4:33 PM, Shawn Heisey wrote:
The recommended method of safely upgrading Solr that I've read about 
is to upgrade slave servers, keeping your production application 
pointed either at another set of slave servers or your master 
servers.  Then you test it with a dev copy of your application, and 
once you're sure it's working, you can switch production traffic over 
to the upgraded set.  If it falls over, you just switch back to the 
old version.  Once you're sure it's TRULY working, you upgrade 
everything else.  To convert fully to the new index format, you have 
the option of reindexing or optimizing your existing indexes.


I like this method, and this is the way I want to do it, except that 
the new javabin format makes it impossible.  I need a viable way to 
replicate indexes from a set of 1.4.1 master servers to 3.1-dev 
slaves.  Delving into the source and tackling the problem myself is 
something I would truly love to do, but I lack the necessary skills. 


Since I don't have the java skills required to solve the underlying 
problem, I have come up with a solution in the realm that I do 
understand - my build scripts.  I will update the scripts so that they 
can safely work on the slave machines as well as the masters.  They are 
currently hard-coded to work on the masters.  By turning replication off 
and running the scripts against both server sets, I'll be able to do all 
my testing.


IMHO this incompatibility with replication is a bug that needs to be 
fixed before the official release, which is why I filed SOLR-2204.  I 
have found a way around it, but the workaround might not be a viable 
option for everyone.


Thanks,
Shawn