Re: Delete from Solr index...

2010-01-29 Thread vnchoudhary

I am looking for following solution in C#, Please provide sample code if
possible:-

1. Delete all the index using delete query.
2. Take backup of all the old index, before regenerate.
3. Try to write unlike query for a field to delete stale index.
4. How can use transaction under index generation (delete all old index and
generate index), so that if any error occurs than it will not affect old
indexes.





ryantxu wrote:
 
 escher2k wrote:
 I am trying to remove documents from my index using delete by query.
 However when I did this, the deleted
 items seem to remain. This is the format of the XML file I am using -
 
 deletequeryload_id:20070424150841/query/delete
 deletequeryload_id:20070425145301/query/delete
 deletequeryload_id:20070426145301/query/delete
 deletequeryload_id:20070427145302/query/delete
 deletequeryload_id:20070428145301/query/delete
 deletequeryload_id:20070429145301/query/delete
 
 When I do the deletes individually, it seems to work (i.e. create each of
 the above in a separate file). Does this
 mean that each delete query request has to be executed separately ?
 
 
 correct, delete (unlike add) only accepts one command.
 
 Just to note, if load_id is your unique key, you could also use:
   deleteid20070424150841/id/delete
 
 This will give you better performance and does not commit the changes 
 until you explicitly send commit/
 
 

-- 
View this message in context: 
http://old.nabble.com/Delete-from-Solr-index...-tp10264940p27369849.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Querying for multi-term phrases only . . .

2010-01-29 Thread Erik Hatcher
You can avoid one word terms by setting outputUnigrams=false on the  
ShingleFilterFactory configuration.


Erik

On Jan 28, 2010, at 11:29 PM, Christopher Ball wrote:


I am curious how I can query for multi-term phrases using the
TermsComponent?



The field I am searching has been shingled so it contains 2 and 3 word
phrases.



For example in the sample results below I want to only get back  
multi-word
phrases such as table of contents and under the but not the  
single word

terms such as year and significant



int name=table of contents25302/int

int name=including25162/int

int name=year25097/int

int name=significant17501/int

int name=under the17359/int



Appreciate any ideas,



Christopher





Re: Newbie Question on Custom Query Generation

2010-01-29 Thread Erik Hatcher
dismax won't quite give you the same query result.  What you can do  
pretty easily, though, is create a QParser and QParserPlugin pair,  
register it solrconfig.xml and then use defType=name registered.   
Pretty straightforward.  Have a look at Solr's various QParserPlugin  
implementations for details.


Erik

On Jan 29, 2010, at 12:30 AM, Abin Mathew wrote:

Hi I want to generate my own customized query from the input string  
entered

by the user. It should look something like this

*Search field : Microsoft*
*
Generated Query*  :
description:microsoft +((tags:microsoft^1.5 title:microsoft^3.0
role:microsoft requi
rement:microsoft company:microsoft city:microsoft)^5.0)  
tags:microsoft^2.0

title:microsoft^3.5 functionalArea:microsoft

*The lucene code we used is like this*
BooleanQuery must = new BooleanQuery();

addToBooleanQuery(must, tags, inputData, synonymAnalyzer, 1.5f);
addToBooleanQuery(must, title, inputData, synonymAnalyzer);
addToBooleanQuery(must, role, inputData, synonymAnalyzer);
addToBooleanQuery(query, description, inputData, synonymAnalyzer);
addToBooleanQuery(must, requirement, inputData, synonymAnalyzer);
addToBooleanQuery(must, company, inputData, standardAnalyzer);
addToBooleanQuery(must, city, inputData, standardAnalyzer);
must.setBoost(5.0f);
query.add(must, Occur.MUST);
addToBooleanQuery(query, tags, includeAll, synonymAnalyzer, 2.0f);
addToBooleanQuery(query, title, includeAll, synonymAnalyzer, 3.5f);
addToBooleanQuery(query, functionalArea, inputData,  
synonymAnalyzer,);

*
In Simple english*
addToBooleanQuery will add the particular field to the query after  
analysing

using the analyser mentioned and setting a boost as specified
So there MUST be a keyword match with any of the fields
tags,title,role,description,requirement,company,city and it SHOULD  
occur

in the fields tags,title and functionalArea.

Hope you have got an idea of my requirement. I am not asking anyone  
to do it
for me. Please let me know where can i start and give me some useful  
tips to
move ahead with this. I believe that it has to do with modifying the  
XML
configuration file and setting the parameters in Dismax handler. But  
I am

still not sure. Please help

Thanks  Regards
Abin Mathew




Aggregated facet value counts?

2010-01-29 Thread Peter S

Hi,

 

I was wondering if anyone had come across this use case, and if this type of 
faceting is possible:

 

The requirement is to build a query such that an aggregated facet count of 
common (and'ed) field values form the basis of each returned facet count.

 

For example:

Let's say I have a number of documents in an index with, among others, the 
fields 'host' and 'user':

 

Doc1  host:machine_1   user:user_1

Doc2  host:machine_1   user:user_2

Doc3  host:machine_1   user:user_1

Doc3  host:machine_1   user:user_1

 

Doc4  host:machine_2   user:user_1

Doc5  host:machine_2   user:user_1

Doc6  host:machine_2   user:user_4

 

Doc7  host:machine_1   user:user_4

 

Is it possible to get facets back that would give the count of documents that 
have common host AND user values (preferably ordered - i.e. host then user for 
this example, so as not to create a factorial explosion)? Note that the caller 
wouldn't know what machine and user values exist, only the field names.

I've tried using facet queries in various ways to see if they could work for 
this, but I believe facet queries work on a different plane than this 
requirement (narrowing the term count, a.o.t. aggregating).

 

For the example above, the desired result would be:

 

machine_1/user_1 (3)

machine_1/user_2 (1)

machine_1/user_4 (1)

 

machine_2/user_1 (2)

machine_2/user_4 (1)

 

Has anyone had a need for this type of faceting and found a way to achieve it?

 

Many thanks,

Peter

 

 
  
_
We want to hear all your funny, exciting and crazy Hotmail stories. Tell us now
http://clk.atdmt.com/UKM/go/195013117/direct/01/

Re: Aggregated facet value counts?

2010-01-29 Thread Erik Hatcher
When faced with this type of situation where the data is entirely  
available at index-time, simply create an aggregated field that glues  
the two pieces together, and facet on that.


Erik

On Jan 29, 2010, at 6:16 AM, Peter S wrote:



Hi,



I was wondering if anyone had come across this use case, and if this  
type of faceting is possible:




The requirement is to build a query such that an aggregated facet  
count of common (and'ed) field values form the basis of each  
returned facet count.




For example:

Let's say I have a number of documents in an index with, among  
others, the fields 'host' and 'user':




Doc1  host:machine_1   user:user_1

Doc2  host:machine_1   user:user_2

Doc3  host:machine_1   user:user_1

Doc3  host:machine_1   user:user_1



Doc4  host:machine_2   user:user_1

Doc5  host:machine_2   user:user_1

Doc6  host:machine_2   user:user_4



Doc7  host:machine_1   user:user_4



Is it possible to get facets back that would give the count of  
documents that have common host AND user values (preferably ordered  
- i.e. host then user for this example, so as not to create a  
factorial explosion)? Note that the caller wouldn't know what  
machine and user values exist, only the field names.


I've tried using facet queries in various ways to see if they could  
work for this, but I believe facet queries work on a different plane  
than this requirement (narrowing the term count, a.o.t. aggregating).




For the example above, the desired result would be:



machine_1/user_1 (3)

machine_1/user_2 (1)

machine_1/user_4 (1)



machine_2/user_1 (2)

machine_2/user_4 (1)



Has anyone had a need for this type of faceting and found a way to  
achieve it?




Many thanks,

Peter





_
We want to hear all your funny, exciting and crazy Hotmail stories.  
Tell us now

http://clk.atdmt.com/UKM/go/195013117/direct/01/




RE: Aggregated facet value counts?

2010-01-29 Thread Peter S

Hi Erik,

 

Thanks for your reply. That's an interesting idea doing it at index-time, and a 
good idea for known field combinations.

The only thing is

How to handle arbitrary field combinations? - i.e. to allow the caller to 
specify any combination of fields at query-time?

So, yes, the data is available at index-time, but the combination isn't (short 
of creating fields for every possible combination).

 

Peter


 
 From: erik.hatc...@gmail.com
 To: solr-user@lucene.apache.org
 Subject: Re: Aggregated facet value counts?
 Date: Fri, 29 Jan 2010 06:30:27 -0500
 
 When faced with this type of situation where the data is entirely 
 available at index-time, simply create an aggregated field that glues 
 the two pieces together, and facet on that.
 
 Erik
 
 On Jan 29, 2010, at 6:16 AM, Peter S wrote:
 
 
  Hi,
 
 
 
  I was wondering if anyone had come across this use case, and if this 
  type of faceting is possible:
 
 
 
  The requirement is to build a query such that an aggregated facet 
  count of common (and'ed) field values form the basis of each 
  returned facet count.
 
 
 
  For example:
 
  Let's say I have a number of documents in an index with, among 
  others, the fields 'host' and 'user':
 
 
 
  Doc1 host:machine_1 user:user_1
 
  Doc2 host:machine_1 user:user_2
 
  Doc3 host:machine_1 user:user_1
 
  Doc3 host:machine_1 user:user_1
 
 
 
  Doc4 host:machine_2 user:user_1
 
  Doc5 host:machine_2 user:user_1
 
  Doc6 host:machine_2 user:user_4
 
 
 
  Doc7 host:machine_1 user:user_4
 
 
 
  Is it possible to get facets back that would give the count of 
  documents that have common host AND user values (preferably ordered 
  - i.e. host then user for this example, so as not to create a 
  factorial explosion)? Note that the caller wouldn't know what 
  machine and user values exist, only the field names.
 
  I've tried using facet queries in various ways to see if they could 
  work for this, but I believe facet queries work on a different plane 
  than this requirement (narrowing the term count, a.o.t. aggregating).
 
 
 
  For the example above, the desired result would be:
 
 
 
  machine_1/user_1 (3)
 
  machine_1/user_2 (1)
 
  machine_1/user_4 (1)
 
 
 
  machine_2/user_1 (2)
 
  machine_2/user_4 (1)
 
 
 
  Has anyone had a need for this type of faceting and found a way to 
  achieve it?
 
 
 
  Many thanks,
 
  Peter
 
 
 
 
  
  _
  We want to hear all your funny, exciting and crazy Hotmail stories. 
  Tell us now
  http://clk.atdmt.com/UKM/go/195013117/direct/01/
 
  
_
Tell us your greatest, weirdest and funniest Hotmail stories
http://clk.atdmt.com/UKM/go/195013117/direct/01/

loading an updateProcessorChain with multicore in trunk

2010-01-29 Thread Marc Sturlese

I am testing trunk and have seen a different behaviour when loading
updateProcessors wich I don't know if it's normal (at least with multicore)
Before I use to use an updateProcessorChain this way:

requestHandler name=/update class=solr.XmlUpdateRequestHandler
lst name=defaults
   str name=update.processormyChain/str
/lst
/requestHandler  
updateRequestProcessorChain name=myChain
processor
class=org.apache.solr.update.processor.CustomUpdateProcessorFactory /
processor
class=org.apache.solr.update.processor.LogUpdateProcessorFactory /
processor
class=org.apache.solr.update.processor.RunUpdateProcessorFactory /
/updateRequestProcessorChain

It does not work in current trunk. I have debuged the code and I have seen
now UpdateProcessorChain is loaded via:

  public T T initPlugins(ListPluginInfo pluginInfos, MapString, T
registry, ClassT type, String defClassName) {
T def = null;
for (PluginInfo info : pluginInfos) {
  T o = createInitInstance(info,type, type.getSimpleName(),
defClassName);
  registry.put(info.name, o);
  if(info.isDefault()){
def = o;
  }
}
return def;
  }

As I don't have default=true in the configuration, my custom
processorChain is not used. Setting default=true makes it work:

requestHandler name=/update class=solr.XmlUpdateRequestHandler
lst name=defaults
   str name=update.processormyChain/str
/lst
/requestHandler  
updateRequestProcessorChain name=myChain default=true
processor
class=org.apache.solr.update.processor.CustomUpdateProcessorFactory /
processor
class=org.apache.solr.update.processor.LogUpdateProcessorFactory /
processor
class=org.apache.solr.update.processor.RunUpdateProcessorFactory /
/updateRequestProcessorChain

As far as I understand, if you specify the chain you want to use in here:
requestHandler name=/update class=solr.XmlUpdateRequestHandler
lst name=defaults
   str name=update.processormyChain/str
/lst
/requestHandler

Shouldn't be necesary to set it as default.
Is it going to be kept this way?

Thanks in advance



-- 
View this message in context: 
http://old.nabble.com/loading-an-updateProcessorChain-with-multicore-in-trunk-tp27371375p27371375.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Aggregated facet value counts?

2010-01-29 Thread Erik Hatcher
Creating values for every possible combination is what you're asking  
Solr to do at query-time, and as far as I know there isn't really a  
way to accomplish that like you're asking.   Is the need really to be  
arbitrary here?


Erik

On Jan 29, 2010, at 7:25 AM, Peter S wrote:



Hi Erik,



Thanks for your reply. That's an interesting idea doing it at index- 
time, and a good idea for known field combinations.


The only thing is

How to handle arbitrary field combinations? - i.e. to allow the  
caller to specify any combination of fields at query-time?


So, yes, the data is available at index-time, but the combination  
isn't (short of creating fields for every possible combination).




Peter




From: erik.hatc...@gmail.com
To: solr-user@lucene.apache.org
Subject: Re: Aggregated facet value counts?
Date: Fri, 29 Jan 2010 06:30:27 -0500

When faced with this type of situation where the data is entirely
available at index-time, simply create an aggregated field that glues
the two pieces together, and facet on that.

Erik

On Jan 29, 2010, at 6:16 AM, Peter S wrote:



Hi,



I was wondering if anyone had come across this use case, and if this
type of faceting is possible:



The requirement is to build a query such that an aggregated facet
count of common (and'ed) field values form the basis of each
returned facet count.



For example:

Let's say I have a number of documents in an index with, among
others, the fields 'host' and 'user':



Doc1 host:machine_1 user:user_1

Doc2 host:machine_1 user:user_2

Doc3 host:machine_1 user:user_1

Doc3 host:machine_1 user:user_1



Doc4 host:machine_2 user:user_1

Doc5 host:machine_2 user:user_1

Doc6 host:machine_2 user:user_4



Doc7 host:machine_1 user:user_4



Is it possible to get facets back that would give the count of
documents that have common host AND user values (preferably ordered
- i.e. host then user for this example, so as not to create a
factorial explosion)? Note that the caller wouldn't know what
machine and user values exist, only the field names.

I've tried using facet queries in various ways to see if they could
work for this, but I believe facet queries work on a different plane
than this requirement (narrowing the term count, a.o.t.  
aggregating).




For the example above, the desired result would be:



machine_1/user_1 (3)

machine_1/user_2 (1)

machine_1/user_4 (1)



machine_2/user_1 (2)

machine_2/user_4 (1)



Has anyone had a need for this type of faceting and found a way to
achieve it?



Many thanks,

Peter





_
We want to hear all your funny, exciting and crazy Hotmail stories.
Tell us now
http://clk.atdmt.com/UKM/go/195013117/direct/01/




_
Tell us your greatest, weirdest and funniest Hotmail stories
http://clk.atdmt.com/UKM/go/195013117/direct/01/




Re: loading an updateProcessorChain with multicore in trunk

2010-01-29 Thread Noble Paul നോബിള്‍ नोब्ळ्
I guess . default=true should not be necessary if there is only one
updateRequestProcessorChain specified . Open an issue

On Fri, Jan 29, 2010 at 6:06 PM, Marc Sturlese marc.sturl...@gmail.com wrote:

 I am testing trunk and have seen a different behaviour when loading
 updateProcessors wich I don't know if it's normal (at least with multicore)
 Before I use to use an updateProcessorChain this way:

 requestHandler name=/update class=solr.XmlUpdateRequestHandler
    lst name=defaults
       str name=update.processormyChain/str
    /lst
 /requestHandler
 updateRequestProcessorChain name=myChain
    processor
 class=org.apache.solr.update.processor.CustomUpdateProcessorFactory /
    processor
 class=org.apache.solr.update.processor.LogUpdateProcessorFactory /
    processor
 class=org.apache.solr.update.processor.RunUpdateProcessorFactory /
 /updateRequestProcessorChain

 It does not work in current trunk. I have debuged the code and I have seen
 now UpdateProcessorChain is loaded via:

  public T T initPlugins(ListPluginInfo pluginInfos, MapString, T
 registry, ClassT type, String defClassName) {
    T def = null;
    for (PluginInfo info : pluginInfos) {
      T o = createInitInstance(info,type, type.getSimpleName(),
 defClassName);
      registry.put(info.name, o);
      if(info.isDefault()){
            def = o;
      }
    }
    return def;
  }

 As I don't have default=true in the configuration, my custom
 processorChain is not used. Setting default=true makes it work:

 requestHandler name=/update class=solr.XmlUpdateRequestHandler
    lst name=defaults
       str name=update.processormyChain/str
    /lst
 /requestHandler
 updateRequestProcessorChain name=myChain default=true
    processor
 class=org.apache.solr.update.processor.CustomUpdateProcessorFactory /
    processor
 class=org.apache.solr.update.processor.LogUpdateProcessorFactory /
    processor
 class=org.apache.solr.update.processor.RunUpdateProcessorFactory /
 /updateRequestProcessorChain

 As far as I understand, if you specify the chain you want to use in here:
 requestHandler name=/update class=solr.XmlUpdateRequestHandler
    lst name=defaults
       str name=update.processormyChain/str
    /lst
 /requestHandler

 Shouldn't be necesary to set it as default.
 Is it going to be kept this way?

 Thanks in advance



 --
 View this message in context: 
 http://old.nabble.com/loading-an-updateProcessorChain-with-multicore-in-trunk-tp27371375p27371375.html
 Sent from the Solr - User mailing list archive at Nabble.com.





-- 
-
Noble Paul | Systems Architect| AOL | http://aol.com


RE: Aggregated facet value counts?

2010-01-29 Thread Peter S

Well, it wouldn't be 'every' combination - more of 'any' combination at 
query-time.
 
The 'arbitrary' part of the requirement is because it's not practical to 
predict every combination a user might ask for, although generally users would 
tend to search for similar/the same query combinations (but perhaps with 
different date ranges, for example).
 
If 'predicted aggregate fields' were calculated at index-time on, say, 10 
fields (the schema in question actually as 73 fields), that's 3,628,801 new 
fields. A large percentage of these would likely never be used (which ones 
would depend on the user, environment etc.).
 

Perhaps a more 'typical' use case than my network-based example would be a 
product search web page, where you want to show the number of products that are 
made by a manufacturer and within a certain price range (e.g. Sony [$600-$800] 
(15) ). To obtain the (15) facet count value, you would have to correlate the 
number of Sony products (say, (861)), and the products that fall into the [600 
TO 800] price range (say, (1226) ). The (15) would be the intersection of the 
Sony hits and the price range hits by 'manufacturer:Sony'. Am I right that 
filter queries could only do this for document hits if you know the field 
values ahead of time (e.g. fq=manufacturer:Sonyfq=price:[600 TO 800])? The 
facets could then be derived by simply counting the numFound for each result 
set.

 

If there were subsearch support in Solr (i.e. take the output of a query and 
use it as input into another) that included facets [perhaps there is such 
support?], it might be used to achieve this effect.


A custom query parser plugin could work, maybe? I suppose it would need to 
gather up all the separate facets and correlate them according to the input 
query (e.g. host and user, or manufacturer and price range). Such a mechanism 
would be crying out for caching, but perhaps it could leverage the existing 
field and query caches.
 

Peter

 


 From: erik.hatc...@gmail.com
 To: solr-user@lucene.apache.org
 Subject: Re: Aggregated facet value counts?
 Date: Fri, 29 Jan 2010 07:39:44 -0500
 
 Creating values for every possible combination is what you're asking 
 Solr to do at query-time, and as far as I know there isn't really a 
 way to accomplish that like you're asking. Is the need really to be 
 arbitrary here?
 
 Erik
 
 On Jan 29, 2010, at 7:25 AM, Peter S wrote:
 
 
  Hi Erik,
 
 
 
  Thanks for your reply. That's an interesting idea doing it at index- 
  time, and a good idea for known field combinations.
 
  The only thing is
 
  How to handle arbitrary field combinations? - i.e. to allow the 
  caller to specify any combination of fields at query-time?
 
  So, yes, the data is available at index-time, but the combination 
  isn't (short of creating fields for every possible combination).
 
 
 
  Peter
 
 
 
  From: erik.hatc...@gmail.com
  To: solr-user@lucene.apache.org
  Subject: Re: Aggregated facet value counts?
  Date: Fri, 29 Jan 2010 06:30:27 -0500
 
  When faced with this type of situation where the data is entirely
  available at index-time, simply create an aggregated field that glues
  the two pieces together, and facet on that.
 
  Erik
 
  On Jan 29, 2010, at 6:16 AM, Peter S wrote:
 
 
  Hi,
 
 
 
  I was wondering if anyone had come across this use case, and if this
  type of faceting is possible:
 
 
 
  The requirement is to build a query such that an aggregated facet
  count of common (and'ed) field values form the basis of each
  returned facet count.
 
 
 
  For example:
 
  Let's say I have a number of documents in an index with, among
  others, the fields 'host' and 'user':
 
 
 
  Doc1 host:machine_1 user:user_1
 
  Doc2 host:machine_1 user:user_2
 
  Doc3 host:machine_1 user:user_1
 
  Doc3 host:machine_1 user:user_1
 
 
 
  Doc4 host:machine_2 user:user_1
 
  Doc5 host:machine_2 user:user_1
 
  Doc6 host:machine_2 user:user_4
 
 
 
  Doc7 host:machine_1 user:user_4
 
 
 
  Is it possible to get facets back that would give the count of
  documents that have common host AND user values (preferably ordered
  - i.e. host then user for this example, so as not to create a
  factorial explosion)? Note that the caller wouldn't know what
  machine and user values exist, only the field names.
 
  I've tried using facet queries in various ways to see if they could
  work for this, but I believe facet queries work on a different plane
  than this requirement (narrowing the term count, a.o.t. 
  aggregating).
 
 
 
  For the example above, the desired result would be:
 
 
 
  machine_1/user_1 (3)
 
  machine_1/user_2 (1)
 
  machine_1/user_4 (1)
 
 
 
  machine_2/user_1 (2)
 
  machine_2/user_4 (1)
 
 
 
  Has anyone had a need for this type of faceting and found a way to
  achieve it?
 
 
 
  Many thanks,
 
  Peter
 
 
 
 
 
  _
  We want to hear all your funny, exciting and crazy Hotmail stories.
  Tell us now
  

multi term, multi field, auto suggest

2010-01-29 Thread Lukas Kahwe Smith
Hi,

So over the course of the last two weeks I have been trying to come up with an 
optimal solution for auto suggest in the project I am currently working on.
In the application we have names from people and companies. The companies can 
have german, english, italian or french names. people have an additional 
firstname field. We also want to do auto suggest on the street and city names 
as well as on emails and telefon numbers. as such we are treating phonenumbers 
as text.

We do have the option for the user to use phonetic searches or to split 
(especially the compound german words), but I guess we will leave that out of 
the auto suggest.
We do expect that some users will type in properly cased strings, while some 
may just type in all lowercase.
We are using the dismax defType for our normal queries.

There will probably be less than 20M entities.

As such I guess the best approach is to copy all of the above mentioned fields 
(name, firstname, city, street, email, telefon) into a new field called all.
It seems the best approach is to use facet.prefix for our requirements. We will 
therefore split of the last term in the query and pass it in as the 
facet.prefix while the rest is passed in as the q parameter.

Since facet's are driven out of the index, we will use the following type 
definition for this all field:
fieldType name=textplain class=solr.TextField 
positionIncrementGap=100
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
splitOnCaseChange=0/
  /analyzer
/fieldType

So essentially the idea is to just split on whitespace, remove stop words and 
word delimiters.

The query would then look something like the following if the user would enter 
Kaltenreider Ver:
http://localhost:8983/solr/core0/select?defType=dismaxqf=allq= 
Kaltenreiderindent=onfacet=onfacet.limit=10facet.mincount=1facet.field=allrows=0facet.prefix=Ver

Does this approach make sense so far?
Do you expect this to perform decently on a dual quad core machine with 16Gb of 
ram, albeit all of that will be shared with apache, mysql slave and a php app? 
Ah well questions like that are impossible to answer, so just trying to ask if 
you expect this to be really heavy. I noticed that in my initial testing with 
2M on my laptop facets seemed to be fine, though the first request was slow and 
the memory use spiked to 300MB. But I presume its just loading stuff into cache 
and concurrent requests shouldnt cause the memory use to go up linearly.

I am still a bit unsure how to handle both the lowercased and the case 
preserved version:

So here are some examples:
UBS = ubs|UBS
Kreuzstrasse = kreuzstrasse|Kreuzstrasse

So when I type Kreu I would get a suggestion of Kreuzstrasse and with 
kreu I would get kreuzstrasse.
Since I do not expect any words to start with a lowercase letter and still 
contain some upper case letter we should be fine with this approach.

As in I doubt there would be stuff like fooBar which would lead to suggestion 
both foobar and fooBar.

How can I achieve this?

regards,
Lukas Kahwe Smith
m...@pooteeweet.org


Is optimizing always necessary?

2010-01-29 Thread Marcus Herou
If one only have additions do I then need to optimize the index at all ?

I thought that only update/deletes created holes in the index. Or should
the index be sorted on disk at all times, is that the reason ?

Cheers

//Marcus

-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/


Re: Aggregated facet value counts?

2010-01-29 Thread Erik Hatcher
Sounds like what you're asking for is tree faceting.  A basic  
implementation is available in SOLR-792, but one that could also take  
facet.queries, numeric or date range buckets, to tree on would be a  
nice improvement.


Still, the underlying implementation will simply enumerate all the  
possible values (SOLR-792 has some short-circuiting when the top-level  
has zero, of course).  A client-side application could do this with  
multiple requests to Solr.


Subsearch - sure, just make more requests to Solr, rearranging the  
parameters.


I'd still say that in general for this type of need that it'll  
generally be less arbitrary and locking some things in during  
indexing will be the pragmatic way to go for most cases.


Erik



On Jan 29, 2010, at 9:28 AM, Peter S wrote:



Well, it wouldn't be 'every' combination - more of 'any' combination  
at query-time.


The 'arbitrary' part of the requirement is because it's not  
practical to predict every combination a user might ask for,  
although generally users would tend to search for similar/the same  
query combinations (but perhaps with different date ranges, for  
example).


If 'predicted aggregate fields' were calculated at index-time on,  
say, 10 fields (the schema in question actually as 73 fields),  
that's 3,628,801 new fields. A large percentage of these would  
likely never be used (which ones would depend on the user,  
environment etc.).



Perhaps a more 'typical' use case than my network-based example  
would be a product search web page, where you want to show the  
number of products that are made by a manufacturer and within a  
certain price range (e.g. Sony [$600-$800] (15) ). To obtain the  
(15) facet count value, you would have to correlate the number of  
Sony products (say, (861)), and the products that fall into the [600  
TO 800] price range (say, (1226) ). The (15) would be the  
intersection of the Sony hits and the price range hits by  
'manufacturer:Sony'. Am I right that filter queries could only do  
this for document hits if you know the field values ahead of time  
(e.g. fq=manufacturer:Sonyfq=price:[600 TO 800])? The facets could  
then be derived by simply counting the numFound for each result set.




If there were subsearch support in Solr (i.e. take the output of a  
query and use it as input into another) that included facets  
[perhaps there is such support?], it might be used to achieve this  
effect.



A custom query parser plugin could work, maybe? I suppose it would  
need to gather up all the separate facets and correlate them  
according to the input query (e.g. host and user, or manufacturer  
and price range). Such a mechanism would be crying out for caching,  
but perhaps it could leverage the existing field and query caches.



Peter





From: erik.hatc...@gmail.com
To: solr-user@lucene.apache.org
Subject: Re: Aggregated facet value counts?
Date: Fri, 29 Jan 2010 07:39:44 -0500

Creating values for every possible combination is what you're asking
Solr to do at query-time, and as far as I know there isn't really a
way to accomplish that like you're asking. Is the need really to be
arbitrary here?

Erik

On Jan 29, 2010, at 7:25 AM, Peter S wrote:



Hi Erik,



Thanks for your reply. That's an interesting idea doing it at index-
time, and a good idea for known field combinations.

The only thing is

How to handle arbitrary field combinations? - i.e. to allow the
caller to specify any combination of fields at query-time?

So, yes, the data is available at index-time, but the combination
isn't (short of creating fields for every possible combination).



Peter




From: erik.hatc...@gmail.com
To: solr-user@lucene.apache.org
Subject: Re: Aggregated facet value counts?
Date: Fri, 29 Jan 2010 06:30:27 -0500

When faced with this type of situation where the data is entirely
available at index-time, simply create an aggregated field that  
glues

the two pieces together, and facet on that.

Erik

On Jan 29, 2010, at 6:16 AM, Peter S wrote:



Hi,



I was wondering if anyone had come across this use case, and if  
this

type of faceting is possible:



The requirement is to build a query such that an aggregated facet
count of common (and'ed) field values form the basis of each
returned facet count.



For example:

Let's say I have a number of documents in an index with, among
others, the fields 'host' and 'user':



Doc1 host:machine_1 user:user_1

Doc2 host:machine_1 user:user_2

Doc3 host:machine_1 user:user_1

Doc3 host:machine_1 user:user_1



Doc4 host:machine_2 user:user_1

Doc5 host:machine_2 user:user_1

Doc6 host:machine_2 user:user_4



Doc7 host:machine_1 user:user_4



Is it possible to get facets back that would give the count of
documents that have common host AND user values (preferably  
ordered

- i.e. host then user for this example, so as not to create a
factorial explosion)? Note that the caller wouldn't know what
machine and user values exist, only the field names.


Re: Is optimizing always necessary?

2010-01-29 Thread Wangsheng Mei
In addition to destory the holes in the index, optimization is also used
to merge multiple small indexes into a bigger one.
Although I have not got specific performace data, I can imagine that this
will lead to performace benifits.
Supposing you have thousands of small indexes,  open-close these indexes
again and again should be time costing.

2010/1/30 Marcus Herou marcus.he...@tailsweep.com

 If one only have additions do I then need to optimize the index at all ?

 I thought that only update/deletes created holes in the index. Or should
 the index be sorted on disk at all times, is that the reason ?

 Cheers

 //Marcus

 --
 Marcus Herou CTO and co-founder Tailsweep AB
 +46702561312
 marcus.he...@tailsweep.com
 http://www.tailsweep.com/




-- 
梅旺生


Re: Newbie Question on Custom Query Generation

2010-01-29 Thread Wangsheng Mei
What's the point of generating your own query?
Are you sure that solr query syntax cannot satisfy your need?

2010/1/29 Abin Mathew abin.mat...@toostep.com

 Hi I want to generate my own customized query from the input string entered
 by the user. It should look something like this

 *Search field : Microsoft*
 *
 Generated Query*  :
 description:microsoft +((tags:microsoft^1.5 title:microsoft^3.0
 role:microsoft requi
 rement:microsoft company:microsoft city:microsoft)^5.0) tags:microsoft^2.0
 title:microsoft^3.5 functionalArea:microsoft

 *The lucene code we used is like this*
 BooleanQuery must = new BooleanQuery();

 addToBooleanQuery(must, tags, inputData, synonymAnalyzer, 1.5f);
 addToBooleanQuery(must, title, inputData, synonymAnalyzer);
 addToBooleanQuery(must, role, inputData, synonymAnalyzer);
 addToBooleanQuery(query, description, inputData, synonymAnalyzer);
 addToBooleanQuery(must, requirement, inputData, synonymAnalyzer);
 addToBooleanQuery(must, company, inputData, standardAnalyzer);
 addToBooleanQuery(must, city, inputData, standardAnalyzer);
 must.setBoost(5.0f);
 query.add(must, Occur.MUST);
 addToBooleanQuery(query, tags, includeAll, synonymAnalyzer, 2.0f);
 addToBooleanQuery(query, title, includeAll, synonymAnalyzer, 3.5f);
 addToBooleanQuery(query, functionalArea, inputData, synonymAnalyzer,);
 *
 In Simple english*
 addToBooleanQuery will add the particular field to the query after
 analysing
 using the analyser mentioned and setting a boost as specified
 So there MUST be a keyword match with any of the fields
 tags,title,role,description,requirement,company,city and it SHOULD occur
 in the fields tags,title and functionalArea.

 Hope you have got an idea of my requirement. I am not asking anyone to do
 it
 for me. Please let me know where can i start and give me some useful tips
 to
 move ahead with this. I believe that it has to do with modifying the XML
 configuration file and setting the parameters in Dismax handler. But I am
 still not sure. Please help

 Thanks  Regards
 Abin Mathew




-- 
梅旺生


Solr duplicates detection!!

2010-01-29 Thread Wangsheng Mei
Document Duplication Detection

[image: !] Solr1.4 /solr/Solr1.4

目录

   1. Document Duplication Detection #Document_Duplication_Detection
   2. Overview #Overview
  1. Goals #Goals
  2. Design #Design
   3. Notes #Notes
   4. Configuration #Configuration
  1. solrconfig.xml #solrconfig.xml
 1. Note #Note
  2. Settings #Settings

 Overview

Preventing duplicate or near duplicate documents from entering an index or
tagging documents with a signature/fingerprint for duplicate field
collapsing can be efficiently achieved with a low collision or fuzzy hash
algorithm. Solr should natively support deduplication techniques of this
type and allow for the easy addition of new hash/signature implementations.

Goals

   - Efficient, hash based exact/near document duplication detection and
   blocking.
   - Allow for both duplicate collapsing in search results as well as
   deduplication on adding a document.

 Design

Signature

A class capable of generating a signature String from the concatenation of a
group of specified document fields.

public abstract class Signature {
  public void init(SolrParams nl) {
  }

  public abstract String calculate(String content);
}

Implementations:

MD5Signature

128 bit hash used for exact duplicate detection.

Lookup3Signature /solr/Lookup3Signature

64 bit hash used for exact duplicate detection, much faster than MD5 and
smaller to index

TextProfileSignature /solr/TextProfileSignature

Fuzzy hashing implementation from nutch for near duplicate detection. Its
tunable but works best on longer text.

There are other more sophisticated algorithms for fuzzy/near hashing that
could be added later.

Notes

Adding in the dedupe process will change the allowDups setting so that it
applies to an update Term (with field signatureField in this case) rather
than the unique field Term (of course the signatureField could be the unique
field, but generally you want the unique field to be unique)

When a document is added, a signature will automatically be generated and
attached to the document in the specified signatureField.

Configuration

solrconfig.xml

The SignatureUpdateProcessorFactory
/solr/SignatureUpdateProcessorFactoryhas to be registered in the
solrconfig.xml as part of the
UpdateRequest /solr/UpdateRequest Chain:

Accepting all defaults:

  updateRequestProcessorChain name=dedupe
processor
  class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
/processor
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain

Example settings:

  !-- An example dedup update processor that creates the id field on the fly
   based on the hash code of some other fields.  This example has
overwriteDupes
   set to false since we are using the id field as the
signatureField and Solr
   will maintain uniqueness based on that anyway. --
  updateRequestProcessorChain name=dedupe
processor 
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
  bool name=enabledtrue/bool
  bool name=overwriteDupesfalse/bool
  str name=signatureFieldid/str
  str name=fieldsname,features,cat/str
  str 
name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain

 Note

Also be sure to change your update handlers to use the defined chain, i.e.

  requestHandler name=/update class=solr.XmlUpdateRequestHandler 
lst name=defaults
  str name=update.processordedupe/str
/lst
  /requestHandler

The update processor can also be specified per request with a parameter of
update.processor=dedupe

Settings

*Setting*

*Default*

*Description*

signatureClass

org.apache.solr.update.processor.Lookup3Signature /solr/Lookup3Signature

A Signature implementation for generating a signature hash.

fields

all fields

The fields to use to generate the signature hash in a comma separated list.
By default, all fields on the document will be used.

signatureField

signatureField

The name of the field used to hold the fingerprint/signature. Be sure the
field is defined in schema.xml.

enabled

true

Enable/disable dedupe factory processing


-- 
梅旺生


Re: Solr duplicates detection!!

2010-01-29 Thread Wangsheng Mei
Sorry by sending wrong message, this should go to my own mail box  :(

2010/1/30 Wangsheng Mei hairr...@gmail.com

 Document Duplication Detection

 [image: !] Solr1.4 http://solr/Solr1.4

 目录

1. Document Duplication 
 Detection#1267b655a97b48f5_Document_Duplication_Detection
2. Overview #1267b655a97b48f5_Overview
   1. Goals #1267b655a97b48f5_Goals
   2. Design #1267b655a97b48f5_Design
3. Notes #1267b655a97b48f5_Notes
4. Configuration #1267b655a97b48f5_Configuration
   1. solrconfig.xml #1267b655a97b48f5_solrconfig.xml
  1. Note #1267b655a97b48f5_Note
   2. Settings #1267b655a97b48f5_Settings

  Overview

 Preventing duplicate or near duplicate documents from entering an index or
 tagging documents with a signature/fingerprint for duplicate field
 collapsing can be efficiently achieved with a low collision or fuzzy hash
 algorithm. Solr should natively support deduplication techniques of this
 type and allow for the easy addition of new hash/signature implementations.

 Goals

- Efficient, hash based exact/near document duplication detection and
blocking.
- Allow for both duplicate collapsing in search results as well as
deduplication on adding a document.

  Design

 Signature

 A class capable of generating a signature String from the concatenation of
 a group of specified document fields.

 public abstract class Signature {
   public void init(SolrParams nl) {
   }

   public abstract String calculate(String content);
 }

 Implementations:

 MD5Signature

 128 bit hash used for exact duplicate detection.

 Lookup3Signature http://solr/Lookup3Signature

 64 bit hash used for exact duplicate detection, much faster than MD5 and
 smaller to index

 TextProfileSignature http://solr/TextProfileSignature

 Fuzzy hashing implementation from nutch for near duplicate detection. Its
 tunable but works best on longer text.

 There are other more sophisticated algorithms for fuzzy/near hashing that
 could be added later.

 Notes

 Adding in the dedupe process will change the allowDups setting so that it
 applies to an update Term (with field signatureField in this case) rather
 than the unique field Term (of course the signatureField could be the unique
 field, but generally you want the unique field to be unique)

 When a document is added, a signature will automatically be generated and
 attached to the document in the specified signatureField.

 Configuration

 solrconfig.xml

 The 
 SignatureUpdateProcessorFactoryhttp://solr/SignatureUpdateProcessorFactoryhas
  to be registered in the solrconfig.xml as part of the
 UpdateRequest http://solr/UpdateRequest Chain:

 Accepting all defaults:

   updateRequestProcessorChain name=dedupe
 processor
   
 class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
 /processor
 processor class=solr.RunUpdateProcessorFactory /

   /updateRequestProcessorChain

 Example settings:

   !-- An example dedup update processor that creates the id field on the 
 fly
based on the hash code of some other fields.  This example has 
 overwriteDupes
set to false since we are using the id field as the signatureField and 
 Solr

will maintain uniqueness based on that anyway. --
   updateRequestProcessorChain name=dedupe
 processor 
 class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory

   bool name=enabledtrue/bool
   bool name=overwriteDupesfalse/bool
   str name=signatureFieldid/str
   str name=fieldsname,features,cat/str

   str 
 name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str
 /processor
 processor class=solr.LogUpdateProcessorFactory /
 processor class=solr.RunUpdateProcessorFactory /

   /updateRequestProcessorChain

  Note

 Also be sure to change your update handlers to use the defined chain, i.e.

   requestHandler name=/update class=solr.XmlUpdateRequestHandler 
 lst name=defaults
   str name=update.processordedupe/str

 /lst
   /requestHandler

 The update processor can also be specified per request with a parameter of
 update.processor=dedupe

 Settings

 *Setting*

 *Default*

 *Description*

 signatureClass

 org.apache.solr.update.processor.Lookup3Signaturehttp://solr/Lookup3Signature

 A Signature implementation for generating a signature hash.

 fields

 all fields

 The fields to use to generate the signature hash in a comma separated list.
 By default, all fields on the document will be used.

 signatureField

 signatureField

 The name of the field used to hold the fingerprint/signature. Be sure the
 field is defined in schema.xml.

 enabled

 true

 Enable/disable dedupe factory processing


 --
 梅旺生




-- 
梅旺生


Deleting spelll checker index

2010-01-29 Thread darniz

Hello all,
We are using Index based spell checker.
i was wondering with the help of any url parameters can we delete the spell
check index directory.
please let me know 
thans
darniz


-- 
View this message in context: 
http://old.nabble.com/Deleting-spelll-checker-index-tp27376823p27376823.html
Sent from the Solr - User mailing list archive at Nabble.com.



Auto Suggest with multiple space separated words

2010-01-29 Thread Nair, Manas
Hi Experts,
 
I need an auto suggest functionality using SOLR which gives me the feel of 
using the fire fox browser. In short, if I type in a prefix, the results should 
drop down even if the prefix is not the starting of the drop down items.
 
Example: If I search for Lin, then the results could be 
[Abe Lincoln, Lindsay Lohan, Sarah Palin, Gasoline .].
 
Please suggest the best approach.
 
Any help is greatly appreciated.
 
Thankyou,
Manas Nair


distributed search and failed core

2010-01-29 Thread Joe Calderon
hello *, in distributed search when a shard goes down, an error is
returned and the search fails, is there a way to avoid the error and
return the results from the shards that are still up?

thx much

--joe


Re: Basic questions about Solr cost in programming time

2010-01-29 Thread Sven Maurmann

Hi!

Of course the answer depends (as usually) very much on the features
you want to realize. But Solr can be set up very fast. When we created
our first prototype, it took us about a week to get it running with
spell phoneme search, spell checking, facetting - and even collapsing
(using the famous 236-patch).

It is definitely very nice that you can do a lot of things using the
available components and only configuring them inside solrconfig.xml
and schema.xml.

And you may well start with the standard distribution.

Cheers,
   Sven

--On Dienstag, 26. Januar 2010 12:00 -0800 Jeff Crump 
jcr...@hq.mercycorps.org wrote:



Hi,
I hope this message is OK for this list.

I'm looking into search solutions for an intranet site built with Drupal.
Eventually we'd like to scale to enterprise search, which would include
the Drupal site, a document repository, and Jive SBS (collaboration
software). I'm interested in Lucene/Solr because of its scalability,
faceted search and optimization features, and because it is free. Our
problem is that we are a non-profit organization with only three very
busy programmers/sys admins supporting our employees around the world.

To help me argue for Solr in terms of total cost, I'm hoping that members
of this list can share their insights about the following:

* About how many hours of programming did it take you to set up your
instance of Lucene/Solr (not counting time spent on optimization)?

* Are there any disadvantages of going with a certified distribution
rather than the standard distribution?


Thanks and best regards,
Jeff

Jeff Crump
jcr...@hq.mercycorps.org


RE: Aggregated facet value counts?

2010-01-29 Thread Peter S

Tree faceting - that sounds very interesting indeed. I'll have a look into that 
and see how it fits, as well as any improvements for adding facet queries, 
cross-field aggregation, date range etc. There could be some very nice 
use-cases for such functionality. Just wondering how this would work with 
distributed shards/multi-core...


Many Thanks! 

Peter

 

 
 From: erik.hatc...@gmail.com
 To: solr-user@lucene.apache.org
 Subject: Re: Aggregated facet value counts?
 Date: Fri, 29 Jan 2010 12:20:07 -0500
 
 Sounds like what you're asking for is tree faceting. A basic 
 implementation is available in SOLR-792, but one that could also take 
 facet.queries, numeric or date range buckets, to tree on would be a 
 nice improvement.
 
 Still, the underlying implementation will simply enumerate all the 
 possible values (SOLR-792 has some short-circuiting when the top-level 
 has zero, of course). A client-side application could do this with 
 multiple requests to Solr.
 
 Subsearch - sure, just make more requests to Solr, rearranging the 
 parameters.
 
 I'd still say that in general for this type of need that it'll 
 generally be less arbitrary and locking some things in during 
 indexing will be the pragmatic way to go for most cases.
 
 Erik
 
 
 
 On Jan 29, 2010, at 9:28 AM, Peter S wrote:
 
 
  Well, it wouldn't be 'every' combination - more of 'any' combination 
  at query-time.
 
  The 'arbitrary' part of the requirement is because it's not 
  practical to predict every combination a user might ask for, 
  although generally users would tend to search for similar/the same 
  query combinations (but perhaps with different date ranges, for 
  example).
 
  If 'predicted aggregate fields' were calculated at index-time on, 
  say, 10 fields (the schema in question actually as 73 fields), 
  that's 3,628,801 new fields. A large percentage of these would 
  likely never be used (which ones would depend on the user, 
  environment etc.).
 
 
  Perhaps a more 'typical' use case than my network-based example 
  would be a product search web page, where you want to show the 
  number of products that are made by a manufacturer and within a 
  certain price range (e.g. Sony [$600-$800] (15) ). To obtain the 
  (15) facet count value, you would have to correlate the number of 
  Sony products (say, (861)), and the products that fall into the [600 
  TO 800] price range (say, (1226) ). The (15) would be the 
  intersection of the Sony hits and the price range hits by 
  'manufacturer:Sony'. Am I right that filter queries could only do 
  this for document hits if you know the field values ahead of time 
  (e.g. fq=manufacturer:Sonyfq=price:[600 TO 800])? The facets could 
  then be derived by simply counting the numFound for each result set.
 
 
 
  If there were subsearch support in Solr (i.e. take the output of a 
  query and use it as input into another) that included facets 
  [perhaps there is such support?], it might be used to achieve this 
  effect.
 
 
  A custom query parser plugin could work, maybe? I suppose it would 
  need to gather up all the separate facets and correlate them 
  according to the input query (e.g. host and user, or manufacturer 
  and price range). Such a mechanism would be crying out for caching, 
  but perhaps it could leverage the existing field and query caches.
 
 
  Peter
 
 
 
 
  From: erik.hatc...@gmail.com
  To: solr-user@lucene.apache.org
  Subject: Re: Aggregated facet value counts?
  Date: Fri, 29 Jan 2010 07:39:44 -0500
 
  Creating values for every possible combination is what you're asking
  Solr to do at query-time, and as far as I know there isn't really a
  way to accomplish that like you're asking. Is the need really to be
  arbitrary here?
 
  Erik
 
  On Jan 29, 2010, at 7:25 AM, Peter S wrote:
 
 
  Hi Erik,
 
 
 
  Thanks for your reply. That's an interesting idea doing it at index-
  time, and a good idea for known field combinations.
 
  The only thing is
 
  How to handle arbitrary field combinations? - i.e. to allow the
  caller to specify any combination of fields at query-time?
 
  So, yes, the data is available at index-time, but the combination
  isn't (short of creating fields for every possible combination).
 
 
 
  Peter
 
 
 
  From: erik.hatc...@gmail.com
  To: solr-user@lucene.apache.org
  Subject: Re: Aggregated facet value counts?
  Date: Fri, 29 Jan 2010 06:30:27 -0500
 
  When faced with this type of situation where the data is entirely
  available at index-time, simply create an aggregated field that 
  glues
  the two pieces together, and facet on that.
 
  Erik
 
  On Jan 29, 2010, at 6:16 AM, Peter S wrote:
 
 
  Hi,
 
 
 
  I was wondering if anyone had come across this use case, and if 
  this
  type of faceting is possible:
 
 
 
  The requirement is to build a query such that an aggregated facet
  count of common (and'ed) field values form the basis of each
  returned facet count.
 
 
 
  For example:
 
  Let's say I have a 

sort items by whether the user has viewed it or not

2010-01-29 Thread a8910b-solr
hi,

i want to query for documents that have certain values but i want it first 
sorted by documents that this person has viewed in the past.  i can't store 
each user's view information in the document so i want to pass that in to the 
search.  is it possible to do something like this:

http://solr?q=baseballsort=doc_isbn(ABC or DEF or GHI) desc, title desc

any help is appreciated,
r



Re: loading an updateProcessorChain with multicore in trunk

2010-01-29 Thread Chris Hostetter


: I guess . default=true should not be necessary if there is only one
: updateRequestProcessorChain specified . Open an issue

No... that doesn't seem right.  If you declare you're own chains, but you 
don't mark any of them as default=true then it shouldn't matter how many 
of them you declare, SolrCore should create a default for you.


The real question here is: why isn't he getting his explicilty defined 
chain when he refrences it by name?


declaring that he wants his explicitly named chain to be the default is 
fine, and that should work, but w/o declaring it as the default he should 
still be able to ask for it by name ... why isn't that working? ...


:  requestHandler name=/update class=solr.XmlUpdateRequestHandler
:  � �lst name=defaults
:  � � � str name=update.processormyChain/str
:  � �/lst

Marc, can you confirm that when you don't declare your chain as 
default=true that...
1) an instance of your CustomUpdateProcessorFactory is actaully getting 
instantiated by solr (via logging or runningg in a debugger)
2) wether your custom chain is used if you pass update.processor=myChain 
as a request param instead of relying on the configured defaults


(I wonder if some handler refactoring caused the default 
processing logic to no longer respect the defaults)




-Hoss

Re: update doc success, but could not find the new value

2010-01-29 Thread Chris Hostetter

: Subject: update doc success, but could not find the new value
: In-Reply-To: 449216.59315...@web56308.mail.re3.yahoo.com
: References: 27335403.p...@talk.nabble.com
: 449216.59315...@web56308.mail.re3.yahoo.com

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is hidden in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking



-Hoss



RE: Solr wiki link broken

2010-01-29 Thread Chris Hostetter

: Why don't we change the links to have FrontPage explicitly?
: Wouldn't it be the easiest fix unless there are numerous
: other pages that references the default page w/o FrontPage?

I'm fairly confident that there are more links pointing to 
http://wiki.apache.org/solr/ then there are alternate versions in 
differnet langauges ... particularly when you start factoring in all of 
the webpages in the world that we don't have the ability to edit 
directly.


-Hoss



Re: NullPointerException in ReplicationHandler.postCommit + question about compression

2010-01-29 Thread Chris Hostetter

: never keep a str name=maxOptimizedCommitsToKeep0/str.
: 
: It is better to leave not mention the deletionPolicy at all. The
: defaults are usually fine.

if setting the keep values to 0 results in NPEs we should do one (if not 
both) of the following...

1) change the init code to warn/fail if the values are 0 (not sure if 
there is ever a legitimate use for 0 as a value)

2) change the code that's currently throwing an NPE to check it's 
assumptings and log a more meaninful error if it can't function because of 
the existing config.


-Hoss



RE: How to Implement SpanQuery in Solr . . ?

2010-01-29 Thread Chris Hostetter

: and Solr. I was hoping to start by getting a simple example working in SOLR
: and then iterate towards the more complex, given this is my first attempt at
: extending Solr.

wise choice.

: For my first iteration of SpanQuery in Solr I am thinking of starting with a
: simple syntax to combine:

...honestly: since you already mentioned that you might eventually want to 
integrate Qsol, i would suggest you start with that directly.  that way 
you are taking an eixsting parser (that you evidently understand) and just 
hooking it via the QParser abstraction (as opposed to writting a 
Lucene String-Query parser *and* learning the QParser/Solr internals.

: implementation on the Lucene side and the FooQParserPlugin as a reference
: implementation on the SOLR side?

The FooQParserPlugin is fairly primative and doesn't really make it 
obvious some of the things you can do with a QParser, so you may also want 
to skim the LuceneQParserPlugin as well

: The other part of the riddle I would really appreciate some guidance on is
: how to get it to plug-in to SOLR correctly?

http://wiki.apache.org/solr/SolrPlugins#How_to_Load_Plugins
http://wiki.apache.org/solr/SolrPlugins#QParserPlugin


-Hoss



Re: Solr Cache Viewing/Browsing

2010-01-29 Thread Chris Hostetter
: used in a modified DisMaxHandler) and I was wondering if there is a way to
: get at this data from the JSP pages? I then thought that it might be nice to
: view more information about the respective caches like the current elements,
: recently evicted etc to help debug performance issues. Has anyone worked on
: this or have any ideas surrounding this?

I don't beleive anyone has looked into this.

It would be hard to implement in a generic manner since the SolrCache API 
doesn't provide any mechanism for inspecting the contents, but you could 
write an implementation that expost some of these things through the 
getStatstics method (or some other new introspection based API)



-Hoss



Re: replication setup

2010-01-29 Thread Chris Hostetter

: Subject: replication setup
: In-Reply-To: 83ec2c9c1001260724t110d6595m5071e0a40e1b1...@mail.gmail.com

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is hidden in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking



-Hoss



Re: Analysis tool vs search query

2010-01-29 Thread Chris Hostetter

: I've run into this issue that I have no way of resolving, since the analysis
: tool doesn't show me there is an error. I copy the exact field value into
: the analysis tool and i type in the exact query request i'm issuing and the
: tool finds it a match. However running the query with that exact same

the analysis tool doesn'ty do query parsing .. so pasing a *query* 
string into the analysis tool isn't going to give you any meaningful 
information.

what the query section of the analysis tool lets you do is see what the 
query time analyzer (that is used by most query parsers at query time) 
will do with your input ... but the QueryParser is still in control, and 
it decides which input to pass to your analyser -- special characters 
(like whitespace) have meaning to most query parsers, before they ever 
have a chance of getting passed to the analyzer.

: tokenizer class=solr.KeywordTokenizerFactory/

A keyword tokenizer results in a single token for each input string, but 
the (default) query parser is going to chunk the input up on whitespace 
before the analyzer is ever invoked, unless you put it in a quoted string.


-Hoss



Re: using bq with standard request handler

2010-01-29 Thread Chris Hostetter

: I am using a query like:
: 
http://localhost:8080/solr/select?q=product_category:Groceryfq=in_stock:truedebugQuery=true;
: sort=display_priority+desc,prop_l_price+asc
...
: Is it possible to use display_priority/price fields in bq itself to acheive
: the same result.. I tried forming some queries that but was unable to get
: desired results...

bf and bq are features of hte dismax parser, so the default query parser 
won't use them -- it really wouldn't even make sense as a possible 
new feature, because the types of queries that might be specified using 
the lucene QParser are too broad to be able to define a consistent 
mechanism for knowing how/where to add the boosting queries to the 
structure.

if all of your queries have that identical structure, you might consider 
however somehting like...

http://localhost:8080/solr/select?qf=product_categoryq=Groceryfq=in_stock:truebq=...


-Hoss



Re: Mail config

2010-01-29 Thread Chris Hostetter

: I do not want to receive all the emails from this mail list, I only want to
: receive the answers to my questions, is this possible?

That's not how mailing lists work.  If you want to participate in teh 
community, you have to participate fully.

: If I am not mistaken when I unsubscribed I sent an email which did not reach
: the mail list at all (therefore there was of course no chance to get any
: replies).

The same mechanism that prevents you from posting when you are not 
subscribed is the mechanism that prevents thousands of spam messages from 
getting sent to the list every day .. you have to take the bad with the 
good.

: I am newbie for Solr and I doubt I can contribute much by answering to other
: posts.

But you can learn from those posts, and the discussion/responses they 
stimulate...
http://people.apache.org/~hossman/#private_q




-Hoss



Re: Lock problems: Lock obtain timed out

2010-01-29 Thread Chris Hostetter

: Can anyone think of a reason why these locks would hang around for more than
: 2 hours?
: 
: I have been monitoring them and they look like they are very short lived.

Typically the lock files are only left arround for more then a few seconds 
when there was a fatal crash of some kind ... an OOM Error for example, or 
as already mentioned in this thread...

:SEVERE: java.io.IOException: No space left on device

...if you check your solr logs for messages in the immediate time frame 
following the the lastModified time of the lock file you'll probably find 
something interesting.


-Hoss



Re: scenario with FQ parameter

2010-01-29 Thread Chris Hostetter

:qf=field1^10 field2^20 field^100fq=*:9+OR+(field1:xyz)
...
: I know I can use copy field (say 'text') to copy all the fields and then
...
: but doing so , the boost weights specified in the 'qf' field have no effect
: on the score.

An FQ never has any impact on the score, so your question is ab it 
confusing.

If you want to influence the scores, you'll need to use bq instead of 
fq.

as discussed in another current thread on this list, it's possible to make 
the bq param use the dismax parser as well, but there are some tricky 
issues involved with that ... unless your use case is actaully more 
complicated then you are describing, you should probably just use 
something like...

...qf=field1^10+field2^20+field^100bq=field1:9^10+field2:9^20+field:9^100+field1:xyz

-Hoss



Re: How can I boost bq in FieldQParserPlugin?

2010-01-29 Thread Chris Hostetter

: q=ipodbq={!dismax qf=userId^0.5 v=$qq bq=}qq=12345qt=dismaxdebugQuery=on
: 
: I try to debug the above query, it turned out to be as:
: +DisjunctionMaxQuery((content:ipod | title:ipod^4.0)~0.01) ()
: +DisjunctionMaxQuery((userId:12345^0.5)~0.01)

...hmmm, i'm not sure why that's happening, but it certianly seems like a 
bug -- i ust have no idea what that bug is.  

the inner dismax parser should definitely be producing a query where the 
DisjunctionMaxQuery for 12345 is mandatory but that mandatory clause 
should be wrapped inside of another boolean query which should be added to 
the outermost query as an optional clause.

somewhere that BooleanQuery produced by the inner dismax parser is getting 
thrown away ... hmmm, actually that this is a neccessary behavior of
DismaxQParser for some cases (that it sheds it's own outermost 
BooleanQuery when not needed), but in this case it's screwing you because 
it doesn't realize you really do need it.

does this owrk better? ...

q=ipodbq={!dismax qf=userId^0.5 v=$qq 
bq=*:*^0}qq=12345qt=dismaxdebugQuery=on

...it's kind of kludgy, but it should garuntee you that wrapping 
BooleanQuerry is preserved.



-Hoss



Re: Large Query Strings and performance

2010-01-29 Thread Chris Hostetter

: I am using Solr 1.4 with  large query strings with 20+ terms and faceting on
: a single multi-valued field in a 1 million record system. I am using Solr to
: categorize text, that why the query strings are big.
: 
: The performance get's worse the more search terms are used.  Is there any

can you elaborate more on the types of query strings you are using? ... 
are they simply BooleanQuries consiting of many terms? ... are they all 
optional?

We have to understand your goal, what exactly you are currently doing, and 
what exactly you have already tried before we can suggest ways of 
achieving your goal faster then things you've already tried.



-Hoss



Re: Master Read Timeout

2010-01-29 Thread Chris Hostetter

: Is there any way to increase the Slave's timeout value? Are there any 

http://wiki.apache.org/solr/SolrReplication?highlight=%28timeout%29


-Hoss



RE: matching exact/whole phrase

2010-01-29 Thread Chris Hostetter

: Is it safe to say in order to do exact matches the field should be a string.

It depends on your definition of exact

If you want exact matches, including unicode codepoints and 
leading/trailing whitespace, then StrField would probably make sense -- 
but you could equally use TextField with a KeywrodTokenizer and nothing 
else.

If you want *some* normalization (ie: trim leading/trailing whitespace, 
map equivilent codepoints to a canonical representation, etc...) then you 
need TextyField.

: Now in my dismax handler if i have the qf defined as text field and run a
: phrase search on text field
: my car is the best car in the world
: i dont get back any results. looking with debugQuery=on this is the
: parsedQuery
: text:my tire pressure warning light came my honda civic
: This will not work since text was indexed by removing all stop words.

it *can* work if the query analyzer for your text field type is also 
configured to remove stopwords, and if you either: configure the 
StopFilter(s) to deal with token positions in the way the parser expects 
(i forget which one works, you have to play with it); OR us a qs (query 
slop) value that gives you enough slop to miss those empty stop word gaps.


-Hoss



Re: Deleting spelll checker index

2010-01-29 Thread Chris Hostetter

: We are using Index based spell checker.
: i was wondering with the help of any url parameters can we delete the spell
: check index directory.

I don't think so.

You might be able to configure two differnet spell check components that 
point at the same directory -- one hat builds off of a real field, and one 
that builds off of an (empty) text field (using FileBasedSpellChecker) .. 
then you could trigger a rebuild of an empty spell checking index using 
the second component.

But i've never tried it so i have no idea if it would work.


-Hoss



DataImportHandler multivalued field CollectionString not working

2010-01-29 Thread Jason Rutherglen
DataImportHandler multivalued field CollectionString isn't
working the way I'd expect, meaning not at all. I logged the
collection is there, however the multivalue collection field
just isn't being indexed (according to the DIH web UI and it's
not in the index).


Re: DataImportHandler multivalued field CollectionString not working

2010-01-29 Thread Wangsheng Mei
Did you correctly set multiValue(not multivalue)=true in schema.xml?

2010/1/30 Jason Rutherglen jason.rutherg...@gmail.com

 DataImportHandler multivalued field CollectionString isn't
 working the way I'd expect, meaning not at all. I logged the
 collection is there, however the multivalue collection field
 just isn't being indexed (according to the DIH web UI and it's
 not in the index).




-- 
梅旺生


Re: Newbie Question on Custom Query Generation

2010-01-29 Thread Abin Mathew
Hi, I realized the power of Dismax Query Handler recently and now I
dont need to generate my own query since Dismax is giving better
results.Thanks a lot

2010/1/29 Wangsheng Mei hairr...@gmail.com:
 What's the point of generating your own query?
 Are you sure that solr query syntax cannot satisfy your need?

 2010/1/29 Abin Mathew abin.mat...@toostep.com

 Hi I want to generate my own customized query from the input string entered
 by the user. It should look something like this

 *Search field : Microsoft*
 *
 Generated Query*  :
 description:microsoft +((tags:microsoft^1.5 title:microsoft^3.0
 role:microsoft requi
 rement:microsoft company:microsoft city:microsoft)^5.0) tags:microsoft^2.0
 title:microsoft^3.5 functionalArea:microsoft

 *The lucene code we used is like this*
 BooleanQuery must = new BooleanQuery();

 addToBooleanQuery(must, tags, inputData, synonymAnalyzer, 1.5f);
 addToBooleanQuery(must, title, inputData, synonymAnalyzer);
 addToBooleanQuery(must, role, inputData, synonymAnalyzer);
 addToBooleanQuery(query, description, inputData, synonymAnalyzer);
 addToBooleanQuery(must, requirement, inputData, synonymAnalyzer);
 addToBooleanQuery(must, company, inputData, standardAnalyzer);
 addToBooleanQuery(must, city, inputData, standardAnalyzer);
 must.setBoost(5.0f);
 query.add(must, Occur.MUST);
 addToBooleanQuery(query, tags, includeAll, synonymAnalyzer, 2.0f);
 addToBooleanQuery(query, title, includeAll, synonymAnalyzer, 3.5f);
 addToBooleanQuery(query, functionalArea, inputData, synonymAnalyzer,);
 *
 In Simple english*
 addToBooleanQuery will add the particular field to the query after
 analysing
 using the analyser mentioned and setting a boost as specified
 So there MUST be a keyword match with any of the fields
 tags,title,role,description,requirement,company,city and it SHOULD occur
 in the fields tags,title and functionalArea.

 Hope you have got an idea of my requirement. I am not asking anyone to do
 it
 for me. Please let me know where can i start and give me some useful tips
 to
 move ahead with this. I believe that it has to do with modifying the XML
 configuration file and setting the parameters in Dismax handler. But I am
 still not sure. Please help

 Thanks  Regards
 Abin Mathew




 --
 梅旺生



Looking for a Solr volunteer for www.comics.org

2010-01-29 Thread Henry Andrews
Hi folks,
  I apologize if this isn't the right place to post this (alternate suggestions 
welcome alongside appropriate chastisement :-)

  I'm trying to recruit a volunteer to implement a Solr-based search system for 
the Grand Comic-Book Database (http://www.comics.org/).  We're a non-profit, 
non-commercial, international group researching and indexing comic books, and 
we have only two active programmers (we're both unpaid volunteers, as are all 
GCD personnel).  We'd love to have better search, and Solr looks like the right 
tool, but we're swamped with other technical work.

  So if anyone reading this would like to help out a comic book-related web 
site with their Solr experience, for absolutely no monetary compensation 
whatsoever, do please let me know :-D  It would help to be into comic books, 
but that's not strictly required.  Your work would be used quite heavily, and 
you could of course point that out to anyone you might wish to impress with 
your expertise.  Our technical work is open-source, and therefore available for 
inspection and showing off.

  To clarify:  I'm not looking for assistance with or pointers about setting 
Solr up myself (no matter how easy it is).  And I'm not trying to get the list 
as a whole to do our work for us.  I'm just trying to find if any individual 
feels like joining our tech team and volunteering for the project and couldn't 
think of a more likely place to find candidates than here.  If we don't find a 
volunteer, I'll end up doing it next year, and I'll be reading a lot more 
documentation before asking any questions here.

thanks,
-henry



Re: Deleting spelll checker index

2010-01-29 Thread darniz

Then i assume the easiest way is to delete the directory itself.

darniz


hossman wrote:
 
 
 : We are using Index based spell checker.
 : i was wondering with the help of any url parameters can we delete the
 spell
 : check index directory.
 
 I don't think so.
 
 You might be able to configure two differnet spell check components that 
 point at the same directory -- one hat builds off of a real field, and one 
 that builds off of an (empty) text field (using FileBasedSpellChecker) .. 
 then you could trigger a rebuild of an empty spell checking index using 
 the second component.
 
 But i've never tried it so i have no idea if it would work.
 
 
 -Hoss
 
 
 

-- 
View this message in context: 
http://old.nabble.com/Deleting-spelll-checker-index-tp27376823p27381620.html
Sent from the Solr - User mailing list archive at Nabble.com.