date:20100129

You can avoid one word terms by setting outputUnigrams=false on the  
ShingleFilterFactory configuration.


Erik

On Jan 28, 2010, at 11:29 PM, Christopher Ball wrote:


I am curious how I can query for multi-term phrases using the
TermsComponent?



The field I am searching has been shingled so it contains 2 and 3 word
phrases.



For example in the sample results below I want to only get back  
multi-word
phrases such as table of contents and under the but not the  
single word

terms such as year and significant



int name=table of contents25302/int

int name=including25162/int

int name=year25097/int

int name=significant17501/int

int name=under the17359/int



Appreciate any ideas,



Christopher

Re: Newbie Question on Custom Query Generation

dismax won't quite give you the same query result.  What you can do  
pretty easily, though, is create a QParser and QParserPlugin pair,  
register it solrconfig.xml and then use defType=name registered.   
Pretty straightforward.  Have a look at Solr's various QParserPlugin  
implementations for details.


Erik

On Jan 29, 2010, at 12:30 AM, Abin Mathew wrote:

Hi I want to generate my own customized query from the input string  
entered

by the user. It should look something like this

*Search field : Microsoft*
*
Generated Query*  :
description:microsoft +((tags:microsoft^1.5 title:microsoft^3.0
role:microsoft requi
rement:microsoft company:microsoft city:microsoft)^5.0)  
tags:microsoft^2.0

title:microsoft^3.5 functionalArea:microsoft

*The lucene code we used is like this*
BooleanQuery must = new BooleanQuery();

addToBooleanQuery(must, tags, inputData, synonymAnalyzer, 1.5f);
addToBooleanQuery(must, title, inputData, synonymAnalyzer);
addToBooleanQuery(must, role, inputData, synonymAnalyzer);
addToBooleanQuery(query, description, inputData, synonymAnalyzer);
addToBooleanQuery(must, requirement, inputData, synonymAnalyzer);
addToBooleanQuery(must, company, inputData, standardAnalyzer);
addToBooleanQuery(must, city, inputData, standardAnalyzer);
must.setBoost(5.0f);
query.add(must, Occur.MUST);
addToBooleanQuery(query, tags, includeAll, synonymAnalyzer, 2.0f);
addToBooleanQuery(query, title, includeAll, synonymAnalyzer, 3.5f);
addToBooleanQuery(query, functionalArea, inputData,  
synonymAnalyzer,);

*
In Simple english*
addToBooleanQuery will add the particular field to the query after  
analysing

using the analyser mentioned and setting a boost as specified
So there MUST be a keyword match with any of the fields
tags,title,role,description,requirement,company,city and it SHOULD  
occur

in the fields tags,title and functionalArea.

Hope you have got an idea of my requirement. I am not asking anyone  
to do it
for me. Please let me know where can i start and give me some useful  
tips to
move ahead with this. I believe that it has to do with modifying the  
XML
configuration file and setting the parameters in Dismax handler. But  
I am

still not sure. Please help

Thanks  Regards
Abin Mathew

Aggregated facet value counts?


Hi,

 

I was wondering if anyone had come across this use case, and if this type of 
faceting is possible:

 

The requirement is to build a query such that an aggregated facet count of 
common (and'ed) field values form the basis of each returned facet count.

 

For example:

Let's say I have a number of documents in an index with, among others, the 
fields 'host' and 'user':

 

Doc1  host:machine_1   user:user_1

Doc2  host:machine_1   user:user_2

Doc3  host:machine_1   user:user_1

Doc3  host:machine_1   user:user_1

 

Doc4  host:machine_2   user:user_1

Doc5  host:machine_2   user:user_1

Doc6  host:machine_2   user:user_4

 

Doc7  host:machine_1   user:user_4

 

Is it possible to get facets back that would give the count of documents that 
have common host AND user values (preferably ordered - i.e. host then user for 
this example, so as not to create a factorial explosion)? Note that the caller 
wouldn't know what machine and user values exist, only the field names.

I've tried using facet queries in various ways to see if they could work for 
this, but I believe facet queries work on a different plane than this 
requirement (narrowing the term count, a.o.t. aggregating).

 

For the example above, the desired result would be:

 

machine_1/user_1 (3)

machine_1/user_2 (1)

machine_1/user_4 (1)

 

machine_2/user_1 (2)

machine_2/user_4 (1)

 

Has anyone had a need for this type of faceting and found a way to achieve it?

 

Many thanks,

Peter

 

 
  
_
We want to hear all your funny, exciting and crazy Hotmail stories. Tell us now
http://clk.atdmt.com/UKM/go/195013117/direct/01/

Re: Aggregated facet value counts?

When faced with this type of situation where the data is entirely  
available at index-time, simply create an aggregated field that glues  
the two pieces together, and facet on that.


Erik

On Jan 29, 2010, at 6:16 AM, Peter S wrote:



Hi,



I was wondering if anyone had come across this use case, and if this  
type of faceting is possible:




The requirement is to build a query such that an aggregated facet  
count of common (and'ed) field values form the basis of each  
returned facet count.




For example:

Let's say I have a number of documents in an index with, among  
others, the fields 'host' and 'user':




Doc1  host:machine_1   user:user_1

Doc2  host:machine_1   user:user_2

Doc3  host:machine_1   user:user_1

Doc3  host:machine_1   user:user_1



Doc4  host:machine_2   user:user_1

Doc5  host:machine_2   user:user_1

Doc6  host:machine_2   user:user_4



Doc7  host:machine_1   user:user_4



Is it possible to get facets back that would give the count of  
documents that have common host AND user values (preferably ordered  
- i.e. host then user for this example, so as not to create a  
factorial explosion)? Note that the caller wouldn't know what  
machine and user values exist, only the field names.


I've tried using facet queries in various ways to see if they could  
work for this, but I believe facet queries work on a different plane  
than this requirement (narrowing the term count, a.o.t. aggregating).




For the example above, the desired result would be:



machine_1/user_1 (3)

machine_1/user_2 (1)

machine_1/user_4 (1)



machine_2/user_1 (2)

machine_2/user_4 (1)



Has anyone had a need for this type of faceting and found a way to  
achieve it?




Many thanks,

Peter





_
We want to hear all your funny, exciting and crazy Hotmail stories.  
Tell us now

http://clk.atdmt.com/UKM/go/195013117/direct/01/

RE: Aggregated facet value counts?


Hi Erik,

 

Thanks for your reply. That's an interesting idea doing it at index-time, and a 
good idea for known field combinations.

The only thing is

How to handle arbitrary field combinations? - i.e. to allow the caller to 
specify any combination of fields at query-time?

So, yes, the data is available at index-time, but the combination isn't (short 
of creating fields for every possible combination).

 

Peter


 
 From: erik.hatc...@gmail.com
 To: solr-user@lucene.apache.org
 Subject: Re: Aggregated facet value counts?
 Date: Fri, 29 Jan 2010 06:30:27 -0500
 
 When faced with this type of situation where the data is entirely 
 available at index-time, simply create an aggregated field that glues 
 the two pieces together, and facet on that.
 
 Erik
 
 On Jan 29, 2010, at 6:16 AM, Peter S wrote:
 
 
  Hi,
 
 
 
  I was wondering if anyone had come across this use case, and if this 
  type of faceting is possible:
 
 
 
  The requirement is to build a query such that an aggregated facet 
  count of common (and'ed) field values form the basis of each 
  returned facet count.
 
 
 
  For example:
 
  Let's say I have a number of documents in an index with, among 
  others, the fields 'host' and 'user':
 
 
 
  Doc1 host:machine_1 user:user_1
 
  Doc2 host:machine_1 user:user_2
 
  Doc3 host:machine_1 user:user_1
 
  Doc3 host:machine_1 user:user_1
 
 
 
  Doc4 host:machine_2 user:user_1
 
  Doc5 host:machine_2 user:user_1
 
  Doc6 host:machine_2 user:user_4
 
 
 
  Doc7 host:machine_1 user:user_4
 
 
 
  Is it possible to get facets back that would give the count of 
  documents that have common host AND user values (preferably ordered 
  - i.e. host then user for this example, so as not to create a 
  factorial explosion)? Note that the caller wouldn't know what 
  machine and user values exist, only the field names.
 
  I've tried using facet queries in various ways to see if they could 
  work for this, but I believe facet queries work on a different plane 
  than this requirement (narrowing the term count, a.o.t. aggregating).
 
 
 
  For the example above, the desired result would be:
 
 
 
  machine_1/user_1 (3)
 
  machine_1/user_2 (1)
 
  machine_1/user_4 (1)
 
 
 
  machine_2/user_1 (2)
 
  machine_2/user_4 (1)
 
 
 
  Has anyone had a need for this type of faceting and found a way to 
  achieve it?
 
 
 
  Many thanks,
 
  Peter
 
 
 
 
  
  _
  We want to hear all your funny, exciting and crazy Hotmail stories. 
  Tell us now
  http://clk.atdmt.com/UKM/go/195013117/direct/01/
 
  
_
Tell us your greatest, weirdest and funniest Hotmail stories
http://clk.atdmt.com/UKM/go/195013117/direct/01/

loading an updateProcessorChain with multicore in trunk

2010-01-29 Thread Marc Sturlese


I am testing trunk and have seen a different behaviour when loading
updateProcessors wich I don't know if it's normal (at least with multicore)
Before I use to use an updateProcessorChain this way:

requestHandler name=/update class=solr.XmlUpdateRequestHandler
lst name=defaults
   str name=update.processormyChain/str
/lst
/requestHandler  
updateRequestProcessorChain name=myChain
processor
class=org.apache.solr.update.processor.CustomUpdateProcessorFactory /
processor
class=org.apache.solr.update.processor.LogUpdateProcessorFactory /
processor
class=org.apache.solr.update.processor.RunUpdateProcessorFactory /
/updateRequestProcessorChain

It does not work in current trunk. I have debuged the code and I have seen
now UpdateProcessorChain is loaded via:

  public T T initPlugins(ListPluginInfo pluginInfos, MapString, T
registry, ClassT type, String defClassName) {
T def = null;
for (PluginInfo info : pluginInfos) {
  T o = createInitInstance(info,type, type.getSimpleName(),
defClassName);
  registry.put(info.name, o);
  if(info.isDefault()){
def = o;
  }
}
return def;
  }

As I don't have default=true in the configuration, my custom
processorChain is not used. Setting default=true makes it work:

requestHandler name=/update class=solr.XmlUpdateRequestHandler
lst name=defaults
   str name=update.processormyChain/str
/lst
/requestHandler  
updateRequestProcessorChain name=myChain default=true
processor
class=org.apache.solr.update.processor.CustomUpdateProcessorFactory /
processor
class=org.apache.solr.update.processor.LogUpdateProcessorFactory /
processor
class=org.apache.solr.update.processor.RunUpdateProcessorFactory /
/updateRequestProcessorChain

As far as I understand, if you specify the chain you want to use in here:
requestHandler name=/update class=solr.XmlUpdateRequestHandler
lst name=defaults
   str name=update.processormyChain/str
/lst
/requestHandler

Shouldn't be necesary to set it as default.
Is it going to be kept this way?

Thanks in advance



-- 
View this message in context: 
http://old.nabble.com/loading-an-updateProcessorChain-with-multicore-in-trunk-tp27371375p27371375.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Aggregated facet value counts?

2010-01-29 Thread Noble Paul നോബിള്‍ नोब्ळ्

Creating values for every possible combination is what you're asking  
Solr to do at query-time, and as far as I know there isn't really a  
way to accomplish that like you're asking.   Is the need really to be  
arbitrary here?


Erik

On Jan 29, 2010, at 7:25 AM, Peter S wrote:



Hi Erik,



Thanks for your reply. That's an interesting idea doing it at index- 
time, and a good idea for known field combinations.


The only thing is

How to handle arbitrary field combinations? - i.e. to allow the  
caller to specify any combination of fields at query-time?


So, yes, the data is available at index-time, but the combination  
isn't (short of creating fields for every possible combination).




Peter




From: erik.hatc...@gmail.com
To: solr-user@lucene.apache.org
Subject: Re: Aggregated facet value counts?
Date: Fri, 29 Jan 2010 06:30:27 -0500

When faced with this type of situation where the data is entirely
available at index-time, simply create an aggregated field that glues
the two pieces together, and facet on that.

Erik

On Jan 29, 2010, at 6:16 AM, Peter S wrote:



Hi,



I was wondering if anyone had come across this use case, and if this
type of faceting is possible:



The requirement is to build a query such that an aggregated facet
count of common (and'ed) field values form the basis of each
returned facet count.



For example:

Let's say I have a number of documents in an index with, among
others, the fields 'host' and 'user':



Doc1 host:machine_1 user:user_1

Doc2 host:machine_1 user:user_2

Doc3 host:machine_1 user:user_1

Doc3 host:machine_1 user:user_1



Doc4 host:machine_2 user:user_1

Doc5 host:machine_2 user:user_1

Doc6 host:machine_2 user:user_4



Doc7 host:machine_1 user:user_4



Is it possible to get facets back that would give the count of
documents that have common host AND user values (preferably ordered
- i.e. host then user for this example, so as not to create a
factorial explosion)? Note that the caller wouldn't know what
machine and user values exist, only the field names.

I've tried using facet queries in various ways to see if they could
work for this, but I believe facet queries work on a different plane
than this requirement (narrowing the term count, a.o.t.  
aggregating).




For the example above, the desired result would be:



machine_1/user_1 (3)

machine_1/user_2 (1)

machine_1/user_4 (1)



machine_2/user_1 (2)

machine_2/user_4 (1)



Has anyone had a need for this type of faceting and found a way to
achieve it?



Many thanks,

Peter





_
We want to hear all your funny, exciting and crazy Hotmail stories.
Tell us now
http://clk.atdmt.com/UKM/go/195013117/direct/01/




_
Tell us your greatest, weirdest and funniest Hotmail stories
http://clk.atdmt.com/UKM/go/195013117/direct/01/

Re: loading an updateProcessorChain with multicore in trunk

I guess . default=true should not be necessary if there is only one
updateRequestProcessorChain specified . Open an issue

On Fri, Jan 29, 2010 at 6:06 PM, Marc Sturlese marc.sturl...@gmail.com wrote:

 I am testing trunk and have seen a different behaviour when loading
 updateProcessors wich I don't know if it's normal (at least with multicore)
 Before I use to use an updateProcessorChain this way:

 requestHandler name=/update class=solr.XmlUpdateRequestHandler
    lst name=defaults
       str name=update.processormyChain/str
    /lst
 /requestHandler
 updateRequestProcessorChain name=myChain
    processor
 class=org.apache.solr.update.processor.CustomUpdateProcessorFactory /
    processor
 class=org.apache.solr.update.processor.LogUpdateProcessorFactory /
    processor
 class=org.apache.solr.update.processor.RunUpdateProcessorFactory /
 /updateRequestProcessorChain

 It does not work in current trunk. I have debuged the code and I have seen
 now UpdateProcessorChain is loaded via:

  public T T initPlugins(ListPluginInfo pluginInfos, MapString, T
 registry, ClassT type, String defClassName) {
    T def = null;
    for (PluginInfo info : pluginInfos) {
      T o = createInitInstance(info,type, type.getSimpleName(),
 defClassName);
      registry.put(info.name, o);
      if(info.isDefault()){
            def = o;
      }
    }
    return def;
  }

 As I don't have default=true in the configuration, my custom
 processorChain is not used. Setting default=true makes it work:

 requestHandler name=/update class=solr.XmlUpdateRequestHandler
    lst name=defaults
       str name=update.processormyChain/str
    /lst
 /requestHandler
 updateRequestProcessorChain name=myChain default=true
    processor
 class=org.apache.solr.update.processor.CustomUpdateProcessorFactory /
    processor
 class=org.apache.solr.update.processor.LogUpdateProcessorFactory /
    processor
 class=org.apache.solr.update.processor.RunUpdateProcessorFactory /
 /updateRequestProcessorChain

 As far as I understand, if you specify the chain you want to use in here:
 requestHandler name=/update class=solr.XmlUpdateRequestHandler
    lst name=defaults
       str name=update.processormyChain/str
    /lst
 /requestHandler

 Shouldn't be necesary to set it as default.
 Is it going to be kept this way?

 Thanks in advance



 --
 View this message in context: 
 http://old.nabble.com/loading-an-updateProcessorChain-with-multicore-in-trunk-tp27371375p27371375.html
 Sent from the Solr - User mailing list archive at Nabble.com.





-- 
-
Noble Paul | Systems Architect| AOL | http://aol.com

RE: Aggregated facet value counts?


Well, it wouldn't be 'every' combination - more of 'any' combination at 
query-time.
 
The 'arbitrary' part of the requirement is because it's not practical to 
predict every combination a user might ask for, although generally users would 
tend to search for similar/the same query combinations (but perhaps with 
different date ranges, for example).
 
If 'predicted aggregate fields' were calculated at index-time on, say, 10 
fields (the schema in question actually as 73 fields), that's 3,628,801 new 
fields. A large percentage of these would likely never be used (which ones 
would depend on the user, environment etc.).
 

Perhaps a more 'typical' use case than my network-based example would be a 
product search web page, where you want to show the number of products that are 
made by a manufacturer and within a certain price range (e.g. Sony [$600-$800] 
(15) ). To obtain the (15) facet count value, you would have to correlate the 
number of Sony products (say, (861)), and the products that fall into the [600 
TO 800] price range (say, (1226) ). The (15) would be the intersection of the 
Sony hits and the price range hits by 'manufacturer:Sony'. Am I right that 
filter queries could only do this for document hits if you know the field 
values ahead of time (e.g. fq=manufacturer:Sonyfq=price:[600 TO 800])? The 
facets could then be derived by simply counting the numFound for each result 
set.

 

If there were subsearch support in Solr (i.e. take the output of a query and 
use it as input into another) that included facets [perhaps there is such 
support?], it might be used to achieve this effect.


A custom query parser plugin could work, maybe? I suppose it would need to 
gather up all the separate facets and correlate them according to the input 
query (e.g. host and user, or manufacturer and price range). Such a mechanism 
would be crying out for caching, but perhaps it could leverage the existing 
field and query caches.
 

Peter

 


 From: erik.hatc...@gmail.com
 To: solr-user@lucene.apache.org
 Subject: Re: Aggregated facet value counts?
 Date: Fri, 29 Jan 2010 07:39:44 -0500
 
 Creating values for every possible combination is what you're asking 
 Solr to do at query-time, and as far as I know there isn't really a 
 way to accomplish that like you're asking. Is the need really to be 
 arbitrary here?
 
 Erik
 
 On Jan 29, 2010, at 7:25 AM, Peter S wrote:
 
 
  Hi Erik,
 
 
 
  Thanks for your reply. That's an interesting idea doing it at index- 
  time, and a good idea for known field combinations.
 
  The only thing is
 
  How to handle arbitrary field combinations? - i.e. to allow the 
  caller to specify any combination of fields at query-time?
 
  So, yes, the data is available at index-time, but the combination 
  isn't (short of creating fields for every possible combination).
 
 
 
  Peter
 
 
 
  From: erik.hatc...@gmail.com
  To: solr-user@lucene.apache.org
  Subject: Re: Aggregated facet value counts?
  Date: Fri, 29 Jan 2010 06:30:27 -0500
 
  When faced with this type of situation where the data is entirely
  available at index-time, simply create an aggregated field that glues
  the two pieces together, and facet on that.
 
  Erik
 
  On Jan 29, 2010, at 6:16 AM, Peter S wrote:
 
 
  Hi,
 
 
 
  I was wondering if anyone had come across this use case, and if this
  type of faceting is possible:
 
 
 
  The requirement is to build a query such that an aggregated facet
  count of common (and'ed) field values form the basis of each
  returned facet count.
 
 
 
  For example:
 
  Let's say I have a number of documents in an index with, among
  others, the fields 'host' and 'user':
 
 
 
  Doc1 host:machine_1 user:user_1
 
  Doc2 host:machine_1 user:user_2
 
  Doc3 host:machine_1 user:user_1
 
  Doc3 host:machine_1 user:user_1
 
 
 
  Doc4 host:machine_2 user:user_1
 
  Doc5 host:machine_2 user:user_1
 
  Doc6 host:machine_2 user:user_4
 
 
 
  Doc7 host:machine_1 user:user_4
 
 
 
  Is it possible to get facets back that would give the count of
  documents that have common host AND user values (preferably ordered
  - i.e. host then user for this example, so as not to create a
  factorial explosion)? Note that the caller wouldn't know what
  machine and user values exist, only the field names.
 
  I've tried using facet queries in various ways to see if they could
  work for this, but I believe facet queries work on a different plane
  than this requirement (narrowing the term count, a.o.t. 
  aggregating).
 
 
 
  For the example above, the desired result would be:
 
 
 
  machine_1/user_1 (3)
 
  machine_1/user_2 (1)
 
  machine_1/user_4 (1)
 
 
 
  machine_2/user_1 (2)
 
  machine_2/user_4 (1)
 
 
 
  Has anyone had a need for this type of faceting and found a way to
  achieve it?
 
 
 
  Many thanks,
 
  Peter
 
 
 
 
 
  _
  We want to hear all your funny, exciting and crazy Hotmail stories.
  Tell us now

multi term, multi field, auto suggest

2010-01-29 Thread Lukas Kahwe Smith

Hi,

So over the course of the last two weeks I have been trying to come up with an
optimal solution for auto suggest in the project I am currently working on.
In the application we have names from people and companies. The companies can
have german, english, italian or french names. people have an additional
firstname field. We also want to do auto suggest on the street and city names
as well as on emails and telefon numbers. as such we are treating phonenumbers
as text.

We do have the option for the user to use phonetic searches or to split
(especially the compound german words), but I guess we will leave that out of
the auto suggest.
We do expect that some users will type in properly cased strings, while some
may just type in all lowercase.
We are using the dismax defType for our normal queries.

There will probably be less than 20M entities.

As such I guess the best approach is to copy all of the above mentioned fields
(name, firstname, city, street, email, telefon) into a new field called all.
It seems the best approach is to use facet.prefix for our requirements. We will
therefore split of the last term in the query and pass it in as the
facet.prefix while the rest is passed in as the q parameter.

Since facet's are driven out of the index, we will use the following type
definition for this all field:
fieldType name=textplain class=solr.TextField
positionIncrementGap=100
analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0
splitOnCaseChange=0/
/analyzer
/fieldType

So essentially the idea is to just split on whitespace, remove stop words and
word delimiters.

The query would then look something like the following if the user would enter
Kaltenreider Ver:
http://localhost:8983/solr/core0/select?defType=dismaxqf=allq=
Kaltenreiderindent=onfacet=onfacet.limit=10facet.mincount=1facet.field=allrows=0facet.prefix=Ver

Does this approach make sense so far?
Do you expect this to perform decently on a dual quad core machine with 16Gb of
ram, albeit all of that will be shared with apache, mysql slave and a php app?
Ah well questions like that are impossible to answer, so just trying to ask if
you expect this to be really heavy. I noticed that in my initial testing with
2M on my laptop facets seemed to be fine, though the first request was slow and
the memory use spiked to 300MB. But I presume its just loading stuff into cache
and concurrent requests shouldnt cause the memory use to go up linearly.

I am still a bit unsure how to handle both the lowercased and the case
preserved version:

So here are some examples:
UBS = ubs|UBS
Kreuzstrasse = kreuzstrasse|Kreuzstrasse

So when I type Kreu I would get a suggestion of Kreuzstrasse and with
kreu I would get kreuzstrasse.
Since I do not expect any words to start with a lowercase letter and still
contain some upper case letter we should be fine with this approach.

As in I doubt there would be stuff like fooBar which would lead to suggestion
both foobar and fooBar.

How can I achieve this?

regards,
Lukas Kahwe Smith
m...@pooteeweet.org

Is optimizing always necessary?

2010-01-29 Thread Marcus Herou

If one only have additions do I then need to optimize the index at all ?

I thought that only update/deletes created holes in the index. Or should
the index be sorted on disk at all times, is that the reason ?

Cheers

//Marcus

-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/

Re: Aggregated facet value counts?

Sounds like what you're asking for is tree faceting.  A basic  
implementation is available in SOLR-792, but one that could also take  
facet.queries, numeric or date range buckets, to tree on would be a  
nice improvement.


Still, the underlying implementation will simply enumerate all the  
possible values (SOLR-792 has some short-circuiting when the top-level  
has zero, of course).  A client-side application could do this with  
multiple requests to Solr.


Subsearch - sure, just make more requests to Solr, rearranging the  
parameters.


I'd still say that in general for this type of need that it'll  
generally be less arbitrary and locking some things in during  
indexing will be the pragmatic way to go for most cases.


Erik



On Jan 29, 2010, at 9:28 AM, Peter S wrote:



Well, it wouldn't be 'every' combination - more of 'any' combination  
at query-time.


The 'arbitrary' part of the requirement is because it's not  
practical to predict every combination a user might ask for,  
although generally users would tend to search for similar/the same  
query combinations (but perhaps with different date ranges, for  
example).


If 'predicted aggregate fields' were calculated at index-time on,  
say, 10 fields (the schema in question actually as 73 fields),  
that's 3,628,801 new fields. A large percentage of these would  
likely never be used (which ones would depend on the user,  
environment etc.).



Perhaps a more 'typical' use case than my network-based example  
would be a product search web page, where you want to show the  
number of products that are made by a manufacturer and within a  
certain price range (e.g. Sony [$600-$800] (15) ). To obtain the  
(15) facet count value, you would have to correlate the number of  
Sony products (say, (861)), and the products that fall into the [600  
TO 800] price range (say, (1226) ). The (15) would be the  
intersection of the Sony hits and the price range hits by  
'manufacturer:Sony'. Am I right that filter queries could only do  
this for document hits if you know the field values ahead of time  
(e.g. fq=manufacturer:Sonyfq=price:[600 TO 800])? The facets could  
then be derived by simply counting the numFound for each result set.




If there were subsearch support in Solr (i.e. take the output of a  
query and use it as input into another) that included facets  
[perhaps there is such support?], it might be used to achieve this  
effect.



A custom query parser plugin could work, maybe? I suppose it would  
need to gather up all the separate facets and correlate them  
according to the input query (e.g. host and user, or manufacturer  
and price range). Such a mechanism would be crying out for caching,  
but perhaps it could leverage the existing field and query caches.



Peter





From: erik.hatc...@gmail.com
To: solr-user@lucene.apache.org
Subject: Re: Aggregated facet value counts?
Date: Fri, 29 Jan 2010 07:39:44 -0500

Creating values for every possible combination is what you're asking
Solr to do at query-time, and as far as I know there isn't really a
way to accomplish that like you're asking. Is the need really to be
arbitrary here?

Erik

On Jan 29, 2010, at 7:25 AM, Peter S wrote:



Hi Erik,



Thanks for your reply. That's an interesting idea doing it at index-
time, and a good idea for known field combinations.

The only thing is

How to handle arbitrary field combinations? - i.e. to allow the
caller to specify any combination of fields at query-time?

So, yes, the data is available at index-time, but the combination
isn't (short of creating fields for every possible combination).



Peter




From: erik.hatc...@gmail.com
To: solr-user@lucene.apache.org
Subject: Re: Aggregated facet value counts?
Date: Fri, 29 Jan 2010 06:30:27 -0500

When faced with this type of situation where the data is entirely
available at index-time, simply create an aggregated field that  
glues

the two pieces together, and facet on that.

Erik

On Jan 29, 2010, at 6:16 AM, Peter S wrote:



Hi,



I was wondering if anyone had come across this use case, and if  
this

type of faceting is possible:



The requirement is to build a query such that an aggregated facet
count of common (and'ed) field values form the basis of each
returned facet count.



For example:

Let's say I have a number of documents in an index with, among
others, the fields 'host' and 'user':



Doc1 host:machine_1 user:user_1

Doc2 host:machine_1 user:user_2

Doc3 host:machine_1 user:user_1

Doc3 host:machine_1 user:user_1



Doc4 host:machine_2 user:user_1

Doc5 host:machine_2 user:user_1

Doc6 host:machine_2 user:user_4



Doc7 host:machine_1 user:user_4



Is it possible to get facets back that would give the count of
documents that have common host AND user values (preferably  
ordered

- i.e. host then user for this example, so as not to create a
factorial explosion)? Note that the caller wouldn't know what
machine and user values exist, only the field names.

Re: Is optimizing always necessary?

In addition to destory the holes in the index, optimization is also used
to merge multiple small indexes into a bigger one.
Although I have not got specific performace data, I can imagine that this
will lead to performace benifits.
Supposing you have thousands of small indexes,  open-close these indexes
again and again should be time costing.

2010/1/30 Marcus Herou marcus.he...@tailsweep.com

 If one only have additions do I then need to optimize the index at all ?

 I thought that only update/deletes created holes in the index. Or should
 the index be sorted on disk at all times, is that the reason ?

 Cheers

 //Marcus

 --
 Marcus Herou CTO and co-founder Tailsweep AB
 +46702561312
 marcus.he...@tailsweep.com
 http://www.tailsweep.com/




-- 
梅旺生

Re: Newbie Question on Custom Query Generation

What's the point of generating your own query?
Are you sure that solr query syntax cannot satisfy your need?

2010/1/29 Abin Mathew abin.mat...@toostep.com

 Hi I want to generate my own customized query from the input string entered
 by the user. It should look something like this

 *Search field : Microsoft*
 *
 Generated Query*  :
 description:microsoft +((tags:microsoft^1.5 title:microsoft^3.0
 role:microsoft requi
 rement:microsoft company:microsoft city:microsoft)^5.0) tags:microsoft^2.0
 title:microsoft^3.5 functionalArea:microsoft

 *The lucene code we used is like this*
 BooleanQuery must = new BooleanQuery();

 addToBooleanQuery(must, tags, inputData, synonymAnalyzer, 1.5f);
 addToBooleanQuery(must, title, inputData, synonymAnalyzer);
 addToBooleanQuery(must, role, inputData, synonymAnalyzer);
 addToBooleanQuery(query, description, inputData, synonymAnalyzer);
 addToBooleanQuery(must, requirement, inputData, synonymAnalyzer);
 addToBooleanQuery(must, company, inputData, standardAnalyzer);
 addToBooleanQuery(must, city, inputData, standardAnalyzer);
 must.setBoost(5.0f);
 query.add(must, Occur.MUST);
 addToBooleanQuery(query, tags, includeAll, synonymAnalyzer, 2.0f);
 addToBooleanQuery(query, title, includeAll, synonymAnalyzer, 3.5f);
 addToBooleanQuery(query, functionalArea, inputData, synonymAnalyzer,);
 *
 In Simple english*
 addToBooleanQuery will add the particular field to the query after
 analysing
 using the analyser mentioned and setting a boost as specified
 So there MUST be a keyword match with any of the fields
 tags,title,role,description,requirement,company,city and it SHOULD occur
 in the fields tags,title and functionalArea.

 Hope you have got an idea of my requirement. I am not asking anyone to do
 it
 for me. Please let me know where can i start and give me some useful tips
 to
 move ahead with this. I believe that it has to do with modifying the XML
 configuration file and setting the parameters in Dismax handler. But I am
 still not sure. Please help

 Thanks  Regards
 Abin Mathew




-- 
梅旺生

Solr duplicates detection!!

Document Duplication Detection

[image: !] Solr1.4 /solr/Solr1.4

目录

   1. Document Duplication Detection #Document_Duplication_Detection
   2. Overview #Overview
  1. Goals #Goals
  2. Design #Design
   3. Notes #Notes
   4. Configuration #Configuration
  1. solrconfig.xml #solrconfig.xml
 1. Note #Note
  2. Settings #Settings

 Overview

Preventing duplicate or near duplicate documents from entering an index or
tagging documents with a signature/fingerprint for duplicate field
collapsing can be efficiently achieved with a low collision or fuzzy hash
algorithm. Solr should natively support deduplication techniques of this
type and allow for the easy addition of new hash/signature implementations.

Goals

   - Efficient, hash based exact/near document duplication detection and
   blocking.
   - Allow for both duplicate collapsing in search results as well as
   deduplication on adding a document.

 Design

Signature

A class capable of generating a signature String from the concatenation of a
group of specified document fields.

public abstract class Signature {
  public void init(SolrParams nl) {
  }

  public abstract String calculate(String content);
}

Implementations:

MD5Signature

128 bit hash used for exact duplicate detection.

Lookup3Signature /solr/Lookup3Signature

64 bit hash used for exact duplicate detection, much faster than MD5 and
smaller to index

TextProfileSignature /solr/TextProfileSignature

Fuzzy hashing implementation from nutch for near duplicate detection. Its
tunable but works best on longer text.

There are other more sophisticated algorithms for fuzzy/near hashing that
could be added later.

Notes

Adding in the dedupe process will change the allowDups setting so that it
applies to an update Term (with field signatureField in this case) rather
than the unique field Term (of course the signatureField could be the unique
field, but generally you want the unique field to be unique)

When a document is added, a signature will automatically be generated and
attached to the document in the specified signatureField.

Configuration

solrconfig.xml

The SignatureUpdateProcessorFactory
/solr/SignatureUpdateProcessorFactoryhas to be registered in the
solrconfig.xml as part of the
UpdateRequest /solr/UpdateRequest Chain:

Accepting all defaults:

  updateRequestProcessorChain name=dedupe
processor
  class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
/processor
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain

Example settings:

  !-- An example dedup update processor that creates the id field on the fly
   based on the hash code of some other fields.  This example has
overwriteDupes
   set to false since we are using the id field as the
signatureField and Solr
   will maintain uniqueness based on that anyway. --
  updateRequestProcessorChain name=dedupe
processor 
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
  bool name=enabledtrue/bool
  bool name=overwriteDupesfalse/bool
  str name=signatureFieldid/str
  str name=fieldsname,features,cat/str
  str 
name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain

 Note

Also be sure to change your update handlers to use the defined chain, i.e.

  requestHandler name=/update class=solr.XmlUpdateRequestHandler 
lst name=defaults
  str name=update.processordedupe/str
/lst
  /requestHandler

The update processor can also be specified per request with a parameter of
update.processor=dedupe

Settings

*Setting*

*Default*

*Description*

signatureClass

org.apache.solr.update.processor.Lookup3Signature /solr/Lookup3Signature

A Signature implementation for generating a signature hash.

fields

all fields

The fields to use to generate the signature hash in a comma separated list.
By default, all fields on the document will be used.

signatureField

signatureField

The name of the field used to hold the fingerprint/signature. Be sure the
field is defined in schema.xml.

enabled

true

Enable/disable dedupe factory processing


-- 
梅旺生

Re: Solr duplicates detection!!

Sorry by sending wrong message, this should go to my own mail box  :(

2010/1/30 Wangsheng Mei hairr...@gmail.com

 Document Duplication Detection

 [image: !] Solr1.4 http://solr/Solr1.4

 目录

1. Document Duplication 
 Detection#1267b655a97b48f5_Document_Duplication_Detection
2. Overview #1267b655a97b48f5_Overview
   1. Goals #1267b655a97b48f5_Goals
   2. Design #1267b655a97b48f5_Design
3. Notes #1267b655a97b48f5_Notes
4. Configuration #1267b655a97b48f5_Configuration
   1. solrconfig.xml #1267b655a97b48f5_solrconfig.xml
  1. Note #1267b655a97b48f5_Note
   2. Settings #1267b655a97b48f5_Settings

  Overview

 Preventing duplicate or near duplicate documents from entering an index or
 tagging documents with a signature/fingerprint for duplicate field
 collapsing can be efficiently achieved with a low collision or fuzzy hash
 algorithm. Solr should natively support deduplication techniques of this
 type and allow for the easy addition of new hash/signature implementations.

 Goals

- Efficient, hash based exact/near document duplication detection and
blocking.
- Allow for both duplicate collapsing in search results as well as
deduplication on adding a document.

  Design

 Signature

 A class capable of generating a signature String from the concatenation of
 a group of specified document fields.

 public abstract class Signature {
   public void init(SolrParams nl) {
   }

   public abstract String calculate(String content);
 }

 Implementations:

 MD5Signature

 128 bit hash used for exact duplicate detection.

 Lookup3Signature http://solr/Lookup3Signature

 64 bit hash used for exact duplicate detection, much faster than MD5 and
 smaller to index

 TextProfileSignature http://solr/TextProfileSignature

 Fuzzy hashing implementation from nutch for near duplicate detection. Its
 tunable but works best on longer text.

 There are other more sophisticated algorithms for fuzzy/near hashing that
 could be added later.

 Notes

 Adding in the dedupe process will change the allowDups setting so that it
 applies to an update Term (with field signatureField in this case) rather
 than the unique field Term (of course the signatureField could be the unique
 field, but generally you want the unique field to be unique)

 When a document is added, a signature will automatically be generated and
 attached to the document in the specified signatureField.

 Configuration

 solrconfig.xml

 The 
 SignatureUpdateProcessorFactoryhttp://solr/SignatureUpdateProcessorFactoryhas
  to be registered in the solrconfig.xml as part of the
 UpdateRequest http://solr/UpdateRequest Chain:

 Accepting all defaults:

   updateRequestProcessorChain name=dedupe
 processor
   
 class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
 /processor
 processor class=solr.RunUpdateProcessorFactory /

   /updateRequestProcessorChain

 Example settings:

   !-- An example dedup update processor that creates the id field on the 
 fly
based on the hash code of some other fields.  This example has 
 overwriteDupes
set to false since we are using the id field as the signatureField and 
 Solr

will maintain uniqueness based on that anyway. --
   updateRequestProcessorChain name=dedupe
 processor 
 class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory

   bool name=enabledtrue/bool
   bool name=overwriteDupesfalse/bool
   str name=signatureFieldid/str
   str name=fieldsname,features,cat/str

   str 
 name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str
 /processor
 processor class=solr.LogUpdateProcessorFactory /
 processor class=solr.RunUpdateProcessorFactory /

   /updateRequestProcessorChain

  Note

 Also be sure to change your update handlers to use the defined chain, i.e.

   requestHandler name=/update class=solr.XmlUpdateRequestHandler 
 lst name=defaults
   str name=update.processordedupe/str

 /lst
   /requestHandler

 The update processor can also be specified per request with a parameter of
 update.processor=dedupe

 Settings

 *Setting*

 *Default*

 *Description*

 signatureClass

 org.apache.solr.update.processor.Lookup3Signaturehttp://solr/Lookup3Signature

 A Signature implementation for generating a signature hash.

 fields

 all fields

 The fields to use to generate the signature hash in a comma separated list.
 By default, all fields on the document will be used.

 signatureField

 signatureField

 The name of the field used to hold the fingerprint/signature. Be sure the
 field is defined in schema.xml.

 enabled

 true

 Enable/disable dedupe factory processing


 --
 梅旺生




-- 
梅旺生

Deleting spelll checker index

2010-01-29 Thread darniz


Hello all,
We are using Index based spell checker.
i was wondering with the help of any url parameters can we delete the spell
check index directory.
please let me know 
thans
darniz


-- 
View this message in context: 
http://old.nabble.com/Deleting-spelll-checker-index-tp27376823p27376823.html
Sent from the Solr - User mailing list archive at Nabble.com.

Auto Suggest with multiple space separated words

2010-01-29 Thread Nair, Manas

Hi Experts,
 
I need an auto suggest functionality using SOLR which gives me the feel of 
using the fire fox browser. In short, if I type in a prefix, the results should 
drop down even if the prefix is not the starting of the drop down items.
 
Example: If I search for Lin, then the results could be 
[Abe Lincoln, Lindsay Lohan, Sarah Palin, Gasoline .].
 
Please suggest the best approach.
 
Any help is greatly appreciated.
 
Thankyou,
Manas Nair

distributed search and failed core

2010-01-29 Thread Joe Calderon

hello *, in distributed search when a shard goes down, an error is
returned and the search fails, is there a way to avoid the error and
return the results from the shards that are still up?

thx much

--joe

Re: Basic questions about Solr cost in programming time

2010-01-29 Thread Sven Maurmann


Hi!

Of course the answer depends (as usually) very much on the features
you want to realize. But Solr can be set up very fast. When we created
our first prototype, it took us about a week to get it running with
spell phoneme search, spell checking, facetting - and even collapsing
(using the famous 236-patch).

It is definitely very nice that you can do a lot of things using the
available components and only configuring them inside solrconfig.xml
and schema.xml.

And you may well start with the standard distribution.

Cheers,
   Sven

--On Dienstag, 26. Januar 2010 12:00 -0800 Jeff Crump 
jcr...@hq.mercycorps.org wrote:



Hi,
I hope this message is OK for this list.

I'm looking into search solutions for an intranet site built with Drupal.
Eventually we'd like to scale to enterprise search, which would include
the Drupal site, a document repository, and Jive SBS (collaboration
software). I'm interested in Lucene/Solr because of its scalability,
faceted search and optimization features, and because it is free. Our
problem is that we are a non-profit organization with only three very
busy programmers/sys admins supporting our employees around the world.

To help me argue for Solr in terms of total cost, I'm hoping that members
of this list can share their insights about the following:

* About how many hours of programming did it take you to set up your
instance of Lucene/Solr (not counting time spent on optimization)?

* Are there any disadvantages of going with a certified distribution
rather than the standard distribution?


Thanks and best regards,
Jeff

Jeff Crump
jcr...@hq.mercycorps.org

RE: Aggregated facet value counts?