Re: Delete from Solr index...
I am looking for following solution in C#, Please provide sample code if possible:- 1. Delete all the index using delete query. 2. Take backup of all the old index, before regenerate. 3. Try to write unlike query for a field to delete stale index. 4. How can use transaction under index generation (delete all old index and generate index), so that if any error occurs than it will not affect old indexes. ryantxu wrote: escher2k wrote: I am trying to remove documents from my index using delete by query. However when I did this, the deleted items seem to remain. This is the format of the XML file I am using - deletequeryload_id:20070424150841/query/delete deletequeryload_id:20070425145301/query/delete deletequeryload_id:20070426145301/query/delete deletequeryload_id:20070427145302/query/delete deletequeryload_id:20070428145301/query/delete deletequeryload_id:20070429145301/query/delete When I do the deletes individually, it seems to work (i.e. create each of the above in a separate file). Does this mean that each delete query request has to be executed separately ? correct, delete (unlike add) only accepts one command. Just to note, if load_id is your unique key, you could also use: deleteid20070424150841/id/delete This will give you better performance and does not commit the changes until you explicitly send commit/ -- View this message in context: http://old.nabble.com/Delete-from-Solr-index...-tp10264940p27369849.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Querying for multi-term phrases only . . .
You can avoid one word terms by setting outputUnigrams=false on the ShingleFilterFactory configuration. Erik On Jan 28, 2010, at 11:29 PM, Christopher Ball wrote: I am curious how I can query for multi-term phrases using the TermsComponent? The field I am searching has been shingled so it contains 2 and 3 word phrases. For example in the sample results below I want to only get back multi-word phrases such as table of contents and under the but not the single word terms such as year and significant int name=table of contents25302/int int name=including25162/int int name=year25097/int int name=significant17501/int int name=under the17359/int Appreciate any ideas, Christopher
Re: Newbie Question on Custom Query Generation
dismax won't quite give you the same query result. What you can do pretty easily, though, is create a QParser and QParserPlugin pair, register it solrconfig.xml and then use defType=name registered. Pretty straightforward. Have a look at Solr's various QParserPlugin implementations for details. Erik On Jan 29, 2010, at 12:30 AM, Abin Mathew wrote: Hi I want to generate my own customized query from the input string entered by the user. It should look something like this *Search field : Microsoft* * Generated Query* : description:microsoft +((tags:microsoft^1.5 title:microsoft^3.0 role:microsoft requi rement:microsoft company:microsoft city:microsoft)^5.0) tags:microsoft^2.0 title:microsoft^3.5 functionalArea:microsoft *The lucene code we used is like this* BooleanQuery must = new BooleanQuery(); addToBooleanQuery(must, tags, inputData, synonymAnalyzer, 1.5f); addToBooleanQuery(must, title, inputData, synonymAnalyzer); addToBooleanQuery(must, role, inputData, synonymAnalyzer); addToBooleanQuery(query, description, inputData, synonymAnalyzer); addToBooleanQuery(must, requirement, inputData, synonymAnalyzer); addToBooleanQuery(must, company, inputData, standardAnalyzer); addToBooleanQuery(must, city, inputData, standardAnalyzer); must.setBoost(5.0f); query.add(must, Occur.MUST); addToBooleanQuery(query, tags, includeAll, synonymAnalyzer, 2.0f); addToBooleanQuery(query, title, includeAll, synonymAnalyzer, 3.5f); addToBooleanQuery(query, functionalArea, inputData, synonymAnalyzer,); * In Simple english* addToBooleanQuery will add the particular field to the query after analysing using the analyser mentioned and setting a boost as specified So there MUST be a keyword match with any of the fields tags,title,role,description,requirement,company,city and it SHOULD occur in the fields tags,title and functionalArea. Hope you have got an idea of my requirement. I am not asking anyone to do it for me. Please let me know where can i start and give me some useful tips to move ahead with this. I believe that it has to do with modifying the XML configuration file and setting the parameters in Dismax handler. But I am still not sure. Please help Thanks Regards Abin Mathew
Aggregated facet value counts?
Hi, I was wondering if anyone had come across this use case, and if this type of faceting is possible: The requirement is to build a query such that an aggregated facet count of common (and'ed) field values form the basis of each returned facet count. For example: Let's say I have a number of documents in an index with, among others, the fields 'host' and 'user': Doc1 host:machine_1 user:user_1 Doc2 host:machine_1 user:user_2 Doc3 host:machine_1 user:user_1 Doc3 host:machine_1 user:user_1 Doc4 host:machine_2 user:user_1 Doc5 host:machine_2 user:user_1 Doc6 host:machine_2 user:user_4 Doc7 host:machine_1 user:user_4 Is it possible to get facets back that would give the count of documents that have common host AND user values (preferably ordered - i.e. host then user for this example, so as not to create a factorial explosion)? Note that the caller wouldn't know what machine and user values exist, only the field names. I've tried using facet queries in various ways to see if they could work for this, but I believe facet queries work on a different plane than this requirement (narrowing the term count, a.o.t. aggregating). For the example above, the desired result would be: machine_1/user_1 (3) machine_1/user_2 (1) machine_1/user_4 (1) machine_2/user_1 (2) machine_2/user_4 (1) Has anyone had a need for this type of faceting and found a way to achieve it? Many thanks, Peter _ We want to hear all your funny, exciting and crazy Hotmail stories. Tell us now http://clk.atdmt.com/UKM/go/195013117/direct/01/
Re: Aggregated facet value counts?
When faced with this type of situation where the data is entirely available at index-time, simply create an aggregated field that glues the two pieces together, and facet on that. Erik On Jan 29, 2010, at 6:16 AM, Peter S wrote: Hi, I was wondering if anyone had come across this use case, and if this type of faceting is possible: The requirement is to build a query such that an aggregated facet count of common (and'ed) field values form the basis of each returned facet count. For example: Let's say I have a number of documents in an index with, among others, the fields 'host' and 'user': Doc1 host:machine_1 user:user_1 Doc2 host:machine_1 user:user_2 Doc3 host:machine_1 user:user_1 Doc3 host:machine_1 user:user_1 Doc4 host:machine_2 user:user_1 Doc5 host:machine_2 user:user_1 Doc6 host:machine_2 user:user_4 Doc7 host:machine_1 user:user_4 Is it possible to get facets back that would give the count of documents that have common host AND user values (preferably ordered - i.e. host then user for this example, so as not to create a factorial explosion)? Note that the caller wouldn't know what machine and user values exist, only the field names. I've tried using facet queries in various ways to see if they could work for this, but I believe facet queries work on a different plane than this requirement (narrowing the term count, a.o.t. aggregating). For the example above, the desired result would be: machine_1/user_1 (3) machine_1/user_2 (1) machine_1/user_4 (1) machine_2/user_1 (2) machine_2/user_4 (1) Has anyone had a need for this type of faceting and found a way to achieve it? Many thanks, Peter _ We want to hear all your funny, exciting and crazy Hotmail stories. Tell us now http://clk.atdmt.com/UKM/go/195013117/direct/01/
RE: Aggregated facet value counts?
Hi Erik, Thanks for your reply. That's an interesting idea doing it at index-time, and a good idea for known field combinations. The only thing is How to handle arbitrary field combinations? - i.e. to allow the caller to specify any combination of fields at query-time? So, yes, the data is available at index-time, but the combination isn't (short of creating fields for every possible combination). Peter From: erik.hatc...@gmail.com To: solr-user@lucene.apache.org Subject: Re: Aggregated facet value counts? Date: Fri, 29 Jan 2010 06:30:27 -0500 When faced with this type of situation where the data is entirely available at index-time, simply create an aggregated field that glues the two pieces together, and facet on that. Erik On Jan 29, 2010, at 6:16 AM, Peter S wrote: Hi, I was wondering if anyone had come across this use case, and if this type of faceting is possible: The requirement is to build a query such that an aggregated facet count of common (and'ed) field values form the basis of each returned facet count. For example: Let's say I have a number of documents in an index with, among others, the fields 'host' and 'user': Doc1 host:machine_1 user:user_1 Doc2 host:machine_1 user:user_2 Doc3 host:machine_1 user:user_1 Doc3 host:machine_1 user:user_1 Doc4 host:machine_2 user:user_1 Doc5 host:machine_2 user:user_1 Doc6 host:machine_2 user:user_4 Doc7 host:machine_1 user:user_4 Is it possible to get facets back that would give the count of documents that have common host AND user values (preferably ordered - i.e. host then user for this example, so as not to create a factorial explosion)? Note that the caller wouldn't know what machine and user values exist, only the field names. I've tried using facet queries in various ways to see if they could work for this, but I believe facet queries work on a different plane than this requirement (narrowing the term count, a.o.t. aggregating). For the example above, the desired result would be: machine_1/user_1 (3) machine_1/user_2 (1) machine_1/user_4 (1) machine_2/user_1 (2) machine_2/user_4 (1) Has anyone had a need for this type of faceting and found a way to achieve it? Many thanks, Peter _ We want to hear all your funny, exciting and crazy Hotmail stories. Tell us now http://clk.atdmt.com/UKM/go/195013117/direct/01/ _ Tell us your greatest, weirdest and funniest Hotmail stories http://clk.atdmt.com/UKM/go/195013117/direct/01/
loading an updateProcessorChain with multicore in trunk
I am testing trunk and have seen a different behaviour when loading updateProcessors wich I don't know if it's normal (at least with multicore) Before I use to use an updateProcessorChain this way: requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.processormyChain/str /lst /requestHandler updateRequestProcessorChain name=myChain processor class=org.apache.solr.update.processor.CustomUpdateProcessorFactory / processor class=org.apache.solr.update.processor.LogUpdateProcessorFactory / processor class=org.apache.solr.update.processor.RunUpdateProcessorFactory / /updateRequestProcessorChain It does not work in current trunk. I have debuged the code and I have seen now UpdateProcessorChain is loaded via: public T T initPlugins(ListPluginInfo pluginInfos, MapString, T registry, ClassT type, String defClassName) { T def = null; for (PluginInfo info : pluginInfos) { T o = createInitInstance(info,type, type.getSimpleName(), defClassName); registry.put(info.name, o); if(info.isDefault()){ def = o; } } return def; } As I don't have default=true in the configuration, my custom processorChain is not used. Setting default=true makes it work: requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.processormyChain/str /lst /requestHandler updateRequestProcessorChain name=myChain default=true processor class=org.apache.solr.update.processor.CustomUpdateProcessorFactory / processor class=org.apache.solr.update.processor.LogUpdateProcessorFactory / processor class=org.apache.solr.update.processor.RunUpdateProcessorFactory / /updateRequestProcessorChain As far as I understand, if you specify the chain you want to use in here: requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.processormyChain/str /lst /requestHandler Shouldn't be necesary to set it as default. Is it going to be kept this way? Thanks in advance -- View this message in context: http://old.nabble.com/loading-an-updateProcessorChain-with-multicore-in-trunk-tp27371375p27371375.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Aggregated facet value counts?
Creating values for every possible combination is what you're asking Solr to do at query-time, and as far as I know there isn't really a way to accomplish that like you're asking. Is the need really to be arbitrary here? Erik On Jan 29, 2010, at 7:25 AM, Peter S wrote: Hi Erik, Thanks for your reply. That's an interesting idea doing it at index- time, and a good idea for known field combinations. The only thing is How to handle arbitrary field combinations? - i.e. to allow the caller to specify any combination of fields at query-time? So, yes, the data is available at index-time, but the combination isn't (short of creating fields for every possible combination). Peter From: erik.hatc...@gmail.com To: solr-user@lucene.apache.org Subject: Re: Aggregated facet value counts? Date: Fri, 29 Jan 2010 06:30:27 -0500 When faced with this type of situation where the data is entirely available at index-time, simply create an aggregated field that glues the two pieces together, and facet on that. Erik On Jan 29, 2010, at 6:16 AM, Peter S wrote: Hi, I was wondering if anyone had come across this use case, and if this type of faceting is possible: The requirement is to build a query such that an aggregated facet count of common (and'ed) field values form the basis of each returned facet count. For example: Let's say I have a number of documents in an index with, among others, the fields 'host' and 'user': Doc1 host:machine_1 user:user_1 Doc2 host:machine_1 user:user_2 Doc3 host:machine_1 user:user_1 Doc3 host:machine_1 user:user_1 Doc4 host:machine_2 user:user_1 Doc5 host:machine_2 user:user_1 Doc6 host:machine_2 user:user_4 Doc7 host:machine_1 user:user_4 Is it possible to get facets back that would give the count of documents that have common host AND user values (preferably ordered - i.e. host then user for this example, so as not to create a factorial explosion)? Note that the caller wouldn't know what machine and user values exist, only the field names. I've tried using facet queries in various ways to see if they could work for this, but I believe facet queries work on a different plane than this requirement (narrowing the term count, a.o.t. aggregating). For the example above, the desired result would be: machine_1/user_1 (3) machine_1/user_2 (1) machine_1/user_4 (1) machine_2/user_1 (2) machine_2/user_4 (1) Has anyone had a need for this type of faceting and found a way to achieve it? Many thanks, Peter _ We want to hear all your funny, exciting and crazy Hotmail stories. Tell us now http://clk.atdmt.com/UKM/go/195013117/direct/01/ _ Tell us your greatest, weirdest and funniest Hotmail stories http://clk.atdmt.com/UKM/go/195013117/direct/01/
Re: loading an updateProcessorChain with multicore in trunk
I guess . default=true should not be necessary if there is only one updateRequestProcessorChain specified . Open an issue On Fri, Jan 29, 2010 at 6:06 PM, Marc Sturlese marc.sturl...@gmail.com wrote: I am testing trunk and have seen a different behaviour when loading updateProcessors wich I don't know if it's normal (at least with multicore) Before I use to use an updateProcessorChain this way: requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.processormyChain/str /lst /requestHandler updateRequestProcessorChain name=myChain processor class=org.apache.solr.update.processor.CustomUpdateProcessorFactory / processor class=org.apache.solr.update.processor.LogUpdateProcessorFactory / processor class=org.apache.solr.update.processor.RunUpdateProcessorFactory / /updateRequestProcessorChain It does not work in current trunk. I have debuged the code and I have seen now UpdateProcessorChain is loaded via: public T T initPlugins(ListPluginInfo pluginInfos, MapString, T registry, ClassT type, String defClassName) { T def = null; for (PluginInfo info : pluginInfos) { T o = createInitInstance(info,type, type.getSimpleName(), defClassName); registry.put(info.name, o); if(info.isDefault()){ def = o; } } return def; } As I don't have default=true in the configuration, my custom processorChain is not used. Setting default=true makes it work: requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.processormyChain/str /lst /requestHandler updateRequestProcessorChain name=myChain default=true processor class=org.apache.solr.update.processor.CustomUpdateProcessorFactory / processor class=org.apache.solr.update.processor.LogUpdateProcessorFactory / processor class=org.apache.solr.update.processor.RunUpdateProcessorFactory / /updateRequestProcessorChain As far as I understand, if you specify the chain you want to use in here: requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.processormyChain/str /lst /requestHandler Shouldn't be necesary to set it as default. Is it going to be kept this way? Thanks in advance -- View this message in context: http://old.nabble.com/loading-an-updateProcessorChain-with-multicore-in-trunk-tp27371375p27371375.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Noble Paul | Systems Architect| AOL | http://aol.com
RE: Aggregated facet value counts?
Well, it wouldn't be 'every' combination - more of 'any' combination at query-time. The 'arbitrary' part of the requirement is because it's not practical to predict every combination a user might ask for, although generally users would tend to search for similar/the same query combinations (but perhaps with different date ranges, for example). If 'predicted aggregate fields' were calculated at index-time on, say, 10 fields (the schema in question actually as 73 fields), that's 3,628,801 new fields. A large percentage of these would likely never be used (which ones would depend on the user, environment etc.). Perhaps a more 'typical' use case than my network-based example would be a product search web page, where you want to show the number of products that are made by a manufacturer and within a certain price range (e.g. Sony [$600-$800] (15) ). To obtain the (15) facet count value, you would have to correlate the number of Sony products (say, (861)), and the products that fall into the [600 TO 800] price range (say, (1226) ). The (15) would be the intersection of the Sony hits and the price range hits by 'manufacturer:Sony'. Am I right that filter queries could only do this for document hits if you know the field values ahead of time (e.g. fq=manufacturer:Sonyfq=price:[600 TO 800])? The facets could then be derived by simply counting the numFound for each result set. If there were subsearch support in Solr (i.e. take the output of a query and use it as input into another) that included facets [perhaps there is such support?], it might be used to achieve this effect. A custom query parser plugin could work, maybe? I suppose it would need to gather up all the separate facets and correlate them according to the input query (e.g. host and user, or manufacturer and price range). Such a mechanism would be crying out for caching, but perhaps it could leverage the existing field and query caches. Peter From: erik.hatc...@gmail.com To: solr-user@lucene.apache.org Subject: Re: Aggregated facet value counts? Date: Fri, 29 Jan 2010 07:39:44 -0500 Creating values for every possible combination is what you're asking Solr to do at query-time, and as far as I know there isn't really a way to accomplish that like you're asking. Is the need really to be arbitrary here? Erik On Jan 29, 2010, at 7:25 AM, Peter S wrote: Hi Erik, Thanks for your reply. That's an interesting idea doing it at index- time, and a good idea for known field combinations. The only thing is How to handle arbitrary field combinations? - i.e. to allow the caller to specify any combination of fields at query-time? So, yes, the data is available at index-time, but the combination isn't (short of creating fields for every possible combination). Peter From: erik.hatc...@gmail.com To: solr-user@lucene.apache.org Subject: Re: Aggregated facet value counts? Date: Fri, 29 Jan 2010 06:30:27 -0500 When faced with this type of situation where the data is entirely available at index-time, simply create an aggregated field that glues the two pieces together, and facet on that. Erik On Jan 29, 2010, at 6:16 AM, Peter S wrote: Hi, I was wondering if anyone had come across this use case, and if this type of faceting is possible: The requirement is to build a query such that an aggregated facet count of common (and'ed) field values form the basis of each returned facet count. For example: Let's say I have a number of documents in an index with, among others, the fields 'host' and 'user': Doc1 host:machine_1 user:user_1 Doc2 host:machine_1 user:user_2 Doc3 host:machine_1 user:user_1 Doc3 host:machine_1 user:user_1 Doc4 host:machine_2 user:user_1 Doc5 host:machine_2 user:user_1 Doc6 host:machine_2 user:user_4 Doc7 host:machine_1 user:user_4 Is it possible to get facets back that would give the count of documents that have common host AND user values (preferably ordered - i.e. host then user for this example, so as not to create a factorial explosion)? Note that the caller wouldn't know what machine and user values exist, only the field names. I've tried using facet queries in various ways to see if they could work for this, but I believe facet queries work on a different plane than this requirement (narrowing the term count, a.o.t. aggregating). For the example above, the desired result would be: machine_1/user_1 (3) machine_1/user_2 (1) machine_1/user_4 (1) machine_2/user_1 (2) machine_2/user_4 (1) Has anyone had a need for this type of faceting and found a way to achieve it? Many thanks, Peter _ We want to hear all your funny, exciting and crazy Hotmail stories. Tell us now
multi term, multi field, auto suggest
Hi, So over the course of the last two weeks I have been trying to come up with an optimal solution for auto suggest in the project I am currently working on. In the application we have names from people and companies. The companies can have german, english, italian or french names. people have an additional firstname field. We also want to do auto suggest on the street and city names as well as on emails and telefon numbers. as such we are treating phonenumbers as text. We do have the option for the user to use phonetic searches or to split (especially the compound german words), but I guess we will leave that out of the auto suggest. We do expect that some users will type in properly cased strings, while some may just type in all lowercase. We are using the dismax defType for our normal queries. There will probably be less than 20M entities. As such I guess the best approach is to copy all of the above mentioned fields (name, firstname, city, street, email, telefon) into a new field called all. It seems the best approach is to use facet.prefix for our requirements. We will therefore split of the last term in the query and pass it in as the facet.prefix while the rest is passed in as the q parameter. Since facet's are driven out of the index, we will use the following type definition for this all field: fieldType name=textplain class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType So essentially the idea is to just split on whitespace, remove stop words and word delimiters. The query would then look something like the following if the user would enter Kaltenreider Ver: http://localhost:8983/solr/core0/select?defType=dismaxqf=allq= Kaltenreiderindent=onfacet=onfacet.limit=10facet.mincount=1facet.field=allrows=0facet.prefix=Ver Does this approach make sense so far? Do you expect this to perform decently on a dual quad core machine with 16Gb of ram, albeit all of that will be shared with apache, mysql slave and a php app? Ah well questions like that are impossible to answer, so just trying to ask if you expect this to be really heavy. I noticed that in my initial testing with 2M on my laptop facets seemed to be fine, though the first request was slow and the memory use spiked to 300MB. But I presume its just loading stuff into cache and concurrent requests shouldnt cause the memory use to go up linearly. I am still a bit unsure how to handle both the lowercased and the case preserved version: So here are some examples: UBS = ubs|UBS Kreuzstrasse = kreuzstrasse|Kreuzstrasse So when I type Kreu I would get a suggestion of Kreuzstrasse and with kreu I would get kreuzstrasse. Since I do not expect any words to start with a lowercase letter and still contain some upper case letter we should be fine with this approach. As in I doubt there would be stuff like fooBar which would lead to suggestion both foobar and fooBar. How can I achieve this? regards, Lukas Kahwe Smith m...@pooteeweet.org
Is optimizing always necessary?
If one only have additions do I then need to optimize the index at all ? I thought that only update/deletes created holes in the index. Or should the index be sorted on disk at all times, is that the reason ? Cheers //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/
Re: Aggregated facet value counts?
Sounds like what you're asking for is tree faceting. A basic implementation is available in SOLR-792, but one that could also take facet.queries, numeric or date range buckets, to tree on would be a nice improvement. Still, the underlying implementation will simply enumerate all the possible values (SOLR-792 has some short-circuiting when the top-level has zero, of course). A client-side application could do this with multiple requests to Solr. Subsearch - sure, just make more requests to Solr, rearranging the parameters. I'd still say that in general for this type of need that it'll generally be less arbitrary and locking some things in during indexing will be the pragmatic way to go for most cases. Erik On Jan 29, 2010, at 9:28 AM, Peter S wrote: Well, it wouldn't be 'every' combination - more of 'any' combination at query-time. The 'arbitrary' part of the requirement is because it's not practical to predict every combination a user might ask for, although generally users would tend to search for similar/the same query combinations (but perhaps with different date ranges, for example). If 'predicted aggregate fields' were calculated at index-time on, say, 10 fields (the schema in question actually as 73 fields), that's 3,628,801 new fields. A large percentage of these would likely never be used (which ones would depend on the user, environment etc.). Perhaps a more 'typical' use case than my network-based example would be a product search web page, where you want to show the number of products that are made by a manufacturer and within a certain price range (e.g. Sony [$600-$800] (15) ). To obtain the (15) facet count value, you would have to correlate the number of Sony products (say, (861)), and the products that fall into the [600 TO 800] price range (say, (1226) ). The (15) would be the intersection of the Sony hits and the price range hits by 'manufacturer:Sony'. Am I right that filter queries could only do this for document hits if you know the field values ahead of time (e.g. fq=manufacturer:Sonyfq=price:[600 TO 800])? The facets could then be derived by simply counting the numFound for each result set. If there were subsearch support in Solr (i.e. take the output of a query and use it as input into another) that included facets [perhaps there is such support?], it might be used to achieve this effect. A custom query parser plugin could work, maybe? I suppose it would need to gather up all the separate facets and correlate them according to the input query (e.g. host and user, or manufacturer and price range). Such a mechanism would be crying out for caching, but perhaps it could leverage the existing field and query caches. Peter From: erik.hatc...@gmail.com To: solr-user@lucene.apache.org Subject: Re: Aggregated facet value counts? Date: Fri, 29 Jan 2010 07:39:44 -0500 Creating values for every possible combination is what you're asking Solr to do at query-time, and as far as I know there isn't really a way to accomplish that like you're asking. Is the need really to be arbitrary here? Erik On Jan 29, 2010, at 7:25 AM, Peter S wrote: Hi Erik, Thanks for your reply. That's an interesting idea doing it at index- time, and a good idea for known field combinations. The only thing is How to handle arbitrary field combinations? - i.e. to allow the caller to specify any combination of fields at query-time? So, yes, the data is available at index-time, but the combination isn't (short of creating fields for every possible combination). Peter From: erik.hatc...@gmail.com To: solr-user@lucene.apache.org Subject: Re: Aggregated facet value counts? Date: Fri, 29 Jan 2010 06:30:27 -0500 When faced with this type of situation where the data is entirely available at index-time, simply create an aggregated field that glues the two pieces together, and facet on that. Erik On Jan 29, 2010, at 6:16 AM, Peter S wrote: Hi, I was wondering if anyone had come across this use case, and if this type of faceting is possible: The requirement is to build a query such that an aggregated facet count of common (and'ed) field values form the basis of each returned facet count. For example: Let's say I have a number of documents in an index with, among others, the fields 'host' and 'user': Doc1 host:machine_1 user:user_1 Doc2 host:machine_1 user:user_2 Doc3 host:machine_1 user:user_1 Doc3 host:machine_1 user:user_1 Doc4 host:machine_2 user:user_1 Doc5 host:machine_2 user:user_1 Doc6 host:machine_2 user:user_4 Doc7 host:machine_1 user:user_4 Is it possible to get facets back that would give the count of documents that have common host AND user values (preferably ordered - i.e. host then user for this example, so as not to create a factorial explosion)? Note that the caller wouldn't know what machine and user values exist, only the field names.
Re: Is optimizing always necessary?
In addition to destory the holes in the index, optimization is also used to merge multiple small indexes into a bigger one. Although I have not got specific performace data, I can imagine that this will lead to performace benifits. Supposing you have thousands of small indexes, open-close these indexes again and again should be time costing. 2010/1/30 Marcus Herou marcus.he...@tailsweep.com If one only have additions do I then need to optimize the index at all ? I thought that only update/deletes created holes in the index. Or should the index be sorted on disk at all times, is that the reason ? Cheers //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ -- 梅旺生
Re: Newbie Question on Custom Query Generation
What's the point of generating your own query? Are you sure that solr query syntax cannot satisfy your need? 2010/1/29 Abin Mathew abin.mat...@toostep.com Hi I want to generate my own customized query from the input string entered by the user. It should look something like this *Search field : Microsoft* * Generated Query* : description:microsoft +((tags:microsoft^1.5 title:microsoft^3.0 role:microsoft requi rement:microsoft company:microsoft city:microsoft)^5.0) tags:microsoft^2.0 title:microsoft^3.5 functionalArea:microsoft *The lucene code we used is like this* BooleanQuery must = new BooleanQuery(); addToBooleanQuery(must, tags, inputData, synonymAnalyzer, 1.5f); addToBooleanQuery(must, title, inputData, synonymAnalyzer); addToBooleanQuery(must, role, inputData, synonymAnalyzer); addToBooleanQuery(query, description, inputData, synonymAnalyzer); addToBooleanQuery(must, requirement, inputData, synonymAnalyzer); addToBooleanQuery(must, company, inputData, standardAnalyzer); addToBooleanQuery(must, city, inputData, standardAnalyzer); must.setBoost(5.0f); query.add(must, Occur.MUST); addToBooleanQuery(query, tags, includeAll, synonymAnalyzer, 2.0f); addToBooleanQuery(query, title, includeAll, synonymAnalyzer, 3.5f); addToBooleanQuery(query, functionalArea, inputData, synonymAnalyzer,); * In Simple english* addToBooleanQuery will add the particular field to the query after analysing using the analyser mentioned and setting a boost as specified So there MUST be a keyword match with any of the fields tags,title,role,description,requirement,company,city and it SHOULD occur in the fields tags,title and functionalArea. Hope you have got an idea of my requirement. I am not asking anyone to do it for me. Please let me know where can i start and give me some useful tips to move ahead with this. I believe that it has to do with modifying the XML configuration file and setting the parameters in Dismax handler. But I am still not sure. Please help Thanks Regards Abin Mathew -- 梅旺生
Solr duplicates detection!!
Document Duplication Detection [image: !] Solr1.4 /solr/Solr1.4 目录 1. Document Duplication Detection #Document_Duplication_Detection 2. Overview #Overview 1. Goals #Goals 2. Design #Design 3. Notes #Notes 4. Configuration #Configuration 1. solrconfig.xml #solrconfig.xml 1. Note #Note 2. Settings #Settings Overview Preventing duplicate or near duplicate documents from entering an index or tagging documents with a signature/fingerprint for duplicate field collapsing can be efficiently achieved with a low collision or fuzzy hash algorithm. Solr should natively support deduplication techniques of this type and allow for the easy addition of new hash/signature implementations. Goals - Efficient, hash based exact/near document duplication detection and blocking. - Allow for both duplicate collapsing in search results as well as deduplication on adding a document. Design Signature A class capable of generating a signature String from the concatenation of a group of specified document fields. public abstract class Signature { public void init(SolrParams nl) { } public abstract String calculate(String content); } Implementations: MD5Signature 128 bit hash used for exact duplicate detection. Lookup3Signature /solr/Lookup3Signature 64 bit hash used for exact duplicate detection, much faster than MD5 and smaller to index TextProfileSignature /solr/TextProfileSignature Fuzzy hashing implementation from nutch for near duplicate detection. Its tunable but works best on longer text. There are other more sophisticated algorithms for fuzzy/near hashing that could be added later. Notes Adding in the dedupe process will change the allowDups setting so that it applies to an update Term (with field signatureField in this case) rather than the unique field Term (of course the signatureField could be the unique field, but generally you want the unique field to be unique) When a document is added, a signature will automatically be generated and attached to the document in the specified signatureField. Configuration solrconfig.xml The SignatureUpdateProcessorFactory /solr/SignatureUpdateProcessorFactoryhas to be registered in the solrconfig.xml as part of the UpdateRequest /solr/UpdateRequest Chain: Accepting all defaults: updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory /processor processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain Example settings: !-- An example dedup update processor that creates the id field on the fly based on the hash code of some other fields. This example has overwriteDupes set to false since we are using the id field as the signatureField and Solr will maintain uniqueness based on that anyway. -- updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool bool name=overwriteDupesfalse/bool str name=signatureFieldid/str str name=fieldsname,features,cat/str str name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain Note Also be sure to change your update handlers to use the defined chain, i.e. requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.processordedupe/str /lst /requestHandler The update processor can also be specified per request with a parameter of update.processor=dedupe Settings *Setting* *Default* *Description* signatureClass org.apache.solr.update.processor.Lookup3Signature /solr/Lookup3Signature A Signature implementation for generating a signature hash. fields all fields The fields to use to generate the signature hash in a comma separated list. By default, all fields on the document will be used. signatureField signatureField The name of the field used to hold the fingerprint/signature. Be sure the field is defined in schema.xml. enabled true Enable/disable dedupe factory processing -- 梅旺生
Re: Solr duplicates detection!!
Sorry by sending wrong message, this should go to my own mail box :( 2010/1/30 Wangsheng Mei hairr...@gmail.com Document Duplication Detection [image: !] Solr1.4 http://solr/Solr1.4 目录 1. Document Duplication Detection#1267b655a97b48f5_Document_Duplication_Detection 2. Overview #1267b655a97b48f5_Overview 1. Goals #1267b655a97b48f5_Goals 2. Design #1267b655a97b48f5_Design 3. Notes #1267b655a97b48f5_Notes 4. Configuration #1267b655a97b48f5_Configuration 1. solrconfig.xml #1267b655a97b48f5_solrconfig.xml 1. Note #1267b655a97b48f5_Note 2. Settings #1267b655a97b48f5_Settings Overview Preventing duplicate or near duplicate documents from entering an index or tagging documents with a signature/fingerprint for duplicate field collapsing can be efficiently achieved with a low collision or fuzzy hash algorithm. Solr should natively support deduplication techniques of this type and allow for the easy addition of new hash/signature implementations. Goals - Efficient, hash based exact/near document duplication detection and blocking. - Allow for both duplicate collapsing in search results as well as deduplication on adding a document. Design Signature A class capable of generating a signature String from the concatenation of a group of specified document fields. public abstract class Signature { public void init(SolrParams nl) { } public abstract String calculate(String content); } Implementations: MD5Signature 128 bit hash used for exact duplicate detection. Lookup3Signature http://solr/Lookup3Signature 64 bit hash used for exact duplicate detection, much faster than MD5 and smaller to index TextProfileSignature http://solr/TextProfileSignature Fuzzy hashing implementation from nutch for near duplicate detection. Its tunable but works best on longer text. There are other more sophisticated algorithms for fuzzy/near hashing that could be added later. Notes Adding in the dedupe process will change the allowDups setting so that it applies to an update Term (with field signatureField in this case) rather than the unique field Term (of course the signatureField could be the unique field, but generally you want the unique field to be unique) When a document is added, a signature will automatically be generated and attached to the document in the specified signatureField. Configuration solrconfig.xml The SignatureUpdateProcessorFactoryhttp://solr/SignatureUpdateProcessorFactoryhas to be registered in the solrconfig.xml as part of the UpdateRequest http://solr/UpdateRequest Chain: Accepting all defaults: updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory /processor processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain Example settings: !-- An example dedup update processor that creates the id field on the fly based on the hash code of some other fields. This example has overwriteDupes set to false since we are using the id field as the signatureField and Solr will maintain uniqueness based on that anyway. -- updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool bool name=overwriteDupesfalse/bool str name=signatureFieldid/str str name=fieldsname,features,cat/str str name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain Note Also be sure to change your update handlers to use the defined chain, i.e. requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.processordedupe/str /lst /requestHandler The update processor can also be specified per request with a parameter of update.processor=dedupe Settings *Setting* *Default* *Description* signatureClass org.apache.solr.update.processor.Lookup3Signaturehttp://solr/Lookup3Signature A Signature implementation for generating a signature hash. fields all fields The fields to use to generate the signature hash in a comma separated list. By default, all fields on the document will be used. signatureField signatureField The name of the field used to hold the fingerprint/signature. Be sure the field is defined in schema.xml. enabled true Enable/disable dedupe factory processing -- 梅旺生 -- 梅旺生
Deleting spelll checker index
Hello all, We are using Index based spell checker. i was wondering with the help of any url parameters can we delete the spell check index directory. please let me know thans darniz -- View this message in context: http://old.nabble.com/Deleting-spelll-checker-index-tp27376823p27376823.html Sent from the Solr - User mailing list archive at Nabble.com.
Auto Suggest with multiple space separated words
Hi Experts, I need an auto suggest functionality using SOLR which gives me the feel of using the fire fox browser. In short, if I type in a prefix, the results should drop down even if the prefix is not the starting of the drop down items. Example: If I search for Lin, then the results could be [Abe Lincoln, Lindsay Lohan, Sarah Palin, Gasoline .]. Please suggest the best approach. Any help is greatly appreciated. Thankyou, Manas Nair
distributed search and failed core
hello *, in distributed search when a shard goes down, an error is returned and the search fails, is there a way to avoid the error and return the results from the shards that are still up? thx much --joe
Re: Basic questions about Solr cost in programming time
Hi! Of course the answer depends (as usually) very much on the features you want to realize. But Solr can be set up very fast. When we created our first prototype, it took us about a week to get it running with spell phoneme search, spell checking, facetting - and even collapsing (using the famous 236-patch). It is definitely very nice that you can do a lot of things using the available components and only configuring them inside solrconfig.xml and schema.xml. And you may well start with the standard distribution. Cheers, Sven --On Dienstag, 26. Januar 2010 12:00 -0800 Jeff Crump jcr...@hq.mercycorps.org wrote: Hi, I hope this message is OK for this list. I'm looking into search solutions for an intranet site built with Drupal. Eventually we'd like to scale to enterprise search, which would include the Drupal site, a document repository, and Jive SBS (collaboration software). I'm interested in Lucene/Solr because of its scalability, faceted search and optimization features, and because it is free. Our problem is that we are a non-profit organization with only three very busy programmers/sys admins supporting our employees around the world. To help me argue for Solr in terms of total cost, I'm hoping that members of this list can share their insights about the following: * About how many hours of programming did it take you to set up your instance of Lucene/Solr (not counting time spent on optimization)? * Are there any disadvantages of going with a certified distribution rather than the standard distribution? Thanks and best regards, Jeff Jeff Crump jcr...@hq.mercycorps.org
RE: Aggregated facet value counts?
Tree faceting - that sounds very interesting indeed. I'll have a look into that and see how it fits, as well as any improvements for adding facet queries, cross-field aggregation, date range etc. There could be some very nice use-cases for such functionality. Just wondering how this would work with distributed shards/multi-core... Many Thanks! Peter From: erik.hatc...@gmail.com To: solr-user@lucene.apache.org Subject: Re: Aggregated facet value counts? Date: Fri, 29 Jan 2010 12:20:07 -0500 Sounds like what you're asking for is tree faceting. A basic implementation is available in SOLR-792, but one that could also take facet.queries, numeric or date range buckets, to tree on would be a nice improvement. Still, the underlying implementation will simply enumerate all the possible values (SOLR-792 has some short-circuiting when the top-level has zero, of course). A client-side application could do this with multiple requests to Solr. Subsearch - sure, just make more requests to Solr, rearranging the parameters. I'd still say that in general for this type of need that it'll generally be less arbitrary and locking some things in during indexing will be the pragmatic way to go for most cases. Erik On Jan 29, 2010, at 9:28 AM, Peter S wrote: Well, it wouldn't be 'every' combination - more of 'any' combination at query-time. The 'arbitrary' part of the requirement is because it's not practical to predict every combination a user might ask for, although generally users would tend to search for similar/the same query combinations (but perhaps with different date ranges, for example). If 'predicted aggregate fields' were calculated at index-time on, say, 10 fields (the schema in question actually as 73 fields), that's 3,628,801 new fields. A large percentage of these would likely never be used (which ones would depend on the user, environment etc.). Perhaps a more 'typical' use case than my network-based example would be a product search web page, where you want to show the number of products that are made by a manufacturer and within a certain price range (e.g. Sony [$600-$800] (15) ). To obtain the (15) facet count value, you would have to correlate the number of Sony products (say, (861)), and the products that fall into the [600 TO 800] price range (say, (1226) ). The (15) would be the intersection of the Sony hits and the price range hits by 'manufacturer:Sony'. Am I right that filter queries could only do this for document hits if you know the field values ahead of time (e.g. fq=manufacturer:Sonyfq=price:[600 TO 800])? The facets could then be derived by simply counting the numFound for each result set. If there were subsearch support in Solr (i.e. take the output of a query and use it as input into another) that included facets [perhaps there is such support?], it might be used to achieve this effect. A custom query parser plugin could work, maybe? I suppose it would need to gather up all the separate facets and correlate them according to the input query (e.g. host and user, or manufacturer and price range). Such a mechanism would be crying out for caching, but perhaps it could leverage the existing field and query caches. Peter From: erik.hatc...@gmail.com To: solr-user@lucene.apache.org Subject: Re: Aggregated facet value counts? Date: Fri, 29 Jan 2010 07:39:44 -0500 Creating values for every possible combination is what you're asking Solr to do at query-time, and as far as I know there isn't really a way to accomplish that like you're asking. Is the need really to be arbitrary here? Erik On Jan 29, 2010, at 7:25 AM, Peter S wrote: Hi Erik, Thanks for your reply. That's an interesting idea doing it at index- time, and a good idea for known field combinations. The only thing is How to handle arbitrary field combinations? - i.e. to allow the caller to specify any combination of fields at query-time? So, yes, the data is available at index-time, but the combination isn't (short of creating fields for every possible combination). Peter From: erik.hatc...@gmail.com To: solr-user@lucene.apache.org Subject: Re: Aggregated facet value counts? Date: Fri, 29 Jan 2010 06:30:27 -0500 When faced with this type of situation where the data is entirely available at index-time, simply create an aggregated field that glues the two pieces together, and facet on that. Erik On Jan 29, 2010, at 6:16 AM, Peter S wrote: Hi, I was wondering if anyone had come across this use case, and if this type of faceting is possible: The requirement is to build a query such that an aggregated facet count of common (and'ed) field values form the basis of each returned facet count. For example: Let's say I have a
sort items by whether the user has viewed it or not
hi, i want to query for documents that have certain values but i want it first sorted by documents that this person has viewed in the past. i can't store each user's view information in the document so i want to pass that in to the search. is it possible to do something like this: http://solr?q=baseballsort=doc_isbn(ABC or DEF or GHI) desc, title desc any help is appreciated, r
Re: loading an updateProcessorChain with multicore in trunk
: I guess . default=true should not be necessary if there is only one : updateRequestProcessorChain specified . Open an issue No... that doesn't seem right. If you declare you're own chains, but you don't mark any of them as default=true then it shouldn't matter how many of them you declare, SolrCore should create a default for you. The real question here is: why isn't he getting his explicilty defined chain when he refrences it by name? declaring that he wants his explicitly named chain to be the default is fine, and that should work, but w/o declaring it as the default he should still be able to ask for it by name ... why isn't that working? ... : requestHandler name=/update class=solr.XmlUpdateRequestHandler : � �lst name=defaults : � � � str name=update.processormyChain/str : � �/lst Marc, can you confirm that when you don't declare your chain as default=true that... 1) an instance of your CustomUpdateProcessorFactory is actaully getting instantiated by solr (via logging or runningg in a debugger) 2) wether your custom chain is used if you pass update.processor=myChain as a request param instead of relying on the configured defaults (I wonder if some handler refactoring caused the default processing logic to no longer respect the defaults) -Hoss
Re: update doc success, but could not find the new value
: Subject: update doc success, but could not find the new value : In-Reply-To: 449216.59315...@web56308.mail.re3.yahoo.com : References: 27335403.p...@talk.nabble.com : 449216.59315...@web56308.mail.re3.yahoo.com http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. See Also: http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking -Hoss
RE: Solr wiki link broken
: Why don't we change the links to have FrontPage explicitly? : Wouldn't it be the easiest fix unless there are numerous : other pages that references the default page w/o FrontPage? I'm fairly confident that there are more links pointing to http://wiki.apache.org/solr/ then there are alternate versions in differnet langauges ... particularly when you start factoring in all of the webpages in the world that we don't have the ability to edit directly. -Hoss
Re: NullPointerException in ReplicationHandler.postCommit + question about compression
: never keep a str name=maxOptimizedCommitsToKeep0/str. : : It is better to leave not mention the deletionPolicy at all. The : defaults are usually fine. if setting the keep values to 0 results in NPEs we should do one (if not both) of the following... 1) change the init code to warn/fail if the values are 0 (not sure if there is ever a legitimate use for 0 as a value) 2) change the code that's currently throwing an NPE to check it's assumptings and log a more meaninful error if it can't function because of the existing config. -Hoss
RE: How to Implement SpanQuery in Solr . . ?
: and Solr. I was hoping to start by getting a simple example working in SOLR : and then iterate towards the more complex, given this is my first attempt at : extending Solr. wise choice. : For my first iteration of SpanQuery in Solr I am thinking of starting with a : simple syntax to combine: ...honestly: since you already mentioned that you might eventually want to integrate Qsol, i would suggest you start with that directly. that way you are taking an eixsting parser (that you evidently understand) and just hooking it via the QParser abstraction (as opposed to writting a Lucene String-Query parser *and* learning the QParser/Solr internals. : implementation on the Lucene side and the FooQParserPlugin as a reference : implementation on the SOLR side? The FooQParserPlugin is fairly primative and doesn't really make it obvious some of the things you can do with a QParser, so you may also want to skim the LuceneQParserPlugin as well : The other part of the riddle I would really appreciate some guidance on is : how to get it to plug-in to SOLR correctly? http://wiki.apache.org/solr/SolrPlugins#How_to_Load_Plugins http://wiki.apache.org/solr/SolrPlugins#QParserPlugin -Hoss
Re: Solr Cache Viewing/Browsing
: used in a modified DisMaxHandler) and I was wondering if there is a way to : get at this data from the JSP pages? I then thought that it might be nice to : view more information about the respective caches like the current elements, : recently evicted etc to help debug performance issues. Has anyone worked on : this or have any ideas surrounding this? I don't beleive anyone has looked into this. It would be hard to implement in a generic manner since the SolrCache API doesn't provide any mechanism for inspecting the contents, but you could write an implementation that expost some of these things through the getStatstics method (or some other new introspection based API) -Hoss
Re: replication setup
: Subject: replication setup : In-Reply-To: 83ec2c9c1001260724t110d6595m5071e0a40e1b1...@mail.gmail.com http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. See Also: http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking -Hoss
Re: Analysis tool vs search query
: I've run into this issue that I have no way of resolving, since the analysis : tool doesn't show me there is an error. I copy the exact field value into : the analysis tool and i type in the exact query request i'm issuing and the : tool finds it a match. However running the query with that exact same the analysis tool doesn'ty do query parsing .. so pasing a *query* string into the analysis tool isn't going to give you any meaningful information. what the query section of the analysis tool lets you do is see what the query time analyzer (that is used by most query parsers at query time) will do with your input ... but the QueryParser is still in control, and it decides which input to pass to your analyser -- special characters (like whitespace) have meaning to most query parsers, before they ever have a chance of getting passed to the analyzer. : tokenizer class=solr.KeywordTokenizerFactory/ A keyword tokenizer results in a single token for each input string, but the (default) query parser is going to chunk the input up on whitespace before the analyzer is ever invoked, unless you put it in a quoted string. -Hoss
Re: using bq with standard request handler
: I am using a query like: : http://localhost:8080/solr/select?q=product_category:Groceryfq=in_stock:truedebugQuery=true; : sort=display_priority+desc,prop_l_price+asc ... : Is it possible to use display_priority/price fields in bq itself to acheive : the same result.. I tried forming some queries that but was unable to get : desired results... bf and bq are features of hte dismax parser, so the default query parser won't use them -- it really wouldn't even make sense as a possible new feature, because the types of queries that might be specified using the lucene QParser are too broad to be able to define a consistent mechanism for knowing how/where to add the boosting queries to the structure. if all of your queries have that identical structure, you might consider however somehting like... http://localhost:8080/solr/select?qf=product_categoryq=Groceryfq=in_stock:truebq=... -Hoss
Re: Mail config
: I do not want to receive all the emails from this mail list, I only want to : receive the answers to my questions, is this possible? That's not how mailing lists work. If you want to participate in teh community, you have to participate fully. : If I am not mistaken when I unsubscribed I sent an email which did not reach : the mail list at all (therefore there was of course no chance to get any : replies). The same mechanism that prevents you from posting when you are not subscribed is the mechanism that prevents thousands of spam messages from getting sent to the list every day .. you have to take the bad with the good. : I am newbie for Solr and I doubt I can contribute much by answering to other : posts. But you can learn from those posts, and the discussion/responses they stimulate... http://people.apache.org/~hossman/#private_q -Hoss
Re: Lock problems: Lock obtain timed out
: Can anyone think of a reason why these locks would hang around for more than : 2 hours? : : I have been monitoring them and they look like they are very short lived. Typically the lock files are only left arround for more then a few seconds when there was a fatal crash of some kind ... an OOM Error for example, or as already mentioned in this thread... :SEVERE: java.io.IOException: No space left on device ...if you check your solr logs for messages in the immediate time frame following the the lastModified time of the lock file you'll probably find something interesting. -Hoss
Re: scenario with FQ parameter
:qf=field1^10 field2^20 field^100fq=*:9+OR+(field1:xyz) ... : I know I can use copy field (say 'text') to copy all the fields and then ... : but doing so , the boost weights specified in the 'qf' field have no effect : on the score. An FQ never has any impact on the score, so your question is ab it confusing. If you want to influence the scores, you'll need to use bq instead of fq. as discussed in another current thread on this list, it's possible to make the bq param use the dismax parser as well, but there are some tricky issues involved with that ... unless your use case is actaully more complicated then you are describing, you should probably just use something like... ...qf=field1^10+field2^20+field^100bq=field1:9^10+field2:9^20+field:9^100+field1:xyz -Hoss
Re: How can I boost bq in FieldQParserPlugin?
: q=ipodbq={!dismax qf=userId^0.5 v=$qq bq=}qq=12345qt=dismaxdebugQuery=on : : I try to debug the above query, it turned out to be as: : +DisjunctionMaxQuery((content:ipod | title:ipod^4.0)~0.01) () : +DisjunctionMaxQuery((userId:12345^0.5)~0.01) ...hmmm, i'm not sure why that's happening, but it certianly seems like a bug -- i ust have no idea what that bug is. the inner dismax parser should definitely be producing a query where the DisjunctionMaxQuery for 12345 is mandatory but that mandatory clause should be wrapped inside of another boolean query which should be added to the outermost query as an optional clause. somewhere that BooleanQuery produced by the inner dismax parser is getting thrown away ... hmmm, actually that this is a neccessary behavior of DismaxQParser for some cases (that it sheds it's own outermost BooleanQuery when not needed), but in this case it's screwing you because it doesn't realize you really do need it. does this owrk better? ... q=ipodbq={!dismax qf=userId^0.5 v=$qq bq=*:*^0}qq=12345qt=dismaxdebugQuery=on ...it's kind of kludgy, but it should garuntee you that wrapping BooleanQuerry is preserved. -Hoss
Re: Large Query Strings and performance
: I am using Solr 1.4 with large query strings with 20+ terms and faceting on : a single multi-valued field in a 1 million record system. I am using Solr to : categorize text, that why the query strings are big. : : The performance get's worse the more search terms are used. Is there any can you elaborate more on the types of query strings you are using? ... are they simply BooleanQuries consiting of many terms? ... are they all optional? We have to understand your goal, what exactly you are currently doing, and what exactly you have already tried before we can suggest ways of achieving your goal faster then things you've already tried. -Hoss
Re: Master Read Timeout
: Is there any way to increase the Slave's timeout value? Are there any http://wiki.apache.org/solr/SolrReplication?highlight=%28timeout%29 -Hoss
RE: matching exact/whole phrase
: Is it safe to say in order to do exact matches the field should be a string. It depends on your definition of exact If you want exact matches, including unicode codepoints and leading/trailing whitespace, then StrField would probably make sense -- but you could equally use TextField with a KeywrodTokenizer and nothing else. If you want *some* normalization (ie: trim leading/trailing whitespace, map equivilent codepoints to a canonical representation, etc...) then you need TextyField. : Now in my dismax handler if i have the qf defined as text field and run a : phrase search on text field : my car is the best car in the world : i dont get back any results. looking with debugQuery=on this is the : parsedQuery : text:my tire pressure warning light came my honda civic : This will not work since text was indexed by removing all stop words. it *can* work if the query analyzer for your text field type is also configured to remove stopwords, and if you either: configure the StopFilter(s) to deal with token positions in the way the parser expects (i forget which one works, you have to play with it); OR us a qs (query slop) value that gives you enough slop to miss those empty stop word gaps. -Hoss
Re: Deleting spelll checker index
: We are using Index based spell checker. : i was wondering with the help of any url parameters can we delete the spell : check index directory. I don't think so. You might be able to configure two differnet spell check components that point at the same directory -- one hat builds off of a real field, and one that builds off of an (empty) text field (using FileBasedSpellChecker) .. then you could trigger a rebuild of an empty spell checking index using the second component. But i've never tried it so i have no idea if it would work. -Hoss
DataImportHandler multivalued field CollectionString not working
DataImportHandler multivalued field CollectionString isn't working the way I'd expect, meaning not at all. I logged the collection is there, however the multivalue collection field just isn't being indexed (according to the DIH web UI and it's not in the index).
Re: DataImportHandler multivalued field CollectionString not working
Did you correctly set multiValue(not multivalue)=true in schema.xml? 2010/1/30 Jason Rutherglen jason.rutherg...@gmail.com DataImportHandler multivalued field CollectionString isn't working the way I'd expect, meaning not at all. I logged the collection is there, however the multivalue collection field just isn't being indexed (according to the DIH web UI and it's not in the index). -- 梅旺生
Re: Newbie Question on Custom Query Generation
Hi, I realized the power of Dismax Query Handler recently and now I dont need to generate my own query since Dismax is giving better results.Thanks a lot 2010/1/29 Wangsheng Mei hairr...@gmail.com: What's the point of generating your own query? Are you sure that solr query syntax cannot satisfy your need? 2010/1/29 Abin Mathew abin.mat...@toostep.com Hi I want to generate my own customized query from the input string entered by the user. It should look something like this *Search field : Microsoft* * Generated Query* : description:microsoft +((tags:microsoft^1.5 title:microsoft^3.0 role:microsoft requi rement:microsoft company:microsoft city:microsoft)^5.0) tags:microsoft^2.0 title:microsoft^3.5 functionalArea:microsoft *The lucene code we used is like this* BooleanQuery must = new BooleanQuery(); addToBooleanQuery(must, tags, inputData, synonymAnalyzer, 1.5f); addToBooleanQuery(must, title, inputData, synonymAnalyzer); addToBooleanQuery(must, role, inputData, synonymAnalyzer); addToBooleanQuery(query, description, inputData, synonymAnalyzer); addToBooleanQuery(must, requirement, inputData, synonymAnalyzer); addToBooleanQuery(must, company, inputData, standardAnalyzer); addToBooleanQuery(must, city, inputData, standardAnalyzer); must.setBoost(5.0f); query.add(must, Occur.MUST); addToBooleanQuery(query, tags, includeAll, synonymAnalyzer, 2.0f); addToBooleanQuery(query, title, includeAll, synonymAnalyzer, 3.5f); addToBooleanQuery(query, functionalArea, inputData, synonymAnalyzer,); * In Simple english* addToBooleanQuery will add the particular field to the query after analysing using the analyser mentioned and setting a boost as specified So there MUST be a keyword match with any of the fields tags,title,role,description,requirement,company,city and it SHOULD occur in the fields tags,title and functionalArea. Hope you have got an idea of my requirement. I am not asking anyone to do it for me. Please let me know where can i start and give me some useful tips to move ahead with this. I believe that it has to do with modifying the XML configuration file and setting the parameters in Dismax handler. But I am still not sure. Please help Thanks Regards Abin Mathew -- 梅旺生
Looking for a Solr volunteer for www.comics.org
Hi folks, I apologize if this isn't the right place to post this (alternate suggestions welcome alongside appropriate chastisement :-) I'm trying to recruit a volunteer to implement a Solr-based search system for the Grand Comic-Book Database (http://www.comics.org/). We're a non-profit, non-commercial, international group researching and indexing comic books, and we have only two active programmers (we're both unpaid volunteers, as are all GCD personnel). We'd love to have better search, and Solr looks like the right tool, but we're swamped with other technical work. So if anyone reading this would like to help out a comic book-related web site with their Solr experience, for absolutely no monetary compensation whatsoever, do please let me know :-D It would help to be into comic books, but that's not strictly required. Your work would be used quite heavily, and you could of course point that out to anyone you might wish to impress with your expertise. Our technical work is open-source, and therefore available for inspection and showing off. To clarify: I'm not looking for assistance with or pointers about setting Solr up myself (no matter how easy it is). And I'm not trying to get the list as a whole to do our work for us. I'm just trying to find if any individual feels like joining our tech team and volunteering for the project and couldn't think of a more likely place to find candidates than here. If we don't find a volunteer, I'll end up doing it next year, and I'll be reading a lot more documentation before asking any questions here. thanks, -henry
Re: Deleting spelll checker index
Then i assume the easiest way is to delete the directory itself. darniz hossman wrote: : We are using Index based spell checker. : i was wondering with the help of any url parameters can we delete the spell : check index directory. I don't think so. You might be able to configure two differnet spell check components that point at the same directory -- one hat builds off of a real field, and one that builds off of an (empty) text field (using FileBasedSpellChecker) .. then you could trigger a rebuild of an empty spell checking index using the second component. But i've never tried it so i have no idea if it would work. -Hoss -- View this message in context: http://old.nabble.com/Deleting-spelll-checker-index-tp27376823p27381620.html Sent from the Solr - User mailing list archive at Nabble.com.