no segments* file found
I'm using solr to index our files servers ( 480K files ) If I don't optimize, I 've got a too many files open at about 450K files and 3 Gb index If i optimize I've got this stacktrace during the commit of all the following update result status=1java.io.FileNotFoundException: no segments* file found in org.apache.lucene.store.FSDirectory@/root/trunk/example/solr/data/index: files: _7xr.tis _7xt.fdt _7o1.tii _7xq.tis _7xn.nrm _7ws.fdt _7xt.prx _7xp.nrm _7ws.nrm _7xo.nrm _7ws.tis _7xs.fdt _7vc.fnm _7u6.tis _7vx.fnm _7vx.frq _7xs.nrm _7xn.tis _7xq.frq _7xs.tis _7xq.prx _7vx.fdx _7ur.tii _7ur.frq _7xq.fnm _7xr.nrm _7vc.fdt _7xt.frq _7xp.fdx _7ws.prx _7xs.frq _7xo.prx _7xq.nrm _7vx.tii _7vx.prx _7xq.tii _7xs.fnm _7xs.tii _7ws.tii _7xt.fdx _7vc.nrm _7vc.prx _7vc.tis _7xq.fdt _7ur.prx _7xn.fdx _7xp.frq _7vx.nrm _7ur.fdt _7xr.fnm _7ws.fdx _7u6.tii _7xr.tii _7vc.frq _7vx.tis _7xp.fdt _7xr.frq_7ur.tis _7xp.prx _7xr.fdx _7xt.fnm _7xn.tii _7vc.fdx _7xo.fdt _7u6.fnm _7xn.frq _7xp.tis _7o1.frq _7xn.prx _7ur.fdx _7ur.fnm _7o1.fdx _7xs.fdx _7xn.fdt _7xt.tis _7xp.fnm _7xo.fnm _7xn.fnm _7u6.prx _7xq.fdx _7xo.tii _7ws.fnm _7vc.tii _7o1.prx _7xr.fdt _7o1.fdt _7ur.nrm _7ws.frq _7u6.nrm _7o1.nrm _7vx.fdt _7xt.tii _7u6.fdx _7xo.frq _7u6.frq _7xs.prx _7xr.prx _7o1.tis _7xt.nrm _7xp.tii _7xo.tis _7u6.fdt _7xo.fdx _7o1.fnm segments.gen at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfo s.java:516) at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:243) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:616) at org.apache.lucene.index.IndexWriter.lt;initgt;(IndexWriter.java:410) at org.apache.solr.update.SolrIndexWriter.lt;initgt;(SolrIndexWriter.java :97) at org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler .java:121) at org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandl er2.java:189) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2. java:267) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdate ProcessorFactory.java:67) at org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateR equestHandler.java:196) at org.apache.solr.handler.XmlUpdateRequestHandler.doLegacyUpdate(XmlUpdate RequestHandler.java:386) at org.apache.solr.servlet.SolrUpdateServlet.doPost(SolrUpdateServlet.java: 57) /result If I restart solr I've got a NullPointerException in DispatchFilter tested with solr 1.2 and 1.3 , the behaviour is the same Regards Florent BEAUCHAMP
Re: no segments* file found
are you using embedded solr? I had stumbled on a similar error : http://www.mail-archive.com/solr-user@lucene.apache.org/msg06085.html -V On Nov 12, 2007 2:16 PM, SDIS M. Beauchamp [EMAIL PROTECTED] wrote: I'm using solr to index our files servers ( 480K files ) If I don't optimize, I 've got a too many files open at about 450K files and 3 Gb index If i optimize I've got this stacktrace during the commit of all the following update result status=1java.io.FileNotFoundException: no segments* file found in org.apache.lucene.store.FSDirectory@/root/trunk/example/solr/data/index: files: _7xr.tis _7xt.fdt _7o1.tii _7xq.tis _7xn.nrm _7ws.fdt _7xt.prx _7xp.nrm _7ws.nrm _7xo.nrm _7ws.tis _7xs.fdt _7vc.fnm _7u6.tis _7vx.fnm _7vx.frq _7xs.nrm _7xn.tis _7xq.frq _7xs.tis _7xq.prx _7vx.fdx _7ur.tii _7ur.frq _7xq.fnm _7xr.nrm _7vc.fdt _7xt.frq _7xp.fdx _7ws.prx _7xs.frq _7xo.prx _7xq.nrm _7vx.tii _7vx.prx _7xq.tii _7xs.fnm _7xs.tii _7ws.tii _7xt.fdx _7vc.nrm _7vc.prx _7vc.tis _7xq.fdt _7ur.prx _7xn.fdx _7xp.frq _7vx.nrm _7ur.fdt _7xr.fnm _7ws.fdx _7u6.tii _7xr.tii _7vc.frq _7vx.tis _7xp.fdt _7xr.frq_7ur.tis _7xp.prx _7xr.fdx _7xt.fnm _7xn.tii _7vc.fdx _7xo.fdt _7u6.fnm _7xn.frq _7xp.tis _7o1.frq _7xn.prx _7ur.fdx _7ur.fnm _7o1.fdx _7xs.fdx _7xn.fdt _7xt.tis _7xp.fnm _7xo.fnm _7xn.fnm _7u6.prx _7xq.fdx _7xo.tii _7ws.fnm _7vc.tii _7o1.prx _7xr.fdt _7o1.fdt _7ur.nrm _7ws.frq _7u6.nrm _7o1.nrm _7vx.fdt _7xt.tii _7u6.fdx _7xo.frq _7u6.frq _7xs.prx _7xr.prx _7o1.tis _7xt.nrm _7xp.tii _7xo.tis _7u6.fdt _7xo.fdx _7o1.fnm segments.gen at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfo s.java:516) at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:243) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:616) at org.apache.lucene.index.IndexWriter.lt;initgt;(IndexWriter.java:410) at org.apache.solr.update.SolrIndexWriter.lt ;initgt;(SolrIndexWriter.java :97) at org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler .java:121) at org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandl er2.java:189) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2. java:267) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdate ProcessorFactory.java :67) at org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateR equestHandler.java:196) at org.apache.solr.handler.XmlUpdateRequestHandler.doLegacyUpdate(XmlUpdate RequestHandler.java :386) at org.apache.solr.servlet.SolrUpdateServlet.doPost(SolrUpdateServlet.java: 57) /result If I restart solr I've got a NullPointerException in DispatchFilter tested with solr 1.2 and 1.3 , the behaviour is the same Regards Florent BEAUCHAMP --
Re: Trim filer active for solr.StrField ?
what is your specific SolrQuery? calling: query.setQuery( stuff with spaces); does not call trim(), but some other calls do. My query looks e.g. (myField:_T8sY05EAEdyU7fJs63mvdA OR myField:_T8sY0ZEAEdyU7fJs63mvdA OR myField:_T8sY0pEAEdyU7fJs63mvdA) AND NOT myField:_T8sY1JEAEdyU7fJs63mvdA So I want to find all documents where field myField contains any of some UUIDs and must not contain another set of other UUIDs. The only other thing I do is set the result limit: solrQuery.setRows(resultLimit); The actual strings which are truncated are in other fields of returned documents. Any idea?
RE: no segments* file found
No , I'm using a custom indexer, written in C# which submits content using some post request. I let lucene manage the index on his own Florent BEAUCHAMP -Message d'origine- De : Venkatraman S [mailto:[EMAIL PROTECTED] Envoyé : lundi 12 novembre 2007 10:19 À : solr-user@lucene.apache.org Objet : Re: no segments* file found are you using embedded solr? I had stumbled on a similar error : http://www.mail-archive.com/solr-user@lucene.apache.org/msg06085.html -V On Nov 12, 2007 2:16 PM, SDIS M. Beauchamp [EMAIL PROTECTED] wrote: I'm using solr to index our files servers ( 480K files ) If I don't optimize, I 've got a too many files open at about 450K files and 3 Gb index If i optimize I've got this stacktrace during the commit of all the following update result status=1java.io.FileNotFoundException: no segments* file found in org.apache.lucene.store.FSDirectory@/root/trunk/example/solr/data/index: files: _7xr.tis _7xt.fdt _7o1.tii _7xq.tis _7xn.nrm _7ws.fdt _7xt.prx _7xp.nrm _7ws.nrm _7xo.nrm _7ws.tis _7xs.fdt _7vc.fnm _7u6.tis _7vx.fnm _7vx.frq _7xs.nrm _7xn.tis _7xq.frq _7xs.tis _7xq.prx _7vx.fdx _7ur.tii _7ur.frq _7xq.fnm _7xr.nrm _7vc.fdt _7xt.frq _7xp.fdx _7ws.prx _7xs.frq _7xo.prx _7xq.nrm _7vx.tii _7vx.prx _7xq.tii _7xs.fnm _7xs.tii _7ws.tii _7xt.fdx _7vc.nrm _7vc.prx _7vc.tis _7xq.fdt _7ur.prx _7xn.fdx _7xp.frq _7vx.nrm _7ur.fdt _7xr.fnm _7ws.fdx _7u6.tii _7xr.tii _7vc.frq _7vx.tis _7xp.fdt _7xr.frq_7ur.tis _7xp.prx _7xr.fdx _7xt.fnm _7xn.tii _7vc.fdx _7xo.fdt _7u6.fnm _7xn.frq _7xp.tis _7o1.frq _7xn.prx _7ur.fdx _7ur.fnm _7o1.fdx _7xs.fdx _7xn.fdt _7xt.tis _7xp.fnm _7xo.fnm _7xn.fnm _7u6.prx _7xq.fdx _7xo.tii _7ws.fnm _7vc.tii _7o1.prx _7xr.fdt _7o1.fdt _7ur.nrm _7ws.frq _7u6.nrm _7o1.nrm _7vx.fdt _7xt.tii _7u6.fdx _7xo.frq _7u6.frq _7xs.prx _7xr.prx _7o1.tis _7xt.nrm _7xp.tii _7xo.tis _7u6.fdt _7xo.fdx _7o1.fnm segments.gen at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfo s.java:516) at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:243) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:616) at org.apache.lucene.index.IndexWriter.lt;initgt;(IndexWriter.java:410) at org.apache.solr.update.SolrIndexWriter.lt ;initgt;(SolrIndexWriter.java :97) at org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandl er .java:121) at org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHan dl er2.java:189) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2. java:267) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpda te ProcessorFactory.java :67) at org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdat eR equestHandler.java:196) at org.apache.solr.handler.XmlUpdateRequestHandler.doLegacyUpdate(XmlUpda te RequestHandler.java :386) at org.apache.solr.servlet.SolrUpdateServlet.doPost(SolrUpdateServlet.java: 57) /result If I restart solr I've got a NullPointerException in DispatchFilter tested with solr 1.2 and 1.3 , the behaviour is the same Regards Florent BEAUCHAMP --
RE: Multiple indexes
Hello, Until now, i've used two instance of solr, one for each of my collections ; it works fine, but i wonder if there is an advantage to use multiple indexes in one instance over several instances with one index each ? Note that the two indexes have different schema.xml. Thanks. PL Date: Thu, 8 Nov 2007 18:05:43 -0500 From: [EMAIL PROTECTED] To: solr-user@lucene.apache.org Subject: Multiple indexes Hi, I am looking for the way to utilize the multiple indexes for signle sole instance. I saw that there is the patch 215 available and would like to ask someone who knows how to use multiple indexes. Thanks, Jae Joo _ Discover the new Windows Vista http://search.msn.com/results.aspx?q=windows+vistamkt=en-USform=QBRE
solr range query
Hello, I would like to use solr to return ranges of searches on an integer field, if I wrote in the url offset:[0 TO 10], it returns documents with offset values 0, 1, 10 only but I want to return the range 0,1,2, 3, 4 ,10. How can I do that with solr Thanks in advance Best regards, Heba Farouk Software Engineer Bibliotheca Alexandrina
Re: I18N with SOLR?
I'd say yes. Solr supports Unicode and ships with language specific analyzers, and allows you to provide your own custom analyzers if you need them. This allows you to create different fieldType definitions for the languages you want to support. For example here is an example field type for French text which uses a French stopword list and French stemming. fieldType name=text_french class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.FrenchStopFilterFactory ignoreCase=true words=stopwords_french.txt / filter class=solr.FrenchPorterFilterFactory protected=protwords_french.txt / filter class=solr.RemoveDuplicatesTokenFilterFactory / /analyzer /fieldType Then you can create a dynamicField definitions that allow you to index and query your documents using the correct field type: dynamicField name=*_french type=text_french indexed=true stored=true/ This means that when you index you need to know what language your data is in so that you know what field names to use in your document (e.g. title_french). And at search time you need to know what language you are in so you know which fields to search. Most user interfaces are in a single language context so from the query perspective you'll most likely know the language they want to search in. If you don't know the language context in either case you could try to guess using something like org.apache.nutch.analysis.lang.LanguageIdentifier. I hope this helps. We used this technique (without the guessing) quite effectively at the Library of Congress recently for a prototype application that needed to provide search functionality in 7 different languages. //Ed On Nov 12, 2007 1:56 AM, Dilip.TS [EMAIL PROTECTED] wrote: Hello, Does SOLR supports I18N (with multiple language support) ? Thanks in advance. Regards, Dilip TS
Re: Multiple indexes
The advantages of a multi-core setup are configuration flexibility and dynamically changing available options (without a full restart). For high-performance production solr servers, I don't think there is much reason for it. You may want to split the two indexes on to two machines. You may want to run each index in a separate JVM (so if one crashes, the other does not) Maintaining 2 indexes is pretty easy, if that was a larger number or you need to create indexes for each user in a system then it would be worth investigating the multi-core setup (it is still in development) ryan Pierre-Yves LANDRON wrote: Hello, Until now, i've used two instance of solr, one for each of my collections ; it works fine, but i wonder if there is an advantage to use multiple indexes in one instance over several instances with one index each ? Note that the two indexes have different schema.xml. Thanks. PL Date: Thu, 8 Nov 2007 18:05:43 -0500 From: [EMAIL PROTECTED] To: solr-user@lucene.apache.org Subject: Multiple indexes Hi, I am looking for the way to utilize the multiple indexes for signle sole instance. I saw that there is the patch 215 available and would like to ask someone who knows how to use multiple indexes. Thanks, Jae Joo _ Discover the new Windows Vista http://search.msn.com/results.aspx?q=windows+vistamkt=en-USform=QBRE
Re: solr range query
On Nov 12, 2007 8:02 AM, Heba Farouk [EMAIL PROTECTED] wrote: I would like to use solr to return ranges of searches on an integer field, if I wrote in the url offset:[0 TO 10], it returns documents with offset values 0, 1, 10 only but I want to return the range 0,1,2, 3, 4 ,10. How can I do that with solr Use fieldType=sint (sortable int... see the schema.xml), and reindex. -Yonik
Query and heap Size
In my system, the heap size (old generation) keeps growing up caused by heavy traffic. I have adjusted the size of young generation, but it does not work well. Does anyone have any recommendation regarding this issue? - Solr configuration and/or web.xml ...etc... Thanks, Jae
Re: Multiple indexes
Here is my situation. I have 6 millions articles indexed and adding about 10k articles everyday. If I maintain only one index, whenever the daily feeding is running, it consumes the heap area and causes FGC. I am thinking the way to have multiple indexes - one is for ongoing querying service and one is for update. Once update is done, switch the index by automatically and/or my application. Thanks, Jae joo On Nov 12, 2007 8:48 AM, Ryan McKinley [EMAIL PROTECTED] wrote: The advantages of a multi-core setup are configuration flexibility and dynamically changing available options (without a full restart). For high-performance production solr servers, I don't think there is much reason for it. You may want to split the two indexes on to two machines. You may want to run each index in a separate JVM (so if one crashes, the other does not) Maintaining 2 indexes is pretty easy, if that was a larger number or you need to create indexes for each user in a system then it would be worth investigating the multi-core setup (it is still in development) ryan Pierre-Yves LANDRON wrote: Hello, Until now, i've used two instance of solr, one for each of my collections ; it works fine, but i wonder if there is an advantage to use multiple indexes in one instance over several instances with one index each ? Note that the two indexes have different schema.xml. Thanks. PL Date: Thu, 8 Nov 2007 18:05:43 -0500 From: [EMAIL PROTECTED] To: solr-user@lucene.apache.org Subject: Multiple indexes Hi, I am looking for the way to utilize the multiple indexes for signle sole instance. I saw that there is the patch 215 available and would like to ask someone who knows how to use multiple indexes. Thanks, Jae Joo _ Discover the new Windows Vista http://search.msn.com/results.aspx?q=windows+vistamkt=en-USform=QBRE
Re: Multiple indexes
just use the standard collection distribution stuff. That is what it is made for! http://wiki.apache.org/solr/CollectionDistribution Alternatively, open up two indexes using the same config/dir -- do your indexing on one and the searching on the other. when indexing is done (or finishes a big chunk) send commit/ to the 'searching' one and it will see the new stuff. ryan Jae Joo wrote: Here is my situation. I have 6 millions articles indexed and adding about 10k articles everyday. If I maintain only one index, whenever the daily feeding is running, it consumes the heap area and causes FGC. I am thinking the way to have multiple indexes - one is for ongoing querying service and one is for update. Once update is done, switch the index by automatically and/or my application. Thanks, Jae joo On Nov 12, 2007 8:48 AM, Ryan McKinley [EMAIL PROTECTED] wrote: The advantages of a multi-core setup are configuration flexibility and dynamically changing available options (without a full restart). For high-performance production solr servers, I don't think there is much reason for it. You may want to split the two indexes on to two machines. You may want to run each index in a separate JVM (so if one crashes, the other does not) Maintaining 2 indexes is pretty easy, if that was a larger number or you need to create indexes for each user in a system then it would be worth investigating the multi-core setup (it is still in development) ryan Pierre-Yves LANDRON wrote: Hello, Until now, i've used two instance of solr, one for each of my collections ; it works fine, but i wonder if there is an advantage to use multiple indexes in one instance over several instances with one index each ? Note that the two indexes have different schema.xml. Thanks. PL Date: Thu, 8 Nov 2007 18:05:43 -0500 From: [EMAIL PROTECTED] To: solr-user@lucene.apache.org Subject: Multiple indexes Hi, I am looking for the way to utilize the multiple indexes for signle sole instance. I saw that there is the patch 215 available and would like to ask someone who knows how to use multiple indexes. Thanks, Jae Joo _ Discover the new Windows Vista http://search.msn.com/results.aspx?q=windows+vistamkt=en-USform=QBRE
Re: Best way to create multiple indexes
For starters, do you need to be able to search across groups or sub-groups (in one query?) If so, then you have to stick everything in one index. You can add a field to each document saying what 'group' or 'sub-group' it is in and then limit it at query time q=kittens +group:A The advantage to splitting it into multiple indexes is that you could put each index on independent hardware. Depending on your queries and index size that may make a big difference. ryan Rishabh Joshi wrote: Hi, I have a requirement and was wondering if someone could help me in how to go about it. We have to index about 8-9 million documents and their size can be anywhere from a few KBs to a couple of MBs. These documents are categorized into many 'groups' and 'sub-groups'. I wanted to know if we can create multiple indexes based on 'groups' and then on 'sub-groups' in Solr? If yes, then how do we go about it? I tried going through the section on 'Collections' in the Solr Wiki, but could not make much use of it. Regards, Rishabh Joshi
Re: Best way to create multiple indexes
Hi Guys How do we add word documents / pdf / text / etc documents in solr ?. How the content of the files are stored or indexed ?. Does the documents are stored as XML in the filesystem ? Regards Dwarak R - Original Message - From: Ryan McKinley [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Monday, November 12, 2007 7:43 PM Subject: Re: Best way to create multiple indexes For starters, do you need to be able to search across groups or sub-groups (in one query?) If so, then you have to stick everything in one index. You can add a field to each document saying what 'group' or 'sub-group' it is in and then limit it at query time q=kittens +group:A The advantage to splitting it into multiple indexes is that you could put each index on independent hardware. Depending on your queries and index size that may make a big difference. ryan Rishabh Joshi wrote: Hi, I have a requirement and was wondering if someone could help me in how to go about it. We have to index about 8-9 million documents and their size can be anywhere from a few KBs to a couple of MBs. These documents are categorized into many 'groups' and 'sub-groups'. I wanted to know if we can create multiple indexes based on 'groups' and then on 'sub-groups' in Solr? If yes, then how do we go about it? I tried going through the section on 'Collections' in the Solr Wiki, but could not make much use of it. Regards, Rishabh Joshi This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender[EMAIL PROTECTED] immediately and delete the original. Any other use of the email by you is prohibited.
solr workflow ?
Hi Guys How do we add word documents / pdf / text / etc documents in solr ?. How do the content of the files are stored or indexed ?. Are these documents stored as XML in the SOLR filesystem ? Regards Dwarak R This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender[EMAIL PROTECTED] immediately and delete the original. Any other use of the email by you is prohibited.
RE: Best way to create multiple indexes
Ryan, We currently have 8-9 million documents to index and this number will grow in the future. Also, we will never have a query that will search across groups, but, we will have queries that will search across sub-groups for sure. Now, keeping this in mind we were thinking if we could have multiple indexes at the 'group' level at least. Also, can multiple indexes be created dynamically? For example: In my application if I create a 'logical group', then an index should be created for that group. Rishabh -Original Message- From: Ryan McKinley [mailto:[EMAIL PROTECTED] Sent: Monday, November 12, 2007 7:44 PM To: solr-user@lucene.apache.org Subject: Re: Best way to create multiple indexes For starters, do you need to be able to search across groups or sub-groups (in one query?) If so, then you have to stick everything in one index. You can add a field to each document saying what 'group' or 'sub-group' it is in and then limit it at query time q=kittens +group:A The advantage to splitting it into multiple indexes is that you could put each index on independent hardware. Depending on your queries and index size that may make a big difference. ryan Rishabh Joshi wrote: Hi, I have a requirement and was wondering if someone could help me in how to go about it. We have to index about 8-9 million documents and their size can be anywhere from a few KBs to a couple of MBs. These documents are categorized into many 'groups' and 'sub-groups'. I wanted to know if we can create multiple indexes based on 'groups' and then on 'sub-groups' in Solr? If yes, then how do we go about it? I tried going through the section on 'Collections' in the Solr Wiki, but could not make much use of it. Regards, Rishabh Joshi
Re: no segments* file found
On Nov 12, 2007 3:46 AM, SDIS M. Beauchamp [EMAIL PROTECTED] wrote: If I don't optimize, I 've got a too many files open at about 450K files and 3 Gb index You may need to increase the number of filedescriptors in your system. If you're using Linux, see this: http://www.cs.uwaterloo.ca/~brecht/servers/openfiles.html Check the system wide limit and the per-process limit. -Yonik
Does SOLR supports multiple instances within the same webapplication?
Hello, Does SOLR supports multiple instances within the same web application? If so how is this achieved? Thanks in advance. Regards, Dilip TS
leading wildcards
Hi I found the thread about enabling leading wildcards in Solr as additional option in config file. I've got nightly Solr build and I can't find any options connected with leading wildcards in config files. How I can enable leading wildcard queries in Solr? Thank you -- Best regards, Traut
Re: solr workflow ?
rtfm :) http://lucene.apache.org/solr/tutorial.html On Nov 12, 2007 4:33 PM, Dwarak R [EMAIL PROTECTED] wrote: Hi Guys How do we add word documents / pdf / text / etc documents in solr ?. How do the content of the files are stored or indexed ?. Are these documents stored as XML in the SOLR filesystem ? Regards Dwarak R This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender[EMAIL PROTECTED] immediately and delete the original. Any other use of the email by you is prohibited. -- Best regards, Traut
Re: Does SOLR supports multiple instances within the same webapplication?
Dilip.TS wrote: Hello, Does SOLR supports multiple instances within the same web application? If so how is this achieved? If you want multiple indices, you can run multiple web-apps. If you need multiple indices in the same web-app, check SOLR-350 -- it is still in development, and make sure you *really* need it before going that route. ryan
Re: solr workflow ?
Highly unfortunate! On Nov 12, 2007 9:07 PM, Traut [EMAIL PROTECTED] wrote: rtfm :) http://lucene.apache.org/solr/tutorial.html On Nov 12, 2007 4:33 PM, Dwarak R [EMAIL PROTECTED] wrote: Hi Guys How do we add word documents / pdf / text / etc documents in solr ?. How do the content of the files are stored or indexed ?. Are these documents stored as XML in the SOLR filesystem ? Regards Dwarak R This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender[EMAIL PROTECTED] immediately and delete the original. Any other use of the email by you is prohibited. -- Best regards, Traut --
Re: leading wildcards
Seems like there is no way to enable leading wildcard queries except code editing and files repacking. :( On 11/12/07, Bill Au [EMAIL PROTECTED] wrote: The related bug is still open: http://issues.apache.org/jira/browse/SOLR-218 Bill On Nov 12, 2007 10:25 AM, Traut [EMAIL PROTECTED] wrote: Hi I found the thread about enabling leading wildcards in Solr as additional option in config file. I've got nightly Solr build and I can't find any options connected with leading wildcards in config files. How I can enable leading wildcard queries in Solr? Thank you -- Best regards, Traut -- Best regards, Traut
Re: leading wildcards
Vote for that issue and perhaps it'll gain some more traction. A former colleague of mine was the one who contributed the patch in SOLR 218 and it would be nice to have that configuration option 'standard' (if off by default) in the next SOLR release. On Nov 12, 2007 11:18 AM, Traut [EMAIL PROTECTED] wrote: Seems like there is no way to enable leading wildcard queries except code editing and files repacking. :( On 11/12/07, Bill Au [EMAIL PROTECTED] wrote: The related bug is still open: http://issues.apache.org/jira/browse/SOLR-218 Bill On Nov 12, 2007 10:25 AM, Traut [EMAIL PROTECTED] wrote: Hi I found the thread about enabling leading wildcards in Solr as additional option in config file. I've got nightly Solr build and I can't find any options connected with leading wildcards in config files. How I can enable leading wildcard queries in Solr? Thank you -- Best regards, Traut -- Best regards, Traut -- Michael Kimsal http://webdevradio.com
Re: Multiple indexes
I have built the master solr instance and indexed some files. Once I run snapshotter, i complains the error.. - snapshooter -d data/index (in solr/bin directory) Did I missed something? ++ date '+%Y/%m/%d %H:%M:%S' + echo 2007/11/12 12:38:40 taking snapshot /solr/master/solr/data/index/snapshot.20071112123840 + [[ -n '' ]] + mv /solr/master/solr/data/index/temp-snapshot.20071112123840/solr/master/solr/data/index/snapshot.20071112123840 mv: cannot access /solr/master/solr/data/index/temp-snapshot.20071112123840 Jae On Nov 12, 2007 9:09 AM, Ryan McKinley [EMAIL PROTECTED] wrote: just use the standard collection distribution stuff. That is what it is made for! http://wiki.apache.org/solr/CollectionDistribution Alternatively, open up two indexes using the same config/dir -- do your indexing on one and the searching on the other. when indexing is done (or finishes a big chunk) send commit/ to the 'searching' one and it will see the new stuff. ryan Jae Joo wrote: Here is my situation. I have 6 millions articles indexed and adding about 10k articles everyday. If I maintain only one index, whenever the daily feeding is running, it consumes the heap area and causes FGC. I am thinking the way to have multiple indexes - one is for ongoing querying service and one is for update. Once update is done, switch the index by automatically and/or my application. Thanks, Jae joo On Nov 12, 2007 8:48 AM, Ryan McKinley [EMAIL PROTECTED] wrote: The advantages of a multi-core setup are configuration flexibility and dynamically changing available options (without a full restart). For high-performance production solr servers, I don't think there is much reason for it. You may want to split the two indexes on to two machines. You may want to run each index in a separate JVM (so if one crashes, the other does not) Maintaining 2 indexes is pretty easy, if that was a larger number or you need to create indexes for each user in a system then it would be worth investigating the multi-core setup (it is still in development) ryan Pierre-Yves LANDRON wrote: Hello, Until now, i've used two instance of solr, one for each of my collections ; it works fine, but i wonder if there is an advantage to use multiple indexes in one instance over several instances with one index each ? Note that the two indexes have different schema.xml. Thanks. PL Date: Thu, 8 Nov 2007 18:05:43 -0500 From: [EMAIL PROTECTED] To: solr-user@lucene.apache.org Subject: Multiple indexes Hi, I am looking for the way to utilize the multiple indexes for signle sole instance. I saw that there is the patch 215 available and would like to ask someone who knows how to use multiple indexes. Thanks, Jae Joo _ Discover the new Windows Vista http://search.msn.com/results.aspx?q=windows+vistamkt=en-USform=QBRE
RE: Solr + autocomplete
Thanks Ryan, This looks like the way to go. However, when I set up my schema I get, Error loading class 'solr.EdgeNGramFilterFactory'. For some reason the class is not found. I tried the stable 1.2 build and even tried the nightly build. I'm using filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=20/. Any suggestions? Thanks, Mike -Original Message- From: Ryan McKinley [mailto:[EMAIL PROTECTED] Sent: Monday, October 15, 2007 4:44 PM To: solr-user@lucene.apache.org Subject: Re: Solr + autocomplete I would imagine there is a library to set up an autocomplete search with Solr. Does anyone have any suggestions? Scriptaculous has a JavaScript autocomplete library. However, the server must return an unordered list. Solr does not provide an autocomplete UI, but it can return JSON that a JS library can use to populate an autocomplete. Depending on you index size/ query speed, you may be fine with a standard faceting prefix filter. If the index is large, you may want to index using the EdgeNGramFilterFactory. Check the last comment in: https://issues.apache.org/jira/browse/SOLR-357 ryan
RE: Solr + autocomplete
: Error loading class 'solr.EdgeNGramFilterFactory'. For some reason EdgeNGramFilterFactory didn't exist when Solr 1.2 was released, but the EdgeNGramTokenizerFactory did. (the javadocs that come with each release list all of the various factories in that release) -Hoss
DINSTINCT ON functionality in Solr?
Is there a way to define a query in that way that a search result contains only one representative of every set of documents which are equal on a given field (it is not important which representative document), i.e. to have the DINTINCT ON-concept from relational databases in Solr? If this cannot be done with the search API of Lucene, may be one can use Solr server side hooks or filters to achieve this? How? The reason why I do not want to do this filtering manually, is, because I want to have as many matches as possible with respect to my defined result limit for the query (and filtering the search result on client side may really kick me off from this limit far away). Thanks..
Phrase-based (vs. Word-Based) Proximity Search
I gather that the standard Solr query parser uses the same syntax for proximity searches as Lucene, and that Lucene syntax is described at http://lucene.apache.org/java/docs/queryparsersyntax.html#Proximity%20Searches This syntax lets me look for terms that are within x words of each other. Their example is that jakarta apache~10 will find documents where jakarta and apache occur within 10 words of one another. What I would like to do is is find documents where *phrases*, not just terms, are within x words of each other. I want to be able to say things like Find the documents where the phrases apache jakarta and sun microsystems occur within ten words of one another. If I gave such a search, I would *not* want it to count as a match if, for instance, apache appeared near microsystems but apache wasn't followed immediately by jakarta, or microsystems wasn't preceded immediately by sun. I would also not want it to match if apache jakarta appeared, but sun microsystems did not appear. Is there any way to do such a search currently? I suppose it might work to say apache jakarta sun microsystems~10 +apache jakarta +sun microsystems but that seems like an unfortunate hack. In any case it's not really something I can expect my users to be able to type in by themselves. In our current query language (which I'm hoping to wean our users off of), they can type apache jakarta near/10 sun microsystems which I believe is more intuitive. Any ideas? Chris
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
DISCLAIMER: This is from a Lucene-centric viewpoint. That said, this may be useful For your line number, page number etc perspective, it is possible to index special guaranteed-to-not-match tokens then use the termdocs/termenum data, along with SpanQueries to figure this out at search time. For instance, coincident with the last term in each line, index the token $. Coincident with the last token of every paragraph index the token #. If you get the offsets of the matching terms, you can quite quickly simply count the number of line and paragraph tokens using TermDocs/TermEnums and correlate hits to lines and paragraphs. The trick is to index your special tokens with an increment of 0 (see SynonymAnalyzer in Lucene In Action for more on this). Another possibility is to add a special field with each document with the offsets of each end-of-sentence and end-of-paragraph offsets (stored, not indexed). Again, given the offsets, you can read in this field and figure out what line/ paragraph your hits are in. How suitable either of these is depends on a lot of characteristics of your particular problem space. I'm not sure either of them is suitable for very high volume applications. Also, I'm approaching this from an in-the-guts-of-lucene perspective, so don't even *think* of asking me how to really make this work in SOLR G. Best Erick On Nov 11, 2007 12:44 AM, David Neubert [EMAIL PROTECTED] wrote: Ryan (and others who need something to put them so sleep :) ) Wow -- the light-bulb finally went off -- the Analzyer admin page is very cool -- I just was not at all thinking the SOLR/Lucene way. I need to rethink my whole approach now that I understand (from reviewing the schema.xml closer and playing with the Analyser) how compatible index and query policies can be applied automatically on a field by field basis by SOLR at both index and query time. I still may have a stumper here, but I need to give it some thought, and may return again with another question: The problem is that my text is book text (fairly large) that ooks very much like one would expect: book chapter parasen.../sensen/sen/para parasen.../sensen/sen/para parasen.../sensen.../sen/para /chapter /book The search results need to return exact sentences or paragraphs with their exact page:line numbers (which is available in the embedded markup in the text). There were previous responses by others, suggesting I look into payloads, but I did not fully understand that -- I may have to re-read those e-mails now that I am getting a clearer picture of SOLR/Lucene. However, the reason I resorted to indexing each paragraph as a single document, and then redundantly indexing each sentence as a single document, is because I was planning on pre-parsing the text myself (outside of SOLR) -- and feeding separate doc elements to the add because in that way I could produce the page:line reference in the pre-parsing (again outside of SOLR) and feed it in as explict field in the doc elements of the add requests. Therefore at query time, I will have the exact page:line corresponding to the start of the paragraph or sentence. But I am beginning to suspect, I was planning to do a lot of work that SOLR can do for me. I will continue to study this and respond when I am a bit clearer, but the closer I could get to just submitting the books a chapter at a time -- and letting SOLR do the work, the better (cause I have all the books in well formed xml at chapter levels). However, I don't see yet how I could get par/sen granular search result hits, along with their exact page:line coordinates unless I approach it by explicitly indexing the pars and sens as single documents, not chapters hits, and also return the entire text of the sen or par, and highlight the keywords within (for the search result hit). Once a search result hit is selected, it would then act as expected and position into the chapter, at the selected reference, highlight again the key words, but this time in the context of an entire chapter (the whole document to the user's mind). Even with my new understanding you (and others) have given me, which I can use to certainly improve my approach -- it still seems to me that because multi-valued fields concatenate text -- even if you use the positionGapIncrment feature to prohibit unwanted phrase matches, how do you produce a well definied search result hit, bounded by the exact sen or par, unless you index them as single documents? Should I still read up on the payload discussion? Dave - Original Message From: Ryan McKinley [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Saturday, November 10, 2007 5:00:43 PM Subject: Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity) David Neubert wrote: Ryan, Thanks for your response. I infer from your response that you can have a different analyzer for each field yes!
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
Erik, Probably because of my newness to SOLR/Lucene, I see now what you/Yonik meant by case field, but I am not clear about your wording per-book setting attached at index time - would you mind ellaborating on that, so I am clear? Dave - Original Message From: Erik Hatcher [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Sunday, November 11, 2007 5:21:45 AM Subject: Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity) Solr query syntax is documented here: http://wiki.apache.org/solr/ SolrQuerySyntax What Yonik is referring to is creating your own case field with the per-book setting attached at index time. Erik On Nov 11, 2007, at 12:55 AM, David Neubert wrote: Yonik (or anyone else) Do you know where on-line documentation on the +case: syntax is located? I can't seem to find it. Dave - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Saturday, November 10, 2007 4:56:40 PM Subject: Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity) On Nov 10, 2007 4:24 PM, David Neubert [EMAIL PROTECTED] wrote: So if I am hitting multiple fields (in the same search request) that invoke different Analyzers -- am I at a dead end, and have to result to consequetive multiple queries instead Solr handles that for you automatically. The app that I am replacing (and trying to enhance) has the ability to search multiple books at once with sen/par and case sensitivity settings individually selectable per book You could easily select case sensitivity or not *per query* across all books. You should step back and see what the requirements actually are (i.e. the reasons why one needs to be able to select case sensitive/insensitive on a book level... it doesn't make sense to me at first blush). It could be done on a per-book level in solr with a more complex query structure though... (+case:sensitive +(normal relevancy query on the case sensitive fields goes here)) OR (+case:insensitive +(normal relevancy query on the case insensitive fields goes here)) -Yonik __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
RE: Solr + autocomplete
Will I need to use Solr 1.3 with the EdgeNGramFilterFactory in order to get the autosuggest feature? -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Monday, November 12, 2007 1:05 PM To: solr-user@lucene.apache.org Subject: RE: Solr + autocomplete : Error loading class 'solr.EdgeNGramFilterFactory'. For some reason EdgeNGramFilterFactory didn't exist when Solr 1.2 was released, but the EdgeNGramTokenizerFactory did. (the javadocs that come with each release list all of the various factories in that release) -Hoss
Re: Phrase-based (vs. Word-Based) Proximity Search
Hi Chris, I gather that the standard Solr query parser uses the same syntax for proximity searches as Lucene, and that Lucene syntax is described at http://lucene.apache.org/java/docs/queryparsersyntax.html#Proximity%20Searches This syntax lets me look for terms that are within x words of each other. Their example is that jakarta apache~10 will find documents where jakarta and apache occur within 10 words of one another. What I would like to do is is find documents where *phrases*, not just terms, are within x words of each other. I want to be able to say things like Find the documents where the phrases apache jakarta and sun microsystems occur within ten words of one another. [snip] I'd thought that span queries would allow you to do this type of thing, but they're not supported (currently) by the standard query parser. E.g. check out the SpanNearQuery support in (recent) Lucene releases: http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/spans/SpanNearQuery.html I would recommend re-posting this on the Lucene user list. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 If you can't find it, you can't fix it
Re: Associating pronouns instances to proper nouns?
Attempting to answer my own question, which I should probably just try, assuming I can doctor the indexed text ---I suppose I could do something like change all instances or I, he, etc that refer to one person to IJBA HEJBA, HIMJBA (making sure they would never equal a normal word) -- then use the synonym feature to link IJBA, HEJBA, HIMJBA, Joe Book Author, J.B.Author (although, even if this were a good approach) I don't know if you can link synonyms for phrases as opposed to a single word. And of course this would require a correlative translation mechanism at display time to render I, he, him, instead of the indexed acronym. - Original Message From: David Neubert [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Monday, November 12, 2007 2:54:11 PM Subject: Associating pronouns instances to proper nouns? All, I am working with very exact text and search over permament documents (books). It would be great to associate pronouns like he, she, him, her, I, my, etc. with the acutal author or person the pronoun refers to. I can see how I could get pretty darn close with the synonym feature in Lucene. Unfortunately though, as I understand it, this would associate all instances or I, he, she, etc. instead of particular instances. I have come up with a crude mechanism, adding the initials for the referred person, immediately after the pronoun ... him{DGN}, but this of course complicates word counts and potential prhase lookups, etc. (which I could probably live with and work around). But after understanding how easy it is to add synonymns for any particular word in a document, is there any standard practical way to add synonymns to a particular word instance within a document? That would really do the trick? Dave __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
Erik - thanks, I am considering this approach, verses explicit redundant indexing -- and am also considering Lucene -- problem is, I am one week into both technologies (though have years in the search space) -- wish I could go to Hong Kong -- any discounts available anywhere :) Dave - Original Message From: Erick Erickson [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Monday, November 12, 2007 2:11:14 PM Subject: Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity) DISCLAIMER: This is from a Lucene-centric viewpoint. That said, this may be useful For your line number, page number etc perspective, it is possible to index special guaranteed-to-not-match tokens then use the termdocs/termenum data, along with SpanQueries to figure this out at search time. For instance, coincident with the last term in each line, index the token $. Coincident with the last token of every paragraph index the token #. If you get the offsets of the matching terms, you can quite quickly simply count the number of line and paragraph tokens using TermDocs/TermEnums and correlate hits to lines and paragraphs. The trick is to index your special tokens with an increment of 0 (see SynonymAnalyzer in Lucene In Action for more on this). Another possibility is to add a special field with each document with the offsets of each end-of-sentence and end-of-paragraph offsets (stored, not indexed). Again, given the offsets, you can read in this field and figure out what line/ paragraph your hits are in. How suitable either of these is depends on a lot of characteristics of your particular problem space. I'm not sure either of them is suitable for very high volume applications. Also, I'm approaching this from an in-the-guts-of-lucene perspective, so don't even *think* of asking me how to really make this work in SOLR G. Best Erick On Nov 11, 2007 12:44 AM, David Neubert [EMAIL PROTECTED] wrote: Ryan (and others who need something to put them so sleep :) ) Wow -- the light-bulb finally went off -- the Analzyer admin page is very cool -- I just was not at all thinking the SOLR/Lucene way. I need to rethink my whole approach now that I understand (from reviewing the schema.xml closer and playing with the Analyser) how compatible index and query policies can be applied automatically on a field by field basis by SOLR at both index and query time. I still may have a stumper here, but I need to give it some thought, and may return again with another question: The problem is that my text is book text (fairly large) that ooks very much like one would expect: book chapter parasen.../sensen/sen/para parasen.../sensen/sen/para parasen.../sensen.../sen/para /chapter /book The search results need to return exact sentences or paragraphs with their exact page:line numbers (which is available in the embedded markup in the text). There were previous responses by others, suggesting I look into payloads, but I did not fully understand that -- I may have to re-read those e-mails now that I am getting a clearer picture of SOLR/Lucene. However, the reason I resorted to indexing each paragraph as a single document, and then redundantly indexing each sentence as a single document, is because I was planning on pre-parsing the text myself (outside of SOLR) -- and feeding separate doc elements to the add because in that way I could produce the page:line reference in the pre-parsing (again outside of SOLR) and feed it in as explict field in the doc elements of the add requests. Therefore at query time, I will have the exact page:line corresponding to the start of the paragraph or sentence. But I am beginning to suspect, I was planning to do a lot of work that SOLR can do for me. I will continue to study this and respond when I am a bit clearer, but the closer I could get to just submitting the books a chapter at a time -- and letting SOLR do the work, the better (cause I have all the books in well formed xml at chapter levels). However, I don't see yet how I could get par/sen granular search result hits, along with their exact page:line coordinates unless I approach it by explicitly indexing the pars and sens as single documents, not chapters hits, and also return the entire text of the sen or par, and highlight the keywords within (for the search result hit). Once a search result hit is selected, it would then act as expected and position into the chapter, at the selected reference, highlight again the key words, but this time in the context of an entire chapter (the whole document to the user's mind). Even with my new understanding you (and others) have given me, which I can use to certainly improve my approach -- it still seems to me that because multi-valued fields concatenate text -- even if you use the positionGapIncrment feature to prohibit unwanted phrase matches, how do you produce a
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
On Nov 12, 2007 2:20 PM, David Neubert [EMAIL PROTECTED] wrote: Erik - thanks, I am considering this approach, verses explicit redundant indexing -- and am also considering Lucene - There's not a well defined solution in either IMO. - problem is, I am one week into both technologies (though have years in the search space) -- wish I could go to Hong Kong -- any discounts available anywhere :) Unfortunately the OS Summit has been canceled. -Yonik
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
: - problem is, I am one week into both technologies (though have years in the search space) -- wish I could : go to Hong Kong -- any discounts available anywhere :) : : Unfortunately the OS Summit has been canceled. Or rescheduled to 2008 ... depending on wether you are a half-empty / half-full kind of person. And lets not forget atlanta ... starting today and all... http://us.apachecon.com/us2007/ -Hoss
Re: Associating pronouns instances to proper nouns?
All have found (from using the Admin/Analysis page) that if I were to append unique initials (that didn't match any other word or acronym) to each pronoun (e.g. I-WCN, she-WCN, my-WCN etc) that the default parsing and tokenization for the text field in SOLR might actually do the trick -- it parses down to I, wcn, IWCN, i, idgn -- all at the same word position -- so that is perfect. I haven't exhaustively tested all capitalization nuances, but am too woried about that. If I want to do an exhaustive search for person WCN, i just have to enter his/her initials and than can get all references including pronouns? Anybody see any holes in this? (sounds alarmingly easy so far)? Dave - Original Message From: David Neubert [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Monday, November 12, 2007 3:04:20 PM Subject: Re: Associating pronouns instances to proper nouns? Attempting to answer my own question, which I should probably just try, assuming I can doctor the indexed text ---I suppose I could do something like change all instances or I, he, etc that refer to one person to IJBA HEJBA, HIMJBA (making sure they would never equal a normal word) -- then use the synonym feature to link IJBA, HEJBA, HIMJBA, Joe Book Author, J.B.Author (although, even if this were a good approach) I don't know if you can link synonyms for phrases as opposed to a single word. And of course this would require a correlative translation mechanism at display time to render I, he, him, instead of the indexed acronym. - Original Message From: David Neubert [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Monday, November 12, 2007 2:54:11 PM Subject: Associating pronouns instances to proper nouns? All, I am working with very exact text and search over permament documents (books). It would be great to associate pronouns like he, she, him, her, I, my, etc. with the acutal author or person the pronoun refers to. I can see how I could get pretty darn close with the synonym feature in Lucene. Unfortunately though, as I understand it, this would associate all instances or I, he, she, etc. instead of particular instances. I have come up with a crude mechanism, adding the initials for the referred person, immediately after the pronoun ... him{DGN}, but this of course complicates word counts and potential prhase lookups, etc. (which I could probably live with and work around). But after understanding how easy it is to add synonymns for any particular word in a document, is there any standard practical way to add synonymns to a particular word instance within a document? That would really do the trick? Dave __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Faceting over limited result set
On 13/11/2007, Chris Hostetter [EMAIL PROTECTED] wrote: can you elaborate on your use case ... the only time i've ever seen people ask about something like this it was because true facet counts were too expensive to compute, so they were doing sampling of the first N results. In Solr, Sampling like this would likely be just as expensive as getting the full count. It's not really a performance-related issue, the primary goal is to use the facet information to determine the most relevant product category related to the particular search being performed. Generally the facets returned by simple, generic queries are fine for this purpose (e.g. a search for nokia will correctly return Mobile / Cell Phone as the most frequent facet), however facet data for more specific searches are not as clear-cut (e.g. samsung tv where TVs will appear at the top of the search results, but will also match other samsung' products like mobile phones and mp3 players - obviously I could tweak 'mm' parameter to fix this particular case, but it wouldn't really solve my problem). The theory is that facet information generated from the first 'x' (lets say 100) matches to a query (ordered by score / relevance) will be more accurate (for the above purpose) than facets obtained over the entire result set. So ideally, it would be useful to be able to contstrain the size of the DocSet somehow (as you mention below). matching occurs in increasing order of docid, so even if there was as hook to say stop matching after N docs those N wouldn't be a good representative sample, they would be biased towards older documents (based on when they were indexed, not on any particular date field) if what you are interested in is stats on the first N docs according to a specific sort (score or otherwise) then you could write a custom request handler that executed a search with a limit of N, got the DocList, iterated over it to build a DocSet, and then used that DocSet to do faceting ... but that would probably take even longer then just using the full DocSet matching the entire query. I was hoping to avoid having to write a custom request handler but your suggestion above sounds like it would do the trick. I'm also debating whether to extract my own facet info from a result set on the client side, but this would be even slower. Thanks for your suggestions so far, Piete
Re: DINSTINCT ON functionality in Solr?
Currently this functionality is not available in Solr out-of-the-box, however there is a patch implementing Field Collapsing http://issues.apache.org/jira/browse/SOLR-236 which might be similar to what you are trying to achieve. Piete On 13/11/2007, Jörg Kiegeland [EMAIL PROTECTED] wrote: Is there a way to define a query in that way that a search result contains only one representative of every set of documents which are equal on a given field (it is not important which representative document), i.e. to have the DINTINCT ON-concept from relational databases in Solr? If this cannot be done with the search API of Lucene, may be one can use Solr server side hooks or filters to achieve this? How? The reason why I do not want to do this filtering manually, is, because I want to have as many matches as possible with respect to my defined result limit for the query (and filtering the search result on client side may really kick me off from this limit far away). Thanks..
Re: Faceting over limited result set
: It's not really a performance-related issue, the primary goal is to use the : facet information to determine the most relevant product category related to : the particular search being performed. ah ... ok, i understand now. the order does matter, you want the top N documents sorted by some criteria (either score, or maybe popularity i would imagine) and then you want to pick the categories based on that. i had to build this for CNET back before solr went open source, but yes - i did it using a custom subclass of dismax similar to what i discribed before. one thing to watch out for is that you probably want to use a consistent sort independent of the user's sort -- if the user re-sorts by price it can be disconcerting for them if that changes the navigation links. -Hoss
Re: Does SOLR supports multiple instances within the same webapplication?
if I understand correct,,u just do it like that:(i use php) $data1 = getDataFromInstance1($url); $data2 = getDataFromInstance2($url); it just have multi solr Instance. and getData from the distance. On Nov 12, 2007 11:15 PM, Dilip.TS [EMAIL PROTECTED] wrote: Hello, Does SOLR supports multiple instances within the same web application? If so how is this achieved? Thanks in advance. Regards, Dilip TS -- regards jl