RE: Customizing Solr to handle Leading Wildcard queries
Hi, Thanks Otis, Newton and everyone else for the help on this issue. Most of the data I index are documents like pdfs, word Docs, open office documents, etc. I store the content of the document in a field called content and the remaining metadata of the document like name, id, created by, modified by, created on, etc in a copy field called metadata. I am not particularly interested in enabling leading wildcard characters in the content (although such a possibility would be a bonus). For this, I've tried implementing the suggestion to store reverse strings as well as the correct strings for the metadata field. All leading wildcard queries like "*abc" and searched as "cba*" against the reversed metadata field. So far so good. Thank you :) But now, I ran into the scenario where the query string is *abc* :( and the whole thing came down crashing again. I cannot ignore such queries. I would rather take the risk of Solr OOMing by enabling the leading wildcard query searches. Can someone please tell me the steps to turn on this feature in Lucene QueryParser? I am sure it will be helpful to many to document such a procedure on the Wiki or somewhere else. (I am definitely going to do that once I fix this. Too much trouble this seems to be) Also, which queryParser does Solr use by default? Thanks, Kumar -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Thursday, January 15, 2009 10:18 PM To: solr-user@lucene.apache.org Subject: Re: Customizing Solr to handle Leading Wildcard queries Hi ramuK, I believe you can turn that "on" via the Lucene QueryParser, but of course such searches will be slo(oo)w. You can also index reversed tokens (e.g. *kumar --> rakum*) or you could index n-grams with begin/end delim characters (e.g. kumar -> ^ k u m a r $, *kumar -> "k u m a r $") Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: "Jana, Kumar Raja" > To: solr-user@lucene.apache.org > Sent: Thursday, January 15, 2009 9:49:24 AM > Subject: RE: Customizing Solr to handle Leading Wildcard queries > > Hi Erik, > > Thanks for the quick reply. > I want to enable leading wildcard query searches in general. The case > mentioned in the earlier mail is just one of the many instances I use > this feature. > > -Kumar > > > > > -Original Message- > From: Erik Hatcher [mailto:e...@ehatchersolutions.com] > Sent: Thursday, January 15, 2009 7:59 PM > To: solr-user@lucene.apache.org > Subject: Re: Customizing Solr to handle Leading Wildcard queries > > > On Jan 15, 2009, at 8:23 AM, Jana, Kumar Raja wrote: > > Not being able to perform Leading Wildcard queries is a major > > handicap. > > I want to be able to perform searches like *.pdf to fetch all pdf > > documents from Solr. > > For this particular case, I recommend indexing the document type as a > separate field. Something like type:pdf (or use a MIME type string). > Then you can do a very direct and fast query to search or facet by > document types. > > Erik
Re: Optimizing & Improving results based on user feedback
OK I've implemented this before, written academic papers and patents related to this task. Here are some hints: - you're on the right track with the editorial boosting elevators - http://wiki.apache.org/solr/UserTagDesign - be darn careful about assuming that one click is enough evidence to boost a long 'distance' - first page effects in search will skew the learning badly if you don't compensate. 95% of users never go past the first page of results, 1% go past the second page. So perfectly good results on the second page get permanently locked out - consider forgetting what you learn under some condition In fact this whole area is called 'learning to rank' and is a hot research topic in IR. http://web.mit.edu/shivani/www/Ranking-NIPS-05/ http://research.microsoft.com/en-us/um/people/lr4ir-2007/ https://research.microsoft.com/en-us/um/people/lr4ir-2008/ - Neal Richter On Tue, Jan 27, 2009 at 2:06 PM, Matthew Runo wrote: > Hello folks! > > We've been thinking about ways to improve organic search results for a while > (really, who hasn't?) and I'd like to get some ideas on ways to implement a > feedback system that uses user behavior as input. Basically, it'd work on > the premise that what the user actually clicked on is probably a really good > match for their search, and should be boosted up in the results for that > search. > > For example, if I search for "rain boots", and really love the 10th result > down (and show it by clicking on it), then we'd like to capture this and use > the data to boost up that result //for that search//. We've thought about > using index time boosts for the documents, but that'd boost it regardless of > the search terms, which isn't what we want. We've thought about using the > Elevator handler, but we don't really want to force a product to the top - > we'd prefer it slowly rises over time as more and more people click it from > the same search terms. Another way might be to stuff the keyword into the > document, the more times it's in the document the higher it'd score - but > there's gotta be a better way than that. > > Obviously this can't be done 100% in solr - but if anyone had some clever > ideas about how this might be possible it'd be interesting to hear them. > > Thanks for your time! > > Matthew Runo > Software Engineer, Zappos.com > mr...@zappos.com - 702-943-7833 > >
Re: multilanguage prototype
Hi, I a, getting this error in the tomcat log file on passing chinese test to the content field The content field uses the ckj tokenizer. and is defined as INFO: [] webapp=/lang_prototype path=/update params={} status=0 QTime=69 Jan 28, 2009 12:17:03 PM org.apache.solr.common.SolrException log SEVERE: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 1)) at [row,col {unknown-source}]: [2,76] at com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675) at com.ctc.wstx.sr.BasicStreamReader.readTextPrimary(BasicStreamReader.java:4556) at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2888) at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019) at org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:321) at org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195) at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:174) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:151) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:874) at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689) at java.lang.Thread.run(Thread.java:619) regards On 1/28/09, revathy arun wrote: > > Hi, > > > This is the only info in the tomcat log at indexing > > Jan 27, 2009 3:46:15 PM org.apache.solr.core.SolrCore execute > INFO: [] webapp=/lang_prototype path=/update params={} status=0 QTime=191 > I dont see any ohter errors in the logs . > > when i use curl to update i get success message. > > and commit data in solr admin is showing positive ,where in the index file > there are not indexes created. > > regards > sujatha > > > On 1/27/09, Erik Hatcher wrote: >> >> errors: 11 >> >> What were those? >> >> My hunch is your indexer had issues. What did Solr output into the >> console or log during indexing? >> >>Erik >> >> On Jan 27, 2009, at 6:56 AM, revathy arun wrote: >> >> Hi Shalin, >>> >>> The admin page stats are as follows >>> searcherName : searc...@1d4c3d5 main >>> caching : true >>> numDocs : 0 >>> maxDoc : 0 >>> >>> *name: * /update *class: * >>> org.apache.solr.handler.XmlUpdateRequestHandler >>> *version: * $Revision: 690026 $ *description: * Add documents with XML >>> * >>> stats: *handlerStart : 1232692774389 >>> requests : 22 >>> errors : 11 >>> timeouts : 0 >>> totalTime : 1181 >>> avgTimePerRequest : 53.68182 >>> avgRequestsPerSecond : 6.0431463E-5 >>> >>> *stats: *commits : 9 >>> autocommits : 0 >>> optimizes : 2 >>> docsPending : 0 >>> adds : 0 >>> deletesById : 0 >>> deletesByQuery : 0 >>> errors : 0 >>> cumulative_adds : 0 >>> cumulative_deletesById : 0 >>> cumulative_deletesByQuery : 0 >>> cumulative_errors : 0 >>> >>> in the solrconfg.xml i have commented this line >>> >>> >>> >>> >>> so the index will be created in the default data folder under solr home, >>> >>> >>> >>> Thanks for ur time >>> >>> regards >>> >>> sujatha >>> On 1/27/09, Shalin Shekhar Mangar wrote: >>> Are you looking for it in the right place? It is very unlikely that a commit happens and index is not created. The index is usually created inside the data directory as configured in your solconfig.xml Can you search for *:* from the solr admin page and see if documents are returned? On Tue, Jan 27, 2009 at 5:01 PM, revathy arun wrote: this is the stats of my updatehandler > but i still dont see any index created > *stats: *commits : 7 > autocommits : 0 > optimizes : 2 > docsPending : 0 > adds : 0 >>
Re: Text classification with Solr
On Tue, Jan 27, 2009 at 2:21 PM, Grant Ingersoll wrote: > One of the things I am interested in is the marriage of Solr and Mahout > (which has some Genetic Algorithms support) and other ML (Weka, etc.) tools. [snip] I love it, good to know you are thinking big here. Here's another big thought: http://www.eml-r.org/nlp/papers/ponzetto07b.pdf .. but assume we want to extract this type of structure from the full text of Wikipedia rather than the narrow categories DB. > Things that can help with all this: LukeReqHandler, TermVectorComponent, > TermsComponent, others > [snip] > Neal, what did you have in mind for a JIRA issue? I'd love to see a patch. More research needed, but the initial idea would be to enable the passing in of a weighted term vector as a query and allowing a more-like-this type search on it. Anyone attempt this yet? Interesting point about faceting here is that it would give outgoing feedback on what /new/ words (not in initial query) that if added to the query would result in additional discrimination between the matched categories. So Solr outputs a set of categories for a document, and also emits a set of related words to the initial query! Categorization and recommendation in one. - Neal
Re: Store limited text
If you're using a Solr build post-r721758, then copyfield has a maxChars property you can take advantage of. I'm probably misremembering some of the exact names of these elements/attributes, but you can basically have this in your schema.xml: Then anything you store in field f will get copied for storage into f_for_retrieval -- but only up to 1M chars. Here the truncation is done by the field copy. Not sure that there's a way to do it right now without a field copy. On Tue, Jan 27, 2009 at 8:45 PM, Gargate, Siddharth wrote: > Hi All, >Is it possible to store only limited text in the field, say, max 1 > mb? The field maxfieldlength limits only the number of tokens to be > indexed, but stores complete content. > > Thanks, > Siddharth >
Re: Setting dataDir in multicore environment
This is just what I needed, thank you so much for the quick response! It's really appreciated! Mark On Tue, Jan 27, 2009 at 9:59 PM, Noble Paul നോബിള് नोब्ळ् < noble.p...@gmail.com> wrote: > There is a patch given for SOLR-883 . > > On Wed, Jan 28, 2009 at 9:43 AM, Noble Paul നോബിള് नोब्ळ् > wrote: > > I shall give a patch today > > > > On Tue, Jan 27, 2009 at 11:58 PM, Mark Ferguson > > wrote: > >> Oh I see, thanks for the clarification. > >> > >> Unfortunately this brings me back to same problem I started with: > implicit > >> properties aren't available when managing indexes through the REST api. > I > >> know there is a patch in the works for this issue but I can't wait for > it. > >> Is there any way to share the solrconfig.xml file and create indexes > >> dynamically? > >> > >> Mark > >> > >> > >> On Mon, Jan 26, 2009 at 9:02 PM, Noble Paul നോബിള് नोब्ळ् < > >> noble.p...@gmail.com> wrote: > >> > >>> The behavior is expected > >>> properties set in solr.xml are not implicitly used anywhere. > >>> you will have to use those variables explicitly in > >>> solrconfig.xml/schema.xml > >>> instead of hardcoding dataDir in solrconfig.xml you can use it as a > >>> variable $$dataDir > >>> > >>> BTW there is an issue (https://issues.apache.org/jira/browse/SOLR-943) > >>> which helps you specify the dataDir in solr.xml > >>> > >>> > >>> On Tue, Jan 27, 2009 at 5:19 AM, Mark Ferguson > >>> wrote: > >>> > Hi, > >>> > > >>> > In my solr.xml file, I am trying to set the dataDir property the way > it > >>> is > >>> > described in the CoreAdmin page on the wiki: > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > However, the property is being completed ignored. It is using > whatever I > >>> > have set in the solrconfig.xml file (or ./data, the default value, if > I > >>> set > >>> > nothing in that file). Any idea what I am doing wrong? I am trying > this > >>> > approach to avoid using ${solr.core.name} in the solrconfig.xml > file, > >>> since > >>> > dynamic properties are broken for creating cores via the REST api. > >>> > > >>> > Mark > >>> > > >>> > >>> > >>> > >>> -- > >>> --Noble Paul > >>> > >> > > > > > > > > -- > > --Noble Paul > > > > > > -- > --Noble Paul >
Re: question about dismax and parentheses
i found Hoss's explanations at http://www.nabble.com/Dismax-and-Grouping-query-td12938168.html#a12938168 seems to be i cant do this. so my question is transforming to following: can i join multiple dismax queries into one? for instance if i'm looking for +WORD1 +(WORD2 WORD3) it can be translated into +WORD1 +WORD2 and +WORD1 +WORD3 query or can i joing standartRequestHandler queries to different fields into one? -- View this message in context: http://www.nabble.com/question-about-dismax-and-parentheses-tp21699822p21700182.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Setting dataDir in multicore environment
There is a patch given for SOLR-883 . On Wed, Jan 28, 2009 at 9:43 AM, Noble Paul നോബിള് नोब्ळ् wrote: > I shall give a patch today > > On Tue, Jan 27, 2009 at 11:58 PM, Mark Ferguson > wrote: >> Oh I see, thanks for the clarification. >> >> Unfortunately this brings me back to same problem I started with: implicit >> properties aren't available when managing indexes through the REST api. I >> know there is a patch in the works for this issue but I can't wait for it. >> Is there any way to share the solrconfig.xml file and create indexes >> dynamically? >> >> Mark >> >> >> On Mon, Jan 26, 2009 at 9:02 PM, Noble Paul നോബിള് नोब्ळ् < >> noble.p...@gmail.com> wrote: >> >>> The behavior is expected >>> properties set in solr.xml are not implicitly used anywhere. >>> you will have to use those variables explicitly in >>> solrconfig.xml/schema.xml >>> instead of hardcoding dataDir in solrconfig.xml you can use it as a >>> variable $$dataDir >>> >>> BTW there is an issue (https://issues.apache.org/jira/browse/SOLR-943) >>> which helps you specify the dataDir in solr.xml >>> >>> >>> On Tue, Jan 27, 2009 at 5:19 AM, Mark Ferguson >>> wrote: >>> > Hi, >>> > >>> > In my solr.xml file, I am trying to set the dataDir property the way it >>> is >>> > described in the CoreAdmin page on the wiki: >>> > >>> > >>> > >>> > >>> > >>> > However, the property is being completed ignored. It is using whatever I >>> > have set in the solrconfig.xml file (or ./data, the default value, if I >>> set >>> > nothing in that file). Any idea what I am doing wrong? I am trying this >>> > approach to avoid using ${solr.core.name} in the solrconfig.xml file, >>> since >>> > dynamic properties are broken for creating cores via the REST api. >>> > >>> > Mark >>> > >>> >>> >>> >>> -- >>> --Noble Paul >>> >> > > > > -- > --Noble Paul > -- --Noble Paul
Re: [dummy question] applying patch
since you are asking about 'batch file' , are you using windows? I recommend using TortoiseSVN to apply patch On Wed, Jan 28, 2009 at 10:05 AM, surfer10 wrote: > > i'm a little bit noob in java compiler so could you please tell me what tools > are used to apply patch SOLR-236 (Field groupping), does it need to be > applied on current solr-1.3 (and nightly builds of 1.4) or it already in > box? > > what batch file stands for solr compilation in its distributive? > -- > View this message in context: > http://www.nabble.com/-dummy-question--applying-patch-tp21699846p21699846.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- --Noble Paul
Store limited text
Hi All, Is it possible to store only limited text in the field, say, max 1 mb? The field maxfieldlength limits only the number of tokens to be indexed, but stores complete content. Thanks, Siddharth
Re: multilanguage prototype
Hi, This is the only info in the tomcat log at indexing Jan 27, 2009 3:46:15 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/lang_prototype path=/update params={} status=0 QTime=191 I dont see any ohter errors in the logs . when i use curl to update i get success message. and commit data in solr admin is showing positive ,where in the index file there are not indexes created. regards sujatha On 1/27/09, Erik Hatcher wrote: > > errors: 11 > > What were those? > > My hunch is your indexer had issues. What did Solr output into the console > or log during indexing? > >Erik > > On Jan 27, 2009, at 6:56 AM, revathy arun wrote: > > Hi Shalin, >> >> The admin page stats are as follows >> searcherName : searc...@1d4c3d5 main >> caching : true >> numDocs : 0 >> maxDoc : 0 >> >> *name: * /update *class: * >> org.apache.solr.handler.XmlUpdateRequestHandler >> *version: * $Revision: 690026 $ *description: * Add documents with XML * >> stats: *handlerStart : 1232692774389 >> requests : 22 >> errors : 11 >> timeouts : 0 >> totalTime : 1181 >> avgTimePerRequest : 53.68182 >> avgRequestsPerSecond : 6.0431463E-5 >> >> *stats: *commits : 9 >> autocommits : 0 >> optimizes : 2 >> docsPending : 0 >> adds : 0 >> deletesById : 0 >> deletesByQuery : 0 >> errors : 0 >> cumulative_adds : 0 >> cumulative_deletesById : 0 >> cumulative_deletesByQuery : 0 >> cumulative_errors : 0 >> >> in the solrconfg.xml i have commented this line >> >> >> >> >> so the index will be created in the default data folder under solr home, >> >> >> >> Thanks for ur time >> >> regards >> >> sujatha >> On 1/27/09, Shalin Shekhar Mangar wrote: >> >>> >>> Are you looking for it in the right place? It is very unlikely that a >>> commit >>> happens and index is not created. >>> >>> The index is usually created inside the data directory as configured in >>> your >>> solconfig.xml >>> >>> Can you search for *:* from the solr admin page and see if documents are >>> returned? >>> >>> On Tue, Jan 27, 2009 at 5:01 PM, revathy arun >>> wrote: >>> >>> this is the stats of my updatehandler but i still dont see any index created *stats: *commits : 7 autocommits : 0 optimizes : 2 docsPending : 0 adds : 0 deletesById : 0 deletesByQuery : 0 errors : 0 cumulative_adds : 0 cumulative_deletesById : 0 cumulative_deletesByQuery : 0 cumulative_errors : 0 regards On 1/27/09, revathy arun wrote: > > Hi > > I have committed.The admin page does not show any docs pending or > committed > or any errors. > > Regards > Sujatha > > > On 1/27/09, Shalin Shekhar Mangar wrote: > >> >> Did you commit after the updates? >> >> 2009/1/27 revathy arun >> >> Hi, >>> >>> I have downloade solr1.3.0 . >>> >>> I need to index chinese content ,for this i have defined a new field >>> >> in > the >> >>> schema >>> >>> as >>> >>> >>> >> positionIncrementGap="100"> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> I beleive solr1.3 already has the cjkanalyzer by default. >>> >>> my schema in the testing stage has only 2 fields >>> >>> >> >> required="true" >> >>> /> >>> >>> >> >> /> >>> >>> >>> >>> However when i index the chinese text into content , no index is >>> >> being >>> created.i dont see any errors in tomcat as well . >>> >>> this is only entry in tomcat on updating >>> >>> Jan 27, 2009 3:46:15 PM org.apache.solr.core.SolrCore execute >>> INFO: [] webapp=/lang_prototype path=/update params={} status=0 >>> >> QTime=191 >> >>> >>> I have attached the chinese text file for reference. >>> >>> >>> >>> Regards >>> >>> sujatha >>> >>> >>> >>> >> >> >> -- >> Regards, >> Shalin Shekhar Mangar. >> >> > > >>> >>> >>> -- >>> Regards, >>> Shalin Shekhar Mangar. >>> >>> >
[dummy question] applying patch
i'm a little bit noob in java compiler so could you please tell me what tools are used to apply patch SOLR-236 (Field groupping), does it need to be applied on current solr-1.3 (and nightly builds of 1.4) or it already in box? what batch file stands for solr compilation in its distributive? -- View this message in context: http://www.nabble.com/-dummy-question--applying-patch-tp21699846p21699846.html Sent from the Solr - User mailing list archive at Nabble.com.
question about dismax and parentheses
Hello, dear members. I'm a little bit confused about dismax syntax. as far as i know (and i might be wrong) it supports default query language such as +WORD -WORD What about parentheses ? my title of doc consist of WORD1 WORD2 WORD3. when i'm trying to search +WORD1 +(WORD2 WORD4) + WORD3 it does not match how can i query for that? also could you please tell me ho can i search such construction as a phrase? -- View this message in context: http://www.nabble.com/question-about-dismax-and-parentheses-tp21699822p21699822.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Connection mismanagement in Solrj?
if you are making requests in parallel , then it is likely that you see many connections open at a time. They will get cleaned up over time . But if you wish to clean them up explicitly use httpclient.getHttpConnectionManager()r#closeIdleConnections() On Tue, Jan 27, 2009 at 8:22 PM, Walter Underwood wrote: > Making requests in parallel, using the default connection manager, > which is multi-threaded, and we are reusing a single CommonsHttpSolrServer > for all requests. > > wunder > > On 1/26/09 10:59 PM, "Noble Paul നോബിള് नोब्ळ्" > wrote: > >> are you making requests in parallel ? >> which ConnectionManager are you using for HttpClient? >> >> On Tue, Jan 27, 2009 at 11:58 AM, Noble Paul നോബിള് नोब्ळ् >> wrote: >>> you can set any connection parameters for the HttpClient and pass on >>> the instance to CommonsHttpSolrServer and that will be used for making >>> requests >>> >>> make sure that you are not reusing instance of CommonsHttpSolrServer >>> >>> On Tue, Jan 27, 2009 at 10:59 AM, Walter Underwood >>> wrote: We just switched to Solrj from a home-grown client and we have a huge jump in the number of connections to the server, enough that our load balancer was rejecting connections in production tonight. Does that sound familiar? We're running 1.3. I set the timeouts and connection pools to the same values I'd used in my other code, also based on HTTPClient. We can roll back to my code temporarily, but we want some of the Solrj facet support for a new project. wunder >>> >>> >>> >>> -- >>> --Noble Paul >>> >> >> > > -- --Noble Paul
Re: Setting dataDir in multicore environment
I shall give a patch today On Tue, Jan 27, 2009 at 11:58 PM, Mark Ferguson wrote: > Oh I see, thanks for the clarification. > > Unfortunately this brings me back to same problem I started with: implicit > properties aren't available when managing indexes through the REST api. I > know there is a patch in the works for this issue but I can't wait for it. > Is there any way to share the solrconfig.xml file and create indexes > dynamically? > > Mark > > > On Mon, Jan 26, 2009 at 9:02 PM, Noble Paul നോബിള് नोब्ळ् < > noble.p...@gmail.com> wrote: > >> The behavior is expected >> properties set in solr.xml are not implicitly used anywhere. >> you will have to use those variables explicitly in >> solrconfig.xml/schema.xml >> instead of hardcoding dataDir in solrconfig.xml you can use it as a >> variable $$dataDir >> >> BTW there is an issue (https://issues.apache.org/jira/browse/SOLR-943) >> which helps you specify the dataDir in solr.xml >> >> >> On Tue, Jan 27, 2009 at 5:19 AM, Mark Ferguson >> wrote: >> > Hi, >> > >> > In my solr.xml file, I am trying to set the dataDir property the way it >> is >> > described in the CoreAdmin page on the wiki: >> > >> > >> > >> > >> > >> > However, the property is being completed ignored. It is using whatever I >> > have set in the solrconfig.xml file (or ./data, the default value, if I >> set >> > nothing in that file). Any idea what I am doing wrong? I am trying this >> > approach to avoid using ${solr.core.name} in the solrconfig.xml file, >> since >> > dynamic properties are broken for creating cores via the REST api. >> > >> > Mark >> > >> >> >> >> -- >> --Noble Paul >> > -- --Noble Paul
Re: Connection mismanagement in Solrj?
Could it be the framework you are using around it? I know some IOC containers will auto pool objects underneath as a service without you really knowing it is being done or has to be explicitly turned off. Just a thought. I use a single server for all requests behind a Hivemind setup ... umm not by choice :-\ - Jon On Tue, Jan 27, 2009 at 12:32 PM, Ryan McKinley wrote: > if you use this constructor: > > public CommonsHttpSolrServer(URL baseURL, HttpClient client) > > then solrj never touches the HttpClient configuration. > > I normally reuse a single CommonsHttpSolrServer as well. > > > > On Jan 27, 2009, at 9:52 AM, Walter Underwood wrote: > > Making requests in parallel, using the default connection manager, >> which is multi-threaded, and we are reusing a single CommonsHttpSolrServer >> for all requests. >> >> wunder >> >> On 1/26/09 10:59 PM, "Noble Paul നോബിള് नोब्ळ्" >> wrote: >> >> are you making requests in parallel ? >>> which ConnectionManager are you using for HttpClient? >>> >>> On Tue, Jan 27, 2009 at 11:58 AM, Noble Paul നോബിള് नोब्ळ् >>> wrote: >>> you can set any connection parameters for the HttpClient and pass on the instance to CommonsHttpSolrServer and that will be used for making requests make sure that you are not reusing instance of CommonsHttpSolrServer On Tue, Jan 27, 2009 at 10:59 AM, Walter Underwood wrote: > We just switched to Solrj from a home-grown client and we have a huge > jump in the number of connections to the server, enough that our > load balancer was rejecting connections in production tonight. > > Does that sound familiar? We're running 1.3. > > I set the timeouts and connection pools to the same values I'd > used in my other code, also based on HTTPClient. > > We can roll back to my code temporarily, but we want some of > the Solrj facet support for a new project. > > wunder > > > -- --Noble Paul >>> >>> >> >
Re: Highlighting does not work?
They are documented in http://wiki.apache.org/solr/ FieldOptionsByUseCase and in the FAQ , but I agree that it could be more readily accessible. -Mike On 27-Jan-09, at 5:26 AM, Jarek Zgoda wrote: Finally found that the fields have to have an analyzer to be highlighted. Neat. Can I ask somebody to document these all requirements? Wiadomość napisana w dniu 2009-01-27, o godz. 13:49, przez Jarek Zgoda: I turned these fields to indexed + stored but the results are exactly the same, no matter if I search in these fields or elsewhere. Wiadomość napisana w dniu 2009-01-27, o godz. 13:09, przez Jarek Zgoda: Solr 1.3 I'm trying to get highlighting working, with no luck so far. Query with params q=cyrus&fl=*,score&qt=standard&hl=true&hl.fl=title+description finds 182 documents in my index. All of the top 10 hits contain the word "cyrus", but the highlights list is empty. The fields "title" and "description" are stored but not indexed. If I specify "*" as hl.fl value I get the same results. Do I need to add some special configuration to enable highlighting feature? -- We read Knuth so you don't have to. - Tim Peters Jarek Zgoda, R&D, Redefine jarek.zg...@redefine.pl -- We read Knuth so you don't have to. - Tim Peters Jarek Zgoda, R&D, Redefine jarek.zg...@redefine.pl -- We read Knuth so you don't have to. - Tim Peters Jarek Zgoda, R&D, Redefine jarek.zg...@redefine.pl
Tools for Managing Synonyms, Elevate, etc.
I'm considering building some tools for our internal non-technical staff to write to synonyms.txt, elevate.xml, spellings.txt, and protwords.txt so software developers don't have to maintain them. Before my team starts building these tools, has anyone done this before? If so, are these tools available as open source? Thanks, Mark Cohen
Re: Optimizing & Improving results based on user feedback
I've been thinking about the same thing. We have a set of queries that defy straightforward linguistics and ranking, like figuring out how to match "charlie brown" to "It's the Great Pumpkin, Charlie Brown" in October and to "A Charlie Brown Christmas" in December. I don't have any solutions yet, but I recommend analyzing click logs and looking at queries where the most-clicked item is not #1. wunder On 1/27/09 1:06 PM, "Matthew Runo" wrote: > Hello folks! > > We've been thinking about ways to improve organic search results for a > while (really, who hasn't?) and I'd like to get some ideas on ways to > implement a feedback system that uses user behavior as input. > Basically, it'd work on the premise that what the user actually > clicked on is probably a really good match for their search, and > should be boosted up in the results for that search. > > For example, if I search for "rain boots", and really love the 10th > result down (and show it by clicking on it), then we'd like to capture > this and use the data to boost up that result //for that search//. > We've thought about using index time boosts for the documents, but > that'd boost it regardless of the search terms, which isn't what we > want. We've thought about using the Elevator handler, but we don't > really want to force a product to the top - we'd prefer it slowly > rises over time as more and more people click it from the same search > terms. Another way might be to stuff the keyword into the document, > the more times it's in the document the higher it'd score - but > there's gotta be a better way than that. > > Obviously this can't be done 100% in solr - but if anyone had some > clever ideas about how this might be possible it'd be interesting to > hear them. > > Thanks for your time! > > Matthew Runo > Software Engineer, Zappos.com > mr...@zappos.com - 702-943-7833
Re: Text classification with Solr
I guess I've been called to the chalkboard... I haven't looked specifically at putting the taxonomy in Lucene/Solr, but it is an interesting idea. In reading the paper you mentioned, there are some interesting ideas there and Solr could obviously just as easily be used as Lucene, I think. One of the things I am interested in is the marriage of Solr and Mahout (which has some Genetic Algorithms support) and other ML (Weka, etc.) tools. So, for instance in the paper, they have multiple indexes, one for negative and positive sets, well that could be done with Solr cores or just through intelligent filtering. Then, you could have Mahout work do it's training/clustering/whatever in the background as needed just by sending a ReqHandler commands and output it's model that can be shared on the "output" side so that you can nicely serve up your results as part of search results or even standalone, so either as a SearchComponent or from the ReqHandler. Of course, the tricky part is in the implementation and managing the memory, threading, etc. Things that can help with all this: LukeReqHandler, TermVectorComponent, TermsComponent, others As for Hannes question about "Why Solr" I think you can still get close to the metal w/ Solr just as Lucene, but now you have the built in framework that makes experimentation so much easier, IMO, plus you have all the features that Solr has to offer. For instance, a reasonable thing to do with the output from the classification is, of course, to facet on them. Neal, what did you have in mind for a JIRA issue? I'd love to see a patch. On Jan 26, 2009, at 12:29 PM, Neal Richter wrote: Hey all, I'm in the processing of implementing a system to do 'text classification' with Solr. The basic idea is to take an ontology/taxonomy like dmoz of {label: "X", tags: "a,b,c,d,e"}, index it and then classify documents into the taxonomy by pushing parsed document into the Solr search API. Why? Lucene/Solr's ability to do weighted term boosting at both search and index time has lots of obvious uses here. Has anyone worked on this or a similar project yet? I've seen some talk on the list about this area but it's pretty thin... December thread "Taxonomy Support on Solr". I'm assuming Grant Ingersoll is looking at similar things with his 'taming text' project. I store the 'documents' in another repository and they are far too dynamic (write intensive) for direct indexing in Solr... so the previously suggested procedure of 1) store document 2) execute more-like-this and 3) delete document would be too slow. If people are interested I could start a JIRA issue on this (I do not see anything there at the moment). Thanks - Neal Richter http://aicoder.blogspot.com -- Grant Ingersoll http://www.lucidimagination.com/ Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Optimizing & Improving results based on user feedback
Hello folks! We've been thinking about ways to improve organic search results for a while (really, who hasn't?) and I'd like to get some ideas on ways to implement a feedback system that uses user behavior as input. Basically, it'd work on the premise that what the user actually clicked on is probably a really good match for their search, and should be boosted up in the results for that search. For example, if I search for "rain boots", and really love the 10th result down (and show it by clicking on it), then we'd like to capture this and use the data to boost up that result //for that search//. We've thought about using index time boosts for the documents, but that'd boost it regardless of the search terms, which isn't what we want. We've thought about using the Elevator handler, but we don't really want to force a product to the top - we'd prefer it slowly rises over time as more and more people click it from the same search terms. Another way might be to stuff the keyword into the document, the more times it's in the document the higher it'd score - but there's gotta be a better way than that. Obviously this can't be done 100% in solr - but if anyone had some clever ideas about how this might be possible it'd be interesting to hear them. Thanks for your time! Matthew Runo Software Engineer, Zappos.com mr...@zappos.com - 702-943-7833
Re: Indexing documents in multiple languages
First, I'd search the mail archive for the topic of languages, it's been discussed often and there's a wealth of information that might be of benefit, far more information than I can remember. As to whether your approach will be "too big, too slow...", you really haven't given enough information to go on. Here are a few of the questions the answers to which would help: How many e-mails are you indexing? Are you indexing attachments? How many users to you expect to be using this system? What are your target response times? What is your design queries-per-second? How much dynamic is the index (that is, how many e-mails do you expect to add per day and what is the latency you can live with between the time the e-mail is indexed and when it's searchable)? If you're indexing 10,000 e-mails, it's one thing. If you're indexing 1,000,000,000 e-mails it's another. Best Erick On Tue, Jan 27, 2009 at 3:05 PM, Alejandro Valdez < alejandro.val...@gmail.com> wrote: > Hi, I plan to use solr to index a large number of documents extracted > from emails bodies, such documents could be in different languages, > and a single document could be in more than one language. In the same > way, the query string could be words in different languages. > > I read that a common approach to index multilingual documents is to > use some algorithm (n-gram) to determine the document language, then use a > stemmer and finally index the document in a different index for each > language. > > As the document language and the query string can't be detected in a > reliable way, I think that it make not sense to use a stemmer on them > because a stemmer is tied to a specific language. > > My plan is to index all the documents in the same index, without any > stemming process (the users will have to search for the exact words that > they are looking for). > > But I'm not sure if this approach will make the index too big, too > slow, or if there is a better way to index this kind of documents. > > Any suggestion will be very appreciated. >
Indexing documents in multiple languages
Hi, I plan to use solr to index a large number of documents extracted from emails bodies, such documents could be in different languages, and a single document could be in more than one language. In the same way, the query string could be words in different languages. I read that a common approach to index multilingual documents is to use some algorithm (n-gram) to determine the document language, then use a stemmer and finally index the document in a different index for each language. As the document language and the query string can't be detected in a reliable way, I think that it make not sense to use a stemmer on them because a stemmer is tied to a specific language. My plan is to index all the documents in the same index, without any stemming process (the users will have to search for the exact words that they are looking for). But I'm not sure if this approach will make the index too big, too slow, or if there is a better way to index this kind of documents. Any suggestion will be very appreciated.
index size tripled during optimization
Hi, Starting about one week ago, our index size gets tripled during optimization. The current index statistics are: numDocs : 192702132 size: 76G And we do optimization for every 6M docs update. Since we keep getting new data, the index size increases every day. Before, the index size was only doubled during optimization. Why the index size gets tripled instead of doubled during optimization? Is there anything we can do to keep the index only doubled during optimization? Thanks. Qingdi -- View this message in context: http://www.nabble.com/index-size-tripled-during-optimization-tp21691596p21691596.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Text classification with Solr
27 jan 2009 kl. 17.23 skrev Neal Richter: Is it really neccessary to use Solr for it? Things going much faster with Lucene low-level api and much faster if you're loading the classification corpus into the RAM. Good points. At the moment I'd rather have a daemon with a service API.. as well as the filtering/tokenization capabilities Solr has built in. Probably will attempt to get the corpus' index in memory via large memory allocation. If it doesn't scale then I'll either go to Lucene api or implement a custom inverted index via memcached. Other note /at the moment/ is that it's not going to be a deeply hierarchical taxonomy, much less a full indexing of an RDF/OWL schema.. there are some gotchas for that. If your corpus is small enought you may want to take a look at lucene/ contrib/instantiated. It was made just for these sort of things. karl
Re: Setting dataDir in multicore environment
Oh I see, thanks for the clarification. Unfortunately this brings me back to same problem I started with: implicit properties aren't available when managing indexes through the REST api. I know there is a patch in the works for this issue but I can't wait for it. Is there any way to share the solrconfig.xml file and create indexes dynamically? Mark On Mon, Jan 26, 2009 at 9:02 PM, Noble Paul നോബിള് नोब्ळ् < noble.p...@gmail.com> wrote: > The behavior is expected > properties set in solr.xml are not implicitly used anywhere. > you will have to use those variables explicitly in > solrconfig.xml/schema.xml > instead of hardcoding dataDir in solrconfig.xml you can use it as a > variable $$dataDir > > BTW there is an issue (https://issues.apache.org/jira/browse/SOLR-943) > which helps you specify the dataDir in solr.xml > > > On Tue, Jan 27, 2009 at 5:19 AM, Mark Ferguson > wrote: > > Hi, > > > > In my solr.xml file, I am trying to set the dataDir property the way it > is > > described in the CoreAdmin page on the wiki: > > > > > > > > > > > > However, the property is being completed ignored. It is using whatever I > > have set in the solrconfig.xml file (or ./data, the default value, if I > set > > nothing in that file). Any idea what I am doing wrong? I am trying this > > approach to avoid using ${solr.core.name} in the solrconfig.xml file, > since > > dynamic properties are broken for creating cores via the REST api. > > > > Mark > > > > > > -- > --Noble Paul >
multiple indexes
Hi, I would like to know how it can be implemented. Index1 has fields id,1,2,3 and index2 has fields id,5,6,7. The ID in both indexes are unique id. Can I use "a kind of " distributed search and/or multicore to search, sort, and facet through 2 indexes (index1 and index2)? Thanks, Jae joo
Re: Connection mismanagement in Solrj?
if you use this constructor: public CommonsHttpSolrServer(URL baseURL, HttpClient client) then solrj never touches the HttpClient configuration. I normally reuse a single CommonsHttpSolrServer as well. On Jan 27, 2009, at 9:52 AM, Walter Underwood wrote: Making requests in parallel, using the default connection manager, which is multi-threaded, and we are reusing a single CommonsHttpSolrServer for all requests. wunder On 1/26/09 10:59 PM, "Noble Paul നോബിള് नो ब्ळ्" wrote: are you making requests in parallel ? which ConnectionManager are you using for HttpClient? On Tue, Jan 27, 2009 at 11:58 AM, Noble Paul നോബിള് नोब्ळ् wrote: you can set any connection parameters for the HttpClient and pass on the instance to CommonsHttpSolrServer and that will be used for making requests make sure that you are not reusing instance of CommonsHttpSolrServer On Tue, Jan 27, 2009 at 10:59 AM, Walter Underwood wrote: We just switched to Solrj from a home-grown client and we have a huge jump in the number of connections to the server, enough that our load balancer was rejecting connections in production tonight. Does that sound familiar? We're running 1.3. I set the timeouts and connection pools to the same values I'd used in my other code, also based on HTTPClient. We can roll back to my code temporarily, but we want some of the Solrj facet support for a new project. wunder -- --Noble Paul
Re: Connection mismanagement in Solrj?
That's interesting SolrJ doesn't touch HTTPClient params if one is provided in the constructor. I guess I'd try to sniff the headers first and see if any difference sticks out between the clients. I normally just use netcat and pretend to be the solr server. -Yonik On Tue, Jan 27, 2009 at 12:29 AM, Walter Underwood wrote: > We just switched to Solrj from a home-grown client and we have a huge > jump in the number of connections to the server, enough that our > load balancer was rejecting connections in production tonight. > > Does that sound familiar? We're running 1.3. > > I set the timeouts and connection pools to the same values I'd > used in my other code, also based on HTTPClient. > > We can roll back to my code temporarily, but we want some of > the Solrj facet support for a new project. > > wunder > >
query with stemming, prefix and fuzzy?
Hello, I am trying to get Solr to properly work. I have set up a Solr test server (using jetty as mentioned in the tutorial). Also I had to modify the schema.xml so that I have different fields for different languages (with their own stemmers) that occur in the content management system that I am indexing. So far everything does work fine including snippet highlighting. But now I am having some problems with two things: A) fuzzy search When trying to do a fuzzy search the analyzers seem to break up a search string like "house~0.6" into "house", "0" and "6" so that e.g. a single "6" is highlighted, too. So I tried to use an additional raw-field without any stemming and just a lower case and white space analyzer. This seems to work fine. But fuzzy query is very slow and takes 100% CPU for several seconds with only one query at a time. What can I do to speed up the fuzzy query? I e.g. have found a Lucene parameter prefixLength but no according Solr option. Does this exist? Are there some other options to pay attention to? B) combine stemming, prefix and fuzzy search Is there a way to combine all this three query types in one query? Especially stemming and prefixing? I think it would be problematic as a "house*" would be analyzed to "house" with the usual analyzers that are required for stemming? Do I need different query type fields and combine them with an boolean OR in the query? Something like data:house OR data_fuzzy:house~0.6 OR data_prefix:house* This feels to be a little bit circuitous. Is there a way to use "house*~.6" including correct stemming? Thank you, Gert
Re: Text classification with Solr
On Tue, Jan 27, 2009 at 1:36 AM, Hannes Carl Meyer wrote: > Yeah, know it, the challenge on this method is the calculation of the score > and parametrization of thresholds. Not as worried about score itself as the score thresholds for prediction in/out. > Is it really neccessary to use Solr for it? Things going much faster with > Lucene low-level api and much faster if you're loading the classification > corpus into the RAM. Good points. At the moment I'd rather have a daemon with a service API.. as well as the filtering/tokenization capabilities Solr has built in. Probably will attempt to get the corpus' index in memory via large memory allocation. If it doesn't scale then I'll either go to Lucene api or implement a custom inverted index via memcached. Other note /at the moment/ is that it's not going to be a deeply hierarchical taxonomy, much less a full indexing of an RDF/OWL schema.. there are some gotchas for that. Thanks - Neal
Re: QParserPlugin
So it was me defining it in schema.xml rather than solrconfig.xml. 17:17 < erikhatcher> where are you defining the qparser plugin? 17:18 < erikhatcher> it's very odd... if it isn't picking them up but you reference them, it would certainly give an error 17:18 < karlwettin> as a first level child to schema element in schema.xml 17:19 < erikhatcher> qparser plugins go in solrconfig, not schema 17:19 < karlwettin> aha 17:19 < karlwettin> :) 17:19 < erikhatcher> :) 27 jan 2009 kl. 08.25 skrev Erik Hatcher: Karl - where did you put your a.b.QParserPlugin? You should put it in /lib within a JAR file. I'm surprised you aren't seeing an error though. Erik On Jan 27, 2009, at 1:07 AM, Karl Wettin wrote: Hi forum, I'm trying to get QParserPlugin to work, I've got but still get Unknown query type 'myqueryparser' when I /solr/select/?defType=myqueryparser&q=foo There is no warning about myqueryparser from Solr at startup. I do however manage to get this working: So it shouldn't be my Solr environment or a classpath problem? That's the level of me setting up Solr, I'm left with no clues to why it doesn't register. gratefully, karl
Re: solrj delete by Id problem
On Tue, Jan 27, 2009 at 8:51 PM, Parisa wrote: > > I found how the issue is created .when solr warm up the new searcher with > cacheLists , if the queryResultCache is enable the issue is created. > > notice:as I mentioned before I commit with waitflush=false and > waitsearcher=false > > so it has problem in case the queryResultCache is on, > Ah so that is the issue. The problem is that when you call commit with waitSearcher=false and waitFlush=false, the call immediately returns without waiting for the commit to complete and the new searcher to be registered. Therefore any queries you make until autowarming completes does not give you the results from the new index. You should call commit with both waitSearcher and waitFlush as true. That should solve the problem. -- Regards, Shalin Shekhar Mangar.
Re: solrj delete by Id problem
I found how the issue is created .when solr warm up the new searcher with cacheLists , if the queryResultCache is enable the issue is created. notice:as I mentioned before I commit with waitflush=false and waitsearcher=false so it has problem in case the queryResultCache is on, but I don't know why the issue is created only in deleteById mode and we don't have problem when we add a doc and commit with waitflush=false and waitsearcher=false I think they both use the same method for warmup the new searcher !!! there is also a comment on solrCore class that I am concern about it: solrCore.java public RefCounted getSearcher(boolean forceNew, boolean returnSearcher, final Future[] waitSearcher) throws IOException { -- line 1132 (nightly version) // warm the new searcher based on the current searcher. // should this go before the other event handlers or after? if (currSearcher != null) { future = searcherExecutor.submit( new Callable() { public Object call() throws Exception { try { newSearcher.warm(currSearcher); } catch (Throwable e) { SolrException.logOnce(log,null,e); } return null; } } ); } - -- } -- View this message in context: http://www.nabble.com/solrj-delete-by-Id-problem-tp21433056p21687431.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: fastest way to index/reindex
*:* will default to sorting by document insertion order (Lucene's document id, _not_ your Solr uniqueKey). And no, you won't miss any by paging - order will be maintained. Erik On Jan 27, 2009, at 9:52 AM, Ian Connor wrote: When you query by *:*, what order does it use. Is there a chance they will come in a different order as you page through the results (and miss/ dupicate some). Is it best to put the order explicitly by 'id' or is that implied already? On Mon, Jan 26, 2009 at 12:00 PM, Ian Connor wrote: *:* took it up to 45/sec from 28/sec so a nice 60% bump in performance - thanks! On Sun, Jan 25, 2009 at 5:46 PM, Ryan McKinley wrote: I don't know of any standard export/import tool -- i think luke has something, but it will be faster if you write your own. Rather then id:[* TO *], just try *:* -- this should match all documents without using a range query. On Jan 25, 2009, at 3:16 PM, Ian Connor wrote: Hi, Given the only real way to reindex is to save the document again, what is the fastest way to extract all the documents from a solr index to resave them. I have tried the id:[* TO *] trick however, it takes a while once you get a few thousand into the index. Are there any tools that will quickly export the index to a text file or making queries 1000 at a time is the best option and dealing with the time it takes to query once you are deep into the index? -- Regards, Ian Connor -- Regards, Ian Connor -- Regards, Ian Connor 1 Leighton St #723 Cambridge, MA 02141 Call Center Phone: +1 (714) 239 3875 (24 hrs) Fax: +1(770) 818 5697 Skype: ian.connor
Re: fastest way to index/reindex
When you query by *:*, what order does it use. Is there a chance they will come in a different order as you page through the results (and miss/dupicate some). Is it best to put the order explicitly by 'id' or is that implied already? On Mon, Jan 26, 2009 at 12:00 PM, Ian Connor wrote: > *:* took it up to 45/sec from 28/sec so a nice 60% bump in performance - > thanks! > > > On Sun, Jan 25, 2009 at 5:46 PM, Ryan McKinley wrote: > >> I don't know of any standard export/import tool -- i think luke has >> something, but it will be faster if you write your own. >> >> Rather then id:[* TO *], just try *:* -- this should match all documents >> without using a range query. >> >> >> >> On Jan 25, 2009, at 3:16 PM, Ian Connor wrote: >> >> Hi, >>> >>> Given the only real way to reindex is to save the document again, what is >>> the fastest way to extract all the documents from a solr index to resave >>> them. >>> >>> I have tried the id:[* TO *] trick however, it takes a while once you get >>> a >>> few thousand into the index. Are there any tools that will quickly export >>> the index to a text file or making queries 1000 at a time is the best >>> option >>> and dealing with the time it takes to query once you are deep into the >>> index? >>> >>> -- >>> Regards, >>> >>> Ian Connor >>> >> >> > > > -- > Regards, > > Ian Connor > -- Regards, Ian Connor 1 Leighton St #723 Cambridge, MA 02141 Call Center Phone: +1 (714) 239 3875 (24 hrs) Fax: +1(770) 818 5697 Skype: ian.connor
Re: Connection mismanagement in Solrj?
Making requests in parallel, using the default connection manager, which is multi-threaded, and we are reusing a single CommonsHttpSolrServer for all requests. wunder On 1/26/09 10:59 PM, "Noble Paul നോബിള് नोब्ळ्" wrote: > are you making requests in parallel ? > which ConnectionManager are you using for HttpClient? > > On Tue, Jan 27, 2009 at 11:58 AM, Noble Paul നോബിള് नोब्ळ् > wrote: >> you can set any connection parameters for the HttpClient and pass on >> the instance to CommonsHttpSolrServer and that will be used for making >> requests >> >> make sure that you are not reusing instance of CommonsHttpSolrServer >> >> On Tue, Jan 27, 2009 at 10:59 AM, Walter Underwood >> wrote: >>> We just switched to Solrj from a home-grown client and we have a huge >>> jump in the number of connections to the server, enough that our >>> load balancer was rejecting connections in production tonight. >>> >>> Does that sound familiar? We're running 1.3. >>> >>> I set the timeouts and connection pools to the same values I'd >>> used in my other code, also based on HTTPClient. >>> >>> We can roll back to my code temporarily, but we want some of >>> the Solrj facet support for a new project. >>> >>> wunder >>> >>> >> >> >> >> -- >> --Noble Paul >> > >
Re: Error in Integrating JBoss 4.2 and Solr-1.3.0:
I am also getting the same issue. Did any one found the solution for this... Please respond sbutalia wrote: > > I'm having the same issue.. have you had any progress with this? > -- View this message in context: http://www.nabble.com/Error-in-Integrating-JBoss-4.2-and-Solr-1.3.0%3A-tp20202032p21686321.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Highlighting does not work?
Finally found that the fields have to have an analyzer to be highlighted. Neat. Can I ask somebody to document these all requirements? Wiadomość napisana w dniu 2009-01-27, o godz. 13:49, przez Jarek Zgoda: I turned these fields to indexed + stored but the results are exactly the same, no matter if I search in these fields or elsewhere. Wiadomość napisana w dniu 2009-01-27, o godz. 13:09, przez Jarek Zgoda: Solr 1.3 I'm trying to get highlighting working, with no luck so far. Query with params q=cyrus&fl=*,score&qt=standard&hl=true&hl.fl=title +description finds 182 documents in my index. All of the top 10 hits contain the word "cyrus", but the highlights list is empty. The fields "title" and "description" are stored but not indexed. If I specify "*" as hl.fl value I get the same results. Do I need to add some special configuration to enable highlighting feature? -- We read Knuth so you don't have to. - Tim Peters Jarek Zgoda, R&D, Redefine jarek.zg...@redefine.pl -- We read Knuth so you don't have to. - Tim Peters Jarek Zgoda, R&D, Redefine jarek.zg...@redefine.pl -- We read Knuth so you don't have to. - Tim Peters Jarek Zgoda, R&D, Redefine jarek.zg...@redefine.pl
Re: multilanguage prototype
errors: 11 What were those? My hunch is your indexer had issues. What did Solr output into the console or log during indexing? Erik On Jan 27, 2009, at 6:56 AM, revathy arun wrote: Hi Shalin, The admin page stats are as follows searcherName : searc...@1d4c3d5 main caching : true numDocs : 0 maxDoc : 0 *name: * /update *class: * org.apache.solr.handler.XmlUpdateRequestHandler *version: * $Revision: 690026 $ *description: * Add documents with XML * stats: *handlerStart : 1232692774389 requests : 22 errors : 11 timeouts : 0 totalTime : 1181 avgTimePerRequest : 53.68182 avgRequestsPerSecond : 6.0431463E-5 *stats: *commits : 9 autocommits : 0 optimizes : 2 docsPending : 0 adds : 0 deletesById : 0 deletesByQuery : 0 errors : 0 cumulative_adds : 0 cumulative_deletesById : 0 cumulative_deletesByQuery : 0 cumulative_errors : 0 in the solrconfg.xml i have commented this line so the index will be created in the default data folder under solr home, Thanks for ur time regards sujatha On 1/27/09, Shalin Shekhar Mangar wrote: Are you looking for it in the right place? It is very unlikely that a commit happens and index is not created. The index is usually created inside the data directory as configured in your solconfig.xml Can you search for *:* from the solr admin page and see if documents are returned? On Tue, Jan 27, 2009 at 5:01 PM, revathy arun wrote: this is the stats of my updatehandler but i still dont see any index created *stats: *commits : 7 autocommits : 0 optimizes : 2 docsPending : 0 adds : 0 deletesById : 0 deletesByQuery : 0 errors : 0 cumulative_adds : 0 cumulative_deletesById : 0 cumulative_deletesByQuery : 0 cumulative_errors : 0 regards On 1/27/09, revathy arun wrote: Hi I have committed.The admin page does not show any docs pending or committed or any errors. Regards Sujatha On 1/27/09, Shalin Shekhar Mangar wrote: Did you commit after the updates? 2009/1/27 revathy arun Hi, I have downloade solr1.3.0 . I need to index chinese content ,for this i have defined a new field in the schema as I beleive solr1.3 already has the cjkanalyzer by default. my schema in the testing stage has only 2 fields required="true" /> stored="false" /> However when i index the chinese text into content , no index is being created.i dont see any errors in tomcat as well . this is only entry in tomcat on updating Jan 27, 2009 3:46:15 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/lang_prototype path=/update params={} status=0 QTime=191 I have attached the chinese text file for reference. Regards sujatha -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar.
Re: Highlighting does not work?
I turned these fields to indexed + stored but the results are exactly the same, no matter if I search in these fields or elsewhere. Wiadomość napisana w dniu 2009-01-27, o godz. 13:09, przez Jarek Zgoda: Solr 1.3 I'm trying to get highlighting working, with no luck so far. Query with params q=cyrus&fl=*,score&qt=standard&hl=true&hl.fl=title +description finds 182 documents in my index. All of the top 10 hits contain the word "cyrus", but the highlights list is empty. The fields "title" and "description" are stored but not indexed. If I specify "*" as hl.fl value I get the same results. Do I need to add some special configuration to enable highlighting feature? -- We read Knuth so you don't have to. - Tim Peters Jarek Zgoda, R&D, Redefine jarek.zg...@redefine.pl -- We read Knuth so you don't have to. - Tim Peters Jarek Zgoda, R&D, Redefine jarek.zg...@redefine.pl
Highlighting does not work?
Solr 1.3 I'm trying to get highlighting working, with no luck so far. Query with params q=cyrus&fl=*,score&qt=standard&hl=true&hl.fl=title +description finds 182 documents in my index. All of the top 10 hits contain the word "cyrus", but the highlights list is empty. The fields "title" and "description" are stored but not indexed. If I specify "*" as hl.fl value I get the same results. Do I need to add some special configuration to enable highlighting feature? -- We read Knuth so you don't have to. - Tim Peters Jarek Zgoda, R&D, Redefine jarek.zg...@redefine.pl
Re: multilanguage prototype
Hi Shalin, The admin page stats are as follows searcherName : searc...@1d4c3d5 main caching : true numDocs : 0 maxDoc : 0 *name: * /update *class: * org.apache.solr.handler.XmlUpdateRequestHandler *version: * $Revision: 690026 $ *description: * Add documents with XML * stats: *handlerStart : 1232692774389 requests : 22 errors : 11 timeouts : 0 totalTime : 1181 avgTimePerRequest : 53.68182 avgRequestsPerSecond : 6.0431463E-5 *stats: *commits : 9 autocommits : 0 optimizes : 2 docsPending : 0 adds : 0 deletesById : 0 deletesByQuery : 0 errors : 0 cumulative_adds : 0 cumulative_deletesById : 0 cumulative_deletesByQuery : 0 cumulative_errors : 0 in the solrconfg.xml i have commented this line so the index will be created in the default data folder under solr home, Thanks for ur time regards sujatha On 1/27/09, Shalin Shekhar Mangar wrote: > > Are you looking for it in the right place? It is very unlikely that a > commit > happens and index is not created. > > The index is usually created inside the data directory as configured in > your > solconfig.xml > > Can you search for *:* from the solr admin page and see if documents are > returned? > > On Tue, Jan 27, 2009 at 5:01 PM, revathy arun wrote: > > > this is the stats of my updatehandler > > but i still dont see any index created > > *stats: *commits : 7 > > autocommits : 0 > > optimizes : 2 > > docsPending : 0 > > adds : 0 > > deletesById : 0 > > deletesByQuery : 0 > > errors : 0 > > cumulative_adds : 0 > > cumulative_deletesById : 0 > > cumulative_deletesByQuery : 0 > > cumulative_errors : 0 > > > > regards > > > > On 1/27/09, revathy arun wrote: > > > > > > Hi > > > > > > I have committed.The admin page does not show any docs pending or > > committed > > > or any errors. > > > > > > Regards > > > Sujatha > > > > > > > > > On 1/27/09, Shalin Shekhar Mangar wrote: > > >> > > >> Did you commit after the updates? > > >> > > >> 2009/1/27 revathy arun > > >> > > >> > Hi, > > >> > > > >> > I have downloade solr1.3.0 . > > >> > > > >> > I need to index chinese content ,for this i have defined a new field > > in > > >> the > > >> > schema > > >> > > > >> > as > > >> > > > >> > > > >> > > >> > positionIncrementGap="100"> > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > I beleive solr1.3 already has the cjkanalyzer by default. > > >> > > > >> > my schema in the testing stage has only 2 fields > > >> > > > >> > > >> required="true" > > >> > /> > > >> > > > >> > /> > > >> > > > >> > > > >> > > > >> > However when i index the chinese text into content , no index is > being > > >> > created.i dont see any errors in tomcat as well . > > >> > > > >> > this is only entry in tomcat on updating > > >> > > > >> > Jan 27, 2009 3:46:15 PM org.apache.solr.core.SolrCore execute > > >> > INFO: [] webapp=/lang_prototype path=/update params={} status=0 > > >> QTime=191 > > >> > > > >> > I have attached the chinese text file for reference. > > >> > > > >> > > > >> > > > >> > Regards > > >> > > > >> > sujatha > > >> > > > >> > > > >> > > > >> > > >> > > >> > > >> -- > > >> Regards, > > >> Shalin Shekhar Mangar. > > >> > > > > > > > > > > > > -- > Regards, > Shalin Shekhar Mangar. >
Re: multilanguage prototype
Are you looking for it in the right place? It is very unlikely that a commit happens and index is not created. The index is usually created inside the data directory as configured in your solconfig.xml Can you search for *:* from the solr admin page and see if documents are returned? On Tue, Jan 27, 2009 at 5:01 PM, revathy arun wrote: > this is the stats of my updatehandler > but i still dont see any index created > *stats: *commits : 7 > autocommits : 0 > optimizes : 2 > docsPending : 0 > adds : 0 > deletesById : 0 > deletesByQuery : 0 > errors : 0 > cumulative_adds : 0 > cumulative_deletesById : 0 > cumulative_deletesByQuery : 0 > cumulative_errors : 0 > > regards > > On 1/27/09, revathy arun wrote: > > > > Hi > > > > I have committed.The admin page does not show any docs pending or > committed > > or any errors. > > > > Regards > > Sujatha > > > > > > On 1/27/09, Shalin Shekhar Mangar wrote: > >> > >> Did you commit after the updates? > >> > >> 2009/1/27 revathy arun > >> > >> > Hi, > >> > > >> > I have downloade solr1.3.0 . > >> > > >> > I need to index chinese content ,for this i have defined a new field > in > >> the > >> > schema > >> > > >> > as > >> > > >> > > >> > >> > positionIncrementGap="100"> > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > I beleive solr1.3 already has the cjkanalyzer by default. > >> > > >> > my schema in the testing stage has only 2 fields > >> > > >> > >> required="true" > >> > /> > >> > > >> > > >> > > >> > > >> > > >> > However when i index the chinese text into content , no index is being > >> > created.i dont see any errors in tomcat as well . > >> > > >> > this is only entry in tomcat on updating > >> > > >> > Jan 27, 2009 3:46:15 PM org.apache.solr.core.SolrCore execute > >> > INFO: [] webapp=/lang_prototype path=/update params={} status=0 > >> QTime=191 > >> > > >> > I have attached the chinese text file for reference. > >> > > >> > > >> > > >> > Regards > >> > > >> > sujatha > >> > > >> > > >> > > >> > >> > >> > >> -- > >> Regards, > >> Shalin Shekhar Mangar. > >> > > > > > -- Regards, Shalin Shekhar Mangar.
Re: multilanguage prototype
this is the stats of my updatehandler but i still dont see any index created *stats: *commits : 7 autocommits : 0 optimizes : 2 docsPending : 0 adds : 0 deletesById : 0 deletesByQuery : 0 errors : 0 cumulative_adds : 0 cumulative_deletesById : 0 cumulative_deletesByQuery : 0 cumulative_errors : 0 regards On 1/27/09, revathy arun wrote: > > Hi > > I have committed.The admin page does not show any docs pending or committed > or any errors. > > Regards > Sujatha > > > On 1/27/09, Shalin Shekhar Mangar wrote: >> >> Did you commit after the updates? >> >> 2009/1/27 revathy arun >> >> > Hi, >> > >> > I have downloade solr1.3.0 . >> > >> > I need to index chinese content ,for this i have defined a new field in >> the >> > schema >> > >> > as >> > >> > >> > > > positionIncrementGap="100"> >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > I beleive solr1.3 already has the cjkanalyzer by default. >> > >> > my schema in the testing stage has only 2 fields >> > >> > > required="true" >> > /> >> > >> > >> > >> > >> > >> > However when i index the chinese text into content , no index is being >> > created.i dont see any errors in tomcat as well . >> > >> > this is only entry in tomcat on updating >> > >> > Jan 27, 2009 3:46:15 PM org.apache.solr.core.SolrCore execute >> > INFO: [] webapp=/lang_prototype path=/update params={} status=0 >> QTime=191 >> > >> > I have attached the chinese text file for reference. >> > >> > >> > >> > Regards >> > >> > sujatha >> > >> > >> > >> >> >> >> -- >> Regards, >> Shalin Shekhar Mangar. >> > >
Re: multilanguage prototype
Hi I have committed.The admin page does not show any docs pending or committed or any errors. Regards Sujatha On 1/27/09, Shalin Shekhar Mangar wrote: > > Did you commit after the updates? > > 2009/1/27 revathy arun > > > Hi, > > > > I have downloade solr1.3.0 . > > > > I need to index chinese content ,for this i have defined a new field in > the > > schema > > > > as > > > > > > > positionIncrementGap="100"> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I beleive solr1.3 already has the cjkanalyzer by default. > > > > my schema in the testing stage has only 2 fields > > > > required="true" > > /> > > > > > > > > > > > > However when i index the chinese text into content , no index is being > > created.i dont see any errors in tomcat as well . > > > > this is only entry in tomcat on updating > > > > Jan 27, 2009 3:46:15 PM org.apache.solr.core.SolrCore execute > > INFO: [] webapp=/lang_prototype path=/update params={} status=0 QTime=191 > > > > I have attached the chinese text file for reference. > > > > > > > > Regards > > > > sujatha > > > > > > > > > > -- > Regards, > Shalin Shekhar Mangar. >
Re: multilanguage prototype
Did you commit after the updates? 2009/1/27 revathy arun > Hi, > > I have downloade solr1.3.0 . > > I need to index chinese content ,for this i have defined a new field in the > schema > > as > > > positionIncrementGap="100"> > > > > > > > > > > > > > > > > > > I beleive solr1.3 already has the cjkanalyzer by default. > > my schema in the testing stage has only 2 fields > > /> > > > > > > However when i index the chinese text into content , no index is being > created.i dont see any errors in tomcat as well . > > this is only entry in tomcat on updating > > Jan 27, 2009 3:46:15 PM org.apache.solr.core.SolrCore execute > INFO: [] webapp=/lang_prototype path=/update params={} status=0 QTime=191 > > I have attached the chinese text file for reference. > > > > Regards > > sujatha > > > -- Regards, Shalin Shekhar Mangar.
multilanguage prototype
Hi, I have downloade solr1.3.0 . I need to index chinese content ,for this i have defined a new field in the schema as I beleive solr1.3 already has the cjkanalyzer by default. my schema in the testing stage has only 2 fields However when i index the chinese text into content , no index is being created.i dont see any errors in tomcat as well . this is only entry in tomcat on updating Jan 27, 2009 3:46:15 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/lang_prototype path=/update params={} status=0 QTime=191 I have attached the chinese text file for reference. Regards sujatha 西尼羅河病毒) 西尼羅河病毒症﹕重要資訊 什麼是西尼羅河病毒症? 西尼羅河病毒症(WNV)是一種具有潛在嚴重後果的疾病。據專家認為,西尼羅河病毒症已在北美形成季節性流行病,夏 季開始發病並延續到秋季。本資訊單含有重要資訊,可幫助您認識及預防西尼羅河病毒症。 採取何種措施可以防止西尼羅河病毒症(WNV)? 避免西尼羅河病毒症的最簡單及最有效方法是避免受蚊子叮咬。 • • • • 在戶外時, 應使用含有的驅蟲劑。 請遵循包裝物上的使用說明。 許多蚊子最喜歡出來活動的時間是傍晚和凌晨。您可考慮此時在室內活動, 或使用驅蟲劑並穿長衣長褲。 穿淺色衣服有助於察覺落在身上的蚊子。 應確保家中的門窗有完好的紗門紗窗,以便將蚊子擋在室外。 消除蚊子孳生地﹕倒乾花盆、桶、罐中的積水,經常給寵物食碟換水,每周給鳥類沐浴池換水,給輪胎秋千鑽排水孔 ,在不使用兒童涉水池時將水倒乾並翻轉放置。 西尼羅河病毒症有何症狀? 西尼羅河病毒症影響中樞神經系統。症狀因人而異。 • 少數人會出現嚴重症狀。 在感染西尼羅河病毒者中,每150人大約有一人會發生嚴重的病情。嚴重症狀可能包括﹕ 高熱、頭痛、脖子僵硬、感覺遲鈍、神志迷惑、昏迷、顫抖、抽搐、肌肉無力、失明、麻木、癱瘓。這些症狀可能 持續幾周,神經性影響可能永久存在。 • 有些人會出現輕微症狀。最多達20%的受感染者會顯示輕微症狀,其中包括發熱、頭痛、身體疼痛、惡心、嘔吐, 有時淋巴節會腫大,或在胸部、腹部、背部出現皮疹。症狀通常持續幾天,即便健康人也會病几周。 • 大多數人沒有症狀。 大約80%(5人中的4人)感染西尼羅河病毒的人均不會出現任何症狀。 此症如何傳染? • 受感染的蚊子。西尼羅河病毒症通常透過受感染的蚊子叮咬而感染。蚊子在叮咬受感染的鳥類之後即會成為西尼羅 河病毒的攜帶者。 然後,受感染的蚊子可能透過叮咬而將西尼羅河病毒症傳染給人類和其它動物。 2004年8月 第1頁,共2頁 西尼羅河病毒症﹕重要資訊 (接上頁) • • 輸血、器官移植、母親對孩子傳染。在極少數病例中,西尼羅河病毒症還會透過輸血、器官移植、哺乳傳染、甚至 由母親在懷孕期間傳染給孩子。 不會透過接觸傳染。西尼羅河病毒症不會透過偶然接觸(例如接觸或親吻)感染病毒這而傳染。 受感染後經過多長時間才會發病? 人們被受感染的蚊子叮咬後,通常經過3至14天才會出現症狀。 對感染西尼羅河病毒症者如何治療? 對感染西尼羅河病毒症者沒有特別的治療方法。症狀輕微者會出現發熱和疼痛現象,不久即會自動消失。症狀較嚴重者通常 需要前往醫院,接受支援性治療,其中包括輸液,呼吸輔助,護理服務。 如果我感到自己患有西尼羅河病毒症,應該怎麼辦? 西尼羅河病毒症的輕微患者會自動康復,因此感染此病毒者不一定需要就醫。如果您出現嚴重的西尼羅河病毒症狀, 例如不尋常的頭痛或神志迷惑,就應當立即就醫。西尼羅河病毒症的嚴重患者通常需要住院治療。如果孕婦或正在哺乳的母 親出現可能是西尼羅河病毒症的症狀,我們鼓勵您告訴醫生。 感染西尼羅河病毒症的風險有多大? 年齡在50歲以上者可能病情較嚴重。年齡在50歲以上者如果出現西尼羅河病毒症狀,則可能症狀較為嚴重,因此應特別注 意避免受蚊子叮咬。 經常在戶外活動者風險較大。長時間在戶外活動者較有可能被受感染的蚊子叮咬。如果長時間在戶外工作或玩耍,應特別 注意避免受蚊子叮咬。 在醫療過程中患病的風險很小。手術用血在使用前要經過西尼羅河病毒的檢測。透過輸血及器官移植而感染西尼羅河病毒 症的風險很小,因此需要動手術者不應因此病毒而不動手術。 如果您對此有疑慮, 應在手術前與醫生商議。 懷孕和哺乳不會增高感染西尼羅河病毒症的風險。西尼羅河病毒症對胎兒或經哺乳傳染給幼兒的風險正在評估中。請向您 的醫生表明您的顧慮。 CDC正在針對西尼羅河病毒症採取何種措施? CDC正在與各州和地方衛生部門、美國食品及藥品管理局、其它政府機構、以及私營企業界合作,共同準備治療及預防新 發生的西尼羅河病毒症病例。 CDC正在採取某些措施,其中包括﹕ • 協辦全國範圍的電子資料庫,供各州交流關於西尼羅河病毒症的資訊 2004年8月 第2頁,共2頁 西尼羅河病毒症﹕重要資訊 (接上頁) • • • • 幫助各州制定和執行經過改進的蚊子預防和控制計劃 開發更好、更快的試驗方法, 用以發現及診斷西尼羅河病毒症 為媒體、公眾、醫療專業人士設立新的教育工具和計劃 開辦新的西尼羅河病毒症化驗室 我還需要知道哪些事項? 如果發現死鳥﹕ 請勿空手接觸死鳥。應與當地衛生部門聯絡,詢問如何報告及處理死鳥。 若想瞭解更多資訊,請瀏覽致電CDC公眾答復專線﹕ (888) 246-2675 (英文), (888) 246-2857 (西班牙), or (866) 874-2646 (打字電話) 2004年8月 第2頁,共2頁
Re: Text classification with Solr
>>Instead of indexing documents about 'sports' and searching for hits >>based upon 'basketball', 'football' etc.. I simply want to index the >>taxonomy and classify documents into it. This is a an ancient >>AI/Data-Mining discipline.. but the standard methods of 'indexing' the >>taxonomy are/were primitive compared to what one /could/ do with >>something like Lucene. Yeah, know it, the challenge on this method is the calculation of the score and parametrization of thresholds. Is it really neccessary to use Solr for it? Things going much faster with Lucene low-level api and much faster if you're loading the classification corpus into the RAM. On Mon, Jan 26, 2009 at 7:24 PM, Neal Richter wrote: > Thanks for the link Shalin... played with that a while back.. It's > possibly got some indirect possibilities. > > On Mon, Jan 26, 2009 at 10:46 AM, Hannes Carl Meyer > wrote: > > I didn't understand, is the corpus of documents you want to use to > classify > > fix? > > Assume the 'documents' are not stored in the same index and I want to > only store the taxonomy or ontology in this index. > > Instead of indexing documents about 'sports' and searching for hits > based upon 'basketball', 'football' etc.. I simply want to index the > taxonomy and classify documents into it. This is a an ancient > AI/Data-Mining discipline.. but the standard methods of 'indexing' the > taxonomy are/were primitive compared to what one /could/ do with > something like Lucene. > > Here's a 2007 research paper that used Lucene directly for > classification, but doing the inverse of what I described: > http://www.cs.ucl.ac.uk/staff/R.Hirsch/papers/gecco_HHS.pdf > > >>>previously suggested procedure of 1) store document 2) execute > >>>more-like-this and 3) delete document would be too slow. > > Do you mean the document to classify? > > Why do you then want to put it into the index (very expensive), you just > > need the contents of it to build a query! > > Exactly.. in the December Taxonomy thread Walter Underwood outlined a > store/classify/delete procedure. Too slow if you have no need to > index the document itself. > > - Neal >