FunctionQuery of FloatFieldSource (Lucene 5.0)
Hi I am having problems accessing float values in a lucene 5.0 index via the functionquery. My setup is as follows Indexing time -- Document doc = new Document(); FieldType f = new FieldType(); f.setStored(false); f.setNumericType(NumericType.FLOAT); f.setDocValuesType(DocValuesType.NUMERIC); f.setNumericPrecisionStep(4); f.setIndexOptions(IndexOptions.DOCS); for(EntryInteger, Float component:vector.entrySet()) { String w = component.getKey().toString(); Float score = component.getValue(); doc.add(new FloatField(w, score, f)); } writer.addDocument(doc); At end of indexing I do writer.forceMerge(1); writer.close(); Search Time -- for(EntryInteger,Float vector:vectors.entrySet()) { w = vector.getKey().toString(); Float score = (Float) vector.getValue(); Query tq= NumericRangeQuery.newFloatRange(w, 0.0f, 100.0f, true, true ); FunctionQuery fq = new FunctionQuery( new FloatFieldSource(w) ); CustomScoreQuery customQ = new My_CustomScorerQuery(tq, fq,score); TopDocs topDocs = indexSearcher.search(customQ,1); } where My_CustomScorerQuery() is defined as follows: public class My_CustomScorerQuery extends CustomScoreQuery{ public My_CustomScorerQuery(Query mainQuery,FunctionQuery valSrcQuery,Float mainQueryScore) { super(mainQuery,valSrcQuery); this.mainQueryScore = mainQueryScore; } public CustomScoreProvider getCustomScoreProvider(LeafReaderContext r) { return new My_CustomScorer(r); } private class My_CustomScorer extends CustomScoreProvider{ public My_CustomScorer(LeafReaderContext context) { super(context); } public float customScore(int doc,float subQueryScore, float valSrcScore) { System.out.println(\thit lucene docID: +doc+ \n\tquery score: +mainQueryScore+ \n\tsubQueryScore: +subQueryScore+ \n\tvalSrcScore: +valSrcScore); return (float) (mainQueryScore * valSrcScore); } } } The problem I am seeing is that the `valSrcScore` is always 0, and sometimes disappears if I change the setNumericPrecisionStep above 4. I am indexing the following 2 docs MapInteger, Float doc1 = new LinkedHashMapInteger, Float(); doc1.put(12,0.5f); doc1.put(18,0.4f); doc1.put(10,0.1f); indexer.indexVector(doc1, doc1); MapInteger, Float doc2 = new LinkedHashMapInteger, Float(); doc2.put(10,0.9f); doc2.put(1,0.8f); doc2.put(9,0.2f); doc2.put(2,0.1f); and testing with the following query: MapInteger, Float query = new LinkedHashMapInteger, Float(); query.put(10,0.8f); query.put(9,0.6f); query.put(2,0.01f); So field `10` in the query should have the following total scores for the two documents in the index score(query,doc0) = 0.8*0.1 score(query,doc1) = 0.8*0.9 but I only see score(query,doc0) = 0.8*0.0 score(query,doc1) = 0.8*0.0 i.e. FloatFieldSource is always returning 0. If I subclass FloatFieldSource then accessing NumericDocValues arr = DocValues.getNumeric(readerContext.reader(), field); tells me NumericDocValues of doc0: 0 which _seems_ to suggest indexing does not contain the docvalues? I can see the docs fine in Luke. There is a subtle nuance (related to the way I am indexing the fields -- some fields in a doc are not present and some are). Any pointers would be much appreciated Peyman
custom search component on solrcloud
Hi I am trying to port my none solrcloud custom search handler to a solrcloud one. I have read the WritingDistibutedSearchComponents wiki page and looked at Terms and Querycomponent codes but the control flow of execution is still fuzzy (even given the “distributed algorithm” description). Concretely, I have a none solrcloud algorithm that given a sequence of tokens T would 1- split T into single tokens 2- foreach token t_i get all the DocList for t_i by executing rb.req.getSearcher().getDocList in process() method of the custom search component 3- do some magic on the collection of doclists My question is how can i 1) do the splitting (step 1 above) in a single shard, and 2) distribute the getDocList for each token t_i to all shards 3) wait till i have all the doclists from all shards, then 4) do something with the results, in the original calling shard (step 1 above). Thank you for your help
commons-configuration NoClassDefFoundError: Predicate
Hi I've tried all permutations with no results so I thought I write to the group for help. I am running commons config (http://commons.apache.org/proper/commons-configuration/) just fine via maven and ant but when I try to run the class calling the method PropertiesConfiguration via a SOLR search component I get the following error org.eclipse.jetty.servlet.ServletHandler – Error for /solr/ArticlesRaw/ingest java.lang.NoClassDefFoundError: org/apache/commons/collections/Predicate at com.xyz.logic(Ingest.java:106) at com.xyz.logic.process(Runngest.java:76) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:217) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:780) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:427) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:533) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:368) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:744) Caused by: java.lang.ClassNotFoundException: org.apache.commons.collections.Predicate at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:430) at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:383) Following suggestions here http://stackoverflow.com/questions/7651799/proper-usage-of-apache-commons-configuration/7651867#7651867 I am including the appropriate jars in solrconfig.xml lib dir=${mvnRepository}/commons-lang/commons-lang/2.6/ regex=.*\.jar/ lib dir=${mvnRepository}/commons-collections/commons-collections/3.2.1/ regex=.*\.jar/ lib dir=${mvnRepository}/commons-logging/commons-logging/1.1.1/ regex=.*\.jar/ lib dir=${mvnRepository}/commons-configuration/commons-configuration/1.10/ regex=.*\.jar/ (the class is in org.apache.commons.collections.Predicate commons-collections/3.2.1 jar) I am running solr 4.7.1 Any help would be much appreciated Peyman
Deleting and committing inside a SearchComponent
Hi Is it possible to delete and commit updates to an index inside a custom SearchComponent? I know I can do it with solrj but due to several business logic requirements I need to build the logic inside the search component. I am using SOLR 4.5.0. thank you
Re: Deleting and committing inside a SearchComponent
On Dec 3, 2013, at 8:41 PM, Upayavira u...@odoko.co.uk wrote: On Tue, Dec 3, 2013, at 03:22 PM, Peyman Faratin wrote: Hi Is it possible to delete and commit updates to an index inside a custom SearchComponent? I know I can do it with solrj but due to several business logic requirements I need to build the logic inside the search component. I am using SOLR 4.5.0. That just doesn't make sense. Search components are read only. i can think of many situations that it makes sense. for instance, you search for a document and your index contains many duplicates that only differ by one field, such as the time they were indexed (think news feeds from multiple sources). So after the search we want to delete the duplicate documents that satisfy some policy (here date, but it could be some other policy). What are you trying to do? What stuff do you need to change? Could you do it within an UpdateProcessor? Solution i am working with UpdateRequestProcessorChain processorChain = rb.req.getCore().getUpdateProcessingChain(rb.req.getParams().get(UpdateParams.UPDATE_CHAIN)); UpdateRequestProcessor processor = processorChain.createProcessor(rb.req, rb.rsp); ... docId = f(); ... DeleteUpdateCommand cmd = new DeleteUpdateCommand(req); cmd.setId(docId.toString()); processor.processDelete(cmd); Upayavira
deleting a doc inside a custom UpdateRequestProcessor
Hi I am building a custom UpdateRequestProcessor to intercept any doc heading to the index. Basically what I want to do is to check if the current index has a doc with the same title (i am using IDs as the uniques so I can't use that, and besides the logic of checking is a little more complicated). If the incoming doc has a duplicate and some other conditions hold then one of 2 things can happen: 1- we don't index the incoming document 2- we index the incoming and delete the duplicate currently in the index I think (1) can be done by simple not passing the call up the chain (not calling super.processAdd(cmd)). However, I don't know how to implement the second condition, deleting the duplicate document, inside a custom UpdateRequestProcessor. This thread is the closest to my goal http://lucene.472066.n3.nabble.com/SOLR-4-3-0-Migration-How-to-use-DeleteUpdateCommand-td4062454.html however i am not clear how to proceed. Code snippets below. thank you in advance for your help class isDuplicate extends UpdateRequestProcessor { public isDuplicate( UpdateRequestProcessor next) { super( next ); } @Override public void processAdd(AddUpdateCommand cmd) throws IOException { try { boolean indexIncomingDoc = checkIfIsDuplicate(cmd); if(indexIncomingDoc) super.processAdd(cmd); } catch (SolrServerException e) {e.printStackTrace();} catch (ParseException e) {e.printStackTrace();} } public boolean checkIfIsDuplicate(AddUpdateCommand cmd) ...{ SolrInputDocument incomingDoc = cmd.getSolrInputDocument(); if(incomingDoc == null) return false; String title = (String) incomingDoc.getFieldValue( title ); SolrIndexSearcher searcher = cmd.getReq().getSearcher(); boolean addIncomingDoc = true; Integer idOfDuplicate = searcher.getFirstMatch(new Term(title,title)); if(idOfDuplicate != -1) { addIncomingDoc = compareDocs(searcher,incomingDoc,idOfDuplicate,title,addIncomingDoc); } return addIncomingDoc; } private boolean compareDocs(.){ if( condition 1 ) { -- DELETE DUPLICATE DOC in INDEX -- addIncomingDoc = true; } return addIncomingDoc; }
Re: subindex
Hi Erick it makes sense. Thank you for this. peyman On Sep 5, 2013, at 4:11 PM, Erick Erickson erickerick...@gmail.com wrote: Nope. You can do this if you've stored _all_ the fields (with the exception of _version_ and the destinations of copyField directives). But there's no way I know of to do what you want if you haven't. If you have, you'd be essentially spinning through all your docs and re-indexing just the fields you cared about. But if you still have access to your original docs this would be slower/more complicated than just re-indexing from scratch. Best Erick On Wed, Sep 4, 2013 at 1:51 PM, Peyman Faratin pey...@robustlinks.comwrote: Hi Is there a way to build a new (smaller) index from an existing (larger) index where the smaller index contains a subset of the fields of the larger index? thank you
subindex
Hi Is there a way to build a new (smaller) index from an existing (larger) index where the smaller index contains a subset of the fields of the larger index? thank you
Re: State sharing
got it. thank you Jack and Shalin On Aug 19, 2013, at 9:52 AM, Jack Krupansky j...@basetechnology.com wrote: Generally, you shouldn't be trying to maintain, let alone share state in Solr itself. It sounds like you need an application layer between your application clients and Solr which could then maintain whatever state it needs. -- Jack Krupansky -Original Message- From: Peyman Faratin Sent: Saturday, August 17, 2013 12:29 PM To: solr-user@lucene.apache.org Subject: State sharing Hi I have subclassed a SearchComponent (call this class S), and would like to implement the following transaction logic: 1- Client K calls the S's handler 2- S spawns a thread and immediately acks K using rb.rsp.add(status,complete) then terminates public void process (ResponseBuilder rb) { SolrParams params = rb.req.getParams(); try{ ExecutorService executorService = Executors.newCachedThreadPool(); Processor job = new Processor(rb); executorService.submit(job); rb.rsp.add(status,complete); }catch(Exception e) {e.printStackTrace();}; } 3- The thread S started (job above) does two chunks of logic in serial (call these B and C): i) B does some processing and sends client K a series of status updates, then ii) C does some processing and in turn sends K series of status updates then one final complete message iii) transaction ends I am using SOLR 4.3.1. How can I support such a transaction in solr? I've tried sharing S's ResponseBuilder with the thread but presumably because S terminates in step 2 K will never see the response from B and C. In general I would like to implement a mechanism that can share processing state with the client in the same http session. thank you for your help Peyman
State sharing
Hi I have subclassed a SearchComponent (call this class S), and would like to implement the following transaction logic: 1- Client K calls the S's handler 2- S spawns a thread and immediately acks K using rb.rsp.add(status,complete) then terminates public void process (ResponseBuilder rb) { SolrParams params = rb.req.getParams(); try{ ExecutorService executorService = Executors.newCachedThreadPool(); Processor job = new Processor(rb); executorService.submit(job); rb.rsp.add(status,complete); }catch(Exception e) {e.printStackTrace();}; } 3- The thread S started (job above) does two chunks of logic in serial (call these B and C): i) B does some processing and sends client K a series of status updates, then ii) C does some processing and in turn sends K series of status updates then one final complete message iii) transaction ends I am using SOLR 4.3.1. How can I support such a transaction in solr? I've tried sharing S's ResponseBuilder with the thread but presumably because S terminates in step 2 K will never see the response from B and C. In general I would like to implement a mechanism that can share processing state with the client in the same http session. thank you for your help Peyman
Re: cores sharing an instance
I see. If I wanted to try the second option (find a place inside solr before the core is created) then where would that place be in the flow of app waking up? Currently what I am doing is each core loads its app caches via a requesthandler (in solrconfig.xml) that initializes the java class that does the loading. For instance: requestHandler name=/cachedResources class=solr.SearchHandler startup=lazy arr name=last-components strAppCaches/str /arr /requestHandler searchComponent name=AppCaches class=com.name.Project.AppCaches/ So each core has its own so specific cachedResources handler. Where in SOLR would I need to place the AppCaches code to make it visible to all other cores then? thank you Roman On Jun 29, 2013, at 10:58 AM, Roman Chyla roman.ch...@gmail.com wrote: Cores can be reloaded, they are inside solrcore loader /I forgot the exact name/, and they will have different classloaders /that's servlet thing/, so if you want singletons you must load them outside of the core, using a parent classloader - in case of jetty, this means writing your own jetty initialization or config to force shared class loaders. or find a place inside the solr, before the core is created. Google for montysolr to see the example of the first approach. But, unless you really have no other choice, using singletons is IMHO a bad idea in this case Roman On 29 Jun 2013 10:18, Peyman Faratin pey...@robustlinks.com wrote: its the singleton pattern, where in my case i want an object (which is RAM expensive) to be a centralized coordinator of application logic. thank you On Jun 29, 2013, at 1:16 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: There is very little shared between multiple cores (instanceDir paths, logging config maybe?). Why are you trying to do this? On Sat, Jun 29, 2013 at 1:14 AM, Peyman Faratin pey...@robustlinks.com wrote: Hi I have a multicore setup (in 4.3.0). Is it possible for one core to share an instance of its class with other cores at run time? i.e. At run time core 1 makes an instance of object O_i core 1 -- object O_i core 2 --- core n then can core K access O_i? I know they can share properties but is it possible to share objects? thank you -- Regards, Shalin Shekhar Mangar.
Re: cores sharing an instance
That is what I had assumed but it appears not to be the case. A class (and its properties) of one core is not visible to another class in another core - in the same JVM. Peyman On Jun 29, 2013, at 1:23 PM, Erick Erickson erickerick...@gmail.com wrote: Well, the code is all in the same JVM, so there's no reason a singleton approach wouldn't work that I can think of. All the multithreaded caveats apply. Best Erick On Fri, Jun 28, 2013 at 3:44 PM, Peyman Faratin pey...@robustlinks.comwrote: Hi I have a multicore setup (in 4.3.0). Is it possible for one core to share an instance of its class with other cores at run time? i.e. At run time core 1 makes an instance of object O_i core 1 -- object O_i core 2 --- core n then can core K access O_i? I know they can share properties but is it possible to share objects? thank you
Re: cores sharing an instance
its the singleton pattern, where in my case i want an object (which is RAM expensive) to be a centralized coordinator of application logic. thank you On Jun 29, 2013, at 1:16 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: There is very little shared between multiple cores (instanceDir paths, logging config maybe?). Why are you trying to do this? On Sat, Jun 29, 2013 at 1:14 AM, Peyman Faratin pey...@robustlinks.com wrote: Hi I have a multicore setup (in 4.3.0). Is it possible for one core to share an instance of its class with other cores at run time? i.e. At run time core 1 makes an instance of object O_i core 1 -- object O_i core 2 --- core n then can core K access O_i? I know they can share properties but is it possible to share objects? thank you -- Regards, Shalin Shekhar Mangar.
cores sharing an instance
Hi I have a multicore setup (in 4.3.0). Is it possible for one core to share an instance of its class with other cores at run time? i.e. At run time core 1 makes an instance of object O_i core 1 -- object O_i core 2 --- core n then can core K access O_i? I know they can share properties but is it possible to share objects? thank you
Upgrading from 3.6.1 to 4.3.0 and Custom collector
Hi I am migrating from Lucene 3.6.1 to 4.3.0. I am however not sure how to migrate my custom collector below. this page http://lucene.apache.org/core/4_3_0/MIGRATE.html gives some hints but the instructions are incomplete and looking at the source examples of custom collectors make me want to go and eat cheesecake - every time !!! Any advise would be very much appreciated thank you public class AllInLinks extends Collector { private Scorer scorer; private int docBase; private String[] store; private HashSetString outLinks = new HashSetString(); public boolean acceptsDocsOutOfOrder() { return true; } public void setScorer(Scorer scorer) { this.scorer = scorer; } public void setNextReader(IndexReader reader, int docBase) throws IOException{ this.docBase = docBase; store = FieldCache.DEFAULT.getStrings(reader,title); } public void collect(int doc) throws IOException { String page = store[doc]; outLinks.add(page); } public void reset() { outLinks.clear(); store = null; } public int getOutLinks() { return outLinks.size(); } }
setting bq in searchcomponent
Hi If I run a main query cheeze jointly with a boost query bq=spell:cheeze (boosting results with spell field cheeze), as /select?fl=titleqf=mainbq=spell:cheezebq=trans:cheezeq=cheeze everything works fine. And defType=dismax What I'd like to do is to programmatically generate the bq query inside a custom searchcomponent's process method and issue the query similar to above. I can achieve my goal with explicitly constructing and running a query as follows StringBuilder QueryStr = new StringBuilder(); QueryStr.append(echoParams=none); QueryStr.append(debugQuery=off); QueryStr.append(defType=dismax); QueryStr.append(df=main); QueryStr.append(q=+token); QueryStr.append(bq=spell:+token); QueryStr.append(bq=trans:+token); SolrParams query = SolrRequestParsers.parseQueryString(QueryStr.toString()); rb.req.setParams(query); Query q = QParser.getParser(token, defType, rb.req).parse(); DocList hits = searcher.getDocList(q,rb.getFilters(),Sort.RELEVANCE,offset,rows,fieldFlags); But is there a way to directly set the request What would be the best way to do this? public void process(ResponseBuilder rb) throws IOException { ... String token = rb.req.getParams().get(token); String bqfield = rb.req.getParams().get(DisMaxParams.BQ); . Query q = QParser.getParser(token, defType, rb.req).parse(); DocList hits = searcher.getDocList(q,rb.getFilters(),Sort.RELEVANCE,offset,rows,fieldFlags); } without having to explicitly constructing the query? thank you Peyman
faceting and clustering on MLT via stream.body
Hi I would to run a mlt search (in Solrj) of a short piece of text delivered via the stream.body. This part works. What I would like to be able to do is to do 2 things: - faceting on some number (not ALL) of the results - cluster (using carrot2) all of the results Is this possible? I believe faceting occurs on all of the docs returned (numFound), not on the requested number of results (rows). Is this correct? thank you for your help Peyman
recommended SSD
Hi Is there a SSD brand and spec that the community recommends for an index of size 56G with mostly reads? We are evaluating this one http://www.newegg.com/Product/Product.aspx?Item=N82E16820227706 thank you Peyman
synonym file
Hi I have a (23M) synonym file that takes a long time (3 or so minutes) to load and once included seems to adversely affect the QTime of the application by approximately 4 orders of magnitude. Any advise on how to load faster and lower the QT would be much appreciated. best Peyman
Re: index writer in searchComponent
Hi Dmitry Which SolrJ API would I use to receive the user query? I was under the impression the request handler mechanism was the (RESTFUL) interface between user query and the index/s. thank you Peyman On Jul 1, 2012, at 10:11 AM, Dmitry Kan wrote: Hi Peyman, Could you just use solrj api for this purpose? That is, ask via solrj api 1-2 and perform 3 if entity (assuming you mean document or some field value by X) didn't exist, i.e. add it to the index. // Dmitry On Sun, Jul 1, 2012 at 6:03 AM, Peyman Faratin pey...@robustlinks.comwrote: Hi Erik The workflow I'd like to implement is 1- search the index using the incoming query 2- the query is of the type does entity X exist 3- if X does not exist in the index then I'd like to add X to the index Currently I am using a custom search component to achieve this by creating a solrserver within the init (or inform) method of the search component and using that instance to update (and commit) the index. I am not sure this is the best approach either and thought using the IndexReader of the search component itself maybe better. Is there a better approach in your opinion? thank you Erik Peyman On Jun 30, 2012, at 8:13 PM, Erick Erickson wrote: Lots of the index modification (all of it?) has been removed in 4.0 from IndexReaders... It seems like you could always get the directory and open a SolrIndexWriter wherever you wanted, but I'm not sure it's a good idea, are there other processes that will be writing to the index at the same time? What's the purpose here anyway? There might be a better approach Best Erick On Thu, Jun 28, 2012 at 4:02 PM, Peyman Faratin pey...@robustlinks.com wrote: Hi Is it possible to add a new document to the index in a custom SearchComponent (that also implements a SolrCoreAware)? I can get a reference to the indexReader via the ResponseBuilder parameter of the process() method using rb.req.getSearcher().getReader() But is it possible to actually add a new document to the index _after_ searching the index? I.e accessing the indexWriter? thank you Peyman -- Regards, Dmitry Kan
Re: index writer in searchComponent
Hi Erik The workflow I'd like to implement is 1- search the index using the incoming query 2- the query is of the type does entity X exist 3- if X does not exist in the index then I'd like to add X to the index Currently I am using a custom search component to achieve this by creating a solrserver within the init (or inform) method of the search component and using that instance to update (and commit) the index. I am not sure this is the best approach either and thought using the IndexReader of the search component itself maybe better. Is there a better approach in your opinion? thank you Erik Peyman On Jun 30, 2012, at 8:13 PM, Erick Erickson wrote: Lots of the index modification (all of it?) has been removed in 4.0 from IndexReaders... It seems like you could always get the directory and open a SolrIndexWriter wherever you wanted, but I'm not sure it's a good idea, are there other processes that will be writing to the index at the same time? What's the purpose here anyway? There might be a better approach Best Erick On Thu, Jun 28, 2012 at 4:02 PM, Peyman Faratin pey...@robustlinks.com wrote: Hi Is it possible to add a new document to the index in a custom SearchComponent (that also implements a SolrCoreAware)? I can get a reference to the indexReader via the ResponseBuilder parameter of the process() method using rb.req.getSearcher().getReader() But is it possible to actually add a new document to the index _after_ searching the index? I.e accessing the indexWriter? thank you Peyman
index writer in searchComponent
Hi Is it possible to add a new document to the index in a custom SearchComponent (that also implements a SolrCoreAware)? I can get a reference to the indexReader via the ResponseBuilder parameter of the process() method using rb.req.getSearcher().getReader() But is it possible to actually add a new document to the index _after_ searching the index? I.e accessing the indexWriter? thank you Peyman
KeywordTokenizerFactory with SynonymFilterFactory
Hi I have the following 2 field types fieldType name=tokenizer1 class=solr.TextField sortMissingLast=true autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=false expand=true/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType fieldType name=tokenizer2 class=solr.TextField sortMissingLast=true autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=false expand=true/ /analyzer /fieldType The problem I am seeing is if I have an entry as this in the synonyms.txt file helping hand = assistance then issuing helping hand query (with dismax) to the field tokenized with tokenizer1 returns the correct query (assistance) whereas there is no synonym mapping for tokenizer2 (confirmed in Solr admin panel). Am I doing something wrong? thank you
Re: KeywordTokenizerFactory with SynonymFilterFactory
thank you Michael. On Jun 16, 2012, at 6:40 PM, Michael Ryan wrote: Try changing the tokenizer2 SynonymFilterFactory filter to this: filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=false expand=true tokenizerFactory=solr.KeywordTokenizerFactory/ By default, it seems that it uses WhitespaceTokenizer. -Michael
Kernel methods in SOLR
Hi Has there been any work that tries to integrate Kernel methods [1] with SOLR? I am interested in using kernel methods to solve synonym, hyponym and polysemous (disambiguation) problems which SOLR's Vector space model (bag of words) does not capture. For example, imagine we have only 3 words in our corpus, puma, cougar and feline. The 3 words have obviously interdependencies (puma disambiguates to cougar, cougar and puma are instances of felines - hyponyms). Now, imagine 2 docs, d1 and d2, that have the following TF-IDF vectors. puma, cougar, feline d1 = [ 2,0, 0] d2 = [ 0,1, 0] i.e. d1 has no mention of term cougar or feline and conversely, d2 has no mention of terms puma or feline. Hence under the vector approach d1 and d2 are not related at all (and each interpretation of the terms have a unique vector). Which is not what we want to conclude. What I need is to include a kernel matrix (as data) such as the following that captures these relationships: puma, cougar, feline puma= [ 1,1, 0.4] cougar = [ 1,1, 0.4] feline = [ 0.4, 0.4, 1] then recompute the TF-IDF vector as a product of (1) the original vector and (2) the kernel matrix, resulting in puma, cougar, feline d1 = [ 2,2, 0.8] d2 = [ 1,1, 0.4] (note, the new vectors are much less sparse). I can solve this problem (inefficiently) at the application layer but I was wondering if there has been any attempts within the community to solve similar problems, efficiently without paying a hefty response time price? thank you Peyman [1] http://en.wikipedia.org/wiki/Kernel_methods
custom field default qf of requestHandler
Hi I have a problem with the following context. I have a field with a custom type of shingledcontent, defined as follows in the schema.xml field name=shingledContent type=shingledcontent compressed=true omitNorms=false termVectors=true termOffsets=true termPositions=true indexed=true stored=false multiValued=false required=true / where fieldType name=shingledcontent class=solr.TextField sortMissingLast=true analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.ShingleFilterFactory outputUnigrams=true maxShingleSize=2/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.ShingleFilterFactory outputUnigrams=false maxShingleSize=2/ /analyzer /fieldType I then define a request handler as follows requestHandler name=/test class=solr.SearchHandler lst name=defaults str name=defTypedismax/str str name=q.alt /str str name=fltitle,score/str int name=start0/int int name=rows2000/int str name=echoParamsall/str str name=qftitleAnalyzed^2.0 shingledContent^1.0 content^1.0/str /lst lst name=appends str name=fqsomeTest:false/str str name=fqanotherTest:false/str /lst arr name=last-components strTest/str /arr /requestHandler searchComponent name=Test class=com.a.b.c/ the problem I am seeing is that the shingledContent field query never shows up in the query - what I see is: Q: +(content:dog | titleAnalyzed:dog^2.0) () (content and titleAnalyzed are both of type text_general, found in the default schema.xml). If I change shingedContent field to be of type text_general it is correctly included in the query field. Is this a correct behavior or am i making an error somewhere? thank you
query score across ALL docs
Hi What is the best way to retrieve the score of a query across ALL documents in the index? i.e. given: 1) docs, [A,B,C,D,E,...M] of M dimensions 2) Query q searcher outputs (efficiently) 1) the score of q across _all_ M dimensional documents, ordered by index number. i.e score(q) = [A=0.1,B=0.0,M=0.76] Currently the searcher outputs the top N matches, where (often) N M in cases of large indices. My index is ~9MM docs. Using a custom collector will not work. Any advice would be much appreciated Peyman
QueryHandler
Hi A noobie question. I am uncertain what is the best way to design for my requirement which the following. I want to allow another client in solrj to query solr with a query that is handled with a custom handler localhost:9090/solr/tokenSearch?tokens{!dismax qf=content}pear,apples,oyster,king kongfl=scorerows=1000 i.e. a list of tokens (single word and phrases) is sent in one http call. What I would like to do is to search over each individual token and compose a single response back to the client The current approach I have taken is to create a custom search handler as follows requestHandler name=/tokenSearch class=solr.SearchHandler lst name=defaults str name=defTypedismax/str /lst arr name=components strmyHandler/str /arr /requestHandler searchComponent name=myHandler class=com.a.RequestHandlers.myHandler/ myHandler (which extends SearchComponent) overrides prepare and process methods, extracting and iterating over each token in the input. The problem I am hitting in this design is that the prepare() method is passed a reference to the SolrIndexSearcher in the ResponseBuilder parameter (so for efficiency reasons i don't want to open up another server connection for the search). I can construct a Lucene query and search just fine, but what i would like to do is instead use the e/dismax queries (rather than construct my own - to reduce errors). The getDocList() method of SolrIndexSearcher on the other hand requires a lucene query object. Is this an appropriate design for my requirement? And if so what is the best way to send a SolrQuery to the SolrIndexSearcher? Thank you Peyman
Re: Faster Solr Indexing
Hi Erick, Dimitry and Mikhail thank you all for your time. I tried all of the suggestions below and am happy to report that indexing speeds have improved. There were several confounding problems including - a bank of (~20) regexes that were poorly optimized and compiled at each indexing step - single threaded - not using StreamingUpdateSolrServer - excessive logging However, the biggest bottleneck was 2 lucene searches (across ~9MM docs) at the time of building the SOLR document. Indexing sped up after precomputing these values offline. Thank you all for your help. best Peyman On Mar 12, 2012, at 10:58 AM, Erick Erickson wrote: How have you determined that it's the solr add? By timing the call on the SolrJ side or by looking at the machine where Solr is running? This is the very first thing you have to answer. You can get a rough ides with any simple profiler (say Activity Monitor no a Mac, Task Manager on a Windows box). The point is just to see whether the indexer machine is being well utilized. I'd guess it's not actually. One quick experiment would be to try using StreamingUpdateSolrServer (SUSS), which has the capability of having multiple threads fire at Solr at once. It is possible that your performance is spent waiting for I/O. Once you have that question answered, you can refine. But until you know which side of the wire the problem is on, you're flying blind. Both Yandong Peyman: These times are quite surprising. Running everything locally on my laptop, I'm indexing between 5-7K documents/second. The source is the Wikipedia dump. I'm particularly surprised by the difference Yandong is seeing based on the various analysis chains. the first thing I'd back off is the MaxPermSize. 512M is huge for this parameter. If you're getting that kind of time differential and your CPU isn't pegged, you're probably swapping in which case you need to give the processes more memory. I'd just take the MaxPermSize out completely as a start. Not sure if you've seen this page, something there might help. http://wiki.apache.org/lucene-java/ImproveIndexingSpeed But throw a profiler at the indexer as a first step, just to see where the problem is, CPU or I/O. Best Erick On Sat, Mar 10, 2012 at 4:09 PM, Peyman Faratin pey...@robustlinks.com wrote: Hi I am trying to index 12MM docs faster than is currently happening in Solr (using solrj). We have identified solr's add method as the bottleneck (and not commit - which is tuned ok through mergeFactor and maxRamBufferSize and jvm ram). Adding 1000 docs is taking approximately 25 seconds. We are making sure we add and commit in batches. And we've tried both CommonsHttpSolrServer and EmbeddedSolrServer (assuming removing http overhead would speed things up with embedding) but the differences is marginal. The docs being indexed are on average 20 fields long, mostly indexed but none stored. The major size contributors are two fields: - content, and - shingledContent (populated using copyField of content). The length of the content field is (likely) gaussian distributed (few large docs 50-80K tokens, but majority around 2k tokens). We use shingledContent to support phrase queries and content for unigram queries (following the advice of Solr Enterprise search server advice - p. 305, section The Solution: Shingling). Clearly the size of the docs is a contributor to the slow adds (confirmed by removing these 2 fields resulting in halving the indexing time). We've tried compressed=true also but that is not working. Any guidance on how to support our application logic (without having to change the schema too much) and speed the indexing speed (from current 212 days for 12MM docs) would be much appreciated. thank you Peyman
Faster Solr Indexing
Hi I am trying to index 12MM docs faster than is currently happening in Solr (using solrj). We have identified solr's add method as the bottleneck (and not commit - which is tuned ok through mergeFactor and maxRamBufferSize and jvm ram). Adding 1000 docs is taking approximately 25 seconds. We are making sure we add and commit in batches. And we've tried both CommonsHttpSolrServer and EmbeddedSolrServer (assuming removing http overhead would speed things up with embedding) but the differences is marginal. The docs being indexed are on average 20 fields long, mostly indexed but none stored. The major size contributors are two fields: - content, and - shingledContent (populated using copyField of content). The length of the content field is (likely) gaussian distributed (few large docs 50-80K tokens, but majority around 2k tokens). We use shingledContent to support phrase queries and content for unigram queries (following the advice of Solr Enterprise search server advice - p. 305, section The Solution: Shingling). Clearly the size of the docs is a contributor to the slow adds (confirmed by removing these 2 fields resulting in halving the indexing time). We've tried compressed=true also but that is not working. Any guidance on how to support our application logic (without having to change the schema too much) and speed the indexing speed (from current 212 days for 12MM docs) would be much appreciated. thank you Peyman