Re: Multiple servers support
Erick, Many thanks for your suggestions and pointers, i am proceeding with my study and looking forward to do a POC with Solr. Thanks again. On Sun, Sep 25, 2011 at 7:40 PM, Erick Erickson erickerick...@gmail.comwrote: Well, this is not a neutral forum G... A common use-case for Solr is exactly to replace database searches because, as you say, search performance in a database is often slow and limited. RDBMSs do very complex stuff very well, but they are not designed for text searching. Scaling is accomplished by either replication or sharding. Replication is used when the entire index fits on a single machine and you can get reasonable responses. I've seen 40-50M docs fit quite comfortably on one machine. But 150TB *probably* indicates that this isn't reasonable in your case. If you can't fit the entire index on one machine, then you shard, which splits up the single logical index into multiple slices and Solr automatically will query all the shards and assemble the parts into a single response. But you absolutely cannot guess the hardware requirements ahead of time. It's like answering How big is a Java program? There are too many variables. But Solr is free, right? So you absolutely have to get a copy and put your 2.5M docs on it and test (Solrmeter or jMeter are good options). If you get adequate throughput, add another 1M docs to the machine. Keep on until your QPS rate drops and you'll have a good idea how many documents you can put on a single machine. There's really no other way to answer that question Best Erick On Sun, Sep 25, 2011 at 5:55 AM, Raja Ghulam Rasool the.r...@gmail.com wrote: Hi, I am new to Solr, and I am studying it currently. We are planning to implement Solr in our production setup. We have 15 servers where we are getting the data. The data is huge, like we are supposed to keep 150 Tera bytes of data (in terms of documents it will be around 2592000 documents per server), across all servers (combined). We have the necessary storage capacity. Can anyone let me know whether Solr will be a good solution for our text search needs ? We are required to provide text searches or certain limited number of fields. 1- Does Solr support such architecture, i.e. multiple servers ? what specific area in Solr do i need to explore (shards, cores etc, ???) 2- Any idea whether we will really benefit from Solr implementation for text searches, vs let us say Oracle Text Search ? Currently our Oracle Text search is giving a very bad performance and we are looking to some how improve our text search performance any high level pointers or help will be greatly appreciated. thanks in advance guys -- Regards, Raja -- Regards, Ghulam Rasool. Blog: http://ghulamrasool.blogspot.com Mobile: +971506141872
Unique Key error on trunk
Hello, We use solr.UUIDField to generate unique ids, using the latest trunk (change list 1163767) seems to throw an error Document is missing mandatory uniqueKey field: id. The schema is setup to generate a id field on updates field name=id type=uuid indexed=true stored=true default=NEW / Thanks Viswa SEVERE: org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: id at org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:80) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:145) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:115) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:127) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1406) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
error while replication
hi , I am replicating solr and getting this error . i am unable to make out the cause so please kindly help 26 Sep, 2011 8:00:14 AM org.slf4j.impl.JDK14LoggerAdapter fillCallerData SEVERE: Error during auto-warming of key:org.apache.solr.search.QueryResultKey@150f0455:java.lang.NullPointerException at org.apache.lucene.util.StringHelper.intern(StringHelper.java:36) at org.apache.lucene.index.Term.init(Term.java:38) at org.apache.lucene.search.NumericRangeQuery$NumericRangeTermEnum.next(NumericRangeQuery.java:530) at org.apache.lucene.search.NumericRangeQuery$NumericRangeTermEnum.init(NumericRangeQuery.java:476) at org.apache.lucene.search.NumericRangeQuery.getEnum(NumericRangeQuery.java:307) at org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(MultiTermQueryWrapperFilter.java:160) at org.apache.lucene.search.ConstantScoreQuery$ConstantScorer.init(ConstantScoreQuery.java:116) at org.apache.lucene.search.ConstantScoreQuery$ConstantWeight.scorer(ConstantScoreQuery.java:81) at org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:297) at org.apache.lucene.search.IndexSearcher.searchWithFilter(IndexSearcher.java:268) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:258) at org.apache.lucene.search.Searcher.search(Searcher.java:171) at org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1101) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:880) at org.apache.solr.search.SolrIndexSearcher.access$000(SolrIndexSearcher.java:51) at org.apache.solr.search.SolrIndexSearcher$3.regenerateItem(SolrIndexSearcher.java:332) at org.apache.solr.search.LRUCache.warm(LRUCache.java:194) at org.apache.solr.search.SolrIndexSearcher.warm(SolrIndexSearcher.java:1481) at org.apache.solr.core.SolrCore$2.call(SolrCore.java:1130) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) regards rajat rastogi -- View this message in context: http://lucene.472066.n3.nabble.com/error-while-replication-tp3368783p3368783.html Sent from the Solr - User mailing list archive at Nabble.com.
multiple dateranges/timeslots per doc: modeling openinghours.
Sorry for the somewhat length post, I would like to make clear that I covered my basis here, and looking for an alternative solution, because the more trivial solutions don't seem to work for my use-case. Consider Bars, musea, etc. These places have multiple openinghours that can depend on: REQ 1. day of week REQ 2. special days on which they are closed, or have in another way different openinghours than there related 'day of week' Now, I want to model these 'places' in a way so I'm able to do temporal queries like: - which bars are open NOW (and stay open for at least another 3 hours) - which musea are (already) open at 25-12-2011 - 10AM - and stay open until (at least) 3PM. I believe having opening/closing hours available for each day at least gives me the data needed to query the above. (Note that having dayOfWeek*openinghours is not enough, bc. of the special cases in 2.) Okay knowing I need openinghours*dates for each place, how would I format this in documents? OPTION A) --- Considering granularity: I want documents to represent Places and not Places*dates. Although the latter would trivially allow me to do the quering mentioned above, it has the disadvantages: - same place returned multiple times (each with a different date) when queries are not constrained to date. - Lot's of data needs to be duplicated, all for the conceptually 'simple' functionality of needing multiple date-ranges. It feels bad and a simpler solution should exist? - Exploding the resultset (documents = say, 100 dates * 1.000.000 = 100.000.000. ) suddenly the size of the resultset goes from 'easily doable' to 'hmmm I have to think about this'. Given that places also have some other fields to sort on, Lucene fieldcache mem-usage would explode with a factor 100. OPTION B) -- Another, faulty, option would be to model opening/closing hours in 2 multivalued date-fields, i.e: open, close. and insert open/close for each day, e.g: open: 2011-11-08:1800 - close: 2011-11-09:0300 open: 2011-11-09:1700 - close: 2011-11-10:0500 open: 2011-11-10:1700 - close: 2011-11-11:0300 And queries would be of the form: 'open now close now+3h' But since there is no way to indicate that 'open' and 'close' are pairwise related I will get a lot of false positives, e.g the above document would be returned for: open 2011-11-09:0100 close 2011-11-09:0600 because SOME opendate is before 2011-11-09:0100 (i.e: 2011-11-08:1800) and SOME closedate is after 2011-11-09:0600 (for example: 2011-11-11:0300) but these open and close-dates are not pairwise related. OPTION C) The best of what I have now: --- I have been thinking about a totally different approach using Solr dynamic fields, in which each and every opening and closing-date gets it's own dynamic field, e.g: _date_2011-11-09_open: 1800 _date_2011-11-09_close: 0300 _date_2011-11-09_open: 1700 _date_2011-11-10_close: 0500 _date_2011-11-10_open: 1700 _date_2011-11-11_close: 0300 Then, the client should know the date to query, and thus the correct fields to query. This would solve the problem, since startdate/ enddate are nor pairwise -related, but I fear this can be a big issue from a performance standpoint (especially memory consumption of the Lucene fieldcache) IDEAL OPTION D) I'm pretty sure this does not exist out-of-the-box, but might be extended. Okay, Solr has a fieldtype: date, but what if it also had a fieldtype: Daterange? A Daterange would be modeled as lt;DateTimeA,DateTimeBgt; or lt;DateTimeA,Delta DateTimeAgt; Then this problem would be really easily modelled as a multivalued field 'openinghours' of type 'Daterange'. However, I have the feeling that the standard range-query implementation can't be used on this fieldtype, or perhaps should be run for each of the N datereange-values in 'openinghours'. To make matters worse ( I didn't want to introduce this above) REQ 3: It may be possible that certain places have multiple opening-hours / timeslots each day. Consider museum in Spain which get's closed around noon because of siesta-time. OPTION D) would be able to handle this natively, all other options can't. I would very much appreciate any pointers to: - how to start with option D. and if this approach is at all feasible. - if option C. would suffice. (excluding REQ 3. ), and if I'm likely to run into performance / memory troubles. - any other possible solutions I haven' thought of to tackle this. Thanks a lot. Cheers, Geert-Jan -- View this message in context: http://lucene.472066.n3.nabble.com/multiple-dateranges-timeslots-per-doc-modeling-openinghours-tp3368790p3368790.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr stopword problem in Query
Hi all, I have a text field named* textForQuery* . Following content has been indexed into solr in field textForQuery *Coke Studio at MTV* when i fired the query as *textForQuery:(coke studio at mtv)* the results showed 0 documents After runing the same query in debugMode i got the following results result name=response numFound=0 start=0/ lst name=debug str name=rawquerystringtextForQuery:(coke studio at mtv)/str str name=querystringtextForQuery:(coke studio at mtv)/str str name=parsedqueryPhraseQuery(textForQuery:coke studio ? mtv)/str str name=parsedquery_toStringtextForQuery:coke studio *? *mtv/str Why the query did not matched any document even when there is a document with value of textForQuery as *Coke Studio at MTV*? Is this because of the stopword *at* present in stopwordList? -- Thanks Regards, Isan Fulia.
AW: How to map database table for facted search?
Thx for your response, we will try dynamic fields for this -Ursprüngliche Nachricht- Von: Erick Erickson [mailto:erickerick...@gmail.com] Gesendet: Samstag, 24. September 2011 21:33 An: solr-user@lucene.apache.org Betreff: Re: How to map database table for facted search? In general, you flatten the data when you put things into Solr. I know that's anathema to DB training, but this is searching G... If you have a reasonable number of distinct column names, you could just define your schema to have an entry for each and index the associated values that way. Then your facets become easy, you're just faceting on the facet_hobby field in your example. If that's impractical (say you can add arbitrary columns), you can do something very similar with dynamic fields. You could also create a field with the column/name pairs (watch your tokenizer!) in a single field and facet by prefix, where the prefix was the column name (e.g. index tokens like hobby_sailing hobby_camping interest_reading then facet with facet.prefix:hobby_). There are tradeoffs for each that you'll have to experiment with. Note that there is no penalty in Solr for defining fields in your schema but not using them. Best Erick On Fri, Sep 23, 2011 at 12:06 AM, Chorherr Nikolaus nikolaus.chorh...@umweltbundesamt.at wrote: Hi All! We are working first time with solr and have a simple data model Entity Person(column surname) has 1:n Attribute(column name) has 1:n Value(column text) We need faceted search on the content of Attribute:name not on Attribute:name itself, e.g if an Attribute of person has name=hobby, we would like to have something like ... facet=truefacet.name=hobby and get back all related Value with count.(We do not need a facet.name=name and get back all distinct values of the name column of Attribute) How do we have to map our database, define or document and/or define our schema? Any help is highly appreciated - Thx in advance Niki
Re: Seek your wisdom for implementing 12 million docs..
On Sun, 2011-09-25 at 22:00 +0200, Ikhsvaku S wrote: Documents: We have close to ~12 million XML docs, of varying sizes average size 20 KB. These documents have 150 fields, which should be searchable indexed. [...] Approximately ~6000 such documents are updated 400-800 new ones are added each day Queries: [...] Also each one would want to grab as many result rows as possible (we are limiting this to 2000). The output shall contain only 1-5 fields. Except for the result rows (which I guess is equal to returned documents in Solr-world), nothing you say raises any alarms. It actually sounds very much like our local index (~10M documents, ~100 fields, 10.000+ updates/day) at the State and University Library, Denmark. Available hardware: Some of existing hardware we could find consists of existing ~300GB SAN each on 4 Boxes with ~96Gig each. We do couple of older HP DL380s (mainly want to use for offline indexing). All of this is on 10G Ethernet. Yikes! We only use two mirrored machines for fallback, not performance. They have 16GB each and handle index updates as well as searches. The indexes (~60GB) reside on local SSDs. Questions: Our priority is to provide results fast, [...] What is fast in milliseconds and how many queries/second do you anticipate? From what you're telling, your hardware looks like overkill. However, as Eric says, your mileage may wary: Try stuffing all your data into your mock-up and see what happens - it shouldn't take long and you might discover that your test machine is perfectly capable of handling it all alone.
SOLR Index Speed
Hi, We have 500K web document and usind solr (trunk) to index it. We have special anaylizer which little bit heavy cpu . Our machine config: 32 x cpu 32 gig ram SAS HD We are sending document with 16 reduce client (from hadoop) to the stand alone solr server. the problem is we couldnt get speedier than the 500 doc / per sec. 500K document tooks 7-8 hours to index :( While indexin the the solr server cpu load is around : 5-6 (32 max) it means %20 of the cpu total power. We have plenty ram ... I turned of auto commit and give 8198 rambuffer .. there is no io wait .. How can I make it faster ? PS: solr streamindex is not option because we need to submit javabin... thanks..
Re: NRT and commit behavior
Tirthankar, are you indexing 1.smaller docs or 2.books? if 1. your caches are too big for your memory, as Erick already said. Try to allocate 10GB für JVM, leave 14GB for your HDD-Cache and make your caches smaller. if 2. read the blog-posts on hathitrust.com. http://www.hathitrust.org/blogs/large-scale-search Regards Vadim 2011/9/24 Erick Erickson erickerick...@gmail.com No G. The problem is that number of documents isn't a reliable indicator of resource consumption. Consider the difference between indexing a twitter message and a book. I can put a LOT more docs of 140 chars on a single machine of size X than I can books. Unfortunately, the only way I know of is to test. Use something like jMeter of SolrMeter to fire enough queries at your machine to determine when you're over-straining resources and shard at that point (or get a bigger machine G).. Best Erick On Wed, Sep 21, 2011 at 8:24 PM, Tirthankar Chatterjee tchatter...@commvault.com wrote: Okay, but is there any number that if we reach on the index size or total docs in the index or the size of physical memory that sharding should be considered. I am trying to find the winning combination. Tirthankar -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, September 16, 2011 7:46 AM To: solr-user@lucene.apache.org Subject: Re: NRT and commit behavior Uhm, you're putting a lot of index into not very much memory. I really think you're going to have to shard your index across several machines to get past this problem. Simply increasing the size of your caches is still limited by the physical memory you're working with. You really have to put a profiler on the system to see what's going on. At that size there are too many things that it *could* be to definitively answer it with e-mails Best Erick On Wed, Sep 14, 2011 at 7:35 AM, Tirthankar Chatterjee tchatter...@commvault.com wrote: Erick, Also, we had our solrconfig where we have tried increasing the cache making the below value for autowarm count as 0 helps returning the commit call within the second, but that will slow us down on searches filterCache class=solr.FastLRUCache size=16384 initialSize=4096 autowarmCount=4096/ !-- Cache used to hold field values that are quickly accessible by document id. The fieldValueCache is created by default even if not configured here. fieldValueCache class=solr.FastLRUCache size=512 autowarmCount=128 showItems=32 / -- !-- queryResultCache caches results of searches - ordered lists of document ids (DocList) based on a query, a sort, and the range of documents requested. -- queryResultCache class=solr.LRUCache size=16384 initialSize=4096 autowarmCount=4096/ !-- documentCache caches Lucene Document objects (the stored fields for each document). Since Lucene internal document ids are transient, this cache will not be autowarmed. -- documentCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=512/ -Original Message- From: Tirthankar Chatterjee [mailto:tchatter...@commvault.com] Sent: Wednesday, September 14, 2011 7:31 AM To: solr-user@lucene.apache.org Subject: RE: NRT and commit behavior Erick, Here is the answer to your questions: Our index is 267 GB We are not optimizing... No we have not profiled yet to check the bottleneck, but logs indicate opening the searchers is taking time... Nothing except SOLR Total memory is 16GB tomcat has 8GB allocated Everything 64 bit OS and JVM and Tomcat -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Sunday, September 11, 2011 11:37 AM To: solr-user@lucene.apache.org Subject: Re: NRT and commit behavior Hmm, OK. You might want to look at the non-cached filter query stuff, it's quite recent. The point here is that it is a filter that is applied only after all of the less expensive filter queries are run, One of its uses is exactly ACL calculations. Rather than calculate the ACL for the entire doc set, it only calculates access for docs that have made it past all the other elements of the query See SOLR-2429 and note that it is a 3.4 (currently being released) only. As to why your commits are taking so long, I have no idea given that you really haven't given us much to work with. How big is your index? Are you optimizing? Have you profiled the application to see what the bottleneck is (I/O, CPU, etc?). What else is running on your machine? It's quite surprising that it takes that long. How much memory are you giving the JVM? etc... You might want to review: http://wiki.apache.org/solr/UsingMailingLists Best Erick On Fri, Sep 9, 2011 at 9:41 AM,
RE: Best Solr escaping?
I won't guarantee this is the 'best algorithm', but here's what we use. (This is in a final class with only static helper methods): // Set of characters / Strings SOLR treats as having special meaning in a query, and the corresponding Escaped versions. // Note that the actual operators '' and '||' don't show up here - we'll just escape the characters '' and '|' wherever they occur. private static final String[] SOLR_SPECIAL_CHARACTERS = new String[] {+, -, , |, !, (, ), {, }, [, ], ^, \, ~, *, ?, :, \\}; private static final String[] SOLR_REPLACEMENT_CHARACTERS = new String[] {\\+, \\-, \\, \\|, \\!, \\(, \\), \\{, \\}, \\[, \\], \\^, \\\, \\~, \\*, \\?, \\:, }; /** * Escapes all special characters from the Search Terms, so they don't get confused with * the Solr query language special characters. * @param value - Search Term to escape * @return - escaped Search value, suitable for a Solr q parameter */ public static String escapeSolrCharacters(String value) { return StringUtils.replaceEach(value, SOLR_SPECIAL_CHARACTERS, SOLR_REPLACEMENT_CHARACTERS); } Bob Sandiford | Lead Software Engineer | SirsiDynix P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com www.sirsidynix.com -Original Message- From: Bill Bell [mailto:billnb...@gmail.com] Sent: Sunday, September 25, 2011 12:22 AM To: solr-user@lucene.apache.org Subject: Best Solr escaping? What is the best algorithm for escaping strings before sending to Solr? Does someone have some code? A few things I have witnessed in q using DIH handler * Double quotes - that are not balanced can cause several issues from an error (strip the double quote?), to no results. * Should we use + or %20 and what cases make sense: * Dr. Phil Smith or Dr.+Phil+Smith or Dr.%20Phil%20Smith - also what is the impact of double quotes? * Unmatched parenthesis I.e. Opening ( and not closing. * (Dr. Holstein * Cardiologist+(Dr. Holstein Regular encoding of strings does not always work for the whole string due to several issues like white space: * White space works better when we use back quote Bill\ Bell especially when using facets. Thoughts? Code? Ideas? Better Wikis?
Re: email - DIH
Hi Alonso, Gora, I run in the same Problem with the MailEntityProcessor. I have an Email-Folder called Test. Inside there a only two messages. When I run the DIH everything looks find, except that the two Emails doesn't get indexed. Are there any adidtional informations to this problem? I'm using Solr 3.4.0 (earlier Version the same problem) Here my config: dataConfig document entity name=email transformer=TemplateTransformer processor=MailEntityProcessor user=s...@zahn-gmbh.de password=SHI-Test host=mail.zahn-gmbh.de protocol=imap folders=* fetchMailsSince=2000-01-01 00:00:00 deltaFetch=false processAttachement=false batchSize=100 fetchSize=1024 recurse=true field column=id template=email-${email.messageId}/ field column=quelle template=Email/ field column=title template=${email.subject}/ field column=author template=${email.from}/ field column=last_modified template=${email.sentDate} dateTimeFormat=-MM-dd hh:mm:ss/ field column=text template=${email.content}/ field column=content_type template=Email/ field column=quelle template=Comunigate/ field column=doctype template=Email/ /entity /document /dataConfig And here my response (using the command: http://localhost:8080/apache-solr-3.4.0/dataimport-mail?command=full-importcommit=true;): 26.09.2011 15:52:53 org.apache.solr.core.SolrCore execute INFO: [] webapp=/apache-solr-3.4.0 path=/dataimport-mail params={commit=truecommand=full-import} status=0 QTime=16 26.09.2011 15:52:53 org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import 26.09.2011 15:52:53 org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport-mail.properties 26.09.2011 15:52:53 org.apache.solr.update.DirectUpdateHandler2 deleteAll INFO: [] REMOVING ALL DOCUMENTS FROM INDEX 26.09.2011 15:52:53 org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=1 commit{dir=H:\_Projekt.lfd\zahn\solr_home_34\data\index,segFN=segments_4,version=1317035795833,generation=4,filenames=[segments_4] 26.09.2011 15:52:53 org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: newest commit = 1317035795833 26.09.2011 15:52:53 org.apache.solr.handler.dataimport.MailEntityProcessor logConfig INFO: user : s...@zahn-gmbh.de pwd : SHI-Test protocol : imap host : mail.zahn-gmbh.de folders : Test recurse : true exclude : [] include : [] batchSize : 20 fetchSize : 1024 read timeout : 6 conection timeout : 3 custom filter : fetch mail since : Sat Jan 01 00:00:00 CET 2000 26.09.2011 15:52:54 org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true,expungeDeletes=false) 26.09.2011 15:52:54 org.apache.solr.core.SolrDeletionPolicy onCommit INFO: SolrDeletionPolicy.onCommit: commits:num=2 commit{dir=H:\_Projekt.lfd\zahn\solr_home_34\data\index,segFN=segments_4,version=1317035795833,generation=4,filenames=[segments_4] commit{dir=H:\_Projekt.lfd\zahn\solr_home_34\data\index,segFN=segments_5,version=1317035795834,generation=5,filenames=[segments_5] 26.09.2011 15:52:54 org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: newest commit = 1317035795834 26.09.2011 15:52:54 org.apache.solr.search.SolrIndexSearcher init INFO: Opening Searcher@17af46e main 26.09.2011 15:52:54 org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming Searcher@17af46e main from Searcher@5e8d7d main fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=1,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0,item_doctype={field=doctype,memSize=4224,tindexSize=32,time=0,phase1=0,nTerms=0,bigTerms=0,termInstances=0,uses=2}} 26.09.2011 15:52:54 org.apache.solr.update.DirectUpdateHandler2 commit INFO: end_commit_flush 26.09.2011 15:52:54 org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming result for Searcher@17af46e main fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} 26.09.2011 15:52:54 org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming Searcher@17af46e main from Searcher@5e8d7d main filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=2,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} 26.09.2011 15:52:54 org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming result for Searcher@17af46e main filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} 26.09.2011 15:52:54 org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming
Re: mlt content stream help
On 9/24/11 12:17 PM, Erick Erickson wrote: What version of Solr? I am using solr 3.2 When you copied the default, did you set up default values for MLT? This is what I need help with. How should the request handler / solrconfig be setup? Showing us the request you used The request is exactly the same as the url in the wiki using the example solr / exampledocs and the relevant portions of your solrconifg file would help a lot, you might want to review: http://wiki.apache.org/solr/UsingMailingLists Best Erick On Thu, Sep 22, 2011 at 9:08 AM, dan wheland...@adicio.com wrote: I would like to use MLT and the content stream feature in solr like on this page: http://wiki.apache.org/solr/MoreLikeThisHandler How should the request handler / solrconfig be setup? I enabled streaming and I set a requestHandler up by copying the default request handler and I changed the name to: name=/mlt but when accessing the url like the example on the wiki I get a NPE because q is not supplied I'm sure I am just doing it wrong just not sure what. Thanks, dan
Re: Update ingest rate drops suddenly
Just to bring closure on this one, we were slurping data from the wrong DB (hardly desktop class machine)... Solr did not cough on 41Mio records @34k updates / sec., single threaded. Great! On Sat, Sep 24, 2011 at 9:18 PM, eks dev eks...@yahoo.co.uk wrote: just looking for hints where to look for... We were testing single threaded ingest rate on solr, trunk version on atypical collection (a lot of small documents), and we noticed something we are not able to explain. Setup: We use defaults for index settings, windows 64 bit, jdk 7 U2. on SSD, machine with enough memory and 8 cores. Schema has 5 stored fields, 4 of them indexed no positions no norms. Average net document size (optimized index size / number of documents) is around 100 bytes. On a test with 40 Mio document: - we had update ingest rate on first 4,4Mio documents @ incredible 34k records / second... - then it dropped, suddenly to 20k records per second and this rate remained stable (variance 1k) until... - we hit 13Mio, where ingest rate dropped again really hard, from one instant in time to another to 10k records per second. it stayed there until we reached the end @40Mio (slightly reducing, to ca 9k, but this is not long enough to see trend). Nothing unusual happening with jvm memory ( tooth-saw 200- 450M fully regular). CPU in turn was following the ingest rate trend, inicating that we were waiting on something. No searches , no commits, nothing. autoCommit was turned off. Updates were streaming directly from the database. - I did not expect something like this, knowing lucene merges in background. Also, having such sudden drops in ingest rate is indicative that we are not leaking something. (drop would have been much more gradual). It is some caches, but why two really significant drops? 33k/sec to 20k and than to 10k... We would love to keep it @34 k/second :) I am not really acquainted with the new MergePolicy and flushing settings, but I suspect this is something there we could tweak. Could it be windows is somehow, hmm, quirky with solr default directory on win64/jvm (I think it is MMAP by default)... We did not saturate IO with such a small documents I guess, It is a just couple of Gig over 1-2 hours. All in all, it works good, but is having such hard update ingest rate drops normal? Thanks, eks.
Re: Solr stopword problem in Query
This is pretty serious issue Bill Bell Sent from mobile On Sep 26, 2011, at 4:09 AM, Isan Fulia isan.fu...@germinait.com wrote: Hi all, I have a text field named* textForQuery* . Following content has been indexed into solr in field textForQuery *Coke Studio at MTV* when i fired the query as *textForQuery:(coke studio at mtv)* the results showed 0 documents After runing the same query in debugMode i got the following results result name=response numFound=0 start=0/ lst name=debug str name=rawquerystringtextForQuery:(coke studio at mtv)/str str name=querystringtextForQuery:(coke studio at mtv)/str str name=parsedqueryPhraseQuery(textForQuery:coke studio ? mtv)/str str name=parsedquery_toStringtextForQuery:coke studio *? *mtv/str Why the query did not matched any document even when there is a document with value of textForQuery as *Coke Studio at MTV*? Is this because of the stopword *at* present in stopwordList? -- Thanks Regards, Isan Fulia.
Re: mlt content stream help
Please don't say it's just like the example. If it was, then it would most likely be working. If you don't take the time to show us what you've tried, and the results you get back, then there's not much we can do to help. Best Erick On Mon, Sep 26, 2011 at 7:18 AM, dan whelan d...@adicio.com wrote: On 9/24/11 12:17 PM, Erick Erickson wrote: What version of Solr? I am using solr 3.2 When you copied the default, did you set up default values for MLT? This is what I need help with. How should the request handler / solrconfig be setup? Showing us the request you used The request is exactly the same as the url in the wiki using the example solr / exampledocs and the relevant portions of your solrconifg file would help a lot, you might want to review: http://wiki.apache.org/solr/UsingMailingLists Best Erick On Thu, Sep 22, 2011 at 9:08 AM, dan wheland...@adicio.com wrote: I would like to use MLT and the content stream feature in solr like on this page: http://wiki.apache.org/solr/MoreLikeThisHandler How should the request handler / solrconfig be setup? I enabled streaming and I set a requestHandler up by copying the default request handler and I changed the name to: name=/mlt but when accessing the url like the example on the wiki I get a NPE because q is not supplied I'm sure I am just doing it wrong just not sure what. Thanks, dan
drastic performance decrease with 20 cores
Hi everyone, Sorry if this issue has been discussed before, but I'm new to the list. I have a solr (3.4) instance running with 20 cores (around 4 million docs each). The instance has allocated 13GB in a 16GB RAM server. If I run several sets of queries sequentially in each of the cores, the I/O access goes very high, so does the system load, while the CPU percentage remains low. It takes almost 1 hour to complete the set of queries. If I stop solr and restart it with 6GB allocated and 10 cores, after a bit the I/O access goes down and the CPU goes up, taking only around 5 minutes to complete all sets of queries. Meaning that for me is MUCH more performant having 2 solr instances running with half the data and half the memory than a single instance will all the data and memory. It would be even way faster to have 1 instance with half the cores/memory, run the queues, shut it down, start a new instance and repeat the process than having a big instance running everything. Furthermore, if I take the 20cores/13GB instance, unload 10 of the cores, trigger the garbage collector and run the sets of queries again, the behavior still remains slow taking like 30 minutes. am I missing something here? does solr change its caching policy depending on the number of cores at startup or something similar? Any hints will be very appreciated. Thanks
Solr Cloud Number of Shard Limitation?
Is there any limitation, be it technical or for sanity reasons, on the number of shards that can be part of a solr cloud implementation?
Re: mlt content stream help
OK. This is exactly what i did. With a fresh download of solr 3.2 unpack and go to example directory start solr: java -jar start.jar the go to exampledocs and run: ./post.sh *xml Then go here: http://localhost:8983/solr/mlt?stream.body=electronics%20memorymlt.fl=manu,catmlt.interestingTerms=listmlt.mintf=0 Problem accessing /solr/mlt. Reason: NOT_FOUND The page gives no instructions on setting up mlt or the url is incorrect. On 9/26/11 8:25 AM, Erick Erickson wrote: Please don't say it's just like the example. If it was, then it would most likely be working. If you don't take the time to show us what you've tried, and the results you get back, then there's not much we can do to help. Best Erick On Mon, Sep 26, 2011 at 7:18 AM, dan wheland...@adicio.com wrote: On 9/24/11 12:17 PM, Erick Erickson wrote: What version of Solr? I am using solr 3.2 When you copied the default, did you set up default values for MLT? This is what I need help with. How should the request handler / solrconfig be setup? Showing us the request you used The request is exactly the same as the url in the wiki using the example solr / exampledocs and the relevant portions of your solrconifg file would help a lot, you might want to review: http://wiki.apache.org/solr/UsingMailingLists Best Erick On Thu, Sep 22, 2011 at 9:08 AM, dan wheland...@adicio.comwrote: I would like to use MLT and the content stream feature in solr like on this page: http://wiki.apache.org/solr/MoreLikeThisHandler How should the request handler / solrconfig be setup? I enabled streaming and I set a requestHandler up by copying the default request handler and I changed the name to: name=/mlt but when accessing the url like the example on the wiki I get a NPE because q is not supplied I'm sure I am just doing it wrong just not sure what. Thanks, dan
Re: Solr stopword problem in Query
Hi Isan, Does your search return any documents when you remove the 'at' keyword and just search for Coke studio MTV ? Also, can you please provide the snippet of schema.xml file where you have mentioned this field name and its type description ? On Mon, Sep 26, 2011 at 6:09 AM, Isan Fulia isan.fu...@germinait.comwrote: Hi all, I have a text field named* textForQuery* . Following content has been indexed into solr in field textForQuery *Coke Studio at MTV* when i fired the query as *textForQuery:(coke studio at mtv)* the results showed 0 documents After runing the same query in debugMode i got the following results result name=response numFound=0 start=0/ lst name=debug str name=rawquerystringtextForQuery:(coke studio at mtv)/str str name=querystringtextForQuery:(coke studio at mtv)/str str name=parsedqueryPhraseQuery(textForQuery:coke studio ? mtv)/str str name=parsedquery_toStringtextForQuery:coke studio *? *mtv/str Why the query did not matched any document even when there is a document with value of textForQuery as *Coke Studio at MTV*? Is this because of the stopword *at* present in stopwordList? -- Thanks Regards, Isan Fulia. -- Thanks and Regards Rahul A. Warawdekar
solr DIH for mongodb
hi, do we got any DIH plugin which is for mongodb? regards, kiwi
drastic performance decrease with 20 cores
Hi everyone, Sorry if this issue has been discussed before, but I'm new to the list. I have a solr (3.4) instance running with 20 cores (around 4 million docs each). The instance has allocated 13GB in a 16GB RAM server. If I run several sets of queries sequentially in each of the cores, the I/O access goes very high, so does the system load, while the CPU percentage remains always low. It takes almost 1 hour to complete the set of queries. If I stop solr and restart it with 6GB allocated and 10 cores, after a bit the I/O access goes down and the CPU goes up, taking only around 5 minutes to complete all sets of queries. Meaning that for me is MUCH more performant having 2 solr instances running with half the data and half the memory than a single instance will all the data and memory. It would be even way faster to have 1 instance with half the cores/memory, run the queues, shut it down, start a new instance and repeat the process than having a big instance running everything. Furthermore, if I take the 20cores/13GB instance, unload 10 of the cores, trigger the garbage collector and run the sets of queries again, the behavior still remains slow taking like 30 minutes. am I missing something here? does solr change its caching policy depending on the number of cores at startup or something similar? Any hints will be very appreciated. Thanks, Victor
Re: drastic performance decrease with 20 cores
On 9/26/2011 9:33 AM, Bictor Man wrote: Hi everyone, Sorry if this issue has been discussed before, but I'm new to the list. I have a solr (3.4) instance running with 20 cores (around 4 million docs each). The instance has allocated 13GB in a 16GB RAM server. If I run several sets of queries sequentially in each of the cores, the I/O access goes very high, so does the system load, while the CPU percentage remains low. It takes almost 1 hour to complete the set of queries. If I stop solr and restart it with 6GB allocated and 10 cores, after a bit the I/O access goes down and the CPU goes up, taking only around 5 minutes to complete all sets of queries. With 13 of your 16GB of RAM being gobbled up by the Java process running Solr, and some of your memory taken up by the OS itself, you've probably only got about 2GB of free RAM left for the OS disk cache. Not knowing what kind of data you're indexing, I can only guess how big your indexes are, but with around 80 million total documents, I imagine that it is MUCH larger than 2GB. If I'm right, this means that your Solr server is unable to keep index data in RAM, so it ends up going out to the disk every time it needs to make a query, and that is SLOW. The ideal situation is to have enough free memory so that the OS can put all index data into its disk cache, making access to it nearly instantaneous. You may never reach that ideal with your setup, but if you can get between a third and half the index into RAM, it'll probably still perform well. Do you really need to allocate 13GB to Solr? If it crashes when you allocate less, you may have very large Solr caches in in solrconfig.xml that you can reduce. You do want to take advantage of Solr caching, but if you have to choose between disk caching and Solr caching, go for disk. It's unusual, but not necessarily wrong, to have so many large cores on one machine. Why are things set up that way? Are you using a distributed index, or do you have 20 separate indexes? The bottom line - you need more memory. Running with 32GB or even 64GB would probably serve you very well. You probably also need more machines. For redundancy purposes, you'll want to have two complete copies of your index on separate hardware and some kind of load balancer with failover capability. You may also want to look into increasing your I/O speed, with 15k RPM SAS drives, RAID10, or even SSD. Depending on the needs of your application, you may be able to decrease your index size by changing your schema and re-indexing, especially in the area of stored fields. Typically what you want to do is store only the data required to construct a search results grid, and go to the original data source for full details when someone opens a specific result. You can also look into changing the field types on your index to remove Lucene features you don't need. The needs of every Solr installation are different, and even my advice might be wrong for your particular setup, but you can rarely go wrong by adding memory. Thanks, Shawn
Re: drastic performance decrease with 20 cores
You have not said how big your index is but I suspect that allocating 13GB for your 20 cores is starving the OS of memory for caching file data. Have you tried 6GB with 20 cores? I suspect you will see the same performance as 6GB 10 cores. Generally it is better to allocate just enough memory to SOLR to run optimally rather than as much as possible. 'Just enough' depends as well. You will need to try out different allocations and see where the sweet spot is. Cheers François On Sep 26, 2011, at 9:53 AM, Bictor Man wrote: Hi everyone, Sorry if this issue has been discussed before, but I'm new to the list. I have a solr (3.4) instance running with 20 cores (around 4 million docs each). The instance has allocated 13GB in a 16GB RAM server. If I run several sets of queries sequentially in each of the cores, the I/O access goes very high, so does the system load, while the CPU percentage remains always low. It takes almost 1 hour to complete the set of queries. If I stop solr and restart it with 6GB allocated and 10 cores, after a bit the I/O access goes down and the CPU goes up, taking only around 5 minutes to complete all sets of queries. Meaning that for me is MUCH more performant having 2 solr instances running with half the data and half the memory than a single instance will all the data and memory. It would be even way faster to have 1 instance with half the cores/memory, run the queues, shut it down, start a new instance and repeat the process than having a big instance running everything. Furthermore, if I take the 20cores/13GB instance, unload 10 of the cores, trigger the garbage collector and run the sets of queries again, the behavior still remains slow taking like 30 minutes. am I missing something here? does solr change its caching policy depending on the number of cores at startup or something similar? Any hints will be very appreciated. Thanks, Victor
how to implemente a query like like '%pattern%'
Hi all. how can we do a query similar to 'like' ? if I have this phrase like a single token in the index: This phrase has various words (using KeywordTokenizerFactory) and i like a exact match of: phrase has various or various words form instance... How can i do this?? Thanks a lot. Rode. - No se encontraron virus en este mensaje. Comprobado por AVG - www.avg.com Versión: 10.0.1410 / Base de datos de virus: 1520/3920 - Fecha de publicación: 09/26/11
Re: SOLR error with custom FacetComponent
: : Unfortunately the facet fields are not static. The field are dynamic SOLR : fields and are generated by different applications. : The field names will be populated into a data store (like memcache) and : facets have to be driven from that data store. : : I need to write a Custom FacetComponent which picks up the facet fields from : the data store. It sounds like you don't need custom facet *code* you just need to dynamicly decide what fields to facet on -- i would suggest in that case that instead of subclassing FacetComponent you instead write a standalone SearchComponent that you configure to run before the FacetComponent which would modify the request params to add the new facet.field (and any f.*.facet.field.*) params you decide you want ot use at run time -- the more you can decouple your custom code from the existing code, the less maintence headaches you are likely to have. as for your original problem : I'm getting an error saying Error instantiating SearchComponent My Custom : Class is not a org.apache.solr.handler.component.SearchComponent. : : My custom class inherits from *FacetComponent* which extends from * : SearchComponent*. ...this sounds like it is likely a problem with the classloaders -- even though you subclass FacetComponent, if a differnet branch of the classloader tree loads your custom code, it may not recognize that the FacetCOmponet class instance you subclass is the same as the FacetComponent class it already knows about. where exactly did you put the class/jar contianing your subclass? did you you specify a lib/ directive in your solrconfig.xml for it? if you added/moved/copied *any* jars into example/lib that's a good tip off that you made a mistake... https://wiki.apache.org/solr/SolrPlugins#How_to_Load_Plugins -Hoss
aggregate functions in Solr?
Hello guys, I need to implement a functionality which requires something similar to aggregate functions in SQL. My Solr schema looks like this: -doc_id: integer -date: date -value1: integer -value2: integer Basically the index contains some numerical values (value1, value2, etc) per doc and date. Given a date range query, I need to return some stats consolidated by docs for that given date range. I typical response could be something like this: doc_id, sum(value1), avg(value2), sum(value1)/sum(value2). I checked StatsComponent using stats.facet=doc_id but it seems it doesn't cover my needs (especially for complex stats like sum(value1)/sum(value2)). Also checked FieldCollapsing but I couldn't find a way to configure an aggregate function there. Is there any way to implement this, or I will have to resolve it out of Solr? Regards, Esteban
Re: Unique Key error on trunk
You can replicate it with the example app by replacing the id definition in schema.xml with field name=id type=uuid indexed=true stored=true default=NEW / Removing the id fields in the one of the example doc.xml and posting it to solr. Thanks Viswa On Sep 26, 2011, at 12:15 AM, Viswa S wrote: Hello, We use solr.UUIDField to generate unique ids, using the latest trunk (change list 1163767) seems to throw an error Document is missing mandatory uniqueKey field: id. The schema is setup to generate a id field on updates field name=id type=uuid indexed=true stored=true default=NEW / Thanks Viswa SEVERE: org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: id at org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:80) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:145) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:115) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:127) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1406) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Re: mlt content stream help
Dan: The disconnect here seems to be that these examples urls on the MoreLikeThisHandler wiki page assume a /mlt request handler exists, but no handler by that name has ever actually existed in the solr example configs. (the wiki page doesn't explicitly state that those URLs will work with the example configs, but it certianly suggests it) Instead of copying the *default* request handler config (using SearchHandler) verbatim, you need to create a handler declaration that uses the MoreLikeThisHandler class, ala... requestHandler name=/mlt class=solr.MoreLikeThisHandler /requestHandler ...you can add more configuration (to specify things like default params and what not) but that's the minimum config that you need to get the MLT Handler up and running. I've updated the wiki page to reflect this -- thanks for helping to catch the mistake : http://wiki.apache.org/solr/MoreLikeThisHandler : : How should the request handler / solrconfig be setup? : : I enabled streaming and I set a requestHandler up by copying the : default : request handler and I changed the name to: : : name=/mlt : : but when accessing the url like the example on the wiki I get a NPE : because : q is not supplied -Hoss
RE: SOLR Index Speed
500 / second would be 1,800,000 per hour (much more than 500K documents). 1) how big is each document? 2) how big are your index files? 3) as others have recently written, make sure you don't give your JRE so much memory that your OS is starved for memory to use for file system cache. JRJ -Original Message- From: Lord Khan Han [mailto:khanuniver...@gmail.com] Sent: Monday, September 26, 2011 6:09 AM To: solr-user@lucene.apache.org Subject: SOLR Index Speed Hi, We have 500K web document and usind solr (trunk) to index it. We have special anaylizer which little bit heavy cpu . Our machine config: 32 x cpu 32 gig ram SAS HD We are sending document with 16 reduce client (from hadoop) to the stand alone solr server. the problem is we couldnt get speedier than the 500 doc / per sec. 500K document tooks 7-8 hours to index :( While indexin the the solr server cpu load is around : 5-6 (32 max) it means %20 of the cpu total power. We have plenty ram ... I turned of auto commit and give 8198 rambuffer .. there is no io wait .. How can I make it faster ? PS: solr streamindex is not option because we need to submit javabin... thanks..
How to reserve ids?
Hello, While indexing there are certain urls/ids I'd never want to appear in the search results (so be indexed). Is there already a 'supported by design' mechanism to do that to point me too, or should I just create this blacklist as an processor in the update chain? -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
Boost Exact matches on Specific Fields
Hi all I am new to SOLR and have a doubt on Boosting the Exact Terms to the top on a Particular field For ex : I have a text field names ts_category and I want to give more boost to this field rather than other fields, SO in my Query I pass the following in the QF params qf=body^4.0 title^5.0 ts_category^21.0 and also sort on SCORE desc When I do a search against Hospitals . I get Hospitalization Management , Hospital Equipment Supplies on Top rather than the exact matches of Hospitals So It would be great , If I could be helped over here Thanks Balaji Thanks in Advance Balaji -- View this message in context: http://lucene.472066.n3.nabble.com/Boost-Exact-matches-on-Specific-Fields-tp3370513p3370513.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to implemente a query like like '%pattern%'
If you need those kinds of searches then you should probably not be using the KeywordTokenizerFactory, is there any reason why you can't switch to a WhitespaceTokenizer for example? then you could use a simple phrase query for your search case. if you need everything as a Token, you could use a copyfield and duplicate the field and have them both. Are those acceptable options for you? Tomás 2011/9/26 Rode González (libnova) r...@libnova.es Hi all. how can we do a query similar to 'like' ? if I have this phrase like a single token in the index: This phrase has various words (using KeywordTokenizerFactory) and i like a exact match of: phrase has various or various words form instance... How can i do this?? Thanks a lot. Rode. - No se encontraron virus en este mensaje. Comprobado por AVG - www.avg.com Versión: 10.0.1410 / Base de datos de virus: 1520/3920 - Fecha de publicación: 09/26/11
RE: A fieldType for a address street
We used copyField to copy the address to two fields: 1. Which contains just the first token up to the first whitespace 2. Which copies all of it, but translates to lower case. Then our users can enter either a street number, a street name, or both. We copied all of it to the second field because it is not, in general, possible to distinguish between a house number and something else: a house number is not always present, and when present is not always numeric. Both are solr.TextField: fieldType name=streetnumber class=solr.TextField analyzer tokenizer class=solr.PatternTokenizerFactory pattern=(^\S+) group=1 / filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType JRJ -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Friday, September 23, 2011 9:27 AM To: solr-user@lucene.apache.org Subject: Re: A fieldType for a address street Nicolas, A text or ngram field should do it. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message - From: Nicolas Martin nmar...@doyousoft.com To: solr-user@lucene.apache.org Cc: Sent: Friday, September 23, 2011 5:55 AM Subject: A fieldType for a address street Hi solR users! I'd like to make research on my client database, in particular, i need to find client by their address (ex : 100 avenue des champs élysée) Does anyone know a good fieldType to store my addresses to enable me to search client by address easily ? thank you all
RE: SOLR Index Speed
Are you batching the documents before sending them to the solr server? Are you doing a commit only at the end? Also since you have 32 cores, you can try upping the number of concurrent updaters from 16 to 32. Jaeger, Jay - DOT wrote: 500 / second would be 1,800,000 per hour (much more than 500K documents). 1) how big is each document? 2) how big are your index files? 3) as others have recently written, make sure you don't give your JRE so much memory that your OS is starved for memory to use for file system cache. JRJ -Original Message- From: Lord Khan Han [mailto:khanuniver...@gmail.com] Sent: Monday, September 26, 2011 6:09 AM To: solr-user@lucene.apache.org Subject: SOLR Index Speed Hi, We have 500K web document and usind solr (trunk) to index it. We have special anaylizer which little bit heavy cpu . Our machine config: 32 x cpu 32 gig ram SAS HD We are sending document with 16 reduce client (from hadoop) to the stand alone solr server. the problem is we couldnt get speedier than the 500 doc / per sec. 500K document tooks 7-8 hours to index :( While indexin the the solr server cpu load is around : 5-6 (32 max) it means %20 of the cpu total power. We have plenty ram ... I turned of auto commit and give 8198 rambuffer .. there is no io wait .. How can I make it faster ? PS: solr streamindex is not option because we need to submit javabin... thanks.. -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-Index-Speed-tp3368945p3370765.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to implemente a query like like '%pattern%'
: References: : cafwsjvnqkaufwspqrkm4sckb-0gvak-vktkfrnmfwgzwltm...@mail.gmail.com : In-Reply-To: : cafwsjvnqkaufwspqrkm4sckb-0gvak-vktkfrnmfwgzwltm...@mail.gmail.com : Subject: how to implemente a query like like '%pattern%' https://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. -Hoss
Re: Unique Key error on trunk
: Subject: Re: Unique Key error on trunk : : : You can replicate it with the example app by replacing the id definition in schema.xml with : : field name=id type=uuid indexed=true stored=true default=NEW / thanks for reporting this Viswa, I've filed a bug to track it... https://issues.apache.org/jira/browse/SOLR-2796 -Hoss
Searching multiple fields
I have a use case where I would like to search across two fields but I do not want to weight a document that has a match in both fields higher than a document that has a match in only 1 field. For example. Document 1 - Field A: Foo Bar - Field B: Foo Baz Document 2 - Field A: Foo Blarg - Field B: Something else Now when I search for Foo I would like document 1 and 2 to be similarly scored however document 1 will be scored much higher in this use case because it matches in both fields. I could create a third field and use copyField directive to search across that but I was wondering if there is an alternative way. It would be nice if we could search across some sort of virtual field that will use both underlying fields but not actually increase the size of the index. Thanks
Re: How to apply filters to stored data
: Hi Erick, The problem I am trying to solve is to filter invalid entities. : Users might mispell or enter a new entity name. This new/invalid entities : need to pass through a KeepWordFilter so that it won't pollute our : autocomplete result. how are you doing autocomplete? if you are using the Suggest feature of solr, then thta's based on the indexed terms anyway (last time i checked) so you don't need to manipulate the stored field values. In general, the only way to manipluate the stored field values is to do it in an update processor -- which can mutate the documents long before the schema is ever even consulted. -Hoss
Re: drastic performance decrease with 20 cores
Hi guys, thanks for your replies. indeed the filesystem caching seems to be the difference. sadly I can't add more memory and the 6GB/20core combination doesn't work. so I'll just try to tweak it as much as I can. thanks a lot. 2011/9/26 François Schiettecatte fschietteca...@gmail.com You have not said how big your index is but I suspect that allocating 13GB for your 20 cores is starving the OS of memory for caching file data. Have you tried 6GB with 20 cores? I suspect you will see the same performance as 6GB 10 cores. Generally it is better to allocate just enough memory to SOLR to run optimally rather than as much as possible. 'Just enough' depends as well. You will need to try out different allocations and see where the sweet spot is. Cheers François On Sep 26, 2011, at 9:53 AM, Bictor Man wrote: Hi everyone, Sorry if this issue has been discussed before, but I'm new to the list. I have a solr (3.4) instance running with 20 cores (around 4 million docs each). The instance has allocated 13GB in a 16GB RAM server. If I run several sets of queries sequentially in each of the cores, the I/O access goes very high, so does the system load, while the CPU percentage remains always low. It takes almost 1 hour to complete the set of queries. If I stop solr and restart it with 6GB allocated and 10 cores, after a bit the I/O access goes down and the CPU goes up, taking only around 5 minutes to complete all sets of queries. Meaning that for me is MUCH more performant having 2 solr instances running with half the data and half the memory than a single instance will all the data and memory. It would be even way faster to have 1 instance with half the cores/memory, run the queues, shut it down, start a new instance and repeat the process than having a big instance running everything. Furthermore, if I take the 20cores/13GB instance, unload 10 of the cores, trigger the garbage collector and run the sets of queries again, the behavior still remains slow taking like 30 minutes. am I missing something here? does solr change its caching policy depending on the number of cores at startup or something similar? Any hints will be very appreciated. Thanks, Victor
Re: How to apply filters to stored data
Is UpdateProcessor triggered when updating an existing document or for new documents also? On Tue, Sep 27, 2011 at 6:00 AM, Chris Hostetter-3 [via Lucene] ml-node+s472066n3371110...@n3.nabble.com wrote: : Hi Erick, The problem I am trying to solve is to filter invalid entities. : Users might mispell or enter a new entity name. This new/invalid entities : need to pass through a KeepWordFilter so that it won't pollute our : autocomplete result. how are you doing autocomplete? if you are using the Suggest feature of solr, then thta's based on the indexed terms anyway (last time i checked) so you don't need to manipulate the stored field values. In general, the only way to manipluate the stored field values is to do it in an update processor -- which can mutate the documents long before the schema is ever even consulted. -Hoss -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/How-to-apply-filters-to-stored-data-tp3366230p3371110.html To unsubscribe from How to apply filters to stored data, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=3366230code=aml0aGluMTk4N0BnbWFpbC5jb218MzM2NjIzMHwtMTEwMTgwMTA3Ng==. -- Thanks Jithin Emmanuel -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-apply-filters-to-stored-data-tp3366230p3371200.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: external file field partial data match in key field
i found answer to my question .. basically it works only with complete match.. -- View this message in context: http://lucene.472066.n3.nabble.com/external-file-field-partial-data-match-in-key-field-tp3368547p3371328.html Sent from the Solr - User mailing list archive at Nabble.com.
Any plans to support function queries on score?
Hi, guys, Do you have any plans to support function queries on score field? for example, sort=floor(product(score, 100)+0.5) desc? So far I am getting the following error: undefined field score I can't use subquery in this case because I am trying to use secondary sorting, however I will be open for that if someone successfully use another field to boost the results. Thanks, YH http://thetechietutorials.blogspot.com/
Re: Searching multiple fields
Hi Mark, Eh, I don't have Lucene/Solr source code handy, but I *think* for that you'd need to write custom Lucene similarity. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Mark static.void@gmail.com To: solr-user@lucene.apache.org Sent: Monday, September 26, 2011 8:12 PM Subject: Searching multiple fields I have a use case where I would like to search across two fields but I do not want to weight a document that has a match in both fields higher than a document that has a match in only 1 field. For example. Document 1 - Field A: Foo Bar - Field B: Foo Baz Document 2 - Field A: Foo Blarg - Field B: Something else Now when I search for Foo I would like document 1 and 2 to be similarly scored however document 1 will be scored much higher in this use case because it matches in both fields. I could create a third field and use copyField directive to search across that but I was wondering if there is an alternative way. It would be nice if we could search across some sort of virtual field that will use both underlying fields but not actually increase the size of the index. Thanks
Re: How to reserve ids?
Hi Gabriele, Either the latter option, or just treat them as stop words if you just want to remove those urls/ids from indexed docs (may still get highlighted). Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Gabriele Kahlout gabri...@mysimpatico.com To: solr-user@lucene.apache.org Sent: Monday, September 26, 2011 3:33 PM Subject: How to reserve ids? Hello, While indexing there are certain urls/ids I'd never want to appear in the search results (so be indexed). Is there already a 'supported by design' mechanism to do that to point me too, or should I just create this blacklist as an processor in the update chain? -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
Re: Boost Exact matches on Specific Fields
If I were you, probably I will try defining two fields: 1. ts_category as a string type 2. ts_category1 as a text_en type Make sure copy ts_category to ts_category1. You can use the following as qf in your dismax: qf=body^4.0 title^5.0 ts_category^10.0 ts_category1^5.0 or something like that. YH http://thetechietutorials.blogspot.com/ On Mon, Sep 26, 2011 at 2:06 PM, balaji mcabal...@gmail.com wrote: Hi all I am new to SOLR and have a doubt on Boosting the Exact Terms to the top on a Particular field For ex : I have a text field names ts_category and I want to give more boost to this field rather than other fields, SO in my Query I pass the following in the QF params qf=body^4.0 title^5.0 ts_category^21.0 and also sort on SCORE desc When I do a search against Hospitals . I get Hospitalization Management , Hospital Equipment Supplies on Top rather than the exact matches of Hospitals So It would be great , If I could be helped over here Thanks Balaji Thanks in Advance Balaji -- View this message in context: http://lucene.472066.n3.nabble.com/Boost-Exact-matches-on-Specific-Fields-tp3370513p3370513.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr DIH for mongodb
Hi, Here is a 1 month old thread I found on search-lucene -- didn't even have to do a search, I got it as a suggestion from AutoComplete when I started typing the word mongodb :) http://search-lucene.com/m/8AEE31AaTd32 Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Kiwi de coder kiwio...@gmail.com To: solr-user@lucene.apache.org Sent: Monday, September 26, 2011 11:58 AM Subject: solr DIH for mongodb hi, do we got any DIH plugin which is for mongodb? regards, kiwi
Re: Update ingest rate drops suddenly
Aha! See, it was the DB after all! ;) Thanks for following up, I was curious. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: eks dev eks...@yahoo.co.uk To: solr-user solr-user@lucene.apache.org Sent: Monday, September 26, 2011 10:21 AM Subject: Re: Update ingest rate drops suddenly Just to bring closure on this one, we were slurping data from the wrong DB (hardly desktop class machine)... Solr did not cough on 41Mio records @34k updates / sec., single threaded. Great! On Sat, Sep 24, 2011 at 9:18 PM, eks dev eks...@yahoo.co.uk wrote: just looking for hints where to look for... We were testing single threaded ingest rate on solr, trunk version on atypical collection (a lot of small documents), and we noticed something we are not able to explain. Setup: We use defaults for index settings, windows 64 bit, jdk 7 U2. on SSD, machine with enough memory and 8 cores. Schema has 5 stored fields, 4 of them indexed no positions no norms. Average net document size (optimized index size / number of documents) is around 100 bytes. On a test with 40 Mio document: - we had update ingest rate on first 4,4Mio documents @ incredible 34k records / second... - then it dropped, suddenly to 20k records per second and this rate remained stable (variance 1k) until... - we hit 13Mio, where ingest rate dropped again really hard, from one instant in time to another to 10k records per second. it stayed there until we reached the end @40Mio (slightly reducing, to ca 9k, but this is not long enough to see trend). Nothing unusual happening with jvm memory ( tooth-saw 200- 450M fully regular). CPU in turn was following the ingest rate trend, inicating that we were waiting on something. No searches , no commits, nothing. autoCommit was turned off. Updates were streaming directly from the database. - I did not expect something like this, knowing lucene merges in background. Also, having such sudden drops in ingest rate is indicative that we are not leaking something. (drop would have been much more gradual). It is some caches, but why two really significant drops? 33k/sec to 20k and than to 10k... We would love to keep it @34 k/second :) I am not really acquainted with the new MergePolicy and flushing settings, but I suspect this is something there we could tweak. Could it be windows is somehow, hmm, quirky with solr default directory on win64/jvm (I think it is MMAP by default)... We did not saturate IO with such a small documents I guess, It is a just couple of Gig over 1-2 hours. All in all, it works good, but is having such hard update ingest rate drops normal? Thanks, eks.
Re: SOLR Index Speed
Hello, PS: solr streamindex is not option because we need to submit javabin... If you are referring to StreamingUpdateSolrServer, then the above statement makes no sense and you should give SUSS a try. Are you sure your 16 reducers produce more than 500 docs/second? I think somebody already suggested increasing the number of reducers to ~32. What happens to your CPU load and indexing speed then? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Lord Khan Han khanuniver...@gmail.com To: solr-user@lucene.apache.org Sent: Monday, September 26, 2011 7:09 AM Subject: SOLR Index Speed Hi, We have 500K web document and usind solr (trunk) to index it. We have special anaylizer which little bit heavy cpu . Our machine config: 32 x cpu 32 gig ram SAS HD We are sending document with 16 reduce client (from hadoop) to the stand alone solr server. the problem is we couldnt get speedier than the 500 doc / per sec. 500K document tooks 7-8 hours to index :( While indexin the the solr server cpu load is around : 5-6 (32 max) it means %20 of the cpu total power. We have plenty ram ... I turned of auto commit and give 8198 rambuffer .. there is no io wait .. How can I make it faster ? PS: solr streamindex is not option because we need to submit javabin... thanks..
Re: error while replication
Rajat, What version? If 3.4.0, I'd try 3.4.0 first. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: shinkanze rajatrastogi...@gmail.com To: solr-user@lucene.apache.org Sent: Monday, September 26, 2011 5:45 AM Subject: error while replication hi , I am replicating solr and getting this error . i am unable to make out the cause so please kindly help 26 Sep, 2011 8:00:14 AM org.slf4j.impl.JDK14LoggerAdapter fillCallerData SEVERE: Error during auto-warming of key:org.apache.solr.search.QueryResultKey@150f0455:java.lang.NullPointerException at org.apache.lucene.util.StringHelper.intern(StringHelper.java:36) at org.apache.lucene.index.Term.init(Term.java:38) at org.apache.lucene.search.NumericRangeQuery$NumericRangeTermEnum.next(NumericRangeQuery.java:530) at org.apache.lucene.search.NumericRangeQuery$NumericRangeTermEnum.init(NumericRangeQuery.java:476) at org.apache.lucene.search.NumericRangeQuery.getEnum(NumericRangeQuery.java:307) at org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(MultiTermQueryWrapperFilter.java:160) at org.apache.lucene.search.ConstantScoreQuery$ConstantScorer.init(ConstantScoreQuery.java:116) at org.apache.lucene.search.ConstantScoreQuery$ConstantWeight.scorer(ConstantScoreQuery.java:81) at org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:297) at org.apache.lucene.search.IndexSearcher.searchWithFilter(IndexSearcher.java:268) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:258) at org.apache.lucene.search.Searcher.search(Searcher.java:171) at org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1101) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:880) at org.apache.solr.search.SolrIndexSearcher.access$000(SolrIndexSearcher.java:51) at org.apache.solr.search.SolrIndexSearcher$3.regenerateItem(SolrIndexSearcher.java:332) at org.apache.solr.search.LRUCache.warm(LRUCache.java:194) at org.apache.solr.search.SolrIndexSearcher.warm(SolrIndexSearcher.java:1481) at org.apache.solr.core.SolrCore$2.call(SolrCore.java:1130) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) regards rajat rastogi -- View this message in context: http://lucene.472066.n3.nabble.com/error-while-replication-tp3368783p3368783.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: matching reponse and request
Hi Roland, Have a look at hit #1 here: http://search-lucene.com/?q=manifoldcffc_project=Solr I think this is what you are after. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Roland Tollenaar rwatollen...@gmail.com To: solr-user@lucene.apache.org Sent: Sunday, September 25, 2011 4:24 AM Subject: Re: matching reponse and request Hi Otis, this is absolutely brilliant! I did not think it were possible. It opens up a new possibility. If I insert device ID's in this manner (as in a unique identifier of the device sending the request) , might it be possible to control (at least block or permit) the permissions of the user? It seems like something of the sort is possible but I only come up with this: http://search-lucene.com/m/Yuib11zCeYN No redirect to where the permissions can be set (in schema) and how the requests are identified to come from a particular user/device.. Thanks for your help. Kind regards, Roland Otis Gospodnetic wrote: Hi Roland, Check this: response lst name=responseHeader int name=status0/int int name=QTime0/int lst name=params str name=indenton/str str name=start0/str str name=qsolr/str str name=foo1/str === from foo=1 str name=version2.2/str str name=rows10/str /lst I added foo=1 to the request to Solr and got the above back. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Roland Tollenaar rwatollen...@gmail.com To: solr-user@lucene.apache.org Sent: Saturday, September 24, 2011 4:07 AM Subject: matching reponse and request Hi, sorry for this question but I am hoping it has a quick solution. I am sending multiple get request queries to solr but solr is not returning the responses in the sequence I send the requests. The shortest responses arrive back first I am wondering whether I can add a tag to the request which will be given back to me in the response so that when the response comes I can connect it to re original request and handle it in the appropriate manner. If this is possible, how? Help appreciated! Regards, Roland.
Re: solr DIH for mongodb
wow, this search engine is powerful ! too bad after look throught it, still got not solution. seem like I need to get my hand dirty to make one :) kiwi On Tue, Sep 27, 2011 at 12:08 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, Here is a 1 month old thread I found on search-lucene -- didn't even have to do a search, I got it as a suggestion from AutoComplete when I started typing the word mongodb :) http://search-lucene.com/m/8AEE31AaTd32 Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Kiwi de coder kiwio...@gmail.com To: solr-user@lucene.apache.org Sent: Monday, September 26, 2011 11:58 AM Subject: solr DIH for mongodb hi, do we got any DIH plugin which is for mongodb? regards, kiwi
Re: Boost Exact matches on Specific Fields
Hi You mean to say copy the String field to a Text field or the reverse . This is the approach I am currently following Step 1: Created a FieldType fieldType name=string_lower class=solr.TextField sortMissingLast=true omitNorms=true analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.TrimFilterFactory / /analyzer /fieldType Step 2 : field name=str_category type=string_lower indexed=true stored=true/ Step 3 : copyField source=ts_category dest=str_category/ And in the SOLR Query planning to q=hospitalsqf=body^4.0 title^5.0 ts_category^10.0 str_category^8.0 The One Question I have here is All the above mentioned fields will have Hospital present in them , will the above approach work to get the exact match on the top and bring Hospitalization below in the results Thanks Balaji On Tue, Sep 27, 2011 at 9:38 AM, Way Cool way1.wayc...@gmail.com wrote: If I were you, probably I will try defining two fields: 1. ts_category as a string type 2. ts_category1 as a text_en type Make sure copy ts_category to ts_category1. You can use the following as qf in your dismax: qf=body^4.0 title^5.0 ts_category^10.0 ts_category1^5.0 or something like that. YH http://thetechietutorials.blogspot.com/ On Mon, Sep 26, 2011 at 2:06 PM, balaji mcabal...@gmail.com wrote: Hi all I am new to SOLR and have a doubt on Boosting the Exact Terms to the top on a Particular field For ex : I have a text field names ts_category and I want to give more boost to this field rather than other fields, SO in my Query I pass the following in the QF params qf=body^4.0 title^5.0 ts_category^21.0 and also sort on SCORE desc When I do a search against Hospitals . I get Hospitalization Management , Hospital Equipment Supplies on Top rather than the exact matches of Hospitals So It would be great , If I could be helped over here Thanks Balaji Thanks in Advance Balaji -- View this message in context: http://lucene.472066.n3.nabble.com/Boost-Exact-matches-on-Specific-Fields-tp3370513p3370513.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: drastic performance decrease with 20 cores
The following should help with size estimation: http://search-lucene.com/?q=estimate+memoryfc_project=Solr http://issues.apache.org/jira/browse/LUCENE-3435 I'll just add that with that much RAM you'll be more than fine. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: François Schiettecatte fschietteca...@gmail.com To: solr-user@lucene.apache.org Sent: Monday, September 26, 2011 12:43 PM Subject: Re: drastic performance decrease with 20 cores You have not said how big your index is but I suspect that allocating 13GB for your 20 cores is starving the OS of memory for caching file data. Have you tried 6GB with 20 cores? I suspect you will see the same performance as 6GB 10 cores. Generally it is better to allocate just enough memory to SOLR to run optimally rather than as much as possible. 'Just enough' depends as well. You will need to try out different allocations and see where the sweet spot is. Cheers François On Sep 26, 2011, at 9:53 AM, Bictor Man wrote: Hi everyone, Sorry if this issue has been discussed before, but I'm new to the list. I have a solr (3.4) instance running with 20 cores (around 4 million docs each). The instance has allocated 13GB in a 16GB RAM server. If I run several sets of queries sequentially in each of the cores, the I/O access goes very high, so does the system load, while the CPU percentage remains always low. It takes almost 1 hour to complete the set of queries. If I stop solr and restart it with 6GB allocated and 10 cores, after a bit the I/O access goes down and the CPU goes up, taking only around 5 minutes to complete all sets of queries. Meaning that for me is MUCH more performant having 2 solr instances running with half the data and half the memory than a single instance will all the data and memory. It would be even way faster to have 1 instance with half the cores/memory, run the queues, shut it down, start a new instance and repeat the process than having a big instance running everything. Furthermore, if I take the 20cores/13GB instance, unload 10 of the cores, trigger the garbage collector and run the sets of queries again, the behavior still remains slow taking like 30 minutes. am I missing something here? does solr change its caching policy depending on the number of cores at startup or something similar? Any hints will be very appreciated. Thanks, Victor
Re: solr DIH for mongodb
From: Kiwi de coder kiwio...@gmail.com wow, this search engine is powerful ! Thanks, glad it helps. too bad after look throught it, still got not solution. seem like I need to get my hand dirty to make one :) :) Please consider contributing: http://wiki.apache.org/solr/HowToContribute Otis kiwi On Tue, Sep 27, 2011 at 12:08 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, Here is a 1 month old thread I found on search-lucene -- didn't even have to do a search, I got it as a suggestion from AutoComplete when I started typing the word mongodb :) http://search-lucene.com/m/8AEE31AaTd32 Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Kiwi de coder kiwio...@gmail.com To: solr-user@lucene.apache.org Sent: Monday, September 26, 2011 11:58 AM Subject: solr DIH for mongodb hi, do we got any DIH plugin which is for mongodb? regards, kiwi
Re: How to implement Spell Checker using Solr?
I have been able to setup Solr Spell checker on my web application. It is a file based spell checker that i have implemented. I would like to add that the same isn't that accurate, since I haven't applied any specific algorithm for having the most relevant search result. Kindly do let me know in case you have any issues in implementing the same at your end. regards, Anupam -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-implement-Spell-Checker-using-Solr-tp3268450p3371563.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to implement Spell Checker using Solr?
Firstly, just to make it clear the dictionary is made out of already indexed terms, rather it is built upon it if you are using *str name=classnamesolr.IndexBasedSpellChecker/str* which you are. Next lot of changes are required for your *solrconfig.xml* 1. str name=fieldspell/str is the name of the field which will be used to create your dictionary. Does it exist in schema.xml? 2. str name=queryAnalyzerFieldTypetextSpell/str is the name of FieldType used for your dictionary building, as in the str name=fieldspell/str should be of type textSpell in schema.xml. Is it so? Now for you internal error from crawling. This is most probably because your siolrconfig.xml/schema.xml has been changed. This I assume so because as you say before trying to implement spellcheck this was working. /Also, I am not too sure so as to how I can make my search work based on the search control in my application Like how can I search with the word and have the suggestion at the same time, since when the search item is say form/formm, then I should have essentially separate URL created. Does Solr Spell checker component take care of it on its own. if so how and exactly how the Solrconfig and Schema xmls should be configured for the same. Please note: I would prefer to use a filebased dictionary for the search, so kindly suggest on those lines. / If you are looking for filebased searching, you are going in the wrong direction. You are trying to use indexbasedspellchecker class when actually what you need is lst name=spellchecker str name=namefile/str str name=classnamesolr.FileBasedSpellChecker/str str name=sourceLocationspellings.txt/str str name=characterEncodingUTF-8/str str name=spellcheckIndexDir./spellcheckerFile/str /lst Kindly read about spellchecker more. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-implement-Spell-Checker-using-Solr-tp3268450p3371620.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr stopword problem in Query
Hi Rahul, I also tried searching Coke Studio MTV but no documents were returned. Here is the snippet of my schema file. fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_en.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_en.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType *field name=content type=text indexed=false stored=true multiValued=false/ field name=title type=text indexed=false stored=true multiValued=false/ **field name=textForQuery type=text indexed=true stored=false multiValued=true omitTermFreqAndPositions=true/** copyField source=content dest=textForQuery/ copyField source=title dest=textForQuery/* Thanks, Isan Fulia. On 26 September 2011 21:19, Rahul Warawdekar rahul.warawde...@gmail.comwrote: Hi Isan, Does your search return any documents when you remove the 'at' keyword and just search for Coke studio MTV ? Also, can you please provide the snippet of schema.xml file where you have mentioned this field name and its type description ? On Mon, Sep 26, 2011 at 6:09 AM, Isan Fulia isan.fu...@germinait.com wrote: Hi all, I have a text field named* textForQuery* . Following content has been indexed into solr in field textForQuery *Coke Studio at MTV* when i fired the query as *textForQuery:(coke studio at mtv)* the results showed 0 documents After runing the same query in debugMode i got the following results result name=response numFound=0 start=0/ lst name=debug str name=rawquerystringtextForQuery:(coke studio at mtv)/str str name=querystringtextForQuery:(coke studio at mtv)/str str name=parsedqueryPhraseQuery(textForQuery:coke studio ? mtv)/str str name=parsedquery_toStringtextForQuery:coke studio *? *mtv/str Why the query did not matched any document even when there is a document with value of textForQuery as *Coke Studio at MTV*? Is this because of the stopword *at* present in stopwordList? -- Thanks Regards, Isan Fulia. -- Thanks and Regards Rahul A. Warawdekar -- Thanks Regards, Isan Fulia.
Re: what is delata query and how to write?
On Tue, Sep 27, 2011 at 10:51 AM, nagarjuna nagarjuna.avul...@gmail.com wrote: Hi everybody. right now i have little bit idea about the solr query ..but i am not clear about delta query wht it is? and how to write ?any sample delta query? http://lmgtfy.com/?q=solr+delta+query There are many useful links among the first several. Regards, Gora
Re: what is delata query and how to write?
Hi gora can u pls quit ur answers like these.. i may get the perfect answer from anybody but not u,so kindly please be quit i already googled and i saw many links as a beginner i am unable to got the main intention behind using the delta query,even we have query.and i didn't find the samples thats y i posted this thread... if u want to really help to me then u try for the samples and send me the link i will also tryu know still i am googling if i got i will post answer for my thread if anybody got i will get the answer thts it my intension -- View this message in context: http://lucene.472066.n3.nabble.com/what-is-delata-query-and-how-to-write-tp3371639p3371681.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to reserve ids?
I'm interested in the stopwords solution as it sounds like less work but i'm not sure i understand how it works. By having msn.com as a stopword it doesnt mean i wont get msn.com as a result for say 'hotmail'. My understanding is that msn.com will never make it to the similarity function and thus affect the score calculation. But seldom does the url anyway (in my searches on content)!