exclude docs with null field
hi there, say my search query is new york, and i am searching field1 and field2 for it, how do i specify that i want to exlude docs where field3 doesnt exist? thanks
Multi word synonyms + highlighting
Hi, Here's a field type using synonyms : fieldtype name=SFR class=solr.TextField analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=french-synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.LowerCaseFilterFactory/ charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ /analyzer /fieldtype Here are the contents of 'french-synonyms.txt' that I used for testing : PC,parti communiste PS,parti socialiste When I query a field for the words : parti communiste, those things are highlighted : parti communiste parti socialiste parti PC PS communiste Having parti socialiste highlighted is a problem. I expected only parti communiste, parti, communiste and PC highlighted. Is there a way to have things working like I expected ? Here is the query I use : wt=json q=qAndMSFR%3A%28parti%20communiste%29 q.op=AND start=0 rows=5 fl=id,studyId,questionFR,modalitiesFR,variableLabelFR,variableName,nesstarVariableId,lang,studyTitle,nesstarStudyId,CevipofConcept,studyQuestionCount,questionPosition,preQuestionText, sort=score%20desc facet=true facet.field=CevipofConceptCode facet.field=studyDateAndId facet.sort=lex spellcheck=true spellcheck.collate=on spellcheck.count=10 hl=on hl.fl=questionSMFR,modalitiesSMFR,variableLabelSMFR hl.fragsize=1 hl.snippets=100 hl.usePhraseHighlighter=true hl.highlightMultiTerm=true hl.simple.pre=%3Cb%3E hl.simple.post=%3C%2Fb%3E
Re: exclude docs with null field
say my search query is new york, and i am searching field1 and field2 for it, how do i specify that i want to exlude docs where field3 doesnt exist? http://search-lucene.com/m/1o5mEk8DjX1/
Re: exclude docs with null field
i could be wrong but it seems this way has a performance hit? or i am missing something? field1:new york+field2:new york+field3:[* TO *] 2010/6/4 bluestar sea...@butterflycluster.net hi there, say my search query is new york, and i am searching field1 and field2 for it, how do i specify that i want to exlude docs where field3 doesnt exist? thanks
Re: exclude docs with null field
i could be wrong but it seems this way has a performance hit? or i am missing something? Did you read Chris's message in http://search-lucene.com/m/1o5mEk8DjX1/ He proposes alternative (more efficient) way other than [* TO *]
Re: Logs for Java Replication in Solr
Hoss, thanks a lot! (We are using tomcat so the logging properties file is fine.) Do you know what the reason of the mentioned exception could be? It seems to me that if this exception accurs that even the replication for that index does not work. If I then remove the data director + reload + poll a replication all is fine. But sometimes it occurs again :-/ Regards, Peter. : : where can I find more information about a failure of a Java replication : in Solr 1.4? : (Dashboard does not seem to be the best place!?) All the log message are written using the JDK Logging framework, so it really depends on your servlet container, and where it's configured to write the logs... http://wiki.apache.org/solr/SolrLogging -Hoss
Re: exclude docs with null field
nice one! thanks. i could be wrong but it seems this way has a performance hit? or i am missing something? Did you read Chris's message in http://search-lucene.com/m/1o5mEk8DjX1/ He proposes alternative (more efficient) way other than [* TO *]
Re: exclude docs with null field
Additionally, I should have mentioned that you can instead do: fq=field_3:[* TO *], which uses the filtercache. The method presented by Chris will probably outperform the above method but only on the first request, from then on the filtercache takes over. From a performance standpoint it's probably not worth going the 'default value for null-approach' imho. It IS useful however if you want to be able to query on docs with a null-value (instead of excluding them) 2010/6/4 bluestar sea...@butterflycluster.net nice one! thanks. i could be wrong but it seems this way has a performance hit? or i am missing something? Did you read Chris's message in http://search-lucene.com/m/1o5mEk8DjX1/ He proposes alternative (more efficient) way other than [* TO *]
MultiValue Exclusion
How would you model this? We have a table of news items that people can view in their news stream and comment on. Users have the ability to mute item so they never see them in their feed or search results. From what I can see there are a couple ways to accomplish this. 1 - Post process the results and do not render any muted news items. The downside of the pagination become problematic. Its possible we may forgo pagination because of this but for now assume that pagination is a requirement. 2 - Whenever we query for a given user we append a clause that excludes all muted items. I assume in Solr we'd need to do something like -item_id(1 AND 2 AND 3). Obviously this doesn't scale very well. 3 - Have a multi-valued property in the index that contains all ids of users who have muted the item. Being new to Solr I don't even know how (or if its possible) to run a query that says user id not this multivalued property. Can this even be done (sample query please)? Again, I know this doesn't scale very well. Any other suggestions? Thanks in advance for the help. -- View this message in context: http://lucene.472066.n3.nabble.com/MultiValue-Exclusion-tp870173p870173.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: MultiValue Exclusion
I guess the following works. A. similar to your option 2, but using the filtercache fq=-item_id:001 -item_id:002 B. similar to your option 3, but using the filtercache fq=-users_excluded_field:userid the advantage being that the filter is cached independently from the rest of the query so it can be reused efficiently. adv A over B. the 'muted news items' can be queried dynamically, i.e: they aren't set in stone at index time. B will probably perform a little bit better the first time (when nog cached), but I'm not sure. hope that helps, Geert-Jan 2010/6/4 homerlex homerlex.nab...@gmail.com How would you model this? We have a table of news items that people can view in their news stream and comment on. Users have the ability to mute item so they never see them in their feed or search results. From what I can see there are a couple ways to accomplish this. 1 - Post process the results and do not render any muted news items. The downside of the pagination become problematic. Its possible we may forgo pagination because of this but for now assume that pagination is a requirement. 2 - Whenever we query for a given user we append a clause that excludes all muted items. I assume in Solr we'd need to do something like -item_id(1 AND 2 AND 3). Obviously this doesn't scale very well. 3 - Have a multi-valued property in the index that contains all ids of users who have muted the item. Being new to Solr I don't even know how (or if its possible) to run a query that says user id not this multivalued property. Can this even be done (sample query please)? Again, I know this doesn't scale very well. Any other suggestions? Thanks in advance for the help. -- View this message in context: http://lucene.472066.n3.nabble.com/MultiValue-Exclusion-tp870173p870173.html Sent from the Solr - User mailing list archive at Nabble.com.
Faceted Search Slows Down as index gets larger
Hello, I have been dealing with real-time data. As the number of total indexed documents gets larger (now 5 M) a faceted search on a text field limited by the creation time, which we use to find the most used word in all these text fields, gets slow down. query string: created_time:[NOW-1HOUR TO NOW] facet.field=text facet.mincount=1 the document count matching the query is around 9000. It takes around 80 seconds in a decent computer with 4GB ram, quad core cpu I do not know the internal details of term indexing and their counts for faceting. Any suggestion for speeding up this query is appreciated. Thanks in advance. -- Furkan Kuru
Re: Faceted Search Slows Down as index gets larger
Faceting on a full-text field is hard. What version of Solr are you using? If it's 1.4 or later, try setting facet.method=enum And to use the filterCache less, try facet.enum.cache.minDf=100 -Yonik http://www.lucidimagination.com On Fri, Jun 4, 2010 at 10:31 AM, Furkan Kuru furkank...@gmail.com wrote: Hello, I have been dealing with real-time data. As the number of total indexed documents gets larger (now 5 M) a faceted search on a text field limited by the creation time, which we use to find the most used word in all these text fields, gets slow down. query string: created_time:[NOW-1HOUR TO NOW] facet.field=text facet.mincount=1 the document count matching the query is around 9000. It takes around 80 seconds in a decent computer with 4GB ram, quad core cpu I do not know the internal details of term indexing and their counts for faceting. Any suggestion for speeding up this query is appreciated. Thanks in advance. -- Furkan Kuru
Re: OverlappingFileLockException when using str name=replicateAfterstartup/str
Hi Guys, I'm experiencing the same issue with a single war. I'm using a brand new Solr war built from yestertay's version of the trunk. I've got one master with 2 cores and one slave with a single core. I'm using one core from master as the master of the second core (which is configured as a repeater). So that, the slave's core can poll the repeater for index changes. ( I was using solr 1.4, but experienced some issues with replication. While rebuilding the index on the one master core, the new index was not replicated succesfully to the other master core. Files were copied over but the final commit failed on the snappuller. But sometimes, while restarting the master, the replication would work fine between master cores, then no replication would be successful from master to slave core. I had the same issue as described here: https://issues.apache.org/jira/browse/SOLR-1769 . Which seems to be fixed in the trunk. So I moved on to the trunk version of solr in order to tests the fix. This seems to work better. As master cores replication works fine. But I've got a weird behavior on slave. The index replication is successful only the second time the slave is trying to get it even if for each replication trial, slave spits out the following Exception (see below). There seems to be a concurrrency issue but I don't quite undestand where the concurrency is really happening. Can you please help on that issue? org.apache.solr.common.SolrException: Index fetch failed : at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:329) at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:264) at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecu tor.java:98) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExec utor.java:181) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.jav a:205) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: java.nio.channels.OverlappingFileLockException at sun.nio.ch.FileChannelImpl$SharedFileLockTable.checkList(FileChannelImpl.java:1170) at sun.nio.ch.FileChannelImpl$SharedFileLockTable.add(FileChannelImpl.java:1072) at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:878) at java.nio.channels.FileChannel.tryLock(FileChannel.java:962) at org.apache.lucene.store.NativeFSLock.obtain(NativeFSLockFactory.java:260) at org.apache.lucene.store.Lock.obtain(Lock.java:72) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1061) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:950) at org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:192) at org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:99) at org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:173) at org.apache.solr.update.DirectUpdateHandler2.forceOpenWriter(DirectUpdateHandler2.java:376) at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:471) at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:319) ... 11 more -- View this message in context: http://lucene.472066.n3.nabble.com/OverlappingFileLockException-when-using-str-name-replicateAfter-startup-str-tp488686p870589.html Sent from the Solr - User mailing list archive at Nabble.com.
String Sort Nor Working
All, I am trying to sort on a text field and can't get it to work. I try sorting on sortTitle and get no errors, it just doesn't appear to sort. The pertinent parts of my schema: fieldType name=text class=solr.TextField positionIncrementGap=100 ... lots of filters that do work... /fieldType fieldType name=sortString class=solr.TextField sortMissingLast=true omitNorms=true analyzer tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.TrimFilterFactory / /analyzer /fieldType field name=title type=text indexed=true stored=true termVectors=true / field name=sortTitle type=sortString indexed=true stored=true / copyfield source=title dest=sortTitle / I set stored=true on the sort field so I could see if anything was getting copied there, and it would appear that this is not the case. I don't see any top 10 summaries like I do for other fiends, including another field populated by copyField. Is this just because of the filters I am using? I'm sure this horse has or similar horses have been beaten to death before, but I'm new to this mailing list, so sorry about that. Any help is greatly appreciated! Thanks, Patrick
Re: Faceted Search Slows Down as index gets larger
I am using 1.4 version. I have tried your suggestion, it takes around 25-30 seconds now. Thank you, On Fri, Jun 4, 2010 at 5:54 PM, Yonik Seeley yo...@lucidimagination.comwrote: Faceting on a full-text field is hard. What version of Solr are you using? If it's 1.4 or later, try setting facet.method=enum And to use the filterCache less, try facet.enum.cache.minDf=100 -Yonik http://www.lucidimagination.com On Fri, Jun 4, 2010 at 10:31 AM, Furkan Kuru furkank...@gmail.com wrote: Hello, I have been dealing with real-time data. As the number of total indexed documents gets larger (now 5 M) a faceted search on a text field limited by the creation time, which we use to find the most used word in all these text fields, gets slow down. query string: created_time:[NOW-1HOUR TO NOW] facet.field=text facet.mincount=1 the document count matching the query is around 9000. It takes around 80 seconds in a decent computer with 4GB ram, quad core cpu I do not know the internal details of term indexing and their counts for faceting. Any suggestion for speeding up this query is appreciated. Thanks in advance. -- Furkan Kuru -- Furkan Kuru
RE: index growing with updates
Ok so I think that Solr (lucene) will only remove deleted/updated documents from the disk after an optimize or after an 'expungeDeletes' request. Is there a way to trigger the expunsion (new word) across the entire index? I tried : final UpdateRequest request = new UpdateRequest() request.setParam(expungeDeletes,true); request.add someofmydocs server.sendrequest.. But that didn't seem to do the trick as I know I have about 7 Gigs of documents that should be removed from the disk and the index size hasn't really budged. Any ideas? Thanks, Kallin Nagelberg -Original Message- From: Nagelberg, Kallin Sent: Thursday, June 03, 2010 1:36 PM To: 'solr-user@lucene.apache.org' Subject: RE: index growing with updates Is there a way to trigger a purge, or under what conditions does it occur? -Kallin Nagelberg -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, June 03, 2010 12:40 PM To: solr-user@lucene.apache.org Subject: Re: index growing with updates Assuming your config is set up to replace unique keys, you're really doing a delete and an add (under the covers). It could very well be that the deleted version of the document is still in your index taking up space and will be until it is purged. HTH Erick On Thu, Jun 3, 2010 at 10:22 AM, Nagelberg, Kallin knagelb...@globeandmail.com wrote: Hey, If I add a document to the index that already exists (same uniquekey) what is the expected behavior? I would imagine that if the document is the same then the index should not grow, but mine appears to be growing. Any ideas? Thanks, -Kallin Nagelberg
Re: Highlighting a field with a certain value
(10/05/25 0:31), n...@frameweld.com wrote: Hello, How am I able to highlight a field that contains a specific value? If I have a field called type, how am I able to highlight the rows whose values contain something like title? http://localhost:8983/solr/select?q=titlehl=onhl.fl=type Koji -- http://www.rondhuit.com/en/
Re: String Sort Nor Working
copyfield source=title dest=sortTitle / Simple lowercase F is causing this. It should be copyField
RE: String Sort Nor Working
That did it. Thank you =) P.S. Might it be helpful for Solr to complain about invalid XML during startup? Does it do this and I'm just not noticing? -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Friday, June 04, 2010 12:18 PM To: solr-user@lucene.apache.org Subject: Re: String Sort Nor Working copyfield source=title dest=sortTitle / Simple lowercase F is causing this. It should be copyField
Need help to install Solr on JBoss
I installed Solr on my local machine and it works fine with Jetty. I am trying to install on JBoss which is running on a Sun Solaris box and I have the following questions: 1. Do I need to copy the entire example folder from my local machine to Solr home on Sun Solaris box? 2. How can I have multiple cores on the Sun Solaris box? Any help is appreciated. Thanks, Murali
RE: String Sort Nor Working
P.S. Might it be helpful for Solr to complain about invalid XML during startup? Does it do this and I'm just not noticing? Chris's explanation about a similar topic: http://search-lucene.com/m/11JWX1hxL4u/
RE: String Sort Nor Working
Very informative - thank you! I think it might be useful to have this feature - maybe have an interface for plugins to register a XSD or otherwise declare its expected xml elements and attributes. I'm not sure if there's enough demand for this to justify the time it would take to make this change though. Just a thought. -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Friday, June 04, 2010 1:41 PM To: solr-user@lucene.apache.org Subject: RE: String Sort Nor Working P.S. Might it be helpful for Solr to complain about invalid XML during startup? Does it do this and I'm just not noticing? Chris's explanation about a similar topic: http://search-lucene.com/m/11JWX1hxL4u/
conditional Document Boost
Hello out there, I am searching for a solution for conditional Document Boosting. During analyzing the fields of a document, I want to create a document boost based on some metrics. There are two approaches: First: I preprocess the data. The main problem with this is, that I need to take care about the preprocessing-part and I can't do it out of the box (implementing an analyzer, compute the boosting value and afterwards store those values or send them to solr.). Second: Using the UpdateRequestProcessor (does it work with DIH?). However, the problem would also be custom work and taking care that the used params are up-to-date. Third: Setting the Document Boost while analyzing-process is running with the help of a TokenFilter (is this possible?). What would you do? I think what I want to do is quite the same as working with Mahout and Solr. I never worked with Mahout - but how can I use it to improve the user's search-experience? Where can I use Mahout in Solr, if I want to influence document's boosts? And where in general (i.e. for classification). References, ideas and whatever could be useful are welcome :-). Thank you. Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/conditional-Document-Boost-tp871108p871108.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: TikaEntityProcessor not working?
You are my hero. I replaced the Tika 0.8 snapshots that were included with Solr with 0.6 and it works now. Thank you! Brad On Jun 3, 2010, at 6:22 AM, David George wrote: Which version of Tika do you have? There was a problem introduced somewhere between Tika 0.6 and Tika 0.7 whereby the TikaConfig method config.getParsers() was returns an empty parser list due to class loader scope issues with Solr running under an application server. There is a fix in the Tika 0.8 branch and I note that a 0.8 snapshot of Tika is including in the Solr trunk. I've not tried to get this to work and am not sure what config is needed to make this work. I simply installed Tika 0.6 which can be dowloaded from the apache tika website. -- View this message in context: http://lucene.472066.n3.nabble.com/TikaEntityProcessor-not-working-tp856965p867572.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: index growing with updates
: Ok so I think that Solr (lucene) will only remove deleted/updated : documents from the disk after an optimize or after an 'expungeDeletes' : request. Is there a way to trigger the expunsion (new word) across the : entire index? I tried : deletes are removed when segments are merged -- an optimize merges all segments, so it forcibley removes all deleted docs, but regular merges as documents are added/updated will clean things up periodicly -- so if you have a fixed set of documents that you keep updating over and over your index size will not grow with out bounds -- it will ossilate between a min (completely optimized) and a max (lots of segments with lots of deletions just about to be merged) -Hoss
Range query on long value
Hi, I have an issue with range queries on a long value in our dataset (the dataset is fairly large, but i believe the problem still exists for smaller datasets). When i query the index with a range, as such: id:[1 TO 2000], I get values back that are well outside that range. Its as if the range query is ignoring the values and doing something like id:[* TO *]. We are running Solr 1.3. The value is set as the unique key for the index. Our schema is similar to this: field name=id type=long indexed=true stored=true required=true / field name=field_1 type=slong indexed=true stored=false required=true / field name=field_2 type=long indexed=true stored=false required=false / field name=field_3 type=long indexed=true stored=false required=false / . . . field name=field_n type=long indexed=true stored=true required=false / uniqueKeyid/uniqueKey Has anyone else had this problem? If so, how did you correct it? Thanks in advance.
Need help with document format
Hi guys, I have a list of consultants and the users (people who work for the company) are supposed to be able to search for consultants based on the time frame they worked for, for a company. For example, I should be able to search for all consultants who worked for Bear Stearns in the month of july. What is the best of accomplishing this? I was thinking of formatting the document like this company name Bear Stearns/name startDate2000-01-01/startDate endDatepresent/endDate /company company name AIG/name startDate1999-01-01/startDate endDate2000-01-01/endDate /company Is this possible? Thanks, Moazzam
Re: Need help to install Solr on JBoss
Check the wiki 1. Do I need to copy the entire example folder from my local machine to Solr home on Sun Solaris box? http://wiki.apache.org/solr/SolrJBoss 2. How can I have multiple cores on the Sun Solaris box? http://wiki.apache.org/solr/CoreAdmin Regards Juan www.linebee.com Bondiga, Murali wrote: I installed Solr on my local machine and it works fine with Jetty. I am trying to install on JBoss which is running on a Sun Solaris box and I have the following questions: 1. Do I need to copy the entire example folder from my local machine to Solr home on Sun Solaris box? 2. How can I have multiple cores on the Sun Solaris box? Any help is appreciated. Thanks, Murali
Index-time vs. search-time boosting performance
Hi, What are the performance ramifications for using a function-based boost at search time (through bf in dismax parser) versus an index-time boost? Currently I'm using boost functions on a 15GB index of ~14mm documents. Our queries generally match many thousands of documents. I'm wondering if I would see a performance improvement by switching over to index-time boosting. Thanks, Asif -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com
Re: Range query on long value
I have an issue with range queries on a long value in our dataset (the dataset is fairly large, but i believe the problem still exists for smaller datasets). When i query the index with a range, as such: id:[1 TO 2000], I get values back that are well outside that range. Its as if the range query is ignoring the values and doing something like id:[* TO *]. We are running Solr 1.3. The value is set as the unique key for the index. Our schema is similar to this: field name=id type=long indexed=true stored=true required=true / field name=field_1 type=slong indexed=true stored=false required=true / field name=field_2 type=long indexed=true stored=false required=false / field name=field_3 type=long indexed=true stored=false required=false / You need to use sortable double in solr 1.3.0 type=slong for range queries to work correctly. Default schema.xml has an explanation about sortable (sint etc) types.
Re: general debugging techniques?
: to format the data from my sources. I can read through the catalina : log, but this seems to just log requests; not much info is given about : errors or when the service hangs. Here are some examples: if you are only seeing one log line per request, then you are just looking at the request log ... there should be more logs with messages from all over the code base with various levels of severity -- and using standard java log level controls you can turn these up/down for various components. : Although I am keeping document size under 5MB, I regularly see : SEVERE: java.lang.OutOfMemoryError: Java heap space errors. How can : I find what component had this problem? that's one of java's most anoying problems -- even if you have the full stack trace of the OOM, that just tells you which code path was hte straw that broke the camels back -- it doesn't tell you where all your memory was being used. for that you really need to use a java profiler, or turn on heap dumps and use a heap dump analyzer after the OOM occurs. : After the above error, I often see this followup error on the next : document: SEVERE: org.apache.lucene.store.LockObtainFailedException: : Lock obtain timed out: NativeFSLock@/var/lib/solr/data/ : index/lucene-d6f7b3bf6fe64f362b4d45bfd4924f54-write.lock . This has : a backtrace, so I could dive directly into the code. Is this the best : way to track down the problem, or are there debugging settings that : could help show why the lock is being held elsewhere? probably not -- after an OOM, most java apps are just screwed in general after an OOM (or any other low level error). : I attempted to turn on indexing logging with the line : : infoStream file=INFOSTREAM.txttrue/infoStream : : but I can't seem to find this file in either the tomacat or the index directory. it will probably be in whatever the Current Working Directory (CWD) is -- assuming the file permissions allow writting to it. the top of the Solr admin screen tells you what the CWD is in case it's not clear from how your servlet container is run. -Hoss
RE: general debugging techniques?
: That is still really small for 5MB documents. I think the default solr : document cache is 512 items, so you would need at least 3 GB of memory : if you didn't change that and the cache filled up. that assumes that the extracted text tika extracts from each document is the same size as the original raw files *and* that he's configured that content field to be stored ... in practice if you only stored=true the summary fields (title, author, short summary, etc...) the document cache isn't going to be nearly that big (and even if you do store the entire content field, the plain text is usually *much* msaller then the binary source file) : -Xmx128M - my understanding is that this bumps heap size to 128M. FWIW: depending on how many docs you are indexing, and wether you want to support things like faceting that rely on building in memory caches to be fast, 128MB is really, really, really small for a typical Solr instance. Even on a box that is only doing indexing (no queries) i would imagine Tika likes to have a lot of ram when doing extraction (most doc types are gong to require the raw binary data is entirely in the heap, plus all hte extracted Strings, plus all of the connecting objects to build the DOM, etc And that's before you even start thinking about Solr Lucene and the index itself. -Hoss
Re: Index-time vs. search-time boosting performance
Index time boosting is different than search time boosting, so asking about performance is irrelevant. Paraphrasing Hossman from years ago on the Lucene list (from memory). ...index time boosting is a way of saying this documents' title is more important than other documents' titles. Search time boosting is a way of saying I care about documents whose titles contain this term more than other documents whose titles may match other parts of this query HTH Erick On Fri, Jun 4, 2010 at 5:10 PM, Asif Rahman a...@newscred.com wrote: Hi, What are the performance ramifications for using a function-based boost at search time (through bf in dismax parser) versus an index-time boost? Currently I'm using boost functions on a 15GB index of ~14mm documents. Our queries generally match many thousands of documents. I'm wondering if I would see a performance improvement by switching over to index-time boosting. Thanks, Asif -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com
Re: Faceted Search Slows Down as index gets larger
Yonik, Just curious why does using enum improve the facet performance. Furkan was faceting on a text field with each word being a facet value. I'd imagine that'd mean there's a large number of facet values. According to the documentation (http://wiki.apache.org/solr/SimpleFacetParameters#facet.method) facet.method=fc is faster when a field has many unique terms. So how come enum, not fc, is faster in this case? Also why use filterCache less? Thanks Andy --- On Fri, 6/4/10, Furkan Kuru furkank...@gmail.com wrote: From: Furkan Kuru furkank...@gmail.com Subject: Re: Faceted Search Slows Down as index gets larger To: solr-user@lucene.apache.org, yo...@lucidimagination.com Date: Friday, June 4, 2010, 11:25 AM I am using 1.4 version. I have tried your suggestion, it takes around 25-30 seconds now. Thank you, On Fri, Jun 4, 2010 at 5:54 PM, Yonik Seeley yo...@lucidimagination.comwrote: Faceting on a full-text field is hard. What version of Solr are you using? If it's 1.4 or later, try setting facet.method=enum And to use the filterCache less, try facet.enum.cache.minDf=100 -Yonik http://www.lucidimagination.com On Fri, Jun 4, 2010 at 10:31 AM, Furkan Kuru furkank...@gmail.com wrote: Hello, I have been dealing with real-time data. As the number of total indexed documents gets larger (now 5 M) a faceted search on a text field limited by the creation time, which we use to find the most used word in all these text fields, gets slow down. query string: created_time:[NOW-1HOUR TO NOW] facet.field=text facet.mincount=1 the document count matching the query is around 9000. It takes around 80 seconds in a decent computer with 4GB ram, quad core cpu I do not know the internal details of term indexing and their counts for faceting. Any suggestion for speeding up this query is appreciated. Thanks in advance. -- Furkan Kuru -- Furkan Kuru
Re: Index-time vs. search-time boosting performance
Perhaps I should have been more specific in my initial post. I'm doing date-based boosting on the documents in my index, so as to assign a higher score to more recent documents. Currently I'm using a boost function to achieve this. I'm wondering if there would be a performance improvement if instead of using the boost function at search time, I indexed the documents with a date-based boost. On Fri, Jun 4, 2010 at 7:30 PM, Erick Erickson erickerick...@gmail.comwrote: Index time boosting is different than search time boosting, so asking about performance is irrelevant. Paraphrasing Hossman from years ago on the Lucene list (from memory). ...index time boosting is a way of saying this documents' title is more important than other documents' titles. Search time boosting is a way of saying I care about documents whose titles contain this term more than other documents whose titles may match other parts of this query HTH Erick On Fri, Jun 4, 2010 at 5:10 PM, Asif Rahman a...@newscred.com wrote: Hi, What are the performance ramifications for using a function-based boost at search time (through bf in dismax parser) versus an index-time boost? Currently I'm using boost functions on a 15GB index of ~14mm documents. Our queries generally match many thousands of documents. I'm wondering if I would see a performance improvement by switching over to index-time boosting. Thanks, Asif -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com
Re: Help with Shingled queries
the queryparser first splits on whitespace. so each individual word of your query: short,red,evil,fox gets its own tokenstream, and therefore isn't shingled. On Fri, Jun 4, 2010 at 6:21 PM, Greg Bowyer gbow...@shopzilla.com wrote: Hi all Interesting and by the looks of things very solid project you have here with SOLR, however .. I have an index that contains a large number of phrases that I need to search for over, each of these phrases is fairly small being on average about 4 words long. The search terms that I am given to search these phrases are very long, and quite arbitrary, sometimes the search terms will be up to 25 words long. As such the performance of my index when built naively is sporadic sometimes searches are very fast on average they are somewhat slower. I have attempted to improve this situation by using shingling for the phrases and the related search queries, in my schema I have the following fieldType name=bigramed_phrase class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ShingleFilterFactory outputUnigrams=true outputUnigramIfNoNgram=true / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ShingleFilterFactory outputUnigrams=false outputUnigramIfNoNgram=true / /analyzer /fieldType In the indexes, as seen with luke I do indeed have a large range of shingled terms. When I run the analyser for either query or index terms I also see the breakdown with the shingled terms correctly displayed. However when I attempt to use this in a query I do not see the terms applied in the debug output, for example with the term short red evil fox I would expect to see the shingles 'short_red' 'red_evil' 'evil_fox' but instead I get the following debug:{ rawquerystring:short red evil fox, querystring:short red evil fox, parsedquery:+() (), parsedquery_toString:+() (), explain:{}, QParser:DisMaxQParser, altquerystring:null, boostfuncs:null, filter_queries:[atomId:(8235 10914 10911 )], parsed_filter_queries:[atomId:8235 atomId:10914 atomId:10911], timing:{ .. Does anyone know what I could be doing wrong here, is it a bug in the debug output, a stupid mistake misconception or piece of idiocy on my part or something else. Many thanks -- Greg Bowyer -- Robert Muir rcm...@gmail.com
Re: Index-time vs. search-time boosting performance
I've done a lot of recency boosting to documents, and I'm wondering why you would want to do that at index time. If you are continuously indexing new documents, what was recent when it was indexed becomes, over time less recent. Are you unsatisfied with your current performance with the boost function? Query-time recency boosting is a fairly common thing to do, and, if done correctly, shouldn't be a performance concern. -Jay http://lucidimagination.com On Fri, Jun 4, 2010 at 4:50 PM, Asif Rahman a...@newscred.com wrote: Perhaps I should have been more specific in my initial post. I'm doing date-based boosting on the documents in my index, so as to assign a higher score to more recent documents. Currently I'm using a boost function to achieve this. I'm wondering if there would be a performance improvement if instead of using the boost function at search time, I indexed the documents with a date-based boost. On Fri, Jun 4, 2010 at 7:30 PM, Erick Erickson erickerick...@gmail.com wrote: Index time boosting is different than search time boosting, so asking about performance is irrelevant. Paraphrasing Hossman from years ago on the Lucene list (from memory). ...index time boosting is a way of saying this documents' title is more important than other documents' titles. Search time boosting is a way of saying I care about documents whose titles contain this term more than other documents whose titles may match other parts of this query HTH Erick On Fri, Jun 4, 2010 at 5:10 PM, Asif Rahman a...@newscred.com wrote: Hi, What are the performance ramifications for using a function-based boost at search time (through bf in dismax parser) versus an index-time boost? Currently I'm using boost functions on a 15GB index of ~14mm documents. Our queries generally match many thousands of documents. I'm wondering if I would see a performance improvement by switching over to index-time boosting. Thanks, Asif -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com
Re: Faceted Search Slows Down as index gets larger
On Fri, Jun 4, 2010 at 7:33 PM, Andy angelf...@yahoo.com wrote: Yonik, Just curious why does using enum improve the facet performance. Furkan was faceting on a text field with each word being a facet value. I'd imagine that'd mean there's a large number of facet values. According to the documentation (http://wiki.apache.org/solr/SimpleFacetParameters#facet.method) facet.method=fc is faster when a field has many unique terms. So how come enum, not fc, is faster in this case? facet.method=fc is faster when there are many unique terms, and relatively few terms per document. A full-text field doesn't fit that bill. Also why use filterCache less? Take sup a lot of memory. -Yonik http://www.lucidimagination.com
Re: Does SolrJ support nested annotated beans?
+1 Good question, my use of Solr would benefit from nested annotated beans as well. Awaiting the reply, Thom On 2010-06-03, at 1:35 PM, Peter Hanning wrote: When modeling documents with a lot of fields (hundreds) the bean class used with SolrJ to interact with the Solr index tends to get really big and unwieldy. I was hoping that it would be possible to extract groups of properties into nested beans and move the @Field annotations along. Basically, I want to refactor something like the following: // Imports have been omitted for this example. public class TheBigOne { @Field(UniqueKey) private String uniqueKey; @Field(Name_en) private String name_en; @Field(Name_es) private String name_es; @Field(Name_fr) private String name_fr; @Field(Category) private String category; @Field(Color) private String color; // Additional properties, getters and setters have been omitted for this example. } into something like the following: // Imports have been omitted for this example. public class TheBigOne { @Field(UniqueKey) private String uniqueKey; private Names names = new Names(); private Classification classification = new Classification(); // Additional properties, getters and setters have been omitted for this example. } // Imports have been omitted for this example. public class Names { @Field(Name_en) private String name_en; @Field(Name_es) private String name_es; @Field(Name_fr) private String name_fr; // Additional properties, getters and setters have been omitted for this example. } // Imports have been omitted for this example. public class Classification { @Field(Category) private String category; @Field(Color) private String color; // Additional properties, getters and setters have been omitted for this example. } This did not work however as the DocumentObjectBinder does not seem to walk the nested object graph. Am I doing something wrong, or is this not supported? I see JIRA tickets 1129 and 1357 could alleviate this issue somewhat for the Name* fields once 1.5 comes out. Still, it would be great to be able to nest beans without using dynamic names in the field annotations like in the Classification example above. As a quick and naive test I tried to change the DocumentObjectBinder's collectInfo method to something like the following: private ListDocField collectInfo(Class clazz) { ListDocField fields = new ArrayListDocField(); Class superClazz = clazz; ArrayListAccessibleObject members = new ArrayListAccessibleObject(); while (superClazz != null superClazz != Object.class) { members.addAll(Arrays.asList(superClazz.getDeclaredFields())); members.addAll(Arrays.asList(superClazz.getDeclaredMethods())); superClazz = superClazz.getSuperclass(); } for (AccessibleObject member : members) { if (member.isAnnotationPresent(Field.class)) { member.setAccessible(true); fields.add(new DocField(member)); } // BEGIN changes else { // A quick test supporting only Field, not Method and others if (member instanceof java.lang.reflect.Field) { java.lang.reflect.Field field = (java.lang.reflect.Field) member; fields.addAll(collectInfo(field.getType())); } } // END changes } return fields; } This worked in that SolrJ started walking down into nested beans, checking for and handling @Field annotations in the nested beans. However, when trying to retrieve the values of the fields in the nested beans, SolrJ still tried to look for them in the main bean as far as I can tell. ERROR 2010-06-02 09:28:35,326 (main) () (SolrIndexer.java:335 main) - Exception encountered: java.lang.RuntimeException: Exception while getting value: private java.lang.String Names.Name_en at org.apache.solr.client.solrj.beans.DocumentObjectBinder$DocField.get(DocumentObjectBinder.java:377) at org.apache.solr.client.solrj.beans.DocumentObjectBinder.toSolrInputDocument(DocumentObjectBinder.java:71) at org.apache.solr.client.solrj.SolrServer.addBeans(SolrServer.java:56) ... Caused by: java.lang.IllegalArgumentException: Can not set java.lang.String field Names.Name_en to TheBigOne at sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:146) at sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:150) at sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:37) at sun.reflect.UnsafeObjectFieldAccessorImpl.get(UnsafeObjectFieldAccessorImpl.java:18) at java.lang.reflect.Field.get(Field.java:358) at org.apache.solr.client.solrj.beans.DocumentObjectBinder$DocField.get(DocumentObjectBinder.java:374) ... 7 more My conclusion is
Re: Index-time vs. search-time boosting performance
It seems like it would be far more efficient to calculate the boost factor once and store it rather than calculating it for each request in real-time. Some of our queries match tens of thousands if not hundreds of thousands of documents in a 15GB index. However, I'm not well-versed in lucene internals so I may be misunderstanding what is going on here. On Fri, Jun 4, 2010 at 8:31 PM, Jay Hill jayallenh...@gmail.com wrote: I've done a lot of recency boosting to documents, and I'm wondering why you would want to do that at index time. If you are continuously indexing new documents, what was recent when it was indexed becomes, over time less recent. Are you unsatisfied with your current performance with the boost function? Query-time recency boosting is a fairly common thing to do, and, if done correctly, shouldn't be a performance concern. -Jay http://lucidimagination.com On Fri, Jun 4, 2010 at 4:50 PM, Asif Rahman a...@newscred.com wrote: Perhaps I should have been more specific in my initial post. I'm doing date-based boosting on the documents in my index, so as to assign a higher score to more recent documents. Currently I'm using a boost function to achieve this. I'm wondering if there would be a performance improvement if instead of using the boost function at search time, I indexed the documents with a date-based boost. On Fri, Jun 4, 2010 at 7:30 PM, Erick Erickson erickerick...@gmail.com wrote: Index time boosting is different than search time boosting, so asking about performance is irrelevant. Paraphrasing Hossman from years ago on the Lucene list (from memory). ...index time boosting is a way of saying this documents' title is more important than other documents' titles. Search time boosting is a way of saying I care about documents whose titles contain this term more than other documents whose titles may match other parts of this query HTH Erick On Fri, Jun 4, 2010 at 5:10 PM, Asif Rahman a...@newscred.com wrote: Hi, What are the performance ramifications for using a function-based boost at search time (through bf in dismax parser) versus an index-time boost? Currently I'm using boost functions on a 15GB index of ~14mm documents. Our queries generally match many thousands of documents. I'm wondering if I would see a performance improvement by switching over to index-time boosting. Thanks, Asif -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com
RE: Index-time vs. search-time boosting performance
The SolrRelevancyFAQ does suggest that both index-time and search-time boosting can be used to boost the score of newer documents, but doesn't suggest what reasons/contexts one might choose one vs the other. It only provides an example of search-time boost though, so it doesn't answer the question of how to do an index time boost, if that was a question. http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents Sorry, this doesn't answer your question, but does contribute the fact that some author of the FAQ at some point considered index-time boost not neccesarily unreasonable. From: Asif Rahman [a...@newscred.com] Sent: Friday, June 04, 2010 11:31 PM To: solr-user@lucene.apache.org Subject: Re: Index-time vs. search-time boosting performance It seems like it would be far more efficient to calculate the boost factor once and store it rather than calculating it for each request in real-time. Some of our queries match tens of thousands if not hundreds of thousands of documents in a 15GB index. However, I'm not well-versed in lucene internals so I may be misunderstanding what is going on here. On Fri, Jun 4, 2010 at 8:31 PM, Jay Hill jayallenh...@gmail.com wrote: I've done a lot of recency boosting to documents, and I'm wondering why you would want to do that at index time. If you are continuously indexing new documents, what was recent when it was indexed becomes, over time less recent. Are you unsatisfied with your current performance with the boost function? Query-time recency boosting is a fairly common thing to do, and, if done correctly, shouldn't be a performance concern. -Jay http://lucidimagination.com On Fri, Jun 4, 2010 at 4:50 PM, Asif Rahman a...@newscred.com wrote: Perhaps I should have been more specific in my initial post. I'm doing date-based boosting on the documents in my index, so as to assign a higher score to more recent documents. Currently I'm using a boost function to achieve this. I'm wondering if there would be a performance improvement if instead of using the boost function at search time, I indexed the documents with a date-based boost. On Fri, Jun 4, 2010 at 7:30 PM, Erick Erickson erickerick...@gmail.com wrote: Index time boosting is different than search time boosting, so asking about performance is irrelevant. Paraphrasing Hossman from years ago on the Lucene list (from memory). ...index time boosting is a way of saying this documents' title is more important than other documents' titles. Search time boosting is a way of saying I care about documents whose titles contain this term more than other documents whose titles may match other parts of this query HTH Erick On Fri, Jun 4, 2010 at 5:10 PM, Asif Rahman a...@newscred.com wrote: Hi, What are the performance ramifications for using a function-based boost at search time (through bf in dismax parser) versus an index-time boost? Currently I'm using boost functions on a 15GB index of ~14mm documents. Our queries generally match many thousands of documents. I'm wondering if I would see a performance improvement by switching over to index-time boosting. Thanks, Asif -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com