Solr commit takes too long
Hi, We're having a problem when commiting to SOLR. Our application commits right after each update - we need the data to be available instantaneously. The index' size is about 166M, Solr has 1024M on a dual quad. The update takes a few milliseconds, but the commit takes about 1 minute. Could you please recommend what we should check for? Or perhaps some tuning parameters? Thanks, Marius
Re: caching query result
Here is the response XML faceted by multiple fields including state. response − lst name=responseHeader int name=status0/int int name=QTime1782/int − lst name=params str name=facet.limit-1/str str name=wt/ str name=rows10/str str name=start0/str str name=sortscore desc/str str name=facettrue/str str name=facet.mincount1/str − str name=fl duns_number,company_name,phys_state, phys_city, score /str str name=qphys_country:United States/str str name=qt/ str name=version2.2/str str name=explainOther/ str name=hl.fl/ − arr name=facet.field strsales_range/str strtotal_emp_range/str strcompany_type/str strphys_state/str strsic1/str /arr str name=indenton/str /lst /lst On 9/6/07, Yonik Seeley [EMAIL PROTECTED] wrote: On 9/6/07, Jae Joo [EMAIL PROTECTED] wrote: I have 13 millions and have facets by states (50). If there is a mechasim to chche, I may get faster result back. How fast are you getting results back with standard field faceting (facet.field=state)?
RE: Solr and KStem
Yes, I don't think the licensing will be a problem as KStem already includes a wrapper for Lucene. Cheers! harry -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Friday, September 07, 2007 4:40 PM To: solr-user@lucene.apache.org Subject: Re: Solr and KStem Look for KStem in Lucene JIRA. Mny years ago something KStem related was contributed, and there was a discussion about licenses then. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Walter Underwood [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Friday, September 7, 2007 4:31:25 PM Subject: Re: Solr and KStem Even if KStem isn't ASL, we could include the plug-in code with notes about how to get the stemmer. Or, the Solr plug-in could be contributed to the group that manages the KStem distribution: http://ciir.cs.umass.edu/cgi-bin/downloads/downloads.cgi wunder On 9/7/07 12:59 PM, Yonik Seeley [EMAIL PROTECTED] wrote: On 9/7/07, Wagner,Harry [EMAIL PROTECTED] wrote: I've implemented a Solr plug-in that wraps KStem for Solr use. KStem is considered to be more appropriate for library usage since it is much less aggressive than Porter (i.e., searches for organization do NOT match on organ!). If there is any interest in feeding this back into Solr I would be happy to contribute it. Absolutely. We need to make sure that the license for that k-stemmer is ASL compatible of course. -Yonik
quirks with sorting
Hi All. I'm seeing a weird problem with sorting that I can't figure out. I have a query that uses two fields -- a source column and a date column. I search on the source and I sort by the date descending. What I'm seeing is that depending on the value in the source, the date sort works in reverse. For example, the query: content_source:(mv); content_date desc returns 2007-09-10T09:25:00.000Z in its first row, which is what I expect. BUT, the query: content_source:(thomson); content_date desc returns 2008-08-17T00:00:00.000Z, which is the first date we put into SOLR. So, simply by changing the value in the field, the sort seems to beem reversed (or ignored outright). Now, before you ask, I did a sanity-check query to make sure that there is in fact data for that source from today, and there is. Can anyone help shed some light on this? TIA DW
Re: quirks with sorting
On 9/10/07, David Whalen [EMAIL PROTECTED] wrote: I'm seeing a weird problem with sorting that I can't figure out. I have a query that uses two fields -- a source column and a date column. I search on the source and I sort by the date descending. What I'm seeing is that depending on the value in the source, the date sort works in reverse. For example, the query: content_source:(mv); content_date desc returns 2007-09-10T09:25:00.000Z in its first row, which is what I expect. BUT, the query: content_source:(thomson); content_date desc returns 2008-08-17T00:00:00.000Z, which is the first date we put into SOLR. It is it the last (highest date) since it's 2008? -Yonik
RE: quirks with sorting
red-faced You know, I must have looked at that date 10 times and I never noticed the year. Sorry everyone! /red-faced -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Monday, September 10, 2007 11:23 AM To: solr-user@lucene.apache.org Subject: Re: quirks with sorting On 9/10/07, David Whalen [EMAIL PROTECTED] wrote: I'm seeing a weird problem with sorting that I can't figure out. I have a query that uses two fields -- a source column and a date column. I search on the source and I sort by the date descending. What I'm seeing is that depending on the value in the source, the date sort works in reverse. For example, the query: content_source:(mv); content_date desc returns 2007-09-10T09:25:00.000Z in its first row, which is what I expect. BUT, the query: content_source:(thomson); content_date desc returns 2008-08-17T00:00:00.000Z, which is the first date we put into SOLR. It is it the last (highest date) since it's 2008? -Yonik
Re: Distribution Information?
I guess your solr home isn't configured correctly. FYI, you can set master_status_dir to use full path name (ie /opt/solr/logs/clients in your case). Bill On 9/7/07, Matthew Runo [EMAIL PROTECTED] wrote: OK. I made the change, but it seemed not to pick up the files. When I changed distrobutiondump.jsp to say... File masterdir = new File(/opt/solr/logs/clients); it worked. Thank you for your help! ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On Sep 7, 2007, at 2:21 PM, Bill Au wrote: I just double checked distribution.jsp. The directory where it looks for status files is hard coded to logs/clients. So for now master_status_dir in your solr/conf/scripts.conf has to be set to that so the scripts will put the status files there. It looks like they are currently in you logs directory. The status files are snapshot.current.search2 and snapshot.status.search2. Bill On 9/7/07, Matthew Runo [EMAIL PROTECTED] wrote: Actually I don't have the clients directory... [EMAIL PROTECTED]: .../logs]$ pwd /opt/solr/logs [EMAIL PROTECTED]: .../logs]$ ls rsyncd-enabled rsyncd.log rsyncd.pid snapcleaner.log snapshooter.log snapshot.current.search2 snapshot.status.search2 [EMAIL PROTECTED]: .../logs]$ It does look like it could be a path issue. I wonder why, though, no clients sub directory was created. ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On Sep 7, 2007, at 7:43 AM, Bill Au wrote: I that case, definitely take a look at SOLR-333: http://issues.apache.org/jira/browse/SOLR-333 On the master there should be a logs/clients directory. Do you have any files in there? Bill On 9/6/07, Matthew Runo [EMAIL PROTECTED] wrote: Well, I do get... Distribution Info Master Server No distribution info present ... But there appears to be no information filled in. ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On Sep 6, 2007, at 6:09 AM, Bill Au wrote: That is very strange. Even if there is something wrong with the config or code, the static HTML contained in distributiondump.jsp should show up. Are you using the latest version of the JSP? There has been a recent fix: http://issues.apache.org/jira/browse/SOLR-333 Bill On 9/5/07, Matthew Runo [EMAIL PROTECTED] wrote: When I load the distrobutiondump.jsp, there is no output in my catalina.out file. ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On Sep 5, 2007, at 1:55 PM, Matthew Runo wrote: Not that I've noticed. I'll do a more careful grep soon here - I just got back from a long weekend. ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On Aug 31, 2007, at 6:12 PM, Bill Au wrote: Are there any error message in your appserver log files? Bill On 8/31/07, Matthew Runo [EMAIL PROTECTED] wrote: Hello! /solr/admin/distributiondump.jsp This server is set up as a master server, and other servers use the replication scripts to pull updates from it every few minutes. My distribution information screen is blank.. and I couldn't find any information on fixing this in the wiki. Any chance someone would be able to explain how to get this page working, or what I'm doing wrong? ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++
Re: My Solr index keeps growing
On 9/10/07, Robin Bonin [EMAIL PROTECTED] wrote: I had created a new index over the weekend, and the final size was a few hundred megs. I just checked and now the index folder is up to 1.7 Gig. Is this due to results being cached? can I set a limit to how large the index will grow? is there anything else that could be effecting this file size? index normally refers to the index files on the disk... is this what you mean? If so, it shouldn't grow unless new documents are added. -Yonik
Re: My Solr index keeps growing
Yes I am talking about the files in the solr/data/index folder. So that folder should stay the same size unless documents are added, and I guess commit and optimize are run. I'll have to watch my app and make sure it is not adding some extra stuff to the index I am not aware of. On 9/10/07, Yonik Seeley [EMAIL PROTECTED] wrote: On 9/10/07, Robin Bonin [EMAIL PROTECTED] wrote: I had created a new index over the weekend, and the final size was a few hundred megs. I just checked and now the index folder is up to 1.7 Gig. Is this due to results being cached? can I set a limit to how large the index will grow? is there anything else that could be effecting this file size? index normally refers to the index files on the disk... is this what you mean? If so, it shouldn't grow unless new documents are added. -Yonik
Re: How to patch
On 9-Sep-07, at 8:57 PM, James liu wrote: i wanna try patch: https://issues.apache.org/jira/browse/SOLR-139? page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel and i download solr1.2 release patch SOLR-269*.pach(when in '/tmp/apache-solr-1.2.0/src/test/org/apache/solr/update' ) Patches should be generally applied from the top-level solr directory with 'patch -p0' -Mike
RE: adding without overriding dups - DirectUpdateHandler2.java does not implement?
I was unclear. Our use case is that for some data sources we submit the same thing over and over. Overwriting deletes the first one and we end up with long commit times, and also we lose the earliest known date for the document. We would like to have the second update attempt dropped. So we would like to use allowDups=false overwritePending=false overwriteCommitted=false. In DUH2, this case is rejected and contains the comment: // this would need a reader to implement (to be able to check committed // before adding.) Anyway, I think we'll live with it. Thanks, Lance Norskog -Original Message- From: Mike Klaas [mailto:[EMAIL PROTECTED] Sent: Friday, September 07, 2007 2:47 PM To: solr-user@lucene.apache.org Subject: Re: adding without overriding dups - DirectUpdateHandler2.java does not implement? On 7-Sep-07, at 1:35 PM, Lance Norskog wrote: Hi- It appears that DirectUpdateHandler2.java does not actually implement the parameters that control whether to override existing documents. Should I use No? allowDups=true ovewritePending=false overwriteCommited=false should result in adding docs with no overwriting with DUH2. As yonik said, overwriting is the default behaviour. It is based on uniqueKey, which must be defined for overwriting to work. DirectUpdateHandler instead? Apparently DUH is slower than DUH2, but DUH implements these parameters. (We do so many overwrites that switching to DUH is probably a win.) DUH also does not implement many newer update features, like autoCommit. -Mike
Re: Solr and KStem
Hi Harry, Thanks for your contribution! Unfortunately, we can't include it in Solr unless the necessary legal hurdles are cleared. An issue needs to be opened on http://issues.apache.org/jira/browse/ SOLR and you have to attach the file and check the Grant License to ASF button. It is also important to verify that you have the legal right to grant the code to ASF (since it is probably your employer's intellectual property). Legal issues are a hassle, but are unavoidable, I'm afraid. Thanks again, -Mike On 10-Sep-07, at 10:22 AM, Wagner,Harry wrote: Hi Yonik, The modified KStemmer source is attached. The original KStemFilter is now wrapped (and replaced) by KStemFilterFactory. I also changed the path to avoid any naming collisions with existing Lucene code. I included the jar file also, for anyone who wants to just drop and play: - put KStem2.jar in your solr/lib directory. - change your schema to use: filter class=org.oclc.solr.analysis.KStemFilterFactory cacheSize=2/ - restart your app server I don't know if you credit contributions, but if so please include OCLC. Seems only fair since I did this on their dime :) Cheers! harry -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Friday, September 07, 2007 3:59 PM To: solr-user@lucene.apache.org Subject: Re: Solr and KStem On 9/7/07, Wagner,Harry [EMAIL PROTECTED] wrote: I've implemented a Solr plug-in that wraps KStem for Solr use. KStem is considered to be more appropriate for library usage since it is much less aggressive than Porter (i.e., searches for organization do NOT match on organ!). If there is any interest in feeding this back into Solr I would be happy to contribute it. Absolutely. We need to make sure that the license for that k-stemmer is ASL compatible of course. -Yonik kstem_solr.tar.gz
Re: New user question: How to show all stored fields in a result
Well, I figured out my problem. User error of course ;-) I was processing documents in two separate steps. The first step added the id and the doctext fields. The second step did an update to add the metadata. I didn't realize that an update command replaced the whole document rather than just the pieces you specify. I altered the process so that everything was added in one step and now things are working much better. The other change I made (which may or may not have contributed to the solution) was to remove all line breaks from the text being submitted to the doctext field. The line breaks were causing solr to interpret the text as having multiple values and forced me to put a multivalued=true attribute in the schema.xml. Removing the line breaks allowed me to remove this attribute. *Breathes giant sigh of relief* -- View this message in context: http://www.nabble.com/New-user-question%3A-How-to-show-all-stored-fields-in-a-result-tf4394773.html#a12599438 Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr and KStem
Some other notes: I just read the license... it's nice and short, and appears to be ASL compatible to me. We could either include the source in Solr and build it, or add it as a pre-compiled jar into lib. The FilterFactory should probably have it's package changed to org.apache.solr.analysis (definitely if it will be included in source form in our repository). -Yonik On 9/10/07, Mike Klaas [EMAIL PROTECTED] wrote: Hi Harry, Thanks for your contribution! Unfortunately, we can't include it in Solr unless the necessary legal hurdles are cleared. An issue needs to be opened on http://issues.apache.org/jira/browse/ SOLR and you have to attach the file and check the Grant License to ASF button. It is also important to verify that you have the legal right to grant the code to ASF (since it is probably your employer's intellectual property). Legal issues are a hassle, but are unavoidable, I'm afraid. Thanks again, -Mike On 10-Sep-07, at 10:22 AM, Wagner,Harry wrote: Hi Yonik, The modified KStemmer source is attached. The original KStemFilter is now wrapped (and replaced) by KStemFilterFactory. I also changed the path to avoid any naming collisions with existing Lucene code. I included the jar file also, for anyone who wants to just drop and play: - put KStem2.jar in your solr/lib directory. - change your schema to use: filter class=org.oclc.solr.analysis.KStemFilterFactory cacheSize=2/ - restart your app server I don't know if you credit contributions, but if so please include OCLC. Seems only fair since I did this on their dime :) Cheers! harry -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Friday, September 07, 2007 3:59 PM To: solr-user@lucene.apache.org Subject: Re: Solr and KStem On 9/7/07, Wagner,Harry [EMAIL PROTECTED] wrote: I've implemented a Solr plug-in that wraps KStem for Solr use. KStem is considered to be more appropriate for library usage since it is much less aggressive than Porter (i.e., searches for organization do NOT match on organ!). If there is any interest in feeding this back into Solr I would be happy to contribute it. Absolutely. We need to make sure that the license for that k-stemmer is ASL compatible of course. -Yonik kstem_solr.tar.gz
RE: Solr and KStem
Hi Yonik and Mike, No problem regarding my employer. I've checked and they are happy to contribute it. I'm not sure what to do about the KStem code though. It was originally written by Bob Krovetz and then modified for Lucene by Sergio Guzman-Lara (both from UMASS Amherst). I modified the Guzman version for Solr. Perhaps I should contribute only what I modified, with instructions for making it work? Let me know... harry -Original Message- From: Mike Klaas [mailto:[EMAIL PROTECTED] Sent: Monday, September 10, 2007 2:49 PM To: solr-user@lucene.apache.org Subject: Re: Solr and KStem Hi Harry, Thanks for your contribution! Unfortunately, we can't include it in Solr unless the necessary legal hurdles are cleared. An issue needs to be opened on http://issues.apache.org/jira/browse/ SOLR and you have to attach the file and check the Grant License to ASF button. It is also important to verify that you have the legal right to grant the code to ASF (since it is probably your employer's intellectual property). Legal issues are a hassle, but are unavoidable, I'm afraid. Thanks again, -Mike On 10-Sep-07, at 10:22 AM, Wagner,Harry wrote: Hi Yonik, The modified KStemmer source is attached. The original KStemFilter is now wrapped (and replaced) by KStemFilterFactory. I also changed the path to avoid any naming collisions with existing Lucene code. I included the jar file also, for anyone who wants to just drop and play: - put KStem2.jar in your solr/lib directory. - change your schema to use: filter class=org.oclc.solr.analysis.KStemFilterFactory cacheSize=2/ - restart your app server I don't know if you credit contributions, but if so please include OCLC. Seems only fair since I did this on their dime :) Cheers! harry -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Friday, September 07, 2007 3:59 PM To: solr-user@lucene.apache.org Subject: Re: Solr and KStem On 9/7/07, Wagner,Harry [EMAIL PROTECTED] wrote: I've implemented a Solr plug-in that wraps KStem for Solr use. KStem is considered to be more appropriate for library usage since it is much less aggressive than Porter (i.e., searches for organization do NOT match on organ!). If there is any interest in feeding this back into Solr I would be happy to contribute it. Absolutely. We need to make sure that the license for that k-stemmer is ASL compatible of course. -Yonik kstem_solr.tar.gz
Re: DirectSolrConnection, write.lock and Too Many Open Files
We use DirectSolrConnection via JNI in a couple of client apps that sometimes have 100s of thousands of new docs as fast as Solr will have them. It would crash relentlessly if I didn't force all calls to update or query to be on the same thread using objc's @synchronized and a message queue. I never narrowed down if this was a solr issue or a JNI one. That doesn't sound promising. I'll throw in synchronization around the update code and see what happens. That's doesn't seem good for performance though. Can Solr as a web app handle multiple updates at once or does it synchronize to avoid it? Thanks, Adrian Sutton http://www.symphonious.net
Re: DirectSolrConnection, write.lock and Too Many Open Files
On 10-Sep-07, at 1:50 PM, Adrian Sutton wrote: We use DirectSolrConnection via JNI in a couple of client apps that sometimes have 100s of thousands of new docs as fast as Solr will have them. It would crash relentlessly if I didn't force all calls to update or query to be on the same thread using objc's @synchronized and a message queue. I never narrowed down if this was a solr issue or a JNI one. That doesn't sound promising. I'll throw in synchronization around the update code and see what happens. That's doesn't seem good for performance though. Can Solr as a web app handle multiple updates at once or does it synchronize to avoid it? Solr can handle multiple simultaneous updates. The entire request processing is concurrent, as is the document analysis. Only the final write is synchronized (this includes lucene segment merging). In the future, segment merging will occur in a separate thread, further improving concurrency. -Mike
Re: DirectSolrConnection, write.lock and Too Many Open Files
On Sep 10, 2007, at 5:00 PM, Mike Klaas wrote: On 10-Sep-07, at 1:50 PM, Adrian Sutton wrote: We use DirectSolrConnection via JNI in a couple of client apps that sometimes have 100s of thousands of new docs as fast as Solr will have them. It would crash relentlessly if I didn't force all calls to update or query to be on the same thread using objc's @synchronized and a message queue. I never narrowed down if this was a solr issue or a JNI one. That doesn't sound promising. I'll throw in synchronization around the update code and see what happens. That's doesn't seem good for performance though. Can Solr as a web app handle multiple updates at once or does it synchronize to avoid it? Solr can handle multiple simultaneous updates. The entire request processing is concurrent, as is the document analysis. Only the final write is synchronized (this includes lucene segment merging). Yes, i do want to disclaim that it's very likely my thread problems are an implementation detail w/ JNI, nothing to do w/ DSC. -b
Re: DirectSolrConnection, write.lock and Too Many Open Files
The other problem is that after some time we get a Too Many Open Files error when autocommit fires. Have you checked your ulimit settings? http://wiki.apache.org/lucene-java/LuceneFAQ#head-48921635adf2c968f7936dc07d51dfb40d638b82 ulimit -n number. As mike mentioned, you may also want to use 'single' as the lockType. In solrconfig set: indexDefaults ... lockTypesingle/lockType /indexDefaults I could of course switch to using the Solr webapp since we're running in Tomcat anyway, however I really like the ability to have a single WAR file that contains everything and also not have to worry about actually making HTTP requests and the complexity that adds. This sounds like a good candidate to try solrj: http://wiki.apache.org/solr/Solrj This way you write your app independent of how you connect to solr. It also takes care of the XML parsing for you and lets you work with objects rather then strings. ryan
Removing lengthNorm from the calculation
I know I'm missing something really obvious, but I'm spinning my wheels figuring out how to eliminate lengthNorm from the calculations. The specific problem I'm trying to solve is that naive queries are resulting in crummy short records near the top of the list. The reality is that the longer records tend to be higher quality, so if anything, they need to be emphasized. However, I'm missing something simple. Any advice or a pointer to an example I could model off would be greatly appreciated. Thanks, kyle
Re: Removing lengthNorm from the calculation
If you aren't using index-time document boosting, or field boosting for that field specifically, then set omitNorms=true for that field in the schema, shut down solr, completely remove the index, and then re-index. The norms for each field consist of the index-time boost multiplied by the length normalization. -Yonik On 9/10/07, Kyle Banerjee [EMAIL PROTECTED] wrote: I know I'm missing something really obvious, but I'm spinning my wheels figuring out how to eliminate lengthNorm from the calculations. The specific problem I'm trying to solve is that naive queries are resulting in crummy short records near the top of the list. The reality is that the longer records tend to be higher quality, so if anything, they need to be emphasized. However, I'm missing something simple. Any advice or a pointer to an example I could model off would be greatly appreciated. Thanks, kyle
Re: DirectSolrConnection, write.lock and Too Many Open Files
Adrian Sutton wrote: On 11/09/2007, at 7:21 AM, Ryan McKinley wrote: The other problem is that after some time we get a Too Many Open Files error when autocommit fires. Have you checked your ulimit settings? http://wiki.apache.org/lucene-java/LuceneFAQ#head-48921635adf2c968f7936dc07d51dfb40d638b82 ulimit -n number. Yeah I'm aware of the ulimit, I'm just keen to identify what's causing it to happen before starting to increase limits. Given the write.lock errors as well I'm particularly suspicious of it. That said, most likely it happens whenever a search and a write are happening at the same time and two sets of the files get opened which is enough to kick it over the limit. The fact that it fixes itself is a good indication that it's not a file handle leak. lucene opens a lot of files. It can easily get beyond 1024. (I think the default). I'm no expert on how the file handling works, but I think more files are open if you are searching and writing at the same time. If you can't increase the limit you can try: useCompoundFiletrue/useCompoundFile It is slower, but if you are unable to change the ulimit on the deployed machines As mike mentioned, you may also want to use 'single' as the lockType. In solrconfig set: indexDefaults ... lockTypesingle/lockType /indexDefaults I'll give that a go. Looks like it didn't make it into Solr 1.2 so I'll try upgrading to the nightly build. If you need to use this in production soon, I'd suggest sticking with 1.2 for a while. There has been a LOT of action in trunk and it may be good to let it settle before upgrading a production system. You should not need to upgrade to fix the write.lock and Too Many Open Files problem. Try increasing ulimit or using a compoundfile before upgrading. Just when you think you know everything on the wiki someone finally updates it!
Re: DirectSolrConnection, write.lock and Too Many Open Files
On 11/09/2007, at 8:46 AM, Ryan McKinley wrote: lucene opens a lot of files. It can easily get beyond 1024. (I think the default). I'm no expert on how the file handling works, but I think more files are open if you are searching and writing at the same time. If you can't increase the limit you can try: useCompoundFiletrue/useCompoundFile It is slower, but if you are unable to change the ulimit on the deployed machines I've done a bit of poking on the server and ulimit doesn't seem to be the problem: e2wiki:~$ ulimit unlimited e2wiki:~$ cat /proc/sys/fs/file-max 170355 So there's either something going on behind my back (quite possible, it's a VM) or lucene is opening a really insane number of files. I did check that those values were the same for the tomcat55 user that Tomcat actually runs as. An lsof -p on the Tomcat process always shows 40 files in use, the total open files sits around 1000-1500 even when reindexing all the content. I'll watch it a bit more over time and see what happens. I notice that Confluence recommends at least 20 for the max file limit, at least before they switched to compound indexing so it's possible that the 170355 limit could be reached, but it seems unlikely with our load. If you need to use this in production soon, I'd suggest sticking with 1.2 for a while. There has been a LOT of action in trunk and it may be good to let it settle before upgrading a production system. You should not need to upgrade to fix the write.lock and Too Many Open Files problem. Try increasing ulimit or using a compoundfile before upgrading. We're quite a way off of real production, it's just internal use at the moment (on the real product server, but we're a small company so we can handle having some problems). I'll try out the current nightly build and see how it goes, as much as anything out of interest but probably won't pull new builds very often. Thanks again, Adrian Sutton http://www.symphonious.net
Re: Removing lengthNorm from the calculation
On 10-Sep-07, at 3:31 PM, Kyle Banerjee wrote: I know I'm missing something really obvious, but I'm spinning my wheels figuring out how to eliminate lengthNorm from the calculations. The specific problem I'm trying to solve is that naive queries are resulting in crummy short records near the top of the list. The reality is that the longer records tend to be higher quality, so if anything, they need to be emphasized. However, I'm missing something simple. Any advice or a pointer to an example I could model off would be greatly appreciated. Thanks, My lengthNorm() method is filled with clauses like: } else if (whatever.equals(fieldName)) { return super.lengthNorm(fieldName, / Math.max(numTokens, MIN_LENGTH)); where MIN_LENGTH can be quite long for some fields. -Mike
Re: DirectSolrConnection, write.lock and Too Many Open Files
I've done a bit of poking on the server and ulimit doesn't seem to be the problem: e2wiki:~$ ulimit unlimited e2wiki:~$ cat /proc/sys/fs/file-max 170355 try: ulimit -n ulimit on its own is something else. On my machine I get: [EMAIL PROTECTED]:~$ ulimit unlimited [EMAIL PROTECTED]:~$ cat /proc/sys/fs/file-max 364770 [EMAIL PROTECTED]:~$ ulimit -n 1024 I have to run: ulimit -n 2 to get lucene to run w/ a large index...
Re: New user question: How to show all stored fields in a result
On Sep 10, 2007, at 3:07 PM, Mike Klaas wrote: On 10-Sep-07, at 11:54 AM, melkink wrote: The other change I made (which may or may not have contributed to the solution) was to remove all line breaks from the text being submitted to the doctext field. The line breaks were causing solr to interpret the text as having multiple values and forced me to put a multivalued=true attribute in the schema.xml. Removing the line breaks allowed me to remove this attribute. Interesting--I've never seen this behaviour (I definitely store fields with linebreaks in strings). Are you sure that it isn't your own framework that is generating multiple field entries for this input case? Interestingly the solr-ruby library would create multiple field versions before I fixed the issue. A document like this: {:id = 123, :text = a newline\nin the middle} would require text to be multiValued. The reason was because the magic under the covers looks at the field value objects and iterates over them if they implement the #each method. String#each returns each _line_ - *sigh* (going away in later versions of Ruby, thank goodness). melkink - are you using solr-ruby? If so, that bug has been fixed in later versions ;) Erik
Re: DirectSolrConnection, write.lock and Too Many Open Files
On 11/09/2007, at 9:48 AM, Ryan McKinley wrote: try: ulimit -n ulimit on its own is something else. On my machine I get: [EMAIL PROTECTED]:~$ ulimit unlimited [EMAIL PROTECTED]:~$ cat /proc/sys/fs/file-max 364770 [EMAIL PROTECTED]:~$ ulimit -n 1024 I have to run: ulimit -n 2 to get lucene to run w/ a large index... Bingo, I'm an idiot - or rather, I now know *why* I'm an idiot. :) I'll give it a go. Also, this is likely to be the cause of my write.lock problems - the Too many files exception just occured and the write.lock file gets left around (should have seen that one coming too). Thanks for your help, I'm anticipating that this will solve our problems. Regards, Adrian Sutton http://www.symphonious.net
Re: Solr and KStem
Hello, I would like to test this and have a few questions (please excuse what may seem naive questions). I would like to verify that this is purely a configuration feature -- since the schema.xml defines the analysis/tokerizer chain no other changes are required. Also, the source seems to say that a lower case factory needs to be farther down the tokenizer chain. So does this mean that the KStem factory appears before the lower case filter factory in the schema.xml. Is there a recommended (required?) tokenizer factory. I am using the WhiteSpaceFactory which seems OK. Finally, I take it that I need to remove the EnglishPorterFilterFactory item in the schema.xml -- or no? Thanks, Bill On 9/10/07, Wagner,Harry [EMAIL PROTECTED] wrote: Hi Yonik, The modified KStemmer source is attached. The original KStemFilter is now wrapped (and replaced) by KStemFilterFactory. I also changed the path to avoid any naming collisions with existing Lucene code. I included the jar file also, for anyone who wants to just drop and play: - put KStem2.jar in your solr/lib directory. - change your schema to use: filter class=org.oclc.solr.analysis.KStemFilterFactory cacheSize=2/ - restart your app server I don't know if you credit contributions, but if so please include OCLC. Seems only fair since I did this on their dime :) Cheers! harry -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Friday, September 07, 2007 3:59 PM To: solr-user@lucene.apache.org Subject: Re: Solr and KStem On 9/7/07, Wagner,Harry [EMAIL PROTECTED] wrote: I've implemented a Solr plug-in that wraps KStem for Solr use. KStem is considered to be more appropriate for library usage since it is much less aggressive than Porter (i.e., searches for organization do NOT match on organ!). If there is any interest in feeding this back into Solr I would be happy to contribute it. Absolutely. We need to make sure that the license for that k-stemmer is ASL compatible of course. -Yonik