How to write a custom stemmer for Apache Solr
have figured out that the stemmers already built in Apache Solr are contained in org.apache.lucene.analysis.nl.* (for Dutch) but I can't find this package in my Lucene folder. Also I need to write a stemmer for marathi language, any help how should I proceed? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-write-a-custom-stemmer-for-Apache-Solr-tp2140217p2140217.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: error in html???
Hi Satya, This is not a solr issue. In your client which makes the json request you need to have some error checking so you catch the error. Occasionally people have apache set up to return a 200 ok http response with a custom page on http errors (often for spurious security considerations) but this breaks REST like services which SOLR implements and IMO should not be done. Take a look at the response coming back from solr and make sure you are getting the correct http header response 500 etc when your queries errors. If you are, great stuff you can then check your json invocation documentation and catch and deal with these http errors in the client. if your getting a 200 response check your apache config lee c On 24 December 2010 05:18, satya swaroop satya.yada...@gmail.com wrote: Hi Erick, Every result comes in xml format. But when you get any errors like http 500 or http 400 like wise we will get in html format. My query is cant we make that html file into json or vice versa.. Regards, satya
Re: spellcheck
Hi, Im facing the same problem, did anyone find a solution? -- View this message in context: http://lucene.472066.n3.nabble.com/spellcheck-tp506116p2140923.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Map failed at getSearcher
At root, it's an OOM: Caused by: java.lang.OutOfMemoryError: Map failed at I'm guessing that you're optimizing after the import? What are the JVM settings you're using? The standard response is increase the amount of memory available to the JVM, but it's expensive to change it and only find out you're running over the limit *after* a billion docs. The standard advice is to allow the JVM about half the memory available on the machine, leaving the rest for the op system to use as it sees fit, but that's just a starting point. Hope that helps Erick On Fri, Dec 24, 2010 at 1:19 AM, Rok Rejc rokrej...@gmail.com wrote: Hi all, I have created a new index (using Solr trunk version from 17th December, running on Windows 7 Tomcat 6, 64 bit JVM) with around 1.1 billion of documents (index size around 550GB, mergeFactor=20). After the (csv) import I have commited the data and got this error: HTTP Status 500 - Severe errors in solr configuration. Check your log files for more detailed information on what may be wrong. - java.lang.RuntimeException: java.io.IOException: Map failed at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1095) at org.apache.solr.core.SolrCore.init(SolrCore.java:587) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:660) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:412) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:294) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:243) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:86) at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295) at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422) at org.apache.catalina.core.ApplicationFilterConfig.init(ApplicationFilterConfig.java:115) at org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4001) at org.apache.catalina.core.StandardContext.start(StandardContext.java:4651) at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791) at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771) at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:546) at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:637) at org.apache.catalina.startup.HostConfig.deployDescriptors(HostConfig.java:563) at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:498) at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1277) at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:321) at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:119) at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053) at org.apache.catalina.core.StandardHost.start(StandardHost.java:785) at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045) at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:445) at org.apache.catalina.core.StandardService.start(StandardService.java:519) at org.apache.catalina.core.StandardServer.start(StandardServer.java:710) at org.apache.catalina.startup.Catalina.start(Catalina.java:581) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289) at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414) Caused by: java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(Unknown Source) at org.apache.lucene.store.MMapDirectory$MultiMMapIndexInput.init(MMapDirectory.java:327) at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:209) at org.apache.lucene.index.CompoundFileReader.init(CompoundFileReader.java:68) at org.apache.lucene.index.SegmentReader$CoreReaders.openDocStores(SegmentReader.java:208) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:529) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:504) at org.apache.lucene.index.DirectoryReader.init(DirectoryReader.java:123) at org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:91) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:623) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:86) at org.apache.lucene.index.IndexReader.open(IndexReader.java:437) at org.apache.lucene.index.IndexReader.open(IndexReader.java:316) at org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:38) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1084) ... 33 more Caused by: java.lang.OutOfMemoryError: Map failed at
Re: How to write a custom stemmer for Apache Solr
In trunk, it'll be somewhere like: \modules\analysis\common\src\java\org\apache\lucene\analysis\nl but you haven't said what version you're using. Modules is a relatively new division of code, so it may be in contrib if you're on an earlier version. I have no clue about the details of what a Marathi stemmer should #do#, but it's just another filter from the Solr perspective, so model it on any of the filters. Subclass from TokenFilter. Probably LowerCaseFilter is a good model. Drop the resulting jar in a place Solr can find it and you should be good. Best Erick On Fri, Dec 24, 2010 at 1:56 AM, nitishgarg nitishgarg1...@gmail.comwrote: have figured out that the stemmers already built in Apache Solr are contained in org.apache.lucene.analysis.nl.* (for Dutch) but I can't find this package in my Lucene folder. Also I need to write a stemmer for marathi language, any help how should I proceed? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-write-a-custom-stemmer-for-Apache-Solr-tp2140217p2140217.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Map failed at getSearcher
hmm, i think you are actually running out of virtual address space, even on 64-bit! http://msdn.microsoft.com/en-us/library/aa366778(v=VS.85).aspx#memory_limits Apparently windows limits you to 8TB virtual address space (ridiculous), so i think you should try one of the following: * continue using mmap directory, but specify MMapDirectoryFactory yourself, and specify the maxChunkSize parameter. The default maxChunkSize is Integer.MAX_VALUE, but with a smaller one you might be able to work around fragmentation problems. * continue using mmap directory, but adjust index params such as merge factor. * use SimpleFSDirectory instead: (SimpleFSDirectoryFactory). But the big downside is that its slower and you have no i/o concurrency. separately, it might be a good idea to consider splitting up your 1.1B documents/550GB index across more than one machine... :) On Fri, Dec 24, 2010 at 1:19 AM, Rok Rejc rokrej...@gmail.com wrote: Hi all, I have created a new index (using Solr trunk version from 17th December, running on Windows 7 Tomcat 6, 64 bit JVM) with around 1.1 billion of documents (index size around 550GB, mergeFactor=20). After the (csv) import I have commited the data and got this error: HTTP Status 500 - Severe errors in solr configuration. Check your log files for more detailed information on what may be wrong. - java.lang.RuntimeException: java.io.IOException: Map failed at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1095) at org.apache.solr.core.SolrCore.init(SolrCore.java:587) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:660) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:412) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:294) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:243) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:86) at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295) at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422) at org.apache.catalina.core.ApplicationFilterConfig.init(ApplicationFilterConfig.java:115) at org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4001) at org.apache.catalina.core.StandardContext.start(StandardContext.java:4651) at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791) at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771) at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:546) at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:637) at org.apache.catalina.startup.HostConfig.deployDescriptors(HostConfig.java:563) at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:498) at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1277) at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:321) at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:119) at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053) at org.apache.catalina.core.StandardHost.start(StandardHost.java:785) at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045) at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:445) at org.apache.catalina.core.StandardService.start(StandardService.java:519) at org.apache.catalina.core.StandardServer.start(StandardServer.java:710) at org.apache.catalina.startup.Catalina.start(Catalina.java:581) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289) at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414) Caused by: java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(Unknown Source) at org.apache.lucene.store.MMapDirectory$MultiMMapIndexInput.init(MMapDirectory.java:327) at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:209) at org.apache.lucene.index.CompoundFileReader.init(CompoundFileReader.java:68) at org.apache.lucene.index.SegmentReader$CoreReaders.openDocStores(SegmentReader.java:208) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:529) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:504) at org.apache.lucene.index.DirectoryReader.init(DirectoryReader.java:123) at org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:91) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:623) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:86) at org.apache.lucene.index.IndexReader.open(IndexReader.java:437) at
Extensibility of compressed='true'
I'd like to have a field that transparently uses FastInfoset to store XML compactly. Ideally, I could supply the XML already in FIS format to solrj, but have application retrieve the field and get the XML 'reconstituted'. Obviously, I'm writing code here, but what? The field would be indexed='false', so it's not an analyzer. is there some other pluggable component that get into the pipeline here that could look at bytes arriving and personhandle bytes upon retrieval?
Re: Map failed at getSearcher
On Fri, Dec 24, 2010 at 10:23 AM, Robert Muir rcm...@gmail.com wrote: hmm, i think you are actually running out of virtual address space, even on 64-bit! I don't know if there are any x86 processors that allow 64 bits of address space yet. AFAIK, they are mostly 48 bit. http://msdn.microsoft.com/en-us/library/aa366778(v=VS.85).aspx#memory_limits Apparently windows limits you to 8TB virtual address space (ridiculous), so i think you should try one of the following: * continue using mmap directory, but specify MMapDirectoryFactory yourself, and specify the maxChunkSize parameter. The default maxChunkSize is Integer.MAX_VALUE, but with a smaller one you might be able to work around fragmentation problems. Hmmm, maybe we should default to a smaller value? Perhaps something like 1G wouldn't impact performance, but could help avoid OOM due to fragmentation? -Yonik http://www.lucidimagination.com
Re: Map failed at getSearcher
On Fri, Dec 24, 2010 at 12:28 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Fri, Dec 24, 2010 at 10:23 AM, Robert Muir rcm...@gmail.com wrote: hmm, i think you are actually running out of virtual address space, even on 64-bit! I don't know if there are any x86 processors that allow 64 bits of address space yet. AFAIK, they are mostly 48 bit. right but 128TB (linux/osx/solarisx86 et al) I think is a worlds of difference from Windows' 44-bit view (8TB) Hmmm, maybe we should default to a smaller value? Perhaps something like 1G wouldn't impact performance, but could help avoid OOM due to fragmentation? We already conditionalize the default value... if it would actually help, I think this could be a good idea, but maybe only for windows (44bit)?
Re: Map failed at getSearcher
ok i opened https://issues.apache.org/jira/browse/LUCENE-2832 On Fri, Dec 24, 2010 at 12:44 PM, Robert Muir rcm...@gmail.com wrote: On Fri, Dec 24, 2010 at 12:28 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Fri, Dec 24, 2010 at 10:23 AM, Robert Muir rcm...@gmail.com wrote: hmm, i think you are actually running out of virtual address space, even on 64-bit! I don't know if there are any x86 processors that allow 64 bits of address space yet. AFAIK, they are mostly 48 bit. right but 128TB (linux/osx/solarisx86 et al) I think is a worlds of difference from Windows' 44-bit view (8TB) Hmmm, maybe we should default to a smaller value? Perhaps something like 1G wouldn't impact performance, but could help avoid OOM due to fragmentation? We already conditionalize the default value... if it would actually help, I think this could be a good idea, but maybe only for windows (44bit)?
Re: [Import Timeout] using /dataimport
All, That link is great but I am still getting timeout issues which causes the entire import to fail. The feeds that are failing are like Newsweek and USA Today which are very widely used. It's strange because sometimes they work and sometimes they don't. I think that there are still timeout issues and adding the params suggested in that article don't seem to fix it. Adam On Tue, Dec 21, 2010 at 8:04 PM, Koji Sekiguchi k...@r.email.ne.jp wrote: (10/12/22 9:35), Adam Estrada wrote: All, I've noticed that there are some RSS feeds that are slow to respond, especially during high usage times throughout the day. Is there a way to set the timeout to something really high or have it just wait until the feed is returned? The entire thing stops working when the feed doesn't respond. Your ideas are greatly appreciated. Adam readTimeout? http://wiki.apache.org/solr/DataImportHandler#Configuration_of_URLDataSource_or_HttpDataSource Koji -- http://www.rondhuit.com/en/
Re: Solr branch_3x problems
More details, please. You tried all of the different GC implementations? Is there enough memory assign to the JVM to run comfortably but no much more? (The OS uses spare memory as disk buffers a lot better than Java does.) How many threads are there? Distributed search uses two searches, both parallelized with 1 thread per shard. Perhaps they're building up? Do a heap scan with text output every, say, 6 hours. If there is something building up, you might spot it. YourKit is really nice for this kind of problem. Also RMI is very bad on GC. Are you connecting to Solr or the Tomcat with it? Lance On Tue, Dec 21, 2010 at 7:09 PM, Alexey Kovyrin ale...@kovyrin.net wrote: Hello guys, We at scribd.com have recently deployed our new search cluster based on Dec 1st, 2010 branch_3x solr code and we're very happy about the new features in brings. Though looks like we have a weird problem here: once a day our servers handling sharded search queries (frontend servers that receive requests and then fan them out to backend machines) die. Everything looks cool for a day, memory usage is stable, GC is doing its work as usual and then eventually we get a weird GC activity spike that kills whole VM and the only way to bring it back is to kill -9 the tomcat6 vm and restart it. We've tried different GC tuning options, tried to reduce caches to almost a zero size, still no luck. So I was wondering if there were any known issues with solr branch 3x in the last month that could have caused this kind of problems or if we could provide any more information that could help to track down the issue. Thanks. -- Alexey Kovyrin http://kovyrin.net/ -- Lance Norskog goks...@gmail.com
Re: Explanation of the different caches.
The Field Cache is down in Lucene and has no eviction policy. You search, it loads into the Field Case, and that's it. If there's not enough memory allocated, you get OutOfMemory. In fact there's a separate one for each segment. You flush the Field Cache by closing the index and re-opening it. Under Solr you would use the multicore API. Sorry, don't know how. A commit won't flush the field cache of existing segments. If a commit() loads a new segment in an index update, that will have an empty cache. For the Unix OSs there are little system tricks that make it flush the disk buffer. On Tue, Dec 21, 2010 at 7:19 AM, Stijn Vanhoorelbeke stijn.vanhoorelb...@gmail.com wrote: I am aware of the power of the caches. I do not want to completely remove the caches - I want them to be small. - So I can launch a stress test with small amount of data. ( Some items may come from cache - some need to be searched up - right now everything comes from the cache... ) 2010/12/21 Toke Eskildsen t...@statsbiblioteket.dk: Stijn Vanhoorelbeke [stijn.vanhoorelb...@gmail.com] wrote: I want to do a quickdirt load testing - but all my results are cached. I commented out all the Solr caches - but still everything is cached. * Can the caching come from the 'Field Collapsing Cache'. -- although I don't see this element in my config file. ( As the system now jumps from 1GB to 7 GB of RAM when I do a load test with lots of queries ). If you allow the JVM to use a maximum of 7GB heap, it is not that surprising that it allocates it when you hammer the searcher. Whether the heap is used for caching or just filled with dead object waiting for garbage collection is hard to say at this point. Try lowering the maximum heap to 1 GB and do your testing again. Also note that Lucene/Solr performance on conventional harddisks benefits a lot from disk caching: If you perform the same search more than one time, the speed will increase significantly as relevant parts of the index will (probably) be in RAM. Remember to flush your disk cache between tests. -- Lance Norskog goks...@gmail.com