How to write a custom stemmer for Apache Solr

2010-12-24 Thread nitishgarg

 have figured out that the stemmers already built in Apache Solr are
contained in org.apache.lucene.analysis.nl.* (for Dutch) but I can't find
this package in my Lucene folder. 
Also I need to write a stemmer for marathi language, any help how should I
proceed?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-write-a-custom-stemmer-for-Apache-Solr-tp2140217p2140217.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: error in html???

2010-12-24 Thread lee carroll
Hi Satya,

This is not a solr issue. In your client which makes the json request you
need to have some error checking so you catch the error.

Occasionally people have apache set up to return a 200 ok http response with
a custom page on http errors (often for spurious security considerations)
but this breaks REST like services which SOLR implements and IMO should not
be done.

Take a look at the response coming back from solr and make sure you are
getting the correct http header response 500 etc when your queries errors.
If you are, great stuff you can then check your json invocation
documentation and catch and deal with these http errors in the client. if
your getting a 200 response check your apache config

lee c

On 24 December 2010 05:18, satya swaroop satya.yada...@gmail.com wrote:

 Hi Erick,
   Every result comes in xml format. But when you get any errors
 like http 500 or http 400 like wise we will get in html format. My query is
 cant we make that html file into json or vice versa..

 Regards,
 satya



Re: spellcheck

2010-12-24 Thread Hasnain

Hi,
   
   Im facing the same problem, did anyone find a solution?

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/spellcheck-tp506116p2140923.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Map failed at getSearcher

2010-12-24 Thread Erick Erickson
At root, it's an OOM:
Caused by: java.lang.OutOfMemoryError: Map failed at

I'm guessing that you're optimizing after the import? What are the
JVM settings you're using? The standard response is increase
the amount of memory available to the JVM, but it's expensive
to change it and only find out you're running over the limit
*after* a billion docs.

The standard advice is to allow the JVM about half the memory available
on the machine, leaving the rest for the op system to use as it sees fit,
but that's just a starting point.

Hope that helps
Erick

On Fri, Dec 24, 2010 at 1:19 AM, Rok Rejc rokrej...@gmail.com wrote:

 Hi all,

 I have created a new index (using Solr trunk version from 17th December,
 running on Windows 7  Tomcat 6, 64 bit JVM) with around 1.1 billion of
 documents (index size around 550GB, mergeFactor=20).

 After the (csv) import I have commited the data and got this error:

 HTTP Status 500 - Severe errors in solr configuration. Check your log files
 for more detailed information on what may be wrong.
 -
 java.lang.RuntimeException: java.io.IOException: Map failed at
 org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1095) at
 org.apache.solr.core.SolrCore.init(SolrCore.java:587) at
 org.apache.solr.core.CoreContainer.create(CoreContainer.java:660) at
 org.apache.solr.core.CoreContainer.load(CoreContainer.java:412) at
 org.apache.solr.core.CoreContainer.load(CoreContainer.java:294) at

 org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:243)
 at
 org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:86)
 at

 org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295)
 at

 org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422)
 at

 org.apache.catalina.core.ApplicationFilterConfig.init(ApplicationFilterConfig.java:115)
 at

 org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4001)
 at
 org.apache.catalina.core.StandardContext.start(StandardContext.java:4651)
 at

 org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791)
 at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771)
 at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:546) at

 org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:637)
 at

 org.apache.catalina.startup.HostConfig.deployDescriptors(HostConfig.java:563)
 at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:498)
 at
 org.apache.catalina.startup.HostConfig.start(HostConfig.java:1277) at
 org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:321)
 at

 org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:119)
 at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053) at
 org.apache.catalina.core.StandardHost.start(StandardHost.java:785) at
 org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045) at
 org.apache.catalina.core.StandardEngine.start(StandardEngine.java:445) at
 org.apache.catalina.core.StandardService.start(StandardService.java:519) at
 org.apache.catalina.core.StandardServer.start(StandardServer.java:710) at
 org.apache.catalina.startup.Catalina.start(Catalina.java:581) at
 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
 sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at
 java.lang.reflect.Method.invoke(Unknown Source) at
 org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289) at
 org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414) Caused by:
 java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(Unknown
 Source) at

 org.apache.lucene.store.MMapDirectory$MultiMMapIndexInput.init(MMapDirectory.java:327)
 at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:209)
 at

 org.apache.lucene.index.CompoundFileReader.init(CompoundFileReader.java:68)
 at

 org.apache.lucene.index.SegmentReader$CoreReaders.openDocStores(SegmentReader.java:208)
 at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:529) at
 org.apache.lucene.index.SegmentReader.get(SegmentReader.java:504) at
 org.apache.lucene.index.DirectoryReader.init(DirectoryReader.java:123) at
 org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:91)
 at

 org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:623)
 at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:86) at
 org.apache.lucene.index.IndexReader.open(IndexReader.java:437) at
 org.apache.lucene.index.IndexReader.open(IndexReader.java:316) at

 org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:38)
 at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1084) ... 33
 more
 Caused by: java.lang.OutOfMemoryError: Map failed at
 

Re: How to write a custom stemmer for Apache Solr

2010-12-24 Thread Erick Erickson
In trunk, it'll be somewhere like:
\modules\analysis\common\src\java\org\apache\lucene\analysis\nl

but you haven't said what version you're using. Modules is a relatively
new division of code, so it may be in contrib if you're on an earlier
version.

I have no clue about the details of what a Marathi stemmer should #do#, but
it's just another filter from the Solr perspective, so model it on
any of the filters. Subclass from TokenFilter. Probably LowerCaseFilter
is a good model. Drop the resulting jar in a place Solr can find it and you
should be good.

Best
Erick

On Fri, Dec 24, 2010 at 1:56 AM, nitishgarg nitishgarg1...@gmail.comwrote:


  have figured out that the stemmers already built in Apache Solr are
 contained in org.apache.lucene.analysis.nl.* (for Dutch) but I can't find
 this package in my Lucene folder.
 Also I need to write a stemmer for marathi language, any help how should I
 proceed?
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-write-a-custom-stemmer-for-Apache-Solr-tp2140217p2140217.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Map failed at getSearcher

2010-12-24 Thread Robert Muir
hmm, i think you are actually running out of virtual address space,
even on 64-bit!

http://msdn.microsoft.com/en-us/library/aa366778(v=VS.85).aspx#memory_limits

Apparently windows limits you to 8TB virtual address space
(ridiculous), so i think you should try one of the following:
* continue using mmap directory, but specify MMapDirectoryFactory
yourself, and specify the maxChunkSize parameter. The default
maxChunkSize is Integer.MAX_VALUE, but with a smaller one you might be
able to work around fragmentation problems.
* continue using mmap directory, but adjust index params such as merge factor.
* use SimpleFSDirectory instead: (SimpleFSDirectoryFactory). But the
big downside is that its slower and you have no i/o concurrency.

separately, it might be a good idea to consider splitting up your 1.1B
documents/550GB index across more than one machine... :)

On Fri, Dec 24, 2010 at 1:19 AM, Rok Rejc rokrej...@gmail.com wrote:
 Hi all,

 I have created a new index (using Solr trunk version from 17th December,
 running on Windows 7  Tomcat 6, 64 bit JVM) with around 1.1 billion of
 documents (index size around 550GB, mergeFactor=20).

 After the (csv) import I have commited the data and got this error:

 HTTP Status 500 - Severe errors in solr configuration. Check your log files
 for more detailed information on what may be wrong.
 -
 java.lang.RuntimeException: java.io.IOException: Map failed at
 org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1095) at
 org.apache.solr.core.SolrCore.init(SolrCore.java:587) at
 org.apache.solr.core.CoreContainer.create(CoreContainer.java:660) at
 org.apache.solr.core.CoreContainer.load(CoreContainer.java:412) at
 org.apache.solr.core.CoreContainer.load(CoreContainer.java:294) at
 org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:243)
 at
 org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:86)
 at
 org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295)
 at
 org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422)
 at
 org.apache.catalina.core.ApplicationFilterConfig.init(ApplicationFilterConfig.java:115)
 at
 org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4001)
 at org.apache.catalina.core.StandardContext.start(StandardContext.java:4651)
 at
 org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791)
 at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771)
 at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:546) at
 org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:637)
 at
 org.apache.catalina.startup.HostConfig.deployDescriptors(HostConfig.java:563)
 at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:498) at
 org.apache.catalina.startup.HostConfig.start(HostConfig.java:1277) at
 org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:321)
 at
 org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:119)
 at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053) at
 org.apache.catalina.core.StandardHost.start(StandardHost.java:785) at
 org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045) at
 org.apache.catalina.core.StandardEngine.start(StandardEngine.java:445) at
 org.apache.catalina.core.StandardService.start(StandardService.java:519) at
 org.apache.catalina.core.StandardServer.start(StandardServer.java:710) at
 org.apache.catalina.startup.Catalina.start(Catalina.java:581) at
 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
 sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at
 java.lang.reflect.Method.invoke(Unknown Source) at
 org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289) at
 org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414) Caused by:
 java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(Unknown
 Source) at
 org.apache.lucene.store.MMapDirectory$MultiMMapIndexInput.init(MMapDirectory.java:327)
 at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:209)
 at
 org.apache.lucene.index.CompoundFileReader.init(CompoundFileReader.java:68)
 at
 org.apache.lucene.index.SegmentReader$CoreReaders.openDocStores(SegmentReader.java:208)
 at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:529) at
 org.apache.lucene.index.SegmentReader.get(SegmentReader.java:504) at
 org.apache.lucene.index.DirectoryReader.init(DirectoryReader.java:123) at
 org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:91) at
 org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:623)
 at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:86) at
 org.apache.lucene.index.IndexReader.open(IndexReader.java:437) at
 

Extensibility of compressed='true'

2010-12-24 Thread Benson Margulies
I'd like to have a field that transparently uses FastInfoset to store
XML compactly.

Ideally, I could supply the XML already in FIS format to solrj, but
have application retrieve the field and get the XML 'reconstituted'.

Obviously, I'm writing code here, but what? The field would be
indexed='false', so it's not an analyzer. is there some other
pluggable component that get into the pipeline here that could look at
bytes arriving and personhandle bytes upon retrieval?


Re: Map failed at getSearcher

2010-12-24 Thread Yonik Seeley
On Fri, Dec 24, 2010 at 10:23 AM, Robert Muir rcm...@gmail.com wrote:
 hmm, i think you are actually running out of virtual address space,
 even on 64-bit!

I don't know if there are any x86 processors that allow 64 bits of
address space yet.
AFAIK, they are mostly 48 bit.

 http://msdn.microsoft.com/en-us/library/aa366778(v=VS.85).aspx#memory_limits

 Apparently windows limits you to 8TB virtual address space
 (ridiculous), so i think you should try one of the following:
 * continue using mmap directory, but specify MMapDirectoryFactory
 yourself, and specify the maxChunkSize parameter. The default
 maxChunkSize is Integer.MAX_VALUE, but with a smaller one you might be
 able to work around fragmentation problems.

Hmmm, maybe we should default to a smaller value?  Perhaps something
like 1G wouldn't impact performance, but could help avoid OOM due to
fragmentation?

-Yonik
http://www.lucidimagination.com


Re: Map failed at getSearcher

2010-12-24 Thread Robert Muir
On Fri, Dec 24, 2010 at 12:28 PM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Fri, Dec 24, 2010 at 10:23 AM, Robert Muir rcm...@gmail.com wrote:
 hmm, i think you are actually running out of virtual address space,
 even on 64-bit!

 I don't know if there are any x86 processors that allow 64 bits of
 address space yet.
 AFAIK, they are mostly 48 bit.

right but 128TB (linux/osx/solarisx86 et al) I think is a worlds of
difference from Windows' 44-bit view (8TB)


 Hmmm, maybe we should default to a smaller value?  Perhaps something
 like 1G wouldn't impact performance, but could help avoid OOM due to
 fragmentation?


We already conditionalize the default value... if it would actually
help, I think this could be a good idea, but maybe only for windows
(44bit)?


Re: Map failed at getSearcher

2010-12-24 Thread Robert Muir
ok i opened https://issues.apache.org/jira/browse/LUCENE-2832

On Fri, Dec 24, 2010 at 12:44 PM, Robert Muir rcm...@gmail.com wrote:
 On Fri, Dec 24, 2010 at 12:28 PM, Yonik Seeley
 yo...@lucidimagination.com wrote:
 On Fri, Dec 24, 2010 at 10:23 AM, Robert Muir rcm...@gmail.com wrote:
 hmm, i think you are actually running out of virtual address space,
 even on 64-bit!

 I don't know if there are any x86 processors that allow 64 bits of
 address space yet.
 AFAIK, they are mostly 48 bit.

 right but 128TB (linux/osx/solarisx86 et al) I think is a worlds of
 difference from Windows' 44-bit view (8TB)


 Hmmm, maybe we should default to a smaller value?  Perhaps something
 like 1G wouldn't impact performance, but could help avoid OOM due to
 fragmentation?


 We already conditionalize the default value... if it would actually
 help, I think this could be a good idea, but maybe only for windows
 (44bit)?



Re: [Import Timeout] using /dataimport

2010-12-24 Thread Adam Estrada
All,

That link is great but I am still getting timeout issues which causes the
entire import to fail. The feeds that are failing are like Newsweek and USA
Today which are very widely used. It's strange because sometimes they work
and sometimes they don't. I think that there are still timeout issues and
adding the params suggested in that article don't seem to fix it.

Adam

On Tue, Dec 21, 2010 at 8:04 PM, Koji Sekiguchi k...@r.email.ne.jp wrote:

 (10/12/22 9:35), Adam Estrada wrote:

 All,

 I've noticed that there are some RSS feeds that are slow to respond,
 especially during high usage times throughout the day. Is there a way to
 set
 the timeout to something really high or have it just wait until the feed
 is
 returned? The entire thing stops working when the feed doesn't respond.

 Your ideas are greatly appreciated.
 Adam

  readTimeout?

 http://wiki.apache.org/solr/DataImportHandler#Configuration_of_URLDataSource_or_HttpDataSource

 Koji
 --
 http://www.rondhuit.com/en/



Re: Solr branch_3x problems

2010-12-24 Thread Lance Norskog
More details, please. You tried all of the different GC
implementations? Is there enough memory assign to the JVM to run
comfortably but no much more? (The OS uses spare memory as disk
buffers a lot better than Java does.)

How many threads are there? Distributed search uses two searches, both
parallelized with 1 thread per shard. Perhaps they're building up?

Do a heap scan with text output every, say, 6 hours. If there is
something building up, you might spot it.

YourKit is really nice for this kind of problem.

Also RMI is very bad on GC. Are you connecting to Solr or the Tomcat with it?

Lance

On Tue, Dec 21, 2010 at 7:09 PM, Alexey Kovyrin ale...@kovyrin.net wrote:
 Hello guys,

 We at scribd.com have recently deployed our new search cluster based
 on Dec 1st, 2010 branch_3x solr code and we're very happy about the
 new features in brings.
 Though looks like we have a weird problem here: once a day our servers
 handling sharded search queries (frontend servers that receive
 requests and then fan them out to backend machines) die. Everything
 looks cool for a day, memory usage is stable, GC is doing its work as
 usual and then eventually we get a weird GC activity spike that
 kills whole VM and the only way to bring it back is to kill -9 the
 tomcat6 vm and restart it. We've tried different GC tuning options,
 tried to reduce caches to almost a zero size, still no luck.

 So I was wondering if there were any known issues with solr branch 3x
 in the last month that could have caused this kind of problems or if
 we could provide any more information that could help to track down
 the issue.

 Thanks.

 --
 Alexey Kovyrin
 http://kovyrin.net/




-- 
Lance Norskog
goks...@gmail.com


Re: Explanation of the different caches.

2010-12-24 Thread Lance Norskog
The Field Cache is down in Lucene and has no eviction policy. You
search, it loads into the Field Case, and that's it. If there's not
enough memory allocated, you get OutOfMemory. In fact there's a
separate one for each segment.

You flush the Field Cache by closing the index and re-opening it.
Under Solr you would use the multicore API. Sorry, don't know how. A
commit won't flush the field cache of existing segments. If a commit()
loads a new segment in an index update, that will have an empty cache.

For the Unix OSs there are little system tricks that make it flush the
disk buffer.

On Tue, Dec 21, 2010 at 7:19 AM, Stijn Vanhoorelbeke
stijn.vanhoorelb...@gmail.com wrote:
 I am aware of the power of the caches.
 I do not want to completely remove the caches - I want them to be small.
 - So I can launch a stress test with small amount of data.
 ( Some items may come from cache - some need to be searched up -
 right now everything comes from the cache... )

 2010/12/21 Toke Eskildsen t...@statsbiblioteket.dk:
 Stijn Vanhoorelbeke [stijn.vanhoorelb...@gmail.com] wrote:
 I want to do a quickdirt load testing - but all my results are cached.
 I commented out all the Solr caches - but still everything is cached.

 * Can the caching come from the 'Field Collapsing Cache'.
   -- although I don't see this element in my config file.
 ( As the system now jumps from 1GB to 7 GB of RAM when I do a load
 test with lots of queries ).

 If you allow the JVM to use a maximum of 7GB heap, it is not that surprising 
 that it allocates it when you hammer the searcher. Whether the heap is used 
 for caching or just filled with dead object waiting for garbage collection 
 is hard to say at this point. Try lowering the maximum heap to 1 GB and do 
 your testing again.

 Also note that Lucene/Solr performance on conventional harddisks benefits a 
 lot from disk caching: If you perform the same search more than one time, 
 the speed will increase significantly as relevant parts of the index will 
 (probably) be in RAM. Remember to flush your disk cache between tests.




-- 
Lance Norskog
goks...@gmail.com