RE: Use of scanned documents for text extraction and indexing

2009-02-26 Thread Renaud Waldura
There is quite a bit of litterature available on this topic. This paper presents a summary. Nothing immediately applicable I'm afraid. Retrieving OCR Text: A survey of current approaches Steven M. Beitzel, Eric C. Jensen, David A Grossman Illinois Institute of Technology It lists a number of

RE: Performance dead-zone due to garbage collection

2009-01-28 Thread Renaud Waldura
I'm coming in late on this thread, but I want to recommend the YourKit Profiler product. It helped me track a performance problem similar to what you describe. I had been futzing with GC logging etc. for days before YourKit pinpointed the issue within minutes. http://www.yourkit.com/ (My problem

RE: Accented search

2008-03-11 Thread Renaud Waldura
Peter: Very interesting. To take care of the issue you mention, could you add multiple synonyms with progressively less accents? E.g. you'd index préférence as 4 tokens: préférence (unchanged) preférence (stripped one accent) préference (stripped the other accent) preference (stripped both

RE: Color search

2007-09-28 Thread Renaud Waldura
Here's another idea: encode color mixes as one RGB value (32 bits) and sort according to those values. To find the closest color is like finding the closest points in the color space. It would be like a distance search. 70% black #00 = 0 20% gray #f0f0f0 = #303030 10% brown #8b4513 = #0e0702

Non-HTTP Indexing

2007-09-06 Thread Renaud Waldura
Dear Solr Users: Is it possible to index documents directly without going through any XML/HTTP bridge? I have a large collection (10^7 documents, some very large) and indexing speed is a concern. Thanks! --Renaud