There is quite a bit of litterature available on this topic. This paper
presents a summary. Nothing immediately applicable I'm afraid.
Retrieving OCR Text: A survey of current approaches
Steven M. Beitzel, Eric C. Jensen, David A Grossman
Illinois Institute of Technology
It lists a number of
I'm coming in late on this thread, but I want to recommend the YourKit
Profiler product. It helped me track a performance problem similar to what
you describe. I had been futzing with GC logging etc. for days before
YourKit pinpointed the issue within minutes.
http://www.yourkit.com/
(My problem
Peter:
Very interesting. To take care of the issue you mention, could you add
multiple synonyms with progressively less accents?
E.g. you'd index préférence as 4 tokens:
préférence (unchanged)
preférence (stripped one accent)
préference (stripped the other accent)
preference (stripped both
Here's another idea: encode color mixes as one RGB value (32 bits) and sort
according to those values. To find the closest color is like finding the
closest points in the color space. It would be like a distance search.
70% black #00 = 0
20% gray #f0f0f0 = #303030
10% brown #8b4513 = #0e0702
Dear Solr Users:
Is it possible to index documents directly without going through any
XML/HTTP bridge?
I have a large collection (10^7 documents, some very large) and indexing
speed is a concern.
Thanks!
--Renaud