No offense taken.
More on this topic ( opinion only): even the best OCR has a quality ratio, say
95% or 98% correct. And OCR is slow, maybe a minute per image. So it is best to
OCR into a filesystem or DB, assess the quality, then index from the DB.
Cheers -- Rick
On February 12, 2017
Actually, i dont know how to do it((( For now ive just created request
handler and update chain proccessor for it with capability to detect during
recognize process (LanguageDetect or somthing like that). Really appreciate
for any instructions.
Sorry, if i was rude, bad english skill for good
Yes, you are right. I was just trying to help, and did not have time to dig out
the details. So the question is: how do you tell Solr to pass the language arg
to Tika and Tesseract?
On February 11, 2017 12:54:02 AM EST, "Игорь Абрашин"
wrote:
>Hi, Rick.
>I didnt mean
Hi, Rick.
I didnt mean that he need to train, because tesseract works well separetly.
So, tika included in solr doesnt try to use russian dict to recognize
cyrillic text and result comes up utilize only eng alphabet.
10 февр. 2017 г. 15:28 пользователь "Rick Leir"
написал:
>
My guess is that you are using using Tika and Tesseract. The latter is
complex, and you can start learning at
https://wiki.apache.org/tika/TikaOCR <--shows you how to work with TIFF
The traineddata for Cyrillic is here:
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
Hello, community!
Did you manage to recognize jpf,tiff or whatever with cyrillics text inside?
Ive got only latin letter (looks like ugly translite text) in result for
that moment.For image contains only lattin letters it works fine.
Does anyone have any suggestion, best practice or case studies
. However
q=h%e9llo would find matching results (the result set I'd match in Luke).
So assuming that I can fix the form encoding errors so that the characters
are encoded as UTF-8, I believe that I would continue to return incorrect
results. Will cyrillic characters be treated any differently than
On 7/19/06, Tricia Williams [EMAIL PROTECTED] wrote:
...What I called the _solr url encoding_ was the q= parameter
translated into I'm not sure what encoding in the url...
I think I've seen the same problem, haven't investigated deeper but
IIUC the encoding used when posting a form is related
Hi all,
I'm trying to adapt our old cocoon/lucene based web search application
to one that is more solrish. Our old web app was capable of searching for
queries with cyrillic characters in them. I'm finding that using the
packaged example admin interface entering a query with a string
OK, lets split up the indexing side from the query side for a moment
and assume that you are indexing correctly (setting the content-type
correctly, etc).
I just added a new value to the multi-valued features field to the
solr.xml example document:
Good unicode support: héllo (hello with an
: ps. I am using mozilla firefox as my main browser which leads to the
: behaviour I reported above. IE 6.0 works fine for cyrillics although
: there is still a strange but different encoding (%CA%E0%ED%E0%E4%E0 for
: the same query as before).
The problem may not be in the Solr internals as
Definitely some Firefox bugs with UTF8 at least:
If I go to the admin screen, and paste in héllo into the query box,
then kill Solr and run netcat to see exactly what I get, it's the
following:
$ nc -l -p 8983
GET /solr/select/?stylesheet=q=h%E9lloversion=2.1start=0rows=10indent=on HT
TP/1.1
I've started poking around and have fixed already one bug related to
URL encoding of data. I'm going to work some more on this tonight
and will hopefully have a patch for you soon.
phil.
On Jul 18, 2006, at 6:19 PM, Yonik Seeley wrote:
On 7/18/06, WHIRLYCOTT [EMAIL PROTECTED] wrote:
How
On Jul 18, 2006, at 5:53 PM, Tricia Williams wrote:
that using the packaged example admin interface entering a query
with a string of cyrillic characters causes a
java.lang.ArrayIndexOutOfBoundsException
... I have this much fixed as well.
However, I'm still walking data through the stack
14 matches
Mail list logo