Re: OCR image contains cyrillic characters

2017-02-12 Thread Rick Leir
No offense taken. More on this topic ( opinion only): even the best OCR has a quality ratio, say 95% or 98% correct. And OCR is slow, maybe a minute per image. So it is best to OCR into a filesystem or DB, assess the quality, then index from the DB. Cheers -- Rick On February 12, 2017

Re: OCR image contains cyrillic characters

2017-02-12 Thread Игорь Абрашин
Actually, i dont know how to do it((( For now ive just created request handler and update chain proccessor for it with capability to detect during recognize process (LanguageDetect or somthing like that). Really appreciate for any instructions. Sorry, if i was rude, bad english skill for good

Re: OCR image contains cyrillic characters

2017-02-11 Thread Rick Leir
Yes, you are right. I was just trying to help, and did not have time to dig out the details. So the question is: how do you tell Solr to pass the language arg to Tika and Tesseract? On February 11, 2017 12:54:02 AM EST, "Игорь Абрашин" wrote: >Hi, Rick. >I didnt mean

Re: OCR image contains cyrillic characters

2017-02-10 Thread Игорь Абрашин
Hi, Rick. I didnt mean that he need to train, because tesseract works well separetly. So, tika included in solr doesnt try to use russian dict to recognize cyrillic text and result comes up utilize only eng alphabet. 10 февр. 2017 г. 15:28 пользователь "Rick Leir" написал: >

Re: OCR image contains cyrillic characters

2017-02-10 Thread Rick Leir
My guess is that you are using using Tika and Tesseract. The latter is complex, and you can start learning at https://wiki.apache.org/tika/TikaOCR <--shows you how to work with TIFF The traineddata for Cyrillic is here: https://github.com/tesseract-ocr/tesseract/wiki/Data-Files

OCR image contains cyrillic characters

2017-02-10 Thread Игорь Абрашин
Hello, community! Did you manage to recognize jpf,tiff or whatever with cyrillics text inside? Ive got only latin letter (looks like ugly translite text) in result for that moment.For image contains only lattin letters it works fine. Does anyone have any suggestion, best practice or case studies

Re: Cyrillic characters

2006-07-19 Thread Tricia Williams
. However q=h%e9llo would find matching results (the result set I'd match in Luke). So assuming that I can fix the form encoding errors so that the characters are encoded as UTF-8, I believe that I would continue to return incorrect results. Will cyrillic characters be treated any differently than

Re: Re: Cyrillic characters

2006-07-19 Thread Bertrand Delacretaz
On 7/19/06, Tricia Williams [EMAIL PROTECTED] wrote: ...What I called the _solr url encoding_ was the q= parameter translated into I'm not sure what encoding in the url... I think I've seen the same problem, haven't investigated deeper but IIUC the encoding used when posting a form is related

Cyrillic characters

2006-07-18 Thread Tricia Williams
Hi all, I'm trying to adapt our old cocoon/lucene based web search application to one that is more solrish. Our old web app was capable of searching for queries with cyrillic characters in them. I'm finding that using the packaged example admin interface entering a query with a string

Re: Cyrillic characters

2006-07-18 Thread Yonik Seeley
OK, lets split up the indexing side from the query side for a moment and assume that you are indexing correctly (setting the content-type correctly, etc). I just added a new value to the multi-valued features field to the solr.xml example document: Good unicode support: héllo (hello with an

Re: Cyrillic characters

2006-07-18 Thread Chris Hostetter
: ps. I am using mozilla firefox as my main browser which leads to the : behaviour I reported above. IE 6.0 works fine for cyrillics although : there is still a strange but different encoding (%CA%E0%ED%E0%E4%E0 for : the same query as before). The problem may not be in the Solr internals as

Re: Cyrillic characters

2006-07-18 Thread Yonik Seeley
Definitely some Firefox bugs with UTF8 at least: If I go to the admin screen, and paste in héllo into the query box, then kill Solr and run netcat to see exactly what I get, it's the following: $ nc -l -p 8983 GET /solr/select/?stylesheet=q=h%E9lloversion=2.1start=0rows=10indent=on HT TP/1.1

Re: Cyrillic characters

2006-07-18 Thread WHIRLYCOTT
I've started poking around and have fixed already one bug related to URL encoding of data. I'm going to work some more on this tonight and will hopefully have a patch for you soon. phil. On Jul 18, 2006, at 6:19 PM, Yonik Seeley wrote: On 7/18/06, WHIRLYCOTT [EMAIL PROTECTED] wrote: How

Re: Cyrillic characters

2006-07-18 Thread WHIRLYCOTT
On Jul 18, 2006, at 5:53 PM, Tricia Williams wrote: that using the packaged example admin interface entering a query with a string of cyrillic characters causes a java.lang.ArrayIndexOutOfBoundsException ... I have this much fixed as well. However, I'm still walking data through the stack