Previous message >>

Now, I suspect that in future development of the OpenAI API and their 
models it will become possible to query a custom GPT version which could be 
pretrained manually by simply coversing with it about a sample text you 
provide and telling it what it transcribed wrong and how to correct. I 
already tested this in the browser interface and my findings are that if 
you present GPT-4 in the browser just 1 image which you ask to transcribe 
it does a fine job. You then proceed by poinying out it's errors and ask it 
to correct thereby helping GPT-4 to extend it's knowledge base in that 
conversation. You then feed it a 2nd text and do the same process. After a 
certain set of images it will transcribe perfectly from the first attempt, 
simply because it has acquired the skills to read the documentation based 
on your specific steering and instructions. When you would then be able to 
query this model in the API, your prompt would simply need to be an image 
and it would know what to do and result you the transcribed text back 
perfectly. For now I still need the long prompt ;).

I thought this info might be useful to you, but can imagine that you are 
well aware of this already seen as you are already in this field.

I have attached my code and the docs you need to run it. All paths in the 
code need to be changed of course to the locations where you will put the 
source files.
Also you need your own OpenAI API key which you can get here 
https://help.openai.com/en/articles/4936850-where-do-i-find-my-api-key (It 
is a paid service so you need to add min $5 on your account)

Some additional info:
- The script uses a combination of OCR, classic image recognition and GPT-4 
Vision to get all different datapoints. Where OCR or image recognition 
sufficed I applied this because a deterministic procedure seems preferably 
when sufficient.
- The special symbols in the text were found with classical image 
recognition and put in a dictionary based on location in the body text. I 
then used this dict to replace all symbol placeholders in the GPT-4 
transcription with the actual symbols from the image recognition dict. This 
is overly complex, but was the only way to get the accuracy I wanted and 
being able to only prompt vanilla GPT-4 Vision. When you will be able to 
query your own custom trained GPT-4 Vision the replacement step will not be 
necessary as with training, it can learn to recognize the symbols itself. I 
have tested this in the browser interface and this is the case. When I 
correct a symbol transcription once it remembers this for future 
transcriptions in the same chat.
- The script part for GPT-4 Vision already implements a batching method, 
but as batching is not yet allowed on the OpenAI API side, the batch size 
is set to '1'. However in the browser interface you can upload up to 10 
images in 1 prompt, so I suspect this will become available via the API 
sometime in the future. It suffices then to increase the BATCH_size on line 
489 to start using this option.

So, this was a very long explanation and I am sorry for that. I am however 
very enthousiastic about the results and you surely helped me along. Thanks 
again! Also curious to see how you would use this in your field :)

Let me know what you think! :)

Kind regards
Paulus

>> Attachments

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1b5385ca-2607-4dd3-9fad-4aa3dc6cbc79n%40googlegroups.com.

Reply via email to