Oh right, for those facing a similar issue, what I did was
1. relpace the eng.traineddata file with the eng.traineddata found here
tesseract-ocr/tessdata:
Trained models with fast variant of the "best" LSTM models + legacy models
(github.com) <https://github.com/tesseract-ocr/tessdata/tree/main> I didn't
delete the original file but renamed it.
2. Test the orientation command directly with tesseract in the terminal
like so tesseract
"C:\Users\osain\OneDrive\Desktop\2000\Document_20240110_0001.jpg" stdout
--psm 0 --oem 0
If this command works in the terminal then it will work in the node wrapper
version. Here is how I called it.
tesseract.recognize(path, {
oem: 0,
psm: 0,
lang: "eng"
})
.then((data) => {
return data
})
.catch((error) => {
console.log(error.message)
})
On Friday, January 12, 2024 at 8:21:03 PM UTC-5 Oliver Saintilien wrote:
> Great it works like a charm now, thanks very much for your help.
>
> On Friday, January 12, 2024 at 10:42:05 AM UTC-5 [email protected] wrote:
>
>> On Fri, 12 Jan 2024, 14:08 Oliver Saintilien, <[email protected]>
>> wrote:
>>
>>> Something else I tried was this
>>> const tesseract = require("node-tesseract-ocr")
>>>
>> tesseract
>>> .recognize(`C:\\Users\\osain\\OneDrive\\Desktop\\1992 Spring\\
>>> Document_20240109_0014.jpg`, {
>>> lang: "eng",
>>> oem: 1,
>>> psm: 0,
>>>
>> "tessdata-dir": "C:\\Program Files\\Tesseract-OCR\\tessdata"
>>> })
>>>
>>> Thats when I get the error about the Tessdata env var. I have pasted it
>>> below:
>>>
>>> Command failed: tesseract "C:\Users\osain\OneDrive\Desktop\1992
>>> Spring\Document_20240109_0014.jpg" stdout -l eng --oem 1 --psm 3
>>> --tessdata-dir C:\Program Files\Tesseract-OCR\tessdata
>>> Error opening data file C:\Program/eng.traineddata
>>> Please make sure the TESSDATA_PREFIX environment variable is set to your
>>> "tessdata" directory.
>>>
>>
>> Adding to Zdenko's answer: what you need to do is fix / patch
>> node-tesseract-ocr (or file a bug report there and see if someone else does
>> it for you; since this is open source I suggest fork+fix+pullreq at
>> node-tesseract-ocr instead ;-) ) where it then correctly converts paths
>> with spaces as specified in js config struct to operating system dependent
>> correctly escaped commandline arguments for tesseract executable that is
>> invoked by node-tesseract-ocr.
>> Quickest fix would be to wrap the --tessdata-dir path argument in double
>> quotes, which fixes most/your path issues on mswindows (as long as the path
>> itself is not adversarial, containing dquote of it's own).
>>
>> In other words: currently node-tesseract-ocr produces this commandline,
>> as reported by you:
>>
>> tesseract "C:\Users\osain\OneDrive\Desktop\1992
>> Spring\Document_20240109_0014.jpg" stdout -l eng --oem 1 --psm 3
>> --tessdata-dir C:\Program Files\Tesseract-OCR\tessdata
>>
>> which is interpreted like this (extra newlines added to show the
>> arguments separated):
>>
>> tesseract
>> "C:\Users\osain\OneDrive\Desktop\1992 Spring\Document_20240109_0014.jpg"
>> stdout
>> -l eng
>> --oem 1
>> --psm 3
>> --tessdata-dir C:\Program
>> Files\Tesseract-OCR\tessdata
>>
>> so tesseract receives this and gets a damaged path PLUS a surplus
>> argument it apparently ignored: "Files\Tesseract-OCR\tessdata".
>>
>> Would SHOULD have been generated by node-tesseract-ocr is this (with
>> extra newlines again):
>>
>>
>> tesseract
>> "C:\Users\osain\OneDrive\Desktop\1992 Spring\Document_20240109_0014.jpg"
>> stdout
>> -l eng
>> --oem 1
>> --psm 3
>> --tessdata-dir "C:\Program Files\Tesseract-OCR\tessdata"
>>
>> as was intended in the js code.
>>
>>
>> HTH,
>>
>> Ger
>>
>>
>>>>>>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/77f1b6af-6cea-4294-b4fd-5a2ec03ded23n%40googlegroups.com.