I've never done any OCR stuff, so I have no idea. However, I'd just like to mention that one problem that I foresee if that you won't know which way the text is facing. For example, is it portrait or landscape? Is it rotated at 90, 180, 270 degrees? I'm not sure how to solve this. One solution would be to do the OCR 4 times (once at each rotation 0, 90, 180, 270) and just take the "best" result (which would probably mean the largest amount of text). This would use 4 times the CPU time, but I'm not sure what your requirements are. Maybe it's very important to go fast, or maybe you don't care how long it takes as long as the results are the best possible.
I've also heard that reducing the colors can help. For example, instead of having greyscale, convert it to use 1-bit pixels (either black or white). This will make sure all the edges are sharp and most OCR algorithms will work better that way. Of course, this could backfire severely if the text is a light shade of grey (as the entire image would be converted to white), if the text is in a light color (yellow, light blue, etc.), or if the background is a dark color (green text on a black background, for instance). Again here you could do analysis on the image to try to detect the right filters to run on the image (invert colors so you have dark text of a light background, color saturation, contrast, etc.) and you could run the same image through OCR with multiple different filters and take the best result. It's just a matter of how creative you want to get, how much CPU power you have to work with, how much development time you have, and how important it is that the results are as close to perfect as possible. But like I said, I've never actually done any OCR myself, so maybe the OCR libraries out there already take some/most/all of this into account. There might be someone else on this list who has experience and can provide some advice. If not, check with the developers of OCR libs; I'm sure they'll have many good suggestions :-) ---- Thanks, Adam From: Daniel Sánchez González <[email protected]> To: <[email protected]> Date: 06/23/2011 09:47 Subject: Re: Text extraction results in strange characters Thank you very much for your explanation. I'll try to convert pdf to image and then to text via OCR. Which is the most accurate way to do this? ----- Original Message ----- From: <[email protected]> To: <[email protected]> Cc: <[email protected]> Sent: Thursday, June 23, 2011 6:12 PM Subject: Re: Text extraction results in strange characters Dani, The type of font being used is probably embedded and mapped to images of the characters. This works great for viewing the document, but if you don't have characters (ASCII or Unicode), you're not going to get reasonable results when copying and pasting. If my theory is correct, you'll find that you will also be unable to copy & paste using Adobe Reader. The only way to get the text out of a file like this would be to convert it to an image, and then try to use ocr (optical character recognition) to extract the text. As you probably already know, OCR is not 100% accurate, but it'd be better than nothing. Developers, I suggest we add this to the FAQ on the website. I've seen it come up a few times, and it's a very interesting explanation. ---- Thanks, Adam From: Daniel Sánchez González <[email protected]> To: [email protected] Date: 06/23/2011 04:55 Subject: Text extraction results in strange characters When I try to convert a PDF to text the operation results in strange characters. If I copy some text from PDF file and paste it in a text editor, I've got the same result. What is wrong? Thanks in advance. Dani - FHA 203b; 203k; HECM; VA; USDA; Conventional - Warehouse Lines; FHA-Authorized Originators - Lending and Servicing in over 45 States www.swmc.com - www.simplehecmcalculator.com Visit www.swmc.com/resources for helpful links on Training, Webinars, Lender Alerts and Submitting Conditions This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884. - FHA 203b; 203k; HECM; VA; USDA; Conventional - Warehouse Lines; FHA-Authorized Originators - Lending and Servicing in over 45 States www.swmc.com - www.simplehecmcalculator.com Visit www.swmc.com/resources for helpful links on Training, Webinars, Lender Alerts and Submitting Conditions This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.

