It employs a proper Regex statement. Following is the function in Java that
it uses:
/**
* Removes line breaks.
* @param text
* @return
*/
public static String removeLineBreaks(String text) {
return text.replaceAll("(?<=\n|^)[\t ]+|[\t ]+(?=$|\n)",
"").replaceAll("(?<=.)\n(?=.)", " ");
}
On Friday, August 8, 2014 10:49:27 AM UTC-5, Bruce wrote:
>
> Aww.. I tried removing the '\n' line breaks manually, however for some
> articles, the paragraph break still consists of single '\n' line break so
> if I remove that too doing a find/replace I loses the paragraph break. How
> did VietOCR solve this issue?
>
> On Thursday, August 7, 2014 7:23:27 AM UTC+8, Quan Nguyen wrote:
>>
>> I'm afraid not. You can use any programming editor that supports Regex
>> find/replace to do it for you, or use a tool such as VietOCR
>> <http://vietocr.sf.net> to remove line breaks from the output text.
>>
>> On Wednesday, August 6, 2014 10:51:34 AM UTC-5, Bruce wrote:
>>>
>>> For example with the image attached, I get the output:
>>>
>>> - Chapter One
>>> -
>>> - A royal-red Ford F—150 Super-
>>> - Crew rolled through the streets
>>> - of Albany, Georgia. The pickup’s
>>> - driver brimmed with optimism, so
>>> - much that he couldn’t possibly
>>> - foresee the battles about to hit
>>> - his hometown.
>>> -
>>> - Life here is going to be good,
>>> - thirty—seven—year—old Nathan
>>> - Hayes told himself. After eight
>>> - years in Atlanta, Nathan had
>>> - come home to Albany, three
>>> - hours south, with his wife and
>>>
>>> Is there a way to make the output as the below, without the line breaks
>>> within a paragraph?
>>>
>>> - Chapter One
>>> -
>>> - A royal-red Ford F—150 Super-Crew rolled through the streets of
>>> Albany, Georgia. The pickup’s driver brimmed with optimism, so much that
>>> he
>>> couldn’t possibly foresee the battles about to hit his hometown.
>>> -
>>> - Life here is going to be good, thirty—seven—year—old Nathan Hayes
>>> told himself. After eight years in Atlanta, Nathan had come home to
>>> Albany,
>>> three hours south, with his wife and
>>>
>>> Thanks in advance!
>>>
>>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/55cf785f-a52b-4b3a-95f4-424c8f40247b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.