Re: Paragraph identification in apache pdf box

Aravind Swarana Tue, 11 Aug 2020 01:37:09 -0700

Hi Tilman,

Thank you so much for your reply..
I have seen Icecite, it was buggy it seems. Apache pdf box algorithm is far
better. Can you know of something else ?


Thanks & regards,
Aravind Swarna


On Mon, Aug 10, 2020 at 11:59 PM Aravind Swarana <aravindswar...@gmail.com>
wrote:

> Missed Analysis Attachment
>
>
> Thanks & regards,
> Aravind Swarna
>
>
> On Mon, Aug 10, 2020 at 11:49 PM Aravind Swarana <aravindswar...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I wanted to extract text as paragraphs using Apache PDFBox. I came to
>> know from my reading that extracting text from PDF is not that simple.
>>
>> I have extracted Paragraphs from pdf using PDFBox API but they are not
>> that great.
>>
>> Meanwhile I have evaluated a Paid version of PDF Parsing called Aspose
>> PDF which is extracting paragraphs with very minimal error.
>>
>> I'm trying to implement a similar algorithm for Apache PDFBox. Can you
>> guys suggest any recent Research paper or open source library which has
>> efficient paragraph Identification algorithms. I'll need to evaluate and
>> implement them.
>>
>> So far I found :
>> https://github.com/elacin/PDFExtract (There were some errors Observed
>> while evaluating this and not as perfect as Aspose)
>>
>> https://github.com/BMKEG/lapdftext/wiki/System-Overview (Not based on
>> apache pdf box)
>>
>> I just need some suggestions whether there are any other algorithms I can
>> look at and implement them ?
>>
>>
>>
>>
>> Thanks & regards,
>> Aravind Swarna
>>
>

Re: Paragraph identification in apache pdf box

Reply via email to