Re: Paragraph words getting merged

Christian Ribeaud Mon, 31 Oct 2022 12:50:14 -0700

Hi Tim,

This is what I am actually doing: I’m parsing the XHTML and collect the pages 
by identifying the corresponding DIVs. This works really nicely. And, actually, 
all the engine is working nicely.
I’m just having this very specific and small problem I described in my original 
email, a problem I would like to better understand before delivering a suitable 
fix.

I think, we should focus on the original problem. Now, you have the context… 😉

This is what I understand and please correct me if I am wrong. In a normal text 
flow, we would expect a text section/paragraph to end with a dot or something 
similar.
In the page I posted in my original message, we do not have an ending dot:

[code]
// Extracted.html

<i>IDEC-102
</i></p>
<p>Rivaroxaban. Ci9Hi8ClN3O5S. 5-Chloro-N-({(5S)-2-oxo-3-
[/code]

And, because I am appending the text delivered by Tika, in the example above, 
IDEC-102 gets merged with Rivaroxaban. As I said, one possible way to get rid 
of the problem would be to use (in my custom content handler):

[code]
@Override
public void ignorableWhitespace(char[] ch, int start, int length) {
    if (length > 0) {
        builder.append(ch);
    }
}
[/code]

Instead of:

[code]
@Override
public void ignorableWhitespace(char[] ch, int start, int length) {
     // We ignore white spaces
}
[/code]

But this does not feel very natural to me. When switching to a new 
section/paragraph, I would expect Tika to give me a new line or a space but NOT 
as ignorable whitespace. Usually and within a given section/paragraph I get an 
ending space for each sentence, right?

Is my problem now clearer?

Thanks a lot for your time and your patience,

christian

From: Tim Allison <[email protected]>
Date: Monday, 31 October 2022 at 20:09
To: [email protected] <[email protected]>
Subject: Re: Paragraph words getting merged
We add <div class="page">.*</div> markers in our xhtml.  Would that meet your 
needs?  Parse the xhtml and send to Elastic?  Or are you looking to send data 
per page directly to Elasticsearch during the parse?

On Mon, Oct 31, 2022 at 1:43 PM Christian Ribeaud 
<[email protected]<mailto:[email protected]>> wrote:
Hi Tim,

Sorry to not be clear enough. I want to index each page as separate document in 
OpenSearch (aka ex-Elasticsearch).
The page count and language are relevant for the book metadata only (which get 
stored in a DynamoDB table).

Cheers,

christian

From: Tim Allison <[email protected]<mailto:[email protected]>>
Date: Monday, 31 October 2022 at 18:37
To: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: Re: Paragraph words getting merged
I'm sorry.  I'm missing the context.  What are you trying to accomplish?  Do 
you want to index each page as a separate document in Elasticsearch?  Or, is 
the langid + pagecount critical for your needs and somehow you need to create 
your own handler for those?

On Mon, Oct 31, 2022 at 1:16 PM Christian Ribeaud 
<[email protected]<mailto:[email protected]>> wrote:
Good evening,

Thanks for the prompt answer. AFAIR (the project is old, but the problem is 
new) I needed a mechanism to process pages in batches.

The software is handling huge books (I think, the biggest ones are around half 
GB) as Lambda in AWS.
Due to the memory and CPU limitations of AWS Lambda, this is the way we decided 
to go.

Does Tika already offer such content handler? If not, which strategy would you 
then suggest?

We have our custom TikaPageContentHandler, which is plugged as following:

[code]
public void extractTextAndUploadToElasticsearch(long maxMainMemoryBytes, 
InputStream stream, long fileSize, String fileName) throws TikaException, 
IOException, SAXException {
    String baseName = FilenameUtils.getBaseName(fileName);
    final int bulkSize = getElasticBulkSize();
    URL tikaConfigUrl = TikaLambda.class.getResource("/config/tika-config.xml");
    assert tikaConfigUrl != null : "Unspecified Tika configuration";
    TikaPageContentHandler tikaPageContentHandler = new 
TikaPageContentHandler(elasticsearchClient, baseName,
            bulkSize);
    Metadata metadata = new Metadata();
    ParseContext parseContext = new ParseContext();
    PDFParserConfig pdfParserConfig = new PDFParserConfig();
    pdfParserConfig.setMaxMainMemoryBytes(maxMainMemoryBytes);
    LogUtils.info(LOG, () -> String.format("Using following PDF parser 
configuration '%s'.",
            ToStringBuilder.reflectionToString(pdfParserConfig, 
ToStringStyle.MULTI_LINE_STYLE)));
    // Overrides the default values specified in 'tika-config.xml'
    parseContext.set(PDFParserConfig.class, pdfParserConfig);
    TikaConfig tikaConfig = new TikaConfig(tikaConfigUrl);
    // Auto-detecting parser. So, we theoretically are able to handle any 
document.
    AutoDetectParser parser = new AutoDetectParser(tikaConfig);
    parseContext.set(Parser.class, parser);
    parser.parse(stream, tikaPageContentHandler, metadata, parseContext);
    int pageCount = tikaPageContentHandler.getPageCount();
    LogUtils.info(LOG, () -> String.format("%d/%d page(s) of document 
identified by ID '%s' have been submitted.",
            tikaPageContentHandler.getSubmittedPageCount(), pageCount, 
baseName));
    LanguageResult languageResult = tikaPageContentHandler.getLanguageResult();
    String language = languageResult.isReasonablyCertain() ? 
languageResult.getLanguage() : null;
    // Put an entry into DynamoDb.
    putItem(baseName, fileSize, pageCount, language);
}
[/code]

Christian

From: Tim Allison <[email protected]<mailto:[email protected]>>
Date: Monday, 31 October 2022 at 16:22
To: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: Re: Paragraph words getting merged
Y, I agree with Nick. Tika appears to add a new line in the correct spot at 
least for IDEC-102...

On Mon, Oct 31, 2022 at 9:22 AM Nick Burch 
<[email protected]<mailto:[email protected]>> wrote:
On Sun, 30 Oct 2022, Christian Ribeaud wrote:
> I am using the default configuration. I think, we could reduce my
> problem to following code snippet:

Is there a reason that you aren't using one of the built-in Tika content
handlers? Generally they should be taking care of everything for you with
paragraphs, plain text vs html etc

Nick

Re: Paragraph words getting merged

Reply via email to