Good evening,
Thanks for the prompt answer. AFAIR (the project is old, but the problem is
new) I needed a mechanism to process pages in batches.
The software is handling huge books (I think, the biggest ones are around half
GB) as Lambda in AWS.
Due to the memory and CPU limitations of AWS Lambda, this is the way we decided
to go.
Does Tika already offer such content handler? If not, which strategy would you
then suggest?
We have our custom TikaPageContentHandler, which is plugged as following:
[code]
public void extractTextAndUploadToElasticsearch(long maxMainMemoryBytes,
InputStream stream, long fileSize, String fileName) throws TikaException,
IOException, SAXException {
String baseName = FilenameUtils.getBaseName(fileName);
final int bulkSize = getElasticBulkSize();
URL tikaConfigUrl = TikaLambda.class.getResource("/config/tika-config.xml");
assert tikaConfigUrl != null : "Unspecified Tika configuration";
TikaPageContentHandler tikaPageContentHandler = new
TikaPageContentHandler(elasticsearchClient, baseName,
bulkSize);
Metadata metadata = new Metadata();
ParseContext parseContext = new ParseContext();
PDFParserConfig pdfParserConfig = new PDFParserConfig();
pdfParserConfig.setMaxMainMemoryBytes(maxMainMemoryBytes);
LogUtils.info(LOG, () -> String.format("Using following PDF parser
configuration '%s'.",
ToStringBuilder.reflectionToString(pdfParserConfig,
ToStringStyle.MULTI_LINE_STYLE)));
// Overrides the default values specified in 'tika-config.xml'
parseContext.set(PDFParserConfig.class, pdfParserConfig);
TikaConfig tikaConfig = new TikaConfig(tikaConfigUrl);
// Auto-detecting parser. So, we theoretically are able to handle any
document.
AutoDetectParser parser = new AutoDetectParser(tikaConfig);
parseContext.set(Parser.class, parser);
parser.parse(stream, tikaPageContentHandler, metadata, parseContext);
int pageCount = tikaPageContentHandler.getPageCount();
LogUtils.info(LOG, () -> String.format("%d/%d page(s) of document
identified by ID '%s' have been submitted.",
tikaPageContentHandler.getSubmittedPageCount(), pageCount,
baseName));
LanguageResult languageResult = tikaPageContentHandler.getLanguageResult();
String language = languageResult.isReasonablyCertain() ?
languageResult.getLanguage() : null;
// Put an entry into DynamoDb.
putItem(baseName, fileSize, pageCount, language);
}
[/code]
Christian
From: Tim Allison <[email protected]>
Date: Monday, 31 October 2022 at 16:22
To: [email protected] <[email protected]>
Subject: Re: Paragraph words getting merged
Y, I agree with Nick. Tika appears to add a new line in the correct spot at
least for IDEC-102...
On Mon, Oct 31, 2022 at 9:22 AM Nick Burch
<[email protected]<mailto:[email protected]>> wrote:
On Sun, 30 Oct 2022, Christian Ribeaud wrote:
> I am using the default configuration. I think, we could reduce my
> problem to following code snippet:
Is there a reason that you aren't using one of the built-in Tika content
handlers? Generally they should be taking care of everything for you with
paragraphs, plain text vs html etc
Nick