Re: Paragraph words getting merged

Tim Allison Mon, 31 Oct 2022 10:37:17 -0700

I'm sorry.  I'm missing the context.  What are you trying to accomplish?
Do you want to index each page as a separate document in Elasticsearch?
Or, is the langid + pagecount critical for your needs and somehow you need
to create your own handler for those?


On Mon, Oct 31, 2022 at 1:16 PM Christian Ribeaud <
[email protected]> wrote:

> Good evening,
>
>
>
> Thanks for the prompt answer. AFAIR (the project is old, but the problem
> is new) I needed a mechanism to process pages in batches.
>
>
>
> The software is handling huge books (I think, the biggest ones are around
> half GB) as *Lambda* in *AWS*.
>
> Due to the memory and CPU limitations of *AWS Lambda*, this is the way we
> decided to go.
>
>
>
> Does *Tika* already offer such content handler? If not, which strategy
> would you then suggest?
>
>
>
> We have our custom *TikaPageContentHandler*, which is plugged as
> following:
>
>
>
> [code]
>
> public void extractTextAndUploadToElasticsearch(long maxMainMemoryBytes,
> InputStream stream, long fileSize, String fileName) throws TikaException,
> IOException, SAXException {
>
>     String baseName = FilenameUtils.getBaseName(fileName);
>
>     final int bulkSize = getElasticBulkSize();
>
>     URL tikaConfigUrl =
> TikaLambda.class.getResource("/config/tika-config.xml");
>
>     assert tikaConfigUrl != null : "Unspecified Tika configuration";
>
>     TikaPageContentHandler tikaPageContentHandler = new
> TikaPageContentHandler(elasticsearchClient, baseName,
>
>             bulkSize);
>
>     Metadata metadata = new Metadata();
>
>     ParseContext parseContext = new ParseContext();
>
>     PDFParserConfig pdfParserConfig = new PDFParserConfig();
>
>     pdfParserConfig.setMaxMainMemoryBytes(maxMainMemoryBytes);
>
>     LogUtils.info(LOG, () -> String.format("Using following PDF parser
> configuration '%s'.",
>
>             ToStringBuilder.reflectionToString(pdfParserConfig,
> ToStringStyle.MULTI_LINE_STYLE)));
>
>     // Overrides the default values specified in 'tika-config.xml'
>
>     parseContext.set(PDFParserConfig.class, pdfParserConfig);
>
>     TikaConfig tikaConfig = new TikaConfig(tikaConfigUrl);
>
>     // Auto-detecting parser. So, we theoretically are able to handle any
> document.
>
>     AutoDetectParser parser = new AutoDetectParser(tikaConfig);
>
>     parseContext.set(Parser.class, parser);
>
>     parser.parse(stream, tikaPageContentHandler, metadata, parseContext);
>
>     int pageCount = tikaPageContentHandler.getPageCount();
>
>     LogUtils.info(LOG, () -> String.format("%d/%d page(s) of document
> identified by ID '%s' have been submitted.",
>
>             tikaPageContentHandler.getSubmittedPageCount(), pageCount,
> baseName));
>
>     LanguageResult languageResult =
> tikaPageContentHandler.getLanguageResult();
>
>     String language = languageResult.isReasonablyCertain() ?
> languageResult.getLanguage() : null;
>
>     // Put an entry into DynamoDb.
>
>     putItem(baseName, fileSize, pageCount, language);
>
> }
>
> [/code]
>
>
>
> Christian
>
>
>
> *From: *Tim Allison <[email protected]>
> *Date: *Monday, 31 October 2022 at 16:22
> *To: *[email protected] <[email protected]>
> *Subject: *Re: Paragraph words getting merged
>
> Y, I agree with Nick. Tika appears to add a new line in the correct spot
> at least for IDEC-102...
>
>
>
> On Mon, Oct 31, 2022 at 9:22 AM Nick Burch <[email protected]> wrote:
>
> On Sun, 30 Oct 2022, Christian Ribeaud wrote:
> > I am using the default configuration. I think, we could reduce my
> > problem to following code snippet:
>
> Is there a reason that you aren't using one of the built-in Tika content
> handlers? Generally they should be taking care of everything for you with
> paragraphs, plain text vs html etc
>
> Nick
>
>

Re: Paragraph words getting merged

Reply via email to