Re: Paragraph words getting merged

Tim Allison Mon, 31 Oct 2022 12:09:05 -0700

We add <div class="page">.*</div> markers in our xhtml.  Would that meet
your needs?  Parse the xhtml and send to Elastic?  Or are you looking to
send data per page directly to Elasticsearch during the parse?


On Mon, Oct 31, 2022 at 1:43 PM Christian Ribeaud <
[email protected]> wrote:

> Hi Tim,
>
>
>
> Sorry to not be clear enough. I want to index each page as separate
> document in *OpenSearch* (aka ex-*Elasticsearch*).
>
> The page count and language are relevant for the book metadata only (which
> get stored in a *DynamoDB* table).
>
>
>
> Cheers,
>
>
>
> christian
>
>
>
>
>
> *From: *Tim Allison <[email protected]>
> *Date: *Monday, 31 October 2022 at 18:37
> *To: *[email protected] <[email protected]>
> *Subject: *Re: Paragraph words getting merged
>
> I'm sorry.  I'm missing the context.  What are you trying to accomplish?
> Do you want to index each page as a separate document in Elasticsearch?
> Or, is the langid + pagecount critical for your needs and somehow you need
> to create your own handler for those?
>
>
>
> On Mon, Oct 31, 2022 at 1:16 PM Christian Ribeaud <
> [email protected]> wrote:
>
> Good evening,
>
>
>
> Thanks for the prompt answer. AFAIR (the project is old, but the problem
> is new) I needed a mechanism to process pages in batches.
>
>
>
> The software is handling huge books (I think, the biggest ones are around
> half GB) as *Lambda* in *AWS*.
>
> Due to the memory and CPU limitations of *AWS Lambda*, this is the way we
> decided to go.
>
>
>
> Does *Tika* already offer such content handler? If not, which strategy
> would you then suggest?
>
>
>
> We have our custom *TikaPageContentHandler*, which is plugged as
> following:
>
>
>
> [code]
>
> public void extractTextAndUploadToElasticsearch(long maxMainMemoryBytes,
> InputStream stream, long fileSize, String fileName) throws TikaException,
> IOException, SAXException {
>
>     String baseName = FilenameUtils.getBaseName(fileName);
>
>     final int bulkSize = getElasticBulkSize();
>
>     URL tikaConfigUrl =
> TikaLambda.class.getResource("/config/tika-config.xml");
>
>     assert tikaConfigUrl != null : "Unspecified Tika configuration";
>
>     TikaPageContentHandler tikaPageContentHandler = new
> TikaPageContentHandler(elasticsearchClient, baseName,
>
>             bulkSize);
>
>     Metadata metadata = new Metadata();
>
>     ParseContext parseContext = new ParseContext();
>
>     PDFParserConfig pdfParserConfig = new PDFParserConfig();
>
>     pdfParserConfig.setMaxMainMemoryBytes(maxMainMemoryBytes);
>
>     LogUtils.info(LOG, () -> String.format("Using following PDF parser
> configuration '%s'.",
>
>             ToStringBuilder.reflectionToString(pdfParserConfig,
> ToStringStyle.MULTI_LINE_STYLE)));
>
>     // Overrides the default values specified in 'tika-config.xml'
>
>     parseContext.set(PDFParserConfig.class, pdfParserConfig);
>
>     TikaConfig tikaConfig = new TikaConfig(tikaConfigUrl);
>
>     // Auto-detecting parser. So, we theoretically are able to handle any
> document.
>
>     AutoDetectParser parser = new AutoDetectParser(tikaConfig);
>
>     parseContext.set(Parser.class, parser);
>
>     parser.parse(stream, tikaPageContentHandler, metadata, parseContext);
>
>     int pageCount = tikaPageContentHandler.getPageCount();
>
>     LogUtils.info(LOG, () -> String.format("%d/%d page(s) of document
> identified by ID '%s' have been submitted.",
>
>             tikaPageContentHandler.getSubmittedPageCount(), pageCount,
> baseName));
>
>     LanguageResult languageResult =
> tikaPageContentHandler.getLanguageResult();
>
>     String language = languageResult.isReasonablyCertain() ?
> languageResult.getLanguage() : null;
>
>     // Put an entry into DynamoDb.
>
>     putItem(baseName, fileSize, pageCount, language);
>
> }
>
> [/code]
>
>
>
> Christian
>
>
>
> *From: *Tim Allison <[email protected]>
> *Date: *Monday, 31 October 2022 at 16:22
> *To: *[email protected] <[email protected]>
> *Subject: *Re: Paragraph words getting merged
>
> Y, I agree with Nick. Tika appears to add a new line in the correct spot
> at least for IDEC-102...
>
>
>
> On Mon, Oct 31, 2022 at 9:22 AM Nick Burch <[email protected]> wrote:
>
> On Sun, 30 Oct 2022, Christian Ribeaud wrote:
> > I am using the default configuration. I think, we could reduce my
> > problem to following code snippet:
>
> Is there a reason that you aren't using one of the built-in Tika content
> handlers? Generally they should be taking care of everything for you with
> paragraphs, plain text vs html etc
>
> Nick
>
>

Re: Paragraph words getting merged

Reply via email to