I'm sorry. I'm missing the context. What are you trying to accomplish? Do you want to index each page as a separate document in Elasticsearch? Or, is the langid + pagecount critical for your needs and somehow you need to create your own handler for those?
On Mon, Oct 31, 2022 at 1:16 PM Christian Ribeaud < [email protected]> wrote: > Good evening, > > > > Thanks for the prompt answer. AFAIR (the project is old, but the problem > is new) I needed a mechanism to process pages in batches. > > > > The software is handling huge books (I think, the biggest ones are around > half GB) as *Lambda* in *AWS*. > > Due to the memory and CPU limitations of *AWS Lambda*, this is the way we > decided to go. > > > > Does *Tika* already offer such content handler? If not, which strategy > would you then suggest? > > > > We have our custom *TikaPageContentHandler*, which is plugged as > following: > > > > [code] > > public void extractTextAndUploadToElasticsearch(long maxMainMemoryBytes, > InputStream stream, long fileSize, String fileName) throws TikaException, > IOException, SAXException { > > String baseName = FilenameUtils.getBaseName(fileName); > > final int bulkSize = getElasticBulkSize(); > > URL tikaConfigUrl = > TikaLambda.class.getResource("/config/tika-config.xml"); > > assert tikaConfigUrl != null : "Unspecified Tika configuration"; > > TikaPageContentHandler tikaPageContentHandler = new > TikaPageContentHandler(elasticsearchClient, baseName, > > bulkSize); > > Metadata metadata = new Metadata(); > > ParseContext parseContext = new ParseContext(); > > PDFParserConfig pdfParserConfig = new PDFParserConfig(); > > pdfParserConfig.setMaxMainMemoryBytes(maxMainMemoryBytes); > > LogUtils.info(LOG, () -> String.format("Using following PDF parser > configuration '%s'.", > > ToStringBuilder.reflectionToString(pdfParserConfig, > ToStringStyle.MULTI_LINE_STYLE))); > > // Overrides the default values specified in 'tika-config.xml' > > parseContext.set(PDFParserConfig.class, pdfParserConfig); > > TikaConfig tikaConfig = new TikaConfig(tikaConfigUrl); > > // Auto-detecting parser. So, we theoretically are able to handle any > document. > > AutoDetectParser parser = new AutoDetectParser(tikaConfig); > > parseContext.set(Parser.class, parser); > > parser.parse(stream, tikaPageContentHandler, metadata, parseContext); > > int pageCount = tikaPageContentHandler.getPageCount(); > > LogUtils.info(LOG, () -> String.format("%d/%d page(s) of document > identified by ID '%s' have been submitted.", > > tikaPageContentHandler.getSubmittedPageCount(), pageCount, > baseName)); > > LanguageResult languageResult = > tikaPageContentHandler.getLanguageResult(); > > String language = languageResult.isReasonablyCertain() ? > languageResult.getLanguage() : null; > > // Put an entry into DynamoDb. > > putItem(baseName, fileSize, pageCount, language); > > } > > [/code] > > > > Christian > > > > *From: *Tim Allison <[email protected]> > *Date: *Monday, 31 October 2022 at 16:22 > *To: *[email protected] <[email protected]> > *Subject: *Re: Paragraph words getting merged > > Y, I agree with Nick. Tika appears to add a new line in the correct spot > at least for IDEC-102... > > > > On Mon, Oct 31, 2022 at 9:22 AM Nick Burch <[email protected]> wrote: > > On Sun, 30 Oct 2022, Christian Ribeaud wrote: > > I am using the default configuration. I think, we could reduce my > > problem to following code snippet: > > Is there a reason that you aren't using one of the built-in Tika content > handlers? Generally they should be taking care of everything for you with > paragraphs, plain text vs html etc > > Nick > >
