Re: Paragraph words getting merged

2022-10-31 Thread Christian Ribeaud
Hi Tim, This is what I am actually doing: I’m parsing the XHTML and collect the pages by identifying the corresponding DIVs. This works really nicely. And, actually, all the engine is working nicely. I’m just having this very specific and small problem I described in my original email, a

Re: Paragraph words getting merged

2022-10-31 Thread Tim Allison
We add .* markers in our xhtml. Would that meet your needs? Parse the xhtml and send to Elastic? Or are you looking to send data per page directly to Elasticsearch during the parse? On Mon, Oct 31, 2022 at 1:43 PM Christian Ribeaud < christian.ribe...@karakun.com> wrote: > Hi Tim, > > > >

Re: Paragraph words getting merged

2022-10-31 Thread Christian Ribeaud
Hi Tim, Sorry to not be clear enough. I want to index each page as separate document in OpenSearch (aka ex-Elasticsearch). The page count and language are relevant for the book metadata only (which get stored in a DynamoDB table). Cheers, christian From: Tim Allison Date: Monday, 31

Re: Paragraph words getting merged

2022-10-31 Thread Tim Allison
I'm sorry. I'm missing the context. What are you trying to accomplish? Do you want to index each page as a separate document in Elasticsearch? Or, is the langid + pagecount critical for your needs and somehow you need to create your own handler for those? On Mon, Oct 31, 2022 at 1:16 PM

Re: Paragraph words getting merged

2022-10-31 Thread Christian Ribeaud
Good evening, Thanks for the prompt answer. AFAIR (the project is old, but the problem is new) I needed a mechanism to process pages in batches. The software is handling huge books (I think, the biggest ones are around half GB) as Lambda in AWS. Due to the memory and CPU limitations of AWS

Re: Paragraph words getting merged

2022-10-31 Thread Tim Allison
Y, I agree with Nick. Tika appears to add a new line in the correct spot at least for IDEC-102... On Mon, Oct 31, 2022 at 9:22 AM Nick Burch wrote: > On Sun, 30 Oct 2022, Christian Ribeaud wrote: > > I am using the default configuration. I think, we could reduce my > > problem to following code

Re: Paragraph words getting merged

2022-10-31 Thread Nick Burch
On Sun, 30 Oct 2022, Christian Ribeaud wrote: I am using the default configuration. I think, we could reduce my problem to following code snippet: Is there a reason that you aren't using one of the built-in Tika content handlers? Generally they should be taking care of everything for you with