Re: Paragraph words getting merged

2022-11-01 Thread Christian Ribeaud
Hi Tim, Thank you so much to enlighten that part to me. THAT is really useful. Kindest regards, christian From: Tim Allison Date: Tuesday, 1 November 2022 at 17:09 To: user@tika.apache.org Subject: Re: Paragraph words getting merged Sorry. Took a while to make time to look in detail. Yes

Re: Paragraph words getting merged

2022-11-01 Thread Tim Allison
; Best, > > > > christian > > > > *From: *Tim Allison > *Date: *Monday, 31 October 2022 at 16:22 > *To: *user@tika.apache.org > *Subject: *Re: Paragraph words getting merged > > Y, I agree with Nick. Tika appears to add a new line in the correct spot > at

Re: Paragraph words getting merged

2022-11-01 Thread Christian Ribeaud
words getting merged Y, I agree with Nick. Tika appears to add a new line in the correct spot at least for IDEC-102... On Mon, Oct 31, 2022 at 9:22 AM Nick Burch mailto:n...@apache.org>> wrote: On Sun, 30 Oct 2022, Christian Ribeaud wrote: > I am using the default configuration. I think,

Re: Paragraph words getting merged

2022-10-31 Thread Christian Ribeaud
lly and within a given section/paragraph I get an ending space for each sentence, right? Is my problem now clearer? Thanks a lot for your time and your patience, christian From: Tim Allison Date: Monday, 31 October 2022 at 20:09 To: user@tika.apache.org Subject: Re: Paragraph words getting merged

Re: Paragraph words getting merged

2022-10-31 Thread Tim Allison
> > > christian > > > > > > *From: *Tim Allison > *Date: *Monday, 31 October 2022 at 18:37 > *To: *user@tika.apache.org > *Subject: *Re: Paragraph words getting merged > > I'm sorry. I'm missing the context. What are you trying to accomplish? >

Re: Paragraph words getting merged

2022-10-31 Thread Christian Ribeaud
October 2022 at 18:37 To: user@tika.apache.org Subject: Re: Paragraph words getting merged I'm sorry. I'm missing the context. What are you trying to accomplish? Do you want to index each page as a separate document in Elasticsearch? Or, is the langid + pagecount critical for your needs and somehow

Re: Paragraph words getting merged

2022-10-31 Thread Tim Allison
ageCount, > baseName)); > > LanguageResult languageResult = > tikaPageContentHandler.getLanguageResult(); > > String language = languageResult.isReasonablyCertain() ? > languageResult.getLanguage() : null; > > // Put an entry into DynamoDb. > > putIt

Re: Paragraph words getting merged

2022-10-31 Thread Christian Ribeaud
anguageResult.getLanguage() : null; // Put an entry into DynamoDb. putItem(baseName, fileSize, pageCount, language); } [/code] Christian From: Tim Allison Date: Monday, 31 October 2022 at 16:22 To: user@tika.apache.org Subject: Re: Paragraph words getting merged Y, I agree with Nick. Tika

Re: Paragraph words getting merged

2022-10-31 Thread Tim Allison
Y, I agree with Nick. Tika appears to add a new line in the correct spot at least for IDEC-102... On Mon, Oct 31, 2022 at 9:22 AM Nick Burch wrote: > On Sun, 30 Oct 2022, Christian Ribeaud wrote: > > I am using the default configuration. I think, we could reduce my > > problem to following code

Re: Paragraph words getting merged

2022-10-31 Thread Nick Burch
On Sun, 30 Oct 2022, Christian Ribeaud wrote: I am using the default configuration. I think, we could reduce my problem to following code snippet: Is there a reason that you aren't using one of the built-in Tika content handlers? Generally they should be taking care of everything for you with