Hi Tim,
Thank you so much to enlighten that part to me. THAT is really useful.
Kindest regards,
christian
From: Tim Allison
Date: Tuesday, 1 November 2022 at 17:09
To: user@tika.apache.org
Subject: Re: Paragraph words getting merged
Sorry. Took a while to make time to look in detail. Yes
; Best,
>
>
>
> christian
>
>
>
> *From: *Tim Allison
> *Date: *Monday, 31 October 2022 at 16:22
> *To: *user@tika.apache.org
> *Subject: *Re: Paragraph words getting merged
>
> Y, I agree with Nick. Tika appears to add a new line in the correct spot
> at
words getting merged
Y, I agree with Nick. Tika appears to add a new line in the correct spot at
least for IDEC-102...
On Mon, Oct 31, 2022 at 9:22 AM Nick Burch
mailto:n...@apache.org>> wrote:
On Sun, 30 Oct 2022, Christian Ribeaud wrote:
> I am using the default configuration. I think,
lly and within a given section/paragraph I get an
ending space for each sentence, right?
Is my problem now clearer?
Thanks a lot for your time and your patience,
christian
From: Tim Allison
Date: Monday, 31 October 2022 at 20:09
To: user@tika.apache.org
Subject: Re: Paragraph words getting merged
>
>
> christian
>
>
>
>
>
> *From: *Tim Allison
> *Date: *Monday, 31 October 2022 at 18:37
> *To: *user@tika.apache.org
> *Subject: *Re: Paragraph words getting merged
>
> I'm sorry. I'm missing the context. What are you trying to accomplish?
>
October 2022 at 18:37
To: user@tika.apache.org
Subject: Re: Paragraph words getting merged
I'm sorry. I'm missing the context. What are you trying to accomplish? Do
you want to index each page as a separate document in Elasticsearch? Or, is
the langid + pagecount critical for your needs and somehow
ageCount,
> baseName));
>
> LanguageResult languageResult =
> tikaPageContentHandler.getLanguageResult();
>
> String language = languageResult.isReasonablyCertain() ?
> languageResult.getLanguage() : null;
>
> // Put an entry into DynamoDb.
>
> putIt
anguageResult.getLanguage() : null;
// Put an entry into DynamoDb.
putItem(baseName, fileSize, pageCount, language);
}
[/code]
Christian
From: Tim Allison
Date: Monday, 31 October 2022 at 16:22
To: user@tika.apache.org
Subject: Re: Paragraph words getting merged
Y, I agree with Nick. Tika
Y, I agree with Nick. Tika appears to add a new line in the correct spot at
least for IDEC-102...
On Mon, Oct 31, 2022 at 9:22 AM Nick Burch wrote:
> On Sun, 30 Oct 2022, Christian Ribeaud wrote:
> > I am using the default configuration. I think, we could reduce my
> > problem to following code
On Sun, 30 Oct 2022, Christian Ribeaud wrote:
I am using the default configuration. I think, we could reduce my
problem to following code snippet:
Is there a reason that you aren't using one of the built-in Tika content
handlers? Generally they should be taking care of everything for you with
10 matches
Mail list logo