I too didn't find any metadata for language, but thought using tika
language detector extension can be able to get it.

org.apache.tika.language.detect.LanguageDetector

On Wed, Apr 12, 2023, 22:38 Tim Allison <[email protected]> wrote:

> I'm not seeing language hints in the document.xml within the docx nor
> in the metadata.  Do you know where it might be stored inside the
> docx?
>
> On Wed, Apr 12, 2023 at 1:01 PM Chetan Bikire <[email protected]> wrote:
> >
> > I am calling tika using rmeta/text endpoint by running tika server 2.7.
> > Yes, language detection means any metadata field which shows language in
> which document is written.
> > like for example- in our case attached document contains spanish content
> in it then metadata Content-Language:"es"
> >
> >
> >
> > On Wed, Apr 12, 2023 at 8:32 PM Tim Allison <[email protected]> wrote:
> >>
> >> How are you calling Tika?  By "language", do you mean language
> >> detection on the extracted text or an internal metadata flag that says
> >> "I'm X language"?
> >>
> >> On Wed, Apr 12, 2023 at 10:48 AM Chetan Bikire <[email protected]>
> wrote:
> >> >
> >> > Hi,
> >> >
> >> > After parsing documents tika does not return language as part parsing
> result for some of the documents like docx,.msg files.
> >> > Below is the example document.
> >> > please assist.
>

Reply via email to