I too didn't find any metadata for language, but thought using tika language detector extension can be able to get it.
org.apache.tika.language.detect.LanguageDetector On Wed, Apr 12, 2023, 22:38 Tim Allison <[email protected]> wrote: > I'm not seeing language hints in the document.xml within the docx nor > in the metadata. Do you know where it might be stored inside the > docx? > > On Wed, Apr 12, 2023 at 1:01 PM Chetan Bikire <[email protected]> wrote: > > > > I am calling tika using rmeta/text endpoint by running tika server 2.7. > > Yes, language detection means any metadata field which shows language in > which document is written. > > like for example- in our case attached document contains spanish content > in it then metadata Content-Language:"es" > > > > > > > > On Wed, Apr 12, 2023 at 8:32 PM Tim Allison <[email protected]> wrote: > >> > >> How are you calling Tika? By "language", do you mean language > >> detection on the extracted text or an internal metadata flag that says > >> "I'm X language"? > >> > >> On Wed, Apr 12, 2023 at 10:48 AM Chetan Bikire <[email protected]> > wrote: > >> > > >> > Hi, > >> > > >> > After parsing documents tika does not return language as part parsing > result for some of the documents like docx,.msg files. > >> > Below is the example document. > >> > please assist. >
