As I am using the tika 2.7 server standard runnable jar package and which has a built- in language detection feature I believe, do we need to do any other configuration or need to install any other extension in order to achieve language detection as mentioned below.
[image: image.png] Please assist. Thanks On Fri, Apr 14, 2023, 22:05 Chetan Bikire <[email protected]> wrote: > I too didn't find any metadata for language, but thought using tika > language detector extension can be able to get it. > > org.apache.tika.language.detect.LanguageDetector > > On Wed, Apr 12, 2023, 22:38 Tim Allison <[email protected]> wrote: > >> I'm not seeing language hints in the document.xml within the docx nor >> in the metadata. Do you know where it might be stored inside the >> docx? >> >> On Wed, Apr 12, 2023 at 1:01 PM Chetan Bikire <[email protected]> >> wrote: >> > >> > I am calling tika using rmeta/text endpoint by running tika server 2.7. >> > Yes, language detection means any metadata field which shows language >> in which document is written. >> > like for example- in our case attached document contains spanish >> content in it then metadata Content-Language:"es" >> > >> > >> > >> > On Wed, Apr 12, 2023 at 8:32 PM Tim Allison <[email protected]> >> wrote: >> >> >> >> How are you calling Tika? By "language", do you mean language >> >> detection on the extracted text or an internal metadata flag that says >> >> "I'm X language"? >> >> >> >> On Wed, Apr 12, 2023 at 10:48 AM Chetan Bikire <[email protected]> >> wrote: >> >> > >> >> > Hi, >> >> > >> >> > After parsing documents tika does not return language as part >> parsing result for some of the documents like docx,.msg files. >> >> > Below is the example document. >> >> > please assist. >> >
