Thank You, will look into this. On Mon, Apr 17, 2023, 20:53 Tim Allison <[email protected]> wrote:
> Two options: > 1) send the extracted text to the /language endpoint ( > https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-LanguageResource > ). > 2) If you are using the /rmeta endpoint or the json output from the /tika > endpoint, you can get language id from a slightly different lang id > mechanism via tika-eval. Add the tika-eval.jar to your class path (see: > https://cwiki.apache.org/confluence/display/TIKA/TikaServer 's section > titled "Integration with tika-eval"). > > On Mon, Apr 17, 2023 at 8:16 AM Chetan Bikire <[email protected]> wrote: > >> As I am using the tika 2.7 server standard runnable jar package >> and which has a built- in language detection feature I believe, do we need >> to do any other configuration or need to install any other extension in >> order to achieve language detection as mentioned below. >> >> [image: image.png] >> >> Please assist. >> Thanks >> >> On Fri, Apr 14, 2023, 22:05 Chetan Bikire <[email protected]> wrote: >> >>> I too didn't find any metadata for language, but thought using tika >>> language detector extension can be able to get it. >>> >>> org.apache.tika.language.detect.LanguageDetector >>> >>> On Wed, Apr 12, 2023, 22:38 Tim Allison <[email protected]> wrote: >>> >>>> I'm not seeing language hints in the document.xml within the docx nor >>>> in the metadata. Do you know where it might be stored inside the >>>> docx? >>>> >>>> On Wed, Apr 12, 2023 at 1:01 PM Chetan Bikire <[email protected]> >>>> wrote: >>>> > >>>> > I am calling tika using rmeta/text endpoint by running tika server >>>> 2.7. >>>> > Yes, language detection means any metadata field which shows language >>>> in which document is written. >>>> > like for example- in our case attached document contains spanish >>>> content in it then metadata Content-Language:"es" >>>> > >>>> > >>>> > >>>> > On Wed, Apr 12, 2023 at 8:32 PM Tim Allison <[email protected]> >>>> wrote: >>>> >> >>>> >> How are you calling Tika? By "language", do you mean language >>>> >> detection on the extracted text or an internal metadata flag that >>>> says >>>> >> "I'm X language"? >>>> >> >>>> >> On Wed, Apr 12, 2023 at 10:48 AM Chetan Bikire <[email protected]> >>>> wrote: >>>> >> > >>>> >> > Hi, >>>> >> > >>>> >> > After parsing documents tika does not return language as part >>>> parsing result for some of the documents like docx,.msg files. >>>> >> > Below is the example document. >>>> >> > please assist. >>>> >>>
