Re: Language Detection for the data

Karl Wright Mon, 17 Dec 2018 23:30:43 -0800

Look in the manifoldcf source tree for files named
"common_en_US.properties".  For every one of these you will need to create
a similar file for your specific locale.


Thanks,
Karl


On Tue, Dec 18, 2018 at 2:07 AM Nikita Ahuja <[email protected]> wrote:

> Thanks Karl,
>
> But I want to know how to add these files, so that such warnings also not
> come and a smooth flow is executed.
>
> Is there any way to do that?
>
> Thanks,
> Nikita
>
> On Wed, Dec 12, 2018 at 4:47 PM Karl Wright <[email protected]> wrote:
>
>> Hi Nikita,
>>
>> This is occurring because en_GB does not have a translations file.  It's
>> a warning and the code falls back to using en_US.
>>
>> Karl
>>
>>
>> On Wed, Dec 12, 2018 at 4:39 AM Nikita Ahuja <[email protected]>
>> wrote:
>>
>>> Hi Karl,
>>>
>>> Thanks for the suggestion and Language for the data and content is able
>>> to detect now. But there is one issue while ingesting the records in the
>>> ElasticSearch Index. and it is stored there in the log file as:
>>>
>>> ERROR 2018-12-11T19:19:37,637 (qtp348148678-561) - Missing resource
>>> bundle 'org.apache.manifoldcf.ui.i18n.common' for locale 'en_GB': Can't
>>> find bundle for base name org.apache.manifoldcf.ui.i18n.common, locale
>>> en_GB; trying en
>>> java.util.MissingResourceException: Can't find bundle for base name
>>> org.apache.manifoldcf.ui.i18n.common, locale en_GB
>>>     at
>>> java.base/java.util.ResourceBundle.throwMissingResourceException(Unknown
>>> Source) ~[?:?]
>>>     at java.base/java.util.ResourceBundle.getBundleImpl(Unknown Source)
>>> ~[?:?]
>>>     at java.base/java.util.ResourceBundle.getBundleImpl(Unknown Source)
>>> ~[?:?]
>>>     at java.base/java.util.ResourceBundle.getBundle(Unknown Source)
>>> ~[?:?]
>>>     at
>>> org.apache.manifoldcf.core.i18n.Messages.getResourceBundle(Messages.java:132)
>>> [mcf-core.jar:?]
>>>     at
>>> org.apache.manifoldcf.core.i18n.Messages.getMessage(Messages.java:178)
>>> [mcf-core.jar:?]
>>>     at
>>> org.apache.manifoldcf.core.i18n.Messages.getString(Messages.java:216)
>>> [mcf-core.jar:?]
>>>     at
>>> org.apache.manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:343)
>>> [mcf-ui-core.jar:?]
>>>     at
>>> org.apache.manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:119)
>>> [mcf-ui-core.jar:?]
>>>     at
>>> org.apache.manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:67)
>>> [mcf-ui-core.jar:?]
>>>     at org.apache.jsp.index_jsp._jspService(index_jsp.java:212) [jsp/:?]
>>>
>>>
>>> Is this can be resolved after adding any resource files or any other
>>> solution has to be opted?
>>>
>>> On Wed, Nov 21, 2018 at 5:36 PM Karl Wright <[email protected]> wrote:
>>>
>>>> Hi Nikita,
>>>>
>>>> The Tika transformer may well generate a language attribute.  You would
>>>> need to check with Tika, though, to know for sure, and under what
>>>> conditions it might generate this.  It should not be confused with document
>>>> format detection, which Tika definitely does in order to extract content.
>>>>
>>>> It looks like language detection in Tika either comes from document
>>>> metadata already present, or via a Java interface that you need to
>>>> explicitly call to get it.  If your documents need the latter, the Tika
>>>> connector does not currently do this:
>>>>
>>>> https://tika.apache.org/1.19.1/detection.html#Language_Detection
>>>>
>>>> and
>>>>
>>>> https://tika.apache.org/1.19.1/examples.html#Language_Identification
>>>>
>>>> The documentation does not clarify whether a language attribute is
>>>> actually generated; the architecture seems more suited to plug in machine
>>>> translators for your content.  I suspect you would need to run the output
>>>> of the Tika translator into the NullOutputConnector in order to see what
>>>> attributes are being generated to know for sure.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Wed, Nov 21, 2018 at 4:45 AM Nikita Ahuja <[email protected]>
>>>> wrote:
>>>>
>>>>> HI All,
>>>>>
>>>>> Thanks for the timely replies. But I am basically concerned for the
>>>>> language detection of the .doc,.pdf or any other data present in the
>>>>> repository.
>>>>>
>>>>> As per my understanding Tika Transformation provides functionality for
>>>>> the same.
>>>>> But there is no output for the language of the documents.
>>>>>
>>>>> The sequence used is:
>>>>> 1. Repoistory Connector(Web)
>>>>> 2. Tika Transformation
>>>>> 3. MetaData Adjuster
>>>>> 4.Output Connector(Elastic)
>>>>>
>>>>> Is there anything which is being missed here for the language
>>>>> detection of the documents?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Nov 21, 2018 at 2:35 PM Furkan KAMACI <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Nikita,
>>>>>>
>>>>>> First of all, OpenNLP is a transformation connector at ManifoldCF and
>>>>>> should be enabled by default. It extracts named entities (people, 
>>>>>> locations
>>>>>> and organizations) from document.
>>>>>>
>>>>>> You should download trained models to run OpenNLP connector. You can
>>>>>> check here for such purpose: https://opennlp.apache.org/models.html
>>>>>>
>>>>>> Check here for a detailed explanation:
>>>>>> https://github.com/ChalithaUdara/OpenNLP-Manifold-Connector
>>>>>>
>>>>>> Feel free to ask any questions when you try to integrate it. Also,
>>>>>> you should explain the points if you cannot success to run it.
>>>>>>
>>>>>> Kind Regards,
>>>>>> Furkan KAMACI
>>>>>>
>>>>>>
>>>>>> On Wed, Nov 21, 2018 at 11:54 AM Karl Wright <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Nikita,
>>>>>>>
>>>>>>> Can you be more specific when you say "OpenNLP is not working"?  All
>>>>>>> that this connector does is integrate OpenNLP as a ManifoldCF 
>>>>>>> transformer.
>>>>>>> It uses a specific directory to deliver the models that OpenNLP uses to
>>>>>>> match and extract content from documents.  Thus, you can provide any 
>>>>>>> models
>>>>>>> you want that are compatible with the OpenNLP version we're including.
>>>>>>>
>>>>>>> Can you describe the steps you are taking and what you are seeing?
>>>>>>>
>>>>>>> On Wed, Nov 21, 2018 at 12:44 AM Nikita Ahuja <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I have query related to detect the language of the records/data
>>>>>>>> which is going to be ingest in the Output Connector.
>>>>>>>>
>>>>>>>> OpenNLP connector is not working for the detection as per the user
>>>>>>>> documentation, but this is not working appropriately. Please suggest 
>>>>>>>> is NLP
>>>>>>>> has to be used if yes, then how it should be used or is there any other
>>>>>>>> solution for this?
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks and Regards,
>>>>>>>> Nikita
>>>>>>>> Email: [email protected]
>>>>>>>> United Sources Service Pvt. Ltd.
>>>>>>>> a "Smartshore" Company
>>>>>>>> Mobile: +91 99 888 57720
>>>>>>>> http://www.smartshore.nl
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Thanks and Regards,
>>>>> Nikita
>>>>> Email: [email protected]
>>>>> United Sources Service Pvt. Ltd.
>>>>> a "Smartshore" Company
>>>>> Mobile: +91 99 888 57720
>>>>> http://www.smartshore.nl
>>>>>
>>>>
>>>
>>> --
>>> Thanks and Regards,
>>> Nikita
>>> Email: [email protected]
>>> United Sources Service Pvt. Ltd.
>>> a "Smartshore" Company
>>> Mobile: +91 99 888 57720
>>> http://www.smartshore.nl
>>>
>>
>
> --
> Thanks and Regards,
> Nikita
> Email: [email protected]
> United Sources Service Pvt. Ltd.
> a "Smartshore" Company
> Mobile: +91 99 888 57720
> http://www.smartshore.nl
>

Re: Language Detection for the data

Reply via email to