Re: Indexing nouns only with UIMA works - performance issue?

Tommaso Teofili Mon, 04 Feb 2013 05:47:51 -0800

see an example at
http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/solr/contrib/uima/src/test-files/uima/uima-tokenizers-schema.xml?view=diff&r1=1442116&r2=1442117&pathrev=1442117where
the 'ngramsize' parameter is set, that's defined in
AggregateSentenceAE.xml descriptor and is then set with the given actual
value.
HTH,


Tommaso


2013/2/4 Tommaso Teofili <tommaso.teof...@gmail.com>

> Regarding configuration parameters have a look at
> https://issues.apache.org/jira/browse/LUCENE-4749
> Regards,
> Tommaso
>
>
> 2013/2/4 Tommaso Teofili <tommaso.teof...@gmail.com>
>
>> Thanks Kai for your feedback, I'll look into it and let you know.
>> Regards,
>> Tommaso
>>
>>
>> 2013/2/1 Kai Gülzau <kguel...@novomind.com>
>>
>>> I now use the "stupid" way to use the german corpus for UIMA: copy +
>>> paste :-)
>>>
>>> I modified the Tagger-2.3.1.jar/HmmTagger.xml to use the german corpus
>>> ...
>>> <fileResourceSpecifier>
>>>   <fileUrl>file:german/TuebaModel.dat</fileUrl>
>>> </fileResourceSpecifier>
>>> ...
>>> and saved it as Tagger-2.3.1.jar/HmmTaggerDE.xml
>>>
>>>
>>> Next step is to replace every occurrence of "HmmTagger" in
>>> lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceAE.xml
>>> with "HmmTaggerDE" an save it as
>>> lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceDEAE.xml
>>>
>>> This can be used in your schema.xml:
>>> <fieldType name="uima_nouns_de" class="solr.TextField"
>>> positionIncrementGap="100">
>>>   <analyzer>
>>>     <tokenizer class="solr.UIMATypeAwareAnnotationsTokenizerFactory"
>>>       descriptorPath="/uima/AggregateSentenceDEAE.xml"
>>> tokenType="org.apache.uima.TokenAnnotation" featurePath="posTag"/>
>>>     <filter class="solr.TypeTokenFilterFactory" useWhitelist="true"
>>> types="/uima/whitelist_de.txt" />
>>>   </analyzer>
>>> </fieldType>
>>>
>>> There should be a way to accomplish this via config though.
>>>
>>>
>>>
>>> Last open issue: Performance!
>>>
>>> First run via Admin GUI analyze index value "Klaus geht in das Haus und
>>> sieht eine Maus." / query: "": ~ 5 seconds
>>> Feb 01, 2013 11:01:00 AM WhitespaceTokenizer initialize Information:
>>> "Whitespace tokenizer successfully initialized"
>>> Feb 01, 2013 11:01:02 AM WhitespaceTokenizer typeSystemInit
>>> Information: "Whitespace tokenizer typesystem initialized"
>>> Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process
>>>  Information: "Whitespace tokenizer starts processing"
>>> Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process
>>>  Information: "Whitespace tokenizer finished processing"
>>> Feb 01, 2013 11:01:02 AM WhitespaceTokenizer initialize Information:
>>> "Whitespace tokenizer successfully initialized"
>>> Feb 01, 2013 11:01:03 AM WhitespaceTokenizer typeSystemInit
>>> Information: "Whitespace tokenizer typesystem initialized"
>>> Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process
>>>  Information: "Whitespace tokenizer starts processing"
>>> Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process
>>>  Information: "Whitespace tokenizer finished processing"
>>> Feb 01, 2013 11:01:03 AM WhitespaceTokenizer initialize Information:
>>> "Whitespace tokenizer successfully initialized"
>>> Feb 01, 2013 11:01:05 AM WhitespaceTokenizer typeSystemInit
>>> Information: "Whitespace tokenizer typesystem initialized"
>>> Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process
>>>  Information: "Whitespace tokenizer starts processing"
>>> Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process
>>>  Information: "Whitespace tokenizer finished processing"
>>>
>>> Second run via Admin GUI analyze "Klaus geht in das Haus und sieht eine
>>> Maus." / query: "": ~ 4 seconds
>>> Feb 01, 2013 11:07:31 AM WhitespaceTokenizer initialize Information:
>>> "Whitespace tokenizer successfully initialized"
>>> Feb 01, 2013 11:07:32 AM WhitespaceTokenizer typeSystemInit
>>> Information: "Whitespace tokenizer typesystem initialized"
>>> Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process
>>>  Information: "Whitespace tokenizer starts processing"
>>> Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process
>>>  Information: "Whitespace tokenizer finished processing"
>>> Feb 01, 2013 11:07:32 AM WhitespaceTokenizer initialize Information:
>>> "Whitespace tokenizer successfully initialized"
>>> Feb 01, 2013 11:07:33 AM WhitespaceTokenizer typeSystemInit
>>> Information: "Whitespace tokenizer typesystem initialized"
>>> Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process
>>>  Information: "Whitespace tokenizer starts processing"
>>> Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process
>>>  Information: "Whitespace tokenizer finished processing"
>>> Feb 01, 2013 11:07:33 AM WhitespaceTokenizer initialize Information:
>>> "Whitespace tokenizer successfully initialized"
>>> Feb 01, 2013 11:07:34 AM WhitespaceTokenizer typeSystemInit
>>> Information: "Whitespace tokenizer typesystem initialized"
>>> Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process
>>>  Information: "Whitespace tokenizer starts processing"
>>> Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process
>>>  Information: "Whitespace tokenizer finished processing"
>>>
>>> Initialized 3 times?
>>> I think some of the components are not reused while analyzing.
>>>
>>> Is this a known issue?
>>>
>>>
>>> Regards,
>>>
>>> Kai Gülzau
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Kai Gülzau [mailto:kguel...@novomind.com]
>>> Sent: Thursday, January 31, 2013 6:48 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: RE: Indexing nouns only - UIMA vs. OpenNLP
>>>
>>> UIMA:
>>>
>>> I just found this issue https://issues.apache.org/jira/browse/SOLR-3013
>>> Now I am able to use this analyzer for english texts and filter
>>> (un)wanted token types :-)
>>>
>>> <fieldType name="uima_nouns_en" class="solr.TextField"
>>> positionIncrementGap="100">
>>>   <analyzer>
>>>     <tokenizer class="solr.UIMATypeAwareAnnotationsTokenizerFactory"
>>>       descriptorPath="/uima/AggregateSentenceAE.xml"
>>> tokenType="org.apache.uima.TokenAnnotation"
>>>       featurePath="posTag"/>
>>>     <filter class="solr.TypeTokenFilterFactory"
>>> types="/uima/stoptypes.txt" />
>>>   </analyzer>
>>> </fieldType>
>>>
>>> Open issue -> How to set the ModelFile for the Tagger to
>>> "german/TuebaModel.dat" ???
>>>
>>>
>>> Kai Gülzau
>>>
>>>
>>
>

Re: Indexing nouns only with UIMA works - performance issue?

Reply via email to