Re: Lexical analysis tools for German language data

Bernd Fehling Thu, 12 Apr 2012 05:09:38 -0700

Paul,

nearly two years ago I requested an evaluation license and tested BASIS Tech
Rosette for Lucene & Solr. Was working excellent but the price much much to 
high.


Yes, they also have compound analysis for several languages including German.
Just configure your pipeline in solr and setup the processing pipeline in
Rosette Language Processing (RLP) and thats it.

Example from my very old schema.xml config:

<fieldtype name="text_rlp" class="solr.TextField" positionIncrementGap="100">
   <analyzer type="index">
     <tokenizer class="com.basistech.rlp.solr.RLPTokenizerFactory"
                rlpContext="solr/conf/rlp-index-context.xml"
                postPartOfSpeech="false"
                postLemma="true"
                postStem="true"
                postCompoundComponents="true"/>
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter 
class="org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory"/>
   </analyzer>
   <analyzer type="query">
     <tokenizer class="com.basistech.rlp.solr.RLPTokenizerFactory"
                rlpContext="solr/conf/rlp-query-context.xml"
                postPartOfSpeech="false"
                postLemma="true"
                postCompoundComponents="true"/>
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter 
class="org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldtype>

So you just point tokenizer to RLP and have two RLP pipelines configured,
one for indexing (rlp-index-context.xml) and one for querying 
(rlp-query-context.xml).

Example form my rlp-index-context.xml config:

<contextconfig>
  <properties>
    <property name="com.basistech.rex.optimize" value="false"/>
    <property name="com.basistech.ela.retokenize_for_rex" value="true"/>
  </properties>
  <languageprocessors>
    <languageprocessor>Unicode Converter</languageprocessor>
    <languageprocessor>Language Identifier</languageprocessor>
    <languageprocessor>Encoding and Character Normalizer</languageprocessor>
    <languageprocessor>European Language Analyzer</languageprocessor>
<!--    <languageprocessor>Script Region Locator</languageprocessor>
    <languageprocessor>Japanese Language Analyzer</languageprocessor>
    <languageprocessor>Chinese Language Analyzer</languageprocessor>
    <languageprocessor>Korean Language Analyzer</languageprocessor>
    <languageprocessor>Sentence Breaker</languageprocessor>
    <languageprocessor>Word Breaker</languageprocessor>
    <languageprocessor>Arabic Language Analyzer</languageprocessor>
    <languageprocessor>Persian Language Analyzer</languageprocessor>
    <languageprocessor>Urdu Language Analyzer</languageprocessor> -->
    <languageprocessor>Stopword Locator</languageprocessor>
    <languageprocessor>Base Noun Phrase Locator</languageprocessor>
<!--    <languageprocessor>Statistical Entity Extractor</languageprocessor> -->
    <languageprocessor>Exact Match Entity Extractor</languageprocessor>
    <languageprocessor>Pattern Match Entity Extractor</languageprocessor>
    <languageprocessor>Entity Redactor</languageprocessor>
    <languageprocessor>REXML Writer</languageprocessor>
  </languageprocessors>
</contextconfig>

As you can see I used the "European Language Analyzer".

Bernd



Am 12.04.2012 12:58, schrieb Paul Libbrecht:
> Bernd,
> 
> can you please say a little more?
> I think this list is ok to contain some description for commercial solutions 
> that satisfy a request formulated on list.
> 
> Is there any product at BASIS Tech that provides a compound-analyzer with a 
> big dictionary of decomposed compounds in German? 
> If yes, for which domain? 
> The Google Search result (I wonder if this is politically correct to not have 
> yours ;-)) shows me that there's an amount 
> of job done in this direction (e.g. Gärten to match Garten) but being precise 
> for this question would be more helpful!
> 
> paul
> 
>

Re: Lexical analysis tools for German language data

Reply via email to