 Fri Jul 20 12:27:14 2012
@@ -0,0 +1,1483 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
+<!ENTITY imgroot "images/tools/tools.textmarker/" >
+<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >  
+<!-- Licensed to the Apache Software Foundation (ASF) under one or more 
+       license agreements. See the NOTICE file distributed with this work for 
+       information regarding copyright ownership. The ASF licenses this file 
+       you under the Apache License, Version 2.0 (the "License"); you may not 
+       this file except in compliance with the License. You may obtain a copy 
+       the License at Unless 
+       by applicable law or agreed to in writing, software distributed under 
+       License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR 
+       OF ANY KIND, either express or implied. See the License for the 
+       language governing permissions and limitations under the License. -->
+<chapter id="">
+       <title>TextMarker Workbench</title>
+       <para>
+       </para>
+       <section id="">
+               <title>Installation</title>
+               <para>
+                       # Download, install and start an Eclipse 3.5 or Eclipse
+                       3.6.
+                       #
+                       Add the Apache UIMA update site
+                       ( 
and the
+                       TextMarker update site
( to the
+                       available software sites in your Eclipse installation. 
This can be
+                       achived in the "Install New Software" dialog in the 
help menu of
+                       Eclipse.
+                       # Eclipse 3.6: TextMarker is currently based on DLTK
+                       1.0.
+                       Therefore, adding the DLTK 1.0 update site
( is
+                       required since the Eclipse 3.6 update site only 
supports DLTK 2.0.
+                       #
+                       Select "Install New Software" in the help menu of 
Eclipse, if not
+                       done yet.
+                       # Select the TextMarker update site at "Work with",
+                       deselect "Group
+                       items by category" and select "Contact all update
+                       sites during
+                       install to find required software"
+                       # Select the
+                       TextMarker feature and continue the dialog. The CEV
+                       feature is
+                       already contained in the TextMarker feature. Eclipse 
+                       automatically install the Apache UIMA (version 2.3) 
plugins and the
+                       DLTK Core Framework (version 1.X) plugins.
+                       # ''(OPTIONAL)'' If
+                       additional HTML visualizations are desired, then
+                       also install the CEV
+                       HTML feature. However, you need to install the
+                       XPCom and XULRunner
+                       features previously, for example by using an
+                       appropriate update site
( Please
+                       refer to the [CEV installation instruction|CEVInstall] 
for details.
+                       # After the successful installation, switch to the 
+                       perspective.
+                       You can also download the TextMarker plugins from
[|] and
+                       install the plugins mentioned above manually.
+               </para>
+       </section>
+       <section id="">
+               <title>TextMarker Projects</title>
+               <para>
+                       Similar to Java projects in Eclipse, the TextMarker 
+                       provides the possibility to create TextMarker projects. 
+                       projects require a certain folder structure that is 
created with the
+                       project. The most important folders are the script 
folder that
+                       contains the TextMarker rule files in a package and the 
+                       folder that contains the generated UIMA components. The 
input folder
+                       contains the text files or xmiCAS files that will be 
executed when
+                       starting a TextMarker script. The result will be placed 
in the
+                       output folder.
+                       <programlisting><![CDATA[
+  ||Project element|| Used for
+  | Project                   | the TextMarker project
+  | - script                  | source folder with TextMarker scripts
+  | -- my.package                 | the package, resulting in several folders 
+  | ---                 | a TextMarker script
+  | - descriptor              | build folder for UIMA components
+  | -- my/package                 | the folder structure for the components
+  | --- ScriptEngine.xml          | the analysis engine of the script
+  | --- ScriptTypeSystem.xml      | the type system of the script
+  | -- BasicEngine.xml            | the analysis engine template for all 
generated engines in this project 
+  | -- BasicTypeSystem.xml        | the type system template for all generated 
type systems in this project
+  | -- InternalTypeSystem.xml     | a type system with TextMarker types
+  | -- Modifier.xml               | the analysis engine of the optional 
modifier that creates the ''modified'' view
+  | - input                   | folder that contains the files that will be 
processed when launching a TextMarker script
+  | -- test.html                  | an input file containing html
+  | -- test.xmi                   | an input file containing text and 
+  | - output                  | folder that contains the files that were 
processed by a TextMarker script
+  | -- test.html.modified.html    | the result of the modifier: replaced text 
and colored html
+  | -- test.html.xmi              | the result CAS with optional information
+  | -- test.xmi.modified.html     | the result of the modifier: replaced text 
and colored html
+  | -- test.xmi.xmi               | the result CAS with optional information
+  | - resources               | default folder for word lists and dictionaries
+  | -- Dictionary.mtwl            | a dictionary in the "multi tree word list" 
+  | -- FirstNames.txt             | a simple word list with first names:  one 
first name per line
+  | - test                    | test-driven development is still under 
+               </para>
+       </section>
+       <section id="">
+               <title>Explanation</title>
+               <para>
+                       Handcrafting rules is laborious, especially if the newly
+                       written rules do not
+                       behave as expected. The TextMarker System is
+                       able to protocol the
+                       application of each single rule and block in
+                       order to provide an
+                       explanation of the rule inference and a minmal
+                       debug functionality.
+                       The explanation component is built upon the CEV
+                       plugin. The
+                       information about the application of the rules itself is
+                       stored in
+                       the result xmiCAS, if the parameter of the executed 
+                       are
+                       configured correctly. The simplest way the generate 
+                       information is to open a TextMarker file and click on 
the common
+                       "Debug" button (looks like a green bug) in your 
eclipse. The current
+                       TextMarker file will then be executed on the text files 
in the input
+                       directory and xmiCAS are created in the output 
directory containing
+                       the additional UIMA feature structures describing the 
+                       inference. The resulting xmiCAS needs to be opened with 
the CEV
+                       plugin. However, only additional views are capable of 
displaying the
+                       debug information. In order to open the neccessary 
views, you can
+                       either open the "Explain" perspective or open the views 
+                       and arrange them as you like.
+                       There are currently seven views that
+                       display information about the
+                       execution of the rules: Applied Rules,
+                       Selected Rules, Rule List,
+                       Matched Rules, Failed Rules, Rule Elements
+                       and Basic Stream.
+               </para>
+       </section>
+       <section id="">
+               <title>Dictionariers</title>
+               <para>
+                       The TextMarker system suports currently the usage of 
dictionaries in
+                       four different ways. The files are always encoded with 
UTF-8. The
+                       generated analysis engines provide a parameter 
+                       that specifies the folder that contains the external 
+                       files. The paramter is initially set to the resource 
folder of the
+                       current TextMarker project. In order to use a different 
+                       change for example set value of the paramter and 
rebuild all
+                       TextMarker rule files in the project in order to update 
all analysis
+                       engines.
+                       The algorithm for the detection of the entires of a
+                       dictionary:
+                       <programlisting><![CDATA[
+for all basic annotations of the matched annotation do
+  set current candidate to current basic
+  loop
+    if the dictionary contains current candidate then
+      remember candidate
+    else if an entry of the dictionary starts with the current candidate then
+      add next basic annotation to the current candidate
+      continue loop
+    else
+      stop loop
+                       Word List (.txt)
+                       Word lists are simple text files that contain a term
+                       or string in each
+                       line. The strings may include white spaces and are
+                       sperated by a
+                       line break.
+                       Usage:
+                       Content of a file named FirstNames.txt
+                       (located in the resource folder of a
+                       TextMarker project):
+                       <programlisting><![CDATA[
+                       Examplary rules:
+                       <programlisting><![CDATA[
+LIST FirstNameList = 'FirstNames.txt';
+DECLARE FirstName;
+Document{-> MARKFAST(FirstName, FirstNameList)};
+                       In this example, all first names in the given text file 
+                       annotated in the input document with the type FirstName.
+                       Tree Word
+                       List (.twl)
+                       A tree word list is a compiled word list similar to a
+                       trie. A .twl
+                       file is an XML-file that contains a tree-like structure
+                       with a node
+                       for each character. The nodes themselves refer to child
+                       nodes that
+                       represent all characters that succeed the caracter of 
+                       parent
+                       node. For single word entries, this is resulting in a
+                       complexity of
+                       O(m*log(n)) instead of a complexity of O(m*n) (simple
+                       .txt file),
+                       whereas m is the amount of basic annotations in the
+                       document and n
+                       is the amount of entries in the dictionary.
+                       Usage:
+                       A
+                       .twl file are generated using the popup menu. Select 
one or more
+                       .txt files (or a folder containing .txt files), click 
the right
+                       mouse button and choose ''Convert to TWL''. Then, one 
or more .twl
+                       files are generated with the according file name.
+                       Examplary rules:
+                       <programlisting><![CDATA[
+LIST FirstNameList = 'FirstNames.twl';
+DECLARE FirstName;
+Document{-> MARKFAST(FirstName, FirstNameList)};
+                       In this example, all first names in the given text file 
are again
+                       annotated in the input document with the type FirstName.
+                       Multi Tree
+                       Word List (.mtwl)
+                       A multi tree word list is generated using multiple
+                       .txt files and
+                       contains special nodes: Its nodes provide additional
+                       information
+                       about the original file. The .mtwl files are useful, if
+                       several
+                       different dictionaries are used in a TextMarker file. 
+                       five
+                       dictionaries, for example, also five MARKFAST rules are
+                       necessary.
+                       Therefore the matched text is searched five times and 
+                       complexity
+                       is 5 * O(m*log(n)). Using a .mtwl file reduces the
+                       complexity to
+                       about O(m*log(5*n)).
+                       Usage:
+                       A .mtwl file is generated
+                       using the popup menu. Select one or more
+                       .txt files (or a folder
+                       containing .txt files), click the right
+                       mouse button and choose
+                       ''Convert to MTWL''. A .mtwl file named
+                       "generated.mtwl" is then
+                       generated that contains the word lists of
+                       all selected .txt files.
+                       Renaming the .mtwl file is recommended.
+                       If there are for example two
+                       or more word lists with the name
+                       "FirstNames.txt", "Companies.txt"
+                       and so on given and the generated
+                       .mtwl file is renamed to
+                       "Dictionary.mtwl", then the following rule
+                       annotates all companies
+                       and first names in the complete document.
+                       Examplary rules:
+                       <programlisting><![CDATA[
+LIST Dictionary = 'Dictionary.mtwl';
+DECLARE FirstName, Company;
+Document{-> TRIE("FirstNames.txt" = FirstName, "Companies.txt" = Company, 
Dictionary, false, 0, false, 0, "")};
+                       Table (.csv)
+                       The TextMarker system also supports .csv files,
+                       respectively tables.
+                       Usage:
+                       Content of a file named TestTable.csv
+                       (located in the resource folder of a
+                       TextMarker project):
+                       <programlisting><![CDATA[
+                       Examplary rules:
+                       <programlisting><![CDATA[
+TABLE TestTable = 'TestTable.csv';
+DECLARE Annotation Struct (STRING first);
+Document{-> MARKTABLE(Struct, 1, TestTable, "first" = 2)};
+                       In this example, the document is searched for all 
occurences of the
+                       entries of the first column of the given table, an 
annotation of the
+                       type Struct is created and its feature "first" is 
filled with the
+                       entry of the second column.
+                       For the input document with the content
+                       "Peter" the result is a single
+                       annotation of the type Struct and with
+                       P assigned to its features
+                       "first".
+               </para>
+       </section>
+       <section id="">
+               <title>Parameters</title>
+               <para>
+                       <itemizedlist>
+                               <listitem>
+                                       <para>mainScript (String): This is the 
TextMarker script that
+                                               will
+                                               be loaded and executed by the 
generated engine. The string
+                                               is
+                                               referencing the name of the 
file without file extension but
+                                               with
+                                               its complete namespace, e.g., 
+                                       </para>
+                               </listitem>
+                               <listitem>
+                                       <para>scriptPaths (Multiple Strings): 
The given strings
+                                               specify the
+                                               folders that contain TextMarker 
script files, the
+                                               main script file
+                                               and the additional script files 
in particular.
+                                               Currently, there is
+                                               only one folder supported in 
the TextMarker
+                                               workbench (script).
+                                       </para>
+                               </listitem>
+                               <listitem>
+                                       <para>enginePaths (Multiple Strings): 
The given strings
+                                               specify the
+                                               folders that contain additional 
analysis engines that
+                                               are called
+                                               from within a script file. 
Currently, there is only
+                                               one folder
+                                               supported in the TextMarker 
workbench (descriptor).
+                                       </para>
+                               </listitem>
+                               <listitem>
+                                       <para>resourcePaths (Multiple Strings): 
The given strings
+                                               specify
+                                               the folders that contain the 
word lists and dictionaries.
+                                               Currently, there is only one 
folder supported in the TextMarker
+                                               workbench (resources).
+                                       </para>
+                               </listitem>
+                               <listitem>
+                                       <para>additionalScripts (Multiple 
Strings): This parameter
+                                               contains a list of all known 
script files references with their
+                                               complete namespace, e.g., 
+                                       </para>
+                               </listitem>
+                               <listitem>
+                                       <para>additionalEngines (Multiple 
Strings): This parameter
+                                               contains a list of all known 
analysis engines.
+                                       </para>
+                               </listitem>
+                               <listitem>
+                                       <para>additionalEngineLoaders (Multiple 
Strings): This
+                                               parameter
+                                               contains the class names of the 
implementations that
+                                               help to load
+                                               more complex analysis engines.
+                                       </para>
+                               </listitem>
+                               <listitem>
+                                       <para>scriptEncoding (String): The 
encoding of the script
+                                               files.
+                                               Not yet supported, please use 
+                                       </para>
+                               </listitem>
+                               <listitem>
+                                       <para>defaultFilteredTypes (Multiple 
Strings): The complete
+                                               names
+                                               of the types that are filtered 
by default.
+                                       </para>
+                               </listitem>
+                               <listitem>
+                                       <para>defaultFilteredMarkups (Multiple 
Strings): The names of
+                                               the
+                                               markups that are filtered by 
+                                       </para>
+                               </listitem>
+                               <listitem>
+                                       <para>seeders (Multiple Strings):
+                                       </para>
+                               </listitem>
+                               <listitem>
+                                       <para>useBasics (String):
+                                       </para>
+                               </listitem>
+                               <listitem>
+                                       <para>removeBasics (Boolean):
+                                       </para>
+                               </listitem>
+                               <listitem>
+                                       <para>debug (Boolean):
+                                       </para>
+                               </listitem>
+                               <listitem>
+                                       <para>profile (Boolean):
+                                       </para>
+                               </listitem>
+                               <listitem>
+                                       <para>debugWithMatches (Boolean):
+                                       </para>
+                               </listitem>
+                               <listitem>
+                                       <para>statistics (Boolean):
+                                       </para>
+                               </listitem>
+                               <listitem>
+                                       <para>debugOnlyFor (Multiple Strings):
+                                       </para>
+                               </listitem>
+                               <listitem>
+                                       <para>style (Boolean):
+                                       </para>
+                               </listitem>
+                               <listitem>
+                                       <para>styleMapLocation (String):
+                                       </para>
+                               </listitem>
+                       </itemizedlist>
+               </para>
+       </section>
+       <section id="">
+               <title>Query</title>
+               <para>
+                       The query view can be used to write queries on several 
+                       within a folder with the TextMArker language.
+                       A short example how to
+                       use the Query view:
+                       <itemizedlist>
+                               <listitem>
+                                       <para> In the first field ''Query 
Data'', the folder is added in
+                                               which the query is executed, 
for example with drag and drop from
+                                               the script explorer. If the 
checkbox is activated, then all
+                                               subfolder will be included in 
the query.
+                                       </para>
+                               </listitem>
+                               <listitem>
+                                       <para> The next field ''Type System'' 
must contain a type system
+                                               or a TextMarker script that 
specifies all types that are used in
+                                               the query.
+                                       </para>
+                               </listitem>
+                               <listitem>
+                                       <para> The query in form of one or more 
TextMarker rules is
+                                               specified in the text field in 
the middle of the view. In the
+                                               example of the screenshot, all 
''Author'' annotations are
+                                               selected that contain a 
''FalsePositive'' or ''FalseNegative''
+                                               annotation.
+                                       </para>
+                               </listitem>
+                               <listitem>
+                                       <para> If the start button near the tab 
of the view in the upper
+                                               right corner ist pressed, then 
the results are displayed.
+                                       </para>
+                               </listitem>
+                       </itemizedlist>
+                       <screenshot>
+                               <mediaobject>
+                                       <imageobject>
+                                               <imagedata scale="80" 
format="PNG" fileref="&imgroot;Query.png" />
+                                       </imageobject>
+                                       <textobject>
+                                               <phrase>Query View</phrase>
+                                       </textobject>
+                               </mediaobject>
+                       </screenshot>
+               </para>
+       </section>
+       <section id="">
+               <title>Views</title>
+               <para>
+               </para>
+               <section id="">
+                       <title>Annotation Browser</title>
+                       <para>
+                       </para>
+               </section>
+               <section id="">
+                       <title>Annotation Editor</title>
+                       <para>
+                       </para>
+               </section>
+               <section id="">
+                       <title>Marker Palette</title>
+                       <para>
+                       </para>
+               </section>
+               <section id="">
+                       <title>Selection</title>
+                       <para>
+                       </para>
+               </section>
+               <section id="">
+                       <title>Basic Stream</title>
+                       <para>
+                               The basic stream contains a listing of the 
complete disjunct
+                               partition
+                               of the document by the TextMarkerBasic 
annotation that are
+                               used for
+                               the inference and the annotation seeding.
+                       </para>
+               </section>
+               <section id="">
+                       <title>Applied Rules</title>
+                       <para>
+                               The Applied Rules views displays how often a 
rule tried to
+                               apply and
+                               how often the rule succeeded. Additionally some 
+                               information is added after a short 
verbalisation of the rule. The
+                               information is structured: if BLOCK constructs 
were used in the
+                               executed TextMarker file, the rules contained 
in that block will be
+                               represented as child node in the tree of the 
view. Each TextMarker
+                               file is itself a BLOCK construct named after 
the file. Therefore
+                               the root node of the view is always a BLOCK 
containing the rules of
+                               the executed TextMarker script. Additionally, 
if a rule calls a
+                               different TextMarker file, then the root block 
of that file is the
+                               child of that rule. The selection of a rule in 
this view will
+                               directly change the information visualized in 
the other views.
+                       </para>
+               </section>
+               <section id="">
+                       <title>Selected Rules</title>
+                       <para>
+                               This views is very similar to the Applied Rules 
view, but
+                               displays only
+                               rules and blocks under a given selection. If 
the user
+                               clicks on the
+                               document, then an Applied Rule view is generated
+                               containing only
+                               element that affect that position in the 
+                               The Rule
+                               Elements view then only contains match 
information of that
+                               position, but the result of the rule element 
match is still
+                               displayed.
+                       </para>
+               </section>
+               <section id="">
+                       <title>Rule List</title>
+                       <para>
+                               This views is very similar to the Applied Rules 
view and the
+                               Selected
+                               Rules view, but displays only rules and NO 
blocks under
+                               a
+                               given
+                               selection. If the user clicks on the document, 
then a list
+                               of
+                               rules
+                               is generated that matched or tried to match on 
+                               position in
+                               the
+                               document. The Rule Elements view then only 
+                               match
+                               information of that position, but the result of 
the rule
+                               element
+                               match is still displayed. Additionally, this 
view provides a
+                               text
+                               field for filtering the rules. Only those rules 
remain that
+                               contain
+                               the entered text in their verbalization.
+                       </para>
+               </section>
+               <section id="">
+                       <title>Matched Rules</title>
+                       <para>
+                               If a rule is selected in the Applied Rules 
views, then this
+                               view
+                               displays the instances (text passages) where 
this rules
+                               matched.
+                       </para>
+               </section>
+               <section id="">
+                       <title>Failed Rules</title>
+                       <para>
+                               If a rule is selected in the Applied Rules 
views, then this
+                               view
+                               displays the instances (text passages) where 
this rules failed
+                               to
+                               match.
+                       </para>
+               </section>
+               <section id="">
+                       <title>Rule Elements</title>
+                       <para>
+                               If a successful or failed rule match in the 
Matched Rules view
+                               or
+                               Failed Rules view is selected, then this views 
contains a listing
+                               of the rule elements and their conditions. 
There is detailed
+                               information available on what text each rule 
element matched and
+                               which condition did evavaluate true.
+                       </para>
+               </section>
+               <section id="">
+                       <title>Statistics</title>
+                       <para>
+                               This views displays the used conditions and 
actions of the
+                               TextMarker
+                               language. Three numbers are given for each 
element: The
+                               total time
+                               of execution, the amount of executions and the 
time per
+                               execution.
+                       </para>
+               </section>
+               <section id="">
+                       <title>False Positive</title>
+                       <para>
+                       </para>
+               </section>
+               <section id="">
+                       <title>False Negative</title>
+                       <para>
+                       </para>
+               </section>
+               <section id="">
+                       <title>True Positive</title>
+                       <para>
+                       </para>
+               </section>
+       </section>
+       <section id="">
+               <title>Testing</title>
+               <para>
+                       The TextMarker Software comes bundled with its own 
+                       environment,
+                       that allows you to test and evaluate TextMarker scripts.
+                       It provides
+                       full back end testing capabilities and allows you to
+                       examine test
+                       results in detail. As a product of the testing operation
+                       a new
+                       document file will be created and detailed information 
on how
+                       well
+                       the script performed in the test will be added to this 
+               </para>
+               <section id="">
+                       <title>Overview</title>
+                       <para>
+                               The testing procedure compares a previously 
annotated gold standard
+                               file with the result of the selected TextMarker 
script using an
+                               evaluator. The evaluators compare the offsets 
of annotations in
+                               both documents and, depending on the evaluator, 
mark a result
+                               document with true positive, false positive or 
false negative
+                               annotations. Afterwards the f1-score is 
calculated for the whole
+                               set of tests, each test file and each type in 
the test file.
+                               The testing environment contains the following 
parts :
+                               <itemizedlist>
+                                       <listitem>
+                                               <para>Main view</para>
+                                       </listitem>
+                                       <listitem>
+                                               <para>Result views : true 
positive, false positive, false
+                                                       negative view
+                                               </para>
+                                       </listitem>
+                                       <listitem>
+                                               <para>Preference page</para>
+                                       </listitem>
+                               </itemizedlist>
+                               <screenshot>
+                                       <mediaobject>
+                                               <imageobject>
+                                                       <imagedata scale="80" 
fileref="&imgroot;Screenshot_main.png" />
+                                               </imageobject>
+                                               <textobject>
+                                                       <phrase>Eclipse with 
open TextMarker and testing environment.
+                                                       </phrase>
+                                               </textobject>
+                                       </mediaobject>
+                               </screenshot>
+                               All control elements,that are needed for the 
interaction with the
+                               testing environment, are located in the main 
+                               This is also
+                               where test files can be selected and 
information, on how
+                               well the
+                               script performed is, displayed. During the 
testing process
+                               a result
+                               CAS file is produced that will contain new
+                               annotation types like
+                               true positives (tp), false positives (fp) and 
+                               negatives (fn).
+                               While displaying the result .xmi file in the 
+                               editor,
+                               additional
+                               views allow easy navigation through the new 
+                               Additional tree
+                               views, like the true positive view, display the
+                               corresponding
+                               annotations in a
+                               hierarchic structure. This allows an
+                               easy tracing of the results inside the
+                               testing document. A
+                               preference page allows customization of the
+                               behavior
+                               of the testing
+                               plug-in.
+                       </para>
+                       <section id="">
+                               <title>Main View</title>
+                               <para>
+                                       The following picture shows a close up 
view of the testing
+                                       environments main-view part. The 
toolbar contains all buttons
+                                       needed to operate the plug-ins. The 
first line shows the name of
+                                       the script that is going to be tested 
and a combo-box, where the
+                                       view, that should be tested, is 
selected. On the right follow
+                                       fields that will show some basic 
information of the results of the
+                                       test-run.
+                                       Below and on the left the test-list is 
located. This list
+                                       contains the
+                                       different test-files. Right besides it, 
you will find
+                                       a table with
+                                       statistic information. It shows a total 
tp, fp and fn
+                                       information,
+                                       as well as precision, recall and 
f1-score of every
+                                       test-file and
+                                       for every type in each file.
+                                       <screenshot>
+                                               <mediaobject>
+                                                       <imageobject>
+                                                               <imagedata 
scale="80" format="PNG"
fileref="&imgroot;Screenshot_testing_desc_3_resize.png" />
+                                                       </imageobject>
+                                                       <textobject>
+                                                               <phrase>The 
main view of the testing environment.</phrase>
+                                                       </textobject>
+                                               </mediaobject>
+                                       </screenshot>
+                               </para>
+                       </section>
+                       <section id="">
+                               <title>Result Views</title>
+                               <para>
+                                       This views add additional information 
to the CAS View, once a
+                                       result file is opened. Each view 
displays one of the following
+                                       annotation types in a hierarchic tree 
structure : true positives,
+                                       false positive and false negative. 
Adding a check mark to one of
+                                       the annotations in a result view, will 
highlight the annotation in
+                                       the CAS Editor.
+                                       <screenshot>
+                                               <mediaobject>
+                                                       <imageobject>
+                                                               <imagedata 
scale="80" format="PNG"
fileref="&imgroot;Screenshot_result.png" />
+                                                       </imageobject>
+                                                       <textobject>
+                                                               <phrase>The 
main view of the testing environment.</phrase>
+                                                       </textobject>
+                                               </mediaobject>
+                                       </screenshot>
+                               </para>
+                       </section>
+                       <section id="">
+                               <title>Preference Page</title>
+                               <para>
+                                       The preference page offers a few 
options that will modify the
+                                       plug-ins general behavior. For example 
the preloading of
+                                       previously collected result data can be 
turned off, should it
+                                       produce a to long loading time. An 
important option in the
+                                       preference page is the evaluator you 
can select. On default the
+                                       "exact evaluator" is selected, which 
compares the offsets of the
+                                       annotations, that are contained in the 
file produced by the
+                                       selected script, with the annotations 
in the test file. Other
+                                       evaluators will compare annotations in 
a different way.
+                                       <screenshot>
+                                               <mediaobject>
+                                                       <imageobject>
+                                                               <imagedata 
scale="80" format="PNG"
fileref="&imgroot;Screenshot_preferences.png" />
+                                                       </imageobject>
+                                                       <textobject>
+                                                               <phrase>The 
preference page of the testing environment.
+                                                               </phrase>
+                                                       </textobject>
+                                               </mediaobject>
+                                       </screenshot>
+                               </para>
+                       </section>
+                       <section id="">
+                               <title>The TextMarker Project Structure</title>
+                               <para>
+                                       The picture shows the TextMarker's 
script explorer. Every
+                                       TextMarker project contains a folder 
called "test". This folder is
+                                       the default location for the 
test-files. In the folder each
+                                       script-file has its own sub-folder with 
a relative path equal to
+                                       the scripts package path in the 
"script" folder. This folder
+                                       contains the test files. In every 
scripts test-folder you will
+                                       also find a result folder with the 
results of the tests. Should
+                                       you use test-files from another 
location in the file-system, the
+                                       results will be saved in the "temp" 
sub-folder of the projects
+                                       "test" folder. All files in the "temp" 
folder will be deleted,
+                                       once eclipse is closed.
+                                       <screenshot>
+                                               <mediaobject>
+                                                       <imageobject>
+                                                               <imagedata 
scale="80" format="PNG"
fileref="&imgroot;folder_struc_sep_desc_cut.png" />
+                                                       </imageobject>
+                                                       <textobject>
+                                                               <phrase>Script 
Explorer with the test folder expanded.</phrase>
+                                                       </textobject>
+                                               </mediaobject>
+                                       </screenshot>
+                               </para>
+                       </section>
+               </section>
+               <section id="">
+                       <title>Usage</title>
+                       <para>
+                               This section will demonstrate how to use the 
+                               environment.
+                               It will show the basic actions needed to 
perform a test
+                               run.
+                       </para>
+                       <para>
+                               Preparing Eclipse:
+                               The testing environment provides its own
+                               perspective called
+                               "TextMarker Testing". It will display the main
+                               view as well as the
+                               different result views on the right hand side.
+                               It is encouraged to
+                               use this perspective, especially when working
+                               with the testing
+                               environment for the first time.
+                       </para>
+                       <para>
+                               Selecting a script for testing:
+                               TextMarker will always test the
+                               script, that is currently open in the
+                               script-editor. Should another
+                               editor be open, for example a
+                               java-editor with some java class being
+                               displayed, you will see that
+                               the testing view is not available.
+                       </para>
+                       <para>
+                               Creating a test file:
+                               A test-file is a previously annotated
+                               .xmi file that can be used as
+                               a golden standard for the test. To
+                               create such a file, no
+                               additional tools will be provided, instead
+                               the TextMarker system
+                               already provides such tools.
+                       </para>
+                       <para>
+                               Selecting a test-file:
+                               Test files can be added to the test-list
+                               by simply dragging them from
+                               the Script Explorer into the test-file
+                               list. Depending on the
+                               setting in the preference page, test-files
+                               from a scripts "test"
+                               folder might already be loaded into the list.
+                               A different way to
+                               add test-files is to use the "Add files from
+                               folder" button. It can
+                               be used to add all .xmi files from a selected
+                               folder. The "del" key
+                               can be used to remove files from the
+                               test-list.
+                       </para>
+                       <para>
+                               Selecting a CAS View to test:
+                               TextMarker supports different
+                               views, that allow you to operate on different
+                               levels in a document.
+                               The InitialView is selected as default,
+                               however you can also switch
+                               the evaluation to another view by
+                               typing the views name into the
+                               list or selecting the view you wish
+                               to use from the list.
+                       </para>
+                       <para>
+                               Selecting the evaluator:
+                               The testing environment supports
+                               different evaluators that allow a
+                               sophisticated analysis of the
+                               behavior of a TextMarker script. The
+                               evaluator can be chosen in the
+                               testing environments preference
+                               page. The preference page can be
+                               opened either trough the menu or
+                               by clicking the blue preference
+                               buttons in the testing views
+                               toolbar. The default evaluator is the
+                               "Exact CAS Evaluator" which
+                               compares the offsets of the annotations
+                               between the test file and
+                               the file annotated by the tested script.
+                       </para>
+                       <para>
+                               Excluding Types:
+                               During a test-run it might be convenient to
+                               disable testing for specific
+                               types like punctuation or tags. The
+                               ''exclude types`` button will
+                               open a dialog where all types can be
+                               selected that should not be
+                               considered in the test.
+                       </para>
+                       <para>
+                               Running the test:
+                               A test-run can be started by clicking on the
+                               green start button in
+                               the toolbar.
+                       </para>
+                       <para>
+                               Result Overview:
+                               The testing main view displays some
+                               information, on how well the
+                               script did, after every test run. It
+                               will display an overall number
+                               of true positive, false positive and
+                               false negatives annotations of
+                               all result files as well as an
+                               overall f1-score. Furthermore a
+                               table will be displayed that
+                               contains the overall statistics of the
+                               selected test file as well as
+                               statistics for every single type in
+                               the test file. The information
+                               displayed are true positives, false
+                               positives, false negatives,
+                               precision, recall and f1-measure.
+                       </para>
+                       <para>
+                               The testing environment also supports the 
export of the
+                               overall data
+                               in form of a comma-separated table. Clicking 
the export
+                               evaluation
+                               data will open a dialog window that contains 
this table.
+                               The text
+                               in this table can be copied and easily imported 
+                               or MS Excel.
+                       </para>
+                       <para>
+                               Result Files:
+                               When running a test, the evaluator will create 
a new
+                               result .xmi file
+                               and will add new true positive, false positive 
+                               false negative
+                               annotations. By clicking on a file in the 
+                               list, you can
+                               open the corresponding result .xmi file in the
+                               TextMarker script
+                               editor. When opening a result file in the script
+                               explorer,
+                               additional views will open, that allow easy 
access and
+                               browsing of
+                               the additional debugging annotations.
+                               <screenshot>
+                                       <mediaobject>
+                                               <imageobject>
+                                                       <imagedata scale="80" 
fileref="&imgroot;Screenshot_Result_TP_desc_close_cut.png" />
+                                               </imageobject>
+                                               <textobject>
+                                                       <phrase>Open result 
file and selected true positive annotation
+                                                               in the true 
positive view.
+                                                       </phrase>
+                                               </textobject>
+                                       </mediaobject>
+                               </screenshot>
+                       </para>
+               </section>
+               <section id="">
+                       <title>Evaluators</title>
+                       <para>
+                               When testing a CAS file, the system compared 
the offsets of
+                               the
+                               annotations of a previously annotated gold 
standard file with
+                               the
+                               offsets of the annotations
+                               of the result file the script
+                               produced. Responsible for comparing
+                               annotations in the two CAS files
+                               are evaluators. These evaluators
+                               have different methods
+                               and
+                               strategies, for comparing the annotations, 
implemented. Also a
+                               extension point is provided that allows easy 
implementation new
+                               evaluators.
+                       </para>
+                       <para>
+                               Exact Match Evaluator:
+                               The Exact Match Evaluator compares the
+                               offsets of the annotations in
+                               the result and the golden standard
+                               file. Any difference will be
+                               marked with either an false positive or
+                               false negative annotations.
+                       </para>
+                       <para>
+                               Partial Match Evaluator:
+                               The Partial Match Evaluator compares
+                               the offsets of the annotations in
+                               the result and golden standard
+                               file. It will allow differences in
+                               the beginning or the end of an
+                               annotation. For example "corresponding" and 
"corresponding " will
+                               not be
+                               annotated as an error.
+                       </para>
+                       <para>
+                               Core Match Evaluator:
+                               The Core Match Evaluator accepts
+                               annotations that share a core
+                               expression. In this context a core
+                               expression is at least four
+                               digits long and starts with a
+                               capitalized letter. For example the
+                               two annotations "L404-123-421"
+                               and "L404-321-412" would be
+                               considered a true positive match,
+                               because of "L404" is considered a
+                               core expression that is contained
+                               in both annotations.
+                       </para>
+                       <para>
+                               Word Accuracy Evaluator:
+                               Compares the labels of all
+                               words/numbers in an annotation, whereas the
+                               label equals the type of
+                               the annotation. This has the consequence,
+                               for example, that each
+                               word or number that is not part of the
+                               annotation is counted as a
+                               single false negative. For example we
+                               have the sentence: "Christmas
+                               is on the 24.12 every year."
+                               The script labels "Christmas is on the
+                               12" as a single sentence, while
+                               the test file labels the sentence
+                               correctly with a single sentence
+                               annotation. While for example the
+                               Exact CAS Evaluator while only
+                               assign a single False Negative
+                               annotation, Word Accuracy Evaluator
+                               will mark every word or number
+                               as a single False Negative.
+                       </para>
+                       <para>
+                               Template Only Evaluator:
+                               This Evaluator compares the offsets of
+                               the annotations and the
+                               features, that have been created by the
+                               script. For example the
+                               text "Alan Mathison Turing" is marked with
+                               the author annotation
+                               and "author" contains 2 features: "FirstName"
+                               and "LastName". If
+                               the script now creates an author annotation with
+                               only one feature,
+                               the annotation will be marked as a false 
+                       </para>
+                       <para>
+                               Template on Word Level Evaluator:
+                               The Template On Word
+                               Evaluator compares the offsets of the
+                               annotations. In addition it
+                               also compares the features and feature
+                               structures and the values
+                               stored in the features. For example the
+                               annotation "author" might
+                               have features like "FirstName" and
+                               "LastName" The authors name is
+                               "Alan Mathison Turing" and the
+                               script correctly assigns the author
+                               annotation. The feature
+                               assigned by the script are "Firstname :
+                               Alan", "LastName :
+                               Mathison", while the correct feature values 
+                               be "FirstName
+                               Alan", "LastName Turing". In this case the 
+                               Only Evaluator
+                               will mark an annotation as a false positive, 
since the
+                               feature
+                               values differ.
+                       </para>
+               </section>
+       </section>
+       <section id="">
+               <title>TextRuler</title>
+               <para>
+                       Using the knowledge engineering approach, a knowledge 
+                       normally
+                       writes handcrafted rules to create a domain dependent
+                       information
+                       extraction application, often supported by a gold
+                       standard. When
+                       starting the engineering process for the acquisition
+                       of the
+                       extraction knowledge for possibly new slot or more 
general for
+                       new
+                       concepts, machine learning methods are often able to 
+                       support
+                       in an iterative engineering process. This section gives 
+                       conceptual
+                       overview of the process model for the semi-automatic
+                       development of
+                       rule-based information extraction applications.
+               </para>
+               <para>
+                       First, a suitable set of documents that contain the text
+                       fragments with
+                       interesting patterns needs to be selected and
+                       annotated with the
+                       target concepts. Then, the knowledge engineer
+                       chooses and configures
+                       the methods for automatic rule acquisition to
+                       the best of his
+                       knowledge for the learning task: Lambda expressions
+                       based on tokens
+                       and linguistic features, for example, differ in their
+                       application
+                       domain from wrappers that process generated HTML pages.
+               </para>
+               <para>
+                       Furthermore, parameters like the window size defining 
+                       features need to
+                       be set to an appropriate level. Before the annotated
+                       training
+                       documents form the input of the learning task, they are
+                       enriched
+                       with features generated by the partial rule set of the
+                       developed
+                       application. The result of the methods, that is the 
+                       rules,
+                       are proposed to the knowledge engineer for the 
extraction of
+                       the
+                       target concept.
+               </para>
+               <para>
+                       The knowledge engineer has different options to 
proceed: If the
+                       quality, amount or generality of the presented rules is 
+                       sufficient, then additional training documents need to 
be annotated
+                       or additional rules have to be handcrafted to provide 
more features
+                       in general or more appropriate features. Rules or rule 
sets of high
+                       quality can be modified, combined or generalized and 
transfered to
+                       the rule set of the application in order to support the 
+                       task of the target concept. In the case that the 
methods did not
+                       learn reasonable rules at all, the knowledge engineer 
proceeds with
+                       writing handcrafted rules.
+               </para>
+               <para>
+                       Having gathered enough extraction knowledge for the 
+                       concept, the
+                       semi-automatic process is iterated and the focus is
+                       moved to the
+                       next concept until the development of the application is
+                       completed.
+               </para>
+               <section id="">
+                       <title>Available Learners</title>
+                       <para>
+                               Overview
+                               ||Name||Strategy||Document||Slots||Status
+                               |BWI (1)
+                               |Boosting, Top Down |Struct, Semi |Single, 
Boundary |Planning
+                               |LP2
+                               (2) |Bottom Up Cover |All |Single, Boundary 
+                               |RAPIER (3)
+                               |Top Down/Bottom Up Compr. |Semi |Single 
+                               |WHISK (4)
+                               |Top Down Cover |All |Multi |Prototype
+                               |WIEN (5) |CSP |Struct
+                               |Multi, Rows |Prototype
+                       </para>
+                       <para>
+                               * Strategy: The used strategy of the learning 
methods are
+                               commonly
+                               coverage algorithms.
+                               * Document: The type of the document
+                               may be ''free'' like in
+                               newspapers, ''semi'' or ''struct'' like HTML
+                               pages.
+                               * Slots: The slots refer to a single annotation 
+                               represents the
+                               goal of the learning task. Some rule are able to
+                               create several
+                               annotation at once in the same context 
+                               However, only
+                               single slots are supported by the current
+                               implementations.
+                               * Status: The current status of the 
+                               in the TextRuler
+                               framework.
+                       </para>
+                       <para>
+                               Publications
+                       </para>
+                       <para>
+                               (1) Dayne Freitag and Nicholas Kushmerick. 
Boosted Wrapper
+                               Induction.
+                               In AAAI/IAAI, pages 577–583, 2000.
+                       </para>
+                       <para>
+                               (2) F. Ciravegna. (LP)2, Rule Induction for 
+                               Extraction
+                               Using Linguistic Constraints. Technical Report 
+                               Department
+                               of Computer Science, University of Sheffield, 
+                               2003.
+                       </para>
+                       <para>
+                               (3) Mary Elaine Califf and Raymond J. Mooney. 
+                               Relational
+                               Learning of Pattern Matching Rules for 
+                               Extraction.
+                               Journal of Machine Learning Research, 
4:177–210, 2003.
+                       </para>
+                       <para>
+                               (4) Stephen Soderland, Claire Cardie, and 
Raymond Mooney.
+                               Learning
+                               Information Extraction Rules for 
Semi-Structured and Free
+                               Text. In
+                               Machine Learning, volume 34, pages 233–272, 
+                       </para>
+                       <para>
+                               (5) N. Kushmerick, D. Weld, and B. Doorenbos. 
+                               Induction for
+                               Information Extraction. In Proc. IJC Artificial
+                               Intelligence, 1997.
+                       </para>
+                       <para>
+                               BWI
+                               BWI (Boosted Wrapper Induction) uses boosting 
techniques to
+                               improve
+                               the performance of simple pattern matching 
+                               boundary
+                               wrappers (boundary detectors). Two sets of 
detectors are
+                               learned:
+                               the "fore" and the "aft" detectors. Weighted by 
+                               confidences
+                               and combined with a slot length histogram 
derived from
+                               the training
+                               data they can classify a given pair of 
+                               within a
+                               document. BWI can be used for structured, 
+                               and free
+                               text. The patterns are token-based with special 
+                               for more
+                               general rules.
+                       </para>
+                       <para>
+                               Implementations
+                               No implementations are yet available.
+                       </para>
+                       <para>
+                               Parameters
+                               No parameters are yet available.
+                       </para>
+                       <para>
+                               LP2
+                               This method operates on all three kinds of 
documents. It
+                               learns
+                               separate rules for the beginning and the end of 
a single
+                               slot. So
+                               called tagging rules insert boundary SGML tags 
+                               additionally
+                               induced correction rules shift misplaced tags 
to their
+                               correct
+                               positions in order to improve precision. The 
+                               strategy is a
+                               bottom-up covering algorithm. It starts by 
creating a
+                               specific seed
+                               instance with a window of w tokens to the left 
+                               right of the
+                               target boundary and searches for the best
+                               generalization. Other
+                               linguistic NLP-features can be used in order
+                               to generalize over the
+                               flat word sequence.
+                       </para>
+                       <para>
+                               Implementations
+                               LP2 (naive):
+                               LP2 (optimized):
+                       </para>
+                       <para>
+                               Parameters
+                               Context Window Size (to the left and right):
+                               Best
+                               Rules List Size:
+                               Minimum Covered Positives per Rule:
+                               Maximum Error
+                               Threshold:
+                               Contextual Rules List Size:
+                       </para>
+                       <para>
+                               RAPIER
+                               RAPIER induces single slot extraction rules for
+                               semi-structured
+                               documents. The rules consist of three patterns: 
+                               pre-filler, a
+                               filler and a post-filler pattern. Each can hold
+                               several constraints
+                               on tokens and their according POS-tag- and
+                               semantic information.
+                               The algorithm uses a bottom-up compression
+                               strategy, starting with
+                               a most specific seed rule for each training
+                               instance. This initial
+                               rule base is compressed by randomly selecting
+                               rule pairs and search
+                               for the best generalization. Considering
+                               two
+                               rules, the least general generalization (LGG) 
of the slot fillers
+                               are created and specialized by adding rule 
items to the pre- and
+                               post-filler until the new rules operate well on 
the training set.
+                               The best of the k rules (k-beam search) is 
added to the rule base
+                               and all empirically subsumed rules are removed.
+                       </para>
+                       <para>
+                               Implementations
+                               RAPIER:
+                       </para>
+                       <para>
+                               Parameters
+                               Maximum Compression Fail Count:
+                               Internal Rules List
+                               Size:
+                               Rule Pairs for Generalizing:
+                               Maximum 'No improvement' Count:
+                               Maximum Noise Threshold:
+                               Minimum Covered Positives Per Rule:
+                               PosTag
+                               Root Type:
+                               Use All 3 GenSets at Specialization:
+                       </para>
+                       <para>
+                               WHISK
+                               WHISK is a multi-slot method that operates on 
all three
+                               kinds of
+                               documents and learns single- or multi-slot 
rules looking
+                               similar to
+                               regular expressions. The top-down covering 
+                               begins with
+                               the most general rule and specializes it by 
+                               single
+                               rule terms until the rule makes no errors on 
the training
+                               set. Domain
+                               specific classes or linguistic information 
obtained by a
+                               syntactic
+                               analyzer can be used as additional features. 
The exact
+                               definition
+                               of a rule term (e.g. a token) and of a problem 
+                               (e.g. a
+                               whole document or a single sentence) depends on 
+                               operating
+                               domain and document
+                               type.
+                       </para>
+                       <para>
+                               Implementations
+                               WHISK (token):
+                               WHISK (generic):
+                       </para>
+                       <para>
+                               Parameters
+                               Window Size:
+                               Maximum Error Threshold:
+                               PosTag Root
+                               Type:
+                       </para>
+                       <para>
+                               WIEN
+                               WIEN is the only method listed here that 
operates on
+                               highly structured
+                               texts only. It induces so called wrappers that
+                               anchor the slots by
+                               their structured context around them. The HLRT
+                               (head left right
+                               tail) wrapper class for example can determine 
+                               extract
+                               several multi-slot-templates by first 
separating the
+                               important information
+                               block from unimportant head and tail portions
+                               and then extracting
+                               multiple data rows from table like
+                               data
+                               structures from the remaining document. 
Inducing a wrapper is done
+                               by solving a CSP for all possible pattern 
combinations from the
+                               training data.
+                       </para>
+                       <para>
+                               Implementations
+                               WIEN:
+                       </para>
+                       <para>
+                               Parameters
+                               No parameters are available.
+                       </para>
+               </section>
+       </section>
\ No newline at end of file

Reply via email to