README.txt

ogrisel Thu, 12 May 2011 10:25:37 -0700

Author: ogrisel
Date: Thu May 12 17:25:15 2011
New Revision: 1102392

URL: http://svn.apache.org/viewvc?rev=1102392&view=rev
Log:
typos / small fixes in README.txt


Modified:
    incubator/stanbol/trunk/entityhub/indexing/dbpedia/README.txt

Modified: incubator/stanbol/trunk/entityhub/indexing/dbpedia/README.txt
URL: 
http://svn.apache.org/viewvc/incubator/stanbol/trunk/entityhub/indexing/dbpedia/README.txt?rev=1102392&r1=1102391&r2=1102392&view=diff
==============================================================================
--- incubator/stanbol/trunk/entityhub/indexing/dbpedia/README.txt (original)
+++ incubator/stanbol/trunk/entityhub/indexing/dbpedia/README.txt Thu May 12 
17:25:15 2011
@@ -26,7 +26,7 @@ Within this folder all the
  - distribution files (indexing/distribution)
 will be located.
 
-The indexing itself can be started by 
+The indexing itself can be started by
 java -jar 
org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar index
 but before doing this please note the points (2), (3) and (4)
 
@@ -36,13 +36,13 @@ All RDF dumps need to be copied to the d
   indexing/resources/rdfData
 
 The RDF dump of DBpedia.org is splited up in a number of different files.
-The actual files needed depend on the configuration of the mappings 
+The actual files needed depend on the configuration of the mappings
 (indexing/config/mappings.txt). Generally one need to make sure that all the
 RDF dumps with the source data for the specified mappings are available.
 A best is to use the previews of the dumps to check if the data of a dump is
 required or not.
 
-During the initialisation of the Indeing all the RDF files within the 
+During the initialisation of the Indeing all the RDF files within the
 "indexing/resources/rdfData" directory will be imported to an Jena TDB RDF
 triple store. The imported data are stored under
   indexing/resources/tdb
@@ -52,8 +52,8 @@ To avoid (re)importing of already import
 RDF files from the "indexing/resources/rdfData" or - typically the better
 option - rename the "rdfData" folder after the initial run.
 
-It is also save to 
-  - cancel the indexing process after the initialisation has competed 
+It is also save to
+  - cancel the indexing process after the initialisation has competed
     (as soon as the loging says that the indexing has started).
   - load additinal RDF dumps by putting additional RDF files to the "rdfData"
     directory. This files will be added to the others on the next start of the
@@ -63,43 +63,42 @@ It is also save to 
 
 The DBpedia.org indexer uses the incomming links from other wikipages to
 calculate the rank of entities. Entities with more incomming links get an
-higher rank.
-A RDF dump containing all outgoing wiki links is available on DBpedia 
-(TODO: add link). This file need to be processed with the following command
-to get an file containing an ordered list of incomming count and the local
-name of the entity.
+higher rank. A RDF dump containing all outgoing wiki links is available
+on DBpedia (page_links_en.nt.bz2). This file need to be processed with the
+following command to get an file containing an ordered list of incomming
+count and the local name of the entity.
 
-time curl http://downloads.dbpedia.org/{version}/en/page_links_en.nt.bz2 \
+curl http://downloads.dbpedia.org/{version}/en/page_links_en.nt.bz2 \
   | bzcat \
-  | sed -e 's/.*<http\:\/\/dbpedia\.org\/resource\/\([^>]*\)> ./\1/'
+  | sed -e 's/.*<http\:\/\/dbpedia\.org\/resource\/\([^>]*\)> ./\1/' \
   | sort \
   | uniq -c  \
   | sort -nr > incoming_links.txt
 
-Depending on the machine and the download speed for the source file the 
-execution of this command will take several hours. 
+Depending on the machine and the download speed for the source file the
+execution of this command will take several hours.
 
-Importnat NOTES:
- - Links to Categories use wrong URLs in the current version (3.6) of the 
+Important NOTES:
+ - Links to Categories use wrong URLs in the current version (3.6) of the
    page_links_en.nt.bz2 dump.
    All categories start with "CAT:{categoryName}" but the correct local name
    would be "Category:{categoryName}". because of this categories would not be
    indexed.
    It is strongly suggested to
     - first check if still Category: is used as prefix (e.g. by checking if
-      http://dbpedia.org/page/Category:Political_culture is still valid) and 
-    - second if that is the case replace all appearances of "CAT:" to 
"Category:"   
-    
+      http://dbpedia.org/page/Category:Political_culture is still valid) and
+    - second if that is the case replace all appearances of "CAT:" to 
"Category:"
+
 The resulting file MUST BE copied to
   indexing/resources/incoming_links.txt
-  
+
 There is also the possibility do download a precomputed file form
   TODO: add download loaction
 
 (4) Configuration of the Index
 
  The configurations are contained within the "indexing/config" folder:
-  - indexing.properties: Main configuration for the indexing process. It 
+  - indexing.properties: Main configuration for the indexing process. It
       defines the used components and there configurations. Usually no need to
       make any changes.
   - mapping.txt: Define the fields, data type requirements and languages to be
@@ -115,7 +114,7 @@ There is also the possibility do downloa
       value of 0 will result in all entities to be indexed.
   - scoreRange.properties: Can be use to set the upper bound for entities 
score.
       The entities with the most incomming links will get this score. Entities
-      with no incomming links would get a score of zero.  
+      with no incomming links would get a score of zero.
 
 
 Default configuration:
@@ -126,7 +125,7 @@ of the indexing tool.
 
 The default configuration stores creates an index with the following features:
 
-Languages: 
+Languages:
 By default English, German, France and Italien and all literals without any
 language information are indexed. Pleas note also that one needs to provide
 also the RDF dumps for this languages.
@@ -155,8 +154,8 @@ values defined by "dcterms:subject" are 
 Categories itself are hierarchical. Parent categories can be used by following
 "skos:broader" relations.
 e.g.
-   Berlin -> skos:subject 
-      -> Category:City-states -> skos:broader 
+   Berlin -> skos:subject
+      -> Category:City-states -> skos:broader
            -> Category:Cities -> skos:broader
                -> Category:Populated_places -> skos:broader
                    -> Category:Human_habitats ...

svn commit: r1102392 - /incubator/stanbol/trunk/entityhub/indexing/dbpedia/README.txt

Reply via email to