Hi there,

I am trying to index about 1000 HTML-Articles on our website via local
filesystem. I use the "Alias" command to map the links to the real
location.

The problem is that when I search for any pattern, most results show
directories not files. It seems that the indexer is parsing the
directory listings for matching filenames. Nice Feature, but nothing
what I really need. 

For Example a search result for "linux" looks like this:

Displaying documents 1-20 of total 55 found. 

1. http://www.linux-magazin.de/ausgabe/1996/08/ [1]

2.  [1]

[...]

13. http://www.linux-magazin.de/ausgabe/1996/06/News/biodata.html [1]


The Allow/Disallow settings are kept simple:

Allow \.html$ \.htm$ \.txt$ \/$
Disallow .*

I tried it without the \/$ to prevent directory indexing, but then the
program indexed nothing except the rootdir
(http://www.linux-magazin.de/ausgabe/). I have also tried "CheckOnly
\/$", but it doesnt seem to change anything.

Ideas anyone?

-- 
Tobias Freitag          | http://www.linux-magazin.de
Stefan-George-Ring 24   | Tel:  +49 (0) 89 993411-0
D-81929 München         | Fax:  +49 (0) 89 993411-99
# This is sample indexer config file
# To start using it please edit and rename to indexer.conf
# You may want to keep the original indexer.conf-dist for future references.
# Use '#' to comment out lines.
# All command names are case insensitive (DBHost=DBHOST=dbhost).
# You may use '\' character to prolong current command to next line
# when it is required.

###########################################################################
# DBAddr <URL-style database description>
# Database options (type, host, database name, port, user and password) 
# to connect to SQL database.
# Do not matter for built-in text files support.
# Should be used only once and before any other commands.
# Command have global effect for whole config file.
# Format:
#DBAddr <DBType>:[//[DBUser[:DBPass]]DBHost[:DBPort]]/DBName/
#
# ODBC notes:
#       Use DBName to specify ODBC data source name (DSN)
#       DBHost does not matter, use "localhost".
# Solid notes:
#       Use DBHost to specify Solid server
#       DBName does not matter for Solid
#
# Currently supported DBType values are 
# mysql, pgsql, msql, solid, mssql, oracle, ibase.
# Actually, it does not matter for native libraries support.
# But ODBC users should specify one of supported values.
# If your database type is not supported, you may use "unknown" instead.

#DBAddr         mysql://****:*******@localhost/udmsearch/


#######################################################################
# DBMode single/multi/crc/crc-multi
# Does not matter for built-in text files support
# You may select SQL database mode of words storage.
# When "single" is specified, all words are stored in the same
#table. If "multi" is selected, words will be located in different
#tables depending of their lengths. "multi" mode is usually faster
#but requires more tables in database. 
#
# If "crc" mode is selected, UdmSearch will store 32 bit integer
# word IDs calculated by CRC32 algorythm instead of words. This
# mode requres less disc space and it is faster comparing with "single"
# and "multi" modes. "crc-multi" uses the same storage structure with
# the "crc" mode, but also stores words in different tables depending on 
# words lengths like "multi" mode.
#
#Default DBMode value is "single":
DBMode multi


#######################################################################
#SyslogFacility <facility>
# This is used if indexer was compiled with syslog support and if you
# don't like the default value. Argument is the same as used in syslog.conf
# file. For list of possible facilities see syslog.conf(5)
#SyslogFacility local7


#######################################################################
# LocalCharset <charset>
# Defines charset of local file system. It is required if you are using 
# 8 bit charsets and does not matter for 7 bit charsets.
# This command should be used once and takes global effect for the config file.
# Choose currently supported one:
#
# Western Europe: Germany
LocalCharset iso-8859-1
#
# Central Europe: Czech
#LocalCharset iso-8859-2
#
# ISO Cyrillic
#LocalCharset iso-8859-5
#
# Unix Cyrillic
#LocalCharset koi8-r
#
# MS Central Europe: Czech
#LocalCharset cp1250
#
# MS DOS Cyrillic
#LocalCharset cp866
#
# MS Cyrillic
#LocalCharset cp1251
#
# MS Arabic
#LocalCharset cp1256
#
# Mac Cyrillic
#LocalCharset x-mac-cyrillic


###########################################################################
# Ispell support commands. Detailed description is given in /doc/ispell.txt
# Ispell commands MUST be given after LocalCharset definition.
# Load ispell affix file:
#Affix <lang> <ispell affixes file name>
# Load ispell dictionary file
#Spell <lang> <ispell dictionary file name>
# File names are relative to UdmSearch /etc directory
# Absolute paths can be also specified.
#
Affix de german.aff
Spell de german.dict



#######################################################################
# MaxDocSize bytes
# Default value 1048576 (1 Mb)
# Takes global effect for whole config file
#MaxDocSize 104857600


#######################################################################
# Include <filename>
# Include enother configuration file.
# Absolute path if <filename> starts with "/":
#Include /usr/local/udmsearch/etc/inc1.conf
# Relative path else:
#Include inc1.conf


#######################################################################
# HTTPHeader <header>
# You may add your desired headers in indexer HTTP request
# You should not use "If-Modified-Since","Accept-Charset" headers,
# these headers are composed by indexer itself.
# "User-Agent: UdmSearch/version" is sent too, but you may override it.
# Command has global effect for all configuration file.
#
#HTTPHeader User-Agent: My_Own_Agent
#HTTPHeader Accept-Language: ru, en
#HTTPHeader From: [EMAIL PROTECTED]



#################################################################
#ForceIISCharset1251 yes/no
#This option is useful for users which deals with Cyrillic content and broken
#(or misconfigured?) Microsoft IIS web servers, which tends to not report
#charset correctly. This is really dirty hack, but if this option is turned on
#it is assumed that all servers which reports as 'Microsoft' or 'IIS' have
#content in Windows-1251 charset.
#This command should be used only once in configuration file and takes global
#effect.
#Default: no
#ForceIISCharset1251 no



#######################################################################
#Period <seconds>
# Does not matter for built-in text files support
# Reindex period in seconds, 604800 = 1 week
# Can be set many times before "Server" command and
# takes effect till the end of config file or till next Period command.
#Period 604800


#######################################################################
#Tag <number>
# Use this field for your own purposes. For example for grouping
# some servers into one group, etc...
# Can be set multiple times before "Server" command and
# takes effect till the end of config file or till next Tag command.
#Tag 0


#######################################################################
#MaxHops <number>
# Maximum way in "mouse clicks" from start url.
# Default value is 256.
# Can be set multiple times before "Server" command and
# takes effect till the end of config file or till next MaxHops command.
#MaxHops 256


#######################################################################
#MaxNetErrors <number>
# Maximum network errors for each server.
# Default value is 16.
# If there too many network errors on some server 
# (server is down, host unreachable, etc) indexer will try to do 
# not more then 'number' attempts to connect to this server.
# Takes effect till the end of config file or till next MaxNetErrors command.
#MaxNetErrors 16


#######################################################################
#ReadTimeOut <seconds>
# Maximum timeout while downloading document for each server.
# Default value is 90.
# Can be set any times before "Server" command and
# takes effect till the end of config file or till next ReadTimeOut command.
#ReadTimeOut 90


#######################################################################
#Robots yes/no
# Allows/disallows using robots.txt and <META NAME="robots">
# exclusions. Use "no", for example for link validation of your server(s).
# Command may be used several times before "Server" command and
# takes effect till the end of config file or till next Robots command.
# Default value is "yes".
#Robots yes

#######################################################################
#Clones yes/no
# Allow/disallow clone eliminating
# Default value is "yes".
#Clones yes


#######################################################################
#TitleWeight <number>
# Weight of the words in the <title>...</title>
# Can be set multiple times before "Server" command and
# takes effect till the end of config file or till next TitleWeight command.
# Default value is 2
#TitleWeight 2


#######################################################################
#BodyWeight <number>
# Weight of the words in the <body>...</body> of the html documents 
# and in the content of the text/plain documents.
# Can be set multiple times before "Server" command and
# takes effect till the end of config file or till next BodyWeight command.
# Default value is 1
BodyWeight 3


#######################################################################
#DescWeight <number>
# Weight of the words in the <META NAME="Description" Content="...">
# Can be set multiple times before "Server" command and
# takes effect till the end of config file or till next DescWeight command.
# Default value is 2
#DescWeight 2


#######################################################################
#KeywordWeight <number>
# Weight of the words in the <META NAME="Keywords" Content="...">
# Can be set multiple times before "Server" command and
# takes effect till the end of config file or till next KeywordWeight command.
# Default value is 2
#KeywordWeight 2


#######################################################################
#UrlWeight <number>
# Weight of the words in the URL of the documents.
# Can be set multiple times before "Server" command and
# takes effect till the end of config file or till next UrlWeight command.
# Default value is 0
UrlWeight 0


#######################################################################
#UrlHostWeight <number>
# Weight of the words in the hostname part of URL of the documents.
# Can be set multiple times before "Server" command and
# takes effect till the end of config file or till next UrlHostWeight command.
# Default value is 0
UrlHostWeight 0


#######################################################################
#UrlPathWeight <number>
# Weight of the words in the path part of URL of the documents.
# Can be set multiple times before "Server" command and
# takes effect till the end of config file or till next UrlPathWeight command.
# Default value is 0
UrlPathWeight -1


#######################################################################
#UrlFileWeight <number>
# Weight of the words in the filename part of URL of the documents.
# Can be set multiple times before "Server" command and
# takes effect till the end of config file or till next UrlFileWeight command.
# Default value is 0
UrlFileWeight -1


######################################################################
# Spell checking. You can change the factors of word weight depending on
# whether word is found in Ispell dictionaries or not. Setting the 
# "IspellCorrectFactor" to 0 will prevent indexer from storing words with
# right spelling in database. The only incorrect words will be stored
# in database in this case. Then you may easily find incorrect words
# and correspondent URLs where those words are found. If no
# ispell files are used all word are considered as "incorrect".
#
IspellCorrectFactor     1
IspellIncorrectFactor   1

#######################################################################
# Numbers indexing. By default numbers and words which contain both
# digits and letters (like "3a","U2") are stored in database. You may change 
# this behaviour by setting into "0" weight factors. Usefull for spell checking
# in combination with previous commands.
#
#NumberFactor 1
#AlnumFactor  1

#######################################################################
# Word lengths. You may change default length range of words
# stored in database. By default, words with the length in the
# range from 1 to 32 are stored. Note that setting MaxWordLength more
# than 32 will not work as expected.
#
#MinWordLength 1
#MaxWordLength 32


#######################################################################
#DeleteBad yes/no
# Use it to choose whether delete or not bad (not found, forbidden etc) URLs
# from database. 
# May be used multiple times before "Server" command and
# takes effect till the end of config file or till next DeleteBad command.
# Default value is "no", that means do not delete bad URLs.
#DeleteBad no


#######################################################################
#DeleteNoServer yes/no
# Use it to choose whether delete or not those URLs which have no
# correspondent "Server" commands.
# Default value is "yes".
#DeleteNoServer yes


#######################################################################
#Index yes/no
# Prevent indexer from storing words into database.
# Useful for example for link validation.
# Can be set multiple times before "Server" command and
# takes effect till the end of config file or till next Index command.
# Default value is "yes".
#Index yes


#######################################################################
#Follow yes/no
# Allow/disallow indexer to store <a href="..."> into database.
# Can be set multiple times before "Server" command and
# takes effect till the end of config file or till next Follow command.
# Default value is "yes".
#Follow no


#######################################################################
#FollowOutside yes/no
# Allow/disallow indexer to walk outside servers given in config file.
# Should be used carefully (see MaxHops command)
# Takes effect till the end of config file or till next FollowOutside command.
# Default value "no"
#FollowOutside no


########################################################################
#CharSet <charset>
# Useful for 8 bit character sets.
# WWW-servers send data in different charsets.
#<Charset> is default character set of server in next "Server" command(s).
#This is required only for "bad" servers that do not send information
#about charset in header: "Content-type: text/html; charset=some_charset"
# and have not <META NAME="Content" Content="text/html; charset=some_charset">
#Can be set before every "Server" command and
# takes effect till the end of config file or till next CharSet command.
#CharSet windows-1251


#########################################################################
#Proxy your.proxy.host[:port]
# Use proxy rather then connect directly
#One can index ftp servers when using proxy
#Default port value if not specified is 3128 (Squid)
#If proxy host is not specified direct connect will be used.
#Can be set before every "Server" command and
# takes effect till the end of config file or till next Proxy command.
#If no one "Proxy" command specified indexer will use direct connect.
#
#           Examples:
#           Proxy on atoll.anywhere.com, port 3128:
#Proxy atoll.anywhere.com
#
#           Proxy on lota.anywhere.com, port 8090:
#Proxy lota.anywhere.com:8090
#
#           Disable proxy (direct connect):
#Proxy


#########################################################################
#AuthBasic login:passwd
# Use basic http authorization 
# Can be set before every "Server" command and
# takes effect only for next one Server command!
# Examples:
#AuthBasic somebody:something  

# If you have password protected directory(ies), but whole server is open,use:
#AuthBasic login1:passwd1
#Server http://my.server.com/my/secure/directory1/
#AuthBasic login2:passwd2
#Server http://my.server.com/my/secure/directory2/
#Server http://my.server.com/


#########################################################################
#Server <url>
# It is the main command of indexer.conf file
# It's used to add start URL of server
# You may use "Server" command as many times as number of different
# servers required to be indexed.
# You can index ftp servers when using proxy:
#Server ftp://localhost/
#Server http://bashir.linux-magazin.de/web_mirror
#Server file:/user/public/web_mirror/
Server http://www.linux-magazin.de/ausgabe/


#########################################################################
#Alias <master> <mirror>
# You can use this command for example to organize search through 
#master site by indexing mirror site.
# UdmSearch will display URLs from master site while searching
# but go to the mirror site while indexing. This command has global
# indexer.conf file effect. You may use several aliases in one indexer.conf.
#Alias http://www.mysql.com/ http://mysql.udm.net/
Alias http://www.linux-magazin.de/ausgabe/ file://foobar/user/public/web_mirror/



##########################################################################
#DisallowNoMatch <regexp> [<regexp> ... ]
# Use this to disallow URLs that do not match given regexps.
# Takes global effect for config file.
# Examples:
# Disalow URLs that are not in udm.net domains:
#DisallowNoMatch  udm\.net\/  
# or
# Disallow any except known extensions and directory index:
#DisallowNoMatch \/$|\.htm$|\.html$|\.shtml$|\.txt$
#Disallow \/$
#Allow \.html$|\.htm$|\.txt$
#DisallowNoMatch \.html$ \.htm$ \.txt$

#Disallow /[.]{1,2} /\%2e /\%2f
#Disallow ^\. \/\.\.\/

Allow \.html$ \.htm$ \.txt$ \/$
Disallow .*

CheckOnly /$


################################################################
#AddType <mime type> <regexp> [<regrexp>..]
# This command associates filename extensions (for services
# that don't automatically include them - like file:) with a mime types
#
AddType text/plain      \.pl$ \.js$ \.txt$ \.h$ \.c$ \.pm$ \.e$
AddType text/html       \.html$ \.htm$
AddType image/x-xpixmap \.xpm$
AddType image/x-xbitmap \.xbm$
AddType image/gif       \.gif$
AddType application/unknown     .*

#
# Mime <from_mime> <to_mime>[;charset] ["command line [$1]"]
#
# This is used to add support for parsing documents with mime types other
# than text/plain and text/html. It can be done via external parser (which
# should provide output in plain or html text) or just by substituting mime
# type so indexer will understand it.
#
# <from_mime> and <to_mime> are standard mime types
# <to_mime> should be either text/plain or text/html
#
# We assume external parser generates results on stdout (if not, you have to
# write a little script and cat results to stdout)
#
# Optional charset parameter used to change charset if needed
#
# Command line parameter is optional. If there's no command line, this is
# used to change mime type. Command line could also have $1 parameter which
# stands for temporary file name. Some parsers could not operate on stdin,
# so indexer creates temporary file for parser and it's name passed instead
# of $1
#
#       from_mime               to_mime[;charset]       [command line [$1]]
#
#Mime application/msword        text/plain;cp1251      "catdoc $1"
#Mime application/x-troff-man   text/plain             "deroff"
#Mime text/x-postscript         text/plain              "ps2ascii"


##############################################################
# Mirroring parameters commands.
#
# You may specify a path to root dir to enable sites mirroring
#MirrorRoot /path/to/mirror
#
# You may specify as well root dir of mirrored document's headers
#indexer will store HTTP headers to disc too.
#MirrorHeadersRoot /path/to/headers
#
# You may specify period in seconds during wich earlier mirrored files 
#will be used while indexing instead of real downloading.
# It is very usefull when you do some expiriments with UdmSearch
# indexing the same hosts and do not want much traffic from/to Internet.
# If MirrorHeadersRoot is not specified and headers are not stored on disc
# default Content-Type's given in AddType commands will be used.
# Default value of the MirrorPeriod is -1, wich means do not use 
# mirrored files.
# This will force reading local copies during 1 day:
#
#MirrorPeriod 86400

Reply via email to