from:"Bo Gundersen"

Re: Free software to crawl internet site?

2004-09-29 Thread Bo Gundersen

Zhang, Lisheng wrote:
Hi,
Does anyone know if there is free-software to crawl internet site
(webcrawler)? I know currently lucene does not have this feature
according to official lucene FAQ.
Thanks very much for helps, 
In the lucene sandbox there is a pretty advanced crawler called LARM, 
you can check it out at this URL 
http://jakarta.apache.org/lucene/docs/lucene-sandbox/ (right at the bottom).

--
Bo Gundersen
DBA/Software Developer
M.Sc.CS.
www.atira.dk
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Accent filter

2004-09-28 Thread Bo Gundersen

Hi,
I am certainly not the first, and probably not the last, that have had 
problems with accented characters in my index. But unfortunately I 
couldnt find anything in neither lucene nor the lucene-sandbox to solve 
the problem.
Så I wrote an accent filter and thought that I might as well share it 
with you guys :)

--
Bo Gundersen
DBA/Software Developer
M.Sc.CS.
www.atira.dk
package dk.atira.search;

import java.io.IOException;
import java.util.Collection;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;

/**
 * This filter converts accent characters to their non-accented versions.
 * Also it strips unwanted characters from the tokens, mening anything 
 * but A-Z,a-z,0-9,ÆÅØæøå and -
 * The valid characters can be changed by adding them to the string validCharsStr.
 * 
 * Created by Bo Gundersen at Sep 28, 2004 12:39:04 PM 
 *
 * @author Bo Gundersen ([EMAIL PROTECTED])
 */
public class AccentFilter
extends TokenFilter
{
private static final Collection validChars = new HashSet();
private static final String validCharsStr = 
abcdefghijklmnopqrstuvwxyz\u00E6\u00F8\u00E5 +
ABCDEFGHIJKLMNOPQRSTUVWXYZ\u00C6\u00D8\u00C5 +
0123456789 +
-;
static {
for(int i=0; ivalidCharsStr.length(); i++)
validChars.add(new Character(validCharsStr.charAt(i)));
}

private static final Map accents = new HashMap();
static {
accents.put(new Character('\u00C0'), A);
accents.put(new Character('\u00C1'), A);
accents.put(new Character('\u00C2'), A);
accents.put(new Character('\u00C3'), A);
accents.put(new Character('\u00E0'), a);
accents.put(new Character('\u00E1'), a);
accents.put(new Character('\u00E2'), a);
accents.put(new Character('\u00E3'), a);
accents.put(new Character('\u00E4'), a);

accents.put(new Character('\u00C8'), E);
accents.put(new Character('\u00C9'), E);
accents.put(new Character('\u00CA'), E);
accents.put(new Character('\u00CB'), E);
accents.put(new Character('\u00E8'), e);
accents.put(new Character('\u00E9'), e);
accents.put(new Character('\u00EA'), e);
accents.put(new Character('\u00EB'), e);

accents.put(new Character('\u00CC'), I);
accents.put(new Character('\u00CD'), I);
accents.put(new Character('\u00CE'), I);
accents.put(new Character('\u00CF'), I);
accents.put(new Character('\u00EC'), i);
accents.put(new Character('\u00ED'), i);
accents.put(new Character('\u00EE'), i);
accents.put(new Character('\u00EF'), i);

accents.put(new Character('\u00D1'), N);
accents.put(new Character('\u00F1'), n);

accents.put(new Character('\u00D2'), O);
accents.put(new Character('\u00D3'), O);
accents.put(new Character('\u00D4'), O);
accents.put(new Character('\u00D5'), O);
accents.put(new Character('\u00D6'), O);
accents.put(new Character('\u00F2'), o);
accents.put(new Character('\u00F3'), o);
accents.put(new Character('\u00F4'), o);
accents.put(new Character('\u00F5'), o);
accents.put(new Character('\u00F6'), o);

accents.put(new Character('\u00D9'), U);
accents.put(new Character('\u00DA'), U);
accents.put(new Character('\u00DB'), U);
accents.put(new Character('\u00DC'), U);
accents.put(new Character('\u00F9'), u);
accents.put(new Character('\u00FA'), u);
accents.put(new Character('\u00FB'), u);
accents.put(new Character('\u00FC'), u);

accents.put(new Character('\u00DD'), Y);
accents.put(new Character('\u00FD'), y);
accents.put(new Character('\u00FF'), y

Re: re-indexing

2004-09-28 Thread Bo Gundersen

Jason wrote:
I am having touble reindexing.
Basically what I want to do is:
1. Delete the old index
2. Write the new index.
The enviroment:
The index is search by a web app running from the Orion App Server. This
code runs fin and reindexes fine prior to any searches.  After the first
search against the index is completed the index ends up beiong read-only
( or not writeable), I cannot reindex and subsequently cannot search
because the index is incomplete.
We have several apps running like this only on Tomcat and JBoss with no 
problems...

1. Why doesn't IndexReader.delete(i) really delete the file. it seems to
just make anothe 1K file with a .del extension the IndexWriter still
cannot content with?
Never tried the IndexReader.delete() method, we generally build the new 
index in a temporary directory and when the index is done we delete the 
current online directory (using java.io.File methods) and then rename 
the temp directory to online.

2. How can I make this work?
This may be just be silly, but do you remember to close your 
org.apache.lucene.search.IndexSearcher when you are done with your search?

--
Bo Gundersen
DBA/Software Developer
M.Sc.CS.
www.atira.dk
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Free software to crawl internet site?

Accent filter

Re: re-indexing

3 matches

Site Navigation

Mail list logo

Footer information