which lucene

2009-03-28 Thread Timon Roth
hello luceners

i have installed lucene on my linux-debian testing. so there is the jarfile 
lucene-1.4.3.jar under /usr/share/java.

so far so god. there is a german stemmer and a german analyzer in it under 
org.apache.lucene.analysis.de who works pretty well.

but the official release eg. from 
http://mirror.switch.ch/mirror/apache/dist/lucene/java/ is 2.4.1. is quite 
different.

so there is no german analyzer in this package. but some other features are 
available like setAllowLeadingWildcard(true), which are not included in the 
official debian release 1.4.3.

so my question.

which one of the releases are recommended to use? 1.4.3 or 2.4.1?
how do i get to a release 2.4.1 with a german stemmer/analyzer?

my target ist, to search with lucene on a large number of textfiles with 
german, french and italian text.

thank you for your attention and greets from switzerland (the land with the 
many äöü's..:-),
timon

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



newbie question again

2009-03-30 Thread Timon Roth
hello list

sory, this is maeby a stupid questin, but i can't resolve. so maeby you can 
help me:

i try to compile the indexer-example from the book lucene in action, 2nd 
edition (http://www.manning.com/hatcher3/hatcher_meapch1.pdf), but get the 
following error:
--
javac -Xlint -cp ":.:./lucene/lucene-core-2.4.1.jar" Indexer.java
Indexer.java:37: cannot find symbol
symbol  : constructor FSDirectory(java.io.File,)
location: class org.apache.lucene.store.FSDirectory
Directory dir = new FSDirectory(new File(indexDir), null);
^
1 error
--

it means the following codesgement:

public Indexer(String indexDir) throws IOException {
Directory dir = new FSDirectory(new File(indexDir));
writer = new IndexWriter(dir, new StandardAnalyzer(), true, 
IndexWriter.maxFieldLength.UNLIMITED);
}


im using debian testing with..
java -version
java version "1.6.0_0"
OpenJDK  Runtime Environment (build 1.6.0_0-b11)
OpenJDK Server VM (build 1.6.0_0-b11, mixed mode)

sourcecode+makefile are attached:

thanks for your help,
timon
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.Directory;

import java.io.File;
import java.io.IOException;
import java.io.FileReader;

public class Indexer {

private IndexWriter writer;

public static void main(String[] args) throws Exception {
	if (args.length != 2) {
		throw new Exception("Usage: java " + Indexer.class.getName() + "  ");
		}

	String indexDir = args[0];
	String dataDir = args[1];

	long start = System.currentTimeMillis();
	Indexer indexer = new Indexer(indexDir);
	int numIndexed = indexer.index(dataDir);
	indexer.close();
	long end = System.currentTimeMillis();

	System.out.println("Indexing " + numIndexed + " files took " + (end - start) + " milliseconds");
	}



public Indexer(String indexDir) throws IOException {
	
	Directory dir = new FSDirectory(new File(indexDir), null);
	writer = new IndexWriter(dir, new StandardAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);
	}

public void close() throws IOException {
	writer.close();
	}

public int index(String dataDir) throws Exception {
	
	File[] files = new File(dataDir).listFiles();

	for (int i = 0; i < files.length; i++) {
		
		File f = files[i];
		if (!f.isDirectory() && !f.isHidden() && f.exists() && f.canRead() && acceptFile(f)) {
			indexFile(f);
			}
		}
	return writer.numDocs();
	}


protected boolean acceptFile(File f) {
	return f.getName().endsWith(".txt");
	}


protected Document getDocument(File f) throws Exception {
	Document doc = new Document();
	doc.add(new Field("contents", new FileReader(f)));
	doc.add(new Field("filename", f.getCanonicalPath(),
	Field.Store.YES, Field.Index.NOT_ANALYZED));
	return doc;
	}


private void indexFile(File f) throws Exception {
	System.out.println("Indexing " + f.getCanonicalPath());
	Document doc = getDocument(f);
	if (doc != null) {
		writer.addDocument(doc);
		}
	}
}
LANG='de_CH';
CP="$(CLASSPATH):.:./lucene/lucene-core-2.4.1.jar"

all:
	javac -Xlint -cp $(CP) Indexer.java


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

'problem with indexformat and luke

2009-05-08 Thread Timon Roth
hello list

i am using lucene 2.9. when i try to open the index with luke i got an error:

unknown format version: -8

any hints?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: 'problem with indexformat and luke

2009-05-09 Thread Timon Roth
sounds bad...

i use luke 0.9.2 (2009-03-20) who supports lucene until version 2.4. so why do 
i use lucene 2.9? are there some other monitoring tools?


Am Freitag, 8. Mai 2009 schrieb Grant Ingersoll:
> This usually means that your index was created using a newer version  
> of Lucene than is bundled with Luke.  You will need to get the Luke  
> minimal jars (no Lucene) and use that along with the Lucene versions  
> you have.
> 
> On May 8, 2009, at 12:42 PM, Timon Roth wrote:
> 
> > hello list
> >
> > i am using lucene 2.9. when i try to open the index with luke i got  
> > an error:
> >
> > unknown format version: -8
> >
> > any hints?
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> 
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination.com/search
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 



-- 
Timon Roth
Triemlistrasse 92
8047 Zürich
--
043 817 40 31
079 636 57 28
--
digitalforce.ch
timon.r...@digitalforce.ch
http://tel.search.ch/zuerich/triemlistrasse-92/timon-roth

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



german analyers xes me

2009-05-12 Thread Timon Roth
hello list

al little confusion with a phrasequery. im using lucene 2.9 and have indexed 
all the data with the germananalyzer.

i have one field (full_text) for the searchable data and a few fields for 
sorting. the full_text ist not stored and analyzed. the fields for sorting 
are storen and not analyzed.

doc.add(new Field("full_text", value,Field.Store.NO, Field.Index.ANALYZED));
doc.add(new Field("needs_sort", value,Field.Store.YES, 
Field.Index.NOT_ANALYZED));

so i do the following phrasesearch "öffentliche finanzen und abgaberecht"...

the queryparser is feeded with the germananalyzer and translates the phrase 
to "offentlich finanx abgaberech".

QueryParser parser = new QueryParser("full_text", new GermanAnalyzer());

but the result is not as expected.

it gives me all hits who have the phrase in a sortfield, which i am not use 
for searching.

other querys for searching works pretty well just like "gemeindeautonomie; 
art. 8, 9 und 26 bv"

any hints?

-- 
Timon Roth
Triemlistrasse 92
8047 Zürich
--
043 817 40 31
079 636 57 28
--
digitalforce.ch
timon.r...@digitalforce.ch
http://tel.search.ch/zuerich/triemlistrasse-92/timon-roth

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



confusion with questionmark

2009-05-20 Thread Timon Roth
dear list

im searching through some lucene(2.9) index built with the GermanAnalyzer 
(from the package analyzers 2.9).

when i search for the word deutschland (query parsed with german alnalyzer 
transforms to deutschla) i get a few hits.

whei im searching for deu?schland i became no results, because the word leaves 
as it is (deu?schland).

when i try deu?schal (same as deutschla), i get the same numbers of hits like 
when im searching for deutschland.

so where did i go wrong?..:-)

gruess,
timon

-- 
Timon Roth
Triemlistrasse 92
8047 Zürich
--
043 817 40 31
079 636 57 28
--
digitalforce.ch
timon.r...@digitalforce.ch
http://tel.search.ch/zuerich/triemlistrasse-92/timon-roth


read between the lines of an index

2009-05-20 Thread Timon Roth
dear list

i want to add a entry to an index with a custom synomlist to an index. for 
example with the following text:

[i worrie about nothing beacuse this worls is crazy]

and i want to add the two custom synonyms

[anything]=>[nothing] 
and 
[lazy]=>[crazy]

so that a search for lazy, crazy nothing and anything gives me a hit to the 
entry in the index.

the point is, that prasesearch must still work. for exapmple when im searching 
for:

"this world is crazy" or "i worrie about nothing" must result in a hit, and i 
cannot just paste the sysnonyms after the existing words like this:

[i worrie about nothing anything beacuse this worls is crazy lazy]

how ca i do this? is there a possibility to insert more then one word at the 
same position?

gruess,
timon

-- 
Timon Roth
Triemlistrasse 92
8047 Zürich
--
043 817 40 31
079 636 57 28
--
digitalforce.ch
timon.r...@digitalforce.ch
http://tel.search.ch/zuerich/triemlistrasse-92/timon-roth


wheres the word

2009-06-24 Thread Timon Roth
hello list

im figgering about the following problem. in my index i cant find the word BE, 
but it exists in two documents. im usinglucene 2.4 with the standardanalyzer.

other querys with words like de, et or de la works good. any ideas?

gruess,
timon


Re: wheres the word

2009-06-25 Thread Timon Roth
hoi paul

i now tried with the hint from mark miller...disabling all the stopwords from 
standardanalyzer...

String stop_words[] = new String[0];
...StandardAnalyzer(stop_words);

this works perfect..;-)

gruess,
timon

Am Donnerstag, 25. Juni 2009 schrieb Paul Libbrecht:
> 
> Le 25-juin-09 à 01:28, Mark Miller a écrit :
> >> im figgering about the following problem. in my index i cant find  
> >> the word BE, but it exists in two documents. im usinglucene 2.4  
> >> with the standardanalyzer.
> >> other querys with words like de, et or de la works good. any ideas?
> > be is a stopword. Do yourself a favor and turn off stopwords. Best  
> > to remove them at query time if you really need to.
> 
> Timon, you spotted it: the analyzer. You need to care to take the  
> right analyzer and, if that language token (?) is a different field,  
> you ned to use a different analyzer, e.g. whitespaceanalyzer...
> 
> paul


-- 
Timon Roth
Triemlistrasse 92
8047 Zürich
--
043 817 40 31
079 636 57 28
--
digitalforce.ch
timon.r...@digitalforce.ch
http://tel.search.ch/zuerich/triemlistrasse-92/timon-roth