Problem in WebLucene

2004-10-07 Thread Sumathi


  Hello ,

I'm trying to use weblucene in our application . i have created the index 
using IndexRunner class  sucessfuly.

  When i try to access the webapplication using - 
http://localhost:8080/weblucene/search?dir=blog&query=query

  i'm getting a blank page , with the following error in the console .

  Caught error: java.io.IOException: D:\home\weblucene\webapp\WEB-INF\var\blog\index 
not a directory
  java.io.IOException: D:\home\weblucene\webapp\WEB-INF\var\blog\index not a directory 
. 

  Where should i set my path for weblucene directory ?
  Where could be the problem ?

  Thanks in advance !!



RE: Search Lucene documents returns 0 hits

2004-10-07 Thread Fred Yu
Thanks Lars, thanks heaps!

-Original Message-
From: Lars Klevan [mailto:[EMAIL PROTECTED]
Sent: Friday, October 08, 2004 3:30 AM
To: Lucene Users List
Subject: RE: Search Lucene documents returns 0 hits


Use BooleanQuery to combine multiple Queries:

BooleanQuery query = new BooleanQuery();
query.add(new TermQuery(new Term("type", "stockSingle")), true, false);
query.add(new TermQuery(new Term("seqNo", "1000")), true, false);
...

-Original Message-
From: Fred Yu [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 06, 2004 5:36 PM
To: Lucene Users List
Subject: RE: Search Lucene documents returns 0 hits


Hi Lars

Thanks for that! That solved my problem.

By the way, I need build a QueryFilter using MultiTermQuery. How do I
create
a MultiTermQuery object that contains three terms, e.g. new Term("type",
"stockSingle"); new Term("code", "1234"); new Term("seqNo", "1000");

Thanks

Fred


-Original Message-
From: Lars Klevan [mailto:[EMAIL PROTECTED]
Sent: Thursday, October 07, 2004 9:59 AM
To: Lucene Users List
Subject: RE: Search Lucene documents returns 0 hits


If you're indexing with a Keyword field you need to use a TermQuery.
QueryParser will only work for Text fields.

The reason for this is that both the Text field and the QueryParser use
the Analyzer to chop up the input into searchable chunks.  Depending on
the Analyzer this includes converting to lower-case, stripping trailing
"s" and "ing" and removing stopwords like "the" and "and".  The
TermQuery and Keyword field both treat the input exactly as is.

-Lars

-Original Message-
From: Fred Yu [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 06, 2004 3:25 PM
To: [EMAIL PROTECTED]
Subject: Search Lucene documents returns 0 hits


Hi

Does anyone know why Lucene returns 0 hits when there are in fact three
matches? The attached are two java class that repeat the problem. In the
example, I created a Keyword field "type" for each document added.
Lucene
can correctly find the documents if I use "Text" field instead of
"Keyword"
field.


Thanks in advance
Fred

package test;

import java.io.IOException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

public class IndexItems {
  public static void main(String[] args) throws IOException {
try {
  IndexWriter writer = new IndexWriter("/test/index", new
StandardAnalyzer(), true);
  indexDocs(writer);

  writer.optimize();
  writer.close();

  System.out.println("index finished.");

} catch (IOException e) {
  System.out.println(" caught a " + e.getClass() +
   "\n with message: " + e.getMessage());
}
  }

  private static void indexDocs(IndexWriter writer)
throws IOException {
Document document=new Document();

document.add(Field.Keyword("type", "stockSingle"));
document.add(Field.Text("desc", "test single 1"));
writer.addDocument(document);

document=new Document();
document.add(Field.Keyword("type", "stockSingle"));
document.add(Field.Text("desc", "test single 2"));
writer.addDocument(document);

document=new Document();
document.add(Field.Keyword("type", "stockItem"));
document.add(Field.Text("desc", "test single 3"));
writer.addDocument(document);
  }
}

package test;

import java.io.IOException;
import java.io.BufferedReader;
import java.io.InputStreamReader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.QueryFilter;
import org.apache.lucene.search.Hits;
import org.apache.lucene.queryParser.QueryParser;

public class SearchItems {
  public static void main(String[] args) {
try {
  Searcher searcher = new IndexSearcher("/test/index");
  QueryParser qp=new QueryParser("type", new StandardAnalyzer());
  Query query=qp.parse("type:stockSingle");

  Hits hits = searcher.search(query);
  System.out.println("search found: "+hits.length() + " total
matching
documents");

  searcher.close();
} catch (Exception e) {
  System.out.println(" caught a " + e.getClass() +
 "\n with message: " + e.getMessage());
}
  }
}




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: multifield-boolean vs singlefield-enum query performance

2004-10-07 Thread Doug Cutting
Tea Yu wrote:
For the following implementations:
1) storing boolean strings in fields X and Y separately
2) storing the same info in a field XY as 3 enums: X, Y, B, N meaning only X
is True, only Y is True, both are True or both are False
Is there significant performance gain when we substitute "X:T OR Y:T" by
"XY:B", while significant loss in "X:T" by "XY:X OR XY:B"?  Or are they
negligible?
As with most performance questions, it's best to try both and measure! 
It depends on the size of your index, the relative frequencies of X and 
Y, etc.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Analyzer reuse

2004-10-07 Thread Justin Swanhart
Yes you can reuse analyzers.  The only performance gain will come from
not having to create the objects and not having garbage collection
overhead.  I create one for each of my index reading threads.

On Thu, 07 Oct 2004 16:59:38 +, sam s <[EMAIL PROTECTED]> wrote:
> Hi,
> Can instance of an analyzer be reused?
> If yes then will it give any performance gain?
> 
> sam
> 
> _
> Add photos to your messages with MSN 8. Get 2 months FREE*.
> http://join.msn.com/?page=features/featuredemail
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Analyzer reuse

2004-10-07 Thread sam s
Hi,
Can instance of an analyzer be reused?
If yes then will it give any performance gain?
sam
_
Add photos to your messages with MSN 8. Get 2 months FREE*. 
http://join.msn.com/?page=features/featuredemail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Arabic analyzer

2004-10-07 Thread Grant Ingersoll
Someone posted an Arabic analyzer about 1 year ago, however, I don't
think the licensing was very friendly and we no longer use it.

We have a cross language system that works w/ Arabic (among other
languages).  We have written several stemmers based on the literature
that perform pretty well
and were not too difficult to implement (but are not available as open
source at this point).  Light stemming seems to work much better in IR
applications then aggressive stemmers due to the problems with roots
discussed earlier.

-Grant

--
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
http://www.cnlp.org 



>>> [EMAIL PROTECTED] 10/7/2004 8:45:42 AM >>>
Dawid Weiss wrote:
> 
>> nothing to do with each other furthermore, Arabic uses phonetic 
>> indicators on each letter called diacritics that change the way you

>> pronounce the word which in turn changes the words meaning so two
word 
>> spelled exactly the same way with different diacritics will mean two

>> separate things, 
> 
> 
> Just to point out the fact: most slavic languages also use diacritic

> marks (above, like 'acute', or 'dot' marks, or below, like the Polish

> 'ogonek' mark). Some people argue that they can be stripped off the
text 
> upon indexing and that the queries usually disambiguate the context
of 
> the word.

Hmm. This brings up a question: the algorithmic stemmer package from 
Egothor works quite well for Polish (http://www.getopt.org/stempel), 
wouldn't it work well for Arabic, too?

I lack the necessary expertise to evaluate results (knowing only two or

three arabic words ;-) ), but I can certainly help someone to get 
started with testing...

-- 
Best regards,
Andrzej Bialecki

-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Clustering lucene's results

2004-10-07 Thread Dawid Weiss
No problem. Let people know if it worked for you -- I look forward to 
hearing your experiences (good or bad).

Dawid
William W wrote:
Thanks Dawid ! :)

From: Dawid Weiss <[EMAIL PROTECTED]>
Reply-To: "Lucene Users List" <[EMAIL PROTECTED]>
To: Lucene Users List <[EMAIL PROTECTED]>
Subject: Re: Clustering lucene's results
Date: Thu, 07 Oct 2004 10:39:26 +0200
Hi William,
Ok, here is some demo code I've put together that shows how you can 
achieve clustering of Lucene's results. I hope this will get you 
started on your projects. If you have questions, please don't hesitate 
to ask -- cross posts to carrot2-developers would be a good idea too.

The code (plus the binaries so that you don't have to check out all of 
Carrot2 ;) are at:
http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip

Take a look at Demo.java -- it is the main link between Lucene and 
Carrot. Play with the parameters, I used 100 as the number of search 
results to be clustered. Adjust it to your needs.

int start = 0;
int requiredHits = 100;
I hope the code will be self-explanatory.
Good luck,
Dawid
From the readme file:
An example of using Carrot2 components to clustering search
results from Lucene.
===
Prerequisities
--
You must have an index created with Lucene and containing
documents with the following fields: url, title, summary.
The Lucene demo works with exactly these fields -- I just indexed
all of Lucene's source code and documentation using the following line:
mkdir index
java -Djava.ext.dirs=build org.apache.lucene.demo.IndexHTML -create 
-index index .

The index is now in 'index' folder.
Remember that the quality of snippets and titles heavily influences the
output of the clustering; in fact, the above example index of Lucene's 
API is
not too good because most queries will return nonsensical cluster labels
(see below).

Building Carrot2-Lucene demo

Basically you should have all of Carrot2 source code checked out and
issue the building command:
ant -Dcopy.dependencies=true
All of the required libraries and Carrot2 components will end up
in 'tmp/dist/deps-carrot2-lucene-example-jar' folder.
You can also spare yourself some time and download precompiled binaries
I've put at:
http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip
Now, once you have the compiled binaries, issue the following command
(all on one line of course):
java -Djava.ext.dirs=tmp\dist;tmp\dist\deps-carrot2-lucene-example-jar \
com.dawidweiss.carrot.lucene.Demo index query
The first argument is the location of the Lucene's index created 
before. The second argument
is a query. In the output you should have clusters and max. three 
documents from every cluster:

Results for: query
Timings: index opened in: 0,181s, search: 0,13s, clustering: 0,721s
 :> Search Lucene Rc1 Dev API
- 
F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/class-use/Query.html 

  Uses of Class org.apache.lucene.search.Query (Lucene 1.5-rc1-dev 
API)
- 
F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/package-summary.html 

  org.apache.lucene.search (Lucene 1.5-rc1-dev API)
- 
F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/package-use.html 

  Uses of Package org.apache.lucene.search (Lucene 1.5-rc1-dev API)
  (and 19 more)
 :> Jakarta Lucene
- 
F:/Repositories/cvs.apache.org/jakarta-lucene/src/java/overview.html
  Jakarta Lucene API
- F:/Repositories/cvs.apache.org/jakarta-lucene/docs/whoweare.html
  Jakarta Lucene - Who We Are - Jakarta Lucene
- F:/Repositories/cvs.apache.org/jakarta-lucene/docs/index.html
  Jakarta Lucene - Overview - Jakarta Lucene
  (and 12 more)

If you look at the source code of Demo.java, there are plenty of things
apt for customization -- number of results from each cluster, number 
of displayed
clusters (I would cut it to some reasonable number, say 10 or 15 -- 
the further a
cluster is from the "top", the less it is likely to be important). 
Also keep
in mind that some of Carrot2 components produce hierarchical clusters. 
This demonstration
works with "flat" version of Lingo algorithm, so you don't need to 
worry about it.

Hope this gets you started with using Carrot2 and Lucene.
Please let me know about any successes or failures.
Dawid
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
_
Check out Election 2004 for up-to-date election news, plus voter tools 
and more! http://special.msn.com/msn/election2004.armx

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Search Lucene documents returns 0 hits

2004-10-07 Thread Lars Klevan
Use BooleanQuery to combine multiple Queries:

BooleanQuery query = new BooleanQuery();
query.add(new TermQuery(new Term("type", "stockSingle")), true, false);
query.add(new TermQuery(new Term("seqNo", "1000")), true, false);
...

-Original Message-
From: Fred Yu [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 06, 2004 5:36 PM
To: Lucene Users List
Subject: RE: Search Lucene documents returns 0 hits


Hi Lars

Thanks for that! That solved my problem.

By the way, I need build a QueryFilter using MultiTermQuery. How do I
create
a MultiTermQuery object that contains three terms, e.g. new Term("type",
"stockSingle"); new Term("code", "1234"); new Term("seqNo", "1000");

Thanks

Fred


-Original Message-
From: Lars Klevan [mailto:[EMAIL PROTECTED]
Sent: Thursday, October 07, 2004 9:59 AM
To: Lucene Users List
Subject: RE: Search Lucene documents returns 0 hits


If you're indexing with a Keyword field you need to use a TermQuery.
QueryParser will only work for Text fields.

The reason for this is that both the Text field and the QueryParser use
the Analyzer to chop up the input into searchable chunks.  Depending on
the Analyzer this includes converting to lower-case, stripping trailing
"s" and "ing" and removing stopwords like "the" and "and".  The
TermQuery and Keyword field both treat the input exactly as is.

-Lars

-Original Message-
From: Fred Yu [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 06, 2004 3:25 PM
To: [EMAIL PROTECTED]
Subject: Search Lucene documents returns 0 hits


Hi

Does anyone know why Lucene returns 0 hits when there are in fact three
matches? The attached are two java class that repeat the problem. In the
example, I created a Keyword field "type" for each document added.
Lucene
can correctly find the documents if I use "Text" field instead of
"Keyword"
field.


Thanks in advance
Fred

package test;

import java.io.IOException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

public class IndexItems {
  public static void main(String[] args) throws IOException {
try {
  IndexWriter writer = new IndexWriter("/test/index", new
StandardAnalyzer(), true);
  indexDocs(writer);

  writer.optimize();
  writer.close();

  System.out.println("index finished.");

} catch (IOException e) {
  System.out.println(" caught a " + e.getClass() +
   "\n with message: " + e.getMessage());
}
  }

  private static void indexDocs(IndexWriter writer)
throws IOException {
Document document=new Document();

document.add(Field.Keyword("type", "stockSingle"));
document.add(Field.Text("desc", "test single 1"));
writer.addDocument(document);

document=new Document();
document.add(Field.Keyword("type", "stockSingle"));
document.add(Field.Text("desc", "test single 2"));
writer.addDocument(document);

document=new Document();
document.add(Field.Keyword("type", "stockItem"));
document.add(Field.Text("desc", "test single 3"));
writer.addDocument(document);
  }
}

package test;

import java.io.IOException;
import java.io.BufferedReader;
import java.io.InputStreamReader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.QueryFilter;
import org.apache.lucene.search.Hits;
import org.apache.lucene.queryParser.QueryParser;

public class SearchItems {
  public static void main(String[] args) {
try {
  Searcher searcher = new IndexSearcher("/test/index");
  QueryParser qp=new QueryParser("type", new StandardAnalyzer());
  Query query=qp.parse("type:stockSingle");

  Hits hits = searcher.search(query);
  System.out.println("search found: "+hits.length() + " total
matching
documents");

  searcher.close();
} catch (Exception e) {
  System.out.println(" caught a " + e.getClass() +
 "\n with message: " + e.getMessage());
}
  }
}




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



smime.p7s
Description: S/MIME cryptographic signature


Re: Clustering lucene's results

2004-10-07 Thread William W
Thanks Dawid ! :)

From: Dawid Weiss <[EMAIL PROTECTED]>
Reply-To: "Lucene Users List" <[EMAIL PROTECTED]>
To: Lucene Users List <[EMAIL PROTECTED]>
Subject: Re: Clustering lucene's results
Date: Thu, 07 Oct 2004 10:39:26 +0200
Hi William,
Ok, here is some demo code I've put together that shows how you can achieve 
clustering of Lucene's results. I hope this will get you started on your 
projects. If you have questions, please don't hesitate to ask -- cross 
posts to carrot2-developers would be a good idea too.

The code (plus the binaries so that you don't have to check out all of 
Carrot2 ;) are at:
http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip

Take a look at Demo.java -- it is the main link between Lucene and Carrot. 
Play with the parameters, I used 100 as the number of search results to be 
clustered. Adjust it to your needs.

int start = 0;
int requiredHits = 100;
I hope the code will be self-explanatory.
Good luck,
Dawid
From the readme file:
An example of using Carrot2 components to clustering search
results from Lucene.
===
Prerequisities
--
You must have an index created with Lucene and containing
documents with the following fields: url, title, summary.
The Lucene demo works with exactly these fields -- I just indexed
all of Lucene's source code and documentation using the following line:
mkdir index
java -Djava.ext.dirs=build org.apache.lucene.demo.IndexHTML -create -index 
index .

The index is now in 'index' folder.
Remember that the quality of snippets and titles heavily influences the
output of the clustering; in fact, the above example index of Lucene's API 
is
not too good because most queries will return nonsensical cluster labels
(see below).

Building Carrot2-Lucene demo

Basically you should have all of Carrot2 source code checked out and
issue the building command:
ant -Dcopy.dependencies=true
All of the required libraries and Carrot2 components will end up
in 'tmp/dist/deps-carrot2-lucene-example-jar' folder.
You can also spare yourself some time and download precompiled binaries
I've put at:
http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip
Now, once you have the compiled binaries, issue the following command
(all on one line of course):
java -Djava.ext.dirs=tmp\dist;tmp\dist\deps-carrot2-lucene-example-jar \
com.dawidweiss.carrot.lucene.Demo index query
The first argument is the location of the Lucene's index created before. 
The second argument
is a query. In the output you should have clusters and max. three documents 
from every cluster:

Results for: query
Timings: index opened in: 0,181s, search: 0,13s, clustering: 0,721s
 :> Search Lucene Rc1 Dev API
- 
F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/class-use/Query.html
  Uses of Class org.apache.lucene.search.Query (Lucene 1.5-rc1-dev 
API)
- 
F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/package-summary.html
  org.apache.lucene.search (Lucene 1.5-rc1-dev API)
- 
F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/package-use.html
  Uses of Package org.apache.lucene.search (Lucene 1.5-rc1-dev API)
  (and 19 more)

 :> Jakarta Lucene
- F:/Repositories/cvs.apache.org/jakarta-lucene/src/java/overview.html
  Jakarta Lucene API
- F:/Repositories/cvs.apache.org/jakarta-lucene/docs/whoweare.html
  Jakarta Lucene - Who We Are - Jakarta Lucene
- F:/Repositories/cvs.apache.org/jakarta-lucene/docs/index.html
  Jakarta Lucene - Overview - Jakarta Lucene
  (and 12 more)
If you look at the source code of Demo.java, there are plenty of things
apt for customization -- number of results from each cluster, number of 
displayed
clusters (I would cut it to some reasonable number, say 10 or 15 -- the 
further a
cluster is from the "top", the less it is likely to be important). Also 
keep
in mind that some of Carrot2 components produce hierarchical clusters. This 
demonstration
works with "flat" version of Lingo algorithm, so you don't need to worry 
about it.

Hope this gets you started with using Carrot2 and Lucene.
Please let me know about any successes or failures.
Dawid
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
_
Check out Election 2004 for up-to-date election news, plus voter tools and 
more! http://special.msn.com/msn/election2004.armx

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Arabic analyzer

2004-10-07 Thread Nader Henein
I'd be happy to help anyone test this out, my Arabic is pretty good.
Nader
Andrzej Bialecki wrote:
Dawid Weiss wrote:

nothing to do with each other furthermore, Arabic uses phonetic 
indicators on each letter called diacritics that change the way you 
pronounce the word which in turn changes the words meaning so two 
word spelled exactly the same way with different diacritics will 
mean two separate things, 

Just to point out the fact: most slavic languages also use diacritic 
marks (above, like 'acute', or 'dot' marks, or below, like the Polish 
'ogonek' mark). Some people argue that they can be stripped off the 
text upon indexing and that the queries usually disambiguate the 
context of the word.

Hmm. This brings up a question: the algorithmic stemmer package from 
Egothor works quite well for Polish (http://www.getopt.org/stempel), 
wouldn't it work well for Arabic, too?

I lack the necessary expertise to evaluate results (knowing only two 
or three arabic words ;-) ), but I can certainly help someone to get 
started with testing...

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Arabic analyzer

2004-10-07 Thread Andrzej Bialecki
Dawid Weiss wrote:

nothing to do with each other furthermore, Arabic uses phonetic 
indicators on each letter called diacritics that change the way you 
pronounce the word which in turn changes the words meaning so two word 
spelled exactly the same way with different diacritics will mean two 
separate things, 

Just to point out the fact: most slavic languages also use diacritic 
marks (above, like 'acute', or 'dot' marks, or below, like the Polish 
'ogonek' mark). Some people argue that they can be stripped off the text 
upon indexing and that the queries usually disambiguate the context of 
the word.
Hmm. This brings up a question: the algorithmic stemmer package from 
Egothor works quite well for Polish (http://www.getopt.org/stempel), 
wouldn't it work well for Arabic, too?

I lack the necessary expertise to evaluate results (knowing only two or 
three arabic words ;-) ), but I can certainly help someone to get 
started with testing...

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: indexing numeric entities?

2004-10-07 Thread Daan Hoogland
maybe inline?

http://www.w3.org/2001/XMLSchema-instance";>
 
  japan
 
 
  

フィールドサービスエンジニア

  



Indexing the above document using the HTMLParser demo and the 
CJKAnalyzer, only the term "japan" is found in the content. This is not 
correct, is it?
Should I convert the entities by hand?


Sorry for the mess I send before.


-- 
The information contained in this communication and any attachments is confidential 
and may be privileged, and is for the sole use of the intended recipient(s). Any 
unauthorized review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please notify the sender immediately by replying to this message 
and destroy all copies of this message and any attachments. ASML is neither liable for 
the proper and complete transmission of the information contained in this 
communication, nor for any delay in its receipt.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing numeric entities?

2004-10-07 Thread Daan Hoogland




I guess something wnet wrong;

Daan Hoogland wrote:

  Daan Hoogland wrote:

  
  
Daan Hoogland wrote:

 



  Hello,

Does anyone do indexeing of numeric entities for japanese characters? I 
have (non-x)html containing those entities and need to index and search 
them.




   

  

Can the CJKAnalyzer index a string like "●入社"? It 
seems to be ignored completely when used with the demo. There was talk 
on this list of fixes for the demo HTMLParser, do these adres this 
issue? When I look ate the code it seems that the entities should have 
been interpreted before indexing. What am I missing?

Any comment please?
Or a pointer to a howto for dumm^H^H^H^H^H westerners?
 


  
  Indexing the attached document using the HTMLParser demo and the 
CJKAnalyzer, only the term "japan" is found in the content. This is not 
correct, is it?
Should I convert the entities by hand?

  
  
thanks,


 


  
  


  
  

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-- 
The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. ASML is neither liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: indexing numeric entities?

2004-10-07 Thread Daan Hoogland
Daan Hoogland wrote:

>Daan Hoogland wrote:
>
>  
>
>>Hello,
>>
>>Does anyone do indexeing of numeric entities for japanese characters? I 
>>have (non-x)html containing those entities and need to index and search 
>>them.
>>
>>
>> 
>>
>>
>>
>Can the CJKAnalyzer index a string like "●入社"? It 
>seems to be ignored completely when used with the demo. There was talk 
>on this list of fixes for the demo HTMLParser, do these adres this 
>issue? When I look ate the code it seems that the entities should have 
>been interpreted before indexing. What am I missing?
>
>Any comment please?
>Or a pointer to a howto for dumm^H^H^H^H^H westerners?
>  
>
Indexing the attached document using the HTMLParser demo and the 
CJKAnalyzer, only the term "japan" is found in the content. This is not 
correct, is it?
Should I convert the entities by hand?

>
>thanks,
>
>
>  
>



-- 
The information contained in this communication and any attachments is confidential 
and may be privileged, and is for the sole use of the intended recipient(s). Any 
unauthorized review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please notify the sender immediately by replying to this message 
and destroy all copies of this message and any attachments. ASML is neither liable for 
the proper and complete transmission of the information contained in this 
communication, nor for any delay in its receipt.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: *term search

2004-10-07 Thread sergiu gordea
[EMAIL PROTECTED] wrote:
.. and here is the way to do it:
(See attached file: SUPPOR~1.RAR)
 

Hi all,
I got from iouli the solution to enable prefix queries (*term). In fact 
you can find the solution in
lucene source, in QueryParser.jj is said in a comment how to  enable 
prefix queries.

I did so ... but I found a lot of bugs. If you  define WildTerm as
|  | ( [ "*", "?" ] ))* >
a lot of  constructions will be validated, and you will get a lot of 
errors ...
for example "" and "+" are considered valid, * is considered 
valid, and they generate TooManyBooleanClausesExceptions,

I' not very good in creating regular expressions but I succesfully use 
the following construction ..

|  (<_TERM_CHAR> | ( [ "*", "?" ] ))* )
   | ( [ "*", "?" ] <_TERM_START_CHAR> (<_TERM_CHAR> | 
( [ "*", "?" ] ) )* ) >

Can anyone improve the construction and update the comment in 
QueryParser.jj?

 Thanks a lot,
 Sergiu

 
 Erik Hatcher
 <[EMAIL PROTECTED]To:   "Lucene Users List"  
 utions.com>   <[EMAIL PROTECTED]>  
  cc:   (bcc: Iouli Golovatyi/X/GP/Novartis) 
 08.09.2004 12:46 Subject:  Re: *term search 
 Please respond to   
 "Lucene UsersCategory:   |-|
 List"| ( ) Action needed   |
  | ( ) Decision needed |
  | ( ) General Information |
  |-|
 
 


On Sep 8, 2004, at 6:26 AM, sergiu gordea wrote:
 

I want to discuss a little problem, lucene doesn't support *Term like
queries.
   

First of all, this is untrue.  WildcardQuery itself most definitely
supports wildcards at the beginning.
 

I would like to use "*schreiben".
   

The dilemma you've encountered is that QueryParser prevents queries
that begin with a wildcard.
 

So my question is if there is a simple solution for implementing the
funtionality mentioned above.
Maybe subclassing one class and overwriting some methods will sufice.
   

It will require more than that in this case.  You will need to create a
custom parser that allows the grammar you'd like.  Feel free to use the
JavaCC source code to QueryParser as a basis of your customizations.
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: leakage in RAMDirectory ?

2004-10-07 Thread Rupinder Singh Mazara
Hi 

 the major issue is that when using FSDirectory and indexing to a directory there are 
no missing enteries
  where as when indexed using RAMDirectory i get missing enteries 

   currently i am investigating which are the missing enteries, since the application 
is configures to shutdown in 
  the event of Exception either all get indexed or none

  Rupinder

>-Original Message-
>From: Daniel Naber [mailto:[EMAIL PROTECTED]
>Sent: 06 October 2004 20:22
>To: Lucene Users List
>Subject: Re: leakage in RAMDirectory ?
>
>
>On Tuesday 05 October 2004 20:31, Rupinder Singh Mazara wrote:
>
>>  ( there
>> are 18746 records in the table. )
>>  using a database result set , i loop over all the records ,
>>  creating a document object and indexing into ramDirectory and then onto
>> the fileSystem
>>
>>  when I open a IndexReader and output numDoc i get 18740,
>
>It seems even in this case some documents are lost. Do you maybe ignore 
>exceptions? Could you build a self-contained test case that shows the 
>problem? The interesting question is of course *which* documents are lost 
>and if the behaviour is reproducible. The test case will either help you 
>to fix the bug in your code, or it will help us fix the bug in Lucene, if 
>there is any.
>
>Regards
> Daniel
>
>-- 
>http://www.danielnaber.de
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Clustering lucene's results

2004-10-07 Thread Dawid Weiss
Nope, because the example I showed is based on the "local interfaces" 
pipeline and output-xsltrenderer is for remote components only.

Anyway, I don't think it makes much sense -- if you need xslt badly, 
just modify the source code to output the results as XML and put an xslt 
filter on top of what it returns. Shouldn't be too hard.

Dawid
Albert Vila wrote:
That's great, thanks dawid.
Just a question, how can I modify your code in order to use the 
carrot2-output-xsltrenderer to output the clustering results in a html 
page?

Can you provide an example?
Thanks
Dawid Weiss wrote:
Hi William,
Ok, here is some demo code I've put together that shows how you can 
achieve clustering of Lucene's results. I hope this will get you 
started on your projects. If you have questions, please don't hesitate 
to ask -- cross posts to carrot2-developers would be a good idea too.

The code (plus the binaries so that you don't have to check out all of 
Carrot2 ;) are at:
http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip

Take a look at Demo.java -- it is the main link between Lucene and 
Carrot. Play with the parameters, I used 100 as the number of search 
results to be clustered. Adjust it to your needs.

int start = 0;
int requiredHits = 100;
I hope the code will be self-explanatory.
Good luck,
Dawid
From the readme file:
An example of using Carrot2 components to clustering search
results from Lucene.
===
Prerequisities
--
You must have an index created with Lucene and containing
documents with the following fields: url, title, summary.
The Lucene demo works with exactly these fields -- I just indexed
all of Lucene's source code and documentation using the following line:
mkdir index
java -Djava.ext.dirs=build org.apache.lucene.demo.IndexHTML -create 
-index index .

The index is now in 'index' folder.
Remember that the quality of snippets and titles heavily influences the
output of the clustering; in fact, the above example index of Lucene's 
API is
not too good because most queries will return nonsensical cluster labels
(see below).

Building Carrot2-Lucene demo

Basically you should have all of Carrot2 source code checked out and
issue the building command:
ant -Dcopy.dependencies=true
All of the required libraries and Carrot2 components will end up
in 'tmp/dist/deps-carrot2-lucene-example-jar' folder.
You can also spare yourself some time and download precompiled binaries
I've put at:
http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip
Now, once you have the compiled binaries, issue the following command
(all on one line of course):
java -Djava.ext.dirs=tmp\dist;tmp\dist\deps-carrot2-lucene-example-jar \
com.dawidweiss.carrot.lucene.Demo index query
The first argument is the location of the Lucene's index created 
before. The second argument
is a query. In the output you should have clusters and max. three 
documents from every cluster:

Results for: query
Timings: index opened in: 0,181s, search: 0,13s, clustering: 0,721s
 :> Search Lucene Rc1 Dev API
- 
F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/class-use/Query.html 

  Uses of Class org.apache.lucene.search.Query (Lucene 1.5-rc1-dev 
API)
- 
F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/package-summary.html 

  org.apache.lucene.search (Lucene 1.5-rc1-dev API)
- 
F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/package-use.html 

  Uses of Package org.apache.lucene.search (Lucene 1.5-rc1-dev API)
  (and 19 more)
 :> Jakarta Lucene
- 
F:/Repositories/cvs.apache.org/jakarta-lucene/src/java/overview.html
  Jakarta Lucene API
- F:/Repositories/cvs.apache.org/jakarta-lucene/docs/whoweare.html
  Jakarta Lucene - Who We Are - Jakarta Lucene
- F:/Repositories/cvs.apache.org/jakarta-lucene/docs/index.html
  Jakarta Lucene - Overview - Jakarta Lucene
  (and 12 more)

If you look at the source code of Demo.java, there are plenty of things
apt for customization -- number of results from each cluster, number 
of displayed
clusters (I would cut it to some reasonable number, say 10 or 15 -- 
the further a
cluster is from the "top", the less it is likely to be important). 
Also keep
in mind that some of Carrot2 components produce hierarchical clusters. 
This demonstration
works with "flat" version of Lingo algorithm, so you don't need to 
worry about it.

Hope this gets you started with using Carrot2 and Lucene.
Please let me know about any successes or failures.
Dawid
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAI

Re: Clustering lucene's results

2004-10-07 Thread Albert Vila
That's great, thanks dawid.
Just a question, how can I modify your code in order to use the 
carrot2-output-xsltrenderer to output the clustering results in a html page?

Can you provide an example?
Thanks
Dawid Weiss wrote:
Hi William,
Ok, here is some demo code I've put together that shows how you can 
achieve clustering of Lucene's results. I hope this will get you 
started on your projects. If you have questions, please don't hesitate 
to ask -- cross posts to carrot2-developers would be a good idea too.

The code (plus the binaries so that you don't have to check out all of 
Carrot2 ;) are at:
http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip

Take a look at Demo.java -- it is the main link between Lucene and 
Carrot. Play with the parameters, I used 100 as the number of search 
results to be clustered. Adjust it to your needs.

int start = 0;
int requiredHits = 100;
I hope the code will be self-explanatory.
Good luck,
Dawid
From the readme file:
An example of using Carrot2 components to clustering search
results from Lucene.
===
Prerequisities
--
You must have an index created with Lucene and containing
documents with the following fields: url, title, summary.
The Lucene demo works with exactly these fields -- I just indexed
all of Lucene's source code and documentation using the following line:
mkdir index
java -Djava.ext.dirs=build org.apache.lucene.demo.IndexHTML -create 
-index index .

The index is now in 'index' folder.
Remember that the quality of snippets and titles heavily influences the
output of the clustering; in fact, the above example index of Lucene's 
API is
not too good because most queries will return nonsensical cluster labels
(see below).

Building Carrot2-Lucene demo

Basically you should have all of Carrot2 source code checked out and
issue the building command:
ant -Dcopy.dependencies=true
All of the required libraries and Carrot2 components will end up
in 'tmp/dist/deps-carrot2-lucene-example-jar' folder.
You can also spare yourself some time and download precompiled binaries
I've put at:
http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip
Now, once you have the compiled binaries, issue the following command
(all on one line of course):
java -Djava.ext.dirs=tmp\dist;tmp\dist\deps-carrot2-lucene-example-jar \
com.dawidweiss.carrot.lucene.Demo index query
The first argument is the location of the Lucene's index created 
before. The second argument
is a query. In the output you should have clusters and max. three 
documents from every cluster:

Results for: query
Timings: index opened in: 0,181s, search: 0,13s, clustering: 0,721s
 :> Search Lucene Rc1 Dev API
- 
F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/class-use/Query.html 

  Uses of Class org.apache.lucene.search.Query (Lucene 1.5-rc1-dev 
API)
- 
F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/package-summary.html 

  org.apache.lucene.search (Lucene 1.5-rc1-dev API)
- 
F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/package-use.html 

  Uses of Package org.apache.lucene.search (Lucene 1.5-rc1-dev API)
  (and 19 more)
 :> Jakarta Lucene
- 
F:/Repositories/cvs.apache.org/jakarta-lucene/src/java/overview.html
  Jakarta Lucene API
- F:/Repositories/cvs.apache.org/jakarta-lucene/docs/whoweare.html
  Jakarta Lucene - Who We Are - Jakarta Lucene
- F:/Repositories/cvs.apache.org/jakarta-lucene/docs/index.html
  Jakarta Lucene - Overview - Jakarta Lucene
  (and 12 more)

If you look at the source code of Demo.java, there are plenty of things
apt for customization -- number of results from each cluster, number 
of displayed
clusters (I would cut it to some reasonable number, say 10 or 15 -- 
the further a
cluster is from the "top", the less it is likely to be important). 
Also keep
in mind that some of Carrot2 components produce hierarchical clusters. 
This demonstration
works with "flat" version of Lingo algorithm, so you don't need to 
worry about it.

Hope this gets you started with using Carrot2 and Lucene.
Please let me know about any successes or failures.
Dawid
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
Albert Vila
Director de proyectos I+D
http://www.imente.com
902 933 242
[iMente ïLa informaciÃn con mÃs beneficiosï]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: indexing numeric entities?

2004-10-07 Thread Daan Hoogland
Daan Hoogland wrote:

>Hello,
>
>Does anyone do indexeing of numeric entities for japanese characters? I 
>have (non-x)html containing those entities and need to index and search 
>them.
>
>
>  
>
Can the CJKAnalyzer index a string like "●入社"? It 
seems to be ignored completely when used with the demo. There was talk 
on this list of fixes for the demo HTMLParser, do these adres this 
issue? When I look ate the code it seems that the entities should have 
been interpreted before indexing. What am I missing?

Any comment please?
Or a pointer to a howto for dumm^H^H^H^H^H westerners?


thanks,


-- 
The information contained in this communication and any attachments is confidential 
and may be privileged, and is for the sole use of the intended recipient(s). Any 
unauthorized review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please notify the sender immediately by replying to this message 
and destroy all copies of this message and any attachments. ASML is neither liable for 
the proper and complete transmission of the information contained in this 
communication, nor for any delay in its receipt.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Clustering lucene's results

2004-10-07 Thread Dawid Weiss
Hi William,
Ok, here is some demo code I've put together that shows how you can 
achieve clustering of Lucene's results. I hope this will get you started 
on your projects. If you have questions, please don't hesitate to ask -- 
cross posts to carrot2-developers would be a good idea too.

The code (plus the binaries so that you don't have to check out all of 
Carrot2 ;) are at:
http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip

Take a look at Demo.java -- it is the main link between Lucene and 
Carrot. Play with the parameters, I used 100 as the number of search 
results to be clustered. Adjust it to your needs.

int start = 0;
int requiredHits = 100;
I hope the code will be self-explanatory.
Good luck,
Dawid
From the readme file:
An example of using Carrot2 components to clustering search
results from Lucene.
===
Prerequisities
--
You must have an index created with Lucene and containing
documents with the following fields: url, title, summary.
The Lucene demo works with exactly these fields -- I just indexed
all of Lucene's source code and documentation using the following line:
mkdir index
java -Djava.ext.dirs=build org.apache.lucene.demo.IndexHTML -create 
-index index .

The index is now in 'index' folder.
Remember that the quality of snippets and titles heavily influences the
output of the clustering; in fact, the above example index of Lucene's 
API is
not too good because most queries will return nonsensical cluster labels
(see below).

Building Carrot2-Lucene demo

Basically you should have all of Carrot2 source code checked out and
issue the building command:
ant -Dcopy.dependencies=true
All of the required libraries and Carrot2 components will end up
in 'tmp/dist/deps-carrot2-lucene-example-jar' folder.
You can also spare yourself some time and download precompiled binaries
I've put at:
http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip
Now, once you have the compiled binaries, issue the following command
(all on one line of course):
java -Djava.ext.dirs=tmp\dist;tmp\dist\deps-carrot2-lucene-example-jar \
com.dawidweiss.carrot.lucene.Demo index query
The first argument is the location of the Lucene's index created before. 
The second argument
is a query. In the output you should have clusters and max. three 
documents from every cluster:

Results for: query
Timings: index opened in: 0,181s, search: 0,13s, clustering: 0,721s
 :> Search Lucene Rc1 Dev API
- 
F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/class-use/Query.html
  Uses of Class org.apache.lucene.search.Query (Lucene 1.5-rc1-dev API)
- 
F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/package-summary.html
  org.apache.lucene.search (Lucene 1.5-rc1-dev API)
- 
F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/package-use.html
  Uses of Package org.apache.lucene.search (Lucene 1.5-rc1-dev API)
  (and 19 more)

 :> Jakarta Lucene
- F:/Repositories/cvs.apache.org/jakarta-lucene/src/java/overview.html
  Jakarta Lucene API
- F:/Repositories/cvs.apache.org/jakarta-lucene/docs/whoweare.html
  Jakarta Lucene - Who We Are - Jakarta Lucene
- F:/Repositories/cvs.apache.org/jakarta-lucene/docs/index.html
  Jakarta Lucene - Overview - Jakarta Lucene
  (and 12 more)
If you look at the source code of Demo.java, there are plenty of things
apt for customization -- number of results from each cluster, number of 
displayed
clusters (I would cut it to some reasonable number, say 10 or 15 -- the 
further a
cluster is from the "top", the less it is likely to be important). Also keep
in mind that some of Carrot2 components produce hierarchical clusters. 
This demonstration
works with "flat" version of Lingo algorithm, so you don't need to worry 
about it.

Hope this gets you started with using Carrot2 and Lucene.
Please let me know about any successes or failures.
Dawid
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Arabic analyzer

2004-10-07 Thread Nader Henein
There is a way of writing an Arabic stemmer, it's just not a weekend 
project, I've seen the translate/stem option as well, and even tried it 
with Lucene, we've implemented Lucene on our database and we have about 
a million records in our DB with 19 indexed fields (some of which are 
clobs) in each record, the free text fields in each record are in many 
cases Arabic, we do not provide stemming on those just because I 
couldn't find a valid stemming or translation option, which held up to 
proper testing, some were ok, but after collecting data from user 
searches (averaging out at 5 searches per second) the Arabic stemming 
options would not be able to manage user expectations, which is what it 
comes down to, sometimes theory does not translate well to practice.

Nader Henein
Dawid Weiss wrote:

nothing to do with each other furthermore, Arabic uses phonetic 
indicators on each letter called diacritics that change the way you 
pronounce the word which in turn changes the words meaning so two 
word spelled exactly the same way with different diacritics will mean 
two separate things, 

Just to point out the fact: most slavic languages also use diacritic 
marks (above, like 'acute', or 'dot' marks, or below, like the Polish 
'ogonek' mark). Some people argue that they can be stripped off the 
text upon indexing and that the queries usually disambiguate the 
context of the word.

It is just a digression. Now back to the arabic stemmer -- there has 
to be a way of doing it. I know Vivisimo has clustering options for 
arabic. They must be using a stemmer (and an English translation 
dictionary), although it might be a commercial one. Take a look:

http://vivisimo.com/search?v:file=cnnarabic
D.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Indexing XML using lucene_xml_indexing

2004-10-07 Thread Sumathi

  Hello.. 
 I have tried of indexing XML files as a standalone application . has any 
one tried of indexing XMLs using lucene_xml_indexing from isogen.com for a 
webapplication ,simillar to the demo 'luceneweb' . It would be great if i get a 
webdemo for indexing XMLs .

  Expecting some guidance !!
  Thanks.