date:20040928

Hebrew support

2004-09-28 Thread Alex Kiselevski


Hello,
Do you know something about hebrew support in Lucene
Thanks in advance

Alex Kiselevsky
 Speech Technology  Tel:972-9-776-43-46
RD, Amdocs - IsraelMobile: 972-53-63 50 38
mailto:[EMAIL PROTECTED]




The information contained in this message is proprietary of Amdocs,
protected from disclosure, and may be privileged.
The information is intended to be conveyed only to the designated recipient(s)
of the message. If the reader of this message is not the intended recipient,
you are hereby notified that any dissemination, use, distribution or copying of
this communication is strictly prohibited and may be unlawful.
If you have received this communication in error, please notify us immediately
by replying to the message and deleting it from your computer.
Thank you.

Re: online and offline Directory

2004-09-28 Thread Morus Walter

Ernesto De Santis writes:
 Hi Aviran
 
 Thanks for response.
 
 I forgot important information for you understand my issue.
 
 My process do some like this:
 The index have contents from differents sources, identified for a special
 field 'source'.
 Then the index have documents with source: S1 or source: S2 ... etc.
 
 When I reindex the source S1, first delete all documents with source: S1, in
 otherwise I have the index with repeated content. Then add the new index
 result.
 In the middle of process the IndexSearcher use an incomplete index.
 
 Is posible do it like a data base transaction?
 
It's not like a data base transcation but any index reader/searcher that
was opened before the changes won't see them until it's closed and reopened.
AFAIK that also applies to deletions though I never checked that.

So you have two options: a) use a second index for indexing, move the
indexes after the indexing is done and make sure indexreader/searcher
are closed and reopened after the move.
b) use one index and make sure that you do not open any index reader/searcher
during the update. Searches may only use already opened reader/searcher.

I guess it depends on index size, update frequency and so on, which 
szenario is easier to handle.
Given that the index isn't too large and update frequency is rather low, 
I'd use a second index. But you'll need to copy that index and should 
consider the time and disc IO needed for that.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Sorting Info

2004-09-28 Thread Erik Hatcher

On Sep 27, 2004, at 6:32 PM, [EMAIL PROTECTED] wrote:
I'm interested in doing sorting in Lucene.  Is there a FAQ or an  
article that
will show me how to do this?  I already have my indexing and searching  
working.
From IndexSearcher, use search(Query,Sort) method (or other variants  
that take a Sort):

	http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/ 
Searcher.html#search(org.apache.lucene.search.Query,%20org.apache.lucene 
.search.Sort)

Following the Javadocs for Sort should be (hopefully) self-explanatory.
Sorting is a pretty new feature, and is only described in Javadocs and  
this e-mail list as far as I know.  I wrote a pretty extensive section  
on sorting in Lucene in Action - http://www.manning.com/hatcher2 -  
which is primarily in Otis' hands right now (*nudge nudge*) and has  
been technical and copy edited.  In other words, LIA should be  
available in print 6-8 weeks (e-book in PDF format available much  
sooner) after Otis is done.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Using lucene in Tomcat

2004-09-28 Thread mahaveer jain

Hi all,
 
I have implemented lucene search for my documents and db successfully.
 
Now my problem is, the index i created is indexing to my local disk, i want the index 
to be created with reference to my server.
 
Right now I index C:/tomcat/webapps/jetspeed/document, but I want to index wrt,  
/jetspped/document.
 
Let me know if some one 


-
Do you Yahoo!?
Y! Messenger - Communicate in real time. Download now.

Re: Using lucene in Tomcat

2004-09-28 Thread sergiu gordea

mahaveer jain wrote:
Hi all,
I have implemented lucene search for my documents and db successfully.
Now my problem is, the index i created is indexing to my local disk, i want the index 
to be created with reference to my server.
Right now I index C:/tomcat/webapps/jetspeed/document, but I want to index wrt,  /jetspped/document.
 

maybe you can create a small method to convert
/jetspped/document to a File object and then call toStringMethod and pass to indexer.  


Let me know if some one 

		
-
Do you Yahoo!?
Y! Messenger - Communicate in real time. Download now.
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Accent filter

2004-09-28 Thread Bo Gundersen

Hi,
I am certainly not the first, and probably not the last, that have had 
problems with accented characters in my index. But unfortunately I 
couldnt find anything in neither lucene nor the lucene-sandbox to solve 
the problem.
Så I wrote an accent filter and thought that I might as well share it 
with you guys :)

--
Bo Gundersen
DBA/Software Developer
M.Sc.CS.
www.atira.dk
package dk.atira.search;

import java.io.IOException;
import java.util.Collection;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;

/**
 * This filter converts accent characters to their non-accented versions.
 * Also it strips unwanted characters from the tokens, mening anything 
 * but A-Z,a-z,0-9,ÆÅØæøå and -
 * The valid characters can be changed by adding them to the string validCharsStr.
 * 
 * Created by Bo Gundersen at Sep 28, 2004 12:39:04 PM 
 *
 * @author Bo Gundersen ([EMAIL PROTECTED])
 */
public class AccentFilter
extends TokenFilter
{
private static final Collection validChars = new HashSet();
private static final String validCharsStr = 
abcdefghijklmnopqrstuvwxyz\u00E6\u00F8\u00E5 +
ABCDEFGHIJKLMNOPQRSTUVWXYZ\u00C6\u00D8\u00C5 +
0123456789 +
-;
static {
for(int i=0; ivalidCharsStr.length(); i++)
validChars.add(new Character(validCharsStr.charAt(i)));
}

private static final Map accents = new HashMap();
static {
accents.put(new Character('\u00C0'), A);
accents.put(new Character('\u00C1'), A);
accents.put(new Character('\u00C2'), A);
accents.put(new Character('\u00C3'), A);
accents.put(new Character('\u00E0'), a);
accents.put(new Character('\u00E1'), a);
accents.put(new Character('\u00E2'), a);
accents.put(new Character('\u00E3'), a);
accents.put(new Character('\u00E4'), a);

accents.put(new Character('\u00C8'), E);
accents.put(new Character('\u00C9'), E);
accents.put(new Character('\u00CA'), E);
accents.put(new Character('\u00CB'), E);
accents.put(new Character('\u00E8'), e);
accents.put(new Character('\u00E9'), e);
accents.put(new Character('\u00EA'), e);
accents.put(new Character('\u00EB'), e);

accents.put(new Character('\u00CC'), I);
accents.put(new Character('\u00CD'), I);
accents.put(new Character('\u00CE'), I);
accents.put(new Character('\u00CF'), I);
accents.put(new Character('\u00EC'), i);
accents.put(new Character('\u00ED'), i);
accents.put(new Character('\u00EE'), i);
accents.put(new Character('\u00EF'), i);

accents.put(new Character('\u00D1'), N);
accents.put(new Character('\u00F1'), n);

accents.put(new Character('\u00D2'), O);
accents.put(new Character('\u00D3'), O);
accents.put(new Character('\u00D4'), O);
accents.put(new Character('\u00D5'), O);
accents.put(new Character('\u00D6'), O);
accents.put(new Character('\u00F2'), o);
accents.put(new Character('\u00F3'), o);
accents.put(new Character('\u00F4'), o);
accents.put(new Character('\u00F5'), o);
accents.put(new Character('\u00F6'), o);

accents.put(new Character('\u00D9'), U);
accents.put(new Character('\u00DA'), U);
accents.put(new Character('\u00DB'), U);
accents.put(new Character('\u00DC'), U);
accents.put(new Character('\u00F9'), u);
accents.put(new Character('\u00FA'), u);
accents.put(new Character('\u00FB'), u);
accents.put(new Character('\u00FC'), u);

accents.put(new Character('\u00DD'), Y);
accents.put(new Character('\u00FD'), y);
accents.put(new Character('\u00FF'), y);

Re: Accent filter

2004-09-28 Thread John Moylan

Loads of very well thought out ISO-8859 + French/Irish Filters available 
here too: (I think they are all GPL'd)

http://www.nongnu.org/sdx/
Best Regards,
JOhn
Bo Gundersen wrote:
Hi,
I am certainly not the first, and probably not the last, that have had 
problems with accented characters in my index. But unfortunately I 
couldnt find anything in neither lucene nor the lucene-sandbox to solve 
the problem.
Så I wrote an accent filter and thought that I might as well share it 
with you guys :)


package dk.atira.search;
import java.io.IOException;
import java.util.Collection;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
/**
 * This filter converts accent characters to their non-accented versions.
 * Also it strips unwanted characters from the tokens, mening anything 
 * but A-Z,a-z,0-9,ÆÅØæøå and -
 * The valid characters can be changed by adding them to the string validCharsStr.
 * 
 * Created by Bo Gundersen at Sep 28, 2004 12:39:04 PM 
 *
 * @author Bo Gundersen ([EMAIL PROTECTED])
 */
public class AccentFilter
		extends TokenFilter
	{
		private static final Collection validChars = new HashSet();
		private static final String validCharsStr = 
			abcdefghijklmnopqrstuvwxyz\u00E6\u00F8\u00E5 +
			ABCDEFGHIJKLMNOPQRSTUVWXYZ\u00C6\u00D8\u00C5 +
			0123456789 +
			-;
		static {
			for(int i=0; ivalidCharsStr.length(); i++)
validChars.add(new Character(validCharsStr.charAt(i)));
		}
		
		private static final Map accents = new HashMap();
		static {
			accents.put(new Character('\u00C0'), A);
			accents.put(new Character('\u00C1'), A);
			accents.put(new Character('\u00C2'), A);
			accents.put(new Character('\u00C3'), A);
			accents.put(new Character('\u00E0'), a);
			accents.put(new Character('\u00E1'), a);
			accents.put(new Character('\u00E2'), a);
			accents.put(new Character('\u00E3'), a);
			accents.put(new Character('\u00E4'), a);
			
			accents.put(new Character('\u00C8'), E);
			accents.put(new Character('\u00C9'), E);
			accents.put(new Character('\u00CA'), E);
			accents.put(new Character('\u00CB'), E);
			accents.put(new Character('\u00E8'), e);
			accents.put(new Character('\u00E9'), e);
			accents.put(new Character('\u00EA'), e);
			accents.put(new Character('\u00EB'), e);

accents.put(new Character('\u00CC'), I);
accents.put(new Character('\u00CD'), I);
accents.put(new Character('\u00CE'), I);
accents.put(new Character('\u00CF'), I);
accents.put(new Character('\u00EC'), i);
accents.put(new Character('\u00ED'), i);
accents.put(new Character('\u00EE'), i);
accents.put(new Character('\u00EF'), i);
accents.put(new Character('\u00D1'), N);
accents.put(new Character('\u00F1'), n);

accents.put(new Character('\u00D2'), O);
accents.put(new Character('\u00D3'), O);
accents.put(new Character('\u00D4'), O);
accents.put(new Character('\u00D5'), O);
accents.put(new Character('\u00D6'), O);
accents.put(new Character('\u00F2'), o);
accents.put(new Character('\u00F3'), o);
accents.put(new Character('\u00F4'), o);
accents.put(new Character('\u00F5'), o);
accents.put(new Character('\u00F6'), o);

accents.put(new Character('\u00D9'), U);
accents.put(new Character('\u00DA'), U);
accents.put(new Character('\u00DB'), U);
accents.put(new Character('\u00DC'), U);
accents.put(new Character('\u00F9'), u);
accents.put(new Character('\u00FA'), u);
accents.put(new Character('\u00FB'), u);
accents.put(new Character('\u00FC'), u);

accents.put(new Character('\u00DD'), Y);
accents.put(new Character('\u00FD'), y);
accents.put(new Character('\u00FF'), y);

accents.put(new Character('\u00C6'), AE);
accents.put(new Character('\u00E6'), ae);
accents.put(new Character('\u00D8'), OE);
accents.put(new Character('\u00F8'), oe);
accents.put(new Character('\u00C5'), AA);
accents.put(new Character('\u00E5'), aa);
}

RE: Hebrew support

2004-09-28 Thread Aviran

As far as I know there is no Analyzer for Hebrew.

Aviran

-Original Message-
From: Alex Kiselevski [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 28, 2004 3:12 AM
To: [EMAIL PROTECTED]
Subject: Hebrew support

Hello,
Do you know something about hebrew support in Lucene
Thanks in advance

Alex Kiselevsky
 Speech Technology  Tel:972-9-776-43-46
RD, Amdocs - IsraelMobile: 972-53-63 50 38
mailto:[EMAIL PROTECTED]

The information contained in this message is proprietary of Amdocs,
protected from disclosure, and may be privileged. The information is
intended to be conveyed only to the designated recipient(s) of the message.
If the reader of this message is not the intended recipient, you are hereby
notified that any dissemination, use, distribution or copying of

this communication is strictly prohibited and may be unlawful.

If you have received this communication in error, please notify us
immediately by replying to the message and deleting it from your computer.
Thank you.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

PorterStemAnalyzer versus SnowballAnalyser?

2004-09-28 Thread iouli . golovatyi

I use PorterStemAnalyzer provided by Otis (see 
http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=2,),
in 
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/snowball/there 
are another implementations of porter algoritm. 
From the first glance it looks some how very different.

What the difference/advantage/disadvantage in using Otis code or  one from 
lucene-sandbox Snowball

Thanks in advance
J.

Re: Shouldnt IndexWriter.flushRamSegments() be public? or at least protected?

2004-09-28 Thread Doug Cutting

Christian Rodriguez wrote:
Now the problem I have is that I dont have a way to force a flush of
the IndexWriter without closing it and I need to do that before
commiting a transaction or I would get random errors. Shouldnt that
function be public, in case the user wants to force a flush at some
point that is not when the IndexWriter is closed? If not I am forced
to create a new IndexWriter and close it EVERY TIME I commit a
transaction (which in my application is very often).
Opening and closing IndexWriters should be a lightweight operation. 
Have you tried this and found it to be too slow?  A flush() would have 
to do just about the same work.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

How to pull document scoring values

2004-09-28 Thread Zia Syed

Hi,

I'm trying to learn the Scoring mechanism of Lucene. I want to fetch
each parameter value individually as they are collectively dumped out by
Explanation. I've managed to pull out TF and IDF values using
DefaultSimilarity and FilterIndexReader, but not sure from where to get
the fieldNorm and queryNorm from. 
Also is there any reference about how normalisation has been
implemented? 

Any idea?

Thanks,
Zia
-- 
Zia Syed [EMAIL PROTECTED]
Smartweb Research Center, Robert Gordon University


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Sorting on a long string

2004-09-28 Thread Daly, Pete

I am new to lucene, and trying to perform a sorted query on a list of
people's names.  Lucene seem unable to properly sort on the name field of my
indexed documents.  If I sort by the other (shorter) fields, it seems to
work fine.  The name sort seems to be close, almost like the last few
iterations through the sort loop are not being done.  The records are
obviously not in the normally random order, but not fully sorted either.  I
have tried different ways of sorting, including a SortField array/object
with the field cast as a string.

The index I am sorting has about 1.2 million documents.

Are their known limitations in the sorting functionality that I am running
into?  I can provide more details if needed.

Thanks for any help,

-Pete

re-indexing

2004-09-28 Thread Jason

I am having touble reindexing.
Basically what I want to do is:
1. Delete the old index
2. Write the new index.
The enviroment:
The index is search by a web app running from the Orion App Server. This
code runs fin and reindexes fine prior to any searches.  After the first
search against the index is completed the index ends up beiong read-only
( or not writeable), I cannot reindex and subsequently cannot search
because the index is incomplete.
1. Why doesn't IndexReader.delete(i) really delete the file. it seems to
just make anothe 1K file with a .del extension the IndexWriter still
cannot content with?
2. How can I make this work?
Thanks,
Jason
The code below produces the following output when run AFTER an initial
search against the index have be completed:
IndexerDrug-disableLuceneLocks: true
Directory: [EMAIL PROTECTED]:\lucene_index_drug
Deleted [0]: true
... (out put form for loop confirming deleted items)
Deleted [367]: true
Hit uncaught exception java.io.IOException
java.io.IOException: Cannot delete _ba.cfs
   at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:144)
   at
org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:105)
   at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:193)
   at IndexerDrug.index(IndexerDrug.java:103)
   at IndexerDrug.main(IndexerDrug.java:246)
Exception in thread main
=-=-=-=-=-=-=-=-=-=-=-=-=-
My indexing code  (some items have been deleted to protect the innocent)
=-=-=-=-=-=-=-=-=-=-=-=-=-
import java.io.*;
import java.sql.*;
import javax.naming.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.store.*;
public class IndexerDrug {
 private String sql = my query code ;
 public static String[] stopWords =
org.apache.lucene.analysis.standard.StandardAnalyzer.STOP_WORDS;
 public File indexDir = new File(C:\\lucene_index_drug\\);
 public Directory fsDir;
 public void index() throws IOException {
   try {
   // Delete old index
   fsDir = FSDirectory.getDirectory(indexDir, false);
   if (indexDir.list().length  0) {
   IndexReader reader = IndexReader.open(fsDir);
   
System.out.println(Directory:+reader.directory().toString());
   reader.unlock(fsDir);
   for (int i = 0; i  reader.maxDoc()-1; i++) {
   reader.delete(i);
   System.out.println(Deleted [+i+]:  
+reader.isDeleted(i));
   }
   reader.close();
   }
   }
   catch (Exception ex) {
   System.out.println(Error while deleting index:  
+ex.getMessage());
   }
   // Write new index
   Analyzer analyzer = new StandardAnalyzer(stopWords);
   IndexWriter writer = new IndexWriter(indexDir, analyzer, 
true);//  fails here *
   writer.mergeFactor = 1000;
   indexDirectory(writer);
   writer.setUseCompoundFile(true);
   writer.optimize();
   writer.close();

 }
 private void indexDirectory(IndexWriter writer) throws IOException {
   Connection c = null;
   ResultSet rs = null;
   Statement stmt = null;
   long startTime = System.currentTimeMillis();
   System.out.println(Start Time:  + new
java.sql.Timestamp(System.currentTimeMillis()).toString());
   try {
 Class.forName();
 c = DriverManager.getConnection( , , );
 stmt = c.createStatement();
 rs = stmt.executeQuery(this.sql);
 System.out.println(Query Completed:  + new
java.sql.Timestamp(System.currentTimeMillis()).toString());
 int total = 0;
 String resourceID = ;
 String resourceName = ;
 String summary = ;
 String shortSummary = ;
 String hciPick = ;
 String url = ;
 String format = ;
 String orgType = ;
 String holdingType = ;
 String indexText = ;
 String c_indexText = ;
 boolean ready = false;
 Document doc = null;
 String oldResourceID = null;
 String newResourceID = null;
 while (rs.next()) {
   newResourceID = rs.getString(resourceID)!= null ?
rs.getString(resourceID) : ;
   resourceID = newResourceID;
   resourceName = rs.getString(resourceName) != null ?
rs.getString(resourceName) : ;
   summary = rs.getString(summary) != null ?
rs.getString(summary) : ;
   if (summary.length()  300) {
 shortSummary = summary.substring(0, 300) + ...;
   } else {
 shortSummary = summary;
   }
   hciPick = rs.getString(hciPick) != null 
?rs.getString(hciPick) : ;
   url = rs.getString(url) != null ? rs.getString(url) : ;
   format = rs.getString(format) != null ? 
rs.getString(format): ;
   orgType = rs.getString(orgType) != null 
?rs.getString(orgType) : ;
   holdingType = rs.getString(holdingType) != null 
?rs.getString(holdingType) : ;
   indexText = rs.getString(indexText) != null 
?rs.getString(indexText) : ;

   if

Re: re-indexing

2004-09-28 Thread Bo Gundersen

Jason wrote:
I am having touble reindexing.
Basically what I want to do is:
1. Delete the old index
2. Write the new index.
The enviroment:
The index is search by a web app running from the Orion App Server. This
code runs fin and reindexes fine prior to any searches.  After the first
search against the index is completed the index ends up beiong read-only
( or not writeable), I cannot reindex and subsequently cannot search
because the index is incomplete.
We have several apps running like this only on Tomcat and JBoss with no 
problems...

1. Why doesn't IndexReader.delete(i) really delete the file. it seems to
just make anothe 1K file with a .del extension the IndexWriter still
cannot content with?
Never tried the IndexReader.delete() method, we generally build the new 
index in a temporary directory and when the index is done we delete the 
current online directory (using java.io.File methods) and then rename 
the temp directory to online.

2. How can I make this work?
This may be just be silly, but do you remember to close your 
org.apache.lucene.search.IndexSearcher when you are done with your search?

--
Bo Gundersen
DBA/Software Developer
M.Sc.CS.
www.atira.dk
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Hebrew support

Re: online and offline Directory

Re: Sorting Info

Using lucene in Tomcat

Re: Using lucene in Tomcat

Accent filter

Re: Accent filter

RE: Hebrew support

PorterStemAnalyzer versus SnowballAnalyser?

Re: Shouldnt IndexWriter.flushRamSegments() be public? or at least protected?

How to pull document scoring values

Sorting on a long string

re-indexing

Re: re-indexing

14 matches

Site Navigation

Mail list logo

Footer information