Lock obtain timed out

2003-12-16 Thread Hohwiller, Joerg
Hi there,

I have not yet got any response about my problem.

While debugging into the depth of lucene (really hard to read deep insde) I 
discovered that it is possible to disable the Locks using a System property.

When I start my application with -DdisableLuceneLocks=true, 
I do not get the error anymore.

I just wonder if this is legal and wont cause other trouble???
As far as I could understand the source, a proper thread 
synchronization is done using locks on Java Objects and
the index-store locks seem to be required only if multiple 
lucenes (in different VMs) work on the same index.
In my situation there is only one Java-VM running and only one
lucene is working on one index. 

Am I safe disabling the locking???
Can anybody tell me where to get documentation about the Locking
strategy (I still would like to know why I have that problem) ???

Or does anybody know where to get an official example of how to
handle concurrent index modification and searches?

Tank you so much
  Jörg

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lock obtain timed out

2003-12-16 Thread MOYSE Gilles (Cetelem)
Hi.

I obtained this exception when I had more than one thread trying to create
an IndexWriter.
I solved it by placing the code using the IndexWriter in a synchronized
method.

Hope it will help,

Gilles.

-Message d'origine-
De : Hohwiller, Joerg [mailto:[EMAIL PROTECTED]
Envoyé : mardi 16 décembre 2003 11:37
À : [EMAIL PROTECTED]
Objet : Lock obtain timed out


Hi there,

I have not yet got any response about my problem.

While debugging into the depth of lucene (really hard to read deep insde) I 
discovered that it is possible to disable the Locks using a System property.

When I start my application with -DdisableLuceneLocks=true, 
I do not get the error anymore.

I just wonder if this is legal and wont cause other trouble???
As far as I could understand the source, a proper thread 
synchronization is done using locks on Java Objects and
the index-store locks seem to be required only if multiple 
lucenes (in different VMs) work on the same index.
In my situation there is only one Java-VM running and only one
lucene is working on one index. 

Am I safe disabling the locking???
Can anybody tell me where to get documentation about the Locking
strategy (I still would like to know why I have that problem) ???

Or does anybody know where to get an official example of how to
handle concurrent index modification and searches?

Tank you so much
  Jörg

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Disabling modifiers?

2003-12-16 Thread Iain Young
 Treating them as two separate words when quoted is indicative of your 
 analyzer not being sufficient for your domain.  What Analyzer are you 
 using?  Do you have knowledge of what it is tokenizing text into?

I have created a custom analyzer (CobolAnalyzer) which contains some custom
stop words for the language, but it's using the StandardTokenizer and
StandardFilters. I'll have a look and see if I can see what it's actually
tokenizing the text into...

 Any ideas, or am I going to have to try and write my own query parser?

Well, if I manage to get something working, I'll let you know :-)

Thanks,
Iain

*
*  Micro Focus Developer Forum 2004 *
*  3 days that will make a difference   *
*  www.microfocus.com/devforum  *
*

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Disabling modifiers?

2003-12-16 Thread Iain Young
Thanks Gregor, I'll give it a try...

Iain

*
*  Micro Focus Developer Forum 2004 *
*  3 days that will make a difference   *
*  www.microfocus.com/devforum  *
*

-Original Message-
From: Gregor Heinrich [mailto:[EMAIL PROTECTED]
Sent: 15 December 2003 18:32
To: 'Lucene Users List'
Subject: RE: Disabling modifiers?


If you don't want to fiddle with the JavaCC source of QueryParser.jj, you
could work with a regular expression that works in front of the actual query
parser. I just did something similar because I input Lucene's query strings
into a latent semantic analysis algorithm and remove words with + and ?
wildcards, boosting modifiers as well as NOT and - clauses and groupings.
Such as:

/**
 *  exclude words that have these modifiers
 */
public final String excludeWildcards = \\w+\\+|\\w+\\?;
/**
 *  remove these operators
 */
public final String removeOperators = AND|OR|UND|ODER||\\|\\|;
/**
 *  remove these modifiers
 */
public final String removeModifiers = ~[0-9\\.]*|~|\\^[0-9\\.]*|\\*;
/**
 *  exclude phrases that have these modifiers
 */
public final String excludeNot = (NOT |\\-) *\\w+|(NOT|\\-)
*\\([^\\)]+\\)|(NOT |\\-) *\\\[^\\\]+\\\;

/**
 * remove any groupings
 */
public final String removeGrouping = [\(\\)];

You then create Pattern objects from the strings using Pattern.compile() and
can use and re-use the compiled patterns.

excludeWildcardsPattern = Pattern.compile(excludeWildcards);

lsaQ = excludeWildcardsPattern.matcher(q).replaceAll();

This works fine for me. However, this 20 minutes approach does not recognise
nested parentheses with NOT or -, i.e.,
the term ttNOT ((a OR b) AND (c OR d))/tt will result in the removal
of ttNOT ((a OR b/tt and ttc d/tt will still be in the output
query.

Best regards,

Gregor

-Original Message-
From: Iain Young [mailto:[EMAIL PROTECTED]
Sent: Monday, December 15, 2003 6:13 PM
To: Lucene mailing list (E-mail)
Subject: Disabling modifiers?


A quick question. Is there any way to disable the - and + modifiers in the
QueryParser? I'm trying to use Lucene to provide indexing of COBOL source
code, and allow me to highlight matches when the code is displayed. In COBOL
you can have variable names such as DISP-NAME and WS-DATE-1 for example.
Unfortunately the query parser interprets the - signs as modifiers and so
the query does not do what is required.

I've had a bit of success by putting quotes around the offending names, (as
suggested on this list), but the results are still less than satisfactory,
(it removes the NOT from the query, but still treats DISP and NAME as two
separate words rather than one word and so the results are not quite
correct).

Any ideas, or am I going to have to try and write my own query parser?

Thanks,
Iain


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



This e-mail has been scanned for viruses by MCI's Internet Managed Scanning
Services - powered by MessageLabs. For further information visit
http://www.mci.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lock obtain timed out

2003-12-16 Thread Morus Walter
Hohwiller, Joerg writes:
 
 Am I safe disabling the locking???

No.

 Can anybody tell me where to get documentation about the Locking
 strategy (I still would like to know why I have that problem) ???
 
I guess -- but given your input I really have to guess; the source you
wanted to attach didn't make it to the list -- your problem is, that
you cannot have a writing (deleting) IndexReader and an IndexWriter open
at the same time.
There can only be one instance that writes to an index at one time.

Disabling locking disables the checks, but then you have to take care
yourself. So in practice disabling locking is useful for readonly access
to static indices only.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lock obtain timed out

2003-12-16 Thread David Townsend
Does this mean if you can insure that only one IndexWriter and/or IndexReader(Doing 
deletion) are never open at the same time (eg using database instead of lucene's 
locking), there will be no problem with removing locking?   If you do not use an 
IndexReader to do deletion can you open and close it at anytime?

David
-Original Message-
From: Morus Walter [mailto:[EMAIL PROTECTED]
Sent: 16 December 2003 11:08
To: Lucene Users List
Subject: Re: Lock obtain timed out


Hohwiller, Joerg writes:
 
 Am I safe disabling the locking???

No.

 Can anybody tell me where to get documentation about the Locking
 strategy (I still would like to know why I have that problem) ???
 
I guess -- but given your input I really have to guess; the source you
wanted to attach didn't make it to the list -- your problem is, that
you cannot have a writing (deleting) IndexReader and an IndexWriter open
at the same time.
There can only be one instance that writes to an index at one time.

Disabling locking disables the checks, but then you have to take care
yourself. So in practice disabling locking is useful for readonly access
to static indices only.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Disabling modifiers?

2003-12-16 Thread Iain Young
I think it is a problem with the indexing. I've found another example...

WS-CA-PP00-PROCESS-YYMM

I've looked at the index, and it has been tokenized into 3 words...

WS
CA-PP00-PROCESS
YYMM

Looks as though I might have to use a custom tokenizer as well as an
analyzer then, but any ideas as to why the standard tokenizer would have
split the variable up like this (i.e. why didn't it split the middle bit,
only the word off either end)? The only thing I can think of is that there
are several other variables in the source beginning with WS- or ending with
-YYMM, so could the tokenizer have seen this and be doing something clever
with them?

Thanks,
Iain

*
*  Micro Focus Developer Forum 2004 *
*  3 days that will make a difference   *
*  www.microfocus.com/devforum  *
*



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Disabling modifiers?

2003-12-16 Thread Erik Hatcher
On Tuesday, December 16, 2003, at 05:46  AM, Iain Young wrote:
Treating them as two separate words when quoted is indicative of your
analyzer not being sufficient for your domain.  What Analyzer are you
using?  Do you have knowledge of what it is tokenizing text into?
I have created a custom analyzer (CobolAnalyzer) which contains some 
custom
stop words for the language, but it's using the StandardTokenizer and
StandardFilters. I'll have a look and see if I can see what it's 
actually
tokenizing the text into...
Look at my article at java.net and try out the AnalyzerDemo code using 
some sample text and your custom analyzer:

	http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html

One of the things I plan to do with an enhanced Lucene demo to ship 
with Lucene's binary distributions is integrate in this type of 
analyzing the analyzer feature.  It is the root of a lot of questions 
about Lucene.  You can really only search for what you index, and you 
only index what the Analyzer creates, so understanding it is key to a 
lot.

And yes, if you are using StandardTokenizer, you are probably not 
tokenizing COBOL quite like you expect.  Is there a COBOL parser you 
could tap into that could give you the tokens you want?

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Disabling modifiers?

2003-12-16 Thread Iain Young
grin Yes we have got one or two parsers floating around somewhere or other
;)

Unfortunately, I'm unlikely to be able to tap into these before next version
of the product I'm working on (can't say too much because of the nda etc),
and so for now I'm having to make do with a basic text search. I'll give the
whitespace analyzer a try and see if I get any better results.

Thanks,
Iain

*
*  Micro Focus Developer Forum 2004 *
*  3 days that will make a difference   *
*  www.microfocus.com/devforum  *
*

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: 16 December 2003 12:31
To: Lucene Users List
Subject: Re: Disabling modifiers?


On Tuesday, December 16, 2003, at 07:28  AM, Erik Hatcher wrote:
 And yes, if you are using StandardTokenizer, you are probably not 
 tokenizing COBOL quite like you expect.  Is there a COBOL parser you 
 could tap into that could give you the tokens you want?

Ummm. nevermind that last question... I just realized where you 
work!  :)

So, my recommendation would be to tap into some parser for the COBOL 
language that you have handy and have it feed your Analyzer 
appropriately.  Or, use something very very simple like the 
WhitespaceAnalyzer as a first try.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



This e-mail has been scanned for viruses by MCI's Internet Managed Scanning
Services - powered by MessageLabs. For further information visit
http://www.mci.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lock obtain timed out

2003-12-16 Thread Hohwiller, Joerg
Hi there,

thanks for your resonse guys!

For the answers I got the info that I must not have an IndexWriter
and an IndexReader open at the same time that both want to modify
the index - even sequentially.

What I have is the following:

1 Thread is working out events such as resource (file or folder)
  was added/removed/deleted/etc. All index modifications are
  synchronized against a write-lock object.

1 Thread does index switching what means that he synchronizes on
  the write lock and then closes modifying index-reader and index-writer.
  Next it copies that index completely and reopens the index-reader and
  -writer on the copied index.
  Then he syncs on the read lock and closes the index searcher and
  reopens it on the index that was previously copied.

N Threads that perform search requestes but sync against the read-lock.

Since I can garantee that there is only one thread working out the
change events sequentially, the index-writer and index-reader will never
do any concurrent modifications.

This time I will attatch my source as text in this mail to get sure.
For those who do not know avalon/exalibur: It is a framework that
will be the only one calling the configure/start/stop methods.
No one can access the instance until it is properly created, configured
and started so synchronization is not neccessary in the start method.

Thanks again
  Jörg

/**
 * This is the implementation of the ISearchManager using lucene as underlying
 * search engine.br/
 * Everything would be so simple if lucene was thread-safe for concurrently
 * modifying and searching on the same index, but it is not. br/
 * My first idea was to have a single index that is continiusly modified and a
 * background thread that continuosly closes and reopens the index searcher.
 * This should bring most recent search results but it did not work proberly
 * with lucene.br/  
 * My strategy now is to have multiple indexes and to cycle over all of them
 * in a backround thread copying the most recent one to the next (least recent)
 * one. Index modifications are always performed on the most recent index, 
 * while searching is always performed on the second recent (copy of the) index.
 * This stategy results in less acutal (but still very acceptable) actuality
 * of search results. Further it produces a lot more disk space overhead but
 * with the advantage of having backups of the index.br/
 * Because the search must filter the search results the user does not have 
 * read access on, it can also filter the results that do not exist anymore
 * without further costs.  
 * 
 * @author Joerg Hohwiller (jhohwill)
 */
public class SearchManager
extends AbstractManager
implements
ISearchManager,
IDataEventListener,
Startable,
Serviceable,
Disposable,
Configurable,
Runnable,
ThreadSafe {

/** 
 * A background thread is switching/updating the index used for indexing
 * and/or searching. The thread sleeps an amount of this constant in 
 * milliseconds until the next switch is done.br/
 * The shorter the delay, the more actual the search results but also the
 * more preformance overhead is produced.br/
 * Be aware that the delay does not determine the index switching frequency
 * because after a sleep of the delay, the index is copied and the switched.
 * This required time for this operation does depend on the size of the
 * index. This also means that the bigger the index, the less acutal are
 * the search results.br/ 
 * A value of 60 seconds (60 * 1000L) should be OK. 
 */
private static final long INDEX_SWITCH_DELAY = 30 * 1000L;

/** the URI field name */
public static final String FIELD_URI = uri;

/** the title field name */
public static final String FIELD_TITLE = dc_title;

/** the text field name */
public static final String FIELD_TEXT = text;

/** the read action */
private static final String READ_ACTION_URI = /actions/read;

/** the name of the configuration tag for the index settings */
private static final String CONFIGURATION_TAG_INDEXER = indexer;

/** the name of the configuration attribute for the index path */
private static final String CONFIGURATION_ATTRIBUTE_INDEX_PATH = index-path;

/** the user used to access resources for indexing (global read access) */
private static final String SEARCH_INDEX_USER = indexer;

/** the maximum number of search hits */
private static final int MAX_SEARCH_HITS = 100;

/** the default analyzer used for the search index */
private static final Analyzer ANALYZER = new StandardAnalyzer();

/** 
 * the number of indexes used, must be at least 3:
 * ul
 *   lione for writing/updating/li
 *   lione for read/search/li
 *   lione temporary where the index is copied to/li
 * /ul
 * All further indexes will act as extra backups of the index but will
 * also 

Re: Disabling modifiers?

2003-12-16 Thread Karl Penney
One of the token patterns defined by the StandardTokenizer.jj is this:
NUM: (ALPHANUM P HAS_DIGIT

| HAS_DIGIT P ALPHANUM

| ALPHANUM (P HAS_DIGIT P ALPHANUM)+

| HAS_DIGIT (P ALPHANUM P HAS_DIGIT)+

| ALPHANUM P HAS_DIGIT (P ALPHANUM P HAS_DIGIT)+

| HAS_DIGIT P ALPHANUM (P HAS_DIGIT P ALPHANUM)+

)

So basically if you have some sequences of characters separated by a -
character, sequences that contain a digit will be combined with sequences
which are adjacent to it to form a single token.  That explains why the WS
and YYMM sequences got separated out.  You can alter this behavior this with
some simple changes to StandardTokenizer.jj.

- Original Message -
From: Iain Young [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Tuesday, December 16, 2003 7:46 AM
Subject: RE: Disabling modifiers?


 I think it is a problem with the indexing. I've found another example...

 WS-CA-PP00-PROCESS-YYMM

 I've looked at the index, and it has been tokenized into 3 words...

 WS
 CA-PP00-PROCESS
 YYMM

 Looks as though I might have to use a custom tokenizer as well as an
 analyzer then, but any ideas as to why the standard tokenizer would have
 split the variable up like this (i.e. why didn't it split the middle bit,
 only the word off either end)? The only thing I can think of is that there
 are several other variables in the source beginning with WS- or ending
with
 -YYMM, so could the tokenizer have seen this and be doing something clever
 with them?

 Thanks,
 Iain

 *
 *  Micro Focus Developer Forum 2004 *
 *  3 days that will make a difference   *
 *  www.microfocus.com/devforum  *
 *



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Disabling modifiers?

2003-12-16 Thread Iain Young
The WhitespaceTokenizer fixed the problem, so that'll do as a stop gap until
I can figure out how to write our own COBOL tokenizer.

Thanks for the help,
Iain

*
*  Micro Focus Developer Forum 2004 *
*  3 days that will make a difference   *
*  www.microfocus.com/devforum  *
*

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: 16 December 2003 12:31
To: Lucene Users List
Subject: Re: Disabling modifiers?


On Tuesday, December 16, 2003, at 07:28  AM, Erik Hatcher wrote:
 And yes, if you are using StandardTokenizer, you are probably not 
 tokenizing COBOL quite like you expect.  Is there a COBOL parser you 
 could tap into that could give you the tokens you want?

Ummm. nevermind that last question... I just realized where you 
work!  :)

So, my recommendation would be to tap into some parser for the COBOL 
language that you have handy and have it feed your Analyzer 
appropriately.  Or, use something very very simple like the 
WhitespaceAnalyzer as a first try.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



This e-mail has been scanned for viruses by MCI's Internet Managed Scanning
Services - powered by MessageLabs. For further information visit
http://www.mci.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lock obtain timed out

2003-12-16 Thread Tatu Saloranta
On Tuesday 16 December 2003 03:37, Hohwiller, Joerg wrote:
 Hi there,

 I have not yet got any response about my problem.

 While debugging into the depth of lucene (really hard to read deep insde) I
 discovered that it is possible to disable the Locks using a System
 property.
...
 Am I safe disabling the locking???
 Can anybody tell me where to get documentation about the Locking
 strategy (I still would like to know why I have that problem) ???

 Or does anybody know where to get an official example of how to
 handle concurrent index modification and searches?

One problem I have seen, and am still trying to solve, is that if my web app
is terminated (running from console during development, ctrl+c on unix),
sometimes it seems commit.lock file is left. Now problem is that apparently 
method that seems like it tries to check if there is a lock (and subsequently 
asking it to be removed via API) doesn't consider that to be the lock 
(sorry for not having details, writing this from home without source). So 
I'll probably see if disabling locks would get rid of this lock file (as I 
never have multiple writers, or even writer and reader, working on same 
index... I just always make full file copy of index before doing incremental 
updates), or physically delete commit.lock if necessary when starting the 
app.

The problem I describe above happens fairly infrequently, but that's actually 
what makes it worse... our QA people (in different continent) have been 
bitten by a bit couple of times. :-/

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: WebLucene 0.4 released: added full featured demo(dump data php scripts and demo data in Chinese)

2003-12-16 Thread Che Dong
sorry, demo address is:
http://www.blogchina.com/weblucene/


Che, Dong
- Original Message - 
From: Che Dong [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, December 17, 2003 1:33 AM
Subject: WebLucene 0.4 released: added full featured demo(dump data php scripts and 
demo data in Chinese)


 http://sourceforge.net/projects/weblucene/
 
 WebLucene: 
 Lucene search engine XML interface, provided sax based indexing, indexing sequence 
 based result sorting and xml output with highlight support. 
 
 The key features:
 1 The bi-gram based CJK support: org/apache/lucene/analysis/cjk/CJKTokenizer, The 
 CJKTokenizer support Chinese Japanese and Korean with Westen language simultaneously.
 
 2 DocID based result sorting: org/apache/lucene/search/IndexOrderSearcher
 
 3 xml output: com/chedong/weblucene/search/DOMSearcher
 
 4 sax based indexing: com/chedong/weblucene/index/SAXIndexer
 
 5 token based highlighter: 
 reverse StopTokenzier:
 org/apache/lucene/anlysis/HighlightAnalyzer.java
   HighlightFilter.java
 with abstract:
 com/chedong/weblucene/search/WebluceneHighlighter
 
 6 A simplified query parser:
 google like syntax with term limit
 org/apache/lucene/queryParser/SimpleQueryParser
 modified from early version of Lucene :)
 
 7 Add full featured demo (including dump script and sample data) runs on: 
 http://www.blogchina.com/weblucene/
 
 Regards
 
 
 Che Dong
 http://www.chedong.com/tech/weblucene.html
 

Re: field boosting best practise

2003-12-16 Thread Doug Cutting
If you wish to boost the title field for every query then it would be 
easiest to boost the title clause of your query, with Query.setBoost(). 
 Field.setBoost() should only be used when you want to give a field 
different boosts in different documents, but since you want to boost all 
titles by the same amount, you'll find it easier to boost at query time. 
 That way you can experiment with the boost amount without re-building 
the index.  The values of Field.setBoost() are built into the index and 
are harder to change.  Thus I recommend using Query.setBoost() instead. 
 Construct a query for each field to be searched (by hand, or with the 
QueryParser), boost each of these field queries separately, then build a 
BooleanQuery that combines these into a single Query that you then 
execute.  I hope that makes sense.

Doug

Maurice Coyle wrote:
hi,
i was wondering what's the best approach to take when boosting the value of
a particular field.
i'm searching over a document's title, url and contents fields and i want to
try giving certain fields a boost at times to see if it improves my results.
so for instance if i want to give the title field more weight i can use
Field.setBoost() to do this.
my question is, say for example i want to give the title field twice as much
weight as the url and contents fields.  do i set the boost value of title to
be 2.0 or should i set the boost value of the url and contents fields to be
0.25 and the boost value of the title field to be 0.5 (thereby having all
boost values adding up to 1 so all the socres are normalised with respect to
the title field's boost).
i just find the messages in the archives a little confusing regarding this,
i can't see which approach is best, or if either is best.
any help appreciated.
maurice
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: WebLucene 0.4 released: added full featured demo(dump data php scripts and demo data in Chinese)

2003-12-16 Thread Akmal Sarhan
are there any English versions of the site ?

regards
Akmal
- Original Message - 
From: Che Dong [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, December 16, 2003 6:52 PM
Subject: Re: WebLucene 0.4 released: added full featured demo(dump data php
scripts and demo data in Chinese)


 sorry, demo address is:
 http://www.blogchina.com/weblucene/


 Che, Dong
 - Original Message - 
 From: Che Dong [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Wednesday, December 17, 2003 1:33 AM
 Subject: WebLucene 0.4 released: added full featured demo(dump data php
scripts and demo data in Chinese)


  http://sourceforge.net/projects/weblucene/
 
  WebLucene:
  Lucene search engine XML interface, provided sax based indexing,
indexing sequence based result sorting and xml output with highlight
support.
 
  The key features:
  1 The bi-gram based CJK support:
org/apache/lucene/analysis/cjk/CJKTokenizer, The CJKTokenizer support
Chinese Japanese and Korean with Westen language simultaneously.
 
  2 DocID based result sorting:
org/apache/lucene/search/IndexOrderSearcher
 
  3 xml output: com/chedong/weblucene/search/DOMSearcher
 
  4 sax based indexing: com/chedong/weblucene/index/SAXIndexer
 
  5 token based highlighter:
  reverse StopTokenzier:
  org/apache/lucene/anlysis/HighlightAnalyzer.java
HighlightFilter.java
  with abstract:
  com/chedong/weblucene/search/WebluceneHighlighter
 
  6 A simplified query parser:
  google like syntax with term limit
  org/apache/lucene/queryParser/SimpleQueryParser
  modified from early version of Lucene :)
 
  7 Add full featured demo (including dump script and sample data) runs
on: http://www.blogchina.com/weblucene/
 
  Regards
 
 
  Che Dong
  http://www.chedong.com/tech/weblucene.html
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How to get TokenStream from Field?

2003-12-16 Thread Karl Penney
Is there any way to get a TokenStream for a given Field of a Document (is that 
information even stored in the index)?  I want to use the startOffset / endOffset 
information for hit highlighting.  Do I have to tokenize the text value for the field 
again to get this information?


Re: Summarization; sentence-level and document-level filters.

2003-12-16 Thread ambiesense
Hello Gregor and Maurits,

I am not quite sure what you want to do. I think you want to search the
normal text and present the summarized text on the screen where the user is able
to get the full text on request. Is this the case?

If this is the case, then you could create a set of summarized text from the
full text, crate another index for them and have an extra field in the text
which is not summarized. You could use this field to find the summarized
version of a full text and retrieve the full text from the summarized text in
order to present it to the user. In this case you would put your summarizer
before the analyser (in terms of workflow) which would perfectly fit into the
existing concept of Lucene.

I am not sure if I could catch your idea. Please educate me further if I
missunderstood something... 

Cheers,
Ralf

 Hi Gregor,
 
 Sofar as I know there is no summarizer in the plans. And maybe I can help
 you along the way. Have a look
 at Classifier4J project on Sourceforge.
 
 http://classifier4j.sourceforge.net/
 
 It has a small documetn summarizer besides a bayes classifier.It might
 speed
 up your coding.
 
 On the level of lucene, I have no idea. My gut feeling says that a summary
 should be build before the
 text is tokenized! The tokenizer can ofcourse be used when analysing a
 document, but hooking into
 the lucene indexing is a bad idea I think.
 
 Someone else has any ideas?
 
 regards,
 
 Maurits
 
 
 
 
 - Original Message - 
 From: Gregor Heinrich [EMAIL PROTECTED]
 To: 'Lucene Users List' [EMAIL PROTECTED]
 Sent: Monday, December 15, 2003 7:41 PM
 Subject: Summarization; sentence-level and document-level filters.
 
 
  Hi,
 
  is there any possibility to do sentence-level or document level analysis
  with the current Analysis/TokenStream architecture? Or where else is the
  best place to plug in customised document-level and sentence-level
 analysis
  features? Is there any precedence case ?
 
  My technical problem:
 
  I'd like to include a summarization feature into my system, which should
 (1)
  best make use of the architecture already there in Lucene, and (2)
 should
 be
  able to trigger summarization on a per-document basis while requiring
  sentence-level information, such as full-stops and commas. To preserve
 this
  punctuation, a special Tokenizer can be used that outputs such
 landmarks
  as tokens instead of filtering them out. The actual SummaryFilter then
  filters out the punctuation for its successors in the Analyzer's filter
  chain.
 
  The other, more complex thing is the document-level information: As
 Lucene's
  architecture uses a filter concept that does not know about the document
 the
  tokens are generated from (which is good abstraction), a
 document-specific
  operation like summarization is a bit of an awkward thing with this (and
  originally not intended, I guess). On the other hand, I'd like to have
 the
  existing filter structure in place for preprocessing of the input,
 because
  my raw texts are generated by converters from other formats that output
  unwanted chars (from figures, pagenumbers, etc.), which are filtered out
  anyway by my custom Analyzer.
 
  Any idea how to solve this second problem? Is there any support for such
  document / sentence structure analysis planned?
 
  Thanks and regards,
 
  Gregor
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene and Mysql

2003-12-16 Thread Stefan Trcko
Hello

I'm new to Lucene. I want users can search text which is stored in mysql database.
Is there any tutorial how to implement this kind of search feature.

Best regards,
Stefan

RE: Summarization; sentence-level and document-level filters.

2003-12-16 Thread Gregor Heinrich
Yes, copying a summary from one field to an untokenized field was the plan.

I identified DocumentWriter.invertDocument() to be a possible place for an
addition of this document-level analysis. But I admit this appears way too
low-level and inflexible for the overall design.

So I'll make it two-pass indexing.

Thanks for the decision support,

gregor

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 16, 2003 6:57 PM
To: Lucene Users List
Subject: Re: Summarization; sentence-level and document-level filters.


It sounds like you want the value of a stored field (a summary) to be
built from the tokens of another field of the same document.  Is that
right?  This is not presently possible without tokenizing the field
twice, once to produce its summary and once again when indexing.

Doug

Gregor Heinrich wrote:
 Hi,

 is there any possibility to do sentence-level or document level analysis
 with the current Analysis/TokenStream architecture? Or where else is the
 best place to plug in customised document-level and sentence-level
analysis
 features? Is there any precedence case ?

 My technical problem:

 I'd like to include a summarization feature into my system, which should
(1)
 best make use of the architecture already there in Lucene, and (2) should
be
 able to trigger summarization on a per-document basis while requiring
 sentence-level information, such as full-stops and commas. To preserve
this
 punctuation, a special Tokenizer can be used that outputs such landmarks
 as tokens instead of filtering them out. The actual SummaryFilter then
 filters out the punctuation for its successors in the Analyzer's filter
 chain.

 The other, more complex thing is the document-level information: As
Lucene's
 architecture uses a filter concept that does not know about the document
the
 tokens are generated from (which is good abstraction), a document-specific
 operation like summarization is a bit of an awkward thing with this (and
 originally not intended, I guess). On the other hand, I'd like to have the
 existing filter structure in place for preprocessing of the input, because
 my raw texts are generated by converters from other formats that output
 unwanted chars (from figures, pagenumbers, etc.), which are filtered out
 anyway by my custom Analyzer.

 Any idea how to solve this second problem? Is there any support for such
 document / sentence structure analysis planned?

 Thanks and regards,

 Gregor



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Summarization; sentence-level and document-level filters.

2003-12-16 Thread Gregor Heinrich
Maurits: thanks for the hint to classifier4j -- I have had a look on this
package and tried the SimpleSummarizer and it seems to work fine. (However,
as I don't know the benchmarks for summarization, I'm not the one to judge.)

Do you have experience with it?

Gregor

-Original Message-
From: maurits van wijland [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 16, 2003 1:09 AM
To: Lucene Users List; [EMAIL PROTECTED]
Subject: Re: Summarization; sentence-level and document-level filters.


Hi Gregor,

Sofar as I know there is no summarizer in the plans. And maybe I can help
you along the way. Have a look
at Classifier4J project on Sourceforge.

http://classifier4j.sourceforge.net/

It has a small documetn summarizer besides a bayes classifier.It might speed
up your coding.

On the level of lucene, I have no idea. My gut feeling says that a summary
should be build before the
text is tokenized! The tokenizer can ofcourse be used when analysing a
document, but hooking into
the lucene indexing is a bad idea I think.

Someone else has any ideas?

regards,

Maurits




- Original Message -
From: Gregor Heinrich [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Monday, December 15, 2003 7:41 PM
Subject: Summarization; sentence-level and document-level filters.


 Hi,

 is there any possibility to do sentence-level or document level analysis
 with the current Analysis/TokenStream architecture? Or where else is the
 best place to plug in customised document-level and sentence-level
analysis
 features? Is there any precedence case ?

 My technical problem:

 I'd like to include a summarization feature into my system, which should
(1)
 best make use of the architecture already there in Lucene, and (2) should
be
 able to trigger summarization on a per-document basis while requiring
 sentence-level information, such as full-stops and commas. To preserve
this
 punctuation, a special Tokenizer can be used that outputs such landmarks
 as tokens instead of filtering them out. The actual SummaryFilter then
 filters out the punctuation for its successors in the Analyzer's filter
 chain.

 The other, more complex thing is the document-level information: As
Lucene's
 architecture uses a filter concept that does not know about the document
the
 tokens are generated from (which is good abstraction), a document-specific
 operation like summarization is a bit of an awkward thing with this (and
 originally not intended, I guess). On the other hand, I'd like to have the
 existing filter structure in place for preprocessing of the input, because
 my raw texts are generated by converters from other formats that output
 unwanted chars (from figures, pagenumbers, etc.), which are filtered out
 anyway by my custom Analyzer.

 Any idea how to solve this second problem? Is there any support for such
 document / sentence structure analysis planned?

 Thanks and regards,

 Gregor



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene and Mysql

2003-12-16 Thread Gregor Heinrich
Hi.

You read out all the relevant fields from MySQL and assign the primary key
as an indentifier of your Lucene documents.

During search, you retrieve the identifier from the Lucene searcher and
query the database to present the full text.

Best regards,

Gregor



-Original Message-
From: Stefan Trcko [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 16, 2003 9:31 PM
To: [EMAIL PROTECTED]
Subject: Lucene and Mysql


Hello

I'm new to Lucene. I want users can search text which is stored in mysql
database.
Is there any tutorial how to implement this kind of search feature.

Best regards,
Stefan


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene and Mysql

2003-12-16 Thread Pleasant, Tracy
You would just take the items from mysql database and create a document for each 
record. Then index all the documents.


-Original Message-
From: Stefan Trcko [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 16, 2003 3:31 PM
To: [EMAIL PROTECTED]
Subject: Lucene and Mysql


Hello

I'm new to Lucene. I want users can search text which is stored in mysql database.
Is there any tutorial how to implement this kind of search feature.

Best regards,
Stefan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: WebLucene 0.4 released: added full featured demo(dump data php scripts and demo data in Chinese)

2003-12-16 Thread Tun Lin
Hi,

I am using the downloaded weblucene. I have started my tomcat server and trying
to search by clicking on the search button but it says the search page cannot be
found. Also, I cannot find it in the package.

Can anyone help?

Am I missing anything? 

-Original Message-
From: Che Dong [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, December 17, 2003 1:53 AM
To: Lucene Users List
Subject: Re: WebLucene 0.4 released: added full featured demo(dump data php
scripts and demo data in Chinese)

sorry, demo address is:
http://www.blogchina.com/weblucene/


Che, Dong
- Original Message -
From: Che Dong [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, December 17, 2003 1:33 AM
Subject: WebLucene 0.4 released: added full featured demo(dump data php scripts
and demo data in Chinese)


 http://sourceforge.net/projects/weblucene/
 
 WebLucene: 
 Lucene search engine XML interface, provided sax based indexing, indexing
sequence based result sorting and xml output with highlight support. 
 
 The key features:
 1 The bi-gram based CJK support: org/apache/lucene/analysis/cjk/CJKTokenizer,
The CJKTokenizer support Chinese Japanese and Korean with Westen language
simultaneously.
 
 2 DocID based result sorting: org/apache/lucene/search/IndexOrderSearcher
 
 3 xml output: com/chedong/weblucene/search/DOMSearcher
 
 4 sax based indexing: com/chedong/weblucene/index/SAXIndexer
 
 5 token based highlighter: 
 reverse StopTokenzier:
 org/apache/lucene/anlysis/HighlightAnalyzer.java
   HighlightFilter.java
 with abstract:
 com/chedong/weblucene/search/WebluceneHighlighter
 
 6 A simplified query parser:
 google like syntax with term limit
 org/apache/lucene/queryParser/SimpleQueryParser
 modified from early version of Lucene :)
 
 7 Add full featured demo (including dump script and sample data) runs on:
http://www.blogchina.com/weblucene/
 
 Regards
 
 
 Che Dong
 http://www.chedong.com/tech/weblucene.html
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]