from:"Julien Nioche"

Re: Viewing the contents of the index on Tomcat

2004-08-06 Thread Julien Nioche

see http://jakarta.apache.org/lucene/docs/contributions.html
LUKE is a stand alone application for viewing and querying an index
LIMO is a web application for monitoring the content of an index

- Original Message - 
From: Ian McDonnell [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Friday, August 06, 2004 4:10 PM
Subject: Viewing the contents of the index on Tomcat


 How is this done?

 I want to verify that the indexer has added documents submitted from my
JSPs.

 ian

 _
 Sign up for FREE email from SpinnersCity Online Dance Magazine  Vortal at
http://www.spinnerscity.com

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Viewing the contents of the index on Tomcat

2004-08-06 Thread Julien Nioche

LUKE is a stand alone application - it is not meant to work on tomcat

- Original Message - 
From: Ian McDonnell [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, August 06, 2004 5:31 PM
Subject: Re: Viewing the contents of the index on Tomcat


 And how do you run luke once its been added to the classpath on Tomcat? I
cant seem to find any docs on the luke site.

 Ian


 --- Julien Nioche [EMAIL PROTECTED] wrote:
 see http://jakarta.apache.org/lucene/docs/contributions.html
 LUKE is a stand alone application for viewing and querying an index
 LIMO is a web application for monitoring the content of an index

 - Original Message - 
 From: Ian McDonnell [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Friday, August 06, 2004 4:10 PM
 Subject: Viewing the contents of the index on Tomcat


  How is this done?
 
  I want to verify that the indexer has added documents submitted from my
 JSPs.
 
  ian
 
  _
  Sign up for FREE email from SpinnersCity Online Dance Magazine  Vortal
at
 http://www.spinnerscity.com
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



 _
 Sign up for FREE email from SpinnersCity Online Dance Magazine  Vortal at
http://www.spinnerscity.com

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Most efficient way to index 14M documents (out of memory/file handles)

2004-07-07 Thread Julien Nioche

It is not surprising that you run out of file handles with such a large
mergeFactor.

Before trying more complex strategies involving RAMDirectories and/or
splitting your indexation on several machines, I reckon you should try
simple things like using a low mergeFactor (eg: 10) combined with a higher
minMergeDocs (ex: 1000) and optimize only at the end of the process.

By setting a higher value to minMergeDocs, you'll index and merge with a
RAMDirectory. When the limit is reached (ex 1000) a segment is written in
the FS. MergeFactor controls the number of segments to be merged, so when
you have 10 segments on the FS (which is already 10x1000 docs), the
IndexWriter will merge them all into a single segment. This is equivalent to
an optimize I think. The process continues like that until it's finished.

Combining theses parameters should be enough to achieve good performance.
The good point of using minMergeDocs is that you make a heavy use of the
RAMDirectory used by your IndexWriter (== fast) without having to be too
careful with the RAM (which would be the case with RamDirectory). At the
same time keeping your mergeFactor low limits the risks of too many handles
problem.


- Original Message - 
From: Kevin A. Burton [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, July 07, 2004 7:44 AM
Subject: Most efficient way to index 14M documents (out of memory/file
handles)


 I'm trying to burn an index of 14M documents.

 I have two problems.

 1.  I have to run optimize() every 50k documents or I run out of file
 handles.  this takes TIME and of course is linear to the size of the
 index so it just gets slower by the time I complete.  It starts to crawl
 at about 3M documents.

 2.  I eventually will run out of memory in this configuration.

 I KNOW this has been covered before but for the life of me I can't find
 it in the archives, the FAQ or the wiki.

 I'm using an IndexWriter with a mergeFactor of 5k and then optimizing
 every 50k documents.

 Does it make sense to just create a new IndexWriter for every 50k docs
 and then do one big optimize() at the end?

 Kevin

 -- 

 Please reply using PGP.

 http://peerfear.org/pubkey.asc

 NewsMonster - http://www.newsmonster.org/

 Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
AIM/YIM - sfburtonator,  Web - http://peerfear.org/
 GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
   IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Optimizing for long queries?

2004-06-29 Thread Julien Nioche

I ran some tests changing TermInfosWriter.INDEX_INTERVAL to 16.
On my application (which does a lot on top of lucene - including SQL
transactions and so on) I won 10% percent time.
I suppose this could be a bigger improvements in other applications, because
the search with Lucene is not 100% of my application.

The index used for this test is 720 MB - FSDirectory on Fedora 1
the .tii file is 3398 Kb in the modified version against 488Kb in the
original (INDEX_INTERVAL=128)

Has anyone tried changing this value? Do you get similar results?

Julien

- Original Message - 
From: Julien Nioche [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, June 28, 2004 10:04 AM
Subject: Re: Optimizing for long queries?


 Hello Drew,

 I don't think it's in the FAQ.

 1 - What you could do is to sort your query terms by ascending alphabetic
 order. In my case it improved a little bit the performance. It could be
 interesting to know how it worked in your case.

 2- Another solution is to play with TermInfosWriter.INDEX_INTERVAL at
 indexation time. I quote Doug :

 ..., try reducing TermInfosWriter.INDEX_INTERVAL.  You'll
 have to re-create your indexes each time you change this constant.  You
 might try a value like 16.  This would keep the number of terms in
 memory from being too huge (1 of 16 terms), but would reduce the average
 number scanned from 64 to 8, which would be substantial.  Tell me how
 this works.  If it makes a big difference, then perhaps we should make
 this parameter more easily changeable.

 Have you used a profiler on your application? This could be useful to spot
 possible improvments.


 - Original Message - 
 From: Drew Farris [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Friday, June 25, 2004 8:24 PM
 Subject: Optimizing for long queries?


  Apologies if this is a FAQ, but I didn't have much luck searching the
  list archives for answers on this subject:
 
  I'm using Lucene in a context where we have frequently have queries
  that search for as many as 30-50 terms in a single field. Does anyone
  have any thoughts concerning ways optimize Lucene for queries of these
  lengths?
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Performance: compound vs. multi-file index, indexing and searching

2004-06-10 Thread Julien Nioche

Has anyone tried comparing the performance of regular (multi-file) indexing
and specifying a value for minMergeDocs?
Using parameter limits the number of files and is supposed to improve the
speed of indexation.

- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, June 10, 2004 12:41 PM
Subject: Re: Performance: compound vs. multi-file index, indexing and
searching

 I haven't tested the two in a multi-threaded setup, but my
 single-threaded unit test now clearly shows that there _is_ a
 consistent indexing performance difference between the two index
 structures.  The multi-file structure seems to beat the compound index
 structure by about 7% in my tests, which matches Hui's report, and what
 I thought unit tests would show.

 Hui used Lucene 1.3, and I used the latest RC, and the results are
 about the same.

 Otis

 --- Doug Cutting [EMAIL PROTECTED] wrote:
  Otis Gospodnetic wrote:
   Can anyone comment on performance differences?

  I'd expect multi-threaded performance to be a bit worse with the
  compound format, but single-threaded performance should be nearly
  identical.

 =
 http://www.simpy.com/ - social bookmarking and personal search engine

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: code works with 1.3-rc1 but not with 1.3-final??

2004-03-23 Thread Julien Nioche

Or set a big value with minMergeDocs on IndexWriter and keep a low
mergeFactor (ie 10). You'll have a small number of files on your disk and
the indexing should be faster as well.

- Original Message -
From: Matt Quail [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, March 23, 2004 4:22 AM
Subject: Re: code works with 1.3-rc1 but not with 1.3-final??

 Or use IndexWriter.setUseCompundFile(true) to reduce the number of files
 created by Lucene.

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWrite
r.html#setUseCompoundFile(boolean)

 =Matt

 Kevin A. Burton wrote:

  Dan wrote:

  I have some code that creates a lucene index. It has been working fine
  with lucene-1.3-rc1.jar but I wanted to upgrade to
  lucene-1.3-final.jar. I did this and the indexer breaks. I get the
  following error when running the index with 1.3-final:

  Optimizing the index
  IOException: /home/danl001/index-Mar-22-14_31_30/_ni.f43 (Too many
  open files)
  Indexed 884 files in 8 directories
  Index creation took 242 seconds
  %

  No... it's you... ;)

  Read the FAQ and then run

  ulimit -n 100 or so...

  You need to increase your file handles.  Chance are you never noticed
  this before but the problem was still present.  If you're on a Linux box
  you would be amazed to find out that you're only about 200 file handles
  away from running out of your per-user quota file quota.

  You might have to su as root to change this.. RedHat is more strict
  because it uses the glibc resource restrictions thingy. (who's name
  slips my mind at the moment).
  Debian is configured better here as per defaults.

  Also a google query would have solved this for you very quickly ;)..

  Kevin

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Similarity - position in Field[] effects scoring - how to change?

2004-03-23 Thread Julien Nioche

Joachim,

Why don't you use the method explain of IndexSearcher?
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/IndexSear
cher.html

This is the best way to find why your documents are different. I suspect the
lengthNorm  method, which is used at indexation time.

Julien


- Original Message -
From: Joachim Schreiber [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Tuesday, March 23, 2004 4:05 PM
Subject: Similarity - position in Field[] effects scoring - how to change?


 Hallo,

 I run in following problem. Perhaps somebody can help me.

 I have a index with different ids in the same field
 something like

 s
 s45678565
 s87854546

 Situation: I have different documents with the entry s in the
same
 index.


 document 1)

 s324235678565
 s324dssd5678565
 s45678324565
 s
 s8785454324326


 document 2)

 s324235678565
 s
 s45678324565
 s8785454324326



 when I search for   s:   I receive both docs, but document 1 has
a
 better scoring than document 2.
 The position of s in doc 1 is Field[4] and in doc 2 it's
Field[2],
 so this seems to effect scoring.

 How can I disable this behaviour, so doc 1 has the same scoring as doc
2???
 Which method do I have to overwrite in DefaultSimilarity.
 Has anybody any idea, any help.

 Thanks

 yo







 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

LIMO new release (v0.3)

2004-01-22 Thread Julien Nioche

There's a new release of limo available !

This new version :
- includes lucene-1.3-final.jar
- fixes a bug with index loading
- detects when index changes and auto refreshes the information (as proposed
by Jakob Flierl)
- uses css for easier customisation (as proposed by E Hatcher)
- escapes HTML code in the value of the fields (as proposed by E Hatcher)

Please note that the home page for limo is now http://limo.sourceforge.net/
Feel free to submit feature requests or bugs using sourceforge tools at
http://sourceforge.net/projects/limo/

Thanks you

Julien




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Poor Performance when searching for 500+ terms

2003-11-13 Thread Julien Nioche

Hello,

Since there are a lot of Term objects in your Query, your application must
spend a lot of time collecting information about those Terms.

1/ Do you use RAMDirectory? Loading the whole Directory into memory will
increase speed - your index must not be too big though

2/ You are probably not using the QueryParser - so when you are building the
Query you could sort the Term objects inside a BooleanQuery. Sorting the
Terms will reduce jumps on disk. I have no benchmarks for this, but
logically, it should have some positive effect when using FSDirectory. Am I
wrong?

3/ There was a patch submitted by Dmitry Serebrennikov
(http://www.mail-archive.com/[EMAIL PROTECTED]/msg02762.html)
which reduced garbage collecting by limiting the creation of temporary Term
objects. This patch has not been included in Lucene code (a bug in it?).

Hope it helps.

Julien

- Original Message -
From: Jie Yang [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, November 12, 2003 10:11 PM
Subject: Poor Performance when searching for 500+ terms


 I know this is rare, But I am building an application
 that submits searches having 500+ search terms. A
 general example would be

 field1:w1 OR field1:w2 OR ... OR field1:w500

 For 1 millions documents, the performance is OK if
 field1 in each document has less than 50 terms, I can
 get result  1 sec. but if field1 has more than
 average 400 terms in each document, the performance
 degrades to around 6 secs.

 Is there anyway to improve this?

 And my second questions is that my query often comes
 with an AND condition with another search word. for
 example:

 field2:w AND (field1:w1 OR field1:w2, ... field1:w500)

 field2:w will only return less than 1000 records out
 of 1 millions. then I thought I could use a
 StringFilter Object? i.e. search on field2.w first,
 thus limit the search for 500 OR only on the field2.w
 1000 results. somewhat like a join in database. But I
 checked the code and sees that IndexSearcher always
 perfomance the 500 disk searches before calling the
 filter object? Any suggestions on this?

 Also does lucene caches results in memory? I see the
 performance tends to get better after a few runs,
 especailly on searches on fields having small number
 of terms. If so, can I manipulate the cache size
 somehow to accommdate fields with large number of
 terms.

 Many thanks.


 
 Want to chat instantly with your online friends?  Get the FREE Yahoo!
 Messenger http://mail.messenger.yahoo.co.uk

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Proposition :adding minMergeDoc to IndexWriter

2003-09-23 Thread Julien Nioche

Hui,

Concerning an other point of your request list I proposed a patch this week
end on the lucene-dev list and i totally forgot that this feature was
requested on the user list.

This new feature should help you to set a number of Documents to be merged
in memory independently of the mergeFactor.

Any comments would be appreciated

Best regards

Julien Nioche
http://www.lingway.com

-- Debut du message initial ---

De : fp235-5 [EMAIL PROTECTED]
A  : lucene-dev [EMAIL PROTECTED]
Copies :
Date   : Sat, 20 Sep 2003 16:06:06 +0200
Sujet  : [PATCH] IndexWriter : controling the number of Docs merged

Hello,

Someone made a suggestion yesterday about adding a variable to IndexWriter
in
order to control the number of Documents merged in RAMDirectory
independently of
the mergeFactor. (I'm sorry I don't remember who exactly and the mail
arrived at
my office).
I'm proposing a tiny modification of IndexWriter to add this functionality.
A
variable minMergeDocs specifies the number of Documents to be merged in
memory
before starting a new Segment. The mergeFactor still control the number of
Segments created in the Directory and thus it's possible to avoid the file
number limitation problem.

The diff file is attached.

As noticed by Dmitry and Erik there are no true JUnit tests. I'd be OK to
write
a JUnit test for this feature. The problem is that the SegmentInfos field is
private in IndexWriter and can't be used to check the number and size of the
Segments. I ran a test using the infoStream variable of IndexWriter -
everything
seems to be OK.

Any comments / suggestions are welcome.

Regards

Julien









- Original Message -
From: hui [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, September 22, 2003 3:40 PM
Subject: Re: per-field Analyzer (was Re: some requests)


 Good work, Erik.

 Hui

 - Original Message -
 From: Erik Hatcher [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Saturday, September 20, 2003 4:13 AM
 Subject: per-field Analyzer (was Re: some requests)


  On Friday, September 19, 2003, at 07:45  PM, Erik Hatcher wrote:
   On Friday, September 19, 2003, at 11:15  AM, hui wrote:
   1. Move the Analyzer down to field level from document level so some
   fields
   could be applied a specail analyzer.Other fields still use the
default
   analyzer from the document level.
   For example, I do not need to index the number for the content
   field. It
   helps me reduce the index size a lot when I have some excel files.
   But I
   always need the created_date to be indexed though it is a number
   field.
  
   I know there are some workarounds put in the group, but I think it
   should be
   a good feature to have.
  
   The workaround is to write a custom analyzer and and have it do the
   desired thing per-field.
  
   Hmmm just thinking out loud here without knowing if this is
   possible, but could a generic wrapper Analyzer be written that
   allows other analyzers to be used under the covers based on a field
   name/analyzer mapping?   If so, that would be quite cool and save
   folks from having to write custom analyzers as much to handle this
   pretty typical use-case.  I'll look into this more in the very near
   future personally, but feel free to have a look at this yourself and
   see what you can come up with.
 
  What about something like this?
 
  public class PerFieldWrapperAnalyzer extends Analyzer {
 private Analyzer defaultAnalyzer;
 private Map analyzerMap = new HashMap();
 
 
 public PerFieldWrapperAnalyzer(Analyzer defaultAnalyzer) {
   this.defaultAnalyzer = defaultAnalyzer;
 }
 
 public void addAnalyzer(String fieldName, Analyzer analyzer) {
   analyzerMap.put(fieldName, analyzer);
 }
 
 public TokenStream tokenStream(String fieldName, Reader reader) {
   Analyzer analyzer = (Analyzer) analyzerMap.get(fieldName);
   if (analyzer == null) {
 analyzer = defaultAnalyzer;
   }
 
   return analyzer.tokenStream(fieldName, reader);
 }
  }
 
  This would allow you to construct a single analyzer out of others, on a
  per-field basis, including a default one for any fields that do not
  have a special one.  Whether the constructor should take the map or the
  addAnalyzer method is implemented is debatable, but I prefer the
  addAnalyzer way.  Maybe addAnalyzer could return 'this' so you could
  chain: new PerFieldWrapperAnalyzer(new
  StandardAnalyzer).addAnalyzer(field1, new
  WhitespaceAnalyzer()).addAnalyzer(.).  And I'm more inclined to
  call this thing PerFieldAnalyzerWrapper instead.  Any naming
  suggestions?
 
  This simple little class would seem to be the answer to a very common
  question asked.
 
  Thoughts?  Should this be made part of the core?
 
  Erik
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED

Re: Luke v 0.2 - Lucene Index Browser

2003-08-14 Thread Julien Nioche

Hello,

Thanks to Andrzej for this new version! Luke is really useful.

 open the original Dokuments with the platform dependant mimetype viewer
???

My suggestions by order of importance would be :

- History of the 5 last indexes used : under the File Menu item ?
- Remove the empty field (see mailing list Lucene dev 4/16/2003 message from
Otis)
- Possibility to merge different indexes
- Information about the # of segments
[- and almost impossible : recompose the unstored fields of a document]

Best regards

Julien


- Original Message -
From: Günter Kukies [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, August 11, 2003 5:12 PM
Subject: Re: Luke v 0.2 - Lucene Index Browser


 Hi,

 nice Tool.

 Here some points for further developments:
 show the contents of a Reader-valued Field
 open the original Dokuments with the platform dependant mimetype viewer

 regards,
 Günter


 - Original Message -
 From: Andrzej Bialecki [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Cc: Lucene Developers List [EMAIL PROTECTED]
 Sent: Monday, August 11, 2003 1:10 PM
 Subject: Luke v 0.2 - Lucene Index Browser


  Dear Lucene Users,
 
  I'm glad to announce that a new version of Luke is available for
  download, and as a Java WebStart application.
 
  Luke is a diagnostic tool for Lucene (http://jakarta.apache.org/lucene)
  indexes. It enables you to browse documents in existing indexes, perform
  queries, navigate through terms, optimize indexes and more.
 
  Please go to http://www.getopt.org/luke and give it a try. You can
  either use the Java WebStart version, or just download the JAR file with
  binaries. Full source code is available, under Apache license.
 
  Changes in v0.2:
 
  * Add Java WebStart version.
  * Add Read-Only mode.
  * Fix spinbox bug (really a bug in the Thinlet toolkit - fixed there).
  * Allow to browse hidden directories.
  * Add a combobox to choose the default field for searching.
  * Other minor code cleanups.
 
  Thanks to all who provided their comments and suggestions!
 
  --
  Best regards,
  Andrzej Bialecki
 
  -
  Software Architect, System Integration Specialist
  CEN/ISSS EC Workshop, ECIMF project chair
  EU FP6 E-Commerce Expert/Evaluator
  -
  FreeBSD developer (http://www.freebsd.org)
 
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Industry Use of Lucene?

2002-02-11 Thread Julien Nioche


Hello,

Lingway (http://www.lingway.com) is a french company that specializes in the
design, development and implementation of linguistics-based software
solutions. We are using Lucene in one of our projects, which can be seen at
http://kant.lingway.com/LGfisc/index.html.

This demo provides an access to fiscal legal texts in French (Code General
des Impôts) through our linguistic technology, which analyses the user
input, retrieves the most relevant terms and adds semantically related
terms. This helps to retrieve more documents related to the query. An other
aspect is that the linguistic analysis gives automatically all possibles
forms for a word (singular, plural, masculine, feminine) and corrects some
user mistyping (like the lack of accent in impots for impôts).

The analysis provides a disambiguisation between homographic forms (e.g.
verb to book and noun a book). This is why the system proposes related terms
only for the form found in the user sentence. At last, the boolean operators
used in the query are computed according to the weight and role of terms in
the user query.

Since the documentation of the demo is in French (by the way it could be
interesting to know where the Lucene user come from, and in which
proportions), I'll give you a brief overview of the functionalities.

Let's figure that we typed the following query : réduction d'impôts pour les couples


1/ The number of documents found is indicated by :

24 documents trouvés sur (réduction d' impôt), couple
  The second element ( (réduction d' impôt), couple ) gives an information
about which terms (and their related ones) have been sent to the query. You
can try other analysis by passing the mouse over this element, which will
display a contextual menu with all possible degradations of the original
query. By default, the system returns the results for the best matching
analysis.

2/ Information about the document

Article 200 sexies   Section V : Calcul de l'impôt
Termes pertinents :   réductions impôt  -  couple  -  couples  -  famille  -
exonérés

Clicking on the document's reference (Article 200 sexies) opens the document
in a pop-up window. (see below)

Relevant terms are indicated in grey in the second line. These terms have 
been sent to Lucene in a query generated by the system and appears in that
document. Note that the color of the terms depends on the weight of this
term in this document (more relevant terms are darker). Here we can see
that the boolean query generated contains not only the words present in the
original query (réduction - impôt - couple) but also related terms found by
the linguistic analysis (famille - exonérés) and morphologic variations
(singular - plural forms). We can see also that réduction d' impôts has been
recognized as a compound word.

This functionality helps the user to know roughly what's the content of a
document without opening it.

3/ Displaying the document

A click on the document's reference opens it in a pop-up window. The system
highlights the words of the text which are present in the query.  This
functionality uses partially Mark Schreiber's proposals
(http://www.iq-computing.de/lucene/highlight.htm), the difference beeing
that our highlighter recognizes coumpound words (e.g. it will highlight
réduction d'impôts as a whole and not separately réduction and
impôts).

---

A set of examples (in French of course) are available at
http://kant.lingway.com/LGfisc/about.html#exemples

Committers  : We would really be happy to be mentionned in the powered by
Lucene page (http://jakarta.apache.org/lucene/docs/powered.html). Is it possible?

An demo of our system in English is planned. We are waiting for your
suggestions : what would you like us to show you?

Any questions or comments are welcome. You can send them to
[EMAIL PROTECTED]
Please take a look at our site (www.lingway.com) for more information about
our activities.



Thank you

Julien Nioche / www.lingway.com

Re: Viewing the contents of the index on Tomcat

Re: Viewing the contents of the index on Tomcat

Re: Most efficient way to index 14M documents (out of memory/file handles)

Re: Optimizing for long queries?

Re: Performance: compound vs. multi-file index, indexing and searching

Re: code works with 1.3-rc1 but not with 1.3-final??

Re: Similarity - position in Field[] effects scoring - how to change?

LIMO new release (v0.3)

Re: Poor Performance when searching for 500+ terms

Proposition :adding minMergeDoc to IndexWriter

Re: Luke v 0.2 - Lucene Index Browser

Re: Industry Use of Lucene?

12 matches

Site Navigation

Mail list logo

Footer information