Re: Lucene on Windows

2003-10-21 Thread Eric Jain
 The CVS version of Lucene has a patch that allows one to use a
 'Compound Index' instead of the traditional one.  This reduces the
 number of open files.  For more info, see/make the Javadocs for
 IndexWriter.

Interesting option. Do you have a rough idea of what the performance
impact of using this setting is?

--
Eric Jain


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: positional token info

2003-10-21 Thread Pierrick Brihaye
Hi,

Erik Hatcher a écrit:

Is anyone doing anything interesting with the Token.setPositionIncrement 
during analysis?
I think so :-) Well... my arabic analyzer is based on this functionnality.

The basic idea is to have several tokens at the same position (i.e. 
setPositionIncrement(0)) which are different possible stems for the same 
word.

But its practically impossible to formulate a Query that can take 
advantage of this.  A PhraseQuery, because Terms don't have positional 
info (only the transient tokens)
Correct !

I've made a dirty patch for the QueryParser which is able to handle 
tokens with positionIncrement equal to 0 or 1 (see bug #23307). It still 
needs some work, but it fits my needs :-)

I certainly see the benefit of putting tokens into zero-increment 
positions, but are increments of 2 or more at all useful?
Who knows ? I may be interesting  to keep track of the *presence* of 
empty words, e.g. [the] sky [is] blue, [the] sky [is] [really] 
blue, [the] sky [is] [that] [really] blue. The traditionnal reduction 
to sky blue is maybe over-simplistic for some cases...

Well, just an idea.

Cheers,

--
Pierrick Brihaye, informaticien
Service régional de l'Inventaire
DRAC Bretagne
mailto:[EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Compound expression extraction

2003-10-21 Thread MOYSE Gilles (Cetelem)
Hi.

I'm trying to extract expressions from the terms position information, i.e.,
if two words appears frequently side-by-side, then we can consider that the
two words are only one. For instance, 'Object' and 'Oriented' appears
side-by-side 9 times out of 10. It allows us to define a new expression,
'Object_Oriented'.
Does anyone knows the statistical method to detect such expressions ?

Thanks.

Gilles Moyse

-Message d'origine-
De : Eric Jain [mailto:[EMAIL PROTECTED]
Envoyé : mardi 21 octobre 2003 09:24
À : Lucene Users List
Objet : Re: Lucene on Windows


 The CVS version of Lucene has a patch that allows one to use a
 'Compound Index' instead of the traditional one.  This reduces the
 number of open files.  For more info, see/make the Javadocs for
 IndexWriter.

Interesting option. Do you have a rough idea of what the performance
impact of using this setting is?

--
Eric Jain


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene on Windows

2003-10-21 Thread Otis Gospodnetic
A very rough and simple 'add a single document to the index' test shows
that the Compound Index is marginally slower than the traditional one.
I did not test searching.

Otis

--- Eric Jain [EMAIL PROTECTED] wrote:
  The CVS version of Lucene has a patch that allows one to use a
  'Compound Index' instead of the traditional one.  This reduces the
  number of open files.  For more info, see/make the Javadocs for
  IndexWriter.
 
 Interesting option. Do you have a rough idea of what the performance
 impact of using this setting is?
 
 --
 Eric Jain
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


__
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: positional token info

2003-10-21 Thread Erik Hatcher
On Tuesday, October 21, 2003, at 03:36  AM, Pierrick Brihaye wrote:
The basic idea is to have several tokens at the same position (i.e. 
setPositionIncrement(0)) which are different possible stems for the 
same word.
Right.  Like I said, I recognize the benefits of using a position 
increment of 0.

I certainly see the benefit of putting tokens into zero-increment 
positions, but are increments of 2 or more at all useful?
Who knows ? I may be interesting  to keep track of the *presence* of 
empty words, e.g. [the] sky [is] blue, [the] sky [is] [really] 
blue, [the] sky [is] [that] [really] blue. The traditionnal 
reduction to sky blue is maybe over-simplistic for some cases...
But, how would you actually *use* an index that was indexed with the 
holes noted by  1 position increments?

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: positional token info

2003-10-21 Thread Steve Rowe
Erik,

I've submitted a patch (BUG# 23730) very similar to yours, in response 
to a request to fix phrases matching where they should not:

URL:http://mail-archive.com/[EMAIL PROTECTED]/msg04349.html

Bug #23730:
URL:http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23730
 But, how would you actually *use* an index that was indexed with the
 holes noted by  1 position increments?
As the lucene-user email linked above notes, setting the position 
increment interdicts false phrase matching.

Steve Rowe

Erik Hatcher wrote:
On Tuesday, October 21, 2003, at 03:36  AM, Pierrick Brihaye wrote:

The basic idea is to have several tokens at the same position (i.e. 
setPositionIncrement(0)) which are different possible stems for the 
same word.


Right.  Like I said, I recognize the benefits of using a position 
increment of 0.

I certainly see the benefit of putting tokens into zero-increment 
positions, but are increments of 2 or more at all useful?


Who knows ? I may be interesting  to keep track of the *presence* of 
empty words, e.g. [the] sky [is] blue, [the] sky [is] [really] 
blue, [the] sky [is] [that] [really] blue. The traditionnal 
reduction to sky blue is maybe over-simplistic for some cases...


But, how would you actually *use* an index that was indexed with the 
holes noted by  1 position increments?

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Hierarchical document

2003-10-21 Thread Peter Keegan
One way to implement hierarchical documents is through the use of
predefined phrases. Consider the 2 hierarchies:

1. Kids_and_Teens/Computers/Software/Games
2. Computers/Software/Freeware

When indexing a document belonging to (1), add these terms in consecutive
order (autoincrement=1): dir:Top dir:Kids_and_Teens dir:Computers
dir:Software dir:Games dir:Bottom

For documents belonging to (2), add: dir:Top dir:Computers dir:Software
dir:Bottom

The terms dir:Top and dir:Bottom can be used to anchor a query
to a specific portion of the hierachy.

Thus, a query containing the phrase: dir:Computers dir:Software would
match documents in both (1) and (2) (and perhaps others), but a query for:
dir:Top dir:Kids_and_Teens dir:Computers dir:Software would target only
'Computer/Software' documents from the 'Kids_and_Teens' top level directory.
(The QueryPhrase 'slop factor' would be set to 0).

Peter

- Original Message - 
From: Tatu Saloranta [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, October 20, 2003 8:24 PM
Subject: Re: Hierarchical document


 On Monday 20 October 2003 10:31, Erik Hatcher wrote:
  On Monday, October 20, 2003, at 11:06  AM, Tom Howe wrote:
  There is not a more lucene way to do this - its really up to you to
  be creative with this.  I'm sure there are folks that have implemented
  something along these lines on top of Lucene.  In fact, I have a
  particular interest in doing so at some point myself.  This is very
  similar to the object-relational issues surrounding relational
  databases - turning a pretty flat structure into an object graph.
  There are several ideas that could be explored by playing tricks with
  fields, such as giving them a hierarchical naming structure and
  querying at the level you like (think Field.Keyword and PrefixQuery,
  for example), and using a field to indicate type and narrowing queries
  to documents of the desired type.
 
  I'm interested to see what others have done in this area, or what ideas
  emerge about how to accomplish this.

 I'm planning to do something similar. In my case problem is bit simpler;
 documents have associated products, and products form a hierarchy.
 Searches should be able to match not only direct matches (searching
 product, article associated with product), but also indirect ones via
 membership (product member of a product group, matching group).
 Product hierarchy also has variable depth.

 To do searches using non-leaf hierarchy items (groups), all actual product
 items/groups associated with docs are expanded to full ids when
 indexing (ie. they contain path from root, up to and including node,
 each node component having its own unique id).
 Thus, when searching for an intermediate node (product grouping),
 match occurs since that node id is part of path to products that are in
 the group (either directly or as members of sub-groups).

 Since no such path is stored (directly) in database, this also allows me
to do
 queries that would be impossible to do in database (I could add similar
 path/full id fields for search purposes of course). Thus, Lucene index is
 optimized for searching purposes, and database structure for editing
 and retrieval of data.

 Another thing to keep in mind is that at least for metadata it may make
sense
 to use specialized analyzer, one that allows tokenizing using specific ids
 to store ids as separate tokens; instead of using some standard plain text
 analyzer. This way it is possible to separate ids from textual words (by
 using prefixes, for example, @1253 or #13945); this allows for
accurate
 matching based on identity of associated metadata selections.

 -+ Tatu +-


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Expression Extractions

2003-10-21 Thread MOYSE Gilles (Cetelem)
I've found something about expression extractions (the ability , when a word
and another appear frequently side-by-side, to detect that they form an
expression) : http://www.miv.t.u-tokyo.ac.jp/papers/matsuoFLAIRS03.pdf

Gilles Moyse


Re: Lucene on Windows

2003-10-21 Thread Doug Cutting
Tate Avery wrote:
You might have trouble with too many open files if you set your mergeFactor too high.  For example, on my Win2k, I can go up to mergeFactor=300 (or so).  At 400 I get a too many open files error.  Note: the default mergeFactor of 10 should give no trouble.
Please note that it is never recommended that you set mergeFactor 
anywhere near this high.  I don't know why folks do this.  It really 
doesn't make indexing much faster, and it makes searching slower if you 
don't optimize.  It's a bad idea.  The default setting of 10 works 
pretty well.  I've also had good experience setting it as high as 50 on 
big batch indexing runs, but do not recommend setting it much higher 
than that.  Even then, this can cause problems if you need to use 
several indexes at once, or you have lots of fields.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Weird NPE in RAMInputStream when merging indices

2003-10-21 Thread petite_abeille
Hello,

What could cause such weird exception?

RAMInputStream.init: java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.lucene.store.RAMInputStream.init(RAMDirectory.java:217)
at org.apache.lucene.store.RAMDirectory.openFile(RAMDirectory.java:182)
at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:78)
at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:116)
at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:378)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:298)
at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:313)

I don't know if this is a one off as I cannot reproduce this problem 
nor I have seen this before, but I thought I could as well ask.

This is triggered by merging a RAMDirectory into a FSDirectory. Looking 
at the RAMDirectory source code, this exception seems to indicate that 
the file argument to the RAMInputStream constructor is null... how 
could that ever happen?

Here is the code which triggers this weirdness:

this.writer().addIndexes( new Directory[] { aRamDirectory } );

The RAM writer is checked before invoking this code to make sure there 
is some content in the RAM directory:

aRamWriter.docCount()  0

This has been working very reliably since the dawn of time, so I'm a 
little bit at loss as how to diagnose this weird exception...

Any ideas?

Thanks.

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: positional token info

2003-10-21 Thread Tatu Saloranta
On Tuesday 21 October 2003 17:31, Otis Gospodnetic wrote:
  It does seem handy to avoid exact phrase matches on phone boy when
  a
  stop word is removed though, so patching StopFilter to put in the
  missing positions seems reasonable to me currently.  Any objections
  to that?

 So phone boy would match documents containing phone the boy?  That

Hmmh. WWGD (What Would Google Do)? :-)

 doesn't sound right to me, as it assumes what the user is trying to do.
  Wouldn't it be better to allow the user to decide what he wants?
 (i.e. phone boy returns documents with that _exact_ phrase.  phone
 boy~2 also returns documents containing phone the boy).

As long as phrase queries work appropriately with approximity modifiers, one
alternative (from app standpoint) would be to:

(a) Tokenize stopwords out, adding skip value; either one per stop word,
  or one for non-empty sequence of key words ( top of the world might
 make sense to tokenize as top - world, - signifying 'hole')
(b) With phrase queries, first do exact match.
(c) If number of matches is too low (whatever definition of low is),
  use phrase query match with slop of 2 instead.

Tricky part would be to do the same for combination queries, where it's
not easy to check matches for individual query components.

Perhaps it'd be possible to create Yet Another Query object, that would,
given a threshold, do one or two searches (as described above), to allow
for self-adjusting behaviour?
Or, perhaps there should be container query, that could execute ordered
sequence of sub-queries, until one returns good enough set of matches, then
return that set (or last result(s), if no good matches) and above-mentioned 
sloppy if need be phrase query would just be  a special case?

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]