Re: 1.4.x TermInfosWriter.indexInterval not public static ?

2005-02-25 Thread Chris Hostetter
:  Whats the desired pattern of using of TermInfosWriter.indexInterval ?
:
: There isn't one.  It is not a part of the public API.  It is an
: unsupported internal feature.

: It was never public.  It used to be static and final, but is now an
: instance variable.

: The place to put getter/setters would be IndexWriter, since that's the
: public home of all other index parameters.  Some changes to
: DocumentWriter and SegmentMerger would be required to pass this value
: through to TermInfosWriter from IndexWriter.

I don't really understand what this variable does, but from what I do
understand: changing it's value can have significant performance impacts
depending on the nature of the data being indexed.  That leads me to
belive3 that making it configurale would be a good idea, but it begs a
some questions:

 1) If making it mutatable requires changes to other classes to propogate
it, then why is it now an instance variable instead of a static?
(Presumably making it an instance variable allows subclasses to
override the value, but if other classes have internal expectations
of the value, that doesn't seem safe)

 2) Should it be configurable through a get/set method, or through a
system property?
(which rehashes the instance/global question)

 3) Is it important that a writer updating an existing index use the same
value as the writer that initial created the index?  if so should
there really be a preferedIndexInterval variable which is mutatable,
and a currentIndexInterval which is set to the value of the index
currently being updated.  Such that preferedIndexInterval is used when
making an index from scratch and currentIndexInterval is used when
adding segments to a new index?



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene in the Humanities

2005-02-22 Thread Chris Hostetter

:  Just curious: it would seem easier to use multiple fields for the
:  original case and lowercase searching. Is there any particular reason
:  you analyzed the documents to multiple indexes instead of multiple
:  fields?
: 
:  I considered that approach, however to expose QueryParser I'd have to
:  get tricky.  If I have title_orig and title_lc fields, how would I
:  allow freeform queries of title:something?

Why have seperate fields?

Why not index the title into the title field twice, once with each term
lowercased and once with the case left alone. (Using an analyzer that
tokenizes The Quick BrOwN fox as [the] [quick] [brown] [fox] [The]
[Quick] [BrOwN] [fox])

Then at search time, depending on the value of of the checkbox, construct
your QueryParser using the appropriate Analyzer.

The only problem i can think of would be inflated scores for terms that
are naturally lowercased, because they would wind up getting added to the
index twice, but based on what i've seen of hte data you are working
with, i imageing that if you used UPPERCASE instead of lowercase you
could drasticly reduce the likelyhood of any problems with that.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: RangeQuery With Date

2005-02-07 Thread Chris Hostetter
: Your dates need to be stored in lexicographical order for the RangeQuery
: to work.
:
: Index them using this date format: MMDD.
:
: Also, I'm not sure if the QueryParser can handle range queries with only
: one end point. You may need to create this query programmatically.

and when creating them progromaticaly, you need to use the exact same
format they were indexed in.  Assuming I've corectly guess what your
indexing code looks like, you probably want...

Query query = new RangeQuery(null, new Term(modified, 2004), false);




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document numbers and ids

2005-02-06 Thread Chris Hostetter
:  care about their content. I only want to know a particular numeric
:  field from
:  document (id of document's category).
:  I also need to know how many docs in category were found, so I can't
:  index

: You should explore the use of IndexReader.  Index your documents with
: category id field, and use the methods on IndexReader to find all
: unique categories (TermEnum).

to expand on erik's suggestion: once you know the complete list of
categories you iterate over then and execute your search once per
category, filtering each time on the category Id (to determine the number
of results from that category).



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Starts With x and Ends With x Queries

2005-02-06 Thread Chris Hostetter

: book Managing Gigabytes, making *string* queries drastically more
: efficient for searching (though also impacting index size).  Take the
: term cat.  It would be indexed with all rotated variations with an
: end of word marker added:
...
: The query for *at* would be preprocessed and rotated such that the
: wildcards are collapsed at the end to search for at* as a
: PrefixQuery.  A wildcard in the middle of a string like c*t would
: become a prefix query for t$c*.

That's a pretty slick trick.

Considering how many Terms the index would wind up containing in order to
denormalize the data in that way, I wonder if it would be more practicle
to index each of the characters as a seperate term, with the word repeated
after the end of word character, making wildcard searches into phase
searches (after doing preprocessing and rotating as you described).

Ie, index cat as:   c a t $ c a t
  search for *at* as a phrase search for a t
  search for *at  as a phrase search for a t $
  search for c*t  as a phrase search for t $ c

...i'm fairly certain that would keep the index size much smaller (the
number of terms would be much smaller, while the average term frequence
wouldn't really increase), but i'm not sure if it would actaully be any
faster.  it depends on the algorithm/performace of PhraseQuery -- which is
something I haven't really looked into.  It could very well be
significantly slower.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Parsing The Query: Every document that doesn't have a field containing x

2005-02-04 Thread Chris Hostetter

Another approach...

You can make a Filter that is the inverse of the output from another
filter, which means you can make a QueryFilter on the search, then wrap it
in your inverse Filter.

you can't execute a query on a filter without having a Query object, but
you can just apply the Filter directly to an IndexReader yourself, and get
back a BitSet containing the docIds of everydocument that does not contain
your term.

something like this should work...

   class NotFilter extends Filter {
  private Filter wraped;
  public NotFilter(Filter w) {
wraped = w;
  }
  public BitSet bits(IndexReader r) {
BitSet b = wraped.bits(r);
b.flip(0,b.size());
return b;
  }
   }
   ...
   BitSet results = (new NotFilter
 (new QueryFilter
  (new TermQuery(new Term(f,x).bits(reader);




: Date: Thu, 3 Feb 2005 19:51:36 +0100
: From: Kelvin Tan [EMAIL PROTECTED]
: Reply-To: Lucene Users List lucene-user@jakarta.apache.org
: To: Lucene Users List lucene-user@jakarta.apache.org
: Subject: Re: Parsing The Query: Every document that doesn't have a field
: containing x
:
: Alternatively, add a dummy field-value to all documents, like 
doc.add(Field.Keyword(foo, bar))
:
: Waste of space, but allows you to perform negated queries.
:
: On Thu, 03 Feb 2005 19:19:15 +0100, Maik Schreiber wrote:
:  Negating a term must be combined with at least one nonnegated
:  term to return documents; in other words, it isn't possible to
:  use a query like NOT term to find all documents that don't
:  contain a term.
: 
:  So does that mean the above example wouldn't work?
: 
:  Exactly. You cannot search for -kcfileupload:jpg, you need at
:  least one clause that actually _includes_ documents.
: 
:  Do you by chance have a field with known contents? If so, you could
:  misuse that one and include it in your query (perhaps by doing
:  range or wildcard/prefix search). If not, try IndexReader.terms()
:  for building a Query yourself, then use that one for search.
:
:
:
: -
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Starts With x and Ends With x Queries

2005-02-04 Thread Chris Hostetter

: Also keep in mind that QueryParser only allows a trailing asterisk,
: creating a PrefixQuery.  However, if you use a WildcardQuery directly,
: you can use an asterisk as the starting character (at the risk of
: performance).

On the issue of ends with wildcard queries, I wanted to throw out and
idea that i've seen used to deal with matches like this in other systems.
I've never acctually tried this with Lucene, but I've seen it used
effectively with other systems where the goal is to sort strings by the
least significant (ie: right most) characters first.  I think it could
apply nicely to people who have compelling needs for efficent 'ends with'
queries.



Imagine you have a field call name, which you can already do efficient
prefix matching on using the PrefixQuery class.  Your docs and query may
look something like this...

   D1 name:Adam Smith age:13 state:CA ...
   D2 name:Joe Bob age:42 state:WA ...
   D3 name:John Adams age:35 state:NV ...
   D3 name:Sue Smith age:33 state:CA ...

...and your queries may look something like...

   Query q1 = new PrefixQuery(new Term(name,J*));
   Query q2 = new PrefixQuery(new Term(name,Sue*));

If you want to start doing suffix queries (ie: all names ending with
s, or all names ending with Smith) one approach would be to use
WildcarQuery, which as Erik mentioned, will allow you to use a quey Term
that starts with a *. ie...

   Query q3 = new WildcardQuery(new Term(name,*s));
   Query q4 = new WildcardQuery(new Term(name,*Smith));

(NOTE: Erik says you can do this, but the docs for WildcardQuery say you
can't I'll assume the docs are wrong and Erik is correct.)

The problem is that this is horrendously inefficient.  In order to find
the docs that contain Terms which match your suffix, WildcardQuery must
first identify what all of those Terms are, by iterating over every Term
in your index to see if they match the suffix.  This is much slower then a
PrefixQuery, or even a WildcardQuery that has just 1 initial character
before a * (ie: s*foobar), because it can then seek to directly to the
first Term that starts with that character, and also stop iterating as
soon as it encounters a Term that no longer begins with that character.

Which leads me to my point: if you denormalize your data so that you store
both the Term you want, and the *reverse* of the term you want, then a
Suffix query is just a Prefix query on a reversed field -- by sacrificing
space, you can get all the speed efficiencies of a PrefixQuery when doing
a SuffixQuery...

   D1 name:Adam Smith rname:htimS madA age:13 state:CA ...
   D2 name:Joe Bob rname:boB oeJ age:42 state:WA ...
   D3 name:John Adams rname:smadA nhoJ age:35 state:NV ...
   D3 name:Sue Smith rname:htimS euS age:33 state:CA ...

   Query q1 = new PrefixQuery(new Term(name,J*));
   Query q2 = new PrefixQuery(new Term(name,Sue*));
   Query q3 = new PrefixQuery(new Term(rname,s*));
   Query q4 = new PrefixQuery(new Term(rname,htimS*));


(If anyone sees a flaw in my theory, please chime in)


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How do I delete?

2005-02-01 Thread Chris Hostetter

: anywhere.  I checked the count coming back from the delete operation and
: it is zero.  I even tried to delete another unique term with similar
: results.

First off, are you absolutely certain you are closing the reader?  it's
not in the code you listed.

Second, I'd bet $1 that when your documents were indexed, your reference
field was analyzed and parsed into multiple terms.  Did you try searching
for the Term you're trying to delete by?

(I hear luke is a pretty handy tool for checking exactly which Terms are
in your index)

: Here is the delete and associated code:
: 
:   reader = IndexReader.open(database);
: 
:   Term t = new Term(reference,reference);
:   try {
: reader.delete(t);
:   } catch (Exception e) {
: System.out.println(Delete exception;+e);
:   }


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Reloading an index

2005-01-27 Thread Chris Hostetter

: processes ended.  If you're under linux, try running the 'lsof'
: command to see if there are any handles to files marked (deleted).

:  Searcher, the old Searcher is closed and nulled, but I
:  still see about twice the amount of memory in use well
:  after the original searcher has been closed.   Is
:  there something else I can do to get this memory
:  reclaimed?  Should I explicitly call garbarge
:  collection?  Any ideas?

In addition to the previous advice, keep in mind that depending on the
implimentation of your JVM, it may never actually free memory back to
the OS.  And even the JVMs that can, only do it after a GC which results
in a ratio of unused/used memory that they deem worthy of freeing (usually
based on tunning parameters)

assuming you are using a Sun JVM, take a look at...

http://java.sun.com/docs/hotspot/gc1.4.2/index.html

...and search for MinHeapFreeRatio and MaxHeapFreeRatio


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Opening up one large index takes 940M or memory?

2005-01-21 Thread Chris Hostetter
: We have one large index right now... its about 60G ... When I open it
: the Java VM used 940M of memory.  The VM does nothing else besides open

Just out of curiosity, have you tried turning on the verbose gc log, and
putting in some thread sleeps after you open the reader, to see if the
memory footprint settles down after a little while?  You're currently
checking the memoory usage immediately after opening the index, and some
of that memory may be used holding transient data that will get freed up
after some GC iterations.


: IndexReader ir = IndexReader.open( dir );
: System.out.println( ir.getClass() );
: long after = System.currentTimeMillis();
: System.out.println( opening...done - duration:  +
: (after-before) );
:
: System.out.println( totalMemory:  +
: Runtime.getRuntime().totalMemory() );
: System.out.println( freeMemory:  +
: Runtime.getRuntime().freeMemory() );





-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why IndexReader.lastModified(index) is depricated?

2005-01-19 Thread Chris Hostetter

: Why IndexReader.lastModified(index) is depricated?

Did you read the javadocs?

   Synchronization of IndexReader and IndexWriter instances is no longer
   done via time stamps of the segments file since the time resolution
   depends on the hardware platform. Instead, a version number is
   maintained within the segments file, which is incremented everytime
   when the index is changed.

: It's always a good idea to know when the index changed last time, for

That's a good point, and you can still get that information using the same
underlying method IndexReader.lastModified did/does...

 directory.fileModified(segments);

...it's just no longer crucial that IndexReader have that information.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene integration with relational database

2005-01-18 Thread Chris Hostetter

: Thanks for your tips. I am trying to get a more thorough understanding
: why this would be better.

1) give serious consideration to just putting all of your data in lucene
for the purposes of searching.  the intial example mentioned employees,
and salaries and wanted to search for employees with certain names, and
salaries  $X ...lucene can do the salaray  $X using a RangeFilter.

2) assuming you *must* combine your lucene query with your SQL query...

When your goal is performance, I don't think you'll ever be able to
find a truely generic solution for all situations -- the specifics matter.


For example:

  a) is your goal specifically to discount lucene results that don't meet
 a criteria specified in your DB?
  b) do you care about having an accurate number of total matches, or do
 you only care about filtering out results?

depending on the answers, a fairly fast way to eliminate results is to
only worry about the page of results you are looking at.  Consider an
employee search application which displays 10 results per page.  first you
do a lucene search by name, then you want to throw out any employees whose
salary is below $X.  use the Hits object from the lucene search to get the
unique IDs for the first 10 employees (which uses a very small, fixed
amount of memory and time, regardless of how big your index/result is)
then do a lookup in your DB using a query built from those 10 IDs, ala:

   select ... from ... where ID in (1234, 5678 ... 7890)

...(which should also be very fast assuming your DB has a primary key on
ID)

if the 10 IDs all match your SQL query then you're done.  If N don't match
your query, then you need find the next N results from Hits that do; so
just repeat the steps above untill you've gotten 10 viable results.

(given good statistics on your data, you can virtually eliminate the need
to execute more then a few iterations ... if nothing else, you can use the
ratio or misses/hits from the first SQL query -- N of 10 didn't match --
to decide how big to make your second query to ensure you'll get N good
ones.)


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Best way to find if a document exists, using Reader ...

2005-01-17 Thread Chris Hostetter

: 1) Adding 250K documents took half an hour for lucene.
: 2) Deleting and adding same 250K documents took more than 50 minutes. In my
: test all 250K objects are new so there is nothing to delete.
:
: Looks like there is no other way to make it fast.

I bet you can find an improvement in the specific case -- put probably not
in the general case.

Let's summarize your current process in psuedo code:

   open an existing index (which may be empty)
   foreach item in very large set of items:
  id = item.getId()
  index.delete(id)
  index.add(id)

...except that 99% of the time, that delete isn't neccesary right?

so what if you traded space for time, and kept an in memory Cache of all
IDs in your index?

   open an existing index (which may be empty)
   cache = all ids in TermDoc iterator of id field;
   foreach item in very large set of items:
  id = item.getId()
  if cache.contains(id):
 index.delete(id)
  cache.add(id)
  index.add(id)

... assuming you have enough ram to keep a HashMap of every id in your
index arround, i'm fairly confident that would be faster then doing the
delete every time.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to get all field values from a Hits object?

2005-01-17 Thread Chris Hostetter

: is it possible to get all different values for a
: Field from a Hits object and how to do this?

The ording of your question suggests that the Field you are interested in
isn't a field which will have a fairly unique value for every doc (ie: not
a title, more likely an author or category field).  Starting with
that assumption, then there is fairly efficient way to get the information
you want...

Assuming the total set of values for the Field you are interested in is
small (relative your index size), you can pre-compute a BitSet for
each value indicating which docs match that value in the Field (using a
TermFilter).  Then store those BitSets in a Map (key'ed by field value)

Everytime a search is performed, use a HitCollector that generates a
BitSet containing the documents in your result; AND that BitSet against (a
copy of) each BitSet in your Map.  All of the resulting BitSets with a
non-zero cardinality represent values in your results.  (As an added bonus
the cardinality() of each BitSet is the total number of docs in your
result that contain that value)

Two caveats:
   1) Everytime you modify your index, you have to regen the
  BitSets in your Map.
   2) You have to know the set of all values for the field you are
  interested in.  In many cases, this is easy to determine from the
  source data while building the index.  but it's also possible to
  get it using IndexReader.termDocs(Term).


(I'm doing something like this to provide ancilary information about which
categories of documents are most common in the users search result, and
what the exact number of documents in those categories is)


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: stop words and index size

2005-01-13 Thread Chris Hostetter


: The corpus is the English Wikipedia, and I indexed the title and body of
: the articles. I used a list of 525 stop words.
:
: With stopwords removed the index is 227MB.
: With stopwords kept the index is 331MB.

That doesn't seem horribly surprising.

consider that for every Term in the index, lucene is keeping track of the
list of docId, freq pairs for every document that contains that term.

Assume that something has to be in at least 25% of the docs before you
decide it's worth making it a stop word.  your URL indicates you are
dealing with 400k docs, which means that for each stop word, the space
need to store the int pairs for docId, freq is...

(4B + 4B) * 100,000 =~ 780KB  (per stop word Term, minimum)

...not counting any indexing structures that may be used internally to
improve the lookup of a Term.  assuming some of those words are in more or
less then 25% of your documents, that could easily account for a
differents of 100MB.

I suspect that an interesting excersize would be to use some of the code
I've seen tossed arround on this list that lets you iterate over all Terms
and find the most common once to help you determine your stopword list
progromaticly.  Then remove/reindex any documents that have each word as
you add it to your stoplist (one word at a time) and watch your index
shrink.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How do I unlock?

2005-01-11 Thread Chris Hostetter
: What about a shutdown hook?

Interesting idea, at the moment the file is created on disk, the
FSDirectory could add a shutdown hook that checked for the existence of
the file and if it's still there (implying that the Lock owner failed
without releasing the lock) it can forcably remove it.

Of course: this assumes that LockFiles are never shared between processes
-- ie: if client A is waiting on a lock that client B is holding, does the
lock A eventually gets use the same file that B's lock was using, or does
the old lock file get deleted and a new one created ?

(I don't really understand a lot of Lucene's locking code)


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Use a date field for ranking

2005-01-10 Thread Chris Hostetter
:  : have to use something that boosts the scores at _search_ time.

: Yes, I know I can boost Query objects, but that is not the same as
: boosting the document score by a factor. By boosting query objects I
: _add_ values to the score. Let me show you an example:

well, sure it is ... you have to have some way (add search time) to
indicate that you want to want documents that meet a certain critera to
have their scores affected in a certain way -- that's exactly what a Query
is.  there may not be an existing Query subclass that meets you needs
exactly, but if you want the scores of documents to be influnced
conditionally at search time, a Query object is the way to indicate that.

: If I had used a boost of 3.0 per document and left the date part of the
: query out I would have:
:
: Query 1: 0.3
: Query 2: 0.03
:
: Which maintains the original proportion. Now if I want to specify a
: function (like 1/x) that calculates the boost factor of a specific
: publish date I can't emulate this by using Query boosts because the
: query boost must be adjusted to the first part of the query to achieve
: an equal distribution for any query.

Based on a recent thread about scores, I *think* you are making an
incorrect assumption about the relative scores of documents...

http://mail-archives.apache.org/eyebrowse/SearchList?listName=lucene-user%40jakarta.apache.orgsearchText=%22A+question+about+scoring+function+in+Lucene%22defaultField=subject

...but I'll be totally honest, I'm not sure exactly what your point is.
you're talking about comparing the final scores of too different queries,
but I'm not sure if you mean the score of a specific document against two
different queries, or the score of two documents against a single query in
which one document is more relevant to the term you search for.

: date but don't contain the first part of the query. So we might use a
: query like this:
:
: (a word) AND (date:20050108^3 OR date:20050107^1)
:
: But now I have to specify _all_ possible dates in the date part to reach
: all documents the index contains. This smells ;) Because it's all only
: an emulation of the real strategy.

well, this is why i proposed finding a feasible granularity and
age that you were comfortable with to use in picking your boosts.  If
you must have at least single day granularity, and you must provide a
gradually decreasing boost for every day back to the begining of time,
then you are correct: my suggestion was not practical. but if you are
willing to go with week based granularity, and only boost items from the
last 6 weeks, then you can do something like...

(a word) AND ([date:20050108-20050114]^7
   OR [date:20050101-20050107]^6
   OR [date:20041225-20041231]^5
   OR [date:20041218-20041224]^4
   OR [date:20041211-20041217]^3
   OR [date:20041204-20041210]^2
   OR [date:-20041204]^1 )

...except that i loath doing DateRange queries (see my first post in the
archives for why i think they are a silly/inefficient way of doing things)
which is why i suggested just using special keywords to denote which week
an item was published

:  3) I'm sure there is a very cool and efficient way to do this using a
:  custom Similarity implimentation (which somhow causes the default score
:  to be divided by the age of the document) but i've never acctualy played
:  with the SImilarity class, so i won't say for certain it can be done that
:  way (hopefully someone else can chime in)
:
: AFAIK, Similarity can only be used on term level. But as outlined above
: I need a boost factor on document level.

You're right ... I was thinking of the Scorer class ... there was a recent
discussion about creating your own Scorer to return an arbitrary value
value as the Score of a (new class of) Query.  I don't know how much work
is involved, but take a look at this message...

http://mail-archives.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=2055565

...maybe it would be easy to crank out RecentDocsScorer and
RecentQuery classes which can do what you want (by returning the date
difference from a field and now as a score of the query)

-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Problems...

2005-01-07 Thread Chris Hostetter

: Stored = as-is value stored in the Lucene index
:
: Tokenized = field is analyzed using the specified Analyzer - the tokens
: emitted are indexed
:
: Indexed = the text (either as-is with keyword fields, or the tokens
: from tokenized fields) is made searchable (aka inverted)
:
: Vectored = term frequency is stored in the index in an easily
: retrievable fashion.

FYI: I've FAQed this...

http://wiki.apache.org/jakarta-lucene/LuceneFAQ


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Use a date field for ranking

2005-01-07 Thread Chris Hostetter
: we are currently implementing a search engine for a news site. Our goal
: is to have a search result that uses the publish date of the documents
: to boost the score of the documents.

: have to use something that boosts the scores at _search_ time.

1) There is a way to boost individual Query objects (which you may then
compose into a Tree of BooleanQueries) see Query.setBoost(float)

2) if you are planning to rebuild your index on a regular basis (ie:
nightly) then you can easily apply boosts to your documets when you index
them.

If you want to be able to do only incrimental additions...

3) I'm sure there is a very cool and efficient way to do this using a
custom Similarity implimentation (which somhow causes the default score
to be divided by the age of the document) but i've never acctualy played
with the SImilarity class, so i won't say for certain it can be done that
way (hopefully someone else can chime in)

4) I can tell you what i cam up with when i was proof of concepting this a
while back...

In my case, I'm willing to accept that there is some finite granularity of
time at which newer documents are no longer very much more fresh then
older documents (ie: articles from the same week are equally fresh to
me) I also have a practicle cut off of how old things can get before they
are just plan old: 52 weeks.

With those numbers in mind, I can add a special field to each document
that indicates which week the article was published (ie: 2004w1, 2004w2,
2004w3, etc...).  At search time, my query can include a BooleanQuery of
52 clauses ORed together, each one containing the magic token for the last
52 weeks prio to when the search was execuded, each with a slightly
decreasing boost from the week before.





-Hoss

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query based stemming

2005-01-07 Thread Chris Hostetter

: Is it possible to enable stem queries on a per-query basis? It doesn't
: seem to be possible since the stem tokenizing is done during the
: indexing process. Are people basically stuck with having all their
: queries stemmed or none at all?

:  From what I've read, if you want to have a choice, the easiest way is
: to index the documents twice. Once with stemming on and once with it off
: placing the results in two different indexes.  Then at query time,
: select which index you want to use based on whether you want stemming on
: or off.

As I understand it, the intented place to impliment Stemming is in an
Analyzer Filter (not to be confused with a search Filter).  Since you can
can specify an Analyzer when you call addDocument, you don't have to
acctually have two seperate indexes, you could just have all the docs in
one index - and use a search Filter to indicate which docs to look at.

Alternately: the Analyzer's tokenStream method is given the fieldName
being analyzed, so you could write an Analyzer with a set of rules
telling it to only apply your Stemming filter to certain fields, and
then instead of having twice as many documents, you can just index your
text in two seperate fields (which should be a little easier, then
seperate docs because you are only duplicating the fields where stemming
is relevant)  Then at search time you don't have to filter anything, just
search the field that's applicable to your current desire (stemmed or
unstemmed)

Lastely: Allthough it's tricky to get correct, there's no law saying you
have to use the same Analyzer when you query as when you index.  You could
index your documents using an Analyzer that does no stemming, and then at
search time (if you want stemming) use an Analyzer that does reverse
stemming to expand your query terms out to all the possible variants.


(NOTE: I've never acctaully tried this, but i think the theory is sound).


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Question about Analyzer and words spelled in different languages

2005-01-06 Thread Chris Hostetter

: Is there any already written analyzer that would take that name
: (Schamp;auml;ffer or any other name that has entities) so that
: Lucene index could searched (once the field has been indexed) for the real
: version of the name, which is
:
: Schäffer
:
: and the english spelled version of the name which is
:
: Schaffer

I don't know about the un-xml-escaping part of things (there are lots
of xml escapng libraries out there, i'm sure one of them has an unescape)
but there was a recent discussion about unicode characters that look
similar and writting an analyzer that could know about them.  the last
message in the thread was from me, pointing out that it should be easy to
build the mapping table once, and then write a quick and dirty Analyzer
filter to use it ... but no one seemed to have any code handy that
allready did that...

http://mail-archives.apache.org/eyebrowse/[EMAIL 
PROTECTED]by=threadfrom=962022


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: multi-threaded thru-put in lucene

2005-01-06 Thread Chris Hostetter

: This is what we found:
:
:  1 thread, search takes 20 ms.
:
:   2 threads, search takes 40 ms.
:
:   5 threads, search takes 100 ms.

how big is your index?  What are the term frequencies like in your index?
how many differnt queries did you try? what was the structure of your
query objects like?  were you using a RAMDirectory or an FSDirectory? what
hardware were you running on?

Is your test application small enough that you can post it to the list?

I haven't done a lot of PMA testing of Lucene, but from what limited
testing i have done I'm a little suprised at those numbers, you'd get
results just as good if you ran the queries sequentially.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene Book in UK

2005-01-06 Thread Chris Hostetter

: I ordered my from Amazon a while back and was notified yesterday that it
: shipped. Here was my price:

really??? .. those bastards.  I ordered two copies for my work on December
10th and they still haven't shipped them.

: 1Lucene In Action (In Action)   $27.17  1   $27.17

Hmm, they only charged me $26.37 each ... but Amazon has been known to
experiment with pricepoints.  (On my browser, they're currently showing a
discounted price as 38.40).

I can tell you that on December 10th, Amazon's List price was roughly
the same as Manning, hence I was about to order from Manning and get the
free ebook, when i realized I was looking at the List price and not the
Amazon Price.  With Amazon's free shipping it was cheaper to buy two
the two paper copies from amazon *and* give Manning the $22 for the ebook.

: Does anyone know why Amazon.com lists the list price for Lucene in
: Action as $60.95? Bookpool.com has the list price as $44.95, which is
: the price that Manning is charging. After discounting, bookpool.com has
: it on sale for $27.50.

BN agrees that the list price is $60.95 ... which may be what Manning is
citing to resellers.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Problems...

2005-01-06 Thread Chris Hostetter

: Hoss, could you tell me what to exceptions I'm missing?  Thanks!

anytime you have a catch block, you should be doing something with that
exception.  If possible, you can recover from an exception, but no matter
what you should log the exception in some way so that you know it
happened.

Your code has two places where it was catching an exception and doing
absolutely nothing at all -- allowing processing to continue without even
a warning.  there was also an area of your code where if you encountered a
parse exception from the user input, you invented your own query instead
-- again without any sort of logging to let you know waht was happening in
the code.  building your own query when the users query is giberish isn't
neccessarily bad, but logging is your friend.

it wasn't clear from the descirption of your problem what you were trying
to query for so it was very possible that there was a problem parsing your
query, and it was doing the default search in that catch block and
giving you back zero results ... hence my question about the
SYstem.out.println calls that *were* in your code.


logging is (again) your friend.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Problems...

2005-01-04 Thread Chris Hostetter


To start with, there has to be more to the search side of things then
what you included.  this search function is not static, which means it's
getting called on an object, which obviously has some internal state
(paramOffset, hits, and pathToIndex are a few that jump out at me)  what
are the values of those variables when this method gets called?

second, there are at least two places in your code where potential
exceptions get thrown away and execution continues.  as a matter of good
practice, you should add logging to these spots to make sure you aren't
ignoring errors...

third, you said  I'm not getting anything in the log that I can point to
that says what is not working, but what about what is/isn't in the log?
there are several System.out.println calls in this code ... I'm assuming
you're logging STDOUT, what do those messages (with variables) say?
what is the value of currentOffset on the initial search? what does the
query.toString look like? how many total hits are being found when the
search is executed?  (or is that line not getting logged because the
search is getting skipped becuase of some initial state in paramOffset?)




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Exception: cannot determine sort type

2004-12-23 Thread Chris Hostetter
: The issue occurs if the first field it accesses parses as a numeric
: value and then successive fields are String's.  If you are mixing and

:  I am wondering why this exception might occur when the server/index is
:  under load.  I do realise there are many 'variables in the equation',
:  so
:  there probably is not an easy answer to this.

Knowing what i know about stress testing environments, i'm guessing you're
using some sort of auotmated load generating application, which is
generating random input from a dictionary of some kind -- possibly from
access logs of an existing system?  I'm also guessing that in some
configurations your load generator picks a random sort order independant
of the search terms it picks.

I'm also guessing that the issue has nothing to do with load ... if you
picked a single search term which you have manually tested once (sorting
by title) and know for a fact it works fine, and then you tell your load
generator to hit the index as hard as it can with that one query over and
over, it would probably work fine.

I think the problem is just that when it deals with random input and
random sort orders it (frequently) gets a result set in which the
first document has a numeric title field.


PS: I could be wrong, but if i remember right, the code that AUTO uses to
determine what sort type to use will treat it as a number if it *starts*
with something that looks like a number ... so look for titles like 1000
year plan in your data.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: analyzer effecting phrases?

2004-12-23 Thread Chris Hostetter
: Therefore I turned back to the standard analyzer and now do some replacing
: of the underscores in my ID string to avoid my original problem. This solved

maybe i'm missing something, but if you've got a field in your doc that
represents an ID, why not create that field as NonTokenized so you don't
have to worry about what characters the analyzer you're using thinks are
special?


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: (Offtopic) The unicode name for a character

2004-12-23 Thread Chris Hostetter
: However, I don't think that the names are consistent enough to permit a
: generic use of regular expressions. What Daniel is trying to achieve
: looks interesting anyway,

I'm not sure that that really matters in the long run ... I think the OP
was asking if there was a way to get the name in java because he figured
that way he could programaticly determine what the base character was in
his application.  But, that doesn't mean he needs to do this
progromatically every time his indexing/searching code sees a character
outside of LATIN-1

it would probably make more sense to write a little one off program that
could read in this file, and then spit out all of the non latin-1
characters with a guess as to which latin-1 character could act as a
substitution (if any) based on the name of the chracter, and a blank for
the user to override.  This program could be run once to generate a nice
small, efficient mapping table that could be (commited to cvs and) reused
over and over.

-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: sorting on a field that can have null values

2004-12-23 Thread Chris Hostetter

: I thought of putting empty strings instead of null values but I think
: empty strings are put first in the list while sorting which is the
: reverse of what anyone would want.

instead of adding a field with a null value, or value of an epty string,
why not just leave the field out for that/those doc(s)?

there's no requirement that every doc in your index has to have the exact
same set of fields.

If i rememebr correctly (you'll have to test this) sorting on a field
which doesn't exist for every doc does what you would want (docs with
values are listed before docs without)



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: To Sort or not to Sort

2004-12-16 Thread Chris Hostetter
: In my application, users search for messages with Lucene.  Typically,
: they are more interested in seeing their hits in date-order than in
: relevance-order.  In reading my ebook copy of Lucene in action (wish
: I'd had that a year ago), I find that one of the features added in 1.4
: was the ability to ask for hits in an order based on a field.  It also
: looks like adding the field necessary to get things by date order is
: straight forward.

When considering issues like this, it's important to consider what is
really important to your users: Do they eally want to see items strictly
ordered by date, or do they want to see results sorted by relevancy --
where the recentness of an item influences how relevent it is.

For example, when I search theLucene users mailing list for RangeQuery I
want more recent messages to appear first, but I'd still prefer that a
slightly older message bubble up in the list if the Subject includes
RangeQuery and the body mentions RangeQuery dozens of times -- because
it's likely to be very relevent then more recent messages which only
mention RangeQuery once or twice -- but I don't want results that are
strictly sorted by term frequency, becuase then messages from 3 years ago
(and several Lucene revs ago) might be at the top of the list.

Depending on how you maintain your index, there are a couple of different
ways of achieving a goal like this.  if you rebuild regularly, then just
giving your more recent documents a higher boost is one way to go.
another would be to use several FilteredQuery(RangeFilter) with several
increasing intervals of dates (ie: today OR the past week OR the past
month OR the past year) so that more recent documents match all of the
clauses, and older documents match fewer (or none)



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: A question about scoring function in Lucene

2004-12-15 Thread Chris Hostetter
: I question whether such scores are more meaningful.  Yes, such scores
: would be guaranteed to be between zero and one, but would 0.8 really be
: meaningful?  I don't think so.  Do you have pointers to research which
: demonstrates this?  E.g., when such a scoring method is used, that
: thresholding by score is useful across queries?

I freely admit that I'm way out of my league on these scoring discussions,
but I believe what the OP was refering to was not any intrinsic benefit in
having a score between 0 and 1, but of having a uniform normalization of
scores regardless of search terms.

For example, using the current scoring equation, if i do a search for
Doug Cutting and the results/scores i get back are...
  1:   0.9
  2:   0.3
  3:   0.21
  4:   0.21
  5:   0.1
...then there are at least two meaningful pieces of data I can glean:
   a) document #1 is significantly better then the other results
   b) document #3 and #4 are both equaly relevant to Doug Cutting

If I then do a search for Chris Hostetter and get back the following
results/scores...
  9:   0.9
  8:   0.3
  7:   0.21
  6:   0.21
  5:   0.1

...then I can assume the same corrisponding information is true about my
new search term (#9 is significantly better, and #7/#8 are equally as good)

However, I *cannot* say either of the following:
  x) document #9 is as relevant for Chris Hostetter as document #1 is
 relevant to Doug Cutting
  y) document #5 is equally relevant to both Chris Hostetter and
 Doug Cutting


I think the OP is arguing that if the scoring algorithm was modified in
the way they suggested, then you would be able to make statements x  y.

If they are correct, then I for one can see a definite benefit in that.
If for no other reason then in making minimum score thresholds more
meaningful.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Opinions: Using Lucene as a thin database

2004-12-14 Thread Chris Hostetter
: select * from MY_TABLE where MY_NUMERIC_FIELD  80
:
: as far as I know you have only the range query so you will have to say
:
: my_numeric_filed:[80 TO ??]
: but this would not work in the a/m example or am I missing something?

RangeQuery allows you to an open ended range -- you can tell the
QueryParser to leave your range opened ended using hte keyword null,
ie...

my_numeric_filed:[80 TO null]



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Customizing termFreq

2004-12-12 Thread Chris Hostetter
: H1:text in H1 font
: H2:text in H2 font

: content:all the text
:
: The problem is that query of a type
: +(H1:xyz)
: is getting scored with the termFreq of xyz in the H1 field whereas I want
: it be scored using the termFreq of xyz in the entire document (i.e.
: content field)

so why not query for +(content:xyz) .. or is the problem that you only
want to get back docs with xyz in an H1, but you want the score based on
the whole doc?

if that's the case, then construct a Filter with the requirement of
(H1:xyz) and make youre query (content:xyz)


-Hoss

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Incremental Search experiment with Lucene, sort of like the new Google Suggestion page

2004-12-11 Thread Chris Hostetter
: I also realized they're prob not doing searches at all - instead they're
: going off a DB of query popularity - I wanted to code up something

you are correct, hence the reason cnet banana doesn't appear in the list
of suggestions even though it has 41K results, but hossman trophy does
(with less then 1K results)

They're building up the list based on search frequency, not term
frequency.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Sorting based on calculations at search time

2004-12-10 Thread Chris Hostetter
: I believe you are talking about the boost factor for fields or documents
: while searching. That does not apply in my case - maybe I am missing a
: point here.
: The weight field I was talking about is only for the calculation

Otis is suggesting that you set the boost of the document to be your
weight value.  That way Lucene will automaticly do your multiplucation
calculation when determining the score

The down side of this, is that i don't think there's anyway to keep
it from influencing the score on every search, so it's not something you
could use only on some queries.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Unexpected TermEnum behavior

2004-12-08 Thread Chris Hostetter
:   TermEnum terms = reader.terms(new Term(fieldName, ));
:
: I noticed that initially TermEnum is positioned at the first term. In other
: words, I don't have to call terms.next() before calling terms.term(). This
: is different from the behavior of Iterator,  Enumeration and ResultSet whose

Well, strictly speeking it's very different -- in particular, the next
method doesn't return the item, which is also very different from
Iterators and Enumeration.

I agree it's a little confusing, esecially since TermDocs and TermEnum are
different.

: If it is by design, what is the defined TermEnum behavior if there are no
: terms for the field name in question? Will the call to terms.term() return
: null? Or get positioned at the first term with the field name that comes
: after the provided field name? What if there are no field names after it?

I believe that in those cases, the TermEnum object itself will be null.

: In any case, some javadoc describing the behavior would be extremely useful.

I thought it was documented in the TermEnum interface, but looking at it
now I realize that not only does the TermEnum javadoc not explain it
very well, but the class FilteredTermEnum (which implements TermEnum)
acctually documents the oposite behavior...

  public Term term()

  Returns the current Term in the enumeration. Initially
  invalid, valid  after next() called for the first time.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Filter !!!

2004-12-07 Thread Chris Hostetter

: Wait there already is a ChainedFilter in the Lucene Sandbox.

Boo-Ya! ... I was really surprised I hadn't seen one yet, but that's what
I get for assuming everything in the sandbox would be lised on the Lucene
Sandbox page.

It looks very cool, everything i ever wanted and then some. (the
Filter[]chain is what i was planing, but the int[]logic idea is something
i hadn't considered ... I figured when I needed multiple Filters combined
with different operators I could just build a tree of Filters, but I'm
guessing this approach will come in handy).

thanx for the tip.

-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: QueryFilter vs CachingWrapperFilter vs RangeQuery

2004-12-07 Thread Chris Hostetter
:  executes the search, i would keep a static reference to SearchIndexer
:  and then when i want to invalidate the cache, set it to null or create

: design of your system.  But, yes, you do need to keep a reference to it
: for the cache to work properly.  If you use a new IndexSearcher
: instance (I'm simplifying here, you could have an IndexReader instance
: yourself too, but I'm ignoring that possibility) then the filtering
: process occurs for each search rather than using the cache.

Assuming you have a finite number of Filters, and assuming those Filters
are expensive enough to be worth it...

Another approach you can take to share the cache among multiple
IndexReaders is to explicitly call the bits method on your filter(s) once,
and then cache the resulting BitSet anywhere you want (ie: serialize it to
disk if you so choose).  and then impliment a BitsFilter class that you
can construct directly from a BitSet regardless of the IndexReader.  The
down side of this approach is that it will *ONLY* work if you arecertain
that the index is never being modified.  If any documents get added, or
the index gets re-optimized you must regenerate all of the BitSets.

(That's why the CachingWrapperFilter's cache is keyed off of hte
IndexReader ... as long as you're re-using the same IndexReader, it know's
that the cached BitSet must still be valid, because an IndexReader
allways sees the same index as when it was opened, even if another
thread/process modifies it.)


class BitsFilter {
   BitSet bits;
   public BitsFilter(BitSet bits) {
 this.bits=bits;
   }
   public BitSet bigs(IndexReader r) {
 return bits.clone();
   }
}




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexReader close method

2004-12-06 Thread Chris Hostetter

: Do you know why I can't close the IndexReader  explicitly under some
: circumstances and why, when I do manage to close it I can still call
: methods on the reader?

1) I tried to create a test case that demonstrated your bug based on the
code outline you provided, and i couldn't (see below).  that implies to me
that somethine else is going on.  If you can create a completely self
contained program that demonstrates your bug and mail it to the list that
would help us help you.

2) the documentation for IndexReader.close() says...

Closes files associated with this index. Also saves any new deletions to
disk. No other methods should be called after this has been called.

...note the word should.  it doesn't say what the other methods will do
if you try to call them, just that you shouldn't try.  In some cases they
may generate exceptions, in other cases they may just be able to return
you data based on state internal to the object which is unaffected by the
fact that the files have all been closed.

-Hoss

public static void main(String argv[]) throws IOException {

/* create a directory */
String d = System.getProperty(java.io.tmpdir, tmp)
+ System.getProperty(file.separator)
+ index-dir- + (new Random()).nextInt(1000);
Directory trash = FSDirectory.getDirectory(d, true);


/* build index */
Document doc;
IndexWriter w = new IndexWriter(d, new SimpleAnalyzer(), true);
doc = new Document();
doc.add(Field.Text(words, apple emu));
w.addDocument(doc);
w.optimize();
w.close();

/* search index */
IndexReader r = IndexReader.open(d);
IndexSearcher s = new IndexSearcher(r);
Hits h = s.search(new TermQuery(new Term(words, apple)));

s.close();
r.close();

System.out.println(Reader? -  + r.maxDoc());

}






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Problem with indexing/merging indices - documents not indexed.

2004-12-06 Thread Chris Hostetter
: I would appreciate any feedback on my code and whether I'm doing
: something in a wrong way, because I'm at a total loss right now
: as to why documents are not being indexed at all.

I didn't try running your code (because i don't have a DB to test it with)
but a quick read gives me a good guess as to your problem:

I believe you to call...
ramWriter.close();
...before you call...
fsWriter.addIndexes(new Directory[] { ramDir });

(I've never played with merging indexes, so i could be completley wrong)

Everything I've ever read/seen/tried has indicated that untill you close
your IndexWritter, nothing you do will be visible to anybody else who
opens that Directory

I'm also guessing that when you were trying to add the docs to fsWriter
directly, you were using an IndexReader you had opened prior to calling
fsWriter.close() to check the number of docs ... that won't work for hte
same reason.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Is this a bug or a feature with addIndexes?

2004-12-06 Thread Chris Hostetter

: [EMAIL PROTECTED] tmp]# time java MemoryVsDisk 1 1 10 -r
: Docs in the RAM index: 1
: Docs in the FS index: 0
: Total time: 142 ms

I looked at the code from the article you mentioned and added the print
statements i'm guessing you added for ramWriter/fsWriter.docCount() before
and after each are closed.  I also opened the resulting indexDir with a
new IndexReader after all the writers had been closed to get it's numDocs
-- and I can confirm that the index in indexDir is in fact empty.  (using
1.4.2)


But like i said before:  You should try closing the ramWriter before
calling fsWriter.addIndexes.  i can say with authority that it works
(because i've tried it)

The date on that article is March of 2003 -- which pre-dates the lucene
1.3 RC, so it's likely that the internals have changed a bit making
it neccessary to close ramWriter first.

Hell, it's entirely possible that the code in Otis's article never work
100% correctly ... that code never printed out the number of docs in the
final index, so it's entirely possible it was missing a few even when he
ran it.

-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Filter !!!

2004-12-06 Thread Chris Hostetter

:  Hits hits = indexSearcher.search(searchQuery, filter)   //  here I want
: to pass multiple filter...  (DateFilter,QueryFilter)

You can write a Filter that takes in multiple filters and ANDs them
together (or ORs them, it's not clear what you want)

   Hits h = s.search(q,new AndFilter(df,qf));

...

class AndFilter {
   final Filter a;
   final Filter b;
   public AndFilter(Filter a, Filter b) {
 this.a = a;
 this.b = b;
   }
   BitSet bits(IndexReader r) {
 return b.bits(r).and(a.bits(r));
   }
}


(I'm planing on writting a generalized BooleanFilter class sometime
in the next few weeks)

-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Date Range Search throws IndexAccessException

2004-12-03 Thread Chris Hostetter

: I'm assuming that this must have something to do with how the date field
: enumerates against the matches with 'by the second' granularity - and
: thereby exceeding the maximum number of boolean clauses (please correct me
: if I am wrong).

I'm not so certain .. if you were really exceeding the max boolean clauses
limit, you should get a TooManyClauses exception

: Is there some way to reduce the granularity of the search to 'by the day'
: granularity? Otherwise is there some way to perform this query so that I can
: retrieve the results without error?

take a look at the RangeFilter class i recently sent to the list (and is
now in cvs) ... in exchange for giving up scoring, it doesn't suffer any
of the Boolean Clause limitations of RangeQuery.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Date Range Search throws IndexAccessException

2004-12-03 Thread Chris Hostetter

: The problem with using a Filter is that I want to be able to merely generate
: a text query based on the range information instead of having to modify the
: core search module which basically receives text queries. If I understand
: correctly, the Filter would actually have to be created and passed into the
: search method.

I haven't acctually done this myself, but when I asked about RangeQuery vs
RangeFilter before, Erik pointed out that you can wrap a RangeFilter in
FilteredQuery so that you can still use the simpler search API (without
explicitly passing the filter).

If you're using the QueryParser that comes with Lucene, you can probably
subclass it and write you're own getRangeQuery to look like the code
below. (like i said, i haven't acctually tried this yet)

Truthfully, i wonder if it might not be a good idea to change the default
implimentation of getRangeQuery to be something like this?




  protected Query getRangeQuery(String field,
String part1,
String part2,
boolean inclusive) throws ParseException
  {
try {
  DateFormat df = DateFormat.getDateInstance(DateFormat.SHORT, locale);
  df.setLenient(true);
  Date d1 = df.parse(part1);
  Date d2 = df.parse(part2);
  part1 = DateField.dateToString(d1);
  part2 = DateField.dateToString(d2);
}
catch (Exception e) { }

return new FilteredQuery(
  new TermQuery(new Term(field,)), // match all docs
  new RangeFilter(
   new Term(field, part1),
   new Term(field, part2),
   inclusive,inclusive));
  }

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



IndexWriter.optimize and memory usage

2004-12-02 Thread Chris Hostetter

I've been running into an interesting situation that I wanted to ask
about.

I've been doing some testing by building up indexes with code that looks
like this...

 IndexWriter writer = null;
 try {
 writer = new IndexWriter(index, new StandardAnalyzer(), true);
 writer.mergeFactor = MERGE_FACTOR;
 PooledExecutor queue = new PooledExecutor(NUM_UPDATE_THREADS);
 queue.waitWhenBlocked();

 for (int min=low; min  high; min += BATCH_SIZE) {
 int max = min + BATCH_SIZE;
 if (high  max) {
 max = high;
 }
 queue.execute(new BatchIndexer(writer, min, max));
 }
 end = new Date();
 System.out.println(Build Time:  + (end.getTime() - start.getTime()) 
+ ms);
 start = end;
 writer.optimize();
 } finally {
 if (null != writer) {
 try { writer.close(); } catch (Exception ignore) {/*NOOP*/; }
 }
 }
 end = new Date();
 System.out.println(Optimize Time:  + (end.getTime() - start.getTime()) + 
ms);


(where BatchIndexer is a class i have that gets a DB connection, and
slurps all records from my DB between min and max and builds some simple
Documents out of them and calls writer.addDocument(doc) on each)

This was working fine with small ranges, but then i tried building up a
nice big index for doing some performance testing.  i left it running
overnight and when i came back in the morning i discovered that after
successfully building up the whole index (~112K docs, ~1.5GB disk) it
crashed with an OutOfMemory exception while trying to optimize.

I then realized i was only running my JVM with a 256m upper limit on RAM,
and i figured that PooledExecutor was still in scope, and maybe it was
maintaining some state that was using up a lot of space, so i whiped up a
quick little app to solve my problem...

public static void main(String[] args) throws Exception {
IndexWriter writer = null;
try {
writer = new IndexWriter(index, new StandardAnalyzer(), false);
writer.optimize();
} finally {
if (null != writer) {
try { writer.close(); } catch (Exception ignore) { /*NOOP*/; }
}
}
}

...but I was dissapointed to discover that even this couldn't run with
only 256m of ram.  I bumped it up to 512m and then it manged to complete
successfully (the final index was only 1.1GB of disk).


This raises a few questions in my mind:

1) Is there a rule of thumb for knowing how much memory it takes to
   optimize an index?

2) Is there a Best Practice to follow when building up a large index
   from scratch in order to reduce the amount of memory needed to optimize
   once the whole index is build?  (ie: would spining up a thread that
   called writer.optimize() every N minutes be a good idea?)

3) Given an unoptimized index that's allready been built (ie: in the case
   where my builder crashed and i wanted to try and optimize it without
   having to rebuild from scratch) is there anyway to get IndexWriter to
   use less RAM and more disk (trading spead for a smaller form factor --
   and aparently: greater stability so that the app doesn't crash)


I imagine that the answers to #1 and #2 are largely dependent on the
nature of the data in the index (ie: the frequency of terms) but i'm
wondering if there is a high level formula that could be used to say
based on the nature of your data, you want to take this approach to
optimizing when you build



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: GETVALUES +SEARCH

2004-12-01 Thread Chris Hostetter

:  Having Document implement Map sounds reasonable to me though.  Any
:  reasons not to do this?
:
: Not really, except perhaps that a Lucene Document could theoretically
: have multiple identical keys... not something that anyone would want to

Assuming you want all changes to be backwards compatible, you pretty much
have to impliment Map.get(Object):Object usig Document.get(String):String
... otherwise you'll wind up really confusing the hell out of people.  But
If you really wanted to be mean to people, I guess you could use
Document.getField(String):Field or even
Document.getValues(String):String[] or Document.getFields(String):Fields[]
if you were feeling particularly mean.

The real question in my mind is not how should we impliment 'get' given
that we allow multiple values?, a better question is how should we
impliment 'put'?

do you write...
   Object put(Object k, Object v) {
   this.add((Field)v);
   return null;
   }
or...
   Object put(String k, String v) {
   this.add(Field.Text(k.toString(),v.toString()));
   return null;
   }
or...
   Object put(String k, String v) {
   throw new UnsupportedOperationException(we're not that nice);
   }


...i think it may be wiser to just let clinets wrap the Doc in their own
Map, using the rules that make sense to them -- becuase no ones ever going
to agree 100%.

If you think you know how to satisfy 90% of the users, i would still
suggest that instead of making Codument impliment Map, instead add
a toMap() functin that returns a wrapper with the rules that you think
make sense.  (and leave the Document API uncluttered of the Map functions
that people who don't care about Map don't need to see)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: similarity matrix - more clear

2004-11-30 Thread Chris Hostetter
: A possible solution would be to initialize in turn each document as a
: query, do a search using an IndexSearcher and to take from the search
: result the similarity between the query (which is in fact a document)
: and all the other documents. This is highly redundant, because the
: similarity between a pair of documents is computed multiple times.

A simpler aproach that i can think of would be to iterate over a complete
TermEnum of hte index, and for each Term, get the corisponding TermDocs
enumerator to list every document that contains that term.  Assuming that
every pair of docs initially has a similarity of 0 this would allow you
to incriment the similarity of each pair everytime you find a term that
multiple docs have in common.  (the amount you incriment the score for
each pair could be based on TermEnum.docFreq() and TermDocs.freq()).

A very simple approach might be something like...

   IndexReader r = ...;
   int[][] scores = new int[r.maxDocs()][r.maxDocs()];
   TermEnum enumerator = r.terms();
   TermDocs termDocs = r.termDocs();
   do {
  Term term = enumerator.term();
  if (term != null) {
 termDocs.seek(enumerator.term());
 Map docs = new HashMap();
 while (termDocs.next()) {
docs.put(termDocs.doc(),termDoc.freq());
 }
 for (Iterator i = docs.keySet().iterator(); i.hasNext();) {
for (Iterator j = docs.keySet().iterator(); j.hasNext();) {
   ii == i.next();
   jj = j.next();
   if (ii  jj) {
  continue; // do each pair only once
   }
   scores[jj][ii] += (docs.get(ii) + docs.get(jj)) / 2
}
 }
  } else {
 break;
  }
   } while (enumerator.next());


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: fetching similar wordlist as given word

2004-11-24 Thread Chris Hostetter

:can I get the similar wordlist as output. so that I can show the end
:user in the column  ---   do you mean foam?
:How can I get similar word list in the given content?

This is a non trivial problem, because the definition of similar is
subject to interpretation.  I would look into various dictionary
implimentations, and see if you can find a good Java based dictionary that
can suggest alternatives based on an input string.

Once you have that, then you should be able to use IndexSearcher.docFreq
to find out how many docs contains each alternate word, and compare that
with the number of docs that contain the initial word ... if one of the
alternates has a significantly higher number of matches, then you suggest
it.


NOTE: The DICT protocol defines a client/server approach to providing
spell correction and definitions.  Maybe you can leverage some of the
spell correction code mentioned in the Server Software Written in Java
section of this doc...
http://www.dict.org/links.html
In particular, you might want to take a look at JavaDict's Database.match
function using the LevenshteinStrategy...
http://ktulu.com.ar/javadict/docs/ar/com/ktulu/dict/Database.html#match(java.lang.String,%20ar.com.ktulu.dict.strategies.Strategy)



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Numeric Range Restrictions: Queries vs Filters

2004-11-23 Thread Chris Hostetter

: Done.  I deprecated DateField and DateFilter, and added the RangeFilter
: class contributed by Chris.
:
: I did a little code cleanup, Chris, renaming some RangeFilter variables
: and correcting typos in the Javadocs.  Let me know if everything looks
: ok.

Wow ... that was fast.  Things look fine to me (typo's in javadocs are my
specialty)  but now I wish I'd included more tests

I still feel a little confused about two things though...

First: Is there any reason Matt Quail's LongField class hasn't been
added to CVS (or has it and I'm just not seeing it?)

I haven't tested it extensively, but strikes me as being a crucial utility
for people who want to do any serious sorting or filtering of numeric
values.

Although I would suggest a few minor tweaks:
  a) Rename to something like NumberTools (to be consistent with the new
 DateTools and because...)
  b) Add some one line convinience methods like intToString and
 floatToString and doubleToString ala:
 return longToString(Double.doubleToLongBits(d));

Second...

: RangeQuery wrapped inside a QueryFilter is more specifically what I
: said.  I'm not a fan of DateField and how the built-in date support in
: Lucene works, so this is why I don't like DateFilter personally.
:
: Your RangeFilter, however, is nicely done and well worth deprecating
: DateFilter for.
  [...]
:  and RangeQuery. [5] Based on my limited tests, using a Filter to
:  restrict
:  to a Range is a lot faster then using RangeQuery -- independent of
:  caching.
:
: And now with FilteredQuery you can have the best of both worlds :)

See, this is what I'm not getting: what is the advantage of the second
world? :) ... in what situations would using...

   s.search(q1, new QueryFilter(new RangeQuery(t1,t2,true));

...be a better choice then...

   s.search(q1, new RangeFilter(t1.field(),t1.text(),t2.text(),true,true);


?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Numeric Range Restrictions: Queries vs Filters

2004-11-23 Thread Chris Hostetter

: Note that I said FilteredQuery, not QueryFilter.

Doh .. right sorry, I confused myself by thinking you were still refering
to your comments 2004-03-29 comparing DateFilter with RangeQuery wrapped
in a QueryFilter.

: I debate (with myself) on whether add-ons that can be done with other
: code is worth adding to Lucene's core.  In this case the utility
: methods are so commonly needed that it makes sense.  But it could be

In particular, having a class of utilities like that in the code base is
usefull, because now the javadocs for classes like RangeQuery and
RangeFilter can refrence them as being neccessary important to ensure that
ranges work the way you expect ... and hopefully fewer people will be
confused in the future.

: I think there needs to be some discussion on what other utility methods
: should be added.  For example, most of the numerics I index are
: positive integers and using a zero-padded is sufficient.  I'd rather
: have clearly recognizable numbers in my fields than some strange
: contortion that requires a conversion process to see.

I'm of two minds, on one hand I think there's no big harm in providing
every concievable utility function known to man so people have their
choice of representation.  On the other hand, I think it would be nice if
Lucene had a much simpler API for dealing with non-strings that just did
the right thing based on simple expectations -- without the user having
to ask themselves: Will i ever need negative numbers?  Will I ever need
numbers bigger then 1000? or to later remember that they padded tis field
to 5 digits and that field to 7 digits.

Having clearly recognized values is something that can (should?) be easily
accomplished by indexing the contorted but lexically sortable value, and
storing the more readable value...

Document d = /* some doc */;
Long l = /* some value */;
Field f1 = Field.UnIndexed(field, l.toString());
Field f2 = Field.UnStored(field, NumerTools.longToString(l));
d.add(f1);
d.add(f2);

(I'm not imagining things right?  that should work, correct?)

What would really be sweet, Is if Lucene had an API that
transparently dealt with all of the major primitive types, both at
indexing time and at query time, so that users ddn't have to pay any
attention to the stringification, or when to Index a different value
then they store...

Field f = Field.Long(field, l); /* indexes one string, stores the other */
d.add(f);
...
Query q = new RangeQuery(field, l1, l2); /* knows to use the contorted 
string */
...
String s = hits.doc(i).getValue(field); /* returns pretty string */
Long l = hits.doc(i).getValue(field);   /* returns orriginal Long */

--

---
Oh, you're a tricky one.Chris M Hostetter
 -- Trisha Weir[EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]