: Done. I deprecated DateField and DateFilter, and added the RangeFilter
: class contributed by Chris.
:
: I did a little code cleanup, Chris, renaming some RangeFilter variables
: and correcting typos in the Javadocs. Let me know if everything looks
: ok.
Wow ... that was fast. Things look
: Note that I said FilteredQuery, not QueryFilter.
Doh .. right sorry, I confused myself by thinking you were still refering
to your comments 2004-03-29 comparing DateFilter with RangeQuery wrapped
in a QueryFilter.
: I debate (with myself) on whether add-ons that can be done with other
: code
:can I get the similar wordlist as output. so that I can show the end
:user in the column --- do you mean foam?
:How can I get similar word list in the given content?
This is a non trivial problem, because the definition of similar is
subject to interpretation. I
: A possible solution would be to initialize in turn each document as a
: query, do a search using an IndexSearcher and to take from the search
: result the similarity between the query (which is in fact a document)
: and all the other documents. This is highly redundant, because the
: similarity
: Having Document implement Map sounds reasonable to me though. Any
: reasons not to do this?
:
: Not really, except perhaps that a Lucene Document could theoretically
: have multiple identical keys... not something that anyone would want to
Assuming you want all changes to be backwards
I've been running into an interesting situation that I wanted to ask
about.
I've been doing some testing by building up indexes with code that looks
like this...
IndexWriter writer = null;
try {
writer = new IndexWriter(index, new StandardAnalyzer(), true);
: I'm assuming that this must have something to do with how the date field
: enumerates against the matches with 'by the second' granularity - and
: thereby exceeding the maximum number of boolean clauses (please correct me
: if I am wrong).
I'm not so certain .. if you were really exceeding the
: The problem with using a Filter is that I want to be able to merely generate
: a text query based on the range information instead of having to modify the
: core search module which basically receives text queries. If I understand
: correctly, the Filter would actually have to be created and
: Do you know why I can't close the IndexReader explicitly under some
: circumstances and why, when I do manage to close it I can still call
: methods on the reader?
1) I tried to create a test case that demonstrated your bug based on the
code outline you provided, and i couldn't (see below).
: I would appreciate any feedback on my code and whether I'm doing
: something in a wrong way, because I'm at a total loss right now
: as to why documents are not being indexed at all.
I didn't try running your code (because i don't have a DB to test it with)
but a quick read gives me a good
: [EMAIL PROTECTED] tmp]# time java MemoryVsDisk 1 1 10 -r
: Docs in the RAM index: 1
: Docs in the FS index: 0
: Total time: 142 ms
I looked at the code from the article you mentioned and added the print
statements i'm guessing you added for ramWriter/fsWriter.docCount() before
and after
: Hits hits = indexSearcher.search(searchQuery, filter) // here I want
: to pass multiple filter... (DateFilter,QueryFilter)
You can write a Filter that takes in multiple filters and ANDs them
together (or ORs them, it's not clear what you want)
Hits h = s.search(q,new
: Wait there already is a ChainedFilter in the Lucene Sandbox.
Boo-Ya! ... I was really surprised I hadn't seen one yet, but that's what
I get for assuming everything in the sandbox would be lised on the Lucene
Sandbox page.
It looks very cool, everything i ever wanted and then some. (the
: executes the search, i would keep a static reference to SearchIndexer
: and then when i want to invalidate the cache, set it to null or create
: design of your system. But, yes, you do need to keep a reference to it
: for the cache to work properly. If you use a new IndexSearcher
: instance
: TermEnum terms = reader.terms(new Term(fieldName, ));
:
: I noticed that initially TermEnum is positioned at the first term. In other
: words, I don't have to call terms.next() before calling terms.term(). This
: is different from the behavior of Iterator, Enumeration and ResultSet whose
: I believe you are talking about the boost factor for fields or documents
: while searching. That does not apply in my case - maybe I am missing a
: point here.
: The weight field I was talking about is only for the calculation
Otis is suggesting that you set the boost of the document to be your
: I also realized they're prob not doing searches at all - instead they're
: going off a DB of query popularity - I wanted to code up something
you are correct, hence the reason cnet banana doesn't appear in the list
of suggestions even though it has 41K results, but hossman trophy does
(with
: select * from MY_TABLE where MY_NUMERIC_FIELD 80
:
: as far as I know you have only the range query so you will have to say
:
: my_numeric_filed:[80 TO ??]
: but this would not work in the a/m example or am I missing something?
RangeQuery allows you to an open ended range -- you can tell the
is significantly better then the other results
b) document #3 and #4 are both equaly relevant to Doug Cutting
If I then do a search for Chris Hostetter and get back the following
results/scores...
9: 0.9
8: 0.3
7: 0.21
6: 0.21
5: 0.1
...then I can assume the same
: In my application, users search for messages with Lucene. Typically,
: they are more interested in seeing their hits in date-order than in
: relevance-order. In reading my ebook copy of Lucene in action (wish
: I'd had that a year ago), I find that one of the features added in 1.4
: was the
: H1:text in H1 font
: H2:text in H2 font
: content:all the text
:
: The problem is that query of a type
: +(H1:xyz)
: is getting scored with the termFreq of xyz in the H1 field whereas I want
: it be scored using the termFreq of xyz in the entire document (i.e.
: content field)
so why
: The issue occurs if the first field it accesses parses as a numeric
: value and then successive fields are String's. If you are mixing and
: I am wondering why this exception might occur when the server/index is
: under load. I do realise there are many 'variables in the equation',
: so
:
: Therefore I turned back to the standard analyzer and now do some replacing
: of the underscores in my ID string to avoid my original problem. This solved
maybe i'm missing something, but if you've got a field in your doc that
represents an ID, why not create that field as NonTokenized so you
: However, I don't think that the names are consistent enough to permit a
: generic use of regular expressions. What Daniel is trying to achieve
: looks interesting anyway,
I'm not sure that that really matters in the long run ... I think the OP
was asking if there was a way to get the name in
: I thought of putting empty strings instead of null values but I think
: empty strings are put first in the list while sorting which is the
: reverse of what anyone would want.
instead of adding a field with a null value, or value of an epty string,
why not just leave the field out for
To start with, there has to be more to the search side of things then
what you included. this search function is not static, which means it's
getting called on an object, which obviously has some internal state
(paramOffset, hits, and pathToIndex are a few that jump out at me) what
are the
: Is there any already written analyzer that would take that name
: (Schamp;auml;ffer or any other name that has entities) so that
: Lucene index could searched (once the field has been indexed) for the real
: version of the name, which is
:
: Schäffer
:
: and the english spelled version of the
: This is what we found:
:
: 1 thread, search takes 20 ms.
:
: 2 threads, search takes 40 ms.
:
: 5 threads, search takes 100 ms.
how big is your index? What are the term frequencies like in your index?
how many differnt queries did you try? what was the structure of your
: I ordered my from Amazon a while back and was notified yesterday that it
: shipped. Here was my price:
really??? .. those bastards. I ordered two copies for my work on December
10th and they still haven't shipped them.
: 1Lucene In Action (In Action) $27.17 1 $27.17
Hmm,
: Hoss, could you tell me what to exceptions I'm missing? Thanks!
anytime you have a catch block, you should be doing something with that
exception. If possible, you can recover from an exception, but no matter
what you should log the exception in some way so that you know it
happened.
Your
: Stored = as-is value stored in the Lucene index
:
: Tokenized = field is analyzed using the specified Analyzer - the tokens
: emitted are indexed
:
: Indexed = the text (either as-is with keyword fields, or the tokens
: from tokenized fields) is made searchable (aka inverted)
:
: Vectored =
: we are currently implementing a search engine for a news site. Our goal
: is to have a search result that uses the publish date of the documents
: to boost the score of the documents.
: have to use something that boosts the scores at _search_ time.
1) There is a way to boost individual Query
: Is it possible to enable stem queries on a per-query basis? It doesn't
: seem to be possible since the stem tokenizing is done during the
: indexing process. Are people basically stuck with having all their
: queries stemmed or none at all?
: From what I've read, if you want to have a choice,
: : have to use something that boosts the scores at _search_ time.
: Yes, I know I can boost Query objects, but that is not the same as
: boosting the document score by a factor. By boosting query objects I
: _add_ values to the score. Let me show you an example:
well, sure it is ... you have
: What about a shutdown hook?
Interesting idea, at the moment the file is created on disk, the
FSDirectory could add a shutdown hook that checked for the existence of
the file and if it's still there (implying that the Lock owner failed
without releasing the lock) it can forcably remove it.
Of
: The corpus is the English Wikipedia, and I indexed the title and body of
: the articles. I used a list of 525 stop words.
:
: With stopwords removed the index is 227MB.
: With stopwords kept the index is 331MB.
That doesn't seem horribly surprising.
consider that for every Term in the index,
: 1) Adding 250K documents took half an hour for lucene.
: 2) Deleting and adding same 250K documents took more than 50 minutes. In my
: test all 250K objects are new so there is nothing to delete.
:
: Looks like there is no other way to make it fast.
I bet you can find an improvement in the
: is it possible to get all different values for a
: Field from a Hits object and how to do this?
The ording of your question suggests that the Field you are interested in
isn't a field which will have a fairly unique value for every doc (ie: not
a title, more likely an author or category
: Thanks for your tips. I am trying to get a more thorough understanding
: why this would be better.
1) give serious consideration to just putting all of your data in lucene
for the purposes of searching. the intial example mentioned employees,
and salaries and wanted to search for employees
: Why IndexReader.lastModified(index) is depricated?
Did you read the javadocs?
Synchronization of IndexReader and IndexWriter instances is no longer
done via time stamps of the segments file since the time resolution
depends on the hardware platform. Instead, a version number is
: We have one large index right now... its about 60G ... When I open it
: the Java VM used 940M of memory. The VM does nothing else besides open
Just out of curiosity, have you tried turning on the verbose gc log, and
putting in some thread sleeps after you open the reader, to see if the
memory
: processes ended. If you're under linux, try running the 'lsof'
: command to see if there are any handles to files marked (deleted).
: Searcher, the old Searcher is closed and nulled, but I
: still see about twice the amount of memory in use well
: after the original searcher has been
: anywhere. I checked the count coming back from the delete operation and
: it is zero. I even tried to delete another unique term with similar
: results.
First off, are you absolutely certain you are closing the reader? it's
not in the code you listed.
Second, I'd bet $1 that when your
Another approach...
You can make a Filter that is the inverse of the output from another
filter, which means you can make a QueryFilter on the search, then wrap it
in your inverse Filter.
you can't execute a query on a filter without having a Query object, but
you can just apply the Filter
: Also keep in mind that QueryParser only allows a trailing asterisk,
: creating a PrefixQuery. However, if you use a WildcardQuery directly,
: you can use an asterisk as the starting character (at the risk of
: performance).
On the issue of ends with wildcard queries, I wanted to throw out and
: care about their content. I only want to know a particular numeric
: field from
: document (id of document's category).
: I also need to know how many docs in category were found, so I can't
: index
: You should explore the use of IndexReader. Index your documents with
: category id
: book Managing Gigabytes, making *string* queries drastically more
: efficient for searching (though also impacting index size). Take the
: term cat. It would be indexed with all rotated variations with an
: end of word marker added:
...
: The query for *at* would be preprocessed and
: Your dates need to be stored in lexicographical order for the RangeQuery
: to work.
:
: Index them using this date format: MMDD.
:
: Also, I'm not sure if the QueryParser can handle range queries with only
: one end point. You may need to create this query programmatically.
and when
: Just curious: it would seem easier to use multiple fields for the
: original case and lowercase searching. Is there any particular reason
: you analyzed the documents to multiple indexes instead of multiple
: fields?
:
: I considered that approach, however to expose QueryParser I'd have
: Whats the desired pattern of using of TermInfosWriter.indexInterval ?
:
: There isn't one. It is not a part of the public API. It is an
: unsupported internal feature.
: It was never public. It used to be static and final, but is now an
: instance variable.
: The place to put
50 matches
Mail list logo