Strategy for date based searching and indexing

2007-08-19 Thread Berlin Brown
I am using the most basic lucene functionality but using against a
database.  For example, I may have a message forum and will index that
message text and message subject from the database.  But I haven't
figured out a way to index the date.  ideally, when I search I should
be able to return the most recent and the relevant searches.  Possibly
even filtering by the last couple of days or something similar.

Does anybody have an example on how they indexed a date and/or with a
record from a database.

-- 
Berlin Brown
http://www.newspiritcompany.com - newspirit technologies

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Deleting the result from a query or a filter and not a documents specified by Term

2007-08-19 Thread Abu Abdulla alhanbali
Greatly appreciated.
It works perfect.

On 8/19/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
> : Is there a way to delete the results from a query or a filter and not
> : documents specified by Term. I have seen some explanations here but i do
> not
> : know how to do it:
> :
> :
> http://www.nabble.com/Batch-deletions-of-Records-from-index-tf615674.html#a1644740
>
> the simplest approach that will work in a general case:
>   1) build you query object
>   2) call rewrite on your query
>   3) call extractTerms on the rewritten query
>   4) iterate over all those terms and delete.
>
> if you have Filter it's even easier...
>   1) call the bits method on your filter
>   2) iterate over each bit and call the delete method that takes a docid.
>
>
>
> -Hoss
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: Strategy for date based searching and indexing

2007-08-19 Thread Grant Ingersoll
Have a look at the DateTools utility class.   Also, the Wiki has some  
HOWTOs on Dates: http://wiki.apache.org/lucene-java/HowTo


Search this archive for Date handling, plus I believe the Lucene In  
Action book covers dates as well, although it might be a bit dated.


Lucene also comes with sort functionality that can handle dates.   
Have a look at the search API methods that pass in a Sort object.


-Grant

On Aug 19, 2007, at 3:23 AM, Berlin Brown wrote:


I am using the most basic lucene functionality but using against a
database.  For example, I may have a message forum and will index that
message text and message subject from the database.  But I haven't
figured out a way to index the date.  ideally, when I search I should
be able to return the most recent and the relevant searches.  Possibly
even filtering by the last couple of days or something similar.

Does anybody have an example on how they indexed a date and/or with a
record from a database.

--
Berlin Brown
http://www.newspiritcompany.com - newspirit technologies

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: query question

2007-08-19 Thread Erick Erickson
Mohammad:

See below

On 8/19/07, Mohammad Norouzi <[EMAIL PROTECTED]> wrote:
>
> Erick,
> I am using WhitespaceAnalyzer, and yes it's mixed case, in my application
> I
> never change the entered information to lowercase because of some reasons,


I've found it waay easier to index things two different ways
rather than have to endlessly worry about case and special
characters. Especially since whatever you do will be wrong some
of the time. For instance, if you do index with case, "ca" wouldn't
match "Ca". And if I search for "ca" I'd get a  (potentially)
completely different set of responses than searching for "Ca",
which would confuse the users and result in endless bug reports.

Indexing the same data twice, once for search and once for
display isn't, I believe, any more expensive than indexing AND
storing the data. That is, say I'm indexing the text "This is some
text". I just add two fields to the doc, one stored but not indexed
and one indexed but not stored.

doc.add("field_search", "This is some text", Field.Store.NO,
Field.Index.TOKENIZED);
doc.add("field_display", "This is some text", Field.Store.YES,
Field.Index.NO);

With the appropriate analyzer and/or pre-processing, field_search
will be transformed into a "canonical" form, but the field_display
will be exactly what was entered, capitalization, punctuation,
etc. all in place.

I believe that this consumes pretty much the same resources
as indexing into a single field Field.Store.YES,
Field.Index.TOKENIZED. This makes your search behavior much
simpler. You "canonicalize" your  searchable
field. For instance, remove all punctuation, lowercase it, fold
characters perhaps (see below). My point here is that at *both*
index and search time, I "massage" the data to provide a better
user experience. Not to mention have to field fewer "Why didn't
my search return. questions .

But I still have my field_display which contains exactly what
was originally entered when I need it.

Of course I don't know whether this works for you, since your
problem space undoubtedly has its own constraints, but it's
something you should consider if possible.


BELOW: > Folding: I have an English based application that
nevertheless has a very few foreign-language books. By folding all the
accented characters into their low-ASCII counterparts for indexing
and searching, but *displaying* the original text in the results, users
get what they expect.

the thing that I didn't consider was the punctuation in the indexes, but in
> query I didn't use any punctuation.  now using Luke, when I put Ca\. (with
> escaping dot) the result is 5 documents however I expect many more, the
> question is do I have to remove all dots and special characters from the
> indexed information while indexing?


See above. But I'd *start* by assuming that your searchable
fields should have all the extraneous stuff removed at *both*
index and search time. Which is pretty easy if you use the
same analyzer for the searchable fields during both operations.


>>And if you only knew how many times I've said something similar to ...
> been totally wrong
>
> Erick, I have to use this because we are writing an API to use object as
> the
> source of indexes and we have to map objects to documents and vice versa,
> would you tell me to make this what other way we should take?


What I was recommending is NOT that you do things a
different way in your *finished* application, but rather that
you simplify your use of Lucene, perhaps in a test or pilot
project, until you get the results you expect from indexing
and searching. Only when you start getting what you expect
from the simple cases should you try to get fancy.

Until then, my experience has been that I'm never sure
whether my problems are in my code or just that
I don't understand how the tool works.

Best
Erick


On 8/18/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
> >
> > I think you'll get much farther much faster if you concentrate on
> > a very simple test case for searching until you get the results you
> > expect.
> >
> > It's particularly telling that you can't get your results from Luke.
> > All the rest of your code is irrelevant until you get what you expect
> > from Luke with a simple analyzer or with a stupid-simple bit of
> > test code. Until then, the rest of your code, in which bugs may
> > lurk, just gets in your way.
> >
> > For instance you have colons in your term text. I believe you have
> > to escape these for query parsing to work correctly. You have mixed
> > case. Are you absolutely sure that the casing is consistent between
> > indexing and querying? You have other punctuation. Are you also sure
> > that it's not stripped by the query ananlyzers? The fragment above
> > doesn't show us what analyzer you use. I flat guarantee that if it's
> > StandardAnalyzer, lots of punctuation is stripped and the term text is
> > lowercased. Some innocent-seeming bit of code can mess you up in
> > any of these 

Re: Deleting the result from a query or a filter and not a documents specified by Term

2007-08-19 Thread Erick Erickson
Chris:

I didn't understand how your first solution would work,
 so I tried it. The terms I extracted from the rewritten
query were just the four raw terms, i.e.

field1:query1
field1:query3
field2:query2
field2:query4.

So iterating over and deleting them term by term wouldn't
preserve the sense of the original query
(field1:query1 AND field2:query2) OR (field1:query3 AND field2:query4)
and would delete (presumably) more documents than just
the documents matching the query.

So what am I missing?

Thanks
Erick



On 8/19/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
> : Is there a way to delete the results from a query or a filter and not
> : documents specified by Term. I have seen some explanations here but i do
> not
> : know how to do it:
> :
> :
> http://www.nabble.com/Batch-deletions-of-Records-from-index-tf615674.html#a1644740
>
> the simplest approach that will work in a general case:
>   1) build you query object
>   2) call rewrite on your query
>   3) call extractTerms on the rewritten query
>   4) iterate over all those terms and delete.
>
> if you have Filter it's even easier...
>   1) call the bits method on your filter
>   2) iterate over each bit and call the delete method that takes a docid.
>
>
>
> -Hoss
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Lockless read-only deletions in IndexReader?

2007-08-19 Thread karl wettin
I want to set documents in my IndexReader as deleted, but I will  
never commit these deletions. Sort of a filter on a reader rather  
than on a searcher, and no write-locks.


Can I do that out of the box?

Perhaps I can pass down a IndexDeletionPolicy to my IndexWriter that  
ignores deletions from the IndexReader(s) to avoid the lock?


Changing the directory lock factory it will effect the IndexWriter  
locks too? So that would not be an option, or?


I could go hacking in IndexReader, definalizing it for decoration of  
deleteDocument(int), or something like that, but would really prefere  
not to.



This is for a transactional layer on top of Lucene, where I combine  
the system index with a transactional index. Updated documents that  
are represented in the transaction index must be filtered out from  
the system index at IndexReader level without creating write-locks.  
undeleteAll() would be an option if there was no locks -- more than  
one transaction could be updating documents at the same time.



--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document Similarities lucene(particularly using doc id's)

2007-08-19 Thread Lokeya

Hi,

Thanks for your reply.

I can use the getTermFreqVector() on Index Reader and get it. But I am
wondering whats the API which has to be used to find the similarity between
2 such vectors which would give a score (doc-doc similairty in  essence). 

Thanks.



Grant Ingersoll-6 wrote:
> 
> Hi,
> 
> 
> On Aug 16, 2007, at 2:20 PM, Lokeya wrote:
> 
>>
>> Hi All,
>>
>> I have the following set up: a) Indexed set of docs. b) Ran 1st  
>> query and
>> got tops docs  c) Fetched the id's from that and stored in a data  
>> structure.
>> d) Ran 2nd query , got top docs , fetched id's and stored in a data
>> structure.
>>
>> Now i have 2 sets of doc ids (set 1) and (set 1).
>>
>> I want to find out the document content similarity between these 2  
>> sets(just
>> using doc ids information which i have).
>>
> 
> Not sure what you mean here.  What do the doc ids have to do with the  
> content?
> 
>> Qn 1: Is it possible using any lucene api's. In that case can you  
>> point me
>> to the appropriate API's. I did a search at
>> :http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/ 
>> javadoc/index.html
>> But couldn't find anything.
>>
> 
> It is possible if you use Term Vectors (see  
> IndexReader.getTermFreqVector).  You will need to store (when you  
> construct your Field) and load the term vectors and then calculate  
> the similarity.  A common way of doing this is by calculating the  
> cosine of the angle between the two vectors.
> 
> -Grant
> 
> --
> Grant Ingersoll
> http://lucene.grantingersoll.com
> 
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Document-Similarities-lucene%28particularly-using-doc-id%27s%29-tf4281286.html#a12229492
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document Similarities lucene(particularly using doc id's)

2007-08-19 Thread Karl Wettin


20 aug 2007 kl. 05.19 skrev Lokeya:

Grant Ingersoll-6 wrote:

On Aug 16, 2007, at 2:20 PM, Lokeya wrote:

I want to find out the document content similarity



A common way of doing this is by calculating the cosine of the angle
between the two vectors.



I can use the getTermFreqVector() on Index Reader and get it. But I am
wondering whats the API which has to be used to find the similarity  
between
2 such vectors which would give a score (doc-doc similairty in   
essence).


Bob Carpenter wrote an article on the subject for "Lucene in Action".  
He also
works on LingPipe, a semi-free peice of software that might be  
helpful if

your Greek kung fu is too weak.






--
karl





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]