Re: duplicate fields

2006-09-08 Thread Daniel Noll

jacky wrote:
hi, 1. Is there an effect method to check if there exists the same 
field(hold a unique ID) when added into lucene index database? Make a

search for this field?


One way is to create an IndexReader and IndexSearcher on your index, 
which you reopen every now and then.  But we do this task by using a 
separate database, for the sake of efficiency.



2. Is there an effect method to check if there exists the duplicate
fields(hold a unique ID) in the lucene index database? Two methods:
Read all documents and compare the fields, or search for each field.
Is there a better one?


The simplest way without using an external database is to use the 
termDocs enumeration.  For each term you can easily see which ones have 
multiple documents, so every document other than the first for each term 
is a duplicate (which you could then use to build a filter to remove 
duplicates.)


Daniel



--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699
Web: http://www.nuix.com.au/Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



delete operation

2006-09-08 Thread jacky
hi,
  There is a question about delete operation, i have not found  any doc in 
lucene api's javadoc:
   When using delete(Term term) of IndexReader and commit, at the same time, an 
indexSearcher is open.So the deleted document still can be seached till reopen 
the indexSearcher, i don't know how lucene did this.
  So  when the lucene database is updated, how to notify to reopen the 
IndexSearcher since there may be several applications to search this lucene 
database?


 Best Regards.
   jacky  
   

Re: duplicate fields

2006-09-08 Thread jacky
hi Daniel,
   How do you use a separate database to check the duplicate fields?  It is 
interesting!
 
 Best Regards.
   jacky  
   
- Original Message - 
From: "Daniel Noll" <[EMAIL PROTECTED]>
To: 
Sent: Friday, September 08, 2006 3:08 PM
Subject: Re: duplicate fields


> jacky wrote:
> > hi, 1. Is there an effect method to check if there exists the same 
> > field(hold a unique ID) when added into lucene index database? Make a
> > search for this field?
> 
> One way is to create an IndexReader and IndexSearcher on your index, 
> which you reopen every now and then.  But we do this task by using a 
> separate database, for the sake of efficiency.
> 
> > 2. Is there an effect method to check if there exists the duplicate
> > fields(hold a unique ID) in the lucene index database? Two methods:
> > Read all documents and compare the fields, or search for each field.
> > Is there a better one?
> 
> The simplest way without using an external database is to use the 
> termDocs enumeration.  For each term you can easily see which ones have 
> multiple documents, so every document other than the first for each term 
> is a duplicate (which you could then use to build a filter to remove 
> duplicates.)
> 
> Daniel
> 
> 
> 
> -- 
> Daniel Noll
> 
> Nuix Pty Ltd
> Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699
> Web: http://www.nuix.com.au/Fax: +61 2 9212 6902
> 
> This message is intended only for the named recipient. If you are not
> the intended recipient you are notified that disclosing, copying,
> distributing or taking any action in reliance on the contents of this
> message or attachment is strictly prohibited.
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

Changing the Scoring api for OR parameters

2006-09-08 Thread Marcus Falck
Hi everyone,

 

I want to override the default scoring when it comes to queries
containing the OR operator.

 

For example if I got the following headlines in my index :

"Sun sues Microsoft"

"Microsoft want to buy Tiscali"

".NU domain sues Microsoft"

"The sun is shining"

"Sun brings antitrust suit against Microsoft"

 

Those documents have been boosted in desc fashion ("Sun sues Microsoft"
has higher calculated norm value then "Sun brings antirust suit against
Microsoft"), 

The similarity class that has been used has made the norm values to be
exactly as the boost value ( I have even modified the norm to be a float
so I won't loose precision ).

 

If I perform a search for: Microsoft OR Sun

 

The topranked results will almost certainly be:

Sun sues Microsoft

Sun Brings antitrust suit against Microsoft



 

I just want the documents returned like this:

"Sun sues Microsoft"

"Microsoft want to buy Tiscali"

".NU domain sues Microsoft"

"The sun is shining"

"Sun brings antitrust suit against Microsoft"

 

I have to get this to work since I'm indexing news material and the
customers are only interested in the newest articles ( so the date of
the article is being used as a boost factor).

 

Any ideas? My rank changes to lucene works as expected when it comes to
AND operator and single term queries.

 

/

Regards

 

Marcus Falck 

 



Re: Indexing MS Powerpoint files with Lucene

2006-09-08 Thread Tomi NA

On 9/7/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

Tomi NA wrote:
> On 9/7/06, Nick Burch <[EMAIL PROTECTED]> wrote:
>> On Thu, 7 Sep 2006, Tomi NA wrote:
>> > On 9/7/06, Venkateshprasanna <[EMAIL PROTECTED]> wrote:
>> >> Is there any filter available for extracting text from MS
>> Powerpoint files
>> >> and indexing them?
>> >> The lucene website suggests the POI project, which, it seems does not
>> >> support PPT files as of now.
>> >
>> > http://jakarta.apache.org/poi/hslf/index.html
>> >
>> > It doesn't say poi doesn't support ppt. It just says support is
>> limited.
>> > Don't know exactly how limited, but certainly not useless for indexing
>> > purposes.
>>
>> Support for editing and adding things to PowerPoint files is limited, as
>> is getting out the finer points of fonts and positioning.
>
> Which brings me to another (off)topic: can lucene/nutch assign
> different weights to tokens in the same document field? An obvious
> example would be: "this text seems to be in large, bold, blinking
> letters: I'll assume it's more important than the surrounding 8px
> text."

No, it can't (at least not yet). As a workaround you can extract these
portions of text to another field (or multiple fields), and then add
them with a higher boost. Then, expand your queries so that they include
also this field. This way, if query matches these special tokens,
results will get higher rank because of matching on this boosted field.


I thought a workaround like that would be needed. Still, it could give
useful results...though as a nutch user, the possibility is mostly
theoretical for me, as probably none of the existing parsers take into
account the formatting information. I could be completely wrong here,
so please, feel free to correct me.

t.n.a.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Highligher Example

2006-09-08 Thread mark harwood
If you have a budget for this stuff then Stellent provide tools for parsing 
multiple document types and also have a viewer that can display documents with 
their original formatting, plus your highlights. See 
http://www.stellent.com/en/products/outside_in/viewer_tech/index.htm

I don't work for Stellent and haven't used it but I do know this stuff is hard 
to do and they are the only ones I'm aware of trying to provide tools to cover 
all document types which is why I mention it. If anyone has any other similar 
recommendations I would be interested to hear them.


- Original Message 
From: Mark Miller <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, 8 September, 2006 2:02:47 AM
Subject: Re: Highligher Example

Highlighting a PDF document, last time I looked (quite a while ago), 
involves supplying an xml file that describes offsets for highlighting. 
You can specify the file in the URL. You can also do simple highlighting 
by passing in a list of words to be highlighted, but this does not even 
catch minor differences, like singular to plural.

If someone knows more about using to the lucene highlighter to highlight 
PDF's then please speak up. I think I will have to get into this soon.

- Mark

Mag Gam wrote:
> Thanks for the quick response Erik. I will be getting my LIA book back 
> very
> soon, I forgot it at a destination :-(
>
> Lets assume, there is a document called "hello.pdf" and it has the 
> content
> "this is hello.pdf. It uses Acrobat"
>
> When I perform a search for "Acrobat", i want hello.pdf to show up, 
> and also
> the 'It uses Acrobat'
>
> something like that.
>
> tia
>
>
>
> On 9/7/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:
>>
>> There are test cases in the Highlighter codebase that exercise it and
>> show its use, as well as a few examples of it in the "Lucene in
>> Action" codebase.
>>
>> These examples output plain text with some prefix and suffix
>> surrounding the highlighted terms.  Highlighting text in a PDF is
>> possible, I'm pretty sure, but I don't think the same would be easily
>> possible with Microsoft document formats.  I'm not sure if you are
>> asking for these document types to be highlighted or just a plain
>> text representation of them, though.
>>
>> Erik
>>
>> On Sep 7, 2006, at 6:37 PM, Mag Gam wrote:
>>
>> > Hey
>> >
>> > Anyone have a search result highlighter example?
>> >
>> > I have various doc, PDFs, DOC, TXT, PPT, and I would like to show a
>> > highlight, similar to how google does it...
>> >
>> > tia
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Preventing short documents from being boosted

2006-09-08 Thread Wright, Tim
Hi all, 

We have an issue where around 10-20% of our documents are much shorter
(only a paragraph or so of text) than all the rest. Because Lucene
considers document length when indexing, most of the time these shorter
documents end up being scored higher than the longer ones. 

We'd prefer it if we could remove the length factor, or at least reduce
the weight of it so that we returned a mixture of long and short
documents. Is there a simple way of doing this? I've considered applying
a document boost based on length, but I'm not quite sure of the equation
I'd have to use to "counter" the innate prioritisation of short
documents.

Cheers,

Tim.


The information contained in this email message may be confidential. If you are 
not the intended recipient, any use, interference with, disclosure or copying 
of this material is unauthorised and prohibited. Although this message and any 
attachments are believed to be free of viruses, no responsibility is accepted 
by Informa for any loss or damage arising in any way from receipt or use 
thereof.  Messages to and from the company are monitored for operational 
reasons and in accordance with lawful business practices. 
If you have received this message in error, please notify us by return and 
delete the message and any attachments.  Further enquiries/returns can be sent 
to [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: delete operation

2006-09-08 Thread Michael McCandless
jacky wrote:
>   There is a question about delete operation, i have not found  any doc in 
> lucene api's javadoc:
>When using delete(Term term) of IndexReader and commit, at the same time, 
> an indexSearcher is open.So the deleted document still can be seached till 
> reopen the indexSearcher, i don't know how lucene did this.
>   So  when the lucene database is updated, how to notify to reopen the 
> IndexSearcher since there may be several applications to search this lucene 
> database?

Lucene doesn't actually have any builtin ability to "notify" all other
searchers that they should re-open.  So you have to do this part yourself.

However, the IndexReader class has an "isCurrent()" method, which you
could periodically call (say once every N minutes or something) to check
if it's time to re-open.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: read past EOF

2006-09-08 Thread Bhavin Pandya

Hi Mike,

It sounds like you're working with the index correctly, so I don't have 
any other ideas on why you're getting CFS files that are truncated.  I 
would wory about the "cp" step filling up disk, but if you're nowhere near 
filling up disk that's not the root cause here.




I have found the cause of this problem... You were right .
Its because at perticular point of time my hard disk got full so It 
currupted index at that time but after that because of some batch process 
disk becomes empty enough so I was not able to find continuous exception 
like "no space left"...but when i gone through all the log i tracked it 
sucessfully.


Thanks for your help

- Bhavin pandya

- Original Message - 
From: "Michael McCandless" <[EMAIL PROTECTED]>

To: 
Sent: Friday, September 01, 2006 5:07 PM
Subject: Re: read past EOF





Yes I am sure only one writer at a time accessing index.

no i am not getting any other exception.

and there is no problem of disk space also.

right now i have backcopy of indexes so whenever one index got corrupted 
i m replacing with backup one and starting the indexer again from that 
duration.


Here is the script which i am using to move index after its built.

- rm -rf backupindex/*
- mv index backupindex;
- mv newindex index;
- mkdir newindex
- cp -dpR index/* newindex/
- touch index.done
- echo "done";

where "newindex" is the index which I am using for indexing"index" 
which i am using for search purposeand "backupindex" contains 
previous index.


It sounds like you're working with the index correctly, so I don't have 
any other ideas on why you're getting CFS files that are truncated.  I 
would wory about the "cp" step filling up disk, but if you're nowhere near 
filling up disk that's not the root cause here.


Does this happen intermittantly?  Or it happened once and now it's gone? 
Or is it easy to reproduce?


Is there any way through which  I can check if index is corrupt or 
notright now because of this exception (read past EOF ) i made few 
changes in my code to check for corrupt index. But i am checking for 
corrupt index through optimizing...If in optimization of index i m 
getting IOException I am considering that index got corrupted or there is 
permission issue..


That's a great question.  I don't know of existing tools for doing this 
(anyone else?).  Running optimize is likely a good test, so long as 
there's more than 1 segment before optimize (so that it actually does 
something).


Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Preventing short documents from being boosted

2006-09-08 Thread Grant Ingersoll

http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967

-Grant

On Sep 8, 2006, at 5:57 AM, Wright, Tim wrote:


Hi all,

We have an issue where around 10-20% of our documents are much shorter
(only a paragraph or so of text) than all the rest. Because Lucene
considers document length when indexing, most of the time these  
shorter

documents end up being scored higher than the longer ones.

We'd prefer it if we could remove the length factor, or at least  
reduce

the weight of it so that we returned a mixture of long and short
documents. Is there a simple way of doing this? I've considered  
applying
a document boost based on length, but I'm not quite sure of the  
equation

I'd have to use to "counter" the innate prioritisation of short
documents.

Cheers,

Tim.

-- 
--
The information contained in this email message may be  
confidential. If you are not the intended recipient, any use,  
interference with, disclosure or copying of this material is  
unauthorised and prohibited. Although this message and any  
attachments are believed to be free of viruses, no responsibility  
is accepted by Informa for any loss or damage arising in any way  
from receipt or use thereof.  Messages to and from the company are  
monitored for operational reasons and in accordance with lawful  
business practices.
If you have received this message in error, please notify us by  
return and delete the message and any attachments.  Further  
enquiries/returns can be sent to [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: delete operation

2006-09-08 Thread Simon Willnauer

An other way to prevent your indexsearch from reopened everytime you
delete an document is to use a global delete filter which excludes all
deleted documents from being retrieved e.g. included in your search
results. That won't work with updates without using a buffer or
something similar but if you have to deal only with deletes a filter
would do the job and the indexsearcher will remain open until you
commit an update or insert (thats what you would call it in a db
context, remember lucene is a reverse index not a database).

best regards simon

On 9/8/06, Michael McCandless <[EMAIL PROTECTED]> wrote:

jacky wrote:
>   There is a question about delete operation, i have not found  any doc in 
lucene api's javadoc:
>When using delete(Term term) of IndexReader and commit, at the same time, 
an indexSearcher is open.So the deleted document still can be seached till reopen 
the indexSearcher, i don't know how lucene did this.
>   So  when the lucene database is updated, how to notify to reopen the 
IndexSearcher since there may be several applications to search this lucene 
database?

Lucene doesn't actually have any builtin ability to "notify" all other
searchers that they should re-open.  So you have to do this part yourself.

However, the IndexReader class has an "isCurrent()" method, which you
could periodically call (say once every N minutes or something) to check
if it's time to re-open.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: read past EOF

2006-09-08 Thread Michael McCandless

Bhavin Pandya wrote:

It sounds like you're working with the index correctly, so I don't 
have any other ideas on why you're getting CFS files that are 
truncated.  I would wory about the "cp" step filling up disk, but if 
you're nowhere near filling up disk that's not the root cause here.




I have found the cause of this problem... You were right .
Its because at perticular point of time my hard disk got full so It 
currupted index at that time but after that because of some batch 
process disk becomes empty enough so I was not able to find continuous 
exception like "no space left"...but when i gone through all the log i 
tracked it sucessfully.


Phew!  Glad to hear you got down to the root cause and that in fact that 
root cause was "outside" of Lucene :)


Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



SV: Using Hibernate to store Lucene Indexes in a Database

2006-09-08 Thread Marcus Falck
I cant understand why you are interested in storing the directory in a database 
using hibernate. It seems to me like you are trying to mix 2 good techniques in 
a destructive way.




-Ursprungligt meddelande-
Från: Néstor Boscán [mailto:[EMAIL PROTECTED] 
Skickat: den 8 september 2006 01:49
Till: java-user@lucene.apache.org
Ämne: Using Hibernate to store Lucene Indexes in a Database

Hi

 

Has anybody seen a solution that will store Lucene indexes in a database
using Hibernate?. Basically a HibernateDirectory so I can store and retrieve
the indexes from a database?

 

Regards,

 

Néstor Boscán




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Using Hibernate to store Lucene Indexes in a Database

2006-09-08 Thread Ramana Jelda
HI Marcus,
Somehow I like your wording..
Can't stop replying you. 
Jelda

> -Original Message-
> From: Marcus Falck [mailto:[EMAIL PROTECTED] 
> Sent: Friday, September 08, 2006 2:05 PM
> To: java-user@lucene.apache.org
> Subject: SV: Using Hibernate to store Lucene Indexes in a Database
> 
> I cant understand why you are interested in storing the 
> directory in a database using hibernate. It seems to me like 
> you are trying to mix 2 good techniques in a destructive way.
> 
> 
> 
> 
> -Ursprungligt meddelande-
> Från: Néstor Boscán [mailto:[EMAIL PROTECTED]
> Skickat: den 8 september 2006 01:49
> Till: java-user@lucene.apache.org
> Ämne: Using Hibernate to store Lucene Indexes in a Database
> 
> Hi
> 
>  
> 
> Has anybody seen a solution that will store Lucene indexes in 
> a database using Hibernate?. Basically a HibernateDirectory 
> so I can store and retrieve the indexes from a database?
> 
>  
> 
> Regards,
> 
>  
> 
> Néstor Boscán
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Highligher Example

2006-09-08 Thread Dejan Nenov
Second that - I was a client of Stellent - the libs work great but are
expensive. To see Stellent in action - get a copy of the free X1 desktop
search or the X1 server (Lucene based).
Another alternative is KeyView from Verity - now Autonomy.

-Original Message-
From: mark harwood [mailto:[EMAIL PROTECTED] 
Sent: Friday, September 08, 2006 1:27 AM
To: java-user@lucene.apache.org
Subject: Re: Highligher Example

If you have a budget for this stuff then Stellent provide tools for parsing
multiple document types and also have a viewer that can display documents
with their original formatting, plus your highlights. See
http://www.stellent.com/en/products/outside_in/viewer_tech/index.htm

I don't work for Stellent and haven't used it but I do know this stuff is
hard to do and they are the only ones I'm aware of trying to provide tools
to cover all document types which is why I mention it. If anyone has any
other similar recommendations I would be interested to hear them.


- Original Message 
From: Mark Miller <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, 8 September, 2006 2:02:47 AM
Subject: Re: Highligher Example

Highlighting a PDF document, last time I looked (quite a while ago), 
involves supplying an xml file that describes offsets for highlighting. 
You can specify the file in the URL. You can also do simple highlighting 
by passing in a list of words to be highlighted, but this does not even 
catch minor differences, like singular to plural.

If someone knows more about using to the lucene highlighter to highlight 
PDF's then please speak up. I think I will have to get into this soon.

- Mark

Mag Gam wrote:
> Thanks for the quick response Erik. I will be getting my LIA book back 
> very
> soon, I forgot it at a destination :-(
>
> Lets assume, there is a document called "hello.pdf" and it has the 
> content
> "this is hello.pdf. It uses Acrobat"
>
> When I perform a search for "Acrobat", i want hello.pdf to show up, 
> and also
> the 'It uses Acrobat'
>
> something like that.
>
> tia
>
>
>
> On 9/7/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:
>>
>> There are test cases in the Highlighter codebase that exercise it and
>> show its use, as well as a few examples of it in the "Lucene in
>> Action" codebase.
>>
>> These examples output plain text with some prefix and suffix
>> surrounding the highlighted terms.  Highlighting text in a PDF is
>> possible, I'm pretty sure, but I don't think the same would be easily
>> possible with Microsoft document formats.  I'm not sure if you are
>> asking for these document types to be highlighted or just a plain
>> text representation of them, though.
>>
>> Erik
>>
>> On Sep 7, 2006, at 6:37 PM, Mag Gam wrote:
>>
>> > Hey
>> >
>> > Anyone have a search result highlighter example?
>> >
>> > I have various doc, PDFs, DOC, TXT, PPT, and I would like to show a
>> > highlight, similar to how google does it...
>> >
>> > tia
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Using Hibernate to store Lucene Indexes in a Database

2006-09-08 Thread Néstor Boscán
To reduce administration tasks. If you want to move your application from
server to server you'll have to move the index files. I want to be able to
move my application by just moving my database schema and deploying an ear.

Regards,

Néstor Boscán

-Mensaje original-
De: Marcus Falck [mailto:[EMAIL PROTECTED] 
Enviado el: Viernes, 08 de Septiembre de 2006 05:05 a.m.
Para: java-user@lucene.apache.org
Asunto: SV: Using Hibernate to store Lucene Indexes in a Database

I cant understand why you are interested in storing the directory in a
database using hibernate. It seems to me like you are trying to mix 2 good
techniques in a destructive way.




-Ursprungligt meddelande-
Från: Néstor Boscán [mailto:[EMAIL PROTECTED] 
Skickat: den 8 september 2006 01:49
Till: java-user@lucene.apache.org
Ämne: Using Hibernate to store Lucene Indexes in a Database

Hi

 

Has anybody seen a solution that will store Lucene indexes in a database
using Hibernate?. Basically a HibernateDirectory so I can store and retrieve
the indexes from a database?

 

Regards,

 

Néstor Boscán




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Using Hibernate to store Lucene Indexes in a Database

2006-09-08 Thread Néstor Boscán
Also if you want to backup your application you just backup the database.

Regards,

Néstor Boscán

-Mensaje original-
De: Néstor Boscán [mailto:[EMAIL PROTECTED] 
Enviado el: Viernes, 08 de Septiembre de 2006 10:29 a.m.
Para: 'java-user@lucene.apache.org'
Asunto: RE: Using Hibernate to store Lucene Indexes in a Database

To reduce administration tasks. If you want to move your application from
server to server you'll have to move the index files. I want to be able to
move my application by just moving my database schema and deploying an ear.

Regards,

Néstor Boscán

-Mensaje original-
De: Marcus Falck [mailto:[EMAIL PROTECTED] 
Enviado el: Viernes, 08 de Septiembre de 2006 05:05 a.m.
Para: java-user@lucene.apache.org
Asunto: SV: Using Hibernate to store Lucene Indexes in a Database

I cant understand why you are interested in storing the directory in a
database using hibernate. It seems to me like you are trying to mix 2 good
techniques in a destructive way.




-Ursprungligt meddelande-
Från: Néstor Boscán [mailto:[EMAIL PROTECTED] 
Skickat: den 8 september 2006 01:49
Till: java-user@lucene.apache.org
Ämne: Using Hibernate to store Lucene Indexes in a Database

Hi

 

Has anybody seen a solution that will store Lucene indexes in a database
using Hibernate?. Basically a HibernateDirectory so I can store and retrieve
the indexes from a database?

 

Regards,

 

Néstor Boscán




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SpanRegexQuery causes error

2006-09-08 Thread Erik Hatcher


On Sep 7, 2006, at 9:26 PM, Luke Tan wrote:

spanFirst(spanRegexQuery(monthly:day * of every * months), 10)


What analyzer did you use for your text?   Again, that is not a valid  
regular expression.   But also, you're using a single long string of  
several words within your SpanRegexQuery term.  What you probably  
want is a SpanNearQuery of those fixed terms along with a ".*"  
SpanRegexQuery in the wildcarded spots, and then you could nest that  
inside a SpaneFirstQuery.



java.lang.NullPointerException
java.lang.NullPointerException
   at java.util.Hashtable.get(Hashtable.java:336)
   at org.apache.lucene.index.MultiReader.norms(MultiReader.java:163)
   at org.apache.lucene.search.spans.SpanWeight.scorer 
(SpanWeight.java:70)
   at org.apache.lucene.search.IndexSearcher.search 
(IndexSearcher.java:129)
   at org.apache.lucene.search.IndexSearcher.search 
(IndexSearcher.java:110)

   at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65)
   at org.apache.lucene.search.Hits.(Hits.java:52)
   at org.apache.lucene.search.Searcher.search(Searcher.java:53)



It's likely a bug that you get this particular error, but I think  
you'll get around it by solving the above mentioned issues with your  
query.


Erik




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



FWD: Re: parser question

2006-09-08 Thread Chris Salem
any help with this?



Chris Salem
440.946.5214 x5458
[EMAIL PROTECTED] 

- Forwarded Message - 
To: Mark Miller <[EMAIL PROTECTED]>
From: Chris Salem <[EMAIL PROTECTED]>
Sent: Wed 9/6/2006 3:58:49 PM
Subject: Re: parser question


its an index of 10 fields and about 10,000 records.



Chris Salem
440.946.5214 x5458
[EMAIL PROTECTED] 

- Original Message - 
To: Chris Salem <[EMAIL PROTECTED]>
From: Mark Miller <[EMAIL PROTECTED]>
Sent: Wed 9/6/2006 2:32:24 PM
Subject: Re: parser question


What are you using as a test index?

- Mark

Chris Salem wrote:
> yes its ANDing them. Doing the query 'software engineer', 'software 
> OR engineer', 'software AND engineer' all return the same results. 
> the generated queries for them respectively are '(field:software 
> field:engineer)', '(field:software field:engineer)' and 
> '(+field:software +field:engineer)'. I do set the default operator to 
> AND and i'm using the MultiFieldQueryParser if that makes a difference 
> (it was doing the same thing with the QueryParser as well).
>
>
> Chris Salem
> 440.946.5214 x5458
> [EMAIL PROTECTED] 
> 
>
> - Original Message -
> *To:* java-user@lucene.apache.org
> *From:* Mark Miller <[EMAIL PROTECTED]
> >
> *Sent:* Wed 9/6/2006 12:57:44 PM
> *Subject:* Re: parser question
>
> Are you sure it is anding them?
>
> field:software field:engineer
>
> indicates an OR operation.
>
> +field:software +field:engineer
>
> indicates an AND operation.
>
> - Mark
>
>
>
>
>
> Chris Salem wrote:
> > i set the default operator to AND, but if i have a query with an
> OR in it it doesn't work, for example, if i have the query
> 'software OR engineer' the parser interprets it as 'field:software
> field:engineer' and AND's them. how would i fix this?
> >
> >
> > Chris Salem
> > 440.946.5214 x5458
> > [EMAIL PROTECTED]
> >
> > - Original Message -
> > To: java-user@lucene.apache.org
> > From: Mark Miller <[EMAIL PROTECTED]>
> > Sent: Tue 9/5/2006 5:38:50 PM
> > Subject: Re: parser question
> >
> >
> > QueryParser.setDefaultOperator(Operator op)
> >
> > Chris Salem wrote:
> >
> >> With all the parsers I have tried a space in a query, such as
> doing a search for "sales manager", interprets the space as an OR,
> is there a way to change it so that it interprets a space as an AND?
> >>
> >>
> >> Chris Salem
> >> 440.946.5214 x5458
> >> [EMAIL PROTECTED]
> >>
> >> (The following links were included with this email:)
> >> mailto:[EMAIL PROTECTED]
> >>
> >>
> >>
> >> (The following links were included with this email:)
> >> mailto:[EMAIL PROTECTED]
> >>
> >>
> >>
> >>
> >>
> >
> >
> -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> > (The following links were included with this email:)
> > mailto:[EMAIL PROTECTED]
> >
> > mailto:[EMAIL PROTECTED]
> >
> >
> >
> > (The following links were included with this email:)
> > mailto:[EMAIL PROTECTED]
> >
> > mailto:[EMAIL PROTECTED]
> >
> >
> >
> >
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

(The following links were included with this email:)
mailto:[EMAIL PROTECTED]

mailto:[EMAIL PROTECTED]

mailto:[EMAIL PROTECTED]

mailto:[EMAIL PROTECTED]



(The following links were included with this email:)
mailto:[EMAIL PROTECTED]

mailto:[EMAIL PROTECTED]

mailto:[EMAIL PROTECTED]

mailto:[EMAIL PROTECTED]




Re: FWD: Re: parser question

2006-09-08 Thread Michael D. Curtin

If your question is why are the queries
'(field:software field:engineer)'
and
'(+field:software +field:engineer)'
returning the same results, it could be because none of your documents have 
*only* "software" *or* "engineer", i.e. they all have both words or neither. 
You could test this theory with a db containing only the 4 documents below 
(one per line):


other unrelated words
the software
the engineer
the software engineer

Good luck!

--MDC

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: how to index rdf/owl file using lucene

2006-09-08 Thread Simon Willnauer

Just curious!
RDF / OWL is xml right?! So just download the next best xml api or use
java build in dom / sax whatever and extract the content you want to
index, create your fields and pass the created document to the index
writer. there you go.

best regards Simon

On 9/7/06, khgcutg hsowhj <[EMAIL PROTECTED]> wrote:

Hi All,

How can we index RDF/OWL file, can anyone provide a small example or 
related papers or any kind of literature to index and search rdf/owl file using 
lucene.Any kind of help is appreciated.

  Regards,
  phani.




-
Get your email and more, right on the  new Yahoo.com



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: delete operation

2006-09-08 Thread karl wettin
On Fri, 2006-09-08 at 15:27 +0800, jacky wrote:
> 
>   So  when the lucene database is updated, how to notify to reopen the
> IndexSearcher since there may be several applications to search this
> lucene database? 

Jira issue 550 contains easy to use decorated notification code that
will do all that given all listners are running in the same JVM. It does
however require a minor patching of the head.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Preventing short documents from being boosted

2006-09-08 Thread Daniel Naber
On Freitag 08 September 2006 13:30, Grant Ingersoll wrote:

> http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967

I'd be happy about feedback about that similarity class, i.e. whether 
someone has used it successfully. If so, we could add it to the Lucene 
core (the old similarity would stay the default though).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Changing the Scoring api for OR parameters

2006-09-08 Thread Chris Hostetter

if you are already seting the document boost based on the "date" of hte
Document, then the next thing you should familiarize yourself with is
Similarity.coord function.  It's specific purpose is for dealing with
Queries which "aggregate" other queries (like a BooleanQuery does with
it's clauses) to let you determine how much "penalty" documents recieve
for only matching on a subset of the clauses.

(the conventional wisdom being that a document matching more clauses is
"better" then a document matching few clauses, even if the aggregate
score of second document is a little higher thn hte first)

If you don't like that conventional wisdon, you can make the coord method
of your Similarity return a constant value (like "1") to illiminate it's
impact completely, or define some function that has a lower impact then
the overlap/maxOverlap algorithm used in the DefaultSimilarity.

Since illiminating the impact completely is a common need among
"artifically" constructed queries (like Prefix queries and Wild card
ueries) there is acctually a constructor for BooleanQuery that takes in a
boolean to do this.



Lastly: if you want more exact control over the way the dates of your
documents influence the score (without mucking with norms to make them
floats instead of bytes) consider using FunctionQuery ... searching the
archives will pop up several examples of how it can be used, and you can
find it in the Solr code base...

http://incubator.apache.org/solr/
http://incubator.apache.org/solr/docs/api/org/apache/solr/search/function/package-summary.html


: Date: Fri, 8 Sep 2006 09:45:13 +0200
: From: Marcus Falck <[EMAIL PROTECTED]>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Changing the Scoring api for OR parameters
:
: Hi everyone,
:
:
:
: I want to override the default scoring when it comes to queries
: containing the OR operator.
:
:
:
: For example if I got the following headlines in my index :
:
: "Sun sues Microsoft"
:
: "Microsoft want to buy Tiscali"
:
: ".NU domain sues Microsoft"
:
: "The sun is shining"
:
: "Sun brings antitrust suit against Microsoft"
:
:
:
: Those documents have been boosted in desc fashion ("Sun sues Microsoft"
: has higher calculated norm value then "Sun brings antirust suit against
: Microsoft"),
:
: The similarity class that has been used has made the norm values to be
: exactly as the boost value ( I have even modified the norm to be a float
: so I won't loose precision ).
:
:
:
: If I perform a search for: Microsoft OR Sun
:
:
:
: The topranked results will almost certainly be:
:
: Sun sues Microsoft
:
: Sun Brings antitrust suit against Microsoft
:
: 
:
:
:
: I just want the documents returned like this:
:
: "Sun sues Microsoft"
:
: "Microsoft want to buy Tiscali"
:
: ".NU domain sues Microsoft"
:
: "The sun is shining"
:
: "Sun brings antitrust suit against Microsoft"
:
:
:
: I have to get this to work since I'm indexing news material and the
: customers are only interested in the newest articles ( so the date of
: the article is being used as a boost factor).
:
:
:
: Any ideas? My rank changes to lucene works as expected when it comes to
: AND operator and single term queries.
:
:
:
: /
:
: Regards
:
:
:
: Marcus Falck
:
:
:
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SpanRegexQuery causes error

2006-09-08 Thread Luke Tan

I tried .* too but it gave the same error. I think it's a bug.

I solve it using SpanTermQuery where the search phrase is broken into
day
of
every
months

and I nest these SpanTermQuery into SpanNearQuery with slop > 1.

Thanks.

On 9/9/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:



On Sep 7, 2006, at 9:26 PM, Luke Tan wrote:
> spanFirst(spanRegexQuery(monthly:day * of every * months), 10)

What analyzer did you use for your text?   Again, that is not a valid
regular expression.   But also, you're using a single long string of
several words within your SpanRegexQuery term.  What you probably
want is a SpanNearQuery of those fixed terms along with a ".*"
SpanRegexQuery in the wildcarded spots, and then you could nest that
inside a SpaneFirstQuery.

> java.lang.NullPointerException
> java.lang.NullPointerException
>at java.util.Hashtable.get(Hashtable.java:336)
>at org.apache.lucene.index.MultiReader.norms(MultiReader.java:163)
>at org.apache.lucene.search.spans.SpanWeight.scorer
> (SpanWeight.java:70)
>at org.apache.lucene.search.IndexSearcher.search
> (IndexSearcher.java:129)
>at org.apache.lucene.search.IndexSearcher.search
> (IndexSearcher.java:110)
>at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65)
>at org.apache.lucene.search.Hits.(Hits.java:52)
>at org.apache.lucene.search.Searcher.search(Searcher.java:53)


It's likely a bug that you get this particular error, but I think
you'll get around it by solving the above mentioned issues with your
query.

Erik




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: SpanRegexQuery causes error

2006-09-08 Thread Luke Tan

I use analyzer with LowerCaseTokenizer only (No stop word or any other
special treatment). The phrase is tokenized.

On 9/9/06, Luke Tan


I tried .* too but it gave the same error. I think it's a bug.

I solve it using SpanTermQuery where the search phrase is broken into
day
of
every
months

and I nest these SpanTermQuery into SpanNearQuery with slop > 1.

Thanks.


On 9/9/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:
>
>
> On Sep 7, 2006, at 9:26 PM, Luke Tan wrote:
> > spanFirst(spanRegexQuery(monthly:day * of every * months), 10)
>
> What analyzer did you use for your text?   Again, that is not a valid
> regular expression.   But also, you're using a single long string of
> several words within your SpanRegexQuery term.  What you probably
> want is a SpanNearQuery of those fixed terms along with a ".*"
> SpanRegexQuery in the wildcarded spots, and then you could nest that
> inside a SpaneFirstQuery.
>
> > java.lang.NullPointerException
> > java.lang.NullPointerException
> >at java.util.Hashtable.get(Hashtable.java:336)
> >at org.apache.lucene.index.MultiReader.norms (MultiReader.java:163)
> >at org.apache.lucene.search.spans.SpanWeight.scorer
> > (SpanWeight.java:70)
> >at org.apache.lucene.search.IndexSearcher.search
> > (IndexSearcher.java:129)
> >at org.apache.lucene.search.IndexSearcher.search
> > (IndexSearcher.java:110)
> >at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65)
> >at org.apache.lucene.search.Hits.(Hits.java:52)
> >at org.apache.lucene.search.Searcher.search(Searcher.java:53)
>
>
> It's likely a bug that you get this particular error, but I think
> you'll get around it by solving the above mentioned issues with your
> query.
>
> Erik
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



Re: Using Hibernate to store Lucene Indexes in a Database

2006-09-08 Thread Tomi NA

On 9/8/06, Néstor Boscán <[EMAIL PROTECTED]> wrote:

To reduce administration tasks. If you want to move your application from
server to server you'll have to move the index files. I want to be able to
move my application by just moving my database schema and deploying an ear.

Regards,

Néstor Boscán


Funny, I felt the same way about file-based storage: you simply pack
it up using any of the numerous file transfer tools available and you
don't have to worry about any of the database issues (possible
uncompressed large dump over the network, is the database server
running etc.).
On the other hand, if your application utilizes a database anyway, it
might be doable, assuming the app can take the performance penalty.
I'd be hard pressed to come up with a scenario where the gains
(simpler backup) would outweigh the losses (having to learn to store
the index into the database, performance, database bloat), though.
Still, it might only be my lack of imagination, that's the problem. :)

t.n.a.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Using SpanRegexQuery to search year like 200?

2006-09-08 Thread Luke Tan

Hi,

Can this be use to search year 2000, 2001, 2002, ... 2009?

SpanFirstQuery snq = new SpanFirstQuery(new SpanRegexQuery(new Term("year",
"200?")), 1);


I need to use it to search something like

Who is born in 200?

Thanks


RE: Using Hibernate to store Lucene Indexes in a Database

2006-09-08 Thread Néstor Boscán
Tomi thanks for your thoughts. I'm new to Lucene, so coming from an Oracle
background my mind is set that everything goes inside the database. Now that
I know some of the loses I can have a better picture.

Regards,

Néstor Boscán

-Mensaje original-
De: Tomi NA [mailto:[EMAIL PROTECTED] 
Enviado el: Viernes, 08 de Septiembre de 2006 05:21 p.m.
Para: java-user@lucene.apache.org
Asunto: Re: Using Hibernate to store Lucene Indexes in a Database

On 9/8/06, Néstor Boscán <[EMAIL PROTECTED]> wrote:
> To reduce administration tasks. If you want to move your application from
> server to server you'll have to move the index files. I want to be able to
> move my application by just moving my database schema and deploying an
ear.
>
> Regards,
>
> Néstor Boscán

Funny, I felt the same way about file-based storage: you simply pack
it up using any of the numerous file transfer tools available and you
don't have to worry about any of the database issues (possible
uncompressed large dump over the network, is the database server
running etc.).
On the other hand, if your application utilizes a database anyway, it
might be doable, assuming the app can take the performance penalty.
I'd be hard pressed to come up with a scenario where the gains
(simpler backup) would outweigh the losses (having to learn to store
the index into the database, performance, database bloat), though.
Still, it might only be my lack of imagination, that's the problem. :)

t.n.a.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SpanRegexQuery causes error

2006-09-08 Thread Erik Hatcher
We welcome you to package up this issue into a JUnit test case to  
demonstrate the bug, such that we can add it to our suite and fix the  
issue.  I can't say for certain its a bug just yet, but seems  
suspicious.   A simple JUnit test that could replicate this would be  
most helpful!


Thanks,
Erik


On Sep 8, 2006, at 7:45 PM, Luke Tan wrote:


I tried .* too but it gave the same error. I think it's a bug.

I solve it using SpanTermQuery where the search phrase is broken into
day
of
every
months

and I nest these SpanTermQuery into SpanNearQuery with slop > 1.

Thanks.

On 9/9/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:



On Sep 7, 2006, at 9:26 PM, Luke Tan wrote:
> spanFirst(spanRegexQuery(monthly:day * of every * months), 10)

What analyzer did you use for your text?   Again, that is not a valid
regular expression.   But also, you're using a single long string of
several words within your SpanRegexQuery term.  What you probably
want is a SpanNearQuery of those fixed terms along with a ".*"
SpanRegexQuery in the wildcarded spots, and then you could nest that
inside a SpaneFirstQuery.

> java.lang.NullPointerException
> java.lang.NullPointerException
>at java.util.Hashtable.get(Hashtable.java:336)
>at org.apache.lucene.index.MultiReader.norms(MultiReader.java: 
163)

>at org.apache.lucene.search.spans.SpanWeight.scorer
> (SpanWeight.java:70)
>at org.apache.lucene.search.IndexSearcher.search
> (IndexSearcher.java:129)
>at org.apache.lucene.search.IndexSearcher.search
> (IndexSearcher.java:110)
>at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65)
>at org.apache.lucene.search.Hits.(Hits.java:52)
>at org.apache.lucene.search.Searcher.search(Searcher.java:53)


It's likely a bug that you get this particular error, but I think
you'll get around it by solving the above mentioned issues with your
query.

Erik




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using SpanRegexQuery to search year like 200?

2006-09-08 Thread Erik Hatcher
To use SpanRegexQuery, you need to understand regular expressions.   
The WildcardQuery syntax is _NOT_ the same as SpanRegexQuery syntax.   
WildcardQuery supports a ? for single character match and * for  
multiple characters.  SpanRegexQuery use standard regular expression  
syntax.


"200?" matches 20 and 200, but not 2001 (using java.util.regex, that  
is).




"X? matches X, once or not at all"

Use "200.?" perhaps, or more appropriately for matching any year 2000  
- 2009 as "200\d".


Erik




On Sep 8, 2006, at 8:50 PM, Luke Tan wrote:


Hi,

Can this be use to search year 2000, 2001, 2002, ... 2009?

SpanFirstQuery snq = new SpanFirstQuery(new SpanRegexQuery(new Term 
("year",

"200?")), 1);


I need to use it to search something like

Who is born in 200?

Thanks



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]