Opposite to StopFilter. Anything already implemented out there?

2008-07-22 Thread mpermar

Hi All, 

I want to index some incoming text. In this case what I want to do is just
detect keywords in that text. Therefore I want to discard everything that is
not in the keywords set. This sounds to me pretty much like the reverse of
using stop words, that is it I want to use a set of "accepted" words. 

So I planned to create a new filter that just checks that incoming words are
in the "acceptable set" and discards them otherwise. Are you aware of any
analyzer/filter out there that uses this approach? Is there any other better
way to do this?

Best Regards,
Martin
-- 
View this message in context: 
http://www.nabble.com/Opposite-to-StopFilter.-Anything-already-implemented-out-there--tp18585878p18585878.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Memory leaks during indexing.

2008-07-22 Thread Antony Joseph
Hi all,

I am using *Lucene* 2.3.1 and JCC 1.6 to create an *index* of my
python-based application(for searching).Everything is working fine.After
some time(3 hours later) i found my python memory consumptions is grown to
high when i started the applcaition(indexing) the  python consumption is 40
mb , after 3 hours the python consumption shows 140mb . *The *performance of
Indexing become poor and memory leaks.

please help me to solve the problem.

Thanks
Antony







-- 
Antony Joseph A
DigitalGlue
[EMAIL PROTECTED]
T: +91 22 30601091


escaping logical operators such as OR AND

2008-07-22 Thread Aravind . Yarram
helo all,

In my project, we are indexing the US states...when we try to search on 
oregon ; state:OR, search on OR is throwing err...i know OR is a logical 
op in lucene...is there a way to escape such keywords?

tx!

Regards, 
Aravind R Yarram
Enabling Technologies
Equifax Information Services LLC
1525 Windward Concourse, J42E
Alpharetta, GA 30005
desk: 770 740 6951
email: [EMAIL PROTECTED] 
This message contains information from Equifax Inc. which may be confidential 
and privileged.  If you are not an intended recipient, please refrain from any 
disclosure, copying, distribution or use of this information and note that such 
actions are prohibited.  If you have received this transmission in error, 
please notify by e-mail [EMAIL PROTECTED]


escaping logical operators such as OR AND

2008-07-22 Thread Aravind . Yarram
helo all,

In my project, we are indexing the US states...when we try to search on 
oregon ; state:OR, search on OR is throwing err...i know OR is a logical 
op in lucene...is there a way to escape such keywords?

tx!

Regards, 
Aravind R Yarram
Enabling Technologies
Equifax Information Services LLC
1525 Windward Concourse, J42E
Alpharetta, GA 30005
desk: 770 740 6951
email: [EMAIL PROTECTED] 
This message contains information from Equifax Inc. which may be confidential 
and privileged.  If you are not an intended recipient, please refrain from any 
disclosure, copying, distribution or use of this information and note that such 
actions are prohibited.  If you have received this transmission in error, 
please notify by e-mail [EMAIL PROTECTED]


Re: How to avoid duplicate records in lucene

2008-07-22 Thread Erick Erickson
Well, the point of my question was to insure that we were all using common
terms. For all we know, the original questioner considered "duplicate"
records ones that had identical, or even similar text. Nothing in the
original question indicated any de-dup happening.

I've often found that assumptions that we are all talking about the same
thing are...er...incorrect. And I don't want to waste my time answering
questions that weren't what was asked..

Best
Erick

On Mon, Jul 21, 2008 at 2:44 PM, markharw00d <[EMAIL PROTECTED]>
wrote:

> >>could you define duplicate?
>
> That's your choice of field that you want to de-dup on.
> That could be a field such as "DatabasePrimaryKey" or perhaps a field
> containing an MD5 hash of document content.
> The DuplicateFilter ensures only one document can exist in results for each
> unique value for the choice of field.
>
> Cheers
> Mark
>
> Erick Erickson wrote:
>
>> could you define duplicate? As far as I know, you don't
>> get the same (internal) doc id back more than once, so what
>> is a duplicate?
>>
>> Best
>> Erick
>>
>> On Mon, Jul 21, 2008 at 9:40 AM, Sebastin <[EMAIL PROTECTED]> wrote:
>>
>>
>>
>>> at the time search , while querying the data
>>> markrmiller wrote:
>>>
>>>
 Sebastin wrote:


> Hi All,
>
> Is there any possibility to avoid duplicate records in lucene  2.3.1?
>
>
>
 I don't believe that there is a very high performance way to do this.
 You are basically going to have to query the index for an id before
 adding a new doc. The best way I can think of off the top of my head is
 to batch - first check that ids in the batch are unique, then check all
 ids in the batch against the IndexReader, then add the ones that are not
 dupes. Of course all of your docs would have to be added through this
 single choke point so that you knew other threads had not added that id
 after the first thread had looked but before it added the doc.

 I think Mark H has you covered if getting the dupes out after are okay.

 - Mark

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





>>> --
>>> View this message in context:
>>>
>>> http://www.nabble.com/How-to-avoid-duplicate-records-in-lucene-tp18543588p18568862.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>>
>>>
>>
>>  
>>
>> No virus found in this incoming message.
>> Checked by AVG. Version: 7.5.526 / Virus Database: 270.5.3/1563 - Release
>> Date: 20/07/2008 12:59
>>
>>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: escaping logical operators such as OR AND

2008-07-22 Thread Erick Erickson
Have you tried lower-casing them? To be treated as an operator, they
must be upper cased.

But be careful that, when you lower-case them, your query analyzer doesn't
treat them as stop words

Best
Erick

On Tue, Jul 22, 2008 at 9:28 AM, <[EMAIL PROTECTED]> wrote:

> helo all,
>
> In my project, we are indexing the US states...when we try to search on
> oregon ; state:OR, search on OR is throwing err...i know OR is a logical
> op in lucene...is there a way to escape such keywords?
>
> tx!
>
> Regards,
> Aravind R Yarram
> Enabling Technologies
> Equifax Information Services LLC
> 1525 Windward Concourse, J42E
> Alpharetta, GA 30005
> desk: 770 740 6951
> email: [EMAIL PROTECTED]
> This message contains information from Equifax Inc. which may be
> confidential and privileged.  If you are not an intended recipient, please
> refrain from any disclosure, copying, distribution or use of this
> information and note that such actions are prohibited.  If you have received
> this transmission in error, please notify by e-mail [EMAIL PROTECTED]
> .
>


Parametric/faceted Searching

2008-07-22 Thread WY-LAC

I looking for sample code that would do the following : 

  On the first page a parametric Fields 

Topics
ALL
Births, Marriages and Death (1200)   - Major Category
- Divorces in Canada (750)  - sub category
- Deaths (450)  - sub category

   Click on Major Category it would display 

   Topics
   
Births, Marriages and Death (1200)   - Major Category
- Divorces in Canada (750)  - sub category
- Deaths (450)  - sub category

   Click on sub Category ( Divorces in Canada ) it would display the
following

   Topics
   
Births, Marriages and Death - Major Category
- Divorces in Canada (750)  - sub category





 
-- 
View this message in context: 
http://www.nabble.com/Parametric-faceted-Searching-tp18587632p18587632.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: escaping logical operators such as OR AND

2008-07-22 Thread Aravind . Yarram
lower-casing worked...tx...but is there a way of escaping them like we use 
escape characters in java!

Regards, 
Aravind R Yarram
Enabling Technologies
Equifax Information Services LLC
1525 Windward Concourse, J42E
Alpharetta, GA 30005
desk: 770 740 6951
email: [EMAIL PROTECTED] 



"Erick Erickson" <[EMAIL PROTECTED]> 
07/22/2008 09:40 AM
Please respond to
java-user@lucene.apache.org


To
java-user@lucene.apache.org
cc

Subject
Re: escaping logical operators such as OR AND






Have you tried lower-casing them? To be treated as an operator, they
must be upper cased.

But be careful that, when you lower-case them, your query analyzer doesn't
treat them as stop words

Best
Erick

On Tue, Jul 22, 2008 at 9:28 AM, <[EMAIL PROTECTED]> wrote:

> helo all,
>
> In my project, we are indexing the US states...when we try to search on
> oregon ; state:OR, search on OR is throwing err...i know OR is a logical
> op in lucene...is there a way to escape such keywords?
>
> tx!
>
> Regards,
> Aravind R Yarram
> Enabling Technologies
> Equifax Information Services LLC
> 1525 Windward Concourse, J42E
> Alpharetta, GA 30005
> desk: 770 740 6951
> email: [EMAIL PROTECTED]
> This message contains information from Equifax Inc. which may be
> confidential and privileged.  If you are not an intended recipient, 
please
> refrain from any disclosure, copying, distribution or use of this
> information and note that such actions are prohibited.  If you have 
received
> this transmission in error, please notify by e-mail 
[EMAIL PROTECTED]
> .
>


This message contains information from Equifax Inc. which may be confidential 
and privileged.  If you are not an intended recipient, please refrain from any 
disclosure, copying, distribution or use of this information and note that such 
actions are prohibited.  If you have received this transmission in error, 
please notify by e-mail [EMAIL PROTECTED]


RE: Opposite to StopFilter. Anything already implemented out there?

2008-07-22 Thread Steven A Rowe
Hi Martin,

On 07/22/2008 at 5:48 AM, mpermar wrote:
> I want to index some incoming text. In this case what I want
> to do is just detect keywords in that text. Therefore I want
> to discard everything that is not in the keywords set. This
> sounds to me pretty much like the reverse of using stop words,
> that is it I want to use a set of "accepted" words.
> 
> So I planned to create a new filter that just checks that
> incoming words are in the "acceptable set" and discards them
> otherwise. Are you aware of any analyzer/filter out there that
> uses this approach? Is there any other better way to do this?

Solr has KeepWordFilter - it sounds exactly like what you want: 

Javadoc: 

Source: 


Depending on your requirements and the nature of your keywords list, you might 
consider applying this filter only to queries, rather than at index time.  That 
way, the keyword list can change without having to re-index.

Steve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory leaks during indexing.

2008-07-22 Thread Michael McCandless


Can you post the Python sources of the Lucene part of your application?

One thing to check is how the JRE is being instantiated from Python,  
ie, what the equivalent setting is for -Xmx (= max heap size).  It's  
possible the 140 MB consumption is actually "OK" as far as the JRE is  
concerned, if it was told it's allowed to use that much heap (or more).


Try letting it continue to run for a very long time to see how the  
size grows... if it's really a leak it will truly keep growing and  
eventually hit an OutOfMemory exception in Java or Python; if instead  
it's just the JRE thinking it's allowed to use that much heap it  
should at some point level off, but possibly at a high value.


Are you able to get the same memory leak to happen with a java only  
version of the Lucene part of your application?


Mike

Antony Joseph wrote:


Hi all,

I am using *Lucene* 2.3.1 and JCC 1.6 to create an *index* of my
python-based application(for searching).Everything is working  
fine.After
some time(3 hours later) i found my python memory consumptions is  
grown to
high when i started the applcaition(indexing) the  python  
consumption is 40
mb , after 3 hours the python consumption shows 140mb . *The  
*performance of

Indexing become poor and memory leaks.

please help me to solve the problem.

Thanks
Antony







--
Antony Joseph A
DigitalGlue
[EMAIL PROTECTED]
T: +91 22 30601091



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: escaping logical operators such as OR AND

2008-07-22 Thread Erick Erickson
<>

I haven't ever tried, so I don't know ... But my
poor memory doesn't bring any to mind

Best
Erick

On Tue, Jul 22, 2008 at 9:53 AM, <[EMAIL PROTECTED]> wrote:

> lower-casing worked...tx...but is there a way of escaping them like we use
> escape characters in java!
>
> Regards,
> Aravind R Yarram
> Enabling Technologies
> Equifax Information Services LLC
> 1525 Windward Concourse, J42E
> Alpharetta, GA 30005
> desk: 770 740 6951
> email: [EMAIL PROTECTED]
>
>
>
> "Erick Erickson" <[EMAIL PROTECTED]>
> 07/22/2008 09:40 AM
> Please respond to
> java-user@lucene.apache.org
>
>
> To
> java-user@lucene.apache.org
> cc
>
> Subject
> Re: escaping logical operators such as OR AND
>
>
>
>
>
>
> Have you tried lower-casing them? To be treated as an operator, they
> must be upper cased.
>
> But be careful that, when you lower-case them, your query analyzer doesn't
> treat them as stop words
>
> Best
> Erick
>
> On Tue, Jul 22, 2008 at 9:28 AM, <[EMAIL PROTECTED]> wrote:
>
> > helo all,
> >
> > In my project, we are indexing the US states...when we try to search on
> > oregon ; state:OR, search on OR is throwing err...i know OR is a logical
> > op in lucene...is there a way to escape such keywords?
> >
> > tx!
> >
> > Regards,
> > Aravind R Yarram
> > Enabling Technologies
> > Equifax Information Services LLC
> > 1525 Windward Concourse, J42E
> > Alpharetta, GA 30005
> > desk: 770 740 6951
> > email: [EMAIL PROTECTED]
> > This message contains information from Equifax Inc. which may be
> > confidential and privileged.  If you are not an intended recipient,
> please
> > refrain from any disclosure, copying, distribution or use of this
> > information and note that such actions are prohibited.  If you have
> received
> > this transmission in error, please notify by e-mail
> [EMAIL PROTECTED]
> > .
> >
>
>
> This message contains information from Equifax Inc. which may be
> confidential and privileged.  If you are not an intended recipient, please
> refrain from any disclosure, copying, distribution or use of this
> information and note that such actions are prohibited.  If you have received
> this transmission in error, please notify by e-mail [EMAIL PROTECTED]
> .
>


RE: Opposite to StopFilter. Anything already implemented out there?

2008-07-22 Thread mpermar

Absolutely!

Thanks Steven. 

Best Regards,
Martin


Steven A Rowe wrote:
> 
> Hi Martin,
> 
> On 07/22/2008 at 5:48 AM, mpermar wrote:
>> I want to index some incoming text. In this case what I want
>> to do is just detect keywords in that text. Therefore I want
>> to discard everything that is not in the keywords set. This
>> sounds to me pretty much like the reverse of using stop words,
>> that is it I want to use a set of "accepted" words.
>> 
>> So I planned to create a new filter that just checks that
>> incoming words are in the "acceptable set" and discards them
>> otherwise. Are you aware of any analyzer/filter out there that
>> uses this approach? Is there any other better way to do this?
> 
> Solr has KeepWordFilter - it sounds exactly like what you want: 
> 
> Javadoc:
> 
> Source:
> 
> 
> Depending on your requirements and the nature of your keywords list, you
> might consider applying this filter only to queries, rather than at index
> time.  That way, the keyword list can change without having to re-index.
> 
> Steve
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Opposite-to-StopFilter.-Anything-already-implemented-out-there--tp18585878p18591960.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Storing information

2008-07-22 Thread Grant Ingersoll
You may also want a Document cache and or even a Query cache,  
depending on your situation.


-Grant

On Jul 21, 2008, at 11:49 PM, Yonik Seeley wrote:

On Mon, Jul 21, 2008 at 11:27 PM, blazingwolf7  
<[EMAIL PROTECTED]> wrote:
I am using Lucene to perform searching. I have certain information  
that will
be loaded everytime a search is run. This means, if there are  
multiple user
running the search at the same time, the information will be loaded  
multiple

times.


Is  this a Lucene question?
Anyway, refer to the Lucene FieldCache for an example of how to avoid
multiple threads from creating more than one cached object.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to avoid duplicate records in lucene

2008-07-22 Thread mark harwood
>>Well, the point of my question was to insure that we were all using common 
>>terms.

Sorry, Erick. I thought your "define duplicate" question was asking me about 
DuplicateFilter's concept of duplicates rather than asking the original poster 
about his notion of what a duplicate document meant to him. You're right it 
would be useful to understand more about the intention of the original message.

Cheers
Mark





- Original Message 
From: Erick Erickson <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, 22 July, 2008 2:37:50 PM
Subject: Re: How to avoid duplicate records in lucene

Well, the point of my question was to insure that we were all using common
terms. For all we know, the original questioner considered "duplicate"
records ones that had identical, or even similar text. Nothing in the
original question indicated any de-dup happening.

I've often found that assumptions that we are all talking about the same
thing are...er...incorrect. And I don't want to waste my time answering
questions that weren't what was asked..

Best
Erick

On Mon, Jul 21, 2008 at 2:44 PM, markharw00d <[EMAIL PROTECTED]>
wrote:

> >>could you define duplicate?
>
> That's your choice of field that you want to de-dup on.
> That could be a field such as "DatabasePrimaryKey" or perhaps a field
> containing an MD5 hash of document content.
> The DuplicateFilter ensures only one document can exist in results for each
> unique value for the choice of field.
>
> Cheers
> Mark
>
> Erick Erickson wrote:
>
>> could you define duplicate? As far as I know, you don't
>> get the same (internal) doc id back more than once, so what
>> is a duplicate?
>>
>> Best
>> Erick
>>
>> On Mon, Jul 21, 2008 at 9:40 AM, Sebastin <[EMAIL PROTECTED]> wrote:
>>
>>
>>
>>> at the time search , while querying the data
>>> markrmiller wrote:
>>>
>>>
 Sebastin wrote:


> Hi All,
>
> Is there any possibility to avoid duplicate records in lucene  2.3.1?
>
>
>
 I don't believe that there is a very high performance way to do this.
 You are basically going to have to query the index for an id before
 adding a new doc. The best way I can think of off the top of my head is
 to batch - first check that ids in the batch are unique, then check all
 ids in the batch against the IndexReader, then add the ones that are not
 dupes. Of course all of your docs would have to be added through this
 single choke point so that you knew other threads had not added that id
 after the first thread had looked but before it added the doc.

 I think Mark H has you covered if getting the dupes out after are okay.

 - Mark

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





>>> --
>>> View this message in context:
>>>
>>> http://www.nabble.com/How-to-avoid-duplicate-records-in-lucene-tp18543588p18568862.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>>
>>>
>>
>>  
>>
>> No virus found in this incoming message.
>> Checked by AVG. Version: 7.5.526 / Virus Database: 270.5.3/1563 - Release
>> Date: 20/07/2008 12:59
>>
>>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



  __
Not happy with your email address?.
Get the one you really want - millions of new email addresses available now at 
Yahoo! http://uk.docs.yahoo.com/ymail/new.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to avoid duplicate records in lucene

2008-07-22 Thread Erick Erickson
NP, if my original reply had included my second one, then you'd have
known what I was talking about ...

I *love* it when I unknowingly demonstrate the issue I'm trying to clarify
.

Best
Erick

On Tue, Jul 22, 2008 at 2:09 PM, mark harwood <[EMAIL PROTECTED]>
wrote:

> >>Well, the point of my question was to insure that we were all using
> common terms.
>
> Sorry, Erick. I thought your "define duplicate" question was asking me
> about DuplicateFilter's concept of duplicates rather than asking the
> original poster about his notion of what a duplicate document meant to him.
> You're right it would be useful to understand more about the intention of
> the original message.
>
> Cheers
> Mark
>
>
>
>
>
> - Original Message 
> From: Erick Erickson <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Tuesday, 22 July, 2008 2:37:50 PM
> Subject: Re: How to avoid duplicate records in lucene
>
> Well, the point of my question was to insure that we were all using common
> terms. For all we know, the original questioner considered "duplicate"
> records ones that had identical, or even similar text. Nothing in the
> original question indicated any de-dup happening.
>
> I've often found that assumptions that we are all talking about the same
> thing are...er...incorrect. And I don't want to waste my time answering
> questions that weren't what was asked..
>
> Best
> Erick
>
> On Mon, Jul 21, 2008 at 2:44 PM, markharw00d <[EMAIL PROTECTED]>
> wrote:
>
> > >>could you define duplicate?
> >
> > That's your choice of field that you want to de-dup on.
> > That could be a field such as "DatabasePrimaryKey" or perhaps a field
> > containing an MD5 hash of document content.
> > The DuplicateFilter ensures only one document can exist in results for
> each
> > unique value for the choice of field.
> >
> > Cheers
> > Mark
> >
> > Erick Erickson wrote:
> >
> >> could you define duplicate? As far as I know, you don't
> >> get the same (internal) doc id back more than once, so what
> >> is a duplicate?
> >>
> >> Best
> >> Erick
> >>
> >> On Mon, Jul 21, 2008 at 9:40 AM, Sebastin <[EMAIL PROTECTED]> wrote:
> >>
> >>
> >>
> >>> at the time search , while querying the data
> >>> markrmiller wrote:
> >>>
> >>>
>  Sebastin wrote:
> 
> 
> > Hi All,
> >
> > Is there any possibility to avoid duplicate records in lucene  2.3.1?
> >
> >
> >
>  I don't believe that there is a very high performance way to do this.
>  You are basically going to have to query the index for an id before
>  adding a new doc. The best way I can think of off the top of my head
> is
>  to batch - first check that ids in the batch are unique, then check
> all
>  ids in the batch against the IndexReader, then add the ones that are
> not
>  dupes. Of course all of your docs would have to be added through this
>  single choke point so that you knew other threads had not added that
> id
>  after the first thread had looked but before it added the doc.
> 
>  I think Mark H has you covered if getting the dupes out after are
> okay.
> 
>  - Mark
> 
>  -
>  To unsubscribe, e-mail: [EMAIL PROTECTED]
>  For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 
> 
> 
> >>> --
> >>> View this message in context:
> >>>
> >>>
> http://www.nabble.com/How-to-avoid-duplicate-records-in-lucene-tp18543588p18568862.html
> >>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>>
> >>>
> >>> -
> >>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>>
> >>>
> >>>
> >>>
> >>
> >>
>  
> >>
> >> No virus found in this incoming message.
> >> Checked by AVG. Version: 7.5.526 / Virus Database: 270.5.3/1563 -
> Release
> >> Date: 20/07/2008 12:59
> >>
> >>
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
>
>   __
> Not happy with your email address?.
> Get the one you really want - millions of new email addresses available now
> at Yahoo! http://uk.docs.yahoo.com/ymail/new.html
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


storing the contents of a document in the lucene index

2008-07-22 Thread starz10de

  Could any one tell me please how to print the content of the document after
reading the index.
for example if i like to print the  index terms then i do :

IndexReader ir = IndexReader.open(index);
TermEnum termEnum = ir.terms(); 
while (termEnum.next()) {
TermDocs dok = ir.termDocs();
dok.seek(termEnum);
while (dok.next()) {
System.out.println(termEnum.term().text().trim());
}

I can print the text files before indexing them, but because of encoding
issues i like to print them from the index.
As i know the content of the document(whole text) is also stored in the
index, my question how to print this content.

so at the end i will print the path of the current document , index terms
and the content of the document


thanks in advance
-- 
View this message in context: 
http://www.nabble.com/storing-the-contents-of-a-document-in-the--lucene-index-tp18595855p18595855.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Interrupting a query

2008-07-22 Thread Paul J. Lucas

If I'm calling:

IndexSearcher.search( query, sortOrder );

how, exactly, can I do what you suggest?  *That* call is what I want  
to interrupt.


- Paul


On Jul 18, 2008, at 3:51 AM, Grant Ingersoll wrote:

True, but I think the approach is similar, in that you need to have  
the hit collector check to see if your interrupt flag has been set  
and then exit out.


On Jul 16, 2008, at 9:54 AM, Paul J. Lucas wrote:

That has nothing to do with interrupting a query at some arbitrary  
time.


On Jul 16, 2008, at 5:14 AM, Grant Ingersoll wrote:


See https://issues.apache.org/jira/browse/LUCENE-997

On Jul 16, 2008, at 12:22 AM, Paul J. Lucas wrote:

If a complicated query is running in a Thread, how does Lucene  
respond to Thread.interrupt()?  I want to be able to interrupt an  
in-progress query.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Fastest way to get just the "bits" of matching documents

2008-07-22 Thread Robert Stewart
I need to execute a boolean query and get back just the bits of all the 
matching documents.  I do additional filtering (date ranges and entitlements) 
and then do my own sorting later on.  I know that using QueryFilter.Bits() will 
still compute scores for all matching documents.  I do not want to compute any 
scores.  For queries with large results (over 5 million), seems like it is 
somewhat slow , and maybe computing scores is taking some time.  I have 
10million document index, and for some very broad queries (4-5 million matching 
documents), seems like getting bits is slow (1.5 seconds).  I can do my own 
sorting of results for requested page in under 30 ms, since I have efficient 
cached permutations of sorting by various fields.  Is there a way given a 
BooleanQuery, to get matching bits without computing any scores internally?  I 
looked at ConstantScoreQuery but I believe it actually still computes scores 
since it gets bits from the underlying query anyway.  In fact I tested it and 
it is actually slower to use ConstantScoreQuery than not to.

Is it possible to use a custom similarity class to make scoring faster (by 
returning 0 values, etc)?




Thanks,
Bob


Re: Fastest way to get just the "bits" of matching documents

2008-07-22 Thread eks dev
no, at the moment you can not make pure boolean queries. But 1.5 seconds on 
10Mio document sounds a bit too much (we have well under 200mS on 150Mio 
collection) what you can do:

1. use Filter for high frequency terms, e.g. via ConstantScoreQuery as much as 
you can, but you have to cache them (CachingWrapperFilter or something like 
that). SoretedVIntList can help a lot in reducing memory requirements for 
filter caching 
2. Use RAMDisk if it fits in RAM, or MMAPDisk
3.Provide more details, what is the structure of the Query takes so long, what 
is the data in index... so someone can help you really. Your question it is 
just too abstract now
4. try to sort your index so that things that you expect in result get close, 
e.g if you search predominantly on some number, sort it on it... if you can... 
this helps reduce IO stress due locality
5. try https://issues.apache.org/jira/browse/LUCENE-1340  as you do not need 
term frequencies for scoring
6. try using your HitCollector insted of QueryFilter.Bits() to get your bits


if you tried all these options and it still does not work fast enough and you 
really have bottelneck in Scoring (I doubt it) then you have 2:
- Wait for Paul to come back from Holidays, he wanted to make "pure Boolean" 
queries, without Scoring, possible :)
- Invest in faster CPU/Memory
 

have fun
eks



- Original Message 
> From: Robert Stewart <[EMAIL PROTECTED]>
> To: "java-user@lucene.apache.org" 
> Sent: Tuesday, 22 July, 2008 9:37:26 PM
> Subject: Fastest way to get just the "bits" of matching documents
> 
> I need to execute a boolean query and get back just the bits of all the 
> matching 
> documents.  I do additional filtering (date ranges and entitlements) and then 
> do 
> my own sorting later on.  I know that using QueryFilter.Bits() will still 
> compute scores for all matching documents.  I do not want to compute any 
> scores.  For queries with large results (over 5 million), seems like it is 
> somewhat slow , and maybe computing scores is taking some time.  I have 
> 10million document index, and for some very broad queries (4-5 million 
> matching 
> documents), seems like getting bits is slow (1.5 seconds).  I can do my own 
> sorting of results for requested page in under 30 ms, since I have efficient 
> cached permutations of sorting by various fields.  Is there a way given a 
> BooleanQuery, to get matching bits without computing any scores internally?  
> I 
> looked at ConstantScoreQuery but I believe it actually still computes scores 
> since it gets bits from the underlying query anyway.  In fact I tested it and 
> it 
> is actually slower to use ConstantScoreQuery than not to.
> 
> Is it possible to use a custom similarity class to make scoring faster (by 
> returning 0 values, etc)?
> 
> 
> 
> 
> Thanks,
> Bob



  __
Not happy with your email address?.
Get the one you really want - millions of new email addresses available now at 
Yahoo! http://uk.docs.yahoo.com/ymail/new.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Interrupting a query

2008-07-22 Thread Grant Ingersoll
You can't with that call.  You have to make one that uses a  
HitCollector, and your hit collector needs to be interruptable and it  
probably needs to handle your sorting.  Sounds like a nice  
contribution/patch.


Sorry, I can't offer a better solution.

-Grant

On Jul 22, 2008, at 2:48 PM, Paul J. Lucas wrote:


If I'm calling:

IndexSearcher.search( query, sortOrder );

how, exactly, can I do what you suggest?  *That* call is what I want  
to interrupt.


- Paul


On Jul 18, 2008, at 3:51 AM, Grant Ingersoll wrote:

True, but I think the approach is similar, in that you need to have  
the hit collector check to see if your interrupt flag has been set  
and then exit out.


On Jul 16, 2008, at 9:54 AM, Paul J. Lucas wrote:

That has nothing to do with interrupting a query at some arbitrary  
time.


On Jul 16, 2008, at 5:14 AM, Grant Ingersoll wrote:


See https://issues.apache.org/jira/browse/LUCENE-997

On Jul 16, 2008, at 12:22 AM, Paul J. Lucas wrote:

If a complicated query is running in a Thread, how does Lucene  
respond to Thread.interrupt()?  I want to be able to interrupt  
an in-progress query.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: storing the contents of a document in the lucene index

2008-07-22 Thread Erick Erickson
<<>>

This not strictly true. For instance, stop words aren't even indexed.
Reconstructing a document from the index is very expensive
(see Luke for examples of how this is done).

You can get the text back verbatim if you store it in your index. See
Field.Store.YES (or Field.Store.COMPRESS). Storage is orthogonal
to indexing, so you can index the tokens in a field but not store them,
store them but not index them, or do both. Not storing and not indexing
is, I guess, theoretically possible but I sure can't see why you'd try it
.

But if you store the field, you can get it back very easily with
Document.get("field").
Storing the fields will make your index larger, but shouldn't have a great
effect on your search times I don't think.

Best
Erick



On Tue, Jul 22, 2008 at 2:53 PM, starz10de <[EMAIL PROTECTED]> wrote:

>
>  Could any one tell me please how to print the content of the document
> after
> reading the index.
> for example if i like to print the  index terms then i do :
>
> IndexReader ir = IndexReader.open(index);
> TermEnum termEnum = ir.terms();
> while (termEnum.next()) {
>TermDocs dok = ir.termDocs();
>dok.seek(termEnum);
>while (dok.next()) {
> System.out.println(termEnum.term().text().trim());
>}
>
> I can print the text files before indexing them, but because of encoding
> issues i like to print them from the index.
> As i know the content of the document(whole text) is also stored in the
> index, my question how to print this content.
>
> so at the end i will print the path of the current document , index terms
> and the content of the document
>
>
> thanks in advance
> --
> View this message in context:
> http://www.nabble.com/storing-the-contents-of-a-document-in-the--lucene-index-tp18595855p18595855.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: How to avoid duplicate records in lucene

2008-07-22 Thread Sebastin

Erick,

example,

IndexWriter writer = new IndexWriter("C:/index",new
StandardAnalyzer(),true);

String records = "Lucene" +" " +"action"+" "+"book" ;

Document doc = new Document();

doc.add(new
Field("contents",records,Field.Store.YES,Field.Index.TOKENIZED));


writer.addDocument(doc);
writer.optimize();
writer.close();


when the records is inserted twice,while querying for "Lucene" it will
display the same record twice.








mark harwood wrote:
> 
>>>Well, the point of my question was to insure that we were all using
common terms.
> 
> Sorry, Erick. I thought your "define duplicate" question was asking me
> about DuplicateFilter's concept of duplicates rather than asking the
> original poster about his notion of what a duplicate document meant to
> him. You're right it would be useful to understand more about the
> intention of the original message.
> 
> Cheers
> Mark
> 
> 
> 
> 
> 
> - Original Message 
> From: Erick Erickson <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Tuesday, 22 July, 2008 2:37:50 PM
> Subject: Re: How to avoid duplicate records in lucene
> 
> Well, the point of my question was to insure that we were all using common
> terms. For all we know, the original questioner considered "duplicate"
> records ones that had identical, or even similar text. Nothing in the
> original question indicated any de-dup happening.
> 
> I've often found that assumptions that we are all talking about the same
> thing are...er...incorrect. And I don't want to waste my time answering
> questions that weren't what was asked..
> 
> Best
> Erick
> 
> On Mon, Jul 21, 2008 at 2:44 PM, markharw00d <[EMAIL PROTECTED]>
> wrote:
> 
>> >>could you define duplicate?
>>
>> That's your choice of field that you want to de-dup on.
>> That could be a field such as "DatabasePrimaryKey" or perhaps a field
>> containing an MD5 hash of document content.
>> The DuplicateFilter ensures only one document can exist in results for
>> each
>> unique value for the choice of field.
>>
>> Cheers
>> Mark
>>
>> Erick Erickson wrote:
>>
>>> could you define duplicate? As far as I know, you don't
>>> get the same (internal) doc id back more than once, so what
>>> is a duplicate?
>>>
>>> Best
>>> Erick
>>>
>>> On Mon, Jul 21, 2008 at 9:40 AM, Sebastin <[EMAIL PROTECTED]> wrote:
>>>
>>>
>>>
 at the time search , while querying the data
 markrmiller wrote:


> Sebastin wrote:
>
>
>> Hi All,
>>
>> Is there any possibility to avoid duplicate records in lucene  2.3.1?
>>
>>
>>
> I don't believe that there is a very high performance way to do this.
> You are basically going to have to query the index for an id before
> adding a new doc. The best way I can think of off the top of my head
> is
> to batch - first check that ids in the batch are unique, then check
> all
> ids in the batch against the IndexReader, then add the ones that are
> not
> dupes. Of course all of your docs would have to be added through this
> single choke point so that you knew other threads had not added that
> id
> after the first thread had looked but before it added the doc.
>
> I think Mark H has you covered if getting the dupes out after are
> okay.
>
> - Mark
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
>
>
 --
 View this message in context:

 http://www.nabble.com/How-to-avoid-duplicate-records-in-lucene-tp18543588p18568862.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




>>>
>>> 
>>> 
>>>
>>> No virus found in this incoming message.
>>> Checked by AVG. Version: 7.5.526 / Virus Database: 270.5.3/1563 -
>>> Release
>>> Date: 20/07/2008 12:59
>>>
>>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
> 
> 
>   __
> Not happy with your email address?.
> Get the one you really want - millions of new email addresses available
> now at Yahoo! http://uk.docs.yahoo.com/ymail/new.html
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/How-to-avoid-duplicate-records-in-lucene-tp18543588p18603752.html
Sent from the Lucene - Java Users mailing list archive at Nab

Lucene write locks

2008-07-22 Thread Sandeep K

Hi all..
I had a question related to the write locks created by Lucene.
I use Lucene 2.3.2. Will this newwer version create locks while indexing as
older ones?
or is there any other way that lucene handles its operations?

And my another doubt is that i use JMS for lucene indexing.
My App server will not do indexing but will pass the needed data for
indexing to the JMS server.
will there be any problem in indexing as its asynchronous?
plz help me..

thanks and regards,
Sand.



-- 
View this message in context: 
http://www.nabble.com/Lucene-write-locks-tp18604932p18604932.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]