Re: search on colon ":" ending words

2007-02-23 Thread Erick Erickson

I'd *strongly* advise doing it the simple way, that is, your replace.

1> it's simple and understandable.
2> next time you upgrade Lucene you, or the next poor programmer, will have
to remember/reimplement your change to the parser.
3> How will you insure that others in your organization (and you 6 months
from now) won't spend lots of time wondering why ':' didn't work as a field
separator in the query parser? I flat guarantee this will cause you grief...
4> I don't want Otis, Erik and Yonik to have to spend time answering the
question "Why isn't ':' working as a field separator?" .

Best
Erick

On 2/22/07, Felix Litman <[EMAIL PROTECTED]> wrote:


OK. Thank you.  We'll have to consider using this approach.

  I guess the drawback here is that ":" will not longer work as a field
operator. ?:-(

  We were also considering using the following approach.

  String newquery = query.replace(query, ": ", " ");

  It seems this way a colon should still work as a field operator if
followed by a query term with no space in between

  Thanks,
  Felix.

Antony Bowesman <[EMAIL PROTECTED]> wrote:
  Felix Litman wrote:
> Yes. thank you. How did you make that modification not to treat ":" as a
field-name terminator?
>
> Is it using this Or some other way?

I removed the : handling stuff from QueryParser.jj in the method:

Query Clause(String field) :

I removed this section
---
[
LOOKAHEAD(2)
(
fieldToken= {field=discardEscapeChar(fieldToken.image);}
| {field="*";}
)
]
---

and you can also remove the COLON and : related bits to do with start
terms and
escaped chars if you want to exclude treating : as a separator, but from
memory,
it's the above section that does the field recognition.

Antony


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





Index maintainance

2007-02-23 Thread Kainth, Sachin
Hi all,

Just wondering how one would perform index maintainance.  I know how to
add new documents:

writer = new IndexWriter(IndexDirectory, new PorterAnalyzer(), false);

(incidently, I wrote PorterAnalyzer myself for the PorterStemFilter
since I couldn't find an analyzer using it)

But what I don't know is how do we delete documents from the index and
how we replace documents in the index where those documents have
changed.

Cheers

Sachin


This email and any attached files are confidential and copyright protected. If 
you are not the addressee, any dissemination of this communication is strictly 
prohibited. Unless otherwise expressly agreed in writing, nothing stated in 
this communication shall be legally binding.

The ultimate parent company of the Atkins Group is WS Atkins plc.  Registered 
in England No. 1885586.  Registered Office Woodcote Grove, Ashley Road, Epsom, 
Surrey KT18 5BW.

Consider the environment. Please don't print this e-mail unless you really need 
to. 


Re: Index maintainance

2007-02-23 Thread Erick Erickson

If you're using 2.1, see IndexModifier. If you're previous to 2.1, the
IndexModifier is (I think), hanging around in the contrib area. You have to
delete a document and re-add it, there's no such thing as "modify inplace"
in lucene currently.

Erick

On 2/23/07, Kainth, Sachin <[EMAIL PROTECTED]> wrote:


Hi all,

Just wondering how one would perform index maintainance.  I know how to
add new documents:

writer = new IndexWriter(IndexDirectory, new PorterAnalyzer(), false);

(incidently, I wrote PorterAnalyzer myself for the PorterStemFilter
since I couldn't find an analyzer using it)

But what I don't know is how do we delete documents from the index and
how we replace documents in the index where those documents have
changed.

Cheers

Sachin


This email and any attached files are confidential and copyright
protected. If you are not the addressee, any dissemination of this
communication is strictly prohibited. Unless otherwise expressly agreed in
writing, nothing stated in this communication shall be legally binding.

The ultimate parent company of the Atkins Group is WS Atkins
plc.  Registered in England No. 1885586.  Registered Office Woodcote Grove,
Ashley Road, Epsom, Surrey KT18 5BW.

Consider the environment. Please don't print this e-mail unless you really
need to.



Re: Index maintainance

2007-02-23 Thread Michael McCandless

Erick Erickson wrote:

If you're using 2.1, see IndexModifier. If you're previous to 2.1, the
IndexModifier is (I think), hanging around in the contrib area. You have to
delete a document and re-add it, there's no such thing as "modify inplace"
in lucene currently.


Actually as of 2.1 you can now delete and update documents through
IndexWriter.  I would recommend using IndexWriter instead of
IndexModifier.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Index maintainance

2007-02-23 Thread Kainth, Sachin
I've just been looking at IndexReader and it seems you can do it using
that, but I don't know which concrete implementation of IndexReader to
use. 

-Original Message-
From: Michael McCandless [mailto:[EMAIL PROTECTED] 
Sent: 23 February 2007 15:07
To: java-user@lucene.apache.org
Subject: Re: Index maintainance

Erick Erickson wrote:
> If you're using 2.1, see IndexModifier. If you're previous to 2.1, the

> IndexModifier is (I think), hanging around in the contrib area. You 
> have to delete a document and re-add it, there's no such thing as
"modify inplace"
> in lucene currently.

Actually as of 2.1 you can now delete and update documents through
IndexWriter.  I would recommend using IndexWriter instead of
IndexModifier.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



This message has been scanned for viruses by MailControl - (see
http://bluepages.wsatkins.co.uk/?6875772)


This email and any attached files are confidential and copyright protected. If 
you are not the addressee, any dissemination of this communication is strictly 
prohibited. Unless otherwise expressly agreed in writing, nothing stated in 
this communication shall be legally binding.

The ultimate parent company of the Atkins Group is WS Atkins plc.  Registered 
in England No. 1885586.  Registered Office Woodcote Grove, Ashley Road, Epsom, 
Surrey KT18 5BW.

Consider the environment. Please don't print this e-mail unless you really need 
to. 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Can I use Lucene to retrieve a list of duplicates

2007-02-23 Thread Paul Taylor
Hi I have Java Swing application with a table, I was considering using 
Lucene to index the data in the table. One task Id like to do is for the 
user to select 'Find Duplicate records for Column X', then I would 
filter the table to show only records where there is more than one with 
the same value i.e duplicate for that column. Is there a way to return 
all the duplicates from a Lucene index.


thanks paul Taylor

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Can I use Lucene to retrieve a list of duplicates

2007-02-23 Thread Erick Erickson

Sure, you can use the TermDocs/TermEnum classes. Basically, for a term
(probably column value in your app) these let you quickly answer the
question "which (and how many) documents does this term appear in". What you
get is the Lucene doc id, which let's you fetch all the information about
the documents you want.

Erick

On 2/23/07, Paul Taylor <[EMAIL PROTECTED]> wrote:


Hi I have Java Swing application with a table, I was considering using
Lucene to index the data in the table. One task Id like to do is for the
user to select 'Find Duplicate records for Column X', then I would
filter the table to show only records where there is more than one with
the same value i.e duplicate for that column. Is there a way to return
all the duplicates from a Lucene index.

thanks paul Taylor

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Can I use Lucene to retrieve a list of duplicates

2007-02-23 Thread Erik Hatcher


On Feb 23, 2007, at 10:16 AM, Paul Taylor wrote:

Hi I have Java Swing application with a table, I was considering  
using Lucene to index the data in the table. One task Id like to do  
is for the user to select 'Find Duplicate records for Column X',  
then I would filter the table to show only records where there is  
more than one with the same value i.e duplicate for that column. Is  
there a way to return all the duplicates from a Lucene index.


Check this stuff out:  - might be of help.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Index modification

2007-02-23 Thread Kainth, Sachin
Hi all,

I am using the IndexModifier class to perform index modification.  I
have deleted 1 document from an index and the output indicates that 1
document does indeed get deleted.  However, running the program again
reveals that the document deleted has appeared again in the index.  This
despite the fact that I close the IndexModifier after the deletion.
Does anyone know what I'm missing?  Here is my code:

Analyzer analyzer = new PorterAnalyzer();
IndexModifier indexModifier = new IndexModifier("D:\\Index",
analyzer, false);
Console.WriteLine(indexModifier.DocCount() + " docs in
index");
//indexModifier.Delete(1);
indexModifier.Delete(new Term("artist", "hot"));
Console.WriteLine("Deleted a document");
Console.WriteLine(indexModifier.DocCount() + " docs in
index");
indexModifier.Close();

Cheers




This email and any attached files are confidential and copyright protected. If 
you are not the addressee, any dissemination of this communication is strictly 
prohibited. Unless otherwise expressly agreed in writing, nothing stated in 
this communication shall be legally binding.

The ultimate parent company of the Atkins Group is WS Atkins plc.  Registered 
in England No. 1885586.  Registered Office Woodcote Grove, Ashley Road, Epsom, 
Surrey KT18 5BW.

Consider the environment. Please don't print this e-mail unless you really need 
to. 


Re: Can I use Lucene to retrieve a list of duplicates

2007-02-23 Thread Paul Taylor
Thanks this might do it, but do I need to know the terms beforehand, I 
just want to return any terms with frequency more than one?


Erick Erickson wrote:
Sure, you can use the TermDocs/TermEnum classes. Basically, for a term 
(probably column value in your app) these let you quickly answer the 
question "which (and how many) documents does this term appear in". 
What you get is the Lucene doc id, which let's you fetch all the 
information about the documents you want.


Erick

On 2/23/07, *Paul Taylor* <[EMAIL PROTECTED] 
> wrote:


Hi I have Java Swing application with a table, I was considering using
Lucene to index the data in the table. One task Id like to do is
for the
user to select 'Find Duplicate records for Column X', then I would
filter the table to show only records where there is more than one
with
the same value i.e duplicate for that column. Is there a way to return
all the duplicates from a Lucene index.

thanks paul Taylor

-
To unsubscribe, e-mail: [EMAIL PROTECTED]

For additional commands, e-mail: [EMAIL PROTECTED]





Internal Virus Database is out-of-date.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.16.5/616 - Release Date: 04/01/2007
  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Can I use Lucene to retrieve a list of duplicates

2007-02-23 Thread Paul Taylor
yes Ive seen this before thanks, it was an article that referred to this 
that pointed me towards  lucene in the first place :)

Erik Hatcher wrote:


On Feb 23, 2007, at 10:16 AM, Paul Taylor wrote:

Hi I have Java Swing application with a table, I was considering 
using Lucene to index the data in the table. One task Id like to do 
is for the user to select 'Find Duplicate records for Column X', then 
I would filter the table to show only records where there is more 
than one with the same value i.e duplicate for that column. Is there 
a way to return all the duplicates from a Lucene index.


Check this stuff out: 
 - 
might be of help.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--Internal Virus Database is out-of-date.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.16.5/616 - Release Date: 
04/01/2007






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Leaking org.apache.lucene.index.* objects

2007-02-23 Thread Halsey, Stephen
Great, thanks a lot for that Hoss.  Glad to hear it has been fixed.

Steve. 

>-Original Message-
>From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
>Sent: 10 February 2007 06:14
>To: java-user@lucene.apache.org
>Cc: Otis Gospodnetic
>Subject: RE: Leaking org.apache.lucene.index.* objects
>
>
>: Its funny, but I'm having a memory leak with Hibernate that 
>I spent the
>: whole of yesterday banging my head against a wall about and so when
>: searching for emails with Leak in the title came across your message.
>: I'm probably going to hit the same problem as you for long running
>: multi-threaded lucene index updates so am very interested in this.  I
>
>: Just noticed this is from December, so may well be a bit late for
>: helping out?
>
>FYI: After his original post to java-user Otis then replied on 
>java-dev where a lively discussion ensued, resulting in a 
>FieldCache bug being discovered and fixed...
>
>http://www.nabble.com/ThreadLocal-leak-%28was-Re%3A-Leaking-org
>.apache.lucene.index.*-objects%29-tf2831090.html#a7953064
>
>
>
>
>-Hoss
>
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Leaking org.apache.lucene.index.* objects

2007-02-23 Thread Mark Miller

Are you flushing the session every so often with hibernate? If not you might
not have been experiencing Otis's bug -- if so, never mind 

- Mark

On 2/23/07, Halsey, Stephen <[EMAIL PROTECTED]> wrote:


Great, thanks a lot for that Hoss.  Glad to hear it has been fixed.

Steve.

>-Original Message-
>From: Chris Hostetter [mailto:[EMAIL PROTECTED]
>Sent: 10 February 2007 06:14
>To: java-user@lucene.apache.org
>Cc: Otis Gospodnetic
>Subject: RE: Leaking org.apache.lucene.index.* objects
>
>
>: Its funny, but I'm having a memory leak with Hibernate that
>I spent the
>: whole of yesterday banging my head against a wall about and so when
>: searching for emails with Leak in the title came across your message.
>: I'm probably going to hit the same problem as you for long running
>: multi-threaded lucene index updates so am very interested in this.  I
>
>: Just noticed this is from December, so may well be a bit late for
>: helping out?
>
>FYI: After his original post to java-user Otis then replied on
>java-dev where a lively discussion ensued, resulting in a
>FieldCache bug being discovered and fixed...
>
>http://www.nabble.com/ThreadLocal-leak-%28was-Re%3A-Leaking-org
>.apache.lucene.index.*-objects%29-tf2831090.html#a7953064
>
>
>
>
>-Hoss
>
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Determining if index exists Lucene 2.1

2007-02-23 Thread Shane

Hi,

Prior to Lucene 2.1, I was using FSDirectory.getDirectory(String path, 
boolean create)|| inside of a try block to determine whether or not a 
directory existed.


With the deprecation of the above class call in Lucene 2.1,  I need a 
new method for determining the existence of an index.  I can just check 
to see if either of the files INDEX_PATH/segments or 
INDEX_PATH/segments.gen exist, but that doesn't seem like the best route.


Is there a function call to determine whether or not an index already 
exists?


Thanks,

Shane

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: TextMining.org Word extractor

2007-02-23 Thread Chris Hostetter


googling...
TextMining.org licence
...turns up lots of useful info, some from the archive of this list.


: Date: Fri, 23 Feb 2007 16:04:53 +1100
: From: Antony Bowesman <[EMAIL PROTECTED]>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: TextMining.org Word extractor
:
: I'm extracting text from Word using TextMining.org extractors - it works 
better
: than POI because it extracts Word 6/95 as well as 97-2002, which POI cannot 
do.
:   However, I'm trying to find out about licence issues with the TM jar. The TM
: website seems to be permanently hacked these days.
:
: Anyone know?
:
: Also, has anyone come up with a good solution for extracting data from
: fast-saved files, something that neither TM nor POI can do.
:
: Antony
:
:
:
: -
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Determining if index exists Lucene 2.1

2007-02-23 Thread karl wettin


23 feb 2007 kl. 19.53 skrev Shane:


Is there a function call to determine whether or not an index  
already exists?





HTH

--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Determining if index exists Lucene 2.1

2007-02-23 Thread Michael McCandless

Shane wrote:

Prior to Lucene 2.1, I was using FSDirectory.getDirectory(String path, 
boolean create)|| inside of a try block to determine whether or not a 
directory existed.


With the deprecation of the above class call in Lucene 2.1,  I need a 
new method for determining the existence of an index.  I can just check 
to see if either of the files INDEX_PATH/segments or 
INDEX_PATH/segments.gen exist, but that doesn't seem like the best route.


Is there a function call to determine whether or not an index already 
exists?


You want IndexReader.indexExists(...).

Also you may want to use the new constructors to IndexWriter that create
a new index if there isn't one already, else open the existing one.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multy Language documents indexing

2007-02-23 Thread Ivan Vasilev

Thanks Erik,
Here I describe about my research on this problem. It might be helpful 
for someone :)

I will divide the problem with multiple language docs in some subproblems:
*1. Determining the language in the text documents.
1.1. Determining the language in document when the whole text is in one 
and the same language.*
In the Lucene forum I found the following links to sites which provide 
tools for fulfilling this task.

http://odur.let.rug.nl/~vannoord/TextCat/
*_http://frank.spieleck.de/ngram/_*
I have made some tests with the first one and the results for English 
and German are 100% guess but for Russian 0% guess (I used the the 
encoding windows 1251 which is claimed to be supported for the Russian 
text recognition).
Link to this demo 
http://odur.let.rug.nl/~vannoord/TextCat/Demo/ 
 

Other similar sites at: 
_http://odur.let.rug.nl/~vannoord/TextCat/competitors.html_ 

Note that it is important that when indexing the proper Analyzer will be 
chosen when creating Indexer because when searching with a searcher that 
uses not proper analyzer then the results bight be not correct. Example: 
If we index German document using Lucene GermanAnalyzer and then we 
search some German word in the doc by using StandardAnalyzer in is 
possible the word is not found.


*1.2. Determining the language in document when the some part of the 
text in a document is in one language other in a different language.*

I did not found tools for this neither free not commercial.

*2. How to keep the terms for documents when each document is in 
different language.*
There was a discussion about this in this forum and the approach that I 
best like is the one suggested in the mail 
http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200211.mbox/[EMAIL PROTECTED] 

It suggests to index all the docs in one index no matter which analyzer 
we use.
I have done some tests with indexing in this way (see the attached 
source files and sample text files – names of the sample text files are 
important – they are hard coded in the sources to avoid using language 
recognizer – I did not have enough time to use it.)


*3. The encoding of the documents.*
There is another thing that is important – encoding for plain text 
documents written in different languages. It is important when indexing 
to know the encoding of the plain text documents otherwise the results 
are incorrect. For example when some document is encoded in “ISO-8859-1 
“ and when creating index we wrongly decide it is in “UTF-16 “ then the 
results of searching are wrong.
So may be we have to write also some class(es) that will determine the 
right encoding of the document based on its BOM or if missing on the 
text contained in it.
Just some fun: MS Notepad has a bug in this sense - when the file 
created by Notepad contains exactly one of this texts: “this app can 
break” OR “tuka ima golem bug” (without line separator) then the same 
Notepad can not read it (unlike Wordpad or other programs) :). The 
second in Bulgarian means “here is a big bug”.


Best Regards,
Ivan Vasilev


Erick Erickson wrote:

I know this has been discussed several times, but sure don't remember the
answers. Search the mail archive for "multiple languages" and you'll find
some good suggestions. But as I remember, it's not a trivial issue.

But I don't see why the "three different documents" approach wouldn't 
work.

You could also index the same text in three different fields in a single
document, using different language analyzers for each (See
PerFieldAnalyzerWrapper).

Erick

On 2/22/07, Ivan Vasilev <[EMAIL PROTECTED]> wrote:


Hi All,

Our application that uses Lucene for indexing will be used to index
documents that each of which contains parts written in different
languages. For example some document could contain English, Chinese and
Brazilian text. So how to index such document? Is there some best
practice to do this?

What comes in my mind is to index 3 different Lucene Documents for the
real document and keep in a database the meta info that these 3
Documents are related to our real doc. For example for the myDoc.doc we
will have in the index myDocEn.doc, myDocCn.doc and myDocBr.doc and when
making search when the searched word is found in myDocCn.doc we will
visualize to user myDoc.doc. Disadvantage here is that in this case the
occurrences of the searched item will have to be recalculated. It is
important for queries like "Red NEAR/10 fox". So if someone knows better
practice than this, please let me help.

Tanks in advance,
Ivan


-
To unsubscribe, e-mail: [EMAIL

filtering by first letter

2007-02-23 Thread Paul Sundling (Webdaddy)
I have a requirement to support filtering search results by first letter. 

This is relatively simple by adding a field to each index that 
represents the first letter for that relevant index and then adding a 
filter to the search.


The hard part is that I need to list all the letters you can filter BY.  
So if there are no names that start with S, it shouldn't appear as an 
option. 

Is there a simple and performant way to get a set of all the unique 
values for a Field in the Hits returned?  There would probably only be 
low number of unique values.


So let's say I have the following in my index:

letter, personName
m, mike smith
p, paul smith
g, george smith
g, glenda smith

I need to be able to display to the user that they can filter based on  
M, P or G within their search for George.


I could do a compromise and for search results above a certain level, 
show all letters and numbers, but it won't always give correct values.  
Imagine this edge case: A search for george has 50,000 results, but only 
a couple people had george as their last name.  Not many of the letters 
would be valid filters.


Thanks for any ideas or approaches I overlooked.

Paul Sundling


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: filtering by first letter

2007-02-23 Thread Erick Erickson

See TermEnum (I don't think you need TermDocs for this). If you instantiate
a TermEnum(new Term("firstletterfield", "")), it'll enumerate all the terms
in your 'firstletter' field and you can just collect them and go...

For that matter, and assuming that your names are UN_TOKENIZED, you could do
something like this without a special field by iterating over your
personName field. This might be reasonable if your index is fairly static
and you could create this list at IndexReader open time, especially since
you can use TermEnum.skipTo("personName", "a") etc.

Best
Erick

On 2/23/07, Paul Sundling (Webdaddy) <[EMAIL PROTECTED]> wrote:


I have a requirement to support filtering search results by first letter.

This is relatively simple by adding a field to each index that
represents the first letter for that relevant index and then adding a
filter to the search.

The hard part is that I need to list all the letters you can filter BY.
So if there are no names that start with S, it shouldn't appear as an
option.

Is there a simple and performant way to get a set of all the unique
values for a Field in the Hits returned?  There would probably only be
low number of unique values.

So let's say I have the following in my index:

letter, personName
m, mike smith
p, paul smith
g, george smith
g, glenda smith

I need to be able to display to the user that they can filter based on
M, P or G within their search for George.

I could do a compromise and for search results above a certain level,
show all letters and numbers, but it won't always give correct values.
Imagine this edge case: A search for george has 50,000 results, but only
a couple people had george as their last name.  Not many of the letters
would be valid filters.

Thanks for any ideas or approaches I overlooked.

Paul Sundling


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Can I use Lucene to retrieve a list of duplicates

2007-02-23 Thread Chris Hostetter

: Thanks this might do it, but do I need to know the terms beforehand, I
: just want to return any terms with frequency more than one?

no, TermEnum will let you iterate over all the terms ... you don't even
need TermDocs if you just want the docFreq for each term (which would be 1
if there are no duplicates)

: Erick Erickson wrote:
: > Sure, you can use the TermDocs/TermEnum classes. Basically, for a term
: > (probably column value in your app) these let you quickly answer the
: > question "which (and how many) documents does this term appear in".
: > What you get is the Lucene doc id, which let's you fetch all the
: > information about the documents you want.
: >
: > Erick
: >
: > On 2/23/07, *Paul Taylor* <[EMAIL PROTECTED]
: > > wrote:
: >
: > Hi I have Java Swing application with a table, I was considering using
: > Lucene to index the data in the table. One task Id like to do is
: > for the
: > user to select 'Find Duplicate records for Column X', then I would
: > filter the table to show only records where there is more than one
: > with
: > the same value i.e duplicate for that column. Is there a way to return
: > all the duplicates from a Lucene index.
: >
: > thanks paul Taylor
: >
: > -
: > To unsubscribe, e-mail: [EMAIL PROTECTED]
: > 
: > For additional commands, e-mail: [EMAIL PROTECTED]
: > 
: >
: >
: > 
: >
: > Internal Virus Database is out-of-date.
: > Checked by AVG Free Edition.
: > Version: 7.1.394 / Virus Database: 268.16.5/616 - Release Date: 04/01/2007
: >
:
:
: -
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: search on colon ":" ending words

2007-02-23 Thread Chris Hostetter

:   String newquery = query.replace(query, ": ", " ");

you should be able to usea regex like so...

String newquery = query.replaceAll(":\\b", ":");

...(i may have some extra/missing backslashes) to ensure that literal ":"
in your input which are followed by word boundaries are "escaped" fro mteh
query parser ... that way if your analyzer doesn't strip out the ":"
things will still work, and ":" at the end of your input will be properly
escaped (your current string replace will fail in this case)



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Index maintainance

2007-02-23 Thread Chris Hostetter

: I've just been looking at IndexReader and it seems you can do it using
: that, but I don't know which concrete implementation of IndexReader to
: use.

there is a static factory method for opening an IndexReader in the
IndexReader class (you can't call the constructors directly)

please go through the Lucene-Java tutorial and read the corrisponding demo
code ... it covers topics like these.

beyond that i don't know anyone who won't recomend you get a copy of LIA
and read it cover to cover...

http://lucene.apache.org/java/docs/gettingstarted.html
http://lucenebook.com/



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index modification

2007-02-23 Thread Chris Hostetter

you should check the return count from deleteDocuments ... if i had to
guess i would say that your analyzer is steming the input in such a way
that your indexed terms don't match the Term you are trying to delete on.


: Date: Fri, 23 Feb 2007 16:48:27 -
: From: "Kainth, Sachin" <[EMAIL PROTECTED]>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Index modification
:
: Hi all,
:
: I am using the IndexModifier class to perform index modification.  I
: have deleted 1 document from an index and the output indicates that 1
: document does indeed get deleted.  However, running the program again
: reveals that the document deleted has appeared again in the index.  This
: despite the fact that I close the IndexModifier after the deletion.
: Does anyone know what I'm missing?  Here is my code:
:
:   Analyzer analyzer = new PorterAnalyzer();
: IndexModifier indexModifier = new IndexModifier("D:\\Index",
: analyzer, false);
: Console.WriteLine(indexModifier.DocCount() + " docs in
: index");
: //indexModifier.Delete(1);
: indexModifier.Delete(new Term("artist", "hot"));
: Console.WriteLine("Deleted a document");
: Console.WriteLine(indexModifier.DocCount() + " docs in
: index");
: indexModifier.Close();
:
: Cheers
:
:
:
:
: This email and any attached files are confidential and copyright protected. 
If you are not the addressee, any dissemination of this communication is 
strictly prohibited. Unless otherwise expressly agreed in writing, nothing 
stated in this communication shall be legally binding.
:
: The ultimate parent company of the Atkins Group is WS Atkins plc.  Registered 
in England No. 1885586.  Registered Office Woodcote Grove, Ashley Road, Epsom, 
Surrey KT18 5BW.
:
: Consider the environment. Please don't print this e-mail unless you really 
need to.
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ConstantScoreQuery and MatchAllDocsQuery

2007-02-23 Thread Chris Hostetter


: I ask this because I need to return the frequency of the search terms
: with each of my results, I tried using the TermFreqVector object but
: unfortunately it was not fast enough, so I decided to modifiy lucene to
: be able to return the frequency the same way the score is returned by
: org.apache.lucene.search.Hits.
...
: I started by adding public abstract int freq(); in package
: org.apache.lucene.search.Scorerabstract class, and then modified
: everyimplementation of Scorer to be able to get the frequency.

can you elaborate on:
 * how you were trying to use TermFreqVector
 * how you define "fast enough"
 * how you are now getting the freq() value in all of the Scorer classes?

If all you need to know is the frequency of each term in your query (and
not hte frequency of all terms in teh document) did you try using the
freq() method in the TermDocs iterator instead of the TermFreqVector
class?

using Query.extractTerms, and then getting a TermDocs instance
and iterating over those terms using seek and over the docids from your
results using skipTo should be an extremely fast way to get the freq()
info.

: It works well and fast, the only problem I have is that I did not find a
: way to compute the frequency in both ConstantScoreQuery.java and
: MatchAllDocsQuery.java internal scorers.

neither of those queries involve any terms, so i'm not sure what freq()
would even make sense ... "1" or "0" i would imagine.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: QueryParser bug?

2007-02-23 Thread Doron Cohen
Hi Antony,

Could you try the patch in
http://issues.apache.org/jira/browse/LUCENE-813

Thanks,
Doron

Chris Hostetter <[EMAIL PROTECTED]> wrote on 22/02/2007 22:01:00:

>
> : than just on/off), but the original QP shows the problem with
> : setAllowLeadingWildcard(true).  The compiled JavaCC code will
> always create a
> : PrefixQuery if the last character is *, regardless of any other
wildcard
> : characters before it.  Therefore the query is based on the Term:
>
> Yep, definitely a bug...
>
> http://issues.apache.org/jira/browse/LUCENE-813
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]