Re: search question

2004-12-23 Thread roy-lucene-user
Erik,

They both use the StandardAnalyzer... however looking at the toString() makes
everything clearer.  In the case a string has the following email address:
[EMAIL PROTECTED], it gets split like so: first.last domain.com

However in 1.4 it does not get split.

So now we just check to see if an index was built using 1.2 or 1.4 and have
some checks thrown in.

Thanks for the guidance.

Roy.

On Wed, 22 Dec 2004 18:41:44 -0500, Erik Hatcher wrote
 What does toString() return for each of those queries?  Are you 
 using the same analyzer in both cases?
 
   Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



search question

2004-12-22 Thread roy-lucene-user
Hi guys,

We have an index with some fields containing email addresses.  Doing a search 
for an email address with this format: [EMAIL PROTECTED], does not bring up any 
results with lucene 1.4.

The query: Field1:[EMAIL PROTECTED]

However it returns results with 1.2.  Any ideas?

Roy.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



lock file paths

2004-11-15 Thread roy-lucene-user
Hey guys,

Quick question... is there a way to get the file paths to the lock files?  Or 
do I have to modify the src?  Currently I can't find any methods that will 
return a lock's file path.

Roy.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: demo HTML parser question

2004-09-23 Thread roy-lucene-user
Hi Fred,

We were originally attempting to use the demo html parser (Lucene 1.2), but as
you know, its for a demo.  I think its threaded to optimize on time, to allow
the calling thread to grab the title or top message even though its not done
parsing the entire html document.  That's just a guess, I would love to hear
from others about this.  Anyway, since it is a separate thread, a token error
could kill it and there is no way for the calling thread to know about it.

We had to create our own html parser since we only cared about grabbing the
entire text from the html document and also we wanted to avoid the extra
thread.  We also do a lot of SKIPping for minimal EOF errors (html documents
in email almost never follow standards).  For your html needs, you might want
to check out other JavaCC HTML parsers from the JavaCC web site.

Roy.

On Wed, 22 Sep 2004 22:42:55 -0400, Fred Toth wrote
 Hi,
 
 I've been working with the HTML parser demo that comes with
 Lucene and I'm trying to understand why it's multi-threaded,
 and, more importantly, how to exit gracefully on errors.
 
 I've discovered if I throw an exception in the front-end static
 code (main(), etc.), the JVM hangs instead of exiting. Presumably
 this is because there are threads hanging around doing something.
 But I'm not sure what!
 
 Any pointers? I just want to exit gracefully on an error such as
 a required meta tag is missing or similar.
 
 Thanks,
 
 Fred
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



compiling 1.4 source

2004-09-23 Thread roy-lucene-user
Hi guys,

So we started upgrading to 1.4 and we need to add some of our own custom code.
 After compiling with ant, I noticed that the 1.4 ant script builds a jar
called lucene-1.5-rc1-dev.jar, not lucene-1.4-final.jar.  I'm pretty sure I
did not download the wrong source.  Is this just a wrong name in the
properties or does the source code actually contain lucene 1.5 rc1 code?

Roy.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Hits.doc(x) and range queries

2004-09-14 Thread roy-lucene-user
Hi guys!

I've posted previously that Hits.doc(x) was taking a long time.  Turns out it
has to do with a date range in our query.  We usually do date ranges like this:
Date:[(lucene date field) - (lucene date field)]

Sometimes the begin date is 0 which is what we get from
DateField.dateToString( ( new Date( 0 ) ).

This is when getting our search results from the Hits object takes an absurd
amount of time.  Its usually each time the Hits object attempts to get more
results from an IndexSearcher ( aka, every 100? ).

It also takes up more memory...

I was wondering why it affects the search so much even though we're only
returning 350 or so results.  Does the QueryParser do something similar to the
DateFilter on range queries?  Would it be better to use a DateFilter?

We're using Lucene 1.2 (with plans to upgrade).  Do newer versions of Lucene
have this problem?

Roy.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Custom filter

2004-08-24 Thread roy-lucene-user
On Fri, 20 Aug 2004 20:01:36 -0400, Erik Hatcher wrote
 
 On Aug 20, 2004, at 6:48 PM, [EMAIL PROTECTED] wrote:
  We're currently in lucene 1.2... haven't moved to 1.3 yet.
 
 Skip 1.3 and go straight to 1.4.1 :)
 
 Upgrade - why not?

Well we have some MASSIVE indexes so updating needs to be planned out.  In the
meantime we continue with 1.2.  So, just for curiousity's sake... any clue on
the filter?  Or perhaps someone could clue me in on what kind of terms the
query parser creates ( and what the searcher class does with them ) when it
has something like (From:(blah OR blah2) OR To:(blah OR blah2)).  Tried to
look at the QueryParser.jj file but javacc makes my head hurt...

Roy.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Custom filter

2004-08-20 Thread roy-lucene-user
Hi guys!

I was hoping someone here could help me out with a custom filter.

We have an index of emails and do some searches on the text of an email message and 
also searches based on the email addresses in a To, From or CC.

Since we also do searches on a bunch of emails, we created a custom filter for 
searches on an array of fields for an array of values.  [code included below]

The problem we're having is that creating a query string like so:
Message:viagra AND (From:(email1 OR email2) OR To:(email1 OR email2) OR CC:(email1 OR 
email2))
would return results, but our filter combined with a query string of Message:viagra 
sometimes wouldn't.

One thing I noticed is that when the results do return with the filter, the email has 
the format of [EMAIL PROTECTED], but the one that doesn't has something like [EMAIL 
PROTECTED]

Also it might have something to do with the storage of the From or To or CC.  We don't 
parse out the email addresses before storing them.  So sometimes the value of a 
From/To/CC field might be [EMAIL PROTECTED] or local [EMAIL PROTECTED] or even 
[EMAIL PROTECTED].  Could the carrots be throwing off my filter?

I also wouldn't mind any suggestions to doing this filter better.

Here is the bits method from our custom filter:
-
final public BitSet bits( IndexReader reader ) throws IOException {
BitSet bits = new BitSet( reader.maxDoc() );

for ( int x = 0; x  fields.length; x++ ) {
for ( int y = 0; y  values.length; y++ ) {
TermDocs termDocs = reader.termDocs( new Term( fields[x], values[y] ) 
);
try {
while ( termDocs.next() ) {
bits.set( termDocs.doc() );
}
}
finally {
termDocs.close();
}
}
}
return bits;
}
-

Thanks in advance,

Roy.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: addIndexes vs addDocument

2004-07-07 Thread roy-lucene-user
Otis,

Okay, got it... however we weren't creating new document objects... just
grabbing a document through an IndexReader and calling addDocument on another
index.  Would that still work with unstored fields(well, its working for us
since we don't have any unstored fields)?

Thanks a lot!

Roy.

On Tue, 6 Jul 2004 19:46:30 -0700 (PDT), Otis Gospodnetic wrote
 Quick example.
 Index A has fields 'title' and 'contents'.
 Field 'contents' is stored in A as Field.UnStored.
 This means that you cannot retrieve the original content of the
 'contents' field, since that value was not stored verbatim in the
 index.
 Therefore, you cannot create a new Document instance, pull out String
 value of the 'contents' field from A, use it to create another field,
 add it to the new Document instance, and add that Document to a new
 index B using addDocument method.
 
 addIndexes method does not need to pull out the original String field
 values from Documents, so it will work.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



addIndexes and optimize

2004-07-07 Thread roy-lucene-user
Hey y'all again,

Just wondering why the IndexWriter.addIndexes method calls optimize before and after 
it starts merging segments together.

We would like to create an addIndexes method that doesn't optimize and call optimize 
on the IndexWriter later.

Roy.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



moving 1.2 index to 1.4

2004-07-02 Thread roy-lucene-user
Hey guys,

We have a couple of giant indexes that were done in lucene 1.2.  We would like to move 
to lucene 1.4 at some point.

We have heard that we would probably need to re-index our indexes to take advantage of 
certain new features/optimizations of lucene 1.3/1.4.

We were wondering if it was possible to open our old 1.2 index with an IndexReader, 
get each Document object, and add it to a new 1.4 index?  Would it be the same as 
re-building an index from scratch?

Thanks!

Roy.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Stress/scalability testing Lucene

2002-11-20 Thread roy-lucene-user
Ah, for some reason i thought none of the Lucene methods were thread safe,
or is this only in the case of reading and writing at the same time?  I
thought I read this in the FAQ.

Roy.

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, November 20, 2002 5:04 PM
To: Lucene Users List
Subject: Re: Stress/scalability testing Lucene


* Replies will be sent through Spamex to [EMAIL PROTECTED]
* For additional info click - http://www.spamex.com/i/?v=886513

Justin Greene wrote:
 We created a thread pool to read and parse the email
 messages.  10 threads seems to be the magic number here for us.  We then
 created a queue of messages to be indexed onto which we push the parsed
 messages and have a single thread adding messages to the index.

IndexWriter.addDocument(Document) is thread safe, so you don't need a 
separate indexing thread.  So long as your analyzer is thread safe, you 
can index each messages in the thread that parses it, for even greater 
parallelism.

Doug


--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]


This email and any attachments are confidential and may be 
legally privileged. No confidentiality or privilege is waived 
or lost by any transmission in error.  If you are not the 
intended recipient you are hereby notified that any use, 
printing, copying or disclosure is strictly prohibited.  
Please delete this email and any attachments, without 
printing, copying, forwarding or saving them and notify the 
sender immediately by reply e-mail.  Zurich Capital Markets 
and its affiliates reserve the right to monitor all e-mail 
communications through its networks.  Unless otherwise 
stated, any pricing information in this e-mail is indicative 
only, is subject to change and does not constitute an offer 
to enter into any transaction at such price and any terms in 
relation to any proposed transaction are indicative only and 
subject to express final confirmation.



the order of fields in Document.fields()

2002-11-13 Thread roy-lucene-user
Quick question about Document.fields().

Lucene provides you with a method to retrieve the value of a field or grab
all fields as an Enumeration.  It does not, however, allow you to grab all
values of one field for a document, it will only return the last value added
for that field.  

For example, I am indexing email messages that might have multiple To/CC/BCC
fields in the message header.  Currently to grab all the values when I
display an email that has been indexed, I must use the fields() method to
grab an Enumeration of all fields in a document.  I then separate them into
different arrays based on the field names.  However I am concerned about the
order of the fields since I consider the first To or CC or BCC to be the
main value for each field.  

Is the order of the fields returned in the order that they are added?  Or is
there no order?  If there is no order, can someone suggest a solution?

Thanks!

Roy.


This email and any attachments are confidential and may be 
legally privileged. No confidentiality or privilege is waived 
or lost by any transmission in error.  If you are not the 
intended recipient you are hereby notified that any use, 
printing, copying or disclosure is strictly prohibited.  
Please delete this email and any attachments, without 
printing, copying, forwarding or saving them and notify the 
sender immediately by reply e-mail.  Zurich Capital Markets 
and its affiliates reserve the right to monitor all e-mail 
communications through its networks.  Unless otherwise 
stated, any pricing information in this e-mail is indicative 
only, is subject to change and does not constitute an offer 
to enter into any transaction at such price and any terms in 
relation to any proposed transaction are indicative only and 
subject to express final confirmation.



RE: the order of fields in Document.fields()

2002-11-13 Thread roy-lucene-user
Shouldn't there be at least one method that returns an array of fields in
the correct order?

Roy.

-Original Message-
The order is preserved (or reversed actually), so it's not random.
It's reverse of the order of the order in which the fields were added
to the document.

This would be easy to test...


This email and any attachments are confidential and may be 
legally privileged. No confidentiality or privilege is waived 
or lost by any transmission in error.  If you are not the 
intended recipient you are hereby notified that any use, 
printing, copying or disclosure is strictly prohibited.  
Please delete this email and any attachments, without 
printing, copying, forwarding or saving them and notify the 
sender immediately by reply e-mail.  Zurich Capital Markets 
and its affiliates reserve the right to monitor all e-mail 
communications through its networks.  Unless otherwise 
stated, any pricing information in this e-mail is indicative 
only, is subject to change and does not constitute an offer 
to enter into any transaction at such price and any terms in 
relation to any proposed transaction are indicative only and 
subject to express final confirmation.



Deleting a document found in a search

2002-10-09 Thread lucene . user

I am just getting started with Lucene and I think I have a problem
understanding  some basic concepts.

I am using two-part identifiers to uniquely identify a document in the
index.  So whenever I want to index a document, I first want to find and
delete the old form.

To find it, I intend to use:

BooleanQuery findOurs = new BooleanQuery();
findOurs.add(new TermQuery(new Term(Id, id)), true, false);
findOurs.add(new TermQuery(new Term(Domain, domain)), true, false);

System.out.println(Deleting document matching: \ +
   findOurs.toString() + '');

Searcher searcher = new IndexSearcher(directory);
Hits hits = searcher.search(findOurs);

// Assert: hits.length() = 1

for (int i = 0 ; i  hits.length()  i  10; i++) {
  Document d = hits.doc(i);

  // Now what can I do to find document id?

  int id = ??
searcher.delete(id);
}

But I can't discover how to convert a search result into a document id.  It
is recorded in the private HitDoc class, but since it is not publicly
accessible, there must be a reason why it would not work to add a public
getter for it.

Is there an alternative way that I can do this?  My first thought is to
define a Field.Keyword(composite-key, domain + \u + id).  This
would allow me to use the delete(Term) interface to delete the key.

-- 
Thanks, Adrian.

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Enumerating all Terms

2002-10-09 Thread lucene . user

Is there a way of getting a list of all Terms that have been indexed?  I
guess it would approximate a wildcard query of the form *:* if that were
valid, and instead of returning matching documents, just returning the
fields and values.
-- 
Thanks, Adrian.

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Deleting a document found in a search

2002-10-09 Thread lucene . user

No, I mean HitDoc.id, the document number field stored in
the HitDoc class.  This number is needed when calling
IndexReader.delete(int docnum) but it is not publicly
accessible.

-- 
Adrian


At 06:32 09/10/2002 -0700, Otis Gospodnetic wrote:
You mean d.get(Id); ?

--- [EMAIL PROTECTED] wrote:
 I am just getting started with Lucene and I think I have
 a problem understanding some basic concepts.
 
 I am using two-part identifiers to uniquely identify a
 document in the index.  So whenever I want to index a
 document, I first want to find and delete the old form.
 
 To find it, I intend to use:
 
 BooleanQuery findOurs = new BooleanQuery();
 findOurs.add(new TermQuery(new Term(Id, id)), true, false);
 findOurs.add(new TermQuery(new Term(Domain, domain)), true, false);
 
 System.out.println(Deleting document matching: \ +
findOurs.toString() + '');
 
 Searcher searcher = new IndexSearcher(directory);
 Hits hits = searcher.search(findOurs);
 
 // Assert: hits.length() = 1
 
 for (int i = 0 ; i  hits.length()  i  10; i++) {
   Document d = hits.doc(i);
 
   // Now what can I do to find document id?
 
   int id = ??
  searcher.delete(id);
 }
 
 But I can't discover how to convert a search result into
 a document id.  It is recorded in the private HitDoc
 class, but since it is not publicly accessible, there
 must be a reason why it would not work to add a public
 getter for it.
 
 Is there an alternative way that I can do this?  My first
 thought is to define a Field.Keyword(composite-key,
 domain + \u + id).  This would allow me to use the
 delete(Term) interface to delete the key.
 
 -- 
 Thanks, Adrian.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]