how to query range of Date by given date string?

2007-02-26 Thread 李寻欢晕菜了

hello:
I have Stored Date in index, and how could I query the result by given range
of Date?
for example:
I would find some matching result in the range of 2007-02-24 to 2007-02-25.


--
--
WoCal生活,尽在掌握!
http://kofwang.wocal.cn
--


Re: Can I use Lucene to retrieve a list of duplicates

2007-02-26 Thread Paul Taylor

Hi,

Sorry I don't see how I get access to TermEnums. So far Ive created a 
document per row, the first field holds the row id, then i have one  
field per column, and checked  the index has been created ok with some 
search querys.
I now want to pass a column to check, and receive  a list of all the 
documents that contain  a  term  in that column which is used by at 
least one other document for that column ( a duplicate term).


thanks paul

Chris Hostetter wrote:

: Thanks this might do it, but do I need to know the terms beforehand, I
: just want to return any terms with frequency more than one?

no, TermEnum will let you iterate over all the terms ... you don't even
need TermDocs if you just want the docFreq for each term (which would be 1
if there are no duplicates)

: Erick Erickson wrote:
: > Sure, you can use the TermDocs/TermEnum classes. Basically, for a term
: > (probably column value in your app) these let you quickly answer the
: > question "which (and how many) documents does this term appear in".
: > What you get is the Lucene doc id, which let's you fetch all the
: > information about the documents you want.
: >
: > Erick
: >
: > On 2/23/07, *Paul Taylor* <[EMAIL PROTECTED]
: > > wrote:
: >
: > Hi I have Java Swing application with a table, I was considering using
: > Lucene to index the data in the table. One task Id like to do is
: > for the
: > user to select 'Find Duplicate records for Column X', then I would
: > filter the table to show only records where there is more than one
: > with
: > the same value i.e duplicate for that column. Is there a way to return
: > all the duplicates from a Lucene index.
: >
: > thanks paul Taylor
: >
: > -
: > To unsubscribe, e-mail: [EMAIL PROTECTED]
: > 
: > For additional commands, e-mail: [EMAIL PROTECTED]
: > 
: >
: >
: > 
: >
: > Internal Virus Database is out-of-date.
: > Checked by AVG Free Edition.
: > Version: 7.1.394 / Virus Database: 268.16.5/616 - Release Date: 04/01/2007
: >
:
:
: -
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss



  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: how to define a pool for Searcher?

2007-02-26 Thread Mohammad Norouzi

No. actually I dont close the searcher. I just set a flag to true or false.
my considerations are:
[please note I provided a ResultSet so I display the result page by page and
dont load all the result in a list]
1- should I open a searcher for each user? and one reader for all user
session?
2- or create a pool of searcher and one reader for all searcher?
3- or a pool of searcher and each has its own reader. but never close them
just set a flag to true.

which one?


On 2/25/07, Nicolas Lalevée <[EMAIL PROTECTED]> wrote:


Le dimanche 25 février 2007 16:55, Mohammad Norouzi a écrit:
> so, you mean, I open one reader for each session (each user) and never
> close it until the session has expired? if I do this, is that affect the
> performance?

The searcher/reader is a view on the index. And each time you open a new
one,
it cost some time. And the only reason to have a new view on the index is
that you modified it.
As you never modify it, as said Mark, never close your searcher/reader. If
your sevral users are reading the same index, you might share the instance
of
the searcher in the different user sessions.

Nicolas

>
> On 2/25/07, Mark Miller <[EMAIL PROTECTED]> wrote:
> > If you never modify your index you should never need to close your
> > reader (or searcher). Doing so would just slow you down.
> >
> > Mohammad Norouzi wrote:
> > > Hi
> > > actually I dont have any writer or writing reader. I just have
reader.
> > > when
> > > a reader is created by the user because the document returned by
hits
> > > is very much, for example 20,000 so I display the result page by
page.
> > > whenever
> > > user click to next page the hits will use the reader to load next 20
> > > records,
> > > besides, I dont have one directory, there are more than one
directory
> >
> > and
> >
> > > index on the server and each user may request for one of them.
> > > the problem is, a user may close his browser window and the reader
> > > will stay
> > > open becasue I cant detect it. and either if his session expires my
> > > destroy
> > > method will be called and searcher will close but in the cached
> > > searcher i
> > > can not detect which one is closed and ready it for next user. if
the
> > > searcher had a isClosed() method it was easy to determine but
> > > unfortunately
> > > it's has'nt
> > >
> > > any idea?
> > > thanks again
> > >
> > > On 2/25/07, Mark Miller <[EMAIL PROTECTED]> wrote:
> > >> I am a bit confused about what you are asking. Why do you need the
> > >> Searcher to time out? That code should release your searchers at
the
> > >> appropriate times...when the index has been modified. The way that
I
> >
> > use
> >
> > >> it is to make a synchronized map that keeps around an index
accessor
> >
> > for
> >
> > >> each index that I open...from there the code should do the
rest...when
> >
> > a
> >
> > >> writer or a writing reader is released the code waits for all
> > >> searchers to be released and then clears the cache of searchers and
> > >> new searchers are created when requested until another writer or
> > >> writing reader is released...
> > >>
> > >> Mohammad Norouzi wrote:
> > >> > Thank you Mark for your useful help. the code you introduce was
very
> > >> > helpful
> > >> > for me
> > >> >
> > >> > but my only question is that I need to place an idle time for
each
> > >>
> > >> open
> > >>
> > >> > searcher, so if it exceed the specific time then release that
> >
> > searcher
> >
> > >> > and
> > >> > get ready for another thread.
> > >> >
> > >> > how can I put such this feature, I was thinking of a timeout
> >
> > listener,
> >
> > >> > but
> > >> > dont know where tu put it. I have a SingleSearcher that wraps
> >
> > lucene's
> >
> > >> > Searcher and it returns an ResultSet in which I put a Hits
object.
> > >> > do I have
> > >> > to put the time in my ResultSet or my SingleSeacher?
> > >> >
> > >> > still I dont know ehrthrt the reader is important for Hits or
> > >>
> > >> Searcher?
> > >>
> > >> > consider I passed a hits to my ResultSet, now, if I close
searcher,
> > >> > will the
> > >> > Reader get closed?  or another vague thing is can a Reader work
> >
> > thread
> >
> > >> > safely for every Searcher with differenet queries?
> > >> >
> > >> > Thank you very much again.
> > >> >
> > >> > On 2/22/07, Mark Miller <[EMAIL PROTECTED]> wrote:
> > >> >> I would not do this from scratch...if you are interested in Solr
go
> > >>
> > >> that
> > >>
> > >> >> route else I would build off
> > >> >> http://issues.apache.org/jira/browse/LUCENE-390
> > >> >>
> > >> >> - Mark
> > >> >>
> > >> >> Mohammad Norouzi wrote:
> > >> >> > Hi all,
> > >> >> > I am going to build a Searcher pooling. if any one has
> > >>
> > >> experience on
> > >>
> > >> >> > this, I
> > >> >> > would be glad to hear his/her recommendation and suggestion. I
> >
> > want
> >
> > >> to
> > >>
> > >> >> > know
> > >> >> > what issues I should be apply. considering I am going to use
> > >>
> > >> this on
> > >> a
> > >>
> > >> >>

Re: a question about indexing database tables

2007-02-26 Thread Erick Erickson

No, that's not what I was talking about. Remember that there's no
requirement in Lucene that each document have the same fields. So, you have
something like...

Document doc = new Document()
doc.add("table1_id", id);
doc.add("table1_name", name);
IndexWriter.add(doc);


Document doc = new Document();
doc.add("table2_id_emp", employeeId);
doc.add("table2_address", address);
IndexWriter.add(doc);


You now have two documents in the index, one with fields "table1_id",
"table1_name" and the other has fields "table2_id_emp", "table2_address".

These two documents are entirely orthogonal. That is, they have no fields in
common. Even if the values for these fields are the same (say for some
strange reason your name from table1 has a value of "nelson" and the address
from table2 also has a value of "nelson". These don't interfere with each
other since searching for "table1_name:nelson" would never look in the field
"table2_address".

So, all your tables can be stored in the same *index* (not the same
document, and most certainly not in the same field). They are all separate
because no two fields are the same for rows (documents) from different
tables.

The basic idea is that you index one lucene document for each row.

That said, I can't imagine that this is all you want to do. A one-for-one
mapping of table rows to documents is almost sure to be not the best design.
You'll probably want to de-normalize your data for easy lookup etc. There'll
be some up-front design work to get optimal performance. Especially, there's
no sense of performing joins in Lucene, and you shouldn't try.

Overall, use Lucene for searching/sorting text, use your RDBMS for
relational things.

Best
Erick

On 2/26/07, Mohammad Norouzi <[EMAIL PROTECTED]> wrote:


Hi Erick
thank you and sorry for taking long my reply. I am involving in a project.

I was thinking of your idea about storing all tables in the same field. it
seems to me a good idea, but some vague issues.
first, how to create a lucene's document. did you mean, storing all tables
by joining all tables?
if no, how to determine each row to be inserted in the index file?

second, let's consider we indexed all tables as you said, how to find out
the data related in hierarchy of tables.

please have a look at following structure, I want to know whether I
understand you or not?

table1_id table1_name table1_family   table2_id_emp  table2_address
1   x1   f1
1 street1
2   x2   f2
1 street2
3   x3   f3
2 street6


On 2/22/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
>
> OK, I was off on a tangent. We've had several discussions where people
> were
> effectively trying to replace a RDBMS with Lucene and finding out it
that
> RDBMSs are very good at what they do ...
>
> But in general, I'd probably approach it by doing the RDBMS work first
and
> indexing the result. I think this is your option (2). Yes, this will
> de-normalize a bunch of your data and you'll chew up some space, but
disk
> space is cheap. Very cheap .
>
> One thing to remember, though, that took me a while to get used to,
> especially when I had my database hat on. There's no requirement that
> every
> document in a Lucene index have the same fields. Conceptually, you can
> store
> *all* your tables in the same index. So a document for table one has
> fields
> table_1_field1 table_1_field2 table_1_field3. "documents" for table two
> have
> fields table_2_field1 table_2_field2 etc.
>
> These documents will never interfere with each other during searches
> because
> they share no fields (and each query goes against a particular field).
>
> I mention this because your maintenance will be much easier if you only
> have
> one index 
>
> Best
> Erick
>
> On 2/22/07, Mohammad Norouzi <[EMAIL PROTECTED]> wrote:
> >
> > Thanks Erick
> > but we have to because we need to execute very big queries that create
> > traffik network and are very very slow. but with lucene we do it in
some
> > milliseconds. and now we indexed our needed information by joining
> tables.
> > it works fine, besides, it returns the exact result as we can get from
> > database. we indexed about one million records.
> > but let me say, we are not using it instead of database, we use it to
> > generate some dynamic reports that if we did it by sql queries, it
would
> > take about 15 minutes.
> >
> > On 2/22/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
> > >
> > > don't do either one  Search this mail archive for discussions
> of
> > > databases, there are several long threads discussing this along with
> > > various
> > > options on how to make this work. See particularly a mail entitled
> > > *Oracle/Lucene
> > > integration -status- *and any discussions participated in by Marcelo
> > > Ochoa.
> > >
> > > But, in general, Lucene is a text search engine, NOT a RDBMS. When
you
> > > start
> > > 

Re: how to query range of Date by given date string?

2007-02-26 Thread Erick Erickson

Well, you need to do two things. First, make sure your dates are indexed so
they can be sorted lexically, which the format you're showing is. You might
want to look at the DateTools class for handy methods of transforming dates
into Lucene-friendly format.

Then use RangeQuery or RangeFilter classes to search over a range.

Best
Erick

On 2/26/07, 李寻欢晕菜了 <[EMAIL PROTECTED]> wrote:


hello:
I have Stored Date in index, and how could I query the result by given
range
of Date?
for example:
I would find some matching result in the range of 2007-02-24 to
2007-02-25.


--
--
WoCal生活,尽在掌握!
http://kofwang.wocal.cn
--



Re: Can I use Lucene to retrieve a list of duplicates

2007-02-26 Thread Erick Erickson

Here's an excerpt from something I wrote to enumerate all the terms for a
field. I hacked out some of my tracing, so it may not even compile .

Basically, change the line "if (td.next())" to "while (td.next())" and every
time you stay in that loop for more than one cycle, you'll have duplicates
for that particular term

 private void enumField(String field) throws Exception
   {
   long start = System.currentTimeMillis();
   TermEnum termEnum = this.reader.getIndexReader().terms(new
Term(field, ""));

   this.writer.println("");
   this.writer.println("");
   this.writer.println("");
   this.writer.println("Values for term " + field);

   TermDocs td = this.reader.getIndexReader().termDocs();
   Term term = termEnum.term();
   int idx = 0;
   int jdx = 0;

   while ((term != null) && term.field().equals(field)) {

   termEnum.next();
   td.seek(termEnum);

   if (td.next()) {
   ++jdx;
   }

   term = termEnum.term();
   ++idx;
   }
   }


Erick

On 2/26/07, Paul Taylor <[EMAIL PROTECTED]> wrote:


Hi,

Sorry I don't see how I get access to TermEnums. So far Ive created a
document per row, the first field holds the row id, then i have one
field per column, and checked  the index has been created ok with some
search querys.
I now want to pass a column to check, and receive  a list of all the
documents that contain  a  term  in that column which is used by at
least one other document for that column ( a duplicate term).

thanks paul

Chris Hostetter wrote:
> : Thanks this might do it, but do I need to know the terms beforehand, I
> : just want to return any terms with frequency more than one?
>
> no, TermEnum will let you iterate over all the terms ... you don't even
> need TermDocs if you just want the docFreq for each term (which would be
1
> if there are no duplicates)
>
> : Erick Erickson wrote:
> : > Sure, you can use the TermDocs/TermEnum classes. Basically, for a
term
> : > (probably column value in your app) these let you quickly answer the
> : > question "which (and how many) documents does this term appear in".
> : > What you get is the Lucene doc id, which let's you fetch all the
> : > information about the documents you want.
> : >
> : > Erick
> : >
> : > On 2/23/07, *Paul Taylor* <[EMAIL PROTECTED]
> : > > wrote:
> : >
> : > Hi I have Java Swing application with a table, I was considering
using
> : > Lucene to index the data in the table. One task Id like to do is
> : > for the
> : > user to select 'Find Duplicate records for Column X', then I
would
> : > filter the table to show only records where there is more than
one
> : > with
> : > the same value i.e duplicate for that column. Is there a way to
return
> : > all the duplicates from a Lucene index.
> : >
> : > thanks paul Taylor
> : >
> : >
-
> : > To unsubscribe, e-mail: [EMAIL PROTECTED]
> : > 
> : > For additional commands, e-mail:
[EMAIL PROTECTED]
> : > 
> : >
> : >
> : >

> : >
> : > Internal Virus Database is out-of-date.
> : > Checked by AVG Free Edition.
> : > Version: 7.1.394 / Virus Database: 268.16.5/616 - Release Date:
04/01/2007
> : >
> :
> :
> : -
> : To unsubscribe, e-mail: [EMAIL PROTECTED]
> : For additional commands, e-mail: [EMAIL PROTECTED]
> :
>
>
>
> -Hoss
>
>
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: a question about indexing database tables

2007-02-26 Thread Mohammad Norouzi

Thank you very much Erick.
Really I didnt know about this. I've just thought we can not add two
different documents.
now my question is, if I index the data in the way you said and now I have a
query say, "(table1_name:john) AND (table2_address:Adam street)"
using MultiFieldQueryParser, what is the result?
is the result from both documents and if yes, which document will the Hits
return?
is the result a union of two documents or intersection?

thank you very much indeed

On 2/26/07, Erick Erickson <[EMAIL PROTECTED]> wrote:


No, that's not what I was talking about. Remember that there's no
requirement in Lucene that each document have the same fields. So, you
have
something like...

Document doc = new Document()
doc.add("table1_id", id);
doc.add("table1_name", name);
IndexWriter.add(doc);


Document doc = new Document();
doc.add("table2_id_emp", employeeId);
doc.add("table2_address", address);
IndexWriter.add(doc);


You now have two documents in the index, one with fields "table1_id",
"table1_name" and the other has fields "table2_id_emp", "table2_address".

These two documents are entirely orthogonal. That is, they have no fields
in
common. Even if the values for these fields are the same (say for some
strange reason your name from table1 has a value of "nelson" and the
address
from table2 also has a value of "nelson". These don't interfere with each
other since searching for "table1_name:nelson" would never look in the
field
"table2_address".

So, all your tables can be stored in the same *index* (not the same
document, and most certainly not in the same field). They are all separate
because no two fields are the same for rows (documents) from different
tables.

The basic idea is that you index one lucene document for each row.

That said, I can't imagine that this is all you want to do. A one-for-one
mapping of table rows to documents is almost sure to be not the best
design.
You'll probably want to de-normalize your data for easy lookup etc.
There'll
be some up-front design work to get optimal performance. Especially,
there's
no sense of performing joins in Lucene, and you shouldn't try.

Overall, use Lucene for searching/sorting text, use your RDBMS for
relational things.

Best
Erick

On 2/26/07, Mohammad Norouzi <[EMAIL PROTECTED]> wrote:
>
> Hi Erick
> thank you and sorry for taking long my reply. I am involving in a
project.
>
> I was thinking of your idea about storing all tables in the same field.
it
> seems to me a good idea, but some vague issues.
> first, how to create a lucene's document. did you mean, storing all
tables
> by joining all tables?
> if no, how to determine each row to be inserted in the index file?
>
> second, let's consider we indexed all tables as you said, how to find
out
> the data related in hierarchy of tables.
>
> please have a look at following structure, I want to know whether I
> understand you or not?
>
> table1_id table1_name table1_family   table2_id_emp  table2_address
> 1   x1   f1
> 1 street1
> 2   x2   f2
> 1 street2
> 3   x3   f3
> 2 street6
>
>
> On 2/22/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
> >
> > OK, I was off on a tangent. We've had several discussions where people
> > were
> > effectively trying to replace a RDBMS with Lucene and finding out it
> that
> > RDBMSs are very good at what they do ...
> >
> > But in general, I'd probably approach it by doing the RDBMS work first
> and
> > indexing the result. I think this is your option (2). Yes, this will
> > de-normalize a bunch of your data and you'll chew up some space, but
> disk
> > space is cheap. Very cheap .
> >
> > One thing to remember, though, that took me a while to get used to,
> > especially when I had my database hat on. There's no requirement that
> > every
> > document in a Lucene index have the same fields. Conceptually, you can
> > store
> > *all* your tables in the same index. So a document for table one has
> > fields
> > table_1_field1 table_1_field2 table_1_field3. "documents" for table
two
> > have
> > fields table_2_field1 table_2_field2 etc.
> >
> > These documents will never interfere with each other during searches
> > because
> > they share no fields (and each query goes against a particular field).
> >
> > I mention this because your maintenance will be much easier if you
only
> > have
> > one index 
> >
> > Best
> > Erick
> >
> > On 2/22/07, Mohammad Norouzi <[EMAIL PROTECTED]> wrote:
> > >
> > > Thanks Erick
> > > but we have to because we need to execute very big queries that
create
> > > traffik network and are very very slow. but with lucene we do it in
> some
> > > milliseconds. and now we indexed our needed information by joining
> > tables.
> > > it works fine, besides, it returns the exact result as we can get
from
> > > database. we indexed about one million records.
> > > but let me say, we ar

Date searches

2007-02-26 Thread Kainth, Sachin
Hi all,

I have an index in which dates are represented as ranges of two integers
(there are two fields one foreach integer).  The two integers are years.
AD dates are represented as a positive integer and BC dates as a
negative one  There are three possible types of ranges.  These are
listed below with example dates:

*   BC - BC (-2000 - -1000)
*   BC - AD (-1000 - 1000)
*   AD - AD (1000 - 1200)

What I want is to have a textbox in which the user enters a year (eg
1990) and all records for which that date falls within the record's date
range are returned.

What would be the query syntax for this?

Cheers 


This email and any attached files are confidential and copyright protected. If 
you are not the addressee, any dissemination of this communication is strictly 
prohibited. Unless otherwise expressly agreed in writing, nothing stated in 
this communication shall be legally binding.

The ultimate parent company of the Atkins Group is WS Atkins plc.  Registered 
in England No. 1885586.  Registered Office Woodcote Grove, Ashley Road, Epsom, 
Surrey KT18 5BW.

Consider the environment. Please don't print this e-mail unless you really need 
to. 


RE: how to query range of Date by given date string?

2007-02-26 Thread WATHELET Thomas
Parse your date with this classe DateTools.stringToDate to search and 
DateTools.dateToString() to store into index. 

-Original Message-
From: 李寻欢晕菜了 [mailto:[EMAIL PROTECTED] 
Sent: 26 February 2007 11:17
To: java-user@lucene.apache.org
Subject: how to query range of Date by given date string?

hello:
I have Stored Date in index, and how could I query the result by given range
of Date?
for example:
I would find some matching result in the range of 2007-02-24 to 2007-02-25.


-- 
--
WoCal生活,尽在掌握!
http://kofwang.wocal.cn
--



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: a question about indexing database tables

2007-02-26 Thread Erick Erickson

You'll get no hits since there is no document in the index that has both
fields.

On 2/26/07, Mohammad Norouzi <[EMAIL PROTECTED]> wrote:


Thank you very much Erick.
Really I didnt know about this. I've just thought we can not add two
different documents.
now my question is, if I index the data in the way you said and now I have
a
query say, "(table1_name:john) AND (table2_address:Adam street)"
using MultiFieldQueryParser, what is the result?
is the result from both documents and if yes, which document will the Hits
return?
is the result a union of two documents or intersection?

thank you very much indeed

On 2/26/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
>
> No, that's not what I was talking about. Remember that there's no
> requirement in Lucene that each document have the same fields. So, you
> have
> something like...
>
> Document doc = new Document()
> doc.add("table1_id", id);
> doc.add("table1_name", name);
> IndexWriter.add(doc);
>
>
> Document doc = new Document();
> doc.add("table2_id_emp", employeeId);
> doc.add("table2_address", address);
> IndexWriter.add(doc);
>
>
> You now have two documents in the index, one with fields "table1_id",
> "table1_name" and the other has fields "table2_id_emp",
"table2_address".
>
> These two documents are entirely orthogonal. That is, they have no
fields
> in
> common. Even if the values for these fields are the same (say for some
> strange reason your name from table1 has a value of "nelson" and the
> address
> from table2 also has a value of "nelson". These don't interfere with
each
> other since searching for "table1_name:nelson" would never look in the
> field
> "table2_address".
>
> So, all your tables can be stored in the same *index* (not the same
> document, and most certainly not in the same field). They are all
separate
> because no two fields are the same for rows (documents) from different
> tables.
>
> The basic idea is that you index one lucene document for each row.
>
> That said, I can't imagine that this is all you want to do. A
one-for-one
> mapping of table rows to documents is almost sure to be not the best
> design.
> You'll probably want to de-normalize your data for easy lookup etc.
> There'll
> be some up-front design work to get optimal performance. Especially,
> there's
> no sense of performing joins in Lucene, and you shouldn't try.
>
> Overall, use Lucene for searching/sorting text, use your RDBMS for
> relational things.
>
> Best
> Erick
>
> On 2/26/07, Mohammad Norouzi <[EMAIL PROTECTED]> wrote:
> >
> > Hi Erick
> > thank you and sorry for taking long my reply. I am involving in a
> project.
> >
> > I was thinking of your idea about storing all tables in the same
field.
> it
> > seems to me a good idea, but some vague issues.
> > first, how to create a lucene's document. did you mean, storing all
> tables
> > by joining all tables?
> > if no, how to determine each row to be inserted in the index file?
> >
> > second, let's consider we indexed all tables as you said, how to find
> out
> > the data related in hierarchy of tables.
> >
> > please have a look at following structure, I want to know whether I
> > understand you or not?
> >
> > table1_id table1_name table1_family
table2_id_emp  table2_address
> > 1   x1   f1
> > 1 street1
> > 2   x2   f2
> > 1 street2
> > 3   x3   f3
> > 2 street6
> >
> >
> > On 2/22/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
> > >
> > > OK, I was off on a tangent. We've had several discussions where
people
> > > were
> > > effectively trying to replace a RDBMS with Lucene and finding out it
> > that
> > > RDBMSs are very good at what they do ...
> > >
> > > But in general, I'd probably approach it by doing the RDBMS work
first
> > and
> > > indexing the result. I think this is your option (2). Yes, this will
> > > de-normalize a bunch of your data and you'll chew up some space, but
> > disk
> > > space is cheap. Very cheap .
> > >
> > > One thing to remember, though, that took me a while to get used to,
> > > especially when I had my database hat on. There's no requirement
that
> > > every
> > > document in a Lucene index have the same fields. Conceptually, you
can
> > > store
> > > *all* your tables in the same index. So a document for table one has
> > > fields
> > > table_1_field1 table_1_field2 table_1_field3. "documents" for table
> two
> > > have
> > > fields table_2_field1 table_2_field2 etc.
> > >
> > > These documents will never interfere with each other during searches
> > > because
> > > they share no fields (and each query goes against a particular
field).
> > >
> > > I mention this because your maintenance will be much easier if you
> only
> > > have
> > > one index 
> > >
> > > Best
> > > Erick
> > >
> > > On 2/22/07, Mohammad Norouzi <[EMAIL PROTECTED]> wrote:
> > > >
> > > > Thanks Erick
> > > > but we have to beca

Date Searches

2007-02-26 Thread Kainth, Sachin
Anybody?

> __ 
> From: Kainth, Sachin  
> Sent: 26 February 2007 13:36
> To:   'java-user@lucene.apache.org'
> Subject:  Date searches
> 
> Hi all,
> 
> I have an index in which dates are represented as ranges of two
> integers (there are two fields one foreach integer).  The two integers
> are years.  AD dates are represented as a positive integer and BC
> dates as a negative one  There are three possible types of ranges.
> These are listed below with example dates:
> 
> * BC - BC (-2000 - -1000)
> * BC - AD (-1000 - 1000)
> * AD - AD (1000 - 1200)
> 
> What I want is to have a textbox in which the user enters a year (eg
> 1990) and all records for which that date falls within the record's
> date range are returned.
> 
> What would be the query syntax for this?
> 
> Cheers 


This email and any attached files are confidential and copyright protected. If 
you are not the addressee, any dissemination of this communication is strictly 
prohibited. Unless otherwise expressly agreed in writing, nothing stated in 
this communication shall be legally binding.

The ultimate parent company of the Atkins Group is WS Atkins plc.  Registered 
in England No. 1885586.  Registered Office Woodcote Grove, Ashley Road, Epsom, 
Surrey KT18 5BW.

Consider the environment. Please don't print this e-mail unless you really need 
to. 


RE: Date Searches

2007-02-26 Thread Seeta Somagani

This might help.
http://www.catb.org/~esr/faqs/smart-questions.html


-Original Message-
From: Kainth, Sachin [mailto:[EMAIL PROTECTED] 
Sent: Monday, February 26, 2007 10:17 AM
To: java-user@lucene.apache.org
Subject: Date Searches

Anybody?

> __ 
> From: Kainth, Sachin  
> Sent: 26 February 2007 13:36
> To:   'java-user@lucene.apache.org'
> Subject:  Date searches
> 
> Hi all,
> 
> I have an index in which dates are represented as ranges of two
> integers (there are two fields one foreach integer).  The two integers
> are years.  AD dates are represented as a positive integer and BC
> dates as a negative one  There are three possible types of ranges.
> These are listed below with example dates:
> 
> * BC - BC (-2000 - -1000)
> * BC - AD (-1000 - 1000)
> * AD - AD (1000 - 1200)
> 
> What I want is to have a textbox in which the user enters a year (eg
> 1990) and all records for which that date falls within the record's
> date range are returned.
> 
> What would be the query syntax for this?
> 
> Cheers 


This email and any attached files are confidential and copyright
protected. If you are not the addressee, any dissemination of this
communication is strictly prohibited. Unless otherwise expressly agreed
in writing, nothing stated in this communication shall be legally
binding.

The ultimate parent company of the Atkins Group is WS Atkins plc.
Registered in England No. 1885586.  Registered Office Woodcote Grove,
Ashley Road, Epsom, Surrey KT18 5BW.

Consider the environment. Please don't print this e-mail unless you
really need to. 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



One index per user or one index per day?

2007-02-26 Thread ariel goldberg

Greetings,



 



I'm creating an application that 
requires the indexing of millions of documents on behalf of a large group of 
users, and was hoping to get an opinion on whether I should use one index per 
user or one index per day.



 



My application will have to handle 
the following:



 



- the indexing of about 1 million 5K 
documents per day, with each document containing about 5 
fields



- expiration of documents, since 
after a while, my hard drive would run out of 
room



- queries that consist of boolean 
expressions (e.g., the body field contains "a" AND "b", and the title field 
contains "c"), as well as ranges (e.g., the document needs to have been indexed 
between 2/25/07 10:00 am and 2/28/07 9:00 pm)



- permissions; in other words, user 
A might be able to search on documents X and Y, but user B might be able to 
search on documents Y and Z.



- up to 1,000 
users



 



So, I was considering the 
following:



 



1) Using one index per 
user



 



This would entail creating and using 
up to 1,000 indices.  Document Y in the example above would have to be 
duplicated.  Expiration is performed via IndexWriter.deleteDocuments.  The 
advantage here is that querying should be reasonably quick, because each index 
would only contain tens of thousands of documents, instead of millions.  The 
disadvantages: I'm concerned about the "too many open files" error, and I'm 
also 
concerned about the performance of 
deleteDocuments.



 



2) Using one index per 
day



 



Each day, I create a new index.  
Again, document Y in the example above would have to be duplicated (is there 
any 
way around this?)  The advantage here is that expiring documents means simply 
deleting the index corresponding to a particular day.  The disadvantage is the 
query performance, since the queries, which are already very complex, would 
have 
to be performed using MultiSearcher (if expiration is after 10 days, that's 10 
indices to search across).



 



Tough to know for sure which option 
is better without testing, but does anyone have a gut reaction?  Any advice 
would be greatly appreciated!



 



Thanks,



Ariel






 

Need Mail bonding?
Go to the Yahoo! Mail Q&A for great tips from Yahoo! Answers users.
http://answers.yahoo.com/dir/?link=list&sid=396546091

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



NPE in RAMDirectory after upgrade to 2.1

2007-02-26 Thread jm

Hello all,

I have two processes running in parallel, each one adding and deleting
to its own set of indexes. Since I upgraded to 2.1 I am getting a NPE
at RAMDirectory.java line 207 in one of the processes.

Line 207 is:
 RAMFile existing = (RAMFile)fileMap.get(name);
the stack trace is:
java.lang.NullPointerException
org.apache.lucene.store.RAMDirectory.createOutput(RAMDirectory.java:207)
org.apache.lucene.index.FieldInfos.write(FieldInfos.java:256)
org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:75)
org.apache.lucene.index.IndexWriter.buildSingleDocSegment(IndexWriter.java:706)
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:694)
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:680)


I understand this RAMDirectory is something used internally by the
FSDirectories.

I have been double checking my code and I cannot see anything wrong.
Besided upgrading to lucene 2.1 I made some changes to take advantage
of new features (mainly set the locking factory of my indexes to be
the native one, IndexWriter deletes stuff).

someone has a clue? I tried to reproduce in my workstation but had no
success, but it happens consistently in my prod environment.

thanks
javi

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Can I use Lucene to retrieve a list of duplicates

2007-02-26 Thread Paul Taylor

Hi

I  got it working before I saw your latest mail, the only problem is 
that it doesn't look very efficient. This is my duplicate method, the 
problem is that I have to enumerate through *every* term. This was worse 
before because I was only interested
in terms that matched a particular field (column) but had enumerate 
through every term whatever field it was part of, so I recreated my 
index so that each document only contained a row number field, and a 
second field for the value of the column, however this means I am going 
to end up with a number of different indexes each solving a particular 
problem.


paul

public List getDuplicates()
   {
   List matches = new ArrayList();
   try
   {
   IndexReader ir = IndexReader.open(directory);
   TermEnum terms = ir.terms();
   while (terms.next())
   {
   if (terms.docFreq() > 1)
   {
   TermDocs termDocs = ir.termDocs(terms.term());
   while (termDocs.next())
   {
   Document d = ir.document(termDocs.doc());
   matches.add(new 
Integer(d.getField(ROW_NUMBER).stringValue()));

   }
   }
   }

   }
   catch (IOException ioe)
   {
   ioe.printStackTrace();
   }
   return matches;
   }

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Date Searches

2007-02-26 Thread Kainth, Sachin
Ok here's some more information on this:

Having done this search:

date1:{-99" + " TO " + Date + "} AND " +
"date2:{" + Date + " TO 99}"; 

But the problem is that the range search feature here using
Lexicographic and not numeric ranges.  Is there a way to use numeric
ranges?

-Original Message-
From: Seeta Somagani [mailto:[EMAIL PROTECTED] 
Sent: 26 February 2007 15:23
To: java-user@lucene.apache.org
Subject: RE: Date Searches


This might help.
http://www.catb.org/~esr/faqs/smart-questions.html


-Original Message-
From: Kainth, Sachin [mailto:[EMAIL PROTECTED]
Sent: Monday, February 26, 2007 10:17 AM
To: java-user@lucene.apache.org
Subject: Date Searches

Anybody?

> __ 
> From: Kainth, Sachin  
> Sent: 26 February 2007 13:36
> To:   'java-user@lucene.apache.org'
> Subject:  Date searches
> 
> Hi all,
> 
> I have an index in which dates are represented as ranges of two 
> integers (there are two fields one foreach integer).  The two integers

> are years.  AD dates are represented as a positive integer and BC 
> dates as a negative one  There are three possible types of ranges.
> These are listed below with example dates:
> 
> * BC - BC (-2000 - -1000)
> * BC - AD (-1000 - 1000)
> * AD - AD (1000 - 1200)
> 
> What I want is to have a textbox in which the user enters a year (eg
> 1990) and all records for which that date falls within the record's 
> date range are returned.
> 
> What would be the query syntax for this?
> 
> Cheers


This email and any attached files are confidential and copyright
protected. If you are not the addressee, any dissemination of this
communication is strictly prohibited. Unless otherwise expressly agreed
in writing, nothing stated in this communication shall be legally
binding.

The ultimate parent company of the Atkins Group is WS Atkins plc.
Registered in England No. 1885586.  Registered Office Woodcote Grove,
Ashley Road, Epsom, Surrey KT18 5BW.

Consider the environment. Please don't print this e-mail unless you
really need to. 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



This message has been scanned for viruses by MailControl - (see
http://bluepages.wsatkins.co.uk/?6875772)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: NPE in RAMDirectory after upgrade to 2.1

2007-02-26 Thread Michael McCandless

"jm" <[EMAIL PROTECTED]> wrote:

> I have two processes running in parallel, each one adding and deleting
> to its own set of indexes. Since I upgraded to 2.1 I am getting a NPE
> at RAMDirectory.java line 207 in one of the processes.
> 
> Line 207 is:
>   RAMFile existing = (RAMFile)fileMap.get(name);
> the stack trace is:
> java.lang.NullPointerException
> org.apache.lucene.store.RAMDirectory.createOutput(RAMDirectory.java:207)
> org.apache.lucene.index.FieldInfos.write(FieldInfos.java:256)
> org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:75)
> org.apache.lucene.index.IndexWriter.buildSingleDocSegment(IndexWriter.java:706)
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:694)
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:680)

Hmm.  One thing that changed in 2.1 is when a RAMDirectory is closed
it now sets fileMap to null (which it did not pre-2.1).

Is it possible you are accidentally closing a writer but then calling
its addDocument method?

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: One index per user or one index per day?

2007-02-26 Thread Shane
If you can categorize the documents based on user permissions, that is 
the route I would go. 

For example users 1, 2,  and 3 are allowed to search documents a and b.  
In addition, user 1 can search documents c and d, while users 2 and 3 
can search documents e and f.  I would create 3 indexes: one for docs a 
and b, one for docs c and d, and finally one for docs e and f.  Then 
using your method of choice, you can restrict documents based on the 
users permission.


I realize scaling may cause an issue, but this route would allow you to 
normalize your data and reduce duplication in the system.


Shane

ariel goldberg wrote:

Greetings,



 




I'm creating an application that 
requires the indexing of millions of documents on behalf of a large group of 
users, and was hoping to get an opinion on whether I should use one index per 
user or one index per day.




 




My application will have to handle 
the following:




 




- the indexing of about 1 million 5K 
documents per day, with each document containing about 5 
fields




- expiration of documents, since 
after a while, my hard drive would run out of 
room




- queries that consist of boolean 
expressions (e.g., the body field contains "a" AND "b", and the title field 
contains "c"), as well as ranges (e.g., the document needs to have been indexed 
between 2/25/07 10:00 am and 2/28/07 9:00 pm)




- permissions; in other words, user 
A might be able to search on documents X and Y, but user B might be able to 
search on documents Y and Z.




- up to 1,000 
users




 




So, I was considering the 
following:




 




1) Using one index per 
user




 




This would entail creating and using 
up to 1,000 indices.  Document Y in the example above would have to be 
duplicated.  Expiration is performed via IndexWriter.deleteDocuments.  The 
advantage here is that querying should be reasonably quick, because each index 
would only contain tens of thousands of documents, instead of millions.  The 
disadvantages: I'm concerned about the "too many open files" error, and I'm also 
concerned about the performance of 
deleteDocuments.




 




2) Using one index per 
day




 




Each day, I create a new index.  
Again, document Y in the example above would have to be duplicated (is there any 
way around this?)  The advantage here is that expiring documents means simply 
deleting the index corresponding to a particular day.  The disadvantage is the 
query performance, since the queries, which are already very complex, would have 
to be performed using MultiSearcher (if expiration is after 10 days, that's 10 
indices to search across).




 




Tough to know for sure which option 
is better without testing, but does anyone have a gut reaction?  Any advice 
would be greatly appreciated!




 




Thanks,



Ariel






 


Need Mail bonding?
Go to the Yahoo! Mail Q&A for great tips from Yahoo! Answers users.
http://answers.yahoo.com/dir/?link=list&sid=396546091

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: how to define a pool for Searcher?

2007-02-26 Thread Nicolas Lalevée
Le lundi 26 février 2007 12:38, Mohammad Norouzi a écrit :
> No. actually I dont close the searcher. I just set a flag to true or false.
> my considerations are:
> [please note I provided a ResultSet so I display the result page by page
> and dont load all the result in a list]
> 1- should I open a searcher for each user? and one reader for all user
> session?
> 2- or create a pool of searcher and one reader for all searcher?
> 3- or a pool of searcher and each has its own reader. but never close them
> just set a flag to true.
>
> which one?

I probably miss something here, because I will continue answer the same : have 
only one instance of an index searcher in your entire web application.

Nicolas

>
> On 2/25/07, Nicolas Lalevée <[EMAIL PROTECTED]> wrote:
> > Le dimanche 25 février 2007 16:55, Mohammad Norouzi a écrit:
> > > so, you mean, I open one reader for each session (each user) and never
> > > close it until the session has expired? if I do this, is that affect
> > > the performance?
> >
> > The searcher/reader is a view on the index. And each time you open a new
> > one,
> > it cost some time. And the only reason to have a new view on the index is
> > that you modified it.
> > As you never modify it, as said Mark, never close your searcher/reader.
> > If your sevral users are reading the same index, you might share the
> > instance of
> > the searcher in the different user sessions.
> >
> > Nicolas
> >
> > > On 2/25/07, Mark Miller <[EMAIL PROTECTED]> wrote:
> > > > If you never modify your index you should never need to close your
> > > > reader (or searcher). Doing so would just slow you down.
> > > >
> > > > Mohammad Norouzi wrote:
> > > > > Hi
> > > > > actually I dont have any writer or writing reader. I just have
> >
> > reader.
> >
> > > > > when
> > > > > a reader is created by the user because the document returned by
> >
> > hits
> >
> > > > > is very much, for example 20,000 so I display the result page by
> >
> > page.
> >
> > > > > whenever
> > > > > user click to next page the hits will use the reader to load next
> > > > > 20 records,
> > > > > besides, I dont have one directory, there are more than one
> >
> > directory
> >
> > > > and
> > > >
> > > > > index on the server and each user may request for one of them.
> > > > > the problem is, a user may close his browser window and the reader
> > > > > will stay
> > > > > open becasue I cant detect it. and either if his session expires my
> > > > > destroy
> > > > > method will be called and searcher will close but in the cached
> > > > > searcher i
> > > > > can not detect which one is closed and ready it for next user. if
> >
> > the
> >
> > > > > searcher had a isClosed() method it was easy to determine but
> > > > > unfortunately
> > > > > it's has'nt
> > > > >
> > > > > any idea?
> > > > > thanks again
> > > > >
> > > > > On 2/25/07, Mark Miller <[EMAIL PROTECTED]> wrote:
> > > > >> I am a bit confused about what you are asking. Why do you need the
> > > > >> Searcher to time out? That code should release your searchers at
> >
> > the
> >
> > > > >> appropriate times...when the index has been modified. The way that
> >
> > I
> >
> > > > use
> > > >
> > > > >> it is to make a synchronized map that keeps around an index
> >
> > accessor
> >
> > > > for
> > > >
> > > > >> each index that I open...from there the code should do the
> >
> > rest...when
> >
> > > > a
> > > >
> > > > >> writer or a writing reader is released the code waits for all
> > > > >> searchers to be released and then clears the cache of searchers
> > > > >> and new searchers are created when requested until another writer
> > > > >> or writing reader is released...
> > > > >>
> > > > >> Mohammad Norouzi wrote:
> > > > >> > Thank you Mark for your useful help. the code you introduce was
> >
> > very
> >
> > > > >> > helpful
> > > > >> > for me
> > > > >> >
> > > > >> > but my only question is that I need to place an idle time for
> >
> > each
> >
> > > > >> open
> > > > >>
> > > > >> > searcher, so if it exceed the specific time then release that
> > > >
> > > > searcher
> > > >
> > > > >> > and
> > > > >> > get ready for another thread.
> > > > >> >
> > > > >> > how can I put such this feature, I was thinking of a timeout
> > > >
> > > > listener,
> > > >
> > > > >> > but
> > > > >> > dont know where tu put it. I have a SingleSearcher that wraps
> > > >
> > > > lucene's
> > > >
> > > > >> > Searcher and it returns an ResultSet in which I put a Hits
> >
> > object.
> >
> > > > >> > do I have
> > > > >> > to put the time in my ResultSet or my SingleSeacher?
> > > > >> >
> > > > >> > still I dont know ehrthrt the reader is important for Hits or
> > > > >>
> > > > >> Searcher?
> > > > >>
> > > > >> > consider I passed a hits to my ResultSet, now, if I close
> >
> > searcher,
> >
> > > > >> > will the
> > > > >> > Reader get closed?  or another vague thing is can a Reader work
> > > >
> > > > thread
> > > >
> > > > >> > safely for every Searcher with differenet qu

Re: NPE in RAMDirectory after upgrade to 2.1

2007-02-26 Thread jm

Mike,

You were right. As I have many indexes I keep a cache of the
IndexWriters, and in some specific case (that cannot happen in my dev
env) I was closing them without removing them from the cache. Somehow
it was working before 2.1, and upgrading made the error clear.

thanks
javi

On 2/26/07, Michael McCandless <[EMAIL PROTECTED]> wrote:


"jm" <[EMAIL PROTECTED]> wrote:

> I have two processes running in parallel, each one adding and deleting
> to its own set of indexes. Since I upgraded to 2.1 I am getting a NPE
> at RAMDirectory.java line 207 in one of the processes.
>
> Line 207 is:
>   RAMFile existing = (RAMFile)fileMap.get(name);
> the stack trace is:
> java.lang.NullPointerException
> org.apache.lucene.store.RAMDirectory.createOutput(RAMDirectory.java:207)
> org.apache.lucene.index.FieldInfos.write(FieldInfos.java:256)
> org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:75)
> 
org.apache.lucene.index.IndexWriter.buildSingleDocSegment(IndexWriter.java:706)
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:694)
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:680)

Hmm.  One thing that changed in 2.1 is when a RAMDirectory is closed
it now sets fileMap to null (which it did not pre-2.1).

Is it possible you are accidentally closing a writer but then calling
its addDocument method?

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: how to define a pool for Searcher?

2007-02-26 Thread Chris Lu

Hi, Nicolas,

Just a note: Having one searcher is far enough for ordinary usage,
even some production sites. But I do see some throughput gain through
increasing the number of searchers.

Those searchers should either be opened on an static index, or be
synchronized to open/close cleanly, to avoid any exceptions.

--
Chris Lu
-
Instant Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com


On 2/26/07, Nicolas Lalevée <[EMAIL PROTECTED]> wrote:

Le lundi 26 février 2007 12:38, Mohammad Norouzi a écrit:
> No. actually I dont close the searcher. I just set a flag to true or false.
> my considerations are:
> [please note I provided a ResultSet so I display the result page by page
> and dont load all the result in a list]
> 1- should I open a searcher for each user? and one reader for all user
> session?
> 2- or create a pool of searcher and one reader for all searcher?
> 3- or a pool of searcher and each has its own reader. but never close them
> just set a flag to true.
>
> which one?

I probably miss something here, because I will continue answer the same : have
only one instance of an index searcher in your entire web application.

Nicolas

>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: NPE in RAMDirectory after upgrade to 2.1

2007-02-26 Thread Michael McCandless
"jm" <[EMAIL PROTECTED]> wrote:
> You were right. As I have many indexes I keep a cache of the
> IndexWriters, and in some specific case (that cannot happen in my dev
> env) I was closing them without removing them from the cache. Somehow
> it was working before 2.1, and upgrading made the error clear.

OK, glad you got to the bottom of it!

But, I don't think this error (NPE) is particularly clear (though it
is better than pre-2.1 which, I think, would in fact let the writer
run without holding the write lock).  I will open a Jira issue for
this.  I think we should explicitly detect when the IndexWriter is
used after being closed.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: how to define a pool for Searcher?

2007-02-26 Thread Nicolas Lalevée
Le lundi 26 février 2007 20:12, Chris Lu a écrit :
> Hi, Nicolas,
>
> Just a note: Having one searcher is far enough for ordinary usage,
> even some production sites. But I do see some throughput gain through
> increasing the number of searchers.
>
> Those searchers should either be opened on an static index, or be
> synchronized to open/close cleanly, to avoid any exceptions.

ok, I have found what I missed. I looked to the code in SegmentReader, and 
there are some synchronized functions. So the performance improve you are 
talking about. Too bad because it only comes from the possibility to delete 
document, which seems useless for the read-only-index use case.

Nicolas

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: filtering by first letter

2007-02-26 Thread Chris Hostetter
: OK I'm not sure I understand your answer.  I thought TermEnum gave you
: all the terms in an index, not from a search result.
:
: Let me clarify what I need.  I'm looking for a way to find out all the
: values of the FIELD_FILTER_LETTER field for any given search.
:
: INDEX TIME:   (done for each indexed person, stores the first letter of
: their name as a field)

what you are describing is basically a faceted search problem (see list
archive for copious discussion)

step #1: get a list of all possible "first letters" (that's where a
TermEnum comes into play ... iterate over all the terms for that ield)

step #2: for each first letter, get the BitSet of documents corrisponding
to that letter with a filter (this can be cached for hte life of your
IndexReader and interest it with a BitSet from a Filter made from your
users search critera

...there are other approaches to doing faceted searching that involve
iterating over the results to get the list of possible values -- but i'm
guessing the letters a name might start with is typically going ot be
smaller then the set of results from a search ... so this approach is
probably a safe bet.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: QueryParser oddity with PrefixQuery

2007-02-26 Thread Doron Cohen
I agree, might be a redundant check now. This test was added when the query
parser was enhanced to optionally allow leading wild card (revision 468291
),
but this case calls getWildCardQuery(), not getPrefixQuery().

Still, the check seems harmless - sort of defensive - protecting against
the
case that a bug in query parser would identify such a query as prefix
rather
than wildcard (although this is not very likely.)

Doron

Antony Bowesman wrote:

> In certain cases, I use a modified QueryParser which does not allow
"field:"
> syntax.  While testing variants of Prefix and Wildcard Query, I
cameacross an
> oddity with getPrefixQuery().  The standard getPrefixQuery() (2.1) is
given
> termStr without the trailing *, so the check
>
> if (!allowLeadingWildcard && termStr.startsWith("*"))
>throw new ParseException("'*' not allowed as first character in
> PrefixQuery");
>
> only triggers if you parse the String "**", in which case the PrefixQuery
is
> then created with Term(field, "*"); whereas if you parse "*",
> getWildcardQuery()
> is called.
>
> So it's not obvious that the test "termStr.startsWith("*"))" can
> ever occur, so
> this code seems redundant.
>
> After Doron's latest patch
http://issues.apache.org/jira/browse/LUCENE-813
> , if
> you parse "**" it now calls getWildcardQuery().
>
> Antony


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Date Searches

2007-02-26 Thread Chris Hostetter

: Date: Mon, 26 Feb 2007 15:17:15 -

: Anybody?

: > Sent:   26 February 2007 13:36


the java-user list is good -- but amazingly enough it's not unheard of for
two *whole* hours to go by without getting a direct response to a
question -- particularly when the last question/answer posted to the list
prior to your question was directly on topic: searching by dates, and
how to format the dates so that they would be in lexigraphic order...

http://www.nabble.com/how-to-query-range-of-Date-by-given-date-string--tf3291956.html

...patience is a must when asking for help from others.

in general, if you'd searched the list archive before posting, you
probably would have had your answer immediately.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Can I use Lucene to retrieve a list of duplicates

2007-02-26 Thread Chris Hostetter

what you are doing below is iterating over every term in your index, and
for each Term, recording if that term appears in more then one doc (using
IndexReader.document which is a really bad idea ingeneral in a loop like
this)

your orriginal problem description was " 'Find Duplicate records for
Column X' then I would filter the table to show only records where there
is more than one with the same value i.e duplicate for that column." ...

if we equate columns with fields, then what you are doing is certainly
overkill -- you can request a TermEnum that starts at a particular field
Or use the full TermEnum and "skipTo" the first possible temr in your
field) and then stop iterating when you get to a new field -- you can aso
get a lot better performance when compiling hte list of ROW_NUMBER field
values if you use a FieldCache instead of fetching each document.


: Date: Mon, 26 Feb 2007 16:25:11 +
: From: Paul Taylor <[EMAIL PROTECTED]>
: Reply-To: java-user@lucene.apache.org, [EMAIL PROTECTED]
: To: Erick Erickson <[EMAIL PROTECTED]>
: Cc: java-user@lucene.apache.org
: Subject: Re: Can I use Lucene to retrieve a list of duplicates
:
: Hi
:
: I  got it working before I saw your latest mail, the only problem is
: that it doesn't look very efficient. This is my duplicate method, the
: problem is that I have to enumerate through *every* term. This was worse
: before because I was only interested
: in terms that matched a particular field (column) but had enumerate
: through every term whatever field it was part of, so I recreated my
: index so that each document only contained a row number field, and a
: second field for the value of the column, however this means I am going
: to end up with a number of different indexes each solving a particular
: problem.
:
: paul
:
:  public List getDuplicates()
: {
: List matches = new ArrayList();
: try
: {
: IndexReader ir = IndexReader.open(directory);
: TermEnum terms = ir.terms();
: while (terms.next())
: {
: if (terms.docFreq() > 1)
: {
: TermDocs termDocs = ir.termDocs(terms.term());
: while (termDocs.next())
: {
: Document d = ir.document(termDocs.doc());
: matches.add(new
: Integer(d.getField(ROW_NUMBER).stringValue()));
: }
: }
: }
:
: }
: catch (IOException ioe)
: {
: ioe.printStackTrace();
: }
: return matches;
: }
:
: -
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Indexing-Error: Cannot delete

2007-02-26 Thread robisbob

Hi all,

i hope someone can help me. If I index a file directory I get the error 
you see here.

 caught a class java.io.IOException
 with message: Cannot delete _57e.tis
Exception in thread "main" java.io.IOException: Cannot delete _57e.tis
at 
org.apache.lucene.store.FSDirectory.deleteFile(FSDirectory.java:198)
at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java:157)
at 
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:100)
at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:487)
at 
org.apache.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:458)
at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:310)


So Lucene tries to delete this file, which exists and can be deleted 
manually. But somehow Lucene can't.
The weird thing is, if I index the same directory (same content, not 
same network) on my local machine, there is no such error.
I have searched the Internet for these kind of error and found only 
advises about closing the FileWriter, which I do.

I use lucene-1.5-rc1.

Thanx
Rob




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing-Error: Cannot delete

2007-02-26 Thread Michael McCandless
"robisbob" <[EMAIL PROTECTED]> wrote:

> i hope someone can help me. If I index a file directory I get the error 
> you see here.
> >  caught a class java.io.IOException
> >  with message: Cannot delete _57e.tis
> > Exception in thread "main" java.io.IOException: Cannot delete _57e.tis
> > at 
> > org.apache.lucene.store.FSDirectory.deleteFile(FSDirectory.java:198)
> > at 
> > org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java:157)
> > at 
> > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:100)
> > at 
> > org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:487)
> > at 
> > org.apache.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:458)
> > at 
> > org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:310)
> 
> So Lucene tries to delete this file, which exists and can be deleted 
> manually. But somehow Lucene can't.
> The weird thing is, if I index the same directory (same content, not 
> same network) on my local machine, there is no such error.
> I have searched the Internet for these kind of error and found only 
> advises about closing the FileWriter, which I do.
> I use lucene-1.5-rc1.

There are several sneaky IO related issues in Lucene, that have now been
fixed
in Lucene 2.1, that could explain this.  Is it possible to test Lucene
2.1 to
see if this issue is still happening?

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene 1.4.3 : IndexWriter.addDocument(doc) fails when run on OS requiring permissions

2007-02-26 Thread Steven Parkes
The easiest way to pin this down is to get the backtrace from the
exception, e.g., e.printStackTrace(). That would tell a lot.

That said, prior to 2.1, lucene would put lock files outside the index
directory. I don't know if that's what you're hitting, though, because I
think the writer should have taken the lock when you created it, not
when you tried to do the addDocument.

-Original Message-
From: Ridzwan Aminuddin [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 22, 2007 8:48 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene 1.4.3 : IndexWriter.addDocument(doc) fails when run
on OS requiring permissions

Hi Guys.

Ok thanks for the replies. You guys are right that it is to do with the
system and not with Lucene. However, what i'm trying to do is to
pinpoint and narrow down to the exact place that causes the system to
fail. and then from there try to remedy the problem. 

The odd thing is that the program is still able to write other files to
the subdirectories that the program itself creates. It is only when it
goes through this indexing process that this program halts due to the
insufficient permissions. But the directory i provided to store the
target index files has been set to read/write/execute (drwxrwxrwx) all
permissions. 

In any case, i suspect that it is due to this portion of code.


:   FieldsWriter(Directory d, String segment, FieldInfos fn)
:throws IOException {
: fieldInfos = fn;
: fieldsStream = d.createFile(segment + ".fdt");
: indexStream = d.createFile(segment + ".fdx");
:   }


Are these two files created in the Directory d?
And is this Directory d the ramDirectory that i have provided when i
called the method: dw = new DocumentWriter(ramDirectory, analyzer,
similarity,
 maxFieldLength);
in IndexWriter.Java?

If yes. then where exactly does this ramDirectory point to? cos as far
as i can see from the code, ramDirectory is initialised as 

RamDirectory ramDirectory = new RamDirectory()

I think the program fails when it tries to create these two files
somehow.

Please help to enlighten me?... Maybe knowing the exact path to this
ramDirectory would help me in finding out which folder i need to provide
access to

Also, does Lucene ever write any temp data to /tmp ?



> -Original Message-
> From: [EMAIL PROTECTED]
> Sent: Thu, 22 Feb 2007 11:43:38 -0800 (PST)
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 1.4.3 : IndexWriter.addDocument(doc) fails when
run
> on OS requiring permissions
> 
> 
> This sounds like it has absolutely nothing to do with Lucene, and
> everything to do with good security permissions -- your Zope/python
front
> end is most likely running as a user thta does not have write
permissions
> to the directory where your index lives.  you'll need to remedy that.
> 
> you can write a simple java app that doens't use lucene at all -- just
> creates a file and writes  "hellow world" to it -- and you will most
> likely see this exact same behavior, dealing with teh file permissions
is
> totally out side the scope of Lucene.
> 
> 
> : Date: Thu, 22 Feb 2007 00:20:12 -0800
> : From: Ridzwan Aminuddin <[EMAIL PROTECTED]>
> : Reply-To: java-user@lucene.apache.org
> : To: java-user@lucene.apache.org
> : Subject: Lucene 1.4.3 : IndexWriter.addDocument(doc) fails when run
on
> OS
> :  requiring permissions
> :
> : Hi!
> :
> : I'm writing a java program that uses Lucene 1.4.3 to index and
create a
> vector file of words found in Text Files. The purpose is for text
mining.
> :
> : I created a Java .Jar file from my program and my python script
calls
> the Java Jar executable. This is all triggered by my DTML code.
> :
> : I'm running on Linux and i have no problem executing the script when
i
> execute via command line. But once i trigger the script via the web
> (using Zope/Plone external methods ) it doesn't work anymore. This is
> because of the strict permissions that LInux has over its files and
> folders.
> :
> : I've narrowed down the problem to the IndexWriter.addDocument(doc)
> method in Lucene 1.4.3 and as you can see below my code fails
> specifically when a new FieldsWriter object is being initialised.
> :
> : I strongly suspect that it fails at this point but have no idea how
to
> overcome this problem. I know that it has to do with the permissions
> because th eprogram works like a miracle when it is called via command
> line by the super user (sudo).
> :
> : Could anyone give me any pointers or ideas of how i could overcome
> this.
> :
> : The final statement which is printed before the program hangs is:
> : "Entering DocumentWriter.AddDocument (4)"
> :
> : Here is the portions of my relevant code :
> :
> :
> :
> :
>
//--
-
> : //  Indexer.Java // This is my own method and class
> :
>
//--
-
> : // continued

Re: ClassCastException/DocumentWriter and NullPointerException/RAMInputStream

2007-02-26 Thread Antony Bowesman
Looks like this was caused by a corrupt Java installation.  I was half expecting 
to see a comment in the code that said


// Impossible event occurred

Antony


Antony Bowesman wrote:

When adding documents to an index has anyone seen either

java.lang.ClassCastException: org.apache.lucene.analysis.Token cannot be 
cast to org.apache.lucene.index.Posting
  at 
org.apache.lucene.index.DocumentWriter.sortPostingTable(DocumentWriter.java:238) 

  at 
org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:96)

  at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:476)

or

java.lang.NullPointerException
  at org.apache.lucene.store.RAMInputStream.(RAMInputStream.java:32)
  at org.apache.lucene.store.RAMDirectory.openInput(RAMDirectory.java:171)
  at 
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:155)

  at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:129)
  at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:110)
  at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:702)
  at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:686)
  at 
org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:656)

  at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:402)

This data has been indexed many times, but I've never seen this before.  
It's Java 6 and Lucene 2.0.


Thanks
Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing-Error: Cannot delete

2007-02-26 Thread robisbob

Thanx for your answer.
I will use the latest version to check this. Unfortunately I have only 
access to the computer, where the application will be run, once a week. 
And I can't reproduce the error at my local machine or any other 
computer I have access to.
So If someone has (better had) the same error, now it's the time to go 
public !!!


Rob


  
i hope someone can help me. If I index a file directory I get the error 
you see here.


 caught a class java.io.IOException
 with message: Cannot delete _57e.tis
Exception in thread "main" java.io.IOException: Cannot delete _57e.tis
at 
org.apache.lucene.store.FSDirectory.deleteFile(FSDirectory.java:198)
at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java:157)
at 
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:100)
at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:487)
at 
org.apache.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:458)
at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:310)
  
So Lucene tries to delete this file, which exists and can be deleted 
manually. But somehow Lucene can't.
The weird thing is, if I index the same directory (same content, not 
same network) on my local machine, there is no such error.
I have searched the Internet for these kind of error and found only 
advises about closing the FileWriter, which I do.

I use lucene-1.5-rc1.



There are several sneaky IO related issues in Lucene, that have now been
fixed
in Lucene 2.1, that could explain this.  Is it possible to test Lucene
2.1 to
see if this issue is still happening?

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Benchmarker

2007-02-26 Thread Doron Cohen
Hi Karl,

Seems I missed this email...
What is the status of this, have you solved it?

Doron

karl wettin <[EMAIL PROTECTED]> wrote on 13/02/2007 03:24:44:

>
> 13 feb 2007 kl. 04.33 skrev Doron Cohen:
>
> >> Running (once) "ant jar" from the trunk directory should do it.
> >
> > Did it solve the problem?
>
> Indeed.
>
> Do you have way to much time to spare? I patched up the code to run
> on the index index in LUCENE-550. Unfortunatly it seems as I messed
> something up or missed something I have to do prior to running. ant
> run-standard pass in about 4 seconds. I get lots of output with info
> that clearly tells me that the Reuters set was loaded, but 4 seconds?
> Is my new laptop really that fast?
>
> What I did was very simple. I replaced the Directory in TestData (and
> that second similiar class that had a reference too) with my Index
> interface that contains writer and reader factory methods. Then some
> minor things like going from IndexWriter to IndexWriterInterface in
> method parameters.
>
> Would you mind taking a look at my patch of the benchmarker to see if
> it actually works as it should or not? Let me know and I'll post a
> new trunk.diff.bz2 to 550 containing the changes.
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: updating index

2007-02-26 Thread Doron Cohen
"Erick Erickson" <[EMAIL PROTECTED]> wrote on 25/02/2007 07:05:21:

> Yes, I'm pretty sure you have to index the field (UN_TOKENIZED) to be
able
> to fetch it with TermDocs/TermEnum! The loop I posted works like this

Once indexing the database_id field this way, also the newly added
API IndexWriter.updateDocument() may be useful.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]