Re: QueryParser Behavior and Token.setPositionIncrement

2004-04-27 Thread Erik Hatcher
On Apr 26, 2004, at 5:16 PM, Norton, James wrote:
Thanks for the reply.  I had reached the same conclusion as you 
regarding the analyzer for
queries (no multiple tokens per position), but I would still reqard 
the behaviour of
QueryParser as incorrect.
I agree that it is odd, but given that PhraseQuery doesn't support 
token positions either, what would be the correct behavior of 
QueryParser?

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: need info for database based Lucene but not flat file

2004-04-27 Thread Erik Hatcher
There is a Berkeley DB implementation of Lucene's Directory in the 
jakarta-lucene-sandbox repository.

	Erik

On Apr 26, 2004, at 8:35 PM, Yukun Song wrote:

As known, currently Lucene uses flat file to store information for
indexing.
Any people has idea or resources for combining database (Like MySQL or
PostreSQL) and Lucene instead of current flat index file formats?
Regards,

Yukun Song



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: need info for database based Lucene but not flat file

2004-04-27 Thread Cocula Remi
As lucene implements its own concept of document it is not dedicated to index a 
particular type of data source.
It's up to you to write a tool that is able to browse your database and then submit 
the data as Lucene documents to the Lucene indexer.

For example if your database contains a customer entity and you want to index all 
informations about these customers, you can create a module that will perform a select 
on the customer table an for each row  returned create un Lucene Document and then add 
it to the indexWriter.
It is recommended that your Lucene Document contains a keyword Field  that represent 
the unique id of a customer in the database.

As a first step you should be familiar with the concept of Document and Field. See 
Lucene short intro documentation.


-Message d'origine-
De : Yukun Song [mailto:[EMAIL PROTECTED]
Envoyé : mardi 27 avril 2004 02:35
À : [EMAIL PROTECTED]
Objet : need info for database based Lucene but not flat file


As known, currently Lucene uses flat file to store information for
indexing. 

Any people has idea or resources for combining database (Like MySQL or
PostreSQL) and Lucene instead of current flat index file formats?

Regards,

Yukun Song



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: what web crawler work best with Lucene?

2004-04-27 Thread Michael Wechner
Tuan Jean Tee wrote:

Have anyone implemented any open source web crawler with Lucene? I have
a dynamic website and are looking at putting in a search tools. Your
advice is very much appreciated.
 

there is a crawler included within Apache Lenya 
http://cocoon.apache.org/lenya/

src/java/org/apache/lenya/search/crawler/*

or you might try LARM

http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html

HTH

Michi


Thank you.

IMPORTANT -

This email and any attachments are confidential and may be privileged in which case neither is intended to be waived. If you have received this message in error, please notify us and remove it from your system. It is your responsibility to check any attachments for viruses and defects before opening or sending them on. Where applicable, liability is limited by the Solicitors Scheme approved under the Professional Standards Act 1994 (NSW). Minter Ellison collects personal information to provide and market our services. For more information about use, disclosure and access, see our privacy policy at www.minterellison.com.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



--
Michael Wechner
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
http://www.wyona.com  http://cocoon.apache.org/lenya/
[EMAIL PROTECTED][EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


sorting by date (XML)

2004-04-27 Thread Michael Wechner
my XML files contain something like

date
 year2004/yearmonth04/monthday27/day...
/date
and I would like to sort by this date.

So I guess I need to modify the Documentparser and generate something like
a millisecond field and then sort by this, correct?
Has anyone done something like this yet?

Thanks

Michi

--
Michael Wechner
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
http://www.wyona.com  http://cocoon.apache.org/lenya/
[EMAIL PROTECTED][EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


searching only part of an index

2004-04-27 Thread Alan Smith
Hi

I wondered if anyone knows whether it is possible to search ONLY the 100 (or 
whatever) most recently added documents to a lucene index? I know that once 
I have all my results ordered by ID number in Hits I could then just display 
the required amount, but I wondered if there is a way to avoid searching all 
documents in the index in the first place?

Many thanks

Alan

_
Express yourself with cool new emoticons http://www.msn.co.uk/specials/myemo
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: sorting by date (XML)

2004-04-27 Thread Nader S. Henein

Here's my two cents on this:
Both ways you will need to combine the date in one field, but if you use a
millisecond representation you will not be able to use the FLOAT sort type
and you'll have use STRING sort (Slower) because the millisecond
representation is longer than FLOAT allows, so you have three options:

1) Use MMDD and sort by FLOAT type
2) Use the millisecond representation and sort by STRING type
3) If the date you're entering here is the date of indexing then you can
just sort by DOC type (which is the DOC ID) and save yourself the pain

Hope this helps.

Nader Henein

-Original Message-
From: Michael Wechner [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, April 27, 2004 3:52 PM
To: Lucene Users List
Subject: sorting by date (XML)


my XML files contain something like

date
  year2004/yearmonth04/monthday27/day...
/date

and I would like to sort by this date.

So I guess I need to modify the Documentparser and generate something like a
millisecond field and then sort by this, correct?

Has anyone done something like this yet?

Thanks

Michi

-- 
Michael Wechner
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
http://www.wyona.com  http://cocoon.apache.org/lenya/
[EMAIL PROTECTED][EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: searching only part of an index

2004-04-27 Thread Nader S. Henein
You may be able to jimmy the bi filter to produce the most recent 100, but
really keeping your fetch count at 100 and ordering by DOC should be
sufficient.

-Original Message-
From: Alan Smith [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, April 27, 2004 4:03 PM
To: [EMAIL PROTECTED]
Subject: searching only part of an index


Hi

I wondered if anyone knows whether it is possible to search ONLY the 100 (or

whatever) most recently added documents to a lucene index? I know that once 
I have all my results ordered by ID number in Hits I could then just display

the required amount, but I wondered if there is a way to avoid searching all

documents in the index in the first place?

Many thanks

Alan

_
Express yourself with cool new emoticons http://www.msn.co.uk/specials/myemo


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: searching only part of an index

2004-04-27 Thread Ioan Miftode


If you know the id of the last document in the index.
(I don't know what's the best way to get it)
you could probably use a range query.
something like find all docs with the id in [lastId-100 TO lastID].
maybe you should make sure that the first limit is non negative, though.
just a thought

ioan

At 08:02 AM 4/27/2004, you wrote:
Hi

I wondered if anyone knows whether it is possible to search ONLY the 100 
(or whatever) most recently added documents to a lucene index? I know that 
once I have all my results ordered by ID number in Hits I could then just 
display the required amount, but I wondered if there is a way to avoid 
searching all documents in the index in the first place?

Many thanks

Alan

_
Express yourself with cool new emoticons http://www.msn.co.uk/specials/myemo
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorting by date (XML)

2004-04-27 Thread Michael Wechner
Nader S. Henein wrote:

Here's my two cents on this:
Both ways you will need to combine the date in one field, but if you use a
millisecond representation you will not be able to use the FLOAT sort type
and you'll have use STRING sort (Slower) because the millisecond
representation is longer than FLOAT allows, so you have three options:
1) Use MMDD and sort by FLOAT type
 

ok, I guess then will take the FLOAT type

2) Use the millisecond representation and sort by STRING type
3) If the date you're entering here is the date of indexing then you can
just sort by DOC type (which is the DOC ID) and save yourself the pain
 

unfortunately this isn't possible.

Thanks a lot for your help

Michi

Hope this helps.

Nader Henein

-Original Message-
From: Michael Wechner [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, April 27, 2004 3:52 PM
To: Lucene Users List
Subject: sorting by date (XML)

my XML files contain something like

date
 year2004/yearmonth04/monthday27/day...
/date
and I would like to sort by this date.

So I guess I need to modify the Documentparser and generate something like a
millisecond field and then sort by this, correct?
Has anyone done something like this yet?

Thanks

Michi

 



--
Michael Wechner
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
http://www.wyona.com  http://cocoon.apache.org/lenya/
[EMAIL PROTECTED][EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: searching only part of an index

2004-04-27 Thread Nader S. Henein
Are the DOC ids sequential? Or just unique and ascending, I'm thinking like
a good little Oracle boy, so does anyone know?

-Original Message-
From: Ioan Miftode [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, April 27, 2004 4:55 PM
To: Lucene Users List
Subject: Re: searching only part of an index




If you know the id of the last document in the index.
(I don't know what's the best way to get it)
you could probably use a range query.
something like find all docs with the id in [lastId-100 TO lastID]. maybe
you should make sure that the first limit is non negative, though.

just a thought

ioan

At 08:02 AM 4/27/2004, you wrote:
Hi

I wondered if anyone knows whether it is possible to search ONLY the 
100
(or whatever) most recently added documents to a lucene index? I know that 
once I have all my results ordered by ID number in Hits I could then just 
display the required amount, but I wondered if there is a way to avoid 
searching all documents in the index in the first place?

Many thanks

Alan

_
Express yourself with cool new emoticons 
http://www.msn.co.uk/specials/myemo


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: searching only part of an index

2004-04-27 Thread Terry Steichen
I think that if you include the indexing timestamp in the Document you
create when indexing, you could sort on this and only pick the first 100.

Regards,

Terry
- Original Message - 
From: Alan Smith [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Tuesday, April 27, 2004 8:02 AM
Subject: searching only part of an index


 Hi

 I wondered if anyone knows whether it is possible to search ONLY the 100
(or
 whatever) most recently added documents to a lucene index? I know that
once
 I have all my results ordered by ID number in Hits I could then just
display
 the required amount, but I wondered if there is a way to avoid searching
all
 documents in the index in the first place?

 Many thanks

 Alan

 _
 Express yourself with cool new emoticons
http://www.msn.co.uk/specials/myemo


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: searching only part of an index

2004-04-27 Thread Erik Hatcher
On Apr 27, 2004, at 9:00 AM, Nader S. Henein wrote:
Are the DOC ids sequential? Or just unique and ascending, I'm thinking 
like
a good little Oracle boy, so does anyone know?
They are unique and ascending.

Gaps in id's exist when documents are removed, and then the id's are 
squeezed back to completely sequential with no holes during an 
optimize.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: searching only part of an index

2004-04-27 Thread Nader S. Henein
So if Alan wants to limit it to the first 100 he can't really use a range
search unless he can guarantee that the index is optimized after deletes,
but then if his deletion rounds are anything like mine ( every 2 mins) then
optimizing it at each delete will make searching the index really slow.
Right?

Nader

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, April 27, 2004 5:15 PM
To: Lucene Users List
Subject: Re: searching only part of an index


On Apr 27, 2004, at 9:00 AM, Nader S. Henein wrote:
 Are the DOC ids sequential? Or just unique and ascending, I'm thinking
 like
 a good little Oracle boy, so does anyone know?

They are unique and ascending.

Gaps in id's exist when documents are removed, and then the id's are 
squeezed back to completely sequential with no holes during an 
optimize.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: searching only part of an index

2004-04-27 Thread Erik Hatcher
On Apr 27, 2004, at 9:49 AM, Nader S. Henein wrote:
So if Alan wants to limit it to the first 100 he can't really use a 
range
search unless he can guarantee that the index is optimized after 
deletes,
but then if his deletion rounds are anything like mine ( every 2 mins) 
then
optimizing it at each delete will make searching the index really slow.
Right?
Well, if you know how many you've deleted, then a range would work :)  
(number of docs in index minus 100 minus number deleted = starting 
range for doc id)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


BooleanScorer - 32 required/prohibited clause limit

2004-04-27 Thread Tate Avery
Hello,

I am using Lucene 1.3 and I ran into the following exception:

java.lang.IndexOutOfBoundsException: More than 32 required/prohibited
clauses in query.
at org.apache.lucene.search.BooleanScorer.add(BooleanScorer.java:98)

Is there any easy way to fix/adjust this (like the
BooleanQuery.maxClauseCount, for example)?
Strangely, I couldn't find mention of the BooleanScorer class in my javadoc.


Thank you for any tips.

Tate

p.s.  Yes, I am intentionally generating some rather long boolean queries.
:)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: searching only part of an index

2004-04-27 Thread Erik Hatcher
On Apr 27, 2004, at 10:24 AM, Erik Hatcher wrote:
On Apr 27, 2004, at 9:49 AM, Nader S. Henein wrote:
So if Alan wants to limit it to the first 100 he can't really use a 
range
search unless he can guarantee that the index is optimized after 
deletes,
but then if his deletion rounds are anything like mine ( every 2 
mins) then
optimizing it at each delete will make searching the index really 
slow.
Right?
Well, if you know how many you've deleted, then a range would work :)  
(number of docs in index minus 100 minus number deleted = starting 
range for doc id)
On second thought - this is incorrect - my apologies.  To be clever, 
you'd have to know in what positions the deleted documents were in and 
account for them in that manner.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Lucene and MS SQL

2004-04-27 Thread hgadm
Dear all,

has anyone had experience using Lucene with data stored
in MS SQL server 2000 ?

How does indexing and searching work in that case.

Thanks,

Holger


___
The ALL NEW CS2000 from CompuServe
 Better!  Faster! More Powerful!
 250 FREE hours! Sign-on Now!
 http://www.compuserve.com/trycsrv/cs2000/webmail/





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: phrase search AND term

2004-04-27 Thread Erik Hatcher
Can you provide a simple test case that shows this problem?

Did you reindex when upgrading?

On Apr 27, 2004, at 11:31 AM, Ioan Miftode wrote:



I recently upgraded to lucene 1.4 RC2 because I needed some
sorting capabilities. However some phrase searches don't
work anymore (the hits don't even have the term's I'm searching on).
They were fine when using 1.3final.
I noticed it happens when I combine
a phrase search with a simple term like this:
field1:some phrase search AND field2:term

Has anyone experienced anything similar ?
Any thoughts.
thanks

ioan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: need info for database based Lucene but not flat file

2004-04-27 Thread Doug Cutting
Yukun Song wrote:
As known, currently Lucene uses flat file to store information for
indexing. 

Any people has idea or resources for combining database (Like MySQL or
PostreSQL) and Lucene instead of current flat index file formats?
A few folks have implemented an SQL-based Lucene Directory, but none has 
yet been contributed to Lucene.  Hopefully one will be soon.

For some discussion of this, see messages on SQLDirectory in the mail 
archives:

http://nagoya.apache.org/eyebrowse/SearchList?listId=listName=lucene-user%40jakarta.apache.orgsearchText=SQLDirectorydefaultField=subjectSearch=Search

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorting by date (XML)

2004-04-27 Thread Otis Gospodnetic
Beware of storing timestamps (DateFields, I guess) in Lucene, if you
intend to use range queries (xxx TO yyy).

Otis

--- Michael Wechner [EMAIL PROTECTED] wrote:
 my XML files contain something like
 
 date
   year2004/yearmonth04/monthday27/day...
 /date
 
 and I would like to sort by this date.
 
 So I guess I need to modify the Documentparser and generate something
 like
 a millisecond field and then sort by this, correct?
 
 Has anyone done something like this yet?
 
 Thanks
 
 Michi
 
 -- 
 Michael Wechner
 Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
 http://www.wyona.com  http://cocoon.apache.org/lenya/
 [EMAIL PROTECTED][EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re-associate a token with its source

2004-04-27 Thread Olaia Vázquez Sánchez
Hello

 

I have documents in XML in which, for each word, I have 4 positions (top,
down, left and right) that would let me to highlight this word in a jpg
image. I want to index this XML documents and to highlight the results of
the queries in the image, so I need to store this positions for each word
inside the index.

 

I was searching about how can I use the Token fields to store this
attributes but I didn’t found any example where this fields were used.

 

Thanks,

 

Olaia Vázquez



Re: sorting by date (XML)

2004-04-27 Thread Robert Koberg
Otis Gospodnetic wrote:

Beware of storing timestamps (DateFields, I guess) in Lucene, if you
intend to use range queries (xxx TO yyy).
Why?

We have attributes that contain iso8601 date strings and when indexing:

Date date = isoConv.parse(value, new ParsePosition(0));
String dateString = DateField.dateToString(date);
doc.add(Field.Keyword(name, dateString));
then when searching:

String from = DateField.timeToString(searchFromDate);
String to = DateField.timeToString(searchToDate);
RangeQuery rq = new RangeQuery(new Term(searchKey, from),
   new Term(searchKey, to), true);
Is this not correct?

bst,
-Rob

Otis

--- Michael Wechner [EMAIL PROTECTED] wrote:

my XML files contain something like

date
 year2004/yearmonth04/monthday27/day...
/date
and I would like to sort by this date.

So I guess I need to modify the Documentparser and generate something
like
a millisecond field and then sort by this, correct?
Has anyone done something like this yet?

Thanks

Michi

--
Michael Wechner
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
http://www.wyona.com  http://cocoon.apache.org/lenya/
[EMAIL PROTECTED][EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorting by date (XML)

2004-04-27 Thread Otis Gospodnetic
Because having small time units like milliseconds will result in Range
query expanding to a large number of BooleanQueries, if you have a lot
of documents with unique time stamps.  Rounding the timestamp to
minutes, hours, or days, can drastically reduce the number of unique
time stamps, hence resulting in less BooleanQueries.

Otis

--- Robert Koberg [EMAIL PROTECTED] wrote:
 Otis Gospodnetic wrote:
 
  Beware of storing timestamps (DateFields, I guess) in Lucene, if
 you
  intend to use range queries (xxx TO yyy).
 
 Why?
 
 We have attributes that contain iso8601 date strings and when
 indexing:
 
 Date date = isoConv.parse(value, new ParsePosition(0));
 String dateString = DateField.dateToString(date);
 doc.add(Field.Keyword(name, dateString));
 
 then when searching:
 
 String from = DateField.timeToString(searchFromDate);
 String to = DateField.timeToString(searchToDate);
 RangeQuery rq = new RangeQuery(new Term(searchKey, from),
 new Term(searchKey, to), true);
 
 Is this not correct?
 
 bst,
 -Rob
 
 
  
  Otis
  
  --- Michael Wechner [EMAIL PROTECTED] wrote:
  
 my XML files contain something like
 
 date
   year2004/yearmonth04/monthday27/day...
 /date
 
 and I would like to sort by this date.
 
 So I guess I need to modify the Documentparser and generate
 something
 like
 a millisecond field and then sort by this, correct?
 
 Has anyone done something like this yet?
 
 Thanks
 
 Michi
 
 -- 
 Michael Wechner
 Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
 http://www.wyona.com  http://cocoon.apache.org/lenya/
 [EMAIL PROTECTED][EMAIL PROTECTED]
 
 

-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
  
  
  
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: sorting by date (XML)

2004-04-27 Thread Robert Koberg
Otis Gospodnetic wrote:

Because having small time units like milliseconds will result in Range
query expanding to a large number of BooleanQueries, if you have a lot
of documents with unique time stamps.  Rounding the timestamp to
minutes, hours, or days, can drastically reduce the number of unique
time stamps, hence resulting in less BooleanQueries.
Cool, thanks. So DateField.dateToString is the best, most efficient way, 
correct?

Otis

--- Robert Koberg [EMAIL PROTECTED] wrote:

Otis Gospodnetic wrote:


Beware of storing timestamps (DateFields, I guess) in Lucene, if
you

intend to use range queries (xxx TO yyy).
Why?

We have attributes that contain iso8601 date strings and when
indexing:
Date date = isoConv.parse(value, new ParsePosition(0));
String dateString = DateField.dateToString(date);
doc.add(Field.Keyword(name, dateString));
then when searching:

String from = DateField.timeToString(searchFromDate);
String to = DateField.timeToString(searchToDate);
RangeQuery rq = new RangeQuery(new Term(searchKey, from),
   new Term(searchKey, to), true);
Is this not correct?

bst,
-Rob


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Read past EOF and negative bufferLength problem (1.4 rc2)

2004-04-27 Thread Joe Berkovitz
Using Lucene 1.4 rc2 I've run into a fatal problem: certain 
PhraseQueries cause a Read Past EOF exception (see below), while other 
PhraseQueries enter an infinite loop due to a negative bufferLength 
field in CSInputStream.  Environment is WinXP, JDK 1.4.2.  The index is 
large, incorporating 1,000,000 documents each of which has 3 stored, 
indexed fields of 10-100 chars.

The problem does not occur with Lucene 1.3 indexing the exact same set 
of Documents.  Nor does it occur with 1.4 rc2 using various smaller sets 
of documents.  Right now my workaround is to use Lucene 1.3.

For the PhraseQuery a y (that's right, two single-letter terms), the 
read-past-EOF exception is as follows:

java.io.IOException: read past EOF
   at org.apache.lucene.store.InputStream.refill(InputStream.java:154)
   at org.apache.lucene.store.InputStream.readByte(InputStream.java:43)
   at org.apache.lucene.store.InputStream.readVInt(InputStream.java:83)
   at 
org.apache.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:59)
   at 
org.apache.lucene.index.SegmentTermDocs.skipTo(SegmentTermDocs.java:187)
   at 
org.apache.lucene.search.PhrasePositions.skipTo(PhrasePositions.java:47)
   at org.apache.lucene.search.PhraseScorer.next(PhraseScorer.java:69)
   at org.apache.lucene.search.Scorer.score(Scorer.java:37)
   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:81)
   at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64)
   at org.apache.lucene.search.Hits.init(Hits.java:43)
   at org.apache.lucene.search.Searcher.search(Searcher.java:33)
   at org.apache.lucene.search.Searcher.search(Searcher.java:27)
   at...

For the phrase query z y, an  infinite loop is entered.  The loop 
occurs due to a similar condition to read-past-EOF: at line 153 of 
org.apache.lucene.store.InputStream, the value of bufferLength goes 
negative due to the value of start exceeding the value of end.  This in 
turn seems to be a consequence of a seek to a position past the end of 
the stream.

Something is clearly corrupt somewhere in the index structure.  I'd love 
to post the files that reproduce the problem, but it's about 100 MB of 
data.  If someone on the Lucene dev team wants to give me an upload 
destination, I can post the index somewhere and you can play with the 
problem.

regards and thanks for any assistance,

Joe Berkovitz
Chief Architect
Ruckus Network, Inc.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorting by date (XML)

2004-04-27 Thread Erik Hatcher
On Apr 27, 2004, at 2:09 PM, Robert Koberg wrote:
Otis Gospodnetic wrote:

Because having small time units like milliseconds will result in Range
query expanding to a large number of BooleanQueries, if you have a lot
of documents with unique time stamps.  Rounding the timestamp to
minutes, hours, or days, can drastically reduce the number of unique
time stamps, hence resulting in less BooleanQueries.
Cool, thanks. So DateField.dateToString is the best, most efficient 
way, correct?
It all depends.  But if all you care about is year, month, day, it is 
_not_ the most efficient.  DateField converts down to milliseconds, and 
is what Otis was referring to.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorting by date (XML)

2004-04-27 Thread Robert Koberg
Erik Hatcher wrote:

On Apr 27, 2004, at 2:09 PM, Robert Koberg wrote:

Otis Gospodnetic wrote:

Because having small time units like milliseconds will result in Range
query expanding to a large number of BooleanQueries, if you have a lot
of documents with unique time stamps.  Rounding the timestamp to
minutes, hours, or days, can drastically reduce the number of unique
time stamps, hence resulting in less BooleanQueries.


Cool, thanks. So DateField.dateToString is the best, most efficient 
way, correct?


It all depends.  But if all you care about is year, month, day, it is 
_not_ the most efficient.  DateField converts down to milliseconds, and 
is what Otis was referring to.
Oops, I meant to write DateField.timeToString which I use when querying. 
If I use DateField.dateToString when indexing but timeToString when 
searching is that a bad practice? I do only need month, day and year. So 
should I be indexing with timeToString?

How would you do it if the above is still a bad practice?

Sorry for the basic questions...

best,
-Rob

Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: BooleanScorer - 32 required/prohibited clause limit

2004-04-27 Thread Tate Avery

Or if I overlooked some previous post or thread that covers this please help
me track it down.

Thank you,
Tate

-Original Message-
From: Tate Avery [mailto:[EMAIL PROTECTED]
Sent: Tuesday, April 27, 2004 10:20 AM
To: [EMAIL PROTECTED]
Subject: BooleanScorer - 32 required/prohibited clause limit


Hello,

I am using Lucene 1.3 and I ran into the following exception:

java.lang.IndexOutOfBoundsException: More than 32 required/prohibited
clauses in query.
at org.apache.lucene.search.BooleanScorer.add(BooleanScorer.java:98)

Is there any easy way to fix/adjust this (like the
BooleanQuery.maxClauseCount, for example)?
Strangely, I couldn't find mention of the BooleanScorer class in my javadoc.


Thank you for any tips.

Tate

p.s.  Yes, I am intentionally generating some rather long boolean queries.
:)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: phrase search AND term

2004-04-27 Thread Ioan Miftode


Thank you Doug, the latest CVS works fine.

ioan

At 12:23 PM 4/27/2004, you wrote:
Ioan Miftode wrote:
I recently upgraded to lucene 1.4 RC2 because I needed some
sorting capabilities. However some phrase searches don't
work anymore (the hits don't even have the term's I'm searching on).
Try the latest CVS.  There were some bugs in 1.4RC2 that have been fixed.

(We'll probably do an RC3 release soon.  There are currently some bugs in 
span search that would be good to get fixed in RC3, but perhaps these will 
have to wait until RC4...)

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorting by date (XML)

2004-04-27 Thread Erik Hatcher
On Apr 27, 2004, at 3:41 PM, Robert Koberg wrote:
Oops, I meant to write DateField.timeToString which I use when 
querying. If I use DateField.dateToString when indexing but 
timeToString when searching is that a bad practice? I do only need 
month, day and year. So should I be indexing with timeToString?

How would you do it if the above is still a bad practice?

Sorry for the basic questions...
No worries.  This is the type of thing that is a gotcha with dates, 
and is a prime candidate for a wiki page (nudge, nudge)...

You should represent dates (at index and search time) using MMDD 
format - it needs to be lexicographically ordered.  Forget DateField 
and Field.Keyword(String,Date) altogether.

Some tricks are needed if you need to use QueryParser to translate 
mm/dd/ format to how you represent it, but it is quite simple. 
(subclass QueryParser, override getRangeQuery).

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorting by date (XML)

2004-04-27 Thread Robert Koberg
Erik Hatcher wrote:

On Apr 27, 2004, at 3:41 PM, Robert Koberg wrote:

Oops, I meant to write DateField.timeToString which I use when 
querying. If I use DateField.dateToString when indexing but 
timeToString when searching is that a bad practice? I do only need 
month, day and year. So should I be indexing with timeToString?

How would you do it if the above is still a bad practice?

Sorry for the basic questions...


No worries.  This is the type of thing that is a gotcha with dates, 
and is a prime candidate for a wiki page (nudge, nudge)...

You should represent dates (at index and search time) using MMDD 
format - it needs to be lexicographically ordered.  Forget DateField and 
Field.Keyword(String,Date) altogether.

Some tricks are needed if you need to use QueryParser to translate 
mm/dd/ format to how you represent it, but it is quite simple. 
(subclass QueryParser, override getRangeQuery).
Ah. Great - thanks! I see you added it to the wiki. Thanks again :)

This is perfect in my case since iso8601 is in the format:

2004-04-27T01:23:33

Luckily so far, from my logs, hardly anyone uses the date search. I 
guess I should have been doing this from the beginning, don't know why I 
didn't...

best,
-Rob

Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorting by date (XML)

2004-04-27 Thread Michael Wechner
Robert Koberg wrote:

Ah. Great - thanks! I see you added it to the wiki. Thanks again :)


I guess you mean

http://wiki.apache.org/jakarta-lucene/IndexingDateFields

Thanks as well

Michi


This is perfect in my case since iso8601 is in the format:

2004-04-27T01:23:33

Luckily so far, from my logs, hardly anyone uses the date search. I 
guess I should have been doing this from the beginning, don't know why 
I didn't...

best,
-Rob

Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Michael Wechner
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
http://www.wyona.com  http://cocoon.apache.org/lenya/
[EMAIL PROTECTED][EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Index directory name

2004-04-27 Thread Narayan, Anand
I am having a problem with using a network path for the index directory.
If I use a path of the form  //server/indexdir   the IndexWriter finds it
and
indexes documents but the IndexSearcher throws an exception saying it is
not a valid path.
I cannot use a local path as I need to be able to support a common index
directory
for a clustered environment.
What is the best solution in this case?

Thanks
Anand

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: need info for database based Lucene but not flat file

2004-04-27 Thread Incze Lajos
On Tue, Apr 27, 2004 at 09:15:05AM -0700, Doug Cutting wrote:
 Yukun Song wrote:
 As known, currently Lucene uses flat file to store information for
 indexing. 
 
 Any people has idea or resources for combining database (Like MySQL or
 PostreSQL) and Lucene instead of current flat index file formats?
 
 A few folks have implemented an SQL-based Lucene Directory, but none has 
 yet been contributed to Lucene.  Hopefully one will be soon.
 
 For some discussion of this, see messages on SQLDirectory in the mail 
 archives:
 
 http://nagoya.apache.org/eyebrowse/SearchList?listId=listName=lucene-user%40jakarta.apache.orgsearchText=SQLDirectorydefaultField=subjectSearch=Search
 
 Doug

Could anybody summarize what would be the technical pros/cons of a DB-based
directory over the flat files? (What I see at the moment is that for some
- significant? - perfomence penalty you'll get an index available over the
network for multiple lucene engines -- if I'm right.)

incze

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: need info for database based Lucene but not flat file

2004-04-27 Thread Doug Cutting
Incze Lajos wrote:
Could anybody summarize what would be the technical pros/cons of a DB-based
directory over the flat files? (What I see at the moment is that for some
- significant? - perfomence penalty you'll get an index available over the
network for multiple lucene engines -- if I'm right.)
http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1344168

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: languages supported by lucene 1.2.1 in eclipse help system

2004-04-27 Thread Eric Isakson
I'm assuming what you have is an eclipse plugin that is making use of the eclipse help 
system. If what you are doing is relying on the lucene eclipse plugin, you may want to 
look at the help system anyway since it will give you an example of an eclipse plugin 
that is using the lucene plugin.

The eclipse help system uses lucene but they have their own Analyzer class that uses 
BreakIterator to identify tokens for languages other than english and german. The 
lucene eclipse plugin just exports the lucene jar and the html parser so that any 
plugin that depends on the lucene plugin (like the help system) will have those jars 
in the classpath of their plugin.

For english they use the PorterStemFilter with a StopAnalyzer and a stopword list. For 
german, they use the GermanAnalyzer supplied by the lucene jar.

In the latest CVS at :pserver:[EMAIL PROTECTED]:/home/eclipse

see the project in org.eclipse.help.base/src/org/eclipse/help/internal/search
in older eclipse versions see the R2_1_maintenance branch of 
org.eclipse.help/src/org/eclipse/help/internal/search

the class DefaultAnalyzer is the analyzer implementation for languages other than 
english and german and WordTokenStream is where they use BreakIterator to break the 
content from the reader into individual tokens.

The default Eclipse help system sets these extensions in the org.eclipse.help.base 
plugin:

!-- Text Analyzers for search --
   extension
 id=org.eclipse.help.base.Analyzer_en
 point=org.eclipse.help.base.luceneAnalyzer
  analyzer
locale=en
class=org.eclipse.help.internal.search.Analyzer_en
  /analyzer
   /extension
   extension
 id=org.eclipse.help.base.Analyzer_de
 point=org.eclipse.help.base.luceneAnalyzer
  analyzer
locale=de
class=org.apache.lucene.analysis.de.GermanAnalyzer
  /analyzer
   /extension

Look at the extension point schema in 
http://dev.eclipse.org/viewcvs/index.cgi/~checkout~/org.eclipse.help.base/schema/luceneAnalyzer.exsd?rev=HEADcontent-type=text/plain
 for how to declare your own analyzer extensions. Beware though, I read that this 
affects all help searches in that language, not just the ones for your plugin.

Also, since the WordTokenStream is in a package with internal in its path, you 
aren't supposed to ever make use of that class from other plugins, so if you wanted 
your own analyzer based on that class and a stop list, you shouldn't use that class 
without talking the eclipse help developers into moving it outside of an internal 
package.

Most of this has been around for a while, so it is probably the same or very similar 
in previous eclipse versions, you may need to poke around at the extension point 
schema in your eclipse plugins directory to verify that the extension point works the 
same way in your version of eclipse. I haven't used it in versions prior to 3.0M8

Hope this is useful to you,
Eric

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Saturday, April 24, 2004 10:18 AM
To: Lucene Users List
Subject: Re: languages supported by lucene 1.2.1 in eclipse help system


That's no myth :)
Core Lucene (even the current version) does not include classes that know how to 
analyze/tokenize text in languages other than English, Russian, and German.  However, 
take a look at the Snowball contributions in Lucene Sandbox, where a few more 
analyzers are available, including those for CJK group of langauges.

Otis


--- Jason Elliott [EMAIL PROTECTED] wrote:
 We have a plugin in our eclipse project named org.apache.lucene_1.2.1.
 It works quite well in that help system.
  
 I've been notified that this particular version of the lucene search 
 analyzer searches well in German and English (GE), but not so well in 
 the rest of the languages on this planet.
  
 I have several questions
 1.If it does not search very well in French, Italian and Japanese
 (FIJ), what does that really mean to a user conducting searches?
 a.If this is a myth and the searches work the same in EFIG-J, please
 let me know that.
 b.If this is not a myth and there are plugins that enable the search
 to work well in FIJ?
  
 Thanks
 jason
  
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: need info for database based Lucene but not flat file

2004-04-27 Thread Incze Lajos
On Tue, Apr 27, 2004 at 02:46:22PM -0700, Doug Cutting wrote:
 Incze Lajos wrote:
 Could anybody summarize what would be the technical pros/cons of a DB-based
 directory over the flat files? (What I see at the moment is that for some
 - significant? - perfomence penalty you'll get an index available over the
 network for multiple lucene engines -- if I'm right.)
 
 http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1344168
 
 Doug

Thanks.

incze

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: status of LARM project

2004-04-27 Thread Kelvin Tan
As far as I know, LARM is defunct. I read somewhere, perhaps apocryphal, that
Clemens got a job which wasn't supportive of his continued development on LARM.
AFAIK there aren't any other active developers of LARM (at least at the time it
branched off to SF).

Otis recently posted to use Nutch instead of LARM.

Kelvin

On 28 Apr 2004 09:44:04 +0800, Sebastian Ho said:
 Hi

 I have look at LARM website and I get different results

 http://nagoya.apache.org/wiki/apachewiki.cgi?LuceneLARMPages
 It says that development has stopped for this project.

 LARM hosted on sourceforge.
 The last message was dated 2003 in the mailing list. Is it still
 supported and active?

 LARM hosted on apache.
 It says the project is moved to sourceforge.

 Any one here who is active in LARM can comment on the status?

 Regards

 Sebastian Ho


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index directory name

2004-04-27 Thread Gabriela D
I assume you are using Wintel platform. You may map the the directory where your 
indexes are kept using persistent connection. (this can be done using NET USE. 
command in command prompt). This keeps network connection always open, which otherwise 
Windows will close the connection after sometime(but still manully accessible). You 
can notice this in explorer window where you will find a red cross mark against the 
mapped network drive.
Harsha.

Narayan, Anand [EMAIL PROTECTED] wrote:
I am having a problem with using a network path for the index directory.
If I use a path of the form //server/indexdir the IndexWriter finds it
and
indexes documents but the IndexSearcher throws an exception saying it is
not a valid path.
I cannot use a local path as I need to be able to support a common index
directory
for a clustered environment.
What is the best solution in this case?

Thanks
Anand

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
Do you Yahoo!?
Win a $20,000 Career Makeover at Yahoo! HotJobs 

Re: Index directory name

2004-04-27 Thread Gabriela D
I assume you are using Wintel platform. You may map the the directory where your 
indexes are kept using persistent connection. (this can be done using NET USE. 
command in command prompt). This keeps network connection always open, which otherwise 
Windows will close the connection after sometime(but still manully accessible). You 
can notice this in explorer window where you will find a red cross mark against the 
mapped network drive.
Harsha.

Narayan, Anand [EMAIL PROTECTED] wrote:
I am having a problem with using a network path for the index directory.
If I use a path of the form //server/indexdir the IndexWriter finds it
and
indexes documents but the IndexSearcher throws an exception saying it is
not a valid path.
I cannot use a local path as I need to be able to support a common index
directory
for a clustered environment.
What is the best solution in this case?

Thanks
Anand

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
Do you Yahoo!?
Win a $20,000 Career Makeover at Yahoo! HotJobs 

Re: status of LARM project

2004-04-27 Thread Stephane James Vaucher
I suggest you look at:
http://www.manageability.org/blog/stuff/open-source-web-crawlers-java

From what I know of nutch, it's meant as the basic for a competitor to the
big search engines (i.e. google). For a small web site, it might be
overkill especially if it requires you to build from CVS (unless there are
distributions).

Note:
I've got the book Programming Spiders, Bots and Aggregators in Java, it
describes spiders using a project called: j-spider
http://sourceforge.net/projects/j-spider/
It could probably be adapted for your needs.

HTH,
sv

On Wed, 28 Apr 2004, Kelvin Tan wrote:

 As far as I know, LARM is defunct. I read somewhere, perhaps apocryphal, that
 Clemens got a job which wasn't supportive of his continued development on LARM.
 AFAIK there aren't any other active developers of LARM (at least at the time it
 branched off to SF).

 Otis recently posted to use Nutch instead of LARM.

 Kelvin

 On 28 Apr 2004 09:44:04 +0800, Sebastian Ho said:
  Hi
 
  I have look at LARM website and I get different results
 
  http://nagoya.apache.org/wiki/apachewiki.cgi?LuceneLARMPages
  It says that development has stopped for this project.
 
  LARM hosted on sourceforge.
  The last message was dated 2003 in the mailing list. Is it still
  supported and active?
 
  LARM hosted on apache.
  It says the project is moved to sourceforge.
 
  Any one here who is active in LARM can comment on the status?
 
  Regards
 
  Sebastian Ho
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]