Re: Need advice: what Word/Excel/PowerPoint lib to use?

2004-10-26 Thread iouli . golovatyi
Many thanks to everybody for interesting info

Regards and have a nice day
J.




sergiu gordea [EMAIL PROTECTED]
25.10.2004 17:05
Please respond to Lucene Users List

 
To: Lucene Users List [EMAIL PROTECTED]
cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
Subject:Re: Need advice: what Word/Excel/PowerPoint lib to use?
Category: 




of course POI, for open source.
There are some commercial products based on POI also.

for WORD consider textmining.org
for XLS, POI does anything you need
for powerpoint  there is one commercial (it's about 1000$), but you can 
also find some source code in archives.

 All the best,

  Sergiu
 

[EMAIL PROTECTED] wrote:

Hello all,

I need a piece of advice/experience again..

What ms Word/Excel/PowerPoint parsers (written in java) u'd recommend?

Thanks in advance
J.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





Re: Need advice: what pdf lib to use?

2004-10-26 Thread iouli . golovatyi
OK, but even in this case parsing the doc would not be a violation, 
because actually what we need for lucene is just collection of terms. Has 
nothing to do with printing or copying of _text_ pieces.
As long You provide method returning just Document (I mean lucene 
document) permissions specified by the author of the PDF document are respected





Ben Litchfield [EMAIL PROTECTED]
25.10.2004 17:59
Please respond to Lucene Users List

 
To: Lucene Users List [EMAIL PROTECTED]
cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
Subject:Re: Need advice: what pdf lib to use?
Category: 




In order to write software that consumes PDF documents you must agree to a
list of conditions.  One of those conditions is that permissions specified
by the author of the PDF document are respected.

PDFBox complies with this statement, if there is software that does not
then they are in violation of copyright law.

That being said, PDFBox is open source so a user could make modifications
to the source code, or as a PDF library could change permissions on a
document.

Ben

On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote:

 Yes Ben, You are right.

 This would be correct functionality from technical perspective. But look
 it my way with application programmer eyes reporting to big boss that c.
 30% of doc we cope with could not be indexed because of this stupid
 limitation. Neither he or me have any influence on pdf owners and any
 ideas about what made  them create files with documet security set.

 In short, if You also could implement this uncorrect functionality the
 closed source guys did, it would be really great!

 As far as sponsoring is concerned I would be ready to hack (or at least 
to
 try) it even for 1/3 of that fortune:)))

 J.





 Ben Litchfield [EMAIL PROTECTED]
 25.10.2004 14:02
 Please respond to Lucene Users List


 To: Lucene Users List [EMAIL PROTECTED]
 cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
 Subject:Re: Need advice: what pdf lib to use?
 Category:




 PDFBox does not 'stumble' when it gives that message, that is correct
 functionality if that permission is not allowed.

 If your company is willing to pay a 'fortune' why not sponsor a change 
to
 an open source project for half a fortune.

 Ben
 http://www.pdfbox.org

 On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote:

  PDFbox stumbles also with class java.io.IOException with message:  -
 You
  do not have permission to extract text in case the doc is copy/print
  protected.
  I tested now the snowtide commercial product and it looks like it 
could
  process these files as well. Performance was also not so bad.
 Unfortunatly
  the test result could not be considered as 100%, because the free
 version
  processed just first  8  pages.  After all this product costs a 
fortune
  (as long the company is ready to pay I don't realy mind:))
 
  J.
 
 
 
 
 
  Robert Newson [EMAIL PROTECTED]
  Sent by: news [EMAIL PROTECTED]
  24.10.2004 17:44
  Please respond to Lucene Users List
 
 
  To: [EMAIL PROTECTED]
  cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
  Subject:Re: Need advice: what pdf lib to use?
  Category:
 
 
 
  [EMAIL PROTECTED] wrote:
   Hello all,
  
   I need a piece of advice/experience..
  
   What pdf parser (written in java) u'd recommend?
  
   I played now with PDFBox-0.6.7a and would not say I was satisfied 
too
  much
   with it
  
   On certain pdf's (not well formated but anyway readable with 
acrobate)
  it
   run into dead loop (this I could fix in code),
   and on one file it produced out of memory error and killed jvm:(
 (this
 
   problem I could not identify yet)
  
   After all the performance was not too great as well: it took c. 19 
h.
 to
 
   index 13000 files (c. 3.5Gb)
  
   Regards,
   J.
  
  
  
 
  On the specific problem of the dead loop, I reported an instance of
  this to Ben a week or so ago and he has fixed it in the latest
  nightlies.  I expect an official release will include this bugfix 
soon.
  The file in question was unreadable with any PDF software I have, but
  someone managed to create it somehow...
 
  http://sourceforge.net/tracker/index.php?func=detailaid=1037145group_id=78314atid=552832
 
  I've found pdfbox to be pretty good. The only time I get problems is
  with corrupted or egregiously bad PDF files.
 
  B.
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





Large number of documents

2004-10-26 Thread Gard Arneson Haugen
Hi,
I have just started looking at Lucene and are not an experienced user of 
Java, but from what I've been reading this search tool should manage 
large amounts of documents.

I'm wondering if someone have any experience using Lucene on large 
amount of documents. I need to be able to index and search  through 
20-30 million documents of around 8kb. They are all simple text document 
with some attributes to restrict the search result on.

Any feedback would be appreciated.
Best regards,
Gard Arneson Haugen

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: BooleanQuery - TooManyClauses

2004-10-26 Thread Erik Hatcher
On Oct 25, 2004, at 6:35 PM, Angelov, Rossen wrote:
Why there is a limit on the number of clauses? and is there any harm in
setting MaxClauseCount to Integer.MAX_VALUE?
The harm is in performance and resource utilization.  Rather than do 
this, though, read on...

I'm using a Range Query on a field that represents dates and getting
BooleanQuery$TooManyClauses exception.
This is the query -  +/article/createddateiso8601:[2003010100 TO
2003123199]
Do you really need to do ranges down to that time level?  Or are you 
really just concerned with date?  If you indexed using MMDD 
instead, there would only be a maximum of 365 terms in that range, 
whereas you've got zillions (ok, I was too lazy to do the math!  But 
far more than 1,024).

I recommend changing how you index dates, or at least use a different 
field for queries that do not need to concern themselves with the 
timestamp aspect.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Large number of documents

2004-10-26 Thread Otis Gospodnetic
Hello Gard,

This is certainly doable, it just depends on your hardware, complexity
of queries, frequency of queries, and such.  There is a benchmark page
on the Lucene site that you may want to check to get some ideas.

Otis



--- Gard Arneson Haugen [EMAIL PROTECTED] wrote:

 Hi,
 
 I have just started looking at Lucene and are not an experienced user
 of 
 Java, but from what I've been reading this search tool should manage 
 large amounts of documents.
 
 I'm wondering if someone have any experience using Lucene on large 
 amount of documents. I need to be able to index and search  through 
 20-30 million documents of around 8kb. They are all simple text
 document 
 with some attributes to restrict the search result on.
 
 Any feedback would be appreciated.
 
 Best regards,
 Gard Arneson Haugen
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: BooleanQuery - TooManyClauses

2004-10-26 Thread Angelov, Rossen

On Oct 25, 2004, at 6:35 PM, Angelov, Rossen wrote:
 Why there is a limit on the number of clauses? and is there any harm in
 setting MaxClauseCount to Integer.MAX_VALUE?

The harm is in performance and resource utilization.  Rather than do 
this, though, read on...

 I'm using a Range Query on a field that represents dates and getting
 BooleanQuery$TooManyClauses exception.
 This is the query -  +/article/createddateiso8601:[2003010100 TO
 2003123199]

Do you really need to do ranges down to that time level?  Or are you 
really just concerned with date?  If you indexed using MMDD 
instead, there would only be a maximum of 365 terms in that range, 
whereas you've got zillions (ok, I was too lazy to do the math!  But 
far more than 1,024).

I need to do range searches. They are part of the requirements and even
worse, the range can be as big as up to 10 years for now. It will get
bigger. I'm indexing using MMDDHHmmssZ format and as you said there will
be more than just 365 terms per year. This number changes every day as new
documents are indexed daily. The only limit I can see is the number of
documents that were indexed. I guess maxClauseCount can't be more than the
indexed documents.

I recommend changing how you index dates, or at least use a different 
field for queries that do not need to concern themselves with the 
timestamp aspect.

What do you mean change how the dates are indexed? By the way this field is
indexed as a string.


   Erik



Ross


This communication is intended solely for the addressee and is
confidential and not for third party unauthorized distribution.



Re: BooleanQuery - TooManyClauses

2004-10-26 Thread Terry Steichen
I think what Erik's asking is whether you can live with expressing your indexed date 
in the form of MMDD, without the hour and minute extension.  That will sharply 
educe the number of range query expansion terms.  If you're using the timestamp as a 
unique identifier, you might consider creating two fields, one for the unique 
identifier (MMDDHHmmssZ) and one for the date (MMDD), and only use the range 
on the date field (not on the timestamp field)

Regards,

Terry
  - Original Message - 
  From: Angelov, Rossen 
  To: 'Lucene Users List' 
  Sent: Tuesday, October 26, 2004 11:43 AM
  Subject: RE: BooleanQuery - TooManyClauses 


  
  On Oct 25, 2004, at 6:35 PM, Angelov, Rossen wrote:
   Why there is a limit on the number of clauses? and is there any harm in
   setting MaxClauseCount to Integer.MAX_VALUE?
  
  The harm is in performance and resource utilization.  Rather than do 
  this, though, read on...
  
   I'm using a Range Query on a field that represents dates and getting
   BooleanQuery$TooManyClauses exception.
   This is the query -  +/article/createddateiso8601:[2003010100 TO
   2003123199]
  
  Do you really need to do ranges down to that time level?  Or are you 
  really just concerned with date?  If you indexed using MMDD 
  instead, there would only be a maximum of 365 terms in that range, 
  whereas you've got zillions (ok, I was too lazy to do the math!  But 
  far more than 1,024).

  I need to do range searches. They are part of the requirements and even
  worse, the range can be as big as up to 10 years for now. It will get
  bigger. I'm indexing using MMDDHHmmssZ format and as you said there will
  be more than just 365 terms per year. This number changes every day as new
  documents are indexed daily. The only limit I can see is the number of
  documents that were indexed. I guess maxClauseCount can't be more than the
  indexed documents.

  I recommend changing how you index dates, or at least use a different 
  field for queries that do not need to concern themselves with the 
  timestamp aspect.

  What do you mean change how the dates are indexed? By the way this field is
  indexed as a string.

  
   Erik
  
  

  Ross

  This communication is intended solely for the addressee and is
  confidential and not for third party unauthorized distribution.



Re: Exception in thread main java.lang.NoClassDefFoundError

2004-10-26 Thread chandrakant gopalan
Hi Rob,
I noticed that you are using org.apache.lucene.demos where its just demo

Regards
CG

On Mon, 25 Oct 2004 21:54:38 +0100, Rob Hailey [EMAIL PROTECTED] wrote:
 I am using lucene version 1.4.2 but am consistently getting an error
 when I run this:
 
 java -verbose -classpath
 /Users/rob/Desktop/lucene/lucene.jar:/Users/rob/Desktop/lucene/lucene-
 demos.jar:. org.apache.lucene.demos.IndexFiles
 /Users/rob/Desktop/lucene/src/
 
 The error I get is:
 
 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/lucene/demos/IndexFiles
 
 Can someone please help? I have tried on both Mac OS X (Panther) and
 Windows XP - both with the latest JVM - but I get the same error
 message. Thanks.
 
 The JVM version is:
 
 Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_05-141.3)
 Java HotSpot(TM) Client VM (build 1.4.2-38, mixed mode)
 
 The verbose error message is:
 
 [Opened
 /System/Library/Frameworks/JavaVM.framework/Versions/1.4.2/Classes/
 classes.jar]
 [Opened
 /System/Library/Frameworks/JavaVM.framework/Versions/1.4.2/Classes/
 ui.jar]
 [Opened
 /System/Library/Frameworks/JavaVM.framework/Versions/1.4.2/Classes/
 laf.jar]
 [Opened
 /System/Library/Frameworks/JavaVM.framework/Versions/1.4.2/Classes/
 sunrsasign.jar]
 [Opened
 /System/Library/Frameworks/JavaVM.framework/Versions/1.4.2/Classes/
 jsse.jar]
 [Opened
 /System/Library/Frameworks/JavaVM.framework/Versions/1.4.2/Classes/
 jce.jar]
 [Opened
 /System/Library/Frameworks/JavaVM.framework/Versions/1.4.2/Classes/
 charsets.jar]
 [Loaded java.lang.Object from shared objects file]
 [Loaded java.io.Serializable from shared objects file]
 [Loaded java.lang.Comparable from shared objects file]
 [Loaded java.lang.CharSequence from shared objects file]
 [Loaded java.lang.String from shared objects file]
 [Loaded java.lang.Class from shared objects file]
 [Loaded java.lang.Cloneable from shared objects file]
 [Loaded java.lang.ClassLoader from shared objects file]
 [Loaded java.lang.System from shared objects file]
 [Loaded java.lang.Throwable from shared objects file]
 [Loaded java.lang.Error from shared objects file]
 [Loaded java.lang.ThreadDeath from shared objects file]
 [Loaded java.lang.Exception from shared objects file]
 [Loaded java.lang.RuntimeException from shared objects file]
 [Loaded java.security.ProtectionDomain from shared objects file]
 [Loaded java.security.AccessControlContext from shared objects file]
 [Loaded java.lang.ClassNotFoundException from shared objects file]
 [Loaded java.lang.LinkageError from shared objects file]
 [Loaded java.lang.NoClassDefFoundError from shared objects file]
 [Loaded java.lang.ClassCastException from shared objects file]
 [Loaded java.lang.ArrayStoreException from shared objects file]
 [Loaded java.lang.VirtualMachineError from shared objects file]
 [Loaded java.lang.OutOfMemoryError from shared objects file]
 [Loaded java.lang.StackOverflowError from shared objects file]
 [Loaded java.lang.ref.Reference from shared objects file]
 [Loaded java.lang.ref.SoftReference from shared objects file]
 [Loaded java.lang.ref.WeakReference from shared objects file]
 [Loaded java.lang.ref.FinalReference from shared objects file]
 [Loaded java.lang.ref.PhantomReference from shared objects file]
 [Loaded java.lang.ref.Finalizer from shared objects file]
 [Loaded java.lang.Runnable from shared objects file]
 [Loaded java.lang.Thread from shared objects file]
 [Loaded java.lang.ThreadGroup from shared objects file]
 [Loaded java.util.Dictionary from shared objects file]
 [Loaded java.util.Map from shared objects file]
 [Loaded java.util.Hashtable from shared objects file]
 [Loaded java.util.Properties from shared objects file]
 [Loaded java.lang.reflect.AccessibleObject from shared objects file]
 [Loaded java.lang.reflect.Member from shared objects file]
 [Loaded java.lang.reflect.Field from shared objects file]
 [Loaded java.lang.reflect.Method from shared objects file]
 [Loaded java.lang.reflect.Constructor from shared objects file]
 [Loaded sun.reflect.MagicAccessorImpl from shared objects file]
 [Loaded sun.reflect.MethodAccessor from shared objects file]
 [Loaded sun.reflect.MethodAccessorImpl from shared objects file]
 [Loaded sun.reflect.ConstructorAccessor from shared objects file]
 [Loaded sun.reflect.ConstructorAccessorImpl from shared objects file]
 [Loaded sun.reflect.DelegatingClassLoader from shared objects file]
 [Loaded java.util.Collection from shared objects file]
 [Loaded java.util.AbstractCollection from shared objects file]
 [Loaded java.util.List from shared objects file]
 [Loaded java.util.AbstractList from shared objects file]
 [Loaded java.util.RandomAccess from shared objects file]
 [Loaded java.util.Vector from shared objects file]
 [Loaded java.lang.StringBuffer from shared objects file]
 [Loaded java.nio.Buffer from shared objects file]
 [Loaded sun.misc.AtomicLong from shared objects file]
 [Loaded sun.misc.AtomicLongCSImpl from shared objects file]
 [Loaded 

RE: BooleanQuery - TooManyClauses

2004-10-26 Thread Angelov, Rossen
OK, I got that part - to limit the clause counts limit the range. In my case
replace the timestamp with date and if it gets too big again replace the
MMDD with MM and later with . And that of course includes fixing
the old files every time so they have new field.
I was actually looking for more robust solution but this should do for now.

Thanks,
Ross

-Original Message-
From: Terry Steichen [mailto:[EMAIL PROTECTED]
Sent: Tuesday, October 26, 2004 11:28 AM
To: Lucene Users List
Subject: Re: BooleanQuery - TooManyClauses 


I think what Erik's asking is whether you can live with expressing your
indexed date in the form of MMDD, without the hour and minute extension.
That will sharply educe the number of range query expansion terms.  If
you're using the timestamp as a unique identifier, you might consider
creating two fields, one for the unique identifier (MMDDHHmmssZ) and one
for the date (MMDD), and only use the range on the date field (not on
the timestamp field)

Regards,

Terry
  - Original Message - 
  From: Angelov, Rossen 
  To: 'Lucene Users List' 
  Sent: Tuesday, October 26, 2004 11:43 AM
  Subject: RE: BooleanQuery - TooManyClauses 


  
  On Oct 25, 2004, at 6:35 PM, Angelov, Rossen wrote:
   Why there is a limit on the number of clauses? and is there any harm in
   setting MaxClauseCount to Integer.MAX_VALUE?
  
  The harm is in performance and resource utilization.  Rather than do 
  this, though, read on...
  
   I'm using a Range Query on a field that represents dates and getting
   BooleanQuery$TooManyClauses exception.
   This is the query -  +/article/createddateiso8601:[2003010100 TO
   2003123199]
  
  Do you really need to do ranges down to that time level?  Or are you 
  really just concerned with date?  If you indexed using MMDD 
  instead, there would only be a maximum of 365 terms in that range, 
  whereas you've got zillions (ok, I was too lazy to do the math!  But 
  far more than 1,024).

  I need to do range searches. They are part of the requirements and even
  worse, the range can be as big as up to 10 years for now. It will get
  bigger. I'm indexing using MMDDHHmmssZ format and as you said there
will
  be more than just 365 terms per year. This number changes every day as new
  documents are indexed daily. The only limit I can see is the number of
  documents that were indexed. I guess maxClauseCount can't be more than the
  indexed documents.

  I recommend changing how you index dates, or at least use a different 
  field for queries that do not need to concern themselves with the 
  timestamp aspect.

  What do you mean change how the dates are indexed? By the way this field
is
  indexed as a string.

  
   Erik
  
  

  Ross

  This communication is intended solely for the addressee and is
  confidential and not for third party unauthorized distribution.



This communication is intended solely for the addressee and is
confidential and not for third party unauthorized distribution.



RE: Aliasing problem

2004-10-26 Thread Chuck Williams
Looks like you produced a PhraseQuery rather than a BooleanQuery.  You
want

+GAME:(doom3 3 doom)

Chuck

   -Original Message-
   From: Abhay Saswade [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, October 26, 2004 10:22 AM
   To: [EMAIL PROTECTED]
   Subject: Aliasing problem
   
   Hi,
   
   One document in my index contains term 'doom 3' (indexed, tokenized,
   stored)
   How can I match term doom3 with that document?
   
   I tried following but no luck
   I have written alias filter which returns 2 more tokens for doom3 as
3
   and
   doom
   
   I construct query +GAME:doom3
   QueryParser returns +GAME:doom3 3 doom
   
   I am using StandardTokenizer
   
   Is my approach is correct? Or am I missing something? Any help
highly
   appreciated.
   
   Thanks in advance,
   Abhay
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Aliasing problem

2004-10-26 Thread Daniel Naber
On Tuesday 26 October 2004 19:22, Abhay Saswade wrote:

 I tried following but no luck
 I have written alias filter which returns 2 more tokens for doom3 as 3
 and doom

 I construct query +GAME:doom3
 QueryParser returns +GAME:doom3 3 doom

Your approach is correct, but QueryParser doesn't yet support analyzers 
which return more than one token at a position. There's already a patch 
about this in the bug tracking system.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: BooleanQuery - TooManyClauses

2004-10-26 Thread Erik Hatcher
On Oct 26, 2004, at 1:55 PM, Angelov, Rossen wrote:
OK, I got that part - to limit the clause counts limit the range. In 
my case
replace the timestamp with date and if it gets too big again replace 
the
MMDD with MM and later with . And that of course includes 
fixing
the old files every time so they have new field.
I was actually looking for more robust solution but this should do for 
now.
More robust, as in does not require re-indexing?
This is one of the tricky things about making a search engine.  Having 
fast searches, yet reserving the right to change how you query at a 
later time without re-indexing.  Unfortunately it doesn't work that 
way.  You have to consider the types of queries that will be made in 
order to index appropriately.  Changes in types of queries may 
necessitate a re-index to accommodate.

You may want to go ahead and index one field as MMDD, and another 
as .  and possibly another as MM.

You could also utilize a Filter for constraining searches based on a 
date range.  QueryFilter is one option, or writing a custom one that 
selects the appropriate documents.

Erik

Thanks,
Ross
-Original Message-
From: Terry Steichen [mailto:[EMAIL PROTECTED]
Sent: Tuesday, October 26, 2004 11:28 AM
To: Lucene Users List
Subject: Re: BooleanQuery - TooManyClauses
I think what Erik's asking is whether you can live with expressing your
indexed date in the form of MMDD, without the hour and minute 
extension.
That will sharply educe the number of range query expansion terms.  If
you're using the timestamp as a unique identifier, you might consider
creating two fields, one for the unique identifier (MMDDHHmmssZ) 
and one
for the date (MMDD), and only use the range on the date field (not 
on
the timestamp field)

Regards,
Terry
  - Original Message -
  From: Angelov, Rossen
  To: 'Lucene Users List'
  Sent: Tuesday, October 26, 2004 11:43 AM
  Subject: RE: BooleanQuery - TooManyClauses

On Oct 25, 2004, at 6:35 PM, Angelov, Rossen wrote:
Why there is a limit on the number of clauses? and is there any harm 
in
setting MaxClauseCount to Integer.MAX_VALUE?
The harm is in performance and resource utilization.  Rather than do
this, though, read on...
I'm using a Range Query on a field that represents dates and getting
BooleanQuery$TooManyClauses exception.
This is the query -  +/article/createddateiso8601:[2003010100 TO
2003123199]
Do you really need to do ranges down to that time level?  Or are you
really just concerned with date?  If you indexed using MMDD
instead, there would only be a maximum of 365 terms in that range,
whereas you've got zillions (ok, I was too lazy to do the math!  But
far more than 1,024).
  I need to do range searches. They are part of the requirements and 
even
  worse, the range can be as big as up to 10 years for now. It will get
  bigger. I'm indexing using MMDDHHmmssZ format and as you said 
there
will
  be more than just 365 terms per year. This number changes every day 
as new
  documents are indexed daily. The only limit I can see is the number 
of
  documents that were indexed. I guess maxClauseCount can't be more 
than the
  indexed documents.

I recommend changing how you index dates, or at least use a different
field for queries that do not need to concern themselves with the
timestamp aspect.
  What do you mean change how the dates are indexed? By the way this 
field
is
  indexed as a string.

Erik

  Ross
  This communication is intended solely for the addressee and is
  confidential and not for third party unauthorized distribution.
This communication is intended solely for the addressee and is
confidential and not for third party unauthorized distribution.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: BooleanQuery - TooManyClauses

2004-10-26 Thread Vanlerberghe, Luc
Even if you need to be able to search on ranges that include the time,
you could benefit from adding a few extra fields to your documents.

For example: add a year field and an hour field:

If the user then specifies a range between 2001-08-10 11:00 and
2004-10-11 13:00, you break it up behind the scenes into three parts as
follows:
- a query on the date field alone, testing on the range 2001-08-11 to
2004-10-10 (i.e. all dates fully within the date range) -= max number
of clauses=max number of dates in your documents
- a query on the hour field for the first date -= max number of
clauses=24
- a query on the hour field for the last date -= max number of
clauses=24
(You'll need a special case if the start and end happen to be on the
same date of course)

I'm not that familiar with the QueryParser syntax yet, but it should
look something like this (note the use of curly brackets for the
exclusive date ranges):
(date:{20010810 TO 20041011}) OR (+date:20010910 +time:[11 TO ]) OR
(+date:20041011 +time:{ TO 13})

If you need even more fine-grained ranges, you can extend this idea by
adding more fields (at the cost of making the generated query even more
complex)

You can already add the separate fields to your documents even if you
don't use them yet...

Regards,

Luc


 -Original Message-
 From: Terry Steichen [mailto:[EMAIL PROTECTED] 
 Sent: dinsdag 26 oktober 2004 18:28
 To: Lucene Users List
 Subject: Re: BooleanQuery - TooManyClauses 
 
 I think what Erik's asking is whether you can live with 
 expressing your indexed date in the form of MMDD, without 
 the hour and minute extension.  That will sharply educe the 
 number of range query expansion terms.  If you're using the 
 timestamp as a unique identifier, you might consider creating 
 two fields, one for the unique identifier (MMDDHHmmssZ) 
 and one for the date (MMDD), and only use the range on 
 the date field (not on the timestamp field)
 
 Regards,
 
 Terry
   - Original Message -
   From: Angelov, Rossen
   To: 'Lucene Users List' 
   Sent: Tuesday, October 26, 2004 11:43 AM
   Subject: RE: BooleanQuery - TooManyClauses 
 
 
   
   On Oct 25, 2004, at 6:35 PM, Angelov, Rossen wrote:
Why there is a limit on the number of clauses? and is 
 there any harm in
setting MaxClauseCount to Integer.MAX_VALUE?
   
   The harm is in performance and resource utilization.  
 Rather than do
   this, though, read on...
   
I'm using a Range Query on a field that represents dates 
 and getting
BooleanQuery$TooManyClauses exception.
This is the query -  
 +/article/createddateiso8601:[2003010100 TO
2003123199]
   
   Do you really need to do ranges down to that time level?  
 Or are you
   really just concerned with date?  If you indexed using MMDD
   instead, there would only be a maximum of 365 terms in that range,
   whereas you've got zillions (ok, I was too lazy to do the 
 math!  But
   far more than 1,024).
 
   I need to do range searches. They are part of the 
 requirements and even
   worse, the range can be as big as up to 10 years for now. 
 It will get
   bigger. I'm indexing using MMDDHHmmssZ format and as 
 you said there will
   be more than just 365 terms per year. This number changes 
 every day as new
   documents are indexed daily. The only limit I can see is 
 the number of
   documents that were indexed. I guess maxClauseCount can't 
 be more than the
   indexed documents.
 
   I recommend changing how you index dates, or at least use 
 a different
   field for queries that do not need to concern themselves with the
   timestamp aspect.
 
   What do you mean change how the dates are indexed? By the 
 way this field is
   indexed as a string.
 
   
Erik
   
   
 
   Ross
 
   This communication is intended solely for the addressee and is
   confidential and not for third party unauthorized distribution.
 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]