Re: sanity check - index size

2002-05-20 Thread Peter Carlson

This seems big depending on what you are storing.

For example, I have a set of data with 457MB and my Lucene index is 115MB.
However, I don't store much.

If you are storing the complete text (even if you don't index it), then it
will be about the same size (no probably bigger) than your original data
set.

--Peter

On 5/20/02 4:16 PM, "Erik Hatcher" <[EMAIL PROTECTED]> wrote:

> I'm indexing 900+ files (less than 1,000) that total about 15MB in size.
> These are text files and HTML files.  I only index them into a few fields
> (title, content, filename).  My index (specifically _sd.fdt) is 20MB.  The
> bulk of the HTML files are Javadoc files (Ant's own documentation,
> actually).
> 
> Does that seem at all close to being reasonable/normal?  I am calling
> optimize() before closing the index.
> 
> Thanks for the sanity check.
> 
>   Erik
> 
> 
> 
> --
> To unsubscribe, e-mail:   
> For additional commands, e-mail: 
> 
> 


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: sanity check - index size

2002-05-20 Thread James Cooper

On Mon, 20 May 2002, Erik Hatcher wrote:

> I'm indexing 900+ files (less than 1,000) that total about 15MB in size.
> My index (specifically _sd.fdt) is 20MB.
> 
> Does that seem at all close to being reasonable/normal?  I am calling
> optimize() before closing the index.

hi,

I've wondered the same thing.  The indexes I build with Lucene are
generally around the same size as the corpus.  That was larger than I
thought it would be, but it doesn't really matter since disk is pretty
cheap (and my corpus isn't very big).

-- James


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




sanity check - index size

2002-05-20 Thread Erik Hatcher

I'm indexing 900+ files (less than 1,000) that total about 15MB in size.
These are text files and HTML files.  I only index them into a few fields
(title, content, filename).  My index (specifically _sd.fdt) is 20MB.  The
bulk of the HTML files are Javadoc files (Ant's own documentation,
actually).

Does that seem at all close to being reasonable/normal?  I am calling
optimize() before closing the index.

Thanks for the sanity check.

Erik



--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




deleting question

2002-05-20 Thread Armbrust, Daniel C.

I did a batch deletion on an index.  Then, after searching the archives for
something else, I came across this

> I understand there are three modes for using IndexReader and 
> IndexWriter:
> 
> A- IndexReader for reading only, not deleting
> B- IndexReader for deleting (and reading)
> C- IndexWriter (for adding and optimizing)

> What matters is that only one of B or C can be done at once.  
> That's to say, only a single process/thread may
> modify an index at once.  Modification should be single threaded.

In looking back at the code I ran, I had an IndexWriter open, and then I
opened an IndexReader, and did the deletions.  Every 300,000 deletions, I
called the optimize method of the writer.  When I was done with the
deletions, I closed the reader, then called optimize again on the writer,
then closed the writer.

My question is, does anyone know offhand if having both of those open at
once would have done anything bad (like corrupt) my index?

Thanks, 

Dan


-Original Message-
From: Armbrust, Daniel C. [mailto:[EMAIL PROTECTED]]
Sent: Monday, May 20, 2002 5:06 PM
To: 'Lucene Users List'
Subject: RE: Lucene's scalability


In my experience the time it takes depends much more on the complexity of
the query, rather than the number of results returned.  If I am making a
query with 50-60 terms, I am usually getting down to a couple thousand or
less results.

Dan


-Original Message-
From: CNew [mailto:[EMAIL PROTECTED]]
Sent: Monday, May 20, 2002 8:46 AM
To: Lucene Users List
Subject: Re: Lucene's scalability


you didn't mention the hit count on the query with 50-60 terms.
just wondering if the time was linear.

- Original Message -
From: Armbrust, Daniel C. <[EMAIL PROTECTED]>
To: 'Lucene Users List' <[EMAIL PROTECTED]>
Sent: Friday, May 17, 2002 6:57 AM
Subject: RE: Lucene's scalability


> Currently, we are using a Ultra-80 Sparc Solaris with 4 processors and 4
GB
> of Ram.
>
> However, we are only making use of one of those processors with the index.
> Our biggest speed restriction is the fact that our entire index resides on
a
> single disk drive.  We have a raid array coming soon.
>
> The performance has been very impressive, but as you can imagine, the
speed
> depends highly on the complexity of the query.  If you run a query with a
> 1/2 a dozen terms and fields, which returns ~30,000 results, it usually
> takes on the order of a second or two.  If you run a query with 50-60
terms,
> it may take 5-6 seconds.
>
> I don't have any better performance stats than this currently.
>
> Dan
>
>
> -Original Message-
> From: Harpreet S Walia [mailto:[EMAIL PROTECTED]]
> Sent: Friday, May 17, 2002 7:23 AM
> To: Lucene Users List
> Subject: Re: Lucene's scalability
>
>
> Hi ,
>
> I am also trying to do a similar thing . I am very eager to know what kind
> of hardware u are using to maintain such a big index.
> In my case it is very important that the search happens very fast . so
does
> such a big index of 10 gb pose any problems in this direction
>
> TIA
>
> Regards
> Harpreet
>
>
>
> - Original Message -
> From: "Armbrust, Daniel C." <[EMAIL PROTECTED]>
> To: "'Lucene Users List'" <[EMAIL PROTECTED]>
> Sent: Tuesday, April 30, 2002 12:07 AM
> Subject: RE: Lucene's scalability
>
>
> > I currently have an index of ~ 12 million documents, which are each
about
> > that size (but in xml form).
> >
> > When they are transformed for lucene to index, there are upwards of 50
> > searchable fields.
> >
> > The index is about 10 GB right now.
> >
> > I have not yet had any problems with "pushing the limits" of lucene.
> >
> > In the next few weeks, I will be pushing my number of indexed documents
up
> > into the 15-20 million range.  I can let you know if any problems arise.
> >
> > Dan
> >
> >
> >
> > -Original Message-
> > From: Joel Bernstein [mailto:[EMAIL PROTECTED]]
> > Sent: Monday, April 29, 2002 1:32 PM
> > To: [EMAIL PROTECTED]
> > Subject: Lucene's scalability
> >
> >
> > Is there a known limit to the number of documents that Lucene can handle
> > efficiently?  I'm looking to index around 15 million, 2K docs which
> contain
> > 7-10 searchable fields. Should I be attempting this with Lucene?
> >
> > Thanks,
> >
> > Joel
> >
> >
> > --
> > To unsubscribe, e-mail:
> 
> > For additional commands, e-mail:
> 
>
>
> --
> To unsubscribe, e-mail:
> 
> For additional commands, e-mail:
> 
>
> --
> To unsubscribe, e-mail:

> For additional commands, e-mail:

>



--
To unsubscribe, e-mail:

For additional commands, e-mail:


--
To unsubscribe, e-mail:

For additional commands, e-mail:


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: Lucene's scalability

2002-05-20 Thread Armbrust, Daniel C.

In my experience the time it takes depends much more on the complexity of
the query, rather than the number of results returned.  If I am making a
query with 50-60 terms, I am usually getting down to a couple thousand or
less results.

Dan


-Original Message-
From: CNew [mailto:[EMAIL PROTECTED]]
Sent: Monday, May 20, 2002 8:46 AM
To: Lucene Users List
Subject: Re: Lucene's scalability


you didn't mention the hit count on the query with 50-60 terms.
just wondering if the time was linear.

- Original Message -
From: Armbrust, Daniel C. <[EMAIL PROTECTED]>
To: 'Lucene Users List' <[EMAIL PROTECTED]>
Sent: Friday, May 17, 2002 6:57 AM
Subject: RE: Lucene's scalability


> Currently, we are using a Ultra-80 Sparc Solaris with 4 processors and 4
GB
> of Ram.
>
> However, we are only making use of one of those processors with the index.
> Our biggest speed restriction is the fact that our entire index resides on
a
> single disk drive.  We have a raid array coming soon.
>
> The performance has been very impressive, but as you can imagine, the
speed
> depends highly on the complexity of the query.  If you run a query with a
> 1/2 a dozen terms and fields, which returns ~30,000 results, it usually
> takes on the order of a second or two.  If you run a query with 50-60
terms,
> it may take 5-6 seconds.
>
> I don't have any better performance stats than this currently.
>
> Dan
>
>
> -Original Message-
> From: Harpreet S Walia [mailto:[EMAIL PROTECTED]]
> Sent: Friday, May 17, 2002 7:23 AM
> To: Lucene Users List
> Subject: Re: Lucene's scalability
>
>
> Hi ,
>
> I am also trying to do a similar thing . I am very eager to know what kind
> of hardware u are using to maintain such a big index.
> In my case it is very important that the search happens very fast . so
does
> such a big index of 10 gb pose any problems in this direction
>
> TIA
>
> Regards
> Harpreet
>
>
>
> - Original Message -
> From: "Armbrust, Daniel C." <[EMAIL PROTECTED]>
> To: "'Lucene Users List'" <[EMAIL PROTECTED]>
> Sent: Tuesday, April 30, 2002 12:07 AM
> Subject: RE: Lucene's scalability
>
>
> > I currently have an index of ~ 12 million documents, which are each
about
> > that size (but in xml form).
> >
> > When they are transformed for lucene to index, there are upwards of 50
> > searchable fields.
> >
> > The index is about 10 GB right now.
> >
> > I have not yet had any problems with "pushing the limits" of lucene.
> >
> > In the next few weeks, I will be pushing my number of indexed documents
up
> > into the 15-20 million range.  I can let you know if any problems arise.
> >
> > Dan
> >
> >
> >
> > -Original Message-
> > From: Joel Bernstein [mailto:[EMAIL PROTECTED]]
> > Sent: Monday, April 29, 2002 1:32 PM
> > To: [EMAIL PROTECTED]
> > Subject: Lucene's scalability
> >
> >
> > Is there a known limit to the number of documents that Lucene can handle
> > efficiently?  I'm looking to index around 15 million, 2K docs which
> contain
> > 7-10 searchable fields. Should I be attempting this with Lucene?
> >
> > Thanks,
> >
> > Joel
> >
> >
> > --
> > To unsubscribe, e-mail:
> 
> > For additional commands, e-mail:
> 
>
>
> --
> To unsubscribe, e-mail:
> 
> For additional commands, e-mail:
> 
>
> --
> To unsubscribe, e-mail:

> For additional commands, e-mail:

>



--
To unsubscribe, e-mail:

For additional commands, e-mail:


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: setting encoding

2002-05-20 Thread redpineseed

> The biggest problem is some cp1252 characters are "private" in the unicode
> byte set.

those chararcters may not be in the unicode byte (char) set at all and that is the 
major trouble with processing chinese, 

convert your native code to unicode (UTF16) with the following lines:

File f = new File('cp1252_input');
FileInputStream tmp = new FileInputStream(f);
BufferedReader  brin = new BufferedReader( new InputStreamReader( tmp, "CP1252"));
String inputString = brin.readLine();

not sure your code designater is CP1252, chech that out in Java Docs. 


redpineseed



Re: PDF4J Project: Gathering Feature Requests

2002-05-20 Thread Andrew C. Oliver

But POI is only for OLE 2 Compound Document format based formats.

CNew wrote:

>I can't help but notice the similarities between this effort and the new POI
>project at apache (jakharta).
>
>A lot of the language, vision, etc. maps directly... just substitute "Adobe
>PDF" for "Microsoft Excel." and you have a lot of your requirements and
>perhaps your solution architecture.
>
>POI seeks to both read and write.  But on the read-side they use an
>event-callback technique - similar in concept to SAX for XML.  Sounds like a
>nice way to do this sort of thing.
>
>And of course, there is the Lucene connection.  They mention Lucene
>specifically as a project that needs binary level read access to MS Office
>documents.
>
>but I think that PDF is even more mentioned in this list.
>
>- Original Message -
>From: W. Eliot Kimber <[EMAIL PROTECTED]>
>To: Lucene Users List <[EMAIL PROTECTED]>
>Sent: Monday, May 06, 2002 3:58 PM
>Subject: Re: PDF4J Project: Gathering Feature Requests
>
>
>  
>
>>Peter Carlson wrote:
>>
>>
>>>This is very exciting.
>>>
>>>Are you planning on basing the code on other pdf readers / writers?
>>>  
>>>
>>At this point I haven't found any Java PDF reader that meets my
>>requirements. One of the motivations for doing this is the problems we
>>had using Etymon's PJ library: both the license (GPL, not LGPL) and the
>>quality of the code itself, which does not meet our engineering
>>standards. I want to use an LGPL library so that people can use the code
>>in projects that are not themselves open sourced but I want the library
>>itself to be protected.
>>
>>For writing, may or may not be able to leverage existing code, don't
>>know yet.
>>
>>Note too that there are two aspects of writing: creating a valid PDF
>>data stream and creating meaningful page layouts--we are not addressing
>>the second of these (there are lots of libraries that will create useful
>>PDF output from various non-PDF inputs). Our main writing usecase is the
>>rewriting of existing PDFs following some amount of manipulation through
>>our API.
>>
>>A caution: I am still waiting to get approval from my employers to do
>>this work as open source--it may be a while before I can even start on
>>the coding.
>>
>>Cheers,
>>
>>Eliot
>>--
>>W. Eliot Kimber, [EMAIL PROTECTED]
>>Consultant, ISOGEN International
>>
>>1016 La Posada Dr., Suite 240
>>Austin, TX  78752 Phone: 512.656.4139
>>
>>--
>>To unsubscribe, e-mail:
>>
>>
>
>  
>
>>For additional commands, e-mail:
>>
>>
>
>  
>
>
>
>
>--
>To unsubscribe, e-mail:   
>For additional commands, e-mail: 
>
>
>  
>




--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: PDF4J Project: Gathering Feature Requests

2002-05-20 Thread CNew

I can't help but notice the similarities between this effort and the new POI
project at apache (jakharta).

A lot of the language, vision, etc. maps directly... just substitute "Adobe
PDF" for "Microsoft Excel." and you have a lot of your requirements and
perhaps your solution architecture.

POI seeks to both read and write.  But on the read-side they use an
event-callback technique - similar in concept to SAX for XML.  Sounds like a
nice way to do this sort of thing.

And of course, there is the Lucene connection.  They mention Lucene
specifically as a project that needs binary level read access to MS Office
documents.

but I think that PDF is even more mentioned in this list.

- Original Message -
From: W. Eliot Kimber <[EMAIL PROTECTED]>
To: Lucene Users List <[EMAIL PROTECTED]>
Sent: Monday, May 06, 2002 3:58 PM
Subject: Re: PDF4J Project: Gathering Feature Requests


> Peter Carlson wrote:
> >
> > This is very exciting.
> >
> > Are you planning on basing the code on other pdf readers / writers?
>
> At this point I haven't found any Java PDF reader that meets my
> requirements. One of the motivations for doing this is the problems we
> had using Etymon's PJ library: both the license (GPL, not LGPL) and the
> quality of the code itself, which does not meet our engineering
> standards. I want to use an LGPL library so that people can use the code
> in projects that are not themselves open sourced but I want the library
> itself to be protected.
>
> For writing, may or may not be able to leverage existing code, don't
> know yet.
>
> Note too that there are two aspects of writing: creating a valid PDF
> data stream and creating meaningful page layouts--we are not addressing
> the second of these (there are lots of libraries that will create useful
> PDF output from various non-PDF inputs). Our main writing usecase is the
> rewriting of existing PDFs following some amount of manipulation through
> our API.
>
> A caution: I am still waiting to get approval from my employers to do
> this work as open source--it may be a while before I can even start on
> the coding.
>
> Cheers,
>
> Eliot
> --
> W. Eliot Kimber, [EMAIL PROTECTED]
> Consultant, ISOGEN International
>
> 1016 La Posada Dr., Suite 240
> Austin, TX  78752 Phone: 512.656.4139
>
> --
> To unsubscribe, e-mail:

> For additional commands, e-mail:

>



--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Lucene's scalability

2002-05-20 Thread CNew

you didn't mention the hit count on the query with 50-60 terms.
just wondering if the time was linear.

- Original Message -
From: Armbrust, Daniel C. <[EMAIL PROTECTED]>
To: 'Lucene Users List' <[EMAIL PROTECTED]>
Sent: Friday, May 17, 2002 6:57 AM
Subject: RE: Lucene's scalability


> Currently, we are using a Ultra-80 Sparc Solaris with 4 processors and 4
GB
> of Ram.
>
> However, we are only making use of one of those processors with the index.
> Our biggest speed restriction is the fact that our entire index resides on
a
> single disk drive.  We have a raid array coming soon.
>
> The performance has been very impressive, but as you can imagine, the
speed
> depends highly on the complexity of the query.  If you run a query with a
> 1/2 a dozen terms and fields, which returns ~30,000 results, it usually
> takes on the order of a second or two.  If you run a query with 50-60
terms,
> it may take 5-6 seconds.
>
> I don't have any better performance stats than this currently.
>
> Dan
>
>
> -Original Message-
> From: Harpreet S Walia [mailto:[EMAIL PROTECTED]]
> Sent: Friday, May 17, 2002 7:23 AM
> To: Lucene Users List
> Subject: Re: Lucene's scalability
>
>
> Hi ,
>
> I am also trying to do a similar thing . I am very eager to know what kind
> of hardware u are using to maintain such a big index.
> In my case it is very important that the search happens very fast . so
does
> such a big index of 10 gb pose any problems in this direction
>
> TIA
>
> Regards
> Harpreet
>
>
>
> - Original Message -
> From: "Armbrust, Daniel C." <[EMAIL PROTECTED]>
> To: "'Lucene Users List'" <[EMAIL PROTECTED]>
> Sent: Tuesday, April 30, 2002 12:07 AM
> Subject: RE: Lucene's scalability
>
>
> > I currently have an index of ~ 12 million documents, which are each
about
> > that size (but in xml form).
> >
> > When they are transformed for lucene to index, there are upwards of 50
> > searchable fields.
> >
> > The index is about 10 GB right now.
> >
> > I have not yet had any problems with "pushing the limits" of lucene.
> >
> > In the next few weeks, I will be pushing my number of indexed documents
up
> > into the 15-20 million range.  I can let you know if any problems arise.
> >
> > Dan
> >
> >
> >
> > -Original Message-
> > From: Joel Bernstein [mailto:[EMAIL PROTECTED]]
> > Sent: Monday, April 29, 2002 1:32 PM
> > To: [EMAIL PROTECTED]
> > Subject: Lucene's scalability
> >
> >
> > Is there a known limit to the number of documents that Lucene can handle
> > efficiently?  I'm looking to index around 15 million, 2K docs which
> contain
> > 7-10 searchable fields. Should I be attempting this with Lucene?
> >
> > Thanks,
> >
> > Joel
> >
> >
> > --
> > To unsubscribe, e-mail:
> 
> > For additional commands, e-mail:
> 
>
>
> --
> To unsubscribe, e-mail:
> 
> For additional commands, e-mail:
> 
>
> --
> To unsubscribe, e-mail:

> For additional commands, e-mail:

>



--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: identical field names

2002-05-20 Thread Nader S. Henein

Yes it does but there is problem with retrieving the values,
I wrote an entry about that subject at this link
http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg01580.html


-Original Message-
From: Herman Chen [mailto:[EMAIL PROTECTED]]
Sent: Monday, May 20, 2002 2:08 PM
To: Lucene Users List
Subject: identical field names


Hi,

Does Lucene allow Document with more than one Field of the same name but
different values?
I'm wondering about this because I'm trying to implement hierarchical search
as suggested by
the FAQ, i.e. /dir1/dir2/dir3.  But that seems to work only for one folder
assignment for each
Document, I need to support multiple folder assignment to a Document.

Thanks

--
Herman



--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




identical field names

2002-05-20 Thread Herman Chen

Hi,

Does Lucene allow Document with more than one Field of the same name but different 
values?
I'm wondering about this because I'm trying to implement hierarchical search as 
suggested by
the FAQ, i.e. /dir1/dir2/dir3.  But that seems to work only for one folder assignment 
for each
Document, I need to support multiple folder assignment to a Document.

Thanks

--
Herman