from:"Jochen Frey"

RE: SpanXXQuery Usage

2004-03-23 Thread Jochen Frey

Terry,

With regular queries (non-Span-queries) you cannot request that results of
OR / AND / NOT operations are near to one another (i.e. (A or B) near (C or
D)). The span queries solve that problem by allowing any span query to be
used in a SpanNearQuery (and vice versa). There are other applications for
this as well, but this is one of them.

Hope that helps to get you started. Examples for the use can be found in the
unit tests (TestBasics.java, I believe).

Cheers,
Jochen

-Original Message-
From: Terry Steichen [mailto:[EMAIL PROTECTED] 
Sent: Monday, March 22, 2004 3:37 AM
To: Lucene Users List
Subject: Re: SpanXXQuery Usage

Otis,

Can you give me/us a rough idea of what these are supposed to do?  It's hard
to extrapolate the terse unit test code into much of a general notion.  I
searched the archives with little success.

Regards,

Terry

- Original Message -
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, March 22, 2004 2:46 AM
Subject: Re: SpanXXQuery Usage


 Only in unit tests, so far.

 Otis

 --- Terry Steichen [EMAIL PROTECTED] wrote:
  Is there any documentation (other than that in the source) on how to
  use the new SpanxxQuery features?  Specifically: SpanNearQuery,
  SpanNotQuery, SpanFirstQuery and SpanOrQuery?
 
  Regards,
 
  Terry
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Query: A ? B

2004-03-04 Thread Jochen Frey

Hi Everyone.

I am trying to figure out how create a query that matches

A ? B

Where ? is exactly one token. Can anyone tell me how to do that?


Obviously it's easy to match 'A * B' where '*' is 0 or 1 tokens (just use a
PhraseQuery and set slop to 1). However, if I require exactly one word/token
between 'A' and 'B'?


BTW, I know a very clumsy way of doing this, but I really don't like it: For
each indexed token insert a token (for example 'X') at the same
token-position. Then the query would be: A X B and everybody (except the
indexing performance as well as the size on disk) would be happy.

There's got to be an easier way. Right?

Thanks in advance!
Jochen


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Query: A ? B

2004-03-04 Thread Jochen Frey

Otis:

Maybe I don't understand this right, but I *think* I am looking for
something different:

I am trying to write a query like this: my * house which should match my
own house, my red house, my small house, but should not match my
house ... you get the idea.

If I am not mistaken, a wildcard query only works if the wildcard is within
a word (or token), and it would allow me to do things like g* matching
green, great, ...etc. I don't know how to make that work for multi words
scenarios.

Here is what I tried WildcardQuery in the unit test (TestBasics):

Query query = new WildcardQuery(new Term(field,six hundred * five));

Thanks!
Jochen

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 04, 2004 12:00 PM
To: Lucene Users List
Subject: Re: Query: A ? B

Use WildcardQuery: A?B

Otis

--- Jochen Frey [EMAIL PROTECTED] wrote:
 Hi Everyone.
 
 I am trying to figure out how create a query that matches
 
 A ? B
 
 Where ? is exactly one token. Can anyone tell me how to do that?
 
 
 Obviously it's easy to match 'A * B' where '*' is 0 or 1 tokens (just
 use a
 PhraseQuery and set slop to 1). However, if I require exactly one
 word/token
 between 'A' and 'B'?
 
 
 BTW, I know a very clumsy way of doing this, but I really don't like
 it: For
 each indexed token insert a token (for example 'X') at the same
 token-position. Then the query would be: A X B and everybody
 (except the
 indexing performance as well as the size on disk) would be happy.
 
 There's got to be an easier way. Right?
 
 Thanks in advance!
 Jochen
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Query: A ? B

2004-03-04 Thread Jochen Frey

I think I know my way around the Span feature reasonably well ... and I
don't think it can be used for what I want to do.

But I would love to be proven wrong on this one.

:)

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 04, 2004 1:52 PM
To: Lucene Users List
Subject: Re: Query: A ? B

Right Otis was confused by what you were asking.

Google supports what you are asking for, I believe, although I don't 
recall if an '*' indicates one or more or just one.

As far as I know, there is no easy way to do the exact distance like 
you desire.  You could always clone the PhraseQuery stuff into a custom 
Query that uses an == instead of a  for the slop.  Although you'll 
also need to tweak this to disallow reversing of terms too.  Slop 
handles terms out of order too.  Maybe the new span feature can do 
this?

Erik


On Mar 4, 2004, at 4:29 PM, Jochen Frey wrote:

 Otis:

 Maybe I don't understand this right, but I *think* I am looking for
 something different:

 I am trying to write a query like this: my * house which should 
 match my
 own house, my red house, my small house, but should not match my
 house ... you get the idea.

 If I am not mistaken, a wildcard query only works if the wildcard is 
 within
 a word (or token), and it would allow me to do things like g* 
 matching
 green, great, ...etc. I don't know how to make that work for multi 
 words
 scenarios.

 Here is what I tried WildcardQuery in the unit test (TestBasics):

 Query query = new WildcardQuery(new Term(field,six hundred * 
 five));

 Thanks!
 Jochen

 -Original Message-
 From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
 Sent: Thursday, March 04, 2004 12:00 PM
 To: Lucene Users List
 Subject: Re: Query: A ? B

 Use WildcardQuery: A?B

 Otis

 --- Jochen Frey [EMAIL PROTECTED] wrote:
 Hi Everyone.

 I am trying to figure out how create a query that matches

 A ? B

 Where ? is exactly one token. Can anyone tell me how to do that?


 Obviously it's easy to match 'A * B' where '*' is 0 or 1 tokens (just
 use a
 PhraseQuery and set slop to 1). However, if I require exactly one
 word/token
 between 'A' and 'B'?


 BTW, I know a very clumsy way of doing this, but I really don't like
 it: For
 each indexed token insert a token (for example 'X') at the same
 token-position. Then the query would be: A X B and everybody
 (except the
 indexing performance as well as the size on disk) would be happy.

 There's got to be an easier way. Right?

 Thanks in advance!
 Jochen


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene scalability/clustering

2004-02-26 Thread Jochen Frey

Anson,

One way of doing it is having subsets of your indexes / data on
different machines. Each machine indexes its own data. You implement a
system that distributes queries to the various machines and merges the
results back.

The working well completely depends on your implementation of the
distributed search.

I believe there was a discussion about implementing this using a
MultiSearcher somewhere as well.

Cheers!
Jochen


-Original Message-
From: Anson Lau [mailto:[EMAIL PROTECTED] 
Sent: Sunday, February 22, 2004 2:17 PM
To: 'Lucene Users List'
Subject: RE: Lucene scalability/clustering


Further on this topic - has anyone tried implementing a distributed
search with Lucene?  How does it work and does it work well?


Anson


-Original Message-
From: Hamish Carpenter [mailto:[EMAIL PROTECTED]
Sent: Monday, February 23, 2004 5:24 AM
To: Lucene Users List
Subject: Re: Lucene scalability/clustering

Hi All,

I'm Hamish Carpenter who contributed the benchmarks with the comment
about the IndexSearcherCache.  Using this solved our issues with too
many files open under linux.

The original IndexSearcherCache email is here:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg01967.html

See here for a copy of the above message and a download link:
http://www.geocities.com/haytona/lucene/
The mailing list doesn't like attachments.  The source is 10K in size.

HTH

Hamish Carpenter.

[EMAIL PROTECTED] wrote:
  BTW, where can I get Peter Halacsy's IndexSearcherCache?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Benchmark (WAS: Indexing Speed: Documents vs. Sentences)

2003-12-19 Thread Jochen Frey

Hello,

Here's is a benchmark. I am not sure if that is proper etiquette,
but I will just paste it into this mail and hope that it gets funneled into
the right channels.

Cheers!
Jochen


benchmark
  ul
  p
  bHardware Environment/bbr/
  liiDedicated machine for indexing/ino, some other work performed on
it. shouldn't influence results much since it's a multiple processor
machine/li
  liiCPU/i2x Intel Xeon 3.05GHz/li
  liiRAM/i4GB/li
  liiDrive configuration/iSCSI/li
  /p
  p
  bSoftware environment/bbr/
  liiJava Version/i1.4.2-b28/li
  liiJava VM/iJava HotSpot Client VM 1.4.2/li
  liiOS Version/iRedhat 8/li
  liiLocation of index/ilocal/li
  /p
  p
  bLucene indexing variables/bbr/
  liiNumber of source documents/i5,000,000/li
  liiTotal filesize of source documents/i40GB/li
  liiAverage filesize of source documents/i8kB/li
  liiSource documents storage location/iDB on remote server/li
  liiFile type of source documents/ipre-parsed HTML/li
  liiParser(s) used, if any/in/a/li
  liiAnalyzer(s) used/iStandardAnalyzer/li
  liiNumber of fields per document/i5/li
  liiType of fields/iactual text is indexed but not stored in lucene
index/li
  liiIndex persistence/i: Where the index is stored, e.g. 
FSDirectory, SqlDirectory, etc/li
  /p
  p
  bFigures/bbr/
  liiTime taken (in ms/s as an average of at least 3 indexing 
runs)/i332 minutes/li
  liiTime taken / 1000 docs indexed/i4 sec/li
  liiMemory consumption/iabout 100MB/li
  /p
  p
  bNotes/bbr/
  liiNotes/iWith the above configuration we pretty consistently
achieve a 250 docs / sec rate
  of indexing. The actual text cannot be retrieved from the index, this
keeps the index
   size down (6.1GB) and increases indexing speed. When the actual documents
are stored in the index
  the rate drops by about 30% to 160 docs / sec./li
  /p
  /ul
/benchmark


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

FW: Indexing Speed: Documents vs. Sentences

2003-12-19 Thread Jochen Frey


Stephane,

The actual indexing is actually less glamorous than it sounds. When
you index 1TB across 10 machines you end up with 100GB on each machine. We
do not merge the indexes either, since we get better speed on indexing as
well as querying when we keep indexes smaller and distributed across
different machines. (But somehow I think that I'll sit down and merge all of
them together and play with it when I get a chance ... 'cause it's cool :-)
I'll keep you posted when it happens).
 
My test set that I am playing with is 40GB, and I just posted a
benchmark.
 
Best,
Jochen

 -Original Message-
 From: Stephane Vaucher [mailto:[EMAIL PROTECTED]
 Sent: Thursday, December 18, 2003 9:01 AM
 To: Lucene Users List; [EMAIL PROTECTED]
 Subject: RE: Indexing Speed: Documents vs. Sentences
 
 Jochen,
 
 If you have a bit of time, could you post some metrics, (as an example,
 you can look at http://jakarta.apache.org/lucene/docs/benchmarks.html). I
 haven't heard of anyone indexing 1TB yet. I'm sure everyone is interested
 in problems you could be facing and we could probably give you some ideas.
 I know (oddly enough) I sometimes wish I had dataset greater than a few M
 docs to experiment with.
 
 cheers,
 sv
 
 On Thu, 18 Dec 2003, Jochen Frey wrote:
 
  Hi,
 
  Yes, this is correct, I am dealing with a few 100GB (close to 1TB).
  I am, however, distributing the data across several machines and then
 merge
  the results from all the machines together (until I find a better 
 faster
  solution).
 
  Cheers!
 
   -Original Message-
   From: Victor Hadianto [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, December 17, 2003 10:50 PM
   To: Lucene Users List
   Subject: Re: Indexing Speed: Documents vs. Sentences
  
Hi,
   
I am using Lucene to index a large number of web pages (a few 100GB)
 and
   the
indexing speed is great.
  
   Jochen .. a few 100 GB? Is this correct?
  
   /victor
  
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Sentence Endings: IndexWriter.maxFieldLength and Token.setPositionIncrement()

2003-12-19 Thread Jochen Frey

Hi!

I hope this is the right forum for this post.

I was wondering if other people would consider this a bug (it might be a
feature and I am missing the point of it):
 
.The default IndexWriter.maxFieldLength is 10,000.
.The point of maxFieldLength is to limit memory usage.
.The current position (which is compared against maxFieldLength) is
essentially determined by the sum of the PositionIncrements of all Tokens
added to the index.

Why does this matter? If you have setPositionIncrement(1000) for sentence
ending tokens, only the first 10 sentences of your document will be indexed,
the rest will not be searchable (since position will be greater than
10,000).

Why I think this is a bug: If you skip 1000 positions, no memory is required
by the DocumentWriter for the empty 999 positions, thus not using
maxFieldLength to limit memory but simply available positions.

I suggest that there be a counter in DocumentWriter, that counts the actual
number of tokens in the postingTable (probably in
DocumentWriter.addPosition), so that maxFieldLength is compared against the
number actual entries, not the number of actual entries and the number
skipped entries.

Best,
Jochen

PS: Please let me know if this is the wrong forum for this so I'll post to
the right one next time.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Indexing Speed: Documents vs. Sentences

2003-12-18 Thread Jochen Frey

Hi,

Yes, this is correct, I am dealing with a few 100GB (close to 1TB).
I am, however, distributing the data across several machines and then merge
the results from all the machines together (until I find a better  faster
solution).

Cheers!

 -Original Message-
 From: Victor Hadianto [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, December 17, 2003 10:50 PM
 To: Lucene Users List
 Subject: Re: Indexing Speed: Documents vs. Sentences
 
  Hi,
 
  I am using Lucene to index a large number of web pages (a few 100GB) and
 the
  indexing speed is great.
 
 Jochen .. a few 100 GB? Is this correct?
 
 /victor
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Indexing Speed: Documents vs. Sentences

2003-12-17 Thread Jochen Frey

Hi,

I am using Lucene to index a large number of web pages (a few 100GB) and the
indexing speed is great.

Lately I have been trying to index on a sentence level, not the document
level. My problem is that the indexing speed has gone down dramatically and
I am wondering if there is any way for me to improve on that.

Indexing on a sentence level the overall amount of data stays the same while
the number of records increases substantially (since there is usually many
sentences to one web page).

It seems to me like the indexing speed (everything else being the same)
depends largely on the number of Documents inserted into the index, and not
so much on the size of the data within the documents (correct?).

I have played with the merge factor, using RAMDirectory, etc and I am quite
comfortable with our overall configuration, so my guess is that that is not
the issue (and I am QUITE happy with the indexing speed as long as I use
complete pages and not sentences).

Maybe there is a different way of attacking this? My goal is to be able to
execute a query and get the sentences that match the query in the most
efficient way while maintaining good/great indexing speed. I would prefer
not having to search the complete document for the sentence in question.

My current solution is to have one Lucene Document for each page (containing
the URL and other information I require) that does NOT contain the text of
the page. Then I have one Lucene Document for each sentence within that
document, which contains the text of this particular sentence in addition to
some identifying information that references the entry of the page itself.

Any and all suggestions are welcome.

Thanks!
Jochen


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Indexing Speed: Documents vs. Sentences

2003-12-17 Thread Jochen Frey

Hi!

In essence: 
1) I don't care about the whole page

2) I only care about the actual sentence that matches the query.

3) I want the matching for the query only to happen within one sentence and
not over sentence boundaries (even when I do a PhraseQuery with some slop). 

The query: i like the beach~20
should not match: And we go to the restaurant and i really like it. the
beach was wonderful as well.

4) I would much prefer not to parse the actual page to find the sentence
that matches the query (though I obviously will, if I have to).

Does that answer your question?

Thanks!
Jochen

 -Original Message-
 From: Dan Quaroni [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, December 17, 2003 1:19 PM
 To: 'Lucene Users List'
 Subject: RE: Indexing Speed: Documents vs. Sentences
 
 I'm confused about something - what's the point of creating a document for
 every sentence?
 
 -Original Message-
 From: Jochen Frey [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, December 17, 2003 4:17 PM
 To: 'Lucene Users List'
 Subject: Indexing Speed: Documents vs. Sentences
 
 
 Hi,
 
 I am using Lucene to index a large number of web pages (a few 100GB) and
 the
 indexing speed is great.
 
 Lately I have been trying to index on a sentence level, not the document
 level. My problem is that the indexing speed has gone down dramatically
 and
 I am wondering if there is any way for me to improve on that.
 
 Indexing on a sentence level the overall amount of data stays the same
 while
 the number of records increases substantially (since there is usually many
 sentences to one web page).
 
 It seems to me like the indexing speed (everything else being the same)
 depends largely on the number of Documents inserted into the index, and
 not
 so much on the size of the data within the documents (correct?).
 
 I have played with the merge factor, using RAMDirectory, etc and I am
 quite
 comfortable with our overall configuration, so my guess is that that is
 not
 the issue (and I am QUITE happy with the indexing speed as long as I use
 complete pages and not sentences).
 
 Maybe there is a different way of attacking this? My goal is to be able to
 execute a query and get the sentences that match the query in the most
 efficient way while maintaining good/great indexing speed. I would prefer
 not having to search the complete document for the sentence in question.
 
 My current solution is to have one Lucene Document for each page
 (containing
 the URL and other information I require) that does NOT contain the text of
 the page. Then I have one Lucene Document for each sentence within that
 document, which contains the text of this particular sentence in addition
 to
 some identifying information that references the entry of the page itself.
 
 Any and all suggestions are welcome.
 
 Thanks!
 Jochen
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Indexing Speed: Documents vs. Sentences

2003-12-17 Thread Jochen Frey

Dan, I will send you a separate e-mail directly to your address.

In the meanwhile, I hope to get input from other people. Maybe someone else
knows how to solve my original problem below.

Thanks!
Jochen

 -Original Message-
 From: Dan Quaroni [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, December 17, 2003 1:36 PM
 To: 'Lucene Users List'
 Subject: RE: Indexing Speed: Documents vs. Sentences
 
 When you parse the page you can prevent sentence-boundry hits from
 matching
 your criteria
 
 -Original Message-
 From: Jochen Frey [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, December 17, 2003 4:34 PM
 To: 'Lucene Users List'
 Subject: RE: Indexing Speed: Documents vs. Sentences
 
 
 Right.
 
 However, even if I do that, my problem #3 below remains unsolved: I do not
 wish to match phrases across sentence boundaries.
 
 Anyone have a neat solution (or pointers to one)?
 
 Thanks again!
 Jochen
 
  -Original Message-
  From: Dan Quaroni [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, December 17, 2003 1:29 PM
  To: 'Lucene Users List'
  Subject: RE: Indexing Speed: Documents vs. Sentences
 
  Yeah.  I'd suggest parsing the page, unfortunately. :)
 
  -Original Message-
  From: Jochen Frey [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, December 17, 2003 4:26 PM
  To: 'Lucene Users List'
  Subject: RE: Indexing Speed: Documents vs. Sentences
 
 
  Hi!
 
  In essence:
  1) I don't care about the whole page
 
  2) I only care about the actual sentence that matches the query.
 
  3) I want the matching for the query only to happen within one sentence
  and
  not over sentence boundaries (even when I do a PhraseQuery with some
  slop).
 
  The query: i like the beach~20
  should not match: And we go to the restaurant and i really like it. the
  beach was wonderful as well.
 
  4) I would much prefer not to parse the actual page to find the sentence
  that matches the query (though I obviously will, if I have to).
 
  Does that answer your question?
 
  Thanks!
  Jochen
 
   -Original Message-
   From: Dan Quaroni [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, December 17, 2003 1:19 PM
   To: 'Lucene Users List'
   Subject: RE: Indexing Speed: Documents vs. Sentences
  
   I'm confused about something - what's the point of creating a document
  for
   every sentence?
  
   -Original Message-
   From: Jochen Frey [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, December 17, 2003 4:17 PM
   To: 'Lucene Users List'
   Subject: Indexing Speed: Documents vs. Sentences
  
  
   Hi,
  
   I am using Lucene to index a large number of web pages (a few 100GB)
 and
   the
   indexing speed is great.
  
   Lately I have been trying to index on a sentence level, not the
 document
   level. My problem is that the indexing speed has gone down
 dramatically
   and
   I am wondering if there is any way for me to improve on that.
  
   Indexing on a sentence level the overall amount of data stays the same
   while
   the number of records increases substantially (since there is usually
  many
   sentences to one web page).
  
   It seems to me like the indexing speed (everything else being the
 same)
   depends largely on the number of Documents inserted into the index,
 and
   not
   so much on the size of the data within the documents (correct?).
  
   I have played with the merge factor, using RAMDirectory, etc and I am
   quite
   comfortable with our overall configuration, so my guess is that that
 is
   not
   the issue (and I am QUITE happy with the indexing speed as long as I
 use
   complete pages and not sentences).
  
   Maybe there is a different way of attacking this? My goal is to be
 able
  to
   execute a query and get the sentences that match the query in the most
   efficient way while maintaining good/great indexing speed. I would
  prefer
   not having to search the complete document for the sentence in
 question.
  
   My current solution is to have one Lucene Document for each page
   (containing
   the URL and other information I require) that does NOT contain the
 text
  of
   the page. Then I have one Lucene Document for each sentence within
 that
   document, which contains the text of this particular sentence in
  addition
   to
   some identifying information that references the entry of the page
  itself.
  
   Any and all suggestions are welcome.
  
   Thanks!
   Jochen
  
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED

RE: SpanXXQuery Usage

Query: A ? B

RE: Query: A ? B

RE: Query: A ? B

RE: Lucene scalability/clustering

Benchmark (WAS: Indexing Speed: Documents vs. Sentences)

FW: Indexing Speed: Documents vs. Sentences

Sentence Endings: IndexWriter.maxFieldLength and Token.setPositionIncrement()

RE: Indexing Speed: Documents vs. Sentences

Indexing Speed: Documents vs. Sentences

RE: Indexing Speed: Documents vs. Sentences

RE: Indexing Speed: Documents vs. Sentences

12 matches

Site Navigation

Mail list logo

Footer information