Re: pdfbox performance.

2004-07-28 Thread Tatu Saloranta
On Wednesday 28 July 2004 15:44, Paul Smith wrote:
 The first thing that I would do is wrap the FileInputStream with a
 BufferedInputStream.

 You get a significant boost reading in from a buffer, particularly as the
 size of the file grows.

Benchmarking is good; whether there's any significant performance difference 
depends on how the app reads data from the stream. Most high-performance apps 
read straight to a local buffer, in which case BufferedInputStream offers 
nothing more than the buffer overhead... :-)
You do get nice performance improvement only if most reads are done using 
single byte read methods (read()), though, so there's always a chance I 
guess.

-+ Tatu +-

 Try that first, and then rebenchmark.
 Cheers
 Paul Smith

  -Original Message-
  From: Miroslaw Milewski [mailto:[EMAIL PROTECTED]
  Sent: Thursday, July 29, 2004 7:24 AM
  To: [EMAIL PROTECTED]
  Subject: pdfbox performance.
 
 
Hi,
 
I have a serious performance problem while extracting text from pdf.
 
Here is the code (w/o try/catch blocks):
 
File file = new File(test.pdf);
FileInputStream reader = new FileInputStream(file);
 
PDFParser parser = new PDFParser(reader);
parser.parse();
PDDocument pdDoc = parser.getPDDocument();
 
PDFTextStripper stripper = new PDFTextStripper();
String pdftext = stripper.getText(pdDoc);
 
pdDoc.close();
 
Now, the whole process takes:
- 37,4 sec w. a 74 kB file (parsing took 5,3 sec.)
- 156,7 sec w. a 150 kB file (parsing: 11,0 sec.)
- 157,8 sec w. a 270 kB file (parsing: 34,3 sec.)
- 313,3 sec w. a 151 kB file (parsing: 5,9 sec.)
 
Now, I can't really get the point here. Is this performance standard
  for pdfbox? Or is it my system (win2k, PIII 700, 512 RAM), or the code,
  or maybe the pdf docs (text only, the last one with some UML diags.)
 
I am writing a knowledge base system at the moment, and planned to do
  real-time text extraction and indexing (using Lucene.) But this is not
  realistic, considering the extraction thime.
Then maybe it is a better idea to run the extraction and indexing once
  every 24 h, processing all the documents added during that period.
 
TIA for any comments/suggestions.
 
  --
  Miroslaw Milewski
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Field.java - STORED, NOT_STORED, etc...

2004-07-11 Thread Tatu Saloranta
On Sunday 11 July 2004 10:03, Doug Cutting wrote:
 Doug Cutting wrote:
  The calls would look like:
 
  new Field(name, value, Stored.YES, Indexed.NO, Tokenized.YES);
 
.
 Actually, while we're at it, Indexed and Tokenized are confounded.  A
 single entry would be better, something like:
...
 then calls would look like just:

 new Field(name, value, Store.YES, Index.TOKENIZED);
...
 and adding a boolean clause would look like:

 booleanQuery.add(new TermQuery(...), Occur.MUST);

 Then we can deprecate the old methods.

 Comments?

I was about to suggest this, instead of int/boolean constants, since it is a 
recommended good practice, and allows better type safety (until JDK 1.5's 
real enums at least). I would prefer this over un-typesafe consts; although
even just defining and using simple consts in itself would be an improvement 
over existing situation.

Another possibility (or maybe complementary approach) would be to just 
completely do away with constructor access; make the constructors private or 
protected, and only allow factory methods to be used externally. This would 
have the benefit of even better readability: minimum number of arguments 
(method name would replace one or two args) and full type checking. Plus it'd 
be easier to modify implementations should that become necessary. Factory 
methods are especially useful for classes like Field, that are not designed 
to be sub-classed.

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bridge with OpenOffice

2004-04-19 Thread Tatu Saloranta
On Monday 19 April 2004 14:01, Mario Ivankovits wrote:
 Stephane James Vaucher wrote:
  Anyone try what Joerg suggested here?
  http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]
 pache.orgmsgNo=6231

 Dont know what you would like to do, but if you simply would like to
 extract text, you could simply try this sniplet:

This leads to question I was thinking; it seems that originally this thread 
started by someone pointing that OO can be used as converter from other 
formats... but how about tokenizer for native OO documents? I have written 
full-featured converters from OO to (simplified) DocBook and HTML, and 
creating one for just tokenizing to be used by Lucene would be much easier. 
Even if it would tokenize into separate fields (document metadata, content, 
maybe bibliography separately etc), it'd be easy to do.

Would anyone find full-featured, customizable OpenOffice document tokenizer 
useful?

-+ Tatu +-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Suggestion for Token.java

2004-04-13 Thread Tatu Saloranta
On Tuesday 13 April 2004 15:31, Holger Klawitter wrote:
 Hi Erik,

  What is wrong with simply creating a new token that replaces an
  incoming one for synonyms?
  I'm just playing devil's advocate here since you can already get
  the termText() through the public _method_.

 Well, you're right; I forgot about cloning, but ... (Lords advocate :-)

 1.) Cloning implies the need to change filters whenever the fields in Token
 change.

On the other hand, one needs to be sure that no other code assumes Tokens are 
immutable. For example, if they weren't one couldn't reliably use tokens in 
Sets or Maps (not sure if it's useful to do that, just an example).

I guess it's really matter of whether tokens were designed as immutable (which 
often makes sense for similar objects), or if they just happen to be, due to 
lack of modifier method(s).

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Zero hits for queries ending with a number

2004-04-03 Thread Tatu Saloranta
On Saturday 03 April 2004 08:34, [EMAIL PROTECTED] wrote:
 On Saturday 03 April 2004 17:11, Erik Hatcher wrote:
  No objections that error messages and such could be made clearer.
  Patches welcome!  Care to submit better error message handling in this
  case?  Or perhaps allow lower-case to?

 I think the best would be if Lucene would simply have a
 setCaseSensitive(boolean).

 IMHO it's in any case a bad idea to make searches case-sensitive (per
 default).

I'd have to disagree. I think that  search engine core should not have to 
bother with details of character sets, such as lower-casing. Rules for 
lower/upper/initial/mixed case for all Unicode-languages are rather 
involved... and if you tried to do that, next thing would be whether 
accentuation and umlaut marks should matter or not (which is language 
dependant). That's why to me the natural way to go is to do direct 
comparison, ignoring case when executing queries. This does not prevent 
anyone from implementing such functionality (see below).

I think architecture and design of Lucene core is delightfully simple. One can 
easily create case-independent functionality by using proper analyzers, and 
(for the most part), configuring QueryParser. I would agree, however, that 
QueryParser is victim of its success; it's too often used in situations 
where one really should create proper GUI that builds the query. Backend code 
can then mangle input as it sees fit, and build query objects.
QueryParser is more natural for quick-n-dirty scenarios, where one just has to 
slap something together quickly, or if one only has textual interface to deal 
with. It's nice thing to have, but it has its limitations; there's no way to 
create one parser that's perfect for every use(r).

What could be done would be to make sure all examples / demo web apps would 
implement case-insensitive indexing and searching, since that is often what 
is needed?

-+ Tatu +-


  But, also, folks need to really step back and practice basic
  troubleshooting skills.  I asked you if that string was what you passed
  to the QueryParser and you said yes, when in fact it was not.  And you

 I forgot that I did lower-case it. I fact I even output it in it's original
 state but lower-case it just before I pass it to lucene. That lower-casing
 is what I would call a hack and hence it's no surprise that I forgot it :-)

 Timo

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Performing exact search with Lucene

2004-04-02 Thread Tatu Saloranta
On Friday 02 April 2004 08:12, Phil brunet wrote:
 Hi all.

 I'm migrating a part of an application from Oracle intermedia to Lucene
 (1.3) to perform full text searches.

Congratulations! :-)

 I'd like to know if there is a way to perform exact queries. By exact
 query, i mean beeing able to match ONLY document that are exactely equals
 to the terms of the query.

I believe plain old PhraseQuery does exactly that? You can build one yourself, 
or, using QueryParser, use something like

+this is an example

(making sure you use correct analyzer, depending on whether you want 'an' to 
be significant token in there).
Note, too, that '+' prefix in there is not absolutely needed, if you don't 
have multiple parts to the query; even without that it'd only consider 
documents that have that exact phrase.

-+ Tatu +-


 Exemple:

 document 1 =this is an example
 document 2 =this is an example of document
 document 3 =this is an other example 

 Is it possible to match ONLY document 1 if i search for this is an
 exemple ?

 Currently, i'm trying to override the DefaultSimilarity class in order to
 be be able to deduce an exact match from the score.

 My query consists in a BooleanQuery composed by n TermQuery.

 I know i can develop by myself a post filter that could count compare the
 number of tokens of the query and the number of tokens of the indexed
 document. But i would like to know if there is a proper way to do this :
 - directly with Lucene (i.e. a Lucene query that would match only document
 1 in my example)
 - by redefining the Similarity and so by interpreting the scores
 - any idea 

 Thanks.

 Philippe

 _
 Trouvez l'âme soeur sur MSN Rencontres http://g.msn.fr/FR1000/9551


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Caching and paging search results

2004-03-08 Thread Tatu Saloranta
On Monday 08 March 2004 12:34, Erik Hatcher wrote:
 In the RealWorld... many applications actually just re-run a search and
 jump to the appropriate page within the hits searching is generally
 plenty fast enough to alleviate concerns of caching.

 However, if you need to cache Hits, you need to be sure to keep around
 the originating IndexSearcher as well.

Further, oftentimes search index only contains key to actual content indexed 
(which itself is stored as a file, in database, or so)... so it's enough to 
cache just set of such ids, not actual search result objects.
And assuming ids are simple (int id, short String), such information can be 
stored in, say, user session.
In system I'm working on, we store up to 500 hits, only storing document id 
(int) and hit quality (byte), stored in session.

-+ Tatu +-


 A stateful session bean could be used, but I'd opt for a much simpler
 solution as a first pass, such as the first point of just re-running a
 search from scratch.

   Erik

 On Mar 8, 2004, at 2:14 PM, Clandes Tino wrote:
  Hi all,
  could someone describe his expirience in
  implementation of caching, sorting and paging search
  results.
  Is Stateful Session bean appropriate for this?
  My wish is to obtain all search hits only in first
  call, and after that, to iterate through Hit
  Collection and display cached results.
  I have checked SearchBean in contribution section, but
  it does not provide real caching and paging.
 
  Regards and thanx in advance!
  Milan
 
 
 
 
 
 
  ___
  Yahoo! Messenger - Communicate instantly...Ping
  your friends today! Download Messenger Now
  http://uk.messenger.yahoo.com/download/index.html
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Vector - LinkedList for performance reasons...

2004-01-21 Thread Tatu Saloranta
On Wednesday 21 January 2004 08:38, Doug Cutting wrote:
 Francesco Bellomi wrote:
  I agree that synchronization in Vector is a waste of time if it isn't
  required,

 It would be interesting to see if such synchronization actually impairs
 overall performance significantly.  This would be fairly simple to test.

True. At the same time, it's questionable whether there's any benefit of not 
changing it to ArrayList. However:


  but I'm not sure if LinkedList is a better (faster) choice than
  ArrayList.

 Correct.  ArrayList is the substitute for Vector.  One could also try
 replacing Hashtable with HashMap in many places.

Yes, LinkedList is pretty much never more or even as efficient (either memory 
or performancewise) than ArrayList. Arraycopy needed when doubling the size 
(which happens seldom enough when list grows) is neglible compared to 
increased GC activity and memory usage for entries in LinkedList (object 
overhead of 24 bytes for each entry, alloc/GC). 
And obviously indexed access is hideously slow, if that's needed. I've yet to 
find any use for LinkedList; it'd make sense to have some sort of combination 
(segmented array list, ie. linked list of arrays) for huge arrays... but 
LinkedList just isn't useful even there.

...
 My hunch is that the speedup will not be significant.  Synchronization
 costs in modern JVMs are very small when there is no contention.  But
 only measurement can say for sure.

Apparently 1.4 specifically had significant improvement there, reducing cost 
of synchronization.

-+ Tatu +-


 Doug


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Performance question

2004-01-08 Thread Tatu Saloranta
On Wednesday 07 January 2004 20:48, Dror Matalon wrote:
 On Wed, Jan 07, 2004 at 07:24:22PM -0700, Scott Smith wrote:
...
  Thanks for the suggestions.  I wonder how much faster I can go if I
  implement some of those?

 25 msecs to insert a document is on the high side, but it depends of
 course on the size of your document. You're probably spending 90% of
 your time in the XML parsing. I believe that there are other parsers
 that are faster than xerces, you might want to look at these. You might
 want to look at http://dom4j.org/.

I think more significant than whether one uses DOM or some other full-document 
in-memory parser, is whether to perhaps use streaming (usually event-based) 
parsers such as ones using SAX. These are generally an order of magnitude 
faster, at least for bigger documents. Fortunately many standard XML parsers 
can work as both DOM and SAX parsers (I believe Xerces at least does, in any 
case).

It's bit more cumbersome to use event-based parsers (push vs. pull; need to 
explicitly keep track of current subtree, if parent tag order matters), but 
from performance perspective (memory usage, speed) it may be worth it.

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lock obtain timed out

2003-12-16 Thread Tatu Saloranta
On Tuesday 16 December 2003 03:37, Hohwiller, Joerg wrote:
 Hi there,

 I have not yet got any response about my problem.

 While debugging into the depth of lucene (really hard to read deep insde) I
 discovered that it is possible to disable the Locks using a System
 property.
...
 Am I safe disabling the locking???
 Can anybody tell me where to get documentation about the Locking
 strategy (I still would like to know why I have that problem) ???

 Or does anybody know where to get an official example of how to
 handle concurrent index modification and searches?

One problem I have seen, and am still trying to solve, is that if my web app
is terminated (running from console during development, ctrl+c on unix),
sometimes it seems commit.lock file is left. Now problem is that apparently 
method that seems like it tries to check if there is a lock (and subsequently 
asking it to be removed via API) doesn't consider that to be the lock 
(sorry for not having details, writing this from home without source). So 
I'll probably see if disabling locks would get rid of this lock file (as I 
never have multiple writers, or even writer and reader, working on same 
index... I just always make full file copy of index before doing incremental 
updates), or physically delete commit.lock if necessary when starting the 
app.

The problem I describe above happens fairly infrequently, but that's actually 
what makes it worse... our QA people (in different continent) have been 
bitten by a bit couple of times. :-/

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index and Field.Text

2003-12-05 Thread Tatu Saloranta
On Friday 05 December 2003 10:45, Doug Cutting wrote:
 Tatu Saloranta wrote:
  Also, shouldn't there be at least 3 methods that take Readers; one for
  Text-like handling, another for UnStored, and last for UnIndexed.

 How do you store the contents of a Reader?  You'd have to double-buffer
 it, first reading it into a String to store, and then tokenizing the
 StringReader.  A key feature of Reader values is that they're streamed:

Not really, you can pass Reader to tokenizer, which then reads and tokenizes 
directly (I think that's the way code also works). This because internally 
String is read using StringReader, so passing a String looks more like a 
convenience feature?

 the entire value is never in RAM.  Storing a Reader value would remove
 that advantage.  The current API makes this explicit: when you want
 something streamed, you pass in a Reader, when you're willing to have
 the entire value in memory, pass in a String.

I guess for things that are both tokenized and stored, passing a Reader can't 
really help a lot; if one wants to reduce mem usage, text needs to be read 
twice, or analyzer needs to help in writing output; or, text needs to be read 
in-memory much like what happens now. It'd simplify application code a bit, 
but wouldn't do much more.

So I guess I need to downgrade my suggestion to require just 2 
Reader-taking factory methods? :-)
I still think that index-only and store-only version would both make sense. In 
latter case, storing could be done in fully streaming fashion; in former 
tokenization can be done?

 Yes, it is a bit confusing that Text(String, String) stores its value,
 while Text(String, Reader) does not, but it is at least well documented.
   And we cannot change it: that would break too many applications.  But
 we can put this on the list for Lucene 2.0 cleanups.

Yes, I understand that. It'd not be reasonable to do such a change. But how 
about adding more intuitive factory method (UnStored(String, Reader))?

 When I first wrote these static methods I meant for them to be
 constructor-like.  I wanted to have multiple Field(String, String)
 constructors, but that's not possible, so I used capitalized static
 methods instead.  I've never seen anyone else do this (capitalize any
 method but a real constructor) so I guess I didn't start a fad!  This

:-)

 should someday too be cleaned up.  Lucene was the first Java program
 that I ever wrote, and thus its style is in places non-standard.  Sorry.

Best standards are created by people doing things others use, follow or 
imitate... so it was worth a try! :-)

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SearchBlox J2EE Search Component Version 1.1 released

2003-12-03 Thread Tatu Saloranta
On Tuesday 02 December 2003 09:51, Tun Lin wrote:
 Anyone knows a search engine that supports xml formats?

There's no way to generally support xml formats, as xml is just a 
meta-language. However, building specific search engines using Lucene core it 
should be reasonably straight-forward to implement more accurate 
xml-structure-aware tokenization for specific xml applications like DocBook 
or other domain-specific apps.
So, if any search engine advertises indexing xml content, one better read 
the fine print to learn what they really claim.

It might be interesting to create a Lucene plug-in that, given a specification 
of how sub trees under specific elements, would tokenize and index content 
into separate fields. Plus implementation shouldn't be very difficult -- just 
use standard XML parser (SAX, DOM) -- and then match xpaths, feed that to 
analyzer and then add to index. This could also be used for HTML 
(pre-filtering with JTidy or similar first to get to xml-compliant HTML).
I wouldn't be surprised if someone on list has already done this?

-+ Tatu +-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Dates and others

2003-12-01 Thread Tatu Saloranta
On Monday 01 December 2003 15:13, Dion Almaer wrote:
...
 Interesting.  I implemented an approach which boosted based on the number
 of months in the past, and after tweaking the boost amounts, it seems to do
 the job. I do a fresh reindex every night (since the indexing process takes
 no time at all... unlike our old search solution!)

This sounds interesting, as I have been thinking of what's the best way
to boost newer documents. Can you share some of your experience regarding 
boost values that seemed to make sense? In my case, CMS I'm working on stores 
support documentation for software/hardware, meaning that content is highly 
time-sensitive (ie. documents decay pretty quickly).

Since the system is already doing both incremental reindexing, and nightly 
full reindexing (latter to make sure that even if temporarily some changed 
content was not [fully] reindexed, it eventually gets indexed properly), I 
can fairly easily add boosting I think.

On a related note, it would also be nice if there was a way to start 
categorizing general hot topics for Lucene developers; it seems like there 
are about half a dozen areas where there's lots of interest for improvements 
(most of them related to ranking). If so, perhaps there could be more 
specific discussion groups, and also perhaps web pages summarizing some of 
discussions, consensus achieved, even if there's no code to show for it?

-+ Tatu +-


 I read content for the index from different sources. Sometimes the source
 gives me documents loosely in date order, but not all of them. So, it seems
 that one of the other approaches should be taken (adding a month/week field
 etc).  I should look more into the HitCollector and see how it can help me.

 The other issue I have is that I would like to prioritize the title field. 
 At the moment I am lazy and add the title to the body (contents = title +
 body) which seems to be OK... however sometimes something that mentions the
 search term in the title should appear higher up in the pecking order.

 I am using the QueryParser (subclassed to disallow wildcards etc) to do the
 dirty work for me. Should I get away from this and manage the queries
 myself (and run a Multi against the title field as well as the contents?

 Thanks for the great feedback,

 Dion


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Tatu Saloranta
On Monday 17 November 2003 07:40, Chong, Herb wrote:
 i don't know what the Java implementation is like but the C++ one is very
 fast.
...
 I personally do not have any experience with the BreakIterator in Java. Has
 anyone used it in any production environment? I'd be very interested to
 learn more about it's efficiency.

Even if that implementation wasn't fast (which it should be), it should be 
fairly easy to implement it to be pretty much as efficient as any of basic 
tokenizers; ie. not much slower than full scanning speed over text data and 
token creation overhead.

-+ Tatu +-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])

2003-11-17 Thread Tatu Saloranta
On Monday 17 November 2003 08:39, Chong, Herb wrote:
 the core of the search engine has to have certain capabilities, however,
 because they are next to impossible to add as a layer on top with any
 efficiency. detecting sentence boundaries outside the core search engine is
 really hard to do without building another search engine index. if i have
 to do that, there is no point in using Lucene.

It's also good to know what exactly constitutes core; I would assume that 
analyzer implementations are not part per se, as long as core knows how
to use analyzers. But as long as index structure has some way to store 
information needed (perhaps by using existing property of distances between 
tokens, which allows both overlapping tokens and gaps, like someone 
suggested?), core need not know specifics of how analyzers determine 
structural (sentence etc) boundaries.

To me this seems like one of many issues where it's possible to retain 
distinction between Lucene kernel (lean mean core) and more specialized 
functionality; highlighting was another one.

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: positional token info

2003-10-21 Thread Tatu Saloranta
On Tuesday 21 October 2003 17:31, Otis Gospodnetic wrote:
  It does seem handy to avoid exact phrase matches on phone boy when
  a
  stop word is removed though, so patching StopFilter to put in the
  missing positions seems reasonable to me currently.  Any objections
  to that?

 So phone boy would match documents containing phone the boy?  That

Hmmh. WWGD (What Would Google Do)? :-)

 doesn't sound right to me, as it assumes what the user is trying to do.
  Wouldn't it be better to allow the user to decide what he wants?
 (i.e. phone boy returns documents with that _exact_ phrase.  phone
 boy~2 also returns documents containing phone the boy).

As long as phrase queries work appropriately with approximity modifiers, one
alternative (from app standpoint) would be to:

(a) Tokenize stopwords out, adding skip value; either one per stop word,
  or one for non-empty sequence of key words ( top of the world might
 make sense to tokenize as top - world, - signifying 'hole')
(b) With phrase queries, first do exact match.
(c) If number of matches is too low (whatever definition of low is),
  use phrase query match with slop of 2 instead.

Tricky part would be to do the same for combination queries, where it's
not easy to check matches for individual query components.

Perhaps it'd be possible to create Yet Another Query object, that would,
given a threshold, do one or two searches (as described above), to allow
for self-adjusting behaviour?
Or, perhaps there should be container query, that could execute ordered
sequence of sub-queries, until one returns good enough set of matches, then
return that set (or last result(s), if no good matches) and above-mentioned 
sloppy if need be phrase query would just be  a special case?

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Hierarchical document

2003-10-20 Thread Tatu Saloranta
On Monday 20 October 2003 16:41, Erik Hatcher wrote:
 One more thought related to this subject - once a nice scheme for
 representing hierarchies within a Lucene index emerges, having XPath as
 a query language would rock!  Has anyone implemented O/R or XPath-like
 query expressions on top of Lucene?

Not me... but at some point I think I briefly mentioned that someone with 
extra time might want to do a very simple JDBC driver to be used with
Lucene. Obviously it would be very minimal for queries (and might need
to invent new SQL operators for some searches), but it could also expose
metadata about index. Should be an interesting exercise at least. :-)
Plus, if done properly, tools like DBVis could be used for simple Lucene
testing as well.

If so, who knows; perhaps that would make it even easier to do prototype
implementations of Lucene replacing home-grown SQL-bound search
functionalities of apps.

Most of all above would just be a nice little hack, though. :-)

-+ Tatu +-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Struts logic iterate

2003-10-06 Thread Tatu Saloranta
On Monday 06 October 2003 08:35, Lars Hammer wrote:
...
 to iterate the Hits. I thought that Hits was an array of pointers to docs,

   ^^^
Actually, Hits contains a Vector (could be an array as well), but is not a 
Collection itself (one can not extend array classes in Java, so no Object 
besides basic arrays can be arrays or treates as one).
Hits be made a Collection, though.
In fact, I think it would be a reasonable thing to do, to make Hits be
a simple Collection (or perhaps List since it is an ordered collection).
You could file an RFE for this, or better yet, implement it. :-)
I'd think including such patch for Lucene would make sense as well.

 Has anyone any experience in using the logic:iterate tag or is it
 necessary to write a custom JSP tag which does the iteration??

No, it should be enough to write a simple wrapper that implements
Collection, and accesses Hits instance via next() method.

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: HTML Parsing problems...

2003-09-18 Thread Tatu Saloranta
On Thursday 18 September 2003 14:50, Michael Giles wrote:
 I know, I know, the HTML Parser in the demo is just that (i.e. a demo), but
 I also know that it is updated from time to time and performs much better
 than the other ones that I have tested.  Frustratingly, the very first page
 I tried to parse failed
 (http://www.theregister.co.uk/content/54/32593.htmlhttp://www.theregister
.co.uk/content/54/32593.html). It seems to be choking on tags that are being
 written inside of JavaScript code (i.e. document.write('/scr' + 'ipt');. 
 Obviously, the simple solution (that I am using with another parser) is to
 just ignore everything inside of script tags.  It appears that the parser
 is ignoring text inside script tags, but it seems like it needs to be a bit
 smarter (or maybe dumber) about how it deals with this (so it doesn't get

I would guess that often ignoring stuff in script (for indexing purposes) 
makes sense; exception being if someone wants to create HTML site creation 
IDE (like specifically wants to search for stuff in javascript sections?).
Nonetheless HTML parser has to be able to handle these I think.

 confused by such occurrences).  I see a bug has been filed regarding
 trouble parsing JavaScript, has anyone given it thought?

I implemented a rather robust (X[HT])ML parser (QnD) that was able to work
through many of such issues (script tag, unquoted single '' and '' chars,
in attr values and elements, simplistic approach to optional end tags). Since 
it was dead-optimized for speed (anything fully in memory in a char array, 
optimizing based on that) I thought it might be useful for indexing (even 
more so than for its original purpose which was to be very fast utility for 
filtering [adding and/or removing stuff] of HTML pages).

If anyone would be interested I could give the source code and/or (if I have 
time) to implement efficient fault-tolerant indexer.
Like I said this also works equally well for well-formed XML, but that's 
nothing special.

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene demo ideas?

2003-09-17 Thread Tatu Saloranta
On Wednesday 17 September 2003 07:07, Erik Hatcher wrote:
 On Wednesday, September 17, 2003, at 08:43  AM, Killeen, Tom wrote:
  I would suggest XML as well.

 Again, I'd like to hear more about how you'd do this generically.  Tell
 me what the field names and values would correspond to when presented
 with an XML file.

Perhaps just one generic content field, which would contain tokenized
content from all XML segments. That could be done easily  efficiently
with just sax event handling? Since it's a simple demo, you can't get much
simpler than that, but it should still be fairly useful?
Attributes could/should be ignored by default; common practice for XML markup
seems to be for attributes not to contain any content that would make sense to 
index.

So I'd think just stripping out all tags (and comments, PIs etc) might be 
reasonable plain simple approach for demo app.

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Keyword search with space and wildcard

2003-08-30 Thread Tatu Saloranta
On Friday 29 August 2003 10:02, Terry Steichen wrote:
 I agree.  One problem, however, that new (and not-so-new) Lucene users face
 is a learning curve when they want to get past the simplest and most
 obvious uses of Lucene.  For example, I don't think any of the docs mention
 the fact that you can't combine a phrase and a wildcard query.  Other
 things that are obviously quite well understood by many members of the
 list, are still less-than-clear to others.  For example, I found (and still
 find) it a bit difficult to find concrete examples/advice of how to get
 good benefit from filters.

 My whole point is that this is a *very* powerful and flexible technology.
 But I think it's often very difficult for those most experienced in using
 Lucene to fully appreciate how it looks from the newbie point of view.

I agree completely. Perhaps I worded my reply badly; I didn't mean to sound 
hostile towards new users at all -- after all I consider myself to be one (I 
just happened to work on simple improvements to QueryParser and learnt how it 
works). I wish documentation was more complete; perhaps some section could 
list common workarounds or insights. And perhaps incompatibility of phrase 
and wild card queries could be added to document that lists current 
limitations.

I guess the reason I think it's valuable to document the flexibility of query 
construction is that I have been working on something similar (although 
working with database queries) in a system I'm working on, and I have also 
seen systems that have query syntax that's too intertwined with backend 
implementation (for example, while Hibernate is a good ORM, its queries don't 
seem to have backend independent intermediate representation... which makes 
it hard to develop different kinds of backends). So, it's useful to know that 
there are 2 levels of interfaces to Lucene's query functionality.

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: 2,147,483,647 max documents?

2003-08-11 Thread Tatu Saloranta
On Monday 11 August 2003 01:07, Kevin A. Burton wrote:
 Why was an int chosen to represent document handles?  Is there a reason
 for this?  Why wasn't a long chosen to represent document handles?  64
 bits seems like the obvious choice here except for a potentially bloated
 datastore (32 extra bits)

I can't speak for actual reasons (not being core Lucene developer), but the
general benefits of 32-bit ints vs. longs are:

- Better performance on pretty much any current architecture (even so-called
  64-bit CPUs often prefer 32-bit data access, and 64-bit representations are
  more important for addressing).
  Also, smaller data set size is usually also good for performance (caching).
- Atomicity of access (read access can often be done without synchronizing);
  longs can not be atomically accessed in Java.

Another question is whether limited address space presents a real problem. 
Since Lucene can reuse doc ids (or rather, there is not persistent id per se? 
doc id is just an index, and holes left by removed docs can be reused?), 
perhaps this is usually not much of an issue?

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: interesting phrase query issue

2003-07-17 Thread Tatu Saloranta
On Thursday 17 July 2003 07:20, greg wrote:
 I have several document sections that are being indexed via the
 StandardAnalyzer.  One of these documents has the line access, the
 manager.  When searching for the phrase access manager, this document is
 being returned.  I understand why (at least i think i do), because a stop
 word is the and the , is being removed by the tokenizer, my question is
 is there any way I can avoid having this returned in the results?  My
 thoughts were to create a new analyzer that indexes the word the (blick
 to many of those), or index the , in some way (also not good).  Any
 suggestions?

You can also replace all stop words with dummy token ( might be an ok 
candidate?). That would be similar to indexing the (which probably is  
better idea than indexing ,).

I'm planning to do something similar for paragraph breaks (in case of plain 
text, double linefeed, for HTML p etc), to prevent similar problems.

-+ Tatu +-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multiuser environments

2003-07-14 Thread Tatu Saloranta
On Monday 14 July 2003 08:52, Guilherme Barile wrote:
 Hi
 I'm writing a web application which will index files using
 textmining to extract text and lucene to store it. I do have the
 following implementation questions:

 1) Only one user can write to an index at each time. How are you people
 dealing with this ? Maybe some kind of connection pooling ?

Two obvious candidates are locking bottleneck methods and doing index
writing in a critical section, or having a background thread that does
reindexing, and other threads add requests to a queue. In CMS I'm working we 
are doing the latter (so as not to block actual request threads which could
happen with first approach, adding/deleting documents is done as 
post-processing when documents are created/edited/deleted).

In either case you usually have a singleton instance that represents the 
search engine functionality (assuming single index), and from there on it's 
reasonably easy to reuse IndexReader as necessary.

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: commercial websites powered by Lucene?

2003-06-25 Thread Tatu Saloranta
On Wednesday 25 June 2003 09:47, Ulrich Mayring wrote:
 John Takacs wrote:
  I'd love to try Lucene with the above, but the Lucene install fails
  because of JavaCC issues.  Surprised more people haven't encountered this
  problem, as the install instructions are out of date.

 Well, what do you need JavaCC for? Isn't it just the technology for
 building the supplied HTML-Parser? There are much better HTML parsers
 out there, which you can use.

On a related note; has anyone done performance measurements for various
HTML parsers used for indexing?

I have written couple of XML/HTML parsers that were optimized for speed 
(and/or leniency to be able to handle/fix non-valid documents), and was 
wondering if they might be useful for indexing purposes for other people (one 
is in general pretty optimal if document contents are fully in memory 
already, like when fetching from DB; another uses very little memory, while 
being only slightly slower). However, using those as opposed to more standard 
ones would only make sense if there are significant speed improvements.
And to do that, it would be good to have baseline measurements, and/or to know 
what are current best candidates, from performance perspective.

The thing is that creating a parser that only cares about textual content (and 
perhaps in some cases about surrounding element, but not about attributes, or 
structure, or DTD/Schema, validity etc) is fairly easy, and since indexing is 
often the most CPU-intensive part of search engine, it may make sense to try 
to optimize this part heavily, up to and including using specialized parsers.

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Weighted Search by Field using MultiFieldQueryParser

2003-06-17 Thread Tatu Saloranta
On Tuesday 17 June 2003 05:43, Kevin L. Cobb wrote:
 I have an index that has three fields in it. When I do a search using
 MultiFieldQueryParser, the search applies the same importance (weight)
 to each of the fields. BUT, what if I want to apply a different weight
 to each field, i.e. I want to consider found terms from certain fields
 as less important than others. I have applied an algorithm to help me
 do this, which involves searching each field separately and then
 recombining the results into a single collection, but hate to reinvent
 the wheel if I don't have to.

Have you looked at MultiFieldQueryParser source? It's a very simple class, and
modifying it (making a new class) should be easy; pass in not only field names 
but also weights to apply?
(as a sidenote, MultiFieldQueryParser does some unnecessary work as is... it 
seems to re-parse same query once for each field, could just clone it)

-+ Tatu +-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lowercasing wildcards - why?

2003-05-31 Thread Tatu Saloranta
On Friday 30 May 2003 09:55, Leo Galambos wrote:
 Ah, I got it. THX. In the good old days, the wildcards were used as a
 fix for missing stemming module. I am not sure if you can combine these
 two opposite approaches successfully. I see the following drawbacks of
 your solution.

 Example:
 built* (-built) could be changed to build* (no built, but -builder,
 building, etc.), and precision will go down drastically.

 You probably use a stemmer with one important bug (a.k.a. feature) -
 overstemming, so here is another example:
 political* (-political, politically) is transformed to polic*
 (-policer, policy, policies, policement etc.) by Porter alg., and the
 precision is again affected drastically

Yes, this is the exact problem that was brought up last time this was 
discussed. It may not be a very common problem (most of the time stemming a 
wildcard part probably works ok, somebody had tried that), but still a 
potential one. And that's why default lower casing was added, as it solved 
one of FAQs. It is much more common that analyzer used for non-wildcard query 
does lower casing, than not, and thus default setting (which leads to having
to turn feature off by some users) seems to make sense.

More general problem then is that there's no real way to stem foo?ar, or any 
non-prefix wildcard query, but that could be figured out by QueryParser if 
necessary.

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Wildcard workaround

2003-05-30 Thread Tatu Saloranta
On Wednesday 28 May 2003 05:43, David Medinets wrote:
 - Original Message -
 From: Andrei Melis [EMAIL PROTECTED]

  As far as I have understood, lucene does not allow search queries
  starting with wildcards. I have a file database indexed by content
  and also by filename. It would be nice if the user could perform a
  usual search like *.ext.

 Does anyone know if Oracle patented the technique that they use for *ext
 searching in the Oracle Text product. If not, I'm sure the technique can be
 borrowed.

 On the other hand, the slow technique of comparing each term to *.ext can
 certainly be implemented with a minimum of effort, I think.

[apologies if somebody else already pointed this out... I missed some mails to 
the list from yesterday]

One of the most interesting solutions somebody posted earlier, was to use
2 indexes; one for 'normal' searches, with normal analyzer etc, and second
one that uses reversed words; ie. analyzer reverses words tokenized by
standard analyzer. This second index would then allow for searches
to do prefix match, in this case query would be something like

reverse_field:txe.*

This would work efficiently, although pretty much double the size of index for 
content that has to be prefix-searchable. Still, this solution somehow 
appeals to my hacker side. :-)

In this specific case, though, what others have suggested (add file prefix as 
separate field to search on), is probably more practical.

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Analyzer Incorrect?

2003-04-04 Thread Tatu Saloranta
On Friday 04 April 2003 05:24, Rob Outar wrote:
 Hi all,

   Sorry for the flood of questions this week, clients finally started using
 the search engine I wrote which uses Lucene.  When I first started

Yup... that's the root of all evil. :-)
(I'm in similar situation, going through user acceptance test as we speak... 
and getting ready to do second version that'll have more advanced metadata
based search using Lucene).

 developing with Lucene the Analyzers it came with did some odd things so I
 decided to implement my own but it is not working the way I expect it to.
 First and foremost I would like to like to have case insensitive searches
 and I do not want to tokenize the fields.  No field will ever have a space

If you don't need to tokenize a field, you don't need an analyzer either. 
However, to get case insensitive search, you should lower-case field contents 
before adding them to document. QueryParser will do lower casing for search 
terms automatically (if you are using it), so matching should work fine then.

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Wildcard searching - Case sensitiv?

2003-03-28 Thread Tatu Saloranta
On Friday 28 March 2003 08:37, [EMAIL PROTECTED] wrote:
 Ok, thanks Otis,

 you have to write the terms lowercase when you're searching with wildcards.

Or use the set method in QueryParser to ask it to automatically lower case
those terms. Patch for that was added before 1.3RC1 (check javadocs or source 
for exact method to call). I think default was not to enable this feature, 
for backwards compatibility (unless Otis changed it as was suggested?).

-+ Tatu +-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Alternate Boolean Query Parser?

2003-03-28 Thread Tatu Saloranta
On Friday 28 March 2003 15:48, Shah, Vineel wrote:
 One of my clients is asking for an old-style boolean query search on my
 keywords fields. A string might look like this:

   oracle admin* and java and oracle and (8.1.6 or 8.1.7) and
 (solaris or unix or linux)

 There would probably be need for nested parenthesis, although I can't think
 of an example. Is there a parser I can plug into lucene to make this
 happen? It doesn't seem like the normal QueryParser class would like this
 string, or would it? Any ideas or comments would be appreciated. Making my

Actually I think it should, as long as you change 'and' to 'AND' and 'or' to 
'OR' (upper case versions are used, I think, to make it less likely user 
meant to match words 'and' and 'or'?).

 own grammar and parser class is too expensive a proposition.

Well, writing simple grammar and parser is fairly easy to do, if you've ever 
used java_cup or javacc (or just (b)yacc / bison), shouldn't take all that 
long since all actual query classes already exist. But I don't think you need 
to do even that. :-)

The only feature that might need some additional work is matching oracle 
admin*; PhrasePrefixQuery allows doing something like that, but it's not 
integrated with QueryParser (I think it probably should, and might be quite 
easy to do).

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: org.apache.lucene.demo.IndexHTML - parse JSP files?

2003-03-25 Thread Tatu Saloranta
On Monday 24 March 2003 18:03, Michael Wechner wrote:
 John Bresnik wrote:
 anyone know of a quick and easy way to get this demo
 [org.apache.lucene.demo.IndexHTML] to parse JSP files as well? I used to a
 crawler to create a local [static] version of the site [i.e. they are not
 longer JSP files just the html output from the original JSP file  - but
  in the interest of keeping the URL intact, I need to parse the JSP
  extentions - the short question is, does anyone know of a way to *not*
  ignore the *.jsp files?

 just modify IndexHTML: there is one line in there which decides what
 extension it will index.

There is another question I was wondering; since JSP is not XML (ie. can not 
be reliably parse using an XML or even HTML parser [or for that matter, even 
with simplest XML markup tokenizer that ignores nesting], needs a lower level 
scanner), has anyone tried connecting an actual JSP processor to Lucene? Or 
writing a simple one just meant for indexing, without having to execute code 
embedded?
[the problem with JSP compared to XML is that it need not nest properly with 
HTML content around; one can use JSP inside attribute values, for example; 
thus, first JSP has to be processed to HTML, and then HTML needs to be 
further tokenized]

Jakarta has to have at least one such processor (haven't looked at whether 
there's a separate component or if Tomcat just has one embedded?). Of course 
parsing JSP is problematic in many ways, not just getting jsp tagging out; 
dynamic portions probably just have to be ignored, and all text inside 
included (except for things inside comments).

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Create my own Analyzer...

2003-03-21 Thread Tatu Saloranta
On Friday 21 March 2003 03:55, Pierre Lacchini wrote:
 Heya,

 as u can see, I want to create my own french Analyzer, using the snowball's
 FrenchStemmer...

 But i don't really know how to proceed...

 Does anyone know where I can find a tutorial, or a clear example of How to
 create an analyzer ??

 Sorry for all those noob questions, but as i said, i'm kinda noob in java
 ;)

Well, analyzer classes are about as simple as it gets, so perhaps try to look 
default analyzers Lucene core comes with (under org.apache.lucene.analysis)?
(the only slightly more advanced one is StandardAnalyzer as it uses javacc)

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: multiple collections indexing

2003-03-19 Thread Tatu Saloranta
On Wednesday 19 March 2003 01:44, Morus Walter wrote:
...
 Searches must be able on any combination of collections.
 A typical search includes ~ 40 collections.

 Now the question is, how to implement this in lucene best.

 Currently I see basically three possibilities:
 - create a data field containing the collection name for each document
   and extend the query by a or-combined list of queries on this name filed.
 - create an index per collection and use a MultiSearcher to search all
   interesting indexes.
 - (a third on I just discovered): create a data field containing a
   marker for each collection
   x10... for the first collection
   x01... for the second
   x001000... for the third
   and so on.
   The query might use a wildcard search on this field using x?0?0...
   specifying '?' for each collection that should be searched on, and '0'
   for the others.
   The marker would be very long though (the number of collections is
   growing, so we have to keep space for new one also).

This might still be a feasible thing to do, except if number of collections 
changes very frequently (as you need to reindex all docs, not just 
incremental).

Another possibility would be to have a new kind of Query; one to use with 
numeric field values (probably would be easiest to use hex numbers). In a way 
it'd be a specialized/optimized version of WildcardQuery.

For example, one could define required bit pattern after ORing field value 
with mask (in your case you'd use one bit per type, and require 
non-interesting type flags to be zeroes, knowing that then at least one other 
bit, matching interesting type, is one).
Implementing this would be fairly easy; first find the range (like RangeQuery 
does), and iterate over all existing terms in that range, and for each match 
against bit pattern, and add term if it matches the pattern.

Actual search would then search pretty much like prefix, wildcard or range 
query, as Terms at that point have been expanded and search part need not 
care how they were obtained.

This would make representation more compact (4 bits in a char instead of one), 
potentially making index bit smaller (which usually also means faster). And 
of course if you really want to push the limit, you could use even more 
efficient encoding (although, assuming indexes use UTF-8, base64 might be 
almost as efficient as it gets, as ascii chars only take one byte whereas 
upper chars take anywhere from 2 to 7 [for unicode-3? 4 for UC2] bytes).

Adding such a query would need to be done outside QueryParser (as length of 
bitfield field would be variable), but in your case that probably shouldn't 
be a problem?

Anyway, just an idea I thought might be worth sharing,

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: QueryParser and compound words

2003-03-13 Thread Tatu Saloranta
On Thursday 13 March 2003 00:52, Magnus Johansson wrote:
 Tatu Saloranta wrote:
...
 But same happens during indexing; fotbollsmatch should be properly
 split and stemmed to fotboll and match terms, right?

 Yes but the word fotbollsmatch was never indexed in this example. Only
 the word fotboll.
 I want a query for fotbollsmatch to match a document containing the word
 fotboll.

Ok I think I finally understand what you meant. :-)

So, basically, in your case you would prefer getting query:

fotbollsmatch

to expand to (after stemming etc):

fotboll match

and not

fotboll match

So that matching just one of the words would be enough for a hit (either
either of or just first word or just last word).
It would be possible to implement this functionality by overriding default
QueryParser and modifying its functionality slightly. 

In QueryParser you should be able to override default handling for terms,
so that whenever you get just single token (in this case fotbollsmatch)
that expands to multiple Terms, you do not construct a phrase query, but
just BooleanQuery with TermQueries (look at getFieldQuery(); it handles
basic search terms). You may need to use simple heuristics for figuring
when you have white space(s) that indicate normal phrases, which probably
should still be handled using PhraseQuery.

Of course this is all assuming you still do want that functionality. :-)
And if you do, it would be good idea to get patch back in case someone else
finds that useful later on (I think many non-english languages have concept
of compound words; German and Finnish at least do).

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Regarding Setup Lucine for my site

2003-03-05 Thread Tatu Saloranta
On Wednesday 05 March 2003 13:35, Leo Galambos wrote:
  I'm all eyes and I'm a serious grown-up with good manners :)
  Constructive suggestions for improvement are always welcome.


First a disclaimer: I don't mean to sound too negative. I'm genuinely curious 
about many of the issues you mention. But I'm not sure I really understand 
them. :-)

 1. 2 threads per request may improve speed up to 50%

Hmm? Could you clarify? During indexing, multithreading may speed things
up (splitting docs to index in 2 or more sets, indexing separately, combining
indexing). But... isn't that a good thing? Or are you saying that it'd be good 
to have multi-threaded search functionality for single search? (in my 
experience searching is seldom the slow part)

 2. Merger is hard coded

In a way that is bad because... ?
(ie. what is the specific problem... I assume you mean index merging
functionality?)

...
 4. you cannot implement dissemination + wrappers for internet servers
 which would serve as static barrels.

Could you explain this bit more thoroughly (or pointers on longer 
explanation)?

 5. Document metadata cannot be stored as a programmer wants, he must
 translate the object to a set of fields

Yes? I'd think that possibility of doing separate fields is a good thing; 
after all, all a plain text search engine needs to provide (to be considered 
one) is indexing of plain text data, right?
Plus, Lucene is not a Content Management System (or database), but
content indexing system. As such I'm not sure why storage should not be 
optimized to allow for fast searches (which means flattening contents, 
amongst other things).

That is not to say that things couldn't be improved; it might be a good idea 
to define small set of base interfaces / classes to make it easier to convert 
from 'objectified' textual data to straight-forward indexing.

FWIW I am actually using Lucene for storing documents that have extensive 
metadata associated, and I don't find restrictions too bad... but that's 
certainly matter of taste. :-)

 6. Lucene cannot implement your own dynamization

(sorry, I must sound real thick here).
Could you elaborate on this... what do you mean by dynamization?

-+ Tatu +-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: AW: How is that possible ?

2003-02-28 Thread Tatu Saloranta
On Friday 28 February 2003 05:15, Alain Lauzon wrote:
 At 07:16 2003-02-28 +0100, you wrote:
 May it be, that microsoft is found, because the search is not case
 sensitive (text) and ct is not found because there the search is case
 sensitive (Keyword)
 
 Did you try
 +state:CT +company:microsoft~10
 ^^
 ?

 I don't thnik so because the StandardAnalyzer will put everything in
 lowercase.  I will try without the StandardAnalyzer.

Yes, but only fields that are tokenizable. Keywords are not touched, they are 
indexed as is. So if 'state' field is a keyword field, it would be stored in 
upper case (this is explained in FAQ).

-+ Tatu +-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexWriter addDocument NullPointerException

2003-02-22 Thread Tatu Saloranta
On Friday 21 February 2003 13:22, Günter Kukies wrote:
 Hello,

 I don't have any line number.

You unfortunately do need to know the line number, if you do get an exception 
and try to see where it occurs.
Another less frequent problem is that you actually get the exception as an 
object and print out that exception; in that case you would just see
java.lang.NullPointerException, and nothing else?
Otherwise, based on your code, you should see a stack trace, with or without
line numbers. But you would at least see the method call stack, which would
help in figuring out where problem occured.

However, if you do catch an exception, and stack trace doesn't have line 
numbers (it seems that some JVMs do not have line number info available when 
running JIT'ed code) there are basically two ways to figure out exact 
location:

(1) Try to make JVM get the line number info (either running in interpreted 
  mode; I think there was option, something like '-Djava.compiler= ' to 
  disable JIT?)
(2) Run code in a debugger. One nice free debugger (if you are not using an
  IDE that has one is JSwat:
 http://www.bluemarsh.com/java/jswat/

Hope this helps,

-+ Tatu +-


 this is the code snippet:

 Document doc;
 IndexWriter writer;

 .

 try{
 writer.addDocument(doc);
 }
 catch(Exception ex){
 ex.printStackTrace();
 }

 this is the output on Standard.out:

 java.lang.NullPointerException


 and nothing more.

 The doc is not null and System.out.println(doc) seems to be ok. There is
 no difference between the working 80% and the not working 20% doc's.
 Thanks,

 Günter

  On Friday 21 February 2003 05:33, Günter Kukies wrote:
  Hello,
 
  writer.addDocument(doc) is throwing an NullPointerException. The
  stacktrace from the catched Exception is only one line
  NullPointerException without anything else. I open the IndexWriter
  with create true. Run over the files in a Directory and add all found
  documents. After that i close the indexwriter. 80% of the documents
  were added without problems. The rest gets that NullPointerException.
 
  Any Ideas?
 
  Perhaps look at the line where the null pointer exception is thrown and
  see  what happens? NullPointerException is thrown when a null reference
  is being  de-referenced. Seeing the immediate cause should be easy,
  given line number.
 
  Perhaps you have added a field with null value? (just a guess, I don't
  know if  that's even illegal).
 
  -+ Tatu +-
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED] For
  additional commands, e-mail: [EMAIL PROTECTED]

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Number range search through Query subclass

2003-02-15 Thread Tatu Saloranta
On Friday 14 February 2003 02:58, Volker Luedeling wrote:
 Hi,
  
 I am writing an application that constructs Lucene searches from XML
 queries. Each item from the XML is represented by a Query of the
 corresponding type. I have a problem when I try to search for number
 ranges, since RangeQuery compares strings, not numbers, so 15  155  20.
 What I need is a subclass of Query that evaluates numbers correctly. I have
 tried subclassing RangeQuery, MultiTermQuery or Query directly, but each
 time I have run into problems with inheritance and access rights to various
 methods or inner classes. 
 Does anyone know of a solution to this problem? If there is none, the only
 way I can think of would be indexing numbers as something like #15#. But
 it's not a very elegant solution when all I need is a slight variation of
 one existing class. 
 Thanks for any help you can offer,

Actually the problem is not (just) the query, it's tokenizer/analyzer/indexer 
as well. For range query to work, tokens have to be correctly ordered 
lexically (~= in alphabetic order). I don't think using #s as markers would 
work, as they do not make tokens get ordered properly (plus, most analyzers 
would just remove those chars).

The usual way to do this is to use suitable numeric format for indexed data; 
for dates format like -MM-DD works ok (ie. dates are correctly ordered 
when ordering date tokens alphabetically), for other numbers (like 
timestamps) what is usually done is padding, so that numbers in your case
could be 015, 155 and 20 (instead of leading 0 any other letter that is 
before '1' in alphabetic order would do). So, you need to know biggest number 
you'd need to index and use appropriate zero padding.

Now, if you store these numbers as single values in separate index, padding is 
easy to do. If you are trying to get random numeric data contained in 
otherwise plain text content, things are bit more complicated.

Hope this helps,

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: OutOfMemoryException while Indexing an XML file

2003-02-14 Thread Tatu Saloranta
On Friday 14 February 2003 07:27, Aaron Galea wrote:
 I had this problem when using xerces to parse xml documents. The problem I
 think lies in the Java garbage collector. The way I solved it was to create

It's unlikely that GC is the culprit. Current ones are good at purging objects 
that are unreachable, and only throw OutOfMem exception when they really have 
no other choice.
Usually it's the app that has some dangling references to objects that prevent 
GC from collecting objects not useful any more.

However, it's good to note that Xerces (and DOM parsers in general) generally 
use more memory than the input XML files they process; this because they 
usually have to keep the whole document struct in memory, and there is 
overhead on top of text segments. So it's likely to be at least 2 * input 
file size (files usually use UTF-8 which most of the time uses 1 byte per 
char; in memory 16-bit unicode-2 chars are used for performance), plus some 
additional overhead for storing element structure information and all that.

And since default max. java heap size is 64 megs, big XML files can cause 
problems.

More likely however is that references to already processed DOM trees are not 
nulled in a loop or something like that? Especially if doing one JVM process 
for item solves the problem.

 a shell script that invokes a java program for each xml file that adds it
 to the index.

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: % of Relevance

2003-02-11 Thread Tatu Saloranta
On Tuesday 11 February 2003 07:48, Nellai wrote:
 Hi!

 can anyone tell me how to calculate the % of relevance using Lucene.

Lucene's hit score is normalized float, ] 0.0, 1.0 ] (since 0.0 ones are never 
included). From there it's basic arithmetics (perhaps this could be included 
in FAQ , even though it is fairly trivial). The simples way would be:

... // get the search results, 
float score = hits.score(docNr); // between 0.0 and 1.0 (including 1.0)
int pctScore = (int) (100.0f * score);

Also note that it's not guaranteed that all searches have any 100% matching 
docs; for example when none of the docs matches all clauses, and clauses are 
combined with OR-query. Same may also happen (I think?) if best match for 
different sub-clauses is different?

You may also want to normalize the score if you always want your top match to 
be 100% (or have some range that gets rounded up)... users are known to want 
silly features like that. :-)

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: '-' character not interpreted correctly in field names

2003-02-03 Thread Tatu Saloranta
On Monday 03 February 2003 07:19, Terry Steichen wrote:
 I believe that the tokenizer treats a dash as a token separator.  Hence,
 the only way, as I recall, to eliminate this behavior is to modify
 QueryParser.jj so it doesn't do this.  However, doing this can cause some
 other problems, like hyphenated words at a line break and the like.

It might be enough to just replace analyzer passed in to QueryParser
to do this? This is the case if QueryParser only handles modifiers outside
terms, and terms are passed to analyzer.
I think this is the case (QueryParser does  call the analyzer in couple of 
places, and one word may actually expand to a phrase or vice versa)?

Still, it seems like using a hyphen as separator shouldn't necessarily cause 
big problems when indexer does the same; queries against 2 - 5 would be 
phrase queries for 2 5, which is still reasonably specific (and should 
match the content).

On the other hand, simple analyzer and standard analyzer have pretty different 
tokenization rules, so it's important to make sure same analyzer is used for 
both indexing and searching (that mismatch can prevent matches easily).

-+ Tatu +-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Wildchar based search?? |

2003-02-01 Thread Tatu Saloranta
On Saturday 01 February 2003 00:19, Otis Gospodnetic wrote:
  1) to what extent are wildcards supported by lucenes?

 You can use * and ? the way they usually are used.

I think there was one exception; first character of a simple term
can not be a wildcard? (this from query syntax page).

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Range queries

2003-01-22 Thread Tatu Saloranta
On Wednesday 22 January 2003 07:49, Erik Hatcher wrote:
 Unfortunately I don't believe date field range queries work with
 QueryParser, or at least not human-readable dates.

 Is that correct?

 I think it supports date ranges if they are turned into a numeric
 format, but no human would type that kind of query in.  I'm sure
 supporting true date range queries gets tricky with locale issues and
 such too.

Right. In my case that's ok -- the documents I'll be indexing are hybrid 
documents, with some structured/plain text content and additional metadata 
(in DB normalized form). Thus the dates (from normalized metadata fields) can 
easily be converted to numeric form and indexed (for things like last 
modified etc that'd be normally searched via DB).

The other part (UI) needs more work... either need to add a new quoting
mechanism for dates (or just do that for if certain field prefix is used), or 
(more likely) the UI will use simple web forms for constructing query.

Thanks to everyone for quick replies,

-+ Tatu +-


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Range queries

2003-01-22 Thread Tatu Saloranta
On Wednesday 22 January 2003 08:27, Michael Barry wrote:
 I utilize the earlier version and queries such as this work fine with
 QueryParser:

 field:[ 20030120 - 20030125 ]

 of course the back-end indexer canonocalizes all date fields to MMDD.
 The front-end search code is responsible for canonocalizing the user
 inputed dates to MMDD. I think the key here would be either to not
 allow users to
 enter free-form dates (provide some type of UI element to enter year,
 month, day seperately) or give some copy stating dates should be in
 MMDD format.

Thanks, this is along the lines I was thinking too.

-+ Tatu +-



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Range queries

2003-01-21 Thread Tatu Saloranta
My apologies if this is a FAQ (which is possible as I am new to Lucene, 
however, I tried checking the web page for the answer).

I read through the Query syntax web page first, and then checked the 
matching query classes. It seems like query syntax page is missing some 
details; the one I was wondering about was the range query. Since query 
parser seems to construct these queries, I guess they have been implemented, 
even though syntax page didn't explain them. Is that correct?

Looking at QueryParser, it seems that inclusive range query uses [ and ], and 
exclusive query { and }? Is this right? And does it expect exactly two 
arguments?
Also, am I right in assuming that range uses lexiographic ordering, so that it
basically includes all possible words (terms) between specified terms (which 
will work ok with numbers/dates as long as they have been padded with zeroes 
or such)?

Another question I have is regarding wildcard search. Page mentions that there 
is a restriction that search term can not start with a wild card (as that 
would render index useless I guess... would need to full scan?). However, it 
doesn't mention if multiple wildcards are allowed? All the example cases just 
have single wild card?

Sorry for the newbie questions,

-+ Tatu +-

ps. Thanks for the developers for the neat indexing engine. I am currently 
evaluating it for use in a large-scale enterprise content management system.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]