RE: Does the Lucene search engine work with PDF's?

2003-10-20 Thread MOYSE Gilles (Cetelem)
You can also use the TextMining.org toolbox, which provides classes to
extract text from PDF and DOC files, using the Jakarta POI project. They are
all free, under Apache Licence. 

The URL
:http://www.textmining.org/modules.php?op=modloadname=Newsfile=articlesid
=6mode=threadorder=0thold=0).
(URL tested today) 

You can try the JGuru page : http://www.jguru.com/faq/view.jsp?EID=1074237

Gilles Moyse


-Message d'origine-
De : Andre Hughes [mailto:[EMAIL PROTECTED]
Envoyé : samedi 18 octobre 2003 00:05
À : [EMAIL PROTECTED]
Objet : Does the Lucene search engine work with PDF's?


Hello,
Can the Lucene search engine index and search though PDF documents?
What are the file format limits for Lucene search engine.
 
Thanks in Advance,
 
Andre'


[OT] Open Source Goes to COMDEX

2003-10-20 Thread petite_abeille
Hello,

This is pretty much off topic, but...

ZOE has been nominated as one of the candidate project to go the Open 
Source Innovation Area on the COMDEX Exhibit Floor.

http://www.oreillynet.com/contest/comdex/

ZOE is one of the few Java project short listed and it uses Lucene 
quiet extensively.

Show your support by voting for ZOE :)

Cheers,

PA.

--
http://zoe.nu/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Hierarchical document

2003-10-20 Thread Tom Howe
Hi, 
I have a very hierarchical document structure where each level of the
hierarchy contains indexable information.  It looks like this:  

Study - 
Section - 
DataFile - 
Variable.  

The goal is to create a situation where a user can execute a search at
any level and the search would include all of the information below it
in the hierarchy and retrieve the proper aggregated document.  In other
words, someone could search for a Study using word that appears in
several DataFiles in the study and a single study document could be
returned.  At the same time, someone could search for a DataFile and
each of the matching DataFile documents would be returned.  Is there a
good way to do this other than using multiple indexes? 

Thanks,
Tom


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene on Windows

2003-10-20 Thread Steve Jenkins
Hi,

Wonder if anyone can help. Has anyone used Lucene on a Windows environment?
Anyone know of any documentation specifically focused on doing that? 
Or anyone know of any gotchas to avoid?

Thanks for any help,
Cheers Steve.



Re: Lucene on Windows

2003-10-20 Thread Erik Hatcher
On Monday, October 20, 2003, at 12:00  PM, Steve Jenkins wrote:
Hi,

Wonder if anyone can help. Has anyone used Lucene on a Windows 
environment?
Anyone know of any documentation specifically focused on doing that?
Or anyone know of any gotchas to avoid?
Yup, used Lucene on Windows lots.  Is there a specific issue you feel 
is Windows related?  Its pure Java and works the same on all supported 
platforms.  So no real gotchas with respect to Windows.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Does the Lucene search engine work with PDF's?

2003-10-20 Thread Konrad Kolosowski

Return Receipt
   
Your  Does the Lucene search engine work with PDF's?   
document   
:  
   
was   Konrad Kolosowski/Toronto/IBM
received   
by:
   
at:   10/20/2003 12:15:25  
   





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene on Windows

2003-10-20 Thread Otis Gospodnetic
The CVS version of Lucene has a patch that allows one to use a
'Compound Index' instead of the traditional one.  This reduces the
number of open files.  For more info, see/make the Javadocs for
IndexWriter.

Otis

--- Tate Avery [EMAIL PROTECTED] wrote:
 
 You might have trouble with too many open files if you set your
 mergeFactor too high.  For example, on my Win2k, I can go up to
 mergeFactor=300 (or so).  At 400 I get a too many open files error. 
 Note: the default mergeFactor of 10 should give no trouble.
 
 FYI - On my linux box, I got the 'too many open' error on
 mergeFactor=300 (and 200).  So, I am using 100.
 
 
 Tate
 
 
 -Original Message-
 From: Erik Hatcher [mailto:[EMAIL PROTECTED]
 Sent: October 20, 2003 12:11 PM
 To: Lucene Users List
 Subject: Re: Lucene on Windows
 
 
 
 On Monday, October 20, 2003, at 12:00  PM, Steve Jenkins wrote:
  Hi,
 
  Wonder if anyone can help. Has anyone used Lucene on a Windows 
  environment?
  Anyone know of any documentation specifically focused on doing
 that?
  Or anyone know of any gotchas to avoid?
 
 Yup, used Lucene on Windows lots.  Is there a specific issue you feel
 
 is Windows related?  Its pure Java and works the same on all
 supported 
 platforms.  So no real gotchas with respect to Windows.
 
   Erik
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


__
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Hierarchical document

2003-10-20 Thread Erik Hatcher
On Monday, October 20, 2003, at 11:06  AM, Tom Howe wrote:
contain Section and Study information and then, if a user wants a set 
of
Study documents, just aggregate them after the search by hand or is
there a more lucene way of doing this?  I'm trying to avoid storing
too much redundant information to implement this kind of hierarchical
structure, but that may not be possible.  I hope I'm being somewhat
clear with my question.
There is not a more lucene way to do this - its really up to you to 
be creative with this.  I'm sure there are folks that have implemented 
something along these lines on top of Lucene.  In fact, I have a 
particular interest in doing so at some point myself.  This is very 
similar to the object-relational issues surrounding relational 
databases - turning a pretty flat structure into an object graph.  
There are several ideas that could be explored by playing tricks with 
fields, such as giving them a hierarchical naming structure and 
querying at the level you like (think Field.Keyword and PrefixQuery, 
for example), and using a field to indicate type and narrowing queries 
to documents of the desired type.

I'm interested to see what others have done in this area, or what ideas 
emerge about how to accomplish this.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Dash Confusion in QueryParser - Bug? Feature?

2003-10-20 Thread Erik Hatcher
On Wednesday, October 15, 2003, at 10:24  AM, Michael Giles wrote:
So how do we move this issue forward.  I can't think of a single case 
where a - with no whitespace on either side (i.e. t-shirt, Wal-Mart) 
should be interpreted as a NOT command.  Is there a feeling that 
changing the interpretation of such cases is a break in compatibility? 
 I agree that it will change behavior, but I think that it will change 
it for the better (i.e. fix it).  The current behavior is really 
broken (and very frustrating for a user trying to search).
I looked at the patch here:

	http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23838

I'm not entirely satisfied with it.  I'm of the opinion that we should 
only change QueryParser to fix the behavior of operators nestled within 
text with no surrounding whitespace.  The provided patch only works 
with the - character, but what about Wal+Mart?  Shouldn't we keep 
that together also and hand it to the analyzer?

I'm not convinced at all that we should change the StandardTokenizer to 
not split on dash.  If only QueryParser was fixed and handed Wal-Mart 
to the StandardAnalyzer, it would be split the same way as during 
indexing and searches would return the expected hits.

Thoughts?  I'd like to see this fixed, but in a way that makes the most 
general sense.

Thanks,
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


positional token info

2003-10-20 Thread Erik Hatcher
Is anyone doing anything interesting with the 
Token.setPositionIncrement during analysis?

Just for fun, I've written a simple stop filter that bumps the position 
increments to account for the stop words removed:

  public final Token next() throws IOException {
int increment = 0;
for (Token token = input.next(); token != null; token = 
input.next()) {

  if (table.get(token.termText()) == null) {
token.setPositionIncrement(token.getPositionIncrement() + 
increment);
return token;
  }

  increment++;
}
return null;
  }
But its practically impossible to formulate a Query that can take 
advantage of this.  A PhraseQuery, because Terms don't have positional 
info (only the transient tokens), only works using a slop factor which 
doesn't guarantee an exact match like I'm after.  A PhrasePrefixQuery 
won't work any better as there is no way to add in a blank term to 
indicate a missing position.

I certainly see the benefit of putting tokens into zero-increment 
positions, but are increments of 2 or more at all useful?  If so, how 
are folks using it?

Thanks,
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Hierarchical document

2003-10-20 Thread Tatu Saloranta
On Monday 20 October 2003 16:41, Erik Hatcher wrote:
 One more thought related to this subject - once a nice scheme for
 representing hierarchies within a Lucene index emerges, having XPath as
 a query language would rock!  Has anyone implemented O/R or XPath-like
 query expressions on top of Lucene?

Not me... but at some point I think I briefly mentioned that someone with 
extra time might want to do a very simple JDBC driver to be used with
Lucene. Obviously it would be very minimal for queries (and might need
to invent new SQL operators for some searches), but it could also expose
metadata about index. Should be an interesting exercise at least. :-)
Plus, if done properly, tools like DBVis could be used for simple Lucene
testing as well.

If so, who knows; perhaps that would make it even easier to do prototype
implementations of Lucene replacing home-grown SQL-bound search
functionalities of apps.

Most of all above would just be a nice little hack, though. :-)

-+ Tatu +-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]