RE: Summarization; sentence-level and document-level filters.

2003-12-16 Thread Gregor Heinrich
Yes, copying a summary from one field to an untokenized field was the plan.

I identified DocumentWriter.invertDocument() to be a possible place for an
addition of this document-level analysis. But I admit this appears way too
low-level and inflexible for the overall design.

So I'll make it two-pass indexing.

Thanks for the decision support,

gregor

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 16, 2003 6:57 PM
To: Lucene Users List
Subject: Re: Summarization; sentence-level and document-level filters.


It sounds like you want the value of a stored field (a summary) to be
built from the tokens of another field of the same document.  Is that
right?  This is not presently possible without tokenizing the field
twice, once to produce its summary and once again when indexing.

Doug

Gregor Heinrich wrote:
 Hi,

 is there any possibility to do sentence-level or document level analysis
 with the current Analysis/TokenStream architecture? Or where else is the
 best place to plug in customised document-level and sentence-level
analysis
 features? Is there any precedence case ?

 My technical problem:

 I'd like to include a summarization feature into my system, which should
(1)
 best make use of the architecture already there in Lucene, and (2) should
be
 able to trigger summarization on a per-document basis while requiring
 sentence-level information, such as full-stops and commas. To preserve
this
 punctuation, a special Tokenizer can be used that outputs such landmarks
 as tokens instead of filtering them out. The actual SummaryFilter then
 filters out the punctuation for its successors in the Analyzer's filter
 chain.

 The other, more complex thing is the document-level information: As
Lucene's
 architecture uses a filter concept that does not know about the document
the
 tokens are generated from (which is good abstraction), a document-specific
 operation like summarization is a bit of an awkward thing with this (and
 originally not intended, I guess). On the other hand, I'd like to have the
 existing filter structure in place for preprocessing of the input, because
 my raw texts are generated by converters from other formats that output
 unwanted chars (from figures, pagenumbers, etc.), which are filtered out
 anyway by my custom Analyzer.

 Any idea how to solve this second problem? Is there any support for such
 document / sentence structure analysis planned?

 Thanks and regards,

 Gregor



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Summarization; sentence-level and document-level filters.

2003-12-16 Thread Gregor Heinrich
Maurits: thanks for the hint to classifier4j -- I have had a look on this
package and tried the SimpleSummarizer and it seems to work fine. (However,
as I don't know the benchmarks for summarization, I'm not the one to judge.)

Do you have experience with it?

Gregor

-Original Message-
From: maurits van wijland [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 16, 2003 1:09 AM
To: Lucene Users List; [EMAIL PROTECTED]
Subject: Re: Summarization; sentence-level and document-level filters.


Hi Gregor,

Sofar as I know there is no summarizer in the plans. And maybe I can help
you along the way. Have a look
at Classifier4J project on Sourceforge.

http://classifier4j.sourceforge.net/

It has a small documetn summarizer besides a bayes classifier.It might speed
up your coding.

On the level of lucene, I have no idea. My gut feeling says that a summary
should be build before the
text is tokenized! The tokenizer can ofcourse be used when analysing a
document, but hooking into
the lucene indexing is a bad idea I think.

Someone else has any ideas?

regards,

Maurits




- Original Message -
From: Gregor Heinrich [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Monday, December 15, 2003 7:41 PM
Subject: Summarization; sentence-level and document-level filters.


 Hi,

 is there any possibility to do sentence-level or document level analysis
 with the current Analysis/TokenStream architecture? Or where else is the
 best place to plug in customised document-level and sentence-level
analysis
 features? Is there any precedence case ?

 My technical problem:

 I'd like to include a summarization feature into my system, which should
(1)
 best make use of the architecture already there in Lucene, and (2) should
be
 able to trigger summarization on a per-document basis while requiring
 sentence-level information, such as full-stops and commas. To preserve
this
 punctuation, a special Tokenizer can be used that outputs such landmarks
 as tokens instead of filtering them out. The actual SummaryFilter then
 filters out the punctuation for its successors in the Analyzer's filter
 chain.

 The other, more complex thing is the document-level information: As
Lucene's
 architecture uses a filter concept that does not know about the document
the
 tokens are generated from (which is good abstraction), a document-specific
 operation like summarization is a bit of an awkward thing with this (and
 originally not intended, I guess). On the other hand, I'd like to have the
 existing filter structure in place for preprocessing of the input, because
 my raw texts are generated by converters from other formats that output
 unwanted chars (from figures, pagenumbers, etc.), which are filtered out
 anyway by my custom Analyzer.

 Any idea how to solve this second problem? Is there any support for such
 document / sentence structure analysis planned?

 Thanks and regards,

 Gregor



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene and Mysql

2003-12-16 Thread Gregor Heinrich
Hi.

You read out all the relevant fields from MySQL and assign the primary key
as an indentifier of your Lucene documents.

During search, you retrieve the identifier from the Lucene searcher and
query the database to present the full text.

Best regards,

Gregor



-Original Message-
From: Stefan Trcko [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 16, 2003 9:31 PM
To: [EMAIL PROTECTED]
Subject: Lucene and Mysql


Hello

I'm new to Lucene. I want users can search text which is stored in mysql
database.
Is there any tutorial how to implement this kind of search feature.

Best regards,
Stefan


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Word Documents

2003-12-15 Thread Gregor Heinrich
Hi,

that's great info. In fact, I didn't check for fast-saving, yet. So I'll
probably go ahead an have a try later...

Good luck for POI,

Gregor



-Original Message-
From: Ryan Ackley [mailto:[EMAIL PROTECTED]
Sent: Monday, December 15, 2003 3:35 PM
To: Lucene Users List; [EMAIL PROTECTED]
Subject: Re: Word Documents


I have written a library located at http://textmining.org that will extract
text from Word documents. I am the author of the Word library in POI btw.
This is just a lightweight version because I got sick of everyone asking how
to extract text from a Word document. If it doesn't work its because the
document is *not* from Word 97 or later or the file was fast-saved.
Everytime somebody has problems they send me their files and they turn out
to be RTF or Word 95 documents. You can check the format by opening the file
in Word then going to Save As. The format of the document will be in the
Save as Type dropdown. At least in my version of Word it does.

-Ryan

- Original Message -
From: Gregor Heinrich [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Monday, December 15, 2003 9:19 AM
Subject: RE: Word Documents


 Hi,

 we had some problems using the POI Word filter. In one document set,
 everything would work fine, in another more than 50% documents refused to
 work with it (does not index). I am not an OLE2 pro and cannot see any
 apparent difference in the documents between the different sets. The
version
 used was Word 97 in almost all the docs. For the moment, I switched to a
 native converter (that does not process metadata and must be run using
 Runtime.exec(), though) until I have time to revisit the problem.

 I do not want to disrecommend the POI-filters, it's a very cool idea.
Please
 do try your particular document set with it. For a quick test, you can use
 the Docco personal search tool by Peter Becker and colleagues (available
 from SourceForge). It has a current version of POI included as a plugin
and
 Lucene running as indexing backend. So you don't have to write code to get
 answers...

 Cheers, gregor

 -Original Message-
 From: Pleasant, Tracy [mailto:[EMAIL PROTECTED]
 Sent: Monday, December 15, 2003 2:58 PM
 To: Lucene Users List
 Subject: Word Documents


 As a spinoff, I was wondering if anyone has been happy with indexing and
 searching Word docs. What about reading the contents? Any problems?


 -Original Message-
 From: Ryan Ackley [mailto:[EMAIL PROTECTED]
 Sent: Friday, December 12, 2003 5:59 PM
 To: Zhou, Oliver; Lucene Users List
 Subject: Re: textmining: document title


 Check out jakarta POI (http://jakarta.apache.org/poi ) particularly the
HPSF
 API. It allows you to extract metadata like Title, Author, etc. from OLE
 documents.

 -Ryan

 - Original Message -
 From: Zhou, Oliver [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Friday, December 12, 2003 5:26 PM
 Subject: textmining: document title


  Ryan,
 
  I'm using textmining and lucene to index word documents but don't know
how
  to get word document title.  Your advice on this matter is appreciated.
 
  Thanks,
  Oliver Zhou
 
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]








 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Summarization; sentence-level and document-level filters.

2003-12-15 Thread Gregor Heinrich
Hi,

is there any possibility to do sentence-level or document level analysis
with the current Analysis/TokenStream architecture? Or where else is the
best place to plug in customised document-level and sentence-level analysis
features? Is there any precedence case ?

My technical problem:

I'd like to include a summarization feature into my system, which should (1)
best make use of the architecture already there in Lucene, and (2) should be
able to trigger summarization on a per-document basis while requiring
sentence-level information, such as full-stops and commas. To preserve this
punctuation, a special Tokenizer can be used that outputs such landmarks
as tokens instead of filtering them out. The actual SummaryFilter then
filters out the punctuation for its successors in the Analyzer's filter
chain.

The other, more complex thing is the document-level information: As Lucene's
architecture uses a filter concept that does not know about the document the
tokens are generated from (which is good abstraction), a document-specific
operation like summarization is a bit of an awkward thing with this (and
originally not intended, I guess). On the other hand, I'd like to have the
existing filter structure in place for preprocessing of the input, because
my raw texts are generated by converters from other formats that output
unwanted chars (from figures, pagenumbers, etc.), which are filtered out
anyway by my custom Analyzer.

Any idea how to solve this second problem? Is there any support for such
document / sentence structure analysis planned?

Thanks and regards,

Gregor



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Disabling modifiers?

2003-12-15 Thread Gregor Heinrich
If you don't want to fiddle with the JavaCC source of QueryParser.jj, you
could work with a regular expression that works in front of the actual query
parser. I just did something similar because I input Lucene's query strings
into a latent semantic analysis algorithm and remove words with + and ?
wildcards, boosting modifiers as well as NOT and - clauses and groupings.
Such as:

/**
 *  exclude words that have these modifiers
 */
public final String excludeWildcards = \\w+\\+|\\w+\\?;
/**
 *  remove these operators
 */
public final String removeOperators = AND|OR|UND|ODER||\\|\\|;
/**
 *  remove these modifiers
 */
public final String removeModifiers = ~[0-9\\.]*|~|\\^[0-9\\.]*|\\*;
/**
 *  exclude phrases that have these modifiers
 */
public final String excludeNot = (NOT |\\-) *\\w+|(NOT|\\-)
*\\([^\\)]+\\)|(NOT |\\-) *\\\[^\\\]+\\\;

/**
 * remove any groupings
 */
public final String removeGrouping = [\(\\)];

You then create Pattern objects from the strings using Pattern.compile() and
can use and re-use the compiled patterns.

excludeWildcardsPattern = Pattern.compile(excludeWildcards);

lsaQ = excludeWildcardsPattern.matcher(q).replaceAll();

This works fine for me. However, this 20 minutes approach does not recognise
nested parentheses with NOT or -, i.e.,
the term ttNOT ((a OR b) AND (c OR d))/tt will result in the removal
of ttNOT ((a OR b/tt and ttc d/tt will still be in the output
query.

Best regards,

Gregor

-Original Message-
From: Iain Young [mailto:[EMAIL PROTECTED]
Sent: Monday, December 15, 2003 6:13 PM
To: Lucene mailing list (E-mail)
Subject: Disabling modifiers?


A quick question. Is there any way to disable the - and + modifiers in the
QueryParser? I'm trying to use Lucene to provide indexing of COBOL source
code, and allow me to highlight matches when the code is displayed. In COBOL
you can have variable names such as DISP-NAME and WS-DATE-1 for example.
Unfortunately the query parser interprets the - signs as modifiers and so
the query does not do what is required.

I've had a bit of success by putting quotes around the offending names, (as
suggested on this list), but the results are still less than satisfactory,
(it removes the NOT from the query, but still treats DISP and NAME as two
separate words rather than one word and so the results are not quite
correct).

Any ideas, or am I going to have to try and write my own query parser?

Thanks,
Iain


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Docco 0.2 / contribution offer

2003-09-02 Thread Gregor Heinrich
Hi Peter.

Docco is a great tool which I have been using since you posted your first
announcement (version 1.0, that is). Beside the things you mention in you
mail I also generally think it's a great idea to using formal concept
analysis with Lucene. I would be interested to explore the idea also for
more structured data (maybe include fields and even hierarchies).

Apart from this, if I had an idea of the time commitments connected, I would
definitely consider to join.

Best,

Gregor



-Original Message-
From: Peter Becker [mailto:[EMAIL PROTECTED]
Sent: Tuesday, September 02, 2003 1:52 PM
To: Lucene Users List
Subject: ANN: Docco 0.2 / contribution offer


Hi all,

we finally finished the 0.2 release of our little personal document
management tool based on Lucene:

  http://tockit.sourceforge.net/docco/index.html

This might be interesting for some readers of this list since its source
contains some infrastructure for document handlers and index management.
The document handlers are written with a very simple API, which just
asks the implementation to fill a structure with the information
retrieved from a URL. It is similar to the Ant task in the Lucene
sandbox, but it separates the information collection and the actual
indexing, i.e. all the decisions what should be stored and what shouldn't.

The program comes with implementations for plain text, HTML (based on
Swing), XML (based on JAXP) and Open Office (using ZipStreams/SAX). We
wrote plugins for POI, PDFbox and Multivalent. The latter is
unfortunately a wild hack since Multivalent is the worst Java code I've
seen. Literally. Bad C written in Java. The tool would be nice to use,
but catching exceptions in little helper classes to do a System.exit is
just insane. And that is just one of the problems -- we had to do some
bad hacks to fix these issues. The other implementations should be fine,
although they need some more testing.

The source (including all required libs) of the program is available via
Sourceforge's CVS:

  http://sourceforge.net/cvs/?group_id=37081

The module in question is called docco. A current snapshot of only the
source is here:

  http://tockit.sourceforge.net/docco/source20030902.zip (~100kb)


The relevant packages are:

  org.tockit.docco.documenthandler: the documenthandler interface and
implementations
  org.tockit.docco.filefilter: some code to pick document handlers via
file extensions or regexps
  org.tockit.docco.index: the model/static bits of the index management
  org.tockit.docco.indexer: the dynamic aspects of the index management:
runnable, framework for handlers

The index management is probably not optimal, I strongly suspect that an
expert could tweak it. But the structure should be ok.

We would be happy to contribute this code to the Lucene sandbox if there
is interest. Or to turn it into a project of its own, we don't think it
should be hidden in our more specific program. It should be easy to
merge it with the Ant task and we are happy to give a hand if wanted.
Adding some documentation would be easy, too -- at the moment the code
is still more for ourself, but it should be very readable by itself. We
require JDK 1.4, but this can be reduced by moving some more document
handlers into plugins.

Anyone interested in joining into maintaining this code? Any feedback is
welcome.

Cheers,
   Peter


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Newbie Questions

2003-08-26 Thread Gregor Heinrich
Hi Mark,

short answers to your questions:

ad 1: MultiFieldQueryParser is what you might want: you can specify the
fields to run the query on. Alternatively, the practice of duplicating the
contents of all separate fields in question into one additional merged field
has been suggested, which enables you to use QueryParser itself.

ad 2: Depending on the Analyzer you use, the query is normalised, i.e.,
stemmed (remove suffices from words) and stopword-filtered (remove highly
frequent words). Have a look at StandardAnalyzer.tokenStream(...) to see how
the different filters work. In the analysis package the 1.3rc2 Lucene
distribution has a Porter stemming algorithm: PorterStemmer.

Have fun,

Gregor

-Original Message-
From: Mark Woon [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 26, 2003 6:54 AM
To: [EMAIL PROTECTED]
Subject: Newbie Questions


Hi all...

I've been playing with Lucene for a couple days now and I have a couple
questions I'm hoping some one can help me with.  I've created a Lucene
index with data from a database that's in several different fields, and
I want to set up a web page where users can search the index.  Ideally,
all searches should be as google-like as possible.  In Lucene terms, I
guess this means the query should be fuzzy.  For example, if someone
searches for cancer then I'd like to get back all resuls with any form
of the word cancer in the term (cancerous, breast cancer, etc.).

So far, I seem to be having two problems:

1) How can I search all fields at the same time?  The QueryParser seems
to only search one specific field.

2) How can I automatically default all searches into fuzzy mode?  I
don't want my users to have to know that they must add a ~ at the end
of all their terms.

Thanks,
-Mark




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Similar Document Search

2003-08-26 Thread Gregor Heinrich
Hi Terry,

the suggestion of Haystack's Lucene was a hint to give you an additional
alternative to reach your goal.

Depending on the definition of your notion similar document, this solution
does or does not make sense. My definition of similar document (and term) is
maybe more general than yours: It supports rather generic similarity metrics
and needs to cover cosine similarity according to vector-space model (VSM;
can be achieved using unmodified Lucene code), semantic similarity according
to a generative model like latent semantic indexing or Bayesian approaches
etc. and even semantic similarity according to a taxonomy. If you want such
a flexibility (like I do for my research), you should consider this approach
because you can relatively easily work on the forward document vectors.

If all you need is vanilla VSM cosine similarity, you are probably best off
with the suggestion that was sent in this list, to submit the document
content in the query and throw it through the same Analyzer that was used to
create the index, thus finding best matches using Lucene's standard matching
scheme.

Good luck,

Gregor





-Original Message-
From: Terry Steichen [mailto:[EMAIL PROTECTED]
Sent: Thursday, August 21, 2003 2:54 PM
To: Lucene Users List
Subject: Re: Similar Document Search


Hi Peter,

I took a look at Mark's thesis and briefly at some of his code.  It appears
to me that what he's done with the so-called forward indexing is to (a)
include a unique id with each document (allowing retrieval by id rather than
by a standard query), and to (b) include a frequency map class with each
document (allowing easier retrieval of term frequency information).

Now I may be missing something very obvious, but it seems to me that both of
these functions can be done rather easily with the standard (unmodified)
version of Lucene.  Moreover, I don't understand how use of these functions
will facilitate retrieval of documents that are similar to a selected
document, as outlined in my original question on this topic.

Could you (or anyone else, of course) perhaps elaborate just a bit on how
using this approach will help achieve that end?

Regards,

Terry

- Original Message -
From: Peter Becker [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, August 21, 2003 1:37 AM
Subject: Re: Similar Document Search


 Hi all,

 it seems there are quite a few people looking for similar features, i.e.
 (a) document identity and (b) forward indexing. So far we hacked (a) by
 using a wrapper implementing equals/hashcode based on a unique field,
 but of course that assumes maintaining a unique field in the index. (b)
 is something we haven't tackled yet, but plan to.

 The source code for Mark's thesis seems to be part of the Haystack
 distribution. The comments in the files put it under Apche-license. This
 seems to make it a good candidate to be included at least in the Lucene
 sandbox -- although I haven't tried it myself yet. But it sounds like a
 good candidate for us to use.

 Since the haystack source is a bit larger and I actually couldn't get
 the download at the moment, here is a copy of the relevant bit grabbed
 from one of my colleague's machines:

   http://www.itee.uq.edu.au/~pbecker/luceneHaystack.tar.gz (22kb)

 Note that this is just a tarball of src/org/apache/lucene out of some
 Haystack source. Untested, unmodified.

 I'd love to see something like this supported in the Lucene context were
 people might actually find it :-)

   Peter


 Gregor Heinrich wrote:

 Hello Terry,
 
 Lucene can do forward indexing, as Mark Rosen outlines in his Master's
 thesis: http://citeseer.nj.nec.com/rosen03email.html.
 
 We use a similar approach for (probabilistic) latent semantic analysis
and
 vector space searches. However, the solution is not really completely
fixed
 yet, therefore no code at this time...
 
 Best regards,
 
 Gregor
 
 
 
 
 -Original Message-
 From: Peter Becker [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, August 19, 2003 3:06 AM
 To: Lucene Users List
 Subject: Re: Similar Document Search
 
 
 Hi Terry,
 
 we have been thinking about the same problem and in the end we decided
 that most likely the only good solution to this is to keep a
 non-inverted index, i.e. a map from the documents to the terms. Then you
 can query the most terms for the documents and query other documents
 matching parts of this (where you get the usual question of what is
 actually interesting: high frequency, low frequency or the mid range).
 
 Indexing would probably be quite expensive since Lucene doesn't seem to
 support changes in the index, and the index for the terms would change
 all the time. We haven't implemented it yet, but it shouldn't be hard to
 code. I just wouldn't expect good performance when indexing large
 collections.
 
   Peter
 
 
 Terry Steichen wrote:
 
 
 
 Is it possible without extensive additional coding to use Lucene to
conduct
 
 
 a search based on a document rather

RE: Newbie Questions

2003-08-26 Thread Gregor Heinrich
Hi Mark.

Sorry, it's rc1 really which is out. But if you go to the cvs server, then
you'll find the rc2-dev version.

Multiple calls to Document.add with the same field results in that their
text is treated as though appended for the purposes of search. (API doc).

Can you try out if there's a differece between the cases you mention? I don'
t know but I'd be interested as well;-).

Gregor




-Original Message-
From: Mark Woon [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 26, 2003 8:52 PM
To: Lucene Users List
Subject: Re: Newbie Questions


Gregor Heinrich wrote:

 ad 1: MultiFieldQueryParser is what you might want: you can specify the
 fields to run the query on. Alternatively, the practice of duplicating
 the
 contents of all separate fields in question into one additional merged
 field
 has been suggested, which enables you to use QueryParser itself.


Ah, I've been testing out something similar to the latter.  I've been
adding multiple values on the same key.  Won't this have the same
effect?  I've been assuming that if I do

doc.add(Field.Keyword(content, value1);
doc.add(Field.Keyword(content, value2);

And did a search on the content field for either value, I'd get a hit,
and it seems to work.  This way, I figure I'd be able to differentiate
between values that I want tokenized and values that I don't.

Is there a difference between this and building a StringBuffer
containing all the values and storing that as a single field-value?


 ad 2: Depending on the Analyzer you use, the query is normalised, i.e.,
 stemmed (remove suffices from words) and stopword-filtered (remove highly
 frequent words). Have a look at StandardAnalyzer.tokenStream(...) to
 see how
 the different filters work. In the analysis package the 1.3rc2 Lucene
 distribution has a Porter stemming algorithm: PorterStemmer.


There's an rc2 out?  Where??  I just checked the Lucene website and only
see rc1.


Thanks everyone for all the quick responses!

-Mark



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Similar Document Search

2003-08-20 Thread Gregor Heinrich
Hello Terry,

Lucene can do forward indexing, as Mark Rosen outlines in his Master's
thesis: http://citeseer.nj.nec.com/rosen03email.html.

We use a similar approach for (probabilistic) latent semantic analysis and
vector space searches. However, the solution is not really completely fixed
yet, therefore no code at this time...

Best regards,

Gregor




-Original Message-
From: Peter Becker [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 19, 2003 3:06 AM
To: Lucene Users List
Subject: Re: Similar Document Search


Hi Terry,

we have been thinking about the same problem and in the end we decided
that most likely the only good solution to this is to keep a
non-inverted index, i.e. a map from the documents to the terms. Then you
can query the most terms for the documents and query other documents
matching parts of this (where you get the usual question of what is
actually interesting: high frequency, low frequency or the mid range).

Indexing would probably be quite expensive since Lucene doesn't seem to
support changes in the index, and the index for the terms would change
all the time. We haven't implemented it yet, but it shouldn't be hard to
code. I just wouldn't expect good performance when indexing large
collections.

  Peter


Terry Steichen wrote:

Is it possible without extensive additional coding to use Lucene to conduct
a search based on a document rather than a query?  (One use of this would be
to refine a search by selecting one of the hits returned from the initial
query and subsequently retrieving other documents like the selected one.)

Regards,

Terry






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene as a high-performance RDF database.

2003-08-11 Thread Gregor Heinrich
Hi Kevin,

your idea could work for higher mega-byte ranges, I guess, don't know how
about several TBytes.

We have been considering a concept to use Lucene as an RDF backend for a
semantic search engine, because of its reported excellent scalability, on
the order of tens of Megs. The idea was similar to yours and but we thought
of using some index extension to introduce the class / properties hierarchy
(i.e., RDF Schema) and make them searchable via cascaded index lookups.

Didn't have the time, though, to test it, but would be grateful if you could
comment.

Here are the fields, in a draft with three index parts it's something like:

node (unique)
clss (class in schema)
prop (position-ordered)
prwt (a scalar value, weighting the relation or 1, position-ordered)
rsrc (resource, position-ordered)

and for the ontology itself:

clss
spcl (superclass, multi-inheritance)

and

prop (property)
sprp (super-property, multi-inheritance)
domn (domain)
rnge (range)

Best regards,

gregor



-Original Message-
From: Kevin A. Burton [mailto:[EMAIL PROTECTED]
Sent: Monday, August 11, 2003 12:33 AM
To: [EMAIL PROTECTED]
Subject: Lucene as a high-performance RDF database.


I have been giving some thought to using Lucene as an RDF database.
I'm specifically thinking about the RDF model and not the RDF syntax.

Essentially this would just comprise triples encoded in a document as
fields.

So for example we would have subject predicate and object relationships
as document fields.  Subject and predicates would be Tokens and then the
object field would be indexed.

For example a triple (document) would be:

http://jakarta.apache.org - title - A great Java developer's website

This would be just one document in the index.

This would have a lot of advantages most importantly speed and the
reliability of Lucene and the ability to run a full text query on objects.

For example we could query on Java and get back
http://jakarta.apache.org;

The major downside I could see is that this would mean that we would be
indexing a LOT of small documents with a LOT of index updates.

Can anyone see any problems here?  This database will eventually grow to
around 2TB in the next month so performance issues are non-trivial.

Most people have deployed Lucene with large document sizes and the fact
that most people are citing document COUNT makes me nervous.

Kevin



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Multiple fields identical terms.

2003-07-30 Thread Gregor Heinrich
Hi everyone,

my index has a title and an abstract field, both inverted and tokenized.

I would like to have unique term texts in my term enumeration. That is,
across all fields there should be no duplicate term text.

An easy solution would be to only use one field.

But does someone know an alternative way with multiple fields?

Best regards,

Gregor


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Multiple fields identical terms.

2003-07-30 Thread Gregor Heinrich
Hi.

Thanks for your suggestion; I think the storage overhead is bearable.

Actually I am doing some sort of forward indexing in addition to the
inverted index. I.e., the result will be a meta-search engine that combines
the Lucene IR process proper with an aspect model similar to Latent Semantic
Analysis. To store the forward index, it's necessary to create a
term-document matrix where the terms should all be unique regardsless of the
field. This kind of vector space indexing could as well be useful for other
purposes such as document classification.

One idea is to run an additional Hashtable that checks for uniqueness and
attaches additional information to a term, such as its phonetic encoding or
its catalogization key. But I wanted to use as much of the existing
infrastructure and stay compatible.

I also thought of changing the way how fields and terms are allocated to
each other, i.e., allowing a list of fields in each Term object and thus
make term texts unique. But this would cause a substantial re-design of the
index file and access structure...

Gregor



-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 30, 2003 2:40 PM
To: Lucene Users List
Subject: Re: Multiple fields identical terms.


On Wednesday, July 30, 2003, at 06:16  AM, Gregor Heinrich wrote:
 I would like to have unique term texts in my term enumeration. That is,
 across all fields there should be no duplicate term text.

 An easy solution would be to only use one field.

 But does someone know an alternative way with multiple fields?

What about putting both abstract and title together into a single new
field called keywords?  Leave title and abstract there as well, but
just append the two strings together (with a space in the middle to
tokenize properly! :).

Is that a reasonable alternative?  What are you trying to accomplish?

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Different Analyzer for each Field

2003-07-28 Thread Gregor Heinrich

Hi Claude,

one solution is to make the tokenStream method in the Analyzer subclass
listen to the field name. Example:

public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new StandardTokenizer(reader);

result = new StandardFilter(result);
result = new LowerCaseFilter(result);
result = new StopFilter(result, stoptable);
if (fieldName.startsWith(phonetic_)  phon != null) {
result = new PhoneticFilter(result, phon);
return result;
}
result = new SnowballFilter(result, German);
return result;
}

(In my index I have phonetically encoded fields that are filtered
differently.)

Ciao, Gregor


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]