RE: Summarization; sentence-level and document-level filters.
Yes, copying a summary from one field to an untokenized field was the plan. I identified DocumentWriter.invertDocument() to be a possible place for an addition of this document-level analysis. But I admit this appears way too low-level and inflexible for the overall design. So I'll make it two-pass indexing. Thanks for the decision support, gregor -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 16, 2003 6:57 PM To: Lucene Users List Subject: Re: Summarization; sentence-level and document-level filters. It sounds like you want the value of a stored field (a summary) to be built from the tokens of another field of the same document. Is that right? This is not presently possible without tokenizing the field twice, once to produce its summary and once again when indexing. Doug Gregor Heinrich wrote: Hi, is there any possibility to do sentence-level or document level analysis with the current Analysis/TokenStream architecture? Or where else is the best place to plug in customised document-level and sentence-level analysis features? Is there any precedence case ? My technical problem: I'd like to include a summarization feature into my system, which should (1) best make use of the architecture already there in Lucene, and (2) should be able to trigger summarization on a per-document basis while requiring sentence-level information, such as full-stops and commas. To preserve this punctuation, a special Tokenizer can be used that outputs such landmarks as tokens instead of filtering them out. The actual SummaryFilter then filters out the punctuation for its successors in the Analyzer's filter chain. The other, more complex thing is the document-level information: As Lucene's architecture uses a filter concept that does not know about the document the tokens are generated from (which is good abstraction), a document-specific operation like summarization is a bit of an awkward thing with this (and originally not intended, I guess). On the other hand, I'd like to have the existing filter structure in place for preprocessing of the input, because my raw texts are generated by converters from other formats that output unwanted chars (from figures, pagenumbers, etc.), which are filtered out anyway by my custom Analyzer. Any idea how to solve this second problem? Is there any support for such document / sentence structure analysis planned? Thanks and regards, Gregor - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Summarization; sentence-level and document-level filters.
Maurits: thanks for the hint to classifier4j -- I have had a look on this package and tried the SimpleSummarizer and it seems to work fine. (However, as I don't know the benchmarks for summarization, I'm not the one to judge.) Do you have experience with it? Gregor -Original Message- From: maurits van wijland [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 16, 2003 1:09 AM To: Lucene Users List; [EMAIL PROTECTED] Subject: Re: Summarization; sentence-level and document-level filters. Hi Gregor, Sofar as I know there is no summarizer in the plans. And maybe I can help you along the way. Have a look at Classifier4J project on Sourceforge. http://classifier4j.sourceforge.net/ It has a small documetn summarizer besides a bayes classifier.It might speed up your coding. On the level of lucene, I have no idea. My gut feeling says that a summary should be build before the text is tokenized! The tokenizer can ofcourse be used when analysing a document, but hooking into the lucene indexing is a bad idea I think. Someone else has any ideas? regards, Maurits - Original Message - From: Gregor Heinrich [EMAIL PROTECTED] To: 'Lucene Users List' [EMAIL PROTECTED] Sent: Monday, December 15, 2003 7:41 PM Subject: Summarization; sentence-level and document-level filters. Hi, is there any possibility to do sentence-level or document level analysis with the current Analysis/TokenStream architecture? Or where else is the best place to plug in customised document-level and sentence-level analysis features? Is there any precedence case ? My technical problem: I'd like to include a summarization feature into my system, which should (1) best make use of the architecture already there in Lucene, and (2) should be able to trigger summarization on a per-document basis while requiring sentence-level information, such as full-stops and commas. To preserve this punctuation, a special Tokenizer can be used that outputs such landmarks as tokens instead of filtering them out. The actual SummaryFilter then filters out the punctuation for its successors in the Analyzer's filter chain. The other, more complex thing is the document-level information: As Lucene's architecture uses a filter concept that does not know about the document the tokens are generated from (which is good abstraction), a document-specific operation like summarization is a bit of an awkward thing with this (and originally not intended, I guess). On the other hand, I'd like to have the existing filter structure in place for preprocessing of the input, because my raw texts are generated by converters from other formats that output unwanted chars (from figures, pagenumbers, etc.), which are filtered out anyway by my custom Analyzer. Any idea how to solve this second problem? Is there any support for such document / sentence structure analysis planned? Thanks and regards, Gregor - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene and Mysql
Hi. You read out all the relevant fields from MySQL and assign the primary key as an indentifier of your Lucene documents. During search, you retrieve the identifier from the Lucene searcher and query the database to present the full text. Best regards, Gregor -Original Message- From: Stefan Trcko [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 16, 2003 9:31 PM To: [EMAIL PROTECTED] Subject: Lucene and Mysql Hello I'm new to Lucene. I want users can search text which is stored in mysql database. Is there any tutorial how to implement this kind of search feature. Best regards, Stefan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Word Documents
Hi, that's great info. In fact, I didn't check for fast-saving, yet. So I'll probably go ahead an have a try later... Good luck for POI, Gregor -Original Message- From: Ryan Ackley [mailto:[EMAIL PROTECTED] Sent: Monday, December 15, 2003 3:35 PM To: Lucene Users List; [EMAIL PROTECTED] Subject: Re: Word Documents I have written a library located at http://textmining.org that will extract text from Word documents. I am the author of the Word library in POI btw. This is just a lightweight version because I got sick of everyone asking how to extract text from a Word document. If it doesn't work its because the document is *not* from Word 97 or later or the file was fast-saved. Everytime somebody has problems they send me their files and they turn out to be RTF or Word 95 documents. You can check the format by opening the file in Word then going to Save As. The format of the document will be in the Save as Type dropdown. At least in my version of Word it does. -Ryan - Original Message - From: Gregor Heinrich [EMAIL PROTECTED] To: 'Lucene Users List' [EMAIL PROTECTED] Sent: Monday, December 15, 2003 9:19 AM Subject: RE: Word Documents Hi, we had some problems using the POI Word filter. In one document set, everything would work fine, in another more than 50% documents refused to work with it (does not index). I am not an OLE2 pro and cannot see any apparent difference in the documents between the different sets. The version used was Word 97 in almost all the docs. For the moment, I switched to a native converter (that does not process metadata and must be run using Runtime.exec(), though) until I have time to revisit the problem. I do not want to disrecommend the POI-filters, it's a very cool idea. Please do try your particular document set with it. For a quick test, you can use the Docco personal search tool by Peter Becker and colleagues (available from SourceForge). It has a current version of POI included as a plugin and Lucene running as indexing backend. So you don't have to write code to get answers... Cheers, gregor -Original Message- From: Pleasant, Tracy [mailto:[EMAIL PROTECTED] Sent: Monday, December 15, 2003 2:58 PM To: Lucene Users List Subject: Word Documents As a spinoff, I was wondering if anyone has been happy with indexing and searching Word docs. What about reading the contents? Any problems? -Original Message- From: Ryan Ackley [mailto:[EMAIL PROTECTED] Sent: Friday, December 12, 2003 5:59 PM To: Zhou, Oliver; Lucene Users List Subject: Re: textmining: document title Check out jakarta POI (http://jakarta.apache.org/poi ) particularly the HPSF API. It allows you to extract metadata like Title, Author, etc. from OLE documents. -Ryan - Original Message - From: Zhou, Oliver [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Friday, December 12, 2003 5:26 PM Subject: textmining: document title Ryan, I'm using textmining and lucene to index word documents but don't know how to get word document title. Your advice on this matter is appreciated. Thanks, Oliver Zhou - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Summarization; sentence-level and document-level filters.
Hi, is there any possibility to do sentence-level or document level analysis with the current Analysis/TokenStream architecture? Or where else is the best place to plug in customised document-level and sentence-level analysis features? Is there any precedence case ? My technical problem: I'd like to include a summarization feature into my system, which should (1) best make use of the architecture already there in Lucene, and (2) should be able to trigger summarization on a per-document basis while requiring sentence-level information, such as full-stops and commas. To preserve this punctuation, a special Tokenizer can be used that outputs such landmarks as tokens instead of filtering them out. The actual SummaryFilter then filters out the punctuation for its successors in the Analyzer's filter chain. The other, more complex thing is the document-level information: As Lucene's architecture uses a filter concept that does not know about the document the tokens are generated from (which is good abstraction), a document-specific operation like summarization is a bit of an awkward thing with this (and originally not intended, I guess). On the other hand, I'd like to have the existing filter structure in place for preprocessing of the input, because my raw texts are generated by converters from other formats that output unwanted chars (from figures, pagenumbers, etc.), which are filtered out anyway by my custom Analyzer. Any idea how to solve this second problem? Is there any support for such document / sentence structure analysis planned? Thanks and regards, Gregor - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Disabling modifiers?
If you don't want to fiddle with the JavaCC source of QueryParser.jj, you could work with a regular expression that works in front of the actual query parser. I just did something similar because I input Lucene's query strings into a latent semantic analysis algorithm and remove words with + and ? wildcards, boosting modifiers as well as NOT and - clauses and groupings. Such as: /** * exclude words that have these modifiers */ public final String excludeWildcards = \\w+\\+|\\w+\\?; /** * remove these operators */ public final String removeOperators = AND|OR|UND|ODER||\\|\\|; /** * remove these modifiers */ public final String removeModifiers = ~[0-9\\.]*|~|\\^[0-9\\.]*|\\*; /** * exclude phrases that have these modifiers */ public final String excludeNot = (NOT |\\-) *\\w+|(NOT|\\-) *\\([^\\)]+\\)|(NOT |\\-) *\\\[^\\\]+\\\; /** * remove any groupings */ public final String removeGrouping = [\(\\)]; You then create Pattern objects from the strings using Pattern.compile() and can use and re-use the compiled patterns. excludeWildcardsPattern = Pattern.compile(excludeWildcards); lsaQ = excludeWildcardsPattern.matcher(q).replaceAll(); This works fine for me. However, this 20 minutes approach does not recognise nested parentheses with NOT or -, i.e., the term ttNOT ((a OR b) AND (c OR d))/tt will result in the removal of ttNOT ((a OR b/tt and ttc d/tt will still be in the output query. Best regards, Gregor -Original Message- From: Iain Young [mailto:[EMAIL PROTECTED] Sent: Monday, December 15, 2003 6:13 PM To: Lucene mailing list (E-mail) Subject: Disabling modifiers? A quick question. Is there any way to disable the - and + modifiers in the QueryParser? I'm trying to use Lucene to provide indexing of COBOL source code, and allow me to highlight matches when the code is displayed. In COBOL you can have variable names such as DISP-NAME and WS-DATE-1 for example. Unfortunately the query parser interprets the - signs as modifiers and so the query does not do what is required. I've had a bit of success by putting quotes around the offending names, (as suggested on this list), but the results are still less than satisfactory, (it removes the NOT from the query, but still treats DISP and NAME as two separate words rather than one word and so the results are not quite correct). Any ideas, or am I going to have to try and write my own query parser? Thanks, Iain - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Docco 0.2 / contribution offer
Hi Peter. Docco is a great tool which I have been using since you posted your first announcement (version 1.0, that is). Beside the things you mention in you mail I also generally think it's a great idea to using formal concept analysis with Lucene. I would be interested to explore the idea also for more structured data (maybe include fields and even hierarchies). Apart from this, if I had an idea of the time commitments connected, I would definitely consider to join. Best, Gregor -Original Message- From: Peter Becker [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 02, 2003 1:52 PM To: Lucene Users List Subject: ANN: Docco 0.2 / contribution offer Hi all, we finally finished the 0.2 release of our little personal document management tool based on Lucene: http://tockit.sourceforge.net/docco/index.html This might be interesting for some readers of this list since its source contains some infrastructure for document handlers and index management. The document handlers are written with a very simple API, which just asks the implementation to fill a structure with the information retrieved from a URL. It is similar to the Ant task in the Lucene sandbox, but it separates the information collection and the actual indexing, i.e. all the decisions what should be stored and what shouldn't. The program comes with implementations for plain text, HTML (based on Swing), XML (based on JAXP) and Open Office (using ZipStreams/SAX). We wrote plugins for POI, PDFbox and Multivalent. The latter is unfortunately a wild hack since Multivalent is the worst Java code I've seen. Literally. Bad C written in Java. The tool would be nice to use, but catching exceptions in little helper classes to do a System.exit is just insane. And that is just one of the problems -- we had to do some bad hacks to fix these issues. The other implementations should be fine, although they need some more testing. The source (including all required libs) of the program is available via Sourceforge's CVS: http://sourceforge.net/cvs/?group_id=37081 The module in question is called docco. A current snapshot of only the source is here: http://tockit.sourceforge.net/docco/source20030902.zip (~100kb) The relevant packages are: org.tockit.docco.documenthandler: the documenthandler interface and implementations org.tockit.docco.filefilter: some code to pick document handlers via file extensions or regexps org.tockit.docco.index: the model/static bits of the index management org.tockit.docco.indexer: the dynamic aspects of the index management: runnable, framework for handlers The index management is probably not optimal, I strongly suspect that an expert could tweak it. But the structure should be ok. We would be happy to contribute this code to the Lucene sandbox if there is interest. Or to turn it into a project of its own, we don't think it should be hidden in our more specific program. It should be easy to merge it with the Ant task and we are happy to give a hand if wanted. Adding some documentation would be easy, too -- at the moment the code is still more for ourself, but it should be very readable by itself. We require JDK 1.4, but this can be reduced by moving some more document handlers into plugins. Anyone interested in joining into maintaining this code? Any feedback is welcome. Cheers, Peter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Newbie Questions
Hi Mark, short answers to your questions: ad 1: MultiFieldQueryParser is what you might want: you can specify the fields to run the query on. Alternatively, the practice of duplicating the contents of all separate fields in question into one additional merged field has been suggested, which enables you to use QueryParser itself. ad 2: Depending on the Analyzer you use, the query is normalised, i.e., stemmed (remove suffices from words) and stopword-filtered (remove highly frequent words). Have a look at StandardAnalyzer.tokenStream(...) to see how the different filters work. In the analysis package the 1.3rc2 Lucene distribution has a Porter stemming algorithm: PorterStemmer. Have fun, Gregor -Original Message- From: Mark Woon [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 26, 2003 6:54 AM To: [EMAIL PROTECTED] Subject: Newbie Questions Hi all... I've been playing with Lucene for a couple days now and I have a couple questions I'm hoping some one can help me with. I've created a Lucene index with data from a database that's in several different fields, and I want to set up a web page where users can search the index. Ideally, all searches should be as google-like as possible. In Lucene terms, I guess this means the query should be fuzzy. For example, if someone searches for cancer then I'd like to get back all resuls with any form of the word cancer in the term (cancerous, breast cancer, etc.). So far, I seem to be having two problems: 1) How can I search all fields at the same time? The QueryParser seems to only search one specific field. 2) How can I automatically default all searches into fuzzy mode? I don't want my users to have to know that they must add a ~ at the end of all their terms. Thanks, -Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Similar Document Search
Hi Terry, the suggestion of Haystack's Lucene was a hint to give you an additional alternative to reach your goal. Depending on the definition of your notion similar document, this solution does or does not make sense. My definition of similar document (and term) is maybe more general than yours: It supports rather generic similarity metrics and needs to cover cosine similarity according to vector-space model (VSM; can be achieved using unmodified Lucene code), semantic similarity according to a generative model like latent semantic indexing or Bayesian approaches etc. and even semantic similarity according to a taxonomy. If you want such a flexibility (like I do for my research), you should consider this approach because you can relatively easily work on the forward document vectors. If all you need is vanilla VSM cosine similarity, you are probably best off with the suggestion that was sent in this list, to submit the document content in the query and throw it through the same Analyzer that was used to create the index, thus finding best matches using Lucene's standard matching scheme. Good luck, Gregor -Original Message- From: Terry Steichen [mailto:[EMAIL PROTECTED] Sent: Thursday, August 21, 2003 2:54 PM To: Lucene Users List Subject: Re: Similar Document Search Hi Peter, I took a look at Mark's thesis and briefly at some of his code. It appears to me that what he's done with the so-called forward indexing is to (a) include a unique id with each document (allowing retrieval by id rather than by a standard query), and to (b) include a frequency map class with each document (allowing easier retrieval of term frequency information). Now I may be missing something very obvious, but it seems to me that both of these functions can be done rather easily with the standard (unmodified) version of Lucene. Moreover, I don't understand how use of these functions will facilitate retrieval of documents that are similar to a selected document, as outlined in my original question on this topic. Could you (or anyone else, of course) perhaps elaborate just a bit on how using this approach will help achieve that end? Regards, Terry - Original Message - From: Peter Becker [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 21, 2003 1:37 AM Subject: Re: Similar Document Search Hi all, it seems there are quite a few people looking for similar features, i.e. (a) document identity and (b) forward indexing. So far we hacked (a) by using a wrapper implementing equals/hashcode based on a unique field, but of course that assumes maintaining a unique field in the index. (b) is something we haven't tackled yet, but plan to. The source code for Mark's thesis seems to be part of the Haystack distribution. The comments in the files put it under Apche-license. This seems to make it a good candidate to be included at least in the Lucene sandbox -- although I haven't tried it myself yet. But it sounds like a good candidate for us to use. Since the haystack source is a bit larger and I actually couldn't get the download at the moment, here is a copy of the relevant bit grabbed from one of my colleague's machines: http://www.itee.uq.edu.au/~pbecker/luceneHaystack.tar.gz (22kb) Note that this is just a tarball of src/org/apache/lucene out of some Haystack source. Untested, unmodified. I'd love to see something like this supported in the Lucene context were people might actually find it :-) Peter Gregor Heinrich wrote: Hello Terry, Lucene can do forward indexing, as Mark Rosen outlines in his Master's thesis: http://citeseer.nj.nec.com/rosen03email.html. We use a similar approach for (probabilistic) latent semantic analysis and vector space searches. However, the solution is not really completely fixed yet, therefore no code at this time... Best regards, Gregor -Original Message- From: Peter Becker [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 19, 2003 3:06 AM To: Lucene Users List Subject: Re: Similar Document Search Hi Terry, we have been thinking about the same problem and in the end we decided that most likely the only good solution to this is to keep a non-inverted index, i.e. a map from the documents to the terms. Then you can query the most terms for the documents and query other documents matching parts of this (where you get the usual question of what is actually interesting: high frequency, low frequency or the mid range). Indexing would probably be quite expensive since Lucene doesn't seem to support changes in the index, and the index for the terms would change all the time. We haven't implemented it yet, but it shouldn't be hard to code. I just wouldn't expect good performance when indexing large collections. Peter Terry Steichen wrote: Is it possible without extensive additional coding to use Lucene to conduct a search based on a document rather
RE: Newbie Questions
Hi Mark. Sorry, it's rc1 really which is out. But if you go to the cvs server, then you'll find the rc2-dev version. Multiple calls to Document.add with the same field results in that their text is treated as though appended for the purposes of search. (API doc). Can you try out if there's a differece between the cases you mention? I don' t know but I'd be interested as well;-). Gregor -Original Message- From: Mark Woon [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 26, 2003 8:52 PM To: Lucene Users List Subject: Re: Newbie Questions Gregor Heinrich wrote: ad 1: MultiFieldQueryParser is what you might want: you can specify the fields to run the query on. Alternatively, the practice of duplicating the contents of all separate fields in question into one additional merged field has been suggested, which enables you to use QueryParser itself. Ah, I've been testing out something similar to the latter. I've been adding multiple values on the same key. Won't this have the same effect? I've been assuming that if I do doc.add(Field.Keyword(content, value1); doc.add(Field.Keyword(content, value2); And did a search on the content field for either value, I'd get a hit, and it seems to work. This way, I figure I'd be able to differentiate between values that I want tokenized and values that I don't. Is there a difference between this and building a StringBuffer containing all the values and storing that as a single field-value? ad 2: Depending on the Analyzer you use, the query is normalised, i.e., stemmed (remove suffices from words) and stopword-filtered (remove highly frequent words). Have a look at StandardAnalyzer.tokenStream(...) to see how the different filters work. In the analysis package the 1.3rc2 Lucene distribution has a Porter stemming algorithm: PorterStemmer. There's an rc2 out? Where?? I just checked the Lucene website and only see rc1. Thanks everyone for all the quick responses! -Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Similar Document Search
Hello Terry, Lucene can do forward indexing, as Mark Rosen outlines in his Master's thesis: http://citeseer.nj.nec.com/rosen03email.html. We use a similar approach for (probabilistic) latent semantic analysis and vector space searches. However, the solution is not really completely fixed yet, therefore no code at this time... Best regards, Gregor -Original Message- From: Peter Becker [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 19, 2003 3:06 AM To: Lucene Users List Subject: Re: Similar Document Search Hi Terry, we have been thinking about the same problem and in the end we decided that most likely the only good solution to this is to keep a non-inverted index, i.e. a map from the documents to the terms. Then you can query the most terms for the documents and query other documents matching parts of this (where you get the usual question of what is actually interesting: high frequency, low frequency or the mid range). Indexing would probably be quite expensive since Lucene doesn't seem to support changes in the index, and the index for the terms would change all the time. We haven't implemented it yet, but it shouldn't be hard to code. I just wouldn't expect good performance when indexing large collections. Peter Terry Steichen wrote: Is it possible without extensive additional coding to use Lucene to conduct a search based on a document rather than a query? (One use of this would be to refine a search by selecting one of the hits returned from the initial query and subsequently retrieving other documents like the selected one.) Regards, Terry - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene as a high-performance RDF database.
Hi Kevin, your idea could work for higher mega-byte ranges, I guess, don't know how about several TBytes. We have been considering a concept to use Lucene as an RDF backend for a semantic search engine, because of its reported excellent scalability, on the order of tens of Megs. The idea was similar to yours and but we thought of using some index extension to introduce the class / properties hierarchy (i.e., RDF Schema) and make them searchable via cascaded index lookups. Didn't have the time, though, to test it, but would be grateful if you could comment. Here are the fields, in a draft with three index parts it's something like: node (unique) clss (class in schema) prop (position-ordered) prwt (a scalar value, weighting the relation or 1, position-ordered) rsrc (resource, position-ordered) and for the ontology itself: clss spcl (superclass, multi-inheritance) and prop (property) sprp (super-property, multi-inheritance) domn (domain) rnge (range) Best regards, gregor -Original Message- From: Kevin A. Burton [mailto:[EMAIL PROTECTED] Sent: Monday, August 11, 2003 12:33 AM To: [EMAIL PROTECTED] Subject: Lucene as a high-performance RDF database. I have been giving some thought to using Lucene as an RDF database. I'm specifically thinking about the RDF model and not the RDF syntax. Essentially this would just comprise triples encoded in a document as fields. So for example we would have subject predicate and object relationships as document fields. Subject and predicates would be Tokens and then the object field would be indexed. For example a triple (document) would be: http://jakarta.apache.org - title - A great Java developer's website This would be just one document in the index. This would have a lot of advantages most importantly speed and the reliability of Lucene and the ability to run a full text query on objects. For example we could query on Java and get back http://jakarta.apache.org; The major downside I could see is that this would mean that we would be indexing a LOT of small documents with a LOT of index updates. Can anyone see any problems here? This database will eventually grow to around 2TB in the next month so performance issues are non-trivial. Most people have deployed Lucene with large document sizes and the fact that most people are citing document COUNT makes me nervous. Kevin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Multiple fields identical terms.
Hi everyone, my index has a title and an abstract field, both inverted and tokenized. I would like to have unique term texts in my term enumeration. That is, across all fields there should be no duplicate term text. An easy solution would be to only use one field. But does someone know an alternative way with multiple fields? Best regards, Gregor - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Multiple fields identical terms.
Hi. Thanks for your suggestion; I think the storage overhead is bearable. Actually I am doing some sort of forward indexing in addition to the inverted index. I.e., the result will be a meta-search engine that combines the Lucene IR process proper with an aspect model similar to Latent Semantic Analysis. To store the forward index, it's necessary to create a term-document matrix where the terms should all be unique regardsless of the field. This kind of vector space indexing could as well be useful for other purposes such as document classification. One idea is to run an additional Hashtable that checks for uniqueness and attaches additional information to a term, such as its phonetic encoding or its catalogization key. But I wanted to use as much of the existing infrastructure and stay compatible. I also thought of changing the way how fields and terms are allocated to each other, i.e., allowing a list of fields in each Term object and thus make term texts unique. But this would cause a substantial re-design of the index file and access structure... Gregor -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 30, 2003 2:40 PM To: Lucene Users List Subject: Re: Multiple fields identical terms. On Wednesday, July 30, 2003, at 06:16 AM, Gregor Heinrich wrote: I would like to have unique term texts in my term enumeration. That is, across all fields there should be no duplicate term text. An easy solution would be to only use one field. But does someone know an alternative way with multiple fields? What about putting both abstract and title together into a single new field called keywords? Leave title and abstract there as well, but just append the two strings together (with a space in the middle to tokenize properly! :). Is that a reasonable alternative? What are you trying to accomplish? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Different Analyzer for each Field
Hi Claude, one solution is to make the tokenStream method in the Analyzer subclass listen to the field name. Example: public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new StandardTokenizer(reader); result = new StandardFilter(result); result = new LowerCaseFilter(result); result = new StopFilter(result, stoptable); if (fieldName.startsWith(phonetic_) phon != null) { result = new PhoneticFilter(result, phon); return result; } result = new SnowballFilter(result, German); return result; } (In my index I have phonetically encoded fields that are filtered differently.) Ciao, Gregor - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]