Re: Not entire document being indexed?

2005-02-25 Thread Andrzej Bialecki
, and check if they are present in the index. There are really only few reasons why this might be happening: * your extractor has a bug, or * the max token limit is wrongly set, or * the indexing process doesn't close the IndexWriter properly. -- Best regards, Andrzej Bialecki

ANN: Luke 0.6 - Lucene Index Toolbox

2005-02-20 Thread Andrzej Bialecki
in this release, please keep nagging... ;-) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Lucene Unicode Usage

2005-02-10 Thread Andrzej Bialecki
uses the standard platform-specific font dialog. On Windows this font doesn't support Unicode glyphs, so you will see just blanks (or rectangles). In the upcoming release you will be able to select the display font. -- Best regards, Andrzej Bialecki

Re: Configurable indexing of an RDBMS, has it been done before?

2005-02-09 Thread Andrzej Bialecki
, if anyone wants to rewrite Luke in Swing, SwiXML or something else, he's more than welcome - but this won't be me, because I hate Swing programming... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Retrieve all documents - possible?

2005-02-07 Thread Andrzej Bialecki
Karl Koch wrote: Hi, is it possible to retrieve ALL documents from a Lucene index? This should then actually not be a search... You are right. Just use the IndexReader.document(int). -- Best regards, Andrzej Bialecki

Re: Similarity coord,lengthNorm

2005-02-07 Thread Andrzej Bialecki
. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: PHP-Lucene Integration

2005-02-06 Thread Andrzej Bialecki
and build a fully native PHPLucene module using gcj. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: Synonyms Not Showing In The Index

2005-02-03 Thread Andrzej Bialecki
can't find the reason you could send me a small test index... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact

Re: Synonyms Not Showing In The Index

2005-02-03 Thread Andrzej Bialecki
Andrzej Bialecki wrote: Luke Shannon wrote: Hello; It seems my Synonym analyzer is working (based on some successful queries). But I can't see the synonyms in the index using Luke. Is this correct? Did you use the combined JAR to run? It contains an oldish version of Lucene... Other than

Re: Numbers in the Query String

2005-02-03 Thread Andrzej Bialecki
the documents. Would the use of two different analyzers cause any trouble for the results? Yes. StopAnalyzer eats all numbers for breakfast. ;-) You need to use another analyzer, one that doesn't discard numbers. -- Best regards, Andrzej Bialecki

Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!

2005-01-21 Thread Andrzej Bialecki
to lemmas than in e.g. Porter's stemmer, but there is a significant amount of stems like in the example above. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded

Re: Aramorph Analyzer

2004-12-16 Thread Andrzej Bialecki
also view the final query terms. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: updating documents in the index

2004-11-04 Thread Andrzej Bialecki
your index with term vectors. -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD

Re: Need advice: what Word/Excel/PowerPoint lib to use?

2004-10-25 Thread Andrzej Bialecki
, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org

Re: your mail

2004-10-15 Thread Andrzej Bialecki
, and then only their postings (occurences) are stored. -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator

Re: Arabic analyzer

2004-10-07 Thread Andrzej Bialecki
can certainly help someone to get started with testing... -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator

Re: Clustering lucene's results

2004-09-23 Thread Andrzej Bialecki
, usually using QueryParser, and finally you search using IndexSearcher. You get a list of Hits, which you can use to get scores, and the contents of the documents. Take a look at the IndexFiles and SearchFiles classes in org.apache.lucene.demo package (under /src/demo). -- Best regards, Andrzej

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread Andrzej Bialecki
that these could easily be found, with the heuristic that a frequent way of misspelling words is to transpose two adjacent letters. Yes, sounds like a good idea. Even though it increases the size of the lookup index, it still beats using the linear search... -- Best regards, Andrzej Bialecki

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread Andrzej Bialecki
in as a committer to the sandbox then I can Well, someone needs to maintain the code after all... ;-) -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert

Re: combining open office spellchecker with Lucene

2004-09-09 Thread Andrzej Bialecki
terms. This should be fast, and you could provide a did you mean function too... -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator

Re: Compound File Format question

2004-09-08 Thread Andrzej Bialecki
the other way? In my experience it's safe. I've been doing this in a couple of real applications, and also in Luke there is an option to re-pack the index using compound or not. -- Best regards, Andrzej Bialecki - Software Architect, System

Re: Most efficient way to index 14M documents (out of memory/file handles)

2004-07-07 Thread Andrzej Bialecki
quite well - I use it myself in some applications, both with Lucene 1.3 and 1.4. The disadvantage is of course that the memory consumption goes up, so you have to be careful to cap the max size of RAMDirectory according to your max heap size limits. -- Best regards, Andrzej Bialecki

Re: Underscore character and case issue

2004-07-05 Thread Andrzej Bialecki
Robert Brown wrote: F:\Apache\Lucene\AddOns\Luke\v0.5java -fullversion java full version 1.3.1_10-b03 F:\Lucene\AddOns\Luke\v0.5 I never tested it with anything below 1.4 ... -- Best regards, Andrzej Bialecki - Software Architect, System Integration

Re: PDF Indexing Issue

2004-06-28 Thread Andrzej Bialecki
a PDF parser (e.g. PDFBox), and then extract plain-text content (such as body, title, author, etc), and only then add that plaintext content to the index. -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS

Re: To anyone who has used Luke

2004-06-25 Thread Andrzej Bialecki
, but so far noone provided any patches... -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator

Re: ANN: Luke v. 0.5 released

2004-06-24 Thread Andrzej Bialecki
on the Search tab to see what is the result of your query, or paste your query into the text area on the AnalyzerTool plugin (Plugins), and see what tokens you get using RussianAnalyzer. I just did it, and the result for * was - clearly not what you wanted. -- Best regards, Andrzej Bialecki

Re: pylucene

2004-06-24 Thread Andrzej Bialecki
on it... -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http

Re: score and frequency

2004-06-24 Thread Andrzej Bialecki
you send with every reply made to the list... -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator

Re: Rebuild a part of an indexed document

2004-06-24 Thread Andrzej Bialecki
package from the sandbox to produce snippets. -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator

Re: ANN: Luke v. 0.5 released

2004-06-23 Thread Andrzej Bialecki
; lucene.jar org.getopt.luke. Luke + Remember to put both JARs on your classpath, e.g.: java-classpath luke.jar:lucene.jar org.getopt.luke. Luke Well, both versions are correct - just the platform is different :-). I'll make a clarification. Thank you! -- Best regards, Andrzej Bialecki

ANN: Luke v. 0.5 released

2004-06-22 Thread Andrzej Bialecki
or bufixes are welcome! If you want to provide a patch, please use diff -bdruN - this will help me to integrate it. Thank you! -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project

Re: Writing a stemmer

2004-06-05 Thread Andrzej Bialecki
% of correct stems, and ~70% of correct lemmas. Which is a _very_ good result! -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator

Re: Writing a stemmer

2004-06-04 Thread Andrzej Bialecki
), and it works exceptionally well indeed. Highly recommended! -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator

Re: Stemmer Benefits/Costs

2004-04-22 Thread Andrzej Bialecki
project (http://www.egothor.org) - much more sophisticated than simple rule-based stemmers like Snowball or Porter. In fact, after proper training on a large corpus I was getting ~70% of correct lemmas for previously unseen words, and over 90% of correct (unique) stems. -- Best regards, Andrzej

Re: PrefixQuery and hieracical queries problem

2004-03-19 Thread Andrzej Bialecki
on... -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org

Re: PrefixQuery and hieracical queries problem

2004-03-19 Thread Andrzej Bialecki
Dennis Thrysøe wrote: Andrzej Bialecki wrote: What about using PhraseQuery, and store the path with all but first path separator replaced by whitespace (i.e. /foo bar baz one two three). Then you could query for /foo bar, /foo bar baz, and so on... Hi, It doesn't seem to work though

Re: PrefixQuery and hieracical queries problem

2004-03-19 Thread Andrzej Bialecki
Dennis Thrysøe wrote: Andrzej Bialecki wrote: Anyway.. I should've added that for Phrase Queries to work the text must be tokenized. So, the best way in this case would be to use WhitespaceAnalyzer for the uri field, I've figured out how to use the WhitespaceAnalyzer for creating

Re: Sys properties Was: java.io.tmpdir as lock dir .... once again

2004-03-08 Thread Andrzej Bialecki
in the indexing/inserting process between the runs. Luke provides you also with a simple time measurement for query execution. Just FYI. -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project

Re: index: how to store binary data or objects ?

2004-02-10 Thread Andrzej Bialecki
(1.4.2) seem to be very stable and performing well, so that could also be an option. After all, a filesystem _is_ a kind of very specialized database... ;-) -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN

Re: arrays of values in a field

2004-01-28 Thread Andrzej Bialecki
to a field. I ended up encoding the keywords like 10.0 keyword and then writing an analyzer which skips the initial numbers when processing this particular field (which was stored, indexed and tokenized). -- Best regards, Andrzej Bialecki

Re: umlaut normalisation

2004-01-27 Thread Andrzej Bialecki
huhnerstall out of it in the query (Why?). But ther is no huhnerstall indexed. Please check which Analyzer you're using in each case. -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF

Re: Using Russian analyzer in Luke

2004-01-26 Thread Andrzej Bialecki
will be another couple of days), but in the meantime you can just rebuild Luke from sources, using the latest Lucene. -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6

Re: setMaxClauseCount ??

2004-01-21 Thread Andrzej Bialecki
of the Lucene index? You should try to reduce the dimensionality by reducing the number of unique features. In this case, you could for example use only keywords (or key phrases) instead of the full content of documents. -- Best regards, Andrzej Bialecki

ANN: Luke 0.45 released

2004-01-17 Thread Andrzej Bialecki
pressing Search. * Fix the JNLP file to require J2SE 1.3+. * By popular demand, add a single self-contained JAR to the binary distribution. * Minor restructuring to increase reuse. Screenshots have been updated, too. Enjoy! -- Best regards, Andrzej Bialecki

Re: Term weighting and Term boost

2004-01-16 Thread Andrzej Bialecki
that. Could you please turn on the Java console, and see what kind of exception and where is thrown? -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert

ANN: Luke 0.4 released

2004-01-11 Thread Andrzej Bialecki
on the search page. Spotted by Erik Hatcher. Thank you for your comments and contributions! -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert

Re: Performance question

2004-01-08 Thread Andrzej Bialecki
xerces, you might want to look at these. You might want to look at http://dom4j.org/. Dror You may want to check the XML Pull Parser - it offers something between SAX and DOM, with performance similar to SAX. (http://www.extreme.indiana.edu/xgws/xsoap/xpp) -- Best regards, Andrzej Bialecki

Fields with same name but different boosts

2003-11-24 Thread Andrzej Bialecki
value for the fields when I search? Is it equivalent to value1^10.0 value2^20.0 (which is my intention), or rather value1^20.0 value2^20.0? If the latter, do you have any suggestions how to achieve the original effect? Thanks in advance! -- Best regards, Andrzej Bialecki

Re: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])

2003-11-17 Thread Andrzej Bialecki
regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org

Re: Vector Space Model in Lucene?

2003-11-14 Thread Andrzej Bialecki
using a bastardized version of Markov chains, but it's more of a hack... -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator

Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Andrzej Bialecki
Well ... Sure, nothing can replace a human mind. But believe it or not, there are studies which show that even human experts can significantly differ in their opinions on what are key-phrases for a given text. So, the results are never clear cut with humans either... So, in this sense a

Re: Index entire filesystem

2003-11-05 Thread Andrzej Bialecki
(including Java implementation of strings(1) command as the last resort). -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator

Luke v. 0.3 release - Lucene Index Browser

2003-09-24 Thread Andrzej Bialecki
please visit the link above to get both binaries and source code. -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator

Re: HTML Parsing problems...

2003-09-22 Thread Andrzej Bialecki
also very actively developed. -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD

Re: Lucene demo ideas?

2003-09-17 Thread Andrzej Bialecki
), and then got stuck in an infinite wait somewhere... So I came up with a workaround: I run the parser in a separate thread, while waiting in the main thread, and then after a certain timeout I kill the processing thread and return. -- Best regards, Andrzej Bialecki

Re: Lucene features

2003-09-05 Thread Andrzej Bialecki
for query expansion and for finding associated words (synsets?), or hypernyms like in your example. -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert

TermVector again (Re: Luke v 0.2 - Lucene Index Browser)

2003-08-14 Thread Andrzej Bialecki
Andrzej Bialecki wrote: Julien Nioche wrote: [- and almost impossible : recompose the unstored fields of a document] It's not impossible, just time-consuming - all information (except the parts removed by analyzer) is already there. This functionality has a high cool-ness factor, which

Luke v 0.2 - Lucene Index Browser

2003-08-14 Thread Andrzej Bialecki
. * Add Read-Only mode. * Fix spinbox bug (really a bug in the Thinlet toolkit - fixed there). * Allow to browse hidden directories. * Add a combobox to choose the default field for searching. * Other minor code cleanups. Thanks to all who provided their comments and suggestions! -- Best regards, Andrzej

Re: Luke - Lucene Index Browser

2003-07-16 Thread Andrzej Bialecki
of adding some GUI and logic. Thinlet-based applications are easy to modify in the View layer, so it's up to the Controller part, if it can be coded at all... -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC

Re: Luke - Lucene Index Browser

2003-07-14 Thread Andrzej Bialecki
this as well... In any case, if you're referring to the Search panel, then you can always double-click on one of the search results, and it will be displayed in the Documents panel, where you can not only see all the fields, but also copy them to clipboard... -- Best regards, Andrzej Bialecki

Re: RE : Parsers

2003-05-29 Thread Andrzej Bialecki
objects, but this information can be found at msdn.microsoft.com. Obviously I'd love to learn about an alternative, because then I could free my clients from dependance on Office... I already use POI to convert XLS and DOC files, and it works _very_ well. -- Best regards, Andrzej Bialecki

Re: RE : Parsers

2003-05-29 Thread Andrzej Bialecki
an extensible marshaller/de-marshaller, so if you know COM pretty well you can extend it to handle any conceivable parameter types. -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project

Re: RE : Parsers

2003-05-29 Thread Andrzej Bialecki
document bean that allows you to work with a document editor in JComponent. -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator

Re: Potential Lucene drawbacks

2003-03-07 Thread Andrzej Bialecki
than we think :). I believe there are tools out there that will analyze Java sources and create UML class diagrams from that. I believe TogetherJ or one of those 'all in one' tools can do that. I can do it for you, if you want - it takes ~10 minutes. -- -- Best regards, Andrzej Bialecki

Re: Word doc parser

2003-03-02 Thread Andrzej Bialecki
wrote: COM based parser: http://www.intrinsyc.com/products/enterprise_applications.asp convert word to text: http://www.winfield.demon.nl/index.html That's a bit expensive... I found a free alternative - Jawin, plus OLE Automation. -- Best regards, Andrzej Bialecki

Re: Indexing Tips and Hints

2003-02-25 Thread Andrzej Bialecki
Multivalent browser, and is subject to BSD-equivalent license - which means you can use it for whatever purpose, and if it turns out to be useful, it can be included in Lucene distribution. -- Best regards, Andrzej Bialecki - Software Architect, System

Re: Indexing Tips and Hints

2003-02-25 Thread Andrzej Bialecki
it only with small doc. collections that I use for functionality testing... Everything appears to work as expected, but my test collection is just ~100 documents, so the searching is blazingly fast no matter what I do.. :-) -- Best regards, Andrzej Bialecki

Re: Indexing Tips and Hints

2003-02-25 Thread Andrzej Bialecki
petite_abeille wrote: On Tuesday, Feb 25, 2003, at 09:43 Europe/Zurich, Andrzej Bialecki wrote: No, I'm not - this is clearly stated in the class javadoc. I meant to try it out in my application, but haven't got to it yet - I need to address first the base functionality, not performance; so, I

Score per Term

2003-02-24 Thread Andrzej Bialecki
it does against particular query... Any suggestions? -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist - FreeBSD developer (http://www.freebsd.org

Re: Score per Term

2003-02-24 Thread Andrzej Bialecki
the query? Should I expect a similar cost for that as creating Explanations separately? BTW: I tried to contact you regarding some help in a commercial project. Is [EMAIL PROTECTED] the right way to do it? Thanks! Andrzej Bialecki wrote: Hello, Is there any simple way to get the information from

Re: Indexing Tips and Hints

2003-02-24 Thread Andrzej Bialecki
] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist