, and check if they are present in the index.
There are really only few reasons why this might be happening:
* your extractor has a bug, or
* the max token limit is wrongly set, or
* the indexing process doesn't close the IndexWriter properly.
--
Best regards,
Andrzej Bialecki
in this release,
please keep nagging... ;-)
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
uses the standard platform-specific font dialog. On
Windows this font doesn't support Unicode glyphs, so you will see just
blanks (or rectangles). In the upcoming release you will be able to
select the display font.
--
Best regards,
Andrzej Bialecki
, if anyone wants to rewrite Luke in Swing, SwiXML or something else,
he's more than welcome - but this won't be me, because I hate Swing
programming...
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
Karl Koch wrote:
Hi,
is it possible to retrieve ALL documents from a Lucene index? This should
then actually not be a search...
You are right. Just use the IndexReader.document(int).
--
Best regards,
Andrzej Bialecki
.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
and build
a fully native PHPLucene module using gcj.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info
can't find the reason
you could send me a small test index...
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact
Andrzej Bialecki wrote:
Luke Shannon wrote:
Hello;
It seems my Synonym analyzer is working (based on some successful
queries).
But I can't see the synonyms in the index using Luke. Is this correct?
Did you use the combined JAR to run? It contains an oldish version of
Lucene... Other than
the documents. Would the use of two different analyzers cause any
trouble for the results?
Yes. StopAnalyzer eats all numbers for breakfast. ;-) You need to use
another analyzer, one that doesn't discard numbers.
--
Best regards,
Andrzej Bialecki
to lemmas than in e.g.
Porter's stemmer, but there is a significant amount of stems like in the
example above.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded
also view the final
query terms.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
your index with term vectors.
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD
,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org
, and then
only their postings (occurences) are stored.
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
can certainly help someone to get
started with testing...
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
, usually using QueryParser, and finally you search using
IndexSearcher. You get a list of Hits, which you can use to get scores,
and the contents of the documents. Take a look at the IndexFiles and
SearchFiles classes in org.apache.lucene.demo package (under /src/demo).
--
Best regards,
Andrzej
that these could easily be found, with the
heuristic that a frequent way of misspelling words is to transpose two
adjacent letters.
Yes, sounds like a good idea. Even though it increases the size of the
lookup index, it still beats using the linear search...
--
Best regards,
Andrzej Bialecki
in as a committer to the sandbox then I can
Well, someone needs to maintain the code after all... ;-)
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert
terms. This should be fast, and
you could provide a did you mean function too...
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
the other way?
In my experience it's safe. I've been doing this in a couple of real
applications, and also in Luke there is an option to re-pack the index
using compound or not.
--
Best regards,
Andrzej Bialecki
-
Software Architect, System
quite well - I use it myself in
some applications, both with Lucene 1.3 and 1.4. The disadvantage is of
course that the memory consumption goes up, so you have to be careful to
cap the max size of RAMDirectory according to your max heap size limits.
--
Best regards,
Andrzej Bialecki
Robert Brown wrote:
F:\Apache\Lucene\AddOns\Luke\v0.5java -fullversion
java full version 1.3.1_10-b03
F:\Lucene\AddOns\Luke\v0.5
I never tested it with anything below 1.4 ...
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration
a PDF parser (e.g.
PDFBox), and then extract plain-text content (such as body, title,
author, etc), and only then add that plaintext content to the index.
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS
, but so far noone provided any patches...
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
on the Search tab to see what is
the result of your query, or paste your query into the text area on the
AnalyzerTool plugin (Plugins), and see what tokens you get using
RussianAnalyzer.
I just did it, and the result for * was - clearly
not what you wanted.
--
Best regards,
Andrzej Bialecki
on it...
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http
you send with every reply made to the
list...
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
package from the sandbox to
produce snippets.
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
; lucene.jar org.getopt.luke. Luke
+ Remember to put both JARs on your classpath, e.g.: java-classpath
luke.jar:lucene.jar org.getopt.luke. Luke
Well, both versions are correct - just the platform is different :-).
I'll make a clarification. Thank you!
--
Best regards,
Andrzej Bialecki
or bufixes are welcome! If you
want to provide a patch, please use diff -bdruN - this will help me to
integrate it. Thank you!
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project
% of correct stems, and ~70% of correct lemmas.
Which is a _very_ good result!
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
), and it works exceptionally well indeed. Highly recommended!
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
project
(http://www.egothor.org) - much more sophisticated than simple
rule-based stemmers like Snowball or Porter. In fact, after proper
training on a large corpus I was getting ~70% of correct lemmas for
previously unseen words, and over 90% of correct (unique) stems.
--
Best regards,
Andrzej
on...
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org
Dennis Thrysøe wrote:
Andrzej Bialecki wrote:
What about using PhraseQuery, and store the path with all but first
path separator replaced by whitespace (i.e. /foo bar baz one two
three). Then you could query for /foo bar, /foo bar baz, and so
on...
Hi,
It doesn't seem to work though
Dennis Thrysøe wrote:
Andrzej Bialecki wrote:
Anyway.. I should've added that for Phrase Queries to work the text
must be tokenized. So, the best way in this case would be to use
WhitespaceAnalyzer for the uri field,
I've figured out how to use the WhitespaceAnalyzer for creating
in the
indexing/inserting process between the runs. Luke provides you also with
a simple time measurement for query execution. Just FYI.
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project
(1.4.2) seem to be very stable and performing well, so that could also
be an option.
After all, a filesystem _is_ a kind of very specialized database... ;-)
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN
to a field. I
ended up encoding the keywords like 10.0 keyword and then writing an
analyzer which skips the initial numbers when processing this particular
field (which was stored, indexed and tokenized).
--
Best regards,
Andrzej Bialecki
huhnerstall out of it in the query (Why?). But ther is no huhnerstall
indexed.
Please check which Analyzer you're using in each case.
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF
will be another couple of days), but in the
meantime you can just rebuild Luke from sources, using the latest Lucene.
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6
of the Lucene
index?
You should try to reduce the dimensionality by reducing the number of
unique features. In this case, you could for example use only keywords
(or key phrases) instead of the full content of documents.
--
Best regards,
Andrzej Bialecki
pressing Search.
* Fix the JNLP file to require J2SE 1.3+.
* By popular demand, add a single self-contained JAR to the binary
distribution.
* Minor restructuring to increase reuse.
Screenshots have been updated, too. Enjoy!
--
Best regards,
Andrzej Bialecki
that. Could you please turn
on the Java console, and see what kind of exception and where is thrown?
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert
on the search page. Spotted by Erik Hatcher.
Thank you for your comments and contributions!
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert
xerces, you might want to look at these. You might
want to look at http://dom4j.org/.
Dror
You may want to check the XML Pull Parser - it offers something between
SAX and DOM, with performance similar to SAX.
(http://www.extreme.indiana.edu/xgws/xsoap/xpp)
--
Best regards,
Andrzej Bialecki
value for the fields when I
search? Is it equivalent to value1^10.0 value2^20.0 (which is my
intention), or rather value1^20.0 value2^20.0?
If the latter, do you have any suggestions how to achieve the original
effect?
Thanks in advance!
--
Best regards,
Andrzej Bialecki
regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org
using a bastardized version of Markov
chains, but it's more of a hack...
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
Well ... Sure, nothing can replace a human mind. But believe it or not,
there are studies which show that even human experts can significantly
differ in their opinions on what are key-phrases for a given text. So,
the results are never clear cut with humans either...
So, in this sense a
(including Java
implementation of strings(1) command as the last resort).
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
please visit the link above to get both
binaries and source code.
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
also very actively developed.
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD
), and
then got stuck in an infinite wait somewhere... So I came up with a
workaround: I run the parser in a separate thread, while waiting in the
main thread, and then after a certain timeout I kill the processing
thread and return.
--
Best regards,
Andrzej Bialecki
for
query expansion and for finding associated words (synsets?), or
hypernyms like in your example.
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert
Andrzej Bialecki wrote:
Julien Nioche wrote:
[- and almost impossible : recompose the unstored fields of a document]
It's not impossible, just time-consuming - all information (except the
parts removed by analyzer) is already there. This functionality has a
high cool-ness factor, which
.
* Add Read-Only mode.
* Fix spinbox bug (really a bug in the Thinlet toolkit - fixed there).
* Allow to browse hidden directories.
* Add a combobox to choose the default field for searching.
* Other minor code cleanups.
Thanks to all who provided their comments and suggestions!
--
Best regards,
Andrzej
of
adding some GUI and logic. Thinlet-based applications are easy to modify
in the View layer, so it's up to the Controller part, if it can be coded
at all...
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC
this as well...
In any case, if you're referring to the Search panel, then you can
always double-click on one of the search results, and it will be
displayed in the Documents panel, where you can not only see all the
fields, but also copy them to clipboard...
--
Best regards,
Andrzej Bialecki
objects, but this information can be found at msdn.microsoft.com.
Obviously I'd love to learn about an alternative, because then I could
free my clients from dependance on Office... I already use POI to
convert XLS and DOC files, and it works _very_ well.
--
Best regards,
Andrzej Bialecki
an extensible marshaller/de-marshaller, so if you know COM pretty
well you can extend it to handle any conceivable parameter types.
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project
document bean that allows you to work with a document editor in
JComponent.
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
than we think
:).
I believe there are tools out there that will analyze Java sources and
create UML class diagrams from that. I believe TogetherJ or one of
those 'all in one' tools can do that.
I can do it for you, if you want - it takes ~10 minutes.
--
--
Best regards,
Andrzej Bialecki
wrote:
COM based parser:
http://www.intrinsyc.com/products/enterprise_applications.asp
convert word to text: http://www.winfield.demon.nl/index.html
That's a bit expensive... I found a free alternative - Jawin, plus OLE
Automation.
--
Best regards,
Andrzej Bialecki
Multivalent
browser, and is subject to BSD-equivalent license - which means you can
use it for whatever purpose, and if it turns out to be useful, it can be
included in Lucene distribution.
--
Best regards,
Andrzej Bialecki
-
Software Architect, System
it
only with small doc. collections that I use for functionality testing...
Everything appears to work as expected, but my test collection is just
~100 documents, so the searching is blazingly fast no matter what I do.. :-)
--
Best regards,
Andrzej Bialecki
petite_abeille wrote:
On Tuesday, Feb 25, 2003, at 09:43 Europe/Zurich, Andrzej Bialecki wrote:
No, I'm not - this is clearly stated in the class javadoc. I meant to
try it out in my application, but haven't got to it yet - I need to
address first the base functionality, not performance; so, I
it does against particular query...
Any suggestions?
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
-
FreeBSD developer (http://www.freebsd.org
the query? Should I expect a similar cost for that as creating
Explanations separately?
BTW: I tried to contact you regarding some help in a commercial project.
Is [EMAIL PROTECTED] the right way to do it?
Thanks!
Andrzej Bialecki wrote:
Hello,
Is there any simple way to get the information from
]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
71 matches
Mail list logo