Re: [ANNOUNCE] Wiki editing change

2013-03-25 Thread Andrzej Bialecki
AndrzejBialecki to the ContributorsGroup. Thanks! -- Best regards, Andrzej Bialecki http://www.sigram.com, blog http://www.sigram.com/blog ___.,___,___,___,_._. __ [___||.__|__/|__||\/|: Information Retrieval, System Integration ___|||__||..\|..||..|: Contact

Lucene 4 architecture - paper available

2012-10-09 Thread Andrzej Bialecki
the reading. :) -- Best regards, Andrzej Bialecki http://www.sigram.com, blog http://www.sigram.com/blog ___.,___,___,___,_._. __ [___||.__|__/|__||\/|: Information Retrieval, System Integration ___|||__||..\|..||..|: Contact: info at sigram dot com

Re: Problem with TermVector offsets and positions not being preserved

2012-07-27 Thread Andrzej Bialecki
positions and offsets if available (or blanks if not available). -- Best regards, Andrzej Bialecki http://www.sigram.com, blog http://www.sigram.com/blog ___.,___,___,___,_._. __ [___||.__|__/|__||\/|: Information Retrieval, System Integration

Re: Getting terms from unstored fields, doc-wise

2012-07-27 Thread Andrzej Bialecki
fields or in an external system. -- Best regards, Andrzej Bialecki http://www.sigram.com, blog http://www.sigram.com/blog ___.,___,___,___,_._. __ [___||.__|__/|__||\/|: Information Retrieval, System Integration ___|||__||..\|..||..|: Contact: info at sigram

Re: Lucene 4.0 .FDT

2012-07-19 Thread Andrzej Bialecki
is whether the space savings would be worth the complication? -- Best regards, Andrzej Bialecki http://www.sigram.com, blog http://www.sigram.com/blog ___.,___,___,___,_._. __ [___||.__|__/|__||\/|: Information Retrieval, System Integration

[ANNOUNCE] Luke 4.0.0-ALPHA released

2012-07-17 Thread Andrzej Bialecki
implementation must be inherited from FSDirectory (mitja.lenic) * Issue 21: luke tarball needs to extract to a luke directory (bevan.koopman, Photodeus) * Issue 27: Cannot add or edit documents using StandardAnalyzer (dean.thrasher) Thanks to all contributors. Enjoy! -- Best regards, Andrzej

Re: Index pruning

2012-06-13 Thread Andrzej Bialecki
pairs. -- Best regards, Andrzej Bialecki http://www.sigram.com, blog http://www.sigram.com/blog ___.,___,___,___,_._. __ [___||.__|__/|__||\/|: Information Retrieval, System Integration ___|||__||..\|..||..|: Contact: info at sigram dot com

Re: Re-indexing a particular field only without re-indexing the entire enclosing document in the index

2012-04-26 Thread Andrzej Bialecki
. LUCENE-3837, to be specific. But as you said, it's still early and there is no code yet to speak of... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded

Re: lucene algorithm ?

2012-04-26 Thread Andrzej Bialecki
is lower than the current lowest score. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram

Re: delete entries from posting list Lucene 4.0

2012-04-02 Thread Andrzej Bialecki
On 29/03/2012 11:14, Andrzej Bialecki wrote: The problem in our implementation is that we use a within-document term frequency (the number of occurrences of t in the current document) and not a collection-wide term frequency... so, it looks to me that the fix would be to first fully traverse

Re: delete entries from posting list Lucene 4.0

2012-03-29 Thread Andrzej Bialecki
enumeration and calculate the total number of term occurrences in all documents (e.g. in RIDFTermPruningPolicy.initPositionsTerm(..) ), and use this value in the formula in place of termPositions.freq(). -- Best regards, Andrzej Bialecki

Re: delete entries from posting list Lucene 4.0

2012-03-19 Thread Andrzej Bialecki
regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Tamper resistant index

2012-01-09 Thread Andrzej Bialecki
, and then see if it's good enough. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot

[ANN] Luke 3.5.0 released

2011-12-28 Thread Andrzej Bialecki
and a happy New Year to you all! :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot

Re: luke and chinese text

2011-12-22 Thread Andrzej Bialecki
that supports Unicode characters, the default platform font often doesn't support them, which results in '?' or other strange characters. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Bet you didn't know Lucene can...

2011-10-31 Thread Andrzej Bialecki
bit-level distance in their hashes from the query hash. The solution is described in SOLR-1918 - Bit-wise scoring field type. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Bet you didn't know Lucene can...

2011-10-31 Thread Andrzej Bialecki
On 31/10/2011 21:42, Petite Abeille wrote: On Oct 31, 2011, at 9:32 PM, Andrzej Bialecki wrote: similarity-preserving hash function was calculated on each sentence, and the hash was added as a field. The property of the hash was that similar documents (sentences) would produce a similar

[ANN] Luke 3.4.0 release

2011-10-03 Thread Andrzej Bialecki
. * Rearranged field flags so that they are more logical and cover index options added in 3.4.0. E.g. omitNorms is represented as with Norms and marked by N, IndexOptions are expanded to Idfp to mark indexed fields with docs, freqs and positions. Enjoy! -- Best regards, Andrzej Bialecki

Re: [ANN] Luke 3.4.0 release

2011-10-03 Thread Andrzej Bialecki
. There's probably some lesson to learn from this situation... I committed a fix, and the updated release is marked as 3.4.0_1. Sorry! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

[ANN] Luke 3.3.0 released.

2011-07-06 Thread Andrzej Bialecki
Hi all, Luke 3.3.0 has been released and is available for download here: http://code.google.com/p/luke/ Apart from the updated Lucene libraries there were no changes in functionality. -- Best regards, Andrzej Bialecki

Re: Changing Boosting that was set at indexing time

2011-06-16 Thread Andrzej Bialecki
modify norms directly using IndexReader.setNorm(...) but you need to remember that this method uses raw byte values, that is the result of encoding a floating point value with Similarity.encodeNormValue(..). -- Best regards, Andrzej Bialecki

Re: Coloring search results based on score?

2011-06-16 Thread Andrzej Bialecki
://people.ischool.berkeley.edu/~hearst/research/tilebars.html -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram

[ANN] Luke 3.1.0 released

2011-04-29 Thread Andrzej Bialecki
, patches and comments. Happy Luke-ing! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram

Re: Search one index but use IDF from another?

2011-03-10 Thread Andrzej Bialecki
obtained from the full index, and then you use this map to calculate IDF. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: Adding a new field to existing Index

2010-07-07 Thread Andrzej Bialecki
On 2010-07-07 14:49, Naveen Kumar wrote: Hi Andrzej Bialecki When you suggested - There are some other low-level ways to do this, but the easiest is to use a FilterIndexReader, especially since you just want to add a stored field - implement a subclass of FilterIndexReader

Re: Document Order in IndexWriter.addIndexes

2010-06-30 Thread Andrzej Bialecki
. You have been warned :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Adding a new field to existing Index

2010-06-30 Thread Andrzej Bialecki
the Reconstruct Edit functionality in Luke (http://www.getopt.org/luke). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

Re: Document Order in IndexWriter.addIndexes

2010-06-30 Thread Andrzej Bialecki
in the output index. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Question about Field.setOmitTermFreqAndPositions(true)

2010-05-31 Thread Andrzej Bialecki
On 2010-05-31 10:54, Uwe Schindler wrote: No. See also LUCENE-2048 (nice round number ;) ). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System

Re: Access indexed terms

2010-05-14 Thread Andrzej Bialecki
need such kind of access in your application then add your documents with term vectors with offsets and positions. Even then, depending on the Analyzer you used, the process is lossy - some input data that was discarded by Analyzer is simply no longer available. -- Best regards, Andrzej Bialecki

Re: Access indexed terms

2010-05-14 Thread Andrzej Bialecki
that? Yes, see the discussion here: https://issues.apache.org/jira/browse/LUCENE-2393 -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System

Re: How to avoid sharing docStore files?

2010-05-12 Thread Andrzej Bialecki
will need twice as much space. But in this case perhaps you could put the original index on a network FS, and split it into the target partition - the data would be read just once. -- Best regards, Andrzej Bialecki

[ANN] Luke - The Lucene Index Toolbox - 1.0.1 release

2010-04-01 Thread Andrzej Bialecki
plugin (and analyzers) don't work. * Issue 4 : Compress flag no longer available. * Issue 14 : Error while using custom similarity. Enjoy! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: SpanQueries in Luke

2010-03-05 Thread Andrzej Bialecki
. I'll commit the current mostly-working state today, you can take a look - you've written some cool Luke plugins before .. ;) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: SpanQueries in Luke

2010-03-05 Thread Andrzej Bialecki
could store such information in IndexCommit.getUserData(). The lack of standardized metadata is an issue, of course - we could start experimenting with this in Luke, to see whether we can squeeze a subset of Solr schema there. -- Best regards, Andrzej Bialecki

Re: SpanQueries in Luke

2010-03-04 Thread Andrzej Bialecki
this parser out of the box. I expect to make a release within a few days. Watch the commits on the Google code project ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: SpanQueries in Luke

2010-03-04 Thread Andrzej Bialecki
, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Do deleted documents affect scores?

2010-02-11 Thread Andrzej Bialecki
). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Field creation with TokenStream and stored value

2010-01-13 Thread Andrzej Bialecki
implement your own Fieldable, and return what you want from its methods. You can also use Field constructor that takes the stored value, and then use Field.setTokenStream(TokenStream) - it doesn't override the stored value. -- Best regards, Andrzej Bialecki

[ANN] Luke 1.0.0 for Lucene 3.0

2009-12-26 Thread Andrzej Bialecki
Lucene 2.9.1 and 3.0. Your feedback is welcome - please use the Google Issue tracker to report issues. Merry Christmas! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: [ANN] Luke 1.0.0 for Lucene 3.0

2009-12-26 Thread Andrzej Bialecki
and if not existent fall back to the zero-arg ctor. I'll open an issue. Indeed - thanks! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: document with different index time boost returns same score

2009-12-18 Thread Andrzej Bialecki
that this encoding causes (and what input values effectively come out the same, once encoded). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration

[ANN] Luke 0.9.9.1 release

2009-11-20 Thread Andrzej Bialecki
to edit per-commit user data Map Bug fixes - * Term frequency vectors were not displayed for selected field. Enjoy! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Split single string into several fields?

2009-10-28 Thread Andrzej Bialecki
create other fields in the document (or split this token stream into several fields). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System

Re: [ANN] Luke 0.9.9 release

2009-10-23 Thread Andrzej Bialecki
are hardcoded somewhere deep in Thinlet, but likely they could be made configurable. You can find an EPS version of the Lucene logo here: http://lucene.apache.org/images/logo.eps -- Best regards, Andrzej Bialecki

Re: Question about how to speed up custom scoring

2009-10-08 Thread Andrzej Bialecki
) that if the terms you load are indexed that'll help. But this is mostly a guess. Just to clarify: IndexReader.document(doc) and .document(doc, selector) load _only_ stored fields, they don't interact at all with the terms-related part of Lucene.. -- Best regards, Andrzej Bialecki

Re: [ANN] Luke 0.9.9 release

2009-10-01 Thread Andrzej Bialecki
Andrzej Bialecki wrote: Hi all, I'm happy to announce the new release of Luke - the Lucene Index Toolbox. There's a bug in this version in that it doesn't show TermVectors for a field. I'll fix it in a few days - I'm waiting for other potential bugs to show up. So if you find something

[ANN] Luke 0.9.9 release

2009-09-29 Thread Andrzej Bialecki
! :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Lucene gobbling file descriptors

2009-08-27 Thread Andrzej Bialecki
/Downloads/LucidGaze-for-Lucene -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Lucene Search Performance Analysis Workshop

2009-08-26 Thread Andrzej Bialecki
, September 3rd 2009 11:00-11:30AM PDT / 14:00-14:30 EDT Follow this link to sign up: http://www2.eventsvc.com/lucidimagination/event/ff97623d-3fd5-43ba-a69d-650dcb1d6bbc?trk=WR-SEP2009-AP About: Lucene Performance Workshop: Understanding Lucene Search Performance with Andrzej Bialecki Experienced

Re: Why does this search succeed with web app, but not Luke?

2009-08-07 Thread Andrzej Bialecki
. analyzed tokens in the field should become apparent. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

[ANN] Luke + Hadoop, alpha version

2009-07-10 Thread Andrzej Bialecki
that this is an early preview. Also, various UI glitches are probably related to the Thinlet toolkit - again, one day I may re-write Luke using something else, but for now I don't have the strength to do it. :) -- Best regards, Andrzej Bialecki

Re: Lucene Index Encryption

2009-05-11 Thread Andrzej Bialecki
regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Index in text format

2009-04-24 Thread Andrzej Bialecki
://www.getopt.org/luke) can export all stored fields from all documents into an XML file. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: Help to determine why an optimized index is proportionaly too big.

2009-04-10 Thread Andrzej Bialecki
. (Actually: does CheckIndex warn about unused files in the index directory so people can clean them up? i'm not sure) It doesn't. But Luke has a function to do this. -- Best regards, Andrzej Bialecki

Re: [ANN] Luke 0.9.2 release

2009-03-20 Thread Andrzej Bialecki
Andrzej Bialecki wrote: (sorry for cross-posting) Hi all, I'm happy to announce a new release of Luke, the Lucene Index Toolbox. As usually, you can obtain it from here: http://www.getopt.org/luke If you tried to access this url during last couple hours the site was down. It should

Re: boosting query

2009-03-19 Thread Andrzej Bialecki
an arbitrary re-sorting of top-N results, according to your rules of preference (business rules, or heuristics). This way you can avoid the overfitting or doing endless tweaking, and still get the ranking that makes sense to your users. -- Best regards, Andrzej Bialecki

[ANN] Luke 0.9.2 release

2009-03-19 Thread Andrzej Bialecki
per field in Overview - contributed by Mark Harwood. o Improved the Analysis plugin to show all token information, and highlight whenever a token is selected from the list. * Bug fixes: o (None) -- Best regards, Andrzej Bialecki

Re: IndexSearcher

2009-03-09 Thread Andrzej Bialecki
to the classpath when you start Luke. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram

Re: IndexSearcher

2009-03-08 Thread Andrzej Bialecki
liat oren wrote: Ok, thanks. I will have to edit the code of Luke in order to add another analyzer, right? No - if your analyzer is already on the classpath, then it's enough to type in the fully qualified class name in the drop down box (it's editable). -- Best regards, Andrzej Bialecki

Re: Luke site is down?

2009-03-04 Thread Andrzej Bialecki
Hi all, I apologize for the inconvenience - the site went down without any prior notice from the ISP. I'm investigating the issue ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Determining index term count

2009-01-07 Thread Andrzej Bialecki
be messy - it's better to propose that this information should be added to API. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: Document.getBinaryValue returning null after upgrading to 2.4 for the data which was indexed using 2.3.1

2008-12-16 Thread Andrzej Bialecki
, the search worked fine using 2.4. Any ideas why this is happening. No idea - but perhaps this is somehow related: https://issues.apache.org/jira/browse/LUCENE-1452 -- Best regards, Andrzej Bialecki

Re: Document.getBinaryValue returning null after upgrading to 2.4 for the data which was indexed using 2.3.1

2008-12-16 Thread Andrzej Bialecki
of Lucene involved. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [ot] a reverse lucene

2008-11-23 Thread Andrzej Bialecki
it. (with a score etc) I can see the case for this would be a news-article and several people writing queries to get alerted if it matched a certain condition. http://www.seas.upenn.edu/~svilen/publications/subscribe.pdf -- Best regards, Andrzej Bialecki

[ANN] Luke 0.9.1 - bugfix release

2008-11-23 Thread Andrzej Bialecki
commits option was specified. Reported by Mark Harwood. o Empty index with no fields was reported as invalid. Discovered by Andrew Zhang and Michael McCandless (LUCENE-1454). Thank you! -- Best regards, Andrzej Bialecki

Re: [ANN] Luke 0.9 released

2008-11-14 Thread Andrzej Bialecki
other places ... I forgot about the use of IndexFileDeleter - and indeed passing the read-only flag here can solve this, because then I can always use KeepAllDeletionPolicy when opening read-only. Thanks for the report! -- Best regards, Andrzej Bialecki

Re: Read all the data from an index

2008-10-31 Thread Andrzej Bialecki
IndexReader.undeleteAll(). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Luke is coming .. not there yet.

2008-10-30 Thread Andrzej Bialecki
format, incompatible with earlier versions of Lucene (including 2.4 release). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: Luke is coming .. not there yet.

2008-10-30 Thread Andrzej Bialecki
Andrzej Bialecki wrote: 1) Luke 2.4 release. This has the advantage of being an official stable [...] 2) Luke 2.9-dev snapshot. This has the advantage that you get the [...] Of course I meant Lucene 2.4 and Lucene 2.9-dev ... sorry for the confusion. -- Best regards, Andrzej Bialecki

Re: Luke is coming .. not there yet.

2008-10-30 Thread Andrzej Bialecki
else does it it's simply not going to happen. All code in Luke except for the Thinlet class is under Apache License, so feel free to start coding :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

Re: Luke is coming .. not there yet.

2008-10-30 Thread Andrzej Bialecki
this in the proposals for the next summer. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram

Re: Sorting posting lists before intersection

2008-10-13 Thread Andrzej Bialecki
it needs to read this info from the .ti file. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: Case Sensitivity

2008-09-19 Thread Andrzej Bialecki
methods on Fieldable that test the validity of flag combinations with particular version of Lucene? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix

Re: Sorting posting lists before intersection

2008-09-17 Thread Andrzej Bialecki
: ConjunctionScorer, lines 85-103 - pay attention to the comments there, it's not strictly a sort by frequency, rather by the sampled sparseness. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic

Re: Pre-filtering for expensive query

2008-08-30 Thread Andrzej Bialecki
scoring and not afterwards. FilteredQuery internally makes use of skipTo(), which should help to limit the number of evaluated docs. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Case Sensitivity

2008-08-28 Thread Andrzej Bialecki
are set now like this: isIndexed = true; isTokenized = true; omitNorms = true; The end result of processing such a field is (I believe) conceptually equivalent to adding as many Fields as there are tokens, each with omitNorms=true. -- Best regards, Andrzej Bialecki

Re: Case Sensitivity

2008-08-28 Thread Andrzej Bialecki
Otis Gospodnetic wrote: So in other words, it *is* possible to have the field both tokenized and its norms omitted? Yes. Probably this is an unintended side-effect of adding setOmitNorms, but I think it's useful and IMHO we should keep it. -- Best regards, Andrzej Bialecki

Re: boost freshness instead of sorting

2008-08-28 Thread Andrzej Bialecki
the same thing in Case sensitivity thread - it's possible to have a tokenized field and omit its norms. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix

Re: Case Sensitivity

2008-08-28 Thread Andrzej Bialecki
regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Stop search process when a given number of hits is reached

2008-08-07 Thread Andrzej Bialecki
(org.apache.nutch.indexer.IndexSorter). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: updating existing field values

2008-08-07 Thread Andrzej Bialecki
, which requires overriding other methods. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: Copying a part of index and index structure

2008-06-20 Thread Andrzej Bialecki
by doc. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Copying a part of index and index structure

2008-06-20 Thread Andrzej Bialecki
Indices ... and quite a few other papers that I don't remember now ... please do a search for distributed IR on ACM or Citeseer. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Rebuilding parallel indexes

2008-06-09 Thread Andrzej Bialecki
;) Perhaps you could use a FilteredIndexReader to maintain a map between new IDs and old IDs, and remap on the fly. Although I think that some parts of Lucene depend on the fact that in a normal index the IDs are monotonically increasing ... this would complicate the issue. -- Best regards, Andrzej

Re: text extraction from pdf

2008-05-14 Thread Andrzej Bialecki
something better, AFAIK, PDFBox has a lower-level API that allows you to get hold of text positions. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix

Re: Can POI provide reliable text extraction results for production search engine for Word, Excel and PowerPoint formats?

2008-05-13 Thread Andrzej Bialecki
_and_ formatting from any documents that could be normally opened with MS Office - however, performance was an issue, ie. it was slow, CPU/memory hog, and occasionally it would get stuck in a weird state when only complete reboot would help. -- Best regards, Andrzej Bialecki

Re: Does lucene support distributed indexing?

2008-04-29 Thread Andrzej Bialecki
that executes in a distributed fashion (not sure if map-reduce is the best model here), but first copies the indexes to LocalFileSystem. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: How to reconstruct field value from index ?

2008-04-02 Thread Andrzej Bialecki
there is to it :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Stored Field vs offset plus external file?

2008-02-13 Thread Andrzej Bialecki
/ ram / fs for specific index file types (e.g. tis, tii, fdt, prx and so on) - you should be able to cut paste large chunks of each directory code to start the implementation. -- Best regards, Andrzej Bialecki

Re: Lukes document hitlist display

2008-02-12 Thread Andrzej Bialecki
is not available ... Luke populates this screen using Document.getFields(). If a field is unstored then it's not returned in this list, so it's not possible to get its flags. -- Best regards, Andrzej Bialecki

Re: TermPositionVector

2008-02-12 Thread Andrzej Bialecki
I'll include it in a minor update. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram

Re: appending field to an existing index

2008-02-04 Thread Andrzej Bialecki
regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

[ANN] Luke 0.8 released

2008-02-04 Thread Andrzej Bialecki
this column now reads Norms and shows the fieldNorm value of a field. Have fun! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: Performance guarantees and index format

2008-01-31 Thread Andrzej Bialecki
regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

FYI: parallel corpus in 22 languages

2008-01-24 Thread Andrzej Bialecki
://wt.jrc.it/lt/Acquis/ -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: SV: Integrating dynamic data into Lucene search/ranking

2008-01-17 Thread Andrzej Bialecki
to the on-disk index), and start using the new IndexSearcher. And again, start accumulating new docs in the RAMDirectory, etc, etc ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Lucene + Hadoop

2008-01-16 Thread Andrzej Bialecki
it to the local filesystem first... Yes - see org.apache.nutch.indexer.FsDirectory. However, you will not like the performance, it's much slower than using the index locally. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: Bucketing (was Re: Wikia search goes live today)

2008-01-09 Thread Andrzej Bialecki
to implement, yet produces useful results difficult to obtain through the usual means (similarity, boosting, even function query). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Wikia search goes live today

2008-01-08 Thread Andrzej Bialecki
? (I'm not involved in Wikia development). There are some ways to go about it even in the pure Lucene-land, so that the updates are fast without reindexing the main content. Hint: ParallelReader. -- Best regards, Andrzej Bialecki

  1   2   >