storing pre-analyzed fields

2012-07-10 Thread Michael Sokolov
I have a question about the API for storing and indexing lucene documents (in 3.x). If I want to index a document by providing a TokenStream, I can do that by calling document.add (field) where field is something I write deriving from AbstractField that returns the TokenStream for tokenStream

Re: storing pre-analyzed fields

2012-07-11 Thread Michael Sokolov
d instances for each type as you like. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael Sokolov [mailto:soko...@ifactory.com] Sent: Wednesday, July 11, 2012 2:54 AM To: java-user@lucene.apache.org

Re: Highlighting html pages

2012-10-23 Thread Michael Sokolov
If you use HTMLStripCharFilter, it extracts the text only, leaving tags out, and remembering the word positions so that highlighting works properly. Should do exactly what you want out of the box... On 10/23/2012 8:00 PM, Scott Smith wrote: I need to take an html page that I retrieve from m

Re: Highlighting html pages

2012-11-05 Thread Michael Sokolov
ve me after I've stripped the HTML. Suggestions? Scott -Original Message----- From: Michael Sokolov [mailto:soko...@ifactory.com] Sent: Tuesday, October 23, 2012 9:04 PM To: java-user@lucene.apache.org Cc: Scott Smith Subject: Re: Highlighting html pages If you use HTMLStripCharFilter, i

Re: Highlighting html pages

2012-11-06 Thread Michael Sokolov
On 11/6/2012 3:29 AM, Steve Rowe wrote: Hi Scott, HTMLStripCharFilter doesn't require that its input be valid HTML - there is no assumption of balanced tags. Also, highlighted sections could span tags, e.g. if you highlight "this phrase", and the original HTML looks like: … thisphras

Re: Which stemmer?

2012-11-14 Thread Michael Sokolov
Does anyone have any experience with the stemmers? I know that Porter is what "everyone" uses. Am I better off with KStemFilter (better performance) or ?? Does anyone understand the differences between the various stemmers and how to choose one over another? We started off using Porter, t

Re: Which stemmer?

2012-11-15 Thread Michael Sokolov
On 11/15/2012 1:06 PM, Tom Burton-West wrote: This paper on the Kstem stemmer lists cases where the Porter stemmer understems or overstems and explains the logic of Kstem: "Viewing Morphology as an Inference Process" (*Krovetz*, R., Proceedings of the Sixteenth Annual International ACM SIGIR Con

Re: Which stemmer?

2012-11-15 Thread Michael Sokolov
On 11/15/2012 1:06 PM, Tom Burton-West wrote: This paper on the Kstem stemmer lists cases where the Porter stemmer understems or overstems and explains the logic of Kstem: "Viewing Morphology as an Inference Process" (*Krovetz*, R., Proceedings of the Sixteenth Annual International ACM SIGIR Con

Re: Grouping on multiple shards possible in lucene?

2012-11-20 Thread Michael Sokolov
On 11/20/2012 6:49 AM, Michael McCandless wrote: On Tue, Nov 20, 2012 at 1:49 AM, Ravikumar Govindarajan wrote: Also, for a TopN query sorted by doc-id will the query terminate early? Actually, it won't! But it really should ... you could make a Collector that throws an exception once the N

Re: adding attributes to TokenStream

2012-12-31 Thread Michael Sokolov
On 12/31/2012 11:39 AM, Itai Peleg wrote: Hi all, Can someone please post a simple example showing how to add additional attributes to token in a TokenStream (inside IncrementToken for example?). I'm working on entity extraction and want to flag specific tokens an entities, but I'm having probl

Re: adding attributes to TokenStream

2013-01-01 Thread Michael Sokolov
r one. Thanks in advance, Itai 2012/12/31 Michael Sokolov On 12/31/2012 11:39 AM, Itai Peleg wrote: Hi all, Can someone please post a simple example showing how to add additional attributes to token in a TokenStream (inside IncrementToken for example?). I'm working on entity extraction an

Re: More about storing NLP-type stuff in the index

2013-01-03 Thread Michael Sokolov
On 1/3/2013 6:16 PM, Wu, Stephen T., Ph.D. wrote: I think we've been saying that if we put something in a Payload, it will be indexed. From what I understand of the indexing format, that means that what you put in the Payload will be stored in the Lucene index... But it won't *itself* be indexed

adding field values with count

2013-01-04 Thread Michael Sokolov
I have an indexer that already collapses field values into a Map of (value, count) before indexing, and I would like to specify an increment to frequency (docFreq?) when adding a field value to a Lucene Document. Should I just add the same value multiple times? -Mike

Re: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-03-01 Thread Michael Sokolov
On 2/28/2013 5:05 PM, Uwe Schindler wrote: ... Collector instead of HitCollector (like your ancient Lucene from 2.4), you have to respect the new semantics that are *different* to old HitCollector. Collector works with low-level atomic readers (also in Lucene 3.x), the calls to the "collect(in

Re: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-03-01 Thread Michael Sokolov
On 03/01/2013 07:56 AM, Uwe Schindler wrote: The slowdown happens not on making the doc ids absolute (it is just an addition), the slowdown appears when you retrieve the stored fields on the top-level reader (because the composite top-level reader has to do a binary search in the reader tree t

Re: Rewrite for RegexpQuery

2013-03-11 Thread Michael Sokolov
On 03/11/2013 01:22 PM, Michael McCandless wrote: On Mon, Mar 11, 2013 at 9:32 AM, Carsten Schnober wrote: Am 11.03.2013 13:38, schrieb Michael McCandless: On Mon, Mar 11, 2013 at 7:08 AM, Uwe Schindler wrote: Set the rewrite method to e.g. SCORING_BOOLEAN_QUERY_REWRITE,

Re: Subclassing QueryScorer

2013-05-07 Thread Michael Sokolov
On 5/7/2013 6:26 PM, Colin Pollock wrote: Hi, I want to modify how the QueryScorer selects fragments for snippeting. I want to add a small boost for fragments that contain certain terms (e.g. "great", "amazing") to the unique term occurrence score. But I don't want these words to actually be high

Re: Seemingly very difficult to wrap an Analyzer with CharFilter

2013-06-12 Thread Michael Sokolov
You may not have noticed that CharFilter extends Reader. The expected pattern here is that you chain instances together -- your CharFilter should act as *input* to the Analyzer, I think. Don't think in terms of extending these analysis classes (except the base ones designed for it): compose t

Re: Seemingly very difficult to wrap an Analyzer with CharFilter

2013-06-12 Thread Michael Sokolov
On 6/12/2013 7:02 PM, Steven Schlansker wrote: On Jun 12, 2013, at 3:44 PM, Michael Sokolov wrote: You may not have noticed that CharFilter extends Reader. The expected pattern here is that you chain instances together -- your CharFilter should act as *input* to the Analyzer, I think

[ANN] Lux XML search engine

2013-06-18 Thread Michael Sokolov
I'm pleased to announce the first public release of Lux (version 0.9.1), an XML search engine embedding Saxon 9 and Lucene/Solr 4. Lux offers many features found in XML databases: persistent XML storage, index-optimized querying, an interactive query window, and some application support feature

Re: Payload Matching Query

2013-06-23 Thread Michael Sokolov
On 6/21/13 11:18 AM, Uwe Schindler wrote: You may also be interested in this talk @ BerlinBuzzwords2013: http://intrafind.de/tl_files/documents/INTRAFIND_BerlinBuzzwords2013_The-Typed-Index.pdf Unfortunately the slides are not available. Uwe I've been wondering why we seem to handle case- and

supply term frequency directly

2013-07-02 Thread Michael Sokolov
by repeating the same term many times (I don't care about positions or highlighting in this case, either, just scoring), but that seems a bit perverse (and probably slower than just supplying the counts directly). -- Michael Sokolov Senior Architect Safari Books O

Re: Query serialization/deserialization

2013-08-04 Thread Michael Sokolov
On 07/28/2013 07:32 PM, Denis Bazhenov wrote: A full JSON query ser/deser would be an especially nice additionto Solr, allowing direct access to all Lucene Query features even if they haven't been integrated into the higher level query parsers. There is nothing we could do, so we wrote one, in

Re: Document boosting and native ordering of results

2013-08-26 Thread Michael Sokolov
I had been planning something similar to what Michael was used to: creating a regular numeric field (call it "weight", say) with a rank value, applying a field boost to that field that is equal to the rank value, and then querying with weight:[* TO *] as a term, thinking that would end up bring

Re: proposed change to CharTokenizer

2010-10-17 Thread Michael Sokolov
rom comprehensive. Does this seem like a reasonable patch? -Mike Michael Sokolov Engineering Director www.ifactory.com @iFactoryBoston PubFactory: the revolutionary e-publishing platform from iFactory - To unsubscri

Re: QueryValidator

2011-05-05 Thread Michael Sokolov
In our applications, we catch ParseException and then take one of the following actions: 1) report an error to the user 2) rewrite the query, stripping all punctuation, and try again 3) rewrite the query, quoting all punctuation, and try again would that work for you? On 5/5/2011 3:26 AM, Bern

Re: new to lucene, non standard index

2011-05-06 Thread Michael Sokolov
I believe creating a large number of fields is not a good match w/the underlying architecture, and you'd be better off w/a large number of documents/small number of fields, where the same field occurs in every document. There is some discussion here: http://markmail.org/message/hcmt5syca7zdeac

Re: ranged query didn't work, got exception...

2011-06-19 Thread Michael Sokolov
I think you need field:[20020101 TO *] although the "*" option isn't available in some versions (pre 3.1?) and you just have to supply a big value: field:[20020101 TO ] -Mike On 6/19/2011 6:18 PM, Hiller, Dean x66079 wrote: "here you can simply go for field:[20020101 TO ] and leave

Re: ranged query didn't work

2011-06-19 Thread Michael Sokolov
On 6/19/2011 8:11 PM, Hiller, Dean x66079 wrote: Oddly, enough, this seems to work and I get one result calling Collector.collect(int docIt)...(I found out AND has to be caps)... author:dean AND date:20110623 but this does not seem to work... author:dean AND date:[ 20110623 TO * ] I'm not sur

Re: highlighting performance

2011-06-20 Thread Michael Sokolov
Koji- I'm not familiar with the benchmarking system, but maybe I'll see if I can run that benchmark on my test data as a point of comparison - thanks for the pointer! -Mike On 6/20/2011 8:21 PM, Koji Sekiguchi wrote: Mike, FVH used to be faster for large docs. I wrote FVH section for Lucene

Re: highlighting performance

2011-06-21 Thread Michael Sokolov
e PhraseQueries - I added those and it did make FVH slightly slower, but not all that much. I'll keep digging. -Mike On 6/20/2011 10:54 PM, Michael Sokolov wrote: Koji- I'm not familiar with the benchmarking system, but maybe I'll see if I can run that benchmark on my test

Re: highlighting performance

2011-06-21 Thread Michael Sokolov
e" within which the term may occur - I think?) so there is an n^2 growth factor in the number of occurrences of a term in a document. Does that seem possible? -Mike On 6/21/2011 8:48 PM, Michael Sokolov wrote: I did that, and the benchmark indicates FVH is 10x faster than Highlighter no

Re: Lucene sort performance roots?

2011-06-24 Thread Michael Sokolov
Because of this top-n behavior, its generally slow with Lucene to scan deeply into the result set. If you want to go on page 100 of your search results, the priority queue must at least have a size of n=docsPerPage*100. Because of this, most full text search engines (e.g. Google does this, too)

Re: WELCOME to java-user@lucene.apache.org

2011-07-03 Thread Michael Sokolov
You should take a look at org.apache.solr.analysis.MappingCharFilter, which provides a generic table-based approach for use with solr. There are also a lot of other interesting CharFilters in the same package. For lucene-only use, there's org.apache.lucene.analysis.icu.ICUFoldingFilter, which

Re: Extracting span terms using WeightedSpanTermExtractor

2011-07-06 Thread Michael Sokolov
I tried something similar, and failed - I think the API is lacking there? My only advice is to vote for this: https://issues.apache.org/jira/browse/LUCENE-2878 which should provide an alternative better API, but it's not near completion. -Mike On 7/6/2011 5:34 PM, Jahangir Anwari wrote: I h

Re: Rewriting other query types into span queries and two questions about this

2011-08-07 Thread Michael Sokolov
On 8/4/2011 9:06 PM, Trejkaz wrote:... For AND (and for any "default boolean" queries which aren't equivalent to OR) queries, I have problems. For instance, you can't do this: within(5, 'my', and('cat', 'dog')) -> and( within(5, 'my', 'cat'), within(5, 'my', 'dog') ) The problem is that

Re: Search highlighter for custom Query implementations - how to?

2011-09-09 Thread Michael Sokolov
Lukas there really isn't any support for custom Query types in Highlighter, as you've found. If you inherit from one of the types it does support, or rewrite your query to one of them, that should work, but the Query class just doesn't provide enough support for Highlighter to work with in the

Re: Using Lucene to index Wikipedia

2011-10-23 Thread Michael Sokolov
Daniel, since no one knowledgeable has answered I'll take a stab - there are a number of ant targets you can run, most of which incorporate some indexing step(s). Basically you can run: ant -Dtask.alg= it looks as if the ant build.xml is set up to run conf/micro-standard.alg by default, but

Re: Best document format / markup for text indexing?

2011-11-23 Thread Michael Sokolov
In my experience, books and other semi-structured text documents are best handled as XML. There are many many different XML "vocabularies" for doing this, each of which has benefits for different kinds of documents. You probably should look at TEI, NLM Book, and DocBook though - these are som

Re: Spell check on a subset of an index ( 'namespace' aware spell checker)

2011-11-23 Thread Michael Sokolov
could use simply index every term with a namespace prefix like: Q::term where Q is the namespace and term the term? Then when you do spell corrections, submit each candidate term with the namespace prefix prepended -Mike On 11/23/2011 9:28 AM, E. van Chastelet wrote: I currently have an id

Re: Associated values for a field and its value

2013-10-03 Thread Michael Sokolov
On 10/02/2013 07:12 PM, Alice Wong wrote: Hello, We would like to index some documents. Each field of a document may have multiple values. And for each (field,value) pair there are some associated values. These associated values are just for retrieving, not searching. For example, a document D

Re: Associated values for a field and its value

2013-10-04 Thread Michael Sokolov
,2", "a2:2,3". I think that's what Aditya suggested. You still have to parse these though, so why not use a prebuilt flexible parsing infrastructure? Thanks. On Thu, Oct 3, 2013 at 1:49 PM, Michael Sokolov <mailto:msoko...@safaribooksonline.com>> wrote: On 10

Re: Analyzer classes versus the constituent components

2013-10-08 Thread Michael Sokolov
There are some Analyzer methods you might want to override (initReader for inserting a CharFilter, stuff about gaps), but if you don't need that, it seems to be mostly about packaging neatly, as you say. -Mike On 10/8/13 10:30 AM, Benson Margulies wrote: Is there some advice around about when

external file stored field codec

2013-10-11 Thread Michael Sokolov
I've been running some tests comparing storing large fields (documents, say 100K .. 10M) as files vs. storing them in Lucene as stored fields. Initial results seem to indicate storing them externally is a win (at least for binary docs which don't compress, and presumably we can compress the ex

Re: external file stored field codec

2013-10-11 Thread Michael Sokolov
On 10/11/2013 03:04 PM, Adrien Grand wrote: On Fri, Oct 11, 2013 at 7:03 PM, Michael Sokolov wrote: I've been running some tests comparing storing large fields (documents, say 100K .. 10M) as files vs. storing them in Lucene as stored fields. Initial results seem to indicate storing

Re: external file stored field codec

2013-10-11 Thread Michael Sokolov
On 10/11/2013 03:19 PM, Michael Sokolov wrote: On 10/11/2013 03:04 PM, Adrien Grand wrote: On Fri, Oct 11, 2013 at 7:03 PM, Michael Sokolov wrote: I've been running some tests comparing storing large fields (documents, say 100K .. 10M) as files vs. storing them in Lucene as stored f

Re: external file stored field codec

2013-10-13 Thread Michael Sokolov
On 10/13/2013 1:52 PM, Adrien Grand wrote: Hi Michael, I'm not aware enough of operating system internals to know what exactly happens when a file is open but it sounds to be like having separate files per document or field adds levels of indirection when loading stored fields, so I would be sur

Re: external file stored field codec

2013-10-17 Thread Michael Sokolov
On 10/13/13 8:09 PM, Michael Sokolov wrote: On 10/13/2013 1:52 PM, Adrien Grand wrote: Hi Michael, I'm not aware enough of operating system internals to know what exactly happens when a file is open but it sounds to be like having separate files per document or field adds levels of indire

Re: external file stored field codec

2013-10-18 Thread Michael Sokolov
On 10/18/2013 1:08 AM, Shai Erera wrote: The codec intercepts merges in order to clean up files that are no longer referenced What happens if a document is deleted while there's a reader open on the index, and the segments are merged? Maybe I misunderstand what you meant by this statement, but

Re: default or cascaded fallback query

2013-11-13 Thread Michael Sokolov
It sounds as if you want to create a new Query type. I would start by having a look at BooleanQuery and trying to write an analogous object that does what you want instead. -Mike On 11/13/2013 10:03 AM, Harald Kirsch wrote: Hello all, I wonder if a query according to the following rules is

Revolution writeup

2013-11-25 Thread Michael Sokolov
I just posted a writeup of the Lucene/Solr Revolution Dublin conference. I've been waiting for videos to become available, but I got impatient. Slides are there, mostly though. Sorry if I missed your talk -- I'm hoping to catch up when the videos are posted... http://blog.safariflow.com/201

Re: Number of Times 1 Field has occured in a document within a Given TimeRange,.

2013-12-06 Thread Michael Sokolov
Have you read about numeric range faceting? http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html On 12/6/2013 5:34 AM, Ankit Murarka wrote: Well a bit strange as this is the 1st time, I am not receiving any reply to the question even after sending it again. Would be very h

Re: Lucene Newbie Question

2013-12-08 Thread Michael Sokolov
On 12/8/2013 12:03 PM, Ted Goldstein wrote: I am new to Lucene and have begun experimenting. I've loaded both the example books.csv and the various example electronic components documents. I then do a variety of queries. Quering http://su2c-dev.ucsc.edu:8983/solr/select?q=name:A* returns bot

Re: BytesRef equals() method

2014-01-21 Thread Michael Sokolov
Note the comments in the source: /** Length of used bytes. */ public int length; length is not the same as the size of the internal buffer. It is the number of used bytes, so the length of the "logical" value as you call it. -Mike On 1/21/2014 10:32 AM, Yann-Erwan Perio wrote: Hello,

Re: (DocIds satisfying a query) -> (branch of the boolean query as a tree)

2014-01-25 Thread Michael Sokolov
The highlighters are the only thing I know of (in trunk) that do something like that. Work on this branch (https://issues.apache.org/jira/browse/LUCENE/fixforversion/12317158) is an attempt to make that more efficient. In general the problem with doing this during scoring (the filtering doc

Re: Highlighting text, do I seriously have to reimplement this from scratch?

2014-02-04 Thread Michael Sokolov
On 2/4/14 12:16 PM, Earl Hood wrote: On Tue, Feb 4, 2014 at 12:20 AM, Trejkaz wrote: I'm trying to find a precise and reasonably efficient way to highlight all occurrences of terms in the query, only highlighting fields which ... [snip] I am in a similiar situation with a web-based applica

Re: Highlighting text, do I seriously have to reimplement this from scratch?

2014-02-04 Thread Michael Sokolov
On 2/4/2014 2:50 PM, Earl Hood wrote: On Tue, Feb 4, 2014 at 1:16 PM, Michael Sokolov wrote: You might be interested in looking at Lux, which layers XML services like XQuery on top of Lucene and Solr, and includes an XML-aware highlighter: https://github.com/msokolov/lux/blob/master/src/main

Re: question about using lucene on large documents

2014-02-04 Thread Michael Sokolov
Ideally you would chunk a document at logical boundaries that will make sense as units of both search and presentation. For some content, these boundaries don't align; for example you might want to search for matches within a paragraph scope, or within a section, chapter, or part of a book, bu

Re: question about using lucene on large documents

2014-02-05 Thread Michael Sokolov
No, not really. What would you do if you had a match contained entirely within the overlapping region? You'd probably need a way to distinguish that from a term that matched in two adjacent chunks, but *not* in the overlap. Sounds very tricky to me. -Mike On 2/5/2014 2:21 AM, mrodent wrote:

Re: Wildcard searches

2014-02-05 Thread Michael Sokolov
On 2/5/2014 6:30 PM, raghavendra.k@barclays.com wrote: Hi, Can Lucene support wildcard searches such as the ones shown below? Indexed value is "XYZ CORPORATION LIMITED". If you index the value as a single token (KeywordTokenizer), there is nothing really special about the examples you gav

Re: Highlighting text, do I seriously have to reimplement this from scratch?

2014-02-06 Thread Michael Sokolov
On 2/6/2014 12:53 AM, Earl Hood wrote: On Tue, Feb 4, 2014 at 6:05 PM, Michael Sokolov wrote: Thanks for the feedback. I think it's difficult to know what to do about attribute value highlighting in the general case - do you have any suggestions? That is a challenging one since one h

grouped scoring

2014-04-07 Thread Michael Sokolov
I have an idea for something I'm calling grouped scoring, and I want to know if anybody has already done anything like this. The idea comes from the problem that in your search results you'd like to show only one or a small number of items from each group: for example on google.com, multiple r

Re: Proximity Search for SENTENCE and PARAGRAPH

2014-04-07 Thread Michael Sokolov
You could insert a large position gap between sentences (say 100; something larger than the largest sentence in #words), and a still larger position gap between paragraphs (1000; larger than the largest para). Then within-sentence search is just (A B)~100 and within-paragraph search (A B)~1000

multi-field suggestions

2014-04-18 Thread Michael Sokolov
I've been working on getting AnalyzingInfixSuggester to make suggestions using tokens drawn from multiple fields. I've done this by copying tokens from each of those fields into a destination field, and building suggestions using that destination field. This allows me to use different analysi

Re: Getting multi-values to use in filter?

2014-04-23 Thread Michael Sokolov
This isn't really a good use case for an index like Lucene. The most essential property of an index is that it lets you look up documents very quickly based on *precomputed* values. -Mike On 04/23/2014 06:56 AM, Rob Audenaerde wrote: Hi all, I'm looking for a way to use multi-values in a f

Re: How to locate a Phrase inside text (like a Browser text searcher)

2014-05-13 Thread Michael Sokolov
ShingleFilter can help with this; it concatenates neighboring tokens. So a search for "good morning john" becomes a search for "goodmorning john" OR "good morningjohn" OR "good morning john" it makes your index much bigger because of all the terms, but you may find it's worth the cost -Mike

Re: MultiReader docid reliability

2014-05-30 Thread Michael Sokolov
There is a Solr document cache that holds field values too, see: http://wiki.apache.org/solr/SolrCaching Maybe take this question over to the solr mailing list? -Mike On 5/30/2014 10:32 AM, Alan Woodward wrote: Solr caches hold lucene docids, which are invalidated every time a new searcher i

Re: How to approach indexing source code?

2014-06-04 Thread Michael Sokolov
Probably the simplest thing is to define a field for each of the contexts you are interested in, but you might want to consider using a tagged-token approach. I spent a while figuring out how to index tagged tree-structured data and came up with Lux (http://luxdb.org) - basically it accepts XM

Re: How to approach indexing source code?

2014-06-05 Thread Michael Sokolov
If you already have a parser for the language, you could use it to create a TokenStream that you can feed to Lucene. That way you won't be trying to reinvent a parser using tools designed for natural language. -Mike On 6/5/2014 6:42 AM, Johan Tibell wrote: I will definitely try a prototype.

Re: Indexing integer ranges for point search

2014-06-05 Thread Michael Sokolov
It all depends on the statistics: how the ranges are correlated. If the integer range is small: from 1-2, for example, you might consider indexing every integer in each range as a separate value, especially if most documents will only have a small number of small ranges. If there are too

AnalyzingInfixSuggester questions

2014-08-14 Thread Michael Sokolov
I've been using AIS, and I see that it now has support for incremental updates, which is great! I'm looking forward to getting suggestions from newly-added documents without the need to rebuild the entire suggester index. I've run into a few problems though, and I want to see if there is a bett

Re: AnalyzingInfixSuggester questions

2014-08-15 Thread Michael Sokolov
this in a subclass. With that in place, there is no issue about raising exceptions - the index is always available. Mike McCandless http://blog.mikemccandless.com On Thu, Aug 14, 2014 at 10:50 AM, Michael Sokolov wrote: I've been using AIS, and I see that it now has support for incr

Re: AnalyzingInfixSuggester questions

2014-08-15 Thread Michael Sokolov
can open a ticket for that too. -Mike S On 08/15/2014 07:33 AM, Michael Sokolov wrote: On 8/14/2014 5:48 PM, Michael McCandless wrote: I think we should expose commit? Can you open an issue? I will - To unsubscribe, e-mail:

Re: Calculate Term Frequency

2014-08-19 Thread Michael Sokolov
Have you looked into term vectors? I think they should fit your bill pretty neatly. Here's a nice blog post with helpful background info: http://blog.jpountz.net/post/41301889664/putting-term-vectors-on-a-diet -Mike On 8/19/2014 10:04 AM, Bianca Pereira wrote: Hi everybody, I would like

Re: Why does this search fail?

2014-08-27 Thread Michael Sokolov
Tokenization is tricky. You might consider using whitespace tokenizer followed by word delimiter filter (instead of standard tokenizer); it does a kind of secondary tokenization pass that can preserve the original token in addition to its component parts. There are some weird side effects to

Re: How to not span fields with phrase query?

2014-08-28 Thread Michael Sokolov
Usually that's referred to as multiple "values" for the same field; in the index there is no distinction between title:C and title:X as far as which field they are in -- they're in the same field. If you want to prevent phrase queries from matching B C X, insert a position gap between C and X;

Re: indexing json

2014-09-04 Thread Michael Sokolov
On 9/4/2014 6:46 AM, Larry White wrote: Hi, Is there a way to index an entire json document automatically as one can do with the new PostgreSQL json support? By automatically, I mean to create an inverted index entry (path: value) for each element in the document without having to specify in adv

Re: Case sensitivity

2014-09-21 Thread Michael Sokolov
On 9/19/2014 9:07 AM, John Cecere wrote: Is there a way to set up Lucene so that both case-sensitive and case-insensitive searches can be done without having to generate two indexes? You might be interested in the discussion here: https://issues.apache.org/jira/browse/LUCENE-5620 which addres

Re: real infix suggester, not AnalyzingInfixSuggester

2014-10-27 Thread Michael Sokolov
Have you considered combining the AnalyzingInfixSuggester with a German decompounding filter? If you break compound words into their constituent parts during analysis, then the suggester will be able to do what you want (prefix matches on the word-parts). I found this project with a quick goo

Re: Query with many clauses

2014-10-29 Thread Michael Sokolov
I'm curious to know more about your use case, because I have an idea for something that addresses this, but haven't found the opportunity to develop it yet - maybe somebody else wants to :). The basic idea is to reduce the number of terms needed to be looked up by collapsing commonly-occurring

Re: Query with many clauses

2014-10-29 Thread Michael Sokolov
s filter into ConstantScoreQuery and in other test I used FilteredQuery with MatchAllDocsQuery and BooleanFilter. Both cases seems to work quite similar in terms of performance to simple BooleanQuery. But of course I'll also try to use TermsFilter. Maybe it will speedUp filters. Michael

Re: Payload and Similarity Function: Always same value

2014-10-30 Thread Michael Sokolov
That's a lot of code to eyeball. Have you tried printing out the input data as you are indexing it (just at doc.add)? I am guessing there is some simple variable aliasing issue that I don't see at a glance ... -Mike On 10/30/14 2:03 PM, Ralf Bierig wrote: I want to implement a Lucene Indexer

Re: Case Insensitive Matching in Solr/Lucene

2014-11-25 Thread Michael Sokolov
The index size will not increase as quickly as you might think, and is not an issue in most cases. An alternative to two fields, though, is to index both upper- and lower-case tokens at the same position in a single field, and then to perform no case folding at query time. There is no standar

Re: Retrieve found terms

2014-11-25 Thread Michael Sokolov
Why don't you want to use a highlighter? That's what they're for. -Mike On 11/25/2014 09:12 AM, John Cecere wrote: I've done a bunch of searching, but I still can't seem to figure out how to do this. Given a WildcardQuery or PrefixQuery (something with a wildcard in it), is there a way to r

Re: ControlledRealTimeReopenThread

2014-12-01 Thread Michael Sokolov
It's impossible to tell since you didn't include the code for it, but my advice would be to look at how the documents are being marked for deletion. What are the terms being used to delete them? Are you trying to use lucene docids? -Mike On 12/1/2014 4:22 PM, Badano Andrea wrote: Hello, M

Re: ControlledRealTimeReopenThread

2014-12-01 Thread Michael Sokolov
On 1 Dec 2014, at 23:23, Michael Sokolov wrote: It's impossible to tell since you didn't include the code for it, but my advice would be to look at how the documents are being marked for deletion. What are the terms being used to delete them? Are you trying to use lucene docids?

Re: Total Freq for Bigrams, Trigrams, etc.

2014-12-02 Thread Michael Sokolov
If you index the n-grams in their own field using ShingleFilter, you can get statistics using the same term api on that field, in which the terms *are* n-grams, and similarly for queries. -Mike On 12/02/2014 03:38 PM, Peter Organisciak wrote: It is possible to get a total corpus frequency for

Re: Index replication strategy

2014-12-04 Thread Michael Sokolov
There are also Solr replication options - older snapshot-style replication, and newer Solr Cloud, but if you are not using solr now, you will incur some transitional costs since you would need to alter your indexing and possibly querying code to use it -Mike On 12/04/2014 09:38 AM, Shai Erera

Re: multiterm numbers regexp search

2014-12-15 Thread Michael Sokolov
You probably don't want to use StandardAnalyzer: maybe try WhitespaceAnalyzer, but you'll need to enhance your regex a little to deal with punctuation since WA may give you tokens like: 5106-7922-9469-8422. "5106-7922-9469-8422" etc -Mike On 12/15/14 3:45 AM, Valentin Popov wrote: I have

including self-joins in parent/child queries

2014-12-16 Thread Michael Sokolov
I see in the docs of ToParentBlockJoinQuery that: * The child documents must be orthogonal to the parent * documents: the wrapped child query must never * return a parent document. First, it would be helpful if the docs explained what would happen if that assumption were violated. Second,

Re: including self-joins in parent/child queries

2014-12-16 Thread Michael Sokolov
OK - I see looking at the code that an exception is thrown if a parent doc matches the subquery -- so that explains what will happen, but I guess my further question is -- is that necessary? Could we just not throw an exception there? -Mike On 12/16/2014 10:38 AM, Michael Sokolov wrote: I

Re: including self-joins in parent/child queries

2014-12-16 Thread Michael Sokolov
able fields; its children could include both 'Chapter' child docs and also a 'BookMetadata' child doc. -Greg On Tue, Dec 16, 2014 at 10:42 AM, Michael Sokolov wrote: OK - I see looking at the code that an exception is thrown if a parent doc matches the subquery -- so that

Re: including self-joins in parent/child queries

2014-12-18 Thread Michael Sokolov
pendix had the word 'apple'. :) It's equally possible to accidentally create a 'ToUncleJoin' or 'ToCousinJoin'. Just my two cents, Greg On Tue, Dec 16, 2014 at 8:42 PM, Michael Sokolov < msoko...@safaribooksonline.com> wrote: Looking at the

Re: AW: howto: handle temporal visibility of a document?

2015-01-12 Thread Michael Sokolov
The basic idea seems sound, but I think you can simplify that query a bit. For one thing, the *:* clauses can be removed in a few places: also if you index an explicit null value you won't need them at all; for visiblefrom, if you don't have a from time, use 0, for visibleto, if you don't have

Re: howto: handle temporal visibility of a document?

2015-01-13 Thread Michael Sokolov
On 1/13/2015 2:07 AM, Clemens Wyss DEV wrote: reduced to: ( ( *:* -visiblefrom:[* TO *] AND -visibleto:[* TO *] ) OR (-visiblefrom:[* TO *] AND visibleto:[ TO ]) OR (-visibleto:[ * TO *] AND visiblefrom:[0 TO ]) OR ( visiblefrom:[0 TO ] AND visibleto:[ TO ]) ) also if y

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-14 Thread Michael Sokolov
In practice, normalization by field length proves to be more useful than normalization by the sum of the lengths of all fields (document length), which I think is what you seem to be after. Think of a book chapter document with two fields: title and full text. It makes little sense to weight

Re: Multi-valued field and numTerms

2015-01-15 Thread Michael Sokolov
On 1/15/15 4:34 AM, rama44ster wrote: Hi, I am using lucene to index documents that have a multivalued text field named ‘city’. Each document might have multiple values for this field, like la, los angeles etc. Assuming document d1 contains city = la ; city = los angeles document d2 contains cit

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-15 Thread Michael Sokolov
On 1/15/15 11:23 AM, danield wrote: Hi Mike, Thank you for your reply. Yes, I had thought of this, but it is not a solution to my problem, and this is because the Term Frequency and therefore the results will still be wrong, as prepending or appending a string to the term will still make it a di

Re: ToChildBlockJoinQuery question

2015-01-21 Thread Michael Sokolov
On 1/21/2015 6:59 PM, Gregory Dearing wrote: Jim, I think you hit the nail on the head... that's not what BlockJoinQueries do. If you're wanting to search for children and join to their parents... then use ToParentBlockJoinQuery, with a query that matches the set of children and a filter that m

Re: ToChildBlockJoinQuery question

2015-01-22 Thread Michael Sokolov
e.org/mod_mbox/lucene-java-user/201412.mbox/%3ccaasl1-_ppmcnq3apjjfbt3adb4pgaspve-8o5r9gv5kldpf...@mail.gmail.com%3E> -Greg On Wed, Jan 21, 2015 at 7:59 PM, Michael Sokolov < msoko...@safaribooksonline.com> wrote: On 1/21/2015 6:59 PM, Gregory Dearing wrote: Jim, I think

  1   2   3   >