Re: Custom indexing

2016-04-18 Thread Jack Krupansky
You failed to disclose up front that you are using such an old release of Lucene. Lucene is now on 6.0. I'll defer to others if they wish to provide support for such an old release. -- Jack Krupansky On Mon, Apr 18, 2016 at 8:01 AM, PK C <tech.kumar...@gmail.com> wrote: > Hi, > &g

Re: Custom indexing

2016-04-12 Thread Jack Krupansky
The standard analyzer/tokenizer should do a decent job of splitting on dot, hyphen, and underscore, in addition to whitespace and other punctuation. Can you post some specific test cases you are concerned with? (You should always run some test cases.) -- Jack Krupansky On Tue, Apr 12, 2016

Re: Subset Matching

2016-03-25 Thread Jack Krupansky
There is no simple, direct way to do this "Boolean Reverse Query" in Lucene, but I suggest filing a Jira to request this as a feature improvement/new feature. -- Jack Krupansky On Fri, Mar 25, 2016 at 11:43 AM, Ahmet Arslan <iori...@yahoo.com.invalid> wrote: > Hi Otmar, >

Re: Query regarding Lucene

2016-03-10 Thread Jack Krupansky
Are you calling the IndexSearcher#explain method to get the details of the score calculation? How exactly are your results not what you expect? What Similarity are you using? Scores will be the product of the underlying calculated scores and you term boost values. -- Jack Krupansky On Thu, Mar

Re: Creating composite query in lucene

2016-03-08 Thread Jack Krupansky
BooleanQuery can be nested, so you do a top-level BQ that has two clauses, the first a TQ for a:x and the second another BQ that itself has two clauses, both SHOULD. -- Jack Krupansky On Tue, Mar 8, 2016 at 4:38 AM, sandeep das <yarnhad...@gmail.com> wrote: > Hi, > > I'm usi

Field name syntax for Lucene Expressions

2016-02-29 Thread Jack Krupansky
LE binding. -- Jack Krupansky

Re: Spaces in regular expressions

2016-02-15 Thread Jack Krupansky
source line. And then there is the issue of code sequences that span source lines. -- Jack Krupansky On Mon, Feb 15, 2016 at 8:30 AM, Kudrettin Güleryüz <kudret...@gmail.com> wrote: > Since documents are source code, I am considering matching on operators > too. > > Using whitespa

Re: Spaces in regular expressions

2016-02-13 Thread Jack Krupansky
ate string (not tokenized text) field and then you can do a complex regex that spans terms (and only do that if normal span queries don't do what you need.) What does your typical cross-term regex actually look like? -- Jack Krupansky On Sat, Feb 13, 2016 at 1:25 PM, Uwe Schindler <u...@theta

Re: boolean query for multiple values on a specific field

2016-01-27 Thread Jack Krupansky
no code that would tell the analyzer that "tag" is a defined field. Also, I see no value to having the single-clause BooleanQuery wrapped around the actual query. -- Jack Krupansky On Wed, Jan 27, 2016 at 12:52 PM, G.Long <jde...@gmail.com> wrote: > Hi :) > > I would like to

Re: Poor performances with Shingle and Phrase query

2016-01-21 Thread Jack Krupansky
Be sure to check and see if your app is compute or I/O bound during this process - whether too little of your index is cached in system memory and each query requires I/O, lots of it. -- Jack Krupansky On Thu, Jan 21, 2016 at 1:52 PM, Doug Turnbull < dturnb...@opensourceconnections.com>

Re: How to escape URL at indexing time

2015-12-27 Thread Jack Krupansky
It looks like you attempted to quote the URL in your query using apostrophes (sometimes referred to as single quotes), but you need to use quote (sometimes referred to as double quote). Change: id:'http://www.yahoo.com' to: id:"http://www.yahoo.com; -- Jack Krupansky On Sun, Dec 27, 2015

Re: Any lucene query sorts docs by Hamming distance?

2015-12-24 Thread Jack Krupansky
, but is deprecated and has been relegated to the sand box, so it is not really usable going forward: http://lucene.apache.org/core/5_4_0/sandbox/index.html?org/apache/lucene/sandbox/queries/SlowFuzzyQuery.html -- Jack Krupansky On Tue, Dec 22, 2015 at 4:02 AM, Yonghui Zhao <zhaoyong...@gmail.

Re: Need change one field type from IntField to String including indexOptions to store positions & Norms

2015-12-17 Thread Jack Krupansky
The standard answer is that you need to reindex all of your data. -- Jack Krupansky On Thu, Dec 17, 2015 at 6:10 AM, Kumaran Ramasubramanian <kums@gmail.com > wrote: > Dear All > > i am using lucene 4.10.4. Is there any more information i missed to > provide?

Re: Need change one field type from IntField to String including indexOptions to store positions & Norms

2015-12-17 Thread Jack Krupansky
Delete the full index and create from scratch with the correct field type, re-adding all documents. Any remnants of the old field must be removed. -- Jack Krupansky On Thu, Dec 17, 2015 at 11:48 AM, Kumaran R <kums@gmail.com> wrote: > While Reindexing only am facing this problem.

Re: Need change one field type from IntField to String including indexOptions to store positions & Norms

2015-12-17 Thread Jack Krupansky
You could certainly read your stored values from your current index and then write new documents to a new index and then use the new index. That's if all of the indexed field values are stored. -- Jack Krupansky On Thu, Dec 17, 2015 at 2:10 PM, Kumaran Ramasubramanian <kums@gmail.com >

Re: Searching for "iso surface", and looking for "isosurface"

2015-12-17 Thread Jack Krupansky
/DictionaryCompoundWordTokenFilterFactory.html The doc is weak. I do have some examples in my old Solr 4.x Deep Dive e-book: http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html You might also be able to achieve a similar effect with synonyms, but again only

Re: Jensen–Shannon divergence

2015-12-14 Thread Jack Krupansky
/similarities/TFIDFSimilarity.html https://lucene.apache.org/core/5_3_0/core/org/apache/lucene/search/similarities/BM25Similarity.html -- Jack Krupansky On Sun, Dec 13, 2015 at 8:30 AM, Shay Hummel <shay.hum...@gmail.com> wrote: > Hi > > I need help to implement similarity between query mod

Re: Wildcard Terms and total word or phrase count

2015-11-29 Thread Jack Krupansky
You didn't post your code that creates the index. Make sure you are using a tokenized TextField rather than a single-token StringField. -- Jack Krupansky On Fri, Nov 27, 2015 at 4:06 PM, Kunzman, Douglas * < douglas.kunz...@fda.hhs.gov> wrote: > Hi - > > This is my first Luc

Re: lucene query complexity

2015-11-20 Thread Jack Krupansky
for a significant sample of realistic data and then you can empirically deduce who the big-O function is for your particular application data and data model. -- Jack Krupansky On Fri, Nov 20, 2015 at 4:38 AM, Adrien Grand <jpou...@gmail.com> wrote: > I don't think the big-O notation is ap

Re: need help in search

2015-10-05 Thread Jack Krupansky
, so if you need to keep that entire string as one term, use the whitespace tokenizer. That said, treating hyphen as a word break is usually not a problem as long as you enable auto phrase generation for the query parser. -- Jack Krupansky On Mon, Oct 5, 2015 at 4:06 AM, Bhaskar <bhask

Re: Need help in alphanumeric search

2015-10-01 Thread Jack Krupansky
Technically, there is no such thing as a "sentence search" in Lucene. Please provide an example of how you wish to search, and then we can determine whether a phrase query or a span query might accomplish the task. -- Jack Krupansky On Thu, Oct 1, 2015 at 11:53 AM, Bhaskar <bhaskar1

Re: Need help in alphanumeric search

2015-10-01 Thread Jack Krupansky
Phrase query for a tokenized text field should do it. -- Jack Krupansky On Thu, Oct 1, 2015 at 10:04 PM, Bhaskar <bhaskar1...@gmail.com> wrote: > Hi Jack, > > my searching is working like this. > > if i give input as "SD RAM Bhaskar" then which ever strings are

Re: How to use case in-sentive search

2015-08-14 Thread Jack Krupansky
was really how to get case-sensitive query, simply create your own analyzer without the lower case filter. -- Jack Krupansky On Fri, Aug 14, 2015 at 10:07 AM, Erick Erickson erickerick...@gmail.com wrote: Add LowercaseFilterFactory to your analysis chain for the fieldType both at query and index time

Re: ignore score and weight in lucene search

2015-07-29 Thread Jack Krupansky
ConstantScoreQuery is the proper approach. What specific failure did you encounter? -- Jack Krupansky On Wed, Jul 29, 2015 at 7:09 AM, 丁儒 bfore...@126.com wrote: Hi, all Currently i'm using lucene. But i don't care the score and weight, i just need the documents meets the query. I tried

Re: Analyzer for supporting hyphenated words

2015-07-21 Thread Jack Krupansky
/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean) -- Jack Krupansky On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti socac...@gmail.com wrote: Hi all, i'm new to lucene and tried to write my own analyzer to support hyphenated words like wi-fi, jean-pierre, etc. For our customer it is important

Re: Using lucene queries to search StringFields

2015-06-21 Thread Jack Krupansky
://lucene.apache.org/core/5_2_0/analyzers-common/org/apache/lucene/analysis/core/KeywordAnalyzer.html You can also simply escape the spaces with a backslash rather than quote the entire term, but you still need to use the keyword analyzer. -- Jack Krupansky On Fri, Jun 19, 2015 at 2:31 AM, Gimantha

Re: Text dependent analyzer

2015-04-15 Thread Jack Krupansky
the sentence boundaries are? Be specific, because that determines what your queries should look like, which determines what the indexed text should look like, which determines how the text should be analyzed. -- Jack Krupansky On Wed, Apr 15, 2015 at 8:12 AM, Shay Hummel shay.hum...@gmail.com wrote: Hi

Re: Calculate the score of an arbitrary string vs a query?

2015-04-10 Thread Jack Krupansky
/org/apache/lucene/search/IndexSearcher.html#explain(org.apache.lucene.search.Query, int) -- Jack Krupansky On Fri, Apr 10, 2015 at 4:15 PM, Gregory Dearing gregdear...@gmail.com wrote: Hi Ali, The short answer to your question is... there's no good way to create a score from your result

Re: Lucene and accumulo

2015-04-09 Thread Jack Krupansky
/browse/ACCUMULO-3698 The SQRRL commercial product has (or at least had before the company shifted its corporate strategy) Lucene indexing of Accumulo data, but that's a proprietary product: http://sqrrl.com/product/search/ -- Jack Krupansky On Thu, Apr 9, 2015 at 6:33 AM, madhvi madhvi.gu

Re: Would Like to contribute to Lucene

2015-03-27 Thread Jack Krupansky
is always a great contribution. -- Jack Krupansky On Thu, Mar 26, 2015 at 8:15 PM, Erick Erickson erickerick...@gmail.com wrote: You really have to just pick a problem, dive into the code and learn it bit by bit through exploration. The code base changes fast enough that anything published

Re: how to reasonably estimate the disk size for Lucene 4.x

2015-03-24 Thread Jack Krupansky
, everything runs great on commodity hardware! Kool-Aid. IOW, running a 32GB index on a 16 GB box is probably not a great idea if you need low latency. -- Jack Krupansky On Tue, Mar 24, 2015 at 8:37 AM, Gaurav gupta gupta.gaurav0...@gmail.com wrote: Erick, When further testing the index sizes using Lucene

Re: Tokenizer for Brown Corpus?

2015-02-24 Thread Jack Krupansky
This is the first mention that I have seen for that corpus on this list. There seem to be more than a few references when I google for brown corpus lucene, such as: https://github.com/INL/BlackLab/wiki/Blacklab-query-tool -- Jack Krupansky On Tue, Feb 24, 2015 at 1:40 AM, Koji Sekiguchi

Re: Indexing Query

2015-02-18 Thread Jack Krupansky
You could store the length of the field (in terms) in a second field and then add a MUST term to the BooleanQuery which is a RangeQuery with an upper bound that is the maximum length that can match. -- Jack Krupansky On Wed, Feb 18, 2015 at 4:54 AM, Ian Lea ian@gmail.com wrote: You mean

Re: Boolean Search Query is not workng

2015-01-24 Thread Jack Krupansky
documents have different capitalization of Java/java. -- Jack Krupansky On Fri, Jan 23, 2015 at 9:54 AM, Rajendra Rao rajendra@launchship.com wrote: Hello Reply to the mail, sent by Nitin We tried and this is what we got : My query was dotNet^10.0 Resume:jdbc Resume:C# Resume:MVC Documents

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-15 Thread Jack Krupansky
. -- Jack Krupansky On Thu, Jan 15, 2015 at 11:23 AM, danield danield...@gmail.com wrote: Hi Mike, Thank you for your reply. Yes, I had thought of this, but it is not a solution to my problem, and this is because the Term Frequency and therefore the results will still be wrong, as prepending

Re: Questions regarding Lucene 5

2015-01-10 Thread Jack Krupansky
/lucene/facet/FacetsCollector.java?revision=1634013view=markup Any other particular features of Lucene 5 that you are particularly interested in? -- Jack Krupansky On Sat, Jan 10, 2015 at 3:01 PM, Elad Margalit eladm...@gmail.com wrote: Hi, I would like to ask regarding Lucene 5, Do you

Re: Looking for docs that have certain fields empty (an/or not set)

2015-01-07 Thread Jack Krupansky
Oops... I take that back! After I clicked Send I realized that this is the Lucene list - what I said is true for Solr queries, but that is because Solr added a hack to do things properly, but the Lucene query parser doesn't have that hack, so Erick is correct. -- Jack Krupansky On Wed, Jan 7

Re: OutOfMemoryError indexing large documents

2014-11-26 Thread Jack Krupansky
that the above strategy would be reasonable, or do you need to process large numbers of large documents. -- Jack Krupansky -Original Message- From: ryanb Sent: Tuesday, November 25, 2014 7:39 PM To: java-user@lucene.apache.org Subject: OutOfMemoryError indexing large documents Hello, We

Re: Exceptions during batch indexing

2014-11-08 Thread Jack Krupansky
Oops... you sent this to the wrong list - this is the Lucene user list, send it to the Solr user list. -- Jack Krupansky -Original Message- From: Peter Keegan Sent: Thursday, November 6, 2014 3:21 PM To: java-user Subject: Exceptions during batch indexing How are folks handling Solr

Re: Questions about the Lucene query language

2014-10-27 Thread Jack Krupansky
Pure negative queries are not supported, but all you need to do is include *:*, which translates into MatchAllDocsQuery. hello dolly is the same as hello dolly~0 -- Jack Krupansky -Original Message- From: Prad Nelluru Sent: Monday, October 27, 2014 8:57 PM To: java-user

Re: How to properly use Levenstein distance with ~ in Java

2014-10-18 Thread Jack Krupansky
What is the value of the qf parameter? You don't have an explicit field name such as title in your query string, q. -- Jack Krupansky -Original Message- From: Aleksander Sadecki Sent: Thursday, October 16, 2014 11:46 AM To: java-user@lucene.apache.org Subject: How to properly use

Re: How to properly use Levenstein distance with ~ in Java

2014-10-18 Thread Jack Krupansky
Oops... for future reference, please post Solr questions to the *Solr* user list, not the *Lucene* (java) user list! -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Saturday, October 18, 2014 7:50 AM To: java-user@lucene.apache.org Subject: Re: How to properly use

Re: Term vectors

2014-09-30 Thread Jack Krupansky
-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html The free Solr Reference Guide has a short section on the Solr Term Vector component. You could check it out before buying my $10 e-book. See: https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component -- Jack

Re: NOTICE: Seeking Moderators for java-user@lucene

2014-09-30 Thread Jack Krupansky
Yeah, I can be a moderator, for both Lucene and Solr. -- Jack Krupansky -Original Message- From: Chris Hostetter Sent: Tuesday, September 30, 2014 12:51 PM To: java-user@lucene.apache.org Cc: java-user-ow...@lucene.apache.org Subject: NOTICE: Seeking Moderators for java-user@lucene

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-09-30 Thread Jack Krupansky
Yes, most special characters are treated as term delimiters, except that underscores, dots, and commas have some special rules. See the details under Standard Tokenizer in my Solr e-book: http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product

Re: Migrating lucene index to Elastic Search

2014-09-26 Thread Jack Krupansky
since ES has some special things they do so that a raw Lucene index will unlikely be compatible with ES, and to simple reindex your source data directly into ES to take full advantage of ES. -- Jack Krupansky -Original Message- From: Aditya Sent: Friday, September 26, 2014 3:55 AM

Re: How to properly correlate relevance in a search across multiple collections

2014-09-06 Thread Jack Krupansky
on a refine results button to re-do the search with the more expensive cross-corpus df-based scoring. Thoughts? -- Jack Krupansky -Original Message- From: Baldwin, David Sent: Friday, September 5, 2014 8:05 PM To: java-user@lucene.apache.org Subject: How to properly correlate relevance

Re: Question regarding complex queries and long tail suggestions

2014-09-03 Thread Jack Krupansky
/lucene/queryparser/complexPhrase/TestComplexPhraseQuery.java?revision=1622067view=markup -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Wednesday, September 3, 2014 7:14 PM To: java-user Subject: Re: Question regarding complex queries and long tail suggestions Take a look

Re: indexing all suffixes to support leading wildcard?

2014-08-28 Thread Jack Krupansky
Use the ngram token filter, and the a query of 512 would match by itself: http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenFilter.html -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Thursday, August 28, 2014 11:52 PM

Re: Why does this search fail?

2014-08-27 Thread Jack Krupansky
-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html -- Jack Krupansky -Original Message- From: Michael Sokolov Sent: Wednesday, August 27, 2014 10:26 AM To: java-user@lucene.apache.org Subject: Re: Why does this search fail? Tokenization is tricky. You might

Re: Why does this search fail?

2014-08-27 Thread Jack Krupansky
://support.google.com/websearch/answer/136861?hl=en It also seems to support ** in a quoted phrase to mean one or more arbitrary terms. This isn't documented, but seems to work. -- Jack Krupansky -Original Message- From: Milind Sent: Wednesday, August 27, 2014 10:51 AM To: java-user

Re: Why does this search fail?

2014-08-26 Thread Jack Krupansky
are defined as multi-term, so they will be performed, but the standard tokenizer is not being called, so the dot remains and this whole term is treated as one term, unlike the index analysis. -- Jack Krupansky -Original Message- From: Milind Sent: Tuesday, August 26, 2014 12:24 PM

Re: WhiteSpaceTokenizer

2014-08-15 Thread Jack Krupansky
on it: https://issues.apache.org/jira/browse/LUCENE-5785 -- Jack Krupansky -Original Message- From: Sheng Sent: Thursday, August 14, 2014 11:38 PM To: java-user@lucene.apache.org Subject: WhiteSpaceTokenizer The length of token has to be shorter than 255, otherwise there will be unpredictable

Re: WhiteSpaceTokenizer

2014-08-15 Thread Jack Krupansky
Sure, that should be a configurable option. Oh, and I neglected to mention a workaround: use the pattern tokenizer, which doesn't have a limit (yet.) But it might be slower. -- Jack Krupansky -Original Message- From: Sheng Sent: Friday, August 15, 2014 8:13 AM To: java-user

Re: Searching with String that Represents a Signature

2014-08-14 Thread Jack Krupansky
The standard analyzer will discard most special characters as punctuation. What analyzer are you using? -- Jack Krupansky -Original Message- From: Scott Selvia Sent: Thursday, August 14, 2014 7:42 PM To: java-user@lucene.apache.org Subject: Searching with String that Represents

Re: Can't get case insensitive keyword analyzer to work

2014-08-12 Thread Jack Krupansky
And unfiltered. So even if you use the keyword tokenizer that only generates a single token, you still want token filtering, such as lower case. -- Jack Krupansky -Original Message- From: Christoph Kaser Sent: Tuesday, August 12, 2014 3:07 AM To: java-user@lucene.apache.org Subject

Re: escaping characters

2014-08-12 Thread Jack Krupansky
The default changed to false in Lucene 3.1. Before that it was true. -- Jack Krupansky -Original Message- From: Chris Salem Sent: Tuesday, August 12, 2014 8:34 AM To: java-user@lucene.apache.org Subject: RE: escaping characters Thanks! That worked. We recently upgraded from 2.9

Re: escaping characters

2014-08-11 Thread Jack Krupansky
#setAutoGeneratePhraseQueries(boolean) -- Jack Krupansky -Original Message- From: Chris Salem Sent: Monday, August 11, 2014 1:03 PM To: java-user@lucene.apache.org Subject: RE: escaping characters I'm not using Solr. Here's my code: FSDirectory fsd = FSDirectory.open(new File(C

Re: EnglishAnalyzer vs WhiteSpaceAnalyzer in getting Term Frequency

2014-08-07 Thread Jack Krupansky
need to manually filter your query terms. Sounds like maybe a term got stemmed. -- Jack Krupansky -Original Message- From: Bianca Pereira Sent: Thursday, August 7, 2014 7:28 AM To: java-user@lucene.apache.org Subject: EnglishAnalyzer vs WhiteSpaceAnalyzer in getting Term Frequency Hi

Re: EnglishAnalyzer vs WhiteSpaceAnalyzer in getting Term Frequency

2014-08-07 Thread Jack Krupansky
Also, usually query-time analysis is done by a query parser, so if you aren't going through a quwery parser, you have to call the aalyzer yourself. The stemming is very likely the culprit here. -- Jack Krupansky -Original Message- From: Uwe Schindler Sent: Thursday, August 7, 2014

Re: Lucene Query Wrong Result for phrase.

2014-07-18 Thread Jack Krupansky
The standard tokenizer will strip off those escaped quotes at query time. Ditto for the hyphen at index time. Try constructing your own analyzer using the white space tokenizer instead of the standard tokenizer. -- Jack Krupansky -Original Message- From: itisismail Sent: Friday

Re: How to handle words that stem to stop words

2014-07-07 Thread Jack Krupansky
of your stop words, or possibly a pattern that matches stop words plus a short suffix that might get stemmed. -- Jack Krupansky -Original Message- From: Arjen van der Meijden Sent: Sunday, July 6, 2014 2:47 PM To: java-user@lucene.apache.org Subject: How to handle words that stem to stop

Re: QueryParserUtil, big query with wildcards - runs endlessly and produces heavy load

2014-06-26 Thread Jack Krupansky
I'll defer the the hard-core Lucene committers for the technical details, but I would suggest that a very large term with dozens of wildcards is a known limitation (albeit not well-documented.) IOW, to use wildcards in Lucene in a performant manner, they need to be brief. -- Jack Krupansky

Re: Lucene QueryParser/Analyzer inconsistency

2014-06-17 Thread Jack Krupansky
introduces a regex query term. It is added by the escape method you call, but the escaping will be gone by the time your analyzer is called. So, just try a simple, unescaped slash in your char mapping table. -- Jack Krupansky -Original Message- From: Luis Pureza Sent: Tuesday, June 17

Re: searching with stemming

2014-06-09 Thread Jack Krupansky
. -- Jack Krupansky -Original Message- From: Jamie Sent: Monday, June 9, 2014 6:56 AM To: java-user@lucene.apache.org Subject: Re: searching with stemming To me, it seems strange that these default analyzers, don't provide constructors that enable one to override stemming, etc? On 2014/06

Re: searching with stemming

2014-06-09 Thread Jack Krupansky
Please do file a Jira. I'm sure the discussion will be interesting. -- Jack Krupansky -Original Message- From: Jamie Sent: Monday, June 9, 2014 9:33 AM To: java-user@lucene.apache.org Subject: Re: searching with stemming Jack Thanks. I figured as much. I'm modifying each analyzer

Re: How to approach indexing source code?

2014-06-03 Thread Jack Krupansky
to be indexed. -- Jack Krupansky -Original Message- From: Johan Tibell Sent: Tuesday, June 3, 2014 9:32 PM To: java-user@lucene.apache.org Subject: How to approach indexing source code? Hi, I'd like to index (Haskell) source code. I've run the source code through a compiler (GHC) to get rich

Re: search performance

2014-06-02 Thread Jack Krupansky
a 256GB machine? How frequent are your commits for updates while doing queries? -- Jack Krupansky -Original Message- From: Jamie Sent: Monday, June 2, 2014 2:51 AM To: java-user@lucene.apache.org Subject: search performance Greetings Despite following all the recommended optimizations

Re: Multi-thread indexing, should the commit be called from each thread?

2014-05-21 Thread Jack Krupansky
(Was this supposed to be a java-user/Lucene question or a Solr question?!) -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Wednesday, May 21, 2014 10:58 AM To: java-user Subject: Re: Multi-thread indexing, should the commit be called from each thread? I'll be more

Re: writer.updateDocument() not working (possible bug?)

2014-05-19 Thread Jack Krupansky
for a batch update model as opposed to a true real-time database (it's a search engine, not a database!), but... the original goals and requirements might give us some insight. Thanks. -- Jack Krupansky -Original Message- From: Michael McCandless Sent: Monday, May 19, 2014 6:10 AM To: Lucene

Re: Performance issue when using multiple PhraseQueries against a 1+ million entries index

2014-05-19 Thread Jack Krupansky
Does your index fit fully in system memory - the OS file cache? If not, there could be a lot of thrashing (I/O) as Lucene accesses the index. -- Jack Krupansky -Original Message- From: Liviu Matei Sent: Monday, May 19, 2014 4:21 PM To: java-user@lucene.apache.org Subject: Performance

Re: A work around to get matching terms from document - Stemmed and Synonyms

2014-05-17 Thread Jack Krupansky
The explain section of the debug response when you set the debugQuery=true parameter will give you the final terms that were matched for each document. -- Jack Krupansky -Original Message- From: venkatesham.gu...@igate.com Sent: Saturday, May 17, 2014 2:28 AM To: java-user

Re: A work around to get matching terms from document - Stemmed and Synonyms

2014-05-17 Thread Jack Krupansky
Oops... I just noticed that you sent this request to the java-user list, which is primarily for developers using the Lucene library directly. Try sending it to the solr-user list, which is for users and developers working with Solr. -- Jack Krupansky -Original Message- From

Re: How to locate a Phrase inside text (like a Browser text searcher)

2014-05-16 Thread Jack Krupansky
by having a tokenizer that that simply ignored punctuation and whitespace and generated one big original token and then n-grammed it based on some maximal query phrase size. And... the original requirement spec didn't list that as a use case anyway. -- Jack Krupansky -Original Message- From

Re: How to locate a Phrase inside text (like a Browser text searcher)

2014-05-11 Thread Jack Krupansky
. In truth, Lucene/Solr doesn't have a good out of the box solution for this use case. -- Jack Krupansky -Original Message- From: teko Sent: Thursday, May 8, 2014 9:03 AM To: java-user@lucene.apache.org Subject: How to locate a Phrase inside text (like a Browser text searcher) Hi

Re: is there a historical reason why default conjunction operator is OR?

2014-04-16 Thread Jack Krupansky
. Average users just get annoyed when the search engine is being so picky. -- Jack Krupansky -Original Message- From: Jose Carlos Canova Sent: Wednesday, April 16, 2014 12:53 PM To: java-user@lucene.apache.org Subject: Re: is there a historical reason why default conjunction operator

Re: Stored fields and OS file caching

2014-04-05 Thread Jack Krupansky
it. -- Jack Krupansky -Original Message- From: Adrien Grand Sent: Friday, April 4, 2014 4:50 PM To: java-user@lucene.apache.org Subject: Re: Stored fields and OS file caching Hi Vitaly, Doc values are indeed well-suited for grouping and sorting. However stored fields remain better at returning

Re: Lucene Wildcard for zero or one character

2014-03-25 Thread Jack Krupansky
/houses?/ -- Jack Krupansky -Original Message- From: Uwe Schindler Sent: Tuesday, March 25, 2014 11:34 AM To: java-user@lucene.apache.org Subject: RE: Lucene Wildcard for zero or one character The default WildcardQuery only supports: '*' (star) is the wildcard in WildcardQuery

Re: maxDoc/numDocs int fields

2014-03-21 Thread Jack Krupansky
now, but otherwise, that's the limit for now. -- Jack Krupansky -Original Message- From: Artem Gayardo-Matrosov Sent: Friday, March 21, 2014 12:41 PM To: java-user@lucene.apache.org Subject: Re: maxDoc/numDocs int fields Hi Oli, Thanks for your reply, I thought about this, but it feels

Re: How to search for terms containing negation

2014-03-18 Thread Jack Krupansky
Of course - you need to use the same analyzer for both indexing and query. So, just reindex your data with this new analyzer. -- Jack Krupansky -Original Message- From: Natalia Connolly Sent: Tuesday, March 18, 2014 10:37 AM To: java-user@lucene.apache.org Subject: Re: How to search

Re: tf/idf similarity with modified document similarity

2014-03-07 Thread Jack Krupansky
of that info is hanging around as part of the query matching process. Still, that is a reasonable feature to want and it has been requested before. Worth a Jira. -- Jack Krupansky -Original Message- From: Christian Reuschling Sent: Thursday, March 6, 2014 1:34 PM To: java-user

Re: encoding problem when retrieving document field value

2014-03-03 Thread Jack Krupansky
picking a PU Unicode character? -- Jack Krupansky -Original Message- From: G.Long Sent: Monday, March 3, 2014 12:09 PM To: java-user@lucene.apache.org Subject: encoding problem when retrieving document field value Hi :) My index (Lucene 3.5) contains a field called title. Its value

Re: query regarding Lucene Indexing and searching

2014-03-02 Thread Jack Krupansky
Please elaborate on what you expect will be in this payload. Is it information derived from the indexing process itself or is it external information to be added to the indexed terms? -- Jack Krupansky -Original Message- From: Mrugendra Sent: Sunday, March 2, 2014 5:15 AM To: java

Re: Fuzzy query on capital letters does not match documents

2014-02-27 Thread Jack Krupansky
Be careful with very short terms and fuzzy query. The rounding when converting from a fraction to an edit distance can make the match exact rather than fuzzy. What terms does your index have? XV, Xv, xV, xv? XV~0.7 may only match XV. -- Jack Krupansky -Original Message- From: G.Long

Re: How to delete a token that comes exactly after a token

2014-02-26 Thread Jack Krupansky
Sounds like a custom filter. Or maybe an option for stop filter or a specialization of stop filter. Or maybe it could be even more generalized. What are some practical example token sequences? -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Wednesday, February 26

Re: How to delete a token that comes exactly after a token

2014-02-26 Thread Jack Krupansky
If this is primarily an issue with the document input, as opposed to queries, you might be better off simply preprocessing the text before it is given to Lucene to be indexed. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Wednesday, February 26, 2014 1:37 PM

Re: codec mismatch

2014-02-17 Thread Jack Krupansky
in the native file system for greater performance. Solrandra stored the Lucene indexes in Cassandra, but the performance penalty was too high. -- Jack Krupansky -Original Message- From: Jason Wee Sent: Friday, February 14, 2014 3:13 AM To: java-user@lucene.apache.org Subject: codec mismatch

Re: char mapping in lucene-icu

2014-02-14 Thread Jack Krupansky
be as simple as whether the data file should have DOS or UNIX or Mac line endings (CRLF vs. NL vs. CR.) Be sure to use an editor that satisfies the requirements of ICU. To be clear, Lucene itself does not have a published API for modifying the mappings of ICU. -- Jack Krupansky -Original Message

Re: Wildcard searches

2014-02-05 Thread Jack Krupansky
Take a look at the complex phrase query parser. See: http://lucene.apache.org/core/4_6_0/queryparser/org/apache/lucene/queryparser/complexPhrase/ComplexPhraseQueryParser.html See also: https://issues.apache.org/jira/browse/LUCENE-1486 -- Jack Krupansky -Original Message- From

Re: Why PhraseQuery translate stopwords to ?

2013-12-10 Thread Jack Krupansky
In theory, the query with holes (position increments) for stop words should work... unless you originally indexed the data without the stop word filter. Any time you change the filters, you typically need to reindex the data. -- Jack Krupansky -Original Message- From: Jean-Claude

Re: Why PhraseQuery translate stopwords to ?

2013-12-09 Thread Jack Krupansky
The analyzer is generating holes for the stop words - the position of the subsequent term is incremented an extra time for each stop word so that their positions are maintained. -- Jack Krupansky -Original Message- From: Jean-Claude Dauphin Sent: Monday, December 09, 2013 4:15 PM

Re: tokenizer to strip a set of characters

2013-11-21 Thread Jack Krupansky
at the start or end. -- Jack Krupansky -Original Message- From: Stephane Nicoll Sent: Thursday, November 21, 2013 9:42 AM To: java-user@lucene.apache.org Subject: tokenizer to strip a set of characters Hi, I am using lucene 3.6 and I am looking to a tokenized that would remove certain

Re: How to perform Wildcard search when using WhitespaceAnalyzer?

2013-11-18 Thread Jack Krupansky
As I indicated in my previous message, we need actual queries and the actual indexed data where matches are failing. Note that *NALYZE will not match ANALYZER. So, it might be that you have composed queries in which some of the terms match properly and only some fail. -- Jack Krupansky

Re: How to perform Wildcard search when using WhitespaceAnalyzer?

2013-11-17 Thread Jack Krupansky
, and what does the indexed data look like? The simple answer to your question is that wildcards don't behave any differently between the two analyzers - simply because they are not used at all for the wildcard terms. -- Jack Krupansky -Original Message- From: raghavendra.k@barclays.com

Re: Twitter analyser

2013-11-05 Thread Jack Krupansky
protWords) See: http://lucene.apache.org/core/4_5_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html -- Jack Krupansky -Original Message- From: Stéphane Nicoll Sent: Tuesday, November 05, 2013 2:40 AM To: java-user@lucene.apache.org Subject: Twitter analyser

Re: DateQuery with comparison operators

2013-10-29 Thread Jack Krupansky
the use of curly braces for exclusive end points. -- Jack Krupansky -Original Message- From: Umashanker, Srividhya Sent: Tuesday, October 29, 2013 3:57 AM To: java-user@lucene.apache.org Subject: DateQuery with comparison operators HI - I are using Lucene 4.5 and want to support date

Re: Handling special characters in Lucene 4.0

2013-10-20 Thread Jack Krupansky
using at query time? -- Jack Krupansky -Original Message- From: saisantoshi Sent: Sunday, October 20, 2013 12:47 PM To: java-user@lucene.apache.org Subject: Handling special characters in Lucene 4.0 I have created strings like the below searchtext +sampletext and when I try to search

Re: Handling special characters in Lucene 4.0

2013-10-20 Thread Jack Krupansky
characters with a backslash, and then leave the asterisk unescaped to perform a wildcard query. -- Jack Krupansky -Original Message- From: saisantoshi Sent: Sunday, October 20, 2013 6:02 PM To: java-user@lucene.apache.org Subject: Re: Handling special characters in Lucene 4.0

Re: Handling special characters in Lucene 4.0

2013-10-20 Thread Jack Krupansky
characters that you don't want to keep (e.g., period, comma, semicolon, parentheses, etc.) The query parser itself knows nothing about what your chosen analyzer does. But the query parser does specially interpret the special characters that the escape method mentions. -- Jack Krupansky -Original

Re: Handling special characters in Lucene 4.0

2013-10-20 Thread Jack Krupansky
. -- Jack Krupansky -Original Message- From: saisantoshi Sent: Sunday, October 20, 2013 7:43 PM To: java-user@lucene.apache.org Subject: Re: Handling special characters in Lucene 4.0 what about other characters like ','( quote) characters. We have a requirement that a text can start

  1   2   3   >