Re: Subset Matching

2016-03-25 Thread Sujit Pal
Hi Otmar, Shouldn't Occur.SHOULD alone do what you ask? Documents that match all terms in the query would be scored higher than documents that match fewer than all terms. -sujit On Fri, Mar 25, 2016 at 2:20 AM, Otmar Caduff wrote: > Hi all > In Lucene, I know of the

Re: Calculate the score of an arbitrary string vs a query?

2015-04-11 Thread Sujit Pal
Hi Ali, I agree with the others that there is no good way to do what you are looking for if you want to assign lucene-like scores to your external results, but if you have some objective measure of goodness that doesn't depend on your lucene scores, you can apply it to both result sets and merge

Re: Proximity query

2015-02-12 Thread Sujit Pal
I did something like this sometime back. The objective was to find patterns surrounding some keywords of interest so I could find keywords similar to the ones I was looking for, sort of like a poor man's word2vec. It uses SpanQuery as Jigar said, and you can find the code here (I believe it was

Re: Case sensitivity

2014-09-19 Thread Sujit Pal
Hi John, Take a look at the PerFieldAnalyzerWrapper. As the name suggests, it allows you to create different analyzers per field. -sujit On Fri, Sep 19, 2014 at 6:50 AM, John Cecere john.cec...@oracle.com wrote: I've considered this, but there are two problems with it. First of all, it

Re: Quickest way to collect one field from the searched docs....

2014-09-19 Thread Sujit Pal
Hi Shouvik, not sure if you have already considered this, but you could put the database primary key for the record into the index - ie, reverse your insert to do DB first, get the record_id and then add this to the Lucene index as record_id field. During retrieval you can minimize the network

Re: How to handle words that stem to stop words

2014-07-10 Thread Sujit Pal
Hi Arjen, This is kind of a spin on your last observation that your list of stop words don't change frequently. If you have a custom filter that attempts to stem the incoming token and if it stems to the same as a stopword, only then sets the keyword attribute on the original token. That way

Re: How to handle words that stem to stop words

2014-07-07 Thread Sujit Pal
Hi Arjen, You could also mark a token as keyword so the stemmer passes it through unchanged. For example, per the Javadocs for PorterStemFilter: http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/en/PorterStemFilter.html Note: This filter is aware of the

Re: Securing stored data using Lucene

2013-06-25 Thread SUJIT PAL
Hi Rafaela, I built something along these lines as a proof of concept. All data in the index was unstored and only fields which were searchable (tokenized and indexed) were kept in the index. The full record was encrypted and stored in a MongoDB database. A custom Solr component did the search

Re: Payload Matching Query

2013-06-21 Thread SUJIT PAL
Hi Michael, Instead of putting the annotation in Payloads, why not put them in as synonyms, ie at the same spot as the original string (see SynonymFilter in the LIA book). So your string would look like (to the index): W. A. Mozart was born in Salzburg artist city so you can

Re: Statically store sub-collections for search (faceted search?)

2013-04-15 Thread SUJIT PAL
, schrieb SUJIT PAL: Hi Carsten, Why not use your idea of the BooleanQuery but wrap it in a Filter instead? Since you are not doing any scoring (only filtering), the max boolean clauses limit should not apply to a filter. Hi Sujit, thanks for your suggestion! I wasn't aware that the max

Re: Statically store sub-collections for search (faceted search?)

2013-04-15 Thread SUJIT PAL
Hi Uwe, I see, makes sense, thanks very much for the info. Sorry about giving you wrong info Carsten. -sujit On Apr 15, 2013, at 1:06 PM, Uwe Schindler wrote: Hi, Original Message- From: Sujit Pal [mailto:sujitatgt...@gmail.com] On Behalf Of SUJIT PAL Sent: Monday, April 15

Re: Statically store sub-collections for search (faceted search?)

2013-04-12 Thread SUJIT PAL
Hi Carsten, Why not use your idea of the BooleanQuery but wrap it in a Filter instead? Since you are not doing any scoring (only filtering), the max boolean clauses limit should not apply to a filter. -sujit On Apr 12, 2013, at 7:34 AM, Carsten Schnober wrote: Dear list, I would like to

Re: Accent insensitive analyzer

2013-03-22 Thread SUJIT PAL
Hi Jerome, How about this one? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ISOLatin1AccentFilterFactory Regards, Sujit On Mar 22, 2013, at 9:22 AM, Jerome Blouin wrote: Hello, I'm looking for an analyzer that allows performing accent insensitive search in latin

Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread SUJIT PAL
Hi Glen, I don't believe you can attach a single payload to multiple tokens. What I did for a similar requirement was to combine the tokens into a single _ delimited single token and attached the payload to it. For example: The Big Bad Wolf huffed and puffed and blew the house of the Three

Re: Scoring a document using LDA topics

2011-11-29 Thread Sujit Pal
. Is this the best way? I can't see a way to compute the Sim() metric at any other time, because in scorePayload(), we don't have access to the full payload, nor to the query. Thanks again, Steve On Mon, Nov 28, 2011 at 1:51 PM, Sujit Pal sujit@comcast.net wrote: Hi Stephen, We

Re: Scoring a document using LDA topics

2011-11-28 Thread Sujit Pal
Hi Stephen, We are doing something similar, and we store as a multifield with each document as (d,z) pairs where we store the z's (scores) as payloads for each d (topic). We have had to build a custom similarity which implements the scorePayload function. So to find docs for a given d (topic), we

Re: Bet you didn't know Lucene can...

2011-10-22 Thread Sujit Pal
Hi Grant, Not sure if this qualifies as a bet you didn't know, but one could use Lucene term vectors to construct document vectors for similarity, clustering and classification tasks. I found this out recently (although I am probably not the first one), and I think this could be quite useful.

Re: How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-17 Thread Sujit Pal
Hi Paul, Since you have modified the StandardAnalyzer (I presume you mean StandardFilter), why not do a check on the term.text() and if its all punctuation, skip the analysis for that term? Something like this in your StandardFilter: public final boolean incrementToken() throws IOException {

Re: Is there any Query in Lucene can search the term, which is similar as SQL-LIKE?

2011-10-17 Thread Sujit Pal
Hi Mead, You may want to check out the permuterm index idea. http://www-nlp.stanford.edu/IR-book/html/htmledition/permuterm-indexes-1.html Basically you write a custom filter that takes a term and generates all word permutations off it. On the query side, you convert your query so its always a

Payload Query and Document Boosts

2011-10-12 Thread Sujit Pal
Hi, Question about Payload Query and Document Boosts. We are using Lucene 3.2 and Payload queries, with our own PayloadSimilarity class which overrides the scorePayload method like so: {code} @Override public float scorePayload(int docId, String fieldName, int start, int end, byte[]

Re: How can i index a Java Bean into Lucene application ?

2011-08-07 Thread Sujit Pal
Depending on what you wanted to do with the Javabean (I assume you want to make some or all its fields searchable since you are writing to Lucene), you could use reflection to break it up into field name value pairs and write them out to the IndexWriter using something like this: Document d = new

Re: Suggestion: make some more TokenFilters KeywordAttribute aware

2011-06-23 Thread Sujit Pal
+0200, Simon Willnauer wrote: On Wed, Jun 22, 2011 at 8:53 PM, Sujit Pal s...@healthline.com wrote: Hello, I am currently in need of a LowerCaseFilter and StopFilter that will recognize KeywordAttribute, similar to the way PorterStemFilter currently does (on trunk). Specifically, in case

Suggestion: make some more TokenFilters KeywordAttribute aware

2011-06-22 Thread Sujit Pal
Hello, I am currently in need of a LowerCaseFilter and StopFilter that will recognize KeywordAttribute, similar to the way PorterStemFilter currently does (on trunk). Specifically, in case the term is a KeywordAttribute.isKeyword(), it should not lowercase and remove respectively. This can be

Re: Passage retrieval with Lucene-based application

2011-05-25 Thread Sujit Pal
Hi Leroy, Would it make sense to index as Lucene documents the unit to be searched? So if you want paragraphs to be shown in search results, you could parse the source document during indexing into paragraphs and index them as separate Lucene documents. -sujit On Wed, 2011-05-25 at 15:46 -0400,

Re: FastVectorHighlighter - can FieldFragList expose fragInfo?

2011-05-24 Thread Sujit Pal
Thank you Koji. I opened LUCENE-3141 for this. https://issues.apache.org/jira/browse/LUCENE-3141 -sujit On Tue, 2011-05-24 at 22:33 +0900, Koji Sekiguchi wrote: (11/05/24 3:28), Sujit Pal wrote: Hello, My version: Lucene 3.1.0 I've had to customize the snippet for highlighting

FastVectorHighlighter - can FieldFragList expose fragInfo?

2011-05-23 Thread Sujit Pal
Hello, My version: Lucene 3.1.0 I've had to customize the snippet for highlighting based on our application requirements. Specifically, instead of the snippet being a set of relevant fragments in the text, I need it to be the first sentence where a match occurs, with a fixed size from the

Re: Reg: Query behavior

2011-04-26 Thread Sujit Pal
Hi Deepak, Would something like this work in your case? Arcos Bioscience^2.0 Arcos Bioscience ie, a BooleanQuery with the full phrase boosted OR'd with a query on each word? -sujit On Tue, 2011-04-26 at 14:46 -0400, Deepak Konidena wrote: Hi, Currently when I type in Arcos Bioscience in

Re: Searching partial names using Lucene

2011-03-24 Thread Sujit Pal
I don't know if there is already an analyzer available for this, but you could use GATE or UIMA for Named Entity Extraction against names and expand the query to include the extra names that are used synonymously. You could do this outside Lucene or inline using a custom Lucene tokenizer that

Re: How to define different similarity scores per field ?

2011-03-01 Thread Sujit Pal
One way to do this currently is to build a per field similarity wrapper (that triggers off the field name). I believe there is some work going on with Lucene Similarity that would make it pluggable for this sort of stuff, but in the meantime, this is what I did: public class

Re: How to define different similarity scores per field ?

2011-03-01 Thread Sujit Pal
the other methods that are calculating the similarity scores... those methods are called and they have the implementation you have in DefaultSimilarityClass.. right ? On 1 March 2011 21:12, Sujit Pal sujit@comcast.net wrote: One way to do this currently is to build a per field