Re: Term Dictionary taking up lots of memory, looking for solutions, lucene 5.3.1

2017-07-02 Thread David Smiley
If there are no filters, then LatLonDocValuesField is going to be asked to sort all of your docs, which is obviously going to take awhile. Can you simply add a filter? Like a distance filter using LatLonPoint? On Thu, Jun 29, 2017 at 11:49 AM sc wrote: > Hi, > >I have similar requirement o

Re: Term Dictionary taking up lots of memory, looking for solutions, lucene 5.3.1

2017-06-29 Thread sc
Hi, I have similar requirement of searching points within a radius of 50m. Loaded 100M latlon, indexed/searching with LatLonDocValuesField. I am testing it on my macbook pro. I have used all Directory(RAM/FS/MMap) types but it takes 3-4 secs to do search/sort to return of 5 points with in rad

Re: Term Dictionary taking up lots of memory, looking for solutions, lucene 5.3.1

2017-06-14 Thread David Smiley
Nice! On Tue, Jun 13, 2017 at 11:12 PM Tom Hirschfeld wrote: > Hey All, > > I was able to solve my problem a few weeks ago and wanted to update you > all. The root issue was with the caching mechanism in > "makedistancevaluesource" method in the lucene spatial module, it appears > that documents

Re: Term Dictionary taking up lots of memory, looking for solutions, lucene 5.3.1

2017-06-13 Thread Tom Hirschfeld
Hey All, I was able to solve my problem a few weeks ago and wanted to update you all. The root issue was with the caching mechanism in "makedistancevaluesource" method in the lucene spatial module, it appears that documents were being pulled into the cache and not expired. To address this issue, w

Re: Term Dictionary taking up lots of heap memory, looking for solutions, lucene 5.3.1

2017-06-06 Thread David Smiley
I know I'm late to this thread, but I saw this and specifically "reverse geocoding" and it caught my attention. I recently did this on a public project with Solr, which you may find of interest: https://github.com/cga-harvard/hhypermap-bop/tree/master/enrich/solr-geo-admin I'm super pleased with t

Re: Term Dictionary taking up lots of memory, looking for solutions, lucene 5.3.1

2017-05-18 Thread Uwe Schindler
Hi, Are you sure that the term index is the problem? Even with huge indexes you never need 65 good of heap! That's impossible. Are you sure that your problem is not something else?: - too large heap? Heaps greater than 31 gigs are bad by default. Lucene needs only few heap, although you have lar

Re: Term Dictionary taking up lots of memory, looking for solutions, lucene 5.3.1

2017-05-18 Thread Michael McCandless
That sounds like a fun amount of terms! Note that Lucene does not load all terms into memory; only the "prefix trie", stored as an FST ( http://blog.mikemccandless.com/2010/12/using-finite-state-transducers-in.html), mapping term prefixes to on-disk blocks of terms. FSTs are very compact data str

Re: Term Dictionary taking up lots of memory, looking for solutions, lucene 5.3.1

2017-05-17 Thread Adrien Grand
Is upgrading to Lucene 6 and using points rather than terms an option? Points typically have lower memory usage (see GeoPoint which is based on terms vs LatLonPoint which is based on points at http://people.apache.org/~mikemccand/geobench.html#reader-heap). Le jeu. 18 mai 2017 à 02:35, Tom Hirschf

RE: Term no longer matches if PositionLengthAttr is set to two

2017-05-04 Thread Markus Jelsma
ent: Monday 1st May 2017 12:33 > To: java-user@lucene.apache.org; solr-user > Subject: RE: Term no longer matches if PositionLengthAttr is set to two > > Hello again, apologies for cross-posting and having to get back to this > unsolved problem. > > Initially i thought this

RE: Term no longer matches if PositionLengthAttr is set to two

2017-05-01 Thread Markus Jelsma
Hello again, apologies for cross-posting and having to get back to this unsolved problem. Initially i thought this is a problem i have with, or in Lucene. Maybe not, so is this problem in Solr? Is here anyone who has seen this problem before? Many thanks, Markus -Original message- > Fr

Re: term frequency in solr

2017-01-05 Thread Ahmet Arslan
Hi, I guess you are working with default techproducts. can you try using the terms request handler: query.setRequestHandler("terms") Ahmet On Friday, January 6, 2017 1:19 AM, huda barakat wrote: Thank you for fast reply, I add the query in the code but still not working:

Re: term frequency in solr

2017-01-05 Thread huda barakat
Thank you for fast reply, I add the query in the code but still not working: import java.util.List; import org.apache.solr.client.solrj.SolrClient; import org.apache.solr.client.solrj.SolrQuery; import org.apache.solr.client.solrj.SolrR

Re: term frequency in solr

2017-01-05 Thread Ahmet Arslan
Hi, I think you are missing the main query parameter? q=*:* By the way you may get more response in the sole-user mailing list. Ahmet On Wednesday, January 4, 2017 4:59 PM, huda barakat wrote: Please help me with this: I have this code which return term frequency from techproducts example:

Re: term frequency

2016-11-28 Thread huda barakat
This the error I get it is the same: Exception in thread "main" java.lang.NullPointerException at solr_test.solr.SolrJTermsApplication.main(SolrJTermsApplication.java:30) I know the object is null but I don't know why it is null?? when I change the query to this: SolrQuery query = new SolrQue

Re: term frequency

2016-11-24 Thread Jason Wee
the exception line does not match the code you pasted, but do make sure your object actually not null before accessing its method. On Thu, Nov 24, 2016 at 5:42 PM, huda barakat wrote: > I'm using SOLRJ to find term frequency for each term in a field, I wrote > this code but it is not working: > >

Re: Term query equivalent in Dimensional fields?

2015-12-27 Thread Michael McCandless
On Sun, Dec 27, 2015 at 1:31 AM, Ishan Chattopadhyaya wrote: > I'm trying: DimensionalRangeQuery.new1DIntRange(fname, 1, true, 1, true); Yes, that is the best way! Remember that dimensional values are trunk only (to be Lucene 6.0, hopefully soonish), and index file format is free to change on t

Re: Term vectors

2014-09-30 Thread Jack Krupansky
My Solr Deep Dive e-book has a whole chapter on the Solr term vector search component, which is based on the Lucene term vector support. It won't help you directly for Java coding, but the examples may help illustrate what this feature can do. See: http://www.lulu.com/us/en/shop/jack-krupansk

Re: Term vector Lucene 4.2

2013-04-02 Thread Adrien Grand
On Tue, Apr 2, 2013 at 12:45 PM, andi rexha wrote: > Hi Adrien, > Thank you very much for the reply. > > I have two other small question about this: > 1) Is "final int freq = docsAndPositions.freq();" the same with > "iterator.totalTermFreq()" ? In my tests it returns the same result and from >

RE: Term vector Lucene 4.2

2013-04-02 Thread andi rexha
u...@gmail.com > Date: Tue, 2 Apr 2013 12:05:12 +0200 > Subject: Re: Term vector Lucene 4.2 > To: java-user@lucene.apache.org > > Hi Andi, > > Here is how you could retrieve positions from your document: > > Terms termVector = indexReader.getTermVector(docId, fieldN

Re: Term vector Lucene 4.2

2013-04-02 Thread Adrien Grand
Hi Andi, Here is how you could retrieve positions from your document: Terms termVector = indexReader.getTermVector(docId, fieldName); TermsEnum reuse = null; TermsEnum iterator = termVector.iterator(reuse); BytesRef ref = null; DocsAndPositionsEnum docsAndPositions = null;

Re: Term Positions added to one document forward

2012-10-30 Thread Ivan Vasilev
Thanks Simon! On 29.10.2012 г. 21:38, Simon Willnauer wrote: you should call currDocsAndPositions.nextPosition() before you call currDocsAndPositions.getPayload() payloads are per positions so you need to advance the pos first! simon On Mon, Oct 29, 2012 at 6:44 PM, Ivan Vasilev wrote: Hi G

Re: Term Positions added to one document forward

2012-10-29 Thread Simon Willnauer
you should call currDocsAndPositions.nextPosition() before you call currDocsAndPositions.getPayload() payloads are per positions so you need to advance the pos first! simon On Mon, Oct 29, 2012 at 6:44 PM, Ivan Vasilev wrote: > Hi Guys, > > I use the following code to index documents and set Pa

Re: term frequency on a particular query

2011-06-07 Thread Ian Lea
http://www.gossamer-threads.com/lists/lucene/java-user/86299 looks relevant. -- Ian. On Tue, Jun 7, 2011 at 10:05 AM, G.Long wrote: > Hi :) > > In my index, there are documents like : > > doc { question: 1, response: 1, word: excellent } > doc { question 1, response: 1, word: great } > doc { q

Re: term vector - WITH_POSITIONS_OFFSETS vs YES in terms of search performance

2010-11-30 Thread Michael McCandless
The performance impact should only be at indexing time, unless you actually retrieve the vectors for some number of hits at search time. Mike On Tue, Nov 30, 2010 at 2:28 PM, Maricris Villareal wrote: > Hi, > > Could someone tell me the effect (if any) of having term vectors set to > WITH_POSITI

RE: Term browsing much slower in Lucene 3.x.x

2010-07-30 Thread Nader, John P
sary in other API calls. BTW, that environment is Java 1.6.0_12 on 64-bit SUSE Linux with 32G of RAM and using MMapDirectory. Thanks. -John -Original Message- From: Nader, John P [mailto:john.na...@cengage.com] Sent: Thursday, July 29, 2010 5:49 PM To: java-user@lucene.apache.org Sub

RE: Term browsing much slower in Lucene 3.x.x

2010-07-29 Thread Chris Hostetter
: > My other question is whether there are planned performance : > enhancements to address this loss of performance? : : These APIs are very different in the next major release (4.0) of : Lucene, so except for problems spotted by users like you, there's not : much more dev happening against them

RE: Term browsing much slower in Lucene 3.x.x

2010-07-29 Thread Nader, John P
the added synchronization. I don't think is waiting on locks, but rather the memory flush and loading that goes on. -John -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Thursday, July 29, 2010 5:55 AM To: java-user@lucene.apache.org Subject: Re:

Re: Term browsing much slower in Lucene 3.x.x

2010-07-29 Thread Michael McCandless
On Wed, Jul 28, 2010 at 2:39 PM, Nader, John P wrote: > We recently upgraded from lucene 2.4.0 to lucene 3.0.2.  Our load testing > revealed a serious performance drop specific to traversing the list of terms > and their associated documents for a given indexed field.  Our code looks > somethin

Re: Term/Phrase frequencies

2010-05-07 Thread Erick Erickson
Well, counting frequency isn't the best approach. For instance, if a field has 1,000 terms and 10 occurrences of your target, is that a better match than a field with 10 terms and 5 occurrences of your target? This kind of thing is already taken into account with Lucene scoring, you might want to

Re: Term/Phrase frequencies

2010-05-06 Thread manjula wijewickrema
Hi Erik, Thanks for the reply. What I want to do is, to identify key terms and key phrases of a document according to their number of occurences in the document. Output should be the highest freequency words and (two or three word) phrases. For this purpose can I use Lucene? Thanks Manjula On Th

Re: Term/Phrase frequencies

2010-05-06 Thread Erick Erickson
Terms are relatively easy, see TermFreqVector in the JavaDocs. Phrases aren't as easy, before you go there, though, what is the high-level problem you're trying to solve? Possibly this is an XY problem (see http://people.apache.org/~hossman/#xyproblem). Best Erick On Thu, May 6, 2010 at 6:39 AM,

RE: Term offsets for highlighting

2010-04-27 Thread Stephen Greene
: Monday, April 26, 2010 10:55 AM To: java-user@lucene.apache.org Subject: Re: Term offsets for highlighting Stephen Greene wrote: > Hi Koji, > > Thank you. I implemented a solution based on the FieldTermStackTest.java > and if I do a search like "iron ore" it matches iron or o

Re: Term offsets for highlighting

2010-04-26 Thread Koji Sekiguchi
Stephen Greene wrote: Hi Koji, Thank you. I implemented a solution based on the FieldTermStackTest.java and if I do a search like "iron ore" it matches iron or ore. The same is true if I specify iron AND ore. The termSetMap[0].value[0] = ore and termSetMap[0].value[1] = iron. What am I missing

RE: Term offsets for highlighting

2010-04-26 Thread Stephen Greene
tIndexReader(), pintDocId, fieldName); -Original Message- From: Koji Sekiguchi [mailto:k...@r.email.ne.jp] Sent: Saturday, April 24, 2010 5:18 AM To: java-user@lucene.apache.org Subject: Re: Term offsets for highlighting Hi Steve, > is there a way to access a TermVector containin

Re: Term offsets for highlighting

2010-04-24 Thread Koji Sekiguchi
Hi Steve, > is there a way to access a TermVector containing only matched terms, > or is my previous approach still the So you want to access FieldTermStack, I understand. The way to access it, I wrote it at previous mail: You cannot access FieldTermStack from FVH, but I think you can create i

RE: Term offsets for highlighting

2010-04-22 Thread Stephen Greene
ay, April 19, 2010 9:02 PM To: java-user@lucene.apache.org Subject: Re: Term offsets for highlighting Stephen Greene wrote: > Hi Koji, > > An additional question. Is it possible to access the FieldTermStack from > the FastVectorHighlighter after the it has been populated with matching

Re: Term offsets for highlighting

2010-04-19 Thread Koji Sekiguchi
Stephen Greene wrote: Hi Koji, An additional question. Is it possible to access the FieldTermStack from the FastVectorHighlighter after the it has been populated with matching terms from the field? I think this would provide an ideal solution for this problem, as ultimately I am only concerned

RE: Term offsets for highlighting

2010-04-19 Thread Stephen Greene
positional offsets to have highlighting tags applied to them in a separate process. Thank you for your insight, Steve -Original Message- From: Koji Sekiguchi [mailto:k...@r.email.ne.jp] Sent: Sunday, April 18, 2010 10:42 AM To: java-user@lucene.apache.org Subject: Re: Term offsets for

RE: Term offsets for highlighting

2010-04-19 Thread Stephen Greene
Subject: Re: Term offsets for highlighting Stephen Greene wrote: > Hi Koji, > > Thank you for your reply. I did try the QueryScorer without success, but > I was using Lucene 2.4.x > Hi Steve, I thought you were using 2.9 or later because you mentioned FastVectorHighlighter in you

Re: Term offsets for highlighting

2010-04-18 Thread Koji Sekiguchi
Stephen Greene wrote: Hi Koji, Thank you for your reply. I did try the QueryScorer without success, but I was using Lucene 2.4.x Hi Steve, I thought you were using 2.9 or later because you mentioned FastVectorHighlighter in your previous mail (FVH was first introduced in 2.9). If I remembe

RE: Term offsets for highlighting

2010-04-18 Thread Stephen Greene
...@r.email.ne.jp] Sent: Friday, April 16, 2010 9:49 PM To: java-user@lucene.apache.org Subject: Re: Term offsets for highlighting Stephen Greene wrote: > Hello, > > > > I am trying to determine begin and end offsets for terms and phrases > matching a query. > > Is there a way usin

Re: Term offsets for highlighting

2010-04-16 Thread Koji Sekiguchi
Stephen Greene wrote: Hello, I am trying to determine begin and end offsets for terms and phrases matching a query. Is there a way using either the highlighter or fast vector highlighter in contrib? I have already attempted extending the highlighter which would match terms but would not

Re: Term Frequency for phrases

2010-01-08 Thread Erick Erickson
What are the associated Analyzers for your Gene and Token? Because if they're NOT something akin to KeywordAnalyzer, you have a problem. Specifically, most of the "regular" tokenizers will break this stream up into three separate terms, "brain", "natriuetic", and "peptide". If that's the case, the

Re: Term Frequency for phrases

2010-01-08 Thread Jason Rutherglen
I'm not going to go into too much code level detail, however I'd index the phrases using tri-gram shingles, and as uni-grams. I think this'll give you the results you're looking for. You'll be able to quickly recall the count of a given phrase aka tri-gram such as "blue_shorts_burough" On Fri, J

Re: Term Frequency for phrases

2010-01-08 Thread hrishim
@All : Elaborating the problem The phrase is being indexed as a single token ... I have a Gene tag in the xml document which is like brain natriuretic peptide This phrase is present in the abstract text for the given document . Code is as : doc.add(new Field("Gene", geneName, Field.Store.YES

Re: Term Frequency for phrases

2010-01-08 Thread Grant Ingersoll
When do you detect that they are phrases? During indexing or during search? On Jan 8, 2010, at 5:16 AM, hrishim wrote: > > Hi . > I have phrases like brain natriuretic peptide indexed as a single token > using Lucene. > When I calculate the term frequency for the same the count is 0 since the

Re: Term Frequency for phrases

2010-01-08 Thread Erick Erickson
On a quick read, your statements are contradictory <<>> <<>> Either "brain natriuretic peptide" is a single token/term or it's not Are you sure you're not confusing indexing and storing? What analyzer are you using at index time? Erick On Fri, Jan 8, 2010 at 5:16 AM, hrishim wrote:

Re: Term Frequency for phrases

2010-01-08 Thread Michael McCandless
Issue a PhraseQuery and count how many hits came back? Is that too slow? If so, you could detect all phrases during indexing and add them as tokens to the index? Mike On Fri, Jan 8, 2010 at 5:16 AM, hrishim wrote: > > Hi . > I have phrases like brain natriuretic peptide indexed as a single tok

Re: Term Boost Threshold

2009-11-13 Thread Jake Mannix
On Fri, Nov 13, 2009 at 4:21 PM, Max Lynch wrote: > Well already, without doing any boosting, documents matching more of the > > terms > > in your query will score higher. If you really want to make this effect > > more > > pronounced, yes, you can boost the more important query terms higher. >

Re: Term Boost Threshold

2009-11-13 Thread Max Lynch
Well already, without doing any boosting, documents matching more of the > terms > in your query will score higher. If you really want to make this effect > more > pronounced, yes, you can boost the more important query terms higher. > > -jake > But there isn't a way to determine exactly what bo

Re: Term Boost Threshold

2009-11-13 Thread Jake Mannix
On Fri, Nov 13, 2009 at 4:02 PM, Max Lynch wrote: > > > Now, I would like to know exactly what term was found. For example, if > a > > > result comes back from the query above, how do I know whether John > Smith > > > was > > > found, or both John Smith and his company, or just John Smith > > Ma

Re: Term Boost Threshold

2009-11-13 Thread Max Lynch
> > Now, I would like to know exactly what term was found. For example, if a > > result comes back from the query above, how do I know whether John Smith > > was > > found, or both John Smith and his company, or just John Smith > Manufacturing > > was found? > > > In general, this is actually very

Re: Term Boost Threshold

2009-11-13 Thread Jake Mannix
On Fri, Nov 13, 2009 at 3:35 PM, Max Lynch wrote: > > query: "San Francisco" "California" +("John Smith" "John Smith > > Manufacturing") > > > > Here the San Fran and CA clauses are optional, and the ("John Smith" OR > > "John Smith Manufacturing") is required. > > > > Thanks Jake, that works nic

Re: Term Boost Threshold

2009-11-13 Thread Max Lynch
> query: "San Francisco" "California" +("John Smith" "John Smith > Manufacturing") > > Here the San Fran and CA clauses are optional, and the ("John Smith" OR > "John Smith Manufacturing") is required. > Thanks Jake, that works nicely. Now, I would like to know exactly what term was found. For e

Re: Term Boost Threshold

2009-11-13 Thread Jake Mannix
Did I do that wrong? I always mess up the AND/OR human-readable form of this - it's clearer when you use +/- unary operators instead: query: "San Francisco" "California" +("John Smith" "John Smith Manufacturing") Here the San Fran and CA clauses are optional, and the ("John Smith" OR "John Smith

Re: Term Boost Threshold

2009-11-13 Thread Max Lynch
> You want a query like > > ("San Francisco" OR "California") AND ("John Smith" OR "John Smith > Manufacturing") > Won't his require San Francisco or California to be present? I do not require them to be, I only require "John Smith" OR "John Smith Manufacturing", but I want to get a bigger scor

Re: Term Boost Threshold

2009-11-13 Thread Jake Mannix
Hi Max, You want a query like ("San Francisco" OR "California") AND ("John Smith" OR "John Smith Manufacturing") essentially? You can give Lucene exactly this query and it will require that either "John Smith" or "John Smith Manufacturing" be present, but will score results which have these

Re: Term Extraction

2009-08-13 Thread Grant Ingersoll
I would just throw your doc into a MemoryIndex (lives in contrib/ memory, I think; it only holds one doc), get the Vector and do what you need to do. So you would kind of be doing indexing, but not really. On Aug 13, 2009, at 8:43 AM, joe_coder wrote: Grant, thanks for responding. My i

Re: Term Extraction

2009-08-13 Thread joe_coder
For example, I am able to do Analyzer analyzer = new StandardAnalyzer(); // or any other analyzer TokenStream ts = analyzer.tokenStream("myfield",new StringReader("some text goes here")); Token t = ts.next(); while (t!=null) { System.out.println("token: "+t)); t

Re: Term Extraction

2009-08-13 Thread joe_coder
Grant, thanks for responding. My issue is that I am not planning to use lucene ( as I don't need any search capability, atleast yet). All I have is a text document and I need to extract keywords and their frequency ( which could be a simple split on space and tracking the count). But I realize th

Re: Term Extraction

2009-08-13 Thread Grant Ingersoll
On Aug 13, 2009, at 7:40 AM, joe_coder wrote: I was wondering if there is any way to directly use Lucene API to extract terms from a given string. My requirement is that I have a text document for which I need a term frequency vector ( after stemming, removing stopwords and synonyms che

Re: term query boost problem

2009-08-13 Thread Simon Willnauer
Chrisitan, if you haven't done so you might find Luke (http://www.getopt.org/luke/) very helpful so see what has been indexed and how. simon On Thu, Aug 13, 2009 at 6:10 AM, Christian Bongiorno wrote: > turns out the index is being built with lower-case terms which is why we > aren't getting hits

Re: term query boost problem

2009-08-12 Thread Christian Bongiorno
turns out the index is being built with lower-case terms which is why we aren't getting hits the way we expect. When I change my search terms to lower I see more of what I expect. Gonna keep working on this and post updates. On Wed, Aug 12, 2009 at 12:46 PM, Christian Bongiorno < christ...@bongio

Re: term query boost problem

2009-08-12 Thread Grant Ingersoll
You have a bunch of log statements in there, what are they printing out? Also, IndexSearcher.explain() is your friend for understanding why a doc matched the way it did. On Aug 12, 2009, at 3:46 PM, Christian Bongiorno wrote: I have a situation where I have a series of terms queries as par

Re: Term Frequency vector consumes memory

2009-07-02 Thread Grant Ingersoll
ant Ingersoll" To: Sent: Tuesday, June 30, 2009 9:48 PM Subject: Re: Term Frequency vector consumes memory In Lucene, a Term Vector is a specific thing that is stored on disk when creating a Document and Field. It is optional and off by default. It is separate from being able to get th

Re: Term Frequency vector consumes memory

2009-06-30 Thread Ganesh
er to load term vector. I want to switch off this feature? Is that possible without re-indexing? Regards Ganesh - Original Message - From: "Grant Ingersoll" To: Sent: Tuesday, June 30, 2009 9:48 PM Subject: Re: Term Frequency vector consumes memory > In Lucene, a Term Ve

Re: Term Frequency vector consumes memory

2009-06-30 Thread Grant Ingersoll
In Lucene, a Term Vector is a specific thing that is stored on disk when creating a Document and Field. It is optional and off by default. It is separate from being able to get the term frequencies for all the docs in a specific field. The former is decided at indexing time and there is

Re: Term frequencies within a search

2009-05-22 Thread Robert Young
For all the docs, and in fact, I think it might be the document frequency. Basically I need to be able to do a query and get a list of terms with how many documents in the result set contain that term. I'm not so worried about how often the term appears in each document. Thanks Rob On Thu, May 21

Re: Term frequencies within a search

2009-05-21 Thread Michael McCandless
This is often requested, but Lucene doesn't make it easy. I'd love for someone to come up and build this feature :) Do you need term freqs for just the top N that were collected? Or for all docs that matched the query? Mike On Thu, May 21, 2009 at 6:34 AM, Robert Young wrote: > Hi, > I would

Re: Term Limit?

2009-04-04 Thread Michael McCandless
OK I opened https://issues.apache.org/jira/browse/LUCENE-1586 to track this. Thanks deminix! Mike - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.or

Re: Term Limit?

2009-04-04 Thread deminix
Ah yes. I'd be happy with the ability to monitor it for now. Assuming it is too involved to remove the limitation. For all practical purposes we should only be using, worst case, 10% of the term space today. That happens to make it risky enough that it needs an eye kept on it, as this will be o

Re: Term Limit?

2009-04-04 Thread Michael McCandless
On Sat, Apr 4, 2009 at 11:57 AM, deminix wrote: > Yea.  That is all that matters anyway right, is the limit at the segment > level? Well... the problem is when merges kick off. You could have N segments that each are below the limit, but when a merge runs the merged segment would try to exceed t

Re: Term Limit?

2009-04-04 Thread Michael McCandless
On Sat, Apr 4, 2009 at 11:52 AM, deminix wrote: > My crude regex'ing of the code has me thinking it is only term vectors that > are limited to 32 bits, since they allocate arrays.  Otherwise it seems > good.  Does that sound right? Not quite... SegmentTermEnum.seek takes "int p". TermInfosReader

Re: Term Limit?

2009-04-04 Thread deminix
Yea. That is all that matters anyway right, is the limit at the segment level? On Sat, Apr 4, 2009 at 8:44 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Sat, Apr 4, 2009 at 10:25 AM, deminix wrote: > > > AFAIK there isn't an api that returns the current number of terms, > cor

Re: Term Limit?

2009-04-04 Thread deminix
My crude regex'ing of the code has me thinking it is only term vectors that are limited to 32 bits, since they allocate arrays. Otherwise it seems good. Does that sound right? On Sat, Apr 4, 2009 at 7:25 AM, deminix wrote: > Thanks for the clarification. > > I'm partitioning the document spac

Re: Term Limit?

2009-04-04 Thread Michael McCandless
On Sat, Apr 4, 2009 at 10:25 AM, deminix wrote: > AFAIK there isn't an api that returns the current number of terms, correct? Alas, no. This limitation has been talked about before... maybe we should add it. But: it's not actually simple to compute, at the MultiSegmentReader level. Each Segme

Re: Term Limit?

2009-04-04 Thread deminix
Thanks for the clarification. I'm partitioning the document space, so I'm not really concerned about the fact documents are ints. Some fields have very unique value spaces though (and many values per document), and they don't align to the same way the documents are partitioned so may have a very

Re: Term Limit?

2009-04-04 Thread Michael McCandless
Correct, and, not that I know of. Mike On Sat, Apr 4, 2009 at 7:55 AM, Murat Yakici wrote: > > I assume the total number of documents that you can index is also limited > by Java max int. Is this correct? Is there any way to index documents > beyond this number in a single index? > > Murat > > >

Re: Term Limit?

2009-04-04 Thread Murat Yakici
I assume the total number of documents that you can index is also limited by Java max int. Is this correct? Is there any way to index documents beyond this number in a single index? Murat > I tentatively think you are correct: the file format itself does not > impose this limitation. > > But in

Re: Term Limit?

2009-04-04 Thread Michael McCandless
I tentatively think you are correct: the file format itself does not impose this limitation. But in a least a couple places internally, Lucene uses a java int to hold the term number, which is actually a limit of 2,147,483,648 terms. I'll update fileformats.html for 2.9. Mike On Sat, Apr 4, 200

Re: Term level boosting

2009-03-25 Thread Grant Ingersoll
In contrib/analysis there are also some TokenFilters that provide examples of using Payloads. See the org.apache.lucene.analysis.payloads package: http://lucene.apache.org/java/2_4_1/api/contrib-analyzers/org/apache/lucene/analysis/payloads/package-summary.html -Grant On Mar 24, 2009, at 4

Re: Term level boosting

2009-03-24 Thread Koji Sekiguchi
Seid Mohammed wrote: ok, but I need to know how to proceed with it. I mean how to include to my application many thanks Seid M You may want to look at the following articles: http://lucene.jugem.jp/?eid=133 http://lucene.jugem.jp/?eid=134 articles are in Japanese, but ignore them. :) Pro

Re: Term level boosting

2009-03-24 Thread Seid Mohammed
ok, but I need to know how to proceed with it. I mean how to include to my application many thanks Seid M On 3/24/09, Koji Sekiguchi wrote: > Seid Mohammed wrote: >> Hi All >> I want my lucene to index documents and making some terms to have more >> boost value. >> so, if I index the document "

Re: Term level boosting

2009-03-24 Thread Koji Sekiguchi
Seid Mohammed wrote: Hi All I want my lucene to index documents and making some terms to have more boost value. so, if I index the document "The quick fox jumps over the lazy dog" and I want the term fox and dog to have greater boost value. How can I do that Thanks a lot seid M How about

Re: term position in phrase query using queryparser

2009-03-02 Thread Matt Ronge
On Feb 25, 2009, at 2:52 PM, Tim Williams wrote: Is there a syntax to set the term position in a query built with queryparser? For example, I would like something like: PhraseQuery q = new PhraseQuery(); q.add(t1, 0); q.add(t2, 0); q.setSlop(0); As I understand it, the slop defaults to 0, bu

Re: Term precendence

2009-02-15 Thread Yonik Seeley
On Sun, Feb 15, 2009 at 10:50 AM, Joel Halbert wrote: > When constructing a query, using a series of terms e.g. > > Term1=X, Term2=Y etc... > > does it make sense, like in sql, to place to most restrictive term query > first? > > i.e. if I know that the query will be mainly constrained by the valu

Re: term frequency normalization

2009-02-12 Thread Chris Hostetter
: The easiest way to change the tf calculation would be overwriting : tf in an own implementation of Similarity like it's done in : SweetSpotSimilarity. But the average term frequency of the : document is missing. Is there a simple way to get or calc this : number? there was quite a bit of discus

Re: term offsets info seems to be wrong...

2009-01-16 Thread Koji Sekiguchi
Mark, This is exactly what I want and It worked perfectly. Thanks! I'll post my highlighter to JIRA in a few days (hopegully). It uses term offsets with positions (WITH_POSITIONS_OFFSETS) to support PhraseQuery. Thanks again, Koji Mark Miller wrote: Okay, Koji, hopefully I'll be more luckily

Re: term offsets info seems to be wrong...

2009-01-16 Thread Mark Miller
Okay, Koji, hopefully I'll be more luckily suggesting this this time. Have you tried http://issues.apache.org/jira/browse/LUCENE-1448 yet? I am not sure if its in an applyable state, but I hope that covers your issue. On Fri, Jan 16, 2009 at 7:15 PM, Koji Sekiguchi wrote: > Hello, > > I'm writi

Re: Term Frequency and IndexSearcher

2009-01-16 Thread Chris Hostetter
: References: : : <1998.130.159.185.12.1232021837.squir...@webmail.cis.strath.ac.uk> : Date: Thu, 15 Jan 2009 04:49:49 -0800 (PST) : Subject: Term Frequency and IndexSearcher http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion

Re: Term Frequency and IndexSearcher

2009-01-15 Thread Murat Yakici
Hi Paul, I am tempted to suggest the following ( I am assuming here that the document and the particular fields are TFVed when indexing): For every doc in the result set: - get the doc id - using the doc id, get the TermFreqVector of this document from the index reader (tfv=ireader.getTermFr

Re: Term numbering and range filtering

2008-11-19 Thread Paul Elschot
Tim, Op Wednesday 19 November 2008 02:32:40 schreef Tim Sturge: ... > >> > >> This is less than 2x slower than the dedicated bitset and more > >> than 50x faster than the range boolean query. > >> > >> Mike, Paul, I'm happy to contribute this (ugly but working) code > >> if there is interest. Let

Re: Term numbering and range filtering

2008-11-18 Thread Tim Sturge
> With "Allow Filter as clause to BooleanQuery": > https://issues.apache.org/jira/browse/LUCENE-1345 > one could even skip the ConstantScoreQuery with this. > Unfortunately 1345 is unfinished for now. > That would be interesting; I'd like to see how much performance improves. >> startup: 2811

Re: Term numbering and range filtering

2008-11-18 Thread Paul Elschot
Op Wednesday 19 November 2008 00:43:56 schreef Tim Sturge: > I've finished a query time implementation of a column stride filter, > which implements DocIdSetIterator. This just builds the filter at > process start and uses it for each subsequent query. The index itself > is unchanged. > > The resul

Re: Term numbering and range filtering

2008-11-18 Thread Tim Sturge
I've finished a query time implementation of a column stride filter, which implements DocIdSetIterator. This just builds the filter at process start and uses it for each subsequent query. The index itself is unchanged. The results are very impressive. Here are the results on a 45M document index:

Re: term offsets wrong depending on analyzer

2008-11-11 Thread Michael McCandless
Just to followup... I opened these three issues: https://issues.apache.org/jira/browse/LUCENE-1441 (fixed in 2.9) https://issues.apache.org/jira/browse/LUCENE-1442 (fixed in 2.9) https://issues.apache.org/jira/browse/LUCENE-1448 (still iterating) Mike Christian Reuschling wrote: Hi Guy

Re: Term numbering and range filtering

2008-11-11 Thread Michael McCandless
Paul Elschot wrote: Op Tuesday 11 November 2008 21:55:45 schreef Michael McCandless: Also, one nice optimization we could do with the "term number column- stride array" is do bit packing (borrowing from the PFOR code) dynamically. Ie since we know there are X unique terms in this segment, whe

Re: Term numbering and range filtering

2008-11-11 Thread Paul Elschot
Op Tuesday 11 November 2008 21:55:45 schreef Michael McCandless: > Also, one nice optimization we could do with the "term number column- > stride array" is do bit packing (borrowing from the PFOR code) > dynamically. > > Ie since we know there are X unique terms in this segment, when > populating t

Re: Term numbering and range filtering

2008-11-11 Thread Michael McCandless
Also, one nice optimization we could do with the "term number column- stride array" is do bit packing (borrowing from the PFOR code) dynamically. Ie since we know there are X unique terms in this segment, when populating the array that maps docID to term number we could use exactly the r

Re: Term numbering and range filtering

2008-11-11 Thread Paul Elschot
Op Tuesday 11 November 2008 11:29:27 schreef Michael McCandless: > > The other part of your proposal was to somehow "number" term text > such that term range comparisons can be implemented fast int > comparison. ... > >http://fontoura.org/papers/paramsearch.pdf > > However that'd be quite a bit

  1   2   >