Re: i'm Lucene beginner. help me

2012-06-26 Thread Adrien Grand
Hi kjysmu, On Tue, Jun 26, 2012 at 11:22 AM, kjysmu wrote: > What i want with lucene is that i wanna get it's image ids for certain query > (tag) > > how can i implement it using Lucene with Java? I moved the discussion to java-user@lucene instead of dev@lucene since your question is not related

Re: Lucene 4.0.0 - find term position.

2012-12-07 Thread Adrien Grand
Hi Vitaly, On Fri, Dec 7, 2012 at 3:24 PM, wrote: > I try to use or Terms tfvector = reader.getTermVector(docId, "contents"); > or Fields fields = reader.getTermVectors(docId); > but I get null from these calls. > What is wrong? These methods will always return null unless you turn term vect

Re: StoredFieldsFormat / documentation

2013-01-24 Thread Adrien Grand
Hi Bernd, On Thu, Jan 24, 2013 at 11:55 AM, Bernd Müller wrote: > Hi Simon, > >> you mean where it is used? Look at the org.apache.lucene.codecs.Codec >> class, it has a method: >> >> public abstract StoredFieldsFormat storedFieldsFormat(); >> >> which returns a stored fields format used to enc

Re: Need help regarding understanding internals of Lucene Index.

2013-01-25 Thread Adrien Grand
Hi Vignesh, This is a very broad question! The following links might help you: - Lucene documentation: http://lucene.apache.org/core/4_1_0/index.html - File formats: http://lucene.apache.org/core/4_1_0/core/org/apache/lucene/codecs/lucene41/package-summary.html#package_description - The block t

Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-01-27 Thread Adrien Grand
Have you tried using the PDFParser [1] and the OfficeParser [2] classes from Tika? This question seems to be more appropriate for the Tika user mailing list [3]? [1] http://tika.apache.org/1.3/api/org/apache/tika/parser/pdf/PDFParser.html#parse(java.io.InputStream, org.xml.sax.ContentHandler, or

Re: CompressingStoredFieldsFormat doesn't show improvement

2013-01-29 Thread Adrien Grand
Arun, Lucene uses a very light compression algorithm so I'm a little surprised it can make indexing 2x slower. Could you run indexing under a profiler to make sure it really is what makes indexing slower? Thanks! -- Adrien - T

Re: CompressingStoredFieldsFormat doesn't show improvement

2013-01-30 Thread Adrien Grand
On Wed, Jan 30, 2013 at 8:08 AM, arun k wrote: > Adrein, > > I have created an index of size 370M of 1 million docs of 40 fields of 40 > chars and did the profiling. > I see that the indexing and in particular the addDocument & > ConcurrentMergeScheduler in 4.1 takes double the time compared to 3.

Re: Example settings for TieredMergePolicy : Lucene 4.0

2013-02-01 Thread Adrien Grand
Hi, On Fri, Feb 1, 2013 at 6:51 PM, saisantoshi wrote: > Prior to 4.0, there was an optimize() in the IndexWriter which was merging > the index files. Is there any settings that can be done on the > TieredMergePolicy so that I want to limit the number of files produced > during the indexing. Seg

Re: updateDocument question

2013-02-06 Thread Adrien Grand
Hi Thomas, On Wed, Feb 6, 2013 at 2:50 PM, Becker, Thomas wrote: > I've built a search prototype feature for my application using Lucene, and it > works great. The application monitors a remote system and currently indexes > just a few core attributes of the objects on that system. I get > n

Re: updateDocument question

2013-02-07 Thread Adrien Grand
On Thu, Feb 7, 2013 at 1:54 PM, Becker, Thomas wrote: > Thanks for the response Adrien. I guess I'll just leave things as they are > for now. To be clear though, do merged segments get cleaned up completely > even if the IndexWriter is never closed? The way it works is that indexing data crea

Re: Indexing directly from stdin in lucene 3.5

2013-02-19 Thread Adrien Grand
Hi, On Tue, Feb 19, 2013 at 11:04 AM, A. L. Benhenni wrote: > I am currently writing an indexer class to index texts from stdin. I also > need the text to be tokenized and stored to access the termvector of the > document. Actually, you don't need to store documents to access their term vectors,

Re: [ANNOUNCE] Wiki editing change

2013-03-25 Thread Adrien Grand
Hi Steve, On Mon, Mar 25, 2013 at 4:16 AM, Steve Rowe wrote: > Please request either on the java-user@lucene.apache.org or on > d...@lucene.apache.org to have your wiki username added to the > ContributorsGroup page - this is a one-time step. Can you add 'jpountz' to the ContributorsGroup? Tha

Re: Beginner's questions

2013-03-27 Thread Adrien Grand
Hi Paul, On Wed, Mar 27, 2013 at 1:58 PM, Paul Bell wrote: > As to the ideas raised in the links you pointed me to: the first link shows > the instantiation of a Term object via > >writer.UpdateDocument(new Term("IDField", *id*), doc); > > yet in the 4.2.0 docs I see no Term constructor that

Re: Beginner's questions

2013-03-27 Thread Adrien Grand
On Wed, Mar 27, 2013 at 9:04 PM, Paul Bell wrote: > Thanks Adrien. > > I've scraped together a simple program in the Lucene 4.2 idiom (see below). > Does this illustrate what you meant by your last sentence? > > The code adds/indexes 5 documents all of whose content is identical, but > whose 'id'

Re: Indexing Term Frequency Vectors

2013-03-28 Thread Adrien Grand
Hi, On Thu, Mar 28, 2013 at 8:25 PM, Sharon Tam wrote: > I believe that when Lucene indexes documents, it generates counts for a > term by counting how many times the term appears in a particular document. > Instead of having Lucene do the counting, I want to do my own counting and > feed a term-

Re: Storing Documents in Lucene

2013-03-28 Thread Adrien Grand
On Thu, Mar 28, 2013 at 11:06 PM, Paul wrote: > Hi, Hi Paul, > Some of the stuff I've read suggests that Lucene is not especially > well-suited to storing the documents. It's supposed to be great at indexing > those documents, but not so great at storing the docs themselves. > > Can someone sh

Re: Beginner's questions

2013-03-29 Thread Adrien Grand
Hi Paul, On Fri, Mar 29, 2013 at 1:38 PM, Paul Bell wrote: > Last night reading in "Lucene in Action, 2nd edition," I came upon this > about addDocument(Document, Analyzer): "Adds the document using the > provided analyzer for tokenization. But be careful! In order for searches > to work correctl

Re: Discrepancies between search results and reader.document(i).get("path")

2013-03-29 Thread Adrien Grand
Hi, On Fri, Mar 29, 2013 at 10:23 AM, Bushman, Lamont wrote: > This snippet of one of my classes looks at all of my documents and displays > their file path. > > Directory dir = FSDirectory.open(mInd

Re: Discrepancies between search results and reader.document(i).get("path")

2013-03-29 Thread Adrien Grand
On Sat, Mar 30, 2013 at 12:39 AM, Bushman, Lamont wrote: > However, with your response, especially if I come across problems later. > reader.liveDocs() is not found in IndexWriter. I am guessing you are > referring to the TermsEnum class. I assume numDocs() returns the amount of > documents

Re: "4.1 consuming more memory than 3.0.2 while Indexing"

2013-04-01 Thread Adrien Grand
On Mon, Apr 1, 2013 at 1:56 PM, Arun Kumar K wrote: > Hi Guys, Hi, > I have been finding out the heap space requirement for indexing and > searching with 3.0.2 vs 4.1 (with BlockPostings Format). > > I have a 2GB index with 1 million docs with around 42 fields with 40 fields > being random strin

Re: Term vector Lucene 4.2

2013-04-02 Thread Adrien Grand
Hi Andi, Here is how you could retrieve positions from your document: Terms termVector = indexReader.getTermVector(docId, fieldName); TermsEnum reuse = null; TermsEnum iterator = termVector.iterator(reuse); BytesRef ref = null; DocsAndPositionsEnum docsAndPositions = null;

Re: Term vector Lucene 4.2

2013-04-02 Thread Adrien Grand
On Tue, Apr 2, 2013 at 12:45 PM, andi rexha wrote: > Hi Adrien, > Thank you very much for the reply. > > I have two other small question about this: > 1) Is "final int freq = docsAndPositions.freq();" the same with > "iterator.totalTermFreq()" ? In my tests it returns the same result and from >

Re: How to use concurrency efficiently

2013-04-02 Thread Adrien Grand
On Tue, Apr 2, 2013 at 2:29 PM, Igor Shalyminov wrote: > Hello! Hi Igor, > I have a ~20GB index and try to make a concurrent search over it. > The index has 16 segments, I run SpanQuery.getSpans() on each segment > concurrently. > I see really small performance improvement of searching concurre

Re: How to use concurrency efficiently

2013-04-02 Thread Adrien Grand
On Tue, Apr 2, 2013 at 4:39 PM, Igor Shalyminov wrote: > Yes, the number of documents is not too large (about 90 000), but the queries > are very hard. Although they're just boolean, a typical query can produce a > result with tens of millions of hits. How can there be tens of millions of hits

Re: Indexing Term Frequency Vectors

2013-04-02 Thread Adrien Grand
On Tue, Apr 2, 2013 at 4:10 PM, Sharon W Tam wrote: > Are there any other ideas? Since scoring seems to be what you are interested in, you could have a look to payloads: there can store arbitrary data and can be used to score matches. -- Adrien

Re: DocValues questions

2013-04-04 Thread Adrien Grand
Hi, On Thu, Apr 4, 2013 at 10:30 AM, Wei Wang wrote: > A few quick questions about DocValues: > > 1. If only small number of documents have a ShortDocValueField defined, > should each document in the index has this field filled with some value? > The add() function of Document seems not enforce a

Re: DocValues questions

2013-04-04 Thread Adrien Grand
On Thu, Apr 4, 2013 at 11:03 PM, Wei Wang wrote: > Given the new Lucene 4.2 DocValues API, it seems no matter it is byte, > short, int, or long, they are all stored as NumericDocValuesField. Does > this mean "long" values are always stored regardless of the initial type? > If so, do we still save

Re: DocValues questions

2013-04-05 Thread Adrien Grand
On Fri, Apr 5, 2013 at 4:05 AM, Wei Wang wrote: > Do we need to use setLongValue() all the time? Yes. -- Adrien - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@l

Re: DocValues space usage

2013-04-09 Thread Adrien Grand
Hi, On Tue, Apr 9, 2013 at 5:22 PM, Wei Wang wrote: > DocValues makes fast per doc value lookup possible, which is nice. But it > brings other interesting issues. > > Assume there are 100M docs and 200 NumericDocValuesFields, this ends up > with huge number of disk and memory usage, even if there

Re: Indexing Term Frequency Vectors

2013-04-09 Thread Adrien Grand
Hi, On Tue, Apr 9, 2013 at 5:24 PM, Sharon Tam wrote: > I tried following following this payloads tutorial to attach the term > frequencies as payloads: > http://searchhub.org/2009/08/05/getting-started-with-payloads/ > > But I'm confused as to where I need to override the term frequency counter

Re: IntField question

2013-04-10 Thread Adrien Grand
Hi, On Wed, Apr 10, 2013 at 9:34 AM, Wei Wang wrote: > IntField inherits from Field class a function called setByteValue(). > However, if we call it, it gives an error message: > > java.lang.IllegalArgumentException: cannot change value type from Integer > to Byte > > 1. If this not allowed for I

Re: IntField question

2013-04-10 Thread Adrien Grand
Hi, On Wed, Apr 10, 2013 at 4:59 PM, Wei Wang wrote: > Okay. Since there is no ByteField, setByteValue will never by used. It > seems like a dead function. Right, Lucene doesn't have byte or short fields. > That makes sense. If we don't need positional info (virtually all terms are > at the sam

Re: Update a bunch of documents

2013-04-12 Thread Adrien Grand
Hi, On Thu, Apr 11, 2013 at 5:46 PM, Carsten Schnober wrote: > This is limited to one > field only (not the one on which the query is typically performed!), > shouldn't that help? Unfortunately not. Lucene doesn't support in-place updates so updating a document is equivalent to deleting the old

Re: DiskDocValuesFormat

2013-04-13 Thread Adrien Grand
Hi Wei, On Sat, Apr 13, 2013 at 7:44 AM, Wei Wang wrote: > I am trying to use DiskDocValuesFormat for a particular > BinaryDocValuesField. It seems there is no good examples showing how to do > this. The only hint I got from various docs and forums is set some codec in > IndexWriter. Could someon

Re: Please explain the example

2013-04-21 Thread Adrien Grand
Hi, On Thu, Apr 18, 2013 at 3:46 PM, Gaurav Ranjan wrote: > I am a student and studying the functionality of Lucene for my project work. > The DocDelta example on this link is not clear > http://lucene.apache.org/core/4_2_0/core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html?is-ext

Re: Too many unique terms

2013-04-24 Thread Adrien Grand
Hi Manuel, On Thu, Apr 25, 2013 at 12:29 AM, Manuel LeNormand wrote: > Hi there, > Looking at my index (about 1M docs) i see lot of unique terms, more > than 8M which is a significant part of my total term count. These are very > likely useless terms, binaries or other meaningless numbers that co

Re: Distinction between AtomicReader and CompositeReader

2013-04-24 Thread Adrien Grand
Hi Paul On Wed, Apr 24, 2013 at 1:35 PM, Paul Taylor wrote: > Trying to convert some Lucene 3 code to Lucene 4, > > I want to use termEnums.docs(ir.getLiveDocs()) to only return docs that have > not been deleted for a particular term. However getLiveDocs() is only > available for AtomicReaders, a

Re: org.apache.lucene.classification - bug in SimpleNaiveBayesClassifier

2013-04-24 Thread Adrien Grand
Hi Alexey, On Tue, Apr 23, 2013 at 3:28 PM, Alexey Anatolevitch wrote: > I was trying it with 4.2.1 and SimpleNaiveBayesClassifier seems to have a > bug - the local copy of BytesRef referenced by foundClass is affected by > subsequent TermsEnum.iterator.next() calls as the shared BytesRef.bytes >

Re: Too many unique terms

2013-04-29 Thread Adrien Grand
On Sat, Apr 27, 2013 at 8:41 PM, Manuel Le Normand wrote: > Hi, real thanks for the previous reply. > For now i'm not able to make a separation between these useless words, > whether they contain words or digits. > I liked the idea of iterating with TermsEnum. Will it also delete the > occurances

Re: Too many unique terms

2013-04-29 Thread Adrien Grand
Hi, On Mon, Apr 29, 2013 at 10:38 PM, Manuel Le Normand wrote: > I want to make sure: iterating with the TermsEnum will not delete all the > terms occuring in the same doc that includes the single term, but only the > single term right? > Going through the Class TermEnum i cannot find any "delete

Re: lucene and mongodb

2013-05-14 Thread Adrien Grand
Hi, On Tue, May 14, 2013 at 10:35 AM, Rider Carrion Cleger wrote: > - Can I store the lucene index in a mongodb database ? I don't know whether it's possible, but even if it was, I would not recommend it. Lucene works best on local filesystems, and even better if the disk is an SSD. If your inte

Re: lucene and mongodb

2013-05-14 Thread Adrien Grand
Hi, On Tue, May 14, 2013 at 1:34 PM, Rider Carrion Cleger wrote: > So, can I have for sure scalability and safety with a distribution on top > of Lucene like Solr ? Yes, Solr can help you shard your index and add replicas, see http://wiki.apache.org/solr/SolrCloud. -- Adrien

Re: how to get max value of a long field?

2013-05-17 Thread Adrien Grand
Hi, On Fri, May 17, 2013 at 11:10 AM, Hu Jing wrote: > I want to know the max value of a long field. > I read lucene api , but don't find any api about this? > does someone can supply any hits about how to implement this. To do this efficiently, your field needs to have doc values[1]. First, it

Re: how to get max value of a long field?

2013-05-17 Thread Adrien Grand
On Fri, May 17, 2013 at 11:36 AM, Adrien Grand wrote: > if (liveDocs != null || liveDocs.get(i)) { Sorry, I meant "if (liveDocs == null || liveDocs.get(i)) {". -- Adrien - To unsubscribe, e-mail: java-

Re: Lucene 4.2 DocValues

2013-05-28 Thread Adrien Grand
On Tue, May 28, 2013 at 4:48 PM, Arun Kumar K wrote: > Hi Guys, Hi, > I have been trying to understand DocValues and get some hands on and have > observed few things. > > I have added LongDocValuesField to the documents like: > doc.add(new LongDocValuesField("id",1)); > > 1> In 4.0 i saw that th

Re: Lucene 4.2 DocValues

2013-05-28 Thread Adrien Grand
On Tue, May 28, 2013 at 8:55 PM, Arun Kumar K wrote: > Thanks for clarifying the things. > I have some doubts regarding sorting : >> >> While you can do that, I don't recommend it. For example, if you have >> 5 fields, loading all fields from stored fields requires at most 1 >> disk seek while loa

Re: confirm subscribe to java-user@lucene.apache.org

2013-06-03 Thread Adrien Grand
Hi Manoj, This is maybe related to the compression support which was added in Lucene 4.1. Although it improves performance on large indexes, it might prove to be slightly faster on indexes that completely fit in the file-system cache, especially if you fetch a large number of records at each reque

Re: Please add me as a wiki editor

2013-06-10 Thread Adrien Grand
Hi Lance, On Mon, Jun 10, 2013 at 4:55 AM, Lance Norskog wrote: > I'm responsible for the OpenNLP wiki page: > https://wiki.apache.org/solr/OpenNLP > > Please add me to the list of editors. I just added you to the ContributorsGroup, please let me know if you have trouble editing wiki pages. --

Re: posting list traversal code

2013-06-13 Thread Adrien Grand
Hi, On Thu, Jun 13, 2013 at 8:24 AM, Denis Bazhenov wrote: > Document id on the index level is offset of the document in the index. It can > change over time for the same document, for example when merging several > segments. They are also stored in order in posting lists. This allows fast > p

Re: posting list traversal code

2013-06-13 Thread Adrien Grand
On Thu, Jun 13, 2013 at 7:56 PM, Sriram Sankar wrote: > Thank you very much. I think I need to play a bit with the code before > asking more questions. Here is the context for my questions: > > I was at Facebook until recently and worked extensively on the Unicorn > search backend. Unicorn allo

Re: segments and sorting

2013-06-15 Thread Adrien Grand
Hi, On Fri, Jun 14, 2013 at 11:24 PM, Sriram Sankar wrote: > For my use case of having all docs sorted by a static rank and being able > to cut off retrieval after a certain number of docs, I have to sort all my > docs using the static rank (and Lucene 4 has a way to do this). > > When an index h

Re: Lucene pointing to existing DB Index

2013-06-15 Thread Adrien Grand
Hi, On Sat, Jun 15, 2013 at 6:55 AM, Pradeep B wrote: > Hi > I have just started out on lucene and experimenting with some possibilities. > My goal is to try to exploit an existing database index (which in my case > is an inverted index) to serve as a Lucene Index. > this helps me avoid need of

Re: merging policy is not triggered behind the scene

2013-06-15 Thread Adrien Grand
Hi Lei, On Fri, Jun 14, 2013 at 1:06 AM, Reg wrote: > I noticed if I do the merging in the following way, > IndexWriter.mabyeMerge() is never triggered automatically by the merge > scheduler. > > > IndexWriter writer = ...; > > IndexReader[] readers = ...; > > writer.addIndexes(readers) > > write

Re: segments and sorting

2013-06-18 Thread Adrien Grand
On Tue, Jun 18, 2013 at 1:05 AM, Sriram Sankar wrote: > I'm sorry - I meant "DocValue" not "FieldValue". Slide 20 in the following > deck talks about the 2Gb limit. Doc values don't have this limit anymore. However, there is a limit of ~32kb per term, but this shouldn't be a problem with reasona

Re: Upgrading from 3.6.1 to 4.3.0 and Custom collector

2013-06-18 Thread Adrien Grand
Hi, You didn't say specifically what your problem is so I assume it is with the following method: On Tue, Jun 18, 2013 at 4:37 AM, Peyman Faratin wrote: > public void setNextReader(IndexReader reader, int docBase) > throws IOException{ > this.docBase =

Re: segments and sorting

2013-06-19 Thread Adrien Grand
Hi, On Wed, Jun 19, 2013 at 12:16 AM, Sriram Sankar wrote: > Is it possible to do this more efficiently using a merge sort? Assuming > the individual segments are already sorted, is there a wrapper that I can > use where I can pass the same sorting function? I'm guessing the > SlowCompositeRead

Re: Doing concurrent searches efficiently

2013-06-19 Thread Adrien Grand
Hi Roberto, On Wed, Jun 19, 2013 at 12:57 PM, Roberto Ragusa wrote: > Hi, > > I would like an expert opinion about how to optimally do concurrent > searches on the same index (let's suppose there are several threads > doing searches). Consider these options: > > a) one IndexReader, all threads us

Re: build of trunk hangs

2013-06-20 Thread Adrien Grand
Hi, On Thu, Jun 20, 2013 at 5:59 PM, Tom Burton-West wrote: > I'm trying to build trunk and when I run "ant compile" > the build hangs right after "Building replicator" at the line > "common.resolve:". (see below for more context) > > I'm not familiar with Ivy so I'm not too sure where to look f

Re: Payload Matching Query

2013-06-20 Thread Adrien Grand
Hi Michal, Although payloads can be used at query time to customize scoring, they can't be used for searching. Lucene only allows to search on terms. -- Adrien - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org Fo

Re: Stored fields: decompression slows down in my scenario ... any idea for a workaround?

2013-06-24 Thread Adrien Grand
Hi, On Sun, Jun 23, 2013 at 9:08 PM, Savia Beson wrote: > I think Mathias was talking about the case with many smallish fields that all > get read per document. DV approach would mean seeking N times, while stored > fields, only once? Or you meant he should encode all his fields into single

Re: Stored fields: decompression slows down in my scenario ... any idea for a workaround?

2013-06-24 Thread Adrien Grand
Hi, On Mon, Jun 24, 2013 at 2:47 PM, Mathias Lux wrote: > Still, I've read that all the BinaryDocValues go directly to memory. > Am I right with this? It is true that the current default implementation stores them in memory. However, disk doc values formats can be configured on a per-field basis

Re: Stored fields: decompression slows down in my scenario ... any idea for a workaround?

2013-06-25 Thread Adrien Grand
Hi, On Mon, Jun 24, 2013 at 6:13 PM, Mathias Lux wrote: > When searching for an image within memory I came down to 44ms. > Therefore, 77ms is totally acceptable in these terms. My benchmarking > of the BinaryDocValuesField showed that it'd come close to the 44ms, > but I didn't go for a full eval

Re: Securing stored data using Lucene

2013-06-25 Thread Adrien Grand
On Tue, Jun 25, 2013 at 1:03 PM, Rafaela Voiculescu wrote: > Hello, Hi, > I am sorry I was not a bit more explicit. I am trying to find an acceptable > way to encrypt the data to prevent any access of it in any way unless the > person who is trying to access it knows how to decrypt it. As I ment

Re: In memory index (current status in Lucene)

2013-07-04 Thread Adrien Grand
On Tue, Jul 2, 2013 at 10:09 AM, Toke Eskildsen wrote: > I wonder if Java's ByteBuffer could be used to make a more GC-friendly > RAMDirectory? For the record, there is an open issue about it: https://issues.apache.org/jira/browse/LUCENE-2292. -- Adrien

Re: Please Help solve problem of bad read performance in lucene 4.2.1

2013-07-07 Thread Adrien Grand
Indeed, Lucene 4.1+ may be a bit slower for indices that comptelely fit in your file-system cache. On the other hand, you should see better performance with indices which are larger than the amount of physical memory of your machine. Your reading benchmark only measures IndexReader.get(int) which s

Re: NRT + static rank based sorting

2013-07-09 Thread Adrien Grand
Hi Sriram, On Tue, Jul 9, 2013 at 5:06 AM, Sriram Sankar wrote: > I've finally got something running and will send you some performance > numbers as promised shortly. In the meanwhile, I've a question regarding > the use of real time indexing along with ordering by static rank. Before > each se

Re: posting list strings

2013-07-09 Thread Adrien Grand
Hi, Lucene stores the string because it may need it to run prefix or range queries. We don't have a hash-based terms dictionary right now but I know some people wrote one since they don't need support for these queries, see for instance the Earlybird paper[1]. Then if you can find a perfect hashin

Re: Another question on sorting documents

2013-07-18 Thread Adrien Grand
Hi, On Thu, Jul 18, 2013 at 7:15 AM, Sriram Sankar wrote: > The approach we have discussed in an earlier thread uses: > > writer.addIndexes(new SortingAtomicReader(...)); > > I want to confirm (this is not absolutely clear to me yet) that the above > call will not create multiple segments - i.e.,

Re: Performance measurements

2013-07-24 Thread Adrien Grand
Hi, On Wed, Jul 24, 2013 at 6:11 PM, Sriram Sankar wrote: > termA AND (termB1 OR termB2 OR ... OR termBn) Maybe this comment is not appropriate for your use-case, but if you don't actually need scoring from the disjunction on the right of the query, a TermsFilter will be faster when n gets large

Re: Query serialization/deserialization

2013-07-28 Thread Adrien Grand
Hi Denis, Indeed, Query.toString() only tries to give a human-understandable representation of what the query searches for and doesn't guarantee that it can be parsed again and would give the same query. We don't provide tools to serialize queries but since query parsing is usually lightweight com

Re: getNumericDocValues

2013-07-29 Thread Adrien Grand
Hi, On Mon, Jul 29, 2013 at 4:56 PM, Yonghui Zhao wrote: > I want to know what will be returned if the input docID is not a valid id, > for examples: > > 1. the docID beyonds the reader scope In that case, the behavior is not defined, it might throw an exception or return a random value. You sh

Re: Cache Field Lucene 3.6.0

2013-07-30 Thread Adrien Grand
Hi, On Tue, Jul 30, 2013 at 4:09 PM, andi rexha wrote: > Hi, I have a stored and tokenized field, and I want to cache all the field > values. > > I have one document in the index, with the "field.value" => "hello world" > and with tokens => "hello", "world". > I try to extract the field

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

2013-07-30 Thread Adrien Grand
Hi, On Tue, Jul 30, 2013 at 5:34 PM, Robert Muir wrote: > I'm not sure if there is a similar one for vectors. There is, it has been done for stored fields and term vectors at the same time[1]. [1] https://issues.apache.org/jira/browse/LUCENE-4928 -- Adrien ---

Re: sorting with lucene 4.3

2013-07-30 Thread Adrien Grand
Hi, On Tue, Jul 30, 2013 at 8:19 PM, Nicolas Guyot wrote: > When sorting numerically, the search seems to take a bit of a while > compared to the lexically sorted search. > Also when sorting numerically the result is sorted within each page but no > globally as opposed to the lexical sorted searc

Re: Files greater than 20 MB not getting Indexed. No files generated except write.lock even after 8-9 minutes.

2013-08-30 Thread Adrien Grand
Ankit, The stack traces you are showing only say there was an out of memory error. In those case, the stack trace is unfortunately not always helpful since the allocation may fail on a small object because another object is taking all the memory of the JVM. Can you come up with a small piece of co

Re: Optimize Lucene 4.4 for CPU usage

2013-08-31 Thread Adrien Grand
Hi, On Sat, Aug 31, 2013 at 6:55 AM, Rose, Stuart J wrote: > I've noticed that processes that were previously IO bound (in 3.5) are now > CPU bound (in 4.4) and I expect it is due to the compression/decompression of > term vector fields in 4.4. > > It would be nice if users of 4.4 could turn t

Re: Making lucene indexing multi threaded

2013-09-02 Thread Adrien Grand
Hi, Lucene's IndexWriter can safely accept updates coming from several threads, just make sure to share the same IndexWriter instance across all threads, no extrenal locking is necessary. 30 minutes sound slike a lot for 3 files unless they are large. You can have a look at http://wiki.apache

Re: Lucene handling of duplicate terms

2013-09-05 Thread Adrien Grand
Hi, On Thu, Sep 5, 2013 at 9:28 AM, Kristofer Karlsson wrote: > I have a use case where some of my documents have duplicate terms in > various fields or within the same field. > > For an example, I may have a million documents with just the term "foo" in > field A, and one particular document wit

Re: Strange performance of Lucene 4.4.0

2013-09-10 Thread Adrien Grand
Sort.INDEXORDER just lets you know about matching documents while by default a score is computed and Lucene selects the top N matching documents from your index. On Mon, Sep 9, 2013 at 7:33 PM, Mirko Sertic wrote: > Ok, using Sort.INDEXORDER for default sorting is blazing fast. Just for my > unde

Re: possible latency increase from Lucene versions 4.1 to 4.4?

2013-09-16 Thread Adrien Grand
Hi John, I just had a look at Mike's benchs[1][2] which don't show any performance difference from approximately 1 year. But this only tests a conjunction of two terms so it might still be that latency worsened for more complex queries. [1] http://people.apache.org/~mikemccand/lucenebench/AndHigh

Re: A question about "seek past EOF: MMapIndexInput"

2013-09-18 Thread Adrien Grand
Hi, This means that there is either a bug in Lucene or that your index is corrupted. Can you reproduce this failure if you reindex data? The output of CheckIndex would be interesting as well, see http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/index/CheckIndex.html#main%28java.lang.Stri

Re: Position problems in 4.3.0

2013-09-18 Thread Adrien Grand
Hi, This looks bad! Can you write a small test case that reproduces the issue so that we can try to understand what happens here? Thanks! -- Adrien - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For addition

Re: IndexFileNameFilter

2013-09-18 Thread Adrien Grand
Hi, Since Lucene 4.0 which introduced codecs, it is not possible anymore to know based on filename extensions whether files have been created by Lucene or not: every codec is free to use any file extension. On Wed, Sep 18, 2013 at 1:03 PM, Yonghui Zhao wrote: > In lucene 4.3.0 there is no IndexF

Re: How to modify the Lucene 4 index?

2013-09-18 Thread Adrien Grand
Hi, Are you talking about updating the content of the index or customizing the file formats of the index? On Tue, Sep 17, 2013 at 11:31 PM, Ralf Bierig wrote: > Hi all, > > is there any good documentation of how to change and modify the index of > Lucene version 4 other than what is already on t

Re: IndexFileNameFilter

2013-09-18 Thread Adrien Grand
Hi, On Wed, Sep 18, 2013 at 1:39 PM, Yonghui Zhao wrote: > Got it. Currently I don't use any custom codecs. Part of the problem is that even the current codec keeps evolving, and file extensions that exist today might not be used anymore in 6 months and vice-versa. I would strongly recommend not

Re: How to make good use of the multithreaded IndexSearcher?

2013-10-01 Thread Adrien Grand
Hi Benson, On Mon, Sep 30, 2013 at 5:21 PM, Benson Margulies wrote: > The multithreaded index searcher fans out across segments. How aggressively > does 'optimize' reduce the number of segments? If the segment count goes > way down, is there some other way to exploit multiple cores? forceMerge[1

[ANNOUNCE] Apache Lucene 4.5 released

2013-10-05 Thread Adrien Grand
. If that is the case, please try another mirror. This also goes for Maven access. -- Adrien Grand - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: optimal way to access many TermVectors

2013-10-08 Thread Adrien Grand
Hi, On Mon, Oct 7, 2013 at 9:31 PM, Rose, Stuart J wrote: > Is there an optimal way to access many document TermVectors (in the same > chunk) consecutively when using the LZ4 termvector compression? > > I'm curious to know whether all TermVectors in a single compressed chunk are > decompressed

Re: external file stored field codec

2013-10-11 Thread Adrien Grand
On Fri, Oct 11, 2013 at 7:03 PM, Michael Sokolov wrote: > I've been running some tests comparing storing large fields (documents, say > 100K .. 10M) as files vs. storing them in Lucene as stored fields. Initial > results seem to indicate storing them externally is a win (at least for > binary doc

Re: external file stored field codec

2013-10-13 Thread Adrien Grand
Hi Michael, I'm not aware enough of operating system internals to know what exactly happens when a file is open but it sounds to be like having separate files per document or field adds levels of indirection when loading stored fields, so I would be surprised it it actually proved to be more effic

Re: Retrieving values for a NumericDocValuesField [SEC=UNOFFICIAL]

2013-10-23 Thread Adrien Grand
Hi Stephen, On Wed, Oct 23, 2013 at 9:29 AM, Stephen GRAY wrote: > UNOFFICIAL > Hi everyone, > > I have a question about how to retrieve the values in a > NumericDocValuesField. I understand how to do this in situations where you > have an AtomicReaderContext available > (context.reader().getN

Re: Merging ordered segments without re-sorting.

2013-10-23 Thread Adrien Grand
Hi, On Wed, Oct 23, 2013 at 10:19 PM, Arvind Kalyan wrote: > Sorting is not an option for our case so we will most likely implement a > variant that merges the segments in one pass. Using TimSort is great but in > our case the 2 segments will be highly interspersed and would not benefit > from th

Re: Retrieving values for a NumericDocValuesField [SEC=UNOFFICIAL]

2013-10-24 Thread Adrien Grand
Hi Stephen, On Thu, Oct 24, 2013 at 1:18 AM, Stephen GRAY wrote: > I actually need to loop through a large number of documents (50,000 - > 100,000) calculating a number of statistics (min, max, sum) so I really need > the most efficient/fastest solution available. It sounds like it would be >

Re: Merging ordered segments without re-sorting.

2013-10-24 Thread Adrien Grand
Hi, On Thu, Oct 24, 2013 at 12:20 AM, Arvind Kalyan wrote: > I will benchmark the available approach itself then, in that case. Will > revert back if the performance in unacceptable. For the record, last time I checked, indexing was 2x slower on average on a 10M document collection (see https://

Re: Lucene doubt

2014-02-17 Thread Adrien Grand
Hi Pedro, Lucene indeed supports indexing data from several threads into a single IndexWriter instance, and it will make use of all your I/O and CPU. You can learn more about how it works at http://blog.trifork.com/2011/05/03/lucene-indexing-gains-concurrency/ On Mon, Feb 17, 2014 at 3:54 PM, Ped

Re: Stored fields and OS file caching

2014-04-04 Thread Adrien Grand
Hi Vitaly, Doc values are indeed well-suited for grouping and sorting. However stored fields remain better at returning field values to users since they guarantee a worst-case of one disk seek per document. The filesystem cache typically caches data by blocks of 4KB. This plays more nicely with d

Re: Performance issues with the default field compression

2014-04-09 Thread Adrien Grand
Hi Alex, Indeed, one or several (the number depends on the size of your documents) documents need to be fully decompressed in order to read a single field of a single document. Regarding the stored fields visitor, the default one doesn't return STOP when the field has been found because other fie

Re: Reading a v2 index in v4

2014-06-09 Thread Adrien Grand
Hi, It is not possible to read 2.x indices from Lucene 4, even with a custom codec. For instance, Lucene 4 needs to hook into SegmentInfos.read to detect old 3.x indices and force the use of the Lucene3x codec since these indices don't expose what codec has been used to write them. On Mon, Jun 9

Re: EarlyTerminatingSortingCollector help needed..

2014-06-21 Thread Adrien Grand
Hi Ravikumar, On Fri, Jun 20, 2014 at 12:14 PM, Ravikumar Govindarajan wrote: > If my "numDocsToCollect" = 50 and no.of. segments = 15, then > collector.collect() will be called 750 times. That is the worst-case indeed. However if some of your segments have less than 50 matches, `collect` will o

Re: EarlyTerminatingSortingCollector help needed..

2014-06-23 Thread Adrien Grand
On Sun, Jun 22, 2014 at 6:44 PM, Ravikumar Govindarajan wrote: > For a normal sorting-query, on a top-level searcher, I execute > > TopDocs docs = searcher.search(query, 50, sortField) > > Then I can issue reader.document() for final list of exactly 50 docs, which > gives me a global order across

  1   2   3   4   5   >