Re: i'm Lucene beginner. help me

2012-06-26 Thread Adrien Grand
Hi kjysmu, On Tue, Jun 26, 2012 at 11:22 AM, kjysmu kjy...@gmail.com wrote: What i want with lucene is that i wanna get it's image ids for certain query (tag) how can i implement it using Lucene with Java? I moved the discussion to java-user@lucene instead of dev@lucene since your question

Re: Lucene 4.0.0 - find term position.

2012-12-07 Thread Adrien Grand
Hi Vitaly, On Fri, Dec 7, 2012 at 3:24 PM, vitaly_arte...@mcafee.com wrote: I try to use or Terms tfvector = reader.getTermVector(docId, contents); or Fields fields = reader.getTermVectors(docId); but I get null from these calls. What is wrong? These methods will always return null

Re: StoredFieldsFormat / documentation

2013-01-24 Thread Adrien Grand
Hi Bernd, On Thu, Jan 24, 2013 at 11:55 AM, Bernd Müller belu...@googlemail.com wrote: Hi Simon, you mean where it is used? Look at the org.apache.lucene.codecs.Codec class, it has a method: public abstract StoredFieldsFormat storedFieldsFormat(); which returns a stored fields format

Re: Need help regarding understanding internals of Lucene Index.

2013-01-25 Thread Adrien Grand
Hi Vignesh, This is a very broad question! The following links might help you: - Lucene documentation: http://lucene.apache.org/core/4_1_0/index.html - File formats: http://lucene.apache.org/core/4_1_0/core/org/apache/lucene/codecs/lucene41/package-summary.html#package_description - The block

Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-01-27 Thread Adrien Grand
Have you tried using the PDFParser [1] and the OfficeParser [2] classes from Tika? This question seems to be more appropriate for the Tika user mailing list [3]? [1] http://tika.apache.org/1.3/api/org/apache/tika/parser/pdf/PDFParser.html#parse(java.io.InputStream, org.xml.sax.ContentHandler,

Re: CompressingStoredFieldsFormat doesn't show improvement

2013-01-29 Thread Adrien Grand
Arun, Lucene uses a very light compression algorithm so I'm a little surprised it can make indexing 2x slower. Could you run indexing under a profiler to make sure it really is what makes indexing slower? Thanks! -- Adrien -

Re: CompressingStoredFieldsFormat doesn't show improvement

2013-01-30 Thread Adrien Grand
On Wed, Jan 30, 2013 at 8:08 AM, arun k arunk...@gmail.com wrote: Adrein, I have created an index of size 370M of 1 million docs of 40 fields of 40 chars and did the profiling. I see that the indexing and in particular the addDocument ConcurrentMergeScheduler in 4.1 takes double the time

Re: Example settings for TieredMergePolicy : Lucene 4.0

2013-02-01 Thread Adrien Grand
Hi, On Fri, Feb 1, 2013 at 6:51 PM, saisantoshi saisantosh...@gmail.com wrote: Prior to 4.0, there was an optimize() in the IndexWriter which was merging the index files. Is there any settings that can be done on the TieredMergePolicy so that I want to limit the number of files produced

Re: updateDocument question

2013-02-06 Thread Adrien Grand
Hi Thomas, On Wed, Feb 6, 2013 at 2:50 PM, Becker, Thomas thomas.bec...@netapp.com wrote: I've built a search prototype feature for my application using Lucene, and it works great. The application monitors a remote system and currently indexes just a few core attributes of the objects on

Re: updateDocument question

2013-02-07 Thread Adrien Grand
On Thu, Feb 7, 2013 at 1:54 PM, Becker, Thomas thomas.bec...@netapp.com wrote: Thanks for the response Adrien. I guess I'll just leave things as they are for now. To be clear though, do merged segments get cleaned up completely even if the IndexWriter is never closed? The way it works is

Re: Indexing directly from stdin in lucene 3.5

2013-02-19 Thread Adrien Grand
Hi, On Tue, Feb 19, 2013 at 11:04 AM, A. L. Benhenni albenhe...@gmail.com wrote: I am currently writing an indexer class to index texts from stdin. I also need the text to be tokenized and stored to access the termvector of the document. Actually, you don't need to store documents to access

Re: [ANNOUNCE] Wiki editing change

2013-03-25 Thread Adrien Grand
Hi Steve, On Mon, Mar 25, 2013 at 4:16 AM, Steve Rowe sar...@gmail.com wrote: Please request either on the java-user@lucene.apache.org or on d...@lucene.apache.org to have your wiki username added to the ContributorsGroup page - this is a one-time step. Can you add 'jpountz' to the

Re: Beginner's questions

2013-03-27 Thread Adrien Grand
Hi Paul, On Wed, Mar 27, 2013 at 1:58 PM, Paul Bell arach...@gmail.com wrote: As to the ideas raised in the links you pointed me to: the first link shows the instantiation of a Term object via writer.UpdateDocument(new Term(IDField, *id*), doc); yet in the 4.2.0 docs I see no Term

Re: Beginner's questions

2013-03-27 Thread Adrien Grand
On Wed, Mar 27, 2013 at 9:04 PM, Paul Bell arach...@gmail.com wrote: Thanks Adrien. I've scraped together a simple program in the Lucene 4.2 idiom (see below). Does this illustrate what you meant by your last sentence? The code adds/indexes 5 documents all of whose content is identical, but

Re: Indexing Term Frequency Vectors

2013-03-28 Thread Adrien Grand
Hi, On Thu, Mar 28, 2013 at 8:25 PM, Sharon Tam sharon...@gmail.com wrote: I believe that when Lucene indexes documents, it generates counts for a term by counting how many times the term appears in a particular document. Instead of having Lucene do the counting, I want to do my own counting

Re: Storing Documents in Lucene

2013-03-28 Thread Adrien Grand
On Thu, Mar 28, 2013 at 11:06 PM, Paul arach...@gmail.com wrote: Hi, Hi Paul, Some of the stuff I've read suggests that Lucene is not especially well-suited to storing the documents. It's supposed to be great at indexing those documents, but not so great at storing the docs themselves.

Re: Beginner's questions

2013-03-29 Thread Adrien Grand
Hi Paul, On Fri, Mar 29, 2013 at 1:38 PM, Paul Bell arach...@gmail.com wrote: Last night reading in Lucene in Action, 2nd edition, I came upon this about addDocument(Document, Analyzer): Adds the document using the provided analyzer for tokenization. But be careful! In order for searches to

Re: Discrepancies between search results and reader.document(i).get(path)

2013-03-29 Thread Adrien Grand
Hi, On Fri, Mar 29, 2013 at 10:23 AM, Bushman, Lamont bus08...@byui.edu wrote: This snippet of one of my classes looks at all of my documents and displays their file path. Directory dir =

Re: Discrepancies between search results and reader.document(i).get(path)

2013-03-29 Thread Adrien Grand
On Sat, Mar 30, 2013 at 12:39 AM, Bushman, Lamont bus08...@byui.edu wrote: However, with your response, especially if I come across problems later. reader.liveDocs() is not found in IndexWriter. I am guessing you are referring to the TermsEnum class. I assume numDocs() returns the amount

Re: 4.1 consuming more memory than 3.0.2 while Indexing

2013-04-01 Thread Adrien Grand
On Mon, Apr 1, 2013 at 1:56 PM, Arun Kumar K arunk...@gmail.com wrote: Hi Guys, Hi, I have been finding out the heap space requirement for indexing and searching with 3.0.2 vs 4.1 (with BlockPostings Format). I have a 2GB index with 1 million docs with around 42 fields with 40 fields being

Re: Term vector Lucene 4.2

2013-04-02 Thread Adrien Grand
On Tue, Apr 2, 2013 at 12:45 PM, andi rexha a_re...@hotmail.com wrote: Hi Adrien, Thank you very much for the reply. I have two other small question about this: 1) Is final int freq = docsAndPositions.freq(); the same with iterator.totalTermFreq() ? In my tests it returns the same result

Re: How to use concurrency efficiently

2013-04-02 Thread Adrien Grand
On Tue, Apr 2, 2013 at 2:29 PM, Igor Shalyminov ishalymi...@yandex-team.ru wrote: Hello! Hi Igor, I have a ~20GB index and try to make a concurrent search over it. The index has 16 segments, I run SpanQuery.getSpans() on each segment concurrently. I see really small performance improvement

Re: How to use concurrency efficiently

2013-04-02 Thread Adrien Grand
On Tue, Apr 2, 2013 at 4:39 PM, Igor Shalyminov ishalymi...@yandex-team.ru wrote: Yes, the number of documents is not too large (about 90 000), but the queries are very hard. Although they're just boolean, a typical query can produce a result with tens of millions of hits. How can there be

Re: Indexing Term Frequency Vectors

2013-04-02 Thread Adrien Grand
On Tue, Apr 2, 2013 at 4:10 PM, Sharon W Tam s...@mit.edu wrote: Are there any other ideas? Since scoring seems to be what you are interested in, you could have a look to payloads: there can store arbitrary data and can be used to score matches. -- Adrien

Re: DocValues questions

2013-04-04 Thread Adrien Grand
Hi, On Thu, Apr 4, 2013 at 10:30 AM, Wei Wang welshw...@gmail.com wrote: A few quick questions about DocValues: 1. If only small number of documents have a ShortDocValueField defined, should each document in the index has this field filled with some value? The add() function of Document

Re: DocValues questions

2013-04-04 Thread Adrien Grand
On Thu, Apr 4, 2013 at 11:03 PM, Wei Wang welshw...@gmail.com wrote: Given the new Lucene 4.2 DocValues API, it seems no matter it is byte, short, int, or long, they are all stored as NumericDocValuesField. Does this mean long values are always stored regardless of the initial type? If so, do

Re: DocValues questions

2013-04-05 Thread Adrien Grand
On Fri, Apr 5, 2013 at 4:05 AM, Wei Wang welshw...@gmail.com wrote: Do we need to use setLongValue() all the time? Yes. -- Adrien - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands,

Re: DocValues space usage

2013-04-09 Thread Adrien Grand
Hi, On Tue, Apr 9, 2013 at 5:22 PM, Wei Wang welshw...@gmail.com wrote: DocValues makes fast per doc value lookup possible, which is nice. But it brings other interesting issues. Assume there are 100M docs and 200 NumericDocValuesFields, this ends up with huge number of disk and memory

Re: Indexing Term Frequency Vectors

2013-04-09 Thread Adrien Grand
Hi, On Tue, Apr 9, 2013 at 5:24 PM, Sharon Tam sharon...@gmail.com wrote: I tried following following this payloads tutorial to attach the term frequencies as payloads: http://searchhub.org/2009/08/05/getting-started-with-payloads/ But I'm confused as to where I need to override the term

Re: IntField question

2013-04-10 Thread Adrien Grand
Hi, On Wed, Apr 10, 2013 at 4:59 PM, Wei Wang welshw...@gmail.com wrote: Okay. Since there is no ByteField, setByteValue will never by used. It seems like a dead function. Right, Lucene doesn't have byte or short fields. That makes sense. If we don't need positional info (virtually all terms

Re: Update a bunch of documents

2013-04-12 Thread Adrien Grand
Hi, On Thu, Apr 11, 2013 at 5:46 PM, Carsten Schnober schno...@ids-mannheim.de wrote: This is limited to one field only (not the one on which the query is typically performed!), shouldn't that help? Unfortunately not. Lucene doesn't support in-place updates so updating a document is

Re: DiskDocValuesFormat

2013-04-13 Thread Adrien Grand
Hi Wei, On Sat, Apr 13, 2013 at 7:44 AM, Wei Wang welshw...@gmail.com wrote: I am trying to use DiskDocValuesFormat for a particular BinaryDocValuesField. It seems there is no good examples showing how to do this. The only hint I got from various docs and forums is set some codec in

Re: Please explain the example

2013-04-21 Thread Adrien Grand
Hi, On Thu, Apr 18, 2013 at 3:46 PM, Gaurav Ranjan gaurav.ranjan.i...@gmail.com wrote: I am a student and studying the functionality of Lucene for my project work. The DocDelta example on this link is not clear

Re: Too many unique terms

2013-04-24 Thread Adrien Grand
Hi Manuel, On Thu, Apr 25, 2013 at 12:29 AM, Manuel LeNormand manuel.lenorm...@gmail.com wrote: Hi there, Looking at my index (about 1M docs) i see lot of unique terms, more than 8M which is a significant part of my total term count. These are very likely useless terms, binaries or other

Re: Distinction between AtomicReader and CompositeReader

2013-04-24 Thread Adrien Grand
Hi Paul On Wed, Apr 24, 2013 at 1:35 PM, Paul Taylor paul_t...@fastmail.fm wrote: Trying to convert some Lucene 3 code to Lucene 4, I want to use termEnums.docs(ir.getLiveDocs()) to only return docs that have not been deleted for a particular term. However getLiveDocs() is only available for

Re: org.apache.lucene.classification - bug in SimpleNaiveBayesClassifier

2013-04-24 Thread Adrien Grand
Hi Alexey, On Tue, Apr 23, 2013 at 3:28 PM, Alexey Anatolevitch alexeyl...@gmail.com wrote: I was trying it with 4.2.1 and SimpleNaiveBayesClassifier seems to have a bug - the local copy of BytesRef referenced by foundClass is affected by subsequent TermsEnum.iterator.next() calls as the

Re: Too many unique terms

2013-04-29 Thread Adrien Grand
On Sat, Apr 27, 2013 at 8:41 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hi, real thanks for the previous reply. For now i'm not able to make a separation between these useless words, whether they contain words or digits. I liked the idea of iterating with TermsEnum. Will it also

Re: lucene and mongodb

2013-05-14 Thread Adrien Grand
Hi, On Tue, May 14, 2013 at 10:35 AM, Rider Carrion Cleger rider.carr...@gmail.com wrote: - Can I store the lucene index in a mongodb database ? I don't know whether it's possible, but even if it was, I would not recommend it. Lucene works best on local filesystems, and even better if the disk

Re: lucene and mongodb

2013-05-14 Thread Adrien Grand
Hi, On Tue, May 14, 2013 at 1:34 PM, Rider Carrion Cleger rider.carr...@gmail.com wrote: So, can I have for sure scalability and safety with a distribution on top of Lucene like Solr ? Yes, Solr can help you shard your index and add replicas, see http://wiki.apache.org/solr/SolrCloud. --

Re: how to get max value of a long field?

2013-05-17 Thread Adrien Grand
Hi, On Fri, May 17, 2013 at 11:10 AM, Hu Jing huj@gmail.com wrote: I want to know the max value of a long field. I read lucene api , but don't find any api about this? does someone can supply any hits about how to implement this. To do this efficiently, your field needs to have doc

Re: how to get max value of a long field?

2013-05-17 Thread Adrien Grand
On Fri, May 17, 2013 at 11:36 AM, Adrien Grand jpou...@gmail.com wrote: if (liveDocs != null || liveDocs.get(i)) { Sorry, I meant if (liveDocs == null || liveDocs.get(i)) {. -- Adrien - To unsubscribe, e-mail: java

Re: Lucene 4.2 DocValues

2013-05-28 Thread Adrien Grand
On Tue, May 28, 2013 at 4:48 PM, Arun Kumar K arunk...@gmail.com wrote: Hi Guys, Hi, I have been trying to understand DocValues and get some hands on and have observed few things. I have added LongDocValuesField to the documents like: doc.add(new LongDocValuesField(id,1)); 1 In 4.0 i saw

Re: Lucene 4.2 DocValues

2013-05-28 Thread Adrien Grand
On Tue, May 28, 2013 at 8:55 PM, Arun Kumar K arunk...@gmail.com wrote: Thanks for clarifying the things. I have some doubts regarding sorting : While you can do that, I don't recommend it. For example, if you have 5 fields, loading all fields from stored fields requires at most 1 disk seek

Re: confirm subscribe to java-user@lucene.apache.org

2013-06-03 Thread Adrien Grand
Hi Manoj, This is maybe related to the compression support which was added in Lucene 4.1. Although it improves performance on large indexes, it might prove to be slightly faster on indexes that completely fit in the file-system cache, especially if you fetch a large number of records at each

Re: Please add me as a wiki editor

2013-06-10 Thread Adrien Grand
Hi Lance, On Mon, Jun 10, 2013 at 4:55 AM, Lance Norskog goks...@gmail.com wrote: I'm responsible for the OpenNLP wiki page: https://wiki.apache.org/solr/OpenNLP Please add me to the list of editors. I just added you to the ContributorsGroup, please let me know if you have trouble editing

Re: posting list traversal code

2013-06-13 Thread Adrien Grand
Hi, On Thu, Jun 13, 2013 at 8:24 AM, Denis Bazhenov dot...@gmail.com wrote: Document id on the index level is offset of the document in the index. It can change over time for the same document, for example when merging several segments. They are also stored in order in posting lists. This

Re: posting list traversal code

2013-06-13 Thread Adrien Grand
On Thu, Jun 13, 2013 at 7:56 PM, Sriram Sankar san...@gmail.com wrote: Thank you very much. I think I need to play a bit with the code before asking more questions. Here is the context for my questions: I was at Facebook until recently and worked extensively on the Unicorn search backend.

Re: segments and sorting

2013-06-15 Thread Adrien Grand
Hi, On Fri, Jun 14, 2013 at 11:24 PM, Sriram Sankar san...@gmail.com wrote: For my use case of having all docs sorted by a static rank and being able to cut off retrieval after a certain number of docs, I have to sort all my docs using the static rank (and Lucene 4 has a way to do this).

Re: Lucene pointing to existing DB Index

2013-06-15 Thread Adrien Grand
Hi, On Sat, Jun 15, 2013 at 6:55 AM, Pradeep B bpradeep.m...@gmail.com wrote: Hi I have just started out on lucene and experimenting with some possibilities. My goal is to try to exploit an existing database index (which in my case is an inverted index) to serve as a Lucene Index. this

Re: merging policy is not triggered behind the scene

2013-06-15 Thread Adrien Grand
Hi Lei, On Fri, Jun 14, 2013 at 1:06 AM, Reg register9...@gmail.com wrote: I noticed if I do the merging in the following way, IndexWriter.mabyeMerge() is never triggered automatically by the merge scheduler. IndexWriter writer = ...; IndexReader[] readers = ...;

Re: segments and sorting

2013-06-18 Thread Adrien Grand
On Tue, Jun 18, 2013 at 1:05 AM, Sriram Sankar san...@gmail.com wrote: I'm sorry - I meant DocValue not FieldValue. Slide 20 in the following deck talks about the 2Gb limit. Doc values don't have this limit anymore. However, there is a limit of ~32kb per term, but this shouldn't be a problem

Re: Upgrading from 3.6.1 to 4.3.0 and Custom collector

2013-06-18 Thread Adrien Grand
Hi, You didn't say specifically what your problem is so I assume it is with the following method: On Tue, Jun 18, 2013 at 4:37 AM, Peyman Faratin peymanfara...@gmail.com wrote: public void setNextReader(IndexReader reader, int docBase) throws IOException{

Re: segments and sorting

2013-06-19 Thread Adrien Grand
Hi, On Wed, Jun 19, 2013 at 12:16 AM, Sriram Sankar san...@gmail.com wrote: Is it possible to do this more efficiently using a merge sort? Assuming the individual segments are already sorted, is there a wrapper that I can use where I can pass the same sorting function? I'm guessing the

Re: Doing concurrent searches efficiently

2013-06-19 Thread Adrien Grand
Hi Roberto, On Wed, Jun 19, 2013 at 12:57 PM, Roberto Ragusa m...@robertoragusa.it wrote: Hi, I would like an expert opinion about how to optimally do concurrent searches on the same index (let's suppose there are several threads doing searches). Consider these options: a) one IndexReader,

Re: build of trunk hangs

2013-06-20 Thread Adrien Grand
Hi, On Thu, Jun 20, 2013 at 5:59 PM, Tom Burton-West tburt...@umich.edu wrote: I'm trying to build trunk and when I run ant compile the build hangs right after Building replicator at the line common.resolve:. (see below for more context) I'm not familiar with Ivy so I'm not too sure where

Re: Payload Matching Query

2013-06-20 Thread Adrien Grand
Hi Michal, Although payloads can be used at query time to customize scoring, they can't be used for searching. Lucene only allows to search on terms. -- Adrien - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org

Re: Stored fields: decompression slows down in my scenario ... any idea for a workaround?

2013-06-24 Thread Adrien Grand
Hi, On Sun, Jun 23, 2013 at 9:08 PM, Savia Beson eks...@googlemail.com wrote: I think Mathias was talking about the case with many smallish fields that all get read per document. DV approach would mean seeking N times, while stored fields, only once? Or you meant he should encode all his

Re: Stored fields: decompression slows down in my scenario ... any idea for a workaround?

2013-06-24 Thread Adrien Grand
Hi, On Mon, Jun 24, 2013 at 2:47 PM, Mathias Lux m...@itec.uni-klu.ac.at wrote: Still, I've read that all the BinaryDocValues go directly to memory. Am I right with this? It is true that the current default implementation stores them in memory. However, disk doc values formats can be

Re: Stored fields: decompression slows down in my scenario ... any idea for a workaround?

2013-06-25 Thread Adrien Grand
Hi, On Mon, Jun 24, 2013 at 6:13 PM, Mathias Lux m...@itec.uni-klu.ac.at wrote: When searching for an image within memory I came down to 44ms. Therefore, 77ms is totally acceptable in these terms. My benchmarking of the BinaryDocValuesField showed that it'd come close to the 44ms, but I

Re: Securing stored data using Lucene

2013-06-25 Thread Adrien Grand
On Tue, Jun 25, 2013 at 1:03 PM, Rafaela Voiculescu rafaela.voicule...@gmail.com wrote: Hello, Hi, I am sorry I was not a bit more explicit. I am trying to find an acceptable way to encrypt the data to prevent any access of it in any way unless the person who is trying to access it knows how

Re: In memory index (current status in Lucene)

2013-07-04 Thread Adrien Grand
On Tue, Jul 2, 2013 at 10:09 AM, Toke Eskildsen t...@statsbiblioteket.dk wrote: I wonder if Java's ByteBuffer could be used to make a more GC-friendly RAMDirectory? For the record, there is an open issue about it: https://issues.apache.org/jira/browse/LUCENE-2292. -- Adrien

Re: Please Help solve problem of bad read performance in lucene 4.2.1

2013-07-07 Thread Adrien Grand
Indeed, Lucene 4.1+ may be a bit slower for indices that comptelely fit in your file-system cache. On the other hand, you should see better performance with indices which are larger than the amount of physical memory of your machine. Your reading benchmark only measures IndexReader.get(int) which

Re: NRT + static rank based sorting

2013-07-09 Thread Adrien Grand
Hi Sriram, On Tue, Jul 9, 2013 at 5:06 AM, Sriram Sankar san...@gmail.com wrote: I've finally got something running and will send you some performance numbers as promised shortly. In the meanwhile, I've a question regarding the use of real time indexing along with ordering by static rank.

Re: posting list strings

2013-07-09 Thread Adrien Grand
Hi, Lucene stores the string because it may need it to run prefix or range queries. We don't have a hash-based terms dictionary right now but I know some people wrote one since they don't need support for these queries, see for instance the Earlybird paper[1]. Then if you can find a perfect

Re: Another question on sorting documents

2013-07-18 Thread Adrien Grand
Hi, On Thu, Jul 18, 2013 at 7:15 AM, Sriram Sankar san...@gmail.com wrote: The approach we have discussed in an earlier thread uses: writer.addIndexes(new SortingAtomicReader(...)); I want to confirm (this is not absolutely clear to me yet) that the above call will not create multiple

Re: Performance measurements

2013-07-24 Thread Adrien Grand
Hi, On Wed, Jul 24, 2013 at 6:11 PM, Sriram Sankar san...@gmail.com wrote: termA AND (termB1 OR termB2 OR ... OR termBn) Maybe this comment is not appropriate for your use-case, but if you don't actually need scoring from the disjunction on the right of the query, a TermsFilter will be faster

Re: Query serialization/deserialization

2013-07-28 Thread Adrien Grand
Hi Denis, Indeed, Query.toString() only tries to give a human-understandable representation of what the query searches for and doesn't guarantee that it can be parsed again and would give the same query. We don't provide tools to serialize queries but since query parsing is usually lightweight

Re: getNumericDocValues

2013-07-29 Thread Adrien Grand
Hi, On Mon, Jul 29, 2013 at 4:56 PM, Yonghui Zhao zhaoyong...@gmail.com wrote: I want to know what will be returned if the input docID is not a valid id, for examples: 1. the docID beyonds the reader scope In that case, the behavior is not defined, it might throw an exception or return a

Re: Cache Field Lucene 3.6.0

2013-07-30 Thread Adrien Grand
Hi, On Tue, Jul 30, 2013 at 4:09 PM, andi rexha a_re...@hotmail.com wrote: Hi, I have a stored and tokenized field, and I want to cache all the field values. I have one document in the index, with the field.value = hello world and with tokens = hello, world. I try to extract the

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

2013-07-30 Thread Adrien Grand
Hi, On Tue, Jul 30, 2013 at 5:34 PM, Robert Muir rcm...@gmail.com wrote: I'm not sure if there is a similar one for vectors. There is, it has been done for stored fields and term vectors at the same time[1]. [1] https://issues.apache.org/jira/browse/LUCENE-4928 -- Adrien

Re: sorting with lucene 4.3

2013-07-30 Thread Adrien Grand
Hi, On Tue, Jul 30, 2013 at 8:19 PM, Nicolas Guyot sfni...@gmail.com wrote: When sorting numerically, the search seems to take a bit of a while compared to the lexically sorted search. Also when sorting numerically the result is sorted within each page but no globally as opposed to the

Re: Files greater than 20 MB not getting Indexed. No files generated except write.lock even after 8-9 minutes.

2013-08-30 Thread Adrien Grand
Ankit, The stack traces you are showing only say there was an out of memory error. In those case, the stack trace is unfortunately not always helpful since the allocation may fail on a small object because another object is taking all the memory of the JVM. Can you come up with a small piece of

Re: Optimize Lucene 4.4 for CPU usage

2013-08-31 Thread Adrien Grand
Hi, On Sat, Aug 31, 2013 at 6:55 AM, Rose, Stuart J stuart.r...@pnnl.gov wrote: I've noticed that processes that were previously IO bound (in 3.5) are now CPU bound (in 4.4) and I expect it is due to the compression/decompression of term vector fields in 4.4. It would be nice if users of

Re: Making lucene indexing multi threaded

2013-09-02 Thread Adrien Grand
Hi, Lucene's IndexWriter can safely accept updates coming from several threads, just make sure to share the same IndexWriter instance across all threads, no extrenal locking is necessary. 30 minutes sound slike a lot for 3 files unless they are large. You can have a look at

Re: Lucene handling of duplicate terms

2013-09-05 Thread Adrien Grand
Hi, On Thu, Sep 5, 2013 at 9:28 AM, Kristofer Karlsson k...@spotify.com wrote: I have a use case where some of my documents have duplicate terms in various fields or within the same field. For an example, I may have a million documents with just the term foo in field A, and one particular

Re: Strange performance of Lucene 4.4.0

2013-09-10 Thread Adrien Grand
Sort.INDEXORDER just lets you know about matching documents while by default a score is computed and Lucene selects the top N matching documents from your index. On Mon, Sep 9, 2013 at 7:33 PM, Mirko Sertic mirko.ser...@web.de wrote: Ok, using Sort.INDEXORDER for default sorting is blazing fast.

Re: A question about seek past EOF: MMapIndexInput

2013-09-18 Thread Adrien Grand
Hi, This means that there is either a bug in Lucene or that your index is corrupted. Can you reproduce this failure if you reindex data? The output of CheckIndex would be interesting as well, see

Re: Position problems in 4.3.0

2013-09-18 Thread Adrien Grand
Hi, This looks bad! Can you write a small test case that reproduces the issue so that we can try to understand what happens here? Thanks! -- Adrien - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For

Re: IndexFileNameFilter

2013-09-18 Thread Adrien Grand
Hi, Since Lucene 4.0 which introduced codecs, it is not possible anymore to know based on filename extensions whether files have been created by Lucene or not: every codec is free to use any file extension. On Wed, Sep 18, 2013 at 1:03 PM, Yonghui Zhao zhaoyong...@gmail.com wrote: In lucene

Re: How to modify the Lucene 4 index?

2013-09-18 Thread Adrien Grand
Hi, Are you talking about updating the content of the index or customizing the file formats of the index? On Tue, Sep 17, 2013 at 11:31 PM, Ralf Bierig ralf.bie...@gmail.com wrote: Hi all, is there any good documentation of how to change and modify the index of Lucene version 4 other than

Re: IndexFileNameFilter

2013-09-18 Thread Adrien Grand
Hi, On Wed, Sep 18, 2013 at 1:39 PM, Yonghui Zhao zhaoyong...@gmail.com wrote: Got it. Currently I don't use any custom codecs. Part of the problem is that even the current codec keeps evolving, and file extensions that exist today might not be used anymore in 6 months and vice-versa. I would

Re: How to make good use of the multithreaded IndexSearcher?

2013-10-01 Thread Adrien Grand
Hi Benson, On Mon, Sep 30, 2013 at 5:21 PM, Benson Margulies ben...@basistech.com wrote: The multithreaded index searcher fans out across segments. How aggressively does 'optimize' reduce the number of segments? If the segment count goes way down, is there some other way to exploit multiple

[ANNOUNCE] Apache Lucene 4.5 released

2013-10-05 Thread Adrien Grand
. If that is the case, please try another mirror. This also goes for Maven access. -- Adrien Grand - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: optimal way to access many TermVectors

2013-10-08 Thread Adrien Grand
Hi, On Mon, Oct 7, 2013 at 9:31 PM, Rose, Stuart J stuart.r...@pnnl.gov wrote: Is there an optimal way to access many document TermVectors (in the same chunk) consecutively when using the LZ4 termvector compression? I'm curious to know whether all TermVectors in a single compressed chunk are

Re: external file stored field codec

2013-10-11 Thread Adrien Grand
On Fri, Oct 11, 2013 at 7:03 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: I've been running some tests comparing storing large fields (documents, say 100K .. 10M) as files vs. storing them in Lucene as stored fields. Initial results seem to indicate storing them externally is a

Re: external file stored field codec

2013-10-13 Thread Adrien Grand
Hi Michael, I'm not aware enough of operating system internals to know what exactly happens when a file is open but it sounds to be like having separate files per document or field adds levels of indirection when loading stored fields, so I would be surprised it it actually proved to be more

Re: Retrieving values for a NumericDocValuesField [SEC=UNOFFICIAL]

2013-10-23 Thread Adrien Grand
Hi Stephen, On Wed, Oct 23, 2013 at 9:29 AM, Stephen GRAY stephen.g...@immi.gov.au wrote: UNOFFICIAL Hi everyone, I have a question about how to retrieve the values in a NumericDocValuesField. I understand how to do this in situations where you have an AtomicReaderContext available

Re: Merging ordered segments without re-sorting.

2013-10-23 Thread Adrien Grand
Hi, On Wed, Oct 23, 2013 at 10:19 PM, Arvind Kalyan bas...@gmail.com wrote: Sorting is not an option for our case so we will most likely implement a variant that merges the segments in one pass. Using TimSort is great but in our case the 2 segments will be highly interspersed and would not

Re: Retrieving values for a NumericDocValuesField [SEC=UNOFFICIAL]

2013-10-24 Thread Adrien Grand
Hi Stephen, On Thu, Oct 24, 2013 at 1:18 AM, Stephen GRAY stephen.g...@immi.gov.au wrote: I actually need to loop through a large number of documents (50,000 - 100,000) calculating a number of statistics (min, max, sum) so I really need the most efficient/fastest solution available. It

Re: Merging ordered segments without re-sorting.

2013-10-24 Thread Adrien Grand
Hi, On Thu, Oct 24, 2013 at 12:20 AM, Arvind Kalyan bas...@gmail.com wrote: I will benchmark the available approach itself then, in that case. Will revert back if the performance in unacceptable. For the record, last time I checked, indexing was 2x slower on average on a 10M document

Re: Lucene doubt

2014-02-17 Thread Adrien Grand
Hi Pedro, Lucene indeed supports indexing data from several threads into a single IndexWriter instance, and it will make use of all your I/O and CPU. You can learn more about how it works at http://blog.trifork.com/2011/05/03/lucene-indexing-gains-concurrency/ On Mon, Feb 17, 2014 at 3:54 PM,

Re: Stored fields and OS file caching

2014-04-04 Thread Adrien Grand
Hi Vitaly, Doc values are indeed well-suited for grouping and sorting. However stored fields remain better at returning field values to users since they guarantee a worst-case of one disk seek per document. The filesystem cache typically caches data by blocks of 4KB. This plays more nicely with

Re: Performance issues with the default field compression

2014-04-09 Thread Adrien Grand
Hi Alex, Indeed, one or several (the number depends on the size of your documents) documents need to be fully decompressed in order to read a single field of a single document. Regarding the stored fields visitor, the default one doesn't return STOP when the field has been found because other

Re: Reading a v2 index in v4

2014-06-09 Thread Adrien Grand
Hi, It is not possible to read 2.x indices from Lucene 4, even with a custom codec. For instance, Lucene 4 needs to hook into SegmentInfos.read to detect old 3.x indices and force the use of the Lucene3x codec since these indices don't expose what codec has been used to write them. On Mon, Jun

Re: EarlyTerminatingSortingCollector help needed..

2014-06-21 Thread Adrien Grand
Hi Ravikumar, On Fri, Jun 20, 2014 at 12:14 PM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: If my numDocsToCollect = 50 and no.of. segments = 15, then collector.collect() will be called 750 times. That is the worst-case indeed. However if some of your segments have less than

Re: EarlyTerminatingSortingCollector help needed..

2014-06-23 Thread Adrien Grand
On Sun, Jun 22, 2014 at 6:44 PM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: For a normal sorting-query, on a top-level searcher, I execute TopDocs docs = searcher.search(query, 50, sortField) Then I can issue reader.document() for final list of exactly 50 docs, which

Re: EarlyTerminatingSortingCollector help needed..

2014-06-25 Thread Adrien Grand
On Mon, Jun 23, 2014 at 3:56 PM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: Yes, we can get the top-50 docs finally. I am not denying that. I will probably re-phrase my question. Apologize if I am not clear How do we ensure global sort-order during search across all

Re: Lucene DocValuesField, SortedDocValuesField usage for filtering and sorting

2014-12-16 Thread Adrien Grand
Hi Piotr, On Mon, Dec 15, 2014 at 9:43 PM, Piotr Idzikowski piotridzikow...@gmail.com wrote: Hello. I am going to switch to newest (4.10.2) version of Lucene and I'd like to make some optimization in my index and code. I would like to use DocValuesField to get values but also for filtering

Re: Lucene DocValuesField, SortedDocValuesField usage for filtering and sorting

2014-12-16 Thread Adrien Grand
On Tue, Dec 16, 2014 at 3:25 PM, Piotr Idzikowski piotridzikow...@gmail.com wrote: So for instance if I store documents with ie creation date and I have a data (millions of documents) from last let's say 3 years and I'd like to do range filter to get socs from some month only is it better to

Re: Throwing CollectionTerminatedException from Collector.getLeafCollector

2015-03-02 Thread Adrien Grand
Hi András, It feels useful to me too, I think we should document this behaviour. For the record, this other issue has just be open and mentions this problem https://issues.apache.org/jira/browse/LUCENE-6326. On Mon, Mar 2, 2015 at 1:17 PM, András Péteri apet...@b2international.com wrote: Hi,

  1   2   3   4   >