Re: Collector is collecting more than the specified hits

2014-02-18 Thread saisantoshi
The above works fine but how do I get the state of *last docID*. Also, there will be multiple users accessing this and we need to maintain the integrity of last docID. Can we know the last docID from the collector collect call? Thanks, Ranjith. -- View this message in context:

Re: Collector is collecting more than the specified hits

2014-02-18 Thread saisantoshi
There might be an issue with the below approach as the docID that is saved might be deleted before the next call to search and I am not sure if it does break the seach functionality when such a thing happens. Thanks, Ranjith. -- View this message in context:

Re: Collector is collecting more than the specified hits

2014-02-17 Thread saisantoshi
The collector is collecting all the documents. Let's say I have 50k documents and I want the collector to give me the results taking the start and maxHits. Can we get this functionality from Lucene? For example, very first time, I want to collect from 0 -100 the next time I want to collect from

Re: Collector is collecting more than the specified hits

2014-02-17 Thread saisantoshi
Could you please elaborate on the above? I am not sure if the collector is already doing it or do I need to call any other API? Thanks, Sai. -- View this message in context: http://lucene.472066.n3.nabble.com/Collector-is-collecting-more-than-the-specified-hits-tp4117329p4117883.html Sent

Re: Collector is collecting more than the specified hits

2014-02-17 Thread saisantoshi
As I mentioned in my original post, I am calling like the below: MyCollector collector; TopScoreDocCollector topScore = TopScoreDocCollector.create(firstIndex+numHits, true); IndexSearcher searcher = new IndexSearcher(reader); try {

Re: Collector is collecting more than the specified hits

2014-02-14 Thread saisantoshi
I am not interested in the scores at all. My requirement is simple, I only need the first 100 hits or the numHits I specify ( irrespective of there scores). The collector should stop after collecting the numHits specified. Is there a way to tell in the collector to stop after collecting the

Collector is collecting more than the specified hits

2014-02-13 Thread saisantoshi
The problem with the below collector is the collect method is not stopping after the numHits count has reached. Is there a way to stop the collector collecting the docs after it has reached the numHits specified. For example: * TopScoreDocCollector topScore = TopScoreDocCollector.create(numHits,

Lucene 4.0 chokes on multiple requests

2014-02-05 Thread saisantoshi
We recently upgraded to Lucen4.0 and found performance issues in searching the results. Upon some analysis, we found that it chokes when there are multiple requests coming for lucene search. User1 - Search User2 - search User3 - search The search request done by User Search1 is still waiting

Handling special characters in Lucene 4.0

2013-10-20 Thread saisantoshi
I have created strings like the below searchtext +sampletext and when I try to search the following using *** or *+** it does not give any result. I am using QueryParser.escape(String s) method to handle the special characters but does not look like it did anything. Also, when I search

Re: Handling special characters in Lucene 4.0

2013-10-20 Thread saisantoshi
StandardAnalyzer both at index and search time. We use the default one and don't have any custom analyzers. Thanks, Sai -- View this message in context: http://lucene.472066.n3.nabble.com/Handling-special-characters-in-Lucene-4-0-tp4096674p4096710.html Sent from the Lucene - Java Users

Re: Handling special characters in Lucene 4.0

2013-10-20 Thread saisantoshi
Thanks. So, if I understand correctly, StandardAnalyzer wont work for the following below as it strips out the special characters and does search only on searchText ( in this case). queryText = *searchText* If we want to do a search like *** then we need to use WhiteSpaceAnalyzer. Please let me

Re: Handling special characters in Lucene 4.0

2013-10-20 Thread saisantoshi
what about other characters like ','( quote) characters. We have a requirement that a text can start with 'sampletext' and when I search with a '* it does not return any results but instead when I search with sample*, it does return the result. Thanks, Ranjith, -- View this message in context:

RE: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-03-08 Thread saisantoshi
Could someone please comment on the above? Thanks, Sai -- View this message in context: http://lucene.472066.n3.nabble.com/TopDocCollector-vs-TopScoreDocCollector-semantics-changed-in-4-0-not-backward-comptabile-tp4035806p4045855.html Sent from the Lucene - Java Users mailing list archive at

RE: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-03-06 Thread saisantoshi
Thanks for the response and really appreciate your help. I have read the documentation but could not get it in the first read as I was new to Lucene. I have changed it to AtomicReader and it seems to be working fine. One last clarification is do we also need to use AtomicReader for the following

RE: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-02-28 Thread saisantoshi
Could someone please comment on the above code snippet ? Also, one observation is that our search results are not consistent if we are using* IndexReader vs AtomicReader?* Could this be a problem? Thanks, Sai. -- View this message in context:

RE: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-02-28 Thread saisantoshi
Thanks a lot. Really appreciate your help here. I have read through the document and understand that the IndexReader uses sub readers (to look into the index files) and AtomicReader does not. But how does this affect from the search stand point of view. I think search results should be consistent

Re: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-02-27 Thread saisantoshi
I want to get the Document in the following below code and thats why I need an indexReader public void collect(int doc) throws IOException { // ADD YOUR CUSTOM LOGIC HERE *Document doc = indexReader.document(doc)* delegate.collect(doc); } But this seems to be the problem as the

RE: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-02-27 Thread saisantoshi
Thanks. Is there any issue the way we are calling the indexReader.getDocument(doc)? Not sure how do I get an AtomicReaderConext in the following below method? Any pointers on how do I get that instance is appreciated? public void collect(int doc) throws IOException { // ADD YOUR CUSTOM

RE: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-02-27 Thread saisantoshi
Here is how I am using it: public class MyCollector extends PositiveScoresOnlyCollector { private IndexReader indexReader; public MyCollector(IndexReader indexReader, PositiveScoresOnlyCollector topScore) { super(topScore); this.indexReader = indexReader;

Boolean Query not working in Lucene 4.0

2013-02-26 Thread saisantoshi
The following query does not seems to work after we upgrade from 2.4 - 4.0 *+type:sometype +title:sometitle** Any ideas as to what are some of the places to look for? Is the above Query correct in syntax. Appreciate if you could advise on the above? Thanks, Sai. -- View this message in

IndexSearcher.close() removed in 4.0

2013-02-18 Thread saisantoshi
I understand from the JIRA ticket(Lucene-3640) that the IndexSearcher.close() is no-op operation but not very clear on why it is a no-op? Could someone shed some light on this? We were using this method in the older versions and is it safe now to remove this call. Just want to understand the

Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-02-05 Thread saisantoshi
I am looking at the versions supported by newer version of Tika (1.3) and was not sure what version(s) of the Microsoft office it supports (97/2000/2010/2013) for each of the below? http://tika.apache.org/1.3/formats.html#Microsoft_Office_document_formats Microsoft word (also does it support

Re: IndexWriterConfig.OpenMode.CREATE vs OpenMode.APPEND (index files)

2013-02-01 Thread saisantoshi
Are you closing or committing your IndexWriter after each added document? Because if you add 100 docs you should not see 100 versions of these files, only one set of files in the end (many docs are written to one segment). No. What I meant to say here is if 100 users have updated the document

Example settings for TieredMergePolicy : Lucene 4.0

2013-02-01 Thread saisantoshi
I am using the TieredMergePolicy and using the compound index: TieredMergePolicy mergePolicy = new TieredMergePolicy(); indexWriterConfig.setMergePolicy(mergePolicy.setNoCFSRatio(1.0d)); Prior to 4.0, there was an optimize() in the IndexWriter which was merging the index files. Is there any

Re: Example settings for TieredMergePolicy : Lucene 4.0

2013-02-01 Thread saisantoshi
Thanks. I read this ( and also tried it out in my code) and understand that forceMerge(1) is not advisable for performance reasons. My question here is if we don't have a way to compress these files, it will produce enormous amount of files which will lead to some file system issues ( such as

IndexWriterConfig.OpenMode.CREATE vs OpenMode.APPEND (index files)

2013-01-31 Thread saisantoshi
I am using the following below for creating the IndexWriter (for my indexing): IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_40, new LimitTokenCountAnalyzer(analyzer, MAX_FIELD_SCAN_LENGTH)); if (create) { // create will be trure for indexing

Re: IndexWriterConfig.OpenMode.CREATE vs OpenMode.APPEND (index files)

2013-01-31 Thread saisantoshi
It's _0.si ( typo) For second update, create = false. Thanks, Sai. -- View this message in context: http://lucene.472066.n3.nabble.com/IndexWriterConfig-OpenMode-CREATE-vs-OpenMode-APPEND-index-files-tp4037766p4037785.html Sent from the Lucene - Java Users mailing list archive at Nabble.com.

Re: IndexWriterConfig.OpenMode.CREATE vs OpenMode.APPEND (index files)

2013-01-31 Thread saisantoshi
Is it by design. The older API (2.4) does not have this problem. Lets say if I have 100 updates or so.. then it will create 100 versions of those files in the index. This would increase the number of files in the index directory and might run into some file issues? It would be good to just have

Re: List of files that Lucene 4.0 generates during indexing

2013-01-30 Thread saisantoshi
The following files are originally created files (upon an initial indexing): _0.fdt _0.fdx _0.fnm _0.si _0_Lucene40_0.frq _0_Lucene40_0.prx _0_Lucene40_0.tim _0_Lucene40_0.tip _0_nrm.cfe _0_nrm.cfs index.v0008

Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-01-27 Thread saisantoshi
We are not using Solr and using just Lucene core 4.0 engine. I am trying to see if we can use tika library to extract textual information from pdf/word/excel documents. I am mainly interested in reading the contents inside the documents and index using lucene. My question here is , is tika

Re: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-01-25 Thread saisantoshi
Thanks a lot. If we want to wrap TopScoreDocCollector into PositiveScoresOnlyCollector. Can we do that? I need only positive scores and I dont think topscore collector can handle by itself right? Thanks, Sai -- View this message in context:

Re: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-01-25 Thread saisantoshi
I am not looking for negative scores and want to skip it. Thanks, Sai -- View this message in context: http://lucene.472066.n3.nabble.com/TopDocCollector-vs-TopScoreDocCollector-semantics-changed-in-4-0-not-backward-comptabile-tp4035806p4036378.html Sent from the Lucene - Java Users mailing

Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-01-25 Thread saisantoshi
I want to index the document content( such as PDF/word/excel) and am just wondering if there are any good readers that I can use to integrate into Lucene 4.0. Any pointers/example code is appreciated.. Lucene In Action book mentions tika as the library to use but not sure if this is the preferred

Re: List of files that Lucene 4.0 generates during indexing

2013-01-24 Thread saisantoshi
Thanks. I checked it out. Here are the list of files that has been generated: _0.fdt _0.fdx _0.fnm _0.si _0_Lucene40_0.frq _0_Lucene40_0.prx _0_Lucene40_0.tim _0_Lucene40_0.tip _0_nrm.cfe _0_nrm.cfs

Re: List of files that Lucene 4.0 generates during indexing

2013-01-24 Thread saisantoshi
Thanks Michael. The additional file in the list is just a typo. One more question is, we were using 2.4 before, and it only generated few files _0.cfs _0.cfx // segment files I am assuming that the 2.4 version has the compound index structure enabled by default. Do we need to set it explicitly

Re: List of files that Lucene 4.0 generates during indexing

2013-01-24 Thread saisantoshi
Thanks. Are there any best practices to follow here? or leave the the default ( which is hybrid approach as you mentioned). -- View this message in context: http://lucene.472066.n3.nabble.com/List-of-files-that-Lucene-4-0-generates-during-indexing-tp4035993p4036086.html Sent from the Lucene -

Re: List of files that Lucene 4.0 generates during indexing

2013-01-24 Thread saisantoshi
Thanks a lot. One last question, how do we set it? IndexWriter.??? Thanks, Ranjith. -- View this message in context: http://lucene.472066.n3.nabble.com/List-of-files-that-Lucene-4-0-generates-during-indexing-tp4035993p4036091.html Sent from the Lucene - Java Users mailing list archive at

RE: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-01-24 Thread saisantoshi
Can someone please help us here to validate the above? Thanks, Sai. -- View this message in context: http://lucene.472066.n3.nabble.com/TopDocCollector-vs-TopScoreDocCollector-semantics-changed-in-4-0-not-backward-comptabile-tp4035806p4036093.html Sent from the Lucene - Java Users mailing

Re: List of files that Lucene 4.0 generates during indexing

2013-01-24 Thread saisantoshi
Thanks. Could you please also comment on the following as well? http://lucene.472066.n3.nabble.com/TopDocCollector-vs-TopScoreDocCollector-semantics-changed-in-4-0-not-backward-comptabile-td4035806.html Thanks and really appreciate your help. Thanks, Sai. -- View this message in context:

RE: Are Search Index directories backward comptabile? ( when upgrading to latest lucene version)

2013-01-23 Thread saisantoshi
Thanks. We decided to delete the existing index directories and recreate it once we upgrade to 4.0 (unless we hit any major api blockers during compilation, we will prefer to go to 3.6.2 first and then later to 4.0). Thanks, Sai. -- View this message in context:

IndexWriter.optimize() is removed in 4.0?

2013-01-23 Thread saisantoshi
There is no optimize() method in 4.0. I looked at the 3.6 docs and it did mention the following below. Does the following below mean that we no longer need this method and should not be used anymore. Is there any supplement method that we need to use as it is deprecated as of version 3.6.0 /*

TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-01-23 Thread saisantoshi
Our current search implementation (based on 2.4.0) uses a collector extending the TopDocCollector class public class MyHitCollector extends TopDocsCollector { private IndexReader indexReader; private CustomFilter customFilter; public MyHitCollector (IndexReader indexReader, int

RE: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-01-23 Thread saisantoshi
I am sorry but I am confused looking at the change logs and the enhancements done. Since we are jumping from 2.4 - 4.0. Could you please point me to any example code that extends one of the new collectors.. that would help a lot or it would be great if you could give some pointers on how we can

Extending TopScoreDocCollector to write a custom collector

2013-01-23 Thread saisantoshi
I would like to write a custom collector ( similar to the one which is inside the source of TopScoreDocCollector like InOrderTopScoreDocCollector). The reason for extending this is because InOrderTopScoreDocCollector and OutOfOrderTopScoreDocCollector are private to the class and I really wanted

Re: Extending TopScoreDocCollector to write a custom collector

2013-01-23 Thread saisantoshi
Here is the way I implemented a collector class. Appreciate if you could let me know of any issues.. public class MyCollector extends PositiveScoresOnlyCollector { private IndexReader indexReader; public MyCollector (IndexReader indexReader,PositiveScoresOnlyCollector

RE: TopDocCollector vs TopScoreDocCollector (semantics changed in 4.0, not backward comptabile)

2013-01-23 Thread saisantoshi
Here is the way I implemented a collector class. Appreciate if you could let me know of any issues.. public class MyCollector extends PositiveScoresOnlyCollector { private IndexReader indexReader; public MyCollector (IndexReader indexReader,PositiveScoresOnlyCollector

IndexSearcher.search(Weight weight, Filter filter, HitCollector results) is not there in 4.0 version

2013-01-22 Thread saisantoshi
We are using the following below method with Lucene 2.4.0 public void search(Weight weight, Filter filter, HitCollector results) throws IOException We are upgrading to the latest version and looking at the API (4.0), the above signature has been

RE: IndexSearcher.search(Weight weight, Filter filter, HitCollector results) is not there in 4.0 version

2013-01-22 Thread saisantoshi
Thanks. Can we use the following method in 4.0 as a replacement for the above method? However, we will rewrite this to use FilteredQuery later but don't want to refactor a lot. public void search(Query query, Filter filter, Collector results) throws IOException

RE: Are Search Index directories backward comptabile? ( when upgrading to latest lucene version)

2013-01-22 Thread saisantoshi
We are upgrading from 2.4 - 4.0? What are the options here? ( To delete the existing index directories and reindex with the upgraded ones). We don't want to take any intermediate steps which would cause more work again to upgrade to the latest version. Thanks, Sai. -- View this message in

RE: Are Search Index directories backward comptabile? ( when upgrading to latest lucene version)

2013-01-22 Thread saisantoshi
Also, I am not sure about the following below statement: A direct update from 2.x to 4.0 is not possible Are you saying that its impossible to upgrade from 2.4 - 4.0 version? Why can't we upgrade? Any technical limitations with Lucene that will not allow from upgrading from 2.4 - 4.x?. I am

StandardAnalyzer: Support for Japanese

2013-01-10 Thread saisantoshi
We are using StandardAnalyzer for indexing some Japanese Keywords. It works fine so far but just wanted to confirm if the StandardAnalyzer can fully support it ( I have read somewhere in Lucene In Action book, that StandardAnalyzer does support CJK). Just want to confirm if my understanding is

Re: Field.Store.YES vs Field.Store.NO

2013-01-10 Thread saisantoshi
Not sure what does the following below mean? using Field.Store.NO the field itself is definitely searchable. You will not be able to retrieve the field value itself For example, if we have a file that we upload using some keywords and if the keyword (is of type Field.Store.NO but is analyzed)

Re: Upgrade Lucene to latest version (4.0) from 2.4.0

2013-01-10 Thread saisantoshi
Thanks for all the responses. Apart from the API changes, is there any major functionality change from 2.4.0 - 4.x version. I know we need to modify the API to the latest version but just curious if we need to be aware of any functional changes so as to do more thorough testing? Thanks, Sai.

Upgrade Lucene to latest version (4.0) from 2.4.0

2013-01-09 Thread saisantoshi
We have an existing application which uses Lucene 2.4.0 version. We are thinking of upgrading it to alatest version (4.0). I am not sure the process involved in upgrading to latest version. Is it just copying of the jars? If yes, what are all the jars that we need to copy over. Will it be backward

Re: Upgrade Lucene to latest version (4.0) from 2.4.0

2013-01-09 Thread saisantoshi
Thanks. Could you please elaborate on what is needed other than replacing the jars? Are the jars listed is the only jars or any additional jars required? Is the API not backward compatible? I mean to say whatever the API calls we are using in 2.4.0 is not supported by 4.0? Has the signature

Re: Upgrade Lucene to latest version (4.0) from 2.4.0

2013-01-09 Thread saisantoshi
Are there any best practices that we can follow? We want to get to the latest version and am thinking if we can directly go from 2.4.0 to 4.x (as supposed to 2.x - 3.x and 3.x - 4.x)? so that it will not only save time but also testing cycle at each migration hop. Are there any limitations in

Lucene support for multi byte characters : 2.4.0 (version).

2013-01-08 Thread saisantoshi
We are using Lucene (2.4.0 libraries) for implementing search in our application. We are using Standard Analyzer for Analyzer part. Our application has a documents upload feature which lets you upload the documents and be able to put in some keywords (while uploading it). When we search (using

Is StandardAnalyzer good enough for multi languages...

2013-01-08 Thread saisantoshi
DoesLucene StandardAnalyzer work for all the languagues for tokenizing before indexing (since we are using java, I think the content is converted to UTF-8 before tokenizing/indeing)? or do we need to use special analyzers for each of the language. In this case, if a document has a mixed case (