from:"Ian Lea"

Re: Lucene 9.0.0 inconsistent index options

2021-12-14 Thread Ian Lea

Thanks for the response. https://issues.apache.org/jira/browse/LUCENE-10314

Will we still be able to decide, maybe years down the line, that we do want
to search on fieldX after all, and be able to change the code and reindex
the maybe small proportion of documents that have a value for fieldX
without having to create a new index from scratch?  That happens.


--
Ian.


On Tue, Dec 14, 2021 at 1:54 PM Adrien Grand  wrote:

> This looks related to the new changes around schema validation. Lucene
> now requires a field to either be absent from a document or be indexed
> with the exact same options (index options, points dimensions, norms,
> doc values type, etc.) as already indexed documents that also have
> this field.
>
> However it's a bug that Lucene fails to open an index that was legal
> in Lucene 8. Can you file a JIRA issue?
>
> On Mon, Dec 13, 2021 at 4:23 PM Ian Lea  wrote:
> >
> > Hi
> >
> >
> > We have a long-standing index with some mandatory fields and some
> optional
> > fields that has been through multiple lucene upgrades without a full
> > rebuild and on testing out an upgrade from version 8.11.0 to 9.0.0, when
> > open an IndexWriter we are hitting the exception
> >
> > Exception in thread "main" java.lang.IllegalArgumentException: cannot
> > change field "language" from index options=NONE to inconsistent index
> > options=DOCS
> > at
> >
> org.apache.lucene.index.FieldInfo.verifySameIndexOptions(FieldInfo.java:245)
> > at
> >
> org.apache.lucene.index.FieldInfos$FieldNumbers.verifySameSchema(FieldInfos.java:421)
> > at
> >
> org.apache.lucene.index.FieldInfos$FieldNumbers.addOrGet(FieldInfos.java:357)
> > at
> >
> org.apache.lucene.index.IndexWriter.getFieldNumberMap(IndexWriter.java:1263)
> > at
> org.apache.lucene.index.IndexWriter.(IndexWriter.java:1116)
> >
> > Where language is one of our optional fields.
> >
> > Presumably this is at least somewhat related to "Index options can no
> > longer be changed dynamically" as mentioned at
> > https://lucene.apache.org/core/9_0_0/MIGRATE.html although it fails
> before
> > our code attempts to update the index, and we are not trying to change
> any
> > index options.
> >
> > Adding some displays to IndexWriter and FieldInfos and logging rather
> than
> > throwing the exception I see
> >
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=DOCS
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=DOCS
> >  language curr=NONE, other=DOCS
> >  language curr=NONE, other=DOCS
> >  language curr=NONE, other=DOCS
> >  language curr=NONE, other=DOCS
> >  language curr=NONE, other=DOCS
> >  language curr=NONE, other=DOCS
> >  language curr=NONE, other=DOCS
> >
> > where there is one line per segment.  It logs the exception whenever
> > other=DOCS.  Subset with segment info:
> >
> > segment _x8(8.2.0):c31753/-1:[diagnostics={timestamp=1565623850605,
> > lucene.version=8.2.0, java.vm.version=11.0.3+7, java.version=11.0.3,
> > mergeMaxNumSegments=-1, os.version=3.1.0-1.2-desktop,
> > java.vendor=AdoptOpenJDK, source=merge, os.arch=amd64, mergeFactor=10,
> > java.runtime.version=11.0.3+7,
> > os=Linux}]:[attributes={Lucene50StoredFieldsFormat.mode=BEST_SPEED}]
> >
> >  language curr=NONE, other=NONE
> >
> > segment _y9(8.7.0):c43531/-1:[diagnostics={timestamp=1604597581562,
> > lucene.version=8.7.0, java.vm.version=11.0.3+7, java.version=11.0.3,
> > mergeMaxNumSegments=-1, os.version=3.1.0-1.2-desktop,
> > java.vendor=AdoptOpenJDK, source=merge, os.arch=amd64, mergeFactor=10,
> > java.runtime.version=11.0.3+7,
> > os=Linux}]:[attributes={Lucene87StoredFieldsFormat.mode=BEST_SPEED}]
> >
> >  language curr=NONE, other=DOCS
> >
> > NOT throwing java.lang.IllegalArgumentException: cannot change field
> > "language" from index options=NONE to inconsistent index options=DOCS
> >
> >
> > Some variation on an old-fashioned not set versus not present bug
> perhaps?
> >
> >
> > --
> > Ian.
>
>
>
> --
> Adrien
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Lucene 9.0.0 inconsistent index options

2021-12-13 Thread Ian Lea

Hi


We have a long-standing index with some mandatory fields and some optional
fields that has been through multiple lucene upgrades without a full
rebuild and on testing out an upgrade from version 8.11.0 to 9.0.0, when
open an IndexWriter we are hitting the exception

Exception in thread "main" java.lang.IllegalArgumentException: cannot
change field "language" from index options=NONE to inconsistent index
options=DOCS
at
org.apache.lucene.index.FieldInfo.verifySameIndexOptions(FieldInfo.java:245)
at
org.apache.lucene.index.FieldInfos$FieldNumbers.verifySameSchema(FieldInfos.java:421)
at
org.apache.lucene.index.FieldInfos$FieldNumbers.addOrGet(FieldInfos.java:357)
at
org.apache.lucene.index.IndexWriter.getFieldNumberMap(IndexWriter.java:1263)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1116)

Where language is one of our optional fields.

Presumably this is at least somewhat related to "Index options can no
longer be changed dynamically" as mentioned at
https://lucene.apache.org/core/9_0_0/MIGRATE.html although it fails before
our code attempts to update the index, and we are not trying to change any
index options.

Adding some displays to IndexWriter and FieldInfos and logging rather than
throwing the exception I see

 language curr=NONE, other=NONE
 language curr=NONE, other=NONE
 language curr=NONE, other=NONE
 language curr=NONE, other=NONE
 language curr=NONE, other=NONE
 language curr=NONE, other=NONE
 language curr=NONE, other=NONE
 language curr=NONE, other=NONE
 language curr=NONE, other=DOCS
 language curr=NONE, other=NONE
 language curr=NONE, other=NONE
 language curr=NONE, other=NONE
 language curr=NONE, other=NONE
 language curr=NONE, other=NONE
 language curr=NONE, other=NONE
 language curr=NONE, other=NONE
 language curr=NONE, other=NONE
 language curr=NONE, other=NONE
 language curr=NONE, other=DOCS
 language curr=NONE, other=DOCS
 language curr=NONE, other=DOCS
 language curr=NONE, other=DOCS
 language curr=NONE, other=DOCS
 language curr=NONE, other=DOCS
 language curr=NONE, other=DOCS
 language curr=NONE, other=DOCS

where there is one line per segment.  It logs the exception whenever
other=DOCS.  Subset with segment info:

segment _x8(8.2.0):c31753/-1:[diagnostics={timestamp=1565623850605,
lucene.version=8.2.0, java.vm.version=11.0.3+7, java.version=11.0.3,
mergeMaxNumSegments=-1, os.version=3.1.0-1.2-desktop,
java.vendor=AdoptOpenJDK, source=merge, os.arch=amd64, mergeFactor=10,
java.runtime.version=11.0.3+7,
os=Linux}]:[attributes={Lucene50StoredFieldsFormat.mode=BEST_SPEED}]

 language curr=NONE, other=NONE

segment _y9(8.7.0):c43531/-1:[diagnostics={timestamp=1604597581562,
lucene.version=8.7.0, java.vm.version=11.0.3+7, java.version=11.0.3,
mergeMaxNumSegments=-1, os.version=3.1.0-1.2-desktop,
java.vendor=AdoptOpenJDK, source=merge, os.arch=amd64, mergeFactor=10,
java.runtime.version=11.0.3+7,
os=Linux}]:[attributes={Lucene87StoredFieldsFormat.mode=BEST_SPEED}]

 language curr=NONE, other=DOCS

NOT throwing java.lang.IllegalArgumentException: cannot change field
"language" from index options=NONE to inconsistent index options=DOCS


Some variation on an old-fashioned not set versus not present bug perhaps?


--
Ian.

Re: [VOTE] Lucene logo contest

2020-06-18 Thread Ian Lea

A.  Non-PMC.


--
Ian.


On Wed, Jun 17, 2020 at 1:28 PM jim ferenczi  wrote:

> I vote option A (PMC vote)
>
> Le mer. 17 juin 2020 à 14:24, Felix Kirchner <
> felix.kirch...@uni-wuerzburg.de> a écrit :
>
> > A
> >
> > non-PMC
> >
> > Am 16.06.2020 um 00:08 schrieb Ryan Ernst:
> > > Dear Lucene and Solr developers!
> > >
> > > In February a contest was started to design a new logo for Lucene [1].
> > That
> > > contest concluded, and I am now (admittedly a little late!) calling a
> > vote.
> > >
> > > The entries are labeled as follows:
> > >
> > > A. Submitted by Dustin Haver [2]
> > >
> > > B. Submitted by Stamatis Zampetakis [3] Note that this has several
> > > variants. Within the linked entry there are 7 patterns and 7 color
> > > palettes. Any vote for B should contain the pattern number, like B1 or
> > B3.
> > > If a B variant wins, we will have a followup vote on the color palette.
> > >
> > > C. The current Lucene logo [4]
> > >
> > > Please vote for one of the three (or nine depending on your
> perspective!)
> > > above choices. Note that anyone in the Lucene+Solr community is invited
> > to
> > > express their opinion, though only Lucene+Solr PMC cast binding votes
> > > (indicate non-binding votes in your reply, please). This vote will
> close
> > > one week from today, Mon, June 22, 2020.
> > >
> > > Thanks!
> > >
> > > [1] https://issues.apache.org/jira/browse/LUCENE-9221
> > > [2]
> > >
> >
> https://issues.apache.org/jira/secure/attachment/12999548/Screen%20Shot%202020-04-10%20at%208.29.32%20AM.png
> > > [3]
> > >
> >
> https://issues.apache.org/jira/secure/attachment/12997768/zabetak-1-7.pdf
> > > [4]
> > https://lucene.apache.org/theme/images/lucene/lucene_logo_green_300.png
> > >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>

Re: lucene Input and Output format

2017-08-02 Thread Ian Lea

What are the full package names for these interfaces?  I don't think they
are org.apache.lucene.


--
Ian.


On Wed, Aug 2, 2017 at 9:00 AM, Ranganath B N 
wrote:

> Hi,
>
>   It's not about the file formats. Rather It is about LuceneInputFormat
> and LuceneOutputFormat interfaces which deals with getsplit(),
> getRecordReader() and getRecordWriter() methods. Are there any
> Implementations for these interfaces?
>
>
> Thanks,
> Ranganath B. N.
>
> -Original Message-
> From: Adrien Grand [mailto:jpou...@gmail.com]
> Sent: Tuesday, August 01, 2017 7:23 PM
> To: java-user@lucene.apache.org
> Cc: Vadiraj Muradi
> Subject: Re: lucene Input and Output format
>
> Which part of the index do you want to learn about? Here are some
> descriptions of the file formats:
>  - terms dict:
> http://lucene.apache.org/core/6_6_0/core/org/apache/lucene/
> codecs/blocktree/BlockTreeTermsWriter.html
>  - postings:
> http://lucene.apache.org/core/6_6_0/core/index.html?org/
> apache/lucene/index/IndexableField.html
>  - doc values:
> http://lucene.apache.org/core/6_6_0/core/index.html?org/
> apache/lucene/index/IndexableField.html
>  - stored fields:
> http://lucene.apache.org/core/6_6_0/core/index.html?org/
> apache/lucene/index/IndexableField.html
>
> Le lun. 31 juil. 2017 à 15:02, Ranganath B N  a
> écrit :
>
> >
> >
> >
> > Hi All,
> >
> >  Can you point me to some of the implementations  of lucene Input
> > and Output format? I wanted to know them to  understand the
> > distributed implementation approach.
> >
> >
> > Thanks,
> > Ranganath B. N.
> >
>

Re: join on two txt files data using apache lucene

2017-07-14 Thread Ian Lea

Looks like your screenshot didn't make it, but never mind: I'm sure we all
know what text files look like.

A join on two ID fields sounds more like SQL database territory rather than
lucene.  Lucene is not an SQL database.  But I typed "lucene join" into a
well known search engine and the top hit was
http://lucene.apache.org/core/6_5_1/join/org/apache/lucene/search/join/package-summary.html
.

--
Ian.

On Thu, Jul 13, 2017 at 7:34 PM, Shaik Nizamuddin 
wrote:

> Hi Mike,
>
> I have a use case .
>
> I have 2 .txt files containg 10 millions of record (please find an
> attached screen shot). i want to do inner join using apache lucene.
> there is relationship BE_GEO_ID and Partner_ID. i want to  one single
> file.please reply asap.
>
> Regards
> Nizamuddin
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

Re: Un-used index files are not getting released

2017-05-09 Thread Ian Lea

Something looks slightly out of sync, with _29ip.cfs shown by $ lsof but
not by $ ls but that could just be by chance of timing, lucene doing its
stuff behind the scenes.

What does $ ls -al show?

What are the 1761 files returned by the java listFiles() call?  Are you
sure there isn't something using your index directory for some non-lucene
purpose?  That's usually best avoided.


--
Ian.


On Mon, May 8, 2017 at 6:38 PM, Siraj Haider <si...@jobdiva.com> wrote:

> Hi Ian,
> We do not open any IndexReader explicitly. We keep one instance on
> IndexWriter open (and never close) and for searching we use
> SearcherManager. I checked the lsof and did not find any files with delete
> status.
>
> Following is the output of lsof for lucene1:
> 0;lucene@lidxnj39:~[lucene@lidxnj39 ~]$ /usr/sbin/lsof | grep lucene1
> java  32332lucene   71r  REG8,1   606739
> 64749587 /lucene1/index/_29ip.cfs
> java  32332lucene   73r  REG8,1 191494805022
> 64749583 /lucene1/index/_2988.cfs
> java  32332lucene   76r  REG8,1  1423548
> 64749585 /lucene1/index/_29io.cfs
> java  32332lucene   80r  REG8,1  1204827
> 64749586 /lucene1/index/_29in.cfs
> java  32332lucene   81r  REG8,1  5453524
> 64749588 /lucene1/index/_29il.cfs
> java  32332lucene   86r  REG8,1  5453524
> 64749588 /lucene1/index/_29il.cfs
> java  32332lucene   90r  REG8,1 37530221
> 64749590 /lucene1/index/_29im.cfs
> java  32332lucene   92r  REG8,1  1204827
> 64749586 /lucene1/index/_29in.cfs
> java  32332lucene   96r  REG8,1 37530221
> 64749590 /lucene1/index/_29im.cfs
> java  32332lucene  101r  REG8,1  1423548
> 64749585 /lucene1/index/_29io.cfs
> java  32332lucene  111r  REG8,1 53364009
> 64749606 /lucene1/index/_29hj.cfs
> java  32332lucene  114r  REG8,1 53364009
> 64749606 /lucene1/index/_29hj.cfs
> java  32332lucene  117r  REG8,1 191494805022
> 64749583 /lucene1/index/_2988.cfs
> java  32332lucene  119r  REG8,1195525434
> 64749601 /lucene1/index/_29fj.cfs
> java  32332lucene  139r  REG8,1195525434
> 64749601 /lucene1/index/_29fj.cfs
>
> Following is the directory listing of the folder:
> 0;lucene@lidxnj39:~[lucene@lidxnj39 ~]$ ls -l /lucene1/index/
> total 187294328
> -rw-r--r--. 1 lucene mis 1451 May  8 13:31 _2988_8i.del
> -rw-r--r--. 1 lucene mis 191494805022 May  8 02:10 _2988.cfs
> -rw-r--r--. 1 lucene mis   65 May  8 13:26 _29fj_8.del
> -rw-r--r--. 1 lucene mis195525434 May  8 11:21 _29fj.cfs
> -rw-r--r--. 1 lucene mis   24 May  8 12:50 _29hj_2.del
> -rw-r--r--. 1 lucene mis 53364009 May  8 12:46 _29hj.cfs
> -rw-r--r--. 1 lucene mis  5453524 May  8 13:29 _29il.cfs
> -rw-r--r--. 1 lucene mis 37530221 May  8 13:29 _29im.cfs
> -rw-r--r--. 1 lucene mis  1204827 May  8 13:30 _29in.cfs
> -rw-r--r--. 1 lucene mis  1423548 May  8 13:31 _29io.cfs
> -rw-r--r--. 1 lucene mis 1714 May  8 13:31 segments_2615
> -rw-r--r--. 1 lucene mis   20 May  8 13:31 segments.gen
>
> But when I get the number of files in that index folder using java
> (File.listFiles()) it lists 1761 files in that folder. This count goes down
> to a double digit number when I restart the tomcat.
>
> Thanks for looking into it.
>
> --
> Regards
> -Siraj Haider
> (212) 306-0154
>
> -Original Message-
> From: Ian Lea [mailto:ian@gmail.com]
> Sent: Friday, May 05, 2017 9:33 AM
> To: java-user@lucene.apache.org
> Subject: Re: Un-used index files are not getting released
>
> The most common cause is unclosed index readers.  If you run lsof against
> the tomcat process id and see that some deleted files are still open,
> that's almost certainly the problem.  Then all you have to do is track it
> down in your code.
>
>
> --
> Ian.
>
>
> On Thu, May 4, 2017 at 10:09 PM, Siraj Haider <si...@jobdiva.com> wrote:
>
> > Hi all,
> > We recently switched to Lucene 6.5 from 2.9 and we have an issue that
> > the files in index directory are not getting released after the
> > IndexWriter finishes up writing a batch of documents. We are using
> > IndexFolder.listFiles().length to check the number of files in index
> > folder. We have even tried closing/reopening the
> > IndexWriter/SearcherManager/MMapDirectory after indexing each batch to
> > see if that would release the

Re: Un-used index files are not getting released

2017-05-05 Thread Ian Lea

The most common cause is unclosed index readers.  If you run lsof against
the tomcat process id and see that some deleted files are still open,
that's almost certainly the problem.  Then all you have to do is track it
down in your code.


--
Ian.


On Thu, May 4, 2017 at 10:09 PM, Siraj Haider  wrote:

> Hi all,
> We recently switched to Lucene 6.5 from 2.9 and we have an issue that the
> files in index directory are not getting released after the IndexWriter
> finishes up writing a batch of documents. We are using
> IndexFolder.listFiles().length to check the number of files in index
> folder. We have even tried closing/reopening the
> IndexWriter/SearcherManager/MMapDirectory after indexing each batch to
> see if that would release the files but it didn't. When we shutdown the
> tomcat and restart it, only then we see that number drop, which proves that
> there were some deleted files still held by Lucene somewhere. Can you
> please direct me on what needs to be checked?
>
> P.S. I apologize for the duplicate email, as I didn't see my yesterday's
> email in the list.
>
> Regards
> -Siraj
>
> 
>
> This electronic mail message and any attachments may contain information
> which is privileged, sensitive and/or otherwise exempt from disclosure
> under applicable law. The information is intended only for the use of the
> individual or entity named as the addressee above. If you are not the
> intended recipient, you are hereby notified that any disclosure, copying,
> distribution (electronic or otherwise) or forwarding of, or the taking of
> any action in reliance on, the contents of this transmission is strictly
> prohibited. If you have received this electronic transmission in error,
> please notify us by telephone, facsimile, or e-mail as noted above to
> arrange for the return of any electronic mail or attachments. Thank You.
>

Re: unable to delete document via the IndexWriter.deleteDocuments(term) method

2017-02-17 Thread Ian Lea

Hi


Sounds like you should use FieldType.setTokenized(false).  For the
equivalent field in some of my lucene indexes I use

FieldType idf = new FieldType();
idf.setStored(true);
idf.setOmitNorms(true);
idf.setIndexOptions(IndexOptions.DOCS);
idf.setTokenized(false);
idf.freeze();

There's also PerFieldAnalyzerWrapper,  in oal.analysis.miscellaneous for
version 6.x although I have a feeling it was elsewhere in earlier versions.


--
Ian.



On Fri, Feb 17, 2017 at 12:26 PM, Armnotstrong <zhaoxmu...@gmail.com> wrote:

> Thanks, Ian:
>
> You saved my day!
>
> And there is a further question to ask:
>
> Since the analyzer could only be configured through the IndexWriter,
> using  different
> analyzers for different Fields is not possible, right? I only want
> this '_id' field to identify
> the document in index, so I could update or delete the specific
> document from index
> when needed, the real searching field is a text field, which should be
> analysed by
> smart_cn analyser.
>
> Thus, I think it will good to have such an configure option as
> IndexOptions.NOT_ANALYSED.
> I remember to have that in the old version of lucene, but not found in
> version 5.x
>
> Any suggestion to bypass that?
>
> Sorry for my bad English.
>
> 2017-02-17 19:40 GMT+08:00 Ian Lea <ian@gmail.com>:
> > Hi
> >
> >
> > SimpleAnalyzer uses LetterTokenizer which divides text at non-letters.
> > Your add and search methods use the analyzer but the delete method
> doesn't.
> >
> > Replacing SimpleAnalyzer with KeywordAnalyzer in your program fixes it.
> > You'll need to make sure that your id field is left alone.
> >
> >
> > Good to see a small self-contained test program.  A couple of suggestions
> > to make it even better if there's a next time:
> >
> > Use final static String ID = "_id" and ... KEY =
> > "5836962b0293a47b09d345f1".  Minimises the risk of typos.
> >
> > And use RAMDirectory.  Means your program doesn't leave junk on my disk
> if
> > I run it, and also means it starts with an empty index each time.
> >
> >
> > --
> > Ian.
> >
> >
> > On Fri, Feb 17, 2017 at 10:04 AM, Armnotstrong <zhaoxmu...@gmail.com>
> wrote:
> >
> >> Hi, all:
> >>
> >> I am Using version 5.5.4, and find can't delete a document via the
> >> IndexWriter.deleteDocuments(term) method.
> >>
> >> Here is the test code:
> >>
> >> import org.apache.lucene.analysis.core.SimpleAnalyzer;
> >> import org.apache.lucene.document.Document;
> >> import org.apache.lucene.document.Field;
> >> import org.apache.lucene.document.FieldType;
> >> import org.apache.lucene.index.*;
> >> import org.apache.lucene.queryparser.classic.ParseException;
> >> import org.apache.lucene.queryparser.classic.QueryParser;
> >> import org.apache.lucene.search.IndexSearcher;
> >> import org.apache.lucene.search.Query;
> >> import org.apache.lucene.search.ScoreDoc;
> >> import org.apache.lucene.store.Directory;
> >> import org.apache.lucene.store.FSDirectory;
> >>
> >> import java.io.IOException;
> >> import java.nio.file.Paths;
> >>
> >> public class TestSearch {
> >> static SimpleAnalyzer analyzer = new SimpleAnalyzer();
> >>
> >> public static void main(String[] argvs) throws IOException,
> >> ParseException {
> >> generateIndex("5836962b0293a47b09d345f1");
> >> query("5836962b0293a47b09d345f1");
> >> delete("5836962b0293a47b09d345f1");
> >> query("5836962b0293a47b09d345f1");
> >>
> >> }
> >>
> >> public static void generateIndex(String id) throws IOException {
> >> Directory directory = FSDirectory.open(Paths.get("/
> >> tmp/test/lucene"));
> >> IndexWriterConfig config = new IndexWriterConfig(analyzer);
> >> IndexWriter iwriter = new IndexWriter(directory, config);
> >> FieldType fieldType = new FieldType();
> >> fieldType.setStored(true);
> >> fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_
> >> AND_POSITIONS_AND_OFFSETS);
> >> Field idField = new Field("_id", id, fieldType);
> >> Document doc = new Document();
> >> doc.add(idField);
> >> iwriter.addDocument(doc);
> >> iwriter.close();
> >>
> >> }
> >>
> >> public static void qu

Re: unable to delete document via the IndexWriter.deleteDocuments(term) method

2017-02-17 Thread Ian Lea

Hi


SimpleAnalyzer uses LetterTokenizer which divides text at non-letters.
Your add and search methods use the analyzer but the delete method doesn't.

Replacing SimpleAnalyzer with KeywordAnalyzer in your program fixes it.
You'll need to make sure that your id field is left alone.


Good to see a small self-contained test program.  A couple of suggestions
to make it even better if there's a next time:

Use final static String ID = "_id" and ... KEY =
"5836962b0293a47b09d345f1".  Minimises the risk of typos.

And use RAMDirectory.  Means your program doesn't leave junk on my disk if
I run it, and also means it starts with an empty index each time.


--
Ian.


On Fri, Feb 17, 2017 at 10:04 AM, Armnotstrong  wrote:

> Hi, all:
>
> I am Using version 5.5.4, and find can't delete a document via the
> IndexWriter.deleteDocuments(term) method.
>
> Here is the test code:
>
> import org.apache.lucene.analysis.core.SimpleAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.document.FieldType;
> import org.apache.lucene.index.*;
> import org.apache.lucene.queryparser.classic.ParseException;
> import org.apache.lucene.queryparser.classic.QueryParser;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.ScoreDoc;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.FSDirectory;
>
> import java.io.IOException;
> import java.nio.file.Paths;
>
> public class TestSearch {
> static SimpleAnalyzer analyzer = new SimpleAnalyzer();
>
> public static void main(String[] argvs) throws IOException,
> ParseException {
> generateIndex("5836962b0293a47b09d345f1");
> query("5836962b0293a47b09d345f1");
> delete("5836962b0293a47b09d345f1");
> query("5836962b0293a47b09d345f1");
>
> }
>
> public static void generateIndex(String id) throws IOException {
> Directory directory = FSDirectory.open(Paths.get("/
> tmp/test/lucene"));
> IndexWriterConfig config = new IndexWriterConfig(analyzer);
> IndexWriter iwriter = new IndexWriter(directory, config);
> FieldType fieldType = new FieldType();
> fieldType.setStored(true);
> fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_
> AND_POSITIONS_AND_OFFSETS);
> Field idField = new Field("_id", id, fieldType);
> Document doc = new Document();
> doc.add(idField);
> iwriter.addDocument(doc);
> iwriter.close();
>
> }
>
> public static void query(String id) throws ParseException, IOException
> {
> Query query = new QueryParser("_id", analyzer).parse(id);
> Directory directory = FSDirectory.open(Paths.get("/
> tmp/test/lucene"));
> IndexReader ireader  = DirectoryReader.open(directory);
> IndexSearcher isearcher = new IndexSearcher(ireader);
> ScoreDoc[] scoreDoc = isearcher.search(query, 100).scoreDocs;
> for(ScoreDoc scdoc: scoreDoc){
> Document doc = isearcher.doc(scdoc.doc);
> System.out.println(doc.get("_id"));
> }
> }
>
> public static void delete(String id){
> try {
>  Directory directory =
> FSDirectory.open(Paths.get("/tmp/test/lucene"));
> IndexWriterConfig config = new IndexWriterConfig(analyzer);
> IndexWriter iwriter = new IndexWriter(directory, config);
> Term term = new Term("_id", id);
> iwriter.deleteDocuments(term);
> iwriter.commit();
> iwriter.close();
> }catch (IOException e){
> e.printStackTrace();
> }
> }
> }
>
>
> --
> 
> best regards & a nice day
> Zhao Ximing
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Disabling Lucene Scoring/Ranking

2017-01-09 Thread Ian Lea

oal.search.ConstantScoreQuery?

"A query that wraps another query and simply returns a constant score equal
to the query boost for every document that matches the query. It therefore
simply strips of all scores and returns a constant one."

--
Ian.

On Mon, Jan 9, 2017 at 11:39 AM, Taher Galal 
wrote:

> Hi,
>
> What about writing your own scoring that just give a value of 1 to all the
> documents that are hits?
>
> On Mon, Jan 9, 2017 at 12:17 PM, Rajnish kamboj 
> wrote:
>
> > My application does not require scoring/ranking.  All data is equally
> > important for me.
> >
> > Search query can return any documents matching search criteria.
> >
> > So, Is there a way to completely disable scoring/ranking altogether?
> > OR Is there a better solution to it.
> >
> > Regards
> > Rajnish
> >
>

Re: Favoring Terms Occurring in Close Proximity

2016-06-27 Thread Ian Lea

No, it implies that Lucene is a low level library that allows people like
you and me, application developers, to develop applications that meet our
business and technical needs.

Like you, most of the things I work with prefer documents where the search
terms are close together, often preferably in the right order.  They also
rate certain fields as being more important than others.  So we build
fairly complex boolean queries with an assortment of queries - lots of
spans - with boosting, to try and get the results we need.

Some of the projects built on top of lucene may provide some built in
support for this but as far as I'm aware Lucene doesn't, and nor do I
expect it to.


--
Ian.


On Mon, Jun 27, 2016 at 1:55 PM, Daniel Bigham  wrote:

> Hi Ahmet,
>
> Yes, thanks... that did come to mind and is the strategy I'm playing with.
>
> However, if you are giving a user a plain text field and using the Lucene
> query parser, it doesn't create optional clauses for boosting purposes.
>
> Does this imply that anyone wanting to use Lucene in conjunction with an
> input field needs to write a custom query parser if they want reasonable
> results?
>
> - On Jun 24, 2016, at 12:25 PM, Ahmet Arslan 
> wrote:
>
> > Hi Daniel,
>
> > You can add optional clauses to your query for boosting purposes.
>
> > for example,
>
> > temperate OR climates OR "temperate climates"~5^100
>
> > ahmet
>
> > On Friday, June 24, 2016 5:07 PM, Daniel Bigham 
> wrote:
> > Something significant that I've noticed about using the default Lucene
> > query parser is that if your user enters a query like:
>
> > "temperate climates"
>
> > ... it will get turned into an OR query:
>
> > temperate OR climates
>
> > This means that a document that contains the literal substring
> > "temperate climates" will be on equal footing with a document that
> > contains "temperate emotions may go a long way to keeping the peace as
> > we continue to discuss climate change".
>
> > So far as I know, your typical search engine definitely does not ignore
> > the relative positions of terms.
>
> > And so my question is -- how do people typically deal with this when
> > using Lucene? What is wanted is a query that desires search terms to be
> > close together, but failing that, is ok with the terms simply occurring
> > in the document.
>
> > And again -- the ultimate desire isn't just to construct a Query object
> > to accomplish that, but to hook things up in such a way that a user can
> > enter a query in an input box and have the system take their flat string
> > and turn it into an intelligent query that acts somewhat like today's
> > modern search engines in terms of wanting terms to be close to each
> other.
>
> > This is such a "basic" use case of a search system that I'm tempted to
> > think there must be well worn paths for doing this in Lucene.
>
> > Thanks,
> > Daniel
>
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>

Re: LockFactory issue observed in lucene while getting instance of indexWriter

2016-06-16 Thread Ian Lea

Sounds to me like it's related to the index not having been closed properly
or still being updated or something.  I'd worry about that.

--
Ian.


On Thu, Jun 16, 2016 at 11:19 AM, Mukul Ranjan  wrote:

> Hi,
>
> I'm observing below exception while getting instance of indexWriter-
>
> java.lang.IllegalArgumentException: Directory MMapDirectory@"directoryName"
> lockFactory=org.apache.lucene.store.NativeFSLockFactory@1ec79746 still
> has pending deleted files; cannot initialize IndexWriter
>
> Is it related to the default used NativeFSLockFactory. Should I use
> simpleFSLockFactory to avoid this type of issue. Please suggest as I'm
> getting the above exception in my application.
>
> Thanks,
> Mukul
> Visit eGain on YouTube and
> LinkedIn
>

Re: Using Lucene to model ownership of documents

2016-06-16 Thread Ian Lea

I'd definitely go for b).  The index will of course be larger for every
extra bit of data you store but it doesn't sound like this would make much
difference.  Likewise for speed of indexing.


--
Ian.


On Wed, Jun 15, 2016 at 2:25 PM, Geebee Coder  wrote:

> Hi there,
> I would like to use Lucene to solve the following problem:
>
> 1.We have about 100k customers and we have 25 millions of documents.
>
> 2.When a customer performs a text search on the document space, we want to
> return only documents that the customer has access to.
>
> 3.The # of documents a customer owns varies a lot. some have close to 23
> million, some have close to 10k and some own a third of the documents etc.
>
> What is an efficient way to use Lucene in this scenario in terms of
> performance and indexing?
> We have tried a number of solutions such as
>
>  a)100k boolean fields per document that indicates whether a customer has
> access to the document.
>  b)A single text field that has a list of customers who owns the document
> e.g. (customers field : "abc abd cfx...")
> c) the above option with shards by customers
>
> The search performance for a was bad. b,c performed better for search
> but lengthened the time needed for indexing & index size.
> We are also thinking about using a custom filter but we are concerned about
> the memory requirements.
>
> Any ideas/suggestions would be really appreciated.
>

Re: Selective Output fields in Search Result. Lucene 5.5.0

2016-05-16 Thread Ian Lea

Would
http://lucene.apache.org/core/5_5_0/core/org/apache/lucene/index/IndexReader.html#document(int,%20java.util.Set)
be what you are looking for?


--
Ian.


On Mon, May 16, 2016 at 1:39 PM,  wrote:

> Hello,
>
> I am storing close to 100 fields in a single document which is being
> indexed. There are million such documents. Lucene version is 5.5.0
>
> During search, I wish to get only 1 field out of these 100 fields as a
> result. The conventional approach is to get the required hits and get the
> selected field by iterating over each Document.
>
> Problem with conventional approach is, the document so retrieved contains
> all the fields bloating my memory. (i.e. I get 100 fields in each document
> whereas I should get only 1 field to save on resource and memory)
>
> I would like to have only 1 particular field in the output document.
>
> Is there any approach for this ? I have looked into the api docs and stuff
> but could not find anything related to this.
>
> Appreciate your guidance. Thanks in advance.
>
> Regards,
> Ankit
> "Confidentiality Warning: This message and any attachments are intended
> only for the use of the intended recipient(s).
> are confidential and may be privileged. If you are not the intended
> recipient. you are hereby notified that any
> review. re-transmission. conversion to hard copy. copying. circulation or
> other use of this message and any attachments is
> strictly prohibited. If you are not the intended recipient. please notify
> the sender immediately by return email.
> and delete this message and any attachments from your system.
>
> Virus Warning: Although the company has taken reasonable precautions to
> ensure no viruses are present in this email.
> The company cannot accept responsibility for any loss or damage arising
> from the use of this email or attachment."
>

Re: Need help in alphanumeric search

2015-10-01 Thread Ian Lea

Set rs = stmt.executeQuery(sql);
>> >   int i=0;
>> >   while (rs.next()) {
>> >  Document d = new Document();
>> >  d.add(new TextField("cpn", rs.getString("cpn"), Field.Store.YES));
>> >
>> >  writer.addDocument(d);
>> >  i++;
>> >  }
>> >   stmt.close();
>> >   rs.close();
>> >
>> >   return i;
>> > }
>> >
>> >
>> > Searching code:
>> >
>> > public class SimpleDBSearcher {
>> > // PLASTRON
>> > private static final String LUCENE_QUERY = "SD*"; private static final
>> int
>> > MAX_HITS = 500; private static final String INDEX_DIR = "C:/DBIndexAll/";
>> >
>> > public static void main(String[] args) throws Exception { // File
>> indexDir = new
>> > File(SimpleDBIndexer.INDEX_DIR); final Path indexDir =
>> > Paths.get(SimpleDBIndexer.INDEX_DIR);
>> > String query = LUCENE_QUERY;
>> > SimpleDBSearcher searcher = new SimpleDBSearcher();
>> > searcher.searchIndex(indexDir, query); }
>> >
>> > private void searchIndex(Path indexDir, String queryStr) throws
>> Exception {
>> > Directory directory = FSDirectory.open(indexDir); System.out.println("The
>> > query string is " + queryStr); MultiFieldQueryParser queryParser = new
>> > MultiFieldQueryParser(new String[] { "cpn" }, new StandardAnalyzer());
>> > IndexReader reader = DirectoryReader.open(directory); IndexSearcher
>> > searcher = new IndexSearcher(reader);
>> > queryParser.getAllowLeadingWildcard();
>> >
>> > Query query = queryParser.parse(queryStr); TopDocs topDocs =
>> > searcher.search(query, MAX_HITS);
>> >
>> > ScoreDoc[] hits = topDocs.scoreDocs;
>> > System.out.println(hits.length + " Record(s) Found"); for (int i = 0; i <
>> > hits.length; i++) { int docId = hits[i].doc; Document d =
>> searcher.doc(docId);
>> > System.out.println("\"cpn value is:\" " + d.get("cpn")); } if
>> (hits.length == 0) {
>> > System.out.println("No Data Founds "); }
>> >
>> > }
>> > }
>> >
>> >
>> > Please help here, thanks in advance.
>> >
>> > Regards,
>> > Bhaskar
>> >
>> > On Tue, Sep 29, 2015 at 3:47 AM, Uwe Schindler <u...@thetaphi.de> wrote:
>> >
>> > > Hi Erick,
>> > >
>> > > This mail was in Lucene's user mailing list. This is not about Solr,
>> > > so user cannot provide his Solr config! :-) In any case, it would be
>> > > good to get the Analyzer + code you use while indexing and also the
>> > > code (+ Analyzer) that creates the query while searching.
>> > >
>> > > Uwe
>> > >
>> > > -
>> > > Uwe Schindler
>> > > H.-H.-Meier-Allee 63, D-28213 Bremen
>> > > http://www.thetaphi.de
>> > > eMail: u...@thetaphi.de
>> > >
>> > >
>> > > > -Original Message-
>> > > > From: Erick Erickson [mailto:erickerick...@gmail.com]
>> > > > Sent: Monday, September 28, 2015 6:01 PM
>> > > > To: java-user
>> > > > Subject: Re: Need help in alphanumeric search
>> > > >
>> > > > You need to supply the definitions of this field from your
>> > > > schema.xml
>> > > file,
>> > > > both the  and 
>> > > >
>> > > > Additionally, please provide the results of the query you're trying
>> > > > with =true appended.
>> > > >
>> > > > The adminUI/analysis page is very helpful in these situations as
>> well.
>> > > Select
>> > > > the appropriate core from the drop-down on the left and you'll see
>> > > > an "analysis"
>> > > > section appear that shows you exactly what happens when the field is
>> > > > analyzed.
>> > > >
>> > > > Best,
>> > > > Erick
>> > > >
>> > > > On Mon, Sep 28, 2015 at 5:01 AM, Bhaskar <bhaskar1...@gmail.com>
>> > wrote:
>> > > > > Thanks Lan for reply.
>> > > > >
>> > > > > cpn values are like 123-0049, 342-043, ab23-090, hedwsdg
>> > > > >
>> > > > > my application is working when i gave search  for below inputs
>> > >

Re: Need help in alphanumeric search

2015-09-28 Thread Ian Lea

Hi


Can you provide a few examples of values of cpn that a) are and b) are
not being found, for indexing and searching.

You may also find some of the tips at
http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_no_hits_.2F_incorrect_hits.3F
useful.

You haven't shown the code that created the IndexWriter so the tip
about using the same analyzer at index and search time might be
relevant.



--
Ian.


On Mon, Sep 28, 2015 at 10:49 AM, Bhaskar  wrote:
> Hi,
> I am beginner in Apache lucene, I am using 5.3.1.
> I have created  the index on the database result. The index values are
> having alphanumeric and strings values. I am able to search the strings but
> I am not able to search alphanumeric values.
>
> Can someone help me here.
>
> Below is indexing code...
>
> int indexDocs(IndexWriter writer, Connection conn) throws Exception {
> Statement stmt = conn.createStatement();
>   ResultSet rs = stmt.executeQuery(sql);
>   int i=0;
>   while (rs.next()) {
>  Document d = new Document();
> // System.out.println("cpn is" + rs.getString("cpn"));
> // System.out.println("mpn is" + rs.getString("mpn"));
>
>   d.add(new TextField("cpn", rs.getString("cpn"), Field.Store.YES));
>
>
>  writer.addDocument(d);
>  i++;
>  }
> }
>
> Searching code:
>
>
> private void searchIndex(Path indexDir, String queryStr) throws Exception {
> Directory directory = FSDirectory.open(indexDir);
> System.out.println("The query string is " + queryStr);
> // MultiFieldQueryParser queryParser = new MultiFieldQueryParser(new
> // String[] {"mpn"}, new StandardAnalyzer());
> // IndexReader reader = IndexReader.open(directory);
> IndexReader reader = DirectoryReader.open(directory);
> IndexSearcher searcher = new IndexSearcher(reader);
> Analyzer analyzer = new StandardAnalyzer();
> analyzer.tokenStream("cpn", queryStr);
> QueryParser parser = new QueryParser("cpn", analyzer);
> parser.setDefaultOperator(Operator.OR);
> parser.getAllowLeadingWildcard();
> parser.setAutoGeneratePhraseQueries(true);
> Query query = parser.parse(queryStr);
> searcher.search(query, 100);
> TopDocs topDocs = searcher.search(query, MAX_HITS);
>
> ScoreDoc[] hits = topDocs.scoreDocs;
> System.out.println(hits.length + " Record(s) Found");
> for (int i = 0; i < hits.length; i++) {
> int docId = hits[i].doc;
> Document d = searcher.doc(docId);
> System.out.println("\"value is:\" " + d.get("cpn"));
> }
> if (hits.length == 0) {
> System.out.println("No Data Founds ");
> }
>
>
> Thanks in advance.
>
> --
> Keep Smiling
> Thanks & Regards
> Bhaskar.
> Mobile:9866724142

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Improvement performance of my indexing with Lucene

2015-09-09 Thread Ian Lea

See also http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

Also double check that it's Lucene that you should be concentrating
on.  In my experience it's often the reading of the data from a
database, if that's what you are doing, that is the bottleneck.


--
Ian.


On Wed, Sep 9, 2015 at 6:07 AM, Modassar Ather  wrote:
> There are few things you can try to improve indexing performance.
>
> 1. Try indexing documents in batches.
> 2. You can try multi-threaded indexing. What I mean to say is feed the data
> using multiple threads to the indexer.
> 3. Analysis of memory utilization and GC tuning.
>
> Following are few links which has few details on Solr indexing performance.
> http://wiki.apache.org/solr/SolrPerformanceFactors
> https://lucidworks.com/blog/indexing-performance-solr-5-2-now-twice-fast/
>
> Regards,
> Modassar
>
> On Wed, Sep 9, 2015 at 7:29 AM, Humberto Rocha  wrote:
>
>> Hi,
>>
>> I need to improve the performance of my indexing with Lucene .
>>
>> Is there any material (eg, article, book , tutorial ) that can be used for
>> this?
>>
>> Could anyone help me please ?
>>
>> Thanks a lot!
>>
>> --
>> Humberto
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Improvement performance of my indexing with Lucene

2015-09-09 Thread Ian Lea

The link that I sent,
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed is for Lucene,
not Solr.  The second item on the list is to make sure you are using
the latest version of lucene so that would be a good starting point.


--
Ian.


On Wed, Sep 9, 2015 at 3:10 PM, Humberto Rocha <humro...@gmail.com> wrote:
> Thanks a lot !
>
> But do you know some links that helps implement these optimization options
> without the Solr (using only lucene) ?
>
> I am using lucene 4.9.
>
> More thanks.
>
> Humberto
>
>
> On Wed, Sep 9, 2015 at 5:23 AM, Ian Lea <ian@gmail.com> wrote:
>
>> See also http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
>>
>> Also double check that it's Lucene that you should be concentrating
>> on.  In my experience it's often the reading of the data from a
>> database, if that's what you are doing, that is the bottleneck.
>>
>>
>> --
>> Ian.
>>
>>
>> On Wed, Sep 9, 2015 at 6:07 AM, Modassar Ather <modather1...@gmail.com>
>> wrote:
>> > There are few things you can try to improve indexing performance.
>> >
>> > 1. Try indexing documents in batches.
>> > 2. You can try multi-threaded indexing. What I mean to say is feed the
>> data
>> > using multiple threads to the indexer.
>> > 3. Analysis of memory utilization and GC tuning.
>> >
>> > Following are few links which has few details on Solr indexing
>> performance.
>> > http://wiki.apache.org/solr/SolrPerformanceFactors
>> >
>> https://lucidworks.com/blog/indexing-performance-solr-5-2-now-twice-fast/
>> >
>> > Regards,
>> > Modassar
>> >
>> > On Wed, Sep 9, 2015 at 7:29 AM, Humberto Rocha <humro...@gmail.com>
>> wrote:
>> >
>> >> Hi,
>> >>
>> >> I need to improve the performance of my indexing with Lucene .
>> >>
>> >> Is there any material (eg, article, book , tutorial ) that can be used
>> for
>> >> this?
>> >>
>> >> Could anyone help me please ?
>> >>
>> >> Thanks a lot!
>> >>
>> >> --
>> >> Humberto
>> >>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
>
> --
> Humberto Rocha

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Improvement performance of my indexing with Lucene

2015-09-09 Thread Ian Lea

> Great! I will upgrade Lucene then.

Good start.

> I'm not using database.

Fine, but you must be getting your data from somewhere.  Maybe that is
blazingly fast, maybe it isn't.

> Are there some java samples code ?
>
> Samples with:
>
> 1. indexing documents in batches.

I think this means call IndexWriter.commit() every some-large-number
of docs rather than some-small-number.

> 2. Multi-threaded indexing

I don't have examples, but pseudocode would look something like

 IndexWriter iw = whatever
 Thread t1 = whatever(iw, data-source-1)
 Thread t2 = whatever(iw, data-source-2)
 ...
 t1.start()
 t2.start()
 ...
 wait ...
 iw.close()


--
Ian.


> On Wed, Sep 9, 2015 at 11:23 AM, Ian Lea <ian@gmail.com> wrote:
>
>> The link that I sent,
>> http://wiki.apache.org/lucene-java/ImproveIndexingSpeed is for Lucene,
>> not Solr.  The second item on the list is to make sure you are using
>> the latest version of lucene so that would be a good starting point.
>>
>>
>> --
>> Ian.
>>
>>
>> On Wed, Sep 9, 2015 at 3:10 PM, Humberto Rocha <humro...@gmail.com> wrote:
>> > Thanks a lot !
>> >
>> > But do you know some links that helps implement these optimization
>> options
>> > without the Solr (using only lucene) ?
>> >
>> > I am using lucene 4.9.
>> >
>> > More thanks.
>> >
>> > Humberto
>> >
>> >
>> > On Wed, Sep 9, 2015 at 5:23 AM, Ian Lea <ian@gmail.com> wrote:
>> >
>> >> See also http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
>> >>
>> >> Also double check that it's Lucene that you should be concentrating
>> >> on.  In my experience it's often the reading of the data from a
>> >> database, if that's what you are doing, that is the bottleneck.
>> >>
>> >>
>> >> --
>> >> Ian.
>> >>
>> >>
>> >> On Wed, Sep 9, 2015 at 6:07 AM, Modassar Ather <modather1...@gmail.com>
>> >> wrote:
>> >> > There are few things you can try to improve indexing performance.
>> >> >
>> >> > 1. Try indexing documents in batches.
>> >> > 2. You can try multi-threaded indexing. What I mean to say is feed the
>> >> data
>> >> > using multiple threads to the indexer.
>> >> > 3. Analysis of memory utilization and GC tuning.
>> >> >
>> >> > Following are few links which has few details on Solr indexing
>> >> performance.
>> >> > http://wiki.apache.org/solr/SolrPerformanceFactors
>> >> >
>> >>
>> https://lucidworks.com/blog/indexing-performance-solr-5-2-now-twice-fast/
>> >> >
>> >> > Regards,
>> >> > Modassar
>> >> >
>> >> > On Wed, Sep 9, 2015 at 7:29 AM, Humberto Rocha <humro...@gmail.com>
>> >> wrote:
>> >> >
>> >> >> Hi,
>> >> >>
>> >> >> I need to improve the performance of my indexing with Lucene .
>> >> >>
>> >> >> Is there any material (eg, article, book , tutorial ) that can be
>> used
>> >> for
>> >> >> this?
>> >> >>
>> >> >> Could anyone help me please ?
>> >> >>
>> >> >> Thanks a lot!
>> >> >>
>> >> >> --
>> >> >> Humberto
>> >> >>
>> >>
>> >> -
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>
>> >>
>> >
>> >
>> > --
>> > Humberto Rocha
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
>
> --
> Humberto Rocha

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: IndexWriter is not closing the FDs (deleted files)

2015-09-01 Thread Ian Lea

>From a glance, you need to close the old reader after calling
openIfChanged if it gives you a new one.

See 
https://lucene.apache.org/core/5_3_0/core/org/apache/lucene/index/DirectoryReader.html#openIfChanged(org.apache.lucene.index.DirectoryReader).
You may wish to pay attention to the words about not closing readers
while they may still be in use.


--
Ian.


On Tue, Sep 1, 2015 at 12:55 PM, Marcio Napoli  wrote:
> Hey Anton,
>
> I use this way:
>
> Thanks!
>
> @PostConstruct
> public void create() {
> final String parent = System.getProperty("jboss.server.data.dir");
> final File indexFullPath = new File(parent, CIDADAO_INDEX_PATH);
> try {
> final FSDirectory directory = FSDirectory.open(indexFullPath);
> this.openIndex(directory);
> } catch (Exception e) {
> logger.error("Problema no índice lucene para Cidadão", e);
> throw EJBUtil.rollbackNE(e);
> }
> }
>
> private void openIndex(Directory directory) throws IOException {
> final IndexWriterConfig config = new IndexWriterConfig(LUCENE_4_10_3,
> DEFAULT_ANALYZER);
> config.setMaxThreadStates(2);
> config.setCheckIntegrityAtMerge(true);
> this.writer = new IndexWriter(directory, config);
> this.reader = DirectoryReader.open(this.writer, true);
> }
>
>
> private void addToIndex(final CidadaoBean cidadaoBean, boolean commit) {
> if (cidadaoBean == null || !cidadaoBean.isSetId()) {
> return;
> }
>
> try {
> CidadaoHelper.addToIndex(writer, cidadaoBean, commit);
> if (commit) {
> DirectoryReader newReader = DirectoryReader.openIfChanged(this.reader);
> if (newReader != null) {
> this.reader = newReader;
> }
> }
> } catch (Exception e) {
> logger.error("Não foi possível inserir o Cidadao no índice Lucene \n" +
> cidadaoBean, e);
> e.printStackTrace();
> }
> }
>
>
>
>
> Em seg, 31 de ago de 2015 às 21:13, Anton Zenkov 
> escreveu:
>
>> Are you sure you are not holding open readers somewhere?
>>
>> On Mon, Aug 31, 2015 at 7:46 PM, Marcio Napoli 
>> wrote:
>>
>> > Hey! :)
>> >
>> > It seems IndexWriter is not closing the descriptors of the removed files,
>> > see the log below.
>> >
>> > Thanks,
>> > Napoli
>> >
>> > [root@server01 log]# ls -l /proc/59491/fd  | grep index
>> > l-wx--. 1 wildfly wildfly 64 Ago 31 11:26 429 ->
>> > /usr/local/wildfly-2.0/standalone/data/index/cidadao/write.lock
>> > lr-x--. 1 wildfly wildfly 64 Ago 31 11:26 529 ->
>> > /usr/local/wildfly-2.0/standalone/data/index/cidadao/_4.cfs (deleted)
>> > lr-x--. 1 wildfly wildfly 64 Ago 31 11:26 530 ->
>> > /usr/local/wildfly-2.0/standalone/data/index/cidadao/_3.cfs (deleted)
>> > lr-x--. 1 wildfly wildfly 64 Ago 31 11:26 531 ->
>> > /usr/local/wildfly-2.0/standalone/data/index/cidadao/_2.cfs (deleted)
>> > lr-x--. 1 wildfly wildfly 64 Ago 31 11:26 532 ->
>> > /usr/local/wildfly-2.0/standalone/data/index/cidadao/_1.cfs (deleted)
>> > lr-x--. 1 wildfly wildfly 64 Ago 31 11:26 533 ->
>> > /usr/local/wildfly-2.0/standalone/data/index/cidadao/_0.cfs (deleted)
>> > lr-x--. 1 wildfly wildfly 64 Ago 31 11:26 535 ->
>> > /usr/local/wildfly-2.0/standalone/data/index/cidadao/_a.cfs (deleted)
>> > lr-x--. 1 wildfly wildfly 64 Ago 31 11:26 536 ->
>> > /usr/local/wildfly-2.0/standalone/data/index/cidadao/_9.cfs (deleted)
>> > lr-x--. 1 wildfly wildfly 64 Ago 31 11:26 537 ->
>> > /usr/local/wildfly-2.0/standalone/data/index/cidadao/_8.cfs (deleted)
>> > lr-x--. 1 wildfly wildfly 64 Ago 31 11:26 538 ->
>> > /usr/local/wildfly-2.0/standalone/data/index/cidadao/_7.cfs (deleted)
>> > lr-x--. 1 wildfly wildfly 64 Ago 31 11:26 539 ->
>> > /usr/local/wildfly-2.0/standalone/data/index/cidadao/_6.cfs (deleted)
>> > lr-x--. 1 wildfly wildfly 64 Ago 31 11:26 540 ->
>> > /usr/local/wildfly-2.0/standalone/data/index/cidadao/_5_Lucene41_0.doc
>> > (deleted)
>> > lr-x--. 1 wildfly wildfly 64 Ago 31 11:26 541 ->
>> > /usr/local/wildfly-2.0/standalone/data/index/cidadao/_5_Lucene41_0.pos
>> > (deleted)
>> > lr-x--. 1 wildfly wildfly 64 Ago 31 11:26 542 ->
>> > /usr/local/wildfly-2.0/standalone/data/index/cidadao/_5_Lucene41_0.tim
>> > (deleted)
>> > lr-x--. 1 wildfly wildfly 64 Ago 31 11:26 543 ->
>> > /usr/local/wildfly-2.0/standalone/data/index/cidadao/_5.nvd (deleted)
>> > lr-x--. 1 wildfly wildfly 64 Ago 31 11:26 544 ->
>> > /usr/local/wildfly-2.0/standalone/data/index/cidadao/_5.fdt (deleted)
>> > lr-x--. 1 wildfly wildfly 64 Ago 31 11:26 545 ->
>> > /usr/local/wildfly-2.0/standalone/data/index/cidadao/_5_Lucene410_0.dvd
>> > (deleted)
>> > lr-x--. 1 wildfly wildfly 64 Ago 31 20:25 619 ->
>> > /usr/local/wildfly-2.0/standalone/data/index/cidadao/_o.cfs (deleted)
>> > lr-x--. 1 wildfly wildfly 64 Ago 31 20:25 676 ->
>> > /usr/local/wildfly-2.0/standalone/data/index/cidadao/_k.cfs (deleted)
>> > lr-x--. 1 wildfly wildfly 64 Ago 31 20:25 677 ->
>> >

Re: IndexReader returns all fields, but IndexSearcher does not

2015-06-02 Thread Ian Lea

Hi - I suggest you narrow the problem down to a small self-contained
example and if you still can't get it to work, show us the code.  And
tell us what version of Lucene you are using.


--
Ian.


On Mon, Jun 1, 2015 at 5:20 PM, Rahul Kotecha
kotecha.rahul...@gmail.com wrote:
 Hi All,
 I am trying to query an index.
 When I try to read the index using IndexReader, I am able to print all the
 fields (close to 30 fields stored) in the index.
 However, when I run a query on the same index using IndexSearcher, I am
 able to get only a couple of fields instead of all the fields as returned
 by IndexReader.

 Any help would be greatly appreciated.

 Regards,
 Rahul Kotecha

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Filtering question

2015-03-11 Thread Ian Lea

Can you use a BooleanFilter (or ChainedFilter in 4.x) alongside your
BooleanQuery?   Seems more logical and I suspect would solve the problem.
Caching filters can be good too, depending on how often your data changes.
See CachingWrapperFilter.

--
Ian.


On Tue, Mar 10, 2015 at 12:45 PM, Chris Bamford cbamf...@mimecast.com
wrote:


  Hi,

  I have an index of 30 docs, 20 of which have an owner field of UserA
 and 10 of UserB.
 I also have a query which consists of:

  BooleanQuery:
 -- Clause 1: TermQuery
 -- Clause 2: FilteredQuery
 - Branch 1: MatchAllDocsQuery()
 - Branch 2: MyNDVFilter

  I execute my search as follows:

  searcher.search( booleanQuery,
 new TermFilter(new Term(owner,
 UserA),
 50);

  The TermFilter's job is to reduce the number of searchable documents
 from 30 to 20, which it does for all clauses of the BooleanQuery except for
 MyNDVFilter which iterates through the full 30 docs, 10 needlessly.  How
 can I restrict it so it behaves the same as the other query branches?

  MyNDVFilter source code:

  public class MyNDVFilter extends Filter {

  private String fieldName;
 private String matchTag;

  public TagFilter(String ndvFieldName, String matchTag) {
 this.fieldName = ndvFieldName;
 this.matchTag = matchTag;
 }

  @Override
 public DocIdSet getDocIdSet(AtomicReaderContext context, Bits
 acceptDocs) throws IOException {

  AtomicReader reader = context.reader();
 int maxDoc = reader.maxDoc();
 final FixedBitSet bitSet = new FixedBitSet(maxDoc);
 BinaryDocValues ndv = reader.getBinaryDocValues(fieldName);

  if (ndv != null) {
 for (int i = 0; i  maxDoc; i++) {
 BytesRef br = ndv.get(i);
 if (br.length  0) {
 String strval = br.utf8ToString();
 if (strval.equals(matchTag)) {
 bitSet.set(i);
 System.out.println(MyNDVFilter   + matchTag +
  matched  + i +  [ + strval + ]);
 }
 }
 }
 }

  return new DVDocSetId(bitSet);// just wraps a FixedBitSet
 }
 }



   Chris Bamford m: +44 7860 405292  w: www.mimecast.com  Senior Developer p:
 +44 207 847 8700 Address click here
 http://www.mimecast.com/About-us/Contact-us/
 --
  [image: http://www.mimecast.com]
 https://serviceA.mimecast.com/mimecast/click?account=C1A1code=83be674748892bc34425eb4133af3e68
   [image: LinkedIn]
 https://serviceA.mimecast.com/mimecast/click?account=C1A1code=83a78f78bdfa40c471501ae0b813a68f
  [image:
 YouTube]
 https://serviceA.mimecast.com/mimecast/click?account=C1A1code=ad1ed1af5bb9cf9dc965267ed43faff0
  [image:
 Facebook]
 https://serviceA.mimecast.com/mimecast/click?account=C1A1code=172d4ea57e4a4673452098ba62badace
  [image:
 Blog]
 https://serviceA.mimecast.com/mimecast/click?account=C1A1code=871b30b627b3263b9ae2a8f37b0de5ff
  [image:
 Twitter]
 https://serviceA.mimecast.com/mimecast/click?account=C1A1code=cc3a825e202ee26a108f3ef8a1dc3c6f

Re: Difference between StoredField vs Other Fields with Field.Store.YES

2015-03-11 Thread Ian Lea

 Is there a difference between using StoredField and using other types of
 fields with Field.Store.YES?

It will depend on what the other type of field is.  As the javadoc for
Field states, the xxxField classes are sugar.  If you are doing
standard things on standard data it's generally easier to use the
sugar classes, but if you need control you can build your own fields
however you like.


 Another question, Is it a good practise to use NumericDocValuesField
 instead of using usual Fields (IntField, LongField, StringField ...etc)
 with Field.Store.NO ?

Sorry, can't answer that.


--
Ian.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Indexing an IntField but getting SotredField from found Document

2015-02-19 Thread Ian Lea

I think if you follow the Field.fieldType().numericType() chain you'll
end up with INT or DOUBLE or whatever.

But if you know you stored it as an IntField then surely you already
know it's an integer?  Unless you sometimes store different things in
the one field.  I wouldn't do that.


--
Ian.


On Thu, Feb 19, 2015 at 12:22 PM, Clemens Wyss DEV clemens...@mysign.ch wrote:
 When I index a Document with an IntField and then find that very Document the 
 former IntField is returned as StoredField. How do I determine the original 
 fieldtype (IntField, LongField, DoubleField ...)?

 Must I ?
 Number number = Field.numericValue();
 if( number != null )
 {
   if( number instanceof Integer)
   {
 ...
   }
   else if( number instanceof Double)
   {
 ...
   }
   
 }

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Indexing Query

2015-02-18 Thread Ian Lea

You mean you'd like a BooleanQuery.setMaximumNumberShouldMatch()
method?  Unfortunately that doesn't exist and I can't think of a
simple way of doing it.


--
Ian.


On Wed, Feb 18, 2015 at 5:26 AM, Deepak Gopalakrishnan dgk...@gmail.com wrote:
 Thanks Ian. Also, if I have a unigram in the query, and I want to make sure
 I match only index entries that do not have more than 2 tokens, is there a
 way to do that too?

 Thanks

 On Wed, Feb 18, 2015 at 2:23 AM, Ian Lea ian@gmail.com wrote:

 Break the query into words then add them as TermQuery instances as
 optional clauses to a BooleanQuery with a call to
 setMinimumNumberShouldMatch(2) somewhere along the line.  You may want
 to do some parsing or analysis on the query terms to avoid problems of
 case matching and the like.


 --
 Ian.


 On Tue, Feb 17, 2015 at 4:57 PM, Deepak Gopalakrishnan dgk...@gmail.com
 wrote:
  Hello,
 
  I have a rather simple query. I have a list where I have terms like and
  then my query is more natural language. I want to be able to retrieve
   matches that has atleast 2 words in common between the query and the
 index
 
  Can you guys suggest a Query Type and a field that I should be using?
 
  --
  Regards,
  *Deepak Gopalakrishnan*

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




 --
 Regards,
 *Deepak Gopalakrishnan*
 *Mobile*:+918891509774
 *Skype* : deepakgk87
 http://myexps.blogspot.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: URL/Email tokenizer

2015-02-17 Thread Ian Lea

Ah, you want to do it the hard way.  Sorry, can't help you there - I
prefer to do things the simple way - easier to write and to maintain
and, in my experience, usually more robust in the long run.


--
Ian.


On Tue, Feb 17, 2015 at 11:42 AM, Ravikumar Govindarajan
ravikumar.govindara...@gmail.com wrote:
 Thanks Ian

 What I am currently doing is duplicating the data into 2 different fields
 and having my own PerFieldAnalyzerWrapper just like you pointed out

 Is there a good way to do this in a single-pass? Like how Bi-Grams or
 Common-Grams do…

 --
 Ravi

 On Tue, Feb 17, 2015 at 3:08 PM, Ian Lea ian@gmail.com wrote:

 Sounds like a job for
 org.apache.lucene.analysis.miscellaneous.PerFieldAnalyzerWrapper.


 --
 Ian.


 On Tue, Feb 17, 2015 at 8:51 AM, Ravikumar Govindarajan
 ravikumar.govindara...@gmail.com wrote:
  We have a requirement in that E-mail addresses need to be added in a
  tokenized form to one field while untokenized form is added to another
 field
 
  Ex:
 
  I have mailed a...@xyz.com . It should tokenize as below
 
  body = {I, have, mailed, abc, xyz, com};
 
  I also have a body-addr field. Tokenizer needs to extract e-mail
 addresses
  from body field and add them as below
 
  body-addr = {a...@xyz.com}
 
  How to achieve this via tokenizer chain?
 
  --
  Ravi

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Indexing Query

2015-02-17 Thread Ian Lea

Break the query into words then add them as TermQuery instances as
optional clauses to a BooleanQuery with a call to
setMinimumNumberShouldMatch(2) somewhere along the line.  You may want
to do some parsing or analysis on the query terms to avoid problems of
case matching and the like.


--
Ian.


On Tue, Feb 17, 2015 at 4:57 PM, Deepak Gopalakrishnan dgk...@gmail.com wrote:
 Hello,

 I have a rather simple query. I have a list where I have terms like and
 then my query is more natural language. I want to be able to retrieve
  matches that has atleast 2 words in common between the query and the index

 Can you guys suggest a Query Type and a field that I should be using?

 --
 Regards,
 *Deepak Gopalakrishnan*

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: URL/Email tokenizer

2015-02-17 Thread Ian Lea

Sounds like a job for
org.apache.lucene.analysis.miscellaneous.PerFieldAnalyzerWrapper.


--
Ian.


On Tue, Feb 17, 2015 at 8:51 AM, Ravikumar Govindarajan
ravikumar.govindara...@gmail.com wrote:
 We have a requirement in that E-mail addresses need to be added in a
 tokenized form to one field while untokenized form is added to another field

 Ex:

 I have mailed a...@xyz.com . It should tokenize as below

 body = {I, have, mailed, abc, xyz, com};

 I also have a body-addr field. Tokenizer needs to extract e-mail addresses
 from body field and add them as below

 body-addr = {a...@xyz.com}

 How to achieve this via tokenizer chain?

 --
 Ravi

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: occurrence of two terms with the highest frequency

2015-02-13 Thread Ian Lea

Sorry, finger trouble.  Should have been oal which is shorthand for
org.apache.lucene, so org.apache.lucene.search.TotalHitCountCollector.

http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/search/TotalHitCountCollector.html


--
Ian.


On Fri, Feb 13, 2015 at 6:55 PM, Maisnam Ns maisnam...@gmail.com wrote:
 Thanks Ian for your help. But I didn't get aol search, what it is ? tried
 searching in google but couldn't find.

 Thanks

 On Fri, Feb 13, 2015 at 3:00 AM, Ian Lea ian@gmail.com wrote:

 I think you can do it with 4 simple queries:

 1) +flying +shooting

 2) +flying +fighting

 etc.

 or BooleanQuery equivalents with MUST clauses.  Use
 aol.search.TotalHitCountCollector and it should be blazingly fast,
 even if you have more that 100 docs.


 --
 Ian.


 On Thu, Feb 12, 2015 at 5:42 PM, Maisnam Ns maisnam...@gmail.com wrote:
  Hi,
 
  Can someone help me with this use case.
 
  Use case: Say there are 4 key words 'Flying', 'Shooting', 'fighting' and
  'looking' in100 documents to search for.
 
  Consider 'Flying' and 'Shooting' co- occurs (together) in 70 documents
  where as
 
  'Flying and 'fighting' co- occurs in 14 documents
 
  'Flying' and 'looking' co-occurs in 2 documents and so on.
 
  I have to list them in order or rather show them on a web page
  1. Flying , Shooting -70
  2. Flying , fighting - 14
  3 Flying , looking -2
 
  How to achieve this and please tell me what kind of query is this
  co-occurrence frequency.
  Is this possible in Lucene.And how to proceed .
 
  Please help and thanks in advance.
 
  Regards

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: occurrence of two terms with the highest frequency

2015-02-12 Thread Ian Lea

I think you can do it with 4 simple queries:

1) +flying +shooting

2) +flying +fighting

etc.

or BooleanQuery equivalents with MUST clauses.  Use
aol.search.TotalHitCountCollector and it should be blazingly fast,
even if you have more that 100 docs.


--
Ian.


On Thu, Feb 12, 2015 at 5:42 PM, Maisnam Ns maisnam...@gmail.com wrote:
 Hi,

 Can someone help me with this use case.

 Use case: Say there are 4 key words 'Flying', 'Shooting', 'fighting' and
 'looking' in100 documents to search for.

 Consider 'Flying' and 'Shooting' co- occurs (together) in 70 documents
 where as

 'Flying and 'fighting' co- occurs in 14 documents

 'Flying' and 'looking' co-occurs in 2 documents and so on.

 I have to list them in order or rather show them on a web page
 1. Flying , Shooting -70
 2. Flying , fighting - 14
 3 Flying , looking -2

 How to achieve this and please tell me what kind of query is this
 co-occurrence frequency.
 Is this possible in Lucene.And how to proceed .

 Please help and thanks in advance.

 Regards

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: StandardQueryParser with date/time fields stored as longs

2015-02-11 Thread Ian Lea

In my case, for any non-trivial search, I always build a boolean query
with relevant parsing for each field, where applicable using something
like oal.queryparser.classic.QueryParser.

So if I had some free text docs with a date field, the latter stored
as you suggest, and a query along the lines of datefrom: 2010-01-01
dateto: 2014-12-31 words: whatever, passed to my app by whatever
means, I'd parse the date fields into a NumericRangeQuery, pass the
words through a lucene supplied parser, getting back some Query
object, then glue them all in to a BooleanQuery with the relevant
MUST/SHOULD logic and boosts and whatever else I wanted.


--
Ian.

On Wed, Feb 11, 2015 at 2:37 PM, Jon Stewart
j...@lightboxtechnologies.com wrote:
 Ok... so how does anyone ever use date-time queries in lucene with the
 new recommended way of using longs?


 Jon


 On Wed, Feb 11, 2015 at 9:26 AM, Ian Lea ian@gmail.com wrote:
 Ah well, you've got me there.  I'm not a lucene developer and rather
 thought that I'd leave the implementation as an exercise for the
 reader.  Good luck!


 --
 Ian.


 On Wed, Feb 11, 2015 at 2:20 PM, Jon Stewart
 j...@lightboxtechnologies.com wrote:
 Eek. So is there a parsing component somewhere that gets handed a
 field name and query components (e.g., created, 2010-01-01,
 2014-12-31), which I can derive from, parse the timestamp strings,
 and then turn the whole thing into a numeric range query component?


 Jon


 On Wed, Feb 11, 2015 at 9:10 AM, Ian Lea ian@gmail.com wrote:
 To the best of my knowledge you are spot on with everything you say,
 except that the component to parse the strings doesn't exist.  I
 suspect that a contribution to add that to StandardQueryParser might
 well be accepted.


 --
 Ian.


 On Wed, Feb 11, 2015 at 4:21 AM, Jon Stewart
 j...@lightboxtechnologies.com wrote:
 Hello,

 I've done a lot of googling, but haven't stumbled upon the magic
 answer: how does one use StandardQueryParser with numeric fields
 representing timestamps, to allow for range queries?

 When indexing, my timestamp fields are ISO 8601 strings. I'm parsing
 them and then storing the milliseconds epoch time as a long, i.e.:

   doc.add(new LongField(created, ts.getMillis(), Field.Store.NO));

 From reading around, this seems to be the preferred method to index a
 timestamp (makes sense). However, how can you get StandardQueryParser
 to handle a query like created:[2010-01-01 TO 2014-12-31]?

 For other numeric fields, StandardQueryParser.setNumericConfigMap() is
 working just fine for me. It would seem that the created field would
 have to be part of this map in order to execute the range query
 properly, but that there must also be a component to parse the
 date/time strings in the query and convert them to long values, right?

 Thanks in advance,

 Jon

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




 --
 Jon Stewart, Principal
 (646) 719-0317 | j...@lightboxtechnologies.com | Arlington, VA

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




 --
 Jon Stewart, Principal
 (646) 719-0317 | j...@lightboxtechnologies.com | Arlington, VA

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: StandardQueryParser with date/time fields stored as longs

2015-02-11 Thread Ian Lea

To the best of my knowledge you are spot on with everything you say,
except that the component to parse the strings doesn't exist.  I
suspect that a contribution to add that to StandardQueryParser might
well be accepted.


--
Ian.


On Wed, Feb 11, 2015 at 4:21 AM, Jon Stewart
j...@lightboxtechnologies.com wrote:
 Hello,

 I've done a lot of googling, but haven't stumbled upon the magic
 answer: how does one use StandardQueryParser with numeric fields
 representing timestamps, to allow for range queries?

 When indexing, my timestamp fields are ISO 8601 strings. I'm parsing
 them and then storing the milliseconds epoch time as a long, i.e.:

   doc.add(new LongField(created, ts.getMillis(), Field.Store.NO));

 From reading around, this seems to be the preferred method to index a
 timestamp (makes sense). However, how can you get StandardQueryParser
 to handle a query like created:[2010-01-01 TO 2014-12-31]?

 For other numeric fields, StandardQueryParser.setNumericConfigMap() is
 working just fine for me. It would seem that the created field would
 have to be part of this map in order to execute the range query
 properly, but that there must also be a component to parse the
 date/time strings in the query and convert them to long values, right?

 Thanks in advance,

 Jon

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: StandardQueryParser with date/time fields stored as longs

2015-02-11 Thread Ian Lea

Ah well, you've got me there.  I'm not a lucene developer and rather
thought that I'd leave the implementation as an exercise for the
reader.  Good luck!


--
Ian.


On Wed, Feb 11, 2015 at 2:20 PM, Jon Stewart
j...@lightboxtechnologies.com wrote:
 Eek. So is there a parsing component somewhere that gets handed a
 field name and query components (e.g., created, 2010-01-01,
 2014-12-31), which I can derive from, parse the timestamp strings,
 and then turn the whole thing into a numeric range query component?


 Jon


 On Wed, Feb 11, 2015 at 9:10 AM, Ian Lea ian@gmail.com wrote:
 To the best of my knowledge you are spot on with everything you say,
 except that the component to parse the strings doesn't exist.  I
 suspect that a contribution to add that to StandardQueryParser might
 well be accepted.


 --
 Ian.


 On Wed, Feb 11, 2015 at 4:21 AM, Jon Stewart
 j...@lightboxtechnologies.com wrote:
 Hello,

 I've done a lot of googling, but haven't stumbled upon the magic
 answer: how does one use StandardQueryParser with numeric fields
 representing timestamps, to allow for range queries?

 When indexing, my timestamp fields are ISO 8601 strings. I'm parsing
 them and then storing the milliseconds epoch time as a long, i.e.:

   doc.add(new LongField(created, ts.getMillis(), Field.Store.NO));

 From reading around, this seems to be the preferred method to index a
 timestamp (makes sense). However, how can you get StandardQueryParser
 to handle a query like created:[2010-01-01 TO 2014-12-31]?

 For other numeric fields, StandardQueryParser.setNumericConfigMap() is
 working just fine for me. It would seem that the created field would
 have to be part of this map in order to execute the range query
 properly, but that there must also be a component to parse the
 date/time strings in the query and convert them to long values, right?

 Thanks in advance,

 Jon

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




 --
 Jon Stewart, Principal
 (646) 719-0317 | j...@lightboxtechnologies.com | Arlington, VA

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search on a field by a single word

2015-02-11 Thread Ian Lea

If you only ever want to retrieve based on exact match you could index
the name field using org.apache.lucene.document.StringField.  Do be
aware that it is exact: if you do nothing else, a search for a will
not match A or A .

Or you could so something with start and end markers e.g. index your
documents as something like STARTMARKER a b c ENDMARKER and search
using a phrase or span query including the markers.


--
Ian.



On Wed, Feb 11, 2015 at 7:55 AM, wangdong hrdxwa...@gmail.com wrote:
 Hi folks

 I have a question as follows:

 suppose there are 3 document in field name:
 1) a b c
 2) a b
 3) a

 I just want to retrival doc 3) only. I try to use syntax like this:
 name:a
 but I find it is not correct.is there any way to solve my question.

 please help me!
 thanks ahead!



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: combine to MultiTermQuery with OR

2015-02-10 Thread Ian Lea

org.apache.lucene.search.BooleanQuery.


--
Ian.


On Tue, Feb 10, 2015 at 3:28 PM, Sascha Janz sascha.j...@gmx.net wrote:

 Hi,

 i want to combine two MultiTermQueries.

 One searches over FieldA, one over FieldB.  Both queries should be combined 
 with OR operator.

 so in lucene Syntax i want  to search

 FieldA:Term1 OR FieldB:Term1,   FieldA:Term2 OR FieldB:Term2, FieldA:Term3 OR 
 FieldB:Term3...

 how can i do this?

 greetings
 sascha

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Re: combine to MultiTermQuery with OR

2015-02-10 Thread Ian Lea

Yep, that looks good to me.


--
Ian.


On Tue, Feb 10, 2015 at 5:01 PM, Sascha Janz sascha.j...@gmx.net wrote:
 hm,  already thought this could be the solution but didn't know how to do the 
 or Operation

 so i tried this

 BooleanQuery bquery = new BooleanQuery();
 bquery.add(queryFieldA, BooleanClause.Occur.SHOULD);
 bquery.add(queryFieldB, BooleanClause.Occur.SHOULD);

 this is the correct way?


 Gesendet: Dienstag, 10. Februar 2015 um 17:31 Uhr
 Von: Ian Lea ian@gmail.com
 An: java-user@lucene.apache.org
 Betreff: Re: combine to MultiTermQuery with OR
 org.apache.lucene.search.BooleanQuery.


 --
 Ian.


 On Tue, Feb 10, 2015 at 3:28 PM, Sascha Janz sascha.j...@gmx.net wrote:

 Hi,

 i want to combine two MultiTermQueries.

 One searches over FieldA, one over FieldB. Both queries should be combined 
 with OR operator.

 so in lucene Syntax i want to search

 FieldA:Term1 OR FieldB:Term1, FieldA:Term2 OR FieldB:Term2, FieldA:Term3 OR 
 FieldB:Term3...

 how can i do this?

 greetings
 sascha

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Boolean Search Query is not workng

2015-01-23 Thread Ian Lea

How about home~10 house~10 flat. See
http://lucene.apache.org/core/4_10_3/queryparser/index.html


--
Ian.


On Fri, Jan 23, 2015 at 7:17 AM, Priyanka Tufchi
priyanka.tuf...@launchship.com wrote:
 Hi ALL

 I am  working on a project which uses lucene for searching . I am
 struggling with boolean based Query : Actual Scenario is

 e.g
  In Query, if I give house home flat
  then
  inside It should search house or home or flat  but I want to give them
 with weightage  like house and home  should get high weigh and flat should
 get less then rest.
 If document contain Home  .Lucene search should not go for house  and
 flat.

 I searched on Internet for some good stuff but not able to find any code
 sample or proper syntax for reference .


 Thanks
 Priyanka

 --
 Launchship Technology  respects your privacy. This email is intended only
 for the use of the party to which it is addressed and may contain
 information that is privileged, confidential, or protected by law. If you
 have received this message in error, or do not want to receive any further
 emails from us, please notify us immediately by replying to the message and
 deleting it from your computer.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Boolean Search Query is not workng

2015-01-23 Thread Ian Lea

Use IndexSearcher.explain() to help figure out what matched, why.  And
watch out for typos: jakarta != jakarata.

If you still can't figure it out, post here a very small completely
self-contained program or test case, using RAMDirectory, that
demonstrates the problem.


--
Ian.


On Fri, Jan 23, 2015 at 10:27 AM, Priyanka Tufchi
priyanka.tuf...@launchship.com wrote:
 Hi Ian

 I tried with what u sent

 Query-java~5 jakarta~5 apache  tomcat
 Document : 1, java jakarta tomcat
  2, java jakarata
 3, java jakarta  apache

  Score  : 1 :0.27094576
 3 :0.27094576
 2 :0.010494952


 If we go  by query it is giving same score ..It is not working.

 Thanks
 Priyanka


 On Fri, Jan 23, 2015 at 3:19 PM, Ian Lea ian@gmail.com wrote:

 How about home~10 house~10 flat. See
 http://lucene.apache.org/core/4_10_3/queryparser/index.html


 --
 Ian.


 On Fri, Jan 23, 2015 at 7:17 AM, Priyanka Tufchi
 priyanka.tuf...@launchship.com wrote:
  Hi ALL
 
  I am  working on a project which uses lucene for searching . I am
  struggling with boolean based Query : Actual Scenario is
 
  e.g
   In Query, if I give house home flat
   then
   inside It should search house or home or flat  but I want to give them
  with weightage  like house and home  should get high weigh and flat
 should
  get less then rest.
  If document contain Home  .Lucene search should not go for house  and
  flat.
 
  I searched on Internet for some good stuff but not able to find any code
  sample or proper syntax for reference .
 
 
  Thanks
  Priyanka
 
  --
  Launchship Technology  respects your privacy. This email is intended only
  for the use of the party to which it is addressed and may contain
  information that is privileged, confidential, or protected by law. If you
  have received this message in error, or do not want to receive any
 further
  emails from us, please notify us immediately by replying to the message
 and
  deleting it from your computer.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



 --
 Launchship Technology  respects your privacy. This email is intended only
 for the use of the party to which it is addressed and may contain
 information that is privileged, confidential, or protected by law. If you
 have received this message in error, or do not want to receive any further
 emails from us, please notify us immediately by replying to the message and
 deleting it from your computer.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: MultiPhraseQuery:Rewrite to BooleanQuery

2015-01-21 Thread Ian Lea

Are you asking if your two suggestions

1) a MultiPhraseQuery or

2) a BooleanQuery made up of multiple PhraseQuery instances

are equivalent?  If so, I'd say that they could be if you build them
carefully enough.  For the specific examples you show I'd say not and
would wonder if you get correct hits, particularly for your
MultiPhraseQuery which looks wrong to me, based on my reading of the
javadoc.  But I haven't tried or tested your code - I assume you have.


If you are asking something else, please explain more clearly.

--
Ian.


On Wed, Jan 21, 2015 at 2:50 PM, ku3ia dem...@gmail.com wrote:
 ku3ia wrote
 Hi folks!
 I have a multiphrase query, for example, from units:

 Directory indexStore = newDirectory();
 RandomIndexWriter writer = new RandomIndexWriter(random(), indexStore);
 add(blueberry chocolate pie, writer);
 add(blueberry chocolate tart, writer);
 IndexReader r = writer.getReader();
 writer.close();

 IndexSearcher searcher = newSearcher(r);
 MultiPhraseQuery q = new MultiPhraseQuery();
 q.add(new Term(body, blueberry));
 q.add(new Term(body, chocolate));
 q.add(new Term[] {new Term(body, pie), new Term(body, tart)});
 assertEquals(2, searcher.search(q, 1).totalHits);
 r.close();
 indexStore.close();

 I need to know on which phrase query will be match. Explanation doesn't
 return exact information, only that is match by this query. So can I
 rewrite this query to Boolean?, like

 BooleanQuery q = new BooleanQuery();

 PhraseQuery pq1 = new PhraseQuery();
 pq1.add(new Term(body, blueberry));
 pq1.add(new Term(body, chocolate));
 pq1.add(new Term(body, pie));
 q.add(pq1, BooleanClause.Occur.SHOULD);

 PhraseQuery pq2 = new PhraseQuery();
 pq2.add(new Term(body, blueberry));
 pq2.add(new Term(body, chocolate));
 pq2.add(new Term(body, tart));
 q.add(pq2, BooleanClause.Occur.SHOULD);

 In this case I'll exact know on which query I have a match. But main
 querstion is, Is this rewrite is equal/true?
 Thanks.

 --
 dennis yermakov
 mailto:

 demesg@

 Any ideas?



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/MultiPhraseQuery-Rewrite-to-BooleanQuery-tp4178898p4180863.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: forceMerge(1) grows index and does not shrink back

2015-01-20 Thread Ian Lea

Unclosed readers can definitely cause problems with index size, by
preventing the deletion of merged-away segments.  lsof can be useful
for diagnosing that.

As to the rest, I for one have lost track of what problems you've got
with which of your indexes.  I suggest you remove the forceMerge call,
double check for unclosed readers or anything else hanging on to index
files, then post a new message if you've still got problems.


--
Ian.


On Mon, Jan 19, 2015 at 2:16 PM, Jürgen Albert
j.alb...@data-in-motion.biz wrote:
 Hi,

 Am 19.01.2015 um 14:13 schrieb Uwe Schindler:

 Hi,

 we use 4.8.1. We know that the javadoc advises against it. Like I wrote,
 the
 deletion of old documents (that appear during an update) would be done
 while closing the writer.

 This is not true. The merge policy continuously merges segments that
 contain deletions. The problem you might have is the following:
 If you call forceMerge(1) for the first time, your index is reduced from a
 well distributed multi-segment index to one single, large segment. If you
 then apply deletes, they are applied against this large segment. Newly added
 documents are added to new segments. Those new segments are small, so they
 are merged with preference. The deletions in the huge single segment are
 very unlikely merged away, because Lucene only touches this segment as a
 large resort. So the problem starts when you call forceMerge for the first
 time!

 If you don’t call forceMerge and continuously index, you deletions will be
 removed quite fast. This is especially true if the deletions are
 well-distributed over the whole index! There are tons of instances with
 Elasticsearch and Lucene doing this all the time. They never ever close
 their writer. Be sure to use TieredMergePolicy (the default), because this
 one prefers segments that have many deletions. The old LogMergePolicy does
 not respect deletes, but should no longer be used, unless you rely on a
 specific index order of your documents.

 We use the default, which is the TieredMergePolicy as far as I can see. If
 what you write is true, I wonder why our index started growing in the first
 place. We have 2 indices, where the bigger one receives an update on every
 document every couple of days and a smaller one where every document is
 updated randomly over a period of roughly 3 minutes. After a couple of days,
 the indices became 12 GB each (the bigger one started with 2 GB and the
 smaller one with a couple of Megabytes). This should not happen if the
 MergePolicy works as intended. Can unclosed readers cause such a problem. We
 use a SearchManager to avoid this, but there can always be the possibility.

 On the other hand we have the case I initially described. We have a fresh
 index, that we populate. No reader is opened and no additional updates have
 been made. Therefore I see no reason why forceMerge triples the size of the
 index at all.

 Unfortunately we can't close the writer and we
 chose the force merge as alternative with less afford. Could
 forceMergeDeletes serve our purpose here?

 It could, but has the same problem like above. The only difference to
 forceMerge is that it only merges segments which have deletions.

 I will take a look into it with lsof, but I'm pretty sure, the files will
 be held by
 some javaprocess.

 Jürgen.

 Am 19.01.2015 um 13:36 schrieb Ian Lea:

 Do you need to call forceMerge(1) at all?  The javadoc, certainly for
 recent versions of lucene, advises against it.  What version of lucene
 are you running?

 It might be helpful to run lsof against the index directory
 before/during/after the merge to see what files are coming or going,
 or if there are any marked as deleted but still present.  That would
 imply that something, somewhere, was holding on to the files.


 --
 Ian.


 On Fri, Jan 16, 2015 at 1:57 PM, Jürgen Albert
 j.alb...@data-in-motion.biz wrote:

 Hi,

 because we have constant updates on our index, we can't really close
 the index from time to time. Therefore we decided to trigger
 forceMerge  when the traffic is lowest, the clean up.

 On our development laptops (Windows and Linux) it works as expected,
 but on the real Servers we have some wired behaviour.

 Scenario:

 We create a fresh index and populate it. This results in an index
 with a size of 2 GB. If we rigger forceMerge(1) and a commit()
 afterwards for this index, the index grows over the next 10 minutes
 to 6 GB and does not shrink back. During the whole process no reader is

 opened on the index.

 If I try the same stunt with the same data on my Windows Laptop, it
 does nothing at all and finishes after a few ms.

 Any Ideas?

 Technical details:
 We use an MMapDirectory and the Server is a Debian7 Kernel 3.2 in a
 KVM. The file system is Ext4.

 Thx,

 Jürgen Albert.

 --
 Jürgen Albert
 Geschäftsführer

 Data In Motion UG (haftungsbeschränkt)

 Kahlaische Str. 4
 07745 Jena

 Mobil:  0157-72521634
 E-Mail: j.alb...@datainmotion.de
 Web: www.datainmotion.de

Re: forceMerge(1) grows index and does not shrink back

2015-01-19 Thread Ian Lea

Do you need to call forceMerge(1) at all?  The javadoc, certainly for
recent versions of lucene, advises against it.  What version of lucene
are you running?

It might be helpful to run lsof against the index directory
before/during/after the merge to see what files are coming or going,
or if there are any marked as deleted but still present.  That would
imply that something, somewhere, was holding on to the files.


--
Ian.


On Fri, Jan 16, 2015 at 1:57 PM, Jürgen Albert
j.alb...@data-in-motion.biz wrote:
 Hi,

 because we have constant updates on our index, we can't really close the
 index from time to time. Therefore we decided to trigger forceMerge  when
 the traffic is lowest, the clean up.

 On our development laptops (Windows and Linux) it works as expected, but on
 the real Servers we have some wired behaviour.

 Scenario:

 We create a fresh index and populate it. This results in an index with a
 size of 2 GB. If we rigger forceMerge(1) and a commit() afterwards for this
 index, the index grows over the next 10 minutes to 6 GB and does not shrink
 back. During the whole process no reader is opened on the index.
 If I try the same stunt with the same data on my Windows Laptop, it does
 nothing at all and finishes after a few ms.

 Any Ideas?

 Technical details:
 We use an MMapDirectory and the Server is a Debian7 Kernel 3.2 in a KVM. The
 file system is Ext4.

 Thx,

 Jürgen Albert.

 --
 Jürgen Albert
 Geschäftsführer

 Data In Motion UG (haftungsbeschränkt)

 Kahlaische Str. 4
 07745 Jena

 Mobil:  0157-72521634
 E-Mail: j.alb...@datainmotion.de
 Web: www.datainmotion.de

 XING:   https://www.xing.com/profile/Juergen_Albert5

 Rechtliches

 Jena HBR 507027
 USt-IdNr: DE274553639
 St.Nr.: 162/107/04586


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: trouble with Collector and FieldCache

2015-01-15 Thread Ian Lea

How are you storing the id field?  A wild guess might be that this
error might be caused by having some documents with id stored,
perhaps, as a StringField or TextField and some as an IntField.


--
Ian.


On Wed, Jan 14, 2015 at 2:07 PM, Sascha Janz sascha.j...@gmx.net wrote:

 hello,

 i am using lucene 4.6.  in my query i use a collector to get field values.

 setNextReader is implemented as below.

 public void setNextReader(AtomicReaderContext context) throws IOException {

 cacheIDs = FieldCache.DEFAULT.getInts(context.reader(), id, true);
 }

 and collect

 public void collect(int docnr) throws IOException {
 int id = 0;
 id = cacheIDs .get(docnr);
 ...


 }

 But sometimes i got an exception like

 2015-01-12 13:32:49,342 [ID:1e8e1dff-9a57-11e4-a697-c29020524153]ERROR 
 (org.apache.lucene.search.join.Collector) - Error getInts
 java.lang.ArrayIndexOutOfBoundsException: 0
 at 
 org.apache.lucene.util.NumericUtils.getPrefixCodedIntShift(NumericUtils.java:208)
 at org.apache.lucene.util.NumericUtils$2.accept(NumericUtils.java:493)
 at 
 org.apache.lucene.index.FilteredTermsEnum.next(FilteredTermsEnum.java:241)
 at 
 org.apache.lucene.search.FieldCacheImpl$Uninvert.uninvert(FieldCacheImpl.java:307)
 at 
 org.apache.lucene.search.FieldCacheImpl$IntCache.createValue(FieldCacheImpl.java:678)
 at 
 org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:211)
 at 
 org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:570)
 at 
 org.apache.lucene.search.FieldCacheImpl$IntCache.createValue(FieldCacheImpl.java:631)
 at 
 org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:211)
 at 
 org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:570)
 at 
 org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:545)
 at 
 org.apache.lucene.search.join.MyCollector.setNextReader(MyCollector.java:264)


 when i try to get the field the normal way like this

 Document doc = reader.document(docnr);
 String sid = doc.get(id);
   int id = Integer.parseInt(sid);

 everything is fine.

 am i doing something wrong?

 regards
 Sascha

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: AlreadyClosedException on new index

2015-01-06 Thread Ian Lea

Presumably no exception is thrown from the new IndexWriter() call?
 I'd double check that, and try some harmless method call on the
writer and make sure that works.  And run CheckIndex against the
index.


--
Ian.


On Tue, Jan 6, 2015 at 5:05 PM, Brian Call
brian.c...@soterawireless.com wrote:
 Hi Tomoko,

 Thank you for your response! We’ve actually never seen this before in the 
 three+ years of developing using Lucene 3.6.x. The only time we’ve ever seen 
 this kind of exception is once recently in a running production system and it 
 caught me way off guard. We’re deploying on Suse linux (enterprise), and 
 jdk1.7.

 Our application creates and deletes indices within the context of a single 
 thread so I don’t think another thread is abruptly closing the index. All 
 index open/close operations are always done sequentially in the context of 
 one thread as operation requests are received.

 Blessings,
 Brian



 On Jan 6, 2015, at 2:16 AM, Tomoko Uchida tomoko.uchida.1...@gmail.com 
 wrote:

 Hi,

 How often does this error occur?
 You do not tell the lucene version, but I guess you use lucene 3.x
 according to the stack trace...
 IndexWriter would not be closed until IndexWriter.close() method is called
 explicitly.
 https://github.com/apache/lucene-solr/blob/lucene_solr_3_6_2/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java
  
 https://github.com/apache/lucene-solr/blob/lucene_solr_3_6_2/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java

 Do you have custom codes wrapping lucene objects? There can be any codes
 which call IndexWriter.close() unexpectedly?

 If your application seems to have no problem, you need to share more
 information including lucene version and system environments.

 Regards,
 Tomoko


 2015-01-06 8:32 GMT+09:00 Brian Call brian.c...@soterawireless.com 
 mailto:brian.c...@soterawireless.com:

 Hi Guys,

 So I’m seeing a problem in production that is very bizarre and I wanted to
 see if anyone else has encountered this issue and/or has a suggestion or
 fix. Here goes:

 We create a wrapper around an index and searcher manager to encapsulate
 both. First we create the IndexWriter and then immediately after create the
 SearcherManager, like this:

 indexWriter = new IndexWriter(indexDirectory, getWriterConfig());
 searcherManager = new SearcherManager(indexWriter, true, new
 ExecutorSearcherFactory());


 During construction of the SearcherManager this is thrown:

 Caused by: org.apache.lucene.store.AlreadyClosedException: this
 IndexWriter is closed
at
 org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:766)
at
 org.apache.lucene.index.IndexWriter.getDirectory(IndexWriter.java:1909)
at
 com.triagewireless.h1s.session.data.index.LuceneIndexMergeScheduler.getIndexDirectory(LuceneIndexMergeScheduler.java:162)
at
 com.triagewireless.h1s.session.data.index.LuceneIndexMergeScheduler.access$000(LuceneIndexMergeScheduler.java:31)
at
 com.triagewireless.h1s.session.data.index.LuceneIndexMergeScheduler$MergeTask.equals(LuceneIndexMergeScheduler.java:127)
at
 java.util.concurrent.ArrayBlockingQueue.contains(ArrayBlockingQueue.java:497)
at
 com.triagewireless.h1s.session.data.index.LuceneIndexMergeScheduler.merge(LuceneIndexMergeScheduler.java:148)
at
 org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2740)
at
 org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2734)
at
 org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:457)
at
 org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:399)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:296)
at
 org.apache.lucene.search.SearcherManager.init(SearcherManager.java:82)
at
 com.triagewireless.h1s.session.data.index.LuceneIndex.initNoCache(LuceneIndex.java:312)
at
 com.triagewireless.h1s.session.data.index.LuceneIndex.init(LuceneIndex.java:270)
at sun.reflect.GeneratedMethodAccessor45.invoke(Unknown Source)
at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
 org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleElement.invoke(InitDestroyAnnotationBeanPostProcessor.java:346)
at
 org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleMetadata.invokeInitMethods(InitDestroyAnnotationBeanPostProcessor.java:299)
at
 org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor.postProcessBeforeInitialization(InitDestroyAnnotationBeanPostProcessor.java:132)
... 57 more


 Why? No other threads are acting on this index, etc. The index writer was
 just created so how could it already be closed?

 I’m completely baffled on this one guys… so many thanks in advance for
 your help! I’ll take any

Re: batch-update-pattern, NoMergeScheduler?

2014-12-23 Thread Ian Lea

Hi


I can't give an exact answer to your question but my experience has
been that it's best to leave all the merge/buffer/etc settings alone.
If you are doing a bulk update of a large number of docs then it's no
surprise that you are seeing a heavy IO load.  If you can, it's likely
to be worth giving lucene a dedicated disk or at least make sure
there's as little contention as possible - that's just general advice
for any workload.  There is always going to a limiting factor
somewhere.

You could also experiment with multiple threads, or multiple jobs
writing to separate indexes with a standalone merge at the end.  In my
experience these have generally been more trouble than they're worth,
but the occasions when I do bulk loads of large number of docs are
sufficiently rare that I'm not too bothered how long it takes.


--
Ian.



--
Ian.


On Mon, Dec 22, 2014 at 9:45 AM, Clemens Wyss DEV clemens...@mysign.ch wrote:
 One of our indexes is updated completely quite frequently - batch update 
 or re-index.
 If so more than 2million documents are added/updated to/in the very index. 
 This creates an immense IO load on our system. Does it make sense to set 
 merge scheduler to NoMergeScheduler (and/or MergePolicy to NoMergePolicy). Or 
 is merging not relevant as the commit is done at the very end only?

 Context information:
 At the moment the writer's config consists only of setRAMBufferSizeMB:
 IndexWriterConfig config = new IndexWriterConfig( 
 IndexManager.CURRENT_LUCENE_VERSION, analyzer );
 config.setMergePolicy( NoMergePolicy.NO_COMPOUND_FILES );
 //config.setMergeScheduler( NoMergeScheduler.INSTANCE );
 config.setRAMBufferSizeMB( 20 );

 The update logic is as follows:
 indexWriter.deleteAll()
 ...
 for all elements do {
 ...
 indexWriter.updateDocument( term, doc ); // in order to omit duplicate 
 entries
 ...
 }
 indexWriter.commit

 What is the proposed way to perform such a batch update?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Index keeps growing, then shrinks on restart

2014-11-11 Thread Ian Lea

Telling us the version of lucene and the OS you're running on is
always a good idea.

A guess here is that you aren't closing index readers, so the JVM will
be holding on to deleted files until it exits.

A combination of du, ls, and lsof commands should prove it, or just
losf: run it against the java process and look for deleted files.  If
you're on unix that is.


--
Ian.


On Mon, Nov 10, 2014 at 11:03 PM, Rob Nikander rob.nikan...@gmail.com wrote:
 Hi,

 I have an index that's about 700 MB, and it grows over days to until it
 causes problems with disk size, at about 5GB.  If the JVM process ends, the
 index shrinks back to about 700MB, I'm calling IndexWriter.commit() all the
 time.  What else do you call to get it to compact it's use of space?

 thank you,
 Rob

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: SpanTermQuery Issue

2014-10-03 Thread Ian Lea

Toronto != toronto.  From the javadocs for StandardAnalyzer:

Filters StandardTokenizer with StandardFilter, LowerCaseFilter and StopFilter,

LowerCaseFilter does what you would expect.


--
Ian.



On Fri, Oct 3, 2014 at 3:52 AM, Xu Chu 1989ch...@gmail.com wrote:
 Hi everyone

 In the following piece of code, Query test1 returns result, while Query test2 
 does not return result!! Obviously “Toronto” is appearing in the doc.
 Can any one tell me what’s wrong?

 Thanks



 private static void testRAMSpam() throws IOException
 {
  RAMDirectory directory = new RAMDirectory();
  Analyzer analyzer = new 
 StandardAnalyzer(Version.LUCENE_4_10_0);
  IndexWriterConfig iwc = new 
 IndexWriterConfig(Version.LUCENE_4_10_0, analyzer);

  IndexWriter writer = new IndexWriter(directory,  iwc);
  Document doc = new Document();
  doc.add(new TextField(f,
 Project X took place in Waterloo. It was successful, 
 Field.Store.YES));
  writer.addDocument(doc);


   doc = new Document();
  doc.add(new TextField(f,
 Project Y took place in Toronto. It was also good., 
 Field.Store.YES));
  writer.addDocument(doc);
  writer.close();

  IndexReader   reader = IndexReader.open(directory);
  IndexSearcher   searcher = new IndexSearcher(reader);

  SpanTermQuery test1 = new SpanTermQuery(new Term(f, 
 good));
  System.out.println(test span query:  + test1.toString());
  TopDocs results1  = searcher.search(test1, 100);
  System.out.println(num result 1 :  + results1.totalHits);


  SpanTermQuery test2 = new SpanTermQuery(new Term(f, 
 Toronto));
  System.out.println(test span query:  + test2.toString());
  TopDocs results2  = searcher.search(test2, 100);
  System.out.println(num result 2:  + results2.totalHits);
 }

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Case sensitivity

2014-09-19 Thread Ian Lea

PerFieldAnalyzerWrapper is the way to mix and match fields and analyzers.

Personally I'd simply store the case-insensitive field with a call to
toLowerCase() on the value and equivalent on the search string.

You will of course use more storage, but you don't need to store the
text contents for both variants so it won't be double.  Unless you
aren't storing the original either.


--
Ian.


On Fri, Sep 19, 2014 at 2:50 PM, John Cecere john.cec...@oracle.com wrote:
 I've considered this, but there are two problems with it. First of all, it
 feels like I'm still taking up twice the storage, I'm just doing it using a
 single index rather than two of them. This doesn't sound like it's buying me
 anything.

 The second problem with this is simply that I haven't figured out how to do
 this. I assume in creating two fields you would implement two separate
 analyzers on them, one using LowerCaseFilter and the other not. I haven't
 made the connection on how to tie an Analyzer to a particular field. It
 seems to be tied to the IndexWriterConfig and the IndexWriter.

 Thanks,
 John


 On 9/19/14 9:36 AM, Paul Libbrecht wrote:

 two fields?

 paul


 On 19 sept. 2014, at 15:07, John Cecere john.cec...@oracle.com wrote:

 Is there a way to set up Lucene so that both case-sensitive and
 case-insensitive searches can be done without having to generate two
 indexes?

 --
 John Cecere
 Principal Engineer - Oracle Corporation
 732-987-4317 / john.cec...@oracle.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 --
 John Cecere
 Principal Engineer - Oracle Corporation
 732-987-4317 / john.cec...@oracle.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

4.10.0: java.lang.IllegalStateException: cannot write 3x SegmentInfo unless codec is Lucene3x (got: Lucene40)

2014-09-10 Thread Ian Lea

Hi


On running a quick test after a handful of minor code changes to deal
with 4.10 deprecations, a program that updates an existing index
failed with

Exception in thread main java.lang.IllegalStateException: cannot
write 3x SegmentInfo unless codec is Lucene3x (got: Lucene40)
at org.apache.lucene.index.SegmentInfos.write3xInfo(SegmentInfos.java:607)

and along the way did something to the index to make it unusable.

Digging a bit deeper and working on a different old test index that
was lying around, and taking a backup first this time, this is
reproducible.

The working index:

total 1036
-rw-r--r-- 1 tril users 165291 Jan 18  2013 _0.fdt
-rw-r--r-- 1 tril users 125874 Jan 18  2013 _0.fdx
-rw-r--r-- 1 tril users   1119 Jan 18  2013 _0.fnm
-rw-r--r-- 1 tril users 378015 Jan 18  2013 _0_Lucene40_0.frq
-rw-r--r-- 1 tril users 350628 Jan 18  2013 _0_Lucene40_0.tim
-rw-r--r-- 1 tril users  13988 Jan 18  2013 _0_Lucene40_0.tip
-rw-r--r-- 1 tril users311 Jan 18  2013 _0.si
-rw-r--r-- 1 tril users 69 Jan 18  2013 segments_2
-rw-r--r-- 1 tril users 20 Jan 18  2013 segments.gen

and output from 4.10 CheckIndex

Opening index @ index/

Segments file=segments_2 numSegments=1 version=4.0.0.2 format=
  1 of 1: name=_0 docCount=15730
version=4.0.0.2
codec=Lucene40
compound=false
numFiles=7
size (MB)=0.987
diagnostics = {os=Linux, os.version=3.1.0-1.2-desktop,
source=flush, lucene.version=4.0.0 1394950 - rmuir - 2012-10-06
02:58:12, os.arch=amd64, java.version=1.7.0_10, java.vendor=Oracle
Corporation}
no deletions
test: open reader.OK
test: check integrity.OK
test: check live docs.OK
test: fields..OK [13 fields]
test: field norms.OK [0 fields]
test: terms, freq, prox...OK [53466 terms; 217447 terms/docs
pairs; 139382 tokens]
test: stored fields...OK [15730 total field count; avg 1 fields per doc]
test: term vectorsOK [0 total vector count; avg 0
term/freq vector fields per doc]
test: docvalues...OK [0 docvalues fields; 0 BINARY; 0
NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET]

No problems were detected with this index.


Now run this little program

public static void main(final String[] _args) throws Exception {
File index = new File(_args[0]);
IndexWriterConfig iwcfg = new IndexWriterConfig(Version.LUCENE_4_10_0,
new StandardAnalyzer());
iwcfg.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
Directory d = FSDirectory.open(index, new SimpleFSLockFactory(index));
IndexWriter iw = new IndexWriter(d, iwcfg);
Document doc1 = new Document();
doc1.add(new StringField(type, test, Field.Store.NO));
iw.addDocument(doc1);
iw.close();
}

and it fails with

Exception in thread main java.lang.IllegalStateException: cannot
write 3x SegmentInfo unless codec is Lucene3x (got: Lucene40)
at org.apache.lucene.index.SegmentInfos.write3xInfo(SegmentInfos.java:607)
at org.apache.lucene.index.SegmentInfos.write(SegmentInfos.java:524)
at org.apache.lucene.index.SegmentInfos.prepareCommit(SegmentInfos.java:1017)
at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4549)
at 
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3062)
at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3169)
at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:915)
at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:986)
at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:956)
at t.main(t.java:25)

and when run CheckIndex again get


Opening index @ index/

ERROR: could not read any segments file in directory
java.nio.file.NoSuchFileException: /tmp/lucene/index/_0.si
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at 
sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:176)
at java.nio.channels.FileChannel.open(FileChannel.java:287)
at java.nio.channels.FileChannel.open(FileChannel.java:334)
at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:196)
at 
org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoReader.read(Lucene40SegmentInfoReader.java:52)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:362)
at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:458)
at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:913)
at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:759)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:454)
at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:414)
at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:2096)

which is true

total 1032
-rw-r--r-- 1 tril users 165291 Jan 18  2013 _0.fdt
-rw-r--r-- 1 tril users 125874 Jan 18  2013 _0.fdx
-rw-r--r-- 1 tril users   1119

Re: 4.10.0: java.lang.IllegalStateException: cannot write 3x SegmentInfo unless codec is Lucene3x (got: Lucene40)

2014-09-10 Thread Ian Lea

Sent to your personal email address.


--
Ian.


On Wed, Sep 10, 2014 at 12:36 PM, Robert Muir rcm...@gmail.com wrote:
 Ian, this looks terrible, thanks for reporting this. Is there any
 possible way I could have a copy of that working index to make it
 easier to reproduce?

 On Wed, Sep 10, 2014 at 7:01 AM, Ian Lea ian@gmail.com wrote:
 Hi


 On running a quick test after a handful of minor code changes to deal
 with 4.10 deprecations, a program that updates an existing index
 failed with

 Exception in thread main java.lang.IllegalStateException: cannot
 write 3x SegmentInfo unless codec is Lucene3x (got: Lucene40)
 at org.apache.lucene.index.SegmentInfos.write3xInfo(SegmentInfos.java:607)

 and along the way did something to the index to make it unusable.

 Digging a bit deeper and working on a different old test index that
 was lying around, and taking a backup first this time, this is
 reproducible.

 The working index:

 total 1036
 -rw-r--r-- 1 tril users 165291 Jan 18  2013 _0.fdt
 -rw-r--r-- 1 tril users 125874 Jan 18  2013 _0.fdx
 -rw-r--r-- 1 tril users   1119 Jan 18  2013 _0.fnm
 -rw-r--r-- 1 tril users 378015 Jan 18  2013 _0_Lucene40_0.frq
 -rw-r--r-- 1 tril users 350628 Jan 18  2013 _0_Lucene40_0.tim
 -rw-r--r-- 1 tril users  13988 Jan 18  2013 _0_Lucene40_0.tip
 -rw-r--r-- 1 tril users311 Jan 18  2013 _0.si
 -rw-r--r-- 1 tril users 69 Jan 18  2013 segments_2
 -rw-r--r-- 1 tril users 20 Jan 18  2013 segments.gen

 and output from 4.10 CheckIndex

 Opening index @ index/

 Segments file=segments_2 numSegments=1 version=4.0.0.2 format=
   1 of 1: name=_0 docCount=15730
 version=4.0.0.2
 codec=Lucene40
 compound=false
 numFiles=7
 size (MB)=0.987
 diagnostics = {os=Linux, os.version=3.1.0-1.2-desktop,
 source=flush, lucene.version=4.0.0 1394950 - rmuir - 2012-10-06
 02:58:12, os.arch=amd64, java.version=1.7.0_10, java.vendor=Oracle
 Corporation}
 no deletions
 test: open reader.OK
 test: check integrity.OK
 test: check live docs.OK
 test: fields..OK [13 fields]
 test: field norms.OK [0 fields]
 test: terms, freq, prox...OK [53466 terms; 217447 terms/docs
 pairs; 139382 tokens]
 test: stored fields...OK [15730 total field count; avg 1 fields per 
 doc]
 test: term vectorsOK [0 total vector count; avg 0
 term/freq vector fields per doc]
 test: docvalues...OK [0 docvalues fields; 0 BINARY; 0
 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET]

 No problems were detected with this index.


 Now run this little program

 public static void main(final String[] _args) throws Exception {
 File index = new File(_args[0]);
 IndexWriterConfig iwcfg = new IndexWriterConfig(Version.LUCENE_4_10_0,
 new StandardAnalyzer());
 iwcfg.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
 Directory d = FSDirectory.open(index, new SimpleFSLockFactory(index));
 IndexWriter iw = new IndexWriter(d, iwcfg);
 Document doc1 = new Document();
 doc1.add(new StringField(type, test, Field.Store.NO));
 iw.addDocument(doc1);
 iw.close();
 }

 and it fails with

 Exception in thread main java.lang.IllegalStateException: cannot
 write 3x SegmentInfo unless codec is Lucene3x (got: Lucene40)
 at org.apache.lucene.index.SegmentInfos.write3xInfo(SegmentInfos.java:607)
 at org.apache.lucene.index.SegmentInfos.write(SegmentInfos.java:524)
 at org.apache.lucene.index.SegmentInfos.prepareCommit(SegmentInfos.java:1017)
 at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4549)
 at 
 org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3062)
 at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3169)
 at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:915)
 at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:986)
 at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:956)
 at t.main(t.java:25)

 and when run CheckIndex again get


 Opening index @ index/

 ERROR: could not read any segments file in directory
 java.nio.file.NoSuchFileException: /tmp/lucene/index/_0.si
 at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
 at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
 at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
 at 
 sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:176)
 at java.nio.channels.FileChannel.open(FileChannel.java:287)
 at java.nio.channels.FileChannel.open(FileChannel.java:334)
 at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:196)
 at 
 org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoReader.read(Lucene40SegmentInfoReader.java:52)
 at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:362)
 at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:458)
 at 
 org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:913

Re: 4.10.0: java.lang.IllegalStateException: cannot write 3x SegmentInfo unless codec is Lucene3x (got: Lucene40)

2014-09-10 Thread Ian Lea

Yes, quite possible.  I do sometimes download and test beta versions.

This isn't really a problem for me - it has only happened on test
indexes that I don't care about, but there might be live indexes out
there that are also affected and having them made unusable would be
undesirable, to put it mildly.  A message saying Unsupported version
would be much better.


--
Ian.


On Wed, Sep 10, 2014 at 12:41 PM, Uwe Schindler u...@thetaphi.de wrote:
 Hi Ian,

 this index was created with the BETA version of Lucene 4.0:

 Segments file=segments_2 numSegments=1 version=4.0.0.2 format=
   1 of 1: name=_0 docCount=15730

 4.0.0.2 was the index version number of Lucene 4.0-BETA. This is not a 
 supported version and may not open correctly. In Lucene 4.10 we changed 
 version handling and parsing version numbers a bit, so this may be the cause 
 for the error.

 Uwe

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 -Original Message-
 From: Ian Lea [mailto:ian@gmail.com]
 Sent: Wednesday, September 10, 2014 1:01 PM
 To: java-user@lucene.apache.org
 Subject: 4.10.0: java.lang.IllegalStateException: cannot write 3x SegmentInfo
 unless codec is Lucene3x (got: Lucene40)

 Hi


 On running a quick test after a handful of minor code changes to deal with
 4.10 deprecations, a program that updates an existing index failed with

 Exception in thread main java.lang.IllegalStateException: cannot write 3x
 SegmentInfo unless codec is Lucene3x (got: Lucene40) at
 org.apache.lucene.index.SegmentInfos.write3xInfo(SegmentInfos.java:607)

 and along the way did something to the index to make it unusable.

 Digging a bit deeper and working on a different old test index that was lying
 around, and taking a backup first this time, this is reproducible.

 The working index:

 total 1036
 -rw-r--r-- 1 tril users 165291 Jan 18  2013 _0.fdt
 -rw-r--r-- 1 tril users 125874 Jan 18  2013 _0.fdx
 -rw-r--r-- 1 tril users   1119 Jan 18  2013 _0.fnm
 -rw-r--r-- 1 tril users 378015 Jan 18  2013 _0_Lucene40_0.frq
 -rw-r--r-- 1 tril users 350628 Jan 18  2013 _0_Lucene40_0.tim
 -rw-r--r-- 1 tril users  13988 Jan 18  2013 _0_Lucene40_0.tip
 -rw-r--r-- 1 tril users311 Jan 18  2013 _0.si
 -rw-r--r-- 1 tril users 69 Jan 18  2013 segments_2
 -rw-r--r-- 1 tril users 20 Jan 18  2013 segments.gen

 and output from 4.10 CheckIndex

 Opening index @ index/

 Segments file=segments_2 numSegments=1 version=4.0.0.2 format=
   1 of 1: name=_0 docCount=15730
 version=4.0.0.2
 codec=Lucene40
 compound=false
 numFiles=7
 size (MB)=0.987
 diagnostics = {os=Linux, os.version=3.1.0-1.2-desktop, source=flush,
 lucene.version=4.0.0 1394950 - rmuir - 2012-10-06 02:58:12, os.arch=amd64,
 java.version=1.7.0_10, java.vendor=Oracle Corporation}
 no deletions
 test: open reader.OK
 test: check integrity.OK
 test: check live docs.OK
 test: fields..OK [13 fields]
 test: field norms.OK [0 fields]
 test: terms, freq, prox...OK [53466 terms; 217447 terms/docs pairs; 
 139382
 tokens]
 test: stored fields...OK [15730 total field count; avg 1 fields per 
 doc]
 test: term vectorsOK [0 total vector count; avg 0 term/freq 
 vector
 fields per doc]
 test: docvalues...OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0
 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET]

 No problems were detected with this index.


 Now run this little program

 public static void main(final String[] _args) throws Exception { File 
 index =
 new File(_args[0]); IndexWriterConfig iwcfg = new
 IndexWriterConfig(Version.LUCENE_4_10_0,
 new StandardAnalyzer());
 iwcfg.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
 Directory d = FSDirectory.open(index, new SimpleFSLockFactory(index));
 IndexWriter iw = new IndexWriter(d, iwcfg); Document doc1 = new
 Document(); doc1.add(new StringField(type, test, Field.Store.NO));
 iw.addDocument(doc1); iw.close();
 }

 and it fails with

 Exception in thread main java.lang.IllegalStateException: cannot write 3x
 SegmentInfo unless codec is Lucene3x (got: Lucene40) at
 org.apache.lucene.index.SegmentInfos.write3xInfo(SegmentInfos.java:607)
 at org.apache.lucene.index.SegmentInfos.write(SegmentInfos.java:524)
 at
 org.apache.lucene.index.SegmentInfos.prepareCommit(SegmentInfos.java:
 1017)
 at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4549)
 at
 org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.j
 ava:3062)
 at
 org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3169
 )
 at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:915)
 at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:986)
 at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:956)
 at t.main(t.java:25)

 and when run CheckIndex again get


 Opening index @ index/

 ERROR: could not read any segments file in directory

Re: Fetching stored data takes more time

2014-08-04 Thread Ian Lea

Retrieving stored data is always likely to take longer than not doing
so.  There are some tips in
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed.

But taking over a minute to retrieve data for 50 hits sounds
excessive.  Are you sure about those figures?


--
Ian.


On Thu, Jul 31, 2014 at 2:36 AM, Ganesh emailg...@yahoo.co.in wrote:
 Hello all,

 I am using Lucene 4.9 and the index size is 7 GB. Search is faster, it takes
 1 second to return the results  (50 hits). I loop through the result and
 fetching the stored data for all and it takes more time. Some times it takes
 more than a minute.

 Could some one guide me.. how to resolve this issue.

 Regards
 Ganesh

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How does Lucene decide which fields to index?

2014-08-04 Thread Ian Lea

You tell it what you want.  See the javadocs for
org.apache.lucene.document.Field and friends such as TextField.


--
Ian.


On Mon, Aug 4, 2014 at 2:43 PM, Sachin Kulkarni kulk...@hawk.iit.edu wrote:
 Hi,

 I am using lucene 4.6.0 to index a dataset.
 I have the following fields:
 doctitle, docbody, docname, docid, date.
 But when I access the fields using indexReader.getTermVectors(indexedDocID)
 then I only get two fields
 docbody and docname.

 How do I index so that I also get doctitle?

 Thank you.

 Regards,
 Sachin Kulkarni

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: More Like This query is not working.

2014-07-21 Thread Ian Lea

That's not completely self-contained which means that we can't compile
and test it ourselves.

This looks very dodgy:

doc.add(new TextField(searchedText, ...

mlt.setFieldNames(SearchedText);


--
Ian.


On Mon, Jul 21, 2014 at 12:41 PM, Rajendra Rao
rajendra@launchship.com wrote:
 Hello Ian,

 I am using version 4.1


 I am expecting id of SearchedText which is similar of   QueryStr
 but i am getting   0 sizeScoreDoc[] hits

 Below is code I am using.

 ___

 My Parameter is
 ArrayList whic contain DocID  and Criteria Text  (collection of Documents
 to be passed for indexing)
 Query  text  in QueryStr  .


 StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_41);

 // 1. create the index
  Directory index = new RAMDirectory();
 IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_41,
  analyzer);

 IndexWriter w = new IndexWriter(index, config);
 int counter = 0;

  while (counter  lstDocBean.size()) {

 String searchedText=lstDocBean.getText();
   String docID=lstDocBean.getDocID();


 Document doc = new Document();
  doc.add(new TextField(DocID, docID, shouldStore.YES));
doc.add(new TextField(searchedText, searchedText,
 shouldStore.YES));



 w.addDocument(doc);

 counter++;

 }



 w.close();


 int hitsperpage = 10;
 IndexReader reader = IndexReader.open(index);

  IndexSearcher searcher = new IndexSearcher(reader);

 // get similar doc
 // Reader sReader = new StringReader();
  MoreLikeThis mlt = new MoreLikeThis(reader);
 // 
 mlt.setAnalyzer(analyzer);

 mlt.setFieldNames(SearchedText);

 Reader reader1 = new StringReader(queryStr);
  Query Searchedquery = mlt.like(reader1, null);



 // --

 TopDocs results = searcher.search(Searchedquery, 10);
  ScoreDoc[] hits = results.scoreDocs;


 for (int i = 0; i  hits.length; ++i) {
 int docId = hits[i].doc; //

  Document d = searcher.doc(docId);
  int sys_DocID=d.get(DocID);
double Score   = hits[i].score



 }










 On Fri, Jul 18, 2014 at 7:34 PM, Ian Lea ian@gmail.com wrote:

 You need to supply more info.  Tell us what version of lucene you are
 using and provide a very small completely self-contained example or
 test case showing exactly what you expect to happen and what is
 happening instead.


 --
 Ian.


 On Fri, Jul 18, 2014 at 11:50 AM, Rajendra Rao
 rajendra@launchship.com wrote:
  Hello
 
 
  I am using more like this query .But size of Score Docs i am getting is 0
  I found that it
  In Query Searchedquery = mlt.like(reader1, criteria);
 
  query object contain following value
  boost 1.0
  all clauses element data is null
 
 
  I used following code
  MoreLikeThis mlt = new MoreLikeThis(reader);
   // 
  mlt.setAnalyzer(analyzer);
 
   Reader reader1 = new StringReader(Requirement);
  Query Searchedquery = mlt.like(reader1, criteria);
 
 
  please guide me.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene Query Wrong Result for phrase.

2014-07-18 Thread Ian Lea

Probably because something in the analysis chain is removing the
hyphen.  Check out the javadocs.  Generally you should also make sure
you use the same analyzer at index and search time.


--
Ian.


On Fri, Jul 18, 2014 at 6:52 AM, itisismail it.is.ism...@gmail.com wrote:
 Hi I have created index with 1 field with simple message like (hello - world)
 now when I create for search like +body: \hello world\  I should not
 get any result because I have wrapped my search word in double quotes and
 does not specify dash(-) between (hello and world) but I am still getting
 that result.I am using Lucene version 3.0  using snowball analyzer, also
 use Cosntruct query likeString qry=+body:\hello world\;Query query = new
 QueryParser(Version.LUCENE_30, body, analyzer).parse(qry);Why lucene is
 ignoring dash(-) while search is a phrase.



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Lucene-Query-Wrong-Result-for-phrase-tp4147842.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: More Like This query is not working.

2014-07-18 Thread Ian Lea

You need to supply more info.  Tell us what version of lucene you are
using and provide a very small completely self-contained example or
test case showing exactly what you expect to happen and what is
happening instead.


--
Ian.


On Fri, Jul 18, 2014 at 11:50 AM, Rajendra Rao
rajendra@launchship.com wrote:
 Hello


 I am using more like this query .But size of Score Docs i am getting is 0
 I found that it
 In Query Searchedquery = mlt.like(reader1, criteria);

 query object contain following value
 boost 1.0
 all clauses element data is null


 I used following code
 MoreLikeThis mlt = new MoreLikeThis(reader);
  // 
 mlt.setAnalyzer(analyzer);

  Reader reader1 = new StringReader(Requirement);
 Query Searchedquery = mlt.like(reader1, criteria);


 please guide me.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Can phrasequery allow mismatch？

2014-07-17 Thread Ian Lea

Might be able to do it with some combination of SpanNearQuery, with
suitable values for slop and inOrder, combined into a BooleanQuery
with setMinimumNumberShouldMatch = number of SpanNearQuery instances -
1.

So, making this up as I go along, you'd have

SpanNearQuery sn1 = B after A, slop 0, in order
SpanNearQuery sn2 = C after B, slop 0, in order
SpanNearQuery sn3 = D after C, slop 0, in order
SpanNearQuery sn4 = E after D, slop 0, in order
BooleanQuery bq = whatever(sn1, sn2, sn3, sn4)
bq.setMinimumNumberShouldMatch(3)

Might work.  The value 3 should perhaps be 2.  Or a larger value of
slop might help to match C X E rather than C D E.  In that case
minshouldmatch would be 4, I think,  I'm getting confused so will
stop.


--
Ian.



On Thu, Jul 17, 2014 at 8:22 AM, Yonghui Zhao zhaoyong...@gmail.com wrote:
 Hi，

 I want to implement a query like phrase query with slop 0, but I can allow
 one term mismatch.

 For example,  the text is A  B  C D E

 I want to match this text with the query  A B C X E.

 X mismatches the D.

 i.e. Query A B C D E  will match “W1 W2 W3 W4 W5”,  the 5 words are
 consecutive and at most one word is mismatched.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Warm up IndexReader

2014-07-14 Thread Ian Lea

There's no magic to it - just build a query or six and fire them at
your newly opened reader.  If you want to put the effort in you could
track recent queries and use them, or make sure you warm up searches
on particular fields.  Likewise, if you use Lucene's sorting and/or
filters, it might be worth adding them in to the mix as well.

Start simple and stop when you've found something good enough.


--
Ian.


On Fri, Jul 11, 2014 at 5:16 PM, Jason.H 469673...@qq.com wrote:
 I found my first search on new created IndexReader is slow , but after i made 
 a search , it will be much faster
 i'd like to do such warn up in the back end of my application rather then 
 wait for the user to warm this up because it's a little bit unfriendly design 
 .
 I  know a warmer would do the intial warm up , but i don't know the exact 
 way to implement this . i googled this , but i can't find any solution .
 can anybody do some help here ~ tks advance ! :D

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: IndexSearcher.doc thread safe problem

2014-07-09 Thread Ian Lea

It's more likely to be a demonstration that concurrent programming is
hard, results often hard to predict and debugging very hard.

Or perhaps you simply need to add acceptsDocsOutOfOrder() to your
collector, returning false.

Either way, hard to see any evidence of a thread-safety problem in lucene.

If adding acceptsDocsOutOfOrder() doesn't fix it, I suggest you verify
that your queue is getting the values you expect, in the order you
expect, consistently.  Then worry about the display part, first
checking everything without any lucene calls.


--
Ian.

On Wed, Jul 9, 2014 at 5:59 AM, 김선무 guks...@gmail.com wrote:
 Hi all,

 I know IndexSearcher is thread safe.
 But IndexSearcher.doc is not thread safe maybe...

 I try to below
 
 First, I extract docID at index directory. And that docID add on
 queue(ConcurrentLinkedQueue)

 Second, extract field value using docID poll at this queue after extract
 process end. This process is  work to multi-threads.

 For this I used the following summation code below:
 searcher.search( query, filter, new Collector() { public void collect( int
 doc ) { queue.add( docBase + doc ) } );
 Thread thread1 = new Thread( () - { while( !queue.isEmpty() ) {
 System.out.println( searcher.doc(queue.poll()).get(content) ); } } );
 Thread thread2 = new Thread( thread1 );
 thread1.start();
 thread2.start();
 ---

 Result was different in every execution.

 My method is wrong? or IndexSearcher bug?

 Please help me

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: re-use IndexWriter

2014-07-08 Thread Ian Lea

Read the javadocs to understand the difference between commit() and
flush().  You need commit(), or close().

There are no hard and fast rules and it depends on how much data you
are indexing, how fast, how many searches you're getting and how up to
date they need to be.  And how much you worry about losing indexed
data.

One option is to pick a value that makes sense to you and commit() the
writer every n seconds|minutes|hours|docs.  close() it when your
indexing job exits.  You'll need to reopen index searchers to pick up
changes.  See the javadocs for IndexSearcher.

Another option is to use lucene's near-real-time (NRT) features.  Also
see the IndexSearcher javadocs for a way in to that.


--
Ian.


On Tue, Jul 8, 2014 at 10:08 AM, Jason.H 469673...@qq.com wrote:
 nowadays , i've been trying every way to improve the performance of indexing 
 , IndexWriter's close operation is really costly , and the Lucene's doc 
 sugguest to re-use IndexWriter instance , i  did it , i  kept the indexWriter 
 instance , and give it back to every request thread , But there comes a big 
 problem ,  i never search the index changes because the index changes is till 
 in the RAM , maybe there's a way to flush all the changes to the stable 
 Storage and this operation don't close the IndexWriter so i could re-use it  
 . am i right at this point ?

 there're several point i don't quite understand ..

 1, what's the difference between commit and flush  ?   i thought with these 
 two method , i could see the changes in my Directory without closing 
 IndexWriter .

 2, when should i close the writer ? if i use it Singleton(i don't have to 
 worry about the LockObtainException) , and i don't have to worry about the 
 changes because commit and flush would do this , then i don't have to close 
 it any more ...

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene Upgrade from 2.9.x to 4.7.x

2014-05-29 Thread Ian Lea

The migration guide that came out with 4.0 is probably the best place to start.

http://lucene.apache.org/core/4_8_1/MIGRATE.html is from the current
release but probably hasn't changed since 4.0.  There's also the
changes file with every release.  And if you browse the list archives
I expect you'll find similar questions and answers.


--
Ian.


On Thu, May 29, 2014 at 3:12 PM, Baldwin, David david_bald...@bmc.com wrote:
 I am looking for the same.   Need to upgrade from 2.9.2 .

 -Original Message-
 From: Buddhavarapu, Suresh [mailto:suresh.buddhavar...@emc.com]
 Sent: Thursday, May 29, 2014 7:57 AM
 To: java-user@lucene.apache.org
 Subject: Lucene Upgrade from 2.9.x to 4.7.x

 Hello,

 I'm looking for some documents/information on upgrade from Lucene 2.9.x to 
 4.7.2. This is the first time I'm doing a upgrade. Can someone point me to 
 some help in this?
 Is there a documentation on what has changed from 2.9.x to 4.7.2?

 Thanks,
 Suresh

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Which is better ,Search through query and whole text document or search through query with document field.

2014-02-13 Thread Ian Lea

The one that meets your requirements most easily will be the best.

If people will want to search for words in particular fields you'll
need to split it but if they only ever want to search across all
fields there's no point.

A common requirement is to want both, in which case you can split it
and also store everything in a common field called something like
contents.  Or look at MultiFieldQueryParser.


--
Ian.


On Thu, Feb 13, 2014 at 10:16 AM, Rajendra Rao
rajendra@launchship.com wrote:
 Hello,

 I have query and document.Its unstructured  natural  text.I used lucene
 for searching document on query.If I  separate Document into field and then
 search.what will be difference?
 I can't check it because now i don't have field separated data .But in
 future we will have.

 Thanks.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Delete a field in old documents

2014-01-07 Thread Ian Lea

You'll have to reindex.


--
Ian.


On Mon, Jan 6, 2014 at 2:11 PM, manoj raj manojluc...@gmail.com wrote:
 Hi,

 I have stored fields. I want to delete a single field in all documents. Can
 i do that without reindexing? if yes, is it costly operations..?


 Thanks,
 Manoj.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Slow Index Writes

2014-01-07 Thread Ian Lea

I don't have a working example but I believe it's pretty
straightforward.  See  DirectoryReader.open() and .openIfChanged().


--
Ian.


On Tue, Jan 7, 2014 at 1:41 PM, Klaus Schaefers
klaus.schaef...@ligatus.com wrote:
 Hi,


 I was looking for some examples but I just found some using an NRTManager
 class? In Lucene 4.5 I cannot find the class (missing a maven dependency?).
 Can anyone point me to a working example?

 Cheers,

 Klaus



 On Fri, Jan 3, 2014 at 11:49 AM, Ian Lea ian@gmail.com wrote:

 You will indeed get poor performance if you commit for every doc.  Can
 you compromise and commit every, say, 1000 docs, or once every few
 minutes, or whatever makes sense for your app.

 Or look at lucene's near-real-time search features.  Google Lucene
 NRT for info.

 Or use Elastic Search.


 --
 Ian.


 On Fri, Jan 3, 2014 at 10:21 AM, Klaus Schaefers
 klaus.schaef...@ligatus.com wrote:
  Hi,
 
  I am trying to use a lucene as a kind of key value store, but I
 encountered
  some bad performance issues. When I try to add my data as documents to
 the
  index I get an average write rate of 3 documents / second!! This seems to
  me ridiculously slow and I guess I must have somewhere an error. Please
  have a look at my code:
 
 
 
  Directory dir = new niofsdirectojava-u...@lucene.apache.org!
  java-user@lucene.apache.org!ry(file);
  Analyzer analyzer =  new StandardAnalyzer(Version.LUCENE_45);
  IndexWriterConfig config = new
 IndexWriterConfig(Version.LUCENE_45,
  analyzer);
  IndexWriter writer = new IndexWriter(dir, config);
 
  int eventCount = 1000;
  for(int i=0; i  eventCount;i++){
  Document doc = new Document();
  doc.add(new StringField(id, i+id ,Store.YES));
  doc.add(new StoredField(b, buildVector()));
  writer.addDocument(doc);
  writer.commit();
  }
  dir.close();
  writer.close()
 
 
  Not calling the commit function seems to fix the issue, but I guess this
  would then have some issues if I want to read values in the mean time. My
  normal use case would be to read something from the index, maybe alter it
  and then write back. So I would have roughly 50% of reads.
 
  I tried also an embedded version of elastic search and it manages to go
 to
  2000 documents/ per second. As its based on lucene as well I guess I do
  something wrong in my code.
 
 
  THX for the help,
 
  Klaus
 
 
  --
 
  --
 
  Klaus Schaefers
  Senior Optimization Manager
 
  Ligatus GmbH
  Hohenstaufenring 30-32
  D-50674 Köln
 
  Tel.:  +49 (0) 221 / 56939 -784
  Fax:  +49 (0) 221 / 56 939 - 599
  E-Mail: klaus.schaef...@ligatus.com
  Web: www.ligatus.de
 
  HRB Köln 56003
  Geschäftsführung:
  Dipl.-Kaufmann Lars Hasselbach, Dipl.-Kaufmann Klaus Ludemann,
  Dipl.-Wirtschaftsingenieur Arne Wolter

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




 --

 --

 Klaus Schaefers
 Senior Optimization Manager

 Ligatus GmbH
 Hohenstaufenring 30-32
 D-50674 Köln

 Tel.:  +49 (0) 221 / 56939 -784
 Fax:  +49 (0) 221 / 56 939 - 599
 E-Mail: klaus.schaef...@ligatus.com
 Web: www.ligatus.de

 HRB Köln 56003
 Geschäftsführung:
 Dipl.-Kaufmann Lars Hasselbach, Dipl.-Kaufmann Klaus Ludemann,
 Dipl.-Wirtschaftsingenieur Arne Wolter

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Slow Index Writes

2014-01-03 Thread Ian Lea

You will indeed get poor performance if you commit for every doc.  Can
you compromise and commit every, say, 1000 docs, or once every few
minutes, or whatever makes sense for your app.

Or look at lucene's near-real-time search features.  Google Lucene
NRT for info.

Or use Elastic Search.


--
Ian.


On Fri, Jan 3, 2014 at 10:21 AM, Klaus Schaefers
klaus.schaef...@ligatus.com wrote:
 Hi,

 I am trying to use a lucene as a kind of key value store, but I encountered
 some bad performance issues. When I try to add my data as documents to the
 index I get an average write rate of 3 documents / second!! This seems to
 me ridiculously slow and I guess I must have somewhere an error. Please
 have a look at my code:



 Directory dir = new niofsdirectojava-u...@lucene.apache.org!
 java-user@lucene.apache.org!ry(file);
 Analyzer analyzer =  new StandardAnalyzer(Version.LUCENE_45);
 IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_45,
 analyzer);
 IndexWriter writer = new IndexWriter(dir, config);

 int eventCount = 1000;
 for(int i=0; i  eventCount;i++){
 Document doc = new Document();
 doc.add(new StringField(id, i+id ,Store.YES));
 doc.add(new StoredField(b, buildVector()));
 writer.addDocument(doc);
 writer.commit();
 }
 dir.close();
 writer.close()


 Not calling the commit function seems to fix the issue, but I guess this
 would then have some issues if I want to read values in the mean time. My
 normal use case would be to read something from the index, maybe alter it
 and then write back. So I would have roughly 50% of reads.

 I tried also an embedded version of elastic search and it manages to go to
 2000 documents/ per second. As its based on lucene as well I guess I do
 something wrong in my code.


 THX for the help,

 Klaus


 --

 --

 Klaus Schaefers
 Senior Optimization Manager

 Ligatus GmbH
 Hohenstaufenring 30-32
 D-50674 Köln

 Tel.:  +49 (0) 221 / 56939 -784
 Fax:  +49 (0) 221 / 56 939 - 599
 E-Mail: klaus.schaef...@ligatus.com
 Web: www.ligatus.de

 HRB Köln 56003
 Geschäftsführung:
 Dipl.-Kaufmann Lars Hasselbach, Dipl.-Kaufmann Klaus Ludemann,
 Dipl.-Wirtschaftsingenieur Arne Wolter

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Deletion of Index not happening in Lucene 4.3

2013-11-29 Thread Ian Lea

How do you know it's not working?  My favourite suggestion: post a
very small self-contained RAMDirectory based program or test case, or
maybe 2 in this case, for 3.6 and 4.3, that demonstrates the problem.


--
Ian.


On Fri, Nov 29, 2013 at 6:00 AM, VIGNESH S vigneshkln...@gmail.com wrote:
 Hi,

 I try deleting the document from the Index like below.It is working in case
 of Lucene 3.6.But document is not getting deleted  for Lucene 4.3


 Term term = new Term(path, value);

 mWriter.deleteDocuments(term);
 mWriter.commit();

 Please kindly help..
 --
 Thanks and Regards
 Vignesh Srinivasan

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: java.lang.NoSuchFieldError: STOP_WORDS_SET

2013-11-13 Thread Ian Lea

Pasting that line into a chunk of code works fine for me, with 4.5
rather than 4.3 but I don't expect that matters.  Have you got a) all
the right jars in your classpath and b) none of the wrong jars?


--
Ian.

On Wed, Nov 13, 2013 at 11:20 AM, Hang Mang gucko.gu...@googlemail.com wrote:
 Hi guys,

 I'm using Lucene 4.3 and I'm getting this Exception:

 java.lang.NoSuchFieldError: STOP_WORDS_SET


 at this line in my code:

 CharArraySet DEFAULT_STOP_SET = StandardAnalyzer.STOP_WORDS_SET;


 This is driving me crazy, I don't know what's wrong!

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: subscribe

2013-11-11 Thread Ian Lea

Have you set an analyzer when you create your IndexWriter?


--
Ian.

P.S.  Please start new questions in new messages with sensible subjects.


On Mon, Nov 11, 2013 at 9:00 AM, Rohit Girdhar rohit.ii...@gmail.com wrote:
 Hi

 I was trying to use the lucene JAVA API to create an index. I am repeatedly
 getting NullPointerException when I try to add a document with a
 TextField() field to the IndexWriter. The exception is:
 http://pastebin.com/KFZT4XNV
 I even tried to use the deprecated Field() API, but still the same
 exception.
 Any pointers?

 Thanks!

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: IndexWriter.addDocument() gives NullPointerException when used with a doc containing TextField

2013-11-11 Thread Ian Lea

Well, you certainly shouldn't need to take any hacky approaches.
TextFields do work.  Maybe post a small but complete code sample if
you can't figure it out.


--
Ian.


On Mon, Nov 11, 2013 at 7:45 PM, Rohit Girdhar rohit.ii...@gmail.com wrote:
 Hi Ian,
 Yes, I am using a StandardAnalyzer with the IndexWriter.
 Actually I kind-of fixed the issue, by this hacky approach:
 ```
   f = new Field(title, title_string, Field.Store.YES,
 Field.Index.ANALYZED);
   f2 = new TextField(title, f.tokenStream())
 ```
 and then using f2 as the field for the doc does not give that exception.
 However, I'm still not sure what went wrong in using the other constructor
 for TextField...

 Thanks

 PS: Sorry about that, didn't realize that while posting :( . Updated the
 message subject now.


 On Mon, Nov 11, 2013 at 10:00 PM, Ian Lea ian@gmail.com wrote:

 Have you set an analyzer when you create your IndexWriter?


 --
 Ian.

 P.S.  Please start new questions in new messages with sensible subjects.


 On Mon, Nov 11, 2013 at 9:00 AM, Rohit Girdhar rohit.ii...@gmail.com
 wrote:
  Hi
 
  I was trying to use the lucene JAVA API to create an index. I am
 repeatedly
  getting NullPointerException when I try to add a document with a
  TextField() field to the IndexWriter. The exception is:
  http://pastebin.com/KFZT4XNV
  I even tried to use the deprecated Field() API, but still the same
  exception.
  Any pointers?
 
  Thanks!

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




 --
 *rohit*

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Search sentence from document based on keyword as input using lucene

2013-10-17 Thread Ian Lea

If you're using Solr you'd be better off asking this on the Solr list:
http://lucene.apache.org/solr/discussion.html.

You might also like to clarify what you want with regard to sentence
vs document.  If you want to display the sentences of a matched doc,
surely you just do it: store what you need and display it however you
like.  If you want to display the sentence, rather than a document, in
which a match occurred, that is a different question.  Sound rather
like highlighting.


--
Ian.


On Thu, Oct 17, 2013 at 8:28 AM, Avni Sompura avni.somp...@gmail.com wrote:
 Hi Team,

 I have one requirement where i have to display sentences of valid document
 if the keyword(input string) is found in that document.

 I am thinking if parent-child relation will work?

 DocBean

 int doc_id
 String doc_path
 String content_id

 ContentBean

 int content_id
 String content;

 Need your suggestion on above problem.
 Can this be done efficiently using SOLR?


 Thanks,
 Avni

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Optimizing Filters

2013-10-17 Thread Ian Lea

Yes, I think you should have a play. But on an index that is as
realistic as you can make it - there may be variations in performance
of the different queries and filters depending on term frequencies and
loads of other stuff I don't understand.  General point being simply
that YMMV.


--
Ian.


On Wed, Oct 16, 2013 at 3:07 PM, James Clarke jcla...@basistech.com wrote:
 Filters are created programmatically per request (and customized for the
 request) thus in order to benefit from CachingWrapperFilter we require a
 mechanism for looking up CachingWrapperFilters based on the request. But this 
 is
 certainly an area worth trying (we could probably reuse each filter 10 times,
 because of the variation in requests and NRT search).

 I was hoping to improve query latency by reformulating the filters and
 queries. However my intuition of the best practice for filter and query
 construction is lacking i.e., is it better to use a TermsFilter and
 MatchAllDocsQuery or a BooleanQuery of TermQuerys, or a BooleanQuery of
 ConstantScoreQuerys of TermQuery etc.

 Maybe I should just hunker down and create a synthetic index and try many
 different combinations of filter/query construction.

 On Oct 11, 2013, at 7:33 AM, Ian Lea ian@gmail.com wrote:

 Are you going to be caching and reusing the filters e.g. by
 CachingWrapperFilter?  The main benefit of filters is in reuse.  It
 takes time to build them in the first place, likely roughly equivalent
 to running the underlying query although with variations as you
 describe.  Or are you saying that querying with filters is slow?


 --
 Ian.


 On Thu, Oct 10, 2013 at 7:01 PM, James Clarke jcla...@basistech.com wrote:
 Are there any best practices for constructing Filters to search efficiently?
 From my non-exhaustive experiments I cannot intuit how to construct my 
 filters
 to achieve best performance.

 I have an index (Lucene 4.3) of about 1.8M documents which contain a field
 acting as a flag (evidence:true). Initially all the documents I am 
 interested in
 searching have this field. Later as the index grows some documents will not 
 have
 this field.

 In the simplest case I want to filter on documents with evidence:true. 
 Running a
 couple of hundred queries sequentially and recording how long it takes to
 complete.

 * No filter: ~40s
 * QueryWrapperFilter(TermQuery(evidence:true)): ~80s
 * FieldValueFilter(evidence): ~43s
 * TermsFilter(evidence:true): ~50s

 This suggests QWF is a bad idea.

 A more complex filter is:

  (evidence:true AND (cid:x OR cid:y ...) AND language:eng)

 Where 1.8M documents evidence:true, 2-4 documents per cid clause, 1-60 cid
 clauses, and 1.4M documents lang:eng.

 Our initial implementation uses QWF of a BooleanQuery(TQ AND BQ(OR) AND TQ)
 which takes ~210s.

 Adjusting this to be a BooleanFilter(TermsFilter AND TermsFilter AND
 TermsFilter) sees things slow down to ~239s!

 Any advice on optimizing these filters would be appreciated!

 James


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: PhraseQuery boost doesn't affect ScoreDoc.score

2013-10-17 Thread Ian Lea

Boosting query clauses means more this clause is more important than
that clause rather than make the score for this search higher.  I
use it for biblio searching when want to search across multiple fields
and want matches in titles to be more important than matches in
blurbs..  Amended version of your program, pasted below, produces this
output

$ java Test3 title | grep -e Query -e First
Query: title:amber eyes blurb:amber eyes: 2 hits, boost=1.0 on title
First: The Hare with Amber Eyes, Boost 1.0, score 0.353553:
Query: title:amber eyes^5.0 blurb:amber eyes: 2 hits, boost=5.0 on title
First: The Hare with Amber Eyes, Boost 5.0, score 0.490290:


$ java Test3 blurb | grep -e Query -e First
Query: title:amber eyes blurb:amber eyes: 2 hits, boost=1.0 on blurb
First: The Hare with Amber Eyes, Boost 1.0, score 0.353553:
Query: title:amber eyes blurb:amber eyes^5.0: 2 hits, boost=5.0 on blurb
First: Some Book, Boost 5.0, score 0.429004:

The first run boosts matches on title, the second boosts matches on
blurb and this affects the result ordering when boosting is  1.
Looking at precise scores is generally not very helpful.

My test was with 4.5 rather than 4.4 but I'm sure that's irrelevant.


--
Ian.



public class Test3 {
public static void main(String[] args) throws IOException {
RAMDirectory dir = new RAMDirectory();
Version version = Version.LUCENE_45;
try (
IndexWriter writer = new IndexWriter(dir, new
IndexWriterConfig(version, new StandardAnalyzer(version {
Document doc1 = new Document();
doc1.add(new TextField(title, Some Book, Field.Store.YES));
doc1.add(new TextField(blurb, This book is not as good as The Hare
with Amber Eyes, Field.Store.YES));
writer.addDocument(doc1);

Document doc2 = new Document();
doc2.add(new TextField(title, The Hare with Amber Eyes, Field.Store.YES));
doc2.add(new TextField(blurb, This book is brilliant, Field.Store.YES));
writer.addDocument(doc2);
   }
IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(dir));
search(searcher, 1, args[0]);
search(searcher, 5, args[0]);
}

private static void search(IndexSearcher searcher, float boost,
String ftoboost) throws
IOException {
BooleanQuery bq = new BooleanQuery();
PhraseQuery tpq = new PhraseQuery();
tpq.add(new Term(title, amber));
tpq.add(new Term(title, eyes));
PhraseQuery bpq = new PhraseQuery();
bpq.add(new Term(blurb, amber));
bpq.add(new Term(blurb, eyes));
if (title.equals(ftoboost)) {
   tpq.setBoost(boost);
}
if (blurb.equals(ftoboost)) {
   bpq.setBoost(boost);
}
bq.add(tpq, BooleanClause.Occur.SHOULD);
bq.add(bpq, BooleanClause.Occur.SHOULD);
TopDocs hits = searcher.search(bq, 10);
System.out.printf(Query: %s: %s hits, boost=%s on %s\n,
 bq, hits.scoreDocs.length, boost, ftoboost);
if (hits != null  hits.scoreDocs.length  0) {
   ScoreDoc doc = hits.scoreDocs[0];
   int docid = doc.doc;
   System.out.printf(First: %s, Boost %g, score %g:%n%s%n%n,
 searcher.doc(docid).get(title), boost, doc.score,
 searcher.explain(bq, doc.doc));
}
}
}

On Wed, Oct 16, 2013 at 7:03 AM, denis.zhdanov denzhda...@gmail.com wrote:
 Hello,

 Have a question about default PhraseQuery boost processing. The
 Query.setBoost()
 http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/Query.html#setBoost(float)
 says:
 /
 Sets the boost for this query clause to b. Documents matching this clause
 will (in addition to the normal weightings) have their score multiplied by b
 /

 However, that's not true for /PhraseQuery/. Example:

 import org.apache.lucene.analysis.standard.StandardAnalyzer;
 import org.apache.lucene.document.Document;
 import org.apache.lucene.document.Field;
 import org.apache.lucene.document.TextField;
 import org.apache.lucene.index.DirectoryReader;
 import org.apache.lucene.index.IndexWriter;
 import org.apache.lucene.index.IndexWriterConfig;
 import org.apache.lucene.index.Term;
 import org.apache.lucene.search.IndexSearcher;
 import org.apache.lucene.search.PhraseQuery;
 import org.apache.lucene.search.ScoreDoc;
 import org.apache.lucene.search.TopDocs;
 import org.apache.lucene.store.FSDirectory;
 import org.apache.lucene.store.RAMDirectory;
 import org.apache.lucene.util.Version;

 import java.io.IOException;

 public class Test {
 public static void main(String[] args) throws IOException {
 RAMDirectory dir = new RAMDirectory();
 Version version = Version.LUCENE_44;
 try (IndexWriter writer = new IndexWriter(dir, new
 IndexWriterConfig(version, new StandardAnalyzer(version {
 Document document = new Document();
 document.add(new TextField(data, 1 2 3, Field.Store.YES));
 writer.addDocument(document);
 }

 IndexSearcher searcher = new
 IndexSearcher(DirectoryReader.open(dir));
 search(searcher, 1);
 search(searcher, 5);
 }

 private static void search(IndexSearcher searcher, float boost)

Re: QueryParser stripping off Hyphen from query

2013-10-15 Thread Ian Lea

If you want to keep hyphens you could try WhitespaceAnalyzer.  But
that may of course have knock on effects on other searches.  Don't
forget to use the same analyzer for indexing and searching, unless
you're doing clever things.

An alternative is to create the queries directly in code, but you'll
still need to match up what you pass with what has been indexed.

--
Ian.


On Tue, Oct 15, 2013 at 12:38 AM,  raghavendra.k@barclays.com wrote:
 Could you please suggest which Analyzer to use in this case?

 I haven’t yet explored much with Analyzers. I've always used the 
 StandardAnalyzer.

 Regards,
 Raghu


 -Original Message-
 From: Uwe Schindler [mailto:u...@thetaphi.de]
 Sent: Monday, October 14, 2013 6:38 PM
 To: java-user@lucene.apache.org
 Subject: RE: QueryParser stripping off Hyphen from query

 The problem is not query parser, it is your analyzer.

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de

 -Original Message-
 From: raghavendra.k@barclays.com
 [mailto:raghavendra.k@barclays.com]
 Sent: Tuesday, October 15, 2013 12:15 AM
 To: java-user@lucene.apache.org
 Subject: QueryParser stripping off Hyphen from query

 Hi,

 I am using the regular QueryParser to form a PhraseQuery. It works
 fine, but when it consists of a hyphen, it gets removed, hence
 resulting in unexpected results.

 Note: I am NOT using the QueryParser.escape() method before parse()
 method as it results in a BooleanQuery, while I want a PhraseQuery.

 Please suggest how to retain the hyphen (-) in my query.

 *** Code **
 Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);
 QueryParser parser = new QueryParser(Version.LUCENE_43, CONTENTS,
 analyzer); Query query = parser.parse(strSearch); logger.info(Type of
 query:  + query.getClass().getSimpleName());
 logger.info(query.toString:  + query.toString());

 *** Log output ***
 Contents of strSearch: ab-cde
 Type of query: PhraseQuery
 query.toString: CONTENTS:ab cde

 Regards,
 Raghu


 ___

 This message is for information purposes only, it is not a
 recommendation, advice, offer or solicitation to buy or sell a product
 or service nor an official confirmation of any transaction. It is
 directed at persons who are professionals and is not intended for
 retail customer use. Intended for recipient only. This message is subject to 
 the terms at:
 www.barclays.com/emaildisclaimer.

 For important disclosures, please see:
 www.barclays.com/salesandtradingdisclaimer regarding market commentary
 from Barclays Sales and/or Trading, who are active market
 participants; and in respect of Barclays Research, including
 disclosures relating to specific issuers, please see 
 http://publicresearch.barclays.com.

 ___


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 ___

 This message is for information purposes only, it is not a recommendation, 
 advice, offer or solicitation to buy or sell a product or service nor an 
 official confirmation of any transaction. It is directed at persons who are 
 professionals and is not intended for retail customer use. Intended for 
 recipient only. This message is subject to the terms at: 
 www.barclays.com/emaildisclaimer.

 For important disclosures, please see: 
 www.barclays.com/salesandtradingdisclaimer regarding market commentary from 
 Barclays Sales and/or Trading, who are active market participants; and in 
 respect of Barclays Research, including disclosures relating to specific 
 issuers, please see http://publicresearch.barclays.com.

 ___

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Calculating min, max and sum of a field in docs returned by search [SEC=UNOFFICIAL]

2013-10-14 Thread Ian Lea

I'd start with the simple approach of a stored field and only worry
about performance if you needed to.  Field caching would likely help
if you did need to.


--
Ian.


On Mon, Oct 14, 2013 at 2:04 AM, Stephen GRAY stephen.g...@immi.gov.au wrote:
 UNOFFICIAL
 Hi everyone,

 I'd appreciate some help with a problem I'm having. I have a collection of 
 documents in my index. Each doc contains an IntField with a value in it. What 
 I want is to find out the minimum, maximum and sum of this field for all 
 documents returned by a search.

 I was thinking that I could do this by adding my documents to a facet, then 
 using AssociationIntSumFacetRequest to calculate the sum over the facet 
 returned, but I can't figure out how to use AssociationIntSumFacetRequest - 
 there is nowhere to pass on the name of the field I want to sum over for 
 example. As for getting the minimum and maximum, presumably you'd have to 
 write your own FacetRequest, which is hard.

 I could store the field, then iterate over all the docs returned, getting 
 each doc and calculating min, max and sum, but in my index there are likely 
 to be upwards of 10 million documents after a while so this might be rather 
 slow.

 Any help would be greatly appreciated.

 Thanks,
 Steve

 UNOFFICIAL


 
 Important Notice: If you have received this email by mistake, please advise
 the sender and delete the message and attachments immediately.  This email,
 including attachments, may contain confidential, sensitive, legally privileged
 and/or copyright information.  Any review, retransmission, dissemination
 or other use of this information by persons or entities other than the
 intended recipient is prohibited.  DIAC respects your privacy and has
 obligations under the Privacy Act 1988.  The official departmental privacy
 policy can be viewed on the department's website at www.immi.gov.au.  See:
 http://www.immi.gov.au/functional/privacy.htm


 -


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: wildcard search not working on file paths

2013-10-14 Thread Ian Lea

Do some googling on leading wildcards and read things like
http://www.gossamer-threads.com/lists/lucene/java-user/175732 and pick
an option you like.


--
Ian.


On Mon, Oct 14, 2013 at 9:12 AM, nischal reddy
nischal.srini...@gmail.com wrote:
 Hi,

 I have problem with doing wild card search on file path fields.

 i have a field filePath where i store complete path of files.

 i have used StringField to store the field (i assume by default
 StringField will not be tokenized) .

 doc.add(new StringField(FIELD_FILE_PATH,resourcePath, Store.YES));

 I am using StandardAnalyzer for IndexWriter

 but since i am using a StringField the fields are not analyzed.

 After the files are indexed i checked it with Luke the path seems fine. And
 when i do wildcard searches with luke i am getting desired results.

 But when i do the same search in my code with IndexSearcher i am getting
 zero docs

 My searching code looks something like this

 indexSearcher.search(new WildcardQuery(new
 Term(filePath,*SuperClass.cls)),100);

 this is returning zero documents.

 But when i just use * in query it is returning all the documents

 indexSearcher.search(new WildcardQuery(new Term(filePath,*)),100);

 only when i use some queries like prefix wildcard etc it is not working

 What is possibly going wrong.

 Thanks,
 Nischal Y

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: wildcard search not working on file paths

2013-10-14 Thread Ian Lea

Seems to me that it should work.  I suggest you show us a complete
self-contained example program that demonstrates the problem.


--
Ian.


On Mon, Oct 14, 2013 at 12:42 PM, nischal reddy
nischal.srini...@gmail.com wrote:
 Hi Ian,

 Actually im able to do wildcard searches on all the fields except the
 filePath field. I am able to do both the leading and trailing wildcard
 searches on all the fields,
 but when i do the wildcard search on filepath field it is somehow not
 working, an eg file path would look some thing like this \Samples\F1.cls
 i think because of \ present in the field it is failing. when i do a
 wildcard search with the query filePath : * it is indeed returning all
 the docs in the index. But when i do any other wildcard searches(leading or
 trailing) it is not working, any clues why it is working in other fields
 and not working on filePath field.

 TIA,
 Nischal Y


 On Mon, Oct 14, 2013 at 4:55 PM, Ian Lea ian@gmail.com wrote:

 Do some googling on leading wildcards and read things like
 http://www.gossamer-threads.com/lists/lucene/java-user/175732 and pick
 an option you like.


 --
 Ian.


 On Mon, Oct 14, 2013 at 9:12 AM, nischal reddy
 nischal.srini...@gmail.com wrote:
  Hi,
 
  I have problem with doing wild card search on file path fields.
 
  i have a field filePath where i store complete path of files.
 
  i have used StringField to store the field (i assume by default
  StringField will not be tokenized) .
 
  doc.add(new StringField(FIELD_FILE_PATH,resourcePath, Store.YES));
 
  I am using StandardAnalyzer for IndexWriter
 
  but since i am using a StringField the fields are not analyzed.
 
  After the files are indexed i checked it with Luke the path seems fine.
 And
  when i do wildcard searches with luke i am getting desired results.
 
  But when i do the same search in my code with IndexSearcher i am getting
  zero docs
 
  My searching code looks something like this
 
  indexSearcher.search(new WildcardQuery(new
  Term(filePath,*SuperClass.cls)),100);
 
  this is returning zero documents.
 
  But when i just use * in query it is returning all the documents
 
  indexSearcher.search(new WildcardQuery(new Term(filePath,*)),100);
 
  only when i use some queries like prefix wildcard etc it is not working
 
  What is possibly going wrong.
 
  Thanks,
  Nischal Y

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: wildcard search not working on file paths

2013-10-14 Thread Ian Lea

You seem to be indexing paths delimited by backslash then saying a
search for Samples/* doesn't match anything.  No surprises there, if
I've read your code correctly.  Since you are creating wildcard
queries directly from Terms I don't think that lucene escaping is
relevant here,  But the presence of all the backslashes in paths and
java code doesn't help.  I'd convert them all to standard unix /a/b/c
format, for searching anyway: you can always store the original if you
want to use that in results.

One further small tip: your sample program is good, with no external
dependencies, but would be even better if you used RAMDirectory.  That
way I could run it on my non-Windows system if I wanted to, with the
addition of some imports.


--
Ian.


On Mon, Oct 14, 2013 at 7:55 PM, nischal reddy
nischal.srini...@gmail.com wrote:
 Hi Ian,

 Please find a sample program below which better illustrates the scenario


 public class TestWriter {
 public static void main(String[] args) throws IOException {
 createIndex();
 searchIndex();
 }

 public static void createIndex() throws IOException {
 Directory directory = FSDirectory.open(new File(C:\\temp));

 IndexWriterConfig iwriter = new IndexWriterConfig(
 Version.LUCENE_44, new
 StandardAnalyzer(Version.LUCENE_44));

 IndexWriter iWriter = new IndexWriter(directory, iwriter);

 Document document1 = new Document();

 document1.add(new StringField(FILE_PATH,
 \\Samples\\Batching\\runner.p, Store.YES));
 document1.add(new StringField(contents, runnerfile,
 Store.YES));

 iWriter.addDocument(document1);

 Document document2 = new Document();

 document2.add(new StringField(FILE_PATH,
 \\Samples\\Business\\stopper.p, Store.YES));
 document2
 .add(new StringField(contents, stopperfile,
 Store.YES));

 iWriter.addDocument(document2);
 iWriter.commit();
 iWriter.close();


 }

 public static void searchIndex() throws IOException {

 Directory directory = FSDirectory.open(new File(C:\\temp));
 IndexReader indexReader = DirectoryReader.open(directory);
 IndexSearcher indexSearcher = new IndexSearcher(indexReader);

 // Create a wildcard query to get all file paths
 // This query works fine and returns all the docs in index
 Query query1 = new WildcardQuery(new Term(FILE_PATH, *));
 TopDocs topDocs = indexSearcher.search(query1, 100);
 System.out.println(total no of docs  + topDocs.totalHits);

 // Create a wildcard query to search for paths starting with
 /Samples
 // This query doesnt work and returns zero docs
 //doest work with *Samples//* either
 // but works with *Samples*
 Query query2 = new WildcardQuery(new Term(FILE_PATH,
 *Samples/*));
 TopDocs topDocs2 = indexSearcher.search(query2, 100);
 System.out.println(total no of docs  + topDocs2.totalHits);

 // Create a wildcard query to search for paths ending with runner.p
 // This query works and returns 1 doc
 Query query3 = new WildcardQuery(new Term(FILE_PATH,
 *runner.p));
 TopDocs topDocs3 = indexSearcher.search(query3, 100);
 System.out.println(total no of docs  + topDocs3.totalHits);

 // Queries to search in contents field

 // Create a wildcard query to search for contents starting with
 runner
 // This query works and returns one doc
 Query query4 = new WildcardQuery(new Term(contents, runner*));
 TopDocs topDocs4 = indexSearcher.search(query4, 100);
 System.out.println(total no of docs  + topDocs4.totalHits);

 // Create a wildcard query to search for contents ending with file
 // This query works and returns two  docs
 Query query5 = new WildcardQuery(new Term(contents, *file));
 TopDocs topDocs5 = indexSearcher.search(query5, 100);
 System.out.println(total no of docs  + topDocs5.totalHits);

 }

 }


 I observed that the file path seperator that i am using in the field and
 lucene escape charater seem to be same. so whenever i am using a escape
 character in the query the search is failing, if i dont use the escape
 sequence it is returning the results properly.

 Though i am escaping \ by giving two \\ the query is still failing.

 one way to solve this problem is to replace all \ with / while
 indexing. and subsequently using / as file path seperator while searching.

 But i wouldnt prefer to meddle with the filepath. So is there any
 alternative to solve this problem without replacing the file path.

 TIA,
 Nischal Y



 On Mon, Oct 14, 2013 at 10:31 PM, Ian Lea ian@gmail.com wrote:

 Seems to me that it should work.  I suggest you show us a complete
 self-contained example program

Re: Multiple Keywords - Regular and Any Order Search

2013-10-11 Thread Ian Lea

Looks like you can achieve most of what you want by using AND rather
than OR.  I think that all the should/should not examples you give
will work if you use AND on your content field.

For ordering, I suggest you look at SpanNearQuery.  That can consider
order and slop, the distance between the search terms.

You may also want to consider separate fields if you care whether
raining beautiful abc should match or not.  You could use
MultiFieldQueryParser or build up a BooleanQuery in code, or build a
complicated string to parse to the standard query parser.  There are
other query parsers as well that might work for you e.g.
org.apache.lucene.queryparser.surround.parser.QueryParser


--
Ian.


On Thu, Oct 10, 2013 at 4:54 PM,  raghavendra.k@barclays.com wrote:
 Hi,

 I have implemented Lucene to search for a single keyword across multiple 
 fields and it works great. I did this by concatenating all the fields into a 
 contents field and searching against this field.

 When I give multiple keywords against this setup, Lucene by default does an 
 OR search, leading to loads of duplicates. This, I understand is an expected 
 behaviour.


 1.   Hence the first thing that I am trying to achieve is search 
 functionality for multiple keywords. The most popular suggestion is to 
 implement PhraseQuery. I will try this out, but please let me know if you can 
 provide an example or any suggestions.



 2.   Once the multiple keywords search is implemented, I need to provide 
 another option to the users. They should be able to check a checkbox Search 
 in any order. If checked, if the same keywords of the phrase are present in 
 a particular field BUT in different order, that should still be a match. I 
 don't know how to implement this without forming all permutations of the 
 phrase and then performing an AND search. This could be very expensive in 
 terms of performance. Please let me know if Lucene provides a way to do this.



 Examples for Item 2:



 3.   Field1: RAINING HEAVILY TODAY Field2: BEAUTIFUL MORNING Field3: 
 ABC CORPORATION LIMITED



 Search1: RAINING HEAVILY TODAY - Should Match

 Search2: RAINING TODAY HEAVILY - Should Match

 Search3: RAIN TODAY HEAVILY - Should NOT Match

 Search4: ABC CORPORATION LIMITED - Should Match

 Search5: ABC CORP LIMITED - Should NOT Match

 Search6: ABC LIMITED CORPORATION - Should Match



 I am also not sure if the contents field approach will work in this case. 
 Do I need to index the fields separately using MultiFieldQueryParser to 
 achieve this?


 Sorry for the lengthy question. I would greatly appreciate any suggestions or 
 inputs.

 Regards,
 Raghu


 ___

 This message is for information purposes only, it is not a recommendation, 
 advice, offer or solicitation to buy or sell a product or service nor an 
 official confirmation of any transaction. It is directed at persons who are 
 professionals and is not intended for retail customer use. Intended for 
 recipient only. This message is subject to the terms at: 
 www.barclays.com/emaildisclaimer.

 For important disclosures, please see: 
 www.barclays.com/salesandtradingdisclaimer regarding market commentary from 
 Barclays Sales and/or Trading, who are active market participants; and in 
 respect of Barclays Research, including disclosures relating to specific 
 issuers, please see http://publicresearch.barclays.com.

 ___

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Optimizing Filters

2013-10-11 Thread Ian Lea

Are you going to be caching and reusing the filters e.g. by
CachingWrapperFilter?  The main benefit of filters is in reuse.  It
takes time to build them in the first place, likely roughly equivalent
to running the underlying query although with variations as you
describe.  Or are you saying that querying with filters is slow?


--
Ian.


On Thu, Oct 10, 2013 at 7:01 PM, James Clarke jcla...@basistech.com wrote:
 Are there any best practices for constructing Filters to search efficiently?
 From my non-exhaustive experiments I cannot intuit how to construct my filters
 to achieve best performance.

 I have an index (Lucene 4.3) of about 1.8M documents which contain a field
 acting as a flag (evidence:true). Initially all the documents I am interested 
 in
 searching have this field. Later as the index grows some documents will not 
 have
 this field.

 In the simplest case I want to filter on documents with evidence:true. 
 Running a
 couple of hundred queries sequentially and recording how long it takes to
 complete.

  * No filter: ~40s
  * QueryWrapperFilter(TermQuery(evidence:true)): ~80s
  * FieldValueFilter(evidence): ~43s
  * TermsFilter(evidence:true): ~50s

 This suggests QWF is a bad idea.

 A more complex filter is:

   (evidence:true AND (cid:x OR cid:y ...) AND language:eng)

 Where 1.8M documents evidence:true, 2-4 documents per cid clause, 1-60 cid
 clauses, and 1.4M documents lang:eng.

 Our initial implementation uses QWF of a BooleanQuery(TQ AND BQ(OR) AND TQ)
 which takes ~210s.

 Adjusting this to be a BooleanFilter(TermsFilter AND TermsFilter AND
 TermsFilter) sees things slow down to ~239s!

 Any advice on optimizing these filters would be appreciated!

 James


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Performance/scoring impacts with multiple occurrences of a field

2013-10-11 Thread Ian Lea

With multiple fields of the same name vs a single field I doubt you'd
be able to tell the difference in performance or matching or scoring
in normal use.  There may be some matching/ranking effect if you are
looking at, say, span queries across the multiple fields.

Try it out and see what happens.


--
Ian.


On Tue, Oct 8, 2013 at 3:03 AM, Earl Hood e...@earlhood.com wrote:
 Using Lucene 3.

 I know Lucene supports multiple occurrences of a field, and if one
 searches on that field, all fields are checked for hits.  One question I
 have is if there is a performance difference between if all the data I
 want to index is represented by a single field vs multiple fields of the
 same name?

 The other question is if scoring of results differ between the use of a
 single field vs multiple fields of the same name?

 For results ranking, I am guessing there is an effect based on
 https://wiki.apache.org/lucene-java/LuceneFAQ#How_can_I_search_over_multiple_fields.3F
 and
 https://wiki.apache.org/lucene-java/LuceneFAQ#Does_the_position_of_the_matches_in_the_text_affect_the_scoring.3F
 But I am not sure if this only applicable for cases of different fields
 names vs fields of the same name.

 --ewh

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: queries with doesn't work but AND does

2013-10-10 Thread Ian Lea

Looks like you've got some XML processing in there somewhere.  Nothing
to do with lucene.  This code:

public static void main(String[] _args) throws Exception {
QueryParser qp = new QueryParser(Version.LUCENE_44,
x,
new StandardAnalyzer(Version.LUCENE_44));
for (String s : _args) {
   System.out.printf(%s: %s\n, s, qp.parse(s));
}
}

produces this output:

hello  goodbye: +x:hello +x:goodbye
hello AND goodbye: +x:hello +x:goodbye


--
Ian.


On Thu, Oct 10, 2013 at 11:32 AM, Devi pulaparti pvkd...@gmail.com wrote:
 toString output by queryparser.parse()  for query   TEST  USAGE  is 
 content:TEST content:\amp amp\ content:USAGE   .
 and for query  TEST AND USAGE is +content:TEST +content:USAGE
 any idea why is analyzer treating  as content?




 On Thu, Oct 10, 2013 at 2:50 PM, Alan Burlison alan.burli...@gmail.comwrote:

 On 10/10/2013 09:27, Devi pulaparti wrote:

  In our search application, queries like  test  usage  do not return
 correct results but  test AND usage works fine.  So queries with 
 doesn't work but AND does. We are using default queryparser with
 standard
 analyzer. Could some one please help me resolving this. please let me know
 if you need more details of implementation.


 Most likely cause is that the analyzer is discarding non-alphanumeric
 tokens. Use toString on the query returned by queryparser.parse() to see
 what's in there.

 --
 Alan Burlison
 --


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Multiphrase Query in Lucene 4.3

2013-10-03 Thread Ian Lea

Certainly sounds like a bug in your analyzer.  You could start a new
thread if you need help with that.  But from your previous email it
sounds like you could use WhitespaceTokenizer chained with
LowerCaseFilter.


--
Ian.


On Thu, Oct 3, 2013 at 7:16 AM, VIGNESH S vigneshkln...@gmail.com wrote:
 Hi,

 In my Analyzer,problem actually occurs for words which are preceded by
 punctuation marks..

 For Example:
 If I am Indexing content,Andrey Gubarev,JingGoogle,Inc.

 If I search Andrew Gubarev ,It is not working properly since word Andrew
 is preceded by punctuation ,.


 On Thu, Oct 3, 2013 at 11:23 AM, VIGNESH S vigneshkln...@gmail.com wrote:

 Hi Ian,

 In Lucene Is there any Default Analyzer we can use which will ignore only
 Spaces.
 All other numbers,punctuation,dates everything it should preserve.

 I created my analyzer  with tokenizer which returns
 Character.isDefined(cn)  (!Character.isWhitespace(cn)).
 My analyzer will use a lowe case filter on top of the tokenizer.This Woks
 Perfect in case of 3.6
 In 4.3 it is creating problems in offsets of tokens.




 On Mon, Sep 30, 2013 at 8:21 PM, Ian Lea ian@gmail.com wrote:

 Whenever someone says they are using a custom analyzer that has to be
 a suspect.  Does it work if you use one of the core lucene analyzers
 instead?  Have you used Luke to verify that the index holds what you
 think it does?


 --
 Ian.


 On Mon, Sep 30, 2013 at 3:21 PM, VIGNESH S vigneshkln...@gmail.com
 wrote:
  Hi,
 
  It is not the problem with case..Because Iam using LowercaseFilter.
 
  My Analyzer is a custom analyzer which will ignore just white spaces.All
  other numbers date and other special characters it will consider.The
 Same
  analyzer works for Lucene 3.6.
 
 
  When i do a single term query for Geoffrey it is giving hits..But when
  given as a part of multiphrase query ,it is not able to find..When the
  below code is Executed with say word =Geoffrey,it is not finding the
 word
  itself ..
 
  if(TermsEnum.SeekStatus.FOUND ==trm.seekCeil(new BytesRef(word)))
   {do {
String s = trm.term().utf8ToString();
if (s.equals(word)) {
  termsWithPrefix.add(new
 Term(content,
  s));
} else {
  break;
}
  }
   while (trm.next() != null);
   }
 
 
 
  On Mon, Sep 30, 2013 at 3:01 PM, Ian Lea ian@gmail.com wrote:
 
  Whenever someone says something along the lines of a search for
  geoffrey not matching Geoffrey the case difference springs out,
  Can't recall what if anything you said about the analysis side of
  things but that could be the cause.  See
 
 
 http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_no_hits_.2F_incorrect_hits.3F
 
  If on the other hand the problem is more obscure, and only related to
  the multi phrase stuff, I suggest you build a tiny but complete
  RAMDirectory based program or test case that shows the problem and
  post it here.
 
 
  --
  Ian.
 
 
 
  On Mon, Sep 30, 2013 at 6:46 AM, VIGNESH S vigneshkln...@gmail.com
  wrote:
   Hi,
  
   Thanks for your Reply.The Problem I face is there is a word called
  Geoffrey
   Romer in my Field.
  
   I am Forming a Multiphrase query object properly like  Geoffrey
  Romer.But
   When i do a Search,it is not returning Hits.This Problem I am facing
 is
  not
   for all phrases
   This Problem happens for only few Phrases.
  
   When i do a single query like Geoffrey it is giving a Hit..But when
 i do
  it
   in MultiphraseQuery it is not able to find geoffrey. I confirmed
 this
  by
   doing trm.seekCeil(new BytesRef(Geoffrey))  and then and then when
 i
   do String s = trm.term().utf8ToString().It is pointing to a diffrent
 word
   instead of geoffrey.seekceil is working properly for many phrases
 though.
  
   What could be the problem..please kindly suggest.
  
  
  
   On Fri, Sep 27, 2013 at 6:58 PM, Allison, Timothy B. 
 talli...@mitre.org
  wrote:
  
   1) An alternate method to your original question would be to do
  something
   like this (I haven't compiled or tested this!):
  
   Query q = new PrefixQuery(new Term(field, app));
  
   q = q.rewrite(indexReader) ;
   SetTerm terms = new HashSetTerm();
   q.extractTerms(terms);
   Term[] arr = terms.toArray(new Term[terms.size()]);
   MultiPhraseQuery mpq = new MultiPhraseQuery();
   mpq.add(new Term(field, microsoft);
   mpq.add(arr);
  
  
   2) At a higher level, do you need to generate your query
  programmatically?
Here are three parsers that could handle this:
 a) ComplexPhraseQueryParser
 b) SurroundQueryParser:
 oal.queryparser.surround.parser.QueryParser
 c) experimental: self_promotion degree=shameless
   http://issues.apache.org/jira/browse/LUCENE-5205/self_promotion
  
  
   -Original Message-
   From: VIGNESH S

Re: Multiphrase Query in Lucene 4.3

2013-10-03 Thread Ian Lea

Then I suggest you start a new thread, posting all relevant details
and preferably a cut down but complete program, with all relevant
code, and no irrelevant code, with simple examples, input and output,
of what does and doesn't work,

--
Ian.

On Thu, Oct 3, 2013 at 12:28 PM, VIGNESH S vigneshkln...@gmail.com wrote:
Ian,
Thanks for your reply..
I am facing the same problem if i use whiteSpaceTokenizer also.
My analyzer works perfect in case of Lucene 3.6.

Thanks and Regards
Vignesh Srinivasan

On Thu, Oct 3, 2013 at 3:23 PM, Ian Lea ian@gmail.com wrote:

Certainly sounds like a bug in your analyzer. You could start a new
thread if you need help with that. But from your previous email it
sounds like you could use WhitespaceTokenizer chained with
LowerCaseFilter.

--
Ian.

On Thu, Oct 3, 2013 at 7:16 AM, VIGNESH S vigneshkln...@gmail.com wrote:
Hi,

In my Analyzer,problem actually occurs for words which are preceded by
punctuation marks..

For Example:
If I am Indexing content,Andrey Gubarev,JingGoogle,Inc.

If I search Andrew Gubarev ,It is not working properly since word
Andrew
is preceded by punctuation ,.

On Thu, Oct 3, 2013 at 11:23 AM, VIGNESH S vigneshkln...@gmail.com
wrote:

Hi Ian,

In Lucene Is there any Default Analyzer we can use which will ignore
only
Spaces.
All other numbers,punctuation,dates everything it should preserve.

I created my analyzer with tokenizer which returns
Character.isDefined(cn) (!Character.isWhitespace(cn)).
My analyzer will use a lowe case filter on top of the tokenizer.This
Woks
Perfect in case of 3.6
In 4.3 it is creating problems in offsets of tokens.

On Mon, Sep 30, 2013 at 8:21 PM, Ian Lea ian@gmail.com wrote:

Whenever someone says they are using a custom analyzer that has to be
a suspect. Does it work if you use one of the core lucene analyzers
instead? Have you used Luke to verify that the index holds what you
think it does?

--
Ian.

On Mon, Sep 30, 2013 at 3:21 PM, VIGNESH S vigneshkln...@gmail.com
wrote:
Hi,

It is not the problem with case..Because Iam using LowercaseFilter.

My Analyzer is a custom analyzer which will ignore just white
spaces.All
other numbers date and other special characters it will consider.The
Same
analyzer works for Lucene 3.6.

When i do a single term query for Geoffrey it is giving hits..But
when
given as a part of multiphrase query ,it is not able to find..When
the
below code is Executed with say word =Geoffrey,it is not finding
the
word
itself ..

if(TermsEnum.SeekStatus.FOUND ==trm.seekCeil(new BytesRef(word)))
{do {
String s =
trm.term().utf8ToString();
if (s.equals(word)) {
termsWithPrefix.add(new
Term(content,
s));
} else {
break;
}
}
while (trm.next() != null);
}

On Mon, Sep 30, 2013 at 3:01 PM, Ian Lea ian@gmail.com wrote:

Whenever someone says something along the lines of a search for
geoffrey not matching Geoffrey the case difference springs out,
Can't recall what if anything you said about the analysis side of
things but that could be the cause. See

http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_no_hits_.2F_incorrect_hits.3F

If on the other hand the problem is more obscure, and only related
to
the multi phrase stuff, I suggest you build a tiny but complete
RAMDirectory based program or test case that shows the problem and
post it here.

--
Ian.

On Mon, Sep 30, 2013 at 6:46 AM, VIGNESH S vigneshkln...@gmail.com

wrote:
Hi,

Thanks for your Reply.The Problem I face is there is a word called
Geoffrey
Romer in my Field.

I am Forming a Multiphrase query object properly like Geoffrey
Romer.But
When i do a Search,it is not returning Hits.This Problem I am
facing
is
not
for all phrases
This Problem happens for only few Phrases.

When i do a single query like Geoffrey it is giving a Hit..But
when
i do
it
in MultiphraseQuery it is not able to find geoffrey. I confirmed
this
by
doing trm.seekCeil(new BytesRef(Geoffrey)) and then and then
when
i
do String s = trm.term().utf8ToString().It is pointing to a
diffrent
word
instead of geoffrey.seekceil is working properly for many phrases
though.

What could be the problem..please kindly suggest.

On Fri, Sep 27, 2013 at 6:58 PM, Allison, Timothy B.
talli...@mitre.org
wrote:

1) An alternate method to your original question would be to do
something
like this (I haven't compiled or tested

Re: Handling abrupt shutdown while indexing

2013-10-03 Thread Ian Lea

I'd write a shutdown method that calls close() in a controlled manner
and invoke it at 23:55.  You could also call commit() at whatever
interval makes sense to you but if you carried on killing the JVM
you'd still be liable to lose any docs indexed since the last commit.

This is standard stuff just like any file system or database.  If you
don't commit or close you are likely to lose data.


--
Ian.

On Thu, Oct 3, 2013 at 1:40 PM, Ramprakash Ramamoorthy
youngestachie...@gmail.com wrote:
 Team,

  We have our app using lucene 4.1. Docs keep getting indexed and we
 close the index by 00.00 hrs every day and open a new one for the next
 calendar day, however in case of an abrupt shutdown/kill of the JVM, the
 app server crashes and the live indexes end up remaining without a
 segments.gen file, and with a write.lock file. Say the abrupt shut down
 happens by 23:55, all the documents indexed from 00:01 to 23:54 are lost.
 NRT searchers function well before the shutdown.

  Is there anyway I can recover such indexes without a segments.gen?
 Or committing periodically (say one hour once or when 1000 docs indexed) is
 the only way? Or am I missing something very basic? Thanks in advance.

 --
 With Thanks and Regards,
 Ramprakash Ramamoorthy,
 Chennai, India.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Problem with MultiPhrase Query in Lucene 4.3

2013-10-03 Thread Ian Lea

Are you sure it's not failing because adhoc != ad-hoc?


--
Ian.


On Thu, Oct 3, 2013 at 3:07 PM, VIGNESH S vigneshkln...@gmail.com wrote:
 Hi,

 I am Trying to do Multiphrase Query in Lucene 4.3. It is working Perfect
 for all scenarios except the below scenario.
 When I try to Search for a phrase which is preceded by any punctuation,it
 is not working..

 TextContent:  Dremel is a scalable, interactive ad-hoc query system for
 analysis
 of read-only nested data. By combining multi-level execution
 trees and columnar data layout, it is capable of running aggregation

 Search phrase :  interactive adhoc

 The Above Search is failing because interactive adhoc is preceded by ,
 in original text.


 I am Doing Indexing like this..Sample Code for Indexing.I have used
 whitespace analyzer.

 Document doc = new Document();

 contents =Dremel is a scalable, interactive ad-hoc query system for
 analysis
 of read-only nested data. By combining multi-level execution
 trees and columnar data layout, it is capable of running aggregation;

 FieldType offsetsType = new FieldType(TextField.TYPE_STORED);

 Field field =new Field(content,, offsetsType);

 doc.add(field);
 field.setStringValue(contents);

 mWriter.addDocument(doc);

 In the Search I am forming MultiphraseQueryObject and adding the tokens of
 the search Phrase.

 Before Adding the tokens,I validated like this

 LinkedListTerm termsWithPrefix = new LinkedListTerm(); trm.seekCeil(new
 BytesRef(word)); do { String s = trm.term().utf8ToString(); if
 (s.startsWith(word)) { termsWithPrefix.add(new Term(content, s)); } else
 { break; } } while (trm.next() != null);
 mpquery.add(termsWithPrefix.toArray(new Term[0])); }

 It is working for all scenarios except the scenarios where the search
 phrase is preceded by punctuation.

 In case of text preceded by punctuation trm.seekCeil(new BytesRef(word));
 is pointing a diffrent word which actually causes the problem..

 Please kindly help..


 --
 Thanks and Regards
 Vignesh Srinivasan

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Problem with MultiPhrase Query in Lucene 4.3

2013-10-03 Thread Ian Lea

Below is a little self-contained test program.  You may recognise some
of the code.

Here's the output from a couple of runs using lucene 4.4.0.

$ java ian.G1 Dremel is a scalable, interactive ad-hoc query system
interactive ad-hoc
term=interactive
term=ad-hoc
+content:interactive +content:ad-hoc: totalHits=1


$ java ian.G1 Dremel is a scalable, interactive ad-hoc query system
interactive adhoc
term=interactive
+content:interactive: totalHits=1

All looks OK to me.  Maybe you can make it fail, or use it to help fix
your problem.

--
Ian.

package ian;

import java.util.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.core.*;
import org.apache.lucene.analysis.en.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.document.*;
import org.apache.lucene.queries.*;
import org.apache.lucene.search.*;
import org.apache.lucene.store.*;
import org.apache.lucene.index.*;
import org.apache.lucene.util.*;

public class G1 {

void test(String _contents, String _words) throws Exception {
String contents = _contents;
String words = _words;

  RAMDirectory dir = new RAMDirectory();
Analyzer anl = new WhitespaceAnalyzer(Version.LUCENE_44);
IndexWriterConfig iwcfg = new IndexWriterConfig(Version.LUCENE_44,
anl);
IndexWriter iw = new IndexWriter(dir, iwcfg);

FieldType offsetsType = new FieldType(TextField.TYPE_STORED);
Field field = new Field(content, , offsetsType);
Document doc = new Document();
doc.add(field);
field.setStringValue(contents);
iw.addDocument(doc);
iw.close();

IndexReader rdr = DirectoryReader.open(dir);
Fields fields = MultiFields.getFields(rdr);
Terms terms = fields.terms(content);

BooleanQuery bq = new BooleanQuery();
String[] worda = _words.split( );
for (String w : worda) {
   LinkedListTerm termsWithPrefix = new LinkedListTerm();
   TermsEnum trm = terms.iterator(null);
   trm.seekCeil(new BytesRef(w));
   do {
String s = trm.term().utf8ToString();
if (s.startsWith(w)) {
   termsWithPrefix.add(new Term(content, s));
   System.out.printf(term=%s\n, s);
}
else {
   break;
}
   }
   while (trm.next() != null);

   if (!termsWithPrefix.isEmpty()) {
MultiPhraseQuery mpquery = new MultiPhraseQuery();
mpquery.add(termsWithPrefix.toArray(new Term[0]));
bq.add(mpquery, BooleanClause.Occur.MUST);
   }
}

IndexSearcher searcher = new IndexSearcher(rdr);
TopDocs results = searcher.search(bq, 10);
System.out.printf(%s: totalHits=%s\n,
 bq, results.totalHits);
}



public static void main(String[] _args) throws Exception {
G1 t = new G1();
t.test(_args[0], _args[1]);
}
}


On Thu, Oct 3, 2013 at 4:10 PM, VIGNESH S vigneshkln...@gmail.com wrote:
 Hi,

 sorry.. thats my typo..

 Its not failing because of that


 On Thu, Oct 3, 2013 at 8:17 PM, Ian Lea ian@gmail.com wrote:

 Are you sure it's not failing because adhoc != ad-hoc?


 --
 Ian.


 On Thu, Oct 3, 2013 at 3:07 PM, VIGNESH S vigneshkln...@gmail.com wrote:
  Hi,
 
  I am Trying to do Multiphrase Query in Lucene 4.3. It is working Perfect
  for all scenarios except the below scenario.
  When I try to Search for a phrase which is preceded by any punctuation,it
  is not working..
 
  TextContent:  Dremel is a scalable, interactive ad-hoc query system for
  analysis
  of read-only nested data. By combining multi-level execution
  trees and columnar data layout, it is capable of running aggregation
 
  Search phrase :  interactive adhoc
 
  The Above Search is failing because interactive adhoc is preceded by
 ,
  in original text.
 
 
  I am Doing Indexing like this..Sample Code for Indexing.I have used
  whitespace analyzer.
 
  Document doc = new Document();
 
  contents =Dremel is a scalable, interactive ad-hoc query system for
  analysis
  of read-only nested data. By combining multi-level execution
  trees and columnar data layout, it is capable of running aggregation;
 
  FieldType offsetsType = new FieldType(TextField.TYPE_STORED);
 
  Field field =new Field(content,, offsetsType);
 
  doc.add(field);
  field.setStringValue(contents);
 
  mWriter.addDocument(doc);
 
  In the Search I am forming MultiphraseQueryObject and adding the tokens
 of
  the search Phrase.
 
  Before Adding the tokens,I validated like this
 
  LinkedListTerm termsWithPrefix = new LinkedListTerm();
 trm.seekCeil(new
  BytesRef(word)); do { String s = trm.term().utf8ToString(); if
  (s.startsWith(word)) { termsWithPrefix.add(new Term(content, s)); }
 else
  { break; } } while (trm.next() != null);
  mpquery.add(termsWithPrefix.toArray(new Term[0])); }
 
  It is working for all scenarios except the scenarios where the search
  phrase is preceded by punctuation.
 
  In case of text preceded by punctuation trm.seekCeil(new BytesRef(word));
  is pointing a diffrent word which actually causes the problem..
 
  Please kindly help..
 
 
  --
  Thanks and Regards
  Vignesh Srinivasan

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org

Re: Rendexing problem: Indexing folder size is keep on growing for same remote folder

2013-10-02 Thread Ian Lea

Yes, as I suggested, you could search on your unique id and not index
if already present. Or, as Uwe suggested, call updateDocument instead
of add, again using the unique id.

--
Ian.

On Tue, Oct 1, 2013 at 6:41 PM, gudiseashok gudise.as...@gmail.com wrote:
I am really sorry if something made you confuse, as I said I am indexing a
folder
which contains mylogs.log,mylogs1.log,mylogs2.log etc, I am not indexing
them as a flat file.
I have tokenized my each line of text with regex and storing them as fields
like messageType,
timeStamp,message.

So I dont bother what file among those 4 files having this particular
content but, I just want to insert only new records.
My job routine will update these log files for every 30 minutes, and storing
each row as document. So when I reading the files after 30 minutes for
indexing,mylogs1.log content will previous version of mylog.log content. So
If a row exists with the same data,
So If I want to eliminate writing same record (from other file among those
4) again,
Could you please suggest what do I need to do while calling add or
updateDocument?

Do I need to run seach before inserting any row or do I have any better way
to eiliminate writing?

I really appreciate your time reading this, and thanks for responding.

--
View this message in context:
http://lucene.472066.n3.nabble.com/Rendexing-problem-Indexing-folder-size-is-keep-on-growing-for-same-remote-folder-tp4092835p4092990.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Multi server

2013-10-01 Thread Ian Lea

I'm not aware of a lucene rather than Solr or whatever tutorial.  A
search for something like lucene sharding will get hits.

Why don't you want to use Solr or Katta or similar?  They've already
done much of the hard work.

How much data are you talking about?

What are your master-master requirements?  I don't thing that even
Solr provides multi-master capability.


--
Ian.


On Mon, Sep 30, 2013 at 8:08 PM, Neda Grbic neda.gr...@mangora.org wrote:
 Hi all

 I'm hoping to use Lucene in my project, but I have two master-master
 servers. Is there any good tutorial how to make Lucene scalable (without
 Solr and similar web applications).

 Thanks

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Rendexing problem: Indexing folder size is keep on growing for same remote folder

2013-10-01 Thread Ian Lea

milliseconds as unique keys are a bad idea unless you are 100% certain
you'll never be creating 2 docs in the same millisecond. And are you
saying the log record A1 from file a.log indexed at 14:00 will have
the same unique id as the same record from the same file indexed at
14:30 or will it be different?

If the same, you can use updateDocument as Uwe suggested.

If different, and you want to replace all the docs already indexed
from file a.log with the current contents of a.log, I suggest you
store the file name as an indexed field for each record from each file
and, when you reindex a file, start by calling
IndexWriter.deleteDocuments(Term t) where t is a Term that references
the file name.

--
Ian.

On Tue, Oct 1, 2013 at 2:20 PM, gudiseashok gudise.as...@gmail.com wrote:
I am afraid, my document in the above code has already a unique-key (will
milli-seconds I hope this is enough to differentiate with another records).

My requirement is simple, I have a folder with a.log,b.log and c.log files
which will be updated every 30 minutes, I want to update the index of these
files and re-indexing them. I am trying to explore the lucene-indexing but
some how I am not able to get much help other than demo java files.

Kindly suggest.

Regards
Ashok Gudise.

--
View this message in context:
http://lucene.472066.n3.nabble.com/Rendexing-problem-Indexing-folder-size-is-keep-on-growing-for-same-remote-folder-tp4092835p4092934.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Rendexing problem: Indexing folder size is keep on growing for same remote folder

2013-10-01 Thread Ian Lea

I'm still a bit confused about exactly what you're indexing, when, but
if you have a unique id and don't want to add or update a doc that's
already present, add the unique id to the index and search (TermQuery
probably) for each one and skip if already present.

Can't you change the log rotation/copying/indexing so that you only
index new data?

To start a fresh index, use IndexWriterConfig.OpenMode.CREATE.


--
Ian.


On Tue, Oct 1, 2013 at 4:51 PM, gudiseashok gudise.as...@gmail.com wrote:
 Hi

 Basically my log folder consists of four log files like
 abc.log,abc1.log,abc2.log,abc3.log, as my log appender is doing. Every 30
 minutes content will be changed of all these file , for example after 30
 minutes refresh my conent of abc1.log will be replaced with existing abc.log
 content and abc.log will have new content (Timestamp is DD-MM- MM-ss:S).
 Since I am goingthrough the re-indexing for every 30 minutes, I dont want to
 re-index the same record which was already present with same timstamp.

 Also if I want to do clean-up for every week, (clean up in the sense I want
 to delete all indexes , and I want to do a fresh indexing for these 4
 files), how to do this efficiently.

 I really appreciate your time reading this, and kindly suggest a better way.


 Regards
 Ashok Gudise



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Rendexing-problem-Indexing-folder-size-is-keep-on-growing-for-same-remote-folder-tp4092835p4092963.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Multiphrase Query in Lucene 4.3

2013-09-30 Thread Ian Lea

Whenever someone says something along the lines of a search for
geoffrey not matching Geoffrey the case difference springs out,
Can't recall what if anything you said about the analysis side of
things but that could be the cause.  See
http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_no_hits_.2F_incorrect_hits.3F

If on the other hand the problem is more obscure, and only related to
the multi phrase stuff, I suggest you build a tiny but complete
RAMDirectory based program or test case that shows the problem and
post it here.


--
Ian.



On Mon, Sep 30, 2013 at 6:46 AM, VIGNESH S vigneshkln...@gmail.com wrote:
 Hi,

 Thanks for your Reply.The Problem I face is there is a word called Geoffrey
 Romer in my Field.

 I am Forming a Multiphrase query object properly like  Geoffrey Romer.But
 When i do a Search,it is not returning Hits.This Problem I am facing is not
 for all phrases
 This Problem happens for only few Phrases.

 When i do a single query like Geoffrey it is giving a Hit..But when i do it
 in MultiphraseQuery it is not able to find geoffrey. I confirmed this by
 doing trm.seekCeil(new BytesRef(Geoffrey))  and then and then when i
 do String s = trm.term().utf8ToString().It is pointing to a diffrent word
 instead of geoffrey.seekceil is working properly for many phrases though.

 What could be the problem..please kindly suggest.



 On Fri, Sep 27, 2013 at 6:58 PM, Allison, Timothy B. 
 talli...@mitre.orgwrote:

 1) An alternate method to your original question would be to do something
 like this (I haven't compiled or tested this!):

 Query q = new PrefixQuery(new Term(field, app));

 q = q.rewrite(indexReader) ;
 SetTerm terms = new HashSetTerm();
 q.extractTerms(terms);
 Term[] arr = terms.toArray(new Term[terms.size()]);
 MultiPhraseQuery mpq = new MultiPhraseQuery();
 mpq.add(new Term(field, microsoft);
 mpq.add(arr);


 2) At a higher level, do you need to generate your query programmatically?
  Here are three parsers that could handle this:
   a) ComplexPhraseQueryParser
   b) SurroundQueryParser: oal.queryparser.surround.parser.QueryParser
   c) experimental: self_promotion degree=shameless
 http://issues.apache.org/jira/browse/LUCENE-5205/self_promotion


 -Original Message-
 From: VIGNESH S [mailto:vigneshkln...@gmail.com]
 Sent: Friday, September 27, 2013 3:33 AM
 To: java-user@lucene.apache.org
 Subject: Re: Multiphrase Query in Lucene 4.3

 Hi,

 The word i am giving is Romer Geoffrey .The Word is in the Field.

  trm.seekCeil(new BytesRef(Geoffrey)) and then when i do String s =
 trm.term().utf8ToString(); and hence

 It is giving a diffrent word..I think this is why my multiphrasequery is
 not giving desired results.

 What may be the reason..




 On Fri, Sep 27, 2013 at 11:49 AM, VIGNESH S vigneshkln...@gmail.com
 wrote:

  Hi Lan,
 
  Thanks for your Reply.
 
  I am doing similar to this only..In MultiPhraseQuery object actual phrase
  is going proper but it is not returning any hits..
 
  In Lucene 3.6,I implemented the same logic and it is working.
 
  In Lucene 4.3,I implemented the Index for that  using
 
   FieldType offsetsType = new FieldType(TextField.TYPE_STORED);
 
 
  
 offsetsType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
 
  For MultiphraseQuery, whether I need to add any other parameter in
  addition to this while indexing?
 
  Is there any MultiPhraseQueryTest java file for Lucene 4.3? I checked in
  Lucene branch and i was not able to find..Please kindly help.
 
 
 
 
 
 
  On Thu, Sep 26, 2013 at 2:55 PM, Ian Lea ian@gmail.com wrote:
 
  I use the code below to do something like this.  Not exactly what you
  want but should be easy to adapt.
 
 
  public ListString findTerms(IndexReader _reader,
String _field) throws IOException {
ListString l = new ArrayListString();
Fields ff = MultiFields.getFields(_reader);
Terms trms = ff.terms(_field);
TermsEnum te = trms.iterator(null);
BytesRef br;
while ((br = te.next()) != null) {
  l.add(br.utf8ToString());
}
return l;
  }
 
  --
  Ian.
 
  On Wed, Sep 25, 2013 at 3:04 PM, VIGNESH S vigneshkln...@gmail.com
  wrote:
   Hi,
  
   In the Example of Multiphrase Query it is mentioned
  
   To use this class, to search for the phrase Microsoft app* first
 use
   add(Term) on the term Microsoft, then find all terms that have app
  as
   prefix using IndexReader.terms(Term), and use
  MultiPhraseQuery.add(Term[]
   terms) to add them to the query
  
  
   How can i replicate the Same in Lucene 4.3 since
  IndexReader.terms(Term) is
   no more used
  
   --
   Thanks and Regards
   Vignesh Srinivasan
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 
  --
  Thanks and Regards
  Vignesh Srinivasan
  9739135640
 



 --
 Thanks and Regards
 Vignesh

Re: Multiphrase Query in Lucene 4.3

2013-09-30 Thread Ian Lea

Whenever someone says they are using a custom analyzer that has to be
a suspect.  Does it work if you use one of the core lucene analyzers
instead?  Have you used Luke to verify that the index holds what you
think it does?


--
Ian.


On Mon, Sep 30, 2013 at 3:21 PM, VIGNESH S vigneshkln...@gmail.com wrote:
 Hi,

 It is not the problem with case..Because Iam using LowercaseFilter.

 My Analyzer is a custom analyzer which will ignore just white spaces.All
 other numbers date and other special characters it will consider.The Same
 analyzer works for Lucene 3.6.


 When i do a single term query for Geoffrey it is giving hits..But when
 given as a part of multiphrase query ,it is not able to find..When the
 below code is Executed with say word =Geoffrey,it is not finding the word
 itself ..

 if(TermsEnum.SeekStatus.FOUND ==trm.seekCeil(new BytesRef(word)))
  {do {
   String s = trm.term().utf8ToString();
   if (s.equals(word)) {
 termsWithPrefix.add(new Term(content,
 s));
   } else {
 break;
   }
 }
  while (trm.next() != null);
  }



 On Mon, Sep 30, 2013 at 3:01 PM, Ian Lea ian@gmail.com wrote:

 Whenever someone says something along the lines of a search for
 geoffrey not matching Geoffrey the case difference springs out,
 Can't recall what if anything you said about the analysis side of
 things but that could be the cause.  See

 http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_no_hits_.2F_incorrect_hits.3F

 If on the other hand the problem is more obscure, and only related to
 the multi phrase stuff, I suggest you build a tiny but complete
 RAMDirectory based program or test case that shows the problem and
 post it here.


 --
 Ian.



 On Mon, Sep 30, 2013 at 6:46 AM, VIGNESH S vigneshkln...@gmail.com
 wrote:
  Hi,
 
  Thanks for your Reply.The Problem I face is there is a word called
 Geoffrey
  Romer in my Field.
 
  I am Forming a Multiphrase query object properly like  Geoffrey
 Romer.But
  When i do a Search,it is not returning Hits.This Problem I am facing is
 not
  for all phrases
  This Problem happens for only few Phrases.
 
  When i do a single query like Geoffrey it is giving a Hit..But when i do
 it
  in MultiphraseQuery it is not able to find geoffrey. I confirmed this
 by
  doing trm.seekCeil(new BytesRef(Geoffrey))  and then and then when i
  do String s = trm.term().utf8ToString().It is pointing to a diffrent word
  instead of geoffrey.seekceil is working properly for many phrases though.
 
  What could be the problem..please kindly suggest.
 
 
 
  On Fri, Sep 27, 2013 at 6:58 PM, Allison, Timothy B. talli...@mitre.org
 wrote:
 
  1) An alternate method to your original question would be to do
 something
  like this (I haven't compiled or tested this!):
 
  Query q = new PrefixQuery(new Term(field, app));
 
  q = q.rewrite(indexReader) ;
  SetTerm terms = new HashSetTerm();
  q.extractTerms(terms);
  Term[] arr = terms.toArray(new Term[terms.size()]);
  MultiPhraseQuery mpq = new MultiPhraseQuery();
  mpq.add(new Term(field, microsoft);
  mpq.add(arr);
 
 
  2) At a higher level, do you need to generate your query
 programmatically?
   Here are three parsers that could handle this:
a) ComplexPhraseQueryParser
b) SurroundQueryParser: oal.queryparser.surround.parser.QueryParser
c) experimental: self_promotion degree=shameless
  http://issues.apache.org/jira/browse/LUCENE-5205/self_promotion
 
 
  -Original Message-
  From: VIGNESH S [mailto:vigneshkln...@gmail.com]
  Sent: Friday, September 27, 2013 3:33 AM
  To: java-user@lucene.apache.org
  Subject: Re: Multiphrase Query in Lucene 4.3
 
  Hi,
 
  The word i am giving is Romer Geoffrey .The Word is in the Field.
 
   trm.seekCeil(new BytesRef(Geoffrey)) and then when i do String s =
  trm.term().utf8ToString(); and hence
 
  It is giving a diffrent word..I think this is why my multiphrasequery is
  not giving desired results.
 
  What may be the reason..
 
 
 
 
  On Fri, Sep 27, 2013 at 11:49 AM, VIGNESH S vigneshkln...@gmail.com
  wrote:
 
   Hi Lan,
  
   Thanks for your Reply.
  
   I am doing similar to this only..In MultiPhraseQuery object actual
 phrase
   is going proper but it is not returning any hits..
  
   In Lucene 3.6,I implemented the same logic and it is working.
  
   In Lucene 4.3,I implemented the Index for that  using
  
FieldType offsetsType = new FieldType(TextField.TYPE_STORED);
  
  
 
  
 offsetsType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
  
   For MultiphraseQuery, whether I need to add any other parameter in
   addition to this while indexing?
  
   Is there any MultiPhraseQueryTest java file for Lucene 4.3? I checked
 in
   Lucene branch and i was not able to find..Please

Re: Multiphrase Query in Lucene 4.3

2013-09-26 Thread Ian Lea

I use the code below to do something like this.  Not exactly what you
want but should be easy to adapt.


public ListString findTerms(IndexReader _reader,
  String _field) throws IOException {
  ListString l = new ArrayListString();
  Fields ff = MultiFields.getFields(_reader);
  Terms trms = ff.terms(_field);
  TermsEnum te = trms.iterator(null);
  BytesRef br;
  while ((br = te.next()) != null) {
l.add(br.utf8ToString());
  }
  return l;
}

--
Ian.

On Wed, Sep 25, 2013 at 3:04 PM, VIGNESH S vigneshkln...@gmail.com wrote:
 Hi,

 In the Example of Multiphrase Query it is mentioned

 To use this class, to search for the phrase Microsoft app* first use
 add(Term) on the term Microsoft, then find all terms that have app as
 prefix using IndexReader.terms(Term), and use MultiPhraseQuery.add(Term[]
 terms) to add them to the query


 How can i replicate the Same in Lucene 4.3 since IndexReader.terms(Term) is
 no more used

 --
 Thanks and Regards
 Vignesh Srinivasan

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene 4.4.0 mergeSegments OutOfMemoryError

2013-09-26 Thread Ian Lea

Is this OOM happening as part of your early morning optimize or at
some other point?  By optimize do you mean IndexWriter.forceMerge(1)?
You really shouldn't have to use that. If the index grows forever
without it then something else is going on which you might wish to
report separately.


--
Ian.


On Wed, Sep 25, 2013 at 12:35 PM, Michael van Rooyen mich...@loot.co.za wrote:
 We've recently upgraded to Lucene 4.4.0 and mergeSegments now causes an OOM
 error.

 As background, our index contains about 14 million documents (growing
 slowly) and we process about 1 million updates per day. It's about 8GB on
 disk.  I'm not sure if the Lucene segments merge the way they used to in the
 early versions, but we've always optimized at 3am to get rid of dead space
 in the index, or otherwise it grows forever.

 The mergeSegments was working under 4.3.1 but the index has grown somewhat
 on disk since then, probably due to a couple of added NumericDocValues
 fields.  The java process is assigned about 3GB (the maximum, as it's
 running on a 32 bit i686 Linux box), and it still goes OOM.

 Any advice as to the possible cause and how to circumvent it would be great.
 Here's the stack trace:

 org.apache.lucene.index.MergePolicy$MergeException:
 java.lang.OutOfMemoryError: Java heap space
 org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545)
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)
 Caused by: java.lang.OutOfMemoryError: Java heap space
 org.apache.lucene.codecs.lucene42.Lucene42DocValuesProducer.loadNumeric(Lucene42DocValuesProducer.java:212)
 org.apache.lucene.codecs.lucene42.Lucene42DocValuesProducer.getNumeric(Lucene42DocValuesProducer.java:174)
 org.apache.lucene.index.SegmentCoreReaders.getNormValues(SegmentCoreReaders.java:301)
 org.apache.lucene.index.SegmentReader.getNormValues(SegmentReader.java:253)
 org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:215)
 org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
 org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
 org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
 org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)


 Thanks,
 Michael.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Strange behaviour of tokenizer with wildcard queries

2013-09-20 Thread Ian Lea

It's reasonable that block-major won't find anything.
block-major-57 should match.

The split into block and major-57 will be because, from the javadocs
for ClassicTokenizer, Splits words at hyphens, unless there's a
number in the token, in which case the whole token is interpreted as a
product number and is not split..  So I guess it splits on the first
hyphen but not the second.

ClassicAnalyzer/Tokenizer is general purpose and will never meet
everyone's requirement all the time.  You could try a different
analyzer, or build your own.  That's what the javadoc recommends.


--
Ian.


On Fri, Sep 20, 2013 at 1:26 PM, Ramprakash Ramamoorthy
youngestachie...@gmail.com wrote:
 Sorry, hit the send button accidentally the last time. Please read below :

 Hello,

 We're using lucene 4.1. We have the word *block-major-57*
 indexed. Using the classic analyzer, we get the following tokens : *block*and
 *major-57*.

  I search for *block-major*, *the document doesn't match.
 However searching for *block** works perfect. Is this a bug, or am I doing
 something wrong?


 --
 With Thanks and Regards,
 Ramprakash Ramamoorthy,
 Chennai, India.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Strange behaviour of tokenizer with wildcard queries

2013-09-20 Thread Ian Lea

Oh, sorry, didn't catch that.  There are some spurious asterisks in
your message, as displayed by gmail anyway.  The most recent one has
block-major**   *

I don't know the answer.  Some unwanted interaction between the
tokenization and query parser and wildcards?  If it's going to split
block-major-57 into block and major-57 will it also split query
block-major* into block and major* or leave it as
block-major*.  The first might be expected to work, the latter
wouldn't.

Maybe try storing this field without analysis, or just with something
simple like downcasing, and searching with a PrefixQuery?  I think
that would work.


--
Ian.


On Fri, Sep 20, 2013 at 1:48 PM, Ramprakash Ramamoorthy
youngestachie...@gmail.com wrote:
 On Fri, Sep 20, 2013 at 6:11 PM, Ian Lea ian@gmail.com wrote:

 It's reasonable that block-major won't find anything.
 block-major-57 should match.


 Thank you Ian,  I understand. But my question is why wouldn't 
 block-major**   * match?, please note the wildcard at the end! Thanks.


 The split into block and major-57 will be because, from the javadocs
 for ClassicTokenizer, Splits words at hyphens, unless there's a
 number in the token, in which case the whole token is interpreted as a
 product number and is not split..  So I guess it splits on the first
 hyphen but not the second.

 ClassicAnalyzer/Tokenizer is general purpose and will never meet
 everyone's requirement all the time.  You could try a different
 analyzer, or build your own.  That's what the javadoc recommends.


 --
 Ian.


 On Fri, Sep 20, 2013 at 1:26 PM, Ramprakash Ramamoorthy
 youngestachie...@gmail.com wrote:
  Sorry, hit the send button accidentally the last time. Please read below
 :
 
  Hello,
 
  We're using lucene 4.1. We have the word *block-major-57*
  indexed. Using the classic analyzer, we get the following tokens :
 *block*and
  *major-57*.
 
   I search for *block-major*, *the document doesn't match.
  However searching for *block** works perfect. Is this a bug, or am I
 doing
  something wrong?
 
 
  --
  With Thanks and Regards,
  Ramprakash Ramamoorthy,
  Chennai, India.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




 --
 With Thanks and Regards,
 Ramprakash Ramamoorthy,
 Chennai, India

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Multiple field instances and Field.Store.NO

2013-09-16 Thread Ian Lea

Not exactly dumb, and I can't tell you exactly what is happening here,
but lucene stores some info at the index level rather than the field
level, and things can get confusing if you don't use the same Field
definition consistently for a field.

From the javadocs for org.apache.lucene.document.Field:

NOTE: the field type is an IndexableFieldType. Making changes to the
state of the IndexableFieldType will impact any Field it is used in.
It is strongly recommended that no changes be made after Field
instantiation.

--
Ian.


On Mon, Sep 16, 2013 at 11:33 AM, Alan Burlison alan.burli...@gmail.com wrote:
 I'm creating multiple instances of a field, some with Field.Store.YES
 and some with Field.Store.NO, with Lucene 4.4. If Field.Store.YES is
 set then I see multiple instances of the field in the documents in the
 resulting index, if I use Field.Store.NO then I only see a single
 field. Is that expected or am I doing something dumb?

 Thanks,

 --
 Alan Burlison
 --

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Regarding Compression Tool

2013-09-13 Thread Ian Lea

Are you talking about CompressionTools as in
http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/document/CompressionTools.html?

They've long been superseded by a completely different, low-level,
transparent compression method.

Anyway, use them to compress stored fields, not fields you want to search on.


--
Ian.


On Fri, Sep 13, 2013 at 6:19 AM, Jebarlin Robertson jebar...@gmail.com wrote:
 Hi,

 I am trying to store all the Field values using CompressionTool, But When I
 search for any content, it is not finding any results.

 Can you help me, how to create the Field with CompressionTool to add to the
 Document and how to decompress it when searching for any content in it.

 --
 Thanks  Regards,
 Jebarlin Robertson.R

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Check if Term present in Existing Index before Merging indexes from Directory.

2013-09-11 Thread Ian Lea

If you want to stick with the approach of multiple indexes you'll have
to add some logic to work round it.

Option 1.

Post merge, loop through all docs identifying duplicates and deleting
the one(s) you don't want.


Option 2.

Pre merge, read all indexes in parallel, identifying and deleting as above.


Option 3.

When creating a new index, check the first and delete matches or don't
index the file, whichever makes sense.


I'm sure there are other options as well, but no instant solutions.
One obvious option is to skip the merging altogether: if you want one
big index, why not just work directly with that, using updateDocument
with filename as the Term.



--
Ian.


On Wed, Sep 11, 2013 at 1:40 PM, Ankit Murarka
ankit.mura...@rancoretech.com wrote:
 Hello

 Have a peculiar problem to deal with and I am sure there must be some way to
 handle it.

 1. Indexes exist on the server for existing files.
 2. Generating indexing is automated so files when generated will also lead
 to index generation.
 3. I am merging the newly generated indexes and existing index.

 /*Field of prime importance is fileName.*/

 Now since merging is being done with /* writer.addIndexes(Directory name)*/

 The same file if indexed again is being added in the indexes twice. So in
 Hit I am getting more than 1 entries for same file. No problem with the
 HIT..

 Problem is with the same file being indexed two times during merging..

 I need to ensure that when I merge indexes, if term say /*File1*/ is
 already present, the indexes should be updated instead of adding. This is
 supposed to happen during indexing process.

 Kindly guide as to how it can be achieved.. Javadoc does not seem to help
 me.

 TIA.

 --
 Regards

 Ankit Murarka

 What lies behind us and what lies before us are tiny matters compared with
 what lies within us


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene Concurrent Search

2013-09-06 Thread Ian Lea

For the singleton technique that I use, the per-search code looks like

import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.SearcherManager;

 SearcherManager sm = LuceneSearcherManagerCache.get(indexdir);
 IndexSearcher s = sm.acquire();
 try {
   search(...);
 }
 finally {
   sm.release(s);
 }
 s =  null;

where LuceneSearcherManagerCache is the singleton class that
initialises and caches SearcherManager instances by index directory.
It calls maybeRefresh() on each call which of course isn't
particularly efficient, but this is used, within tomcat, for
occasional searches on small indexes with no knowledge of when or if a
particular index may have changed or not.  In practice, on my indexes
on my hardware it is, as usual with lucene, fast.

As I think I said, the initialization of SearcherManager is 100% default:

new SearcherManager(dir, new SearcherFactory());



Hope that helps.


--
Ian.


On Thu, Sep 5, 2013 at 11:21 PM, David Miranda
david.b.mira...@gmail.com wrote:
 Did you have a practical example of the use of SearchManager (initialize,
 use to do research)?

 Thanks in advance.


 2013/9/5 Stephen Green eelstretch...@gmail.com

 You can implement a ServletListener for your app and open the index there
 (in the contextInitialized method). You can then create the SearcherManager
 from the IndexReader/Searcher and store it in the ServletContext, where it
 can be fetched out by your REST servlets.

 This is a typical pattern that we use for lots of Web apps that use
 resources like Lucene.


 On Thu, Sep 5, 2013 at 12:05 PM, Ian Lea ian@gmail.com wrote:

  I use a singleton class but there are other ways in tomcat.  Can't
  remember what - maybe application scope.
 
 
  --
  Ian.
 
 
  On Thu, Sep 5, 2013 at 4:46 PM, David Miranda david.b.mira...@gmail.com
 
  wrote:
   Where I can initialize the SearchManager variable to after use it in
 the
   REST servlet to do research in the index?
  
  
   2013/9/5 Ian Lea ian@gmail.com
  
   I think that blog post was bleeding edge and the API changed a bit
   subsequently.
  
   I use
  
   Directory dir = whatever;
   SearcherManager sm = new SearcherManager(dir, new SearcherFactory());
  
   to get default behaviour.  The javadocs for SearcherFactory explain
   that you can write your own implementation if you want custom
   behaviour such as warming.
  
  
   --
   Ian.
  
  
   On Thu, Sep 5, 2013 at 3:53 PM, David Miranda 
  david.b.mira...@gmail.com
   wrote:
Hi,
   
I'm trying to implement my code with SearchManager to make  my app
thread-safe. I'm follow this post:
   
  
 
 http://blog.mikemccandless.com/2011/09/lucenes-searchermanager-simplifies.html
   
There is a class that implements SearchWarmer. I can't find this
  class
   in
the Lucene library, what class is that?
   
Thanks.
   
   
2013/9/5 Aditya findbestopensou...@gmail.com
   
Hi
   
You want to use REST service for your search, then my advice would
  be to
use Solr. As it has buitl-in functionality of REST API.
   
If you want to use Lucene then below are my comments:
1. In do search function, you are creating reader object. If this
  call
   is
invoked for every query then it would be very expensive. You need
 to
   create
it once globally and re opon it, if the index is modified. Its
 better
   use
SearchManager.
   
Regards
Aditya
www.findbestopensource.com - Search from 1 Million open source
   projects.
   
   
   
On Thu, Sep 5, 2013 at 6:46 AM, David Miranda 
   david.b.mira...@gmail.com
wrote:
   
 Hi,

 I'm developing a web application, that contains a REST service in
  the
 Tomcat, that receives several requests per second.
 The REST requests do research in a Lucene index, to do this i use
  the
 IndexSearch.

 My questions are:
 - There are concurrency problems in multiple research?
 - What the best design pattern to do this?

 public class IndexResearch(){
private static int MAX_HITS = 500;
private static String DIRECTORY = indexdir;
private IndexSearcher searcher;
private StandardAnalyzer analyzer;
 



public IndexResearch(){
}
public String doSearch(String text){
   analyzer = new StandardAnalyzer(Version.LUCENE_43);
   topic = QueryParser.escape(topic);
   Query q = new QueryParser(Version.LUCENE_43, field,
  analyzer
  ).parse(text);
   File indexDirectory = new File(DIRECTORY);
   IndexReader reader;
   reader =
   DirectoryReader.open(FSDirectory.open(indexDirectory));
   searcher = new IndexSearcher(reader);
 
 /*more code*/

 }
  }


 Can I create, in the servlet, one object of this class per client
   request
 (Is that the best design pattern)?

 Thanks in advance.

   
   
   
   
--
Cumprimentos

Re: Lucene Concurrent Search

2013-09-05 Thread Ian Lea

Take a look at org.apache.lucene.search.SearcherManager.

From the javadocs Utility class to safely share IndexSearcher
instances across multiple threads, while periodically reopening..


--
Ian.


On Thu, Sep 5, 2013 at 2:16 AM, David Miranda david.b.mira...@gmail.com wrote:
 Hi,

 I'm developing a web application, that contains a REST service in the
 Tomcat, that receives several requests per second.
 The REST requests do research in a Lucene index, to do this i use the
 IndexSearch.

 My questions are:
 - There are concurrency problems in multiple research?
 - What the best design pattern to do this?

 public class IndexResearch(){
   private static int MAX_HITS = 500;
   private static String DIRECTORY = indexdir;
   private IndexSearcher searcher;
   private StandardAnalyzer analyzer;




   public IndexResearch(){
   }
   public String doSearch(String text){
  analyzer = new StandardAnalyzer(Version.LUCENE_43);
  topic = QueryParser.escape(topic);
  Query q = new QueryParser(Version.LUCENE_43, field, analyzer
 ).parse(text);
  File indexDirectory = new File(DIRECTORY);
  IndexReader reader;
  reader = DirectoryReader.open(FSDirectory.open(indexDirectory));
  searcher = new IndexSearcher(reader);

 /*more code*/

}
 }


 Can I create, in the servlet, one object of this class per client request
 (Is that the best design pattern)?

 Thanks in advance.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 911 matches

Mail list logo