RE: Custom tokenizer

2015-01-12 Thread Uwe Schindler
Hi,

Extending an existing Analyzer is not useful, because it is just a factory that 
returns a TokenStream instance to consumers. If you want to change the 
Tokenizer of an existing Analyzer, just clone it and rewrite its 
createComponents() method, see the example in the Javadocs: 
http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/analysis/Analyzer.html

If you want to add additional TokenFilters to the chain, you can do this with 
AnalyzerWrapper 
(http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/analysis/AnalyzerWrapper.html),
 but this does not work with Tokenizers, because those are instantiated before 
the TokenFilters which depend on them, so changing the Tokenizer afterwards is 
impossible.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Vihari Piratla [mailto:viharipira...@gmail.com]
> Sent: Monday, January 12, 2015 8:51 AM
> To: java-user@lucene.apache.org
> Subject: Custom tokenizer
> 
> Hi,
> I am trying to implement a custom tokenizer for my application and I have
> few queries regarding the same.
> 1. Is there a way to provide an existing analyzer (say EnglishAnanlyzer) the
> custom tokenizer and make it use this tokenizer instead of say
> StandardTokenizer?
> 2. Why are analyzers such as Standard and EnglishAnalyzers defined final?
> Because of which, I cannot extend them.
> 
> Thank you.
> --
> V


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Custom tokenizer

2015-01-12 Thread Vihari Piratla
Thanks for the reply.

Hmm, I understand.
I know about AnalyzerWrapper, but that is not what I am looking for.

I also know about cloning and overriding. I want my analyzer to behave
exactly the same as EnglishAnalyzer and right now I am copying the code
from the EnglishAnalyzer to mimic the behavior, which is a dirty solution.
Is there any other proper solution(s) to this problem?

Thank you.

On Mon, Jan 12, 2015 at 1:36 PM, Uwe Schindler  wrote:

> Hi,
>
> Extending an existing Analyzer is not useful, because it is just a factory
> that returns a TokenStream instance to consumers. If you want to change the
> Tokenizer of an existing Analyzer, just clone it and rewrite its
> createComponents() method, see the example in the Javadocs:
> http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/analysis/Analyzer.html
>
> If you want to add additional TokenFilters to the chain, you can do this
> with AnalyzerWrapper (
> http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/analysis/AnalyzerWrapper.html),
> but this does not work with Tokenizers, because those are instantiated
> before the TokenFilters which depend on them, so changing the Tokenizer
> afterwards is impossible.
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> > -Original Message-
> > From: Vihari Piratla [mailto:viharipira...@gmail.com]
> > Sent: Monday, January 12, 2015 8:51 AM
> > To: java-user@lucene.apache.org
> > Subject: Custom tokenizer
> >
> > Hi,
> > I am trying to implement a custom tokenizer for my application and I have
> > few queries regarding the same.
> > 1. Is there a way to provide an existing analyzer (say EnglishAnanlyzer)
> the
> > custom tokenizer and make it use this tokenizer instead of say
> > StandardTokenizer?
> > 2. Why are analyzers such as Standard and EnglishAnalyzers defined final?
> > Because of which, I cannot extend them.
> >
> > Thank you.
> > --
> > V
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
V


RE: Custom tokenizer

2015-01-12 Thread Uwe Schindler
> Thanks for the reply.
> 
> Hmm, I understand.
> I know about AnalyzerWrapper, but that is not what I am looking for.
> 
> I also know about cloning and overriding. I want my analyzer to behave
> exactly the same as EnglishAnalyzer and right now I am copying the code
> from the EnglishAnalyzer to mimic the behavior, which is a dirty solution.
> Is there any other proper solution(s) to this problem?

NO.

Analyzers that are provided by Lucene have a configuration (combination of 
Tokenizers and Filters) that won't change unless the matchVersion differs 
(which is documented in the Javadocs). The reason for this is: If you have 
indexed with a given analyzer you have to use it unmodified always when 
updating/searching the index, otherwise the results of those actions are 
undefined. So on updating Lucene every Analyzer should return exactly the same 
results. Otherwise all users would need to rebuild their indexes also in minor 
versions.

Also, see Lucene Analyzers as "example" code. What counts here is the 
combination of Tokenizers and TokenFilters, which is freely configureable. The 
ones provided by Lucene are useful for common cases, but whenever you have 
custom requirements, you have to define your Analyzer *completely* yourself. 
This is also what Solr and Elasticsearch users do in their config files.

Uwe

> Thank you.
> 
> On Mon, Jan 12, 2015 at 1:36 PM, Uwe Schindler  wrote:
> 
> > Hi,
> >
> > Extending an existing Analyzer is not useful, because it is just a
> > factory that returns a TokenStream instance to consumers. If you want
> > to change the Tokenizer of an existing Analyzer, just clone it and
> > rewrite its
> > createComponents() method, see the example in the Javadocs:
> >
> http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/analysis/A
> > nalyzer.html
> >
> > If you want to add additional TokenFilters to the chain, you can do
> > this with AnalyzerWrapper (
> >
> http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/analysis/A
> > nalyzerWrapper.html), but this does not work with Tokenizers, because
> > those are instantiated before the TokenFilters which depend on them,
> > so changing the Tokenizer afterwards is impossible.
> >
> > Uwe
> >
> > -
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: u...@thetaphi.de
> >
> >
> > > -Original Message-
> > > From: Vihari Piratla [mailto:viharipira...@gmail.com]
> > > Sent: Monday, January 12, 2015 8:51 AM
> > > To: java-user@lucene.apache.org
> > > Subject: Custom tokenizer
> > >
> > > Hi,
> > > I am trying to implement a custom tokenizer for my application and I
> > > have few queries regarding the same.
> > > 1. Is there a way to provide an existing analyzer (say
> > > EnglishAnanlyzer)
> > the
> > > custom tokenizer and make it use this tokenizer instead of say
> > > StandardTokenizer?
> > > 2. Why are analyzers such as Standard and EnglishAnalyzers defined final?
> > > Because of which, I cannot extend them.
> > >
> > > Thank you.
> > > --
> > > V
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> 
> 
> --
> V


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



howto: handle temporal visibility of a document?

2015-01-12 Thread Clemens Wyss DEV
We have documents that are not always visible (visiblefrom-visibleto). In order 
to not have to query the originating object of the document whether it is 
currently visible (after the query), we'd like to put metadata into the 
documents, so that the visibility can be determined at query-time (by the query 
itself or a query filter). Any suggestions on how to index and query this 
metadata?


AW: howto: handle temporal visibility of a document?

2015-01-12 Thread Clemens Wyss DEV
I'll add/start with my proposal ;)

Document-meta fields:
+ visiblefrom [long]
+ visibleto [long]

Query or query filter:
(*:* -visiblefrom:[* TO *] AND -visibleto:[* TO *]) 
OR (*:* -visiblefrom:[* TO *] AND visibleto:[  TO *]) 
OR (*:* -visibleto:[ * TO *] AND visiblefrom:[* TO ]) 
OR ( visiblefrom:[* TO ] AND visibleto:[  TO *])

-Ursprüngliche Nachricht-
Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch] 
Gesendet: Montag, 12. Januar 2015 09:40
An: java-user@lucene.apache.org
Betreff: howto: handle temporal visibility of a document?

We have documents that are not always visible (visiblefrom-visibleto). In order 
to not have to query the originating object of the document whether it is 
currently visible (after the query), we'd like to put metadata into the 
documents, so that the visibility can be determined at query-time (by the query 
itself or a query filter). Any suggestions on how to index and query this 
metadata?


fill 'empty' facet-values, sampling, taxoreader

2015-01-12 Thread Rob Audenaerde
Hi all,

I'm building an application in which users can add arbitrary documents, and
all fields will be added as facets as well. This allows users to browse
their documents by their own defined facets easily.

However, when the number of documents gets very large, I switch to
random-sampled facets to make sure the application stays responsive. By the
nature of sampling, documents (and thus facet-values) will be missed.

I let the user select the number of facet-values he want to see for each
facets. For example, the default is 10. If a facet contains values 1 to 20,
the user will always see 10 values if all documents are returned in the
search and no sampling is done.

If sampling is done, and the values are non-uniformly distributed, the user
might end up with only 5 values instead of 10. I want to 'fill' the empty 5
facet-value-slots with existing facet-values and an unknown facet-count
(?). The reason behind this, is that this value might exist in the
resultset and for interaction purposes, it is very nice if this value can
be selected and added to the query, to quickly find if there are documents
that also contain this facet value.

It is even more useful if these facet values are not sorted by count, but
by label. The user can then quickly see there are document that contain a
certain value.

I can iterate over the ordinals via the TaxonomyReader and TaxonomyFacets
(by leveraging the 'children'), but these ordinals might no longer be used
in the documents.

What would be a good approach to tackle this issue?


MultiPhraseQuery:Rewrite to BooleanQuery

2015-01-12 Thread dennis yermakov
Hi folks!
I have a multiphrase query, for example, from units:

Directory indexStore = newDirectory();
RandomIndexWriter writer = new RandomIndexWriter(random(), indexStore);
add("blueberry chocolate pie", writer);
add("blueberry chocolate tart", writer);
IndexReader r = writer.getReader();
writer.close();

IndexSearcher searcher = newSearcher(r);
MultiPhraseQuery q = new MultiPhraseQuery();
q.add(new Term("body", "blueberry"));
q.add(new Term("body", "chocolate"));
q.add(new Term[] {new Term("body", "pie"), new Term("body", "tart")});
assertEquals(2, searcher.search(q, 1).totalHits);
r.close();
indexStore.close();

I need to know on which phrase query will be match. Explanation doesn't
return exact information, only that is match by this query. So can I
rewrite this query to Boolean?, like

BooleanQuery q = new BooleanQuery();

PhraseQuery pq1 = new PhraseQuery();
pq1.add(new Term("body", "blueberry"));
pq1.add(new Term("body", "chocolate"));
pq1.add(new Term("body", "pie"));
q.add(pq1, BooleanClause.Occur.SHOULD);

PhraseQuery pq2 = new PhraseQuery();
pq2.add(new Term("body", "blueberry"));
pq2.add(new Term("body", "chocolate"));
pq2.add(new Term("body", "tart"));
q.add(pq2, BooleanClause.Occur.SHOULD);

In this case I'll exact know on which query I have a match. But main
querstion is, Is this rewrite is equal/true?
Thanks.

-- 
dennis yermakov
mailto: dem...@gmail.com


Re: AW: howto: handle temporal visibility of a document?

2015-01-12 Thread Michael Sokolov
The basic idea seems sound, but I think you can simplify that query a 
bit.  For one thing, the *:* clauses can be removed in a few places: 
also if you index an explicit null value you won't need them at all; for 
visiblefrom, if you don't have a from time, use 0, for visibleto, if you 
don't have a to time, use maxlong.


-Mike

On 1/12/15 4:23 AM, Clemens Wyss DEV wrote:

I'll add/start with my proposal ;)

Document-meta fields:
+ visiblefrom [long]
+ visibleto [long]

Query or query filter:
(*:* -visiblefrom:[* TO *] AND -visibleto:[* TO *])
OR (*:* -visiblefrom:[* TO *] AND visibleto:[  TO *])
OR (*:* -visibleto:[ * TO *] AND visiblefrom:[* TO ])
OR ( visiblefrom:[* TO ] AND visibleto:[  TO *])

-Ursprüngliche Nachricht-
Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch]
Gesendet: Montag, 12. Januar 2015 09:40
An: java-user@lucene.apache.org
Betreff: howto: handle temporal visibility of a document?

We have documents that are not always visible (visiblefrom-visibleto). In order 
to not have to query the originating object of the document whether it is 
currently visible (after the query), we'd like to put metadata into the 
documents, so that the visibility can be determined at query-time (by the query 
itself or a query filter). Any suggestions on how to index and query this 
metadata?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Finding a match for an automaton against a FST

2015-01-12 Thread Michael McCandless
On Sat, Jan 10, 2015 at 8:23 AM, Olivier Binda  wrote:
> On 01/10/2015 11:00 AM, Michael McCandless wrote:
>>
>> On Fri, Jan 9, 2015 at 6:42 AM, Olivier Binda 
>> wrote:
>>>
>>> Hello.
>>>
>>> 1) What is the best way to check if an automaton (from a regex or a
>>> string
>>> with a wildcard)
>>> has at least 1 match against a FST (from a WFSTCompletionLookup) ?
>>
>> You need to implement "intersect".  We already have this method for
>> two automata (Operations.java); maybe you can start from that but
>> cutover to the FST APIs instead for the 2nd automaton?
>
>
> I looked a bit into this. This is complicated stuff :/

Sorry, yes it is.  If you have any ideas to simplify the APIs that
would be awesome :)

> I think I get what the nested loops in intersect() do  : transitions consist
> of a double dimension array, somehow those arays are intersected.
> I don't understand yet why is there a .min and a .max for a transition (why
> not just a codepoint ?)

Most automaton transitions cover a wide range of unicode characters,
so requiring a separate transition for each would be too costly (too
many objects / RAM).

> Fst and automaton (and maybee the lucene codec stuff) are 3 different
> implementations
> of finite state machine/transducers, right ?

I think we have only 2 implementations (FST, Automaton).

> How does regexQuery (automaton) match against an index ?
> Does it use intersect() internally ?  (if it does, maybee I could reuse that
> code too)

RegexpQuery in core (NOT to be confused with the much slower,
differing in name by only one letter, RegexQuery in sandbox) builds an
Automaton and then uses Terms.intersect API.

However, I would not look for inspiration from Terms.intersect: that
implementation (in block tree terms dict) works with the terms
dictionary data structures to perform a fast intersection and that
code is crazy complex.

Possibly a place to look for inspiration/poaching is
FSTUtil.intersectPrefixPaths: that intersects an automaton with an
FST.  It's used by the fuzzy suggester...

Mike McCandless

http://blog.mikemccandless.com



>
>
>>
>>> 2) Also, is there a simple/efficient way to find the lowest and the
>>> highest
>>> arcs of a FST that match against  an automaton ?
>>
>> Hmm arcs leaving which state?  The initial state?  You could simply
>> walk all arcs leaving the initial state from the FST and check if the
>> automaton accepts them leaving its initial state (assuming the
>> automaton has no dead states)?
>>
>> Or, if you are already doing an intersection here, just save this
>> information as a side effect since you will have already computed it.
>
>
> thanks for the tips, it helps.
> Olivier
>
>
>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



MultiPhraseQuery:Rewrite to BooleanQuery

2015-01-12 Thread ku3ia
Hi folks!
I have a multiphrase query, for example, from units:

Directory indexStore = newDirectory();
RandomIndexWriter writer = new RandomIndexWriter(random(), indexStore);
add("blueberry chocolate pie", writer);
add("blueberry chocolate tart", writer);
IndexReader r = writer.getReader();
writer.close();

IndexSearcher searcher = newSearcher(r);
MultiPhraseQuery q = new MultiPhraseQuery();
q.add(new Term("body", "blueberry"));
q.add(new Term("body", "chocolate"));
q.add(new Term[] {new Term("body", "pie"), new Term("body", "tart")});
assertEquals(2, searcher.search(q, 1).totalHits);
r.close();
indexStore.close();

I need to know on which phrase query will be match. Explanation doesn't
return exact information, only that is match by this query. So can I rewrite
this query to Boolean?, like

BooleanQuery q = new BooleanQuery();

PhraseQuery pq1 = new PhraseQuery();
pq1.add(new Term("body", "blueberry"));
pq1.add(new Term("body", "chocolate"));
pq1.add(new Term("body", "pie"));
q.add(pq1, BooleanClause.Occur.SHOULD);

PhraseQuery pq2 = new PhraseQuery();
pq2.add(new Term("body", "blueberry"));
pq2.add(new Term("body", "chocolate"));
pq2.add(new Term("body", "tart"));
q.add(pq2, BooleanClause.Occur.SHOULD);

In this case I'll exact know on which query I have a match. But main
querstion is, Is this rewrite is equal/true?
Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/MultiPhraseQuery-Rewrite-to-BooleanQuery-tp4178894.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



AW: AW: howto: handle temporal visibility of a document?

2015-01-12 Thread Clemens Wyss DEV
Thx, I will simplify/optimize ;)

-Ursprüngliche Nachricht-
Von: Michael Sokolov [mailto:msoko...@safaribooksonline.com] 
Gesendet: Montag, 12. Januar 2015 14:41
An: java-user@lucene.apache.org
Betreff: Re: AW: howto: handle temporal visibility of a document?

The basic idea seems sound, but I think you can simplify that query a bit.  For 
one thing, the *:* clauses can be removed in a few places: 
also if you index an explicit null value you won't need them at all; for 
visiblefrom, if you don't have a from time, use 0, for visibleto, if you don't 
have a to time, use maxlong.

-Mike

On 1/12/15 4:23 AM, Clemens Wyss DEV wrote:
> I'll add/start with my proposal ;)
>
> Document-meta fields:
> + visiblefrom [long]
> + visibleto [long]
>
> Query or query filter:
> (*:* -visiblefrom:[* TO *] AND -visibleto:[* TO *]) OR (*:* 
> -visiblefrom:[* TO *] AND visibleto:[  TO *]) OR (*:* 
> -visibleto:[ * TO *] AND visiblefrom:[* TO ]) OR ( 
> visiblefrom:[* TO ] AND visibleto:[  TO *])
>
> -Ursprüngliche Nachricht-
> Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch]
> Gesendet: Montag, 12. Januar 2015 09:40
> An: java-user@lucene.apache.org
> Betreff: howto: handle temporal visibility of a document?
>
> We have documents that are not always visible (visiblefrom-visibleto). In 
> order to not have to query the originating object of the document whether it 
> is currently visible (after the query), we'd like to put metadata into the 
> documents, so that the visibility can be determined at query-time (by the 
> query itself or a query filter). Any suggestions on how to index and query 
> this metadata?
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Details on setting block parameters for Lucene41PostingsFormat

2015-01-12 Thread Tom Burton-West
Thanks Mike,


> OK.  It would be good to know where all your RAM is being consumed,
> and how much of that is really the terms index: it ought to be a very
> small part of it.
>
> I made a bunch of heap dumps.  I just watched with jconsole and ran jmap
-histo when memory use got high.
I've appended a bit more from the error trace  and the top memory users
from one of the heap dumps below..

I tried to send a bunch of heap dumps to the mailing list but the message
got rejected. I'll send them directly to you.


Tom




java.lang.OutOfMemoryError: Java heap space
at
org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.(FreqProxTermsWriterPerField.java:212)
at
org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.newInstance(FreqProxTermsWriterPerField.java:230)
at
org.apache.lucene.index.ParallelPostingsArray.grow(ParallelPostingsArray.java:48)
at
org.apache.lucene.index.TermsHashPerField$PostingsBytesStartArray.grow(TermsHashPerField.java:252)
at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:292)
at
org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:151)
at
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:659)
at
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:359)
---
top memory users from one of the heap dumps:

   1:   1131932 2546933736  [B
   2:308670  743033280  [I
   3:696803  203038680  [C
   4:383039   36771744
 org.apache.lucene.codecs.lucene41.Lucene41PostingsWriter$IntBlockTermState
   5:   1089864   26156736
 org.apache.lucene.util.AttributeSource$State
   6:544870   26153760
 org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl
   7:687500   1650  org.apache.lucene.util.BytesRef
   8:1358209779040  org.apache.lucene.util.fst.FST$Arc
   9:3825199180456
 org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$PendingTerm
  10:382037916  org.apache.lucene.codecs.TermStats
  11:5449528719232  org.apache.lucene.util.BytesRefBuilder


Re: Details on setting block parameters for Lucene41PostingsFormat

2015-01-12 Thread Tom Burton-West
Thanks Mike,

Do you know how I can configure Solr to use the min=200 and
max=398 block sizes you suggested?  Or should I ask on the Solr list?

Tom

On Sat, Jan 10, 2015 at 4:46 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> The first int to Lucene41PostingsFormat is the min block size (default
> 25) and the second is the max (default 48) for the block tree terms
> dict.
>
> The max must be >= 2*(min-1).
>
> Since you were using 8X the default before, maybe try min=200 and
> max=398?  However, block tree should have been more RAM efficient than
> 3.x's terms index... if you run CheckIndex with -verbose it will print
> additional details about the block structure of your terms indices...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, Jan 9, 2015 at 4:15 PM, Tom Burton-West 
> wrote:
> > Hello all,
> >
> > We have over 3 billion unique terms in our indexes and with Solr 3.x we
> set
> > the TermIndexInterval to about 8 times its default value in order to
> index
> > without OOMs.  (
> > http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again)
> >
> > We are now working with Solr 4 and running into memory issues and are
> > wondering if we need to do something analogous for Solr 4.
> >
> > The javadoc for IndexWriterConfig (
> >
> http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
> > )
> > indicates that the lucene 4.1 postings format has some parameters which
> may
> > be set:
> > "..To configure its parameters (the minimum and maximum size for a
> block),
> > you would instead use Lucene41PostingsFormat.Lucene41PostingsFormat(int,
> > int)
> > <
> https://lucene.apache.org/core/4_10_2/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Lucene41PostingsFormat%28int,%20int%29
> >
> > "
> >
> > Is there documentation or discussion somewhere about how to determine
> > appropriate parameters or some detail about what setting the maxBlockSize
> > and minBlockSize does?
> >
> > Tom Burton-West
> > http://www.hathitrust.org/blogs/large-scale-search
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


StoredField available in Collector.setNextReader

2015-01-12 Thread Hasenberger, Josef
Hello,

I have tried to retrieve values stored via StoredField type inside a Collector 
when its method setNextReader(AtomicReaderContext) was called.
I used the following method from FieldCache, but do not get back any values:
  FieldCache.DEFAULT.getTerms(indexReader, field, false);

Retrieving the values from the document itself during call to 
Collector.collect(int) works fine.
But this is much much slower than getting all terms at once as by the above 
method.

My question:
Is there a way to get binary content with similar performance as by the above 
described concept, i.e. retrieving the field terms when setting the reader in a 
Collector?


Besides, the concept works fine for any stored field that is indexed, e.g. like 
in the following code snippet:

final FieldType fieldType = new FieldType();
{
fieldType.setStored(true);
fieldType.setIndexed(true); // need to index, otherwise no fast 
retrieval of terms in collector is possible
fieldType.setIndexOptions(IndexOptions.DOCS_ONLY);
fieldType.setTokenized(false);
fieldType.setOmitNorms(true);
fieldType.freeze();
}

Field field = new Field(fieldName, fieldValue, fieldType); // 
fieldValue is of type String

But this does not allow me to store binary content (i.e. values in byte[] 
arrays) as is available for StoredField.
The constructor expects field content of type String.
I have tried to convert the content into base64 encoded strings, but the 
conversion from base64 encoded strings to byte arrays is quite expensive for 
large indexes.


Thanks for your advice.

Best regards,

Josef



Problem with Custom FieldComparator

2015-01-12 Thread Victor Podberezski
I'm changing one web application with lucene 2.4.1 to lucene 2.9.4
(basically because this bug:
https://issues.apache.org/jira/browse/LUCENE-1304).

I'm trying to migrate a custom sort field according to some examples i read.
But I cannot make it work right.

I have a field with string values and when i find a pattern I extract a
number (priority). This priority is used for sorting de documents.

The field has values like this:

"pub.generic1 pub.generic1.zonahome pub.generic1.zonahome.prio.1
pub.generic1.zonahome.lateral.slash.derecha
pub.generic1.zonahome.lateral.slash.derecha.prio.1
pub.generic1.seccion.seccion1 pub.generic1.seccion.seccion1.prio.10"

This is the new comparator code:

public class HighTrafficSortComparator
extends FieldComparatorSource {

protected static final Log LOG =
CmsLog.getLog(HighTrafficSortComparator.class);

private List priorityPreffix;

private boolean ascending = false;

public HighTrafficSortComparator(String[] prefifx, boolean ascending){

this.ascending = ascending;
 // here I make the preffix
// ...

}

public  FieldComparator newComparator(String fieldname, int numHits, int
sortPos, boolean reversed) throws IOException {

return new HighTrafficFieldComparator(numHits, fieldname);
}



class HighTrafficFieldComparator extends FieldComparator {

String field;
int[] docValues;
int[] slotValues;
int bottomValue;

HighTrafficFieldComparator(int numHits, String fieldName) {
slotValues = new int[numHits];
field = fieldName;
}

public void copy(int slot, int doc) {
slotValues[slot] = docValues[doc];
}

public int compare(int slot1, int slot2) {
return slotValues[slot1] - slotValues[slot2];
}

public int compareBottom(int doc) {
return bottomValue - docValues[doc];
}

public void setBottom(int bottom) {
bottomValue = slotValues[bottom];
}

public void setNextReader(IndexReader reader, int docBase) throws
IOException {
docValues = FieldCache.DEFAULT.getInts(reader, field, new
FieldCache.IntParser() {
public final int parseInt(final String val) {
 return getPrioridad(val);
}
});
}

public Comparable value(int slot) {
return new Integer(slotValues[slot]);
}
}

private Integer getPrioridad(String text) {
int prioridad = !ascending ? Integer.MAX_VALUE : Integer.MIN_VALUE;
if (text!=null) {
String[] termstext = text.split(" ");
for (String termtext : termstext) {
int idx = termtext.indexOf(NoticiacontentExtrator.KEY_SEPARATOR + "prio" +
NoticiacontentExtrator.VALUE_SEPARATOR);
if (idx>-1)
{
//is a priority.
String termPreffix = termtext.substring(0,idx);
 if (priorityPreffix.contains(termPreffix))
{
//has the requested priority

try {
int prioridadTerm = Integer.parseInt(termtext.substring(idx+6));

 if (!ascending && prioridadTerm < prioridad)
prioridad = prioridadTerm;
else if (ascending && prioridadTerm > prioridad)
prioridad = prioridadTerm;


 }
catch (NumberFormatException ex) {

}



}
}
}
}
return new Integer(prioridad);
}



}

this is how I use this custom sort:

camposOrden = new SortField(luceneFieldName, new
HighTrafficSortComparator(preffix,isAscending), isAscending);

When I made the query the result is not sorted correclty... But I dont know
what I doing wrong.

This is the old code working correctly in lucene 2.4.1:

public class HighTrafficSortComparator
implements SortComparatorSource {
 private List priorityPreffix;
 public ScoreDocComparator newComparator(final IndexReader indexReader,
final String fieldname) throws IOException {
 return new ScoreDocComparator() {
private Map cachedScores = new HashMap();

public int compare(ScoreDoc scoreDoc1, ScoreDoc scoreDoc2) {
try {
 Integer priorityDoc1 = cachedScores.get(scoreDoc1.doc);
Integer priorityDoc2 = cachedScores.get(scoreDoc2.doc);
 if (priorityDoc1==null) {
final Document doc1 = indexReader.document(scoreDoc1.doc);
final String strVal1 = doc1.get(fieldname);


priorityDoc1 = getPrioridad(strVal1);
cachedScores.put(scoreDoc1.doc, priorityDoc1);
 }
if (priorityDoc2==null) {
final Document doc2 = indexReader.document(scoreDoc2.doc);
final String strVal2 = doc2.get(fieldname);

priorityDoc2 = getPrioridad(strVal2);
cachedScores.put(scoreDoc2.doc, priorityDoc2);

}

return priorityDoc1.compareTo(priorityDoc2);

} catch (IOException e) {
LOG.error("Cannot read doc", e); }
return 0;
}
 public Comparable sortValue(ScoreDoc scoreDoc)
{
try {
Integer priorityDoc = cachedScores.get(scoreDoc.doc);

if (priorityDoc==null) {
final Document doc = indexReader.document(scoreDoc.doc);
final String strVal = doc.get(fieldname);
priorityDoc = getPrioridad(strVal);
}
 return priorityDoc;

}
catch (IOException e) {
LOG.error("Cannot read doc", e); }
return 0;
 }
 public int sortType() {
return SortField.CUSTOM;
}

private Integer getPrioridad(String text) {
int prioridad = !ascending ? Integer.MAX_VALUE : Integer.MIN_VALUE;
 if (text!=null) {
String[] termstext = text.split(" ");
for (String termtext : termstext) {
int idx = termtext.indexOf(NoticiacontentExtrator.KEY_SEPARATOR + "prio" +
NoticiacontentExtrator.VALUE_SEPARATOR);
if 

RE: RE: howto: handle temporal visibility of a document?

2015-01-12 Thread Clemens Wyss DEV
reduced to:
( ( *:* -visiblefrom:[* TO *] AND -visibleto:[* TO *] ) 
OR (-visiblefrom:[* TO *] AND visibleto:[ TO ])
OR (-visibleto:[ * TO *] AND visiblefrom:[0 TO ])
OR ( visiblefrom:[0 TO ] AND visibleto:[  TO 
]) )

> also if you index an explicit null value you won't need them at all
Could it then be reduced to
(-visiblefrom:[* TO *] AND visibleto:[ TO ])
OR (-visibleto:[ * TO *] AND visiblefrom:[0 TO ])
OR ( visiblefrom:[0 TO ] AND visibleto:[  TO 
])
? 
Would I gain a lot more speed if I set visiblefrom to 0 and visibleto  to . The query would then be: 
visiblefrom:[0 TO ] AND visibleto:[  TO ]

And a rather Solr'y question, nevertheless I ask it here:
I intended to use this very query as query filter (qf), but I guess it doesn't 
make sense because '' changes at every call ;) 

-Ursprüngliche Nachricht-
Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch] 
Gesendet: Montag, 12. Januar 2015 17:14
An: java-user@lucene.apache.org
Betreff: AW: AW: howto: handle temporal visibility of a document?

Thx, I will simplify/optimize ;)

-Ursprüngliche Nachricht-
Von: Michael Sokolov [mailto:msoko...@safaribooksonline.com]
Gesendet: Montag, 12. Januar 2015 14:41
An: java-user@lucene.apache.org
Betreff: Re: AW: howto: handle temporal visibility of a document?

The basic idea seems sound, but I think you can simplify that query a bit.  For 
one thing, the *:* clauses can be removed in a few places: 
also if you index an explicit null value you won't need them at all; for 
visiblefrom, if you don't have a from time, use 0, for visibleto, if you don't 
have a to time, use maxlong.

-Mike

On 1/12/15 4:23 AM, Clemens Wyss DEV wrote:
> I'll add/start with my proposal ;)
>
> Document-meta fields:
> + visiblefrom [long]
> + visibleto [long]
>
> Query or query filter:
> (*:* -visiblefrom:[* TO *] AND -visibleto:[* TO *]) OR (*:*
> -visiblefrom:[* TO *] AND visibleto:[  TO *]) OR (*:* 
> -visibleto:[ * TO *] AND visiblefrom:[* TO ]) OR (
> visiblefrom:[* TO ] AND visibleto:[  TO *])
>
> -Ursprüngliche Nachricht-
> Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch]
> Gesendet: Montag, 12. Januar 2015 09:40
> An: java-user@lucene.apache.org
> Betreff: howto: handle temporal visibility of a document?
>
> We have documents that are not always visible (visiblefrom-visibleto). In 
> order to not have to query the originating object of the document whether it 
> is currently visible (after the query), we'd like to put metadata into the 
> documents, so that the visibility can be determined at query-time (by the 
> query itself or a query filter). Any suggestions on how to index and query 
> this metadata?
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org