Re: TermVectorOffsetStrategy producing Passages with matches out of order? (causing IndexOutOfBoundsException)

2023-07-04 Thread Chris Hostetter


I hacked up the test a bit so it would compile against 9.0 and confirmed 
the problem existed there as well.

So going back a little farther with some manual bisection (to account for 
the transition from ant to gradle) lead me to the following...

# first bad commit: [2719cf6630eb2bd7cb37d0e8462dc912d8fafd83] 
LUCENE-9431: UnifiedHighlighter WEIGHT_MATCHES is now true by default 
(#362)

...my impression here is that this probably must have existed for a 
while somwhere in a 'WEIGHT_MATCHES' code path, and this commit just 
exposed the probably "by default".

That impression seemed to be confirmed by tweaking my test patch (against 
2719cf6630eb2bd7cb37d0e8462dc912d8fafd83) to use...

  UnifiedHighlighter highlighter = new UnifiedHighlighter(searcher, 
indexAnalyzer) {
@Override
protected Set getFlags(String field) {
  final Set x = new 
java.util.HashSet<>(super.getFlags(field));
  x.remove(HighlightFlag.WEIGHT_MATCHES);
  return x;
}
  };

...and the tests started to pass.

Again, i don't really understand this code, but: Knowing that the probably 
happens when TermVectorOffsetStrategy means that usages of WEIGHT_MATCHES 
in getOffsetStrategy's ANALYSIS codepath probably aren't relevant -- which 
leands me to assume the source of the problem is 
probably FieldOffsetStrategy.createOffsetsEnumsWeightMatcher ?


But this brings me back to not really understanding what code is "at 
fault" here ? ... The existence of WEIGHT_MATCHES and the design of
FieldOffsetStrategy.createOffsetsEnumsWeightMatcher to return an 
OffsetsEnum ordered by the "weighted" matches implies that it's 
expected/allowed for the offsets in Passages to be out of (ordinal) order 
... so does that mean DefaultPassageFormatter is broken for not 
expecting this?



-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: TermVectorOffsetStrategy producing Passages with matches out of order? (causing IndexOutOfBoundsException)

2023-06-29 Thread Chris Hostetter


With some trial and error I realized two things...

1) the order of the terms in the BooleanQuery seems to matter
   - but in terms of their "natural order", not the order in the doc
   
   (which is why i was so confused trying to reproduce it)

2) the problem happens when using termVectors but *NOT* using 
termVectorPositions

Test patch below demonstrates problem (applies to branch_9x)


-Hoss
http://www.lucidworks.com/


diff --git 
a/lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/TestUnifiedHighlighterTermVec.java
 
b/lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/TestUnifiedHighlighterTermVec.java
index 341318739f1..b94d60c3f85 100644
--- 
a/lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/TestUnifiedHighlighterTermVec.java
+++ 
b/lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/TestUnifiedHighlighterTermVec.java
@@ -76,6 +76,51 @@ public class TestUnifiedHighlighterTermVec extends 
LuceneTestCase {
 dir.close();
   }
 
+  public void testTermVecButNoPositions1() throws Exception {
+testTermVecButNoPositions("x", "y", "y x", "y x");
+  }
+  public void testTermVecButNoPositions2() throws Exception {
+testTermVecButNoPositions("y", "x", "y x", "y x");
+  } 
+  public void testTermVecButNoPositions3() throws Exception {
+testTermVecButNoPositions("zzz", "yyy", "zzz yyy", "zzz 
yyy");
+  }
+  public void testTermVecButNoPositions4() throws Exception {
+testTermVecButNoPositions("zzz", "yyy", "yyy zzz", "yyy 
zzz");
+  } 
+  public void testTermVecButNoPositions(String aaa, String bbb,
+String indexed, String expected) 
throws Exception {
+
+final FieldType tvNoPosType = new 
FieldType(org.apache.lucene.document.TextField.TYPE_STORED);
+tvNoPosType.setStoreTermVectors(true);
+// tvNoPosType.setStoreTermVectorPositions(true); // cause of problem 
seems to be lack of positions
+tvNoPosType.setStoreTermVectorOffsets(true);
+tvNoPosType.freeze();
+
+RandomIndexWriter iw = new RandomIndexWriter(random(), dir, indexAnalyzer);
+
+Field body = new Field("body", indexed, tvNoPosType);
+Document document = new Document();
+document.add(body);
+iw.addDocument(document);
+try (IndexReader ir = iw.getReader()) {
+  iw.close();
+  IndexSearcher searcher = newSearcher(ir);
+  BooleanQuery query =
+new BooleanQuery.Builder()
+// WTF? order of the terms in the boolean query also matters?
+.add(new TermQuery(new Term("body", aaa)), BooleanClause.Occur.MUST)
+.add(new TermQuery(new Term("body", bbb)), BooleanClause.Occur.MUST)
+.build();
+  TopDocs topDocs = searcher.search(query, 10);
+  assertEquals(1, topDocs.totalHits.value);
+  UnifiedHighlighter highlighter = UnifiedHighlighter.builder(searcher, 
indexAnalyzer).build();
+  String[] snippets = highlighter.highlight("body", query, topDocs, 2);
+  assertEquals(1, snippets.length);
+  assertTrue(snippets[0], snippets[0].contains(expected));
+}
+  }
+  
   public void testFetchTermVecsOncePerDoc() throws IOException {
 RandomIndexWriter iw = new RandomIndexWriter(random(), dir, indexAnalyzer);
 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



TermVectorOffsetStrategy producing Passages with matches out of order? (causing IndexOutOfBoundsException)

2023-06-29 Thread Chris Hostetter



I've got a user getting java.lang.IndexOutOfBoundsException from the 
UnifiedHighlighter in Solr 9.1.0 w/Lucene 9.3.0


(And FWIW, this same data, w/same configs, in 8.11.1, purportedtly didn't 
have this problem)



I don't really understand the highlighter code very well, but AFAICT:

- DefaultPassageFormatter seems to assume that the "matches"
  inside a single Passage will be "in order" (by offset)
  - it accounts for the possibility that they overlap
  - but not that matchEnds[i+1] < matchStarts[i]
- but in some cases (i don't understand)
  - TermVectorOffsetStrategy can produce Passages that are "reversed"
  - aparently based on the iteration order from
OfMatchesIteratorWithSubs ?

Which means DefaultPassageFormatter can trigger IOOBE in StringBuilder..

java.lang.IndexOutOfBoundsException: start 8, end 7, length 16
  at java.lang.AbstractStringBuilder.checkRange(Unknown Source) ~[?:?]
  at java.lang.AbstractStringBuilder.append(Unknown Source) ~[?:?]
  at java.lang.StringBuilder.append(Unknown Source) ~[?:?]
  at 
org.apache.lucene.search.uhighlight.DefaultPassageFormatter.append(DefaultPassageFormatter.java:133)
 ~[?:?]
  at 
org.apache.lucene.search.uhighlight.DefaultPassageFormatter.format(DefaultPassageFormatter.java:84)
 ~[?:?]
  at 
org.apache.lucene.search.uhighlight.DefaultPassageFormatter.format(DefaultPassageFormatter.java:25)
 ~[?:?]
  at 
org.apache.lucene.search.uhighlight.FieldHighlighter.highlightFieldForDoc(FieldHighlighter.java:94)
 ~[?:?]
  at 
org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlightFieldsAsObjects(UnifiedHighlighter.java:954)
 ~[?:?]
  at 
org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlightFields(UnifiedHighlighter.java:824)
 ~[?:?]
  at 
org.apache.solr.highlight.UnifiedSolrHighlighter.doHighlighting(UnifiedSolrHighlighter.java:165)
 ~[?:?]

...as it tries to append a subsequence based on the start+end of 
"overlapping" matches that don't actaully overlap -- the end of the 
"i+1" match is just strictly less then the "start" of the "i"

match because of how the Passage was build



I'm still trying to wrap my head around all the moving pieces to 
try and reproduce this in a small scale lucene test, but in the meantime I 
patched some of the 9.3.0 highlighter code (patch below sig) to include 
some debugging output to kind of show what's happening here...


http://localhost:8983/solr/workplace/select?fl=Expertise,id&defType=lucene&df=Expertise&q=machine+learning&hl=true&rows=1&q.op=OR&echoParams=all

nocommit: highlightOffsetsEnums -> 
OfMatchesIteratorWithSubs(term:learning,[8-16])
nocommit: Passage2030658055.addMatch(8,16,[6c 65 61 72 6e 69 6e 67],1)
nocommit: highlightOffsetsEnums -> OfMatchesIteratorWithSubs(term:machine,[0-7])
nocommit: Passage2030658055.addMatch(0,7,[6d 61 63 68 69 6e 65],1)
nocommit: 
format([[Passage[0-16]{learning[8-16],machine[0-7]}score=2.7656934]],Machine 
Learning) <-- class org.apache.lucene.search.uhighlight.TermVectorOffsetStrategy
nocommit: append(,Machine Learning,0,8)
nocommit: append(Machine ,Machine Learning,8,7)
2023-06-29 21:11:15.711 ERROR (qtp1528769018-17) [ x:workplace] 
o.a.s.h.RequestHandlerBase java.lang.IndexOutOfBoundsException: start 8, end 7, 
length 16 => java.lang.IndexOutOfBoundsException: start 8, end 7, length 16
at 
java.base/java.lang.AbstractStringBuilder.checkRange(AbstractStringBuilder.java:1716)
java.lang.IndexOutOfBoundsException: start 8, end 7, length 16
at 
java.lang.AbstractStringBuilder.checkRange(AbstractStringBuilder.java:1716) 
~[?:?]
at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:631) ~[?:?]
at java.lang.StringBuilder.append(StringBuilder.java:217) ~[?:?]
at 
org.apache.lucene.search.uhighlight.DefaultPassageFormatter.append(DefaultPassageFormatter.java:134)
 ~[?:?]

..note how the OfMatchesIteratorWithSubs (OffsetEnum) enumerates over the 
two terms in this order...


term:learning,[8-16]
term:machine,[0-7]

...and that order is preserved in the final Passage -- leading 
DefaultPassageFormatter.format() to decide that the two matches in this 
Passage overlap (because the start of match#1 (machine[0-7]) is less then 
the end of match#0 (learning[8-16]) ... but they don't overlap, one is 
strictly before the other, so it winds up passing StringBuilder.append an 
end < start.



 * Has anyone seen any failures like this ?
 * Is this a bug in DefaultPassageFormatter's assumptions,
   or in the ordering produced by the OffsetEnum ?
 * Does anyone have a theory where/how the problem might have changed
   between 8.11 and 9.3 ?


-Hoss
http://www.lucidworks.com/



diff --git 
a/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/DefaultPassageFormatter.java
 
b/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/DefaultPassageFormatter.java
index 345e2b61316..c82362b5eac 100644
--- 
a/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/DefaultPassageFormatter

Re: Reproducible crash matching phrases

2021-02-10 Thread Chris Hostetter


: I'm attaching an updated file as well this this changes.
: 
: This happens in Lucene 8.8.0 (and probably since 8.4.0).

Ok -- cool ... with the udpated code i was able to reproduce on branch_8x, 
and with 8.8 & 8.7 (but not 8.4) -- I've distilled your patch into a test 
case and attached to a new jira...

https://issues.apache.org/jira/browse/LUCENE-9762

FYI: with this updated code the error *DOES* reproduce for me regardless 
of Directory type -- i suspect your original comment about it not failing 
if you used ByteBuffersDirectory was because that would have been a 
"clean" index everytime, and the old code was only failing with your 
existing index on disk.

let's see if the folks with the low level expertise can figure out what's 
going wrong here.


-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Reproducible crash matching phrases

2021-02-10 Thread Chris Hostetter


: I've been able to reproduce a crash we are seeing in our product with newer
: Lucene versions.

Can you be specific?  What exact versions of Lucene are you using that 
reproduces this failure?  If you know of other "older" versions where you 
can't reproduce the problem, that info would also be helpful...


I tried running your test code against the current branch_8x and was 
unable to trigger any sort of failure.  I also tried using 8.4.1 based on 
the stack trace indicating that you must be using a version of lucene no 
older then 8.4 given the codec in use -- and was also unable to reproduce 
any sort of problem.

Also note that as written your LuceneCrash code leaves an index on disk 
which is re-used the next time the code is run: does the problem reproduce 
for you if you manually "rm -r /tmp/xxx" and run it again, or is the 
problem specific to having some "cruft" documents left in the index from 
previous runs?  Can you zip up the contents of /tmp/xxx on your machine 
and attache it ti a new jira?


: Interestingly, the bug does not happen if the index is created on a
: ByteBuffersDirectory.

That makes it seem like the bug might be filesystem specific -- what impl 
does the FSDirectory.open() call in your code return?



-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: explainOther SOLR concept?

2019-06-27 Thread Chris Hostetter

: It’s a Solr-only param for adding to debug=true….

at the Lucene level it's just calling the explain() method on an arbitrary 
docId regardless of whether that doc appear in the topN results for that 
query (or if it matches the query at all)


-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Question about usage of LuceneTestCase

2018-08-27 Thread Chris Hostetter


: Current version of Luke supports FS based directory implementations only.
: (I think it will be better if future versions support non-FS based custom
: implementations, such as HdfsDirectoryFactory for users who need it.)
: Disabling the randomization, at least for now, sounds reasonable to me too.
: I'll try this way.

Be careful with this assumption...

The randomization of directory types isn't just about things like "let's 
try a RAM Dir" it also includes things like "let's randomize a dir that 
simulates Windows filesystem quirks"  -- stuff that would be very handy to 
test with re-usable tool like Luke where you expect users to run on a 
variety of platforms / filesystems.

i haven't looked closely into what exactly that "useFactory(null)" call 
does, but it's probably worth getting to the bottom of the failures and 
*IF* it's tied to some specific dir type or codec, using annotations to 
supress them -- rather then just eliminating all directory randomization.


-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Practical usages of arbitrary Shingles when using a query parser?

2018-07-31 Thread Chris Hostetter


: The query parser is confused by these overlapping positions indeed, which
: it interprets as synonyms. I was going to write that you should set the

Sure -- i'm not blaming the QueryParser, what it does with the 
Shingles output makes sense (and actual works! .. just not as efficiently 
as possible).  I'm trying to figure out how to make the ShingleFilter 
output more useful in the query time analyzer usecase.

: it interprets as synonyms. I was going to write that you should set the
: same min and max shingle sizes at query time, but while writing that I
: realized that you probably wanted to keep outputing shorter shingles so
: that a phrase query on 2 terms with a max shingle size of 3 would still use

Yes exactly ... if at index time you output both unigrams and shingles of 
sizes 2-5, and at query time you have a "phrase" of only 2 words, ideally 
the filter should output a simple Token so you can make a single TermQuery 
-- likewise if you have a phrase of 3 words, or 4, words, or 5 words 
thouse should ideally all produces single tokens.

Your suggestion of "same min & max at query time" where min=max=X is 
something i briefly considered, but that means you're only optimizing the 
"phrases" of length "X", all shorter phrases just use unigrams, and in 
fact there is no point in building shingles of any size othe then X at 
index time.

: shingles? Maybe 'outputUnigramsIfNoShingles' should really be something
: like 'outputShinglesOfTheMaximumSizeOnly'?

That's what i was thinking -- but i haven't dug into the code enough to 
understand how complex that would be. (i was starting with "Am i missing 
something about how/why this shouldn't/doesn't already exist?")

: For the record, in addition to the problems that you mentioned,
: ShingleFilter proved very hard to be fixed in order to work correctly on
: top of synonyms when X != Y[1], which encouraged Alan work on a new
: FixedShingleFilter[2] that deals with index-time synonyms (ie. ignores

Yeah ... i can't even imagine the complexity of dealing with "graph" based 
synonyms and shinles (didn't read your link for fear of my own sanity)

: position length) just fine but only allows X == Y. Also instead of feeding
: an analyzer with shingles to the query parser, we found it more
: user-friendly to add an option to text fields in order to index 2-shingles
: into a separate field and redirect phrase queries to it.[3] We did

Right ... i'm actually looking at a system know that puts uni-shingles, 
bi-shingles, and tri-shingles in 3 diff fields, and then pre-parses the 
input to figure out how long it is to decide which field to query ... i'm 
trying to simplify that.

Ideally what I'd like to be able to say is "give me a phrase, if the 
field is configured w/o any shingles at all it will work fine (via 
PhraseQuery), but if the analyzer is configured with shingles it will be 
even faster (via term query) if/when the query phrase is "shorter" then 
the max shingles length.


-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Practical usages of arbitrary Shingles when using a query parser?

2018-07-30 Thread Chris Hostetter



Although I've been aware of Shings and some of the useful applications for 
a long time, today is the first tiem i really sat down and tried to do 
something non-trivial with them myself.


My objective seems realatively straight forard: given a corpus of text and 
some analyzer (for sake of discussion let's assume simple whitespace 
tokenization w/lowercasing) i want to be able to say "I am happy to trade 
index time/size for faster queries of shorter phrases"


So instead of just indexing "the quick brown fox jumped over the lazy dog" 
as a field with 9 terms, I might want to add ShingleFilterFactory to the 
end of my analyzer using [[minShingleSize="2" maxShingleSize="2" 
outputUnigrams="true"]] and now I have a field w/17 terms, but if I get a 
query for a "phrase" of 2 words/terms, i should in theory be able to just 
use a TermQuery under the covers -- making just as "fast" as query for a 
single word/term.  But meanwhile longer phrases should still "just work" 
as if i didn't have any shingles.


So far so good...

If I actually index a corpus as described above, and then at query time I 
use ShingleFilterFactory w/ [[minShingleSize="2" maxShingleSize="2" 
outputUnigramsIfNoShingles="true" outputUnigrams="false"]] I get the 
expected TemQuery for either a single word input or two-word input ... 
for input "phrases" longer then 2 terms I get a PhraseQuery -- albeit one 
composed of bi-shingles instead of individual unigrams, but AFAICT the 
position info is set correctly so that it will only match the documents 
thta would have been matched w/o any shingles (and IIUC the term stats 
for the shingles seem like should probably result in subjectively "better" 
scores? not certain on this bit, but also not overly concerend about it)


The problem is that (unless I'm missing something) this doesn't really 
work if I want to use an arbitrary 'maxShingleSize="N"' where N>2.


If i change my index time ShingleFilterFactory uses [[minShingleSize="2" 
maxShingleSize="N" outputUnigrams="true"]] the equivilent change to the 
query time analyzer would be [[minShingleSize="2" maxShingleSize="N" 
outputUnigramsIfNoShingles="true" outputUnigrams="false"]] -- and while 
that does seem to cause "phrase" input of all sizes to be converted by the 
analyzer+QueryParser into a query that (AFAICT) will match the correct 
documents (compared to using no shingles) it's only "optimized" as a 
TermQuery for one & two word phrases.  For input phrasees longer then 2 
terms it generates a SpanOrQuery wrapping multiple SpanNearQueries, 
i believe because of the overlapping positions of the bi/tri/quad-etc.. 
shingles.


There just doesn't seem to be any good/generic way to leverage a field 
built with an analyzer that uses [[minShingleSize="X" maxShingleSize="Y"]] 
(where X != Y) at query time using an QueryParser configured with out of 
the box analyzer components.


It seems like what's missing is a ShingleFilter(Factory) configuration 
that means "output the maximum possible shingle size between MIN and 
MAX based on the size of the input stream" ... but that doesn't seem to 
exist.


Does anyone have any advice/suggestions on how to approach this type of 
problem based on their own experiences?  Does anyone have first hand 
experience using maxShingleSize > 2 with a QueryParser (and w/o any 
preconcieved assumptions about the length of the input) ?


?

-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Size of Document

2018-07-05 Thread Chris Hostetter


: Subject: Size of Document
: To: java-user@lucene.apache.org
: References:
: 
:  
: Message-ID: 
: In-Reply-To:
: 

https://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.




-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ClassicAnalyzer Behavior on accent character

2017-10-26 Thread Chris Hostetter


Classic is ... "classic" ... it exists largely for historical purposes to 
provide a tokenizer that does exactly what the javadocs say it does 
(regarding punctuation, "produc numbers", and email addresses), so that 
people who depend on that behavior can continue to rely on it.

Standard is ... "standard" ... it implements that Unicode Standard text 
segmentation rules.


: Date: Fri, 20 Oct 2017 18:58:35 +0530
: From: Chitra 
: Reply-To: java-user@lucene.apache.org
: To: Lucene Users 
: Subject: Re: ClassicAnalyzer Behavior on accent character
: 
: Hi,
:  I found the difference and understand the behavior of both
: tokenizers appropriately.
: 
: Could you please suggest me which one is the better to use
: ClassicTokenizer/StandardTokenizer?
: 
: -- 
: Regards,
: Chitra
: 

-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: payload at the document level

2017-10-05 Thread Chris Hostetter

what you're describing is essentially just DocValues -- for each document, 
you can have an arbitrary bytes[] (or number, or sorted list of numbers), 
and you could write a custom query/similarity/collector that can access 
that "docvalue" at search time to decide if it's a match (or how to score 
it)




: Date: Thu, 5 Oct 2017 14:15:01 -0700
: From: Lisheng Zhang 
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: payload at the document level
: 
: Hi, I understand payload should be associated with an indexed term, but i
: remembered long time ago it was suggested that we should have payload at
: the document level (for whole document), such that we can get payload by
: docID only.
: 
: Do we have this feature implemented in lucene 7 (i searched doc & code and
: could not find)?
: 
: Thanks very much for helps, Lisheng
: 

-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: LongPoint.newRangeQuery results differ from LegacyNumericRangeQuery.newLongRange

2017-07-24 Thread Chris Hostetter

The Points data structures are completley different and distinct 
from the Term Index structures used by LegacyNumeric fields -- just having 
hte backwards codex (or using merges to convert indexes to the new index 
format) isn't enough -- you have to reindex.



-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Changing the default FSLockFactory implementation

2017-05-31 Thread Chris Hostetter


: We are experiencing some “Lock obtain timed out: NativeFSLock@” issues 
: on or NFS file system, could someone please show me, what’s the right 
: way to switch the Lucene default NativeFSLockFactory to 
: SimpleFSLockFactory?

You can specify the LockFactory used when opening your Directory...

http://lucene.apache.org/core/6_5_0/core/org/apache/lucene/store/FSDirectory.html#open-java.nio.file.Path-org.apache.lucene.store.LockFactory-




-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Un-used index files are not getting released

2017-05-11 Thread Chris Hostetter

: We do not open any IndexReader explicitly. We keep one instance on 
: IndexWriter open (and never close) and for searching we use 
: SearcherManager. I checked the lsof and did not find any files with 
: delete status.

what exactly does your SearchManager usage look like?  is every searcher = 
acquire() associated with a corrisponding release(searcher) ?


: Following is the output of lsof for lucene1:
: 0;lucene@lidxnj39:~[lucene@lidxnj39 ~]$ /usr/sbin/lsof | grep lucene1
...
: But when I get the number of files in that index folder using java 
: (File.listFiles()) it lists 1761 files in that folder. This count goes 
: down to a double digit number when I restart the tomcat.

If the JVM/Lucene had the file open, then lsof should list it -- the fact 
that your lsof list (esentially) matches your "ls -l" (accounting for a 
few files that IndexWriter may have deleted but an active searcher may be 
using) seems to suggest everything is working fine  since only 
Files.listFiles() disagrees that hsa me fairly suspicious of the java 
code you have using Files.listFiles().

what is the fully list of file names Files.listFiles() returns?

as someone else asked: what does "ls -al" on that dir return at the same 
time as your Files.listFiles() call?



-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: will lucene traverse all segments to search a 'primary key'term or will it stop as soon as it get one?

2017-04-21 Thread Chris Hostetter
: Lucene by default will search all segments, because it does not know that
: your field is a primary key.
: 
: Trejkaz's suggestion to early-terminate should work well.  You could also
: write custom code that uses TermsEnum on each segment.

Before you go too far down the rabit hole of writting any custom code, 
make sure to do some experiements and actaully measure the performance of 
a uniqueKey lookup using a simple needScores=false search ... the way 
TermQuery works across each segments is very low cost for the segments 
where the Term doesn't exist in any docs at all.


-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to get document effectively. or FieldCache example

2017-04-21 Thread Chris Hostetter

: then which one is right tool for text searching in files. please can you
: suggest me?

so far all you've done is show us your *indexing* code; and said that 
after you do a search, calling searcher.doc(docid) on 500,000 documents is 
slow.

But you still haven't described the usecase you are trying to solve -- ie: 
*WHY* do you want these 500,000 results from your search? Once you get 
those Documents back, *WHAT* are you going to do with them?

If you show us some code, and talk us through your goal, then we can help 
you -- otherwise all we can do is warn you that the specific 
searcher.doc(docid) API isn't designed to be efficient at that large a 
scale.  Other APIs in Lucene are designed to be efficient at large scale, 
but we don't really know what to suggest w/o knowing what you're trying to 
do...

https://people.apache.org/~hossman/#xyproblem
XY Problem

Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue.  Perhaps the best solution doesn't involve "Y" at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341


PS: please, Please PLEASE upgrade to Lucene 6.x.  3.6 is more then 5 years 
old, and completley unsupported -- any advice you are given on this list 
is likeley to refer to APIs that are completley different then the version 
of Lucene you are working with.


: 
: 
: On Fri, Apr 21, 2017 at 2:01 PM, Adrien Grand  wrote:
: 
: > Lucene is not designed for retrieving that many results. What are you doing
: > with those 5 lacs documents, I suspect this is too much to display so you
: > probably perform some computations on them? If so maybe you could move them
: > to Lucene using eg. facets? If that does not work, I'm afraid that Lucene
: > is not the right tool for your problem.
: >
: > Le ven. 21 avr. 2017 à 08:56, neeraj shah  a
: > écrit :
: >
: > > Yes I fetching around 5 lacs result from index searcher.
: > > Also i am indexing each line of each file because while searching i need
: > > all the lines of a file which has matched term.
: > > Please tell me am i doing it right.
: > > {code}
: > >
: > > InputStream  is = new BufferedInputStream(new FileInputStream(file));
: > > BufferedReader bufr = new BufferedReader(new InputStreamReader(is));
: > > String inputLine="" ;
: > >
: > > while((inputLine=bufr.readLine())!=null ){
: > > Document doc = new Document();
: > > doc.add(new
: > >
: > > Field("contents",inputLine,Field.Store.YES,Field.Index.
: > ANALYZED,Field.TermVector.WITH_POSITIONS_OFFSETS));
: > > doc.add(new
: > > Field("title",section,Field.Store.YES,Field.Index.NOT_ANALYZED));
: > > String newRem = new String(rem);
: > >
: > > doc.add(new
: > > Field("fieldsort",newRem,Field.Store.YES,Field.Index.ANALYZED));
: > > doc.add(new Field("fieldsort2",rem.toLowerCase().replaceAll("-",
: > > "").replaceAll(" ", ""),Field.Store.YES,Field.Index.ANALYZED));
: > >
: > > doc.add(new
: > > Field("field1",Author,Field.Store.YES,Field.Index.NOT_ANALYZED));
: > > doc.add(new
: > > Field("field2",Book,Field.Store.YES,Field.Index.NOT_ANALYZED));
: > > doc.add(new
: > > Field("field3",sec,Field.Store.YES,Field.Index.NOT_ANALYZED));
: > >
: > > writer.addDocument(doc);
: > >
: > > }
: > > is.close();
: > >
: > > {/code}
: > >
: > > On Thu, Apr 20, 2017 at 5:57 PM, Adrien Grand  wrote:
: > >
: > > > IndexSearcher.doc is the right way to retrieve documents. If this is
: > > > slowing things down for you, I'm wondering that you might be fetching
: > too
: > > > many results?
: > > >
: > > > Le jeu. 20 avr. 2017 à 14:16, neeraj shah  a
: > > > écrit :
: > > >
: > > > > Hello Everyone,
: > > > >
: > > > > I am using Lucene 3.6. I have to index around 60k docuemnts. After
: > > > > performing the search when i try to reterive documents from seacher
: > > using
: > > > > searcher.doc(docid)  it slows down the search .
: > > > > Please is there any other way to get the document.
: > > > >
: > > > > Also if anyone can give me an end-to-end example for working
: > > FieldCache.
: > > > > While implementing the cache i have :
: > > > >
: > > > > int[] fieldIds = FieldCache.DEFAULT.getInts(indexMultiReader, "id");
: > > > >
: > > > > now i dont know how to further use the fieldIds for improving search.
: > > > > Please give me an end-to-end example.
: > > > >
: > > > > Thanks
: > > > > Neeraj
: > > > >
: > > >
: > >
: >
: 

-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Automata and Transducer on Lucene 6

2017-04-19 Thread Chris Hostetter

: pairs). It is this kind of
: "high-level goal" I asked about. Your answer only adds to the mystery:

https://people.apache.org/~hossman/#xyproblem
XY Problem

Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue.  Perhaps the best solution doesn't involve "Y" at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341



-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: question

2017-01-19 Thread Chris Hostetter

: Yes, they should be the same unless the field is indexed with shingles, in 
that case order matters.
: Markus 

just to clarify...

The examples provided show *stirngs* which would have to 
be parsed into Query objects by a query parser.

the *default* QueryParser will produce queries that result in the answers 
to your questions being "yes"

But it is fairly trivially to tweak/extend the query parser to produce 
diff behavior.

Examples...

you could use SpanNearQuery with order enforced to make the two 
queries in #1 match completey diff documents.

You could force the use of PhraseQuery or SpanNearQuery in all queries, 
such that the answer to #3 is "no", each pair of queries (with 
terms in same order) would match the same set of docs.  Conversly, you 
could make those queries allways use BooleanQuery with SHOULD clauses so 
all 4 queries would match the same (much larger) set of docs.

...It's entirely up to the query parsing code to decide what types of 
queries to produce from those inputs.

:  
: -Original message-
: > From:Julius Kravjar 
: > Sent: Monday 16th January 2017 18:20
: > To: java-user@lucene.apache.org
: > Subject: question
: > 
: > May I have one question? One company - we used their sw - talked to us that
: > in Lucene it is normal that the search results for
: > 
: > 1.
: > "sas institute"
: > "institute sas"
: > are the same.
: > 
: > 2.
: > sas institute
: > institute sas
: > are the same
: > 
: > 3.
: > the number of searches of "sas institute" is smaller then sas institute
: > (analogically "institute sas" is smaller then institute sas
: > 
: > 
: > 
: > Should we believe them? Manythanks in advance.
: > 
: > Best regards
: > 
: > J. Kravjar
: > 
: 
: -
: To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
: For additional commands, e-mail: java-user-h...@lucene.apache.org
: 
: 

-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Where did earthDiameter go?

2017-01-12 Thread Chris Hostetter

I don't konw the rhyme/reason but it looks like it was removed (w/o 
any deprecation first i guess) as part of LUCENE-7123 in commit: 
ce3114233bdc45e71a315cb6ece64475d2d6b1d4

in that commit, existing callers in the lucene code base were changed to 
use "2 * GeoProjectionUtils.SEMIMAJOR_AXIS"

(once the jira outage is over you might find more info in the comments of 
that issue)


: Date: Thu, 12 Jan 2017 12:00:31 +1100
: From: Trejkaz 
: Reply-To: java-user@lucene.apache.org
: To: Lucene Users Mailing List 
: Subject: Where did earthDiameter go?
: 
: Hi.
: 
: I don't know why, but we have some kind of esoteric logic in our own
: code to simplify a circle on the Earth to a bounding box, clearly
: something to do with computing geo queries.
: 
: double lonMin = -180.0, lonMax = 180.0;
: if (!closeToPole(latMin, latMax)) {
:double D = SloppyMath.earthDiameter(lat);
: double d = D * Math.sin((90.0 - lat) * Math.PI / 180.0); //
: diameter of a disk formed by parallel at latitude = lat
:double kmPerLonDeg = Math.PI * d / 360.0;
: double distanceInLonDeg = distanceKm / kmPerLonDeg;
:lonMin = lon - distanceInLonDeg;
: lonMax = lon + distanceInLonDeg;
: }
: 
: This SloppyMath.earthDiameter(latitude) method appears to be gone in
: v6.3.0 but I don't see any mention of a replacement in the changelog.
: Is there a replacement? Do I just slot in a constant and hope that
: nobody notices? I mean, if the maths are supposed to be "sloppy"... :D
: 
: TX
: 
: -
: To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
: For additional commands, e-mail: java-user-h...@lucene.apache.org
: 
: 

-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Problem sorting long integers

2016-12-13 Thread Chris Hostetter


How are you constructing your SortField at query time?

Are you sure you are using SortField.Type.LONG ?

Can you show us some minimally self contained reproducible code 
demonstrating your problem? (ie: create an index with 2 docs, then do 
a simple serach for both and sort them and show that the order is wrong)


: Date: Tue, 13 Dec 2016 18:30:09 +0100
: From: Jaime 
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Problem sorting long integers
: 
: Hello,
: 
: With Lucene 6.1.0, I'm trying to search sorting the results by a long column.
: This is my indexing code:
: 
: doc.add(new LongPoint(name, Long.parseLong(value)));
: doc.add(new StoredField(name, value));
: doc.add(new NumericDocValuesField(SORT_FIELD_PREFIX + name,
: Long.parseLong(value)));
: 
: SORT_FIELD_PREFIX is a prefix I use for sorting fields.
: My problem is, the sorting is being made modulo 4G (2^32) (and interpreted as
: signed values).
: So, first values are 2G, 4G, 6G, 8G, etc. Then would come 2G +1, 4G +1, 6G +1
: and so on.
: (I.e. only the 4 least significant bytes are taken into account).
: 
: Is this a bug, or there is some problem with my code?
: 
: Best Regards.
: 
: -
: To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
: For additional commands, e-mail: java-user-h...@lucene.apache.org
: 
: 

-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How exclude empty fields?

2016-11-16 Thread Chris Hostetter
: The issue I have is that some promotions are permanent so they don't have
: an endDate set.
: 
: I tried doing:
: 
: ( +Promotion.endDate:[210100TOvariable containing yesterday's date]
: || -Promotion.endDate:* )

1) mixing prefix ops with "||" like this is most certainly not doing what 
you think...

https://lucidworks.com/blog/why-not-and-or-and-not/

2) combine that with Ahmet's point about needing a "MatchAllDocsQuery" to 
"select all docs" from which you can thin "exclude docs with an endDate" 
to give you the final results of "docs w/o an endDate" ...

BooleanQuery(
  Should(NumericRangeQuery("endDate:[X TO Y]"))
  Should(BooleanQuery(
Must(MatchAllDocsQuery())
MustNot(FieldValueQuery("endDate"))
  ))
)

...either that, or index a new boolean field "permenant" and then simplify 
your query to basically just be "endDate:[X TO Y] OR permentant:true"







-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Unsubscribing problems

2016-09-07 Thread Chris Hostetter

Peyman: I'll contact you off list to try and address your specific 
problem.

As a general reminder for all users: If you need help with the mailing 
list, step #1 should be to email the automated help system via 
java-user-help@lucene (identified in the Mailin-List and List-Help mail 
MIME headers of every email)

The automated response will then point you to java-user-owner@lucene to 
contact the human list moderators if you still need additional help.

FWIW: Some additional helpful tips can be found here: 
https://wiki.apache.org/solr/Unsubscribing%20from%20mailing%20lists


: Date: Wed, 7 Sep 2016 11:11:52 -0400
: From: Robust Links 
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Unsubscribing problems
: 
: Hi
: 
: I am not sure who to report this to but I have tried to unsubscribe from 
lucene lists (including java-user@lucene.apache.org) without success many times 
now. I have sent an unsubscribe email to all of the list servers on this page, 
with no bounces. 
: 
: https://lucene.apache.org/core/discussion.html 

: 
: I am however still receiving emails. How does one unsubscribe?
: 
: thank you
: 
: Peyman

-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: BooleanQuery rewrite optimization

2016-08-08 Thread Chris Hostetter

Off the top of my head, i think any optimiation like that would also need 
to account for minNrShouldMatch, wouldn't it?

if your query is "(X Y Z #X)" w/minshouldmatch=2, and you rewrite that 
query to "(+X Y Z)" w/minshouldmatch=2 you now have a semantically diff 
query that won't match as many documents as the original.

in that example, you could decrement minshouldmatch (=1) ... but i'm not 
sure off that holds as a general rule for all possible permutations/values 
... i'd have to think about it.

An interesting edge case to think about is "(X X Y #X)" w/minshouldmatch=2 
... pretty sure that would give you very diff scores if you rewrote it to 
"(+X X Y)" (or "(+X Y)") w/minshouldmatch=1



: Hello all, I noticed while debugging a query that BooleanQuery will 
: rewrite itself to remove FILTER clauses that are also MUST as an 
: optimization/simplification, which makes total sense. So (+f:x #f:x) 
: will become (+f:x). However, shouldn't there also be another 
: optimization to remove FILTER clauses that are also SHOULD, while 
: converting them to MUST? So, for eg. query (f:x #f:x) will become 
: (+f:x). I did an initial simple implementation and the tests seem to 
: pass. Are there any cases where this does not hold? 
: 
: 

-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: disable field length normalization on specific fields?

2016-03-28 Thread Chris Hostetter

yep, just use a customied similarity that doesn't include a length factor 
when computing the norm.

If you are currently using TFIDFSimilarity (or one of it's subclasses) 
then the computeNorm method delegates to a lengthNorm method, and you 
can override that to return "1" for fields with a certain name regardless 
of the length.

If you are currently using something else -- like BM25Similarity perhaps 
-- you'll probably have to override the computeNorm method and 
write a slightly longer calculation based on whatever logic is in the 
computeNorm method you are currently using -- look for usages of 
FieldInvertState.getLength() and remove/replace that with a fixed value.




: Date: Wed, 9 Mar 2016 13:23:16 -0500
: From: Matt Savona 
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: disable field length normalization on specific fields?
: 
: Hi all,
: 
: I am trying to understand if the following is possible:
: 
: I would like to have several fields in my index which are boosted at index
: time. Because they are to be boosted at index time, their field type
: requires omitNorms(false).
: 
: However, I do not want field length normalization to affect the scoring of
: these fields. For example, finding the term 'baseball' (1:5 words) should
: score exactly the same as (1:100 words).
: 
: There are other fields in my index which are not boosted, so
: omitNorms(true) is acceptable on them. However, I do not want to broadly
: disable length normalization on every single field (I have at least one
: where I require it). Thus, I am not certain a custom Similarity class is
: appropriate.
: 
: Is it possible to simply disable length normalization on a a field-by-field
: basis, while still allowing index-time boosting?
: 
: Thank you in advance!
: 
: - Matt
: 

-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: 500 millions document for loop.

2015-11-15 Thread Chris Hostetter

:   public void collect(int docID) throws IOException {
:   Document doc = indexSearcher.doc(docID, 
loadFields);
:   found.found(doc);
:   }

Based on your description of the calculation you are doing on all of these 
docs, you will probably find using DocValues on the "to" field and using 
that in your calculations will be a lot faster then dealing with the 
StoredFields...

: >> We have ~10 indexes for 500M documents, each document
: >> has «archive date», and «to» address, one of our task is
: >> calculate statistics of «to» for last year. Right now we are
: >> using search archive_date:(current_date - 1 year) and paginate


-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene 4.x -> 5.x: Converting FieldValueFilter to FieldValueQuery

2015-11-05 Thread Chris Hostetter

: > The fact that you need to index doc values is related to another change in
: > which we removed Lucene's FieldCache and now recommend to use doc values
: > instead. Until you reindex with doc values, you can temporarily use
: > UninvertingReader[1] to have the same behaviour as in Lucene 4.x.
: 
: Is indexing doc values vs. UninvertingReader a space/time tradeoff? Or
: is the general recommendation to index doc values and only resort to an
: UninvertingReader when a full 4.x to 5.x conversion is to costly in
: terms of dev. hours?

Using docValues is going to be better for almost every use case -- it adds 
additional disk space (and some additional time spent building the index) 
but saves a large amount of RAM + time when opening IndexSearchers.

It is essetnailly a more efficient, on disk, version of FieldCache 
constructed hwen building then index instead of "on the fly" the first 
time it's used.



-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: sizes of non-fdt flies affected by compression settings

2015-11-04 Thread Chris Hostetter

: This setting can only affect the size of the fdt (and fdx) files. I suspect
: you saw differences in the size of other files because it caused Lucene to
: run different merges (because segments had different sizes), and the
: compression that we use for postings/terms worked better, but it could have
: been the other way as well.

You can check the number of documents in each segment to verify Adrien's 
comments.

If you want to do a true "apples to apples" comparison on just the impacts 
of stored field compression, choose something like the NoMergePolicy or 
LogDocMergePolicy for your test to ensure that the number of documents per 
segment are not impacted by the size (in bytes) of any of the files in 
those segments.


: > Hello,
: >
: > I'm experimenting with Lucene 5.2.1 and I see something I cannot find an
: > easy explanation for in the api docs.
: > Depending on whether I pick BEST_COMPRESSION or BEST_SPEED mode for
: > StoredFieldsFormat almost all files become smaller for BEST_COMPRESSION
: > mode. I expected only .fdt files to be smaller but for some reason the
: > following file types also shrink very significantly:
: > .fdx, .doc, .pos. Term dictionary (.tim) also gets smaller though not as
: > significantly. Weirdly enough .tip becomes a little bigger for the best
: > compressions setting.
: > Index contained about 10M small (~300 bytes each) text docs.
: >
: > I guess I could go through the code myself to understand this but may be
: > someone can shed some light on this.
: >
: > Thanks!
: >
: > Anton
: >
: 

-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: sizes of non-fdt flies affected by compression settings

2015-11-04 Thread Chris Hostetter

: This setting can only affect the size of the fdt (and fdx) files. I suspect
: you saw differences in the size of other files because it caused Lucene to
: run different merges (because segments had different sizes), and the
: compression that we use for postings/terms worked better, but it could have
: been the other way as well.

You can check the number of documents in each segment to verify Adrien's 
comments.

If you want to do a true "apples to apples" comparison on just the impacts 
of stored field compression, choose something like the NoMergePolicy or 
LogDocMergePolicy for your test to ensure that the number of documents per 
segment are not impacted by the size (in bytes) of any of the files in 
those segments.


: > Hello,
: >
: > I'm experimenting with Lucene 5.2.1 and I see something I cannot find an
: > easy explanation for in the api docs.
: > Depending on whether I pick BEST_COMPRESSION or BEST_SPEED mode for
: > StoredFieldsFormat almost all files become smaller for BEST_COMPRESSION
: > mode. I expected only .fdt files to be smaller but for some reason the
: > following file types also shrink very significantly:
: > .fdx, .doc, .pos. Term dictionary (.tim) also gets smaller though not as
: > significantly. Weirdly enough .tip becomes a little bigger for the best
: > compressions setting.
: > Index contained about 10M small (~300 bytes each) text docs.
: >
: > I guess I could go through the code myself to understand this but may be
: > someone can shed some light on this.
: >
: > Thanks!
: >
: > Anton
: >
: 

-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Pagination using searchAfter

2015-09-04 Thread Chris Hostetter

: I want to use the searchAfter API in IndexSearcher. This API takes ScoreDoc as
: argument. Do we need to store the last ScoreDoc value (ScoreDoc value from
: previous search)? When multiple users perform search, then it might be
: difficult to store the last ScoreDoc value.
: 
: I guess, docid value is not valid when the IndexReader is reopened. We have
: multiple users simultaneously querying the index. Every minute the IndexReader
: will be reopened.

1) yes, you do have to keep track of the ScoreDoc/FieldDoc of the last 
result on the last page in order to use searchAfter as designed.

2) the ScoreDoc objects actaully do include the docid, and as such you 
can't reliably use a ScoreDoc from an older reader when doing a 
searchAfter with a newer reader -- the design assumes you keep a 
consistent searcher for each "user"

3) one approach you can take, is to serialise/deserializee all of the 
neccessary info contained in the ScoreDoc to capture all of the 
informatoin about the relative position in the sorted doc set and track 
this per user.  This aproach can work even with re-opeend readers as long 
a the docids is irrelevant -- in order to ensure this, you must garuntee 
that perfect "ties" are impossible (ie: your final sort critera must 
produce a unique value for every document)

This last suggestion is How solr deals with searchAfter using the "cursor" 
API, the code used in solr may be relevant for you to impleemnt your own 
solution using lucene-java directly

https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
https://lucidworks.com/blog/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/
https://issues.apache.org/jira/browse/SOLR-5463

-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Getting a proper ID value into every document

2015-06-05 Thread Chris Hostetter

: If you cannot do this for whatever reason, I vaguely remember someone
: posting a link to a program they'd put together to do this for a
: docValues field, you'd have to search the archives to find it.

It was Toke - he generated DocValues for an existing index by writing an 
IndexReader Filter that would "fake" the DocValues based on the value of a 
Stored field, and used that fake for a one time "copy" of the IndexReader 
into a new index (where the docValues were written "for real")...

https://sbdevel.wordpress.com/2014/12/15/changing-field-type-in-lucenesolr/
https://github.com/netarchivesuite/dvenabler

...if all you care about is generating *some* uniqueId field value for 
each doc, you could do the same thing fairly efficiently, skipping the 
stored field reading and just using a global counter.


-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: multi valued facets

2015-06-04 Thread Chris Hostetter

: Set the field to multiValued="true" in your schema.  How'd you manage to 
: get multiple values in there without an indexing error?  An existing 
: index built with Lucene directly?

Erik: this isn't a Solr question -- the error message mentioned comes from 
the lucene/facets FacetsConfig class.

: > I am trying to add a facet for which each document can have multiple 
values, but am receiving the following exception:
: > dimension "Role Name" is not multiValued, but it appears more than once in 
this document
: > 
: > How do I create a MultiValued Facet?


-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Specifying a Version vs. not specifying a Version

2015-05-29 Thread Chris Hostetter

: Now StandardTokenizer(Version, Reader) is deprecated and the docs say
: to use StandardTokenizer(Reader) instead. But I can't do that, because
: that constructor hardcodes Version.LATEST, which will break backwards
: compatibility in the future (its Javadoc even confirms that this is
: the case.)

Welcome to the wonderful world of "Lucene Analysis Compatibility 
Post-LUCENE-5859" !

https://issues.apache.org/jira/browse/LUCENE-5859

If you have a strong stomach, reading that issue might help you understand 
some of the arguments made for/against pruning "Version" from much of the 
analysis APIs -- I can't honestly tell you what the final 
decision/justification for the current state of things is (or even if the 
issue comments reflect it, whatever it was -- or if the final decision was 
made in a diff jira) because I stoped following the issue once I started 
being personally attacked for making technical arguments.


My best understanding based on what I see in the current code, is that if 
you care about backcompat:
 
 * you must call setVersion() on any *Analyzer* instances you construct 
before using them
 * you must *not* construct Tokenizers or TokenFilter's directly -- 
instead you must use the corrisponding Factory and pass the 
LUCENE_MATCH_VERSION_PARAM to request an instance.

the Analyzers and Factories are now the only things that worry about 
Version semantics, and instantiate diff classes or call diff methods on 
the objects they instantiate, as needed to "best fit" the Version you have 
specified.



-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: BytesRef violates the principle of least astonishment

2015-05-20 Thread Chris Hostetter

: I already know how Object#clone() works:

May i humbly suggest that you: a) relax a bit; b) keep reading the 
rest of the javadocs for that method?

: As BytesRef#clone() is overriding Object#clone(), I expect it to
: comply with that.

BytesRef#clone() functions virtually identical to the way Object#clone 
functions, as noted in the *remainder* of the method javadocs you 
quoted...

https://docs.oracle.com/javase/7/docs/api/java/lang/Object.html#clone%28%29

>> ...this method creates a new instance of the class of this object and 
>> initializes all its fields with exactly the contents of the corresponding 
>> fields of this object, as if by assignment; the contents of the fields 
>> are not themselves cloned. Thus, this method performs a "shallow copy" 
>> of this object, not a "deep copy" operation.

...regardless of what convention is *suggested*, BytesRef behaves in the 
manor consistent with Object#clone, and in the manor consistent with the 
*purpose* of the class (to be a *Reference*) and in the manor that gives 
the most flexibility for it's use -- if it did a deep clone by default, we 
might as well just use byte[] everywhere in the API.

: Essentially, BytesRef
: cannot be used as an object.
...
: If Lucene wanted a shallow clone, it could have added shallowClone() while

If you really feel strongly about this, and want to advocate for more 
consistency arround the meaning/implementation of "clone()" in Java APIs, 
i suggest you take it up with the Open JDK project, and focus on a more 
high visibility, widely used (in the Java community as a whole) 
class, such as ArrayList, HashSet, or HashMap etc...  (all of which 
implement Clonable, and all of which are documented as implementing 
Object#clone by making a shallow copy)

https://docs.oracle.com/javase/7/docs/api/java/util/ArrayList.html#clone%28%29
https://docs.oracle.com/javase/7/docs/api/java/util/HashSet.html#clone%28%29
https://docs.oracle.com/javase/7/docs/api/java/util/HashMap.html#clone%28%29


-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Lucene/Solr Revolution 2015 - Austin Oct 13-16 - CFP ends next Week

2015-04-30 Thread Chris Hostetter


(cross posted, please confine any replies to general@lucene)

A quick reminder and/or heads up for htose who haven't heard yet: this 
year's Lucene/Solr Revolution is happeing in Austin Texas in October.  The 
CFP and Early bird registration are currently open.  (CFP ends May 8, 
Early Bird ends May 31)


http://lucenerevolution.org/

More details below...

- - -

Are you a developer, business practitioner, data scientist, or Solr 
enthusiast doing something interesting with Lucene/Solr? The last day to 
submit your proposal  for 
Lucene/Solr Revolution 2015 is May 8. Don't miss your chance to represent 
the Solr community by speaking at this year's conference.


Last year, speakers from companies like Twitter, Airbnb, and Bloomberg 
shared how they are using Lucene and Solr to solve complex business 
problems and build mission-critical apps. If you are doing something 
innovative with Lucene/Solr and other open source tools, have best 
practices insight at any level, or just have something cool to share that 
is Solr-related, we want to hear from you!


Call for Papers is open through May 8. Submit your proposal now 
.


Not submitting a talk this year but still want to attend? Save up to $500 
on conference registration packages when you register 
 by May 31.


Stay up to date on everything Revolution by following us on Twitter 
@lucenesolrrev  or joining us on 
Facebook .




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene indexing speed on NVMe drive

2015-04-30 Thread Chris Hostetter

: Hi. I am studying Lucene performance and in particular how it benefits from 
faster I/O such as SSD and NVMe.

: parameters as used in nightlyBench. (Hardware: Intel Xeon, 2.5GHz, 20 
: processor ,40 with hyperthreading, 64G Memory) and study indexing speed 
...
: I get best performance (200GB/hour) with 20 indexing threads, increasing 
: number of threads to 40 hurts performance. Similarly increasing 
: maxConcurrentMerges above 3-5 doesn't seem to give me any benefit. I am 
: wondering what the bottleneck is, or anyone has insight on set of 

Maybe i'm missing something, but it sounds like you are CPU bound.  

Hyperthreading isn't going to help you if you are maxing out 20 (real) 
CPUS -- IIUC it only helps with some additional paralellization when 
processes are blocked by something else -- ie: IO bound.




-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Filters execution efficiency

2015-03-26 Thread Chris Hostetter

FWIW: If you're reading LIA, part of your confusion may be that Filters, 
and when/how they are factored into iterating over scorers, has changed 
significantly over the years.

: Date: Fri, 27 Mar 2015 00:45:14 +0100
: From: Adrien Grand 
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Cc: Mousom Dhar Gupta 
: Subject: Re: Filters execution efficiency
: 
: Applying a filter with a filtered query works very similarly to a
: Boolean query with 2 MUST clauses. The query and filter iterators are
: advanced in a leap-frog fashion in order to compute the intersection.
: So the filter is neither applied before or after the query but rather
: at the same time.
: 
: On Wed, Mar 18, 2015 at 2:59 AM, Manjesh Nilange
:  wrote:
: > Hi all,
: >
: > I have recently started to learn about Lucene and I'm a little confused
: > about how Filters work. I am going through the "Lucene in Action" book and
: > did some Internet research, but haven't found an answer yet, hence this
: > email...
: >
: > From basic experimentation, I know that Filters work on the entire document
: > set (before the inverted index is looked up). Why not run the filters on
: > the hits after the index lookup? That'd reduce the number of filter
: > executions...I don't imagine that running the filters first would
: > effectively reduce the size of the index (making the lookup faster). Any
: > guidance is much appreciated!
: >
: > Thanks in advance,
: 
: 
: 
: -- 
: Adrien
: 
: -
: To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
: For additional commands, e-mail: java-user-h...@lucene.apache.org
: 
: 

-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Eclipse Compiled lucene-core-5.0.0.jar Not Working in Solr

2015-03-09 Thread Chris Hostetter

: If you need to make changes to an existing 4.10 installation, pull down the 
4.10
: source code and work from _that_, which you can do with something like:

based on the error, i don't think he's trying to drop the 
lucene-core-5.0.0.jar into a Solr 4 install -- i suspect he's compiled & 
built the 5.0 jar file w/o including the neccessary SPI metadata in the 
jar file itself -- so when Solr loads it, there is no SPI metdata for the 
(default) Lucene50

If i'm correct: you need to change how you build your jar files to include 
all the neccessary jar metadata -- if you do it using the official 
lucene/solr ant build scripts, it should work fine.


: 
: svn checkout 
https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_10.
: 
: Best,
: Erick
: 
: On Sun, Mar 8, 2015 at 9:49 PM, Andrew Jie Zhou  wrote:
: > Hi, all,
: >
: > I want to compile lucene-core-5.0.0.jar  and add it to Solr's library.
: > What I do is,
: > 1. Create a new java project,
: > 2. Copy source code of Lucene core into the project,
: > 3. Export a jar,
: > 4. Replace the jar in Solr.
: >
: > The log is:
: > INFO  - 2015-03-09 04:01:56.384; org.eclipse.jetty.server.Server;
: > jetty-8.1.10.v20130312
: > INFO  - 2015-03-09 04:01:56.417;
: > org.eclipse.jetty.deploy.providers.ScanningAppProvider; Deployment monitor
: > /Users/jie/Documents/data/infosense/dumpling/solr/server/contexts at
: > interval 0
: > INFO  - 2015-03-09 04:01:56.423;
: > org.eclipse.jetty.deploy.DeploymentManager; Deployable added:
: > 
/Users/jie/Documents/data/infosense/dumpling/solr/server/contexts/solr-jetty-context.xml
: > INFO  - 2015-03-09 04:01:57.515;
: > org.eclipse.jetty.webapp.StandardDescriptorProcessor; NO JSP Support for
: > /solr, did not find org.apache.jasper.servlet.JspServlet
: > INFO  - 2015-03-09 04:01:57.570;
: > org.apache.solr.servlet.SolrDispatchFilter;
: > SolrDispatchFilter.init()WebAppClassLoader=712256162@2a742aa2
: > INFO  - 2015-03-09 04:01:57.581; org.apache.solr.core.SolrResourceLoader;
: > JNDI not configured for solr (NoInitialContextEx)
: > INFO  - 2015-03-09 04:01:57.582; org.apache.solr.core.SolrResourceLoader;
: > using system property solr.solr.home:
: > /Users/jie/Documents/data/infosense/dumpling/solr/server/solr
: > INFO  - 2015-03-09 04:01:57.584; org.apache.solr.core.SolrResourceLoader;
: > new SolrResourceLoader for directory:
: > '/Users/jie/Documents/data/infosense/dumpling/solr/server/solr/'
: > ERROR - 2015-03-09 04:01:57.680;
: > org.apache.solr.servlet.SolrDispatchFilter; Could not start Solr. Check
: > solr/home property and the logs
: > ERROR - 2015-03-09 04:01:57.702; org.apache.solr.common.SolrException;
: > null:java.lang.ExceptionInInitializerError
: > at
: > 
org.apache.solr.core.SolrResourceLoader.reloadLuceneSPI(SolrResourceLoader.java:208)
: > at
: > org.apache.solr.core.SolrResourceLoader.(SolrResourceLoader.java:144)
: > at
: > org.apache.solr.core.SolrResourceLoader.(SolrResourceLoader.java:258)
: > at
: > 
org.apache.solr.servlet.SolrDispatchFilter.createCoreContainer(SolrDispatchFilter.java:179)
: > at
: > org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:129)
: > at org.eclipse.jetty.servlet.FilterHolder.doStart(FilterHolder.java:119)
: > at
: > 
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
: > at
: > org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:719)
: > at
: > 
org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:265)
: > at
: > org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1252)
: > at
: > 
org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:710)
: > at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:494)
: > at
: > 
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
: > at
: > 
org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:39)
: > at org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:186)
: > at
: > 
org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:494)
: > at
: > 
org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:141)
: > at
: > 
org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:145)
: > at
: > 
org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(ScanningAppProvider.java:56)
: > at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:609)
: > at org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:540)
: > at org.eclipse.jetty.util.Scanner.scan(Scanner.java:403)
: > at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:337)
: > at
: > 
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
: > at
: > 
org.eclipse.jetty.deploy.providers.ScanningAppProvider.doStart(ScanningAppProvider.java:121)
: > at
: > 
org.eclipse.jetty.util.component.AbstractLifeCycle.sta

Re: Lucene Version Upgrade (3->4) and Java JVM Versions(6->8)

2015-01-27 Thread Chris Hostetter

: I seem to remember reading that certain versions of lucene were 
: incompatible with some java versions although I cannot find anything to 
: verify this. As we have tens of thousands of large indexes, backwards 
: compatibility without the need to reindex on an upgrade is of prime 
: importance to us.

All known JVM bugs affecting Lucene are listed here...

https://wiki.apache.org/lucene-java/JavaBugs


-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



REMINDER: ApacheCon 2015 Call For Papers Ends This Week (February 1st)

2015-01-26 Thread Chris Hostetter


(cross posted, please confine replies to general@lucene)


ApacheCon 2015 Will be in Austin Texas April 13-17.

http://apachecon.com/

The Call For Papers is currently open, but it ends 2015-02-01 (11:55PM GMT-0600)

https://events.linuxfoundation.org/events/apachecon-north-america/program/cfp


This is a great oportunity to showcase how you use Lucene/Solr, or help 
teach people about features of Lucene/Solr that you think folks might 
not know enough about or fully appreciate.


All levels of talks are welcome -- you don't have to be a Lucene/Solr 
expert to submit a proposal.  Talks targeted at entry level users, and 
talks by novice users about their experiences are frequently in high 
demand.


For more information, and advice on how to prepare a great talk, please 
see the CFP webpage...


https://events.linuxfoundation.org/events/apachecon-north-america/program/cfp



-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Details on setting block parameters for Lucene41PostingsFormat

2015-01-13 Thread Chris Hostetter
: 
: The first int to Lucene41PostingsFormat is the min block size (default
: 25) and the second is the max (default 48) for the block tree terms
: dict.

we were discussing over on the solr-user mailing list how Tom would/could 
go about configuring Solr to use a custom subclass of 
Lucene41PostingsFormat where he overrode those min/max constructor params, 
but i realized i have no idea how he's suppose to leverage the plumbing in 
PostingFormat to override the "name" of the format so it's used properly 
in SPI.

Lucene41PostingsFormat's constructor options only allow overriding the 
block sizes, not the "name" that gets propogated up to the PostingFormat() 
constructor ... so what is the expected way to write a subclass?


: On Fri, Jan 9, 2015 at 4:15 PM, Tom Burton-West  wrote:
: > Hello all,
: >
: > We have over 3 billion unique terms in our indexes and with Solr 3.x we set
: > the TermIndexInterval to about 8 times its default value in order to index
: > without OOMs.  (
: > http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again)
: >
: > We are now working with Solr 4 and running into memory issues and are
: > wondering if we need to do something analogous for Solr 4.
: >
: > The javadoc for IndexWriterConfig (
: > 
http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
: > )
: > indicates that the lucene 4.1 postings format has some parameters which may
: > be set:
: > "..To configure its parameters (the minimum and maximum size for a block),
: > you would instead use Lucene41PostingsFormat.Lucene41PostingsFormat(int,
: > int)
: > 

: > "
: >
: > Is there documentation or discussion somewhere about how to determine
: > appropriate parameters or some detail about what setting the maxBlockSize
: > and minBlockSize does?
: >
: > Tom Burton-West
: > http://www.hathitrust.org/blogs/large-scale-search
: 
: -
: To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
: For additional commands, e-mail: java-user-h...@lucene.apache.org
: 
: 

-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Looking for docs that have certain fields empty (an/or not set)

2015-01-07 Thread Chris Hostetter

: In Lucene you don't need to use a query parser for that, especially 
: because range Queries is suboptimal and slow: There is already a very 
: fast query/filter available. Ahmet Arslan already mentioned that, we had 
: the same discussion a few weeks ago: 
: http://find.searchhub.org/document/abb73b45a48cb89e

Thanks for reminding me Uwe, been meaning to file that as an 
improvement...

https://issues.apache.org/jira/browse/SOLR-6927



-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



ANNOUNCE: CFP and Travel Assistance now open for ApacheCon North America 2015

2014-12-16 Thread Chris Hostetter


(NOTE: cross posted to several lucene lists, if you have replies, please 
confine them to general@lucene)


-- Forwarded message --

In case you've missed it:

- ApacheCon North America returns to Austin, Texas, 13-17 April 2015 
http://apachecon.com/

- Call for Papers open until 1 February --submissions and presentation 
guidelines 
http://events.linuxfoundation.org/events/apachecon-north-america/program/cfp

- Become involved with the program selection process --check out 
http://s.apache.org/60N

- Applications accepted for Apache Travel Assistance --deadline is 6 February! 
http://www.apache.org/travel/


We look forward to seeing you in Austin!


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Compiling and running Lucene/Solr based on github does not seem to work

2014-12-05 Thread Chris Hostetter

For future questions about solr, please use solr-user@lucene ...

: ant compile
: ant test
: 
: successfully. Also Jetty  seems to startup fine, but when I access
: 
: http://localhost:8983/solr/
: 
: then I receive
...

Note the "Instructions for Building Apache Solr from Source" section of 
solr/README.txt ...

4. Navigate to the "solr" folder and issue an "ant" command to see the 
available options
   for building, testing, and packaging Solr.
  
   NOTE: 
   To see Solr in action, you may want to use the "ant example" command to build
   and package Solr into the server/webapps directory. See also 
server/README.txt.


...if you are still having problems, then please email more details (ie: 
exact commands run, log messages, etc...) to solr-user@lucene.


-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How best to compare tow sentences

2014-12-04 Thread Chris Hostetter

: For a number of years I've been doing this for some time by creating a
: RAMDirectory, creating a document for one of the sentence and then doing  a
: search using the other sentence and seeing if we get a good match. This has
: worked reasonably well but since improving the performance of other parts of
: the application this part has become a performance bottleneck, not that
: suprising as Im creating all these objects just for a one off search, and I
: have to do this for many sentence pairs.

i'm not an academic, and i don't want to undermine the very goal specific 
advice given by other folks in this thread - but i do wnat ot point out 
that if you are doing *lots* of comparisons like this, then building a 
RAMDirectory for each and every "known" song title to comare with each and 
every "new" song title is already a super inefficient use of lucene.

if instead you built  and *kept* a lucene index containing all known song 
titles (one per doc) and then queried it for each "new" song title that 
came in you'd probably find yourself with a much more efficient solution 
w/o needing to spend a lot of time investigating new algorithms.



-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to map lucene scores to range from 0~100?

2014-11-12 Thread Chris Hostetter

: I met a new trouble. In my system, we should score the doc range from 0 
: to 100. There are some easy ways to map lucene scores to this scope. 
: Thanks for your help~

https://wiki.apache.org/lucene-java/ScoresAsPercentages




-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Dangerous reflection access to sun.misc.Cleaner by class org.apache.lucene.store.MMapDirectory$MMapIndexInput$1 detected!

2014-11-03 Thread Chris Hostetter

FYI: random googling for "Dangerous reflection access" indicates these are 
logged by "TopSecurityManager" in Netbeans

random clicking on random messages in the Netbeans forums suggests:

1) these INFO messages are designed to only show up if you run with 
assertions on (evidently under the assumption that only developers would 
run with assertions, and developers hsould be warned if their apps do this 
-- but evidently not enough to log it as a true WARN0

2) there is aparently no way to configure netbeans not to log this

3) there is the occasional mention of implementing a LoggingFilter to 
supress these log messages...

http://netbeans-org.1045718.n5.nabble.com/What-means-this-error-java-lang-Exception-Dangerous-reflection-access-to-sun-misc-Unsafe-by-class-td5732754.html

4) if this is a true java Security manager (not clear to me, i didn't keep 
digging) then you could probably just disable it completely using the 
normal JVM mechanisms for specifying your own SecurityManager.






: Date: Mon, 3 Nov 2014 16:45:36 +0100
: From: Jean-Claude Dauphin 
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Dangerous reflection access to sun.misc.Cleaner by class
: org.apache.lucene.store.MMapDirectory$MMapIndexInput$1 detected!
: 
: Hello,
: 
: We have a NetBeans RCP application using Lucene 4.7 for indexing and
: searching.
: 
: When indexing, the following message is displayed:
: 
: Dangerous reflection access to sun.misc.Cleaner by class
: org.apache.lucene.store.MMapDirectory$MMapIndexInput$1 detected!
: 
: It does not stop indexing but it is quite annoying. Any idea on how to get
: rid of this message or should we ignore it ??
: 
: Thank you in advance for any information on this issue.
: 
: Best wishes,
: 
: JCD
: 
: 
: -- 
: Jean-Claude Dauphin
: 
: jc.daup...@gmail.com
: jc.daup...@afus.unesco.org
: 
: http://kenai.com/projects/j-isis/
: http://www.unesco.org/isis/
: http://www.unesco.org/idams/
: http://www.greenstone.org
: 

-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Getting min/max of numeric doc-values facets

2014-10-09 Thread Chris Hostetter

: Is there some way when faceted search is executed, we can retrieve the
: possible min/max values of numeric doc-values field with supplied custom
: ranges in (LongRangeFacetCounts) or some other way to do it ?
: 
: As i believe this can give application hint, and next search request can be
: much smarter, e.g custom ranges can be more specific ?

You can use the StatsComponent to find out the min/max values of a field 
(constrained by your query, or *:* if you want the min/max across the 
entire index) and then you can use those values in your subsequent 
queries...

https://cwiki.apache.org/confluence/display/solr/The+Stats+Component

It's not currently possible to get the "actual" min/max *within* each 
range bucket of a facet.range (which is what you seem to be asking for 
althought i may be missunderstanding) but it's something being actively 
investigated on as part of a larger objective to beter integrate stats & 
facets...

https://issues.apache.org/jira/browse/SOLR-6352
https://issues.apache.org/jira/browse/SOLR-6348


-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Notifications of new Lucene-Releases

2014-10-06 Thread Chris Hostetter

: Lucene doesn't have a dedicated announce list; maybe subscribe to
: Apache's announce list?  But then you get announcements for all Apache
: projects ... maybe add a mail filter ;)

there's also the "product" info feeds which you can subscribe to...

https://projects.apache.org/projects/lucene_core.html
https://projects.apache.org/feeds/rss/lucene_core.xml

https://projects.apache.org/projects/solr.html
https://projects.apache.org/feeds/rss/solr.xml

Generated from our DOAP files...

https://lucene.apache.org/core/doap.rdf
https://lucene.apache.org/solr/doap.rdf



-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NOTICE: Seeking Moderators for java-user@lucene

2014-10-03 Thread Chris Hostetter

: After a few days (probably on friday?) i'll file an infra request to replace
: all current moderators with the new list of volunteers.

Thanks to all our volunteers, watch this jira to know when the change 
happens...

https://issues.apache.org/jira/browse/INFRA-8429


-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



NOTICE: Seeking Moderators for java-user@lucene

2014-09-30 Thread Chris Hostetter


Hey folks,

I was on facation for the psat 7 days - 6 days ago someone sent an email 
directly to the java-user moderator list asking for subscription help and 
never got any response -- indicating that all of our other list moderators 
are either no longer active, or just happened to be on vacation the exact 
same time as me.


In anyase: time for another round of "Do you want to be a moderator?"

Being a moderator is really easy, details of what's involved posted 
here...


https://wiki.apache.org/solr/MailingListModeratorInfo

...in general, being a good moderator typically requires ~30 seconds of 
work a day skimming emails and occasionally hitting reply (usually blank 
replies to the automated system, occasionally a reply with a link to the 
unsubscribe info page)



If you would like to be a moderator, or if you are already a moderator and 
would like to continue to be a moderator, please reply back to this 
message.  If you would like to use an alternative email address for 
moderation, please include that address in your reply (else i'll use the 
one you send your mail from)


After a few days (probably on friday?) i'll file an infra request to 
replace all current moderators with the new list of volunteers.




-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Snowball filter - Error instantiating stemmer for a language

2014-09-05 Thread Chris Hostetter

To see about improving the error messages when users make mistakes like 
this...

https://issues.apache.org/jira/browse/LUCENE-5926



-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Snowball filter - Error instantiating stemmer for a language

2014-09-04 Thread Chris Hostetter

Odd ... the class org/tartarus/snowball/ext/CatalanStemmer.class should 
exist in the same jar as SnowballPorterFilterFactory, can you please 
confirm that you see it there?

$ jar tf lucene-analyzers-common-4.6-SNAPSHOT.jar | grep CatalanStemmer
org/tartarus/snowball/ext/CatalanStemmer.class


The only explanation i can think of is that maybe you manually copied some 
of the java or class files directly into your project instead of using hte 
jars?

if not: can you describe a bit more about your project and how you are 
running this code?  what does the classpath look like exactly?


: Date: Thu, 4 Sep 2014 03:25:21 -0700 (PDT)
: From: atawfik 
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Snowball filter - Error instantiating stemmer for a language
: 
: I am trying to use some filters from the snowball package. However, when I
: run the following code:
: 
: Map args = new HashMap<>();
: TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_46, new
: StringReader("Some text"));
: args.put("luceneMatchVersion", "4.6");
: args.put("language", "Catalan");
: SnowballPorterFilterFactory factory = new SnowballPorterFilterFactory(args);
: TokenFilter filter = factory.create(tokenStream);
: 
: I got the following error:
: 
: Exception in thread "main" java.lang.RuntimeException: Error instantiating
: stemmer for language Catalanfrom class null
:   at
: 
org.apache.lucene.analysis.snowball.SnowballPorterFilterFactory.create(SnowballPorterFilterFactory.java:79)
:   at analysis.AutoLanguageAnalyizer.main(AutoLanguageAnalyizer.java:205)
: Caused by: java.lang.NullPointerException
:   at
: 
org.apache.lucene.analysis.snowball.SnowballPorterFilterFactory.create(SnowballPorterFilterFactory.java:77)
:   ... 1 more
: 
: 
: Any idea on how to resolve this.
: 
: Thanks in advance.
: 
: Regards
: Ameer
: 
: 
: 
: --
: View this message in context: 
http://lucene.472066.n3.nabble.com/Snowball-filter-Error-instantiating-stemmer-for-a-language-tp4156882.html
: Sent from the Lucene - Java Users mailing list archive at Nabble.com.
: 
: -
: To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
: For additional commands, e-mail: java-user-h...@lucene.apache.org
: 
: 

-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Should .tip/.doc/.tii files be missing/deleted?

2014-09-03 Thread Chris Hostetter

: following files (I'm not listing all extensions) are deleted immediately
: upon IndexWriter.close() being called:
: 
: *.fdt, *.tip, *.tii, .*pos
: 
: Only the following 5 files are left in all cases
: _0.cfe
: _0.cfs

...you're got the CompoundFileFormat configured, so each time a segment 
is finished being written is immediately combined into those cfe/cfs files

https://lucene.apache.org/core/4_10_0/core/org/apache/lucene/codecs/lucene410/package-summary.html#file-names
https://lucene.apache.org/core/4_10_0/core/org/apache/lucene/index/LiveIndexWriterConfig.html#setUseCompoundFile%28boolean%29

https://lucene.apache.org/core/4_10_0/core/org/apache/lucene/index/MergePolicy.html#setMaxCFSSegmentSizeMB%28double%29
https://lucene.apache.org/core/4_10_0/core/org/apache/lucene/index/MergePolicy.html#setNoCFSRatio%28double%29


-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Seeking Additional Moderator Volunteers for java-user@lucene

2014-07-29 Thread Chris Hostetter
On Wed, 23 Jul 2014, Yalamarthi, Vineel wrote:

: Can I be volunteer too

Vineel: sorry i didn't see your response until now.  Thanks for 
volunteering by asfinfra already processed the request and now we've got 
plenty of moderators. (i think it was actauly processed before you even 
replied)



-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Seeking Additional Moderator Volunteers for java-user@lucene

2014-07-23 Thread Chris Hostetter

Thanks folks, plenty of new volunteers

https://issues.apache.org/jira/browse/INFRA-8082



-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Seeking Additional Moderator Volunteers for java-user@lucene

2014-07-23 Thread Chris Hostetter


We're doing some housekeeping of the moderators of this list, and looking 
for any new folks that would like to volunteer. (we currently have 3 
active moderators, 1-2 additional mods would be helpful for good coverage)


If you'd like to volunteer to be a moderator, please reply back to this 
thread and specify which email address you'd like to use as a moderator 
(if different from the one you use when sending the email)


Being a moderator is really easy: you'll get a some extra emails in your 
inbox with MODERATE in the subject, which you skim to see if they are spam 
-- if they are you delete them, if not you "reply all" to let them get 
sent to the list, and authorize that person to send future messages w/o 
moderation.


Occasionally, you'll see an explicit email to java-user-owner@lucene from 
a user asking for help realted to their subscription (usually 
unsubscribing problems) and you and the other moderators chime in with 
assistance when possible.


More details can be found here...

https://wiki.apache.org/solr/MailingListModeratorInfo

-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Different Scores For Same Query on Identical Index

2014-07-16 Thread Chris Hostetter

: I created an index with three documents, ran a query, and noted the scores.
: Then I deleted one of the documents using IndexWriter.tryDeleteDocument, and
: then re-added the exact same document. (I saved the Document in an instance
: variable, so I couldn't have accidentally changed any of the fields.) After
: rerunning the query, I am getting back the same documents, but with
: different scores. Anyone know what's happening? I can post my code if anyone
: is interested. Thanks!

Deleted docments still affect term stats like IDF until the are expunged.  
If you look at the Explanation from the Query for each of your matchings 
docs, you'll see the differneces in the stats and how they affect the 
scores.


FYI, realted info (relevant to the ideas of "comparing" scores across 
indexes of queries)...
https://wiki.apache.org/lucene-java/ScoresAsPercentages


-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: IndexSearcher.doc thread safe problem

2014-07-09 Thread Chris Hostetter

: 4. Syncronized searcher.doc method call in multi-thread(like this: public
: synchronized Document getValue( IndexSearcher searcher, int docId ) {
: return searcher.doc( docId ); })
: ==> every execution is same.
:but If I use this method, It is no difference with single thread
: performance.
: 
: What do you think about it?

You're asking us about the behavior of IndexSearcher.doc from 
multi-threaded code, but you haven't shown us enough code to even guess as 
to what your problem might be -- let alone reproduce it.

can you please submit a cully self contained program -- that builds an 
index and then searches it -- and which demonstrates the problem you are 
having?   That way folks trying to help you will have actual code they can 
run on their machine to understand the problem you are describing.



-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Query rewriting - caching rewritten quries

2014-07-02 Thread Chris Hostetter
: In the system which I develop I have to store many query objects in memory.
: The system also receives documents. For each document MemoryIndex is
: instantiated. I execute all stored queries on this MemoryIndex. I realized
: that searching over MemoryIndex takes much time for query rewriting. I'm
: wondering if I can cache rewritten queries to avoid still rewritting. Is
: there any way to do it?

it depends on what you want to do with the cached queries.

the rewritten queries are relative to the IndexReader passed to the 
rewrite() method -- you can't re-use them against a new/different 
IndexReader (not even if it's a reopened reader against the same index)

from a memory standpoint, rewritten queries also tend to be larger then 
the original query (due to term expansion) so even if you plan on using 
these cached queries over and over against hte same IndexReader 
(allthough: there's not much point of that -- you might as wel ust cache 
the results instead) you're trading the time needed for rewrite() against 
the the memory needed for the cache.


-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



ANNOUNCE: ApacheCon deadlines: CFP June 25 / Travel Assistance Jul 25

2014-06-12 Thread Chris Hostetter


(NOTE: cross-posted announcement, please confine any replies to 
general@lucene)


As you may be aware, ApacheCon will be held this year in Budapest, on
November 17-23. (See http://apachecon.eu for more info.)

### ### 1 - Call For Papers - June 25

The CFP for the conference is still open, but will end on June 25th.

If you have an idea for a Lucene/Solr related session @ ApacheCon please 
submit it.  All types of sessions are of interest to ApacheCon attendees 
-- from deep technical talks about internals, hands-on tutorials of 
specific features, general introductions for beginners, "How we did X" 
case studies about your own expereinces, etc...


Please consider submitting a proposal, at
http://events.linuxfoundation.org//events/apachecon-europe/program/cfp


### ### 2 - Travel Assistance - July 25th

The Travel Assistance Committee (TAC) is happy to anounce that 
applications for ApacheCon Europe 2014 will be accepted until July 25th.


Applications are welcome from individuals within the Apache community
at-large, users, developers, educators, students, Committers, and Members,
who need financial support to attend ApacheCon.

Please be aware the seats are very limited, and all applicants will be
scored on their individual merit.

More information can be found at http://www.apache.org/travel including a
link to the online application and detailed instructions for submitting.




-Hoss
http://www.lucidworks.com/


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: will score get changed as document continuously added.

2014-06-11 Thread Chris Hostetter

: Yes the score will change, because the new documents change the 
: statistics. In general, scores cannot be seen as absolute numbers, they 
: are only useful to compare between search results of the exact same 
: query at the same index snapshot. They have no global meaning.

This wiki page goes into more depth (it focuses on a more specific 
question people frequently ask, but the core of the problem is the 
same as what you are asking about)...

https://wiki.apache.org/lucene-java/ScoresAsPercentages


-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: absence of searchAfter method with Collector parameter in Lucene IndexSearcher

2014-06-06 Thread Chris Hostetter

: I was wondering why there is no search method in lucene Indexsearcher to
: search after last reference by passing collector. Say a method with
: signature like searchAfter(Query query, ScoreDoc after, Collector results).

searchAfter only makes sense if there is a Sort involved -- either 
explicitly or implicitly on "score"

When you use a Collector, even if your collector produces "ScoreDoc" 
objects, a subsequent (hypothetical) 
call searchAfter(Query,ScoreDoc,Collector) would have no idea what the 
meaning of "after" was for that ScoreDoc.

(Even if the ScoreDoc was an instance of FieldDoc that encapsulated the 
values for the sort fields, it doesn't know what the fieldNames are, or 
what the comparator/direction to use against those field+values are to 
know what is "after" them).

So from an API standpoint: it just doesn't make any sense.

if you want searchAfter functionality along with custom Collector logic, 
take a look at things like TopFieldCollector.create(...) which you could 
then wrap in your own Collector.


-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: is there a historical reason why default conjunction operator is "OR"?

2014-04-16 Thread Chris Hostetter

: I  recently wondered,
: why lucene's default conjunction operator is "OR".
: Is there a historical reason for that?

The only 'default' is in the query parser -- if you construct the 
BooleanQueyr objects programatically you must always be explicit about the 
Occur property of each Clause.

In the parser the default is "OR" aka "SHOULD" because the prefix operator 
syntax has no operator for "SHOULD" ... 

  "+" => MUST
  "-" => MUST_NOT, 
  absense of an operator => "SHOULD"


-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 4 single segment performance improvement tips?

2014-03-05 Thread Chris Hostetter
: Our runtime/search use-case is very simple: run filters to select all docs
: that match some conditions specified in a filter query (we do not use
: Lucene scoring) and return the first 100 docs that match (this is an
: over-simplification)

"first" as defined how? in order collected by a custom collector, or via 
some sort?

: On a machine with nothing else running, we are unable to move the needle on
: CPU utilization to serve higher QPS. We see that most of the time is spent
: in BlockTreeTermsReader.FieldReader.iterator() when we run profiling tools
: to see where time is being spent. The CPU usage doesn't cross 30% (we have
: multiple threads one per each client connected over a Jetty connection all
: taken from a bounded thread-pool). We tried the usual suspects like
: tweaking size of the threadpool, changing some jvm parameters like newsize,
: heapsize, using cms for old gen, parnew for newgen, etc.

You said you have one thread per client, but you didn't mention anything 
about varying the number of clients -- did you try increasing the number 
of clients hitting your application concurrently?  It's possible that your 
box is "beefy" enough that 30% of the available CPU is all that's needed 
for the number of active concurrnt threads you are using (increasing hte 
size of the threadpool isn't going to affect anything if there aren't more 
clients utilizing those threads)


-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



ANNOUNCE: Lucene/Solr @ ApacheCon (Denver, April 7-9)

2014-02-27 Thread Chris Hostetter


(cross posted, please keep any replies to general@lucene)

ApacheCon Denver is coming up and registration is currently open.

In addition to a solid 3 day track of Lucene & Solr related talks, there 
are also some post confrence events that are open to anyone even if you 
don't attend the confrence proper...



* Registration
https://events.linuxfoundation.org/events/apachecon-north-america/attend/register

* Schedule of Sessions
http://events.linuxfoundation.org/events/apachecon-north-america/program/schedule

* Fast Feather Track
http://s.apache.org/FFT14

(Monday afternoon. This is a series of short talks, ~15 minutes in length, 
covering what's new / interesting / exciting / recently changed etc. Tell 
us about a new project, a new feature, recent updates, tools we should 
know etc. A great chance for new groups and new speakers to experience 
giving an ApacheCon talk in a short form and friendly environment, plus 
welcoming of old hands too!)


* Hackathon
https://events.linuxfoundation.org/events/apachecon-north-america/program/hackathons
http://wiki.apache.org/apachecon/HackathonNA14

(Every day. There's space for dedicated project hackathons welcoming 
of new people, as well as the usual adhoc gatherings. Please sign up if you'll 
be doing something you'd like new people to join in with, so attendees can find 
out what's available.)


* Lightning Talks
http://sched.co/MKcn0O

(Tuesday night. As ever, we look forward to people entertaining, 
informing, challenging and inspiring us, with the odd bit of music or 
silliness no doubt too! You have your 5 minutes... Serious technical talks 
may be better in the Fast Feather Track.  Jim will collect submissions 
before and during the event.)


* BarCamp
https://events.linuxfoundation.org/events/apachecon-north-america/program/barcamp
http://wiki.apache.org/apachecon/BarCampApacheDenver

(Thursday. Don't dash off on Wednesday after the last talk, stay to the 
Thursday and join the unconfernce! Sessions typically range from about the 
ASF and how it works, through new projects, cool tools, handy processes, 
things from outside we can learn from, things from inside we can learn 
from, what's hot in the incubator, and what's hot in our host city! With 
30 minute sessions, you'll never be bored.)


* Project summits
https://events.linuxfoundation.org/events/apachecon-north-america/extend-the-experience/project-summits-and-meetups
http://s.apache.org/ACNA14Smt

(Thursday and Friday. There's still some space for projects who want to do 
their own mini summits, uninterrupted hackathons, talks and more.)




-Hoss
http://www.lucidworks.com/


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Limiting the fields a user can query on

2014-02-19 Thread Chris Hostetter

: Is there a way to limit the fields a user can query by when using the
: standard query parser or a way to get all fields/terms that make up a query
: without writing custom code for each query subclass?

"limit" in what way?  do you want to throw a parse error if they give you 
"field_you_do_not_allow:foo*" or do you want to treat th entire string 
(including the colon) as a prefix? or do you want to ignore the clause 
entirely?

depending on your goal, you could subclass the QueryParser and override 
the various methods to throw and exception (or change what query gets 
produced) based on the field name passed to you -- or you could use an 
IndexReader wrapper that hides the existence of the terms in the fields 
you don't want to allow so the queries re-write to no-ops.



-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



[REMINDER] ApacheCon NA 2014 Travel Assistance Applications Due Feb 7

2014-02-05 Thread Chris Hostetter


(NOTE: cross posted, if you feel the need to reply, please keep it on 
general@lucene)


As a reminder, Travel Assistance Applications for ApacheCon NA 2014 are 
due on Feb 7th (about 48 hours from now)


Details are below, please note that if you have any questions about this 
program or the applicaiton, they should be addressed to 
travel-assista...@apache.org


-Hoss
http://www.lucidworks.com/


-- Forwarded message --
Date: Wed, Jan 15, 2014 at 4:41 PM
Subject: ApacheCon NA 2014 Travel Assistance Applications now open!
Reply-To: travel-assista...@apache.org


The Travel Assistance Committee (TAC) are pleased to announce that travel
assistance applications for ApacheCon North America 2014 are now open! This
announcement serves as a purpose for you (pmcs@) to let members of your
community know about both ApacheConNA 2014 and about the TAC assistance to
attend. Could you please forward this announcement to your community, along
if possible with information on how your project is involved in ApacheCon
this year?

ApacheConNA will be held in Denver, Colorado, April 7-9, 2014.

TAC exists to help those that would like to attend ApacheCon events, but
are unable to do so for financial reasons. For more info on this years
applications and qualifying criteria please visit the TAC website at <
http://www.apache.org/travel/ >.   Applications are already open, so don't
delay!

*The important date*...

   - Friday February 7th 2014 - TAC applications close.

Applicants have until the the closing date above to submit their
applications (which should contain as much supporting material as required
to efficiently and accurately process your request), this will enable TAC
to announce successful awards shortly afterwards.

As usual TAC expects to deal with a range of applications from a diverse
range of backgrounds. We therefore encourage (as always) anyone thinking
about sending in an application to do so ASAP.

We look forward to greeting everyone in Denver, Colorado in April.

Kind Regards

Lewis

(On behalf of the Travel Assistance Committee)


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



REMINDER: Call For Papers: ApacheCon North America 2014 -- ends Feb 1st

2014-01-27 Thread Chris Hostetter


(Note: cross posted, please keep any replies to general@lucene)

Quick reminder that the CFP for ApacheCon (Denver) ends on Saturday...

http://events.linuxfoundation.org/events/apachecon-north-america/program/cfp


Ladies and Gentlemen, start writing your proposals. The Call For Papers 
for ApacheCon North America 2014 is now open, and is open until February 
1st, 2014. Note that we are on a very short timeline this year, so don't 
assume that we'll extend the CFP, just because we've done so every time 
before.




-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



ANNOUNCE: ApacheCon NA 2014 Travel Assistance Applications now open!

2014-01-15 Thread Chris Hostetter


(Note: cross-posted to various lucene user lists, if you have replies 
please keep them on general@lucene, but pleast note that specific 
questions should be addressed to travel-assista...@apache.org)


- - - Forwarded Announcement - - -

The Travel Assistance Committee (TAC) are pleased to announce that travel
assistance applications for ApacheCon North America 2014 are now open!

http://www.apachecon.com/
https://www.apache.org/travel/

ApacheConNA will be held in Denver, Colorado, April 7-9, 2014.

TAC exists to help those that would like to attend ApacheCon events, but
are unable to do so for financial reasons. For more info on this years
applications and qualifying criteria please visit the TAC website at <
http://www.apache.org/travel/ >.   Applications are already open, so don't
delay!

*The important date*...

   - Friday February 7th 2014 - TAC applications close.

Applicants have until the the closing date above to submit their
applications (which should contain as much supporting material as required
to efficiently and accurately process your request), this will enable TAC
to announce successful awards shortly afterwards.

As usual TAC expects to deal with a range of applications from a diverse
range of backgrounds. We therefore encourage (as always) anyone thinking
about sending in an application to do so ASAP.

We look forward to greeting everyone in Denver, Colorado in April.

Kind Regards

Lewis

(On behalf of the Travel Assistance Committee)
travel-assista...@apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Scanning through inverted index

2013-11-27 Thread Chris Hostetter

: The goal is to construct the iterator
: 
: Iterator: term -> [doc1, doc2, ...]

That iterator already exists -- it's a DocsEnum.

Erick's question is what your *end* goal is .. what are you attempting to 
do that you are asking about accessing a low level iterator over all thd 
docs that contain a term, because based on your ultimate goal, there may 
be better suggestions about how to achieve your goal (or specific 
suggestions about how to get/use the DocsEnum) ...

https://people.apache.org/~hossman/#xyproblem

XY Problem

Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue.  Perhaps the best solution doesn't involve "Y" at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341


-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



ANNOUNCE: Stump The Chump @ Lucene Revolution EU - Tommorrow

2013-11-05 Thread Chris Hostetter


(Note: cross posted announcement, please confine any replies to solr-user)


Hey folks,

On Wednesday, I'll be doing a "Stump The Chump" session at Lucene 
Revolution EU in Dublin Ireland.


  http://lucenerevolution.org/stump-the-chump

If you aren't familiar with "Stump The Chump" it is a Q&A style session
where I (the Chump) get put on the hot seat to answer tough / interesting
/ unusual questions about Lucene & Solr -- live, on stage, in front of
hundreds of people who are laughing at me, with judges who have all seen
and thought about the questions in advance and get to mock me and make me
look bad.

It's really a lot of fun.

Even if you won't be at the conference, you can still participate by
emailing your challenging question to st...@lucenerevolution.org.
(Regardless of whether you already found a solution to a tough problem,
you can still submit it and see what kind of creative solution I might
come up with under pressure.)

Prizes will be awarded at the discretion of the judges, and video should
be posted online at some point soon after the con -- more details and
links to videos of past sessions are in my recent blog posts...

  http://searchhub.org/tag/chump/



-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



ANNOUNCE: Lucene/Solr Revolution EU 2013 - Session List & Early Bird Pricing

2013-09-24 Thread Chris Hostetter


(NOTE: cross-posted to various lists, please reply only to general@lucene 
w/ any questions or follow ups)


Hey folks,

2 announcements regarding the upcoming Lucene/Solr Revolution EU 2013 in 
Dublin (November 4-7)...


## 1) Session List Now Posted

I'd like to thank everyone who helped vote for the sessions that 
interested you during the community voting period.  The bulk of the 
sessions that were selected, and will be presented, are now listed online 
-- a few more will be added once we get final confirmation from the 
remaining speakers who were selected...


  http://lucenerevolution.org/sessions

## 2) Early Bird Pricing Ends Soon

"Early bird" discount registration pricing is available until Monday, 
September 30th -- after that, the registration cost will increase by $100 
USD.  So if you are planning to go, you should register soon and save some 
money...


  http://lucenerevolution.org/registration


Additional details about the conference can be found at the website, or 
feel free to reply to this email with any questions...


  http://lucenerevolution.org


-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



ANNOUNCE: Lucene/Solr Revolution EU 2013: Registration & Community Voting

2013-08-26 Thread Chris Hostetter


(NOTE: cross-posted to various lists, please reply only to general@lucene 
w/ any questions or follow ups)



2 Announcements folks should be aware of regarding the upcoming 
Lucene/Solr Revolution EU 2013 in Dublin...



# 1) Registration Now Open

Registration is now open for Lucene/Solr Revolution EU 2013, the biggest 
open source conference dedicated to Apache Lucene/Solr.  Two-day training 
workshops will precede the conference.  You can benefit from discounted 
conference rates if you register early.


http://lucenerevolution.org/registration

More info...
http://searchhub.org/2013/08/15/lucenesolr-revolution-eu-registration-is-open/


# 2) Community Voting on Agenda (Until September 9th)

The Lucene/Solr Revolution free voting system allows you to vote on your 
favorite topics. The sessions that receive the highest number of votes 
will be automatically added to the Lucene/Solr Revolution EU 2013 agenda. 
The remaining sessions will be selected by a committee of industry experts 
who will take into account the community’s votes as well as their own 
expertise in the area.


http://lucenerevolution.org/2013/call-for-papers-survey

More info...
http://searchhub.org/2013/08/23/help-us-set-the-agenda-for-lucenesolr-revolution-eu/

-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: QueryParser for DisjunctionMaxQuery, et al.

2013-07-23 Thread Chris Hostetter

: Subject: QueryParser for DisjunctionMaxQuery, et al.
: References: <1374578398714-4079673.p...@n3.nabble.com>
: In-Reply-To: <1374578398714-4079673.p...@n3.nabble.com>

https://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.



-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



ANNOUNCE: CFP Lucene/Solr Revolution EU 2013 (Deadline August 2nd)

2013-07-08 Thread Chris Hostetter


(NOTE: cross-posted to variuous lists, please reply only to general@lucene 
w/ any questions or follow ups)


The Call for Papers for Lucene/Solr Revolution EU 2013 is currently open.

http://www.lucenerevolution.org/2013/call-for-papers

Lucene/Solr Revolution is the biggest open source conference dedicated to 
Apache Lucene/Solr. The great content delivered by speakers like you is 
the heart of the conference. If you are a practitioner, business leader, 
architect, data scientist or developer and have something important to 
share, we welcome your submission.


We are particularly interested in compelling use cases and success 
stories, best practices, and technology insights.


This year, sessions will be selected by a committee of industry experts 
and your peers through the Community Choice voting process. Once the Call 
for Papers has closed, the voting process will begin. The top vote getters 
in each track will automatically be added to the Lucene/Solr Revolution 
agenda. The remaining sessions will be selected by a committee of industry 
experts who will take into account the Community Choice votes as well as 
their own expertise in the area.


Key Dates:
June 3, 2013: CFP opens
August 2, 2013: CFP closes
August 12, 2013: Community voting begins
September 1, 2013: Community voting ends
September 22, 2013: All speakers notified of submission status

Tracks and Format:
Technical Deep Dive (75 minute tutorials)
Lucene/Solr in Action (45 minute use case presentations)
Lucene/Solr in the Big Data Ecosystem (45 minute technical presentations)

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Read an solr index with two different lucene formats

2013-06-14 Thread Chris Hostetter

: I used solr to query the index, and verified that each document does have a
: non-blank date field.  I suspect that it's because the lucene-3.6 api I am
: using can not read datefield correctly from documents written in lucene 1.4
: format.

how did you verify that they all have a non-blank value?

my wild short in the dark guess here...

1) you are "verifying" that every doc has a value in the date field by 
using something like q=date:[* TO *] and looking at the numfound and it 
matches q=*:*
2) at some point your date field was indexed but not stored, and a large 
number of documnts were added during this time.
3) so now all of your documents have an *indexed* value for the date 
field, but many of them have no *stored* value for the date field.


-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ERROR help me please ,org.apache.lucene.search.IndexSearcher.(Ljava/lang/String;)V

2013-05-17 Thread Chris Hostetter

: Well IndexSearcher doesn't have a constructor that accepts a string,
: maybe you should pass in an indexreader instead?

speciically: the code you are trying to run was compiled against a version 
of lucene in which the IndexSearcher class had a constructor that accepted 
a single string argument -- but at runtime, the classpath you are using 
contains a compiled version of the IndexSearcher class where that 
constructor does not exist.

the likely sitaution is that your code was compiled with an older version 
of lucene (once upone a time IndexSearcher had a constructor that took a 
String argument) and you are now running that code against a newer version 
of lucene.  Alternatively: it's possible you have many versions of lucene 
in your classpath, and the JVM is getting confused.


-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Why does index boosting a field to 2.0f on a document have such a dramatic effect

2013-04-04 Thread Chris Hostetter

: At index time I boost the alias field of a small set of documents, setting the
: boost to 2.0f, which I thought meant equivalent to doubling the score this doc
: would get over another doc, everything else being equal.

1) you haven't shown us enough details to be certian, but based on the 
code you've provied it looks like you are adding a boost for *each* field 
instance named "alias" if the value of artistGuid is in your 
artistGuIdSet...

: if(artistGuIdSet.contains(artistGuid)) {
: for(IndexableField indexablefield:doc.getFields())
: {
: if(indexablefield.name().equals(ArtistIndexField.ALIAS.getName()))
: {
: Field field = (Field)indexablefield;
: field.setBoost(ARTIST_DOC_BOOST);

...so a doc with N values in the "alias" field is going to get a field 
boost of N*2.

2) Looking at the URL you mentioned

: http://search.musicbrainz.org/?type=artist&query=Jean&explain=true

...the debug explanation currently produced by that URL says...

6.4894321E10 = (MATCH) weight(alias:jean in 7610) [MusicbrainzSimilarity], 
result of:
   ...
   7.5161928E9 = fieldNorm(doc=7610)

ou need to look at your "MusicbrainzSimilarity" class and it's fieldNorm 
method to determine for certain why it's producing such large values.  we 
have no idea how that's implemented.


-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: StandardAnalyzer class not present in Lucene 4.2.0

2013-03-25 Thread Chris Hostetter

: Thank you very much Arjen. I had to separately download and install the
: jar.  it was not present in my lucene installation directory. I had
: downloaded the lucene zip file and ran the command  "ant" after extracting
: it. Did i miss anything.?

if you download & build lucene from source, then:

1) the lucene-analyzers-common code would be in the analysis/common module

2) running "ant" defaults to only building lucene-core.  if you want to 
build all of the modules as well you need to run "ant jar"


-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Migrating SnowballAnalyzer to 4.1

2013-02-28 Thread Chris Hostetter

: Subject: Migrating SnowballAnalyzer to 4.1
: References:
: 
:  
: In-Reply-To:
: 

https://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.


-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Searching for keywords .net,c#,...

2013-02-26 Thread Chris Hostetter

: which seems to override incrementToken() ( guess as I don't know java )
: however using lucene.net 3.0.3, I can override

Lucene.Net is a completely seperate project from Lucene, with it's own 
APIs, release cycles, and user community.

Your best bet at getting help from people who are familiar with 
Lucene.Net (and .Net in general), would be on the Lucene.Net user list...


https://lucenenet.apache.org/community.html


-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ApacheCon meetup

2013-02-19 Thread Chris Hostetter

: Subject: ApacheCon meetup
: 
: Any other Lucene/Solr enthusiasts attending ApacheCon in Portland next week?

I won't make it to ApacheCon this year (first time in a long time 
actually) but I'm fairly certain there will be a Lucene MeetUp of some 
kind -- there always is.

This is usually organized via the ApacheCon wiki, so interested 
participants should sign up there...

https://wiki.apache.org/apachecon/CommunityEventsNA13
https://wiki.apache.org/apachecon/ApacheMeetupsNA13


-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Large Index Query Help!

2013-01-29 Thread Chris Hostetter

: Subject: Large Index Query Help!
: References: <1359429227142-4036943.p...@n3.nabble.com>

https://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.



-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NPE when adding a Document to an IndexWriter

2013-01-09 Thread Chris Hostetter

: thanks for your reply.  please see attached.  I tried to maintain the
: structure of the code that I need to use in the library I'm building.  I think
: it should work for you as long as you remove the package declaration at the
: top.

I can't currently try your code, but skimming through it i'd bet money the 
problem is in your Analyzer.  Have you tried simplifying your test down 
and just using "StandardAnalyzer" to rule that out?

In particular i see this...

>>> Analyzer.TokenStreamComponents tsc = new Analyzer.TokenStreamComponents( 
>>>   getCharTokenizer( reader )
>>> , getTokenFilterChain( reader, config ) 
>>> );

...passing the same Reader to to diff methods there is almost certainly 
not what you want to do.



-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NPE when adding a Document to an IndexWriter

2013-01-09 Thread Chris Hostetter

: I keep getting an NPE when trying to add a Doc to an IndexWriter. I've
: minimized my code to very basic code.  what am I doing wrong? pseudo-code:

can you post a full test that other people can run to try and reproduce?  

it doesn't even have to be a junit test -- just some complete javacode 
people paste into a main method and compile would be enough (right now we 
have no idea what IndexWriterConfig you are using (could easily affect 
things) or what directory you are using (less likeley, but still)


-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Which token filter can combine 2 terms into 1?

2012-12-21 Thread Chris Hostetter
: Unfortunately, no...I am not combine every two term into one. I am
: combining a specific pair.

I'm confused ... you've already said that you expect you will need a 
custom filter because your usecase is very special -- and you haven't 
given us many details about exactly when/why/how you want to a filter to 
decide to combine two tokens so no one can make a guess as to wether any 
existing filter fits your usecase exactly -- but Alan did point out an 
example of a filter you could look at as a guide for how to go about 
cobining filters, and your response to that was that it isn't exactly what 
you are looking for.

I think you need either need to give us more info about exactly what you 
are looking for, or you need to look closer at the code for ShingleFilter 
and ask more specific questions about the parts you don't understand in 
the quest to implement your own custom filter.

: > Have a look at ShingleFilter:
: > 
http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/shingle/ShingleFilter.html

: > > I think developing my own filter is the only resolution...but I just
: > cannot
: > > find a guide to help me understand what I need to do to implement a
: > > TokenFilter.


-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Question about ordering rule of SpanNearQuery

2012-11-21 Thread Chris Hostetter

: I am confused with the ordering rule about SpanNearQuery. For example, I 
: indicate the slot in SpanNearQuery is 10. And the results are all the 
: qualified documents. Is it true that any document with shorter distance 
...
: it till uses tf-idf algorithm to rank the docs. Or there is some complex 
: algorithm blending the distance and tf-idf algorithm.

Its blended ... think of each occurance of a specified span as a 
"psuedo-term" but instead of each occurance incrementing the 
"psuedo-term-frequency" by "1" it increments it by a floating point number 
based on how sloppy the match was (an exact match is usually "1", a sloppy 
match is usually something smaller)...

https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html#sloppyFreq%28int%29

As Jack mentioned: look at the explain results for the details for any 
specific query & doc

-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Is there anything in Lucene 4.0 that provides 'absolute' scoring so that i can compare the scoring results of different searches ?

2012-10-25 Thread Chris Hostetter

https://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_filter_by_score.3F
https://wiki.apache.org/lucene-java/ScoresAsPercentages

The fundemental problem of attempting to compare scores for different 
searches is the same in your situation as in the goal of trying to 
"normalize" scores to a fixed range, But the subtle difference between 
your question and that FAQ is that if i'm understanding your goal 
correctly, you could modify your similarity to eliminate the "queryNorm" 
factor and get a ... i'm not sure what to call it ... "more raw" score 
back, which *might* be suitable for your purposes. (i'm not sure)


: Is there anything in Lucene 4.0 that provides 'absolute' scoring so that i can
: compare the scoring results of different searches ?
: 
: To explain if I do a search for two values fred OR jane and there is a
: document that contains both those words exaclty then that document will score
: 100, documents that contain only one word will score less. But if there is no
: document contain both words but there is one document that contains fred then
: that document will score 100 even though it didnt match jane at all. (Im
: clearly ignoring all the complexties but you get my gist)
: 
: So all documents returned from a search are scored relative to each other, but
: I cannot perform a second score and sensibly compare the score of the first
: search with the second whihc is what I would like to do.
: 
: Why would I want to this ?
: 
: In a music database  have an index of releases and a separate indexes of
: artists, usually the user just searches artists or releases. But sometimes
: they want to search all and interleave the results from the two indexes, but
: its not sensible for me to interleave them based on their score at the moment.
: 
: thanks Paul
: 
: 
: 
: -
: To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
: For additional commands, e-mail: java-user-h...@lucene.apache.org
: 
: 

-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: short search terms

2012-09-26 Thread Chris Hostetter

: I have a key field that will only ever have a length of 3 characters. I am
: using a StandardAnalyzer and a QueryParser to create the Query
: (parser.parse(string)), and an IndexReader and IndexSearcher to execute the
: query (searcher(query)). I can't seem to find a setter to allow for a 3
: character search string. There is one setMinWordLen, but it isn't applicable

there's a lot of missing information here ... what do you mean "allow for 
a 3 character search string" .. the query parser doesn't have anything in 
it that would prevent a 3 (or 3, or 1) character search string, so i 
suspect that's not really the question you mean to ask.

what is problem you are actaully seeing?  do you have a query that isn't 
matching the docs you think it should? what query? what docs? what does 
the code look like?

can you explain more what this 3 character ifeld represents, and how you 
want to use it?

https://people.apache.org/~hossman/#xyproblem
Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue.  Perhaps the best solution doesn't involve "Y" at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341

-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Issue with documentation for org.apache.lucene.analysis.synonym.SynonymMap.Builder.add() method

2012-09-06 Thread Chris Hostetter

: Converted to U+000 by what, I wonder? Javadoc shouldn't be doing that. If
: it does,  I wonder if we need \\u instead?

aparently it is...

https://mail-archives.apache.org/mod_mbox/harmony-dev/200802.mbox/%3c47b2f7ae.2000...@gmail.com%3E




-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Seeking more moderators for java-user@lucene

2012-08-28 Thread Chris Hostetter

: I have tried multiple times to unsubscribe, and it never works.  Could you 
unsubscribe me?

Anyone having trouble unsubscribing should read the help page on the 
wiki and follow the instructions there if thye need more help...

https://wiki.apache.org/solr/Unsubscribing%20from%20mailing%20lists

(I've just added a more promininent link to this page to the main solr 
discussion.html page: https://lucene.apache.org/solr/discussion.html )


-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Seeking more moderators for java-user@lucene

2012-08-27 Thread Chris Hostetter


Greetings subscribers to java-user@lucene.

I've been offline for the past ~5 days, and when i looked at my email 
again this morning I found a message to java-user@lucene sitting in the 
moderator queue since Aug 22nd.


Messages sitting in the queue that long are a good indication that we 
don't have enough (active) mailing list moderators.


Being a moderator is really easy: you'll get a some extra emails in your 
inbox with MODERATE in the subject, which you skim to see if they are spam 
-- if they are you delete them, if not you "reply all" to let them get 
sent to the list, and authorize that person to send future messages w/o 
moderation.


If you would like to volunteer to be a moderator, please contact 
java-user-owner@lucene.






-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



[ANNOUNCE] Lucene/Solr @ ApacheCon Europe - August 13th Deadline for CFP and Travel Assistance applications

2012-08-06 Thread Chris Hostetter


ApacheCon Europe will be happening 5-8 November 2012 in Sinsheim, Germany 
at the Rhein-Neckar-Arena.  Early bird tickets go on sale this Monday, 6 
August.


  http://www.apachecon.eu/

The Lucene/Solr track is shaping up to be quite impressive this year, so 
make your plans to attend and submit your session proposals ASAP!


-- CALL FOR PAPERS --

The Call for Participation for ApacheCon Europe has been extended to 13 
August!


To submit a presentation and for more details, visit 
http://www.apachecon.eu/cfp/


Post a banner on your Website to show your support for ApacheCon Europe or 
North America (24-28 February 2013 in Portland, OR)! Download at 
http://www.apache.org/events/logos-banners/


We look forward to seeing you!

 -the Apache Conference Committee & ApacheCon Planners

--- TRAVEL ASSISTANCE ---

We're pleased to announce Travel Assistance (TAC) applications for 
ApacheCon Europe 2012 are now open!


The Travel Assistance Committee exists to help those that would like to 
attend ApacheCon events, but are unable to do so for financial reasons. 
For more info on this years Travel Assistance application criteria please 
visit the TAC website at < http://www.apache.org/travel/ >.


Some important dates... The original application period officially opened 
on 23rd July, 2012. Applicants have until the 13th August 2012 to submit 
their applications (which should contain as much supporting material as 
required to efficiently and accurately process your request), this will 
enable the Travel Assistance Committee to announce successful awards on or 
shortly after the 24th August, 2012.


As always TAC expects to deal with a range of applications from many 
diverse backgrounds so we encourage (as always) anyone thinking about 
sending in a TAC application to get it in ASAP.


We look forward to greeting everyone in Sinsheim, Germany in November.



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: change of API Javadoc interface funtionality in 4.0.x

2012-07-18 Thread Chris Hostetter

: What is the sense of removing the "Index" from the API Javadoc for Lucene and 
Solr?

It was heavily bloating the size of the releases...

https://issues.apache.org/jira/browse/LUCENE-3977

It's pretty easy to turn this back on and rebuild the docs locally.  Feel 
free to open a jira and submit a patch to make it a build prop (so you 
could put "javadoc.index=true" in your build.user.properties or use "ant 
-Djavadoc.index=true javadocs")



-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: How to unsubscribe from this list?

2012-06-25 Thread Chris Hostetter

G.Long: I'm Replying to list so this info is visibilt to anyone who is 
curious, but if you have specific followup questions, please reply to 
java-user-owner@lucene ...

: Thanks. I tried this but it did not work so asking :).

1) sending an unsubscribe request will trigger an automated response
requesting verification.  you must reply to the special email address that
sends this automated verification requests in order to complete the
request.

2) in order to unsubscribe, you must send your unsubscribe request from
the exact address used to subsribe -- not an alternate address that
forwards to your current address.  if the address you attempt to
unsubscribe is not currently in the subscriber list, the automated emails
will mention this.

3) if you are unable to unsubscribe through the automated email system, 
please contact java-user-owner@lucene and help us understand what went 
wrong so we can try to fix it?  did you try emailing java-user-help@lucene 
first? did you get any automated response when you emailed 
java-user-unsubscribe? did you get any response after replying with 
verification?

4) in order for the list owners to intervene and manually unsubscribe you,
you need to provide the "Return-Path" header from a message you are
recieving from the list(s) you wish to unsubscribe to -- if there are
multiple lists, then you need to provide an example of the "Return-Path"
header from at least one message on each list.


-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: need to find locations of query hits in doc: works fine for regular text but not for phone numbers

2012-06-14 Thread Chris Hostetter

: Subject: need to find locations of query hits in doc: works fine for regular
:  text but not for phone numbers
: Message-ID: 
: References: <1339635547170-3989548.p...@n3.nabble.com>
: In-Reply-To: <1339635547170-3989548.p...@n3.nabble.com>

https://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.



-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Bizarre Search order request

2012-05-25 Thread Chris Hostetter

: For example, if I display of 20 results, I might want to limit it to a 
: maximum of 10 "mail", 10 "blog" and 10 "website" documents.  Which ones 
: get displayed and how they were ordered would depend on the normal 
: relevancy ranking, but, for example, once I had 10 "mail" objects to 
: display on the page, the effect would be that other "mail" objects 
: relevancy would drop below "blog" and "website".  If there aren't 10 of 
: one of these, then the I'm allowed to exceed the maximum of 10 so that I 
: get 20 results.  What I don't want is 20 "mail" documents if there are 
: "blog" and/or "website" documents to display.

Most of what you're asking about is a straight forward use of 
Result Grouping...

http://wiki.apache.org/solr/FieldCollapsing

...the nit is your statement 'the effect would be that other "mail" 
objects relevancy would drop below "blog" and "website"' ... grouping 
doesn't change the relevancy scores, it just limits the number of results 
per field value.

the canonical UI is to show users the top N per group on the main page 
(you can interleave them if you want, or leave them grouped), but give 
them them links to see all (ie: redo the search with no grouping and allow 
pagination) or see all of a particular type (ie: redo the search with no 
grouping and an fq; allow pagination)

-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: old fashioned....."Too many open files"!

2012-05-18 Thread Chris Hostetter

: the point is that I keep the readers open to share them across search. Is
: this wrong?

your goal is fine, but where in your code do you think you are doing that? 

I don't see any readers ever being shared.  You open new ones (which are 
never closed) in every call to getSearcher()

: > >while(i.hasNext()){
: > >Map.Entry index = (Map.Entry)i.next();
: > >IndexWriter iw = (IndexWriter)index.getValue();
: > >readers.add(IndexReader.open(iw, true));
: > >}
: > >
: > >MultiReader mr = new MultiReader(readers.toArray(readerList));
: > >return new IndexSearcher(mr);


-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Repeatability of results

2012-04-04 Thread Chris Hostetter

: OK this could make sense (floating point math is frustrating!).
: 
: But, Lucene generally scores one document at a time, so in theory just
: changing its docid shouldn't alter the order of float operations.

i haven't thought this through, but couldn't scorer re-ordering in 
BooleanScorer2 possibly? tickle weird little floating point math 
excentricities?  

if the documents are in diff orders, then the skipTo calls (or are they 
called advance() now?) would result in the subScoresr being in a diff 
order right?  ... so the floats from each subscorer would be added in a 
diff order?


-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



  1   2   3   4   5   6   7   8   9   10   >