[jira] [Commented] (LUCENE-5156) CompressingTermVectors termsEnum should probably not support seek-by-ord

2014-08-01 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14082898#comment-14082898
 ] 

David Smiley commented on LUCENE-5156:
--

I can understand why this change was done -- better to not support it than 
support something optional that should be implemented fast yet not do it fast.  
What if it were to be made fast, along with seekCeil() which is also 
implemented slowly right now too?  For example, say the first time either 
seekCeil is called or an ord method is called, then build up an array of term 
start positions by ordinal, which otherwise wouldn't be done.  Then you could 
do a binary search for seekCeil and a direct lookup for seekExact.  The 
lazy-created array could also then be shared across repeated invocations to get 
Terms for the current document.

Why bother, you might ask?  I'm working on a means of having the Terms from 
term vectors be directly searched against by the default highlighter instead of 
re-inverting to MemoryIndex.  I'll post a separate issue for that with code, of 
course, which works but isn't as efficient as it could be thanks to the O(N) 
of seekCeil on term vectors' Terms.

 CompressingTermVectors termsEnum should probably not support seek-by-ord
 

 Key: LUCENE-5156
 URL: https://issues.apache.org/jira/browse/LUCENE-5156
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir
 Fix For: 4.5, 5.0

 Attachments: LUCENE-5156.patch


 Just like term vectors before it, it has a O(n) seek-by-term. 
 But this one also advertises a seek-by-ord, only this is also O(n).
 This could cause e.g. checkindex to be very slow, because if termsenum 
 supports ord it does a bunch of seeking tests. (Another solution would be to 
 leave it, and add a boolean so checkindex never does seeking tests for term 
 vectors, only real fields).
 However, I think its also kinda a trap, in my opinion if seek-by-ord is 
 supported anywhere, you kinda expect it to be faster than linear time...?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5156) CompressingTermVectors termsEnum should probably not support seek-by-ord

2014-08-01 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14083026#comment-14083026
 ] 

Robert Muir commented on LUCENE-5156:
-

Thats unrelated to term vectors. We shouldnt have such caching in the default 
codec, it can easily blow up on a large document.

 CompressingTermVectors termsEnum should probably not support seek-by-ord
 

 Key: LUCENE-5156
 URL: https://issues.apache.org/jira/browse/LUCENE-5156
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir
 Fix For: 4.5, 5.0

 Attachments: LUCENE-5156.patch


 Just like term vectors before it, it has a O(n) seek-by-term. 
 But this one also advertises a seek-by-ord, only this is also O(n).
 This could cause e.g. checkindex to be very slow, because if termsenum 
 supports ord it does a bunch of seeking tests. (Another solution would be to 
 leave it, and add a boolean so checkindex never does seeking tests for term 
 vectors, only real fields).
 However, I think its also kinda a trap, in my opinion if seek-by-ord is 
 supported anywhere, you kinda expect it to be faster than linear time...?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5156) CompressingTermVectors termsEnum should probably not support seek-by-ord

2014-08-01 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14083029#comment-14083029
 ] 

Robert Muir commented on LUCENE-5156:
-

Personally i would do such a thing with a FilterTerms + FilterReader. you just 
check if docid == lastDocID and you have your cache thing.

But i dont think it should be in the default codec. I also happen to think term 
vectors arent a good datastructure for highlighting anyway.

 CompressingTermVectors termsEnum should probably not support seek-by-ord
 

 Key: LUCENE-5156
 URL: https://issues.apache.org/jira/browse/LUCENE-5156
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir
 Fix For: 4.5, 5.0

 Attachments: LUCENE-5156.patch


 Just like term vectors before it, it has a O(n) seek-by-term. 
 But this one also advertises a seek-by-ord, only this is also O(n).
 This could cause e.g. checkindex to be very slow, because if termsenum 
 supports ord it does a bunch of seeking tests. (Another solution would be to 
 leave it, and add a boolean so checkindex never does seeking tests for term 
 vectors, only real fields).
 However, I think its also kinda a trap, in my opinion if seek-by-ord is 
 supported anywhere, you kinda expect it to be faster than linear time...?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5156) CompressingTermVectors termsEnum should probably not support seek-by-ord

2014-08-01 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14083065#comment-14083065
 ] 

Robert Muir commented on LUCENE-5156:
-

I also think its ok if we fix the codec to have a faster seekExact (not by 
copying stuff into a large array on the first call though, just by fixing 
datastructure / how it access data).

That would solve the actual problem here you have in a clean way.

 CompressingTermVectors termsEnum should probably not support seek-by-ord
 

 Key: LUCENE-5156
 URL: https://issues.apache.org/jira/browse/LUCENE-5156
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir
 Fix For: 4.5, 5.0

 Attachments: LUCENE-5156.patch


 Just like term vectors before it, it has a O(n) seek-by-term. 
 But this one also advertises a seek-by-ord, only this is also O(n).
 This could cause e.g. checkindex to be very slow, because if termsenum 
 supports ord it does a bunch of seeking tests. (Another solution would be to 
 leave it, and add a boolean so checkindex never does seeking tests for term 
 vectors, only real fields).
 However, I think its also kinda a trap, in my opinion if seek-by-ord is 
 supported anywhere, you kinda expect it to be faster than linear time...?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5156) CompressingTermVectors termsEnum should probably not support seek-by-ord

2014-08-01 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14083064#comment-14083064
 ] 

David Smiley commented on LUCENE-5156:
--

I agree on the caching thing -- that is, what I said in which you ask for Terms 
for the same document again.  Never-mind that part -- as I thought about it I 
realized I didn't need that after all.

bq. But i dont think it should be in the default codec. I also happen to think 
term vectors arent a good datastructure for highlighting anyway.

The default highlighter fully respects the positions and other aspects of the 
user's query, unlike the other highlighters.  Some applications demand that a 
highlight is accurate to the query, even if the query uses custom span queries 
that do tricks with payloads, etc.  It would be nice if the other highlighters 
supported accurate highlights for such queries but they don't, so today, this 
is the applicable one for accurate highlights for complex queries.  The default 
highlighter requires a Terms instance reflecting the current document -- it 
currently gets it via a re-inverting into a MemoryIndex but it can be hacked to 
accept a Terms directly from term vectors.  

So you don't like the idea of enhancing performance of term vector seekCeil in 
the default codec?  Is that a -1 or -0?  This change I propose seems harmless 
-- the code would not create  build up the new offset array if consuming code 
doesn't call seekCeil or the ord methods.

 CompressingTermVectors termsEnum should probably not support seek-by-ord
 

 Key: LUCENE-5156
 URL: https://issues.apache.org/jira/browse/LUCENE-5156
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir
 Fix For: 4.5, 5.0

 Attachments: LUCENE-5156.patch


 Just like term vectors before it, it has a O(n) seek-by-term. 
 But this one also advertises a seek-by-ord, only this is also O(n).
 This could cause e.g. checkindex to be very slow, because if termsenum 
 supports ord it does a bunch of seeking tests. (Another solution would be to 
 leave it, and add a boolean so checkindex never does seeking tests for term 
 vectors, only real fields).
 However, I think its also kinda a trap, in my opinion if seek-by-ord is 
 supported anywhere, you kinda expect it to be faster than linear time...?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5156) CompressingTermVectors termsEnum should probably not support seek-by-ord

2014-08-01 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14083072#comment-14083072
 ] 

Robert Muir commented on LUCENE-5156:
-

Sorry David, its not about being against speeding something up, its about how 
you propose implementing it.

Copying all the data from the entire document into another array on the first 
read for the doc, that's a really trashy thing to do here. Instead, we should 
just fix it correctly, so that seekCeil() is not linear time. 

 CompressingTermVectors termsEnum should probably not support seek-by-ord
 

 Key: LUCENE-5156
 URL: https://issues.apache.org/jira/browse/LUCENE-5156
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir
 Fix For: 4.5, 5.0

 Attachments: LUCENE-5156.patch


 Just like term vectors before it, it has a O(n) seek-by-term. 
 But this one also advertises a seek-by-ord, only this is also O(n).
 This could cause e.g. checkindex to be very slow, because if termsenum 
 supports ord it does a bunch of seeking tests. (Another solution would be to 
 leave it, and add a boolean so checkindex never does seeking tests for term 
 vectors, only real fields).
 However, I think its also kinda a trap, in my opinion if seek-by-ord is 
 supported anywhere, you kinda expect it to be faster than linear time...?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5156) CompressingTermVectors termsEnum should probably not support seek-by-ord

2014-08-01 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14083079#comment-14083079
 ] 

Robert Muir commented on LUCENE-5156:
-

Also, this should be discussed somewhere else than on an unrelated, closed, 
year-old issue, like on its own issue. (Sorry, its not really related to 
seek-by-ord, your problem is a more general one, and it wasnt created by this 
issue, nor even by compressing term vectors but is older than that... this 
issue is closed)

 CompressingTermVectors termsEnum should probably not support seek-by-ord
 

 Key: LUCENE-5156
 URL: https://issues.apache.org/jira/browse/LUCENE-5156
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir
 Fix For: 4.5, 5.0

 Attachments: LUCENE-5156.patch


 Just like term vectors before it, it has a O(n) seek-by-term. 
 But this one also advertises a seek-by-ord, only this is also O(n).
 This could cause e.g. checkindex to be very slow, because if termsenum 
 supports ord it does a bunch of seeking tests. (Another solution would be to 
 leave it, and add a boolean so checkindex never does seeking tests for term 
 vectors, only real fields).
 However, I think its also kinda a trap, in my opinion if seek-by-ord is 
 supported anywhere, you kinda expect it to be faster than linear time...?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5156) CompressingTermVectors termsEnum should probably not support seek-by-ord

2013-08-06 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730673#comment-13730673
 ] 

Michael McCandless commented on LUCENE-5156:


+1

 CompressingTermVectors termsEnum should probably not support seek-by-ord
 

 Key: LUCENE-5156
 URL: https://issues.apache.org/jira/browse/LUCENE-5156
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir
 Attachments: LUCENE-5156.patch


 Just like term vectors before it, it has a O(n) seek-by-term. 
 But this one also advertises a seek-by-ord, only this is also O(n).
 This could cause e.g. checkindex to be very slow, because if termsenum 
 supports ord it does a bunch of seeking tests. (Another solution would be to 
 leave it, and add a boolean so checkindex never does seeking tests for term 
 vectors, only real fields).
 However, I think its also kinda a trap, in my opinion if seek-by-ord is 
 supported anywhere, you kinda expect it to be faster than linear time...?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5156) CompressingTermVectors termsEnum should probably not support seek-by-ord

2013-08-06 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730688#comment-13730688
 ] 

Adrien Grand commented on LUCENE-5156:
--

Patch looks good, +1

 CompressingTermVectors termsEnum should probably not support seek-by-ord
 

 Key: LUCENE-5156
 URL: https://issues.apache.org/jira/browse/LUCENE-5156
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir
 Attachments: LUCENE-5156.patch


 Just like term vectors before it, it has a O(n) seek-by-term. 
 But this one also advertises a seek-by-ord, only this is also O(n).
 This could cause e.g. checkindex to be very slow, because if termsenum 
 supports ord it does a bunch of seeking tests. (Another solution would be to 
 leave it, and add a boolean so checkindex never does seeking tests for term 
 vectors, only real fields).
 However, I think its also kinda a trap, in my opinion if seek-by-ord is 
 supported anywhere, you kinda expect it to be faster than linear time...?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5156) CompressingTermVectors termsEnum should probably not support seek-by-ord

2013-08-06 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730869#comment-13730869
 ] 

ASF subversion and git services commented on LUCENE-5156:
-

Commit 1511009 from [~rcmuir] in branch 'dev/trunk'
[ https://svn.apache.org/r1511009 ]

LUCENE-5156: remove seek-by-ord from CompressingTermVectors

 CompressingTermVectors termsEnum should probably not support seek-by-ord
 

 Key: LUCENE-5156
 URL: https://issues.apache.org/jira/browse/LUCENE-5156
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir
 Attachments: LUCENE-5156.patch


 Just like term vectors before it, it has a O(n) seek-by-term. 
 But this one also advertises a seek-by-ord, only this is also O(n).
 This could cause e.g. checkindex to be very slow, because if termsenum 
 supports ord it does a bunch of seeking tests. (Another solution would be to 
 leave it, and add a boolean so checkindex never does seeking tests for term 
 vectors, only real fields).
 However, I think its also kinda a trap, in my opinion if seek-by-ord is 
 supported anywhere, you kinda expect it to be faster than linear time...?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5156) CompressingTermVectors termsEnum should probably not support seek-by-ord

2013-08-06 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730880#comment-13730880
 ] 

ASF subversion and git services commented on LUCENE-5156:
-

Commit 1511014 from [~rcmuir] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1511014 ]

LUCENE-5156: remove seek-by-ord from CompressingTermVectors

 CompressingTermVectors termsEnum should probably not support seek-by-ord
 

 Key: LUCENE-5156
 URL: https://issues.apache.org/jira/browse/LUCENE-5156
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir
 Attachments: LUCENE-5156.patch


 Just like term vectors before it, it has a O(n) seek-by-term. 
 But this one also advertises a seek-by-ord, only this is also O(n).
 This could cause e.g. checkindex to be very slow, because if termsenum 
 supports ord it does a bunch of seeking tests. (Another solution would be to 
 leave it, and add a boolean so checkindex never does seeking tests for term 
 vectors, only real fields).
 However, I think its also kinda a trap, in my opinion if seek-by-ord is 
 supported anywhere, you kinda expect it to be faster than linear time...?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5156) CompressingTermVectors termsEnum should probably not support seek-by-ord

2013-08-02 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13727443#comment-13727443
 ] 

Adrien Grand commented on LUCENE-5156:
--

+1 this is trappy

 CompressingTermVectors termsEnum should probably not support seek-by-ord
 

 Key: LUCENE-5156
 URL: https://issues.apache.org/jira/browse/LUCENE-5156
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir

 Just like term vectors before it, it has a O(n) seek-by-term. 
 But this one also advertises a seek-by-ord, only this is also O(n).
 This could cause e.g. checkindex to be very slow, because if termsenum 
 supports ord it does a bunch of seeking tests. (Another solution would be to 
 leave it, and add a boolean so checkindex never does seeking tests for term 
 vectors, only real fields).
 However, I think its also kinda a trap, in my opinion if seek-by-ord is 
 supported anywhere, you kinda expect it to be faster than linear time...?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org