Re: issue with automatic iterable detection?

2011-03-03 Thread Andi Vajda


Indeed, this is why I put that assertion there :-)
It's a bit of guesswork what all the possibilities are there.
I'll add support for arrays there.

Andi..

On Thu, 3 Mar 2011, Bill Janssen wrote:


This looks like a problem.

This is with an svn checkout of branch_3x.

Bill

122, in _run_module_as_main
   __main__, fname, loader, pkg_name)
 File /usr/lib/python2.6/runpy.py, line 34, in _run_code
   exec code in run_globals
 File 
/usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/__main__.py,
 line 98, in module
   cpp.jcc(sys.argv)
 File 
/usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/cpp.py,
 line 548, in jcc
   addRequiredTypes(cls, typeset, generics)
 File 
/usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/cpp.py,
 line 233, in addRequiredTypes
   addRequiredTypes(cls, typeset, True)
 File 
/usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/cpp.py,
 line 238, in addRequiredTypes
   addRequiredTypes(ta, typeset, True)
 File 
/usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/cpp.py,
 line 240, in addRequiredTypes
   raise NotImplementedError, repr(cls)
NotImplementedError: Type: double[]
%



Re: issue with automatic iterable detection?

2011-03-03 Thread Andi Vajda


On Thu, 3 Mar 2011, Bill Janssen wrote:


Did a fresh checkout and here's the next issue.

This one may be harder to fix...


No, it's just another one of these Type classes, WildcardType.
I should have a fix shortly. Sorry for the mess.

Andi..



Bill

Traceback (most recent call last):
 File /usr/lib/python2.6/runpy.py, line 122, in _run_module_as_main
   __main__, fname, loader, pkg_name)
 File /usr/lib/python2.6/runpy.py, line 34, in _run_code
   exec code in run_globals
 File 
/usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/__main__.py,
 line 98, in module
   cpp.jcc(sys.argv)
 File 
/usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/cpp.py,
 line 551, in jcc
   addRequiredTypes(cls, typeset, generics)
 File 
/usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/cpp.py,
 line 233, in addRequiredTypes
   addRequiredTypes(cls, typeset, True)
 File 
/usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/cpp.py,
 line 238, in addRequiredTypes
   addRequiredTypes(ta, typeset, True)
 File 
/usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/cpp.py,
 line 238, in addRequiredTypes
   addRequiredTypes(ta, typeset, True)
 File 
/usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/cpp.py,
 line 243, in addRequiredTypes
   raise NotImplementedError, repr(cls)
NotImplementedError: Type: ?



Re: issue with automatic iterable detection?

2011-03-03 Thread Andi Vajda


On Thu, 3 Mar 2011, Bill Janssen wrote:


Andi Vajda va...@apache.org wrote:


  Bill,

Did that solve your problem ?


Hmmm, I'm still seeing it.  And some other stuff:


Could you please send me the Java code that triggers this ?

Andi..


build/_GoodStuff/__wrap03__.cpp: In function ?PyObject* 
com::parc::goodstuff::relations::t_Something1SeqIterator_nextElement(com::parc::goodstuff::relations::t_Something1SeqIterator*,
 PyObject*)?:
build/_GoodStuff/__wrap03__.cpp:9122: error: ?class 
com::parc::goodstuff::relations::t_Something1SeqIterator? has no member named 
?parameters?
build/_GoodStuff/__wrap03__.cpp:9122: error: ?class 
com::parc::goodstuff::relations::t_Something1SeqIterator? has no member named 
?parameters?
build/_GoodStuff/__wrap03__.cpp: In function ?PyObject* 
com::parc::goodstuff::family::t_Something2Iterator_nextElement(com::parc::goodstuff::family::t_Something2Iterator*,
 PyObject*)?:
build/_GoodStuff/__wrap03__.cpp:15376: error: ?class 
com::parc::goodstuff::family::t_Something2Iterator? has no member named 
?parameters?
build/_GoodStuff/__wrap03__.cpp:15376: error: ?class 
com::parc::goodstuff::family::t_Something2Iterator? has no member named 
?parameters?
build/_GoodStuff/__wrap03__.cpp: At global scope:
build/_GoodStuff/__wrap03__.cpp:27749: error: ?t_JArray? was not declared in 
this scope
build/_GoodStuff/__wrap03__.cpp:27749: error: parse error in template argument 
list
build/_GoodStuff/__wrap03__.cpp:27749: error: insufficient contextual 
information to determine type
build/_GoodStuff/__wrap03__.cpp:27749: warning: ?? operator will be treated 
as two right angle brackets in C++0x
build/_GoodStuff/__wrap03__.cpp:27749: warning: suggest parentheses around ?? 
expression
build/_GoodStuff/__wrap03__.cpp:27749: error: spurious ??, use ?? to 
terminate a template argument list
build/_GoodStuff/__wrap03__.cpp:27749: error: expected primary-expression 
before ?)? token
build/_GoodStuff/__wrap03__.cpp:27749: error: too many initializers for 
?PyTypeObject?
build/_GoodStuff/__wrap03__.cpp:41430: error: ?t_JArray? was not declared in 
this scope
build/_GoodStuff/__wrap03__.cpp:41430: error: parse error in template argument 
list
build/_GoodStuff/__wrap03__.cpp:41430: error: insufficient contextual 
information to determine type
build/_GoodStuff/__wrap03__.cpp:41430: error: expected primary-expression 
before ?)? token
build/_GoodStuff/__wrap03__.cpp:41430: error: too many initializers for 
?PyTypeObject?
error: command 'gcc' failed with exit status 1

I think when I tried it this afternoon (I was running out the door and
kind of rushed) I just did a wrap, and not a --build.

Sorry about that.

Bill




Andi..

On Feb 28, 2011, at 20:05, Andi Vajda va...@apache.org wrote:



On Sun, 27 Feb 2011, Bill Janssen wrote:


Andi Vajda va...@apache.org wrote:


It may be simplest if you can send me the source file for this class
as well as a small jar file I can use to reproduce this ?


Turns out to be simple to reproduce.  Put the attached in a file called
test.java, and run this sequence:

% javac -classpath . test.java
% jar cf test.jar *.class
% python -m jcc.__main__ --python test --shared --jar /tmp/test.jar --build 
--vmarg -Djava.awt.headless=true


This was a tougher one. It was triggered by a combination of things:
 - no wrapper requested for java.io.File or --package java.io
 - a subclass of a parameterized class or interface implementor of a
   parameterized interface wasn't pulling in classes used as type
   parameters (java.io.File here).

A fix is checked into jcc trunk/branch_3x rev 1075642.
This also includes the earlier fix about using absolute class names.

Andi..




Using JCC / PyLucene with JEPP?

2011-03-03 Thread Bill Janssen
New topic.

I'd like to wrap my UpLib codebase, which is Python using PyLucene, in
Java using JEPP (http://jepp.sourceforge.net/), so that I can use it
with Tomcat.

Now, am I going to have to do some trickery to get a VM?  Or will
getVMEnv() just work with a previously initialized JVM?

Bill


Re: issue with automatic iterable detection?

2011-03-03 Thread Bill Janssen
Here's one of the generated lines that's causing me grief.

DECLARE_TYPE(RankIterator, t_RankIterator, ::java::lang::Object, 
RankIterator, t_RankIterator_init_, PyObject_SelfIter, ((PyObject 
*(*)(t_RankIterator *)) get_nextt_RankIterator,t_JArray jint ,JArray jint 
), t_RankIterator__fields_, 0, 0);

It yields this:

build/_PPD/__wrap02__.cpp:27284: error: ‘t_JArray’ was not declared in this 
scope
build/_PPD/__wrap02__.cpp:27284: error: parse error in template argument list
build/_PPD/__wrap02__.cpp:27284: error: insufficient contextual information to 
determine type
build/_PPD/__wrap02__.cpp:27284: warning: ‘’ operator will be treated as two 
right angle brackets in C++0x
build/_PPD/__wrap02__.cpp:27284: warning: suggest parentheses around ‘’ 
expression
build/_PPD/__wrap02__.cpp:27284: error: spurious ‘’, use ‘’ to terminate a 
template argument list
build/_PPD/__wrap02__.cpp:27284: error: expected primary-expression before ‘)’ 
token
build/_PPD/__wrap02__.cpp:27284: error: too many initializers for ‘PyTypeObject’

Where does t_JArray get defined?  I can't find it.

Bill


Re: issue with automatic iterable detection?

2011-03-03 Thread Andi Vajda


On Thu, 3 Mar 2011, Andi Vajda wrote:



On Mar 3, 2011, at 22:09, Bill Janssen jans...@parc.com wrote:


Here's one of the generated lines that's causing me grief.

   DECLARE_TYPE(RankIterator, t_RankIterator, ::java::lang::Object, RankIterator, 
t_RankIterator_init_, PyObject_SelfIter, ((PyObject *(*)(t_RankIterator *)) 
get_nextt_RankIterator,t_JArray jint ,JArray jint ),


Ah yes, that's invalid c++. Nested generics need to insert a space between ''. 
Otherwise, the c++ parser gets it as the bit shifting operator, believe it or not. 
Should be easy enough to fix in jcc.


Fixed in trunk/branch_3x rev 1077828.

Andi..



Andi..


t_RankIterator__fields_, 0, 0);

It yields this:

build/_PPD/__wrap02__.cpp:27284: error: ?t_JArray? was not declared in this 
scope
build/_PPD/__wrap02__.cpp:27284: error: parse error in template argument list
build/_PPD/__wrap02__.cpp:27284: error: insufficient contextual information to 
determine type
build/_PPD/__wrap02__.cpp:27284: warning: ?? operator will be treated as two 
right angle brackets in C++0x
build/_PPD/__wrap02__.cpp:27284: warning: suggest parentheses around ?? 
expression
build/_PPD/__wrap02__.cpp:27284: error: spurious ??, use ?? to terminate a 
template argument list
build/_PPD/__wrap02__.cpp:27284: error: expected primary-expression before ?)? 
token
build/_PPD/__wrap02__.cpp:27284: error: too many initializers for ?PyTypeObject?

Where does t_JArray get defined?  I can't find it.

Bill




[jira] Commented: (SOLR-1489) A UTF-8 character is output twice (Bug in Jetty)

2011-03-03 Thread Jun Ohtani (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001909#comment-13001909
 ] 

Jun Ohtani commented on SOLR-1489:
--

Sekiguchi-san, I checked bufsize [510-512]. only saw B once. Maybe, it is OK. 

 A UTF-8 character is output twice (Bug in Jetty)
 

 Key: SOLR-1489
 URL: https://issues.apache.org/jira/browse/SOLR-1489
 Project: Solr
  Issue Type: Bug
 Environment: Jetty-6.1.3
 Jetty-6.1.21
 Jetty-7.0.0RC6
Reporter: Jun Ohtani
Assignee: Koji Sekiguchi
Priority: Critical
 Attachments: SOLR-1489.patch, error_utf8-example.xml, 
 jetty-6.1.22.jar, jetty-util-6.1.22.jar, jettybugsample.war, jsp-2.1.zip, 
 servlet-api-2.5-20081211.jar


 A UTF-8 character is output twice under particular conditions.
 Attach the sample data.(error_utf8-example.xml)
 Registered only sample data, click the following URL.
 http://localhost:8983/solr/select?q=*%3A*version=2.2start=0rows=10omitHeader=truefl=attr_jsonwt=json
 Sample data is only B, but response is BB.
 When wt=phps, error occurs in PHP unsrialize() function.
 This bug is like a bug in Jetty.
 jettybugsample.war is the simplest one to reproduce the problem.
 Copy example/webapps, and start Jetty server, and click the following URL.
 http://localhost:8983/jettybugsample/filter/hoge
 Like earlier, B is output twice. Sysout only B once.
 I have tested this on Jetty 6.1.3 and 6.1.21, 7.0.0rc6.
 (When testing with 6.1.21or 7.0.0rc6, change bufsize from 128 to 512 in 
 web.xml. )

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[HUDSON] Solr-trunk - Build # 1428 - Failure

2011-03-03 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Solr-trunk/1428/

All tests passed

Build Log (for compile errors):
[...truncated 14806 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Boost function problem with disquerymax

2011-03-03 Thread Gastone Penzo
You are right. it was not and index field. just stored
Thanx

2011/3/2 Yonik Seeley yo...@lucidimagination.com

 On Wed, Mar 2, 2011 at 11:34 AM, Gastone Penzo gastone.pe...@gmail.com
 wrote:
  HI,
  for search i use disquery max
  and a i want to boost a field with bf parameter like:
  ...bf=boost_has_img^5
  the boost_has_img field of my document is 3:
  int name=boost_has_img3/int
  if i see the results in debug query mode i can see:
0.0 = (MATCH) FunctionQuery(int(boost_has_img)), product of:
  0.0 = int(boost_has_img)=0
  5.0 = boost
  0.06543833 = queryNorm
  why the score is 0 if the value is 3 and the boost is 5???

 Solr thinks the value of boost_has_img is 0 for that document.
 Is boost_has_img an indexed field?
 If so, verify that the value is actually 3 for that specific document.


 -Yonik
 http://lucidimagination.com




-- 

Gastone Penzo
Webster Srl
www.webster.it
www.libreriauniversitaria.it


perfect match in dismax search

2011-03-03 Thread Gastone Penzo
How to obtain perfect match with dismax query??

es:

i want to search hello i love you with deftype=dismax in the title field
and i want to obtain results which title is exactly hello i love you with
all this terms
in this order.

Not less words or other.
how is it possilbe??

i tryed with +(hello i love you) but if i have a title which is hello i
love you mum it matches and i don't want!

Thanx


-- 

Gastone Penzo
Webster Srl
www.webster.it
www.libreriauniversitaria.it


[jira] Resolved: (SOLR-1489) A UTF-8 character is output twice (Bug in Jetty)

2011-03-03 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved SOLR-1489.
--

   Resolution: Fixed
Fix Version/s: 4.0
   3.1

Marking resolved as duplicate of SOLR-2381.

 A UTF-8 character is output twice (Bug in Jetty)
 

 Key: SOLR-1489
 URL: https://issues.apache.org/jira/browse/SOLR-1489
 Project: Solr
  Issue Type: Bug
 Environment: Jetty-6.1.3
 Jetty-6.1.21
 Jetty-7.0.0RC6
Reporter: Jun Ohtani
Assignee: Koji Sekiguchi
Priority: Critical
 Fix For: 3.1, 4.0

 Attachments: SOLR-1489.patch, error_utf8-example.xml, 
 jetty-6.1.22.jar, jetty-util-6.1.22.jar, jettybugsample.war, jsp-2.1.zip, 
 servlet-api-2.5-20081211.jar


 A UTF-8 character is output twice under particular conditions.
 Attach the sample data.(error_utf8-example.xml)
 Registered only sample data, click the following URL.
 http://localhost:8983/solr/select?q=*%3A*version=2.2start=0rows=10omitHeader=truefl=attr_jsonwt=json
 Sample data is only B, but response is BB.
 When wt=phps, error occurs in PHP unsrialize() function.
 This bug is like a bug in Jetty.
 jettybugsample.war is the simplest one to reproduce the problem.
 Copy example/webapps, and start Jetty server, and click the following URL.
 http://localhost:8983/jettybugsample/filter/hoge
 Like earlier, B is output twice. Sysout only B once.
 I have tested this on Jetty 6.1.3 and 6.1.21, 7.0.0rc6.
 (When testing with 6.1.21or 7.0.0rc6, change bufsize from 128 to 512 in 
 web.xml. )

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2822) TimeLimitingCollector starts thread in static {} with no way to stop them

2011-03-03 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001967#comment-13001967
 ] 

Michael McCandless commented on LUCENE-2822:


I think we should stick with our private timer thread (and we should definitely 
make it stop-able).

I've seen too many problems associated with relying on the system's time for 
important things like timing out queries, eg when daylight savings time 
strikes, or the clock is being aggressively corrected, and suddenly a bunch 
of queries are truncated.  In theory System.nanoTime should be immune to this 
(it's the system's timer and not any notion of wall clock time), but in 
practice, I don't think we should risk it.

 TimeLimitingCollector starts thread in static {} with no way to stop them
 -

 Key: LUCENE-2822
 URL: https://issues.apache.org/jira/browse/LUCENE-2822
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Robert Muir

 See the comment in LuceneTestCase.
 If you even do Class.forName(TimeLimitingCollector) it starts up a thread 
 in a static method, and there isn't a way to kill it.
 This is broken.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2385) Backport latest /browse improvements to branch_3x

2011-03-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001969#comment-13001969
 ] 

Jan Høydahl commented on SOLR-2385:
---

I classified SOLR-2383 as a Bug, not a feature, because most people downloading 
Solr3.1 will start customizing facets and get puzzled when the range facet 
still read Price ($) and that their own facets do not show up. I'm sure this 
will generate a bunch of traffic on the mailing lists.

 Backport latest /browse improvements to branch_3x
 -

 Key: SOLR-2385
 URL: https://issues.apache.org/jira/browse/SOLR-2385
 Project: Solr
  Issue Type: Improvement
  Components: Response Writers
Affects Versions: 3.1
Reporter: Jan Høydahl
Assignee: Grant Ingersoll
  Labels: velocity
 Fix For: 3.1

 Attachments: SOLR-2385.patch, SOLR-2385.patch


 There are a lot of improvements in TRUNK Velocity GUI which will work well 
 even for 3.1

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2383) Velocity: Generalize range and date facet display

2011-03-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001970#comment-13001970
 ] 

Jan Høydahl commented on SOLR-2383:
---

Would appreciate if someone could test out this patch on your own data and test 
various combinations of facet.range and gaps to see if it is water tight

 Velocity: Generalize range and date facet display
 -

 Key: SOLR-2383
 URL: https://issues.apache.org/jira/browse/SOLR-2383
 Project: Solr
  Issue Type: Bug
  Components: Response Writers
Reporter: Jan Høydahl
  Labels: facet, range, velocity
 Attachments: SOLR-2383.patch, SOLR-2383.patch, SOLR-2383.patch


 Velocity (/browse) GUI has hardcoded price range facet and a hardcoded 
 manufacturedate_dt date facet. Need general solution which work for any 
 facet.range and facet.date.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: wind down for 3.1?

2011-03-03 Thread Sanne Grinovero
Hello all,
Is there any update on the 3.1 status?
I'm really looking forward to it :)

Regards,
Sanne


2011/2/16 Chris Hostetter hossman_luc...@fucit.org:

 : 1. javadocs warnings/errors: this is a constant battle, its worth
 : considering if the build should actually fail if you get one of these,
 : in my opinion if we can do this we really should. its frustrating to

 for a brief period we did, and then we rolled it back...

        https://issues.apache.org/jira/browse/LUCENE-875

 : 2. introducing new compiler warnings: another problem just being left
 : for someone else to clean up later, another constant losing battle.
 : 99% of the time (for non-autogenerated code) the warnings are
 : useful... in my opinion we should not commit patches that create new
 : warnings.

 it's hard to spot new compiler warnings when there are already so many
 ... if we can get down to 0 then we can add hacks to make hte build fail
 if someone adds 1 but until then we have an uphill battle.


 -Hoss

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: wind down for 3.1?

2011-03-03 Thread Robert Muir
On Thu, Mar 3, 2011 at 7:43 AM, Sanne Grinovero
sanne.grinov...@gmail.com wrote:
 Hello all,
 Is there any update on the 3.1 status?
 I'm really looking forward to it :)


Yes, we are currently in the feature freeze, but it seems to be coming in shape.

I'm planning on creating the release branch this weekend and getting
our first RC out Sunday (Steven Rowe volunteered to help with the
maven side, thanks!).

If you want to help, for example you can checkout the lucene code from
http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/
then you can run 'ant clean dist dist-src' and inspect the artifacts
it puts in the dist/ folder and report any problems.

If everyone waits until we build an RC before reviewing how things
look and reporting problems, its going to significantly slow down the
release process as generating RC's for both lucene and solr at the
moment is nontrivial (which is why Steven and I have set aside
this day to try to build RC1, if the vote doesn't pass it might be
weeks before we have the time to build RC2).

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: wind down for 3.1?

2011-03-03 Thread Sanne Grinovero
2011/3/3 Robert Muir rcm...@gmail.com:
 On Thu, Mar 3, 2011 at 7:43 AM, Sanne Grinovero
 sanne.grinov...@gmail.com wrote:
 Hello all,
 Is there any update on the 3.1 status?
 I'm really looking forward to it :)


 Yes, we are currently in the feature freeze, but it seems to be coming in 
 shape.

 I'm planning on creating the release branch this weekend and getting
 our first RC out Sunday (Steven Rowe volunteered to help with the
 maven side, thanks!).

 If you want to help, for example you can checkout the lucene code from
 http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/
 then you can run 'ant clean dist dist-src' and inspect the artifacts
 it puts in the dist/ folder and report any problems.

 If everyone waits until we build an RC before reviewing how things
 look and reporting problems, its going to significantly slow down the
 release process as generating RC's for both lucene and solr at the
 moment is nontrivial (which is why Steven and I have set aside
 this day to try to build RC1, if the vote doesn't pass it might be
 weeks before we have the time to build RC2).

Cheers, thanks a lot. I'm definitely testing it often, and will report
anything weird.
I can't say about Solr though as we use Lucene mostly.

Sanne

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2822) TimeLimitingCollector starts thread in static {} with no way to stop them

2011-03-03 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002002#comment-13002002
 ] 

Robert Muir commented on LUCENE-2822:
-

bq. I think we should stick with our private timer thread (and we should 
definitely make it stop-able).

And no private thread should start in the static initializer... its fine for 
all instances to share a single private timer thread but this should be 
lazy-loaded.


 TimeLimitingCollector starts thread in static {} with no way to stop them
 -

 Key: LUCENE-2822
 URL: https://issues.apache.org/jira/browse/LUCENE-2822
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Robert Muir

 See the comment in LuceneTestCase.
 If you even do Class.forName(TimeLimitingCollector) it starts up a thread 
 in a static method, and there isn't a way to kill it.
 This is broken.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [Lucene.Net] CI Task Update: Hudkins

2011-03-03 Thread Michael Herndon
Sorry I've been just reading through the list  wiki (
http://wiki.apache.org/general/Hudson)

I subscribed to the list last night and haven't received the usual message
for activating subscriptions, I'll try again today and get on the list to
see what is already on the windows slave.

We also have to ask if others are interested in tools that not installed and
if not install them under home/username.

-  michael.

On Wed, Mar 2, 2011 at 5:19 PM, Troy Howard thowar...@gmail.com wrote:

 I've been following builds@ for the past couple of days. Looks like
 they just finished the migration to Jenkins.

 Michael - Have you had a chance to contact them and find out what
 tools are available out of our list? Want me to do that?

 Thanks,
 Troy

 On Mon, Feb 28, 2011 at 9:19 PM, Scott Lombard slomb...@theta.net wrote:
  +1
 
  Scott
 
  On Mon, Feb 28, 2011 at 5:18 AM, Stefan Bodewig bode...@apache.org
 wrote:
 
  On 2011-02-28, Troy Howard wrote:
 
   One quick concern I have, is how much of the things listed are already
   available on the Apache hudson server?
 
  builds@apache is the place to ask.
 
   A lot of this is .NET specific, so unlikely that it will already be
   available.
 
  well, the DotCMIS build seems to be using Sandcastle Helpfile Builder by
  looking the console output.
 
   We'll have to request that ASF Infra team install these tools for us,
   and they may not agree, or there might be licensing issues, etc.. Not
   sure. I'd start the conversation with them now to suss this out.
 
  Really, go to the builds list.  License issues usually don't show up for
  build tools.  It may be good if anybody of the team could volunteer time
  helping administrate the Windows slave.
 
   - Mono is going to be a requirement moving forward
 
  This could be done on a non-Windows slave just to completely sure it
  works.  This may require installing a newer Mono (or just pulling in in
  a different Debian package source for Mono) than is installed by
  default.
 
   - Project structure was being discussed on the LUCENENET-377 thread.
 
  As a quick note, in general we prefer the mailing list of JIRA for
  discussions around the ASF.
 
  Stefan
 
 



[jira] Created: (LUCENE-2947) NGramTokenizer shouldn't trim whitespace

2011-03-03 Thread David Byrne (JIRA)
NGramTokenizer shouldn't trim whitespace


 Key: LUCENE-2947
 URL: https://issues.apache.org/jira/browse/LUCENE-2947
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 3.0.3
Reporter: David Byrne
Priority: Minor


Before I tokenize my strings, I am padding them with white space:

String foobar =   + foo +   + bar +  ;

When constructing term vectors from ngrams, this strategy has a couple 
benefits.  First, it places special emphasis on the starting and ending of a 
word.  Second, it improves the similarity between phrases with swapped words.  
 foo bar  matches  bar foo  more closely than foo bar matches bar foo.

The problem is that Lucene's NGramTokenizer trims whitespace.  This forces me 
to do some preprocessing on my strings before I can tokenize them:

foobar.replaceAll( ,$); //arbitrary char not in my data

This is undocumented, so users won't realize their strings are being trim()'ed, 
unless they look through the source, or examine the tokens manually.

I am proposing NGramTokenizer should be changed to respect whitespace.  Is 
there a compelling reason against this?


-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2947) NGramTokenizer shouldn't trim whitespace

2011-03-03 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002019#comment-13002019
 ] 

Robert Muir commented on LUCENE-2947:
-

Hi Dave, in my opinion there are a lot of problems with our current 
NGramTokenizer (yours is just one) and it would be a good idea to consider 
creating a new one. We could rename the old one to ClassicNGramTokenizer or 
something for people that need the backwards compatibility.

A lot of the problems already have open JIRA issues: i gave my opinion on some 
of the broken-ness here: LUCENE-1224 . The largest problem being that these 
tokenizers only examine the first 1024 chars of the document. They shouldn't 
just discard anything after 1024 chars. There is no need to load the 'entire 
document' into memory... n-gram tokenization can work on a sliding window 
across the document.

In my opinion part of n-gram character tokenization is being able to configure 
what is a token character and what is not. (Note I don't mean java character 
here, but in the more abstract sense, e.g. a character might have diacritics 
and be treated as a single unit). For some applications maybe this is just 
'alphabetic letters', for other apps perhaps even punctuation could be 
considered 'relevant'. So it should somehow be flexible.  Furthermore, in the 
case of word-spanning n-grams, you should be able to collapse runs of 
Non-characters into a single marker (e.g. _), and usually you would want to 
do this for the start and end of string too.

here's visual representation of how things should look when you use these 
tokenizers in my opinion:
http://www.csee.umbc.edu/~nicholas/601/SIGIR08-Poster.pdf

 NGramTokenizer shouldn't trim whitespace
 

 Key: LUCENE-2947
 URL: https://issues.apache.org/jira/browse/LUCENE-2947
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 3.0.3
Reporter: David Byrne
Priority: Minor

 Before I tokenize my strings, I am padding them with white space:
 String foobar =   + foo +   + bar +  ;
 When constructing term vectors from ngrams, this strategy has a couple 
 benefits.  First, it places special emphasis on the starting and ending of a 
 word.  Second, it improves the similarity between phrases with swapped words. 
   foo bar  matches  bar foo  more closely than foo bar matches bar 
 foo.
 The problem is that Lucene's NGramTokenizer trims whitespace.  This forces me 
 to do some preprocessing on my strings before I can tokenize them:
 foobar.replaceAll( ,$); //arbitrary char not in my data
 This is undocumented, so users won't realize their strings are being 
 trim()'ed, unless they look through the source, or examine the tokens 
 manually.
 I am proposing NGramTokenizer should be changed to respect whitespace.  Is 
 there a compelling reason against this?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2947) NGramTokenizer shouldn't trim whitespace

2011-03-03 Thread David Byrne (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Byrne updated LUCENE-2947:


Attachment: NGramTokenizerTest.java

A simple failing JUnit test illustrating the problem.

 NGramTokenizer shouldn't trim whitespace
 

 Key: LUCENE-2947
 URL: https://issues.apache.org/jira/browse/LUCENE-2947
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 3.0.3
Reporter: David Byrne
Priority: Minor
 Attachments: NGramTokenizerTest.java


 Before I tokenize my strings, I am padding them with white space:
 String foobar =   + foo +   + bar +  ;
 When constructing term vectors from ngrams, this strategy has a couple 
 benefits.  First, it places special emphasis on the starting and ending of a 
 word.  Second, it improves the similarity between phrases with swapped words. 
   foo bar  matches  bar foo  more closely than foo bar matches bar 
 foo.
 The problem is that Lucene's NGramTokenizer trims whitespace.  This forces me 
 to do some preprocessing on my strings before I can tokenize them:
 foobar.replaceAll( ,$); //arbitrary char not in my data
 This is undocumented, so users won't realize their strings are being 
 trim()'ed, unless they look through the source, or examine the tokens 
 manually.
 I am proposing NGramTokenizer should be changed to respect whitespace.  Is 
 there a compelling reason against this?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2947) NGramTokenizer shouldn't trim whitespace

2011-03-03 Thread David Byrne (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002026#comment-13002026
 ] 

David Byrne commented on LUCENE-2947:
-

Thanks for the feedback Robert.  I'll give it a shot and try and write a new 
one.  I wanted to write a tokenizer to support skip-grams anyways.

 NGramTokenizer shouldn't trim whitespace
 

 Key: LUCENE-2947
 URL: https://issues.apache.org/jira/browse/LUCENE-2947
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 3.0.3
Reporter: David Byrne
Priority: Minor
 Attachments: NGramTokenizerTest.java


 Before I tokenize my strings, I am padding them with white space:
 String foobar =   + foo +   + bar +  ;
 When constructing term vectors from ngrams, this strategy has a couple 
 benefits.  First, it places special emphasis on the starting and ending of a 
 word.  Second, it improves the similarity between phrases with swapped words. 
   foo bar  matches  bar foo  more closely than foo bar matches bar 
 foo.
 The problem is that Lucene's NGramTokenizer trims whitespace.  This forces me 
 to do some preprocessing on my strings before I can tokenize them:
 foobar.replaceAll( ,$); //arbitrary char not in my data
 This is undocumented, so users won't realize their strings are being 
 trim()'ed, unless they look through the source, or examine the tokens 
 manually.
 I am proposing NGramTokenizer should be changed to respect whitespace.  Is 
 there a compelling reason against this?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2947) NGramTokenizer shouldn't trim whitespace

2011-03-03 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002031#comment-13002031
 ] 

Robert Muir commented on LUCENE-2947:
-

Thank you... by the way if you want to do skip-grams as a separate tokenizer or 
whatever, you know whatever makes sense...

I could imagine some of the n-gram variations might need to be their own 
tokenizers to prevent things from being too complicated but perhaps they could 
still share some code.

(But maybe you have some way to fit skipgrams in there easily)


 NGramTokenizer shouldn't trim whitespace
 

 Key: LUCENE-2947
 URL: https://issues.apache.org/jira/browse/LUCENE-2947
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 3.0.3
Reporter: David Byrne
Priority: Minor
 Attachments: NGramTokenizerTest.java


 Before I tokenize my strings, I am padding them with white space:
 String foobar =   + foo +   + bar +  ;
 When constructing term vectors from ngrams, this strategy has a couple 
 benefits.  First, it places special emphasis on the starting and ending of a 
 word.  Second, it improves the similarity between phrases with swapped words. 
   foo bar  matches  bar foo  more closely than foo bar matches bar 
 foo.
 The problem is that Lucene's NGramTokenizer trims whitespace.  This forces me 
 to do some preprocessing on my strings before I can tokenize them:
 foobar.replaceAll( ,$); //arbitrary char not in my data
 This is undocumented, so users won't realize their strings are being 
 trim()'ed, unless they look through the source, or examine the tokens 
 manually.
 I am proposing NGramTokenizer should be changed to respect whitespace.  Is 
 there a compelling reason against this?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2947) NGramTokenizer shouldn't trim whitespace

2011-03-03 Thread David Byrne (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002035#comment-13002035
 ] 

David Byrne commented on LUCENE-2947:
-

Yeah I was originally planning to implement skip-grams as a seperate tokenizer. 
 Since we are re-evaluating ngram tokenization in general, maybe I can come up 
with an elegant solution.  Support for positional ngrams is another thing to 
consider. 

 NGramTokenizer shouldn't trim whitespace
 

 Key: LUCENE-2947
 URL: https://issues.apache.org/jira/browse/LUCENE-2947
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 3.0.3
Reporter: David Byrne
Priority: Minor
 Attachments: NGramTokenizerTest.java


 Before I tokenize my strings, I am padding them with white space:
 String foobar =   + foo +   + bar +  ;
 When constructing term vectors from ngrams, this strategy has a couple 
 benefits.  First, it places special emphasis on the starting and ending of a 
 word.  Second, it improves the similarity between phrases with swapped words. 
   foo bar  matches  bar foo  more closely than foo bar matches bar 
 foo.
 The problem is that Lucene's NGramTokenizer trims whitespace.  This forces me 
 to do some preprocessing on my strings before I can tokenize them:
 foobar.replaceAll( ,$); //arbitrary char not in my data
 This is undocumented, so users won't realize their strings are being 
 trim()'ed, unless they look through the source, or examine the tokens 
 manually.
 I am proposing NGramTokenizer should be changed to respect whitespace.  Is 
 there a compelling reason against this?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2948) Make var gap terms index a partial prefix trie

2011-03-03 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2948:
---

Attachment: LUCENE-2948.patch

Initial patch.  This is a checkpoint of work-in-progress -- all tests pass, but 
there are zillions of nocommits to be resolved...

 Make var gap terms index a partial prefix trie
 --

 Key: LUCENE-2948
 URL: https://issues.apache.org/jira/browse/LUCENE-2948
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2948.patch


 Var gap stores (in an FST) the indexed terms (every 32nd term, by
 default), minus their non-distinguishing suffixes.
 However, often times the resulting FST is close to a prefix trie in
 some portion of the terms space.
 By allowing some nodes of the FST to store all outgoing edges,
 including ones that do not lead to an indexed term, and by recording
 that this node is then authoritative as to what terms exist in the
 terms dict from that prefix, we can get some important benefits:
   * It becomes possible to know that a certain term prefix cannot
 exist in the terms index, which means we can save a disk seek in
 some cases (like PK lookup, docFreq, etc.)
   * We can query for the next possible prefix in the index, allowing
 some MTQs (eg FuzzyQuery) to save disk seeks.
 Basically, the terms index is able to answer questions that previously
 required seeking/scanning in the terms dict file.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2948) Make var gap terms index a partial prefix trie

2011-03-03 Thread Michael McCandless (JIRA)
Make var gap terms index a partial prefix trie
--

 Key: LUCENE-2948
 URL: https://issues.apache.org/jira/browse/LUCENE-2948
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0
 Attachments: LUCENE-2948.patch

Var gap stores (in an FST) the indexed terms (every 32nd term, by
default), minus their non-distinguishing suffixes.

However, often times the resulting FST is close to a prefix trie in
some portion of the terms space.

By allowing some nodes of the FST to store all outgoing edges,
including ones that do not lead to an indexed term, and by recording
that this node is then authoritative as to what terms exist in the
terms dict from that prefix, we can get some important benefits:

  * It becomes possible to know that a certain term prefix cannot
exist in the terms index, which means we can save a disk seek in
some cases (like PK lookup, docFreq, etc.)

  * We can query for the next possible prefix in the index, allowing
some MTQs (eg FuzzyQuery) to save disk seeks.

Basically, the terms index is able to answer questions that previously
required seeking/scanning in the terms dict file.


-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2381) The included jetty server does not support UTF-8

2011-03-03 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002069#comment-13002069
 ] 

Uwe Schindler commented on SOLR-2381:
-

Ok, thanks for reporting back. So there was maybe a problem in the past with 
XMLWriter, which is solved with Lucene trunk. Can you also check branch_3x 
(Lucene 3.1), because this is the next release and trunk (Lucene 4.0) is very 
unstable.

 The included jetty server does not support UTF-8
 

 Key: SOLR-2381
 URL: https://issues.apache.org/jira/browse/SOLR-2381
 Project: Solr
  Issue Type: Bug
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: SOLR-2381.patch, SOLR-ServletOutputWriter.patch, 
 jetty-6.1.26-patched-JETTY-1340.jar, jetty-util-6.1.26-patched-JETTY-1340.jar


 Some background here: 
 http://www.lucidimagination.com/search/document/6babe83bd4a98b64/which_unicode_version_is_supported_with_lucene
 Some possible solutions:
 * wait and see if we get resolution on 
 http://jira.codehaus.org/browse/JETTY-1340. To be honest, I am not even sure 
 where jetty is being maintained (there is a separate jetty project at 
 eclipse.org with another bugtracker, but the older releases are at codehaus).
 * include a patched version of jetty with correct utf-8, using that patch.
 * remove jetty and include a different container instead.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2822) TimeLimitingCollector starts thread in static {} with no way to stop them

2011-03-03 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002071#comment-13002071
 ] 

Uwe Schindler commented on LUCENE-2822:
---

bq. I think we should stick with our private timer thread (and we should 
definitely make it stop-able).

I think this is still the best reason, as both System.nanoTime() and 
currentTimeMillies use system calls that are really expensive. But nanoTime() 
has no wallclock problems, thats true, but is still a no-go for every collected 
hit!

 TimeLimitingCollector starts thread in static {} with no way to stop them
 -

 Key: LUCENE-2822
 URL: https://issues.apache.org/jira/browse/LUCENE-2822
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Robert Muir

 See the comment in LuceneTestCase.
 If you even do Class.forName(TimeLimitingCollector) it starts up a thread 
 in a static method, and there isn't a way to kill it.
 This is broken.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2948) Make var gap terms index a partial prefix trie

2011-03-03 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2948:
---

Attachment: LUCENE-2948.patch

New patch -- changes nextPossiblePrefix to return a SeekStatus.

 Make var gap terms index a partial prefix trie
 --

 Key: LUCENE-2948
 URL: https://issues.apache.org/jira/browse/LUCENE-2948
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2948.patch, LUCENE-2948.patch


 Var gap stores (in an FST) the indexed terms (every 32nd term, by
 default), minus their non-distinguishing suffixes.
 However, often times the resulting FST is close to a prefix trie in
 some portion of the terms space.
 By allowing some nodes of the FST to store all outgoing edges,
 including ones that do not lead to an indexed term, and by recording
 that this node is then authoritative as to what terms exist in the
 terms dict from that prefix, we can get some important benefits:
   * It becomes possible to know that a certain term prefix cannot
 exist in the terms index, which means we can save a disk seek in
 some cases (like PK lookup, docFreq, etc.)
   * We can query for the next possible prefix in the index, allowing
 some MTQs (eg FuzzyQuery) to save disk seeks.
 Basically, the terms index is able to answer questions that previously
 required seeking/scanning in the terms dict file.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Maven Lucene Plugin : Contribution

2011-03-03 Thread Paritosh Ranjan

Hi,

I have released an open source project maven-lucene-plugin 
https://sourceforge.net/projects/lucene-plugin/ hosted at SourceForge. 
Its a maven plugin with which index can be created (without writing any 
code ) from a file source. The structure of the index can be defined in 
a file lucene.xml. I have also created a dependency which provided easy 
to use methods which work on the same index created by the 
maven-lucene-plugin (using the same lucene.xml).


This plugin removes the need of knowledge of lucene's API to use Lucene.

The full documentation about the plugin can be found here 
http://xebee.xebia.in/2011/02/28/maven-lucene-plugin/. The plugin is 
available in the Central Maven Repository 
http://repo1.maven.org/maven2/com/ and can also be browsed here 
https://oss.sonatype.org/index.html#nexus-search;quick%7Emaven-lucene-plugin.


I would highly appreciate your feedback on the plugin. Along with some 
suggestions on how to improve it. As this is only the first version, we 
have lot to develop in the plugin. If you are interested in contributing 
to the plugin, please write to me. That would be very helpful.


If Apache wants, I would be glad to donate the plugin to Apache to make 
it more stronger. Right now, the source code is on Sourceforge (here 
https://lucene-plugin.svn.sourceforge.net/svnroot/lucene-plugin/trunk/maven-lucene-plugin/) 
and the artifacts are on Central Maven Repository (here 
http://repo1.maven.org/maven2/com/xebia/).


Thanks and Regards,
Paritosh Ranjan



Re: Unintuitive NGramTokenizer behavior

2011-03-03 Thread Grant Ingersoll

On Mar 3, 2011, at 9:36 AM, David Byrne wrote:

 I have a minor quibble about Lucene's NGramTokenizer.
 
 Before I tokenize my strings, I am padding them with white space:
 
 String foobar =   + foo +   + bar +  ;
 
 When constructing term vectors from ngrams, this strategy has a couple 
 benefits.  First, it places special emphasis on the starting and ending of a 
 word.  Second, it improves the similarity between phrases with swapped words. 
   foo bar  matches  bar foo  more closely than foo bar matches bar 
 foo.
 
 

I'm not following this argument.  What does the extra whitespace give you here? 
 

 The problem is that Lucene's NGramTokenizer trims whitespace.  This forces me 
 to do some preprocessing on my strings before I can tokenize them:
 
 foobar.replaceAll( ,$); //arbitrary char not in my data
 
 

I'm confused.  If you are padding them up front, then why don't you just do the 
arbitrary char trick then?  Where is the extra processing?

 This is undocumented, so users won't realize their strings are being 
 trim()'ed, unless they look through the source, or examine the tokens 
 manually.
 
 

It may be undocumented, but I think it is pretty standard as to what users 
expect out of a tokenizer.

 I am proposing NGramTokenizer should be changed to respect whitespace.  Is 
 there a compelling reason against this?
 
 

Unfortunately, I'm not following your reasons for doing it.  I won't say I'm 
against it at this point, but I don't see a compelling reason to change it 
either so if you could clarify that would be great.  It's been around for quite 
some time in it's current form and I think fits most people's expectations of 
ngrams.

-Grant

[jira] Commented: (LUCENE-2822) TimeLimitingCollector starts thread in static {} with no way to stop them

2011-03-03 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002115#comment-13002115
 ] 

Mark Harwood commented on LUCENE-2822:
--

FYI - I visited a site today using Lucene 1720 live on a large index (2 
billion docs, sharded with 5 minute update intervals). They haven't noticed any 
significant degrading of search performance as a result of using this approach.


 TimeLimitingCollector starts thread in static {} with no way to stop them
 -

 Key: LUCENE-2822
 URL: https://issues.apache.org/jira/browse/LUCENE-2822
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Robert Muir

 See the comment in LuceneTestCase.
 If you even do Class.forName(TimeLimitingCollector) it starts up a thread 
 in a static method, and there isn't a way to kill it.
 This is broken.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Unintuitive NGramTokenizer behavior

2011-03-03 Thread Robert Muir
On Thu, Mar 3, 2011 at 1:00 PM, Grant Ingersoll gsing...@apache.org wrote:

 Unfortunately, I'm not following your reasons for doing it.  I won't say I'm
 against it at this point, but I don't see a compelling reason to change it
 either so if you could clarify that would be great.  It's been around for
 quite some time in it's current form and I think fits most people's
 expectations of ngrams.

Grant I'm sorry, but I couldnt disagree more.

There are many variations on ngram tokenization (word-internal,
word-spanning, skipgrams), besides allowing flexibility for what
should be a word character and what should not be (e.g.
punctuation), and how to handle the specifics of these.

But our n-gram tokenizer is *UNARGUABLY* completely broken for these reasons:
1. it discards anything after the first 1024 code units of the document.
2. it uses partial characters (UTF-16 code units) as its fundamental
measure, potentially creating lots of invalid unicode.
3. it forms n-grams in the wrong order, contributing to #1. I
explained this in LUCENE-1224

Its these reasons that I suggested we completely rewrite it... people
that are just indexing english documents with  1024 chars per
document and don't care about these things can use
ClassicNGramTokenizer.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: [Lucene.Net] how to add a new record to existing index

2011-03-03 Thread Digy
I don't think that I understand your problem.
Is it something like 
IndexWriter writer = new IndexWriter(path, analyzer, *false*,
IndexWriter.DEFAULT_MAX_FIELD_LENGTH);
..
writer.AddDocument(doc);

DIGY

-Original Message-
From: Wen Gao [mailto:samuel.gao...@gmail.com] 
Sent: Thursday, March 03, 2011 1:43 AM
To: lucene-net-...@lucene.apache.org
Subject: Re: [Lucene.Net] how to add a new record to existing index

Hi Digy,
It was my fault that i didnt say it clearly. I mean I have created an
index,but it is not updated real time. So I want to update the index
everytime after I add data to database to keep the index up-to-date. My data
is what user inputs and inserted to database.Then

BTW, I know how to delete a term from index using IndexReader. Likewise, I
want to write a term to the created index instead of creating a new index.
I appreciate your time.

Thanks,
Wen

2011/3/2 Digy digyd...@gmail.com


 First of all, your code doesn't mean anything to me other than you add
some
 fields to a document object.

 Also,  I can't see what you mean with *existing* index. The directory you
 pass to the IndexWriter is the index you use
 and every document added(using IndexWriter's AddDocument) is written to
 that
 index.

 I think, we have problems in using a common terminology.

 DIGY

 PS: It would be better if you use user mailing list to ask questions.
 This
 mailing lists is intented to be for development purposes.




 -Original Message-
 From: Wen Gao [mailto:samuel.gao...@gmail.com]
 Sent: Wednesday, March 02, 2011 11:02 PM
 To: lucene-net-...@lucene.apache.org
 Subject: [Lucene.Net] how to add a new record to existing index

 Hi,
 I already have created an index, and I want to insert an index record to
 this existing index everytime I insert a new record to database.
 For example,  if I want to insert an reord (l1, 15,tom,
 20,2010/01/02) to my *existing* index, how can i do this? (I dont want
 to create a new index, which takes too much time)

 my format of index is as follows:
  ///
  doc.Add(new Lucene.Net.Documents.Field(
lmname,
readerreader1[lmname].ToString(),
//new
 System.IO.StringReader(readerreader[cname].ToString()),
Lucene.Net.Documents.Field.Store.YES,
 Lucene.Net.Documents.Field.Index.TOKENIZED)

);

//lmid
doc.Add(new Lucene.Net.Documents.Field(
lmid,
readerreader1[lmid].ToString(),
 Lucene.Net.Documents.Field.Store.YES,
 Lucene.Net.Documents.Field.Index.UN_TOKENIZED));

// nick name of user
doc.Add(new Lucene.Net.Documents.Field(
nickName,
 readerreader1[nickName].ToString(),
 Lucene.Net.Documents.Field.Store.YES,
 Lucene.Net.Documents.Field.Index.UN_TOKENIZED));
// uid
doc.Add(new Lucene.Net.Documents.Field(
uid,
 readerreader1[uid].ToString(),
 Lucene.Net.Documents.Field.Store.YES,
 Lucene.Net.Documents.Field.Index.UN_TOKENIZED));

// acttime
doc.Add(new Lucene.Net.Documents.Field(
acttime,
 readerreader1[acttime].ToString(),
 Lucene.Net.Documents.Field.Store.YES,
 Lucene.Net.Documents.Field.Index.UN_TOKENIZED));
writer.AddDocument(doc);
///

 Thanks,
 Wen





Re: Unintuitive NGramTokenizer behavior

2011-03-03 Thread David Byrne
Grant,

To explain the advantage:

Trigrams for foo bar: 'foo', 'oo ', 'o b', ' ba', 'bar'
Trigrams for bar foo: 'bar', 'ar ', 'r f', ' fo', 'foo'

Only two out of eight unique trigrams match.

Trigrams for  foo bar : ' fo', 'foo', 'oo ', 'o b', ' ba', 'bar', 'ar '
Trigrams for  bar foo : ' ba', 'bar', 'ar ', 'r f', ' fo', 'foo', 'oo '

Six out of eight unique trigrams match.

I can't do the character replacement up front, because foo and bar might
already contain whitespace as well.  Anyways, its a hack, and if my
arbitrary character ever gets introduced into the data I am in trouble.

Not only is this undocumented, but it seems unintentional if you look at the
comments in the code.

FYI, I opened up an issue regarding this: http://bit.ly/eqhTO1
 On Mar 3, 2011 1:00 PM, Grant Ingersoll gsing...@apache.org wrote:

 On Mar 3, 2011, at 9:36 AM, David Byrne wrote:

 I have a minor quibble about Lucene's NGramTokenizer.

 Before I tokenize my strings, I am padding them with white space:

 String foobar =   + foo +   + bar +  ;

 When constructing term vectors from ngrams, this strategy has a couple
benefits. First, it places special emphasis on the starting and ending of a
word. Second, it improves the similarity between phrases with swapped words.
 foo bar  matches  bar foo  more closely than foo bar matches bar
foo.



 I'm not following this argument. What does the extra whitespace give you
here?

 The problem is that Lucene's NGramTokenizer trims whitespace. This forces
me to do some preprocessing on my strings before I can tokenize them:

 foobar.replaceAll( ,$); //arbitrary char not in my data



 I'm confused. If you are padding them up front, then why don't you just do
the arbitrary char trick then? Where is the extra processing?

 This is undocumented, so users won't realize their strings are being
trim()'ed, unless they look through the source, or examine the tokens
manually.



 It may be undocumented, but I think it is pretty standard as to what users
expect out of a tokenizer.

 I am proposing NGramTokenizer should be changed to respect whitespace. Is
there a compelling reason against this?



 Unfortunately, I'm not following your reasons for doing it. I won't say
I'm against it at this point, but I don't see a compelling reason to change
it either so if you could clarify that would be great. It's been around for
quite some time in it's current form and I think fits most people's
expectations of ngrams.

 -Grant


[jira] Commented: (LUCENE-2822) TimeLimitingCollector starts thread in static {} with no way to stop them

2011-03-03 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002129#comment-13002129
 ] 

Uwe Schindler commented on LUCENE-2822:
---

Mark: But LUCENE-1720 does not use a 
System.nanoTime()/System.currentTimeMillis(), so what is your comment about?

 TimeLimitingCollector starts thread in static {} with no way to stop them
 -

 Key: LUCENE-2822
 URL: https://issues.apache.org/jira/browse/LUCENE-2822
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Robert Muir

 See the comment in LuceneTestCase.
 If you even do Class.forName(TimeLimitingCollector) it starts up a thread 
 in a static method, and there isn't a way to kill it.
 This is broken.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (SOLR-2399) Solr Admin Interface, reworked

2011-03-03 Thread Stefan Matheis (steffkes) (JIRA)
Solr Admin Interface, reworked
--

 Key: SOLR-2399
 URL: https://issues.apache.org/jira/browse/SOLR-2399
 Project: Solr
  Issue Type: Improvement
  Components: web gui
Reporter: Stefan Matheis (steffkes)
Priority: Minor


*The idea was to create a new, fresh (and hopefully clean) Solr Admin 
Interface.* [Based on this 
[ML-Thread|http://www.lucidimagination.com/search/document/ae35e236d29d225e/solr_admin_interface_reworked_go_on_go_away]]

I've quickly created a Github-Repository (Just for me, to keep track of the 
changes)
» https://github.com/steffkes/solr-admin 

[This commit shows the 
differences|https://github.com/steffkes/solr-admin/commit/5f80bb0ea9deb4b94162632912fe63386f869e0d]
 between old/existing index.jsp and my new one (which is could copy-cut/paste'd 
from the existing one).

Main Action takes place in 
[js/script.js|https://github.com/steffkes/solr-admin/blob/master/js/script.js] 
which is actually neither clean nor pretty .. just work-in-progress.

Actually it's Work in Progress, so ... give it a try. It's developed with 
Firefox as Browser, so, for a first impression .. please don't use _things_ 
like Internet Explorer or so ;o

Jan already suggested a bunch of good things, i'm sure there are more ideas 
over there :)

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [Lucene.Net] CI Task Update: Hudkins

2011-03-03 Thread Wyatt Barnett
I'd ask if we need to install this stuff on hudson at all -- most of
it is command line utilites that can be transported with the source in
svn anyhow. Sidebar advantages here are it is much easier to debug the
build scrips since you have no environmental dependencies and you've
got all the toys in one download.

On Thu, Mar 3, 2011 at 10:05 AM, Michael Herndon mhern...@o19s.com wrote:
 Sorry I've been just reading through the list  wiki (
 http://wiki.apache.org/general/Hudson)

 I subscribed to the list last night and haven't received the usual message
 for activating subscriptions, I'll try again today and get on the list to
 see what is already on the windows slave.

 We also have to ask if others are interested in tools that not installed and
 if not install them under home/username.

 -  michael.

 On Wed, Mar 2, 2011 at 5:19 PM, Troy Howard thowar...@gmail.com wrote:

 I've been following builds@ for the past couple of days. Looks like
 they just finished the migration to Jenkins.

 Michael - Have you had a chance to contact them and find out what
 tools are available out of our list? Want me to do that?

 Thanks,
 Troy

 On Mon, Feb 28, 2011 at 9:19 PM, Scott Lombard slomb...@theta.net wrote:
  +1
 
  Scott
 
  On Mon, Feb 28, 2011 at 5:18 AM, Stefan Bodewig bode...@apache.org
 wrote:
 
  On 2011-02-28, Troy Howard wrote:
 
   One quick concern I have, is how much of the things listed are already
   available on the Apache hudson server?
 
  builds@apache is the place to ask.
 
   A lot of this is .NET specific, so unlikely that it will already be
   available.
 
  well, the DotCMIS build seems to be using Sandcastle Helpfile Builder by
  looking the console output.
 
   We'll have to request that ASF Infra team install these tools for us,
   and they may not agree, or there might be licensing issues, etc.. Not
   sure. I'd start the conversation with them now to suss this out.
 
  Really, go to the builds list.  License issues usually don't show up for
  build tools.  It may be good if anybody of the team could volunteer time
  helping administrate the Windows slave.
 
   - Mono is going to be a requirement moving forward
 
  This could be done on a non-Windows slave just to completely sure it
  works.  This may require installing a newer Mono (or just pulling in in
  a different Debian package source for Mono) than is installed by
  default.
 
   - Project structure was being discussed on the LUCENENET-377 thread.
 
  As a quick note, in general we prefer the mailing list of JIRA for
  discussions around the ASF.
 
  Stefan
 
 




[jira] Updated: (LUCENE-2919) IndexSplitter that divides by primary key term

2011-03-03 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2919:
-

Attachment: LUCENE-2919.patch

First cut.  Roughly divides an index by the exclusive mid term given.  

 IndexSplitter that divides by primary key term
 --

 Key: LUCENE-2919
 URL: https://issues.apache.org/jira/browse/LUCENE-2919
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 4.0
Reporter: Jason Rutherglen
Priority: Minor
 Attachments: LUCENE-2919.patch


 Index splitter that divides by primary key term.  The contrib 
 MultiPassIndexSplitter we have divides by docid, however to guarantee 
 external constraints it's sometimes necessary to split by a primary key term 
 id.  I think this implementation is a fairly trivial change.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Unintuitive NGramTokenizer behavior

2011-03-03 Thread Robert Muir
On Thu, Mar 3, 2011 at 2:06 PM, Grant Ingersoll gsing...@apache.org wrote:

 On Mar 3, 2011, at 1:10 PM, Robert Muir wrote:

 On Thu, Mar 3, 2011 at 1:00 PM, Grant Ingersoll gsing...@apache.org wrote:

 Unfortunately, I'm not following your reasons for doing it.  I won't say I'm
 against it at this point, but I don't see a compelling reason to change it
 either so if you could clarify that would be great.  It's been around for
 quite some time in it's current form and I think fits most people's
 expectations of ngrams.

 Grant I'm sorry, but I couldnt disagree more.

 There are many variations on ngram tokenization (word-internal,
 word-spanning, skipgrams), besides allowing flexibility for what
 should be a word character and what should not be (e.g.
 punctuation), and how to handle the specifics of these.

 But our n-gram tokenizer is *UNARGUABLY* completely broken for these reasons:
 1. it discards anything after the first 1024 code units of the document.
 2. it uses partial characters (UTF-16 code units) as its fundamental
 measure, potentially creating lots of invalid unicode.
 3. it forms n-grams in the wrong order, contributing to #1. I
 explained this in LUCENE-1224

 Sure, but those are ancillary to the whitespace question that was asked about.


Not really? its the more general form of the whitespace question.

I'm saying you should be able to say 'this is part of a word', but
then also specify if you want to fold runs of non-characters into a
single thing (e.g. '_') or into nothing at all, or whatever.

Additionally NGramTokenizer should also support option to treat
start and end of string as non-characters... in my opinion this
should be the default and is the root cause of Dave's issue?

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [Lucene.Net] CI Task Update: Hudkins

2011-03-03 Thread Michael Herndon
The licenses on some of the tools may not cover having them in svn.  If
there were free open source versions that allowed redistributing the
binaries of all the tools listed that did the job well, we could put them
into svn.

However as far as I know there isn't a pure .net open source tool chain for
some of the tools listed above.

Thus it needs to be put on the slave or accessible from the slave with the
limited licenses.

Also one would want to install hudson plugins that generate reports and
graphical information based on the xml documents generated during the build.
Then have hudson process them on initial load and display them on the
dashboard / build output.

- Michael


On Thu, Mar 3, 2011 at 1:57 PM, Wyatt Barnett wyatt.barn...@gmail.comwrote:

 I'd ask if we need to install this stuff on hudson at all -- most of
 it is command line utilites that can be transported with the source in
 svn anyhow. Sidebar advantages here are it is much easier to debug the
 build scrips since you have no environmental dependencies and you've
 got all the toys in one download.

 On Thu, Mar 3, 2011 at 10:05 AM, Michael Herndon mhern...@o19s.com
 wrote:
  Sorry I've been just reading through the list  wiki (
  http://wiki.apache.org/general/Hudson)
 
  I subscribed to the list last night and haven't received the usual
 message
  for activating subscriptions, I'll try again today and get on the list to
  see what is already on the windows slave.
 
  We also have to ask if others are interested in tools that not installed
 and
  if not install them under home/username.
 
  -  michael.
 
  On Wed, Mar 2, 2011 at 5:19 PM, Troy Howard thowar...@gmail.com wrote:
 
  I've been following builds@ for the past couple of days. Looks like
  they just finished the migration to Jenkins.
 
  Michael - Have you had a chance to contact them and find out what
  tools are available out of our list? Want me to do that?
 
  Thanks,
  Troy
 
  On Mon, Feb 28, 2011 at 9:19 PM, Scott Lombard slomb...@theta.net
 wrote:
   +1
  
   Scott
  
   On Mon, Feb 28, 2011 at 5:18 AM, Stefan Bodewig bode...@apache.org
  wrote:
  
   On 2011-02-28, Troy Howard wrote:
  
One quick concern I have, is how much of the things listed are
 already
available on the Apache hudson server?
  
   builds@apache is the place to ask.
  
A lot of this is .NET specific, so unlikely that it will already be
available.
  
   well, the DotCMIS build seems to be using Sandcastle Helpfile Builder
 by
   looking the console output.
  
We'll have to request that ASF Infra team install these tools for
 us,
and they may not agree, or there might be licensing issues, etc..
 Not
sure. I'd start the conversation with them now to suss this out.
  
   Really, go to the builds list.  License issues usually don't show up
 for
   build tools.  It may be good if anybody of the team could volunteer
 time
   helping administrate the Windows slave.
  
- Mono is going to be a requirement moving forward
  
   This could be done on a non-Windows slave just to completely sure it
   works.  This may require installing a newer Mono (or just pulling in
 in
   a different Debian package source for Mono) than is installed by
   default.
  
- Project structure was being discussed on the LUCENENET-377
 thread.
  
   As a quick note, in general we prefer the mailing list of JIRA for
   discussions around the ASF.
  
   Stefan
  
  
 
 




-- 
Michael Herndon
Senior Developer (mhern...@o19s.com)
804.767.0083

[connect online]
http://www.opensourceconnections.com
http://www.amptools.net
http://www.linkedin.com/pub/michael-herndon/4/893/23
http://www.facebook.com/amptools.net
http://www.twitter.com/amptools-net


[jira] Updated: (LUCENE-2948) Make var gap terms index a partial prefix trie

2011-03-03 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2948:


Attachment: LUCENE-2948_automaton.patch

Nice work Mike, i think I found a bug with nextPossiblePrefix though?

I attached my modifications to try to use this with Automaton. (just the 
automaton parts):

I'm also somehow triggering BlockReader's assert about crossing over index 
terms with other tests...

I think i could see the problem here... is it that 
nextPossiblePrefix(BytesRef prefix) means it wants me to truly pass in a 
prefix? obviously a consumer doesn't know which portion of his term is/isnt a 
prefix!

So we would have to expose that :(, or alternatively change the semantics to 
nextPossiblePrefix(BytesRef term)? In other words, in this situation of 
1[\u]234567891 it would simply return true, because it knows 1* exists 
rather than forwarding me to s? Maybe this is what was intended all along and 
its just an off by one?

{noformat}
[junit] NOTE: reproduce with: ant test -Dtestcase=TestFuzzyQuery 
-Dtestmethod=testTokenLengthOpt 
-Dtests.seed=4471452442745287654:-2341611255635429887 -Dtests.codec=Standard

// NOTE: this index has two terms: 12345678911 and segment

[junit] - Standard Output ---
[junit] candidate: [\u]1234567891
[junit] not found, goto: 1
[junit] candidate: 1[\u]234567891
[junit] not found, goto: s --- this is the problem, because 12345678911 
exists
[junit] candidate: s1234567891
[junit] found!
[junit] candidate: t1234567891
[junit] found!
{noformat}

 Make var gap terms index a partial prefix trie
 --

 Key: LUCENE-2948
 URL: https://issues.apache.org/jira/browse/LUCENE-2948
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2948.patch, LUCENE-2948.patch, 
 LUCENE-2948_automaton.patch


 Var gap stores (in an FST) the indexed terms (every 32nd term, by
 default), minus their non-distinguishing suffixes.
 However, often times the resulting FST is close to a prefix trie in
 some portion of the terms space.
 By allowing some nodes of the FST to store all outgoing edges,
 including ones that do not lead to an indexed term, and by recording
 that this node is then authoritative as to what terms exist in the
 terms dict from that prefix, we can get some important benefits:
   * It becomes possible to know that a certain term prefix cannot
 exist in the terms index, which means we can save a disk seek in
 some cases (like PK lookup, docFreq, etc.)
   * We can query for the next possible prefix in the index, allowing
 some MTQs (eg FuzzyQuery) to save disk seeks.
 Basically, the terms index is able to answer questions that previously
 required seeking/scanning in the terms dict file.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: [Lucene.Net] how to add a new record to existing index

2011-03-03 Thread Granroth, Neal V.
There are several things to consider.

The first is what DIGY pointed out.  The third parameter of the IndexWriter 
constructor determines if the code is creating a new index or opening an 
existing index for additions.  The code must specify false to open an 
existing index for additions.

A second thing to consider is that the additions made to the index with 
writer.AddDocument() will not be visible until the IndexWriter is closed, or 
the Commit() method is called.

A third thing to consider, instances of IndexReader can only see the content of 
the index at the time the IndexReader instance was opened.  Even after the 
IndexWriter commits its changes, IndexReader instances must be re-opened in 
order to see the new index content.

It seems you should check your code to ensure:

- IndexWriter constructor is being called with the right parameters to open an 
existing index.

- IndexWriter is closed or commit is called after changes have been made.

- IndexReader instances are re-opened after changes have been committed.


- Neal

-Original Message-
From: Digy [mailto:digyd...@gmail.com] 
Sent: Thursday, March 03, 2011 12:16 PM
To: lucene-net-...@lucene.apache.org
Subject: RE: [Lucene.Net] how to add a new record to existing index

I don't think that I understand your problem.
Is it something like 
IndexWriter writer = new IndexWriter(path, analyzer, *false*,
IndexWriter.DEFAULT_MAX_FIELD_LENGTH);
..
writer.AddDocument(doc);

DIGY

-Original Message-
From: Wen Gao [mailto:samuel.gao...@gmail.com] 
Sent: Thursday, March 03, 2011 1:43 AM
To: lucene-net-...@lucene.apache.org
Subject: Re: [Lucene.Net] how to add a new record to existing index

Hi Digy,
It was my fault that i didnt say it clearly. I mean I have created an
index,but it is not updated real time. So I want to update the index
everytime after I add data to database to keep the index up-to-date. My data
is what user inputs and inserted to database.Then

BTW, I know how to delete a term from index using IndexReader. Likewise, I
want to write a term to the created index instead of creating a new index.
I appreciate your time.

Thanks,
Wen

2011/3/2 Digy digyd...@gmail.com


 First of all, your code doesn't mean anything to me other than you add
some
 fields to a document object.

 Also,  I can't see what you mean with *existing* index. The directory you
 pass to the IndexWriter is the index you use
 and every document added(using IndexWriter's AddDocument) is written to
 that
 index.

 I think, we have problems in using a common terminology.

 DIGY

 PS: It would be better if you use user mailing list to ask questions.
 This
 mailing lists is intented to be for development purposes.




 -Original Message-
 From: Wen Gao [mailto:samuel.gao...@gmail.com]
 Sent: Wednesday, March 02, 2011 11:02 PM
 To: lucene-net-...@lucene.apache.org
 Subject: [Lucene.Net] how to add a new record to existing index

 Hi,
 I already have created an index, and I want to insert an index record to
 this existing index everytime I insert a new record to database.
 For example,  if I want to insert an reord (l1, 15,tom,
 20,2010/01/02) to my *existing* index, how can i do this? (I dont want
 to create a new index, which takes too much time)

 my format of index is as follows:
  ///
  doc.Add(new Lucene.Net.Documents.Field(
lmname,
readerreader1[lmname].ToString(),
//new
 System.IO.StringReader(readerreader[cname].ToString()),
Lucene.Net.Documents.Field.Store.YES,
 Lucene.Net.Documents.Field.Index.TOKENIZED)

);

//lmid
doc.Add(new Lucene.Net.Documents.Field(
lmid,
readerreader1[lmid].ToString(),
 Lucene.Net.Documents.Field.Store.YES,
 Lucene.Net.Documents.Field.Index.UN_TOKENIZED));

// nick name of user
doc.Add(new Lucene.Net.Documents.Field(
nickName,
 readerreader1[nickName].ToString(),
 Lucene.Net.Documents.Field.Store.YES,
 Lucene.Net.Documents.Field.Index.UN_TOKENIZED));
// uid
doc.Add(new Lucene.Net.Documents.Field(
uid,
 readerreader1[uid].ToString(),
 Lucene.Net.Documents.Field.Store.YES,
 Lucene.Net.Documents.Field.Index.UN_TOKENIZED));

// acttime
doc.Add(new Lucene.Net.Documents.Field(
acttime,
 readerreader1[acttime].ToString(),
 Lucene.Net.Documents.Field.Store.YES,
 Lucene.Net.Documents.Field.Index.UN_TOKENIZED));
writer.AddDocument(doc);
///

 Thanks,
 Wen





[jira] Commented: (SOLR-2399) Solr Admin Interface, reworked

2011-03-03 Thread Ryan McKinley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002159#comment-13002159
 ] 

Ryan McKinley commented on SOLR-2399:
-

Any thoughts on implementing with velocity templates?

I don't want to slow this down since any effort is great!  but long term, it 
would be great to drop JSP completly

 Solr Admin Interface, reworked
 --

 Key: SOLR-2399
 URL: https://issues.apache.org/jira/browse/SOLR-2399
 Project: Solr
  Issue Type: Improvement
  Components: web gui
Reporter: Stefan Matheis (steffkes)
Priority: Minor

 *The idea was to create a new, fresh (and hopefully clean) Solr Admin 
 Interface.* [Based on this 
 [ML-Thread|http://www.lucidimagination.com/search/document/ae35e236d29d225e/solr_admin_interface_reworked_go_on_go_away]]
 I've quickly created a Github-Repository (Just for me, to keep track of the 
 changes)
 » https://github.com/steffkes/solr-admin 
 [This commit shows the 
 differences|https://github.com/steffkes/solr-admin/commit/5f80bb0ea9deb4b94162632912fe63386f869e0d]
  between old/existing index.jsp and my new one (which is could 
 copy-cut/paste'd from the existing one).
 Main Action takes place in 
 [js/script.js|https://github.com/steffkes/solr-admin/blob/master/js/script.js]
  which is actually neither clean nor pretty .. just work-in-progress.
 Actually it's Work in Progress, so ... give it a try. It's developed with 
 Firefox as Browser, so, for a first impression .. please don't use _things_ 
 like Internet Explorer or so ;o
 Jan already suggested a bunch of good things, i'm sure there are more ideas 
 over there :)

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2399) Solr Admin Interface, reworked

2011-03-03 Thread Stefan Matheis (steffkes) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002182#comment-13002182
 ] 

Stefan Matheis (steffkes) commented on SOLR-2399:
-

Ryan, actually not - but that is only based on the fact, that i've never worked 
with them. After a first look on http://velocity.apache.org/ there is no 
Getting Started-Noob-Stefan-Tutorial, no Getting Started at all ;o .. but 
i'll check this.

 Solr Admin Interface, reworked
 --

 Key: SOLR-2399
 URL: https://issues.apache.org/jira/browse/SOLR-2399
 Project: Solr
  Issue Type: Improvement
  Components: web gui
Reporter: Stefan Matheis (steffkes)
Priority: Minor

 *The idea was to create a new, fresh (and hopefully clean) Solr Admin 
 Interface.* [Based on this 
 [ML-Thread|http://www.lucidimagination.com/search/document/ae35e236d29d225e/solr_admin_interface_reworked_go_on_go_away]]
 I've quickly created a Github-Repository (Just for me, to keep track of the 
 changes)
 » https://github.com/steffkes/solr-admin 
 [This commit shows the 
 differences|https://github.com/steffkes/solr-admin/commit/5f80bb0ea9deb4b94162632912fe63386f869e0d]
  between old/existing index.jsp and my new one (which is could 
 copy-cut/paste'd from the existing one).
 Main Action takes place in 
 [js/script.js|https://github.com/steffkes/solr-admin/blob/master/js/script.js]
  which is actually neither clean nor pretty .. just work-in-progress.
 Actually it's Work in Progress, so ... give it a try. It's developed with 
 Firefox as Browser, so, for a first impression .. please don't use _things_ 
 like Internet Explorer or so ;o
 Jan already suggested a bunch of good things, i'm sure there are more ideas 
 over there :)

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (SOLR-2399) Solr Admin Interface, reworked

2011-03-03 Thread Stefan Matheis (steffkes) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002182#comment-13002182
 ] 

Stefan Matheis (steffkes) edited comment on SOLR-2399 at 3/3/11 8:27 PM:
-

Ryan, actually not - but that is only based on the fact, that i've never worked 
with them. 

// -Edit

After having a first look, i'm not really sure what/how that should help? the 
index.jsp is needed to gather Information about Cores [using 
org.apache.solr.core.CoreContainer] .. and i (just) don't see, if (and if so, 
how) it is possible to pass that information to the Velocity-Thingy.

If that could be done .. no point about dropping that index.jsp out of order :)

  was (Author: steffkes):
Ryan, actually not - but that is only based on the fact, that i've never 
worked with them. After a first look on http://velocity.apache.org/ there is no 
Getting Started-Noob-Stefan-Tutorial, no Getting Started at all ;o .. but 
i'll check this.
  
 Solr Admin Interface, reworked
 --

 Key: SOLR-2399
 URL: https://issues.apache.org/jira/browse/SOLR-2399
 Project: Solr
  Issue Type: Improvement
  Components: web gui
Reporter: Stefan Matheis (steffkes)
Priority: Minor

 *The idea was to create a new, fresh (and hopefully clean) Solr Admin 
 Interface.* [Based on this 
 [ML-Thread|http://www.lucidimagination.com/search/document/ae35e236d29d225e/solr_admin_interface_reworked_go_on_go_away]]
 I've quickly created a Github-Repository (Just for me, to keep track of the 
 changes)
 » https://github.com/steffkes/solr-admin 
 [This commit shows the 
 differences|https://github.com/steffkes/solr-admin/commit/5f80bb0ea9deb4b94162632912fe63386f869e0d]
  between old/existing index.jsp and my new one (which is could 
 copy-cut/paste'd from the existing one).
 Main Action takes place in 
 [js/script.js|https://github.com/steffkes/solr-admin/blob/master/js/script.js]
  which is actually neither clean nor pretty .. just work-in-progress.
 Actually it's Work in Progress, so ... give it a try. It's developed with 
 Firefox as Browser, so, for a first impression .. please don't use _things_ 
 like Internet Explorer or so ;o
 Jan already suggested a bunch of good things, i'm sure there are more ideas 
 over there :)

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: issue with automatic iterable detection?

2011-03-03 Thread Bill Janssen
Andi Vajda va...@apache.org wrote:

   Bill,
 
 Did that solve your problem ?

Haven't had a chance to try it yet.  Will report back when I do.

Bill

 
 Andi..
 
 On Feb 28, 2011, at 20:05, Andi Vajda va...@apache.org wrote:
 
  
  On Sun, 27 Feb 2011, Bill Janssen wrote:
  
  Andi Vajda va...@apache.org wrote:
  
  It may be simplest if you can send me the source file for this class
  as well as a small jar file I can use to reproduce this ?
  
  Turns out to be simple to reproduce.  Put the attached in a file called
  test.java, and run this sequence:
  
  % javac -classpath . test.java
  % jar cf test.jar *.class
  % python -m jcc.__main__ --python test --shared --jar /tmp/test.jar 
  --build --vmarg -Djava.awt.headless=true
  
  This was a tougher one. It was triggered by a combination of things:
   - no wrapper requested for java.io.File or --package java.io
   - a subclass of a parameterized class or interface implementor of a
 parameterized interface wasn't pulling in classes used as type
 parameters (java.io.File here).
  
  A fix is checked into jcc trunk/branch_3x rev 1075642.
  This also includes the earlier fix about using absolute class names.
  
  Andi..


[jira] Created: (SOLR-2400) FieldAnalysisRequestHandler; add information about token-relation

2011-03-03 Thread Stefan Matheis (steffkes) (JIRA)
FieldAnalysisRequestHandler; add information about token-relation
-

 Key: SOLR-2400
 URL: https://issues.apache.org/jira/browse/SOLR-2400
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Reporter: Stefan Matheis (steffkes)
Priority: Minor
 Attachments: 110303_FieldAnalysisRequestHandler_output.xml

The XML-Output (simplified example attached) is missing one small information 
.. which could be very useful to build an nice Analysis-Output, and that's 
Token-Relation (if there is special/correct word for this, please correct me).

Meaning, that is actually not possible to follow the Analysis-Process 
(completly) while the Tokenizers/Filters will drop out Tokens (f.e. StopWord) 
or split it into multiple Tokens (f.e. WordDelimiter).

Would it be possible to include this Information? If so, it would be possible 
to create an improved Analysis-Page for the new Solr Admin (SOLR-2399) - short 
scribble attached

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2400) FieldAnalysisRequestHandler; add information about token-relation

2011-03-03 Thread Stefan Matheis (steffkes) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefan Matheis (steffkes) updated SOLR-2400:


Attachment: 110303_FieldAnalysisRequestHandler_output.xml

 FieldAnalysisRequestHandler; add information about token-relation
 -

 Key: SOLR-2400
 URL: https://issues.apache.org/jira/browse/SOLR-2400
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Reporter: Stefan Matheis (steffkes)
Priority: Minor
 Attachments: 110303_FieldAnalysisRequestHandler_output.xml


 The XML-Output (simplified example attached) is missing one small information 
 .. which could be very useful to build an nice Analysis-Output, and that's 
 Token-Relation (if there is special/correct word for this, please correct 
 me).
 Meaning, that is actually not possible to follow the Analysis-Process 
 (completly) while the Tokenizers/Filters will drop out Tokens (f.e. StopWord) 
 or split it into multiple Tokens (f.e. WordDelimiter).
 Would it be possible to include this Information? If so, it would be possible 
 to create an improved Analysis-Page for the new Solr Admin (SOLR-2399) - 
 short scribble attached

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2400) FieldAnalysisRequestHandler; add information about token-relation

2011-03-03 Thread Stefan Matheis (steffkes) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefan Matheis (steffkes) updated SOLR-2400:


Attachment: 110303_FieldAnalysisRequestHandler_view.png

 FieldAnalysisRequestHandler; add information about token-relation
 -

 Key: SOLR-2400
 URL: https://issues.apache.org/jira/browse/SOLR-2400
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Reporter: Stefan Matheis (steffkes)
Priority: Minor
 Attachments: 110303_FieldAnalysisRequestHandler_output.xml, 
 110303_FieldAnalysisRequestHandler_view.png


 The XML-Output (simplified example attached) is missing one small information 
 .. which could be very useful to build an nice Analysis-Output, and that's 
 Token-Relation (if there is special/correct word for this, please correct 
 me).
 Meaning, that is actually not possible to follow the Analysis-Process 
 (completly) while the Tokenizers/Filters will drop out Tokens (f.e. StopWord) 
 or split it into multiple Tokens (f.e. WordDelimiter).
 Would it be possible to include this Information? If so, it would be possible 
 to create an improved Analysis-Page for the new Solr Admin (SOLR-2399) - 
 short scribble attached

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2822) TimeLimitingCollector starts thread in static {} with no way to stop them

2011-03-03 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002260#comment-13002260
 ] 

Robert Muir commented on LUCENE-2822:
-

bq. I think this is still the best variant, as both System.nanoTime() and 
currentTimeMillies use system calls that are really expensive. 

Sorry its too funny, playing with LUCENE-2948 I saw a big slowdown on windows 
that mike didn't see on linux... finally tracked it down to an uncommented 
nanoTime :)

 TimeLimitingCollector starts thread in static {} with no way to stop them
 -

 Key: LUCENE-2822
 URL: https://issues.apache.org/jira/browse/LUCENE-2822
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Robert Muir

 See the comment in LuceneTestCase.
 If you even do Class.forName(TimeLimitingCollector) it starts up a thread 
 in a static method, and there isn't a way to kill it.
 This is broken.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1824) FastVectorHighlighter truncates words at beginning and end of fragments

2011-03-03 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002289#comment-13002289
 ] 

Mark Miller commented on LUCENE-1824:
-

This has 4 votes and 5 watchers - is it ready to go in?

 FastVectorHighlighter truncates words at beginning and end of fragments
 ---

 Key: LUCENE-1824
 URL: https://issues.apache.org/jira/browse/LUCENE-1824
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
 Environment: any
Reporter: Alex Vigdor
Assignee: Koji Sekiguchi
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-1824.patch


 FastVectorHighlighter does not take word boundaries into consideration when 
 building fragments, so that in most cases the first and last word of a 
 fragment are truncated.  This makes the highlights less legible than they 
 should be.  I will attach a patch to BaseFragmentBuilder that resolves this 
 by expanding the start and end boundaries of the fragment to the first 
 whitespace character on either side of the fragment, or the beginning or end 
 of the source text, whichever comes first.  This significantly improves 
 legibility, at the cost of returning a slightly larger number of characters 
 than specified for the fragment size.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2400) FieldAnalysisRequestHandler; add information about token-relation

2011-03-03 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002292#comment-13002292
 ] 

Uwe Schindler commented on SOLR-2400:
-

The position is used e.g. in analysis.jsp to do exactly what you want to 
have. It is the token position. If no broken TokenFilters are used that do 
not correctly modify the posIncr attribute, you can simply use it for alignment.

 FieldAnalysisRequestHandler; add information about token-relation
 -

 Key: SOLR-2400
 URL: https://issues.apache.org/jira/browse/SOLR-2400
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Reporter: Stefan Matheis (steffkes)
Priority: Minor
 Attachments: 110303_FieldAnalysisRequestHandler_output.xml, 
 110303_FieldAnalysisRequestHandler_view.png


 The XML-Output (simplified example attached) is missing one small information 
 .. which could be very useful to build an nice Analysis-Output, and that's 
 Token-Relation (if there is special/correct word for this, please correct 
 me).
 Meaning, that is actually not possible to follow the Analysis-Process 
 (completly) while the Tokenizers/Filters will drop out Tokens (f.e. StopWord) 
 or split it into multiple Tokens (f.e. WordDelimiter).
 Would it be possible to include this Information? If so, it would be possible 
 to create an improved Analysis-Page for the new Solr Admin (SOLR-2399) - 
 short scribble attached

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2949) FastVectorHighlighter FieldTermStack could likely benefit from using TermVectorMapper

2011-03-03 Thread Grant Ingersoll (JIRA)
FastVectorHighlighter FieldTermStack could likely benefit from using 
TermVectorMapper
-

 Key: LUCENE-2949
 URL: https://issues.apache.org/jira/browse/LUCENE-2949
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.0.3, 4.0
Reporter: Grant Ingersoll
Priority: Minor
 Fix For: 3.2, 4.0


Based on my reading of the FieldTermStack constructor that loads the vector 
from disk, we could probably save a bunch of time and memory by using the 
TermVectorMapper callback mechanism instead of materializing the full array of 
terms into memory and then throwing most of them out.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1824) FastVectorHighlighter truncates words at beginning and end of fragments

2011-03-03 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002299#comment-13002299
 ] 

Robert Muir commented on LUCENE-1824:
-

just an idea: it seems like using a breakiterator would be the way to go here.


 FastVectorHighlighter truncates words at beginning and end of fragments
 ---

 Key: LUCENE-1824
 URL: https://issues.apache.org/jira/browse/LUCENE-1824
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
 Environment: any
Reporter: Alex Vigdor
Assignee: Koji Sekiguchi
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-1824.patch


 FastVectorHighlighter does not take word boundaries into consideration when 
 building fragments, so that in most cases the first and last word of a 
 fragment are truncated.  This makes the highlights less legible than they 
 should be.  I will attach a patch to BaseFragmentBuilder that resolves this 
 by expanding the start and end boundaries of the fragment to the first 
 whitespace character on either side of the fragment, or the beginning or end 
 of the source text, whichever comes first.  This significantly improves 
 legibility, at the cost of returning a slightly larger number of characters 
 than specified for the fragment size.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2399) Solr Admin Interface, reworked

2011-03-03 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002310#comment-13002310
 ] 

Hoss Man commented on SOLR-2399:


bq. the index.jsp is needed to gather Information about Cores [using 
org.apache.solr.core.CoreContainer]

This is where the core admin handler should be useful -- you can use it to get 
a list of cores and their statuses.  In the example solr.xml (and by default if 
no solr.xml exists) it's available at /admin/cores but that can be changed -- 
for now your JSP should be able to ask the CoreContainer for it using 
getAdminPath()  

(If it would be useful, we could also add a simple bit of info to the 
SystemInfoRequestHandler ((/_corename_/admin/system) output to let the UI (and 
external clients) know what path (if any) they can use to access the 
CoreAdminHandler if all they have is the URL for a single core.)

bq. I don't want to slow this down since any effort is great! but long term, it 
would be great to drop JSP completly

i agree it would be nice to show off using the velocity writer to style handler 
responses in the admin ui, but i think that the general approach of using a jsp 
(or servlet) as the master controller for creating a base HTML page that then 
uses javascript to query all of the individual handler APIs makes a lot of 
sense -- if for no other reason then that i don't think the velocity writer 
could really be used on the output of the CoreAdminHandler (can it? .. what 
context would it load the templates form?)

Ultimately the problem we're always going to run into is that people can 
customize the paths of things in their configs - not just CoreAdminHandler but 
even all of hte various core specific admin handlers.  

I don't think that's something we really have to be worried about right now 
(the existing admin UI certainly doesn't) but using a simple servlet/index.jsp 
gives us the ability to at least start with a direct java call to answer the 
question: what is the url of the coreadmin handler? and then from there 
everything can be dynamicly driven.

If the logic in the JSP is simple enough, and the real work is done in the 
javascript, then porting that JSP to velocity should ultimately be pretty 
straight forward (if there is a strong desire)





 Solr Admin Interface, reworked
 --

 Key: SOLR-2399
 URL: https://issues.apache.org/jira/browse/SOLR-2399
 Project: Solr
  Issue Type: Improvement
  Components: web gui
Reporter: Stefan Matheis (steffkes)
Priority: Minor

 *The idea was to create a new, fresh (and hopefully clean) Solr Admin 
 Interface.* [Based on this 
 [ML-Thread|http://www.lucidimagination.com/search/document/ae35e236d29d225e/solr_admin_interface_reworked_go_on_go_away]]
 I've quickly created a Github-Repository (Just for me, to keep track of the 
 changes)
 » https://github.com/steffkes/solr-admin 
 [This commit shows the 
 differences|https://github.com/steffkes/solr-admin/commit/5f80bb0ea9deb4b94162632912fe63386f869e0d]
  between old/existing index.jsp and my new one (which is could 
 copy-cut/paste'd from the existing one).
 Main Action takes place in 
 [js/script.js|https://github.com/steffkes/solr-admin/blob/master/js/script.js]
  which is actually neither clean nor pretty .. just work-in-progress.
 Actually it's Work in Progress, so ... give it a try. It's developed with 
 Firefox as Browser, so, for a first impression .. please don't use _things_ 
 like Internet Explorer or so ;o
 Jan already suggested a bunch of good things, i'm sure there are more ideas 
 over there :)

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2400) FieldAnalysisRequestHandler; add information about token-relation

2011-03-03 Thread Stefan Matheis (steffkes) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002313#comment-13002313
 ] 

Stefan Matheis (steffkes) commented on SOLR-2400:
-

Uwe, that was the first thing i thought myself, yes - but .. let's take flat 
(starting on position 4) and follow it. Passing StopFilter, still position 4; 
Arriving at WordDelimiter, it's position 6 - the dash was dropped out while 
beeing an StopWord and VA902B gets splitted up in three Tokens.

So, what i guess, that it's missing .. is some type of information, that for 
example the original Token on position 2 (VA902B) is splitted an know (partial) 
placed on position 3 through 6 .. also for example, that flat is no longer 
position 4, because it's moved to 6.

Or did i just miss something really simple?

 FieldAnalysisRequestHandler; add information about token-relation
 -

 Key: SOLR-2400
 URL: https://issues.apache.org/jira/browse/SOLR-2400
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Reporter: Stefan Matheis (steffkes)
Priority: Minor
 Attachments: 110303_FieldAnalysisRequestHandler_output.xml, 
 110303_FieldAnalysisRequestHandler_view.png


 The XML-Output (simplified example attached) is missing one small information 
 .. which could be very useful to build an nice Analysis-Output, and that's 
 Token-Relation (if there is special/correct word for this, please correct 
 me).
 Meaning, that is actually not possible to follow the Analysis-Process 
 (completly) while the Tokenizers/Filters will drop out Tokens (f.e. StopWord) 
 or split it into multiple Tokens (f.e. WordDelimiter).
 Would it be possible to include this Information? If so, it would be possible 
 to create an improved Analysis-Page for the new Solr Admin (SOLR-2399) - 
 short scribble attached

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: issue with automatic iterable detection?

2011-03-03 Thread Andi Vajda


On Thu, 3 Mar 2011, Andi Vajda wrote:


Indeed, this is why I put that assertion there :-)
It's a bit of guesswork what all the possibilities are there.
I'll add support for arrays there.


Fix is checked into rev 1076883.
Back to you, Bill.

Thanks !

Andi..



Andi..

On Thu, 3 Mar 2011, Bill Janssen wrote:


This looks like a problem.

This is with an svn checkout of branch_3x.

Bill

122, in _run_module_as_main
   __main__, fname, loader, pkg_name)
 File /usr/lib/python2.6/runpy.py, line 34, in _run_code
   exec code in run_globals
 File 
/usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/__main__.py, 
line 98, in module

   cpp.jcc(sys.argv)
 File 
/usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/cpp.py, 
line 548, in jcc

   addRequiredTypes(cls, typeset, generics)
 File 
/usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/cpp.py, 
line 233, in addRequiredTypes

   addRequiredTypes(cls, typeset, True)
 File 
/usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/cpp.py, 
line 238, in addRequiredTypes

   addRequiredTypes(ta, typeset, True)
 File 
/usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/cpp.py, 
line 240, in addRequiredTypes

   raise NotImplementedError, repr(cls)
NotImplementedError: Type: double[]
%





[jira] Commented: (LUCENE-2939) Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream

2011-03-03 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002334#comment-13002334
 ] 

Mark Miller commented on LUCENE-2939:
-

My last patch is missing a couple required test compile changes - I excluded 
that class cause I had some test code in it.

I'll put up a new patch as soon as I get a chance with the test class changes 
(Scorer init method gets a new param and there are a couple anonymous impls in 
test)

 Highlighter should try and use maxDocCharsToAnalyze in 
 WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as 
 when using CachingTokenStream
 

 Key: LUCENE-2939
 URL: https://issues.apache.org/jira/browse/LUCENE-2939
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-2939.patch, LUCENE-2939.patch


 huge documents can be drastically slower than need be because the entire 
 field is added to the memory index
 this cost can be greatly reduced in many cases if we try and respect 
 maxDocCharsToAnalyze
 things can be improved even further by respecting this setting with 
 CachingTokenStream

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2939) Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream

2011-03-03 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002335#comment-13002335
 ] 

Mark Miller commented on LUCENE-2939:
-

Honestly, if I was not so busy, I'd say we should really get this in for 3.1.

If you are doing something like desktop search, this can be a really cruel 
highlighter perf problem.

 Highlighter should try and use maxDocCharsToAnalyze in 
 WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as 
 when using CachingTokenStream
 

 Key: LUCENE-2939
 URL: https://issues.apache.org/jira/browse/LUCENE-2939
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-2939.patch, LUCENE-2939.patch


 huge documents can be drastically slower than need be because the entire 
 field is added to the memory index
 this cost can be greatly reduced in many cases if we try and respect 
 maxDocCharsToAnalyze
 things can be improved even further by respecting this setting with 
 CachingTokenStream

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2939) Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream

2011-03-03 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002340#comment-13002340
 ] 

Mark Miller commented on LUCENE-2939:
-

P.S. One that is really a bad bug in my mind - we switched this to be the 
default and the old Highlighter did not suffer like this in these situations.

Looking back over the email archives, it bit more than a few people. I'm pretty 
sure this bug was the impetus of the Fast Vector Highlighter (which is still 
valuable if you *really* do want to highlight over every token in your 3 
billion word PDF file ;) ).

You pay this huge perf penalty for no gain and no reason. If you are talking 
wikipedia size docs, it won't affect you - but for long documents, doing 10 
snippets can be prohibitive, with no workaround. That is not a friendly 
neighborhood highlighter.

 Highlighter should try and use maxDocCharsToAnalyze in 
 WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as 
 when using CachingTokenStream
 

 Key: LUCENE-2939
 URL: https://issues.apache.org/jira/browse/LUCENE-2939
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-2939.patch, LUCENE-2939.patch


 huge documents can be drastically slower than need be because the entire 
 field is added to the memory index
 this cost can be greatly reduced in many cases if we try and respect 
 maxDocCharsToAnalyze
 things can be improved even further by respecting this setting with 
 CachingTokenStream

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2399) Solr Admin Interface, reworked

2011-03-03 Thread Ryan McKinley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002381#comment-13002381
 ] 

Ryan McKinley commented on SOLR-2399:
-

bq. If the logic in the JSP is simple enough, and the real work is done in the 
javascript, then porting that JSP to velocity should ultimately be pretty 
straight forward (if there is a strong desire)

Yes, if anyone is willing to give the admin pages some much needed design love, 
I really don't want anythign to slow that down.  In the future, if there is 
interest, it would be great to do this w/o JSP, the details of how will take 
some work.

 Solr Admin Interface, reworked
 --

 Key: SOLR-2399
 URL: https://issues.apache.org/jira/browse/SOLR-2399
 Project: Solr
  Issue Type: Improvement
  Components: web gui
Reporter: Stefan Matheis (steffkes)
Priority: Minor

 *The idea was to create a new, fresh (and hopefully clean) Solr Admin 
 Interface.* [Based on this 
 [ML-Thread|http://www.lucidimagination.com/search/document/ae35e236d29d225e/solr_admin_interface_reworked_go_on_go_away]]
 I've quickly created a Github-Repository (Just for me, to keep track of the 
 changes)
 » https://github.com/steffkes/solr-admin 
 [This commit shows the 
 differences|https://github.com/steffkes/solr-admin/commit/5f80bb0ea9deb4b94162632912fe63386f869e0d]
  between old/existing index.jsp and my new one (which is could 
 copy-cut/paste'd from the existing one).
 Main Action takes place in 
 [js/script.js|https://github.com/steffkes/solr-admin/blob/master/js/script.js]
  which is actually neither clean nor pretty .. just work-in-progress.
 Actually it's Work in Progress, so ... give it a try. It's developed with 
 Firefox as Browser, so, for a first impression .. please don't use _things_ 
 like Internet Explorer or so ;o
 Jan already suggested a bunch of good things, i'm sure there are more ideas 
 over there :)

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2939) Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream

2011-03-03 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002394#comment-13002394
 ] 

Grant Ingersoll commented on LUCENE-2939:
-

I can backport if you want.

 Highlighter should try and use maxDocCharsToAnalyze in 
 WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as 
 when using CachingTokenStream
 

 Key: LUCENE-2939
 URL: https://issues.apache.org/jira/browse/LUCENE-2939
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-2939.patch, LUCENE-2939.patch


 huge documents can be drastically slower than need be because the entire 
 field is added to the memory index
 this cost can be greatly reduced in many cases if we try and respect 
 maxDocCharsToAnalyze
 things can be improved even further by respecting this setting with 
 CachingTokenStream

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[HUDSON] Lucene-Solr-tests-only-trunk - Build # 5555 - Failure

2011-03-03 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk//

1 tests failed.
REGRESSION:  org.apache.lucene.index.TestIndexReaderReopen.testThreadSafety

Error Message:
Error occurred in thread Thread-63: 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/lucene/build/test/8/test6361311639913063277tmp/_a_1.doc
 (Too many open files in system)

Stack Trace:
junit.framework.AssertionFailedError: Error occurred in thread Thread-63:
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/lucene/build/test/8/test6361311639913063277tmp/_a_1.doc
 (Too many open files in system)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1213)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1145)
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/lucene/build/test/8/test6361311639913063277tmp/_a_1.doc
 (Too many open files in system)
at 
org.apache.lucene.index.TestIndexReaderReopen.testThreadSafety(TestIndexReaderReopen.java:833)




Build Log (for compile errors):
[...truncated 3110 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2939) Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream

2011-03-03 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002400#comment-13002400
 ] 

Mark Miller commented on LUCENE-2939:
-

bq. i think the offsetLength calculation needs to be inside the incrementToken?

I do not follow ... incrementToken is:

+  @Override
+  public boolean incrementToken() throws IOException {
+int offsetLength = offsetAttrib.endOffset() - offsetAttrib.startOffset();
+if (offsetCount  offsetLimit  input.incrementToken()) {
+  offsetCount += offsetLength;
+  return true;
+}
+return false;
+  }

 Highlighter should try and use maxDocCharsToAnalyze in 
 WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as 
 when using CachingTokenStream
 

 Key: LUCENE-2939
 URL: https://issues.apache.org/jira/browse/LUCENE-2939
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-2939.patch, LUCENE-2939.patch


 huge documents can be drastically slower than need be because the entire 
 field is added to the memory index
 this cost can be greatly reduced in many cases if we try and respect 
 maxDocCharsToAnalyze
 things can be improved even further by respecting this setting with 
 CachingTokenStream

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2939) Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream

2011-03-03 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002402#comment-13002402
 ] 

Robert Muir commented on LUCENE-2939:
-

Exactly, so what is the attributes values before calling input.incrementToken() 
?

I don't think this is good practice to work with the uninitialized values.


 Highlighter should try and use maxDocCharsToAnalyze in 
 WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as 
 when using CachingTokenStream
 

 Key: LUCENE-2939
 URL: https://issues.apache.org/jira/browse/LUCENE-2939
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-2939.patch, LUCENE-2939.patch


 huge documents can be drastically slower than need be because the entire 
 field is added to the memory index
 this cost can be greatly reduced in many cases if we try and respect 
 maxDocCharsToAnalyze
 things can be improved even further by respecting this setting with 
 CachingTokenStream

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2939) Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream

2011-03-03 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-2939:


Attachment: LUCENE-2939.patch

This includes the change to the test to make it compile.

Still no Changes entry.

The compile change to the test is a back compat break. The Scorer needs to know 
the maxCharsToAnalyze setting.

Have not had time to consider further yet.

 Highlighter should try and use maxDocCharsToAnalyze in 
 WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as 
 when using CachingTokenStream
 

 Key: LUCENE-2939
 URL: https://issues.apache.org/jira/browse/LUCENE-2939
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-2939.patch, LUCENE-2939.patch, LUCENE-2939.patch


 huge documents can be drastically slower than need be because the entire 
 field is added to the memory index
 this cost can be greatly reduced in many cases if we try and respect 
 maxDocCharsToAnalyze
 things can be improved even further by respecting this setting with 
 CachingTokenStream

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1431) CommComponent abstracted

2011-03-03 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002410#comment-13002410
 ] 

Jason Rutherglen commented on SOLR-1431:


What's the status of this one?

 CommComponent abstracted
 

 Key: SOLR-1431
 URL: https://issues.apache.org/jira/browse/SOLR-1431
 Project: Solr
  Issue Type: Improvement
  Components: search
Affects Versions: 1.4
Reporter: Jason Rutherglen
Assignee: Noble Paul
Priority: Trivial
 Fix For: Next

 Attachments: SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, 
 SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 We'll abstract CommComponent in this issue.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1395) Integrate Katta

2011-03-03 Thread JohnWu (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002432#comment-13002432
 ] 

JohnWu commented on SOLR-1395:
--

ok,ALL

we have the correct result back (form slave02 to master):
result name=response numFound=1 start=0
\u2212
doc
str name=idMA147LL/A/str
str name=nameApple 60 GB iPod with Video Playback Black/str
str name=manuApple Computer Inc./str
\u2212
...

note:
  if you use the Tomliu's patch please correct the code of queryComponent:

//JohnWu correct the  to ||, need decide the shards is null
if (shards == null){
hasShardURL = false;
}else{
hasShardURL = shards != null || shards.indexOf('/')  0;
}

so the queryCore can enter the distribute process and get the hits, the 
DocSlice cast to DocumentList

If you have any problem, please ask me, we discuss it together

Thanks!


johnWu

 Integrate Katta
 ---

 Key: SOLR-1395
 URL: https://issues.apache.org/jira/browse/SOLR-1395
 Project: Solr
  Issue Type: New Feature
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: Next

 Attachments: SOLR-1395.patch, SOLR-1395.patch, SOLR-1395.patch, 
 back-end.log, front-end.log, hadoop-core-0.19.0.jar, katta-core-0.6-dev.jar, 
 katta-solrcores.jpg, katta.node.properties, katta.zk.properties, 
 log4j-1.2.13.jar, solr-1395-1431-3.patch, solr-1395-1431-4.patch, 
 solr-1395-1431-katta0.6.patch, solr-1395-1431-katta0.6.patch, 
 solr-1395-1431.patch, solr-1395-katta-0.6.2-1.patch, 
 solr-1395-katta-0.6.2-2.patch, solr-1395-katta-0.6.2-3.patch, 
 solr-1395-katta-0.6.2.patch, test-katta-core-0.6-dev.jar, 
 zkclient-0.1-dev.jar, zookeeper-3.2.1.jar

   Original Estimate: 336h
  Remaining Estimate: 336h

 We'll integrate Katta into Solr so that:
 * Distributed search uses Hadoop RPC
 * Shard/SolrCore distribution and management
 * Zookeeper based failover
 * Indexes may be built using Hadoop

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2939) Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream

2011-03-03 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002433#comment-13002433
 ] 

Robert Muir commented on LUCENE-2939:
-

{quote}
I see what you mean now - though I still don't understand your previous comment.
I assume that it's just defaulting to 0 - 0 now?
{quote}

Only the first time.

But imagine you try to reuse this tokenstream (maybe its not being reused now, 
but in the future)... the values for the last token of the previous doc are say 
10 - 5... the consumer calls reset(Reader) with new document and reset(), which 
clears your accumulator, but this attribute is still 10 - 5 until 
input.incrementToken()... only then does the tokenizer update the values.


 Highlighter should try and use maxDocCharsToAnalyze in 
 WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as 
 when using CachingTokenStream
 

 Key: LUCENE-2939
 URL: https://issues.apache.org/jira/browse/LUCENE-2939
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-2939.patch, LUCENE-2939.patch, LUCENE-2939.patch


 huge documents can be drastically slower than need be because the entire 
 field is added to the memory index
 this cost can be greatly reduced in many cases if we try and respect 
 maxDocCharsToAnalyze
 things can be improved even further by respecting this setting with 
 CachingTokenStream

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2329) old index files not deleted on slave

2011-03-03 Thread Ryosuke Fujita (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002439#comment-13002439
 ] 

Ryosuke Fujita commented on SOLR-2329:
--

I had similar problem, but I added modify/write permission to solruser, old 
index files are vanished.
But my os is windows server 2008, and yours is centos. Is it not related? Who 
invokes replication task?

 old index files not deleted on slave
 

 Key: SOLR-2329
 URL: https://issues.apache.org/jira/browse/SOLR-2329
 Project: Solr
  Issue Type: Bug
  Components: replication (java)
Affects Versions: 4.0
 Environment: centos 5.5
 ext3 file system
Reporter: Edwin Khodabakchian
 Attachments: solrconfig.xml, solrconfig_slave.xml


 I have set up index replication (triggered on optimize). The problem I
 am having is the old index files are not being deleted on the slave.
 After each replication, I can see the old files still hanging around
 as well as the files that have just been pulled. This causes the data
 directory size to increase by the index size every replication until
 the disk fills up.
 I am running 4.0 rev 993367 with patch SOLR-1316. Otherwise, the setup
 is pretty vanilla. I can reproduce this on multiple slaves.
 Checking the logs, I see the following error:
 SEVERE: SnapPull failed
 org.apache.solr.common.SolrException: Index fetch failed :
at 
 org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:329)
at 
 org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:265)
at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159)
at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at 
 java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)
at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)
at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
 Caused by: org.apache.lucene.store.LockObtainFailedException: Lock
 obtain timed out:
 NativeFSLock@/var/solrhome/data/index/lucene-cdaa80c0fefe1a7dfc7aab89298c614c-write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:84)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1065)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:954)
at 
 org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:192)
at 
 org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:99)
at 
 org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:173)
at 
 org.apache.solr.update.DirectUpdateHandler2.forceOpenWriter(DirectUpdateHandler2.java:376)
at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:471)
at 
 org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:319)
... 11 more
 lsof reveals that the file is still opened from the java process.
 Contents of the index data dir:
 master:
 -rw-rw-r-- 1 feeddo feeddo  191 Dec 14 01:06 _1lg.fnm
 -rw-rw-r-- 1 feeddo feeddo  26M Dec 14 01:07 _1lg.fdx
 -rw-rw-r-- 1 feeddo feeddo 1.9G Dec 14 01:07 _1lg.fdt
 -rw-rw-r-- 1 feeddo feeddo 474M Dec 14 01:12 _1lg.tis
 -rw-rw-r-- 1 feeddo feeddo  15M Dec 14 01:12 _1lg.tii
 -rw-rw-r-- 1 feeddo feeddo 144M Dec 14 01:12 _1lg.prx
 -rw-rw-r-- 1 feeddo feeddo 277M Dec 14 01:12 _1lg.frq
 -rw-rw-r-- 1 feeddo feeddo  311 Dec 14 01:12 segments_1ji
 -rw-rw-r-- 1 feeddo feeddo  23M Dec 14 01:12 _1lg.nrm
 -rw-rw-r-- 1 feeddo feeddo  191 Dec 18 01:11 _24e.fnm
 -rw-rw-r-- 1 feeddo feeddo  26M Dec 18 01:12 _24e.fdx
 -rw-rw-r-- 1 feeddo feeddo 1.9G Dec 18 01:12 _24e.fdt
 -rw-rw-r-- 1 feeddo feeddo 483M Dec 18 01:23 _24e.tis
 -rw-rw-r-- 1 feeddo feeddo  15M Dec 18 01:23 _24e.tii
 -rw-rw-r-- 1 feeddo feeddo 146M Dec 18 01:23 _24e.prx
 -rw-rw-r-- 1 feeddo feeddo 283M Dec 18 01:23 _24e.frq
 -rw-rw-r-- 1 feeddo feeddo  311 Dec 18 01:24 segments_1xz
 -rw-rw-r-- 1 feeddo feeddo  23M Dec 18 01:24 _24e.nrm
 -rw-rw-r-- 1 feeddo feeddo  191 Dec 18 13:15 _25z.fnm
 -rw-rw-r-- 1 feeddo feeddo  26M Dec 18 13:16 _25z.fdx
 -rw-rw-r-- 1 feeddo feeddo 1.9G Dec 18 13:16 _25z.fdt
 -rw-rw-r-- 1 feeddo feeddo 484M Dec 18 13:35 _25z.tis
 -rw-rw-r-- 1 feeddo feeddo  15M Dec 18 

[jira] Commented: (SOLR-2400) FieldAnalysisRequestHandler; add information about token-relation

2011-03-03 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002491#comment-13002491
 ] 

Uwe Schindler commented on SOLR-2400:
-

Stefan, this is an egeneral issue of TokenStreams adding Tokens. TokenStreams 
that remove Tokens *should* automatically preserve position, but not even all 
of those do that correctly (we were fixing some of them lately). The way of how 
the Lucene analysis works makes it impossible to guarantee any corresponence of 
the position numbers. Because for the indexer its only important what comes out 
at the end, the steps inbetween are impossible. AnalysisReqHandler on the other 
hand does some bad hacks to look inside the analysis (by using temporary 
TokenStreams that buffer tokens), which are not the general use-case of 
TokenStreams.

I wonder a little bit about your xml file, it only contains text and position, 
but it should also contain rawTerm, startOffset, endOffset. When I call 
analysis i get all of those attributes not only two of them. Is this a 
hand-made file or what is the problem? Which Solr version?

One possibility to handle the thing might be the char offset in the original 
text, because that one should point to the character offset of begin and end of 
the token in the original stream instead of the token position, but this is 
likely to break for lots of TokenFilters (WordDelimiterFilter would work as 
long as you don't do stemming before...). The problem is incorrect handling of 
offset calculation (also leading to bugs in highlighting) when the inserted 
terms are longer than their originals.

Alltogether: Its unlikely that you can implement that and it will work for all 
combinations of TokenStream components.

 FieldAnalysisRequestHandler; add information about token-relation
 -

 Key: SOLR-2400
 URL: https://issues.apache.org/jira/browse/SOLR-2400
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Reporter: Stefan Matheis (steffkes)
Priority: Minor
 Attachments: 110303_FieldAnalysisRequestHandler_output.xml, 
 110303_FieldAnalysisRequestHandler_view.png


 The XML-Output (simplified example attached) is missing one small information 
 .. which could be very useful to build an nice Analysis-Output, and that's 
 Token-Relation (if there is special/correct word for this, please correct 
 me).
 Meaning, that is actually not possible to follow the Analysis-Process 
 (completly) while the Tokenizers/Filters will drop out Tokens (f.e. StopWord) 
 or split it into multiple Tokens (f.e. WordDelimiter).
 Would it be possible to include this Information? If so, it would be possible 
 to create an improved Analysis-Page for the new Solr Admin (SOLR-2399) - 
 short scribble attached

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org