WordDelimiterFilter preserveOriginal position increment

2012-10-23 Thread Jay Luker
Hi,

I'm having an issue with the WDF preserveOriginal=1 setting and the
matching of a phrase query. Here's an example of the text that is
being indexed:

...obtained with the Southern African Large Telescope,SALT...

A lot of our text is extracted from PDFs, so this kind of formatting
junk is very common.

The phrase query that is failing is:

Southern African Large Telescope

From looking at the analysis debugger I can see that the WDF is
getting the term Telescope,SALT and correctly splitting on the
comma. The problem seems to be that the original term is given the 1st
position, e.g.:

Pos  Term
1  Southern
2  African
3  Large
4  Telescope,SALT  -- original term
5  Telescope
6  SALT

Only by adding a phrase slop of ~1 do I get a match.

I realize that the WDF is behaving correctly in this case (or at least
I can't imagine a rational alternative). But I'm curious if anyone can
suggest an way to work around this issue that doesn't involve adding
phrase query slop.

Thanks,
--jay


Re: WordDelimiterFilter preserveOriginal position increment

2012-10-23 Thread Jay Luker
Bah... While attempting to duplicate this on our 4.x instance I
realized I was mis-reading the analysis output. In the example I
mentioned it was actually a SynonymFilter in the analysis chain that
was affecting the term position (we have several synonyms for
telescope).

Regardless, it seems to not be a problem in 4.x.

Thanks,
--jay

On Tue, Oct 23, 2012 at 10:45 AM, Shawn Heisey s...@elyograg.org wrote:
 On 10/23/2012 8:16 AM, Jay Luker wrote:

  From looking at the analysis debugger I can see that the WDF is
 getting the term Telescope,SALT and correctly splitting on the
 comma. The problem seems to be that the original term is given the 1st
 position, e.g.:

 Pos  Term
 1  Southern
 2  African
 3  Large
 4  Telescope,SALT  -- original term
 5  Telescope
 6  SALT


 Jay, I have WDF with preserveOriginal turned on.  I get the following from
 WDF parsing in the analysis page on either 3.5 or 4.1-SNAPSHOT, and the
 analyzer shows that all four of the query words are found in consecutive
 fields.  On the new version, I had to slide a scrollbar to the right to see
 the last term.  Visually they were not in consecutive fields on the new
 version (they were on 3.5), but the position number says otherwise.


 1Southern
 2African
 3Large
 4Telescope,SALT
 4Telescope
 5SALT
 5TelescopeSALT

 My full WDF parameters:
 index: {preserveOriginal=1, splitOnCaseChange=1, generateNumberParts=1,
 catenateWords=1, splitOnNumerics=1, stemEnglishPossessive=1,
 luceneMatchVersion=LUCENE_35, generateWordParts=1, catenateAll=0,
 catenateNumbers=1}
 query: {preserveOriginal=1, splitOnCaseChange=1, generateNumberParts=1,
 catenateWords=0, splitOnNumerics=1, stemEnglishPossessive=1,
 luceneMatchVersion=LUCENE_35, generateWordParts=1, catenateAll=0,
 catenateNumbers=0}

 I understand from other messages on the mailing list that I should not have
 preserveOriginal on the query side, but I have not yet changed it.

 If your position numbers really are what you indicated, you may have found a
 bug.  I have not tried the released 4.0.0 version, I expect to deploy from
 the 4.x branch under development.

 Thanks,
 Shawn



Re: NumericRangeQuery: what am I doing wrong?

2011-12-15 Thread Jay Luker
On Wed, Dec 14, 2011 at 5:02 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 I'm a little lost in this thread ... if you are programaticly construction
 a NumericRangeQuery object to execute in the JVM against a Solr index,
 that suggests you are writting some sort of SOlr plugin (or uembedding
 solr in some way)

It's not you; it's me. I'm just doing weird things, partly, I'm sure,
due to ignorance, but sometimes out of expediency. I was experimenting
with ways to do a NumericRangeFilter, and the tests I was trying used
the Lucene api to query a Solr index, therefore I didn't have access
to the IndexSchema. Also my question might have been better directed
at the lucene-general list to avoid confusion.

Thanks,
--jay


NumericRangeQuery: what am I doing wrong?

2011-12-14 Thread Jay Luker
I can't get NumericRangeQuery or TermQuery to work on my integer id
field. I feel like I must be missing something obvious.

I have a test index that has only two documents, id:9076628 and
id:8003001. The id field is defined like so:

field name=id type=tint indexed=true stored=true required=true /

A MatchAllDocsQuery will return the 2 documents, but any queries I try
on the id field return no results. For instance,

public void testIdRange() throws IOException {
Query q = NumericRangeQuery.newIntRange(id, 1, 1000, true, true);
System.out.println(query:  + q);
assertEquals(2, searcher.search(q, 5).totalHits);
}

public void testIdSearch() throws IOException {
Query q = new TermQuery(new Term(id, 9076628));
System.out.println(query:  + q);
assertEquals(1, searcher.search(q, 5).totalHits);
}

Both tests fail with totalHits being 0. This is using solr/lucene
trunk, but I tried also with 3.2 and got the same results.

What could I be doing wrong here?

Thanks,
--jay


Re: NumericRangeQuery: what am I doing wrong?

2011-12-14 Thread Jay Luker
On Wed, Dec 14, 2011 at 2:04 PM, Erick Erickson erickerick...@gmail.com wrote:
 Hmmm, seems like it should work, but there are two things you might try:
 1 just execute the query in Solr. id:1 TO 100]. Does that work?

Yep, that works fine.

 2 I'm really grasping at straws here, but it's *possible* that you
     need to use the same precisionstep as tint (8?)? There's a
     constructor that takes precisionStep as a parameter, but the
     default is 4 in the 3.x code.

Ah-ha, that was it. I did not notice the alternate constructor. The
field was originally indexed with solr's default int type, which has
precisionStep=0 (i.e., don't index at different precision levels).
The equivalent value for the NumericRangeQuery constructor is 32. This
isn't exactly inuitive, but I was able to figure it out with a careful
reading of the javadoc.

Thanks!
--jay


Re: RegexQuery performance

2011-12-12 Thread Jay Luker
On Sat, Dec 10, 2011 at 9:25 PM, Erick Erickson erickerick...@gmail.com wrote:
 My off-the-top-of-my-head notion is you implement a
 Filter whose job is to emit some special tokens when
 you find strings like this that allow you to search without
 regexes. For instance, in the example you give, you could
 index something like...oh... I don't know, ###VER### as
 well as the normal text of IRAS-A-FPA-3-RDR-IMPS-V6.0.
 Now, when searching for docs with the pattern you used
 as an example, you look for ###VER### instead. I guess
 it all depends on how many regexes you need to allow.
 This wouldn't work at all if you allow users to put in arbitrary
 regexes, but if you have a small enough number of patterns
 you'll allow, something like this could work.

This is a great suggestion. I think the number of users that need this
feature, as well as the variety of regexs that would be used, is small
enough that it could definitely work. I turns it into a problem of
collecting the necessary regexes, plus the UI details.

Thanks!
--jay


Re: RegexQuery performance

2011-12-10 Thread Jay Luker
Hi Erick,

On Fri, Dec 9, 2011 at 12:37 PM, Erick Erickson erickerick...@gmail.com wrote:
 Could you show us some examples of the kinds of things
 you're using regex for? I.e. the raw text and the regex you
 use to match the example?

Sure!

An example identifier would be IRAS-A-FPA-3-RDR-IMPS-V6.0, which
identifies a particular Planetary Data System data set. Another
example is ULY-J-GWE-8-NULL-RESULTS-V1.0. These kind of strings
frequently appear in the references section of the articles, so the
context looks something like,

 ... rvey. IRAS-A-FPA-3-RDR-IMPS-V6.0, NASA Planetary Data System
Tholen, D. J. 1989, in Asteroids II, ed ... 

The simple  straightforward regex I've been using is
/[A-Z0-9:\-]+V\d+\.\d+/. There may be a smarter regex approach but I
haven't put my mind to it because I assumed the primary performance
issue was elsewhere.

 The reason I ask is that perhaps there are other approaches,
 especially thinking about some clever analyzing at index time.

 For instance, perhaps NGrams are an option. Perhaps
 just making WordDelimiterFilterFactory do its tricks. Perhaps.

WordDelimiter does help in the sense that if you search for a specific
identifier you will usually find fairly accurate results, even for
cases where the hyphens resulted in the term being broken up. But I'm
not sure how WordDelimiter can help if I want to search for a pattern.

I tried a few tweaks to the index, like putting a minimum character
count for terms, making sure WordDelimeter's preserveOriginal is
turned on, indexing without lowercasing so that I don't have to use
Pattern.CASE_INSENSITIVE. Performance was not improved significantly.

The new RegexpQuery mentioned by R. Muir looks promising, but I
haven't built an instance of trunk yet to try it out. Any ohter
suggestions appreciated.

Thanks!
--jay


 In other words, this could be an XY problem

 Best
 Erick

 On Thu, Dec 8, 2011 at 11:14 AM, Robert Muir rcm...@gmail.com wrote:
 On Thu, Dec 8, 2011 at 11:01 AM, Jay Luker lb...@reallywow.com wrote:
 Hi,

 I am trying to provide a means to search our corpus of nearly 2
 million fulltext astronomy and physics articles using regular
 expressions. A small percentage of our users need to be able to
 locate, for example, certain types of identifiers that are present
 within the fulltext (grant numbers, dataset identifers, etc).

 My straightforward attempts to do this using RegexQuery have been
 successful only in the sense that I get the results I'm looking for.
 The performance, however, is pretty terrible, with most queries taking
 five minutes or longer. Is this the performance I should expect
 considering the size of my index and the massive number of terms? Are
 there any alternative approaches I could try?

 Things I've already tried:
  * reducing the sheer number of terms by adding a LengthFilter,
 min=6, to my index analysis chain
  * swapping in the JakartaRegexpCapabilities

 Things I intend to try if no one has any better suggestions:
  * chunk up the index and search concurrently, either by sharding or
 using a RangeQuery based on document id

 Any suggestions appreciated.


 This RegexQuery is not really scalable in my opinion, its always
 linear to the number of terms except in super-rare circumstances where
 it can compute a common prefix (and slow to boot).

 You can try svn trunk's RegexpQuery -- don't forget the p, instead
 from lucene core (it works from queryparser: /[ab]foo/, myfield:/bar/
 etc)

 The performance is faster, but keep in mind its only as good as the
 regular expressions, if the regular expressions are like /.*foo.*/,
 then
 its just as slow as wildcard of *foo*.

 --
 lucidimagination.com


RegexQuery performance

2011-12-08 Thread Jay Luker
Hi,

I am trying to provide a means to search our corpus of nearly 2
million fulltext astronomy and physics articles using regular
expressions. A small percentage of our users need to be able to
locate, for example, certain types of identifiers that are present
within the fulltext (grant numbers, dataset identifers, etc).

My straightforward attempts to do this using RegexQuery have been
successful only in the sense that I get the results I'm looking for.
The performance, however, is pretty terrible, with most queries taking
five minutes or longer. Is this the performance I should expect
considering the size of my index and the massive number of terms? Are
there any alternative approaches I could try?

Things I've already tried:
  * reducing the sheer number of terms by adding a LengthFilter,
min=6, to my index analysis chain
  * swapping in the JakartaRegexpCapabilities

Things I intend to try if no one has any better suggestions:
  * chunk up the index and search concurrently, either by sharding or
using a RangeQuery based on document id

Any suggestions appreciated.

Thanks,
--jay


Re: PatternTokenizer failure

2011-11-30 Thread Jay Luker
On Tue, Nov 29, 2011 at 9:37 AM, Michael Kuhlmann k...@solarier.de wrote:
 Jay,
 I think the problem is this:

 You're checking whether the character preceding the array of at least one
 whitespace is not a hyphen.

 However, when you've more than one whitespace, like this:
 foo- \n bar
 then there's another array of whitespaces - \n  - which is precedes by the
 first whitespace -  .

 Therefore, you'll need to not only check for preceding hyphens, but also for
 preceding whitespaces.

 I'll leave this as an exercise for you. ;)

 -Kuli

Just for the sake of closure, you were correct. I needed to update the
regex to include a whitespace character in the negative look-behind,
i.e., (?![-\s])\s+.

Thanks,
--jay


Re: InvalidTokenOffsetsException when using MappingCharFilterFactory, DictionaryCompoundWordTokenFilterFactory and Highlighting

2011-11-30 Thread Jay Luker
I am having a similar issue with OffsetExceptions during highlighting.
In all of the explanations and bug reports I'm reading there is a
mention this is all the result of a problem with HTMLStripCharFilter.
But my analysis chains don't (that I'm aware of) make use of
HTMLStripCharFilter, so can someone explain what else might be going
on? Or is it acknowledged that the bug may exist elsewhere?

Thanks,
--jay

On Fri, Nov 11, 2011 at 4:37 AM, Vadim Kisselmann
v.kisselm...@googlemail.com wrote:
 Hi Edwin, Chris

 it´s an old bug. I have big problems too with OffsetExceptions when i use
 Highlighting, or Carrot.
 It looks like a problem with HTMLStripCharFilter.
 Patch doesn´t work.

 https://issues.apache.org/jira/browse/LUCENE-2208

 Regards
 Vadim



 2011/11/11 Edwin Steiner edwin.stei...@gmail.com

 I just entered a bug: https://issues.apache.org/jira/browse/SOLR-2891

 Thanks  regards, Edwin

 On Nov 7, 2011, at 8:47 PM, Chris Hostetter wrote:

 
  : finally I want to use Solr highlighting. But there seems to be a
 problem
  : if I combine the char filter and the compound word filter in
 combination
  : with highlighting (an
  : org.apache.lucene.search.highlight.InvalidTokenOffsetsException is
  : raised).
 
  Definitely sounds like a bug somwhere in dealing with the offsets.
 
  can you please file a Jira, and include all of the data you have provided
  here?  it would also be helpful to know what the analysis tool says about
  the various attributes of your tokens at each stage of the analysis?
 
  : SEVERE: org.apache.solr.common.SolrException:
 org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token fall
 exceeds length of provided text sized 12
  :     at
 org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:469)
  :     at
 org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:378)
  :     at
 org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:116)
  :     at
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
  :     at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
  :     at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
  :     at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
  :     at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
  :     at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
  :     at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
  :     at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
  :     at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
  :     at
 org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:462)
  :     at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:164)
  :     at
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100)
  :     at
 org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:851)
  :     at
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
  :     at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:405)
  :     at
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:278)
  :     at
 org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:515)
  :     at
 org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:302)
  :     at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
  :     at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
  :     at java.lang.Thread.run(Thread.java:680)
  : Caused by:
 org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token fall
 exceeds length of provided text sized 12
  :     at
 org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:228)
  :     at
 org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:462)
  :     ... 23 more
 
 
  -Hoss





PatternTokenizer failure

2011-11-28 Thread Jay Luker
Hi all,

I'm trying to use PatternTokenizer and not getting expected results.
Not sure where the failure lies. What I'm trying to do is split my
input on whitespace except in cases where the whitespace is preceded
by a hyphen character. So to do this I'm using a negative look behind
assertion in the pattern, e.g. (?!-)\s+.

Expected behavior:
foo bar - [foo,bar] - OK
foo \n bar - [foo,bar] - OK
foo- bar - [foo- bar] - OK
foo-\nbar - [foo-\nbar] - OK
foo- \n bar - [foo- \n bar] - FAILS

Here's a test case that demonstrates the failure:

public void testPattern() throws Exception {
MapString,String args = new HashMapString, String();
args.put( PatternTokenizerFactory.GROUP, -1 );
args.put( PatternTokenizerFactory.PATTERN, (?!-)\\s+ );
Reader reader = new StringReader(blah \n foo bar- 
baz\nfoo-\nbar-
baz foo- \n bar);
PatternTokenizerFactory tokFactory = new PatternTokenizerFactory();
tokFactory.init( args );
TokenStream stream = tokFactory.create( reader );
assertTokenStreamContents(stream, new String[] { blah, foo,
bar- baz, foo-\nbar- baz, foo- \n bar });
}

This fails with the following output:
org.junit.ComparisonFailure: term 4 expected:foo- [\n bar] but was:foo- []

Am I doing something wrong? Incorrect expectations? Or could this be a bug?

Thanks,
--jay


Re: Document has fields with different update frequencies: how best to model

2011-06-11 Thread Jay Luker
You are correct that ExternalFileField values can only be used in
query functions (i.e. scoring, basically). Sorry for firing off that
answer without reading your use case more carefully.

I'd be inclined towards giving your Option #1 a try, but that's
without knowing much about the scale of your app, size of your index,
documents, etc. Unneeded field updates are only a problem if they're
causing performance problems, right? Otherwise, trying to avoid seems
like premature optimization.

--jay

On Sat, Jun 11, 2011 at 5:26 AM, lee carroll
lee.a.carr...@googlemail.com wrote:
 Hi Jay
 I thought external file field could not be returned as a field but
 only used in scoring.
 trunk has pseudo field which can take a function value but we cant
 move to trunk.

 also its a more general question around schema design, what happens if
 you have several fields with different update frequencies. It does not
 seem external file field is the use case for this.



 On 10 June 2011 20:13, Jay Luker lb...@reallywow.com wrote:
 Take a look at ExternalFileField [1]. It's meant for exactly what you
 want to do here.

 FYI, there is an issue with caching of the external values introduced
 in v1.4 but, thankfully, resolved in v3.2 [2]

 --jay

 [1] 
 http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html
 [2] https://issues.apache.org/jira/browse/SOLR-2536


 On Fri, Jun 10, 2011 at 12:54 PM, lee carroll
 lee.a.carr...@googlemail.com wrote:
 Hi,
 We have a document type which has fields which are pretty static. Say
 they change once every 6 month. But the same document has a field
 which changes hourly
 What are the best approaches to index this document ?

 Eg
 Hotel ID (static) , Hotel Description (static and costly to get from a
 url etc), FromPrice (changes hourly)

 Option 1
 Index hourly as a single document and don't worry about the unneeded
 field updates

 Option 2
 Split into 2 document types and index independently. This would
 require the front end application to query multiple times?
 doc1
 ID,Description,DocType
 doc2
 ID,HotelID,Price,DocType

 application performs searches based on hotel attributes
 for each hotel match issue query to get price


 Any other options ? Can you query across documents ?

 We run 1.4.1, we could maybe update to 3.2 but I don't think I could
 swing to trunk for JOIN feature (if that indeed is JOIN's use case)

 Thanks in advance

 PS Am I just worrying about de-normalised data and should sort the
 source data out maybe by caching and get over it ...?

 cheers Lee c





Re: Document has fields with different update frequencies: how best to model

2011-06-10 Thread Jay Luker
Take a look at ExternalFileField [1]. It's meant for exactly what you
want to do here.

FYI, there is an issue with caching of the external values introduced
in v1.4 but, thankfully, resolved in v3.2 [2]

--jay

[1] 
http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html
[2] https://issues.apache.org/jira/browse/SOLR-2536


On Fri, Jun 10, 2011 at 12:54 PM, lee carroll
lee.a.carr...@googlemail.com wrote:
 Hi,
 We have a document type which has fields which are pretty static. Say
 they change once every 6 month. But the same document has a field
 which changes hourly
 What are the best approaches to index this document ?

 Eg
 Hotel ID (static) , Hotel Description (static and costly to get from a
 url etc), FromPrice (changes hourly)

 Option 1
 Index hourly as a single document and don't worry about the unneeded
 field updates

 Option 2
 Split into 2 document types and index independently. This would
 require the front end application to query multiple times?
 doc1
 ID,Description,DocType
 doc2
 ID,HotelID,Price,DocType

 application performs searches based on hotel attributes
 for each hotel match issue query to get price


 Any other options ? Can you query across documents ?

 We run 1.4.1, we could maybe update to 3.2 but I don't think I could
 swing to trunk for JOIN feature (if that indeed is JOIN's use case)

 Thanks in advance

 PS Am I just worrying about de-normalised data and should sort the
 source data out maybe by caching and get over it ...?

 cheers Lee c



Re: Solr performance

2011-05-11 Thread Jay Luker
On Wed, May 11, 2011 at 7:07 AM, javaxmlsoapdev vika...@yahoo.com wrote:
 I have some 25 odd fields with stored=true in schema.xml. Retrieving back
 5,000 records back takes a few secs. I also tried passing fl and only
 include one field in the response but still response time is same. What are
 the things to look to tune the performance.


Confirm that you have enableLazyFieldLoading set to true in
solrconfig.xml. I suspect it is since that's the default.

Is the request taking a few seconds the first time, but returns
quickly on subsequent requests?

Also, may or may not be relevant, but you might find a few bits of
info in this thread enlightening:
http://lucene.472066.n3.nabble.com/documentCache-clarification-td1780504.html

--jay


Re: Text Only Extraction Using Solr and Tika

2011-05-05 Thread Jay Luker
Hi Emyr,

You could try using the extractOnly=true parameter [1]. Of course,
you'll need to repost the extracted text manually.

--jay

[1] http://wiki.apache.org/solr/ExtractingRequestHandler#Extract_Only


On Thu, May 5, 2011 at 9:36 AM, Emyr James emyr.ja...@sussex.ac.uk wrote:
 Hi All,

 I have solr and tika installed and am happily extracting and indexing
 various files.
 Unfortunately on some word documents it blows up since it tries to
 auto-generate a 'title' field but my title field in the schema is single
 valued.

 Here is my config for the extract handler...

 requestHandler name=/update/extract
 class=org.apache.solr.handler.extraction.ExtractingRequestHandler
 lst name=defaults
 str name=uprefixignored_/str
 /lst
 /requestHandler

 Is there a config option to make it only extract text, or ideally to allow
 me to specify which metadata fields to accept ?

 E.g. I'd like to use any author metadata it finds but to not use any title
 metadata it finds as I want title to be single valued and set explicitly
 using a literal.title in the post request.

 I did look around for some docs but all i can find are very basic examples.
 there's no comprehensive configuration documentation out there as far as I
 can tell.


 ALSO...

 I get some other bad responses coming back such as...

 htmlheadtitleApache Tomcat/6.0.28 - Error report/titlestyle!--H1
 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
 H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#
 525D76;font-size:16px;} H3
 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
 BODY
 {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B
 {font-family:Tahoma,Arial,sans-serif;c
 olor:white;background-color:#525D76;} P
 {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
 {color : black;}A.name {color : black;}HR {color : #525D76;}--/style
 /headbodyh1HTTP Status 500 - org.ap
 ache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;

 java.lang.NoSuchMethodError:
 org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
    at
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:168)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
    at
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148)
    at
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
    at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
    at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
    at
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
    at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
    at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
    at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
    at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
    at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
    at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
    at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
    at
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
    at
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
    at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
    at
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
    at
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
    at
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
    at java.lang.Thread.run(Thread.java:636)
 /h1HR size=1 noshade=noshadepbtype/b Status
 report/ppbmessage/b
 uorg.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;

 For the above my url was...

  http://localhost:8080/solr/update/extract?literal.id=3922defaultField=contentfmap.content=contentuprefix=ignored_stream.contentType=application%2Fvnd.ms-powerpointcommit=trueliteral.title=Reactor+cycle+141literal.not
 es=literal.tag=UCN_productionliteral.author=Maurits+van+der+Grinten

 I guess there's something special I need to be able to process power point
 files ? Maybe I need to get the latest apache POI ? Any suggestions
 welcome...


 Regards,

 Emyr



tika/pdfbox knobs levers

2011-04-13 Thread Jay Luker
Hi all,

I'm wondering if there are any knobs or levers i can set in
solrconfig.xml that affect how pdfbox text extraction is performed by
the extraction handler. I would like to take advantage of pdfbox's
ability to normalize diacritics and ligatures [1], but that doesn't
seem to be the default behavior. Is there a way to enable this?

Thanks,
--jay

[1] 
http://pdfbox.apache.org/apidocs/index.html?org/apache/pdfbox/util/TextNormalize.html


Re: UIMA example setup w/o OpenCalais

2011-04-08 Thread Jay Luker
Thank you, that worked.

For the record, my objection to the OpenCalais service is that their
ToS states that they will retain a copy of the metadata submitted by
you, and that by submitting data to the service you grant Thomson
Reuters a non-exclusive perpetual, sublicensable, royalty-free license
to that metadata. The AlchemyAPI service Tos states only that they
retain the *generated* metadata.

Just a warning to anyone else thinking of experimenting with Solr  UIMA.

--jay

On Fri, Apr 8, 2011 at 6:45 AM, Tommaso Teofili
tommaso.teof...@gmail.com wrote:
 Hi Jay,
 you should be able to do so by simply removing the OpenCalaisAnnotator from
 the execution pipeline commenting the line 124 of the file:
 solr/contrib/uima/src/main/resources/org/apache/uima/desc/OverridingParamsExtServicesAE.xml
 Hope this helps,
 Tommaso

 2011/4/7 Jay Luker lb...@reallywow.com

 Hi,

 I'd would like to experiment with the UIMA contrib package, but I have
 issues with the OpenCalais service's ToS and would rather not use it.
 Is there a way to adapt the UIMA example setup to use only the
 AlchemyAPI service? I tried simply leaving out the OpenCalais api key
 but i get exceptions thrown during indexing.

 Thanks,
 --jay




UIMA example setup w/o OpenCalais

2011-04-07 Thread Jay Luker
Hi,

I'd would like to experiment with the UIMA contrib package, but I have
issues with the OpenCalais service's ToS and would rather not use it.
Is there a way to adapt the UIMA example setup to use only the
AlchemyAPI service? I tried simply leaving out the OpenCalais api key
but i get exceptions thrown during indexing.

Thanks,
--jay


Re: Highlight snippets for a set of known documents

2011-04-01 Thread Jay Luker
It turns out the answer is I'm a moron; I had an unnoticed rows=1
nestled in the querystring I was testing with.

Anyway, thanks for replying!

--jay

On Fri, Apr 1, 2011 at 4:25 AM, Stefan Matheis
matheis.ste...@googlemail.com wrote:
 Jay,

 i'm not sure, but did you try it w/ brackets?
 q=foobarfq={!q.op=OR}(id:1 id:5 id:11)

 Regards
 Stefan

 On Thu, Mar 31, 2011 at 6:40 PM, Jay Luker lb...@reallywow.com wrote:
 Hi all,

 I'm trying to get highlight snippets for a set of known documents and
 I must being doing something wrong because it's only sort of working.

 Say my query is foobar and I already know that docs 1, 5 and 11 are
 matches. Now I want to retrieve the highlight snippets for the term
 foobar for docs 1, 5 and 11. What I assumed would work was something
 like: ...q=foobarfq={!q.op=OR}id:1 id:5 id:11 This returns
 numfound=3 in the response, but I only get the highlight snippets for
 document id:1. What am I doing wrong?

 Thanks,
 --jay




Help with parsing configuration using SolrParams/NamedList

2011-02-16 Thread Jay Luker
Hi,

I'm trying to use a CustomSimilarityFactory and pass in per-field
options from the schema.xml, like so:

 similarity class=org.ads.solr.CustomSimilarityFactory
lst name=field_a
int name=min500/int
int name=max1/int
float name=steepness0.5/float
/lst
lst name=field_b
int name=min500/int
int name=max2/int
float name=steepness0.5/float
/lst
 /similarity

My problem is I am utterly failing to figure out how to parse this
nested option structure within my CustomSimilarityFactory class. I
know that the settings are available as a SolrParams object within the
getSimilarity() method. I'm convinced I need to convert to a NamedList
using params.toNamedList(), but my java fu is too feeble to code the
dang thing. The closest I seem to get is the top level as a NamedList
where the keys are field_a and field_b, but then my values are
strings, e.g., {min=500,max=1,steepness=0.5}.

Anyone who could dash off a quick example of how to do this?

Thanks,
--jay


Re: Sending binary data as part of a query

2011-02-01 Thread Jay Luker
On Mon, Jan 31, 2011 at 9:22 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 that class should probably have been named ContentStreamUpdateHandlerBase
 or something like that -- it tries to encapsulate the logic that most
 RequestHandlers using COntentStreams (for updating) need to worry about.

 Your QueryComponent (as used by SearchHandler) should be able to access
 the ContentStreams the same way that class does ... call
 req.getContentStreams().

 Sending a binary stream from a remote client depends on how the client is
 implemented -- you can do it via HTTP using the POST body (with or w/o
 multi-part mime) in any langauge you want. If you are using SolrJ you may
 again run into an assumption that using ContentStreams means you are doing
 an Update but that's just a vernacular thing ... something like a
 ContentStreamUpdateRequest should work just as well for a query (as long
 as you set the neccessary params and/or request handler path)

Thanks for the help. I was just about to reply to my own question for
the benefit of future googlers when I noticed your response. :)

I actually got this working, much the way you suggest. The client is
python. I created a gist with the script I used for testing [1].

On the solr side my QueryComponent grabs the stream, uses
jzlib.ZInputStream to do the deflating, then translates the incoming
integers in the bitset (my solr schema.xml integer ids) to the lucene
ids and creates a docSetFilter with them.

Very relieved to get this working as it's the basis of a talk I'm
giving next week [2]. :-)

--jay

[1] https://gist.github.com/806397
[2] http://code4lib.org/conference/2011/luker


Sending binary data as part of a query

2011-01-28 Thread Jay Luker
Hi all,

Here is what I am interested in doing: I would like to send a
compressed integer bitset as a query to solr. The bitset integers
represent my document ids and the results I want to get back is the
facet data for those documents.

I have successfully created a QueryComponent class that, assuming it
has the integer bitset, can turn that into the necessary DocSetFilter
to pass to the searcher, get back the facets, etc. That part all works
right now because I'm using either canned or randomly generated
bitsets on the server side.

What I'm unsure how to do is actually send this compressed bitset from
a client to solr as part of the query. From what I can tell, the Solr
API classes that are involved in handling binary data as part of a
request assume that the data is a document to be added. For instance,
extending ContentStreamHandlerBase requires implementing some kind of
document loader and an UpdateRequestProcessorChain and a bunch of
other stuff that I don't really think I should need. Is there a
simpler way? Anyone tried or succeeded in doing anything similar to
this?

Thanks,
--jay


Re: Using jetty's GzipFilter in the example solr.war

2010-11-15 Thread Jay Luker
On Sun, Nov 14, 2010 at 12:49 AM, Kiwi de coder kiwio...@gmail.com wrote:
 try to put u filter on top of web.xml (instead of middle or bottom), i try
 this few day and it just only a simple solution (not sure is a spec to put
 on top or is a bug)

Thank you.

An explanation of why this worked is probably better explored on the
jetty list, but, for the record, it did.

--jay


Using jetty's GzipFilter in the example solr.war

2010-11-13 Thread Jay Luker
Hi,

I thought I'd try turning on gzip compression but I can't seem to get
jetty's GzipFilter to actually compress my responses. I unpacked the
example solr.war and tried adding variations of the following to the
web.xml (and then rejar-ed), but as far as I can tell, jetty isn't
actually compressing anything.

filter
 filter-nameGZipFilter/filter-name
 display-nameJetty's GZip Filter/display-name
 descriptionFilter that zips all the content on-the-fly/description
 filter-classorg.mortbay.servlet.GzipFilter/filter-class
 init-param
  param-namemimeTypes/param-name
  param-value*/param-value
 /init-param
/filter

filter-mapping
 filter-nameGZipFilter/filter-name
 url-pattern*/url-pattern
/filter-mapping

I've also tried explicitly listing mime-types and assigning the
filter-mapping using servlet-name. I can see that the GzipFilter is
being loaded when I add -DDEBUG to the jetty startup command. But as
far as I can tell from looking at the response headers nothing is
being gzipped. I'm expecting to see Content-Encoding: gzip in the
response headers.

Anyone successfully gotten this to work?

Thanks,
--jay


Re: documentCache clarification

2010-10-29 Thread Jay Luker
On Thu, Oct 28, 2010 at 7:27 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 The queryResultCache is keyed on Query,Sort,Start,Rows,Filters and the
 value is a DocList object ...

 http://lucene.apache.org/solr/api/org/apache/solr/search/DocList.html

 Unlike the Document objects in the documentCache, the DocLists in the
 queryResultCache never get modified (techincally Solr doesn't actually
 modify the Documents either, the Document just keeps track of it's fields
 and updates itself as Lazy Load fields are needed)

 if a DocList containing results 0-10 is put in the cache, it's not
 going to be of any use for a query with start=50.  but if it contains 0-50
 it *can* be used if start  50 and rows  50 -- that's where the
 queryResultWindowSize comes in.  if you use start=0rows=10, but your
 window size is 50, SolrIndexSearcher will (under the covers) use
 start=0rows=50 and put that in the cache, returning a slice from 0-10
 for your query.  the next query asking for 10-20 will be a cache hit.

This makes sense but still doesn't explain what I'm seeing in my cache
stats. When I issue a request with rows=10 the stats show an insert
into the queryResultCache. If I send the same query, this time with
rows=1000, I would not expect to see a cache hit but I do. So it seems
like there must be something useful in whatever gets cached on the
first request for rows=10 for it to be re-used by the request for
rows=1000.

--jay


Re: documentCache clarification

2010-10-28 Thread Jay Luker
On Wed, Oct 27, 2010 at 9:13 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : schema.) My evidence for this is the documentCache stats reported by
 : solr/admin. If I request rows=10fl=id followed by
 : rows=10fl=id,title I would expect to see the 2nd request result in
 : a 2nd insert to the cache, but instead I see that the 2nd request hits
 : the cache from the 1st request. rows=10fl=* does the same thing.

 your evidence is correct, but your interpretation is incorrect.

 the objects in the documentCache are lucene Documents, which contain a
 List of Field refrences.  when enableLazyFieldLoading=true is set, and
 there is a documentCache Document fetched from the IndexReader only
 contains the Fields specified in the fl, and all other Fields are marked
 as LOAD_LAZY.

 When there is a cache hit on that uniqueKey at a later date, the Fields
 allready loaded are used directly if requested, but the Fields marked
 LOAD_LAZY are (you guessed it) lazy loaded from the IndexReader and then
 the Document updates the refrence to the newly actualized fields (which
 are no longer marked LOAD_LAZY)

 So with different fl params, the same Document Object is continually
 used, but the Fields in that Document grow as the fields requested (using
 the fl param) change.

Great stuff. Makes sense. Thanks for the clarification, and if no one
objects I'll update the wiki with some of this info.

I'm still not clear on this statement from the wiki's description of
the documentCache: (Note: This cache cannot be used as a source for
autowarming because document IDs will change when anything in the
index changes so they can't be used by a new searcher.)

Can anyone elaborate a bit on that. I think I've read it at least 10
times and I'm still unable to draw a mental picture. I'm wondering if
the document IDs referred to are the ones I'm defining in my schema,
or are they the underlying lucene ids, i.e. the ones that, according
to the Lucene in Action book, are relative within each segment?


 : will *not* result in an insert to queryResultCache. I have tried
 : various increments--10, 100, 200, 500--and it seems the magic number
 : is somewhere between 200 (cache insert) and 500 (no insert). Can
 : someone explain this?

 In addition to the queryResultMaxDocsCached config option already
 mentioned (which controls wether a DocList is cached based on it's size)
 there is also the queryResultWindowSize config option which may confuse
 your cache observations.  if the window size is 50 and you ask for
 start=0rows=10 what actually gets cached is 0-50 (assuming there are
 more then 50 results) so a subsequent request for start=10rows=10 will be
 a cache hit.

Just so I'm clear, does the queryResultCache operate in a similar
manner as the documentCache as to what is actually cached? In other
words, is it the caching of the docList object that is reported in the
cache statistics hits/inserts numbers? And that object would get
updated with a new set of ordered doc ids on subsequent, larger
requests. (I'm flailing a bit to articulate the question, I know). For
example, if my queryResultMaxDocsCached is set to 200 and I issue a
request with rows=500, then I won't get a docList object entry in the
queryResultCache. However, if I issue a request with rows=10, I will
get an insert, and then a later request for rows=500 would re-use and
update that original cached docList object. Right? And would it be
updated with the full list of 500 ordered doc ids or only 200?

Thanks,
--jay


documentCache clarification

2010-10-27 Thread Jay Luker
Hi all,

The solr wiki says this about the documentCache: The more fields you
store in your documents, the higher the memory usage of this cache
will be.

OK, but if i have enableLazyFieldLoading set to true and in my request
parameters specify fl=id, then the number of fields per document
shouldn't affect the memory usage of the document cache, right?

Thanks,
--jay


Re: documentCache clarification

2010-10-27 Thread Jay Luker
(btw, I'm running 1.4.1)

It looks like my assumption was wrong. Regardless of the fields
selected using the fl parameter and the enableLazyFieldLoading
setting, solr apparently fetches from disk and caches all the fields
in the document (or maybe just those that are stored=true in my
schema.) My evidence for this is the documentCache stats reported by
solr/admin. If I request rows=10fl=id followed by
rows=10fl=id,title I would expect to see the 2nd request result in
a 2nd insert to the cache, but instead I see that the 2nd request hits
the cache from the 1st request. rows=10fl=* does the same thing.
i.e., the first request, even though I have
enableLazyFieldLoading=true and I'm only asking for the ids, fetches
the entire document from disk and inserts into the documentCache.
Subsequent requests, regardless of which fields I actually select,
don't hit the disk but are loaded from the documentCache. Is this
really the expected behavior and/or am I misunderstanding something?

A 2nd question: while watching these stats I noticed something else
weird with the queryResultCache. It seems that inserts to the
queryResultCache depend on the number of rows requested. For example,
an initial request (solr restarted, clean cache, etc) with rows=10
will result in a insert. A 2nd request of the same query with
rows=1000 will result in a cache hit. However if you reverse that
order, starting with a clean cache, an initial request for rows=1000
will *not* result in an insert to queryResultCache. I have tried
various increments--10, 100, 200, 500--and it seems the magic number
is somewhere between 200 (cache insert) and 500 (no insert). Can
someone explain this?

Thanks,
--jay

On Wed, Oct 27, 2010 at 10:54 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
 I've been wondering about this too some time ago. I've found more 
 informationenableLazyFieldLoading
 in SOLR-52 and some correspondence on this one but it didn't give me a
 definitive answer..

 [1]: https://issues.apache.org/jira/browse/SOLR-52
 [2]: http://www.mail-archive.com/solr-...@lucene.apache.org/msg01185.html

 On Wednesday 27 October 2010 16:39:44 Jay Luker wrote:
 Hi all,

 The solr wiki says this about the documentCache: The more fields you
 store in your documents, the higher the memory usage of this cache
 will be.

 OK, but if i have enableLazyFieldLoading set to true and in my request
 parameters specify fl=id, then the number of fields per document
 shouldn't affect the memory usage of the document cache, right?

 Thanks,
 --jay

 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536600 / 06-50258350



Re: Autocommit not happening

2010-07-23 Thread Jay Luker
For the sake of any future googlers I'll report my own clueless but
thankfully brief struggle with autocommit.

There are two parts to the story: Part One is where I realize my
autoCommit config was not contained within my updateHandler. In
Part Two I realized I had typed autocommit rather than
autoCommit.

--jay

On Fri, Jul 23, 2010 at 2:35 PM, John DeRosa jo...@ipstreet.com wrote:
 On Jul 23, 2010, at 9:37 AM, John DeRosa wrote:

 Hi! I'm a Solr newbie, and I don't understand why autocommits aren't 
 happening in my Solr installation.


 [snip]

 Never mind... I have discovered my boneheaded mistake. It's so silly, I 
 wish I could retract my question from the archives.