Re: Facets - What's a better term for non technical people?

2007-12-11 Thread Adrian Sutton

On 11/12/2007, at 8:32 PM, Benjamin O'Steen wrote:

So, has anyone got a good example of the language they might use over,
say, a set of radio buttons and fields on a web form, to indicate that
selecting one or more of these would return facets. 'Show grouping by'
or 'List the sets that the results fall into' or something similar.


Filter by is what I'd use which is unfortunately already used in  
Solr, though very much related since the facet is generally added as a  
filter query. Not close enough to use the same term though.


Other things that are close but not really right would be groups or  
categories. Maybe Limit to so facets would be limiters.  I think  
facet is the right term and what you need is to add see also type  
entries under a bunch of these other terms.


Regards,

Adrian Sutton
http://www.symphonious.net



Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Adrian Sutton

One last one, when you send HTML to solr, do you too replace special
chars and tags with named entities?  I did this and HTMLStripper
doesn't seem to recognise them the tags :-S  While if I try and input
HTML as is indexer throws exceptions (as having tags within XML tags
is obviously not valid.  How to do this part?


We didn't do anything at all to the HTML, the editor returns valid  
XHTML (using numeric entities, never named entities which aren't  
valid in XML and don't tend to work in XHTML) and we do string  
concatenation to build up the /update request body like:


requestBody += str name=\content\ + xhtmlContent + /str;

Solr seems to handle it. From what people are suggesting though you'd  
be better off converting to plain text before indexing it with Solr.  
Something like JTidy (http://jtidy.sf.net) can parse most HTML that's  
around and you can iterate over the DOM to extract the text from there.


Regards,

Adrian Sutton
http://www.symphonious.net


Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Adrian Sutton

On 05/10/2007, at 4:07 PM, Ravish Bhagdev wrote:

(Query esp. Adrian):

If you are indexing XHTML, do you replace tags with entities before
giving it to solr, if so, when you get back snippets do you get tags
or entities or do you convert again to tags for presentation?  What's
the best way out?  It would help me a lot if you briefly explain your
configuration.


We happen to develop a HTML editor so we know 100% for certain that  
the XHTML is valid XML. Given that we just throw the raw XHTML at  
Solr which uses the HTMLStripWhitespaceTokenizer. However, at this  
stage we haven't configured highlighting at all, so our index is used  
for search and retrieving a document ID. At some point I'd like to  
add highlighting and it sounds like the best way to do so would be to  
index the document text instead of the HTML.


Beyond that, we also use Solr as an optimization for extracting  
information such as what content was most recently changed, which  
pages link to others etc. On the page linking, we actually identify  
what pages are linked to prior to indexing and store them as a  
separate field - Solr itself has no understanding of the linking.


Oh and I should note, I'm very new to Solr so I'm probably not doing  
things the best way, but I'm getting great results anyway.


Regards,

Adrian Sutton
http://www.symphonious.net



Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Adrian Sutton
Named entity references are valid in XML.  They just need to be  
declared

before they are used[1], unless they are one of the builtin named
entities lt; gt; apos; quot; or amp; -- these are always valid  
when

parsing with an XML parser.


Correct, it was an offhand comment and I skipped over all the  
details. In general named entities other than the built-ins aren't  
declared at the top of the file and many parsers don't bother to read  
in external DTDs so any entities declared there aren't read and are  
therefore considered invalid.



XHTML is XML, so if parsed by an XML parser, XML's builtin named
entities are available, and if the parser doesn't ignore external
entities, then the same set of (roughly) 250 named entities defined in
HTML are available as well[2].


Except that no browser that I know of actually reads in the XHTML DTD  
when in standards compliant mode, so none of those entities are  
actually viable to be used unless you include the declarations for  
them at the top of every XHTML document (which is ludicrous).


The bottom line is that it's far, far better to use numeric entities  
in XML and simply ignore all but the built-in named entities if you  
want to have any confidence that the document will be parsed  
correctly - hence my offhand comment.


Regards,

Adrian Sutton
http://www.symphonious.net


Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread Adrian Sutton
I see that you're using the HTML analyzer.  Unfortunately that does  
not play very well with highlighting at the moment. You may get  
garbled output.


Is it the HTML analyzer or the fact that it's HTML content? If it's  
just the analyzer you could always just copy the HTML content to  
another field with a different analyzer and use that for highlighting  
(but search on the original field). Would this work, and if so, which  
analyzer would be suitable for the second field?


Adrian Sutton
http://www.symphonious.net


Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread Adrian Sutton

On 05/10/2007, at 8:45 AM, Mike Klaas wrote:
In general, I don't recommend indexing HTML content straight to  
Solr.  None of the Solr contributors do this so the use case hasn't  
received a lot of love.


We're indexing XHTML straight to Solr and it's working great so far.

I'm actually somewhat surprised that several people are interested  
in this but none have have been sufficiently interested to  
implement a solution to contribute:


http://issues.apache.org/jira/browse/SOLR-42


Didn't know there was a problem to solve. We're a fair way off  
actually playing with highlighting but I'll keep an eye on this for  
when we get to it.



-Mike


Thanks,

Adrian Sutton
http://www.symphonious.net



Re: solr locked itself out

2007-09-17 Thread Adrian Sutton

ulimit is unlimited and cat /proc/sys/fs/file-max 11769


I just went through the same kind of mistake - ulimit doesn't report  
what you think it does, what you should check is ulimit -n (the -n  
isn't just the option to set the value). If you're using bash as your  
shell that will almost certainly by 1024 which I've found isn't  
enough to search and write at the same time.  The commit winds up  
throwing an exception and the lock file(s) get left around causing  
further problems.


The first thing I'd try is upping the ulimit -n to 2 and see if  
that resolves the issue, it did for me.


Regards,

Adrian Sutton
http://www.symphonious.net


Re: Searching Versioned Resources

2007-09-13 Thread Adrian Sutton

On 13/09/2007, at 2:36 PM, Adrian Sutton wrote:

I think you can use the CollapseFilter to collapse on version  
field.
However, I think you need to modify the CollapseFilter code to  
sort by

version and get the latest version returned.


Ooo, that's very cool. I assume the patches haven't actually been  
applied yet? This would let me just collapse on the name field  and  
if I could get it to sort by modification date before collapsing  
it'd be perfect.


I have a feeling I'm going to wind up extremely lost, but I'll  
delve into the patch and see what I can find.


For the benefit of the archives (and anyone else following along), it  
looks like the current version of the patch in JIRA will actually do  
what I need. The collapse filter will return the first N documents it  
iterates over and collapse the rest, but before iterating it will  
sort the documents by the sort parameters you specify. So to get the  
latest version I simply set sort=version,desc.


It seems to work well, though the patch needs to be updated again to  
work with HEAD, it's not too hard to resolve the differences.


Regards,

Adrian Sutton.
http://www.symphonious.net/


Re: Problem with word 'Repository' in facets

2007-09-13 Thread Adrian Sutton

I am facing a issue with facet fields.
 I have one field as category in my schema.xml file and i have  
mapped it to facet fields. But the problem is, one of the category  
is Repository which is displaying fine in category field but in  
facet field it is showing Repositori. y is getting changed to i.  
I don't know why.

I tried giving Repositry, and it showing me repos.


This looks like the work of a Stemming filter - perhaps you have  
solr.EnglishPorterFilterFactory as one of the filters for that field?  
You could add Repository to the list of words to ignore or remove the  
stemming filter from the indexing analyzer or add the stemming filter  
to the query analyzer.


Regards,

Adrian Sutton
http://www.symphonious.net


Searching Versioned Resources

2007-09-12 Thread Adrian Sutton

Hi all,
The document's we're indexing are versioned and generally we only  
want search results to return the latest version of a document,  
however there's a couple of scenarios where I'd like to be able to  
include previous versions in the search result.


It feels like a straight-forward case of a filter, but given that  
each document has independent version numbers it's hard to know what  
to filter on. The only solution I can think of at the moment is to  
index each new version twice - once with the version and once with  
version=latest. We'd then tweak the ID field in such a way that there  
is only one version of each document with version=latest. It's then  
simple to use a filter for version=latest when we search.


Is there a better way? Is there a way to achieve this without having  
to index the document twice?


Thanks in advance,

Adrian Sutton
http://www.symphonious.net





Re: Searching Versioned Resources

2007-09-12 Thread Adrian Sutton

I think you can use the CollapseFilter to collapse on version field.
However, I think you need to modify the CollapseFilter code to sort by
version and get the latest version returned.


Ooo, that's very cool. I assume the patches haven't actually been  
applied yet? This would let me just collapse on the name field  and  
if I could get it to sort by modification date before collapsing it'd  
be perfect.


I have a feeling I'm going to wind up extremely lost, but I'll delve  
into the patch and see what I can find.


Thanks,

Adrian Sutton
http://www.symphonious.net


Re: DirectSolrConnection, write.lock and Too Many Open Files

2007-09-10 Thread Adrian Sutton
We use DirectSolrConnection via JNI in a couple of client apps that  
sometimes have 100s of thousands of new docs as fast as Solr will  
have them. It would crash relentlessly if I didn't force all calls  
to update or query to be on the same thread using objc's  
@synchronized and a message queue. I never narrowed down if this  
was a solr issue or a JNI one.


That doesn't sound promising. I'll throw in synchronization around  
the update code and see what happens. That's doesn't seem good for  
performance though. Can Solr as a web app handle multiple updates at  
once or does it synchronize to avoid it?


Thanks,

Adrian Sutton
http://www.symphonious.net


Re: DirectSolrConnection, write.lock and Too Many Open Files

2007-09-10 Thread Adrian Sutton

On 11/09/2007, at 8:46 AM, Ryan McKinley wrote:
lucene opens a lot of files.  It can easily get beyond 1024. (I  
think the default).  I'm no expert on how the file handling works,  
but I think more files are open if you are searching and writing at  
the same time.


If you can't increase the limit you can try:
 useCompoundFiletrue/useCompoundFile

It is slower, but if you are unable to change the ulimit on the  
deployed machines


I've done a bit of poking on the server and ulimit doesn't seem to be  
the problem:

e2wiki:~$ ulimit
unlimited
e2wiki:~$ cat /proc/sys/fs/file-max
170355

So there's either something going on behind my back (quite possible,  
it's a VM) or lucene is opening a really insane number of files. I  
did check that those values were the same for the tomcat55 user that  
Tomcat actually runs as.  An lsof -p on the Tomcat process always  
shows 40 files in use, the total open files sits around 1000-1500  
even when reindexing all the content. I'll watch it a bit more over  
time and see what happens.


I notice that Confluence recommends at least 20 for the max file  
limit, at least before they switched to compound indexing so it's  
possible that the 170355 limit could be reached, but it seems  
unlikely with our load.


If you need to use this in production soon, I'd suggest sticking  
with 1.2 for a while.  There has been a LOT of action in trunk and  
it may be good to let it settle before upgrading a production system.


You should not need to upgrade to fix the write.lock and Too Many  
Open Files problem.  Try increasing ulimit or using a compoundfile  
before upgrading.


We're quite a way off of real production, it's just internal use at  
the moment (on the real product server, but we're a small company so  
we can handle having some problems). I'll try out the current nightly  
build and see how it goes, as much as anything out of interest but  
probably won't pull new builds very often.


Thanks again,

Adrian Sutton
http://www.symphonious.net


Re: DirectSolrConnection, write.lock and Too Many Open Files

2007-09-10 Thread Adrian Sutton

On 11/09/2007, at 9:48 AM, Ryan McKinley wrote:

try: ulimit -n

ulimit on its own is something else.  On my machine I get:

[EMAIL PROTECTED]:~$ ulimit
unlimited
[EMAIL PROTECTED]:~$ cat /proc/sys/fs/file-max
364770
[EMAIL PROTECTED]:~$ ulimit -n
1024


I have to run:
ulimit -n 2

to get lucene to run w/ a large index...


Bingo, I'm an idiot - or rather, I now know *why* I'm an idiot. :)   
I'll give it a go.


Also, this is likely to be the cause of my write.lock problems - the  
Too many files exception just occured and the write.lock file gets  
left around (should have seen that one coming too).


Thanks for your help, I'm anticipating that this will solve our  
problems.


Regards,

Adrian Sutton
http://www.symphonious.net