Re: Facets - What's a better term for non technical people?
On 11/12/2007, at 8:32 PM, Benjamin O'Steen wrote: So, has anyone got a good example of the language they might use over, say, a set of radio buttons and fields on a web form, to indicate that selecting one or more of these would return facets. 'Show grouping by' or 'List the sets that the results fall into' or something similar. Filter by is what I'd use which is unfortunately already used in Solr, though very much related since the facet is generally added as a filter query. Not close enough to use the same term though. Other things that are close but not really right would be groups or categories. Maybe Limit to so facets would be limiters. I think facet is the right term and what you need is to add see also type entries under a bunch of these other terms. Regards, Adrian Sutton http://www.symphonious.net
Re: unable to figure out nutch type highlighting in solr....
One last one, when you send HTML to solr, do you too replace special chars and tags with named entities? I did this and HTMLStripper doesn't seem to recognise them the tags :-S While if I try and input HTML as is indexer throws exceptions (as having tags within XML tags is obviously not valid. How to do this part? We didn't do anything at all to the HTML, the editor returns valid XHTML (using numeric entities, never named entities which aren't valid in XML and don't tend to work in XHTML) and we do string concatenation to build up the /update request body like: requestBody += str name=\content\ + xhtmlContent + /str; Solr seems to handle it. From what people are suggesting though you'd be better off converting to plain text before indexing it with Solr. Something like JTidy (http://jtidy.sf.net) can parse most HTML that's around and you can iterate over the DOM to extract the text from there. Regards, Adrian Sutton http://www.symphonious.net
Re: unable to figure out nutch type highlighting in solr....
On 05/10/2007, at 4:07 PM, Ravish Bhagdev wrote: (Query esp. Adrian): If you are indexing XHTML, do you replace tags with entities before giving it to solr, if so, when you get back snippets do you get tags or entities or do you convert again to tags for presentation? What's the best way out? It would help me a lot if you briefly explain your configuration. We happen to develop a HTML editor so we know 100% for certain that the XHTML is valid XML. Given that we just throw the raw XHTML at Solr which uses the HTMLStripWhitespaceTokenizer. However, at this stage we haven't configured highlighting at all, so our index is used for search and retrieving a document ID. At some point I'd like to add highlighting and it sounds like the best way to do so would be to index the document text instead of the HTML. Beyond that, we also use Solr as an optimization for extracting information such as what content was most recently changed, which pages link to others etc. On the page linking, we actually identify what pages are linked to prior to indexing and store them as a separate field - Solr itself has no understanding of the linking. Oh and I should note, I'm very new to Solr so I'm probably not doing things the best way, but I'm getting great results anyway. Regards, Adrian Sutton http://www.symphonious.net
Re: unable to figure out nutch type highlighting in solr....
Named entity references are valid in XML. They just need to be declared before they are used[1], unless they are one of the builtin named entities lt; gt; apos; quot; or amp; -- these are always valid when parsing with an XML parser. Correct, it was an offhand comment and I skipped over all the details. In general named entities other than the built-ins aren't declared at the top of the file and many parsers don't bother to read in external DTDs so any entities declared there aren't read and are therefore considered invalid. XHTML is XML, so if parsed by an XML parser, XML's builtin named entities are available, and if the parser doesn't ignore external entities, then the same set of (roughly) 250 named entities defined in HTML are available as well[2]. Except that no browser that I know of actually reads in the XHTML DTD when in standards compliant mode, so none of those entities are actually viable to be used unless you include the declarations for them at the top of every XHTML document (which is ludicrous). The bottom line is that it's far, far better to use numeric entities in XML and simply ignore all but the built-in named entities if you want to have any confidence that the document will be parsed correctly - hence my offhand comment. Regards, Adrian Sutton http://www.symphonious.net
Re: unable to figure out nutch type highlighting in solr....
I see that you're using the HTML analyzer. Unfortunately that does not play very well with highlighting at the moment. You may get garbled output. Is it the HTML analyzer or the fact that it's HTML content? If it's just the analyzer you could always just copy the HTML content to another field with a different analyzer and use that for highlighting (but search on the original field). Would this work, and if so, which analyzer would be suitable for the second field? Adrian Sutton http://www.symphonious.net
Re: unable to figure out nutch type highlighting in solr....
On 05/10/2007, at 8:45 AM, Mike Klaas wrote: In general, I don't recommend indexing HTML content straight to Solr. None of the Solr contributors do this so the use case hasn't received a lot of love. We're indexing XHTML straight to Solr and it's working great so far. I'm actually somewhat surprised that several people are interested in this but none have have been sufficiently interested to implement a solution to contribute: http://issues.apache.org/jira/browse/SOLR-42 Didn't know there was a problem to solve. We're a fair way off actually playing with highlighting but I'll keep an eye on this for when we get to it. -Mike Thanks, Adrian Sutton http://www.symphonious.net
Re: solr locked itself out
ulimit is unlimited and cat /proc/sys/fs/file-max 11769 I just went through the same kind of mistake - ulimit doesn't report what you think it does, what you should check is ulimit -n (the -n isn't just the option to set the value). If you're using bash as your shell that will almost certainly by 1024 which I've found isn't enough to search and write at the same time. The commit winds up throwing an exception and the lock file(s) get left around causing further problems. The first thing I'd try is upping the ulimit -n to 2 and see if that resolves the issue, it did for me. Regards, Adrian Sutton http://www.symphonious.net
Re: Searching Versioned Resources
On 13/09/2007, at 2:36 PM, Adrian Sutton wrote: I think you can use the CollapseFilter to collapse on version field. However, I think you need to modify the CollapseFilter code to sort by version and get the latest version returned. Ooo, that's very cool. I assume the patches haven't actually been applied yet? This would let me just collapse on the name field and if I could get it to sort by modification date before collapsing it'd be perfect. I have a feeling I'm going to wind up extremely lost, but I'll delve into the patch and see what I can find. For the benefit of the archives (and anyone else following along), it looks like the current version of the patch in JIRA will actually do what I need. The collapse filter will return the first N documents it iterates over and collapse the rest, but before iterating it will sort the documents by the sort parameters you specify. So to get the latest version I simply set sort=version,desc. It seems to work well, though the patch needs to be updated again to work with HEAD, it's not too hard to resolve the differences. Regards, Adrian Sutton. http://www.symphonious.net/
Re: Problem with word 'Repository' in facets
I am facing a issue with facet fields. I have one field as category in my schema.xml file and i have mapped it to facet fields. But the problem is, one of the category is Repository which is displaying fine in category field but in facet field it is showing Repositori. y is getting changed to i. I don't know why. I tried giving Repositry, and it showing me repos. This looks like the work of a Stemming filter - perhaps you have solr.EnglishPorterFilterFactory as one of the filters for that field? You could add Repository to the list of words to ignore or remove the stemming filter from the indexing analyzer or add the stemming filter to the query analyzer. Regards, Adrian Sutton http://www.symphonious.net
Searching Versioned Resources
Hi all, The document's we're indexing are versioned and generally we only want search results to return the latest version of a document, however there's a couple of scenarios where I'd like to be able to include previous versions in the search result. It feels like a straight-forward case of a filter, but given that each document has independent version numbers it's hard to know what to filter on. The only solution I can think of at the moment is to index each new version twice - once with the version and once with version=latest. We'd then tweak the ID field in such a way that there is only one version of each document with version=latest. It's then simple to use a filter for version=latest when we search. Is there a better way? Is there a way to achieve this without having to index the document twice? Thanks in advance, Adrian Sutton http://www.symphonious.net
Re: Searching Versioned Resources
I think you can use the CollapseFilter to collapse on version field. However, I think you need to modify the CollapseFilter code to sort by version and get the latest version returned. Ooo, that's very cool. I assume the patches haven't actually been applied yet? This would let me just collapse on the name field and if I could get it to sort by modification date before collapsing it'd be perfect. I have a feeling I'm going to wind up extremely lost, but I'll delve into the patch and see what I can find. Thanks, Adrian Sutton http://www.symphonious.net
Re: DirectSolrConnection, write.lock and Too Many Open Files
We use DirectSolrConnection via JNI in a couple of client apps that sometimes have 100s of thousands of new docs as fast as Solr will have them. It would crash relentlessly if I didn't force all calls to update or query to be on the same thread using objc's @synchronized and a message queue. I never narrowed down if this was a solr issue or a JNI one. That doesn't sound promising. I'll throw in synchronization around the update code and see what happens. That's doesn't seem good for performance though. Can Solr as a web app handle multiple updates at once or does it synchronize to avoid it? Thanks, Adrian Sutton http://www.symphonious.net
Re: DirectSolrConnection, write.lock and Too Many Open Files
On 11/09/2007, at 8:46 AM, Ryan McKinley wrote: lucene opens a lot of files. It can easily get beyond 1024. (I think the default). I'm no expert on how the file handling works, but I think more files are open if you are searching and writing at the same time. If you can't increase the limit you can try: useCompoundFiletrue/useCompoundFile It is slower, but if you are unable to change the ulimit on the deployed machines I've done a bit of poking on the server and ulimit doesn't seem to be the problem: e2wiki:~$ ulimit unlimited e2wiki:~$ cat /proc/sys/fs/file-max 170355 So there's either something going on behind my back (quite possible, it's a VM) or lucene is opening a really insane number of files. I did check that those values were the same for the tomcat55 user that Tomcat actually runs as. An lsof -p on the Tomcat process always shows 40 files in use, the total open files sits around 1000-1500 even when reindexing all the content. I'll watch it a bit more over time and see what happens. I notice that Confluence recommends at least 20 for the max file limit, at least before they switched to compound indexing so it's possible that the 170355 limit could be reached, but it seems unlikely with our load. If you need to use this in production soon, I'd suggest sticking with 1.2 for a while. There has been a LOT of action in trunk and it may be good to let it settle before upgrading a production system. You should not need to upgrade to fix the write.lock and Too Many Open Files problem. Try increasing ulimit or using a compoundfile before upgrading. We're quite a way off of real production, it's just internal use at the moment (on the real product server, but we're a small company so we can handle having some problems). I'll try out the current nightly build and see how it goes, as much as anything out of interest but probably won't pull new builds very often. Thanks again, Adrian Sutton http://www.symphonious.net
Re: DirectSolrConnection, write.lock and Too Many Open Files
On 11/09/2007, at 9:48 AM, Ryan McKinley wrote: try: ulimit -n ulimit on its own is something else. On my machine I get: [EMAIL PROTECTED]:~$ ulimit unlimited [EMAIL PROTECTED]:~$ cat /proc/sys/fs/file-max 364770 [EMAIL PROTECTED]:~$ ulimit -n 1024 I have to run: ulimit -n 2 to get lucene to run w/ a large index... Bingo, I'm an idiot - or rather, I now know *why* I'm an idiot. :) I'll give it a go. Also, this is likely to be the cause of my write.lock problems - the Too many files exception just occured and the write.lock file gets left around (should have seen that one coming too). Thanks for your help, I'm anticipating that this will solve our problems. Regards, Adrian Sutton http://www.symphonious.net