Re: Obtaining SOLR index size on disk
On Jul 17, 2009, at 8:45 PM, J G wrote: Is it possible to obtain the SOLR index size on disk through the SOLR API? I've read through the docs and mailing list questions but can't seem to find the answer. No, but it'd be a great addition to the /admin/system handler which returns lots of other useful trivia like the free memory, ulimit, uptime, and such. Erik
Truncated XML responses from CoreAdminHandler
The Solr application I'm working on has many concurrently active cores - of the order of 1000s at a time. The management application depends on being able to query Solr for the current set of live cores, a requirement I've been satisfying using the STATUS core admin handler method. However, once the number of active cores reaches a particular threshold (which I haven't determined exactly), the response to the STATUS method is truncated, resulting in malformed XML. My debugging so far has revealed: - when doing STATUS queries from the local machine, they succeed, untruncated, 90% of the time - when local STATUS queries do fail, they are always truncated to the same length: 73685 bytes in my case - when doing STATUS queries from a remote machine, they fail due to truncation every time - remote STATUS queries are always truncated to the same length: 24704 bytes in my case - the failing STATUS queries take visibly longer to complete on the client - a few seconds for a truncated result versus 1 second for an untruncated result - all STATUS queries return a successful 200 HTTP code - all STATUS queries are logged as returning in ~700ms in Solr's info log - during failing (truncated) responses, Solr's CPU usage spikes to saturation - behaviour seems the same whatever client I use: wget, curl, Python, ... Using Solr 1.3.0 694707, Jetty 6.1.3. At the moment, the main puzzles for me are that the local and remote behaviour is so different. It leads me to think that it is something to do with the network transmission speed. But the response really isn't that big (untruncated it's ~1MB), and the CPU spike seems to suggest that something in the process of serialising the core information is taking too long and causing a timeout? Any suggestions on settings to tweak, ways to get extra debug information, or ascertain the active core list in some other way would be much appreciated! James
Re: Truncated XML responses from CoreAdminHandler
James, Not enough memory and Garbage Collection? Connecting to Solr via JConsole should show it. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: James Brady james.colin.br...@gmail.com To: solr-user@lucene.apache.org Sent: Saturday, July 18, 2009 5:02:42 PM Subject: Truncated XML responses from CoreAdminHandler The Solr application I'm working on has many concurrently active cores - of the order of 1000s at a time. The management application depends on being able to query Solr for the current set of live cores, a requirement I've been satisfying using the STATUS core admin handler method. However, once the number of active cores reaches a particular threshold (which I haven't determined exactly), the response to the STATUS method is truncated, resulting in malformed XML. My debugging so far has revealed: - when doing STATUS queries from the local machine, they succeed, untruncated, 90% of the time - when local STATUS queries do fail, they are always truncated to the same length: 73685 bytes in my case - when doing STATUS queries from a remote machine, they fail due to truncation every time - remote STATUS queries are always truncated to the same length: 24704 bytes in my case - the failing STATUS queries take visibly longer to complete on the client - a few seconds for a truncated result versus 1 second for an untruncated result - all STATUS queries return a successful 200 HTTP code - all STATUS queries are logged as returning in ~700ms in Solr's info log - during failing (truncated) responses, Solr's CPU usage spikes to saturation - behaviour seems the same whatever client I use: wget, curl, Python, ... Using Solr 1.3.0 694707, Jetty 6.1.3. At the moment, the main puzzles for me are that the local and remote behaviour is so different. It leads me to think that it is something to do with the network transmission speed. But the response really isn't that big (untruncated it's ~1MB), and the CPU spike seems to suggest that something in the process of serialising the core information is taking too long and causing a timeout? Any suggestions on settings to tweak, ways to get extra debug information, or ascertain the active core list in some other way would be much appreciated! James
Re: Wikipedia or reuters like index for testing facets?
It's only really effective if the number of tokens in the Sink is expected to be significantly less than (my various tests showed around 50%, but YMMV) so it isn't likely useful for most copy fields situations. For Solr to utilize, the schema would have to allow for giving ids to the various TokenFilter's so that you could identify the Tees and the Sinks. At least that was my first thought on it. -Grant On Jul 17, 2009, at 7:50 PM, Jason Rutherglen wrote: I saw the discussion about TeeSinkTokenFilter on java-user, and was wondering how Solr performs copy fields? Couldn't Solr by default utilize a TeeSinkTokenFilter like class for copying fields? That link is meant to be stable for benchmarking purposes within Lucene. The fields are different? On Fri, Jul 17, 2009 at 9:57 AM, Grant Ingersollgsing...@apache.org wrote: It's likely quite different. That link is meant to be stable for benchmarking purposes within Lucene. Note, one think I wish I had time for: Hook in Tee/Sink capabilities into Solr such that one could use the WikipediaTokenizer and then Tee the Categories, etc. off to separate fields automatically for faceting, etc. -Grant On Jul 17, 2009, at 10:48 AM, Jason Rutherglen wrote: The question that comes to mind is how it's different than http://people.apache.org/~gsingers/wikipedia/enwiki-20070527-pages-articles.xml.bz2 Guess we'd need to download it and take a look! On Thu, Jul 16, 2009 at 8:33 PM, Peter Wolaninpeter.wola...@acquia.com wrote: AWS provides some standard data sets, including an extract of all wikipedia content: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2345categoryID=249 Looks like it's not being updated often, so this or another AWS data set could be a consistent basis for benchmarking? -Peter On Wed, Jul 15, 2009 at 2:21 PM, Jason Rutherglenjason.rutherg...@gmail.com wrote: Yeah that's what I was thinking of as an alternative, use enwiki and randomly generate facet data along with it. However for consistent benchmarking the random data would need to stay the same so that people could execute the same benchmark consistently in their own environment. On Tue, Jul 14, 2009 at 6:28 PM, Mark Millermarkrmil...@gmail.com wrote: Why don't you just randomly generate the facet data? Thats prob the best way right? You can control the uniques and ranges. On Wed, Jul 15, 2009 at 1:21 AM, Grant Ingersoll gsing...@apache.orgwrote: Probably not as generated by the EnwikiDocMaker, but the WikipediaTokenizer in Lucene can pull out richer syntax which could then be Teed/ Sinked to other fields. Things like categories, related links, etc. Mostly, though, I was just commenting on the fact that it isn't hard to at least use it for getting docs into Solr. -Grant On Jul 14, 2009, at 7:38 PM, Jason Rutherglen wrote: You think enwiki has enough data for faceting? On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersollgsing...@apache.org wrote: At a min, it is trivial to use the EnWikiDocMaker and then send the doc over SolrJ... On Jul 14, 2009, at 4:07 PM, Mark Miller wrote: On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Is there a standard index like what Lucene uses for contrib/benchmark for executing faceted queries over? Or maybe we can randomly generate one that works in conjunction with wikipedia? That way we can execute real world queries against faceted data. Or we could use the Lucene/ Solr mailing lists and other data (ala Lucid's faceted site) as a standard index? I don't think there is any standard set of docs for solr testing - there is not a real benchmark contrib - though I know more than a few of us have hacked up pieces of Lucene benchmark to work with Solr - I think I've done it twice now ;) Would be nice to get things going. I was thinking the other day: I wonder how hard it would be to make Lucene Benchmark generic enough to accept Solr impls and Solr algs? It does a lot that would suck to duplicate. -- -- - Mark http://www.lucidimagination.com -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/ Droids) using Solr/Lucene: http://www.lucidimagination.com/search -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/ Droids) using Solr/Lucene: http://www.lucidimagination.com/search -- -- - Mark http://www.lucidimagination.com -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene
Re: Wikipedia or reuters like index for testing facets?
I have something that maybe could be made into one: http://uncorpora.org/ It is resolutions of the United Nations General Assembly in 6 official languages aligned on a paragraph level in an XML (Translation Memory eXchange) format. The 6 languages are: English, French, Spanish, Arabic, Chinese, Russian. Facets could be derived from already encoded information for: 1) Session number: 55-62 2) Committee number: 0-6 3) Operative/preambulatory phrase (for some of the paragraphs) 4) Resolution number (which is part of the record ID) 5) Cross-reference information that is embedded in the text, but is marked off with XML tags Markup and all, it is about 170 Mbytes between 6 languages. If that looks useful, I would be happy to work with more experienced Solr users to beat it into the right shape. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ Research group: http://www.clt.mq.edu.au/Research/ - I think age is a very high price to pay for maturity (Tom Stoppard) On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglenjason.rutherg...@gmail.com wrote: Is there a standard index like what Lucene uses for contrib/benchmark for executing faceted queries over? Or maybe we can randomly generate one that works in conjunction with wikipedia? That way we can execute real world queries against faceted data. Or we could use the Lucene/Solr mailing lists and other data (ala Lucid's faceted site) as a standard index?