Re: Obtaining SOLR index size on disk

2009-07-18 Thread Erik Hatcher


On Jul 17, 2009, at 8:45 PM, J G wrote:
Is it possible to obtain the SOLR index size on disk through the  
SOLR API? I've read through the docs and mailing list questions but  
can't seem to find the answer.


No, but it'd be a great addition to the /admin/system handler which  
returns lots of other useful trivia like the free memory, ulimit,  
uptime, and such.


Erik



Truncated XML responses from CoreAdminHandler

2009-07-18 Thread James Brady
The Solr application I'm working on has many concurrently active cores - of
the order of 1000s at a time.

The management application depends on being able to query Solr for the
current set of live cores, a requirement I've been satisfying using the
STATUS core admin handler method.

However, once the number of active cores reaches a particular threshold
(which I haven't determined exactly), the response to the STATUS method is
truncated, resulting in malformed XML.

My debugging so far has revealed:

   - when doing STATUS queries from the local machine, they succeed,
   untruncated, 90% of the time
   - when local STATUS queries do fail, they are always truncated to the
   same length: 73685 bytes in my case
   - when doing STATUS queries from a remote machine, they fail due to
   truncation every time
   - remote STATUS queries are always truncated to the same length: 24704
   bytes in my case
   - the failing STATUS queries take visibly longer to complete on the
   client - a few seconds for a truncated result versus 1 second for an
   untruncated result
   - all STATUS queries return a successful 200 HTTP code
   - all STATUS queries are logged as returning in ~700ms in Solr's info log
   - during failing (truncated) responses, Solr's CPU usage spikes to
   saturation
   - behaviour seems the same whatever client I use: wget, curl, Python, ...

Using Solr 1.3.0 694707, Jetty 6.1.3.

At the moment, the main puzzles for me are that the local and remote
behaviour is so different. It leads me to think that it is something to do
with the network transmission speed. But the response really isn't that big
(untruncated it's ~1MB), and the CPU spike seems to suggest that something
in the process of serialising the core information is taking too long and
causing a timeout?

Any suggestions on settings to tweak, ways to get extra debug information,
or ascertain the active core list in some other way would be much
appreciated!

James


Re: Truncated XML responses from CoreAdminHandler

2009-07-18 Thread Otis Gospodnetic

James,

Not enough memory and Garbage Collection?  Connecting to Solr via JConsole 
should show it.


Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: James Brady james.colin.br...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Saturday, July 18, 2009 5:02:42 PM
 Subject: Truncated XML responses from CoreAdminHandler
 
 The Solr application I'm working on has many concurrently active cores - of
 the order of 1000s at a time.
 
 The management application depends on being able to query Solr for the
 current set of live cores, a requirement I've been satisfying using the
 STATUS core admin handler method.
 
 However, once the number of active cores reaches a particular threshold
 (which I haven't determined exactly), the response to the STATUS method is
 truncated, resulting in malformed XML.
 
 My debugging so far has revealed:
 
- when doing STATUS queries from the local machine, they succeed,
untruncated, 90% of the time
- when local STATUS queries do fail, they are always truncated to the
same length: 73685 bytes in my case
- when doing STATUS queries from a remote machine, they fail due to
truncation every time
- remote STATUS queries are always truncated to the same length: 24704
bytes in my case
- the failing STATUS queries take visibly longer to complete on the
client - a few seconds for a truncated result versus 1 second for an
untruncated result
- all STATUS queries return a successful 200 HTTP code
- all STATUS queries are logged as returning in ~700ms in Solr's info log
- during failing (truncated) responses, Solr's CPU usage spikes to
saturation
- behaviour seems the same whatever client I use: wget, curl, Python, ...
 
 Using Solr 1.3.0 694707, Jetty 6.1.3.
 
 At the moment, the main puzzles for me are that the local and remote
 behaviour is so different. It leads me to think that it is something to do
 with the network transmission speed. But the response really isn't that big
 (untruncated it's ~1MB), and the CPU spike seems to suggest that something
 in the process of serialising the core information is taking too long and
 causing a timeout?
 
 Any suggestions on settings to tweak, ways to get extra debug information,
 or ascertain the active core list in some other way would be much
 appreciated!
 
 James



Re: Wikipedia or reuters like index for testing facets?

2009-07-18 Thread Grant Ingersoll
It's only really effective if the number of tokens in the Sink is  
expected to be significantly less than (my various tests showed around  
 50%, but YMMV) so it isn't likely useful for most copy fields  
situations.  For Solr to utilize, the schema would have to allow for  
giving ids to the various TokenFilter's so that you could identify the  
Tees and the Sinks.  At least that was my first thought on it.


-Grant
On Jul 17, 2009, at 7:50 PM, Jason Rutherglen wrote:


I saw the discussion about TeeSinkTokenFilter on java-user, and
was wondering how Solr performs copy fields? Couldn't Solr by
default utilize a TeeSinkTokenFilter like class for copying
fields?

That link is meant to be stable for benchmarking purposes within  
Lucene.


The fields are different?

On Fri, Jul 17, 2009 at 9:57 AM, Grant  
Ingersollgsing...@apache.org wrote:

It's likely quite different.  That link is meant to be stable for
benchmarking purposes within Lucene.

Note, one think I wish I had time for:
Hook in Tee/Sink capabilities into Solr such that one could use the
WikipediaTokenizer and then Tee the Categories, etc. off to  
separate fields

automatically for faceting, etc.

-Grant

On Jul 17, 2009, at 10:48 AM, Jason Rutherglen wrote:


The question that comes to mind is how it's different than

http://people.apache.org/~gsingers/wikipedia/enwiki-20070527-pages-articles.xml.bz2

Guess we'd need to download it and take a look!

On Thu, Jul 16, 2009 at 8:33 PM, Peter Wolaninpeter.wola...@acquia.com 


wrote:


AWS provides some standard data sets, including an extract of all
wikipedia content:


http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2345categoryID=249

Looks like it's not being updated often, so this or another AWS  
data

set could be a consistent basis for benchmarking?

-Peter

On Wed, Jul 15, 2009 at 2:21 PM, Jason
Rutherglenjason.rutherg...@gmail.com wrote:


Yeah that's what I was thinking of as an alternative, use enwiki
and randomly generate facet data along with it. However for
consistent benchmarking the random data would need to stay the
same so that people could execute the same benchmark
consistently in their own environment.

On Tue, Jul 14, 2009 at 6:28 PM, Mark  
Millermarkrmil...@gmail.com

wrote:


Why don't you just randomly generate the facet data? Thats prob  
the

best way
right? You can control the uniques and ranges.

On Wed, Jul 15, 2009 at 1:21 AM, Grant Ingersoll
gsing...@apache.orgwrote:


Probably not as generated by the EnwikiDocMaker, but the
WikipediaTokenizer
in Lucene can pull out richer syntax which could then be Teed/ 
Sinked

to
other fields.  Things like categories, related links, etc.   
Mostly,

though,
I was just commenting on the fact that it isn't hard to at  
least use

it for
getting docs into Solr.

-Grant

On Jul 14, 2009, at 7:38 PM, Jason Rutherglen wrote:

 You think enwiki has enough data for faceting?


On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersollgsing...@apache.org 


wrote:

At a min, it is trivial to use the EnWikiDocMaker and then  
send the

doc
over
SolrJ...

On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:

 On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen 


jason.rutherg...@gmail.com wrote:

 Is there a standard index like what Lucene uses for
contrib/benchmark


for
executing faceted queries over? Or maybe we can randomly  
generate

one
that
works in conjunction with wikipedia? That way we can  
execute real

world
queries against faceted data. Or we could use the Lucene/ 
Solr

mailing
lists
and other data (ala Lucid's faceted site) as a standard  
index?



I don't think there is any standard set of docs for solr  
testing -

there
is
not a real benchmark contrib - though I know more than a  
few of us

have
hacked up pieces of Lucene benchmark to work with Solr - I  
think

I've
done
it twice now ;)

Would be nice to get things going. I was thinking the other  
day: I

wonder
how hard it would be to make Lucene Benchmark generic  
enough to

accept
Solr
impls and Solr algs?

It does a lot that would suck to duplicate.

--
--
- Mark

http://www.lucidimagination.com



--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/ 
Droids)

using
Solr/Lucene:
http://www.lucidimagination.com/search




--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/ 
Droids)

using
Solr/Lucene:
http://www.lucidimagination.com/search





--
--
- Mark

http://www.lucidimagination.com







--
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com



--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using

Solr/Lucene:
http://www.lucidimagination.com/search




--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene 

Re: Wikipedia or reuters like index for testing facets?

2009-07-18 Thread Alexandre Rafalovitch
I have something that maybe could be made into one: http://uncorpora.org/

It is resolutions of the United Nations General Assembly in 6 official
languages aligned on a paragraph level in an XML (Translation Memory
eXchange) format. The 6 languages are: English, French, Spanish,
Arabic, Chinese, Russian.

Facets could be derived from already encoded information for:
1) Session number: 55-62
2) Committee number: 0-6
3) Operative/preambulatory phrase (for some of the paragraphs)
4) Resolution number (which is part of the record ID)
5) Cross-reference information that is embedded in the text, but is
marked off with XML tags

Markup and all, it is about 170 Mbytes between 6 languages.

If that looks useful, I would be happy to work with more experienced
Solr users to beat it into the right shape.

Regards,
Alex.

Personal blog: http://blog.outerthoughts.com/
Research group: http://www.clt.mq.edu.au/Research/
- I think age is a very high price to pay for maturity (Tom Stoppard)

On Tue, Jul 14, 2009 at 3:36 PM, Jason
Rutherglenjason.rutherg...@gmail.com wrote:
 Is there a standard index like what Lucene uses for contrib/benchmark for
 executing faceted queries over? Or maybe we can randomly generate one that
 works in conjunction with wikipedia? That way we can execute real world
 queries against faceted data. Or we could use the Lucene/Solr mailing lists
 and other data (ala Lucid's faceted site) as a standard index?