Default result rows

2008-06-18 Thread Mihails Agafonovs
Hi!

Where can I define, how many rows must be returned in the result?
Default is 10, and specifying other value each time through URL or
advanced interface isn't comfortable.
 Ar cieņu, Mihails

Deleting Solr index

2008-06-18 Thread Mihails Agafonovs
How can I clear the whole Solr index?
 Ar cieņu, Mihails

Re: Deleting Solr index

2008-06-18 Thread j . L
just rm -r SOLR_DIR/data/index.


2008/6/18 Mihails Agafonovs [EMAIL PROTECTED]:

 How can I clear the whole Solr index?
  Ar cieņu, Mihails




-- 
regards
j.L


Re: Default result rows

2008-06-18 Thread Shalin Shekhar Mangar
You can configure this in solrconfig.xml under the defaults section for
StandardRequestHandler

requestHandler name=standard class=solr.StandardRequestHandler
default=true
!-- default values for query parameters --
 lst name=defaults
   str name=echoParamsexplicit/str
   int name=rows30/int
   str name=fl*/str
   str name=version2.1/str
 /lst
  /requestHandler

2008/6/18 Mihails Agafonovs [EMAIL PROTECTED]:

 Hi!

 Where can I define, how many rows must be returned in the result?
 Default is 10, and specifying other value each time through URL or
 advanced interface isn't comfortable.
  Ar cieņu, Mihails




-- 
Regards,
Shalin Shekhar Mangar.


Re: Deleting Solr index

2008-06-18 Thread Shalin Shekhar Mangar
You can delete by query *:* (which matches all documents)

http://wiki.apache.org/solr/UpdateXmlMessages

2008/6/18 Mihails Agafonovs [EMAIL PROTECTED]:

 How can I clear the whole Solr index?
  Ar cieņu, Mihails




-- 
Regards,
Shalin Shekhar Mangar.


Re: Default result rows

2008-06-18 Thread Mihails Agafonovs
Doesn't work :(. None of the parameters in the defaults section is
being read. Solr still uses the predefined default parameters.

P.S. In defaults section I should be able specify also what
stylesheet to use, right?
 Quoting Shalin Shekhar Mangar : You can configure this in
solrconfig.xml under the quot;defaultsquot; section for
 StandardRequestHandler
 lt;requestHandler name=quot;standardquot;
class=quot;solr.StandardRequestHandlerquot;
 default=quot;truequot;gt;
 lt;!-- default values for query parameters --gt;
 lt;lst name=quot;defaultsquot;gt;
 lt;str name=quot;echoParamsquot;gt;explicitlt;/strgt;
 lt;int name=quot;rowsquot;gt;30lt;/intgt;
 lt;str name=quot;flquot;gt;*lt;/strgt;
 lt;str name=quot;versionquot;gt;2.1lt;/strgt;
 lt;/lstgt;
 lt;/requestHandlergt;
 2008/6/18 Mihails Agafonovs lt;[EMAIL PROTECTED]gt;:
 gt; Hi!
 gt;
 gt; Where can I define, how many rows must be returned in the
result?
 gt; Default is 10, and specifying other value each time through URL
or
 gt; advanced interface isn't comfortable.
 gt;  Ar cie#326;u, Mihails
 -- 
 Regards,
 Shalin Shekhar Mangar.
 Ar cieņu, Mihails

Links:
--
[1] mailto:[EMAIL PROTECTED]


SOLR-236 patch works

2008-06-18 Thread JLIST
I had the patch problem but I manually created that file and
solr nightly builds fine.

After replacing solr.war with apache-solr-solrj-1.3-dev.jar,
in solrconfig.xml, I added this:

searchComponent name=collapse  
class=org.apache.solr.handler.component.CollapseComponent /

Then added this to the standard and dismax handler handler
requestHandler name=standard ...
arr name=components
  strcollapse/str
/arr
/requestHandler

I added collapse.field=fieldcollapse.threshold=n, and the result
collapsed as expected.

 Can you provide feedback about this particular patch once you try
 it?  I'd like to get it on Solr 1.3, actually, so any feedback would
 help.

 Thanks,
 Otis





Re: Did you mean functionality

2008-06-18 Thread Lucas F. A. Teixeira

Yeah, i read it.
Thanks a lot, I`m waiting for it!

[]s,

Lucas

Lucas Frare A. Teixeira
[EMAIL PROTECTED] mailto:[EMAIL PROTECTED]
Tel: +55 11 3660.1622 - R3018



Grant Ingersoll escreveu:

Also see http://wiki.apache.org/solr/SpellCheckComponent

I expect to commit fairly soon.

On Jun 17, 2008, at 5:46 PM, Otis Gospodnetic wrote:


Hi Lucas,

Have a look at (the patch in) SOLR-572, lots of work happening there 
as we speak.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 

From: Lucas F. A. Teixeira [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Tuesday, June 17, 2008 4:30:12 PM
Subject: Did you mean functionality

Hello everybody,

I need to integrate the Lucene SpellChecker Contrib lib in my
applycation, but I`m using the EmbeededSolrServer to access all 
indexes.
I want to know what should I do (if someone have any step-by-step, 
link,

tutorial or smoke signal) of what I need to do during indexing, and of
course to search through this words generated by this API.

I can use the lib itself to search the suggestions, w/out using solr,
but I`m confused about how may I proceed when indexing this docs.

Thanks a lot,

[]s,

--
Lucas Frare A. Teixeira
[EMAIL PROTECTED]
Tel: +55 11 3660.1622 - R3018









Re: Feature idea - delete and commit from web interface ?

2008-06-18 Thread Koji Sekiguchi

A patch for this had been posted before, though I don't know it can delete.
It can add documents and commit from admin gui.

https://issues.apache.org/jira/browse/SOLR-85

Koji

JLIST wrote:

It seems that the web interface only supports select but not delete.
Is it possible to do delete from the browser? It would be nice to be
able to do delete and commit, and even post (put XML in an html form)
from the admin web interface :)

Also, does delete have to be a POST? A GET should do.




  




Bug Solr/bin/commit problem - fails to commit correctly and render response

2008-06-18 Thread McBride, John
Hello,

I am using the solr/bin/commit file to commit index changes after index
distribution in the collection distribution operations model.

The commit script is printed at the end of the email.

When I run the script as is, I get the following error:

commit request to Solr at port 8080 failed

This is corrected with the following addition to the line:

rs=`curl http://${solr_hostname}:${solr_port}/solr/update -s -d
commit/`
Becomes:
rs=`curl http://${solr_hostname}:${solr_port}/solr/update -s -d
commit/ -H 'Content-type:text/xml; charset=utf-8'`

This works, but the log reports an error, because the response is not as
expected.
SOLR returns:  int name=status0/int

But the commit script expects: result.*status=0'[regular
expression]


Has anybody else had problems using this commit script?
Where can I get the latest version?  I got this script from the solr 1.2
package.

Thanks,
John

---
#!/bin/bash
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version
2.0
# (the License); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an AS IS BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Shell script to force a commit of all changes since last commit
# for a Solr server

orig_dir=$(pwd)
cd ${0%/*}/..
solr_root=$(pwd)
cd ${orig_dir}

unset solr_hostname solr_port webapp_name user verbose debug
. ${solr_root}/bin/scripts-util

# set up variables
prog=${0##*/}
log=${solr_root}/logs/${prog}.log

# define usage string
USAGE=\
usage: $prog [-h hostname] [-p port] [-w webapp_name] [-u username] [-v]
   -h  specify Solr hostname
   -p  specify Solr port number
   -w  specify name of Solr webapp (defaults to solr)
   -u  specify user to sudo to before running script
   -v  increase verbosity
   -V  output debugging info


# parse args
while getopts h:p:w:u:vV OPTION
do
case $OPTION in
h)
solr_hostname=$OPTARG
;;
p)
solr_port=$OPTARG
;;
w)
webapp_name=$OPTARG
;;
u)
user=$OPTARG
;;
v)
verbose=v
;;
V)
debug=V
;;
*)
echo $USAGE
exit 1
esac
done

[[ -n $debug ]]  set -x

if [[ -z ${solr_port} ]]
then
echo Solr port number missing in $confFile or command line.
echo $USAGE


exit 1
fi

# use default hostname if not specified
if [[ -z ${solr_hostname} ]]
then
solr_hostname=localhost
fi

# use default webapp name if not specified
if [[ -z ${webapp_name} ]]
then
webapp_name=solr
fi

fixUser $@

start=`date +%s`

logMessage started by $oldwhoami
logMessage command: $0 $@

rs=`curl http://${solr_hostname}:${solr_port}/solr/update -s -d
commit/`
if [[ $? != 0 ]]
then
  logMessage failed to connect to Solr server at port ${solr_port}
  logMessage commit failed
  logExit failed 1
fi

# check status of commit request
echo $rs | grep 'result.*status=0'  /dev/null 21
if [[ $? != 0 ]]
then
  logMessage commit request to Solr at port ${solr_port} failed:
  logMessage $rs
  logExit failed 2
fi

logExit ended 0
---



RE: Bug Solr/bin/commit problem - fails to commit correctly and render response

2008-06-18 Thread McBride, John
Ok I checked out the nightly builds and the two changes have been made.

I will use the SOLR 1.3 version of solr/bin/commit.

Thanks,
John 

-Original Message-
From: McBride, John [mailto:[EMAIL PROTECTED] 
Sent: 18 June 2008 11:48
To: solr-user@lucene.apache.org
Subject: Bug Solr/bin/commit problem - fails to commit correctly and
render response

Hello,

I am using the solr/bin/commit file to commit index changes after index
distribution in the collection distribution operations model.

The commit script is printed at the end of the email.

When I run the script as is, I get the following error:

commit request to Solr at port 8080 failed

This is corrected with the following addition to the line:

rs=`curl http://${solr_hostname}:${solr_port}/solr/update -s -d
commit/`
Becomes:
rs=`curl http://${solr_hostname}:${solr_port}/solr/update -s -d
commit/ -H 'Content-type:text/xml; charset=utf-8'`

This works, but the log reports an error, because the response is not as
expected.
SOLR returns:  int name=status0/int

But the commit script expects: result.*status=0'[regular
expression]


Has anybody else had problems using this commit script?
Where can I get the latest version?  I got this script from the solr 1.2
package.

Thanks,
John

---
#!/bin/bash
#
# Licensed to the Apache Software Foundation (ASF) under one or more #
contributor license agreements.  See the NOTICE file distributed with #
this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version
2.0 # (the License); you may not use this file except in compliance
with # the License.  You may obtain a copy of the License at #
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software #
distributed under the License is distributed on an AS IS BASIS, #
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and #
limitations under the License.
#
# Shell script to force a commit of all changes since last commit # for
a Solr server

orig_dir=$(pwd)
cd ${0%/*}/..
solr_root=$(pwd)
cd ${orig_dir}

unset solr_hostname solr_port webapp_name user verbose debug .
${solr_root}/bin/scripts-util

# set up variables
prog=${0##*/}
log=${solr_root}/logs/${prog}.log

# define usage string
USAGE=\
usage: $prog [-h hostname] [-p port] [-w webapp_name] [-u username] [-v]
   -h  specify Solr hostname
   -p  specify Solr port number
   -w  specify name of Solr webapp (defaults to solr)
   -u  specify user to sudo to before running script
   -v  increase verbosity
   -V  output debugging info


# parse args
while getopts h:p:w:u:vV OPTION
do
case $OPTION in
h)
solr_hostname=$OPTARG
;;
p)
solr_port=$OPTARG
;;
w)
webapp_name=$OPTARG
;;
u)
user=$OPTARG
;;
v)
verbose=v
;;
V)
debug=V
;;
*)
echo $USAGE
exit 1
esac
done

[[ -n $debug ]]  set -x

if [[ -z ${solr_port} ]]
then
echo Solr port number missing in $confFile or command line.
echo $USAGE


exit 1
fi

# use default hostname if not specified
if [[ -z ${solr_hostname} ]]
then
solr_hostname=localhost
fi

# use default webapp name if not specified if [[ -z ${webapp_name} ]]
then
webapp_name=solr
fi

fixUser $@

start=`date +%s`

logMessage started by $oldwhoami
logMessage command: $0 $@

rs=`curl http://${solr_hostname}:${solr_port}/solr/update -s -d
commit/` if [[ $? != 0 ]] then
  logMessage failed to connect to Solr server at port ${solr_port}
  logMessage commit failed
  logExit failed 1
fi

# check status of commit request
echo $rs | grep 'result.*status=0'  /dev/null 21 if [[ $? != 0 ]]
then
  logMessage commit request to Solr at port ${solr_port} failed:
  logMessage $rs
  logExit failed 2
fi

logExit ended 0
---



never desallocate RAM...during search

2008-06-18 Thread Roberto Nieto
Hi users,

Somedays ago I made a question about RAM use during searchs but I didn't
solve my problem with the ideas that some expert users told me. After making
somes test I can make a more specific question hoping someone can help me.

My problem is that i need highlighting and i have quite big docs (txt of
40MB). The conclusion of my tests is that if I set rows to 10, the content
of the first 10 results are cached. This if something normal because its
probable needed for the highlighting, but this memory is never desallocate
although I set solr's caches to 0. With this, the memory grows up until is
close to the heap, then the gc start to desallocate memory..but at that
point the searches are quite slow. Is this a normal behavior? Can I
configure some solr parameter to force the desallocation of results after
each search? [I´m using solr 1.2]

Another thing that I found is that although I comment (in solrconfig) all
this options:
  filterCache, queryResultCache, documentCache, enableLazyFieldLoading,
useFilterForSortedQuery, boolTofilterOptimizer
In the stats always appear caching:true.

I'm probably leaving some stupid thing but I can't find it.

If anyone can help me..i'm quite desperate.


Rober.


Re: Default result rows

2008-06-18 Thread Otis Gospodnetic
Use rows=NNN in the URL.


Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
 From: Mihails Agafonovs [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Wednesday, June 18, 2008 4:30:53 AM
 Subject: Default result rows
 
 Hi!
 
 Where can I define, how many rows must be returned in the result?
 Default is 10, and specifying other value each time through URL or
 advanced interface isn't comfortable.
 Ar cieņu, Mihails



Re: SOLR-236 patch works

2008-06-18 Thread Otis Gospodnetic
That looks right.  CollapseComponent replaces QueryComponent.


Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
 From: JLIST [EMAIL PROTECTED]
 To: Otis Gospodnetic solr-user@lucene.apache.org
 Sent: Wednesday, June 18, 2008 5:24:25 AM
 Subject: SOLR-236 patch works
 
 I had the patch problem but I manually created that file and
 solr nightly builds fine.
 
 After replacing solr.war with apache-solr-solrj-1.3-dev.jar,
 in solrconfig.xml, I added this:
 
 
 class=org.apache.solr.handler.component.CollapseComponent /
 
 Then added this to the standard and dismax handler handler
 
 
   collapse
 
 
 
 I added collapse.field=collapse.threshold=, and the result
 collapsed as expected.
 
  Can you provide feedback about this particular patch once you try
  it?  I'd like to get it on Solr 1.3, actually, so any feedback would
  help.
 
  Thanks,
  Otis



Re: Feature idea - delete and commit from web interface ?

2008-06-18 Thread Otis Gospodnetic
As for POST vs. GET - don't let REST purists hear you. :)
Actually, isn't there a DELETE HTTP method that REST purists would say should 
be used in case of doc deletion?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
 From: JLIST [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Wednesday, June 18, 2008 4:13:09 AM
 Subject: Feature idea - delete and commit from web interface ?
 
 It seems that the web interface only supports select but not delete.
 Is it possible to do delete from the browser? It would be nice to be
 able to do delete and commit, and even post (put XML in an html form)
 from the admin web interface :)
 
 Also, does delete have to be a POST? A GET should do.



Re: never desallocate RAM...during search

2008-06-18 Thread Otis Gospodnetic
Hi,
I don't have the answer about why cache still shows true, but as far as 
memory usage goes, based on your description I'd guess the memory is allocated 
and used by the JVM which typically  tries not to run GC unless it needs to.  
So if you want to get rid of that used memory, you need to talk to the JVM and 
persuade it to run GC.  I don't think there is a way to manage memory usage 
directly.  There is System.gc() that you can call, but that's only a 
suggestion for the JVM to run GC.


Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
 From: Roberto Nieto [EMAIL PROTECTED]
 To: solr-user solr-user@lucene.apache.org
 Sent: Wednesday, June 18, 2008 7:43:12 AM
 Subject: never desallocate RAM...during search
 
 Hi users,
 
 Somedays ago I made a question about RAM use during searchs but I didn't
 solve my problem with the ideas that some expert users told me. After making
 somes test I can make a more specific question hoping someone can help me.
 
 My problem is that i need highlighting and i have quite big docs (txt of
 40MB). The conclusion of my tests is that if I set rows to 10, the content
 of the first 10 results are cached. This if something normal because its
 probable needed for the highlighting, but this memory is never desallocate
 although I set solr's caches to 0. With this, the memory grows up until is
 close to the heap, then the gc start to desallocate memory..but at that
 point the searches are quite slow. Is this a normal behavior? Can I
 configure some solr parameter to force the desallocation of results after
 each search? [I´m using solr 1.2]
 
 Another thing that I found is that although I comment (in solrconfig) all
 this options:
  filterCache, queryResultCache, documentCache, enableLazyFieldLoading,
 useFilterForSortedQuery, boolTofilterOptimizer
 In the stats always appear caching:true.
 
 I'm probably leaving some stupid thing but I can't find it.
 
 If anyone can help me..i'm quite desperate.
 
 
 Rober.



Re: missing document count?

2008-06-18 Thread Chris Hostetter

: not hard, but useful information to have handy without additional
: manipulations on my part.

: our pages are the results of multiple queries.  so, given a max number of
: records per page (or total), the rows asked of query2 is max - query1, of

in the common case, counting the number of docs in a result is just as 
easy as reading some attribute containing the count.  It sounds like you 
have a more complicated case where what you really wnat is the count of 
how many docs there are in the entire response (ie: multiple result 
sections) ... that count is admitedly a little more work but would also be 
completley useless to most clients if it was included in the response 
(just as the number of fields in each doc, or the total number of strings 
in the response) ... there is a lot of metadata that *could* be included 
in the response, but we don't bother when the client can compute that 
metadata just as easily as the server -- among other things, it helps keep 
the response size smaller.

This was actually one of the orriginal guiding principles of Solr: support 
features that are faster/cheaper/easier/more-efficient on the central 
server then they would be on the clients (sorting, docset caching, 
faceting, etc...)



-Hoss



Re: missing document count?

2008-06-18 Thread Geoffrey Young



Chris Hostetter wrote:

: not hard, but useful information to have handy without additional
: manipulations on my part.

: our pages are the results of multiple queries.  so, given a max number of
: records per page (or total), the rows asked of query2 is max - query1, of

in the common case, counting the number of docs in a result is just as 
easy as reading some attribute containing the count. 


I suppose :)  in my mind, one (potentially) requires just a read, while 
the other requires some further manipulations.  but I suppose most 
modern languages have optimizations for things like array size :)


It sounds like you 
have a more complicated case where what you really wnat is the count of 
how many docs there are in the entire response 


I don't know how complex it is to ask for documents in the response, but 
yes :)


(ie: multiple result 
sections) ... 


multiple results from multiple queries, not a single query.

but really, I wasn't planning on having anyone (solr or otherwise) 
solving my needs.  I just find it odd that I need to discern the number 
of returned results.


that count is admitedly a little more work but would also be 
completley useless to most clients if it was included in the response 


perhaps :)

(just as the number of fields in each doc, or the total number of strings 
in the response) ... there is a lot of metadata that *could* be included 
in the response, but we don't bother when the client can compute that 
metadata just as easily as the server -- among other things, it helps keep 
the response size smaller.


agreed - smaller is better.

as for client as easily as a the server, I assumed that solr was keeping 
track of the document count already, if only to see when the number of 
documents exceeds the rows parameter.  if so, all the people who care 
about number of documents in the result (which, I'll assume, is more 
than those who care about total strings in the response ;) are all 
re-computing a known value.




This was actually one of the orriginal guiding principles of Solr: support 
features that are faster/cheaper/easier/more-efficient on the central 
server then they would be on the clients (sorting, docset caching, 
faceting, etc...)


sure, I'll buy that.  but in my mind it was only exposing something solr 
already was calculating anyway.


regardless, thanks for taking the time :)

--Geoff


Re[2]: Feature idea - delete and commit from web interface ?

2008-06-18 Thread JLIST
GET makes it possible to delete from a browser address bar,
which you can not do with DELETE :)

 As for POST vs. GET - don't let REST purists hear you. :)
 Actually, isn't there a DELETE HTTP method that REST purists
 would say should be used in case of doc deletion?




Re[2]: Feature idea - delete and commit from web interface ?

2008-06-18 Thread JLIST

Sounds like web designer's fault. No permission check and no
confirmation for deletion?

 Never, never delete with a GET. The Ultraseek spider deleted 20K
 docments on an intranet once because they gave it admin perms and
 it followed the delete this page link on every page.




Re: Re[2]: Feature idea - delete and commit from web interface ?

2008-06-18 Thread Walter Underwood
The spider was given an admin login so it could access all
content. Reasonable decision if the pages had been designed well.

Even with a confirmation, never delete with a GET. Use POST.
If the spider ever discovers the URL that the confirmation
uses, it will still delete the content.

Luckily, they had a backup.

wunder

On 6/18/08 1:55 PM, JLIST [EMAIL PROTECTED] wrote:

 
 Sounds like web designer's fault. No permission check and no
 confirmation for deletion?
 
 Never, never delete with a GET. The Ultraseek spider deleted 20K
 docments on an intranet once because they gave it admin perms and
 it followed the delete this page link on every page.
 
 



Re: Re[2]: Feature idea - delete and commit from web interface ?

2008-06-18 Thread Craig McClanahan
On Wed, Jun 18, 2008 at 1:55 PM, JLIST [EMAIL PROTECTED] wrote:

 Sounds like web designer's fault. No permission check and no
 confirmation for deletion?


Nope ... application designer's fault for misusing the web.  Allowing
deletes on a GET violates HTTP/1.1 requirements (not just RESTful
ones) that GET requests not have side effects, so an app that works
that way is going to mess up when HTTP caching is in use ... as lots
of people found to their chagrin when they installed Google Desktop's
caching capabilities, and the cache played by the standard HTTP rules
(GETs are supposed to be idempotent, having no side effects, so it's
just fine to issue the same GET as many times as desired.

If you want an easy way to do deletes from a browser, just set up a
little form that does a POST and includes the id of the document you
want to delete.  Then you're playing by the rules, and won't make a
fool of yourself when crawlers or caches interact with your
application.

Craig McClanahan

 Never, never delete with a GET. The Ultraseek spider deleted 20K
 docments on an intranet once because they gave it admin perms and
 it followed the delete this page link on every page.





Re: scaling / sharding questions

2008-06-18 Thread Phillip Farber
This may be slightly off topic, for which I apologize, but is related to 
the question of searching several indexes as Lance describes below, quoting:


 We also found that searching a few smaller indexes via the Solr 1.3 
Distributed Search feature is actually faster than searching one large

index, YMMV.

The wiki describing distributed search lists several limitations which 
set me to wonder about two limitations in particular and what the value 
is mainly with respect to scoring:


1) No distributed idf

Does this mean that the Lucene scoring algorithm is computed without the 
idf factor, i.e. we just get term frequency scoring?


2) Doesn't support consistency between stages, e.g. a shard index can be 
changed between STAGE_EXECUTE_QUERY and STAGE_GET_FIELDS


What does this mean or where can I find out what it means?

Thanks!

Phil




Lance Norskog wrote:

Yes, I've done this split-by-delete several times. The halved index still
uses as much disk space until you optimize it.

As to splitting policy: we use an MD5 signature as our unique ID. This has
the lovely property that we can wildcard.  'contentid:f*' denotes 1/16 of
the whole index. This 1/16 is a very random sample of the whole index. We
use this for several things. If we use this for shards, we have a query that
matches a shard's contents.

The Solr/Lucene syntax does not support modular arithmetic,and so it will
not let you query a subset that matches one of your shards.

We also found that searching a few smaller indexes via the Solr 1.3
Distributed Search feature is actually faster than searching one large
index, YMMV. So for us, a large pile of shards will be optimal anyway, so we
have to need rebalance.

It sounds like you're not storing the data in a backing store, but are
storing all data in the index itself. We have found this challenging.

Cheers,

Lance Norskog

-Original Message-
From: Jeremy Hinegardner [mailto:[EMAIL PROTECTED] 
Sent: Friday, June 13, 2008 3:36 PM

To: solr-user@lucene.apache.org
Subject: Re: scaling / sharding questions

Sorry for not keeping this thread alive, lets see what we can do...

One option I've thought of for 'resharding' would splitting an index into
two by just copying it, the deleting 1/2 the documents from one, doing a
commit, and delete the other 1/2 from the other index and commit.  That is:

  1) Take original index
  2) copy to b1 and b2
  3) delete docs from b1 that match a particular query A
  4) delete docs from b2 that do not match a particular query A
  5) commit b1 and b2

Has anyone tried something like that?

As for how to know where each document is stored, generally we're
considering unique_document_id % N.  If we rebalance we change N and
redistribute, but that
probably will take too much time.That makes us move more towards a
staggered
age based approach where the most recent docs filter down to permanent
indexes based upon time.

Another thought we've had recently is to have many many many physical
shards, on the indexing writer side, but then merge groups of them into
logical shards which are snapshotted to reader solrs' on a frequent basis.
I haven't done any testing along these lines, but logically it seems like an
idea worth pursuing.

enjoy,

-jeremy

On Fri, Jun 06, 2008 at 03:14:10PM +0200, Marcus Herou wrote:

Cool sharding technique.

We as well are thinking of howto move docs from one index to another 
because we need to re-balance the docs when we add new nodes to the

cluster.
We do only store id's in the index otherwise we could have moved stuff 
around with IndexReader.document(x) or so. Luke 
(http://www.getopt.org/luke/) is able to reconstruct the indexed Document

data so it should be doable.
However I'm thinking of actually just delete the docs from the old 
index and add new Documents to the new node. It would be cool to not 
waste cpu cycles by reindexing already indexed stuff but...


And we as well will have data amounts in the range you are talking 
about. We perhaps could share ideas ?


How do you plan to store where each document is located ? I mean you 
probably need to store info about the Document and it's location 
somewhere perhaps in a clustered DB ? We will probably go for HBase for

this.
I think the number of documents is less important than the actual data 
size (just speculating). We currently search 10M (will get much much 
larger) indexed blog entries on one machine where the JVM has 1G heap, 
the index size is 3G and response times are still quite fast. This is 
a readonly node though and is updated every morning with a freshly 
optimized index. Someone told me that you probably need twice the RAM 
if you plan to both index and search at the same time. If I were you I 
would just test to index X entries of your data and then start to 
search in the index with lower JVM settings each round and when 
response times get too slow or you hit OOE then you get a rough estimate

of the bare minimum X RAM needed for Y entries.
I think we 

Re: scaling / sharding questions

2008-06-18 Thread Yonik Seeley
On Wed, Jun 18, 2008 at 5:53 PM, Phillip Farber [EMAIL PROTECTED] wrote:
 Does this mean that the Lucene scoring algorithm is computed without the idf
 factor, i.e. we just get term frequency scoring?

No, it means that the idf calculation is done locally on a single shard.
With a big index that is randomly mixed, this should not have a
practical impact.

 2) Doesn't support consistency between stages, e.g. a shard index can be
 changed between STAGE_EXECUTE_QUERY and STAGE_GET_FIELDS

 What does this mean or where can I find out what it means?

STAGE_EXECUTE_QUERY finds the ids of matching documents.
STAGE_GET_FIELDS retrieves the fields of matching documents.

A change to a document could possibly happen inbetween, and one would
end up retrieving a document that no longer matched the query.  In
practice, this is rarely an issue.

-Yonik


Re: Re[2]: Feature idea - delete and commit from web interface ?

2008-06-18 Thread Noble Paul നോബിള്‍ नोब्ळ्
The implementation  may provide a form where user can
type in  a doc id  to delete or a lucene query

if it is a  POST so be it.
But let us have the functionality

--Noble

On Thu, Jun 19, 2008 at 2:40 AM, Craig McClanahan [EMAIL PROTECTED] wrote:
 On Wed, Jun 18, 2008 at 1:55 PM, JLIST [EMAIL PROTECTED] wrote:

 Sounds like web designer's fault. No permission check and no
 confirmation for deletion?


 Nope ... application designer's fault for misusing the web.  Allowing
 deletes on a GET violates HTTP/1.1 requirements (not just RESTful
 ones) that GET requests not have side effects, so an app that works
 that way is going to mess up when HTTP caching is in use ... as lots
 of people found to their chagrin when they installed Google Desktop's
 caching capabilities, and the cache played by the standard HTTP rules
 (GETs are supposed to be idempotent, having no side effects, so it's
 just fine to issue the same GET as many times as desired.

 If you want an easy way to do deletes from a browser, just set up a
 little form that does a POST and includes the id of the document you
 want to delete.  Then you're playing by the rules, and won't make a
 fool of yourself when crawlers or caches interact with your
 application.

 Craig McClanahan

 Never, never delete with a GET. The Ultraseek spider deleted 20K
 docments on an intranet once because they gave it admin perms and
 it followed the delete this page link on every page.







-- 
--Noble Paul


Re: Slight issue with classloading and DataImportHandler

2008-06-18 Thread Brendan Grainger

Hi,

I am actually providing the fully qualified classname in the  
configuration and I was still getting a ClassNotFoundException. If you  
look at the code in SolrResourceLoader they actually explicitly add  
the jars in solr-home/lib to the classloader:


static ClassLoader createClassLoader(File f, ClassLoader loader) {
if( loader == null ) {
  loader = Thread.currentThread().getContextClassLoader();
}
if (f.canRead()  f.isDirectory()) {
  File[] jarFiles = f.listFiles();
  URL[] jars = new URL[jarFiles.length];
  try {
for (int j = 0; j  jarFiles.length; j++) {
  jars[j] = jarFiles[j].toURI().toURL();
  log.info(Adding ' + jars[j].toString() + ' to Solr  
classloader);

}
return URLClassLoader.newInstance(jars, loader);
  } catch (MalformedURLException e) {
SolrException.log(log,Can't construct solr lib class  
loader, e);

  }
}
log.info(Reusing parent classloader);
return loader;
  }


This seems to be me to be why my class is now found when I include my  
utilities jar in solr-home/lib.


Thanks
Brendan

On Jun 18, 2008, at 11:49 PM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:



hi,
DIH does not load class using the SolrResourceLoader. It tries a
Class.forName() with the name you provide if it fails it prepends
org.apache.solr.handler.dataimport. and retries.

This is true for not just transformers but also for Entityprocessor,
DataSource and Evaluator

The reason for doing so is that we do not use any of the 'solr.'
packages in DIH. All our implementations fall into the default package
and we can directly use them w/o the package name.

So , if you are writing your own implementations use the default
package or provide the fully qualified class name.

--Noble

On Thu, Jun 19, 2008 at 8:09 AM, Jon Baer [EMAIL PROTECTED] wrote:
Thanks.  Yeah took me a while to figure out I needed to do  
something like
transformer=com.mycompany.solr.MyTransformer on the entity before  
it would

work ...

- Jon

On Jun 18, 2008, at 1:51 PM, Brendan Grainger wrote:


Hi,

I set up the new DataimportHandler last night to replace some custom
import code I'd written and so far I'm loving it thank you.

I had one issue you might want to know about it. I have some solr
extensions I've written and packaged in a jar which I place in:

solr-home/lib

as per:


http://wiki.apache.org/solr/SolrPlugins#head-59e2685df65335e82f8936ed55d260842dc7a4dc

This works well for my handlers but a custom Transformer I wrote and
packaged the same way was throwing a ClassNotFoundException. I  
tracked it

down to the DocBuilder.loadClass method which was just doing a
Class.forName. Anyway, I fixed it for the moment by probably do  
something
stupid and creating a SolrResourceLoader (which I imagine could be  
an
instance variable, but at 3am I just wanted to get it working).  
Anyway, this

fixes the problem:

@SuppressWarnings(unchecked)
static Class loadClass(String name) throws ClassNotFoundException {
 SolrResourceLoader loader = new SolrResourceLoader( null );
 return loader.findClass(name);
  // return Class.forName(name);
}

Brendan







--
--Noble Paul




Re: Slight issue with classloading and DataImportHandler

2008-06-18 Thread Noble Paul നോബിള്‍ नोब्ळ्
aah!. We always assumed that people put the custom jars in the
WEB-INF/lib folder of solr webapp and hence they are automatically in
the classpath we shall make the necessary changes  .
--Noble

On Thu, Jun 19, 2008 at 10:06 AM, Brendan Grainger
[EMAIL PROTECTED] wrote:
 Hi,

 I am actually providing the fully qualified classname in the configuration
 and I was still getting a ClassNotFoundException. If you look at the code in
 SolrResourceLoader they actually explicitly add the jars in solr-home/lib to
 the classloader:

 static ClassLoader createClassLoader(File f, ClassLoader loader) {
if( loader == null ) {
  loader = Thread.currentThread().getContextClassLoader();
}
if (f.canRead()  f.isDirectory()) {
  File[] jarFiles = f.listFiles();
  URL[] jars = new URL[jarFiles.length];
  try {
for (int j = 0; j  jarFiles.length; j++) {
  jars[j] = jarFiles[j].toURI().toURL();
  log.info(Adding ' + jars[j].toString() + ' to Solr
 classloader);
}
return URLClassLoader.newInstance(jars, loader);
  } catch (MalformedURLException e) {
SolrException.log(log,Can't construct solr lib class loader, e);
  }
}
log.info(Reusing parent classloader);
return loader;
  }


 This seems to be me to be why my class is now found when I include my
 utilities jar in solr-home/lib.

 Thanks
 Brendan

 On Jun 18, 2008, at 11:49 PM, Noble Paul നോബിള്‍ नोब्ळ् wrote:

 hi,
 DIH does not load class using the SolrResourceLoader. It tries a
 Class.forName() with the name you provide if it fails it prepends
 org.apache.solr.handler.dataimport. and retries.

 This is true for not just transformers but also for Entityprocessor,
 DataSource and Evaluator

 The reason for doing so is that we do not use any of the 'solr.'
 packages in DIH. All our implementations fall into the default package
 and we can directly use them w/o the package name.

 So , if you are writing your own implementations use the default
 package or provide the fully qualified class name.

 --Noble

 On Thu, Jun 19, 2008 at 8:09 AM, Jon Baer [EMAIL PROTECTED] wrote:

 Thanks.  Yeah took me a while to figure out I needed to do something like
 transformer=com.mycompany.solr.MyTransformer on the entity before it
 would
 work ...

 - Jon

 On Jun 18, 2008, at 1:51 PM, Brendan Grainger wrote:

 Hi,

 I set up the new DataimportHandler last night to replace some custom
 import code I'd written and so far I'm loving it thank you.

 I had one issue you might want to know about it. I have some solr
 extensions I've written and packaged in a jar which I place in:

 solr-home/lib

 as per:



 http://wiki.apache.org/solr/SolrPlugins#head-59e2685df65335e82f8936ed55d260842dc7a4dc

 This works well for my handlers but a custom Transformer I wrote and
 packaged the same way was throwing a ClassNotFoundException. I tracked
 it
 down to the DocBuilder.loadClass method which was just doing a
 Class.forName. Anyway, I fixed it for the moment by probably do
 something
 stupid and creating a SolrResourceLoader (which I imagine could be an
 instance variable, but at 3am I just wanted to get it working). Anyway,
 this
 fixes the problem:

 @SuppressWarnings(unchecked)
 static Class loadClass(String name) throws ClassNotFoundException {
  SolrResourceLoader loader = new SolrResourceLoader( null );
  return loader.findClass(name);
  // return Class.forName(name);
 }

 Brendan





 --
 --Noble Paul





-- 
--Noble Paul


Re: Slight issue with classloading and DataImportHandler

2008-06-18 Thread Chris Hostetter

: aah!. We always assumed that people put the custom jars in the
: WEB-INF/lib folder of solr webapp and hence they are automatically in
: the classpath we shall make the necessary changes  .

It would be better to use the classloader from the SolrResourceLoader ... 
that should be safe for anyone with any setup. 

 DIH does not load class using the SolrResourceLoader. It tries a
 Class.forName() with the name you provide if it fails it prepends
 org.apache.solr.handler.dataimport. and retries.
...
 The reason for doing so is that we do not use any of the 'solr.'
 packages in DIH. All our implementations fall into the default package
 and we can directly use them w/o the package name.

FWIW: there isn't relaly a solr. package ... solr. can be used as 
an short form alias for the likely package when Solr resolves classes, 
where the likely package varies by context and there can be multiple 
options that it tries in order

DIH could do the same thing, letting short form solr. signify that
Transformers, Evaluators, etc are in the o.a.s.handler.dataimport package.

the advantage of this over what it sounds like DIH currently does is that 
if there is an o.a.s.handler.dataimport.WizWatTransformer but someone 
wants to write their own (package less) WizWatTransformer they can and 
refer to it simply as WizWatTransformer (whereas to use the one that 
ships with DIH they would specify solr.WizWatTransformer).  There's no 
ambiguity as to which one someone means unless they create a package 
called solr ... but then they'ed just be looking for trouble :)



-Hoss



Re: Seeking suggestions - keyword related site promotion

2008-06-18 Thread Stephen Weiss
Is there a fixed set of keywords?  If so, I suppose you could simply  
index these keywords into a field for each site (either through some  
kind of automatic parser or manually - from personal experience I  
would recommend manually unless you have tens of thousands of these  
things), and then search that field with each word in the query (with  
or).  Any site that had one of these keywords would match it if it  
were used in the query...


If there is no list here and you're just indexing all the content of  
all these sites... isn't that what Nutch is designed for?


--
Steve

On Jun 18, 2008, at 11:05 PM, JLIST wrote:


Hi all,

This is what I'm trying to do: since some sources (say,
some web sites) are more authoritative than other sources
on certain subjects, I'd like to promote those sites when
the query contains certain keywords. I'm not sure what
is the best way to implement this. I suppose I can index
the keywords in a field for all pages from that site but
this isn't very efficient, and any changes in the keyword
list would require re-indexing all pages of that site.
I wonder if there is a more efficient way that can dynamically
promote sites from a domain that is considered more related
to the queries. Any suggestion is welcome.

Thanks,
Jack