date:20110110

Re: schema.xml in other than conf folder

2011-01-10 Thread Shanmugavel SRD


Chris,
   Our solr conf folder is in read-only file system. But the data directory
(index) is not in read-only file system. As per our production environment
guidelines, the configuration files should be in read-only file system.
Thanks,
SRD
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/schema-xml-in-other-than-conf-folder-tp2206587p2225625.html
Sent from the Solr - User mailing list archive at Nabble.com.

Tuning StatsComponent

2011-01-10 Thread stockii


Hello.

i`m using the StatsComponent to get the sum of amounts. but solr
statscomponent is very slow on a huge index of 30 Million documents. how can
i tune the statscomponent ? 

the problem is, that i have 5 currencys and i need to send for each currency
a new request. thats make the solr search sometimes very slow. =( 

any ideas ? 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Tuning-StatsComponent-tp2225809p2225809.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: DIH load only selected documents with XPathEntityProcessor

2011-01-10 Thread Bernd Fehling

Hi Gora,

thanks a lot, very nice solution, works perfectly.
I will dig more into ScriptTransformer, seems to be very powerful.

Regards,
Bernd

Am 08.01.2011 14:38, schrieb Gora Mohanty:
 On Fri, Jan 7, 2011 at 12:30 PM, Bernd Fehling
 bernd.fehl...@uni-bielefeld.de wrote:
 Hello list,

 is it possible to load only selected documents with XPathEntityProcessor?
 While loading docs I want to drop/skip/ignore documents with missing URL.

 Example:
 documents
document
titlefirst title/title
ididentifier_01/id
linkhttp://www.foo.com/path/bar.html/link
/document
document
titlesecond title/title
ididentifier_02/id
link/link
/document
 /documents

 The first document should be loaded, the second document should be ignored
 because it has an empty link (should also work for missing link field).
 [...]
 
 You can use a ScriptTransformer, along with $skipRow/$skipDoc.
 E.g., something like this for your data import configuration file:
 
 dataConfig
 script![CDATA[
   function skipRow(row) {
 var link = row.get( 'link' );
 if( link == null || link == '' ) {
   row.put( '$skipRow', 'true' );
 }
 return row;
   }
 ]]/script
 dataSource type=FileDataSource /
 document
 entity name=f processor=FileListEntityProcessor
 baseDir=/home/gora/test fileName=.*xml newerThan='NOW-3DAYS'
 recursive=true rootEntity=false dataSource=null
 entity name=top processor=XPathEntityProcessor
 forEach=/documents/document url=${f.fileAbsolutePath}
 transformer=script:skipRow
field column=link xpath=/documents/document/link/
field column=title xpath=/documents/document/title/
field column=id xpath=/documents/document/id/
 /entity
 /entity
 /document
 /dataConfig
 
 Regards,
 Gora

Re: Tuning StatsComponent

2011-01-10 Thread Gora Mohanty

On Mon, Jan 10, 2011 at 2:28 PM, stockii st...@shopgate.com wrote:

 Hello.

 i`m using the StatsComponent to get the sum of amounts. but solr
 statscomponent is very slow on a huge index of 30 Million documents. how can
 i tune the statscomponent ?

Not sure about this problem.

 the problem is, that i have 5 currencys and i need to send for each currency
 a new request. thats make the solr search sometimes very slow. =(
[...]

I guess that you mean the search from the front-end is slow.

It is difficult to make a guess without details of your index,
and of your queries, but one thing that immediately jumps
out is that you could shard the Solr index by currency, and
have your front-end direct queries for each currency to the
appropriate Solr server.

Please do share a description of what all you are indexing,
how large your index is, and what kind of queries you are
running. I take it that you have already taken a look at
http://wiki.apache.org/solr/SolrPerformanceFactors

Regards,
Gora

Re: Tuning StatsComponent

2011-01-10 Thread stockii


oh thx for your fast reply. 

i will try the suggestions.

in meanwhile more information about my index.

i have 2 solr instances with 6 cores. each core have his own index and one
core`s index is about 30 million documents.

each document have:(stats-relevant)
amount
amount_euro
currency_id

user_costs
user_costs_euro
currency_id_user_costs

so i send for each currency an requeston statscompontn like this

stats=truejson.nl=mapwt=javabinrows=0version=2fl=uniquekey,scorestart=0stats.field=amountq=QUERYisShard=truefq=product:bla+currency_id:EURfsv=true

the stats.field is changing and filter for each of my 5 currencys. so for
ONE search-request, i need to send 10 requests to get the sums. and that
veeery slow =(

i searching over two shards. sometimes more than two.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Tuning-StatsComponent-tp2225809p2226258.html
Sent from the Solr - User mailing list archive at Nabble.com.

segment gets corrupted (after background merge ?)

2011-01-10 Thread Stéphane Delprat


Hi,

We are using :
Solr Specification Version: 1.4.1
Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17 18:06:42
Lucene Specification Version: 2.9.3
Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55

# java -version
java version 1.6.0_20
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)

We want to index 4M docs in one core (and when it works fine we will add 
other cores with 2M on the same server) (1 doc ~= 1kB)


We use SOLR replication every 5 minutes to update the slave server 
(queries are executed on the slave only)


Documents are changing very quickly, during a normal day we will have 
approx :

* 200 000 updated docs
* 1000 new docs
* 200 deleted docs


I attached the last good checkIndex : solr20110107.txt
And the corrupted one : solr20110110.txt


This is not the first time a segment gets corrupted on this server, 
that's why I ran frequent checkIndex. (but as you can see the first 
segment is 1.800.000 docs and it works fine!)



I can't find any SEVER FATAL or exception in the Solr logs.


I also attached my schema.xml and solrconfig.xml


Is there something wrong with what we are doing ? Do you need other info ?


Thanks,

Opening index @ /solr/multicore/core1/data/index/

Segments file=segments_i7t numSegments=9 version=FORMAT_DIAGNOSTICS [Lucene 2.9]
  1 of 9: name=_ncc docCount=1841685
compound=false
hasProx=true
numFiles=9
size (MB)=6,683.447
diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, 
os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 
01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20, java.vendor=Sun 
Microsystems Inc.}
has deletions [delFileName=_ncc_13m.del]
test: open reader.OK [105940 deleted docs]
test: fields..OK [51 fields]
test: field norms.OK [51 fields]
test: terms, freq, prox...OK [17952652 terms; 174113812 terms/docs pairs; 
248678841 tokens]
test: stored fields...OK [51585300 total field count; avg 29.719 fields 
per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq vector 
fields per doc]

  2 of 9: name=_nqt docCount=431889
compound=false
hasProx=true
numFiles=9
size (MB)=1,671.375
diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, 
os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 
01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20, java.vendor=Sun 
Microsystems Inc.}
has deletions [delFileName=_nqt_gt.del]
test: open reader.OK [10736 deleted docs]
test: fields..OK [51 fields]
test: field norms.OK [51 fields]
test: terms, freq, prox...OK [5211271 terms; 39824029 terms/docs pairs; 
67787288 tokens]
test: stored fields...OK [12562924 total field count; avg 29.83 fields 
per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq vector 
fields per doc]

  3 of 9: name=_ol7 docCount=913886
compound=false
hasProx=true
numFiles=9
size (MB)=3,567.63
diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, 
os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 
01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20, java.vendor=Sun 
Microsystems Inc.}
has deletions [delFileName=_ol7_3.del]
test: open reader.OK [11 deleted docs]
test: fields..OK [51 fields]
test: field norms.OK [51 fields]
test: terms, freq, prox...OK [9825896 terms; 93954470 terms/docs pairs; 
152947518 tokens]
test: stored fields...OK [29587930 total field count; avg 32.376 fields 
per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq vector 
fields per doc]

  4 of 9: name=_ol2 docCount=1011
compound=false
hasProx=true
numFiles=8
size (MB)=6.959
diagnostics = {os.version=2.6.26-2-amd64, os=Linux, lucene.version=2.9.3 
951790 - 2010-06-06 01:30:55, source=flush, os.arch=amd64, 
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.OK
test: fields..OK [38 fields]
test: field norms.OK [38 fields]
test: terms, freq, prox...OK [54205 terms; 220705 terms/docs pairs; 389336 
tokens]
test: stored fields...OK [27402 total field count; avg 27.104 fields 
per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq vector 
fields per doc]

  5 of 9: name=_ol3 docCount=1000
compound=false
hasProx=true
numFiles=8
size (MB)=6.944
diagnostics = {os.version=2.6.26-2-amd64, os=Linux, lucene.version=2.9.3 
951790 - 2010-06-06 01:30:55, source=flush, os.arch=amd64, 
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.OK
test: fields..OK [33 fields]
test:

Re: Internal Server Error when indexing a pdf file

2011-01-10 Thread Grijesh.singh


Check your libraries for Tika related Jar files.Tika related files must be on
classpath of solr

-
Grijesh
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Internal-Server-Error-when-indexing-a-pdf-file-tp2214617p2226374.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tuning StatsComponent

2011-01-10 Thread stockii


when i start statsComponent i get this message:
INFO: UnInverted multi-valued field
{field=product,memSize=4336,tindexSize=46,time=0,phase1=0,nTerms=1,bigTerms=1,termInstances=0,uses=0}

what means this ? 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Tuning-StatsComponent-tp2225809p2226555.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Creating Solr index from map/reduce

2011-01-10 Thread Joan

Thanks Alexander

2011/1/3 Alexander Kanarsky kanarsky2...@gmail.com

 Joan,

 current version of the patch assumes the location and names for the
 schema and solrconfig files ($SOLR_HOME/conf), it is hardcoded (see
 the SolrRecordWriter's constructor). Multi-core configuration with
 separate configuration locations via solr.xml is not supported as for
 now.  As a workaround, you could link or copy the schema and
 solrconfig files to follow the hardcoded assumption.

 Thanks,
 -Alexander

 On Wed, Dec 29, 2010 at 2:50 AM, Joan joan.monp...@gmail.com wrote:
  If I rename my custom schema file (schema-xx.xml), whitch is located in
  SOLR_HOME/schema/, and then I copy it to conf folder and finally I try
 to
  run CSVIndexer, it shows me an other error:
 
  Caused by: java.lang.RuntimeException: Can't find resource
 'solrconfig.xml'
  in classpath or
 
 '/tmp/hadoop-root/mapred/local/taskTracker/archive/localhost/tmp/b7611d6d-9cc7-4237-a240-96ecaab9f21a.solr.zip/conf/'
 
  I dont't understand because I've a solr configuration file (solr.xml)
 where
  I define all core:
 
   core name=core_name
 instanceDir=solr-data/index
 config=solr/conf/solrconfig_xx.xml
 schema=solr/schema/schema_xx.xml
 properties=solr/conf/solrcore.properties/ 
 
  But I think that when I run CSVIndexer, it doesn't know that solr.xml
 exist,
  and it try to looking for schema.xml and solrconfig.xml by default in
  default folder (conf)
 
 
 
  2010/12/29 Joan joan.monp...@gmail.com
 
  Hi,
 
  I'm trying generate Solr index from hadoop (map/reduce) so I'm using
 this
  patch SOLR-301 https://issues.apache.org/jira/browse/SOLR-1301,
 however
  I don't get it.
 
  When I try to run CSVIndexer with some arguments: directory Solr index
  -solr Solr home input, in this case CSV
 
  I'm runnig CSVIndexer:
 
  HADOOP_INSTALL/bin/hadoop jar my.jar CSVIndexer INDEX_FOLDER -solr
  /SOLR_HOME CSV FILE PATH
 
  Before that I run CSVIndexer, I've put csv file into HDFS.
 
  My Solr home hasn't default files configurations, but which is divided
  into multiple folders
 
  /conf
  /schema
 
  I have custom solr file configurations so CSVIndexer can't find
 schema.xml,
  obviously It won't be able to find it because this file doesn't exist,
 in my
  case, this file is named schema-xx.xml and CSVIndexer is looking for
 it
  inside conf folder and It don't know that schema folder exist. And I
 have
  solr configuration file (solr.xml) where I configure multiple cores.
 
  I tried to modify solr's paths but It still not working .
 
  I understand that CSVIndexer copy Solr Home specified into HDFS
  (/tmp/hadoop-user/mapred/local/taskTracker/archive/...) and when It try
 to
  find schema.xml it doesn't exit:
 
  10/12/29 10:18:11 INFO mapred.JobClient: Task Id :
  attempt_201012291016_0002_r_00_1, Status : FAILED
  java.lang.IllegalStateException: Failed to initialize record writer for
  my.jar, attempt_201012291016_0002_r_00_1
  at
 
 org.apache.solr.hadoop.SolrRecordWriter.init(SolrRecordWriter.java:253)
  at
 
 org.apache.solr.hadoop.SolrOutputFormat.getRecordWriter(SolrOutputFormat.java:152)
  at
  org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:553)
  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
  at org.apache.hadoop.mapred.Child.main(Child.java:170)
  Caused by: java.io.FileNotFoundException: Source
 
 '/tmp/hadoop-guest/mapred/local/taskTracker/archive/localhost/tmp/e8be5bb1-e910-47a1-b5a7-1352dfec2b1f.solr.zip/conf/schema.xml'
  does not exist
  at org.apache.commons.io.FileUtils.copyFile(FileUtils.java:636)
  at org.apache.commons.io.FileUtils.copyFile(FileUtils.java:606)
  at
 
 org.apache.solr.hadoop.SolrRecordWriter.init(SolrRecordWriter.java:222)
  ... 4 more

Solr trunk for production

2011-01-10 Thread Otis Gospodnetic

Hello,

Are people using Solr trunk in serious production environments?  I suspect the 
answer is yes, just want to see if there are any gotchas/warnings.

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

Re: Replication: abort-fetch and restarting

2011-01-10 Thread Markus Jelsma

Any thoughts on this one? Should i add a ticket?

On Tuesday 04 January 2011 20:08:40 Markus Jelsma wrote:
 Hi,
 
 It seems abort-fetch nicely removes the index directory which i'm
 replicating to which is fine. Restarting, however, does not trigger the
 the same feature as the abort-fetch command does. At least, that's what my
 tests seems to tell me.
 
 Shouldn't a restart of Solr nicely clean up the mess before exiting? And,
 shouldn't starting Solr also look for mess left behind by a possible sudden
 shutdown of the server at which the mess obviously cannot get cleaned?
 
 If i now stop, clean and start my slave it will attempt to download an
 existing index. If i abort-fetch it will clean up the mess and (due to low
 interval polling) make another attempt. If i, however, restart (instead of
 abort-fetch) the old temporary directory will stay and needs to be deleted
 manually.
 
 Cheers,

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

How to let crawlers in, but prevent their damage?

2011-01-10 Thread Otis Gospodnetic

Hi,

How do people with public search services deal with bots/crawlers?
And I don't mean to ask how one bans them (robots.txt) or slow them down (Delay 
stuff in robots.txt) or prevent them from digging too deep in search results...

What I mean is that when you have publicly exposed search that bots crawl, they 
issue all kinds of crazy queries that result in errors, that add noise to 
Solr 
caches, increase Solr cache evictions, etc. etc.

Are there some known recipes for dealing with them, minimizing their negative 
side-effects, while still letting them crawl you?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

Re: DIH - Closing ResultSet in JdbcDataSource

2011-01-10 Thread Shane Perry

Gora,

Thanks for the response.  After taking another look, you are correct about
the hasnext() closing the ResultSet object (1.4.1 as well as 1.4.0).  I
didn't recognize the case difference in the two function calls, so missed
it.  I'll keep looking into the original issue and reply if I find a
cause/solution.

Shane

On Sat, Jan 8, 2011 at 4:04 AM, Gora Mohanty g...@mimirtech.com wrote:

 On Sat, Jan 8, 2011 at 1:10 AM, Shane Perry thry...@gmail.com wrote:
  Hi,
 
  I am in the process of migrating our system from Postgres 8.4 to Solr
  1.4.1.  Our system is fairly complex and as a result, I have had to
 define
  19 base entities in the data-config.xml definition file.  Each of these
  entities executes 5 queries.  When doing a full-import, as each entity
  completes, the server hosting Postgres shows 5 idle in transaction for
 the
  entity.
 
  In digging through the code, I found that the JdbcDataSource wraps the
  ResultSet object in a custom ResultSetIterator object, leaving the
 ResultSet
  open.  Walking through the code I can't find a close() call anywhere on
 the
  ResultSet.  I believe this results in the idle in transaction
 processes.
 [...]

 Have not examined the idle in transaction issue that you
 mention, but the ResultSet object in a ResultSetIterator is
 closed in the private hasnext() method, when there are no
 more results, or if there is an exception. hasnext() is called
 by the public hasNext() method that should be used in
 iterating over the results, so I see no issue there.

 Regards,
 Gora

 P.S. This is from Solr 1.4.0 code, but I would not think that
this part of the code would have changed.

strange SOLR behavior with required field attribute

2011-01-10 Thread Bernd Fehling

Dear list,

while trying different options with DIH and SciptTransformer I also
tried using the required=true option for a field.

I have 3 records:
documents
document
titlefirst title/title
ididentifier_01/id
linkhttp://www.foo.com/path/bar.html/link
/document
document
titlesecond title/title
ididentifier_02/id
link/link
/document
document
titlethierd title/title
ididentifier_03/id
/document
/documents

schema.xml snippet:
field name=title type=string indexed=true stored=true /
field name=id type=string indexed=true stored=true required=true /
field name=link type=string indexed=true stored=true required=true /

After loading I have 2 records in the index.

str name=titlefirst title/str
str name=ididentifier_01/str
str name=linkhttp://www.foo.com/path/bar.html/link

str name=titlesecond title/str
str name=ididentifier_02/str
str name=link/

Sure, I get an SolrException in the logs saying missing required field: link
but this is for the third record whereas the second record gets loaded even if
link is empty.

So I guess this is a feature of Solr?

And the required attribute means the presense of the tag and not
the presense of content for the tag, right?

Regards
Bernd

Re: How to let crawlers in, but prevent their damage?

2011-01-10 Thread Ken Krugler


Hi Otis,

From what I learned at Krugle, the approach that worked for us was:

1. Block all bots on the search page.

2. Expose the target content via statically linked pages that are  
separately generated from the same backing store, and optimized for  
target search terms (extracted from your own search logs).


-- Ken

On Jan 10, 2011, at 5:41am, Otis Gospodnetic wrote:


Hi,

How do people with public search services deal with bots/crawlers?
And I don't mean to ask how one bans them (robots.txt) or slow them  
down (Delay
stuff in robots.txt) or prevent them from digging too deep in search  
results...


What I mean is that when you have publicly exposed search that bots  
crawl, they
issue all kinds of crazy queries that result in errors, that add  
noise to Solr

caches, increase Solr cache evictions, etc. etc.

Are there some known recipes for dealing with them, minimizing their  
negative

side-effects, while still letting them crawl you?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: How to let crawlers in, but prevent their damage?

2011-01-10 Thread lee carroll

Sorry not an answer but a +1 vote for finding out best practice for this.

Related to it is DOS attacks. We have rewrite rules  in between the proxy
server and solr which attempts to filter out undesriable stuff but would it
be better to have a query app doing this?

any standard rewrite rules which drop invalid or potentially malicious
queries would be very nice :-)

lee c

On 10 January 2011 13:41, Otis Gospodnetic otis_gospodne...@yahoo.comwrote:

 Hi,

 How do people with public search services deal with bots/crawlers?
 And I don't mean to ask how one bans them (robots.txt) or slow them down
 (Delay
 stuff in robots.txt) or prevent them from digging too deep in search
 results...

 What I mean is that when you have publicly exposed search that bots crawl,
 they
 issue all kinds of crazy queries that result in errors, that add noise to
 Solr
 caches, increase Solr cache evictions, etc. etc.

 Are there some known recipes for dealing with them, minimizing their
 negative
 side-effects, while still letting them crawl you?

 Thanks,
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/

Re: Multivalued fields and facet performance

2011-01-10 Thread Otis Gospodnetic

Hi Howard,

This is normal.  Your first query is reading a bunch of index data from disk 
and 
your RAM is then caching it.  If your first query involves sorting, some more 
data for FieldCache is being read and stored.  If there are multiple sort 
fields, one such thing for each.  If facets are involves, more of that stuff.  
If you are optimizing your index you are likely to be forcing more disk IO

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Howard Lee how...@workdigital.co.uk
 To: solr-user@lucene.apache.org
 Sent: Mon, January 10, 2011 8:59:03 AM
 Subject: Multivalued fields and facet performance
 
 Hi,
 
 I'd appreciate some explanation on what may be going on in the  following
 scenario using multivalued fields and facets.
 
 Solr version:  1.5
 
 Our index contains 35 million docs, and our search is using 2  multivalued
 fields as facets. There are approx 5 million different values in  one field
 and 5000 in the other. We are seeing the following, and I'm curious  as what
 is actually happening in the background.
 
 The first search can  take up to 5 minutes, all subsequent queries of any q
 return in under a  second. This is fine unless you are the first search or
 new  searcher.
 
 I plan on adding a first searcher and new searcher in the  config to avoid
 long delays every time the index is updated (once a day) but  I have concerns
 of the length of the delay in launching a new searcher, and  whether this is
 causing too much overhead.
 
 Can someone explain to me  what processes are going on in the backgroud that
 cause  this behaviour  so I can understand the implications or make some
 adjustments in the config  to compensate.
 
 thanx
 
 Howard

Re: How to let crawlers in, but prevent their damage?

2011-01-10 Thread Otis Gospodnetic

Hi Ken, thanks Ken. :)

The problem with this approach is that it exposes very limited content to 
bots/web search engines.

Take http://search-lucene.com/ for example.  People enter all kinds of queries 
in web search engines and end up on that site.  People who visit the site 
directly don't necessarily search for those same things.  Plus, new terms are 
entered to get to search-lucene.com every day, so keeping up with that would 
mean constantly generating more and more of those static pages.  Basically, the 
tail is super long.  On top of that, new content is constantly being generated, 
so one would have to also constantly both add and update those static pages.

I have a feeling there is not a good solution for this because on one hand 
people don't like the negative bot side effect, on the other hand people want 
as 
much of their sites indexed by the big guys.  The only half-solution that comes 
to mind involves looking at who's actually crawling you and who's bringing you 
visitors, then blocking those with a bad ratio of those two - bots that crawl a 
lot but don't bring a lot of value.

Any other ideas?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Ken Krugler kkrugler_li...@transpac.com
 To: solr-user@lucene.apache.org
 Sent: Mon, January 10, 2011 9:43:49 AM
 Subject: Re: How to let crawlers in, but prevent their damage?
 
 Hi Otis,
 
 From what I learned at Krugle, the approach that worked for us  was:
 
 1. Block all bots on the search page.
 
 2. Expose the target  content via statically linked pages that are separately 
generated from the same  backing store, and optimized for target search terms 
(extracted from your own  search logs).
 
 -- Ken
 
 On Jan 10, 2011, at 5:41am, Otis Gospodnetic  wrote:
 
  Hi,
  
  How do people with public search  services deal with bots/crawlers?
  And I don't mean to ask how one bans  them (robots.txt) or slow them down 
(Delay
  stuff in robots.txt) or  prevent them from digging too deep in search 
results...
  
  What I  mean is that when you have publicly exposed search that bots crawl, 
they
   issue all kinds of crazy queries that result in errors, that add noise 
  to  
Solr
  caches, increase Solr cache evictions, etc. etc.
  
   Are there some known recipes for dealing with them, minimizing their  
negative
  side-effects, while still letting them crawl you?
  
  Thanks,
  Otis
  
  Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
  Lucene ecosystem search :: http://search-lucene.com/
  
 
 --
 Ken  Krugler
 +1 530-210-6378
 http://bixolabs.com
 e l a s t i c   w e b   m i n i n  g

Re: strange SOLR behavior with required field attribute

2011-01-10 Thread Koji Sekiguchi


(11/01/10 23:26), Bernd Fehling wrote:

Dear list,

while trying different options with DIH and SciptTransformer I also
tried using the required=true option for a field.

I have 3 records:
documents
 document
 titlefirst title/title
 ididentifier_01/id
 linkhttp://www.foo.com/path/bar.html/link
 /document
 document
 titlesecond title/title
 ididentifier_02/id
 link/link
 /document
 document
 titlethierd title/title
 ididentifier_03/id
 /document
/documents

schema.xml snippet:
field name=title type=string indexed=true stored=true /
field name=id type=string indexed=true stored=true required=true /
field name=link type=string indexed=true stored=true required=true /

After loading I have 2 records in the index.

str name=titlefirst title/str
str name=ididentifier_01/str
str name=linkhttp://www.foo.com/path/bar.html/link

str name=titlesecond title/str
str name=ididentifier_02/str
str name=link/

Sure, I get an SolrException in the logs saying missing required field: link
but this is for the third record whereas the second record gets loaded even if
link is empty.

So I guess this is a feature of Solr?

And the required attribute means the presense of the tag and not
the presense of content for the tag, right?

Regards
Bernd


Bernd,

Seems like same problem of SOLR-1973 that I've recently fixed
in trunk and 3x, but I'm not sure. Which version are you using?
Can you try trunk or 3x? If you still get same error with trunk/3x,
please open a jira issue.

Koji
--
http://www.rondhuit.com/en/

Re: Multivalued fields and facet performance

2011-01-10 Thread Howard Lee

Otis,
The reason I ask is that I run a number of sites on Solr, some with 10
million+ docs faceting on similar types of data, and have not seen anywhere
near this length of initial delay. The main difference is that these sites
facet on single value fields rather that multivalued and that this site is
searching on 3 times the volume of data. Would switching to single valued
(I'd rather not) make much of a  difference.

I've also noticed that multivalued fields aren't populating the lucene field
cache. Is this the correct behaviour.

Regards

Howard

On 10 January 2011 14:55, Otis Gospodnetic otis_gospodne...@yahoo.comwrote:

 Hi Howard,

 This is normal.  Your first query is reading a bunch of index data from
 disk and
 your RAM is then caching it.  If your first query involves sorting, some
 more
 data for FieldCache is being read and stored.  If there are multiple sort
 fields, one such thing for each.  If facets are involves, more of that
 stuff.
 If you are optimizing your index you are likely to be forcing more disk
 IO

 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



 - Original Message 
  From: Howard Lee how...@workdigital.co.uk
  To: solr-user@lucene.apache.org
  Sent: Mon, January 10, 2011 8:59:03 AM
  Subject: Multivalued fields and facet performance
 
  Hi,
 
  I'd appreciate some explanation on what may be going on in the  following
  scenario using multivalued fields and facets.
 
  Solr version:  1.5
 
  Our index contains 35 million docs, and our search is using 2
  multivalued
  fields as facets. There are approx 5 million different values in  one
 field
  and 5000 in the other. We are seeing the following, and I'm curious  as
 what
  is actually happening in the background.
 
  The first search can  take up to 5 minutes, all subsequent queries of any
 q
  return in under a  second. This is fine unless you are the first search
 or
  new  searcher.
 
  I plan on adding a first searcher and new searcher in the  config to
 avoid
  long delays every time the index is updated (once a day) but  I have
 concerns
  of the length of the delay in launching a new searcher, and  whether this
 is
  causing too much overhead.
 
  Can someone explain to me  what processes are going on in the backgroud
 that
  cause  this behaviour  so I can understand the implications or make some
  adjustments in the config  to compensate.
 
  thanx
 
  Howard
 




-- 
WORKDIGITAL LTD
workdigital.co.uk
32-34 Broadwick Street
W1A 2HG London, UK

Howard Lee
CEO

M  +44(0)7931 476 766
E  how...@workdigital.co.uk

workhound.co.uk - salarytrack.co.uk - twitterjobsearch.com -
dreamjobalert.co.uk - recruitmentadnetwork.com

Token Counter

2011-01-10 Thread supersoft


Hello,

I would like to know if there is a trivial procedure/tool for displaying the
number of appearances of each token from query results. 

Thanks
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Token-Counter-tp2227795p2227795.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: strange SOLR behavior with required field attribute

2011-01-10 Thread Bernd Fehling

Hi Koji,

I'm using apache-solr-4.0-2010-11-24_09-25-17 from trunk.

A grep for SOLR-1973 in CHANGES.txt says that it should have been fixed.
Strange...

Regards,
Bernd



Am 10.01.2011 16:14, schrieb Koji Sekiguchi:
 (11/01/10 23:26), Bernd Fehling wrote:
 Dear list,

 while trying different options with DIH and SciptTransformer I also
 tried using the required=true option for a field.

 I have 3 records:
 documents
  document
  titlefirst title/title
  ididentifier_01/id
  linkhttp://www.foo.com/path/bar.html/link
  /document
  document
  titlesecond title/title
  ididentifier_02/id
  link/link
  /document
  document
  titlethierd title/title
  ididentifier_03/id
  /document
 /documents

 schema.xml snippet:
 field name=title type=string indexed=true stored=true /
 field name=id type=string indexed=true stored=true
 required=true /
 field name=link type=string indexed=true stored=true
 required=true /

 After loading I have 2 records in the index.

 str name=titlefirst title/str
 str name=ididentifier_01/str
 str name=linkhttp://www.foo.com/path/bar.html/link

 str name=titlesecond title/str
 str name=ididentifier_02/str
 str name=link/

 Sure, I get an SolrException in the logs saying missing required
 field: link
 but this is for the third record whereas the second record gets loaded
 even if
 link is empty.

 So I guess this is a feature of Solr?

 And the required attribute means the presense of the tag and not
 the presense of content for the tag, right?

 Regards
 Bernd
 
 Bernd,
 
 Seems like same problem of SOLR-1973 that I've recently fixed
 in trunk and 3x, but I'm not sure. Which version are you using?
 Can you try trunk or 3x? If you still get same error with trunk/3x,
 please open a jira issue.
 
 Koji

Storing metadata from post parameters and XML

2011-01-10 Thread Walter Closenfleight

I'm very unclear on how to associate what I need to a Solr index entry.
Based on what I've read thus far, you can extract data from text files and
store that in a Solr document.

I have hundreds of thousands of documents in a database/svn type system.
When I index a file, it is likely going to be local to the filesystem and I
know the location it will take on in the database. So, when I index, I want
to provide a path that it can find it when someone else does a search.

123.xml may look like:

mydoc
titlemy title/title
paraEvery foobar has its day/para
figure href=/abc/xxx.gifcaptionMy caption/caption
/mydoc

and the proprietary location I want it to be associated with is:

/abc/def/ghi/123.xml

So, when a user does a search for foobar, it returns some information
about 123.xml but most importantly the location should be available.

I have yet to find (in the schema.xml or otherwise) where you can define
that path to store, and how you would pass along that parameter in the
indexing of that document.

Instead, from the examples I can find, including the book, you store fields
from your data into the index. In the book's examples (a music database),
searching for Cherub Rock returns a list of with their duration, track
name, album name, and artist. In other words, the full text data you
retrieve is the only information the search index has to offer.

Just for example, using the exampledocs post.jar, I'm envisioning something
like this:

java -jar post.jar 123.xml -dblocation /abc/def/ghi/123.xml -othermeta1
xxx -othermeta2 zzz

Then the Solr doc would look like:
doc
field name=id123/field
field name=dblocation/abc/def/ghi/123.xml/field
field name=othermeta1xxx/field
field name=othermeta2zzz/field
field name=titlemy title/field
field name=graphic/abc/xxx.gif/field
field name=textEvery foobar has its day My caption/field
/doc

This way, when a user searches for foobar, they get item 123 back, review
the search result and if they decide that's the data they want, they can use
the dblocation field to retrieve the data for editing purposes (and then
re-index it following their edits).

I'm guessing I just haven't found the right terms yet to look into, as I'm
very new to this. Thanks for any direction you can provide. Also, if Solr
appears to be the wrong tool for what I need, let me know as well!

Thank you,
Walter

Re: Storing metadata from post parameters and XML

2011-01-10 Thread Stefan Matheis

Hey Walter,

what's against just putting your db-location in a 'string' field, and use it
like any other value?
There is no special field-type for something like a
path/directory/location-information, afaik.

Regards
Stefan

On Mon, Jan 10, 2011 at 4:50 PM, Walter Closenfleight 
walter.p.closenflei...@gmail.com wrote:

 I'm very unclear on how to associate what I need to a Solr index entry.
 Based on what I've read thus far, you can extract data from text files and
 store that in a Solr document.

 I have hundreds of thousands of documents in a database/svn type system.
 When I index a file, it is likely going to be local to the filesystem and I
 know the location it will take on in the database. So, when I index, I want
 to provide a path that it can find it when someone else does a search.

 123.xml may look like:

 mydoc
 titlemy title/title
 paraEvery foobar has its day/para
 figure href=/abc/xxx.gifcaptionMy caption/caption
 /mydoc

 and the proprietary location I want it to be associated with is:

 /abc/def/ghi/123.xml

 So, when a user does a search for foobar, it returns some information
 about 123.xml but most importantly the location should be available.

 I have yet to find (in the schema.xml or otherwise) where you can define
 that path to store, and how you would pass along that parameter in the
 indexing of that document.

 Instead, from the examples I can find, including the book, you store fields
 from your data into the index. In the book's examples (a music database),
 searching for Cherub Rock returns a list of with their duration, track
 name, album name, and artist. In other words, the full text data you
 retrieve is the only information the search index has to offer.

 Just for example, using the exampledocs post.jar, I'm envisioning something
 like this:

 java -jar post.jar 123.xml -dblocation /abc/def/ghi/123.xml -othermeta1
 xxx -othermeta2 zzz

 Then the Solr doc would look like:
 doc
 field name=id123/field
 field name=dblocation/abc/def/ghi/123.xml/field
 field name=othermeta1xxx/field
 field name=othermeta2zzz/field
 field name=titlemy title/field
 field name=graphic/abc/xxx.gif/field
 field name=textEvery foobar has its day My caption/field
 /doc

 This way, when a user searches for foobar, they get item 123 back, review
 the search result and if they decide that's the data they want, they can
 use
 the dblocation field to retrieve the data for editing purposes (and then
 re-index it following their edits).

 I'm guessing I just haven't found the right terms yet to look into, as I'm
 very new to this. Thanks for any direction you can provide. Also, if Solr
 appears to be the wrong tool for what I need, let me know as well!

 Thank you,
 Walter

Re: Storing metadata from post parameters and XML

2011-01-10 Thread Walter Closenfleight

Stefan,



You're right. I was attempting to post some quick pseudo-code, but that
doc/ is pretty misleading, they should have been str elements, like str
name=dblocation/abc/def/ghi/123.xml/str, or something to that affect.



Thanks,

Walter


On Mon, Jan 10, 2011 at 10:08 AM, Stefan Matheis 
matheis.ste...@googlemail.com wrote:

 Hey Walter,

 what's against just putting your db-location in a 'string' field, and use
 it
 like any other value?
 There is no special field-type for something like a
 path/directory/location-information, afaik.

 Regards
 Stefan

 On Mon, Jan 10, 2011 at 4:50 PM, Walter Closenfleight 
 walter.p.closenflei...@gmail.com wrote:

  I'm very unclear on how to associate what I need to a Solr index entry.
  Based on what I've read thus far, you can extract data from text files
 and
  store that in a Solr document.
 
  I have hundreds of thousands of documents in a database/svn type system.
  When I index a file, it is likely going to be local to the filesystem and
 I
  know the location it will take on in the database. So, when I index, I
 want
  to provide a path that it can find it when someone else does a search.
 
  123.xml may look like:
 
  mydoc
  titlemy title/title
  paraEvery foobar has its day/para
  figure href=/abc/xxx.gifcaptionMy caption/caption
  /mydoc
 
  and the proprietary location I want it to be associated with is:
 
  /abc/def/ghi/123.xml
 
  So, when a user does a search for foobar, it returns some information
  about 123.xml but most importantly the location should be available.
 
  I have yet to find (in the schema.xml or otherwise) where you can define
  that path to store, and how you would pass along that parameter in the
  indexing of that document.
 
  Instead, from the examples I can find, including the book, you store
 fields
  from your data into the index. In the book's examples (a music database),
  searching for Cherub Rock returns a list of with their duration, track
  name, album name, and artist. In other words, the full text data you
  retrieve is the only information the search index has to offer.
 
  Just for example, using the exampledocs post.jar, I'm envisioning
 something
  like this:
 
  java -jar post.jar 123.xml -dblocation /abc/def/ghi/123.xml -othermeta1
  xxx -othermeta2 zzz
 
  Then the Solr doc would look like:
  doc
  field name=id123/field
  field name=dblocation/abc/def/ghi/123.xml/field
  field name=othermeta1xxx/field
  field name=othermeta2zzz/field
  field name=titlemy title/field
  field name=graphic/abc/xxx.gif/field
  field name=textEvery foobar has its day My caption/field
  /doc
 
  This way, when a user searches for foobar, they get item 123 back, review
  the search result and if they decide that's the data they want, they can
  use
  the dblocation field to retrieve the data for editing purposes (and then
  re-index it following their edits).
 
  I'm guessing I just haven't found the right terms yet to look into, as
 I'm
  very new to this. Thanks for any direction you can provide. Also, if Solr
  appears to be the wrong tool for what I need, let me know as well!
 
  Thank you,
  Walter

Help needed in handling plurals

2011-01-10 Thread taimurAQ


Hi,

I am currently facing the following problematic scenario:

At index time, i index a field by the value of Laptop
At index time, i index another field with the value of Laptops
At query time, i search for Laptops.

What is happening right now is that i am only getting back Laptops in the
results, whereas i would like both Laptop and Laptops to be included. I
do not want to use the Porter stemmer due to its aggressive nature, and i
have tried to set up Pling Stemmer as a custom filter in my analyzer, but,
to no avail.

Can anyone guide me as to:
1. Where to put the PlingStemmer.class file.
2. How to set up the custom filter in the schema.xml file.

Thanks in advance.

Regards,
Taimur

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Help-needed-in-handling-plurals-tp2228165p2228165.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Token Counter

2011-01-10 Thread Shawn Heisey


On 1/10/2011 8:38 AM, supersoft wrote:

Hello,

I would like to know if there is a trivial procedure/tool for displaying the
number of appearances of each token from query results.

Thanks


Unless I'm misunderstanding what you mean, this sounds exactly like facets.

http://wiki.apache.org/solr/SolrFacetingOverview

An example URL (rows=0 for less distraction):

http://HOST:8983/solr/CORE/select/?q=horserows=0facet=truefacet.field=keywords

Am I misunderstanding your question?

Thanks,
Shawn

Re: How to let crawlers in, but prevent their damage?

2011-01-10 Thread Dennis Gearon

I don't nkow about stopping proble3ms with the issues that you've raised.

But I do know that web sites that aren't indempotent with GET requests are in a 
hurt locket. That seems to be WAY too many of them.
This means, don't do anything with GET that changes the contents of your web 
site.

Regarding a more dierct answer to your question, you'd probably have to have 
some sort of filtering applied. And anyway, crawlers only issue 'queries' based 
on the URLs found in the site, right? So are you going to have wierd URLs 
embedded in your site?

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Otis Gospodnetic otis_gospodne...@yahoo.com
To: solr-user@lucene.apache.org
Sent: Mon, January 10, 2011 5:41:17 AM
Subject: How to let crawlers in, but prevent their damage?

Hi,

How do people with public search services deal with bots/crawlers?
And I don't mean to ask how one bans them (robots.txt) or slow them down (Delay 
stuff in robots.txt) or prevent them from digging too deep in search results...

What I mean is that when you have publicly exposed search that bots crawl, they 
issue all kinds of crazy queries that result in errors, that add noise to 
Solr 

caches, increase Solr cache evictions, etc. etc.

Are there some known recipes for dealing with them, minimizing their negative 
side-effects, while still letting them crawl you?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

Re: How to let crawlers in, but prevent their damage?

2011-01-10 Thread Ken Krugler



On Jan 10, 2011, at 7:02am, Otis Gospodnetic wrote:


Hi Ken, thanks Ken. :)

The problem with this approach is that it exposes very limited  
content to

bots/web search engines.

Take http://search-lucene.com/ for example.  People enter all kinds  
of queries
in web search engines and end up on that site.  People who visit the  
site
directly don't necessarily search for those same things.  Plus, new  
terms are
entered to get to search-lucene.com every day, so keeping up with  
that would
mean constantly generating more and more of those static pages.   
Basically, the

tail is super long.


To clarify - the issue of using actual user search traffic is one of  
SEO, not what content you expose.


If, for example, people commonly do a search for java something  
then that's a hint that the URL to the static content, and the page  
title, should have the language as part of it.


So you shouldn't be generating static pages based on search traffic.  
Though you might want to decide what content to favor (see below)  
based on popularity.



On top of that, new content is constantly being generated,
so one would have to also constantly both add and update those  
static pages.


Yes, but that's why you need to automate that content generation, and  
do it on a regular (e.g. weekly) basis.


The big challenges we ran into were:

1. Dealing with badly behaved bots that would hammer the site.

We wound up putting this content on a separate system, so it wouldn't  
impact users on the main system.


And generating a regular report by user agent  IP address, so that we  
could block by robots.txt and IP when necessary.


2. Figuring out how to structure the static content so that it didn't  
look like spam to Google/Yahoo/Bing


You don't want to have too many links per page, or too much depth, but  
that constrains how many pages you can reasonably expose.


We had project scores based on code, activity, usage - so we used that  
to rank the content and focus on exposing early (low depth) the good  
stuff. You could do the same based on popularity, from search logs.


Anyway, there's a lot to this topic, but it doesn't feel very Solr  
specific. So apologies for reducing the signal-to-noise ratio with  
talk about SEO :)


-- Ken

I have a feeling there is not a good solution for this because on  
one hand
people don't like the negative bot side effect, on the other hand  
people want as
much of their sites indexed by the big guys.  The only half-solution  
that comes
to mind involves looking at who's actually crawling you and who's  
bringing you
visitors, then blocking those with a bad ratio of those two - bots  
that crawl a

lot but don't bring a lot of value.

Any other ideas?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 

From: Ken Krugler kkrugler_li...@transpac.com
To: solr-user@lucene.apache.org
Sent: Mon, January 10, 2011 9:43:49 AM
Subject: Re: How to let crawlers in, but prevent their damage?

Hi Otis,

From what I learned at Krugle, the approach that worked for us  was:

1. Block all bots on the search page.

2. Expose the target  content via statically linked pages that are  
separately
generated from the same  backing store, and optimized for target  
search terms

(extracted from your own  search logs).

-- Ken

On Jan 10, 2011, at 5:41am, Otis Gospodnetic  wrote:


Hi,

How do people with public search  services deal with bots/crawlers?
And I don't mean to ask how one bans  them (robots.txt) or slow  
them down

(Delay
stuff in robots.txt) or  prevent them from digging too deep in  
search

results...


What I  mean is that when you have publicly exposed search that  
bots crawl,

they
issue all kinds of crazy queries that result in errors, that add  
noise to

Solr

caches, increase Solr cache evictions, etc. etc.

Are there some known recipes for dealing with them, minimizing their

negative

side-effects, while still letting them crawl you?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



--
Ken  Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n  g








--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: How to let crawlers in, but prevent their damage?

2011-01-10 Thread Dennis Gearon

- Original Message 

From: lee carroll lee.a.carr...@googlemail.com
To: solr-user@lucene.apache.org
Sent: Mon, January 10, 2011 6:48:12 AM
Subject: Re: How to let crawlers in, but prevent their damage?

Sorry not an answer but a +1 vote for finding out best practice for this.

Related to it is DOS attacks. We have rewrite rules  in between the proxy
server and solr which attempts to filter out undesriable stuff but would it
be better to have a query app doing this?

any standard rewrite rules which drop invalid or potentially malicious
queries would be very nice :-

What exactly are milicious queries? (besides scraping) What's the problem with 
invalid queries? Unless someone is doing a custom crawl/scraping of your site, 
how are they going to issue queries that aren't alread on the site as URLs?

On 10 January 2011 13:41, Otis Gospodnetic otis_gospodne...@yahoo.comwrote:

 Hi,

 How do people with public search services deal with bots/crawlers?
 And I don't mean to ask how one bans them (robots.txt) or slow them down
 (Delay
 stuff in robots.txt) or prevent them from digging too deep in search
 results...

 What I mean is that when you have publicly exposed search that bots crawl,
 they
 issue all kinds of crazy queries that result in errors, that add noise to
 Solr
 caches, increase Solr cache evictions, etc. etc.

 Are there some known recipes for dealing with them, minimizing their
 negative
 side-effects, while still letting them crawl you?

 Thanks,
 Otis

 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/

Re: PHP PECL solr API library

2011-01-10 Thread Dennis Gearon

Yeah, it doesn't look like an easy, CRUD based interface.

- Original Message 
From: Lukas Kahwe Smith m...@pooteeweet.org
To: solr-user@lucene.apache.org
Sent: Sun, January 9, 2011 11:33:16 PM
Subject: Re: PHP PECL solr API library

On 10.01.2011, at 08:16, Dennis Gearon wrote:

 Anyone have any experience using this library?

 http://us3.php.net/solr

Yeah. it works quite well.
However imho the API is a maze. Also its lacking critical stuff like escaping 
and nice to have stuff like lucene query parsing/rewriting.

regards,
Lukas Kahwe Smith
m...@pooteeweet.org

Re: How to let crawlers in, but prevent their damage?

2011-01-10 Thread Dennis Gearon

H, so if someone says they have SEO skills on their resume, they COULD be 
talking about optimizing the SEARH engnie at some site, not just a web site to 
be crawled by search engines?

- Original Message 
From: Ken Krugler kkrugler_li...@transpac.com
To: solr-user@lucene.apache.org
Sent: Mon, January 10, 2011 9:07:43 AM
Subject: Re: How to let crawlers in, but prevent their damage?

On Jan 10, 2011, at 7:02am, Otis Gospodnetic wrote:

 Hi Ken, thanks Ken. :)

 The problem with this approach is that it exposes very limited content to
 bots/web search engines.

 Take http://search-lucene.com/ for example.  People enter all kinds of queries
 in web search engines and end up on that site.  People who visit the site
 directly don't necessarily search for those same things.  Plus, new terms are
 entered to get to search-lucene.com every day, so keeping up with that would
 mean constantly generating more and more of those static pages.  Basically, 
the
 tail is super long.

To clarify - the issue of using actual user search traffic is one of SEO, not 
what content you expose.

If, for example, people commonly do a search for java something then that's 
a hint that the URL to the static content, and the page title, should have the 
language as part of it.

So you shouldn't be generating static pages based on search traffic. Though you 
might want to decide what content to favor (see below) based on popularity.

 On top of that, new content is constantly being generated,
 so one would have to also constantly both add and update those static pages.

Yes, but that's why you need to automate that content generation, and do it on 
a 
regular (e.g. weekly) basis.

The big challenges we ran into were:

1. Dealing with badly behaved bots that would hammer the site.

We wound up putting this content on a separate system, so it wouldn't impact 
users on the main system.

And generating a regular report by user agent  IP address, so that we could 
block by robots.txt and IP when necessary.

2. Figuring out how to structure the static content so that it didn't look like 
spam to Google/Yahoo/Bing

You don't want to have too many links per page, or too much depth, but that 
constrains how many pages you can reasonably expose.

We had project scores based on code, activity, usage - so we used that to rank 
the content and focus on exposing early (low depth) the good stuff. You could 
do the same based on popularity, from search logs.

Anyway, there's a lot to this topic, but it doesn't feel very Solr specific. So 
apologies for reducing the signal-to-noise ratio with talk about SEO :)

-- Ken

 I have a feeling there is not a good solution for this because on one hand
 people don't like the negative bot side effect, on the other hand people want 
as
 much of their sites indexed by the big guys.  The only half-solution that 
comes
 to mind involves looking at who's actually crawling you and who's bringing you
 visitors, then blocking those with a bad ratio of those two - bots that crawl 
a
 lot but don't bring a lot of value.

 Any other ideas?

 Thanks,
 Otis

 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/

 - Original Message 
 From: Ken Krugler kkrugler_li...@transpac.com
 To: solr-user@lucene.apache.org
 Sent: Mon, January 10, 2011 9:43:49 AM
 Subject: Re: How to let crawlers in, but prevent their damage?

 Hi Otis,

 From what I learned at Krugle, the approach that worked for us  was:

 1. Block all bots on the search page.

 2. Expose the target  content via statically linked pages that are separately
 generated from the same  backing store, and optimized for target search terms
 (extracted from your own  search logs).

 -- Ken

 On Jan 10, 2011, at 5:41am, Otis Gospodnetic  wrote:

 Hi,

 How do people with public search  services deal with bots/crawlers?
 And I don't mean to ask how one bans  them (robots.txt) or slow them down
 (Delay
 stuff in robots.txt) or  prevent them from digging too deep in search
 results...

 What I  mean is that when you have publicly exposed search that bots crawl,
 they
 issue all kinds of crazy queries that result in errors, that add noise to
 Solr
 caches, increase Solr cache evictions, etc. etc.

 Are there some known recipes for dealing with them, minimizing their
 negative
 side-effects, while still letting them crawl you?

 Thanks,
 Otis

 Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/

 --
 Ken  Krugler
 +1 530-210-6378
 http://bixolabs.com
 e l a s t i c   w e b   m i n i n  g

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Help needed in handling plurals

2011-01-10 Thread Ahmet Arslan

--- On Mon, 1/10/11, taimurAQ taimur_qure...@hotmail.com wrote:

 From: taimurAQ taimur_qure...@hotmail.com
 Subject: Help needed in handling plurals
 To: solr-user@lucene.apache.org
 Date: Monday, January 10, 2011, 6:35 PM

 Hi,

 I am currently facing the following problematic scenario:

 At index time, i index a field by the value of Laptop
 At index time, i index another field with the value of
 Laptops
 At query time, i search for Laptops.

 What is happening right now is that i am only getting back
 Laptops in the
 results, whereas i would like both Laptop and Laptops
 to be included. I
 do not want to use the Porter stemmer due to its aggressive
 nature, and i
 have tried to set up Pling Stemmer as a custom filter in my
 analyzer, but,
 to no avail.

 Can anyone guide me as to:
 1. Where to put the PlingStemmer.class file.
 2. How to set up the custom filter in the schema.xml file.

For an alternative to PlingStemmer see :
http://search-lucene.com/m/uHzMd2h5uDK1/

To integrate Pling to solr you need to write a custom TokenFilterFactory.
http://wiki.apache.org/solr/SolrPlugins

public class PlingStemFilterFactory extends BaseTokenFilterFactory {

}
You can do this by modifying existing subclasses.

You need to create a jar file and put it into solrhome/lib directory.
All custom codes must be included as jar files.

first steps in nlp

2011-01-10 Thread lee carroll

Hi

I'm indexing a set of documents which have a conversational writing style.
In particular the authors are very fond
of listing facts in a variety of ways (this is to keep a human reader
interested) but its causing my index trouble.

For example instead of listing facts like: the house is white, the castle is
pretty.

We get the house is the complete opposite of black and the castle is not
ugly.

What are the best approaches to resolve these sorts of issues. Even if its
just handling not correctly would be a good start


cheers lee c

Re: Token Counter

2011-01-10 Thread supersoft


As I understand, a faceted search would be useful if keywords is a
multivalued field and the its field value is just a token. 

I want to display the occurences of the tokens wich appear in a indexed (and
stored) text field.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Token-Counter-tp2227795p2228991.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Token Counter

2011-01-10 Thread Sasank Mudunuri

Faceting will do this for you. Check out:
http://wiki.apache.org/solr/SimpleFacetParameters#facet.field

This param allows you to specify a field which should be treated as a facet.
 It will iterate over each Term in the field and generate a facet count using
 that Term as the constraint.


For a text field, it actually does go over each of the indexed tokens.


On Mon, Jan 10, 2011 at 10:11 AM, supersoft elarab...@gmail.com wrote:


 As I understand, a faceted search would be useful if keywords is a
 multivalued field and the its field value is just a token.

 I want to display the occurences of the tokens wich appear in a indexed
 (and
 stored) text field.
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Token-Counter-tp2227795p2228991.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: first steps in nlp

2011-01-10 Thread Grant Ingersoll


On Jan 10, 2011, at 12:42 PM, lee carroll wrote:

 Hi
 
 I'm indexing a set of documents which have a conversational writing style.
 In particular the authors are very fond
 of listing facts in a variety of ways (this is to keep a human reader
 interested) but its causing my index trouble.
 
 For example instead of listing facts like: the house is white, the castle is
 pretty.
 
 We get the house is the complete opposite of black and the castle is not
 ugly.
 
 What are the best approaches to resolve these sorts of issues. Even if its
 just handling not correctly would be a good start
 

Hmm, good problem.  I guess I'd start by stepping back and ask what is the 
problem you are trying to solve?  You've stated, I think, one half of the 
problem, namely that your authors have a conversational style, but you haven't 
stated what your users are expecting to do with this information?  Is this a 
pure search app?  Is it something else that is just backed by Solr but the user 
would never do a search?  

Do you have a relevance problem?  Also, what is your notion of handling not 
correctly?  In other words, more details are welcome!

-Grant

--
Grant Ingersoll
http://www.lucidimagination.com

Box occasionally pegs one cpu at 100%

2011-01-10 Thread Simon Wistow

I have a fairly classic master/slave set up.

Response times on the slave are generally good with blips periodically, 
apparently when replication is happening.

Occasionally however the process will have one incredibly slow query and 
will peg the CPU at 100%.

The weird thing is that it will remain that way even if we stop querying 
it and stop replication and then wait for over 20 minutes. The only way 
to fix the problem at that point is to restart tomcat.

Looking at slow queries around the time of the incident they don't look 
particularly bad - they're predominantly filter queries running under 
dismax and there doesn't seem to be anything unusual about them.

The index file is about 266G and has 30G of disk free. The machine has 
50G of RAM and is running with -Xmx35G.

Looking at the processes running it appears to be the main Java thread 
that's CPU bound, not the child threads. 

Stracing the process gives a lot of brk instructions (presumably some 
sort of wait loop) with occasional blips of: 


mprotect(0x7fc5721d9000, 4096, PROT_READ) = 0
futex(0x451c24a4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x451c24a0, 
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x4269dd14, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4269dd10, 
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7fbc941603b4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 
325, {1294683789, 614186000}, ) = 0
futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0
mprotect(0x7fc5721d8000, 4096, PROT_READ) = 0
mprotect(0x7fc5721d8000, 4096, PROT_READ|PROT_WRITE) = 0
futex(0x7fbc94eeb5b4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fbc94eeb5b0, 
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x426a6a28, FUTEX_WAKE_PRIVATE, 1) = 1
mprotect(0x7fc5721d9000, 4096, PROT_NONE) = 0
futex(0x41cae8f4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x41cae8f0, 
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x41cae328, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fbc941603b4, FUTEX_WAIT_PRIVATE, 327, NULL) = 0
futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0
mmap(0x7fc2e023, 121962496, PROT_NONE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 
0x7fc2e023
mmap(0x7fbca58e, 237568, PROT_NONE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 
0x7fbca58e

Any ideas about what's happening and if there's anyway to mitigate it? 
If the box at least recovered then I could run another slave and load 
balance between them working on the principle that the second box 
would pick up the slack whilst the first box restabilised but, as it is, 
that's not reliable.

Thanks,

Simon

Re: Box occasionally pegs one cpu at 100%

2011-01-10 Thread Brian Burke

This sounds like it could be garbage collection related, especially with a heap 
that large.  Depending on your jvm tuning, a FGC could take quite a while, 
effectively 'pausing' the JVM.

Have you looked at something like jstat -gcutil   or similar to monitor the 
garbage collection?


On Jan 10, 2011, at 1:36 PM, Simon Wistow wrote:

 I have a fairly classic master/slave set up.
 
 Response times on the slave are generally good with blips periodically, 
 apparently when replication is happening.
 
 Occasionally however the process will have one incredibly slow query and 
 will peg the CPU at 100%.
 
 The weird thing is that it will remain that way even if we stop querying 
 it and stop replication and then wait for over 20 minutes. The only way 
 to fix the problem at that point is to restart tomcat.
 
 Looking at slow queries around the time of the incident they don't look 
 particularly bad - they're predominantly filter queries running under 
 dismax and there doesn't seem to be anything unusual about them.
 
 The index file is about 266G and has 30G of disk free. The machine has 
 50G of RAM and is running with -Xmx35G.
 
 Looking at the processes running it appears to be the main Java thread 
 that's CPU bound, not the child threads. 
 
 Stracing the process gives a lot of brk instructions (presumably some 
 sort of wait loop) with occasional blips of: 
 
 
 mprotect(0x7fc5721d9000, 4096, PROT_READ) = 0
 futex(0x451c24a4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x451c24a0, 
 {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 futex(0x4269dd14, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4269dd10, 
 {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 futex(0x7fbc941603b4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 
 325, {1294683789, 614186000}, ) = 0
 futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0
 mprotect(0x7fc5721d8000, 4096, PROT_READ) = 0
 mprotect(0x7fc5721d8000, 4096, PROT_READ|PROT_WRITE) = 0
 futex(0x7fbc94eeb5b4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fbc94eeb5b0, 
 {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 futex(0x426a6a28, FUTEX_WAKE_PRIVATE, 1) = 1
 mprotect(0x7fc5721d9000, 4096, PROT_NONE) = 0
 futex(0x41cae8f4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x41cae8f0, 
 {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 futex(0x41cae328, FUTEX_WAKE_PRIVATE, 1) = 1
 futex(0x7fbc941603b4, FUTEX_WAIT_PRIVATE, 327, NULL) = 0
 futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0
 mmap(0x7fc2e023, 121962496, PROT_NONE, 
 MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 
 0x7fc2e023
 mmap(0x7fbca58e, 237568, PROT_NONE, 
 MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 
 0x7fbca58e
 
 Any ideas about what's happening and if there's anyway to mitigate it? 
 If the box at least recovered then I could run another slave and load 
 balance between them working on the principle that the second box 
 would pick up the slack whilst the first box restabilised but, as it is, 
 that's not reliable.
 
 Thanks,
 
 Simon

Re: How to let crawlers in, but prevent their damage?

2011-01-10 Thread Chris Hostetter


: What I mean is that when you have publicly exposed search that bots crawl, 
they 
: issue all kinds of crazy queries that result in errors, that add noise to 
Solr 
: caches, increase Solr cache evictions, etc. etc.

I teld with this type of thing a few years back by having my front end app 
executing queries to different solr tiers based on the User-Agent. Typical 
users to the main tier, known bots of partners to their own alt tier, 
known bots of public crawlers to a third alt tier.

in some cases these alternate tier had the same configs as my normal 
search tier, but by being distinct, the unusual and eratic query volume 
and number of unique queries didn't screw up the cache rates or user stats 
generated by log parsing that i would use on my regular search tier.  In 
other cases the tiers had slightly differnet configs, ie: the bots of my 
known parterns ran twice a day at predictible times, didn't do any 
faceting, and used a very predictible set of filters -- so i did 
snappulling only twice a day, and force warmed those filters.

i advocate this kind of distinct search tiers per user base even for 
human users -- assusming your volumne is high enough and you have the 
budget for the hardware -- users who do similar queries on a certain 
subset of documents (with tons of faceting on a certain subset fields) 
should all use the same set of query servers -- but if a differnt group of 
users tend to issue differnt types of queries (and facet on different 
fields) and you know this in advance -- you might as well have that second 
group of people query differnet boxes.

it's esentailly session affinity except it's not about sessions -- it's 
about expected behavior based on what you know about the user 


-Hoss

Re: Box occasionally pegs one cpu at 100%

2011-01-10 Thread Dennis Gearon

One other possiblity is that the OS or BIOS is doing that, at least on a 
laptop. 
There is a new feature where, if the load is low enough, non multi threaded 
applications can be assigned to one processor and that processor has it's clock 
boosted so the older software will run faster on the new processors - Otherwise 
they run SLOWER!.

My brother has a cad program that runs slower on his new quad core because the 
base clock speed is slower than a single processor CPU. The software company is 
not taking the time to rewrite their code, excpet where they add features or 
fixes. 




- Original Message 

From: Brian Burke bbu...@techtarget.com
To: solr-user@lucene.apache.org solr-user@lucene.apache.org
Sent: Mon, January 10, 2011 10:56:27 AM
Subject: Re: Box occasionally pegs one cpu at 100%

This sounds like it could be garbage collection related, especially with a heap 
that large.  Depending on your jvm tuning, a FGC could take quite a while, 
effectively 'pausing' the JVM.

Have you looked at something like jstat -gcutil   or similar to monitor the 
garbage collection?


On Jan 10, 2011, at 1:36 PM, Simon Wistow wrote:

 I have a fairly classic master/slave set up.
 
 Response times on the slave are generally good with blips periodically, 
 apparently when replication is happening.
 
 Occasionally however the process will have one incredibly slow query and 
 will peg the CPU at 100%.
 
 The weird thing is that it will remain that way even if we stop querying 
 it and stop replication and then wait for over 20 minutes. The only way 
 to fix the problem at that point is to restart tomcat.
 
 Looking at slow queries around the time of the incident they don't look 
 particularly bad - they're predominantly filter queries running under 
 dismax and there doesn't seem to be anything unusual about them.
 
 The index file is about 266G and has 30G of disk free. The machine has 
 50G of RAM and is running with -Xmx35G.
 
 Looking at the processes running it appears to be the main Java thread 
 that's CPU bound, not the child threads. 
 
 Stracing the process gives a lot of brk instructions (presumably some 
 sort of wait loop) with occasional blips of: 
 
 
 mprotect(0x7fc5721d9000, 4096, PROT_READ) = 0
 futex(0x451c24a4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x451c24a0, 
 {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 futex(0x4269dd14, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4269dd10, 
 {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 futex(0x7fbc941603b4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 
 325, {1294683789, 614186000}, ) = 0
 futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0
 mprotect(0x7fc5721d8000, 4096, PROT_READ) = 0
 mprotect(0x7fc5721d8000, 4096, PROT_READ|PROT_WRITE) = 0
 futex(0x7fbc94eeb5b4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fbc94eeb5b0, 
 {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 futex(0x426a6a28, FUTEX_WAKE_PRIVATE, 1) = 1
 mprotect(0x7fc5721d9000, 4096, PROT_NONE) = 0
 futex(0x41cae8f4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x41cae8f0, 
 {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 futex(0x41cae328, FUTEX_WAKE_PRIVATE, 1) = 1
 futex(0x7fbc941603b4, FUTEX_WAIT_PRIVATE, 327, NULL) = 0
 futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0
 mmap(0x7fc2e023, 121962496, PROT_NONE, 
 MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 
 0x7fc2e023
 mmap(0x7fbca58e, 237568, PROT_NONE, 
 MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 
 0x7fbca58e
 
 Any ideas about what's happening and if there's anyway to mitigate it? 
 If the box at least recovered then I could run another slave and load 
 balance between them working on the principle that the second box 
 would pick up the slack whilst the first box restabilised but, as it is, 
 that's not reliable.
 
 Thanks,
 
 Simon

Re: Improving Solr performance

2011-01-10 Thread Paul

 I see from your other messages that these indexes all live on the same 
 machine.
 You're almost certainly I/O bound, because you don't have enough memory for 
 the
 OS to cache your index files.  With 100GB of total index size, you'll get best
 results with between 64GB and 128GB of total RAM.

Is that a general rule of thumb? That it is best to have about the
same amount of RAM as the size of your index?

So, with a 5GB index, I should have between 4GB and 8GB of RAM
dedicated to solr?

Re: Improving Solr performance

2011-01-10 Thread Markus Jelsma

No, it also depends on the queries you execute (sorting is a big consumer) and 
the number of concurrent users.

 Is that a general rule of thumb? That it is best to have about the
 same amount of RAM as the size of your index?
 
 So, with a 5GB index, I should have between 4GB and 8GB of RAM
 dedicated to solr?

Re: Improving Solr performance

2011-01-10 Thread Jonathan Rochkind

I see a lot of people using shards to hold different types of 
documents, and it almost always seems to be a bad solution. Shards are 
intended for distributing a large index over multiple hosts -- that's 
it.  Not for some kind of federated search over multiple schemas, not 
for access control.


Why not put everything in the same index, without shards, and just use 
an 'fq' limit in order to limit to the specific document you'd like to 
search over in a given search?I think that would achieve your goal a 
lot more simply than shards -- then you use sharding only if and when 
your index grows to be so large you'd like to distribute it over 
multiple hosts, and when you do so you choose a shard key that will have 
more or less equal distribution accross shards.


Using shards for access control or schema management just leads to 
headaches.


[Apparently Solr could use some highlighted documentation on what shards 
are really for, as it seems to be a very common issue on this list, 
someone trying to use them for something else and then inevitably 
finding problems with that approach.]


Jonathan

On 1/7/2011 6:48 AM, supersoft wrote:

The reason of this distribution is the kind of the documents. In spite of
having the same schema structure (and solr conf), a document belongs to 1 of
5 different kinds.

Each kind corresponds to a concrete shard and due to this, the implemented
client tool avoids searching in all the shards when the users selects just
one or a few of kinds. The tool runs a multisharded query of the proper
shards. I guess this is a right approach but correct me if I am wrong.

The real problem of this architecture is the correlation between concurrent
users and response time:
1 query: n seconds
2 queries: 2*n second each query
3 queries: 3*n seconds each query
and so...

This is being a real headache because 1 single query has an acceptable
response time but when many users are accessing to the server the
performance goes hardly down.

Re: Tuning StatsComponent

2011-01-10 Thread Jonathan Rochkind

I found StatsComponent to be slow only when I didn't have enough RAM 
allocated to the JVM.  I'm not sure exactly what was causing it, but it 
was pathologically slow -- and then adding more RAM to the JVM made it 
incredibly fast.


On 1/10/2011 4:58 AM, Gora Mohanty wrote:

On Mon, Jan 10, 2011 at 2:28 PM, stockiist...@shopgate.com  wrote:

Hello.

i`m using the StatsComponent to get the sum of amounts. but solr
statscomponent is very slow on a huge index of 30 Million documents. how can
i tune the statscomponent ?

Not sure about this problem.


the problem is, that i have 5 currencys and i need to send for each currency
a new request. thats make the solr search sometimes very slow. =(

[...]

I guess that you mean the search from the front-end is slow.

It is difficult to make a guess without details of your index,
and of your queries, but one thing that immediately jumps
out is that you could shard the Solr index by currency, and
have your front-end direct queries for each currency to the
appropriate Solr server.

Please do share a description of what all you are indexing,
how large your index is, and what kind of queries you are
running. I take it that you have already taken a look at
http://wiki.apache.org/solr/SolrPerformanceFactors

Regards,
Gora

Re: Including Small Amounts of New Data in Searches (MultiSearcher ?)

2011-01-10 Thread Jason Rutherglen

 most of the Solr sites I know of
 have much larger indexes than ram and expect everything to work
 smoothly

Hmm... In that case, throttling the merges would probably help most,
though, yes, that's not available today.  In lieu of that, I'd run
large merges during off-peak hours, or better yet, use Solr's
replication, eg, merge on the master where queries aren't hitting
anything.  Perhaps that'd throw off the NRT interval though.

On Sun, Jan 9, 2011 at 8:55 PM, Lance Norskog goks...@gmail.com wrote:
 Ok. I was talking about what tools are available now- much better
 things are in the NRT work. I don't know how merges work now, in re
 multitasking and thread contention. Most of the Solr sites I know of
 have much larger indexes than ram and expect everything to work
 smoothly.

 Lance

 On Sun, Jan 9, 2011 at 9:18 AM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
 The older MergePolicies followed a strategy which is quite disruptive in an 
 NRT environment.

 Can you elaborate as to why (maybe we need to place this in a wiki)?
 If large merges are running in their own thread, they should not
 disrupt queries, eg, there won't be CPU contention.  The IO contention
 can be disruptive, depending on the size and type of hardware, however
 in the ideal case of the index 'fitting' into RAM/IO cache, then a
 large merge should not affect queries (or indexing).

 I think what's useful that is being developed for not disrupting NRT
 with merges is DirectIOLinuxDirectory:
 https://issues.apache.org/jira/browse/LUCENE-2500  It's also useful
 for the non-NRT use case because anytime IO cache pages are evicted,
 queries will slow down (unless the index is too large to fit in RAM
 anyways).

 On Sat, Jan 8, 2011 at 7:55 PM, Lance Norskog goks...@gmail.com wrote:
 There are always slowdowns when merging new segments during indexing.
 A MergePolicy decides when to merge segments.  The older MergePolicies
 followed a strategy which is quite disruptive in an NRT environment.

 There is a new feature in 3.x  the trunk called
 'BalancedSegmentMergePolicy'. This new MergePolicy is designed for the
 near-real-time use case. It was contributed by LinkedIn. You may find
 it works well enough for your case.

 Lance

 On Thu, Jan 6, 2011 at 10:21 AM, Stephen Boesch java...@gmail.com wrote:
 Thanks Yonik,
  Using a stable release of Solr what would you suggest to do - given
 MultiSearch's demise and the other work is still ongoing?

 2011/1/6 Yonik Seeley yo...@lucidimagination.com

 On Thu, Jan 6, 2011 at 12:37 PM, Stephen Boesch java...@gmail.com wrote:
  Solr/lucene newbie here ..
 
  We would like searches against a solr/lucene index to immediately be 
  able
 to
  view data that was added.  I stress small amount of new data given 
  that
  any significant amount would require excessive  latency.

 There has been significant ongoing work in lucene-core for NRT (near real
 time).
 We need to overhaul Solr's DirectUpdateHandler2 to take advantage of
 all this work.
 Mark Miller took a first crack at it (sharing a single IndexWriter,
 letting lucene handle the concurrency issues, etc)
 but if there's a JIRA issue, I'm having trouble finding it.

  Looking around, i'm wondering if the direction would be a MultiSearcher
  living on top of our standard directory-based IndexReader as well as a
  custom Searchable that handles the newest documents - and then combines
 the
  two results?

 If you look at trunk, MultiSearcher has already gone away.

 -Yonik
 http://www.lucidimagination.com





 --
 Lance Norskog
 goks...@gmail.com





 --
 Lance Norskog
 goks...@gmail.com

Re: Tuning StatsComponent

2011-01-10 Thread Grant Ingersoll

StatsComponent, like many things, relies on FieldCache (and the related 
uninverted version in Solr for multivalued fields), which takes up memory and 
is related to the number of documents in the index.  Strings in FieldCache can 
also be expensive.

-Grant

On Jan 10, 2011, at 4:10 PM, Jonathan Rochkind wrote:

 I found StatsComponent to be slow only when I didn't have enough RAM 
 allocated to the JVM.  I'm not sure exactly what was causing it, but it was 
 pathologically slow -- and then adding more RAM to the JVM made it incredibly 
 fast.
 
 On 1/10/2011 4:58 AM, Gora Mohanty wrote:
 On Mon, Jan 10, 2011 at 2:28 PM, stockiist...@shopgate.com  wrote:
 Hello.
 
 i`m using the StatsComponent to get the sum of amounts. but solr
 statscomponent is very slow on a huge index of 30 Million documents. how can
 i tune the statscomponent ?
 Not sure about this problem.
 
 the problem is, that i have 5 currencys and i need to send for each currency
 a new request. thats make the solr search sometimes very slow. =(
 [...]
 
 I guess that you mean the search from the front-end is slow.
 
 It is difficult to make a guess without details of your index,
 and of your queries, but one thing that immediately jumps
 out is that you could shard the Solr index by currency, and
 have your front-end direct queries for each currency to the
 appropriate Solr server.
 
 Please do share a description of what all you are indexing,
 how large your index is, and what kind of queries you are
 running. I take it that you have already taken a look at
 http://wiki.apache.org/solr/SolrPerformanceFactors
 
 Regards,
 Gora
 

--
Grant Ingersoll
http://www.lucidimagination.com/

Re: Improving Solr performance

2011-01-10 Thread Toke Eskildsen

On Mon, 2011-01-10 at 21:43 +0100, Paul wrote:
  I see from your other messages that these indexes all live on the same 
  machine.
  You're almost certainly I/O bound, because you don't have enough memory for 
  the
  OS to cache your index files.  With 100GB of total index size, you'll get 
  best
  results with between 64GB and 128GB of total RAM.
 
 Is that a general rule of thumb? That it is best to have about the
 same amount of RAM as the size of your index?

I does not seems like there is a clear current consensus on hardware to
handle IO problems. I am firmly in the SSD camp, but as you can see from
the current thread, other people recommend RAM and/or extra machines.

I can say that our tests with RAM and spinning disks showed us that a
lot of RAM certainly helps a lot, but also that it takes a considerable
amount of time to warm the index before the performance is satisfactory.
It might be helped with disk cache tricks, such as copying the whole
index to /dev/null before opening it in Solr.

 So, with a 5GB index, I should have between 4GB and 8GB of RAM
 dedicated to solr?

Not as -Xmx, but free for disk cache, yes. If you follow the RAM ~=
index size recommendation.

Re: Improving Solr performance

2011-01-10 Thread Dennis Gearon

What I seem to see suggested here is to use different cores for the things you 
suggested:
  different types of documents
  Access Control Lists

I wonder how sharding would work in that scenario?

Me, I plan on :
  For security:
Using a permissions field
  For different schmas:
Dynamic fields with enough premade fields to handle it.


The one thing I don't thing my approach does well with is statistics.

 Dennis Gearon



- Original Message 
From: Jonathan Rochkind rochk...@jhu.edu
To: solr-user@lucene.apache.org solr-user@lucene.apache.org
Cc: supersoft elarab...@gmail.com
Sent: Mon, January 10, 2011 1:08:00 PM
Subject: Re: Improving Solr performance

I see a lot of people using shards to hold different types of documents, and 
it almost always seems to be a bad solution. Shards are intended for 
distributing a large index over multiple hosts -- that's it.  Not for some kind 
of federated search over multiple schemas, not for access control.

Why not put everything in the same index, without shards, and just use an 'fq' 
limit in order to limit to the specific document you'd like to search over in a 
given search?I think that would achieve your goal a lot more simply than 
shards -- then you use sharding only if and when your index grows to be so 
large 
you'd like to distribute it over multiple hosts, and when you do so you choose 
a 
shard key that will have more or less equal distribution accross shards.

Using shards for access control or schema management just leads to headaches.

[Apparently Solr could use some highlighted documentation on what shards are 
really for, as it seems to be a very common issue on this list, someone trying 
to use them for something else and then inevitably finding problems with that 
approach.]

Jonathan

On 1/7/2011 6:48 AM, supersoft wrote:
 The reason of this distribution is the kind of the documents. In spite of
 having the same schema structure (and solr conf), a document belongs to 1 of
 5 different kinds.
 
 Each kind corresponds to a concrete shard and due to this, the implemented
 client tool avoids searching in all the shards when the users selects just
 one or a few of kinds. The tool runs a multisharded query of the proper
 shards. I guess this is a right approach but correct me if I am wrong.
 
 The real problem of this architecture is the correlation between concurrent
 users and response time:
 1 query: n seconds
 2 queries: 2*n second each query
 3 queries: 3*n seconds each query
 and so...
 
 This is being a real headache because 1 single query has an acceptable
 response time but when many users are accessing to the server the
 performance goes hardly down.

Re: first steps in nlp

2011-01-10 Thread lee carroll

Hi Grant,

Its a search relevancy problem. For example:

a document about london reads like

London is not very good for a peaceful break.

we analyse this at the (i can't remember the technical term) is it lexical
level? (bloody hell i think you may have wrote the book !) anyway which
produces tokens in our index of say

London good peaceful holiday

users search for cities which would be nice for them to take a holiday in
say the search is
good for a peaceful break

and bang london is top. talk about a relevancy problem :-)

now i was thinking of using phrase matches in the synonyms file but is that
the best approach or could nlp help here?

cheers lee




On 10 January 2011 18:21, Grant Ingersoll gsing...@apache.org wrote:


 On Jan 10, 2011, at 12:42 PM, lee carroll wrote:

  Hi
 
  I'm indexing a set of documents which have a conversational writing
 style.
  In particular the authors are very fond
  of listing facts in a variety of ways (this is to keep a human reader
  interested) but its causing my index trouble.
 
  For example instead of listing facts like: the house is white, the castle
 is
  pretty.
 
  We get the house is the complete opposite of black and the castle is not
  ugly.
 
  What are the best approaches to resolve these sorts of issues. Even if
 its
  just handling not correctly would be a good start
 

 Hmm, good problem.  I guess I'd start by stepping back and ask what is the
 problem you are trying to solve?  You've stated, I think, one half of the
 problem, namely that your authors have a conversational style, but you
 haven't stated what your users are expecting to do with this information?
  Is this a pure search app?  Is it something else that is just backed by
 Solr but the user would never do a search?

 Do you have a relevance problem?  Also, what is your notion of handling
 not correctly?  In other words, more details are welcome!

 -Grant

 --
 Grant Ingersoll
 http://www.lucidimagination.com

Re: Improving Solr performance

2011-01-10 Thread mike anderson

Not sure if this was mentioned yet, but if you are doing slave/master
replication you'll need 2x the RAM at replication time. Just something to
keep in mind.

-mike

On Mon, Jan 10, 2011 at 5:01 PM, Toke Eskildsen t...@statsbiblioteket.dkwrote:

 On Mon, 2011-01-10 at 21:43 +0100, Paul wrote:
   I see from your other messages that these indexes all live on the same
 machine.
   You're almost certainly I/O bound, because you don't have enough memory
 for the
   OS to cache your index files.  With 100GB of total index size, you'll
 get best
   results with between 64GB and 128GB of total RAM.
 
  Is that a general rule of thumb? That it is best to have about the
  same amount of RAM as the size of your index?

 I does not seems like there is a clear current consensus on hardware to
 handle IO problems. I am firmly in the SSD camp, but as you can see from
 the current thread, other people recommend RAM and/or extra machines.

 I can say that our tests with RAM and spinning disks showed us that a
 lot of RAM certainly helps a lot, but also that it takes a considerable
 amount of time to warm the index before the performance is satisfactory.
 It might be helped with disk cache tricks, such as copying the whole
 index to /dev/null before opening it in Solr.

  So, with a 5GB index, I should have between 4GB and 8GB of RAM
  dedicated to solr?

 Not as -Xmx, but free for disk cache, yes. If you follow the RAM ~=
 index size recommendation.

Re: Box occasionally pegs one cpu at 100%

2011-01-10 Thread Simon Wistow

On Mon, Jan 10, 2011 at 01:56:27PM -0500, Brian Burke said:
 This sounds like it could be garbage collection related, especially 
 with a heap that large.  Depending on your jvm tuning, a FGC could 
 take quite a while, effectively 'pausing' the JVM.
 
 Have you looked at something like jstat -gcutil or similar to monitor 
 the garbage collection?

I think you may have hit the nail on the head. 

Having checked the configuration again I noticed that the -server flag 
didn't appear to be present in the options passed to Java (I'm convinced 
it used to be there). As I understand it, this would mean that the 
Parallel GC wouldn't be implicitly enabled.

If that's true then that's a definite strong candidate for causing the 
root process and only the root process to peg a single CPU.

Anybody have any experience of the differences between

-XX:+UseParallelGC 

and

-XX:+UseConcMarkSweepGC with -XX:+UseParNewGC

?

I believe -XX:+UseParallelGC  is the default with -server so I suppose 
that's a good place to start but I'd appreciate any anecdotes or 
experiences.

Re: Improving Solr performance

2011-01-10 Thread Jonathan Rochkind


On 1/10/2011 5:03 PM, Dennis Gearon wrote:

What I seem to see suggested here is to use different cores for the things you
suggested:
   different types of documents
   Access Control Lists

I wonder how sharding would work in that scenario?


Sharding has nothing to do with that scenario at all. Different cores 
are essentially _entirely seperate_.  While it can be convenient to use 
different cores like this, it means you don't get ANY searches that 
'join' over multiple 'kinds' of data in different cores.


Solr is not great at handling hetereogenous data like that.  Putting it 
in seperate cores is one solution, although then they are entirely 
seperate.  If that works, great.  Another solution is putting them in 
the same index, but using mostly different fields, and perhaps having a 
'type' field shared amongst all of your 'kinds' of data, and then always 
querying with an 'fq' for the right 'kind'.  Or if the fields they use 
are entirely different, you don't even need the fq, since a query on a 
certain field will only match a certain 'kind' of document.


Solr is not great at handling complex queries over data with 
hetereogenous schemata. Solr wants you to to flatten all your data into 
one single set of documents.


Sharding is a way of splitting up a single index (multiple cores are 
_multiple indexes_) amongst several hosts for performance reasons, 
mostly when you have a very large index.  That is it.  The end.  if you 
have multiple cores, that's the same as having multiple solr indexes 
(which may or may not happen to be on the same machine). Any one or more 
of those cores could be sharded if you want. This is a seperate issue.

Re: Improving Solr performance

2011-01-10 Thread Jonathan Rochkind

And I don't think I've seen anyone suggest a seperate core just for 
Access Control Lists. I'm not sure what that would get you.


Perhaps a separate store that isn't Solr at all, in some cases.

On 1/10/2011 5:36 PM, Jonathan Rochkind wrote:

Access Control Lists

Re: Improving Solr performance

2011-01-10 Thread Markus Jelsma

Any sources to cite for this statement? And are you talking about RAM 
allocated to the JVM or available for OS cache?

 Not sure if this was mentioned yet, but if you are doing slave/master
 replication you'll need 2x the RAM at replication time. Just something to
 keep in mind.
 
 -mike
 
 On Mon, Jan 10, 2011 at 5:01 PM, Toke Eskildsen 
t...@statsbiblioteket.dkwrote:
  On Mon, 2011-01-10 at 21:43 +0100, Paul wrote:
I see from your other messages that these indexes all live on the
same
  
  machine.
  
You're almost certainly I/O bound, because you don't have enough
memory
  
  for the
  
OS to cache your index files.  With 100GB of total index size, you'll
  
  get best
  
results with between 64GB and 128GB of total RAM.
   
   Is that a general rule of thumb? That it is best to have about the
   same amount of RAM as the size of your index?
  
  I does not seems like there is a clear current consensus on hardware to
  handle IO problems. I am firmly in the SSD camp, but as you can see from
  the current thread, other people recommend RAM and/or extra machines.
  
  I can say that our tests with RAM and spinning disks showed us that a
  lot of RAM certainly helps a lot, but also that it takes a considerable
  amount of time to warm the index before the performance is satisfactory.
  It might be helped with disk cache tricks, such as copying the whole
  index to /dev/null before opening it in Solr.
  
   So, with a 5GB index, I should have between 4GB and 8GB of RAM
   dedicated to solr?
  
  Not as -Xmx, but free for disk cache, yes. If you follow the RAM ~=
  index size recommendation.

Re: Solr Spellcheker automatically tokenizes on period marks

2011-01-10 Thread Sebastian M


I've noticed that the spellcheck component also seems to tokenize by itself
on question marks, not only period marks. 

Based on the spellcheck definition above, does anyone know how to stop Solr
from tokenizing strings on queries such as

www.sometest.com

(which causes suggestions of the form www.www.sometest.com.com)

It gets really messy if the user then clicks the above suggestion, which
causes a suggestion such as www.www.www.sometest.com.com.com to be given.

Thanks in advance!
Sebastian
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Spellcheker-automatically-tokenizes-on-period-marks-tp2131844p2231170.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Box occasionally pegs one cpu at 100%

2011-01-10 Thread François Schiettecatte

This reminded me of a situation I ran into in the past where the JVM was being 
rendered useless because it was calling FGC repeatedly. Effectively what was 
going on is that a very large array was allocated which swamped the JVM memory 
and caused it to trash, much like an OS.

Here are some links which will help (at least they helped me):

http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html  (you need to 
read this one)

http://java.sun.com/performance/reference/whitepapers/tuning.html   (and 
this one).

http://www.oracle.com/technetwork/java/javase/tech/index-jsp-136373.html

http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html

http://java.sun.com/performance/jvmstat/

http://blogs.sun.com/watt/resource/jvm-options-list.html

jstat is also very good for seeing what is going on in the JVM. I also recall 
there was a way to trace GC in the JVM but cant recall how off the top of my 
head, maybe it was a JVM option.

Hope this helps.

Cheers

François


On Jan 10, 2011, at 5:13 PM, Simon Wistow wrote:

 On Mon, Jan 10, 2011 at 01:56:27PM -0500, Brian Burke said:
 This sounds like it could be garbage collection related, especially 
 with a heap that large.  Depending on your jvm tuning, a FGC could 
 take quite a while, effectively 'pausing' the JVM.
 
 Have you looked at something like jstat -gcutil or similar to monitor 
 the garbage collection?
 
 I think you may have hit the nail on the head. 
 
 Having checked the configuration again I noticed that the -server flag 
 didn't appear to be present in the options passed to Java (I'm convinced 
 it used to be there). As I understand it, this would mean that the 
 Parallel GC wouldn't be implicitly enabled.
 
 If that's true then that's a definite strong candidate for causing the 
 root process and only the root process to peg a single CPU.
 
 Anybody have any experience of the differences between
 
 -XX:+UseParallelGC 
 
 and
 
 -XX:+UseConcMarkSweepGC with -XX:+UseParNewGC
 
 ?
 
 I believe -XX:+UseParallelGC  is the default with -server so I suppose 
 that's a good place to start but I'd appreciate any anecdotes or 
 experiences.

Re: Box occasionally pegs one cpu at 100%

2011-01-10 Thread Simon Wistow

On Mon, Jan 10, 2011 at 05:58:42PM -0500, François Schiettecatte said:
 http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html(you 
 need to read this one)
 
 http://java.sun.com/performance/reference/whitepapers/tuning.html (and 
 this one).

Yeah, I have these two pages bookmarked :)

 jstat is also very good for seeing what is going on in the JVM. I also 
 recall there was a way to trace GC in the JVM but cant recall how off 
 the top of my head, maybe it was a JVM option.

You can use -XX:+PrintGC and -XX:+PrintGCDetail (and 
-XX:+PrintGCTimeStamps) as well as -Xloggc:gc.log to log to a file.

I'm also finding NewRelic's RPM system great for monitoring Solr - the 
integration is really good, I give it two thumbs up.

RE: Empty value/string matching

2011-01-10 Thread Viswa S


Anyone know why this would not be working in solr?. Just to recap, we are 
trying to exclude document which have fields missing values in the search 
results. I have tried and none of it seems to be working:
1. *:* -field:[* TO *]2. -field:[* TO *]3. field:
The fields are either typed string or custom and the query parser used is 
the,LuceneQParser. The below suggested solutions of using some default values 
do not work for our use case.
ThanksViswa

 From: bob.sandif...@sirsidynix.com
 To: solr-user@lucene.apache.org
 Date: Mon, 22 Nov 2010 08:35:22 -0700
 Subject: RE: Empty value/string matching
 
 One possibility to consider - if you really need documents with specifically 
 empty or non-defined values (if that's not an oxymoron :)), and you have 
 control over the values you send into the indexing, you could set a special 
 value that means 'no value'. We've done that in a similar vein, using 
 something like '@@EMPTY@@' for a given field, meaning that the original 
 document didn't actually have a value for that field.  I.E. it is something 
 very unlikely to be a 'real' value - and then we can easily select on 
 documents by querying for the field:@@EMPTY@@ instead of the negated form of 
 the select...  However, we haven't considered things like what it does to 
 index size.  It's relatively rare for us (that there not be a value), so our 
 'gut feel' is that it's not impacting the indexes very much size-wise or 
 performance-wise.
 
 Bob Sandiford | Lead Software Engineer | SirsiDynix
 P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
 www.sirsidynix.com 
 
  -Original Message-
  From: Viswa S [mailto:svis...@hotmail.com]
  Sent: Saturday, November 20, 2010 5:38 PM
  To: solr-user@lucene.apache.org
  Subject: RE: Empty value/string matching
  
  
  Erick,
  Thanks for the quick response. The output i showed is on a test
  instance i created to simulate this issue. I intentionally tried to
  create documents with no values by creating xml nodes with field
  name=fieldName/field, but having values in the other fields in a
  document.
  Are you saying that there is no way have a field with no value?, with
  text fields they seem to make sense than for string?.
  You are right on fieldName:[* TO *] results, which basically returned
  all the documents which included the couple of documents in question.
  -Viswa
   Date: Sat, 20 Nov 2010 17:20:53 -0500
   Subject: Re: Empty value/string matching
   From: erickerick...@gmail.com
   To: solr-user@lucene.apache.org
  
   I don't think that's correct. The documents wouldn't be showing
   up in the facets if they had no value for the field. So I think
  you're
   being mislead by the printout from the faceting. Perhaps you
   have unprintable characters in there or some such. Certainly the
   name:  is actually a value, admittedly just a space. As for the
   other, I suspect something similar.
  
   What results do you get back when you just search for
   FieldName:[* TO *]? I'm betting you get all the docs back,
   but I've been very wrong before.
  
   Best
   Erick
  
   On Sat, Nov 20, 2010 at 5:02 PM, Viswa S svis...@hotmail.com wrote:
  
   
Yes I do have a couple of documents with no values and one with an
  empty
string. Find below the output of a facet on the fieldName.
ThanksViswa
   
   
int name=2/intint name=CASTIGO.4302/intint
name=GDOGPRODY.4242/intint name=QMAGIC.4122/intint
  name=
1/int
 Date: Sat, 20 Nov 2010 15:29:06 -0500
 Subject: Re: Empty value/string matching
 From: erickerick...@gmail.com
 To: solr-user@lucene.apache.org

 Are you absolutely sure your documents really don't have any
  values for
 FieldName? Because your results are perfectly correct if every
  doc has
a
 value for FieldName.

 Or are you saying there no such field as FieldName?

 Best
 Erick

 On Sat, Nov 20, 2010 at 3:12 PM, Viswa S svis...@hotmail.com
  wrote:

 
  Folks,Am trying to query documents which have no values
  present, I have
  used the following constructs and it doesn't seem to work on
  the solr
dev
  tip (as of 09/22) or the 1.4 builds.1. (*:* AND -FieldName[* TO
  *]) -
  returns no documents, parsedquery was +MatchAllDocsQuery(*:*)
-FieldName:[*
  TO *]2. -FieldName:[* TO *] -  returns no documents,
  parsedquery was
  -FieldName:[* TO *]3. FieldName: - returns no documents,
parsedquery was
  empty (str name=parsedquery/)The field is type string,
  using the
  LuceneQParser, I have also tried to see if FieldName:[* TO *]
  if the
  documents with no terms are ignored and didn't seem to be the
  case, the
  result set was everything.Any help would be appreciated.-Viswa

Solr highlighting is botching output

2011-01-10 Thread Dan Loewenherz

Hi all,

I'm implementing Solr for a course and book search service for college
students, and I'm running into some issues with the highlighting plugin.
After a few minutes of tinkering, searching on Google, searching the group
archives and not finding anything, I thought I would see if anyone else is
having this problem and if not what I am doing to cause it.

Basically, the issue is that whenever I turn on highlighting for a certain
field, I get either (1) inconsistent highlights or (2) bizarre highlight
output for some of the results. A few of the results look correct.

Here's my solrconfig.xml: http://pastie.org/private/iz3fd77innxb5r2v63zpa

Broken output: http://pastie.org/private/pyptpektckitp2piqvcgw

As you can see, I searched for history. In the results, a few times that
the query is highlighted, you'll see that the name fields contain strings
such as spanHistory/spanspanHistory/spanspanHistory/span
, instead of just highlighting it once.

I don't have the knowledge to understand why Solr would treat African
American History: From Emancipation to the Present differently than African
American Women's History, other than one is longer than the other, or why
it would double or quadruple the highlighted response. I tried to figure out
what configuration option could change this, to no avail.

If anyone has any input, I would be very grateful. Thank you!

Dan

Post size limit to Solr?

2011-01-10 Thread Stephen Powis

Is there a max POST size limit when sending documents over to Solrs update
handler to be indexed?  Right now I've self imposed a limit of sending a max
of 50 docs per request to solr in my PHP code..and that seems to work fine.
I was just curious as to if there was a limit somewhere at which Solr will
complain?

Thanks
Stephen

Re: Post size limit to Solr?

2011-01-10 Thread Ahmet Arslan

 Is there a max POST size limit when
 sending documents over to Solrs update
 handler to be indexed?  Right now I've self imposed a
 limit of sending a max
 of 50 docs per request to solr in my PHP code..and that
 seems to work fine.
 I was just curious as to if there was a limit somewhere at
 which Solr will
 complain?

I think this is related to servlet container. 
Default maxPostSize for tamcat is 2 megabytes.
http://tomcat.apache.org/tomcat-5.5-doc/config/http.html

Re: Solr highlighting is botching output

2011-01-10 Thread Ahmet Arslan

 I'm implementing Solr for a course and book search service
 for college
 students, and I'm running into some issues with the
 highlighting plugin.
 After a few minutes of tinkering, searching on Google,
 searching the group
 archives and not finding anything, I thought I would see if
 anyone else is
 having this problem and if not what I am doing to cause
 it.
 
 Basically, the issue is that whenever I turn on
 highlighting for a certain
 field, I get either (1) inconsistent highlights or (2)
 bizarre highlight
 output for some of the results. A few of the results look
 correct.
 
 Here's my solrconfig.xml: http://pastie.org/private/iz3fd77innxb5r2v63zpa
 
 Broken output: http://pastie.org/private/pyptpektckitp2piqvcgw
 
 As you can see, I searched for history. In the results, a
 few times that
 the query is highlighted, you'll see that the name fields
 contain strings
 such as
 spanHistory/spanspanHistory/spanspanHistory/span
 , instead of just highlighting it once.
 
 I don't have the knowledge to understand why Solr would
 treat African
 American History: From Emancipation to the Present
 differently than African
 American Women's History, other than one is longer than
 the other, or why
 it would double or quadruple the highlighted response. I
 tried to figure out
 what configuration option could change this, to no avail.
 
 If anyone has any input, I would be very grateful. Thank
 you!

Thats really strange. Can you provide us field type definition of text field. 
And full search URL that caused that output. And the solr version.

Re: Solr highlighting is botching output

2011-01-10 Thread Ahmet Arslan

 Thats really strange. Can you provide us field type
 definition of text field. And full search URL that caused
 that output. And the solr version.
 

Also, did you enable term vectors on text field?

Re: Post size limit to Solr?

2011-01-10 Thread Stephen Powis

Thanks!

On Mon, Jan 10, 2011 at 9:27 PM, Ahmet Arslan iori...@yahoo.com wrote:

  Is there a max POST size limit when
  sending documents over to Solrs update
  handler to be indexed?  Right now I've self imposed a
  limit of sending a max
  of 50 docs per request to solr in my PHP code..and that
  seems to work fine.
  I was just curious as to if there was a limit somewhere at
  which Solr will
  complain?

 I think this is related to servlet container.
 Default maxPostSize for tamcat is 2 megabytes.
 http://tomcat.apache.org/tomcat-5.5-doc/config/http.html

Synonyms at index time

2011-01-10 Thread TxCSguy


Hi,

I'm not sure if this question is better posted in Solr - User or Solr - Dev,
but I'll start here.

I'm interested to find some documentation that describes in detail how
synonym expansion is handled at index time.  
http://www.lucidimagination.com/blog/2009/03/18/exploring-lucenes-indexing-code-part-2/
This  article explains what the index looks like for three example
documents.  However, I'm looking for some documentation about what the index
(the inverted index) looks like when synonyms are thrown into the mix.  

Thanks in advance for your help.
-Mark 

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Synonyms-at-index-time-tp2232470p2232470.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Input raw log file

2011-01-10 Thread Dinesh


can u give an example.. like something that is currently being used.. i'am an
engineering student and my project is to index all the real time log files
from different devices and use some artificial intelligence and produce a
usefull data out of it.. i'm doing this for my college.. i'm struggling more
than a month even for a start.. 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Input-raw-log-file-tp2210043p2232604.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr highlighting is botching output

2011-01-10 Thread Dan Loewenherz

On Mon, Jan 10, 2011 at 6:48 PM, Ahmet Arslan iori...@yahoo.com wrote:

 Thats really strange. Can you provide us field type definition of text
 field. And full search URL that caused that output. And the solr version.


Sure. Full search URL:

/solr/select?indent=on
version=2.2q=historyfq=start=0rows=10fl=*,scoreqt=standardwt=standardexplainOther=hl=onhl.fl=

Here's the type definition:

fieldType name=text class=solr.TextField
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.EdgeNGramTokenizerFactory minGramSize=1
maxGramSize=114 /
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType

Re: Solr highlighting is botching output

2011-01-10 Thread Dan Loewenherz

On Mon, Jan 10, 2011 at 6:51 PM, Ahmet Arslan iori...@yahoo.com wrote:

  Thats really strange. Can you provide us field type
  definition of text field. And full search URL that caused
  that output. And the solr version.
 

 Also, did you enable term vectors on text field?


Not sure what those are, so I'm guessing no :)

icq or other 'instant gratification' communication forums for Solr

2011-01-10 Thread Dennis Gearon

Are there any chatrooms or ICQ rooms to ask questions late at night to people 
who stay up or are on other side of planet?

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.

How to insert this using Solr PHP?

2011-01-10 Thread Dennis Gearon

I am switching between building the query to a Solr instance by hand and doing 
it with PHP Solr Extension.

I have this query that my dev partner said to insert before all the other 
column 
searches. What kind of query is it and how do I get it into the query in an 
'OOP' style using the PHP Solr extension? In particular, I'm interested in what 
is the part in the query 'q={!.}. Is that a filter query? How do I put it 
into the query . . . I already asked that ;-)

URL_BASE?wt=jsonindent=truestart=0rows=20q={!spatial lat=xx.x 
long=xxx.x radius=10 unit=km threadCount=3} OTHER COLUMNS, blah blah

 
bcc: my partner

Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.

Re: Solr highlighting is botching output

2011-01-10 Thread Ahmet Arslan


Not sure about your solr version but probably it can be 
https://issues.apache.org/jira/browse/LUCENE-2266

Is there a special reason for using EdgeNGramTokenizerFactory?
Replacing this tokenizer with WhiteSpaceTokenizer should solve this.

Or upgrade solr version.

And I don't see span either in your search URL or solrconfig.xml, how span 
is popping up in the response?


 
 Sure. Full search URL:
 
 /solr/select?indent=on
 version=2.2q=historyfq=start=0rows=10fl=*,scoreqt=standardwt=standardexplainOther=hl=onhl.fl=
 
 Here's the type definition:
 
     fieldType name=text
 class=solr.TextField
 positionIncrementGap=100
       analyzer type=index
         tokenizer
 class=solr.EdgeNGramTokenizerFactory minGramSize=1
 maxGramSize=114 /
         filter
 class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt/
         filter
 class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1
 catenateWords=1
 catenateNumbers=1 catenateAll=0/
         filter
 class=solr.LowerCaseFilterFactory/
         filter
 class=solr.EnglishPorterFilterFactory
 protected=protwords.txt/
         filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
       /analyzer
       analyzer type=query
         tokenizer
 class=solr.WhitespaceTokenizerFactory/
         filter
 class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
         filter
 class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt/
         filter
 class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1
 catenateWords=0
 catenateNumbers=0 catenateAll=0/
         filter
 class=solr.LowerCaseFilterFactory/
         filter
 class=solr.EnglishPorterFilterFactory
 protected=protwords.txt/
         filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
       /analyzer
     /fieldType

Re: Solr highlighting is botching output

2011-01-10 Thread Dan Loewenherz

On Mon, Jan 10, 2011 at 9:19 PM, Ahmet Arslan iori...@yahoo.com wrote:


 Not sure about your solr version but probably it can be
 https://issues.apache.org/jira/browse/LUCENE-2266

 Is there a special reason for using EdgeNGramTokenizerFactory?
 Replacing this tokenizer with WhiteSpaceTokenizer should solve this.


I'm trying to implement autocomplete, so I need to be able to search within
words. Maybe I was using it incorrectly, but the WhiteSpaceTokenizer would
only index on whole words.

econ needs to match economics, econometrics, etc.

Or upgrade solr version.


Oops, forgot to mention the version. I'm running Solr 1.4.1.


 And I don't see span either in your search URL or solrconfig.xml, how
 span is popping up in the response?


My mistake. I was playing around with the pre/post parameters. Everything
else is the same.

Re: Solr highlighting is botching output

2011-01-10 Thread Ahmet Arslan

Replacing with EdgeNGramTokenizerFactory with

 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=114 
/

combination should solve your problem. Preserving your search within
words.

Searching histo will return : African American emHisto/emry

--- On Tue, 1/11/11, Dan Loewenherz dloewenh...@gmail.com wrote:

 From: Dan Loewenherz dloewenh...@gmail.com
 Subject: Re: Solr highlighting is botching output
 To: solr-user@lucene.apache.org
 Date: Tuesday, January 11, 2011, 7:30 AM
 On Mon, Jan 10, 2011 at 9:19 PM,
 Ahmet Arslan iori...@yahoo.com
 wrote:
 
 
  Not sure about your solr version but probably it can
 be
  https://issues.apache.org/jira/browse/LUCENE-2266
 
  Is there a special reason for using
 EdgeNGramTokenizerFactory?
  Replacing this tokenizer with WhiteSpaceTokenizer
 should solve this.
 
 
 I'm trying to implement autocomplete, so I need to be able
 to search within
 words. Maybe I was using it incorrectly, but the
 WhiteSpaceTokenizer would
 only index on whole words.
 
 econ needs to match economics, econometrics, etc.
 
 Or upgrade solr version.
 
 
 Oops, forgot to mention the version. I'm running Solr
 1.4.1.
 
 
  And I don't see span either in your search URL
 or solrconfig.xml, how
  span is popping up in the response?
 
 
 My mistake. I was playing around with the pre/post
 parameters. Everything
 else is the same.

Re: Solr highlighting is botching output

2011-01-10 Thread Dan Loewenherz

Awesome, thank you so much! That did the trick.


On Mon, Jan 10, 2011 at 10:02 PM, Ahmet Arslan iori...@yahoo.com wrote:

 Replacing with EdgeNGramTokenizerFactory with

  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=114 /

 combination should solve your problem. Preserving your search within
 words.

 Searching histo will return : African American emHisto/emry

 --- On Tue, 1/11/11, Dan Loewenherz dloewenh...@gmail.com wrote:

  From: Dan Loewenherz dloewenh...@gmail.com
  Subject: Re: Solr highlighting is botching output
  To: solr-user@lucene.apache.org
  Date: Tuesday, January 11, 2011, 7:30 AM
  On Mon, Jan 10, 2011 at 9:19 PM,
  Ahmet Arslan iori...@yahoo.com
  wrote:
 
  
   Not sure about your solr version but probably it can
  be
   https://issues.apache.org/jira/browse/LUCENE-2266
  
   Is there a special reason for using
  EdgeNGramTokenizerFactory?
   Replacing this tokenizer with WhiteSpaceTokenizer
  should solve this.
  
 
  I'm trying to implement autocomplete, so I need to be able
  to search within
  words. Maybe I was using it incorrectly, but the
  WhiteSpaceTokenizer would
  only index on whole words.
 
  econ needs to match economics, econometrics, etc.
 
  Or upgrade solr version.
  
 
  Oops, forgot to mention the version. I'm running Solr
  1.4.1.
 
 
   And I don't see span either in your search URL
  or solrconfig.xml, how
   span is popping up in the response?
  
 
  My mistake. I was playing around with the pre/post
  parameters. Everything
  else is the same.

Re: How to insert this using Solr PHP?

2011-01-10 Thread Ahmet Arslan

 I'm interested in what 
 is the part in the query 'q={!.}. Is that a filter
 query? 

It is in local params syntax. http://wiki.apache.org/solr/LocalParams

Solr highlighting is botching output

2011-01-10 Thread Dan Loewenherz

Hi all,

I'm implementing Solr for a course and book search service for college
students, and I'm running into some issues with the highlighting plugin.
After a few minutes of tinkering, searching on Google, searching the group
archives and not finding anything, I thought I would see if anyone else is
having this problem and if not what I am doing to cause it.

Basically, the issue is that whenever I turn on highlighting for a certain
field, I get either (1) inconsistent highlights or (2) bizarre highlight
output for some of the results. A few of the results look correct.

Here's my solrconfig.xml: http://pastie.org/private/iz3fd77innxb5r2v63zpa

Broken output: http://pastie.org/private/pyptpektckitp2piqvcgw

As you can see, I searched for history. In the results, a few times that
the query is highlighted, you'll see that the name fields contain strings
such as spanHistory/spanspanHistory/spanspanHistory/span
, instead of just highlighting it once.

I don't have the knowledge to understand why Solr would treat African
American History: From Emancipation to the Present differently than African
American Women's History, other than one is longer than the other, or why
it would double or quadruple the highlighted response. I tried to figure out
what configuration option could change this, to no avail.

If anyone has any input, I would be very grateful. Thank you!

Dan

77 matches

Mail list logo