Weighting categories

2012-02-06 Thread Ramo Karahasan
Hi,

 

i've a table with products and their proper categories. Is it possible to
weight categories, so that a user that searches for apple ipad don't get a
magazine about apple ipad at the first result but the hardware apple ipad?
I'm using DHI for indexing data, but don't know if there is any post-process
to weight the categories I have.

 

Thanks,

Rmao



Re: multiple values encountered for non multiValued field type:[text/html, text, html]

2012-02-06 Thread William_Xu
error message:


 org.apache.solr.common.SolrException: ERROR: [http://bbs.dichan.com/] mult
iple values encountered for non multiValued field type: [text/html, text,
html]
at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.jav
a:242)
at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpd
ateProcessorFactory.java:60)
at
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpd
ateProcessorFactory.java:115)
at
org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:158)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Co
ntentStreamHandlerBase.java:67)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl
erBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter
.java:356)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte
r.java:252)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appl
icationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationF
ilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperV
alve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextV
alve.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.j
ava:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.j
ava:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineVal
ve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.jav
a:298)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java
:857)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.proce
ss(Http11Protocol.java:588)
at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:48
9)
at java.lang.Thread.run(Thread.java:662)


--
View this message in context: 
http://lucene.472066.n3.nabble.com/multiple-values-encountered-for-non-multiValued-field-type-text-html-text-html-tp3719088p3719103.html
Sent from the Solr - User mailing list archive at Nabble.com.


Which Tokeniser (and/or filter)

2012-02-06 Thread Robert Brown

Hi,

I need to tokenise on whitespace, full-stop, and comma ONLY.

Currently using solr.WhitespaceTokenizerFactory with 
WordDelimiterFilterFactory but this is also splitting on , /, 
new-line, etc.


It seems such a simple setup, what am I doing wrong?  what do you use 
for such normal searching?


Thanks,
Rob

--

IntelCompute
Web Design  Local Online Marketing

http://www.intelcompute.com



Symbols in synonyms

2012-02-06 Thread Robert Brown
is it good practice, common, or even possible to put symbols in my 
list of synonyms?


I'm having trouble indexing and searching for AE, with it being 
split on the .


we already convert .net to dotnet, but don't want to store every 
combination of 2 letters, AE, ME, etc.





--

IntelCompute
Web Design  Local Online Marketing

http://www.intelcompute.com



Replication problem on windows

2012-02-06 Thread Rafał Kuć
Hello!

We have Solr running on Windows. Once in a while we see a problem with
replication failing. While slave server replicates the index, it throws
exception like the following:

SEVERE: Unable to copy index file from: 
D:\web\solr\collection\data\index.2011102510\_3s.fdt
to: D:\web\solr\Collection\data\index\_3s.fdt java.io.FileNotFoundException:
D:\web\solr\collection\data\index.2011102510\_3s.fdt (The system cannot 
find the file specified)

We've added commitReserveDuration to the master server
configuration, but it didn't change that situation, the error still
happens once in a while.

Did anyone encounter such error ?

-- 
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch



Re: multiple values encountered for non multiValued field type:[text/html, text, html]

2012-02-06 Thread tamanjit.bin...@yahoo.co.in
Hi I am not sure if what you are doing is possible i.e. having a schema other
than that provided by nutch. The schema provided by nutch in its directory
nutch-dir\conf is to be used as the solr schema.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/multiple-values-encountered-for-non-multiValued-field-type-text-html-text-html-tp3719088p3719253.html
Sent from the Solr - User mailing list archive at Nabble.com.


Phonetic search and matching

2012-02-06 Thread Dirk Högemann
Hi,

I have a question on phonetic search and matching in solr.
In our application all the content of an article is written to a full-text
search field, which provides stemming and a phonetic filter (cologne
phonetic for german).
This is the relevant part of the configuration for the index analyzer
(search is analogous):

tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=German2
/
filter class=solr.PhoneticFilterFactory
encoder=ColognePhonetic inject=true/
filter class=solr.RemoveDuplicatesTokenFilterFactory /

Unfortunately this results sometimes in strange, but also explainable,
matches.
For example:

Content field indexes the following String: Donnerstag von 13 bis 17 Uhr.

This results in a match, if we search for puf  as the result of the
phonetic filter for this is 13.
(As a consequence the 13 is then also highlighted)

Does anyone has an idea how to handle this in a reasonable way that a
search for puf does not match 13 in the content?

Thanks in advance!

Dirk


Improving performance for SOLR geo queries?

2012-02-06 Thread Matthias Käppler
Hi,

we need to perform fast geo lookups on an index of ~13M places, and
were running into performance problems here with SOLR. We haven't done
a lot of query optimization / SOLR tuning up until now so there's
probably a lot of things we're missing. I was wondering if you could
give me some feedback on the way we do things, whether they make
sense, and especially why a supposed optimization we implemented
recently seems to have no effect, when we actually thought it would
help a lot.

What we do is this: our API is built on a Rails stack and talks to
SOLR via a Ruby wrapper. We have a few filters that almost always
apply, which we put in filter queries. Filter cache hit rate is
excellent, about 97%, and cache size caps at 10k filters (max size is
32k, but it never seems to reach that many, probably because we
replicate / delta update every few minutes). Still, geo queries are
slow, about 250-500msec on average. We send them with cache=false, so
as to not flood the fq cache and cause undesirable evictions.

Now our idea was this: while the actual geo queries are poorly
cacheable, we could clearly identify geographical regions which are
more often queried than others (naturally, since we're a user driven
service). Therefore, we dynamically partition Earth into a static grid
of overlapping boxes, where the grid size (the distance of the nodes)
depends on the maximum allowed search radius. That way, for every user
query, we would always be able to identify a single bounding box that
covers it. This larger bounding box (200km edge length) we would send
to SOLR as a cached filter query, along with the actual user query
which would still be sent uncached. Ex:

User asks for places in 10km around 49.14839,8.5691, then what we will
send to SOLR is something like this:

fq={!bbox cache=false d=10 sfield=location_ll pt=49.14839,8.5691}
fq={!bbox cache=true d=100.0 sfield=location_ll
pt=49.4684836290799,8.31165802979391} -- this one we derive
automatically

That way SOLR would intersect the two filters and return the same
results as when only looking at the smaller bounding box, but keep the
larger box in cache and speed up subsequent geo queries in the same
regions. Or so we thought; unfortunately this approach did not help
query execution times get better, at all.

Question is: why does it not help? Shouldn't it be faster to search on
a cached bbox with only a few hundred thousand places? Is it a good
idea to make these kinds of optimizations in the app layer (we do this
as part of resolving the SOLR query in Ruby), and does it make sense
at all? We're not sure what kind of optimizations SOLR already does in
its query planner. The documentation is (sorry) miserable, and
debugQuery yields no insight into which optimizations are performed.
So this has been a hit and miss game for us, which is very ineffective
considering that it takes considerable time to build these kinds of
optimizations in the app layer.

Would be glad to hear your opinions / experience around this.

Thanks!

-- 
Matthias Käppler
Lead Developer API  Mobile

Qype GmbH
Großer Burstah 50-52
20457 Hamburg
Telephone: +49 (0)40 - 219 019 2 - 160
Skype: m_kaeppler
Email: matth...@qype.com

Managing Director: Ian Brotherston
Amtsgericht Hamburg
HRB 95913

This e-mail and its attachments may contain confidential and/or
privileged information. If you are not the intended recipient (or have
received this e-mail in error) please notify the sender immediately
and destroy this e-mail and its attachments. Any unauthorized copying,
disclosure or distribution of this e-mail and  its attachments is
strictly forbidden. This notice also applies to future messages.


Re: Which Tokeniser (and/or filter)

2012-02-06 Thread Ahmet Arslan
 I need to tokenise on whitespace, full-stop, and comma
 ONLY.
 
 Currently using solr.WhitespaceTokenizerFactory with
 WordDelimiterFilterFactory but this is also splitting on
 , /, new-line, etc.

WDF is customizable via types=wdftypes.txt parameter. 

https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/conf/wdftypes.txt

Alternatively you can convert . and , to whitespace (before tokenizer) by 
MappingCharFilterFactory. 

http://lucene.apache.org/solr/api/org/apache/solr/analysis/MappingCharFilterFactory.html


Re: Which Tokeniser (and/or filter)

2012-02-06 Thread Robert Brown
My fear is what will then happen with highlighting if I use re-mapping?



On Mon, 6 Feb 2012 03:33:03 -0800 (PST), Ahmet Arslan
iori...@yahoo.com wrote:
 I need to tokenise on whitespace, full-stop, and comma
 ONLY.

 Currently using solr.WhitespaceTokenizerFactory with
 WordDelimiterFilterFactory but this is also splitting on
 , /, new-line, etc.
 
 WDF is customizable via types=wdftypes.txt parameter. 
 
 https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/conf/wdftypes.txt
 
 Alternatively you can convert . and , to whitespace (before
 tokenizer) by MappingCharFilterFactory.
 
 http://lucene.apache.org/solr/api/org/apache/solr/analysis/MappingCharFilterFactory.html



Re: Which Tokeniser (and/or filter)

2012-02-06 Thread Ahmet Arslan

 My fear is what will then happen with
 highlighting if I use re-mapping?

What do you mean by re-mapping?


Re: Which Tokeniser (and/or filter)

2012-02-06 Thread Robert Brown
mapping dots to spaces.  I don't think that's workable anyway since
.net would cause issues.

Tying out the wdftypes now...


---

IntelCompute
Web Design  Local Online Marketing

http://www.intelcompute.com

On Mon, 6 Feb 2012 04:10:18 -0800 (PST), Ahmet Arslan
iori...@yahoo.com wrote:
 My fear is what will then happen with
 highlighting if I use re-mapping?
 
 What do you mean by re-mapping?



multiple values encountered for non multiValued field type:[text/html, text, html]

2012-02-06 Thread William_Xu
Hi everyone:

 when i index my crawl result form some bbs site by solr, then i got
that error. Is there someone could help me?

my solr schema is :

  field name=id type=string indexed=true stored=true
required=true / 
   field name=sku type=text_en_splitting_tight indexed=true
stored=true omitNorms=true/
   field name=name type=text_general indexed=true stored=true/
   field name=alphaNameSort type=alphaOnlySort indexed=true
stored=false/
   field name=manu type=text_general indexed=true stored=true
omitNorms=true/
   field name=cat type=string indexed=true stored=true
multiValued=true/
   field name=features type=text_general indexed=true stored=true
multiValued=true/
   field name=includes type=text_general indexed=true stored=true
termVectors=true termPositions=true termOffsets=true /

   field name=weight type=float indexed=true stored=true/
   field name=price  type=float indexed=true stored=true/
   field name=popularity type=int indexed=true stored=true /
   field name=inStock type=boolean indexed=true stored=true /

   
   field name=store type=location indexed=true stored=true/

   
   field name=title type=text_general indexed=true stored=true
multiValued=true/
   field name=subject type=text_general indexed=true stored=true/
   field name=description type=text_general indexed=true
stored=true/
   field name=comments type=text_general indexed=true stored=true/
   field name=author type=text_general indexed=true stored=true/
   field name=keywords type=text_general indexed=true stored=true/
   field name=category type=text_general indexed=true stored=true/
   field name=content_type type=string indexed=true stored=true
multiValued=true/
   field name=last_modified type=date indexed=true stored=true/
   field name=links type=string indexed=true stored=true
multiValued=true/

   
  field name=url type=string indexed=true stored=true/
  field name=content type=textMaxWord indexed=true stored=true 
multiValued=true/
  field name=cache_content type=text_cache indexed=false
stored=true/
  field name=segment type=string indexed=false stored=true/
  field name=boost type=float indexed=true stored=true/
  field name=digest type=string indexed=false stored=true/
  field name=host type=string indexed=true stored=false/
  field name=cache type=string indexed=true stored=false/
  field name=site type=string indexed=true stored=false/
  field name=anchor type=string indexed=true stored=false
multiValued=true/
  field name=tstamp type=string indexed=true stored=true/
  field name=date type=date indexed=true stored=true/
  field name=type type=string indexed=true stored=true/

   
   field name=text type=text_general indexed=true stored=true
multiValued=true/
field name=simple type=textSimple indexed=true stored=true/  
field name=complex type=textComplex indexed=true stored=true/  

   
   field name=text_rev type=text_general_rev indexed=true
stored=false multiValued=true/

   
   field name=manu_exact type=string indexed=true stored=false/

   field name=payloads type=payloads indexed=true stored=true/

   
   
   

   
   dynamicField name=*_i  type=intindexed=true  stored=true/
   dynamicField name=*_s  type=string  indexed=true  stored=true/
   dynamicField name=*_l  type=long   indexed=true  stored=true/
   dynamicField name=*_t  type=text_generalindexed=true 
stored=true/
   dynamicField name=*_txt type=text_generalindexed=true 
stored=true/
   dynamicField name=*_b  type=boolean indexed=true  stored=true/
   dynamicField name=*_f  type=float  indexed=true  stored=true/
   dynamicField name=*_d  type=double indexed=true  stored=true/

   
   dynamicField name=*_coordinate  type=tdouble indexed=true 
stored=false/

   dynamicField name=*_dt type=dateindexed=true  stored=true/
   dynamicField name=*_p  type=location indexed=true stored=true/

   
   dynamicField name=*_ti type=tintindexed=true  stored=true/
   dynamicField name=*_tl type=tlong   indexed=true  stored=true/
   dynamicField name=*_tf type=tfloat  indexed=true  stored=true/
   dynamicField name=*_td type=tdouble indexed=true  stored=true/
   dynamicField name=*_tdt type=tdate  indexed=true  stored=true/

   dynamicField name=*_pi  type=pintindexed=true  stored=true/

   dynamicField name=ignored_* type=ignored multiValued=true/
   dynamicField name=attr_* type=text_general indexed=true
stored=true multiValued=true/

   dynamicField name=random_* type=random /


   
   
 /fields

--
View this message in context: 
http://lucene.472066.n3.nabble.com/multiple-values-encountered-for-non-multiValued-field-type-text-html-text-html-tp3719088p3719088.html
Sent from the Solr - User mailing list archive at Nabble.com.


Searching context within a book

2012-02-06 Thread pistacchio
I'm very new to Solr and I'm evaluating it. My task is to look for words
within a corpus of books and return them within a small context. So far, I'm
storing the books in a database split by paragraphs (slicing the books by
line breaks), I do a fulltext search and return the row.

In Solr, would I have to do the same, or can I add the whole book (in .txt
format) and, whenever a match is found, return something like the match plus
100 words before and 100 words after or something like that? Thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Searching-context-within-a-book-tp3718997p3718997.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Parallel indexing in Solr

2012-02-06 Thread Per Steffensen

See response below

Erick Erickson skrev:

Unfortunately, the answer is it depends(tm).

First question: How are you indexing things? SolrJ? post.jar?
  

SolrJ, CommonsHttpSolrServer

But some observations:

1 sure, using multiple cores will have some parallelism. So will
using a single core but using something like SolrJ and
StreamingUpdateSolrServer.
So SolrJ with CommonsHttpSolrServer will not support handling several 
requests concurrently?

 Especially with trunk (4.0)
 and the Document Writer Per Thread stuff.
We are using trunk (4.0). Can you provide me with a little more info on 
this Document Writer Per Thread stuff. A link or something?

 In 3.x, you'll
 see some pauses when segments are merged that you
 can't get around (per core). See:
 
http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/
 for an excellent writeup. But whether or not you use several
 cores should be determined by your problem space, certainly
 not by trying to increase the throughput. Indexing usually
 take a back seat to search performance.
  

We will have few searches, but a lot of indexing.

2 general settings are hard to come by. If you're sending
  structured documents that use Tika to parse the data
  behind the scenes, your performance will be much
  different (slower) than sending SolrInputDocuments
 (SolrJ).
  

We are sending SolrInputDocuments

3 The recommended servlet container is, generally,
  The one you're most comfortable with. Tomcat is
  certainly popular. That said, use whatever you're
  most comfortable with until you see a performance
 problem. Odds are you'll find your load on Solr is a
  at its limit before your servlet container has problems.
  

So Jetty in not a easy to use, but non-performance-container?

4 Monitor you CPU, fire more requests at it until it
 hits 100%. Note that there are occasions where the
servlet container limits the number of outstanding
 requests it will allow and queues ones over that
 limit (find the magic setting to increase this if it's a
 problem, it differs by container). If you start to see
 your response times lengthen but the CPU not being
fully utilized, that may be the cause.
  
Actually right now, I am trying to find our what my bottleneck is. The 
setup is more complex, than I would bother you with, but basically I 
have servers with 80-90% IO-wait and only 5-10% real CPU usage. It 
might not be a Solr-related problem, I am investigating different 
things, but just wanted to know a little more about how Jetty/Solr works 
in order to make a qualified guess.

5 How high is high performance? On a stock solr
 with the Wikipedia dump (11M docs), all running on
 my laptop, I see 7K docs/sec indexed. I know of
 installations that see 60 docs/sec or even less. I'm
sending simple docs with SolrJ locally and they're
 sending huge documents over the wire that Tika
 handles. There are just so many variables it's hard
 to say anything except try it and see..
  
Well eventaually we need to be able to index and delete about 50mio 
documents per day. We will need to keep a history of 2 years of data 
in our system, deletion will not start before we have been in production 
for 2 years. At that point in time the system needs to contain 2 year * 
365 days/year * 50mio docs/day = 36,5billion documents. At that point 
50mio documents need to be deleted and index per day - before that we 
only need to index 50mio documents per day. We are aware that we are 
probably going to need a certain amout of hardware for this, but most 
important thing is that we make a scalable setup so that we can get to 
this kind of numbers at all. Right now I am focusing on getting most out 
of one Solr instance potentially with several cores, though.

Best
Erick

On Fri, Feb 3, 2012 at 3:55 AM, Per Steffensen st...@designware.dk wrote:
  

Hi

This topic has probably been covered before, but I havnt had the luck to
find the answer.

We are running solr instances with several cores inside. Solr running
out-of-the-box on top of jetty. I believe jetty is receiving all the
http-requests about indexing ned documents, and forwards it to the solr
engine. What kind of parallelism does this setup provide. Can more than one
index-request get processed concurrently? How many? How to increase the
number of index-requests that can be handled in parallel? Will I get better
parallelism by running on another web-container than jetty - e.g. tomcat?
What is the recommended web-container for high performance production
systems?

Thanks!

Regards, Per Steffensen



  




Re: effect of continuous deletes on index's read performance

2012-02-06 Thread Erick Erickson
Your continuous deletes won't affect performance
noticeably, that's true.

But you're really doing bad things with the commit after every
add or delete. You haven't said whether you have a master/
slave setup or not, but assuming you're searching on
the same machine you're indexing to, each time you commit,
you're forcing the underlying searcher to close and re-open and
any attendant autowarming to occur. All to get a single
document searchable. 20 times a second. If you have a master/
slave setup, you're forcing the slave to fetch the changed
parts of the index every time it polls, which is better than
what's happening on the master, but still rather often.

400K documents isn't very big by Solr standards, so unless
you can show performance problems, I wouldn't be concerned
about index size, as Otis says, your per-document commit
is probably hurting you far more than any index size
savings.

I'd actually think carefully about whether you need even
10 second commits. If you can stretch that out to minutes,
so much the better. But it all depends upon your problem
space.

Best
Erick


On Mon, Feb 6, 2012 at 2:59 AM, prasenjit mukherjee
prasen@gmail.com wrote:
 Thanks Otis. commitWithin  will definitely work for me ( as I
 currently am using 3.4 version, which doesnt have NRT yet ).

 Assuming that I use commitWithin=10secs, are you saying that the
 continuous deletes ( without commit ) wont have any affect on
 performance ?
 I was under the impression that deletes just mark the doc-ids (
 essentially means that the index size will remain the same ) , but
 wont actually do the compaction till someone calls optimize/commit, is
 my assumption  not true ?

 -Thanks,
 Prasenjit

 On Mon, Feb 6, 2012 at 1:13 PM, Otis Gospodnetic
 otis_gospodne...@yahoo.com wrote:
 Hi Prasenjit,

 It sounds like at this point your main enemy might be those per-doc-add 
 commits.  Don't commit until you need to see your new docs in results.  And 
 if you need NRT then use softCommit option with Solr trunk 
 (http://search-lucene.com/?q=softcommitfc_project=Solr) or use commitWithin 
 to limit commit's performance damage.


  Otis

 
 Performance Monitoring SaaS for Solr - 
 http://sematext.com/spm/solr-performance-monitoring/index.html




 From: prasenjit mukherjee prasen@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Monday, February 6, 2012 1:17 AM
Subject: effect of continuous deletes on index's read performance

I have a use case where documents are continuously added @ 20 docs/sec
( each doc add is also doing a commit )  and docs continuously getting
deleted at the same rate. So the searchable index size remains the
same : ~ 400K docs ( docs for last 6 hours ~ 20*3600*6).

Will it have pauses when deletes triggers compaction. Or with every
commits ( while adds ) ? How bad they will effect on search response
time.

-Thanks,
Prasenjit





Re: effect of continuous deletes on index's read performance

2012-02-06 Thread Nagendra Nagarajayya
You could also try Solr 3.4 with RankingAlgorithm as this offers NRT.  
You can get more information about NRT for Solr 3.4 from here:


http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x

Regards,

- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

On 2/5/2012 11:59 PM, prasenjit mukherjee wrote:

Thanks Otis. commitWithin  will definitely work for me ( as I
currently am using 3.4 version, which doesnt have NRT yet ).

Assuming that I use commitWithin=10secs, are you saying that the
continuous deletes ( without commit ) wont have any affect on
performance ?
I was under the impression that deletes just mark the doc-ids (
essentially means that the index size will remain the same ) , but
wont actually do the compaction till someone calls optimize/commit, is
my assumption  not true ?

-Thanks,
Prasenjit

On Mon, Feb 6, 2012 at 1:13 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com  wrote:

Hi Prasenjit,

It sounds like at this point your main enemy might be those per-doc-add commits.  Don't 
commit until you need to see your new docs in results.  And if you need NRT then use 
softCommit option with Solr trunk 
(http://search-lucene.com/?q=softcommitfc_project=Solr) or use commitWithin to limit 
commit's performance damage.


  Otis


Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html





From: prasenjit mukherjeeprasen@gmail.com
To: solr-usersolr-user@lucene.apache.org
Sent: Monday, February 6, 2012 1:17 AM
Subject: effect of continuous deletes on index's read performance

I have a use case where documents are continuously added @ 20 docs/sec
( each doc add is also doing a commit )  and docs continuously getting
deleted at the same rate. So the searchable index size remains the
same : ~ 400K docs ( docs for last 6 hours ~ 20*3600*6).

Will it have pauses when deletes triggers compaction. Or with every
commits ( while adds ) ? How bad they will effect on search response
time.

-Thanks,
Prasenjit









Re: effect of continuous deletes on index's read performance

2012-02-06 Thread prasenjit mukherjee
Pardon my ignorance, Why can't the IndexWriter and IndexSearcher share
the same underlying in-memory datastructure so that IndexSearcher need
not be reopened with every commit.


On 2/6/12, Erick Erickson erickerick...@gmail.com wrote:
 Your continuous deletes won't affect performance
 noticeably, that's true.

 But you're really doing bad things with the commit after every
 add or delete. You haven't said whether you have a master/
 slave setup or not, but assuming you're searching on
 the same machine you're indexing to, each time you commit,
 you're forcing the underlying searcher to close and re-open and
 any attendant autowarming to occur. All to get a single
 document searchable. 20 times a second. If you have a master/
 slave setup, you're forcing the slave to fetch the changed
 parts of the index every time it polls, which is better than
 what's happening on the master, but still rather often.

 400K documents isn't very big by Solr standards, so unless
 you can show performance problems, I wouldn't be concerned
 about index size, as Otis says, your per-document commit
 is probably hurting you far more than any index size
 savings.

 I'd actually think carefully about whether you need even
 10 second commits. If you can stretch that out to minutes,
 so much the better. But it all depends upon your problem
 space.

 Best
 Erick


 On Mon, Feb 6, 2012 at 2:59 AM, prasenjit mukherjee
 prasen@gmail.com wrote:
 Thanks Otis. commitWithin  will definitely work for me ( as I
 currently am using 3.4 version, which doesnt have NRT yet ).

 Assuming that I use commitWithin=10secs, are you saying that the
 continuous deletes ( without commit ) wont have any affect on
 performance ?
 I was under the impression that deletes just mark the doc-ids (
 essentially means that the index size will remain the same ) , but
 wont actually do the compaction till someone calls optimize/commit, is
 my assumption  not true ?

 -Thanks,
 Prasenjit

 On Mon, Feb 6, 2012 at 1:13 PM, Otis Gospodnetic
 otis_gospodne...@yahoo.com wrote:
 Hi Prasenjit,

 It sounds like at this point your main enemy might be those per-doc-add
 commits.  Don't commit until you need to see your new docs in results.
 And if you need NRT then use softCommit option with Solr trunk
 (http://search-lucene.com/?q=softcommitfc_project=Solr) or use
 commitWithin to limit commit's performance damage.


  Otis

 
 Performance Monitoring SaaS for Solr -
 http://sematext.com/spm/solr-performance-monitoring/index.html




 From: prasenjit mukherjee prasen@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Monday, February 6, 2012 1:17 AM
Subject: effect of continuous deletes on index's read performance

I have a use case where documents are continuously added @ 20 docs/sec
( each doc add is also doing a commit )  and docs continuously getting
deleted at the same rate. So the searchable index size remains the
same : ~ 400K docs ( docs for last 6 hours ~ 20*3600*6).

Will it have pauses when deletes triggers compaction. Or with every
commits ( while adds ) ? How bad they will effect on search response
time.

-Thanks,
Prasenjit





-- 
Sent from my mobile device


Re: effect of continuous deletes on index's read performance

2012-02-06 Thread Michael McCandless
On Mon, Feb 6, 2012 at 8:20 AM, prasenjit mukherjee
prasen@gmail.com wrote:

 Pardon my ignorance, Why can't the IndexWriter and IndexSearcher share
 the same underlying in-memory datastructure so that IndexSearcher need
 not be reopened with every commit.

Because the semantics of an IndexReader in Lucene guarantee an
unchanging point-in-time view of the index, as of when that
IndexReader was opened.

That said, Lucene has near-real-time readers, which keep point-in-time
semantics but are very fast to open after adding/deleting docs, and do
not require a (costly) commit.  EG see my blog post:


http://blog.mikemccandless.com/2011/06/lucenes-near-real-time-search-is-fast.html

The tests I ran there indexed at a highish rate (~1000 1KB sized docs
per second, or 1 MB plain text per second, or ~2X Twitter's peak rate,
at least as of last July), and the reopen latency was fast (~ 60
msec).  Admittedly this was a fast machine, and the index was on a
good SSD, and I used NRTCachingDir and MemoryCodec for the id field.

But net/net Lucene's NRT search is very fast.  It should easily handle
your 20 docs/second rate, unless your docs are enormous

Solr trunk has finally cutover to using these APIs, but unfortunately
this has not been backported to Solr 3.x.  You might want to check out
ElasticSearch, an alternative to Solr, which does use Lucene's NRT
APIs

Mike McCandless

http://blog.mikemccandless.com


Is Solr waiting for data to arrive

2012-02-06 Thread Per Steffensen

Hi

I have a setup where a lot is going on, but where there is about 80-90% 
IO-wait (%wa in top). I have a suspicion that this is due to slow 
networking. I would like someone to help med interpret threaddumps 
(retrieved using kill -3).


Whenever I do threaddumps I see that most threads have this stacktrace:
2036752846@qtp-1221696456-205 prio=10 tid=0x7f8f50102000 
nid=0x3a31 runnable [0x7f90908e3000]

  java.lang.Thread.State: RUNNABLE
   at java.net.SocketInputStream.socketRead0(Native Method)
   at java.net.SocketInputStream.read(SocketInputStream.java:129)
   at org.mortbay.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:382)
   at org.mortbay.io.bio.StreamEndPoint.fill(StreamEndPoint.java:114)
   at 
org.mortbay.jetty.bio.SocketConnector$Connection.fill(SocketConnector.java:198)

   at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:290)
   at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
   at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
   at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
   at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)


Im not sure if this indicate
1) that they are just hanging around waiting for the next request (doing 
nothing)
2) or if it indicates that a request has been initiated and that the 
thread is waiting to receive the data
I guess if it is 1) I havnt confirmed my suspiction, but if it is 2) I 
probably have. Can anyone help med with the interpretation. Thanks!


Regards, Per Steffensen


Re: Replication problem on windows

2012-02-06 Thread Shawn Heisey

On 2/6/2012 3:04 AM, Rafał Kuć wrote:

Hello!

We have Solr running on Windows. Once in a while we see a problem with
replication failing. While slave server replicates the index, it throws
exception like the following:

SEVERE: Unable to copy index file from: 
D:\web\solr\collection\data\index.2011102510\_3s.fdt
to: D:\web\solr\Collection\data\index\_3s.fdt java.io.FileNotFoundException:
D:\web\solr\collection\data\index.2011102510\_3s.fdt (The system cannot 
find the file specified)

We've addedcommitReserveDuration  to the master server
configuration, but it didn't change that situation, the error still
happens once in a while.

Did anyone encounter such error ?


I found another old mailing list entry by searching google for your 
error message without filename/path.


It looked like they solved it by adding/updating the following config 
line, found in solrconfig.xml in deletionPolicy, which is found in 
mainIndex.  Increasing that number will increase the on-disk size of the 
index on your master server.


str name=maxCommitsToKeep2/str

The directory paths in their error messages look exactly like yours, 
down the the difference in case between the from and to strings, so I 
fear that I am pointing you at information that you already have.


http://www.xkcd.com/979/

Thanks,
Shawn



Re: Replication problem on windows

2012-02-06 Thread Rafał Kuć
Hello!

Thanks for the answer Shawn.

-- 
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch


 On 2/6/2012 3:04 AM, Rafał Kuć wrote:
 Hello!

 We have Solr running on Windows. Once in a while we see a problem with
 replication failing. While slave server replicates the index, it throws
 exception like the following:

 SEVERE: Unable to copy index file from: 
 D:\web\solr\collection\data\index.2011102510\_3s.fdt
 to: D:\web\solr\Collection\data\index\_3s.fdt java.io.FileNotFoundException:
 D:\web\solr\collection\data\index.2011102510\_3s.fdt (The system cannot 
 find the file specified)

 We've addedcommitReserveDuration  to the master server
 configuration, but it didn't change that situation, the error still
 happens once in a while.

 Did anyone encounter such error ?

 I found another old mailing list entry by searching google for your 
 error message without filename/path.

 It looked like they solved it by adding/updating the following config 
 line, found in solrconfig.xml in deletionPolicy, which is found in 
 mainIndex.  Increasing that number will increase the on-disk size of the
 index on your master server.

 str name=maxCommitsToKeep2/str

 The directory paths in their error messages look exactly like yours, 
 down the the difference in case between the from and to strings, so I 
 fear that I am pointing you at information that you already have.

 http://www.xkcd.com/979/

 Thanks,
 Shawn







Re: Searching context within a book

2012-02-06 Thread Robert Stewart
You are probably better off splitting up each book into separate SOLR 
documents, one document per paragraph (each document with same book ID,  ISBN, 
etc.).  Then you can use field-collapsing on the book ID to return a single 
document per book.  And you can use highlighting to show the paragraph that 
matched the query.
You will need to store the full-text in SOLR in order to use highlighting 
feature and/or to return the text in the search results.


On Feb 6, 2012, at 2:13 AM, pistacchio wrote:

 I'm very new to Solr and I'm evaluating it. My task is to look for words
 within a corpus of books and return them within a small context. So far, I'm
 storing the books in a database split by paragraphs (slicing the books by
 line breaks), I do a fulltext search and return the row.
 
 In Solr, would I have to do the same, or can I add the whole book (in .txt
 format) and, whenever a match is found, return something like the match plus
 100 words before and 100 words after or something like that? Thanks
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Searching-context-within-a-book-tp3718997p3718997.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Parallel indexing in Solr

2012-02-06 Thread Erick Erickson
Right. See below.

On Mon, Feb 6, 2012 at 7:53 AM, Per Steffensen st...@designware.dk wrote:
 See response below

 Erick Erickson skrev:

 Unfortunately, the answer is it depends(tm).

 First question: How are you indexing things? SolrJ? post.jar?


 SolrJ, CommonsHttpSolrServer

 But some observations:

 1 sure, using multiple cores will have some parallelism. So will
    using a single core but using something like SolrJ and
    StreamingUpdateSolrServer.

 So SolrJ with CommonsHttpSolrServer will not support handling several
 requests concurrently?


Nope. Use StreamingUpdateSolrServer, it should be just a drop-in with
a different constructor.

  Especially with trunk (4.0)
     and the Document Writer Per Thread stuff.

 We are using trunk (4.0). Can you provide me with a little more info on this
 Document Writer Per Thread stuff. A link or something?


I already did, follow the link I provided.

  In 3.x, you'll
     see some pauses when segments are merged that you
     can't get around (per core). See:

 http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/
     for an excellent writeup. But whether or not you use several
     cores should be determined by your problem space, certainly
     not by trying to increase the throughput. Indexing usually
     take a back seat to search performance.


 We will have few searches, but a lot of indexing.


Hmmm, this the inverse of most installations, so it's good to know.

 2 general settings are hard to come by. If you're sending
      structured documents that use Tika to parse the data
      behind the scenes, your performance will be much
      different (slower) than sending SolrInputDocuments
     (SolrJ).


 We are sending SolrInputDocuments

 3 The recommended servlet container is, generally,
      The one you're most comfortable with. Tomcat is
      certainly popular. That said, use whatever you're
      most comfortable with until you see a performance
     problem. Odds are you'll find your load on Solr is a
      at its limit before your servlet container has problems.


 So Jetty in not a easy to use, but non-performance-container?


Again, test and see. Lots of commercial systems use Jetty. Consider
that you're just sending sets of documents at Solr, the container
is doing very little work. You are batching up your Solr documents
aren't you?

 4 Monitor you CPU, fire more requests at it until it
     hits 100%. Note that there are occasions where the
    servlet container limits the number of outstanding
     requests it will allow and queues ones over that
     limit (find the magic setting to increase this if it's a
     problem, it differs by container). If you start to see
     your response times lengthen but the CPU not being
    fully utilized, that may be the cause.


 Actually right now, I am trying to find our what my bottleneck is. The setup
 is more complex, than I would bother you with, but basically I have servers
 with 80-90% IO-wait and only 5-10% real CPU usage. It might not be a
 Solr-related problem, I am investigating different things, but just wanted
 to know a little more about how Jetty/Solr works in order to make a
 qualified guess.

You should see this differ with StreamingUpdateSolrServer assuming your
client can feed documents fast enough. You can consider having multiple
clients feed the same solr indexer if necessary.



 5 How high is high performance? On a stock solr
     with the Wikipedia dump (11M docs), all running on
     my laptop, I see 7K docs/sec indexed. I know of
     installations that see 60 docs/sec or even less. I'm
    sending simple docs with SolrJ locally and they're
     sending huge documents over the wire that Tika
     handles. There are just so many variables it's hard
     to say anything except try it and see..


 Well eventaually we need to be able to index and delete about 50mio
 documents per day. We will need to keep a history of 2 years of data in
 our system, deletion will not start before we have been in production for 2
 years. At that point in time the system needs to contain 2 year * 365
 days/year * 50mio docs/day = 36,5billion documents. At that point 50mio
 documents need to be deleted and index per day - before that we only need to
 index 50mio documents per day. We are aware that we are probably going to
 need a certain amout of hardware for this, but most important thing is that
 we make a scalable setup so that we can get to this kind of numbers at all.
 Right now I am focusing on getting most out of one Solr instance potentially
 with several cores, though.

My off-the-top-of-my-head feeling is that this will be a LOT of hardware. You'll
without doubt be sharding the index. NOTE: Shards are cores, just special
purpose ones, i.e. they're all use the same schema. When Solr folks see cores,
we assume that the several cores that may have different schemas and handle
unrelated queries. It sounds like you're talking about a sharded 

Re: Parallel indexing in Solr

2012-02-06 Thread Sami Siren
On Mon, Feb 6, 2012 at 2:53 PM, Per Steffensen st...@designware.dk wrote:


 Actually right now, I am trying to find our what my bottleneck is. The setup
 is more complex, than I would bother you with, but basically I have servers
 with 80-90% IO-wait and only 5-10% real CPU usage. It might not be a
 Solr-related problem, I am investigating different things, but just wanted
 to know a little more about how Jetty/Solr works in order to make a
 qualified guess.

What kind of/how many discs do you have for your shards? ..also what
kind of server are you experimenting with?

--
 Sami Siren


Re: Parallel indexing in Solr

2012-02-06 Thread Per Steffensen



So SolrJ with CommonsHttpSolrServer will not support handling several
requests concurrently?




Nope. Use StreamingUpdateSolrServer, it should be just a drop-in with
a different constructor.
  
I will try to do that. It is a little bit difficult for me, as we are 
actually not dealing with Solr ourselves. We are using Lily, but I will 
modify Lily, compile and try to see how goes.
  

 Especially with trunk (4.0)
and the Document Writer Per Thread stuff.
  

We are using trunk (4.0). Can you provide me with a little more info on this
Document Writer Per Thread stuff. A link or something?




I already did, follow the link I provided.
  

Ahh ok, didnt get it the first time, that the link below was about that



http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/

  

So Jetty in not a easy to use, but non-performance-container?




Again, test and see. Lots of commercial systems use Jetty. Consider
that you're just sending sets of documents at Solr, the container
is doing very little work. You are batching up your Solr documents
aren't you?
  
Havnt looked into Lily to see whether or not documents are batched, but 
I will. I didnt expect Jetty to be the problem, basically just wanted to 
know that is was not a stupid everything-in-a-single-thread container, 
almost designed to not perform (because the focus might be different, 
e.g. providing an easy-to-use/understand container for testing etc.)
  

Actually right now, I am trying to find our what my bottleneck is.



You should see this differ with StreamingUpdateSolrServer assuming your
client can feed documents fast enough. You can consider having multiple
clients feed the same solr indexer if necessary.
  

Thanks!



5 How high is high performance? On a stock solr
with the Wikipedia dump (11M docs), all running on
my laptop, I see 7K docs/sec indexed. I know of
installations that see 60 docs/sec or even less. I'm
   sending simple docs with SolrJ locally and they're
sending huge documents over the wire that Tika
handles. There are just so many variables it's hard
to say anything except try it and see..

  

50mio documents need to be deleted and indexed per day. 2 years history = 36 
billion docs in store



My off-the-top-of-my-head feeling is that this will be a LOT of hardware.
Well it takes what it takes. Someone else will buy the hardware. My 
first concern is to make sure we have a system that scales, so that we 
can buy us out of problems by buying more hardware. On the other hand of 
course I want to privide at system that makes the most of the hardware.

 You'll
without doubt be sharding the index. NOTE: Shards are cores, just special
purpose ones, i.e. they're all use the same schema. When Solr folks see cores,
we assume that the several cores that may have different schemas and handle
unrelated queries. It sounds like you're talking about a sharded system rather
than independent cores, is that so?
  
Yes that is correct. We only have one single schema/config shared by all 
cores through ZK. So the many cores are just for sharding, because I do 
not expect that it will work very well with 20 billion docs in the same 
core/shard :-)

You should have no trouble indexing 50M documents/day, even assuming that the
ingestion rate is not evenly distributed. The link I referenced talks
about indexing 10M documents in a little over 6 minutes. YMMV however. I think
you're going along the right path when trying to push a single indexer to
the max. My setup uses Jetty and is getting 5-7K docs/second so I doubt it's
inherently a Jetty problem, although there may be configuration tweaks getting
in your way.

Bottom line: I doubt it's a Jetty issue at this point but I've been
wrong on too many
occasions to count. I'd be looking other places first though. Start
with the streaming
update solr server though, and also whether your clients can spit out documents
fast enough...
  

I will have a look at all that. Thanks!

Best
Erick
  




Re: Parallel indexing in Solr

2012-02-06 Thread Per Steffensen

Sami Siren skrev:

On Mon, Feb 6, 2012 at 2:53 PM, Per Steffensen st...@designware.dk wrote:


  

Actually right now, I am trying to find our what my bottleneck is. The setup
is more complex, than I would bother you with, but basically I have servers
with 80-90% IO-wait and only 5-10% real CPU usage. It might not be a
Solr-related problem, I am investigating different things, but just wanted
to know a little more about how Jetty/Solr works in order to make a
qualified guess.



What kind of/how many discs do you have for your shards? ..also what
kind of server are you experimenting with?
  
Grrr, thats where I have a little fight with operations. For now they 
gave me one (fairly big) machine with XenServer. I create my machines 
as Xen VM's on top of that. One of the things I dont like about this 
(besides that I dont trust Xen to do its virtualization right, or at 
least not provide me with correct readings on IO) is that disk space is 
assigned from an iSCSI connected SAN that they all share (including the 
line out there). But for now actually it doesnt look like disk IO 
problems. It looks like networks-bottlenecks (but to some extend they 
all also shard network) among all the components in our setup - our 
client plus Lily stack (HDFS, HBase, ZK, Lily Server, Solr etc). Well it 
is complex, but anyways ...

--
 Sami Siren

  




Commit call - ReadTimeoutException - usage scenario for big update requests and the ioexception case

2012-02-06 Thread Torsten Krah
Hi,

i wonder if it is possible to commit data to solr without having to
catch SockedReadTimeout Exceptions.

I am calling commit(false, false) using a streaming server instance -
but i still have to wait  30 seconds and catch the timeout from http
method.
I does not matter if its 30 or 60, it will fail depending on how long it
takes until the update request is processed, or can i tweak things here?

So whats the way to go here? Any other option or must i fetch those
exception and go on like done now.
The operation itself does finish successful - later on when its done -
on server side and all stuff is committed and searchable.


regards

Torsten


smime.p7s
Description: S/MIME cryptographic signature


Re: SolrCell maximum file size

2012-02-06 Thread Augusto Camarotti
Thanks for the tips Erick, i'm really talking about 2.5GB files full of data to 
be indexed. Like .csv files or .xls, .ods and so on. I guess I will try to do a 
great increase on the memory the JVM will be able to use. 
 
Regards,
 
Augusto

 Erick Erickson erickerick...@gmail.com 1/27/2012 1:22 pm 
Hmmm, I'd go considerably higher than 2.5G. Problem is you the Tika
processing will need memory, I have no idea how much. Then you'll
have a bunch of stuff for Solr to index it etc.

But I also suspect that this will be about useless to index (assuming
you're talking lots of data, not say just the meta-data associated
with a video or something). How do you provide a meaningful snippet
of such a huge amount of data?

If it *is* say a video or whatever where almost all of the data won't
make it into the index anyway, you're probably better off using
tika directly on the client and only sending the bits to Solr that
you need in the form of a SolrInputDocument (I'm thinking that
you'll be doing this in SolrJ) rather than transmit 2.5G over the
network and throwing almost all of it away

If the entire 2.5G is data to be indexed, you'll probably want to
consider breaking it up into smaller chunks in order to make it
useful.

Best
Erick

On Fri, Jan 27, 2012 at 3:43 AM, Augusto Camarotti
augu...@prpb.mpf.gov.br wrote:
 I'm talking about 2 GB files. It means that I'll have to allocate something 
 bigger than that for the JVM? Something like 2,5 GB?

 Thanks,

 Augusto Camarotti

 Erick Erickson erickerick...@gmail.com 1/25/2012 1:48 pm 
 Mostly it depends on your container settings, quite often that's
 where the limits are. I don't think Solr imposes any restrictions.

 What size are we talking about anyway? There are implicit
 issues with how much memory parsing the file requires, but you
 can allocate lots of memory to the JVM to handle that.

 Best
 Erick

 On Tue, Jan 24, 2012 at 10:24 AM, Augusto Camarotti
 augu...@prpb.mpf.gov.br wrote:
 Hi everybody

 Does anyone knows if there is a maximum file size that can be uploaded to 
 the extractingrequesthandler via http request?

 Thanks in advance,

 Augusto Camarotti


Re: Parallel indexing in Solr

2012-02-06 Thread Erick Erickson
grin. I've had recurring discussions with executive level folks that no
matter how many VMs you host on a machine, and no matter how big that
machine is, there really, truly, *is* some hardware underlying it all that
really, truly, *does* have some limits.

And adding more VMs doesn't somehow get around those limits..

Good Luck!
Erick

On Mon, Feb 6, 2012 at 10:55 AM, Per Steffensen st...@designware.dk wrote:
 Sami Siren skrev:

 On Mon, Feb 6, 2012 at 2:53 PM, Per Steffensen st...@designware.dk
 wrote:




 Actually right now, I am trying to find our what my bottleneck is. The
 setup
 is more complex, than I would bother you with, but basically I have
 servers
 with 80-90% IO-wait and only 5-10% real CPU usage. It might not be a
 Solr-related problem, I am investigating different things, but just
 wanted
 to know a little more about how Jetty/Solr works in order to make a
 qualified guess.



 What kind of/how many discs do you have for your shards? ..also what
 kind of server are you experimenting with?


 Grrr, thats where I have a little fight with operations. For now they gave
 me one (fairly big) machine with XenServer. I create my machines as Xen
 VM's on top of that. One of the things I dont like about this (besides that
 I dont trust Xen to do its virtualization right, or at least not provide me
 with correct readings on IO) is that disk space is assigned from an iSCSI
 connected SAN that they all share (including the line out there). But for
 now actually it doesnt look like disk IO problems. It looks like
 networks-bottlenecks (but to some extend they all also shard network) among
 all the components in our setup - our client plus Lily stack (HDFS, HBase,
 ZK, Lily Server, Solr etc). Well it is complex, but anyways ...

 --
  Sami Siren






solrcore.properties

2012-02-06 Thread Walter Underwood
Looking at SOLR-1335 and the wiki, I'm not quite sure of the final behavior for 
this.

These properties are per-core, and not visible in other cores, right?

Are variables substituted in solr.xml, so I can swap in different properties 
files for dev, test, and prod? Like this:

core name=mary properties=conf/solrcore-${env:dev}.properties/

If that does not work, what are the best practices for managing dev/test/prod 
configs for Solr?

wunder
--
Walter Underwood
wun...@wunderwood.org
Search Guy, Chegg.com





Re: Performance degradation with distributed search

2012-02-06 Thread oleole
Yonik,

Thanks for your reply. Yeah that's the first thing I tried (adding fsv=true
to the query) and it surprised me too. Could it due to we're using many
complex sortings (20 sortings with dismax, and, or...). Any thing it can be
optimized? Looks like it's calculated twice in solr?

XJ

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Performance-degradation-with-distributed-search-tp3715060p3720739.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Performance degradation with distributed search

2012-02-06 Thread XJ
BTW we just upgraded to Solr 3.5 from Solr 1.4. Thats why we want to
explore the improvements/new features of distributed search.

On Mon, Feb 6, 2012 at 12:30 PM, oleole oleol...@gmail.com wrote:

 Yonik,

 Thanks for your reply. Yeah that's the first thing I tried (adding fsv=true
 to the query) and it surprised me too. Could it due to we're using many
 complex sortings (20 sortings with dismax, and, or...). Any thing it can be
 optimized? Looks like it's calculated twice in solr?

 XJ

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Performance-degradation-with-distributed-search-tp3715060p3720739.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Performance degradation with distributed search

2012-02-06 Thread Yonik Seeley
On Mon, Feb 6, 2012 at 3:30 PM, oleole oleol...@gmail.com wrote:
 Thanks for your reply. Yeah that's the first thing I tried (adding fsv=true
 to the query) and it surprised me too. Could it due to we're using many
 complex sortings (20 sortings with dismax, and, or...). Any thing it can be
 optimized? Looks like it's calculated twice in solr?

It currently does calculate it twice... but only for those documents
being returned (which should not be significant).
What is rows set to?

-Yonik
lucidimagination.com


Re: Performance degradation with distributed search

2012-02-06 Thread XJ
hm.. just looked at the log only 112 matched, and start=0, rows=30

On Mon, Feb 6, 2012 at 1:33 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Mon, Feb 6, 2012 at 3:30 PM, oleole oleol...@gmail.com wrote:
  Thanks for your reply. Yeah that's the first thing I tried (adding
 fsv=true
  to the query) and it surprised me too. Could it due to we're using many
  complex sortings (20 sortings with dismax, and, or...). Any thing it can
 be
  optimized? Looks like it's calculated twice in solr?

 It currently does calculate it twice... but only for those documents
 being returned (which should not be significant).
 What is rows set to?

 -Yonik
 lucidimagination.com



Re: Performance degradation with distributed search

2012-02-06 Thread Yonik Seeley
On Mon, Feb 6, 2012 at 5:35 PM, XJ oleol...@gmail.com wrote:
 hm.. just looked at the log only 112 matched, and start=0, rows=30

Are any of the sort criteria sort-by-function with anything complex
(like an embedded relevance query)?

-Yonik
lucidimagination.com



 On Mon, Feb 6, 2012 at 1:33 PM, Yonik Seeley yo...@lucidimagination.com
 wrote:

 On Mon, Feb 6, 2012 at 3:30 PM, oleole oleol...@gmail.com wrote:
  Thanks for your reply. Yeah that's the first thing I tried (adding
  fsv=true
  to the query) and it surprised me too. Could it due to we're using many
  complex sortings (20 sortings with dismax, and, or...). Any thing it can
  be
  optimized? Looks like it's calculated twice in solr?

 It currently does calculate it twice... but only for those documents
 being returned (which should not be significant).
 What is rows set to?

 -Yonik
 lucidimagination.com




Re: Performance degradation with distributed search

2012-02-06 Thread XJ
Yes as I mentioned in previous email, we do dismax queries(with different
mm values), solr function queries (map, etc) math calculations (sum,
product, log). I understand those are expensive. But worst case it should
only double the time not going from 200ms to 1200ms right?

XJ

On Mon, Feb 6, 2012 at 2:37 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Mon, Feb 6, 2012 at 5:35 PM, XJ oleol...@gmail.com wrote:
  hm.. just looked at the log only 112 matched, and start=0, rows=30

 Are any of the sort criteria sort-by-function with anything complex
 (like an embedded relevance query)?

 -Yonik
 lucidimagination.com


 
  On Mon, Feb 6, 2012 at 1:33 PM, Yonik Seeley yo...@lucidimagination.com
 
  wrote:
 
  On Mon, Feb 6, 2012 at 3:30 PM, oleole oleol...@gmail.com wrote:
   Thanks for your reply. Yeah that's the first thing I tried (adding
   fsv=true
   to the query) and it surprised me too. Could it due to we're using
 many
   complex sortings (20 sortings with dismax, and, or...). Any thing it
 can
   be
   optimized? Looks like it's calculated twice in solr?
 
  It currently does calculate it twice... but only for those documents
  being returned (which should not be significant).
  What is rows set to?
 
  -Yonik
  lucidimagination.com
 
 



Re: Performance degradation with distributed search

2012-02-06 Thread Yonik Seeley
On Mon, Feb 6, 2012 at 5:53 PM, XJ oleol...@gmail.com wrote:
 Yes as I mentioned in previous email, we do dismax queries(with different mm
 values), solr function queries (map, etc) math calculations (sum, product,
 log). I understand those are expensive. But worst case it should only double
 the time not going from 200ms to 1200ms right?

You mention dismax... but I assume that's as the main query and you
sort by score (which is fine).
The only issue with relevancy queries is if you sorted by one that was
not the main query - this is not yet optimized.

But for straight function queries that don't contain embedded
relevancy queries, I would definitely not expect the degradation you
are seeing - hence we should try to get to the bottom of this.

-Yonik
lucidimagination.com



 XJ

 On Mon, Feb 6, 2012 at 2:37 PM, Yonik Seeley yo...@lucidimagination.com
 wrote:

 On Mon, Feb 6, 2012 at 5:35 PM, XJ oleol...@gmail.com wrote:
  hm.. just looked at the log only 112 matched, and start=0, rows=30

 Are any of the sort criteria sort-by-function with anything complex
 (like an embedded relevance query)?

 -Yonik
 lucidimagination.com


 
  On Mon, Feb 6, 2012 at 1:33 PM, Yonik Seeley
  yo...@lucidimagination.com
  wrote:
 
  On Mon, Feb 6, 2012 at 3:30 PM, oleole oleol...@gmail.com wrote:
   Thanks for your reply. Yeah that's the first thing I tried (adding
   fsv=true
   to the query) and it surprised me too. Could it due to we're using
   many
   complex sortings (20 sortings with dismax, and, or...). Any thing it
   can
   be
   optimized? Looks like it's calculated twice in solr?
 
  It currently does calculate it twice... but only for those documents
  being returned (which should not be significant).
  What is rows set to?
 
  -Yonik
  lucidimagination.com
 
 




spell check - preserve case in suggestions

2012-02-06 Thread Satish Kumar
Hi,

Say that the field name has the following terms:

Giants
Manning
New York


When someone searches for gants or Gants, I need the suggestion to be
returned as Giants (capital G - same case as in the content that was
indexed). Using lowercase filter in both index and query analyzers I get
the suggestion giants, but all the letters are in smaller case. Is it
possible to preserve the case in suggestions, yet get suggestions for input
term in upper or lower or mixed case?


Thanks,
Satish


Re: Solr with Scala

2012-02-06 Thread Tommy Chheng
I have created a solr plugin using scala. It works without problems.

I wouldn't go as far as using scala improve solr performance but you
can definitely use scala to add a missing functionality or custom
query parsing. Just build a jar using maven/sbt and put it in solr's
lib directory.


On Sun, Feb 5, 2012 at 4:06 PM, deniz denizdurmu...@gmail.com wrote:
 Hi all,

 I have a question about scala and solr... I am curious if we can use solr
 with scala (plugins etc) to improve performance.

 anybody used scala on solr? could you tell me opinions about them?

 -
 Zeki ama calismiyor... Calissa yapar...
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-with-Scala-tp3718539p3718539.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Tommy Chheng


Re: multiple values encountered for non multiValued field type:[text/html, text, html]

2012-02-06 Thread William_Xu
Thank you for your reply, it is much helpful for me !

--
View this message in context: 
http://lucene.472066.n3.nabble.com/multiple-values-encountered-for-non-multiValued-field-type-text-html-text-html-tp3719088p3721305.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Performance degradation with distributed search

2012-02-06 Thread XJ
Yonik, thanks for your explanation. I've created a ticket here
https://issues.apache.org/jira/browse/SOLR-3104

On Mon, Feb 6, 2012 at 4:28 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Mon, Feb 6, 2012 at 6:16 PM, XJ oleol...@gmail.com wrote:
  Sorry I didn't make this clear. Yeah we use dismax in main query, as
 well as
  in sort orders (different from main queries). Because of our complicated
  business logic, we need many different relevancy queries in different
 sort
  orders (other than sort by score, we also have around 20 other different
  sort orders, some of them are dismax queries). However, this is
 something we
  can not get away from right now. What kind of optimization I can try to
 do
  there?

 OK, so basically it's slow because functions with embedded relevancy
 queries are forward only - if you request the value for a docid
 previous to the last, we need to reboot the query (re-weight, ask for
 the scorer, etc).  This means that for your 30 documents, that will
 require rebooting the query about 15 times (assuming that roughly half
 of the time the next docid will be less than the previous one).

 Unfortunately there's not much you can do externally... we need to
 implement optimizations at the Solr level for this.
 Can you open a JIRA issue for this?

 -Yonik
 lucidimagination.com



Re: summing facets on a specific field

2012-02-06 Thread Johannes Goll
you can use the StatsComponent

http://wiki.apache.org/solr/StatsComponent

with stats=truestats.price=categorystats.facet=category

and pull the sum fields from the resulting stats facets.

Johannes

2012/2/5 Paul Kapla paul.ka...@gmail.com:
 Hi everyone,
 I'm pretty new to solr and I'm not sure if this can even be done. Is there
 a way to sum a specific field per each item in a facet. For example, you
 have an ecommerce site that has the following documents:

 id,category,name,price
 1,books,'solr book', $10.00
 2,books,'lucene in action', $12.00
 3.video, 'cool video', $20.00

 so instead of getting (when faceting on category)
 books(2)
 video(1)

 I'd like to get:
 books ($22)
 video ($20)

 Is this something that can be even done? Any feedback would be much
 appreciated.



-- 
Dipl.-Ing.(FH)
Johannes Goll
211 Curry Ford Lane
Gaithersburg, Maryland 20878
USA


Re: summing facets on a specific field

2012-02-06 Thread Johannes Goll
I meant

stats=truestats.field=pricestats.facet=category

2012/2/6 Johannes Goll johannes.g...@gmail.com:
 you can use the StatsComponent

 http://wiki.apache.org/solr/StatsComponent

 with stats=truestats.price=categorystats.facet=category

 and pull the sum fields from the resulting stats facets.

 Johannes

 2012/2/5 Paul Kapla paul.ka...@gmail.com:
 Hi everyone,
 I'm pretty new to solr and I'm not sure if this can even be done. Is there
 a way to sum a specific field per each item in a facet. For example, you
 have an ecommerce site that has the following documents:

 id,category,name,price
 1,books,'solr book', $10.00
 2,books,'lucene in action', $12.00
 3.video, 'cool video', $20.00

 so instead of getting (when faceting on category)
 books(2)
 video(1)

 I'd like to get:
 books ($22)
 video ($20)

 Is this something that can be even done? Any feedback would be much
 appreciated.



 --
 Dipl.-Ing.(FH)
 Johannes Goll
 211 Curry Ford Lane
 Gaithersburg, Maryland 20878
 USA



-- 
Dipl.-Ing.(FH)
Johannes Goll
211 Curry Ford Lane
Gaithersburg, Maryland 20878
USA