Debugging on Tika

2012-02-03 Thread Arkadi Colson

Hi

I'm using Tika 0.10 for indexing my documents but I am not getting the 
expected results when doing a search. Even after I delete the index and 
started over again.
Some of the words in for example a PDF document can be found but most of 
them not. Is it related to some language setting perhaps? How can I 
start debugging on Tika? Any tips?


Thx!

--
Smartbit bvba
Hoogstraat 13
B-3670 Meeuwen
T: +32 11 64 08 80
F: +32 89 46 81 10
W: http://www.smartbit.be
E: ark...@smartbit.be



Parallel indexing in Solr

2012-02-03 Thread Per Steffensen

Hi

This topic has probably been covered before, but I havnt had the luck to 
find the answer.


We are running solr instances with several cores inside. Solr running 
out-of-the-box on top of jetty. I believe jetty is receiving all the 
http-requests about indexing ned documents, and forwards it to the solr 
engine. What kind of parallelism does this setup provide. Can more than 
one index-request get processed concurrently? How many? How to increase 
the number of index-requests that can be handled in parallel? Will I get 
better parallelism by running on another web-container than jetty - e.g. 
tomcat? What is the recommended web-container for high performance 
production systems?


Thanks!

Regards, Per Steffensen


Re: error in indexing

2012-02-03 Thread leonardo2
Someone can help me?

Leonardo

--
View this message in context: 
http://lucene.472066.n3.nabble.com/error-in-indexing-tp3709686p3712495.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Debugging on Tika

2012-02-03 Thread Oleg Tikhonov
Hi Arkadi,

You can try to extract text from your documents using Tika's CLI (more
details http://tika.apache.org/0.7/gettingstarted.html).
If you were succeeded that means that something goes wrong during the
indexing. Tika only extracts text and metadata from the documents and sends
this text to the Lucene. Lucene itself constructs the index. That index you
can check using LUKE (http://code.google.com/p/luke/).

Hope it helps.

Oleg


On Fri, Feb 3, 2012 at 10:43 AM, Arkadi Colson ark...@smartbit.be wrote:

 Hi

 I'm using Tika 0.10 for indexing my documents but I am not getting the
 expected results when doing a search. Even after I delete the index and
 started over again.
 Some of the words in for example a PDF document can be found but most of
 them not. Is it related to some language setting perhaps? How can I start
 debugging on Tika? Any tips?

 Thx!

 --
 Smartbit bvba
 Hoogstraat 13
 B-3670 Meeuwen
 T: +32 11 64 08 80
 F: +32 89 46 81 10
 W: http://www.smartbit.be
 E: ark...@smartbit.be




Solr index update approach

2012-02-03 Thread Listas Discussões
hi,
I have an opinion mining application running solr that serves to retrieve
documents and perform some analytics using facet queries.
It works great. But I have a big issue.
The document has an attribute for opinion that is automatically detected,
but users can change it if it´s not correct.

A document may be shared by some users and each user can change the opinion
of the document.
And the opinion may be different for each user.
Opinion value is crucial here because its the main facet field on the
analytic view.

The thing is that solr does not handle doc updates, right now I need to
delete it first and recreate the whole doc index to change it with the new
metadata.
And of course this is not fast enough. So I´m probably doing this the wrong
way.
Seams to me that is not a good approach and I should not update the index
this way. The index should be more static, otherwise I will be reindexing
the whole index to often.

I´m running solr with a master/slave topology (2 slaves replication). The
master to write and the slaves to read.
Solr index is feed by a PostgresSQL database.

I was wondering about using a nosql keyvalue database to store this kind of
metadata and keep the index untouchable.
So this way I could keep the index intact and store the user´s custom data
there.

It would fit if this value was not used by facet queries. That´s the
problem.

So my question is, what would be the best approach to handle this kind of
use case with solr?
If is not a usual use case, consider for example favorite docs.
Favorites docs is probably a common use case in information retrieval.
How do you handle for example favorite docs between users?

I´d be very interested to hear about the best approach here.

best
Arian


Re: Debugging on Tika

2012-02-03 Thread Ahmet Arslan
 I'm using Tika 0.10 for indexing my documents but I am not
 getting the expected results when doing a search. Even after
 I delete the index and started over again.
 Some of the words in for example a PDF document can be found
 but most of them not. 

It could be the maxFieldLength setting in solrconfig.xml . Try setting it to 
maxFieldLength2147483647/maxFieldLength


Re: Solr index update approach

2012-02-03 Thread Mikhail Khludnev
Hello Arian,

Pls look into
http://sujitpal.blogspot.com/2011/05/custom-sorting-in-solr-using-external.htmlit
can be useful for your purpose. If you need to count facets against an
external field you need to develop your own component - shouldn't be a big
deal.
Solr's bolts are
http://lucidworks.lucidimagination.com/display/solr/Solr+Field+Types#SolrFieldTypes-WorkingwithExternalFiles
http://wiki.apache.org/solr/FunctionQuery
http://www.lucidimagination.com/blog/2009/07/06/ranges-over-functions-in-solr-14/

Regards

On Fri, Feb 3, 2012 at 3:39 PM, Listas Discussões
lis...@arianpasquali.comwrote:

 hi,
 I have an opinion mining application running solr that serves to retrieve
 documents and perform some analytics using facet queries.
 It works great. But I have a big issue.
 The document has an attribute for opinion that is automatically detected,
 but users can change it if it´s not correct.

 A document may be shared by some users and each user can change the opinion
 of the document.
 And the opinion may be different for each user.
 Opinion value is crucial here because its the main facet field on the
 analytic view.

 The thing is that solr does not handle doc updates, right now I need to
 delete it first and recreate the whole doc index to change it with the new
 metadata.
 And of course this is not fast enough. So I´m probably doing this the wrong
 way.
 Seams to me that is not a good approach and I should not update the index
 this way. The index should be more static, otherwise I will be reindexing
 the whole index to often.

 I´m running solr with a master/slave topology (2 slaves replication). The
 master to write and the slaves to read.
 Solr index is feed by a PostgresSQL database.

 I was wondering about using a nosql keyvalue database to store this kind of
 metadata and keep the index untouchable.
 So this way I could keep the index intact and store the user´s custom data
 there.

 It would fit if this value was not used by facet queries. That´s the
 problem.

 So my question is, what would be the best approach to handle this kind of
 use case with solr?
 If is not a usual use case, consider for example favorite docs.
 Favorites docs is probably a common use case in information retrieval.
 How do you handle for example favorite docs between users?

 I´d be very interested to hear about the best approach here.

 best
 Arian




-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Which patch 236 to choose for collapse - Solr 3.5

2012-02-03 Thread Erick Erickson
Prateesh:

I'm not understanding here. I believe Tamanjit is correct. Your example
works if and only if *all* the groups are returned, which happens in the
example case but not in the general case. Try your experiment with

rows=3 and you'll find that (trunk, example)

This search:
http://localhost:8983/solr/select?q=*:*group=truegroup.field=manu_exactgroup.ngroups=truerows=3
returns this (lots of stuff removed for clarity)

response
  lst name=responseHeader
lst name=params
  str name=group.fieldmanu_exact/str
  str name=group.ngroupstrue/str
  str name=grouptrue/str
  str name=q*:*/str
  str name=rows3/str
/lst
  /lst
  lst name=grouped
lst name=manu_exact
  int name=matches28/int
  int name=ngroups13/int
  arr name=groups
lst
  null name=groupValue/
  result name=doclist numFound=12 start=0

  /result
/lst
lst
  str name=groupValueSamsung Electronics Co. Ltd./str
  result name=doclist numFound=1 start=0
  /result
/lst
lst
  str name=groupValueMaxtor Corp./str
  result name=doclist numFound=1 start=0
  /result
/lst
  /arr
/lst
  /lst
/response

Sum of numfounds is different than matches.

Or perhaps I'm misunderstanding your example...

Best
Erick

On Fri, Feb 3, 2012 at 12:46 AM, preetesh dubey dubeypreet...@gmail.com wrote:
 Nope!
 if u r doing grouping then matches is always the total no. of results and
 ngroups is the number of groups. Every groups can have some docs
 belonging to it which can be anything according to provided parameter
 group.limit. If u get the sum of all the docs of each group, then it's
 equivalent to matches.
 Ok.u can do one experiment.execute a simple query in solr which
 returns very few results.
 1)execute the query *without grouping* in browser and check the xml/json
 response...it will show u the total no. of result matches as a response in
 numFound...e,g.
 result name=response numFound=20 start=0
 lets say it.
 *a)* numFound*withoutgrouping=20*
 2) Now execute the query same *with grouping *parameters* *and* *look at
 the xml/json response in browser. it will show u the results like this
 /**
 lst name=groupid
 int name=matches20/int
 int name=ngroups12/int
 arr name=groups
 lst
 str name=groupValue4362/str
 result name=doclist numFound=1 start=0.../result
 /lst
 lst
 str name=groupValue3170/str
 result name=doclist numFound=3 start=0.../result
 /lst
 **/

 b)matcheswithgroups=20

 Now do the sum of docs of every group...
 result name=doclist numFound=1 start=0.../result
 result name=doclist numFound=3 start=0.../result
 sum of  numFound=1,  numFound=3 .
 lets say
 *
 *
 *c) sumofgroups*=1+3+...u will find a==b==c at last.
 do that experiment and reply back..

 after doing sum compare the a), b), c)

 On Fri, Feb 3, 2012 at 10:32 AM, tamanjit.bin...@yahoo.co.in 
 tamanjit.bin...@yahoo.co.in wrote:

 Ummm.. I think there is some confusion here.

 As per my understanding, matches is the total no of docs which the original
 query/filter query returned. On these docs grouping is done. So matches may
 not be actually equal to total no. of returned in your result, post
 grouping. Its just a subset of the matches, divided into groups.

 Is my understanding correct?

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Which-patch-236-to-choose-for-collapse-Solr-3-5-tp3697685p3712195.html
 Sent from the Solr - User mailing list archive at Nabble.com.




 --
 Thanks  Regards
 Preetesh Dubey


Re: error in indexing

2012-02-03 Thread Erick Erickson
Perhaps you could review:
http://wiki.apache.org/solr/UsingMailingLists

You really haven't shown us what it is that you're doing that
generates this error, about all you've said is it doesn't work.

I'd start with trying to index a document with only the required
fields for your particular schema (i.e. fields in schema.xml
where 'required=true' and build up from there. Many people use
SolrJ to index docs, so I'd assume it's something in your setup,
which you haven't shown us.

Best
Erick

On Fri, Feb 3, 2012 at 4:05 AM, leonardo2 leonardo.rigut...@gmail.com wrote:
 Someone can help me?

 Leonardo

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/error-in-indexing-tp3709686p3712495.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Which patch 236 to choose for collapse - Solr 3.5

2012-02-03 Thread preetesh dubey
Erick,
yes, u r correct. But with that example I only wanted to explain  Tamanjit
that matches in solr response contains all docs which matched with group
query.
Tamanjit if u want the counts of docs of only first page according to
rows parameter then that is the only way which u mentioned...iterate and
count...
There was small misunderstanding b/w us, I thought  Tamanjit  wants all
matched docs, but I think he wanted to know docs matched in first page
according to rows parameter.

On Fri, Feb 3, 2012 at 7:32 PM, Erick Erickson erickerick...@gmail.comwrote:

 Prateesh:

 I'm not understanding here. I believe Tamanjit is correct. Your example
 works if and only if *all* the groups are returned, which happens in the
 example case but not in the general case. Try your experiment with

 rows=3 and you'll find that (trunk, example)

 This search:

 http://localhost:8983/solr/select?q=*:*group=truegroup.field=manu_exactgroup.ngroups=truerows=3
 returns this (lots of stuff removed for clarity)

 response
  lst name=responseHeader
lst name=params
  str name=group.fieldmanu_exact/str
  str name=group.ngroupstrue/str
  str name=grouptrue/str
  str name=q*:*/str
  str name=rows3/str
/lst
  /lst
  lst name=grouped
lst name=manu_exact
  int name=matches28/int
  int name=ngroups13/int
  arr name=groups
lst
  null name=groupValue/
  result name=doclist numFound=12 start=0

  /result
/lst
lst
  str name=groupValueSamsung Electronics Co. Ltd./str
   result name=doclist numFound=1 start=0
   /result
/lst
lst
  str name=groupValueMaxtor Corp./str
   result name=doclist numFound=1 start=0
   /result
/lst
  /arr
/lst
  /lst
 /response

 Sum of numfounds is different than matches.

 Or perhaps I'm misunderstanding your example...

 Best
 Erick

 On Fri, Feb 3, 2012 at 12:46 AM, preetesh dubey dubeypreet...@gmail.com
 wrote:
  Nope!
  if u r doing grouping then matches is always the total no. of results
 and
  ngroups is the number of groups. Every groups can have some docs
  belonging to it which can be anything according to provided parameter
  group.limit. If u get the sum of all the docs of each group, then it's
  equivalent to matches.
  Ok.u can do one experiment.execute a simple query in solr which
  returns very few results.
  1)execute the query *without grouping* in browser and check the xml/json
  response...it will show u the total no. of result matches as a response
 in
  numFound...e,g.
  result name=response numFound=20 start=0
  lets say it.
  *a)* numFound*withoutgrouping=20*
  2) Now execute the query same *with grouping *parameters* *and* *look at
  the xml/json response in browser. it will show u the results like
 this
  /**
  lst name=groupid
  int name=matches20/int
  int name=ngroups12/int
  arr name=groups
  lst
  str name=groupValue4362/str
  result name=doclist numFound=1 start=0.../result
  /lst
  lst
  str name=groupValue3170/str
  result name=doclist numFound=3 start=0.../result
  /lst
  **/
 
  b)matcheswithgroups=20
 
  Now do the sum of docs of every group...
  result name=doclist numFound=1 start=0.../result
  result name=doclist numFound=3 start=0.../result
  sum of  numFound=1,  numFound=3 .
  lets say
  *
  *
  *c) sumofgroups*=1+3+...u will find a==b==c at last.
  do that experiment and reply back..
 
  after doing sum compare the a), b), c)
 
  On Fri, Feb 3, 2012 at 10:32 AM, tamanjit.bin...@yahoo.co.in 
  tamanjit.bin...@yahoo.co.in wrote:
 
  Ummm.. I think there is some confusion here.
 
  As per my understanding, matches is the total no of docs which the
 original
  query/filter query returned. On these docs grouping is done. So matches
 may
  not be actually equal to total no. of returned in your result, post
  grouping. Its just a subset of the matches, divided into groups.
 
  Is my understanding correct?
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Which-patch-236-to-choose-for-collapse-Solr-3-5-tp3697685p3712195.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
 
  --
  Thanks  Regards
  Preetesh Dubey




-- 
Thanks  Regards
Preetesh Dubey


Re: Parallel indexing in Solr

2012-02-03 Thread Erick Erickson
Unfortunately, the answer is it depends(tm).

First question: How are you indexing things? SolrJ? post.jar?

But some observations:

1 sure, using multiple cores will have some parallelism. So will
using a single core but using something like SolrJ and
StreamingUpdateSolrServer. Especially with trunk (4.0)
 and the Document Writer Per Thread stuff. In 3.x, you'll
 see some pauses when segments are merged that you
 can't get around (per core). See:
 
http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/
 for an excellent writeup. But whether or not you use several
 cores should be determined by your problem space, certainly
 not by trying to increase the throughput. Indexing usually
 take a back seat to search performance.
2 general settings are hard to come by. If you're sending
  structured documents that use Tika to parse the data
  behind the scenes, your performance will be much
  different (slower) than sending SolrInputDocuments
 (SolrJ).
3 The recommended servlet container is, generally,
  The one you're most comfortable with. Tomcat is
  certainly popular. That said, use whatever you're
  most comfortable with until you see a performance
 problem. Odds are you'll find your load on Solr is a
  at its limit before your servlet container has problems.
4 Monitor you CPU, fire more requests at it until it
 hits 100%. Note that there are occasions where the
servlet container limits the number of outstanding
 requests it will allow and queues ones over that
 limit (find the magic setting to increase this if it's a
 problem, it differs by container). If you start to see
 your response times lengthen but the CPU not being
fully utilized, that may be the cause.
5 How high is high performance? On a stock solr
 with the Wikipedia dump (11M docs), all running on
 my laptop, I see 7K docs/sec indexed. I know of
 installations that see 60 docs/sec or even less. I'm
sending simple docs with SolrJ locally and they're
 sending huge documents over the wire that Tika
 handles. There are just so many variables it's hard
 to say anything except try it and see..

Best
Erick

On Fri, Feb 3, 2012 at 3:55 AM, Per Steffensen st...@designware.dk wrote:
 Hi

 This topic has probably been covered before, but I havnt had the luck to
 find the answer.

 We are running solr instances with several cores inside. Solr running
 out-of-the-box on top of jetty. I believe jetty is receiving all the
 http-requests about indexing ned documents, and forwards it to the solr
 engine. What kind of parallelism does this setup provide. Can more than one
 index-request get processed concurrently? How many? How to increase the
 number of index-requests that can be handled in parallel? Will I get better
 parallelism by running on another web-container than jetty - e.g. tomcat?
 What is the recommended web-container for high performance production
 systems?

 Thanks!

 Regards, Per Steffensen


Re: Solr index update approach

2012-02-03 Thread Listas Discussões
hi Mikhail

external fields was one of the options, but I was not 100% sure if it would
fit.
I will study more about this option.

thank you so much for your reply
Arian

2012/2/3 Mikhail Khludnev mkhlud...@griddynamics.com

 Hello Arian,

 Pls look into

 http://sujitpal.blogspot.com/2011/05/custom-sorting-in-solr-using-external.htmlit
 can be useful for your purpose. If you need to count facets against an
 external field you need to develop your own component - shouldn't be a big
 deal.
 Solr's bolts are

 http://lucidworks.lucidimagination.com/display/solr/Solr+Field+Types#SolrFieldTypes-WorkingwithExternalFiles
 http://wiki.apache.org/solr/FunctionQuery

 http://www.lucidimagination.com/blog/2009/07/06/ranges-over-functions-in-solr-14/

 Regards

 On Fri, Feb 3, 2012 at 3:39 PM, Listas Discussões
 lis...@arianpasquali.comwrote:

  hi,
  I have an opinion mining application running solr that serves to retrieve
  documents and perform some analytics using facet queries.
  It works great. But I have a big issue.
  The document has an attribute for opinion that is automatically detected,
  but users can change it if it´s not correct.
 
  A document may be shared by some users and each user can change the
 opinion
  of the document.
  And the opinion may be different for each user.
  Opinion value is crucial here because its the main facet field on the
  analytic view.
 
  The thing is that solr does not handle doc updates, right now I need to
  delete it first and recreate the whole doc index to change it with the
 new
  metadata.
  And of course this is not fast enough. So I´m probably doing this the
 wrong
  way.
  Seams to me that is not a good approach and I should not update the index
  this way. The index should be more static, otherwise I will be reindexing
  the whole index to often.
 
  I´m running solr with a master/slave topology (2 slaves replication). The
  master to write and the slaves to read.
  Solr index is feed by a PostgresSQL database.
 
  I was wondering about using a nosql keyvalue database to store this kind
 of
  metadata and keep the index untouchable.
  So this way I could keep the index intact and store the user´s custom
 data
  there.
 
  It would fit if this value was not used by facet queries. That´s the
  problem.
 
  So my question is, what would be the best approach to handle this kind of
  use case with solr?
  If is not a usual use case, consider for example favorite docs.
  Favorites docs is probably a common use case in information retrieval.
  How do you handle for example favorite docs between users?
 
  I´d be very interested to hear about the best approach here.
 
  best
  Arian
 



 --
 Sincerely yours
 Mikhail Khludnev
 Lucid Certified
 Apache Lucene/Solr Developer
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com




-- 
Arian Pasquali
FEUP researcher
twitter: @arianpasquali
www.arianpasquali.com


Re: Shard timeouts on large (1B docs) Solr cluster

2012-02-03 Thread Marc Sturlese
timeAllowed can be used outside distributed search. It is used by the
TimeL¡mitingCollector. When the search time is equal to timeAllowed it will
stop searching and will return the results that could find till then.
This can be a problem when using incremental indexing. Lucene starts
searching from the bottom and new docs are inserted on the top, so,
timeAllowed could cause that new docs never appear on the search results.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Shard-timeouts-on-large-1B-docs-Solr-cluster-tp3691229p3713263.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr index update approach

2012-02-03 Thread Ahmet Arslan

 external fields was one of the options, but I was not 100%
 sure if it would
 fit.
 I will study more about this option.

I was wondering if Lucene's ToChildBlockJoinQuery and/or ToParentBlockJoinQuery 
can be replacement for ExternalFileField.

http://www.searchworkings.org/blog/-/blogs/tochildblockjoinquery-in-lucene

Also what are the similarities and differences from solr's join QueryParser

http://wiki.apache.org/solr/Join


Zero Matches Weirdness

2012-02-03 Thread Marian Steinbach
Hi!

I am having a weird issue with a search string not producing a match
where it should. I can reproduce it with both 3.4 and 3.5.

Where it should means that I am getting a hit in the Analyse tool
in the admin panel, but not in a query via /select.

Now when I try

   select?q=Am+Heidstamm...

I get zero results back. But, when I quote the string

  select?q=%22Am+Heidstamm%22...

I get several hits.

BTW, the token am is filtered out in the field text, since it's in a
stopword list.

Any ideas on how this can b explained?

My defaultSearchField ist text. The field gets its content via
several copyField statements.

The configuration for text is as follows:

   field name=text type=text_de indexed=true stored=false
multiValued=true /

The configuration for type text_de is this:

fieldType name=text_de class=solr.TextField positionIncrementGap=100
analyzer
!-- protect slashes from tokenizer by replacing with 
something unique --
charFilter class=solr.PatternReplaceCharFilterFactory
pattern=([A-Z]+)/([0-9]+)/([0-9]+) 
replacement=$1ḧ$2ḧ$3 /
charFilter class=solr.PatternReplaceCharFilterFactory
pattern=([0-9]+)/([0-9]+) replacement=$1ḧ$2 
/
!-- protect paragraph symbol from tokenizer --
charFilter class=solr.PatternReplaceCharFilterFactory
pattern=§\s*([0-9]+) replacement=ǚ$1 /
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 
catenateWords=1
catenateNumbers=1 catenateAll=1 
preserveOriginal=1
splitOnCaseChange=1/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords_de.txt enablePositionIncrements=true /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.GermanMinimalStemFilterFactory /
!-- get slashes back in --
filter class=solr.PatternReplaceFilterFactory 
pattern=ḧ
replacement=/ /
!-- get paragraph symbols back in --
filter class=solr.PatternReplaceFilterFactory 
pattern=ǚ
replacement=§ /
/analyzer
/fieldType


Log output for the unquoted phrase:

INFO: [] webapp=/solr path=/select
params={facet=truesort=score+descfl=sitzung,gremium,betreff,datum,timestamp,score,aktenzeichen,typ,id,anhangdebugQuery=truestart=0q=Am+Heidstammhl.fl=betreffwt=jsonfq=hl=truerows=10}
hits=0 status=0 QTime=29

... and for the quoted one:

INFO: [] webapp=/solr path=/select
params={facet=truesort=score+descfl=sitzung,gremium,betreff,datum,timestamp,score,aktenzeichen,typ,id,anhangstart=0q=Am+Heidstammhl.fl=betreffwt=standardfq=hl=truerows=10version=2.2}
hits=14 status=0 QTime=244


Thanks!


Re: SolrCloud war?

2012-02-03 Thread Darren Govoni

UPDATE:

I set my app server[1] system property jetty.port to be equal to the app 
servers open port and was able to get two Solr shards to talk.


The overall properties I set are:

App server domain 1:

bootstrap_confdir
collection.configName
jetty.port
solr.solr.home
zkRun

App server domain 2:

bootstrap_confdir
collection.configName
jetty.port
solr.solr.home
zkHost

I deployed each war app into the /solr context. I presume its needed 
by remote URL addressing.

I checked the zookeeper config page and it shows both shards.

Awesome.

[1] Glassfish 3.1.1

On 02/01/2012 08:50 PM, Mark Miller wrote:

I have not yet tried to run SolrCloud in another app server, but it shouldn't 
be a problem.

One issue you might have is the fact that we count on hostPort coming from the 
system property jetty.port. This is set in the default solr.xml - the hostPort 
defaults to jetty.port. You probably want to explicitly pass -DhostPort= if you 
are not going to use jetty.port.


- Mark Miller
lucidimagination.com
On Feb 1, 2012, at 2:44 PM, Darren Govoni wrote:


Hi,
  I'm trying to get the SolrCloud2 examples to work using a war deployed solr 
into glassfish.
The startup properties must be different in this case, because its having 
trouble connecting to zookeeper when
I deploy the solr war file.

Perhaps the embedded zookeeper has trouble running in an app server?

Any tips appreciated!

Darren

On 01/30/2012 06:58 PM, Darren Govoni wrote:

Hi,
  Is there any issue with running the new SolrCloud deployed as a war in 
another app server?
Has anyone tried this yet?

thanks.

















Re: Zero Matches Weirdness

2012-02-03 Thread Dmitry Kan
What about query side of the field?

On Fri, Feb 3, 2012 at 6:11 PM, Marian Steinbach mar...@sendung.de wrote:

 Hi!

 I am having a weird issue with a search string not producing a match
 where it should. I can reproduce it with both 3.4 and 3.5.

 Where it should means that I am getting a hit in the Analyse tool
 in the admin panel, but not in a query via /select.

 Now when I try

   select?q=Am+Heidstamm...

 I get zero results back. But, when I quote the string

  select?q=%22Am+Heidstamm%22...

 I get several hits.

 BTW, the token am is filtered out in the field text, since it's in a
 stopword list.

 Any ideas on how this can b explained?

 My defaultSearchField ist text. The field gets its content via
 several copyField statements.

 The configuration for text is as follows:

   field name=text type=text_de indexed=true stored=false
 multiValued=true /

 The configuration for type text_de is this:

fieldType name=text_de class=solr.TextField
 positionIncrementGap=100
analyzer
!-- protect slashes from tokenizer by replacing
 with something unique --
charFilter
 class=solr.PatternReplaceCharFilterFactory
pattern=([A-Z]+)/([0-9]+)/([0-9]+)
 replacement=$1ḧ$2ḧ$3 /
charFilter
 class=solr.PatternReplaceCharFilterFactory
pattern=([0-9]+)/([0-9]+)
 replacement=$1ḧ$2 /
!-- protect paragraph symbol from tokenizer --
charFilter
 class=solr.PatternReplaceCharFilterFactory
pattern=§\s*([0-9]+) replacement=ǚ$1 /
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1
 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=1
 preserveOriginal=1
 splitOnCaseChange=1/
filter class=solr.StopFilterFactory
 ignoreCase=true
 words=stopwords_de.txt enablePositionIncrements=true /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.GermanMinimalStemFilterFactory
 /
!-- get slashes back in --
filter class=solr.PatternReplaceFilterFactory
 pattern=ḧ
 replacement=/ /
!-- get paragraph symbols back in --
filter class=solr.PatternReplaceFilterFactory
 pattern=ǚ
 replacement=§ /
/analyzer
/fieldType


 Log output for the unquoted phrase:

 INFO: [] webapp=/solr path=/select

 params={facet=truesort=score+descfl=sitzung,gremium,betreff,datum,timestamp,score,aktenzeichen,typ,id,anhangdebugQuery=truestart=0q=Am+Heidstammhl.fl=betreffwt=jsonfq=hl=truerows=10}
 hits=0 status=0 QTime=29

 ... and for the quoted one:

 INFO: [] webapp=/solr path=/select

 params={facet=truesort=score+descfl=sitzung,gremium,betreff,datum,timestamp,score,aktenzeichen,typ,id,anhangstart=0q=Am+Heidstammhl.fl=betreffwt=standardfq=hl=truerows=10version=2.2}
 hits=14 status=0 QTime=244


 Thanks!




-- 
Regards,

Dmitry Kan


Re: SolrCloud - issues running with embedded zookeeper ensemble

2012-02-03 Thread Dipti Srivastava
Hi Mark,
Thanks for looking into the issue.

As for specifying the bootstrap dir for each instance with ZK, it was just
a typo on my side. I went back and looked at my script on the second and
3rd nodes and it did not have the bootstrp dir, so I had specified it for
only the very FIRST node that registers ZK.

2. java -DzkRun=ec2-compute-2.amazonaws.com:9983
-Dsolr.solr.home=/home/ec2-user/solrcloud/example/solr
-DzkHost=ec2-compute-1.amazonaws.com:9983,ec2-compute-2.amazonaws.com:9983,
ec2-compute-3.amazonaws.com:9983 -DnumShards=2 -jar start.jar


Thanks!
Dipti

On 2/2/12 8:27 PM, Mark Miller markrmil...@gmail.com wrote:

Thanks Dipti!

One thing that seems off is that you are passing the bootstrap_confdir
param on each instance?

Other than that though, the problem you are seeing is indeed a bug -
though hidden if using localhost. I'll fix it here:
https://issues.apache.org/jira/browse/SOLR-3091

Again, thanks for the detailed report.

- mark


On Feb 2, 2012, at 4:44 PM, Dipti Srivastava wrote:

 Hi Mark,
 I am trying to set up on 4 ami's, where 3 of the instances will have the
 embedded ZK running. Here are the startup commands for all 4.

 - Note that on the 4th instance I do not have the ZK host and bootstrap
 conf dir specified. The 4th instance throws exception (earlier in this
 email chain) at startup.
 - Ideally, I should not have to specify the host for the -DzkRun since
it
 is the localhost, but without that I get the exception as well.

 1. java -DzkRun=ec2-compute-1.amazonaws.com:9983
 -Dsolr.solr.home=/home/ec2-user/solrcloud/example/solr
 -Dbootstrap_confdir=/home/ec2-user/solrcloud//example/solr/conf

-DzkHost=ec2-compute-1.amazonaws.com:9983,ec2-compute-2.amazonaws.com:998
3,
 ec2-compute-3.amazonaws.com:9983 -DnumShards=2 -jar start.jar


 2. java -DzkRun=ec2-compute-2.amazonaws.com:9983
 -Dsolr.solr.home=/home/ec2-user/solrcloud/example/solr
 -Dbootstrap_confdir=/home/ec2-user/solrcloud//example/solr/conf

-DzkHost=ec2-compute-1.amazonaws.com:9983,ec2-compute-2.amazonaws.com:998
3,
 ec2-compute-3.amazonaws.com:9983 -DnumShards=2 -jar start.jar


 3. java -DzkRun=ec2-compute-3.amazonaws.com:9983
 -Dsolr.solr.home=/home/ec2-user/solrcloud/example/solr
 -Dbootstrap_confdir=/home/ec2-user/solrcloud//example/solr/conf

-DzkHost=ec2-compute-1.amazonaws.com:9983,ec2-compute-2.amazonaws.com:998
3,
 ec2-compute-3.amazonaws.com:9983 -DnumShards=2 -jar start.jar


 4. java -Dsolr.solr.home=/home/ec2-user/solrcloud/example/solr

-DzkHost=ec2-compute-1.amazonaws.com:9983,ec2-compute-2.amazonaws.com:998
3,
 ec2-compute-3.amazonaws.com:9983 -DnumShards=2 -jar start.jar


 Thanks,
 Dipti

 On 1/31/12 10:18 AM, Mark Miller markrmil...@gmail.com wrote:

 Hey Dipti -

 Can you give the exact startup cmds you are using for each of the
 instances? I have got Example C going, so I'll have to try and dig into
 whatever you are seeing.

 - mark

 On Jan 27, 2012, at 12:53 PM, Dipti Srivastava wrote:

 Hi Mark,
 Did you get a chance to look into the issues with running the embedded
 Zookeeper ensemble, as per Example C, from the
 http://wiki.apache.org/solr/SolrCloud2

 Hi All,
 Did anyone else run multiple shards with embedded zk ensemble
 successfully? If so would like some tips on any issues that you came
 across.

 Regards,
 Dipti

 From: diptis dipti.srivast...@apollogrp.edu
 Date: Fri, 23 Dec 2011 10:32:52 -0700
 To: markrmil...@gmail.com markrmil...@gmail.com
 Subject: Re: Release build or code for SolrCloud

 Hi Mark,
 There is some issue with specifying localhost vs actual host names for
 zk. When I changed my script to specify the actual hostname (which
 should be local by default) the first, 2nd and 3rd instances came up,
 that have the embedded zk running. Now, I am getting the same
exception
 for the 4th AMI which in NOT part of the zookeeper ensemble. I want to
 zk only on 3 of the 4 instances.

 java -Dbootstrap_confdir=./solr/conf ­DzkRun=ami-19983
 -DzkHost=ami-1:9983,ami-2:9983,ami-3:9983 -DnumShards=2 -jar
 start.jar

 Dipti

 From: Mark Miller markrmil...@gmail.com
 Reply-To: markrmil...@gmail.com markrmil...@gmail.com
 Date: Fri, 23 Dec 2011 09:34:52 -0700
 To: diptis dipti.srivast...@apollogrp.edu
 Subject: Re: Release build or code for SolrCloud

 I'm having trouble getting a quorum up using the built in SolrZkServer
 as well - so i have not been able to replicate this - I'll have to
keep
 digging. Not sure if it's due to a ZooKeeper update or what yet.

 2011/12/21 Dipti Srivastava dipti.srivast...@apollogrp.edu
 Hi Mark,
 Thanks! So now I am deploying a 4 node cluster on AMI's and the main
 instance that bootstraps the config to the zookeeper does not come
up I
 get an exception as follows. My solrcloud.sh looks like

 #!/usr/bin/env bash

 cd ..

 rm -r -f example/solr/zoo_data
 rm -f example/example.log

 cd example
 #java -DzkRun -DnumShards=2 -DSTOP.PORT=7983 -DSTOP.KEY=key -jar
 start.jar
 1example.log 21 
 java -Dbootstrap_confdir=./solr/conf -DzkRun
 

Re: Zero Matches Weirdness

2012-02-03 Thread Marian Steinbach
2012/2/3 Dmitry Kan dmitry@gmail.com:
 What about query side of the field?


It's identical. At least that's what I think, since I din't specify
the type=query or type=index attribute for the analyzer part.

Marian


Re: Zero Matches Weirdness

2012-02-03 Thread Dmitry Kan
Actually, I wouldn't count on it and just specify index and query sides
explicitly. Just to play it safe.

On Fri, Feb 3, 2012 at 8:34 PM, Marian Steinbach mar...@sendung.de wrote:

 2012/2/3 Dmitry Kan dmitry@gmail.com:
  What about query side of the field?
 

 It's identical. At least that's what I think, since I din't specify
 the type=query or type=index attribute for the analyzer part.

 Marian




-- 
Regards,

Dmitry Kan


Re: Zero Matches Weirdness

2012-02-03 Thread Erik Hatcher
No, don't do that.  That's definitely not good advice.  If the analysis chain 
is the same for both index and query, just use analyzer.

As for Marian's issue... was there literally a + in the query or was that 
urlencoded?   Try debugQuery=true for both queries and see what you get for the 
query parsing output.

Erik

On Feb 3, 2012, at 14:18 , Dmitry Kan wrote:

 Actually, I wouldn't count on it and just specify index and query sides
 explicitly. Just to play it safe.
 
 On Fri, Feb 3, 2012 at 8:34 PM, Marian Steinbach mar...@sendung.de wrote:
 
 2012/2/3 Dmitry Kan dmitry@gmail.com:
 What about query side of the field?
 
 
 It's identical. At least that's what I think, since I din't specify
 the type=query or type=index attribute for the analyzer part.
 
 Marian
 
 
 
 
 -- 
 Regards,
 
 Dmitry Kan



Another zero match issue

2012-02-03 Thread Van Tassell, Kristian
Hi everyone!

I'm also having some zero match weirdness. When I execute this search:

?q=Create+a+self+contained+Part+ModuledefType=edismaxqf=location^0.9+text^0.8+fileName^8.0+title^4.0

I get ZERO results.

If I remove the fileName qf parameter (an indexed but not stored field), I get 
5 hits.

?q= 
Create+a+self+contained+Part+ModuledefType=edismaxqf=location^0.9+text^0.8+title^4.0

Putting quotes around the original query returns the hit, but that shouldn't be 
required, I would think.

Also, removing part of the query text gives the intended results(!):

?q=contained+Part+ModuledefType=edismaxqf=location^0.9+text^0.8+fileName^8.0+title^4.0

These search parameters haven't seemed to be a problem until this example. 
Other searches with the same parameters return their intended results.

What are some things I should be looking at? Thanks in advance!

Debug info:

str name=rawquerystringCreate a self contained Part Module/strstr 
name=querystringCreate a self contained Part Module/strstr 
name=parsedquery+((DisjunctionMaxQuery((fileName:Create^8.0 | 
title:creat^4.0 | text:creat^0.8 | location:creat^0.9)) 
DisjunctionMaxQuery((fileName:a^8.0)) DisjunctionMaxQuery((fileName:self^8.0 | 
title:self^4.0 | text:self^0.8 | location:self^0.9)) 
DisjunctionMaxQuery((fileName:contained^8.0 | title:contain^4.0 | 
text:contain^0.8 | location:contain^0.9)) 
DisjunctionMaxQuery((fileName:Part^8.0 | title:part^4.0 | text:part^0.8 | 
location:part^0.9)) DisjunctionMaxQuery((fileName:Module^8.0 | title:modul^4.0 
| text:modul^0.8 | location:modul^0.9)))~6)/str

str name=parsedquery_toString+(((fileName:Create^8.0 | title:creat^4.0 | 
text:creat^0.8 | location:creat^0.9) (fileName:a^8.0) (fileName:self^8.0 | 
title:self^4.0 | text:self^0.8 | location:self^0.9) (fileName:contained^8.0 | 
title:contain^4.0 | text:contain^0.8 | location:contain^0.9) (fileName:Part^8.0 
| title:part^4.0 | text:part^0.8 | location:part^0.9) (fileName:Module^8.0 | 
title:modul^4.0 | text:modul^0.8 | location:modul^0.9))~6)/str


Setting solrj server connection timeout

2012-02-03 Thread Shawn Heisey
Is the following a reasonable approach to setting a connection timeout 
with SolrJ?


queryCore.getHttpClient().getHttpConnectionManager().getParams()
.setConnectionTimeout(15000);

Right now I have all my solr server objects sharing a single HttpClient 
that gets created using the multithreaded connection manager, where I 
set the timeout for all of them.  Now I will be letting each server 
object create its own HttpClient object, and using the above statement 
to set the timeout on each one individually.  It'll use up a bunch more 
memory, as there are 56 server objects, but maybe it'll work better.  
The total of 56 objects comes about from 7 shards, a build core and a 
live core per shard, two complete index chains, and for each of those, 
one server object for access to CoreAdmin and another for the index.


The impetus for this, as it's possible I'm stating an XY problem: 
Currently I have an occasional problem where SolrJ connections throw an 
exception.  When it happens, nothing is logged in Solr.  My code is 
smart enough to notice the problem, send an email alert, and simply try 
again at the top of the next minute.  The simple explanation is that 
this is a Linux networking problem, but I never had any problem like 
this when I was using Perl with LWP to keep my index up to date.  I sent 
a message to the list some time ago on this exception, but I never got a 
response that helped me figure it out.


Caused by: org.apache.solr.client.solrj.SolrServerException: 
java.net.SocketException: Connection reset


at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:480)


at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:246)


at 
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)


at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:276)

at com.newscom.idxbuild.solr.Core.getCount(Core.java:325)

... 3 more

Caused by: java.net.SocketException: Connection reset

at java.net.SocketInputStream.read(SocketInputStream.java:168)

at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)

at java.io.BufferedInputStream.read(BufferedInputStream.java:237)

at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78)

at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106)

at 
org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116)


at 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1413)


at 
org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973)


at 
org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735)


at 
org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098)


at 
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)


at 
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)


at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)


at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)


at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:424)


... 7 more


Thanks,
Shawn



Re: Zero Matches Weirdness

2012-02-03 Thread Marian Steinbach
2012/2/3 Erik Hatcher erik.hatc...@gmail.com:
 As for Marian's issue... was there literally a + in the query or was that 
 urlencoded?   Try debugQuery=true for both queries and see what you get for 
 the query parsing output.


I tested both + and %20 with and without quotes, it doesn't make a
difference whether I use + or %20.

Here is the debug output for the unquoted version (zero hits):

debug: {
   rawquerystring: Am Heidstamm,
   querystring: Am Heidstamm,
   parsedquery: +((DisjunctionMaxQuery((aktenzeichen:Am^10.0))
DisjunctionMaxQuery((text:heidstamm^0.1 | betreff:heidstamm^3.0 |
aktenzeichen:Heidstamm^10.0)))~2),
   parsedquery_toString: +(((aktenzeichen:Am^10.0)
(text:heidstamm^0.1 | betreff:heidstamm^3.0 |
aktenzeichen:Heidstamm^10.0))~2),
   QParser: ExtendedDismaxQParser,
}


And for the quoted version (with hits):

{
   rawquerystring: Am Heidstamm,
   querystring: Am Heidstamm,
   parsedquery: +DisjunctionMaxQuery((text:heidstamm^0.1 |
betreff:heidstamm^3.0 | aktenzeichen:Am Heidstamm^10.0)),
   parsedquery_toString: +(text:heidstamm^0.1 | betreff:heidstamm^3.0
| aktenzeichen:Am Heidstamm^10.0),
   explain: { },
   QParser: ExtendedDismaxQParser,
}


As it seems to me, the +(((aktenzeichen:Am^10.0) (text:heidstamm^0.1
| betreff:heidstamm^3.0 | aktenzeichen:Heidstamm^10.0))~2) condition
cannot be fulfilled. I have AND as the detault operator. The term
(aktenzeichen:Am^10.0) cannot be satisfied. The thing is: why does
it even appear there?

This is my current qf:

   betreff^5.0 aktenzeichen^10.0 body^0.2 text^0.1

I have just changed this to only

   text^0.1

for the sake of testing, and then it works.

It seems as if I haven't quite understood the impact of qf. I thought
it would allow me to boost the score based on a string appearing in a
field. I didn't expect it to affect what matches and what doesnt.

Marian


Re: Zero Matches Weirdness

2012-02-03 Thread Marian Steinbach
It just got rid of the one field aktenzeichen never matching in the
qf string. Now it works fine. Solved for now.

Thanks!


Re: Another zero match issue

2012-02-03 Thread Chris Hostetter
: 
?q=Create+a+self+contained+Part+ModuledefType=edismaxqf=location^0.9+text^0.8+fileName^8.0+title^4.0
: 
: I get ZERO results.
: 
: If I remove the fileName qf parameter (an indexed but not stored field), I 
get 5 hits.

lemme guess: fileName doesn't use stopwords but other fields do, correct?

you're getting zero matches because you've told dismax that every clause 
must match something, and a is a clause in your query thta gets ignored 
for everyfield that uses stopwords, but for fields that don't use 
stopwords (like fileName) it is kept arround, and you get no matches 
for the whole query unless that clause gets a match.


http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/
http://www.lucidimagination.com/search/document/ca18cbded00bdc79#6a30d2ed7914a4d9
https://issues.apache.org/jira/browse/SOLR-3085

...if you google for dismax stopwords' you'll find lots of discussion on 
how how/why this happens.

in general you really need to think carefully about hte fields you put in 
your qf field, and make sure their query analyzers play nicely with 
eachother in these multi-term query situations.


-Hoss


Re: Zero Matches Weirdness

2012-02-03 Thread Dmitry Kan
Ok, thanks, Erick, good to know. Sorry for the confusion.

On Fri, Feb 3, 2012 at 9:42 PM, Erik Hatcher erik.hatc...@gmail.com wrote:

 No, don't do that.  That's definitely not good advice.  If the analysis
 chain is the same for both index and query, just use analyzer.

 As for Marian's issue... was there literally a + in the query or was that
 urlencoded?   Try debugQuery=true for both queries and see what you get for
 the query parsing output.

Erik

 On Feb 3, 2012, at 14:18 , Dmitry Kan wrote:

  Actually, I wouldn't count on it and just specify index and query sides
  explicitly. Just to play it safe.
 
  On Fri, Feb 3, 2012 at 8:34 PM, Marian Steinbach mar...@sendung.de
 wrote:
 
  2012/2/3 Dmitry Kan dmitry@gmail.com:
  What about query side of the field?
 
 
  It's identical. At least that's what I think, since I din't specify
  the type=query or type=index attribute for the analyzer part.
 
  Marian
 
 
 
 
  --
  Regards,
 
  Dmitry Kan




-- 
Regards,

Dmitry Kan


Re: SolrCloud war?

2012-02-03 Thread Mark Miller

On Feb 3, 2012, at 1:04 PM, Darren Govoni wrote:

 I deployed each war app into the /solr context. I presume its needed by 
 remote URL addressing.

Yup - but you can override this by setting the hostContext in solr.xml. It 
defaults to solr as that fits the example jetty distribution.

- Mark Miller
lucidimagination.com













Re: error in indexing

2012-02-03 Thread Chris Hostetter

: Subject: Re: error in indexing

FWIW: it's really crucial to state which version of Solr you are using 
when you have bugs with error stack traces like this -- going back through 
the versions i'm *guessing* that you are using Solr 1.4.1 (or possibly 
older), correct?

Based on that assumption (and the stack trace) i *think* your problem is 
that somehow you are adding a field to your documents where the *name* of 
the field is null ... but unless you left something out of the java code 
you posted i'm not reall sure how that would be possible.  are you sure 
you don't have any other code adding fields to these SOlrInputDocuments?

: output_documents=new ArrayListSolrInputDocument();
: while () {
: sdoc=new SolrInputDocument();
: sdoc.setField(id, idb);
: sdoc.setField(file_id, id);
: sdoc.addField(box_text, zone.Text);
: final IteratorWPWord it_on_words = zone.Words.iterator();
: while (it_on_words.hasNext()) {
:   final WPWord word = it_on_words.next();
:   final String word_box = word.boxesToString();
:   final String word_payload = word.Text + | + word_box;
:   sdoc.addField(word, word_payload);
: }



-Hoss


ReversedWildcardFilterFactory and PorterStemFilterFactory

2012-02-03 Thread Jamie Johnson
I'd like to use both the ReversedWildcardFilterFactory and
PorterStemFilterFactory on a text field that I have, I'd like to avoid
stemming the reversed fields and would also like to avoid reversing
the stemmed fields.  My original thought was to have the
ReversedWildcardFilterFactory higher in the chain, but what would this
do with the stemmer?  Would it attempt o stem the reversed tokens or
are they ignored?  What is the best way to achieve the result I am
looking for in a single field?

Again goal is to have text come in have it be reversed and stemmed but
I don't want the stemmed reversed and I don't want the reversed
stemmed, is this possible?


Re: frange with multi-valued fields

2012-02-03 Thread Chris Hostetter

: Has anyone had experience using frange with multi-valued fields?  In 
: solr 3.5 doing so results in the error: can not use FieldCache on 
: multivalued field

correct.

: Here's the use case.  We have multiple years attached to each document 
: and want to be able to refine by a year range.  We're currently using 
: the standard range query syntax [ 1900 TO 1910 ] which works, but those 
: queries are slower than we would like.  I've seen reports that using 
: frange can greatly improve performance.  
: http://solr.pl/en/2011/05/30/quick-look-frange/

note that there is a mistake in the Faster implementation 
column of performance table on that article .. the actaul data (and hte 
paragraph after the table) indicate that...

standard range query is faster only for queries that cover a
small number of terms from the given field. 

Yonik got similar results when he did testing on range queries over 
strings, but the specifics on where the cut-off point was were slightly 
different...

https://yonik.wordpress.com/2009/07/06/ranges-over-functions-in-solr-1-4/

In general you'd have to test it, but for things like years 
unless you are dealing really big spans of time (ie: 
[1901 TO 20]) and will have ranges that are generally large relative 
the total span of data you are dealing with, i seriously  doubt fgrange 
would be much faster for you if you had a single valued fields -- and the 
bottom line is frange won't work with multivalued fields.

forget about frange for a moment, and tell us more about your specific 
sitaution. to start with: what field configuration are you using right now 
for your year field? specificly are you using TrieIntField? have you 
tried tunning the options on it? how many unique year values are in your 
corpus? how big to your ranges usually get?



https://people.apache.org/~hossman/#xyproblem
Your question appears to be an XY Problem ... that is: you are dealing
with X, you are assuming Y will help you, and you are asking about Y
without giving more details about the X so that we can understand the
full issue.  Perhaps the best solution doesn't involve Y at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341




-Hoss


Re: multiple index analyzer chains on a field

2012-02-03 Thread Jamie Johnson
Looking closer I think I asked the wrong question, please disregard and I
will start a new chain with that question

On Friday, February 3, 2012, Jamie Johnson jej2...@gmail.com wrote:
 Is it possible to have multiple index analysis chains on a single field?


Performance degradation with distributed search

2012-02-03 Thread XJ
Hello,

I am experimenting with solr distributed search/random sharding (currently
use geo sharding), hope to gain some performance and also scalability in
the future. (index size keep growing and geo shard is hard to scale)

However I'm seeing worse performance with distributed search, on a testing
server of 6 shards, 15 core cpu, 24G mem, index size is about 8G on each
shard. With geo sharding it can easily take 150 QPS load with good response
time. Now with distribute search, there are timeout and average response
time also inreases. This is probably no big surprise since I'm using same
amount of shards and plus overhead of distribute search/merge/http network
etc.

When I look into details (slow queries), I found some real issues that I
need help with. For example, a query which takes 200ms with geo sharding,
now timeout (2000ms) with distributed search. And each shard query
(isShard=true) takes about 1200ms. But if I run the query toward the shard
only (without distributed search), it only takes 200ms. So I compared the
two query urls, the only difference is shard query using distribute
search has fsv=true. I understand field sort values are need during merge
process, but didn't expect that'll make this much difference in
performance, although we do have lot of sort orders (about 20 different
sort orders).

Any suggestion/comment on the performance problem I'm having with
distributed search? Is distributed search the right choice for me? What
other setup/idea I can try?

thanks,
XJ