date:20140415

Re: Tomcat creates a thread for each SOLR core

2014-04-15 Thread Atanas Atanasov

Hello again,

Current situation is, after setting the two options in order not to load
the cores on start up
and ramBufferSizeMB=32 Tomcat is stable, responsive, threads reach 60 as a
maximum.
Browsing and storing are fast. I should note that I have many cores with
small amount of documents.
Unfortunately the problem with the creation of a new core taking 20 minutes
still exists.
Next step will be downgrading to Java 7u25. Any other suggestions will be
highly appreciated. Thanks in advance.

P.S previous SOLR version from which I updated was 3.6.

Regards,
Atanas Atanasov


On Thu, Apr 10, 2014 at 6:06 PM, Shawn Heisey s...@elyograg.org wrote:

 On 4/10/2014 12:40 AM, Atanas Atanasov wrote:
  I need some help. After updating to SOLR 4.4 the tomcat process is
  consuming about 2GBs of memory, the CPU usage is about 40% after the
 start
  for about 10 minutes. However, the bigger problem is, I have about 1000
  cores and seems that for each core a thread is created. The process has
  more than 1000 threads and everything is extremely slow. Creating or
  unloading a core even without documents takes about 20 minutes. Searching
  is more or less good, but storing also takes a lot.
  Is there some configuration I missed or that I did wrong? There aren't
 many
  calls, I use 64 bit tomcat 7, SOLR 4.4, latest 64 bit Java. The machine
 has
  24 GBs of RAM, a CPU with 16 cores and is running Windows Server 2008 R2.
  Index is uppdated every 30 seconds/10 000 documents.
  I haven't checked the number of threads before the update, because I
 didn't
  have to, it was working just fine. Any suggestion will be highly
  appreciated, thank you in advance.

 If creating a core takes 20 minutes, that sounds to me like the JVM is
 doing constant full garbage collections to free up enough memory for
 basic system operation.  It could also be explained by temporary work
 threads having to wait to execute because the servlet container will not
 allow them to run.

 When indexing is happening, each core will set aside some memory for
 buffering index updates.  By default, the value of ramBufferSizeMB is
 100.  If all your cores are indexing at once, multiply the indexing
 buffer by 1000, and you'll require 100GB of heap memory.  You'll need to
 greatly reduce that buffer size.  This buffer was 32MB by default in 4.0
 and earlier.  If you are not setting this value, this change sounds like
 it might fully explain what you are seeing.

 https://issues.apache.org/jira/browse/SOLR-4074

 What version did you upgrade from?  Solr 4.x is a very different beast
 than earlier major versions.  I believe there may have been some changes
 made to reduce memory usage in versions after 4.4.0.

 The jetty that comes with Solr is configured to allow 10,000 threads.
 Most people don't have that many, even on a temporary basis, but bad
 things happen when the servlet container will not allow Solr to start as
 many as it requires.  I believe that the typical default maxThreads
 value you'll find in a servlet container config is 200.

 Erick's right about a 6GB heap being very small for what you are trying
 to do.  Putting 1000 cores on one machine is something I would never
 try.  If it became a requirement I had to deal with, I wouldn't try it
 unless the machine had a lot more CPU cores, hundreds of gigabytes of
 RAM, and a lot of extremely fast disk space.

 If this worked before a Solr upgrade, I'm amazed.  Congratulations to
 you for fine work!

 NB: Oracle Java 7u25 is what you should be using.  7u40 through 7u51
 have known bugs that affect Solr/Lucene.  These should be fixed by 7u60.
  A pre-release of that is available now, and it should be generally
 available in May 2014.

 Thanks,
 Shawn

[ANNOUNCE] Apache Solr 4.7.2 released.

2014-04-15 Thread Robert Muir

April 2014, Apache Solr™ 4.7.2 available

The Lucene PMC is pleased to announce the release of Apache Solr 4.7.2

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search. Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.7.2 is available for immediate download at:

http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr 4.7.2 includes 2 bug fixes, as well as Lucene 4.7.2 and its bug fixes.

See the CHANGES.txt file included with the release for a full list of
changes and further details.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also goes for Maven access.

Re: filter capabilities are limited?

2014-04-15 Thread horot

Variables over which the comparison is a string data type. I can not apply to
them or mathematical functions needed to perform the conversion type (string
to integer). Will I be able to build a circuit without changing a filter?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/filter-capabilities-are-limited-tp4130458p4131174.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr join and lucene scoring

2014-04-15 Thread mm


Thank you for the clarification.
We really need scoring with solr joins, but as you can see I'm not a  
specialist in solr development.
We would like to hire somebody with more experience to write a qparser  
plugin for scoring in joins and donate the source code to the community.


Any suggestions where we could find somebody with the fitting experience?


Zitat von Mikhail Khludnev mkhlud...@griddynamics.com:


On Wed, Apr 9, 2014 at 1:33 PM, m...@preselect-media.com wrote:


Hello Mikhail,

thx for the clarification. I'm a little bit confused by the answer of
Alvaro, but my own tests didn't result in a proper score, so I think you're
right and it's still not implemented.

What do you mean with the impedance between Lucene and Solr?


It's an old story, and unfortunately obvious. Using Lucene's code in Solr
might not be straightforward. I haven't looked at this problem
particularly, it's just a caveat.



Why isn't the possibility of scoring in joins not implemented in Solr
anyways when Lucene offers a solution for that?


As you can see these are two separate implementation. It seems like Solr
guys just didn't care about scoring (and here I share their point). It's
just an exercise for someone, who needs it.




Best regards,
Moritz

Zitat von Mikhail Khludnev mkhlud...@griddynamics.com:

 On Thu, Apr 3, 2014 at 1:42 PM, m...@preselect-media.com wrote:


 Hello,


referencing to this issue:
https://issues.apache.org/jira/browse/SOLR-4307

Is it still not possible with the solr query time join to use scoring?

 It's not implemented still.

https://github.com/apache/lucene-solr/blob/trunk/solr/
core/src/java/org/apache/solr/search/JoinQParserPlugin.java#L549


 Do I still have to write my own plugin or is there a plugin somewhere I

could use?

I never wrote a plugin for solr before, so I would prefer if I don't have
to start from scratch.

 The right approach from my POV is to use Lucene's join

https://github.com/apache/lucene-solr/blob/trunk/lucene/
join/src/java/org/apache/lucene/search/join/JoinUtil.javain
new QParser, but solving the impedance between Lucene and Solr, might
be
tricky.





THX,
Moritz





--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com









--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com

Autocomplete with Case-insensitive feature

2014-04-15 Thread Sunayana

Hi All,

I have been trying out this autocomplete feature in Solr4.7.1 using
Suggester.I have configured it to display phrase suggestions also.Problem is
If I type game I get suggestions as game or phrases containing game.
But If I type Game *no suggestion is displayed at all*.How can I get
suggestions case-insensitive?
I have defined in schema.xml fields like this:
 field name=name_autocomplete type=text_auto indexed=true
stored=true multiValued=true /
copyField source=name dest=name_autocomplete /
fieldType name=text_auto class=solr.TextField 
positionIncrementGap=100 
analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.ShingleFilterFactory
minShingleSize=2
maxShingleSize=4
outputUnigrams=true
outputUnigramsIfNoShingles=true/
   

/analyzer
analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.TrimFilterFactory /

 /analyzer
/fieldType




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Autocomplete-with-Case-insensitive-feature-tp4131182.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Autocomplete with Case-insensitive feature

2014-04-15 Thread Dmitry Kan

Hi,

Configure LowerCaseFilterFactory into the query side of your type config.

Dmitry


On Tue, Apr 15, 2014 at 10:50 AM, Sunayana sunayana...@wipro.com wrote:

 Hi All,

 I have been trying out this autocomplete feature in Solr4.7.1 using
 Suggester.I have configured it to display phrase suggestions also.Problem
 is
 If I type game I get suggestions as game or phrases containing game.
 But If I type Game *no suggestion is displayed at all*.How can I get
 suggestions case-insensitive?
 I have defined in schema.xml fields like this:
  field name=name_autocomplete type=text_auto indexed=true
 stored=true multiValued=true /
 copyField source=name dest=name_autocomplete /
 fieldType name=text_auto class=solr.TextField
 positionIncrementGap=100 
 analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
  filter
 class=solr.LowerCaseFilterFactory/
  filter class=solr.ShingleFilterFactory
 minShingleSize=2
 maxShingleSize=4
 outputUnigrams=true
 outputUnigramsIfNoShingles=true/


 /analyzer
 analyzer type=query
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.TrimFilterFactory /

  /analyzer
 /fieldType




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Autocomplete-with-Case-insensitive-feature-tp4131182.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Dmitry
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan

Re: More Robust Search Timeouts (to Kill Zombie Queries)?

2014-04-15 Thread Salman Akram

Looking at this, sharding seems to be best and simple option to handle such
queries.


On Wed, Apr 2, 2014 at 1:26 AM, Mikhail Khludnev mkhlud...@griddynamics.com
 wrote:

 Hello Salman,
 Let's me drop few thoughts on

 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3c856ac15f0903272054q2dbdbd19kea3c5ba9e105b...@mail.gmail.com%3E

 There two aspects of this question:
 1. dealing with long running processing (thread divergence actions
 http://docs.oracle.com/javase/specs/jls/se5.0/html/memory.html#65310) and
 2. an actual time checking.
 terminating or aborting thread (2.) are just a way to tracking time
 externally, and send interrupt() which the thread should react on, which
 they don't do now, and we returning to the core issue (1.)

 Solr's time allowed is to the proper way to handle this things, the only
 problem is that expect that the only core search is long running, but in
 your case rewriting MultiTermQuery-s takes a huge time.
 Let's consider this problem. First of all MultiTermQuery.rewrite() is the
 nearly design issue, after heavy rewrite occurs, it's thrown away, after
 search is done. I think the most straightforward way is to address this
 issue by caching these expensive queries. Solr does it well
 http://wiki.apache.org/solr/CommonQueryParameters#fq However, only for
 http://en.wikipedia.org/wiki/Conjunctive_normal_form like queries, there
 is
 a workaround allows to cache disjunction legs see
 http://blog.griddynamics.com/2014/01/segmented-filter-cache-in-solr.html
 If you still want to run expensively rewritten queries you need to
 implement timeout check (similar to TimeLimitingCollector) for TermsEnum
 returned from MultiTermQuery.getTermsEnum(), wrapping an actual TermsEnums
 is the good way, to apply queries injecting time limiting wrapper
 TermsEnum, you might consider override methods like
 SolrQueryParserBase.newWildcardQuery(Term) or post process the query three
 after parsing.



 On Mon, Mar 31, 2014 at 2:24 PM, Salman Akram 
 salman.ak...@northbaysolutions.net wrote:

  Anyone?
 
 
  On Wed, Mar 26, 2014 at 7:55 PM, Salman Akram 
  salman.ak...@northbaysolutions.net wrote:
 
   With reference to this thread
 
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3c856ac15f0903272054q2dbdbd19kea3c5ba9e105b...@mail.gmail.com%3E
 I
  wanted to know if there was any response to that or if Chris Harris
   himself can comment on what he ended up doing, that would be great!
  
  
   --
   Regards,
  
   Salman Akram
  
  
 
 
  --
  Regards,
 
  Salman Akram
 



 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com




-- 
Regards,

Salman Akram

Re: Autocomplete with Case-insensitive feature

2014-04-15 Thread Sunayana

Hi,

Did u mean changing field type as 
fieldType name=text_auto class=solr.TextField 
positionIncrementGap=100 indexed=true stored=false
 multiValued=true
analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.ShingleFilterFactory
minShingleSize=2
maxShingleSize=4
outputUnigrams=true
outputUnigramsIfNoShingles=true/
   

/analyzer
analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
filter class=solr.TrimFilterFactory /

 /analyzer
/fieldType

This did not work out for me. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Autocomplete-with-Case-insensitive-feature-tp4131182p4131198.html
Sent from the Solr - User mailing list archive at Nabble.com.

Indexing Big Data With or Without Solr

2014-04-15 Thread Vineet Mishra

Hi All,

I have worked with Solr 3.5 to implement real time search on some 100GB
data, that worked fine but was little slow on complex queries(Multiple
group/joined queries).
But now I want to index some real Big Data(around 4 TB or even more), can
SolrCloud be solution for it if not what could be the best possible
solution in this case.

*Stats for the previous Implementation:*
It was Master Slave Architecture with normal Standalone multiple instance
of Solr 3.5. There were around 12 Solr instance running on different
machines.

*Things to consider for the next implementation:*
Since all the data is sensor data hence it is the factor of duplicity and
uniqueness.

*Really urgent, please take the call on priority with set of feasible
solution.*

Regards

Re: Class not found ICUFoldingFilter (SOLR-4852)

2014-04-15 Thread Ronak Kirit

Hello Shawn,

Thanks for your reply.

Yes, I have defined ${solr.solr.home} explicitly, and all the mentioned jars
present in ${solr.solr.home}/lib. solr.log also shows that those files are
getting added once (grep icu4 solr.log). I could see the lines in log,

INFO  - 2014-04-15 15:40:21.448; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/solr/lib/icu4j-49.1.jar' to classloader
INFO  - 2014-04-15 15:40:21.454; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/solr/lib/lucene-analyzers-icu-4.3.1.jar' to classloader
INFO  - 2014-04-15 15:40:21.454; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/solr/lib/lucene-analyzers-morfologik-4.3.1.jar' to classloader
INFO  - 2014-04-15 15:40:21.455; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/solr/lib/lucene-analyzers-smartcn-4.3.1.jar' to classloader
INFO  - 2014-04-15 15:40:21.455; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/solr/lib/lucene-analyzers-stempel-4.3.1.jar' to classloader
INFO  - 2014-04-15 15:40:21.455; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/solr/lib/lucene-analyzers-uima-4.3.1.jar' to classloader

But, still, I get the same exception ICUFoldingFilter not found. However,
coping those files to WEB-INF/lib, works fine for me. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Class-not-found-ICUFoldingFilter-SOLR-4852-tp4130612p4131221.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Class not found ICUFoldingFilter (SOLR-4852)

2014-04-15 Thread ronak kirit

Hello Shawn,

Thanks for your reply.

Yes, I have defined ${solr.solr.home} explicitly, and all the mentioned
jars present in ${solr.solr.home}/lib. solr.log also shows that those files
are getting added once (grep icu4 solr.log). I could see the lines in
log,

INFO  - 2014-04-15 15:40:21.448; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/solr/lib/icu4j-49.1.jar' to classloader
INFO  - 2014-04-15 15:40:21.454; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/solr/lib/lucene-analyzers-icu-4.3.1.jar' to classloader
INFO  - 2014-04-15 15:40:21.454; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/solr/lib/lucene-analyzers-morfologik-4.3.1.jar' to
classloader
INFO  - 2014-04-15 15:40:21.455; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/solr/lib/lucene-analyzers-smartcn-4.3.1.jar' to classloader
INFO  - 2014-04-15 15:40:21.455; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/solr/lib/lucene-analyzers-stempel-4.3.1.jar' to classloader
INFO  - 2014-04-15 15:40:21.455; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/solr/lib/lucene-analyzers-uima-4.3.1.jar' to classloader

But, still, I get the same exception ICUFoldingFilter not found. However,
coping those files to WEB-INF/lib, works fine for me.

Thanks,
Ronak


On Fri, Apr 11, 2014 at 3:14 PM, ronak kirit ronak...@gmail.com wrote:

 Hello,

 I am facing the same issue discussed at SOLR-4852. I am getting below
 error:

 Caused by: java.lang.NoClassDefFoundError: Could not initialize class
 org.apache.lucene.analysis.icu.ICUFoldingFilter
 at
 org.apache.lucene.analysis.icu.ICUFoldingFilterFactory.create(ICUFoldingFilterFactory.java:50)
   at
 org.apache.solr.analysis.TokenizerChain.createComponents(TokenizerChain.java:67)


 I am using solr-4.3.1. As discussed at SOLR-4852, I had all the jars at
 (SOLR_HOME)/lib and there is no reference to lib via any of solrconfig.xml
 or schema.xml.

 I have also tried with setting sharedLib=foo, but that also didn't work.
 However, if  I removed all the below files:

 icu4j-49.1.jar

 lucene-analyzers-morfologik-4.3.1.jar  l

 ucene-analyzers-stempel-4.3.1.jar

 solr-analysis-extras-4.3.1.jar

 lucene-analyzers-icu-4.3.1.jar

 lucene-analyzers-smartcn-4.3.1.jar

 lucene-analyzers-uima-4.3.1.jar

 from $(solrhome)/lib and move to solr-webapp/webapp/WEB-INF/lib things are
 working fine.

 Any guess? Any help?

 Thanks,

 Ronak

Re: Error Arising from when I start to crawl

2014-04-15 Thread Cihad Guzel

Hi Ridwan,

This error is not related to Solr. Solr is used in IndexerJob for
Nutch.  This error is thrown from InjectorJob. It is related Nutch and
Gora. You check your hbase and nutch configuration. You ensure the HBase
run correctly and to use the correct version. For more accurate
information, you should ask questions to the nutch user list with more
information.


2014-04-14 5:11 GMT+03:00 Alexandre Rafalovitch arafa...@gmail.com:

 This is most definitely not a Solr issue, so you may want to check with
 Gora's list.

 However as a quick general hint, you problem seems to be in thus
 part: 3530@engr-MacBookProlocalhost . I assume it should be a server name
 there, but it seems to be two name joined together. So I would check where
 that (possibly hbase listen address) is defined and ensure it is correct.

 Regards,
  Alex
 On 14/04/2014 8:46 am, Ridwan Naibi ridwan.na...@gmail.com wrote:

  Hi there,
 
  I get the following error after I run the following command. Can you
  please let me know what the problem is? I have exhausted online tutorials
  trying to solve this issue. Thanks
 
  engr@engr-MacBookPro:~/NUTCH_HOME/apache-nutch-2.2.1/runtime/local$
  bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/ 2
  InjectorJob: starting at 2014-04-14 02:28:56
  InjectorJob: Injecting urlDir: urls/seed.txt
  InjectorJob: org.apache.gora.util.GoraException:
  java.lang.RuntimeException: java.lang.IllegalArgumentException: Not a
  host:port pair: � 3530@engr-MacBookProlocalhost,43200,1397436949832
  at org.apache.gora.store.DataStoreFactory.createDataStore(
  DataStoreFactory.java:167)
  at org.apache.gora.store.DataStoreFactory.createDataStore(
  DataStoreFactory.java:135)
  at org.apache.nutch.storage.StorageUtils.createWebStore(
  StorageUtils.java:75)
  at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:221)
  at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
  at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)
  Caused by: java.lang.RuntimeException:
 java.lang.IllegalArgumentException:
  Not a host:port pair: � 3530@engr-MacBookProlocalhost
 ,43200,1397436949832
  at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:127)
  at org.apache.gora.store.DataStoreFactory.initializeDataStore(
  DataStoreFactory.java:102)
  at org.apache.gora.store.DataStoreFactory.createDataStore(
  DataStoreFactory.java:161)
  ... 7 more
  Caused by: java.lang.IllegalArgumentException: Not a host:port pair: �
  3530@engr-MacBookProlocalhost,43200,1397436949832
  at org.apache.hadoop.hbase.HServerAddress.init(HServerAddress.java:60)
  at org.apache.hadoop.hbase.MasterAddressTracker.getMasterAddress(
  MasterAddressTracker.java:63)
  at org.apache.hadoop.hbase.client.HConnectionManager$
  HConnectionImplementation.getMaster(HConnectionManager.java:354)
  at org.apache.hadoop.hbase.client.HBaseAdmin.init(HBaseAdmin.java:94)
  at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:109)
  ... 9 more

Re: Analysis Tool Not Working for CharFilterFactory?

2014-04-15 Thread Alexandre Rafalovitch

Which version of Solr. I think there was a bug in ui. You can check network
traffic to confirm.
On 15/04/2014 5:32 pm, Steve Huckle steve.huc...@gmail.com wrote:

  I have used a CharFilterFactory in my schema.xml for fileType
 text_general, so that queries for cafe and café return the same results. It
 works correctly. Here's the relevant part of my schema.xml:

  fieldType name=text_general class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
 charFilter class=solr.MappingCharFilterFactory
 mapping=mapping-ISOLatin1Accent.txt/
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt /
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
   analyzer type=query
 charFilter class=solr.MappingCharFilterFactory
 mapping=mapping-ISOLatin1Accent.txt/
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt /
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
 /fieldType

 However, using the analysis tool within the admin ui, if I analyse
 text_general with any field values for index and query, the output for ST,
 SF and LCF are all empty. Is this a bug?


 --
 Steve Huckle

 If you print this email, eventually you'll want to throw it away. But there 
 is no away. So don't print this email, even if you have to.

Re: multiple analyzers for one field

2014-04-15 Thread Michael Sokolov

A blog post is a great idea, Alex!  I think I should wait until I have a 
complete end-to-end implementation done before I write about it though, 
because I'd also like to include some tips about configuring the new 
suggesters with Solr (the documentation on the wiki hasn't quite caught 
up yet, I think), and I don't have that working as I'd like just yet.  
But I will follow up with something soon; probably I will be able to 
share code on a public repo.


-Mike

On 04/14/2014 10:01 PM, Alexandre Rafalovitch wrote:

Hi Mike,

Glad I was able to help. Good note about the PoolingReuseStrategy, I
did not think of that either.

  Is there a blog post or a GitHub repository coming with more details
on that? Sounds like something others may benefit from as well.

Regards,
Alex.
P.s. If you don't have your own blog, I'll be happy to host such
article on mine.

Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Tue, Apr 15, 2014 at 8:52 AM, Michael Sokolov
msoko...@safaribooksonline.com wrote:

I lost the original thread; sorry for the new / repeated topic, but thought
I would follow up to let y'all know that I ended up implementing Alex's idea
to implement an UpdateRequestProcessor in order to apply different analysis
to different fields when doing something analogous to copyFields.

It was pretty straightforward except that when there are multiple values, I
ended up needing multiple copies of the same Analyzer.  I had to implement a
new PoolingReuseStrategy for the Analyzer to handle this, which I hadn't
foreseen.

-Mike

Re: multiple analyzers for one field

2014-04-15 Thread Alexandre Rafalovitch

Your call, though from experience thus sounds like either two or no blog
posts. I certainly have killed a bunch of good articles by waiting for
perfection:-)
On 15/04/2014 7:01 pm, Michael Sokolov msoko...@safaribooksonline.com
wrote:

 A blog post is a great idea, Alex!  I think I should wait until I have a
 complete end-to-end implementation done before I write about it though,
 because I'd also like to include some tips about configuring the new
 suggesters with Solr (the documentation on the wiki hasn't quite caught up
 yet, I think), and I don't have that working as I'd like just yet.  But I
 will follow up with something soon; probably I will be able to share code
 on a public repo.

 -Mike

 On 04/14/2014 10:01 PM, Alexandre Rafalovitch wrote:

 Hi Mike,

 Glad I was able to help. Good note about the PoolingReuseStrategy, I
 did not think of that either.

   Is there a blog post or a GitHub repository coming with more details
 on that? Sounds like something others may benefit from as well.

 Regards,
 Alex.
 P.s. If you don't have your own blog, I'll be happy to host such
 article on mine.

 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr
 proficiency


 On Tue, Apr 15, 2014 at 8:52 AM, Michael Sokolov
 msoko...@safaribooksonline.com wrote:

 I lost the original thread; sorry for the new / repeated topic, but
 thought
 I would follow up to let y'all know that I ended up implementing Alex's
 idea
 to implement an UpdateRequestProcessor in order to apply different
 analysis
 to different fields when doing something analogous to copyFields.

 It was pretty straightforward except that when there are multiple
 values, I
 ended up needing multiple copies of the same Analyzer.  I had to
 implement a
 new PoolingReuseStrategy for the Analyzer to handle this, which I hadn't
 foreseen.

 -Mike

Re: Indexing Big Data With or Without Solr

2014-04-15 Thread Furkan KAMACI

Hi Vineet;

I've been using SolrCloud for such kind of Big Data and I think that you
should consider to use it. If you have any problems you can ask it here.

Thanks;
Furkan KAMACI


2014-04-15 13:20 GMT+03:00 Vineet Mishra clearmido...@gmail.com:

 Hi All,

 I have worked with Solr 3.5 to implement real time search on some 100GB
 data, that worked fine but was little slow on complex queries(Multiple
 group/joined queries).
 But now I want to index some real Big Data(around 4 TB or even more), can
 SolrCloud be solution for it if not what could be the best possible
 solution in this case.

 *Stats for the previous Implementation:*
 It was Master Slave Architecture with normal Standalone multiple instance
 of Solr 3.5. There were around 12 Solr instance running on different
 machines.

 *Things to consider for the next implementation:*
 Since all the data is sensor data hence it is the factor of duplicity and
 uniqueness.

 *Really urgent, please take the call on priority with set of feasible
 solution.*

 Regards

Bug within the solr query parser (version 4.7.1)

2014-04-15 Thread Johannes Siegert


Hi,

I have updated my solr instance from 4.5.1 to 4.7.1. Now the parsed 
query seems to be not correct.


Query: /*q=*:*fq=title:TEdebug=true */

Before the update the parsed filter query is */+title:te +title:t 
+title:e/*. After the update the parsed filter query is */+((title:te 
title:t)/no_coord) +title:e/*. It seems like a bug within the query parser.


I also have validated the parsed filter query with the analysis 
component. The result was */+title:te +title:t +title:e/*.


The behavior is equal on all special characters that split words into 2 
parts.


I use the following WordDelimiterFilter on query side:

filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=0 catenateNumbers=0 
catenateAll=0 splitOnCaseChange=0 splitOnNumerics=0 
preserveOriginal=1/


Thanks.

Johannes


Additional informations:

Debug before the update:

lstname=debug
strname=rawquerystring*:*/str
strname=querystring*:*/str
strname=parsedqueryMatchAllDocsQuery(*:*)/str
strname=parsedquery_toString*:*/str
lstname=explain/
strname=QParserLuceneQParser/str
arrname=filter_queries
str(title:((TE)))/str
/arr
*arrname=parsed_filter_queries **
**str+title:te +title:t +title:e/str **
**/arr *
...

Debug after the update:

lstname=debug
strname=rawquerystring*:*/str
strname=querystring*:*/str
strname=parsedqueryMatchAllDocsQuery(*:*)/str
strname=parsedquery_toString*:*/str
lstname=explain/
strname=QParserLuceneQParser/str
arrname=filter_queries
str(title:((TE)))/str
/arr
*arrname=parsed_filter_queries **
**str+((title:te title:t)/no_coord) +title:e/str **
**/arr*
...

title-field definition:

fieldType name=text_title class=solr.TextField 
positionIncrementGap=100 omitNorms=true

  analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
charFilter class=solr.MappingCharFilterFactory 
mapping=mapping.txt/

tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 generateNumberParts=1 catenateWords=1 
catenateNumbers=1 catenateAll=1 splitOnCaseChange=1 
splitOnNumerics=1 preserveOriginal=1 stemEnglishPossessive=0/

filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
charFilter class=solr.HTMLStripCharFilterFactory/
 charFilter class=solr.MappingCharFilterFactory 
mapping=mapping.txt/

tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory 
synonyms=synonyms.txt ignoreCase=true expand=false/
filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 generateNumberParts=1 catenateWords=0 
catenateNumbers=0 catenateAll=0 splitOnCaseChange=0 
splitOnNumerics=0 preserveOriginal=1/

filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType

clusterstate.json does not reflect current state of down versus active

2014-04-15 Thread Rich Mayfield

Solr 4.7.1

I am trying to orchestrate a fast restart of a SolrCloud (4.7.1). I was
hoping to use clusterstate.json would reflect the up/down state of each
core as well as whether or not a given core was leader.

clusterstate.json is not kept up to date with what I see going on in my
logs though - I see the leader election process play out. I would expect
that state would show down immediately for replicas on the node that I
have shut down.

Eventually, after about 30 minutes, all of the leader election processes
complete and clusterstate.json gets updated to the true state for each
replica.

Why does it take so long for clusterstate.json to reflect the correct
state? Is there a better way to determine the state of the system?

(In my case, each node has upwards of 1,000 1-shard collections. There are
two nodes in the cluster - each collection has 2 replicas.)

Thanks much.
rich

Re: clusterstate.json does not reflect current state of down versus active

2014-04-15 Thread Shawn Heisey

On 4/15/2014 8:58 AM, Rich Mayfield wrote:
 I am trying to orchestrate a fast restart of a SolrCloud (4.7.1). I was
 hoping to use clusterstate.json would reflect the up/down state of each
 core as well as whether or not a given core was leader.

 clusterstate.json is not kept up to date with what I see going on in my
 logs though - I see the leader election process play out. I would expect
 that state would show down immediately for replicas on the node that I
 have shut down.

 Eventually, after about 30 minutes, all of the leader election processes
 complete and clusterstate.json gets updated to the true state for each
 replica.

 Why does it take so long for clusterstate.json to reflect the correct
 state? Is there a better way to determine the state of the system?

 (In my case, each node has upwards of 1,000 1-shard collections. There are
 two nodes in the cluster - each collection has 2 replicas.)

First, I'll admit that my experience with SolrCloud is not as extensive
as my experience with non-cloud installs.  I do have a SolrCloud (4.2.1)
install, but it's a the smallest possible redundant setup -- three
servers, two run Solr and Zookeeper, the third runs Zookeeper only.

What are you trying to achieve with your restart?  Can you just reload
the collections one by one instead?

Assuming that reloading isn't going to work for some reason (rebooting
for OS updates is one possibility), we need to determine why it takes so
long for a node to stabilize.

Here's a bunch of info about performance problems with Solr.  I wrote
it, so we can discuss it in depth if you like:

http://wiki.apache.org/solr/SolrPerformanceProblems

I have three possible suspicions for the root of your problem.  It is
likely to be one of them, but it could be a combination of any or all of
them.  Because this happens at startup, I don't think it's likely that
you're dealing with a GC problem caused by a very large heap.

1) The system is replaying 1000 transaction logs (possibly large, one
for each core) at startup, and also possibly initiating index recovery
using replication.  2) You don't have enough RAM to cache your index
effectively.  3) Your java heap is too small.

If your zookeeper ensemble does not use separate disks from your Solr
data (or separate servers), there could be an issue with zookeeper
client timeouts that's completely separate from any other problems.

I haven't addressed the fact that your cluster state doesn't update
quickly.  This might be a bug, but if we can deal with the slow
startup/stabilization first, then we can see whether there's anything
left to deal with on the cluster state.

Thanks,
Shawn

Re: Empty documents in Solr\lucene 3.6

2014-04-15 Thread Shawn Heisey

On 4/15/2014 9:41 AM, Alexey Kozhemiakin wrote:
 We've faced a strange data corruption issue with one of our clients old solr 
 setup (3.6).

 When we do a query (id:X OR id:Y) we get 2 nodes, one contains normal doc 
 data, another is empty (doc /).
 We've looked inside lucene index using Luke - same story, one of documents is 
 empty.
 When we click on 1st document - it shows nothing.
 http://snag.gy/O5Lgq.jpg


 Probably files for stored data were corrupted? But luke index check says OK.
 Any clues how to troubleshoot root cause?

Do you know for sure that the index was OK at some point?  Do you know
what might have happened when it became not OK, like a system crash?

If you have Solr logs from whatever event caused the problem, we might
be able to figure it out ... but if you don't know when it happened or
you don't have logs, it might not be possible to know what happened. 
The document may have simply been indexed incorrectly.

Thanks,
Shawn

Empty documents in Solr\lucene 3.6

2014-04-15 Thread Alexey Kozhemiakin

Dear Community,

We've faced a strange data corruption issue with one of our clients old solr 
setup (3.6).

When we do a query (id:X OR id:Y) we get 2 nodes, one contains normal doc data, 
another is empty (doc /).
We've looked inside lucene index using Luke - same story, one of documents is 
empty.
When we click on 1st document - it shows nothing.
http://snag.gy/O5Lgq.jpg


Probably files for stored data were corrupted? But luke index check says OK.
Any clues how to troubleshoot root cause?

Best regards,
Alexey

Race condition in Leader Election

2014-04-15 Thread Rich Mayfield

I see something similar where, given ~1000 shards, both nodes spend a LOT of 
time sorting through the leader election process. Roughly 30 minutes.

I too am wondering - if I force all leaders onto one node, then shut down both, 
then start up the node with all of the leaders on it first, then start up the 
other node, then I think I would have a much faster startup sequence.

Does that sound reasonable? And if so, is there a way to trigger the leader 
election process without taking the time to unload and recreate the shards?

 Hi
 
   When restarting a node in solrcloud, i run into scenarios where both the
 replicas for a shard get into recovering state and never come up causing
 the error No servers hosting this shard. To fix this, I either unload one
 core or restart one of the nodes again so that one of them becomes the
 leader.
 
 Is there a way to force leader election for a shard for solrcloud? Is
 there a way to break ties automatically (without restarting nodes) to make
 a node as the leader for the shard?
 
 
 Thanks
 Nitin

RE: Empty documents in Solr\lucene 3.6

2014-04-15 Thread Alexey Kozhemiakin

The system was up and running for long time(months) without any updates.
There was no crashes for sure, at least support team says so.
Logs indicate that at some point there was not enough disk space (caused by 
weekend index optimization).


Were there any other similar cases or it's unique for us?


Alexey.
-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Tuesday, April 15, 2014 18:50
To: solr-user@lucene.apache.org
Subject: Re: Empty documents in Solr\lucene 3.6

Do you know for sure that the index was OK at some point?  Do you know what 
might have happened when it became not OK, like a system crash?

If you have Solr logs from whatever event caused the problem, we might be able 
to figure it out ... but if you don't know when it happened or you don't have 
logs, it might not be possible to know what happened. 
The document may have simply been indexed incorrectly.

Thanks,
Shawn

Re: Race condition in Leader Election

2014-04-15 Thread Mark Miller

We have to fix that then.

-- 
Mark Miller
about.me/markrmiller

On April 15, 2014 at 12:20:03 PM, Rich Mayfield (mayfield.r...@gmail.com) wrote:

I see something similar where, given ~1000 shards, both nodes spend a LOT of 
time sorting through the leader election process. Roughly 30 minutes.  

I too am wondering - if I force all leaders onto one node, then shut down both, 
then start up the node with all of the leaders on it first, then start up the 
other node, then I think I would have a much faster startup sequence.  

Does that sound reasonable? And if so, is there a way to trigger the leader 
election process without taking the time to unload and recreate the shards?  

 Hi  
  
 When restarting a node in solrcloud, i run into scenarios where both the  
 replicas for a shard get into recovering state and never come up causing  
 the error No servers hosting this shard. To fix this, I either unload one  
 core or restart one of the nodes again so that one of them becomes the  
 leader.  
  
 Is there a way to force leader election for a shard for solrcloud? Is  
 there a way to break ties automatically (without restarting nodes) to make  
 a node as the leader for the shard?  
  
  
 Thanks  
 Nitin

Re: Empty documents in Solr\lucene 3.6

2014-04-15 Thread Shawn Heisey

On 4/15/2014 10:22 AM, Alexey Kozhemiakin wrote:
 The system was up and running for long time(months) without any updates.
 There was no crashes for sure, at least support team says so.
 Logs indicate that at some point there was not enough disk space (caused by 
 weekend index optimization).

Software behavior becomes very difficult to define when a resource (RAM,
disk space, etc) is completely exhausted.  Even if Lucene's behavior is
well defined (which I think it might be -- the index itself is NOT
corrupt), Solr is another layer here, and I don't know whether its
behavior is well defined.  I suspect that it's not.  This might explain
what you're seeing.  That might be the only information you'll get, if
there's nothing else in the logs besides the inability to write to the disk.

Thanks,
Shawn

Re: What's the actual story with new morphline and hadoop contribs?

2014-04-15 Thread Wolfgang Hoschek

The solr morphline jars are integrated with solr by way of the solr specific 
solr/contrib/map-reduce module.

Ingestion from Flume into Solr is available here: 
http://flume.apache.org/FlumeUserGuide.html#morphlinesolrsink

FWIW, for our purposes we see no role for DataImportHandler anymore.

Wolfgang.

On Apr 15, 2014, at 6:01 AM, Alexandre Rafalovitch arafa...@gmail.com wrote:

 The use case I keep thinking about is Flue/Morphline replacing
 DataImportHandler. So, when I saw morphline shipped with Solr, I tried
 to understand whether it is a step towards it.
 
 As it is, I am still not sure I understand why those jars are shipped
 with Solr, if it is not actually integrating into Solr.
 
 Regards,
   Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr 
 proficiency
 
 
 On Mon, Apr 14, 2014 at 8:36 PM, Wolfgang Hoschek whosc...@cloudera.com 
 wrote:
 Currently all Solr morphline use cases I’m aware of run in processes outside 
 of the Solr JVM, e.g. in Flume, in MapReduce, in HBase Lily Indexer, etc. 
 These ingestion processes generate Solr documents for Solr updates. Running 
 in external processes is done to improve scalability, reliability, 
 flexibility and reusability. Not everything needs to run inside of the Solr 
 JVM.
 
 We haven’t found a use case for it so far, but it would be easy to add an 
 UpdateRequestProcessor that runs a morphline inside of the Solr JVM.
 
 Here is more background info:
 
 http://kitesdk.org/docs/current/kite-morphlines/index.html
 
 http://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html
 
 http://files.meetup.com/5139282/SHUG10%20-%20Search%20On%20Hadoop.pdf
 
 Wolfgang.
 
 On Apr 14, 2014, at 2:26 PM, Alexandre Rafalovitch arafa...@gmail.com 
 wrote:
 
 Hello,
 
 I saw that 4.7.1 has morphline and hadoop contribution libraries, but
 I can't figure out the degree to which they are useful to _Solr_
 users. I found one hadoop example in the readme that does some sort
 injection into Solr. Is that the only use case supported?
 
 I thought that maybe there is a UpdateRequestProcessor or Handler
 end-point or something that hooks into morphline to do
 similar/alternative work to DataImportHandler. But I can't see any
 entry points or examples for that.
 
 Anybody knows what the story is and/or what the future holds?
 
 Regards,
   Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr 
 proficiency

Re: What is Overseer?

2014-04-15 Thread Chris Hostetter


: So, is Overseer really only an implementation detail or something that Solr
: Ops guys need to be very aware of?

Most people don't ever need to worry about the overseer - it's magic and 
it will take care of itself.  

The recent work on adding support for an overseer role in 4.7 was 
specifically for people who *want* to worry about it.

I've updated several places in the solr ref guide to remove some 
missleading claims about hte overseer (some old docs equated it to running 
embedded zookeeper) and add some more info to the glossary..

https://cwiki.apache.org/confluence/display/solr/Solr+Glossary#SolrGlossary-Overseer
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api15AddRole



-Hoss
http://www.lucidworks.com/

cache warming questions

2014-04-15 Thread Matt Kuiper

Hello,

I have a few questions regarding how Solr caches are warmed.

My understanding is that there are two ways to warm internal Solr caches (only 
one way for document cache and lucene FieldCache):

Auto warming - occurs when there is a current searcher handling requests and 
new searcher is being prepared.  When a new searcher is opened, its caches may 
be prepopulated or autowarmed with cached object from caches in the old 
searcher. autowarmCount is the number of cached items that will be regenerated 
in the new searcher.http://wiki.apache.org/solr/SolrCaching#autowarmCount

Explicit warming - where the static warming queries specified in Solrconfig.xml 
for newSearcher and firstSearcher listeners are executed when a new searcher is 
being prepared.

What does it mean that items will be regenerated or prepopulated from the 
current searcher's cache to the new searcher's cache?  I doubt it means copy, 
as the index has likely changed with a commit and possibly invalidated some 
contents of the cache.  Are the queries, or filters, that define the contents 
of the current caches re-executed for the new searcher's caches?

For the case where auto warming is configured, a current searcher is active, 
and static warming queries are defined how does auto warming and explicit 
warming work together? Or do they?  Is only one type of warming activated to 
fill the caches?

Thanks,
Matt

Re: Question regarding solrj

2014-04-15 Thread Prashant Golash

Sorry for not replying!!!
It was wrong version of solrj that client was using (As it was third-party
code, we couldn't find out earlier). After fixing the version, things seem
to be working fine.

Thanks for your response!!!


On Sun, Apr 13, 2014 at 7:26 PM, Erick Erickson erickerick...@gmail.comwrote:

 You say I can't change the client. What is the client written in?
 What does it expect? Does it use the same version of SolrJ?

 Best,
 Erick

 On Sun, Apr 13, 2014 at 6:40 AM, Prashant Golash
 prashant.gol...@gmail.com wrote:
  Thanks for your feedback. Following are some more details
 
  Version of solr : 4.3.0
  Version of solrj : 4.3.0
 
  The way I am returning response to client:
 
 
  Request Holder is the object containing post process request from client
  (After renaming few of the fields, and internal to external mapping of
 the
  fields)
 
  *Snippet of code*
 
  WS.WSRequestHolder requestHolder = WS.url(url);
  // requestHolder processing of few fields
  return requestHolder.get().map(
  new F.FunctionWS.Response, Result() {
  @Override
  public Result apply(WS.Response response)
  throws Throwable {
  System.out.println(Response header:
 
  + response.getHeader(Content-Type));
  System.out.println(Response:  +
  response.getBody());
  *return
  ok(response.asByteArray()).as(response.getHeader(Content-Type));*
  }
  }
  );
 
  Thanks,
  Prashant
 
 
  On Sun, Apr 13, 2014 at 3:35 AM, Furkan KAMACI furkankam...@gmail.com
 wrote:
 
  Hi;
 
  If you had a chance to change the code at client side I would suggest to
  try that:
 
 
 http://lucene.apache.org/solr/4_2_1/solr-solrj/org/apache/solr/client/solrj/impl/HttpSolrServer.html#setParser(org.apache.solr.client.solrj.ResponseParser)
  There
  maybe a problem about character encoding of your Play App and here is
 the
  information:
 
  Javabin is a custom binary format used to write out Solr's response in a
  fast and efficient manner. As of Solr 3.1, the JavaBin format has
 changed
  to version 2. Version 2 serializes strings differently: instead of
 writing
  the number of UTF-16 characters followed by the bytes in Modified UTF-8
 it
  writes the number of UTF-8 bytes followed by the bytes in UTF-8.
 
  Which version of Solr and Solrj do you use respectively? On the other
 hand
  if you give us more information I can help you because there may be any
  other interesting thing as like here:
  https://issues.apache.org/jira/browse/SOLR-5744
 
  Thanks;
  Furkan KAMACI
 
 
  2014-04-12 22:18 GMT+03:00 Prashant Golash prashant.gol...@gmail.com:
 
   Hi Solr Gurus,
  
   I have some doubt related to solrj client.
  
   My scenario is like this:
  
  - There is a proxy server (Play App) which internally queries solr.
  - The proxy server is called from client side, which uses Solrj
  library.
  The issue is that I can't change client code. I can only change
  configurations to call different servers, hence I need to use
 SolrJ.
  - Results are successfully returned from my play app in
   *java-bin*format without modify them, but on client side, I am
   receiving this
  exception:
  
   Caused by: java.lang.NullPointerException
   * at
  
  
 
 org.apache.solr.common.util.JavaBinCodec.readExternString(JavaBinCodec.java:689)*
   * at
  
 org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:188)*
   * at
  
 
 org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:112)*
   * at
  
  
 
 org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:41)*
   * at
  
  
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:385)*
   * at
  
  
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)*
   * at
  
  
 
 org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:90)*
   * at
 org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:310)*
   * at
  
  
 
 com.ibm.commerce.foundation.internal.server.services.search.util.SearchQueryHelper.query(SearchQueryHelper.java:125)*
   * at
  
  
 
 com.ibm.commerce.foundation.server.services.rest.search.processor.solr.SolrRESTSearchExpressionProcessor.performSearch(SolrRESTSearchExpressionProcessor.java:506)*
   * at
  
  
 
 com.ibm.commerce.foundation.server.services.search.SearchServiceFacade.performSearch(SearchS*
   erviceFacade.java:193)
  
   I am not sure, if this exception is related to some issue in response
   format or with respect to querying non-solr server from solrj.
  
   Let me know your thoughts
  
   Thanks,
   Prashant

Distributed commits in CloudSolrServer

2014-04-15 Thread Peter Keegan

I have a SolrCloud index, 1 shard, with a leader and one replica, and 3
ZKs. The Solr indexes are behind a load balancer. There is one
CloudSolrServer client updating the indexes. The index schema includes 3
ExternalFileFields. When the CloudSolrServer client issues a hard commit, I
observe that the commits occur sequentially, not in parallel, on the leader
and replica. The duration of each commit is about a minute. Most of this
time is spent reloading the 3 ExternalFileField files. Because of the
sequential commits, there is a period of time (1 minute+) when the index
searchers will return different results, which can cause a bad user
experience. This will get worse as replicas are added to handle
auto-scaling. The goal is to keep all replicas in sync w.r.t. the user
queries.

My questions:

1. Is there a reason that the distributed commits are done in sequence, not
in parallel? Is there a way to change this behavior?

2. If instead, the commits were done in parallel by a separate client via a
GET to each Solr instance, how would this client get the host/port values
for each Solr instance from zookeeper? Are there any downsides to doing
commits this way?

Thanks,
Peter

Re: multiple analyzers for one field

2014-04-15 Thread Michael Sokolov


Ha! You were right.  Thanks for the nudge; here's my post:

http://blog.safariflow.com/2014/04/15/search-suggestions-with-solr-2/

there's code at http://github.com/safarijv/ifpress-solr-plugin

cheers

-Mike

On 04/15/2014 08:18 AM, Alexandre Rafalovitch wrote:

Your call, though from experience thus sounds like either two or no blog
posts. I certainly have killed a bunch of good articles by waiting for
perfection:-)
On 15/04/2014 7:01 pm, Michael Sokolov msoko...@safaribooksonline.com
wrote:


A blog post is a great idea, Alex!  I think I should wait until I have a
complete end-to-end implementation done before I write about it though,
because I'd also like to include some tips about configuring the new
suggesters with Solr (the documentation on the wiki hasn't quite caught up
yet, I think), and I don't have that working as I'd like just yet.  But I
will follow up with something soon; probably I will be able to share code
on a public repo.

-Mike

On 04/14/2014 10:01 PM, Alexandre Rafalovitch wrote:


Hi Mike,

Glad I was able to help. Good note about the PoolingReuseStrategy, I
did not think of that either.

   Is there a blog post or a GitHub repository coming with more details
on that? Sounds like something others may benefit from as well.

Regards,
 Alex.
P.s. If you don't have your own blog, I'll be happy to host such
article on mine.

Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr
proficiency


On Tue, Apr 15, 2014 at 8:52 AM, Michael Sokolov
msoko...@safaribooksonline.com wrote:


I lost the original thread; sorry for the new / repeated topic, but
thought
I would follow up to let y'all know that I ended up implementing Alex's
idea
to implement an UpdateRequestProcessor in order to apply different
analysis
to different fields when doing something analogous to copyFields.

It was pretty straightforward except that when there are multiple
values, I
ended up needing multiple copies of the same Analyzer.  I had to
implement a
new PoolingReuseStrategy for the Analyzer to handle this, which I hadn't
foreseen.

-Mike

Re: Empty documents in Solr\lucene 3.6

2014-04-15 Thread Dmitry Kan

Alexey,

1. Can you take a backup of the index and run the index checker with -fix
option? Will it modify the index at all?
2. Are all the missing fields configured as stored? Are they marked as
required in the schema or optional?

Dmitry


On Tue, Apr 15, 2014 at 7:22 PM, Alexey Kozhemiakin 
alexey_kozhemia...@epam.com wrote:

 The system was up and running for long time(months) without any updates.
 There was no crashes for sure, at least support team says so.
 Logs indicate that at some point there was not enough disk space (caused
 by weekend index optimization).


 Were there any other similar cases or it's unique for us?


 Alexey.
 -Original Message-
 From: Shawn Heisey [mailto:s...@elyograg.org]
 Sent: Tuesday, April 15, 2014 18:50
 To: solr-user@lucene.apache.org
 Subject: Re: Empty documents in Solr\lucene 3.6

 Do you know for sure that the index was OK at some point?  Do you know
 what might have happened when it became not OK, like a system crash?

 If you have Solr logs from whatever event caused the problem, we might be
 able to figure it out ... but if you don't know when it happened or you
 don't have logs, it might not be possible to know what happened.
 The document may have simply been indexed incorrectly.

 Thanks,
 Shawn




-- 
Dmitry
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan

Re: What is Overseer?

2014-04-15 Thread Jack Krupansky

I should have suggested three levels in my question: 1) important to average 
users, 2) expert-only, and 3) internal implementation detail. Yes, 
expert-only does have a place, but it is good to mark features as such.


-- Jack Krupansky

-Original Message- 
From: Chris Hostetter

Sent: Tuesday, April 15, 2014 1:48 PM
To: solr-user@lucene.apache.org
Subject: Re: What is Overseer?


: So, is Overseer really only an implementation detail or something that 
Solr

: Ops guys need to be very aware of?

Most people don't ever need to worry about the overseer - it's magic and
it will take care of itself.

The recent work on adding support for an overseer role in 4.7 was
specifically for people who *want* to worry about it.

I've updated several places in the solr ref guide to remove some
missleading claims about hte overseer (some old docs equated it to running
embedded zookeeper) and add some more info to the glossary..

https://cwiki.apache.org/confluence/display/solr/Solr+Glossary#SolrGlossary-Overseer
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api15AddRole



-Hoss
http://www.lucidworks.com/

Re: Distributed commits in CloudSolrServer

2014-04-15 Thread Mark Miller

Inline responses below.
-- 
Mark Miller
about.me/markrmiller

On April 15, 2014 at 2:12:31 PM, Peter Keegan (peterlkee...@gmail.com) wrote:

I have a SolrCloud index, 1 shard, with a leader and one replica, and 3 
ZKs. The Solr indexes are behind a load balancer. There is one 
CloudSolrServer client updating the indexes. The index schema includes 3 
ExternalFileFields. When the CloudSolrServer client issues a hard commit, I 
observe that the commits occur sequentially, not in parallel, on the leader 
and replica. The duration of each commit is about a minute. Most of this 
time is spent reloading the 3 ExternalFileField files. Because of the 
sequential commits, there is a period of time (1 minute+) when the index 
searchers will return different results, which can cause a bad user 
experience. This will get worse as replicas are added to handle 
auto-scaling. The goal is to keep all replicas in sync w.r.t. the user 
queries. 

My questions: 

1. Is there a reason that the distributed commits are done in sequence, not 
in parallel? Is there a way to change this behavior? 


The reason is that updates are currently done this way - it’s the only safe way 
to do it without solving some more problems. I don’t think you can easily 
change this. I think we should probably file a JIRA issue to track a better 
solution for commit handling. I think there are some complications because of 
how commits can be added on update requests, but its something we probably want 
to try and solve before tackling *all* updates to replicas in parallel with the 
leader.



2. If instead, the commits were done in parallel by a separate client via a 
GET to each Solr instance, how would this client get the host/port values 
for each Solr instance from zookeeper? Are there any downsides to doing 
commits this way? 

Not really, other than the extra management.





Thanks, 
Peter

Transformation on a numeric field

2014-04-15 Thread Jean-Sebastien Vachon

Hi All,

I am looking for a way to index a numeric field and its value divided by 1 000 
into another numeric field.
I thought about using a CopyField with a PatternReplaceFilterFactory to keep 
only the first few digits (cutting the last three).

Solr complains that I can not have an analysis chain on a numeric field:

Core: 
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: 
Plugin init failure for [schema.xml] fieldType truncated_salary: FieldType: 
TrieIntField (truncated_salary) does not support specifying an analyzer. Schema 
file is /data/solr/solr-no-cloud/Core1/schema.xml


Is there a way to accomplish this ?

Thanks

Re: Transformation on a numeric field

2014-04-15 Thread Rafał Kuć

Hello!

You can achieve that using update processor, for example look here: 
http://wiki.apache.org/solr/ScriptUpdateProcessor

What you would have to do, in general, is create a script that would
take a value of the field, divide it by the 1000 and put it in another
field - the target numeric field.

-- 
Regards,
 Rafał Kuć
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


 Hi All,

 I am looking for a way to index a numeric field and its value
 divided by 1 000 into another numeric field.
 I thought about using a CopyField with a
 PatternReplaceFilterFactory to keep only the first few digits (cutting the 
 last three).

 Solr complains that I can not have an analysis chain on a numeric field:

 Core:
 org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
 Plugin init failure for [schema.xml] fieldType truncated_salary:
 FieldType: TrieIntField (truncated_salary) does not support
 specifying an analyzer. Schema file is
 /data/solr/solr-no-cloud/Core1/schema.xml


 Is there a way to accomplish this ?

 Thanks

Re: Transformation on a numeric field

2014-04-15 Thread Jack Krupansky

You can use an update processor. The stateless script update processor will 
let you write arbitrary JavaScript code, which can do this calculation.


You should be able to figure it  out from the wiki:
http://wiki.apache.org/solr/ScriptUpdateProcessor

My e-book has plenty of script examples for this processor as well.

We could also write a generic script that takes a source and destination 
field name and then does a specified operation on it, like add an offset or 
multiple by a scale factor.


-- Jack Krupansky

-Original Message- 
From: Jean-Sebastien Vachon

Sent: Tuesday, April 15, 2014 3:57 PM
To: 'solr-user@lucene.apache.org'
Subject: Transformation on a numeric field

Hi All,

I am looking for a way to index a numeric field and its value divided by 1 
000 into another numeric field.
I thought about using a CopyField with a PatternReplaceFilterFactory to keep 
only the first few digits (cutting the last three).


Solr complains that I can not have an analysis chain on a numeric field:

Core: 
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: 
Plugin init failure for [schema.xml] fieldType truncated_salary: 
FieldType: TrieIntField (truncated_salary) does not support specifying an 
analyzer. Schema file is /data/solr/solr-no-cloud/Core1/schema.xml



Is there a way to accomplish this ?

Thanks

Odd extra character duplicates in spell checking

2014-04-15 Thread Ed Smiley

Hi,
I am going to make this question pretty short, so I don’t overwhelm with 
technical details until  the end.
I suspect that some folks may be seeing this issue without the particular 
configuration we are using.

What our problem is:

  1.  Correctly spelled words are returning as not spelled correctly, with the 
original, correctly spelled word with a single oddball character appended as 
multiple suggestions.
  2.  Incorrectly spelled words are returning correct spelling suggestions with 
a single oddball character appended as multiple suggestions.
  3.  We’re seeing this in Solr 4.5x and 4.7x.

Example:

The return values are all a single character (unicode shown in square brackets).

correction=attitude[2d]
correction=attitude[2f]
correction=attitude[2026]

Spurious characters:

  *   Unicode Character 'HYPHEN-MINUS' (U+002D)
  *   Unicode Character 'SOLIDUS' (U+002F)
  *   Unicode Character 'HORIZONTAL ELLIPSIS' (U+2026)

Anybody see anything like this?  Anybody fix something like this?

Thanks!
—Ed


OK, here’s the gory details:


What we are doing:
We have developed an application that returns  did you mean” spelling 
alternatives against a specific (presumably misspelled word).
We’re using the vocabulary of indexed pages of a specified book as the source 
of the alternatives, so this is not a general dictionary spell check, we are 
returning only matching alternatives.
So when I say “correctly spelled” I mean they are words found on at least one 
page.  We are using the collations, so that we restrict ourselves to those 
pages in one book.
We are having to check for and “fix up” these faulty results.  That’s not a 
robust or desirable solution.

We are using SolrJ to get the collations,
  private static final String DID_YOU_MEAN_REQUEST_HANDLER = 
/spell”;
….
SolrQuery query = new SolrQuery(q);
query.set(spellcheck, true);
query.set(SpellingParams.SPELLCHECK_COUNT, 10);
query.set(SpellingParams.SPELLCHECK_COLLATE, true);
query.set(SpellingParams.SPELLCHECK_COLLATE_EXTENDED_RESULTS, true);
query.set(wt, json);
query.setRequestHandler(DID_YOU_MEAN_REQUEST_HANDLER);
query.set(shards.qt, DID_YOU_MEAN_REQUEST_HANDLER);
query.set(shards.tolerant, true);
etc……

but we can duplicate the behavior without SolrJ with the collations/ 
misspellingsAndCorrections below:, e.g.:
solr/pg1/spell?q=+doc-id:(810500)+AND+attitudexspellcheck=truespellcheck.count=10spellcheck.collate=truespellcheck.collateExtendedResults=truewt=jsonqt=%2Fspellshards.qt=%2Fspellshards.tolerant=true.out.print


{responseHeader:{status:0,QTime:60},response:{numFound:0,start:0,maxScore:0.0,docs:[]},spellcheck:{suggestions:[attitudex,{numFound:6,startOffset:21,endOffset:30,origFreq:0,suggestion:[{word:attitudes,freq:362486},{word:attitu
 dex,freq:4819},{word:atti tudex,freq:3254},{word:attit 
udex,freq:159},{word:attitude-,freq:1080},{word:attituden,freq:261}]},correctlySpelled,false,collation,[collationQuery,
 doc-id:(810500) AND 
attitude-,hits,2,misspellingsAndCorrections,[attitudex,attitude-]],collation,[collationQuery,
 doc-id:(810500) AND 
attitude/,hits,2,misspellingsAndCorrections,[attitudex,attitude/]],collation,[collationQuery,
 doc-id:(810500) AND 
attitude…,hits,2,misspellingsAndCorrections,[attitudex,attitude…]]]}}

The configuration is:

requestHandler name=/spell class=solr.SearchHandler startup=lazy

lst name=defaults

  str name=dftext/str

  str name=spellcheck.dictionarydefault/str

  str name=spellcheck.dictionarywordbreak/str

  str name=spellcheckon/str

  str name=spellcheck.extendedResultstrue/str

  str name=spellcheck.count10/str

  str name=spellcheck.alternativeTermCount5/str

  str name=spellcheck.maxResultsForSuggest5/str

  str name=spellcheck.collatetrue/str

  str name=spellcheck.collateExtendedResultstrue/str

  str name=spellcheck.maxCollationTries10/str

  str name=spellcheck.maxCollations5/str

name=last-components

  strspellcheck/str

/arr

  /requestHandler


lst name=spellchecker

  str name=namewordbreak/str

  str name=classnamesolr.WordBreakSolrSpellChecker/str

  str name=fieldtext/str

  str name=combineWordstrue/str

  str name=breakWordstrue/str

  int name=maxChanges25/int

  int name=minBreakLength3/int

/lst


lst name=spellchecker

  str name=namedefault/str

  str name=fieldtext/str

  str name=classnamesolr.DirectSolrSpellChecker/str

  str name=distanceMeasureinternal/str

  float name=accuracy0.2/float

  int name=maxEdits2/int

  int name=minPrefix1/int

  int name=maxInspections25/int

  int name=minQueryLength4/int

  float name=maxQueryFrequency1/float

/lst

--

Ed Smiley, Senior Software Architect, eBooks
ProQuest | 161 E Evelyn Ave|
Mountain View, CA 94041 | USA |
+1 650 475 8700 extension 3772

Re: svn vs GIT

2014-04-15 Thread Jeff Wartes


I guess I should¹ve double-checked it was still the case before saying
anything, but I¹m glad to be proven wrong.
Yes, it worked nicely for me when I tried today, which should simplify my
life a bit.


On 4/14/14, 4:35 PM, Shawn Heisey s...@elyograg.org wrote:

On 4/14/2014 12:56 PM, Ramkumar R. Aiyengar wrote:
 ant compile / ant -f solr dist / ant test certainly work, I use them
with a
 git working copy. You trying something else?
 On 14 Apr 2014 19:36, Jeff Wartes jwar...@whitepages.com wrote:

 I vastly prefer git, but last I checked, (admittedly, some time ago)
you
 couldn't build the project from the git clone. Some of the build
scripts
 assumed some svn commands will work.

The nightly-smoke build target uses svn.  There is a related smoketest
script that uses provided URL parameters (or svn if it's a checkout from
svn and the parameters are not supplied) to obtain artifacts for
testing.  This may not be the only build target that uses facilities not
available from git, but it's the only one that I know about for sure.

Ordinary people should be able to use repositories cloned from the
git.apache.org or github mirrors with no problem if they are not using
exotic build targets or build scripts.

When I tried 'ant precommit' it worked, but it did say at least once in
what scrolled by that this was not an SVN checkout, so the
'-check-svn-working-copy' build target (which is part of precommit)
didn't work.

Thanks,
Shawn

Re: cache warming questions

2014-04-15 Thread Erick Erickson

bq: What does it mean that items will be regenerated or prepopulated
from the current searcher's cache...

You're right, the values aren't cached. They can't be since the
internal Lucene document id is used to identify docs, and due to
merging the internal ID may bear no relation to the old internal ID
for a particular document.

I find it useful to think of Solr's caches as a  map where the key is
the query and the value is some representation of the found
documents. The details of the value don't matter, so I'll skip them.

What matters is the key. Consider the filter cache. You put something
like fq=price:[0 TO 100] on a URL. Solr then uses the fq  clause as
the key to the filterCache.

Here's the sneaky bit. When you specify an autowarm count of N for the
filterCache, when a new searcher is opened the first N keys from the
map are re-executed in the new searcher's context and the results put
into the new searcher's filterCache.

bq:  ...how does auto warming and explicit warming work together?

They're orthogonal. IOW, the autowarming for each cache is executed as
well as the newSearcher static warming queries. Use the static queries
to do things like fill the sort caches etc.

Incidentally, this bears on why there's a firstSearcher and
newSearcher. The newSearcher queries are run in addition to the
cache autowarms. firstSearcher static queries are only run when a Solr
server is started the first time, and there are no cache entries to
autowarm. So the firstSearcher queries might be quite a bit more
complex than newSearcher queries.

HTH,
Erick

On Tue, Apr 15, 2014 at 1:55 PM, Matt Kuiper matt.kui...@issinc.com wrote:
 Hello,

 I have a few questions regarding how Solr caches are warmed.

 My understanding is that there are two ways to warm internal Solr caches 
 (only one way for document cache and lucene FieldCache):

 Auto warming - occurs when there is a current searcher handling requests and 
 new searcher is being prepared.  When a new searcher is opened, its caches 
 may be prepopulated or autowarmed with cached object from caches in the old 
 searcher. autowarmCount is the number of cached items that will be 
 regenerated in the new searcher.
 http://wiki.apache.org/solr/SolrCaching#autowarmCount

 Explicit warming - where the static warming queries specified in 
 Solrconfig.xml for newSearcher and firstSearcher listeners are executed when 
 a new searcher is being prepared.

 What does it mean that items will be regenerated or prepopulated from the 
 current searcher's cache to the new searcher's cache?  I doubt it means copy, 
 as the index has likely changed with a commit and possibly invalidated some 
 contents of the cache.  Are the queries, or filters, that define the contents 
 of the current caches re-executed for the new searcher's caches?

 For the case where auto warming is configured, a current searcher is active, 
 and static warming queries are defined how does auto warming and explicit 
 warming work together? Or do they?  Is only one type of warming activated to 
 fill the caches?

 Thanks,
 Matt

Re: More Robust Search Timeouts (to Kill Zombie Queries)?

2014-04-15 Thread Steve Davids

I have also experienced a similar problem on our cluster, I went ahead and 
opened SOLR-5986 to track the issue. I know Apache Blur has implemented a 
mechanism to kill these long running term enumerations, would be fantastic if 
Solr can get a similar mechanism.

-Steve

On Apr 15, 2014, at 5:23 AM, Salman Akram salman.ak...@northbaysolutions.net 
wrote:

 Looking at this, sharding seems to be best and simple option to handle such
 queries.
 
 
 On Wed, Apr 2, 2014 at 1:26 AM, Mikhail Khludnev mkhlud...@griddynamics.com
 wrote:
 
 Hello Salman,
 Let's me drop few thoughts on
 
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3c856ac15f0903272054q2dbdbd19kea3c5ba9e105b...@mail.gmail.com%3E
 
 There two aspects of this question:
 1. dealing with long running processing (thread divergence actions
 http://docs.oracle.com/javase/specs/jls/se5.0/html/memory.html#65310) and
 2. an actual time checking.
 terminating or aborting thread (2.) are just a way to tracking time
 externally, and send interrupt() which the thread should react on, which
 they don't do now, and we returning to the core issue (1.)
 
 Solr's time allowed is to the proper way to handle this things, the only
 problem is that expect that the only core search is long running, but in
 your case rewriting MultiTermQuery-s takes a huge time.
 Let's consider this problem. First of all MultiTermQuery.rewrite() is the
 nearly design issue, after heavy rewrite occurs, it's thrown away, after
 search is done. I think the most straightforward way is to address this
 issue by caching these expensive queries. Solr does it well
 http://wiki.apache.org/solr/CommonQueryParameters#fq However, only for
 http://en.wikipedia.org/wiki/Conjunctive_normal_form like queries, there
 is
 a workaround allows to cache disjunction legs see
 http://blog.griddynamics.com/2014/01/segmented-filter-cache-in-solr.html
 If you still want to run expensively rewritten queries you need to
 implement timeout check (similar to TimeLimitingCollector) for TermsEnum
 returned from MultiTermQuery.getTermsEnum(), wrapping an actual TermsEnums
 is the good way, to apply queries injecting time limiting wrapper
 TermsEnum, you might consider override methods like
 SolrQueryParserBase.newWildcardQuery(Term) or post process the query three
 after parsing.
 
 
 
 On Mon, Mar 31, 2014 at 2:24 PM, Salman Akram 
 salman.ak...@northbaysolutions.net wrote:
 
 Anyone?
 
 
 On Wed, Mar 26, 2014 at 7:55 PM, Salman Akram 
 salman.ak...@northbaysolutions.net wrote:
 
 With reference to this thread
 
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3c856ac15f0903272054q2dbdbd19kea3c5ba9e105b...@mail.gmail.com%3E
 I
 wanted to know if there was any response to that or if Chris Harris
 himself can comment on what he ended up doing, that would be great!
 
 
 --
 Regards,
 
 Salman Akram
 
 
 
 
 --
 Regards,
 
 Salman Akram
 
 
 
 
 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics
 
 http://www.griddynamics.com
 mkhlud...@griddynamics.com
 
 
 
 
 -- 
 Regards,
 
 Salman Akram

Tipping point of solr shards (Num of docs / size)

2014-04-15 Thread Mukesh Jha

Hi Gurus,

In my solr cluster I've multiple shards and each shard containing
~500,000,000 documents total index size being ~1 TB.

I was just wondering how much more can I keep on adding to the shard before
we reach a tipping point and the performance starts to degrade?

Also as best practice what is the recomended no of docs / size of shards .

Txz in advance :)

-- 
Thanks  Regards,

*Mukesh Jha me.mukesh@gmail.com*

Re: deleting large amount data from solr cloud

2014-04-15 Thread Vinay Pothnis

Another update:

I removed the replicas - to avoid the replication doing a full copy. I am
able delete sizeable chunks of data.
But the overall index size remains the same even after the deletes. It does
not seem to go down.

I understand that Solr would do this in background - but I don't seem to
see the decrease in overall index size even after 1-2 hours.
I can see a bunch of .del files in the index directory, but the it does
not seem to get cleaned up. Is there anyway to monitor/follow the progress
of index compaction?

Also, does triggering optimize from the admin UI help to compact the
index size on disk?

Thanks
Vinay


On 14 April 2014 12:19, Vinay Pothnis poth...@gmail.com wrote:

 Some update:

 I removed the auto warm configurations for the various caches and reduced
 the cache sizes. I then issued a call to delete a day's worth of data (800K
 documents).

 There was no out of memory this time - but some of the nodes went into
 recovery mode. Was able to catch some logs this time around and this is
 what i see:

 
 *WARN  [2014-04-14 18:11:00.381] [org.apache.solr.update.PeerSync]
 PeerSync: core=core1_shard1_replica2 url=http://host1:8983/solr
 http://host1:8983/solr too many updates received since start -
 startingUpdates no longer overlaps with our currentUpdates*
 *INFO  [2014-04-14 18:11:00.476] [org.apache.solr.cloud.RecoveryStrategy]
 PeerSync Recovery was not successful - trying replication.
 core=core1_shard1_replica2*
 *INFO  [2014-04-14 18:11:00.476] [org.apache.solr.cloud.RecoveryStrategy]
 Starting Replication Recovery. core=core1_shard1_replica2*
 *INFO  [2014-04-14 18:11:00.535] [org.apache.solr.cloud.RecoveryStrategy]
 Begin buffering updates. core=core1_shard1_replica2*
 *INFO  [2014-04-14 18:11:00.536] [org.apache.solr.cloud.RecoveryStrategy]
 Attempting to replicate from http://host2:8983/solr/core1_shard1_replica1/
 http://host2:8983/solr/core1_shard1_replica1/. core=core1_shard1_replica2*
 *INFO  [2014-04-14 18:11:00.536]
 [org.apache.solr.client.solrj.impl.HttpClientUtil] Creating new http
 client,
 config:maxConnections=128maxConnectionsPerHost=32followRedirects=false*
 *INFO  [2014-04-14 18:11:01.964]
 [org.apache.solr.client.solrj.impl.HttpClientUtil] Creating new http
 client,
 config:connTimeout=5000socketTimeout=2allowCompression=falsemaxConnections=1maxConnectionsPerHost=1*
 *INFO  [2014-04-14 18:11:01.969] [org.apache.solr.handler.SnapPuller]  No
 value set for 'pollInterval'. Timer Task not started.*
 *INFO  [2014-04-14 18:11:01.973] [org.apache.solr.handler.SnapPuller]
 Master's generation: 1108645*
 *INFO  [2014-04-14 18:11:01.973] [org.apache.solr.handler.SnapPuller]
 Slave's generation: 1108627*
 *INFO  [2014-04-14 18:11:01.973] [org.apache.solr.handler.SnapPuller]
 Starting replication process*
 *INFO  [2014-04-14 18:11:02.007] [org.apache.solr.handler.SnapPuller]
 Number of files in latest index in master: 814*
 *INFO  [2014-04-14 18:11:02.007]
 [org.apache.solr.core.CachingDirectoryFactory] return new directory for
 /opt/data/solr/core1_shard1_replica2/data/index.20140414181102007*
 *INFO  [2014-04-14 18:11:02.008] [org.apache.solr.handler.SnapPuller]
 Starting download to
 NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/data/solr/core1_shard1_replica2/data/index.20140414181102007
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@5f6570fe;
 maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy=true*

 


 So, it looks like the number of updates is too huge for the regular
 replication and then it goes into full copy of index. And since our index
 size is very huge (350G), this is causing the cluster to go into recovery
 mode forever - trying to copy that huge index.

 I also read in some thread
 http://lucene.472066.n3.nabble.com/Recovery-too-many-updates-received-since-start-td3935281.htmlthat
  there is a limit of 100 documents.

 I wonder if this has been updated to make that configurable since that
 thread. If not, the only option I see is to do a trickle delete of 100
 documents per second or something.

 Also - the other suggestion of using distributed=false might not help
 because the issue currently is that the replication is going to full copy.

 Any thoughts?

 Thanks
 Vinay







 On 14 April 2014 07:54, Vinay Pothnis poth...@gmail.com wrote:

 Yes, that is our approach. We did try deleting a day's worth of data at a
 time, and that resulted in OOM as well.

 Thanks
 Vinay


 On 14 April 2014 00:27, Furkan KAMACI furkankam...@gmail.com wrote:

 Hi;

 I mean you can divide the range (i.e. one week at each delete instead of
 one month) and try to check whether you still get an OOM or not.

 Thanks;
 Furkan KAMACI


 2014-04-14 7:09 GMT+03:00 Vinay Pothnis poth...@gmail.com:

  Aman,
  Yes - Will do!
 
  Furkan,
  How do you mean by 'bulk delete'?
 
  -Thanks
  Vinay
 
 
  On 12 April 2014 14:49, Furkan KAMACI furkankam...@gmail.com wrote:
 
   Hi;
  
   Do you get any problems when you index

Re: Tipping point of solr shards (Num of docs / size)

2014-04-15 Thread Vinay Pothnis

You could look at this link to understand about the factors that affect the
solrcloud performance: http://wiki.apache.org/solr/SolrPerformanceProblems

Especially, the sections about RAM and disk cache. If the index grows too
big for one node, it can lead to performance issues. From the looks of it,
500mil docs per shard - may be already pushing it. How much does that
translate to in terms of index size on disk per shard?

-vinay


On 15 April 2014 21:44, Mukesh Jha me.mukesh@gmail.com wrote:

 Hi Gurus,

 In my solr cluster I've multiple shards and each shard containing
 ~500,000,000 documents total index size being ~1 TB.

 I was just wondering how much more can I keep on adding to the shard before
 we reach a tipping point and the performance starts to degrade?

 Also as best practice what is the recomended no of docs / size of shards .

 Txz in advance :)

 --
 Thanks  Regards,

 *Mukesh Jha me.mukesh@gmail.com*

Re: Tipping point of solr shards (Num of docs / size)

2014-04-15 Thread Mukesh Jha

My index size per shard varies b/w 250 GB to 1 TB.
The cluster is performing well even now but thought it's high time to
change it, so that a shard doesn't get too big


On Wed, Apr 16, 2014 at 10:25 AM, Vinay Pothnis poth...@gmail.com wrote:

 You could look at this link to understand about the factors that affect the
 solrcloud performance: http://wiki.apache.org/solr/SolrPerformanceProblems

 Especially, the sections about RAM and disk cache. If the index grows too
 big for one node, it can lead to performance issues. From the looks of it,
 500mil docs per shard - may be already pushing it. How much does that
 translate to in terms of index size on disk per shard?

 -vinay


 On 15 April 2014 21:44, Mukesh Jha me.mukesh@gmail.com wrote:

  Hi Gurus,
 
  In my solr cluster I've multiple shards and each shard containing
  ~500,000,000 documents total index size being ~1 TB.
 
  I was just wondering how much more can I keep on adding to the shard
 before
  we reach a tipping point and the performance starts to degrade?
 
  Also as best practice what is the recomended no of docs / size of shards
 .
 
  Txz in advance :)
 
  --
  Thanks  Regards,
 
  *Mukesh Jha me.mukesh@gmail.com*
 




-- 


Thanks  Regards,

*Mukesh Jha me.mukesh@gmail.com*

45 matches

Mail list logo