date:20131008


On 10/8/2013 3:01 AM, Furkan KAMACI wrote:
 Actually I want to remove special characters and wont send them into my
 Solr indexes. I mean user can send a special query as like a SQL 
injection

 and I want to prevent my system such kind of scenarios.

There is a newer javadoc than the *very* old one you are looking at:

http://lucene.apache.org/core/4_5_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html?is-external=true#Escaping_Special_Characters

When I compare that list to what's actually in the SolrJ 
escapeQueryChars method, it looks like that method does one additional 
character - the semicolon.


http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_5_0/solr/solrj/src/java/org/apache/solr/client/solrj/util/ClientUtils.java

Just search the page for escapeQueryChars to see the java code.

Thanks,
Shawn

run filter queries after post filter

2013-10-08 Thread Rohit Harchandani

Hey,
I am using solr 4.0 with my own PostFilter implementation which is executed
after the normal solr query is done. This filter has a cost of 100. Is it
possible to run filter queries on the index after the execution of the post
filter?
I tried adding the below line to the url but it did not seem to work:
fq={!cache=false cost=200}field:value
Thanks,
Rohit

Re: no such field error:smaller big block size details while indexing doc files

2013-10-08 Thread sweety

This my new schema.xml:
schema  name=documents
fields 
field name=id type=string indexed=true stored=true required=true 
multiValued=false/
field name=author type=string indexed=true stored=true 
multiValued=true/
field name=comments type=text indexed=true stored=true 
multiValued=false/
field name=keywords type=text indexed=true stored=true 
multiValued=false/
field name=contents type=text indexed=true stored=true 
multiValued=false/
field name=title type=text indexed=true stored=true 
multiValued=false/
field name=revision_number type=string indexed=true stored=true 
multiValued=false/
field name=_version_ type=long indexed=true stored=true 
multiValued=false/
dynamicField name=ignored_* type=string indexed=false stored=true 
multiValued=true/
dynamicField name=* type=ignored  multiValued=true /
copyfield source=id dest=text /
copyfield source=author dest=text /
/fields 
types
fieldtype name=ignored stored=false indexed=false class=solr.StrField 
/ 
fieldType name=integer class=solr.IntField /
fieldType name=long class=solr.LongField /
fieldType name=string class=solr.StrField  /  
fieldType name=text class=solr.TextField /
/types
uniqueKeyid/uniqueKey
/schema
I still get the same error.


 From: Erick Erickson [via Lucene] ml-node+s472066n4094013...@n3.nabble.com
To: sweety sweetyshind...@yahoo.com 
Sent: Tuesday, October 8, 2013 7:16 AM
Subject: Re: no such field error:smaller big block size details while indexing 
doc files
 


Well, one of the attributes parsed out of, probably the 
meta-information associated with one of your structured 
docs is SMALLER_BIG_BLOCK_SIZE_DETAILS and 
Solr Cel is faithfully sending that to your index. If you 
want to throw all these in the bit bucket, try defining 
a true catch-all field that ignores things, like this. 
dynamicField name=* type=ignored multiValued=true / 

Best, 
Erick 

On Mon, Oct 7, 2013 at 8:03 AM, sweety [hidden email] wrote: 

 Im trying to index .doc,.docx,pdf files, 
 im using this url: 
 curl 
 http://localhost:8080/solr/document/update/extract?literal.id=12commit=true;
  
 -Fmyfile=@complex.doc 
 
 This is the error I get: 
 Oct 07, 2013 5:02:18 PM org.apache.solr.common.SolrException log 
 SEVERE: null:java.lang.RuntimeException: java.lang.NoSuchFieldError: 
 SMALLER_BIG_BLOCK_SIZE_DETAILS 
         at 
 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:651)
  
         at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:364)
  
         at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
  
         at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
  
         at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
  
         at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
  
         at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
  
         at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168) 
         at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98) 
         at 
 org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:928) 
         at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
  
         at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407) 
         at 
 org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
  
         at 
 org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:539)
  
         at 
 org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:298)
  
         at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
         at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
         at java.lang.Thread.run(Unknown Source) 
 Caused by: java.lang.NoSuchFieldError: SMALLER_BIG_BLOCK_SIZE_DETAILS 
         at 
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:93)
  
         at 
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:190)
  
         at 
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:184)
  
         at 
 org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:376)
  
         at 
 org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:165)
  
         at 
 org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61) 
         at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:113) 
         at 
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
  
         at

Re: solr cpu usage

2013-10-08 Thread Tim Vaillancourt

Yes, you've saved us all lots of time with this article. I'm about to do
the same for the old Jetty or Tomcat? container question ;).

Tim

On 7 October 2013 18:55, Erick Erickson erickerick...@gmail.com wrote:

Tim:

Thanks! Mostly I wrote it to have something official looking to hide
behind when I didn't have a good answer to the hardware sizing question
:).

On Mon, Oct 7, 2013 at 2:48 PM, Tim Vaillancourt t...@elementspace.com
wrote:
Fantastic article!

Tim

On 5 October 2013 18:14, Erick Erickson erickerick...@gmail.com wrote:

From my perspective, your question is almost impossible to
answer, there are too many variables. See:

http://searchhub.org/dev/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Best,
Erick

On Thu, Oct 3, 2013 at 9:38 PM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:
Hi,

More CPU cores means more concurrency. This is good if you need to
handle
high query rates.

Faster cores mean lower query latency, assuming you are not
bottlenecked
by
memory or disk IO or network IO.

So what is ideal for you depends on your concurrency and latency
needs.

Otis
Solr ElasticSearch Support
http://sematext.com/
On Oct 1, 2013 9:33 AM, adfel70 adfe...@gmail.com wrote:

hi
We're building a spec for a machine to purchase.
We're going to buy 10 machines.
we aren't sure yet how many proccesses we will run per machine.
the question is -should we buy faster cpu with less cores or slower
cpu
with more cores?
in any case we will have 2 cpus in each machine.
should we buy 2.6Ghz cpu with 8 cores or 3.5Ghz cpu with 4 cores?

what will we gain by having many cores?

what kinds of usages would make cpu be the bottleneck?

--
View this message in context:
http://lucene.472066.n3.nabble.com/solr-cpu-usage-tp4092938.html
Sent from the Solr - User mailing list archive at Nabble.com.

dynamically adding core with auto-discovery in Solr 4.5

2013-10-08 Thread Jan Van Besien

Hi,

We are using auto discovery and have a use case where we want to be
able to add cores dynamically, without restarting solr.

In 4.4 we were able to
- add a directory (e.g. core1) with an empty core.properties
- call 
http://localhost:8983/solr/admin/cores?action=CREATEcore=core1name=core1instanceDir=%2Fsomewhere%2Fcore1

In 4.5 however this (the second step) fails, saying it cannot create a
new core in that directory because another core is already defined
there.

From the documentation (http://wiki.apache.org/solr/CoreAdmin), I
understand that since 4.3 we should actually do RELOAD. However,
RELOAD results in this stacktrace:

org.apache.solr.common.SolrException: Error handling 'reload' action
at 
org.apache.solr.handler.admin.CoreAdminHandler.handleReloadAction(CoreAdminHandler.java:673)
at 
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:172)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at 
org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:655)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:246)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:322) at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at
org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at
org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: org.apache.solr.common.SolrException: Unable to reload
core: core1 at 
org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:936)
at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:691)
at 
org.apache.solr.handler.admin.CoreAdminHandler.handleReloadAction(CoreAdminHandler.java:671)
... 20 more Caused by: org.apache.solr.common.SolrException: No such
core: core1 at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:642)
... 21 more

Note that before I RELOAD, the core1 directory was created.

Also note that next to the core1 directory, there is a core0 directory
which has exactly the same content and is auto-discovered perfectly
fine at startup.

So... what should it be? Or am I missing something here?

thanks in advance,
Jan

Re: Accent insensitive multi-words suggester

2013-10-08 Thread Dominique Bejean


Thank you Erick.
I will try this.

Regards
Dominique

Le 06/10/13 03:03, Erick Erickson a écrit :

Consider implementing a special field that of the form
accentfolded|original

For instance, you'd index something like
ecole|école
ecole|école privée
as _terms_, not broken up at all.

Now, when you send something to the suggester you send just
eco or éco you fold them to eco too and get back these tokens.
Then the app layer breaks them up and displays them pleasingly.

Best
Erick

On Tue, Oct 1, 2013 at 5:45 PM, Dominique Bejean
dominique.bej...@eolya.fr wrote:

Hi,

Up to now, the best solution I found in order to implement a multi-words
suggester was to use ShingleFilterFactory filter at index time and the
termsComponent. At index time the analyzer was :

   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.ASCIIFoldingFilterFactory/
 filter class=solr.ElisionFilterFactory ignoreCase=true
articles=lang/contractions_fr.txt/
 filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt /
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.ShingleFilterFactory maxShingleSize=4
outputUnigrams=true/
   /analyzer


With ASCIIFoldingFilter filter, it works find if the user do not use
accent in query terms and all suggestions are without accents.
Without ASCIIFoldingFilter filter, it works find if the user do not forget
accent in query terms and all suggestions are with accents.

Note : I use the StopFilter to avoid suggestions including stop words and
particularly starting or ending with stop words.


What I need is a suggester where the user can use or not use the accent in
query terms and the suggestions are returned with accent.

For example, if the user type éco or eco, the suggester should return :

école
école primaire
école publique
école privée
école primaire privée


I think it is impossible to achieve this with the termComponents and I
should use the SpellCheckComponent instead. However, I don't see how to make
the suggester accent insensitive and return the suggestions with accents.

Did somebody already achieved that ?

Thank you.

Dominique


--
Dominique Béjean
+33 6 08 46 12 43
skype: dbejean
www.eolya.fr
www.crawl-anywhere.com

Re: {soft}Commit and cache flusing

2013-10-08 Thread Tim Vaillancourt

I have a genuine question with substance here. If anything this
nonconstructive, rude response was to get noticed. Thanks for
contributing to the discussion.

Tim


On 8 October 2013 05:31, Dmitry Kan solrexp...@gmail.com wrote:

 Tim,
 I suggest you open a new thread and not reply to this one to get noticed.
 Dmitry


 On Mon, Oct 7, 2013 at 9:44 PM, Tim Vaillancourt t...@elementspace.com
 wrote:

  Is there a way to make autoCommit only commit if there are pending
 changes,
  ie: if there are 0 adds pending commit, don't autoCommit (open-a-searcher
  and wipe the caches)?
 
  Cheers,
 
  Tim
 
 
  On 2 October 2013 00:52, Dmitry Kan solrexp...@gmail.com wrote:
 
   right. We've got the autoHard commit configured only atm. The
  soft-commits
   are controlled on the client. It was just easier to implement the first
   version of our internal commit policy that will commit to all solr
   instances at once. This is where we have noticed the reported behavior.
  
  
   On Wed, Oct 2, 2013 at 9:32 AM, Bram Van Dam bram.van...@intix.eu
  wrote:
  
if there are no modifications to an index and a softCommit or
  hardCommit
issued, then solr flushes the cache.
   
   
Indeed. The easiest way to work around this is by disabling auto
  commits
and only commit when you have to.

What's the purpose of the bits option in compositeId (Solr 4.5)?

I'm curious what the later shard-local bits do, if anything?

I have a very large cluster (256 shards) and I'm sending most of my data
with a single composite, e.g. 1234!unique_id, but I'm noticing the data
is being split among many of the shards.

My guess right now is that since I'm only using the default 16 bits my data
is being split across multiple shards (because of my high # of shards).

Thanks,
Brett

Re: What's the purpose of the bits option in compositeId (Solr 4.5)?

2013-10-08 Thread Yonik Seeley

On Tue, Oct 8, 2013 at 6:29 PM, Brett Hoerner br...@bretthoerner.com wrote:
 I'm curious what the later shard-local bits do, if anything?

 I have a very large cluster (256 shards) and I'm sending most of my data
 with a single composite, e.g. 1234!unique_id, but I'm noticing the data
 is being split among many of the shards.

That shouldn't be the case.  All of your shards should have a lower
hash value with all 0 bits and an upper hash value of all 1s (i.e.
0x to 0x)
So you see any shards where that's not true?

Also, is the router set to compositeId?

-Yonik

 My guess right now is that since I'm only using the default 16 bits my data
 is being split across multiple shards (because of my high # of shards).

 Thanks,
 Brett

Re: What's the purpose of the bits option in compositeId (Solr 4.5)?

Router is definitely compositeId.

To be clear, data isn't being spread evenly... it's like it's *almost*
working. It's just odd to me that I'm slamming in data that's 99% of one
_route_ key yet after a few minutes (from a fresh empty index) I have 2
shards with a sizeable amount of data (68M and 128M) and the rest are very
small as expected.

The fact that two are receiving so much makes me think my data is being
split into two shards. I'm trying to debug more now.


On Tue, Oct 8, 2013 at 5:45 PM, Yonik Seeley ysee...@gmail.com wrote:

 On Tue, Oct 8, 2013 at 6:29 PM, Brett Hoerner br...@bretthoerner.com
 wrote:
  I'm curious what the later shard-local bits do, if anything?
 
  I have a very large cluster (256 shards) and I'm sending most of my data
  with a single composite, e.g. 1234!unique_id, but I'm noticing the
 data
  is being split among many of the shards.

 That shouldn't be the case.  All of your shards should have a lower
 hash value with all 0 bits and an upper hash value of all 1s (i.e.
 0x to 0x)
 So you see any shards where that's not true?

 Also, is the router set to compositeId?

 -Yonik

  My guess right now is that since I'm only using the default 16 bits my
 data
  is being split across multiple shards (because of my high # of shards).
 
  Thanks,
  Brett

limiting deep pagination

2013-10-08 Thread Peter Keegan

Is there a way to configure Solr 'defaults/appends/invariants' such that
the product of the 'start' and 'rows' parameters doesn't exceed a given
value? This would be to prevent deep pagination.  Or would this require a
custom requestHandler?

Peter

dynamic field question

2013-10-08 Thread Twomey, David


I am having trouble trying to return a particular dynamic field only instead of 
all dynamic fields.

Imagine I have a document with an unknown number of sections.  Each section can 
have a 'title' and a 'body'

 I have each section title and body as dynamic fields such as section_title_*  
and section_body_*

Imagine that some documents contain a section that has a title=Appendix

I want a query that will find all docs with that section and return just the 
Appendix section.

I don't know how to return just that one section though

I can copyField my dynamic field section_title_* into a static field called 
section_titles and query that for docs that contain the Appendix

But I don't know how to only return that one dynamic field

?q=section_titles:Appendixfl=section_body_*

Any ideas?   I can't seem to put a conditional in the fl parameter

Re: What's the purpose of the bits option in compositeId (Solr 4.5)?

This is my clusterstate.json:
https://gist.github.com/bretthoerner/0098f741f48f9bb51433

And these are my core sizes (note large ones are sorted to the end):
https://gist.github.com/bretthoerner/f5b5e099212194b5dff6

I've only heavily sent 2 shards by now (I'm sharding by hour and it's
been running for 2). There *is* a little old data in my stream, but not
that much (like 5%). What's confusing to me is that 5 of them are rather
large, when I'd expect 2 of them to be.


On Tue, Oct 8, 2013 at 5:45 PM, Yonik Seeley ysee...@gmail.com wrote:

 On Tue, Oct 8, 2013 at 6:29 PM, Brett Hoerner br...@bretthoerner.com
 wrote:
  I'm curious what the later shard-local bits do, if anything?
 
  I have a very large cluster (256 shards) and I'm sending most of my data
  with a single composite, e.g. 1234!unique_id, but I'm noticing the
 data
  is being split among many of the shards.

 That shouldn't be the case.  All of your shards should have a lower
 hash value with all 0 bits and an upper hash value of all 1s (i.e.
 0x to 0x)
 So you see any shards where that's not true?

 Also, is the router set to compositeId?

 -Yonik

  My guess right now is that since I'm only using the default 16 bits my
 data
  is being split across multiple shards (because of my high # of shards).
 
  Thanks,
  Brett

Re: limiting deep pagination

2013-10-08 Thread Tomás Fernández Löbbe

I don't know of any OOTB way to do that, I'd write a custom request handler
as you suggested.

Tomás


On Tue, Oct 8, 2013 at 3:51 PM, Peter Keegan peterlkee...@gmail.com wrote:

 Is there a way to configure Solr 'defaults/appends/invariants' such that
 the product of the 'start' and 'rows' parameters doesn't exceed a given
 value? This would be to prevent deep pagination.  Or would this require a
 custom requestHandler?

 Peter

Re: limiting deep pagination

2013-10-08 Thread Erik Hatcher

I'd recommend a custom first-components SearchComponent.  Then it could 
simply validate (or adjust) the parameters or throw an exception. 

Knowing Tomás - that's probably what he'd really do :) 

Erik

On Oct 8, 2013, at 19:34, Tomás Fernández Löbbe tomasflo...@gmail.com wrote:

 I don't know of any OOTB way to do that, I'd write a custom request handler
 as you suggested.
 
 Tomás
 
 
 On Tue, Oct 8, 2013 at 3:51 PM, Peter Keegan peterlkee...@gmail.com wrote:
 
 Is there a way to configure Solr 'defaults/appends/invariants' such that
 the product of the 'start' and 'rows' parameters doesn't exceed a given
 value? This would be to prevent deep pagination.  Or would this require a
 custom requestHandler?
 
 Peter

Re: What's the purpose of the bits option in compositeId (Solr 4.5)?

2013-10-08 Thread Yonik Seeley

On Tue, Oct 8, 2013 at 7:31 PM, Brett Hoerner br...@bretthoerner.com wrote:
 This is my clusterstate.json:
 https://gist.github.com/bretthoerner/0098f741f48f9bb51433

 And these are my core sizes (note large ones are sorted to the end):
 https://gist.github.com/bretthoerner/f5b5e099212194b5dff6

 I've only heavily sent 2 shards by now (I'm sharding by hour and it's
 been running for 2). There *is* a little old data in my stream, but not
 that much (like 5%). What's confusing to me is that 5 of them are rather
 large, when I'd expect 2 of them to be.

The cluster state looks fine at first glance... and each route key
should map to a single shard.
You could try a query to each of the big shards and see what IDs are in them.

-Yonik

Re: What's the purpose of the bits option in compositeId (Solr 4.5)?

I have a silly question, how do I query a single shard in SolrCloud? When I
hit solr/foo_shard1_replica1/select it always seems to do a full cluster
query.

I can't (easily) do a _route_ query before I know what each have.


On Tue, Oct 8, 2013 at 7:06 PM, Yonik Seeley ysee...@gmail.com wrote:

 On Tue, Oct 8, 2013 at 7:31 PM, Brett Hoerner br...@bretthoerner.com
 wrote:
  This is my clusterstate.json:
  https://gist.github.com/bretthoerner/0098f741f48f9bb51433
 
  And these are my core sizes (note large ones are sorted to the end):
  https://gist.github.com/bretthoerner/f5b5e099212194b5dff6
 
  I've only heavily sent 2 shards by now (I'm sharding by hour and it's
  been running for 2). There *is* a little old data in my stream, but not
  that much (like 5%). What's confusing to me is that 5 of them are rather
  large, when I'd expect 2 of them to be.

 The cluster state looks fine at first glance... and each route key
 should map to a single shard.
 You could try a query to each of the big shards and see what IDs are in
 them.

 -Yonik

Re: What's the purpose of the bits option in compositeId (Solr 4.5)?

Ignore me I forgot about shards= from the wiki.


On Tue, Oct 8, 2013 at 7:11 PM, Brett Hoerner br...@bretthoerner.comwrote:

 I have a silly question, how do I query a single shard in SolrCloud? When
 I hit solr/foo_shard1_replica1/select it always seems to do a full cluster
 query.

 I can't (easily) do a _route_ query before I know what each have.


 On Tue, Oct 8, 2013 at 7:06 PM, Yonik Seeley ysee...@gmail.com wrote:

 On Tue, Oct 8, 2013 at 7:31 PM, Brett Hoerner br...@bretthoerner.com
 wrote:
  This is my clusterstate.json:
  https://gist.github.com/bretthoerner/0098f741f48f9bb51433
 
  And these are my core sizes (note large ones are sorted to the end):
  https://gist.github.com/bretthoerner/f5b5e099212194b5dff6
 
  I've only heavily sent 2 shards by now (I'm sharding by hour and it's
  been running for 2). There *is* a little old data in my stream, but not
  that much (like 5%). What's confusing to me is that 5 of them are
 rather
  large, when I'd expect 2 of them to be.

 The cluster state looks fine at first glance... and each route key
 should map to a single shard.
 You could try a query to each of the big shards and see what IDs are in
 them.

 -Yonik

Re: Improving indexing performance

queue size shouldn't really be too large, the whole point of
the concurrency is to keep from waiting around for the
communication with the server in a single thread. So having
a bunch of stuff backed up in the queue isn't buying you anything

And you can always increase the memory allocated to the JVM
running SolrJ...

Erick

On Tue, Oct 8, 2013 at 5:29 AM, Matteo Grolla matteo.gro...@gmail.com wrote:
 Thanks Erik,
 I think I have been able to exhaust a resource
 if I split the data in 2 and upload it with 2 clients like benchmark 
 1.1 it takes 120s here the bottleneck it my LAN,
 if I use a setting like benchmark 1 probably the bottleneck is the 
 ramBuffer.

 I'm going to buy a Gigabit ethernet cable so I can make a better test.

 OutOfMemory error: it's the solrj client that crashes
 I'm using solr 4.2.1 and corresponding solrj client
 httpsolrserver works fine
 concurrentupdatesolrsever gives me problems, and I didn't 
 understand how to size the queuesize parameter optimally


 Il giorno 07/ott/2013, alle ore 14:03, Erick Erickson ha scritto:

 Just skimmed, but the usual reason you can't max out the server
 is that the client can't go fast enough. Very quick experiment:
 comment out the server.add line in your client and run it again,
 does that speed up the client substantially? If not, then the time
 is being spent on the client.

 Or split your csv file into, say, 5 parts and run it from 5 different
 PCs in parallel.

 bq:  I can't rely on auto commit, otherwise I get an OutOfMemory error
 This shouldn't be happening, I'd get to the bottom of this. Perhaps simply
 allocating more memory to the JVM running Solr.

 bq: committing every 100k docs gives worse performance
 It'll be best to specify openSearcher=false for max indexing throughput
 BTW. You should be able to do this quite frequently, 15 seconds seems
 quite reasonable.

 Best,
 Erick

 On Sun, Oct 6, 2013 at 12:19 PM, Matteo Grolla matteo.gro...@gmail.com 
 wrote:
 I'd like to have some suggestion on how to improve the indexing performance 
 on the following scenario
 I'm uploading 1M docs to solr,

 every docs has
id: sequential number
title:  small string
date: date
body: 1kb of text

 Here are my benchmarks (they are all single executions, not averages from 
 multiple executions):

 1)  using the updaterequesthandler
and streaming docs from a csv file on the same disk of solr
auto commit every 15s with openSearcher=false and commit after last 
 document

total time: 143035ms

 1.1)using the updaterequesthandler
and streaming docs from a csv file on the same disk of solr
auto commit every 15s with openSearcher=false and commit after last 
 document
ramBufferSizeMB500/ramBufferSizeMB
maxBufferedDocs10/maxBufferedDocs

total time: 134493ms

 1.2)using the updaterequesthandler
and streaming docs from a csv file on the same disk of solr
auto commit every 15s with openSearcher=false and commit after last 
 document
mergeFactor30/mergeFactor

total time: 143134ms

 2)  using a solrj client from another pc in the lan (100Mbps)
with httpsolrserver
with javabin format
add documents to the server in batches of 1k docs   ( 
 server.add( collection ) )
auto commit every 15s with openSearcher=false and commit after last 
 document

total time: 139022ms

 3)  using a solrj client from another pc in the lan (100Mbps)
with concurrentupdatesolrserver
with javelin format
add documents to the server in batches of 1k docs   ( 
 server.add( collection ) )
server queue size=20k
server threads=4
no auto-commit and commit every 100k docs

total time: 167301ms


 --On the solr server--
 cpu averages25%
at best 100% for 1 core
 IO  is still far from being saturated
iostat gives a pattern like this (every 5 s)

time(s) %util
100 45,20
105 1,68
110 17,44
115 76,32
120 2,64
125 68
130 1,28

 I thought that using concurrentupdatesolrserver I was able to max cpu or IO 
 but I wasn't.
 With concurrentupdatesolrserver I can't rely on auto commit, otherwise I 
 get an OutOfMemory error
 and I found that committing every 100k docs gives worse performance than 
 auto commit every 15s (benchmark 3 with httpsolrserver took 193515)

 I'd really like to understand why I can't max out the resources on the 
 server hosting solr (disk above all)
 And I'd really like to understand what I'm doing wrong with 
 concurrentupdatesolrserver

 thanks

Re: What's the purpose of the bits option in compositeId (Solr 4.5)?


On 10/8/2013 6:12 PM, Brett Hoerner wrote:

Ignore me I forgot about shards= from the wiki.


On Tue, Oct 8, 2013 at 7:11 PM, Brett Hoerner br...@bretthoerner.comwrote:


I have a silly question, how do I query a single shard in SolrCloud? When
I hit solr/foo_shard1_replica1/select it always seems to do a full cluster
query.

I can't (easily) do a _route_ query before I know what each have.


There is also the distrib=false parameter that will cause the request 
to be handled directly by the core it is sent to rather than being 
distributed/balanced by SolrCloud.


Thanks,
Shawn

Re: Bootstrapping / Full Importing using Solr Cloud

DIH works with SolrCloud as far as I understand. But
moving to SolrJ has several advantages:
1 you have more control over our process, beter
ability to debug etc.
2 If you can partition your data up amongst
several clients, you can probably get through your jobs
much faster.
3 You're not overloading one machine with both the
DIH bits and the indexing bits.

There are some other options, I generally prefer SolrJ
though. Others have different opinions of course.

Best,
Erick

On Tue, Oct 8, 2013 at 12:57 PM, Mark static.void@gmail.com wrote:
 We are in the process of upgrading our Solr cluster to the latest and 
 greatest Solr Cloud. I have some questions regarding full indexing though. 
 We're currently running a long job (~30 hours) using DIH to do a full index 
 on over 10M products. This process consumes a lot of memory and while 
 updating can not handle any user requests.

 How, or what would be the best way going about this when using Solr Cloud? 
 First off, does DIH work with cloud? Would I need to separate out my DIH 
 indexing machine from the machines serving up user requests? If not going 
 down the DIH route, what are my best options (solrj?)

 Thanks for the input

Re: run filter queries after post filter

Hmmm, seems like it should. What's our evidence that it isn't working?

Best,
Erick

On Tue, Oct 8, 2013 at 4:10 PM, Rohit Harchandani rhar...@gmail.com wrote:
 Hey,
 I am using solr 4.0 with my own PostFilter implementation which is executed
 after the normal solr query is done. This filter has a cost of 100. Is it
 possible to run filter queries on the index after the execution of the post
 filter?
 I tried adding the below line to the url but it did not seem to work:
 fq={!cache=false cost=200}field:value
 Thanks,
 Rohit

Re: no such field error:smaller big block size details while indexing doc files

Hmmm, that is odd, the glob dynamicField should
pick this up.

Not quite sure what's going on. You an parse the file
via Tika yourself and look at what's in there, it's a relatively
simple SolrJ program, here's a sample:
http://searchhub.org/2012/02/14/indexing-with-solrj/

Best,
Erick

On Tue, Oct 8, 2013 at 4:15 PM, sweety sweetyshind...@yahoo.com wrote:
 This my new schema.xml:
 schema  name=documents
 fields
 field name=id type=string indexed=true stored=true required=true 
 multiValued=false/
 field name=author type=string indexed=true stored=true 
 multiValued=true/
 field name=comments type=text indexed=true stored=true 
 multiValued=false/
 field name=keywords type=text indexed=true stored=true 
 multiValued=false/
 field name=contents type=text indexed=true stored=true 
 multiValued=false/
 field name=title type=text indexed=true stored=true 
 multiValued=false/
 field name=revision_number type=string indexed=true stored=true 
 multiValued=false/
 field name=_version_ type=long indexed=true stored=true 
 multiValued=false/
 dynamicField name=ignored_* type=string indexed=false stored=true 
 multiValued=true/
 dynamicField name=* type=ignored  multiValued=true /
 copyfield source=id dest=text /
 copyfield source=author dest=text /
 /fields
 types
 fieldtype name=ignored stored=false indexed=false 
 class=solr.StrField /
 fieldType name=integer class=solr.IntField /
 fieldType name=long class=solr.LongField /
 fieldType name=string class=solr.StrField  /
 fieldType name=text class=solr.TextField /
 /types
 uniqueKeyid/uniqueKey
 /schema
 I still get the same error.

 
  From: Erick Erickson [via Lucene] ml-node+s472066n4094013...@n3.nabble.com
 To: sweety sweetyshind...@yahoo.com
 Sent: Tuesday, October 8, 2013 7:16 AM
 Subject: Re: no such field error:smaller big block size details while 
 indexing doc files



 Well, one of the attributes parsed out of, probably the
 meta-information associated with one of your structured
 docs is SMALLER_BIG_BLOCK_SIZE_DETAILS and
 Solr Cel is faithfully sending that to your index. If you
 want to throw all these in the bit bucket, try defining
 a true catch-all field that ignores things, like this.
 dynamicField name=* type=ignored multiValued=true /

 Best,
 Erick

 On Mon, Oct 7, 2013 at 8:03 AM, sweety [hidden email] wrote:

 Im trying to index .doc,.docx,pdf files,
 im using this url:
 curl
 http://localhost:8080/solr/document/update/extract?literal.id=12commit=true;
 -Fmyfile=@complex.doc

 This is the error I get:
 Oct 07, 2013 5:02:18 PM org.apache.solr.common.SolrException log
 SEVERE: null:java.lang.RuntimeException: java.lang.NoSuchFieldError:
 SMALLER_BIG_BLOCK_SIZE_DETAILS
 at
 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:651)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:364)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
 at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
 at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
 at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
 at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
 at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
 at
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
 at
 org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:928)
 at
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
 at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
 at
 org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
 at
 org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:539)
 at
 org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:298)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 Caused by: java.lang.NoSuchFieldError: SMALLER_BIG_BLOCK_SIZE_DETAILS
 at
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:93)
 at
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:190)
 at
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:184)
 at
 org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:376)
 at
 org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:165)
 at

Re: How to share Schema between multicore on Solr 4.4


On 10/7/2013 6:02 AM, Dharmendra Jaiswal wrote:

I am using Solr 4.4 version with SolrCloud on Windows machine.
Somehow i am not able to share schema between multiple core.


If you're in SolrCloud mode, then you already *are* sharing your 
schema.  You are also sharing your configuration.  Both of them are in 
zookeeper.  All collections (and all shards within a collection) which 
use a given config name are using the same copy.


Any copies of your config/schema that might be on your disk are *NOT* 
being used.  If you are starting Solr with any bootstrap options, then 
the config set that is in zookeeper might be getting overwritten by 
whats on your disk when Solr restarts, but otherwise SolrCloud *only* 
uses zookeeper for config/schema. The bootstrap options are meant to be 
used once, and I actually prefer to get SolrCloud operational without 
using bootstrap options at all.


Thanks,
Shawn

Re: What's the purpose of the bits option in compositeId (Solr 4.5)?

2013-10-08 Thread Yonik Seeley

On Tue, Oct 8, 2013 at 8:27 PM, Shawn Heisey s...@elyograg.org wrote:
 There is also the distrib=false parameter that will cause the request to
 be handled directly by the core it is sent to rather than being
 distributed/balanced by SolrCloud.

Right - this is probably the best option for diagnosing what is in what index.

-Yonik

Re: How to warm up filter queries for a category field with 1000 possible values ?


On 10/7/2013 12:36 AM, user 01 wrote:

what's the way to warm up filter queries for a category field with 1000
possible values. Would I need to write 1000 lines manually in the
solrconig.xml or what is the format?


Erick has given you awesome advice.  Here's something a little bit 
different that doesn't invalidate his advice:


If you have enough free RAM (not used by programs) for good OS disk 
caching, then as soon as you do one query that checks this field, then 
all 1000 values for that field are likely to be in RAM, and the next 
query against that field is going to be lightning fast, because the 
operating system will not have to read the disk to get the information.  
Although it is slightly faster to get informatin out of Solr's caches 
than the OS disk cache, the operating system is far better at managing 
huge caches than Solr and Java are.


http://wiki.apache.org/solr/SolrPerformanceProblems#General_information

Thanks,
Shawn

Re: SolrJ best pratices