How to reindex in solr

2011-12-01 Thread Kashif Khan
Hi all,

I have my solr indexed completely and now i have added a new field in the
schema which is a copyfield of another field. Please suggest me how can i
reindex solr without going through formal process which i did for the first
time because there are some fields whose data is really time consuming to
obtain. I have been trying to reindex from solrj and indexes well except for
those fields which have been mentioned as stored=false. Now on the
production servers these indexing is getting failed because of the out of
memory swap space. Please suggest some good method to reindex using lucene
indexes with even stored=false.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-reindex-in-solr-tp3550871p3550871.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Seek past EOF

2011-12-01 Thread Ruben Chadien
We are using ext3 on Debian.

Noticed today that i only need to reload the core to get it working again….


On 30 November 2011 19:59, Simon Willnauer
simon.willna...@googlemail.comwrote:

 can you give us some details about what filesystem you are using?

 simon

 On Wed, Nov 30, 2011 at 3:07 PM, Ruben Chadien ruben.chad...@aspiro.com
 wrote:
  Happened again….
 
  I got 3 directories in my index dir
 
  4096 Nov  4 09:31 index.2004083156
  4096 Nov 21 10:04 index.2021090440
  4096 Nov 30 14:55 index.2029024919
 
  as you can se the first two are old and also empty , the last one from
  today is and containing 9 files none of the are 0 size
  and total size 7 GB. The size of the index on the master is 14GB.
 
  Any ideas on what to look for ?
 
  Thanks
  Ruben Chadien
 
 
 
 
  On 29 November 2011 15:58, Mark Miller markrmil...@gmail.com wrote:
 
  Hmm...I've seen a bug like this, but I don't think it would be tickled
 if
  you are replicating config files...
 
  It def looks related though ... I'll try to dig around.
 
  Next time it happens, take a look on the slave for 0 size files - also
 if
  the index dir on the slave is plain 'index' or has a timestamp as part
 of
  the name (eg timestamp.index).
 
  On Tue, Nov 29, 2011 at 9:53 AM, Ruben Chadien 
 ruben.chad...@aspiro.com
  wrote:
 
   Hi, for the moment there are no 0 sized files, but all indexes are
  working
   now. I will have to look next time it breaks.
   Yes, the directory name is index and it replicates the schema and a
   synonyms file.
  
   /Ruben Chadien
  
   On 29 November 2011 15:29, Mark Miller markrmil...@gmail.com wrote:
  
Also, on your master, what is the name of the index directory? Just
'index'?
   
And are you replicating config files as well or no?
   
   
On Nov 29, 2011, at 9:23 AM, Mark Miller wrote:
   
 Does the problem index have any 0 size files in it?

 On Nov 29, 2011, at 2:54 AM, Ruben Chadien wrote:

 HI all

 After upgrading tol Solr 3.4 we are having trouble with the
   replication.
 The setup is one indexing master with a few slaves that replicate
  the
 indexes once every night.
 The largest index is 20 GB and the master and slaves are on the
 same
DMZ.

 Almost every night one of the indexes (17 in total) fail after
 the
 replication with an EOF file.

 SEVERE: Error during auto-warming of
 key:org.apache.solr.search.QueryResultKey@bda006e3
   :java.io.IOException:
 seek past EOF
 at

   
  
 
 org.apache.lucene.store.MMapDirectory$MMapIndexInput.seek(MMapDirectory.java:347)
 at
   
 org.apache.lucene.index.SegmentTermEnum.seek(SegmentTermEnum.java:114)
 at

   
  
 
 org.apache.lucene.index.TermInfosReader.seekEnum(TermInfosReader.java:203)
 at
   org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:273)
 at
   org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:210)
 at
   org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:507)
 at
   
  org.apache.solr.search.SolrIndexReader.docFreq(SolrIndexReader.java:309)
 at
   
 org.apache.lucene.search.TermQuery$TermWeight$1.add(TermQuery.java:56)
 at
 org.apache.lucene.util.ReaderUtil$Gather.run(ReaderUtil.java:77)
 at
 org.apache.lucene.util.ReaderUtil$Gather.run(ReaderUtil.java:82)


 After a restart the errors are gone, anyone else seen this ?

 Thanks
 Ruben Chadien

 - Mark Miller
 lucidimagination.com











   
- Mark Miller
lucidimagination.com
   
   
   
   
   
   
   
   
   
   
   
   
  
  
   --
   *Ruben Chadien
   *Senior Developer
   Mobile +47 900 35 371
   ruben.chad...@aspiro.com
   *
  
   Aspiro Music AS*
   Øvre Slottsgate 25, P.O. Box 8710 Youngstorget, N-0028 Oslo
   Tel +47 452 86 900, fax +47 22 37 36 59
   www.aspiro.com/music
  
 
 
 
  --
  - Mark
 
  http://www.lucidimagination.com
 
 
 
 
  --
  *Ruben Chadien
  *Senior Developer
  Mobile +47 900 35 371
  ruben.chad...@aspiro.com
  *
 
  Aspiro Music AS*
  Øvre Slottsgate 25, P.O. Box 8710 Youngstorget, N-0028 Oslo
  Tel +47 452 86 900, fax +47 22 37 36 59
  www.aspiro.com/music




-- 
*Ruben Chadien
*Senior Developer
Mobile +47 900 35 371
ruben.chad...@aspiro.com
*

Aspiro Music AS*
Øvre Slottsgate 25, P.O. Box 8710 Youngstorget, N-0028 Oslo
Tel +47 452 86 900, fax +47 22 37 36 59
www.aspiro.com/music


Problem with hunspell french dictionary

2011-12-01 Thread Nathan Castelein
Hi,

I'm trying to add the HunspellStemFilterFactory to my Solr project.

I'm trying this on a fresh new download of Solr 3.5.

I downloaded french dictionary here (found it from here
http://wiki.services.openoffice.org/wiki/Dictionaries#French_.28France.2C_29):
http://www.dicollecte.org/download/fr/hunspell-fr-moderne-v4.3.zip

But when I start Solr and go to the Solr Analysis, an error occurs in Solr.

Is there the trace :

java.lang.RuntimeException: Unable to load hunspell data!
[dictionary=en_GB.dic,affix=fr-moderne.aff]
at 
org.apache.solr.analysis.HunspellStemFilterFactory.inform(HunspellStemFilterFactory.java:82)
at 
org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:546)
at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:126)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:461)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:316)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:207)
at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:130)
at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:94)
at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at 
org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713)
at org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
at 
org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282)
at 
org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
at 
org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at 
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at 
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at 
org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
at org.mortbay.jetty.Server.doStart(Server.java:224)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.mortbay.start.Main.invokeMain(Main.java:194)
at org.mortbay.start.Main.start(Main.java:534)
at org.mortbay.start.Main.start(Main.java:441)
at org.mortbay.start.Main.main(Main.java:119)

Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
range: 3 at java.lang.String.charAt(Unknown Source) at
org.apache.lucene.analysis.hunspell.HunspellDictionary$DoubleASCIIFlagParsingStrategy.parseFlags(HunspellDictionary.java:382)
at
org.apache.lucene.analysis.hunspell.HunspellDictionary.parseAffix(HunspellDictionary.java:165)
at
org.apache.lucene.analysis.hunspell.HunspellDictionary.readAffixFile(HunspellDictionary.java:121)
at
org.apache.lucene.analysis.hunspell.HunspellDictionary.init(HunspellDictionary.java:64)
at
org.apache.solr.analysis.HunspellStemFilterFactory.inform(HunspellStemFilterFactory.java:46)
I can't find where the problem is. It seems like my dictionary isn't well
written for hunspell, but I tried with two different dictionaries, and I
had the same problem.

I also tried with an english dictionary, and ... it works !

So I think that my french dictionary is wrong for hunspell, but I
don't know why ...

Can you help me ?


Re: mysolr python client

2011-12-01 Thread Alejandro Gonzalez
sounds great for a python project i'm involved in rigth now. I'll take a
deeper look on it.

thx marco

2011/11/30 Marco Martinez mmarti...@paradigmatecnologico.com

 Hi all,

 For anyone interested, recently I've been using a new Solr client for
 Python. It's easy and pretty well documented. If you're interested its site
 is: *http://mysolr.redtuna.org/*
 *
 *
 bye!

 Marco Martínez Bautista
 http://www.paradigmatecnologico.com
 Avenida de Europa, 26. Ática 5. 3ª Planta
 28224 Pozuelo de Alarcón
 Tel.: 91 352 59 42



Re: mysolr python client

2011-12-01 Thread Jens Grivolla

On 11/30/2011 05:40 PM, Marco Martinez wrote:

For anyone interested, recently I've been using a new Solr client for
Python. It's easy and pretty well documented. If you're interested its site
is: http://mysolr.redtuna.org/


Do you know what advantages it has over pysolr or solrpy? On the page it 
only says mysolr was born to be a fast and easy-to-use client for 
Apache Solr’s API and because existing Python clients didn’t fulfill 
these conditions.


Thanks,
Jens



Re: Weird docs-id clustering output in Solr 1.4.1

2011-12-01 Thread Vadim Kisselmann
Hi Stanislaw,
did you already have time to create a patch?
If not, can you tell me please which lines in which class in source code
are relevant?
Thanks and regards
Vadim Kisselmann



2011/11/29 Vadim Kisselmann v.kisselm...@googlemail.com

 Hi,
 the quick and dirty way sound good:)
 It would be great if you can send me a patch for 1.4.1.


 By the way, i tested Solr. 3.5 with my 1.4.1 test index.
 I can search and optimize, but clustering doesn't work (java.lang.Integer
 cannot be cast to java.lang.String)
 My uniqieKey for my docs it the id(sint).
 These here was the error message:


 Problem accessing /solr/select/. Reason:

Carrot2 clustering failed

 org.apache.solr.common.SolrException: Carrot2 clustering failed
at
 org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.cluster(CarrotClusteringEngine.java:217)
at
 org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91)
at
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)
at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
at
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at
 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at
 org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at
 org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at
 org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
 Caused by: java.lang.ClassCastException: java.lang.Integer cannot be cast
 to java.lang.String
at
 org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.getDocuments(CarrotClusteringEngine.java:364)
at
 org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.cluster(CarrotClusteringEngine.java:201)
... 23 more

 It this case it's better for me to upgrade/patch the 1.4.1 version.

 Best regards
 Vadim




 2011/11/29 Stanislaw Osinski stanislaw.osin...@carrotsearch.com

 
  But my actual live system works on solr 1.4.1. i can only change my
  solrconfig.xml and integrate new packages...
  i check the possibility to upgrade from 1.4.1 to 3.5 with the same index
  (without reinidex) with luceneMatchVersion 2.9.
  i hope it works...
 

 Another option would be to check out Solr 1.4.1 source code, fix the issue
 and recompile the clustering component. The quick and dirty way would be
 to
 convert all identifiers to strings in the clustering component, before the
 they are returned for serialization (I can send you a patch that does
 this). The proper way would be to fix the root cause of the problem, but
 I'd need to dig deeper into the code to find this.

 Staszek





Re: Problem with hunspell french dictionary

2011-12-01 Thread Chris Male
There seems that theres a problem with the code parsing the Dictionary.
 Can you open a JIRA issue with the same information so we can look into
fixing it?

On Thu, Dec 1, 2011 at 10:14 PM, Nathan Castelein 
nathan.castel...@gmail.com wrote:

 Hi,

 I'm trying to add the HunspellStemFilterFactory to my Solr project.

 I'm trying this on a fresh new download of Solr 3.5.

 I downloaded french dictionary here (found it from here
 
 http://wiki.services.openoffice.org/wiki/Dictionaries#French_.28France.2C_29
 ):
 http://www.dicollecte.org/download/fr/hunspell-fr-moderne-v4.3.zip

 But when I start Solr and go to the Solr Analysis, an error occurs in Solr.

 Is there the trace :

 java.lang.RuntimeException: Unable to load hunspell data!
 [dictionary=en_GB.dic,affix=fr-moderne.aff]
at
 org.apache.solr.analysis.HunspellStemFilterFactory.inform(HunspellStemFilterFactory.java:82)
at
 org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:546)
at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:126)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:461)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:316)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:207)
at
 org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:130)
at
 org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:94)
at
 org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
at
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at
 org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713)
at org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
at
 org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282)
at
 org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
at
 org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
at
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at
 org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
at
 org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156)
at
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at
 org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
at
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at
 org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
at org.mortbay.jetty.Server.doStart(Server.java:224)
at
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.mortbay.start.Main.invokeMain(Main.java:194)
at org.mortbay.start.Main.start(Main.java:534)
at org.mortbay.start.Main.start(Main.java:441)
at org.mortbay.start.Main.main(Main.java:119)

 Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
 range: 3 at java.lang.String.charAt(Unknown Source) at

 org.apache.lucene.analysis.hunspell.HunspellDictionary$DoubleASCIIFlagParsingStrategy.parseFlags(HunspellDictionary.java:382)
 at

 org.apache.lucene.analysis.hunspell.HunspellDictionary.parseAffix(HunspellDictionary.java:165)
 at

 org.apache.lucene.analysis.hunspell.HunspellDictionary.readAffixFile(HunspellDictionary.java:121)
 at

 org.apache.lucene.analysis.hunspell.HunspellDictionary.init(HunspellDictionary.java:64)
 at

 org.apache.solr.analysis.HunspellStemFilterFactory.inform(HunspellStemFilterFactory.java:46)
 I can't find where the problem is. It seems like my dictionary isn't well
 written for hunspell, but I tried with two different dictionaries, and I
 had the same problem.

 I also tried with an english dictionary, and ... it works !

 So I think that my french dictionary is wrong for hunspell, but I
 don't know why ...

 Can you help me ?




-- 
Chris Male | Software Developer | DutchWorks | www.dutchworks.nl


Error in New Solr version

2011-12-01 Thread Pawan Darira
Hi

I am migrating from Solr 1.4 to Solr 3.2. I am getting below error in my
logs

org.apache.solr.common.SolrException: Error loading class
'org.apache.solr.handler.component.CollapseComponent

Could not found satisfactory solution on google. please help

thanks
Pawan


Re: Error in New Solr version

2011-12-01 Thread Vadim Kisselmann
Hi,
comment out the lines with the collapse component in your solrconfig.xml if
not need it.
otherwise, you're missing the right jar's for this component, or path's to
this jars in your solrconfig.xml are wrong.
regards
vadim



2011/12/1 Pawan Darira pawan.dar...@gmail.com

 Hi

 I am migrating from Solr 1.4 to Solr 3.2. I am getting below error in my
 logs

 org.apache.solr.common.SolrException: Error loading class
 'org.apache.solr.handler.component.CollapseComponent

 Could not found satisfactory solution on google. please help

 thanks
 Pawan



Re: make fuzzy search for phrase

2011-12-01 Thread meghana
any solutions?? i am just get stuck in this.  :(  

--
View this message in context: 
http://lucene.472066.n3.nabble.com/make-fuzzy-search-for-phrase-tp3542079p3551203.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: mysolr python client

2011-12-01 Thread Marc SCHNEIDER
Hi Marco,

Great! Maybe you can add it on the Solr wiki? (
http://wiki.apache.org/solr/IntegratingSolr).

Regards,
Marc.

On Thu, Dec 1, 2011 at 10:42 AM, Jens Grivolla j+...@grivolla.net wrote:

 On 11/30/2011 05:40 PM, Marco Martinez wrote:

 For anyone interested, recently I've been using a new Solr client for
 Python. It's easy and pretty well documented. If you're interested its
 site
 is: http://mysolr.redtuna.org/


 Do you know what advantages it has over pysolr or solrpy? On the page it
 only says mysolr was born to be a fast and easy-to-use client for Apache
 Solr’s API and because existing Python clients didn’t fulfill these
 conditions.

 Thanks,
 Jens




Re: mysolr python client

2011-12-01 Thread Marco Martinez
Done!

Marco Martínez Bautista
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42


2011/12/1 Marc SCHNEIDER marc.schneide...@gmail.com

 Hi Marco,

 Great! Maybe you can add it on the Solr wiki? (
 http://wiki.apache.org/solr/IntegratingSolr).

 Regards,
 Marc.

 On Thu, Dec 1, 2011 at 10:42 AM, Jens Grivolla j+...@grivolla.net wrote:

  On 11/30/2011 05:40 PM, Marco Martinez wrote:
 
  For anyone interested, recently I've been using a new Solr client for
  Python. It's easy and pretty well documented. If you're interested its
  site
  is: http://mysolr.redtuna.org/
 
 
  Do you know what advantages it has over pysolr or solrpy? On the page it
  only says mysolr was born to be a fast and easy-to-use client for Apache
  Solr’s API and because existing Python clients didn’t fulfill these
  conditions.
 
  Thanks,
  Jens
 
 



Re: Solr and Ping PHP

2011-12-01 Thread akopov
Hi,

I know it's been a while since you posted this question but I'm experiencing
the same problem with my instance of Solr (sometimes ping returns false for
no visible reason) and I just wonder if you found the solution.

Thank you.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-and-Ping-PHP-tp2254214p3550917.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Weird docs-id clustering output in Solr 1.4.1

2011-12-01 Thread Stanislaw Osinski
Hi Vadim,

I've had limited connectivity, so I couldn't check out the complete 1.4.1
code and test the changes. Here's what you can try:

In this file:

http://svn.apache.org/viewvc/lucene/solr/tags/release-1.4.1/contrib/clustering/src/main/java/org/apache/solr/handler/clustering/carrot2/CarrotClusteringEngine.java?revision=957515view=markup

around line 216 you will see:

for (Document doc : docs) {
  docList.add(doc.getField(solrId));
}

You need to change this to:

for (Document doc : docs) {
  docList.add(doc.getField(solrId).toString());
}

Let me know if this did the trick.

Cheers,

S.

On Thu, Dec 1, 2011 at 10:43, Vadim Kisselmann
v.kisselm...@googlemail.comwrote:

 Hi Stanislaw,
 did you already have time to create a patch?
 If not, can you tell me please which lines in which class in source code
 are relevant?
 Thanks and regards
 Vadim Kisselmann



 2011/11/29 Vadim Kisselmann v.kisselm...@googlemail.com

  Hi,
  the quick and dirty way sound good:)
  It would be great if you can send me a patch for 1.4.1.
 
 
  By the way, i tested Solr. 3.5 with my 1.4.1 test index.
  I can search and optimize, but clustering doesn't work (java.lang.Integer
  cannot be cast to java.lang.String)
  My uniqieKey for my docs it the id(sint).
  These here was the error message:
 
 
  Problem accessing /solr/select/. Reason:
 
 Carrot2 clustering failed
 
  org.apache.solr.common.SolrException: Carrot2 clustering failed
 at
 
 org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.cluster(CarrotClusteringEngine.java:217)
 at
 
 org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91)
 at
 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
 at
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)
 at
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
 at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
 at
 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at
  org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
 at
 
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
 at
  org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
 at
  org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
 at
 org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
 at
 
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
 at
 
 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
 at
  org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
 at org.mortbay.jetty.Server.handle(Server.java:326)
 at
  org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
 at
 
 org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
 at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
 at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
 at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
 at
 
 org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
 at
 
 org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
  Caused by: java.lang.ClassCastException: java.lang.Integer cannot be cast
  to java.lang.String
 at
 
 org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.getDocuments(CarrotClusteringEngine.java:364)
 at
 
 org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.cluster(CarrotClusteringEngine.java:201)
 ... 23 more
 
  It this case it's better for me to upgrade/patch the 1.4.1 version.
 
  Best regards
  Vadim
 
 
 
 
  2011/11/29 Stanislaw Osinski stanislaw.osin...@carrotsearch.com
 
  
   But my actual live system works on solr 1.4.1. i can only change my
   solrconfig.xml and integrate new packages...
   i check the possibility to upgrade from 1.4.1 to 3.5 with the same
 index
   (without reinidex) with luceneMatchVersion 2.9.
   i hope it works...
  
 
  Another option would be to check out Solr 1.4.1 source code, fix the
 issue
  and recompile the clustering component. The quick and dirty way would be
  to
  convert all identifiers to strings in the clustering component, before
 the
  they are returned for serialization (I can send you a patch that does
  this). The proper way would be to fix the root cause of the problem, but
  I'd need to dig deeper into the code to find this.
 
  Staszek
 
 
 



Re: make fuzzy search for phrase

2011-12-01 Thread Erick Erickson
What did you do to install it? What code line did you start from?
1.4 Solr? 3.1? Fresh trunk update?

What jar? The usual method of applying a patch is to get the entire source
tree, apply the patch and then re-compile all of solr. Perhaps this page
would help:

http://wiki.apache.org/solr/HowToContribute

Note that this patch is a zip file, not in the usual patch format, so doing this
may be a bit tricky.

Best
Erick

On Thu, Dec 1, 2011 at 6:00 AM, meghana meghana.rav...@amultek.com wrote:
 any solutions?? i am just get stuck in this.  :(

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/make-fuzzy-search-for-phrase-tp3542079p3551203.html
 Sent from the Solr - User mailing list archive at Nabble.com.


highlight issue

2011-12-01 Thread Radha Krishna Reddy
Hi,

I am indexing around 2000 names using solr. highlight flag is on while
querying.

For some name i am getting the search substring appened at the start.

Suppose my search query is *Rak*.In my database i have *Rakesh Chaturvedi
* name.
I am getting *emRak/ememRak/emesh Chaturvedi* as the response.

Same the case with the following names.

Search Dhar -- highlight emDhar/ememDhar/emmesh Darshan
Search Suda-- highlight emSuda/ememSuda/emrshan Faakir

Can someone help me?

I am using the following filters for index and query.

fieldType name=text_autofill class=solr.TextField
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 preserveOriginal=1/
filter class=solr.EdgeNGramFilterFactory minGramSize=1
maxGramSize=50 side=front/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 preserveOriginal=1/
  /analyzer
/fieldType

Thanks and Regards,
Radha Krishna Reddy.


Re: when using group=true facet numbers are incorrect

2011-12-01 Thread O. Klein
https://issues.apache.org/jira/browse/SOLR-2898 has been created for this.

Thanx Martijn!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/when-using-group-true-facet-numbers-are-incorrect-tp3488605p3551741.html
Sent from the Solr - User mailing list archive at Nabble.com.


(fq=field1:val1 AND field2:val2) VS fq=field1:val1fq=field2:val2 and filterCache

2011-12-01 Thread Antoine LE FLOC'H
Hello,

Is there any difference in the way things are stored in the filterCache if
I do

(fq=field1:val1 AND field2:val2)
or
fq=field1:valfq=field2:val2

eventhough these are logically identical ? What get stored exactly ? Also
can you point me to where in the Solr source code this processing happens ?

Thank you.

Antoine.


Configuring the Distributed

2011-12-01 Thread Jamie Johnson
I am currently looking at the latest solrcloud branch and was
wondering if there was any documentation on configuring the
DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
to be added/modified to make distributed indexing work?


Re: (fq=field1:val1 AND field2:val2) VS fq=field1:val1fq=field2:val2 and filterCache

2011-12-01 Thread Tanguy Moal

Hello,
Quoting http://wiki.apache.org/solr/SolrCaching#filterCache :
The filter cache stores the results of any filter queries (fq 
parameters) that Solr is explicitly asked to execute. (Each filter is 
executed and cached separately. When it's time to use them to limit 
the number of results returned by a query, this is done using set 
intersections.)
Finding what best suits your needs probably depends on how field1:val1 
and field2:val2 vary all together, ie wether there exist a correlation 
in issuing field2:val2 knowing that field1:val1 was issued (or the other 
way)


Hope this helps ;-)

Tanguy


Le 01/12/2011 16:01, Antoine LE FLOC'H a écrit :

Hello,

Is there any difference in the way things are stored in the filterCache if
I do

(fq=field1:val1 AND field2:val2)
or
fq=field1:valfq=field2:val2

eventhough these are logically identical ? What get stored exactly ? Also
can you point me to where in the Solr source code this processing happens ?

Thank you.

Antoine.





Re: highlight issue

2011-12-01 Thread Koji Sekiguchi

Suppose my search query is *Rak*.In my database i have *Rakesh Chaturvedi
* name.
I am getting *emRak/ememRak/emesh Chaturvedi* as the response.

Same the case with the following names.

Search Dhar -- highlight emDhar/ememDhar/emmesh Darshan
Search Suda-- highlight emSuda/ememSuda/emrshan Faakir

Can someone help me?

I am using the following filters for index and query.

fieldType name=text_autofill class=solr.TextField
positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 preserveOriginal=1/
 filter class=solr.EdgeNGramFilterFactory minGramSize=1
maxGramSize=50 side=front/
   /analyzer
   analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 preserveOriginal=1/
   /analyzer
 /fieldType


I don't think Highlighter can support n-gram field.
Can you try to comment out EdgeNGramFilterFactory and re-index then highlight?

koji
--
Check out Query Log Visualizer for Apache Solr
http://www.rondhuit-demo.com/loganalyzer/loganalyzer.html
http://www.rondhuit.com/en/


Solr cache size information

2011-12-01 Thread elisabeth benoit
Hello,

If anybody can help, I'd like to confirm a few things about Solr's caches
configuration.

If I want to calculate cache size in memory relativly to cache size in
solrconfig.xml

For Document cache

size in memory = size in solrconfig.xml * average size of all fields
defined in fl parameter   ???

For Filter cache

size in memory = size in solrconfig.xml * WHAT (the size of an id) ??? (I
don't use facet.enum method)

For Query result cache

size in memory = size in solrconfig.xml * the size of an id ???


I would also like to know relation between solr's caches sizes and JVM max
size?

If anyone has an answer or a link for further reading to suggest, it would
be greatly appreciated.

Thanks,
Elisabeth


Re: Weird docs-id clustering output in Solr 1.4.1

2011-12-01 Thread Vadim Kisselmann
Hi Stanislaw,

unfortunately it doesn't work.
I changed the line 216 with the new toString()-part and rebuild the
source.
still the same behavior, without errors(because of changes).
an another line to change?

Thanks and regards
Vadim



2011/12/1 Stanislaw Osinski stanislaw.osin...@carrotsearch.com

 Hi Vadim,

 I've had limited connectivity, so I couldn't check out the complete 1.4.1
 code and test the changes. Here's what you can try:

 In this file:


 http://svn.apache.org/viewvc/lucene/solr/tags/release-1.4.1/contrib/clustering/src/main/java/org/apache/solr/handler/clustering/carrot2/CarrotClusteringEngine.java?revision=957515view=markup

 around line 216 you will see:

 for (Document doc : docs) {
  docList.add(doc.getField(solrId));
 }

 You need to change this to:

 for (Document doc : docs) {
  docList.add(doc.getField(solrId).toString());
 }

 Let me know if this did the trick.

 Cheers,

 S.

 On Thu, Dec 1, 2011 at 10:43, Vadim Kisselmann
 v.kisselm...@googlemail.comwrote:

  Hi Stanislaw,
  did you already have time to create a patch?
  If not, can you tell me please which lines in which class in source code
  are relevant?
  Thanks and regards
  Vadim Kisselmann
 
 
 
  2011/11/29 Vadim Kisselmann v.kisselm...@googlemail.com
 
   Hi,
   the quick and dirty way sound good:)
   It would be great if you can send me a patch for 1.4.1.
  
  
   By the way, i tested Solr. 3.5 with my 1.4.1 test index.
   I can search and optimize, but clustering doesn't work
 (java.lang.Integer
   cannot be cast to java.lang.String)
   My uniqieKey for my docs it the id(sint).
   These here was the error message:
  
  
   Problem accessing /solr/select/. Reason:
  
  Carrot2 clustering failed
  
   org.apache.solr.common.SolrException: Carrot2 clustering failed
  at
  
 
 org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.cluster(CarrotClusteringEngine.java:217)
  at
  
 
 org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91)
  at
  
 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
  at
  
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)
  at
  
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
  at
  
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
  at
  
 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
  at
  
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
  at
  
 
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
  at
  
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
  at
  
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
  at
  org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
  at
  
 
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
  at
  
 
 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
  at
  
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
  at org.mortbay.jetty.Server.handle(Server.java:326)
  at
   org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
  at
  
 
 org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
  at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
  at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
  at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
  at
  
 
 org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
  at
  
 
 org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
   Caused by: java.lang.ClassCastException: java.lang.Integer cannot be
 cast
   to java.lang.String
  at
  
 
 org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.getDocuments(CarrotClusteringEngine.java:364)
  at
  
 
 org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.cluster(CarrotClusteringEngine.java:201)
  ... 23 more
  
   It this case it's better for me to upgrade/patch the 1.4.1 version.
  
   Best regards
   Vadim
  
  
  
  
   2011/11/29 Stanislaw Osinski stanislaw.osin...@carrotsearch.com
  
   
But my actual live system works on solr 1.4.1. i can only change my
solrconfig.xml and integrate new packages...
i check the possibility to upgrade from 1.4.1 to 3.5 with the same
  index
(without reinidex) with luceneMatchVersion 2.9.
i hope it works...
   
  
   Another option would be to check out Solr 1.4.1 source code, fix the
  issue
   and recompile the clustering component. The quick and dirty way would
 be
   to
   convert all identifiers to strings in the 

switching on hl.requireFieldMatch reducing highlighted fields returned

2011-12-01 Thread Robert Brown
I have a query which is highlighting 3 snippets in 1 field, and 1 
snippet in another field.


By enabling hl.requireFieldMatch, only the latter highlighted field is 
returned.


from this...

lst name=highlighting
lst name=348231
arr name=content_stemmed
str
plc Whetstone Temporary [hl-on]Sales[hl-off] Assistant Customer 
service Cashier work 08

/str
str
and customer queries. 07 / 99 – 2003 Debenhams Central London 
[hl-on]Sales[hl-off] Adviser Customer

/str
str
Central London [hl-on]Sales[hl-off] Assistant Customer service; Visual 
merchandising; Dealing

/str
str
with telephone enquiries; Assisted in the [hl-on]production[hl-off] of 
jewellery, e.g. setting stones

/str
/arr
arr name=skills_stemmed
str
[hl-on]product[hl-off] knowledge [hl-on]sales[hl-off] experience
/str
/arr
/lst
/lst


to this...

lst name=highlighting
lst name=348231
arr name=skills_stemmed
strproduct knowledge [hl-on]sales[hl-off] experience/str
/arr
/lst
/lst


I'm doing this so the word product and it's variants are NOT 
highlighted - they match against a different field.




Re: mysolr python client

2011-12-01 Thread Rubén Abad
Hi Jens,

Our objective with mysolr was to create a pythonic Apache Solr binding. But
we
also have been working in speed and concurrency. We always use the Python
QueryResponseWriter, because it avoids us dependencies (a XML or JSON
parser).

We would also like to create a complete concurrent API, but at the moment
only querying is working.

Our main goal is to keep evolving mysolr with the feedback we receive from
the community.

I hope I have answered your questions.

Thanks for your interest,
Rubén Abad rua...@gmail.com


On Thu, Dec 1, 2011 at 10:42 AM, Jens Grivolla j+...@grivolla.net wrote:

 On 11/30/2011 05:40 PM, Marco Martinez wrote:

 For anyone interested, recently I've been using a new Solr client for
 Python. It's easy and pretty well documented. If you're interested its
 site
 is: http://mysolr.redtuna.org/


 Do you know what advantages it has over pysolr or solrpy? On the page it
 only says mysolr was born to be a fast and easy-to-use client for Apache
 Solr’s API and because existing Python clients didn’t fulfill these
 conditions.

 Thanks,
 Jens




DataImportHandler w/ multivalued fields

2011-12-01 Thread Briggs Thompson
Hello Solr Community!

I am implementing a data connection to Solr through the Data Import Handler
and non-multivalued fields are working correctly, but multivalued fields
are not getting indexed properly.

I am new to DataImportHandler, but from what I could find, the entity is
the way to go for multivalued field. The weird thing is that data is being
indexed for one row, meaning first raw_tag gets populated.


Anyone have any ideas?
Thanks,
Briggs

This is the relevant part of the schema:

   field name =raw_tag type=text_en_lessAggressive indexed=true
stored=false multivalued=true/
   field name =raw_tag_string type=string indexed=false
stored=true multivalued=true/
   copyField source=raw_tag dest=raw_tag_string/

And the relevant part of data-import.xml:

document name=merchant
entity name=site
  query=select * from site 
field column=siteId name=siteId /
field column=domain name=domain /
field column=aliasFor name=aliasFor /
field column=title name=title /
field column=description name=description /
field column=requests name=requests /
field column=requiresModeration name=requiresModeration /
field column=blocked name=blocked /
field column=affiliateLink name=affiliateLink /
field column=affiliateTracker name=affiliateTracker /
field column=affiliateNetwork name=affiliateNetwork /
field column=cjMerchantId name=cjMerchantId /
field column=thumbNail name=thumbNail /
field column=updateRankings name=updateRankings /
field column=couponCount name=couponCount /
field column=category name=category /
field column=adult name=adult /
field column=rank name=rank /
field column=redirectsTo name=redirectsTo /
field column=wwwRequired name=wwwRequired /
field column=avgSavings name=avgSavings /
field column=products name=products /
field column=nameChecked name=nameChecked /
field column=tempFlag name=tempFlag /
field column=created name=created /
field column=enableSplitTesting name=enableSplitTesting /
field column=affiliateLinklock name=affiliateLinklock /
field column=hasMobileSite name=hasMobileSite /
field column=blockSite name=blockSite /
entity name=merchant_tags pk=siteId
query=select raw_tag, freetags.id,
freetagged_objects.object_id as siteId
   from freetags
   inner join freetagged_objects
   on freetags.id=freetagged_objects.tag_id
   where freetagged_objects.object_id='${site.siteId}'
field column=raw_tag name=raw_tag/
/entity
/entity
/document


spatial search or null

2011-12-01 Thread dan whelan

Hi,

how would I go about constructing a solr 3.2 spatial query that would 
return documents that are in a specified radius OR documents that have 
no location information.


The query would have a similar result as this:   q=City:San Diego OR 
-City:['' TO *]



Thanks


RE: Solr cache size information

2011-12-01 Thread Andrew Lundgren
 For Filter cache
 
 size in memory = size in solrconfig.xml * WHAT (the size of an id) ???
 (I
 don't use facet.enum method)


As I understand it, size is the number queries that will be cached.  My short 
experience means that the memory consumed will be data dependent.  If you have 
a huge number of documents matched in a FQ, then the size consumed will be very 
large, if you get a single match then the cached result will take much less 
memory. 

I don't know if there is a way you can bound the cache by memory rather than 
results.  I think all of the solr caches behave this way, but I am not sure.


 NOTICE: This email message is for the sole use of the intended recipient(s) 
and may contain confidential and privileged information. Any unauthorized 
review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please contact the sender by reply email and destroy all 
copies of the original message.




Re: spatial search or null

2011-12-01 Thread Rob Brown
Recently had this myself...

http://wiki.apache.org/solr/SpatialSearch#How_to_combine_with_a_sub-query_to_expand_results



-- 

IntelCompute
Web Design and Online Marketing

http://www.intelcompute.com


-Original Message-
From: dan whelan d...@adicio.com
Reply-to: solr-user@lucene.apache.org
To: solr-user@lucene.apache.org
Subject: spatial search or null
Date: Thu, 01 Dec 2011 10:22:38 -0800

Hi,

how would I go about constructing a solr 3.2 spatial query that would 
return documents that are in a specified radius OR documents that have 
no location information.

The query would have a similar result as this:   q=City:San Diego OR 
-City:['' TO *]


Thanks



Re: DataImportHandler w/ multivalued fields

2011-12-01 Thread Briggs Thompson
In addition, I tried a query like below and changed the column definition
to
field column=raw_tag name=raw_tag splitBy=, /
and still no luck. It is indexing the full content now but not multivalued.
It seems like the splitBy ins't working properly.

select group_concat(freetags.raw_tag separator ', ') as raw_tag, site.*
from site
left outer join
  (freetags inner join freetagged_objects)
 on (freetags.id = freetagged_objects.tag_id
   and site.siteId = freetagged_objects.object_id)
group  by site.siteId

Am I doing something wrong?
Thanks,
Briggs Thompson

On Thu, Dec 1, 2011 at 11:46 AM, Briggs Thompson 
w.briggs.thomp...@gmail.com wrote:

 Hello Solr Community!

 I am implementing a data connection to Solr through the Data Import
 Handler and non-multivalued fields are working correctly, but multivalued
 fields are not getting indexed properly.

 I am new to DataImportHandler, but from what I could find, the entity is
 the way to go for multivalued field. The weird thing is that data is being
 indexed for one row, meaning first raw_tag gets populated.


 Anyone have any ideas?
 Thanks,
 Briggs

 This is the relevant part of the schema:

field name =raw_tag type=text_en_lessAggressive indexed=true
 stored=false multivalued=true/
field name =raw_tag_string type=string indexed=false
 stored=true multivalued=true/
copyField source=raw_tag dest=raw_tag_string/

 And the relevant part of data-import.xml:

 document name=merchant
 entity name=site
   query=select * from site 
 field column=siteId name=siteId /
 field column=domain name=domain /
 field column=aliasFor name=aliasFor /
 field column=title name=title /
 field column=description name=description /
 field column=requests name=requests /
 field column=requiresModeration name=requiresModeration /
 field column=blocked name=blocked /
 field column=affiliateLink name=affiliateLink /
 field column=affiliateTracker name=affiliateTracker /
 field column=affiliateNetwork name=affiliateNetwork /
 field column=cjMerchantId name=cjMerchantId /
 field column=thumbNail name=thumbNail /
 field column=updateRankings name=updateRankings /
 field column=couponCount name=couponCount /
 field column=category name=category /
 field column=adult name=adult /
 field column=rank name=rank /
 field column=redirectsTo name=redirectsTo /
 field column=wwwRequired name=wwwRequired /
 field column=avgSavings name=avgSavings /
 field column=products name=products /
 field column=nameChecked name=nameChecked /
 field column=tempFlag name=tempFlag /
 field column=created name=created /
 field column=enableSplitTesting name=enableSplitTesting /
 field column=affiliateLinklock name=affiliateLinklock /
 field column=hasMobileSite name=hasMobileSite /
 field column=blockSite name=blockSite /
 entity name=merchant_tags pk=siteId
 query=select raw_tag, freetags.id,
 freetagged_objects.object_id as siteId
from freetags
inner join freetagged_objects
on freetags.id=freetagged_objects.tag_id
 where freetagged_objects.object_id='${site.siteId}'
 field column=raw_tag name=raw_tag/
  /entity
 /entity
 /document



Re: DataImportHandler w/ multivalued fields

2011-12-01 Thread Rahul Warawdekar
Hi Briggs,

By saying multivalued fields are not getting indexed prperly, do you mean
to say that you are not able to search on those fields ?
Have you tried actually searching your Solr index for those multivalued
terms and make sure if it returns the search results ?

One possibility could be that the multivalued fields are getting indexed
correctly and are searchable.
However, since your schema.xml has a raw_tag field whose stored
attribute is set to false, you may not be able to see those fields.



On Thu, Dec 1, 2011 at 1:43 PM, Briggs Thompson w.briggs.thomp...@gmail.com
 wrote:

 In addition, I tried a query like below and changed the column definition
 to
field column=raw_tag name=raw_tag splitBy=, /
 and still no luck. It is indexing the full content now but not multivalued.
 It seems like the splitBy ins't working properly.

select group_concat(freetags.raw_tag separator ', ') as raw_tag, site.*
 from site
 left outer join
  (freetags inner join freetagged_objects)
 on (freetags.id = freetagged_objects.tag_id
   and site.siteId = freetagged_objects.object_id)
 group  by site.siteId

 Am I doing something wrong?
 Thanks,
 Briggs Thompson

 On Thu, Dec 1, 2011 at 11:46 AM, Briggs Thompson 
 w.briggs.thomp...@gmail.com wrote:

  Hello Solr Community!
 
  I am implementing a data connection to Solr through the Data Import
  Handler and non-multivalued fields are working correctly, but multivalued
  fields are not getting indexed properly.
 
  I am new to DataImportHandler, but from what I could find, the entity is
  the way to go for multivalued field. The weird thing is that data is
 being
  indexed for one row, meaning first raw_tag gets populated.
 
 
  Anyone have any ideas?
  Thanks,
  Briggs
 
  This is the relevant part of the schema:
 
 field name =raw_tag type=text_en_lessAggressive indexed=true
  stored=false multivalued=true/
 field name =raw_tag_string type=string indexed=false
  stored=true multivalued=true/
 copyField source=raw_tag dest=raw_tag_string/
 
  And the relevant part of data-import.xml:
 
  document name=merchant
  entity name=site
query=select * from site 
  field column=siteId name=siteId /
  field column=domain name=domain /
  field column=aliasFor name=aliasFor /
  field column=title name=title /
  field column=description name=description /
  field column=requests name=requests /
  field column=requiresModeration name=requiresModeration
 /
  field column=blocked name=blocked /
  field column=affiliateLink name=affiliateLink /
  field column=affiliateTracker name=affiliateTracker /
  field column=affiliateNetwork name=affiliateNetwork /
  field column=cjMerchantId name=cjMerchantId /
  field column=thumbNail name=thumbNail /
  field column=updateRankings name=updateRankings /
  field column=couponCount name=couponCount /
  field column=category name=category /
  field column=adult name=adult /
  field column=rank name=rank /
  field column=redirectsTo name=redirectsTo /
  field column=wwwRequired name=wwwRequired /
  field column=avgSavings name=avgSavings /
  field column=products name=products /
  field column=nameChecked name=nameChecked /
  field column=tempFlag name=tempFlag /
  field column=created name=created /
  field column=enableSplitTesting name=enableSplitTesting
 /
  field column=affiliateLinklock name=affiliateLinklock /
  field column=hasMobileSite name=hasMobileSite /
  field column=blockSite name=blockSite /
  entity name=merchant_tags pk=siteId
  query=select raw_tag, freetags.id,
  freetagged_objects.object_id as siteId
 from freetags
 inner join freetagged_objects
 on freetags.id=freetagged_objects.tag_id
  where freetagged_objects.object_id='${site.siteId}'
  field column=raw_tag name=raw_tag/
   /entity
  /entity
  /document
 




-- 
Thanks and Regards
Rahul A. Warawdekar


Re: DataImportHandler w/ multivalued fields

2011-12-01 Thread Briggs Thompson
Hey Rahul,

Thanks for the response. I actually just figured it thankfully :). To
answer your question, the raw_tag is indexed and not stored (tokenized),
and then there is a copyField for raw_tag to raw_tag_string which would
be used for facets. That *should have* been displayed in the results.

The silly mistake I made was not camel casing multiValued, which is
clearly the source of the problem.

The second email I sent changing the query and using the split for the
multivalued field had an error in it in the form of a missing line:
transformer=RegexTransformer
in the entity declaration.

Anyhow, thanks for the quick response!

Briggs


On Thu, Dec 1, 2011 at 12:57 PM, Rahul Warawdekar 
rahul.warawde...@gmail.com wrote:

 Hi Briggs,

 By saying multivalued fields are not getting indexed prperly, do you mean
 to say that you are not able to search on those fields ?
 Have you tried actually searching your Solr index for those multivalued
 terms and make sure if it returns the search results ?

 One possibility could be that the multivalued fields are getting indexed
 correctly and are searchable.
 However, since your schema.xml has a raw_tag field whose stored
 attribute is set to false, you may not be able to see those fields.



 On Thu, Dec 1, 2011 at 1:43 PM, Briggs Thompson 
 w.briggs.thomp...@gmail.com
  wrote:

  In addition, I tried a query like below and changed the column definition
  to
 field column=raw_tag name=raw_tag splitBy=, /
  and still no luck. It is indexing the full content now but not
 multivalued.
  It seems like the splitBy ins't working properly.
 
 select group_concat(freetags.raw_tag separator ', ') as raw_tag,
 site.*
  from site
  left outer join
   (freetags inner join freetagged_objects)
  on (freetags.id = freetagged_objects.tag_id
and site.siteId = freetagged_objects.object_id)
  group  by site.siteId
 
  Am I doing something wrong?
  Thanks,
  Briggs Thompson
 
  On Thu, Dec 1, 2011 at 11:46 AM, Briggs Thompson 
  w.briggs.thomp...@gmail.com wrote:
 
   Hello Solr Community!
  
   I am implementing a data connection to Solr through the Data Import
   Handler and non-multivalued fields are working correctly, but
 multivalued
   fields are not getting indexed properly.
  
   I am new to DataImportHandler, but from what I could find, the entity
 is
   the way to go for multivalued field. The weird thing is that data is
  being
   indexed for one row, meaning first raw_tag gets populated.
  
  
   Anyone have any ideas?
   Thanks,
   Briggs
  
   This is the relevant part of the schema:
  
  field name =raw_tag type=text_en_lessAggressive indexed=true
   stored=false multivalued=true/
  field name =raw_tag_string type=string indexed=false
   stored=true multivalued=true/
  copyField source=raw_tag dest=raw_tag_string/
  
   And the relevant part of data-import.xml:
  
   document name=merchant
   entity name=site
 query=select * from site 
   field column=siteId name=siteId /
   field column=domain name=domain /
   field column=aliasFor name=aliasFor /
   field column=title name=title /
   field column=description name=description /
   field column=requests name=requests /
   field column=requiresModeration
 name=requiresModeration
  /
   field column=blocked name=blocked /
   field column=affiliateLink name=affiliateLink /
   field column=affiliateTracker name=affiliateTracker /
   field column=affiliateNetwork name=affiliateNetwork /
   field column=cjMerchantId name=cjMerchantId /
   field column=thumbNail name=thumbNail /
   field column=updateRankings name=updateRankings /
   field column=couponCount name=couponCount /
   field column=category name=category /
   field column=adult name=adult /
   field column=rank name=rank /
   field column=redirectsTo name=redirectsTo /
   field column=wwwRequired name=wwwRequired /
   field column=avgSavings name=avgSavings /
   field column=products name=products /
   field column=nameChecked name=nameChecked /
   field column=tempFlag name=tempFlag /
   field column=created name=created /
   field column=enableSplitTesting
 name=enableSplitTesting
  /
   field column=affiliateLinklock name=affiliateLinklock
 /
   field column=hasMobileSite name=hasMobileSite /
   field column=blockSite name=blockSite /
   entity name=merchant_tags pk=siteId
   query=select raw_tag, freetags.id,
   freetagged_objects.object_id as siteId
  from freetags
  inner join freetagged_objects
  on freetags.id=freetagged_objects.tag_id
   where freetagged_objects.object_id='${site.siteId}'
   field 

Dealing with dashes with solr.PatternReplaceCharFilterFactory

2011-12-01 Thread Aaron Wong
Hi all,

We're encountering a problem with querying terms with dashes (and other
non-alphanumeric characters). For example, we
use PatternReplaceCharFilterFactory to replace dashes with blank characters
for both index and query, however any terms with dashes in them will not
return any results.

For example:
searching for 'cdka' won't return any results, even though 'cdka-1' should
be indexed.

This is similar the problem posted here (
http://stackoverflow.com/questions/6459695/solr-ngramtokenizerfactory-and-patternreplacecharfilterfactory-analyzer-result)
without a response.

The following is the relevant part of the schema:
-
fieldType name=edge_ngram class=solr.TextField
positionIncrementGap=1
  analyzer type=index
charfilter class=solr.PatternReplaceCharFilterFactory
pattern=- replacement=/
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.WordDelimiterFilterFactory
generateWordParts=0 splitOnNumerics=0 generateNumberParts=0
catenateWords=0 catenateNumbers=0 catenateAll=0
splitOnCaseChange=0/
filter class=solr.EdgeNGramFilterFactory minGramSize=2
maxGramSize=15 side=front /
  /analyzer
  analyzer type=query
charfilter class=solr.PatternReplaceCharFilterFactory
pattern=- replacement=/
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.WordDelimiterFilterFactory
generateWordParts=0 splitOnNumerics=0 generateNumberParts=0
catenateWords=0 catenateNumbers=0 catenateAll=0
splitOnCaseChange=0/
  /analyzer
/fieldType

  fields
field name=names_auto type=edge_ngram indexed=true stored=true
multiValued=false /
..

 /fields
-

Thanks for any help anyone can provide!

Aaron


Re: Configuring the Distributed

2011-12-01 Thread Mark Miller
On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com wrote:

 I am currently looking at the latest solrcloud branch and was
 wondering if there was any documentation on configuring the
 DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
 to be added/modified to make distributed indexing work?



Hi Jaime - take a look at solrconfig-distrib-update.xml in
solr/core/src/test-files

You need to enable the update log, add an empty replication handler def,
and an update chain with solr.DistributedUpdateProcessFactory in it.

-- 
- Mark

http://www.lucidimagination.com


Re: Error in New Solr version

2011-12-01 Thread Samuel García Martínez
You are using the uncomitted FieldCollapse component for 1.4.x.

Now, on 3.x field collapse component is not that anymore. You must remove
it and configure the out-of-the-box one.

On Thu, Dec 1, 2011 at 11:34 AM, Vadim Kisselmann 
v.kisselm...@googlemail.com wrote:

 Hi,
 comment out the lines with the collapse component in your solrconfig.xml if
 not need it.
 otherwise, you're missing the right jar's for this component, or path's to
 this jars in your solrconfig.xml are wrong.
 regards
 vadim



 2011/12/1 Pawan Darira pawan.dar...@gmail.com

  Hi
 
  I am migrating from Solr 1.4 to Solr 3.2. I am getting below error in my
  logs
 
  org.apache.solr.common.SolrException: Error loading class
  'org.apache.solr.handler.component.CollapseComponent
 
  Could not found satisfactory solution on google. please help
 
  thanks
  Pawan
 




-- 
Un saludo,
Samuel García.


Re: mysolr python client

2011-12-01 Thread Óscar Marín Miró
Nice job, pythonic solr access!! Thanks for the effort

On Thu, Dec 1, 2011 at 5:53 PM, Rubén Abad rua...@gmail.com wrote:

 Hi Jens,

 Our objective with mysolr was to create a pythonic Apache Solr binding. But
 we
 also have been working in speed and concurrency. We always use the Python
 QueryResponseWriter, because it avoids us dependencies (a XML or JSON
 parser).

 We would also like to create a complete concurrent API, but at the moment
 only querying is working.

 Our main goal is to keep evolving mysolr with the feedback we receive from
 the community.

 I hope I have answered your questions.

 Thanks for your interest,
 Rubén Abad rua...@gmail.com


 On Thu, Dec 1, 2011 at 10:42 AM, Jens Grivolla j+...@grivolla.net wrote:

  On 11/30/2011 05:40 PM, Marco Martinez wrote:
 
  For anyone interested, recently I've been using a new Solr client for
  Python. It's easy and pretty well documented. If you're interested its
  site
  is: http://mysolr.redtuna.org/
 
 
  Do you know what advantages it has over pysolr or solrpy? On the page it
  only says mysolr was born to be a fast and easy-to-use client for Apache
  Solr’s API and because existing Python clients didn’t fulfill these
  conditions.
 
  Thanks,
  Jens
 
 




-- 
Whether it's science, technology, personal experience, true love,
astrology, or gut feelings, each of us has confidence in something that we
will never fully comprehend.
 --Roy H. William


Re: Configuring the Distributed

2011-12-01 Thread Jamie Johnson
Thanks I will try this first thing in the morning.

On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller markrmil...@gmail.com wrote:
 On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com wrote:

 I am currently looking at the latest solrcloud branch and was
 wondering if there was any documentation on configuring the
 DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
 to be added/modified to make distributed indexing work?



 Hi Jaime - take a look at solrconfig-distrib-update.xml in
 solr/core/src/test-files

 You need to enable the update log, add an empty replication handler def,
 and an update chain with solr.DistributedUpdateProcessFactory in it.

 --
 - Mark

 http://www.lucidimagination.com



Re: Configuring the Distributed

2011-12-01 Thread Jamie Johnson
Another question, is there any support for repartitioning of the index
if a new shard is added?  What is the recommended approach for
handling this?  It seemed that the hashing algorithm (and probably
any) would require the index to be repartitioned should a new shard be
added.

On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson jej2...@gmail.com wrote:
 Thanks I will try this first thing in the morning.

 On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller markrmil...@gmail.com wrote:
 On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com wrote:

 I am currently looking at the latest solrcloud branch and was
 wondering if there was any documentation on configuring the
 DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
 to be added/modified to make distributed indexing work?



 Hi Jaime - take a look at solrconfig-distrib-update.xml in
 solr/core/src/test-files

 You need to enable the update log, add an empty replication handler def,
 and an update chain with solr.DistributedUpdateProcessFactory in it.

 --
 - Mark

 http://www.lucidimagination.com




Re: Configuring the Distributed

2011-12-01 Thread Mark Miller
Not yet - we don't plan on working on this until a lot of other stuff is
working solid at this point. But someone else could jump in!

There are a couple ways to go about it that I know of:

A more long term solution may be to start using micro shards - each index
starts as multiple indexes. This makes it pretty fast to move mirco shards
around as you decide to change partitions. It's also less flexible as you
are limited by the number of micro shards you start with.

A more simple and likely first step is to use an index splitter . We
already have one in lucene contrib - we would just need to modify it so
that it splits based on the hash of the document id. This is super
flexible, but splitting will obviously take a little while on a huge index.
The current index splitter is a multi pass splitter - good enough to start
with, but most files under codec control these days, we may be able to make
a single pass splitter soon as well.

Eventually you could imagine using both options - micro shards that could
also be split as needed. Though I still wonder if micro shards will be
worth the extra complications myself...

Right now though, the idea is that you should pick a good number of
partitions to start given your expected data ;) Adding more replicas is
trivial though.

- Mark

On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson jej2...@gmail.com wrote:

 Another question, is there any support for repartitioning of the index
 if a new shard is added?  What is the recommended approach for
 handling this?  It seemed that the hashing algorithm (and probably
 any) would require the index to be repartitioned should a new shard be
 added.

 On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson jej2...@gmail.com wrote:
  Thanks I will try this first thing in the morning.
 
  On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller markrmil...@gmail.com
 wrote:
  On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com
 wrote:
 
  I am currently looking at the latest solrcloud branch and was
  wondering if there was any documentation on configuring the
  DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
  to be added/modified to make distributed indexing work?
 
 
 
  Hi Jaime - take a look at solrconfig-distrib-update.xml in
  solr/core/src/test-files
 
  You need to enable the update log, add an empty replication handler def,
  and an update chain with solr.DistributedUpdateProcessFactory in it.
 
  --
  - Mark
 
  http://www.lucidimagination.com
 
 




-- 
- Mark

http://www.lucidimagination.com


Re: Configuring the Distributed

2011-12-01 Thread Jamie Johnson
I am not familiar with the index splitter that is in contrib, but I'll
take a look at it soon.  So the process sounds like it would be to run
this on all of the current shards indexes based on the hash algorithm.
 Is there also an index merger in contrib which could be used to merge
indexes?  I'm assuming this would be the process?

On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com wrote:
 Not yet - we don't plan on working on this until a lot of other stuff is
 working solid at this point. But someone else could jump in!

 There are a couple ways to go about it that I know of:

 A more long term solution may be to start using micro shards - each index
 starts as multiple indexes. This makes it pretty fast to move mirco shards
 around as you decide to change partitions. It's also less flexible as you
 are limited by the number of micro shards you start with.

 A more simple and likely first step is to use an index splitter . We
 already have one in lucene contrib - we would just need to modify it so
 that it splits based on the hash of the document id. This is super
 flexible, but splitting will obviously take a little while on a huge index.
 The current index splitter is a multi pass splitter - good enough to start
 with, but most files under codec control these days, we may be able to make
 a single pass splitter soon as well.

 Eventually you could imagine using both options - micro shards that could
 also be split as needed. Though I still wonder if micro shards will be
 worth the extra complications myself...

 Right now though, the idea is that you should pick a good number of
 partitions to start given your expected data ;) Adding more replicas is
 trivial though.

 - Mark

 On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson jej2...@gmail.com wrote:

 Another question, is there any support for repartitioning of the index
 if a new shard is added?  What is the recommended approach for
 handling this?  It seemed that the hashing algorithm (and probably
 any) would require the index to be repartitioned should a new shard be
 added.

 On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson jej2...@gmail.com wrote:
  Thanks I will try this first thing in the morning.
 
  On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller markrmil...@gmail.com
 wrote:
  On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com
 wrote:
 
  I am currently looking at the latest solrcloud branch and was
  wondering if there was any documentation on configuring the
  DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
  to be added/modified to make distributed indexing work?
 
 
 
  Hi Jaime - take a look at solrconfig-distrib-update.xml in
  solr/core/src/test-files
 
  You need to enable the update log, add an empty replication handler def,
  and an update chain with solr.DistributedUpdateProcessFactory in it.
 
  --
  - Mark
 
  http://www.lucidimagination.com
 
 




 --
 - Mark

 http://www.lucidimagination.com



Re: Configuring the Distributed

2011-12-01 Thread Mark Miller

On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:

 I am not familiar with the index splitter that is in contrib, but I'll
 take a look at it soon.  So the process sounds like it would be to run
 this on all of the current shards indexes based on the hash algorithm.

Not something I've thought deeply about myself yet, but I think the idea would 
be to split as many as you felt you needed to.

If you wanted to keep the full balance always, this would mean splitting every 
shard at once, yes. But this depends on how many boxes (partitions) you are 
willing/able to add at a time.

You might just split one index to start - now it's hash range would be handled 
by two shards instead of one (if you have 3 replicas per shard, this would mean 
adding 3 more boxes). When you needed to expand again, you would split another 
index that was still handling its full starting range. As you grow, once you 
split every original index, you'd start again, splitting one of the now half 
ranges.

 Is there also an index merger in contrib which could be used to merge
 indexes?  I'm assuming this would be the process?

You can merge with IndexWriter.addIndexes (Solr also has an admin command that 
can do this). But I'm not sure where this fits in?

- Mark

 
 On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com wrote:
 Not yet - we don't plan on working on this until a lot of other stuff is
 working solid at this point. But someone else could jump in!
 
 There are a couple ways to go about it that I know of:
 
 A more long term solution may be to start using micro shards - each index
 starts as multiple indexes. This makes it pretty fast to move mirco shards
 around as you decide to change partitions. It's also less flexible as you
 are limited by the number of micro shards you start with.
 
 A more simple and likely first step is to use an index splitter . We
 already have one in lucene contrib - we would just need to modify it so
 that it splits based on the hash of the document id. This is super
 flexible, but splitting will obviously take a little while on a huge index.
 The current index splitter is a multi pass splitter - good enough to start
 with, but most files under codec control these days, we may be able to make
 a single pass splitter soon as well.
 
 Eventually you could imagine using both options - micro shards that could
 also be split as needed. Though I still wonder if micro shards will be
 worth the extra complications myself...
 
 Right now though, the idea is that you should pick a good number of
 partitions to start given your expected data ;) Adding more replicas is
 trivial though.
 
 - Mark
 
 On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 Another question, is there any support for repartitioning of the index
 if a new shard is added?  What is the recommended approach for
 handling this?  It seemed that the hashing algorithm (and probably
 any) would require the index to be repartitioned should a new shard be
 added.
 
 On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson jej2...@gmail.com wrote:
 Thanks I will try this first thing in the morning.
 
 On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller markrmil...@gmail.com
 wrote:
 On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com
 wrote:
 
 I am currently looking at the latest solrcloud branch and was
 wondering if there was any documentation on configuring the
 DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
 to be added/modified to make distributed indexing work?
 
 
 
 Hi Jaime - take a look at solrconfig-distrib-update.xml in
 solr/core/src/test-files
 
 You need to enable the update log, add an empty replication handler def,
 and an update chain with solr.DistributedUpdateProcessFactory in it.
 
 --
 - Mark
 
 http://www.lucidimagination.com
 
 
 
 
 
 
 --
 - Mark
 
 http://www.lucidimagination.com
 

- Mark Miller
lucidimagination.com













Re: Configuring the Distributed

2011-12-01 Thread Jamie Johnson
hmmm.This doesn't sound like the hashing algorithm that's on the
branch, right?  The algorithm you're mentioning sounds like there is
some logic which is able to tell that a particular range should be
distributed between 2 shards instead of 1.  So seems like a trade off
between repartitioning the entire index (on every shard) and having a
custom hashing algorithm which is able to handle the situation where 2
or more shards map to a particular range.

On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote:

 On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:

 I am not familiar with the index splitter that is in contrib, but I'll
 take a look at it soon.  So the process sounds like it would be to run
 this on all of the current shards indexes based on the hash algorithm.

 Not something I've thought deeply about myself yet, but I think the idea 
 would be to split as many as you felt you needed to.

 If you wanted to keep the full balance always, this would mean splitting 
 every shard at once, yes. But this depends on how many boxes (partitions) you 
 are willing/able to add at a time.

 You might just split one index to start - now it's hash range would be 
 handled by two shards instead of one (if you have 3 replicas per shard, this 
 would mean adding 3 more boxes). When you needed to expand again, you would 
 split another index that was still handling its full starting range. As you 
 grow, once you split every original index, you'd start again, splitting one 
 of the now half ranges.

 Is there also an index merger in contrib which could be used to merge
 indexes?  I'm assuming this would be the process?

 You can merge with IndexWriter.addIndexes (Solr also has an admin command 
 that can do this). But I'm not sure where this fits in?

 - Mark


 On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com wrote:
 Not yet - we don't plan on working on this until a lot of other stuff is
 working solid at this point. But someone else could jump in!

 There are a couple ways to go about it that I know of:

 A more long term solution may be to start using micro shards - each index
 starts as multiple indexes. This makes it pretty fast to move mirco shards
 around as you decide to change partitions. It's also less flexible as you
 are limited by the number of micro shards you start with.

 A more simple and likely first step is to use an index splitter . We
 already have one in lucene contrib - we would just need to modify it so
 that it splits based on the hash of the document id. This is super
 flexible, but splitting will obviously take a little while on a huge index.
 The current index splitter is a multi pass splitter - good enough to start
 with, but most files under codec control these days, we may be able to make
 a single pass splitter soon as well.

 Eventually you could imagine using both options - micro shards that could
 also be split as needed. Though I still wonder if micro shards will be
 worth the extra complications myself...

 Right now though, the idea is that you should pick a good number of
 partitions to start given your expected data ;) Adding more replicas is
 trivial though.

 - Mark

 On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson jej2...@gmail.com wrote:

 Another question, is there any support for repartitioning of the index
 if a new shard is added?  What is the recommended approach for
 handling this?  It seemed that the hashing algorithm (and probably
 any) would require the index to be repartitioned should a new shard be
 added.

 On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson jej2...@gmail.com wrote:
 Thanks I will try this first thing in the morning.

 On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller markrmil...@gmail.com
 wrote:
 On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com
 wrote:

 I am currently looking at the latest solrcloud branch and was
 wondering if there was any documentation on configuring the
 DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
 to be added/modified to make distributed indexing work?



 Hi Jaime - take a look at solrconfig-distrib-update.xml in
 solr/core/src/test-files

 You need to enable the update log, add an empty replication handler def,
 and an update chain with solr.DistributedUpdateProcessFactory in it.

 --
 - Mark

 http://www.lucidimagination.com






 --
 - Mark

 http://www.lucidimagination.com


 - Mark Miller
 lucidimagination.com














Multithreaded DIH bug

2011-12-01 Thread Mark
I'm trying to use multiple threads with DIH but I keep receiving the 
following error.. Operation not allowed after ResultSet closed


Is there anyway I can fix this?

Dec 1, 2011 4:38:47 PM org.apache.solr.common.SolrException log
SEVERE: Full Import failed:java.lang.RuntimeException: Error in 
multi-threaded import
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:268)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408)
Caused by: 
org.apache.solr.handler.dataimport.DataImportHandlerException: 
java.sql.SQLException: Operation not allowed after ResultSet closed
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:339)
at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$600(JdbcDataSource.java:228)
at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:262)
at 
org.apache.solr.handler.dataimport.CachedSqlEntityProcessor.getAllNonCachedRows(CachedSqlEntityProcessor.java:72)
at 
org.apache.solr.handler.dataimport.EntityProcessorBase.getIdCacheData(EntityProcessorBase.java:201)
at 
org.apache.solr.handler.dataimport.CachedSqlEntityProcessor.nextRow(CachedSqlEntityProcessor.java:60)
at 
org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper.nextRow(ThreadedEntityProcessorWrapper.java:84)
at 
org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.runAThread(DocBuilder.java:449)
at 
org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.run(DocBuilder.java:402)
at 
org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.runAThread(DocBuilder.java:469)
at 
org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.access$000(DocBuilder.java:356)
at 
org.apache.solr.handler.dataimport.DocBuilder$EntityRunner$1.run(DocBuilder.java:409)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)

at java.lang.Thread.run(Thread.java:636)
Caused by: java.sql.SQLException: Operation not allowed after ResultSet 
closed

at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:987)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:982)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:927)
at com.mysql.jdbc.ResultSetImpl.checkClosed(ResultSetImpl.java:794)
at com.mysql.jdbc.ResultSetImpl.next(ResultSetImpl.java:7139)
at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:331)

... 14 more



Re: (fq=field1:val1 AND field2:val2) VS fq=field1:val1fq=field2:val2 and filterCache

2011-12-01 Thread Shawn Heisey

On 12/1/2011 8:01 AM, Antoine LE FLOC'H wrote:

Is there any difference in the way things are stored in the filterCache if
I do

(fq=field1:val1 AND field2:val2)
or
fq=field1:valfq=field2:val2

eventhough these are logically identical ? What get stored exactly ? Also
can you point me to where in the Solr source code this processing happens ?


Your first example would result in one entry in filterCache, probably 
for +field1:val1 +field2:val2 which is what the parser ultimately 
reduces the query to.  Your second example will result in two separate 
entries in filterCache.


The second example takes more cache space, but it is also more 
reusable.  If you started with a clean cache and sent 
fq=field2:val2field3:val3 immediately after sending your second 
example, one of the filter queries would be satisfied from the cache, so 
Solr would use fewer resources on the query as a whole.  If you sent 
your first example and then fq=(field2:val2 AND field3:val3) there 
would be no speedup from the cache, because the new query wouldn't match 
the previous one at all.


Thanks,
Shawn



Re: Configuring the Distributed

2011-12-01 Thread Mark Miller
Right now lets say you have one shard - everything there hashes to range X.

Now you want to split that shard with an Index Splitter.

You divide range X in two - giving you two ranges - then you start splitting. 
This is where the current Splitter needs a little modification. You decide 
which doc should go into which new index by rehashing each doc id in the index 
you are splitting - if its hash is greater than X/2, it goes into index1 - if 
its less, index2. I think there are a couple current Splitter impls, but one of 
them does something like, give me an id - now if the id's in the index are 
above that id, goto index1, if below, index2. We need to instead do a quick 
hash rather than simple id compare.
 
Why do you need to do this on every shard?

The other part we need that we dont have is to store hash range assignments in 
zookeeper - we don't do that yet because it's not needed yet. Instead we 
currently just simply calculate that on the fly (too often at the moment - on 
every request :) I intend to fix that of course).

At the start, zk would say, for range X, goto this shard. After the split, it 
would say, for range less than X/2 goto the old node, for range greater than 
X/2 goto the new node.

- Mark

On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:

 hmmm.This doesn't sound like the hashing algorithm that's on the
 branch, right?  The algorithm you're mentioning sounds like there is
 some logic which is able to tell that a particular range should be
 distributed between 2 shards instead of 1.  So seems like a trade off
 between repartitioning the entire index (on every shard) and having a
 custom hashing algorithm which is able to handle the situation where 2
 or more shards map to a particular range.
 
 On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote:
 
 On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
 
 I am not familiar with the index splitter that is in contrib, but I'll
 take a look at it soon.  So the process sounds like it would be to run
 this on all of the current shards indexes based on the hash algorithm.
 
 Not something I've thought deeply about myself yet, but I think the idea 
 would be to split as many as you felt you needed to.
 
 If you wanted to keep the full balance always, this would mean splitting 
 every shard at once, yes. But this depends on how many boxes (partitions) 
 you are willing/able to add at a time.
 
 You might just split one index to start - now it's hash range would be 
 handled by two shards instead of one (if you have 3 replicas per shard, this 
 would mean adding 3 more boxes). When you needed to expand again, you would 
 split another index that was still handling its full starting range. As you 
 grow, once you split every original index, you'd start again, splitting one 
 of the now half ranges.
 
 Is there also an index merger in contrib which could be used to merge
 indexes?  I'm assuming this would be the process?
 
 You can merge with IndexWriter.addIndexes (Solr also has an admin command 
 that can do this). But I'm not sure where this fits in?
 
 - Mark
 
 
 On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com wrote:
 Not yet - we don't plan on working on this until a lot of other stuff is
 working solid at this point. But someone else could jump in!
 
 There are a couple ways to go about it that I know of:
 
 A more long term solution may be to start using micro shards - each index
 starts as multiple indexes. This makes it pretty fast to move mirco shards
 around as you decide to change partitions. It's also less flexible as you
 are limited by the number of micro shards you start with.
 
 A more simple and likely first step is to use an index splitter . We
 already have one in lucene contrib - we would just need to modify it so
 that it splits based on the hash of the document id. This is super
 flexible, but splitting will obviously take a little while on a huge index.
 The current index splitter is a multi pass splitter - good enough to start
 with, but most files under codec control these days, we may be able to make
 a single pass splitter soon as well.
 
 Eventually you could imagine using both options - micro shards that could
 also be split as needed. Though I still wonder if micro shards will be
 worth the extra complications myself...
 
 Right now though, the idea is that you should pick a good number of
 partitions to start given your expected data ;) Adding more replicas is
 trivial though.
 
 - Mark
 
 On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 Another question, is there any support for repartitioning of the index
 if a new shard is added?  What is the recommended approach for
 handling this?  It seemed that the hashing algorithm (and probably
 any) would require the index to be repartitioned should a new shard be
 added.
 
 On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson jej2...@gmail.com wrote:
 Thanks I will try this first thing in the morning.
 
 On 

Re: Configuring the Distributed

2011-12-01 Thread Jamie Johnson
Yes, the ZK method seems much more flexible.  Adding a new shard would
be simply updating the range assignments in ZK.  Where is this
currently on the list of things to accomplish?  I don't have time to
work on this now, but if you (or anyone) could provide direction I'd
be willing to work on this when I had spare time.  I guess a JIRA
detailing where/how to do this could help.  Not sure if the design has
been thought out that far though.

On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote:
 Right now lets say you have one shard - everything there hashes to range X.

 Now you want to split that shard with an Index Splitter.

 You divide range X in two - giving you two ranges - then you start splitting. 
 This is where the current Splitter needs a little modification. You decide 
 which doc should go into which new index by rehashing each doc id in the 
 index you are splitting - if its hash is greater than X/2, it goes into 
 index1 - if its less, index2. I think there are a couple current Splitter 
 impls, but one of them does something like, give me an id - now if the id's 
 in the index are above that id, goto index1, if below, index2. We need to 
 instead do a quick hash rather than simple id compare.

 Why do you need to do this on every shard?

 The other part we need that we dont have is to store hash range assignments 
 in zookeeper - we don't do that yet because it's not needed yet. Instead we 
 currently just simply calculate that on the fly (too often at the moment - on 
 every request :) I intend to fix that of course).

 At the start, zk would say, for range X, goto this shard. After the split, it 
 would say, for range less than X/2 goto the old node, for range greater than 
 X/2 goto the new node.

 - Mark

 On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:

 hmmm.This doesn't sound like the hashing algorithm that's on the
 branch, right?  The algorithm you're mentioning sounds like there is
 some logic which is able to tell that a particular range should be
 distributed between 2 shards instead of 1.  So seems like a trade off
 between repartitioning the entire index (on every shard) and having a
 custom hashing algorithm which is able to handle the situation where 2
 or more shards map to a particular range.

 On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote:

 On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:

 I am not familiar with the index splitter that is in contrib, but I'll
 take a look at it soon.  So the process sounds like it would be to run
 this on all of the current shards indexes based on the hash algorithm.

 Not something I've thought deeply about myself yet, but I think the idea 
 would be to split as many as you felt you needed to.

 If you wanted to keep the full balance always, this would mean splitting 
 every shard at once, yes. But this depends on how many boxes (partitions) 
 you are willing/able to add at a time.

 You might just split one index to start - now it's hash range would be 
 handled by two shards instead of one (if you have 3 replicas per shard, 
 this would mean adding 3 more boxes). When you needed to expand again, you 
 would split another index that was still handling its full starting range. 
 As you grow, once you split every original index, you'd start again, 
 splitting one of the now half ranges.

 Is there also an index merger in contrib which could be used to merge
 indexes?  I'm assuming this would be the process?

 You can merge with IndexWriter.addIndexes (Solr also has an admin command 
 that can do this). But I'm not sure where this fits in?

 - Mark


 On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com wrote:
 Not yet - we don't plan on working on this until a lot of other stuff is
 working solid at this point. But someone else could jump in!

 There are a couple ways to go about it that I know of:

 A more long term solution may be to start using micro shards - each index
 starts as multiple indexes. This makes it pretty fast to move mirco shards
 around as you decide to change partitions. It's also less flexible as you
 are limited by the number of micro shards you start with.

 A more simple and likely first step is to use an index splitter . We
 already have one in lucene contrib - we would just need to modify it so
 that it splits based on the hash of the document id. This is super
 flexible, but splitting will obviously take a little while on a huge 
 index.
 The current index splitter is a multi pass splitter - good enough to start
 with, but most files under codec control these days, we may be able to 
 make
 a single pass splitter soon as well.

 Eventually you could imagine using both options - micro shards that could
 also be split as needed. Though I still wonder if micro shards will be
 worth the extra complications myself...

 Right now though, the idea is that you should pick a good number of
 partitions to start given your expected data ;) Adding more 

Re: Configuring the Distributed

2011-12-01 Thread Ted Dunning
Of course, resharding is almost never necessary if you use micro-shards.
 Micro-shards are shards small enough that you can fit 20 or more on a
node.  If you have that many on each node, then adding a new node consists
of moving some shards to the new machine rather than moving lots of little
documents.

Much faster.  As in thousands of times faster.

On Thu, Dec 1, 2011 at 5:51 PM, Jamie Johnson jej2...@gmail.com wrote:

 Yes, the ZK method seems much more flexible.  Adding a new shard would
 be simply updating the range assignments in ZK.  Where is this
 currently on the list of things to accomplish?  I don't have time to
 work on this now, but if you (or anyone) could provide direction I'd
 be willing to work on this when I had spare time.  I guess a JIRA
 detailing where/how to do this could help.  Not sure if the design has
 been thought out that far though.

 On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote:
  Right now lets say you have one shard - everything there hashes to range
 X.
 
  Now you want to split that shard with an Index Splitter.
 
  You divide range X in two - giving you two ranges - then you start
 splitting. This is where the current Splitter needs a little modification.
 You decide which doc should go into which new index by rehashing each doc
 id in the index you are splitting - if its hash is greater than X/2, it
 goes into index1 - if its less, index2. I think there are a couple current
 Splitter impls, but one of them does something like, give me an id - now if
 the id's in the index are above that id, goto index1, if below, index2. We
 need to instead do a quick hash rather than simple id compare.
 
  Why do you need to do this on every shard?
 
  The other part we need that we dont have is to store hash range
 assignments in zookeeper - we don't do that yet because it's not needed
 yet. Instead we currently just simply calculate that on the fly (too often
 at the moment - on every request :) I intend to fix that of course).
 
  At the start, zk would say, for range X, goto this shard. After the
 split, it would say, for range less than X/2 goto the old node, for range
 greater than X/2 goto the new node.
 
  - Mark
 
  On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
 
  hmmm.This doesn't sound like the hashing algorithm that's on the
  branch, right?  The algorithm you're mentioning sounds like there is
  some logic which is able to tell that a particular range should be
  distributed between 2 shards instead of 1.  So seems like a trade off
  between repartitioning the entire index (on every shard) and having a
  custom hashing algorithm which is able to handle the situation where 2
  or more shards map to a particular range.
 
  On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
  On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
 
  I am not familiar with the index splitter that is in contrib, but I'll
  take a look at it soon.  So the process sounds like it would be to run
  this on all of the current shards indexes based on the hash algorithm.
 
  Not something I've thought deeply about myself yet, but I think the
 idea would be to split as many as you felt you needed to.
 
  If you wanted to keep the full balance always, this would mean
 splitting every shard at once, yes. But this depends on how many boxes
 (partitions) you are willing/able to add at a time.
 
  You might just split one index to start - now it's hash range would be
 handled by two shards instead of one (if you have 3 replicas per shard,
 this would mean adding 3 more boxes). When you needed to expand again, you
 would split another index that was still handling its full starting range.
 As you grow, once you split every original index, you'd start again,
 splitting one of the now half ranges.
 
  Is there also an index merger in contrib which could be used to merge
  indexes?  I'm assuming this would be the process?
 
  You can merge with IndexWriter.addIndexes (Solr also has an admin
 command that can do this). But I'm not sure where this fits in?
 
  - Mark
 
 
  On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com
 wrote:
  Not yet - we don't plan on working on this until a lot of other
 stuff is
  working solid at this point. But someone else could jump in!
 
  There are a couple ways to go about it that I know of:
 
  A more long term solution may be to start using micro shards - each
 index
  starts as multiple indexes. This makes it pretty fast to move mirco
 shards
  around as you decide to change partitions. It's also less flexible
 as you
  are limited by the number of micro shards you start with.
 
  A more simple and likely first step is to use an index splitter . We
  already have one in lucene contrib - we would just need to modify it
 so
  that it splits based on the hash of the document id. This is super
  flexible, but splitting will obviously take a little while on a huge
 index.
  The current index splitter is a multi 

Re: Configuring the Distributed

2011-12-01 Thread Mark Miller
In this case we are still talking about moving a whole index at a time rather 
than lots of little documents. You split the index into two, and then ship one 
of them off.

The extra cost you can avoid with micro sharding will be the cost of splitting 
the index - which could be significant for a very large index. I have not done 
any tests though.

The cost of 20 micro-shards is that you will always have tons of segments 
unless you are very heavily merging - and even in the very unusual case of each 
micro shard being optimized, you have essentially 20 segments. Thats best case 
- normal case is likely in the hundreds.

This can be a fairly significant % hit at search time.

You also have the added complexity of managing 20 indexes per node in solr code.

I think that both options have there +/-'s and eventually we could perhaps 
support both.

To kick things off though, adding another partition should be a rare event if 
you plan carefully, and I think many will be able to handle the cost of 
splitting (you might even mark the replica you are splitting on so that it's 
not part of queries while its 'busy' splitting).

- Mark

On Dec 1, 2011, at 9:17 PM, Ted Dunning wrote:

 Of course, resharding is almost never necessary if you use micro-shards.
 Micro-shards are shards small enough that you can fit 20 or more on a
 node.  If you have that many on each node, then adding a new node consists
 of moving some shards to the new machine rather than moving lots of little
 documents.
 
 Much faster.  As in thousands of times faster.
 
 On Thu, Dec 1, 2011 at 5:51 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 Yes, the ZK method seems much more flexible.  Adding a new shard would
 be simply updating the range assignments in ZK.  Where is this
 currently on the list of things to accomplish?  I don't have time to
 work on this now, but if you (or anyone) could provide direction I'd
 be willing to work on this when I had spare time.  I guess a JIRA
 detailing where/how to do this could help.  Not sure if the design has
 been thought out that far though.
 
 On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote:
 Right now lets say you have one shard - everything there hashes to range
 X.
 
 Now you want to split that shard with an Index Splitter.
 
 You divide range X in two - giving you two ranges - then you start
 splitting. This is where the current Splitter needs a little modification.
 You decide which doc should go into which new index by rehashing each doc
 id in the index you are splitting - if its hash is greater than X/2, it
 goes into index1 - if its less, index2. I think there are a couple current
 Splitter impls, but one of them does something like, give me an id - now if
 the id's in the index are above that id, goto index1, if below, index2. We
 need to instead do a quick hash rather than simple id compare.
 
 Why do you need to do this on every shard?
 
 The other part we need that we dont have is to store hash range
 assignments in zookeeper - we don't do that yet because it's not needed
 yet. Instead we currently just simply calculate that on the fly (too often
 at the moment - on every request :) I intend to fix that of course).
 
 At the start, zk would say, for range X, goto this shard. After the
 split, it would say, for range less than X/2 goto the old node, for range
 greater than X/2 goto the new node.
 
 - Mark
 
 On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
 
 hmmm.This doesn't sound like the hashing algorithm that's on the
 branch, right?  The algorithm you're mentioning sounds like there is
 some logic which is able to tell that a particular range should be
 distributed between 2 shards instead of 1.  So seems like a trade off
 between repartitioning the entire index (on every shard) and having a
 custom hashing algorithm which is able to handle the situation where 2
 or more shards map to a particular range.
 
 On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
 On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
 
 I am not familiar with the index splitter that is in contrib, but I'll
 take a look at it soon.  So the process sounds like it would be to run
 this on all of the current shards indexes based on the hash algorithm.
 
 Not something I've thought deeply about myself yet, but I think the
 idea would be to split as many as you felt you needed to.
 
 If you wanted to keep the full balance always, this would mean
 splitting every shard at once, yes. But this depends on how many boxes
 (partitions) you are willing/able to add at a time.
 
 You might just split one index to start - now it's hash range would be
 handled by two shards instead of one (if you have 3 replicas per shard,
 this would mean adding 3 more boxes). When you needed to expand again, you
 would split another index that was still handling its full starting range.
 As you grow, once you split every original index, you'd start again,
 splitting one of the now half 

Re: Configuring the Distributed

2011-12-01 Thread Mark Miller
Sorry - missed something - you also have the added cost of shipping the new 
half index to all of the replicas of the original shard with the splitting 
method. Unless you somehow split on every replica at the same time - then of 
course you wouldn't be able to avoid the 'busy' replica, and it would probably 
be fairly hard to juggle.


On Dec 1, 2011, at 9:37 PM, Mark Miller wrote:

 In this case we are still talking about moving a whole index at a time rather 
 than lots of little documents. You split the index into two, and then ship 
 one of them off.
 
 The extra cost you can avoid with micro sharding will be the cost of 
 splitting the index - which could be significant for a very large index. I 
 have not done any tests though.
 
 The cost of 20 micro-shards is that you will always have tons of segments 
 unless you are very heavily merging - and even in the very unusual case of 
 each micro shard being optimized, you have essentially 20 segments. Thats 
 best case - normal case is likely in the hundreds.
 
 This can be a fairly significant % hit at search time.
 
 You also have the added complexity of managing 20 indexes per node in solr 
 code.
 
 I think that both options have there +/-'s and eventually we could perhaps 
 support both.
 
 To kick things off though, adding another partition should be a rare event if 
 you plan carefully, and I think many will be able to handle the cost of 
 splitting (you might even mark the replica you are splitting on so that it's 
 not part of queries while its 'busy' splitting).
 
 - Mark
 
 On Dec 1, 2011, at 9:17 PM, Ted Dunning wrote:
 
 Of course, resharding is almost never necessary if you use micro-shards.
 Micro-shards are shards small enough that you can fit 20 or more on a
 node.  If you have that many on each node, then adding a new node consists
 of moving some shards to the new machine rather than moving lots of little
 documents.
 
 Much faster.  As in thousands of times faster.
 
 On Thu, Dec 1, 2011 at 5:51 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 Yes, the ZK method seems much more flexible.  Adding a new shard would
 be simply updating the range assignments in ZK.  Where is this
 currently on the list of things to accomplish?  I don't have time to
 work on this now, but if you (or anyone) could provide direction I'd
 be willing to work on this when I had spare time.  I guess a JIRA
 detailing where/how to do this could help.  Not sure if the design has
 been thought out that far though.
 
 On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote:
 Right now lets say you have one shard - everything there hashes to range
 X.
 
 Now you want to split that shard with an Index Splitter.
 
 You divide range X in two - giving you two ranges - then you start
 splitting. This is where the current Splitter needs a little modification.
 You decide which doc should go into which new index by rehashing each doc
 id in the index you are splitting - if its hash is greater than X/2, it
 goes into index1 - if its less, index2. I think there are a couple current
 Splitter impls, but one of them does something like, give me an id - now if
 the id's in the index are above that id, goto index1, if below, index2. We
 need to instead do a quick hash rather than simple id compare.
 
 Why do you need to do this on every shard?
 
 The other part we need that we dont have is to store hash range
 assignments in zookeeper - we don't do that yet because it's not needed
 yet. Instead we currently just simply calculate that on the fly (too often
 at the moment - on every request :) I intend to fix that of course).
 
 At the start, zk would say, for range X, goto this shard. After the
 split, it would say, for range less than X/2 goto the old node, for range
 greater than X/2 goto the new node.
 
 - Mark
 
 On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
 
 hmmm.This doesn't sound like the hashing algorithm that's on the
 branch, right?  The algorithm you're mentioning sounds like there is
 some logic which is able to tell that a particular range should be
 distributed between 2 shards instead of 1.  So seems like a trade off
 between repartitioning the entire index (on every shard) and having a
 custom hashing algorithm which is able to handle the situation where 2
 or more shards map to a particular range.
 
 On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
 On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
 
 I am not familiar with the index splitter that is in contrib, but I'll
 take a look at it soon.  So the process sounds like it would be to run
 this on all of the current shards indexes based on the hash algorithm.
 
 Not something I've thought deeply about myself yet, but I think the
 idea would be to split as many as you felt you needed to.
 
 If you wanted to keep the full balance always, this would mean
 splitting every shard at once, yes. But this depends on how many boxes
 (partitions) you are willing/able to 

Re: Configuring the Distributed

2011-12-01 Thread Jamie Johnson
So I couldn't resist, I attempted to do this tonight, I used the
solrconfig you mentioned (as is, no modifications), I setup a 2 shard
cluster in collection1, I sent 1 doc to 1 of the shards, updated it
and sent the update to the other.  I don't see the modifications
though I only see the original document.  The following is the test

public void update() throws Exception {

String key = 1;

SolrInputDocument solrDoc = new SolrInputDocument();
solrDoc.setField(key, key);

solrDoc.addField(content, initial value);

SolrServer server = servers
.get(http://localhost:8983/solr/collection1;);
server.add(solrDoc);

server.commit();

solrDoc = new SolrInputDocument();
solrDoc.addField(key, key);
solrDoc.addField(content, updated value);

server = servers.get(http://localhost:7574/solr/collection1;);

UpdateRequest ureq = new UpdateRequest();
ureq.setParam(update.chain, distrib-update-chain);
ureq.add(solrDoc);
ureq.setParam(shards,

localhost:8983/solr/collection1,localhost:7574/solr/collection1);
ureq.setParam(self, foo);
ureq.setAction(ACTION.COMMIT, true, true);
server.request(ureq);
System.out.println(done);
}

key is my unique field in schema.xml

What am I doing wrong?

On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson jej2...@gmail.com wrote:
 Yes, the ZK method seems much more flexible.  Adding a new shard would
 be simply updating the range assignments in ZK.  Where is this
 currently on the list of things to accomplish?  I don't have time to
 work on this now, but if you (or anyone) could provide direction I'd
 be willing to work on this when I had spare time.  I guess a JIRA
 detailing where/how to do this could help.  Not sure if the design has
 been thought out that far though.

 On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote:
 Right now lets say you have one shard - everything there hashes to range X.

 Now you want to split that shard with an Index Splitter.

 You divide range X in two - giving you two ranges - then you start 
 splitting. This is where the current Splitter needs a little modification. 
 You decide which doc should go into which new index by rehashing each doc id 
 in the index you are splitting - if its hash is greater than X/2, it goes 
 into index1 - if its less, index2. I think there are a couple current 
 Splitter impls, but one of them does something like, give me an id - now if 
 the id's in the index are above that id, goto index1, if below, index2. We 
 need to instead do a quick hash rather than simple id compare.

 Why do you need to do this on every shard?

 The other part we need that we dont have is to store hash range assignments 
 in zookeeper - we don't do that yet because it's not needed yet. Instead we 
 currently just simply calculate that on the fly (too often at the moment - 
 on every request :) I intend to fix that of course).

 At the start, zk would say, for range X, goto this shard. After the split, 
 it would say, for range less than X/2 goto the old node, for range greater 
 than X/2 goto the new node.

 - Mark

 On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:

 hmmm.This doesn't sound like the hashing algorithm that's on the
 branch, right?  The algorithm you're mentioning sounds like there is
 some logic which is able to tell that a particular range should be
 distributed between 2 shards instead of 1.  So seems like a trade off
 between repartitioning the entire index (on every shard) and having a
 custom hashing algorithm which is able to handle the situation where 2
 or more shards map to a particular range.

 On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote:

 On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:

 I am not familiar with the index splitter that is in contrib, but I'll
 take a look at it soon.  So the process sounds like it would be to run
 this on all of the current shards indexes based on the hash algorithm.

 Not something I've thought deeply about myself yet, but I think the idea 
 would be to split as many as you felt you needed to.

 If you wanted to keep the full balance always, this would mean splitting 
 every shard at once, yes. But this depends on how many boxes (partitions) 
 you are willing/able to add at a time.

 You might just split one index to start - now it's hash range would be 
 handled by two shards instead of one (if you have 3 replicas per shard, 
 this would mean adding 3 more boxes). When you needed to expand again, you 
 would split another index that was still handling its full starting range. 
 As you grow, once you split every original index, you'd start again, 
 splitting one of the now 

Re: Configuring the Distributed

2011-12-01 Thread Mark Miller
It's not full of details yet, but there is a JIRA issue here:
https://issues.apache.org/jira/browse/SOLR-2595

On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson jej2...@gmail.com wrote:

 Yes, the ZK method seems much more flexible.  Adding a new shard would
 be simply updating the range assignments in ZK.  Where is this
 currently on the list of things to accomplish?  I don't have time to
 work on this now, but if you (or anyone) could provide direction I'd
 be willing to work on this when I had spare time.  I guess a JIRA
 detailing where/how to do this could help.  Not sure if the design has
 been thought out that far though.

 On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote:
  Right now lets say you have one shard - everything there hashes to range
 X.
 
  Now you want to split that shard with an Index Splitter.
 
  You divide range X in two - giving you two ranges - then you start
 splitting. This is where the current Splitter needs a little modification.
 You decide which doc should go into which new index by rehashing each doc
 id in the index you are splitting - if its hash is greater than X/2, it
 goes into index1 - if its less, index2. I think there are a couple current
 Splitter impls, but one of them does something like, give me an id - now if
 the id's in the index are above that id, goto index1, if below, index2. We
 need to instead do a quick hash rather than simple id compare.
 
  Why do you need to do this on every shard?
 
  The other part we need that we dont have is to store hash range
 assignments in zookeeper - we don't do that yet because it's not needed
 yet. Instead we currently just simply calculate that on the fly (too often
 at the moment - on every request :) I intend to fix that of course).
 
  At the start, zk would say, for range X, goto this shard. After the
 split, it would say, for range less than X/2 goto the old node, for range
 greater than X/2 goto the new node.
 
  - Mark
 
  On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
 
  hmmm.This doesn't sound like the hashing algorithm that's on the
  branch, right?  The algorithm you're mentioning sounds like there is
  some logic which is able to tell that a particular range should be
  distributed between 2 shards instead of 1.  So seems like a trade off
  between repartitioning the entire index (on every shard) and having a
  custom hashing algorithm which is able to handle the situation where 2
  or more shards map to a particular range.
 
  On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
  On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
 
  I am not familiar with the index splitter that is in contrib, but I'll
  take a look at it soon.  So the process sounds like it would be to run
  this on all of the current shards indexes based on the hash algorithm.
 
  Not something I've thought deeply about myself yet, but I think the
 idea would be to split as many as you felt you needed to.
 
  If you wanted to keep the full balance always, this would mean
 splitting every shard at once, yes. But this depends on how many boxes
 (partitions) you are willing/able to add at a time.
 
  You might just split one index to start - now it's hash range would be
 handled by two shards instead of one (if you have 3 replicas per shard,
 this would mean adding 3 more boxes). When you needed to expand again, you
 would split another index that was still handling its full starting range.
 As you grow, once you split every original index, you'd start again,
 splitting one of the now half ranges.
 
  Is there also an index merger in contrib which could be used to merge
  indexes?  I'm assuming this would be the process?
 
  You can merge with IndexWriter.addIndexes (Solr also has an admin
 command that can do this). But I'm not sure where this fits in?
 
  - Mark
 
 
  On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com
 wrote:
  Not yet - we don't plan on working on this until a lot of other
 stuff is
  working solid at this point. But someone else could jump in!
 
  There are a couple ways to go about it that I know of:
 
  A more long term solution may be to start using micro shards - each
 index
  starts as multiple indexes. This makes it pretty fast to move mirco
 shards
  around as you decide to change partitions. It's also less flexible
 as you
  are limited by the number of micro shards you start with.
 
  A more simple and likely first step is to use an index splitter . We
  already have one in lucene contrib - we would just need to modify it
 so
  that it splits based on the hash of the document id. This is super
  flexible, but splitting will obviously take a little while on a huge
 index.
  The current index splitter is a multi pass splitter - good enough to
 start
  with, but most files under codec control these days, we may be able
 to make
  a single pass splitter soon as well.
 
  Eventually you could imagine using both options - micro shards that
 could
  also 

Re: Configuring the Distributed

2011-12-01 Thread Mark Miller
Hmm...sorry bout that - so my first guess is that right now we are not 
distributing a commit (easy to add, just have not done it).

Right now I explicitly commit on each server for tests.

Can you try explicitly committing on server1 after updating the doc on server 2?

I can start distributing commits tomorrow - been meaning to do it for my own 
convenience anyhow.

Also, you want to pass the sys property numShards=1 on startup. I think it 
defaults to 3. That will give you one leader and one replica.

- Mark

On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:

 So I couldn't resist, I attempted to do this tonight, I used the
 solrconfig you mentioned (as is, no modifications), I setup a 2 shard
 cluster in collection1, I sent 1 doc to 1 of the shards, updated it
 and sent the update to the other.  I don't see the modifications
 though I only see the original document.  The following is the test
 
 public void update() throws Exception {
 
   String key = 1;
 
   SolrInputDocument solrDoc = new SolrInputDocument();
   solrDoc.setField(key, key);
 
   solrDoc.addField(content, initial value);
 
   SolrServer server = servers
   .get(http://localhost:8983/solr/collection1;);
   server.add(solrDoc);
 
   server.commit();
 
   solrDoc = new SolrInputDocument();
   solrDoc.addField(key, key);
   solrDoc.addField(content, updated value);
 
   server = servers.get(http://localhost:7574/solr/collection1;);
 
   UpdateRequest ureq = new UpdateRequest();
   ureq.setParam(update.chain, distrib-update-chain);
   ureq.add(solrDoc);
   ureq.setParam(shards,
   
 localhost:8983/solr/collection1,localhost:7574/solr/collection1);
   ureq.setParam(self, foo);
   ureq.setAction(ACTION.COMMIT, true, true);
   server.request(ureq);
   System.out.println(done);
   }
 
 key is my unique field in schema.xml
 
 What am I doing wrong?
 
 On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson jej2...@gmail.com wrote:
 Yes, the ZK method seems much more flexible.  Adding a new shard would
 be simply updating the range assignments in ZK.  Where is this
 currently on the list of things to accomplish?  I don't have time to
 work on this now, but if you (or anyone) could provide direction I'd
 be willing to work on this when I had spare time.  I guess a JIRA
 detailing where/how to do this could help.  Not sure if the design has
 been thought out that far though.
 
 On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote:
 Right now lets say you have one shard - everything there hashes to range X.
 
 Now you want to split that shard with an Index Splitter.
 
 You divide range X in two - giving you two ranges - then you start 
 splitting. This is where the current Splitter needs a little modification. 
 You decide which doc should go into which new index by rehashing each doc 
 id in the index you are splitting - if its hash is greater than X/2, it 
 goes into index1 - if its less, index2. I think there are a couple current 
 Splitter impls, but one of them does something like, give me an id - now if 
 the id's in the index are above that id, goto index1, if below, index2. We 
 need to instead do a quick hash rather than simple id compare.
 
 Why do you need to do this on every shard?
 
 The other part we need that we dont have is to store hash range assignments 
 in zookeeper - we don't do that yet because it's not needed yet. Instead we 
 currently just simply calculate that on the fly (too often at the moment - 
 on every request :) I intend to fix that of course).
 
 At the start, zk would say, for range X, goto this shard. After the split, 
 it would say, for range less than X/2 goto the old node, for range greater 
 than X/2 goto the new node.
 
 - Mark
 
 On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
 
 hmmm.This doesn't sound like the hashing algorithm that's on the
 branch, right?  The algorithm you're mentioning sounds like there is
 some logic which is able to tell that a particular range should be
 distributed between 2 shards instead of 1.  So seems like a trade off
 between repartitioning the entire index (on every shard) and having a
 custom hashing algorithm which is able to handle the situation where 2
 or more shards map to a particular range.
 
 On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote:
 
 On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
 
 I am not familiar with the index splitter that is in contrib, but I'll
 take a look at it soon.  So the process sounds like it would be to run
 this on all of the current shards indexes based on the hash algorithm.
 
 Not something I've thought deeply about myself yet, but I think the idea 
 would be to split as many as you felt you needed to.
 
 If you 

Re: Configuring the Distributed

2011-12-01 Thread Jamie Johnson
Thanks for the quick response.  With that change (have not done
numShards yet) shard1 got updated.  But now when executing the
following queries I get information back from both, which doesn't seem
right

http://localhost:7574/solr/select/?q=*:*
docstr name=key1/strstr name=content_mvtxtupdated value/str/doc

http://localhost:8983/solr/select?q=*:*
docstr name=key1/strstr name=content_mvtxtupdated value/str/doc



On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller markrmil...@gmail.com wrote:
 Hmm...sorry bout that - so my first guess is that right now we are not 
 distributing a commit (easy to add, just have not done it).

 Right now I explicitly commit on each server for tests.

 Can you try explicitly committing on server1 after updating the doc on server 
 2?

 I can start distributing commits tomorrow - been meaning to do it for my own 
 convenience anyhow.

 Also, you want to pass the sys property numShards=1 on startup. I think it 
 defaults to 3. That will give you one leader and one replica.

 - Mark

 On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:

 So I couldn't resist, I attempted to do this tonight, I used the
 solrconfig you mentioned (as is, no modifications), I setup a 2 shard
 cluster in collection1, I sent 1 doc to 1 of the shards, updated it
 and sent the update to the other.  I don't see the modifications
 though I only see the original document.  The following is the test

 public void update() throws Exception {

               String key = 1;

               SolrInputDocument solrDoc = new SolrInputDocument();
               solrDoc.setField(key, key);

               solrDoc.addField(content, initial value);

               SolrServer server = servers
                               .get(http://localhost:8983/solr/collection1;);
               server.add(solrDoc);

               server.commit();

               solrDoc = new SolrInputDocument();
               solrDoc.addField(key, key);
               solrDoc.addField(content, updated value);

               server = servers.get(http://localhost:7574/solr/collection1;);

               UpdateRequest ureq = new UpdateRequest();
               ureq.setParam(update.chain, distrib-update-chain);
               ureq.add(solrDoc);
               ureq.setParam(shards,
                               
 localhost:8983/solr/collection1,localhost:7574/solr/collection1);
               ureq.setParam(self, foo);
               ureq.setAction(ACTION.COMMIT, true, true);
               server.request(ureq);
               System.out.println(done);
       }

 key is my unique field in schema.xml

 What am I doing wrong?

 On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson jej2...@gmail.com wrote:
 Yes, the ZK method seems much more flexible.  Adding a new shard would
 be simply updating the range assignments in ZK.  Where is this
 currently on the list of things to accomplish?  I don't have time to
 work on this now, but if you (or anyone) could provide direction I'd
 be willing to work on this when I had spare time.  I guess a JIRA
 detailing where/how to do this could help.  Not sure if the design has
 been thought out that far though.

 On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote:
 Right now lets say you have one shard - everything there hashes to range X.

 Now you want to split that shard with an Index Splitter.

 You divide range X in two - giving you two ranges - then you start 
 splitting. This is where the current Splitter needs a little modification. 
 You decide which doc should go into which new index by rehashing each doc 
 id in the index you are splitting - if its hash is greater than X/2, it 
 goes into index1 - if its less, index2. I think there are a couple current 
 Splitter impls, but one of them does something like, give me an id - now 
 if the id's in the index are above that id, goto index1, if below, index2. 
 We need to instead do a quick hash rather than simple id compare.

 Why do you need to do this on every shard?

 The other part we need that we dont have is to store hash range 
 assignments in zookeeper - we don't do that yet because it's not needed 
 yet. Instead we currently just simply calculate that on the fly (too often 
 at the moment - on every request :) I intend to fix that of course).

 At the start, zk would say, for range X, goto this shard. After the split, 
 it would say, for range less than X/2 goto the old node, for range greater 
 than X/2 goto the new node.

 - Mark

 On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:

 hmmm.This doesn't sound like the hashing algorithm that's on the
 branch, right?  The algorithm you're mentioning sounds like there is
 some logic which is able to tell that a particular range should be
 distributed between 2 shards instead of 1.  So seems like a trade off
 between repartitioning the entire index (on every shard) and having a
 custom hashing algorithm which is able to handle the situation where 2
 or more shards map to a particular range.

 On 

Re: Configuring the Distributed

2011-12-01 Thread Mark Miller
Not sure offhand - but things will be funky if you don't specify the correct 
numShards.

The instance to shard assignment should be using numShards to assign. But then 
the hash to shard mapping actually goes on the number of shards it finds 
registered in ZK (it doesn't have to, but really these should be equal).

So basically you are saying, I want 3 partitions, but you are only starting up 
2 nodes, and the code is just not happy about that I'd guess. For the system to 
work properly, you have to fire up at least as many servers as numShards.

What are you trying to do? 2 partitions with no replicas, or one partition with 
one replica?

In either case, I think you will have better luck if you fire up at least as 
many servers as the numShards setting. Or lower the numShards setting.

This is all a work in progress by the way - what you are trying to test should 
work if things are setup right though.

- Mark


On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote:

 Thanks for the quick response.  With that change (have not done
 numShards yet) shard1 got updated.  But now when executing the
 following queries I get information back from both, which doesn't seem
 right
 
 http://localhost:7574/solr/select/?q=*:*
 docstr name=key1/strstr name=content_mvtxtupdated 
 value/str/doc
 
 http://localhost:8983/solr/select?q=*:*
 docstr name=key1/strstr name=content_mvtxtupdated 
 value/str/doc
 
 
 
 On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller markrmil...@gmail.com wrote:
 Hmm...sorry bout that - so my first guess is that right now we are not 
 distributing a commit (easy to add, just have not done it).
 
 Right now I explicitly commit on each server for tests.
 
 Can you try explicitly committing on server1 after updating the doc on 
 server 2?
 
 I can start distributing commits tomorrow - been meaning to do it for my own 
 convenience anyhow.
 
 Also, you want to pass the sys property numShards=1 on startup. I think it 
 defaults to 3. That will give you one leader and one replica.
 
 - Mark
 
 On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:
 
 So I couldn't resist, I attempted to do this tonight, I used the
 solrconfig you mentioned (as is, no modifications), I setup a 2 shard
 cluster in collection1, I sent 1 doc to 1 of the shards, updated it
 and sent the update to the other.  I don't see the modifications
 though I only see the original document.  The following is the test
 
 public void update() throws Exception {
 
   String key = 1;
 
   SolrInputDocument solrDoc = new SolrInputDocument();
   solrDoc.setField(key, key);
 
   solrDoc.addField(content, initial value);
 
   SolrServer server = servers
   
 .get(http://localhost:8983/solr/collection1;);
   server.add(solrDoc);
 
   server.commit();
 
   solrDoc = new SolrInputDocument();
   solrDoc.addField(key, key);
   solrDoc.addField(content, updated value);
 
   server = 
 servers.get(http://localhost:7574/solr/collection1;);
 
   UpdateRequest ureq = new UpdateRequest();
   ureq.setParam(update.chain, distrib-update-chain);
   ureq.add(solrDoc);
   ureq.setParam(shards,
   
 localhost:8983/solr/collection1,localhost:7574/solr/collection1);
   ureq.setParam(self, foo);
   ureq.setAction(ACTION.COMMIT, true, true);
   server.request(ureq);
   System.out.println(done);
   }
 
 key is my unique field in schema.xml
 
 What am I doing wrong?
 
 On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson jej2...@gmail.com wrote:
 Yes, the ZK method seems much more flexible.  Adding a new shard would
 be simply updating the range assignments in ZK.  Where is this
 currently on the list of things to accomplish?  I don't have time to
 work on this now, but if you (or anyone) could provide direction I'd
 be willing to work on this when I had spare time.  I guess a JIRA
 detailing where/how to do this could help.  Not sure if the design has
 been thought out that far though.
 
 On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote:
 Right now lets say you have one shard - everything there hashes to range 
 X.
 
 Now you want to split that shard with an Index Splitter.
 
 You divide range X in two - giving you two ranges - then you start 
 splitting. This is where the current Splitter needs a little 
 modification. You decide which doc should go into which new index by 
 rehashing each doc id in the index you are splitting - if its hash is 
 greater than X/2, it goes into index1 - if its less, index2. I think 
 there are a couple current Splitter impls, but one of them does something 
 like, give me an id - now if the id's in the index are above that id, 
 goto index1, if below, index2. We need to instead do a quick hash rather 
 than simple id compare.
 
 Why do you need to 

Re: Configuring the Distributed

2011-12-01 Thread Mark Miller
Getting late - didn't really pay attention to your code I guess - why are you 
adding the first doc without specifying the distrib update chain? This is not 
really supported. It's going to just go to the server you specified - even with 
everything setup right, the update might then go to that same server or the 
other one depending on how it hashes. You really want to just always use the 
distrib update chain.  I guess I don't yet understand what you are trying to 
test. 

Sent from my iPad

On Dec 1, 2011, at 10:57 PM, Mark Miller markrmil...@gmail.com wrote:

 Not sure offhand - but things will be funky if you don't specify the correct 
 numShards.
 
 The instance to shard assignment should be using numShards to assign. But 
 then the hash to shard mapping actually goes on the number of shards it finds 
 registered in ZK (it doesn't have to, but really these should be equal).
 
 So basically you are saying, I want 3 partitions, but you are only starting 
 up 2 nodes, and the code is just not happy about that I'd guess. For the 
 system to work properly, you have to fire up at least as many servers as 
 numShards.
 
 What are you trying to do? 2 partitions with no replicas, or one partition 
 with one replica?
 
 In either case, I think you will have better luck if you fire up at least as 
 many servers as the numShards setting. Or lower the numShards setting.
 
 This is all a work in progress by the way - what you are trying to test 
 should work if things are setup right though.
 
 - Mark
 
 
 On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote:
 
 Thanks for the quick response.  With that change (have not done
 numShards yet) shard1 got updated.  But now when executing the
 following queries I get information back from both, which doesn't seem
 right
 
 http://localhost:7574/solr/select/?q=*:*
 docstr name=key1/strstr name=content_mvtxtupdated 
 value/str/doc
 
 http://localhost:8983/solr/select?q=*:*
 docstr name=key1/strstr name=content_mvtxtupdated 
 value/str/doc
 
 
 
 On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller markrmil...@gmail.com wrote:
 Hmm...sorry bout that - so my first guess is that right now we are not 
 distributing a commit (easy to add, just have not done it).
 
 Right now I explicitly commit on each server for tests.
 
 Can you try explicitly committing on server1 after updating the doc on 
 server 2?
 
 I can start distributing commits tomorrow - been meaning to do it for my 
 own convenience anyhow.
 
 Also, you want to pass the sys property numShards=1 on startup. I think it 
 defaults to 3. That will give you one leader and one replica.
 
 - Mark
 
 On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:
 
 So I couldn't resist, I attempted to do this tonight, I used the
 solrconfig you mentioned (as is, no modifications), I setup a 2 shard
 cluster in collection1, I sent 1 doc to 1 of the shards, updated it
 and sent the update to the other.  I don't see the modifications
 though I only see the original document.  The following is the test
 
 public void update() throws Exception {
 
  String key = 1;
 
  SolrInputDocument solrDoc = new SolrInputDocument();
  solrDoc.setField(key, key);
 
  solrDoc.addField(content, initial value);
 
  SolrServer server = servers
  
 .get(http://localhost:8983/solr/collection1;);
  server.add(solrDoc);
 
  server.commit();
 
  solrDoc = new SolrInputDocument();
  solrDoc.addField(key, key);
  solrDoc.addField(content, updated value);
 
  server = 
 servers.get(http://localhost:7574/solr/collection1;);
 
  UpdateRequest ureq = new UpdateRequest();
  ureq.setParam(update.chain, distrib-update-chain);
  ureq.add(solrDoc);
  ureq.setParam(shards,
  
 localhost:8983/solr/collection1,localhost:7574/solr/collection1);
  ureq.setParam(self, foo);
  ureq.setAction(ACTION.COMMIT, true, true);
  server.request(ureq);
  System.out.println(done);
  }
 
 key is my unique field in schema.xml
 
 What am I doing wrong?
 
 On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson jej2...@gmail.com wrote:
 Yes, the ZK method seems much more flexible.  Adding a new shard would
 be simply updating the range assignments in ZK.  Where is this
 currently on the list of things to accomplish?  I don't have time to
 work on this now, but if you (or anyone) could provide direction I'd
 be willing to work on this when I had spare time.  I guess a JIRA
 detailing where/how to do this could help.  Not sure if the design has
 been thought out that far though.
 
 On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote:
 Right now lets say you have one shard - everything there hashes to range 
 X.
 
 Now you want to split that shard with an Index Splitter.
 
 You divide range X in two - giving you 

Possible to facet across two indices, or document types in single index?

2011-12-01 Thread Jeff Schmidt
Hello:

I'm trying to relate together two different types of documents.  Currently I 
have 'node' documents that reside in one index (core), and 'product mapping' 
documents that are in another index.  The product mapping index is used to map 
tenant products to nodes. The nodes are canonical content that gets updated 
every quarter, where as the product mappings can change at any time.

I put them in two indexes because (1) canonical content changes rarely, and I 
don't want product mapping changes to affect it (commit, re-open searchers 
etc.), and I would like to support multiple tenants mapping products to the 
same canonical content to avoid duplication (a few GB).

This arrange has worked well thus far, but only in the sense that for each node 
result returned, I can query the product mapping index to determine the 
products mapped to the node.  I combine this information within my application 
and return it to the client.  This works okay in that there are only 5-20 
results returned per page (start, rows).  But now I'm being asked to facet the 
product catagories (multi-valued field within a product mapping document) along 
with other facets defined in the canonical content.

Can this be done with Solr 3.5.0?  I've been looking into sub-queries, function 
queries etc.  Also, I've seen various postings indicating that one needs to 
denormalize more.  I don't want to add product information as fields to the 
canonical content. Not only does that defeat my objective (1) above, but Solr 
does not support incremental updates of document fields.

So, one approach is to issue by query to the canonical index and get all of the 
document IDs (could be 1000s), and then issue a filter query to the product 
mapping index with all of these IDs and have Solr facet the product categories. 
 Is that efficient?  I suppose I could use HTTP POST (via SolrJ) to convey that 
payload of IDs?  I could then take the facet results of that query and combine 
them with the canonical index results and return them to the client.

That may be do-able, but then let's say the user clicks on a product category 
facet value to narrow the node results to only those mapped to category XYZ. 
This will not affect the query issued against the canonical content index.  
Instead, I think I'd have to go through the canonical results and eliminate the 
nodes that are not associated with product category XYZ.  Then, if the current 
page of results is inadequate (rows=10, but 3 nodes were eliminated), I'd have 
to go back to the canonical index to get more rows, eliminate some some again 
perhaps, get more etc.  That sounds unappealing and low performing.

Is there a Solr way to do this?  My Packt Apache Solr 3 Enterprise Search 
Server book (page 34) states regarding separate indices:

If you do develop separate schemas and if you need to search across 
your indices in one search then you must perform a distributed search, 
described in the last chapter. A distributed search is usually a feature 
employed for a large corpus but it applies here too.

But in the chapter it goes on to talk about dealing with sharding, replication 
etc. to support a large corpus, not necessarily tying together two different 
indexes.

Is it possible to accomplish my goal in a less ugly way than I outlined above?  
Since we only have a single tenant to worry about, I could use a combined index 
at least for a few months (separate fields per document type, IDs are unique 
among then all) if that makes a difference.

Thanks!

Jeff
--
Jeff Schmidt
535 Consulting
j...@535consulting.com
http://www.535consulting.com
(650) 423-1068











XPathEntityProcessor, Fields without Content, and Null-backup

2011-12-01 Thread Michael Watts
Hello Solr and Solr-Users,

I can't confidently say I completeley understand all that these classes so 
boldy tackle (that is, XPathEntityProcessor and XPathRecordReader) , but there 
may be someone who does. Nonetheless,I think I've got some or most of this 
right, and more likely there are more someones like that. So, I won't qualify 
everything I say with a maybe -- lets this be the refactoring of those.

Whenver mapping an XML file into a Solr Index, within the XPathRecordReader, 
(used by the XPathEntityProcessor within the DataImportHandler), if (A) a field 
is perceived to be null and is multivalued, it is pushed a value of null (on 
top of any other values it previously had). Otherwise (B) for multivalued 
fields, any found value is pushed onto its existing list of values, and the 
field is marked as found within the frame (a.k.a record).

In general, when the end-tag of a record is seen, (C) the XPathRecordReader 
clears all of the field's values which have been marked as found, as tidiness 
is a value and they are supposedly no longer useful.
However, suppose that for a given record and multivalued field, a value is 
never found (though it may have been found for other fields in the record), 
only (A) will have occurred, never will (B) have occurred, the field will never 
have been marked as found, and thus (C) never will have occurred for the field.

So, the field will remain, with its list of nulls.
This list of nulls will grow until either the last record or a non-null value 
is seen.
And so, (1) an out-of-memory error may occur, given sufficiently many records 
and a mortal computer.
Moreover, (2), a transformer cannot reliably depend on the number of nulls in 
the field (and this information cannot be guaranteed to be determined by some 
other value).

I will try to provide more information, if this seems an issue and if there 
doesn't seem to be an answer.
At this point, if I understand the problem correctly, it seems the answer is to 
'mark' those null fields, considering 'null' and added value.

Thanks,
Michael Watts



Re: Configuring the Distributed

2011-12-01 Thread Jamie Johnson
Really just trying to do a simple add and update test, the chain
missing is just proof of my not understanding exactly how this is
supposed to work.  I modified the code to this

String key = 1;

SolrInputDocument solrDoc = new SolrInputDocument();
solrDoc.setField(key, key);

solrDoc.addField(content_mvtxt, initial value);

SolrServer server = servers
.get(http://localhost:8983/solr/collection1;);

UpdateRequest ureq = new UpdateRequest();
ureq.setParam(update.chain, distrib-update-chain);
ureq.add(solrDoc);
ureq.setParam(shards,

localhost:8983/solr/collection1,localhost:7574/solr/collection1);
ureq.setParam(self, foo);
ureq.setAction(ACTION.COMMIT, true, true);
server.request(ureq);
server.commit();

solrDoc = new SolrInputDocument();
solrDoc.addField(key, key);
solrDoc.addField(content_mvtxt, updated value);

server = servers.get(http://localhost:7574/solr/collection1;);

ureq = new UpdateRequest();
ureq.setParam(update.chain, distrib-update-chain);
// ureq.deleteById(8060a9eb-9546-43ee-95bb-d18ea26a6285);
ureq.add(solrDoc);
ureq.setParam(shards,

localhost:8983/solr/collection1,localhost:7574/solr/collection1);
ureq.setParam(self, foo);
ureq.setAction(ACTION.COMMIT, true, true);
server.request(ureq);
// server.add(solrDoc);
server.commit();
server = servers.get(http://localhost:8983/solr/collection1;);


server.commit();
System.out.println(done);

but I'm still seeing the doc appear on both shards.After the first
commit I see the doc on 8983 with initial value.  after the second
commit I see the updated value on 7574 and the old on 8983.  After the
final commit the doc on 8983 gets updated.

Is there something wrong with my test?

On Thu, Dec 1, 2011 at 11:17 PM, Mark Miller markrmil...@gmail.com wrote:
 Getting late - didn't really pay attention to your code I guess - why are you 
 adding the first doc without specifying the distrib update chain? This is not 
 really supported. It's going to just go to the server you specified - even 
 with everything setup right, the update might then go to that same server or 
 the other one depending on how it hashes. You really want to just always use 
 the distrib update chain.  I guess I don't yet understand what you are trying 
 to test.

 Sent from my iPad

 On Dec 1, 2011, at 10:57 PM, Mark Miller markrmil...@gmail.com wrote:

 Not sure offhand - but things will be funky if you don't specify the correct 
 numShards.

 The instance to shard assignment should be using numShards to assign. But 
 then the hash to shard mapping actually goes on the number of shards it 
 finds registered in ZK (it doesn't have to, but really these should be 
 equal).

 So basically you are saying, I want 3 partitions, but you are only starting 
 up 2 nodes, and the code is just not happy about that I'd guess. For the 
 system to work properly, you have to fire up at least as many servers as 
 numShards.

 What are you trying to do? 2 partitions with no replicas, or one partition 
 with one replica?

 In either case, I think you will have better luck if you fire up at least as 
 many servers as the numShards setting. Or lower the numShards setting.

 This is all a work in progress by the way - what you are trying to test 
 should work if things are setup right though.

 - Mark


 On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote:

 Thanks for the quick response.  With that change (have not done
 numShards yet) shard1 got updated.  But now when executing the
 following queries I get information back from both, which doesn't seem
 right

 http://localhost:7574/solr/select/?q=*:*
 docstr name=key1/strstr name=content_mvtxtupdated 
 value/str/doc

 http://localhost:8983/solr/select?q=*:*
 docstr name=key1/strstr name=content_mvtxtupdated 
 value/str/doc



 On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller markrmil...@gmail.com wrote:
 Hmm...sorry bout that - so my first guess is that right now we are not 
 distributing a commit (easy to add, just have not done it).

 Right now I explicitly commit on each server for tests.

 Can you try explicitly committing on server1 after updating the doc on 
 server 2?

 I can start distributing commits tomorrow - been meaning to do it for my 
 own convenience anyhow.

 Also, you want to pass the sys property numShards=1 on startup. I think it 
 defaults to 3. That will give you one leader and one replica.

 - Mark

 On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:

 So I 

Re: Configuring the Distributed

2011-12-01 Thread Ted Dunning
Well, this goes both ways.

It is not that unusual to take a node down for maintenance of some kind or
even to have a node failure.  In that case, it is very nice to have the
load from the lost node be spread fairly evenly across the remaining
cluster.

Regarding the cost of having several micro-shards, they are also an
opportunity for threading the search.  Most sites don't have enough queries
coming in to occupy all of the cores in modern machines so threading each
query can actually be a substantial benefit in terms of query time.

On Thu, Dec 1, 2011 at 6:37 PM, Mark Miller markrmil...@gmail.com wrote:

 To kick things off though, adding another partition should be a rare event
 if you plan carefully, and I think many will be able to handle the cost of
 splitting (you might even mark the replica you are splitting on so that
 it's not part of queries while its 'busy' splitting).



Re: Configuring the Distributed

2011-12-01 Thread Ted Dunning
With micro-shards, you can use random numbers for all placements with minor
constraints like avoiding replicas sitting in the same rack.  Since the
number of shards never changes, things stay very simple.

On Thu, Dec 1, 2011 at 6:44 PM, Mark Miller markrmil...@gmail.com wrote:

 Sorry - missed something - you also have the added cost of shipping the
 new half index to all of the replicas of the original shard with the
 splitting method. Unless you somehow split on every replica at the same
 time - then of course you wouldn't be able to avoid the 'busy' replica, and
 it would probably be fairly hard to juggle.


 On Dec 1, 2011, at 9:37 PM, Mark Miller wrote:

  In this case we are still talking about moving a whole index at a time
 rather than lots of little documents. You split the index into two, and
 then ship one of them off.
 
  The extra cost you can avoid with micro sharding will be the cost of
 splitting the index - which could be significant for a very large index. I
 have not done any tests though.
 
  The cost of 20 micro-shards is that you will always have tons of
 segments unless you are very heavily merging - and even in the very unusual
 case of each micro shard being optimized, you have essentially 20 segments.
 Thats best case - normal case is likely in the hundreds.
 
  This can be a fairly significant % hit at search time.
 
  You also have the added complexity of managing 20 indexes per node in
 solr code.
 
  I think that both options have there +/-'s and eventually we could
 perhaps support both.
 
  To kick things off though, adding another partition should be a rare
 event if you plan carefully, and I think many will be able to handle the
 cost of splitting (you might even mark the replica you are splitting on so
 that it's not part of queries while its 'busy' splitting).
 
  - Mark
 
  On Dec 1, 2011, at 9:17 PM, Ted Dunning wrote:
 
  Of course, resharding is almost never necessary if you use micro-shards.
  Micro-shards are shards small enough that you can fit 20 or more on a
  node.  If you have that many on each node, then adding a new node
 consists
  of moving some shards to the new machine rather than moving lots of
 little
  documents.
 
  Much faster.  As in thousands of times faster.
 
  On Thu, Dec 1, 2011 at 5:51 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
  Yes, the ZK method seems much more flexible.  Adding a new shard would
  be simply updating the range assignments in ZK.  Where is this
  currently on the list of things to accomplish?  I don't have time to
  work on this now, but if you (or anyone) could provide direction I'd
  be willing to work on this when I had spare time.  I guess a JIRA
  detailing where/how to do this could help.  Not sure if the design has
  been thought out that far though.
 
  On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com
 wrote:
  Right now lets say you have one shard - everything there hashes to
 range
  X.
 
  Now you want to split that shard with an Index Splitter.
 
  You divide range X in two - giving you two ranges - then you start
  splitting. This is where the current Splitter needs a little
 modification.
  You decide which doc should go into which new index by rehashing each
 doc
  id in the index you are splitting - if its hash is greater than X/2, it
  goes into index1 - if its less, index2. I think there are a couple
 current
  Splitter impls, but one of them does something like, give me an id -
 now if
  the id's in the index are above that id, goto index1, if below,
 index2. We
  need to instead do a quick hash rather than simple id compare.
 
  Why do you need to do this on every shard?
 
  The other part we need that we dont have is to store hash range
  assignments in zookeeper - we don't do that yet because it's not needed
  yet. Instead we currently just simply calculate that on the fly (too
 often
  at the moment - on every request :) I intend to fix that of course).
 
  At the start, zk would say, for range X, goto this shard. After the
  split, it would say, for range less than X/2 goto the old node, for
 range
  greater than X/2 goto the new node.
 
  - Mark
 
  On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
 
  hmmm.This doesn't sound like the hashing algorithm that's on the
  branch, right?  The algorithm you're mentioning sounds like there is
  some logic which is able to tell that a particular range should be
  distributed between 2 shards instead of 1.  So seems like a trade off
  between repartitioning the entire index (on every shard) and having a
  custom hashing algorithm which is able to handle the situation where
 2
  or more shards map to a particular range.
 
  On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com
  wrote:
 
  On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
 
  I am not familiar with the index splitter that is in contrib, but
 I'll
  take a look at it soon.  So the process sounds like it would be to
 run
  this on all of the current shards