can't overwrite and can't delete by id

2013-11-22 Thread Mingfeng Yang
Recently, I found out that  I can't delete doc by id or overwrite a doc
 from/in my SOLR index which is based on SOLR 4.4.0 version.

Say, I have a doc  http://pastebin.com/GqPP4Uw4  (to make it easier to
view, I use pastebin here).  And I tried to add a dynamic field rank_ti
to it, want to make it like http://pastebin.com/dGnRRwux

Funny thing is that after I inserted the new version of doc, if I do query
curl 'localhost:8995/solr/select?wt=jsonindent=trueq=id:28583776'  ,
 the two versions above will appear randomly. And after half a minute,
 version 2 will disappear, which means the update is not get write into the
disk.

I tried to delete by id with rsolr, and the doc just can't be removed.

Insert new doc into the index is fine though.

Anyone ran into this strange behavior before?

Thanks
Ming


Re: can't overwrite and can't delete by id

2013-11-22 Thread Mingfeng Yang
BTW:  it's a 4 shards solorcloud cluster using zookeeper 3.3.5


On Fri, Nov 22, 2013 at 11:07 AM, Mingfeng Yang mfy...@wisewindow.comwrote:

 Recently, I found out that  I can't delete doc by id or overwrite a doc
  from/in my SOLR index which is based on SOLR 4.4.0 version.

 Say, I have a doc  http://pastebin.com/GqPP4Uw4  (to make it easier to
 view, I use pastebin here).  And I tried to add a dynamic field rank_ti
 to it, want to make it like http://pastebin.com/dGnRRwux

 Funny thing is that after I inserted the new version of doc, if I do query
 curl 'localhost:8995/solr/select?wt=jsonindent=trueq=id:28583776'  ,
  the two versions above will appear randomly. And after half a minute,
  version 2 will disappear, which means the update is not get write into the
 disk.

 I tried to delete by id with rsolr, and the doc just can't be removed.

 Insert new doc into the index is fine though.

 Anyone ran into this strange behavior before?

 Thanks
 Ming



Re: Problem of facet on 170M documents

2013-11-04 Thread Mingfeng Yang
Erick,

It could have more than 4M distinct values.  The purpose of this facet is
to display the most frequent, say top 500, urls to users.

Sascha,

Thanks for the info. I will look into this thread thing.

Mingfeng


On Mon, Nov 4, 2013 at 4:47 AM, Erick Erickson erickerick...@gmail.comwrote:

 How many unique URLs do you have in your 9M
 docs? If your 9M hits have 4M distinct URLs, then
 this is not very valuable to the user.

 Sascha:
 Was that speedup on a single field or were you faceting over
 multiple fields? Because as I remember that code spins off
 threads on a per-field basis, and if I'm mis-remembering I need
 to look again!

 Best,
 Erick


 On Sat, Nov 2, 2013 at 5:07 AM, Sascha SZOTT sz...@gmx.de wrote:

  Hi Ming,
 
  which Solr version are you using? In case you use one of the latest
  versions (4.5 or above) try the new parameter facet.threads with a
  reasonable value (4 to 8 gave me a massive performance speedup when
  working with large facets, i.e. nTerms  10^7).
 
  -Sascha
 
 
  Mingfeng Yang wrote:
   I have an index with 170M documents, and two of the fields for each
   doc is source and url.  And I want to know the top 500 most
   frequent urls from Video source.
  
   So I did a facet with
   fq=source:Videofacet=truefacet.field=urlfacet.limit=500, and
   the matching documents are about 9 millions.
  
   The solr cluster is hosted on two ec2 instances each with 4 cpu, and
   32G memory. 16G is allocated tfor java heap.  4 master shards on one
   machine, and 4 replica on another machine. Connected together via
   zookeeper.
  
   Whenever I did the query above, the response is just taking too long
   and the client will get timed out. Sometimes,  when the end user is
   impatient, so he/she may wait for a few second for the results, and
   then kill the connection, and then issue the same query again and
   again.  Then the server will have to deal with multiple such heavy
   queries simultaneously and being so busy that we got no server
   hosting shard error, probably due to lost communication between solr
   node and zookeeper.
  
   Is there any way to deal with such problem?
  
   Thanks, Ming
  
 



Problem of facet on 170M documents

2013-11-02 Thread Mingfeng Yang
I have an index with 170M documents, and two of the fields for each doc is
source and url.  And I want to know the top 500 most frequent urls from
Video source.

So I did a facet with
 fq=source:Videofacet=truefacet.field=urlfacet.limit=500, and the
matching documents are about 9 millions.

The solr cluster is hosted on two ec2 instances each with 4 cpu, and  32G
memory. 16G is allocated tfor java heap.  4 master shards on one machine,
and 4 replica on another machine. Connected together via zookeeper.

Whenever I did the query above, the response is just taking too long and
the client will get timed out. Sometimes,  when the end user is impatient,
so he/she may wait for a few second for the results, and then kill the
connection, and then issue the same query again and again.  Then the server
will have to deal with multiple such heavy queries simultaneously and
 being so busy that we got no server hosting shard error, probably due to
lost communication between solr node and zookeeper.

Is there any way to deal with such problem?

Thanks,
Ming


Re: spatial search, geofilt does not work

2013-08-20 Thread Mingfeng Yang
Oh, man.  I have been trying to figure out the problem for half day.
 Probably Solr could use some error msg if the query format is invalid.

But, THANKS! David, you probably saved me another half day.

Ming-



On Mon, Aug 19, 2013 at 10:20 PM, David Smiley (@MITRE.org) 
dsmi...@mitre.org wrote:

 Thank goodness for Solr's feature of echo'ing params back in the response
 as
 it helps diagnose problems like this.  In your case, the filter query that
 Solr is seeing isn't what you (seemed) to have given on the command line:
 fq:!geofilt sfield=author_geo
 Clearly wrong.  Try escaping the braces with URL percent escapes, etc.

 ~ David


 Mingfeng Yang wrote
  My solr index has a field called author_geo which contains the author's
  location, and when I am trying to get all docs whose author are within 10
  km of 35.0,35.0 using the following query.
 
  curl '
 
 http://localhost/solr/select?q=*:*fq={!geofilt%20sfield=author_geo}pt=35.0,35.0d=10wt=jsonindent=truefl=author_geo
  '
 
  I got one match document which actually has no value of author_geo.
 
  {
responseHeader:{
  status:0,
  QTime:7,
  params:{
d:10,
fl:author_geo,
indent:true,
q:*:*,
pt:35.0,35.0,
wt:json,
fq:!geofilt sfield=author_geo}},
response:{numFound:1,start:0,maxScore:1.0,docs:[
{}]
}}
 
 
  But if I run the following query to do a sorting, it shows clearly that
  there are at least 6 docs which are within 10km of 35.0,35.0.
 
  curl '
 
 http://localhost/solr/select?q=*:*sort=geodist(author_geo,35,35)+ascwt=jsonindent=truefl=author_geo,geodist(author_geo,35,35)fq=author_geo
  :\[0,0%20TO%20360,360\]'
 
  {
responseHeader:{
  status:0,
  QTime:10,
  params:{
fl:author_geo,geodist(author_geo,35,35),
sort:geodist(author_geo,35,35) asc,
indent:true,
q:*:*,
wt:json,
fq:author_geo:[0,0 TO 360,360]}},
response:{numFound:78133,start:0,docs:[
{
  author_geo:34.991199,34.991199,
  geodist(author_geo,35,35):1.2650756688780775},
{
  author_geo:34.991199,34.991199,
  geodist(author_geo,35,35):1.2650756688780775},
{
  author_geo:34.991199,34.991199,
  geodist(author_geo,35,35):1.2650756688780775},
{
  author_geo:35.032242,35.032242,
  geodist(author_geo,35,35):4.634071252404282},
{
  author_geo:35.04644,35.04644,
  geodist(author_geo,35,35):6.674485609316976},
{
  author_geo:35.060379,35.060379,
  geodist(author_geo,35,35):8.67754019129343},
{
  author_geo:34.924019,34.924019,
  geodist(author_geo,35,35):10.923479728441448},
{
  author_geo:34.89296,34.89296,
  geodist(author_geo,35,35):15.389876355902395},
{
  author_geo:34.89296,34.89296,
  geodist(author_geo,35,35):15.389876355902395},
{
  author_geo:35.109669,35.109669,
  geodist(author_geo,35,35):15.759483283896515}]
}}
 
  Can anyone tell me if anything is wrong here?  I am using Solr 4.4.
 
  Thanks,
  Ming-





 -
  Author:
 http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/spatial-search-geofilt-does-not-work-tp4085551p4085590.html
 Sent from the Solr - User mailing list archive at Nabble.com.



spatial search, geofilt does not work

2013-08-19 Thread Mingfeng Yang
My solr index has a field called author_geo which contains the author's
location, and when I am trying to get all docs whose author are within 10
km of 35.0,35.0 using the following query.

curl '
http://localhost/solr/select?q=*:*fq={!geofilt%20sfield=author_geo}pt=35.0,35.0d=10wt=jsonindent=truefl=author_geo
'

I got one match document which actually has no value of author_geo.

{
  responseHeader:{
status:0,
QTime:7,
params:{
  d:10,
  fl:author_geo,
  indent:true,
  q:*:*,
  pt:35.0,35.0,
  wt:json,
  fq:!geofilt sfield=author_geo}},
  response:{numFound:1,start:0,maxScore:1.0,docs:[
  {}]
  }}


But if I run the following query to do a sorting, it shows clearly that
there are at least 6 docs which are within 10km of 35.0,35.0.

curl '
http://localhost/solr/select?q=*:*sort=geodist(author_geo,35,35)+ascwt=jsonindent=truefl=author_geo,geodist(author_geo,35,35)fq=author_geo
:\[0,0%20TO%20360,360\]'

{
  responseHeader:{
status:0,
QTime:10,
params:{
  fl:author_geo,geodist(author_geo,35,35),
  sort:geodist(author_geo,35,35) asc,
  indent:true,
  q:*:*,
  wt:json,
  fq:author_geo:[0,0 TO 360,360]}},
  response:{numFound:78133,start:0,docs:[
  {
author_geo:34.991199,34.991199,
geodist(author_geo,35,35):1.2650756688780775},
  {
author_geo:34.991199,34.991199,
geodist(author_geo,35,35):1.2650756688780775},
  {
author_geo:34.991199,34.991199,
geodist(author_geo,35,35):1.2650756688780775},
  {
author_geo:35.032242,35.032242,
geodist(author_geo,35,35):4.634071252404282},
  {
author_geo:35.04644,35.04644,
geodist(author_geo,35,35):6.674485609316976},
  {
author_geo:35.060379,35.060379,
geodist(author_geo,35,35):8.67754019129343},
  {
author_geo:34.924019,34.924019,
geodist(author_geo,35,35):10.923479728441448},
  {
author_geo:34.89296,34.89296,
geodist(author_geo,35,35):15.389876355902395},
  {
author_geo:34.89296,34.89296,
geodist(author_geo,35,35):15.389876355902395},
  {
author_geo:35.109669,35.109669,
geodist(author_geo,35,35):15.759483283896515}]
  }}

Can anyone tell me if anything is wrong here?  I am using Solr 4.4.

Thanks,
Ming-


Re: spatial search, geofilt does not work

2013-08-19 Thread Mingfeng Yang
BTW: my schema.xml contains the following related lines.

fieldType name=location class=solr.LatLonType
subFieldSuffix=_coordinate/
field name=author_geo type=location indexed=true stored=true/
dynamicField name=*_coordinate  type=tdouble indexed=true
stored=false/


On Mon, Aug 19, 2013 at 2:02 PM, Mingfeng Yang mfy...@wisewindow.comwrote:

 My solr index has a field called author_geo which contains the author's
 location, and when I am trying to get all docs whose author are within 10
 km of 35.0,35.0 using the following query.

 curl '
 http://localhost/solr/select?q=*:*fq={!geofilt%20sfield=author_geo}pt=35.0,35.0d=10wt=jsonindent=truefl=author_geohttp://localhost/solr/select?q=*:*fq=%7B!geofilt%20sfield=author_geo%7Dpt=35.0,35.0d=10wt=jsonindent=truefl=author_geo
 '

 I got one match document which actually has no value of author_geo.

 {
   responseHeader:{
 status:0,
 QTime:7,
 params:{
   d:10,
   fl:author_geo,
   indent:true,
   q:*:*,
   pt:35.0,35.0,
   wt:json,
   fq:!geofilt sfield=author_geo}},
   response:{numFound:1,start:0,maxScore:1.0,docs:[
   {}]
   }}


 But if I run the following query to do a sorting, it shows clearly that
 there are at least 6 docs which are within 10km of 35.0,35.0.

 curl '
 http://localhost/solr/select?q=*:*sort=geodist(author_geo,35,35)+ascwt=jsonindent=truefl=author_geo,geodist(author_geo,35,35)fq=author_geo
 :\[0,0%20TO%20360,360\]'

 {
   responseHeader:{
 status:0,
 QTime:10,
 params:{
   fl:author_geo,geodist(author_geo,35,35),
   sort:geodist(author_geo,35,35) asc,
   indent:true,
   q:*:*,
   wt:json,
   fq:author_geo:[0,0 TO 360,360]}},
   response:{numFound:78133,start:0,docs:[
   {
 author_geo:34.991199,34.991199,
 geodist(author_geo,35,35):1.2650756688780775},
   {
 author_geo:34.991199,34.991199,
 geodist(author_geo,35,35):1.2650756688780775},
   {
 author_geo:34.991199,34.991199,
 geodist(author_geo,35,35):1.2650756688780775},
   {
 author_geo:35.032242,35.032242,
 geodist(author_geo,35,35):4.634071252404282},
   {
 author_geo:35.04644,35.04644,
 geodist(author_geo,35,35):6.674485609316976},
   {
 author_geo:35.060379,35.060379,
 geodist(author_geo,35,35):8.67754019129343},
   {
 author_geo:34.924019,34.924019,
 geodist(author_geo,35,35):10.923479728441448},
   {
 author_geo:34.89296,34.89296,
 geodist(author_geo,35,35):15.389876355902395},
   {
 author_geo:34.89296,34.89296,
 geodist(author_geo,35,35):15.389876355902395},
   {
 author_geo:35.109669,35.109669,
 geodist(author_geo,35,35):15.759483283896515}]
   }}

 Can anyone tell me if anything is wrong here?  I am using Solr 4.4.

 Thanks,
 Ming-




list docs with geo location info

2013-08-15 Thread Mingfeng Yang
I have a schema with a geolocation field named author_geo defined as

 field name=author_geo  type=location indexed=true
stored=true /

How can I list docs whose author_geo fields are not empty?

Seems filter query fq=author_geo:* does not work like other fields which
are string or text or float type.

curl
'localhost/solr/select?q=*:*rows=10wt=jsonindent=truefq=author_geo:*fl=author_geo'

What's the right way of doing it?

Thanks,
Mingfeng


Re: list docs with geo location info

2013-08-15 Thread Mingfeng Yang
Figured out.  use  author_geo:[* TO *] will do the trick.




On Thu, Aug 15, 2013 at 1:26 PM, Mingfeng Yang mfy...@wisewindow.comwrote:

 I have a schema with a geolocation field named author_geo defined as

  field name=author_geo  type=location indexed=true
 stored=true /

 How can I list docs whose author_geo fields are not empty?

 Seems filter query fq=author_geo:* does not work like other fields which
 are string or text or float type.

 curl
 'localhost/solr/select?q=*:*rows=10wt=jsonindent=truefq=author_geo:*fl=author_geo'

 What's the right way of doing it?

 Thanks,
 Mingfeng




plugin init failure for ShingleFilterFactory

2013-07-26 Thread Mingfeng Yang
I am trying to upgrade solr to 4.4 version, and looks like solr cann't load
the ShingleFilterFactory class.

417 [coreLoadExecutor-4-thread-1] ERROR org.apache.solr.core.CoreContainer
 – Unable to create core: collection1
org.apache.solr.common.SolrException: Plugin init failure for [schema.xml]
fieldType textshingle: Plugin init failure for [schema.xml]
analyzer/filter: Error instantiating class:
'org.apache.lucene.analysis.shingle.ShingleFilterFactory'
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)
at
org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:467)
at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:164)
at
org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)
at
org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)
at
org.apache.solr.core.ZkContainer.createFromZk(ZkContainer.java:268)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:655)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:364)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:356)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)

the field definition in the schema.xml is

fieldType name=textshingle class=solr.TextField
positionIncrementGap=100 stored=false
analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StandardFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.ShingleFilterFactory maxShingleSize=3
outputUnigrams=true/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=false
/
filter class=solr.SnowballPorterFilterFactory
language=English protected=protwords.txt/
/analyzer
analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StandardFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.ShingleFilterFactory maxShingleSize=3
outputUnigrams=true outputUnigramIfNoNgram=true/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=false
/
filter class=solr.SnowballPorterFilterFactory
language=English protected=protwords.txt/
/analyzer
/fieldType


preserve special characters

2013-06-18 Thread Mingfeng Yang
We need to index and search lots of tweets which can like @solr:  solr is
great. or @solr_lucene, good combination.

And we want to search with @solr or @solr_lucene.  How can we preserve
@ and _ in the index?

If using whitespacetokennizer followed by worddelimiterfilter, @solr_lucene
will be broken down into solr and lucene, which make the search results
contain lots of non-relevant docs.

If using standardtokenizer, the @ symbol is stripped.

Thanks,
Ming-


Re: preserve special characters

2013-06-18 Thread Mingfeng Yang
Hi Jack,

That seems like the solution I am looking for. Thanks so much!

//Can't find this types for WDF anywhere.

Ming-


On Tue, Jun 18, 2013 at 4:52 PM, Jack Krupansky j...@basetechnology.comwrote:

 The WDF has a types attribute which can specify one or more character
 type mapping files. You could create a file like:

 @ = ALPHA
 _ = ALPHA

 For example (from the book!):

 Example - Treat at-sign and underscores as text

  fieldType name=text_at_under class=solr.TextField
 positionIncrementGap=100 autoGeneratePhraseQueries=**true
analyzer
  tokenizer class=solr.**WhitespaceTokenizerFactory/
  filter class=solr.**WordDelimiterFilterFactory
  types=at-under-alpha.txt/
/analyzer
  /fieldType

 The file +at-under-alpha.txt+ would contain:

  @ = ALPHA
  _ = ALPHA

 The analysis results:

Source: Hello @World_bar, r@end.
Tokens: 1: Hello 2: @World_bar 3: r@end


 -- Jack Krupansky

 -Original Message- From: Mingfeng Yang
 Sent: Tuesday, June 18, 2013 6:58 PM
 To: solr-user@lucene.apache.org
 Subject: preserve special characters


 We need to index and search lots of tweets which can like @solr:  solr is
 great. or @solr_lucene, good combination.

 And we want to search with @solr or @solr_lucene.  How can we preserve
 @ and _ in the index?

 If using whitespacetokennizer followed by worddelimiterfilter, @solr_lucene
 will be broken down into solr and lucene, which make the search results
 contain lots of non-relevant docs.

 If using standardtokenizer, the @ symbol is stripped.

 Thanks,
 Ming-



dynamic field

2013-06-17 Thread Mingfeng Yang
How is daynamic field in solr implemented?  Does it get saved into the same
Document as other regular fields in lucene index?

Ming-


retrieve datefield value from document

2013-06-14 Thread Mingfeng Yang
I have an index first built with solr1.4 and later upgraded to solr3.6,
which has 150million documents, and all docs have a datefield which are not
blank. (verified by solr query).

I am using the following code snippet to retrieve

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.store.*;
import org.apache.lucene.document.*;

IndexReader input = IndexReader.open(indexDir);
Document d = input.document(i);
int maxDoc = input.maxDoc();
for (int i = 0; i  maxDoc; i++) {
System.out.println(d.get('date');
}

However, about 100 million docs give null for d.get('date') and about other
50 million docs give the right values.

What could be wrong?

Ming-


Re: retrieve datefield value from document

2013-06-14 Thread Mingfeng Yang
Michael,

That's what I thought as well.  I would assume an optimization of the index
would rewrite all documents in the newer format then?

Ming-



On Fri, Jun 14, 2013 at 1:25 PM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:

 Shot in the dark:

 You're using Lucene to read the index. That's sort of circumventing all the
 typing stuff that Solr does. Solr can deal with an index where some of the
 segments are in one format (say 1.4) and others are in another (3.6). Maybe
 they're being stored in a format in the newer (or older) segments that
 doesn't work with raw retrieval of the values through Lucene in the same
 way.

 Maybe it's able to retrieve the stored value from the indexed
 representation in one case rather than needing to store it.

 I'd query your index using EmbeddedSolrServer instead and see if that
 changes what you see.


 Michael Della Bitta

 Applications Developer

 o: +1 646 532 3062  | c: +1 917 477 7906

 appinions inc.

 “The Science of Influence Marketing”

 18 East 41st Street

 New York, NY 10017

 t: @appinions https://twitter.com/Appinions | g+:
 plus.google.com/appinions
 w: appinions.com http://www.appinions.com/


 On Fri, Jun 14, 2013 at 4:15 PM, Mingfeng Yang mfy...@wisewindow.com
 wrote:

  I have an index first built with solr1.4 and later upgraded to solr3.6,
  which has 150million documents, and all docs have a datefield which are
 not
  blank. (verified by solr query).
 
  I am using the following code snippet to retrieve
 
  import org.apache.lucene.index.IndexReader;
  import org.apache.lucene.store.*;
  import org.apache.lucene.document.*;
 
  IndexReader input = IndexReader.open(indexDir);
  Document d = input.document(i);
  int maxDoc = input.maxDoc();
  for (int i = 0; i  maxDoc; i++) {
  System.out.println(d.get('date');
  }
 
  However, about 100 million docs give null for d.get('date') and about
 other
  50 million docs give the right values.
 
  What could be wrong?
 
  Ming-
 



Re: retrieve datefield value from document

2013-06-14 Thread Mingfeng Yang
HI Dmitry,

No, the docs are not deleted.

Ming-


On Fri, Jun 14, 2013 at 1:31 PM, Dmitry Kan solrexp...@gmail.com wrote:

 Maybe a document was marked as deleted?

 *isDeleted
 http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/IndexReader.html#isDeleted(int)
 
 *


 On Fri, Jun 14, 2013 at 11:25 PM, Michael Della Bitta 
 michael.della.bi...@appinions.com wrote:

  Shot in the dark:
 
  You're using Lucene to read the index. That's sort of circumventing all
 the
  typing stuff that Solr does. Solr can deal with an index where some of
 the
  segments are in one format (say 1.4) and others are in another (3.6).
 Maybe
  they're being stored in a format in the newer (or older) segments that
  doesn't work with raw retrieval of the values through Lucene in the same
  way.
 
  Maybe it's able to retrieve the stored value from the indexed
  representation in one case rather than needing to store it.
 
  I'd query your index using EmbeddedSolrServer instead and see if that
  changes what you see.
 
 
  Michael Della Bitta
 
  Applications Developer
 
  o: +1 646 532 3062  | c: +1 917 477 7906
 
  appinions inc.
 
  “The Science of Influence Marketing”
 
  18 East 41st Street
 
  New York, NY 10017
 
  t: @appinions https://twitter.com/Appinions | g+:
  plus.google.com/appinions
  w: appinions.com http://www.appinions.com/
 
 
  On Fri, Jun 14, 2013 at 4:15 PM, Mingfeng Yang mfy...@wisewindow.com
  wrote:
 
   I have an index first built with solr1.4 and later upgraded to solr3.6,
   which has 150million documents, and all docs have a datefield which are
  not
   blank. (verified by solr query).
  
   I am using the following code snippet to retrieve
  
   import org.apache.lucene.index.IndexReader;
   import org.apache.lucene.store.*;
   import org.apache.lucene.document.*;
  
   IndexReader input = IndexReader.open(indexDir);
   Document d = input.document(i);
   int maxDoc = input.maxDoc();
   for (int i = 0; i  maxDoc; i++) {
   System.out.println(d.get('date');
   }
  
   However, about 100 million docs give null for d.get('date') and about
  other
   50 million docs give the right values.
  
   What could be wrong?
  
   Ming-
  
 



Re: retrieve datefield value from document

2013-06-14 Thread Mingfeng Yang
How did you solve the problem then?

MIng


On Fri, Jun 14, 2013 at 3:24 PM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:

 Yes, that should be what happens. But then I'd guess you'd be able to
 retrieve no dates. I've encountered this myself.
 On Jun 14, 2013 6:05 PM, Mingfeng Yang mfy...@wisewindow.com wrote:

  Michael,
 
  That's what I thought as well.  I would assume an optimization of the
 index
  would rewrite all documents in the newer format then?
 
  Ming-
 
 
 
  On Fri, Jun 14, 2013 at 1:25 PM, Michael Della Bitta 
  michael.della.bi...@appinions.com wrote:
 
   Shot in the dark:
  
   You're using Lucene to read the index. That's sort of circumventing all
  the
   typing stuff that Solr does. Solr can deal with an index where some of
  the
   segments are in one format (say 1.4) and others are in another (3.6).
  Maybe
   they're being stored in a format in the newer (or older) segments that
   doesn't work with raw retrieval of the values through Lucene in the
 same
   way.
  
   Maybe it's able to retrieve the stored value from the indexed
   representation in one case rather than needing to store it.
  
   I'd query your index using EmbeddedSolrServer instead and see if that
   changes what you see.
  
  
   Michael Della Bitta
  
   Applications Developer
  
   o: +1 646 532 3062  | c: +1 917 477 7906
  
   appinions inc.
  
   “The Science of Influence Marketing”
  
   18 East 41st Street
  
   New York, NY 10017
  
   t: @appinions https://twitter.com/Appinions | g+:
   plus.google.com/appinions
   w: appinions.com http://www.appinions.com/
  
  
   On Fri, Jun 14, 2013 at 4:15 PM, Mingfeng Yang mfy...@wisewindow.com
   wrote:
  
I have an index first built with solr1.4 and later upgraded to
 solr3.6,
which has 150million documents, and all docs have a datefield which
 are
   not
blank. (verified by solr query).
   
I am using the following code snippet to retrieve
   
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.store.*;
import org.apache.lucene.document.*;
   
IndexReader input = IndexReader.open(indexDir);
Document d = input.document(i);
int maxDoc = input.maxDoc();
for (int i = 0; i  maxDoc; i++) {
System.out.println(d.get('date');
}
   
However, about 100 million docs give null for d.get('date') and about
   other
50 million docs give the right values.
   
What could be wrong?
   
Ming-
   
  
 



Re: retrieve datefield value from document

2013-06-14 Thread Mingfeng Yang
Figured out the solution.

The datefield in those documents were stored as binary, so what I should do
is

Fieldable df = doc.getFieldable(fname);
byte[] ary = df.getBinaryValue();
ByteBuffer bb = ByteBuffer.wrap(ary);
long num = bb.getLong();
ate dt = DateTools.stringToDate(DateTools.timeToString(num,
DateTools.Resolution.SECOND));

Then you get dt as a string in the right format.

Ming-


On Fri, Jun 14, 2013 at 4:20 PM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:

 Use EmbeddedSolrServer rather than Lucene directly.
 On Jun 14, 2013 6:47 PM, Mingfeng Yang mfy...@wisewindow.com wrote:

  How did you solve the problem then?
 
  MIng
 
 
  On Fri, Jun 14, 2013 at 3:24 PM, Michael Della Bitta 
  michael.della.bi...@appinions.com wrote:
 
   Yes, that should be what happens. But then I'd guess you'd be able to
   retrieve no dates. I've encountered this myself.
   On Jun 14, 2013 6:05 PM, Mingfeng Yang mfy...@wisewindow.com
 wrote:
  
Michael,
   
That's what I thought as well.  I would assume an optimization of the
   index
would rewrite all documents in the newer format then?
   
Ming-
   
   
   
On Fri, Jun 14, 2013 at 1:25 PM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:
   
 Shot in the dark:

 You're using Lucene to read the index. That's sort of circumventing
  all
the
 typing stuff that Solr does. Solr can deal with an index where some
  of
the
 segments are in one format (say 1.4) and others are in another
 (3.6).
Maybe
 they're being stored in a format in the newer (or older) segments
  that
 doesn't work with raw retrieval of the values through Lucene in the
   same
 way.

 Maybe it's able to retrieve the stored value from the indexed
 representation in one case rather than needing to store it.

 I'd query your index using EmbeddedSolrServer instead and see if
 that
 changes what you see.


 Michael Della Bitta

 Applications Developer

 o: +1 646 532 3062  | c: +1 917 477 7906

 appinions inc.

 “The Science of Influence Marketing”

 18 East 41st Street

 New York, NY 10017

 t: @appinions https://twitter.com/Appinions | g+:
 plus.google.com/appinions
 w: appinions.com http://www.appinions.com/


 On Fri, Jun 14, 2013 at 4:15 PM, Mingfeng Yang 
  mfy...@wisewindow.com
 wrote:

  I have an index first built with solr1.4 and later upgraded to
   solr3.6,
  which has 150million documents, and all docs have a datefield
 which
   are
 not
  blank. (verified by solr query).
 
  I am using the following code snippet to retrieve
 
  import org.apache.lucene.index.IndexReader;
  import org.apache.lucene.store.*;
  import org.apache.lucene.document.*;
 
  IndexReader input = IndexReader.open(indexDir);
  Document d = input.document(i);
  int maxDoc = input.maxDoc();
  for (int i = 0; i  maxDoc; i++) {
  System.out.println(d.get('date');
  }
 
  However, about 100 million docs give null for d.get('date') and
  about
 other
  50 million docs give the right values.
 
  What could be wrong?
 
  Ming-
 

   
  
 



SolrEntityProcessor gets slower and slower

2013-06-10 Thread Mingfeng Yang
I trying to migrate 100M documents from a solr index (v3.6) to a solrcloud
index (v4.1, 4 shards) by using SolrEntityProcessor.  My data-config.xml is
like

dataConfig document entity name=sep processor=SolrEntityProcessor
url=http://10.64.35.117:8995/solr/; query=*:* rows=2000 fl=
author_class,authorlink,author_location_text,author_text,author,category,date,dimension,entity,id,language,md5_text,op_dimension,opinion_text,query_id,search_source,sentiment,source_domain_text,source_domain,text,textshingle,title,topic,topic_text,url
/ /document /dataConfig

Initially, the data import rate is about 1K docs/second, but it eventually
decrease to 20docs/second after running for tens of hours.

Last time I tried data import with solorentityprocessor, the transfer rate
can be as high as 3K docs/seconds.

Anyone has any clues what can cause the slowdown?

Thanks,
Ming-


shard splitting

2013-06-10 Thread Mingfeng Yang
From the solr wiki, I saw this command (
http://localhost:8983/solr/admin/collections?action=SPLITSHARDcollection=collection_nameshard=shardId)
which split one index into 2 shards.  However, is there someway to split
into more shards?

Thanks,
Ming-


Re: shard splitting

2013-06-10 Thread Mingfeng Yang
Hi Shalin,

Do you mean that we can do 1-2, 2-4, 4-8 to get 8 shards eventually?

After splitting, if we want to set up a solrcloud with all 8 shards, how
shall we allocate the shards then?

Thanks,
Ming-


On Mon, Jun 10, 2013 at 9:55 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 No, it is hard coded to split into two shards only. You can call it
 recursively on a sub shard to split into more pieces. Please note that some
 serious bugs were found in that command which will be fixed in the next
 (4.3.1) release of Solr.


 On Tue, Jun 11, 2013 at 9:43 AM, Mingfeng Yang mfy...@wisewindow.com
 wrote:

  From the solr wiki, I saw this command (
 
 http://localhost:8983/solr/admin/collections?action=SPLITSHARDcollection=
  collection_nameshard=shardId)
  which split one index into 2 shards.  However, is there someway to split
  into more shards?
 
  Thanks,
  Ming-
 



 --
 Regards,
 Shalin Shekhar Mangar.



solr 3.6 use only one CPU

2013-05-30 Thread Mingfeng Yang
We have a solr instance running on a 4 CPU box.

Sometimes, we send a query to our solr server and it take up 100% of one
CPU and  60% of memory.   I assume that if we send another query request,
solr should be able to use another idling CPU.  However, it is not the
case.  Using top, I only see one cpu is busy, and the client side just gets
stucked.

Is solr 3.6 able to do multithreading to process requests?

Ming-


Re: iterate through each document in Solr

2013-05-06 Thread Mingfeng Yang
Hi Dmitry,

My index is not sharded, and since its size is so big, sharding won't help
much on the paging issue.  Do you know any API which can help read from
lucene binary index directly? I will be nice if we can just scan
through the docs directly.

Thanks!
Ming-


On Mon, May 6, 2013 at 3:33 AM, Dmitry Kan solrexp...@gmail.com wrote:

 Are you doing it once? Is your index sharded? If so, can you ask each shard
 individually?
 Another way would be to do it on Lucene level, i.e. read from the binary
 indices (API exists).

 Dmitry


 On Mon, May 6, 2013 at 5:48 AM, Mingfeng Yang mfy...@wisewindow.com
 wrote:

  Dear Solr Users,
 
  Does anyone know what is the best way to iterate through each document
 in a
  Solr index with billion entries?
 
  I tried to use  select?q=*:*start=xxrows=500  to get 500 docs each time
  and then change start value, but it got very slow after getting through
  about 10 million docs.
 
  Thanks,
  Ming-
 



Re: iterate through each document in Solr

2013-05-06 Thread Mingfeng Yang
Andre,

Thanks for the info!  Unfortunately, my solr is on 3.6 version, and looks
like those options are not available. :(

Ming-


On Mon, May 6, 2013 at 5:32 AM, Andre Bois-Crettez andre.b...@kelkoo.comwrote:

 On 05/06/2013 06:03 AM, Michael Sokolov wrote:

 On 5/5/13 7:48 PM, Mingfeng Yang wrote:

 Dear Solr Users,

 Does anyone know what is the best way to iterate through each document
 in a
 Solr index with billion entries?

 I tried to use  select?q=*:*start=xxrows=500  to get 500 docs each time
 and then change start value, but it got very slow after getting through
 about 10 million docs.

 Thanks,
 Ming-

  You need to use a unique and stable sort key and get documents
 sortkey.  For example, if you have a unique key, retrieve documents
 ordered by the unique key, and for each batch get documents  max (key)
 from the previous batch

 -Mike

  There is more details on the wiki :
 http://wiki.apache.org/solr/**CommonQueryParameters#pageDoc_**
 and_pageScorehttp://wiki.apache.org/solr/CommonQueryParameters#pageDoc_and_pageScore


 --
 André Bois-Crettez

 Search technology, Kelkoo
 http://www.kelkoo.com/


 Kelkoo SAS
 Société par Actions Simplifiée
 Au capital de € 4.168.964,30
 Siège social : 8, rue du Sentier 75002 Paris
 425 093 069 RCS Paris

 Ce message et les pièces jointes sont confidentiels et établis à
 l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
 destinataire de ce message, merci de le détruire et d'en avertir
 l'expéditeur.



iterate through each document in Solr

2013-05-05 Thread Mingfeng Yang
Dear Solr Users,

Does anyone know what is the best way to iterate through each document in a
Solr index with billion entries?

I tried to use  select?q=*:*start=xxrows=500  to get 500 docs each time
and then change start value, but it got very slow after getting through
about 10 million docs.

Thanks,
Ming-


Re: facet.method enum vs fc

2013-04-19 Thread Mingfeng Yang
Joel,

Thanks for your kind reply.   The problem is solved with sharding and using
facet.method=enum.  I am curious about  what's the different between enum
and fc, so that enum works but fc does not.   Do you know something about
this?

Thank you!

Regards,
Ming


On Fri, Apr 19, 2013 at 6:18 AM, Joel Bernstein joels...@gmail.com wrote:

 Faceting on a high cardinality string field, like url, on a 120 million
 record index is going to be very memory intensive.

 You will very likely need to shard the index to get the performance that
 you need.

 In Solr 4.2, you can make the url field a Disk based DocValue and shift the
 memory from Solr to the file system cache. But to run efficiently this is
 still going to take a lot of memory in the OS file cache.




 On Thu, Apr 18, 2013 at 12:00 PM, Mingfeng Yang mfy...@wisewindow.com
 wrote:

  20G is allocated to Solr already.
 
  Ming
 
 
  On Wed, Apr 17, 2013 at 11:56 PM, Toke Eskildsen t...@statsbiblioteket.dk
  wrote:
 
   On Wed, 2013-04-17 at 20:06 +0200, Mingfeng Yang wrote:
I am doing faceting on an index of 120M documents,
on the field of url[...]
  
   I would guess that you would need 3-4GB for that.
   How much memory do you allocate to Solr?
  
   - Toke Eskildsen
  
  
 



 --
 Joel Bernstein
 Professional Services LucidWorks



Re: Updating clusterstate from the zookeeper

2013-04-19 Thread Mingfeng Yang
Right. I am wondering if/how we can download a specific file from the
zookeeper, modify it and then upload to rewrite it.  Anyone ?

Thanks,
Ming


On Fri, Apr 19, 2013 at 10:53 AM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:

 I would like to know the answer to this as well.

 Michael Della Bitta

 
 Appinions
 18 East 41st Street, 2nd Floor
 New York, NY 10017-6271

 www.appinions.com

 Where Influence Isn’t a Game


 On Thu, Apr 18, 2013 at 8:15 PM, Manuel Le Normand
 manuel.lenorm...@gmail.com wrote:
  Hello,
  After creating a distributed collection on several different servers I
  sometimes get to deal with failing servers (cores appear not available
 =
  grey) or failing cores (Down / unable to recover = brown / red).
  In case i wish to delete this errorneous collection (through collection
  API) only the green nodes get erased, leaving a meaningless unavailable
  collection in the clusterstate.json.
 
  Is there any way to edit explicitly the clusterstate.json? If not, how
 do i
  update it so the collection as above gets deleted?
 
  Cheers,
  Manu



Re: facet.method enum vs fc

2013-04-18 Thread Mingfeng Yang
20G is allocated to Solr already.

Ming


On Wed, Apr 17, 2013 at 11:56 PM, Toke Eskildsen 
t...@statsbiblioteket.dkwrote:

 On Wed, 2013-04-17 at 20:06 +0200, Mingfeng Yang wrote:
  I am doing faceting on an index of 120M documents,
  on the field of url[...]

 I would guess that you would need 3-4GB for that.
 How much memory do you allocate to Solr?

 - Toke Eskildsen




facet.method enum vs fc

2013-04-17 Thread Mingfeng Yang
I am doing faceting on an index of 120M documents, on the field of url,
using the following two queries.  Note that the only difference of the two
queries is that first one uses default facet.method, and the second one
uses face.method=enum.   ( each document in the index contains a review we
extracted from internet with multiple fields, and url field stands for the
link to the original web pages.  The matching document size is like 5.3
million. )

http://autos-solr-api.wisewindow.com:8995/solr/select?q=*:*indent=onversion=2.2fq=language:englishstart=0rows=1facet.mincount=1facet=truewt=jsonfq=search_source:%22Video%22sort=date%20descfl=topicfacet.limit=25facet.field=urlfacet.offset=0

http://autos-solr-api.wisewindow.com:8995/solr/select?q=*:*indent=onversion=2.2fq=language:englishstart=0rows=1facet.mincount=1facet=truewt=jsonfq=search_source:%22Video%22sort=date%20descfl=topicfacet.limit=25facet.field=urlfacet.offset=0facet.method=enum

The first method gives me outofmemory error( ERROR 500: Java heap space
 java.lang.OutOfMemoryError: Java heap space), but the second one runs fine
though very slow (163 seconds)

According to the wiki and solr documentation, the default facet.method=fc
uses less memory than facet.method=enum, isn't it?

Thanks,
Ming


Re: facet.method enum vs fc

2013-04-17 Thread Mingfeng Yang
Does Solr 3.6 has facet.method=fcs?   I tried anyway, and got

ERROR 500: GC overhead limit exceeded  java.lang.OutOfMemoryError: GC
overhead limit exceeded.


On Wed, Apr 17, 2013 at 12:38 PM, Timothy Potter thelabd...@gmail.comwrote:

 What are your results when using facet.method=fcs?


 On Wed, Apr 17, 2013 at 12:06 PM, Mingfeng Yang mfy...@wisewindow.com
 wrote:

  I am doing faceting on an index of 120M documents, on the field of url,
  using the following two queries.  Note that the only difference of the
 two
  queries is that first one uses default facet.method, and the second one
  uses face.method=enum.   ( each document in the index contains a review
 we
  extracted from internet with multiple fields, and url field stands for
 the
  link to the original web pages.  The matching document size is like 5.3
  million. )
 
 
 
 http://autos-solr-api.wisewindow.com:8995/solr/select?q=*:*indent=onversion=2.2fq=language:englishstart=0rows=1facet.mincount=1facet=truewt=jsonfq=search_source:%22Video%22sort=date%20descfl=topicfacet.limit=25facet.field=urlfacet.offset=0
 
 
 
 http://autos-solr-api.wisewindow.com:8995/solr/select?q=*:*indent=onversion=2.2fq=language:englishstart=0rows=1facet.mincount=1facet=truewt=jsonfq=search_source:%22Video%22sort=date%20descfl=topicfacet.limit=25facet.field=urlfacet.offset=0facet.method=enum
 
  The first method gives me outofmemory error( ERROR 500: Java heap space
   java.lang.OutOfMemoryError: Java heap space), but the second one runs
 fine
  though very slow (163 seconds)
 
  According to the wiki and solr documentation, the default facet.method=fc
  uses less memory than facet.method=enum, isn't it?
 
  Thanks,
  Ming
 



Re: tokenizer of solr

2013-04-12 Thread Mingfeng Yang
Jack,

Thanks so much for this info.  It's awesome.

Ming


On Thu, Apr 11, 2013 at 7:32 PM, Jack Krupansky j...@basetechnology.comwrote:

 In that case, use the types=wdfftypes.txt attribute of WDF and map @
 and _ to ALPHA as shown in:
 http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s#solr.**
 WordDelimiterFilterFactoryhttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
 .


 -- Jack Krupansky

 -Original Message- From: Mingfeng Yang
 Sent: Thursday, April 11, 2013 8:50 PM
 To: solr-user@lucene.apache.org
 Subject: Re: tokenizer of solr


 looks like it's due to the word delimiter filter.  Anyone know if the
 protected file support regular expression or not?

 Ming


 On Thu, Apr 11, 2013 at 4:58 PM, Jack Krupansky j...@basetechnology.com*
 *wrote:

  Try the whitespace tokenizer.

 -- Jack Krupansky

 -Original Message- From: Mingfeng Yang Sent: Thursday, April 11,
 2013 7:48 PM To: solr-user@lucene.apache.org Subject: tokenizer of solr
 Dear Solr users and developers,

 I am trying to index some documents some of which are twitter messages,
 and
 we have a problem when indexing retweet.

 Say a twitter user named jpc_108 post a tweet, and then someone retweet
 his msg, and now @jpc_108 become part of the tweet text body.

 Seems like before indexing, the tokenizer factory of solr turns @jpc_108
 into jpc and 108, and when we search for jpc_108, it's not there
 anymore.


 Is there anyway we can keep jcp_108 when it appears as @jpc_108?

 Thanks,
 Ming-





tokenizer of solr

2013-04-11 Thread Mingfeng Yang
Dear Solr users and developers,

I am trying to index some documents some of which are twitter messages, and
we have a problem when indexing retweet.

Say a twitter user named jpc_108 post a tweet, and then someone retweet
his msg, and now @jpc_108 become part of the tweet text body.

Seems like before indexing, the tokenizer factory of solr turns @jpc_108
into jpc and 108, and when we search for jpc_108, it's not there anymore.


Is there anyway we can keep jcp_108 when it appears as @jpc_108?

Thanks,
Ming-


Re: tokenizer of solr

2013-04-11 Thread Mingfeng Yang
looks like it's due to the word delimiter filter.  Anyone know if the
protected file support regular expression or not?

Ming


On Thu, Apr 11, 2013 at 4:58 PM, Jack Krupansky j...@basetechnology.comwrote:

 Try the whitespace tokenizer.

 -- Jack Krupansky

 -Original Message- From: Mingfeng Yang Sent: Thursday, April 11,
 2013 7:48 PM To: solr-user@lucene.apache.org Subject: tokenizer of solr
 Dear Solr users and developers,

 I am trying to index some documents some of which are twitter messages, and
 we have a problem when indexing retweet.

 Say a twitter user named jpc_108 post a tweet, and then someone retweet
 his msg, and now @jpc_108 become part of the tweet text body.

 Seems like before indexing, the tokenizer factory of solr turns @jpc_108
 into jpc and 108, and when we search for jpc_108, it's not there anymore.


 Is there anyway we can keep jcp_108 when it appears as @jpc_108?

 Thanks,
 Ming-



update some fields vs replace the whole document

2013-03-08 Thread Mingfeng Yang
Generally speaking, which has better performance for Solr?
1. updating some fields or adding new fields into a document.
or
2. replacing the whole document.

As I understand,  update fields need to search for the corresponding doc
first, and then replace field values.  While replacing the whole document
is just like adding new document.  Is it right?


Re: update some fields vs replace the whole document

2013-03-08 Thread Mingfeng Yang
Then what's the difference between adding a new document vs.
replacing/overwriting a document?

Ming-


On Fri, Mar 8, 2013 at 2:07 PM, Upayavira u...@odoko.co.uk wrote:

 With an atomic update, you need to retrieve the stored fields in order
 to build up the full document to insert back.

 In either case, you'll have to locate the previous version and mark it
 deleted before you can insert the new version.

 I bet that the amount of time spent retrieving stored fields is matched
 by the time saved by not having to transmit those fields over the wire,
 although I'd be very curious to see someone actually test that.

 Upayavira

 On Fri, Mar 8, 2013, at 09:51 PM, Mingfeng Yang wrote:
  Generally speaking, which has better performance for Solr?
  1. updating some fields or adding new fields into a document.
  or
  2. replacing the whole document.
 
  As I understand,  update fields need to search for the corresponding doc
  first, and then replace field values.  While replacing the whole document
  is just like adding new document.  Is it right?



pivot facet with solrcloud (solr 4.1)

2013-03-04 Thread Mingfeng Yang
Looks like pivot facet with solrcloud does not work (I am using Solr 4.1).

The query below return no pivot search result unless I added
shards=shard1.

http://localhost:8995/solr/collection1/select?q=*%3A*facet=truefacet.mincount=1facet.pivot=source_domain,authorrows=1wt=jsonfacet.limit=5


When this JIRA (https://issues.apache.org/jira/browse/SOLR-2894) will be
implemented?

Thanks,
Ming-


solrcloud data directory structure

2013-02-22 Thread Mingfeng Yang
I see the items under my solorcloud data directory of replica node as

drwxr-xr-x 2 solr solr42 Feb 22 18:19 index
drwxr-xr-x 2 solr solr 12288 Feb 23 01:00 index.20130222181947835
-rw-r--r-- 1 solr solr78 Feb 22 18:25 index.properties
-rw-r--r-- 1 solr solr   209 Feb 22 18:25 replication.properties
drwxr-xr-x 2 solr solr99 Feb 23 01:00 tlog

The index.timestamp directory is always there.

But in old solr master replication setup,  the index.timestamp directory
becomes index after replication is done.

What's the reason?  Is it because in solrcloud, the replica node is
always replicating?

Thanks,
Ming


Re: How to change the index dir in Solr 4.1

2013-02-21 Thread Mingfeng Yang
How about passing -Dsolr.data.dir=/ur/data/dir  in the command line to java
when you start Solr service.


On Thu, Feb 21, 2013 at 9:05 AM, chamara chama...@gmail.com wrote:

 Yes that is what i am doing now? I taught this solution is not elegant for
 a
 deployment? Is there any other way to do this from the SolrConfig.xml?



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-change-the-index-dir-in-Solr-4-1-tp4041891p4041950.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: can i install new SOLR 4.1 as slaver(3.3 Master)

2013-02-21 Thread Mingfeng Yang
I cannot give an affirmative answer.  But I am thinking that it would have
potential problem, as the index format in 3.3 and 4.1 are slightly
different.

Why don't you upgrade to 4.1?  The only thing you need to do is
1. install solr 4.1
2.1 copy all related config files from 3.3
2.2 back up the index data folder
3. shutdown solr 3.3
4 start solr 4.1 with solr.data.dir pointing to the old dir




On Thu, Feb 21, 2013 at 10:54 AM, michaelweica m...@hipdigital.com wrote:

 Hi ,

 our SOLR master version is 3.3,  can i install new box SOLR 4.1 as slaver,
 and replication from master data.

 thanks



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/can-i-install-new-SOLR-4-1-as-slaver-3-3-Master-tp4041976.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: RequestHandler init failure

2013-02-20 Thread Mingfeng Yang
Chris,

My config file did include the section of loading related plugin.

Ming


On Tue, Feb 19, 2013 at 10:42 AM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : Found it by myself.  It's here
 :
 http://mirrors.ibiblio.org/maven2/org/apache/solr/solr-dataimporthandler/4.1.0/
 :
 : Download and move the jar file to solr-webapp/webapp/WEB-INF/lib
 directory,
 : and the errors are all gone.

 you don't need to move/copy/add any jars into hte solr webapp (where they
 will be blown away if/when you upgrade the webapp)

 All you need to do is load the jar as a plugin...

 https://wiki.apache.org/solr/SolrPlugins#How_to_Load_Plugins
 https://wiki.apache.org/solr/SolrConfigXml#lib


 -Hoss



RequestHandler init failure

2013-02-18 Thread Mingfeng Yang
When trying to use SolrEntityProcessor to do data import from another solr
index (solor 4.1)

I added  the following in solrconfig.xml

 requestHandler name=/data
class=org.apache.solr.handler.dataimport.DataImportHandler
   lst name=defaults
   str name=configdata-config.xml/str
   /lst
   /requestHandler

and create new file data-config.xml with
dataConfig
  document
entity name=sep processor=SolrEntityProcessor
url=http://wolf:1Xnbdoq@myserver:8995/solr/; query=*:*
fl=id,md5_text,title,text/
  /document
/dataConfig


I got the following errors:

org.apache.solr.common.SolrException: RequestHandler init failure
at org.apache.solr.core.SolrCore.init(SolrCore.java:794)
at org.apache.solr.core.SolrCore.init(SolrCore.java:607)
at
org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:949)
at
org.apache.solr.core.CoreContainer.create(CoreContainer.java:1031)
at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629)
at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:624)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: org.apache.solr.common.SolrException: RequestHandler init failure
at
org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:168)
at org.apache.solr.core.SolrCore.init(SolrCore.java:731)
... 13 more
Caused by: org.apache.solr.common.SolrException: Error loading class
'org.apache.solr.handler.dataimport.DataImportHandler'
at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:438)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:507)
at
org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:581)
at
org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:154)
... 14 more
Caused by: java.lang.ClassNotFoundException:
org.apache.solr.handler.dataimport.DataImportHandler
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:627)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:422)
... 17 more
Feb 18, 2013 7:24:43 PM org.apache.solr.common.SolrException log
SEVERE: null:org.apache.solr.common.SolrException: Unable to create core:
collection1
at
org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:1654)
at
org.apache.solr.core.CoreContainer.create(CoreContainer.java:1039)
at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629)

I assume that it's because jar file related to dataimporthandler is not
included in default solr 4.1 distribution.  Where can I find it?

Thanks
Ming


Re: RequestHandler init failure

2013-02-18 Thread Mingfeng Yang
Found it by myself.  It's here
http://mirrors.ibiblio.org/maven2/org/apache/solr/solr-dataimporthandler/4.1.0/

Download and move the jar file to solr-webapp/webapp/WEB-INF/lib directory,
and the errors are all gone.

Ming


On Mon, Feb 18, 2013 at 11:52 AM, Mingfeng Yang mfy...@wisewindow.comwrote:

 When trying to use SolrEntityProcessor to do data import from another solr
 index (solor 4.1)

 I added  the following in solrconfig.xml

  requestHandler name=/data
 class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
str name=configdata-config.xml/str
/lst
/requestHandler

 and create new file data-config.xml with
 dataConfig
   document
 entity name=sep processor=SolrEntityProcessor
 url=http://wolf:1Xnbdoq@myserver:8995/solr/; query=*:*
 fl=id,md5_text,title,text/
   /document
 /dataConfig


 I got the following errors:

 org.apache.solr.common.SolrException: RequestHandler init failure
 at org.apache.solr.core.SolrCore.init(SolrCore.java:794)
 at org.apache.solr.core.SolrCore.init(SolrCore.java:607)
 at
 org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:949)
 at
 org.apache.solr.core.CoreContainer.create(CoreContainer.java:1031)
 at
 org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629)
 at
 org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:624)
 at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
 at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:619)
 Caused by: org.apache.solr.common.SolrException: RequestHandler init
 failure
 at
 org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:168)
 at org.apache.solr.core.SolrCore.init(SolrCore.java:731)
 ... 13 more
 Caused by: org.apache.solr.common.SolrException: Error loading class
 'org.apache.solr.handler.dataimport.DataImportHandler'
 at
 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:438)
 at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:507)
 at
 org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:581)
 at
 org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:154)
 ... 14 more
 Caused by: java.lang.ClassNotFoundException:
 org.apache.solr.handler.dataimport.DataImportHandler
 at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
 at
 java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:627)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:247)
 at
 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:422)
 ... 17 more
 Feb 18, 2013 7:24:43 PM org.apache.solr.common.SolrException log
 SEVERE: null:org.apache.solr.common.SolrException: Unable to create core:
 collection1
 at
 org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:1654)
 at
 org.apache.solr.core.CoreContainer.create(CoreContainer.java:1039)
 at
 org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629)

 I assume that it's because jar file related to dataimporthandler is not
 included in default solr 4.1 distribution.  Where can I find it?

 Thanks
 Ming



fatest way to rebuild Solr index

2013-02-14 Thread Mingfeng Yang
I have a few Solr indexes, each with 20-200 millions documents, which were
indexed by querying multiple PostgreSQL databases.  If I do rebuild the
index by the same way, it would take a few months, because the PostgresSQL
query is slow.

Now, I need to do the following changes to all indexes.
1. delete a couple fields from the Solr index
2. add a couple new fields
3. change the type of one field from string to int

Luckily, all fields were indexed and stored.   My plan is to query an old
index, and get values for all fields, and then add them into new index.

Any faster ways to build new indexes in my case?

Thanks,
Ming


Re: fatest way to rebuild Solr index

2013-02-14 Thread Mingfeng Yang
Shawn,

Awesome.  Exactly something I am looking for.

Thanks!
Ming


On Thu, Feb 14, 2013 at 12:00 PM, Shawn Heisey s...@elyograg.org wrote:

 On 2/14/2013 12:46 PM, Mingfeng Yang wrote:

 I have a few Solr indexes, each with 20-200 millions documents, which were
 indexed by querying multiple PostgreSQL databases.  If I do rebuild the
 index by the same way, it would take a few months, because the PostgresSQL
 query is slow.

 Now, I need to do the following changes to all indexes.
 1. delete a couple fields from the Solr index
 2. add a couple new fields
 3. change the type of one field from string to int

 Luckily, all fields were indexed and stored.   My plan is to query an
 old
 index, and get values for all fields, and then add them into new index.


 Using the DataImportHandler with SolrEntityProcessor is probably your best
 bet.  I believe you would want to avoid updating the source index while
 using this.

 http://wiki.apache.org/solr/**DataImportHandler#**SolrEntityProcessorhttp://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor

 Thanks,
 Shawn




Traditional replication behind SolrCloud

2013-01-29 Thread Mingfeng Yang
Our application of Solr is somehow non-typical.  We constantly feed Solr
with lots of documents grabbed from internet, and NRT searching is not
required.  A typical search will return millions of result, and query
response need to be as fast as possible.

Since in SolrCloud environment, indexing request is constantly distributing
to all leaders and replicas, and I think that may impact the query
performance since the replicas are doing indexing and searching at the same
time.   I think about setting up a traditional replication behind each
shard of SolrCloud, and set the replication interval to a few minutes, to
minimize the impact of indexing on system resources.

Or is there already some way to enforce traditional type of replication in
the replicas of SolrCloud?

Thanks,
Ming


Re: How to migrate SolrCloud shards to different servers?

2013-01-29 Thread Mingfeng Yang
An experiment found that stop all shards, remove the zoo_data (assume your
zookeeper is used for this particular solrcloud, otherwise, be cautious),
and then start instance by order works fine.

Ming



On Sat, Jan 26, 2013 at 5:31 AM, Per Steffensen st...@designware.dk wrote:

 Hi

 We have actually tested this and found that the following will do it
 * Shutdown all Solr nodes - make sure ZKs are still running
 * For each replica (shard-instance) move its data-folder to the new server
 (if they are not already available to it through some shared storage)
 * For each repilca (shard-instance) also move solr.xmls
 * Extract clusterstate.json from ZK into a file. Modify that file so that
 hosts/IPs and ports are correct according to new setup. Replace
 clusterstate.json in ZK with the modified content of the clusterstate.json
 file
 * Start new Solr nodes

 Good luck!

 Regards, Per Steffensen



 On 1/26/13 6:56 AM, Mingfeng Yang wrote:

 Hi Mark,

 When I did testing with SolrCloud, I found the following.

 1. I started 4 shards on the same host on port 8983, 8973, 8963, and 8953.
 2. Index some data.
 3. Shutdown all 4 shards.
 4. Started 4 shards again, all pointing to the same data directory and use
 the same configuration, except that now we use different ports 8983, 8973,
   7633 and 7648.
 5. Now Solr has problem to load all cores properly.

 Therefore, I had the impression that ZooKeeper may have a memory of which
 hosts correspond to which shards. If I change the host info, it may get
 confused.  I could not find any related documentation or discussion about
 this issue.

 Thanks,
 Ming




 On Fri, Jan 25, 2013 at 5:52 PM, Mark Miller markrmil...@gmail.com
 wrote:

  You could do it that way.

 I'm not sure why you are worried about the leaders. That shouldn't
 matter.

 You could also start up new Solrs on the new machines as replicas of the
 cores you want to move - then once they are active, unload the cores on
 the
 old machine, stop the Solr instances and remove the stuff left on the
 filesystem.

 - Mark

 On Jan 25, 2013, at 7:42 PM, Mingfeng Yang mfy...@wisewindow.com
 wrote:

  Right now I have an index with four shards on a single EC2 server, each
 running on different ports.  Now I'd like to migrate three shards
 to independent servers.

 What should I do to safely accomplish this process?

 Can I just
 1. shutdown all four solr instances.
 2. copy three shards (indexes) to different servers.
 3. launch 4 solr instances on 4 different servers, each with -zKhost
 specified, pointing to the zookeeper servers.

 In my impression, zookeeper remembers which shards are leaders.  What I
 plan to do above could not elect the three new servers as leaders.  If

 so,

 what's the correct way to do it?

 Thanks,
 Ming






Re: Distibuted search

2013-01-28 Thread Mingfeng Yang
In your case, since there is no co-current queries, adding replicas won't
help much on improving the response speed.  However, break your index into
a few shards do help increase query performance. I recently break an index
with 30 million documents (30G) into 4 shards, and the boost is pretty
impressive (roughly 2-5x faster for a complicated query)

Ming


On Mon, Jan 28, 2013 at 10:54 AM, Isaac Hebsh isaac.he...@gmail.com wrote:

 Does adding replicas (on additional servers) help to improve search
 performance?

 It is known that each query goes to all the shards. It's clear that if we
 have massive load, then multiple cores serving the same shard are very
 useful.

 But what happens if I'll never have concurrent queries (one query is in the
 system at any time), but I want these single queries to return faster. Is a
 bigger replication factor will contribute?

 Especially, Will a complicated query (with a large amount of queried
 fields) go to multiple cores *of the same shard*? (E.g. core1 searching for
 term1 in field1, and core2 searching for term 2 in field2)

 And what about a query on a single field, which contains a lot of terms?

 Thanks in advance..



secure Solr server

2013-01-27 Thread Mingfeng Yang
Before Solr 4.0, I secure solr by enable password protection in Jetty.
 However, password protection will make solrcloud not work.

We use EC2 now, and we need the www admin interface of solr to be
accessible (with password) from anywhere.

How do you protect your solr sever from unauthorized access?

Thanks,
Ming


maxScore field in SolrCloud response

2013-01-25 Thread Mingfeng Yang
We are migrating our Solr index from single index to multiple shards with
solrcloud. I noticed that when I query solrcloud (to all shards or just one
of the shards), the response has a field of maxScore, but query of single
index does not include this field.

In both cases, we are using Solr 4.0.

Why is there such differences?

Ming


How to migrate SolrCloud shards to different servers?

2013-01-25 Thread Mingfeng Yang
Right now I have an index with four shards on a single EC2 server, each
running on different ports.  Now I'd like to migrate three shards
to independent servers.

What should I do to safely accomplish this process?

Can I just
1. shutdown all four solr instances.
2. copy three shards (indexes) to different servers.
3. launch 4 solr instances on 4 different servers, each with -zKhost
specified, pointing to the zookeeper servers.

In my impression, zookeeper remembers which shards are leaders.  What I
plan to do above could not elect the three new servers as leaders.  If so,
what's the correct way to do it?

Thanks,
Ming


Re: How to migrate SolrCloud shards to different servers?

2013-01-25 Thread Mingfeng Yang
Hi Mark,

When I did testing with SolrCloud, I found the following.

1. I started 4 shards on the same host on port 8983, 8973, 8963, and 8953.
2. Index some data.
3. Shutdown all 4 shards.
4. Started 4 shards again, all pointing to the same data directory and use
the same configuration, except that now we use different ports 8983, 8973,
 7633 and 7648.
5. Now Solr has problem to load all cores properly.

Therefore, I had the impression that ZooKeeper may have a memory of which
hosts correspond to which shards. If I change the host info, it may get
confused.  I could not find any related documentation or discussion about
this issue.

Thanks,
Ming




On Fri, Jan 25, 2013 at 5:52 PM, Mark Miller markrmil...@gmail.com wrote:

 You could do it that way.

 I'm not sure why you are worried about the leaders. That shouldn't matter.

 You could also start up new Solrs on the new machines as replicas of the
 cores you want to move - then once they are active, unload the cores on the
 old machine, stop the Solr instances and remove the stuff left on the
 filesystem.

 - Mark

 On Jan 25, 2013, at 7:42 PM, Mingfeng Yang mfy...@wisewindow.com wrote:

  Right now I have an index with four shards on a single EC2 server, each
  running on different ports.  Now I'd like to migrate three shards
  to independent servers.
 
  What should I do to safely accomplish this process?
 
  Can I just
  1. shutdown all four solr instances.
  2. copy three shards (indexes) to different servers.
  3. launch 4 solr instances on 4 different servers, each with -zKhost
  specified, pointing to the zookeeper servers.
 
  In my impression, zookeeper remembers which shards are leaders.  What I
  plan to do above could not elect the three new servers as leaders.  If
 so,
  what's the correct way to do it?
 
  Thanks,
  Ming