Re: Solr substring search

2013-09-06 Thread Alvaro Cabrerizo
Hi:

I would start looking:

http://docs.lucidworks.com/display/solr/The+Standard+Query+Parser

And the
org.apache.lucene.queryparser.flexible.standard.StandardQueryParser.java

Hope it helps.

On Thu, Sep 5, 2013 at 11:30 PM, Scott Schneider 
scott_schnei...@symantec.com wrote:

 Hello,

 I'm trying to find out how Solr runs a query for *foo*.  Google tells me
 that you need to use NGramFilterFactory for that kind of substring search,
 but I find that even with very simple fieldTypes, it just works.  (Perhaps
 because I'm testing on very small data sets, Solr is willing to look
 through all the keywords.)  e.g. This works on the tutorial.

 Can someone tell me exactly how this works and/or point me to the Lucene
 code that implements this?

 Thanks,
 Scott




Re: charfilter doesn't do anything

2013-09-06 Thread Andreas Owen
the input string is a normal html page with the word Zahlungsverkehr in it and 
my query is ...solr/collection1/select?q=*

On 5. Sep 2013, at 9:57 PM, Jack Krupansky wrote:

 And show us an input string and a query that fail.
 
 -- Jack Krupansky
 
 -Original Message- From: Shawn Heisey
 Sent: Thursday, September 05, 2013 2:41 PM
 To: solr-user@lucene.apache.org
 Subject: Re: charfilter doesn't do anything
 
 On 9/5/2013 10:03 AM, Andreas Owen wrote:
 i would like to filter / replace a word during indexing but it doesn't do 
 anything and i dont get a error.
 
 in schema.xml i have the following:
 
 field name=text_html type=text_cutHtml indexed=true stored=true 
 multiValued=true/
 
 fieldType name=text_cutHtml class=solr.TextField
 analyzer
  !--  tokenizer class=solr.StandardTokenizerFactory/ --
  charFilter class=solr.PatternReplaceCharFilterFactory 
 pattern=Zahlungsverkehr replacement=ASDFGHJK /
  tokenizer class=solr.KeywordTokenizerFactory/
 /analyzer
   /fieldType
 
 my 2. question is where can i say that the expression is multilined like in 
 javascript i can use /m at the end of the pattern?
 
 I don't know about your second question.  I don't know if that will be
 possible, but I'll leave that to someone who's more expert than I.
 
 As for the first question, here's what I have.  Did you reindex?  That
 will be required.
 
 http://wiki.apache.org/solr/HowToReindex
 
 Assuming that you did reindex, are you trying to search for ASDFGHJK in
 a field that contains more than just Zahlungsverkehr?  The keyword
 tokenizer might not do what you expect - it tokenizes the entire input
 string as a single token, which means that you won't be able to search
 for single words in a multi-word field without wildcards, which are
 pretty slow.
 
 Note that both the pattern and replacement are case sensitive.  This is
 how regex works.  You haven't used a lowercase filter, which means that
 you won't be able to search for asdfghjk.
 
 Use the analysis tab in the UI on your core to see what Solr does to
 your field text.
 
 Thanks,
 Shawn 



Re: unknown _stream_source_info while indexing rich doc in solr

2013-09-06 Thread Nutan
I will try this,thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/unknown-stream-source-info-while-indexing-rich-doc-in-solr-tp4088136p4088490.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solrcloud shards backup/restoration

2013-09-06 Thread Shalin Shekhar Mangar
The replication handler's backup command was built for pre-SolrCloud.
It takes a snapshot of the index but it is unaware of the transaction
log which is a key component in SolrCloud. Hence unless you stop
updates, commit your changes and then take a backup, you will likely
miss some updates.

That being said, I'm curious to see how peer sync behaves when you try
to restore from a snapshot. When you say that you haven't been
successful in restoring, what exactly is the behaviour you observed?

On Fri, Sep 6, 2013 at 5:14 AM, Aditya Sakhuja aditya.sakh...@gmail.com wrote:
 Hello,

 I was looking for a good backup / recovery solution for the solrcloud
 indexes. I am more looking for restoring the indexes from the index
 snapshot, which can be taken using the replicationHandler's backup command.

 I am looking for something that works with solrcloud 4.3 eventually, but
 still relevant if you tested with a previous version.

 I haven't been successful in have the restored index replicate across the
 new replicas, after I restart all the nodes, with one node having the
 restored index.

 Is restoring the indexes on all the nodes the best way to do it ?
 --
 Regards,
 -Aditya Sakhuja



-- 
Regards,
Shalin Shekhar Mangar.


Re: Solr documents update on index

2013-09-06 Thread Shalin Shekhar Mangar
Yes, if a document with the same key exists, then the old document
will be deleted and replaced with the new document. You can also
partially update documents (we call it atomic updates) which reads the
old document from local index, updates it according to the request and
then replaces the old document with the new one.

See 
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-UpdatingOnlyPartofaDocument

On Fri, Sep 6, 2013 at 1:03 AM, Luis Portela Afonso
meligalet...@gmail.com wrote:
 Hi,

 I'm having a problem when solr indexes.
 It is updating documents already indexed. Is this a normal behavior?
 If a document with the same key already exists is it supposed to be updated?
 I has thinking that is supposed to just update if the information on the
 rss has changed.

 Appreciate your help

 --
 Sent from Gmail Mobile



-- 
Regards,
Shalin Shekhar Mangar.


Re: bucket count for facets

2013-09-06 Thread Shalin Shekhar Mangar
Stats Component can give you a count of non-null values in a field.

See https://cwiki.apache.org/confluence/display/solr/The+Stats+Component

On Fri, Sep 6, 2013 at 12:28 AM, Steven Bower smb-apa...@alcyon.net wrote:
 Is there a way to get the count of buckets (ie unique values) for a field
 facet? the rudimentary approach of course is to get back all buckets, but
 in some cases this is a huge amount of data.

 thanks,

 steve



-- 
Regards,
Shalin Shekhar Mangar.


Re: Odd behavior after adding an additional core.

2013-09-06 Thread Shalin Shekhar Mangar
Can you give exact steps to reproduce this problem?

Also, are you sure you supplied numShards=4 while creating the collection?

On Fri, Sep 6, 2013 at 12:20 AM, mike st. john mstj...@gmail.com wrote:
 using solr 4.4  , i used collection admin to create a collection  4shards
 replication - factor of 1

 i did this so i could index my data, then bring in replicas later by adding
 cores via coreadmin


 i added a new core via coreadmin,  what i noticed shortly after adding the
 core,  the leader of the shard where the new replica was placed was marked
 active the new core marked as the leader  and the routing was now set to
 implicit.



 i've replicated this on another solr setup as well.


 Any ideas?


 Thanks

 msj



-- 
Regards,
Shalin Shekhar Mangar.


Regarding reducing qtime

2013-09-06 Thread prabu palanisamy
Hi

I am currently using solr -3.5.0 indexed by wikipedia dump (50 gb) with
java 1.6. I am searching the tweets in the solr. Currently it takes average
of 210 millisecond for each post, out of which 200 millisecond is consumed
by solr server (QTime).   I used the jconsole mointor tool, The report are
   heap usage of 10-50Mb,
   No of threads - 10-20
   No of class around 3800,


monitoring Solr RAM with graphite

2013-09-06 Thread Dmitry Kan
Hello!

I remember some time ago people were interested in how Solr instances can
be monitored with graphite. This blog post gives a hands-on example from my
experience of monitoring RAM usage of Solr.

http://dmitrykan.blogspot.fi/2013/09/monitoring-solr-with-graphite-and-carbon.html

Please note, that this is not SOLR native monitoring, i.e. SOLR is more
like a black box. It can still suffice to a persistent monitoring need.

Further stats can be added with querying SOLR for cache usage and so on.

Regards,

Dmitry Kan


Re: Loading a SpellCheck dynamically

2013-09-06 Thread Shalin Shekhar Mangar
My guess is that you have a single request handler defined with all
your language specific spell check components. This is why you see
spellcheck values from all spellcheckers.

If the above is true, then I don't think there is a way to choose one
specific spellchecker component. The alternative is to define multiple
request handlers with one-to-one mapping with the spell check
components. Then you can send a request to one particular request
handler and the corresponding spell check component will return its
response.

On Thu, Sep 5, 2013 at 11:29 PM, Mr Havercamp mrhaverc...@gmail.com wrote:
 I currently have multiple spellchecks configured in my solrconfig.xml to
 handle a variety of different spell suggestions in different languages.

 In the snippet below, I have a catch-all spellcheck as well as an English
 only one for more accurate matching (I.e. my schema.xml is set up to capture
 english only fields to an english-specific textSpell_en field and then I
 also capture to a generic textSpell field):

 ---solrconfig.xml---

 searchComponent name=spellcheck_en class=solr.SpellCheckComponent
 str name=queryAnalyzerFieldTypetextSpell_en/str

 lst name=spellchecker
 str name=namedefault/str
 str name=fieldspell_en/str
 str name=spellcheckIndexDir./spellchecker_en/str
 str name=buildOnOptimizetrue/str
 /lst
 /searchComponent

 searchComponent name=spellcheck class=solr.SpellCheckComponent
 str name=queryAnalyzerFieldTypetextSpell/str

 lst name=spellchecker
 str name=namedefault/str
 str name=fieldspell/str
 str name=spellcheckIndexDir./spellchecker/str
 str name=buildOnOptimizetrue/str
 /lst
 /searchComponent

 My question is; when I query my Solr index, am I able to load, say, just
 spellcheck values from the spellcheck_en spellchecker rather than from both?
 This would be useful if I were to start implementing additional language
 spellchecks; E.g. spellcheck_ja, spellcheck_fr, etc.

 Thanks for any insights.

 Cheers


 Hayden



-- 
Regards,
Shalin Shekhar Mangar.


Regarding improving performance of the solr

2013-09-06 Thread prabu palanisamy
 Hi

I am currently using solr -3.5.0,  indexed  wikipedia dump (50 gb) with
java 1.6.
I am searching the solr with text (which is actually twitter tweets) .
Currently it takes average time of 210 millisecond for each post, out of
which 200 millisecond is consumed by solr server (QTime).  I used the
jconsole monitor tool.

The stats are
   Heap usage - 10-50Mb,
   No of threads - 10-20
   No of class- 3800,
   Cpu usage - 10-15%

Currently I am loading all the fields of the wikipedia.

I only need the freebase category and wikipedia category. I want to know
how to optimize the solr server to improve the performance.

Could you please help me out in optimize the performance?

Thanks and Regards
Prabu


Re: Questions about Replication Factor on solrcloud

2013-09-06 Thread Shalin Shekhar Mangar
Comments inline:

On Wed, Sep 4, 2013 at 10:38 PM, Lisandro Montaño
lisan...@itivitykids.com wrote:
 Hi all,



 I’m currently working on deploying a solrcloud distribution in centos
 machines and wanted to have more guidance about Replication Factor
 configuration.



 I have configured two servers with solrcloud over tomcat and a third server
 as zookeeper. I have configured successfully and have one server with
 collection1 available and the other with collection1_Shard1_Replica1.


How did you configure them this way? In particular, I'm confused as to
why there is collection1 on the first node and
collection1_Shard1_Replica1 on the other.



 My questions are:



 -  Can I have 1 shard and 2 replicas on two machines? What are the
 limitations or considerations to define this?

Yes you can have 1 shard and 2 replicas, one each on your two
machines. That is the way it is configured by default. For example,
this can be achieved if you create another collection
(numShards=1replicationFactor=2) using the collection API.


 -  How does replica works? (there is not too much info about it)

All replicas (physical shards) are peers who decide on a leader using
ZooKeeper. All updates are routed via the leader who forwards
(versioned) updates to other replicas. A query can be served by any
replica. If a replica goes down, then it will attempt to recover from
the current leader and then start serving requests. If the leader goes
down, then all the other replicas (after waiting for a certain time
for the old leader to come back) decide on a new leader.


 -  When I import data on collection1 it works properly, but when I
 do it in collection1_Shard1_Replica1 it fails. Is that an expected behavior?
 (Maybe if I have a better definition of replica’s I will understand it
 better)


Can you describe how it fails? Stack traces or excerpts from the Solr
logs will help.
-- 
Regards,
Shalin Shekhar Mangar.


Re: How to config SOLR server for spell check functionality

2013-09-06 Thread Shalin Shekhar Mangar
On Wed, Sep 4, 2013 at 4:56 PM, sebastian.manolescu
sebastian.manole...@yahoo.com wrote:
 I want to implement spell check functionality offerd by solr using MySql
 database, but I dont understand how.
 Here the basic flow of what I want to do.

 I have a simple inputText (in jsf) and if I type the word shwo the response
 to OutputLabel should be show.

 First of all I'm using the following tools and frameworks:

 JBoss application server 6.1.
 Eclipse
 JPA
 JSF(Primefaces)

 Steps I've done until now:

 Step 1: Download solr server from:
 http://lucene.apache.org/solr/downloads.html Extract content.

 Step 2: Add to Envoierment variable:

 Variable name: solr.solr.home Variable value :
 D:\JBOSS\solr-4.4.0\solr-4.4.0\example\solr --- where you have the solr
 server

 Step 3:

 Open solr war and to solr.war\WEB-INF\web.xml add env-entry - (the easy way)

 solr/home D:\JBOSS\solr-4.4.0\solr-4.4.0\example\solr java.lang.String

 OR import project change and bulid war.

 Step 4: Browser: localhost:8080/solr/

 And the solr console appears.

 Until now all works well.

 I have found some usefull code (my opinion) that returns:

 [collection1] webapp=/solr path=/spell
 params={spellcheck=onq=whateverwt=javabinqt=/spellversion=2spellcheck.build=true}
 hits=0 status=0 QTime=16

 Here is the code that gives the result from above:

 SolrServer solr;
 try {
 solr = new CommonsHttpSolrServer(http://localhost:8080/solr;);

 ModifiableSolrParams params = new ModifiableSolrParams();
 params.set(qt, /spell);
 params.set(q, whatever);
 params.set(spellcheck, on);
 params.set(spellcheck.build, true);

 QueryResponse response = solr.query(params);
 SpellCheckResponse spellCheckResponse =
 response.getSpellCheckResponse();
 if (!spellCheckResponse.isCorrectlySpelled()) {
 for (Suggestion suggestion :
 response.getSpellCheckResponse().getSuggestions()) {
System.out.println(original token:  + suggestion.getToken() + 
 - alternatives:  + suggestion.getAlternatives());
 }
 }
 } catch (Exception e) {
 // TODO Auto-generated catch block
 e.printStackTrace();
 }

 Questions:

 1.How do I make the database connection whit my DB and search the content to
 see if there are any words that could match?

You can either write SolrJ code to index data into Solr or you can use
DataImportHandler.

http://wiki.apache.org/solr/DIHQuickStart
http://wiki.apache.org/solr/DataImportHandler

 2.How do I make the configuration.(solr-config.xml,shema.xml...etc)?

You must first edit the schema.xml according to your data. See
https://cwiki.apache.org/confluence/display/solr/Documents%2C+Fields%2C+and+Schema+Design

 3.How do I send a string from my view(xhtml) so that the solr server knows
 what he looks for?

For search, you can use the SolrJ java client.

https://cwiki.apache.org/confluence/display/solr/Searching
http://wiki.apache.org/solr/Solrj#Reading_Data_from_Solr

You seem to have done your homework and have found most of the
resources. We will be able to help you in a better way if you asked
specific questions instead.

-- 
Regards,
Shalin Shekhar Mangar.


Re: bucket count for facets

2013-09-06 Thread Steven Bower
Understood, what I need is a count of the unique values in a field and that
field is multi-valued (which makes stats component a non-option)


On Fri, Sep 6, 2013 at 4:22 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Stats Component can give you a count of non-null values in a field.

 See https://cwiki.apache.org/confluence/display/solr/The+Stats+Component

 On Fri, Sep 6, 2013 at 12:28 AM, Steven Bower smb-apa...@alcyon.net
 wrote:
  Is there a way to get the count of buckets (ie unique values) for a field
  facet? the rudimentary approach of course is to get back all buckets, but
  in some cases this is a huge amount of data.
 
  thanks,
 
  steve



 --
 Regards,
 Shalin Shekhar Mangar.



Restrict Parsing duplicate file in Solr

2013-09-06 Thread shabbir
Hi I am new to Solr , I am looking for option of restricting duplicate file
indexing in solr.Please let me know if it can be done with any configuration
change.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Restrict-Parsing-duplicate-file-in-Solr-tp4088471.html
Sent from the Solr - User mailing list archive at Nabble.com.


Store 2 dimensional array( of int values) in solr 4.0

2013-09-06 Thread A Geek
hi All, I'm trying to store a 2 dimensional array in SOLR [version 4.0]. 
Basically I've the following data: 
[[20121108, 1],[20121110, 7],[2012, 2],[20121112, 2]] ...

The inner array being used to keep some count say X for that particular day. 
Currently, I'm using the following field to store this data: 
field name=dataX type=string indexed=true stored=true 
multiValued=true/
and I'm using python library pySolr to store the data. Currently the data that 
gets stored looks like this(its array of strings)
arr name=dataXstr[20121108, 1]/strstr[20121110, 
7]/strstr[2012, 2]/strstr[20121112, 2]/strstr[20121113, 
2]/strstr[20121116, 1]/str/arr
Is there a way, i can store the 2 dimensional array and the inner array can 
contain int values, like the one shown in the beginning example, such that the 
the final/stored data in SOLR looks something like: arr name=dataX
arr name=indexint20121108/int int 7 /int /arr
arr name=indexint 20121110/intint 12 /int/arr
arr name=indexint 20121110/intint 12 /int/arr
/arr
Just a guess, I think for this case, we need to add one more field[the index 
for instance], for each inner array which will again be multivalued (which will 
store int values only)? How do I add the actual 2 dimensional array, how to 
pass the inner arrays and how to store the full doc that contains this 2 
dimensional array. Please help me out sort this issue.
Please share your views and point me in the right direction. Any help would be 
highly appreciated. 
I found similar things on the web, but not the one I'm looking for: 
http://lucene.472066.n3.nabble.com/Two-dimensional-array-in-Solr-schema-td4003309.html
Thanks

Re: Solr documents update on index

2013-09-06 Thread Luís Portela Afonso
Hi,

But i'm indexing rss feeds. I want that solr indexes that without change the 
existing information of a document with the same uniqueKey.
The best approach is that solr updates the doc if changes are detected, but i 
can leave without that.

I really would like that solr does not update the document if it already exists.

I'm using the DataImportScheduler to solr index launch the scheduled index.

Appreciate any possible help.

On Sep 6, 2013, at 9:16 AM, Shalin Shekhar Mangar shalinman...@gmail.com 
wrote:

 Yes, if a document with the same key exists, then the old document
 will be deleted and replaced with the new document. You can also
 partially update documents (we call it atomic updates) which reads the
 old document from local index, updates it according to the request and
 then replaces the old document with the new one.
 
 See 
 https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-UpdatingOnlyPartofaDocument
 
 On Fri, Sep 6, 2013 at 1:03 AM, Luis Portela Afonso
 meligalet...@gmail.com wrote:
 Hi,
 
 I'm having a problem when solr indexes.
 It is updating documents already indexed. Is this a normal behavior?
 If a document with the same key already exists is it supposed to be updated?
 I has thinking that is supposed to just update if the information on the
 rss has changed.
 
 Appreciate your help
 
 --
 Sent from Gmail Mobile
 
 
 
 -- 
 Regards,
 Shalin Shekhar Mangar.



smime.p7s
Description: S/MIME cryptographic signature


SOLR 3.6.1 auto complete sorting

2013-09-06 Thread Poornima Jay
Hi, 

We had implemented Auto Complete feature in our site. Below are the solr config 
details.

schema.xml

 fieldType class=solr.TextField name=text_auto positionIncrementGap=100
         analyzer type=index
            filter class=solr.ASCIIFoldingFilterFactory /
            tokenizer class=solr.KeywordTokenizerFactory /
            filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 generateNumberParts=1 catenateWords=0 
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1 
/
            filter class=solr.LowerCaseFilterFactory /
            filter class=solr.EdgeNGramFilterFactory maxGramSize=30 
minGramSize=1 /
         /analyzer
         analyzer type=query
            filter class=solr.ASCIIFoldingFilterFactory /
            tokenizer class=solr.KeywordTokenizerFactory /
            filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 generateNumberParts=1 catenateWords=0 
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1 
/
            filter class=solr.LowerCaseFilterFactory /
         /analyzer
      /fieldType

field name=dams_id type=string indexed=true stored=true /

 field name=published_date type=date indexed=true stored=false  /

field name=ph_su type=text_auto indexed=true stored=true 
multiValued=true /


 !-- Copy fields Auto Complete --
   copyField source=title dest=ph_su /
   copyField source=product_catalogue dest=ph_su /
   copyField source=product_category_name dest=ph_su /
  
solrquery is  
q=ph_su%3Aepub+start=0rows=10fl=dams_idwt=jsonindent=onhl=truehl.fl=ph_suhl.simple.pre=bhl.simple.post=/b

the requirement is to sort the results based on releavance and latest published 
products for the search term.

I have the below parameters but nothing worked

sort = dams_id desc,published_date desc
order_by = dams_id desc,published_date desc

Please let me know how to sort the results with relevance and published date 
descending.

Thanks,
Poornima


Re: Store 2 dimensional array( of int values) in solr 4.0

2013-09-06 Thread Jack Krupansky
First you need to tell us how you wish to use and query the data. That will 
largely determine how the data must be stored. Give us a few example queries 
of how you would like your application to be able to access the data.


Note that Lucene has only simple multivalued fields - no structure or 
nesting within a single field other that a list of scalar values.


But you can always store a complex structure as a BSON blob or JSON string 
if all you want is to store and retrieve it in its entirety without querying 
its internal structure. And note that Lucene queries are field level - does 
a field contain or match a scalar value.


-- Jack Krupansky

-Original Message- 
From: A Geek

Sent: Friday, September 06, 2013 7:10 AM
To: solr user
Subject: Store 2 dimensional array( of int values) in solr 4.0

hi All, I'm trying to store a 2 dimensional array in SOLR [version 4.0]. 
Basically I've the following data:

[[20121108, 1],[20121110, 7],[2012, 2],[20121112, 2]] ...

The inner array being used to keep some count say X for that particular day. 
Currently, I'm using the following field to store this data:
field name=dataX type=string indexed=true stored=true 
multiValued=true/
and I'm using python library pySolr to store the data. Currently the data 
that gets stored looks like this(its array of strings)
arr name=dataXstr[20121108, 1]/strstr[20121110, 
7]/strstr[2012, 2]/strstr[20121112, 2]/strstr[20121113, 
2]/strstr[20121116, 1]/str/arr
Is there a way, i can store the 2 dimensional array and the inner array can 
contain int values, like the one shown in the beginning example, such that 
the the final/stored data in SOLR looks something like: arr name=dataX

arr name=indexint20121108/int int 7 /int /arr
arr name=indexint 20121110/intint 12 /int/arr
arr name=indexint 20121110/intint 12 /int/arr
/arr
Just a guess, I think for this case, we need to add one more field[the index 
for instance], for each inner array which will again be multivalued (which 
will store int values only)? How do I add the actual 2 dimensional array, 
how to pass the inner arrays and how to store the full doc that contains 
this 2 dimensional array. Please help me out sort this issue.
Please share your views and point me in the right direction. Any help would 
be highly appreciated.
I found similar things on the web, but not the one I'm looking for: 
http://lucene.472066.n3.nabble.com/Two-dimensional-array-in-Solr-schema-td4003309.html
Thanks 



Re: Restrict Parsing duplicate file in Solr

2013-09-06 Thread Jack Krupansky
Explain what you mean by restring duplicate file indexing. Solr doesn't work 
at the file level - only documents (rows or records) and fields and 
values.


-- Jack Krupansky

-Original Message- 
From: shabbir

Sent: Friday, September 06, 2013 12:24 AM
To: solr-user@lucene.apache.org
Subject: Restrict Parsing duplicate file in Solr

Hi I am new to Solr , I am looking for option of restricting duplicate file
indexing in solr.Please let me know if it can be done with any configuration
change.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Restrict-Parsing-duplicate-file-in-Solr-tp4088471.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: charfilter doesn't do anything

2013-09-06 Thread Jack Krupansky
Is there any chance that your changed your schema since you indexed the 
data? If so, re-index the data.


If a * query finds nothing, that implies that the default field is empty. 
Are you sure the df parameter is set to the field containing your data? 
Show us your request handler definition and a sample of your actual Solr 
input (Solr XML or JSON?) so that we can see what fields are being 
populated.


-- Jack Krupansky

-Original Message- 
From: Andreas Owen

Sent: Friday, September 06, 2013 4:01 AM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

the input string is a normal html page with the word Zahlungsverkehr in it 
and my query is ...solr/collection1/select?q=*


On 5. Sep 2013, at 9:57 PM, Jack Krupansky wrote:


And show us an input string and a query that fail.

-- Jack Krupansky

-Original Message- From: Shawn Heisey
Sent: Thursday, September 05, 2013 2:41 PM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

On 9/5/2013 10:03 AM, Andreas Owen wrote:
i would like to filter / replace a word during indexing but it doesn't do 
anything and i dont get a error.


in schema.xml i have the following:

field name=text_html type=text_cutHtml indexed=true stored=true 
multiValued=true/


fieldType name=text_cutHtml class=solr.TextField
analyzer
 !--  tokenizer class=solr.StandardTokenizerFactory/ --
 charFilter class=solr.PatternReplaceCharFilterFactory 
pattern=Zahlungsverkehr replacement=ASDFGHJK /

 tokenizer class=solr.KeywordTokenizerFactory/
/analyzer
  /fieldType

my 2. question is where can i say that the expression is multilined like 
in javascript i can use /m at the end of the pattern?


I don't know about your second question.  I don't know if that will be
possible, but I'll leave that to someone who's more expert than I.

As for the first question, here's what I have.  Did you reindex?  That
will be required.

http://wiki.apache.org/solr/HowToReindex

Assuming that you did reindex, are you trying to search for ASDFGHJK in
a field that contains more than just Zahlungsverkehr?  The keyword
tokenizer might not do what you expect - it tokenizes the entire input
string as a single token, which means that you won't be able to search
for single words in a multi-word field without wildcards, which are
pretty slow.

Note that both the pattern and replacement are case sensitive.  This is
how regex works.  You haven't used a lowercase filter, which means that
you won't be able to search for asdfghjk.

Use the analysis tab in the UI on your core to see what Solr does to
your field text.

Thanks,
Shawn 




Re: charfilter doesn't do anything

2013-09-06 Thread Andreas Owen
i've managed to get it working if i use the regexTransformer and string is on 
the same line in my tika entity. but when the string is multilined it isn't 
working even though i tried ?s to set the flag dotall.

entity name=tika processor=TikaEntityProcessor url=${rec.url} 
dataSource=dataUrl onError=skip htmlMapper=identity format=html 
transformer=RegexTransformer
field column=text_html regex=lt;bodygt;(.+)lt;/bodygt; 
replaceWith=QQQ sourceColName=text  /
/entity

then i tried it like this and i get a stackoverflow

field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt; 
replaceWith=QQQ sourceColName=text  /

in javascript this works but maybe because i only used a small string.



On 6. Sep 2013, at 2:55 PM, Jack Krupansky wrote:

 Is there any chance that your changed your schema since you indexed the data? 
 If so, re-index the data.
 
 If a * query finds nothing, that implies that the default field is empty. 
 Are you sure the df parameter is set to the field containing your data? 
 Show us your request handler definition and a sample of your actual Solr 
 input (Solr XML or JSON?) so that we can see what fields are being populated.
 
 -- Jack Krupansky
 
 -Original Message- From: Andreas Owen
 Sent: Friday, September 06, 2013 4:01 AM
 To: solr-user@lucene.apache.org
 Subject: Re: charfilter doesn't do anything
 
 the input string is a normal html page with the word Zahlungsverkehr in it 
 and my query is ...solr/collection1/select?q=*
 
 On 5. Sep 2013, at 9:57 PM, Jack Krupansky wrote:
 
 And show us an input string and a query that fail.
 
 -- Jack Krupansky
 
 -Original Message- From: Shawn Heisey
 Sent: Thursday, September 05, 2013 2:41 PM
 To: solr-user@lucene.apache.org
 Subject: Re: charfilter doesn't do anything
 
 On 9/5/2013 10:03 AM, Andreas Owen wrote:
 i would like to filter / replace a word during indexing but it doesn't do 
 anything and i dont get a error.
 
 in schema.xml i have the following:
 
 field name=text_html type=text_cutHtml indexed=true stored=true 
 multiValued=true/
 
 fieldType name=text_cutHtml class=solr.TextField
 analyzer
 !--  tokenizer class=solr.StandardTokenizerFactory/ --
 charFilter class=solr.PatternReplaceCharFilterFactory 
 pattern=Zahlungsverkehr replacement=ASDFGHJK /
 tokenizer class=solr.KeywordTokenizerFactory/
 /analyzer
  /fieldType
 
 my 2. question is where can i say that the expression is multilined like in 
 javascript i can use /m at the end of the pattern?
 
 I don't know about your second question.  I don't know if that will be
 possible, but I'll leave that to someone who's more expert than I.
 
 As for the first question, here's what I have.  Did you reindex?  That
 will be required.
 
 http://wiki.apache.org/solr/HowToReindex
 
 Assuming that you did reindex, are you trying to search for ASDFGHJK in
 a field that contains more than just Zahlungsverkehr?  The keyword
 tokenizer might not do what you expect - it tokenizes the entire input
 string as a single token, which means that you won't be able to search
 for single words in a multi-word field without wildcards, which are
 pretty slow.
 
 Note that both the pattern and replacement are case sensitive.  This is
 how regex works.  You haven't used a lowercase filter, which means that
 you won't be able to search for asdfghjk.
 
 Use the analysis tab in the UI on your core to see what Solr does to
 your field text.
 
 Thanks,
 Shawn 



RE: Regarding improving performance of the solr

2013-09-06 Thread Jean-Sebastien Vachon
Have you checked the hit ratio of the different caches? Try to tune them to get 
rid of all evictions if possible.

Tuning the size of the caches and warming you searcher can give you a pretty 
good improvement. You might want to check your analysis chain as well to see if 
you`re not doing anything that is not necessary.



 -Original Message-
 From: prabu palanisamy [mailto:pr...@serendio.com]
 Sent: September-06-13 4:55 AM
 To: solr-user@lucene.apache.org
 Subject: Regarding improving performance of the solr
 
  Hi
 
 I am currently using solr -3.5.0,  indexed  wikipedia dump (50 gb) with java
 1.6.
 I am searching the solr with text (which is actually twitter tweets) .
 Currently it takes average time of 210 millisecond for each post, out of which
 200 millisecond is consumed by solr server (QTime).  I used the jconsole
 monitor tool.
 
 The stats are
Heap usage - 10-50Mb,
No of threads - 10-20
No of class- 3800,
Cpu usage - 10-15%
 
 Currently I am loading all the fields of the wikipedia.
 
 I only need the freebase category and wikipedia category. I want to know
 how to optimize the solr server to improve the performance.
 
 Could you please help me out in optimize the performance?
 
 Thanks and Regards
 Prabu
 
 -
 Aucun virus trouvé dans ce message.
 Analyse effectuée par AVG - www.avg.fr
 Version: 2013.0.3392 / Base de données virale: 3222/6640 - Date: 05/09/2013


Re: Solr Cell Question

2013-09-06 Thread Erick Erickson
It's always frustrating when someone replies with Why not do it
a completely different way?.  But I will anyway :).

There's no requirement at all that you send things to Solr to make
Solr Cel (aka Tika) do it's tricks. Since you're already in SolrJ
anyway, why not just parse on the client? This has the advantage
of allowing you to offload the Tika processing from Solr which can
be quite expensive. You can use the same Tika jars that come
with Solr or download whatever version from the Tika project
you want. That way, you can exercise much better control over
what's done.

Here's a skeletal program with indexing from a DB mixed in, but
it shouldn't be hard at all to pull the DB parts out.

http://searchhub.org/dev/2012/02/14/indexing-with-solrj/

FWIW,
Erick


On Thu, Sep 5, 2013 at 5:28 PM, Jamie Johnson jej2...@gmail.com wrote:

 Is it possible to configure solr cell to only extract and store the body of
 a document when indexing?  I'm currently doing the following which I
 thought would work

 ModifiableSolrParams params = new ModifiableSolrParams();

  params.set(defaultField, content);

  params.set(xpath, /xhtml:html/xhtml:body/descendant::node());

  ContentStreamUpdateRequest up = new ContentStreamUpdateRequest(
 /update/extract);

  up.setParams(params);

  FileStream f = new FileStream(new File(..));

  up.addContentStream(f);

 up.setAction(ACTION.COMMIT, true, true);

 solrServer.request(up);


 But the result of content is as follows

 arr name=content_mvtxt
 str/
 strnull/str
 strISO-8859-1/str
 strtext/plain; charset=ISO-8859-1/str
 strJust a little test/str
 /arr


 What I had hoped for was just

 arr name=content_mvtxt
 strJust a little test/str
 /arr



Facet Count and RegexTransformersplitBy

2013-09-06 Thread Raheel Hasan
Hi guyz,

Just a quick question:

I have a field that has CSV values in the database. So I will use the
DataImportHandler and will index it using RegexTransformer's splitBy
attribute. However, since this is the first time I am doing it, I just
wanted to be sure if it will work for Facet Count?

For example:
From query results (say this is the values in that field):
row 1 = 1,2,3,4
row 2 = 1,4,5,3
row 3 = 2,1,20,66
.
.
.
.
so facet count will get me:
'1' = 3 occurrence
'2' = 2 occur.
.
.
.and so on.





-- 
Regards,
Raheel Hasan


RE: Solr 4.3 Startup with Multiple Cores Hangs on Registering Core

2013-09-06 Thread Austin Rasmussen
Thanks for clearing that up Erick.  The updateLog XML element isn't present in 
any of the solrconfig.xml files, so I don't believe this is enabled.  

I posted the directory listing of all of the core data directories in a prior 
post, but there are no files/folders found that contain tlog in the name of 
them.

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Friday, September 06, 2013 9:18 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 4.3 Startup with Multiple Cores Hangs on Registering Core

bq: I'm actually not using the transaction log (or the 
NRTCachingDirectoryFactory); it's currently set up to use the 
MMapDirectoryFactory,

This isn't relevant to whether you're using the update log or not, this is just 
how the index is handled. Look for something in your solrconfig.xml
like:
 updateLog
  str name=dir${solr.ulog.dir:}/str
/updateLog

The other thing to check is if you have files in a tlog directory that's a 
sibling to your index directory as Hoss suggested.

You may well NOT have any transaction log, but it's something to check.



Re: solrcloud shards backup/restoration

2013-09-06 Thread Mark Miller
I don't know that it's too bad though - its always been the case that if you do 
a backup while indexing, it's just going to get up to the last hard commit. 
With SolrCloud that will still be the case. So just make sure you do a hard 
commit right before taking the backup - yes, it might miss a few docs in the 
tran log, but if you are taking a back up while indexing, you don't have great 
precision in any case - you will roughly get a snapshot for around that time - 
even without SolrCloud, if you are worried about precision and getting every 
update into that backup, you want to stop indexing and commit first. But if you 
just want a rough snapshot for around that time, in both cases you can still 
just don't hard commit and take a snapshot. 

Mark

Sent from my iPhone

On Sep 6, 2013, at 1:13 AM, Shalin Shekhar Mangar shalinman...@gmail.com 
wrote:

 The replication handler's backup command was built for pre-SolrCloud.
 It takes a snapshot of the index but it is unaware of the transaction
 log which is a key component in SolrCloud. Hence unless you stop
 updates, commit your changes and then take a backup, you will likely
 miss some updates.
 
 That being said, I'm curious to see how peer sync behaves when you try
 to restore from a snapshot. When you say that you haven't been
 successful in restoring, what exactly is the behaviour you observed?
 
 On Fri, Sep 6, 2013 at 5:14 AM, Aditya Sakhuja aditya.sakh...@gmail.com 
 wrote:
 Hello,
 
 I was looking for a good backup / recovery solution for the solrcloud
 indexes. I am more looking for restoring the indexes from the index
 snapshot, which can be taken using the replicationHandler's backup command.
 
 I am looking for something that works with solrcloud 4.3 eventually, but
 still relevant if you tested with a previous version.
 
 I haven't been successful in have the restored index replicate across the
 new replicas, after I restart all the nodes, with one node having the
 restored index.
 
 Is restoring the indexes on all the nodes the best way to do it ?
 --
 Regards,
 -Aditya Sakhuja
 
 
 
 -- 
 Regards,
 Shalin Shekhar Mangar.


Re: solrcloud shards backup/restoration

2013-09-06 Thread Mark Miller
Phone typing. The end should not say don't hard commit - it should say do a 
hard commit and take a snapshot. 

Mark

Sent from my iPhone

On Sep 6, 2013, at 7:26 AM, Mark Miller markrmil...@gmail.com wrote:

 I don't know that it's too bad though - its always been the case that if you 
 do a backup while indexing, it's just going to get up to the last hard 
 commit. With SolrCloud that will still be the case. So just make sure you do 
 a hard commit right before taking the backup - yes, it might miss a few docs 
 in the tran log, but if you are taking a back up while indexing, you don't 
 have great precision in any case - you will roughly get a snapshot for around 
 that time - even without SolrCloud, if you are worried about precision and 
 getting every update into that backup, you want to stop indexing and commit 
 first. But if you just want a rough snapshot for around that time, in both 
 cases you can still just don't hard commit and take a snapshot. 
 
 Mark
 
 Sent from my iPhone
 
 On Sep 6, 2013, at 1:13 AM, Shalin Shekhar Mangar shalinman...@gmail.com 
 wrote:
 
 The replication handler's backup command was built for pre-SolrCloud.
 It takes a snapshot of the index but it is unaware of the transaction
 log which is a key component in SolrCloud. Hence unless you stop
 updates, commit your changes and then take a backup, you will likely
 miss some updates.
 
 That being said, I'm curious to see how peer sync behaves when you try
 to restore from a snapshot. When you say that you haven't been
 successful in restoring, what exactly is the behaviour you observed?
 
 On Fri, Sep 6, 2013 at 5:14 AM, Aditya Sakhuja aditya.sakh...@gmail.com 
 wrote:
 Hello,
 
 I was looking for a good backup / recovery solution for the solrcloud
 indexes. I am more looking for restoring the indexes from the index
 snapshot, which can be taken using the replicationHandler's backup command.
 
 I am looking for something that works with solrcloud 4.3 eventually, but
 still relevant if you tested with a previous version.
 
 I haven't been successful in have the restored index replicate across the
 new replicas, after I restart all the nodes, with one node having the
 restored index.
 
 Is restoring the indexes on all the nodes the best way to do it ?
 --
 Regards,
 -Aditya Sakhuja
 
 
 
 -- 
 Regards,
 Shalin Shekhar Mangar.


Re: Solr substring search

2013-09-06 Thread Erick Erickson
Yah, you're getting away with it due to the small data size. As
your data grows, the underlying mechanisms have to enumerate
every term in the field in order to find terms that match so it
can get _very_ expensive with large data sets.

Best to bite the bullet early or, better yet, see if you really need
to support this use-case.

Best,
Erick


On Fri, Sep 6, 2013 at 2:58 AM, Alvaro Cabrerizo topor...@gmail.com wrote:

 Hi:

 I would start looking:

 http://docs.lucidworks.com/display/solr/The+Standard+Query+Parser

 And the
 org.apache.lucene.queryparser.flexible.standard.StandardQueryParser.java

 Hope it helps.

 On Thu, Sep 5, 2013 at 11:30 PM, Scott Schneider 
 scott_schnei...@symantec.com wrote:

  Hello,
 
  I'm trying to find out how Solr runs a query for *foo*.  Google tells
 me
  that you need to use NGramFilterFactory for that kind of substring
 search,
  but I find that even with very simple fieldTypes, it just works.
  (Perhaps
  because I'm testing on very small data sets, Solr is willing to look
  through all the keywords.)  e.g. This works on the tutorial.
 
  Can someone tell me exactly how this works and/or point me to the Lucene
  code that implements this?
 
  Thanks,
  Scott
 
 



RE: Store 2 dimensional array( of int values) in solr 4.0

2013-09-06 Thread A Geek
Hi,Thanks for the quick reply. Sure, please find below the details as per your 
query.
Essentially, I want to retrieve the doc through JSON [using JSON format as SOLR 
result output]and want JSON to pick the the data from the dataX field as a two 
dimensional array of ints. When I store the data as show below, it shows up in 
JSON array of strings where the internal array is basically shown as strings 
(because thats how the field is configured and I'm storing, not finding any 
other option). Following is the current JSON output that I'm able to fetch: 
dataX:[[20130614, 2],[20130615, 11],[20130616, 1],[20130617, 
1],[20130619, 8],[20130620, 5],[20130623, 5]]
whereas I want  to fetch the dataX as something like: 
dataX:[[20130614, 2],[20130615, 11],[20130616, 1],[20130617, 1],[20130619, 
8],[20130620, 5],[20130623, 5]]
as can be seen, the dataX is essentially a 2D array where the internal array is 
of two ints, one being date and other being the count.
Please point me in the right direction. Appreciate your time.
Thanks.

 From: j...@basetechnology.com
 To: solr-user@lucene.apache.org
 Subject: Re: Store 2 dimensional array( of int values) in solr 4.0
 Date: Fri, 6 Sep 2013 08:44:06 -0400
 
 First you need to tell us how you wish to use and query the data. That will 
 largely determine how the data must be stored. Give us a few example queries 
 of how you would like your application to be able to access the data.
 
 Note that Lucene has only simple multivalued fields - no structure or 
 nesting within a single field other that a list of scalar values.
 
 But you can always store a complex structure as a BSON blob or JSON string 
 if all you want is to store and retrieve it in its entirety without querying 
 its internal structure. And note that Lucene queries are field level - does 
 a field contain or match a scalar value.
 
 -- Jack Krupansky
 
 -Original Message- 
 From: A Geek
 Sent: Friday, September 06, 2013 7:10 AM
 To: solr user
 Subject: Store 2 dimensional array( of int values) in solr 4.0
 
 hi All, I'm trying to store a 2 dimensional array in SOLR [version 4.0]. 
 Basically I've the following data:
 [[20121108, 1],[20121110, 7],[2012, 2],[20121112, 2]] ...
 
 The inner array being used to keep some count say X for that particular day. 
 Currently, I'm using the following field to store this data:
 field name=dataX type=string indexed=true stored=true 
 multiValued=true/
 and I'm using python library pySolr to store the data. Currently the data 
 that gets stored looks like this(its array of strings)
 arr name=dataXstr[20121108, 1]/strstr[20121110, 
 7]/strstr[2012, 2]/strstr[20121112, 2]/strstr[20121113, 
 2]/strstr[20121116, 1]/str/arr
 Is there a way, i can store the 2 dimensional array and the inner array can 
 contain int values, like the one shown in the beginning example, such that 
 the the final/stored data in SOLR looks something like: arr name=dataX
 arr name=indexint20121108/int int 7 /int /arr
 arr name=indexint 20121110/intint 12 /int/arr
 arr name=indexint 20121110/intint 12 /int/arr
 /arr
 Just a guess, I think for this case, we need to add one more field[the index 
 for instance], for each inner array which will again be multivalued (which 
 will store int values only)? How do I add the actual 2 dimensional array, 
 how to pass the inner arrays and how to store the full doc that contains 
 this 2 dimensional array. Please help me out sort this issue.
 Please share your views and point me in the right direction. Any help would 
 be highly appreciated.
 I found similar things on the web, but not the one I'm looking for: 
 http://lucene.472066.n3.nabble.com/Two-dimensional-array-in-Solr-schema-td4003309.html
 Thanks 
 
  

Re: Invalid Version when slave node pull replication from master node

2013-09-06 Thread Erick Erickson
Whoa! You should _not_ be using replication with SolrCloud. You can use
replication just fine with 4.4, just like you would have in 3.x say, but in
that case you should not be using the zkHost or zkRun parameters, should not
have a ZooKeeper ensemble running etc.

In SolrCloud, all updates are routed to all the nodes at index time,
otherwise
it couldn't support, say, NRT processing. This makes replication not only
unnecessary, but I wouldn't want to try to predict what problems that would
cause.

So keep a sharp distinction between running Solr 4x and SolrCloud. The
latter
is specifically enabled when you specify zkHost or zkRun when you start Solr
as per the SolrCloud page.

Best
Erick


On Wed, Sep 4, 2013 at 11:32 PM, YouPeng Yang yypvsxf19870...@gmail.comwrote:

 Hi all
I solve the problem by add the coreName explicitly according to
 http://wiki.apache.org/solr/SolrReplication#Replicating_solrconfig.xml.

But I want to make sure about that is it necessary to set the coreName
 explicitly. Is there any SolrJ API to pull the replication on the slave
 node from the master node?


 regards



 2013/9/5 YouPeng Yang yypvsxf19870...@gmail.com

  Hi again
 
I'm  using Solr4.4.
 
 
  2013/9/5 YouPeng Yang yypvsxf19870...@gmail.com
 
  HI solrusers
 
 I'm testing the replication within SolrCloud .
 I just uncomment the replication section separately on the master and
  slave node.
 The replication section setting on the  master node:
  lst name=master
   str name=replicateAftercommit/str
   str name=replicateAfterstartup/str
   str name=confFilesschema.xml,stopwords.txt/str
 /lst
   and on the slave node:
lst name=slave
   str name=masterUrlhttp://10.7.23.124:8080/solr/#//str
   str name=pollInterval00:00:50/str
 /lst
 
 After startup, an Error comes out on the slave node :
  80110110 [snapPuller-70-thread-1] ERROR
  org.apache.solr.handler.SnapPuller  ?.Master at:
  http://10.7.23.124:8080/solr/#/ is not available. Index fetch failed.
  Exception: Invalid version (expected 2, but 60) or the data in not in
  'javabin' format
 
 
   Could anyone help me to solve the problem ?
 
 
  regards
 
 
 
 
 



Re: Tweaking boosts for more search results variety

2013-09-06 Thread Sai Gadde
Thank you Jack for the suggestion.

We can try group by site. But considering that number of sites are only
about 1000 against the index size of 5 million, One can expect most of the
hits would be hidden and for certain specific keywords only a handful of
actual results could be displayed if results are grouped by site.

we already group on a signature field to identify duplicate content in
these 5 million+ docs. But here the number of duplicates are only about
3-5% maximum.

Is there any workaround for these limitations with grouping?

Thanks
Shyam



On Thu, Sep 5, 2013 at 9:16 PM, Jack Krupansky j...@basetechnology.comwrote:

 The grouping (field collapsing) feature somewhat addresses this - group by
 a site field and then if more than one or a few top pages are from the
 same site they get grouped or collapsed so that you can see more sites in a
 few results.

 See:
 http://wiki.apache.org/solr/**FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsing
 https://cwiki.apache.org/**confluence/display/solr/**Result+Groupinghttps://cwiki.apache.org/confluence/display/solr/Result+Grouping

 -- Jack Krupansky

 -Original Message- From: Sai Gadde
 Sent: Thursday, September 05, 2013 2:27 AM
 To: solr-user@lucene.apache.org
 Subject: Tweaking boosts for more search results variety


 Our index is aggregated content from various sites on the web. We want good
 user experience by showing multiple sites in the search results. In our
 setup we are seeing most of the results from same site on the top.

 Here is some information regarding queries and schema
site - String field. We have about 1000 sites in index
sitetype - String field.  we have 3 site types
 omitNorms=true for both the fields

 Doc count varies largely based on site and sitetype by a factor of 10 -
 1000 times
 Total index size is about 5 million docs.
 Solr Version: 4.0

 In our queries we have a fixed and preferential boost for certain sites.
 sitetype has different and fixed boosts for 3 possible values. We turned
 off Inverse Document Frequency (IDF) for these boosts to work properly.
 Other text fields are boosted based on search keywords only.

 With this setup we often see a bunch of hits from a single site followed by
 next etc.,
 Is there any solution to see results from variety of sites and still keep
 the preferential boosts in place?



Re: SolrCloud 4.x hangs under high update volume

2013-09-06 Thread Erick Erickson
Markus:

See: https://issues.apache.org/jira/browse/SOLR-5216


On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Hi Mark,

 Got an issue to watch?

 Thanks,
 Markus

 -Original message-
  From:Mark Miller markrmil...@gmail.com
  Sent: Wednesday 4th September 2013 16:55
  To: solr-user@lucene.apache.org
  Subject: Re: SolrCloud 4.x hangs under high update volume
 
  I'm going to try and fix the root cause for 4.5 - I've suspected what it
 is since early this year, but it's never personally been an issue, so it's
 rolled along for a long time.
 
  Mark
 
  Sent from my iPhone
 
  On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt t...@elementspace.com
 wrote:
 
   Hey guys,
  
   I am looking into an issue we've been having with SolrCloud since the
   beginning of our testing, all the way from 4.1 to 4.3 (haven't tested
 4.4.0
   yet). I've noticed other users with this same issue, so I'd really
 like to
   get to the bottom of it.
  
   Under a very, very high rate of updates (2000+/sec), after 1-12 hours
 we
   see stalled transactions that snowball to consume all Jetty threads in
 the
   JVM. This eventually causes the JVM to hang with most threads waiting
 on
   the condition/stack provided at the bottom of this message. At this
 point
   SolrCloud instances then start to see their neighbors (who also have
 all
   threads hung) as down w/Connection Refused, and the shards become
 down
   in state. Sometimes a node or two survives and just returns 503s no
 server
   hosting shard errors.
  
   As a workaround/experiment, we have tuned the number of threads sending
   updates to Solr, as well as the batch size (we batch updates from
 client -
   solr), and the Soft/Hard autoCommits, all to no avail. Turning off
   Client-to-Solr batching (1 update = 1 call to Solr), which also did not
   help. Certain combinations of update threads and batch sizes seem to
   mask/help the problem, but not resolve it entirely.
  
   Our current environment is the following:
   - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
   - 3 x Zookeeper instances, external Java 7 JVM.
   - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard
 and
   a replica of 1 shard).
   - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a
 good
   day.
   - 5000 max jetty threads (well above what we use when we are healthy),
   Linux-user threads ulimit is 6000.
   - Occurs under Jetty 8 or 9 (many versions).
   - Occurs under Java 1.6 or 1.7 (several minor versions).
   - Occurs under several JVM tunings.
   - Everything seems to point to Solr itself, and not a Jetty or Java
 version
   (I hope I'm wrong).
  
   The stack trace that is holding up all my Jetty QTP threads is the
   following, which seems to be waiting on a lock that I would very much
 like
   to understand further:
  
   java.lang.Thread.State: WAITING (parking)
  at sun.misc.Unsafe.park(Native Method)
  - parking to wait for  0x0007216e68d8 (a
   java.util.concurrent.Semaphore$NonfairSync)
  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
  at
  
 java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
  at
  
 java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
  at
  
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
  at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
  at
  
 org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
  at
  
 org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
  at
  
 org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
  at
  
 org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
  at
  
 org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
  at
  
 org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
  at
  
 org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
  at
  
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
  at
  
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
  at
  
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
  at
  
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
  at
  
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
  at
  
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
  at
  
 

Re: Solr 4.3 Startup with Multiple Cores Hangs on Registering Core

2013-09-06 Thread Erick Erickson
bq: I'm actually not using the transaction log (or the
NRTCachingDirectoryFactory); it's currently set up to use the
MMapDirectoryFactory,

This isn't relevant to whether you're using the update log or not, this is
just how the index is handled. Look for something in your solrconfig.xml
like:
 updateLog
  str name=dir${solr.ulog.dir:}/str
/updateLog

The other thing to check is if you have files in a tlog directory that's
a sibling to your index directory as Hoss suggested.

You may well NOT have any transaction log, but it's something to check.



Re: Facet Count and RegexTransformersplitBy

2013-09-06 Thread Jack Krupansky
Facet counts are per field - your counts are scattered across different 
fields.


There are additional capabilities in the facet component, but first you 
should describe exactly what your requirements are.


-- Jack Krupansky
-Original Message- 
From: Raheel Hasan

Sent: Friday, September 06, 2013 9:58 AM
To: solr-user@lucene.apache.org
Subject: Facet Count and RegexTransformersplitBy

Hi guyz,

Just a quick question:

I have a field that has CSV values in the database. So I will use the
DataImportHandler and will index it using RegexTransformer's splitBy
attribute. However, since this is the first time I am doing it, I just
wanted to be sure if it will work for Facet Count?

For example:

From query results (say this is the values in that field):

row 1 = 1,2,3,4
row 2 = 1,4,5,3
row 3 = 2,1,20,66
.
.
.
.
so facet count will get me:
'1' = 3 occurrence
'2' = 2 occur.
.
.
.and so on.





--
Regards,
Raheel Hasan 



Re: charfilter doesn't do anything

2013-09-06 Thread Shawn Heisey
On 9/6/2013 7:09 AM, Andreas Owen wrote:
 i've managed to get it working if i use the regexTransformer and string is on 
 the same line in my tika entity. but when the string is multilined it isn't 
 working even though i tried ?s to set the flag dotall.
 
 entity name=tika processor=TikaEntityProcessor url=${rec.url} 
 dataSource=dataUrl onError=skip htmlMapper=identity format=html 
 transformer=RegexTransformer
   field column=text_html regex=lt;bodygt;(.+)lt;/bodygt; 
 replaceWith=QQQ sourceColName=text  /
 /entity
   
 then i tried it like this and i get a stackoverflow
 
 field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt; 
 replaceWith=QQQ sourceColName=text  /
 
 in javascript this works but maybe because i only used a small string.

Sounds like we've got an XY problem here.

http://people.apache.org/~hossman/#xyproblem

How about you tell us *exactly* what you'd actually like to have happen
and then we can find a solution for you?

It sounds a little bit like you're interested in stripping all the HTML
tags out.  Perhaps the HTMLStripCharFilter?

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory

Something that I already said: By using the KeywordTokenizer, you won't
be able to search for individual words on your HTML input.  The entire
input string is treated as a single token, and therefore ONLY exact
entire-field matches (or certain wildcard matches) will be possible.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory

Note that no matter what you do to your data with the analysis chain,
Solr will always return the text that was originally indexed in search
results.  If you need to affect what gets stored as well, perhaps you
need an Update Processor.

Thanks,
Shawn



Re: Store 2 dimensional array( of int values) in solr 4.0

2013-09-06 Thread Jack Krupansky

You still haven't supplied any queries.

If all you really need is the JSON as a blob, simply store it as a string 
and parse the JSON in your application layer.


-- Jack Krupansky

-Original Message- 
From: A Geek

Sent: Friday, September 06, 2013 10:30 AM
To: solr user
Subject: RE: Store 2 dimensional array( of int values) in solr 4.0

Hi,Thanks for the quick reply. Sure, please find below the details as per 
your query.
Essentially, I want to retrieve the doc through JSON [using JSON format as 
SOLR result output]and want JSON to pick the the data from the dataX field 
as a two dimensional array of ints. When I store the data as show below, it 
shows up in JSON array of strings where the internal array is basically 
shown as strings (because thats how the field is configured and I'm storing, 
not finding any other option). Following is the current JSON output that I'm 
able to fetch:
dataX:[[20130614, 2],[20130615, 11],[20130616, 1],[20130617, 
1],[20130619, 8],[20130620, 5],[20130623, 5]]

whereas I want  to fetch the dataX as something like:
dataX:[[20130614, 2],[20130615, 11],[20130616, 1],[20130617, 1],[20130619, 
8],[20130620, 5],[20130623, 5]]
as can be seen, the dataX is essentially a 2D array where the internal array 
is of two ints, one being date and other being the count.

Please point me in the right direction. Appreciate your time.
Thanks.


From: j...@basetechnology.com
To: solr-user@lucene.apache.org
Subject: Re: Store 2 dimensional array( of int values) in solr 4.0
Date: Fri, 6 Sep 2013 08:44:06 -0400

First you need to tell us how you wish to use and query the data. That 
will
largely determine how the data must be stored. Give us a few example 
queries

of how you would like your application to be able to access the data.

Note that Lucene has only simple multivalued fields - no structure or
nesting within a single field other that a list of scalar values.

But you can always store a complex structure as a BSON blob or JSON string
if all you want is to store and retrieve it in its entirety without 
querying
its internal structure. And note that Lucene queries are field level - 
does

a field contain or match a scalar value.

-- Jack Krupansky

-Original Message- 
From: A Geek

Sent: Friday, September 06, 2013 7:10 AM
To: solr user
Subject: Store 2 dimensional array( of int values) in solr 4.0

hi All, I'm trying to store a 2 dimensional array in SOLR [version 4.0].
Basically I've the following data:
[[20121108, 1],[20121110, 7],[2012, 2],[20121112, 2]] ...

The inner array being used to keep some count say X for that particular 
day.

Currently, I'm using the following field to store this data:
field name=dataX type=string indexed=true stored=true
multiValued=true/
and I'm using python library pySolr to store the data. Currently the data
that gets stored looks like this(its array of strings)
arr name=dataXstr[20121108, 1]/strstr[20121110,
7]/strstr[2012, 2]/strstr[20121112, 2]/strstr[20121113,
2]/strstr[20121116, 1]/str/arr
Is there a way, i can store the 2 dimensional array and the inner array 
can

contain int values, like the one shown in the beginning example, such that
the the final/stored data in SOLR looks something like: arr name=dataX
arr name=indexint20121108/int int 7 /int /arr
arr name=indexint 20121110/intint 12 /int/arr
arr name=indexint 20121110/intint 12 /int/arr
/arr
Just a guess, I think for this case, we need to add one more field[the 
index

for instance], for each inner array which will again be multivalued (which
will store int values only)? How do I add the actual 2 dimensional array,
how to pass the inner arrays and how to store the full doc that contains
this 2 dimensional array. Please help me out sort this issue.
Please share your views and point me in the right direction. Any help 
would

be highly appreciated.
I found similar things on the web, but not the one I'm looking for:
http://lucene.472066.n3.nabble.com/Two-dimensional-array-in-Solr-schema-td4003309.html
Thanks






Re: Regarding improving performance of the solr

2013-09-06 Thread Shawn Heisey
On 9/6/2013 2:54 AM, prabu palanisamy wrote:
 I am currently using solr -3.5.0,  indexed  wikipedia dump (50 gb) with
 java 1.6.
 I am searching the solr with text (which is actually twitter tweets) .
 Currently it takes average time of 210 millisecond for each post, out of
 which 200 millisecond is consumed by solr server (QTime).  I used the
 jconsole monitor tool.

If the size of all your Solr indexes on disk is in the 50GB range of
your wikipedia dump, then for ideal performance, you'll want to have
50GB of free memory so the OS can cache your index.  You might be able
to get by with 25-30GB of free memory, depending on your index composition.

Note that this is memory over and above what you allocate to the Solr
JVM, and memory used by other processes on the machine.  If you do have
other services on the same machine, note that those programs might ALSO
require OS disk cache RAM.

http://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache

Thanks,
Shawn



Re: charfilter doesn't do anything

2013-09-06 Thread Andreas Owen
ok i have html pages with html.!--body--content i 
want!--/body--./html. i want to extract (index, store) only that 
between the body-comments. i thought regexTransformer would be the best because 
xpath doesn't work in tika and i cant nest a xpathEntetyProcessor to use xpath. 
what i have also found out is that the htmlparser from tika cuts my 
body-comments out and tries to make well formed html, which i would like to 
switch off.

On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:

 On 9/6/2013 7:09 AM, Andreas Owen wrote:
 i've managed to get it working if i use the regexTransformer and string is 
 on the same line in my tika entity. but when the string is multilined it 
 isn't working even though i tried ?s to set the flag dotall.
 
 entity name=tika processor=TikaEntityProcessor url=${rec.url} 
 dataSource=dataUrl onError=skip htmlMapper=identity format=html 
 transformer=RegexTransformer
  field column=text_html regex=lt;bodygt;(.+)lt;/bodygt; 
 replaceWith=QQQ sourceColName=text  /
 /entity
  
 then i tried it like this and i get a stackoverflow
 
 field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt; 
 replaceWith=QQQ sourceColName=text  /
 
 in javascript this works but maybe because i only used a small string.
 
 Sounds like we've got an XY problem here.
 
 http://people.apache.org/~hossman/#xyproblem
 
 How about you tell us *exactly* what you'd actually like to have happen
 and then we can find a solution for you?
 
 It sounds a little bit like you're interested in stripping all the HTML
 tags out.  Perhaps the HTMLStripCharFilter?
 
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
 
 Something that I already said: By using the KeywordTokenizer, you won't
 be able to search for individual words on your HTML input.  The entire
 input string is treated as a single token, and therefore ONLY exact
 entire-field matches (or certain wildcard matches) will be possible.
 
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
 
 Note that no matter what you do to your data with the analysis chain,
 Solr will always return the text that was originally indexed in search
 results.  If you need to affect what gets stored as well, perhaps you
 need an Update Processor.
 
 Thanks,
 Shawn



CRLF Invalid Exception ?

2013-09-06 Thread Brent Ryan
Has anyone ever hit this when adding documents to SOLR?  What does it mean?


ERROR [http-8983-6] 2013-09-06 10:09:32,700 SolrException.java (line 108)
org.apache.solr.common.SolrException: Invalid CRLF

at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:175)

at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)

at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)

at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)

at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:663)

at
com.datastax.bdp.cassandra.index.solr.CassandraDispatchFilter.execute(CassandraDispatchFilter.java:176)

at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)

at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)

at
com.datastax.bdp.cassandra.index.solr.CassandraDispatchFilter.doFilter(CassandraDispatchFilter.java:139)

at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)

at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)

at
com.datastax.bdp.cassandra.audit.SolrHttpAuditLogFilter.doFilter(SolrHttpAuditLogFilter.java:194)

at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)

at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)

at
com.datastax.bdp.cassandra.index.solr.auth.CassandraAuthorizationFilter.doFilter(CassandraAuthorizationFilter.java:95)

at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)

at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)

at
com.datastax.bdp.cassandra.index.solr.auth.DseAuthenticationFilter.doFilter(DseAuthenticationFilter.java:102)

at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)

at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)

at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)

at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)

at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)

at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)

at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)

at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)

at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)

at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)

at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)

at java.lang.Thread.run(Thread.java:722)

Caused by: com.ctc.wstx.exc.WstxIOException: Invalid CRLF

at com.ctc.wstx.sr.StreamScanner.throwFromIOE(StreamScanner.java:708)

at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1086)

at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:387)

at
org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:245)

at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)

... 30 more

Caused by: java.io.IOException: Invalid CRLF

at
org.apache.coyote.http11.filters.ChunkedInputFilter.parseCRLF(ChunkedInputFilter.java:352)

at
org.apache.coyote.http11.filters.ChunkedInputFilter.doRead(ChunkedInputFilter.java:151)

at
org.apache.coyote.http11.InternalInputBuffer.doRead(InternalInputBuffer.java:710)

at org.apache.coyote.Request.doRead(Request.java:428)

at
org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:304)

at org.apache.tomcat.util.buf.ByteChunk.substract(ByteChunk.java:403)

at org.apache.catalina.connector.InputBuffer.read(InputBuffer.java:327)

at
org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:162)

at com.ctc.wstx.io.UTF8Reader.loadMore(UTF8Reader.java:365)

at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:110)

at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)

at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)

at
com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)

at
com.ctc.wstx.sr.StreamScanner.loadMoreFromCurrent(StreamScanner.java:1046)

at com.ctc.wstx.sr.StreamScanner.parseLocalName2(StreamScanner.java:1796)

at com.ctc.wstx.sr.StreamScanner.parseLocalName(StreamScanner.java:1756)

at
com.ctc.wstx.sr.BasicStreamReader.handleStartElem(BasicStreamReader.java:2914)

at
com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2848)

at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019)

... 33 more


Re: Facet Count and RegexTransformersplitBy

2013-09-06 Thread Jack Krupansky
You're not being clear here - are the commas delimiting fields or do you 
have one value per row?


Yes, you can tokenize a comma-delimited value in Solr.

-- Jack Krupansky

-Original Message- 
From: Raheel Hasan

Sent: Friday, September 06, 2013 11:54 AM
To: solr-user@lucene.apache.org
Subject: Re: Facet Count and RegexTransformersplitBy

Hi,

What I want is very simple:

The query results:
row 1 = a,b,c,d
row 2 = a,f,r,e
row 3 = a,c,ff,e,b
..

facet count needed:
'a' = 3 occurrence
'b' = 2 occur.
'c' = 2 occur.
.
.
.


I searched and found a solution here:
http://stackoverflow.com/questions/9914483/solr-facet-multiple-words-with-comma-separated-values

But I want to be sure if it will work.



On Fri, Sep 6, 2013 at 8:20 PM, Jack Krupansky 
j...@basetechnology.comwrote:



Facet counts are per field - your counts are scattered across different
fields.

There are additional capabilities in the facet component, but first you
should describe exactly what your requirements are.

-- Jack Krupansky
-Original Message- From: Raheel Hasan
Sent: Friday, September 06, 2013 9:58 AM
To: solr-user@lucene.apache.org
Subject: Facet Count and RegexTransformersplitBy


Hi guyz,

Just a quick question:

I have a field that has CSV values in the database. So I will use the
DataImportHandler and will index it using RegexTransformer's splitBy
attribute. However, since this is the first time I am doing it, I just
wanted to be sure if it will work for Facet Count?

For example:
From query results (say this is the values in that field):
row 1 = 1,2,3,4
row 2 = 1,4,5,3
row 3 = 2,1,20,66
.
.
.
.
so facet count will get me:
'1' = 3 occurrence
'2' = 2 occur.
.
.
.and so on.





--
Regards,
Raheel Hasan





--
Regards,
Raheel Hasan 



Re: Facet Count and RegexTransformersplitBy

2013-09-06 Thread Raheel Hasan
Its a csv from the database. I will import it like this, (say for example
the field is 'emailids' and it contain csv of email ids):
field column=mailId splitBy=, sourceColName=emailids/



On Fri, Sep 6, 2013 at 9:01 PM, Jack Krupansky j...@basetechnology.comwrote:

 You're not being clear here - are the commas delimiting fields or do you
 have one value per row?

 Yes, you can tokenize a comma-delimited value in Solr.


 -- Jack Krupansky

 -Original Message- From: Raheel Hasan
 Sent: Friday, September 06, 2013 11:54 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Facet Count and RegexTransformersplitBy


 Hi,

 What I want is very simple:

 The query results:
 row 1 = a,b,c,d
 row 2 = a,f,r,e
 row 3 = a,c,ff,e,b
 ..

 facet count needed:
 'a' = 3 occurrence
 'b' = 2 occur.
 'c' = 2 occur.
 .
 .
 .


 I searched and found a solution here:
 http://stackoverflow.com/**questions/9914483/solr-facet-**
 multiple-words-with-comma-**separated-valueshttp://stackoverflow.com/questions/9914483/solr-facet-multiple-words-with-comma-separated-values

 But I want to be sure if it will work.



 On Fri, Sep 6, 2013 at 8:20 PM, Jack Krupansky j...@basetechnology.com**
 wrote:

  Facet counts are per field - your counts are scattered across different
 fields.

 There are additional capabilities in the facet component, but first you
 should describe exactly what your requirements are.

 -- Jack Krupansky
 -Original Message- From: Raheel Hasan
 Sent: Friday, September 06, 2013 9:58 AM
 To: solr-user@lucene.apache.org
 Subject: Facet Count and RegexTransformersplitBy


 Hi guyz,

 Just a quick question:

 I have a field that has CSV values in the database. So I will use the
 DataImportHandler and will index it using RegexTransformer's splitBy
 attribute. However, since this is the first time I am doing it, I just
 wanted to be sure if it will work for Facet Count?

 For example:
 From query results (say this is the values in that field):
 row 1 = 1,2,3,4
 row 2 = 1,4,5,3
 row 3 = 2,1,20,66
 .
 .
 .
 .
 so facet count will get me:
 '1' = 3 occurrence
 '2' = 2 occur.
 .
 .
 .and so on.





 --
 Regards,
 Raheel Hasan




 --
 Regards,
 Raheel Hasan




-- 
Regards,
Raheel Hasan


Connection Established but waiting for response for a long time.

2013-09-06 Thread qungg
Hi,

I'm runing solr 4.0 but using legacy distributed search set up. I set the
shards parameter for search, but indexing into each solr shards directly.
The problem I have been experiencing is building connection with solr
shards. If I run a query, by using wget, to get number of records from each
individual shards (50 of them) sequentially, the request will hang at some
shards (seems random). The wget log will say the connection is established
but waiting for response. At that point I thought that the Solr shard might
be under high load, but the strange behavior happens when I send another
request to the send shard (using wget again) from another thread, the
response comes back, and will trigger something in Solr to send back
response for the first request I have sent before. 

This also happens in my daily indexing. If I send an commit, it will some
times hangs. However, if I send another commit to the same shard, both
commit will come back fine.

I'm running Solr on stock jetty server, and sometime back my boss told me to
set the maxIdleTime to 5000 for indexing purpose. I'm not sure if this
have anything to do with the strange behavior that I'm seeing right now. 

Please help me resolve this issue.

Thanks,
Qun



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Connection-Established-but-waiting-for-response-for-a-long-time-tp4088587.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Facet Count and RegexTransformersplitBy

2013-09-06 Thread Raheel Hasan
let me further elaborate:
[dbtable1]
field1 = int
field2= string (solr indexing = true)
field3 = csv

[During import into solr]
splitBy=,

[After import]
solr will be searched for terms from field2.

[needed]
counts of occurrances of each value in csv



On Fri, Sep 6, 2013 at 9:35 PM, Raheel Hasan raheelhasan@gmail.comwrote:

 Its a csv from the database. I will import it like this, (say for example
 the field is 'emailids' and it contain csv of email ids):
 field column=mailId splitBy=, sourceColName=emailids/



 On Fri, Sep 6, 2013 at 9:01 PM, Jack Krupansky j...@basetechnology.comwrote:

 You're not being clear here - are the commas delimiting fields or do you
 have one value per row?

 Yes, you can tokenize a comma-delimited value in Solr.


 -- Jack Krupansky

 -Original Message- From: Raheel Hasan
 Sent: Friday, September 06, 2013 11:54 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Facet Count and RegexTransformersplitBy


 Hi,

 What I want is very simple:

 The query results:
 row 1 = a,b,c,d
 row 2 = a,f,r,e
 row 3 = a,c,ff,e,b
 ..

 facet count needed:
 'a' = 3 occurrence
 'b' = 2 occur.
 'c' = 2 occur.
 .
 .
 .


 I searched and found a solution here:
 http://stackoverflow.com/**questions/9914483/solr-facet-**
 multiple-words-with-comma-**separated-valueshttp://stackoverflow.com/questions/9914483/solr-facet-multiple-words-with-comma-separated-values

 But I want to be sure if it will work.



 On Fri, Sep 6, 2013 at 8:20 PM, Jack Krupansky j...@basetechnology.com*
 *wrote:

  Facet counts are per field - your counts are scattered across different
 fields.

 There are additional capabilities in the facet component, but first you
 should describe exactly what your requirements are.

 -- Jack Krupansky
 -Original Message- From: Raheel Hasan
 Sent: Friday, September 06, 2013 9:58 AM
 To: solr-user@lucene.apache.org
 Subject: Facet Count and RegexTransformersplitBy


 Hi guyz,

 Just a quick question:

 I have a field that has CSV values in the database. So I will use the
 DataImportHandler and will index it using RegexTransformer's splitBy
 attribute. However, since this is the first time I am doing it, I just
 wanted to be sure if it will work for Facet Count?

 For example:
 From query results (say this is the values in that field):
 row 1 = 1,2,3,4
 row 2 = 1,4,5,3
 row 3 = 2,1,20,66
 .
 .
 .
 .
 so facet count will get me:
 '1' = 3 occurrence
 '2' = 2 occur.
 .
 .
 .and so on.





 --
 Regards,
 Raheel Hasan




 --
 Regards,
 Raheel Hasan




 --
 Regards,
 Raheel Hasan




-- 
Regards,
Raheel Hasan


Re: CRLF Invalid Exception ?

2013-09-06 Thread Brent Ryan
Thanks.  I realized there's an error in the ChunkedInputFilter...

I'm not sure if this means there's a bug in the client library I'm using
(solrj 4.3) or is a bug in the server SOLR 4.3?  Or is there something in
my data that's causing the issue?


On Fri, Sep 6, 2013 at 1:02 PM, Chris Hostetter hossman_luc...@fucit.orgwrote:


 : Has anyone ever hit this when adding documents to SOLR?  What does it
 mean?

 Always check for the root cause...

 : Caused by: java.io.IOException: Invalid CRLF
 :
 : at
 :
 org.apache.coyote.http11.filters.ChunkedInputFilter.parseCRLF(ChunkedInputFilter.java:352)

 ...so while Solr is trying to read XML off the InputStream from the
 client, an error is encountered by the ChunkedInputFilter.

 I suspect the client library you are using for the HTTP connection is
 claiming it's using chunking but isn't, or is doing something wrong with
 the chunking, or there is a bug in the ChunkedInputFilter.


 -Hoss



Re: Facet Count and RegexTransformersplitBy

2013-09-06 Thread Raheel Hasan
basically, a field having a csv... and find counts / number of occurrance
of each csv value..


On Fri, Sep 6, 2013 at 8:54 PM, Raheel Hasan raheelhasan@gmail.comwrote:

 Hi,

 What I want is very simple:

 The query results:
 row 1 = a,b,c,d
 row 2 = a,f,r,e
 row 3 = a,c,ff,e,b
 ..

 facet count needed:
 'a' = 3 occurrence
 'b' = 2 occur.
 'c' = 2 occur.
 .
 .
 .


 I searched and found a solution here:

 http://stackoverflow.com/questions/9914483/solr-facet-multiple-words-with-comma-separated-values

 But I want to be sure if it will work.



 On Fri, Sep 6, 2013 at 8:20 PM, Jack Krupansky j...@basetechnology.comwrote:

 Facet counts are per field - your counts are scattered across different
 fields.

 There are additional capabilities in the facet component, but first you
 should describe exactly what your requirements are.

 -- Jack Krupansky
 -Original Message- From: Raheel Hasan
 Sent: Friday, September 06, 2013 9:58 AM
 To: solr-user@lucene.apache.org
 Subject: Facet Count and RegexTransformersplitBy


 Hi guyz,

 Just a quick question:

 I have a field that has CSV values in the database. So I will use the
 DataImportHandler and will index it using RegexTransformer's splitBy
 attribute. However, since this is the first time I am doing it, I just
 wanted to be sure if it will work for Facet Count?

 For example:
 From query results (say this is the values in that field):
 row 1 = 1,2,3,4
 row 2 = 1,4,5,3
 row 3 = 2,1,20,66
 .
 .
 .
 .
 so facet count will get me:
 '1' = 3 occurrence
 '2' = 2 occur.
 .
 .
 .and so on.





 --
 Regards,
 Raheel Hasan




 --
 Regards,
 Raheel Hasan




-- 
Regards,
Raheel Hasan


Re: CRLF Invalid Exception ?

2013-09-06 Thread Chris Hostetter

: Has anyone ever hit this when adding documents to SOLR?  What does it mean?

Always check for the root cause...

: Caused by: java.io.IOException: Invalid CRLF
: 
: at
: 
org.apache.coyote.http11.filters.ChunkedInputFilter.parseCRLF(ChunkedInputFilter.java:352)

...so while Solr is trying to read XML off the InputStream from the 
client, an error is encountered by the ChunkedInputFilter.  

I suspect the client library you are using for the HTTP connection is 
claiming it's using chunking but isn't, or is doing something wrong with 
the chunking, or there is a bug in the ChunkedInputFilter.


-Hoss


SOLR 4.x vs 3.x parsedquery differences

2013-09-06 Thread Raúl Cardozo
I'm migrating from 3.x to 4.x and I'm running some queries to verify that
everything works like before. I've found however that the query galaxy s3
is giving much less results. In 3.x numFound=1628, in 4.x numFound=70.

Here's the relevant schema part:

fieldtype name=text_pt class=solr.TextField
positionIncrementGap=100 autoGeneratePhraseQueries=false
   analyzer type=index
   charFilter class=solr.PatternReplaceCharFilterFactory
pattern=- replacement=IIIHYPHENIII/
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.PatternReplaceFilterFactory
pattern=IIIHYPHENIII replacement=-/
   filter class=solr.ASCIIFoldingFilterFactory /
   filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 preserveOriginal=1
catenateWords=1 catenateNumbers=1 catenateAll=0/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.StopFilterFactory ignoreCase=false
words=portugueseStopWords.txt/
   filter class=solr.BrazilianStemFilterFactory/
   filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
   analyzer type=query
   charFilter class=solr.PatternReplaceCharFilterFactory
pattern=- replacement=IIIHYPHENIII/
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.PatternReplaceFilterFactory
pattern=IIIHYPHENIII replacement=-/
   filter class=solr.ASCIIFoldingFilterFactory /
   filter class=solr.SynonymFilterFactory ignoreCase=true
synonyms=portugueseSynonyms.txt expand=true/
   filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
preserveOriginal=1 catenateNumbers=0 catenateAll=0
protected=protwords.txt/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.StopFilterFactory ignoreCase=false
words=portugueseStopWords.txt/
   filter class=solr.BrazilianStemFilterFactory/
   filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer/fieldtype

The synonyms involved in this query are:

siii, s3
galaxy, galax

My default search operator is AND (in both versions, even if it's
deprecated in 4.x), and the output of the debug is:

SOLR 3.x

str name=parsedquery+(title_search_pt:galaxy
title_search_pt:galax) +MultiPhraseQuery(title_search_pt:(sii s3 s)
3)/str

SOLR 4.x

str name=parsedquery+((title_search_pt:galaxy
title_search_pt:galax)/no_coord) +(+title_search_pt:sii
+title_search_pt:s3 +title_search_pt:s +title_search_pt:3)/str

The weird thing is that it does not return results like 'galaxy s3'. This
is the debug query:

no match on required clause (+title_search_pt:sii +title_search_pt:s3
+title_search_pt:s +title_search_pt:3)
(NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s), *no
match on required clause (title_search_pt:sii)*
(NON-MATCH) no matching term
(MATCH) weight(title_search_pt:s3 in 1834535)
(MATCH) weight(title_search_pt:s in 1834535)
(MATCH) weight(title_search_pt:3 in 1834535)

How is that sii is *required* when it should be OR'ed with s and s3 ?

The analysis output shows that sii has token position 2, like it's
synonyms, like so:

galaxy  sii 3
galax   s3
s

Thanks,

Raúl Cardozo.


Re: unknown _stream_source_info while indexing rich doc in solr

2013-09-06 Thread Chris Hostetter

: it shows type as undefined for dynamic field ignored_* , and I am using

That means the running solr instance does not know anything about a 
dynamic field named ignored_* -- it doesn't exist.

: but on the admin page it shows schema :

the page showing hte schema file just tells you what's on disk -- it has 
no way of knowing if you modified that file after starting up solr.

... Wait a minute ... i see your problem now...

...
: /fields 
: dynamicField name=ignored_* type=ignored indexed=false stored=true
: multiValued=true/

...your dynamicField/ declaration needs to be inside your fields 
block.


-Hoss


Re: SOLR 4.x vs 3.x parsedquery differences

2013-09-06 Thread Chris Hostetter

: I'm migrating from 3.x to 4.x and I'm running some queries to verify that
: everything works like before. I've found however that the query galaxy s3
: is giving much less results. In 3.x numFound=1628, in 4.x numFound=70.

is your entire schema 100% identical in both cases?
what is the luceneMatchVersion set to in your solrconfig.xml?


By the looks of your debug output, it appears that you are using 
autoGeneratePhraseQueries=true in 3x, but have it set to false in 4x -- 
but the fieldType you posted here shows it set to false

: fieldtype name=text_pt class=solr.TextField
: positionIncrementGap=100 autoGeneratePhraseQueries=false

...i haven't tried to reproduce your specific situation, but that 
configuration doesn't smell right compared with what you are showing for 
the 3x output...

: SOLR 3.x
: 
: str name=parsedquery+(title_search_pt:galaxy
: title_search_pt:galax) +MultiPhraseQuery(title_search_pt:(sii s3 s)
: 3)/str
: 
: SOLR 4.x
: 
: str name=parsedquery+((title_search_pt:galaxy
: title_search_pt:galax)/no_coord) +(+title_search_pt:sii
: +title_search_pt:s3 +title_search_pt:s +title_search_pt:3)/str


-Hoss


Re: CRLF Invalid Exception ?

2013-09-06 Thread Brent Ryan
For what it's worth... I just updated to solrj 4.4 (even though my server
is solr 4.3) and it seems to have fixed the issue.

Thanks for the help!


On Fri, Sep 6, 2013 at 1:41 PM, Chris Hostetter hossman_luc...@fucit.orgwrote:


 : I'm not sure if this means there's a bug in the client library I'm using
 : (solrj 4.3) or is a bug in the server SOLR 4.3?  Or is there something in
 : my data that's causing the issue?

 It's unlikly that an error in the data you pass to SolrJ methods would be
 causing this problem -- i'm pretty sure it's not even a problem with the
 raw xml data being streamed, it appears to be a problem with how that data
 is getting shunked across the wire.

 My best guess is that the most likely causes are either...
  * a bug in the HttpClient versio you are using on the client side
  * a bug in the ChunkedInputFilter you are using on the server side
  * a misconfiguration on the HttpClient object you are using with SolrJ
(ie: claiming it's sending chunked when it's not?)


 -Hoss



Re: SOLR 4.x vs 3.x parsedquery differences

2013-09-06 Thread Fermin Silva
Besides liking or not the behaviour we are getting in 3.x, Im required to
keep everything working as close as possible as before.

Have no idea why this is happening, but setting that field to true solved
the issue, now I get the exact same amount of items in both queries!

I wouldn't bother checking why that was so since we'll be moving away from
the older version, which shows the inconsistency.

But thanks a million.

If you have a SO user I can mark yours as answer here:
http://stackoverflow.com/questions/18661996/solr-4-x-vs-3-x-parsedquery-differences

Cheers
On Sep 6, 2013 4:15 PM, Chris Hostetter hossman_luc...@fucit.org wrote:


 : Our schema is identical except the version.
 : In 3.x it's 1.1 and in 4.x it's 1.5.

 That's kind of a significant difference to leave out -- indepenent of the
 question you are asking about here, it's going to make quite a few
 differences in how things are being being parsed, and what defaults are.

 If i'm understanding correctly: you like the behavior you are getting from
 Solr 3.x where phrases are generated automatically for you.

 what i can't understand, is how/why phrases are being generated
 automatically for you if you have that 'autoGeneratePhraseQueries=false'
 on your fieldType in your 3x schema ... that makes no sense to me.

 if you didn't have autoGeneratePhraseQueries specified at all, then the
 'version=1.1' would explain it (up to version=1.3, the default for
 autoGeneratePhraseQueries was true, but in version=1.4 and above, it
 defaults to false)  but with an explicit
 'autoGeneratePhraseQueries=false' i can't explain why 3x works the way
 you say it works for you.

 Bottom line: if you *want* the auto generated phrase query behavior
 in 4.x, you should just set 'autoGeneratePhraseQueries=true' on your
 fieldType.



 :  : I'm migrating from 3.x to 4.x and I'm running some queries to verify
 that
 :  : everything works like before. I've found however that the query
 galaxy
 :  s3
 :  : is giving much less results. In 3.x numFound=1628, in 4.x
 numFound=70.
 : 
 :  is your entire schema 100% identical in both cases?
 :  what is the luceneMatchVersion set to in your solrconfig.xml?
 : 
 : 
 :  By the looks of your debug output, it appears that you are using
 :  autoGeneratePhraseQueries=true in 3x, but have it set to false in 4x
 --
 :  but the fieldType you posted here shows it set to false
 : 
 :  : fieldtype name=text_pt class=solr.TextField
 :  : positionIncrementGap=100 autoGeneratePhraseQueries=false
 : 
 :  ...i haven't tried to reproduce your specific situation, but that
 :  configuration doesn't smell right compared with what you are showing
 for
 :  the 3x output...
 : 
 :  : SOLR 3.x
 :  :
 :  : str name=parsedquery+(title_search_pt:galaxy
 :  : title_search_pt:galax) +MultiPhraseQuery(title_search_pt:(sii s3 s)
 :  : 3)/str
 :  :
 :  : SOLR 4.x
 :  :
 :  : str name=parsedquery+((title_search_pt:galaxy
 :  : title_search_pt:galax)/no_coord) +(+title_search_pt:sii
 :  : +title_search_pt:s3 +title_search_pt:s +title_search_pt:3)/str
 : 
 : 
 :  -Hoss
 : 
 :

 -Hoss



Re: CRLF Invalid Exception ?

2013-09-06 Thread Chris Hostetter

: I'm not sure if this means there's a bug in the client library I'm using
: (solrj 4.3) or is a bug in the server SOLR 4.3?  Or is there something in
: my data that's causing the issue?

It's unlikly that an error in the data you pass to SolrJ methods would be 
causing this problem -- i'm pretty sure it's not even a problem with the 
raw xml data being streamed, it appears to be a problem with how that data 
is getting shunked across the wire.

My best guess is that the most likely causes are either...
 * a bug in the HttpClient versio you are using on the client side
 * a bug in the ChunkedInputFilter you are using on the server side
 * a misconfiguration on the HttpClient object you are using with SolrJ
   (ie: claiming it's sending chunked when it's not?)


-Hoss


Re: SolrCloud 4.x hangs under high update volume

2013-09-06 Thread Tim Vaillancourt
Hey guys,

(copy of my post to SOLR-5216)

We tested this patch and unfortunately encountered some serious issues a
few hours of 500 update-batches/sec. Our update batch is 10 docs, so we are
writing about 5000 docs/sec total, using autoCommit to commit the updates
(no explicit commits).

Our environment:

Solr 4.3.1 w/SOLR-5216 patch.
Jetty 9, Java 1.7.
3 solr instances, 1 per physical server.
1 collection.
3 shards.
2 replicas (each instance is a leader and a replica).
Soft autoCommit is 1000ms.
Hard autoCommit is 15000ms.

After about 6 hours of stress-testing this patch, we see many of these
stalled transactions (below), and the Solr instances start to see each
other as down, flooding our Solr logs with Connection Refused exceptions,
and otherwise no obviously-useful logs that I could see.

I did notice some stalled transactions on both /select and /update,
however. This never occurred without this patch.

Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9

Lastly, I have a summary of the ERROR-severity logs from this 24-hour soak.
My script normalizes the ERROR-severity stack traces and returns them in
order of occurrence.

Summary of my solr.log: http://pastebin.com/pBdMAWeb

Thanks!

Tim Vaillancourt


On 6 September 2013 07:27, Markus Jelsma markus.jel...@openindex.io wrote:

 Thanks!

 -Original message-
  From:Erick Erickson erickerick...@gmail.com
  Sent: Friday 6th September 2013 16:20
  To: solr-user@lucene.apache.org
  Subject: Re: SolrCloud 4.x hangs under high update volume
 
  Markus:
 
  See: https://issues.apache.org/jira/browse/SOLR-5216
 
 
  On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
  markus.jel...@openindex.iowrote:
 
   Hi Mark,
  
   Got an issue to watch?
  
   Thanks,
   Markus
  
   -Original message-
From:Mark Miller markrmil...@gmail.com
Sent: Wednesday 4th September 2013 16:55
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud 4.x hangs under high update volume
   
I'm going to try and fix the root cause for 4.5 - I've suspected
 what it
   is since early this year, but it's never personally been an issue, so
 it's
   rolled along for a long time.
   
Mark
   
Sent from my iPhone
   
On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt t...@elementspace.com
   wrote:
   
 Hey guys,

 I am looking into an issue we've been having with SolrCloud since
 the
 beginning of our testing, all the way from 4.1 to 4.3 (haven't
 tested
   4.4.0
 yet). I've noticed other users with this same issue, so I'd really
   like to
 get to the bottom of it.

 Under a very, very high rate of updates (2000+/sec), after 1-12
 hours
   we
 see stalled transactions that snowball to consume all Jetty
 threads in
   the
 JVM. This eventually causes the JVM to hang with most threads
 waiting
   on
 the condition/stack provided at the bottom of this message. At this
   point
 SolrCloud instances then start to see their neighbors (who also
 have
   all
 threads hung) as down w/Connection Refused, and the shards become
   down
 in state. Sometimes a node or two survives and just returns 503s
 no
   server
 hosting shard errors.

 As a workaround/experiment, we have tuned the number of threads
 sending
 updates to Solr, as well as the batch size (we batch updates from
   client -
 solr), and the Soft/Hard autoCommits, all to no avail. Turning off
 Client-to-Solr batching (1 update = 1 call to Solr), which also
 did not
 help. Certain combinations of update threads and batch sizes seem
 to
 mask/help the problem, but not resolve it entirely.

 Our current environment is the following:
 - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
 - 3 x Zookeeper instances, external Java 7 JVM.
 - 1 collection, 3 shards, 2 replicas (each node is a leader of 1
 shard
   and
 a replica of 1 shard).
 - Log4j 1.2 for Solr logs, set to WARN. This log has no movement
 on a
   good
 day.
 - 5000 max jetty threads (well above what we use when we are
 healthy),
 Linux-user threads ulimit is 6000.
 - Occurs under Jetty 8 or 9 (many versions).
 - Occurs under Java 1.6 or 1.7 (several minor versions).
 - Occurs under several JVM tunings.
 - Everything seems to point to Solr itself, and not a Jetty or Java
   version
 (I hope I'm wrong).

 The stack trace that is holding up all my Jetty QTP threads is the
 following, which seems to be waiting on a lock that I would very
 much
   like
 to understand further:

 java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0x0007216e68d8 (a
 java.util.concurrent.Semaphore$NonfairSync)
at
 java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at

  
 

RE: Solr 4.3 Startup with Multiple Cores Hangs on Registering Core

2013-09-06 Thread Chris Hostetter

: Sorry for the multi-post, seems like the .tdump files didn't get 
: attached.  I've tried attaching them as .txt files this time.

Interesting ... it looks like 2 of your cores are blocked in loaded while 
waiting for the searchers to open ... not clera if it's a deaklock or why 
though - in both cases the coreLoaderThread is trying to register stuff 
with JMX, which is asking for stats right off the bat (not sure why), 
which requires accessing the searcher and is waiting for that to be 
available.  but then you also have newSearcher listener events which 
are using the spellcheck componnent which is blocked waiting for that 
searcher as well.

Do all of your cores have newSearcher event listners configured or just 
2 (i'm trying to figure out if it's a timing fluke that these two are 
stalled, or if it's something special about the configs)

Can you try removing the newSearcher listners to confirm that that does in 
fact make the problem go away?

With the newSearcher listeners in place, Can you try setting 
spellcheck=false as a query param on the newSearcher listeners you have 
configured and see if that works arround the problem?

Assuming it's just 2 cores using these listeners: can you reproduce this 
problem with a simpler seup where only one of the affected cores is in 
use?

can you reproduce using Solr 4.4?


It would be helpful if you could create a jira and attach...

* your complete configs -- or at least some configs similar to 
yours that are complete enough to reproduce the startup problem.  
* some sample data (based on 
your initial description, i'm guessing there at least needs to be a 
handful of docs in the index -- and most likelye they need to match your 
warming query -- but we don't need your actual indexes, just some docs 
that will work with your configs that we can index  restart to see the 
problem. 
* these thread dumps.


-Hoss


Re: SOLR 4.x vs 3.x parsedquery differences

2013-09-06 Thread Chris Hostetter

: Our schema is identical except the version.
: In 3.x it's 1.1 and in 4.x it's 1.5.

That's kind of a significant difference to leave out -- indepenent of the 
question you are asking about here, it's going to make quite a few 
differences in how things are being being parsed, and what defaults are.

If i'm understanding correctly: you like the behavior you are getting from 
Solr 3.x where phrases are generated automatically for you.

what i can't understand, is how/why phrases are being generated 
automatically for you if you have that 'autoGeneratePhraseQueries=false' 
on your fieldType in your 3x schema ... that makes no sense to me.

if you didn't have autoGeneratePhraseQueries specified at all, then the 
'version=1.1' would explain it (up to version=1.3, the default for 
autoGeneratePhraseQueries was true, but in version=1.4 and above, it 
defaults to false)  but with an explicit 
'autoGeneratePhraseQueries=false' i can't explain why 3x works the way 
you say it works for you.

Bottom line: if you *want* the auto generated phrase query behavior 
in 4.x, you should just set 'autoGeneratePhraseQueries=true' on your 
fieldType.



:  : I'm migrating from 3.x to 4.x and I'm running some queries to verify that
:  : everything works like before. I've found however that the query galaxy
:  s3
:  : is giving much less results. In 3.x numFound=1628, in 4.x numFound=70.
: 
:  is your entire schema 100% identical in both cases?
:  what is the luceneMatchVersion set to in your solrconfig.xml?
: 
: 
:  By the looks of your debug output, it appears that you are using
:  autoGeneratePhraseQueries=true in 3x, but have it set to false in 4x --
:  but the fieldType you posted here shows it set to false
: 
:  : fieldtype name=text_pt class=solr.TextField
:  : positionIncrementGap=100 autoGeneratePhraseQueries=false
: 
:  ...i haven't tried to reproduce your specific situation, but that
:  configuration doesn't smell right compared with what you are showing for
:  the 3x output...
: 
:  : SOLR 3.x
:  :
:  : str name=parsedquery+(title_search_pt:galaxy
:  : title_search_pt:galax) +MultiPhraseQuery(title_search_pt:(sii s3 s)
:  : 3)/str
:  :
:  : SOLR 4.x
:  :
:  : str name=parsedquery+((title_search_pt:galaxy
:  : title_search_pt:galax)/no_coord) +(+title_search_pt:sii
:  : +title_search_pt:s3 +title_search_pt:s +title_search_pt:3)/str
: 
: 
:  -Hoss
: 
: 

-Hoss


Re: SOLR 4.x vs 3.x parsedquery differences

2013-09-06 Thread Shawn Heisey

On 9/6/2013 12:46 PM, Fermin Silva wrote:

Our schema is identical except the version.
In 3.x it's 1.1 and in 4.x it's 1.5.

Also in solrconfig.xml we have no lucene version for 3.x (so it's using 2_4
i believe) and in 4.x we fixed it to 4_4.


The autoGeneratePhraseQueries parameter didn't exist before schema 
version 1.4.


I'm fairly sure that for your schema that is at version 1.1, the 
autoGeneratePhraseQueries value specified in the field definition will 
be ignored and the actual value that gets used will be true, which 
goes along with what Hoss has said.


See the comment about the version in the example schema on any 4.x Solr 
download.


Thanks,
Shawn



RE: Solr 4.3 Startup with Multiple Cores Hangs on Registering Core

2013-09-06 Thread Austin Rasmussen
: Do all of your cores have newSearcher event listners configured or just
: 2 (i'm trying to figure out if it's a timing fluke that these two are 
stalled, or if it's something special about the configs)

All of my cores have both the newSearcher and firstSearcher event listeners 
configured. (The firstSearcher actually doesn't have any queries configured 
against it, so it probably should just be removed altogether)

: Can you try removing the newSearcher listners to confirm that that does in 
fact make the problem go away?

Removing the newSearcher listeners does not make the problem go away; 
however, removing the firstSearcher listener (even if the newSearcher 
listener is still configured) does make the problem go away.

: With the newSearcher listeners in place, Can you try setting 
spellcheck=false as a query param on the newSearcher listeners you have 
configured and 
: see if that works arround the problem?

Adding the spellcheck=false param to the firstSearcher listener does appear 
to work around the problem.

: Assuming it's just 2 cores using these listeners: can you reproduce this 
problem with a simpler seup where only one of the affected cores is in use?

Since it's not just these two cores, I'm not sure how to produce much of a 
simpler setup.  I did attempt to limit how many cores are loaded in the 
solr.xml, and found that if I cut it down to 56, it was able to load 
successfully (without any of the above config changed).

If I cut it down to 57 cores, it doesn't hang at registering core any more, 
it actually gets as far as  QuerySenderListener sending requests to 
Searcher@2f28849 main{StandardDirectoryReader(...

If 58+ cores are loaded at start up, that's when it begins to hang at 
registering core.  However, it always hangs on the *last* core configured in 
the solr.xml, regardless of how many cores are being loaded.


: can you reproduce using Solr 4.4?
: It would be helpful if you could create a jira and attach...
: * your complete configs -- or at least some configs similar to yours that are 
complete enough to reproduce the startup problem.  
: * some sample data (based on
: your initial description, i'm guessing there at least needs to be a handful 
of docs in the index -- and most likelye they need to match your warming query 
-: - but we don't need your actual indexes, just some docs that will work with 
your configs that we can index 
:  restart to see the problem. 
: * these thread dumps.

I can likely get to this early next week, both checking into how this behaves 
using Solr 4.4 and submitting a JIRA with your requested info.


collections api setting dataDir

2013-09-06 Thread mike st. john
is there any way to change the dataDir while creating a collection via the
collection api?


Re: SolrCloud 4.x hangs under high update volume

2013-09-06 Thread Mark Miller
Did you ever get to index that long before without hitting the deadlock?

There really isn't anything negative the patch could be introducing, other than 
allowing for some more threads to possibly run at once. If I had to guess, I 
would say its likely this patch fixes the deadlock issue and your seeing 
another issue - which looks like the system cannot keep up with the requests or 
something for some reason - perhaps due to some OS networking settings or 
something (more guessing). Connection refused happens generally when there is 
nothing listening on the port. 

Do you see anything interesting change with the rest of the system? CPU usage 
spikes or something like that?

Clamping down further on the overall number of threads night help (which would 
require making something configurable). How many nodes are listed in zk under 
live_nodes?

Mark

Sent from my iPhone

On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt t...@elementspace.com wrote:

 Hey guys,
 
 (copy of my post to SOLR-5216)
 
 We tested this patch and unfortunately encountered some serious issues a
 few hours of 500 update-batches/sec. Our update batch is 10 docs, so we are
 writing about 5000 docs/sec total, using autoCommit to commit the updates
 (no explicit commits).
 
 Our environment:
 
Solr 4.3.1 w/SOLR-5216 patch.
Jetty 9, Java 1.7.
3 solr instances, 1 per physical server.
1 collection.
3 shards.
2 replicas (each instance is a leader and a replica).
Soft autoCommit is 1000ms.
Hard autoCommit is 15000ms.
 
 After about 6 hours of stress-testing this patch, we see many of these
 stalled transactions (below), and the Solr instances start to see each
 other as down, flooding our Solr logs with Connection Refused exceptions,
 and otherwise no obviously-useful logs that I could see.
 
 I did notice some stalled transactions on both /select and /update,
 however. This never occurred without this patch.
 
 Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
 Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
 
 Lastly, I have a summary of the ERROR-severity logs from this 24-hour soak.
 My script normalizes the ERROR-severity stack traces and returns them in
 order of occurrence.
 
 Summary of my solr.log: http://pastebin.com/pBdMAWeb
 
 Thanks!
 
 Tim Vaillancourt
 
 
 On 6 September 2013 07:27, Markus Jelsma markus.jel...@openindex.io wrote:
 
 Thanks!
 
 -Original message-
 From:Erick Erickson erickerick...@gmail.com
 Sent: Friday 6th September 2013 16:20
 To: solr-user@lucene.apache.org
 Subject: Re: SolrCloud 4.x hangs under high update volume
 
 Markus:
 
 See: https://issues.apache.org/jira/browse/SOLR-5216
 
 
 On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
 markus.jel...@openindex.iowrote:
 
 Hi Mark,
 
 Got an issue to watch?
 
 Thanks,
 Markus
 
 -Original message-
 From:Mark Miller markrmil...@gmail.com
 Sent: Wednesday 4th September 2013 16:55
 To: solr-user@lucene.apache.org
 Subject: Re: SolrCloud 4.x hangs under high update volume
 
 I'm going to try and fix the root cause for 4.5 - I've suspected
 what it
 is since early this year, but it's never personally been an issue, so
 it's
 rolled along for a long time.
 
 Mark
 
 Sent from my iPhone
 
 On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt t...@elementspace.com
 wrote:
 
 Hey guys,
 
 I am looking into an issue we've been having with SolrCloud since
 the
 beginning of our testing, all the way from 4.1 to 4.3 (haven't
 tested
 4.4.0
 yet). I've noticed other users with this same issue, so I'd really
 like to
 get to the bottom of it.
 
 Under a very, very high rate of updates (2000+/sec), after 1-12
 hours
 we
 see stalled transactions that snowball to consume all Jetty
 threads in
 the
 JVM. This eventually causes the JVM to hang with most threads
 waiting
 on
 the condition/stack provided at the bottom of this message. At this
 point
 SolrCloud instances then start to see their neighbors (who also
 have
 all
 threads hung) as down w/Connection Refused, and the shards become
 down
 in state. Sometimes a node or two survives and just returns 503s
 no
 server
 hosting shard errors.
 
 As a workaround/experiment, we have tuned the number of threads
 sending
 updates to Solr, as well as the batch size (we batch updates from
 client -
 solr), and the Soft/Hard autoCommits, all to no avail. Turning off
 Client-to-Solr batching (1 update = 1 call to Solr), which also
 did not
 help. Certain combinations of update threads and batch sizes seem
 to
 mask/help the problem, but not resolve it entirely.
 
 Our current environment is the following:
 - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
 - 3 x Zookeeper instances, external Java 7 JVM.
 - 1 collection, 3 shards, 2 replicas (each node is a leader of 1
 shard
 and
 a replica of 1 shard).
 - Log4j 1.2 for Solr logs, set to WARN. This log has no movement
 on a
 good
 day.
 - 5000 max jetty threads (well above what we use when we are
 healthy),
 Linux-user threads 

Re: SolrCloud 4.x hangs under high update volume

2013-09-06 Thread Mark Miller
Okay, thanks, useful info. Getting on a plane, but ill look more at this soon. 
That 10k thread spike is good to know - that's no good and could easily be part 
of the problem. We want to keep that from happening. 

Mark

Sent from my iPhone

On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt t...@elementspace.com wrote:

 Hey Mark,
 
 The farthest we've made it at the same batch size/volume was 12 hours
 without this patch, but that isn't consistent. Sometimes we would only get
 to 6 hours or less.
 
 During the crash I can see an amazing spike in threads to 10k which is
 essentially our ulimit for the JVM, but I strangely see no OutOfMemory:
 cannot open native thread errors that always follow this. Weird!
 
 We also notice a spike in CPU around the crash. The instability caused some
 shard recovery/replication though, so that CPU may be a symptom of the
 replication, or is possibly the root cause. The CPU spikes from about
 20-30% utilization (system + user) to 60% fairly sharply, so the CPU, while
 spiking isn't quite pinned (very beefy Dell R720s - 16 core Xeons, whole
 index is in 128GB RAM, 6xRAID10 15k).
 
 More on resources: our disk I/O seemed to spike about 2x during the crash
 (about 1300kbps written to 3500kbps), but this may have been the
 replication, or ERROR logging (we generally log nothing due to
 WARN-severity unless something breaks).
 
 Lastly, I found this stack trace occurring frequently, and have no idea
 what it is (may be useful or not):
 
 java.lang.IllegalStateException :
  at org.eclipse.jetty.server.Response.resetBuffer(Response.java:964)
  at org.eclipse.jetty.server.Response.sendError(Response.java:325)
  at
 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:692)
  at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
  at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
  at
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423)
  at
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450)
  at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
  at
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
  at
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
  at
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083)
  at
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)
  at
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
  at
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017)
  at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
  at
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)
  at
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
  at
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
  at org.eclipse.jetty.server.Server.handle(Server.java:445)
  at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260)
  at
 org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225)
  at
 org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
  at
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)
  at
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)
  at java.lang.Thread.run(Thread.java:724)
 
 On your live_nodes question, I don't have historical data on this from when
 the crash occurred, which I guess is what you're looking for. I could add
 this to our monitoring for future tests, however. I'd be glad to continue
 further testing, but I think first more monitoring is needed to understand
 this further. Could we come up with a list of metrics that would be useful
 to see following another test and successful crash?
 
 Metrics needed:
 
 1) # of live_nodes.
 2) Full stack traces.
 3) CPU used by Solr's JVM specifically (instead of system-wide).
 4) Solr's JVM thread count (already done)
 5) ?
 
 Cheers,
 
 Tim Vaillancourt
 
 
 On 6 September 2013 13:11, Mark Miller markrmil...@gmail.com wrote:
 
 Did you ever get to index that long before without hitting the deadlock?
 
 There really isn't anything negative the patch could be introducing, other
 than allowing for some more threads to possibly run at once. If I had to
 guess, I would say its likely this patch fixes the deadlock issue and your
 seeing another issue - which looks like the system cannot keep up with the
 requests or something for some reason - perhaps due to some OS networking
 settings or something (more guessing). Connection refused happens generally
 when there is nothing listening on the port.
 
 Do you see 

Re: Odd behavior after adding an additional core.

2013-09-06 Thread mike st. john
hi,

curl '
http://192.168.0.1:8983/solr/admin/collections?action=CREATEname=collectionxnumShards=4replicationFactor=1collection.configName=config1
'

after that,  i added approx 100k documents,  verified there were in the
index and distributed across the shards.


i then decided to start adding some replicas via coreadmin.

curl '
http://192.168.0.1:8983/solr/admin/cores?action=CREATEname=collectionx_ex_replica1collection=collectionxcollection.configName=config1
'


adding the core, produced the following,   it took away leader status from
the leader on the shard it was replicating, inserted itself as down.
 changed the doc routing to implicit.


Thanks.



On Fri, Sep 6, 2013 at 4:24 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Can you give exact steps to reproduce this problem?

 Also, are you sure you supplied numShards=4 while creating the collection?

 On Fri, Sep 6, 2013 at 12:20 AM, mike st. john mstj...@gmail.com wrote:
  using solr 4.4  , i used collection admin to create a collection  4shards
  replication - factor of 1
 
  i did this so i could index my data, then bring in replicas later by
 adding
  cores via coreadmin
 
 
  i added a new core via coreadmin,  what i noticed shortly after adding
 the
  core,  the leader of the shard where the new replica was placed was
 marked
  active the new core marked as the leader  and the routing was now set to
  implicit.
 
 
 
  i've replicated this on another solr setup as well.
 
 
  Any ideas?
 
 
  Thanks
 
  msj



 --
 Regards,
 Shalin Shekhar Mangar.



Re: unknown _stream_source_info while indexing rich doc in solr

2013-09-06 Thread Nutan
it shows type as undefined for dynamic field ignored_* , and I am using
default collection1 core,
but on the admin page it shows schema :
fields 
field name=id type=string indexed=true stored=true required=true
multiValued=false/
field name=author type=string indexed=true stored=true
multiValued=true/
field name=comments type=text indexed=true stored=true
multiValued=false/
field name=keywords type=text indexed=true stored=true
multiValued=false/
field name=contents type=string indexed=true stored=true
multiValued=false/
field name=title type=text indexed=true stored=true
multiValued=false/
field name=revision_number type=string indexed=true stored=true
multiValued=false/
/fields 
dynamicField name=ignored_* type=ignored indexed=false stored=true
multiValued=true/
types



--
View this message in context: 
http://lucene.472066.n3.nabble.com/unknown-stream-source-info-while-indexing-rich-doc-in-solr-tp4088136p4088591.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud 4.x hangs under high update volume

2013-09-06 Thread Tim Vaillancourt
Hey Mark,

The farthest we've made it at the same batch size/volume was 12 hours
without this patch, but that isn't consistent. Sometimes we would only get
to 6 hours or less.

During the crash I can see an amazing spike in threads to 10k which is
essentially our ulimit for the JVM, but I strangely see no OutOfMemory:
cannot open native thread errors that always follow this. Weird!

We also notice a spike in CPU around the crash. The instability caused some
shard recovery/replication though, so that CPU may be a symptom of the
replication, or is possibly the root cause. The CPU spikes from about
20-30% utilization (system + user) to 60% fairly sharply, so the CPU, while
spiking isn't quite pinned (very beefy Dell R720s - 16 core Xeons, whole
index is in 128GB RAM, 6xRAID10 15k).

More on resources: our disk I/O seemed to spike about 2x during the crash
(about 1300kbps written to 3500kbps), but this may have been the
replication, or ERROR logging (we generally log nothing due to
WARN-severity unless something breaks).

Lastly, I found this stack trace occurring frequently, and have no idea
what it is (may be useful or not):

java.lang.IllegalStateException :
  at org.eclipse.jetty.server.Response.resetBuffer(Response.java:964)
  at org.eclipse.jetty.server.Response.sendError(Response.java:325)
  at
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:692)
  at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
  at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
  at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423)
  at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450)
  at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
  at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
  at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
  at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083)
  at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)
  at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
  at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017)
  at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
  at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)
  at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
  at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
  at org.eclipse.jetty.server.Server.handle(Server.java:445)
  at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260)
  at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225)
  at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
  at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)
  at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)
  at java.lang.Thread.run(Thread.java:724)

On your live_nodes question, I don't have historical data on this from when
the crash occurred, which I guess is what you're looking for. I could add
this to our monitoring for future tests, however. I'd be glad to continue
further testing, but I think first more monitoring is needed to understand
this further. Could we come up with a list of metrics that would be useful
to see following another test and successful crash?

Metrics needed:

1) # of live_nodes.
2) Full stack traces.
3) CPU used by Solr's JVM specifically (instead of system-wide).
4) Solr's JVM thread count (already done)
5) ?

Cheers,

Tim Vaillancourt


On 6 September 2013 13:11, Mark Miller markrmil...@gmail.com wrote:

 Did you ever get to index that long before without hitting the deadlock?

 There really isn't anything negative the patch could be introducing, other
 than allowing for some more threads to possibly run at once. If I had to
 guess, I would say its likely this patch fixes the deadlock issue and your
 seeing another issue - which looks like the system cannot keep up with the
 requests or something for some reason - perhaps due to some OS networking
 settings or something (more guessing). Connection refused happens generally
 when there is nothing listening on the port.

 Do you see anything interesting change with the rest of the system? CPU
 usage spikes or something like that?

 Clamping down further on the overall number of threads night help (which
 would require making something configurable). How many nodes are listed in
 zk under live_nodes?

 Mark

 Sent from my iPhone

 On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt t...@elementspace.com
 wrote:

  Hey guys,
 
  

Re: SolrCloud 4.x hangs under high update volume

2013-09-06 Thread Tim Vaillancourt
Enjoy your trip, Mark! Thanks again for the help!

Tim

On 6 September 2013 14:18, Mark Miller markrmil...@gmail.com wrote:

 Okay, thanks, useful info. Getting on a plane, but ill look more at this
 soon. That 10k thread spike is good to know - that's no good and could
 easily be part of the problem. We want to keep that from happening.

 Mark

 Sent from my iPhone

 On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt t...@elementspace.com wrote:

  Hey Mark,
 
  The farthest we've made it at the same batch size/volume was 12 hours
  without this patch, but that isn't consistent. Sometimes we would only
 get
  to 6 hours or less.
 
  During the crash I can see an amazing spike in threads to 10k which is
  essentially our ulimit for the JVM, but I strangely see no OutOfMemory:
  cannot open native thread errors that always follow this. Weird!
 
  We also notice a spike in CPU around the crash. The instability caused
 some
  shard recovery/replication though, so that CPU may be a symptom of the
  replication, or is possibly the root cause. The CPU spikes from about
  20-30% utilization (system + user) to 60% fairly sharply, so the CPU,
 while
  spiking isn't quite pinned (very beefy Dell R720s - 16 core Xeons,
 whole
  index is in 128GB RAM, 6xRAID10 15k).
 
  More on resources: our disk I/O seemed to spike about 2x during the crash
  (about 1300kbps written to 3500kbps), but this may have been the
  replication, or ERROR logging (we generally log nothing due to
  WARN-severity unless something breaks).
 
  Lastly, I found this stack trace occurring frequently, and have no idea
  what it is (may be useful or not):
 
  java.lang.IllegalStateException :
   at org.eclipse.jetty.server.Response.resetBuffer(Response.java:964)
   at org.eclipse.jetty.server.Response.sendError(Response.java:325)
   at
 
 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:692)
   at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
   at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
   at
 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423)
   at
 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450)
   at
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
   at
 
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
   at
 
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
   at
 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083)
   at
  org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)
   at
 
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
   at
 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017)
   at
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
   at
 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)
   at
 
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
   at
 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
   at org.eclipse.jetty.server.Server.handle(Server.java:445)
   at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260)
   at
 
 org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225)
   at
 
 org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
   at
 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)
   at
 
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)
   at java.lang.Thread.run(Thread.java:724)
 
  On your live_nodes question, I don't have historical data on this from
 when
  the crash occurred, which I guess is what you're looking for. I could add
  this to our monitoring for future tests, however. I'd be glad to continue
  further testing, but I think first more monitoring is needed to
 understand
  this further. Could we come up with a list of metrics that would be
 useful
  to see following another test and successful crash?
 
  Metrics needed:
 
  1) # of live_nodes.
  2) Full stack traces.
  3) CPU used by Solr's JVM specifically (instead of system-wide).
  4) Solr's JVM thread count (already done)
  5) ?
 
  Cheers,
 
  Tim Vaillancourt
 
 
  On 6 September 2013 13:11, Mark Miller markrmil...@gmail.com wrote:
 
  Did you ever get to index that long before without hitting the deadlock?
 
  There really isn't anything negative the patch could be introducing,
 other
  than allowing for some more threads to possibly run at once. If I had to
  guess, I would say its likely this patch fixes the deadlock issue and
 your
  seeing another issue - which looks like the system 

Re: solrcloud shards backup/restoration

2013-09-06 Thread Aditya Sakhuja
Thanks Shalin and Mark for your responses. I am on the same page about the
conventions for taking the backup. However, I am less sure about the
restoration of the index. Lets say we have 3 shards across 3 solrcloud
servers.

1. I am assuming we should take a backup from each of the shard leaders to
get a complete collection. do you think that will get the complete index (
not worrying about what is not hard committed at the time of backup ). ?

2. How do we go about restoring the index in a fresh solrcloud cluster ?
From the structure of the snapshot I took, I did not see any
replication.properties or index.properties  which I see normally on a
healthy solrcloud cluster nodes.
if I have the snapshot named snapshot.20130905 does the snapshot.20130905/*
go into data/index ?

Thanks
Aditya



On Fri, Sep 6, 2013 at 7:28 AM, Mark Miller markrmil...@gmail.com wrote:

 Phone typing. The end should not say don't hard commit - it should say
 do a hard commit and take a snapshot.

 Mark

 Sent from my iPhone

 On Sep 6, 2013, at 7:26 AM, Mark Miller markrmil...@gmail.com wrote:

  I don't know that it's too bad though - its always been the case that if
 you do a backup while indexing, it's just going to get up to the last hard
 commit. With SolrCloud that will still be the case. So just make sure you
 do a hard commit right before taking the backup - yes, it might miss a few
 docs in the tran log, but if you are taking a back up while indexing, you
 don't have great precision in any case - you will roughly get a snapshot
 for around that time - even without SolrCloud, if you are worried about
 precision and getting every update into that backup, you want to stop
 indexing and commit first. But if you just want a rough snapshot for around
 that time, in both cases you can still just don't hard commit and take a
 snapshot.
 
  Mark
 
  Sent from my iPhone
 
  On Sep 6, 2013, at 1:13 AM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:
 
  The replication handler's backup command was built for pre-SolrCloud.
  It takes a snapshot of the index but it is unaware of the transaction
  log which is a key component in SolrCloud. Hence unless you stop
  updates, commit your changes and then take a backup, you will likely
  miss some updates.
 
  That being said, I'm curious to see how peer sync behaves when you try
  to restore from a snapshot. When you say that you haven't been
  successful in restoring, what exactly is the behaviour you observed?
 
  On Fri, Sep 6, 2013 at 5:14 AM, Aditya Sakhuja 
 aditya.sakh...@gmail.com wrote:
  Hello,
 
  I was looking for a good backup / recovery solution for the solrcloud
  indexes. I am more looking for restoring the indexes from the index
  snapshot, which can be taken using the replicationHandler's backup
 command.
 
  I am looking for something that works with solrcloud 4.3 eventually,
 but
  still relevant if you tested with a previous version.
 
  I haven't been successful in have the restored index replicate across
 the
  new replicas, after I restart all the nodes, with one node having the
  restored index.
 
  Is restoring the indexes on all the nodes the best way to do it ?
  --
  Regards,
  -Aditya Sakhuja
 
 
 
  --
  Regards,
  Shalin Shekhar Mangar.




-- 
Regards,
-Aditya Sakhuja


Unknown attribute id in add:allowDups

2013-09-06 Thread Brian Robinson

Hello,
I'm working with the Pecl package, with Solr 4.3.1. I have a doc defined 
in my schema where id is the uniqueKey,


field name=id type=int indexed=true stored=true required=true 
multiValued=false /

uniqueKeyid/uniqueKey

I tried to add a doc to my index with the following code (simplified for 
the question):


$client = new SolrClient($options);
$doc = new SolrInputDocument();
$doc-addField('id', 12345);
$doc-addField('description', 'This is the content of the doc');
$updateResponse = $client-addDocument($doc);

When I do this, the doc is not added to the index, and I get the 
following error in the logs in admin


 Unknown attribute id in add:allowDups

However, I noticed that if I change the field to type string:

field name=id type=string indexed=true stored=true 
required=true multiValued=false /

...
$doc-addField('id', '12345');

the doc is added to the index, but I still get the error in the log.

So first, I was wondering, is there some other way I should be setting 
this up so that id can be an int instead of a string?


And then I was also wondering what this error is referring to. Is there 
some further way I need to define id? Or maybe define the uniqueKey 
differently?


Any help would be much appreciated.
Thanks,
Brian


Re: SOLR 4.x vs 3.x parsedquery differences

2013-09-06 Thread Fermin Silva
Hi,

Our schema is identical except the version.
In 3.x it's 1.1 and in 4.x it's 1.5.

Also in solrconfig.xml we have no lucene version for 3.x (so it's using 2_4
i believe) and in 4.x we fixed it to 4_4.

Thanks
On Sep 6, 2013 3:34 PM, Chris Hostetter hossman_luc...@fucit.org wrote:


 : I'm migrating from 3.x to 4.x and I'm running some queries to verify that
 : everything works like before. I've found however that the query galaxy
 s3
 : is giving much less results. In 3.x numFound=1628, in 4.x numFound=70.

 is your entire schema 100% identical in both cases?
 what is the luceneMatchVersion set to in your solrconfig.xml?


 By the looks of your debug output, it appears that you are using
 autoGeneratePhraseQueries=true in 3x, but have it set to false in 4x --
 but the fieldType you posted here shows it set to false

 : fieldtype name=text_pt class=solr.TextField
 : positionIncrementGap=100 autoGeneratePhraseQueries=false

 ...i haven't tried to reproduce your specific situation, but that
 configuration doesn't smell right compared with what you are showing for
 the 3x output...

 : SOLR 3.x
 :
 : str name=parsedquery+(title_search_pt:galaxy
 : title_search_pt:galax) +MultiPhraseQuery(title_search_pt:(sii s3 s)
 : 3)/str
 :
 : SOLR 4.x
 :
 : str name=parsedquery+((title_search_pt:galaxy
 : title_search_pt:galax)/no_coord) +(+title_search_pt:sii
 : +title_search_pt:s3 +title_search_pt:s +title_search_pt:3)/str


 -Hoss



Re: Facet Count and RegexTransformersplitBy

2013-09-06 Thread Raheel Hasan
Hi,

What I want is very simple:

The query results:
row 1 = a,b,c,d
row 2 = a,f,r,e
row 3 = a,c,ff,e,b
..

facet count needed:
'a' = 3 occurrence
'b' = 2 occur.
'c' = 2 occur.
.
.
.


I searched and found a solution here:
http://stackoverflow.com/questions/9914483/solr-facet-multiple-words-with-comma-separated-values

But I want to be sure if it will work.



On Fri, Sep 6, 2013 at 8:20 PM, Jack Krupansky j...@basetechnology.comwrote:

 Facet counts are per field - your counts are scattered across different
 fields.

 There are additional capabilities in the facet component, but first you
 should describe exactly what your requirements are.

 -- Jack Krupansky
 -Original Message- From: Raheel Hasan
 Sent: Friday, September 06, 2013 9:58 AM
 To: solr-user@lucene.apache.org
 Subject: Facet Count and RegexTransformersplitBy


 Hi guyz,

 Just a quick question:

 I have a field that has CSV values in the database. So I will use the
 DataImportHandler and will index it using RegexTransformer's splitBy
 attribute. However, since this is the first time I am doing it, I just
 wanted to be sure if it will work for Facet Count?

 For example:
 From query results (say this is the values in that field):
 row 1 = 1,2,3,4
 row 2 = 1,4,5,3
 row 3 = 2,1,20,66
 .
 .
 .
 .
 so facet count will get me:
 '1' = 3 occurrence
 '2' = 2 occur.
 .
 .
 .and so on.





 --
 Regards,
 Raheel Hasan




-- 
Regards,
Raheel Hasan


Re: solrcloud shards backup/restoration

2013-09-06 Thread Tim Vaillancourt
I wouldn't say I love this idea, but wouldn't it be safe to LVM snapshot
the Solr index? I think this may even work on a live server, depending on
some file I/O details. Has anyone tried this?

An in-Solr solution sounds more elegant, but considering the tlog concern
Shalin mentioned, I think this may work as an interim solution.

Cheers!

Tim


On 6 September 2013 15:41, Aditya Sakhuja aditya.sakh...@gmail.com wrote:

 Thanks Shalin and Mark for your responses. I am on the same page about the
 conventions for taking the backup. However, I am less sure about the
 restoration of the index. Lets say we have 3 shards across 3 solrcloud
 servers.

 1. I am assuming we should take a backup from each of the shard leaders to
 get a complete collection. do you think that will get the complete index (
 not worrying about what is not hard committed at the time of backup ). ?

 2. How do we go about restoring the index in a fresh solrcloud cluster ?
 From the structure of the snapshot I took, I did not see any
 replication.properties or index.properties  which I see normally on a
 healthy solrcloud cluster nodes.
 if I have the snapshot named snapshot.20130905 does the snapshot.20130905/*
 go into data/index ?

 Thanks
 Aditya



 On Fri, Sep 6, 2013 at 7:28 AM, Mark Miller markrmil...@gmail.com wrote:

  Phone typing. The end should not say don't hard commit - it should say
  do a hard commit and take a snapshot.
 
  Mark
 
  Sent from my iPhone
 
  On Sep 6, 2013, at 7:26 AM, Mark Miller markrmil...@gmail.com wrote:
 
   I don't know that it's too bad though - its always been the case that
 if
  you do a backup while indexing, it's just going to get up to the last
 hard
  commit. With SolrCloud that will still be the case. So just make sure you
  do a hard commit right before taking the backup - yes, it might miss a
 few
  docs in the tran log, but if you are taking a back up while indexing, you
  don't have great precision in any case - you will roughly get a snapshot
  for around that time - even without SolrCloud, if you are worried about
  precision and getting every update into that backup, you want to stop
  indexing and commit first. But if you just want a rough snapshot for
 around
  that time, in both cases you can still just don't hard commit and take a
  snapshot.
  
   Mark
  
   Sent from my iPhone
  
   On Sep 6, 2013, at 1:13 AM, Shalin Shekhar Mangar 
  shalinman...@gmail.com wrote:
  
   The replication handler's backup command was built for pre-SolrCloud.
   It takes a snapshot of the index but it is unaware of the transaction
   log which is a key component in SolrCloud. Hence unless you stop
   updates, commit your changes and then take a backup, you will likely
   miss some updates.
  
   That being said, I'm curious to see how peer sync behaves when you try
   to restore from a snapshot. When you say that you haven't been
   successful in restoring, what exactly is the behaviour you observed?
  
   On Fri, Sep 6, 2013 at 5:14 AM, Aditya Sakhuja 
  aditya.sakh...@gmail.com wrote:
   Hello,
  
   I was looking for a good backup / recovery solution for the solrcloud
   indexes. I am more looking for restoring the indexes from the index
   snapshot, which can be taken using the replicationHandler's backup
  command.
  
   I am looking for something that works with solrcloud 4.3 eventually,
  but
   still relevant if you tested with a previous version.
  
   I haven't been successful in have the restored index replicate across
  the
   new replicas, after I restart all the nodes, with one node having the
   restored index.
  
   Is restoring the indexes on all the nodes the best way to do it ?
   --
   Regards,
   -Aditya Sakhuja
  
  
  
   --
   Regards,
   Shalin Shekhar Mangar.
 



 --
 Regards,
 -Aditya Sakhuja



Re: Solr Cloud hangs when replicating updates

2013-09-06 Thread Kevin Osborn
Thanks a ton Mark. I have tried SOLR-4816 and it didn't help. But I will
try Mark's patch next week, and see what happens.

-Kevin


On Thu, Sep 5, 2013 at 4:46 AM, Erick Erickson erickerick...@gmail.comwrote:

 If you run into this again, try a jstack trace. You should see
 evidence of being stuck in SolrCmdDistributor on a variable
 called semaphore... On current 4x this is around line 420.

 If you're using SolrJ, then SOLR-4816 is another thing to try.

 But Mark's patch would be best of all to test, If that doesn't
 fix it then the jstack suggestion would at least tell us if it's
 the issue we think it is.

 FWIW,
 Erick


 On Wed, Sep 4, 2013 at 12:51 PM, Mark Miller markrmil...@gmail.com
 wrote:

  It would be great if you could give this patch a try:
  http://pastebin.com/raw.php?i=aaRWwSGP
 
  - Mark
 
 
  On Wed, Sep 4, 2013 at 8:31 AM, Kevin Osborn kevin.osb...@cbsi.com
  wrote:
 
   Thanks. If there is anything I can do to help you resolve this issue,
 let
   me know.
  
   -Kevin
  
  
   On Wed, Sep 4, 2013 at 7:51 AM, Mark Miller markrmil...@gmail.com
  wrote:
  
Ill look at fixing the root issue for 4.5. I've been putting it off
 for
way to long.
   
Mark
   
Sent from my iPhone
   
On Sep 3, 2013, at 2:15 PM, Kevin Osborn kevin.osb...@cbsi.com
  wrote:
   
 I was having problems updating SolrCloud with a large batch of
  records.
The
 records are coming in bursts with lulls between updates.

 At first, I just tried large updates of 100,000 records at a time.
 Eventually, this caused Solr to hang. When hung, I can still query
   Solr.
 But I cannot do any deletes or other updates to the index.

 At first, my updates were going as SolrJ CSV posts. I have also
 tried
local
 file updates and had similar results. I finally slowed things down
 to
just
 use SolrJ's Update feature, which is basically just JavaBin. I am
  also
 sending over just 100 at a time in 10 threads. Again, it eventually
   hung.

 Sometimes, Solr hangs in the first couple of chunks. Other times,
 it
hangs
 right away.

 These are my commit settings:

 autoCommit
   maxTime15000/maxTime
   maxDocs5000/maxDocs
   openSearcherfalse/openSearcher
 /autoCommit
 autoSoftCommit
 maxTime3/maxTime
   /autoSoftCommit

 I have tried quite a few variations with the same results. I also
  tried
 various JVM settings with the same results. The only variable seems
  to
   be
 that reducing the cluster size from 2 to 1 is the only thing that
   helps.

 I also did a jstack trace. I did not see any explicit deadlocks,
 but
  I
did
 see quite a few threads in WAITING or TIMED_WAITING. It is
 typically
 something like this:

  java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0x00074039a450 (a
 java.util.concurrent.Semaphore$NonfairSync)
at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at

   
  
 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at

   
  
 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
at

   
  
 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
at
 java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
at

   
  
 
 org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
at

   
  
 
 org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
at

   
  
 
 org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
at

   
  
 
 org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
at

   
  
 
 org.apache.solr.update.SolrCmdDistributor.distribAdd(SolrCmdDistributor.java:139)
at

   
  
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:474)
at

   
  
 
 org.apache.solr.handler.loader.CSVLoaderBase.doAdd(CSVLoaderBase.java:395)
at

   
  
 
 org.apache.solr.handler.loader.SingleThreadedCSVLoader.addDoc(CSVLoader.java:44)
at

  
 org.apache.solr.handler.loader.CSVLoaderBase.load(CSVLoaderBase.java:364)
at
org.apache.solr.handler.loader.CSVLoader.load(CSVLoader.java:31)
at

   
  
 
 org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at

   
  
 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at

   

Batch Solr Server

2013-09-06 Thread gaoagong
Does anyone know if there is such a thing as a BatchSolrServer object in the
solrj code? I am currently using the ConcurrentUpdateSolrServer, but it
isn't doing quite what I expected. It will distribute the load of sending
through the http client through different threads and manage the
connections, but it does not package the documents in bundles. This can be
done manually by calling solrServer.add(CollectionSolrInputDocument
documents), which will create an UpdateRequest object for the entire
collection. When the ConcurrentUpdateSolrServer gets to this UpdateRequest
it will send all of the documents together in a single http call.

What I want to be able to do is call solrServer.add(SolInputDocument
document) and have the SolrServer grab the next batch (up to a specified
size) and then create an UpdateRequest. This would reduce the number of
individual Requests the SOLR servers have to handle as well as any per http
call overhead incurred.

Would this kind of functionality be worth while to anyone else? Should I
create such a SolrServer object?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Batch-Solr-Server-tp4088657.html
Sent from the Solr - User mailing list archive at Nabble.com.