date:20130912

Thanks, guys. Now I know a little more about DocValues and realize that
they will do the job wrt FieldCache.

Regards, Per Steffensen

On 9/12/13 3:11 AM, Otis Gospodnetic wrote:

Per, check zee Wiki, there is a page describing docvalues. We used them
successfully in a solr for analytics scenario.

Otis
Solr ElasticSearch Support
http://sematext.com/
On Sep 11, 2013 9:15 AM, Michael Sokolov msoko...@safaribooksonline.com
wrote:

On 09/11/2013 08:40 AM, Per Steffensen wrote:

The reason I mention sort is that we in my project, half a year ago, have
dealt with the FieldCache-OOM-problem when doing sort-requests. We
basically just reject sort-requests unless they hit below X documents - in
case they do we just find them without sorting and sort them ourselves
afterwards.

Currently our problem is, that we have to do a group/distinct (in
SQL-language) query and we have found that we can do what we want to do
using group
(http://wiki.apache.org/solr/**FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsing)
or facet - either will work for us. Problem is that they both use
FieldCache and we know that using FieldCache will lead to OOM-execptions
with the amount of data each of our Solr-nodes administrate. This time we
have really no option of just limit usage as we did with sort. Therefore
we need a group/distinct-functionality that works even on huge data-amounts
(and a algorithm using FieldCache will not)

I believe setting facet.method=enum will actually make facet not use the
FieldCache. Is that true? Is it a bad idea?

I do not know much about DocValues, but I do not believe that you will
avoid FieldCache by using DocValues? Please elaborate, or point to
documentation where I will be able to read that I am wrong. Thanks!

There is Simon Willnauer's presentation http://www.slideshare.net/**
lucenerevolution/willnauer-**simon-doc-values-column-**
stride-fields-in-lucenehttp://www.slideshare.net/lucenerevolution/willnauer-simon-doc-values-column-stride-fields-in-lucene

and this blog post http://blog.trifork.com/2011/**
10/27/introducing-lucene-**index-doc-values/http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/

and this one that shows some performance comparisons:
http://searchhub.org/2013/04/**02/fun-with-docvalues-in-solr-**4-2/http://searchhub.org/2013/04/02/fun-with-docvalues-in-solr-4-2/

Re: SolrCloud 4.x hangs under high update volume

2013-09-12 Thread Tim Vaillancourt


Thanks Erick!

Yeah, I think the next step will be CloudSolrServer with the SOLR-4816 
patch. I think that is a very, very useful patch by the way. SOLR-5232 
seems promising as well.


I see your point on the more-shards idea, this is obviously a 
global/instance-level lock. If I really had to, I suppose I could run 
more Solr instances to reduce locking then? Currently I have 2 cores per 
instance and I could go 1-to-1 to simplify things.


The good news is we seem to be more stable since changing to a bigger 
client-solr batch-size and fewer client threads updating.


Cheers,

Tim

On 11/09/13 04:19 AM, Erick Erickson wrote:

If you use CloudSolrServer, you need to apply SOLR-4816 or use a recent
copy of the 4x branch. By recent, I mean like today, it looks like Mark
applied this early this morning. But several reports indicate that this will
solve your problem.

I would expect that increasing the number of shards would make the problem
worse, not
better.

There's also SOLR-5232...

Best
Erick


On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourtt...@elementspace.comwrote:


Hey guys,

Based on my understanding of the problem we are encountering, I feel we've
been able to reduce the likelihood of this issue by making the following
changes to our app's usage of SolrCloud:

1) We increased our document batch size to 200 from 10 - our app batches
updates to reduce HTTP requests/overhead. The theory is increasing the
batch size reduces the likelihood of this issue happening.
2) We reduced to 1 application node sending updates to SolrCloud - we write
Solr updates to Redis, and have previously had 4 application nodes pushing
the updates to Solr (popping off the Redis queue). Reducing the number of
nodes pushing to Solr reduces the concurrency on SolrCloud.
3) Less threads pushing to SolrCloud - due to the increase in batch size,
we were able to go down to 5 update threads on the update-pushing-app (from
10 threads).

To be clear the above only reduces the likelihood of the issue happening,
and DOES NOT actually resolve the issue at hand.

If we happen to encounter issues with the above 3 changes, the next steps
(I could use some advice on) are:

1) Increase the number of shards (2x) - the theory here is this reduces the
locking on shards because there are more shards. Am I onto something here,
or will this not help at all?
2) Use CloudSolrServer - currently we have a plain-old least-connection
HTTP VIP. If we go direct to what we need to update, this will reduce
concurrency in SolrCloud a bit. Thoughts?

Thanks all!

Cheers,

Tim


On 6 September 2013 14:47, Tim Vaillancourtt...@elementspace.com  wrote:


Enjoy your trip, Mark! Thanks again for the help!

Tim


On 6 September 2013 14:18, Mark Millermarkrmil...@gmail.com  wrote:


Okay, thanks, useful info. Getting on a plane, but ill look more at this
soon. That 10k thread spike is good to know - that's no good and could
easily be part of the problem. We want to keep that from happening.

Mark

Sent from my iPhone

On Sep 6, 2013, at 2:05 PM, Tim Vaillancourtt...@elementspace.com
wrote:


Hey Mark,

The farthest we've made it at the same batch size/volume was 12 hours
without this patch, but that isn't consistent. Sometimes we would only

get

to 6 hours or less.

During the crash I can see an amazing spike in threads to 10k which is
essentially our ulimit for the JVM, but I strangely see no

OutOfMemory:

cannot open native thread errors that always follow this. Weird!

We also notice a spike in CPU around the crash. The instability caused

some

shard recovery/replication though, so that CPU may be a symptom of the
replication, or is possibly the root cause. The CPU spikes from about
20-30% utilization (system + user) to 60% fairly sharply, so the CPU,

while

spiking isn't quite pinned (very beefy Dell R720s - 16 core Xeons,

whole

index is in 128GB RAM, 6xRAID10 15k).

More on resources: our disk I/O seemed to spike about 2x during the

crash

(about 1300kbps written to 3500kbps), but this may have been the
replication, or ERROR logging (we generally log nothing due to
WARN-severity unless something breaks).

Lastly, I found this stack trace occurring frequently, and have no

idea

what it is (may be useful or not):

java.lang.IllegalStateException :
  at

org.eclipse.jetty.server.Response.resetBuffer(Response.java:964)

  at org.eclipse.jetty.server.Response.sendError(Response.java:325)
  at


org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:692)

  at


org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)

  at


org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)

  at


org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423)

  at


org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450)

  at


org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)

  at

DataImportHandler oddity

I'm trying to index a view in an Oracle database, and have come across some
strange behaviour: all the VARCHAR2 fields are being returned as empty
strings; this also applies to a datetime field converted to a string via
TO_CHAR, and the url field built by concatenating two constant strings and
a numeric filed converted via TO_CHAR.

If I cast the fields columns to CHAR(N), I get values back, but this is not
an acceptable workaround (the maximum length of CHAR(N) is less than
VARCHAR2(N), and the result is padded to the specified length).

Note that this query works as it should in sqldeveloper, and also in some
code that uses the .NET sqlclient api.

The query I'm using is

select 'APPLICATION' as sourceid,
  'http://app.company.com' || '/app/report.aspx?trsid=' ||
to_char(incident_no) as URL,
  incident_no, trans_date, location,
  responsible_unit, process_eng, product_eng,
  case_title, case_description,
  index_lob,
  investigated, investigated_eng,
  to_char(modified_date, '-MM-DDTHH24:MI:SSZ') as modified_date
  from synx.dw_fast
  where (investigated  3)

while the view is
INCIDENT_NONUMBER(38)
TRANS_DATEVARCHAR2(8)
LOCATIONVARCHAR2(4000)
RESPONSIBLE_UNITVARCHAR2(4000)
PROCESS_ENGVARCHAR2(4000)
PROCESS_NOVARCHAR2(4000)
PRODUCT_ENGVARCHAR2(4000)
PRODUCT_NOVARCHAR2(4000)
CASE_TITLEVARCHAR2(4000)
CASE_DESCRIPTIONVARCHAR2(4000)
INDEX_LOBCLOB
INVESTIGATEDNUMBER(38)
INVESTIGATED_ENGVARCHAR2(254)
INVESTIGATED_NOVARCHAR2(254)
MODIFIED_DATEDATE

Storing/indexing speed drops quickly


Hi

SolrCloud 4.0: 6 machines, quadcore, 8GB ram, 1T disk, one Solr-node on 
each, one collection across the 6 nodes, 4 shards per node
Storing/indexing from 100 threads on external machines, each thread one 
doc at the time, full speed (they always have a new doc to store/index)

See attached images
* iowait.png: Measured I/O wait on the Solr machines
* doccount.png: Measured number of doc in Solr collection

Starting from an empty collection. Things are fine wrt storing/indexing 
speed for the first two-three hours (100M docs per hour), then speed 
goes down dramatically, to an, for us, unacceptable level (max 10M per 
hour). At the same time as speed goes down, we see that I/O wait 
increases dramatically. I am not 100% sure, but quick investigation has 
shown that this is due to almost constant merging.


What to do about this problem?
Know that you can play around with mergeFactor and commit-rate, but 
earlier tests shows that this really do not seem to do the job - it 
might postpone the time where the problem occurs, but basically it is 
just a matter of time before merging exhaust the system.
Is there a way to totally avoid merging, and keep indexing speed at a 
high level, while still making sure that searches will perform fairly 
well when data-amounts become big? (guess without merging you will end 
up with lots and lots of small files, and I guess this is not good for 
search response-time)


Regards, Per Steffensen

create a core with explicit node_name

2013-09-12 Thread YouPeng Yang

Hi solr users

   I want to create a core with node_name through the api
CloudSolrServer.query(SolrParams params  ).
  For example:
  ModifiableSolrParams  params  = new ModifiableSolrParams();
params.set(qt, /admin/cores);
params.set(action, CREATE);
params.set(name, newcore.getName());
params.set(shard,  newcore.getShardname());
params.set(collection.configName,
newcore.getCollectionconfigname());
params.set(schema, newcore.getSchemaXMLFilename());
params.set(config, newcore.getSolrConfigFilename());
params.set(coreNodeName, newcore.getCorenodename());
params.set(node_name, 10.7.23.124:8080_solr);
params.set(collection, newcore.getCollectionname());

 The newcore encapsulats the create properties about the created core.


  It seems  not to work. As a result,the core was created on the other node.

  Do I need to send the  params directly  to the explicit web server -here
is the 10.7.23.124 - instead of using the CloudSolrServer.query(SolrParams
params  )?




regards

Re: charset encoding

2013-09-12 Thread Andreas Owen

no jetty, and yes for tomcat i've seen a couple of answers

On 12. Sep 2013, at 3:12 AM, Otis Gospodnetic wrote:

 Using tomcat by any chance? The ML archive has the solution. May be on
 Wiki, too.
 
 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On Sep 11, 2013 8:56 AM, Andreas Owen a...@conx.ch wrote:
 
 i'm using solr 4.3.1 with tika to index html-pages. the html files are
 iso-8859-1 (ansi) encoded and the meta tag content-encoding as well. the
 server-http-header says it's utf8 and firefox-webdeveloper agrees.
 
 when i index a page with special chars like ä,ö,ü solr outputs it
 completly foreign signs, not the normal wrong chars with 1/4 or the Flag in
 it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy.
 has anyone got a idea whats wrong?

Re: ReplicationFactor for solrcloud

2013-09-12 Thread Aloke Ghoshal

Hi Aditya,

You need to start another 6 instances (9 instances in total) to
achieve this. The first 3 instances, as you mention, are already
assigned to the 3 shards. The next 3 will be become their replicas,
followed by the next 3 as the next replicas.

You could create two copies each of the example folder and start each
one on a different jetty port. See:
http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster

Regards,
Aloke

On 9/12/13, Aditya Sakhuja aditya.sakh...@gmail.com wrote:
 Hi -

 I am trying to set the 3 shards and 3 replicas for my solrcloud deployment
 with 3 servers, specifying the replicationFactor=3 and numShards=3 when
 starting the first node. I see each of the servers allocated to 1 shard
 each.however, do not see 3 replicas allocated on each node.

 I specifically need to have 3 replicas across 3 servers with 3 shards. Do
 we think of any reason to not have this configuration ?

 --
 Regards,
 -Aditya Sakhuja

Re: number of replicas in Cloud

2013-09-12 Thread Anshum Gupta

Can you specify what do you mean by 'problem'? I don't think there should
be any issues with that.
Hope this is what you followed in your attempt so far:

http://wiki.apache.org/solr/SolrCloud#Example_B:_Simple_two_shard_cluster_with_shard_replicas



On Thu, Sep 12, 2013 at 11:31 AM, Prasi S prasi1...@gmail.com wrote:

 Hi Anshum,
 Im using solr 4.4. Is there a problem with using replicationFactor of 2




 On Thu, Sep 12, 2013 at 11:20 AM, Anshum Gupta ans...@anshumgupta.net
 wrote:

  Prasi, a replicationFactor of 2 is what you want. However, as of the
  current releases, this is not persisted.
 
 
 
  On Thu, Sep 12, 2013 at 11:17 AM, Prasi S prasi1...@gmail.com wrote:
 
   Hi,
   I want to setup solrcloud with 2 shards and 1 replica for each shard.
  
   MyCollection
  
   shard1 , shard2
   shard1-replica , shard2-replica
  
   In this case, i would numShards=2. For replicationFactor , should
 give
   replicationFactor=1 or replicationFActor=2 ?
  
  
   Pls suggest me.
  
   thanks,
   Prasi
  
 
 
 
  --
 
  Anshum Gupta
  http://www.anshumgupta.net
 




-- 

Anshum Gupta
http://www.anshumgupta.net

Re: DataImportHandler oddity

Followup: I just tried modifying the select with

select CAST('APPLICATION' as varchar2(100)) as sourceid, ...

and that caused the sourceid field to be empty. CASTing to char(100) gave
me the expected value ('APPLICATION', right-padded to 100 characters).

Meanwhile, google gave me this: http://bugs.caucho.com/view.php?id=4224(via
http://forum.caucho.com/showthread.php?t=27574).


On Thu, Sep 12, 2013 at 8:25 AM, Raymond Wiker rwi...@gmail.com wrote:

 I'm trying to index a view in an Oracle database, and have come across
 some strange behaviour: all the VARCHAR2 fields are being returned as empty
 strings; this also applies to a datetime field converted to a string via
 TO_CHAR, and the url field built by concatenating two constant strings and
 a numeric filed converted via TO_CHAR.

 If I cast the fields columns to CHAR(N), I get values back, but this is
 not an acceptable workaround (the maximum length of CHAR(N) is less than
 VARCHAR2(N), and the result is padded to the specified length).

 Note that this query works as it should in sqldeveloper, and also in some
 code that uses the .NET sqlclient api.

 The query I'm using is

 select 'APPLICATION' as sourceid,
   'http://app.company.com' || '/app/report.aspx?trsid=' ||
 to_char(incident_no) as URL,
   incident_no, trans_date, location,
   responsible_unit, process_eng, product_eng,
   case_title, case_description,
   index_lob,
   investigated, investigated_eng,
   to_char(modified_date, '-MM-DDTHH24:MI:SSZ') as modified_date
   from synx.dw_fast
   where (investigated  3)

 while the view is
 INCIDENT_NONUMBER(38)
 TRANS_DATEVARCHAR2(8)
 LOCATIONVARCHAR2(4000)
 RESPONSIBLE_UNITVARCHAR2(4000)
 PROCESS_ENGVARCHAR2(4000)
 PROCESS_NOVARCHAR2(4000)
 PRODUCT_ENGVARCHAR2(4000)
 PRODUCT_NOVARCHAR2(4000)
 CASE_TITLEVARCHAR2(4000)
 CASE_DESCRIPTIONVARCHAR2(4000)
 INDEX_LOBCLOB
 INVESTIGATEDNUMBER(38)
 INVESTIGATED_ENGVARCHAR2(254)
 INVESTIGATED_NOVARCHAR2(254)
 MODIFIED_DATEDATE

Re: charset encoding

2013-09-12 Thread Andreas Owen

could it have something to do with the meta encoding tag is iso-8859-1 but the 
http-header tag is utf8 and firefox inteprets it as utf8?

On 12. Sep 2013, at 8:36 AM, Andreas Owen wrote:

 no jetty, and yes for tomcat i've seen a couple of answers
 
 On 12. Sep 2013, at 3:12 AM, Otis Gospodnetic wrote:
 
 Using tomcat by any chance? The ML archive has the solution. May be on
 Wiki, too.
 
 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On Sep 11, 2013 8:56 AM, Andreas Owen a...@conx.ch wrote:
 
 i'm using solr 4.3.1 with tika to index html-pages. the html files are
 iso-8859-1 (ansi) encoded and the meta tag content-encoding as well. the
 server-http-header says it's utf8 and firefox-webdeveloper agrees.
 
 when i index a page with special chars like ä,ö,ü solr outputs it
 completly foreign signs, not the normal wrong chars with 1/4 or the Flag in
 it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy.
 has anyone got a idea whats wrong?

Re: DataImportHandler oddity

2013-09-12 Thread Shalin Shekhar Mangar

This is probably a bug with Oracle thin JDBC driver. Google found a
similar issue:
http://stackoverflow.com/questions/4168494/resultset-getstring-on-varchar2-column-returns-empty-string

I don't think this is specific to DataImportHandler.


On Thu, Sep 12, 2013 at 12:43 PM, Raymond Wiker rwi...@gmail.com wrote:
 Followup: I just tried modifying the select with

 select CAST('APPLICATION' as varchar2(100)) as sourceid, ...

 and that caused the sourceid field to be empty. CASTing to char(100) gave
 me the expected value ('APPLICATION', right-padded to 100 characters).

 Meanwhile, google gave me this: http://bugs.caucho.com/view.php?id=4224(via
 http://forum.caucho.com/showthread.php?t=27574).


 On Thu, Sep 12, 2013 at 8:25 AM, Raymond Wiker rwi...@gmail.com wrote:

 I'm trying to index a view in an Oracle database, and have come across
 some strange behaviour: all the VARCHAR2 fields are being returned as empty
 strings; this also applies to a datetime field converted to a string via
 TO_CHAR, and the url field built by concatenating two constant strings and
 a numeric filed converted via TO_CHAR.

 If I cast the fields columns to CHAR(N), I get values back, but this is
 not an acceptable workaround (the maximum length of CHAR(N) is less than
 VARCHAR2(N), and the result is padded to the specified length).

 Note that this query works as it should in sqldeveloper, and also in some
 code that uses the .NET sqlclient api.

 The query I'm using is

 select 'APPLICATION' as sourceid,
   'http://app.company.com' || '/app/report.aspx?trsid=' ||
 to_char(incident_no) as URL,
   incident_no, trans_date, location,
   responsible_unit, process_eng, product_eng,
   case_title, case_description,
   index_lob,
   investigated, investigated_eng,
   to_char(modified_date, '-MM-DDTHH24:MI:SSZ') as modified_date
   from synx.dw_fast
   where (investigated  3)

 while the view is
 INCIDENT_NONUMBER(38)
 TRANS_DATEVARCHAR2(8)
 LOCATIONVARCHAR2(4000)
 RESPONSIBLE_UNITVARCHAR2(4000)
 PROCESS_ENGVARCHAR2(4000)
 PROCESS_NOVARCHAR2(4000)
 PRODUCT_ENGVARCHAR2(4000)
 PRODUCT_NOVARCHAR2(4000)
 CASE_TITLEVARCHAR2(4000)
 CASE_DESCRIPTIONVARCHAR2(4000)
 INDEX_LOBCLOB
 INVESTIGATEDNUMBER(38)
 INVESTIGATED_ENGVARCHAR2(254)
 INVESTIGATED_NOVARCHAR2(254)
 MODIFIED_DATEDATE






-- 
Regards,
Shalin Shekhar Mangar.

Re: Storing/indexing speed drops quickly

Maybe the fact that we are never ever going to delete or update 
documents, can be used for something. If we delete we will delete entire 
collections.


Regards, Per Steffensen

On 9/12/13 8:25 AM, Per Steffensen wrote:

Hi

SolrCloud 4.0: 6 machines, quadcore, 8GB ram, 1T disk, one Solr-node 
on each, one collection across the 6 nodes, 4 shards per node
Storing/indexing from 100 threads on external machines, each thread 
one doc at the time, full speed (they always have a new doc to 
store/index)

See attached images
* iowait.png: Measured I/O wait on the Solr machines
* doccount.png: Measured number of doc in Solr collection

Starting from an empty collection. Things are fine wrt 
storing/indexing speed for the first two-three hours (100M docs per 
hour), then speed goes down dramatically, to an, for us, unacceptable 
level (max 10M per hour). At the same time as speed goes down, we see 
that I/O wait increases dramatically. I am not 100% sure, but quick 
investigation has shown that this is due to almost constant merging.


What to do about this problem?
Know that you can play around with mergeFactor and commit-rate, but 
earlier tests shows that this really do not seem to do the job - it 
might postpone the time where the problem occurs, but basically it is 
just a matter of time before merging exhaust the system.
Is there a way to totally avoid merging, and keep indexing speed at a 
high level, while still making sure that searches will perform fairly 
well when data-amounts become big? (guess without merging you will end 
up with lots and lots of small files, and I guess this is not good 
for search response-time)


Regards, Per Steffensen

Re: Storing/indexing speed drops quickly


Seems like the attachments didnt make it through to this mailing list

https://dl.dropboxusercontent.com/u/25718039/doccount.png
https://dl.dropboxusercontent.com/u/25718039/iowait.png


On 9/12/13 8:25 AM, Per Steffensen wrote:

Hi

SolrCloud 4.0: 6 machines, quadcore, 8GB ram, 1T disk, one Solr-node 
on each, one collection across the 6 nodes, 4 shards per node
Storing/indexing from 100 threads on external machines, each thread 
one doc at the time, full speed (they always have a new doc to 
store/index)

See attached images
* iowait.png: Measured I/O wait on the Solr machines
* doccount.png: Measured number of doc in Solr collection

Starting from an empty collection. Things are fine wrt 
storing/indexing speed for the first two-three hours (100M docs per 
hour), then speed goes down dramatically, to an, for us, unacceptable 
level (max 10M per hour). At the same time as speed goes down, we see 
that I/O wait increases dramatically. I am not 100% sure, but quick 
investigation has shown that this is due to almost constant merging.


What to do about this problem?
Know that you can play around with mergeFactor and commit-rate, but 
earlier tests shows that this really do not seem to do the job - it 
might postpone the time where the problem occurs, but basically it is 
just a matter of time before merging exhaust the system.
Is there a way to totally avoid merging, and keep indexing speed at a 
high level, while still making sure that searches will perform fairly 
well when data-amounts become big? (guess without merging you will end 
up with lots and lots of small files, and I guess this is not good 
for search response-time)


Regards, Per Steffensen

Re: Regarding improving performance of the solr

2013-09-12 Thread prabu palanisamy

Hi

I tried to reindex the solr. I get the regular expression problem. The
steps I followed are

I started the java -jar start.jar
http://localhost:8983/solr/update?stream.body=
deletequery*:*querydelete
http://localhost:8983/solr/update?stream.body=commit/
I stopped the solr server

I changed indexed and stored tags as false for some of the fields in
schema.xml
 fields
field name=idtype=string  indexed=true stored=true
required=true/
field name=title type=string  indexed=true stored=true
multiValued=true termVectors=true termPositions=true
termOffsets=true/
field name=revision  type=sintindexed=false stored=false/
field name=user  type=string  indexed=false stored=false/
field name=userIdtype=int indexed=false stored=false/
field name=text  type=text_general indexed=true stored=true
multiValued=true termVectors=true termPositions=true
termOffsets=true/
field name=pagerank  type=text_generalindexed=true
stored=false/
field name=anchor_text type=text_general indexed=true
stored=false  multiValued=true compressed=true termVectors=true
termPositions=true termOffsets=true/
field name=freebase type=text_general indexed=true stored=true
multiValued=true termVectors=true termPositions=true
termOffsets=true/
field name=timestamp type=dateindexed=true stored=true
multiValued=true termVectors=true termPositions=true
termOffsets=true/
field name=titleText type=text_generalindexed=true
stored=true  multiValued=true termVectors=true termPositions=true
termOffsets=true/
field name=category type=string indexed=true stored=true/
/fields
uniqueKeyid/uniqueKey
copyField source=title dest=titleText/

My data-config.xml
dataConfig
dataSource type=FileDataSource encoding=UTF-8 /
document
entity name=page
processor=XPathEntityProcessor
stream=true
forEach=/mediawiki/page/
url=/home/prabu/wikipedia_full_indexed_dump.xml

transformer=RegexTransformer,DateFormatTransformer,HTMLStripTransformer

field column=idxpath=/mediawiki/page/id
stripHTML=true/
field column=title xpath=/mediawiki/page/title
stripHTML=true/
field column=category  xpath=/mediawiki/page/category
stripHTML=true/
field column=revision  xpath=/mediawiki/page/revision/id
stripHTML=true/
field column=user
xpath=/mediawiki/page/revision/contributor/username stripHTML=true/
field column=userId
xpath=/mediawiki/page/revision/contributor/id stripHTML=true/
field column=text  xpath=/mediawiki/page/revision/text
stripHTML=true/
field column=freebase  xpath=/mediawiki/page/freebase
stripHTML=true/
field column=pagerank  xpath=/mediawiki/page/pagerank
stripHTML=true/
field column=anchor_text xpath=/mediawiki/page/anchor_text/
stripHTML=true/
field column=timestamp
xpath=/mediawiki/page/revision/timestamp
dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' /
field column=$skipDoc  regex=^#REDIRECT .*
replaceWith=true sourceColName=text/
field column=category regex=((\[\[.*Category:.*\]\]\W?)+)
sourceColName=text stripHTML=true/
field column=$skipDoc regex=^Template:.* replaceWith=true
sourceColName=title/
   /entity
/document
/dataConfig

I tried the http://localhost:8983/solr/dataimport?command=full-import.  At
50,000 document, I get some error related to regular expression.

at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)

I do not how to proceed. Please help me out.

Thanks and Regards
Prabu


On Wed, Sep 11, 2013 at 11:31 AM, Erick Erickson erickerick...@gmail.comwrote:

 Be a little careful when extrapolating from disk to memory.
 Any fields where you've set stored=true will put data in
 segment files with extensions .fdt and .fdx, see
 These are the compressed verbatim copy of the data
 for stored fields and have very little impact on
 memory required for searching. I've seen indexes where
 75% of the data is stored and indexes where 5% of the
 data is stored.

 Summary of File Extensions here:

 http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/lucene40/package-summary.html

 Best,
 Erick


 On Wed, Sep 11, 2013 at 2:57 AM, prabu palanisamy

Not able to deploy SOLR after applying OpenNLP patch

2013-09-12 Thread rashi gandhi

Hi,



My Question is related to OpenNLP Integration with SOLR.

I have successfully applied OpenNLP LUCENE-2899-x.patch to latest solr
branch checkout from here:

http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x

And also iam able to compile source code, generated all realted binaries
and able to create war file.

But facing issues while deployment of SOLR.

Here is the error

Caused by: org.apache.solr.common.SolrException: Plugin init failure for
[schema.xml] fieldType text_opennlp: Plugin init failure for [schema.xml]
a

nalyzer/tokenizer: Error loading class 'solr.OpenNLPTokenizerFactory'

at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)

at
org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:467)

... 15 more

Caused by: org.apache.solr.common.SolrException: Plugin init failure for
[schema.xml] analyzer/tokenizer: Error loading class
'solr.OpenNLPTokenizerFa

ctory'

at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)

at
org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:362)

at
org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95)

at
org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)

at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)

... 16 more

Caused by: org.apache.solr.common.SolrException: Error loading class
'solr.OpenNLPTokenizerFactory'

at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:449)

at
org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:543)

at
org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:342)

at
org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:335)

at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)

... 20 more

Caused by: java.lang.ClassNotFoundException: solr.OpenNLPTokenizerFactory

at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:354)

at java.lang.ClassLoader.loadClass(ClassLoader.java:423)

at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:789)

at java.lang.ClassLoader.loadClass(ClassLoader.java:356)

at java.lang.Class.forName0(Native Method)

at java.lang.Class.forName(Class.java:264)

at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:433)

... 24 more

4446 [coreLoadExecutor-3-thread-1] ERROR
org.apache.solr.core.CoreContainer  û
null:org.apache.solr.common.SolrException: Unable to create core: colle

ction1

at
org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:931)

at org.apache.solr.core.CoreContainer.create(CoreContainer.java:563)

at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:244)

at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:236)

at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)

at java.util.concurrent.FutureTask.run(FutureTask.java:166)

at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)

at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)

at java.util.concurrent.FutureTask.run(FutureTask.java:166)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)

at java.lang.Thread.run(Thread.java:722)

Please help me on this.



Waiting for your reply.
Thanks in advance.

SolrCloud behave differently on server and local

2013-09-12 Thread cihat güzel

hi all.
I am trying solr cloud on my server. The server is a virtual machine.

I have followed solr cloude wiki  http://wiki.apache.org/solr/SolrCloud .
When I run solr Cloud, It si failed.  But If I try on my local ,it runs
successfully. Why does solr behave differently on server and local?

My solr.log as follows:

INFO  - 2013-09-12 14:50:13.389;
org.apache.solr.servlet.SolrDispatchFilter; SolrDispatchFilter.init() done
ERROR - 2013-09-12 14:50:13.433; org.apache.solr.core.CoreContainer;
CoreContainer was not shutdown prior to finalize(), indicates a bug --
POSSIBLE RESOURCE LEAK!!!  instance=1423856966
INFO  - 2013-09-12 14:50:13.483;
org.eclipse.jetty.server.AbstractConnector; Started
SocketConnector@0.0.0.0:8983
INFO  - 2013-09-12 14:57:01.776; org.eclipse.jetty.server.Server;
jetty-8.1.10.v20130312
INFO  - 2013-09-12 14:57:01.838;
org.eclipse.jetty.deploy.providers.ScanningAppProvider; Deployment monitor
/opt/Applications/solr-4.4.0/example/contexts at interval 0
INFO  - 2013-09-12 14:57:01.846;
org.eclipse.jetty.deploy.DeploymentManager; Deployable added:
/opt/Applications/solr-4.4.0/example/contexts/solr-jetty-context.xml
INFO  - 2013-09-12 14:57:02.549;
org.eclipse.jetty.webapp.StandardDescriptorProcessor; NO JSP Support for
/solr, did not find org.apache.jasper.servlet.JspServlet
INFO  - 2013-09-12 14:57:02.656;
org.apache.solr.servlet.SolrDispatchFilter; SolrDispatchFilter.init()
INFO  - 2013-09-12 14:57:02.797; org.apache.solr.core.SolrResourceLoader;
JNDI not configured for solr (NoInitialContextEx)
INFO  - 2013-09-12 14:57:02.799; org.apache.solr.core.SolrResourceLoader;
solr home defaulted to 'solr/' (could not find system property or JNDI)
INFO  - 2013-09-12 14:57:02.801; org.apache.solr.core.SolrResourceLoader;
new SolrResourceLoader for directory: 'solr/'
INFO  - 2013-09-12 14:57:02.917; org.apache.solr.core.ConfigSolr; Loading
container configuration from
/opt/Applications/solr-4.4.0/example/solr/solr.xml
ERROR - 2013-09-12 14:57:03.072;
org.apache.solr.servlet.SolrDispatchFilter; Could not start Solr. Check
solr/home property and the logs
ERROR - 2013-09-12 14:57:03.098; org.apache.solr.common.SolrException;
null:org.apache.solr.common.SolrException: Could not load SOLR configuration
   at org.apache.solr.core.ConfigSolr.fromFile(ConfigSolr.java:65)
   at org.apache.solr.core.ConfigSolr.fromSolrHome(ConfigSolr.java:89)
   at org.apache.solr.core.CoreContainer.init(CoreContainer.java:139)
   at org.apache.solr.core.CoreContainer.init(CoreContainer.java:129)
   at
org.apache.solr.servlet.SolrDispatchFilter.createCoreContainer(SolrDispatchFilter.java:139)
   at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:122)
   at org.eclipse.jetty.servlet.FilterHolder.doStart(FilterHolder.java:119)
   at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
   at
org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:719)
   at
org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:265)
   at
org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1252)
   at
org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:710)
   at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:494)
   at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
   at
org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:39)
   at
org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:186)
   at
org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:494)
   at
org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:141)
   at
org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:145)
   at
org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(ScanningAppProvider.java:56)
   at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:609)
   at org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:540)
   at org.eclipse.jetty.util.Scanner.scan(Scanner.java:403)
   at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:337)
   at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
   at
org.eclipse.jetty.deploy.providers.ScanningAppProvider.doStart(ScanningAppProvider.java:121)
   at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
   at
org.eclipse.jetty.deploy.DeploymentManager.startAppProvider(DeploymentManager.java:555)
   at
org.eclipse.jetty.deploy.DeploymentManager.doStart(DeploymentManager.java:230)
   at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
   at
org.eclipse.jetty.util.component.AggregateLifeCycle.doStart(AggregateLifeCycle.java:81)
   at
org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:58)
   at

Re: SolrCloud 4.x hangs under high update volume

2013-09-12 Thread Erick Erickson

Fewer client threads updating makes sense, and going to 1 core also seems
like it might help. But it's all a crap-shoot unless the underlying cause
gets fixed up. Both would improve things, but you'll still hit the problem
sometime, probably when doing a demo for your boss ;).

Adrien has branched the code for SOLR 4.5 in preparation for a release
candidate tentatively scheduled for next week. You might just start working
with that branch if you can rather than apply individual patches...

I suspect there'll be a couple more changes to this code (looks like
Shikhar already raised an issue for instance) before 4.5 is finally cut...

FWIW,
Erick



On Thu, Sep 12, 2013 at 2:13 AM, Tim Vaillancourt t...@elementspace.comwrote:

 Thanks Erick!

 Yeah, I think the next step will be CloudSolrServer with the SOLR-4816
 patch. I think that is a very, very useful patch by the way. SOLR-5232
 seems promising as well.

 I see your point on the more-shards idea, this is obviously a
 global/instance-level lock. If I really had to, I suppose I could run more
 Solr instances to reduce locking then? Currently I have 2 cores per
 instance and I could go 1-to-1 to simplify things.

 The good news is we seem to be more stable since changing to a bigger
 client-solr batch-size and fewer client threads updating.

 Cheers,

 Tim

 On 11/09/13 04:19 AM, Erick Erickson wrote:

 If you use CloudSolrServer, you need to apply SOLR-4816 or use a recent
 copy of the 4x branch. By recent, I mean like today, it looks like Mark
 applied this early this morning. But several reports indicate that this
 will
 solve your problem.

 I would expect that increasing the number of shards would make the problem
 worse, not
 better.

 There's also SOLR-5232...

 Best
 Erick


 On Tue, Sep 10, 2013 at 5:20 PM, Tim 
 Vaillancourttim@elementspace.**comt...@elementspace.com
 wrote:

  Hey guys,

 Based on my understanding of the problem we are encountering, I feel
 we've
 been able to reduce the likelihood of this issue by making the following
 changes to our app's usage of SolrCloud:

 1) We increased our document batch size to 200 from 10 - our app batches
 updates to reduce HTTP requests/overhead. The theory is increasing the
 batch size reduces the likelihood of this issue happening.
 2) We reduced to 1 application node sending updates to SolrCloud - we
 write
 Solr updates to Redis, and have previously had 4 application nodes
 pushing
 the updates to Solr (popping off the Redis queue). Reducing the number of
 nodes pushing to Solr reduces the concurrency on SolrCloud.
 3) Less threads pushing to SolrCloud - due to the increase in batch size,
 we were able to go down to 5 update threads on the update-pushing-app
 (from
 10 threads).

 To be clear the above only reduces the likelihood of the issue happening,
 and DOES NOT actually resolve the issue at hand.

 If we happen to encounter issues with the above 3 changes, the next steps
 (I could use some advice on) are:

 1) Increase the number of shards (2x) - the theory here is this reduces
 the
 locking on shards because there are more shards. Am I onto something
 here,
 or will this not help at all?
 2) Use CloudSolrServer - currently we have a plain-old least-connection
 HTTP VIP. If we go direct to what we need to update, this will reduce
 concurrency in SolrCloud a bit. Thoughts?

 Thanks all!

 Cheers,

 Tim


 On 6 September 2013 14:47, Tim 
 Vaillancourttim@elementspace.**comt...@elementspace.com
  wrote:

  Enjoy your trip, Mark! Thanks again for the help!

 Tim


 On 6 September 2013 14:18, Mark Millermarkrmil...@gmail.com  wrote:

  Okay, thanks, useful info. Getting on a plane, but ill look more at
 this
 soon. That 10k thread spike is good to know - that's no good and could
 easily be part of the problem. We want to keep that from happening.

 Mark

 Sent from my iPhone

 On Sep 6, 2013, at 2:05 PM, Tim 
 Vaillancourttim@elementspace.**comt...@elementspace.com
 
 wrote:

  Hey Mark,

 The farthest we've made it at the same batch size/volume was 12 hours
 without this patch, but that isn't consistent. Sometimes we would only

 get

 to 6 hours or less.

 During the crash I can see an amazing spike in threads to 10k which is
 essentially our ulimit for the JVM, but I strangely see no

 OutOfMemory:

 cannot open native thread errors that always follow this. Weird!

 We also notice a spike in CPU around the crash. The instability caused

 some

 shard recovery/replication though, so that CPU may be a symptom of the
 replication, or is possibly the root cause. The CPU spikes from about
 20-30% utilization (system + user) to 60% fairly sharply, so the CPU,

 while

 spiking isn't quite pinned (very beefy Dell R720s - 16 core Xeons,

 whole

 index is in 128GB RAM, 6xRAID10 15k).

 More on resources: our disk I/O seemed to spike about 2x during the

 crash

 (about 1300kbps written to 3500kbps), but this may have been the
 replication, or ERROR logging (we generally log nothing due to

Re: No or limited use of FieldCache

2013-09-12 Thread Erick Erickson

Per:

One thing I'll be curious about. From my reading of DocValues, it uses
little or no heap. But it _will_ use memory from the OS if I followed
Simon's slides correctly. So I wonder if you'll hit swapping issues...
Which are better than OOMs, certainly...

Thanks,
Erick

On Thu, Sep 12, 2013 at 2:07 AM, Per Steffensen st...@designware.dk wrote:

Thanks, guys. Now I know a little more about DocValues and realize that
they will do the job wrt FieldCache.

Regards, Per Steffensen

On 9/12/13 3:11 AM, Otis Gospodnetic wrote:

Per, check zee Wiki, there is a page describing docvalues. We used them
successfully in a solr for analytics scenario.

Otis
Solr ElasticSearch Support
http://sematext.com/
On Sep 11, 2013 9:15 AM, Michael Sokolov msokolov@safaribooksonline.**
com msoko...@safaribooksonline.com
wrote:

On 09/11/2013 08:40 AM, Per Steffensen wrote:

The reason I mention sort is that we in my project, half a year ago,
have
dealt with the FieldCache-OOM-problem when doing sort-requests. We
basically just reject sort-requests unless they hit below X documents -
in
case they do we just find them without sorting and sort them ourselves
afterwards.

Currently our problem is, that we have to do a group/distinct (in
SQL-language) query and we have found that we can do what we want to do
using group
(http://wiki.apache.org/solr/FieldCollapsinghttp://wiki.apache.org/solr/**FieldCollapsing
http://wiki.**apache.org/solr/**FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsing
)
or facet - either will work for us. Problem is that they both use
FieldCache and we know that using FieldCache will lead to
OOM-execptions
with the amount of data each of our Solr-nodes administrate. This time
we
have really no option of just limit usage as we did with sort.
Therefore
we need a group/distinct-functionality that works even on huge
data-amounts
(and a algorithm using FieldCache will not)

I believe setting facet.method=enum will actually make facet not use the
FieldCache. Is that true? Is it a bad idea?

There is Simon Willnauer's presentation http://www.slideshare.net/**
lucenerevolution/willnauer-simon-doc-values-column-**
stride-fields-in-lucenehttp:/**/www.slideshare.net/**
lucenerevolution/willnauer-**simon-doc-values-column-**
stride-fields-in-lucenehttp://www.slideshare.net/lucenerevolution/willnauer-simon-doc-values-column-stride-fields-in-lucene

and this blog post
http://blog.trifork.com/2011/http://blog.trifork.com/2011/**
10/27/introducing-lucene-index-doc-values/http://blog.**
trifork.com/2011/10/27/**introducing-lucene-index-doc-**values/http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/

and this one that shows some performance comparisons:
http://searchhub.org/2013/04/02/fun-with-docvalues-in-**solr-**4-2/http://searchhub.org/2013/04/**02/fun-with-docvalues-in-solr-**4-2/
http://searchhub.**org/2013/04/02/fun-with-**docvalues-in-solr-4-2/http://searchhub.org/2013/04/02/fun-with-docvalues-in-solr-4-2/

Re: Help in resolving the below retrieval issue

2013-09-12 Thread Prathik Puthran

Hi,

I am also seeing this issue when the search query is something like how
are you? (Quotes for clarity).
The query parser splits it to the below tokens:
+text:whats +text:your +text:raashee?

However when I remove the ? from the search query how are you I get the
results.
Is ? a special character? Should it be escaped as well?


On Wed, Sep 11, 2013 at 1:50 AM, Jack Krupansky j...@basetechnology.comwrote:

 Removing stray hyphens (embedded hyphens, like CD-ROM, are okay) or
 escaping them with backslash looks like your best bests. There's no query
 parser option to disable the hyphen as an exlusion operator, although an
 upgrade to a modern Solr should fix the problem.


 -- Jack Krupansky

 -Original Message- From: Prathik Puthran
 Sent: Tuesday, September 10, 2013 4:13 PM
 To: solr-user@lucene.apache.org

 Subject: Re: Help in resolving the below retrieval issue

 I'm using Solr 3.4.


 This bug is causing the 2nd term i.e. kumar to be treated as an exclusion
 operator?
 Is it possible to configure the query parser to not treat the '-' as
 exclusion operator ?
 If not the only way is to remove the '-' from the query string?

 Thanks,
 Prathik


 On Tue, Sep 10, 2013 at 10:36 PM, Jack Krupansky j...@basetechnology.com
 **wrote:

  What release of Solr are you using?

 It appears that the hyphen is being treated as an exclusion operator even
 though it is followed by a space. Solr 4.4 doesn't appear to do that, but
 maybe earlier releases had a problem.

 In any case, be careful with leading hyphen in queries since it does mean
 exclude documents that contain the following term.

 Or, just escape any leading hyphen with a backslash.

 -- Jack Krupansky

 -Original Message- From: Prathik Puthran
 Sent: Tuesday, September 10, 2013 11:47 AM
 To: d...@lucene.apache.org ; solr-user@lucene.apache.org
 Subject: Re: Help in resolving the below retrieval issue


 Thanks Erick for the response.
 I tried to debug the query. Below is the response in the debug node

 str name=rawquerystringRahul - kumar/strstr
 name=querystringRahul
 - kumar/strstr name=parsedquery+text:Rahul -text:kumar/strstr
 name=parsedquery_toString+text:Rahul -text:kumar/strlst
 name=explain/str name=QParserLuceneQParser/strarr
 name=filter_queriesstrRahul - kumar/str/arrarr
 name=parsed_filter_queriesstr+text:rahul -text:kumar/str/arr



 Does it mean the query parser has parsed it to tokens Rahul - and
 kumar?
 Even if this was the case solr should be able to retrieve the documents
 because I have indexed all the documents based on n-grams as well.

 Thanks,
 Prathik


 On Tue, Sep 10, 2013 at 7:09 PM, Erick Erickson erickerick...@gmail.com
 *
 *wrote:


  Try adding debug=query to the url. What I think you'll find is that

 you're running into
 a common issue, the difference between query parsing and analysis.

 when you submit anything with whitespace in it, the query parser will
 break it up
 _before_ it gets to the analysis part, you should see something in the
 debug
 portion of the query like
 field:rahul field:kumar and possibly even field:-

 These are searched as separate tokens. By specifying KeywordTokenizer, at
 index time you'll have exactly one token, rahul-kumar in the index which
 will not
 match any of the separated tokens

 Try escaping the spaces with backslash. You could also try quoting the
 input although
 that has some phrase implications.

 Do you really want this search to fail on just searching rahul though?
 Perhaps
 keywordTokenizer isn't best here, it depends upon your use-case...

 Best,
 Erick


 On Tue, Sep 10, 2013 at 8:10 AM, Prathik Puthran 
 prathik.puthra...@gmail.com wrote:

  Hi,


 I am facing the below issue where in Solr is not retrieving the indexed
 word for some cases.

 This happens whenever the indexed word has string  -  (quotes for
 clarity) as substring i.e word prefix followed by a space which is
 followed
 by '-' again followed by a space and followed by the rest of the word
 suffix.
 When I search with search query being the exact string Solr returns no
 results.

 Example:
 Indexed word -- Rahul - kumar  (quotes for clarity)
 If I search with the search query as below Solr gives no results
 Search query -- Rahul - kumar  (quotes for clarity)

 However the below search query returns the results
 Search query -- Rahul kumar

 Can you please let me know what I am doing wrong here and what should I
 do to ensure the first query i.e. Rahul - kumar returns the documents
 indexed using it.

 Below are the analyzers I am using:
 Index time analyzer components:
 1) charFilter class=solr.PatternReplaceCharFilterFactory

 pattern=([^A-Za-z0-9 ]) replacement=/
  2) tokenizer class=solr.KeywordTokenizerFactory/
  3) filter class=solr.LowerCaseFilterFactory/
  4) filter class=solr.WordDelimiterFilterFactory

 generateWordParts=1
 preserveOriginal=1/
  5) filter class=solr.EdgeNGramFilterFactory minGramSize=2

 maxGramSize=50 side=front/
  6)

Re: ReplicationFactor for solrcloud

2013-09-12 Thread Shalin Shekhar Mangar

You must specify maxShardsPerNode=3 for this to happen. By default
maxShardsPerNode defaults to 1 so only one shard is created per node.

On Thu, Sep 12, 2013 at 3:19 AM, Aditya Sakhuja
aditya.sakh...@gmail.com wrote:
 Hi -

 I am trying to set the 3 shards and 3 replicas for my solrcloud deployment
 with 3 servers, specifying the replicationFactor=3 and numShards=3 when
 starting the first node. I see each of the servers allocated to 1 shard
 each.however, do not see 3 replicas allocated on each node.

 I specifically need to have 3 replicas across 3 servers with 3 shards. Do
 we think of any reason to not have this configuration ?

 --
 Regards,
 -Aditya Sakhuja



-- 
Regards,
Shalin Shekhar Mangar.

Re: DataImportHandler oddity

That sounds reasonable. I've done some more digging, and found that the
database instance in this case is an _OLD_ version of Oracle: 9.2.0.8.0. I
also tried using the OCI driver (version 12), which refuses to even talk to
this database.

I have three other databases running on more recent versions of Oracle, and
all three have worked fine with DataImportHandler.


On Thu, Sep 12, 2013 at 9:48 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 This is probably a bug with Oracle thin JDBC driver. Google found a
 similar issue:

 http://stackoverflow.com/questions/4168494/resultset-getstring-on-varchar2-column-returns-empty-string

 I don't think this is specific to DataImportHandler.


 On Thu, Sep 12, 2013 at 12:43 PM, Raymond Wiker rwi...@gmail.com wrote:
  Followup: I just tried modifying the select with
 
  select CAST('APPLICATION' as varchar2(100)) as sourceid, ...
 
  and that caused the sourceid field to be empty. CASTing to char(100) gave
  me the expected value ('APPLICATION', right-padded to 100 characters).
 
  Meanwhile, google gave me this:
 http://bugs.caucho.com/view.php?id=4224(via
  http://forum.caucho.com/showthread.php?t=27574).
 
 
  On Thu, Sep 12, 2013 at 8:25 AM, Raymond Wiker rwi...@gmail.com wrote:
 
  I'm trying to index a view in an Oracle database, and have come across
  some strange behaviour: all the VARCHAR2 fields are being returned as
 empty
  strings; this also applies to a datetime field converted to a string via
  TO_CHAR, and the url field built by concatenating two constant strings
 and
  a numeric filed converted via TO_CHAR.
 
  If I cast the fields columns to CHAR(N), I get values back, but this is
  not an acceptable workaround (the maximum length of CHAR(N) is less than
  VARCHAR2(N), and the result is padded to the specified length).
 
  Note that this query works as it should in sqldeveloper, and also in
 some
  code that uses the .NET sqlclient api.
 
  The query I'm using is
 
  select 'APPLICATION' as sourceid,
'http://app.company.com' || '/app/report.aspx?trsid=' ||
  to_char(incident_no) as URL,
incident_no, trans_date, location,
responsible_unit, process_eng, product_eng,
case_title, case_description,
index_lob,
investigated, investigated_eng,
to_char(modified_date, '-MM-DDTHH24:MI:SSZ') as modified_date
from synx.dw_fast
where (investigated  3)
 
  while the view is
  INCIDENT_NONUMBER(38)
  TRANS_DATEVARCHAR2(8)
  LOCATIONVARCHAR2(4000)
  RESPONSIBLE_UNITVARCHAR2(4000)
  PROCESS_ENGVARCHAR2(4000)
  PROCESS_NOVARCHAR2(4000)
  PRODUCT_ENGVARCHAR2(4000)
  PRODUCT_NOVARCHAR2(4000)
  CASE_TITLEVARCHAR2(4000)
  CASE_DESCRIPTIONVARCHAR2(4000)
  INDEX_LOBCLOB
  INVESTIGATEDNUMBER(38)
  INVESTIGATED_ENGVARCHAR2(254)
  INVESTIGATED_NOVARCHAR2(254)
  MODIFIED_DATEDATE
 
 
 



 --
 Regards,
 Shalin Shekhar Mangar.

Re: DataImportHandler oddity

2013-09-12 Thread Shalin Shekhar Mangar

Thanks. It'd be great if you can update this thread if you ever find a
workaround. We will document it on the DataImportHandlerFaq wiki page.

http://wiki.apache.org/solr/DataImportHandlerFaq

On Thu, Sep 12, 2013 at 4:56 PM, Raymond Wiker rwi...@gmail.com wrote:
 That sounds reasonable. I've done some more digging, and found that the
 database instance in this case is an _OLD_ version of Oracle: 9.2.0.8.0. I
 also tried using the OCI driver (version 12), which refuses to even talk to
 this database.

 I have three other databases running on more recent versions of Oracle, and
 all three have worked fine with DataImportHandler.


 On Thu, Sep 12, 2013 at 9:48 AM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:

 This is probably a bug with Oracle thin JDBC driver. Google found a
 similar issue:

 http://stackoverflow.com/questions/4168494/resultset-getstring-on-varchar2-column-returns-empty-string

 I don't think this is specific to DataImportHandler.


 On Thu, Sep 12, 2013 at 12:43 PM, Raymond Wiker rwi...@gmail.com wrote:
  Followup: I just tried modifying the select with
 
  select CAST('APPLICATION' as varchar2(100)) as sourceid, ...
 
  and that caused the sourceid field to be empty. CASTing to char(100) gave
  me the expected value ('APPLICATION', right-padded to 100 characters).
 
  Meanwhile, google gave me this:
 http://bugs.caucho.com/view.php?id=4224(via
  http://forum.caucho.com/showthread.php?t=27574).
 
 
  On Thu, Sep 12, 2013 at 8:25 AM, Raymond Wiker rwi...@gmail.com wrote:
 
  I'm trying to index a view in an Oracle database, and have come across
  some strange behaviour: all the VARCHAR2 fields are being returned as
 empty
  strings; this also applies to a datetime field converted to a string via
  TO_CHAR, and the url field built by concatenating two constant strings
 and
  a numeric filed converted via TO_CHAR.
 
  If I cast the fields columns to CHAR(N), I get values back, but this is
  not an acceptable workaround (the maximum length of CHAR(N) is less than
  VARCHAR2(N), and the result is padded to the specified length).
 
  Note that this query works as it should in sqldeveloper, and also in
 some
  code that uses the .NET sqlclient api.
 
  The query I'm using is
 
  select 'APPLICATION' as sourceid,
'http://app.company.com' || '/app/report.aspx?trsid=' ||
  to_char(incident_no) as URL,
incident_no, trans_date, location,
responsible_unit, process_eng, product_eng,
case_title, case_description,
index_lob,
investigated, investigated_eng,
to_char(modified_date, '-MM-DDTHH24:MI:SSZ') as modified_date
from synx.dw_fast
where (investigated  3)
 
  while the view is
  INCIDENT_NONUMBER(38)
  TRANS_DATEVARCHAR2(8)
  LOCATIONVARCHAR2(4000)
  RESPONSIBLE_UNITVARCHAR2(4000)
  PROCESS_ENGVARCHAR2(4000)
  PROCESS_NOVARCHAR2(4000)
  PRODUCT_ENGVARCHAR2(4000)
  PRODUCT_NOVARCHAR2(4000)
  CASE_TITLEVARCHAR2(4000)
  CASE_DESCRIPTIONVARCHAR2(4000)
  INDEX_LOBCLOB
  INVESTIGATEDNUMBER(38)
  INVESTIGATED_ENGVARCHAR2(254)
  INVESTIGATED_NOVARCHAR2(254)
  MODIFIED_DATEDATE
 
 
 



 --
 Regards,
 Shalin Shekhar Mangar.




-- 
Regards,
Shalin Shekhar Mangar.

Re: No or limited use of FieldCache


Yes, thanks.

Actually some months back I made PoC of a FieldCache that could expand 
beyond the heap. Basically imagine a FieldCache with room for 
unlimited data-arrays, that just behind the scenes goes to 
memory-mapped files when there is no more room on heap. Never finished 
it, and it might be kinda stupid because you actually just go read the 
data from lucene indices and write them to memory-mapped files in order 
to use them. It is better to just use the data in the Lucene indices 
instead. But it had some nice features. But that solution will also have 
the running out of swap space-problems.


Regards, Per Steffensen

On 9/12/13 12:48 PM, Erick Erickson wrote:

Per:

One thing I'll be curious about. From my reading of DocValues, it uses
little or no heap. But it _will_ use memory from the OS if I followed
Simon's slides correctly. So I wonder if you'll hit swapping issues...
Which are better than OOMs, certainly...

Thanks,
Erick

Re: Help in resolving the below retrieval issue

Question mark and asterisk are wildcard characters, so if you want them to 
be treated as punctuation, either enclose the terms in quotes or escape the 
characters.


Wildcard characters suppress the execution of some token filters if they are 
not able to cope with wildcards.


-- Jack Krupansky

-Original Message- 
From: Prathik Puthran

Sent: Thursday, September 12, 2013 7:01 AM
To: solr-user@lucene.apache.org
Subject: Re: Help in resolving the below retrieval issue

Hi,

I am also seeing this issue when the search query is something like how
are you? (Quotes for clarity).
The query parser splits it to the below tokens:
+text:whats +text:your +text:raashee?

However when I remove the ? from the search query how are you I get the
results.
Is ? a special character? Should it be escaped as well?


On Wed, Sep 11, 2013 at 1:50 AM, Jack Krupansky 
j...@basetechnology.comwrote:



Removing stray hyphens (embedded hyphens, like CD-ROM, are okay) or
escaping them with backslash looks like your best bests. There's no query
parser option to disable the hyphen as an exlusion operator, although an
upgrade to a modern Solr should fix the problem.


-- Jack Krupansky

-Original Message- From: Prathik Puthran
Sent: Tuesday, September 10, 2013 4:13 PM
To: solr-user@lucene.apache.org

Subject: Re: Help in resolving the below retrieval issue

I'm using Solr 3.4.


This bug is causing the 2nd term i.e. kumar to be treated as an 
exclusion

operator?
Is it possible to configure the query parser to not treat the '-' as
exclusion operator ?
If not the only way is to remove the '-' from the query string?

Thanks,
Prathik


On Tue, Sep 10, 2013 at 10:36 PM, Jack Krupansky j...@basetechnology.com
**wrote:

 What release of Solr are you using?


It appears that the hyphen is being treated as an exclusion operator even
though it is followed by a space. Solr 4.4 doesn't appear to do that, but
maybe earlier releases had a problem.

In any case, be careful with leading hyphen in queries since it does mean
exclude documents that contain the following term.

Or, just escape any leading hyphen with a backslash.

-- Jack Krupansky

-Original Message- From: Prathik Puthran
Sent: Tuesday, September 10, 2013 11:47 AM
To: d...@lucene.apache.org ; solr-user@lucene.apache.org
Subject: Re: Help in resolving the below retrieval issue


Thanks Erick for the response.
I tried to debug the query. Below is the response in the debug node

str name=rawquerystringRahul - kumar/strstr
name=querystringRahul
- kumar/strstr name=parsedquery+text:Rahul -text:kumar/strstr
name=parsedquery_toString+text:Rahul -text:kumar/strlst
name=explain/str name=QParserLuceneQParser/strarr
name=filter_queriesstrRahul - kumar/str/arrarr
name=parsed_filter_queriesstr+text:rahul -text:kumar/str/arr



Does it mean the query parser has parsed it to tokens Rahul - and
kumar?
Even if this was the case solr should be able to retrieve the documents
because I have indexed all the documents based on n-grams as well.

Thanks,
Prathik


On Tue, Sep 10, 2013 at 7:09 PM, Erick Erickson erickerick...@gmail.com
*
*wrote:


 Try adding debug=query to the url. What I think you'll find is that


you're running into
a common issue, the difference between query parsing and analysis.

when you submit anything with whitespace in it, the query parser will
break it up
_before_ it gets to the analysis part, you should see something in the
debug
portion of the query like
field:rahul field:kumar and possibly even field:-

These are searched as separate tokens. By specifying KeywordTokenizer, 
at

index time you'll have exactly one token, rahul-kumar in the index which
will not
match any of the separated tokens

Try escaping the spaces with backslash. You could also try quoting the
input although
that has some phrase implications.

Do you really want this search to fail on just searching rahul though?
Perhaps
keywordTokenizer isn't best here, it depends upon your use-case...

Best,
Erick


On Tue, Sep 10, 2013 at 8:10 AM, Prathik Puthran 
prathik.puthra...@gmail.com wrote:

 Hi,



I am facing the below issue where in Solr is not retrieving the indexed
word for some cases.

This happens whenever the indexed word has string  -  (quotes for
clarity) as substring i.e word prefix followed by a space which is
followed
by '-' again followed by a space and followed by the rest of the word
suffix.
When I search with search query being the exact string Solr returns no
results.

Example:
Indexed word -- Rahul - kumar  (quotes for clarity)
If I search with the search query as below Solr gives no results
Search query -- Rahul - kumar  (quotes for clarity)

However the below search query returns the results
Search query -- Rahul kumar

Can you please let me know what I am doing wrong here and what should I
do to ensure the first query i.e. Rahul - kumar returns the documents
indexed using it.

Below are the analyzers I am using:
Index time analyzer components:
1)

Re: No or limited use of FieldCache

2013-09-12 Thread Toke Eskildsen

On Thu, 2013-09-12 at 14:48 +0200, Per Steffensen wrote:
 Actually some months back I made PoC of a FieldCache that could expand 
 beyond the heap. Basically imagine a FieldCache with room for 
 unlimited data-arrays, that just behind the scenes goes to 
 memory-mapped files when there is no more room on heap.

That sounds a lot like disk-based DocValues.

[...]

 But that solution will also have the running out of swap space-problems.

Not really. Memory mapping works like the disk cache: There is no
requirement that a certain amount of physical memory needs to be
available, it just takes what it can get. If there are not a lot of
physical memory, it will require a lot of storage access, but it will
not over-allocate swap space.


It seems that different setups vary quite a lot in this area and some
systems are prone to aggressive use of the swap file, which can severely
harm responsiveness of applications with out-swapped data.

However, this should still not result in any OOM's, as the system can
always discard some of the memory mapped data if it needs more physical
memory.

- Toke Eskildsen, State and University Library, Denmark

Re: No or limited use of FieldCache


On 9/12/13 3:28 PM, Toke Eskildsen wrote:

On Thu, 2013-09-12 at 14:48 +0200, Per Steffensen wrote:

Actually some months back I made PoC of a FieldCache that could expand
beyond the heap. Basically imagine a FieldCache with room for
unlimited data-arrays, that just behind the scenes goes to
memory-mapped files when there is no more room on heap.

That sounds a lot like disk-based DocValues.


He he

But that solution will also have the running out of swap space-problems.

Not really. Memory mapping works like the disk cache: There is no
requirement that a certain amount of physical memory needs to be
available, it just takes what it can get. If there are not a lot of
physical memory, it will require a lot of storage access, but it will
not over-allocate swap space.
That was also my impression, but during the work, I experienced some 
problems around swap space, but I do not remember exactly what I saw, 
and therefore how I concluded that everything in mm-files actually have 
to fit in physical mem + swap. I might very well have been wrong in that 
conclusion

It seems that different setups vary quite a lot in this area and some
systems are prone to aggressive use of the swap file, which can severely
harm responsiveness of applications with out-swapped data.

However, this should still not result in any OOM's, as the system can
always discard some of the memory mapped data if it needs more physical
memory.

I saw no OOMs

- Toke Eskildsen, State and University Library, Denmark

Facet counting empty as well.. how to prevent this?

2013-09-12 Thread Raheel Hasan

Hi,

I got a small issue here, my facet settings are returning counts for empty
. I.e. when no the actual field was empty.

Here are the facet settings:

str name=facet.sortcount/str
str name=facet.limit6/str
str name=facet.mincount1/str
str name=facet.missingfalse/str

and this is the part of the result I dont want:
int name=4/int

(that is coming because the query results had 4 rows with no value in that
field whole facet counts are being called).

Rest all is working just fine

-- 
Regards,
Raheel Hasan

Re: SolrCloud behave differently on server and local

2013-09-12 Thread cihat güzel

My problem is solved. My server default java version is 1.5 . I upgrade
java version.


2013/9/12 cihat güzel c.guzel@gmail.com

 hi all.
 I am trying solr cloud on my server. The server is a virtual machine.

 I have followed solr cloude wiki  http://wiki.apache.org/solr/SolrCloud
  .
 When I run solr Cloud, It si failed.  But If I try on my local ,it runs
 successfully. Why does solr behave differently on server and local?

 My solr.log as follows:

 INFO  - 2013-09-12 14:50:13.389;
 org.apache.solr.servlet.SolrDispatchFilter; SolrDispatchFilter.init() done
 ERROR - 2013-09-12 14:50:13.433; org.apache.solr.core.CoreContainer;
 CoreContainer was not shutdown prior to finalize(), indicates a bug --
 POSSIBLE RESOURCE LEAK!!!  instance=1423856966
 INFO  - 2013-09-12 14:50:13.483;
 org.eclipse.jetty.server.AbstractConnector; Started
 SocketConnector@0.0.0.0:8983
 INFO  - 2013-09-12 14:57:01.776; org.eclipse.jetty.server.Server;
 jetty-8.1.10.v20130312
 INFO  - 2013-09-12 14:57:01.838;
 org.eclipse.jetty.deploy.providers.ScanningAppProvider; Deployment monitor
 /opt/Applications/solr-4.4.0/example/contexts at interval 0
 INFO  - 2013-09-12 14:57:01.846;
 org.eclipse.jetty.deploy.DeploymentManager; Deployable added:
 /opt/Applications/solr-4.4.0/example/contexts/solr-jetty-context.xml
 INFO  - 2013-09-12 14:57:02.549;
 org.eclipse.jetty.webapp.StandardDescriptorProcessor; NO JSP Support for
 /solr, did not find org.apache.jasper.servlet.JspServlet
 INFO  - 2013-09-12 14:57:02.656;
 org.apache.solr.servlet.SolrDispatchFilter; SolrDispatchFilter.init()
 INFO  - 2013-09-12 14:57:02.797; org.apache.solr.core.SolrResourceLoader;
 JNDI not configured for solr (NoInitialContextEx)
 INFO  - 2013-09-12 14:57:02.799; org.apache.solr.core.SolrResourceLoader;
 solr home defaulted to 'solr/' (could not find system property or JNDI)
 INFO  - 2013-09-12 14:57:02.801; org.apache.solr.core.SolrResourceLoader;
 new SolrResourceLoader for directory: 'solr/'
 INFO  - 2013-09-12 14:57:02.917; org.apache.solr.core.ConfigSolr; Loading
 container configuration from
 /opt/Applications/solr-4.4.0/example/solr/solr.xml
 ERROR - 2013-09-12 14:57:03.072;
 org.apache.solr.servlet.SolrDispatchFilter; Could not start Solr. Check
 solr/home property and the logs
 ERROR - 2013-09-12 14:57:03.098; org.apache.solr.common.SolrException;
 null:org.apache.solr.common.SolrException: Could not load SOLR configuration
at org.apache.solr.core.ConfigSolr.fromFile(ConfigSolr.java:65)
at org.apache.solr.core.ConfigSolr.fromSolrHome(ConfigSolr.java:89)
at org.apache.solr.core.CoreContainer.init(CoreContainer.java:139)
at org.apache.solr.core.CoreContainer.init(CoreContainer.java:129)
at
 org.apache.solr.servlet.SolrDispatchFilter.createCoreContainer(SolrDispatchFilter.java:139)
at
 org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:122)
at org.eclipse.jetty.servlet.FilterHolder.doStart(FilterHolder.java:119)
at
 org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at
 org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:719)
at
 org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:265)
at
 org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1252)
at
 org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:710)
at
 org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:494)
at
 org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at
 org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:39)
at
 org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:186)
at
 org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:494)
at
 org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:141)
at
 org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:145)
at
 org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(ScanningAppProvider.java:56)
at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:609)
at org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:540)
at org.eclipse.jetty.util.Scanner.scan(Scanner.java:403)
at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:337)
at
 org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at
 org.eclipse.jetty.deploy.providers.ScanningAppProvider.doStart(ScanningAppProvider.java:121)
at
 org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at
 org.eclipse.jetty.deploy.DeploymentManager.startAppProvider(DeploymentManager.java:555)
at
 org.eclipse.jetty.deploy.DeploymentManager.doStart(DeploymentManager.java:230)
at
 org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at

Re: SolrCloud behave differently on server and local

2013-09-12 Thread cihat güzel

My problem is solved. My server default java version is 1.5 . I upgrade
java version.


2013/9/12 cihat güzel c.guzel@gmail.com

 hi all.
 I am trying solr cloud on my server. The server is a virtual machine.

 I have followed solr cloude wiki  http://wiki.apache.org/solr/SolrCloud
  .
 When I run solr Cloud, It si failed.  But If I try on my local ,it runs
 successfully. Why does solr behave differently on server and local?

 My solr.log as follows:

 INFO  - 2013-09-12 14:50:13.389;
 org.apache.solr.servlet.SolrDispatchFilter; SolrDispatchFilter.init() done
 ERROR - 2013-09-12 14:50:13.433; org.apache.solr.core.CoreContainer;
 CoreContainer was not shutdown prior to finalize(), indicates a bug --
 POSSIBLE RESOURCE LEAK!!!  instance=1423856966
 INFO  - 2013-09-12 14:50:13.483;
 org.eclipse.jetty.server.AbstractConnector; Started
 SocketConnector@0.0.0.0:8983
 INFO  - 2013-09-12 14:57:01.776; org.eclipse.jetty.server.Server;
 jetty-8.1.10.v20130312
 INFO  - 2013-09-12 14:57:01.838;
 org.eclipse.jetty.deploy.providers.ScanningAppProvider; Deployment monitor
 /opt/Applications/solr-4.4.0/example/contexts at interval 0
 INFO  - 2013-09-12 14:57:01.846;
 org.eclipse.jetty.deploy.DeploymentManager; Deployable added:
 /opt/Applications/solr-4.4.0/example/contexts/solr-jetty-context.xml
 INFO  - 2013-09-12 14:57:02.549;
 org.eclipse.jetty.webapp.StandardDescriptorProcessor; NO JSP Support for
 /solr, did not find org.apache.jasper.servlet.JspServlet
 INFO  - 2013-09-12 14:57:02.656;
 org.apache.solr.servlet.SolrDispatchFilter; SolrDispatchFilter.init()
 INFO  - 2013-09-12 14:57:02.797; org.apache.solr.core.SolrResourceLoader;
 JNDI not configured for solr (NoInitialContextEx)
 INFO  - 2013-09-12 14:57:02.799; org.apache.solr.core.SolrResourceLoader;
 solr home defaulted to 'solr/' (could not find system property or JNDI)
 INFO  - 2013-09-12 14:57:02.801; org.apache.solr.core.SolrResourceLoader;
 new SolrResourceLoader for directory: 'solr/'
 INFO  - 2013-09-12 14:57:02.917; org.apache.solr.core.ConfigSolr; Loading
 container configuration from
 /opt/Applications/solr-4.4.0/example/solr/solr.xml
 ERROR - 2013-09-12 14:57:03.072;
 org.apache.solr.servlet.SolrDispatchFilter; Could not start Solr. Check
 solr/home property and the logs
 ERROR - 2013-09-12 14:57:03.098; org.apache.solr.common.SolrException;
 null:org.apache.solr.common.SolrException: Could not load SOLR configuration
at org.apache.solr.core.ConfigSolr.fromFile(ConfigSolr.java:65)
at org.apache.solr.core.ConfigSolr.fromSolrHome(ConfigSolr.java:89)
at org.apache.solr.core.CoreContainer.init(CoreContainer.java:139)
at org.apache.solr.core.CoreContainer.init(CoreContainer.java:129)
at
 org.apache.solr.servlet.SolrDispatchFilter.createCoreContainer(SolrDispatchFilter.java:139)
at
 org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:122)
at org.eclipse.jetty.servlet.FilterHolder.doStart(FilterHolder.java:119)
at
 org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at
 org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:719)
at
 org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:265)
at
 org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1252)
at
 org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:710)
at
 org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:494)
at
 org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at
 org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:39)
at
 org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:186)
at
 org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:494)
at
 org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:141)
at
 org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:145)
at
 org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(ScanningAppProvider.java:56)
at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:609)
at org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:540)
at org.eclipse.jetty.util.Scanner.scan(Scanner.java:403)
at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:337)
at
 org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at
 org.eclipse.jetty.deploy.providers.ScanningAppProvider.doStart(ScanningAppProvider.java:121)
at
 org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at
 org.eclipse.jetty.deploy.DeploymentManager.startAppProvider(DeploymentManager.java:555)
at
 org.eclipse.jetty.deploy.DeploymentManager.doStart(DeploymentManager.java:230)
at
 org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at

Re: Storing/indexing speed drops quickly

On 9/12/2013 2:14 AM, Per Steffensen wrote:
 Starting from an empty collection. Things are fine wrt
 storing/indexing speed for the first two-three hours (100M docs per
 hour), then speed goes down dramatically, to an, for us, unacceptable
 level (max 10M per hour). At the same time as speed goes down, we see
 that I/O wait increases dramatically. I am not 100% sure, but quick
 investigation has shown that this is due to almost constant merging.

While constant merging is contributing to the slowdown, I would guess
that your index is simply too big for the amount of RAM that you have.
Let's ignore for a minute that you're distributed and just concentrate
on one machine.

After three hours of indexing, you have nearly 300 million documents.
If you have a replicationFactor of 1, that's still 50 million documents
per machine.  If your replicationFactor is 2, you've got 100 million
documents per machine.  Let's focus on the smaller number for a minute.

50 million documents in an index, even if they are small documents, is
probably going to result in an index size of at least 20GB, and quite
possibly larger.  In order to make Solr function with that many
documents, I would guess that you have a heap that's at least 4GB in size.

With only 8GB on the machine, this doesn't leave much RAM for the OS
disk cache.  If we assume that you have 4GB left for caching, then I
would expect to see problems about the time your per-machine indexes hit
15GB in size.  If you are making it beyond that with a total of 300
million documents, then I am impressed.

Two things are going to happen when you have enough documents:  1) You
are going to fill up your Java heap and Java will need to do frequent
collections to free up enough RAM for normal operation.  When this
problem gets bad enough, the frequent collections will be *full* GCs,
which are REALLY slow.  2) The index will be so big that the OS disk
cache cannot effectively cache it.  I suspect that the latter is more of
the problem, but both might be happening at nearly the same time.

When dealing with an index of this size, you want as much RAM as you can
possibly afford.  I don't think I would try what you are doing without
at least 64GB per machine, and I would probably use at least an 8GB heap
on each one, quite possibly larger.  With a heap that large, extreme GC
tuning becomes a necessity.

To cut down on the amount of merging, I go with a fairly large
mergeFactor, but mergeFactor is basically deprecated for
TieredMergePolicy, there's a new way to configure it now.  Here's the
indexConfig settings that I use on my dev server:

indexConfig
  mergePolicy class=org.apache.lucene.index.TieredMergePolicy
int name=maxMergeAtOnce35/int
int name=segmentsPerTier35/int
int name=maxMergeAtOnceExplicit105/int
  /mergePolicy
  mergeScheduler class=org.apache.lucene.index.ConcurrentMergeScheduler
int name=maxThreadCount1/int
int name=maxMergeCount6/int
  /mergeScheduler
  ramBufferSizeMB48/ramBufferSizeMB
  infoStream file=INFOSTREAM-${solr.core.name}.txtfalse/infoStream
/indexConfig

Thanks,
Shawn

RE: Solr cloud shard goes down after SocketException in another shard

2013-09-12 Thread Greg Walters

Neoman,

Make sure that solr08-prod (or the elected leader at any time) isn't doing a 
stop-the-world garbage collection that takes long enough that the zookeeper 
connection times out. I've seen that in my cluster when I didn't have parallel 
GC enabled and my zkClientTimeout in solr.xml was too low.

Thanks,
Greg

-Original Message-
From: neoman [mailto:harira...@gmail.com] 
Sent: Thursday, September 12, 2013 9:19 AM
To: solr-user@lucene.apache.org
Subject: Solr cloud shard goes down after SocketException in another shard

Exception in  shard1 (solr01-prod) primary
09/12/13
13:56:46:635|http-bio-8080-exec-66|ERROR|apache.solr.servlet.SolrDispatchFilter|null:ClientAbortException:
 
java.net.SocketException: Broken pipe
at
org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:406)
at org.apache.tomcat.util.buf.ByteChunk.append(ByteChunk.java:342)
at
org.apache.catalina.connector.OutputBuffer.writeBytes(OutputBuffer.java:431)
at
org.apache.catalina.connector.OutputBuffer.write(OutputBuffer.java:419)
at
org.apache.catalina.connector.CoyoteOutputStream.write(CoyoteOutputStream.java:91)
at
org.apache.solr.common.util.FastOutputStream.flush(FastOutputStream.java:214)
at
org.apache.solr.common.util.FastOutputStream.write(FastOutputStream.java:95)
at
org.apache.solr.common.util.JavaBinCodec.writeStr(JavaBinCodec.java:470)
at
org.apache.solr.common.util.JavaBinCodec.writePrimitive(JavaBinCodec.java:545)
at
org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:232)
at
org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:149)
at
org.apache.solr.common.util.JavaBinCodec.writeSolrDocument(JavaBinCodec.java:320)
at
org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:257)
at
org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:149)
at
org.apache.solr.common.util.JavaBinCodec.writeArray(JavaBinCodec.java:427)
at
org.apache.solr.common.util.JavaBinCodec.writeSolrDocumentList(JavaBinCodec.java:356)


Exception in  shard1 (solr08-prod) secondary

09/12/13
13:56:46:729|http-bio-8080-exec-50|ERROR|apache.solr.core.SolrCore|org.apache.solr.common.SolrException:
ClusterState says we are the leader (http://solr08-prod:8080/solr/aq-core),
but locally we don't think so. Request came from 
http://solr03-prod.phneaz:8080/solr/aq-core/
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428)
at
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)

Out configuration
Solr 4.4, Tomcat 7, 3 shards
Thanks for your help



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-cloud-shard-goes-down-after-SocketException-in-another-shard-tp4089576.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr cloud shard goes down after SocketException in another shard

2013-09-12 Thread neoman

Thanks greg. Currently we have 60 seconds (we reduced it recently). I may
have to reduce it again. can you please share your timeout value.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-cloud-shard-goes-down-after-SocketException-in-another-shard-tp4089576p4089582.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Solr cloud shard goes down after SocketException in another shard

2013-09-12 Thread Greg Walters

Neoman,

I've got ours set at 45 seconds:

int name=zkClientTimeout${zkClientTimeout:45000}/int


-Original Message-
From: neoman [mailto:harira...@gmail.com] 
Sent: Thursday, September 12, 2013 9:33 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr cloud shard goes down after SocketException in another shard

Thanks greg. Currently we have 60 seconds (we reduced it recently). I may have 
to reduce it again. can you please share your timeout value.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-cloud-shard-goes-down-after-SocketException-in-another-shard-tp4089576p4089582.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr cloud shard goes down after SocketException in another shard

2013-09-12 Thread neoman

Exception in  shard1 (solr01-prod) primary
09/12/13
13:56:46:635|http-bio-8080-exec-66|ERROR|apache.solr.servlet.SolrDispatchFilter|null:ClientAbortException:
 
java.net.SocketException: Broken pipe
at
org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:406)
at org.apache.tomcat.util.buf.ByteChunk.append(ByteChunk.java:342)
at
org.apache.catalina.connector.OutputBuffer.writeBytes(OutputBuffer.java:431)
at
org.apache.catalina.connector.OutputBuffer.write(OutputBuffer.java:419)
at
org.apache.catalina.connector.CoyoteOutputStream.write(CoyoteOutputStream.java:91)
at
org.apache.solr.common.util.FastOutputStream.flush(FastOutputStream.java:214)
at
org.apache.solr.common.util.FastOutputStream.write(FastOutputStream.java:95)
at
org.apache.solr.common.util.JavaBinCodec.writeStr(JavaBinCodec.java:470)
at
org.apache.solr.common.util.JavaBinCodec.writePrimitive(JavaBinCodec.java:545)
at
org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:232)
at
org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:149)
at
org.apache.solr.common.util.JavaBinCodec.writeSolrDocument(JavaBinCodec.java:320)
at
org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:257)
at
org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:149)
at
org.apache.solr.common.util.JavaBinCodec.writeArray(JavaBinCodec.java:427)
at
org.apache.solr.common.util.JavaBinCodec.writeSolrDocumentList(JavaBinCodec.java:356)


Exception in  shard1 (solr08-prod) secondary

09/12/13
13:56:46:729|http-bio-8080-exec-50|ERROR|apache.solr.core.SolrCore|org.apache.solr.common.SolrException:
ClusterState says we are the leader (http://solr08-prod:8080/solr/aq-core),
but locally we don't think so. Request came from
http://solr03-prod.phneaz:8080/solr/aq-core/
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428)
at 
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)

Out configuration 
Solr 4.4, Tomcat 7, 3 shards
Thanks for your help



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-cloud-shard-goes-down-after-SocketException-in-another-shard-tp4089576.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Grouping by field substring?

2013-09-12 Thread Ken Krugler

Hi Jack,

On Sep 11, 2013, at 5:34pm, Jack Krupansky wrote:

 Do a copyField to another field, with a limit of 8 characters, and then use 
 that other field.

Thanks - I should have included a few more details in my original question.

The issue is that I've got an index with 200M records, of which about 50M have 
a unique value for this prefix (which is 32 characters long)

So adding another indexed field would be significant, which is why I was hoping 
there was a way to do it via grouping/collapsing at query time.

Or is that just not possible?

Thanks,

-- Ken

 -Original Message- From: Ken Krugler
 Sent: Wednesday, September 11, 2013 8:24 PM
 To: solr-user@lucene.apache.org
 Subject: Grouping by field substring?
 
 Hi all,
 
 Assuming I want to use the first N characters of a specific field for 
 grouping results, is such a thing possible out-of-the-box?
 
 If not, then what would the next best option be? E.g. a custom function query?
 
 Thanks,
 
 -- Ken
 
 --
 Ken Krugler
 +1 530-210-6378
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Cassandra  Solr
 
 
 
 
 

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr

Re: Facet counting empty as well.. how to prevent this?

On 9/12/2013 7:54 AM, Raheel Hasan wrote:
 I got a small issue here, my facet settings are returning counts for empty
 . I.e. when no the actual field was empty.
 
 Here are the facet settings:
 
 str name=facet.sortcount/str
 str name=facet.limit6/str
 str name=facet.mincount1/str
 str name=facet.missingfalse/str
 
 and this is the part of the result I dont want:
 int name=4/int

The facet.missing parameter has to do with whether or not to display
counts for documents that have no value at all for that field.

Even though it might seem wrong, the empty string is a valid value, so
you can't fix this with faceting parameters.  If you don't want that to
be in your index, then you can add the LengthFilterFactory to your
analyzer to remove terms with a length less than 1.  You might also
check to see whether the field definition in your schema has a default
value set to the empty string.

If you are using DocValues (Solr 4.2 and later), then the indexed terms
aren't used for facets, and it won't matter what you do to your analysis
chain.  With DocValues, Solr basically uses a value equivalent to the
stored value.  To get rid of the empty string with DocValues, you'll
need to either change your indexing process so it doesn't send empty
strings, or use a custom UpdateProcessor to change the data before it
gets indexed.

Thanks,
Shawn

Re: Facet counting empty as well.. how to prevent this?

2013-09-12 Thread Raheel Hasan

ok, so I got the idea... I will pull 7 fields instead and remove the empty
one...

But there must be some setting that can be done in Facet configuration to
ignore certain value if we want to


On Thu, Sep 12, 2013 at 7:44 PM, Shawn Heisey s...@elyograg.org wrote:

 On 9/12/2013 7:54 AM, Raheel Hasan wrote:
  I got a small issue here, my facet settings are returning counts for
 empty
  . I.e. when no the actual field was empty.
 
  Here are the facet settings:
 
  str name=facet.sortcount/str
  str name=facet.limit6/str
  str name=facet.mincount1/str
  str name=facet.missingfalse/str
 
  and this is the part of the result I dont want:
  int name=4/int

 The facet.missing parameter has to do with whether or not to display
 counts for documents that have no value at all for that field.

 Even though it might seem wrong, the empty string is a valid value, so
 you can't fix this with faceting parameters.  If you don't want that to
 be in your index, then you can add the LengthFilterFactory to your
 analyzer to remove terms with a length less than 1.  You might also
 check to see whether the field definition in your schema has a default
 value set to the empty string.

 If you are using DocValues (Solr 4.2 and later), then the indexed terms
 aren't used for facets, and it won't matter what you do to your analysis
 chain.  With DocValues, Solr basically uses a value equivalent to the
 stored value.  To get rid of the empty string with DocValues, you'll
 need to either change your indexing process so it doesn't send empty
 strings, or use a custom UpdateProcessor to change the data before it
 gets indexed.

 Thanks,
 Shawn




-- 
Regards,
Raheel Hasan

Re: Get the commit time of a document in Solr

Slow down, back up, and now tell us what problem (if any!) you are really 
trying to solve. Don't leap to a proposed solution before you clearly state 
the problem to be solved.


First, why do you think there is any problem at all?

Or, what are you really trying to achieve?

-- Jack Krupansky

-Original Message- 
From: phanichaitanya

Sent: Thursday, September 12, 2013 1:04 PM
To: solr-user@lucene.apache.org
Subject: Re: Get the commit time of a document in Solr

So, now I want to know when that document becomes searchable or when it is
committed. I've the following scenario:

1) Indexing starts at say 9:00 AM - with the above additions to the
schema.xml I'll know the indexed time of each document I send to Solr via
the update handler. Say 9:01, 9:02 and so on ... lets say I send a document
for every second between 9 - 9:30 AM and it makes it 30*60 = 1800 docs
2) Now at 9:30 AM, I issue a hard commit and now I'll be able to search
these 1800 documents which is fine.
3) Now I want to know that I can search these 1800 documents only at =9:30
AM but not  9:30 AM as I did not do a hard commit before 9:30 AM.

In order to know that, is there a way in Solr rather than some application
keeping track of the documents it sends to Solr between any two commits. The
reason I'm asking is, if there are say two parallel processes indexing to
the same index and one process issues a commit - then whatever documents
process two indexed until that point of time would also be committed right ?
Now if I keep track of commit times in each process it doesn't reflect the
true commit times as they are inter-twined.



-
Phani Chaitanya
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Get-the-commit-time-of-a-document-in-Solr-tp4089624p4089638.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: charset encoding

2013-09-12 Thread Andreas Owen

it was the http-header, as soon as i force a iso-8859-1 header it worked

On 12. Sep 2013, at 9:44 AM, Andreas Owen wrote:

 could it have something to do with the meta encoding tag is iso-8859-1 but 
 the http-header tag is utf8 and firefox inteprets it as utf8?
 
 On 12. Sep 2013, at 8:36 AM, Andreas Owen wrote:
 
 no jetty, and yes for tomcat i've seen a couple of answers
 
 On 12. Sep 2013, at 3:12 AM, Otis Gospodnetic wrote:
 
 Using tomcat by any chance? The ML archive has the solution. May be on
 Wiki, too.
 
 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On Sep 11, 2013 8:56 AM, Andreas Owen a...@conx.ch wrote:
 
 i'm using solr 4.3.1 with tika to index html-pages. the html files are
 iso-8859-1 (ansi) encoded and the meta tag content-encoding as well. the
 server-http-header says it's utf8 and firefox-webdeveloper agrees.
 
 when i index a page with special chars like ä,ö,ü solr outputs it
 completly foreign signs, not the normal wrong chars with 1/4 or the Flag in
 it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy.
 has anyone got a idea whats wrong?

Re: Get the commit time of a document in Solr

2013-09-12 Thread phanichaitanya

Hi Jack,

  Sorry, I was not clear earlier. What I'm trying to achieve is :

I want to know when a document is committed (hard commit). There can be a
lot of time lapse (1 hour or more) between the time you indexed that
document vs you issue a commit in my case. Now, I exactly want to know when
a document is committed.

In my previous example all 1800 docs are committed at 9:30 AM and I want to
know that time for those 1800 docs. In other batch it'll be some other time.

The use-case is I've have more than 1 process sending the update requests to
Solr and each of those process has a separate commit step and I want to know
the commit time of the documents that were committed when I gave a commit
request.

I hope I'm clear now - please let me know if I'm not. 



-
Phani Chaitanya
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Get-the-commit-time-of-a-document-in-Solr-tp4089624p4089662.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud 4.x hangs under high update volume

2013-09-12 Thread Tim Vaillancourt

Lol, at breaking during a demo - always the way it is! :) I agree, we are
just tip-toeing around the issue, but waiting for 4.5 is definitely an
option if we get-by for now in testing; patched Solr versions seem to
make people uneasy sometimes :).

Seeing there seems to be some danger to SOLR-5216 (in some ways it blows up
worse due to less limitations on thread), I'm guessing only SOLR-5232 and
SOLR-4816 are making it into 4.5? I feel those 2 in combination will make a
world of difference!

Thanks so much again guys!

Tim



On 12 September 2013 03:43, Erick Erickson erickerick...@gmail.com wrote:

 Fewer client threads updating makes sense, and going to 1 core also seems
 like it might help. But it's all a crap-shoot unless the underlying cause
 gets fixed up. Both would improve things, but you'll still hit the problem
 sometime, probably when doing a demo for your boss ;).

 Adrien has branched the code for SOLR 4.5 in preparation for a release
 candidate tentatively scheduled for next week. You might just start working
 with that branch if you can rather than apply individual patches...

 I suspect there'll be a couple more changes to this code (looks like
 Shikhar already raised an issue for instance) before 4.5 is finally cut...

 FWIW,
 Erick



 On Thu, Sep 12, 2013 at 2:13 AM, Tim Vaillancourt t...@elementspace.com
 wrote:

  Thanks Erick!
 
  Yeah, I think the next step will be CloudSolrServer with the SOLR-4816
  patch. I think that is a very, very useful patch by the way. SOLR-5232
  seems promising as well.
 
  I see your point on the more-shards idea, this is obviously a
  global/instance-level lock. If I really had to, I suppose I could run
 more
  Solr instances to reduce locking then? Currently I have 2 cores per
  instance and I could go 1-to-1 to simplify things.
 
  The good news is we seem to be more stable since changing to a bigger
  client-solr batch-size and fewer client threads updating.
 
  Cheers,
 
  Tim
 
  On 11/09/13 04:19 AM, Erick Erickson wrote:
 
  If you use CloudSolrServer, you need to apply SOLR-4816 or use a recent
  copy of the 4x branch. By recent, I mean like today, it looks like
 Mark
  applied this early this morning. But several reports indicate that this
  will
  solve your problem.
 
  I would expect that increasing the number of shards would make the
 problem
  worse, not
  better.
 
  There's also SOLR-5232...
 
  Best
  Erick
 
 
  On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourttim@elementspace.
 **comt...@elementspace.com
  wrote:
 
   Hey guys,
 
  Based on my understanding of the problem we are encountering, I feel
  we've
  been able to reduce the likelihood of this issue by making the
 following
  changes to our app's usage of SolrCloud:
 
  1) We increased our document batch size to 200 from 10 - our app
 batches
  updates to reduce HTTP requests/overhead. The theory is increasing the
  batch size reduces the likelihood of this issue happening.
  2) We reduced to 1 application node sending updates to SolrCloud - we
  write
  Solr updates to Redis, and have previously had 4 application nodes
  pushing
  the updates to Solr (popping off the Redis queue). Reducing the number
 of
  nodes pushing to Solr reduces the concurrency on SolrCloud.
  3) Less threads pushing to SolrCloud - due to the increase in batch
 size,
  we were able to go down to 5 update threads on the update-pushing-app
  (from
  10 threads).
 
  To be clear the above only reduces the likelihood of the issue
 happening,
  and DOES NOT actually resolve the issue at hand.
 
  If we happen to encounter issues with the above 3 changes, the next
 steps
  (I could use some advice on) are:
 
  1) Increase the number of shards (2x) - the theory here is this reduces
  the
  locking on shards because there are more shards. Am I onto something
  here,
  or will this not help at all?
  2) Use CloudSolrServer - currently we have a plain-old least-connection
  HTTP VIP. If we go direct to what we need to update, this will reduce
  concurrency in SolrCloud a bit. Thoughts?
 
  Thanks all!
 
  Cheers,
 
  Tim
 
 
  On 6 September 2013 14:47, Tim Vaillancourttim@elementspace.**com
 t...@elementspace.com
   wrote:
 
   Enjoy your trip, Mark! Thanks again for the help!
 
  Tim
 
 
  On 6 September 2013 14:18, Mark Millermarkrmil...@gmail.com  wrote:
 
   Okay, thanks, useful info. Getting on a plane, but ill look more at
  this
  soon. That 10k thread spike is good to know - that's no good and
 could
  easily be part of the problem. We want to keep that from happening.
 
  Mark
 
  Sent from my iPhone
 
  On Sep 6, 2013, at 2:05 PM, Tim Vaillancourttim@elementspace.**com
 t...@elementspace.com
  
  wrote:
 
   Hey Mark,
 
  The farthest we've made it at the same batch size/volume was 12
 hours
  without this patch, but that isn't consistent. Sometimes we would
 only
 
  get
 
  to 6 hours or less.
 
  During the crash I can see an amazing spike in threads to 10k which
 is
  essentially our ulimit for

Re: SolrCloud 4.x hangs under high update volume

2013-09-12 Thread Mark Miller

Right, I don't see SOLR-5232 making 4.5 unfortunately. It could perhaps make a 
4.5.1 - it does resolve a critical issue - but 4.5 is in motion and SOLR-5232 
is not quite ready - we need some testing.

- Mark

On Sep 12, 2013, at 2:12 PM, Erick Erickson erickerick...@gmail.com wrote:

 My take on it is this, assuming I'm reading this right:
 1 SOLR-5216 - probably not going anywhere, 5232 will take care of it.
 2 SOLR-5232 - expected to fix the underlying issue no matter whether
 you're using CloudSolrServer from SolrJ or sending lots of updates from
 lots of clients.
 3 SOLR-4816 - use this patch and CloudSolrServer from SolrJ in the
 meantime.
 
 I don't quite know whether SOLR-5232 will make it in to 4.5 or not, it
 hasn't been committed anywhere yet. The Solr 4.5 release is imminent, RC0
 is looking like it'll be ready to cut next week so it might not be included.
 
 Best,
 Erick
 
 
 On Thu, Sep 12, 2013 at 1:42 PM, Tim Vaillancourt 
 t...@elementspace.comwrote:
 
 Lol, at breaking during a demo - always the way it is! :) I agree, we are
 just tip-toeing around the issue, but waiting for 4.5 is definitely an
 option if we get-by for now in testing; patched Solr versions seem to
 make people uneasy sometimes :).
 
 Seeing there seems to be some danger to SOLR-5216 (in some ways it blows up
 worse due to less limitations on thread), I'm guessing only SOLR-5232 and
 SOLR-4816 are making it into 4.5? I feel those 2 in combination will make a
 world of difference!
 
 Thanks so much again guys!
 
 Tim
 
 
 
 On 12 September 2013 03:43, Erick Erickson erickerick...@gmail.com
 wrote:
 
 Fewer client threads updating makes sense, and going to 1 core also seems
 like it might help. But it's all a crap-shoot unless the underlying cause
 gets fixed up. Both would improve things, but you'll still hit the
 problem
 sometime, probably when doing a demo for your boss ;).
 
 Adrien has branched the code for SOLR 4.5 in preparation for a release
 candidate tentatively scheduled for next week. You might just start
 working
 with that branch if you can rather than apply individual patches...
 
 I suspect there'll be a couple more changes to this code (looks like
 Shikhar already raised an issue for instance) before 4.5 is finally
 cut...
 
 FWIW,
 Erick
 
 
 
 On Thu, Sep 12, 2013 at 2:13 AM, Tim Vaillancourt t...@elementspace.com
 wrote:
 
 Thanks Erick!
 
 Yeah, I think the next step will be CloudSolrServer with the SOLR-4816
 patch. I think that is a very, very useful patch by the way. SOLR-5232
 seems promising as well.
 
 I see your point on the more-shards idea, this is obviously a
 global/instance-level lock. If I really had to, I suppose I could run
 more
 Solr instances to reduce locking then? Currently I have 2 cores per
 instance and I could go 1-to-1 to simplify things.
 
 The good news is we seem to be more stable since changing to a bigger
 client-solr batch-size and fewer client threads updating.
 
 Cheers,
 
 Tim
 
 On 11/09/13 04:19 AM, Erick Erickson wrote:
 
 If you use CloudSolrServer, you need to apply SOLR-4816 or use a
 recent
 copy of the 4x branch. By recent, I mean like today, it looks like
 Mark
 applied this early this morning. But several reports indicate that
 this
 will
 solve your problem.
 
 I would expect that increasing the number of shards would make the
 problem
 worse, not
 better.
 
 There's also SOLR-5232...
 
 Best
 Erick
 
 
 On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourttim@elementspace.
 **comt...@elementspace.com
 wrote:
 
 Hey guys,
 
 Based on my understanding of the problem we are encountering, I feel
 we've
 been able to reduce the likelihood of this issue by making the
 following
 changes to our app's usage of SolrCloud:
 
 1) We increased our document batch size to 200 from 10 - our app
 batches
 updates to reduce HTTP requests/overhead. The theory is increasing
 the
 batch size reduces the likelihood of this issue happening.
 2) We reduced to 1 application node sending updates to SolrCloud - we
 write
 Solr updates to Redis, and have previously had 4 application nodes
 pushing
 the updates to Solr (popping off the Redis queue). Reducing the
 number
 of
 nodes pushing to Solr reduces the concurrency on SolrCloud.
 3) Less threads pushing to SolrCloud - due to the increase in batch
 size,
 we were able to go down to 5 update threads on the update-pushing-app
 (from
 10 threads).
 
 To be clear the above only reduces the likelihood of the issue
 happening,
 and DOES NOT actually resolve the issue at hand.
 
 If we happen to encounter issues with the above 3 changes, the next
 steps
 (I could use some advice on) are:
 
 1) Increase the number of shards (2x) - the theory here is this
 reduces
 the
 locking on shards because there are more shards. Am I onto something
 here,
 or will this not help at all?
 2) Use CloudSolrServer - currently we have a plain-old
 least-connection
 HTTP VIP. If we go direct to what we need to update, this will
 reduce
 concurrency in

Re: Get the commit time of a document in Solr

Sorry, but all you've done is reshuffle your previous statements but without 
telling us about the actual problem that you are trying to solve!


Repeating myself: You, the application developer can send a hard commit any 
time you want to assure that documents are searchable. Maybe not every 
millisecond, but, say, once a second with a soft commit and once a minute 
for a hard commit, using commit within to minimize commits when multiple 
processes are indexing data.


AFAICT, no application should ever have to care when a document is actually 
committed - and you have control with commit, anyway.


You the application developer can tune the commit interval to balance 
searchability and overall efficiency. There shouldn't be any problem there, 
given the variety of commit methods that Solr supports, but you have to make 
the choices.


So, what's the problem you are trying to solve? You still haven't 
articulated it.


It sounds as if you are trying to solve a non-problem. But, we can't be sure 
since you haven't articulated what the actual problem (if any) really is.


-- Jack Krupansky

-Original Message- 
From: phanichaitanya

Sent: Thursday, September 12, 2013 1:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Get the commit time of a document in Solr

Hi Jack,

 Sorry, I was not clear earlier. What I'm trying to achieve is :

I want to know when a document is committed (hard commit). There can be a
lot of time lapse (1 hour or more) between the time you indexed that
document vs you issue a commit in my case. Now, I exactly want to know when
a document is committed.

In my previous example all 1800 docs are committed at 9:30 AM and I want to
know that time for those 1800 docs. In other batch it'll be some other time.

The use-case is I've have more than 1 process sending the update requests to
Solr and each of those process has a separate commit step and I want to know
the commit time of the documents that were committed when I gave a commit
request.

I hope I'm clear now - please let me know if I'm not.



-
Phani Chaitanya
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Get-the-commit-time-of-a-document-in-Solr-tp4089624p4089662.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Get the commit time of a document in Solr

On Sep 12, 2013, at 20:55 , phanichaitanya pvempaty@gmail.com wrote:
 Apologies again. But here is another try :
 
 I want to make sure that documents that are indexed are committed in say an
 hour. I agree that if you pass commitWithIn params and the like will make
 sure of that based on the time configurations we set. But, I want to make
 sure that the document is really committed within whatever time we set using
 commitWithIn.
 
 It's a question asking for proof that Solr commits within that time if we
 add commitWithIn parameter to the configuration.
 
 That is about commitWithIn parameter option that you suggested.
 
 Now is there a way to explicitly get all the documents that are committed
 when a hard commit request is issued ? This might not make sense but we are
 pondered with that question.
 

If you have a timestamp field that defaults to NOW, you could do queries for a 
single document (q=*), ranked by descending timestamp. If you're  feeding 
constantly, and run these queries regularly, you should be able to get some 
sort of feel for the latency in the system.

Re: Get the commit time of a document in Solr

On 9/12/2013 12:55 PM, phanichaitanya wrote:
 I want to make sure that documents that are indexed are committed in say an
 hour. I agree that if you pass commitWithIn params and the like will make
 sure of that based on the time configurations we set. But, I want to make
 sure that the document is really committed within whatever time we set using
 commitWithIn.
 
 It's a question asking for proof that Solr commits within that time if we
 add commitWithIn parameter to the configuration.
 
 That is about commitWithIn parameter option that you suggested.
 
 Now is there a way to explicitly get all the documents that are committed
 when a hard commit request is issued ? This might not make sense but we are
 pondered with that question.

If these are ongoing requirements that you need to with every commit or
with a large subset of commits, then I don't think there is any way to
do it without writing custom plugins for Solr.

If you are just trying to prove to someone that Solr is doing what you
say it is, then you can do some simple testing:

Send an update request with as many documents as you want to test, and
include commit=true on the request.  If you are planning to use
commitWithin, also include SoftCommit=true, because commitWithin is a
soft commit.

Time how long it takes for the update request to complete.  That's
approximately how long it will take for a real update/commit to
happen.  There will be some extra time for the indexing itself, but
unless the document count is absolutely enormous, it shouldn't matter
too much.

If you want to test just the commit time, then (after making sure
nothing else is sending updates or commits) send the update without any
commit parameters, then send a commit request by itself and time how
long the commit request takes.

With enough RAM for proper OS disk caching, commits should be very fast
even on an index with 10 million documents.  Here is a wiki page that
has a small amount of discussion about slow commits:

http://wiki.apache.org/solr/SolrPerformanceProblems#Slow_commits

Thanks,
Shawn

Re: Get the commit time of a document in Solr

2013-09-12 Thread phanichaitanya

So, now I want to know when that document becomes searchable or when it is
committed. I've the following scenario:

1) Indexing starts at say 9:00 AM - with the above additions to the
schema.xml I'll know the indexed time of each document I send to Solr via
the update handler. Say 9:01, 9:02 and so on ... lets say I send a document
for every second between 9 - 9:30 AM and it makes it 30*60 = 1800 docs
2) Now at 9:30 AM, I issue a hard commit and now I'll be able to search
these 1800 documents which is fine.
3) Now I want to know that I can search these 1800 documents only at =9:30
AM but not  9:30 AM as I did not do a hard commit before 9:30 AM. 

In order to know that, is there a way in Solr rather than some application
keeping track of the documents it sends to Solr between any two commits. The
reason I'm asking is, if there are say two parallel processes indexing to
the same index and one process issues a commit - then whatever documents
process two indexed until that point of time would also be committed right ?
Now if I keep track of commit times in each process it doesn't reflect the
true commit times as they are inter-twined.



-
Phani Chaitanya
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Get-the-commit-time-of-a-document-in-Solr-tp4089624p4089638.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud 4.x hangs under high update volume

2013-09-12 Thread Erick Erickson

My take on it is this, assuming I'm reading this right:
1 SOLR-5216 - probably not going anywhere, 5232 will take care of it.
2 SOLR-5232 - expected to fix the underlying issue no matter whether
you're using CloudSolrServer from SolrJ or sending lots of updates from
lots of clients.
3 SOLR-4816 - use this patch and CloudSolrServer from SolrJ in the
meantime.

I don't quite know whether SOLR-5232 will make it in to 4.5 or not, it
hasn't been committed anywhere yet. The Solr 4.5 release is imminent, RC0
is looking like it'll be ready to cut next week so it might not be included.

Best,
Erick


On Thu, Sep 12, 2013 at 1:42 PM, Tim Vaillancourt t...@elementspace.comwrote:

 Lol, at breaking during a demo - always the way it is! :) I agree, we are
 just tip-toeing around the issue, but waiting for 4.5 is definitely an
 option if we get-by for now in testing; patched Solr versions seem to
 make people uneasy sometimes :).

 Seeing there seems to be some danger to SOLR-5216 (in some ways it blows up
 worse due to less limitations on thread), I'm guessing only SOLR-5232 and
 SOLR-4816 are making it into 4.5? I feel those 2 in combination will make a
 world of difference!

 Thanks so much again guys!

 Tim



 On 12 September 2013 03:43, Erick Erickson erickerick...@gmail.com
 wrote:

  Fewer client threads updating makes sense, and going to 1 core also seems
  like it might help. But it's all a crap-shoot unless the underlying cause
  gets fixed up. Both would improve things, but you'll still hit the
 problem
  sometime, probably when doing a demo for your boss ;).
 
  Adrien has branched the code for SOLR 4.5 in preparation for a release
  candidate tentatively scheduled for next week. You might just start
 working
  with that branch if you can rather than apply individual patches...
 
  I suspect there'll be a couple more changes to this code (looks like
  Shikhar already raised an issue for instance) before 4.5 is finally
 cut...
 
  FWIW,
  Erick
 
 
 
  On Thu, Sep 12, 2013 at 2:13 AM, Tim Vaillancourt t...@elementspace.com
  wrote:
 
   Thanks Erick!
  
   Yeah, I think the next step will be CloudSolrServer with the SOLR-4816
   patch. I think that is a very, very useful patch by the way. SOLR-5232
   seems promising as well.
  
   I see your point on the more-shards idea, this is obviously a
   global/instance-level lock. If I really had to, I suppose I could run
  more
   Solr instances to reduce locking then? Currently I have 2 cores per
   instance and I could go 1-to-1 to simplify things.
  
   The good news is we seem to be more stable since changing to a bigger
   client-solr batch-size and fewer client threads updating.
  
   Cheers,
  
   Tim
  
   On 11/09/13 04:19 AM, Erick Erickson wrote:
  
   If you use CloudSolrServer, you need to apply SOLR-4816 or use a
 recent
   copy of the 4x branch. By recent, I mean like today, it looks like
  Mark
   applied this early this morning. But several reports indicate that
 this
   will
   solve your problem.
  
   I would expect that increasing the number of shards would make the
  problem
   worse, not
   better.
  
   There's also SOLR-5232...
  
   Best
   Erick
  
  
   On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourttim@elementspace.
  **comt...@elementspace.com
   wrote:
  
Hey guys,
  
   Based on my understanding of the problem we are encountering, I feel
   we've
   been able to reduce the likelihood of this issue by making the
  following
   changes to our app's usage of SolrCloud:
  
   1) We increased our document batch size to 200 from 10 - our app
  batches
   updates to reduce HTTP requests/overhead. The theory is increasing
 the
   batch size reduces the likelihood of this issue happening.
   2) We reduced to 1 application node sending updates to SolrCloud - we
   write
   Solr updates to Redis, and have previously had 4 application nodes
   pushing
   the updates to Solr (popping off the Redis queue). Reducing the
 number
  of
   nodes pushing to Solr reduces the concurrency on SolrCloud.
   3) Less threads pushing to SolrCloud - due to the increase in batch
  size,
   we were able to go down to 5 update threads on the update-pushing-app
   (from
   10 threads).
  
   To be clear the above only reduces the likelihood of the issue
  happening,
   and DOES NOT actually resolve the issue at hand.
  
   If we happen to encounter issues with the above 3 changes, the next
  steps
   (I could use some advice on) are:
  
   1) Increase the number of shards (2x) - the theory here is this
 reduces
   the
   locking on shards because there are more shards. Am I onto something
   here,
   or will this not help at all?
   2) Use CloudSolrServer - currently we have a plain-old
 least-connection
   HTTP VIP. If we go direct to what we need to update, this will
 reduce
   concurrency in SolrCloud a bit. Thoughts?
  
   Thanks all!
  
   Cheers,
  
   Tim
  
  
   On 6 September 2013 14:47, Tim Vaillancourttim@elementspace.**com
  t...@elementspace.com

Re: charset encoding

On 9/12/2013 11:17 AM, Andreas Owen wrote:
 it was the http-header, as soon as i force a iso-8859-1 header it worked

Glad you found a workaround!

If you are in a situation where you cannot control the header of the
request or modify the content itself to include charset information, or
there's some reason you would rather not take that route, there will be
another way with the next Solr release.

https://issues.apache.org/jira/browse/SOLR-5082

Solr 4.5 will support an ie (input encoding) parameter for the update
request so you can inform Solr what charset encoding to expect.  The
release process for Solr 4.5 has been started, it usually takes 2-3
weeks to complete.

Thanks,
Shawn

Re: SolrCloud 4.x hangs under high update volume

2013-09-12 Thread Tim Vaillancourt

That makes sense, thanks Erick and Mark for you help! :)

I'll see if I can find a place to assist with the testing of SOLR-5232.

Cheers,

Tim



On 12 September 2013 11:16, Mark Miller markrmil...@gmail.com wrote:

 Right, I don't see SOLR-5232 making 4.5 unfortunately. It could perhaps
 make a 4.5.1 - it does resolve a critical issue - but 4.5 is in motion and
 SOLR-5232 is not quite ready - we need some testing.

 - Mark

 On Sep 12, 2013, at 2:12 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  My take on it is this, assuming I'm reading this right:
  1 SOLR-5216 - probably not going anywhere, 5232 will take care of it.
  2 SOLR-5232 - expected to fix the underlying issue no matter whether
  you're using CloudSolrServer from SolrJ or sending lots of updates from
  lots of clients.
  3 SOLR-4816 - use this patch and CloudSolrServer from SolrJ in the
  meantime.
 
  I don't quite know whether SOLR-5232 will make it in to 4.5 or not, it
  hasn't been committed anywhere yet. The Solr 4.5 release is imminent, RC0
  is looking like it'll be ready to cut next week so it might not be
 included.
 
  Best,
  Erick
 
 
  On Thu, Sep 12, 2013 at 1:42 PM, Tim Vaillancourt t...@elementspace.com
 wrote:
 
  Lol, at breaking during a demo - always the way it is! :) I agree, we
 are
  just tip-toeing around the issue, but waiting for 4.5 is definitely an
  option if we get-by for now in testing; patched Solr versions seem to
  make people uneasy sometimes :).
 
  Seeing there seems to be some danger to SOLR-5216 (in some ways it
 blows up
  worse due to less limitations on thread), I'm guessing only SOLR-5232
 and
  SOLR-4816 are making it into 4.5? I feel those 2 in combination will
 make a
  world of difference!
 
  Thanks so much again guys!
 
  Tim
 
 
 
  On 12 September 2013 03:43, Erick Erickson erickerick...@gmail.com
  wrote:
 
  Fewer client threads updating makes sense, and going to 1 core also
 seems
  like it might help. But it's all a crap-shoot unless the underlying
 cause
  gets fixed up. Both would improve things, but you'll still hit the
  problem
  sometime, probably when doing a demo for your boss ;).
 
  Adrien has branched the code for SOLR 4.5 in preparation for a release
  candidate tentatively scheduled for next week. You might just start
  working
  with that branch if you can rather than apply individual patches...
 
  I suspect there'll be a couple more changes to this code (looks like
  Shikhar already raised an issue for instance) before 4.5 is finally
  cut...
 
  FWIW,
  Erick
 
 
 
  On Thu, Sep 12, 2013 at 2:13 AM, Tim Vaillancourt 
 t...@elementspace.com
  wrote:
 
  Thanks Erick!
 
  Yeah, I think the next step will be CloudSolrServer with the SOLR-4816
  patch. I think that is a very, very useful patch by the way. SOLR-5232
  seems promising as well.
 
  I see your point on the more-shards idea, this is obviously a
  global/instance-level lock. If I really had to, I suppose I could run
  more
  Solr instances to reduce locking then? Currently I have 2 cores per
  instance and I could go 1-to-1 to simplify things.
 
  The good news is we seem to be more stable since changing to a bigger
  client-solr batch-size and fewer client threads updating.
 
  Cheers,
 
  Tim
 
  On 11/09/13 04:19 AM, Erick Erickson wrote:
 
  If you use CloudSolrServer, you need to apply SOLR-4816 or use a
  recent
  copy of the 4x branch. By recent, I mean like today, it looks like
  Mark
  applied this early this morning. But several reports indicate that
  this
  will
  solve your problem.
 
  I would expect that increasing the number of shards would make the
  problem
  worse, not
  better.
 
  There's also SOLR-5232...
 
  Best
  Erick
 
 
  On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourttim@elementspace.
  **comt...@elementspace.com
  wrote:
 
  Hey guys,
 
  Based on my understanding of the problem we are encountering, I feel
  we've
  been able to reduce the likelihood of this issue by making the
  following
  changes to our app's usage of SolrCloud:
 
  1) We increased our document batch size to 200 from 10 - our app
  batches
  updates to reduce HTTP requests/overhead. The theory is increasing
  the
  batch size reduces the likelihood of this issue happening.
  2) We reduced to 1 application node sending updates to SolrCloud -
 we
  write
  Solr updates to Redis, and have previously had 4 application nodes
  pushing
  the updates to Solr (popping off the Redis queue). Reducing the
  number
  of
  nodes pushing to Solr reduces the concurrency on SolrCloud.
  3) Less threads pushing to SolrCloud - due to the increase in batch
  size,
  we were able to go down to 5 update threads on the
 update-pushing-app
  (from
  10 threads).
 
  To be clear the above only reduces the likelihood of the issue
  happening,
  and DOES NOT actually resolve the issue at hand.
 
  If we happen to encounter issues with the above 3 changes, the next
  steps
  (I could use some advice on) are:
 
  1) Increase the

Re: Some highlighted snippets aren't being returned

2013-09-12 Thread Eric O'Hanlon

maxAnalyzedChars did it!  I wasn't setting that param, and I'm working with 
some very long documents.  I also made the hl.fl param formatting change that 
you suggested, Aloke.

Thanks again!

- Eric

On Sep 11, 2013, at 3:10 AM, Eric O'Hanlon elo2...@columbia.edu wrote:

 Thank you, Aloke and Bryan!  I'll give this a try and I'll report back on 
 what happens!
 
 - Eric
 
 On Sep 9, 2013, at 2:32 AM, Aloke Ghoshal alghos...@gmail.com wrote:
 
 Hi Eric,
 
 As Bryan suggests, you should look at appropriately setting up the
 fragSize  maxAnalyzedChars for long documents.
 
 One issue I find with your search request is that in trying to
 highlight across three separate fields, you have added each of them as
 a separate request param:
 hl.fl=contentshl.fl=titlehl.fl=original_url
 
 The way to do it would be
 (http://wiki.apache.org/solr/HighlightingParameters#hl.fl) to pass
 them as values to one comma (or space) separated field:
 hl.fl=contents,title,original_url
 
 Regards,
 Aloke
 
 On 9/9/13, Bryan Loofbourrow bloofbour...@knowledgemosaic.com wrote:
 Eric,
 
 Your example document is quite long. Are you setting hl.maxAnalyzedChars?
 If you don't, the highlighter you appear to be using will not look past
 the first 51,200 characters of the document for snippet candidates.
 
 http://wiki.apache.org/solr/HighlightingParameters#hl.maxAnalyzedChars
 
 -- Bryan
 
 
 -Original Message-
 From: Eric O'Hanlon [mailto:elo2...@columbia.edu]
 Sent: Sunday, September 08, 2013 2:01 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Some highlighted snippets aren't being returned
 
 Hi again Everyone,
 
 I didn't get any replies to this, so I thought I'd re-send in case
 anyone
 missed it and has any thoughts.
 
 Thanks,
 Eric
 
 On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon elo2...@columbia.edu wrote:
 
 Hi Everyone,
 
 I'm facing an issue in which my solr query is returning highlighted
 snippets for some, but not all results.  For reference, I'm searching
 through an index that contains web crawls of human-rights-related
 websites.  I'm running solr as a webapp under Tomcat and I've included
 the
 query's solr params from the Tomcat log:
 
 ...
 webapp=/solr-4.2
 path=/select
 
 
 params={facet=truesort=score+descgroup.limit=10spellcheck.q=Unanganf.m
 
 imetype_code.facet.limit=7hl.simple.pre=codeq.alt=*:*f.organization_t
 
 ype__facet.facet.limit=6f.language__facet.facet.limit=6hl=truef.date_of
 
 _capture_.facet.limit=6group.field=original_urlhl.simple.post=/code
 
 facet.field=domainfacet.field=date_of_capture_facet.field=mimetype
 
 _codefacet.field=geographic_focus__facetfacet.field=organization_based_i
 
 n__facetfacet.field=organization_type__facetfacet.field=language__facet
 
 facet.field=creator_name__facethl.fragsize=600f.creator_name__facet.face
 
 t.limit=6facet.mincount=1qf=text^1hl.fl=contentshl.fl=titlehl.fl=orig
 
 inal_urlwt=rubyf.geographic_focus__facet.facet.limit=6defType=edismaxr
 
 ows=10f.domain.facet.limit=6q=Unanganf.organization_based_in__facet.fac
 et.limit=6q.op=ANDgroup=truehl.usePhraseHighlighter=true} hits=8
 status=0 QTime=108
 ...
 
 For the query above (which can be simplified to say: find all
 documents
 that contain the word unangan and return facets, highlights, etc.), I
 get five search results.  Only three of these are returning highlighted
 snippets.  Here's the highlighting portion of the solr response (note:
 printed in ruby notation because I'm receiving this response in a Rails
 app):
 
 
 highlighting=
 
 
 {20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%
 202002%20tentang%20Perlindungan%20Anak.pdf=
  {},
 
 
 20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2
 02002%20tentang%20Perlindungan%20Anak.pdf=
  {},
 
 
 20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2
 02002%20tentang%20Perlindungan%20Anak.pdf=
  {},
 20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf=
  {contents=
[...actual snippet is returned here...]},
 20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf=
  {contents=
[...actual snippet is returned here...]},
 20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2-
 uu-no-39-tahun-1999=
  {contents=
[...actual snippet is returned here...]},
 
 20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no-
 39-tahun-1999?tmpl=componentformat=raw=
  {contents=
[...actual snippet is returned here...]},
 
 
 20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_U
 timut_heritage.pdf=
  {}}
 
 
 I have eight (as opposed to five) results above because I'm also doing
 a
 grouped query, grouping by a field called original_url, and this leads
 to five grouped results.
 
 I've confirmed that my highlight-lacking results DO contain the word
 unangan, as expected, and this term is appearing in a text field
 that's
 indexed and stored, and being searched for all text searches.  For
 example, one

Re: Get the commit time of a document in Solr

On 9/12/2013 11:04 AM, phanichaitanya wrote:
 So, now I want to know when that document becomes searchable or when it is
 committed. I've the following scenario:
 
 1) Indexing starts at say 9:00 AM - with the above additions to the
 schema.xml I'll know the indexed time of each document I send to Solr via
 the update handler. Say 9:01, 9:02 and so on ... lets say I send a document
 for every second between 9 - 9:30 AM and it makes it 30*60 = 1800 docs
 2) Now at 9:30 AM, I issue a hard commit and now I'll be able to search
 these 1800 documents which is fine.
 3) Now I want to know that I can search these 1800 documents only at =9:30
 AM but not  9:30 AM as I did not do a hard commit before 9:30 AM. 
 
 In order to know that, is there a way in Solr rather than some application
 keeping track of the documents it sends to Solr between any two commits. The
 reason I'm asking is, if there are say two parallel processes indexing to
 the same index and one process issues a commit - then whatever documents
 process two indexed until that point of time would also be committed right ?
 Now if I keep track of commit times in each process it doesn't reflect the
 true commit times as they are inter-twined.

From what I understand, if you use the default of NOW for a field in
your schema, then all documents indexed in that request will have the
timestamp of the time that indexing started.

Assuming what I understand is the way it actually works, if you want the
time to reflect anything even close to commit time, then you will need
to send very small batches and you will need to commit after every
batch.  If you are indexing very quickly, you'll probably want those
commits to be soft commits.

You'll also want to have an autoCommit set up to do hard commits less
frequently with openSearcher=false, or you'll run into the problem
described at the link below.  There is a good autoCommit example there:

http://wiki.apache.org/solr/SolrPerformanceProblems#Slow_startup

I've heard (but have not tested) that with the NOW default, large
imports with the dataimporthandler will all have the timestamp of when
the DIH request started, no matter what you do with autoCommit or
autoSoftCommit.

Thanks,
Shawn

Re: Get the commit time of a document in Solr