Re: admin-extra

2015-10-11 Thread Bill Au
admin-extra allows one to include additional links and/or information in
the Solr admin main page:

https://cwiki.apache.org/confluence/display/solr/Core-Specific+Tools

Bill

On Wed, Oct 7, 2015 at 5:40 PM, Upayavira  wrote:

> Do you use admin-extra within the admin UI?
>
> If so, please go to [1] and document your use case. The feature
> currently isn't implemented in the new admin UI, and without use-cases,
> it likely won't be - so if you want it in there, please help us
> understand how you use it!
>
> Thanks!
>
> Upayavira
>
> [1] https://issues.apache.org/jira/browse/SOLR-8140
>


solrcloud and core swapping

2015-08-28 Thread Bill Au
Is core swapping supported in SolrCloud?  If I have a 5 nodes SolrCloud
cluster and I do a core swap on the leader, will the core be swapped on the
other 4 nodes as well?  Or do I need to do a core swap on each node?

Bill


Re: Nested objects in Solr

2015-07-24 Thread Bill Au
What exactly do you mean by nested objects in Solr.  It would help if you
give an example.  The Solr schema is flat as far as I know.

Bill

On Fri, Jul 24, 2015 at 9:24 AM, Rajesh rajesh.panneersel...@aspiresys.com
wrote:

 You can use nested entities like below.

 document
 entity name=OuterEntity pk=id
 query=SELECT * FROM User
  field column=id name=id /
 field column=name name=name /

 entity name=InnerEntity child=true
 query=select * from subject 
/entity
 /entity
 /document




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Nested-objects-in-Solr-tp4213212p4219039.html
 Sent from the Solr - User mailing list archive at Nabble.com.



DIH question: importing string containing comma-delimited list into a multiValued field

2015-07-17 Thread Bill Au
One of my database column is a varchar containing a comma-delimited list of
values.  I wold like to import these values into a multiValued field.  I
figure that I will need to write a ScriptTransformer to do that.  Is there
a better way?

Bill


Re: SolrCloud indexing

2015-05-12 Thread Bill Au
Thanks for the reply.

Actually in our case we want the timestamp to be populated locally on each
node in the SolrCloud cluster.  We want to see if there is any delay in the
document being distributed within the cluster.  Just want to confirm that
the timestamp can be use for that purpose.

Bill

On Sat, May 9, 2015 at 11:37 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 5/9/2015 8:41 PM, Bill Au wrote:
  Is the behavior of document being indexed independently on each node in a
  SolrCloud cluster new in 5.x or is that true in 4.x also?
 
  If the document is indexed independently on each node, then if I query
 the
  document from each node directly, a timestamp could hold different values
  since the document is indexed independently, right?
 
  field name=timestamp type=date indexed=true stored=true
  default=NOW /

 SolrCloud has had that behavior from day one, when it was released in
 version 4.0.  You are correct that it can result in a different
 timestamp on each replica if the default comes from schema.xml.

 I am pretty sure that the solution for this problem is to set up an
 update processor chain that includes TimestampUpdateProcessorFactory to
 populate the timestamp field before the document is distributed to each
 replica.

 https://cwiki.apache.org/confluence/display/solr/Update+Request+Processors

 Thanks,
 Shawn




Re: SolrCloud indexing

2015-05-09 Thread Bill Au
Is the behavior of document being indexed independently on each node in a
SolrCloud cluster new in 5.x or is that true in 4.x also?

If the document is indexed independently on each node, then if I query the
document from each node directly, a timestamp could hold different values
since the document is indexed independently, right?

field name=timestamp type=date indexed=true stored=true
default=NOW /

Bill

On Fri, May 8, 2015 at 6:39 PM, Vincenzo D'Amore v.dam...@gmail.com wrote:

 I have just added a comment to the CWiki.
 Thanks again for your prompt answer Erick.

 Best,
 Vincenzo

 On Fri, May 8, 2015 at 12:39 AM, Erick Erickson erickerick...@gmail.com
 wrote:

  bq: ...forwards the index notation to itself and any replicas...
 
  That's just odd phrasing.
 
  All that means is that the document sent through the indexing process
  on the leader and all followers for a shard and
  is indexed independently on each.
 
  This is as opposed to the old master/slave situation where the master
  indexed the doc, but the slave got the indexed
  version as part of a segment when it replicated.
 
  Could you add a comment to the CWiki calling the phrasing out? It
  really is a bit mysterious.
 
  Best,
  Erick
 
  On Thu, May 7, 2015 at 2:18 PM, Vincenzo D'Amore v.dam...@gmail.com
  wrote:
   Thanks Shawn.
  
   Just to make the picture more clear, I'm trying to understand why a 3
  node
   solrcloud cluster and a old style solr server take same time to index
  same
   documents.
  
   But in the wiki is written:
  
   If the machine is a leader, SolrCloud determines which shard the
 document
   should go to, forwards the document the leader for that shard, indexes
  the
   document for this shard, and *forwards the index notation to itself
 and
   any replicas*.
  
  
  
 
 https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
  
  
   Could you please explain what does it mean forwards the index
 notation
  ?
  
   On the other hand, on solrcloud I have 3 shards and 2 replicas for each
   shard. So, every node is indexing all the documents and this explains
 why
   solrcloud consumes same time compared to an old-style solr server.
  
  
  
   On Thu, May 7, 2015 at 3:08 PM, Shawn Heisey apa...@elyograg.org
  wrote:
  
   On 5/7/2015 3:04 AM, Vincenzo D'Amore wrote:
Thanks Erick. I'm not sure I got your answer.
   
I try to recap, when the raw document has to be indexed, it will be
forwarded to shard leader. Shard leader indexes the document for
 that
shard, and then forwards the indexed document to any replicas.
   
I want just be sure that when the raw document is forwarded from the
   leader
to the replicas it will be indexed only one time on the shard
 leader.
   From
what I understand replicas do not indexes, only the leader indexes.
  
   The document is indexed by all replicas.  There is no way to forward
 the
   indexed document, it can only forward the source document ... so each
   replica must index it independently.
  
   The old-style master-slave replication (which existed long before
   SolrCloud) copies the finished Lucene segments, so only the master
   actually does indexing.
  
   SolrCloud doesn't have a master, only multiple replicas, one of which
 is
   elected leader, and replication only comes into the picture if
 there's a
   serious problem and Solr determines that it can't use the transaction
   log to recover the index.
  
   Thanks,
   Shawn
  
  
  
  
   --
   Vincenzo D'Amore
   email: v.dam...@gmail.com
   skype: free.dev
   mobile: +39 349 8513251
 



 --
 Vincenzo D'Amore
 email: v.dam...@gmail.com
 skype: free.dev
 mobile: +39 349 8513251



timestamp field and atomic updates

2015-01-30 Thread Bill Au
I have a timestamp field in my schema to track when each doc was indexed:

field name=timestamp type=date indexed=true stored=true
default=NOW multiValued=false /

Recently, we have switched over to use atomic update instead of re-indexing
when we need to update a doc in the index.  It looks to me that the
timestamp field is not updated during an atomic update.  I have also looked
into TimestampUpdateProcessorFactory and it looks to me that won't help in
my case.

Is there anything within Solr that I can use to update the timestamp during
atomic update, or do I have to explicitly include the timestamp field as
part of the atomic update?

Bill


Solr atomic updates question

2014-07-08 Thread Bill Au
Solr atomic update allows for changing only one or more fields of a
document without having to re-index the entire document.  But what about
the case where I am sending in the entire document?  In that case the whole
document will be re-indexed anyway, right?  So I assume that there will be
no saving.  I am actually thinking that there will be a performance penalty
since atomic update requires Solr to first retrieve all the fields first
before updating.

Bill


Re: Solr atomic updates question

2014-07-08 Thread Bill Au
Thanks for that under-the-cover explanation.

I am not sure what you mean by mix atomic updates with regular field
values.  Can you give an example?

Thanks.

Bill


On Tue, Jul 8, 2014 at 6:56 PM, Steve McKay st...@b.abbies.us wrote:

 Atomic updates fetch the doc with RealTimeGet, apply the updates to the
 fetched doc, then reindex. Whether you use atomic updates or send the
 entire doc to Solr, it has to deleteById then add. The perf difference
 between the atomic updates and normal updates is likely minimal.

 Atomic updates are for when you have changes and want to apply them to a
 document without affecting the other fields. A regular add will replace an
 existing document completely. AFAIK Solr will let you mix atomic updates
 with regular field values, but I don't think it's a good idea.

 Steve

 On Jul 8, 2014, at 5:30 PM, Bill Au bill.w...@gmail.com wrote:

  Solr atomic update allows for changing only one or more fields of a
  document without having to re-index the entire document.  But what about
  the case where I am sending in the entire document?  In that case the
 whole
  document will be re-indexed anyway, right?  So I assume that there will
 be
  no saving.  I am actually thinking that there will be a performance
 penalty
  since atomic update requires Solr to first retrieve all the fields first
  before updating.
 
  Bill




Re: Solr atomic updates question

2014-07-08 Thread Bill Au
I see what you mean now.  Thanks for the example.  It makes things very
clear.

I have been thinking about the explanation in the original response more.
 According to that, both regular update with entire doc and atomic update
involves a delete by id followed by a add.  But both the Solr reference doc
(
https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents)
says that:

The first is *atomic updates*. This approach allows changing only one or
more fields of a document without having to re-index the entire document.

But since Solr is doing a delete by id followed by a add, so without
having to re-index the entire document apply to the client side only?  On
the server side the add means that the entire document is re-indexed, right?

Bill


On Tue, Jul 8, 2014 at 7:32 PM, Steve McKay st...@b.abbies.us wrote:

 Take a look at this update XML:

 add
   doc
 field name=employeeId05991/field
 field name=employeeNameSteve McKay/field
 field name=office update=setWalla Walla/field
 field name=skills update=addPython/field
   /doc
 /add

 Let's say employeeId is the key. If there's a fourth field, salary, on the
 existing doc, should it be deleted or retained? With this update it will
 obviously be deleted:

 add
   doc
 field name=employeeId05991/field
 field name=employeeNameSteve McKay/field
   /doc
 /add

 With this XML it will be retained:

 add
   doc
 field name=employeeId05991/field
 field name=office update=setWalla Walla/field
 field name=skills update=addPython/field
   /doc
 /add

 I'm not willing to guess what will happen in the case where non-atomic and
 atomic updates are present on the same add because I haven't looked at that
 code since 4.0, but I think I could make a case for retaining salary or for
 discarding it. That by itself reeks--and it's also not well documented.
 Relying on iffy, poorly-documented behavior is asking for pain at upgrade
 time.

 Steve

 On Jul 8, 2014, at 7:02 PM, Bill Au bill.w...@gmail.com wrote:

  Thanks for that under-the-cover explanation.
 
  I am not sure what you mean by mix atomic updates with regular field
  values.  Can you give an example?
 
  Thanks.
 
  Bill
 
 
  On Tue, Jul 8, 2014 at 6:56 PM, Steve McKay st...@b.abbies.us wrote:
 
  Atomic updates fetch the doc with RealTimeGet, apply the updates to the
  fetched doc, then reindex. Whether you use atomic updates or send the
  entire doc to Solr, it has to deleteById then add. The perf difference
  between the atomic updates and normal updates is likely minimal.
 
  Atomic updates are for when you have changes and want to apply them to a
  document without affecting the other fields. A regular add will replace
 an
  existing document completely. AFAIK Solr will let you mix atomic updates
  with regular field values, but I don't think it's a good idea.
 
  Steve
 
  On Jul 8, 2014, at 5:30 PM, Bill Au bill.w...@gmail.com wrote:
 
  Solr atomic update allows for changing only one or more fields of a
  document without having to re-index the entire document.  But what
 about
  the case where I am sending in the entire document?  In that case the
  whole
  document will be re-indexed anyway, right?  So I assume that there will
  be
  no saving.  I am actually thinking that there will be a performance
  penalty
  since atomic update requires Solr to first retrieve all the fields
 first
  before updating.
 
  Bill
 
 




Re: question about DIH solr-data-config.xml and XML include

2014-01-14 Thread Bill Au
The problem is with the admin UI not following the XML include to find
entity so it found none.  DIH itself does support XML include as I can
issue the DIH commands via HTTP on the included entities successfully.

Bill


On Mon, Jan 13, 2014 at 8:03 PM, Shawn Heisey s...@elyograg.org wrote:

 On 1/13/2014 3:31 PM, Bill Au wrote:

 But when I use XML include, the Entity pull-down in the Dataimport section
 of the Solr admin UI is empty.  I know that happens when there is a syntax
 error in solr-data-config.xml.  Does DIH supports XML include?  Also I am
 not seeing any error message in the log even if I set log level to ALL.
  Is
 there any way to get DIH to log what it thinks is wrong
 solr-data-cofig.xml?


 Paying it forward.  Someone on this mailing list helped me with this.  I
 have tested this DIH configand found that it works:

 ?xml version=1.0 encoding=UTF-8 ?
 dataConfig xmlns:xi=http://www.w3.org/2001/XInclude;
   dataSource type=JdbcDataSource
 driver=com.mysql.jdbc.Driver
 encoding=UTF-8
 url=jdbc:mysql://${dih.request.dbHost}:3306/${dih.request.dbSchema}?
 zeroDateTimeBehavior=convertToNull
 batchSize=-1
 user=REDACTED
 password=REDACTED/
   document
   xi:include href=test-dih-include.xml /
   /document
 /dataConfig

 The xlmns:xi attribute in the outer tag makes it possible to use the
 xi:include syntax later.

 I make extensive use of this in my solrconfig.xml file. There's almost no
 actual config in that file, everything is included from other files.

 When you look at the config in the admin UI, you will not see the included
 text, you'll only see the xi:include tag.

 Thanks,
 Shawn




question about DIH solr-data-config.xml and XML include

2014-01-13 Thread Bill Au
I am trying to simplify my Solr DIH configuration by using XML schema
include element.  Here is an example:

?xml version=1.0 standalone=no ?
!DOCTYPE doc [
!ENTITY dataSource SYSTEM include_datasource.xml
!ENTITY article SYSTEM include_entity1.xml
!ENTITY article SYSTEM include_entity2.xml
]
dataConfig
dataSource;
document
entity1;
entity2;
/document
/dataConfig


I know my included XML files are good because if I put them all into a
single XML file, DIH works as expected.

But when I use XML include, the Entity pull-down in the Dataimport section
of the Solr admin UI is empty.  I know that happens when there is a syntax
error in solr-data-config.xml.  Does DIH supports XML include?  Also I am
not seeing any error message in the log even if I set log level to ALL.  Is
there any way to get DIH to log what it thinks is wrong solr-data-cofig.xml?

BTW, the admin UI show the DIH config as shown above.  So I suspecting that
DIH isn't actually doing the XML include.

Bill


Re: problem with data import handler delta import due to use of multiple datasource

2013-10-08 Thread Bill Au
I am using 4.3.  It is not related to bugs related to last_index_time.  The
problem is caused by the fact that the parent entity and child entity use
different data source (different databases on different hosts).

From the log output, I do see the the delta query of the child entity being
executed correctly and found all the rows that have been modified for the
child entity.  But it fails when it executed the parentDeltaQuery because
it is still using the database connection from the child entity (ie
datasource ds2 in my example above).

Is there a way to tell DIH to use a different datasource in the
parentDeltaQuery?

Bill


On Sat, Oct 5, 2013 at 10:28 PM, Alexandre Rafalovitch
arafa...@gmail.comwrote:

 Which version of Solr and what kind of SQL errors? There were some bugs in
 4.x related to last_index_time, but it does not sound related.

 Regards,
Alex.

 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


 On Sun, Oct 6, 2013 at 8:51 AM, Bill Au bill.w...@gmail.com wrote:

  Here is my DIH config:
 
  dataConfig
  dataSource name=ds1 type=JdbcDataSource
 driver=com.mysql.jdbc.Driver
  url=jdbc:mysql://localhost1/dbname1 user=db_username1
  password=db_password1/
  dataSource name=ds2 type=JdbcDataSource
 driver=com.mysql.jdbc.Driver
  url=jdbc:mysql://localhost2/dbname2 user=db_username2
  password=db_password2/
  document name=products
  entity name=item dataSource=ds1 query=select * from item
  field column=ID name=id /
  field column=NAME name=name /
 
  entity name=feature dataSource=ds2 query=select
  description from feature where item_id='${item.ID}'
  field name=features column=description /
  /entity
  /entity
  /document
  /dataConfig
 
  I am having trouble with delta import.  I think it is because the main
  entity and the sub-entity use different data source.  I have tried using
  both a delta query:
 
  deltaQuery=select id from item where id in (select item_id as id from
  feature where last_modified  '${dih.last_index_time}') or last_modified
  gt; '${dih.last_index_time}'
 
  and a parentDeltaQuery:
 
  entity name=feature pk=ITEM_ID query=select DESCRIPTION as features
  from FEATURE where ITEM_ID='${item.ID}' deltaQuery=select ITEM_ID from
  FEATURE where last_modified  '${dih.last_index_time}'
  parentDeltaQuery=select ID from item where ID=${feature.ITEM_ID}/
 
  I ended up with an SQL error for both.  Is there any way to make delta
  import work in my case?
 
  Bill
 



Re: problem with data import handler delta import due to use of multiple datasource

2013-10-08 Thread Bill Au
Thanks for the suggestion but that won't work as I have last_modified field
in both the parent entity and child entity as I want delta import to kick
in when either change.  That other approach has the same problem since the
parent and child entity uses different datasources.

Bill


On Tue, Oct 8, 2013 at 10:18 AM, Dyer, James
james.d...@ingramcontent.comwrote:

 Bill,

 I do not believe there is any way to tell it to use a different datasource
 for the parent delta query.

 If you used this approach, would it solve your problem:
 http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport ?

 James Dyer
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Bill Au [mailto:bill.w...@gmail.com]
 Sent: Tuesday, October 08, 2013 8:50 AM
 To: solr-user@lucene.apache.org
 Subject: Re: problem with data import handler delta import due to use of
 multiple datasource

 I am using 4.3.  It is not related to bugs related to last_index_time.  The
 problem is caused by the fact that the parent entity and child entity use
 different data source (different databases on different hosts).

 From the log output, I do see the the delta query of the child entity being
 executed correctly and found all the rows that have been modified for the
 child entity.  But it fails when it executed the parentDeltaQuery because
 it is still using the database connection from the child entity (ie
 datasource ds2 in my example above).

 Is there a way to tell DIH to use a different datasource in the
 parentDeltaQuery?

 Bill


 On Sat, Oct 5, 2013 at 10:28 PM, Alexandre Rafalovitch
 arafa...@gmail.comwrote:

  Which version of Solr and what kind of SQL errors? There were some bugs
 in
  4.x related to last_index_time, but it does not sound related.
 
  Regards,
 Alex.
 
  Personal website: http://www.outerthoughts.com/
  LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
  - Time is the quality of nature that keeps events from happening all at
  once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
 
 
  On Sun, Oct 6, 2013 at 8:51 AM, Bill Au bill.w...@gmail.com wrote:
 
   Here is my DIH config:
  
   dataConfig
   dataSource name=ds1 type=JdbcDataSource
  driver=com.mysql.jdbc.Driver
   url=jdbc:mysql://localhost1/dbname1 user=db_username1
   password=db_password1/
   dataSource name=ds2 type=JdbcDataSource
  driver=com.mysql.jdbc.Driver
   url=jdbc:mysql://localhost2/dbname2 user=db_username2
   password=db_password2/
   document name=products
   entity name=item dataSource=ds1 query=select * from
 item
   field column=ID name=id /
   field column=NAME name=name /
  
   entity name=feature dataSource=ds2 query=select
   description from feature where item_id='${item.ID}'
   field name=features column=description /
   /entity
   /entity
   /document
   /dataConfig
  
   I am having trouble with delta import.  I think it is because the main
   entity and the sub-entity use different data source.  I have tried
 using
   both a delta query:
  
   deltaQuery=select id from item where id in (select item_id as id from
   feature where last_modified  '${dih.last_index_time}') or
 last_modified
   gt; '${dih.last_index_time}'
  
   and a parentDeltaQuery:
  
   entity name=feature pk=ITEM_ID query=select DESCRIPTION as
 features
   from FEATURE where ITEM_ID='${item.ID}' deltaQuery=select ITEM_ID
 from
   FEATURE where last_modified  '${dih.last_index_time}'
   parentDeltaQuery=select ID from item where ID=${feature.ITEM_ID}/
  
   I ended up with an SQL error for both.  Is there any way to make delta
   import work in my case?
  
   Bill
  
 




problem with data import handler delta import due to use of multiple datasource

2013-10-05 Thread Bill Au
Here is my DIH config:

dataConfig
dataSource name=ds1 type=JdbcDataSource driver=com.mysql.jdbc.Driver
url=jdbc:mysql://localhost1/dbname1 user=db_username1
password=db_password1/
dataSource name=ds2 type=JdbcDataSource driver=com.mysql.jdbc.Driver
url=jdbc:mysql://localhost2/dbname2 user=db_username2
password=db_password2/
document name=products
entity name=item dataSource=ds1 query=select * from item
field column=ID name=id /
field column=NAME name=name /

entity name=feature dataSource=ds2 query=select
description from feature where item_id='${item.ID}'
field name=features column=description /
/entity
/entity
/document
/dataConfig

I am having trouble with delta import.  I think it is because the main
entity and the sub-entity use different data source.  I have tried using
both a delta query:

deltaQuery=select id from item where id in (select item_id as id from
feature where last_modified  '${dih.last_index_time}') or last_modified
gt; '${dih.last_index_time}'

and a parentDeltaQuery:

entity name=feature pk=ITEM_ID query=select DESCRIPTION as features
from FEATURE where ITEM_ID='${item.ID}' deltaQuery=select ITEM_ID from
FEATURE where last_modified  '${dih.last_index_time}'
parentDeltaQuery=select ID from item where ID=${feature.ITEM_ID}/

I ended up with an SQL error for both.  Is there any way to make delta
import work in my case?

Bill


Re: Solr 4.3.0 DIH problem with MySQL datetime being imported with time as 00:00:00

2013-06-29 Thread Bill Au
I just double check my config.  We are using convertType=true.  Someone
else came up with the config so I am not sure why we are using it.  I will
try with it set to false to see if something else will break.  Thanks for
pointing that out.

This is my first time using DIH.  I really like what I have seen so far.

Bill


On Sat, Jun 29, 2013 at 1:45 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 The default in JdbcDataSource is to use ResultSet.getObject which
 returns the underlying database's type. The type specific methods in
 ResultSet are not invoked unless you are using convertType=true.

 Is MySQL actually returning java.sql.Timestamp objects?

 On Sat, Jun 29, 2013 at 5:22 AM, Bill Au bill.w...@gmail.com wrote:
  I am running Solr 4.3.0, using DIH to import data from MySQL.  I am
 running
  into a very strange problem where data from a datetime column being
  imported with the right date but the time is 00:00:00.  I tried using SQL
  DATE_FORMAT() and also DIH DateFormatTransformer but nothing works.  The
  raw debug response of DIH, it looks like the time porting of the datetime
  data is already 00:00:00 in Solr jdbc query result.
 
  So I looked at the source code of DIH JdbcDataSource class.  It is using
  java.sql.ResultSet and its getDate() method to handle date column.  The
  getDate() method returns java.sql.Date.  The java api doc for
 java.sql.Date
 
  http://docs.oracle.com/javase/6/docs/api/java/sql/Date.html
 
  states that:
 
  To conform with the definition of SQL DATE, the millisecond values
 wrapped
  by a java.sql.Date instance must be 'normalized' by setting the hours,
  minutes, seconds, and milliseconds to zero in the particular time zone
 with
  which the instance is associated.
 
  This seems to be describing exactly my problem.  Has anyone else notice
  this problem?  Has anyone use DIH to index SQL datetime successfully?  If
  so can you send me the relevant portion of the DIH config?
 
  Bill



 --
 Regards,
 Shalin Shekhar Mangar.



Re: Solr 4.3.0 DIH problem with MySQL datetime being imported with time as 00:00:00

2013-06-29 Thread Bill Au
Setting convertType=false does solve the datetime issue.  But there are now
other columns that were working before but not working now.  Since I have
already done some research into the datetime to date issue and not been
able to find a solution, I think I will have to keep convertType set to
false and deal with the other column type that are not working now.

Thanks for your help.

Bill


On Sat, Jun 29, 2013 at 10:24 AM, Bill Au bill.w...@gmail.com wrote:

 I just double check my config.  We are using convertType=true.  Someone
 else came up with the config so I am not sure why we are using it.  I will
 try with it set to false to see if something else will break.  Thanks for
 pointing that out.

 This is my first time using DIH.  I really like what I have seen so far.

 Bill


 On Sat, Jun 29, 2013 at 1:45 AM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:

 The default in JdbcDataSource is to use ResultSet.getObject which
 returns the underlying database's type. The type specific methods in
 ResultSet are not invoked unless you are using convertType=true.

 Is MySQL actually returning java.sql.Timestamp objects?

 On Sat, Jun 29, 2013 at 5:22 AM, Bill Au bill.w...@gmail.com wrote:
  I am running Solr 4.3.0, using DIH to import data from MySQL.  I am
 running
  into a very strange problem where data from a datetime column being
  imported with the right date but the time is 00:00:00.  I tried using
 SQL
  DATE_FORMAT() and also DIH DateFormatTransformer but nothing works.  The
  raw debug response of DIH, it looks like the time porting of the
 datetime
  data is already 00:00:00 in Solr jdbc query result.
 
  So I looked at the source code of DIH JdbcDataSource class.  It is using
  java.sql.ResultSet and its getDate() method to handle date column.  The
  getDate() method returns java.sql.Date.  The java api doc for
 java.sql.Date
 
  http://docs.oracle.com/javase/6/docs/api/java/sql/Date.html
 
  states that:
 
  To conform with the definition of SQL DATE, the millisecond values
 wrapped
  by a java.sql.Date instance must be 'normalized' by setting the hours,
  minutes, seconds, and milliseconds to zero in the particular time zone
 with
  which the instance is associated.
 
  This seems to be describing exactly my problem.  Has anyone else notice
  this problem?  Has anyone use DIH to index SQL datetime successfully?
  If
  so can you send me the relevant portion of the DIH config?
 
  Bill



 --
 Regards,
 Shalin Shekhar Mangar.





Re: Solr 4.3.0 DIH problem with MySQL datetime being imported with time as 00:00:00

2013-06-29 Thread Bill Au
So disabling convertType does provide a workaround for my problem with
datetime column.  But the problem still exists when convertType is enabled
because DIH is not doing the conversion correctly for a solr date field.
 Solr date field does have a time portion but java.sql.Date does not.  So
DIH should not be calling ResultSet.getDate() for a solr date field.  It
should really be calling ResultSet.getTimestamp() instead.  Is the fix this
simple?  Am I missing anything?

If the fix is this simple I can submit and commit a patch to DIH.

Bill


On Sat, Jun 29, 2013 at 12:13 PM, Bill Au bill.w...@gmail.com wrote:

 Setting convertType=false does solve the datetime issue.  But there are
 now other columns that were working before but not working now.  Since I
 have already done some research into the datetime to date issue and not
 been able to find a solution, I think I will have to keep convertType set
 to false and deal with the other column type that are not working now.

 Thanks for your help.

 Bill


 On Sat, Jun 29, 2013 at 10:24 AM, Bill Au bill.w...@gmail.com wrote:

 I just double check my config.  We are using convertType=true.  Someone
 else came up with the config so I am not sure why we are using it.  I will
 try with it set to false to see if something else will break.  Thanks for
 pointing that out.

 This is my first time using DIH.  I really like what I have seen so far.

 Bill


 On Sat, Jun 29, 2013 at 1:45 AM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:

 The default in JdbcDataSource is to use ResultSet.getObject which
 returns the underlying database's type. The type specific methods in
 ResultSet are not invoked unless you are using convertType=true.

 Is MySQL actually returning java.sql.Timestamp objects?

 On Sat, Jun 29, 2013 at 5:22 AM, Bill Au bill.w...@gmail.com wrote:
  I am running Solr 4.3.0, using DIH to import data from MySQL.  I am
 running
  into a very strange problem where data from a datetime column being
  imported with the right date but the time is 00:00:00.  I tried using
 SQL
  DATE_FORMAT() and also DIH DateFormatTransformer but nothing works.
  The
  raw debug response of DIH, it looks like the time porting of the
 datetime
  data is already 00:00:00 in Solr jdbc query result.
 
  So I looked at the source code of DIH JdbcDataSource class.  It is
 using
  java.sql.ResultSet and its getDate() method to handle date column.  The
  getDate() method returns java.sql.Date.  The java api doc for
 java.sql.Date
 
  http://docs.oracle.com/javase/6/docs/api/java/sql/Date.html
 
  states that:
 
  To conform with the definition of SQL DATE, the millisecond values
 wrapped
  by a java.sql.Date instance must be 'normalized' by setting the hours,
  minutes, seconds, and milliseconds to zero in the particular time zone
 with
  which the instance is associated.
 
  This seems to be describing exactly my problem.  Has anyone else notice
  this problem?  Has anyone use DIH to index SQL datetime successfully?
  If
  so can you send me the relevant portion of the DIH config?
 
  Bill



 --
 Regards,
 Shalin Shekhar Mangar.






Re: Solr 4.3.0 DIH problem with MySQL datetime being imported with time as 00:00:00

2013-06-29 Thread Bill Au
https://issues.apache.org/jira/browse/SOLR-4978


On Sat, Jun 29, 2013 at 2:33 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Yes we need to use getTimestamp instead of getDate. Please create an issue.

 On Sat, Jun 29, 2013 at 11:48 PM, Bill Au bill.w...@gmail.com wrote:
  So disabling convertType does provide a workaround for my problem with
  datetime column.  But the problem still exists when convertType is
 enabled
  because DIH is not doing the conversion correctly for a solr date field.
   Solr date field does have a time portion but java.sql.Date does not.  So
  DIH should not be calling ResultSet.getDate() for a solr date field.  It
  should really be calling ResultSet.getTimestamp() instead.  Is the fix
 this
  simple?  Am I missing anything?
 
  If the fix is this simple I can submit and commit a patch to DIH.
 
  Bill
 
 
  On Sat, Jun 29, 2013 at 12:13 PM, Bill Au bill.w...@gmail.com wrote:
 
  Setting convertType=false does solve the datetime issue.  But there are
  now other columns that were working before but not working now.  Since I
  have already done some research into the datetime to date issue and not
  been able to find a solution, I think I will have to keep convertType
 set
  to false and deal with the other column type that are not working now.
 
  Thanks for your help.
 
  Bill
 
 
  On Sat, Jun 29, 2013 at 10:24 AM, Bill Au bill.w...@gmail.com wrote:
 
  I just double check my config.  We are using convertType=true.  Someone
  else came up with the config so I am not sure why we are using it.  I
 will
  try with it set to false to see if something else will break.  Thanks
 for
  pointing that out.
 
  This is my first time using DIH.  I really like what I have seen so
 far.
 
  Bill
 
 
  On Sat, Jun 29, 2013 at 1:45 AM, Shalin Shekhar Mangar 
  shalinman...@gmail.com wrote:
 
  The default in JdbcDataSource is to use ResultSet.getObject which
  returns the underlying database's type. The type specific methods in
  ResultSet are not invoked unless you are using convertType=true.
 
  Is MySQL actually returning java.sql.Timestamp objects?
 
  On Sat, Jun 29, 2013 at 5:22 AM, Bill Au bill.w...@gmail.com wrote:
   I am running Solr 4.3.0, using DIH to import data from MySQL.  I am
  running
   into a very strange problem where data from a datetime column being
   imported with the right date but the time is 00:00:00.  I tried
 using
  SQL
   DATE_FORMAT() and also DIH DateFormatTransformer but nothing works.
   The
   raw debug response of DIH, it looks like the time porting of the
  datetime
   data is already 00:00:00 in Solr jdbc query result.
  
   So I looked at the source code of DIH JdbcDataSource class.  It is
  using
   java.sql.ResultSet and its getDate() method to handle date column.
  The
   getDate() method returns java.sql.Date.  The java api doc for
  java.sql.Date
  
   http://docs.oracle.com/javase/6/docs/api/java/sql/Date.html
  
   states that:
  
   To conform with the definition of SQL DATE, the millisecond values
  wrapped
   by a java.sql.Date instance must be 'normalized' by setting the
 hours,
   minutes, seconds, and milliseconds to zero in the particular time
 zone
  with
   which the instance is associated.
  
   This seems to be describing exactly my problem.  Has anyone else
 notice
   this problem?  Has anyone use DIH to index SQL datetime
 successfully?
   If
   so can you send me the relevant portion of the DIH config?
  
   Bill
 
 
 
  --
  Regards,
  Shalin Shekhar Mangar.
 
 
 
 



 --
 Regards,
 Shalin Shekhar Mangar.



Solr 4.3.0 DIH problem with MySQL datetime being imported with time as 00:00:00

2013-06-28 Thread Bill Au
I am running Solr 4.3.0, using DIH to import data from MySQL.  I am running
into a very strange problem where data from a datetime column being
imported with the right date but the time is 00:00:00.  I tried using SQL
DATE_FORMAT() and also DIH DateFormatTransformer but nothing works.  The
raw debug response of DIH, it looks like the time porting of the datetime
data is already 00:00:00 in Solr jdbc query result.

So I looked at the source code of DIH JdbcDataSource class.  It is using
java.sql.ResultSet and its getDate() method to handle date column.  The
getDate() method returns java.sql.Date.  The java api doc for java.sql.Date

http://docs.oracle.com/javase/6/docs/api/java/sql/Date.html

states that:

To conform with the definition of SQL DATE, the millisecond values wrapped
by a java.sql.Date instance must be 'normalized' by setting the hours,
minutes, seconds, and milliseconds to zero in the particular time zone with
which the instance is associated.

This seems to be describing exactly my problem.  Has anyone else notice
this problem?  Has anyone use DIH to index SQL datetime successfully?  If
so can you send me the relevant portion of the DIH config?

Bill


SolrCloud excluding certain files in conf from zookeeper

2013-06-14 Thread Bill Au
When using SolrCloud, is it possible to exclude certain files in the conf
directory from being loaded into Zookeeper?

We are keeping our own solr related config files in the conf directory that
is actually different for each node.  Right now the copy in Zookeeper is
overriding the local copy.

Bill


question about the file data/index.properties

2013-05-15 Thread Bill Au
I am running 2 separate 4.3 SolrCloud clusters.  On one of them I noticed
the file data/index.properties on the replica nodes where the index
directory is named index.value of index property in index.properties.
 On the other cluster, the index directory is just named index.

Under what condition is index.properties created?  I am trying to
understand why there is a difference between my 2 SolrCloud clusters.

Bill


Re: question about the file data/index.properties

2013-05-15 Thread Bill Au
Thanks for that info.  So besides the two that I have already seen, are
there any more ways that the index directory can be named?  I am working on
some home-grown administration scripts which need to know the name of the
index directory.

Bill


On Wed, May 15, 2013 at 7:13 PM, Mark Miller markrmil...@gmail.com wrote:

 It's fairly meaningless from a user perspective, but it happens when an
 index is replicated that cannot be simply merged with the existing index
 files and needs a new directory.

 - Mark

 On May 15, 2013, at 5:38 PM, Bill Au bill.w...@gmail.com wrote:

  I am running 2 separate 4.3 SolrCloud clusters.  On one of them I noticed
  the file data/index.properties on the replica nodes where the index
  directory is named index.value of index property in index.properties.
  On the other cluster, the index directory is just named index.
 
  Under what condition is index.properties created?  I am trying to
  understand why there is a difference between my 2 SolrCloud clusters.
 
  Bill




Best practice for rebuild index in SolrCloud

2013-04-08 Thread Bill Au
We are using SolrCloud for replication and dynamic scaling but not
distribution so we are only using a single shard.  From time to time we
make changes to the index schema that requires rebuilding of the index.

Should I treat the rebuilding as just any other index operation?  It seems
to me it would be better if I can somehow take a node offline and rebuild
the index there, then put it back online and let the new index be
replicated from there.  But I am not sure how to do the latter.

Bill


multiple SolrCloud clusters with one ZooKeeper ensemble?

2013-03-28 Thread Bill Au
Can I use a single ZooKeeper ensemble for multiple SolrCloud clusters or
would each SolrCloud cluster requires its own ZooKeeper ensemble?

Bill


Re: multiple SolrCloud clusters with one ZooKeeper ensemble?

2013-03-28 Thread Bill Au
Thanks.

Now I have to go back and re-read the entire SolrCloud Wiki to see what
other info I missed and/or forgot.

Bill


On Thu, Mar 28, 2013 at 12:48 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : Can I use a single ZooKeeper ensemble for multiple SolrCloud clusters or
 : would each SolrCloud cluster requires its own ZooKeeper ensemble?

 https://wiki.apache.org/solr/SolrCloud#Zookeeper_chroot

 (I'm going to FAQ this)


 -Hoss



Solr 4.1 SolrCloud with 1 shard and 3 replicas

2013-03-27 Thread Bill Au
I am running Solr 4.1.  I have set up SolrCloud with 1 leader and 3
replicas, 4 nodes total.  Do query requests send to a node only query the
replica on that node, or are they load-balanced to the entire cluster?

Bill


Re: Solr 4.1 SolrCloud with 1 shard and 3 replicas

2013-03-27 Thread Bill Au
Thanks for the info, Erik.

I had gone through the tutorial in the SolrCloud Wiki and verified that
queries are load balanced in the two shard cluster with shard replicas
setup.  I was wondering if I need to explicitly specify distrib=false in my
single shard setup.  Glad to see that Solr is doing the right thing by
default in my case.

Bill

ps thanks for a very informative webinar.  I am going to recommend it to my
co-workers once the recording is available


On Wed, Mar 27, 2013 at 3:26 PM, Erik Hatcher erik.hatc...@gmail.comwrote:

 Requests to a node in your example would be answered by that node (no need
 to distribute; it's a single shard system) and it would not internally be
 routed otherwise either.  Ultimately it is up to the client to load-balance
 the initial requests into a SolrCloud cluster, but internally in a
 multi-shard distributed search request it will be load balanced beyond that
 initial node.

 CloudSolrServer does load balance, so if you're using that client it'll
 randomly pick a shard to send to from the client-side.  If you're using
 some other mechanism, it'll request directly to whatever node that you've
 specified directly for that initial request.

 Erik

 p.s. Thanks for attending the webinar, Bill!   I saw your name as one of
 the question askers.  Hopefully all that stuff I made up is close to the
 truth :)



 On Mar 27, 2013, at 14:51 , Bill Au wrote:

  I am running Solr 4.1.  I have set up SolrCloud with 1 leader and 3
  replicas, 4 nodes total.  Do query requests send to a node only query the
  replica on that node, or are they load-balanced to the entire cluster?
 
  Bill




Re: [ANNOUNCE] Apache Solr 4.2 released

2013-03-17 Thread Bill Au
The Upgrading from Solr 4.1.0 section of the 4.2.0 CHANGES.txt says:

(No upgrade instructions yet)

To me that's not the same as no need to do anything.  I think the doc
should be updated with either specific instructions or states 4.2.0 is
backward compatible with 4.1.0 so there is no need to do anything.

Bill


On Sun, Mar 17, 2013 at 6:12 AM, sandeep a sundipk...@gmail.com wrote:

 Hi , please let me know how to upgrade solr from 4.1.0 to 4.2.0.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/ANNOUNCE-Apache-Solr-4-2-released-tp4046510p4048201.html
 Sent from the Solr - User mailing list archive at Nabble.com.


multiple facet.prefix for the same facet.field VS multiple facet.query

2013-02-21 Thread Bill Au
There have been requests for supporting multiple facet.prefix for the same
facet.field.  There is an open JIRA with a patch:

https://issues.apache.org/jira/browse/SOLR-1351

Wouldn't using multiple facet.query achieve the same result?  I mean
something like:

facet.query=lastName:A*facet.query=lastName:B*facet.query=lastName:C*


Bill


Re: multiple facet.prefix for the same facet.field VS multiple facet.query

2013-02-21 Thread Bill Au
Never mind.  I just realized the difference between the two.  Sorry for the
noise.

Bill


On Thu, Feb 21, 2013 at 8:42 AM, Bill Au bill.w...@gmail.com wrote:

 There have been requests for supporting multiple facet.prefix for the same
 facet.field.  There is an open JIRA with a patch:

 https://issues.apache.org/jira/browse/SOLR-1351

 Wouldn't using multiple facet.query achieve the same result?  I mean
 something like:

 facet.query=lastName:A*facet.query=lastName:B*facet.query=lastName:C*


 Bill




Re: Solr 4.0 SolrCloud with AWS Auto Scaling

2013-01-04 Thread Bill Au
thanks for pointing me to Solr's Zookeeper servlet.  I will look at the
source to see how I can use to fulfill my needs.

Bill


On Thu, Jan 3, 2013 at 6:43 PM, Mark Miller markrmil...@gmail.com wrote:

 Technically, you want to make sure zookeeper reports the node as live and
 active.

 You could use the same api that the UI uses for that - the
 localhost:port/solr/zookeeper (I think?) servlet.

 If you can't reach it for a node, it's obviously down - if you can reach
 it, parse the json and see if it notes the node as active?

 Not quite as clean as you'd like prob. Might be worth a JIRA issue to look
 at further options.

 - Mark

 On Jan 3, 2013, at 5:54 PM, Bill Au bill.w...@gmail.com wrote:

  Thanks, Mark.
 
  That does remove the node.  And it seems to do so permanently.  Even
 when I
  restart Solr after unloading, it does not join the SolrCloud cluster.
  And
  I can get it to re-join the cluster by creating the core.
 
  Anyone know if there is an API to determine the state of a node.  When
 AWS
  auto scaling add a new node, I need to make sure it has before active
  before I enable it in the load balancer.
 
  Bill
 
 
 
 
  On Thu, Jan 3, 2013 at 9:10 AM, Mark Miller markrmil...@gmail.com
 wrote:
 
 
  http://wiki.apache.org/solr/CoreAdmin#UNLOAD
 
  - Mark
 
  On Jan 3, 2013, at 9:06 AM, Bill Au bill.w...@gmail.com wrote:
 
  Mark,
 What do you mean by unload them?
 
  I am using an AWS load balancer with my auto scaling group in stead of
  using Solr's built-in load balancer.  I am no sharding my index.  I am
  using SolrCloud for replication only.  I am doing local search on each
  instance and sending all updates to the shard leader directly because I
  want to minimize traffic between nodes during search and update
 
  Bill
 
 
  On Wed, Jan 2, 2013 at 6:47 PM, Mark Miller markrmil...@gmail.com
  wrote:
 
 
  On Jan 2, 2013, at 5:51 PM, Bill Au bill.w...@gmail.com wrote:
 
  Is anyone running Solr 4.0 SolrCloud with AWS auto scaling?
 
  My concern is that as AWS auto scaling add and remove instances to
  SolrCloud, the number of nodes in SolrCloud Zookeeper config will
 grow
  indefinitely as removed instances will never be used again.  AWS auto
  scaling will keep on adding new instances, and there is no way to
  remove
  them from Zookeeper, right?
 
  You can unload them and that removes them.
 
  What's the effect of have all these phantom
  nodes?
 
  Unless they are only replicas, they would need to be removed.
 
  Also, unless you are using elastic ips,
  https://issues.apache.org/jira/browse/SOLR-4078 may be of interest.
 
  - Mark
 
 




Re: Solr 4.0 SolrCloud with AWS Auto Scaling

2013-01-03 Thread Bill Au
With AWS auto scaling, one can specify a minimum number of instances for an
auto scaling group.  So there should never be an insufficient number of
replicas.  Once can also specify a termination policy so that the newly
added nodes are removed first.

But with SolrCloud as long as there are enough replicas there is no wrong
node to remove, right?

AWS Beanstalk seems to be a wrapper for AWS auto scaling and other AWS
elastic services.  I am not sure if it offers the detail-grained control
that you have when using auto scaling directly.


On Wed, Jan 2, 2013 at 11:14 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 We've considered using AWS Beanstalk (hmm, what's the difference between
 AWS auto scaling and elastic beanstalk? not sure.) for search-lucene.com ,
 but the idea of something adding and removing nodes seems scary.  The
 scariest part to me is automatic removal of wrong nodes that ends up in
 data loss or insufficient number of replicas.

 But if somebody has done thing and has written up a how-to, I'd love to see
 it!

 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/





 On Wed, Jan 2, 2013 at 5:51 PM, Bill Au bill.w...@gmail.com wrote:

  Is anyone running Solr 4.0 SolrCloud with AWS auto scaling?
 
  My concern is that as AWS auto scaling add and remove instances to
  SolrCloud, the number of nodes in SolrCloud Zookeeper config will grow
  indefinitely as removed instances will never be used again.  AWS auto
  scaling will keep on adding new instances, and there is no way to remove
  them from Zookeeper, right?  What's the effect of have all these phantom
  nodes?
 
  Bill
 



Re: Solr 4.0 SolrCloud with AWS Auto Scaling

2013-01-03 Thread Bill Au
Thanks, Mark.

That does remove the node.  And it seems to do so permanently.  Even when I
restart Solr after unloading, it does not join the SolrCloud cluster.  And
I can get it to re-join the cluster by creating the core.

Anyone know if there is an API to determine the state of a node.  When AWS
auto scaling add a new node, I need to make sure it has before active
before I enable it in the load balancer.

Bill




On Thu, Jan 3, 2013 at 9:10 AM, Mark Miller markrmil...@gmail.com wrote:


 http://wiki.apache.org/solr/CoreAdmin#UNLOAD

 - Mark

 On Jan 3, 2013, at 9:06 AM, Bill Au bill.w...@gmail.com wrote:

  Mark,
  What do you mean by unload them?
 
  I am using an AWS load balancer with my auto scaling group in stead of
  using Solr's built-in load balancer.  I am no sharding my index.  I am
  using SolrCloud for replication only.  I am doing local search on each
  instance and sending all updates to the shard leader directly because I
  want to minimize traffic between nodes during search and update
 
  Bill
 
 
  On Wed, Jan 2, 2013 at 6:47 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
 
  On Jan 2, 2013, at 5:51 PM, Bill Au bill.w...@gmail.com wrote:
 
  Is anyone running Solr 4.0 SolrCloud with AWS auto scaling?
 
  My concern is that as AWS auto scaling add and remove instances to
  SolrCloud, the number of nodes in SolrCloud Zookeeper config will grow
  indefinitely as removed instances will never be used again.  AWS auto
  scaling will keep on adding new instances, and there is no way to
 remove
  them from Zookeeper, right?
 
  You can unload them and that removes them.
 
  What's the effect of have all these phantom
  nodes?
 
  Unless they are only replicas, they would need to be removed.
 
  Also, unless you are using elastic ips,
  https://issues.apache.org/jira/browse/SOLR-4078 may be of interest.
 
  - Mark




Re: Solr PHP client

2012-12-14 Thread Bill Au
You need to configure and start Solr independent of any client you use.

Bill


On Fri, Dec 14, 2012 at 2:23 AM, Romita Saha
romita.s...@sg.panasonic.comwrote:

 Hi,

 Can anyone please guide me to use SolrPhpClient? The documents available
 are not clear. As to where to place SolrPhpClient?

 I have downloaded SolrPhpClient and have changed the following lines,
 specifying the path (where the files are present in my computer)


 require_once('/home/solr/SolrPhpClient/Apache/Solr/Document.php./Document.php');

 require_once('/home/solr/SolrPhpClient/Apache/Solr/Document.php./Response.php');

 After this I am unable to proceed. What and how should I index my
 documents now. How should I start my solr. Where to place the conf files.
 I see there are few html documents inside the folder
 SolrPhpClien/phpdocs.

 Could someone please help.

 Thanks and regards,
 Romita


Re: if I only need exact search, does frequency/score matter?

2012-12-14 Thread Bill Au
If your exact search returns more than one result, then by default they are
sorted by the score.

Bill


On Thu, Dec 13, 2012 at 11:41 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 Hi

 If you are doing a pure boolean search - something matches or doesn't match
 and you don't care about scoring, relevancy, results order, then you can
 turn off frequency.
 See

 http://search-lucene.com/m/S27ja2IJStK1/turn+off+frequencysubj=Re+omitTermFreq+only+

 Otis
 --
 SOLR Performance Monitoring - http://sematext.com/spm/index.html
 Search Analytics - http://sematext.com/search-analytics/index.html




 On Thu, Dec 13, 2012 at 7:07 PM, Jie Sun jsun5...@yahoo.com wrote:

  this is related to my previous post where I did not get any feedback
 yet...
 
  I am going through a practice to reduce the disk usage by solr index
 files.
 
  first step I took was to move some fields from stored to not stored. this
  reduced the size of .fdt by 30-60%.
 
  very promising... however I notice the .frq are taking almost as much
 disk
  space as the .fdt files.
 
  It seems .frq keeps the term frequency information.
 
  In our application, we only care about exact search (legal purpose), we
 do
  not care about search results in relevance (by score) at all.
 
  does this mean I can omit the freq? is it feasible in solr to turn the
  frequency off?
  I do need phrase search so I will have to keep the .prx which is also the
  huge files similar to .fdt files.
 
  Any suggestions or inputs?
  thanks
  Jie
 
 
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/if-I-only-need-exact-search-does-frequency-score-matter-tp4026893.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 



Re: setting hostPort for SolrCloud

2012-12-10 Thread Bill Au
Thanks for the information.

Bill


On Fri, Dec 7, 2012 at 3:04 PM, Mark Miller markrmil...@gmail.com wrote:

 Yup, solr.xml is pretty much required - especially if you want to use
 solrcloud.

 The only reason anything works without is for back compat.

 We are working towards removing the need for it, but's considered required
 these days.

 - Mark

 On Dec 7, 2012, at 11:04 AM, Bill Au bill.w...@gmail.com wrote:

  I actually was not using a solr.xml.  I am only using a single core.  I
 am
  using the default core name collection1.  I know for sure I will not be
  using more than a single core so I did not bother with having a solr.xml.
  Is that a bad thing?
 
  Everything works when I had tomcat config to run on port 8983.  But once
 I
  configure tomcat to use a different port, I notice that SolrCloud is
 still
  using port 8983 so it wasn't working.  I then tried adding
  -Djetty.port=8000 and -DhostPort=8000 to the environment variable
  JAVA_OPTS before running the tomcat start script bin/startup.sh.  But
  SolrCloud was still using 8983.  I ended up setting hostPort in solr.xml
  and got things working.
 
  It solr.xml is required, then I can just set the port for SolrCloud in
  there.  But I was hoping I did not have to bother with solr.xml at all.
  One less configuration file, one less thing that can go wrong.
 
  Bill
 
 
  On Wed, Dec 5, 2012 at 4:40 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
  Be aware that you still have to setup tomcat to run Solr on the right
 port
  - and you also have to provide the port to Solr on startup. With jetty
 we
  do both with -Djetty.port - with Tomcat you have to setup Tomcat to run
 on
  the right port *and* tell Solr what that port is. By default that means
  also passing -Djetty.port - but you can change that to whatever you
 want in
  solr.xml (to hostPort or solr.port or whatever).
 
  The problem is that it's difficult for a webapp to find what ports it's
  running on - you can only do it when a request actually comes in to my
  knowledge.
 
  - Mark
 
  On Dec 5, 2012, at 1:05 PM, Bill Au bill.w...@gmail.com wrote:
 
  I am using tomcat.  In my tomcat start script I have tried setting
 system
  properties with both
 
  -Djetty.port=8080
 
  and
 
  -DhostPort=8080
 
  but neither changed the host port for SolrCloud.  It still uses the
  default
  8983.
 
  Bill
 
 
  On Wed, Dec 5, 2012 at 12:11 PM, Jack Krupansky 
 j...@basetechnology.com
  wrote:
 
  Solr runs in a container and the container controls the port. So, you
  need
  to tell the container which port to use.
 
  For example,
 
  java -Djetty.port=8180 -jar start.jar
 
  -- Jack Krupansky
 
  -Original Message- From: Bill Au
  Sent: Wednesday, December 05, 2012 10:30 AM
  To: solr-user@lucene.apache.org
  Subject: setting hostPort for SolrCloud
 
 
  Can hostPort for SolrCloud only be set in solr.xml?  I tried setting
 the
  system property hostPort and jetty.port on the Java command line but
  neither of them work.
 
  Bill
 
 
 




Re: PHP client

2012-12-07 Thread Bill Au
I have not used the pecl Solr client.  I have been using SolrPhpClient.  I
came across this patch for pecl when I was researching php client for Solr
4.0.  SolrPhpClient has the same problem with 4.0 that this patch addresses.

Bill


On Fri, Dec 7, 2012 at 11:00 AM, Arkadi Colson ark...@smartbit.be wrote:

 Thanks for the info!

 Do you know if it'spossible to use file uploads to Tika with this client?


 On 12/03/2012 03:56 PM, Bill Au wrote:

 https://bugs.php.net/bug.php?**id=62332https://bugs.php.net/bug.php?id=62332

 There is a fork with patches applied.


 On Mon, Dec 3, 2012 at 9:38 AM, Arkadi Colson ark...@smartbit.be wrote:

  Hi

 Anyone tested the pecl Solr Client in combination with SolrCloud? I seems
 to be broken since 4.0

 Best regard
 Arkadi





 --
 Met vriendelijke groeten

 Arkadi Colson

 Smartbit bvba . Hoogstraat 13 . 3670 Meeuwen
 T +32 11 64 08 80 . F +32 11 64 08 81




Re: setting hostPort for SolrCloud

2012-12-07 Thread Bill Au
 I actually was not using a solr.xml.  I am only using a single core.  I am
using the default core name collection1.  I know for sure I will not be
using more than a single core so I did not bother with having a solr.xml.
Is that a bad thing?

Everything works when I had tomcat config to run on port 8983.  But once I
configure tomcat to use a different port, I notice that SolrCloud is still
using port 8983 so it wasn't working.  I then tried adding
-Djetty.port=8000 and -DhostPort=8000 to the environment variable
JAVA_OPTS before running the tomcat start script bin/startup.sh.  But
SolrCloud was still using 8983.  I ended up setting hostPort in solr.xml
and got things working.

It solr.xml is required, then I can just set the port for SolrCloud in
there.  But I was hoping I did not have to bother with solr.xml at all.
One less configuration file, one less thing that can go wrong.

Bill


On Wed, Dec 5, 2012 at 4:40 PM, Mark Miller markrmil...@gmail.com wrote:

 Be aware that you still have to setup tomcat to run Solr on the right port
 - and you also have to provide the port to Solr on startup. With jetty we
 do both with -Djetty.port - with Tomcat you have to setup Tomcat to run on
 the right port *and* tell Solr what that port is. By default that means
 also passing -Djetty.port - but you can change that to whatever you want in
 solr.xml (to hostPort or solr.port or whatever).

 The problem is that it's difficult for a webapp to find what ports it's
 running on - you can only do it when a request actually comes in to my
 knowledge.

 - Mark

 On Dec 5, 2012, at 1:05 PM, Bill Au bill.w...@gmail.com wrote:

  I am using tomcat.  In my tomcat start script I have tried setting system
  properties with both
 
  -Djetty.port=8080
 
  and
 
  -DhostPort=8080
 
  but neither changed the host port for SolrCloud.  It still uses the
 default
  8983.
 
  Bill
 
 
  On Wed, Dec 5, 2012 at 12:11 PM, Jack Krupansky j...@basetechnology.com
 wrote:
 
  Solr runs in a container and the container controls the port. So, you
 need
  to tell the container which port to use.
 
  For example,
 
  java -Djetty.port=8180 -jar start.jar
 
  -- Jack Krupansky
 
  -Original Message- From: Bill Au
  Sent: Wednesday, December 05, 2012 10:30 AM
  To: solr-user@lucene.apache.org
  Subject: setting hostPort for SolrCloud
 
 
  Can hostPort for SolrCloud only be set in solr.xml?  I tried setting the
  system property hostPort and jetty.port on the Java command line but
  neither of them work.
 
  Bill
 




Re: PHP client

2012-12-07 Thread Bill Au
No news there.  But according to their roadmap, Solr 4.0 won't be full
supported until Solarium 3.1.  There is no schedule for 3.1 yet as Solarium
3.0 first release candidate was released on Oct 4, 2012.

Bill


On Fri, Dec 7, 2012 at 2:01 PM, Jorge Luis Betancourt Gonzalez 
jlbetanco...@uci.cu wrote:

 Any news on Solarium Project? Is the one I'm using with Solr 3.6!

 - Mensaje original -
 De: Bill Au bill.w...@gmail.com
 Para: solr-user@lucene.apache.org, Arkadi Colson ark...@smartbit.be
 Enviados: Viernes, 7 de Diciembre 2012 13:40:20
 Asunto: Re: PHP client

 I have not used the pecl Solr client.  I have been using SolrPhpClient.  I
 came across this patch for pecl when I was researching php client for Solr
 4.0.  SolrPhpClient has the same problem with 4.0 that this patch
 addresses.

 Bill


 On Fri, Dec 7, 2012 at 11:00 AM, Arkadi Colson ark...@smartbit.be wrote:

  Thanks for the info!
 
  Do you know if it'spossible to use file uploads to Tika with this client?
 
 
  On 12/03/2012 03:56 PM, Bill Au wrote:
 
  https://bugs.php.net/bug.php?**id=62332
 https://bugs.php.net/bug.php?id=62332
 
  There is a fork with patches applied.
 
 
  On Mon, Dec 3, 2012 at 9:38 AM, Arkadi Colson ark...@smartbit.be
 wrote:
 
   Hi
 
  Anyone tested the pecl Solr Client in combination with SolrCloud? I
 seems
  to be broken since 4.0
 
  Best regard
  Arkadi
 
 
 
 
 
  --
  Met vriendelijke groeten
 
  Arkadi Colson
 
  Smartbit bvba . Hoogstraat 13 . 3670 Meeuwen
  T +32 11 64 08 80 . F +32 11 64 08 81
 
 


 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
 INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

 http://www.uci.cu
 http://www.facebook.com/universidad.uci
 http://www.flickr.com/photos/universidad_uci

 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
 INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

 http://www.uci.cu
 http://www.facebook.com/universidad.uci
 http://www.flickr.com/photos/universidad_uci



Re: setting hostPort for SolrCloud

2012-12-05 Thread Bill Au
I am using tomcat.  In my tomcat start script I have tried setting system
properties with both

-Djetty.port=8080

and

-DhostPort=8080

but neither changed the host port for SolrCloud.  It still uses the default
8983.

Bill


On Wed, Dec 5, 2012 at 12:11 PM, Jack Krupansky j...@basetechnology.comwrote:

 Solr runs in a container and the container controls the port. So, you need
 to tell the container which port to use.

 For example,

 java -Djetty.port=8180 -jar start.jar

 -- Jack Krupansky

 -Original Message- From: Bill Au
 Sent: Wednesday, December 05, 2012 10:30 AM
 To: solr-user@lucene.apache.org
 Subject: setting hostPort for SolrCloud


 Can hostPort for SolrCloud only be set in solr.xml?  I tried setting the
 system property hostPort and jetty.port on the Java command line but
 neither of them work.

 Bill



Re: PHP client

2012-12-03 Thread Bill Au
https://bugs.php.net/bug.php?id=62332

There is a fork with patches applied.


On Mon, Dec 3, 2012 at 9:38 AM, Arkadi Colson ark...@smartbit.be wrote:

 Hi

 Anyone tested the pecl Solr Client in combination with SolrCloud? I seems
 to be broken since 4.0

 Best regard
 Arkadi




Re: consistency in SolrCloud replication

2012-11-16 Thread Bill Au
Yes, my original question is about search.  And Mark did answered is in his
original reply.  I am guessing that the replicas are updated sequentially
so the newly added documents will be available in some replicas before
other.  I want to know where SolrCloud stands in terms of CAP.

Bill


On Thu, Nov 15, 2012 at 10:31 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 I think Bill was asking about search
 I think the Q is whether the query hitting the shard where a doc was sent
 for indexing would see that doc even before that doc has been copied to
 replicas.

 I didn't test it, but I'd think the answer would be positive because of the
 xa log.

 Otis
 --
 Performance Monitoring - http://sematext.com/spm
 On Nov 15, 2012 11:30 AM, Mark Miller markrmil...@gmail.com wrote:

  It depends - no commit necessary for realtime get. Otherwise, yes, you
  would need to do at least a soft commit. That works the same way though -
  so if you make your update, then do a soft commit, you can be sure your
  next search will see the update on all the replicas. And with realtime
 get,
  of course no commit is necessary to see it.
 
  - Mark
 
  On Nov 15, 2012, at 10:40 AM, David Smiley (@MITRE.org) 
 dsmi...@mitre.org
  wrote:
 
   Mark Miller-3 wrote
   I'm talking about an update request. So if you make an update, when it
   returns, your next search will see the update, because it will be on
   all replicas.
  
   I presume this is only the case if (of course) the client also sent a
   commit.  So you're saying the commit call will not return unless all
   replicas have completed their commits.  Right?
  
   ~ David
  
  
  
   -
   Author:
  http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
   --
   View this message in context:
 
 http://lucene.472066.n3.nabble.com/consistency-in-SolrCloud-replication-tp4020379p4020518.html
   Sent from the Solr - User mailing list archive at Nabble.com.
 
 



Re: consistency in SolrCloud replication

2012-11-15 Thread Bill Au
Thanks for the info, Mark.  By a request won't return until it's affected
all replicas, are you referring to the update request or the query?

Bill


On Wed, Nov 14, 2012 at 7:57 PM, Mark Miller markrmil...@gmail.com wrote:

 It's included as soon as it has been indexed - though a request won't
 return until it's affected all replicas. Low latency eventual consistency.

 - Mark

 On Nov 14, 2012, at 5:47 PM, Bill Au bill.w...@gmail.com wrote:

  Will a newly indexed document included in search result in the shard
 leader
  as soon as it has been indexed locally or is it included in search result
  only after it has been forwarded to and indexed in all the replicas?
 
  Bill




Re: best practice for restarting the entire SolrCloud cluster

2012-11-08 Thread Bill Au
My replicas are actually on different machines so they do come up.  The
problem I found is that since they can't get the leader they just come up
but is not part of the cluster.  I can still do local search with
distrib=false.  They do not retry to get the leader so I have to restarted
them after the leader has started in order to get them back into the
cluster.

Bill


On Thu, Nov 8, 2012 at 4:02 PM, Markus Jelsma markus.jel...@openindex.iowrote:

 Hi - i think you're seeing:
 https://issues.apache.org/jira/browse/SOLR-3993


 -Original message-
  From:Bill Au bill.w...@gmail.com
  Sent: Thu 08-Nov-2012 21:16
  To: solr-user@lucene.apache.org
  Subject: best practice for restarting the entire SolrCloud cluster
 
  I have a simple SolrCloud cluster with 4 Solr instances and 1 shard.  I
 can
  start and stop individual Solr instances without any problem.  But not
 when
  I have to shutdown all the Solr instances at the same time.
 
  After shutting down all the Solr instances, the first instance that
 starts
  up wait for all the replicas:
 
  INFO: Waiting until we see more replicas up: total=4 found=3
  timeoutin=169243
 
  In the meantime, any additional Solr instances that start up while the
  first one is waiting can't get the leader from zookeeper:
 
  SEVERE: Error getting leader from zk
  org.apache.solr.common.SolrException: Could not get leader props
 
  When the first Solr instance see all the replicas, it becomes the leader:
 
  INFO: Enough replicas found to continue.
  INFO: I may be the new leader - try and sync
 
  But it fails to sync with the instances that had failed to get the leader
  before:
 
  WARNING: PeerSync: core=collection1 url=http://host2:8983/solr exception
  talking to http://host2:8983/solr/collection1/, failed
  org.apache.solr.client.solrj.SolrServerException: Timeout occured while
  waiting response from server at: http://host2:8983/solr/collection1
 
  So I ended up with one for more replicas down after the restart.  I had
 to
  figure out which replica is down and restart them.
 
  What I also discovered is that if I start the first Solr instance and
 wait
  until it returns after the leaderVoteWait of 3 minutes, the rest of the
  Solr instance can be started without any problem since by then they can
 get
  the leader from zookeeper.
 
  Is there a better way to restart an entire SolrCloud cluster?
 
  Bill
 



Re: SolrCloud and distributed search

2012-10-29 Thread Bill Au
Do updates always start at the shard leader first?  If so one can save one
internal request by only sending updates to the shard leader.  I am
assuming that when the shard leader is down, SolrJ's CloudSolrServer is
smart enough to use the newly elected shard leader after a failover has
occurred.  Am I correct?

Bill

On Fri, Oct 26, 2012 at 11:42 AM, Tomás Fernández Löbbe 
tomasflo...@gmail.com wrote:

 If you are going to use SolrJ, CloudSolrServer is even better than a
 round-robin load balancer for indexing, because it will send the documents
 straight to the shard leader (you save one internal request). If not,
 round-robin should be fine.

 Tomás

 On Fri, Oct 26, 2012 at 12:27 PM, Bill Au bill.w...@gmail.com wrote:

  I am thinking of using a load balancer for both indexing and querying to
  spread both the indexing and querying load across all the machines.
 
  Bill
 
  On Fri, Oct 26, 2012 at 10:48 AM, Tomás Fernández Löbbe 
  tomasflo...@gmail.com wrote:
 
   You should still use some kind of load balancer for searches, unless
 you
   use the CloudSolrServer (SolrJ) which includes the load balancing.
   Tomás
  
   On Fri, Oct 26, 2012 at 11:46 AM, Erick Erickson 
  erickerick...@gmail.com
   wrote:
  
Yes, I think SolrCloud makes sense with a single shard for exactly
this reason, NRT and multiple replicas. I don't know how you'd get
 NRT
on multiple machines without it.
   
But do be aware of: https://issues.apache.org/jira/browse/SOLR-3971
A collection that is created with numShards=1 turns into a
numShards=2 collection after starting up a second core and not
specifying numShards.
   
Erick
   
On Fri, Oct 26, 2012 at 10:14 AM, Bill Au bill.w...@gmail.com
 wrote:
 I am currently using one master with multiple slaves so I do have
  high
 availability for searching now.

 My index does fit on a single machine and a single query does not
  take
too
 long to execute.  But I do want to take advantage of high
  availability
   of
 indexing and real time replication.  So it looks like I can set up
 SolrCloud with only 1 shard (ie numShards=1).

 In this case is SolrCloud still using distributed search behind the
 screen?  Will MoreLikeThis work?

 Does using SolrCloud with only 1 shard make any sense at all?

 Bill

 On Thu, Oct 25, 2012 at 4:29 PM, Tomás Fernández Löbbe 
 tomasflo...@gmail.com wrote:

 It also provides high availability for indexing and searching.

 On Thu, Oct 25, 2012 at 4:43 PM, Bill Au bill.w...@gmail.com
  wrote:

  So I guess one would use SolrCloud for the same reasons as
   distributed
  search:
 
  When an index becomes too large to fit on a single system, or
  when a
 single
  query takes too long to execute.
 
  Bill
 
  On Thu, Oct 25, 2012 at 3:38 PM, Shawn Heisey 
 s...@elyograg.org
wrote:
 
   On 10/25/2012 1:29 PM, Bill Au wrote:
  
   Is SolrCloud using distributed search behind the scene?  Does
  it
have
  the
   same limitations (for example, doesn't support MoreLikeThis)
 distributed
   search has?
  
  
   Yes and yes.
  
  
 

   
  
 



Re: SolrCloud and distributed search

2012-10-26 Thread Bill Au
I am currently using one master with multiple slaves so I do have high
availability for searching now.

My index does fit on a single machine and a single query does not take too
long to execute.  But I do want to take advantage of high availability of
indexing and real time replication.  So it looks like I can set up
SolrCloud with only 1 shard (ie numShards=1).

In this case is SolrCloud still using distributed search behind the
screen?  Will MoreLikeThis work?

Does using SolrCloud with only 1 shard make any sense at all?

Bill

On Thu, Oct 25, 2012 at 4:29 PM, Tomás Fernández Löbbe 
tomasflo...@gmail.com wrote:

 It also provides high availability for indexing and searching.

 On Thu, Oct 25, 2012 at 4:43 PM, Bill Au bill.w...@gmail.com wrote:

  So I guess one would use SolrCloud for the same reasons as distributed
  search:
 
  When an index becomes too large to fit on a single system, or when a
 single
  query takes too long to execute.
 
  Bill
 
  On Thu, Oct 25, 2012 at 3:38 PM, Shawn Heisey s...@elyograg.org wrote:
 
   On 10/25/2012 1:29 PM, Bill Au wrote:
  
   Is SolrCloud using distributed search behind the scene?  Does it have
  the
   same limitations (for example, doesn't support MoreLikeThis)
 distributed
   search has?
  
  
   Yes and yes.
  
  
 



Re: SolrCloud and distributed search

2012-10-26 Thread Bill Au
I am thinking of using a load balancer for both indexing and querying to
spread both the indexing and querying load across all the machines.

Bill

On Fri, Oct 26, 2012 at 10:48 AM, Tomás Fernández Löbbe 
tomasflo...@gmail.com wrote:

 You should still use some kind of load balancer for searches, unless you
 use the CloudSolrServer (SolrJ) which includes the load balancing.
 Tomás

 On Fri, Oct 26, 2012 at 11:46 AM, Erick Erickson erickerick...@gmail.com
 wrote:

  Yes, I think SolrCloud makes sense with a single shard for exactly
  this reason, NRT and multiple replicas. I don't know how you'd get NRT
  on multiple machines without it.
 
  But do be aware of: https://issues.apache.org/jira/browse/SOLR-3971
  A collection that is created with numShards=1 turns into a
  numShards=2 collection after starting up a second core and not
  specifying numShards.
 
  Erick
 
  On Fri, Oct 26, 2012 at 10:14 AM, Bill Au bill.w...@gmail.com wrote:
   I am currently using one master with multiple slaves so I do have high
   availability for searching now.
  
   My index does fit on a single machine and a single query does not take
  too
   long to execute.  But I do want to take advantage of high availability
 of
   indexing and real time replication.  So it looks like I can set up
   SolrCloud with only 1 shard (ie numShards=1).
  
   In this case is SolrCloud still using distributed search behind the
   screen?  Will MoreLikeThis work?
  
   Does using SolrCloud with only 1 shard make any sense at all?
  
   Bill
  
   On Thu, Oct 25, 2012 at 4:29 PM, Tomás Fernández Löbbe 
   tomasflo...@gmail.com wrote:
  
   It also provides high availability for indexing and searching.
  
   On Thu, Oct 25, 2012 at 4:43 PM, Bill Au bill.w...@gmail.com wrote:
  
So I guess one would use SolrCloud for the same reasons as
 distributed
search:
   
When an index becomes too large to fit on a single system, or when a
   single
query takes too long to execute.
   
Bill
   
On Thu, Oct 25, 2012 at 3:38 PM, Shawn Heisey s...@elyograg.org
  wrote:
   
 On 10/25/2012 1:29 PM, Bill Au wrote:

 Is SolrCloud using distributed search behind the scene?  Does it
  have
the
 same limitations (for example, doesn't support MoreLikeThis)
   distributed
 search has?


 Yes and yes.


   
  
 



Re: SolrCloud and distributed search

2012-10-25 Thread Bill Au
So I guess one would use SolrCloud for the same reasons as distributed
search:

When an index becomes too large to fit on a single system, or when a single
query takes too long to execute.

Bill

On Thu, Oct 25, 2012 at 3:38 PM, Shawn Heisey s...@elyograg.org wrote:

 On 10/25/2012 1:29 PM, Bill Au wrote:

 Is SolrCloud using distributed search behind the scene?  Does it have the
 same limitations (for example, doesn't support MoreLikeThis) distributed
 search has?


 Yes and yes.




Re: Solr 4.0.0 - index version and generation not changed after delete by query on master

2012-10-24 Thread Bill Au
I just filed a bug with all the details:

https://issues.apache.org/jira/browse/SOLR-3681

Bill

On Tue, Oct 23, 2012 at 2:47 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:

 : Just discovered that the replication admin REST API reports the correct
 : index version and generation:
 :
 : http://master_host:port/solr/replication?command=indexversion
 :
 : So is this a bug in the admin UI?

 Ya gotta be specific Bill: where in the admin UI do you think it's
 displaying the incorrect information?

 The Admin UI just adds pretty markup to information fetched from the
 admin handlers using javascript, so if there is a problem it's either in
 the admin handlers, or in the javascript possibly caching the olds values.

 Off the cuff, this reminds me of...

 https://issues.apache.org/jira/browse/SOLR-3681

 The root confusion there was that /admin/replication explicitly shows data
 about the commit point available for replication -- not the current commit
 point being searched on the master.

 So if you are seeing a disconnect, then perhaps it's just that same
 descrepency? -- allthough if you are *only* seeing a disconnect after a
 deleteByQuery (and not after document adds, or a deleteById) then that
 does smell fishy, and makes me wonder if there is a code path where the
 userData for the commits aren't being set properly.

 Can you file a bug with a unit test to reproduce?  or at the very list a
 set of specific commands to run against the solr example including what
 request handler URLs to hit (so there's no risk of confusion about the ui
 javascript behavior) to see the problem?


 -Hoss



Re: Solr 4.0.0 - index version and generation not changed after delete by query on master

2012-10-24 Thread Bill Au
Sorry, I had copy/paste the wrong link before.  Here is the correct one:

https://issues.apache.org/jira/browse/SOLR-3986

Bill

On Wed, Oct 24, 2012 at 10:26 AM, Bill Au bill.w...@gmail.com wrote:

 I just filed a bug with all the details:

 https://issues.apache.org/jira/browse/SOLR-3681

 Bill


 On Tue, Oct 23, 2012 at 2:47 PM, Chris Hostetter hossman_luc...@fucit.org
  wrote:

 : Just discovered that the replication admin REST API reports the correct
 : index version and generation:
 :
 : http://master_host:port/solr/replication?command=indexversion
 :
 : So is this a bug in the admin UI?

 Ya gotta be specific Bill: where in the admin UI do you think it's
 displaying the incorrect information?

 The Admin UI just adds pretty markup to information fetched from the
 admin handlers using javascript, so if there is a problem it's either in
 the admin handlers, or in the javascript possibly caching the olds values.

 Off the cuff, this reminds me of...

 https://issues.apache.org/jira/browse/SOLR-3681

 The root confusion there was that /admin/replication explicitly shows data
 about the commit point available for replication -- not the current commit
 point being searched on the master.

 So if you are seeing a disconnect, then perhaps it's just that same
 descrepency? -- allthough if you are *only* seeing a disconnect after a
 deleteByQuery (and not after document adds, or a deleteById) then that
 does smell fishy, and makes me wonder if there is a code path where the
 userData for the commits aren't being set properly.

 Can you file a bug with a unit test to reproduce?  or at the very list a
 set of specific commands to run against the solr example including what
 request handler URLs to hit (so there's no risk of confusion about the ui
 javascript behavior) to see the problem?


 -Hoss





Re: Solr 4.0.0 - index version and generation not changed after delete by query on master

2012-10-19 Thread Bill Au
It's not the browser cache.  I have tried reloading the admin page and
accessing the admin page from another machine.  Both show the older index
version and generation.  On the slave, replication did kicked in and show
the new index version and generation for the slave.  But the slave admin
page also shows the older index version and generation for the master.

If I do a second delete by query on the master, the master index generation
reported the admin UI does go up by one on both the master and slave.  But
it is still one generation behind.

Bill

On Fri, Oct 19, 2012 at 7:09 AM, Erick Erickson erickerick...@gmail.comwrote:

 I wonder if you're getting hit by the browser caching the admin page and
 serving up the old version? What happens if you try from a different
 browser or purge the browser cache?

 Of course you have to refresh the master admin page, there's no
 automatic update but I assume you did that.

 Best
 Erick

 On Thu, Oct 18, 2012 at 1:59 PM, Bill Au bill.w...@gmail.com wrote:
  Just discovered that the replication admin REST API reports the correct
  index version and generation:
 
  http://master_host:port/solr/replication?command=indexversion
 
  So is this a bug in the admin UI?
 
  Bill
 
  On Thu, Oct 18, 2012 at 11:34 AM, Bill Au bill.w...@gmail.com wrote:
 
  I just upgraded to Solr 4.0.0.  I noticed that after a delete by query,
  the index version, generation, and size remain unchanged on the master
 even
  though the documents have been deleted (num docs changed and those
 deleted
  documents no longer show up in query responses).  But on the slave both
 the
  index version, generation, and size are updated.  So I though the master
  and slave were out of sync but in reality that is not true.
 
  What's going on here?
 
  Bill
 



Solr 4.0.0 - index version and generation not changed after delete by query on master

2012-10-18 Thread Bill Au
I just upgraded to Solr 4.0.0.  I noticed that after a delete by query, the
index version, generation, and size remain unchanged on the master even
though the documents have been deleted (num docs changed and those deleted
documents no longer show up in query responses).  But on the slave both the
index version, generation, and size are updated.  So I though the master
and slave were out of sync but in reality that is not true.

What's going on here?

Bill


Re: Solr 4.0.0 - index version and generation not changed after delete by query on master

2012-10-18 Thread Bill Au
Just discovered that the replication admin REST API reports the correct
index version and generation:

http://master_host:port/solr/replication?command=indexversion

So is this a bug in the admin UI?

Bill

On Thu, Oct 18, 2012 at 11:34 AM, Bill Au bill.w...@gmail.com wrote:

 I just upgraded to Solr 4.0.0.  I noticed that after a delete by query,
 the index version, generation, and size remain unchanged on the master even
 though the documents have been deleted (num docs changed and those deleted
 documents no longer show up in query responses).  But on the slave both the
 index version, generation, and size are updated.  So I though the master
 and slave were out of sync but in reality that is not true.

 What's going on here?

 Bill



Re: Debugging a Solr/Jetty Hung Process

2011-06-01 Thread Bill Au
Taking a thread dump will take you what's going.

Bill

On Wed, Jun 1, 2011 at 3:04 PM, Chris Cowan chrisco...@plus3network.comwrote:

 About once a day a Solr/Jetty process gets hung on my server consuming 100%
 of one of the CPU's. Once this happens the server no longer responds to
 requests. I've looked through the logs to try and see if anything stands out
 but so far I've found nothing out of the ordinary.

 My current remedy is to log in and just kill the single processes that's
 hung. Once that happens everything goes back to normal and I'm good for a
 day or so.  I'm currently  the running following:

 solr-jetty-1.4.0+ds1-1ubuntu1

 which is comprised of

 Solr 1.4.0
 Jetty 6.1.22
 on Unbuntu 10.10

 I'm pretty new to managing a Jetty/Solr instance so at this point I'm just
 looking for advice on how I should go about trouble shooting this problem.

 Chris


Re: very slow commits and overlapping commits

2011-05-27 Thread Bill Au
I managed to get a thread dump during a slow commit:

resin-tcp-connection-*:5062-129 Id=12721 in RUNNABLE total cpu
time=391530.ms user time=390620.ms
at java.lang.String.intern(Native Method)
at
org.apache.lucene.util.SimpleStringInterner.intern(SimpleStringInterner.java:74)
at org.apache.lucene.util.StringHelper.intern(StringHelper.java:36)
at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:356)
at org.apache.lucene.index.FieldInfos.(FieldInfos.java:71)
at
org.apache.lucene.index.SegmentReader$CoreReaders.(SegmentReader.java:116)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:638)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:608)
at org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:691)
at org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:667)
at
org.apache.lucene.index.DocumentsWriter.applyDeletes(DocumentsWriter.java:956)
at org.apache.lucene.index.IndexWriter.applyDeletes(IndexWriter.java:5207)
at
org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:4370)
at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:4209)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:4200)
at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:2195)
at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:2158)
at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:2122)
at org.apache.solr.update.SolrIndexWriter.close(SolrIndexWriter.java:230)
at
org.apache.solr.update.DirectUpdateHandler2.closeWriter(DirectUpdateHandler2.java:181)
at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:409)
at
org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:85)
at
org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:107)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:48)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at
com.caucho.server.dispatch.FilterFilterChain.doFilter(FilterFilterChain.java:70)
at
com.caucho.server.webapp.WebAppFilterChain.doFilter(WebAppFilterChain.java:173)
at
com.caucho.server.dispatch.ServletInvocation.service(ServletInvocation.java:229)
at com.caucho.server.http.HttpRequest.handleRequest(HttpRequest.java:274)
at com.caucho.server.port.TcpConnection.run(TcpConnection.java:511)
at com.caucho.util.ThreadPool.runTasks(ThreadPool.java:520)
at com.caucho.util.ThreadPool.run(ThreadPool.java:442)
at java.lang.Thread.run(Thread.java:619)

It looks like Lucene's StringHelper is hardcoding the max size of the hash
table of SimpleStringInterner to 1024 and I might be hitting that limit,
causing an actual call to java.lang.String.intern().

I think I need to reduce the number of fields in my index.  Any other things
I can do to help in this case.

Bill

On Wed, May 25, 2011 at 11:28 AM, Bill Au bill.w...@gmail.com wrote:

 I am taking a snapshot after every commit.  From looking at the snapshots,
 it does not look like the delay in caused by segments merging because I am
 not seeing any large new segments after a commit.

 I still can't figure out why there is a 2 minutes gap between start
 commit and SolrDelectionPolicy.onCommit.  Will changing the deletion
 policy make any difference?  I am using the default deletion policy now.

 Bill

 2011/5/21 Erick Erickson erickerick...@gmail.com

 Well, committing less offside a possibilty  g. Here's what's probably
 happening. When you pass certain thresholds, segments are merged which can
 take quite some time.  His are you triggering commits? If it's external,
 think about using auto commit instead.

 Best
 Erick
 On May 20, 2011 6:04 PM, Bill Au bill.w...@gmail.com wrote:
  On my Solr 1.4.1 master I am doing commits regularly at a fixed
 interval.
 I
  noticed that from time to time commit will take longer than the commit
  interval, causing commits to overlap. Then things will get worse as
 commit
  will take longer and longer. Here is the logs for a long commit:
 
 
  [2011-05-18 23:47:30.071] start
 

 commit(optimize=false,waitFlush=false,waitSearcher=false,expungeDeletes=false)
  [2011-05-18 23:49:48.119] SolrDeletionPolicy.onCommit: commits:num=2
  [2011-05-18 23:49:48.119]
 

 commit{dir=/var/opt/resin3/5062/solr/data/index,segFN=segments_5cpa,version=1247782702272,generation=249742,filenames=[_4dqu_2g.del,
  _4e66.tis, _4e3r.tis, _4e59.nrm, _4e68_1.del, _4e4n.prx, _4e4n.fnm,
  _4e67.fnm, _4e3r.frq, _4e3r.tii, _4e6d.fnm, _4e6c.prx, _4e68.fdx,
 _4e68.nrm,
  _4e6a.frq, _4e68.fdt, _4dqu.fnm, _4e4n.tii, _4e69.fdx, _4e69.fdt,
 _4e0e.nrm,
  _4e4n.tis, _4e6e.fnm, _4e3r.prx, _4e66.fnm, _4e3r.nrm, _4e0e.prx

Re: very slow commits and overlapping commits

2011-05-25 Thread Bill Au
I am taking a snapshot after every commit.  From looking at the snapshots,
it does not look like the delay in caused by segments merging because I am
not seeing any large new segments after a commit.

I still can't figure out why there is a 2 minutes gap between start commit
and SolrDelectionPolicy.onCommit.  Will changing the deletion policy make
any difference?  I am using the default deletion policy now.

Bill

2011/5/21 Erick Erickson erickerick...@gmail.com

 Well, committing less offside a possibilty  g. Here's what's probably
 happening. When you pass certain thresholds, segments are merged which can
 take quite some time.  His are you triggering commits? If it's external,
 think about using auto commit instead.

 Best
 Erick
 On May 20, 2011 6:04 PM, Bill Au bill.w...@gmail.com wrote:
  On my Solr 1.4.1 master I am doing commits regularly at a fixed interval.
 I
  noticed that from time to time commit will take longer than the commit
  interval, causing commits to overlap. Then things will get worse as
 commit
  will take longer and longer. Here is the logs for a long commit:
 
 
  [2011-05-18 23:47:30.071] start
 

 commit(optimize=false,waitFlush=false,waitSearcher=false,expungeDeletes=false)
  [2011-05-18 23:49:48.119] SolrDeletionPolicy.onCommit: commits:num=2
  [2011-05-18 23:49:48.119]
 

 commit{dir=/var/opt/resin3/5062/solr/data/index,segFN=segments_5cpa,version=1247782702272,generation=249742,filenames=[_4dqu_2g.del,
  _4e66.tis, _4e3r.tis, _4e59.nrm, _4e68_1.del, _4e4n.prx, _4e4n.fnm,
  _4e67.fnm, _4e3r.frq, _4e3r.tii, _4e6d.fnm, _4e6c.prx, _4e68.fdx,
 _4e68.nrm,
  _4e6a.frq, _4e68.fdt, _4dqu.fnm, _4e4n.tii, _4e69.fdx, _4e69.fdt,
 _4e0e.nrm,
  _4e4n.tis, _4e6e.fnm, _4e3r.prx, _4e66.fnm, _4e3r.nrm, _4e0e.prx,
 _4e4c.fdx,
  _4dx1.prx, _4e5v.frq, _4e3r.fdt, _4e4c.tis, _4e41_6.del, _4e6b.tis,
  _4e6b_1.del, _4e4y_3.del, _4e6b.tii, _4e3r.fdx, _4dx1.nrm, _4e4y.frq,
  _4e4c.fdt, _4e4c.tii, _4e6d.fdt, _4e5k.fnm, _4e41.fnm, _4e69.fnm,
 _4e67.fdt,
  _4e0e.tii, _4dty_h.del, _4e6b.fnm, _4e0e_h.del, _4e6d.fdx, _4e67.fdx,
  _4e0e.tis, _4e5v.nrm, _4dx1.fnm, _4e5v.tii, _4dqu.fdt, segments_5cpa,
  _4e5v.prx, _4dqu.fdx, _4e59.fnm, _4e6d.prx, _4e59_5.del, _4e4c.prx,
  _4e4c.nrm, _4e5k.prx, _4e66.fdx, _4dty.frq, _4e6c.frq, _4e5v.tis,
 _4e6e.tii,
  _4e66.fdt, _4e6b.fdx, _4e68.prx, _4e59.fdx, _4e6e.fdt, _4e41.prx,
 _4dx1.tii,
  _4dx1.fdt, _4e6b.fdt, _4e5v_4.del, _4e4n.fdt, _4e6e.fdx, _4dx1.fdx,
  _4e41.nrm, _4e4n.fdx, _4e6e.tis, _4e66.tii, _4e4c.fnm, _4e6b.prx,
 _4e67.prx,
  _4e0e.fnm, _4e4n.nrm, _4e67.nrm, _4e5k.nrm, _4e6a.prx, _4e68.fnm,
  _4e4c_4.del, _4dx1.tis, _4e6e.nrm, _4e59.tii, _4e68.tis, _4e67.frq,
  _4e3r.fnm, _4dty.nrm, _4e4y.prx, _4e6e.prx, _4dty.tis, _4e4y.tis,
 _4e6b.nrm,
  _4e6a.fdt, _4e4n.frq, _4e6d.frq, _4e59.fdt, _4e6a.fdx, _4e6a.fnm,
 _4dqu.tii,
  _4e41.tii, _4e67_1.del, _4e41.tis, _4dty.fdt, _4e69.tis, _4dqu.frq,
  _4dty.fdx, _4dx1.frq, _4e6e.frq, _4e66_1.del, _4e69.prx, _4e6d.tii,
  _4e5k.tii, _4e0e.fdt, _4dqu.tis, _4e6d.tis, _4e69.nrm, _4dqu.prx,
 _4e4y.fnm,
  _4e67.tis, _4e69_1.del, _4e6d.nrm, _4e6c.tis, _4e0e.fdx, _4e6c.tii,
  _4dx1_n.del, _4e5v.fnm, _4e5k.tis, _4e59.tis, _4e67.tii, _4dqu.nrm,
  _4e5k_8.del, _4e6c.fdx, _4e6c.fdt, _4e41.frq, _4e4y.fdx, _4e69.frq,
  _4e6a.tis, _4dty.prx, _4e66.frq, _4e5k.frq, _4e6a.tii, _4e69.tii,
 _4e6c.nrm,
  _4dty.fnm, _4e59.prx, _4e59.frq, _4e66.prx, _4e68.frq, _4e5k.fdx,
 _4e4y.tii,
  _4e6c.fnm, _4e0e.frq, _4e6b.frq, _4e41.fdt, _4e4n_2.del, _4dty.tii,
  _4e4y.fdt, _4e66.nrm, _4e4c.frq, _4e6a.nrm, _4e5k.fdt, _4e3r_i.del,
  _4e5v.fdt, _4e4y.nrm, _4e68.tii, _4e5v.fdx, _4e41.fdx]
  [2011-05-18 23:49:48.119]
 

 commit{dir=/var/opt/resin3/5062/solr/data/index,segFN=segments_5cpb,version=1247782702273,generation=249743,filenames=[_4dqu_2g.del,
  _4e66.tis, _4e59.nrm, _4e3r.tis, _4e4n.fnm, _4e67.fnm, _4e3r.tii,
 _4e6d.fnm,
  _4e68.fdx, _4e68.fdt, _4dqu.fnm, _4e4n.tii, _4e69.fdx, _4e69.fdt,
 _4e4n.tis,
  _4e6e.fnm, _4e0e.prx, _4e4c.tis, _4e5v.frq, _4e4y_3.del, _4e6b_1.del,
  _4e4c.tii, _4e6f.fnm, _4e5k.fnm, _4e6c_1.del, _4e41.fnm, _4dx1.fnm,
  _4e5v.nrm, _4e5v.tii, _4e5v.prx, _4e5k.prx, _4e4c.nrm, _4dty.frq,
 _4e66.fdx,
  _4e5v.tis, _4e66.fdt, _4e6e.tii, _4e59.fdx, _4e6b.fdx, _4e41.prx,
 _4e6b.fdt,
  _4e41.nrm, _4e6e.tis, _4e4c.fnm, _4e66.tii, _4e6b.prx, _4e0e.fnm,
 _4e5k.nrm,
  _4e6a.prx, _4e6e.nrm, _4e59.tii, _4e67.frq, _4dty.nrm, _4e4y.tis,
 _4e6a.fdt,
  _4e6b.nrm, _4e59.fdt, _4e6a.fdx, _4e41.tii, _4e41.tis, _4e67_1.del,
  _4dty.fdt, _4dty.fdx, _4e69.tis, _4e66_1.del, _4e6e.frq, _4e5k.tii,
  _4dqu.prx, _4e67.tis, _4e69_1.del, _4e6c.tis, _4e6c.tii, _4e5v.fnm,
  _4e5k.tis, _4e59.tis, _4e67.tii, _4e6c.fdx, _4e4y.fdx, _4e41.frq,
 _4e6c.fdt,
  _4dty.prx, _4e66.frq, _4e69.tii, _4e6c.nrm, _4e59.frq, _4e66.prx,
 _4e5k.fdx,
  _4e68.frq, _4e4y.tii, _4e4n_2.del, _4e41.fdt, _4e6b.frq, _4e4y.fdt,
  _4e66.nrm, _4e4c.frq, _4e3r_i.del, _4e5k.fdt, _4e4y.nrm, _4e41.fdx,
  _4e4n.prx, _4e68_1.del, _4e3r.frq, _4e6f.fdt, _4e6f.fdx, _4e6c.prx,
  _4e68.nrm, _4e6a.frq

Re: very slow commits and overlapping commits

2011-05-23 Thread Bill Au
You can use the postCommit event listener as an callback mechanism to let
you know that a commit has happened.

Bill

On Sun, May 22, 2011 at 9:31 PM, Jeff Crump jeffrey.cr...@gmail.com wrote:

 I don't have an answer to this but only another question:  I don't think I
 can use auto-commit in my application, as I have to checkpoint my index
 submissions and I don't know of any callback mechanism that would let me
 know a commit has happened.  Is there one?

 2011/5/21 Erick Erickson erickerick...@gmail.com

  Well, committing less offside a possibilty  g. Here's what's probably
  happening. When you pass certain thresholds, segments are merged which
 can
  take quite some time.  His are you triggering commits? If it's external,
  think about using auto commit instead.
 
  Best
  Erick
  On May 20, 2011 6:04 PM, Bill Au bill.w...@gmail.com wrote:
   On my Solr 1.4.1 master I am doing commits regularly at a fixed
 interval.
  I
   noticed that from time to time commit will take longer than the commit
   interval, causing commits to overlap. Then things will get worse as
  commit
   will take longer and longer. Here is the logs for a long commit:
  
  
   [2011-05-18 23:47:30.071] start
  
 
 
 commit(optimize=false,waitFlush=false,waitSearcher=false,expungeDeletes=false)
   [2011-05-18 23:49:48.119] SolrDeletionPolicy.onCommit: commits:num=2
   [2011-05-18 23:49:48.119]
  
 
 
 commit{dir=/var/opt/resin3/5062/solr/data/index,segFN=segments_5cpa,version=1247782702272,generation=249742,filenames=[_4dqu_2g.del,
   _4e66.tis, _4e3r.tis, _4e59.nrm, _4e68_1.del, _4e4n.prx, _4e4n.fnm,
   _4e67.fnm, _4e3r.frq, _4e3r.tii, _4e6d.fnm, _4e6c.prx, _4e68.fdx,
  _4e68.nrm,
   _4e6a.frq, _4e68.fdt, _4dqu.fnm, _4e4n.tii, _4e69.fdx, _4e69.fdt,
  _4e0e.nrm,
   _4e4n.tis, _4e6e.fnm, _4e3r.prx, _4e66.fnm, _4e3r.nrm, _4e0e.prx,
  _4e4c.fdx,
   _4dx1.prx, _4e5v.frq, _4e3r.fdt, _4e4c.tis, _4e41_6.del, _4e6b.tis,
   _4e6b_1.del, _4e4y_3.del, _4e6b.tii, _4e3r.fdx, _4dx1.nrm, _4e4y.frq,
   _4e4c.fdt, _4e4c.tii, _4e6d.fdt, _4e5k.fnm, _4e41.fnm, _4e69.fnm,
  _4e67.fdt,
   _4e0e.tii, _4dty_h.del, _4e6b.fnm, _4e0e_h.del, _4e6d.fdx, _4e67.fdx,
   _4e0e.tis, _4e5v.nrm, _4dx1.fnm, _4e5v.tii, _4dqu.fdt, segments_5cpa,
   _4e5v.prx, _4dqu.fdx, _4e59.fnm, _4e6d.prx, _4e59_5.del, _4e4c.prx,
   _4e4c.nrm, _4e5k.prx, _4e66.fdx, _4dty.frq, _4e6c.frq, _4e5v.tis,
  _4e6e.tii,
   _4e66.fdt, _4e6b.fdx, _4e68.prx, _4e59.fdx, _4e6e.fdt, _4e41.prx,
  _4dx1.tii,
   _4dx1.fdt, _4e6b.fdt, _4e5v_4.del, _4e4n.fdt, _4e6e.fdx, _4dx1.fdx,
   _4e41.nrm, _4e4n.fdx, _4e6e.tis, _4e66.tii, _4e4c.fnm, _4e6b.prx,
  _4e67.prx,
   _4e0e.fnm, _4e4n.nrm, _4e67.nrm, _4e5k.nrm, _4e6a.prx, _4e68.fnm,
   _4e4c_4.del, _4dx1.tis, _4e6e.nrm, _4e59.tii, _4e68.tis, _4e67.frq,
   _4e3r.fnm, _4dty.nrm, _4e4y.prx, _4e6e.prx, _4dty.tis, _4e4y.tis,
  _4e6b.nrm,
   _4e6a.fdt, _4e4n.frq, _4e6d.frq, _4e59.fdt, _4e6a.fdx, _4e6a.fnm,
  _4dqu.tii,
   _4e41.tii, _4e67_1.del, _4e41.tis, _4dty.fdt, _4e69.tis, _4dqu.frq,
   _4dty.fdx, _4dx1.frq, _4e6e.frq, _4e66_1.del, _4e69.prx, _4e6d.tii,
   _4e5k.tii, _4e0e.fdt, _4dqu.tis, _4e6d.tis, _4e69.nrm, _4dqu.prx,
  _4e4y.fnm,
   _4e67.tis, _4e69_1.del, _4e6d.nrm, _4e6c.tis, _4e0e.fdx, _4e6c.tii,
   _4dx1_n.del, _4e5v.fnm, _4e5k.tis, _4e59.tis, _4e67.tii, _4dqu.nrm,
   _4e5k_8.del, _4e6c.fdx, _4e6c.fdt, _4e41.frq, _4e4y.fdx, _4e69.frq,
   _4e6a.tis, _4dty.prx, _4e66.frq, _4e5k.frq, _4e6a.tii, _4e69.tii,
  _4e6c.nrm,
   _4dty.fnm, _4e59.prx, _4e59.frq, _4e66.prx, _4e68.frq, _4e5k.fdx,
  _4e4y.tii,
   _4e6c.fnm, _4e0e.frq, _4e6b.frq, _4e41.fdt, _4e4n_2.del, _4dty.tii,
   _4e4y.fdt, _4e66.nrm, _4e4c.frq, _4e6a.nrm, _4e5k.fdt, _4e3r_i.del,
   _4e5v.fdt, _4e4y.nrm, _4e68.tii, _4e5v.fdx, _4e41.fdx]
   [2011-05-18 23:49:48.119]
  
 
 
 commit{dir=/var/opt/resin3/5062/solr/data/index,segFN=segments_5cpb,version=1247782702273,generation=249743,filenames=[_4dqu_2g.del,
   _4e66.tis, _4e59.nrm, _4e3r.tis, _4e4n.fnm, _4e67.fnm, _4e3r.tii,
  _4e6d.fnm,
   _4e68.fdx, _4e68.fdt, _4dqu.fnm, _4e4n.tii, _4e69.fdx, _4e69.fdt,
  _4e4n.tis,
   _4e6e.fnm, _4e0e.prx, _4e4c.tis, _4e5v.frq, _4e4y_3.del, _4e6b_1.del,
   _4e4c.tii, _4e6f.fnm, _4e5k.fnm, _4e6c_1.del, _4e41.fnm, _4dx1.fnm,
   _4e5v.nrm, _4e5v.tii, _4e5v.prx, _4e5k.prx, _4e4c.nrm, _4dty.frq,
  _4e66.fdx,
   _4e5v.tis, _4e66.fdt, _4e6e.tii, _4e59.fdx, _4e6b.fdx, _4e41.prx,
  _4e6b.fdt,
   _4e41.nrm, _4e6e.tis, _4e4c.fnm, _4e66.tii, _4e6b.prx, _4e0e.fnm,
  _4e5k.nrm,
   _4e6a.prx, _4e6e.nrm, _4e59.tii, _4e67.frq, _4dty.nrm, _4e4y.tis,
  _4e6a.fdt,
   _4e6b.nrm, _4e59.fdt, _4e6a.fdx, _4e41.tii, _4e41.tis, _4e67_1.del,
   _4dty.fdt, _4dty.fdx, _4e69.tis, _4e66_1.del, _4e6e.frq, _4e5k.tii,
   _4dqu.prx, _4e67.tis, _4e69_1.del, _4e6c.tis, _4e6c.tii, _4e5v.fnm,
   _4e5k.tis, _4e59.tis, _4e67.tii, _4e6c.fdx, _4e4y.fdx, _4e41.frq,
  _4e6c.fdt,
   _4dty.prx, _4e66.frq, _4e69.tii, _4e6c.nrm, _4e59.frq, _4e66.prx,
  _4e5k.fdx,
   _4e68.frq, _4e4y.tii, _4e4n_2.del, _4e41.fdt, _4e6b.frq, _4e4y.fdt,
   _4e66.nrm, _4e4c.frq

very slow commits and overlapping commits

2011-05-20 Thread Bill Au
On my Solr 1.4.1 master I am doing commits regularly at a fixed interval.  I
noticed that from time to time commit will take longer than the commit
interval, causing commits to overlap.  Then things will get worse as commit
will take longer and longer.  Here is the logs for a long commit:


[2011-05-18 23:47:30.071] start
commit(optimize=false,waitFlush=false,waitSearcher=false,expungeDeletes=false)
[2011-05-18 23:49:48.119] SolrDeletionPolicy.onCommit: commits:num=2
[2011-05-18 23:49:48.119]
commit{dir=/var/opt/resin3/5062/solr/data/index,segFN=segments_5cpa,version=1247782702272,generation=249742,filenames=[_4dqu_2g.del,
_4e66.tis, _4e3r.tis, _4e59.nrm, _4e68_1.del, _4e4n.prx, _4e4n.fnm,
_4e67.fnm, _4e3r.frq, _4e3r.tii, _4e6d.fnm, _4e6c.prx, _4e68.fdx, _4e68.nrm,
_4e6a.frq, _4e68.fdt, _4dqu.fnm, _4e4n.tii, _4e69.fdx, _4e69.fdt, _4e0e.nrm,
_4e4n.tis, _4e6e.fnm, _4e3r.prx, _4e66.fnm, _4e3r.nrm, _4e0e.prx, _4e4c.fdx,
_4dx1.prx, _4e5v.frq, _4e3r.fdt, _4e4c.tis, _4e41_6.del, _4e6b.tis,
_4e6b_1.del, _4e4y_3.del, _4e6b.tii, _4e3r.fdx, _4dx1.nrm, _4e4y.frq,
_4e4c.fdt, _4e4c.tii, _4e6d.fdt, _4e5k.fnm, _4e41.fnm, _4e69.fnm, _4e67.fdt,
_4e0e.tii, _4dty_h.del, _4e6b.fnm, _4e0e_h.del, _4e6d.fdx, _4e67.fdx,
_4e0e.tis, _4e5v.nrm, _4dx1.fnm, _4e5v.tii, _4dqu.fdt, segments_5cpa,
_4e5v.prx, _4dqu.fdx, _4e59.fnm, _4e6d.prx, _4e59_5.del, _4e4c.prx,
_4e4c.nrm, _4e5k.prx, _4e66.fdx, _4dty.frq, _4e6c.frq, _4e5v.tis, _4e6e.tii,
_4e66.fdt, _4e6b.fdx, _4e68.prx, _4e59.fdx, _4e6e.fdt, _4e41.prx, _4dx1.tii,
_4dx1.fdt, _4e6b.fdt, _4e5v_4.del, _4e4n.fdt, _4e6e.fdx, _4dx1.fdx,
_4e41.nrm, _4e4n.fdx, _4e6e.tis, _4e66.tii, _4e4c.fnm, _4e6b.prx, _4e67.prx,
_4e0e.fnm, _4e4n.nrm, _4e67.nrm, _4e5k.nrm, _4e6a.prx, _4e68.fnm,
_4e4c_4.del, _4dx1.tis, _4e6e.nrm, _4e59.tii, _4e68.tis, _4e67.frq,
_4e3r.fnm, _4dty.nrm, _4e4y.prx, _4e6e.prx, _4dty.tis, _4e4y.tis, _4e6b.nrm,
_4e6a.fdt, _4e4n.frq, _4e6d.frq, _4e59.fdt, _4e6a.fdx, _4e6a.fnm, _4dqu.tii,
_4e41.tii, _4e67_1.del, _4e41.tis, _4dty.fdt, _4e69.tis, _4dqu.frq,
_4dty.fdx, _4dx1.frq, _4e6e.frq, _4e66_1.del, _4e69.prx, _4e6d.tii,
_4e5k.tii, _4e0e.fdt, _4dqu.tis, _4e6d.tis, _4e69.nrm, _4dqu.prx, _4e4y.fnm,
_4e67.tis, _4e69_1.del, _4e6d.nrm, _4e6c.tis, _4e0e.fdx, _4e6c.tii,
_4dx1_n.del, _4e5v.fnm, _4e5k.tis, _4e59.tis, _4e67.tii, _4dqu.nrm,
_4e5k_8.del, _4e6c.fdx, _4e6c.fdt, _4e41.frq, _4e4y.fdx, _4e69.frq,
_4e6a.tis, _4dty.prx, _4e66.frq, _4e5k.frq, _4e6a.tii, _4e69.tii, _4e6c.nrm,
_4dty.fnm, _4e59.prx, _4e59.frq, _4e66.prx, _4e68.frq, _4e5k.fdx, _4e4y.tii,
_4e6c.fnm, _4e0e.frq, _4e6b.frq, _4e41.fdt, _4e4n_2.del, _4dty.tii,
_4e4y.fdt, _4e66.nrm, _4e4c.frq, _4e6a.nrm, _4e5k.fdt, _4e3r_i.del,
_4e5v.fdt, _4e4y.nrm, _4e68.tii, _4e5v.fdx, _4e41.fdx]
[2011-05-18 23:49:48.119]
commit{dir=/var/opt/resin3/5062/solr/data/index,segFN=segments_5cpb,version=1247782702273,generation=249743,filenames=[_4dqu_2g.del,
_4e66.tis, _4e59.nrm, _4e3r.tis, _4e4n.fnm, _4e67.fnm, _4e3r.tii, _4e6d.fnm,
_4e68.fdx, _4e68.fdt, _4dqu.fnm, _4e4n.tii, _4e69.fdx, _4e69.fdt, _4e4n.tis,
_4e6e.fnm, _4e0e.prx, _4e4c.tis, _4e5v.frq, _4e4y_3.del, _4e6b_1.del,
_4e4c.tii, _4e6f.fnm, _4e5k.fnm, _4e6c_1.del, _4e41.fnm, _4dx1.fnm,
_4e5v.nrm, _4e5v.tii, _4e5v.prx, _4e5k.prx, _4e4c.nrm, _4dty.frq, _4e66.fdx,
_4e5v.tis, _4e66.fdt, _4e6e.tii, _4e59.fdx, _4e6b.fdx, _4e41.prx, _4e6b.fdt,
_4e41.nrm, _4e6e.tis, _4e4c.fnm, _4e66.tii, _4e6b.prx, _4e0e.fnm, _4e5k.nrm,
_4e6a.prx, _4e6e.nrm, _4e59.tii, _4e67.frq, _4dty.nrm, _4e4y.tis, _4e6a.fdt,
_4e6b.nrm, _4e59.fdt, _4e6a.fdx, _4e41.tii, _4e41.tis, _4e67_1.del,
_4dty.fdt, _4dty.fdx, _4e69.tis, _4e66_1.del, _4e6e.frq, _4e5k.tii,
_4dqu.prx, _4e67.tis, _4e69_1.del, _4e6c.tis, _4e6c.tii, _4e5v.fnm,
_4e5k.tis, _4e59.tis, _4e67.tii, _4e6c.fdx, _4e4y.fdx, _4e41.frq, _4e6c.fdt,
_4dty.prx, _4e66.frq, _4e69.tii, _4e6c.nrm, _4e59.frq, _4e66.prx, _4e5k.fdx,
_4e68.frq, _4e4y.tii, _4e4n_2.del, _4e41.fdt, _4e6b.frq, _4e4y.fdt,
_4e66.nrm, _4e4c.frq, _4e3r_i.del, _4e5k.fdt, _4e4y.nrm, _4e41.fdx,
_4e4n.prx, _4e68_1.del, _4e3r.frq, _4e6f.fdt, _4e6f.fdx, _4e6c.prx,
_4e68.nrm, _4e6a.frq, _4e0e.nrm, _4e3r.prx, _4e66.fnm, _4e3r.nrm, _4e4c.fdx,
_4dx1.prx, _4e3r.fdt, _4e41_6.del, _4e6b.tis, _4e3r.fdx, _4e6b.tii,
_4dx1.nrm, _4e4y.frq, _4e4c.fdt, _4e6d.fdt, _4e69.fnm, _4dty_h.del,
_4e0e.tii, _4e67.fdt, _4e0e_h.del, _4e6b.fnm, _4e6d.fdx, _4e67.fdx,
_4e0e.tis, _4dqu.fdt, segments_5cpb, _4dqu.fdx, _4e59.fnm, _4e59_5.del,
_4e6d.prx, _4e4c.prx, _4e6c.frq, _4e68.prx, _4e6e.fdt, _4dx1.tii, _4dx1.fdt,
_4e4n.fdt, _4e5v_4.del, _4e6e.fdx, _4dx1.fdx, _4e4n.fdx, _4e6f.nrm,
_4e4n.nrm, _4e67.prx, _4e67.nrm, _4e68.fnm, _4e4c_4.del, _4dx1.tis,
_4e68.tis, _4e3r.fnm, _4e6f.prx, _4e4y.prx, _4dty.tis, _4e6e.prx, _4e4n.frq,
_4e6d.frq, _4dqu.tii, _4e6a.fnm, _4dx1.frq, _4dqu.frq, _4e69.prx, _4e6d.tii,
_4e0e.fdt, _4dqu.tis, _4e69.nrm, _4e6d.tis, _4e4y.fnm, _4e0e.fdx, _4e6d.nrm,
_4dx1_n.del, _4dqu.nrm, _4e6f.tii, _4e5k_8.del, _4e6f.frq, _4e69.frq,
_4e6a.tis, _4e6f.tis, _4e5k.frq, _4e6a.tii, _4dty.fnm, _4e59.prx, _4e6c.fnm,
_4e0e.frq, _4dty.tii, 

Re: JVM GC is very frequent.

2010-08-28 Thread Bill Au
Besides frequency, you should also look at duration of GC events.  You may
want to try the concurrent garbage collector if you see many long full gc.

Bill

2010/8/25 Chengyang atreey...@163.com

 We have about 500million documents are indexed.The index size is aobut 10G.
 Running on a 32bit box. During the pressure testing, we monitered that the
 JVM GC is very frequent, about 5min once. Is there any tips to turning this?



Re: Solr jam after all my jvm thread pool hang in blocked state

2010-08-23 Thread Bill Au
It would be helpful it you can attached a threads dump.

BIll

On Mon, Aug 23, 2010 at 6:00 PM, AlexxelA alexandre.boudrea...@canoe.cawrote:


 I,

 I'm running solr 1.3 in production for now 1 year and i never had any
 problem with it since 2 weeks.  It happen 6-7 times a day, all of my thread
 but one are in a blocked state.  All thread that are blocked are waiting on
 the Console monitor owned by the Runnable thread.

 We did not changed anything on the application / server.  I have monitored
 the thread count and there's no accumulation of thread during the period
 solr is ok.

 The problem don't seem to be related to high load of queries since it also
 happen during low load period.

 Anyone got a clue of is going on ?


 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-jam-after-all-my-jvm-thread-pool-hang-in-blocked-state-tp1303361p1303361.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr Hangs up after couple of hours

2010-08-23 Thread Bill Au
It would be very useful if you can take a threads dump while Solr is
hanging.  That will give indication where/why Solr is hanging.

Bill

On Mon, Aug 23, 2010 at 9:32 PM, Manepalli, Kalyan 
kalyan.manepa...@orbitz.com wrote:

 Hi all,
   I am facing a peculiar problem with Solr querying. During our
 indexing process we analyze the existing index. For this we query the index.
 We found that the solr server just hangs on a arbitrary query. If we access
 the admin/stats.jsp, it again resumes executing the queries. The thread
 count and memory utilization looks very normal.

 Any clues on whats going on will be very helpful.

 Thanks
 Kalyan


Re: Can query boosting be used with a custom request handlers?

2010-06-10 Thread Bill Au
You can use the defType param ni the boost local params  to use a different
handler.  Here is an example for using dismax:

{!boost b=log(popularity) defType=dismax}foo

I do this with a custom handler that I have implemented fro my app.

Bill



On Wed, Jun 9, 2010 at 11:37 PM, Andy angelf...@yahoo.com wrote:

 I want to try out the bobo plugin for Solr, which is a custom request
  handler  (http://code.google.com/p/bobo-browse/wiki/SolrIntegration).

 At the same time I want to use BoostQParserPlugin to boost my queries,
 something like {!boost b=log(popularity)}foo

 Can I use the {!boost} feature in conjunction with an external custom
 request handler like the bobo plugin, or does {!boost} only work with the
 standard request handler?






Re: Storing different entities in Solr

2010-05-30 Thread Bill Au
There is only one primary key in a single index.  If the id of your
different document types do collide, you can simply add a prefix or suffix
to make them unique.

Bill

On Fri, May 28, 2010 at 1:12 PM, Moazzam Khan moazz...@gmail.com wrote:

 Thanks for all your answers guys. Requests and consultants have a many
 to many relationship so I can't store request info in a document with
 advisorID as the primary key.

 Bill's solution and multicore solutions might be what I am looking
 for. Bill, will I be able to have 2 primary keys (so I can update and
 delete documents)? If yes, can you please give me a link or someting
 where I can get more info on this?

 Thanks,
 Moazzam



 On Fri, May 28, 2010 at 11:50 AM, Bill Au bill.w...@gmail.com wrote:
  You can keep different type of documents in the same index.  If each
  document has a type field.  You can restrict your searches to specific
  type(s) of document by using a filter query, which is very fast and
  efficient.
 
  Bill
 
  On Fri, May 28, 2010 at 12:28 PM, Nagelberg, Kallin 
  knagelb...@globeandmail.com wrote:
 
  Multi-core is an option, but keep in mind if you go that route you will
  need to do two searches to correlate data between the two.
 
  -Kallin Nagelberg
 
  -Original Message-
  From: Robert Zotter [mailto:robertzot...@gmail.com]
  Sent: Friday, May 28, 2010 12:26 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Storing different entities in Solr
 
 
  Sounds like you'll want to use a multiple core setup. One core fore each
  type
  of document
 
  http://wiki.apache.org/solr/CoreAdmin
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Storing-different-entities-in-Solr-tp852299p852346.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 



Re: Storing different entities in Solr

2010-05-28 Thread Bill Au
You can keep different type of documents in the same index.  If each
document has a type field.  You can restrict your searches to specific
type(s) of document by using a filter query, which is very fast and
efficient.

Bill

On Fri, May 28, 2010 at 12:28 PM, Nagelberg, Kallin 
knagelb...@globeandmail.com wrote:

 Multi-core is an option, but keep in mind if you go that route you will
 need to do two searches to correlate data between the two.

 -Kallin Nagelberg

 -Original Message-
 From: Robert Zotter [mailto:robertzot...@gmail.com]
 Sent: Friday, May 28, 2010 12:26 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Storing different entities in Solr


 Sounds like you'll want to use a multiple core setup. One core fore each
 type
 of document

 http://wiki.apache.org/solr/CoreAdmin
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Storing-different-entities-in-Solr-tp852299p852346.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Jetty, Tomcat or JBoss?

2010-04-20 Thread Bill Au
Solr only uses Servlet and JSP.

Bill

On Sat, Apr 17, 2010 at 9:11 AM, Abdelhamid ABID aeh.a...@gmail.com wrote:

 Solr does use JEE WEB components

 On 4/17/10, Lukáš Vlček lukas.vl...@gmail.com wrote:
 
  Hi,
 
  may be you should be aware that JBoss AS is using Tomcat for web
 container
  (with modified classloader), so if your web application is running inside
  JBoss AS then it is in fact running in Tomcat.
  I don't think Solr uses JEE technologies provided by JEE Application
 server
  (JMS, Transaction services, pooling services, clustered EJB... etc...).
 All
  it requires is web container AFAIK. This being said it will always take
  longer for application server to start and it will require more resources
  as
  opposed to lightweight web container.
 
  Regards,
  Lukas
 
 
  On Sat, Apr 17, 2010 at 11:08 AM, Andrea Gazzarini 
  andrea.gazzar...@atcult.it wrote:
 
   Hi all,
   I have a web application which is basically a (user) search interface
   towards SOLR.
   My index is something like 7GB and has a lot of records so apart other
   things like optiming SOLR schema, config ,clustering etc... I'd like to
  keep
   SOLR installation as light as possible.
   At the moment my SOLR instance is running under JBoss but I saw that
   running under the bundled Jetty it takes a very little amount of memory
  (at
   least at startup and after one hour of usage)
  
   So my questions is: since SOLR is using JEE web components what are the
   drawback of using the following architecture?
  
   -My Application (Full JEE application with web components and EJB) on
   JBoss;
   - SOLR on Jetty or Tomcat
  
   Having said that and supposing that the idea is good, what are the main
   differences / advantages / disadvamtages (from this point of view)
  between
   Tomcat and Jetty?
  
   Best Regards,
   Andrea
  
  
 



 --
 Abdelhamid ABID



Re: Jetty, Tomcat or JBoss?

2010-04-20 Thread Bill Au
I never said they weren't.

Bill

On Tue, Apr 20, 2010 at 5:54 PM, Abdelhamid ABID aeh.a...@gmail.com wrote:

 Which are JEE Web components, aren't they?

 On 4/20/10, Bill Au bill.w...@gmail.com wrote:
 
  Solr only uses Servlet and JSP.
 
 
  Bill
 
 
  On Sat, Apr 17, 2010 at 9:11 AM, Abdelhamid ABID aeh.a...@gmail.com
  wrote:
 
   Solr does use JEE WEB components
  
   On 4/17/10, Lukáš Vlček lukas.vl...@gmail.com wrote:
   
Hi,
   
may be you should be aware that JBoss AS is using Tomcat for web
   container
(with modified classloader), so if your web application is running
  inside
JBoss AS then it is in fact running in Tomcat.
I don't think Solr uses JEE technologies provided by JEE Application
   server
(JMS, Transaction services, pooling services, clustered EJB...
 etc...).
   All
it requires is web container AFAIK. This being said it will always
 take
longer for application server to start and it will require more
  resources
as
opposed to lightweight web container.
   
Regards,
Lukas
   
   
On Sat, Apr 17, 2010 at 11:08 AM, Andrea Gazzarini 
andrea.gazzar...@atcult.it wrote:
   
 Hi all,
 I have a web application which is basically a (user) search
 interface
 towards SOLR.
 My index is something like 7GB and has a lot of records so apart
  other
 things like optiming SOLR schema, config ,clustering etc... I'd
 like
  to
keep
 SOLR installation as light as possible.
 At the moment my SOLR instance is running under JBoss but I saw
 that
 running under the bundled Jetty it takes a very little amount of
  memory
(at
 least at startup and after one hour of usage)

 So my questions is: since SOLR is using JEE web components what are
  the
 drawback of using the following architecture?

 -My Application (Full JEE application with web components and EJB)
 on
 JBoss;
 - SOLR on Jetty or Tomcat

 Having said that and supposing that the idea is good, what are the
  main
 differences / advantages / disadvamtages (from this point of view)
between
 Tomcat and Jetty?

 Best Regards,
 Andrea


   
  
  
  
   --
   Abdelhamid ABID
  
 



 --
 Abdelhamid ABID
 Software Engineer- J2EE / WEB



Re: Snapshooter shooting after commit or optimize

2010-04-12 Thread Bill Au
The lines you have encloses are commented out by the !-- and --

Bill

On Mon, Apr 12, 2010 at 1:32 PM, william pink will.p...@gmail.com wrote:

 Hi,

 I am running Solr 1.2 ( I will be updating in due course)

 I am having a few issues with doing the snapshots after a postCommit or
 postOptimize neither appear to work in my solrconfig.xml I have the
 following


 !--
  A postCommit event is fired after every commit or optimize command
listener event=postCommit class=solr.RunExecutableListener
  str name=exesnapshooter/str
  str name=dir/opt/solr/bin/str
  bool name=waittrue/bool
  arr name=args strarg1/str strarg2/str /arr
  arr name=env strMYVAR=val1/str /arr
/listener

 --
 -
 !--
  A postOptimize event is fired only after every optimize command, useful
 in conjunction with index distribution to only distribute optimized
 indicies
listener event=postOptimize class=solr.RunExecutableListener
  str name=exesnapshooter/str
  str name=dir/opt/solr/bin/str
  bool name=waittrue/bool
/listener

 --

 But a snapshot never gets taken, It's most likely something that I haven't
 spotted but I can't seem to work it out. It's quite possible the route to
 snapshooter might be the issue but I have tried a few different things and
 none have worked.

 Any tips appreciated,

 Thanks

 Will



Re: jmap output help

2010-03-29 Thread Bill Au
Take a heap dump and use jhat to find out for sure.

Bill

On Mon, Mar 29, 2010 at 1:03 PM, Siddhant Goel siddhantg...@gmail.comwrote:

 Gentle bounce

 On Sun, Mar 28, 2010 at 11:31 AM, Siddhant Goel siddhantg...@gmail.com
 wrote:

  Hi everyone,
 
  The output of jmap -histo:live 27959 | head -30 is something like the
  following :
 
  num #instances #bytes  class name
  --
 1:448441  180299464  [C
 2:  5311  135734480  [I
 3:  3623   68389720  [B
 4:445669   17826760  java.lang.String
 5:391739   15669560  org.apache.lucene.index.TermInfo
 6:417442   13358144  org.apache.lucene.index.Term
 7: 587675171496
   org.apache.lucene.index.FieldsReader$LazyField
 8: 329025049760  constMethodKlass
 9: 329023955920  methodKlass
10:  28433512688  constantPoolKlass
11:  23973128048  [Lorg.apache.lucene.index.Term;
12:353053592  [J
13: 33044288  [Lorg.apache.lucene.index.TermInfo;
14: 556712707536  symbolKlass
15: 272822701352  [Ljava.lang.Object;
16:  28432212384  instanceKlassKlass
17:  23432132224  constantPoolCacheKlass
18: 264241056960  java.util.ArrayList
19: 164231051072  java.util.LinkedHashMap$Entry
20:  20391028944  methodDataKlass
21: 14336 917504  org.apache.lucene.document.Field
22: 29587 710088  java.lang.Integer
23:  3171 583464  java.lang.Class
24:   813 492880  [Ljava.util.HashMap$Entry;
25:  8471 474376  org.apache.lucene.search.PhraseQuery
26:  4184 402848  [[I
27:  4277 380704  [S
 
  Is it ok to assume that the top 3 entries (character/integer/byte arrays)
  are referring to the entries inside the solr cache?
 
  Thanks,
 
 
  --
  - Siddhant
 



 --
 - Siddhant



Re: Snapshot / Distribution Process

2010-03-11 Thread Bill Au
Have you started rsyncd on the master?  Make sure that it is enabled before
you start:

http://wiki.apache.org/solr/SolrCollectionDistributionOperationsOutline

You can also tried running snappuller with the -V option to et more
debugging info.

Bill

On Wed, Mar 10, 2010 at 4:09 PM, Lars R. Noldan l...@sixfeetup.com wrote:

 Is anyone aware of a comprehensive guide for setting up the Snapshot
 Distribution process on Solr 1.3?

 I'm working through:
 http://wiki.apache.org/solr/CollectionDistribution#The_Snapshot_and_Distribution_Process

 And have run into a roadblock where the solr/bin/snappuller finds the
 appropriate snapshot, but rsync fails.  (according to the logs.)

 Any guidance you can provide, even if it's asking for additional
 troubleshooting information is welcome and appreciated.

 Thanks
 Lars
 --
 l...@sixfeetup.com | +1 (317) 861-5948 x609
 six feet up presents INDIGO : The Help Line for Plone
 More info at http://sixfeetup.com/indigo or call +1 (866) 749-3338


Re: Payloads with Phrase queries

2009-12-15 Thread Bill Au
Lucene 2.9.1 comes with a PayloadTermQuery:
http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/search/payloads/PayloadTermQuery.html

I have been using that to use the payload as part of the score without any
problem.

Bill


On Tue, Dec 15, 2009 at 6:31 AM, Raghuveer Kancherla 
raghuveer.kanche...@aplopio.com wrote:

 The interesting thing I am noticing is that the scoring works fine for a
 phrase query like solr rocks.
 This lead me to look at what query I am using in case of a single term.
 Turns out that I am using PayloadTermQuery taking a cue from solr-1485
 patch.

 I changed this to BoostingTermQuery (i read somewhere that this is
 deprecated .. but i was just experimenting) and the scoring seems to work
 as
 expected now for a single term.

 Now, the important question is what is the Payload version of a TermQuery?

 Regards
 Raghu


 On Tue, Dec 15, 2009 at 12:45 PM, Raghuveer Kancherla 
 raghuveer.kanche...@aplopio.com wrote:

  Hi,
  Thanks everyone for the responses, I am now able to get both phrase
 queries
  and term queries to use payloads.
 
  However the the score value for each document (and consequently, the
  ordering of documents) are coming out wrong.
 
  In the solr output appended below, document 4 has a score higher than the
  document 2 (look at the debug part). The results section shows a wrong
 score
  (which is the payload value I am returning from my custom similarity
 class)
  and the ordering is also wrong because of this. Can someone explain this
 ?
 
  My custom query parser is pasted here http://pastebin.com/m9f21565
 
  In the similarity class, I return 10.0 if payload is 1 and 20.0 if
 payload
  is 2. For everything else I return 1.0.
 
  {
   'responseHeader':{
'status':0,
'QTime':2,
'params':{
'fl':'*,score',
'debugQuery':'on',
'indent':'on',
 
 
'start':'0',
'q':'solr',
'qt':'aplopio',
'wt':'python',
'fq':'',
'rows':'10'}},
   'response':{'numFound':5,'start':0,'maxScore':20.0,'docs':[
 
 
{
 'payloadTest':'solr|2 rocks|1',
 'id':'2',
 'score':20.0},
{
 'payloadTest':'solr|2',
 'id':'4',
 'score':20.0},
 
 
{
 'payloadTest':'solr|1 rocks|2',
 'id':'1',
 'score':10.0},
{
 'payloadTest':'solr|1 rocks|1',
 'id':'3',
 'score':10.0},
 
 
{
 'payloadTest':'solr',
 'id':'5',
 'score':1.0}]
   },
   'debug':{
'rawquerystring':'solr',
'querystring':'solr',
 
 
'parsedquery':'PayloadTermQuery(payloadTest:solr)',
'parsedquery_toString':'payloadTest:solr',
'explain':{
'2':'\n7.227325 = (MATCH) fieldWeight(payloadTest:solr in 1),
 product of:\n  14.142136 = (MATCH) btq, product of:\n0.70710677 =
 tf(phraseFreq=0.5)\n20.0 = scorePayload(...)\n  0.81767845 =
 idf(payloadTest:  solr=5)\n  0.625 = fieldNorm(field=payloadTest, doc=1)\n',
 
 
'4':'\n11.56372 = (MATCH) fieldWeight(payloadTest:solr in 3),
 product of:\n  14.142136 = (MATCH) btq, product of:\n0.70710677 =
 tf(phraseFreq=0.5)\n20.0 = scorePayload(...)\n  0.81767845 =
 idf(payloadTest:  solr=5)\n  1.0 = fieldNorm(field=payloadTest, doc=3)\n',
 
 
'1':'\n3.6136625 = (MATCH) fieldWeight(payloadTest:solr in 0),
 product of:\n  7.071068 = (MATCH) btq, product of:\n0.70710677 =
 tf(phraseFreq=0.5)\n10.0 = scorePayload(...)\n  0.81767845 =
 idf(payloadTest:  solr=5)\n  0.625 = fieldNorm(field=payloadTest, doc=0)\n',
 
 
'3':'\n3.6136625 = (MATCH) fieldWeight(payloadTest:solr in 2),
 product of:\n  7.071068 = (MATCH) btq, product of:\n0.70710677 =
 tf(phraseFreq=0.5)\n10.0 = scorePayload(...)\n  0.81767845 =
 idf(payloadTest:  solr=5)\n  0.625 = fieldNorm(field=payloadTest, doc=2)\n',
 
 
'5':'\n0.578186 = (MATCH) fieldWeight(payloadTest:solr in 4),
 product of:\n  0.70710677 = (MATCH) btq, product of:\n0.70710677 =
 tf(phraseFreq=0.5)\n1.0 = scorePayload(...)\n  0.81767845 =
 idf(payloadTest:  solr=5)\n  1.0 = fieldNorm(field=payloadTest, doc=4)\n'},
 
 
'QParser':'BoostingTermQParser',
'filter_queries':[''],
'parsed_filter_queries':[],
'timing':{
'time':2.0,
'prepare':{
 'time':1.0,
 
 
 'org.apache.solr.handler.component.QueryComponent':{
  'time':1.0},
 'org.apache.solr.handler.component.FacetComponent':{
  'time':0.0},
 'org.apache.solr.handler.component.MoreLikeThisComponent':{
 
 
  'time':0.0},
 'org.apache.solr.handler.component.HighlightComponent':{
  'time':0.0},
 'org.apache.solr.handler.component.StatsComponent':{
  'time':0.0},
 'org.apache.solr.handler.component.DebugComponent':{
 
 
  'time':0.0}},
'process':{
 'time':1.0,
 'org.apache.solr.handler.component.QueryComponent':{
  'time':0.0},
 

Re: How to instruct MoreLikeThisHandler to sort results

2009-12-03 Thread Bill Au
I had open a Jira and submitted a patch for this:

https://issues.apache.org/jira/browse/SOLR-1545

Bill

On Thu, Dec 3, 2009 at 7:47 AM, Sascha Szott sz...@zib.de wrote:

 Hi Folks,

 is there any way to instruct MoreLikeThisHandler to sort results? I was
 wondering that MLTHandler recognizes faceting parameters among others, but
 it ignores the sort parameter.

 Best,
 Sascha




Re: TermsComponent results don't change after documents removed from index

2009-11-03 Thread Bill Au
Thanks for pointing that out.  The TermsComponent prefix query is running
much faster than the facet prefix query.  I guess there is yet another
reason to optimize the index.

Bill

On Tue, Nov 3, 2009 at 5:09 PM, Koji Sekiguchi k...@r.email.ne.jp wrote:

 Bill Au wrote:

 Should the results of the TermsComponent change after documents have been
 removed from the index?  I am thinking about using the prefix of
 TermsComponent to implement auto-suggest.  But I noticed that the prefix
 counts in TermsComponent don't change after documents have been deleted.
 The deletes are done with the standard update handler using a
 delete-by-query.  Since the TermsComponent is showing the number of
 documents matching the terms, the number should be decreasing when
 documents
 are deleted.

 I can reproduce this using the sample in the tutorial and the
 TermsComponent
 prefix query in the Wiki:
 http://wiki.apache.org/solr/TermsComponent

 The output of the TermsComponent prefix doesn't change even after I
 removed
 all the documents:

 java -Ddata=args -jar post.jar deletequeryid:*/query/delete

 What am I doing wrong?

 Bill



 This is a feature of Lucene... docFreq is not changed until segments
 containing
 deletions are merged. You can do optimize to correct docFreq.

 Koji

 --
 http://www.rondhuit.com/en/




Re: question about text field and WordDelimiterFilter in example schema.xml

2009-10-27 Thread Bill Au
I have been playing with this using the analysis.jsp.  I am still not clear
why we don't want to catenate at query time.  Here is my example.

With the current text field, the query term iPhone will not match document
containing the string iphone because iPhone is analyzed into two terms:
i(1) and phone(2).  I am using a lower case filter.

If I set catenateWords to 1, the iPhone is analyzed into:

term position 1: i
term position 2: phone iphone

So that will match document containing the string iphone

What bad things can happen if I split and catenate at query time?

Bill

On Tue, Oct 20, 2009 at 8:09 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Tue, Oct 20, 2009 at 6:37 PM, Bill Au bill.w...@gmail.com wrote:
  I have a question regarding the use of the WordDelimiterFilter in the
 text
  field in the example schema.xml.  The parameters are set differently for
 the
  indexing and querying.  Namely, catenateWords and catenateNumbers are set
  differently.  Shouldn't the same analysis be done at both index and query
  time?

 That wouldn't work... of you tried to split and catenate at query time then
 foo-bar would generate the tokens foo/foobar,bar  (foo and foobar
 tokens overlapping).
 The Lucene query parser considers this to mean (foo or foobar)
 followed by bar, which is clearly not good.

 It's essentially the same problem that keeps us from using synonym
 expansion at query time with synonyms greater than length 1.

 -Yonik
 http://www.lucidimagination.com

  Bill
 
 !-- A text field that uses WordDelimiterFilter to enable splitting
 and
  matching of
 words on case-change, alpha numeric boundaries, and
 non-alphanumeric
  chars,
 so that a query of wifi or wi fi could match a document
  containing Wi-Fi.
 Synonyms and stopwords are customized by external files, and
  stemming is enabled.
 --
 fieldType name=text class=solr.TextField
  positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 !-- in this example, we will only use synonyms at query time
 filter class=solr.SynonymFilterFactory
  synonyms=index_synonyms.txt ignoreCase=true expand=false/
 --
 !-- Case insensitive stop word removal.
   add enablePositionIncrements=true in both the index and query
   analyzers to leave a 'gap' for more accurate phrase queries.
 --
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=stopwords.txt
 enablePositionIncrements=true
 /
 filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1 catenateWords=1
  catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.SnowballPorterFilterFactory
 language=English
  protected=protwords.txt/
   /analyzer
   analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=true/
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=stopwords.txt
 enablePositionIncrements=true
 /
 filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1 catenateWords=0
  catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.SnowballPorterFilterFactory
 language=English
  protected=protwords.txt/
   /analyzer
 /fieldType
 



Question on ShingleFilter

2009-10-21 Thread Bill Au
I am having problem with using the ShingleFIlter.  My test document is the
quick brown fox jumps over the lazy dog.  My query is my quick brown.
Since both have the term quick brown at term position 2, the query should
match the test document, right?  But my query is not returning anything.  I
tried googling for example use fo the ShingleFIlter but didn't find any...

Bill


question about text field and WordDelimiterFilter in example schema.xml

2009-10-20 Thread Bill Au
I have a question regarding the use of the WordDelimiterFilter in the text
field in the example schema.xml.  The parameters are set differently for the
indexing and querying.  Namely, catenateWords and catenateNumbers are set
differently.  Shouldn't the same analysis be done at both index and query
time?


Bill

!-- A text field that uses WordDelimiterFilter to enable splitting and
matching of
words on case-change, alpha numeric boundaries, and non-alphanumeric
chars,
so that a query of wifi or wi fi could match a document
containing Wi-Fi.
Synonyms and stopwords are customized by external files, and
stemming is enabled.
--
fieldType name=text class=solr.TextField
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
!-- in this example, we will only use synonyms at query time
filter class=solr.SynonymFilterFactory
synonyms=index_synonyms.txt ignoreCase=true expand=false/
--
!-- Case insensitive stop word removal.
  add enablePositionIncrements=true in both the index and query
  analyzers to leave a 'gap' for more accurate phrase queries.
--
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=English
protected=protwords.txt/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=English
protected=protwords.txt/
  /analyzer
/fieldType


Re: doing searches from within an UpdateRequestProcessor

2009-10-13 Thread Bill Au
Thanks for the info.  Just want to me sure that I am on the right track
before I go too deep.

Bill

2009/10/12 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com

 A custom UpdateRequestProcessor is the solution. You can access the
 searcher in a UpdateRequestProcessor.

 On Tue, Oct 13, 2009 at 4:20 AM, Bill Au bill.w...@gmail.com wrote:
  Is it possible to do searches from within an UpdateRequestProcessor?  The
  documents in my index reference each other.  When a document is deleted,
 I
  would like to update all documents containing a reference to the deleted
  document.  My initial idea is to use a custom UpdateRequestProcessor.  Is
  there a better way to do this?
  Bill
 



 --
 -
 Noble Paul | Principal Engineer| AOL | http://aol.com



two facet.prefix on one facet field in a single query

2009-10-12 Thread Bill Au
Is it possible to have two different facet.prefix on the same facet field in
a single query.  I wan to get facet counts for two prefix, xx and yy.  I
tried using two facet.prefix (ie facet.prefix=xxfacet.prefix=yy) but the
second one seems to have no effect.

Bill


doing searches from within an UpdateRequestProcessor

2009-10-12 Thread Bill Au
Is it possible to do searches from within an UpdateRequestProcessor?  The
documents in my index reference each other.  When a document is deleted, I
would like to update all documents containing a reference to the deleted
document.  My initial idea is to use a custom UpdateRequestProcessor.  Is
there a better way to do this?
Bill


Re: two facet.prefix on one facet field in a single query

2009-10-12 Thread Bill Au
It looks like there is a JIRA covering this:

https://issues.apache.org/jira/browse/SOLR-1387

On Mon, Oct 12, 2009 at 11:00 AM, Bill Au bill.w...@gmail.com wrote:

 Is it possible to have two different facet.prefix on the same facet field
 in a single query.  I wan to get facet counts for two prefix, xx and
 yy.  I tried using two facet.prefix (ie facet.prefix=xxfacet.prefix=yy)
 but the second one seems to have no effect.

 Bill



Re: cleanup old index directories on slaves

2009-10-05 Thread Bill Au
Have you looked at snapcleaner?

http://wiki.apache.org/solr/SolrCollectionDistributionScripts#snapcleaner
http://wiki.apache.org/solr/CollectionDistribution#snapcleaner

Bill

On Mon, Oct 5, 2009 at 4:58 PM, solr jay solr...@gmail.com wrote:

 Is there a reliable way to safely clean up index directories? This is
 needed
 mainly on slave side as in several situations, an old index directory is
 replaced with a new one, and I'd like to remove those that are no longer in
 use.

 Thanks,

 --
 J



Re: Solr and Garbage Collection

2009-10-03 Thread Bill Au
SUN has recently clarify the issue regarding unsupported unless you pay
for the G1 garbage collector. Here is the updated release of Java 6 update
14:
http://java.sun.com/javase/6/webnotes/6u14.html


G1 will be part of Java 7, fully supported without pay.  The version
included in Java 6 update 14 is a beta release.  Since it is beta, SUN does
not recommend using it unless you have a support contract because as with
any beta software there will be bugs.  Non paying customers may very well
have to wait for the official version in Java 7 for bug fixes.

Here is more info on the G1 garbage collector:

http://java.sun.com/javase/technologies/hotspot/gc/g1_intro.jsp


Bill

On Sat, Oct 3, 2009 at 1:28 PM, Mark Miller markrmil...@gmail.com wrote:

 Another option of course, if you're using a recent version of Java 6:

 try out the beta-ish, unsupported unless you pay, G1 garbage collector.
 I've only recently started playing with it, but its supposed to be much
 better than CMS. Its supposedly got much better throughput, its much
 better at dealing with fragmentation issues (CMS is actually pretty bad
 with fragmentation come to find out), and overall its just supposed to
 be a very nice leap ahead in GC. Havn't had a chance to play with it
 much myself, but its supposed to be fantastic. A whole new approach to
 generational collection for Sun, and much closer to the real time GC's
 available from some other vendors.

 Mark Miller wrote:
  siping liu wrote:
 
  Hi,
 
  I read pretty much all posts on this thread (before and after this one).
 Looks like the main suggestion from you and others is to keep max heap size
 (-Xmx) as small as possible (as long as you don't see OOM exception). This
 brings more questions than answers (for me at least. I'm new to Solr).
 
 
 
  First, our environment and problem encountered: Solr1.4 (nightly build,
 downloaded about 2 months ago), Sun JDK1.6, Tomcat 5.5, running on
 Solaris(multi-cpu/cores). The cache setting is from the default
 solrconfig.xml (looks very small). At first we used minimum JAVA_OPTS and
 quickly run into the problem similar to the one orignal poster reported --
 long pause (seconds to minutes) under load test. jconsole showed that it
 pauses on GC. So more JAVA_OPTS get added: -XX:+UseConcMarkSweepGC
 -XX:+UseParNewGC -XX:ParallelGCThreads=8 -XX:SurvivorRatio=2
 -XX:NewSize=128m -XX:MaxNewSize=512m -XX:MaxGCPauseMillis=200, the thinking
 is with mutile-cpu/cores we can get over with GC as quickly as possibe. With
 the new setup, it works fine until Tomcat reaches heap size, then it blocks
 and takes minutes on full GC to get more space from tenure generation.
 We tried different Xmx (from very small to large), no difference in long GC
 time. We never run into OOM.
 
 
  MaxGCPauseMillis doesnt work with UseConcMarkSweepGC - its for use with
  the Parallel collector. That also doesnt look like a good survivorratio.
 
 
 
  Questions:
 
  * In general various cachings are good for performance, we have more RAM
 to use and want to use more caching to boost performance, isn't your
 suggestion (of lowering heap limit) going against that?
 
 
  Leaving RAM for the FileSystem cache is also very important. But you
  should also have enough RAM for your Solr caches of course.
 
  * Looks like Solr caching made its way into tenure-generation on heap,
 that's good. But why they get GC'ed eventually?? I did a quick check of Solr
 code (Solr 1.3, not 1.4), and see a single instance of using WeakReference.
 Is that what is causing all this? This seems to suggest a design flaw in
 Solr's memory management strategy (or just my ignorance about Solr?). I
 mean, wouldn't this be the right way of doing it -- you allow user to
 specify the cache size in solrconfig.xml, then user can set up heap limit in
 JAVA_OPTS accordingly, and no need to use WeakReference (BTW, why not
 SoftReference)??
 
 
  Do you see concurrent mode failure when looking at your gc logs? ie:
 
  174.445: [GC 174.446: [ParNew: 66408K-66408K(66416K), 0.618
  secs]174.446: [CMS (concurrent mode failure): 161928K-162118K(175104K),
  4.0975124 secs] 228336K-162118K(241520K)
 
  That means you have still getting major collections with CMS, and you
  don't want that. You might try kicking GC off earlier with something
  like: -XX:CMSInitiatingOccupancyFraction=50
 
  * Right now I have a single Tomcat hosting Solr and other applications.
 I guess now it's better to have Solr on its own Tomcat, given that it's
 tricky to adjust the java options.
 
 
 
  thanks.
 
 
 
 
 
  From: wun...@wunderwood.org
  To: solr-user@lucene.apache.org
  Subject: RE: Solr and Garbage Collection
  Date: Fri, 25 Sep 2009 09:51:29 -0700
 
  30ms is not better or worse than 1s until you look at the service
  requirements. For many applications, it is worth dedicating 10% of your
  processing time to GC if that makes the worst-case pause short.
 
  On the other hand, my experience with the IBM JVM was that the maximum
 query
  

Re: Solr and Garbage Collection

2009-10-03 Thread Bill Au
SUN's initial release notes actually pretty much said that it was
unsupported unless you pay.  They had since revised the release notes to
clear up the confusion.
Bill

On Sat, Oct 3, 2009 at 2:51 PM, Mark Miller markrmil...@gmail.com wrote:

 Ah, yes - thanks for the clarification. Didn't pay attention to how
 ambiguously I was using supported there :)

 Bill Au wrote:
  SUN has recently clarify the issue regarding unsupported unless you pay
  for the G1 garbage collector. Here is the updated release of Java 6
 update
  14:
  http://java.sun.com/javase/6/webnotes/6u14.html
 
 
  G1 will be part of Java 7, fully supported without pay.  The version
  included in Java 6 update 14 is a beta release.  Since it is beta, SUN
 does
  not recommend using it unless you have a support contract because as with
  any beta software there will be bugs.  Non paying customers may very well
  have to wait for the official version in Java 7 for bug fixes.
 
  Here is more info on the G1 garbage collector:
 
  http://java.sun.com/javase/technologies/hotspot/gc/g1_intro.jsp
 
 
  Bill
 
  On Sat, Oct 3, 2009 at 1:28 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
 
  Another option of course, if you're using a recent version of Java 6:
 
  try out the beta-ish, unsupported unless you pay, G1 garbage collector.
  I've only recently started playing with it, but its supposed to be much
  better than CMS. Its supposedly got much better throughput, its much
  better at dealing with fragmentation issues (CMS is actually pretty bad
  with fragmentation come to find out), and overall its just supposed to
  be a very nice leap ahead in GC. Havn't had a chance to play with it
  much myself, but its supposed to be fantastic. A whole new approach to
  generational collection for Sun, and much closer to the real time GC's
  available from some other vendors.
 
  Mark Miller wrote:
 
  siping liu wrote:
 
 
  Hi,
 
  I read pretty much all posts on this thread (before and after this
 one).
 
  Looks like the main suggestion from you and others is to keep max heap
 size
  (-Xmx) as small as possible (as long as you don't see OOM exception).
 This
  brings more questions than answers (for me at least. I'm new to Solr).
 
 
  First, our environment and problem encountered: Solr1.4 (nightly
 build,
 
  downloaded about 2 months ago), Sun JDK1.6, Tomcat 5.5, running on
  Solaris(multi-cpu/cores). The cache setting is from the default
  solrconfig.xml (looks very small). At first we used minimum JAVA_OPTS
 and
  quickly run into the problem similar to the one orignal poster reported
 --
  long pause (seconds to minutes) under load test. jconsole showed that it
  pauses on GC. So more JAVA_OPTS get added: -XX:+UseConcMarkSweepGC
  -XX:+UseParNewGC -XX:ParallelGCThreads=8 -XX:SurvivorRatio=2
  -XX:NewSize=128m -XX:MaxNewSize=512m -XX:MaxGCPauseMillis=200, the
 thinking
  is with mutile-cpu/cores we can get over with GC as quickly as possibe.
 With
  the new setup, it works fine until Tomcat reaches heap size, then it
 blocks
  and takes minutes on full GC to get more space from tenure
 generation.
  We tried different Xmx (from very small to large), no difference in long
 GC
  time. We never run into OOM.
 
 
  MaxGCPauseMillis doesnt work with UseConcMarkSweepGC - its for use with
  the Parallel collector. That also doesnt look like a good
 survivorratio.
 
 
  Questions:
 
  * In general various cachings are good for performance, we have more
 RAM
 
  to use and want to use more caching to boost performance, isn't your
  suggestion (of lowering heap limit) going against that?
 
 
  Leaving RAM for the FileSystem cache is also very important. But you
  should also have enough RAM for your Solr caches of course.
 
 
  * Looks like Solr caching made its way into tenure-generation on heap,
 
  that's good. But why they get GC'ed eventually?? I did a quick check of
 Solr
  code (Solr 1.3, not 1.4), and see a single instance of using
 WeakReference.
  Is that what is causing all this? This seems to suggest a design flaw in
  Solr's memory management strategy (or just my ignorance about Solr?). I
  mean, wouldn't this be the right way of doing it -- you allow user to
  specify the cache size in solrconfig.xml, then user can set up heap
 limit in
  JAVA_OPTS accordingly, and no need to use WeakReference (BTW, why not
  SoftReference)??
 
 
  Do you see concurrent mode failure when looking at your gc logs? ie:
 
  174.445: [GC 174.446: [ParNew: 66408K-66408K(66416K), 0.618
  secs]174.446: [CMS (concurrent mode failure):
 161928K-162118K(175104K),
  4.0975124 secs] 228336K-162118K(241520K)
 
  That means you have still getting major collections with CMS, and you
  don't want that. You might try kicking GC off earlier with something
  like: -XX:CMSInitiatingOccupancyFraction=50
 
 
  * Right now I have a single Tomcat hosting Solr and other
 applications.
 
  I guess now it's better to have Solr on its own Tomcat, given that it's
  tricky to adjust

Re: snapshot creation and distribution

2009-10-02 Thread Bill Au
A snapshot is a copy of the index at a particular moment in time.  So
changes in earlier snapshots are in the latest one as well.  Nothing is
missed by pulling the latest snapshot.

When triggering snapshooter with the postCommit hook, a commit always
results in a snapshot being created.

Bill

On Fri, Oct 2, 2009 at 11:52 AM, robert@virginmoney.com wrote:


 Hello,

 A couple questions with regard to snapshots and distribution:

 1. If two snapshots are created in between a snappull, are the changes from
 the first snapshot missed by the slave, as it only pulls the most recent
 snapshot?

 2. When triggering snapshooter from the postCommit hook, does a commit
 always result in a snapshot being created, or is there any kind of quiet
 period?

 Many thanks,
 Rob.


 This e-mail is intended to be confidential to the recipient. If you receive
 a copy in error, please inform the sender and then delete this message.
 Virgin Money do not accept responsibility for changes made to any e-mail
 after sending. Virgin Money have swept, and believe this e-mail to be free
 of viruses and profanity but make no guarantees to this effect.

 Virgin Money Personal Financial Service Ltd is authorised and regulated by
 the Financial Services Authority. Registered in England no. 3072766. Entered
 on the Financial Services Authority's Register
 http://www.fsa.gov.uk/register/. Register Number 179271. The Virgin
 Deposit Account is a personal bank account with The Royal Bank of Scotland.

 Virgin Money Unit Trust Managers Ltd is authorised and regulated by the
 Financial Services Authority. Registered in England no. 3000482. Entered on
 the Financial Services Authority's Register. Register Number 171748.

 Virgin Money Ltd. Registered in England no. 4232392. Introducer appointed
 representative only of Virgin Money Personal Financial Service Ltd.

 Virgin Money Management Services Ltd. Registered in England no.3072772.

 Virgin Money Group Ltd. Registered in England no.3087587.

 All the above companies have their Registered office at Discovery House,
 Whiting Road, Norwich NR4 6EJ.

 All products are open only to residents of the United Kingdom.

 This message has been checked for viruses and spam by the Virgin Money
 email scanning system powered by Messagelabs.



Re: TermVector term frequencies for tag cloud

2009-10-02 Thread Bill Au
Have you considered using facet counts for your tag cloud?

Bill

On Fri, Oct 2, 2009 at 11:34 AM, aod...@gmail.com wrote:

 Hello,

 I'm trying to create a tag cloud from a term vector, but the array
 returned (using JSON wt) is quite complex and takes an inordinate
 amount of time to process. Is there a better way to retrieve terms and
 their document TF? The TermVectorComponent allows for retrieval of tf
 and df though I'm only interested in TF. I know the TermsComponent
 gives you DF, but I need TF!

 Any suggestions,

 Thanks,

 Aodh.



Re: Solr Trunk Heap Space Issues

2009-10-01 Thread Bill Au
You probably want to add the following command line option to java to
produce a heap dump:

-XX:+HeapDumpOnOutOfMemoryError

Then you can use jhat to see what's taking up all the space in the heap.

Bill

On Thu, Oct 1, 2009 at 11:47 AM, Mark Miller markrmil...@gmail.com wrote:

 Jeff Newburn wrote:
  I am trying to update to the newest version of solr from trunk as of May
  5th.  I updated and compiled from trunk as of yesterday (09/30/2009).
  When
  I try to do a full import I am receiving a GC heap error after changing
  nothing in the configuration files.  Why would this happen in the most
  recent versions but not in the version from a few months ago.
 Good question. The error means its spending too much time trying to
 garbage collect without making much progress.
 Why so much more garbage to collect just by updating? Not sure...

  The stack
  trace is below.
 
  Oct 1, 2009 8:34:32 AM
 org.apache.solr.update.processor.LogUpdateProcessor
  finish
  INFO: {add=[166400, 166608, 166698, 166800, 166811, 167097, 167316,
 167353,
  ...(83 more)]} 0 35991
  Oct 1, 2009 8:34:32 AM org.apache.solr.common.SolrException log
  SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
  at java.util.Arrays.copyOfRange(Arrays.java:3209)
  at java.lang.String.init(String.java:215)
  at com.ctc.wstx.util.TextBuffer.contentsAsString(TextBuffer.java:384)
  at
 com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:821)
  at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:280)
  at
 org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
  at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
  at
 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentSt
  reamHandlerBase.java:54)
  at
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
  java:131)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
  at
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3
  38)
  at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
  241)
  at
 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
  FilterChain.java:235)
  at
 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
  ain.java:206)
  at
 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja
  va:233)
  at
 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja
  va:175)
  at
 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128
  )
  at
 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102
  )
  at
 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java
  :109)
  at
 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
  at
 
 org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:
  879)
  at
 
 org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(H
  ttp11NioProtocol.java:719)
  at
 
 org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:
  2080)
  at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja
  va:886)
  at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9
  08)
  at java.lang.Thread.run(Thread.java:619)
 
  Oct 1, 2009 8:40:06 AM org.apache.solr.core.SolrCore execute
  INFO: [zeta-main] webapp=/solr path=/update params={} status=500
 QTime=5265
  Oct 1, 2009 8:40:12 AM org.apache.solr.common.SolrException log
  SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
 
 


 --
 - Mark

 http://www.lucidimagination.com






Re: Solr and Garbage Collection

2009-09-28 Thread Bill Au
One way to track expensive is to look at the query time, QTime, in the solr
log.
There are a couple of tools for analyzing gc logs:

http://www.tagtraum.com/gcviewer.html
https://h20392.www2.hp.com/portal/swdepot/displayProductInfo.do?productNumber=HPJMETER

They will give you frequency and duration of minor and major collection.

On a multi-processor/core system with CPU cycles to spare, using the
concurrent collector will reduce (may even eliminate) major collection.  The
trade off is that CPU utilization on the system will go up.  When I tried it
with one of my Java app, the system utilization went up so much under heavy
load that it reduced the overall throughput of my app.  You milage may
varies.  You will have to measure it for your app to see for yourself.

Bill

On Mon, Sep 28, 2009 at 4:49 PM, Jonathan Ariel ionat...@gmail.com wrote:

 How do you track major collections? Even better, how do you log your GC
 behavior with details? Right now I just log total time spent on
 collections,
 but I don't really know on which collections.Regard application performance
 with the ConcMarkSweepGC, I think I didn't experience any impact for now.
 Actually the CPU usage of the solr servers is almost insignificant (it was
 like that before).
 BTW, do you know a good way to track the N most expensive solr queries? I
 would like to measure that on 2 different solr servers with different GC.

 On Mon, Sep 28, 2009 at 4:42 PM, Mark Miller markrmil...@gmail.com
 wrote:

  Do you have your GC logs? Are you still seeing major collections?
 
  Where is the time spent?
 
  Hard to say without some of that info.
 
  The goal of the low pause collector is to finish collecting before the
  tenured space is filled - if it doesn't, a standard major collection
  occurs.
 
  The collector will use recent stats it records to try and pick a good
  time to start - as a fail safe though, it will trigger no matter what at
  a certain percentage. With Java 1.5, it was 68% full that it triggered.
  With 1.6, its 92%.
 
  If your still getting major collections, you might want to see if
  lowering that helps (-XX:CMSInitiatingOccupancyFraction=N). If not,
  you might be near optimal settings.
 
  There is likely not anything else you should mess with - unless using
  the extra thread to collect while your app is running affects your apps
  performance - in that case you might want to look into turning on the
  incremental mode. But you havn't mentioned that, so I doubt it.
 
 
 
  --
  - Mark
 
  http://www.lucidimagination.com
 
 
 
  Jonathan Ariel wrote:
   Ok... good news! Upgrading to the newest version of JVM 6 (update 6)
  seems
   to solve this ugly bug. With the upgraded JVM I could run the solr
  servers
   for more than 12 hours on the production environment with the GC
  mentioned
   in the previous e-mails. The results are really amazing. The time spent
  on
   collecting memory dropped from 11% to 3.81%Do you think there is more
 to
   tune there?
  
   Thanks!
  
   Jonathan
  
   On Sun, Sep 27, 2009 at 8:39 PM, Bill Au bill.w...@gmail.com wrote:
  
  
   You are running a very old version of Java 6 (update 6).  The latest
 is
   update 16.  You should definitely upgrade.  There is a bug in Java 6
   starting with update 4 that may result in a corrupted Lucene/Solr
 index:
   http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6707044
   https://issues.apache.org/jira/browse/LUCENE-1282
  
   The JVM crash occurred in the gc thread.  So it looks like a bug in
 the
  JVM
   itself.  Upgrading to the latest release might help.  Switching to a
   different garbage collector should help.
  
   Bill
  
   On Sat, Sep 26, 2009 at 4:31 PM, Mark Miller markrmil...@gmail.com
   wrote:
  
  
   Jonathan Ariel wrote:
  
   Ok. After the server ran for more than 12 hours, the time spent on
 GC
   decreased from 11% to 3,4%, but 5 hours later it crashed. This is
 the
  
   thread
  
   dump, maybe you can help identify what happened?
  
  
   Well thats a tough ;) My guess is its a bug :)
  
   Your two survivor spaces are filled, so it was likely about to move
   objects into the tenured space, which still has plenty of room for
 them
   (barring horrible fragmentation). Any issues with that type of thing
   should generate an OOM anyway though. You can find people that have
 run
   into similar issues in the past, but a lot of times unreproducible.
   Usually, their bugs are closed and they are told to try a newer JVM.
  
   Your JVM appears to be quite a few versions back. There have been
 many
   garbage collection bugs fixed in the 7 or so updates since your
  version,
   a good handful of them related to CMS.
  
   If you can, my best suggestion at the moment is to upgrade to the
  latest
   and see how that fairs.
  
   If not, you might see if going back to the throughput collector and
   turning on the parallel tenured space collector might meet your needs
   instead. You can work with other params to get that going better

Re: Solr and Garbage Collection

2009-09-27 Thread Bill Au
You are running a very old version of Java 6 (update 6).  The latest is
update 16.  You should definitely upgrade.  There is a bug in Java 6
starting with update 4 that may result in a corrupted Lucene/Solr index:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6707044
https://issues.apache.org/jira/browse/LUCENE-1282

The JVM crash occurred in the gc thread.  So it looks like a bug in the JVM
itself.  Upgrading to the latest release might help.  Switching to a
different garbage collector should help.

Bill

On Sat, Sep 26, 2009 at 4:31 PM, Mark Miller markrmil...@gmail.com wrote:

 Jonathan Ariel wrote:
  Ok. After the server ran for more than 12 hours, the time spent on GC
  decreased from 11% to 3,4%, but 5 hours later it crashed. This is the
 thread
  dump, maybe you can help identify what happened?
 
 Well thats a tough ;) My guess is its a bug :)

 Your two survivor spaces are filled, so it was likely about to move
 objects into the tenured space, which still has plenty of room for them
 (barring horrible fragmentation). Any issues with that type of thing
 should generate an OOM anyway though. You can find people that have run
 into similar issues in the past, but a lot of times unreproducible.
 Usually, their bugs are closed and they are told to try a newer JVM.

 Your JVM appears to be quite a few versions back. There have been many
 garbage collection bugs fixed in the 7 or so updates since your version,
 a good handful of them related to CMS.

 If you can, my best suggestion at the moment is to upgrade to the latest
 and see how that fairs.

 If not, you might see if going back to the throughput collector and
 turning on the parallel tenured space collector might meet your needs
 instead. You can work with other params to get that going better if you
 have to as well.

 Also, adjusting other settings with the low pause collector might
 trigger something to side step the bug. Not a great option there though ;)

 How many unique fields are you sorting/faceting on? It must be a lot if
 you need 10 gig for 8 million documents. Its kind of rough to have to
 work at such a close limit to your total heap available as a min mem
 requirement.

 --
 - Mark

 http://www.lucidimagination.com


  #
  # An unexpected error has been detected by Java Runtime Environment:
  #
  #  SIGSEGV (0xb) at pc=0x2b4e0f69ea2a, pid=32224, tid=1103812928
  #
  # Java VM: Java HotSpot(TM) 64-Bit Server VM (10.0-b22 mixed mode
  linux-amd64)
  # Problematic frame:
  # V  [libjvm.so+0x265a2a]
  #
  # If you would like to submit a bug report, please visit:
  #   http://java.sun.com/webapps/bugreport/crash.jsp
  #
 
  ---  T H R E A D  ---
 
  Current thread (0x5be47400):  VMThread [stack:
  0x41bad000,0x41cae000] [id=32249]
 
  siginfo:si_signo=SIGSEGV: si_errno=0, si_code=128 (),
  si_addr=0x
 
  Registers:
  RAX=0x2aac929b4c70, RBX=0x0037c985003a095e, RCX=0x0006,
  RDX=0x005c49870037c996
  RSP=0x41cac550, RBP=0x41cac550, RSI=0x2aac929b4c70,
  RDI=0x0037c985003a095e
  R8 =0x2aadab201538, R9 =0x0005, R10=0x0001,
  R11=0x0010
  R12=0x2aac929b4c70, R13=0x2aac9289cf58, R14=0x2aac9289cf40,
  R15=0x2aadab2015ac
  RIP=0x2b4e0f69ea2a, EFL=0x00010206,
 CSGSFS=0x0033,
  ERR=0x
TRAPNO=0x000d
 
  Top of Stack: (sp=0x41cac550)
  0x41cac550:   41cac580 2b4e0f903c5b
  0x41cac560:   41cac590 0003
  0x41cac570:   2aac9289cf50 2aadab2015a8
  0x41cac580:   41cac5c0 2b4e0f72e388
  0x41cac590:   41cac5c0 2aac9289cf40
  0x41cac5a0:   0005 2b4e0fc86330
  0x41cac5b0:    2b4e0fd8c740
  0x41cac5c0:   41cac5f0 2b4e0f903b7f
  0x41cac5d0:   41cac610 0003
  0x41cac5e0:   2aaccb1750f8 2aaccea41570
  0x41cac5f0:   41cac610 2b4e0f931548
  0x41cac600:   2b4e0fc861d8 2aadd4052ab0
  0x41cac610:   41cac640 2b4e0f903d1a
  0x41cac620:   41cac650 0003
  0x41cac630:   5bc7d6d0 2b4e0fd8c740
  0x41cac640:   41cac650 2b4e0f90411c
  0x41cac650:   41cac680 2b4e0fa1d16e
  0x41cac660:    5bc7d6d0
  0x41cac670:   0002 2b4e0fd8c740
  0x41cac680:   41cac6c0 2b4e0fa74640
  0x41cac690:   41cac6b0 5bc7d6d0
  0x41cac6a0:   0002 2b4e0fd8c740
  0x41cac6b0:   0001 2b4e0fd8c740
  0x41cac6c0:   41cac700 2b4e0f9a52da
  0x41cac6d0:   bfc0 
  0x41cac6e0:   2b4e0fd8c740 5bc7d6d0
  

Re: Any way to encrypt/decrypt stored fields?

2009-09-16 Thread Bill Au
That's certainly something that is doable with a filter.  I am not aware of
any available.

Bill

On Wed, Sep 16, 2009 at 10:39 AM, Jay Hill jayallenh...@gmail.com wrote:

 For security reasons (say I'm indexing very sensitive data, medical records
 for example) is there a way to encrypt data that is stored in Solr? Some
 businesses I've encountered have such needs and this is a barrier to them
 adopting Solr to replace other legacy systems. Would it require a
 custom-written filter to encrypt during indexing and decrypt at query time,
 or is there something I'm unaware of already available to do this?

 -Jay



Re: Is it possible to query for everything ?

2009-09-14 Thread Bill Au
For the standard query handler, try [* TO *].
Bill

On Mon, Sep 14, 2009 at 8:46 PM, Jay Hill jayallenh...@gmail.com wrote:

 With dismax you can use q.alt when the q param is missing:
 q.alt=*:*
 should work.

 -Jay


 On Mon, Sep 14, 2009 at 5:38 PM, Jonathan Vanasco jvana...@2xlp.com
 wrote:

  Thanks Jay  Matt
 
  I tried *:* on my app, and it didn't work
 
  I tried it on the solr admin, and it did
 
  I checked the solr config file, and realized that it works on standard,
 but
  not on dismax, queries
 
  So i have my app checking *:* on a standard qt, and then filtering what I
  need on other qts!
 
  I would never have figured this out without you two!
 



Re: UpdateRequestProcessor config location

2009-08-28 Thread Bill Au
 You also need a requestHandler that uses your updateRequestProcessorChain:

 requestHandler name=/customUpdate class=solr.XmlUpdateRequestHandler 
lst name=defaults
  str name=update.processorcustom/str
/lst
  /requestHandler

  updateRequestProcessorChain name=custom
  ...
  /updateRequestProcessorChain


Bill

On Fri, Aug 28, 2009 at 11:44 AM, Mark Miller markrmil...@gmail.com wrote:

 Erik Earle wrote:
  I've read through the wiki for this and it explains most everything
 except where in the solrconfig.xml theupdateRequestProcessorChain  goes.
 
  I tried it at the top level but that doesn't seem to do anything.
 
  http://wiki.apache.org/solr/UpdateRequestProcessor
 
 
 
 
 
 Look at the example schema - it places it at the bottom - try a similar
 location (it should just need to be in the same section or nesting at
 most).

 --
 - Mark

 http://www.lucidimagination.com






Re: Using Lucene's payload in Solr

2009-08-26 Thread Bill Au
While testing my code I discovered that my copyField with PatternTokenize
does not do what I want.  This is what I am indexing into Solr:

field name=title2.0|Solr In Action/field

My copyField is simply:

   copyField source=title dest=titleRaw/

field titleRaw is of type title_raw:

fieldType name=title_raw class=solr.TextField
  analyzer type=index
tokenizer class=solr.PatternTokenizerFactory pattern=[^#]*#(.*)
group=1/
  /analyzer
  analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory/
  /analyzer
/fieldType

For my example input Solr in Action is indexed into the titleRaw field
without the payload.  But the payload is still stored.  So when I retrieve
the field titleRaw I still get back 2.0|Solr in Action where what I really
want is just Solr in Action.

Is it possible to have the copyField strip off the payload while it is
copying since doing it in the analysis phrase is too late?  Or should I
start looking into using UpdateProcessors as Chris had suggested?

Bill

On Fri, Aug 21, 2009 at 12:04 PM, Bill Au bill.w...@gmail.com wrote:

 I ended up not using an XML attribute for the payload since I need to
 return the payload in query response.  So I ended up going with:

 field name=title2.0|Solr In Action/field

 My payload is numeric so I can pick a non-numeric delimiter (ie '|').
 Putting the payload in front means I don't have to worry about the delimiter
 appearing in the value.  The payload is required in my case so I can simply
 look for the first occurrence of the delimiter and ignore the possibility of
 the delimiter appearing in the value.

 I ended up writing a custom Tokenizer and a copy field with a
 PatternTokenizerFactory to filter out the delimiter and payload.  That's is
 straight forward in terms of implementation.  On top of that I can still use
 the CSV loader, which I really like because of its speed.

 Bill.

 On Thu, Aug 20, 2009 at 10:36 PM, Chris Hostetter 
 hossman_luc...@fucit.org wrote:


 : of the field are correct but the delimiter and payload are stored so
 they
 : appear in the response also.  Here is an example:
 ...
 : I am thinking maybe I can do this instead when indexing:
 :
 : XML for indexing:
 : field name=title payload=2.0Solr In Action/field
 :
 : This will simplify indexing as I don't have to repeat the payload for
 each

 but now you're into a custom request handler for the updates to deal with
 the custom XML attribute so you can't use DIH, or CSV loading.

 It seems like it might be simpler have two new (generic) UpdateProcessors:
 one that can clone fieldA into fieldB, and one that can do regex mutations
 on fieldB ... neither needs to know about payloads at all, but the first
 can made a copy of 2.0|Solr In Action and the second can strip off the
 2.0| from the copy.

 then you can write a new NumericPayloadRegexTokenizer that takes in two
 regex expressions -- one that knows how to extract the payload from a
 piece of input, and one that specifies the tokenization.

 those three classes seem easier to implemnt, easier to maintain, and more
 generally reusable then a custom xml request handler for your updates.


 -Hoss





frequency of commit when building index from scratch

2009-08-25 Thread Bill Au
Just curious, how often do folks commit when building their Solr/Lucene
index from scratch for index with millions of documents?  Should I just wait
and do a single commit at the end after adding all the documents to the
index?

Bill


Re: frequency of commit when building index from scratch

2009-08-25 Thread Bill Au
That's my gut feeling (start big and go lower if OOM occurs) too.

Bill

On Tue, Aug 25, 2009 at 5:34 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 On Tue, Aug 25, 2009 at 5:29 PM, Bill Aubill.w...@gmail.com wrote:
  Just curious, how often do folks commit when building their Solr/Lucene
  index from scratch for index with millions of documents?  Should I just
 wait
  and do a single commit at the end after adding all the documents to the
  index?
 
  Bill
 

 Bill in most cases you probably cannot do one large commit as you will
 hit OOM. How many documents can be uncommitted is based on the size of
 the documents. Committing every document is slow. I have done a commit
 every 10,000 mostly. Results may vary. Someone might have a better
 answer then me.



Re: Using Lucene's payload in Solr

2009-08-21 Thread Bill Au
I ended up not using an XML attribute for the payload since I need to return
the payload in query response.  So I ended up going with:

field name=title2.0|Solr In Action/field

My payload is numeric so I can pick a non-numeric delimiter (ie '|').
Putting the payload in front means I don't have to worry about the delimiter
appearing in the value.  The payload is required in my case so I can simply
look for the first occurrence of the delimiter and ignore the possibility of
the delimiter appearing in the value.

I ended up writing a custom Tokenizer and a copy field with a
PatternTokenizerFactory to filter out the delimiter and payload.  That's is
straight forward in terms of implementation.  On top of that I can still use
the CSV loader, which I really like because of its speed.

Bill.

On Thu, Aug 20, 2009 at 10:36 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : of the field are correct but the delimiter and payload are stored so they
 : appear in the response also.  Here is an example:
 ...
 : I am thinking maybe I can do this instead when indexing:
 :
 : XML for indexing:
 : field name=title payload=2.0Solr In Action/field
 :
 : This will simplify indexing as I don't have to repeat the payload for
 each

 but now you're into a custom request handler for the updates to deal with
 the custom XML attribute so you can't use DIH, or CSV loading.

 It seems like it might be simpler have two new (generic) UpdateProcessors:
 one that can clone fieldA into fieldB, and one that can do regex mutations
 on fieldB ... neither needs to know about payloads at all, but the first
 can made a copy of 2.0|Solr In Action and the second can strip off the
 2.0| from the copy.

 then you can write a new NumericPayloadRegexTokenizer that takes in two
 regex expressions -- one that knows how to extract the payload from a
 piece of input, and one that specifies the tokenization.

 those three classes seem easier to implemnt, easier to maintain, and more
 generally reusable then a custom xml request handler for your updates.


 -Hoss




Re: Issue with Collection Distribution

2009-08-18 Thread Bill Au
I say it is worth upgrading since 1.2 is old.  1.4 is almost ready to be
released.  So you may want to wait a little while longer.  There are many
nice new features in 1.4.  There are performance improvement too.  In the
mean time, you can just get the latest version of the scripts from SVN.
Those should work as is.

Bill

On Tue, Aug 18, 2009 at 6:54 AM, william pink will.p...@gmail.com wrote:

 Hi,

 Sorry for the delayed response didn't even realise I had got a reply, those
 logs are from the slave and the both version of Solr are the same

 Solr Implementation Version: 1.2.0 - Yonik - 2007-06-02 17:35:12

 It maybe worth upgrading them?

 Thank you for the assistance,
 Will

 On Thu, Aug 13, 2009 at 6:28 PM, Bill Au bill.w...@gmail.com wrote:

  Have you check the solr log on the slave to see if there was any commit
  done?  It looks to me you are still using an older version of the commit
  script that is not compatible with the newer Solr response format.  If
  thats' the case, the commit was actually performed.  It is just that the
  script failed to handle the Solr response.  See
 
  https://issues.apache.org/jira/browse/SOLR-463
  https://issues.apache.org/jira/browse/SOLR-426
 
  Bill
 
  On Thu, Aug 13, 2009 at 12:28 PM, william pink will.p...@gmail.com
  wrote:
 
   Hello,
  
   I am having a few problems with the snapinstaller/commit on the slave,
 I
   have a pull_from_master script which is the following
  
   #!/bin/bash
   cd /opt/solr/solr/bin -v
   ./snappuller -v -P 18983
   ./snapinstaller -v
  
  
   I have been executing snapshooter manually on the master then running
 the
   above script to test but I am getting the following
  
   2009/08/13 17:18:16 notifing Solr to open a new Searcher
   2009/08/13 17:18:16 failed to connect to Solr server
   2009/08/13 17:18:17 snapshot installed but Solr server has not open a
 new
   Searcher
  
   Commit logs
  
   2009/08/13 17:18:16 started by user
   2009/08/13 17:18:16 command: /opt/solr/solr/bin/commit
   2009/08/13 17:18:16 commit request to Solr at
   http://slave-server:8983/solr/update failed:
   2009/08/13 17:18:16 ?xml version=1.0 encoding=UTF-8? response
  lst
   name=responseHeaderint name=status0/intint
   name=QTime28/int/lst /response
   2009/08/13 17:18:16 failed (elapsed time: 0 sec)
  
   Snappinstaller logs
  
   2009/08/13 17:18:16 started by user
   2009/08/13 17:18:16 command: ./snapinstaller -v
   2009/08/13 17:18:16 installing snapshot
   /opt/solr/solr/data/snapshot.20090813171835
   2009/08/13 17:18:16 notifing Solr to open a new Searcher
   2009/08/13 17:18:16 failed to connect to Solr server
   2009/08/13 17:18:17 snapshot installed but Solr server has not open a
 new
   Searcher
   2009/08/13 17:18:17 failed (elapsed time: 1 sec)
  
  
   Is there a way of telling why it is failing?
  
   Many Thanks,
   Will
  
 



Re: How to boost fields with many terms against single-term?

2009-08-18 Thread Bill Au
Lucene's default scoring formula gives shorter fields a higher score:

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html

Sounds like you want the opposite.  You can write your own Similarity class
overriding the lengthNorm() method:

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#lengthNorm%28java.lang.String,%20int%29

Bill

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html

On Tue, Aug 18, 2009 at 3:02 PM, Fuad Efendi f...@efendi.ca wrote:

 I don't want single-term docs such as home to appear in top for simple
 search for a home; I need home improvement made easy in top... How to
 implement it at query time?

 Thanks!






Re: Using Lucene's payload in Solr

2009-08-14 Thread Bill Au
Thanks for sharing your code, Ken.  It is pretty much the same code that I
have written except that my custom QueryParser extends Solr's
SolrQueryParser instead of Lucene's QueryParser.  I am also using BFTQ
instead of BTQ.  I have tested it and do see the payload being used in the
explain output.

Functionally I have got everything work now.  I still have to decide how I
want to index the payload (using DelimitedPayloadTokenFilter or my own
custom format/code).

Bill

On Thu, Aug 13, 2009 at 11:31 AM, Ensdorf Ken ensd...@zoominfo.com wrote:

   It looks like things have changed a bit since this subject was last
   brought
   up here.  I see that there are support in Solr/Lucene for indexing
   payload
   data (DelimitedPayloadTokenFilterFactory and
   DelimitedPayloadTokenFilter).
   Overriding the Similarity class is straight forward.  So the last
   piece of
   the puzzle is to use a BoostingTermQuery when searching.  I think
   all I need
   to do is to subclass Solr's LuceneQParserPlugin uses SolrQueryParser
   under
   the cover.  I think all I need to do is to write my own query parser
   plugin
   that uses a custom query parser, with the only difference being in
  the
   getFieldQuery() method where a BoostingTermQuery is used instead of a
   TermQuery.
 
  The BTQ is now deprecated in favor of the BoostingFunctionTermQuery,
  which gives some more flexibility in terms of how the spans in a
  single document are scored.
 
  
   Am I on the right track?
 
  Yes.
 
   Has anyone done something like this already?
 

 I wrote a QParserPlugin that seems to do the trick.  This is minimally
 tested - we're not actually using it at the moment, but should get you
 going.  Also, as Grant suggested, you may want to sub BFTQ for BTQ below:

 package com.zoominfo.solr.analysis;

 import org.apache.lucene.analysis.Analyzer;
 import org.apache.lucene.index.Term;
 import org.apache.lucene.queryParser.*;
 import org.apache.lucene.search.*;
 import org.apache.lucene.search.payloads.BoostingTermQuery;
 import org.apache.solr.common.params.*;
 import org.apache.solr.common.util.NamedList;
 import org.apache.solr.request.SolrQueryRequest;
 import org.apache.solr.search.*;

 public class BoostingTermQParserPlugin extends QParserPlugin {
  public static String NAME = zoom;

  public void init(NamedList args) {
  }

  public QParser createParser(String qstr, SolrParams localParams,
 SolrParams params, SolrQueryRequest req) {
System.out.print(BoostingTermQParserPlugin::createParser\n);
return new BoostingTermQParser(qstr, localParams, params, req);
  }
 }

 class BoostingTermQueryParser extends QueryParser {

public BoostingTermQueryParser(String f, Analyzer a) {
super(f, a);

  System.out.print(BoostingTermQueryParser::BoostingTermQueryParser\n);
}

@Override
protected Query newTermQuery(Term term){
System.out.print(BoostingTermQueryParser::newTermQuery\n);
return new BoostingTermQuery(term);
}
 }

 class BoostingTermQParser extends QParser {
  String sortStr;
  QueryParser lparser;

  public BoostingTermQParser(String qstr, SolrParams localParams, SolrParams
 params, SolrQueryRequest req) {
super(qstr, localParams, params, req);
System.out.print(BoostingTermQParser::BoostingTermQParser\n);
  }


  public Query parse() throws ParseException {
System.out.print(BoostingTermQParser::parse\n);
String qstr = getString();

String defaultField = getParam(CommonParams.DF);
if (defaultField==null) {
  defaultField =
 getReq().getSchema().getSolrQueryParser(null).getField();
}

lparser = new BoostingTermQueryParser(defaultField,
 getReq().getSchema().getQueryAnalyzer());

// these could either be checked  set here, or in the SolrQueryParser
 constructor
String opParam = getParam(QueryParsing.OP);
if (opParam != null) {
  lparser.setDefaultOperator(AND.equals(opParam) ?
 QueryParser.Operator.AND : QueryParser.Operator.OR);
} else {
  // try to get default operator from schema

  
 lparser.setDefaultOperator(getReq().getSchema().getSolrQueryParser(null).getDefaultOperator());
}

return lparser.parse(qstr);
  }


  public String[] getDefaultHighlightFields() {
return new String[]{lparser.getField()};
  }

 }



Re: Using Lucene's payload in Solr

2009-08-13 Thread Bill Au
Thanks for the tip on BFTQ.  I have been using a nightly build before that
was committed.  I have upgrade to the latest nightly build and will use that
instead of BTQ.

I got DelimitedPayloadTokenFilter to work and see that the terms and payload
of the field are correct but the delimiter and payload are stored so they
appear in the response also.  Here is an example:

XML for indexing:
field name=titleSolr|2.0 In|2.0 Action|2.0/field


XML response:
doc
str nametitleSolr|2.0 In|2.0 Action|2.0/str
/doc


I want to set payload on a field that has a variable number of words.  So I
guess I can use a copy field with a PatternTokenizerFactory to filter out
the delimiter and payload.

I am thinking maybe I can do this instead when indexing:

XML for indexing:
field name=title payload=2.0Solr In Action/field

This will simplify indexing as I don't have to repeat the payload for each
word in the field.  I do have to write a payload aware update handler.  It
looks like I can use Lucene's NumericPayloadTokenFilter in my custom update
handler to

Any thoughts/comments/suggestions?

Bill


On Wed, Aug 12, 2009 at 7:13 AM, Grant Ingersoll gsing...@apache.orgwrote:


 On Aug 11, 2009, at 5:30 PM, Bill Au wrote:

  It looks like things have changed a bit since this subject was last
 brought
 up here.  I see that there are support in Solr/Lucene for indexing payload
 data (DelimitedPayloadTokenFilterFactory and DelimitedPayloadTokenFilter).
 Overriding the Similarity class is straight forward.  So the last piece of
 the puzzle is to use a BoostingTermQuery when searching.  I think all I
 need
 to do is to subclass Solr's LuceneQParserPlugin uses SolrQueryParser under
 the cover.  I think all I need to do is to write my own query parser
 plugin
 that uses a custom query parser, with the only difference being in the
 getFieldQuery() method where a BoostingTermQuery is used instead of a
 TermQuery.


 The BTQ is now deprecated in favor of the BoostingFunctionTermQuery, which
 gives some more flexibility in terms of how the spans in a single document
 are scored.


 Am I on the right track?


 Yes.

  Has anyone done something like this already?


 I intend to, but haven't started.

  Since Solr already has indexing support for payload, I was hoping that
 query
 support is already in the works if not available already.  If not, I am
 willing to contribute but will probably need some guidance since my
 knowledge in Solr query parser is weak.



 https://issues.apache.org/jira/browse/SOLR-1337



Re: Using Lucene's payload in Solr

2009-08-13 Thread Bill Au
I need to boost a field differently according to the content of the field.
Here is an example:

doc
  field name=nameSolr/field
  field name=category payload=3.0information retrieval/category
  field name=category payload=2.0webapp/category
  field name=category payload=2.0java/category
  field name=category payload=1.0xml/category
/doc
doc
  field name=nameTomcat/field
  field name=category payload=3.0webapp/category
  field name=category payload=2.0java/category
/doc
doc
  field name=nameXMLSpy/field
  field name=category payload=3.0xml/category
  field name=category payload=2.0ide/category
/doc

A seach on category:webapp should return Tomcat before Solr.  A search on
category:xml should return XMLSpy before Solr.

Bill

On Thu, Aug 13, 2009 at 12:13 PM, Grant Ingersoll gsing...@apache.orgwrote:


 On Aug 13, 2009, at 11:58 AM, Bill Au wrote:

  Thanks for the tip on BFTQ.  I have been using a nightly build before that
 was committed.  I have upgrade to the latest nightly build and will use
 that
 instead of BTQ.

 I got DelimitedPayloadTokenFilter to work and see that the terms and
 payload
 of the field are correct but the delimiter and payload are stored so they
 appear in the response also.  Here is an example:

 XML for indexing:
 field name=titleSolr|2.0 In|2.0 Action|2.0/field


 XML response:
 doc
 str nametitleSolr|2.0 In|2.0 Action|2.0/str
 /doc



 Correct.



  I want to set payload on a field that has a variable number of words.
  So I
 guess I can use a copy field with a PatternTokenizerFactory to filter out
 the delimiter and payload.

 I am thinking maybe I can do this instead when indexing:

 XML for indexing:
 field name=title payload=2.0Solr In Action/field


 Hmmm, interesting, what's your motivation vs. boosting the field?




 This will simplify indexing as I don't have to repeat the payload for each
 word in the field.  I do have to write a payload aware update handler.  It
 looks like I can use Lucene's NumericPayloadTokenFilter in my custom
 update
 handler to

 Any thoughts/comments/suggestions?



  Bill


 On Wed, Aug 12, 2009 at 7:13 AM, Grant Ingersoll gsing...@apache.org
 wrote:


 On Aug 11, 2009, at 5:30 PM, Bill Au wrote:

 It looks like things have changed a bit since this subject was last

 brought
 up here.  I see that there are support in Solr/Lucene for indexing
 payload
 data (DelimitedPayloadTokenFilterFactory and
 DelimitedPayloadTokenFilter).
 Overriding the Similarity class is straight forward.  So the last piece
 of
 the puzzle is to use a BoostingTermQuery when searching.  I think all I
 need
 to do is to subclass Solr's LuceneQParserPlugin uses SolrQueryParser
 under
 the cover.  I think all I need to do is to write my own query parser
 plugin
 that uses a custom query parser, with the only difference being in the
 getFieldQuery() method where a BoostingTermQuery is used instead of a
 TermQuery.


 The BTQ is now deprecated in favor of the BoostingFunctionTermQuery,
 which
 gives some more flexibility in terms of how the spans in a single
 document
 are scored.


  Am I on the right track?


 Yes.

 Has anyone done something like this already?



 I intend to, but haven't started.

 Since Solr already has indexing support for payload, I was hoping that

 query
 support is already in the works if not available already.  If not, I am
 willing to contribute but will probably need some guidance since my
 knowledge in Solr query parser is weak.



 https://issues.apache.org/jira/browse/SOLR-1337


 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
 Solr/Lucene:
 http://www.lucidimagination.com/search




  1   2   3   >