Re: solr-duplicate post management

2009-01-22 Thread S.Selvam Siva
On Thu, Jan 22, 2009 at 7:12 AM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : what i need is ,to log the existing urlid and new urlid(of course both
 will
 : not be same) ,when a .xml file of same id(unique field) is posted.
 :
 : I want to make this by modifying the solr source.Which file do i need to
 : modify so that i could get the above details in log ?
 :
 : I tried with DirectUpdateHandler2.java(which removes the duplicate
 : entries),but efforts in vein.

 DirectUpdateHandler2.java (on the trunk) delegates to Lucene-Java's
 IndexWriter.updateDocument method when you have a uniqueKey and you aren't
 allowing duplicates -- this method doesn't give you any way to access the
 old document(s) that had that existing key.

 The easiest way to make a change like what you are interested in might be
 an UpdateProcessor that does a lookup/search for the uniqueKey of each
 document about to be added to see if it already exists.  that's probably
 about as efficient as you can get, and would be nicely encapsulated.

 You might also want to take a look at SOLR-799, where some work is being
 done to create UpdateProcessors that can do near duplicate detection...

 http://wiki.apache.org/solr/Deduplication
 https://issues.apache.org/jira/browse/SOLR-799






 -Hoss


Thank you for your response.I will try it out.



-- 
Yours,
S.Selvam


Intermittent high response times

2009-01-22 Thread hbi dev
Hi all,
I have an implmentation of solr (rev.708837) running on tomcat 6.

Approx 600,000 docs, 2 fairly content heavy text fields, between 4 and 7
facets (depending on what our front end is requesting, and mostly low unique
values)

1GB of memory allocated, generally I do not see it using all of that up.

For the most part my response times are under 200ms, but I randomly get
times that are around 100,000ms!

Original load testing didn't reveal this, I can see from the logs we are
getting approx 20 requests per second so it's not really under much load at
the moment.

Does anyone have any pointers that I can follow or look into?
Please ask if I need to provide any more info.

Thanks in advance

Regards,
Waseem


Re: Intermittent high response times

2009-01-22 Thread Otis Gospodnetic
Hi,

Is there anything special about those queries?  e.g. lots of terms, frequent 
terms, something else?  Is there anything else happening on that server when 
you see such long queries?  Do you see lots of IO or lots of CPU being used 
during those times?


Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: hbi dev hbi...@googlemail.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, January 22, 2009 6:39:39 AM
 Subject: Intermittent high response times
 
 Hi all,
 I have an implmentation of solr (rev.708837) running on tomcat 6.
 
 Approx 600,000 docs, 2 fairly content heavy text fields, between 4 and 7
 facets (depending on what our front end is requesting, and mostly low unique
 values)
 
 1GB of memory allocated, generally I do not see it using all of that up.
 
 For the most part my response times are under 200ms, but I randomly get
 times that are around 100,000ms!
 
 Original load testing didn't reveal this, I can see from the logs we are
 getting approx 20 requests per second so it's not really under much load at
 the moment.
 
 Does anyone have any pointers that I can follow or look into?
 Please ask if I need to provide any more info.
 
 Thanks in advance
 
 Regards,
 Waseem



Re: Intermittent high response times

2009-01-22 Thread hbi dev
Hi,
The criteria rarely varies from others that are much quicker, maybe only
what the start row is. Most of the time the main terms are a single word
or just a blank query (q.alt=*:*)

My request handler does have a lot of predefined filters, this is included
below. Most of this is auto-warmed.
The server also does updates via the DataImportHandler every 5 minutes.
Optimisation is only performed once a day at approximately midnight. These
high response times can happen at any time of day, mostly out of working
hours, which is also when we have the least number of updates + search
traffic.

In terms of CPU and IO usage, as mentioned above they are mostly out of
hours so I will see if our server admins have setup some SNMP tools to
provide reports for me. Looking at the server right now I can see between 2
and 10% CPU


Here is a small extract from the log:

21-Jan-2009 19:45:39 org.apache.solr.core.SolrCore execute
INFO: [news] webapp=/solr path=/select
params={rows=10start=40sort=score+desc,+newsArticleDate_Date+desc,+newsCalculatedImportance+descfq=newsArticleDate_Year:1995hl=trueqt=BR2News}
hits=1106 status=0 QTime=31
21-Jan-2009 19:45:39 org.apache.solr.core.SolrCore execute
INFO: [news] webapp=/solr path=/select
params={rows=10start=100sort=score+desc,+newsArticleDate_Date+desc,+newsCalculatedImportance+descfq=newsArticleDate_Year:1996hl=trueqt=BR2News}
hits=8345 status=0 QTime=119234


The request handler is:

requestHandler name=BR2News class=solr.DisMaxRequestHandler 
lst name=defaults
int name=rows10/int
str name=hlfalse/str
str name=sortscore desc, newsArticleDate_Date desc,
newsCalculatedImportance desc/str
 str name=f.newsArticleDate_Year.facet.sortfalse/str
str name=f.newsArticleDate_Month.facet.sortfalse/str
str name=f.newsArticleDate_Day.facet.sortfalse/str
 str name=facet.mincount1/str
str name=mltfalse/str
str name=wtxslt/str
str name=trnewsResults.xsl/str
/lst
lst name=appends
  str name=fqnews_magJournalCode:BR2 OR news_magJournalCode:CAM OR
news_magJournalCode:CEI OR news_magJournalCode:CIT OR
news_magJournalCode:DRN OR news_magJournalCode:EVE OR
news_magJournalCode:MKT OR news_magJournalCode:MXD OR
news_magJournalCode:PRA OR news_magJournalCode:PRI OR
news_magJournalCode:PRS OR news_magJournalCode:PRW OR
news_magJournalCode:REV OR news_magJournalCode:RSV OR
news_magJournalCode:WWP OR news_magJournalCode:XMB OR
news_magJournalCode:XMW OR news_magJournalCode:XX6/str
  str name=fqnewsStatus:true/str
  str name=fqnewsPublishedDate_Date:[* TO NOW/DAY]/str
str name=fqnewsArticleDate_Date:[* TO NOW/DAY]/str
/lst
lst name=invariants
 !-- mm=1 ONLY TOUCH THIS IF YOU REALLY KNOW WHAT YOU ARE DOING!
--
str name=mm1/str
 !-- mm=1 ONLY TOUCH THIS IF YOU REALLY KNOW WHAT YOU ARE DOING!
--
 str name=hl.simple.pre![CDATA[span class=hiLite]]/str
str name=hl.simple.post![CDATA[/span]]/str
str name=f.newsBody.hl.snippets3/str
str name=f.newsBody.hl.mergeContiguoustrue/str
str name=q.alt*:*/str
str name=echoParamsall/str
float name=tie0.01/float
str
name=flnewsID,newsTitle,newsTitleAlternate,newsAuthor,news_magJournalCode,news_newsTypeID,newsStatus,magName,newsSeoURLTitle,newsSummary,newsBody,newsSummaryAlternate,newsArticleDate_DateTime,newsDateAdded_DateTime,newsAuthor,score/str
str name=version2.2/str
str name=qfnewsTitle newsSummary^0.75 newsBody^0.5 newsAuthor^0.1/str
!-- no exact author field as this is already indexed appropriately --
str name=pfnewsTitleExact newsSummaryExact newsBodyExact
newsAuthor/str
str name=ps1/str
str name=hl.flnewsTitle newsBody newsSummary newsAuthor/str
str name=mlt.flnewsTitle newsBody newsSummary newsAuthor/str
 str name=facettrue/str
str name=facet.fieldnews_magJournalCode_FacetDetails/str
str name=facet.fieldnews_newsTypeID_FacetDetails/str
str name=facet.fieldsector_FacetDetails/str
str name=facet.fielddiscipline_FacetDetails/str
str name=facet.fieldasset_FacetDetails/str
!-- get facet for today --
str name=facet.querynewsArticleDate_Date:[NOW/DAY TO NOW/DAY]/str
!-- get facet for lastweek (last 7 days) --
str name=facet.querynewsArticleDate_Date:[NOW/DAY-7DAYS TO
NOW/DAY]/str
!-- get facet for lastmonth --
str name=facet.querynewsArticleDate_Date:[NOW/DAY-1MONTH TO
NOW/DAY]/str
str name=facet.fieldnewsArticleDate_Year/str
str name=facet.fieldnewsArticleDate_Month/str
str name=facet.fieldnewsArticleDate_Day/str
!-- get facet for today --
str name=facet.querynewsDateAdded_Date:[NOW/DAY TO NOW/DAY]/str
!-- get facet for lastweek (last 7 days) --
str name=facet.querynewsDateAdded_Date:[NOW/DAY-7DAYS TO NOW/DAY]/str
!-- get facet for lastmonth --
str name=facet.querynewsDateAdded_Date:[NOW/DAY-1MONTH TO NOW/DAY]/str
str name=facet.fieldnewsDateAdded_Year/str
str name=facet.fieldnewsDateAdded_Month/str
str name=facet.fieldnewsDateAdded_Day/str
/lst
  /requestHandler

On Thu, Jan 22, 2009 at 2:23 PM, Otis Gospodnetic 
otis_gospodne...@yahoo.com wrote:

 Hi,

 Is there anything special about those queries?  e.g. lots 

Re: Solr Replication: disk space consumed on slave much higher than on master

2009-01-22 Thread Jaco
Hm, I don't know what to do anymore. I tried this:
- Run Tomcat service as local administrator to overcome any permissioning
issues
- Installed latest nightly build (I noticed that item I mentioned before (
http://markmail.org/message/yq2ram4f3jblermd) had been committed which is
good
- Build a small master and slave core to try it all out
- With each replication, the number of files on slave grows, and the
directories index.xxx.. are not removed
- I tried sending explicit commit commands to the slave, assuming it
wouldn't help, which was true.
- I don't see any reference to SolrDeletion in the log of the slave (it's
there in the log of the master)

Can anybody recommend some action to be taken? I'm building up some quite
large production cores right now, and don't want the slaves to eat up all
hard disk space of course..

Thanks a lot in advance,

Jaoc.

2009/1/21 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@gmail.com

 On Wed, Jan 21, 2009 at 3:42 PM, Jaco jdevr...@gmail.com wrote:
  Thanks for the fast replies!
 
  It appears that I made a (probably classical) error... I didnt' make the
  change to solrconfig.xml to include the deletionPolicy when applying
 the
  upgrade. I include this now, but the slave is not cleaning up. Will this
 be
  done at some point automatically? Can I trigger this?
 Unfortunately , no.
 Lucene is supposed to cleanup these old commit points automatically
 after each commit. Even if the delettionPolicy is not specified the
 default is supposed to take  effect.
 
  User access rights for the user are OK, this use is allowed to do
 anything
  in the Solr data directory (Tomcat service is running from SYSTEM account
  (Windows)).
 
  Thanks, regards,
 
  Jaco.
 
 
  2009/1/21 Shalin Shekhar Mangar shalinman...@gmail.com
 
  Hi,
 
  There shouldn't be so many files on the slave. Since the empty
 index.x
  folders are not getting deleted, is it possible that Solr process user
 does
  not enough privileges to delete files/folders?
 
  Also, have you made any changes to the IndexDeletionPolicy
 configuration?
 
  On Wed, Jan 21, 2009 at 2:15 PM, Jaco jdevr...@gmail.com wrote:
 
   Hi,
  
   I'm running Solr nightly build of 20.12.2008, with patch as discussed
 on
   http://markmail.org/message/yq2ram4f3jblermd, using Solr replication.
  
   On various systems running, I see that the disk space consumed on the
  slave
   is much higher than on the master. One example:
   - Master: 30 GB in 138 files
   - Slave: 152 GB in 3,941 files
  
   Can anybody tell me what to do to prevent this from happening, and how
 to
   clean up the slave? Also, there are quite some empty index.xxx
   directories sitting in the slaves data dir. Can these be safely
 removed?
  
   Thanks a lot in advance, bye,
  
   Jaco.
  
 
 
 
  --
  Regards,
  Shalin Shekhar Mangar.
 
 



 --
 --Noble Paul



Re: Solr Replication: disk space consumed on slave much higher than on master

2009-01-22 Thread Shalin Shekhar Mangar
On Thu, Jan 22, 2009 at 10:18 PM, Jaco jdevr...@gmail.com wrote:

 Hm, I don't know what to do anymore. I tried this:
 - Run Tomcat service as local administrator to overcome any permissioning
 issues
 - Installed latest nightly build (I noticed that item I mentioned before (
 http://markmail.org/message/yq2ram4f3jblermd) had been committed which is
 good
 - Build a small master and slave core to try it all out
 - With each replication, the number of files on slave grows, and the
 directories index.xxx.. are not removed
 - I tried sending explicit commit commands to the slave, assuming it
 wouldn't help, which was true.
 - I don't see any reference to SolrDeletion in the log of the slave (it's
 there in the log of the master)

 Can anybody recommend some action to be taken? I'm building up some quite
 large production cores right now, and don't want the slaves to eat up all
 hard disk space of course..


How frequently do you optimize your index? Does the number of files decrease
after an optimize?

Can you execute the indexversion command:
/replication?command=indexversion
and then issue the following command with the returned index version:
/replication?command=filelistindexversion=INDEX_VERSION_BY_FIRST_COMMAND
The above will give the list of files being used by that commit point.

Can you compare the list of files given by the above command and with the
files you see in the solr/data/index directory? How many are extra?

-- 
Regards,
Shalin Shekhar Mangar.


Re: Solr Replication: disk space consumed on slave much higher than on master

2009-01-22 Thread Jeff Newburn
We are seeing something very similar.  Ours is intermittent and usually
happens a great deal on random days. Often it seems to occur during large
index updates on the master.


On 1/22/09 8:58 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote:

 On Thu, Jan 22, 2009 at 10:18 PM, Jaco jdevr...@gmail.com wrote:
 
 Hm, I don't know what to do anymore. I tried this:
 - Run Tomcat service as local administrator to overcome any permissioning
 issues
 - Installed latest nightly build (I noticed that item I mentioned before (
 http://markmail.org/message/yq2ram4f3jblermd) had been committed which is
 good
 - Build a small master and slave core to try it all out
 - With each replication, the number of files on slave grows, and the
 directories index.xxx.. are not removed
 - I tried sending explicit commit commands to the slave, assuming it
 wouldn't help, which was true.
 - I don't see any reference to SolrDeletion in the log of the slave (it's
 there in the log of the master)
 
 Can anybody recommend some action to be taken? I'm building up some quite
 large production cores right now, and don't want the slaves to eat up all
 hard disk space of course..
 
 
 How frequently do you optimize your index? Does the number of files decrease
 after an optimize?
 
 Can you execute the indexversion command:
 /replication?command=indexversion
 and then issue the following command with the returned index version:
 /replication?command=filelistindexversion=INDEX_VERSION_BY_FIRST_COMMAND
 The above will give the list of files being used by that commit point.
 
 Can you compare the list of files given by the above command and with the
 files you see in the solr/data/index directory? How many are extra?



Re: Solr Replication: disk space consumed on slave much higher than on master

2009-01-22 Thread Shalin Shekhar Mangar
On Thu, Jan 22, 2009 at 10:37 PM, Jeff Newburn jnewb...@zappos.com wrote:

 We are seeing something very similar.  Ours is intermittent and usually
 happens a great deal on random days. Often it seems to occur during large
 index updates on the master.


Jeff, is this also on a Windows box?

-- 
Regards,
Shalin Shekhar Mangar.


Re: Solr Replication: disk space consumed on slave much higher than on master

2009-01-22 Thread Jeff Newburn
My apologies.  No we are using linux, tomcat setup.


On 1/22/09 9:15 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote:

 On Thu, Jan 22, 2009 at 10:37 PM, Jeff Newburn jnewb...@zappos.com wrote:
 
 We are seeing something very similar.  Ours is intermittent and usually
 happens a great deal on random days. Often it seems to occur during large
 index updates on the master.
 
 
 Jeff, is this also on a Windows box?



Re: Intermittent high response times

2009-01-22 Thread wojtekpia

I'm experiencing similar issues. Mine seem to be related to old generation
garbage collection. Can you monitor your garbage collection activity? (I'm
using JConsole to monitor it:
http://java.sun.com/developer/technicalArticles/J2SE/jconsole.html). 

In my system, garbage collection usually doesn't cause any trouble. But once
in a while, the size of the old generation flat-lines for some time (~dozens
of seconds). When this happens, I see really bad response times from Solr
(not quite as bad as you're seeing, but almost). The old-gen flat-lines
always seem to be right before, or right after the old-gen is garbage
collected.
-- 
View this message in context: 
http://www.nabble.com/Intermittent-high-response-times-tp21602475p21608986.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr Replication: disk space consumed on slave much higher than on master

2009-01-22 Thread Noble Paul നോബിള്‍ नोब्ळ्
Jeff ,
Do you see both the empty index. dirs as well as the extra files
in the index?
--Noble

On Thu, Jan 22, 2009 at 10:37 PM, Jeff Newburn jnewb...@zappos.com wrote:
 We are seeing something very similar.  Ours is intermittent and usually
 happens a great deal on random days. Often it seems to occur during large
 index updates on the master.


 On 1/22/09 8:58 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote:

 On Thu, Jan 22, 2009 at 10:18 PM, Jaco jdevr...@gmail.com wrote:

 Hm, I don't know what to do anymore. I tried this:
 - Run Tomcat service as local administrator to overcome any permissioning
 issues
 - Installed latest nightly build (I noticed that item I mentioned before (
 http://markmail.org/message/yq2ram4f3jblermd) had been committed which is
 good
 - Build a small master and slave core to try it all out
 - With each replication, the number of files on slave grows, and the
 directories index.xxx.. are not removed
 - I tried sending explicit commit commands to the slave, assuming it
 wouldn't help, which was true.
 - I don't see any reference to SolrDeletion in the log of the slave (it's
 there in the log of the master)

 Can anybody recommend some action to be taken? I'm building up some quite
 large production cores right now, and don't want the slaves to eat up all
 hard disk space of course..


 How frequently do you optimize your index? Does the number of files decrease
 after an optimize?

 Can you execute the indexversion command:
 /replication?command=indexversion
 and then issue the following command with the returned index version:
 /replication?command=filelistindexversion=INDEX_VERSION_BY_FIRST_COMMAND
 The above will give the list of files being used by that commit point.

 Can you compare the list of files given by the above command and with the
 files you see in the solr/data/index directory? How many are extra?





-- 
--Noble Paul


Re: Solr Replication: disk space consumed on slave much higher than on master

2009-01-22 Thread Jeff Newburn
We have both.  A majority of them are just empty but others have almost a
full index worth of files.  I have also noticed that during a lengthy index
update the system will throw errors about how it cannot move one of the
index files.  Essentially on reindex the system does not replicate until an
optimize is done which changes all of the file names allowing the file error
go away.

Jan 22, 2009 10:17:15 AM org.apache.solr.handler.SnapPuller copyAFile
SEVERE: Unable to move index file from: /data/index.20090122101604/_8n.tvx
to: /data/index/_8n.tvx



On 1/22/09 10:23 AM, Noble Paul നോബിള്‍  नोब्ळ् noble.p...@gmail.com
wrote:

 Jeff ,
 Do you see both the empty index. dirs as well as the extra files
 in the index?
 --Noble
 
 On Thu, Jan 22, 2009 at 10:37 PM, Jeff Newburn jnewb...@zappos.com wrote:
 We are seeing something very similar.  Ours is intermittent and usually
 happens a great deal on random days. Often it seems to occur during large
 index updates on the master.
 
 
 On 1/22/09 8:58 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote:
 
 On Thu, Jan 22, 2009 at 10:18 PM, Jaco jdevr...@gmail.com wrote:
 
 Hm, I don't know what to do anymore. I tried this:
 - Run Tomcat service as local administrator to overcome any permissioning
 issues
 - Installed latest nightly build (I noticed that item I mentioned before (
 http://markmail.org/message/yq2ram4f3jblermd) had been committed which is
 good
 - Build a small master and slave core to try it all out
 - With each replication, the number of files on slave grows, and the
 directories index.xxx.. are not removed
 - I tried sending explicit commit commands to the slave, assuming it
 wouldn't help, which was true.
 - I don't see any reference to SolrDeletion in the log of the slave (it's
 there in the log of the master)
 
 Can anybody recommend some action to be taken? I'm building up some quite
 large production cores right now, and don't want the slaves to eat up all
 hard disk space of course..
 
 
 How frequently do you optimize your index? Does the number of files decrease
 after an optimize?
 
 Can you execute the indexversion command:
 /replication?command=indexversion
 and then issue the following command with the returned index version:
 /replication?command=filelistindexversion=INDEX_VERSION_BY_FIRST_COMMAND
 The above will give the list of files being used by that commit point.
 
 Can you compare the list of files given by the above command and with the
 files you see in the solr/data/index directory? How many are extra?
 
 
 
 



Re: Solr Replication: disk space consumed on slave much higher than on master

2009-01-22 Thread Noble Paul നോബിള്‍ नोब्ळ्
This was reported by another user and was fixed recently.Are you using
a recent version?
--Noble

On Fri, Jan 23, 2009 at 12:00 AM, Jeff Newburn jnewb...@zappos.com wrote:
 We have both.  A majority of them are just empty but others have almost a
 full index worth of files.  I have also noticed that during a lengthy index
 update the system will throw errors about how it cannot move one of the
 index files.  Essentially on reindex the system does not replicate until an
 optimize is done which changes all of the file names allowing the file error
 go away.

 Jan 22, 2009 10:17:15 AM org.apache.solr.handler.SnapPuller copyAFile
 SEVERE: Unable to move index file from: /data/index.20090122101604/_8n.tvx
 to: /data/index/_8n.tvx



 On 1/22/09 10:23 AM, Noble Paul നോബിള്‍  नोब्ळ् noble.p...@gmail.com
 wrote:

 Jeff ,
 Do you see both the empty index. dirs as well as the extra files
 in the index?
 --Noble

 On Thu, Jan 22, 2009 at 10:37 PM, Jeff Newburn jnewb...@zappos.com wrote:
 We are seeing something very similar.  Ours is intermittent and usually
 happens a great deal on random days. Often it seems to occur during large
 index updates on the master.


 On 1/22/09 8:58 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote:

 On Thu, Jan 22, 2009 at 10:18 PM, Jaco jdevr...@gmail.com wrote:

 Hm, I don't know what to do anymore. I tried this:
 - Run Tomcat service as local administrator to overcome any permissioning
 issues
 - Installed latest nightly build (I noticed that item I mentioned before (
 http://markmail.org/message/yq2ram4f3jblermd) had been committed which is
 good
 - Build a small master and slave core to try it all out
 - With each replication, the number of files on slave grows, and the
 directories index.xxx.. are not removed
 - I tried sending explicit commit commands to the slave, assuming it
 wouldn't help, which was true.
 - I don't see any reference to SolrDeletion in the log of the slave (it's
 there in the log of the master)

 Can anybody recommend some action to be taken? I'm building up some quite
 large production cores right now, and don't want the slaves to eat up all
 hard disk space of course..


 How frequently do you optimize your index? Does the number of files 
 decrease
 after an optimize?

 Can you execute the indexversion command:
 /replication?command=indexversion
 and then issue the following command with the returned index version:
 /replication?command=filelistindexversion=INDEX_VERSION_BY_FIRST_COMMAND
 The above will give the list of files being used by that commit point.

 Can you compare the list of files given by the above command and with the
 files you see in the solr/data/index directory? How many are extra?









-- 
--Noble Paul


Re: Solr Replication: disk space consumed on slave much higher than on master

2009-01-22 Thread Jeff Newburn
Few weeks ago is our version.  Does this contribute to the directory issues
and extra files that are left?


On 1/22/09 10:33 AM, Noble Paul നോബിള്‍  नोब्ळ् noble.p...@gmail.com
wrote:

 This was reported by another user and was fixed recently.Are you using
 a recent version?
 --Noble
 
 On Fri, Jan 23, 2009 at 12:00 AM, Jeff Newburn jnewb...@zappos.com wrote:
 We have both.  A majority of them are just empty but others have almost a
 full index worth of files.  I have also noticed that during a lengthy index
 update the system will throw errors about how it cannot move one of the
 index files.  Essentially on reindex the system does not replicate until an
 optimize is done which changes all of the file names allowing the file error
 go away.
 
 Jan 22, 2009 10:17:15 AM org.apache.solr.handler.SnapPuller copyAFile
 SEVERE: Unable to move index file from: /data/index.20090122101604/_8n.tvx
 to: /data/index/_8n.tvx
 
 
 
 On 1/22/09 10:23 AM, Noble Paul നോബിള്‍  नोब्ळ् noble.p...@gmail.com
 wrote:
 
 Jeff ,
 Do you see both the empty index. dirs as well as the extra files
 in the index?
 --Noble
 
 On Thu, Jan 22, 2009 at 10:37 PM, Jeff Newburn jnewb...@zappos.com wrote:
 We are seeing something very similar.  Ours is intermittent and usually
 happens a great deal on random days. Often it seems to occur during large
 index updates on the master.
 
 
 On 1/22/09 8:58 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote:
 
 On Thu, Jan 22, 2009 at 10:18 PM, Jaco jdevr...@gmail.com wrote:
 
 Hm, I don't know what to do anymore. I tried this:
 - Run Tomcat service as local administrator to overcome any permissioning
 issues
 - Installed latest nightly build (I noticed that item I mentioned before
 (
 http://markmail.org/message/yq2ram4f3jblermd) had been committed which is
 good
 - Build a small master and slave core to try it all out
 - With each replication, the number of files on slave grows, and the
 directories index.xxx.. are not removed
 - I tried sending explicit commit commands to the slave, assuming it
 wouldn't help, which was true.
 - I don't see any reference to SolrDeletion in the log of the slave (it's
 there in the log of the master)
 
 Can anybody recommend some action to be taken? I'm building up some quite
 large production cores right now, and don't want the slaves to eat up all
 hard disk space of course..
 
 
 How frequently do you optimize your index? Does the number of files
 decrease
 after an optimize?
 
 Can you execute the indexversion command:
 /replication?command=indexversion
 and then issue the following command with the returned index version:
 
/replication?command=filelistindexversion=INDEX_VERSION_BY_FIRST_COMMAND

 The above will give the list of files being used by that commit point.
 
 Can you compare the list of files given by the above command and with the
 files you see in the solr/data/index directory? How many are extra?
 
 
 
 
 
 
 
 



Re: Solr Replication: disk space consumed on slave much higher than on master

2009-01-22 Thread Noble Paul നോബിള്‍ नोब्ळ्
I am not sure if it was completely fixed. (This was related to a Lucene bug)
But you can try w/ a recent build and confirm it for us.
I have never encountered these during our tests in windows XP/Linux

I have attached a patch which logs the names of the files which could
not get deleted (which may help us diagnose the problem). If you are
comfortable applying a patch you may try it out.

--Noble

On Fri, Jan 23, 2009 at 12:05 AM, Jeff Newburn jnewb...@zappos.com wrote:
 Few weeks ago is our version.  Does this contribute to the directory issues
 and extra files that are left?


 On 1/22/09 10:33 AM, Noble Paul നോബിള്‍  नोब्ळ् noble.p...@gmail.com
 wrote:

 This was reported by another user and was fixed recently.Are you using
 a recent version?
 --Noble

 On Fri, Jan 23, 2009 at 12:00 AM, Jeff Newburn jnewb...@zappos.com wrote:
 We have both.  A majority of them are just empty but others have almost a
 full index worth of files.  I have also noticed that during a lengthy index
 update the system will throw errors about how it cannot move one of the
 index files.  Essentially on reindex the system does not replicate until an
 optimize is done which changes all of the file names allowing the file error
 go away.

 Jan 22, 2009 10:17:15 AM org.apache.solr.handler.SnapPuller copyAFile
 SEVERE: Unable to move index file from: /data/index.20090122101604/_8n.tvx
 to: /data/index/_8n.tvx



 On 1/22/09 10:23 AM, Noble Paul നോബിള്‍  नोब्ळ् noble.p...@gmail.com
 wrote:

 Jeff ,
 Do you see both the empty index. dirs as well as the extra files
 in the index?
 --Noble

 On Thu, Jan 22, 2009 at 10:37 PM, Jeff Newburn jnewb...@zappos.com wrote:
 We are seeing something very similar.  Ours is intermittent and usually
 happens a great deal on random days. Often it seems to occur during large
 index updates on the master.


 On 1/22/09 8:58 AM, Shalin Shekhar Mangar shalinman...@gmail.com 
 wrote:

 On Thu, Jan 22, 2009 at 10:18 PM, Jaco jdevr...@gmail.com wrote:

 Hm, I don't know what to do anymore. I tried this:
 - Run Tomcat service as local administrator to overcome any 
 permissioning
 issues
 - Installed latest nightly build (I noticed that item I mentioned before
 (
 http://markmail.org/message/yq2ram4f3jblermd) had been committed which 
 is
 good
 - Build a small master and slave core to try it all out
 - With each replication, the number of files on slave grows, and the
 directories index.xxx.. are not removed
 - I tried sending explicit commit commands to the slave, assuming it
 wouldn't help, which was true.
 - I don't see any reference to SolrDeletion in the log of the slave 
 (it's
 there in the log of the master)

 Can anybody recommend some action to be taken? I'm building up some 
 quite
 large production cores right now, and don't want the slaves to eat up 
 all
 hard disk space of course..


 How frequently do you optimize your index? Does the number of files
 decrease
 after an optimize?

 Can you execute the indexversion command:
 /replication?command=indexversion
 and then issue the following command with the returned index version:

 /replication?command=filelistindexversion=INDEX_VERSION_BY_FIRST_COMMAND

 The above will give the list of files being used by that commit point.

 Can you compare the list of files given by the above command and with the
 files you see in the solr/data/index directory? How many are extra?













-- 
--Noble Paul
Index: src/java/org/apache/solr/handler/SnapPuller.java
===
--- src/java/org/apache/solr/handler/SnapPuller.java	(revision 736216)
+++ src/java/org/apache/solr/handler/SnapPuller.java	(working copy)
@@ -587,21 +587,27 @@
   static boolean delTree(File dir) {
 if (dir == null || !dir.exists())
   return false;
+boolean isSuccess  = true;
 File contents[] = dir.listFiles();
 if (contents != null) {
   for (File file : contents) {
 if (file.isDirectory()) {
   boolean success = delTree(file);
-  if (!success)
-return false;
+  if (!success){
+LOG.error(Unable to delete directory : + file);
+isSuccess = false;
+  }
 } else {
   boolean success = file.delete();
-  if (!success)
+  if (!success){
+LOG.error(Unable to delete file : + file);
+isSuccess = false;
 return false;
+  }
 }
   }
 }
-return dir.delete();
+return isSuccess  dir.delete();
   }
 
   /**
@@ -853,6 +859,7 @@
 //close the file
 fileChannel.close();
   } catch (Exception e) {/* noop */
+  LOG.error(Error closing the file stream: + this.saveAs ,e);
   }
   try {
 post.releaseConnection();


Re: Random queries extremely slow

2009-01-22 Thread oleg_gnatovskiy

My aplogies, this is likely the same issue as Intermittent high response
times  by  hbi dev 



oleg_gnatovskiy wrote:
 
 Hello. Our production servers are operating relatively smoothly most of
 the time running Solr with 19 million listings. However every once in a
 while the same query that used to take 100 miliseconds takes 6000. This
 causes out health check to fail, and the server is taken out of service.
 Once the server is put back in service, queries are back to their regular
 response times. Is there anything we could do to stop this random slowness
 from occurring? 
 

-- 
View this message in context: 
http://www.nabble.com/Random-queries-extremely-slow-tp21610568p21610660.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Newbie Design Questions

2009-01-22 Thread Noble Paul നോബിള്‍ नोब्ळ्
You are out of luck if you are not using a recent version of DIH

The sub entity will work only if you use the FieldReaderDataSource.
Then you do not need a ClobTransformer also.

The trunk version of DIH can be used w/ Solr 1.3 release

On Thu, Jan 22, 2009 at 12:59 PM, Gunaranjan Chandraraju
chandrar...@apple.com wrote:
 Hi

 Yes, the XML is inside the DB in a clob. Would love to use XPath inside
 SQLEntityProcessor as it will save me tons of trouble for file-dumping
 (given that I am not able to post it).  This is how I setup my DIH for DB
 import.

 dataConfig
 dataSource type=JdbcDataSource name=data-source-1
 driver=oracle.jdbc.driver.OracleDriver url=jdbc:oracle:thin:@X
 user=abc password=*** batchSize=100/
   document
 entity dataSource=data-source-1
 name =item processor=SqlEntityProcessor
 pk=ID
 stream=false
 rootEntity=false
 transformer=ClobTransformer  !-- custom clob transformer I
 saw and not the one from 1.4.   --
 query=select xml_col from xml_table where xml_col IS NOT NULL
   !-- horrible query I need to work on making it better --

entity
   dataSource=null  !-- this is my problem - if I don't give a
 name here it complains, if I put in null then the code seems to fail with a
 null pointer --
   name=record
   processor=XPathEntityProcessor
   stream=false
   url=${item.xml_col}
forEach=/record

  field column=ID xpath=/record/coreinfo/@a /
  field column=type xpath=/record/coreinfo/@b /
  field column=streetname xpath=/record/address/@c /

  .. and so on
/entity


 /entity
   /document
 /dataConfig


 The problem with this is that it always fails with this error.  I can see
 that the earlier SQL entity extraction and clob transformation is working as
 the values show in the debug jsp (verbose mode with dataimport.jsp).
  However no records are extracted for entity.  When I check catalina.out
 file, it shows me the following errors for entity name=record. (the XPath
 entity on top).

 java.lang.NullPointerException at
 org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85).

 I don't have the whole stack trace right now.  If you need it I would be
 happy to recreate and post it.

 Regards,
 Guna

 On Jan 21, 2009, at 8:22 PM, Noble Paul നോബിള്‍ नोब्ळ् wrote:

 On Thu, Jan 22, 2009 at 7:02 AM, Gunaranjan Chandraraju
 chandrar...@apple.com wrote:

 Thanks

 Yes the source of data is a DB.  However the xml is also posted on
 updates
 via publish framework.  So I can just plug in an adapter hear to listen
 for
 changes and post to SOLR.  I was trying to use the XPathProcessor inside
 the
 SQLEntityProcessor and this did not work (using 1.3 - I did see support
 in
 1.4).  That is not a show stopper for me and I can just post them via the
 framework and use files for the first time load.

 XPathEntityprocessor works inside SqlEntityprocessor only if a db
 field contains xml.

 However ,you can have a separate entity (at the root) to read from db
 for delta.
 Anyway if your current solution works stick to it.

 Have a seen a couple of answers on the backup for crash scenarios.  just
 wanted to confirm - if I replace the index with the backup'ed files then
 I
 can simple start the up solr again and reindex the documents changed
 since
 last backup? Am I right?  The slaves will also automatically adjust to
 this.

 Yes. you can replace an archived index and Solr should work just fine.
 but the docs added since the last snapshot was taken will be missing
 (of course :) )

 THanks
 Guna


 On Jan 20, 2009, at 9:37 PM, Noble Paul നോബിള്‍ नोब्ळ् wrote:

 On Wed, Jan 21, 2009 at 5:15 AM, Gunaranjan Chandraraju
 chandrar...@apple.com wrote:

 Hi All
 We are considering SOLR for a large database of XMLs.  I have some
 newbie
 questions - if there is a place I can go read about them do let me know
 and
 I will go read up :)

 1. Currently we are able to pull the XMLs from a file systems using
 FileDataSource.  The DIH is convenient since I can map my XML fields
 using
 the XPathProcessor. This works for an initial load.However after
 the
 initial load, we would like to 'post' changed xmls to SOLR whenever the
 XML
 is updated in a separate system.  I know we can post xmls with 'add'
 however
 I was not sure how to do this and maintain the DIH mapping I use in
 data-config.xml?  I don't want to save the file to the disk and then
 call
 the DIH - would prefer to directly post it.  Do I need to use solrj for
 this?

 What is the source of your new data? is it a DB?


 2.  If my solr schema.xml changes then do I HAVE to reindex all the old
 documents?  Suppose in future we have newer XML documents that contain
 a
 new
 additional xml field.The old documents that are already indexed
 don't
 have this field and (so) I don't need search on them with this field.
 

Re: Incorrect Scoring

2009-01-22 Thread Yonik Seeley
DisjunctionMax takes the max score of a disjuction... and max across
all fields was slightly higher for the first match.

Try setting tie higher  (add tie=0.2 to your query or to the
defaults in your request handler).
http://wiki.apache.org/solr/DisMaxRequestHandler

-Yonik

On Wed, Jan 21, 2009 at 1:18 PM, Jeff Newburn jnewb...@zappos.com wrote:
 Can someone please make sense of why the following occurs in our system.
 The first item barely matches but scores higher than the second one that
 matches all over the place.  The second one is a MUCH better match but has a
 worse score. These are in the same query results.  All I can see are the
 norms but don¹t know how to fix that.

 Parsed Query Info
  str name=parsedquery+((DisjunctionMaxQuery((realBrandName:brown |
 subCategory:brown^20.0 | productDescription:brown | width:brown |
 personality:brown^10.0 | brandName:brown | productType:brown^8.0 |
 productId:brown^10.0 | size:brown^1.2 | category:brown^10.0 | price:brown |
 productNameSearch:brown | heelHeight:brown | color:brown^10.0 |
 attrs:brown^5.0 | expandedGender:brown^0.5)~0.01)
 DisjunctionMaxQuery((realBrandName:shoe | subCategory:shoe^20.0 |
 productDescription:shoe | width:shoes | personality:shoe^10.0 |
 brandName:shoe | productType:shoe^8.0 | productId:shoes^10.0 |
 size:shoes^1.2 | category:shoe^10.0 | price:shoes | productNameSearch:shoe |
 heelHeight:shoes | color:shoe^10.0 | attrs:shoe^5.0 |
 expandedGender:shoes^0.5)~0.01))~2)
 DisjunctionMaxQuery((realBrandName:brown shoe~1^10.0 | category:brown
 shoe~1^10.0 | productNameSearch:brown shoe~1 | productDescription:brown
 shoe~1^2.0 | subCategory:brown shoe~1^20.0 | personality:brown
 shoe~1^2.0 | brandName:brown shoe~1^10.0 | productType:brown
 shoe~1^8.0)~0.01)/str
  str name=parsedquery_toString+(((realBrandName:brown |
 subCategory:brown^20.0 | productDescription:brown | width:brown |
 personality:brown^10.0 | brandName:brown | productType:brown^8.0 |
 productId:brown^10.0 | size:brown^1.2 | category:brown^10.0 | price:brown |
 productNameSearch:brown | heelHeight:brown | color:brown^10.0 |
 attrs:brown^5.0 | expandedGender:brown^0.5)~0.01 (realBrandName:shoe |
 subCategory:shoe^20.0 | productDescription:shoe | width:shoes |
 personality:shoe^10.0 | brandName:shoe | productType:shoe^8.0 |
 productId:shoes^10.0 | size:shoes^1.2 | category:shoe^10.0 | price:shoes |
 productNameSearch:shoe | heelHeight:shoes | color:shoe^10.0 | attrs:shoe^5.0
 | expandedGender:shoes^0.5)~0.01)~2) (realBrandName:brown shoe~1^10.0 |
 category:brown shoe~1^10.0 | productNameSearch:brown shoe~1 |
 productDescription:brown shoe~1^2.0 | subCategory:brown shoe~1^20.0 |
 personality:brown shoe~1^2.0 | brandName:brown shoe~1^10.0 |
 productType:brown shoe~1^8.0)~0.01/str


 DebugQuery Info

  str name=38959
 0.45851633 = (MATCH) sum of:
  0.45851633 = (MATCH) sum of:
0.19769925 = (MATCH) max plus 0.01 times others of:
  0.19769925 = (MATCH) weight(color:brown^10.0 in 1407), product of:
0.06819186 = queryWeight(color:brown^10.0), product of:
  10.0 = boost
  2.8991618 = idf(docFreq=19348, numDocs=129257)
  0.0023521234 = queryNorm
2.8991618 = (MATCH) fieldWeight(color:brown in 1407), product of:
  1.0 = tf(termFreq(color:brown)=1)
  2.8991618 = idf(docFreq=19348, numDocs=129257)
  1.0 = fieldNorm(field=color, doc=1407)
0.26081708 = (MATCH) max plus 0.01 times others of:
  0.26081708 = (MATCH) weight(subCategory:shoe^20.0 in 1407), product
 of:
0.14011127 = queryWeight(subCategory:shoe^20.0), product of:
  20.0 = boost
  2.9783995 = idf(docFreq=17874, numDocs=129257)
  0.0023521234 = queryNorm
1.8614997 = (MATCH) fieldWeight(subCategory:shoe in 1407), product
 of:
  1.0 = tf(termFreq(subCategory:shoe)=1)
  2.9783995 = idf(docFreq=17874, numDocs=129257)
  0.625 = fieldNorm(field=subCategory, doc=1407)

 /str
  str name=692583
 0.4086538 = (MATCH) sum of:
  0.4086538 = (MATCH) sum of:
0.19769925 = (MATCH) max plus 0.01 times others of:
  0.19769925 = (MATCH) weight(color:brown^10.0 in 75829), product of:
0.06819186 = queryWeight(color:brown^10.0), product of:
  10.0 = boost
  2.8991618 = idf(docFreq=19348, numDocs=129257)
  0.0023521234 = queryNorm
2.8991618 = (MATCH) fieldWeight(color:brown in 75829), product of:
  1.0 = tf(termFreq(color:brown)=1)
  2.8991618 = idf(docFreq=19348, numDocs=129257)
  1.0 = fieldNorm(field=color, doc=75829)
0.21095455 = (MATCH) max plus 0.01 times others of:
  0.20865366 = (MATCH) weight(subCategory:shoe^20.0 in 75829), product
 of:
0.14011127 = queryWeight(subCategory:shoe^20.0), product of:
  20.0 = boost
  2.9783995 = idf(docFreq=17874, numDocs=129257)
  0.0023521234 = queryNorm
1.4891998 = (MATCH) fieldWeight(subCategory:shoe in 75829), product
 of:
  1.0 = 

Re: Random queries extremely slow

2009-01-22 Thread oleg_gnatovskiy

Actually my issue might merit a seperate discussion as I did tuning by
adjusting the heap to different settings to see how it affected changed. It
really had no affect, as with jdk 1.6, garbage collection is parallel which
now should no longer interfere with requests during garbage collection which
holds true based on the tests we ran.



oleg_gnatovskiy wrote:
 
 My aplogies, this is likely the same issue as Intermittent high response
 times  by  hbi dev 
 
 
 
 oleg_gnatovskiy wrote:
 
 Hello. Our production servers are operating relatively smoothly most of
 the time running Solr with 19 million listings. However every once in a
 while the same query that used to take 100 miliseconds takes 6000. This
 causes out health check to fail, and the server is taken out of service.
 Once the server is put back in service, queries are back to their regular
 response times. Is there anything we could do to stop this random
 slowness from occurring? 
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Random-queries-extremely-slow-tp21610568p21610972.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Random queries extremely slow

2009-01-22 Thread Yonik Seeley
On Thu, Jan 22, 2009 at 1:46 PM, oleg_gnatovskiy
oleg_gnatovs...@citysearch.com wrote:
 Hello. Our production servers are operating relatively smoothly most of the
 time running Solr with 19 million listings. However every once in a while
 the same query that used to take 100 miliseconds takes 6000.

Anything else happening on the system that may have forced some of the
index files out of operating system disk cache at these times?

-Yonik


Re: Solr Replication: disk space consumed on slave much higher than on master

2009-01-22 Thread Shalin Shekhar Mangar
On Fri, Jan 23, 2009 at 12:15 AM, Noble Paul നോബിള്‍ नोब्ळ् 
noble.p...@gmail.com wrote:

 I have attached a patch which logs the names of the files which could
 not get deleted (which may help us diagnose the problem). If you are
 comfortable applying a patch you may try it out.


I've committed this patch to trunk.

-- 
Regards,
Shalin Shekhar Mangar.


Re: Random queries extremely slow

2009-01-22 Thread oleg_gnatovskiy

What are some things that could happen to force files out of the cache on a
Linux machine? I don't know what kinds of events to look for...




yonik wrote:
 
 On Thu, Jan 22, 2009 at 1:46 PM, oleg_gnatovskiy
 oleg_gnatovs...@citysearch.com wrote:
 Hello. Our production servers are operating relatively smoothly most of
 the
 time running Solr with 19 million listings. However every once in a while
 the same query that used to take 100 miliseconds takes 6000.
 
 Anything else happening on the system that may have forced some of the
 index files out of operating system disk cache at these times?
 
 -Yonik
 
 

-- 
View this message in context: 
http://www.nabble.com/Random-queries-extremely-slow-tp21610568p21611240.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Random queries extremely slow

2009-01-22 Thread Walter Underwood
The OS keeps recently accessed disk pages in memory. If another
process does a lot of disk access, like a backup, the OS might
replace the Solr index pages with that processes pages.

What kind of storage: local disk, SAN, NFS?

wunder

On 1/22/09 11:22 AM, oleg_gnatovskiy oleg_gnatovs...@citysearch.com
wrote:

 
 What are some things that could happen to force files out of the cache on a
 Linux machine? I don't know what kinds of events to look for...
 
 
 
 
 yonik wrote:
 
 On Thu, Jan 22, 2009 at 1:46 PM, oleg_gnatovskiy
 oleg_gnatovs...@citysearch.com wrote:
 Hello. Our production servers are operating relatively smoothly most of
 the
 time running Solr with 19 million listings. However every once in a while
 the same query that used to take 100 miliseconds takes 6000.
 
 Anything else happening on the system that may have forced some of the
 index files out of operating system disk cache at these times?
 
 -Yonik
 
 



Re: Random queries extremely slow

2009-01-22 Thread Otis Gospodnetic
Here is one example: pushing a large newly optimized index onto the server.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: oleg_gnatovskiy oleg_gnatovs...@citysearch.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, January 22, 2009 2:22:51 PM
 Subject: Re: Random queries extremely slow
 
 
 What are some things that could happen to force files out of the cache on a
 Linux machine? I don't know what kinds of events to look for...
 
 
 
 
 yonik wrote:
  
  On Thu, Jan 22, 2009 at 1:46 PM, oleg_gnatovskiy
  wrote:
  Hello. Our production servers are operating relatively smoothly most of
  the
  time running Solr with 19 million listings. However every once in a while
  the same query that used to take 100 miliseconds takes 6000.
  
  Anything else happening on the system that may have forced some of the
  index files out of operating system disk cache at these times?
  
  -Yonik
  
  
 
 -- 
 View this message in context: 
 http://www.nabble.com/Random-queries-extremely-slow-tp21610568p21611240.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Query Performance while updating teh index

2009-01-22 Thread Otis Gospodnetic
Oleg,

This is more of an OS-level thing that Solr-thing, it seems from your emails.  
If you send answers to my questions we'll be able to help more.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: oleg_gnatovskiy oleg_gnatovs...@citysearch.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, January 21, 2009 1:09:21 PM
 Subject: Re: Query Performance while updating teh index
 
 
 What exactly does Solr do when it receives a new Index? How does it keep
 serving while performing the updates? It seems that the part that causes the
 slowdown is this transition.
 
 
 
 
 Otis Gospodnetic wrote:
  
  This is an old and long thread, and I no longer recall what the specific
  suggestions were.
  My guess is this has to do with the OS cache of your index files.  When
  you make the large index update, that OS cache is useless (old files are
  gone, new ones are in) and the OS cache has get re-warmed and this takes
  time.
  
  Are you optimizing your index before the update?  Do you *really* need to
  do that?
  How large is your update, what makes it big, and could you make it
  smaller?
  
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
  
  
  - Original Message 
  From: oleg_gnatovskiy 
  To: solr-user@lucene.apache.org
  Sent: Tuesday, January 20, 2009 6:19:46 PM
  Subject: Re: Query Performance while updating teh index
  
  
  Hello again. It seems that we are still having these problems. Queries
  take
  as long as 20 minutes to get back to their average response time after a
  large index update, so it doesn't seem like the problem is the 12 second
  autowarm time. Are there any more suggestions for things we can try?
  Taking
  our servers out of teh loop for as long as 20 minutes is a bit of a
  hassle,
  and a risk.
  -- 
  View this message in context: 
  
 http://www.nabble.com/Query-Performance-while-updating-the-index-tp20452835p21573927.html
  Sent from the Solr - User mailing list archive at Nabble.com.
  
  
  
 
 -- 
 View this message in context: 
 http://www.nabble.com/Query-Performance-while-updating-the-index-tp20452835p21588779.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Random queries extremely slow

2009-01-22 Thread oleg_gnatovskiy

Well this probably isn't the cause of our random slow queries, but might be
the cause of the slow queries after pulling a new index. Is there anything
we could do to reduce the performance hit we take from this happening?



Otis Gospodnetic wrote:
 
 Here is one example: pushing a large newly optimized index onto the
 server.
 
 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
 - Original Message 
 From: oleg_gnatovskiy oleg_gnatovs...@citysearch.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, January 22, 2009 2:22:51 PM
 Subject: Re: Random queries extremely slow
 
 
 What are some things that could happen to force files out of the cache on
 a
 Linux machine? I don't know what kinds of events to look for...
 
 
 
 
 yonik wrote:
  
  On Thu, Jan 22, 2009 at 1:46 PM, oleg_gnatovskiy
  wrote:
  Hello. Our production servers are operating relatively smoothly most
 of
  the
  time running Solr with 19 million listings. However every once in a
 while
  the same query that used to take 100 miliseconds takes 6000.
  
  Anything else happening on the system that may have forced some of the
  index files out of operating system disk cache at these times?
  
  -Yonik
  
  
 
 -- 
 View this message in context: 
 http://www.nabble.com/Random-queries-extremely-slow-tp21610568p21611240.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Random-queries-extremely-slow-tp21610568p21611454.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Query Performance while updating teh index

2009-01-22 Thread oleg_gnatovskiy

We do optimize the index before updates but we get tehse performance issues
even when we pull an empty snapshot. Thus even when our update is tiny, the
performance issues still happen.



Otis Gospodnetic wrote:
 
 This is an old and long thread, and I no longer recall what the specific
 suggestions were.
 My guess is this has to do with the OS cache of your index files.  When
 you make the large index update, that OS cache is useless (old files are
 gone, new ones are in) and the OS cache has get re-warmed and this takes
 time.
 
 Are you optimizing your index before the update?  Do you *really* need to
 do that?
 How large is your update, what makes it big, and could you make it
 smaller?
 
 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
 - Original Message 
 From: oleg_gnatovskiy oleg_gnatovs...@citysearch.com
 To: solr-user@lucene.apache.org
 Sent: Tuesday, January 20, 2009 6:19:46 PM
 Subject: Re: Query Performance while updating teh index
 
 
 Hello again. It seems that we are still having these problems. Queries
 take
 as long as 20 minutes to get back to their average response time after a
 large index update, so it doesn't seem like the problem is the 12 second
 autowarm time. Are there any more suggestions for things we can try?
 Taking
 our servers out of teh loop for as long as 20 minutes is a bit of a
 hassle,
 and a risk.
 -- 
 View this message in context: 
 http://www.nabble.com/Query-Performance-while-updating-the-index-tp20452835p21573927.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Query-Performance-while-updating-the-index-tp20452835p21611642.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Query Performance while updating teh index

2009-01-22 Thread Otis Gospodnetic
OK.  Then it's likely not this.  You saw the other response about looking at GC 
to see if maybe that hits you once in a while and slows whatever queries are in 
flight?  Try jconsole.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: oleg_gnatovskiy oleg_gnatovs...@citysearch.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, January 22, 2009 2:43:31 PM
 Subject: Re: Query Performance while updating teh index
 
 
 We do optimize the index before updates but we get tehse performance issues
 even when we pull an empty snapshot. Thus even when our update is tiny, the
 performance issues still happen.
 
 
 
 Otis Gospodnetic wrote:
  
  This is an old and long thread, and I no longer recall what the specific
  suggestions were.
  My guess is this has to do with the OS cache of your index files.  When
  you make the large index update, that OS cache is useless (old files are
  gone, new ones are in) and the OS cache has get re-warmed and this takes
  time.
  
  Are you optimizing your index before the update?  Do you *really* need to
  do that?
  How large is your update, what makes it big, and could you make it
  smaller?
  
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
  
  
  - Original Message 
  From: oleg_gnatovskiy 
  To: solr-user@lucene.apache.org
  Sent: Tuesday, January 20, 2009 6:19:46 PM
  Subject: Re: Query Performance while updating teh index
  
  
  Hello again. It seems that we are still having these problems. Queries
  take
  as long as 20 minutes to get back to their average response time after a
  large index update, so it doesn't seem like the problem is the 12 second
  autowarm time. Are there any more suggestions for things we can try?
  Taking
  our servers out of teh loop for as long as 20 minutes is a bit of a
  hassle,
  and a risk.
  -- 
  View this message in context: 
  
 http://www.nabble.com/Query-Performance-while-updating-the-index-tp20452835p21573927.html
  Sent from the Solr - User mailing list archive at Nabble.com.
  
  
  
 
 -- 
 View this message in context: 
 http://www.nabble.com/Query-Performance-while-updating-the-index-tp20452835p21611642.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Query Performance while updating teh index

2009-01-22 Thread oleg_gnatovskiy

We've tried it. There doesn't seem to be any connection between GC and the
bad performance spikes.


Otis Gospodnetic wrote:
 
 OK.  Then it's likely not this.  You saw the other response about looking
 at GC to see if maybe that hits you once in a while and slows whatever
 queries are in flight?  Try jconsole.
 
 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
 - Original Message 
 From: oleg_gnatovskiy oleg_gnatovs...@citysearch.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, January 22, 2009 2:43:31 PM
 Subject: Re: Query Performance while updating teh index
 
 
 We do optimize the index before updates but we get tehse performance
 issues
 even when we pull an empty snapshot. Thus even when our update is tiny,
 the
 performance issues still happen.
 
 
 
 Otis Gospodnetic wrote:
  
  This is an old and long thread, and I no longer recall what the
 specific
  suggestions were.
  My guess is this has to do with the OS cache of your index files.  When
  you make the large index update, that OS cache is useless (old files
 are
  gone, new ones are in) and the OS cache has get re-warmed and this
 takes
  time.
  
  Are you optimizing your index before the update?  Do you *really* need
 to
  do that?
  How large is your update, what makes it big, and could you make it
  smaller?
  
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
  
  
  - Original Message 
  From: oleg_gnatovskiy 
  To: solr-user@lucene.apache.org
  Sent: Tuesday, January 20, 2009 6:19:46 PM
  Subject: Re: Query Performance while updating teh index
  
  
  Hello again. It seems that we are still having these problems. Queries
  take
  as long as 20 minutes to get back to their average response time after
 a
  large index update, so it doesn't seem like the problem is the 12
 second
  autowarm time. Are there any more suggestions for things we can try?
  Taking
  our servers out of teh loop for as long as 20 minutes is a bit of a
  hassle,
  and a risk.
  -- 
  View this message in context: 
  
 http://www.nabble.com/Query-Performance-while-updating-the-index-tp20452835p21573927.html
  Sent from the Solr - User mailing list archive at Nabble.com.
  
  
  
 
 -- 
 View this message in context: 
 http://www.nabble.com/Query-Performance-while-updating-the-index-tp20452835p21611642.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Query-Performance-while-updating-the-index-tp20452835p21611976.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: numFound problem

2009-01-22 Thread Chris Hostetter
: I have a test search which I know should return 34 docs and it does 
: 
: however, numFound says 40 
: 
: with debug enabled, I can see the 40 it has found 
...
: now, I can probably work round it if had returned me the 40 docs but the 
problem is it returns 34 docs but gives me a numFound of 40 

these statments don't seem consistent.  you're saying that when you 
execute the query, you see a numFound=40, but only 34 docs are included in 
response ... at first guess, i would assume then that maybe you have 
rows=34 in your query ... but you then say adding enabling debug (i 
assume by adding debugQuery=true to the request) you see all 40 docs.

I don't understand how that is possible.

can you please add echoParams=allechoHandler=true to your url, and then 
send a reply with that full URL, as well as the full response solr returns 
when you hit it (you can add an fl param that just specifies the 
uniqueKey field for your docs to keep the response small and your data 
private)



-Hoss



Re: numFound problem

2009-01-22 Thread Ron Chan
sorry, I miss counted the number of docs returned 

I was thrown when it first returned numFound=40, lost track after trying a few 
things 

the returned docs are correct and matches numFound , there is no problem here 

Sorry for the confusion 


- Original Message - 
From: Chris Hostetter hossman_luc...@fucit.org 
To: solr-user@lucene.apache.org 
Sent: Thursday, 22 January, 2009 20:15:27 GMT +00:00 GMT Britain, Ireland, 
Portugal 
Subject: Re: numFound problem 

: I have a test search which I know should return 34 docs and it does 
: 
: however, numFound says 40 
: 
: with debug enabled, I can see the 40 it has found 
... 
: now, I can probably work round it if had returned me the 40 docs but the 
problem is it returns 34 docs but gives me a numFound of 40 

these statments don't seem consistent. you're saying that when you 
execute the query, you see a numFound=40, but only 34 docs are included in 
response ... at first guess, i would assume then that maybe you have 
rows=34 in your query ... but you then say adding enabling debug (i 
assume by adding debugQuery=true to the request) you see all 40 docs. 

I don't understand how that is possible. 

can you please add echoParams=allechoHandler=true to your url, and then 
send a reply with that full URL, as well as the full response solr returns 
when you hit it (you can add an fl param that just specifies the 
uniqueKey field for your docs to keep the response small and your data 
private) 



-Hoss 



Master failover - seeking comments

2009-01-22 Thread edre...@ha

Hi,

We're looking forward to using Solr in a project.  We're using a typical
setup with one Master and a handful of Slaves.  We're using the Master for
writes and the Slaves for reads.  Standard stuff.

Our concern is with downtime of the Master server.  I read a few posts that
touched on this topic but didn't find anything substantive.  I've got a test
setup in place that appears to work, but I'd like to get some feedback.

Essentially, the plan is to add another Master server, so now we have M1 and
M2.  Both M1 and M2 are also configured to be slaves of each other.  The
plan is to put a load balancer in between the Slaves and the Master servers. 
This way, if M1 goes down, traffic will be routed to M2 automatically.  Once
M1 comes back online, we'll route traffic back to that server.  Because M1
and M2 are replicating each other all updates are captured.

To test this, I ran the following scenario.

1) Slave 1 (S1) is configured to use M2 as it's master.
2) We push an update to M2.
3) We restart S1, now pointing to M1.
4) We wait for M1 to sync from M2
5) We then sync S1 to M1.  
6) Success!

However...

M1 and M2 generate snapshots every time they sync to each other, even if no
new data was pushed to them from a Slave.  We're concerned about this.   

Is this even a problem?  
Are we stuck in some infinte sync loop between the 2 Master machines?  
Will this degrade performance of the Master machines over time?  
Is there anything else I should know about this setup?

Any insights, or alternative suggestions to this setup are quite welcome.

Thanks,
Erik
 
-- 
View this message in context: 
http://www.nabble.com/Master-failover---seeking-comments-tp21614750p21614750.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Newbie Design Questions

2009-01-22 Thread Gunaranjan Chandraraju

Thanks

A last question - do you have any approximate date for the release of  
1.4. If its going to be soon enough (within a month or so) then I can  
plan for our development around it.


Thanks
Guna

On Jan 22, 2009, at 11:04 AM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:



You are out of luck if you are not using a recent version of DIH

The sub entity will work only if you use the FieldReaderDataSource.
Then you do not need a ClobTransformer also.

The trunk version of DIH can be used w/ Solr 1.3 release

On Thu, Jan 22, 2009 at 12:59 PM, Gunaranjan Chandraraju
chandrar...@apple.com wrote:

Hi

Yes, the XML is inside the DB in a clob. Would love to use  
XPath inside
SQLEntityProcessor as it will save me tons of trouble for file- 
dumping
(given that I am not able to post it).  This is how I setup my DIH  
for DB

import.

dataConfig
dataSource type=JdbcDataSource name=data-source-1
driver=oracle.jdbc.driver.OracleDriver  
url=jdbc:oracle:thin:@X

user=abc password=*** batchSize=100/
 document
   entity dataSource=data-source-1
   name =item processor=SqlEntityProcessor
   pk=ID
   stream=false
   rootEntity=false
   transformer=ClobTransformer  !-- custom clob  
transformer I

saw and not the one from 1.4.   --
   query=select xml_col from xml_table where xml_col IS  
NOT NULL

 !-- horrible query I need to work on making it better --


  entity
 dataSource=null  !-- this is my problem - if I don't  
give a
name here it complains, if I put in null then the code seems to  
fail with a

null pointer --
 name=record
 processor=XPathEntityProcessor
 stream=false
 url=${item.xml_col}
  forEach=/record

field column=ID xpath=/record/coreinfo/@a /
field column=type xpath=/record/coreinfo/@b /
field column=streetname xpath=/record/address/@c /

.. and so on
  /entity


   /entity
 /document
/dataConfig


The problem with this is that it always fails with this error.  I  
can see
that the earlier SQL entity extraction and clob transformation is  
working as

the values show in the debug jsp (verbose mode with dataimport.jsp).
However no records are extracted for entity.  When I check  
catalina.out
file, it shows me the following errors for entity name=record.  
(the XPath

entity on top).

java.lang.NullPointerException at
org 
.apache 
.solr 
.handler 
.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java: 
85).


I don't have the whole stack trace right now.  If you need it I  
would be

happy to recreate and post it.

Regards,
Guna

On Jan 21, 2009, at 8:22 PM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:



On Thu, Jan 22, 2009 at 7:02 AM, Gunaranjan Chandraraju
chandrar...@apple.com wrote:


Thanks

Yes the source of data is a DB.  However the xml is also posted on
updates
via publish framework.  So I can just plug in an adapter hear to  
listen

for
changes and post to SOLR.  I was trying to use the XPathProcessor  
inside

the
SQLEntityProcessor and this did not work (using 1.3 - I did see  
support

in
1.4).  That is not a show stopper for me and I can just post them  
via the

framework and use files for the first time load.


XPathEntityprocessor works inside SqlEntityprocessor only if a db
field contains xml.

However ,you can have a separate entity (at the root) to read from  
db

for delta.
Anyway if your current solution works stick to it.


Have a seen a couple of answers on the backup for crash  
scenarios.  just
wanted to confirm - if I replace the index with the backup'ed  
files then

I
can simple start the up solr again and reindex the documents  
changed

since
last backup? Am I right?  The slaves will also automatically  
adjust to

this.


Yes. you can replace an archived index and Solr should work just  
fine.

but the docs added since the last snapshot was taken will be missing
(of course :) )


THanks
Guna


On Jan 20, 2009, at 9:37 PM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:



On Wed, Jan 21, 2009 at 5:15 AM, Gunaranjan Chandraraju
chandrar...@apple.com wrote:


Hi All
We are considering SOLR for a large database of XMLs.  I have  
some

newbie
questions - if there is a place I can go read about them do let  
me know

and
I will go read up :)

1. Currently we are able to pull the XMLs from a file systems  
using
FileDataSource.  The DIH is convenient since I can map my XML  
fields

using
the XPathProcessor. This works for an initial load.However  
after

the
initial load, we would like to 'post' changed xmls to SOLR  
whenever the

XML
is updated in a separate system.  I know we can post xmls with  
'add'

however
I was not sure how to do this and maintain the DIH mapping I  
use in
data-config.xml?  I don't want to save the file to the disk and  
then

call
the DIH - would prefer to directly post it.  Do I need to use  
solrj for

this?


What is the source of your new data? is it a DB?



2.  If my solr schema.xml changes then 

Re: Embedded Solr updates not showing until restart

2009-01-22 Thread edre...@ha



Grant Ingersoll-6 wrote:
 
 Can you share your code?  Or reduce it down to a repeatable test?
 

I'll try to do this.  For now I'm proceeding with the HTTP route.  We're
going to want to revisit this and I'll likely do it at that time.

Thanks,
Erik
-- 
View this message in context: 
http://www.nabble.com/Embedded-Solr-updates-not-showing-until-restart-tp21546235p21614923.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to select *actual* match from a multi-valued field

2009-01-22 Thread Chris Hostetter

: At a high level, I'm trying to do some more intelligent searching using
: an app that will send multiple queries to Solr. My current issue is
: around multi-valued fields and determining which entry actually
: generated the hit for a particular query.

strictly speaking, this isn't possible with normal queries: the underlying 
data structures do not maintain any history about why a doc matches when 
executing a Query. SpanQuery is a subclass of Query that can give you this 
information, so a custom Solr plugin that used SpanTermQueries and 
SpanNearQueries in place of TermQueries and PhraseQueries could generate 
this kind of informatio -- but it comes at a cost (SpanQueries are not as 
fast as their traditional counter parts).

The best you can do is use things like score Explanations and hit 
hihlighting which mimic the logic used during a query to determine why a 
doc (already identified) matched.

: Jane Smith, Bob Smith, Roger Smith, Jane Doe. If the user performs a
: search for Bob Smith, this document is returned. What I want to know is
: that this document was returned because of Bob Smith, not because of
: Jane or Roger. I've tried using the highlighting settings. They do
: provide some help, as the Jane Doe entry doesn't come back highlighted,
: but both Jane and Roger do. I've tried using hl.requireFieldMatch, but
: that seems to pertain only to fields, not entries within a multi-valued
: field.

FWIW: if you are using q=Bob+Smith then Jane Smith and Roger Smith 
*are* contributing to the result.

However, even if you are using a phrase search (q=Bob+Smith) i do seem 
to recall thatthe traditional highlighter highlights all of the terms in 
the fields, even if the whole phrase isn't there -- historicly that was 
considered a feature (for the purpose of snippet generation people 
frequently want to see that type of behavior) but i can understand why it 
would cause you problems in your current use case

As mention on the wiki, there is a hl.usePhraseHighlighter you can use 
to trigger a newer SpanScorer based highlighter -- which takes advantage 
of hte previously mentioned SpanQuery logic to determine what to 
highlight (evne if the queries themselves weren't SpanQueries) ... this 
param gets it's name because when dealing with phrase queries, it only 
highlights them if the whole phrase is there.

http://wiki.apache.org/solr/HighlightingParameters

Compare the results of these two URLs when using the example 
configs/data...

http://localhost:8983/solr/select/?hl.fragsize=0hl.usePhraseHighlighter=falsedf=featuresq=%22Solr+Search%22hl.snippets=1000hl.requireFieldMatch=truefl=featureshl=truehl.fl=features
http://localhost:8983/solr/select/?hl.fragsize=0hl.usePhraseHighlighter=truedf=featuresq=%22Solr+Search%22hl.snippets=1000hl.requireFieldMatch=truefl=featureshl=truehl.fl=features

I think that may solve your particular problem.


-Hoss



URL-import field type?

2009-01-22 Thread Paul Libbrecht


Hello list,

after searching around for quite a while, including in the  
DataImportHandler documentation on the wiki (which looks amazing), I  
couldn't find a way to indicate to solr that the tokens of that field  
should be the result of analyzing the tokens of the stream at URL-xxx.


I know I was able to imitate that in plain-lucene by crafting a  
particular analyzer-filter who was only given the URL as content and  
who gave further the tokens of the stream.


Is this the right way in solr?

thanks in advance.

paul

smime.p7s
Description: S/MIME cryptographic signature


Re: Performance dead-zone due to garbage collection

2009-01-22 Thread wojtekpia

I'm not sure if you suggested it, but I'd like to try the IBM JVM. Aside from
setting my JRE paths, is there anything else I need to do run inside the IBM
JVM? (e.g. re-compiling?)


Walter Underwood wrote:
 
 What JVM and garbage collector setting? We are using the IBM JVM with
 their concurrent generational collector. I would strongly recommend
 trying a similar collector on your JVM. Hint: how much memory is in
 use after a full GC? That is a good approximation to the working set.
 
 

-- 
View this message in context: 
http://www.nabble.com/Performance-%22dead-zone%22-due-to-garbage-collection-tp21588427p21616078.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Performance dead-zone due to garbage collection

2009-01-22 Thread Walter Underwood
No need to recompile. Install it and change your JAVA_HOME
and things should work. The options are different than for
the Sun JVM. --wunder

On 1/22/09 3:46 PM, wojtekpia wojte...@hotmail.com wrote:

 
 I'm not sure if you suggested it, but I'd like to try the IBM JVM. Aside from
 setting my JRE paths, is there anything else I need to do run inside the IBM
 JVM? (e.g. re-compiling?)
 
 
 Walter Underwood wrote:
 
 What JVM and garbage collector setting? We are using the IBM JVM with
 their concurrent generational collector. I would strongly recommend
 trying a similar collector on your JVM. Hint: how much memory is in
 use after a full GC? That is a good approximation to the working set.




Re: Newbie Design Questions

2009-01-22 Thread Noble Paul നോബിള്‍ नोब्ळ्
It is planned to be in an another month or so. But it is never too sure.


On Fri, Jan 23, 2009 at 3:57 AM, Gunaranjan Chandraraju
chandrar...@apple.com wrote:
 Thanks

 A last question - do you have any approximate date for the release of 1.4.
 If its going to be soon enough (within a month or so) then I can plan for
 our development around it.

 Thanks
 Guna

 On Jan 22, 2009, at 11:04 AM, Noble Paul നോബിള്‍ नोब्ळ् wrote:

 You are out of luck if you are not using a recent version of DIH

 The sub entity will work only if you use the FieldReaderDataSource.
 Then you do not need a ClobTransformer also.

 The trunk version of DIH can be used w/ Solr 1.3 release

 On Thu, Jan 22, 2009 at 12:59 PM, Gunaranjan Chandraraju
 chandrar...@apple.com wrote:

 Hi

 Yes, the XML is inside the DB in a clob. Would love to use XPath
 inside
 SQLEntityProcessor as it will save me tons of trouble for file-dumping
 (given that I am not able to post it).  This is how I setup my DIH for DB
 import.

 dataConfig
 dataSource type=JdbcDataSource name=data-source-1
 driver=oracle.jdbc.driver.OracleDriver url=jdbc:oracle:thin:@X
 user=abc password=*** batchSize=100/
  document
   entity dataSource=data-source-1
   name =item processor=SqlEntityProcessor
   pk=ID
   stream=false
   rootEntity=false
   transformer=ClobTransformer  !-- custom clob transformer I
 saw and not the one from 1.4.   --
   query=select xml_col from xml_table where xml_col IS NOT NULL

  !-- horrible query I need to work on making it better --

  entity
 dataSource=null  !-- this is my problem - if I don't give a
 name here it complains, if I put in null then the code seems to fail with
 a
 null pointer --
 name=record
 processor=XPathEntityProcessor
 stream=false
 url=${item.xml_col}
  forEach=/record

field column=ID xpath=/record/coreinfo/@a /
field column=type xpath=/record/coreinfo/@b /
field column=streetname xpath=/record/address/@c /

.. and so on
  /entity


   /entity
  /document
 /dataConfig


 The problem with this is that it always fails with this error.  I can see
 that the earlier SQL entity extraction and clob transformation is working
 as
 the values show in the debug jsp (verbose mode with dataimport.jsp).
 However no records are extracted for entity.  When I check catalina.out
 file, it shows me the following errors for entity name=record. (the
 XPath
 entity on top).

 java.lang.NullPointerException at

 org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85).

 I don't have the whole stack trace right now.  If you need it I would be
 happy to recreate and post it.

 Regards,
 Guna

 On Jan 21, 2009, at 8:22 PM, Noble Paul നോബിള്‍ नोब्ळ् wrote:

 On Thu, Jan 22, 2009 at 7:02 AM, Gunaranjan Chandraraju
 chandrar...@apple.com wrote:

 Thanks

 Yes the source of data is a DB.  However the xml is also posted on
 updates
 via publish framework.  So I can just plug in an adapter hear to listen
 for
 changes and post to SOLR.  I was trying to use the XPathProcessor
 inside
 the
 SQLEntityProcessor and this did not work (using 1.3 - I did see support
 in
 1.4).  That is not a show stopper for me and I can just post them via
 the
 framework and use files for the first time load.

 XPathEntityprocessor works inside SqlEntityprocessor only if a db
 field contains xml.

 However ,you can have a separate entity (at the root) to read from db
 for delta.
 Anyway if your current solution works stick to it.

 Have a seen a couple of answers on the backup for crash scenarios.
  just
 wanted to confirm - if I replace the index with the backup'ed files
 then
 I
 can simple start the up solr again and reindex the documents changed
 since
 last backup? Am I right?  The slaves will also automatically adjust to
 this.

 Yes. you can replace an archived index and Solr should work just fine.
 but the docs added since the last snapshot was taken will be missing
 (of course :) )

 THanks
 Guna


 On Jan 20, 2009, at 9:37 PM, Noble Paul നോബിള്‍ नोब्ळ् wrote:

 On Wed, Jan 21, 2009 at 5:15 AM, Gunaranjan Chandraraju
 chandrar...@apple.com wrote:

 Hi All
 We are considering SOLR for a large database of XMLs.  I have some
 newbie
 questions - if there is a place I can go read about them do let me
 know
 and
 I will go read up :)

 1. Currently we are able to pull the XMLs from a file systems using
 FileDataSource.  The DIH is convenient since I can map my XML fields
 using
 the XPathProcessor. This works for an initial load.However after
 the
 initial load, we would like to 'post' changed xmls to SOLR whenever
 the
 XML
 is updated in a separate system.  I know we can post xmls with 'add'
 however
 I was not sure how to do this and maintain the DIH mapping I use in
 data-config.xml?  I don't want to save the file to the disk and then
 call
 the DIH - would prefer to 

Re: URL-import field type?

2009-01-22 Thread Noble Paul നോബിള്‍ नोब्ळ्
where is this url coming from? what is the content type of the stream?
is it plain text or html?

if yes, this is a possible enhancement to DIH



On Fri, Jan 23, 2009 at 4:39 AM, Paul Libbrecht p...@activemath.org wrote:

 Hello list,

 after searching around for quite a while, including in the DataImportHandler
 documentation on the wiki (which looks amazing), I couldn't find a way to
 indicate to solr that the tokens of that field should be the result of
 analyzing the tokens of the stream at URL-xxx.

 I know I was able to imitate that in plain-lucene by crafting a particular
 analyzer-filter who was only given the URL as content and who gave further
 the tokens of the stream.

 Is this the right way in solr?

 thanks in advance.

 paul



-- 
--Noble Paul


Re: Date Format in QueryParsing

2009-01-22 Thread Chris Hostetter

: When I parse DateRange query in a custom RequestHandler I get the date in
: format -MM-dd'T'HH:mm:ss, but I would like it with the trailling 'Z' for
: UTC time. Is there a way how to set the desired date format?
...
: Query q = QueryParsing.parseQuery(query, req.getSchema());
: log.debug(q.toString()); // output the dates in -MM-dd'T'HH:mm:ss format

what you are logging is the toString of the internal query object -- for 
field types that have special encodings (SortableIntField, etc...) this is 
going to be gibberish -- it's why there is a static 
QueryParsing.toString(Query,IndexSchema) method, which isn't perfect but 
does a decent job for debugging.

DateField may not seem like a field with special incodings but the 
the Z is missing when you look at that Query object for the same reason -- 
the indexed form suitable for searching and range queries is missing hte Z 
so that things sort properly.

in general, if you have an indexed term for a field, and you want to make 
it readable, you should the methods in the appropraite FieldType to 
convert it (I think the method is indexedToReadable but double check 
that)




-Hoss



Re: DocumentId, InternalDocID and Query from QueryResponse

2009-01-22 Thread Chris Hostetter

: I am new to Solr. I would like to know how to get DocumentId,
: InternalDocID and Query from QueryResponse.

I'm going to make some assumptions about what it is you are asking for...

1) by DocumentId, i assume you mean the value of the uniqueKey field you 
define in your schema.xml -- it's a field like any other, so if you wnat 
it returned for each doc, just ask for it in the fl param.

2) by InternalDocId i'm asssuming you mean the low level Lucene docid -- 
there is no way to get this info from Solr, there is also 
no use for it in a client, since you can't do anything with it -- internal 
docids can change any time there is a segment merge.

3) if by Query you mean the query string that came from the client (echoed 
back) take a look at the echoParams option which can give back all of the 
request params in the response if you wish.  if you mean the actually 
Query object used to execute the search, there is no way to get that in 
the client -- the parsing and Query object structure are built on the 
server side.



-Hoss



Re: Master failover - seeking comments

2009-01-22 Thread Shalin Shekhar Mangar
On Fri, Jan 23, 2009 at 3:57 AM, edre...@ha edre...@homeaway.com wrote:


 Essentially, the plan is to add another Master server, so now we have M1
 and
 M2.  Both M1 and M2 are also configured to be slaves of each other.  The
 plan is to put a load balancer in between the Slaves and the Master
 servers.


What exactly do you mean by that? The slaves never write to the master, do
they?



 This way, if M1 goes down, traffic will be routed to M2 automatically.
  Once
 M1 comes back online, we'll route traffic back to that server.  Because M1
 and M2 are replicating each other all updates are captured.



 To test this, I ran the following scenario.

 1) Slave 1 (S1) is configured to use M2 as it's master.
 2) We push an update to M2.
 3) We restart S1, now pointing to M1.
 4) We wait for M1 to sync from M2
 5) We then sync S1 to M1.
 6) Success!


How do you co-ordinate all this?



 However...

 M1 and M2 generate snapshots every time they sync to each other, even if no
 new data was pushed to them from a Slave.  We're concerned about this.


Are you using rsync based replication? The Java based replication does not
create snapshots, however it is in 1.4 trunk only.



 Is this even a problem?
 Are we stuck in some infinte sync loop between the 2 Master machines?
 Will this degrade performance of the Master machines over time?
 Is there anything else I should know about this setup?

 Any insights, or alternative suggestions to this setup are quite welcome.


It seems like you are trying to write to Solr directly from your front end
application. This is why you are thinking of multiple masters. I'll let
others comment on how easy/hard/correct the solution would be.

But, do you really need to have live writes? Can they be channeled through a
background process? Since you anyway cannot do a commit per-write, the
advantage of live writes is minimal. Moreover you would need to invest a lot
of time in handling availability concerns to avoid losing updates. If you
log/record the write requests to an intermediate store (or queue), you can
do with one master (with another host on standby acting as a slave). The
switching between the boxes for writes can be done manually. I know it is
more manual work but a simpler design and we know it works :)

-- 
Regards,
Shalin Shekhar Mangar.


How to make Relationships work for Multi-valued Index Fields?

2009-01-22 Thread Gunaranjan Chandraraju

Hi
I may be completely off on this being new to SOLR but I am not sure  
how to index related groups of fields in a document and preserver  
their 'grouping'.   I  would appreciate any help on this.Detailed  
description of the problem below.


I am trying to index an entity that can have multiple occurrences in  
the same document - e.g. Address.  The address could be Shipping,  
Home, Office etc.   Each address element has multiple values in it  
like street, state etc.Thus each address element is a group with  
the state and street in one address element being related to each other.


It looks like this in my source xml

record
   coreInfo id=123 , .../
   address street=XYZ1 State=CA ...type=home /
   address street=XYZ2 state=CA ... type=Office/
   address street=XYZ3 state=CA type=Other/
/record

I have setup my DIH to treat these as entities as below

dataConfig
   dataSource type=FileDataSource encoding=UTF-8 /
   document
 entity name =f processor=FileListEntityProcessor
 baseDir=***
 fileName=.*xml
 rootEntity=false
 dataSource=null 
entity
   name=record
   processor=XPathEntityProcessor
   stream=false
   forEach=/record
   url=${f.fileAbsolutePath}
field column=ID xpath=/record/@id /

!-- Address  --
 entity
 name=record_adr
 processor=XPathEntityProcessor
 stream=false
 forEach=/record/address
 url=${f.fileAbsolutePath}
 field column=address_street  xpath=/ 
record/address/@street /

 field column=address_state   
xpath=/record/address//@state /
 field column=address_typexpath=/ 
record/address//@type /

/entity
   /entity
 /entity
   /document
/dataConfig


The problem is as follows.  DIH seems to treat these as entities but  
solr seems to flatten them out on indexing to fields in a document  
(losing the entity part).


So when I search for the an ID - in the response all the street fields  
are bunched to-gather, followed by all the state fields type etc.   
Thus I can't associate which street address corresponds to which  
address type in the response.


What seems harder is this - say I need to query on 'Street' = XYZ1 and  
type=Office.  This should NOT return a document since the street for  
the office address is XY2 and not XYZ1.  However when I query for  
address_state:XYZ1 and address_type:Office I get back this document.


The problem seems to be that while DIH allows 'entities' within a  
document  the SOLR schema does not preserve them - it 'flattens' all  
of them out as indices for the document.


I could work around the problem by creating SOLR fields like  
home_address_street and office_address_street and do some xpath  
mapping.  However I don't want to do it as we can have multiple  
'other' addresses.  Also I have other fields whose type is not easily  
distinguished like address.


As I mentioned being new to SOLR I might have completely goofed on a  
way to set it up - much appreciate any direction on it. I am using  
SOLR 1.3


Regards,
Guna




Re: how can solr search angainst group of field

2009-01-22 Thread surfer10

definitly disMax do the thing by searching one term against multifield. but 
what if my index contains two additional multivalued fields like category id

i need to search against terms in particular fields of documents and dismax
do this well thru qf=field1,field2
how can i filter results which has only 1 or 2 or 3 in categoryID
field?

could you please help me to figure this?

-- 
View this message in context: 
http://www.nabble.com/how-can-solr-search-angainst-group-of-field-tp21557783p21619981.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to make Relationships work for Multi-valued Index Fields?

2009-01-22 Thread Shalin Shekhar Mangar
On Fri, Jan 23, 2009 at 1:08 PM, Gunaranjan Chandraraju 
chandrar...@apple.com wrote:


 record
   coreInfo id=123 , .../
   address street=XYZ1 State=CA ...type=home /
   address street=XYZ2 state=CA ... type=Office/
   address street=XYZ3 state=CA type=Other/
 /record

 I have setup my DIH to treat these as entities as below

 dataConfig
   dataSource type=FileDataSource encoding=UTF-8 /
   document
 entity name =f processor=FileListEntityProcessor
 baseDir=***
 fileName=.*xml
 rootEntity=false
 dataSource=null 
entity
   name=record
   processor=XPathEntityProcessor
   stream=false
   forEach=/record
   url=${f.fileAbsolutePath}
field column=ID xpath=/record/@id /

!-- Address  --
 entity
 name=record_adr
 processor=XPathEntityProcessor
 stream=false
 forEach=/record/address
 url=${f.fileAbsolutePath}
 field column=address_street
  xpath=/record/address/@street /
 field column=address_state
 xpath=/record/address//@state /
 field column=address_type
  xpath=/record/address//@type /
/entity
   /entity
 /entity
   /document
 /dataConfig


I think the only way is to create a dynamic field for each attribute
(street, state etc.). Write a transformer to copy the fields from your data
config to appropriately named dynamic field (e.g. street_1, state_1, etc).
To maintain this counter you will need to get/store it with
Context#getSessionAttribute(name, val, Context.SCOPE_DOC) and
Context#setSessionAttribute(name, val, Context.SCOPE_DOC).

I cant't think of an easier way.
-- 
Regards,
Shalin Shekhar Mangar.


Re: How to make Relationships work for Multi-valued Index Fields?

2009-01-22 Thread Shalin Shekhar Mangar
Oops, one more gotcha. The dynamic field support is only in 1.4 trunk.

On Fri, Jan 23, 2009 at 1:24 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 On Fri, Jan 23, 2009 at 1:08 PM, Gunaranjan Chandraraju 
 chandrar...@apple.com wrote:


 record
   coreInfo id=123 , .../
   address street=XYZ1 State=CA ...type=home /
   address street=XYZ2 state=CA ... type=Office/
   address street=XYZ3 state=CA type=Other/
 /record

 I have setup my DIH to treat these as entities as below

 dataConfig
   dataSource type=FileDataSource encoding=UTF-8 /
   document
 entity name =f processor=FileListEntityProcessor
 baseDir=***
 fileName=.*xml
 rootEntity=false
 dataSource=null 
entity
   name=record
   processor=XPathEntityProcessor
   stream=false
   forEach=/record
   url=${f.fileAbsolutePath}
field column=ID xpath=/record/@id /

!-- Address  --
 entity
 name=record_adr
 processor=XPathEntityProcessor
 stream=false
 forEach=/record/address
 url=${f.fileAbsolutePath}
 field column=address_street
  xpath=/record/address/@street /
 field column=address_state
 xpath=/record/address//@state /
 field column=address_type
  xpath=/record/address//@type /
/entity
   /entity
 /entity
   /document
 /dataConfig


 I think the only way is to create a dynamic field for each attribute
 (street, state etc.). Write a transformer to copy the fields from your data
 config to appropriately named dynamic field (e.g. street_1, state_1, etc).
 To maintain this counter you will need to get/store it with
 Context#getSessionAttribute(name, val, Context.SCOPE_DOC) and
 Context#setSessionAttribute(name, val, Context.SCOPE_DOC).

 I cant't think of an easier way.
 --
 Regards,
 Shalin Shekhar Mangar.




-- 
Regards,
Shalin Shekhar Mangar.