date:20120222

Yes, I consciously let my slaves run away from the master in order to
reduce update latency, but every now and then they sync up with master
that is doing heavy lifting.

The price you pay is that slaves do not see the same documents as the
master, but this is the case anyhow with replication, in my setup
slave may go ahead of master with updates, this delta gets zeroed
after replication and the game starts again.

What you have to take into account with this is very small time window
where you may go back in time on slaves (not seeing documents that
were already there), but we are talking about seconds and a couple out
of 200Mio documents (only those documents that were softComited on
slave during replication, since commit ond master and postCommit on
slave).

Why do you think something is strange here?

 What are you expecting a BeforeCommitListener could do for you, if one
 would exist?
Why should I be expecting something?

I just need to read userCommit Data as soon as replication is done,
and I am looking for proper/easy way to do it.  (postCommitListener is
what I use now).

What makes me slightly nervous are those life cycle questions, e.g.
when I issue update command before and after postCommit event, which
index gets updated, the one just replicated or the one that was there
just before replication.

There are definitely ways to optimize this, for example to force
replication handler to copy only delta files if index gets updated on
slave and master  (there is already todo somewhere on solr replication
Wiki I think). Now replicationHandler copies complete index if this
gets detected ...

I am all ears if there are better proposals to have low latency
updates in multi server setup...


On Tue, Feb 21, 2012 at 11:53 PM, Em mailformailingli...@yahoo.de wrote:
 Eks,

 that sounds strange!

 Am I getting you right?
 You have a master which indexes batch-updates from time to time.
 Furthermore you got some slaves, pulling data from that master to keep
 them up-to-date with the newest batch-updates.
 Additionally your slaves index own content in soft-commit mode that
 needs to be available as soon as possible.
 In consequence the slavesare not in sync with the master.

 I am not 100% certain, but chances are good that Solr's
 replication-mechanism only changes those segments that are not in sync
 with the master.

 What are you expecting a BeforeCommitListener could do for you, if one
 would exist?

 Kind regards,
 Em

 Am 21.02.2012 21:10, schrieb eks dev:
 Thanks Mark,
 Hmm, I would like to have this information asap, not to wait until the
 first search gets executed (depends on user) . Is solr going to create
 new searcher as a part of replication transaction...

 Just to make it clear why I need it...
 I have simple master, many slaves config where master does batch
 updates in big chunks (things user can wait longer to see on search
 side) but slaves work in soft commit mode internally where I permit
 them to run away slightly from master in order to know where
 incremental update should start, I read it from UserData 

 Basically, ideally, before commit (after successful replication is
 finished) ends, I would like to read in these counters to let
 incremental update run from the right point...

 I need to prevent updating replicated index before I read this
 information (duplicates can appear) are there any IndexWriter
 listeners around?


 Thanks again,
 eks.



 On Tue, Feb 21, 2012 at 8:03 PM, Mark Miller markrmil...@gmail.com wrote:
 Post commit calls are made before a new searcher is opened.

 Might be easier to try to hook in with a new searcher listener?

 On Feb 21, 2012, at 8:23 AM, eks dev wrote:

 Hi all,
 I am a bit confused with IndexSearcher refresh lifecycles...
 In a master slave setup, I override postCommit listener on slave
 (solr trunk version) to read some user information stored in
 userCommitData on master

 --
 @Override
 public final void postCommit() {
 // This returnes stale information that was present before
 replication finished
 RefCountedSolrIndexSearcher refC = core.getNewestSearcher(true);
 MapString, String userData =
 refC.get().getIndexReader().getIndexCommit().getUserData();
 }
 
 I expected core.getNewestSearcher(true); to return refreshed
 SolrIndexSearcher, but it didn't

 When is this information going to be refreshed to the status from the
 replicated index, I repeat this is postCommit listener?

 What is the way to get the information from the last commit point?

 Maybe like this?
 core.getDeletionPolicy().getLatestCommit().getUserData();

 Or I need to explicitly open new searcher (isn't solr does this behind
 the scenes?)
 core.openNewSearcher(false, false)

 Not critical, reopening new searcher works, but I would like to
 understand these lifecycles, when solr loads latest commit point...

 Thanks, eks

 - Mark Miller
 lucidimagination.com

Re: Unique key constraint and optimistic locking (versioning)

2012-02-22 Thread Per Steffensen

Thanks a lot. We will use the UniqueKey feature and build versioning 
ourselves. Do you think it would be a good idea if we built a versioning 
feature into Solr/Lucene instead of doing it outside, so that others can 
benefit from the feature as well? Guess contributions will be made 
according to http://wiki.apache.org/solr/HowToContribute. It is possible 
for outsiders (like us) to get a SVN branch at svn.apache.org to 
prepare contributions, or do we have to use our own SVN? Are there any 
plans migrating lucene/solr codebase to Git, which will make it easier 
getting a separate area to work on the code (making a Git fork), and 
suggest the contribution back to core lucene/solr (doing a Git pull 
request)?


Thanks!
Per Steffensen

Em skrev:

Hi Per,

Solr provides the so called UniqueKey-field.
Refer to the Wiki to learn more:
http://wiki.apache.org/solr/UniqueKey

  

Optimistic locking (versioning)


... is not provided by Solr out of the box. If you add a new document
with the same UniqueKey it replaces the old one.
You have to do the versioning on your own (and keep in mind concurrent
updates).

Kind regards,
Em

Am 21.02.2012 13:50, schrieb Per Steffensen:
  

Hi

Does solr/lucene provide any mechanism for unique key constraint and
optimistic locking (versioning)?
Unique key constraint: That a client will not succeed creating a new
document in solr/lucene if a document already exists having the same
value in some field (e.g. an id field). Of course implemented right, so
that even though two or more threads are concurrently trying to create a
new document with the same value in this field, only one of them will
succeed.
Optimistic locking (versioning): That a client will only succeed
updating a document if this updated document is based on the version of
the document currently stored in solr/lucene. Implemented in the
optimistic way that clients during an update have to tell which version
of the document they fetched from Solr and that they therefore have used
as a starting-point for their updated document. So basically having a
version field on the document that clients increase by one before
sending to solr for update, and some code in Solr that only makes the
update succeed if the version number of the updated document is exactly
one higher than the version number of the document already stored. Of
course again implemented right, so that even though two or more thrads
are concurrently trying to update a document, and they all have their
updated document based on the current version in solr/lucene, only one
of them will succeed.

Or do I have to do stuff like this myself outside solr/lucene - e.g. in
the client using solr.

Regards, Per Steffensen

Re: Unique key constraint and optimistic locking (versioning)

2012-02-22 Thread Per Steffensen


Per Steffensen skrev:
Thanks a lot. We will use the UniqueKey feature and build versioning 
ourselves. Do you think it would be a good idea if we built a 
versioning feature into Solr/Lucene instead of doing it outside, so 
that others can benefit from the feature as well? Guess contributions 
will be made according to http://wiki.apache.org/solr/HowToContribute. 
It is possible for outsiders (like us) to get a SVN branch at 
svn.apache.org to prepare contributions, or do we have to use our own 
SVN? Are there any plans migrating lucene/solr codebase to Git, which 
will make it easier getting a separate area to work on the code 
(making a Git fork), and suggest the contribution back to core 
lucene/solr (doing a Git pull request)?
Sorry - didnt see the Eclipse (using Git) chapter on 
http://wiki.apache.org/solr/HowToContribute. We might contribute in that 
area.


Thanks!
Per Steffensen

Re: reader/searcher refresh after replication (commit)

Sounds much clearer to me than before. :)

Ad-hoc I have two ideas:
First: Let Replication run asynchronously.
If shard1 is pulling the new index from the master and therefore very
recent documents aren't available anymore, shard2 will find them in the
mean-time. As soon as shard1 is up-to-date (including the most recent
documents) shard2 can pull its update from the master.
However beeing out of sync between two shards that should serve the same
data has its own problems, I think.

Second:
You can have another SolrCore for the most recent documents. This one
could be based on a RAMDirectory for reduced latency (or even use
NRT-features, if available in your Solr-version).
Your Master-Slave setup becomes more easier, since you do not have to
worry about out-of-sync-scenarios anymore.
The challange here is to handle duplicate documents (i.e. newer versions
in the RAMDirectory) and proper relevancy due to unbalanced shards by
design.

Kind regards,
Em


Am 22.02.2012 09:25, schrieb eks dev:
 Yes, I consciously let my slaves run away from the master in order to
 reduce update latency, but every now and then they sync up with master
 that is doing heavy lifting.
 
 The price you pay is that slaves do not see the same documents as the
 master, but this is the case anyhow with replication, in my setup
 slave may go ahead of master with updates, this delta gets zeroed
 after replication and the game starts again.
 
 What you have to take into account with this is very small time window
 where you may go back in time on slaves (not seeing documents that
 were already there), but we are talking about seconds and a couple out
 of 200Mio documents (only those documents that were softComited on
 slave during replication, since commit ond master and postCommit on
 slave).
 
 Why do you think something is strange here?
 
 What are you expecting a BeforeCommitListener could do for you, if one
 would exist?
 Why should I be expecting something?
 
 I just need to read userCommit Data as soon as replication is done,
 and I am looking for proper/easy way to do it.  (postCommitListener is
 what I use now).
 
 What makes me slightly nervous are those life cycle questions, e.g.
 when I issue update command before and after postCommit event, which
 index gets updated, the one just replicated or the one that was there
 just before replication.
 
 There are definitely ways to optimize this, for example to force
 replication handler to copy only delta files if index gets updated on
 slave and master  (there is already todo somewhere on solr replication
 Wiki I think). Now replicationHandler copies complete index if this
 gets detected ...
 
 I am all ears if there are better proposals to have low latency
 updates in multi server setup...
 
 
 On Tue, Feb 21, 2012 at 11:53 PM, Em mailformailingli...@yahoo.de wrote:
 Eks,

 that sounds strange!

 Am I getting you right?
 You have a master which indexes batch-updates from time to time.
 Furthermore you got some slaves, pulling data from that master to keep
 them up-to-date with the newest batch-updates.
 Additionally your slaves index own content in soft-commit mode that
 needs to be available as soon as possible.
 In consequence the slavesare not in sync with the master.

 I am not 100% certain, but chances are good that Solr's
 replication-mechanism only changes those segments that are not in sync
 with the master.

 What are you expecting a BeforeCommitListener could do for you, if one
 would exist?

 Kind regards,
 Em

 Am 21.02.2012 21:10, schrieb eks dev:
 Thanks Mark,
 Hmm, I would like to have this information asap, not to wait until the
 first search gets executed (depends on user) . Is solr going to create
 new searcher as a part of replication transaction...

 Just to make it clear why I need it...
 I have simple master, many slaves config where master does batch
 updates in big chunks (things user can wait longer to see on search
 side) but slaves work in soft commit mode internally where I permit
 them to run away slightly from master in order to know where
 incremental update should start, I read it from UserData 

 Basically, ideally, before commit (after successful replication is
 finished) ends, I would like to read in these counters to let
 incremental update run from the right point...

 I need to prevent updating replicated index before I read this
 information (duplicates can appear) are there any IndexWriter
 listeners around?


 Thanks again,
 eks.



 On Tue, Feb 21, 2012 at 8:03 PM, Mark Miller markrmil...@gmail.com wrote:
 Post commit calls are made before a new searcher is opened.

 Might be easier to try to hook in with a new searcher listener?

 On Feb 21, 2012, at 8:23 AM, eks dev wrote:

 Hi all,
 I am a bit confused with IndexSearcher refresh lifecycles...
 In a master slave setup, I override postCommit listener on slave
 (solr trunk version) to read some user information stored in
 userCommitData on master

 --
 @Override
 public

Fields, Facets, and Search Results

2012-02-22 Thread drLocke97

I'm new to SOLR and trying to get a proper understanding of what's going on
with fields, facets, and search results.

I've modified the example schema.xml and solrconfig.xml that comes with SOLR
to reflect some fields I want to experiment with. I've also modified the
velocity templates in Solaritas accordingly. I've created some sample docs
to post to the index that have the fields/data I want to experiment with.
Everything compiles and works, but my search results are not what I expect
and I'm trying to understand why.

Every field I have is defined the same (here are two examples):

field name=title type=text_general indexed=true stored=true
omitNorms=true/
field name=section_text_content type=text_general indexed=true
stored=true omitNorms=true/

The puzzle is that I'm getting search results on every term that's in the
title field, but nothing on terms in the section_text_content field. I
have no idea why. I thought at first it was because I'd specified the title
field to also be a facet, but I removed that and things remain as described
(except now, of course, the facet for title is gone).

Can anyone provide some insight?

Don

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Fields-Facets-and-Search-Results-tp3765946p3765946.html
Sent from the Solr - User mailing list archive at Nabble.com.

how to mock solr server solr_sruby

2012-02-22 Thread solr

Hi,
Am using solr_ruby in ruby code for that am starting solr server by using
start.jsr.
Now i want to write mockobjects for solr connection and code written in my
ruby file to search data from solr.
Can anybody suggest how to do testing without stating solr server

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-mock-solr-server-solr-sruby-tp3766080p3766080.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Fields, Facets, and Search Results

2012-02-22 Thread darul

Check you schema config file first.

It looks like you have missed copy of section_text_content field's content
to your default search field :

 
 defaultSearchFieldtext/defaultSearchField

copyField source=section_text_content dest=text/

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Fields-Facets-and-Search-Results-tp3765946p3766084.html
Sent from the Solr - User mailing list archive at Nabble.com.

'location' fieldType indexation impossible

2012-02-22 Thread Xavier

Hi,

When i try to index my location field i get this error for each documents :
*ATTENTION: Error creating document  Error adding field
'emploi_city_geoloc'='48.85,2.5525' *
(so i have 0 files indexed)

Here is my schema.xml :
*field name=emploi_city_geoloc type=location indexed=true
stored=false/*

I really don't understand why it isnt working because, it was working on my
local server with the same configuration (Solr 3.5.0) and the same database
!!!

If i try to use geohash instead of location it is working for
indexation, but my geodist query in front isnt working anymore ...

Any ideas ?

Best regards,
Xavier

--
View this message in context: 
http://lucene.472066.n3.nabble.com/location-fieldType-indexation-impossible-tp3766136p3766136.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Date filter query

2012-02-22 Thread ku3ia

Hi, all
Thanks for your responses.

I'd tried
[NOW/DAY-30DAY+TO+NOW/DAY-1DAY-1SECOND]
and seems it works fine for me.

Thanks a lot!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Date-filter-query-tp3764349p3766139.html
Sent from the Solr - User mailing list archive at Nabble.com.

How is Data Indexed in HBase?

2012-02-22 Thread Bing Li

Dear all,

I wonder how data in HBase is indexed? Now Solr is used in my system
because data is managed in inverted index. Such an index is suitable to
retrieve unstructured and huge amount of data. How does HBase deal with the
issue? May I replaced Solr with HBase?

Thanks so much!

Best regards,
Bing

Re: Fast Vector Highlighter Working for some records only

2012-02-22 Thread dhaivat


Koji Sekiguchi wrote
 
 (12/02/22 11:58), dhaivat wrote:
 Thanks for reply,

 But can you please tell me why it's working for some documents and not
 for
 other.
 
 As Solr 1.4.1 cannot recognize hl.useFastVectorHighlighter flag, Solr just
 ignore it, but due to hl=true is there, Solr tries to create highlight
 snippets
 by using (existing; traditional; I mean not FVH) Highlighter.
 Highlighter (including FVH) cannot produce snippets sometime for some
 reasons,
 you can use hl.alternateField parameter.
 
 http://wiki.apache.org/solr/HighlightingParameters#hl.alternateField
 
 koji
 -- 
 Query Log Visualizer for Apache Solr
 http://soleami.com/
 

Thank you so much explanation,
 
I have updated my solr version and using 3.5, Could you please tell me when
i am using custom Tokenizer on the field,so do i need to make any changes
related Solr highlighter. 

here is my custom analyser

 fieldType name=custom_text class=solr.TextField
positionIncrementGap=100
  analyzer type=index
tokenizer class=ns.solr.analyser.CustomIndexTokeniserFactory/
  /analyzer
  analyzer type=query
tokenizer class=ns.solr.analyser.CustomSearcherTokeniserFactory/
 
  /analyzer
/fieldType

here is the field info:

field name=contents type=custom_text indexed=true stored=true
multiValued=true termPositions=true  termVectors=true
termOffsets=true/

i am creating tokens using my custom analyser and when i am trying to use
highlighter it's not working properly for contents field.. but when i tried
to use Solr inbuilt tokeniser i am finding the word highlighted for
particular query.. Please can you help me out with this ?


Thanks in advance
Dhaivat





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Fast-Vector-Highlighter-Working-for-some-records-only-tp3763286p3766335.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: reader/searcher refresh after replication (commit)

You'll *really like* the SolrCloud stuff going into trunk when it's baked
for a while

Best
Erick

On Wed, Feb 22, 2012 at 3:25 AM, eks dev eks...@googlemail.com wrote:
 Yes, I consciously let my slaves run away from the master in order to
 reduce update latency, but every now and then they sync up with master
 that is doing heavy lifting.

 The price you pay is that slaves do not see the same documents as the
 master, but this is the case anyhow with replication, in my setup
 slave may go ahead of master with updates, this delta gets zeroed
 after replication and the game starts again.

 What you have to take into account with this is very small time window
 where you may go back in time on slaves (not seeing documents that
 were already there), but we are talking about seconds and a couple out
 of 200Mio documents (only those documents that were softComited on
 slave during replication, since commit ond master and postCommit on
 slave).

 Why do you think something is strange here?

 What are you expecting a BeforeCommitListener could do for you, if one
 would exist?
 Why should I be expecting something?

 I just need to read userCommit Data as soon as replication is done,
 and I am looking for proper/easy way to do it.  (postCommitListener is
 what I use now).

 What makes me slightly nervous are those life cycle questions, e.g.
 when I issue update command before and after postCommit event, which
 index gets updated, the one just replicated or the one that was there
 just before replication.

 There are definitely ways to optimize this, for example to force
 replication handler to copy only delta files if index gets updated on
 slave and master  (there is already todo somewhere on solr replication
 Wiki I think). Now replicationHandler copies complete index if this
 gets detected ...

 I am all ears if there are better proposals to have low latency
 updates in multi server setup...


 On Tue, Feb 21, 2012 at 11:53 PM, Em mailformailingli...@yahoo.de wrote:
 Eks,

 that sounds strange!

 Am I getting you right?
 You have a master which indexes batch-updates from time to time.
 Furthermore you got some slaves, pulling data from that master to keep
 them up-to-date with the newest batch-updates.
 Additionally your slaves index own content in soft-commit mode that
 needs to be available as soon as possible.
 In consequence the slavesare not in sync with the master.

 I am not 100% certain, but chances are good that Solr's
 replication-mechanism only changes those segments that are not in sync
 with the master.

 What are you expecting a BeforeCommitListener could do for you, if one
 would exist?

 Kind regards,
 Em

 Am 21.02.2012 21:10, schrieb eks dev:
 Thanks Mark,
 Hmm, I would like to have this information asap, not to wait until the
 first search gets executed (depends on user) . Is solr going to create
 new searcher as a part of replication transaction...

 Just to make it clear why I need it...
 I have simple master, many slaves config where master does batch
 updates in big chunks (things user can wait longer to see on search
 side) but slaves work in soft commit mode internally where I permit
 them to run away slightly from master in order to know where
 incremental update should start, I read it from UserData 

 Basically, ideally, before commit (after successful replication is
 finished) ends, I would like to read in these counters to let
 incremental update run from the right point...

 I need to prevent updating replicated index before I read this
 information (duplicates can appear) are there any IndexWriter
 listeners around?


 Thanks again,
 eks.



 On Tue, Feb 21, 2012 at 8:03 PM, Mark Miller markrmil...@gmail.com wrote:
 Post commit calls are made before a new searcher is opened.

 Might be easier to try to hook in with a new searcher listener?

 On Feb 21, 2012, at 8:23 AM, eks dev wrote:

 Hi all,
 I am a bit confused with IndexSearcher refresh lifecycles...
 In a master slave setup, I override postCommit listener on slave
 (solr trunk version) to read some user information stored in
 userCommitData on master

 --
 @Override
 public final void postCommit() {
 // This returnes stale information that was present before
 replication finished
 RefCountedSolrIndexSearcher refC = core.getNewestSearcher(true);
 MapString, String userData =
 refC.get().getIndexReader().getIndexCommit().getUserData();
 }
 
 I expected core.getNewestSearcher(true); to return refreshed
 SolrIndexSearcher, but it didn't

 When is this information going to be refreshed to the status from the
 replicated index, I repeat this is postCommit listener?

 What is the way to get the information from the last commit point?

 Maybe like this?
 core.getDeletionPolicy().getLatestCommit().getUserData();

 Or I need to explicitly open new searcher (isn't solr does this behind
 the scenes?)
 core.openNewSearcher(false, false)

 Not critical, reopening new searcher works, but I would like to

Re: reader/searcher refresh after replication (commit)

Erick,

 You'll *really like* the SolrCloud stuff going into trunk when it's baked
 for a while
How stable is SolrCloud at the moment?
I can not wait to try it out.

Kind regards,
Em


Am 22.02.2012 14:45, schrieb Erick Erickson:
 You'll *really like* the SolrCloud stuff going into trunk when it's baked
 for a while
 
 Best
 Erick
 
 On Wed, Feb 22, 2012 at 3:25 AM, eks dev eks...@googlemail.com wrote:
 Yes, I consciously let my slaves run away from the master in order to
 reduce update latency, but every now and then they sync up with master
 that is doing heavy lifting.

 The price you pay is that slaves do not see the same documents as the
 master, but this is the case anyhow with replication, in my setup
 slave may go ahead of master with updates, this delta gets zeroed
 after replication and the game starts again.

 What you have to take into account with this is very small time window
 where you may go back in time on slaves (not seeing documents that
 were already there), but we are talking about seconds and a couple out
 of 200Mio documents (only those documents that were softComited on
 slave during replication, since commit ond master and postCommit on
 slave).

 Why do you think something is strange here?

 What are you expecting a BeforeCommitListener could do for you, if one
 would exist?
 Why should I be expecting something?

 I just need to read userCommit Data as soon as replication is done,
 and I am looking for proper/easy way to do it.  (postCommitListener is
 what I use now).

 What makes me slightly nervous are those life cycle questions, e.g.
 when I issue update command before and after postCommit event, which
 index gets updated, the one just replicated or the one that was there
 just before replication.

 There are definitely ways to optimize this, for example to force
 replication handler to copy only delta files if index gets updated on
 slave and master  (there is already todo somewhere on solr replication
 Wiki I think). Now replicationHandler copies complete index if this
 gets detected ...

 I am all ears if there are better proposals to have low latency
 updates in multi server setup...


 On Tue, Feb 21, 2012 at 11:53 PM, Em mailformailingli...@yahoo.de wrote:
 Eks,

 that sounds strange!

 Am I getting you right?
 You have a master which indexes batch-updates from time to time.
 Furthermore you got some slaves, pulling data from that master to keep
 them up-to-date with the newest batch-updates.
 Additionally your slaves index own content in soft-commit mode that
 needs to be available as soon as possible.
 In consequence the slavesare not in sync with the master.

 I am not 100% certain, but chances are good that Solr's
 replication-mechanism only changes those segments that are not in sync
 with the master.

 What are you expecting a BeforeCommitListener could do for you, if one
 would exist?

 Kind regards,
 Em

 Am 21.02.2012 21:10, schrieb eks dev:
 Thanks Mark,
 Hmm, I would like to have this information asap, not to wait until the
 first search gets executed (depends on user) . Is solr going to create
 new searcher as a part of replication transaction...

 Just to make it clear why I need it...
 I have simple master, many slaves config where master does batch
 updates in big chunks (things user can wait longer to see on search
 side) but slaves work in soft commit mode internally where I permit
 them to run away slightly from master in order to know where
 incremental update should start, I read it from UserData 

 Basically, ideally, before commit (after successful replication is
 finished) ends, I would like to read in these counters to let
 incremental update run from the right point...

 I need to prevent updating replicated index before I read this
 information (duplicates can appear) are there any IndexWriter
 listeners around?


 Thanks again,
 eks.



 On Tue, Feb 21, 2012 at 8:03 PM, Mark Miller markrmil...@gmail.com wrote:
 Post commit calls are made before a new searcher is opened.

 Might be easier to try to hook in with a new searcher listener?

 On Feb 21, 2012, at 8:23 AM, eks dev wrote:

 Hi all,
 I am a bit confused with IndexSearcher refresh lifecycles...
 In a master slave setup, I override postCommit listener on slave
 (solr trunk version) to read some user information stored in
 userCommitData on master

 --
 @Override
 public final void postCommit() {
 // This returnes stale information that was present before
 replication finished
 RefCountedSolrIndexSearcher refC = core.getNewestSearcher(true);
 MapString, String userData =
 refC.get().getIndexReader().getIndexCommit().getUserData();
 }
 
 I expected core.getNewestSearcher(true); to return refreshed
 SolrIndexSearcher, but it didn't

 When is this information going to be refreshed to the status from the
 replicated index, I repeat this is postCommit listener?

 What is the way to get the information from the last commit point?

 Maybe like this?

Re: How to handle to run testcases in ruby code for solr

2012-02-22 Thread solr

Hi Erik,
I have tried links which you given. while runnign rake
am getting error

==
Errno::ECONNREFUSED: No connection could be made because the target machine
acti
vely refused it. - connect(2)
===

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-handle-to-run-testcases-in-ruby-code-for-solr-tp3753479p3766559.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr on netty

2012-02-22 Thread prasenjit mukherjee

Is anybody aware of any effort regarding porting solr to a netty ( or
any other async-io based framework ) based framework.

Even on medium load ( 10 parallel clients )  with 16 shards
performance seems to deteriorate quite sharply compared another
alternative ( async-io based ) solution as load increases.

-Prasenjit

-- 
Sent from my mobile device

Re: How to handle to run testcases in ruby code for solr

2012-02-22 Thread Erik Hatcher

I'm not sure what to suggest at this point... obviously your test setup is 
trying to hit a Solr server that isn't running.  Check the host and port that 
it is trying and ensure that Solr is running as your tests expect or use the 
mock way that I just replied about.

Note, again, that solr-ruby is deprecated and unsupported at this point.  I 
recommend you give the RSolr project a try if you want support with it in the 
future.

Erik

On Feb 22, 2012, at 09:10 , solr wrote:

 Hi Erik,
 I have tried links which you given. while runnign rake
 am getting error
 
 ==
 Errno::ECONNREFUSED: No connection could be made because the target machine
 acti
 vely refused it. - connect(2)
 ===
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-handle-to-run-testcases-in-ruby-code-for-solr-tp3753479p3766559.html
 Sent from the Solr - User mailing list archive at Nabble.com.

solr 3.5 and indexing performance

2012-02-22 Thread mizayah

Hello,

I wanted to switch to new version of solr, exactelly to 3.5 but im getting
big drop of indexing speed.

I'm using 3.1 and after few tests i discower that 3.4 do it a lot of better
then 3.5

My schema is really simple few field using text type field

/fieldType name=text class=solr.TextField
positionIncrementGap=100 autoGeneratePhraseQueries=true
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/


filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.HunspellStemFilterFactory
dictionary=pl_PL.dic affix=pl_PL.aff/
filter class=solr.ASCIIFoldingFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.HunspellStemFilterFactory
dictionary=pl_PL.dic affix=pl_PL.aff/
filter class=solr.ASCIIFoldingFilterFactory/
  /analyzer
/fieldType
/



All data and configuration are the same, same schema, solrconfig, same
jetty.

*SOLR 3.5*
/Feb 22, 2012 3:40:33 PM org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=1
   
commit{dir=/vol/home/mciurla/proj/solr/accordion3.5/example/solr/data/index,segFN=segments_bl,version=1329831219365,generation=417,filenames=[_a5.fdx,
_52.fdx, _aq.frq,
_a5.fdt, _cr.nrm, _52.fnm, _a5.prx, segments_bl, _52.fdt, _7k.tii, _cr.frq,
_a5.tis, _cr.fdt, _a5.nrm, _cr.prx, _cp.prx, _cr.fdx, _cn.nrm, _52.tvf,
_cp.fnm, _co.tii, _52.tvd, _8
o.tvx, _co.tis, _8o.tii, _a5.fnm, _8o.tvd, _7k.tis, _8o.tvf, _bb.tis,
_7k.fdx, _7k.fdt, _7k.frq, _bb.tii, _cn.frq, _co.prx, _aq.tii, _cq.fdx,
_52.tii, _cm.tis, _cq.fdt, _aq.tis,
 _52.tis, _aq.tvx, _co.nrm, _bb.prx, _cm.tii, _cr.fnm, _aq.tvf, _bb_3.del,
_aq.tvd, _cm.frq, _cp.nrm, _cq.tis, _52.prx, _cn.tis, _8o.fnm, _cl.nrm,
_cl.fnm, _a5.tii, _cn.tii, _cq
.tii, _cp.tis, _cp.fdt, _cl.fdt, _cl.prx, _aq.fdt, _cl.fdx, _cr.tis,
_co.frq, _7k.fnm, _cq.frq, _bb.fnm, _cr.tii, _cp.fdx, _cp.tii, _aq.fdx,
_cq.tvd, _8o.fdt, _cq.tvf, _52.nrm,
_8o.nrm, _aq.fnm, _8o.prx, _co.tvd, _cq.tvx, _52.frq, _bb.nrm, _bb.fdt,
_cp.tvf, _a5.tvx, _cp.tvd, _cn.tvx, _7k.nrm, _bb.fdx, _cm.tvx, _cm.fdx,
_cl.tvf, _cp.tvx, _co.fdx, _cl.tv
d, _cn.tvf, _a5.frq, _cm.fdt, _a5.tvf, _co.fdt, _a5.tvd, _cp.frq, _cn.fdt,
_cm.nrm, _7k_d.del, _cn.fdx, _52_1e.del, _7k.prx, _8o.fdx, _cn.prx, _cl.tis,
_cq.nrm, _7k.tvx, _cq.prx
, _cn.tvd, _cl.tii, _cm.fnm, _7k.tvd, _cm.prx, _8o.tis, _cm.tvf, _52.tvx,
_7k.tvf, _cl.tvx, _cm.tvd, _a5_9.del, _bb.tvf, _bb.tvd, _cr.tvd, _co.tvf,
_bb.tvx, _cr.tvf, _co.fnm, _a
q.prx, _cl.frq, _cq.fnm, _aq_9.del, _bb.frq, _8o.frq, _aq.nrm, _co.tvx,
_8o_t.del, _cr.tvx, _cn.fnm, _cl_6.del]
Feb 22, 2012 3:40:33 PM org.apache.solr.core.SolrDeletionPolicy
updateCommits
INFO: newest commit = 1329831219365
Feb 22, 2012 3:40:47 PM org.apache.solr.update.processor.LogUpdateProcessor
finish
*INFO: {add=[2271874, 2271875, 2271876, 2271877, 2271878, 2271879, 2271880,
2271881, ... (100 adds)]} 0 14213*
Feb 22, 2012 3:40:47 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update params={} status=0 QTime=14213
/
when on solr 3.4
/Feb 22, 2012 3:42:56 PM org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=1
commit{dir=/vol/home/mciurla/proj/solr/accordion3.4/example/solr/data/index,segFN=segments_29,version=1329918470592,generation=81,filenames=[_2b.tvf,
_2c.tvx, _2d.tvf, _2f.tvx, _2d.tvd, _15.prx, _15.frq, _2b.tvd, _2c.nrm,
_20.fnm, _2b.tvx, _2c.fdx, _2c.prx, _2f.tii, _2f.tvf, _20.tvx, _2b.fnm,
_2c.fdt, _2d.tis, _15.fdt, _20.frq, _2d.tvx, _2f.tvd, _15.fdx, _15.fnm,
_2c.tvf, _2e.frq, _2e.prx, _2c.tvd, _2b.frq, _20.tvd, _2c.fnm, _20.tvf,
_2e.tvf, _2e.nrm, _20.tis, _2b.prx, _20.tii, _2e.tvd, _15.tis, _2f.frq,
_15.tii, _2e.tvx, _2e.tii, _2c.tis, _2c.frq, _2e.fdx, _2f.prx, _2f.fnm,
_15.tvx, _2e.fdt, _15.tvf, _2b.tis, _2c.tii, _2d.prx, _2d.fnm, _20.fdx,
_2b.tii, _2e.tis, _20.fdt, _2d.frq, _2b.nrm, _15.tvd, _15_b.del, _2b.fdt,
_2f.nrm, _2d.fdx,

Re: reader/searcher refresh after replication (commit)

It's certainly stable enough to start experimenting with, and I know
that it's under pretty active development now. I've seen a lot
of back-and-forth between Mark Miller and Jamie Johnson,
Jamie trying things and Mark responding.

It's part of the trunk, so be prepared for occasional re-indexing
being required. This isn't related to SolrCloud, just the
fact that it's only available on trunk.

And I'm certain that the more eyes look at it, the better it'll be,
so I'd say go for it. I tried out the example here:
http://wiki.apache.org/solr/SolrCloud
and it went quite well, but I didn't stress it much yet (that's next).

Personally, I'd put it through some pretty heavy testing before
deploying to production at this point, just because of all the
new features on trunk. But having people work with it is the best
way to move the effort forward.

So feel free!
Erick

On Wed, Feb 22, 2012 at 9:07 AM, Em mailformailingli...@yahoo.de wrote:
 Erick,

 You'll *really like* the SolrCloud stuff going into trunk when it's baked
 for a while
 How stable is SolrCloud at the moment?
 I can not wait to try it out.

 Kind regards,
 Em


 Am 22.02.2012 14:45, schrieb Erick Erickson:
 You'll *really like* the SolrCloud stuff going into trunk when it's baked
 for a while

 Best
 Erick

 On Wed, Feb 22, 2012 at 3:25 AM, eks dev eks...@googlemail.com wrote:
 Yes, I consciously let my slaves run away from the master in order to
 reduce update latency, but every now and then they sync up with master
 that is doing heavy lifting.

 The price you pay is that slaves do not see the same documents as the
 master, but this is the case anyhow with replication, in my setup
 slave may go ahead of master with updates, this delta gets zeroed
 after replication and the game starts again.

 What you have to take into account with this is very small time window
 where you may go back in time on slaves (not seeing documents that
 were already there), but we are talking about seconds and a couple out
 of 200Mio documents (only those documents that were softComited on
 slave during replication, since commit ond master and postCommit on
 slave).

 Why do you think something is strange here?

 What are you expecting a BeforeCommitListener could do for you, if one
 would exist?
 Why should I be expecting something?

 I just need to read userCommit Data as soon as replication is done,
 and I am looking for proper/easy way to do it.  (postCommitListener is
 what I use now).

 What makes me slightly nervous are those life cycle questions, e.g.
 when I issue update command before and after postCommit event, which
 index gets updated, the one just replicated or the one that was there
 just before replication.

 There are definitely ways to optimize this, for example to force
 replication handler to copy only delta files if index gets updated on
 slave and master  (there is already todo somewhere on solr replication
 Wiki I think). Now replicationHandler copies complete index if this
 gets detected ...

 I am all ears if there are better proposals to have low latency
 updates in multi server setup...


 On Tue, Feb 21, 2012 at 11:53 PM, Em mailformailingli...@yahoo.de wrote:
 Eks,

 that sounds strange!

 Am I getting you right?
 You have a master which indexes batch-updates from time to time.
 Furthermore you got some slaves, pulling data from that master to keep
 them up-to-date with the newest batch-updates.
 Additionally your slaves index own content in soft-commit mode that
 needs to be available as soon as possible.
 In consequence the slavesare not in sync with the master.

 I am not 100% certain, but chances are good that Solr's
 replication-mechanism only changes those segments that are not in sync
 with the master.

 What are you expecting a BeforeCommitListener could do for you, if one
 would exist?

 Kind regards,
 Em

 Am 21.02.2012 21:10, schrieb eks dev:
 Thanks Mark,
 Hmm, I would like to have this information asap, not to wait until the
 first search gets executed (depends on user) . Is solr going to create
 new searcher as a part of replication transaction...

 Just to make it clear why I need it...
 I have simple master, many slaves config where master does batch
 updates in big chunks (things user can wait longer to see on search
 side) but slaves work in soft commit mode internally where I permit
 them to run away slightly from master in order to know where
 incremental update should start, I read it from UserData 

 Basically, ideally, before commit (after successful replication is
 finished) ends, I would like to read in these counters to let
 incremental update run from the right point...

 I need to prevent updating replicated index before I read this
 information (duplicates can appear) are there any IndexWriter
 listeners around?


 Thanks again,
 eks.



 On Tue, Feb 21, 2012 at 8:03 PM, Mark Miller markrmil...@gmail.com 
 wrote:
 Post commit calls are made before a new searcher is opened.

 Might be easier to

Re: Solr on netty

2012-02-22 Thread Yonik Seeley

On Wed, Feb 22, 2012 at 9:27 AM, prasenjit mukherjee
prasen@gmail.com wrote:
 Is anybody aware of any effort regarding porting solr to a netty ( or
 any other async-io based framework ) based framework.

 Even on medium load ( 10 parallel clients )  with 16 shards
 performance seems to deteriorate quite sharply compared another
 alternative ( async-io based ) solution as load increases.

By 16 shards do you mean you have 16 nodes and each single client
request causes a distributed search across all of them them?  How many
concurrent requests are your 10 clients making to each node?

NIO works well when there are many clients, but when servicing those
client requests only needs intermittent CPU.  That's not the pattern
we see for search.
You *can* easily configure Solr's Jetty to use NIO when accepting
client connections, but it won't do you any good, just as switching to
Netty wouldn't do anything here.

Where NIO could help a little is with the requests that Solr makes to
other Solr instances.  Solr is already architected for async
request-response to other nodes, but the current underlying
implementation uses HttpClient 3 (which doesn't have NIO).

Anyway, it's unlikely that NIO vs BIO will make much of a difference
with the numbers you're talking about (16 shards).

Someone else reported that we have the number of connections per host
set too low, and they saw big gains by increasing this.  There's an
issue open to make this configurable in 3x:
https://issues.apache.org/jira/browse/SOLR-3079
We should probably up the max connections per host by default.

-Yonik
lucidimagination.com

Unusually long data import time?

Hello,

Would it be unusual for an import of 160 million documents to take 18 hours?  
Each document is less than 1kb and I have the DataImportHandler using the jdbc 
driver to connect to SQL Server 2008. The full-import query calls a stored 
procedure that contains only a select from my target table.

Is there any way I can speed this up? I saw recently someone on this list 
suggested a new user could get all their Solr data imported in under an hour. I 
sure hope that's true!


Devon Baumgarten

Re: Unusually long data import time?

2012-02-22 Thread Glen Newton

Import times will depend on:
- hardware (speed of disks, cpu, # of cpus, amount of memory, etc)
- Java configuration (heap size, etc)
- Lucene/Solr configuration (many ...)
- Index configuration - how many fields, indexed how; faceting, etc
- OS configuration (this usually to a lesser degree; _usually_)
- Network issues if non-local
- DB configuration (driver, etc)

If you can give more information about the above, people on this list
should be able to better indicate whether 18 hours sounds right for
your situation.

-Glen Newton

On Wed, Feb 22, 2012 at 10:14 AM, Devon Baumgarten
dbaumgar...@nationalcorp.com wrote:
 Hello,

 Would it be unusual for an import of 160 million documents to take 18 hours?  
 Each document is less than 1kb and I have the DataImportHandler using the 
 jdbc driver to connect to SQL Server 2008. The full-import query calls a 
 stored procedure that contains only a select from my target table.

 Is there any way I can speed this up? I saw recently someone on this list 
 suggested a new user could get all their Solr data imported in under an hour. 
 I sure hope that's true!


 Devon Baumgarten





-- 
-
http://zzzoot.blogspot.com/
-

Re: solr 3.5 and indexing performance

2012-02-22 Thread Ahmet Arslan

 I wanted to switch to new version of solr, exactelly to 3.5
 but im getting
 big drop of indexing speed.

Could it be  autoCommit configuration in solrconfig.xml?

SnapPull failed :org.apache.solr.common.SolrException: Error opening new searcher

We started observing strange failures from ReplicationHandler when we
commit on master trunk version 4-5 days old.
It works sometimes, and sometimes not didn't dig deeper yet.

Looks like the real culprit hides behind:
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed

Looks familiar to somebody?


120222 154959 SEVERE SnapPull failed
:org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1138)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1251)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1043)
at org.apache.solr.update.DirectUpdateHandler2.commit(Unknown Source)
at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:503)
at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:348)
at org.apache.solr.handler.ReplicationHandler.doFetch(Unknown Source)
at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:163)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at 
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.lucene.store.AlreadyClosedException: this
IndexWriter is closed
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:810)
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:815)
at org.apache.lucene.index.IndexWriter.nrtIsCurrent(IndexWriter.java:3984)
at 
org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:254)
at 
org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:233)
at 
org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:223)
at 
org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170)
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1095)
... 15 more

Re: Solr on netty

2012-02-22 Thread prasenjit mukherjee

Thanks for the response.

Yes we have 16 shards/partitions each on 16 different nodes and a
separate master Solr receiving continuous parallel requests from 10
client threads running on a single separate machine.
Our observation was that the perf degraded non linearly as the load (
no of concurrent clients ) increased.

Have some followup questions :

1.  What is the default maxnumber of threads configured when a Solr
instance make calls to other 16 partitions ?

2. How do I increase the max no of connections for solr--solr
interactions as u mentioned in your mail ?



On 2/22/12, Yonik Seeley yo...@lucidimagination.com wrote:
 On Wed, Feb 22, 2012 at 9:27 AM, prasenjit mukherjee
 prasen@gmail.com wrote:
 Is anybody aware of any effort regarding porting solr to a netty ( or
 any other async-io based framework ) based framework.

 Even on medium load ( 10 parallel clients )  with 16 shards
 performance seems to deteriorate quite sharply compared another
 alternative ( async-io based ) solution as load increases.

 By 16 shards do you mean you have 16 nodes and each single client
 request causes a distributed search across all of them them?  How many
 concurrent requests are your 10 clients making to each node?

 NIO works well when there are many clients, but when servicing those
 client requests only needs intermittent CPU.  That's not the pattern
 we see for search.
 You *can* easily configure Solr's Jetty to use NIO when accepting
 client connections, but it won't do you any good, just as switching to
 Netty wouldn't do anything here.

 Where NIO could help a little is with the requests that Solr makes to
 other Solr instances.  Solr is already architected for async
 request-response to other nodes, but the current underlying
 implementation uses HttpClient 3 (which doesn't have NIO).

 Anyway, it's unlikely that NIO vs BIO will make much of a difference
 with the numbers you're talking about (16 shards).

 Someone else reported that we have the number of connections per host
 set too low, and they saw big gains by increasing this.  There's an
 issue open to make this configurable in 3x:
 https://issues.apache.org/jira/browse/SOLR-3079
 We should probably up the max connections per host by default.

 -Yonik
 lucidimagination.com


-- 
Sent from my mobile device

RE: Unusually long data import time?

Oh sure! As best as I can, anyway.

I have not set the Java heap size, or really configured it at all. 

The server running both the SQL Server and Solr has:
* 2 Intel Xeon X5660 (each one is 2.8 GHz, 6 cores, 12 logical processors)
* 64 GB RAM
* One Solr instance (no shards)

I'm not using faceting.
My schema has these fields:
  field name=Id type=string indexed=true stored=true / 
  field name=RecordId type=int indexed=true stored=true / 
  field name=RecordType type=string indexed=true stored=true / 
  field name=Name type=LikeText indexed=true stored=true 
termVectors=true / 
  field name=NameFuzzy type=FuzzyText indexed=true stored=true 
termVectors=true / 
  copyField source=Name dest=NameFuzzy / 
  field name=NameType type=string indexed=true stored=true /

Custom types:

*LikeText
PatternReplaceCharFilterFactory (\W+ = )
KeywordTokenizerFactory 
StopFilterFactory (~40 words in stoplist)
ASCIIFoldingFilterFactory
LowerCaseFilterFactory
EdgeNGramFilterFactory
LengthFilterFactory (min:3, max:512)

*FuzzyText
PatternReplaceCharFilterFactory (\W+ = )
KeywordTokenizerFactory 
StopFilterFactory (~40 words in stoplist)
ASCIIFoldingFilterFactory
LowerCaseFilterFactory
NGramFilterFactory
LengthFilterFactory (min:3, max:512)

Devon Baumgarten


-Original Message-
From: Glen Newton [mailto:glen.new...@gmail.com] 
Sent: Wednesday, February 22, 2012 9:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Unusually long data import time?

Import times will depend on:
- hardware (speed of disks, cpu, # of cpus, amount of memory, etc)
- Java configuration (heap size, etc)
- Lucene/Solr configuration (many ...)
- Index configuration - how many fields, indexed how; faceting, etc
- OS configuration (this usually to a lesser degree; _usually_)
- Network issues if non-local
- DB configuration (driver, etc)

If you can give more information about the above, people on this list
should be able to better indicate whether 18 hours sounds right for
your situation.

-Glen Newton

On Wed, Feb 22, 2012 at 10:14 AM, Devon Baumgarten
dbaumgar...@nationalcorp.com wrote:
 Hello,

 Would it be unusual for an import of 160 million documents to take 18 hours?  
 Each document is less than 1kb and I have the DataImportHandler using the 
 jdbc driver to connect to SQL Server 2008. The full-import query calls a 
 stored procedure that contains only a select from my target table.

 Is there any way I can speed this up? I saw recently someone on this list 
 suggested a new user could get all their Solr data imported in under an hour. 
 I sure hope that's true!


 Devon Baumgarten





-- 
-
http://zzzoot.blogspot.com/
-

Re: Fields, Facets, and Search Results

2012-02-22 Thread drLocke97

Hi darul,

You're right, I was not using defaultSearchField. So, following your
suggestions, I added 

defaultSearchFieldtext/defaultSearchField

and

copyField source=section_text_content dest=text/ 

This required that I add a field text, which is fine. I did that. Now,
when I commit the doc for indexing, I get this error:

SOLR returned a #400 Error: Error adding field section_text_content. . .

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Fields-Facets-and-Search-Results-tp3765946p3767006.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr Performance Improvement and degradation Help

2012-02-22 Thread naptowndev

As I've mentioned before, I'm very new to Solr. I'm not a Java guy or an
Apache guy. I'm a .Net guy.

We have a rather large schema - some 100 + fields plus a large number of
dynamic fields.

We've been trying to improve performance and finally got around to
implementing fastvectorhighlighting which gave us an immediate improvement
on the qtime (nearly 70%) which also improved the overall response time by
over 20%.

With that, we also bring back an extraordinarly large amount of data in the
XML. Some results (20 records) come back with a payload between 3MB and even
17MB. We have a lot of report text that is used for searching and
highlighting. We recently implemented field list wildcards on two versions
of Solr to test it out. This allowed us to leave the report text off the
return and decreased the payload significantly - by nearly 85% in the large
cases...

SO, we'd expect a performance boost there, however we are seeing greatly
increased response times on these builds of Solr even though the qtime is
incredibly fast.

To put it in perspective - our original Solr core is 4.0, I believe the
4.0.0.2010.12.10.08.54.56 version.

On our test boxes, we have one running 4.0.0.2011.11.17 and one running
4.0.0.2012.02.16 version.

with the older version (not having the wildcard field list), it returns a
payload of approximately 13MB in an average of 1.5 seconds. with the new
version (2012.02.16) which is on the same machines as the older version (so
network traffic/latency/hardware/etc are all the same), it's returning the
reduced payload (approximately 1.5MB in an average of 3.5-4 seconds). I
will say that we reloaded the core once and briefly saw the 1.5MB payload
come back in 150-200 milliseconds, but within minutes we were back to the
3.5-4 seconds. We also noticed the CPU was being pegged for seconds when
running the queries on the new build with the wildcard field list.

We have a lower scale box running the 2011.11.17 version and had more
success for a while. We were getting the 150-200 ms response time on the
reduced payload for probably 30 minutes or so, and then it did the same
thing - bumped up to 3-4 seconds in response time.

Anyone have any experience with this type of random yet consistent
performance degradation or have insight as to what might be causing the
issues and how to fix them?

We'd love to not only have the performance boost from fast vector
highlighting, but also the decreased payload size.

Thanks in advance!

--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-Performance-Improvement-and-degradation-Help-tp3767015p3767015.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to merge an autofacet with a predefined facet

2012-02-22 Thread Xavier

I'm not sure to understand your solution ?

When (and how) will be the 'word' detection in the fulltext ? before (by my
own) or during (with) solr indexation ?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-merge-an-autofacet-with-a-predefined-facet-tp3763988p3767059.html
Sent from the Solr - User mailing list archive at Nabble.com.

Problem parsing queries with forward slashes and multiple fields

I'm running into a problem with queries that contain forward slashes and more 
than one field.

For example, these queries work fine:
fieldName:/a
fieldName:/*

But if I have two fields with similar syntax in the same query, it fails.

For simplicity, I'm using the same field twice:

fieldName:/a fieldName:/a

results in: no field name specified in query and no defaultSearchField defined 
in schema.xml

SEVERE: org.apache.solr.common.SolrException: no field name specified in query 
and no defaultSearchField defined in schema.xml
at 
org.apache.solr.search.SolrQueryParser.checkNullField(SolrQueryParser.java:106)
at 
org.apache.solr.search.SolrQueryParser.getFieldQuery(SolrQueryParser.java:124)
at 
org.apache.lucene.queryparser.classic.QueryParserBase.handleBareTokenQuery(QueryParserBase.java:1058)
at 
org.apache.lucene.queryparser.classic.QueryParser.Term(QueryParser.java:358)
at 
org.apache.lucene.queryparser.classic.QueryParser.Clause(QueryParser.java:257)
at 
org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:212)
at 
org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:170)
at 
org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:118)
at 
org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:74)
at org.apache.solr.search.QParser.getQuery(QParser.java:143)


fieldName:/* fieldName:/*

results in: null

java.lang.NullPointerException
at 
org.apache.solr.schema.IndexSchema$DynamicReplacement.matches(IndexSchema.java:747)
at 
org.apache.solr.schema.IndexSchema.getDynamicFieldType(IndexSchema.java:1026)
at org.apache.solr.schema.IndexSchema.getFieldType(IndexSchema.java:980)
at 
org.apache.solr.search.SolrQueryParser.getWildcardQuery(SolrQueryParser.java:172)
at 
org.apache.lucene.queryparser.classic.QueryParserBase.handleBareTokenQuery(QueryParserBase.java:1039)
at 
org.apache.lucene.queryparser.classic.QueryParser.Term(QueryParser.java:358)
at 
org.apache.lucene.queryparser.classic.QueryParser.Clause(QueryParser.java:257)
at 
org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:212)
at 
org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:170)
at 
org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:118)
at 
org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:74)
at org.apache.solr.search.QParser.getQuery(QParser.java:143)


Any ideas as to what may be wrong and how can I make these work?

I'm on a 4.0 snapshot from Nov 29, 2011.

RE: Unusually long data import time?

I changed the heap size (Xmx1582m was as high as I could go). The import is at 
about 5% now, and from that I now estimate about 13 hours. It's hard to say 
though.. it keeps going up little by little.

If I get approval to use Solr for this project, I'll have them install a 64bit 
jvm instead, but is there anything else I can do?


Devon Baumgarten
Application Developer


-Original Message-
From: Devon Baumgarten [mailto:dbaumgar...@nationalcorp.com] 
Sent: Wednesday, February 22, 2012 10:32 AM
To: 'solr-user@lucene.apache.org'
Subject: RE: Unusually long data import time?

Oh sure! As best as I can, anyway.

I have not set the Java heap size, or really configured it at all. 

The server running both the SQL Server and Solr has:
* 2 Intel Xeon X5660 (each one is 2.8 GHz, 6 cores, 12 logical processors)
* 64 GB RAM
* One Solr instance (no shards)

I'm not using faceting.
My schema has these fields:
  field name=Id type=string indexed=true stored=true / 
  field name=RecordId type=int indexed=true stored=true / 
  field name=RecordType type=string indexed=true stored=true / 
  field name=Name type=LikeText indexed=true stored=true 
termVectors=true / 
  field name=NameFuzzy type=FuzzyText indexed=true stored=true 
termVectors=true / 
  copyField source=Name dest=NameFuzzy / 
  field name=NameType type=string indexed=true stored=true /

Custom types:

*LikeText
PatternReplaceCharFilterFactory (\W+ = )
KeywordTokenizerFactory 
StopFilterFactory (~40 words in stoplist)
ASCIIFoldingFilterFactory
LowerCaseFilterFactory
EdgeNGramFilterFactory
LengthFilterFactory (min:3, max:512)

*FuzzyText
PatternReplaceCharFilterFactory (\W+ = )
KeywordTokenizerFactory 
StopFilterFactory (~40 words in stoplist)
ASCIIFoldingFilterFactory
LowerCaseFilterFactory
NGramFilterFactory
LengthFilterFactory (min:3, max:512)

Devon Baumgarten


-Original Message-
From: Glen Newton [mailto:glen.new...@gmail.com] 
Sent: Wednesday, February 22, 2012 9:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Unusually long data import time?

Import times will depend on:
- hardware (speed of disks, cpu, # of cpus, amount of memory, etc)
- Java configuration (heap size, etc)
- Lucene/Solr configuration (many ...)
- Index configuration - how many fields, indexed how; faceting, etc
- OS configuration (this usually to a lesser degree; _usually_)
- Network issues if non-local
- DB configuration (driver, etc)

If you can give more information about the above, people on this list
should be able to better indicate whether 18 hours sounds right for
your situation.

-Glen Newton

On Wed, Feb 22, 2012 at 10:14 AM, Devon Baumgarten
dbaumgar...@nationalcorp.com wrote:
 Hello,

 Would it be unusual for an import of 160 million documents to take 18 hours?  
 Each document is less than 1kb and I have the DataImportHandler using the 
 jdbc driver to connect to SQL Server 2008. The full-import query calls a 
 stored procedure that contains only a select from my target table.

 Is there any way I can speed this up? I saw recently someone on this list 
 suggested a new user could get all their Solr data imported in under an hour. 
 I sure hope that's true!


 Devon Baumgarten





-- 
-
http://zzzoot.blogspot.com/
-

How to check if a field is a multivalue field with java

2012-02-22 Thread tschiela

Hello,

is there any way to check, if a field of a SolrDocument ist a multivalue
field with java (solrj)?

Greets
Thomas

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-check-if-a-field-is-a-multivalue-field-with-java-tp3767200p3767200.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr HBase - Re: How is Data Indexed in HBase?

2012-02-22 Thread Bing Li

Mr Gupta,

Thanks so much for your reply!

In my use cases, retrieving data by keyword is one of them. I think Solr is
a proper choice.

However, Solr does not provide a complex enough support to rank. And,
frequent updating is also not suitable in Solr. So it is difficult to
retrieve data randomly based on the values other than keyword frequency in
text. In this case, I attempt to use HBase.

But I don't know how HBase support high performance when it needs to keep
consistency in a large scale distributed system.

Now both of them are used in my system.

I will check out ElasticSearch.

Best regards,
Bing


On Thu, Feb 23, 2012 at 1:35 AM, T Vinod Gupta tvi...@readypulse.comwrote:

 Bing,
 Its a classic battle on whether to use solr or hbase or a combination of
 both. both systems are very different but there is some overlap in the
 utility. they also differ vastly when it compares to computation power,
 storage needs, etc. so in the end, it all boils down to your use case. you
 need to pick the technology that it best suited to your needs.
 im still not clear on your use case though.

 btw, if you haven't started using solr yet - then you might want to
 checkout ElasticSearch. I spent over a week researching between solr and ES
 and eventually chose ES due to its cool merits.

 thanks


 On Wed, Feb 22, 2012 at 9:31 AM, Ted Yu yuzhih...@gmail.com wrote:

 There is no secondary index support in HBase at the moment.

 It's on our road map.

 FYI

 On Wed, Feb 22, 2012 at 9:28 AM, Bing Li lbl...@gmail.com wrote:

  Jacques,
 
  Yes. But I still have questions about that.
 
  In my system, when users search with a keyword arbitrarily, the query is
  forwarded to Solr. No any updating operations but appending new indexes
  exist in Solr managed data.
 
  When I need to retrieve data based on ranking values, HBase is used.
 And,
  the ranking values need to be updated all the time.
 
  Is that correct?
 
  My question is that the performance must be low if keeping consistency
 in a
  large scale distributed environment. How does HBase handle this issue?
 
  Thanks so much!
 
  Bing
 
 
  On Thu, Feb 23, 2012 at 1:17 AM, Jacques whs...@gmail.com wrote:
 
   It is highly unlikely that you could replace Solr with HBase.  They're
   really apples and oranges.
  
  
   On Wed, Feb 22, 2012 at 1:09 AM, Bing Li lbl...@gmail.com wrote:
  
   Dear all,
  
   I wonder how data in HBase is indexed? Now Solr is used in my system
   because data is managed in inverted index. Such an index is suitable
 to
   retrieve unstructured and huge amount of data. How does HBase deal
 with
   the
   issue? May I replaced Solr with HBase?
  
   Thanks so much!
  
   Best regards,
   Bing

Re: Unusually long data import time?

2012-02-22 Thread Walter Underwood

In my first try with the DIH, I had several sub-entities and it was making six 
queries per document. My 20M doc load was going to take many hours, most of a 
day. I re-wrote it to eliminate those, and now it makes a single query for the 
whole load and takes 70 minutes. These are small documents, just the metadata 
for each book.

wunder
Search Guy
Chegg

On Feb 22, 2012, at 9:41 AM, Devon Baumgarten wrote:

 I changed the heap size (Xmx1582m was as high as I could go). The import is 
 at about 5% now, and from that I now estimate about 13 hours. It's hard to 
 say though.. it keeps going up little by little.
 
 If I get approval to use Solr for this project, I'll have them install a 
 64bit jvm instead, but is there anything else I can do?
 
 
 Devon Baumgarten
 Application Developer
 
 
 -Original Message-
 From: Devon Baumgarten [mailto:dbaumgar...@nationalcorp.com] 
 Sent: Wednesday, February 22, 2012 10:32 AM
 To: 'solr-user@lucene.apache.org'
 Subject: RE: Unusually long data import time?
 
 Oh sure! As best as I can, anyway.
 
 I have not set the Java heap size, or really configured it at all. 
 
 The server running both the SQL Server and Solr has:
 * 2 Intel Xeon X5660 (each one is 2.8 GHz, 6 cores, 12 logical processors)
 * 64 GB RAM
 * One Solr instance (no shards)
 
 I'm not using faceting.
 My schema has these fields:
  field name=Id type=string indexed=true stored=true / 
  field name=RecordId type=int indexed=true stored=true / 
  field name=RecordType type=string indexed=true stored=true / 
  field name=Name type=LikeText indexed=true stored=true 
 termVectors=true / 
  field name=NameFuzzy type=FuzzyText indexed=true stored=true 
 termVectors=true / 
  copyField source=Name dest=NameFuzzy / 
  field name=NameType type=string indexed=true stored=true /
 
 Custom types:
 
 *LikeText
   PatternReplaceCharFilterFactory (\W+ = )
   KeywordTokenizerFactory 
   StopFilterFactory (~40 words in stoplist)
   ASCIIFoldingFilterFactory
   LowerCaseFilterFactory
   EdgeNGramFilterFactory
   LengthFilterFactory (min:3, max:512)
 
 *FuzzyText
   PatternReplaceCharFilterFactory (\W+ = )
   KeywordTokenizerFactory 
   StopFilterFactory (~40 words in stoplist)
   ASCIIFoldingFilterFactory
   LowerCaseFilterFactory
   NGramFilterFactory
   LengthFilterFactory (min:3, max:512)
 
 Devon Baumgarten
 
 
 -Original Message-
 From: Glen Newton [mailto:glen.new...@gmail.com] 
 Sent: Wednesday, February 22, 2012 9:24 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Unusually long data import time?
 
 Import times will depend on:
 - hardware (speed of disks, cpu, # of cpus, amount of memory, etc)
 - Java configuration (heap size, etc)
 - Lucene/Solr configuration (many ...)
 - Index configuration - how many fields, indexed how; faceting, etc
 - OS configuration (this usually to a lesser degree; _usually_)
 - Network issues if non-local
 - DB configuration (driver, etc)
 
 If you can give more information about the above, people on this list
 should be able to better indicate whether 18 hours sounds right for
 your situation.
 
 -Glen Newton
 
 On Wed, Feb 22, 2012 at 10:14 AM, Devon Baumgarten
 dbaumgar...@nationalcorp.com wrote:
 Hello,
 
 Would it be unusual for an import of 160 million documents to take 18 hours? 
  Each document is less than 1kb and I have the DataImportHandler using the 
 jdbc driver to connect to SQL Server 2008. The full-import query calls a 
 stored procedure that contains only a select from my target table.
 
 Is there any way I can speed this up? I saw recently someone on this list 
 suggested a new user could get all their Solr data imported in under an 
 hour. I sure hope that's true!
 
 
 Devon Baumgarten
 
 
 
 
 
 -- 
 -
 http://zzzoot.blogspot.com/
 -

Re: How to merge an autofacet with a predefined facet

If you use the suggested solution, it will detect the words at indexing
time.
However, Solr's FilterFactory's lifecycle keeps no track on whether a
file for synonyms, keywords etc. has been changed since Solr's last startup.
Therefore a change within these files is not visible until you reload
your core.

Furthermore keywords for old documents aren't added automatically if you
change your keywords (and reload the core) - you have to write a routine
that finds documents matching the new keywords and reindex those documents.

Example:

Your keywordslist at time t1 contains two words:
keyword
codeword

You are indexing two documents:
doc1: {content:I am about a secret codeword.}
doc1: {content:Happy keyword and the gang.}

Your filter will mark codeword in doc1 and keyword in doc2 as words
to keep and remove everything else. Therefore their content for your
keepWordField contains only

doc1: {indexedContent:codeword}
doc2: {indexedContent:keyword}

However, if you add the word gang to your keywordlist AND reload your
SolrCore, doc2 will still only contain the term keyword until it gets
reindexed again.

Kind regards,
Em

Am 22.02.2012 17:56, schrieb Xavier:
I'm not sure to understand your solution ?

When (and how) will be the 'word' detection in the fulltext ? before (by my
own) or during (with) solr indexation ?

--
View this message in context:
http://lucene.472066.n3.nabble.com/How-to-merge-an-autofacet-with-a-predefined-facet-tp3763988p3767059.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to check if a field is a multivalue field with java

2012-02-22 Thread SUJIT PAL

Hi Thomas,

With Java (from within a custom handler in Solr) you can get a handle to the 
IndexSchema from the request, like so:

IndexSchema schema = req.getSchema();
SchemaField sf = schema.getField(fielaname);
boolean isMultiValued = sf.multiValued();

From within SolrJ code, you can use SolrDocument.getFieldValue() which returns 
an Object, so you could do an instanceof check - if its a Collection its 
multivalued, else not.

Object value = sdoc.getFieldValue(fieldname);
boolean isMultiValued = value instanceof Collection;

At least this is what I do, I don't think there is a way to get a handle to the 
IndexSchema object over solrj...

-sujit

On Feb 22, 2012, at 9:41 AM, tschiela wrote:

 Hello,
 
 is there any way to check, if a field of a SolrDocument ist a multivalue
 field with java (solrj)?
 
 Greets
 Thomas
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-check-if-a-field-is-a-multivalue-field-with-java-tp3767200p3767200.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to merge an autofacet with a predefined facet

Btw.:
Solr has no downtime while reloading the core.
It loads the new core and while loading the new one it still serves
requests with the old one.
When the new one is ready (and warmed up) it finally replaces the old core.

Best,
Em

Am 22.02.2012 17:56, schrieb Xavier:
 I'm not sure to understand your solution ?
 
 When (and how) will be the 'word' detection in the fulltext ? before (by my
 own) or during (with) solr indexation ?
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-merge-an-autofacet-with-a-predefined-facet-tp3763988p3767059.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Problem parsing queries with forward slashes and multiple fields

Yury,

are you sure your request has a proper url-encoding?

Kind regards,
Em

Am 22.02.2012 18:25, schrieb Yury Kats:
 I'm running into a problem with queries that contain forward slashes and more 
 than one field.
 
 For example, these queries work fine:
 fieldName:/a
 fieldName:/*
 
 But if I have two fields with similar syntax in the same query, it fails.
 
 For simplicity, I'm using the same field twice:
 
 fieldName:/a fieldName:/a
 
 results in: no field name specified in query and no defaultSearchField 
 defined in schema.xml
 
 SEVERE: org.apache.solr.common.SolrException: no field name specified in 
 query and no defaultSearchField defined in schema.xml
   at 
 org.apache.solr.search.SolrQueryParser.checkNullField(SolrQueryParser.java:106)
   at 
 org.apache.solr.search.SolrQueryParser.getFieldQuery(SolrQueryParser.java:124)
   at 
 org.apache.lucene.queryparser.classic.QueryParserBase.handleBareTokenQuery(QueryParserBase.java:1058)
   at 
 org.apache.lucene.queryparser.classic.QueryParser.Term(QueryParser.java:358)
   at 
 org.apache.lucene.queryparser.classic.QueryParser.Clause(QueryParser.java:257)
   at 
 org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:212)
   at 
 org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:170)
   at 
 org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:118)
   at 
 org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:74)
   at org.apache.solr.search.QParser.getQuery(QParser.java:143)
 
 
 fieldName:/* fieldName:/*
 
 results in: null
 
 java.lang.NullPointerException
   at 
 org.apache.solr.schema.IndexSchema$DynamicReplacement.matches(IndexSchema.java:747)
   at 
 org.apache.solr.schema.IndexSchema.getDynamicFieldType(IndexSchema.java:1026)
   at org.apache.solr.schema.IndexSchema.getFieldType(IndexSchema.java:980)
   at 
 org.apache.solr.search.SolrQueryParser.getWildcardQuery(SolrQueryParser.java:172)
   at 
 org.apache.lucene.queryparser.classic.QueryParserBase.handleBareTokenQuery(QueryParserBase.java:1039)
   at 
 org.apache.lucene.queryparser.classic.QueryParser.Term(QueryParser.java:358)
   at 
 org.apache.lucene.queryparser.classic.QueryParser.Clause(QueryParser.java:257)
   at 
 org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:212)
   at 
 org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:170)
   at 
 org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:118)
   at 
 org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:74)
   at org.apache.solr.search.QParser.getQuery(QParser.java:143)
 
 
 Any ideas as to what may be wrong and how can I make these work?
 
 I'm on a 4.0 snapshot from Nov 29, 2011.

Re: Problem parsing queries with forward slashes and multiple fields

On 2/22/2012 12:25 PM, Yury Kats wrote:
 I'm running into a problem with queries that contain forward slashes and more 
 than one field.
 
 For example, these queries work fine:
 fieldName:/a
 fieldName:/*
 
 But if I have two fields with similar syntax in the same query, it fails.
 
 For simplicity, I'm using the same field twice:
 
 fieldName:/a fieldName:/a

Looks like escaping forward slashes makes the query work, eg
  fieldName:\/a fieldName:\/a

This is a bit puzzling as the forward slash is not part of the query language, 
is it?

Re: Problem parsing queries with forward slashes and multiple fields

On 2/22/2012 1:05 PM, Em wrote:
 Yury,
 
 are you sure your request has a proper url-encoding?

Yes

Re: Solr HBase - Re: How is Data Indexed in HBase?

2012-02-22 Thread Jacques

 Solr does not provide a complex enough support to rank.
I believe Solr has a bunch of plug-ability to write your own custom ranking
approach.  If you think you can't do your desired ranking with Solr, you're
probably wrong and need to ask for help from the Solr community.

 retrieving data by keyword is one of them. I think Solr is a proper
choice
The key to keyword retrieval is the construction of the data.  Among other
things, this is one of the key things that Solr is very good at: creating a
very efficient organization of the data so that you can retrieve quickly.
 At their core, Solr, ElasticSearch, Lily and Katta all use Lucene to
construct this data.  HBase is bad at this.

 how HBase support high performance when it needs to keep consistency in
a large scale distributed system
HBase is primarily built for retrieving a single row at a time based on a
predetermined and known location (the key).  It is also very efficient at
splitting massive datasets across multiple machines and allowing sequential
batch analyses of these datasets.  HBase can maintain high performance in
this way because consistency only ever exists at the row level.  This is
what HBase is good at.

You need to focus what you're doing and then write it out.  Figure out how
you think the pieces should work together.  Read the documentation.  Then,
ask specific questions where you feel like the documentation is unclear or
you feel confused.  Your general questions are very difficult to answer in
any kind of really helpful way.

thanks,
Jacques


On Wed, Feb 22, 2012 at 9:51 AM, Bing Li lbl...@gmail.com wrote:

 Mr Gupta,

 Thanks so much for your reply!

 In my use cases, retrieving data by keyword is one of them. I think Solr
 is a proper choice.

 However, Solr does not provide a complex enough support to rank. And,
 frequent updating is also not suitable in Solr. So it is difficult to
 retrieve data randomly based on the values other than keyword frequency in
 text. In this case, I attempt to use HBase.

 But I don't know how HBase support high performance when it needs to keep
 consistency in a large scale distributed system.

 Now both of them are used in my system.

 I will check out ElasticSearch.

 Best regards,
 Bing


 On Thu, Feb 23, 2012 at 1:35 AM, T Vinod Gupta tvi...@readypulse.comwrote:

 Bing,
 Its a classic battle on whether to use solr or hbase or a combination of
 both. both systems are very different but there is some overlap in the
 utility. they also differ vastly when it compares to computation power,
 storage needs, etc. so in the end, it all boils down to your use case. you
 need to pick the technology that it best suited to your needs.
 im still not clear on your use case though.

 btw, if you haven't started using solr yet - then you might want to
 checkout ElasticSearch. I spent over a week researching between solr and ES
 and eventually chose ES due to its cool merits.

 thanks


 On Wed, Feb 22, 2012 at 9:31 AM, Ted Yu yuzhih...@gmail.com wrote:

 There is no secondary index support in HBase at the moment.

 It's on our road map.

 FYI

 On Wed, Feb 22, 2012 at 9:28 AM, Bing Li lbl...@gmail.com wrote:

  Jacques,
 
  Yes. But I still have questions about that.
 
  In my system, when users search with a keyword arbitrarily, the query
 is
  forwarded to Solr. No any updating operations but appending new indexes
  exist in Solr managed data.
 
  When I need to retrieve data based on ranking values, HBase is used.
 And,
  the ranking values need to be updated all the time.
 
  Is that correct?
 
  My question is that the performance must be low if keeping consistency
 in a
  large scale distributed environment. How does HBase handle this issue?
 
  Thanks so much!
 
  Bing
 
 
  On Thu, Feb 23, 2012 at 1:17 AM, Jacques whs...@gmail.com wrote:
 
   It is highly unlikely that you could replace Solr with HBase.
  They're
   really apples and oranges.
  
  
   On Wed, Feb 22, 2012 at 1:09 AM, Bing Li lbl...@gmail.com wrote:
  
   Dear all,
  
   I wonder how data in HBase is indexed? Now Solr is used in my system
   because data is managed in inverted index. Such an index is
 suitable to
   retrieve unstructured and huge amount of data. How does HBase deal
 with
   the
   issue? May I replaced Solr with HBase?
  
   Thanks so much!
  
   Best regards,
   Bing

Re: Problem parsing queries with forward slashes and multiple fields

2012-02-22 Thread Yonik Seeley

2012/2/22 Yury Kats yuryk...@yahoo.com:
 On 2/22/2012 12:25 PM, Yury Kats wrote:
 I'm running into a problem with queries that contain forward slashes and 
 more than one field.

 For example, these queries work fine:
 fieldName:/a
 fieldName:/*

 But if I have two fields with similar syntax in the same query, it fails.

 For simplicity, I'm using the same field twice:

 fieldName:/a fieldName:/a

 Looks like escaping forward slashes makes the query work, eg
  fieldName:\/a fieldName:\/a

 This is a bit puzzling as the forward slash is not part of the query 
 language, is it?

Regex queries were added that use forward slashes:

https://issues.apache.org/jira/browse/LUCENE-2604

-Yonik
lucidimagination.com

Re: Unusually long data import time?

2012-02-22 Thread Ahmet Arslan

 Would it be unusual for an import of 160 million documents
 to take 18 hours?  Each document is less than 1kb and I
 have the DataImportHandler using the jdbc driver to connect
 to SQL Server 2008. The full-import query calls a stored
 procedure that contains only a select from my target table.
 
 Is there any way I can speed this up? I saw recently someone
 on this list suggested a new user could get all their Solr
 data imported in under an hour. I sure hope that's true!

Do have autoCommit or autoSoftCommit configured in solrconfig.xml?

Re: Problem parsing queries with forward slashes and multiple fields

That's strange.

Could you provide a sample dataset?

I'd like to try it out.

Kind regards,
Em

Am 22.02.2012 19:17, schrieb Yury Kats:
 On 2/22/2012 1:05 PM, Em wrote:
 Yury,

 are you sure your request has a proper url-encoding?
 
 Yes

Re: Problem parsing queries with forward slashes and multiple fields

On 2/22/2012 1:25 PM, Em wrote:
 That's strange.
 
 Could you provide a sample dataset?

Data set does not matter. The query fails to parse, long before it gets to the 
data.

Re: Problem parsing queries with forward slashes and multiple fields

On 2/22/2012 1:24 PM, Yonik Seeley wrote:

 This is a bit puzzling as the forward slash is not part of the query 
 language, is it?
 
 Regex queries were added that use forward slashes:
 
 https://issues.apache.org/jira/browse/LUCENE-2604

Oh, so / is a special character now? I don't think it is mentioned as such on 
any of the wiki pages,
or in org.apache.solr.client.solrj.util.ClientUtils

RE: Unusually long data import time?

Ahmet,

I do not. I commented autoCommit out.

Devon Baumgarten



-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Wednesday, February 22, 2012 12:25 PM
To: solr-user@lucene.apache.org
Subject: Re: Unusually long data import time?

 Would it be unusual for an import of 160 million documents
 to take 18 hours?  Each document is less than 1kb and I
 have the DataImportHandler using the jdbc driver to connect
 to SQL Server 2008. The full-import query calls a stored
 procedure that contains only a select from my target table.
 
 Is there any way I can speed this up? I saw recently someone
 on this list suggested a new user could get all their Solr
 data imported in under an hour. I sure hope that's true!

Do have autoCommit or autoSoftCommit configured in solrconfig.xml?

maxClauseCount error

2012-02-22 Thread Darren Govoni

Hi,
  I am suddenly getting a maxclause count error and don't know why. I am
using Solr 3.5

maxClauseCount Exception

2012-02-22 Thread Darren Govoni

Hi,
  I am suddenly getting a maxClauseCount exception for no reason. I am
using Solr 3.5. I have only 206 documents in my index.

Any ideas? This is wierd.

QUERY PARAMS: [hl, hl.snippets, hl.simple.pre, hl.simple.post, fl,
hl.mergeContiguous, hl.usePhraseHighlighter, hl.requireFieldMatch,
echoParams, hl.fl, q, rows, start]|#]


[#|2012-02-22T13:40:13.129-0500|INFO|glassfish3.1.1|
org.apache.solr.core.SolrCore|_ThreadID=22;_ThreadName=Thread-2;|[]
webapp=/solr3 path=/select
params={hl=truehl.snippets=4hl.simple.pre=b/bfl=*,scorehl.mergeContiguous=truehl.usePhraseHighlighter=truehl.requireFieldMatch=trueechoParams=allhl.fl=text_tq={!lucene+q.op%3DOR+df%3Dtext_t}+(+kind_s:doc+OR+kind_s:xml)+AND+(type_s:[*+TO+*])+AND+(usergroup_sm:admin)rows=20start=0wt=javabinversion=2}
 hits=204 status=500 QTime=166 |#]


[#|2012-02-22T13:40:13.131-0500|SEVERE|glassfish3.1.1|
org.apache.solr.servlet.SolrDispatchFilter|
_ThreadID=22;_ThreadName=Thread-2;|org.apache.lucene.search.BooleanQuery
$TooManyClauses: maxClauseCount is set to 1024
at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:136)
at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:127)
at org.apache.lucene.search.ScoringRewrite
$1.addClause(ScoringRewrite.java:51)
at org.apache.lucene.search.ScoringRewrite
$1.addClause(ScoringRewrite.java:41)
at org.apache.lucene.search.ScoringRewrite
$3.collect(ScoringRewrite.java:95)
at
org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:38)
at
org.apache.lucene.search.ScoringRewrite.rewrite(ScoringRewrite.java:93)
at
org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:304)
at
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:158)
at
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:98)
at
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:385)
at
org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:217)
at
org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:185)
at
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:205)
at
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:490)
at
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:401)
at
org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:131)
at org.apache.so

Re: Problem parsing queries with forward slashes and multiple fields

On 2/22/2012 1:24 PM, Yonik Seeley wrote:
 Looks like escaping forward slashes makes the query work, eg
  fieldName:\/a fieldName:\/a

 This is a bit puzzling as the forward slash is not part of the query 
 language, is it?
 
 Regex queries were added that use forward slashes:
 
 https://issues.apache.org/jira/browse/LUCENE-2604

Looks like regex matching happens across multiple fields though. Feels like a 
bug to me?

Re: Unusually long data import time?

Davon, you ought to try to update from many threads, (I do not know if
DIH can do it, check it), but lucene does great job if fed from many
update threads...

depends where your time gets lost, but it is usually a) analysis chain
or b) database

if it os a) and your server has spare cpu-cores, you can scale at X
NooCores rate

On Wed, Feb 22, 2012 at 7:41 PM, Devon Baumgarten
dbaumgar...@nationalcorp.com wrote:
 Ahmet,

 I do not. I commented autoCommit out.

 Devon Baumgarten



 -Original Message-
 From: Ahmet Arslan [mailto:iori...@yahoo.com]
 Sent: Wednesday, February 22, 2012 12:25 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Unusually long data import time?

 Would it be unusual for an import of 160 million documents
 to take 18 hours?  Each document is less than 1kb and I
 have the DataImportHandler using the jdbc driver to connect
 to SQL Server 2008. The full-import query calls a stored
 procedure that contains only a select from my target table.

 Is there any way I can speed this up? I saw recently someone
 on this list suggested a new user could get all their Solr
 data imported in under an hour. I sure hope that's true!

 Do have autoCommit or autoSoftCommit configured in solrconfig.xml?

dih and solr cloud

out of curiosity, trying to see if new cloud features can replace what
I use now...

how is this (batch) update forwarding solved at cloud level?

imagine simple one shard and one replica case, if I fire up DIH
update, is this going to be replicated to replica shard?
If yes,
- is it going to be sent document by document (network, imagine
100Mio+ update commands going to replica from slave for big batches)
- somehow batch into packages to reduce load
- distributed at index level somehow



This is important case, today with master/slave solr replication,  but
is not mentioned at  http://wiki.apache.org/solr/SolrCloud

Re: Fields, Facets, and Search Results

2012-02-22 Thread darul

Well, you probably need to clear you index first..remove index director,
restart your server and try again.
Let me know if it works or not.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Fields-Facets-and-Search-Results-tp3765946p3767537.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Fields, Facets, and Search Results

2012-02-22 Thread darul

And check your log file, you may have some errors at start of your server.

Due to some mistake, bad syntax in your schema file for example...

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Fields-Facets-and-Search-Results-tp3765946p3767569.html
Sent from the Solr - User mailing list archive at Nabble.com.

org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out

2012-02-22 Thread Uomesh

Hi,

I am getting below error while running delta import and my index is not
updated. Could you please let me know what might be causing this issue? I am
using Solr 3.5 version and around 60+ documents suppose to be updated using
delta import.


 [org.apache.solr.handler.dataimport.SolrWriter] - Error creating document :
SolrInputDocument[...]
org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out:
NativeFSLock@/var/solr/data/5159200/index/write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:84)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1108)
at 
org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:83)
at
org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:101)
at
org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:171)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:219)
at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
at
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:115)
at 
org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:73)
at
org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:293)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:636)
at
org.apache.solr.handler.dataimport.DocBuilder.doDelta(DocBuilder.java:303)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:179)
at
org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter.java:390)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:429)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408)


--
View this message in context: 
http://lucene.472066.n3.nabble.com/org-apache-lucene-store-LockObtainFailedException-Lock-obtain-timed-out-tp3767605p3767605.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out

2012-02-22 Thread Sethi, Parampreet

Hi Uomesh,

I was facing similar issues few days ago and was able to resolve it by
deleting the lock file created in the index directory and restarting my
solr server.

I have documented the same in one of the posts at
http://www.params.me/2011/12/solr-index-lock-issue.html

Hope it helps!

-param

On 2/22/12 2:36 PM, Uomesh uom...@gmail.com wrote:

Hi,

I am getting below error while running delta import and my index is not
updated. Could you please let me know what might be causing this issue? I
am
using Solr 3.5 version and around 60+ documents suppose to be updated
using
delta import.


 [org.apache.solr.handler.dataimport.SolrWriter] - Error creating
document :
SolrInputDocument[...]
org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out:
NativeFSLock@/var/solr/data/5159200/index/write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:84)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1108)
at 
org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:83)
at
org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.j
ava:101)
at
org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler
2.java:171)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.ja
va:219)
at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdatePr
ocessorFactory.java:61)
at
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdatePr
ocessorFactory.java:115)
at 
org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:73)
at
org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHa
ndler.java:293)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav
a:636)
at
org.apache.solr.handler.dataimport.DocBuilder.doDelta(DocBuilder.java:303)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:179)
at
org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter
.java:390)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4
29)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40
8)


--
View this message in context:
http://lucene.472066.n3.nabble.com/org-apache-lucene-store-LockObtainFaile
dException-Lock-obtain-timed-out-tp3767605p3767605.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Performance Improvement and degradation Help

2012-02-22 Thread naptowndev

As an update to this... I tried running a query again the
4.0.0.2010.12.10.08.54.56 version and the newer 4.0.0.2012.02.16 (both on
the same box).  So the query params were the same, returned results were the
same, but the 4.0.0.2010.12.10.08.54.56 returned the results in about 1.6
seconds and the newer (4.0.0.2012.02.16) version returned the results in
about 4 seconds.

If I add the wildcard field list to the newer version, the time increases
anywhere from .5-1 second.  

These are all averages after running the queries several times over a 30
minute period. (allowing for warming and cache).

Anybody have any insight into why the newer versions are performing a bit
slower?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Performance-Improvement-and-degradation-Help-tp3767015p3767725.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr 3.5 and indexing performance

2012-02-22 Thread mizayah

i got it all commnented in updateHandler, im prety sure there is no default
autocommit
updateHandler class=solr.DirectUpdateHandler2 



iorixxx wrote
 
 I wanted to switch to new version of solr, exactelly to 3.5
 but im getting
 big drop of indexing speed.
 
 Could it be  autoCommit configuration in solrconfig.xml?
 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-3-5-and-indexing-performance-tp3766653p3767843.html
Sent from the Solr - User mailing list archive at Nabble.com.

result present in Solr 1.4, but missing in Solr 3.5, dismax only

I am working on upgrading Solr from 1.4 to 3.5, and I have hit a problem.   I 
have a test checking for a search result in Solr, and the test passes in Solr 
1.4, but fails in Solr 3.5.   Dismax is the desired QueryParser -- I just 
included output from lucene QueryParser to prove the document exists and is 
found 

I am completely stumped.


Here are the debugQuery details:

***Solr 3.5***

lucene QueryParser: 

URL:   q=all_search:The Beatles as musicians : Revolver through the Anthology
final query:  all_search:the beatl as musician revolv through the antholog

6.0562754 = (MATCH) weight(all_search:the beatl as musician revolv through the 
antholog in 1064395), product of:
  1.0 = queryWeight(all_search:the beatl as musician revolv through the 
antholog), product of:
48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 
revolv=872 through=81366 the=3531140 antholog=11611)
0.02063975 = queryNorm
  6.0562754 = fieldWeight(all_search:the beatl as musician revolv through the 
antholog in 1064395), product of:
1.0 = tf(phraseFreq=1.0)
48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 
revolv=872 through=81366 the=3531140 antholog=11611)
0.125 = fieldNorm(field=all_search, doc=1064395)

dismax QueryParser:   
URL:  qf=all_searchpf=all_searchq=The Beatles as musicians : Revolver 
through the Anthology
final query:   +(all_search:the beatl as musician revolv through the 
antholog~1)~0.01 (all_search:the beatl as musician revolv through the 
antholog~3)~0.01

(no matches)


***Solr 1.4***

lucene QueryParser:   

URL:  q=all_search:The Beatles as musicians : Revolver through the Anthology
final query:  all_search:the beatl as musician revolv through the antholog

5.2676983 = fieldWeight(all_search:the beatl as musician revolv through the 
antholog in 3469163), product of:
  1.0 = tf(phraseFreq=1.0)
  48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 
revolv=820 through=88238 the=3542123 antholog=11205)
  0.109375 = fieldNorm(field=all_search, doc=3469163)

dismax QueryParser:   
URL:  qf=all_searchpf=all_searchq=The Beatles as musicians : Revolver 
through the Anthology
final query:  +(all_search:the beatl as musician revolv through the 
antholog~1)~0.01 (all_search:the beatl as musician revolv through the 
antholog~3)~0.01

score:

7.449651 = (MATCH) sum of:
  3.7248254 = weight(all_search:the beatl as musician revolv through the 
antholog~1 in 3469163), product of:
0.7071068 = queryWeight(all_search:the beatl as musician revolv through 
the antholog~1), product of:
  48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 
revolv=820 through=88238 the=3542123 antholog=11205)
  0.014681898 = queryNorm
5.2676983 = fieldWeight(all_search:the beatl as musician revolv through 
the antholog in 3469163), product of:
  1.0 = tf(phraseFreq=1.0)
  48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 
revolv=820 through=88238 the=3542123 antholog=11205)
  0.109375 = fieldNorm(field=all_search, doc=3469163)
  3.7248254 = weight(all_search:the beatl as musician revolv through the 
antholog~3 in 3469163), product of:
0.7071068 = queryWeight(all_search:the beatl as musician revolv through 
the antholog~3), product of:
  48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 
revolv=820 through=88238 the=3542123 antholog=11205)
  0.014681898 = queryNorm
5.2676983 = fieldWeight(all_search:the beatl as musician revolv through 
the antholog in 3469163), product of:
  1.0 = tf(phraseFreq=1.0)
  48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 
revolv=820 through=88238 the=3542123 antholog=11205)
  0.109375 = fieldNorm(field=all_search, doc=3469163)

RE: Unusually long data import time?

Thank you everyone for your patience and suggestions.

It turns out I was doing something really unreasonable in my schema. I 
mistakenly edited the max EdgeNgram size to 512, when I meant to set the 
lengthFilter max to 512. I brought this to a more reasonable number, and my 
estimated time to import is now down to 4 hours. Based on the size of my record 
set, this time is more consistent with Walter's observations in his own project.

Thanks again for your help,

Devon Baumgarten

-Original Message-
From: Devon Baumgarten [mailto:dbaumgar...@nationalcorp.com] 
Sent: Wednesday, February 22, 2012 12:42 PM
To: 'solr-user@lucene.apache.org'
Subject: RE: Unusually long data import time?

Ahmet,

I do not. I commented autoCommit out.

Devon Baumgarten



-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Wednesday, February 22, 2012 12:25 PM
To: solr-user@lucene.apache.org
Subject: Re: Unusually long data import time?

 Would it be unusual for an import of 160 million documents
 to take 18 hours?  Each document is less than 1kb and I
 have the DataImportHandler using the jdbc driver to connect
 to SQL Server 2008. The full-import query calls a stored
 procedure that contains only a select from my target table.
 
 Is there any way I can speed this up? I saw recently someone
 on this list suggested a new user could get all their Solr
 data imported in under an hour. I sure hope that's true!

Do have autoCommit or autoSoftCommit configured in solrconfig.xml?

Re: nutch and solr

2012-02-22 Thread alessio crisantemi

thanks for your reply, but don't work.
the same message: can't convert empty path

and more: impossible find class org.apache.nutch.crawl.injector

..


Il giorno 22 febbraio 2012 06:14, tamanjit.bin...@yahoo.co.in 
tamanjit.bin...@yahoo.co.in ha scritto:

 Try this command.

  bin/nutch crawl urls/folder name/url file.txt -dir crawl/folders
 name
 -threads 10 -depth 2 -topN 1000

 Your folder structure will look like this:

 nutch folder-- urls -- folder name-- url file.txt
|
|
 -- crawl -- folder name

 The folder name will be for different domains. So for each domain folder in
 urls folder there has to be a corresponding folder (with the same name) in
 the crawl folder.


 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/nutch-and-solr-tp3765166p3765607.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: 'location' fieldType indexation impossible

Make sure that your schema file is exactly the same on both
your local server and the remote server. Especially there should
be a dynamic field definition like:
dynamicField name=*_coordinate type=tdouble indexed=true stored=false/

and you should see a couple of fields appear like
emploi_city_geoloc_0_coordinate and
emploi_city_geoloc_1_coordinate
when you index a location type in the field you indicated.

This has tripped me up in the past.

If that doesn't apply, then you need to provide more information,
more of the stack trace, what you've tried etc.

Because saying:
I really don't understand why it isnt working because, it was working on my
local server with the same configuration (Solr 3.5.0) and the same database
!!!

Is another way of saying Something's different between
the two versions, I just don't know what yet G...

So I'd start (make a backup first) by just copying my entire configuration
from my local machine to the remote one, restarting Solr and trying
again.

Best
Erick

On Wed, Feb 22, 2012 at 5:53 AM, Xavier xav...@audivox.fr wrote:
 Hi,

 When i try to index my location field i get this error for each documents :
 *ATTENTION: Error creating document  Error adding field
 'emploi_city_geoloc'='48.85,2.5525' *
 (so i have 0 files indexed)

 Here is my schema.xml :
 *field name=emploi_city_geoloc type=location indexed=true
 stored=false/*

 I really don't understand why it isnt working because, it was working on my
 local server with the same configuration (Solr 3.5.0) and the same database
 !!!

 If i try to use geohash instead of location it is working for
 indexation, but my geodist query in front isnt working anymore ...

 Any ideas ?

 Best regards,
 Xavier

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/location-fieldType-indexation-impossible-tp3766136p3766136.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Same id on two shards

2012-02-22 Thread jerry.min...@gmail.com

Hi,

I stumbled across this thread after running into the same question. The
answers presented here seem a little vague and I was hoping to renew the
discussion.

I am using using a branch of Solr 4, distributed searching over 12 shards.
I want the documents in the first shard to always be selected over
documents that appear in the other 11 shards.

The queries to these shards looks something like this:
http://solrserver/shard_1_app/select?shards=solr_server:/shard_1_app/,solr_server:/shard_2_app,
... ,solr_server:/shard_12_appq=id:

When I execute a query for an ID that I know exists in shard_1 and another
shard, I do always get the result from shard 1.

Here's some questions that I have:
1. Has anyone rigorously tested the comment in the wiki If docs with
duplicate unique keys are encountered, Solr will make an attempt to return
valid results, but the behavior may be non-deterministic.

2. Who is relying on this behavior (the document of the first shard is
returned) today? When do you notice the wrong document is selected? Do you
have a feeling for how frequently your distributed search returns the
document from a shard other than the first?

3. Is there a good web source other than the Solr wiki for information
about Solr distributed queries?

Thanks,
Jerry M.

On Mon, Aug 8, 2011 at 7:41 PM, simon mtnes...@gmail.com wrote:

I think the first one to respond is indeed the way it works, but
that's only deterministic up to a point (if your small index is in the
throes of a commit and everything required for a response happens to
be cached on the larger shard ... who knows ?)

On Mon, Aug 8, 2011 at 7:10 PM, Shawn Heisey s...@elyograg.org wrote:
On 8/8/2011 4:07 PM, simon wrote:

Only one should be returned, but it's non-deterministic. See

http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations

I had heard it was based on which one responded first. This is part of
why
we have a small index that contains the newest content and only
distribute
content to the other shards once a day. The hope is that the small index
(less than 1GB, fits into RAM on that virtual machine) will always
respond
faster than the other larger shards (over 18GB each). Is this an
incorrect
assumption on our part?

The build system does do everything it can to ensure that periods of
overlap
are limited to the time it takes to commit a change across all of the
shards, which should amount to just a few seconds once a day. There
might
be situations when the index gets out of whack and we have duplicate id
values for a longer time period, but in practice it hasn't happened yet.

Thanks,
Shawn

need to support bi-directional synonyms

2012-02-22 Thread geeky2

hello all,

i need to support the following:

if the user enters sprayer in the desc field - then they get results for
BOTH sprayer and washer.

and in the other direction

if the user enters washer in the desc field - then they get results for
BOTH washer and sprayer. 

would i set up my synonym file like this?

assuming expand = true..

sprayer = washer
washer = sprayer

thank you,
mark

--
View this message in context: 
http://lucene.472066.n3.nabble.com/need-to-support-bi-directional-synonyms-tp3767990p3767990.html
Sent from the Solr - User mailing list archive at Nabble.com.

Trunk build errors

2012-02-22 Thread Darren Govoni

Hi,
  I am getting numerous errors preventing a build of solrcloud trunk.

 [licenses] MISSING LICENSE for the following file:


Any tips to get a clean build working?

thanks

Re: Fast Vector Highlighter Working for some records only

2012-02-22 Thread Koji Sekiguchi


Hi dhaivat,

I think you may want to use analysis.jsp:

http://localhost:8983/solr/admin/analysis.jsp

Go to the URL and look into how your custom tokenizer produces tokens,
and compare with the output of Solr's inbuilt tokenizer.

koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/


(12/02/22 21:35), dhaivat wrote:


Koji Sekiguchi wrote


(12/02/22 11:58), dhaivat wrote:

Thanks for reply,

But can you please tell me why it's working for some documents and not
for
other.


As Solr 1.4.1 cannot recognize hl.useFastVectorHighlighter flag, Solr just
ignore it, but due to hl=true is there, Solr tries to create highlight
snippets
by using (existing; traditional; I mean not FVH) Highlighter.
Highlighter (including FVH) cannot produce snippets sometime for some
reasons,
you can use hl.alternateField parameter.

http://wiki.apache.org/solr/HighlightingParameters#hl.alternateField

koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/



Thank you so much explanation,

I have updated my solr version and using 3.5, Could you please tell me when
i am using custom Tokenizer on the field,so do i need to make any changes
related Solr highlighter.

here is my custom analyser

  fieldType name=custom_text class=solr.TextField
positionIncrementGap=100
   analyzer type=index
 tokenizer class=ns.solr.analyser.CustomIndexTokeniserFactory/
   /analyzer
analyzer type=query
tokenizer class=ns.solr.analyser.CustomSearcherTokeniserFactory/

/analyzer
 /fieldType

here is the field info:

field name=contents type=custom_text indexed=true stored=true
multiValued=true termPositions=true  termVectors=true
termOffsets=true/

i am creating tokens using my custom analyser and when i am trying to use
highlighter it's not working properly for contents field.. but when i tried
to use Solr inbuilt tokeniser i am finding the word highlighted for
particular query.. Please can you help me out with this ?


Thanks in advance
Dhaivat





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Fast-Vector-Highlighter-Working-for-some-records-only-tp3763286p3766335.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only

2012-02-22 Thread Jonathan Rochkind

So I don't really know what I'm talking about, and I'm not really sure 
if it's related or not, but your particular query:


The Beatles as musicians : Revolver through the Anthology

With the lone word that's a ':', reminds me of a dismax stopwords-type 
problem I ran into. Now, I ran into it on 1.4.  I don't know why it 
would be different on 1.4 and 3.x. And I see you aren't even using a 
multi-field dismax in your sample query, so it couldn't possibly be what 
I ran into... I don't think. But I'll write this anyway in case it gives 
someone some ideas.


The problem I ran into is caused by different analysis in two fields 
both used in a dismax, one that ends up keeping : as a token, and one 
that doesn't.  Which ends up having the same effect as the famous 
'dismax stopwords problem'.


Maybe somehow your schema changed such to produce this problem in 3.x 
but not in 1.4? Although again I realize the fact that you are only 
using a single field in your demo dismax query kind of suggests it's not 
this problem. Wonder if you try the query without the :, if the 
problem goes away, that might be a hint. Or, maybe someone more skilled 
at understanding what's in those Solr debug statements than I am (it's 
kind of all greek to me) will be able to take this hint and rule out or 
confirm that it may have something to do with your problem.


Here I write up the issue I ran into (which may or may not have anything 
to do with what you ran into)


http://bibwild.wordpress.com/2011/06/15/more-dismax-gotchas-varying-field-analysis-and-mm/


Also, you don't say what your 'mm' is in your dismax queries, that could 
be relevant if it's got anything to do with anything similar to the 
issue I'm talking about.


Hmm, I wonder if Solr 3.x changes the way dismax calculates number of 
tokens for 'mm' in such a way that the 'varying field analysis dismax 
gotcha' can manifest with only one field, if the way dismax counts 
tokens for 'mm' differs from number of tokens the single field's 
analysis produces?


Jonathan

On 2/22/2012 2:55 PM, Naomi Dushay wrote:

I am working on upgrading Solr from 1.4 to 3.5, and I have hit a problem.   I 
have a test checking for a search result in Solr, and the test passes in Solr 
1.4, but fails in Solr 3.5.   Dismax is the desired QueryParser -- I just 
included output from lucene QueryParser to prove the document exists and is 
found

I am completely stumped.


Here are the debugQuery details:

***Solr 3.5***

lucene QueryParser:

URL:   q=all_search:The Beatles as musicians : Revolver through the Anthology
final query:  all_search:the beatl as musician revolv through the antholog

6.0562754 = (MATCH) weight(all_search:the beatl as musician revolv through the 
antholog in 1064395), product of:
   1.0 = queryWeight(all_search:the beatl as musician revolv through the 
antholog), product of:
 48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 
revolv=872 through=81366 the=3531140 antholog=11611)
 0.02063975 = queryNorm
   6.0562754 = fieldWeight(all_search:the beatl as musician revolv through the 
antholog in 1064395), product of:
 1.0 = tf(phraseFreq=1.0)
 48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 
revolv=872 through=81366 the=3531140 antholog=11611)
 0.125 = fieldNorm(field=all_search, doc=1064395)

dismax QueryParser:
URL:  qf=all_searchpf=all_searchq=The Beatles as musicians : Revolver through the 
Anthology
final query:   +(all_search:the beatl as musician revolv through the antholog~1)~0.01 
(all_search:the beatl as musician revolv through the antholog~3)~0.01

(no matches)


***Solr 1.4***

lucene QueryParser:

URL:  q=all_search:The Beatles as musicians : Revolver through the Anthology
final query:  all_search:the beatl as musician revolv through the antholog

5.2676983 = fieldWeight(all_search:the beatl as musician revolv through the 
antholog in 3469163), product of:
   1.0 = tf(phraseFreq=1.0)
   48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 
revolv=820 through=88238 the=3542123 antholog=11205)
   0.109375 = fieldNorm(field=all_search, doc=3469163)

dismax QueryParser:
URL:  qf=all_searchpf=all_searchq=The Beatles as musicians : Revolver through the 
Anthology
final query:  +(all_search:the beatl as musician revolv through the antholog~1)~0.01 
(all_search:the beatl as musician revolv through the antholog~3)~0.01

score:

7.449651 = (MATCH) sum of:
   3.7248254 = weight(all_search:the beatl as musician revolv through the 
antholog~1 in 3469163), product of:
 0.7071068 = queryWeight(all_search:the beatl as musician revolv through the 
antholog~1), product of:
   48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 
musician=11955 revolv=820 through=88238 the=3542123 antholog=11205)
   0.014681898 = queryNorm
 5.2676983 = fieldWeight(all_search:the beatl as musician revolv through the 
antholog in 3469163), product of:
   1.0 = tf(phraseFreq=1.0)

Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only

I forgot to include the field definition information:

schema.xml:
  field name=all_search type=text indexed=true stored=false /

solr 3.5:
  fieldtype name=text class=solr.TextField
positionIncrementGap=100 autoGeneratePhraseQueries=true
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.ICUFoldingFilterFactory/  
filter class=solr.WordDelimiterFilterFactory
  splitOnCaseChange=1 generateWordParts=1 catenateWords=1
  splitOnNumerics=0 generateNumberParts=1 catenateNumbers=1
  catenateAll=0 preserveOriginal=0 stemEnglishPossessive=1 /
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt /
filter class=solr.RemoveDuplicatesTokenFilterFactory /
  /analyzer
/fieldtype

solr1.4:
fieldtype name=text class=solr.TextField positionIncrementGap=100
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=schema.UnicodeNormalizationFilterFactory
version=icu4j composed=false remove_diacritics=true
remove_modifiers=true fold=true /
filter class=solr.WordDelimiterFilterFactory 
  splitOnCaseChange=1 generateWordParts=1 catenateWords=1 
  splitOnNumerics=0 generateNumberParts=1 catenateNumbers=1 
  catenateAll=0 preserveOriginal=0 stemEnglishPossessive=1 /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt /
filter class=solr.RemoveDuplicatesTokenFilterFactory /
  /analyzer
/fieldtype


And the analysis page shows the same results for Solr 3.5 and 1.4


Solr 3.5:

position1   2   3   4   5   6   7   8
term text   the beatl   as  musicianrevolv  through the 
antholog
keyword false   false   false   false   false   false   false   false
startOffset 0   4   12  15  27  36  44  48
endOffset   3   11  14  24  35  43  47  57
typewordwordwordwordwordwordwordword

Solr 1.4:

term position   1   2   3   4   5   6   7   8
term text   the beatl   as  musicianrevolv  through the 
antholog
term type   wordwordwordwordwordwordwordword
source start,end0,3 4,1112,14   15,24   27,35   36,43   44,47   
48,57

- Naomi

--
View this message in context: 
http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3768007.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: String search in Dismax handler

Two things:
1 what version of Solr are you using? qt=dismax isn't going to any request
handler I don't think.

2 what do you get when you add debugQuery=on? Try that with
both results and perhaps that will shed some light. If not, can you
post the results?

Best
Erick

On Wed, Feb 22, 2012 at 7:47 AM, mechravi25 mechrav...@yahoo.co.in wrote:
Hi,

The string I am searching is Pass By Value. I am using the qt=dismax (in
the request query) as well.

When I search the above string with the double quotes, the data is getting
fetched
but the same query string without any double quotes gives no results.

Following is the dismax request handler in the solrconfig.xml

requestHandler name=dismax class=solr.DisMaxRequestHandler
lst name=defaults
str name=echoParamsexplicit/str

str name=fl
id,score
/str

str name=q.alt*:*/str

str name=f.name.hl.fragsize0/str

str name=f.name.hl.alternateFieldname/str
str name=f.text.hl.fragmenterregex/str
/lst
/requestHandler

The same query string works fine with and without double quotes when I use
default request handler

Following is the default request handler in the solrconfig.xml

requestHandler name=standard class=solr.StandardRequestHandler
default=true

lst name=defaults
str name=echoParamsexplicit/str

/lst
/requestHandler

Please provide some suggestions as to why the string search without quotes
is returning no records
when dismax handler is used. Am I missing out on something?

Thanks.

--
View this message in context:
http://lucene.472066.n3.nabble.com/String-search-in-Dismax-handler-tp3766360p3766360.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only

Jonathan,

I have the same problem without the colon - I tested that, but didn't mention 
it.   

mm can't be the issue either:   in Solr 3.5, if I remove one of the occurrences 
of the  (doesn't matter which), I get results.  Removing any other word does 
NOT get results.   And if the query isn't a phrase query, it gets results.

And no, it can't be related to what you refer to as the  dismax stopwords 
problem, since i can demonstrate the problem with a single field.  mm can't be 
the issue 


I have run into problems in the past with a non-alpha character surrounded by 
spaces tanking my search results for dismax … but I fixed that with this 
fieldType:

!-- single token with punctuation terms removed so dismax doesn't look for 
punctuation terms in these fields --
!-- On client side, Lucene query parser breaks things up by whitespace 
*before* field analysis for dismax --
!-- so punctuation terms ( : ;) are stopwords to allow results from other 
fields when these chars are surrounded by spaces in query --
!--  do not lowercase --
fieldType name=string_punct_stop class=solr.TextField omitNorms=true
  analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory /
filter class=solr.ICUNormalizer2FilterFactory name=nfkc 
mode=compose /
  /analyzer
  analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory /
filter class=solr.ICUNormalizer2FilterFactory name=nfkc 
mode=compose /
!-- removing punctuation for Lucene query parser issues --
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords_punctuation.txt enablePositionIncrements=true /
  /analyzer
/fieldType

My stopwords_punctuation.txt file is

#Punctuation characters we want to ignore in queries
:
;

/

and used this type instead of string for fields in my dismax qf.Thus, the 
punctuation terms in the query are not present for the fields that were 
formerly string fields.

- Naomi

On Feb 22, 2012, at 3:41 PM, Jonathan Rochkind wrote:

 So I don't really know what I'm talking about, and I'm not really sure if 
 it's related or not, but your particular query:
 
 The Beatles as musicians : Revolver through the Anthology
 
 With the lone word that's a ':', reminds me of a dismax stopwords-type 
 problem I ran into. Now, I ran into it on 1.4.  I don't know why it would be 
 different on 1.4 and 3.x. And I see you aren't even using a multi-field 
 dismax in your sample query, so it couldn't possibly be what I ran into... I 
 don't think. But I'll write this anyway in case it gives someone some ideas.
 
 The problem I ran into is caused by different analysis in two fields both 
 used in a dismax, one that ends up keeping : as a token, and one that 
 doesn't.  Which ends up having the same effect as the famous 'dismax 
 stopwords problem'.
 
 Maybe somehow your schema changed such to produce this problem in 3.x but not 
 in 1.4? Although again I realize the fact that you are only using a single 
 field in your demo dismax query kind of suggests it's not this problem. 
 Wonder if you try the query without the :, if the problem goes away, that 
 might be a hint. Or, maybe someone more skilled at understanding what's in 
 those Solr debug statements than I am (it's kind of all greek to me) will be 
 able to take this hint and rule out or confirm that it may have something to 
 do with your problem.
 
 Here I write up the issue I ran into (which may or may not have anything to 
 do with what you ran into)
 
 http://bibwild.wordpress.com/2011/06/15/more-dismax-gotchas-varying-field-analysis-and-mm/
 
 
 Also, you don't say what your 'mm' is in your dismax queries, that could be 
 relevant if it's got anything to do with anything similar to the issue I'm 
 talking about.
 
 Hmm, I wonder if Solr 3.x changes the way dismax calculates number of tokens 
 for 'mm' in such a way that the 'varying field analysis dismax gotcha' can 
 manifest with only one field, if the way dismax counts tokens for 'mm' 
 differs from number of tokens the single field's analysis produces?
 
 Jonathan
 
 On 2/22/2012 2:55 PM, Naomi Dushay wrote:
 I am working on upgrading Solr from 1.4 to 3.5, and I have hit a problem.   
 I have a test checking for a search result in Solr, and the test passes in 
 Solr 1.4, but fails in Solr 3.5.   Dismax is the desired QueryParser -- I 
 just included output from lucene QueryParser to prove the document exists 
 and is found
 
 I am completely stumped.
 
 
 Here are the debugQuery details:
 
 ***Solr 3.5***
 
 lucene QueryParser:
 
 URL:   q=all_search:The Beatles as musicians : Revolver through the 
 Anthology
 final query:  all_search:the beatl as musician revolv through the antholog
 
 6.0562754 = (MATCH) weight(all_search:the beatl as musician revolv through 
 the antholog in 1064395), product of:
   1.0 = queryWeight(all_search:the beatl as musician revolv through the 
 antholog), product of:
 48.450203 = idf(all_search:

Re: SnapPull failed :org.apache.solr.common.SolrException: Error opening new searcher

2012-02-22 Thread Mark Miller

Looks like an issue around replication IndexWriter reboot, soft commits and 
hard commits.

I think I've got a workaround for it:

Index: solr/core/src/java/org/apache/solr/handler/SnapPuller.java
===
--- solr/core/src/java/org/apache/solr/handler/SnapPuller.java  (revision 
1292344)
+++ solr/core/src/java/org/apache/solr/handler/SnapPuller.java  (working copy)
@@ -499,6 +499,17 @@
   
   // reboot the writer on the new index and get a new searcher
   solrCore.getUpdateHandler().newIndexWriter();
+  Future[] waitSearcher = new Future[1];
+  solrCore.getSearcher(true, false, waitSearcher, true);
+  if (waitSearcher[0] != null) {
+try {
+ waitSearcher[0].get();
+   } catch (InterruptedException e) {
+ SolrException.log(LOG,e);
+   } catch (ExecutionException e) {
+ SolrException.log(LOG,e);
+   }
+ }
   // update our commit point to the right dir
   solrCore.getUpdateHandler().commit(new CommitUpdateCommand(req, false));
 
That should allow the searcher that the following commit command prompts to see 
the *new* IndexWriter.

On Feb 22, 2012, at 10:56 AM, eks dev wrote:

 We started observing strange failures from ReplicationHandler when we
 commit on master trunk version 4-5 days old.
 It works sometimes, and sometimes not didn't dig deeper yet.
 
 Looks like the real culprit hides behind:
 org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
 
 Looks familiar to somebody?
 
 
 120222 154959 SEVERE SnapPull failed
 :org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1138)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1251)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1043)
at org.apache.solr.update.DirectUpdateHandler2.commit(Unknown Source)
at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:503)
at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:348)
at org.apache.solr.handler.ReplicationHandler.doFetch(Unknown Source)
at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:163)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at 
 java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
 Caused by: org.apache.lucene.store.AlreadyClosedException: this
 IndexWriter is closed
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:810)
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:815)
at org.apache.lucene.index.IndexWriter.nrtIsCurrent(IndexWriter.java:3984)
at 
 org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:254)
at 
 org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:233)
at 
 org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:223)
at 
 org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170)
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1095)
... 15 more

- Mark Miller
lucidimagination.com

Re: Solr Highlighting not working with PayloadTermQueries

2012-02-22 Thread Koji Sekiguchi


(12/02/22 7:53), Nitin Arora wrote:

Hi,

I'm using SOLR and Lucene in my application for search.

I'm facing an issue of highlighting using FastVectorHighlighter not working
when I use PayloadTermQueries as clauses of a BooleanQuery.

After Debugging I found that In DefaultSolrHighlighter.Java,
fvh.getFieldQuery does not return any term in the termMap.

FastVectorHighlighter fvh = new FastVectorHighlighter(
 // FVH cannot process hl.usePhraseHighlighter parameter per-field
basis
 params.getBool( HighlightParams.USE_PHRASE_HIGHLIGHTER, true ),
 // FVH cannot process hl.requireFieldMatch parameter per-field basis
 params.getBool( HighlightParams.FIELD_MATCH, false ) );

FieldQuery fieldQuery = fvh.getFieldQuery( query );

The reason of empty termmap is, PayloadTermQuery is discarded while
constructing the FieldQuery.

void flatten( Query sourceQuery, CollectionQuery  flatQueries ){
 if( sourceQuery instanceof BooleanQuery ){
   BooleanQuery bq = (BooleanQuery)sourceQuery;
   for( BooleanClause clause : bq.getClauses() ){
 if( !clause.isProhibited() )
   flatten( clause.getQuery(), flatQueries );
   }
 }
 else if( sourceQuery instanceof DisjunctionMaxQuery ){
   DisjunctionMaxQuery dmq = (DisjunctionMaxQuery)sourceQuery;
   for( Query query : dmq ){
 flatten( query, flatQueries );
   }
 }
 else if( sourceQuery instanceof TermQuery ){
   if( !flatQueries.contains( sourceQuery ) )
 flatQueries.add( sourceQuery );
 }
 else if( sourceQuery instanceof PhraseQuery ){
   if( !flatQueries.contains( sourceQuery ) ){
 PhraseQuery pq = (PhraseQuery)sourceQuery;
 if( pq.getTerms().length  1 )
   flatQueries.add( pq );
 else if( pq.getTerms().length == 1 ){
   flatQueries.add( new TermQuery( pq.getTerms()[0] ) );
 }
   }
 }
 // else discard queries
   }

What is the best way to get highlighting working with Payload Term Queries?


Hi Nitin,

Thank you for reporting this problem! Your assumption is correct.
FVH discards PayloadTermQueries in flatten() method.

Though I'm not familiar with SpanQueries so much, but looks like SpanTermQuery 
which is
the super class of PayloadTermQuery, has getTerm() method. Do you think if 
flatten()
can recognize SpanTermQuery and then add the term to flatQueries, it solves 
your problem?

If so, please open a jira ticket. And if you can, attach a patch would help a 
lot!

koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/

Do nested entities have a representation in Solr indexes?

2012-02-22 Thread Mike O'Leary

The data-config.xml file that I have for indexing database contents has nested 
entity nodes within a document node, and each of the entities contains field 
nodes. Lucene indexes consist of documents that contain fields. What about 
entities? If you change the way entities are structured in a data-config.xml 
file, in what way (if any) does it change how the contents are stored in the 
index. When I created the entities I am using, and defined the fields in one of 
the inner entities to be multivalued, I thought that the fields of that entity 
type would be grouped logically somehow in the index, but then I remembered 
that Lucene doesn't have a concept of sub-documents (that I know of), so each 
of the field values will be added to a list, and the extent of the logical 
grouping would be that the field values that were indexed together would be at 
the same position in their respective lists. Am I understanding this right, or 
do entities as defined in data-config.xml have some kind of representation in 
the index like document and field do?
Thanks,
Mike

RE: Recovering from database connection resets in DataimportHandler

2012-02-22 Thread Mike O'Leary

Could you point me to the most non-intimidating introduction to SolrJ that you 
know of? I have a passing familiarity with Javascript and, with few exceptions, 
I haven't developing software that has a graphical user interface of any kind 
in about 25 years. I like the idea of having finer control over data imported 
from a database though.
Thanks,
Mike

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Monday, February 13, 2012 6:19 AM
To: solr-user@lucene.apache.org
Subject: Re: Recovering from database connection resets in DataimportHandler

I'd seriously consider using SolrJ and your favorite JDBC driver instead. It's 
actually quite easy to create one, although as always it may be a bit 
intimidating to get started. This allows you much finer control over error  
conditions than DIH does, so may be more suited to your needs.

Best
Erick

On Sat, Feb 11, 2012 at 2:40 AM, Mike O'Leary tmole...@uw.edu wrote:
 I am trying to use Solr's DataImportHandler to index a large number of 
 database records in a SQL Server database that is owned and managed by a 
 group we are collaborating with. The indexing jobs I have run so far, except 
 for the initial very small test runs, have failed due to database connection 
 resets. I have gotten indexing jobs to go further by using 
 CachedSqlEntityProcessor and specifying responseBuffering=adaptive in the 
 connection url, but I think in order to index that data I'm going to have to 
 work out how to catch database connection reset exceptions and resubmit the 
 queries that failed. Can anyone can suggest a good way to approach this? Or 
 have any of you encountered this problem and worked out a solution to it 
 already?
 Thanks,
 Mike

Re: distributed deletes working?

2012-02-22 Thread Jamie Johnson

I know everyone is busy, but I was wondering if anyone had found
anything with this?  Any suggestions on what I could be doing wrong
would be greatly appreciated.

On Fri, Feb 17, 2012 at 4:08 PM, Mark Miller markrmil...@gmail.com wrote:

 On Feb 17, 2012, at 3:56 PM, Jamie Johnson wrote:

 id field is a UUID.

 Strange - was using UUID's myself in same test this morning...

 I'll try again soon.

 - Mark Miller
 lucidimagination.com

Is there a way to write a DataImportHandler deltaQuery that compares contents still to be imported to contents in the index?

2012-02-22 Thread Mike O'Leary

I am working on indexing the contents of a database that I don't have 
permission to alter. In particular, the DataImportHandler examples that show 
how to specify a deltaQuery attribute value show database tables that have a 
last_modified column, and it compares these values with last_index_time values 
stored in the dataimport.properties file. The tables in the database I am 
working with don't have anything like a last_modified column. An indexing job I 
was running yesterday failed, and I would like to restart it so that it only 
imports the data that it hasn't already indexed. As a one-off, I could create a 
list of the keys of the database records that have been indexed and hack in 
something that reads that list as part of how it figures out what to index, but 
I was wondering if there is something built in that would allow me to do the 
same kind of comparison in a likely far more elegant way. What kinds of 
information do the deltaQuery attributes have access to, apart from the 
database tables, columns, etc., and do they have access to any information that 
would help me with what I want to do?
Thanks,
Mike

P.S. While we're on the subject of delta... attributes, can someone explain to 
me what the difference is between the deltaQuery and the deltaImportQuery 
attributes?

Re: distributed deletes working?

2012-02-22 Thread Mark Miller

Yonik did fix an issue around peer sync and deletes a few days ago - long 
chance that was involved?

Otherwise, neither Sami nor I have replicated these results so far.

On Feb 22, 2012, at 8:56 PM, Jamie Johnson wrote:

 I know everyone is busy, but I was wondering if anyone had found
 anything with this?  Any suggestions on what I could be doing wrong
 would be greatly appreciated.
 
 On Fri, Feb 17, 2012 at 4:08 PM, Mark Miller markrmil...@gmail.com wrote:
 
 On Feb 17, 2012, at 3:56 PM, Jamie Johnson wrote:
 
 id field is a UUID.
 
 Strange - was using UUID's myself in same test this morning...
 
 I'll try again soon.
 
 - Mark Miller
 lucidimagination.com
 
 
 
 
 
 
 
 
 
 
 

- Mark Miller
lucidimagination.com

Re: distributed deletes working?

2012-02-22 Thread Jamie Johnson

Perhaps if you could give me the steps you're using to test I can find
an error in what I'm doing.


On Wed, Feb 22, 2012 at 9:24 PM, Mark Miller markrmil...@gmail.com wrote:
 Yonik did fix an issue around peer sync and deletes a few days ago - long 
 chance that was involved?

 Otherwise, neither Sami nor I have replicated these results so far.

 On Feb 22, 2012, at 8:56 PM, Jamie Johnson wrote:

 I know everyone is busy, but I was wondering if anyone had found
 anything with this?  Any suggestions on what I could be doing wrong
 would be greatly appreciated.

 On Fri, Feb 17, 2012 at 4:08 PM, Mark Miller markrmil...@gmail.com wrote:

 On Feb 17, 2012, at 3:56 PM, Jamie Johnson wrote:

 id field is a UUID.

 Strange - was using UUID's myself in same test this morning...

 I'll try again soon.

 - Mark Miller
 lucidimagination.com












 - Mark Miller
 lucidimagination.com

Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only

Jonathan has brought it to my attention that BOTH of my failing searches happen 
to have 8 terms, and one of the terms is repeated:

 The Beatles as musicians : Revolver through the Anthology
 Color-blindness [print/digital]; its dangers and its detection

but this is a PHRASE search.  

In case it's relevant, both Solr 1.4 and Solr 3.5:
 do NOT use stopwords in the fieldtype;  
 mm is  6-1 690%  for dismax
 qs is 1
 ps is 3

And both use this filter last

filter class=solr.RemoveDuplicatesTokenFilterFactory /

… but I believe that filter is only used for consecutive tokens.

Lastly, 

 Color-blindness [print/digital]; its and its detection   works   (danger 
is removed, rather than one of the repeated its)

- Naomi



On Feb 22, 2012, at 3:41 PM, Jonathan Rochkind wrote:

 So I don't really know what I'm talking about, and I'm not really sure if 
 it's related or not, but your particular query:
 
 The Beatles as musicians : Revolver through the Anthology
 
 With the lone word that's a ':', reminds me of a dismax stopwords-type 
 problem I ran into. Now, I ran into it on 1.4.  I don't know why it would be 
 different on 1.4 and 3.x. And I see you aren't even using a multi-field 
 dismax in your sample query, so it couldn't possibly be what I ran into... I 
 don't think. But I'll write this anyway in case it gives someone some ideas.
 
 The problem I ran into is caused by different analysis in two fields both 
 used in a dismax, one that ends up keeping : as a token, and one that 
 doesn't.  Which ends up having the same effect as the famous 'dismax 
 stopwords problem'.
 
 Maybe somehow your schema changed such to produce this problem in 3.x but not 
 in 1.4? Although again I realize the fact that you are only using a single 
 field in your demo dismax query kind of suggests it's not this problem. 
 Wonder if you try the query without the :, if the problem goes away, that 
 might be a hint. Or, maybe someone more skilled at understanding what's in 
 those Solr debug statements than I am (it's kind of all greek to me) will be 
 able to take this hint and rule out or confirm that it may have something to 
 do with your problem.
 
 Here I write up the issue I ran into (which may or may not have anything to 
 do with what you ran into)
 
 http://bibwild.wordpress.com/2011/06/15/more-dismax-gotchas-varying-field-analysis-and-mm/
 
 
 Also, you don't say what your 'mm' is in your dismax queries, that could be 
 relevant if it's got anything to do with anything similar to the issue I'm 
 talking about.
 
 Hmm, I wonder if Solr 3.x changes the way dismax calculates number of tokens 
 for 'mm' in such a way that the 'varying field analysis dismax gotcha' can 
 manifest with only one field, if the way dismax counts tokens for 'mm' 
 differs from number of tokens the single field's analysis produces?
 
 Jonathan
 
 On 2/22/2012 2:55 PM, Naomi Dushay wrote:
 I am working on upgrading Solr from 1.4 to 3.5, and I have hit a problem.   
 I have a test checking for a search result in Solr, and the test passes in 
 Solr 1.4, but fails in Solr 3.5.   Dismax is the desired QueryParser -- I 
 just included output from lucene QueryParser to prove the document exists 
 and is found
 
 I am completely stumped.
 
 
 Here are the debugQuery details:
 
 ***Solr 3.5***
 
 lucene QueryParser:
 
 URL:   q=all_search:The Beatles as musicians : Revolver through the 
 Anthology
 final query:  all_search:the beatl as musician revolv through the antholog
 
 6.0562754 = (MATCH) weight(all_search:the beatl as musician revolv through 
 the antholog in 1064395), product of:
   1.0 = queryWeight(all_search:the beatl as musician revolv through the 
 antholog), product of:
 48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 
 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611)
 0.02063975 = queryNorm
   6.0562754 = fieldWeight(all_search:the beatl as musician revolv through 
 the antholog in 1064395), product of:
 1.0 = tf(phraseFreq=1.0)
 48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 
 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611)
 0.125 = fieldNorm(field=all_search, doc=1064395)
 
 dismax QueryParser:
 URL:  qf=all_searchpf=all_searchq=The Beatles as musicians : Revolver 
 through the Anthology
 final query:   +(all_search:the beatl as musician revolv through the 
 antholog~1)~0.01 (all_search:the beatl as musician revolv through the 
 antholog~3)~0.01
 
 (no matches)
 
 
 ***Solr 1.4***
 
 lucene QueryParser:
 
 URL:  q=all_search:The Beatles as musicians : Revolver through the 
 Anthology
 final query:  all_search:the beatl as musician revolv through the antholog
 
 5.2676983 = fieldWeight(all_search:the beatl as musician revolv through the 
 antholog in 3469163), product of:
   1.0 = tf(phraseFreq=1.0)
   48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 
 revolv=820 through=88238 the=3542123 antholog=11205)
   0.109375 =

Re: Recovering from database connection resets in DataimportHandler

It *just happens* that I wrote a blog on this very topic, see:
http://www.lucidimagination.com/blog/2012/02/14/indexing-with-solrj/

That code contains two rather different methods, one that indexes
based on a SQL database and one based on indexing random files
with client-side Tika.

Best
Erick

On Wed, Feb 22, 2012 at 8:51 PM, Mike O'Leary tmole...@uw.edu wrote:
 Could you point me to the most non-intimidating introduction to SolrJ that 
 you know of? I have a passing familiarity with Javascript and, with few 
 exceptions, I haven't developing software that has a graphical user interface 
 of any kind in about 25 years. I like the idea of having finer control over 
 data imported from a database though.
 Thanks,
 Mike

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Monday, February 13, 2012 6:19 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Recovering from database connection resets in DataimportHandler

 I'd seriously consider using SolrJ and your favorite JDBC driver instead. 
 It's actually quite easy to create one, although as always it may be a bit 
 intimidating to get started. This allows you much finer control over error  
 conditions than DIH does, so may be more suited to your needs.

 Best
 Erick

 On Sat, Feb 11, 2012 at 2:40 AM, Mike O'Leary tmole...@uw.edu wrote:
 I am trying to use Solr's DataImportHandler to index a large number of 
 database records in a SQL Server database that is owned and managed by a 
 group we are collaborating with. The indexing jobs I have run so far, except 
 for the initial very small test runs, have failed due to database connection 
 resets. I have gotten indexing jobs to go further by using 
 CachedSqlEntityProcessor and specifying responseBuffering=adaptive in the 
 connection url, but I think in order to index that data I'm going to have to 
 work out how to catch database connection reset exceptions and resubmit the 
 queries that failed. Can anyone can suggest a good way to approach this? Or 
 have any of you encountered this problem and worked out a solution to it 
 already?
 Thanks,
 Mike

Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only

2012-02-22 Thread Robert Muir

On Wed, Feb 22, 2012 at 7:35 PM, Naomi Dushay ndus...@stanford.edu wrote:
 Jonathan has brought it to my attention that BOTH of my failing searches 
 happen to have 8 terms, and one of the terms is repeated:

  The Beatles as musicians : Revolver through the Anthology
  Color-blindness [print/digital]; its dangers and its detection

 but this is a PHRASE search.


Can you take your same phrase queries, and simply add some slop to
them (e.g. ~3) and ensure they still match with the lucene
queryparser? SloppyPhraseQuery has a bit of a history with repeats
since Lucene 2.9 that you were using.

https://issues.apache.org/jira/browse/LUCENE-3068
https://issues.apache.org/jira/browse/LUCENE-3215
https://issues.apache.org/jira/browse/LUCENE-3412

-- 
lucidimagination.com

default fq in dismax request handler being overridden

2012-02-22 Thread dboychuck

I have a dismax request handler with a default fq parameter.

requestHandler name=dismax class=solr.DisMaxRequestHandler
lst name=defaults
str name=echoParamsexplicit/str
float name=tie0.01/float
str name=qf
sku^9.0 upc^9.1 searchKeyword^1.9 series^2.8 productTitle^1.2 productID^9.0
manufacturer^4.0 masterFinish^1.5 theme^1.1 categoryName^2.0 finish^1.4
/str
str name=pf
searchKeyword^2.1 text^0.2 productTitle^1.5 manufacturer^4.0 finish^1.9
/str
str name=bqisTopSeller:true^1.30/str
str name=bflinear(popularity,1,2)^3.0/str
str name=flproductID,manufacturer/str
str name=mm3-1 5-2 690%/str
int name=ps100/int
int name=qs3/int
str name=fqdiscontinued:false/str
/lst
/requestHandler

I understand that when I send a search post ex.
 /select?qt=dismaxq=f-0sort=score%20descfq=type_string:faucet

What I would like to know is if there is a way to always include the fq that
is defined in the query handler and not have it be overridden but appended
automatically to any solr searches that use the query handler.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/default-fq-in-dismax-request-handler-being-overridden-tp3768735p3768735.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: need to support bi-directional synonyms

2012-02-22 Thread remi tassing

Same question here...

On Wednesday, February 22, 2012, geeky2 gee...@hotmail.com wrote:
 hello all,

 i need to support the following:

 if the user enters sprayer in the desc field - then they get results for
 BOTH sprayer and washer.

 and in the other direction

 if the user enters washer in the desc field - then they get results for
 BOTH washer and sprayer.

 would i set up my synonym file like this?

 assuming expand = true..

 sprayer = washer
 washer = sprayer

 thank you,
 mark

 --
 View this message in context:
http://lucene.472066.n3.nabble.com/need-to-support-bi-directional-synonyms-tp3767990p3767990.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: default fq in dismax request handler being overridden

2012-02-22 Thread dboychuck

Think I answered my own question... I need to use an appends list

--
View this message in context: 
http://lucene.472066.n3.nabble.com/default-fq-in-dismax-request-handler-being-overridden-tp3768735p3768817.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Development inside or outside of Solr?

2012-02-22 Thread bing

Hi, François Schiettecatte

Thank you for the reply all the same, but  I choose to stick on Solr
(wrapped with Tika language API) and do changes outside Solr. 

Best Regards, 
Bing 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Development-inside-or-outside-of-Solr-tp3759680p3768903.html
Sent from the Solr - User mailing list archive at Nabble.com.

problem with parsering (using Tika) on remote glassfish

2012-02-22 Thread ola nowak

Hi all!
I'm using Tika parser to index my files into Solr. I created my own parser
(which extends XMLParser). It uses my own mimetype.
I created a jar file which inside looks like this:
src
|-main
|-some_packages
|-MyParser.java
|resources
|-META-INF
|-services
|-org.apache.tika.parser.Parser (which contains
some_packages.MyParser.java)
|_org
|-apache
|-tika
|-mime
|-custom-mimetypes.xml


In custom-mimetypes I put the definition of new mimetype becouse my xml
files have some special tags.

Now where is the problem: I've been testing parsing and indexing with Solr
on glassfish installed on my local machine. It worked just fine. Then I
wanted to install it on some remote server. There is the same version of
glassfish installed (3.1.1). I copied-pasted Solr application, it's home
directory with all libraries (including tika jars and the jar with my
custom parser). Unfortunately it doesn't work. After posting files to Solr
I can see in content-type field that it detected my custom mime type. But
there are no fields that suppose to be there like if MyParser class was
never runned. The only fields I get are the ones from Dublin Core. I
checked (by simply adding some printlines) that Tika is only using
XMLParser.
Have anyone had similar problem? How to handle this?
Regards,
Ola

Re: need to support bi-directional synonyms

2012-02-22 Thread Bernd Fehling


Use

sprayer, washer

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

Regards
Bernd

Am 23.02.2012 07:03, schrieb remi tassing:

Same question here...

On Wednesday, February 22, 2012, geeky2gee...@hotmail.com  wrote:

hello all,

i need to support the following:

if the user enters sprayer in the desc field - then they get results for
BOTH sprayer and washer.

and in the other direction

if the user enters washer in the desc field - then they get results for
BOTH washer and sprayer.

would i set up my synonym file like this?

assuming expand = true..

sprayer =  washer
washer =  sprayer

thank you,
mark

--
View this message in context:

http://lucene.472066.n3.nabble.com/need-to-support-bi-directional-synonyms-tp3767990p3767990.html

Sent from the Solr - User mailing list archive at Nabble.com.

Re: SnapPull failed :org.apache.solr.common.SolrException: Error opening new searcher